Thursday 13 August 2015

Building a Mutation History Tree with STR data

In the previous blog post we explored Genetic Distance and how it can be used to group people together into the same Genetic Family (or Lineage). We also saw how a lot of your relevant non-matches only show up in surname projects because they are outside the FTDNA matching threshold. Lastly, we introduced the concept of the Mutation History Tree (MHT) and how traditional genealogies could be hung on its branches to give a combined tree that uses mutations when named individuals run out.

In this blog post, we will look at how to build a Mutation History Tree using STR data.

Building a simple Mutation History Tree

A huge thank you is due to project member Lisa Little who generated a lot of the data discussed below, and to Nigel McCarthy (Admin of the Munster Irish Project) who has offered invaluable advice.

It is clear from the Results Page of the Gleason DNA Project that each Lineage has a distinctive coloured pattern by virtue of the values for its STR markers. This distinctive coloured pattern reflects the unique marker values for each Lineage, and distinguishes one Lineage from another. The Modal Haplotype for each Lineage represents the "signature tune" for that particular Lineage.

The distinctive coloured patterns of Lineages I, II & III
(from the Results Page on the World Families.Net site)

Furthermore, within each Lineage, there are subtle differences among members. In other words, most people are not an exact match to the Modal Haplotype - everyone in the group sings the "signature tune" just ever so slightly differently. They're all in the same choir, but some hit a bum note! These differences between project members in their STR marker values allow us to construct a diagram that suggests how the various project members might be related to each other.

So, for example, if we take the first 12 markers in Lineage II we could construct a Mutation History Tree (MHT) that separates out the members of the group into different branches. Reading from left to right across the tree in the diagram below, the first 4 members (G22, G05, G57, and G64 in the diagram below) all match the Modal Haplotype at 12 markers (Branch 1). The fifth member (G42) has a single mutation (in red) at marker 385b which has mutated from the modal value of 14 to a new value of 15 (Branch 2). This may have happened in the previous generation or many many generations ago. The 6th member (G21) apparently has 2 mutations* from the modal (on markers 389i1 and 389i2), and the remaining members all have a mutation at marker 390 with 3 of them having an additional mutation on marker 392 (Branch 5).

12-marker MHT for Lineage II
click to enlarge

This diagram gives a pictorial representation of how the different members of Lineage II may be related. The branching pattern derived from the DNA mutations may very well correspond to the branching pattern that one might see in the traditional Family History Tree if we were able to trace it all the way back with documentary evidence to the MRCA (Most Recent Common Ancestor). Thus the Mutation History Tree can give us important clues regarding which individuals are likely to be on the same branch of the overall tree, and who is more closely related to whom. This in turn can help focus further documentary research. 

In the example above, the project members in the last branch on the right (Branch 5) are more closely related to each other than to anyone else in the project - they should get together and try to figure out how they are related. Their MRCA is a lot closer in time than the MRCA they share with (for example) the first group in the tree (on the far left, Branch 1). Similarly, because Branch 5 is in fact an off-shoot of Branch 4, the MRCA for these two groups is going to be closer in time than the MRCA either shares with any of the other groups. Thus, for example, Branch 5 members may share an MRCA born in 1750, Branches 4 & 5 share an MRCA born in 1610, Branches 4 & 3 share an MRCA born in 1390, and the MRCA for the entire group was born in 1125.

But this is the pattern based on the values of just 12 STR markers. What happens when we use 37 markers? or 67? or 111? The more markers that are used to generate the Mutation History Tree, the more accurate the picture is likely to become and the more likely we are to approximate the branching pattern in the actual Family History Tree.

However, not everyone in the group has tested to the same level of STR markers. Six people have tested to 111 markers, 3 people have tested to 67 markers, and 2 people have tested to 37 markers. And of the 4 people in "Possible Lineage II", one has tested to 25 markers and 3 have tested to only 12 markers. This obviously creates difficulties in accurately allocating people to the correct branch of the Mutation History Tree and such allocations are likely to change as these members upgrade their results and more data become available.

These are very important points that need to be born in mind and are worth repeating:
  • Mutation History Trees are only as good as the data available 
  • They are liable to change as more data becomes available
  • The more data used to generate the tree, the more likely it is to approximate reality

Building a more complex Mutation History Tree

To generate a more advanced Mutation History Tree, using up to 111 markers, we can use programmes such as Fluxus to generate a phylogenetic tree / cladogram that produces a "best fit" model of the tree (i.e. with the fewest number of branches possible for the known mutations). This is often called a "maximum parsimony" approach and the principle is akin to that of Occam's razor which simply states that - all else being equal - the simplest hypothesis that explains the data should be the one that is selected. It may not be the correct one (there are usually other possible alternatives with varying degrees of plausibility), but it has the highest probability of being the correct one.

Using the Fluxus software is not easy - the process is multi-step and complicated, it is not user-friendly, and it takes time. Nevertheless, we will look at the output of this software in a separate blog post.

Below is a summary of the various STR mutations that occur within Lineage II (courtesy of Lisa Little). Mutations from the Modal Haplotype are highlighted in yellow and beige. The markers are divided into their various "Panels" by bold dark lines. Mutations among the first 12 markers are in the first 5 rows (Panel 1); the next 3 rows are mutations among markers 13-25 (Panel 2); and the following 4 rows have the mutations among markers 26 to 37 (Panel 3).

From these it is possible to build a more evolved Mutation History Tree than the one generated using only 12 marker data. As all 11 members have tested to 37 markers, let's construct a tree based on 37 markers and compare it to the one generated from the 12-marker data.

Lineage II's STR mutations (from Lisa Little)
37-marker Mutation History Tree
click to enlarge
The previous 12-marker Mutation History Tree (for comparison)
click to enlarge

In the new 37-maker Tree, additional branches and additional mutations are indicated in pink. The branches have doubled and are now numbered 1 through 10. Branch 3 is an exact match to the Modal Haplotype (but no members sit on this particular branch). There are several important points to note when comparing this new 37-marker Tree to the previous 12-marker Tree:
  • The tree with more data (the 37-marker Tree) has more branches, more definition, more granularity, more fine detail
  • The new branches allow us to revise our estimates for when the MRCA for the various branches may have been born. For example, what was Branch 5 has now split into two branches  (9 & 10), and the members in Branch 9 (G39, G51) share an MRCA who was possibly born several generations after the MRCA for Branch 9 & 10.
  • There are several Parallel Mutations in the 37-marker Tree (i.e. identical mutations that occur in several different branches)
    • 464b (16>17) occurs on 2 branches (Branches 4 & 8)
    • 464c (17>16) occurs on 2 branches (Branches 1 & 10)
    • CDYa (39>38) occurs on 5 branches (Branches 2, 4, 6, 7/8, & 10)
    • CDYb (40>39) occurs on 2 branches (Branches 1 & 9)
    • 456 (16>15) occurs on 3 branches (Branches 2, 6 & 7)
  • The apparently large number of Parallel Mutations may be because this is not a "maximum parsimony" tree, and there may be another way of arranging the data that would produce a better "fit" with fewer branches. The Fluxus software could help clarify this. 
  • Alternatively, this may be a very accurate reflection of the real Family History Tree (i.e. how people are actually related ancestrally) and the large number of Parallel Mutations are due to the rapid mutation rate of the STR markers in question. This is entirely plausible as several markers are known to mutate back and forth fairly frequently from one generation to the next (e.g. the CDYa and CDYb markers). FTDNA identifies such rapidly mutating markers by colouring them in dark red on the DNA Results page.
  • Back Mutations may be present in the tree, but are hidden  ... and thus the tree may be wrong
    • for illustrative purposes, in Branch 2, theoretically there may have been a mutation in one of the members ancestral lines such as CDYb (40>39) in (say) 1470, and a few generations after that a Back Mutation CDYb (39>40) in (say) 1610. This would place these members on 2 different branches rather than on the same branch where they currently sit. 
    • And even though in reality, they share a Common Ancestor in 1470, the Back Mutation masks this, and makes them look much more closely related than they actually are, with a Common Ancestor that appears to be sometime in the 1700's perhaps, 300 years later than it actually is!
    • This latter point is a common experience when working with any type of DNA - the Common Ancestor is further back than he/she looks.
Fast-mutating markers (in dark red) among the first 37 markers of FTDNA's Y-DNA test

Examples of Mutation History Trees generated using 37-marker results can be found in the Allen Patrilineage Project for their Patrilineage I and Patrilineage II.

When building a Mutation History Tree with larger numbers of markers (67 or 111), software programmes such as Fluxus become indispensable because doing it by hand is much more difficult. And as stated previously, the tree is likely to change as more data is used to generate the branching pattern. If we generated a tree based on 67 marker data it would become even more detailed, and more so too with a 111-marker based tree. Furthermore, adding more data is likely to change the branching pattern within the tree as a new "best fit" model is identified. 

And this essential point will be perfectly illustrated when we add SNP marker data into the mix - it throws our 37-marker "best fit" Mutation History Tree into a completely new configuration.

I don't know about you but I can hardly contain myself.

Maurice Gleeson
13 Aug 2015

* the 2 apparent mutations are in fact only 1. The marker DYS389 is a single STR marker that has four parts: m, n, p, and q. At FamilyTreeDNA they have two tests for DYS389. The first test looks at the first two parts of marker DYS389 (m and n). This is what they call DYS389I. The second test looks at all four parts of DYS389 (m, n, p, and q). This is what they call DYS389II. There are, by scientific convention, two ways to display the result of DYS389II.  The way FTDNA display the result is by showing the total for the entire DYS389 marker (m+n+p+q). This is described in their Learning Centre FAQ here. Member G-21 has a mutation in 389i - which is indicated by his value of 13 rather than the modal 14. For marker 389ii his value of 29 rather than 30 indicates that in the entire marker (parts m+n+p+q) he still only has a single mutation - that same single mutation which is reported as 389i. In reality, since 389ii includes all four parts of the marker, we should just drop 389i off our table. Any mutation in 389i will be included in 389ii. My thanks to Lisa Little for pointing this out.  

Friday 7 August 2015

Genetic Distance, Genetic Families, & Mutation History Trees

In this blog post, we examine how people are grouped together into Lineages (sometimes called Genetic Families) and how the relationship between people within a Lineage can be mapped out and represented in a Mutation History Tree.

Grouping People Together

The members of each lineage within the project have been grouped together because their genetic signatures (aka haplotypes) are similar, suggesting a common ancestor some time in the past several hundred years. The degree of similarity between any two individuals can be assessed by the Genetic Distance between them, as discussed in a previous blog post and reproduced again below:
Who qualifies as a match to you? Anyone whose marker values are sufficiently similar that they meet the criteria set by FTDNA to be declared "a match". And here are those criteria:
  • a GD of 2 at 25 markers
The ISOGG Wiki has a very nice summary of Genetic Distance and the criteria for matching.
However, Genetic Distance is not the only possible criteria for grouping people into the same "Genetic Family" or "Lineage". Other considerations include traditional genealogical indicators such as having the same surname (the main criterion for surname studies), or having an ancestor who came from the same location as the ancestors of other group members. Additional genetic criteria may include having the same rare marker values, or having the same terminal SNP. These considerations can also serve as indicators that Lineage members have been grouped correctly i.e. members may be grouped together on the basis of one criterion (e.g. Genetic Distance) and subsequently are found to share a second criterion (e.g. the same rare marker value, ancestors from the same location, or even the same MDKA). You can read more about some of the possible criteria for grouping people together into the same genetic family here.

Why do I match some Project Members but not others?

You will probably notice that not everyone in your Lineage turns up in your list of matches on your Matches Page. The reason for this is that they do not meet the FTDNA criteria for a "match" to you, but they do meet the criteria as a match either to the Modal Haplotype* for the Lineage or to other members of the project. This is one of the key benefits of joining a surname project (such as the Gleason/Gleeson DNA Project) - it can connect you to people within the FTDNA database who do not show up in your list of matches but to whom you are still likely to be related.

click to enlarge

For example, in Lineage II, my Dad (G21; N74958; yellow dot in the diagram above) is a match (at 37 markers) to only 5 of the other 10 members: G22, G57, G64, G55, and G66 (green dots). He does not match the members with red dots. In other words, his genetic distance to the green dot matches is 4/37 or less, whereas his GD to the red dot matches is 5/37 or greater (in fact, reading down, his GD to each of the red dot members is 5/37, 6/37, 5/37, 6/37, & 6/37 respectively).

But let's look at the member closest to the Modal Haplotype* for Lineage II. This is the person with the fewest number of mutations compared to the Modal Haplotype, or (in other words) the smallest Genetic Distance from the Modal Haplotype. There are no exact matches (i.e. 0/37) to the Modal Haplotype in Lineage II, but there are two members who are the closest (GD = 1/37) and these are members G57 and G64 (kits 60393 & 365763). They are uncle/nephew, both carry the surname Little, and they are believed to have a genetic Gleason ancestor somewhere in the last several hundred years. As Project Admin, I have access to their pages and I can look at their matches. This tells me that each of them matches 7 of the 10 other members in Lineage II. I can also see that their Genetic Distance to the remaining 3 members of Lineage II is only 5/37, just outside the FTDNA threshold for declaring them a match. Looking at these 3 latter members, one matches two other people within the project and the other two match 5 project members.

So everyone within the project is a match to at least one other person within the project, but the distance between any two project members can vary considerably - some are very close, others are very distant.

And this creates a wonderful possibility - by analysing the degree of Genetic Distance between members we can construct a diagram of the branching pattern within the group which shows how closely or how distantly each member may be related to each other member. Such a diagram is variously called a phylogenetic tree, a phylogram, a cladogram, or a Mutation History Tree. The same process is used to generate evolutionary diagrams showing where all living creatures sit on the Tree of Life.

Interactive version of the Tree of life 

Mutation History Trees

This concept has important potential implications for genealogy. Theoretically, it should be possible to take the known genealogies of each member within a Lineage and hang them onto the appropriate branch within the Mutation History Tree. In this way we will have a combined tree that starts off in modern times with named individuals, goes back along each ancestral line until the named individuals run out (at each branch's Brick Wall or MDKA, Most Distant Known Ancestor), and then the tree continues back in time using genetic marker mutations instead of people, culminating at the Modal Haplotype for that particular Lineage.

A combined Family History Tree & Mutation History Tree

In the combined tree above, named individuals appear in the blue boxes, starting with living individuals (born about 1960) and going back in time to the MDKA for each branch. Most branches have an MDKA born about 1810 to 1840 (which is typical for Irish research), and some have a Brick Wall at a later point (Branches 5, 6, & 7 have an MDKA about 1870). One lucky branch can trace their line back to 1690 which is highly unusual but something we all hope for. You never know when a new member will join your particular Lineage who happens to be in possession of the Family Bible! And because you are genetically related to him, his Family Bible pertains to your family too. In this way, all members within a particular Genetic Family can "piggyback" onto the pedigree of the member with the longest pedigree.

Where the named individuals end, DNA marker mutations take over. STR markers are in yellow, SNP markers are in pink, but this is just a crude representation of what the tree might look like. In reality it would be much more complicated than this.

The person sitting at the intersecting point of all the branches is the MRCA (Most Recent Common Ancestor) and he is likely to have the Modal Haplotype for that particular Lineage. And as we go back in time, we would identify NPEs (Non-Paternity Events), and we would identify other surnames to which we are directly related but with whom our common ancestor is prior to the time of surnames. And the tree would continue even further back than that, based on the SNP discoveries that are continuously being made in the ongoing Haplogroup Projects.

Eventually, by superimposing known genealogies on top of the Mutation History Tree, we could build a comprehensive evolutionary tree of all Mankind, that travels back in time to "Genetic Adam" and travels forward to culminate with each living person today.

Maurice Gleeson
7 August 2015

* Modal Haplotype - this is the haplotype (i.e. your genetic signature, your sequence of STR marker values) that is derived from the most frequent value for each of the STR markers in turn among members of the same Lineage. It is likely that the Modal Haplotype is identical or almost identical to the haplotype that the Common Ancestor of the Lineage would have had. In other words, he would have passed on the identical marker values to most descendants and only some of them would have developed the occasional mutation along the way.