Thursday 13 August 2015

Building a Mutation History Tree with STR data

In the previous blog post we explored Genetic Distance and how it can be used to group people together into the same Genetic Family (or Lineage). We also saw how a lot of your relevant non-matches only show up in surname projects because they are outside the FTDNA matching threshold. Lastly, we introduced the concept of the Mutation History Tree (MHT) and how traditional genealogies could be hung on its branches to give a combined tree that uses mutations when named individuals run out.

In this blog post, we will look at how to build a Mutation History Tree using STR data.

Building a simple Mutation History Tree

A huge thank you is due to project member Lisa Little who generated a lot of the data discussed below, and to Nigel McCarthy (Admin of the Munster Irish Project) who has offered invaluable advice.

It is clear from the Results Page of the Gleason DNA Project that each Lineage has a distinctive coloured pattern by virtue of the values for its STR markers. This distinctive coloured pattern reflects the unique marker values for each Lineage, and distinguishes one Lineage from another. The Modal Haplotype for each Lineage represents the "signature tune" for that particular Lineage.

The distinctive coloured patterns of Lineages I, II & III
(from the Results Page on the World Families.Net site)

Furthermore, within each Lineage, there are subtle differences among members. In other words, most people are not an exact match to the Modal Haplotype - everyone in the group sings the "signature tune" just ever so slightly differently. They're all in the same choir, but some hit a bum note! These differences between project members in their STR marker values allow us to construct a diagram that suggests how the various project members might be related to each other.

So, for example, if we take the first 12 markers in Lineage II we could construct a Mutation History Tree (MHT) that separates out the members of the group into different branches. Reading from left to right across the tree in the diagram below, the first 4 members (G22, G05, G57, and G64 in the diagram below) all match the Modal Haplotype at 12 markers (Branch 1). The fifth member (G42) has a single mutation (in red) at marker 385b which has mutated from the modal value of 14 to a new value of 15 (Branch 2). This may have happened in the previous generation or many many generations ago. The 6th member (G21) apparently has 2 mutations* from the modal (on markers 389i1 and 389i2), and the remaining members all have a mutation at marker 390 with 3 of them having an additional mutation on marker 392 (Branch 5).

12-marker MHT for Lineage II
click to enlarge

This diagram gives a pictorial representation of how the different members of Lineage II may be related. The branching pattern derived from the DNA mutations may very well correspond to the branching pattern that one might see in the traditional Family History Tree if we were able to trace it all the way back with documentary evidence to the MRCA (Most Recent Common Ancestor). Thus the Mutation History Tree can give us important clues regarding which individuals are likely to be on the same branch of the overall tree, and who is more closely related to whom. This in turn can help focus further documentary research. 

In the example above, the project members in the last branch on the right (Branch 5) are more closely related to each other than to anyone else in the project - they should get together and try to figure out how they are related. Their MRCA is a lot closer in time than the MRCA they share with (for example) the first group in the tree (on the far left, Branch 1). Similarly, because Branch 5 is in fact an off-shoot of Branch 4, the MRCA for these two groups is going to be closer in time than the MRCA either shares with any of the other groups. Thus, for example, Branch 5 members may share an MRCA born in 1750, Branches 4 & 5 share an MRCA born in 1610, Branches 4 & 3 share an MRCA born in 1390, and the MRCA for the entire group was born in 1125.

But this is the pattern based on the values of just 12 STR markers. What happens when we use 37 markers? or 67? or 111? The more markers that are used to generate the Mutation History Tree, the more accurate the picture is likely to become and the more likely we are to approximate the branching pattern in the actual Family History Tree.

However, not everyone in the group has tested to the same level of STR markers. Six people have tested to 111 markers, 3 people have tested to 67 markers, and 2 people have tested to 37 markers. And of the 4 people in "Possible Lineage II", one has tested to 25 markers and 3 have tested to only 12 markers. This obviously creates difficulties in accurately allocating people to the correct branch of the Mutation History Tree and such allocations are likely to change as these members upgrade their results and more data become available.

These are very important points that need to be born in mind and are worth repeating:
  • Mutation History Trees are only as good as the data available 
  • They are liable to change as more data becomes available
  • The more data used to generate the tree, the more likely it is to approximate reality

Building a more complex Mutation History Tree

To generate a more advanced Mutation History Tree, using up to 111 markers, we can use programmes such as Fluxus to generate a phylogenetic tree / cladogram that produces a "best fit" model of the tree (i.e. with the fewest number of branches possible for the known mutations). This is often called a "maximum parsimony" approach and the principle is akin to that of Occam's razor which simply states that - all else being equal - the simplest hypothesis that explains the data should be the one that is selected. It may not be the correct one (there are usually other possible alternatives with varying degrees of plausibility), but it has the highest probability of being the correct one.

Using the Fluxus software is not easy - the process is multi-step and complicated, it is not user-friendly, and it takes time. Nevertheless, we will look at the output of this software in a separate blog post.

Below is a summary of the various STR mutations that occur within Lineage II (courtesy of Lisa Little). Mutations from the Modal Haplotype are highlighted in yellow and beige. The markers are divided into their various "Panels" by bold dark lines. Mutations among the first 12 markers are in the first 5 rows (Panel 1); the next 3 rows are mutations among markers 13-25 (Panel 2); and the following 4 rows have the mutations among markers 26 to 37 (Panel 3).

From these it is possible to build a more evolved Mutation History Tree than the one generated using only 12 marker data. As all 11 members have tested to 37 markers, let's construct a tree based on 37 markers and compare it to the one generated from the 12-marker data.

Lineage II's STR mutations (from Lisa Little)
37-marker Mutation History Tree
click to enlarge
The previous 12-marker Mutation History Tree (for comparison)
click to enlarge

In the new 37-maker Tree, additional branches and additional mutations are indicated in pink. The branches have doubled and are now numbered 1 through 10. Branch 3 is an exact match to the Modal Haplotype (but no members sit on this particular branch). There are several important points to note when comparing this new 37-marker Tree to the previous 12-marker Tree:
  • The tree with more data (the 37-marker Tree) has more branches, more definition, more granularity, more fine detail
  • The new branches allow us to revise our estimates for when the MRCA for the various branches may have been born. For example, what was Branch 5 has now split into two branches  (9 & 10), and the members in Branch 9 (G39, G51) share an MRCA who was possibly born several generations after the MRCA for Branch 9 & 10.
  • There are several Parallel Mutations in the 37-marker Tree (i.e. identical mutations that occur in several different branches)
    • 464b (16>17) occurs on 2 branches (Branches 4 & 8)
    • 464c (17>16) occurs on 2 branches (Branches 1 & 10)
    • CDYa (39>38) occurs on 5 branches (Branches 2, 4, 6, 7/8, & 10)
    • CDYb (40>39) occurs on 2 branches (Branches 1 & 9)
    • 456 (16>15) occurs on 3 branches (Branches 2, 6 & 7)
  • The apparently large number of Parallel Mutations may be because this is not a "maximum parsimony" tree, and there may be another way of arranging the data that would produce a better "fit" with fewer branches. The Fluxus software could help clarify this. 
  • Alternatively, this may be a very accurate reflection of the real Family History Tree (i.e. how people are actually related ancestrally) and the large number of Parallel Mutations are due to the rapid mutation rate of the STR markers in question. This is entirely plausible as several markers are known to mutate back and forth fairly frequently from one generation to the next (e.g. the CDYa and CDYb markers). FTDNA identifies such rapidly mutating markers by colouring them in dark red on the DNA Results page.
  • Back Mutations may be present in the tree, but are hidden  ... and thus the tree may be wrong
    • for illustrative purposes, in Branch 2, theoretically there may have been a mutation in one of the members ancestral lines such as CDYb (40>39) in (say) 1470, and a few generations after that a Back Mutation CDYb (39>40) in (say) 1610. This would place these members on 2 different branches rather than on the same branch where they currently sit. 
    • And even though in reality, they share a Common Ancestor in 1470, the Back Mutation masks this, and makes them look much more closely related than they actually are, with a Common Ancestor that appears to be sometime in the 1700's perhaps, 300 years later than it actually is!
    • This latter point is a common experience when working with any type of DNA - the Common Ancestor is further back than he/she looks.
Fast-mutating markers (in dark red) among the first 37 markers of FTDNA's Y-DNA test

Examples of Mutation History Trees generated using 37-marker results can be found in the Allen Patrilineage Project for their Patrilineage I and Patrilineage II.

When building a Mutation History Tree with larger numbers of markers (67 or 111), software programmes such as Fluxus become indispensable because doing it by hand is much more difficult. And as stated previously, the tree is likely to change as more data is used to generate the branching pattern. If we generated a tree based on 67 marker data it would become even more detailed, and more so too with a 111-marker based tree. Furthermore, adding more data is likely to change the branching pattern within the tree as a new "best fit" model is identified. 

And this essential point will be perfectly illustrated when we add SNP marker data into the mix - it throws our 37-marker "best fit" Mutation History Tree into a completely new configuration.

I don't know about you but I can hardly contain myself.

Maurice Gleeson
13 Aug 2015

* the 2 apparent mutations are in fact only 1. The marker DYS389 is a single STR marker that has four parts: m, n, p, and q. At FamilyTreeDNA they have two tests for DYS389. The first test looks at the first two parts of marker DYS389 (m and n). This is what they call DYS389I. The second test looks at all four parts of DYS389 (m, n, p, and q). This is what they call DYS389II. There are, by scientific convention, two ways to display the result of DYS389II.  The way FTDNA display the result is by showing the total for the entire DYS389 marker (m+n+p+q). This is described in their Learning Centre FAQ here. Member G-21 has a mutation in 389i - which is indicated by his value of 13 rather than the modal 14. For marker 389ii his value of 29 rather than 30 indicates that in the entire marker (parts m+n+p+q) he still only has a single mutation - that same single mutation which is reported as 389i. In reality, since 389ii includes all four parts of the marker, we should just drop 389i off our table. Any mutation in 389i will be included in 389ii. My thanks to Lisa Little for pointing this out.  


  1. Wonderful! We've been working on a similar analysis with our Towne DNA project and I've been trying to articulate and illustrate for years what you've done so well here. We have the benefit of 30+ men with solid paper trails back to their shared immigrant ancestor b 1599. Parallel and back-mutations have been a real disappointment, though. We are now working with SNPs as well, with one Big-Y in the group. Thank you for this great article!

    1. Thanks Margaret. It's an exciting time for Mutation History Trees now that SNP data is beginning to come in. The next post will show how SNP data paints an entirely new picture and raises some important issues about our previous assumptions! :-)

    2. A fascinating article Maurice. Thank you. I will take some time to red and absorb this information. Thanks again Damian

  2. An excellent article Maurice. I shall be fascinated to see how the picture changes with the inclusion of SNP data!

    1. Thanks Debbie. I suspect the hand-drawn Mutation History Tree (MHT) is likely to be biased by the fact that we tend to start with the Panel 1 markers (1-12), then proceed to the Panel 2 set (13-25), etc. We might get a different tree if we started at the other end i.e. Panel 5 first (markers 68-11), then Panel 4 (38-67), etc.

      Also, the rapidly-mutating markers may easily skew the results if added in to the MHT too soon. I think a less risky approach might be to leave them till last, and start with the slow mutating markers. This will be particularly important for larger genetic families, with a common ancestor pre-surnames.