Friday, 7 August 2015

Genetic Distance, Genetic Families, & Mutation History Trees

In this blog post, we examine how people are grouped together into Lineages (sometimes called Genetic Families) and how the relationship between people within a Lineage can be mapped out and represented in a Mutation History Tree.

Grouping People Together

The members of each lineage within the project have been grouped together because their genetic signatures (aka haplotypes) are similar, suggesting a common ancestor some time in the past several hundred years. The degree of similarity between any two individuals can be assessed by the Genetic Distance between them, as discussed in a previous blog post and reproduced again below:
Who qualifies as a match to you? Anyone whose marker values are sufficiently similar that they meet the criteria set by FTDNA to be declared "a match". And here are those criteria:
  • a GD of 2 at 25 markers
The ISOGG Wiki has a very nice summary of Genetic Distance and the criteria for matching.
However, Genetic Distance is not the only possible criteria for grouping people into the same "Genetic Family" or "Lineage". Other considerations include traditional genealogical indicators such as having the same surname (the main criterion for surname studies), or having an ancestor who came from the same location as the ancestors of other group members. Additional genetic criteria may include having the same rare marker values, or having the same terminal SNP. These considerations can also serve as indicators that Lineage members have been grouped correctly i.e. members may be grouped together on the basis of one criterion (e.g. Genetic Distance) and subsequently are found to share a second criterion (e.g. the same rare marker value, ancestors from the same location, or even the same MDKA). You can read more about some of the possible criteria for grouping people together into the same genetic family here.

Why do I match some Project Members but not others?

You will probably notice that not everyone in your Lineage turns up in your list of matches on your Matches Page. The reason for this is that they do not meet the FTDNA criteria for a "match" to you, but they do meet the criteria as a match either to the Modal Haplotype* for the Lineage or to other members of the project. This is one of the key benefits of joining a surname project (such as the Gleason/Gleeson DNA Project) - it can connect you to people within the FTDNA database who do not show up in your list of matches but to whom you are still likely to be related.

click to enlarge

For example, in Lineage II, my Dad (G21; N74958; yellow dot in the diagram above) is a match (at 37 markers) to only 5 of the other 10 members: G22, G57, G64, G55, and G66 (green dots). He does not match the members with red dots. In other words, his genetic distance to the green dot matches is 4/37 or less, whereas his GD to the red dot matches is 5/37 or greater (in fact, reading down, his GD to each of the red dot members is 5/37, 6/37, 5/37, 6/37, & 6/37 respectively).

But let's look at the member closest to the Modal Haplotype* for Lineage II. This is the person with the fewest number of mutations compared to the Modal Haplotype, or (in other words) the smallest Genetic Distance from the Modal Haplotype. There are no exact matches (i.e. 0/37) to the Modal Haplotype in Lineage II, but there are two members who are the closest (GD = 1/37) and these are members G57 and G64 (kits 60393 & 365763). They are uncle/nephew, both carry the surname Little, and they are believed to have a genetic Gleason ancestor somewhere in the last several hundred years. As Project Admin, I have access to their pages and I can look at their matches. This tells me that each of them matches 7 of the 10 other members in Lineage II. I can also see that their Genetic Distance to the remaining 3 members of Lineage II is only 5/37, just outside the FTDNA threshold for declaring them a match. Looking at these 3 latter members, one matches two other people within the project and the other two match 5 project members.

So everyone within the project is a match to at least one other person within the project, but the distance between any two project members can vary considerably - some are very close, others are very distant.

And this creates a wonderful possibility - by analysing the degree of Genetic Distance between members we can construct a diagram of the branching pattern within the group which shows how closely or how distantly each member may be related to each other member. Such a diagram is variously called a phylogenetic tree, a phylogram, a cladogram, or a Mutation History Tree. The same process is used to generate evolutionary diagrams showing where all living creatures sit on the Tree of Life.

Interactive version of the Tree of life 

Mutation History Trees

This concept has important potential implications for genealogy. Theoretically, it should be possible to take the known genealogies of each member within a Lineage and hang them onto the appropriate branch within the Mutation History Tree. In this way we will have a combined tree that starts off in modern times with named individuals, goes back along each ancestral line until the named individuals run out (at each branch's Brick Wall or MDKA, Most Distant Known Ancestor), and then the tree continues back in time using genetic marker mutations instead of people, culminating at the Modal Haplotype for that particular Lineage.

A combined Family History Tree & Mutation History Tree

In the combined tree above, named individuals appear in the blue boxes, starting with living individuals (born about 1960) and going back in time to the MDKA for each branch. Most branches have an MDKA born about 1810 to 1840 (which is typical for Irish research), and some have a Brick Wall at a later point (Branches 5, 6, & 7 have an MDKA about 1870). One lucky branch can trace their line back to 1690 which is highly unusual but something we all hope for. You never know when a new member will join your particular Lineage who happens to be in possession of the Family Bible! And because you are genetically related to him, his Family Bible pertains to your family too. In this way, all members within a particular Genetic Family can "piggyback" onto the pedigree of the member with the longest pedigree.

Where the named individuals end, DNA marker mutations take over. STR markers are in yellow, SNP markers are in pink, but this is just a crude representation of what the tree might look like. In reality it would be much more complicated than this.

The person sitting at the intersecting point of all the branches is the MRCA (Most Recent Common Ancestor) and he is likely to have the Modal Haplotype for that particular Lineage. And as we go back in time, we would identify NPEs (Non-Paternity Events), and we would identify other surnames to which we are directly related but with whom our common ancestor is prior to the time of surnames. And the tree would continue even further back than that, based on the SNP discoveries that are continuously being made in the ongoing Haplogroup Projects.

Eventually, by superimposing known genealogies on top of the Mutation History Tree, we could build a comprehensive evolutionary tree of all Mankind, that travels back in time to "Genetic Adam" and travels forward to culminate with each living person today.

Maurice Gleeson
7 August 2015

* Modal Haplotype - this is the haplotype (i.e. your genetic signature, your sequence of STR marker values) that is derived from the most frequent value for each of the STR markers in turn among members of the same Lineage. It is likely that the Modal Haplotype is identical or almost identical to the haplotype that the Common Ancestor of the Lineage would have had. In other words, he would have passed on the identical marker values to most descendants and only some of them would have developed the occasional mutation along the way.


  1. I suggest a method other than the modal haplotype to calculate the genetic signature of a Lineage group. For each marker, instead of simply picking the mode, pick the value that minimizes the number of mutations. Example 1: For a given marker, the frequency of consecutive values is 7, 10, 8. In this case, both methods result in selecting the modal value corresponding with the frequency of 10. Example 2: For a given marker, the frequency of consecutive values is 7, 8, 9, 10. In this case, the new suggested method would select the value corresponding with the frequency of 9, NOT 10. You might loosely call this selection as the "mean-weighted mode." Granted it is often the same as the modal haplotype and is somewhat harder to calculate, but perhaps it is a more accurate approximation of the genetic signature. I'm curious what your thoughts are on this?

    Thank you,
    Allan Westreich

    1. Thanks for the comment, Allan. The concept of Ancestral Modal Haplotype is fraught. The modal haplotype for any group is a fictional haplotype made up of the modal value for each of the markers shared by the members in a particular group. Traditionally, this has been taken to approximate the haplotype of the common ancestor of a particular group of people. This might be the case for groups where the common ancestor is within the last several hundred years (e.g. colonial ancestry in the US dating from the 1600's) but for more distant common ancestry, it gets more and more unlikely that the modal haplotype is a true representation of the haplotype of the group's common ancestor.

      The technique you describe above is a very reasonable approach to overcome some of the problems inherent in trying to identify the most likely haplotype of the common ancestor but it may not be sufficient to get much "closer to the truth".

      Another problem is the possible over-representation of certain downstream descendants of the common ancestor and this is an issue that has been addressed by John Robb in his excellent article here ... John proposes a new methodology and the term Root Modal Haplotype too account for the possible biases introduced by over-representation of certain family branches.

  2. Thanks for your reply, Maurice. I do agree that calculating a representative haplotype for the Most Recent Common Ancestor is fraught with difficulties. If I understand John Robb's work correctly, I believe what I am proposing is quite similar ... in his terms, calculating a synthetic Root Prototype Haplotype (RPH) which minimizes the genetic distance to all of the group members. As you noted, he wisely added the method of "pruning the tree" to remove bias.

    Thanks again,