Saturday 16 May 2020

Methodology behind the Mutation History Tree v4 for Lineage II

In the previous post I introduced the new version of the Mutation History Tree (MHT) for Lineage II (the North Tipperary Gleeson's). The previous version of the MHT for Lineage II participants was back in August 2017 (see blog post here) so let's take a look at what has changed since then, how the new version was put together, and what it tells us about the North Tipperary Gleeson's.

You can download a high resolution pdf version of the new MHT (version 4) here ... https://www.dropbox.com/s/8mssf5d8b142wli/L2%20MHT%20v4e.pdf?dl=0

There are now 52 members in Lineage II, an increase of 16 members since the last version of the MHT. The data used in each version of the MHT are summarised in the table below. You can see how the available data has grown over the past 5 years.



We use a combination of STR marker mutations, SNP marker mutations and user-submitted genealogical information to build the tree. The Big Y-700 provides the most comprehensive information for this analysis - it assesses 838 STR markers and over 200,000 SNP markers. Only 4 people have done this test, but 8 people have done the previous version of the Big Y (the Big Y-500) and this assessed 561 STR markers and over 100,000 SNPs (roughly). It is difficult to know if Big Y-700 data will provide additional differentiating information over and above the data from the Big Y-500.

For this fourth version of the MHT, I analysed the data from scratch so that I could compare it with previous versions to see if and where I had done things differently. I also experimented with a different approach that generated a different format of the MHT - this has pros & cons compared to previous formats. Let me take you through it step-by-step.

The new format consists of 5 sections that will look very familiar to you as it is based on the Results Page for the Gleason DNA Project on the FTDNA website:
  1. Personal Details
  2. Genealogical Information
  3. SNP Marker data
  4. STR Marker data
  5. Mutation History Tree 
A pdf version of the entire MHT v4 can be downloaded from this Dropbox link here. This is the best way of viewing the MHT and should be referenced as you read the text below.


1) Personal Details consist of the FTDNA kit number, the old G-number (a numbering system we used when the project ran on the now defunct WorldFamiliesNetwork website), surname of the test-taker, and whether the test-taker has their STR results on the public Results Page (anyone who doesn't has their kit number & G-number blanked out). If you want to change your settings, just let me know. Or you can change them yourself by going to Account Settings > Project Preferences > Project Sharing > Group Project Profile and click / drag the slider to Opt in to Sharing.

Each branch of the Tree is indicated by a coloured border and highlighted background (green in the example below, indicating Branch F).



2) Genealogical information is whatever details have been submitted by the user. All members should have posted their direct male line pedigree on our Pedigrees Page. One of the most important pieces of information is frequently missing, and that's the ancestral location of the MDKA (Most Distant Known Ancestor). This should preferably be where he was born, but failing that his marriage location or where his children were born can be used as a substitute. This information may help us identify if a specific sub-branch comes from a particular area (and that in turn may help people in their documentary research).

Abbreviated Direct Male Line pedigree information is included in the pdf document, along with MDKA info (under FTDNA's heading "Paternal Ancestor Name"), and an additional column specifies the Paternal Ancestral Location (which lists country, county, district and town or townland, extracted from the MDKA information).

Known cousins have the background highlighted in pastel colours. There are none in the first branch but 3 in the second.

(click to enlarge)



3) SNP Marker data includes the Terminal SNP for each member (under FTDNA's heading "Haplogroup"), details of any SNP test taken (SP, SNP Pack; BY500, Big Y-500; BY700, Big Y-700), the number of Unique SNPs and the associated SNP Sequence (or SNP Progression) - this is simply the sequence of SNP markers that characterises each branching point on the Tree of Mankind, starting at a distant "upstream" branch (in the past) and progressing all the way "downstream" (i.e. towards the present) to the Terminal SNP. Think of this string of SNPs as a line of ancestors coming forward in time until it reaches you in the present day. Comparing the SNP Sequences of two people helps us see exactly on which branches each person sits on the Tree of Mankind relative to each other. And this in turn tells us how closely or how distantly they are related to each other.

Each SNP in a SNP Sequence often represents several SNPs in a "SNP Block". Thus the SNP A5629 (for example) is actually the first SNP in a block of 5 SNPs, and the entire block takes its name from the first SNP - A5629. You can see these various SNP Blocks quite clearly in the Big Tree's version of the portion of the Tree of Mankind where the Lineage II Gleeson's sit.



The numbers in brackets after each SNP below represent the number of SNPs in that particular "SNP Block". FTDNA & The Big Tree have different methods for deciding how many SNPs there are in a SNP Block so you will find 2 numbers after each SNP below - the first is the Big Tree estimate and the second is FTDNA's (taken from their Big Y Block Tree). So for example, A5631 is a 1-SNP block according to the Big Tree and a 3-SNP block according to FTDNA. Knowing the number of SNPs in a SNP Block is helpful for calculating the age of a SNP Block (i.e. when was the first SNP in the block formed) ... so having different estimates for the number of SNPs in a block can lead to confusion and inaccurate age estimation (which is going to be very crude anyway).




Knowing the SNP Sequences allows us to build a SNP-based "family tree" (as in the diagram below, which is taken from FTDNA's public Y-DNA Haplotree). A5631 is the overarching SNP for the Lineage II Gleeson's - the common ancestor of all North Tipperary Gleeson's carried this DNA marker and passed it on to all his descendants. All group members sit on branches below this particular SNP.

 


4) The STR Marker Data is taken directly from the Results Page on FTDNA's website for the first 111 markers. Data from Big Y STR markers (markers 112 to 838) had to be downloaded from each individual members webpage. There are up to 838 STR markers, and in the first row, I have numbered them 1 to 838. (Note that any extra multi-copy markers are labelled with the previous number followed by a letter e.g. dys464e & dys464f become 25e & 25f - this preserves the numbering system for the subsequent STR markers up to 838).

Only those markers with a mutation are included - all others are deleted from this tabular summary. The names of each marker are in the 2nd row.

Numbering the markers (rather than writing out their full name, which becomes very cumbersome) makes it easier to locate specific markers within the MHT diagram and discuss them in the text below. For example, CDYa is marker 34.


I also include the modal haplotype for two "upstream" SNPs - Z16437 & Z255. These come before the overarching SNP for Lineage II (A5631). You can see this in the more extensive SNP Sequence for Lineage II below (taken from Rob Spencer's Admin Utilities tool on his Tracking Back website):
  • R-L21 > DF13 > ZZ10_1 > Z16423 > Z255 > Z16437 > BY2853 > Z16438 > BY2852 > A5631
Including these modal haplotypes allows us to see which STR mutations occurred prior to the emergence of the SNP marker A5631, and therefore which STR marker values were ancestral and which ones were descendant. This also identifies a Unique STR Pattern (USP) - relative to the Z255 modal - that applies to all Gleeson's below A5631. This is highlighted in light yellow (or is it beige?) and identifies the 5-marker USP for Lineage II as follows:
  • marker 9 (dys 439) changed from a value of 12 to 13 (written as 12 > 13)
  • marker 13 (dys 458) 17 > 16
  • marker 14 (dys 459a) 9 > 8
  • marker 23 (dys 464b) 15 > 16
  • marker 68 (dys 710) 36 > 37
Anyone who has this Unique STR Pattern (USP) is likely to test positive for the SNP marker A5631 and is also likely to be a Gleeson from North Tipperary. (Note that this USP is based only on the first 111 markers. Additional STR marker values may be present in STR markers 112-838 that could form part of the USP, but I was not able to locate the modal haplotype for these markers - it is not displayed on the FTDNA webpages and YFULL only has modal values for the first 111 STR markers, but excluding the multi copy markers - I had to obtain the latter from the relevant haplogroup projects).

As well as the USP that defines the A5631 group and distinguishes it from other groups, there are additional USPs within the group that define sub-groups (and hence sub-branches) among the North Tipperary Gleeson's. These sub-USPs are indicated by boxes with bold black borders. In the screenshot above, you can see such boxes around markers 2, 4, 14 & 15 (which have values of 23, 10 9 & 9). There is also a possible sub-sub-group indicated by the box encompassing the values for two men on marker 13 (which has a value of 17).

The resulting subgroups were checked against the genealogical data and corroborating genealogical evidence was found for the groupings on several sub-branches (e.g. Branches B & A1).

Again, each branch of the Tree is indicated by a coloured border (green in the example above, indicating Branch F), as well as a coloured box with the Branch Name within it.


5) The Mutation History Tree sits in between the SNP data and the STR data. Building the MHT proceeds in 3 stages:
  1. defining the sub-groups using SNPs, genealogical data, and USPs
  2. defining the branching structure: which STR mutation came first - the chicken or the egg?
  3. defining the dates for each branching point (using various iterations of SAPP)

Once I had obtained the data from the Results Page and put it into an Excel spreadsheet, I grouped people together according to their Terminal SNP. As a result, 28 of the 51 project members were sorted into 10 distinct groups. To these were added known relatives (who had no downstream SNP data) - this allowed an additional 8 people to be grouped (bringing the total to 36 out of 51). Lastly, I visually inspected the STR mutations to identify sub-groups with a USP and sorted an additional 7 project members accordingly (bringing the total up to 43 out of 51). The remaining 8 project members were placed as accurately as possible within the MHT, on branches where their genetic distance to their neighbours was minimal (lack of available data prevented more accurate placement).

Once we had our groups, the next step was to try to judge how they all fitted together. I examined each STR marker column by column, assessing any mutations to try to determine which came first in the sequence from upstream to downstream, using a "maximum parsimony" approach (i.e one that required the fewest number of mutations). I employed two exceptions to this general rule:
  1. I ignored mutations on markers 34 & 35 (CDYa & CDYb) because these are fast-mutating markers and their values can easily "flip-flop" back and forth from generation to generation.
  2. I avoided Back Mutations as these are rare compared to Parallel Mutations. Dave Vance has calculated that within the past 1000 years or so (which covers the era of surnames in Ireland), one would expect the ratio of Parallel Mutations to Back Mutations to be somewhere around 25:1 (see his article discussing this here). In other words, there should be 25 times more Parallel Mutations than Back Mutations in the MHT. And so I deliberately avoided assigning Back Mutations if at all possible. In fact, I only put in one such mutation (but this was more as a token gesture than anything else).
Some branches were easy to characterise, either because they had a lot of data (e.g. Big Y-700), or they had distinctive USPs, or both. For Branches H & G, it was much more difficult to elucidate a branching structure and thus the connections between people on these branches are unlikely to be accurate reflections of reality. Furthermore, the SNP markers that characterise these branches are quite far upstream - BY5706 goes back to a common ancestor who lived about 1250 AD and A5629 to an ancestor who lived about 1150.

More data (i.e. Big Y-700) for everyone on these branches would be needed in order to better characterise their relationship to each other. In addition, it was not possible to confidently allocate 4 people to any of the 11 named branches. And again, Big Y data would be needed to do so.

It is highly likely that some of the people on Branches H & G, and/or the 4 outliers, will belong to new (possibly isolated) branches of the MHT. Many branches will have become extinct over time and others will only have very few surviving descendants. And some surviving branches may not have anyone who has tested, and so they are not currently represented in the project.

Gleeson Lineage II MHT version 4 - download the MHT as a pdf document here
(click to enlarge)

Once the branching structure was defined, the next step was to date each branching point within the overall tree structure. Dave Vance's SAPP Programme was used to generate TMRCA dates for each branching point (TMRCA, Time to Most Recent Common Ancestor). This is never a straightforward task as there are many hurdles to overcome.

Firstly, the quantity of data for each project member is variable. Some have only tested 12-STR markers, others have tested 838 as well as 200,000 SNP markers. However there are some tricks to get around this:
  1. if close relatives have tested, the STR marker values for one relative can be extrapolated to the other. This is obviously not foolproof but it captures most of the relevant shared mutations and helps recognise USPs.
  2. if someone else has a similar USP to others who have tested to 838 STRs, the additional STRs can be imputed for the person with fewer STR markers tested and the missing marker values can be completed / filled in.
The SAPP Programme is very sensitive to data input - small changes in input can produce big changes in output. Several versions of the input file were created with varying degrees of data imputation and data suppression. The outputs (i.e. MHT) of sequential iterations of the input file were examined for consistency and differences. 

Here is a summary of each input file and at the end of this post are the SAPP-generated MHTs associated with each input file:
  • GL2 v4 ... no calls (-) replaced with "n" prior to pasting into txt file, 2 kits ignored (both above A5361), 464e&f were removed and modal values for the remaining dys464 markers (a thru d) were used for the 2 affected kits
  • GL2 v4a ... additional values for known relatives and USP-defined branches were imputed, extreme outliers were ignored (Y-12)
  • GL2 v4b ... Z16437 modal added (with values for dys464e & f deleted)
  • GL2 v4c ... ignore non-SNP & non-USP participants (mainly branches H&G)
  • GL2 v4d ... ignore CDYa & CDYb

Interestingly, once "hard-to-place" project members were removed, the central estimate (midpoint estimate) for each of the major branching points did not differ substantially between iterations. The most noticeable change was the 5th iteration (GL2 v4d - CDY markers ignored) which reduced the TMRCA by about 50-200 years for each branching point. 

The 4th iteration (using input file GL2 v4c) was taken to be the version most likely to be closest to the true TMRCAs. This had to be massaged slightly as the first 3 SNPs had the same TMRCA - A5631 was adjusted to 1100 (50 years earlier), A5629 stayed at 1150, and BY5706 was adjusted to 1250 (100 years later). Also, A13119 was adjusted up by 100 years from 1250 to 1350.

(click to enlarge)

Further problems arose when trying to calculate the formation date for some of the SNP markers, and in this regard we need to make several important points: 
  • It is essential to appreciate that the TMRCA estimate goes back to a common ancestor who was born with the particular SNP marker in question. In other words, the mutation did not arise in that ancestor - it arose in a previous generation before the birth of that common ancestor. It may have arisen in the ancestor's father or grandfather or 10 times great grandfather. Therefore the formation date (i.e the date when the SNP mutation emerged) is usually going to be older than the TMRCA, sometimes a lot older. The only exception is when the SNP mutation occurred in the common ancestor's father.
  • SNP Counting was used and 84 was taken to be the average number of years per SNP. 
  • If there is only 1 SNP characterising a branching point, then the formation date refers to just that single SNP. However, if there is a block of SNPs at a branching point, the formation date represents the formation date of the very first SNP in the block. Thus SNP marker A660 represents a block of 6 or 7 SNPs and even though the TMRCA is about 1700, the formation date is estimated to be about 1200 (i.e. 6.5 SNPs x84 = 546 years ... so 1700-546 is roughly 1200 AD).
  • SNP formation dates were crudely calculated from the TMRCA as the end point for the Block and the starting point as the number of the SNPs in the block x84 years. These crude formation date estimates were physically impossible in some instances because there were deemed to have occurred before the TMRCA of the preceding SNP. Therefore, such nonsensical estimates were constrained by the TMRCA of the upstream SNP which was deemed to be more accurate. Some of these estimates had to be constrained by 200-300 years.

As a result, a very crude timeline was generated for SNP formation dates and these need to be taken with a large pinch of salt (e.g. with average ranges of about +/-200 years, for example). Nevertheless, they provide an approximate timeline for the evolution of the MHT and add further interest to the picture.


Conclusions

The outcome of this new approach used for generating this version of the MHT did not differ substantially from previous outcomes. The same number of branches were identified and all the major branches were successfully characterised.

The format stays very close to the format of the Results Page on the FTDNA website and hopefully this helps project members better understand where the data comes from and how it is used to build the tree. It may even help software developers to automate the process (or parts of it at least).

Using successive iterations of the SAPP-generated tree allowed better age estimations for each of the branching points within the tree. There were significant problems using SNP Counting to estimate Formation Dates for the SNPs, but the TMRCA estimates seemed reasonably stable, especially when non-SNP & non-USP members were excluded from the SAPP analysis.

At the end of all this, we have a MHT that charts the evolution fo the Gleeson surname from a time close to its formation, through major periods in Irish history including the Norman Conquest, the Black Death, the English Plantations, Cromwell's Conquest, the End of the Gaelic Era, the Great Famine, and the large scale emigration that formed the present-day Irish Diaspora of 80 million people.

In some cases, project members within individual branches have managed to break through Brick Walls and establish relationships with other members on the same branch. A good example of this is the guest article by Lisa Little describing the connection between her Little line and the Gleeson's of Branch B.

However, the connections between the various branches are beyond the reach of surviving documentary records and sadly these ancestors are lost in the mists of time. All we have to show for them is the genetic legacy that they have passed on to their Gleeson descendants living today. But even this can help place new members on a particular branch of the "genetic family tree" for all North Tipperary Gleeson's and allow them to connect at least partially with the rich legacy of their ancestors.

Maurice Gleeson
May 2020

Below are the SAPP-generated MHTs associated with each input file. The branches are colour-coded to match the Branches of the MHT v4 ...

GL2 MHT v4 (Gleeson Lineage II Mutation History Tree version 4) - 50 kits
(click to enlarge)

GL2 MHT v4a - 42 kits (there's a glitch & some of the colours came out wrong)
(click to enlarge)

GLT MHT v4b - 42 kits
(click to enlarge)

GLT MHT v4c - 39 kits
(click to enlarge)

GLT MHT v4d - 39 kits
(click to enlarge)














2 comments:

  1. Amazing piece of work. I had some difficulty enlarging it enough to read the names. I would love to be able to plug in my Doyle ancestry to this tree. I'm going to look at the SNP's in my Y700 and yFull and see how it parallels your data. You are a true master. Thanks, Larry

    ReplyDelete
    Replies
    1. Thanks Larry. The images can be enlarged if you click on them, so you might want to read the text and view the images in two separate web browsers or tabs. Alternatively you can download the pdf document and refer to that instead ... https://www.dropbox.com/s/ohgzy0225z2x1w2/L2%20MHT%20v4.pdf?dl=0 (just copy & paste into your browser's address bar at the top)

      Delete