Monday 15 February 2016

Big Y SNP Markers of Lineage II

In the previous post, we reported on the three new sets of Big Y results, where these new results placed those members in Alex Williamson's Big Tree  and how these new data changed the overall structure of our own little Lineage II portion of the human evolutionary tree. I include the diagram again below, just to recap.

In this post, we will take a closer look at the actual SNP markers  - both the Shared SNPs within the group, and the Unique SNPs for each individual.

On the left, the 9 Lineage II members who tested on Big Y & their relation to each other

Shared SNPs among Lineage II members

The diagram above contains only the SNPs that are shared between individual members - therefore, I call this the "Shared SNP" portion of Alex's tree. But in addition, each member has unique SNPs, particular to that specific individual, that are not shared with anyone else in the world (at least for the time being ... but that is liable to change and I will explain why below). We'll take a look at the Shared SNPs first and the Unique SNPs afterwards.

Alex produces fabulous tables and matrices of the results on his Big Tree website and these are really informative.  The table below shows all 9 members of Lineage II who have undertaken the Big Y test. The positions of the relevant "Gleeson" SNPs on the Y chromosome are shown in the first column, and the SNP name (if they have one) is in the second column. Many SNPs will not have a name because they are only newly discovered and no one has got around to naming them yet! This is "cutting-edge science" after all - we are on the crest of the wave of new scientific discoveries and there will not be immediate answers to all our questions. Of the 26 SNPs in the table, only 7 have been named (currently).

The third column tells us what "block" the SNP belongs in, followed by the specific region of the Y chromosome where it is found. And the rest of the columns are the actual data for our 9 members, with the kit number, family name, & terminal SNP (or terminal SNP block) for each member at the top of each column.

It is important to appreciate what a mammoth task this is. The table below represents a distillation of approximately 26,000 SNPs from each of the 9 members of Lineage II. The SNPs in the table below are only shared among these 9 members in Lineage II and by no other of the 2000+ people who have undergone NGS (Next Generation Sequencing) testing with the Big Y and similar tests. [1]

To see the table above, go to this link (www.ytree.net/DisplayTree.php?blockID=16), find the Gleeson portion in the bottom right of the diagram, click on the A5629 SNP block, and on the next page click on Show Mutation Matrix.

Taking the first row of data as an example, all 9 members are positive for this SNP, which lies at position 18606400. Furthermore, the usual "base" found at this position in most people is C (cytosine) but in our case it is a T (thymine). The SNP name is A5629 and it sits within the A5629 SNP block (each block is usually named after the SNP that appears at the top of the list of SNPs within it, which in this instance is A5629 - you can see this in the tree diagram at the top of the page).

There are two different types of classification system in operation in these tables. Some cells have a white, pink or grey background; and on top of this they may contain +, *, **, or ***. Here's an explanation for these classification systems:
  1. The colour of the background gives an indication of the degree of "coverage" achieved by the particular Big Y test. In other words, the Big Y test measures the entire Y chromosome in small chunks, usually 30-100 times per chunk, but (purely by chance) some chunks are read only a few times and (purely by chance) some chunks are not read at all. So a white background indicates good enough coverage, a grey background indicates no coverage, and a pink background indicates questionable coverage. These pink regions often indicate that the individual may be positive for a SNP even if it does not show up in his data. This is important to appreciate because it has a very significant implication: just because someone does not test positive for a given SNP doesn't mean it is not there! 
  2. The +, *, **, and *** signs mean slightly different things for the Big Y test and the FGC tests. In very crude terms, I like to think of them as "definite, probable, possible, and unlikely". In other words, they give some indication of the probability that the SNP in question is a genuine SNP (and not a "false positive"). But that's just a rough guide. You can read Alex's more comprehensive explanation below. [2]
Applying these classification systems to the SNP markers in the table, you can see that 15 of them are "definite" (a + on a white background) and the rest of them are unlikely (***). If you wanted to put a probability on how likely is "unlikely" ... if you said 10% you might not be far off. In other words, "unlikely" SNPs may be genuine SNPs 10% of the time, and "false positives" 90% of the time. Hopefully time (and further testing) will tell. But for now, this raises a second important point worth appreciating: just because someone tests "positive" for a SNP does not mean it is there!

By comparing which SNPs are shared among which members, it is possible to separate out members who are more closely related to each other than to other members in the group. And in this way it is possible to construct a family tree for these people based on their SNP markers. This produces a Mutation History Tree based on SNPs alone, which is essentially what Alex's tree is - a SNP-based Mutation History Tree.

Two things will happen over time: 1) some SNPs will be reclassified (in terms of "definite, probable, possible, and unlikely"), and 2) as more people test on the Big Y (or similar tests), some of the SNP blocks will be split into smaller blocks or individual SNPs, just like we saw when member 411177 (Glisson) joined the project - the addition of his results made SNP A5628 split away from the rest of the A5629 block (see the previous post). 

In other words, as more people test, the tree will branch and subdivide. And because we currently have 16 SNPs in 4 SNP blocks, it is likely we will have an additional 16 branches added to the tree.


Unique SNPs among Lineage II members

If we go to the Lineage II portion of Alex's tree (on the right side of the diagram here) and click on any of the individual names, this launches a table with that particular individuals unique SNPs. The same classification systems for SNPs apply in terms of coverage (white, grey, pink) and probability of being genuine (+, *, **, and ***).

From the Tables below for each of the 9 individual members (each person's kit number appears in the last column), it is apparent that some have no "definite" unique SNPs (the first two members below, who are brothers) whilst others have up to 5 unique SNPs each (3 people). There are 71 "unique" SNPs in the various tables below, of which 22 are "definite" unique SNPs and the rest are mainly "unlikely" SNPs.

Two things will happen over time: 1) as above, some SNPs will be reclassified (in terms of "definite, probable, possible, and unlikely"), and 2) as more people test on the Big Y (or similar tests), some of these unique SNPs will appear in the new test results and will therefore not be "unique" any more - they will move up from the "Unique SNP" tables below into the "Shared SNP" portion of the tree.

So the Take Home message is this: the unique SNPs will become important over time in identifying further sub-divisions and sub-branches within the Lineage II portion of the human evolutionary tree.

Potentially, because we have 22 "definite" unique SNPs, this will probably translate into 22 new sub-branches. 

And if we add these 22 sub-branches to the 16 sub-branches that would develop from splitting the 4 current SNP blocks within the tree, we should (in time) see an extra 38 sub-branches develop within the Gleason Lineage II SNP-based Mutation History Tree.

However, I suspect that it may be many many more than 38 sub-branches. 

Time will tell.


(click to enlarge)


Maurice Gleeson
Feb 2016




[1] this is not entirely true. Sometimes (rarely?) mutations can occur in the same SNP in entirely different populations, just by chance.

[2] The +, *, **, and *** symbols have slightly different meanings depending on whether the kit is a BigY kit, a FGC kit, or a manual entry of my own.
  • For FTDNA kits, + implies a "PASS" result with just one possible variant, * indicates a "PASS" but with multiple variants, ** indicates "REJECTED" with just a single variant, and *** indicates "REJECTED" with multiple possible variants. The multiple variant mutations tend to fall in repetitive regions.
  • For FGC kits, the meaning of the symbols is the same as it is from the FGC interpretation files. + indicates over 99% likely genuine (95% for INDELs); * over 95% likely genuine (90% for INDELs); ** about 40% likely genuine; *** about 10% likely genuine.
  • Manual entries read directly from a BAM file will be either + indicating positive or * indicating that the data show a mixture of possible variants.





2 comments:

  1. Thanks a lot. My big takeaway is that I can now find my 13 unique SNPs.

    ReplyDelete
    Replies
    1. It may be 13, it may be less, it may be more. But all of us are likely to have unique SNP marker mutations that no one else in the world shares. As more people test, it will become clear which SNP markers are shared and which are unique. And that will help place us on the Evolutionary Tree of Mankind.

      Delete