Thursday, March 19, 2015

Triangulated Small Segments are Identical by Descent

   Autosomal DNA segment matching is a complex issue.  Through testing and observation, it is obvious that some segment matches are false positives.  Computer algorithms will detect any matching allele with no knowledge that the allele is of paternal or maternal origin.


   If we said that the left columns are from the father’s sides and the right from the mother’s, we would see that none of the columns match.  Obviously, we can’t just draw a line down the middle and say one side is the mother’s DNA.  To determine which DNA came from mm and which came from dad, the autosomal results would need to be phased.  To phase the results of an autosomal sample it must be compared to at least one parent result.  By difference, the child result can be split into its paternal and maternal contributions. 


   If it were possible to phase every sample to be matched, false positives by computer algorithm would be eliminated.  Unfortunately, phasing every sample is not always possible.  A person’s parents may be deceased or even unknown.

   Another method of reducing or eliminating false positives is to triangulate each matching segment.  If a segment from autosomal sample A matches the corresponding segment from sample B and sample B matches sample C and sample C matches the original sample A, then the segment is considered triangulated and identical by descent.  How confident are we that the triangulated matches aren’t just a circular series of false positives? 

   Let’s look at segment on chromosome 3 that starts at rs6796502 and is 2.5 cM and 946 SNPs.  For this exercise, any chromosome segment could be used. 

Table 1.  Allele frequencies of 20 loci on chromosome 3.
   On that segment, there are 20 published locations with allele frequencies (NCBI).  Table 1 shows the how often a certain allele combination (AA, AC, AG etc.) appears for a European population.  Based on allele frequency, the most common combination of alleles in this section of chromosome 3 for a population of European descent is listed in Table 2.  I have artificially selected the most common combination to simulate a large portion of the population with European descent.  About 1 in 3,400 or about or about 300,000 people should have this combination. 

Table 2.  Predicted allele combination.
   Imagine for a moment that you roll six dice.  The first die comes up with a one and the second is a two and so on.  The probability of rolling a one on the first die is 1/6 (one side up on a six-sided die).  The probability of rolling a one and then a two is 1/6 times 1/6 or 1/36.  It will happen once every 36 rolls.  The combination illustrated on six dice would happen once in every 46,656 rolls.  Now imagine that is your DNA and we are looking for a match.  The other person would need one through six in the same order.  To calculate that probability we multiply 46,656 by 46,656 and get 2,176,782,336.  DNA matching actual has a better probability of matching.


   Table 3 lists the most common alleles again along with potential alleles that would generate a half match and the corresponding summed frequency.  The probability of the set of 20 potential combinations existing is equal to the product of the frequencies - 0.759.  This probability has to be extrapolated from 20 loci to 946, giving us 2.45x10-6 or 1 in 400,000.  There is a 1 in 400,000 chance of a completely random match on this section of chromosome 3 for the alleles with the highest frequency.  It is well within reason to expect false positives for this one-to-one match.

Table 3.  Probability of a half match within a European population.
   In the event of a three-way match (triangulation), we multiply by 2.45x10-6 again, giving us a probability of 1 in 167 billion.  Now we are outside of what is statistically reasonable.

   The most common set of European alleles doesn't produce the highest probability of a random match.  When the alleles are not the same (AC, AG, CT etc.), there is a higher chance of an autosomal half match.  Table 4 shows an actual set of alleles and the corresponding set of alleles to generate a half match.

Table 4.  Probability of a half match within a European population using actual sample.
   This actual sample takes us from a false positive probability of 1 in 400,000 to 1 in 5,900 (0.000169).  A probability of 1 in 5,900 indicates that we should be seeing completely random matches that have no genetic relationship on a regular basis.  Considering a population of about 1.6 million autosomal tests taken, each of us would have 270 false positive matches on a segment similar to the one shown.     

   Triangulated matches exist for this segment of chromosome 3.  For the probability of this triangulated segment, we multiply by 0.000169 again, giving us 2.87x10-8 or about 1 in 35 million.  Considering the number of results available for matching (about 1.6 million), it is not realistic that we are matching randomly.  In fact, most triangulated matches involve more than three test results.  If four test results are triangulated, the probability goes to 1 in 205 billion.  These probabilities indicate that triangulated results cannot be random and are matching due to common genetic descent.

   I have intentionally used two examples that have a higher probability of having false positive matches.  As soon as we look at matches that don’t have the higher frequency European alleles, the probability of a false positive diminishes. 

Table 5.  Probability of a half match within a European population with a Mediterranean sub-component.
   Table 5 shows a typical set of alleles.  There are two alleles at rs7630053 and rs4558783 that are not typical European and may indicate a Mediterranean ethnicity.  The probability of a one to one match on this segment being a false positive calculates to be 1 in 7 quadrillion. 

   Currently, we cannot examine the allele frequency for every SNP in every match we attempt.  When looking for autosomal matches consider phasing or triangulation.  Phasing the data is very valuable, yet the resources are not always available.  I’ve shown that triangulation eliminates false positives and those matches are statistically identical by descent.  Triangulated small segment matching is very valuable in our research.



References:

Maglio, MR (2015) Autosomal DNA and the Triangulation of Small Segments:  A Statistical Approach (Link)

© 2015 Michael Maglio and OriginsDNA.  All Rights Reserved. 

Thursday, March 5, 2015

Breaking Through the Autosomal DNA Generation Barrier: Connecting to Distant Ancestors

   There has been much debate over the use of small autosomal DNA segments.  It is important to understand where they come from and how they can be used for genetic genealogy.  Small segments are considered noise and false matches.  There are too many small matches to make sense out of, but they are not necessarily false matches.  These segments have been in the population for longer than we thought.  When I match someone at 2 cM it is very likely that they are a 12th cousin, not a 5th cousin.  There is no reason for us to look for small segment matches until we understand where these segments originated.

   When we talk about autosomal DNA, we often over simplify the process of genetic inheritance.  The simple answer is that we inherit half of our DNA from dad and half from mom.  The common message is that with every generation the DNA contribution from an ancestor is randomized and reduced until it is insignificant.  Genetic inheritance is actually much more complex than that.  Complex in a great way.  There is a tremendous amount of ancestral information that we are just beginning to tap into.

   We inherit DNA from our parents and their ancestors in large sections.  Take a look at the graphic below.  Each example is the comparison of a grandchild to a set of paternal grandparents.  You can see in the first example that the grandchild inherited over two-thirds of their grandfather’s first chromosome intact (blue bars).  The remaining section of the first chromosome is from their grandmother.  In the third example, the grandchild has inherited the entire chromosome 14 from their grandmother.  It is physically possible that this grandchild could someday give one of their children the grandmother’s complete chromosome 14.  


In an effort not to over simplify, this is just half the story.  That grandchild has an equal contribution from their maternal grandparents. 

   In the examples above, we can visualize what happens when DNA recombines.  The first example shows where one section of the grandfather’s DNA swapped places with the grandmother’s DNA before it was inherited by the grandchild.  This is called crossover.  In the examples, a) is a single crossover, b) is a double crossover and c) has no crossover.  On average, each of our chromosomes experienced 2 or 3 crossovers before we inherited them.

   Where DNA crossover takes place on a chromosome is not random.  There are approximate locations where the chromosome is more likely to split.  These locations are cleavage sites. 


These locations exist because there are groups of genes along a chromosome that have a tendency to stay together.  These groups are part of gene linkage.  These linked genes only allow for chromosome splits at either end of their linked section.  In my research, the minimum size for one of these gene-linked sections is about 2.5 cM.  These small segments then travel in larger groups.


   In the graphic above, the blue bar represents about a 60 cM match.  The intersection between the black and orange ovals is about 2.5 cM and represents a minimum segment.  In this crossover recombination, the large segment actually split to the right of the minimum segment.  In a future crossover, the chromosome could split on the left side of the minimum segment, giving a large segment bound by the orange oval.

   Why are these minimum segments important?  My research shows that these segments stay in the gene pool for dozens of generations.  Over time, naturally occurring SNP mutations take place.  These minimum inherited segments (MIS) can be differentiated into family groups.

   In my research, I started with 28 well known US colonial surnames and 393 autosomal kits.  For each surname, the associated kits were triangulated.  If three or more kits match on the same segment, you can deduce that it came from a common ancestor.  Each of the surnames investigated had 6 to 13 distinct triangulated segments.  Taken together, these triangulated ancestral segments represent an autosomal haplotype that can be used to identify a descendant’s genetic connection to an ancestor.  Across all of the surnames, these distinct segments appear at recurring locations on each chromosome.  I have listed 21 of these ancestral loci in my paper.

   Not all ancestral segments are the same type.  The segments can be categorized into three groups.  The first category is Common to All.  The surnames in this study are predominantly European.  One segment has been identified on chromosome 2 that triangulates across all surnames.  This segment correlates to a Western Atlantic ethnicity and I call it the Western Atlantic Autosomal Haplotype (WAAH).  The Western Atlantic Autosomal Haplotype should not be confused with ancestry informative markers (AIMs).  The WAAH is composed of about 800 SNPs and there are only about 100 AIMs SNPs in that same stretch of chromosome 2.

   The next category is Shared.  Some segments can be attributed to two or more surnames.  There was considerable intermarriage between US colonial families.  That period was a bottleneck genealogically and genetically.  As two major families married, their combined DNA segments entered the gene pool and were reinforced as their descendants intermarried. 

   The third category is Unique.  These shared segments cannot be attributed to intermarriage of families.  Yet the resulting familial autosomal haplotypes are not composed of a single surname.  In the case of Benjamin Franklin, the genetic proximity to his wife, Deborah Read and his mother, Abiah Folger, may make it impossible to distinguish between Folger, Franklin and Read DNA.  Therefore, the haplotype represents the combined inheritance.  

   Here is one of my case studies.   Augustine Bearse was born in England in 1618 and died in Barnstable, MA before 1697.  The Bearse family was chosen due to my familiarity with the genealogy and the debate surrounding Augustine’s wife.  His wife Mary was supposedly the granddaughter of the Chief of the Cape Cod Native American tribes.  The goal was twofold;  to identify the autosomal haplotype for the Bearse family and determine whether any of the ancestral segments had Native American ethnicity.

   The Bearse study was composed of 48 autosomal samples.  These samples were collected based on claimed genealogical connections.  The triangulated samples generated 8 ancestral loci and indicated an additional 5 loci that had the potential to triangulate with more samples.  The resulting Bearse autosomal haplotype is found below.

Bearse Autosomal Haplotype

   The Bearse haplotype contains the Western Atlantic Autosomal Haplotype (chromosome 2) which is common to all haplotypes in the study.  The other 12 loci are more valuable for genealogical validation.  One of the Bearse descendants triangulates on six of the ancestral segments.  It is highly unlikely that a descendant would match on all of the segments.  Although ancestral segments survive over the generations, the randomness of their distribution makes it difficult for any one person to have received them all.  Yet, triangulating on just one segment unique to Bearse is enough to indicate and validate a relationship.  Lack of a match could mean that an ancestral segment was not inherited or that a non-familial event (adoption, infidelity, etc.) has occurred and the individual’s family tree is incorrect.

   In order to investigate the origins of Augustine’s wife Mary, each ancestry segment from the haplotype was evaluated for ethnicity.  Only the segment on chromosome six at location 55850885 had any Native American ethnicity.  This ancestral segment had not fully triangulated, yet a few of the samples match exactly on Native American SNPs.  With additional samples, the segment could triangulate.  Once validated, the segment might be shared across multiple surnames or unique to Bearse, indicating Native American genes in the Bearse descendants.

   While the amount of autosomal DNA received by each successive generation is only half from each parent, that does not mean that given enough generations a distant ancestor’s genetic contribution will become negligible.  Through genetic linkage, portions of DNA are inherited intact.  Naturally occurring cleavage sites allow for ancestral segments averaging 2.5 cM to be passed from generation to generation as a minimum inherited segment (MIS). 

   Ancestral segment analysis is invaluable for the identification of distant ancestors.  All of the triangulated ancestral locations combine to become a Familial Autosomal Haplotype (FAH) that can be used to validate family history.

   Since finishing my initial research, I have gone on to identify over 50 ancestral loci and over 700 autosomal haplotypes for US colonial ancestors.  Stay tuned for further advances in autosomal research.

References:

Maglio, MR (2015) Minimum Inherited DNA Segment Size and the Introduction of Familial Autosomal Haplotypes (Link)

Website:

© 2015 Michael Maglio and OriginsConnector.  All Rights Reserved.