Efficient Genotype Elimination via Adaptive Allele Consolidation

Share Embed


Descripción

1

Efficient genotype elimination via Adaptive Allele Consolidation Nicoletta De Francesco, Giuseppe Lettieri and Luca Martini Abstract—We propose the technique of Adaptive Allele Consolidation, that greatly improves the performance of the Lange-Goradia algorithm for genotype elimination in pedigrees [1], while still producing equivalent output. Genotype elimination consists in removing from a pedigree those genotypes that are impossible according to the Mendelian law of inheritance. This is used to find errors in genetic data and is useful as a preprocessing step in other analyses (such as linkage analysis or haplotype imputation). The problem of genotype elimination is intrinsically combinatorial, and Allele Consolidation is an existing technique where several alleles are replaced by a single “lumped” allele in order to reduce the number of combinations of genotypes that have to be considered, possibly at the expense of precision. In existing Allele Consolidation techniques, alleles are lumped once and for all before performing genotype elimination. The idea of Adaptive Allele Consolidation is to dynamically change the set of alleles that are lumped together during the execution of the Lange-Goradia algorithm, so that both high performance and precision are achieved. We have implemented the technique in a tool called Celer and evaluated it on a large set of scenarios, with good results. Index Terms—Genotype elimination, allele consolidation, pedigree.



1 I NTRODUCTION A pedigree is a data structure recording the parental relations among a given set of individuals and some genetic information for each individual. We assume the genetic information is relative to a single locus, so that each genotype consists of a pair of alleles. The pedigree lists, for each individual, the set of genotypes that the individual may have. This information may be obtained by observing the phenotype of the individual and then listing the genotypes compatible with that phenotype, or by a genetic analysis directly pointing to a pair of alleles. Some individuals may be untyped, i.e., their genetic information is completely missing and their list of possible genotypes consists of all pairs of alleles. Finally, the genetic information may contain errors. Fig. 1 shows an example pedigree with 8 individuals with alleles in the set A = {A, B, C, D}. Male and female individuals are represented by boxes and circles, respectively. Individuals that mate are connected to a dot with an arc, and this dot is then connected to their offspring. Each individual has a list of “phase known” genotypes, i.e., genotypes in which the origin of each allele (either paternal or maternal) is known. We represent phase known genotypes as ordered pairs of alleles, with the paternal allele coming first. The lists of individuals 6 and 7 contain two pairs. This is used to model the fact that we actually do not know which allele is the paternal one in their genotype. Untyped individuals are marked • N. De Francesco and G. Lettieri are with the Dipartimento di Ingegneria dell’Informazione: Informatica, Elettronica, Telecomunicazioni, Universit`a di Pisa, Italy. E-mail: {n.defrancesco,g.lettieri}@iet.unipi.it • L. Martini is with List S.p.A., Pisa, Italy. E-mail: [email protected]

1 ?

4 (A,A)

2 (A,A)

3 (B,B)

7 (A,B) (B,A)

5 ?

6 (C,D) (D,C)

8 ?

Fig. 1. An example pedigree.

with “?” and their lists contain all 16 pairs in A × A, corresponding to 10 phase unknown genotypes. In the following we will omit the “phase known” or “phase unknown” specification when clear from context. We recall that the law of Mendelian inheritance states that, in a genotype, one allele must come from the father and the other allele must come from the mother. Many of the pairs of alleles listed in a pedigree are actually impossible, since they are incompatible with this law when considered together with the genotypes of the other individuals. We call these pairs inconsistent. For example, all pairs of alleles for individual 5 in Fig. 1 must have A as maternal allele. The aim of genotype elimination [1] is to remove from the list of each indi-

2

vidual some, and preferably all, inconsistent pairs. This is used to find errors in the data and is also useful as a preprocessing step in the analysis of a pedigree, such as linkage analysis and haplotype reconstruction [2], [3]. These analyses generally have to compute all possible combinations of the genotypes listed in the pedigree, and thus their cost is greatly reduced if these lists are shrinked. In [2], Aceto et al. consider the related problem of genotype consistency checking and show that it is NPcomplete if there are at least three alleles. The problem of genotype consistency checking considers pedigrees with typed and untyped individuals (like the one in Fig. 1) and asks if all untyped individuals can be assigned a genotype so that the resulting pedigree is consistent with the laws of Mendelian inheritance. This problem can be reduced in polynomial time to the problem of genotype elimination described above: apply genotype elimination to the pedigree and observe if any of the individuals that were originally typed has been left with no genotype at all. This shows that genotype elimination is NP-hard and thus that it is very unlikely that efficient, polynomial algorithms may be found that are always able to remove all inconsistent genotypes from any pedigree. A first contribution to this problem has been given by Lange and Goradia in [1], who have shown that their algorithm is able to remove all inconsistent genotypes, but only from pedigrees that have no loops (i.e., sequences of arcs in the pedigree leading from an individual to himself). For pedigrees with loops the algorithm is still useful, since it is nonetheless able to remove many inconsistent genotypes. Moreover, it may be used as a substep in algorithms that work on general pedigrees [4]. The LangeGoradia algorithm has a wost-case complexity which is polynomial in the number of individuals in the pedigree and the number of alleles [2]. Allele Consolidation [5] is a technique aimed at reducing the computation time of the Lange-Goradia and other algorithms, possibly at the expense of precision. The idea is that many alleles are not actually seen in any given pedigree, and little or no precision is lost if they are all replaced by a single “lumped” allele. This reduces the number of alleles, on which the complexity of the Lange-Goradia algorithm depends. In [6] the idea of Allele Consolidation is extended. Instead of consolidating the same set of alleles for the entire pedigree, a different set of alleles is consolidated for each untyped individual, by looking at the alleles that do not occur in a subset of the pedigree near the individual. We say that every untyped individual has a private lumped allele. Private lumped alleles are generally much bigger then a single, global lumped allele. This considerably reduces the number of alleles seen by the Lange-Goradia algorithm. This technique is implemented in the PedCheck tool [7], so we refer to it as PedCheck-style Allele Consolidation (PAC from now on). PAC generally executes much faster than the original

Lange-Goradia algorithm, but it removes fewer genotypes. This is true even for loopless pedigrees like the one in Fig. 1. The source of imprecision is the use of different private lumped alleles among the components of a nuclear family. We show in Subsection 3.1 that if all components of the nuclear family use the same lumped allele, then the result computed with or without Allele Consolidation is the same. Based on this fact, we propose Adaptive Allele Consolidation (AAC) as a technique that is able to obtain similar performance improvements as PAC, while still removing all the genotypes that are removed by the unmodified Lange-Goradia algorithm. In AAC we still use private lumped alleles for each individual in the pedigree, but the lumped alleles are changed dynamically in order to always use the same ones before examining each nuclear family.

2

T HE L ANGE -G ORADIA

AND

PAC

ALGO -

RITHMS

We repeat the Lange-Goradia algorithm in Fig. 4 for reference. The main idea is to focus on a single nuclear family at a time (step B). In each nuclear family all combinations of genotypes are examined by first fixing a pair of genotypes for the parents (step 1) and then examining all the genotypes of the children (step b). By fixing the genotypes of the parents each child can be examined separately, without the need to first produce every combination of the genotypes of the children. For example, consider the pedigree of Fig. 1. The nuclear family loop for nuclear family {1, 2, 4, 5} performs 10 iterations of step 1 in Fig. 4, one for each of the 10 possible genotypes for the untyped individual 1. Let us denote by Gi the set of phase known genotypes of individual i. At the end of the 10 iterations we will have G1 = {(A, A), (A, B), (A, C), (A, D), (B, A), (C, A), (D, A)}, G2 = G4 = {(A, A)} and G5 = {(A, A), (B, A), (C, A), (D, A)}. Nuclear family {3, 4, 7} requires only one iteration, removing genotype (A, B) from the genotype set of individual 7. Finally, nuclear family {5, 6, 8} requires 4 iterations. The genotype set of individuals 5 and 6 are unchanged, while G8 becomes {(C, A), (C, B), (C, C), (C, D), (D, A), (D, B), (D, C), (D, D)}. We now give a short description of the PAC algorithm. The algorithm starts by determining all private lumped alleles. The private lumped allele of each individual is determined as follows: first, the descendants of the individual are examined, noting the first typed individual (if any) encountered in each line of descent; then, the alleles not seen in any of the genotypes of these noted descendants are consolidated in the private lumped allele. Since unseen alleles are determined by examining only a subset of the entire pedigree, this method produces bigger lumped alleles and thus smaller genotype lists than plain Allele Consolidation. However, since the genotypes of different individuals may now be expressed using different kinds of lumped alleles, the Mendelian law of inheritance and some steps of the

3

Lange-Goradia algorithm have to be modified. Following [6], we first recode each allele as a set: lumped alleles are coded with the set of alleles that they represent, while normal alleles are coded as singleton sets containing the original allele. Then, we introduce a compatibility relation among the (recoded) genotypes of a childmother-father triple. Assume we have fixed a motherfather genotype pair: we say that a child genotype is compatible (with the given mother-father genotype pair) if one of the maternal recoded alleles is contained in one of the recoded alleles of the child genotype, and one of the paternal recoded alleles is contained into the other recoded allele of the child genotype. This compatibility relation is called “fuzzy inheritance” in [6]. Finally, we replace steps a–c in Fig. 4 with the following. ab’. If each child in the nuclear family has one or more compatible genotypes among his current list of genotypes, then save the parental genotypes. Also save any child compatible genotype. c’. If any child has no compatible genotype among his current list of genotypes take no action to save any genotypes. For example, consider Fig. 1. Let us call ℓi the private lumped allele of individual i. We have ℓ1 = {B, C, D} and ℓ5 = ℓ8 = A. Typed individuals do not have private lumped alleles. The recoded genotype  sets are: G1 = (ℓ1 , ℓ1 ), (ℓ1 , {A}), ({A}, ℓ1), ({A}, {A})  , G2 = G4 = 5 , ℓ5 )}, G3 = ({B}, {B}) , G6 = ({A}, {A}) , G5 = {(ℓ ({C}, {D}),  ({D}, {C}) , G7 = ({A}, {B}), ({B}, {A}) and G8 = (ℓ8 , ℓ8 ) . Note that the number of iterations of step 1 in Fig. 4 has been reduced from 10 to 3 for nuclear family {1, 2, 4, 5}. The iteration that chooses (ℓ1 , ℓ1 ) for the father does not save any genotype, since individual 4 has no compatible genotype (according to the definition of “compatible” given above). All other iterations save all sets are G′1 =  genotypes, thus the new genotype (ℓ1 , {A}), ({A}, ℓ1), ({A}, {A}) , G′2 = G2 , G′4 = G4 and G′5 = G5 . Consider also the nuclear family {5, 6, 8}. Only one iteration of step 1 is performed. However, no genotype at all is removed. In particular, for individual 5 the algorithm computes a final genotype of {(ℓ5 , ℓ5 )}, which we should interpret as ℓ5 ×ℓ5 = A×A, while using LangeGoradia we obtained {(A, A), (B, A), (C, A), (D, A)}.

3

A DAPTIVE A LLELE C ONSOLIDATION

In order to understand the reasoning behind AAC it is useful to visualize genotype sets as follows. If we model phase known genotypes as ordered pairs of alleles, then we can understand sets of genotypes as a relation among alleles. We can represent this relation with its incidence matrix, the matrix having a 1 in position (i, j) if (i, j) is in the corresponding relation (i.e., genotype set), and a 0 otherwise. The matrix on the left in Fig. 2 represents the genotype set {(A, B), (A, D), (B, A), (C, D), (D, C)} over the alleles {A, B, C, D}. In this representation, Allele Consolidation is related to the concept of (diagonal) block matrix, i.e., a square matrix together

A  A 0 B 1  C 0 D 0

B 1 0 0 0

C 0 0 0 1

D  1 0  1 0

{A, B} {C} {D}

{A, B} 1 0 0

{C} 0 0 1

{D} 1 ! 1 0

Fig. 2. Matrix representation of a genotype set (left) and its compressed representation after the consolidation of the alleles {A, B} (right). A A 1 B 1 C 0 D 0

B 1 1 0 0

C 0 0 0 1

D 1 1 1 0

Fig. 3. A block matrix which can be compressed without loss of information.

with a partition of its horizontal and vertical indices. Here we are using the alleles as indices of the matrix, so we can refer to partitions of the set of alleles. In particular, Allele Consolidation implies a special form of block matrix in which the partition of the alleles contains only one block with size greater than 1, namely the lumped allele. Here, instead, we allow for general block matrices, arising from any partition of the alleles. This corresponds to the use of multiple private lumped alleles for each individual (one for each block with size greater than 1 in the partition). Allele Consolidation can be thought of as a “compression” of a block matrix. The compression replaces all blocks full of zeros with a single 0, and all other blocks with a single 1. In the matrix on the left of Fig. 2 we have shown the blocks resulting from the consolidation of the alleles A and B into a single lumped allele {A, B}. The matrix on the right is the corresponding compressed form. We can partially order the compressed forms of a given genotype set, based on the standard partial ordering of their corresponding partitions of alleles: coarser partitions give rise to more compressed forms. Compression may lose information: if we “uncompress” the matrix on the right of Fig. 2 we obtain the matrix in Fig. 3, which contains genotypes that were not contained in the original matrix. Of special interest are the “lossless” compressions of a given genotype set. An example is given by the matrix in Fig. 3 when we choose as lumped allele any subset of {A, B}: compression followed by uncompression will give the original matrix back. We call a lossless compression a “representation”. Note that for any genotype set there always is at least one representation, namely the compression arising from the partition with no lumped alleles. In general, a genotype set may have many representations. It can be shown that among all representations of a given genotype set there

4

A. For each pedigree member, list only those genotypes compatible with his or her phenotype. B. For each nuclear family: 1. Consider each mother-father genotype pair. a. Determine which zygote genotypes can result. b. If each child in the nuclear family has one or more of these zygote genotypes among his current list of genotypes, then save the parental genotypes. Also save any child genotype matching one of the created zygote genotypes. c. If any child has none of these zygote genotypes among his current list of genotypes—i.e., is incompatible with the current parental pair of genotypes—take no action to save any genotypes. 2. For each person in the nuclear family, exclude any genotypes not saved during step 1 above. C. Repeat part B until no more genotypes can be excluded. Fig. 4. The Lange-Goradia [1] algorithm.

is one which is the most compressed. If we have two genotype sets G1 and G2 we can also speak of their “most compressed common representation”. This is a compression that is lossless for both G1 and G2 , and it is the most compressed among all compressions with this property. The notion can be extended to any number of genotype sets G1 , . . . , Gn . High compressions give good performance, but may become lossy. In PAC we start with a lossless compression of the initial state of the Lange-Goradia algorithm, but the compression of the intermediate states becomes more and more lossy after the examination of each nuclear family. For example, in the pedigree of Fig. 1 the compressed genotype set of individual 5 after the examination of nuclear family {1, 2, 4, 5} is {(A, A)}. If we uncompress it we obtain A × A, while at the corresponding step of the Lange-Goradia algorithm we have {(A, A), (B, A), (C, A), (D, A)}. The idea of AAC is to dynamically adjust the compression for each individual in order to preserve both losslessness and performance. In Fig. 5 we describe the AAC algorithm informally, as a modification of the Lange-Goradia algorithm of Fig. 4. At the beginning of the algorithm (step A’) we recode the genotype set of each individual (either typed or untyped) into its most compressed representation. Before entering the nuclear family loop (steps 1 and 2 in Fig.s 4 and 5) we compute the most compressed common representation of the genotype sets of all members of the family, and recode them accordingly (step 0). After the recoding, all members of the nuclear family will use the same set of private lumped alleles. As shown in Subsection 3.1, this step guarantees that the nuclear family

loop will compute the same result as the corresponding loop in the original Lange-Goradia algorithm, starting from the corresponding uncompressed genotype sets. At the exit of the loop, each genotype set is recoded to its most compressed representation (step 3). This guarantees that the compression level is kept high in the following iterations of step B, thus improving performance. As an example, consider again the pedigree of Fig. 1. A single private lumped allele for each individual is sufficient for this example. In step A’ we let ℓ1 = ℓ5 = ℓ8 = A, ℓ2 = ℓ4 = {B, C, D}, ℓ3 = {A, C, D}, ℓ6 = {A, B} and ℓ7 = {C, D}. Accordingly, the initial genotype sets  are recoded as G1  = G5 = G8 = {(A, A)}, G = G = ({A}, {A}) , 2 4  G3 = ({B}, {B}) , G6 = ({C}, {D}), ({D}, {C}) and G7 = ({A}, {B}), ({B}, {A}) . This concludes step A’. Now consider step B for nuclear family N = {1, 2, 4, 5}. In step 0 we compute ℓ′ = ℓ1 ∩ ℓ2 ∩ ℓ4 ∩ ℓ5 = {B, C, D} and recode the relevant genotype sets accordingly: G′1 =  ′ ′ ′ ′ ′ ′ G5 = ({A}, {A}), ({A}, ℓ ), (ℓ , {A}), (ℓ , ℓ ) , G2 = G2 and G′4 = G4 . Step 1 requires 3 iterations, at the end of  which we have G′′1 = ({A}, {A}), ({A}, ℓ′), (ℓ′ , {A}) ,  G′′2 = G2 , G′′4 = G4 and G′′5 = ({A}, {A}), (ℓ′ , {A}) . Note that if we uncompress these genotype sets we obtain the same sets obtained in the corresponding step of the Lange-Goradia algorithm. These genotype sets are already in their most compressed representation, so step 3 leaves them unchanged. Now consider step B for nuclear family {5, 6, 8}. Step 0 computes ℓ′′ = ℓ′ ∩ℓ6 ∩ℓ8 = {B} and recodes G′′5 , G6 and G8 accordingly. For example, G′′5 is recoded to ({A}, {A}), (ℓ′′ , {A}), ({C}, {A}), ({D}, {A}) . For this nuclear family AAC performs the same number of steps as Lange-Goradia, since a lumped allele of size 1 is equivalent to no compression. Nuclear family {3, 4, 6} has no untyped individuals, so AAC performs the same steps as Lange-Goradia. 3.1 Proof of correctness In [8] we give a formalization of the technique in the framework of Abstract Interpretation [9], with compression and uncompression formalized as abstraction and concretization maps. However, the proof of correctness is actually very simple and can be outlined as follows. To prove that AAC computes the same result of NAC it is sufficient to show that the nuclear family loop (steps 1 and 2 of Fig. 4 and Fig. 5) preserve representations, i.e., assuming that we have a lossless compression at the beginning of the nuclear family loop, then we have a lossless compression also at the end. Then the conclusion will follow from the fact that AAC starts with a representation of the initial state of NAC, and that all other actions of AAC just convert a representation into another representation. We now proceed to show that the nuclear family loop of AAC preserves representations. In the following N = {1, . . . , n} is a fixed nuclear family where 1 is the father and 2 is the mother.

5

A. For each pedigree member, list only those genotypes compatible with his or her phenotype. A’. For each pedigree member, recode his or her list of genotypes to its most compressed representation. B. For each nuclear family: 0. Recode the genotypes lists of all members of the nuclear family to their most compressed common representation. 1. Consider each mother-father genotype pair. a. Determine which zygote genotypes can result. b. If each child in the nuclear family [. . . ] c. If any child has none of these zygote genotypes [. . . ] 2. For each person in the nuclear family, exclude any genotypes not saved during step 1 above. 3. For each person in the nuclear family, recode his or her list of genotypes to its most compressed representation. C. Repeat part B until no more genotypes can be excluded. Fig. 5. The Adaptive Allele Consolidation algorithm. Definition 1. Let g¯ = (g1 , . . . , gn ) be a vector of genotypes. We say that g¯ is Mendelian (on N ) if gi is a possible zygote of g1 and g2 for all 3 6 i 6 n. Note that, since we are using ordered pairs of alleles as genotypes, gi = (xi , yi ) is a possible zygote of g1 = (x1 , y1 ) and g2 = (x2 , y2 ) iff xi ∈ {x1 , y1 } and yi ∈ {x2 , y2 }. The Lange-Goradia algorithm ensures the following. Lemma 2. Consider step B in Fig. 4 for nuclear family N . Let ¯ = (G1 , . . . , Gn ) be the vector of genotypes sets immediately G ¯ ′ = (G′ , . . . , G′ ) be the corresponding before step 1 and let G 1 n vector immediately after step 2 (clearly G′i ⊆ Gi for all i ∈ N ). Then, for any k ∈ N , we have g ∈ G′k iff there is a Mendelian genotype vector g¯ = (g1 , . . . , gk−1 , g, gk+1 , . . . , gn ) with gi ∈ Gi for all i ∈ N with i 6= k. Proof: See the original paper by Lange and Goradia [1]. Now let A be a set of alleles and fix a partition P of A. Let us call compressed genotypes the ordered pairs (X, Y ) of blocks X, Y ∈ P . If c = (X, Y ) is a compressed genotype, its uncompression u(c) is given by u(c) = X × Y . A compressed genotype set is just a set of compressed genotypes. Therefore, the uncompression of a compressed genotype set C is [ [ X × Y. (1) u(C) = u(c) = c∈C

(X,Y )∈C

A compressed genotype set C is a lossless compression, or representation, of a genotype set G if G = u(C). Note that the nuclear family loop of AAC is the same as in NAC, but for the use of compressed genotypes

in place of normal genotypes. Compressed genotypes differ from normal genotypes only in the kind of alleles they are made of. Since the exact nature of the alleles is irrelevant for the definitions, we can transfer the notion of possible zygote and Mendelian vector to compressed genotypes and compressed genotype vectors. Moreover, we can readily reuse the proof of Lemma 2 to obtain the following. Lemma 3. Consider step B in Fig. 5 for nuclear family N . Let C¯ = (C1 , . . . , Cn ) be the vector of compressed genotype sets immediately before step 1 and let C ′ = (C1′ , . . . , Cn′ ) be the corresponding vector immediately after step 2 (clearly Ci′ ⊆ Ci for all i ∈ N ). Then, for any k ∈ N we will have that c ∈ Ck′ iff there is a Mendelian compressed genotype vector c¯ = (c1 , . . . , ck−1 , c, ck+1 , . . . , cn ) with ci ∈ Ci for all i ∈ N with i 6= k. Assume c¯ = (c1 , . . . , cn ) is a compressed genotype vector according to some given partition P . We say that a (not compressed) genotype vector g¯ = (g1 , . . . , gn ) is extracted from c¯ if gi ∈ u(ci ) for all i ∈ N . A key link between genotype vectors and compressed genotype vectors is given by the following Lemma. Lemma 4. Let c¯ be a vector of genotypes compressed according to a given partition P and g¯ be a genotype vector extracted from c¯. If g¯ is Mendelian, then so is c¯. Proof: Let c¯ = (c1 , . . . , cn ) and g¯ = (g1 , . . . , gn ). Consider any child i ∈ N . Since g¯ is Mendelian, the first allele of gi , be it xi , comes from g1 . Let ci = (Xi , Yi ) for two blocks Xi and Yi of P . Since gi ∈ u(ci ), we have xi ∈ Xi by (1). Now, we know that xi is also an allele of g1 , the genotype of the father of i, and also that g1 is in u(c1 ). Since xi cannot belong to two different blocks of P , this means that Xi is one of the two alleles of c1 . A similar reasoning applies to the maternal allele of i, thus the compressed genotypes c1 , c2 and ci follow the law of Mendelian inheritance. This shows that c¯ is a Mendelian vector, as claimed. Note that Lemma 4 can be readily generalized to arbitrary pedigrees, but nuclear families are sufficient for the proof of the following Theorem. ¯ G ¯ ′ be as in Lemma 2 and C, ¯ C¯ ′ be Theorem 5. Let G, as in Lemma 3. Assume Gi = u(Ci ) for all i ∈ N . Then G′i = u(Ci′ ) for all i ∈ N . Proof: We split this in G′k ⊆ u(Ck′ ) and u(Ck′ ) ⊆ G′k for any fixed k ∈ N . ′ ′ ′ • Consider Gk ⊆ u(Ck ) first. Take any g ∈ Gk . By Lemma 2 we can find a Mendelian vector g¯ = (g1 , . . . , gk−1 , g, gk+1 , . . . , gn ) with gi ∈ Gi for all i ∈ N , with i 6= k. Moreover, g ∈ Gk since G′k ⊆ Gk . By hypothesis, Gi = u(Ci ) for all i ∈ N and so, by (1), we can find a vector c¯ = (c1 , . . . , cn ) such that gi ∈ u(ci ) for all i 6= k, and g ∈ u(ck ). By Lemma 4 we now know that c¯ is itself Mendelian and, by Lemma 3 this implies ck ∈ Ck′ . This finally gives g ∈ u(Ck′ ) by (1).

6



To show that u(Ck′ ) ⊆ G′k , take any g = (x, y) ∈ u(Ck′ ). We need to show that g ∈ G′k . This can be accomplished using Lemma 2 if we are able to build a Mendelian vector g¯ = (g1 , . . . , gn ) with gk = g and gi ∈ Gi for all i ∈ N . By (1) there is c = (X, Y ) ∈ Ck′ such that x ∈ X and y ∈ Y . By Lemma 3, there is a Mendelian vector c¯ = (c1 , . . . , cn ) with ck = c = (X, Y ) and ci ∈ Ci for all i ∈ N . We proceed by cases on k. Assume k > 3 (i.e., k is a child in N ). By Mendelianity, X ∈ {X1 , Y1 } and Y ∈ {X2 , Y2 }, where c1 = (X1 , Y1 ) and c2 = (X2 , Y2 ). By hypothesis, X1 ×Y1 ⊆ G1 and X2 ×Y2 ⊆ G2 , so we can always choose g1 and g2 such that g is a possible zygote. Once g1 and g2 are fixed, we can see in a similar way that we can always choose gi ∈ u(ci ) for the remaining children so that the resulting g¯ is Mendelian. The cases for k = 1 and k = 2 are similar.

Note that we have actually proved that each nuclear family loop (i.e., step B in Fig. 4 and Fig. 5) of AAC computes the same result as the corresponding step of NAC. From this we can obtain a guarantee on the number of times that AAC executes the main body of the algorithm (steps a–c in Fig. 5). In fact, we have proved that AAC and NAC examine each nuclear family the same number of times and, whenever they are both examining the same nuclear family for the n-th time, AAC will have genotype sets which are a compressed representation of the corresponding, uncompressed, genotype sets of NAC. Since the size of a compressed genotype set is no bigger than the size of the corresponding genotype set, AAC will execute steps a–c at most as many times as NAC. If we can achieve good compressions, AAC will accordingly execute these steps fewer times than NAC. If the time thus gained offsets the time spent in steps A’, 0 and 3 (which are unique to AAC) this will translate in a reduction of the running time of AAC w.r.t. NAC.

4

I MPLEMENTATION

We have implemented Adaptive Allele Consolidation in the Celer tool [10] ∗ . In the actual implementation of AAC in Celer we introduced a number of compromises w.r.t. the ideal algorithm of Fig. 5. These compromises are motivated by the need to provide a reasonably fast implementation of steps 0 and 3 of Fig. 5. This is obtained by giving up the requirement of always computing the most compressed representations for all genotype sets. We are allowed to relax this requirement since the proof of correctness of AAC (Subsection 3.1) makes no use of the compression level, as long as the compression is lossless. The main implementation compromise is that we allow for just a single private lumped allele for each individual, as in PAC. This corresponds to restricting ∗. http://www.iet.unipi.it/g.lettieri/celer/.

N: Gi : Tk , G′i , Z: ℓi : ℓ′ , old i : 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

nuclear family; parents are 1, 2 ∈ N bitmatrix of individual i ∈ N temporary bitmatrices (Gi ’s initially empty) private lumped allele of individual i ∈ N temporary lumped alleles

T ℓ′ := i∈N ℓi foreach i ∈ N convert (ℓi , ℓ′ , Gi ) old i := ℓi ℓi := ℓ′ foreach g1 ∈ G1 foreach g2 ∈ G2 Z := mate(g1 , g2 ) foreach child k ∈ N Tk := Gk ∩ Z if Tk = ∅ goto next 7 insert (g1 , G′1 ) insert (g2 , G′2 ) foreach child k ∈ N G′k := G′k ∪ Tk foreach i ∈ N if Gi = G′i convert (ℓ′ , old i , Gi ) ℓi := old i else Gi := G′i

Fig. 6. The implementation of AAC in Celer.

the possible partitions of the set of alleles to those with only one block with a size greater than 1. Step A’ of Fig. 5 is implemented as follows: for each individual i with (uncompressed) genotype set Gi , let ℓi = A if i is untyped; otherwise, let ℓi equal to A minus the set of alleles appearing in Gi . This clearly gives the most compressed representation for untyped individuals, as required. For typed individuals this is always a representation (i.e., a lossless compression), and it is indeed the most compressed representation in the normal case of Gi = {(a, a)} or Gi = {(a, b), (b, a)} for some alleles a, b ∈ A. In the other cases this choice may not give the most compressed representation (which is not even well defined, given the limitation to just one private lumped allele). Fig. 6 shows the pseudo-code for the nuclear family loop of AAC in Celer. Step 0 is implemented in lines 1–5. First, we compute the intersection of the private lumped alleles of all members of the family (line 1) then we convert all their compressed genotype sets to this new lumped allele (lines 2–5). Again, this is not their most compressed common representation in general, but it is still a common representation for all the members of the family. Finally, step 3 is implemented in lines 3 and 16–20 as follows: before converting the compressed genotype sets in step 0, the old private lumped alleles are saved (line 4); at the end of the nuclear family loop all compressed genotype sets that have not been modified are converted back to their saved private lumped allele

7

(lines 16–20). A key implementation decision in Celer is to use intersections and unions of genotype sets as much as possible. Then, by representing genotype sets as bitmaps as suggested by Fig. 2 and Fig. 3, most steps of the algorithm can be implemented using fast bitwise operations. Lines 6–15 in Fig. 6 show the implementation of step 1 of Fig. 5 in Celer. At the beginning of each nuclear family loop an empty output set G′i is prepared for each individual i in the family N . Line 8 (step a in Fig. 5) produces a set Z containing the zygotes of the current father-mother genotype pair (g1 , g2 ), which is then intersected at line 10 with the input set Gk of each child to obtain the set Tk of “saved” genotypes for the current iteration (step b in Fig. 5). If all intersections are not empty the program reaches line 12, where the current father and mother genotype and the saved genotypes for each child are accumulated in their respective output sets. Celer also implements algorithms PAC and LangeGoradia. The reference implementation of PAC is PedCheck [7]. In [10] we have shown that Celer, even without AAC, is already faster than PedCheck, and often an order of magnitude faster. This is due to the use of simple, compact bitwise operations in Celer vs. complex and memory consuming list operations in PedCheck. Thus, this is only an implementation and not an algorithmic issue. Therefore, we have decided to reimplement PAC in Celer in order to compare the algorithms on a fair ground. The implementation of the nuclear family loop for PAC is similar to lines 6–15 of Fig. 6, but the use of different private lumped alleles in parents and children requires some changes: line 8 is removed and line 10 is replaced by Tk := compatible(Gk , g1 , g2 ), which assigns to Tk the genotypes contained in Gk which are compatible with g1 and g2 , according to the definition of “compatible” given in Section 1. A command line switch allows the user to choose among AAC, PAC, and no allele consolidation at all (NAC from now on). The nuclear family loop of NAC consists of lines 6–15 of Fig. 6, and thus implements the original Lange-Goradia algorithm using fast bitwise operations.

5

P ERFORMANCE

EVALUATION

We have compared PAC, AAC and NAC running Celer on three different pedigrees. Following the methodology described in [11], we have simulated genetic data by picking founder alleles from the uniform distribution, applying randomly the Mendelian laws down the pedigree to calculate non-founder alleles, and, finally, deleting the genotype information of some individuals. The first pedigree we considered is composed by 221 individuals. It is a human pedigree that traces the ancestors of two individuals affected by hypophosphatasia (HOPS). The pedigree comes from the Hutterite population living in North America, and it has been used previously in [12], [11].

NAC PAC AAC

HOPS 1h 11m 3m 24s 3m 6s

APE 10h 35m 1h 21m 1h 5m

QMSIM1 1h 12m 1h 5m 1h 2m

TABLE 1 Total running times.

Then, we tested a larger pedigree composed by 4921 individuals. This pedigree was also studied in [11] and has been simulated with the method of Gasbarra et al. [13]. It has been used as a benchmark for the tool Allelic Path Explorer (APE). The pedigree contains 159 founders, and 75 percent of individuals were inbred. The last pedigree we tested is even bigger. It is composed by 8420 individuals and has been generated with the tool QMSIM [14]. It is composed of 10 generations. The founders are 420 individuals (400 females and 20 males). For each pedigree we analyzed 1000 datasets for each combination of the number of alleles (5, 10, 20, 30, 40, 50 and 60) and of the percentage of untyped individuals (10, 20, 30, 40, 50 and 60 percent), for a total of 3 × 42000 datasets. The tests were run on a 64 bit Intel Core i5 2.67 GHz machine equipped with 8 GiB of RAM and running Gentoo Linux (kernel version 3.0.6-gentoo). Celer was compiled by gcc version 4.5.3 with -O2 optimization level. In Table 1 we show the total running time over the 42000 datasets for each combination of algorithm and pedigree. From the table is already clear that PAC and AAC have comparable execution times, and that both are able to greatly outperform NAC. Recall, however, that PAC pays the price of a reduced precision w.r.t. to NAC, while AAC does not. In Fig. 7 we examine the relations among the running times in more detail. The first column of Fig. 7 compares the average running times of NAC and AAC in all the tested configurations. Each average is the geometric mean of the corresponding 1000 running times. Each diagram has the number of alleles on the x axis and shows a different plot for each of the sampled percentages of untyped individuals. On the y axis we show the speedup, i.e., the ratio of the average running time of NAC over the average running time of AAC. Fig. 7(a) and 7(c) show the results for the HOPS and APE pedigrees. We can see that adaptive consolidation can be as much as 90 times faster than the unmodified Lange-Goradia algorithm. As anticipated by the total running times in Table 1, the performance improvement for the synthetic QMSIM1 pedigree (Fig. 7(e)) is less dramatic. However, we can still observe a speedup of up to 60% in the extreme cases. The general rule that can be observed in Fig. 7 is that the greatest performance benefits of AAC over NAC are obtained for configurations with many alleles and/or many untyped individuals, as expected. Moreover, note that we have never observed significant slowdowns of

8

HOPS

HOPS

100

100 10% 20% 30% 40% 50% 60%

speedup AAC vs. PAC

speedup AAC vs. NAC

10% 20% 30% 40% 50% 60%

10

1

10

1

10

20

30

40

50

60

10

20

(a)

30

APE

60

40

50

60

40

50

60

100 10% 20% 30% 40% 50% 60%

speedup AAC vs. PAC

10% 20% 30% 40% 50% 60%

speedup AAC vs. NAC

50

APE

100

10

1

10

1

10

20

30

40

50

60

10

20

(c)

30

(d)

QMSIM1

QMSIM1

100

100 10% 20% 30% 40% 50% 60%

speedup AAC vs. PAC

10% 20% 30% 40% 50% 60%

speedup AAC vs. NAC

40

(b)

10

1

10

1

10

20

30

(e)

40

50

60

10

20

30

(f)

Fig. 7. Speedup NAC vs. AAC (first column) and PAC vs. AAC (second column) for varying number of alleles (x axis) and varying percentage of untyped individuals.

9

ACC w.r.t. to NAC. The implementation of AAC has a potential performance advance also over PAC. When examining a nuclear family in PAC the implementation has to account for the fact that each individual may have a different private lumped allele. In AAC, instead, all private lumped alleles are translated to a common one before entering the nuclear family loop. The nuclear family loop of AAC, then, is much more streamlined and potentially faster than the corresponding loop of PAC. The effect should be more pronounced for large families with many children. On the other hand, PAC generally lumps together more alleles than AAC, and consequently has smaller consolidated genotype sets. This reduces the average number of father-mother genotype pairs that have to be considered in the nuclear family loop. It is then interesting to compare the running times of PAC and NAC, even if they compute different results. The second column of Fig. 7 shows the speedup of AAC vs. PAC. Fig. 7(d) shows that this implementation factor can cause AAC to run more than 3 times faster than PAC in the extreme cases for the APE pedigree. Fig. 7(b) and 7(f), on the other hand, show smaller speedups (around 10% and up to 20%) and sometimes a small slowdown (never more than 8%) for the HOPS and QMSIM1 pedigrees. This difference among the pedigrees is probably due to the fact that APE contains 454 families with 3 children or more, while HOPS contains only 11 and QMSIM1 none at all. According to Table 1 and Fig. 7 the running times of AAC and PAC are similar, thus they both give similar improvements over NAC. However, AAC has the advantage that no approximation w.r.t. NAC is introduced. To compare the precision of AAC w.r.t. PAC we have computed the size of the uncompressed and compressed genotypes sets in the output pedigree of the two algorithms, averaged over all the individuals in the pedigree. The average is a geometric mean, since the complexity of the analyses that use the results of genotype elimination usually depend on the product of the sizes of the genotype sets. Note that AAC and NAC have the same averages for the uncompressed sets, since they compute the same result, while the average of the compressed sets is not applicable to NAC. Fig. 8 shows the ratio of the average genotype set size of PAC over AAC, for the same configurations of Fig. 7. We think that the interesting measure is the average for the uncompressed genotype set size (first column in Fig. 8). A ratio above 1 means that the sets of abstract genotypes computed by PAC hide concrete genotypes that could have been removed, if consolidation had not been used. Indeed, all cases show that the number of these hidden, not removed genotypes can be very significant. The second column in Fig. 8 shows the size ratios for the compressed genotype sets. Here we can see that PAC generally produces smaller compressed sets, as expected. However, AAC succeeds in finding compressed sets whose size is close to those found by PAC.

As a whole, Fig. 7 and Fig. 8 support the claim that AAC is able to take the best of NAC (i.e., precise result) and PAC (i.e., fast response).

ACKNOWLEDGMENTS The authors would like to thank the anonymous referees for their valuable comments and suggestions to improve the quality of the paper.

R EFERENCES [1] [2]

[3] [4] [5] [6]

[7]

[8]

[9]

[10]

[11] [12] [13]

[14]

K. Lange and T. Goradia, “An algorithm for automatic genotype elimination,” Am. J. Human Genetics, vol. 40, pp. 250–256, 1987. L. Aceto, J. A. Hansen, A. Ingolfsd ´ ottir, ´ J. Johnsen, and J. Knudsen, “The complexity of checking consistency of pedigree information and related problems,” J. Comput. Sci. Technol., vol. 19, no. 1, pp. 42–59, 2004. J. Li and T. Jiang, “Efficient rule-based haplotyping algorithms for pedigree data,” in RECOMB, 2003, pp. 197–206. J. R. O’Connell and D. E. Weeks, “An optimal algorithm for automatic genotype elimination,” Am. J. Human Genetics, vol. 65, no. 6, pp. 1733–1740, 1999. K. Lange, Applied Probability. Springer, 2003. J. R. O’Connell and D. E. Weeks, “The vitesse algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance,” Nature Genetics, vol. 11, no. 4, pp. 402 – 408, 1995. ——, “Pedcheck: A program for identification of genotype incompatibilities in linkage analysis,” The American Journal of Human Genetics, vol. 63, no. 1, pp. 259 – 266, 1998. [Online]. Available: http://www.sciencedirect.com/science/article/B8JDD4R1WP1V-17/2/b556d7a79c50d44c4f200e65a4eac506 N. D. Francesco, G. Lettieri, and L. Martini, “Allele consolidation via abstract interpretation,” Dipartimento di Ingegneria dell’Informazione, Universit`a di Pisa, Tech. Rep. IET-11-01, 2011. [Online]. Available: http://www.ing.unipi.it/ a080224/report1101.pdf P. Cousot and R. Cousot, “Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints,” in 4th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages Proceedings. Los Angeles, California, 1977, pp. 238–252. N. D. Francesco, G. Lettieri, and L. Martini, “Celer: an efficient program for genotype elimination,” in AMCA-POP, ser. EPTCS, P. Milazzo and M. de J. P´erez Jim´enez, Eds., vol. 33, 2010, pp. 56–70. M. Pirinen and D. Gasbarra, “Finding consistent gene transmission patterns on large and complex pedigrees,” IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 3, no. 3, pp. 252–262, 2006. Y. Luo and S. Lin, “Finding starting points for markov chain monte carlo analysis of genetic data from large and complex pedigrees,” Genet. Epidemiol., vol. 25, no. 1, pp. 14–24, 2003. D. Gasbarra, M. J. Sillanp¨aa¨ , and E. Arjas, “Backward simulation of ancestors of sampled individuals,” Theoretical Population Biology, vol. 67, no. 2, pp. 75 – 83, 2005. [Online]. Available: http://www.sciencedirect.com/science/article/B6WXD4F6F67C-1/2/134b19fb4e742340bb5b97813e0308b8 M. Sargolzaei and F. S. Schenkel, “Qmsim: a large-scale genome simulator for livestock,” Bioinformatics, vol. 25, no. 5, pp. 680–681, March 2009. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btp045

10

HOPS

HOPS

100

100 10% 20% 30% 40% 50% 60%

ratio of mean gen. set size (compressed)

ratio of mean gen. set size (uncompressed)

10% 20% 30% 40% 50% 60%

10

1

10

1

10

20

30

40

50

60

10

20

(a)

30

APE

60

40

50

60

40

50

60

100 10% 20% 30% 40% 50% 60%

ratio of mean gen. set size (compressed)

10% 20% 30% 40% 50% 60%

ratio of mean gen. set size (uncompressed)

50

APE

100

10

1

10

1

10

20

30

40

50

60

10

20

(c)

30

(d)

QMSIM1

QMSIM1

100

100 10% 20% 30% 40% 50% 60%

ratio of mean gen. set size (compressed)

10% 20% 30% 40% 50% 60%

ratio of mean gen. set size (uncompressed)

40

(b)

10

1

10

1

10

20

30

(e)

40

50

60

10

20

30

(f)

Fig. 8. Ratio of mean genotype set size for PAC vs AAC: uncompressed (first column) and compressed (second column).

11

Nicoletta De Francesco was born in Florence in 1951. In 1974 she obtained the degree in Computer Science (cum laude) from the University of Pisa. From 1981 she was researcher PLACE at the Dipartimento di Informatica of the UniPHOTO versity of Pisa, till 1989. From 1989 till 2000, HERE november, she was associate professor, first at the University of Salerno, and then at the University of Pisa. From december 2000 she is full professor of Computer Engineering at the Dipartimento di Ingegneria della Informazione: Elettronica, Informatica, telecomunicazioni of Engineering Faculty of the University of Pisa. From 2003 to 2010 she was delegate for teaching of the University of Pisa and from november 2010 she is vice-rector of the University of Pisa. She made her research in the field of the specification and verification of concurrent and distributed systems. Her research interests included the formal verification of specifications by model checking and temporal logic. After, she moved to the security field, and in particular the verification of security properties in programs by means of formal models such as abstract interpretation. Her actual research interest is the application of formal verification techniques to biological systems and languages. She has been involved in several reasearch projects under the financial support of the Italian Ministero dell’Universita’ e della Ricerca, the Italian National Research Council (CNR)and the European Community. In 2010 she winned the Computer Journal Wilkes Award 2010.

Giuseppe Lettieri received the PhD degree in Computer Systems Engineering from the University of Pisa, Italy, in 2002. Since 2004 he is a researcher with the Dipartimento di Ingegneria PLACE dell’Informazione: Elettronica, Informatica, TelePHOTO comunicazioni of the University of Pisa. He has HERE worked on distributed Operating Systems and memory management, then he moved to formal methods, with special interest in model checking and applications of the theory of Abstract Interpretation to Java bytecode verification and secure information flow. Giuseppe Lettieri has been involved in research projects under the finantial support of the Italian Ministero dell’Universit`a e della Ricerca and the Cassa di Risparmio di Pisa.

Luca Martini received the Phd degree in Computer Systems Engineering from the University of Pisa, Italy, in 2006. From 2006 to 2010 he has been with the Dipartimento di Ingegneria PLACE dell’Informazione: Elettronica, Informatica, TelePHOTO comunicazioni of the University of Pisa as a HERE Post-Doc Researcher. Since 2010 he is working as a R&D Software Architect for List Group, a company that provides innovative solutions for the financial industry in the areas of Capital Markets and Governance Risk Compliance. Luca Martini is mainly interested in formal methods. So far, his research has been in the field of language-based security. In particular, his activities focussed on Java bytecode verification and on checking securing information flow properties in programs via static analysis.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.