A novel 2-D graphical representation of DNA sequences of low degeneracy

July 8, 2017 | Autor: Subhash Basak | Categoría: Technology, Physical sciences, CHEMICAL SCIENCES, Graphical Representation, Nucleic Acid, DNA sequence
Share Embed


Descripción

14 December 2001

Chemical Physics Letters 350 (2001) 106±112 www.elsevier.com/locate/cplett

A novel 2-D graphical representation of DNA sequences of low degeneracy q Xiaofeng Guo

a,c,* ,

Milan Randic

b,c

, Subhash C. Basak

c

a

Institute of Mathematics and Physics, Xinjiang University, Wulumuqi, Xinjiang 830046, PR China b Ames Laboratory-EOE, Iowa State University, Ames, Iowa 50011, USA c Natural Resources Research Institute, University of Minnesota Duluth, Duluth, MN 55811, USA Received 24 July 2001

Abstract Some 2-D and 3-D graphical representations of DNA sequences have been given by Nandy, Leong and Mogenthaler, and Randic et al., which give visual characterizations of DNA sequences. In this Letter, we introduce a novel graphical representation of DNA sequences by taking four special vectors in 2-D space to represent the four nucleic acid bases in DNA sequences, so that a DNA sequence is denoted on a plane by a successive vector sequence, which is also a directed walk on the plane. It is showed that the novel graphical representation of DNA sequences has lower degeneracy and less overlapping. Ó 2001 Elsevier Science B.V. All rights reserved.

1. Introduction A DNA sequence is a sequence of four letters A, T, G, C which, respectively, denote four nucleic acid bases: adenine, thymine, guanine and cytosine. DNA sequences, even when considered for relatively short segments, do not yield an immediately useful or informative characterization. Comparison of DNA sequences even with bases less than a hundred could be quite dicult (see the

q The project is supported by NSFC and Grant F-49620-96-10330 from the United States Air Force. * Corresponding author. E-mail addresses: [email protected] (X. Guo), milan. [email protected] (M. Randic), [email protected] (S.C. Basak).

list of the ®rst exons of beta-globin genes for eight di€erent species shown in Table 1). In order to give a visual characterization of DNA sequences, many attempts have been made [1±25]. One of the ®rst attempts was that of Hamori and Ruskin [7,8] by the G- and H-curves. The Gcurves are generated in a virtual 5-D space whose coordinates are each assigned to the four DNA nucleotides and to an integer characterizing the position of a nucleotide on a DNA sequence. The H-curves on the other, represent projection of the cryptic G-curve into humanly comprehensible 3-D space. In this method, the information content of a nucleotide sequence is converted from the four letters A, T, G, C, description into a 3-D space curve, called H-curve. The positive z-direction is used to count the number of nucleotides in the

0009-2614/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved. PII: S 0 0 0 9 - 2 6 1 4 ( 0 1 ) 0 1 2 4 6 - 5

X. Guo et al. / Chemical Physics Letters 350 (2001) 106±112

107

Table 1 DNA sequences of the ®rst exons of beta-globin genes for eight di€erent species A Human beta-globin ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGG TGAACGTGGATTAAGTTGGTGGTGAGGCCCTGGGCAG

92 Bases

B Goat alanine beta-globin ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGGTG AAAG TGGATGAAGTTGGTGCTGAGGCCCTGGGCAG C Opossum beta-hemoglobin beta-M-gene ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCTGGTCT AAGG TGCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAG

86 Bases 92 Bases

D Gallus gallus beta-globin ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGC AAGG TCAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG

92 Bases

E Lemur beta-globin ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGC AAGG TGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG

92 Bases

F Mouse beta-a-globin ATGGTTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGTGGGC AAAG GTGAACCCCGATGAAGTTGGTGGTGAGGCCCTGGGCAGG

94 Bases

G Rabbit beta-globin ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGC AAGG TGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGC

90 Bases

H Rat beta-globin ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGGGGA AAGG TGAACCCTGATAATGTTGGCGCTGAGGCCCTGGGCAG

92 Bases

sequence. At each point of z on the corresponding xy-plane the four corners (NW, NE, SE and SW as four points of the compass) are taken to represent the four bases. Basic rule for constructing the sequence map is to move one unit in the corresponding direction depending on which nucleotide is being plotted and draw a connected line of all such points plotted, one for each unit in the z-direction. The H-curve can uniquely denote a DNA sequence. However, it needs to use computer to display 2-D projections or 3-D stereo projections of DNA sequences. The 2-D graphical representation of DNA sequences was ®rst proposed by Gates [2], and rediscovered independently by Nandy [3,4] and Leong and Mogenthaler [5]. Their method is based on choosing the four cardinal directions in …x; y† coordinate system to represent the content of the four bases in DNA sequences. The method essentially consists of plotting a point corresponding to a base by moving one unit in the

positive or negative x- or y-axis depending on the de®ned association of a base with a cardinal direction; the cumulative plot of such points produces a graph that corresponds to the sequence of bases in the gene fragment under consideration. In the Gates axes system, one would move one unit in the positive x-direction for a C, along the positive y-direction for a T, the negative x-direction for a G and the negative y-direction for an A, implying a cumulative plot of the count of instantaneous C±G against T±A. The Nandy axes system associates G with positive x-direction, C with positive y-direction, A with negative x-direction, and T with negative y-direction. In the Leong and Morgenthaler axes system, A is associated with positive x-direction, T with positive y-direction, C with negative x-direction, and G with negative y-direction. It was pointed out by Nandy and Nandy [6] that there are three possible independent axes systems to plot a 2-D graph of a gene sequence,

108

X. Guo et al. / Chemical Physics Letters 350 (2001) 106±112

and in fact the three systems mentioned above cover the three orthogonal systems. The 2-D graphical representation of DNA sequences has high degeneracy, because the graphical representation of a shorter DNA sequence may correspond to more DNA sequences. For example, sequences AG, AGA, AGAG, AGAGA, AGAGAG; . . . will have the same graphical representation. In the graphical representation of the DNA sequence of opossum betaglobin gene in Table 1, some points and lines corresponding to some bases in the DNA sequence also overlap, and so there are some circuits in the graphical representation. It is not dicult to ®nd other DNA sequences di€erent from the DNA sequence of the oppssum beta-globin gene which will have the same graphical representation with the DNA sequence. Recently, Randic et al. [9] generalize the above 2-D graphical representation to 3-D graphical representation. They place the origin of the Cartesian …x; y; z† coordinate system in the center of a cube so that the four corners of the cube have the coordinates …‡1; 1; 1†, … 1; ‡1; 1†, … 1; 1; ‡1†, …‡1; ‡1; ‡1†, and assign four nucleic bases as follows: …‡1; 1; 1† ! A; …‡1; ‡1; ‡1† ! T; … 1; ‡1; 1† ! G;

… 1; 1; ‡1† ! C: To obtain the graphical representation of a DNA sequence, they start from the origin, and move in …x; y; z† space in the direction that the above assignment dictates. Hence a DNA sequence corresponds to a curve in the 3-D space on which some parts may overlap. The curve corresponding to a DNA sequence can be denoted by …x; y; z† coordinate sequence. For example, the DNA sequence of 12 letters, ATGGTGCACCTG, corresponds to the coordinate sequence …0; 0; 0† …‡1; 1; 1† …2; 0; 0† …‡1; ‡1; 1† …0; ‡2; 2† …‡1; ‡3; 1† …0; ‡4; 2† … 1; ‡3; 1† …0; ‡2; 2† … 1; ‡1; 1† … 2; 0; 0† … 1; ‡1; ‡1† … 2; ‡2; 0†. If the graphical representation of a DNA sequence is projected to 2-D Cartesian subspaces, three different 2-D graphical presentations of the DNA sequence are obtained, for which the one on the …x; y† coordinate plane is identical with the 2-D graphical representation of Nandy, the one on the …x; z† coordinate plane is identical with the 2-D graphical representation of Leong and Morgenthaler, and the other on …y; z† coordinate plane is identical with the 2-D graphical representation of Gates. Fig. 1 shows the three projections of the 3-D graphical representation of the above DNA sequence with 12 letters. The degeneracy of the 3-D graphical representation of DNA sequences is still higher. For example, the sequences AGTC, AGTCA,

Fig. 1. Three 2-D graphical representations of a DNA sequence of length 12, which are, respectively, three projections of the 3-D graphical representations of the sequence.

X. Guo et al. / Chemical Physics Letters 350 (2001) 106±112

AGTCAG, AGTCAGT, AGTCAGTC; . . . have the same graphical representation. In general, if a graphical representation of a DNA sequence has no circuit, then the DNA sequence can be uniquely determined by the graphical representation of it. So we use the minimum length of all the DNA sequences each of which forms a circuit in a graphical representation to measure the relative degree of degeneracy of different graphical representations. The smaller the minimum circuit length the higher degeneracy. In Nandy's 2-D graphical representation of DNA sequences, the minimum length of a circuit is equal to 2, because anyone of AG, TC, GA, CT forms a circuit with the minimum length. In Randic's 3-D graphical representation of DNA sequences, the minimum length of a circuit is equal to 4, because ATGC forms a circuit with the minimum length. The two graphical representations have higher degeneracy. On the other hand, a 3-D graphical representation of a DNA sequence cannot be straightly denoted on a plane. Therefore, there is continuing interest in ®nding novel 2-D graphical representation of DNA sequences which has low degeneracy. In this Letter, we introduce such a novel 2-D graphical representation of DNA sequences by designing four special vectors to represent the four nucleic acid bases A, T, G, C. It is showed that the minimum length of a circuit in the graphical representation of a DNA sequence is equal to either 4d for an even integer d or 2d for an odd integer d.

109

We design four special vectors in Cartesian …x; y† coordinate system to represent the four nucleic acid bases A, T, G, C as follows:   1 ! A; 1; ‡ d   1 ‡ ; 1 ! T; d   1 ‡ 1; ‡ ! G; d   1 ‡ ; ‡ 1 ! C; d where d is a positive integer (see Fig. 2). Hence a DNA sequence of four letters A, T, G, C with length n can be regarded as a successive vector sequence of length n consisting of the four vectors corresponding to A, T, G, C. We use the successive vector sequence of a DNA sequence S as a novel 2-D graphical representation of the DNA sequence, denoted by Gd …S†. The graphical representation Gd …S† of S may be regarded as a directed graph embedded on the plane whose vertices are endpoints of all vectors in S and whose arcs are vectors in S. The graphical representation of S may be regarded as a directed walk in the directed graph. The graphical representations of the two DNA sequences in Table 1, human beta-

2. A novel 2-D graphical representation of DNA sequences Let S1 ; S2 ; S3 ; . . . ; Si 1 ; Si ; . . . ; Sk be k vectors in Cartesian …x; y† coordinate system starting from the origin, which can be denoted by coordinates of terminal points of them. A vector sequence S1 S2 S3    Si 1 Si    Sk is said to be a successive vector sequence if S2 ; S3 ; . . . ; Si 1 ; Si ; . . . ; Sk are shifted parallel so that, for 2 6 i 6 k, the initial point of Si is identical with the terminal point of Si 1 step by step. Clearly, the sum of vectors S1 ; S2 ; S3 ; . . . ; Si 1 ; Si ; . . . ; Sk can be obtained by the successive vector sequence S1 S2 S3    Si 1 Si    Sk .

Fig. 2. Four special vectors in Cartesian …x; y† coordinate system which, respectively, represent the four nucleic acid bases A, T, G, C.

110

X. Guo et al. / Chemical Physics Letters 350 (2001) 106±112

globin and opossum beta-hemoglobin, are illustrated in Fig. 3, in which the graphical representations of them corresponding to Nandy's method are also given for comparison of the two graphical representations. It should be mentioned here that there are also three possible independent axes systems for the novel graphical representation of DNA sequences, which are, respectively, corresponding to axes systems of Nandy, Gates, and Leong and Morgenthaler. The axes system of the graphical representation given above is corresponding to Nandy's axes system. The other two axes systems are showed in Fig. 4. It is interesting that a DNA sequence has three di€erent graphical representations in three independent axes systems. Some one or all of them can be used in various applications. 3. The degeneracy of the novel 2-D graphical representation of DNA sequences In order to determine the degeneracy of the novel 2-D graphical representation of DNA sequences, we need to calculate the minimum length of all the DNA sequences each of which forms a circuit in the graphical representation of it. Let S be a DNA sequence which forms a circuit in the graphical representation. Let fA , fT , fG , fC denote the frequencies of A, T, G, C in S, respectively. By the above assumption and the de®-

Fig. 3. The novel graphical representations and Nandy's graphical representations of the two DNA sequences for human beta-globin and opossum beta-hemoglobin genes. (a) Human beta-globin; (b) opossum beta-hemoglobin.

Fig. 4. The other two axes systems for the novel 2-D graphical representation.

X. Guo et al. / Chemical Physics Letters 350 (2001) 106±112

nition of the graphical representation of S, we have the following system of equations:   fG fA ‡ fdC ‡ fdT ˆ 0 dfG dfA ‡ fC ‡ fT ˆ 0; ) fG fA f dfT ˆ 0: ‡ ‡ f f ˆ 0 G ‡ fA ‡ dfC C T d d …1† Adding two equations in (1), we have …d ‡ 1† …fG ‡ fC † ˆ …d 1†…fA ‡ fT †. If d is even, then …d ‡ 1; d 1† ˆ 1, implying that fA ‡ fT ˆ c…d ‡ 1†, where c is a positive integer. Adding the equation into (1), and soluting the system of equations, we have that 8 fG ˆ …d 1†c fC ; > > > > < 1 ‰…d 2 ‡ 1†c …d 1†fC Š; fA ˆ …2† d ‡ 1 > > > > : fT ˆ 1 ‰2dc ‡ …d 1†fC Š: d ‡1 From (2), it can be obtained that jSj ˆ fA ‡ fT ‡ fG ‡ fC ˆ 2dc:

…3†

Now we need to determine the minimum integer c so that (2) has a solution of non-negative integers. From (2), we have ( fC 6 …d 1†c; 2dc ‡ …d

1†fC ˆ …d ‡ 1†x;

where x is a positive integer. If c ˆ 1, then 0 6 fC 6 d 1, and fC ˆ

1 d

1

‰…d ‡ 1†x

2dŠ ˆ x

2…d d

x† : 1

Since d is even, d 1 is odd. So fC has no integer solution. Hence we have c P 2. If c ˆ 2, then 8 < 0 6 fC 6 2…d 1†; : fC ˆ

1 d

1

‰…d ‡ 1†x

4dŠ ˆ x



2…x d

2† ; 1

where x is a positive integer. Then fC has an integer solution if and only if x ˆ d ‡ 1, and so fC ˆ fG ˆ d 1, fA ˆ fT ˆ d ‡ 1, and jSj ˆ 4d. If d is an odd number, …d ‡ 1; d 1† ˆ 2, implying that fA ‡ fT ˆ 12…d ‡ 1†c, where c is a positive integer. By a similar argument as above, we have that jSj ˆ 2d, and. fA ˆ fT ˆ 12…d ‡ 1†, fG ˆ fC ˆ 12…d 1†.

111

Now we can give the following theorem. Theorem 1. Let S be a DNA sequence whose graphical representation Gd …S† forms a circuit with the minimum length. Then 1. jSj ˆ 2d, fA ˆ fT ˆ 12…d ‡ 1†, fG ˆ fC ˆ 12…d 1†, if and only if d is odd, 2. jSj ˆ 4d, fC ˆ fG ˆ d 1, fA ˆ fT ˆ d ‡ 1, if and only if d is even. The above theorem shows that, if d is a greater even number, the graphical representation Gd …S† of a DNA sequence S has lower degeneracy. If d is equal to 4 (resp. 8), the length of a circuit in Gd …S† is greater than or equal to 16 (resp. 32). For drawing a ®gure of Gd …S† on a plane, it will be convenient to take d as 4 or 8, although if d is an even number greater than 8, Gd …S† has lower degeneracy. In Fig. 3, the graphical representations of the two DNA sequences take d equal to 4. For an individual DNA sequence, we can also measure the degree of degeneracy of a graphical representation of it by the quotient of the number of the bases in the sequence and the number of lines (or edges) in the graphical representation. In di€erent graphical representations of the two DNA sequences in Fig. 3, it is showed that the novel graphical representations of the two DNA sequences have no circuit, and so the corresponding quotient is equal to one, implying that the graphical representations can uniquely determine the corresponding DNA primary sequences. In contrast, in Nandy's graphical representations of the same two DNA sequences, there are closed walks and repeated overlapping at numerous points and edges. The quotients for the two Nandy's graphical representations of human beta-globin and opossum beta-hemoglobin are, respectively, equal to 92/57 and 92/42, that is, 1.614 and 2.190. Hence for the human beta-globin exon-1 on average every edge is repeated by 1.614 times, and for the opossum beta-hemoglobin exon1 every edge is on average repeated by 2.190 times. As we see the degeneracy in both cases is very high and there is a considerable loss of information that accompany such degeneracy.

112

X. Guo et al. / Chemical Physics Letters 350 (2001) 106±112

4. Discussion Characterization, comparison, and similarity analysis of DNA sequences are still important and dicult tasks. Many investigations have been made in this area [1±25] and novel concepts are needed to make sense of the enormous amount of date generated by human gene project and other gene sequence research e€orts. The novel graphical representation of DNA sequences in the present Letter can give a better visual characterization of DNA sequences, which has been showed to have lower degeneracy. Based on the graphical representation of DNA sequences, some new invariants can be introduced to give some numerical characterization of DNA sequences. It is expected that they will ®nd applications in comparison and similarity analysis of DNA sequences.

[2] [3] [4] [5] [6] [7] [8] [9] [10]

[11] [12] [13] [14] [15] [16]

Acknowledgements We would like to thank the National Natural Science Foundation of China to support our international collaboration research and Guo's visit to Natural Resources Research Institute of University of Minnesota Duluth. This is contribution number 291 from the Center for Water and the Environment of the Natural Resources Research Institute. Research reported in this Letter was also supported by grant F-49620-96-1-0330 from the United States Air Force.

[17] [18] [19] [20] [21] [22] [23] [24] [25]

References [1] A. Roy, C. Raychaudhury, A. Nandy, Bioscience 23 (1998) 55.

M.A. Gates, J. Theor. Biol. 119 (1986) 319. A. Nandy, Curr. Sci. 66 (1994) 309. A. Nandy, Curr. Sci. 66 (1994) 821. P.M. Leong, S. Mogenthaler, Comput. Appl. Biosci. 12 (1995) 503. A. Nandy, P. Nandy, Curr. Sci. 68 (1995) 75. E. Hamori, J. Ruskin, J. Biol. Chem. 258 (1983) 1318. E. Hamori, BioTechniques 7 (1989) 710. M. Randic, M. Vracko, A. Nandy, S.C. Basak, J. Chem. Inf. Comput. Sci., to appear. J.B. Kruskal, in: D. Sanko€, J.B. Kruskal (Eds.), Tine Wraps, String Edits, and Macromolecules; The Theory and Practice of Sequence Comparisons, Addison-Wesley, London, 1983, p. 1. M.S. Waterman, Bull. Math. Biol. 46 (1984) 473. T.F. Smith, M.S. Waterman, Adv. Appl. Math. 2 (1981) 482. T.F. Smith, M.S. Waterman, J. Mol. Biol. 147 (1981) 195. W.R. Pearson, D.J. Lipman, Proc. Natl. Acad. Sci. USA 85 (1988) 2444. B. Jerman-Blazic, I. Fabic, M. Randic, J. Comput. Chem. 7 (1986) 176. B. Jerman-Blazic, I. Fabic, M. Randic, in: D. Hadzi, B. Jerman-Blazic (Eds.), QSAR in Drug Design and Toxicology, Elsevier, Amsterdam, 1987, p. 52. M. Randic, A. Nandy, S.C. Basak, J. Math. Chem., submitted. C. Raychaudhury, A. Nandy, J. Chem. Inf. Comput. Sci. 39 (1999) 243. M. Randic, G. Krilov, Int. J. Quantum Chem. 75 (1999) 1017. A. Nandy, Curr. Sci. 70 (1996) 661. C. Raychaudhury, A. Nandy, J. Chem. Inf. Comput. Sci. 39 (1999) 243. M. Randic, M. Vracko, J. Chem. Inf. Comput. Sci. 40 (2000) 599. M. Randic, Chem. Phys. Lett. 317 (2000) 29. M. Randic, S.C. Basak, J. Chem. Inf. Comput. Sci., submitted. M. Randic, Xiaofeng Guo, S.C. Basak, J. Chem. Inf. Comput. Sci., submitted.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.