Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes

Share Embed


Descripción

Our reference: BIO 3360

P-authorquery-v9

AUTHOR QUERY FORM Journal: BIO

Please e-mail or fax your responses and any corrections to: E-mail: [email protected]

Article Number: 3360

Fax: +353 6170 9272

Dear Author, Please check your proof carefully and mark all corrections at the appropriate place in the proof (e.g., by using on-screen annotation in the PDF file) or compile them in a separate list. Note: if you opt to annotate the file with software other than Adobe Reader then please also highlight the appropriate place in the PDF file. To ensure fast publication of your paper please return your corrections within 48 hours. For correction or revision of any artwork, please consult http://www.elsevier.com/artworkinstructions. Any queries or remarks that have arisen during the processing of your manuscript are listed below and highlighted by flags in the proof. Click on the ‘Q’ link to go to the location in the proof. Location in article

Q1 Q2 Q3 Q4 Q5 Q6 Q7

Query / Remark: click on the Q link to go Please insert your reply or correction at the corresponding line in the proof The reference given here is cited in the text but is missing from the reference list – please make the list complete or remove the reference from the text: “Michel (2008)”, “Seligmann (2001)”, “Seligmann (2012g)”, “Seligmann (2003)”. Please confirm that given name and surname have been identified correctly. Please check the address for the corresponding author that has been added here, and correct if necessary. Ref. “Seligmann (2003)” is cited in the text but not provided in the reference list. Please provide it in the reference list or delete this citation from the text. Ref. “Seligmann (2012g)” is cited in the text but not provided in the reference list. Please provide it in the reference list or delete this citation from the text. Ref. “Seligmann (2001)” is cited in the text but not provided in the reference list. Please provide it in the reference list or delete this citation from the text. Ref. “Michel (2008)” is cited in the text but not provided in the reference list. Please provide it in the reference list or delete this citation from the text. Please update references: Seligmann (in press-a, in press-b).

Please check this box if you have no corrections to make to the PDF file

Thank you for your assistance.

ARTICLE IN PRESS

G Model BIO 3360 1–19

BioSystems xxx (2013) xxx–xxx

Contents lists available at SciVerse ScienceDirect

BioSystems journal homepage: www.elsevier.com/locate/biosystems

Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes

1 2 3

4

Q1

Hervé Seligmann a,b,∗

5

a

6

b

National Natural History Museum Collections, The Hebrew University of Jerusalem, 91904 Jerusalem, Israel Department of Life Sciences, Ben Gurion University, 84105 Beer Sheva, Israel

7

a r t i c l e

8

i n f o

a b s t r a c t

9

Article history: Received 24 October 2012 Received in revised form 24 January 2013 Accepted 29 January 2013

10 11 12 13 14

Usual DNA→RNA transcription exchanges T→U. Assuming different systematic symmetric nucleotide exchanges during translation, some GenBank RNAs match exactly human mitochondrial sequences (exchange rules listed in decreasing transcript frequencies): C↔U, A↔U, A↔U+C↔G (two nucleotide pairs exchanged), G↔U, A↔G, C↔G, none for A↔C, A↔G+C↔U, and A↔C+G↔U. Most unusual transcripts involve exchanging uracil. Independent measures of rates of rare replicational enzymatic DNA nucleotide misinsertions predict frequencies of RNA transcripts systematically exchanging the corresponding misinserted nucleotides. Exchange transcripts self-hybridize less than other gene regions, self-hybridization increases with length, suggesting endoribonuclease-limited elongation. Blast detects stop codon depleted putative protein coding overlapping genes within exchange-transcribed mitochondrial genes. These align with existing GenBank proteins (mainly metazoan origins, prokaryotic and viral origins underrepresented). These GenBank proteins frequently interact with RNA/DNA, are membrane transporters, or are typical of mitochondrial metabolism. Nucleotide exchange transcript frequencies increase with overlapping gene densities and stop densities, indicating finely tuned counterbalancing regulation of expression of systematic symmetric nucleotide exchange-encrypted proteins. Such expression necessitates combined activities of suppressor tRNAs matching stops, and nucleotide exchange transcription. Two independent properties confirm predicted exchanged overlap coding genes: discrepancy of third codon nucleotide contents from replicational deamination gradients, and codon usage according to circular code predictions. Predictions from both properties converge, especially for frequent nucleotide exchange types. Nucleotide exchanging transcription apparently increases coding densities of protein coding genes without lengthening genomes, revealing unsuspected functional DNA coding potential. © 2013 Published by Elsevier Ireland Ltd.

21

Keywords: Expressed sequence tags Nucleotide misinsertion Human DNA polymerase gamma Genome compression Antitermination tRNA Termination codon

22

1. Introduction

23

The question ‘why are there several stop codons?’ (Krizek and Krizek, 2012) has an apparently satisfying answer: off frame, protein coding genes include numerous stops (Seligmann and Pollock, 2004a,b; Singh and Pardasani, 2009; Tse et al., 2010) which decrease protein synthesis costs due to unprogrammed ribosomal slippage (Seligmann, 2007, 2010a; Warnecke and Hurst, 2011). In addition, the genetic code’s codon–amino acid assignments maximize off frame stop numbers (Itzkovitz and Alon, 2007), and third codon positions that are part of off frame stops tend to mutate less than comparable positions (Seligmann, 2012a). However, this explanation hides a further function that stop codons play in off

15 16 17 18 19 20

24 25 26 27 28 29 30 31 32 33

Q2

∗ Correspondence address: National Natural History Museum Collections, The Hebrew University of Jerusalem, 91904 Jerusalem, Israel. E-mail address: [email protected]

frame sequences: it seems that when antitermination (suppressor) tRNAs are active in translation, the regular genetic code is de facto transformed into another, stopless genetic code (Seligmann, 2010b). Translating sequences into proteins according to that overlapping code reveals numerous previously undetected genes and proteins, their number coevolving with capacities of antitermination tRNAs (tRNAs with anticodons matching stops) to translate the stops they include (Faure et al., 2011; Seligmann, 2011a, 2012a,b). Inclusion of stops codons in the regular genetic code enables a double coding system, based on the same sequences, and whose expression is efficiently regulated by the presence or absence of suppressor (antitermination) tRNAs. That way, numbers of coded proteins can be high while keeping a relatively short genome, by switching from the regular genetic code to a stopless code. Genome length is an important factor limiting replication and cellular multiplication rates, apparently affecting also developmental rates of metazoan organisms (Sessions and Larson, 1987; Gregory and Hebert, 1999; Chipman et al., 2001). Ample

0303-2647/$ – see front matter © 2013 Published by Elsevier Ireland Ltd. http://dx.doi.org/10.1016/j.biosystems.2013.01.011

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

G Model BIO 3360 1–19 2

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

52

data suggest that even at the level of single amino acids, pro-

53

Q3 tein sequences minimize metabolic synthesis costs (Akashi and

54

Gojobori, 2002; Seligmann, 2003; Barton et al., 2010), notably of cognate amino acids (Perlstein et al., 2007; Alves and Savageau, 2005; Seligmann, 2012b). Protein length reduction apparently follows similar principles (Brocchieri and Karlin, 2005; Warringer and Blomberg, 2006; Seligmann, 2012b). Considering this, it is very probable that similar forces decrease genome length. Accordingly, there would be a strong advantage for being able to code for more proteins, while keeping the genome short, a phenomenon that increases coding density by coding compression, such as overlapping genes, including those induced by antitermination tRNA activity (Seligmann, 2011a, 2012c,in press-a; Faure et al., 2011). Recent analyses suggest that mitochondrial genomes include several overlapping genes coded in the 3 -to-5 direction of regular protein coding genes, apparently expressed upon putative ‘invertase’ activity, which would invert the sequence polymerized into RNA in the 3 -to-5 direction (Seligmann, 2012d). A further mechanism apparently increasing coding density is that of protein coding genes based on tetracodons, quadruplet codons recognized by (among others) tRNAs with expanded anticodons (Seligmann, 2012e). Mitochondrial genes for ribosomal RNAs seem also to Q4 include overlapping protein coding genes (Seligmann, 2012g) It is in this context that a group of phenomena called RNA recoding is considered here. These imply typically changing frames (Namy et al., 2005) and various phenomena of exon/intron reshuffling (i.e., Jin et al., 2007; Lev-Maor et al., 2007). In some cases, recoding alters the nucleotides used, such as adenosine-to-inosine RNA editing (Reenan, 2005; Paz et al., 2007; Daniel et al., 2011).

55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80

81 82

83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114

1.1. Nucleotide exchanges as a working hypothesis for cryptic overlapping genes The systematic ‘recoding’ of T (thymidine) to U (uracil) in transcription from DNA to RNA is also a type of recoding, by DNA→RNA polymerases that systematically exchange T by U, and U by T for reverse transcriptases. This suggests the hypothesis that coding density might be increased by other types of systematic nucleotide exchanges, i.e. A by C and C by A (or any other symmetric exchange of this type). The fact that during regular DNA replication, ribonucleotides are frequently inserted instead of deoxynucleotides by the mitochondrial DNA gamma polymerase (Kasiviswanathan and Copeland, 2011) indicates that polymerases have some flexibility in that respect. Misinsertion of non-complementary nucleotides is also a basic property of polymerase (mis)function (Lee and Johnson, 2006). The possibility of polymerase activity implying systematic misinsertions, producing non-complementary DNA and/or RNA strands, cannot be excluded. Such recoded RNA, based on the template of regular DNA sequence, could code for additional protein coding gene(s). Interestingly, if this occurs at DNA level, this could be a mechanism for producing new genes, but in this case the assumed mechanism of transcription exchanging between nucleotides implies that genes code according to ‘direct’ (non-exchanging) and exchange transcription. In some ways, the former can be seen as explicit, and the latter as implicit coding, nevertheless, both levels would be inherent simultaneously to the gene’s primary structure. Hence if such nucleotide exchanging activity exists, by some kind of unknown or modified DNA→RNA polymerases during RNA polymerization or editing, inducing such activity might unleash a very large coding potential, enabling to code for proteins without increasing genome size. In addition, this system implies very simple regulation, as each set of genes associated with a given type of nucleotide exchange would be induced by the expression of its specific ‘nucleotide exchanger’ polymerase/editing activity.

Table 1 The nine different RNA sequences produced from transcription of a single DNA sequence (ACGT) according to the nine types of symmetric nucleotide exchange rules. The amino acid coded by the three first nucleotides according to the vertebrate mitochondrial genetic code is also indicated, as well as the percentage of nucleotides that remain identical after that type of exchange transcription. Exchange rule

Initial DNA 5 -ACGT-3

Codon for Thr

Similarity to initial DNA sequence

A↔C A↔G A↔U C↔G C↔U G↔U A↔C and G↔U A↔G and C↔U A↔U and C↔G

5 -CAGU-3 5 -GCAU-3 5 -UCGA-3 5 -AGCU-3 5 -AUGC-3 5 -ACUG-3 5 -CAUG-3 5 -GUAC-3 5 -UGCA-3

Gln Ala Ser Ser Met Thr His Val Cys

50% 50% 50% 50% 50% 50% 0% 0% 0%

In total, considering only the four usual nucleotides, nine symmetric nucleotide exchanges are possible, multiplying by nine the coding potential of any single sequence. Six of these involve only two types of nucleotides (A↔C, A↔G, A↔U, C↔G, C↔U, G↔U) and three all four types of nucleotides, implying two symmetric exchanges (A↔C+G↔U, A↔G+C↔U, and A↔U+C↔G). Table 1 shows the different RNA sequences produced by each of these rules from a single, given initial DNA sequence. Note that this procedure alters at least 50% of the nucleotides in the initial sequence used in Table 1, and that the amino acid coded by the three first nucleotides in that sequence is changed in almost all cases after systematic symmetric nucleotide exchange. Along the same lines, asymmetric nucleotide recodings are also possible (such as an exchange rule including three nucleotide exchanges, i.e., A→C, C→G and G→A, in total 14 asymmetric exchange possibilities exist (including also rules with four asymmetric nucleotide exchanges). For practical reasons, I explore here only symmetric exchanges Separating symmetric from asymmetric exchanges is also justified by the possibility that symmetric and asymmetric nucleotide exchanges may depend upon different types of polymerization (or editing) mechanisms. First, I explore GenBank’s EST (expressed sequence tags) RNA databank for sequences matching the ‘exchanged’ human mitochondrial genome according to each of the nine symmetric exchange rules and report the results for the various types of exchanges. Then Blast alignment analyses explore whether RNA recoded by each of these exchanges could be coding for proteins, using various bioinformatics methods to indicate whether the detected putative overlapping genes seem functional or not. A meta-analysis of the data shows that frequencies of RNAs associated with the different types of symmetric exchanges are proportional to the bioinformatics estimations of overlap protein coding gene functionalities, indicating that coding compression through RNA exchange/editing occurs, and this at different frequencies for different types of nucleotide exchanges. Most notably, DNA nucleotide misinsertion rates during replication predict rates of nucleotide exchanging RNA transcription.

115

2. Materials and methods

152

2.1. Sequence manipulations and alignments with existing RNA transcripts

153

All analyses are done for GenBank’s reference complete human mitochondrial genome (NC 012920). Its entire sequence is copy pasted from GenBank into a blank Microsoft Word file. In ‘Word’, the sequence of the genome was altered by using the software’s ‘Replace’ function, mimicking a putative systematic nucleotide exchange. For example, for the symmetric exchange rule A↔C, the function ‘Replace’ was used to replace all ‘A’s in the genome by ‘X’, then all ‘C’s by ‘A’, and then all ‘X’s by ‘C’. The intermediate stage using ‘X’ (or any other arbitrary symbol differing from the four letters used to symbolize the four nucleotides) is necessary to avoid that ‘A’s changed into ‘C’s at the first step are changed back into ‘A’ at the second step. The resulting

154

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151

155 156 157 158 159 160 161 162

G Model BIO 3360 1–19

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

170

sequence where all ‘A’s present in the initial genome are replaced by ‘C’, and all ‘C’s in the initial genome are replaced by ‘A’, is copy/pasted from Word into GenBank’s online alignment software ‘Blastn’. Blastn is then requested to search, according to standard default search parameters, for RNA sequences from its ESTs database and matching that altered sequence, the human mitochondrial genome assuming systematic symmetric nucleotide exchange A↔C. This procedure combining Word and Blastn is repeated for each of the nine systematic symmetric exchange rules in Table 1.

171

2.2. Prediction of secondary structure

172

187

Mfold (Zuker, 2003) is used to predict secondary structure formation. This is done for the complete (exchanged) sequence of genes for which exchange transcripts are detected. Mfold’s output presents the secondary structures that are within 5% of the optimal (most stable) secondary structure. The number of secondary structures in which a site does not participate in self-hybridization is indicated by Mfold’s ‘ss-number’. This number is averaged across all nucleotides, divided by the total number of secondary structures predicted by Mfold. This number represents the average ‘loopiness’ (tendency to form loops) for that RNA sequence. It is calculated separately for regions belonging to RNA transcripts that are transcribed according to a systematic nucleotide exchange rule, and for the rest of that gene. The difference between the latter loopiness and the loopiness of the region that has been transcribed by symmetric nucleotide exchange estimates the loopiness of the exchange transcribed region as compared to the rest of the gene, assuming it were also exchange transcribed. Potentially, this subtraction can indicate whether regions that were transcribed according to nucleotide exchange rules differ in secondary structure formation propensities from other regions of that gene.

188

2.3. Candidate overlapping genes detected by Blastp alignments

189

In order to investigate potential protein coding by nucleotide exchange, I translated into putative protein sequences all six frames of all 13 human mitochondrial protein coding genes, after exchanging nucleotides along each of the 9 exchanging rules. Translation of exchanged RNAs was done by the online available software ‘transeq’ at the EMBL-EBI site (http://www.ebi.ac.uk/Tools/st/), according to the regular vertebrate mitochondrial genetic code, inserting asterisks (*) where stop codons occur in the exchange transcribed sequence. Hence putative proteins do not determine the identities of amino acids inserted where stop codons occur in the exchange transcribed RNA. For any single sequence, a total of 6 × 9 = 54 hypothetical protein sequences were produced across frames and exchange rules for each protein coding gene, and a total of 13 × 54 = 702 hypothetical protein sequences for the 13 regular protein coding sequences in the human mitochondrial genome were examined. These 702 hypothetical protein sequences were analyzed by GenBank’s Blastp (Altschul et al., 1997, 2005) using standard default parameters of Blastp as has been used and described in previous publications (Seligmann, 2011a, 2012c,d,e, in pressa, in press-b). Blastp indicates whether the putative peptide is similar to proteins existing in GenBank. It produces a homology hypothesis that indicates the candidate overlapping genes coded after nucleotide exchange transcription.

163 164 165 166 167 168 169

173 174 175 176 177 178 179 180 181 182 183 184 185 186

190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235

2.4. Duration spent single stranded by DNA during replication by mitochondrial protein coding genes Some analyses below describe patterns in nucleotide contents due to replicational deamination gradients along a gradient of duration of DNA single strandedness during DNA replication. This is because single strandedness increases A→G and C→T deamination rates more than it increases the opposite mutations G→A and T→C (Fredrico et al., 1990). These spontaneous mutations are counterselected at coding sites, but have detectable effects on nucleotide contents at third codon positions in protein coding genes (Krishnan et al., 2004a,b; Seligmann et al., 2006). Third codon positions usually have also an additional coding role when a sequence is involved in overlap coding. Systematic nucleotide exchanges may reveal such overlapping protein coding genes. The replicational gradient should not be detectable in such overlap coding regions if these are functional (Seligmann, 2012a,d). Hence analyses of effects of replicational gradients on nucleotide contents at third codon positions should highlight the coding status of overlapping genes. For that purpose, durations spent single stranded during replication are calculated for each human mitochondrial protein coding gene, using the genes midpoint location along the genome. Duration spent single stranded by a site is a function of the distance of that site from the heavy strand replication origin (OH) and the light strand replication origin (OL). This duration spent single stranded is 2 × b/N for the genes ND1 and ND2 (genes located between the OH and OL), where b is the midlocation of the genes in the number of nucleotides counted from the OH, in the 5 →3 direction, of the genome’s heavy strand, and N is the total genome length. Note that standard mitochondrial genome annotations in GenBank indicate the numberings according to the light strand, which may cause some confusions in calculating durations spent single stranded during replication (Tanaka and Ozawa, 1994; Raina et al., 2005; Seligmann et al., 2006). For the other genes, replicational single strandedness is 2 × (OL − b)/N, where OL indicates the midlocation of the light strand replicational origin, according to heavy strand numbering.

3

2.5. Circular code analyses

236

The circular code theory indicates that a set of 20 autocomplementary codons (the 20 codons include the inverse complement of each of these codons) is overrepresented in the coding frame of regular protein coding genes (Arqués and Michel, 1996, 1997). Coded alphabetical communication in human languages consists typically of letters forming words, and of punctuation symbols (comma, question mark, etc). Besides stop codons, in the genetic code, codons coding for amino acids apparently have also ‘punctuation’ roles: the circular code codons apparently regulate the reading frame, as suggested also by circular code properties of ribosomal RNA that interacts with the mRNA (Michel, 2012). It seems that when more than one frame in a sequence is coding, the statistical property of overrepresentation of circular code codons is lost, perhaps because ‘punctuation’ signals of several frames are mixed, or inexistent to facilitate passage between frames. On the other hand, homopolymer codons (AAA, CCC, GGG, TTT), which tend to cause frameshifts (one of the main mechanisms for switching between coding frames) are relatively overrepresented in overlap coding regions (Ahmed et al., 2007; Ahmed and Michel, 2011). Sequences solely composed by these 20 codons have a non-redundant feature: if nucleotide triplets are not read according to the frame of the codons that compose the sequence, one will soon find a codon that is not part of the initial set of 20 codons, indicating that the reading frame is ‘incorrect’. This lack of redundancy between frames is one of the characteristics of circular codes, and could be related to the reason why circular code codons are underrepresented in overlapping genes. Hence the proportion of homopolymers among the sum of homopolymers and circular code codons should be greater in predicted overlap coding sequences than in adjacent regions of a gene. Statistical confirmation of this prediction by sequence data should be considered as consisting independent evidence for the function of that sequence as overlap coding, in this case after systematic symmetric nucleotide exchange. Note that the natural circular code is characterized by a set of 20 autocomplementary codons associated with each frame, where the 20 circular codons of frames +1 and +2 are produced by specific permutations of the nucleotides in the circular codons in frame ‘0’ (frame +1: the first nucleotide in frame ‘0’ is permuted to the third position; frame +2: the third nucleotide in the frame ‘0’ circular codon is permuted to first codon position, producing the circular code codon of frame +2). None of these three sets of 20 circular codons includes any of the four homopolymers. Hence tests performed here are on averages of homopolymer/circular code proportions calculated over the three frames for each set of 20 circular code codons (frame 0, +1 and +2 circular codes).

237

2.6. Kinetics of nucleotide misinsertions and systematic nucleotide exchanges

274

It is plausible that rates (or frequencies) of the various types of systematic nucleotide exchanges during RNA transcription correspond to known rates of occasional nucleotide misinsertions by polymerases. These are not known for the RNA polymerase, but one might use as proxy those known for the human mitochondrial DNA polymerase gamma (Lee and Johnson, 2006). These kinetic parameters are indicated as kd and kpol, respectively, in Table 2 from Lee and Johnson (2006). For each type of systematic symmetric nucleotide exchange, I averaged the corresponding kds, and separately, kpols from Lee and Johnson (2006). For example, for the systematic symmetric nucleotide exchange A↔C, the kd’s averaged were 160 (A→C), 540 (C→A), 150 (G→G) and 57 (T→T), resulting in the mean kd for that type of nucleotide exchange of 226.75 ␮M. One expects that some proportionality exists between these averages with independent estimates of frequencies or rates of nucleotide exchange polymerization. Positive results would be strong confirmation of the working hypothesis, as they would explain observations on transcripts existing in GenBank by independent parameters of DNA misinsertion polymerization kinetics.

275

3. Results and discussion

291

3.1. RNAs in GenBank

292

A priori, there is no evidence that systematic nucleotide exchanges occur, but the large online databases of RNA sequences (expressed sequence tags, EST, in GenBank) allow searching for RNAs that match the assumed exchange-based recoding of regular genes. I explore, for all 9 symmetric nucleotide exchanges presented in Table 1, whether such RNAs exist in the database for the complete human mitochondrial genome. Table 2 presents all RNAs detected by Blastn (Zhang et al., 2000) for GenBank’s EST database that align with some parts of the human mitochondrial genome, after that genome has been recoded according to each of the nine systematic symmetric nucleotide exchanges. There are 51 such RNA sequences originating from 12 independent studies of RNA expression. No RNA sequence was detected for

293

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273

276 277 278 279 280 281 282 283 284 285 286 287 288 289 290

294 295 296 297 298 299 300 301 302 303 304 305

ARTICLE IN PRESS

G Model BIO 3360 1–19

H. Seligmann / BioSystems xxx (2013) xxx–xxx

4

Table 2 Human RNA transcripts detected by Blastn in GenBank’s EST database and aligning with human mitochondrial genome sequences after symmetrically exchanging nucleotides in the sequence. Columns are: 1. exchange nucleotide rule; 2. gene, and DNA strand matching EST transcript; 3. alignment first and last nucleotides; 4. alignment length; 5. similarity (%) between aligning sequences; 6. description of EST; 7. EST entry in GenBank; 9. EST reference. Sub

Gene

Loc

N

Si

Origin

ID

Ref

AG AG AG AG AU AU

ND1− ND5− 12s− Ser4− ND1+ ND1+

77 95 88 100 99 97 93 99

AI940581 57

5

AU

CO1−

122–634

513

99

AW176982 50

5

AU AU AU AU CG CG

ND4+ 12s+ 16s+ Leu2+ AT6+ 16s+

852–1240 2–268 1405–1559 1–75 486–669 565–715

391 271 156 75 188 151

99 97 99 100 83 97

CK327105 54 BF798660 51, BF798653 51, BF798647 (–) 52 BF798658 52, BF798678 53 BF798658 52, BF798678 53 BX457166 47 N41204 38, AJ574341 36, AJ574283 49

6 6 6 6 7 8, 4

CU CU CU CU CU

ND1+ CO1+ ND4+ ND4+ CytB+

770–952 1509–1542 851–1042 1265–1378 852–1133

183 34 196 114 285

99 100 95 96 99

Similar to NADH1, renal cell tumor Homo sapiens hypothalamus Normalized rat brain Female pectoral muscle after mastectomy Colon Colon ins Colon ins Head neck, FAPESP/LICR Human Cancer Genome Project Colon, The FAPESP/LICR Human Cancer Genome Project Colon Colon ins Colon ins Colon ins Thymus Adult heart, female pectoral muscle after mastectomy Female pectoral muscle after mastectomy Female pectoral muscle after mastectomy Prostate normal Female pectoral muscle after mastectomy Hypothalamus

1 2 3 4 5 6

ND2+

131 165 138 68 495 231 210 395

AI367501 62 AV721614 30 AI230934 53 AJ574322 50, AJ574357 51 AW176957 53 BF798658 52, BF798678 53 BF798657 50

AU

687–812 1423–1577 403–634 2–69 357–851 1–231 27–231 202–593

4 4 6 4 2, 4

CU CU CU

16s+ 16s+ 16s+

24–401 361–556 411–776

378 196 367

97 95 96

AJ574326 63 AJ574346 48, AJ574322 45 BF370011 56 AJ574311 60 AV722273 59, AJ574321 56, AJ574347 57, AJ574344 58, AJ574371 55, AJ574334 53, AJ574333 54 AV722267 45 AJ574370 46 AV721363 48, AJ574341 46, AJ574327 45

CU CU GU GU GU

16s+ 16s+ ND1+ ND2− AT6+

122–382 724–937 12–377 564–902 32–150

211 548 369 353 155

99 99 92 84 88

AJ574335 40, AJ574332 42 AJ574378 49, AJ574291 49 AI525967 49 BI032899 59 CF425368 36

4 4 9 6 10

GU GU GU GU GU

AT6− CytB+ 16s+ 16s+ 16s+

386–536 188–331 328–780 46–423 57–1130

120 147 456 302 281

80 94 95 96 80

AW381066 61 AA413440 46 AI541277 48 AI525977 60 C15855 41

5 11 9 9 12

Hypothalamus Female pectoral muscle after mastectomy Hypothalamus, female pectoral muscle after mastectomy Female pectoral muscle after mastectomy Female pectoral muscle after mastectomy Cell line Nervous normal Gastric epithelial progenitor Mus musculus, ATP6 Head neck Adult heart Tissue culture Cell line Human aorta polyA+ mRNA

2 4 2, 4

1. Strausberg, 1997. National Cancer Institute, Cancer Genome Anatomy Project (CGAP), Tumor Gene Index, unpub. 2. Gu, Y., Peng, Y., Song, H., Huang, Q., Yang, Y., Gao, G., Xiao, H., Xu, X., Li, N., Qian, B., Liu, F., Qu, J., Gao, X., Cheng, Z., Xu, Z., Zeng, L., Xu, S., Gu, W., Tu, Y., Jia, J., Fu, G., Ren, S., Zhong, M., Lu, G., Hu, R., Chen, J., Chen, Z., Han, Z., 2000. Homo sapiens cDNA HTB clones, unpub. 3. Lee, N.H., Glodek, A., Chandra, I., Mason, T.M., Quackenbush, J., Kerlavage, A.R., Adams, M.D., 1998. Rat Genome Project: Generation of a Rat EST (REST) Catalog & Rat Gene Index, unpub. 4. Laveder, P., De Pitta, C., Vitulo, N., Valle, G., Lanfranchi, G., 2003. Oligo-directed RNase H cleavage of abundant mRNAs in skeletal muscle, unpub. 5. Simpson, A.J.G., 1999 The FAPESP/LICR Human Cancer Genome Project, unpub. 6. Dias Neto et al. (2000). 7. Li, W.B., Gruber, C., Jessee, J., Polayes, D. 2001. Full-length cDNA libraries and normalization, unpub. 8. Lui et al. (1995). 9. Huang et al. (1999). 10. Tidwell, R., Clifton, S., Marra, M., Hillier, L., Pape, D., Martin, J., Wylie, T., Theising, B., Bowers, Y., Gibbons, M., Ritter, E., Bennet, J., Ronko, I., Tsagareishvili, R., Belaygorod, L., Grow, A., Maguire, L., Waterston, R., Wilson, R., 2002. Unpublished. 11. Liew et al. (1994). 12. Fujiwara, T., Hirano, H., Katagiri, T., Kawai, A., Kuga, Y., Nagata, M., Okuno, S., Ozaki, K., Shimizu, F., Shimada, Y., Shinomiya, H., Takaichi, A., Takeda, S., Watanabe, T., Takahashi, E., Hirai, Y., Maekawa, H., Shin, S., Nakamura, Y., 1995, unpub.

306 307 308 309 310 311 312 313 314 315 316 317

three exchange types: the exchange A↔C, and two of the three exchange rules involving all four nucleotide types, A↔C+G↔U, and A↔G+C↔U. Among nucleotide exchanges involving only two nucleotides, most common were RNAs where recoding exchanges involve uracil: C↔U (21 sequences), A↔U (14 sequences) and G↔U (8 sequences). The systematic exchanges A↔G and C↔G were found in 5 RNA sequences each. The exchange involving all four nucleotides A↔U+C↔G was quite common (10 sequences, data not presented in Table 2) and is analyzed in detail separately (Seligmann, 2012d). It is first of all notable that the three most common exchanges are those between uracil and the three other nucleotides. Hence uracil, which exchanges thymidine during

regular transcription, seems also most frequently involved in ‘unusual’ exchanging transcription. Blastn analyses detect in total 61 ‘exchanged’ sequences (including the 10 for the A↔U+C↔G exchange rule not presented in Table 2 (these 10 transcripts are presented in Table 1 from Seligmann, 2012d)). This is 0.56% of the 10899 ESTs annotated as from human mitochondrial origins in GenBank’s database by June 2012. It would be also very interesting in this context to explore the high accuracy transcript data available for the human mitochondrial transcriptome (Mercer et al., 2011a,b). These data, available at http://mitochondria.matticklab.com, are not searchable at this point along the guidelines of nucleotide exchanging

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

318 319 320 321 322 323 324 325 326 327 328 329

G Model BIO 3360 1–19

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

Fig. 1. Loopiness of transcripts in Table 2 as a function of their relative length. Secondary structure predictions estimate the average number of structures in which the average site does not form a stem by self-hybridization in RNA (loopiness), assuming symmetric exchanging transcription. The y axis is the subtraction of that mean for gene regions that are not within such nucleotide exchanging transcripts, from the mean loopiness of the regions transcribed by exchanging transcription and listed in Table 2. The x axis is the relative proportion the exchanging transcript represents from the total length of that gene. Gene identities, and the types of symmetric nucleotide exchange, are indicated next to each datapoint.

330 331 332

333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363

RNA transcription, but this database would probably yield additional insights into the frequencies of the various types of exchange transcriptions. 3.1.1. Exchanging artifacts The various sequences in Table 2 originate from 12 studies of RNA, with RNAs matching different types of exchanges originating in some cases from the same study, and RNAs matching some types of exchanges originating from several studies. These data are important material evidence suggesting a family of previously undescribed RNA recoding types, a potentially major discovery for genomics and molecular biology. For that reason, these sequence data are examined along the lines of a number of possible artifacts. First, if all or most sequences originated from mainly one study, one could have suggested that exchanges were due to specific conditions associated with that study. Possibly, erroneous sequence manipulation, perhaps while incorrectly or only partly inverse complementing sequences by semi-automatized methods, could have created the sequences in Table 2. For example, the only symmetric exchange involving all four nucleotides for which corresponding RNA has been detected (A↔U+C↔G) can result from complementing a sequence without inversing the nucleotide order, a possible, potential sequence manipulation error that could create the ten BLAST hits matching this exchange rule (which are not reported in Table 2). For analyses excluding the possibility of artifacts for these 10 sequences, see Fig. 1 in Seligmann (2012d), which shows that their length increases with their relative secondary structure formation capacities. Erroneous partial complementing (of A↔U or C↔G) could create the RNAs detected and matching these two additional types of nucleotide exchanges. However, these annotation artifacts could not explain the occurrences of RNA corresponding to A↔G, C↔U and G↔U. It is most probable that the data in Table 2 are not the result of such in silico sequence manipulations, especially that as many as 12 studies produced such sequences.

5

Another possibility is that of a statistical artifact. The exchanged sequences usually exchange between two nucleotides, so they remain identical to the original, regular sequence for the two other nucleotides. Hence on average, half of the nucleotides are being exchanged, expecting a mean similarity between the exchanged RNA and the regular transcript of 50%. However, all (but one) similarities in Table 2 are >80%, and only 7 are below 90%. Nucleotide ratios vary locally, so that high similarities that do not imply exchange transcription are possible because locally, in these sequences, the exchanged nucleotides might have very low frequencies. However, the sequences in Table 2 and their high similarities are not compatible with extreme local nucleotide biases creating the illusion of exchange transcription: the exchanged nucleotides do never represent less than 30% of the EST sequence, which would yield at best a similarity of 70% with the regular transcript, assuming that no nucleotide exchange actually occurred and that the RNA reported in Table 2 does not result from exchanging transcription, but from the low local proportion of the exchanged nucleotides in its composition. In fact, all the RNA sequences include all four nucleotides, and this in proportions that seem incompatible with the high similarities observed if no systematic transcriptional exchange occurred (see Table 2, percentages are indicated next to GenBank entries). Hence nucleotide biases did not create false positives for exchanging transcription for the wide majority of transcripts presented in Table 2. Therefore, most data in Table 2 does not result from statistical artifacts involving local nucleotide biases. The specific cases of potential exceptions, three transcripts in Table 2 with low similarities, are examined in some details in a section below.

364

3.1.2. Alternative biological explanations The next potential problems with the nucleotide exchange interpretation of the data in Table 2 are of biological natures. It is possible that regular transcription of other, nuclear DNA sequences, produces the transcripts in Table 2. This possibility cannot be totally ruled out a priori. The RNAs in Table 2 have high and even very high similarities with the mitochondrial sequences after assuming exchanging transcription. This means that if these RNAs are produced by regular transcription of nuclear (or cytosolic) sequences, and not by exchanging transcription of mitochondrial sequences, these nuclear sequences resulted from exchanging reverse transcription of mitochondrial RNA into nuclear DNA, or some other exchanging process creating a nuclear mitochondrial (pseudo)gene that involved systematic nucleotide exchanges. Hence even if the RNAs in Table 2 would not be the result of exchanging transcription (=exchanging RNA polymerization), they would reflect exchanging DNA polymerization. Searching with BLAST GenBank’s human genome data does not yield any positive hits for the mitochondrial sequences transformed according to any of the nine symmetric exchanging transcription rules. This negative result does not totally rule out the possibility that regular transcription of nuclear or cytosolic DNA is at the origin of the RNAs in Table 2, but there is no evidence to sustain this possibility at this point. Hence this biological interpretation seems unlikely. However, even if this nuclear origin was true, it would indicate that DNA polymerization following exchanging rules occurs. Such exchanging DNA polymerization would still be important indirect evidence in favor of the working hypothesis that exchanging RNA polymerization occurs, and would be compatible with the existence of protein coding genes within these exchanged sequences. It would be direct evidence for the creation of new genes through nucleotide exchanges. The last considered biological alternative to exchanging transcription relates to the fact that all the transcripts in Table 2 originate from studies that date from before the year 2004. This suggests that the RNAs might result from rare dysfunctions by the

393

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392

394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428

G Model BIO 3360 1–19 6 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454

455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

reverse transcriptases used in the creation of cDNA libraries from extracted RNA transcripts, and which form the EST databases. It is indeed possible that such flaws were discovered at some point and that EST libraries produced after 2003 are exempt of these flaws. If this is the case, the data in Table 2 do not reflect directly exchanging transcription activity, but exchanging reverse transcription. At this stage, the correct interpretation of the data presented in Table 2 could be that occasionally natural exchanging transcription occurs, or that occasionally exchanging reverse transcription by the enzymes used to create the EST libraries occurs. However, even in the latter case, such exchanging reverse transcription would be at least indirect evidence that exchanging transcription and associated protein coding genes might exist, as RNA and DNA polymerases have great similarities. It is notable in this context that the human mitochondrial polymerase gamma, which usually replicates the mitochondrial genome, has also reverse transcription activity (Kasiviswanathan and Copeland, 2011). Even if the RNAs in Table 2 are not of natural origin, but result from some kind of dysfunctions by the reverse transcriptase used to produce the EST libraries, the frequencies of the various types of nucleotide exchanges suggested by the data in Table 2 would still be informative from a biological point of view: these dysfunction frequencies would probably indicate the occurrence of natural dysfunctions of these types. In the next section, analyses of secondary structures formed by exchanged transcripts suggest that the reverse transcriptase ‘artefact’ is the less likely explanation

3.1.3. Secondary structure formation by exchanging transcripts According to the scenarios described in Section 3.1.2, the data in Table 2 would reflect a biological reality, which has 2–3 alternative interpretations, but all based on the principle of ‘exchanging’ polymerization, of RNA on the basis of DNA (transcription), of DNA on the basis of DNA (replication), or of DNA on the basis of RNA (reverse transcription). A further analysis confirms that the sequences in Table 2 reflect a biological phenomenon, most probably due to RNA polymerization, though transcript edition cannot be ruled out. Using Mfold (Zuker, 2003), secondary structure formation by gene sequences transcribed assuming the specific exchanging rule was predicted for the complete (exchanged) sequence of genes for which exchange transcripts were found (Table 2). I calculated separately for the regions belonging to the RNA sequences presented in Table 2, and for the rest of that gene the mean of the percentage participation in loops for nucleotides. Loopiness of the ‘exchanged RNA’ was greater in regions that underwent exchanging transcription according to Table 2 than in the rest of the gene transcribed, assuming exchanging transcription for that region (though no such exchange transcription was detected for that region), in 23 of the 33 (62%) of the sequences for which ‘exchanging’ transcripts were detected. Table 2 lists more sequences because in some cases several transcripts were found matching the same genome region. This slight majority is statistically significant according to a one tailed sign test (P = 0.047), suggesting that the transcripts produced by exchanging transcription tend usually to form less secondary structures than the rest of the gene (assuming it was also exchange transcribed). This means that the transcripts in Table 2 have a common feature, and are not a random sample of potential transcripts. The tendency for high loopiness differed among various types of exchange transcriptions, it is weakest for A↔U exchanges (33%), intermediate for C↔G and G↔U exchanges (50%), 60% for A↔U+C↔G exchanges (not in Table 2), and strongest for A↔G and C↔U exchanges (100 and 90%, respectively). These differences are in no way statistically significant, as the number of cases is too low for even considering statistical tests (one transcript for A↔G and two for C↔G).

However, for C↔U, considering that there are 10 cases, a one tailed sign test indicates a statistically significant tendency for greater loopiness in regions that underwent C↔U exchanging transcription than in other regions of the same gene, assuming they too had undergone nucleotide exchange transcription along the C↔U rule. For the 10 transcripts following the C↔U rule, the probability of getting 9 among 10 cases where loopiness is greater than for the rest of the gene, yields according to a binomial distribution (the distribution used in sign tests), the statistical significance P = 0.0054. In other words, if one was to assume that loopiness in exchange transcribed regions is as likely to be above as below the loopiness in surrounding regions, the probability of getting 9 among 10 exchange transcripts with greater loopiness is about half a percent. Hence it is unlikely that exchange transcribed regions have on average the same loopiness as other regions. This tendency indicates that self-hybridization disfavours the production of ‘exchanged’ transcripts. This strengthens the possibility that exchanges result from editing of transcripts, where secondary structure might prevent or at least impede editing after polymerization. However, this does not preclude that exchanges occurred during transcription itself. If RNA polymerization that systematically exchanges nucleotides is relatively slow, it might be particularly impeded by secondary structure formation, and hence loopiness might promote it.

493

3.1.4. The length of exchanging transcripts An additional observation might give a clue on the nature of the process involved, and which relates to the capacity for secondary structure formation by the RNAs in Table 2 in relation to their length: the loopiness of exchanging transcripts, as compared to the rest of the gene sequence, decreases with the relative length of the exchanging transcript (Fig. 1). This suggests that exchange transcription (or edition) is favored by free access to the elongating RNA polymer, but that in order for that polymer to reach a sizeable proportion of the total length of the gene, it should form secondary structure. The ‘paradox’ between the requirement that an exchange-transcribed region forms little secondary structure, and the requirement, for its elongation, that it self-hybridizes, could explain why transcripts produced by systematic nucleotide exchanges are rare. I propose in this context the following interpretation. By definition, exchanging transcription does not produce RNA that is the inverse complement of its template DNA strand, and hence the elongating RNA does not form a duplex with DNA during its elongation. As a result, it is single stranded and open to digestion by endoribonucleases, which would shorten them. Hence in order to reach relatively great lengths, polymerization of non-complementary RNA (or DNA) requires protection by self-hybridization (secondary structure formation), as it cannot be protected by hybridization with existing DNA (or RNA) as for regular transcripts. Regular transcripts are protected by both, hybridization with the ‘maternal’ strand and self-hybridization, but for transcripts produced by exchanging transcription, complementarity is much lower, and hence elongation is much more dependent on protection due to self-hybridization. In the extreme case of the exchange rule that involves two pairs of exchanged nucleotides (A↔U+C↔G), there is no complementarity at all, and protection can only result from self-hybridization. Therefore for the 10 transcripts following that rule, the correlation between relative loopiness and transcript length is much stronger than for the other exchange transcription types: r = −0.65. The association for the rest of the transcripts, in Fig. 1, is much weaker and is only statistically significant if transcripts are split into two groups, those below and those above the relative length of 20% of the length of their gene. A one tailed Fisher exact test indicates that there are more transcripts with more loopiness in the exchange transcribed part than

517

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516

518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557

G Model BIO 3360 1–19

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577

578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622

7

the rest of the gene (positive loopiness values on the y axis in Fig. 1) for transcripts with relative length 20% (P = 0.052). Excluding the extreme length outlier indicated by a triangle in Fig. 1 yields P = 0.042. Including the ten sequences of the A↔U+C↔G exchange type (from Table 1 in Seligmann, 2012d), the test yields P = 0.023. These negative correlations between exchange transcript lengths and loopiness indicate that endoribonucleases (or other enzymes with similar activities) are active during exchanging transcript production. This situation is not totally incompatible with the possibility that the sequences listed in Table 2 were produced by occasional dysfunctional reverse transcription to create the cDNA libraries in GenBank, but seems more in line with transcription occurring under natural physiological circumstances. Hence the working hypothesis that occasionally, various types of exchanging transcriptions occur under natural physiological conditions, seems the most probable explanation for the data in Table 2, and is not incompatible with the alternative explanations that could not be totally ruled out (exchanging replication or exchanging reverse transcription). 3.1.5. Putative protein coding genes in exchanging transcripts Table 2 presents some data favoring the working hypothesis of exchange transcription. The working hypothesis is formulated on the basis of an evolutionary principle of minimizing costs due to genome size, assuming that overlap coding (which in this case results from exchange transcription) increases the number of genes and the genome’s coding density without increasing its size. Hence evidence confirming that transcripts of protein coding genes after systematically exchanging nucleotides potentially include regions that code for proteins would strengthen the hypothesis on two grounds: first, because it would confirm the basic evolutionary principle subjacent to the advantage associated with exchange transcription by indicating its role in revealing coding potential; and second, because consistent patterns in (exchange transcription) overlap coding genes would be in themselves evidence that exchange transcription occurs, independently of physical evidence for RNA transcripts presumably produced by exchange transcription (Table 2). In addition, if analyses of coding properties of RNA after nucleotide exchange converge with those in Table 2, for example if coding seems more probable for exchange types that are relatively more represented in Table 2, and less probable in those for which no transcripts were detected, this coherence between different types of independent data and analyses would, in the context of a meta-analysis, be strong evidence for (overlap) protein coding based on exchange transcription. There are 702 hypothetical peptides for the 13 human mitochondrial protein coding genes. These were analyzed by GenBank’s Blastp (Altschul et al., 1997, 2005) and hits with proteins existing in GenBank were recorded (Table 3). These analyses produced numerous hits, from 9 for A↔C exchanges, to 36 for G↔U exchanges, in total between 483 codons (for A↔C exchanges) and 2801 codons (for A↔G exchanges) putatively involved in overlap coding associated with exchange transcription. It is notable that several hits, mainly for exchanges involving transitions C↔U and A↔G, were for the frame corresponding with the gene’s regular main frame, and with proteins that are homologous to the protein coded by the regular main frame of that gene. These cases may be of interest, but are excluded from analyses of overlapping genes presented here, and also from the statistics on putative overlapping genes at the beginning of this paragraph. It is notable that the average length of putative alignments detected by Blastp for a type of nucleotide exchange is proportional to the number of transcripts detected for that type of exchange as reported in Table 2 (Pearson parametric correlation coefficient r = 0.747, P = 0.0104; Spearman nonparametric correlation

Fig. 2. Mean length of putative overlapping protein coding genes predicted by Blastp analyses (from Table 3) as a function of the number of exchanging transcripts according to that exchange rule (from Table 2). The type of nucleotide exchange assumed by analyses is indicated next to each datapoint, followed by the number of alignments with GenBank proteins interacting with DNA or RNA, membrane proteins, and proteins with physiological functions typical of mitochondria.

coefficient rs = 0.75, P = 0.0166, one tailed tests, Fig. 2). This result is a type of meta-analysis of the data of all exchange transcription types that is indicative that overall, overlap coding by nucleotide exchange might occur, and this proportionally to the observed frequency of exchange transcription.

623

3.1.6. Functions of proteins coded by ‘nucleotide exchange’ encrypted overlapping genes Table 3 suggests that Blastp alignment analyses of the 702 peptides translated from the exchange transcribed human protein coding genes detect 168 previously undetected polypeptides putatively coded by exchange overlap coding. These genes were previously undetected. This means that 23.9% of the hypothetical translated sequences (the percentage ranges from 11.5% for A↔C exchanges, to 46.2% for A↔U exchanges) are potentially protein coding. For the sake of comparison, note that for ‘regular’ overlap coding in the same sequences of that species, induced by suppressor tRNA activity, the same Blastp analyses yield 24 putative overlapping genes (36.9% of the hypothetical translated sequences from the five alternative frames for the 13 genes, see Seligmann, 2011a). According to Table 3, these putative exchange overlapping genes include alignments with 11 proteins interacting with DNA or RNA (4 for G↔U exchanges, 3 for A↔G as well as A↔U+C↔G exchanges (data not in Table 3 for that exchange type that is analyzed in detail by Seligmann, 2012d), and one for A↔G+C↔U exchanges). These putative overlapping genes might code themselves for protein(s) involved in the production of the exchange transcripts. This could be indicated by the positive correlation between their percentage within the sample of putative overlapping genes and observed exchange transcript numbers (r = 0.45, not statistically significant even at P < 0.20). Fig. 2 indicates the number of such candidate overlapping genes for each type of nucleotide exchange, the number of predicted membrane proteins, and of proteins with functions frequently associated with typical mitochondrial metabolism. The latter are most numerous, in total 29 and occur in all nucleotide exchange types (least (one) for A↔U, most (6) for G↔U) and include sequences aligning with an alkyl hyperoxide reductase for G↔U exchange

628

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

624 625 626 627

629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659

ARTICLE IN PRESS

G Model BIO 3360 1–19

H. Seligmann / BioSystems xxx (2013) xxx–xxx

8

Table 3 List of GenBank proteins aligning according to Blastp with putative peptide sequences translated from mitochondrial protein coding genes assuming nucleotide exchanging transcription. Columns are: 1. gene identity and frame (1–3, + strand; 4–6, − strand); 2. first and last amino acids in alignment; 3. alignment length; 4. alignment similarity; 5. entry of GenBank protein aligning with peptide translated from exchanging transcription; 6. description of GenBank protein; 7. number of stops in alignments; 8. type of exchanging transcription. Gene

Loc

ND1 1 ND1 3 ND1 4 ND1 6 ND2 1 ND2 5 ND2 6 CO1 6 CO1 6 CO2 3 CO2 6 AT8 1 AT8 4 AT8 4 AT8 5 AT8 5 AT8 6 AT8 6 AT8 6 AT6 1 AT6 3 AT6 5 AT6 6 AT6 6 AT6 6 CO3 1 CO3 3 CO3 5 CO3 5 ND3 4 ND3 6 ND4l 3 ND4l 5 ND4l 6 ND4 1 ND4 3 ND4 3 ND4 5 ND5 1 ND5 6 ND6 6 CytB 6 CytB 6 ND1 1 ND1 1 ND2 6 ND2 6 CO1 1 CO1 6 CO1 6 CO2 1 CO2 1 CO3 1 ND3 1 ND3 6 ND4l 2 ND4l 2 ND5 6 ND6 1 ND6 5 CytB 1 ND1 1 ND1 2 ND1 2 ND1 3 ND2 1 ND2 2 ND2 3 ND2 6 CO1 1 CO1 3 CO1 4

48–180 5–98 49–282 188–290 3–341 297–340 169–282 150–220 307–460 14–107 7–35 2–68 1–58 28–68 1–45 41–59 20–51 7–57 14–59 11–226 80–149 115–170 151–224 9–75 115–172 2–158 16–166 111–169 103–141 31–95 41–76 39–75 4–92 31–92 2–451 92–216 137–265 246–368 2–581 188–255 112–172 199–276 229–293 21–229 229–305 51–93 212–319 11–395 157–326 316–504 19–124 66–149 57–250 51–106 11–60 3–62 10–49 60–125 26–174 38–114 15–292 1–317 161–224 28–117 85–197 1–347 187–286 45–208 119–333 7–505 337–440 83–193

N

Si

Id

Origin

Ter

139 104 246 104 353 47 126 73 154 96 29 67 61 40 45 19 32 51 47 227 70 58 74 68 59 170 119 59 39 65 36 37 89 62 463 125 121 138 614 71 66 78 67 218 80 45 108 398 171 190 109 90 194 59 53 61 40 78 149 83 281 317 69 109 113 347 138 182 229 499 104 111

45 41 35 49 41 57 46 51 43 42 69 55 46 58 64 84 59 51 55 43 47 52 61 49 53 40 50 49 64 43 53 62 46 48 39 46 44 42 43 46 45 46 31 42 46 56 47 32 38 41 39 48 39 56 58 49 50 41 41 48 41 58 46 42 41 57 34 37 40 54 41 47

AEQ35744 AAB39554 EFC39457 EGH13632 AAL48391 EFZ00925 ABM38496 EFQ64751 XP002804370 XP003230408 EGT47210 ACR09286 EGF97967 AEE71795 AAG44787 EEV38750 BAL01249 CAM64590 EEY51301 ADU77956 AF467769 EGV19704 AAG44787 EED92930 CCB67068 ADL31200 BAB93516 EGU76154 EDP05187 EGW6025 CAJ86300 EGU85298 EEA93813 BQ76407 ADL31476 BAC5228 AAG44628 EAT38723 ACU09622 ADH63862 CAB07382 Y86845 ACA19730 ACT75317 EEB33262 ADQ43189 AFB2830 ACT75318 CAZ61577 AEI55877 ABG40996 EFH6873 ACS71775 EFQ96882 EFW38918 XP002121272 ADN36160 ABF33578 ADT82255 EDY73983 ACI01099 CAA66304 EEU48753 BAJ78673 BAE91117 AEL64185 CAF93389 ADZ45521 XP002611624 CAC37979 XP002801723 EAL47953

ND1 Pan troglodytes Nitrate reductase Agrostemma githago Hyp. Naegleria gruberi Exonuclease Pseudomonas syringa ND2 Homo sapiens Hyp. Metarhizium anisopliae Antitermination Polaromonas naphthalenivorans Hyp. Pseudomonas fluorescens Hyp. Macaca mulatta Brevican core -like Anolis carolinensis Hyp. Caenorhabditis brenneri AT8 Homo sapiens Hyp. Melampsora larici-populina Lipoprotein Propionibacterium acnes DC48 Homo sapiens Glycosyl transferase I Enterococcus casseliflavus Sodium/glutamate symporter Oscillibacter valericigenes Hyp. Mycobacterium abscessus Glycosyltransferase 2 Bacteroides sp. 2 1 33B AT6 Homo sapiens Glycoprotein precursor Crimean-Congo hemorrhagic fever virus Hyp. Thiocapsa marina DC48 Homo sapiens Hyp. Thalassiosira pseudonana Hyp. Hyphomicrobium sp. MC1 CO3 Homo sapiens OK/SW-CL.16, Homo sapiens Hyp. Fusarium oxysporum Hyp. Chlamydomonas reinhardtii Transcriptional regulator Dechlorosoma suillum H0124B04.17 Oryza sativa Indica Hyp. Fusarium oxysporum Alkyl hyperoxide reductase Pseudovibrio Diguanylate cyclase/phosphodiesterase with PAS/PAC sensor Pseudomonas putida ND4 Homo sapiens Hyp. Homo sapiens DC24 Homo sapiens Hyp. Aedes aegypti ND5 Homo sapiens O-Acetylhomoserine/O-acetylserine sulfhydrylase Meiothermus silvanus Caenorhabditis elegans Serine esterase family Metarhizium acridum Transcriptional regulator Methylobacterium ND1 Phaeoceros laevis Hyp. Desulfovibrio piger Oligopeptide transporter Eutrema parvulum Hyp. Rickettsia rickettsii CO1 Phaeoceros laevis CO1 Sciadicleithrum variabilum CO1 Penicillium polonicum Redoxin Pseudoalteromonas atlantica Possible ribosomal prot. Clostridium difficile CO3 Isoetes engelmannii Ankyrin repeat domain-containing Arthroderma gypseum Efflux ABC transporter, permease Treponema phagedenis Hyp. Ciona intestinalis Glycosyl transferase Methanoplanus petrolearius Oligohyaluronate lyase Streptococcus pyogenes ND6 Hylobates muelleri GA28377 Drosophila pseudoobscura Apocytochrome b Isoetes engelmannii ND1 Pongo pygmaeus Hyp. Nectria haematococca RNA polymerase II largest subunit Bemisia tabaci Macaca fascicularis ND2 Homo sapiens Tetraodon nigroviridis GREBP cGMP-response element-binding Homo sapiens Hyp. Branchiostoma floridae Co1 Macaca sylvanus Hyp. Macaca sylvanus DENN domain protein Entamoeba histolytica

5 1 4 0 29 0 0 0 3 0 0 7 0 0 1 0 0 0 0 22 3 0 1 0 0 12 6 1 0 0 0 0 1 0 50 6 3 0 50 1 0 0 1 4 1 0 6 4 8 13 1 1 7 1 0 0 0 1 0 1 2 20 1 2 2 17 3 1 13 23 2 1

G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T G-T C-T C-T C-T C-T C-T C-T C-T C-T C-T C-T C-T C-T C-T C-T C-T C-T C-T C-T A-G A-G A-G A-G A-G A-G A-G A-G A-G A-G A-G

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

ARTICLE IN PRESS

G Model BIO 3360 1–19

H. Seligmann / BioSystems xxx (2013) xxx–xxx

9

Table 3 (Continued) Gene

Loc

N

Si

Id

Origin

Ter

CO2 Homo sapiens Ammonia monooxygenase B uncultured crenarchaeote FAD dependent oxidoreductase Sphingomonas AT8 Pan paniscus DNA gyrase B subunit Myxococcus xanthus N-acetylmuramyl-l-alanine amidase, negative regulator of AmpC, AmpD Opitutus terrae KH domain-containing, RNA-binding, signal transduction-associated Danio rerio Interferon-activable protein 203-like Cricetulus griseus Hyp. Arthroderma otae AT6 Homo sapiens Nucleolar GTP-binding protein 1 Canis lupus familiaris Hyp. uncultured marine microorganism Type VI secretion system Vgr Pectobacterium wasabiae CO2 Homo sapiens OK/SW-CL.16 Homo sapiens Potassium efflux system KefA protein/small-conductance mechanosensitive channel Pelagibacterium halotolerans ND3 Pan troglodytes Regulatory protein Emericella nidulans 4-Alpha-glucanotransferase Haemophilus haemolyticus Zinc finger protein Clonorchis sinensis Histone acetylase complex Aspergillus clavatus Hyp. Sebaldella termitidis ND4l Nomascus siki Peptidyl-prolyl cis-trans isomerase Ajellomyces capsulatus Hyp. Populus trichocarpa Hyp. Phytophthora sojae Phosphoketolase Clostridium carboxidivorans Sodium/potassium/calcium exchanger 1 Heterocephalus glaber ND4 Pan troglodytes Homo sapiens DC24 Homo sapiens Mitochondrial import inner membrane translocase Tim22-like Amphimedon queenslandica ND5 Homo sapiens ND6 Papio hamadryas Flagella associated membrane Chlamydomonas reinhardtii CytB Homo sapiens Hyp. Macaca fascicularis Zonadhesin Mus musculus CO1 Macaca nemestrina GCN5-related N-acetyltransferase Cyanothece Hyp. Toxoplasma gondii Hyp. Physcomitrella patens Hyp. Trichomonas vaginalis Arginine exporter Rhodobacter capsulatus Adenosine monophosphate deaminase Gallus gallus Hyp. Glomerella graminicola Hyp. Culex quinquefasciatus Efflux transporter Brucella Synaptopodin-2 Equus caballus Alpha adducin Dictyostelium discoideum Hyp. Saccoglossus kowalevskii Hyp. Glycine max Photosystem I reaction center subunit psi-N Arabidopsis lyrata Hyp. Vitis vinifera CO1 Dolichopoda euxina Hyp. Botryotinia fuckeliana Glutamate decarboxylase Magnetospirillum magnetotacticum Short chain dehydrogenase Mycobacterium kansasii Short-chain dehydrogenase/reductase SDR Mycobacterium Hyp. Dictyostelium purpureum NLP/P60 Pseudoxanthomonas spadix ND4l Rhinopithecus avunculus Hyp. Thalassiosira pseudonana Homo sapiens UDP-Gal:betaGlcNAc beta 1,4-galactosyltransferase 6 variant Homo sapiens ND5 Homo sapiens ND6 Homo sapiens Hyp. Daphnia pulex CO1 Tetropium fuscum CO2 Homo sapiens Hyp. Streptomyces venezuelae Polyketide synthase Fluoribacter dumoffii MmpL4 7 Mycobacterium colombiense

15 3 1 3 1 1

A-G A-G A-G A-G A-G A-G

0 2 2 5 5 0 1 17 0 1

A-G A-G A-G A-G A-G A-G A-G A-G A-G A-G

5 0 0 0 5 0 2 1 3 0 1 2 22 1 1 0

A-G A-G A-G A-G A-G A-G A-G A-G A-G A-G A-G A-G A-G A-G A-G A-G

21 15 0 14 1 7 5 2 0 0 1 0 0 0 0 1 5 3 2 1 1 1 12 1 0 1 1 0 0 6 1 0 0 34 2 3 0 0 0 0 1

A-G A-G A-G A-G A-G A-T A-T A-T A-T A-T A-T A-T A-T A-T A-T A-T A-T A-T A-T C-G C-G C-G C-G C-G C-G C-G C-G C-G C-G C-G C-G C-G C-G C-G C-G C-G A-C A-C A-C A-C A-C

CO2 1 CO2 2 CO2 3 AT8 1 AT8 2 AT8 3

6–227 32–143 7–73 1–68 10–54 11–57

222 116 72 68 51 48

55 37 39 60 41 46

ABB78341 ABL11420 EGI53545 AEQ36342 CAA04176 ACB74250

AT8 3 AT8 4 AT8 6 AT6 1 AT6 2 AT6 3 AT6 3 CO3 1 CO3 3 CO3 5

1–42 2–67 12–56 4–225 7–108 44–104 146–223 1–256 15–148 11–86

42 66 45 222 102 62 83 256 134 77

50 42 56 62 39 47 42 54 48 47

XP003199307 XP003500291 EEQ30817 ADG46521 XP535203 ABZ06095 ACX89269 ABU64439 BAB93516 AEQ53502

ND3 1 ND3 2 ND3 2 ND3 3 ND3 5 ND3 6 ND4l 1 ND4l 2 ND4l 2 ND4l 3 ND4l 3 ND4l 4 ND4 1 ND4 3 ND4 3 ND4 3

1–102 2–31 23–74 67–103 1–101 52–81 1–97 30–67 2–98 1–25 24–89 12–97 1–459 100–216 135–214 61–108

102 30 52 37 104 30 97 39 100 58 66 87 459 117 80 48

69 57 44 51 43 70 65 59 39 52 45 43 58 43 46 52

AEQ35815 AAA33306 EGT79756 GAA30488 EAW14848 ACZ10040 ADT82590 EGC44959 EEE79199 EGZ30743 EET89248 EHA98470 ABU47843 BAC85228 AAG44628 XP003388243

ND5 1 ND6 1 ND6 6 CytB 1 CytB 3 ND1 2 CO1 1 CO2 1 CO2 2 CO2 3 AT8 1 CO3 2 ND3 2 ND3 2 ND3 5 ND4l 5 ND4 3 CytB 1 CytB 1 ND1 3 ND1 5 ND1 5 CO1 1 AT8 5 AT6 5 CO3 5 CO3 5 ND3 6 ND3 6 ND4l 1 ND4l 4 ND4 3 ND4 6 ND5 1 ND6 1 ND6 3 CO1 1 CO2 1 CO2 1 CO2 1 CO2 3

16–602 5–173 82–129 1–372 239–371 21–319 23–228 68–129 67–131 3–76 14–69 77–139 20–85 4–73 42–78 22–44 53–201 79–151 8–141 191–266 203–266 49–126 237–395 18–67 120–171 136–211 146–228 13–71 2–46 3–97 39–74 103–205 81–159 101–526 1–174 9–105 15–477 100–227 2–58 143–175 28–64

587 169 51 372 134 320 210 62 65 75 59 83 66 72 40 24 150 91 136 76 64 78 159 50 52 76 84 62 48 95 37 103 79 426 174 101 463 128 57 35 37

55 54 61 57 40 40 44 48 55 51 51 46 47 46 60 54 42 49 38 55 58 42 47 50 54 41 40 55 58 58 59 56 43 48 55 46 45 39 51 66 65

CAR95863 CAA77005 EDP00960 ABU67123 BAB12147 EDL19272 AAY22220 ACK71749 EEE25729 EDQ71631 EAY11801 ADE84631 NP001186517 EFQ33388 EDS34822 EFM59912 XP001503289 EAL66426 XP002733786 ACU14018 EFH42847 CAN60489 AAX37529 CCD46646 ZP00052904 ZP04750338 ABL93860 EGC31827 EHB94143 ADZ37133 EED89742 BAC85228 BAD92431 ABB97838 ADO19968 EFX75312 ACY39526 ACM71926 CCA59686 ZP10139372 EGT86488

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

ARTICLE IN PRESS

G Model BIO 3360 1–19

H. Seligmann / BioSystems xxx (2013) xxx–xxx

10 Table 3 (Continued) Gene

Loc

AT8 1 AT8 3 CO3 1 ND3 3 ND4l 3 ND5 1 ND6 1 ND6 2 ND6 2 CytB 1 ND1 4 ND1 5 CO2 6 AT8 2 AT8 3 AT8 4 AT8 6 ND3 4 ND4l 5 ND4 5 ND6 1 CytB 5 CytB 6 ND1 3 ND2 2 CO1 1 CO1 6 CO2 2 CO2 3 AT8 1 AT8 4 AT8 5 AT8 6 AT6 2 AT6 3 AT6 6 CO3 6 ND3 6 ND4l 3 ND4l 5 ND6 5 ND6 6

660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680

N

Si

Id

Origin

Ter

29–58 8–58 10–260 66–101 33–95 19–468 1–170 91–172 41–128 30–351 165–256 54–99 90–139 18–69 25–69 20–56 19–49 15–54 23–70 5–46 104–186 88–120 139–199 260–319 86–149 264–344 173–218 275–356 41–99

32 51 253 37 63 455 170 83 88 324 110 49 51 54 51 37 31 40 48 42 85 33 61 60 74 88 46 82 61

63 61 43 59 56 39 61 46 45 43 44 61 59 56 51 57 55 58 56 64 52 48 61 47 41 51 57 45 54

EAU75829 EAA32833 ABU64439 CCB89733 CAG82479 ADH8404 AFF86056 EDX74154 CAG99131 CAP58774 XP002807608 CBJ30960 ZP08847098 BAD93095 XP003582797 AAC49419 AAO43936 ACB76837 ZP06460666 CBK22878 EAQ38580 AES66793 CAI76264 EGI65165 ACR37944 EAZ06402 GAB46836 CBY23304 ADE11632

0 1 9 0 1 16 1 1 0 9 3 0 0 1 1 2 1 0 0 1 1 0 1 0 1 0 0 1 0

14–65 4–48 1–47 9–42 39–68 117–151 95–178 17–71 60–119 111–161 4–42 36–63 18–47 47–79 15–93 41–151 29–120

58 46 50 34 32 35 103 56 115 56 39 28 32 33 83 110 97

43 57 42 56 66 60 38 48 45 45 51 64 72 55 49 42 48

EHL87055 EHI12376 EAQ88006 EFK66562 EEN61262 AAH73232 EGO64651 CCC14912 EEW37665 XP002826409 ACAZ21358 CCG92029 EGC17145 EFV05470 EEC06142 ZP09248514 ACQ69531

81–131 53–133

51 81

51 48

AEA47822 AFI46966

AGAP012039-PA Anopheles gambiae Nitrate reductase Neurospora crassa CO3 Homo sapiens Hyp. Simkania negevensis YALI0C22946p Yarrowia lipolytica ND5 Sinogastromyzon sichangensis ND6 Homo sapiens Hyp. Microcoleus chthonoplastes KLLA0E02135p Kluyveromyces lactis CytB Cobitis sinensis Adenylate cyclase type 3 Callithrix jacchus Ectocarpus siliculosus Pectate lyase Anaerophaga thermohalophila TEA domain family member 1Homo sapiens Leucine-rich repeat transmembrane Bos taurus Repellent protein Ustilago maydis Ca2+ homeostasis Arabidopsis thaliana Multi-sensor hybrid histidine kinase Opitutus terrae Major facilitator transporter Pseudomonas syringae ABC transporter type 1 Blastocystis hominis Hyp. Dokdonia donghaensis Cysteine-rich receptor-like kinase Medicago truncatula Proton translocating inorganic pyrophosphatase Theileria annulata Purity of essence Acromyrmex echinatior Zea mays Hyp. Oryza sativa Sodium/sulphate symporter Gordonia terrae Oikopleura dioica Diguanylate cyclase/phosphodiesterase with PAS/PAC sensor(s) Sideroxydans lithotrophicus Aspartate-ammonia ligase Tannerella Regulator Mycobacterium thermoresistibile Hyp. Chaetomium globosum ABC transporter ATP-binding Streptomyces Hyp. Branchiostoma floridae MGC80562 Xenopus laevis DEAD 2 domain protein Acetonema longum Sordaria macrospora HMP/thiamine-binding Granulicatella adiacens Cardiotrophin-2-like Pongo abelii Oxidoreductase Sanguibacter keddieii Phosphoenolpyruvate carboxylase Methylacidiphilum fumariolicum Transcriptional regulator Thiocapsa marina Beta-glucosidase Prevotella salivae Dihydrodipicolinate synthase Ixodes scapularis Ammonium transporter Acaryochloris 7TM receptor with intracellular metal dependent phosphohydrolase Exiguobacterium Phosphoesterase RecJ domain Archaeoglobus veneficus Drug resistance transporter Pasteurella multocida

coding, a redoxin for C↔U exchange coding, a FAD dependent oxidoreductase for A↔G exchange coding, and a short chain dehydrogenase for C↔G exchange coding. A detailed discussion of each case would not be constructive at this preliminary stage of exploration of nucleotide exchange coding. However, the distribution of functions does not seem random, especially in relation to functions typically associated with mitochondrial metabolism, including for nucleotide exchanges for which no or few transcripts were found in Table 2. Hence protein alignment data suggest that one cannot exclude the occurrence of any type of nucleotide exchange, though some seem more frequent than others. All regular mitochondrial main frame-encoded proteins are membrane proteins, and these are also frequent among the alignment data in Table 3 (25 cases). These include numerous transporters and symporters, and for example the mitochondrial import inner membrane translocase Tim22-like for the A↔G nucleotide exchange. Here again, the data at hand suggest protein functions that seem non-random in relation to known mitochondrial functions in the cell’s metabolism. Note that for the A↔C+G↔U exchange, a type of nucleotide exchange for which no transcript was detected, alignments with membrane proteins were

A-C A-C A-C A-C A-C A-C A-C A-C A-C A-C ACGT

AGCT

0 0 1 1 0 0 1 0 3 1 0 0 0 1 3 0 1 0 0

most numerous (7), while no alignments with proteins interacting with DNA or RNA were found, and only few (2) with physiological functions associated with mitochondrial metabolism. This would suggest that this type of RNA recoding by nucleotide exchange would specifically produce membrane bound proteins in apparently particularly rare conditions inducing that type of transcriptional nucleotide exchange. This confirms that at this preliminary stage, no type of nucleotide exchange should be excluded, even if RNA alignment data in Table 2 are non-existent for that type of nucleotide exchange. It is indeed plausible that each type of nucleotide exchange is induced by specific, perhaps stress- and/or ontogeny-associated conditions, and some might be rarer than others. Further bioinformatics analyses of the putative nucleotide exchange overlap coding sequences yield clues in this respect.

681

3.1.7. Origins of proteins in Table 3 The distribution of the proteins in Table 3 along broad systematic groups is informative to some extent. Table 3 includes only one alignment with proteins from viral, and one from archean origins (0.6% each). Most common were bacterial origins (41.5%), followed by metazoan (25.7%), fungal (17.5%) and ‘vegetal’ (from

695

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

682 683 684 685 686 687 688 689 690 691 692 693 694

696 697 698 699 700

G Model BIO 3360 1–19

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

Fig. 3. Relative proportion of proteins from broad systematic evolutionary classes of organisms and viruses in Table 3 as a function of their proportion in GenBank’s protein database. Proteins from eukaryotic origins (especially metazoan) are overrepresented in Table 3. The line indicates x = y.

701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740

Viridiplantae) origins (11.11%). Fig. 3 compares these with the relative representation of proteins from these respective origins in GenBank’s database. This clearly shows that eukaryotic origins (organisms that contain mitochondria) are overrepresented in Table 3, especially metazoan origins. This general ranking and pattern varied little between different types of nucleotide exchange codings, though for A↔G+A↔U, metazoan origins were more common than bacterial origins, and for G↔U, proteins from Viridiplantae were more frequent than from Metazoa. However, these variations might be stochastic and are small in relation to the overall pattern found when pooling all data from Table 3 and presented in Fig. 3. Considering that the sequences analyzed are of metazoan origins (Homo sapiens), this suggests that overlap coding through nucleotide exchange transcription is usually not due to horizontal transfers (including viruses), but probably evolves gradually within phylogenetic groups, and occasionally, the genome will include a gene coding for that protein in the regular (non-exchange) form. This might result from occasional reverse transcription of nucleotide exchange transcripts and their integration in the nuclear genome of organisms that possess mitochondria. This phenomenon would be compatible with the positive correlation between detected protein alignment lengths and exchange RNA transcripts in Fig. 2. If this is the mechanism subjacent to the integration of a gene coding without nucleotide exchange for a protein usually coded by mitochondrial nucleotide exchange transcription, the fact that viruses are underrepresented in Table 3 suggests that proteins coded by nucleotide exchange are probably proteins with some adaptive physiological function. This confirms the fact that Table 3 includes numerous proteins that seem adequate for mitochondrial metabolism. Along that rationale, the organism indicated in Table 2 in which the protein is coded directly by DNA, without nucleotide exchange, and whose protein aligns with the human mitochondrial sequence translated after nucleotide exchange, could be an organism where the physiological function of the protein coded by nucleotide exchange became more frequently required, justifying to include a gene that explicitly (=without nucleotide exchange) codes for that protein. The methods used here would not detect any nucleotide exchange-encoded protein coding gene if this did

11

Fig. 4. Mean number of stops per putative protein coding region as detected by Blastp for nucleotide exchange recoded RNAs of human mitochondrial protein coding genes (from Table 3) as a function of the number of exchanging transcripts according to that exchange rule (from Table 2). Relatively common nucleotide exchange transcripts tend to include more stops, indicating that protein expression is limited by the fact that transcription frequency is counterbalanced by the presence of stop codons necessitating translational activity by suppressor tRNAs.

not occur occasionally. It is probable that not all actual nucleotide exchange encoded genes have been detected by this method, because direct integration of nucleotide exchange coding contents into the genome might not have yet occurred for all nucleotide exchange-encrypted genes, and because GenBank may not include sequences of organisms where this has occurred. Hence it is very likely that Table 3 underestimates numbers of nucleotide exchange encoded genes. The difference between pro- and eukaryotic origins could have an alternative explanation, that exchange transcription and coding is rarer in prokaryotes. Though this possibility exists, it is not very likely, especially that the genome analyzed here is mitochondrial, which probably reflects its prokaryotic ancestor. The evolutionary scenario for overlapping genes is a more probable explanation for the overrepresentation of proteins from eukaryotic origins in Table 3.

741

3.1.8. Stop codons in putative overlap coding genes Table 3 indicates stop codon numbers within putative overlap coding gene sequences. Considering the density of stops within these sequences across genes, but for each type of nucleotide exchange, stops are much less frequent within putative overlap coding regions than within the rest of the genes after nucleotide exchange. This ranges from 2.26 times less frequent for putative proteins coded by C↔U exchange transcription, to 8.71 times less frequent for those coded by A↔C exchange transcription. The former is the nucleotide exchange type represented by the least, the latter by the most frequent RNA transcript data in Table 2. Hence it seems that stops within putative overlap coding sequences according to exchange transcription modulate the expression of overlapping genes associated with each exchange transcription type, by constraining this expression to conditions where suppressor tRNA activity occurs. Indeed, Fig. 4 shows that stop codon numbers per overlapping gene (from Table 3) increases with numbers of exchange transcripts observed for that type of exchange transcription (from Table 2): r = 0.747, P = 0.0104; rs = 0.854, P = 0.0078, one tailed tests. This suggests that

757

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

742 743 744 745 746 747 748 749 750 751 752 753 754 755 756

758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776

G Model BIO 3360 1–19 12 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805

806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

the expression of proteins coded by nucleotide exchanges is finely tuned by two regulatory mechanisms, one positive, the frequency of specific nucleotide exchange transcription and the associated frequency of transcripts it produces (from Fig. 2), and one negative, the frequency of stops making the expression of the proteins dependent on the joint activities of nucleotide exchange transcription and suppressor tRNAs, both probably rare events. The close match between transcript frequencies and mean numbers of stops per putative protein coding gene (Fig. 4) suggests that the system is finely tuned so that the two regulatory forces, one positive, one negative, balance each other: translation of transcripts produced by rare types of nucleotide exchange transcription is relatively unhindered by stop signals, while relatively frequent exchange transcription is constrained by numerous stops. This suggests that expression levels of proteins associated with the various types of nucleotide exchanges is regulated so as to be relatively equal, despite differences in frequencies at which the different types of nucleotide exchange transcriptions apparently occur. This highly structured pattern between a positive and a negative regulatory mechanism is a strong indication that analyses reveal real biological coding phenomena with important, even if probably rare, physiological functions. Hence it seems that the expression of genes associated with frequent types of exchange transcription is conditioned by a further condition, that of suppressor tRNA activity. The pattern in Fig. 4 is so strong that it suggests an adaptive component to this, where suppressor tRNA activity downregulates these types of nucleotide exchanges, especially if transcripts are frequent. This result is a further indication that such transcription and expression is a physiological reality in very specific and unknown conditions. 3.1.9. Deamination along replicational gradients in genomic single strandedness and nucleotide exchange overlapping genes in human mitochondrial protein coding genes Single stranded DNA is very mutable, as compared to duplex DNA. This situation occurs when DNA is replicated, and when RNA is transcribed. Mitochondrial DNA replication is unidirectional, involving a heavy strand and a light strand replication origin (OH and OL). Distances of sites in relation to each OH and OL determine the duration sites remain single stranded during replication (see for example Krishnan et al., 2004a,b; Seligmann et al., 2006; Seligmann, 2008, 2011b). In the single stranded state, hydrolytic deaminations A→G and C→T are most frequent (note that in this context, A→G and C→T are spontaneous mutations occurring at the DNA level, these are not systematic nucleotide exchanges during RNA transcription). Replicational single strandedness creates gradients in mitochondrial genome nucleotide contents that reflect these spontaneous mutations (partial review in Seligmann, 2012a). Their effect on nucleotide contents is counterbalanced by functional constraints when the nucleotide has crucial coding functions at the protein level, and hence nucleotide contents at second codon positions barely reflect the mutational single stranded gradients. However, these are detectable at third codon positions, the situation is intermediate for first codon positions (Seligmann et al., 2006). Analysing nucleotide contents at third codon positions of the regular human main frame protein coding genes in relation to overlapping genes predicted by Blastp analyses assuming suppressor tRNA activity confirmed that these regions are involved in overlap coding, as they fit less well deamination gradients than the adjacent regions that are not involved in overlap protein coding (Seligmann, 2012a). This method also confirmed overlapping genes coded by codons of four nucleotides (tetragenes coded by Q5 tetracodons, Seligmann, 2001) and protein coding genes coded in the 3 -to-5 direction of mitochondrial sequences (Seligmann, 2012). This method is used here to confirm the existence of the putative overlapping genes associated with exchange transcription

Fig. 5. A/(A+G) nucleotide ratios at 3d codon positions (light strand DNA) in human mitochondrial protein coding genes as a function of the time spent single stranded during replication by that gene. The base ratios reflect the C→T deamination that occurs on heavy strand DNA during replication, until the complementary lagging (light) strand is polymerized. Filled datapoints are for predicted overlap coding regions after A↔C nucleotide exchange transcription, as presented in Table 2. Hollow datapoints are for the rest of the gene, not predicted involved in overlap coding after A↔C exchange transcription. Both datasets fit well the same predicted deamination gradient, which suggests the putative overlap coding genes are not functional. Functionality would imply that overlap coding regions cannot mutate according to the replicational gradients, in order to preserve coding properties, and should hence not fit the gradient observed for other genome regions. This lack of difference is compatible with the fact that no RNA transcripts fitting human mitochondrial genes have been found in GenBank for nucleotide exchange rule A↔C (Table 2), and that A↔C exchange transcription is predicted to include few overlap coding genes according to Table 3.

and predicted by Blastp (Table 3), for each of the nine symmetric nucleotide exchange types. Only analyses for two nucleotide exchange types are presented here, though such detailed analyses were done for each of the nine types of nucleotide exchange. The test of the replicational deamination gradient expects that if a region functions as an overlapping gene, it does not fit well the deamination gradient. However, if candidate overlapping genes are not expected to be frequently expressed and hence are not expected to be functional, these sequences should fit well within the replicational deamination gradient observed for other regions, not expected to function as overlapping genes, and involved only in regular main frame coding. The overlapping genes coded by A↔C nucleotide exchange transcription are expected to be the least functional ones according to both criteria available at this point: no RNA transcripts were detected fitting this type of exchange transcription, and Blastp analyses of hypothetical protein sequences translated from these exchange transcripts yield the lowest number of alignments with proteins existing in GenBank. No RNA transcripts fitting predictions from nucleotide exchange transcription types along the rules A↔C+G↔U, and A↔G+C↔U were detected, but RNAs transcribed along these exchange transcription rules apparently code for numerous proteins, according to Table 3. Hence their situation is much less clear, they might be more functional then indicated by transcript numbers in Table 2. Fig. 5 plots the nucleotide contents A/(A+G) ratio at the third codon position (according to regular main frame codons of the regular protein coding gene) for human mitochondrial protein coding

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868

G Model BIO 3360 1–19

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

Fig. 6. A/(A+G) nucleotide ratios at 3d codon positions (light strand DNA) in human mitochondrial protein coding genes as a function of the time spent single stranded during replication by that gene for overlapping genes coded by G↔U exchange transcription. Filled datapoints are for predicted overlap coding regions after G↔U nucleotide exchange transcription, as presented in Table 2. Hollow datapoints are for the rest of the gene, not predicted involved in overlap coding after G↔U exchange transcription. The latter fit well the predicted deamination gradient, but the former much less, as expected if overlap coding genes were functional. This fits with the fact that Tables 1 and 2 include numerous G↔U exchange transcripts and predicted overlapping genes, respectively. This pattern contrasts with the one observed in Fig. 5, which jointly confirms the test’s result reflect overlapping gene functionalities.

869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902

genes as a function of the duration spent single stranded during replication by that gene, separately for putative overlapping genes (filled symbols) and for the rest of the genes (open symbols) for A↔C nucleotide exchange transcripts. Note that the nucleotide ratios are for the regular human mitochondrial DNA contents of the light strand, that encodes most of the regular human mitochondrial protein coding genes, and not after A↔C exchange. The replicational gradient in light strand A/(A+G) reflects the increase in T after C→T (deamination) mutations in single stranded heavy strand DNA (Krishnan et al., 2004a,b). The correlation for putative overlapping genes is r = 0.50. This Pearson correlation coefficient is stronger than that for the rest of the genome (r = 0.43). This is the opposite of what would be expected for overlap coding genes, these data should not fit a gradient. In addition, if one excludes from analyses the outlier datapoint for the non-overlap coding region of the gene AT8, which is based on very few codons, because most of that gene is predicted to be involved in overlap coding, the regression lines for both putative overlap and non-overlap coding regions become almost identical (dashed line in Fig. 5). This means that the deamination gradient test does not confirm the predicted overlap coding status for A↔C exchange transcripts. These regions behave exactly along the predictions of the deamination gradients, as do the other regions of the same genes. In other terms, deamination mutations occur according to the same rules in these regions as in regions not expected to be involved in overlap coding after A↔C exchanges. At the other extreme, G↔U exchange transcribed genes are predicted to include the largest number of overlapping protein coding genes, and several RNA transcripts were detected in GenBank matching G↔U exchange transcription of the human mitochondrial genome. Fig. 6 presents the replicational deamination gradient analysis of the same human mitochondrial protein coding genes as in Fig. 5, but separating putative overlapping genes from the rest of the gene for overlap coding predicted for G↔U (and not

13

A↔C) transcribed genes. The gradient is clear for regions not predicted involved in overlap coding (r = 0.55, one tailed P = 0.0398), but weaker for predicted overlap coding regions (r = 0.43, one tailed P = 0.0548). This situation fits what is predicted if overlap coding genes are functional. Similar analyses for the other nucleotide exchange types yield qualitatively similar results (deamination gradient weaker for predicted overlap coding regions than for other regions) in all the remaining six types of symmetric nucleotide exchanges. Hence qualitatively, only for A↔C nucleotide exchanges, gradient analyses do not fit predictions that overlapping genes are functional (analysis presented in Fig. 5). This functions as a kind of negative positive control (negative because overlapping and other regions behave similarly, positive because the null hypothesis expects the detection of a deamination gradient). According to a one tailed sign test, the probability of getting by chance the qualitative result expected if predicted overlap coding genes are functional (that the gradient should be weaker for predicted overlapping regions than other regions) 8 among 9 times has P = 0.0098 according to a one tailed sign test. Hence the replication gradient analyses apparently confirm that overlap coding according to nucleotide exchange transcription predicted by Blastp analyses (Table 3) is generally not an artifact. The only type of nucleotide exchange transcription for which the qualitative result of comparisons between deamination gradients observed for predicted overlap coding regions and other regions do not confirm the functionality of the overlapping genes is for the nucleotide exchange type that according to other analyses (in Tables 1 and 2) is the least likely to occur. This also strengthens the working hypothesis, as well as the adequacy of deamination gradient analyses as a test for functionality of predicted mitochondrion-encoded overlapping genes. One obtains qualitatively similar results when performing these gradient analyses for C/(C+T) nucleotide contents at third codon positions. In that case, the replicational gradient is stronger in regular regions than in predicted overlap coding regions for seven among nine types of nucleotide exchanges, which is also a statistically significant majority of cases according to a one sided sign test (P = 0.0449). While weaker, this result nevertheless confirms overlap coding status for most predicted candidate overlap coding regions and most types of nucleotide exchanges.

903

3.2. Circular code analyses confirm overlap coding

944

The possibility that nucleotide exchange transcription increases the coding potential of genes could be a major discovery, but at this point, evidence on proteins translated from the predicted overlapping genes is still totally missing. For that reason, an additional computational test is used to strengthen the status of the predicted overlap coding genes presented in Table 3. This test uses a theoretical background that totally differs from the deamination gradient analyses presented in the previous section, is based on different information and sequence properties, and hence is totally independent of the deamination gradient test presented in the previous section. Empirical observations have shown that some codons are overrepresented in overlapping genes, as compared to regular genes, while other codons are underrepresented (Ahmed et al., 2007, 2010; Ahmed and Michel, 2011). The overrepresented codons are homopolymer codons, hence AAA, CCC, GGG and UUU. The underrepresented ones form a circular code (Arqués and Michel, 1996, Q6 1997; Michel, 2008; Ahmed and Michel, 2011; Gonzalez et al., 2011). The reasons for that remain unclear, but from an empirical point of view, this enables to test whether codon usages in predicted overlap coding genes are indeed optimized along the lines of avoiding circular code codons and preferred usage of homogenous

945

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943

946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966

G Model BIO 3360 1–19 14 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016

1017 1018

1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

codons. They might have to do with ribosomal frame-maintenance, as parts of ribosomal RNA involved in interacting with the mRNA also form circular codes (Michel, 2012). Independently of the interesting theoretical underpinnings to the links between circular codes and overlap coding, circular codes can be used to test whether codon usages in predicted overlap coding regions fit the circular code. In that context, I analyzed each of the three frames of each human protein coding gene, in relation to each set of 20 ‘circular code codons’, each set associated with one frame, separating predicted overlap coding regions (according to Table 3) from other regions, for each of the nine nucleotide exchange types. Homogenous codons were scored −1, and codons belonging to the circular code for that frame were scored 1. These scores were averaged and compared between putative overlap coding regions and other regions, expecting lower scores for overlapping genes (as predicted by analyses presented in Table 2) than for other regions if the predicted codons are over- and underrepresented within the predicted overlapping gene as compared to regular coding regions. Mean scores for the two types of regions could be compared by t-tests, but here I restrict analyses to the statistically robust nonparametric sign test. Comparing mean scores obtained for each putative overlap coding region with the mean score of the rest of that gene, I tested whether the number of times that the score was lower in the overlap coding region is significantly more frequent than the 50% expected if no pattern exists in the data. This yields three nucleotide exchange types with P < 0.05 according to one tailed sign tests: A↔U exchange (14 positive results among 21, which yields according to a one tailed sign test P = 0.0473); A↔G+C↔U (27 among 39 positive comparisons between regular and predicted overlap coding regions, one tailed sign test P = 0.0059); and A↔U+C↔G (37 positive among 45 comparisons, P = 0.0000014). Note that several transcripts for two of these types of nucleotide exchanges have been detected (A↔U, in Table 2; and A↔U+C↔G, Seligmann, 2012d), and though Table 2 does not include any transcript fitting A↔G+C↔U exchange transcription, this type of nucleotide exchange has the most numerous overlap coding genes according to analyses in Table 3. One can test the working hypothesis by combining the one tailed P values obtained from sign tests for all nine types of nucleotide exchanges. Fisher’s method for combining P values sums −2 × ln Pi, where Pi is the P value obtained for the ith test, and i ranges from 1 to k. This sum is a chi-square statistic with 2 × k degrees of freedoms, in the present case 43.82, which with 18 degrees of freedoms has P = 0.00061. Hence the null hypothesis for the combined data is rejected: predicted overlap coding genes tend to avoid circular code codons and prefer homogenous codons, as compared to regular coding regions, when considering all types of nucleotide exchanges altogether. This confirms their coding status according to the circular code approach. 3.3. Convergence between functionality predictions of overlap coding genes by deamination gradient and circular code tests Examination of Figs. 5 and 6 shows that nucleotide ratios at third codon positions for some predicted overlap coding genes fit better replicational deamination gradients than for other predicted overlap coding genes. The extent by which the datapoint digresses from the deamination gradient might be proportional to gene functionality. In this respect, predicted overlap coding regions match approximately as well the gradient as other regions for A↔C exchange transcription, while digressions were much greater for G↔U exchange transcription, which seems to match the greater abundance of G↔U RNA transcripts in Table 2. By extension, this rationale could apply to different overlapping genes from the same type of nucleotide exchange. Possibly, those with

Fig. 7. Circular code overlapping gene score versus absolute residual of third codon position A/(A+G) from deamination gradient. The y axis is the subtraction of the circular code score calculated for gene regions coding only in the regular main frame, from the score obtained for regions predicted involved in overlap coding according to G↔U exchange transcription (from Table 3). Presumably, the lower this score is, the greater the functionality of the predicted overlap coding gene. The x axis is the absolute residual of the A/(A+G) base ratio for the same overlap coding regions for G↔U exchange transcription from the replicational deamination gradient presented in Fig. 6. Functionality of overlap coding genes is assumed proportional to this absolute residual. Hence the negative association in Fig. 7 suggests that functionality estimates for the same putative overlapping genes, but from different methods, tend to converge.

greater absolute digression from the gradient are more functional than those matching more closely the gradient. Hence functionality might be proportional to the absolute value of the residual of the A/(A+G) ratio at third codon position for a putative overlap coding gene from the deamination gradient observed for regular regions. A similar rationale can be developed for the subtraction of the mean ‘circular code’ scores for gene regions involved only in regular coding from the ‘circular code’ score obtained for predicted overlap coding genes in that gene. One might assume that overlap coding functionality decreases the more positive the value obtained from that subtraction. According to these functionality rationales, absolute residuals (from deamination gradients) and circular code score subtractions should be negatively correlated, because according to that interpretation, they would estimate the same phenomenon. It is important to remind in this context that the two tests are totally independent from each other in terms of theoretical backgrounds, and analyze different properties of the sequences. Hence a positive result (meaning a negative correlation in this context) is not trivial. Fig. 7 plots the circular code score for predicted overlap coding regions for G↔U exchange transcription according to Table 3, as a function of the absolute value of residuals for A/(A+G) ratios at third codon positions for these putative overlap coding regions from the deamination gradient analysis in Fig. 6. The presumed functionality estimates from these independent tests are indeed negatively correlated (r = −0.6272, one tailed P = 0.0082; but note that rs = −0.36, one tailed P = 0.095), as one would expect if these estimates reflect functionality of the different predicted overlap coding genes. Hence gene-wise results for the two tests of overlapping gene functionality might confirm each other. The fact that the more robust but less sensitive nonparametric Spearman rank correlation analysis, rs, does not confirm the result of the parametric analysis does not invalidate the principle, but at this point does not allow high confidence in the result.

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064

G Model BIO 3360 1–19

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082

1083 1084

1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118

1119 1120 1121

1122 1123 1124 1125 1126

15

Analyses similar to those in Fig. 7 were done for each of the eight remaining types of nucleotide exchanges, and the correlation was negative (as expected) in 5 among 9. It was statistically significant according to a one tailed test for Pearson correlation coefficients for putative overlap coding genes due to C↔U exchange transcription (r = −0.538, P = 0.044) and those due to A↔U+C↔G exchange transcription (r = −0.675, P = 0.0029). Hence convergence between functionality estimates from replicational mutation gradient analyses and circular score analyses was statistically significant at P < 0.05 for three among the nine types of potential transcriptional nucleotide exchanges. All three are according to Table 2 for types of nucleotide exchanges that are relatively frequently encountered at the level of RNA transcripts. The fact that two additional analyses yield qualitatively the same statistically significant result does increase the confidence level in the result, despite inconclusive confirmation of the trend in Fig. 7 by the nonparametric rs analysis. This is because the replicability of a result, by independent tests, is the best insurance against false rejection of the null hypothesis. 3.4. A meta-analysis of exchange nucleotide transcripts and coding The analysis in Fig. 7 indicates that each deamination gradient and circular code analyses converge for putative overlap coding genes according to G↔U nucleotide exchanges. Similar levels of convergence were found for two other nucleotide exchanges (C↔U, and A↔U+C↔G), all three are among the nucleotide exchanges with the most abundant transcript data in Table 2. It is possible that the level of convergence between deamination gradient and circular code analyses, as estimated by r2 as the one in Fig. 7, is inversely proportional to RNA transcription. Fig. 8 plots this r2 while keeping the sign of the correlation coefficient (the more negative, the more convergence, adding the value 1 to avoid negative numbers), as a function of the number of genome regions for which RNA transcripts were found (Table 2). The negative trend expected is detected by Pearson’s parametric correlation coefficient r = −0.611, one tailed P = 0.04, but cannot be statistically confirmed at P < 0.05 by Spearman nonparametric rank correlation rs = −0.5, one tailed P = 0.078. Nevertheless, the fact that nucleotide exchange types for which a high level of convergence between tests for overlap coding exists are also those for which RNA transcription is relatively frequent is not at all trivial. It is not simply a confirmation of cryptic coding after nucleotide exchange, and of nucleotide exchange transcription. It shows that the independent evidence for each of these phenomena tends to be coherently integrated. Hence Fig. 8 integrates all the evidence presented here, and shows consistency between all the types of analyses. Hence despite the speculative impression given by the working hypothesis due to its presumed revolutionary meaning in relation to accepted principles of molecular biology, the data at hand are a strong confirmation that the working hypothesis is a valid approach for understanding coding properties of DNA, and the way these are expressed. This implies that the number of protein coding genes is approximately by one order of magnitude greater than believed until now in the presumably well known vertebrate mitochondrial genome. 3.5. Human DNA gamma polymerase misinsertion polymerization rates and systematic symmetric nucleotide exchange polymerization There is a further important piece of evidence confirming the working hypothesis, in relation to the existence of nucleotide exchange polymerization. Unlike the analyses of putative overlapping genes, this evidence is solely based on direct empirical experimental observation, and is therefore a very strong argument

Fig. 8. Convergence between deamination gradient and circular code analyses as a function of the number of genome regions that are exchange transcribed according to Table 2. The y axis is the Pearson correlation coefficient r (+1) between the absolute value of residual A/(A+G) at 3d codon positions for regions predicted to function as overlapping genes according to a given nucleotide exchange rule (according to Table 3) and the circular code score for that putative overlapping gene, for each of the nine types of nucleotide exchanges. The lower the value according to the y axis, the greater the convergence between deamination and circular code analyses in confirming overlap coding for that type of nucleotide exchange. Fig. 7 shows the data used to calculate that Pearson correlation coefficient for G↔U exchange transcription, analyses similar to those in Fig. 7 for nucleotide exchange G↔U were done for each of the nine nucleotide exchange rules and Pearson correlation coefficients from these analyses are used in the y axis of Fig. 8. The x axis is the number of genome regions for which RNA was detected according to that specific nucleotide exchange rule. The trend in Fig. 8 shows that nucleotide exchange types according to which RNA has been detected for numerous regions are also those for which analyses as those in Fig. 7 indicate a high degree of convergence between deamination and circular code analyses. This shows that convergence between two types of independent bioinformatics analyses converges with detected frequencies of RNA transcripts, an ‘experimental’ confirmation (x axis) of complex computational results (y axis).

favoring the working hypothesis. It is plausible that systematic nucleotide exchanges during RNA polymerization follow in principle very similar physico-chemical and enzymatic processes as the occasional nucleotide misinsertions (corresponding to the same replacing and replaced nucleotides as in exchange transcription), as these are known for the human mitochondrial DNA gamma polymerase (Lee and Johnson, 2006) and some other polymerases (i.e., Bertram et al., 2010; Zamft et al., 2012). Hence this approach assumes that properties of misinsertions, such as their rate parameters, should be proportional to the abundance of RNAs produced by systematic nucleotide exchanges corresponding to the replaced and replacing nucleotides by that DNA misinsertion. In short, systematic nucleotide exchanges should follow kinetic principles observed for occasional (erroneous) nucleotide exchanges (misinsertions). Transcription is a DNA→RNA directed process, but no data on the mitochondrial RNA polymerase’s fidelity is available. Because DNA and RNA are quite similar, misinsertion rates by the mitochondrial DNA polymerase gamma were used for these analyses. This is also adequate because one cannot exclude that this enzyme is responsible for exchanging RNA polymerization, perhaps in combination with specific conditions and/or other proteins. The modulation of which type of systematic nucleotide exchanging RNA polymerization could be determined by such interactions with the polymerase(s) responsible for nucleotide exchanging RNA polymerization. According to a simplistic Michaelis–Menten approach to enzymatic reaction kinetics,

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153

G Model BIO 3360 1–19 16

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

Fig. 9. Mean kd for nucleotide misinsertions by the human mitochondrial DNA polymerase gamma as a function of numbers of RNA transcripts with systematic symmetric RNA nucleotide exchanges corresponding to nucleotide misinsertions in the DNA. Transcript abundances are from Table 2, mean kds (affinities) are from Table 2 in Lee and Johnson (2006). This shows that RNAs produced by nucleotide exchange transcription are predicted by misinsertion kinetics of exchanged nucleotides.

1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191

reactions are parametrized according to the affinity and the maximal reaction rate, the first reflecting initial reaction rates when the substrate is rare (the medium has few free nucleotides for insertion, the enzyme is relatively frequent as compared to its substrate), the second when it is saturated (the medium is rich in free nucleotides to be (mis)inserted). These parameters are indicated as kd and kpol, respectively, in Table 2 from Lee and Johnson (2006). For each type of systematic symmetric nucleotide exchange, I averaged the corresponding kds, and separately, kpols from Lee and Johnson (2006). The mean polymerization kd (calculated for each type of nucleotide exchange) is negatively correlated with the mean length of RNA transcripts detected for the corresponding nucleotide exchanges (r = −0.689, P = 0.02; rs = −0.6, P = 0.045, one tailed tests, Fig. 9). The association with kpol is statistically weaker, and positive, which is not surprising because kd and kpol are inversely proportional, a well known phenomenon in kinetics: a high enzymatic specificity for its substrate (high affinity, kd) comes at the expense of its maximal rate. Hence results suggest that frequent types of nucleotide exchanges correspond to nucleotide misinsertions with high kpol (low kd). The observation that statistically, correlations are strongest with kd suggests that symmetric systematic nucleotide exchanges are limited by conditions where nucleotides are relatively rare. Putatively, these systematic symmetric nucleotide exchanges occur when nucleotides are relatively scarce, hence explaining a stronger effect of kd than kpol on their elongation. This result is remarkable because it means that some physicochemical and/or enzymatic principles inherent to nucleotide misinsertions coherently explain the data in Table 2. This excludes that artifacts created the RNAs in Table 2. The phenomena described here are shown meaningful on each chemical and biological grounds. The mean kd also predicts levels of expressions of predicted overlapping genes, as these are estimated by the difference between the strength of the replicational deamination gradients observed at (main frame) third codon positions for regions that are predicted involved in overlap coding (after systematic nucleotide exchange) versus third codon positions in other regions of the

Fig. 10. Difference between strengths of replicational deamination gradient in regions not involved in overlap coding and in those involved in overlap coding as a function of mean kd for nucleotide misinsertions by the human mitochondrial DNA polymerase gamma corresponding to the nucleotide exchanges observed in the RNA. Open circles are for A→G, closed circles for C→T deamination gradients (light strand annotation, not to be confused with nucleotide exchanges during RNA transcription, A→G and C→T in this case represent spontaneous mutations by deaminations during DNA replication). The y axis is calculated, for each type of nucleotide exchange, from an analysis as that presented in Figs. 5 and 6. The x axis is identical to that in Fig. 9. The result shows as for Fig. 8 that computational results from bioinformatics analyses converge with misinsertion kinetics of exchanged nucleotides.

same genes (see analyses in Section 2.2.4 and corresponding Figs. 5 and 6). Along that approach, the stronger the gradient for non-overlap coding regions as compared to predicted overlap coding regions, the weaker the expression of the predicted overlapping genes encoded by that type of symmetric systematic nucleotide exchange. Fig. 10 plots this difference (after a z transformation of the Pearson correlation coefficients (Amzallag, 2001) that estimate the strengths of the replicational deamination gradients, the z transformation accounts for sample size effects (Seligmann et al., 2007)) as a function of the mean kd. The gradient analyses were done separately for two types of transitions predicted to follow the replicational gradient, A→G and C→T (hollow circles and filled symbols in Fig. 10, respectively). Note that in this case A→G and C→T are mutations due to deaminations that occur during DNA replication, not nucleotide exchanges occurring during RNA transcription. For each A→G and C→T gradients, the replicational gradient is stronger for regions not expected involved in overlap coding than for those expected involved in overlap coding in a majority of types of symmetric nucleotide exchanges (values above ‘1’ on the y axis in Fig. 10), and this difference increases with mean kd (A→G gradient, r = 0.622, P = 0.037, rs = 0.533, P = 0.0655; C→T gradient, r = 0.722, P = 0.014, rs = 0.717, P = 0.021, one tailed tests). Hence types of nucleotide exchange polymerizations that are expected to have high rates of polymerization at low nucleotide concentrations seem to be most expressed, and therefore predicted overlapping protein coding genes are proportionally more conserved as compared to the replicational deamination gradient observed in other regions of the genes. The same principle is observed in relation to circular code analyses (Section 2.3 and y axis of Fig. 7) as estimating expression of predicted overlapping protein coding genes. The relative usage of homopolymers as opposed to circular code codons is expected greater in expressed overlap coding regions than in other regions,

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225

G Model BIO 3360 1–19

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

Fig. 11. Difference between proportions of homopolymers among circular code codons in predicted overlap coding regions and in other regions for different types of systematic symmetric nucleotide exchanges as a function of the mean kpol of corresponding nucleotide misinsertions (from Table 2 in Lee and Johnson, 2006). Overlapping genes expected more expressed according to the y axis correspond to types of nucleotide misinsertions with high DNA polymerization rates (x axis). The result shows as for Figs. 8 and 9 that computational results from bioinformatics analyses converge with the known DNA misinsertion kinetics of the RNA exchanged nucleotides.

1246

and this difference is expected to increase with expression levels. Fig. 11 shows that high mean kpols correspond to types of systematic nucleotide exchanges where this difference is large, and vice versa (r = 0.708, P = 0.016; rs = 0.70, P = 0.024, one tailed tests). Hence here, bioinformatics analyses estimating high expression levels correspond to types of nucleotide exchange polymerizations that are expected to have high rates of polymerization at high nucleotide concentrations. These results suggest that deamination gradient analyses estimate more expression of predicted overlapping genes encrypted by systematic symmetric nucleotide exchanges at low nucleotide concentrations. Hence these might be associated with stressful conditions such as low resource availability and low metabolism, as suggested for other types of alternative mitochondrial gene expressions (Seligmann, 2010c, 2011a), which putatively would favor deaminations. Figs. 7 and 8 show that deamination gradient and circular code analyses tend to converge in their overall patterns, yet it seems that each approach fits better specific conditions. Circular code analyses seem to estimate better expression of overlapping genes encrypted by nucleotide exchanges at high concentrations of free nucleotides.

1247

4. General discussion

1248

The analyses presented above confirm the hypothesis that transcription that exchanges systematically nucleotides (in a symmetric manner) reveals protein coding genes that were not detected until now in the human mitochondrial genome. A number of lines of evidence suggest this: (1) RNA transcripts fitting polymerization according to several nucleotide exchange rules are detected in GenBank’s EST database (Table 2); (2) Blastp analyses of putative polypeptides translated from ‘exchange transcribed’ sequences yield numerous alignments with proteins existing in

1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245

1249 1250 1251 1252 1253 1254 1255 1256

17

GenBank; (3) identities of proteins aligning seem non-random in relation to mitochondrial metabolism and include numerous proteins interacting with DNA and RNA (putatively, future studies will find that some of the proteins responsible for exchange transcription are among these); (4) these putative overlapping protein coding genes include few stop codons; (5) bias against stop codons within putative overlapping protein coding genes is inversely proportional to transcript abundances of nucleotide exchange types, suggesting a balance between positive and negative regulations of expression of overlapping genes coded by nucleotide exchange transcription (upregulation) and stop codon presence (downregulation); (6) replicational deamination gradient analyses tend to confirm the coding status of putative overlapping protein coding genes; (7) circular code analyses of codon usages in putative overlap coding regions also confirm this status; (8) results of 6 and 7 tend to converge; (9) that level of convergence is consistent with the number of genome regions that are found ‘exchange transcribed’; (10) frequencies and lengths of RNA transcripts corresponding to different types of nucleotide exchanges are explained by kinetic parameters of occasional nucleotide misinsertions by the human mitochondrial DNA polymerase gamma that reflect the assumed transcriptional nucleotide exchanges. It is particularly notable that results for each of the 10 levels are independent, yet yield a highly integrated overall picture. This confirms that the coding system is much more complex than usually believed (Mercer et al., 2011a,b), and that some types of coding/recoding events, though apparently rare or very rare, actually exist. At this point, the next major steps are similar analyses for nucleotide exchanges that are not symmetric, and to investigate whether the proteins predicted by the analyses can be found and extracted from mitochondria. It is important to note that analyses suggest that some of the overlap coding genes seem more optimized than others. This could have two meanings: their expression level is greater, and/or their function is more important. It is not certain that nucleotide exchange types that seem more frequent are necessarily those that are most important from a functional point of view. Hence transcript abundance does not need to be perfectly correlated with optimization. It seems plausible that the importance of a coding system associated with a given type of nucleotide exchange is not only reflected by the abundance of transcripts detected (Table 2). This is also reflected by the number of putative protein coding genes detected (Table 3), and the extents by which overlap coding is independent of transcription. The analyses compare between different types of nucleotide exchanges. It is possible that these are not all variations of the same phenomenon. Besides the fact that six nucleotide exchange rules involve only a pair of nucleotides, and that three involve two pairs, some of these pairs exchange between nucleotides of the same type (purine to purine, or pyrimidine to pyrimidine), while others do not. This might imply mechanisms of different natures. In addition, the nucleotide exchange A↔U+C↔G could be compatible with a different type of polymerization, which does not necessarily imply nucleotide exchange, but would result in the same transcript sequence. It might result from regular 5 -to-3 RNA polymerization where the progression follows the 3 -to-5 direction of a sequence, a phenomenon that has not yet been observed (but note that 3 -to-5 directed RNA polymerization occurs (Jackman et al., 2012), and that also in mitochondria), but for which evidence exists (Seligmann, 2012d). Such RNA is also compatible with RNA that forms DNARNA triplexes according to antiparallel Hoogsteen base pairings, which have been observed in vertebrate mitochondria (Annex and Williams, 1990; Rocher et al., 2002; Takamatsu et al., 2002). There are other notable observations pertaining to overlap coding through systematic nucleotide exchanges. Most proteins aligning with sequences translated from such exchange transcribed human mitochondrial sequences have eukaryotic, mainly

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322

G Model BIO 3360 1–19 18 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

metazoan origins. The alignment data suggests that occasionally, the nucleotide exchange-coded genes are recoded and integrated in the genome so that the protein is coded without nucleotide exchange. This seems to occur relatively rarely, as in most cases, only one protein from one organism aligns with the protein translated from nucleotide exchange transcripts. The data indicate some phyletic clustering for mitochondrially nucleotide-exchange encoded overlap coding genes and those that are encoded without nucleotide exchange: organisms possessing mitochondria are overrepresented among the latter. The analyses clearly exclude the possibility that transcripts detected as exchanging nucleotides are due to some kind of annotation error or statistical artifacts (i.e., Fig. 1), especially that exchange transcription rates are predicted by rate parameters of misinsertion kinetics for corresponding nucleotides: systematic nucleotide exchange rates during RNA transcription are proportional to occasional replicational mutation rates due to DNA misinsertion of the same nucleotide types (Figs. 9–11). Hence nucleotide exchanging transcription fits basic biochemical nucleotide properties that also affect their DNA misinsertion rate kinetics. However, the possibility that these ESTs are the product of dysfunctional polymerases during the process creating the cDNA libraries is also a possibility. In that case, the data in Table 2 would not directly reflect frequencies of naturally occurring nucleotide exchanging transcription in the mitochondrion. These would only be indirectly estimated, from the production of cDNAs by RNA→DNA reverse transcription. Both possibilities are plausible, and are not mutually exclusive. However, even if the ‘unnatural’ scenario for exchange transcript production was correct, the transcript abundances produced by that ‘unnatural’ mechanism are proportional to computational predictions of overlap protein coding genes embedded in nucleotide exchange RNA transcripts (Figs. 2, 4, 8, 10 and 11). This coherence between gene contents and transcript abundance indicates that abundances from Table 2 reflect a natural reality of mitochondria (and cells), even if RNA→DNA reverse transcription, and not DNA→RNA transcription, produced the suspected transcripts. In that case, occasional RNA→DNA reverse transcriptase dysfunctions would have given insights to the existence of a previously unknown family of related types of polymerization. Nucleotide exchange coding, as a way to encode for more genes without increasing genome length, seems particularly adequate for the dense vertebrate mitochondrial genome, however, there is no ground a priori to assume that such coding is limited to the mitochondrial genome. It is very probable that at various levels, this type of coding occurs also in the nucleus, and in prokaryotes. Hence protein coding genes encoded by genomes might be much more numerous than believed.

1370

References

1371

Ahmed, A., Frey, G., Michel, C.J., 2007. Frameshift signals in genes associated with the circular code. In Silico Biol. 7, 155–168. Ahmed, A., Frey, G., Michel, C.J., 2010. Essential molecular functions associated with circular code evolution. J. Theor. Biol. 264, 613–622. Ahmed, A., Michel, C.J., 2011. Circular code signal in frameshift genes. J. Comp. Sci. Syst. Biol. 4, 7–15. Akashi, H., Gojobori, T., 2002. Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis. Proc. Natl. Acad. Sci. U.S.A. 99, 3695–3700. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389–3402. Altschul, S.F., Wootton, J.C., Gertz, E.M., Agarwala, R., Morgulis, A., Schäffer, A.A., Yu, Y.K., 2005. Protein database searches using compositionally adjusted substitution matrices. FEBS J. 272, 5101–5109. Alves, R., Savageau, M.A., 2005. Evidence of selection for low cognate amino acid bias in amino acid biosynthetic enzymes. Mol. Microbiol. 56, 1017–1034. Amzallag, G.N., 2001. Data analysis in plant physiology: are we missing the reality? Plant Cell Environ. 24, 881–890.

1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389

Annex, B.H., Williams, R.S., 1990. Mitochondrial DNA structure and expression in specialized subtypes of mammalian striated muscle. Mol. Cell. Biol. 10, 56171–65678. Arqués, D.G., Michel, C.J., 1996. A complementary circular code in the protein coding genes. J. Theor. Biol. 182, 45–58. Arqués, D.G., Michel, C.J., 1997. A circular code in the protein coding genes of mitochondria. J. Theor. Biol. 189, 273–290. Barton, M.D., Delneri, D., Oliver, S.G., Rattray, M., Bergman, M.C., 2010. Evolutionary systems biology of amino acid biosynthetic cost in yeast. PLoS One 5, e11935. Bertram, J.G., Oertell, K., Petruska, J., Goodman, M.F., 2010. DNA polymerase fidelity: comparing direct competition of right and wrong dNTP substrates with steady state and pre-steady state kinetics. Biochemistry 49, 20–28. Brocchieri, L., Karlin, S., 2005. Protein length in eukaryotic and prokaryotic proteomes. Nucl. Acids Res. 33, 3390–3400. Chipman, A.D., Khaner, O., Haas, A., Tchernov, E., 2001. The evolution of genome size: what can be learned from anuran development? J. Exp. Zool. A 291, 364–374. Daniel, C., Wahlstedt, H., Ohlson, J., Bjork, P., Ohman, M., 2011. Adenosine-to-inosine RNA editing affects trafficking of the ␥-aminobutyric acid type A (GABAA) receptor. J. Biol. Chem. 286, 2031–2040. Dias Neto, E., Garcia Correa, R., Verjovski-Almeida, S., Briones, M.R., Nagai, M.A., da Silva Jr., W., Zago, M.A., Bordin, S., Costa, F.F., Goldman, G.H., Carvalho, A.F., Matsukuma, A., Baia, G.S., Simpson, D.H., Brunstein, A., deOliveira, P.S., Bucher, P., Jongeneel, C.V., O’Hare, M.J., Soares, F., Brentani, R.R., Reis, L.F., de Souza, S.J., Simpson, A.J., 2000. Shotgun sequencing of the human transcriptome with ORF expressed sequence tags. Proc. Natl. Acad. Sci. U.S.A. 97, 3491–3496. Faure, E., Delaye, L., Tribolo, S., Levasseur, A., Seligmann, H., Barthélémy, R.-M., 2011. Probable presence of an ubiquitous cryptic mitochondrial gene on the antisense strand of the cytochrome oxidase I gene. Biol. Direct 6, 56. Fredrico, A., Kunkel, T.A., Shaw, B.R., 1990. A sensitive genetic assay for the detection of cytosine deamination: determination of rate constants and the activation energy. Biochemistry 29, 2532–2537. Gonzalez, D.L., Giannerini, S., Rosa, R., 2011. Circular codes revisited: a statistical approach. J. Theor. Biol. 275, 21–28. Gregory, T.R., Hebert, P.D.N., 1999. The modulation of DNA content: proximate causes and ultimate consequences. Genome Res. 9, 317–324. Huang, G.M., Ng, W., Farkas, l., He, J., Liang, L., Gordon, H.A., Yu, D., Hood, J.L., 1999. Prostate cancer expression profiling by cDNA sequencing analysis. Genomics 59, 178–186. Itzkovitz, S., Alon, Y., 2007. The genetic code is nearly optimal for allowing additional information within protein-coding sequences. Genome Res. 17, 405–412. Jackman, J.E., Gott, J.M., Gray, M.W., 2012. Doing it in the reverse: 3 -to-5 polymerization by the Thg1 superfamily. RNA 18, 886–899. Jin, Y., Tian, N., Cao, J., Liang, J., Yang, Z., Mv, J., 2007. RNA editing and alternative splicing of the insect nAChR subunit alpha6 transcript: evolutionary conservation, divergence and regulation. BMC Evol. Biol. 7, 98. Kasiviswanathan, R., Copeland, W.C., 2011. Ribonucleotid discrimination and reverse transcription by the human mitochondrial DNA polymerase. J. Biol. Chem. 286, 31490–31500. Krishnan, N.M., Seligmann, H., Raina, S.Z., Pollock, D.D., 2004a. Detecting gradients of asymmetry in site-specific substitutions in mitochondrial genomes. DNA Cell Biol. 23, 707–714. Krishnan, N.M., Seligmann, H., Raina, S.Z., Pollock, D.D., 2004b. Phylogenetic analyses detect site-specific perturbations in asymmetric mutation gradients. Curr. Comput. Mol. Biol. 2004, 266–267. Krizek, M., Krizek, P., 2012. Why has nature invented three stop codons of DNA and only one start codon? J. Theor. Biol. 304, 183–187. Lee, H.R., Johnson, K.A., 2006. Fidelity of the human mitochondrial DNA polymerase. J. Biol. Chem. 281, 36236–36240. Lev-Maor, G., Sorek, R., Levanon, E.Y., Paz, N., Eisenberg, E., Ast, G., 2007. RNA-editingmediated exon evolution. Genome Biol. 8, R29. Liew, C.C., Hwang, D.M., Fung, Y.W., Laurenssen, C., Cukerman, E., Tsui, S., Lee, C.Y., 1994. A catalogue of genes in the cardiovascular system as identified by expressed sequence tags. Proc. Natl. Acad. Sci. U.S.A. 91, 10645–10649. Lui, V.W.Y., Luk, S.C.W., Tsui, S.K.W., Tung, C.K.C., Yam, N.Y.H., Liew, C.C., Lee, C.Y., 1995. Gene expression of adult human heart as revealed by random sequencing of cDNA library. In: Miami Winter BioTechnol. Symp. Proc., vol. 6, p. 90. Mercer, T.R., Dinger, M.E., Crawford, J., Smith, M.A., Shearwood, A.M., Haugen, E., Bracken, C.P., Rackham, O., Stamatoyannopoulos, J.A., Filipovska, A., Mattick, J.S., 2011a. The human mitochondrial transcriptome. Cell 146, 645–658. Mercer, T.R., Gerhardt, D.J., Dinger, M.E., Crawford, J., Trapnell, C., Jeddeloh, J.A., Mattick, J.S., Rinn, J.L., 2011b. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat. Biotechnol. 30, 99–104. Michel, C.J., 2012. Circular code motifs in transfer RNA and 16S ribosomal RNAs: a possible translation code in genes. Comput. Biol. Chem. 34, 24–37. Namy, O., Lecointe, F., Grosjean, H., Rousset, J.-H., 2005. Translational recoding and RNA modifications. Fine-tuning of NRA functions by modification and editing. Top. Curr. Genet. 12, 2005–2340. Paz, N., Levanon, E.Y., Amariglio, N., Heimberger, A.B., Ram, Z., Constantini, S., Barbashi, Z.S., Adamsky, K., Safran, M., Hirschberg, A., Krupsky, M., BenDov, I., Cazacu, S., Mikkelsen, T., Brodie, C., Eisenberg, E., Rechavi, G., 2007. Altered adenosine-to-inosine RNA editing in human cancer. Genome Res. 17, 1586–1595. Perlstein, E.O., de Bivort, B.L., Schreiber, S.L., 2007. Evolutionary conserved optimization of amino acid biosynthesis. J. Mol. Evol. 98, 186–196.

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475

G Model BIO 3360 1–19

ARTICLE IN PRESS H. Seligmann / BioSystems xxx (2013) xxx–xxx

1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517

Q7

Raina, S.Z., Faith, J.J., Disotell, T.R., Seligmann, H., Stewart, C.B., Pollock, D.D., 2005. Evolution of base-substitution gradients in primate mitochondrial genomes. Genome Res. 15, 665–673. Reenan, R.A., 2005. Molecular determinants and guided evolution of species-specific RNA editing. Nature 434, 409–413. Rocher, C., Letellier, T., Copeland, W.C., Lestienne, P., 2002. Base composition at mtDNA boundaries suggests a DNA triple helix model for human mitochondrial DNA large-scale rearrangements. Mol. Genet. Metab. 76, 123–132. Seligmann, H., 2007. Cost minimization of ribosomal frameshifts. J. Theor. Biol. 249, 162–167. Seligmann, H., 2008. Hybridization between mitochondrial heavy strand tDNA and expressed light strand tRNA modulates the function of heavy strand tDNA as light strand replication origin. J. Mol. Biol. 379, 188–199. Seligmann, H., 2010a. The ambush hypothesis at the whole-organism level: off frame, ‘hidden’ stops in vertebrate mitochondrial genes increase developmental stability. Comp. Biol. Chem. 34, 80–85. Seligmann, H., 2010b. Avoidance of antisense antiterminator tRNA anticodons in vertebrate mitochondria. Biosystems 101, 42–50. Seligmann, H., 2010c. Undetected antisense tRNAs in mitochondrial genomes? Biol. Direct 5, 39. Seligmann, H., 2011a. Two genetic codes, one genome: frameshifted primate mitochondrial genes code for additional proteins in presence of antisense antitermination tRNAs. Biosystems 106, 271–286. Seligmann, H., 2011b. Mutation patterns due to converging mitochondrial replication and transcription increase lifespan, and cause growth rate-longevity tradeoffs. In: Seligmann, H. (Ed.), DNA Replication—Current Advances. InTech, pp. 151–180 (Chapter 6). Seligmann, H., 2012a. Coding constraints modulate chemically spontaneous mutational replication gradients in mitochondrial genomes. Curr. Genomics 13, 37–54. Seligmann, H., 2012b. Positive and negative cognate amino acid bias affects compositions of aminoacyl-tRNA synthetases and reflects functional constraints on protein structure. BIO 2, 11–26. Seligmann, H., 2012c. An overlapping genetic code for frameshifted overlapping genes in Drosophila mitochondria: antisense antitermination tRNAs UAR insert serine. J. Theor. Biol. 296, 61–76. Seligmann, H. Overlapping genetic codes for overlapping frameshifted genes in testudines, and Lepidochelys olivacea as a special case. Comp. Biol. Chem., in press. Seligmann, H., 2012d. Overlapping genes coded in the 3 -to-5 direction in mitochondrial genes and 3 -to-5 polymerization of non-complementary RNA by an ‘invertase’. J. Theor. Biol. 315, 38–52.

19

Seligmann, H., 2012e. Putative mitochondrial polypeptides coded by expanded quadruplet codons, decoded by antisense tRNAs with unusual anticodons. Biosystems 110, 84–106. Seligmann, H. Putative protein-encoding genes within mitochondrial rDNA and the D-loop region. In: Lin, Z., Liu, W. (Eds.), Ribosomes: Molecular Structure, Role in Biological Functions and Implications for Genetic Diseases, Nova Publishers, in press. Seligmann, H., Anderson, S.C., Autumn, K., Bouskila, A., Saf, R., Tuniyev, B.S., Werner, Y.L., 2007. Analysis of the locomotor activity of a nocturnal desert lizard (Reptilia: Gekkonidae: Teratoscincus scincus) under varying moonlight. Zoology 110, 104–117. Seligmann, H., Krishnan, N.M., Rao, B.J., 2006. Possible multiple origins of replication in primate mitochondria: alternative role of tRNA sequences. J. Theor. Biol. 241, 321–332. Seligmann, H., Pollock, D.D., 2004a. The ambush hypothesis: hidden stop codons prevent off-frame gene reading. In: Midsouth Computational Biology and Bioinformatics Society, vol. 36, Abstract. Seligmann, H., Pollock, D.D., 2004b. The ambush hypothesis: hidden stop codons prevent off-frame gene reading. DNA Cell Biol. 23, 701–705. Sessions, S.K., Larson, A., 1987. Developmental correlates of genome size in plethodontid salamanders and their implications for genome evolution. Evolution 41, 1239–1251. Singh, T.R., Pardasani, K.R., 2009. Ambush hypothesis revisited: evidences for phylogenetic trends. Comput. Biol. Chem. 33, 239–244. Takamatsu, C., Umeda, S., Ohsato, T., Ohno, T., Abe, Y., Fukuoh, A., Shinagawa, H., Hamasaki, N., Kang, D., 2002. Regulation of mitochondrial D-loops by transcription factor A and single-stranded DNA-binding protein. EMBO Rep. 3, 451–456. Tanaka, M., Ozawa, T., 1994. Strand asymmetry in human mitochondrial mutations. Genomics 22, 327–335. Tse, H., Cai, J.J., Tsoi, H.-W., Lam, E.P.T., Yuen, K.-Y., 2010. Natural selection retains overrepresented out-of-frame stop codons against frameshift peptides in prokaryotes. BMC Genomics 11, 491. Warnecke, T., Hurst, L.D., 2011. Error prevention and mitigation as forces in the evolution of genes and genomes. Nat. Rev. Genet. 12, 875–881. Warringer, J., Blomberg, A., 2006. Evolutionary constraints on yeast protein size. BMC Evol. Biol. 6, 61. Zamft, B.M., Marblestone, A.H., Kording, K., Schmidt, D., Martin-Alarcon, D., Tyo, K., Boyden, E.S., Church, G., 2012. Measuring cation dependent DNA polymerase fidelity landscapes by deep sequencing. PLoS One 7, e43876. Zhang, Z., Schwartz, S., Wagner, L., Miller, W., 2000. A greedy algorithm for aligning DNA sequences. J. Comp. Biol. 7, 203–214. Zuker, M., 2003. Mfold web server for nucleic acid folding and hybridization prediction. Nucl. Acids Res. 31, 3406–3415.

Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotide exchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013), http://dx.doi.org/10.1016/j.biosystems.2013.01.011

1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.