Local RNA base pairing probabilities in large sequences

Share Embed


Descripción

BIOINFORMATICS APPLICATIONS NOTE

Vol. 22 no. 5 2006, pages 614–615 doi:10.1093/bioinformatics/btk014

Genome analysis

Local RNA base pairing probabilities in large sequences Stephan H. Bernhart1, Ivo L. Hofacker1, and Peter F. Stadler1,2,3 1

Institut fu¨r Theoretische Chemie, Universita¨t Wien, Wa¨hringerstr. 17, A-1090 Wien, Austria, 2Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Ha¨rtelstr. 16-18, D-04107 Leipzig, Germany and 3Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA

Received on October 30, 2005; revised and accepted on December 15, 2005 Advance Access publication December 20, 2005 Associate Editor: Alfonso Valencia

The standard backtracking procedure for the partition function folding algorithm (McCaskill, 1990) can be expressed as pij ¼

XX ^ i‚j Zj+1‚n Z 1‚i1 Z + pkl Xij‚kl : Z 1‚n kj

Here Xij,kl is the ratio of the two partition functions Z^ ij‚kl with the constraint that both i, j and k, l pair, and Z^ kl . The first term describes the case in which the (i, j) pair is external, i.e. not enclosed by another pair, the second (sum) term considers all possible base pairs (k, l) that could enclose (i, j). In the simplest case, i.e. for energies dependent on individual base pairs only, we have Z^ ij‚kl ¼ Zk+1‚i1 Z^ ij Zj+1‚l1 zkl ‚

1

INTRODUCTION

Computational approaches to detect and classify structured RNAs at genomic scales require efficient ways of computing local RNA secondary structures for both computational and biological reasons: (1) Long-range base pairs in large transcripts are disfavored kinetically relative to short-range pairs (Flamm et al., 2000). (2) Global approaches to RNA folding are limited to sequence length 20 000 on most hardware because of memory consumption. (3) In general, the exact boundaries of the transcript are unknown, so that global folds cannot add to the accuracy of the structure prediction relative to folding individual sequence windows. A recent algorithm for microRNA detection is based upon the idea to consider the stability of secondary structure against changes in the immediate environment (Pfeffer et al., 2005; Sewer et al., 2005). More precisely, this approach considers the frequency with which a certain base pair (i, j) occurs in local minimum energy structures that are computed from sequence windows with a given size L. Here we combine this idea with a recently developed algorithm for local minimum energy structure predictions (Hofacker et al., 2004b). More precisely, we derive recursions for the average equilibrium probability of a base pair (i, j) over all fixed-size sequence windows.

2

ALGORITHM

Denote by Zij the partition function over all secondary structures on the sequence interval [i, j], write Z^ij for the partition function subject to the constraint that i and j pair and let pij be the probability that i and j are actually paired in thermodynamic equilibrium. 

To whom correspondence should be addressed.

614

ð1Þ

ð2Þ

where zkl is the Boltzmann factor of the pairing energy for the closing base pair (k, l). In the standard energy model, described e.g. by Mathews et al. (1999), Z^ ij‚kl is a sum over contributions for the different loop types (interior loops, bulges and multi-branched loops) as detailed by McCaskill (1990); for given i, j, k, l it can be computed in constant time from the tabulated partition functions of subsequences. Let us now turn to interactions localized within a sequence window. We denote by Ziju‚L the partition function over all secondary structures on the sequence interval [i, j] when the sequence window [u, u + L] is folded. Similarly, Z^ iju‚L denotes the partition function with the additional constraint that positions i and j are paired. Furthermore, we write pu‚L ij for the probability that i and j form a base pair when the sequence window [u, u + L] is folded. Since the partition functions on a subsequence are independent of the external structures as long as the subsequence is contained in the folded sequence window, we have  Zij if ½i‚ j  ½u‚ u + L Ziju‚L ¼ ð3Þ 0 otherwise: Furthermore, we observe that pu‚L ij ¼ 0 unless [i, j]  [u, u + L]. We can immediately restrict Equation (1) to a sequence window [u, u + L] since the recursions for Zij depend only on sub-sequences within the interval [i, j] (McCaskill, 1990). Thus pu‚L ij ¼

^u‚L u‚L Zu‚L 1‚i1 Zi‚j Z j+1‚n Zu‚L u‚u+L

+

XX

u‚L pu‚L kl Xij‚kl

kj

XX Zu‚i1 Z^i‚j Zj+1‚u+L ¼ + pu‚L kl Xij‚kl : Zu‚u+L kj

ð4Þ

 The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 13, 2016

ABSTRACT Summary: The genome-wide search for non-coding RNAs requires efficient methods to compute and compare local secondary structures. Since the exact boundaries of such putative transcripts are typically unknown, arbitrary sequence windows have to be used in practice. Here we present a method for robustly computing the probabilities of local base pairs from long RNA sequences independent of the exact positions of the sequence window. Availability: The program RNAplfold is part of the Vienna RNA Package and can be downloaded from http://www.tbi.univie.ac. at/RNA/ Contact: [email protected]

Local base pairing probabilities in large RNAs

mir-106a

mir-18b

mir-20b

mir-19b-2

mir-92-2

C C U U G G C C A U G U A A A A G U G C U U A C A G U G C A G G U A G C U U U U U G A G A U C U A C U G C A A U G U A A G C A C U U C U U A C A U U A C C A U G G U G A U U U A G U C A A U G G C U A C U G A G A A C U G U A G U U U G U G C A U A A U U A A G U A G U U G A U G C U U U U G A G C U G C U U C U U A U A A U G U G U C U C U U G U G U U A A G G U G C A U C U A G U G C A G U U A G U G A A G C A G C U U A G A A U C U A C U G C C C U A A A U G C C C C U U C U G G C A C A G G C U G C C U A A U A U A C A G C A U U U U A A A A G U A U G C C U U G A G U A G U A A U U U G A A U A G G A C A C A U U U C A G U G G U U U G U U U U U U G C C U U U U U A U U G U U U G U U G G G A A C A G A U G G U G G G G A C U G U G C A G U G U A C A G U U G U G U A C A G A G G A U A A G A U U G G G U C C U A G U A G U A C C A A A G U G C U C A U A G U G C A G G U A G U U U U G G C A U G A C U C U A C U G U A G U A U G G G C A C U U C C A G U A C U C U U G G A U A A C A A A U C U C U U G U U G A U G G A G A G A A U A U U C A A A G A C A U U G C U A C U U A C A A U U A G U U U U G C A G G U U U G C A U U U C A G C G U A U A U A U G U A U A U G U G G C U G U G C A A A U C C A U G C A A A A C U G A U U G U G A U A A U G U G U G C U U C C U A C G U C U G U G U G A A C A C A C C U U C A U G C G U A U C U C C A G C A C U C A U G C C C A U U C A U C C C U G G G U G G G G A U U U G U U G C A U U A C U U G U G U U C U A U A U A A A G U A U U G C A C U U G U C C C G G C C U G U G G A A G A

133029828

Human chromosome X (minus strand)

133029088

Fig. 1. Local structures (L ¼ 100) in a 740 nt region of human X chromosome containing five miRNAs annotated in miRBase 7.1. (Griffiths-Jones, 2004).

i XX X 1 + pu‚L Xij‚kl L  ð j  iÞ + 1 u¼jL kj kl i1 iX +L X k X

¼ pL ij +

pu‚L kl Xij‚kl L  ð j  iÞ + 1 k¼jL l¼j + 1 u¼lL

¼ pL ij +

i1 iX +L X L  ðk  lÞ + 1 L p Xij‚kl : L  ð j  iÞ + 1 kl k¼jL l¼j + 1

ð6Þ

Again, modified expressions apply to the sequence intervals close to the 50 and 30 ends.

3

PERFORMANCE AND APPLICATIONS

Equation (6) implies that the n · L matrix pLij can be computed in O(n · L3) time and O(n · L) memory for any fixed L. As is usual in most dynamic programming implementations of RNA folding, CPU requirements can be reduced further to O(n · L2) by applying the restriction that interior loops are bounded by a maximal size M and assuming that multi-loop energies are linear in loop size and loop degree. Forward and backward recursions can be interleaved in such a way that only O(L2) partition function values have to be stored at any given time, reducing the memory requirement to O(n + L2). Thus, our approach is equivalent to a sliding window approach with increment 1, but is faster by a factor of L. The recursions (6) are implemented in the ANSI C program RNAplfold. Dot plots representing the values of pLij are provided in Postscript format and can be used to visually inspect the results, Figure 1. The program is fast enough for genome-wide applications at least when run on a Linux cluster. Figure 2 shows that 3 Mb of genomic sequence per hour can be scanned on a single CPU. RNAplfold has a number of obvious applications, which we are currently exploring. Along the lines of Pfeffer et al. (2005) it can be used to efficiently retrieve candidates for microRNA precursors or

400

t ~ 0.0315 n 100

2

t ~ 0.35 L + 0.0021 L

300 200

10

10

L=100 3

10

4

10

5

n=40000

100 10

6

sequence length n

100

200

300

400

window size L

Fig. 2. Performance of RNAplfold on a Pentium 4 with 3.2 GHz confirms the theoretical scaling of the CPU time as O (n · L2).

other structured RNAs from genomic sequence data. The parameter L should be a bit larger than the structures of interest. More generally, however, genome-wide tables of pLij provide valuable a priori structure information that can be exploited by algorithms that search for RNA sequence/structure patterns such as e.g. a local variant of marna (Siebert and Backofen, 2005) or pmcomp (Hofacker et al., 2004a). As such RNAplfold provides a starting point for alternative approaches to purely comparative RNA annotation strategies.

ACKNOWLEDGEMENT Financial support by the Austrian GEN-AU bioinformatics integration network sponsored by bm:bwk (200.067/3-VI/1/2002) and the German DFG Bioinformatics Initiative (BIZ-6/1-2) is gratefully acknowledged. Conflict of Interest: none declared.

REFERENCES Flamm,C. et al. (2000) RNA folding at elementary step resolution. RNA, 6, 325–338. Griffiths-Jones,S. (2004) The microRNA Registry. Nucleic Acids Res., 32, D109–D111. Hofacker,I.L. et al. (2004a) Alignment of RNA base pairing probability matrices. Bioinformatics, 20, 2222–2227. Hofacker,I.L. et al. (2004b) Prediction of locally stable RNA secondary structures for genome-wide surveys. Bioinformatics, 20, 186–190. Mathews,D. et al. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288, 911–940. McCaskill,J.S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119. Pfeffer,S. et al. (2005) Identification of microRNAs of the herpesvirus family. Nat. Methods., 2, 269–276. Sewer,A. et al. (2005) Identification of clustered microRNAs using ab initio prediction method. BMC Bioinformatics, 6, 267. Siebert,S. and Backofen,R. (2005) MARNA: multiple alignment and consensus structure prediction of RNAs based on sequencestructure comparisons. Bioinformatics, 21, 3352–3359.

615

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 13, 2016

For i + L > n and j  L < 1 the sequence windows are shorter, hence Equation (5) has to be modified accordingly. Substitution of Equation (4) yields in the generic case i ^u‚L u‚L X Z u‚L 1 1‚i1 Zi‚j Z j+1‚n pLij ¼ u‚L L  ð j  iÞ + 1 u¼jL Z1‚n |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} pL ij

500

1000

CPU time [s]

Next, we define the average probability of an (i, j) pair over all folding windows containing the sequence interval [i, j] as i X 1 pu‚L : ð5Þ pLij ¼ L  ð j  iÞ + 1 u¼jL ij

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.