Alternative expression analysis by RNA sequencing

Descripción

Articles

Alternative expression analysis by RNA sequencing

© 2010 Nature America, Inc. All rights reserved.

Malachi Griffith1,2, Obi L Griffith1,2, Jill Mwenifumbo1, Rodrigo Goya1, A Sorana Morrissy1,2, Ryan D Morin1, Richard Corbett1, Michelle J Tang1, Ying-Chen Hou1, Trevor J Pugh1,2, Gordon Robertson1, Suganthi Chittaranjan1, Adrian Ally1, Jennifer K Asano1, Susanna Y Chan1, Haiyan I Li1, Helen McDonald1, Kevin Teague1, Yongjun Zhao1, Thomas Zeng1, Allen Delaney1, Martin Hirst1, Gregg B Morin1,2, Steven J M Jones1,2, Isabella T Tai1,3 & Marco A Marra1,2 In alternative expression analysis by sequencing (ALEXA-seq), we developed a method to analyze massively parallel RNA sequence data to catalog transcripts and assess differential and alternative expression of known and predicted mRNA isoforms in cells and tissues. As proof of principle, we used the approach to compare fluorouracil-resistant and -nonresistant human colorectal cancer cell lines. We assessed the sensitivity and specificity of the approach by comparison to exon tiling and splicing microarrays and validated the results with reverse transcription–PCR, quantitative PCR and Sanger sequencing. We observed global disruption of splicing in fluorouracil-resistant cells characterized by expression of new mRNA isoforms resulting from exon skipping, alternative splice site usage and intron retention. Alternative expression annotation databases, source code, a data viewer and other resources to facilitate analysis are available at http://www.alexaplatform.org/alexa_seq/.

Expression of multiple distinct mRNA transcripts from a single gene locus by alternative transcript initiation, alternative splicing and alternative polyadenylation (collectively referred to as ‘alternative expression’) is widely recognized as a source of proteome diversity in eukaryotic species1. PCR, microarray and sequencing technologies have been applied to the study of transcript diversity arising from alternative expression1. As sensitive genome-wide methods have become available1, the prevalence and diversity of alternative expression has become increasingly apparent. Microarrays sensitive to alternative splicing 2–4 have made it possible to comprehensively detect and measure the abundance of known and predicted mRNA isoforms. More recently, massively parallel RNA sequencing, known as ‘RNA-seq’5 or wholetranscriptome shotgun sequencing6, has been proposed as a method with potential advantages over microarray-based methods. Several groups have reported proof-of-principle experiments conducted in yeast, human and mouse. These reports described the use of RNA-seq to perform de novo transcriptome annotation7, estimate expression of specific isoforms8, compare gene expression between a pair of contrasting conditions 9–11, comprehensively identify the expression of exons and exon junctions

in a single cell type6,12, and catalog the diversity of known and new transcripts across a range of tissues and individuals5,13,14. Although several tools can be used to identify exon junctions in RNA-seq data (split-read aligners such as TopHat 15, minimal match on either side of exon junction (MMES)16, G-Mo-R-Se17, SplitSeek18, SpliceMap19 and genomic short-read nucleotide alignment program (GSNAP)20), current research efforts are aimed at developing tools to comprehensively quantify mRNA isoforms and measure their differential expression in biological samples. As a result of these efforts, Scripture21 and Cufflinks22 have been developed to use the output of split-read aligners to reconstruct the population of expressed isoforms in a given RNA-seq library, but neither is optimized for comparing expression of individual isoforms between samples. Here we describe a method of analyzing RNA-seq data to allow assessment of the expression, differential expression and alternative expression of known and predicted mRNA isoforms including metrics for identifying reciprocal expression of alternative isoforms. RESULTS Annotation of features and read mapping The alternative expression analysis by sequencing (ALEXA-seq) approach comprises (i) creation of a database of expression and alternative expression sequence ‘features’ (Supplementary Fig. 1a and Supplementary Table 1a), (ii) mapping of short paired-end sequence reads to these features (Supplementary Fig. 1b), (iii) identification of features that are expressed above background noise while taking into account locus-by-locus noise (Fig. 1), (iv) identification of features that are differentially expressed in samples (Fig. 1), (v) identification of the subset of differentially expressed features that correspond to alternative expression of mRNA isoforms (Fig. 1) and (vi) visualization of the outcome of these analyses. The alternative expression database defines expression ‘features’ derived from sequence data that can be informative of alter native expression events such as exon skipping, alternative exon boundary usage, inclusion of cryptic exons and intron retention

1Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, Canada. 2Department of Medical Genetics, University of British Columbia, Vancouver, Canada. 3Department of Medicine, University of British Columbia, Vancouver, Canada. Correspondence should be addressed to M.A.M. ([email protected]).

Received 4 May; accepted 20 August; published online 12 September 2010; doi:10.1038/nmeth.1503

nature methods | VOL.7 NO.10 | OCTOBER 2010 | 843

Articles a

b

mRNA and EST alignments (UCSC)

EnsEMBL gene models

Sample 1

Sample 2

Define and annotate sequence features Genome coordinates, EST support, protein content, sequence conservation, exon skipping, feature names and others

Exon regions

Exon junctions

Exon boundaries

Intron regions

Intergenic regions

Map reads to genome, transcriptome and junction database

Feature database

© 2010 Nature America, Inc. All rights reserved.

c

Generate paired-end read data

Assign reads to features by coordinate comparison Calculate average coverage for all features

Assigned reads

Feature coverage values

Expressed above background?

Sample

Determine intergenic and locus-specific (intragenic) noise levels

Noise estimates

Expressed features

Top features

Differentially expressed features

Candidate genes

Visualize data in ALEXA-seq viewer

Database Process Decision Output data

Alternatively expressed features

Known isoforms New isoforms

Figure 1 | Overview of ALEXA-seq analysis. (a) Input data for creation of ALEXA-seq feature annotation databases. UCSC, University of California Santa Cruz genome browser. (b) Overview of RNA sequencing and mapping. (c) Steps involved in alternative expression analysis.

(Supplementary Fig. 1a). We defined a total of ~3.8 million such features for the human transcriptome and annotated each feature with information describing its size, repeat content, proteincoding content, sequence conservation, mRNA and expressed sequence tag (EST) support, and a descriptive feature name. Of these features, 16% corresponded to known EnsEMBL transcripts, 11% were not known to EnsEMBL but had EST or mRNA support, and 73% represented hypothetical events defined in our database but not currently supported by an EnsEMBL transcript, EST or mRNA data (Supplementary Table 1b). As proof of concept, we used ALEXA-seq to compare fluorouracil (5-FU)–resistant and –nonresistant human colorectal cancer cell lines, MIP101 (ref. 23) and MIP/5-FU24 (Online Methods). We sequenced Illumina paired-end RNA-seq libraries constructed from poly(A)+ RNA isolated from these lines. We assessed the quality of paired 36- and 42-nucleotide sequence libraries by examining the cDNA fragment size (inferred from the mapping distance between read pairs), base quality score at each position in each read, distribution of read-to-transcript alignment lengths, read position bias in known transcripts and redundant reads indicating possible amplification artifacts (Supplementary Results and Supplementary Fig. 2). After quality-filtering 262 million paired reads, we mapped the remaining 257 million reads (97.8%) to a database of sequences encompassing all of the 844 | VOL.7 NO.10 | OCTOBER 2010 | nature methods

features described above as well as all human and ancestral repeat elements25. Briefly, 2.3% of these reads mapped unambiguously to known repeat elements obtained from Repbase25, 59% to known or predicted transcripts, 0.23% to new junctions or boundaries of known exons, 5.2% to introns and 2.8% to intergenic sequence (Supplementary Fig. 3). Of these reads, 5.2% mapped to more than one sequence feature, 10% had similarity to human sequences but did not meet our alignment thresholds (these reads had too many gaps or mismatches in the alignments), and the remaining 13% could not be assigned with high confidence to a human sequence. Some of these unmapped reads may represent low-level contamination in our cell cultures or contamination from non human libraries sequenced at our facility. Based on an analysis of mitochondrial sequences, we estimate that no more than 0.3% of all reads mapped unambiguously to nonhuman sequences. The remaining unmapped reads may correspond to library sequencing or data extraction artifacts, but they might also be new exon-exon junctions, not represented in our junction sequence database. Feature expression, differential and alternative expression To maximize the sensitivity for alternative expression detection, the redundancy of base coverage over the exons of the transcriptome needs to be high. We therefore assessed the extent to which the transcriptome of each sample was comprehensively sequenced (Supplementary Fig. 4) as well as the extent to which the expression of each feature could be distinguished from background noise (Supplementary Figs. 5 and 6 and Supplementary Methods). Having identified all features expressed above background, we next performed differential expression analysis for all expressed features between 5-FU–sensitive and –resistant cells (Supplementary Results, Supplementary Fig. 7 and Supplementary Table 2). We next sought to identify alternative expression of mRNA isoforms from the differentially expressed features (Fig. 1). We calculated the expression of known transcripts for 10,151 multi transcript genes with sequence features unique to each transcript: 5,389 genes had evidence for at least one expressed isoform and 1,975 (37%) had evidence for expression of multiple known isoforms in the MIP101 cell line. An additional 5,862 loci had evidence for expression of new isoforms (1,372 with new junctions and 4,490 with new alternative boundaries, new exons or intron retentions). Alternative expression is predicted to influence protein-coding potential in 90% of alternative expression events involving exons and 98% of events affecting exon junctions (Supplementary Table 1b). A total of 68% of all multiexon expressed genes showed evidence for expression of multiple isoforms in our data. Although recent studies have predicted that 90–95% of human genes undergo alternative splicing13,14, these studies assessed these statistics in diverse cell types, whereas our estimate corresponds to clonally related cell lines of colorectal cancer origin. The number of exons skipped in putatively novel, expressed exon skipping junctions was most often one or two (Supplementary Fig. 8) as is observed in known exon-skipping isoforms. Furthermore, these events were often supported by one or more ESTs in the public EST database, dbEST (Supplementary Results and Supplementary Table 3). ALEXA-seq relies on accurate expression measurements for whole genes as well as individual features of those genes such as exons and exon junctions. We compared RNA-seq data to

© 2010 Nature America, Inc. All rights reserved.

Articles F R 1

2

10 ALEXA-seq differential expression (log2 difference)

Affymetrix exon array and ALEXA splicing microarray results2 generated with the same input RNAs and found high correlation in alternative expression analysis. In addition, RNA-seq provided improvements in sensitivity, specificity, signal-to-noise ratio and dynamic range (Supplementary Methods, Supplementary Results, Supplementary Fig. 9 and Supplementary Table 4). Overall agreement between the splicing microarrays and RNAseq was 77.6–99.1% depending on feature type (Supplementary Table 4c). Correlation of the expression values was 0.84–0.86 (Spearman correlations; Supplementary Fig. 9e). The majority of exons and canonical junctions identified as expressed by RNAseq were confirmed by the splicing microarrays (Supplementary Fig. 9f and Supplementary Table 4c). We also attempted to validate 381 predicted events, representing a wide range of expression levels, using a combination of reverse-transcription (RT)-PCR, quantitative (q)PCR and Sanger sequencing. qPCR confirmed 88% of predicted alternatively expressed exon regions, and the correlation between differential expression values from qPCR and RNA-seq was 0.93 (Fig. 2). RT-PCR confirmed 85% of predicted alternatively expressed junctions and new alternative exon junctions predicted to be expressed in either cell line. Sanger sequencing of a subset of products from these RT-PCR experiments confirmed the identity of 61 canonical and alternative exon junctions (Supplementary Fig. 10). We processed the MIP101 and MIP/5-FU sequence libraries with the recently released ‘Cufflinks’22 and ‘Scripture’21 RNAseq tools and compared the results to ALEXA-seq where possible (Supplementary Methods, Supplementary Fig. 11a–c and Supplementary Table 5). For example, we compared the exon junctions identified by ALEXA-seq, Cufflinks and Scripture and found that despite substantial differences in the methods, 87,726 distinct exon junctions were detected by all three methods. Both Cufflinks and Scripture detected more exon junctions than ALEXA-seq. The concordance with known transcripts was less than 50% for both Cufflinks and Scripture, but it improved dramatically if we applied an expression level cutoff (Supplementary Table 5). Both methods did not detect many of the experimentally validated events predicted by ALEXA-seq. Specifically, of the 189 exon junctions identified by ALEXA-seq and assayed by RT-PCR and Sanger sequencing, 67 (35%) were found by Cufflinks and 93 (49%) by Scripture (Supplementary Results). In addition to expressed and differentially expressed features, a subset of differential expression events identified corresponded to cases in which specific features such as exons or exon-skipping junctions were differentially expressed, but the expression of the gene they are encoded by was not substantially changed between the two libraries, or the change was in the opposite direction (Supplementary Table 2). Such events may indicate differential expression of a specific isoform at a locus that expresses multiple isoforms. Of the top 50 differentially or alternatively expressed genes identified in our pairwise comparison of MIP101 and MIP/5-FU RNA-seq libraries (Supplementary Table 6), 23 consisted of exon-skipping, alternative transcript initiation or polyadenylation, alternative 5′ or 3′ splice site usage, or intron retentions. We identified 306 genes with these kinds of events by combining differential expression analysis with use of a ‘splicing index’2. We refined this approach by developing ‘reciprocity index’ and ‘percent feature contribution’ (PFC) scores to help identify cases in which multiple isoforms were reciprocally

5

3

1

F R 2a

F R 2b

1

2

3

4

5

Validated (P < 0.05) Not validated Housekeeping controls Twofold difference Pearson R = 0.91 Spearman R = 0.93 Validation rate = 88%

+

0

–5

–10

–10

–5

0

5

10

qPCR differential expression (log2 difference)

Figure 2 | Quantitative RT-PCR validation of 192 alternatively expressed exons. qRT-PCR was performed for 192 exons identified as alternatively expressed by ALEXA-seq. Schematic shows exon-specific primers (forward (F) and reverse (R)); differential expression of each alternative exon reported by ALEXA-seq and qPCR are plotted.

ifferentially expressed with respect to overall gene expression or d in which a single isoform appeared to be differentially expressed while the overall expression of the gene was constant. To demonstrate the utility of ‘reciprocity index’ and PFC scores, we provide examples of alternatively expressed exons and new exon junctions with high ‘reciprocity index’ and PFC scores, respectively (Supplementary Table 7). Global disruption of splicing We determined the total number of features expressed above background for the MIP101 (5-FU–sensitive) and MIP/5-FU (5-FU–resistant) libraries. Although the MIP101 library was approximately twice the size of the MIP/5-FU library, this translated into only a small difference in the number of genes, transcripts, exons and known exon junctions being identified as expressed (Supplementary Table 2). Specifically, whereas the MIP/5-FU library had ~57% of the depth (number of sequence reads produced) of the MIP101 library, we detected on average 98.5% of the features. We observed the same slight drop for ‘silent’ intergenic regions (no mRNA or EST evidence for expression). These findings are consistent with the observation that these libraries had effectively reached a point of saturation at which additional depth resulted in diminishing discovery of expressed features (Supplementary Fig. 4). For genes that were differentially expressed between 5-FU–sensitive and 5-FU–resistant cells, there was a trend toward loss of expression in resistant cells with 171 underexpressed genes in 5-FU–resistant cells compared to 81 overexpressed genes. We observed a similar bias toward loss of expression in 5-FU–resistant cells for known exons and known exon junctions. For all feature types indicating expression of new transcripts, the number of such events detected was higher in 5-FU–resistant cells nature methods | VOL.7 NO.10 | OCTOBER 2010 | 845

Articles

Total features expressed (proportion)

a

mutations acquired in the 5-FU–resistant cell line may be the major source of new isoforms in the 5-FU–resistant line.

1.0 MIP101 MIP/5-FU

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Known Known exon exon junction

Exon Alternative skipping boundary usage

Cryptic exons

Intron retention

b UMPS gene model: 6 exons, 14,416 bases, 2,246 exon bases

Coding sequence end

e1 Expression level (log2 average coverage)

© 2010 Nature America, Inc. All rights reserved.

Coding sequence e1–e3 start e1–e2

e2

e2–e3

e3

e4

e5

e6

8 6 MIP101 MIP/5-FU

4 2 0

e1 e1–e2 e1–e3 e2 e2–e3 e3 e3–e4 e4 e4–e5 e5 e5–e6 e6 Exon or junction

Figure 3 | Alternative expression in 5-FU–sensitive versus 5-FU–resistant cells. (a) Proportion of all expressed features identified in the indicated cell types summarized for feature types corresponding to known isoforms (left) and predicted new isoforms (right). The proportion is the number of expressed features observed in one cell line divided by the number of features expressed in either cell line. (b) A screen shot from the ALEXAseq data viewer. Alternatively expressed features are highlighted red in the gene model. The line plot displays the expression of each exon (e1, e2 and so on) and exon junction (e1–e2 and others), and alternative expression events are highlighted yellow.

(Fig. 3a). For example, the MIP/5-FU library had 31.3% more new exon junctions, 31.9% more new exon boundaries, and 37.9% more intron retentions than MIP101. One possible explanation for this observation is that the amount of genomic DNA and/or unprocessed RNA contamination was elevated in the MIP/5-FU library. However, the comparable detection of intergenic elements in both libraries (Supplementary Table 2) suggests that intergenic background noise was not substantially different in the two libraries. Furthermore, increased contamination would not account for the increased expression of new exon-exon junctions as these sequences occur in mature (spliced) mRNAs but do not generally occur in either the genome or unprocessed RNA. If we consider the 3,802 expressed new exon-skipping junctions observed in either library and eliminate those that were present in both libraries, 1,713 remained, and we observed 75% of these only in the MIP/5-FU library. We observed the same trend for alternative exon boundaries and intron retentions (Fig. 3a and Supplementary Table 2). We ensured that the library complexities were comparable (Supplementary Fig. 2f). Based on the lack of changes in the splicing machinery (Supplementary Table 8), we speculate that locus-specific 846 | VOL.7 NO.10 | OCTOBER 2010 | nature methods

DISCUSSION We found that the quality of RNA-seq data was primarily limited by sequencing depth, and the expressed features inferred from these data contained a wealth of details on the structure and complexity of mRNA transcripts generated from each gene locus. A recent publication describing a survey of cancer-associated alternative splice events by RT-PCR had suggested that RT-PCR remains the only sensitive high-throughput method for validation of alternative splicing candidates26. As we achieved high correlations and validation rates when comparing ALEXA-seq predictions for MIP101 and MIP/5-FU to those from splicing microarrays, RT-PCR, qPCR and Sanger sequencing, we believe that with sufficient depth, ALEXA-seq represents a feasible alternative to RT-PCR and microarray based approaches. Furthermore, it has several advantages. RNA-seq does not require preselection of target genes or transcripts, thus allowing a relatively unbiased assessment of the transcript isoforms expressed from each locus. The data generated are not limited by our knowledge of the transcriptome at the time of data generation and allow base level resolution and mutation detection6. Both microarrays and RT-PCR experiments require design of oligonucleotides based on specific gene models. As annotations improve or new predictions are made, RNA-seq data can be reanalyzed to accommodate the changing knowledge landscape. Furthermore, the presence of unknown polymorphisms in the target sequences of oligo nucleotides interfere with their performance. Similarly, design of oligonucleotides to profile small exons or alternative exons with subtle differences can be problematic. Profiling millions of known and predicted exon-exon connections as we describe here is not practical by RT-PCR. The sensitivity of RNA-seq to genes with low expression depends on sufficient depth. With the depth reported in our experiments we achieved detection of ~25,000 genes, half of which were detected above background noise. Included among our expressed genes were well known low-copy genes such as telomerase reverse transcriptase for which we detected a new retention of intron 11. Whereas RNA-seq data are not fundamentally limited by the quality or completeness of the genome annotation, interpretation of RNA-seq data in the context of alternative expression needs accurate gene models. For this reason, our analysis emphasized supplementing existing gene annotations to allow a more comprehensive characterization of alternative expression. In addition, we believe that an annotation, alignment and expression analysis strategy that combines both genome and transcriptome resources as we report is desirable for RNA sequence data. One area of future work is to couple annotation from existing databases with entirely de novo annotation driven by the sequence data themselves. Preliminary work in this area using de novo transcript assembly27 seems promising, and it should be possible to incorporate these methods in the near future to enhance the comprehensiveness of our approach. By generating RNA-seq libraries representing contrasting conditions, we showed that it is possible to elucidate changes in expression of entire transcripts, subtle shifts in the ratio of expressed isoforms and new transcripts. ALEXA-seq allowed the

© 2010 Nature America, Inc. All rights reserved.

Articles i dentification of potentially important events that we would have missed if we relied solely on differential gene expression analysis. Previous proof-of-principle studies have reported the use of massively parallel RNA sequencing to study alternative splicing12–14. However, this work to our knowledge is the first to make available tools that allow (i) creation and annotation of databases of sequences to facilitate alternative expression analysis, (ii) comparison of isoform expression between paired conditions and (iii) visualization of these isoform expression patterns. Alternative expression analysis by RNA-seq will be useful not just in paired comparisons of disease states, tissue types and other studies but can be easily applied to more global analyses such as disease classification and should be more robust than simple gene expression estimates typically used for these analyses to date. Using ALEXA-seq, we detected a global pattern of aberrant splicing as well as specific candidate isoform markers of resistance (Fig. 3b). Profiling the global regulation of splicing and identifying specific isoforms to be used in a diagnostic, prognostic, predictive or therapeutic context is increasingly cited as an important area of disease research, particularly in cancer26,28,29. Methods Methods and any associated references are available in the online version of the paper at http://www.nature.com/naturemethods/. Accession codes. Gene Expression Omnibus (GEO): GSE23776 (raw read data and processed expression data files). Note: Supplementary information is available on the Nature Methods website. Acknowledgments We are grateful for funding provided by the University of British Columbia, the Michael Smith Foundation for Health Research, the Natural Sciences and Engineering Research Council, Genome British Columbia, the Terry Fox Foundation, the Canadian Institutes of Health Research, the National Cancer Institute of Canada and the British Columbia Cancer Foundation. AUTHOR CONTRIBUTIONS M.G. and M.A.M. wrote the manuscript. M.G. conducted the analysis, created the figures and participated in all experimental design, laboratory work, programming, and database and website design. O.L.G., R.G., G.R. and R.D.M. contributed to analysis concepts and assisted with programming and code testing. A.S.M., R.G and R.C. assisted with comparisons to other alternative expression tools. M.J.T. provided cell lines and assisted with tissue culture. T.J.P. and Y.-C.H. contributed to laboratory experiments. H.I.L. and A.D. performed the mitochondrial cross species contamination and other quality control analyses. K.T. assisted in the implementation of website hosting and indexing. S.Y.C., J.K.A. and A.A. performed microarray hybridizations and scanning. Y.Z., H.M., T.Z. and M.H. performed Illumina library construction and sequencing. S.C. and J.M. assisted with experimental validations. S.J.M.J. and G.B.M. contributed concepts and experimental designs. I.T.T. provided cell lines and contributed concepts and experimental designs. M.A.M. supported the project and contributed concepts and experimental designs. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests. Published online at http://www.nature.com/naturemethods/. Reprints and permissions information is available online at http://npg.nature. com/reprintsandpermissions/. 1. Griffith, M. & Marra, M.A. Alternative expression analysis: experimental and bioinformatic approaches for the analysis of transcript diversity. in

Genes, Genomes & Genomics. (eds., Thangadurai, D., Tang, W. & Pullaiah, T.) 201–242 (Regency Publications, New Delhi, 2007). 2. Griffith, M. et al. ALEXA: a microarray design platform for alternative expression analysis. Nat. Methods 5, 118 (2008). 3. Johnson, J.M. et al. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302, 2141–2144 (2003). 4. Pan, Q. et al. Revealing global regulatory features of Mammalian alternative splicing using a quantitative microarray platform. Mol. Cell 16, 929–941 (2004). 5. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008). 6. Morin, R. et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 45, 81–94 (2008). 7. Yassour, M. et al. Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proc. Natl. Acad. Sci. USA 106, 3264–3269 (2009). 8. Jiang, H. & Wong, W.H. Statistical inferences for isoform expression in RNA-seq. Bioinformatics 25, 1026–1032 (2009). 9. Tang, F. et al. mRNA-seq whole-transcriptome analysis of a single cell. Nat. Methods 6, 377–382 (2009). 10. Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods 5, 613–619 (2008). 11. Li, H. et al. Determination of tag density required for digital transcriptome analysis: application to an androgen-sensitive prostate cancer model. Proc. Natl. Acad. Sci. USA 105, 20179–20184 (2008). 12. Sultan, M. et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956–960 (2008). 13. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Deep surveying of alternative splicing complexity in the human transcriptome by highthroughput sequencing. Nat. Genet. 40, 1413–1415 (2008). 14. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008). 15. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009). 16. Wang, L. et al. A statistical method for the detection of alternative splicing using RNA-seq. PLoS ONE 5, e8529 (2010). 17. Denoeud, F. et al. Annotating genomes with massive-scale RNA sequencing. Genome Biol. 9, R175 (2008). 18. Ameur, A., Wetterbom, A., Feuk, L. & Gyllensten, U. Global and unbiased detection of splice junctions from RNA-seq data. Genome Biol. 11, R34 (2010). 19. Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 4570–4578 (2010). 20. Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010). 21. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010). 22. Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010). 23. Niles, R.M. et al. Isolation and characterization of an undifferentiated human colon carcinoma cell line (MIP-101). Cancer Invest. 5, 545–552 (1987). 24. Tai, I.T., Dai, M., Owen, D.A. & Chen, L.B. Genome-wide expression analysis of therapy-resistant tumors reveals SPARC as a novel target for cancer therapy. J. Clin. Invest. 115, 1492–1502 (2005). 25. Jurka, J. et al. Repbase update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467 (2005). 26. Venables, J.P. et al. Cancer-associated regulation of alternative splicing. Nat. Struct. Mol. Biol. 16, 670–676 (2009). 27. Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25, 2872–2877 (2009). 28. Klinck, R. et al. Multiple alternative splicing markers for ovarian cancer. Cancer Res. 68, 657–663 (2008). 29. Venables, J.P. et al. Identification of alternative splicing markers for breast cancer. Cancer Res. 68, 9525–9531 (2008).

nature methods | VOL.7 NO.10 | OCTOBER 2010 | 847

© 2010 Nature America, Inc. All rights reserved.

ONLINE METHODS Tissue culture and RNA preparation. A cell line sensitive to 5-FU (MIP101)23 was passaged in the presence of increasing concentrations of 5-FU resulting in selection of a clone resistant to the drug (MIP/5-FU)24. The resulting cells were profiled to test the utility of our method in detecting alternative expression events. The resistant cell line exhibited a tenfold increase in the half maximal inhibitory concentration (IC50) for 5-FU, and as a clonal derivative of the sensitive cell line was expected to be highly related except for polymorphisms arising during exposure to 5-FU. Both cell lines were maintained in Dulbecco’s Modified Eagle Medium (Invitrogen) supplemented with 10% new born calf serum (Invitrogen), 1% kanamycin (Invitrogen) and 1% penicillin-streptomycin (Invitrogen) at 37 °C in a humidified environment of 5% CO2. The medium for MIP/5-FU cultures was supplemented with the chemotherapy drug 5-FU to a final concentration of 5 μM to maintain resistance. Cultures used for RNA isolation were seeded at a density of 10–30% and grown in medium without drug or antibiotics for ~48 h. Total RNA was isolated using an ‘RNeasy’ kit (Qiagen). Total RNA was treated with DNase I (Qiagen) during RNA isolation according to Qiagen’s instructions. RNA was quantified and assessed for degradation using an RNA 6000 Nano assay (Agilent) and was not used for library construction if it had an RNA integrity score less than nine out of ten. For each sample, polyadenylated RNA was purified from 16.8 μg of DNase I–treated total RNA using the MACS mRNA Isolation kit (Miltenyi Biotec). Illumina library construction and sequencing. Poly(A)+ RNA was purified from total RNA isolated from cell lines (see above) and used for cDNA synthesis followed by fragmentation into 190–210 bp fragments and generation of RNA-seq libraries as follows. Double-stranded cDNA was synthesized from purified poly(A)+ RNA using a Superscript Double-Stranded cDNA Synthesis kit (Invitrogen) and random hexamer primers (Invitrogen) at a concentration of 5 μM. The resulting cDNA was sheared using a Sonic Dismembrator 550 for 5 min at amplitude setting ‘7’ (Fisher Scientific) and size-separated by 8% PAGE. The 190–210-bp DNA fraction was excised, eluted overnight at 4 °C in 300 μl of elution buffer (5:1, LoTE buffer (3 mM Tris-Cl (pH 7.5) and 0.2 mM EDTA) with 7.5 M ammonium acetate) and purified using a QIAquick purification kit (Qiagen). The sequencing library was prepared following the Illumina Genome Analyzer paired end library protocol with ten cycles of PCR amplification. PCR products were purified on QIAquick MinElute columns (Qiagen) and assessed and quantified using a DNA 1000 series II assay (Agilent) and Qubit fluorometer (Invitrogen), respectively. The resulting libraries were sequenced on an Illumina Genome Analyzer II following the manufacturer’s instructions. Data preprocessing. Image analysis and basecalling was performed by the Genome Analyzer pipeline v1.0 (Illumina) using phasing and matrix values calculated from a control phiX174 library run on each flowcell. Raw quality scores were calibrated by alignment to the reference human genome (US National Center for Biotechnology Information (NCBI) build 36.1, hg18) using ELAND (efficient local alignment of nucleotide data; Illumina). Illumina paired-end sequencing produces paired reads from opposite strands of the ends of double-stranded cDNA fragments nature methods

(that is, +/− or –/+ orientations). At the outset of our analysis, the second read of each pair was reverse-complemented (resulting in +/+ or −/− orientations). Since double-stranded cDNA was sequenced, the source strand information of RNAs converted to cDNAs was lost in this approach, and we expected an equal representation of both strands in our sequence alignments. In our alignments to known transcripts, 98% of all read pairs exhibited the expected strand orientation (+/+ or −/−). There was no apparent bias toward one strand or the other (49.3% of read pairs were +/+ and 49.1% were −/−). Before further analysis, reads were first filtered on the following basic criteria: (i) poor quality (arbitrarily defined as more than one ambiguous base call); (ii) low complexity (2/3 or more bases of a read identified as low complexity by mdust30 using default parameters) and (iii) duplicate reads of a pair (read 1 and read 2 of a pair are identical or a reverse complement of each other). Creation and annotation of an alternative expression database. To facilitate interpretation of RNA-seq data in the context of alternative expression, we created an annotation database to define and characterize sequences representing expression events for each locus. This database seeks to describe the characteristics of all known and predicted transcripts, exon regions, exon junctions, alternative exon boundaries, introns and intergenic regions associated with every gene in a genome. Throughout this work we collectively refer to these sequences as sequence ‘features’ (Supplementary Fig. 1a). To create these databases we first retrieved gene, transcript and exon sequences and coordinates from EnsEMBL31. We then defined regions of each type and used existing mRNA and EST sequence data to supplement the information in EnsEMBL. Each feature was defined as ‘known’ or ‘novel’ relative to EnsEMBL data. For each feature, the number of supporting mRNAs and ESTs was also determined. This sequence support was assessed for the target species and then for all other species to assess conservation of each feature. In addition to storing the number of EST and mRNA sequences supporting each feature, the number of species with a supporting sequence was also noted. For all feature types, the size of the feature was noted as well as the number of these bases that correspond to repeatmasked positions according to EnsEMBL (all bases identified by mdust, RepeatMasker and tandem repeat finder)31. Whenever the number of bases was needed to perform a calculation for a feature, such as determining average read coverage, the number of unmasked bases was used. Exon regions were defined by identifying clusters of overlapping exons to create a ‘block’ of exon content. This block was then divided into ‘exon regions’ using the boundaries of all exons in the cluster. This approach divides overlapping exons into pieces that are often specific to particular isoforms. Exon regions were then labeled from 5′ to 3′ starting at 1 and ending at N (where N is the number of exon clusters for the gene). In the case of overlapping exons resulting in multiple exon regions, these were labeled ‘a’ to ‘z’. For example, a gene with three exons and two known transcripts with an alternate exon 2 boundary might have exon regions defined as: ‘ER1’, ‘ER2a’, ‘ER2b’ and ‘ER3’ (Supplementary Fig. 1a). Exon-exon junctions were defined by identifying all exon boundaries in a gene and generating all possible pairwise connections of these exons that could theoretically result from alternative doi:10.1038/nmeth.1503

© 2010 Nature America, Inc. All rights reserved.

splicing of an exon donor site to an exon acceptor site. A gene with n exons has n!/(2!(n – 2)!) possible junctions. In the case of overlapping exons, multiple acceptor or donor sites within an exon cluster were labeled ‘a’ to ‘z’, and the resulting junction name consisted of the exon numbers of two connected exons as well as the donor and acceptor site involved (for example, ‘E3a–E4b’ represents the connection of exon 3 and 4 using the 3′ most donor site of exon 3 and the second acceptor site of exon 4). We chose the size of exon-exon features to be 62 bases to accommodate the length of our sequence reads (42-mers). This length was chosen so that a read aligning across its full length would overlap the center of the junction by at least ten bases. Alternative exon boundary features were defined and named in a similar way to that used for exon-exon junctions. An alternative exon boundary feature was defined for every exon splice acceptor or donor site annotated in EnsEMBL. The size of these features was the same as for exon junctions (62-mers). An alternative exon boundary corresponding to a cluster of overlapping exons was named according to the exon number from 5′ to 3′ and multiple splice donor or acceptor sites were labeled ‘a’ to ‘z’. For example, ‘E3_Da’ refers to a feature spanning from the 5′ end of exon 3 into the downstream intron at the 3′ most donor site of exon 3. All introns and intergenic regions were defined as those portions of the genome not corresponding to EnsEMBL exons. Introns are those regions between known EnsEMBL exons and were numbered from 5′ to 3′ in each gene. Intergenic regions were defined across each chromosome and labeled as such, and the identities of genes flanking each intergenic region were stored. For both introns and intergenic regions, alignments of mRNAs and ESTs were used to supplement the EnsEMBL gene annotations. Specifically, alignments of these sequences to the genome were used to define ‘active’ portions of each intron or intergenic region (‘active intron regions’ and ‘active intergenic regions’). mRNA or EST alignments were only considered if they had more than 300 matching bases, less than five mismatches or gaps and less than two ambiguous bases. Regions that did not exhibit any such evidence of expression from mRNAs or ESTs were defined as ‘silent’ portions of each intron or intergenic region (‘silent intron regions’ and ‘silent intergenic regions’). Finally, although transcript features were already defined in EnsEMBL, for each of these transcripts we identified those exon regions and exon-exon junctions that were specific to each transcript (where possible). The feature collection defining each transcript was used to calculate expression for that transcript. All features were assigned unique feature identifiers and stored in a standardized format including the following characteristics for each feature: source gene identifier, gene name, feature name, chromosome, strand, coordinates relative to the source gene (in 5′ to 3′ orientation), chromosome coordinates, feature size, number of unmasked bases, number of protein coding bases, number of supporting EnsEMBL transcripts, mRNAs and ESTs (for both the target species and all other species), and if applicable the identifier of the transcript to which this feature was unique. Exon junction features also indicate the number of exons skipped. Currently, ALEXA-seq annotation databases as described here have been generated for the human genome (used in this analysis) as wells as chimp, chicken, mouse, rat, fruit fly, yeast and zebrafish (Supplementary Table 1a). Additional annotation databases can be generated to accommodate any species annotated by EnsEMBL doi:10.1038/nmeth.1503

upon request. Annotation databases, including fasta files representing all sequences are available at http://www.alexaplatform. org/alexa_seq/. Calculation of feature expression values. Expression estimates for genes as well as individual exon, intron and intergenic features were determined by mapping reads to transcript or genomic sequences and then calculating the observed average coverage of mapped reads across the base positions of the feature coordinates (Supplementary Equation 1). These estimates were then adjusted to account for varying depth between libraries, resulting in a normalized average coverage (Supplementary Equation 2). The following describes the strategy for measuring specific feature types. Exon junction and boundary expression estimates were determined by mapping reads to discrete N-mer sequences generated from EnsEMBL annotations. This resulted in a systematic bias toward lower expression estimates for exon junctions and boundaries. This is expected because, although a particular 62-mer junction sequence covered by all theoretically possible 42-mer reads has a coverage potential of 923 sequences bases (cumulative coverage of all possible distinct reads corresponding to the junction sequence), an exon of the same size within the bounds of a transcript can have up to 2,604 bases of potential coverage (that is, ~2.8 times). This assumes that we require perfect read matching and that the exon region is in a transcript large enough to allow all possible overlapping 42-mers from both sides. Neither of these assumptions were entirely true in our analysis, so we used the data to empirically measure the bias between exon and junction expression values. Specifically we used the exon and canonical exon junction expression estimates for genes expressed above background to estimate the observed bias. The median expression estimate for exons was ~2.2 times that of exon junctions. When displaying or summarizing exon junction values, they were corrected by this factor to make them more directly comparable to exon expression estimates. To estimate the expression for an entire gene locus, all of the nonredundant exon base positions within the bounds of the EnsEMBL gene were considered. In contrast, transcript-specific expression was calculated by taking the average of exon region and exon-exon junction expression estimates identified as unique to a particular transcript. In EnsEMBL (v.53), there are 36,953 genes with 62,371 total transcripts. 25,879 of these genes have only a single transcript and for these transcripts, calculating the expression simply uses all exon regions and junctions. However, 36,492 transcripts are distributed among the remaining 11,074 multitranscript genes and expression could be individually assessed for 28,636 (78.5%) of them. Transcript-specific exon regions and exon-exon junctions were close to equally represented in the pool of features used for these measurements (62% of the transcripts had a specific exon region, and 59% had a specific exon-exon junction). Differential expression analysis. Differential expression (DE) values were determined as the log2 difference in normalized average coverage values (Supplementary Equation 2) between 5-FU–sensitive and –resistant cells for all feature types. Before calculating differential expression values, all features that were not expressed above background in at least one library were removed. A value of 1 was then added to normalized average nature methods

© 2010 Nature America, Inc. All rights reserved.

coverage values before converting to log2 scale (to stabilize variance). The log2 difference was converted to a more intuitive foldchange value by raising the log2 difference to the power of 2. The MIP/5-FU library was chosen as a reference for differential expression values such that a negative value implies loss of expression in 5-FU–resistant cells relative to 5-FU–sensitive cells, and a positive value implies a gain of expression in resistant cells. Features were considered to be differentially expressed if their expression values differed by a factor of two and their corrected P value was less than 0.05 (by Fisher’s exact test). Alternative expression analysis. Alternative expression analysis was performed as an extension to the differential expression analysis described above. The purpose of this analysis was to identify genes exhibiting differential expression of features that might be indicative of a shift in splicing, transcript initiation or polyadenylation of a specific isoform rather than a shift in expression across the entire locus. Candidate differential splicing events were assessed by calculating a ‘splicing index’ (SI) value for each feature (Supplementary Equation 3). Briefly, this value provides a measure of the change in expression of a feature between two conditions relative to the change in expression of the entire gene locus between the two conditions (5-FU–sensitive and 5-FU– resistant cells in this case). SI values were only calculated for a feature if the feature and the gene to which it corresponded were expressed above background in at least one of the two conditions being compared. In addition to the SI value, for each feature we also calculated a ‘reciprocity index’ (RI) and ‘percent feature contribution’ (PFC) value (Supplementary Equations 4 and 5). The purpose of the RI value was to identify those cases in which the change in expression of a feature was consistent with reciprocal differential expression of an isoform relative to the gene locus. In other words if a gene was upregulated overall between 5-FU–sensitive and –resistant cells while a particular feature such as an exon-skipping junction was downregulated, a high RI value would be produced. Similarly, the PFC value gauges the magnitude of differential expression of a feature relative to the differential expression of the locus. For example, if the expression of a gene is unchanged overall, but an individual feature

nature methods

such as a single exon is changing dramatically, this will result in a large PFC (approaching 100% if the overall gene expression is completely unchanged). Features considered to be candidate differential splicing events were defined as those features having an SI value of at least 1, a significant DE value (P < 0.05 after multiple testing correction) and a PFC value of at least 50. The candidate differential expression and alternative expression gene lists were combined and ranked according to the maximum DE or SI value for all features of each gene. The top 50 genes from this list are shown in Supplementary Table 6, and additional details for each gene are available at our website (http://www. alexaplatform.org/). Statistical analysis. Generation of all statistics, figures and graphs was performed in R using the ‘Base’, ‘caTools’, ‘gcrma’, ‘geneplotter’, ‘multtest’, ‘nortest’, ‘RColorBrewer’ and Bioconductor32 packages. All statistical tests were two-tailed. When comparing values between platforms, Spearman’s rank correlation coefficients were determined. Multiple testing correction was accomplished by Benjamini and Hochberg’s step-up false discovery rate controlling procedure33 using the ‘multtest’ package of R. A multiple testing corrected P < 0.05 was considered significant. A two-sided Fisher’s exact test was used to calculate a P value for the difference in feature expression between two libraries of RNA-seq data. As this test accounts for differences in counts between libraries, raw average coverage values were used when calculating P values. A two-tailed t-test was used to calculate P values for differences in expression assessed by a qPCR assay.

30. Hancock, J.M. & Armstrong, J.S. SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Appl. Biosci. 10, 67–70 (1994). 31. Hubbard, T.J. et al. Ensembl 2009. Nucleic Acids Res. 37, D690–D697 (2009). 32. Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004). 33. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. A Stat. Soc. 57, 289–300 (1995).

doi:10.1038/nmeth.1503

Lihat lebih banyak...

Alternative expression analysis by RNA sequencing

Descripción

Comentarios