BioCloneDB: A Database Application to Manage DNA Sequence and Gene Expression Data

May 19, 2017 | Autor: Oded Yarden | Categoría: Biological Sciences, Mathematical Sciences, Gene Expression Data, DNA sequence

Share Embed

Laporkan tautan ini

Descripción

Appl Bioinformatics 2005; 4 (4): 277-280 1175-5636/05/0004-0277/$34.95/0

APPLICATION NOTE

© 2005 Adis Data Information BV. All rights reserved.

BioCloneDB A Database Application to Manage DNA Sequence and Gene Expression Data Eli Reuveni,1 Dena Leshkowitz2 and Oded Yarden1 1 2

Department of Plant Pathology and Microbiology, Faculty of Agricultural, Food and Environmental Quality Sciences, The Hebrew University of Jerusalem, Rehovot, Israel The Hebrew University Bioinformatics Unit, The Hebrew University of Jerusalem, Rehovot, Israel

Abstract

BioCloneDB is a user-friendly database with a web interface to assist molecular genetics laboratories in managing a local repository of sequence information linked to DNA clones. This tool is designed to assist in high-throughput sequence and gene expression projects, providing a link between both types of information. The unique feature of the application is the automation of batch sequence annotation following BLAST® searches, which is supported by easy-to-use web interfaces. Furthermore, any set of sequences can be annotated against any sequence database. This replaces the need to perform and analyse individual web BLAST® searches or the need to learn how to produce batch searches and perform analysis in a UNIX® operating system. BioCloneDB is open-source software that can be installed on Linux or UNIX® operating systems. To test the application, we used 1400 expressed sequence tags obtained from the filamentous fungus Neurospora crassa. The results were analysed and compared with published results and they show a significant change due to the accumulation of the data in the nr database (ftp://ftp.ncbi.nih.gov/blast/db/). Availability: BioCloneDB is available for academic use along with documentation, screenshots, database scheme and readme files at http://bioclonedb.agri.huji.ac.il/ Contact: Oded Yarden ([email protected])

Many laboratories utilising molecular biology tools are challenged by the continuous increase in the accumulation of sequence and gene expression data. We have developed an application named BioCloneDB that facilitates the integration of large amounts of sequence and gene expression data. It allows sharing, querying, producing reports and efficient laboratory management of both these types of data. The main strength of BioCloneDB is the automatic annotation tool, which assists in the prediction of the biological role of given sequences. The sequence annotation is performed by user-defined BLAST® [1,2] searches and filtering criteria. The results of the annotation procedure can be cross-referenced with experimental expression data. The database will be most beneficial for projects that comprise many initially non-annotated clones, such as the case of cDNA libraries spotted on slides or filter membranes. There are many commercial and public domain tools and data-bases that deal with gene expression data[3,4] (GEO [Gene Expression Omnibus]; http://www.ncbi.nlm.nih.gov/ geo/) and other applications that deal with sequence annotation (Gene Ontol-

ogy™ [GO]; http://www.geneontology.org/doc/GO.tools. html).[5-8] BioCloneDB was developed to integrate between sequence and gene expression data, allowing dynamic and automatic performance of an annotation procedure starting at the level of the initial sequence. Implementation BioCloneDB is a web-based application with a client-server architecture. It consists of Perl common gateway interface (CGI) scripts, dynamic and static HTML pages and JavaScript. The CGI scripts access a relational database, which is managed by MySQL® (http://www.mysql.com). The scheme of the database includes ten tables: organism, project, clones, sequences, BLAST® annotation, sequence databanks, array project, array treatment, array experimental data and users. There are several external programs that are integrated in the application, including BLAST® 2.2.3, which is included in the download package (as an essential component), and cross_match (http://www.phrap.org/phredphrap

278

Reuveni et al.

consed.html#block_phrap). CGI is a stateless technology, so the application includes session management to define the user environment. Some of the application features run as daemon processes, which allows them to escape the web server time-out limits. A status information screen notifies the user of the process progress. BioCloneDB is a user-friendly web application that contains approximately two dozen web forms allowing the user to input and import data to the database. There are other forms that allow the user to query, create reports and export data. The database supports import and export of high-throughput data using various formats, such as a user-defined delimited file for array data and a FASTA file (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html) containing a description sentence for sequences. Application Flowchart The hierarchical scheme used to populate the database starts with the need to define an organism and a project (figure 1). One can then import a concatenated FASTA file or enter sequence data individually. To provide a prediction of the biological role of a clone, one can annotate the clones using the BLAST® annotation module. Populating the experimental data field is done by first defining an array chip project. Each chip can be linked to one or more treatments. Next, the experimental data of each treatment can be imported to the database. At this stage the database is populated with a project that contains sequences, annotations and experimental expression data. Organism

Project

Clones*

Array chip project

Sequences

Array experiment details

Sequence databanks

Experimental results

BLAST® annotation

Vector contamination

Filter parameters EBI annotation retrieval

Report

Fig. 1. Information flow between the fields in the BioCloneDB application. Each line represents one-to-many relationships. * indicates that two types of clones have been designated: real and virtual. The virtual clones are such that although sequence information is accessible, there is no corresponding DNA clone available. Dashed lines indicate optional modules. EBI = European Bioinformatics Institute. © 2005 Adis Data Information BV. All rights reserved.

BioCloneDB contains two BLAST® modules: stand-alone local BLAST® and annotation BLAST®. Both modules use the sequence databases that are stored on the same server where the application is installed. BioCloneDB provides the ability to download a sequence database from a user-defined web location and will convert the sequence database to the appropriate BLAST® format. The stand-alone local BLAST® runs an individual sequence against a downloaded sequence database, or against sequences belonging to a certain project or all projects. The search results are presented in an HTML page. The BLAST® annotation module consists of three parts: query, filter parameters and BLAST® sequence database. The query can be a group of clones, a project or all projects. It is possible to incorporate filtering parameters such as e-value, percentage positives, and description of entrees to be ignored or to be selected for using Boolean conditions. For example, one can deselect entrees containing an annotation, such as unknown, putative, hypothetical, etc. The BLAST® database can consist of a project, all projects or any user-defined database (as detailed in the stand-alone local BLAST®). The user will be notified, via e-mail, of the completion of the annotation run, along with a report of the annotation results. In general, the annotation for clones consists of a list of the single best homology fit for each BLAST® run that passed the user-defined parameters. Once a certain entry has passed the annotation criteria, statistical and alignment values information from the BLAST® search is extracted, and additional information such as protein function, subcellular location, similarity information, GO accession number and Enzyme Commission (EC) number is imported from the European Bioinformatics Institute (EBI; http://srs.ebi.ac.uk/). It should be noted that running the BLAST® annotation module on the same sequence input and database will not result in annotation change unless the user has upgraded the BLAST® database or requested to delete the previous annotation. The BioCloneDB stores and links clone annotation and expression data, allowing the user to query the data. The user can select clones that have a certain expression level in various treatments. One can further extend and focus on the clones that have a specific annotation. Meaningful queries can be built using the annotation data attributes of function, cellular location, similarity and identification, using a Structured Query Language (SQL)-like clause. Data Integration BioCloneDB can produce several report formats, including a tab-delimited format file that contains annotation and expression information, or an HTML version. These reports can be streamed to other bioinformatics platforms for advanced analysis, for instance categorising a sequence set based on the metabolic pathAppl Bioinformatics 2005; 4 (4)

BioCloneDB

279

ways using the application PathFinder[9] or streaming data from the report into the DAVID[10] application in order to categorise them according to the GO.[11] Using BioCloneDB for Expressed Sequence Tag Analysis Recently, the genome of the filamentous ascomycete Neurospora crassa was fully sequenced.[12,13] Because a complete gene dataset of this organism is now available, we used N. crassa for our application testing procedures. To test the efficiency of the application, we retrieved 1400 expressed sequence tag (EST) sequences from a conidial cDNA library submitted to GenBank® [14] and imported it to BioCloneDB. Automatic annotation was performed using the BLAST® annotation module against the nr database (ftp://ftp.ncbi.nih.gov/blast/db/), and a report containing the data annotation was produced. The GO results from the cDNA project were streamed into the Spotfire DecisionSite application (http://www.spotfire.com/) to obtain a table of the GO hierarchy categorised by molecular function (table I). The majority of the entrees correspond with catalytic activity and binding. The BLASTx hit results were categorised using the following threshold values: >10–2 assigned as not significant, 10–2 to 10–4 assigned as weakly significant, 10–5 to 10–19 assigned as moderately significant and ≤10–20 as highly significant (table II). The output results were compared with previously published results that were obtained from the same cDNA library[15] and which were processed manually. Comparing these category counts with the previous analysis reveals an increase in the amount of significant hits and a decrease in the amount of non-significant hits. This is expected, as the data accumulating in the nr database has significantly inTable I. Distribution of cDNAs from a conidial library of Neurospora crassa (with highly or moderately significant relationships to characterised genes) falling into the ten functional Gene Ontology™ categories. The results were obtained by utilising the BioCloneDB BLAST® annotation module and the Spotfire DecisionSite application Molecular function

Percentage

Catalytic activity

28

Binding

26

Structural molecule activity

15

Transcription regulator activity

12

Translation regulator activity

7

Obsolete molecular function

6

Transporter activity

4

Motor activity

1

Chaperone activity

0.5

[Not included]

0.5

© 2005 Adis Data Information BV. All rights reserved.

Table II. Summary of BLASTx p-values for a Neurospora crassa conidial library containing 1400 expressed sequence tags. A comparison is shown between the newly processed information using the BioCloneDB BLAST® annotation module and the previously published results BioCloneDB (%)

Nelson et al.[15] (%)

Highly significant (≤10–20)

66

41

Moderately significant (10–5 to 10–19)

17

14.9

1

5.7

16

38.4

Similarity level (BLASTx p-values)

Weakly significant (10–2 to 10–4) No significant match (>10–2)

creased since the first analysis performed in 1997. These results illustrate the necessity for an automatic tool that will dynamically update the information stored in the database using the newly accumulating annotation and sequence data in the public domain. Conclusions BioCloneDB is a user-friendly web application designed to easily import, export and analyse multiple sequence and expression datasets. It is a free software package and is accompanied by installation and user manuals. The download package is customisable and can be easily integrated with new modules based on the existing kernel. Future modules can assist in biological analysis of the data (e.g. contigs for EST information, domain searches, multiple alignment applications and other homology search algorithms such as FASTA). Acknowledgements This research was supported by BARD, the US – Israel Binational Agricultural Research and Development fund. DL was supported by Israel Ministry of Science grant no. 1424 to Center of Knowledge for Bioinformatics Infrastructure (COBI). We thank Dvorah Weisman and Arye Harel for helpful advice and Zahi Paz for assistance in web design. The authors have no conflicts of interest that are directly relevant to the content of this article.

References 1. Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol 1990; 215: 403-10 2. Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25: 3389-402 3. Dudoit S, Gentleman RC, Quackenbush J. Open source software for the analysis of microarray data. Biotechniques 2003 Mar; Suppl.: 45-51 4. Kaminski M, editor. Microarray resources on the Web (second of many sections) [online]. Available from URL: http://www.thoracic.org/geneexpress/ gene0203.asp [Accessed 2003 Aug] 5. Hoersch S, Leroy C, Brown NP, et al. The GeneQuiz web server: protein functional analysis through the Web. Trends Biochem Sci 2000; 25: 33-5 Appl Bioinformatics 2005; 4 (4)

280

6. Soanes DM, Skinner W, Keon J, et al. Genomics of phytopathogenic fungi and the development of bioinformatic resources. Mol Plant Microbe Interact 2002; 15: 421-7 7. Moller S, Leser U, Fleischmann W, et al. EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics 1999; 15: 219-27 8. Meskauskas A, Lehmann-Horn F, Jurkat-Rott K. Sight: automating genomic datamining without programming skills. Bioinformatics 2004; 20: 1718-20 9. Goesmann A, Haubrock M, Meyer F, et al. PathFinder: reconstruction and dynamic visualization of metabolic pathways. Bioinformatics 2002; 18: 124-9 10. Dennis Jr G, Sherman BT, Hosack DA, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003; 4 (5): P3 11. Bard JL, Rhee SY. Ontologies in biology: design, applications and future challenges. Nat Rev Genet 2004; 5: 213-22 12. Galagan JE, Calvo SE, Borkovich KA, et al. The genome sequence of the filamentous fungus Neurospora crassa. Nature 2003; 422: 859-68

© 2005 Adis Data Information BV. All rights reserved.

Reuveni et al.

13. Borkovich KA, Alex LA, Yarden O, et al. Lessons from the genome sequence of Neurospora crassa: tracing the path from genomic blueprint to multicellular organism. Microbiol Mol Biol Rev 2004; 68: 1-108 14. Benson DA, Boguski MS, Lipman DJ, et al. GenBank. Nucleic Acids Res 1998; 26: 1-7 15. Nelson MA, Kang S, Braun EL, et al. Expressed sequences from conidial, mycelial and sexual stages of Neurospora crassa. Fungal Genet Biol 1997; 21: 348-63

Correspondence and offprints: Dr Oded Yarden, Department of Plant Pathology and Microbiology, Faculty of Agricultural, Food and Environmental Quality Sciences, The Hebrew University of Jerusalem, Rehovot, 76100, Israel. E-mail: [email protected]

Appl Bioinformatics 2005; 4 (4)

Lihat lebih banyak...

BioCloneDB: A Database Application to Manage DNA Sequence and Gene Expression Data

Descripción

Comentarios