Micado - a network-oriented database for microbial genomes

Share Embed


Descripción

Vol. 13 no. 4 1997 Pages 431-438

CABIOS

Micado—a network-oriented database for microbial genomes Veronique Biaudet, Franck Samson and Philippe Bessieres1 Abstract

Introduction

Building a database for genomic information stems from our involvement with microbial molecular genetics. In the first place, this concerned the Bacillus subtilis genome sequencing project (Devine, 1995), for which we produced the genetic map (Biaudet et ai, 1996). A new functional analysis program of unknown genes has followed (Harwood and Wipat, 1996), and we manage mutant data generated by the European framework. Also, we pursue comparative genetic goals, studies of the stability of genomes, and support research in food sciences and environmental microbiology. Lastly, we are involved in a collaboration with mathematicians, undertaking systematic approaches to the analysis of genomes. These include statistical detection, with Markov chains, of abnormal Ge'netique Microbienne, Institut National de la Recherche Agronomique, Jouy-en-Josas cedex, 78352, France 1 To whom correspondence should be addressed

i Oxford University Press

System and methods

Database A selective information retrieval system was needed, with precise multicriteria access to heterogeneous data, integrated into a fine-grained database schema. Diverse biological entities are concerned, and their related bibliography, as well as experimental or calculated data. The schema had to link information input from local experiments, by collaborative research projects, and imported from general or specialized data repositories. To provide standard SQL access, and with respect to performance and reliability constraints, we decided on the ORACLE Relational DataBase Management System (RDBMS). RDMBSs offer mature environments, with a variety of tools and methods. Large primary sequence and genome collections have now moved to relational databases (Markovitz and Ritter, 1995). We chose Entity-Relationship (ER) modelling to design the conceptual information schema (Teorey, 1994). We used a CASE (Computer Aided Software Engineering) tool, AMC*Designor, to help in building the ER schemas. They are viewed as diagrams, and instantiated into relational databases, through translation to SQL CREATE statements.

431

Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011

Motivation: We created Micado, a database for managing genomic information, as part of the Bacillus subtilis genome programs. Its content will be progressively extended to the whole microbial world. Results: A relational schema is defined for selective queries. It links eubacterial and archaeal sequences, genetic maps for Bacillus subtilis and Escherichia coli, and information on mutants. The latter comes from a new functional analysis project of unknown genes in B.subtilis, and the database allows the community to curate information. To help queries from users, a graphical interface is built on SQL access to the database, and provided through the WWW. We have automated imports of microbial sequences, and E.coli genetic map, by programming parsers of flat file distributions. These ensure smooth updates from molecular biology repositories on the Internet. Hyperlinks are created as a complement, to reference other general and specialized related information resources. Availability: http://locus.jouy.inra.fr/micado Contact: E-mail: [email protected]

motif occurrences (Schbath et ai, 1995), and ruptures in the local properties of sequences. Such a variety of projects would benefit from integration of information about related biological entities inside a database (Davidson et al., 1995). This enables the convergence of different and complementary insights into molecular biology and microbes. Access to a unified information structure would then allow the building of integrated views for analysis, and observation of new correlations. In our opinion, the project had to fulfil some practical requirements to be efficient, such as providing a selective access to information, and a graphical interface. In addition, adapting to extension capabilities provided by the Internet would give Micado a delocalized access interface, regular updating of integrated data, and links to external information. Moreover, the current information technology evolution would open new opportunities to distributed environments for the database.

V.Biaudet, F.Samson and P.Bessieres

HTML Display Forms

Common Gateway Interface (CGI)

Imagemap I GIF

Access N° Taxons Features Keywords Biblio

WWW server

Name Sequence Position Keywords Biblio Methods

xfig I PostScnpt

Relational database Hyperlinks |

Gateway Perl script

WWW interface

Local Tables

SQL gateway programs

Fig. 2. Schematic database information content. Dashed boxes represent planned extensions.

Relational database j

Networking The database is interfaced to a WWW server, the CERN htppd 3.1 from the W3 Consortium, with a 2 Mbits/s connection to the Internet. They are both operated on a UNIX server from Sun. To access the ORACLE database, Perl programs use the SQL DBI (Database Interface) library, formerly known as DBperl. HTML pages sent to the WWW client browsers are dynamically generated from the Perl gateway scripts. They are written to display results, and accept queries, both using text and graphics, with menus, and clickable images. A graphical user interface has been written (Figure 1). Displays are created first as 'device virtual independent' objects, through an intermediate layer specifying common physical characteristics. Then specialized libraries (methods) are called for outputs, in GIF (Murray and Van Ryper, 1996), Imagemap (Liu et al., 1994), PostScript (Adobe, 1992) and xfig formats. They are used, respectively, to display, specify clickable regions in graphics, print, and edit, through the WWW client browsers. All but the last have been specifically written for this project. We opted for the gd graphics library (Boutell, 1996) to generate GIF images. To improve the interactivity and the portability, we started the development of a version in Java language. Implementation Information structure and content Information on genes, sequences and mutants (Figure 2) is organized within an accurate relational information

432

Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011

Fig. 1. Interface of the database to the WWW server, with Perl graphical and SQL libraries.

model. The part open to user's queries is built with 32 tables, genes use three tables, sequences 21, and mutants eight (Figure 3). We store genetic maps for B.subtilis (1114 genes) and Escherichia coli (1843 genes). The former was constructed in our laboratory (Biaudet et al., 1996), the latter was imported from ECD (Wahl et al., 1994), with regular updates retrieved via the Internet. Genetic map data are linked to bacterial DNA sequences within the database; sequences of 276 and 108 genes, respectively, are still not fully identified. DNA sequences come from primary databank entries of GenBank (Benson et al., 1996), and are completed with the content of EMBL entries (Rodriguez-Tome et al., 1996), such as cross-references to SwissProt records (Bairoch and Apweiler, 1996). The way we structure sequences is similar to the schema published by GSDB (Genome Sequence DataBase) (Keen et al., 1996). The database includes 29 912 primary DNA sequences, from GenBank 98, accounting for 65 368 445 bp. Sequences are shared between 3035 eubacterial, 167 archaeal and 290 unidentified species, plus 324 plasmids, 45 transposons and 68 insertion sequences. A total of 136 705 features, among them 49 636 CDS (coding sequences), and 405 557 qualifiers, are defined for the sequences. To improve the quality of information, a clean controlled and nonredundant set of sequences for B.subtilis is imported from the SubtiList specialized database (Moszer et al., 1995). Very recently, we added records of complete sequences for seven microbial genomes, including E.coli and yeast. Associated with this core information, Micado supports data on mutants for unknown genes in B.subtilis. A total of 1200 mutants will be produced by specific interruptions, in the context of the functional analysis project started in 1995. To date, > 100 mutants are registered in the database, and should quickly reach 500. Currently, we store details of growth characteristics of

Micado database for microbial genomes

Features tables

Data schema Feature

Location

accession code_feat type_feat location operator attributes

accession code_feat code_loc start_loc stop_loc operator attributes

Qualifier

DNA

accession code_feat code_qual type_qual value

accession code_feat code_loc origin_seq replace_seq comments

source

— Record1..2505 /strain="PAO1 » /map="57 min »

RBS CDS

128..132 136..834 /EC_number= "4.1.1.23" /gene="pyrF"

Feature key type_feat 1 source 2 RBS 3 CDS key 1 1 3 3

' ' a b l e s •> Location stop key start 1 2505 1 132 128 2 136 834 3

Qualifier type_qual strain map EC_number gene

value PAO1 57 4.1.1.23 pyrF

mutants, information on the interruption vector used, reporter gene activity, controls of strain construction, such as images of gel electrophoresis Southern blots, and bibliographic references. The ability to manage these experimental data is facilitated by the specification of standard procedures to produce and evaluate the mutants. Hyperlinks Links between databanks are annotated, pointing to biologically related records, and WWW provides a general interactive cross-referencing capability, through the mechanism of hyperlinks. Micado offers hyperlinks to five information resources (Figure 2). They include SwissProt (Bairoch and Apweiler, 1996), from which to retrieve protein sequences. Links from the Bsubtilis genetic map are bidirectional, as SwissProt points to the genes from associated protein sequence records. The database links the annotated bibliographic references of Medline, another resource of prime importance. Also, hyperlinks include access to original information already integrated into the database, GenBank sequence records, and the contigs of SubtiList, newly interfaced to WWW. This is added to existing links on NRSub (Perriere et ai, 1994), another non-redundant B.subtilis sequence set. The navigable map of links describes a global view of genomic information (Karp, 1996), with rich nodes like SwissProt. This is the best example of using links to describe logical associations, from the protein sequence record to structures, DNA sequences, genes, and other biochemical databanks. Establishing links allows SRS

(Etzold and Argos, 1993a), on the Internet for retrieving sequences, to build indexes (Etzold and Argos, 1993b) that cross-reference a large set of data sources. This offers an easy alternative to importing related information into the existing schema. The links extend the database to a supplementary level of information access, making the queries a structured entry, pointing to a browsable federation of databases (Letovsky, 1995). Nevertheless, links may cause problems, either because the target disappears, or links are automatically generated, and target content is semantically different from that of the source. We experienced the latter problem with SwissProt, when the genes of B.subtilis were recently renamed according to the nomenclature of E.coli (Williams, 1997). Such a situation calls for more stringent definition, exemplified by the heavyweight link solution proposed in Ecocyc (Karp, 1996). We continue to cope with the old names of B.subtilis genes, as synonyms for the new ones, to avoid losing access to previous, often invaluable work, in which old names were used. Data warehouse

Many information repositories are accessible on the Internet. We warehouse the genetic map of ECD, and DNA sequence information of bacteria in GenBank and EMBL (Figure 4). Input of external data from flat files is automated with parsers based on Context-Free Grammars (CFGs) (Hopcroft and Ullman, 1979). They define the record structures of data distributions, and are used to produce table files loaded into the relational model of the

433

Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011

Fig. 3. Schema and tables for the sequence features, and an example of an instance, from the translation of a databank record.

V.Biaudet, F.Samson and P.Bessieres

Microbial DNA GenBank + EMBL

Genetic Maps

SubtiList

E. coli (ECD)

B. subtilis

[Moszer elal. ]

[Kroger

(Biaudet

elal. )

elal. ]

Mutants B. subtilis European program

m n c n » Specific Feature Query

Specific Feature In Pseudomonas aeniginosa N«tMape: Speclltc Feature Extraction

Please complete ti

Origins Interfaces ,

perl

lex yacc

D2ZBS3

Feature:

.T

User Image Menus Scans

Automatic Input (Parsed Flat Files)

lex yacc

=L

>D37883.proteln_blndl bound_mol«ty: function;

perl

WWW forms ....

a l t o : 49.

.62

AHR p r o t e i n

activation of anaerobic aen

ftp

Relational database

TTUCCCGMTCM

Keyword: >D37883.piot«ln_blnd2 i l u : 1769. .1782 bound_rnolety:

ANR pcoteln

not*: This ANR-blndlng s i t e may act Cor the expcaealot nocCB located downetEeam of thie aaquenc*

Fig. 4. Automatic updates and user input. TTCATTOCCATCAA

434

Fig. 5. Extraction of features, with the example of annotated protein binding sites.

syntactic inconsistencies in the databank records. The efficiency of the data warehouse largely relies on the degree of automation for updates, routinely >99.9% of sequence records are accepted from GenBank. Twentyfour sequences of 29936 are rejected from version 98 (December 1996), because of missing information. Grammars are adaptable, allowing the evolution of the information structure from data resources to be followed. As specialized databases proliferate in diverse software environments, generating heterogeneity, grammars propose standard descriptions of, and tools for, program data interchange. Extraction of sequence features Micado offers extraction of sequences based on the annotations, with an interactive access through the WWW (Figure 5). Retrieving sequences according to their features (Sakamoto et al., 1993) was an important purpose, as classical tools lack precision, like with the GCG package (Devereaux et al., 1984), where information is searched by string pattern matching on unstructured text annotations. It also illustrates using highly selective queries, and the role of exhaustive parsing of data text. Moreover, the search may be crossed with other selection criteria, such as taxa, authors, and related to gene or mutant properties in the data schema. Searching sequences from the annotations may be viewed as complementary to extracting sequences by comparison of alignments (Altschul et al., 1994). Noticeable evolution towards feature extraction may be seen

Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011

database. We used the Backus-Naur Form (BNF) specification language, and programmed, with lex and yacc, standard generators of lexical and syntactic analyzers (Aho et al., 1986; Levine et al., 1992). Parsing of sequence data has been reported in the first place by the NCBI (Karp, 1990), on ASN.l format (Ostell, 1990). SORTEZ (Hart et al., 1994) parses ASN.l data to a relational database, allowing SQL access to sequences, and is based on the DCG (Definite Clause Grammar) logic grammar formalism, implemented in Prolog language. QGB (Overton et al., 1994) extends this work to error checking and features completion. This is derived from new linguistic approaches in sequence analysis that encompass the problem of data interchange and controls (Searls, 1993; Dong and Searls, 1994). CFGs allow the writing of formal specifications of record syntaxes that provide complete and extensive parsing of integrated records into the database. This contributes to a fine specification of the schema, and selective access to information, a consequence of the accuracy brought by controlling elementary data item levels. The recursive nature of the CFGs facilitates processing the hierarchical organization of annotations found in the sequence databanks, especially the field of DNA features (DDBJ, EMBL and GenBank Staffs, 1996). A common module is used to parse the features of both EMBL and GenBank, thanks to their standardization. It recognizes both the set of operators applied to the location descriptors of sequence features, to store them in the data model (Figure 3). Processing of the sequences is mainly done at the step of parsing, the complement() operator, applied to coding sequences (CDS), being the most commonly used. The parsers are reliable for automatic updates, the robustness is due to the detection and correction of

Micado database for microbial genomes

WWW Interface SQL/Perl & Java gateway programs Hyperlinked menus, selectable lists & clickable images Remote records

Local public & private data

Private access

Screens

Fig. 6. User interface of Micado on the WWW.

Graphical interface

Information on sequences, genes and mutants is searched

Discussion The WWW interface of Micado is accessed nearly 10 000 times a month, from 300 sites, of which 30% use the nascent Java interface. A total of 70% of the connections come from Europe, mainly people invested in the B.subtilis genome programs, E.coli, or lactic acid bacteria. We

Genetic map

2J

Get The PostScript File.

Bacillus subtil is

270 J = •

Chrcmoflotne (47% Sequenced]

-VeiCB-r'po™ fgpsA Lcrak

T«r«inatsr* I T Seal* I e .

Proaotar i \

Knowl CDS i

IBM

Java frylat Hlndon

Contig map

Physical map

Fig. 7. Contig, genetic and physical maps on the WWW, with clickable contigs, genes and sequence features. The three instances shown here are respectively generated in Java for the contig map, and from Perl, the genetic map in printable PostScript, and the physical map in GIF format.

435

Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011

with ACNUC (Gouy et al., 1984, 1985). An adaptation of ACNUC, for the databanks under GCG, provides a useful complement to the basic search of annotations proposed by the sequence analysis package. This function is also available in the most recent version of SRS.

through hyperlinked menu pages, composed of selectable lists and options, text input, and graphical navigation on clickable images (Figure 6). The importance of graphics for biologists has been demonstrated by the success of ACeDB (Durbin and Thierry-Mieg, 1991) for genomic databases. Interactive queries to Micado dynamically generate graphical representations of genetic, physical and contig maps, which are hierarchically related (Figure 7). These represent the full chromosome, or defined chromosomal regions, where users click on representations of genes, sequences and their features, mostly to access exportable text data and graphics. Information display on mutants of B.subtilis is also based on graphics, for growth and activity curves, or images of electrophoresis controls, sent from the laboratories (Figure 8). The WWW server is used as a database front-end for images, and in the future we expect to provide protein 2D gel electrophoresis (Appel et al., 1994) and RNA transcript 1D gels, generated in the functional analysis project.

V.Biaudet, F.Samson and P.Bessieres

[GIF Im.jc nBtOBt .ball

iddL

± Ell. E«M Utew g . Biibn.it. OK^l. Qlitcttfy Window Bdp CltfuOl Curve tn Rich MtJfcm Srowth Coefficient

(Hutant/ai) : 1

Bntry D»te : 19 - J.n - 1996 Kntry Nu« : d.rvyn

II 1 t it \ Mlf— » 0 1 O- «•

«p> 1

g

!

?

1

J

4

7

expect an enlargement in user interest, as we extended the database to all sequences of Bacteria and Archaea. The most typical queries concern the genetic maps, consultation of mutant properties, physical maps of sequences, and extraction of sequence features. The database successfully combines different uses, and provides a comfortable environment for genome analysis projects. The logic of network interfaces favours delocalized data input and administration, as well as user feedbacks. Cooperativity of genome projects is improved, the laboratories become providers of on-line information. While for general use WWW client browsers allow multiplatform and anonymous access, private accounts are dedicated to data input and confidential queries about sequences and mutants, shared by the European genomic programs on B.subtilis. Participants to the project enter information on new mutants through HTML FORMs, and FTP is used to load gel electrophoresis images into Micado (Figure 4). The genetic map of B.subtilis has been extensively checked. The graphical interface on the Internet offered active controls of the map, to amend the genetic linkage distances measured between the genes. Many modifications were due to the availability of sequences, especially helped by the teams having newly sequenced regions of the chromosome. This allowed the genes to be ordered at the locus level, and the genetic map positions were automatically compared with published sequences in the

436

Conclusion

The design of our database relies on standards to preserve its evolution, it is open enough to allow extension from proteins to microbial strain repository information, and able to support data from large-scale laboratory experiments. The functional analysis of unknown genes presently drives Micado towards comparative genetics, biochemistry and physiology, and we need to link our genomic data to metabolic pathway information resources. However, both approaches used to enlarge the system, data integration and hyperlinks, present inconveniences. We restrain GenBank and EMBL imports to their major releases, on a bimonthly basis. The interval of time between updates is greater, and less regular, for specialized resources, such as ECD and SubtiList. However, as integration into the data schema provides accurate search on information, the problem resides in the diversity of new sources that continue to appear, and are relevant to the database. Even with highly automated procedures of translation, this may impose a heavy maintenance charge for a small bioinformatics team. In

Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011

Fig. 8. Mutant characteristics, with a gel electrophoresis Southern blot and growth curves. The peak shows the activity of the interrupted gene revealed by a reporter gene.

database. In this way, the network gives Micado dynamic data editing and correction. Meaningful results depend on the data quality, an essential requirement of many applications, and a database management system offers support to curate and improve biological information. Noticeably, comparable annotated features of overlapping primary B.subtilis sequences, from independent sources, when combined into a contig feature table, succeed in matching only in about half of cases. Another quarter are successful after correction of locations, due to shifts between the sequences. Remaining features that are in conflict at the same locus, even if automatically controllable, should be reported in sequence annotations. Information circulates from primary sources, and literature, to specialized databases, with multiple successive controls and integration steps. We read already curated data, themselves including other sources, like the genetic map of E.coli (Rudd, 1996), and an open question is the distribution of upgrades. A critical point for the sequences is how primary data resources integrate corrections and additions, as many features remain to be identified in the sequence records. We believe that structuring information with databases on the network, at the level of biologist communities, constitutes the first step of the answer. Such an organization may, in the future, play an important role of mediation between research teams and international structures in charge of large general resources.

Micado database for microbial genomes

Acknowledgements We thank Dusko Ehrlich for his active support and contribution to the project, Shahinaz Gas and Jakub Zimmermann, students involved in the development of the database, the B.subtilis community and Amos Bairoch, for their comments and suggestions. We are grateful to Etienne Dervyn for help and discussions, particularly for the functional analysis project, and Patricia Rodriguez-Tome for reading the manuscript. V.B. is the recipient of a Ministere de la Recherche et de PEnseignement Superieur fellowship. This work was supported in part by a Groupement de Recherche et d'Etudes sur les Genomes grant, and by a Ministere de la Recherche et de l'Enseignement Superieur grant (decisions 92.H.0912 and ACC/SV 1995).

References Adobe Systems Inc. (1992) PostScript Language Reference Manual, 2nd edn. Addison-Wesley Publishing Company, Reading, MA. Aho,A., Sethi.R. and Ullman.J. (1986) Compilers: Principles, Techniques and Tools. Addison-Wesley Publishing Company, Reading, MA. Altschul,S.F., Boguski.M.S., Gish.W. and Wootton.J.C. (1994) Issues in searching molecular sequence databases. Nature Genet., 6, 119129. Appel.R.D., Bairoch.A. and Hochstrasser.D. (1994). A new generation of information retrieval tools for biologists: the example of the ExPASy WWW server. Trends Biochem. Sci., 19, 258-260. Bairoch,A. and Apweiler,R. (1996) The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic Acids Res., 24, 21-25. Benson.D.A., Boguski,M., Lipman.D.J. and Ostell.J. (1996) GenBank. Nucleic Acids Res., 24, 1-5. Biaudet.V., Samson,F., Anagnostopoulos.C, Ehrlich.S.D. and Bessieres,P. (1996) Computerized genetic map of Bacillus subtilis. Microbiology, 142, 2669-2729. Boutell.T. (1996) CGI Programming in C and Perl. Addison-Wesley Publishing Company, Reading, MA. Davidson.S.B., Overton.C. and Buneman.P. (1995) Challenges in integrating biological data sources. J. Comput. Biol., 2, 557-572. DDBJ, EMBL and GenBank Staffs (1996) The DDBJ/EMBLIGenBank Feature Table: Definition. DNA Data Bank of Japan, Mishima, Japan; EMBL Data Library, Cambridge, UK; GenBank, NCBI, Bethesda, MD. Technical Report (version 1.09). Devereux.J. Haeberli.P. and Smithies.O. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res., 12, 387395. Devine,K.M. (1995) The Bacillus subtilis genome project: aims and progress. Trends Biotechnoi., 13, 210-216. Dong.S. and Searls.D.B. (1994) Gene structure prediction by linguistic methods. Genomics, 23, 540-551. Durbin.R. and Thierry-Mieg,J. (1991) A Caenorhabdilis elegans Database. Documentation, code and data available from anonymous FTP servers at: lirmm.lirmm.ft, cele.mrclmb.cam.ac.uk and ncbi.n/ m.nih.gov Etzold.T. and Argos,P. (1993a) SRS—an indexing and retrieval tool for flat file data libraries. Comput. Applic. Biosci., 9, 49-57. Etzold.T. and Argos.P. (1993b) Transforming a set of biological flat file libraries to a fast access network. Comput. Applic. Biosci., 9, 59-64. Gouy,M., Milleret.F., Mugnier.C, Jacobzone.M. and Gautier, C. (1984) ACNUC: a nucleic acid sequence data base and analysis system. Nucleic Acids Res., 12, 121-127. Gouy.M., Gautier,C, Attimonelli.M., Lanave.C. and di Paola.G. (1985) ACNUC—a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput. Applic. Biosci., 1, 167-172. Hart,K.W., Searls.D.B. and Overton,G.C. (1994) SORTEZ: a relational translator for NCBI's ASN.l database. Comput. Applic. Biosci., 10, 369-378. Harwood.C.R. and Wipat,A. (1996) Sequencing and functional analysis of the genome of Bacillus subtilis strain 168. FEBS Lett., 389, 84-87. Hopcroft,J.E. and Ullman.J.D. (1979) Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, Reading, MA. Karp.P.D. (1990) The ASN.l Printfile Parser and Path Manipulation Package. NCBI Technical Report Series, Natl. Library of Medicine, NIH. Technical Report Number 5, 30 pp. Karp,P.D. (1996) Database links are a foundation for interoperability. Trends Biotechnoi, 2, 273-279. Keen,G. et al. (1996) The Genome Sequence DataBase (GSDB): meeting the challenge of genomic sequencing. Nucleic Acids Res., 24, 13-16. Letovsky.S. (1995) Beyond the information maze. J. Comput. Biol., 2, 539-546. Levine,J., Mason.T. and Brown,D. (1992) lex and yacc. O'Reilly and Associates, Sebastapol, CA.

437

Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011

comparison, hyperlinks through WWW offer an alternative method of loading the data into Micado, but at the cost of a loss of granularity for the queries. Basically, the unit of retrieved information is an entire databank record (Letovsky, 1995; Karp, 1996). In such a context, we are looking for new tools and interoperability solutions proposed by the bioinformatics community. We hope networked environments will help, by allowing easier and more cooperative interactions between heterogeneous databases, from data interchanges to real distributed systems. Aside from the problem of organizing databases inside federated systems, we are also aware of general standardized connections of the database to applications programs, such as display and processing tools, that can be easily adapted. In this way, the emerging CORBA standard (Common Object Request Broker Architecture), from the OMG (Object Management Group), constitutes a new approach compared to classical solutions. Nevertheless, efficient implementations need to define semantic equivalences, among the different information representations of biological entities. This requires a close coordination between groups addressing database developments, an important social issue for programmers. The EBI is in charge of distributing a large collection of information resources, and its research and development group is promoting and investing in CORBA. We hope that such a major bioinformatics centre will be helpful in coordinating specifications of common shared data schemas. Availability of general resources such as SwissProt, through CORBA protocols, would greatly facilitate interoperations of satellite sites such as ours. Ideally, this would offer the same quality of access to data, compared to the classical integrated approach, while avoiding redundancy in the maintenance costs. Whatever the future for CORBA, the concept of a non-proprietary and multiplatform solution, to build distributed software systems, should stimulate the bioinformatics community in the search for an adequate answer to the problems of interoperation.

V.Biaudet, F.Samson and P.Bessieres

Received on December 30. 1996; revised on February 26, 1997; accepted on March 14. 1997

438

Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011

Liu.C, Peek,J., Jones.R., Buus,B. and Nye.A. (1994) Managing Internet Information Services. O'Reilly and Associates, Sebastopol, CA. Markovitz.V.M. and Ritter,O. (1995) Characterizing heterogeneous molecular biology database systems. J. Comput. Bioi, 2, 547-556. Moszer,I., Glaser.P. and Danchin,A. (1995) SubtiList: a relational database for the Bacillus subtilis genome. Microbiology, 141, 261 -268. Murray.J.D. and van Ryper.W. (1996) Encyclopedia of Graphics File Formats, 2nd edn. O'Reilly and Associates, Sebastopol, CA. Ostell,J. (1990) Genlnfo ASN.I Syntax: Sequences. NCBI Technical Report Series, Natl Library of Medicine, NIH. Technical Report Number I (version 0.5), 37 pp. Overton,G.C, Aaronson.J.S., Haas,J. and Adams.J. (1994) QBG: a system for querying sequence database fields and features. J. Comput. Bio/., 1, 3-14. Perriere.G., Gouy.M. and Gojobori.T. (1994) NRSub: a non-redundant database for the Bacillus subtilis genome. Nucleic Acids Res., 22, 5525-5529. Rodriguez-Tome.P., Stoehr,P.J., Cameron.G.N. and FloresJ.P. (1996) The European Bioinformatics Institute (EBI) databases. Nucleic Acids Res., 24, 6-12. Rudd.K.E. (1996) Escherichia coli K-12 on the Internet. Trends Genet., 12, 156-157. Sakamoto.N., Takagi,T. and Sakaki.Y. (1993) Development of the Overlapping Oligonucleotide Database and its application to signal sequence search of the human genome. Comput. Applic. Biosci.. 9, 427-434. Schbath.S., Prum,B. and de Turckheim,E. (1995) Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. J. Comput. Bioi, 2, 417-437. Searls,D.B. (1993) The computational linguistics of biological sequences. In Hunter.L. (ed.), Artificial Intelligence and Molecular Biology. AAAI Press/The MIT Press, Menlo Park, USA, pp. 47-120. Teorey.T.J. (1994) Database Modeling and Design, 2nd edn. Morgan Kaufmann Publishers, San Francisco, CA. Wahl.R., Rice,P., Rice.C.M. and Kroger,M. (1994) ECD—a totally integrated database of Escherichia coli K12. Nucleic Acids Res., 22, 3450-3455. Williams,N. (1997) How to get databases talking the same language. Science, 275, 301-302.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.