Random-access compression of annotated DNA sequences

July 9, 2017 | Autor: Ioan Tabus | Categoría: Random Access, Genome sequence, DNA sequence

Descripción

RANDOM-ACCESS COMPRESSION OF ANNOTATED DNA SEQUENCES Gergely Korodi and Ioan Tabus Institute of Signal Processing, Tampere University of Technology P.O Box 553, FIN-33101 Tampere, Finland, e-mail: [email protected], [email protected] During the pruning process at each step we substitute a set of nodes by another node. At each point let R(s) stand for the set of nodes that are substituted by node s after all pruning steps done so far. Initially, R(s) is {s} for leaves and ∅ otherwise. Whenever a node q is omitted from the tree, we set R(F (q)) ← R(F (q))∪R(q). Consider now a Maximal Tree Machine [1] created from a training sample. For each node s assign a value d(s), and initialize these as

ABSTRACT This article investigates the efficiency of randomly accessible coding for annotated genome files and compares it to universal coding. The result is an encoder which has excellent compression efficiency on annotated genome sequences, provides instantaneous access to functional elements in the file, and thus it serves as a basis for further applications, such as indexing and searching for specified feature entries.

d(s) =

1. INTRODUCTION

0

a∈A(s)

na|s log

P (a|s) P (a|F (s))

(s is leaf) (otherwise)

(1)

The pruning algorithm now proceeds as

One of the most popular representation forms of annotated files is the GenBank file format [4]. This describes the DNA sequence and the annotated feature in a plain and readable text format. However, this format is not adequate for most of the tasks which are important for GenBank files. Two such tasks are file distribution and storage; in an optimal scenario these would use a representation which is as compact as possible. Another useful task is information retrieval; here the important features of the sequence and their properties should be easily accessible, which commonly requires some sort of indexing. A file must be converted to various formats, according to the different tasks, usually performed sequentially. This conversion increases the difficulties and complexity of the work with DNA sequences. The objective of the present article is to provide a general annotated file representation format which is adequate for various biological and technical tasks, thereby eliminating the need for format conversion. Our aim is to provide a compact representation of the annotated file which also allows for easy and fast access of designated features. A recent contribution on this subject has been made in [2], which introduced a new pruning condition for Tree Machines in order to generate efficient external models from training samples. In Section 2 we overview a method which generates models for efficient compression of short sequences [2]. In Section 3 we introduce a variable-length block-based encoder, which uses these models. Finally, Section 4 illustrates the performance of our algorithm on GenBank files with random-access to functional elements, comparing it to general purpose compression programs, which do not provide this feature.

1. Set H to be the set of leaves in the Maximal Tree 2. Until the desired tree size has been reached 2.1 Let q = arg mins {d(s) | s ∈ H} P (a|s) 2.2 Increase d(F (q)) by r∈R(q) a∈A(r) na|r log P (a|F (s)) 2.3 Set H ← (H ∪ {F (q)}) \ R(q) 2.4 Omit R(q) from the tree This algorithm has been shown to minimize the difference of code lengths of the training data between two subsequent pruning steps [2]. As a result, the algorithm creates a model which can be used for efficient compression of short data sequences. The exact details of this are given in [1]. 3. ALGORITHM DESCRIPTION For improved efficiency our algorithm relies on the special structure of GenBank files [4] , which are composed of three parts that we call in order the Header, Feature Table and Sequence sections. Since the Header section is usually very small in comparison to the other two parts, we do not compress it. The Feature Table contains the annotated feature entries to the DNA sequence. These obey certain grammatical rules; consequently, it is easy to create models as described in Section 2 which capture well the general characteristics of feature entries. The last section of any GenBank file contains the raw DNA sequence to which the features refer. Apart from the first “ORIGIN” line, the whole section is unambiguously constructed from only the DNA sequence. This sequence is broken into units of 60 bases, and each line holds one such unit. A space is inserted in front of every ten bases in the units, and the position of the first base in the unit is written in 9 characters to the beginning of the line (filling from the left side with spaces, if necessary). The DNA sequence is composed of the symbols A, C, G and T corresponding to the four nucleotides, as well as several other wildcard symbols that have special meaning. However, the wildcard symbols are rare, and it is both easy and efficient to filter them

2. CREATING A MODEL Let us consider a Tree Machine created from certain training data. We denote the set of all nodes (or contexts) in the tree by S, and the one-symbol-shorter suffix of each s ∈ S by F (s). Furthermore, each s ∈ S has an associated set of symbols A(s) ⊆ A and occurrence counts {na|s > 0, | a ∈ A(s)}. These counts induce a probability distribution {P (a|s) | ∀a ∈ A : P (a|s) > 0} for each node s, as described in [1].

1424403855/06/$20.00 ©2006 IEEE

69

GENSIPS 2006

out to a list [3]. After this step, each symbol in the sequence can be represented in 2 bits. Based on these properties, we propose the following algorithm for encoding annotated DNA sequences with random access to functional elements: Step 1 – Initialization. Separate the annotated file into the Header, Feature Table and Sequence parts. Apart from these, we also add a Status file in the archive; this file contains auxiliary information about the sequence, such as whether certain parts of the annotated file conform to the GenBank specification. The Header and Status parts are usually very short, so we always include them in a clear representation. Step 2 – The Sequence Part. This section is checked for compliance to the syntactical rule described earlier. A flag bit indicating this is recorded in the Status part. If the test was successful, we omit all the numbers and white space from this part, remove the wildcard characters from the DNA sequence, and store the resulting sequence in the archive in clear (2 bits per nucleotide), along with the list of wildcard characters. Otherwise, the whole part is stored in clear in the archive. Step 3 – Translation Sequences. Before encoding the Feature Table, we create from it a Filtered Table by omitting redundant translations in the coding sequences. To do this, we repeat the following procedure for each feature entry, one after the other: 1. If the entry is not CDS, or the location in its descriptor is not correct syntactically, then copy the whole entry to the Filtered Table. 2. If the entry is a CDS entry with a proper location in its description line (and possible continuation lines), then (a) If the last qualifier is not a translation sequence, then write a flag bit 0 in Status and copy the whole entry to the Filtered Table. (b) Otherwise, compare this translation sequence to that one which we obtain from the DNA sequence and the location. If these sequences do not match, then again write a 0 in Status and copy the unchanged entry to the Filtered Table. (c) Otherwise, indicate the matching sequences by a 1 in Status, and write only the part before the translation line into the Filtered Table.

Table 1. Comparison of the compression performance (in bits per character) obtained by the algorithms bzip2 (switch “-9”), Rar (switches “-m5 -md4096”), and our method dubbed RandAx. File Size Features bzip2 Rar RandAx MP 256,258 35.4% 2.077 1.978 1.543 VV 393,334 37.7% 2.245 2.173 1.189 SC 632,149 34.0% 2.231 2.163 1.417 CG 903,535 29.3% 2.303 2.241 1.340 PF 2,022,439 24.5% 2.119 1.994 1.471 SP 4,488,804 42.4% 2.053 2.001 1.237 BS 10,197,831 47.6% 1.909 1.844 1.208 EC 11,285,911 47.8% 1.919 1.857 1.339 CE 23,898,982 20.1% 1.955 1.848 1.695 DM 32,354,472 12.3% 2.124 2.052 1.612 AT 47,496,308 20.9% 2.211 2.104 1.493

stantaneous access to each entry in the Feature Table (the average block size was 325 bytes), for the following sequences: chromosome 1 of A. thaliana (AT), chromosome I of C. elegans (CE), chromosome B of C. glabrata (CG), chromosome 2L of D. melanogaster (DM), chloroplast of M. polymorpha (MP), chromosome 4 of P. falciparum (PF), chromosome III of S. cerevisiae (SC), and the complete genomes of B. subtilis (BS), E. coli K-12 (EC), S. pneumoniae R6 (SP), and Vaccinia virus (VV). For each annotated sequence we generated the training file for the pruning algorithm from all of the other files (on paradigm of the cross-validation, leave-one-out method). Using the principles described in [1], from each training file we have created a primary model of 300 kbytes and a secondary model of 100 kbytes large, so our combined models did not exceed 400 kbytes in size, in any case. The sizes of the final archived, block-based randomly accessible files are shown in Table 1. The file sizes are given in bytes, and the “Features” column shows the size of the Feature Table compared to the whole file. The latter information shows that our method is the most efficient for sequences that contain a large amount of annotation, which becomes more important as the biological study and analysis of sequences progress in the future. Even at the current state, our method significantly outperforms general purpose compression programs, while providing extra functionalities for the fast accessing of information inside the archive. Moreover, considering our small model sizes, Table 1 indicates that as an option, we can store the models in the archived data, thus achieving modelindependent compression, while for large files still being much better than other, widely used algorithms.

Step 4 – The Feature Table. At this point we segment the Filtered Table obtained in the previous step for random access. This is done by putting every important functional element (e.g. feature entry) in its own block. These blocks are then independently compressed using the algorithm of [1] and the pruning condition in Section 2. Step 5 – Archivation. At this point we have the compressed Header, wildcard list, filtered DNA sequence and Status files, and the encoded blocks of the Filtered Table. We place these in a final archive with independent access to each entity. After these steps we have created an archive where each entry in the Feature Table can be accessed in the encoded file independently of any other entries (apart from possible references to the DNA sequence in CDS blocks). The DNA sequence is stored in clear, so any part of it can be retrieved instantaneously. For full random access to the features, the individual blocks inside the Filtered Table should be indexed for key information, although we do not consider indexing in this paper.

5. REFERENCES [1] G. Korodi, J. Rissanen and I. Tabus, “Lossless data compression using optimal tree machines,” Proc. IEEE Data Compression Conference (DCC 2005), pp. 348–357, 2005. [2] G. Korodi and I. Tabus, “An Improved Pruning Condition for Tree Machines and Applications to Random-Access Coding,” in Proc. ISCCSP 2006. [3] H.E. Williams and J. Zobel, “Compression of nucleotide databases for fast searching,” Computer Applications in the Biosciences, vol. 13, no. 5, pp. 549–554, 1997.

4. RESULTS

[4] The DDBJ/EMBL/GenBank Feature Table: Definition, http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html

In this section we illustrate the results of the algorithm described in Section 3 on various GenBank files. In the tests we provide in-

70

Lihat lebih banyak...

Random-access compression of annotated DNA sequences

Descripción

Comentarios