KungFQ: A Simple and Powerful Approach to Compress fastq Files

Share Embed


Descripción

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

KungFQ: A Simple and Powerful Approach to Compress fastq Files Elena Grassi, Federico Di Gregorio, and Ivan Molineris Abstract—Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels. Index Terms—Biology and genetics, algorithms for data and knowledge management

Ç 1

INTRODUCTION

DEEP sequencing techniques have posed new problems to bioinformaticians: up until now, experimental improvements never created amounts of data requiring hardware upgrades that went beyond easily achieved ones—memory and storage capabilities were raising at an appropriate pace. In September 2010, the Sequence Read Archive [1] had more than 500 billion reads, corresponding to 60 trillion base pairs and it nearly had to close because it was not able to keep up with the pace of sequencing output. Every researcher interested in working on data downloaded from SRA or derived from individual experiments must be able to manage gigabytes of files deriving from each single experiment; since most common approaches in bioinformatics require using data derived from many different experiments this easily becomes a problem. Our interest, thus, lies in developing a tool capable of compressing fastq files, the most common format used by downstream analysis software (such as alignment tools like bowtie [2]), exploiting their peculiar characteristic and creating files that could be further compressed by standard software (like gzip and lzma) which are not designed specifically for this format and thus do not exploit all compression possibilities. Ideally, this tool should be simple, fast (particularly when decompressing, because all typical downstream analyses will start from already compressed files) and not too memory-hungry. We decided to follow a simple approach to compress read IDs, sequences (in a base space format, which is the most used—moreover the algorithm could easily be adapted to color space) and base quality data from fastq files. Dealing with IDs is a problem because

. E. Grassi and I. Molineris are with the Department of Genetics, Biology and Biochemistry, Molecular Biotechnology Center, via Nizza 52, Torino 10126, Italy. E-mail: [email protected], [email protected]. . F. Di Gregorio is with DNDG srl, via Maria Vittoria 2, Torino 10123, Italy. E-mail: [email protected]. Manuscript received 14 Nov. 2011; revised 21 Aug. 2012; accepted 8 Sept. 2012; published online 25 Sept. 2012. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2011-11-0299. Digital Object Identifier no. 10.1109/TCBB.2012.123. 1545-5963/12/$31.00 ß 2012 IEEE

Published by the IEEE CS, CI, and EMB Societies & the ACM

VOL. 9,

NO. X,

XXXXXXX 2012

1

there is no standard format and therefore it is hard to define a fixed algorithm to compress them—we decided to create ad hoc procedures to encode two of the most frequent formats and leave the rest to the chosen downstream compression algorithm. Since sometimes one is only interested in the read sequences, we added the possibility not to encode the IDs and/or even the quality values and in the latter case to choose if one wants the base calls corresponding to qualities lower than a defined cutoff converted to N. The mainstream approach when compressing genomic data is to annotate the difference between the data that should be compressed and a reference genome and yields good results [3], [4], [5], [6]. This is efficient and feasible when dealing with species that have a valid genomic reference or experiments which are aimed at obtaining the whole genome of the given sample, but is less adequate when approaching new species or nongenomic data due to the lack of a standard reference to align reads; for instance, RNA sequencing data should be aligned against the entire “theoretically possible” transcriptome and is not practical to account each possible alternative transcript splicing or unknown exon. Since our approach is completely different from those based on a reference sequence, we compared our performance against the best available software with a comparable approach: DSRC [7]; which outperforms other specific (G-SQZ [8]) and generic tools (gzip and bzip2) in terms of compression ratios and therefore represents the best alternative. We did not take in consideration SOLiDzipper [9] because it uses different input files (csfasta and QV, thus color space) and its main purpose is high speed in order to use NGS data with cloud computing, thus compression rates are considered less important and are lower than G-SQZ ones. Similarly, we did not compare ourselves with DNACompress [10] as it is only able to compress DNA sequences and is an implementation of the Lempel Ziv algorithm which we use as a downstream compression tool. DSRC is a C++ software that is able to compress fastq files, decompress them and access single reads in the compressed ones. The whole approach is based on dividing the reads in blocks and superblocks and compute statistics over each superblocks in order to be able to summarize common parts of them, for example, storing only once common data shared by all the IDs of the reads contained in a superblock. For sequences, it allows characters different from A, C, T, G, and N considering the rare occurrences of IUPAC codes moving their encoding in the quality streams; when a superblock contains a lot of “extra symbols” the approach changes to Huffman encoding, while other superblocks are treated with LZ77-style encoding. Qualities are divided in three categories and then encoded according to the predefined category structure, applying Huffman or run length encoding.

2 2.1

METHODS General Approach

The compression algorithm relies on treating IDs, sequences, and qualities as three whole independent streams. For sequences and quality data the basic idea is to use a run length encoding: when there are more than n equal characters one can annotate the number of repetitions and the repeated character. The first bit is used as a flag to distinguish bytes encoding single characters from run lengths. Quality values and sequences are considered as two long sequences—single reads are reconstructed when decoding by knowing the read length, which is the same for every read in a fastq with recent technologies, this approach should ensure similar compression ratios with different read lengths deriving from different sequencing platforms. After recognizing the ID format, the fixed part is encoded only once in a header of the encoded file while the variable portion is encoded for every read. The whole

2

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 9,

NO. X,

XXXXXXX 2012

Fig. 1. The encoding process: steps of the compression algorithm on portions of a sample read.

encoded result is then saved as is or further compressed by gzip or lzma according to the options used. Fig. 1 is a schematic representation of the encoding phase.

single file it skipped a value so we had to encode it for every read) and “length ¼ 36” is optional so one just needs to store in the header the information about its presence.

2.2

2.3

IDs Encoding

The most frequent formats for fastq read IDs that we encountered are the ones used by SRA and the ENCODE Project [11], therefore when encoding a fastq we try to detect via regular expressions if one of those format is used or not. In the former case, we store the constant portion of IDs only once and only encode the variable parts of every read, while in the latter we simply pass the IDs to the chosen downstream compressor. Example for the SRA format: @SRX000571_SRR002322.18437692 080317_CM-KID-LIV-2-REPEAT_0003:7:330:466:87 length=36 The first part up until the dot and “080317_CM-KID-LIV-2REPEAT_0003” are constant while the other numbers are variable (the one just after the dot seems to be a simple counter but in a

Sequence Encoding

Sequences have only five possible characters (A, C, G, T, and N) and therefore in a single byte we can store a single bit flag and up to three base calls (we have 53 ¼ 125 possible triplets of nucleotides and 27 ¼ 128 usable bits combinations) or a run length for repetitions longer than four bases; the bit flag is necessary to discriminate between these two cases. We encode in a single byte repetitions up to 19 bases, as long as we use 1 bit for the flag, 3 to encode which character is repeated and in the 4 bits left the length: the biggest representable integer is 15 (the shortest run lengths are of four bases, therefore we always sum 4 to the encoded lengths when decoding). Repetitions longer than 19 bases will be encoded in two separate bytes. An example of the process is depicted in Fig. 1: the “0” flag at the beginning of the first byte indicates that we are encoding a triplet, “CCT,” that corresponds to the bit sequence “0100010”;

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 9,

NO. X,

XXXXXXX 2012

3

Fig. 2. Histogram representing the quality data of SRR002321 from SRX000571.

while the sequence of “G” stored in the second byte uses a “1” flag followed by the length of the sequence (5  4 ¼ 1, therefore we have “0001”) and then “G” encoded as “010.” Another possible approach, which in the end was not adopted because it yielded worse results with regard to final compression ratios, was to break the “byte modulus” and use only the needed 7 bits to encode a triplet and the two of the three (128  125 ¼ 3) remaining combinations as flags for run lengths and end of sequence. The end of sequence flag is needed because when decoding one does not know how many reads have to be decoded and without a flag that says to stop it is possible to add a spurious read at the end (in the last encoded sequence block it is possible that the final 1, 2, or 3 bases have to be encoded padding some “N” to yield a complete triplet). Analyses of run lengths and triplets statistics in our sample fastq files showed that run lengths were, on average, rarer (1 run length every 26 triplets, on average, in SRA data) than triplets and that 60 triplets represented the 95 percent of the total occurrences and 65 the remaining ones. Thus, it seemed sensible to try an approach that uses less space to encode the 60 more frequent triplets, using only the 6 needed bits to encode them and sparing three 6 bits codes to indicate where a rare triplet or a run length followed. Rare triplets in this schema are represented by 7 bits and there are two classes of run lengths, one for repetitions from 4 bases up to 19 and the other one for longer ones (between 20 and 259 bases), that are encoded in 11 bits. This solution was developed but was not chosen for the final tool due to its limited gain of compression ratios with respect to the simpler one explained at the beginning and its dependence on the statistical characteristics of fastq files, which could change in different experimental settings (RNAseq, DNAseq, ChIP-seq, etc.).

2.4

Quality Encoding

Quality data have a broader range of characters (between 33 and 126 depending on the different ASCII encoding used by different sequencing platforms) so we only adopted the run length approach: single characters are encoded in a byte while run lengths use two bytes (the first bit is always the flag and the remaining 7 bits encode the character, possibly followed by a byte storing the number of repetitions, up to 255 in this case).

2.5

Merging and Compressing the Encoded Streams

The compressed streams are created together when reading the input fastq file and are written as fixed size blocks interleaved on the output file, always using the first byte of the written blocks as a flag that tells if we have a sequence or a quality compressed block (repetitions of the same kind of block are possible due to

more efficient encoding on sequences or quality data in different parts of the fastq)—the output stream is directly compressed with gzip using SharpZipLib [12] or with lzma using the LZMA SDK [13]. When using gzip a compression block size equal to our blocks is adopted in order to reach the maximum possible gain in compression.

2.6

Lossy Compression

Lossy compression is performed by simply avoiding to encode the IDs or the quality data blocks and modifying the stored sequence if a cutoff quality is given. To help with the choice of the cutoff value we added an option which creates a figure with an histogram depicting the occurrences of different quality values and a text file with raw counts; see Fig. 2 for an example.

3 3.1

RESULTS AND DISCUSSION Comparisons with the Gold Standard

We compared the compression ratios of our tool with the encoded stream compressed with lzma, without any lossy options, with those of DSRC—test data are from RNA sequencing experiments obtained from SRA, the ENCODE Project and some genomic data from the DDBJ Sequence Read Archive (38 files); three files where the same used for comparison in [7] and [8]. Our compression ratios are always larger than DSRC ones for ENCODE data, while only sometimes smaller for SRA data and genomic data; in this latter cases the origin of our less effective compression could be the additional number, part of the IDs, that we had to store for every read because in a single file it skipped a value, while in all other files it seemed a simple counter. The average gain in encoded file size on these data is 66 Mb—our compressed files altogether use 2.5 Gb less than those compressed by DSRC and 170 Gb less than the original fastq files (total size: 228 Gb); Tables 1 and 3 show some summarized data with comparisons between specific (DSRC, G-SQZ, and BAM) and generic (gzip, bzip2, 7zip, and lzma) compression formats. When fastq files had a repetition of the ID on the third line of every entry (which is a redundancy as long as in the fastq format this line just has to be composed of a single “+”), it was removed prior to evaluating original fastq sizes to get fair compression ratios. On some files (SRA data and one of the genomic ones) G-SQZ failed thus we were able to evaluate its ratios and running times only on a subset of the whole test set. Since long range regularities of the data are more easily identified when working on separated streams with similar data (IDs, sequences, and quality values), to further investigate the compression gain derived from our encoding prior to lzma

4

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 9,

NO. X,

XXXXXXX 2012

TABLE 1 Compression Ratios Comparisons between KungFq, DSRC, G-SQZ, and BAM

TABLE 2 Running Times (Encode and Decode, in Seconds) Comparisons between KungFq, DSRC, G-SQZ, and BAM

Fig. 3. Comparison of compression ratios between lossy and non lossy compression.

compression, we listed the compression ratios obtained with lzma compression on the three separated streams (“split ratios” in Table 3): the mean ratio obtained with KungFq is still larger. As expected results are notably better when performing lossy compression: we suppose that many end user will at least ignore

TABLE 3 Compression Ratios Comparisons between gzip, bzip2, 7zip, and lzma

the IDs, which yields a further space saving of about 12 percent on encoded files and brings the total space saving to 83 percent with respect to the original fastq size. The gain is even higher when quality data are not stored (for the tests with a cutoff we chose 42, the ASCII character ’*’, which is a fairly low quality for the Sanger encoding used in SRA data—ratios does not change if one uses a cutoff or not when losing quality values): see Fig. 3 and Tables 5, 6, and 7. We compared the results of our tool with generic compression tools (compressing only the relevant portions of the fastq files: sequences and quality values, id, and sequences or only sequences) to point out that its performances are better even with the lossy options. All these lossy ratios are evaluated only on a subset of the fastq files, namely the SRA data.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 9,

NO. X,

XXXXXXX 2012

5

TABLE 4 Running Times (Encode and Decode, in Seconds) Comparisons between gzip, bzip2, 7zip, and lzma

TABLE 5 Compression Ratios Comparisons between KungFq and Generic Compression Tools Losing IDs

TABLE 6 Compression Ratios Comparisons between KungFq and Generic Compression Tools Losing Quality Values

TABLE 7 Compression Ratios Comparisons between KungFq and Generic Compression Tools Losing IDs and Quality Values

Our prototype in C# is somewhat slower than DSRC (almost 10 times when encoding and 3 when decoding, with DSRC times in the order of 10 minutes, see Table 2 for a list of encoding and decoding times for all the tested specific tools and Table 4 for generic ones), which is written in C++, but performance gains are possible by rewriting it in C or by algorithmic improvements like, for example, encoding IDs, sequence, and quality data blocks in parallel. When using gzip instead of lzma encoding times are better but the compression ratio is lower, thus depending on what factor is more important one could choose the preferred option. Encoding and decoding times are listed as reported by the unix tool time (“elapsed” data were used). Supplementary material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2012.123, with encoding and decoding times with the lossy options are provided separately.

4

APPROACHES TO GO FURTHER

Lzma is already a really good compression algorithm but our encoding managed to beat its performances and using it as a downstream compressor allowed to reach compression levels on average higher than DSRC. We tried and managed to achieve smaller encoding by two approaches that do not write whole bytes but in this way lzma (and gzip) compression became considerably less efficient. To overcome this problem, we developed a modified version of the simpler of these alternatives: we “fill” the 7 bits used to encode triplets to a whole byte with the 7 bits gotten from the eight triplet in a row (to perform this test we used only the short run length class, which needs 7 bits to be encoded—so we always had 7 bits blocks); but even in this way performances of lzma and

gzip lowered too much, supposedly because this “padding” did not create a sufficiently uniform distribution of bytes. In the end the first approach yields the best compression ratios. We also tried to use a recent compression software, lrzip [16], to compress our encoded output. This tool scans the file to be compressed as far as it can (depending on the physical memory available) and reorganizes it, in order to exploit long range repetitions of patterns, before compressing it with other standard algorithms, comprising lzma. It gave results similar to those of lzma (mean ratio 3.99), thus we performed the main comparison using lzma also considering that lrzip is not available as a library, that its compression ratios depend on the amount of RAM available and that its approach to load in memory as much data as it can to prescan the file to be compressed is in opposition with our algorithm that proceeds with small (1 Mb) encoded blocks. Nonetheless if one wants to use lrzip there is an option which excludes the compression phase with lzma or gzip and just creates the encoded file that could be compressed with any other tool. Our tool supports only fastq with equal read lengths, like the other specific tools that we compare to (DSRC [7] and G-SQZ [8]). We suppose that the most frequent use of compression tools would precede eventual trimming of adapter sequences and low-quality base calls as long as one would like to keep the original data. Nonetheless our approach could be modified to support variable read lengths, as long as we are using only 125 of the 128 possible 7 bits codes for our triplets encoding and we could use one of the spare ones to signal the end of the reads, even if this would lead to a less efficient compression rate.

5

CONCLUSIONS

In this work, we propose a simple and linear algorithm that reaches good compression levels on base space fastq files—the tool

6

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

should be useful for those who need to reduce the space required to store high quantity of next generation sequencing data. We think that the definitive approach in fastq compression still does not exist but in the present work we made a proof of concept that the pivotal step of our algorithm (run length and triplets encoding) may be an important piece of the puzzle. The source code and the executables are freely available from Source Forge (http://quicktsaf.sourceforge.net/).

ACKNOWLEDGMENTS The author would like to thank Paolo Provero for his useful suggestions. This work was supported by the Italian Ministry of University and Research FIRB piattaforme/reti rbpr05zk2z_006.

REFERENCES [1] [2]

[3]

[4] [5]

[6]

[7] [8]

[9]

[10] [11]

[12] [13] [14] [15]

[16]

R. Leinonen, H. Sugawara, M. Shumway and Int’l Nucleotide Sequence Database Collaboration “The Sequence Read Archive.” Nucleic Acids Research, vol. 39, pp. D19-D21, Jan. 2011. B. Langmead, C. Trapnell, M. Pop, and S. Salzberg, “Ultrafast and MemoryEfficient Alignment of Short DNA Sequences to the Human Genome,” Genome Biology, vol. 10, no. 3, p. R25, 2009. C. Kozanitis, C. Saunders, S. Kruglyak, V. Bafna, and G. Varghese, “Compressing Genomic Sequence Fragments Using Slimgene,” Research in Computational Molecular Biology, vol. 6044, pp. 310-324, 2010. S. Christley, Y. Lu, C. Li, and X. Xie, “Human Genomes as Email Attachments,” Bioinformatics, vol. 25, no. 2, pp. 274-275, Jan. 2009. K. Daily, P. Rigor, S. Christley, X. Xie, and P. Baldi, “Data Structures and Compression Algorithms for High-Throughput Sequencing Technologies,” BMC Bioinformatics, vol. 11, no. 1, article 514, 2010. M. Hsi-Yang Fritz, R. Leinonen, G. Cochrane, and E. Birney, “Efficient Storage of High Throughput DNA Sequencing Data Using Reference-Based Compression,” Genome Research, vol. 21, no. 5, pp. 734-740, May 2011. S. Deorowicz and S. Grabowski, “Compression of DNA Sequence Reads in FASTQ Format,” Bioinformatics, vol. 27, no. 6, pp. 860-862, Mar. 2011. W. Tembe, J. Lowey, and E. Suh, “G-SQZ: Compact Encoding of Genomic Sequence and Quality Data,” Bioinformatics, vol. 26, pp. 2192-2194, July 2010. Y.J. Jeon, S.H. Park, S.M. Ahn, and H.J. Hwang, “Solidzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data,” Evol Bioinform Online, vol. 7, pp. 1-6, 2011. X. Chen, M. Li, B. Ma, and J. Tromp, “DNACompress: Fast and Effective DNA Sequence Compression,” Bioinformatics, vol. 18, pp. 1696-1698, 2002. S.E. Celniker, L.A.L. Dillon, M.B. Gerstein, K.C. Gunsalus, S. Henikoff, G.H. Karpen, M. Kellis, E.C. Lai, J.D. Lieb, D.M. MacAlpine, G. Micklem, F. Piano, M. Snyder, L. Stein, K.P. White, and R.H. Waterston, “Unlocking the Secrets of the Genome,” Nature, vol. 459, no. 7249, pp. 927-930, June 2009. M. Krueger, “Sharpziplib,” http://www.sharpdevelop.net/OpenSource/ SharpZipLib/, 2010. I. Pavlov, “LZMA SDK,” http://www.7-zip.org/sdk.html, 2011. Picard, http://picard.sourceforge.net, 2012. H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup, “The Sequence Alignment/Map Format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp. 2078-2079, Aug. 2009. C. Kolivas, “lrzip,” http://freshmeat.net/projects/lrzip, 2011.

VOL. 9,

NO. X,

XXXXXXX 2012

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.