Universal lossless data compression with side information by using a conditional MPM grammar transform

Descripción

2130

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

Universal Lossless Data Compression With Side Information by Using a Conditional MPM Grammar Transform En-hui Yang, Senior Member, IEEE, Alexei Kaltchenko, Student Member, IEEE, and John C. Kieffer, Fellow, IEEE

Abstract—A grammar transform is a transformation that converts any data sequence to be compressed into a grammar from which the original data sequence can be fully reconstructed. In a grammar-based code, a data sequence is first converted into a grammar by a grammar transform and then losslessly encoded. Among several recently proposed grammar transforms is the multilevel pattern matching (MPM) grammar transform. In this paper, the MPM grammar transform is first extended to the case of side information known to both the encoder and decoder, yielding a conditional MPM (CMPM) grammar transform. A new simple linear-time and space complexity algorithm is then proposed to implement the MPM and CMPM grammar transforms. Based on the CMPM grammar transform, a universal lossless data compression algorithm with side information is developed, which can achieve asymptotically the conditional entropy rate of any stationary, ergodic source pair. It is shown that the algorithm’s worst case redundancy/sample against the -context conditional empirical entropy among all individual sequences of length is upper-bounded by (1 log ), where is a constant. The proposed algorithm with side information is the first in the coming family of conditional grammar-based codes, whose expected high efficiency is due to the efficiency of the corresponding unconditional codes. Index Terms—Arithmetic coding, coding with side information, entropy, grammar-based source codes, multilevel pattern matching (MPM), redundancy, string matching, universal data compression.

I. INTRODUCTION

C

ONSIDER the typical data compression diagram shown in Fig. 1 with side information known to both the encoder and , where is the length of and decoder. The sequences the sequences, are regarded as a sequence to be compressed and is assumed known to side information, respectively. Since both the encoder and decoder, the encoder can use its knowledge to conditionally encode into a binary codeword about Manuscript received June 6, 2000; revised March 12, 2001. This work was supported in part by the Natural Science and Engineering Research Council of Canada under Grant RGPIN203035-98, by the Communications and Information Technology Ontario, Canada, by the Premier’s Research Excellence Awards of Ontario, and by the National Sciences Foundation under Grants NCR-9627965 and CCR-9902081. E.-h. Yang and A. Kaltchenko are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: [email protected]; [email protected] ). J. C. Kieffer is with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: kieffer@ece. umn.edu). Communicated by M. Weinberger, Associate Editor for Source Coding. Publisher Item Identifier S 0018-9448(01)05460-8.

. Upon receiving , the decoder can fully rewith the help of its knowledge cover the original sequence about . From information theory, the advantage of using side information, if any, for data compression is obvious; one can considis erably reduce the compression rate if the side information highly correlated with the original sequence . In many applications, side information known to both the encoder and decoder is naturally present. For instance, in video compression, one can use previous frames as side information for the current frame; in lossless resolution, scalable progressive image coding (or multiresolution image coding), one may use low-resolution images as side information for high-resolution images [6]; and in bioinformatics, especially in DNA sequence analysis [4], [9], it is desirable to estimate information distance between two DNA sequences (in this case, either DNA sequence can be regarded as side information for the other sequence). The challenge is often how to utilize efficiently the available side information. This is the issue that will be addressed in this paper and its companions. Previously, the side-information problem has been theoretically studied for memoryless source pairs [16] and for individual sequences [23], [24]. There has also been an attempt [15] to create a conditional Lempel–Ziv algorithm; nonetheless, neither a practical algorithm nor a complete compression performance analysis has been developed. Within the design framework of grammar-based codes [7], [18], [20], [21], in this paper, we extend the Multilevel Pattern Matching (MPM) grammar transform [8] to the case of side information known to both the encoder and decoder, yielding a conditional MPM (CMPM) grammar transform. We then propose a new simple algorithm for implementing the MPM and CMPM grammar transforms. Our algorithm implementation has linear storage and time complexities for MPM grammars with any number of levels while the original MPM transform is only proven to have linear complexity when the number of . Based on the CMPM levels does not exceed grammar transform, we develop a universal lossless data compression algorithm with side information, which will be called a CMPM code and can achieve asymptotically the conditional entropy rate of any stationary, ergodic source pair. It is also shown that the algorithm’s worst case redundancy/sample against the -context conditional empirical entropy among is upper-bounded by all individual sequences of length , where is a constant.

0018–9448/01$10.00 © 2001 IEEE

YANG et al.: UNIVERSAL LOSSLESS DATA COMPRESSION WITH SIDE INFORMATION

Fig. 1.

2131

Data compression diagram with side information.

Fig. 2. Structure of a grammar-based code.

The MPM grammar transform and its corresponding MPM code are a special case of general grammar transforms and grammar-based codes. Another interesting grammar transform is the so-called Yang–Kieffer greedy grammar transform [18], [20], based on which several efficient universal lossless compression algorithms have been proposed. These algorithms now represent some of the best known lossless compression algorithms. In forthcoming papers, we will extend the Yang–Kieffer greedy grammar transform and its corresponding algorithms to the case of side information and provide a unified theoretic framework for general conditional grammar-based codes. This paper is organized as follows. In Section II, we review the structure of the (unconditional) MPM code and its performance. In Section III, we present a simple linear complexity algorithm for implementing the MPM grammar transform. This algorithm is extended in Section IV to construct our CMPM grammar transform and CMPM code; in that section, we also analyze the performance of the CMPM code. Finally, in Section V, we present our simulation results.

A. MPM(

) Grammar Transform

Let us now briefly review the MPM grammar transform. Let and be two positive integers that control how multilevel pattern matching is performed; as a result, the corresponding grammar transform is called the MPM( , ) grammar transform. To get or its equivalent form given by the transformed grammar the MPM( , ) grammar transform, we perform multilevel patfrom left tern matching. First, we partition the string to right into nonoverlapping substrings of the lengths , , where the lengths are obtained from the -ary . expansion of the integer . We denote these substrings by gives the Thus, the concatenation of . Let be the original string -ary expansion of . Then

.. .

.. .

II. REVIEW OF THE MPM CODE As alluded to in the last section, the MPM grammar-based code presented in [8] is a special case of general grammar-based codes [7]. Following [20], a general grammar-based code has be a structure shown in Fig. 2. Let the input data sequence alphabet. For any positive integer , denotes the set of all sequences of length from . A sequence from is sometimes called an -sequence. The origis first transformed inal data sequence from which can be fully into a context-free grammar also gives rise to a nonoverlaprecovered. The grammar is then ping, variable-length parsing of . The sequence compressed by using a zero-order adaptive arithmetic code to or the corresponding sequence of parsed compress either , string matching is often phrases. To get an appropriate is obtained by performing multiused in some manner. If level pattern matching, then one gets the MPM grammar transform and the MPM code.

where, for each , is an integer between and may be zero. For example, for sively, and some , and , we have

inclu,

At each level , processing1 takes place in which of length is identified and a sequence of subblocks of 1For brevity, we will occasionally say: “level i is processed (created, updated, etc.)” meaning that appropriate sequences of blocks, labels, and/or tokens at level i are processed.

2132

Fig. 3.

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

Grammar

G

;

n = 31; I = 3; r = 2.

then represented by a sequence of labels. To perform the prointo cessing at the top level , we partition the sequence blocks of -symbols of the length . From left to right, we visit every block and label the first appearance of each distinct block by the symbol “ .” If the same block appears again, label it by an integer which points to this block’s first appearance. That is, all identical blocks , except the leftmost one, will be labeled by the same integer, which is just the number of distinct blocks up to the first appearance of the block inclusively. This natural ordering will provide the unique correspondence between blocks denoted by “ ” and those denoted by integers. Suppose processing has taken place at levels where . To perform the processing at the next level , do the following. Step 1: Partition every -block from the level into sub. blocks of length into blocks of length Step 2: Partition the sequence . Form a sequence of blocks of length by listing in the indicated order all the subblocks of constructed in Step 1 and all the blocks length arising from the partition of . of length This produces the sequence of blocks of -symbols . at level Step 3: Go through the sequence of blocks at level from left to right. Mark every first-appearing distinct block with the symbol “ .” If the same block appears again, replace it with an integer which points to this block’s first appearance using the natural ordering described earlier. Repeat the above procedure up through the processing of the last , level, level . Fig. 3 illustrates the above procedure for , , and . grammar can be natuNote that the transformed MPM rally represented by an -ary tree. Following [8], we denote,2 for every level , the sequence of block labels consisting of integers and the symbol “ ” by . We call each block label a “token;” and we call the “token sequence” corresponding to the level . The sequence , called the multilevel representation of 2In this paper, we have reversed the ordering of the token sequences with respect to the ordering used in [8] so that now is the bottom sequence.

T

in [8], represents the grammar we have

. For our example in Fig. 3,

To create blocks at the level 3, we partition length and label each block as follows:

into blocks of

Level 3: and the corresponding token sequence is (II.1) To create blocks at level 2, we partition the blocks at level 3, labeled by , into subblocks of length and, then, concatenate . We get them to the block

and then, we label each block at level 2 as follows: Level 2: and the corresponding token sequence is (II.2) To create blocks at level 1, we partition the blocks at level 2, labeled by , into subblocks of length and, then concatenate . We get them to the block

and then, we label each block at level 1 as follows: Level 1: (II.3) To create blocks at level 0, we partition the blocks at level 1, labeled by , into single characters and, then, concatenate them as follows: to the character

YANG et al.: UNIVERSAL LOSSLESS DATA COMPRESSION WITH SIDE INFORMATION

We get

2133

Step 1: Encode

by using the probability

Level 0:

(II.5)

and the corresponding token sequence is (II.4) Combining (II.1)–(II.4), we obtain the multilevel representation of

is Repeat the above procedure until the entire sequence encoded. , we use the folTo encode the bottom level sequence lowing two-steps adaptive arithmetic coding.

B. Arithmetic Encoder In any grammar-based code, the original sequence is compressed indirectly by compressing the transformed . In the case of the MPM code, we separately grammar encode each token sequence in the multilevel representation via the multilevel arithmetic coding algorithm proposed in [19]. Before encoding, since every token sequence begins with the symbol “ ” (except the bottom one), we simply remove the first symbol “ ” from each token , let sequence. For the levels

be the token sequence where

is taken over where the summation , and is the number of times that occurs before the position of this . Note that the alphabet used at this point by the arithmetic code is . by . Step 2: Increase the counter , increase the counter from to , Step 3: If where is defined in Step 1.

Step 1: Encode

by using the probability

Step 2: Increase the counter

by .

is Repeat the above procedure until the entire sequence encoded. For our example given in Fig. 3, the product of the probabilities used in the arithmetic coding process of the entire grammar is

with the first symbol “ ” removed,

denotes the length , and denotes the number of times . the symbol “ ” appears in . Let’s agree that For our example in Fig. 3, we have

The arithmetic encoder sequentially and independently encodes , , restarting at the beginthe sequences , and concatenates the binary codewords for ning of each each sequence in the indicated order into one single binary code. Each of the sequences is enword coded as follows. We associate each symbol

with a counter . Initially, is set to if and otherwise. The initial3 alphabet used by the arithmetic code is . Encode each symbol in the sequence and update the related counters according to the following steps. 3The initial alphabet includes the symbol “1” due to the fact of removing the first appearance of the symbol “s” at every level.

Remark 1: When decoding the binary codeword , the arithmetic decoder must know the length of each sequence before it starts processing the part of corre. To satisfy this requirement, the decoder sponding to as sequentially computes all the lengths follows. Assume that the length of the original sequence is known to the decoder. Then, the length of the top level token sequence is given by

Once the sequence is decoded, the length of the next seis calculated according to the following formula: quence

(II.6) C. Bounds on the Size of the Multilevel Grammar Representation In this section, we make references to some results from [8], which will be used for complexity and compression performance analysis. Let us agree that, unless written explicitly, all logarithm bases throughout the paper are assumed to be . From

2134

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

the description of the MPM code, there are simple relations and between

is , and that the storage required is we have case, for each level

. In our

(II.7) .

(II.8)

Consequently, the arithmetic encoding and decoding part -space and the of the MPM code has the -time complexities. It is easy to see that

Also, the following bound is proved in [8]:

(II.12)

(II.9) . To minimize the order on Note that cannot exceed of the expression given by the right-hand side of (II.9), the integer parameter must be chosen to be at least of the order . For , we obtain the following bound:

(II.10) For sufficiently large , we simplify this bound as follows: (II.11) where

.

D. Computational Complexity of the Arithmetic Coding Using the results from the previous section, we are ready to upper-bound the computational complexity of the arithmetic encoding and decoding part of the MPM code. Recall that the time, required for the “classical” adaptive arithmetic coding algorithm [17] to encode a sequence of length drawn from an alphabet is . Consequently, in view of (II.8) and (II.11), the time, required for that algorithm to encode the token sequences , , will be

Another problem associated with the classical adaptive arithmetic coding algorithm is that it cannot handle a dynamic alphabet which grows without bound. To get around these two problems, we instead apply the multilevel arithmetic coding algorithm, proposed in [19], to encode the token sequences. One of the advantages4 of the multilevel arithmetic coding algorithm over the classical adaptive arithmetic coding algorithm is that the time required for the multilevel arithmetic coding algorithm drawn from an alphabet to encode a sequence of length 4For the other advantages of the multilevel arithmetic coding algorithm, as well as for its implementation details and complexity issues, please see [19].

In view of the relations (II.8), (II.11), and (II.12), we obtain

Remark 2: It should be pointed out that when we measure the time and space complexity of the arithmetic coding part of the MPM code, i.e., the multilevel arithmetic coding, a standard random-access machine (RAM) model with uniform cost criterion [1], [2] is assumed. This assumption is usually made in almost all papers addressing the implementation of arithmetic coding such as [11], [17], and [19], and is certainly valid if only finite precision implementation is concerned. To the best of our knowledge, the only exception so far is the papers [13], [14], where the logarithmic cost criterion, i.e., bitwise computation, is used, and where the tradeoff between redundancy and bitwise complexity is addressed. E. Compression Rate Related to the Grammar Entropy be the sequence obtained from by removing Let all symbols “ .” The unnormalized empirical entropy of is defined as

where denotes the number of occurrences of in , denotes the length of . A similar definition also and applies to any other sequences. in the seThe probability used to encode the symbol is quence

for

and

YANG et al.: UNIVERSAL LOSSLESS DATA COMPRESSION WITH SIDE INFORMATION

for , where is the number of occurrences of in the prefix , and is the number5 of occur. Assume that exact arithmetic is rences of “ ” in used in the arithmetic coding. Then the number of bits needed is to encode

2135

Analogously, the number of bits needed to encode the bottom is upper-bounded by level

To simplify our notation, let us agree that Then, from (II.13) and (II.14)

for

(II.14) .

and Definition II.1: The entropy of an MPM grammar is defined as

for

. Since

, we have

In terms of Definition II.1, therefore, we have proved that (II.15)

for . Then, the number of bits , is the entire level ,

needed to encode

To go further, we have to upper-bound the two sums in the right-hand side of (II.15) in terms of the length of the original sequence . F. Redundancy of the MPM Code Following the approach from [8], [7], [20], we compare the compression performance of the algorithm with that of the best arithmetic coding algorithm with contexts operating letter by letter (not phrase by phrase). be a finite set of elements; each element is Let regarded as an abstract context. Let be a transitional probability function from to , i.e., satisfies

(II.13)

Random transitions between contexts are allowed. For any sefrom the finite source alphabet , the quence compression rate in bits per letter resulting for the arithmetic coding algorithm with transition probability is given by

the number of occurrences of in . The inequality is due to the inequality on the size of a type class; the equality follows from the identity

where is the initial context. The -context empirical entropy , , defined in [20] and given below, gives the of smallest compression rate among all arithmetic coding algorithms with contexts operating letter by letter

In the above,

denotes, for each

for any random variables attributable to the fact that 5The

and

counter c(s) in (II.5) is equal to c

; and the final inequality is . (s) + 1.

(II.16)

2136

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

We want to compare the compression rate of the MPM code with for any and any . . Let be a transitional probability Fix function which maximizes (II.16); , of course, depends on . and any , define For any -string

We note that for every into phrases , relation holds:

concatenation of the phrases in this parsing is . Clearly, the partition sequence will not contain the symbol “ .” Conis a permutation of the sequently, the partition sequence . sequence obtained by concatenating Let us agree, that when used as an argument in the functions , , and the record denotes the -sequence . Then, due to the log-subaddirepresented by the token tivity relation

-string and every parsing of , the following log-subadditivity (II.21) (II.17)

It follows from the definition of the function

and

that (II.18)

For each

, we normalize

over

so that (II.19) (II.22)

is a probability distribution over

satisfying (II.20)

In view of the information inequality [3, Theorem 2.6.3, p. 26]

deFrom (II.18)–(II.20), it is easy to see that the constant . pends on and satisfies Recall [7], [20] that a grammar transform induces a so-called . The following recurpartition sequence sive procedure takes the root node of the corresponding tree representation of the transformed grammar (see the example depicted in Fig. 3) as an input parameter and creates the partition sequence.

(II.23)

Postorder(Node) Begin if the Node has no children, then append this Nodes’s token to the output list else Postorder(Node’s 1st child); Postorder(Node’s 2d child); ............................. Postorder(Node’s last child); End

In our example in Fig. 3, , where a subscript of each integer variable indicates the grammar level for this variable. , the Since each token represents some substring of induces a parsing of if each symbol partition sequence is replaced with the corresponding substring of . The in

where the minimum is over all probability distributions on Combining (II.21)–(II.23) together, we get

(II.24)

Using the inequalities (II.15), (II.7), and (II.8), we obtain

(II.25)

YANG et al.: UNIVERSAL LOSSLESS DATA COMPRESSION WITH SIDE INFORMATION

By substituting the bound for from (II.11) into (II.25) , we get and recalling that

(II.26) Finally, we find the following upper bound on the worst case redundancy (per sample) of the algorithm against the -context empirical entropy defined in [20]:

for large , where

.

Remark 3: The notion of the worst case redundancy of a compression algorithm is rather a strong one. As shown in [7], [8], the worst case redundancy bound also implies the same-order bound for several different notions of redundancy. maximal In particular, an MPM code has the individual redundancy relative to the class of all alphabet finite-state information sources. For comparison, although the bound has been obtained for the average same [10] redundancy of the Lempel–Ziv LZ 78 code [25], its maximal individual redundancy is only known [12] to be . The average redundancy of the MPM , still code, which is obviously no worse than remains the open problem. III. A NEW ALGORITHM FOR IMPLEMENTING THE MPM GRAMMAR TRANSFORM In Section II-A, we have described how to get the transformed given by the MPM grammar transform from top grammar to bottom. In this section, we present a linear-time algorithm from bottom to which constructs the transformed grammar top. We name this new algorithm a “Bottom-to-Top algorithm” in order to distinguish it from the original one, described in [8]. A nice feature of the new algorithm is that no string comparison is made explicitly. In the Bottom-to-Top algorithm, every level is processed in two stages: Creating stage and Updating stage. th level is being When the th level is being updated, the created. That is, this is a “sliding-window” algorithm that processes two adjacent levels at a time while moving up. at the level be the original Let the token sequence . So we say that the token sequence has string been just created, but not updated yet. Suppose, more generally, has been created, but not updated yet.6 the token sequence The token sequence is created and the token sequence is updated as follows. 6The notion of the token sequence T is used here abusively to denote both the unupdated token sequence and the updated token sequence. The original token sequence T defined in Section II-A is now represented by the updated token sequence T .

2137

Creating : From left to right, we partition the sequence into blocks of length , so we will have such blocks. Each block will be replaced with a token from an abstract token alphabet, so that distinct (when treated as supersymbols) blocks are assigned distinct tokens. In practice, we use the set of natural as the token alphabet. All newly created numbers , corresponding to the tokens will form the token sequence , level. For later use, we note that in the corresponding next, tree representation of the grammar, the node corresponding to will be a parent of the nodes a token in the sequence corresponding to those -sequence tokens that form the block replaced by . from left to right. Updating : Traverse the sequence Those tokens in the sequence , whose parental tokens in the have already appeared, are deleted (in reality sequence they will not be deleted, but replaced with the integer “ ”). While traversing the sequence , we rename the token alphabet according to the natural ordering, defined in Section II. Finally, every first (the most left) appearance of a distinct token is replaced with the symbol “ .” Note 1: When having created the level , the algorithm has actually built the MPM( ) grammar. Therefore, we say that ) grammar for any given and , the to get the MPM( Bottom-to-Top algorithm sequentially creates the grammars ), MPM( ), , MPM( ), and for any MPM( ) is the natural numbers and , the grammar MPM( extension of the grammar MPM( ) in the sense that they . have identical token sequences ) Remark 4: The described implementation of MPM( Grammar Transform is “off-line.” That is, the entire sequence has to be observed before it has been processed. However, ) Grammar Transform can also be implementated MPM( [19] sequentially, with linear computational complexity. Based ) Grammar Transon the sequentially implemented MPM( form, a sequential MPM code will have, similarly to the 1978 delay between the encoder input Lempel–Ziv code, a and the decoder output. A. Actual Implementation of the Bottom-to-Top Algorithm will be In actual implementation, every token sequence . To avoid explicit represented by an array string comparison, both the creation process and updating that are defined for process are helped with some lists in the token sequence as every distinct token symbol follows:

All the lists will be constructed so that every token symbol will belong to the set of integers . Therefore, we simply write where Recall that in Section II-B we denoted the number of all distinct symbols in the token sequence by . According to its will consist of the integer at the first definition, the list

2138

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

position and the indexes of all those array be the set elements which are equal to . Let

’s

Clearly, the set uniquely specifies the array . Con, then we will be able to obtain sequently, if we can construct . the array : Assume there is a mapping Bottom level creation , of the alphabet to the set of inte. Denote this mapping by . The algers and fills it with integorithm initializes the array gers as follows:

Originally, the lists in the set

are initialized as follows:

.. .

The following piece of pseudocode appends the indexes of the to the lists : elements of the array Begin Create the array T able[1 1 1 1 jAj] of pointers (0) (0) to the lists L1 ; . . . ; LjAj ; For every j , j = 1 1 1 1 n: append index j to the list pointed by the pointer T able[T (0) [j ]]; End.

Having created and , we say that the level has been created, but not yet updated. and updating the level Creating the level : Now suppose that the levels have been created and updated and that the level has been created, but not yet updated. That is, the set and the array both have been just created. To . simplify our discussion, let us first look at the case of See remarks below for the general case. Then, the pseudocode and updates the th level for shown below creates the level , for which the corresponding MPM grammar the case of transform is called a bisection grammar transform. As the set is being created, its size is growing from zero to and the current size is being stored in the variable . Begin 1 Set s(i + 1) := 0; 2 Initialize the array tegers; 3 initialize the array pointers to N ull;

T

(i+1)

i+1 [1 1 1 1 bn=2 c]

of ins

I N DI C AT OR[1 1 1 1 fi

]

of

Fig. 4. Converting the r -ary tree (r = 5) into binary.

4 For m = 1 to fis do (i) skip the first element of the list Lm , 5 traverse it from left to right and do i 6 For every odd index j bn=2 c in the list (i) Lm , 7 if I N DI C AT OR[T (i) [j + 1]] = N ull, then 8 s(i + 1) := s(i + 1) + 1; create the new (i+1) 9 list Ls(i+1) = fs(i + 1); (j + 1)=2g 10 and append it to the set S (i+1) ; 11 make the pointer I N DI C AT OR[T (i) [j + 1]] (i+1) point to this list Ls(i+1) ; (i+1) 12 T [(j + 1)=2] := s(i + 1) (later this element will be replaced by the symbol “s”); else 13 append the index (j + 1)=2 to the list pointed by the pointer (i) I N DI C AT OR[T [j + 1]] and 14 replace the array’s element (i+1) T [(j +1)=2] by the first member of this list; 15 T (i) [j ] := 0 and T (i) [j +1] := 0 (i.e., delete elements T (i) [j ] and T (i) [j + 1]); (i) skip the first element of the list Lm , 16 traverse it from left to right and do 17 For every odd index j bn=2i c in the list (i) Lm , 18 I N DI C AT OR[T (i) [j + 1]] := N ull. 19 Destroy the array I N DI C AT OR[1 1 1 1 fis ] and release the memory allocated for it. 20 Destroy the set S (i) and all the lists (i) (i) L1 ; . . . ; L f , and release the corresponding allocated memory. 21 Traverse the array T (i) [1 1 1], skip all zero elements, rename the token alphabet according to the natural ordering, and replace the first appearance of each distinct token by the symbol “s.” This modified array T (i) [1 1 1] is the algorithm output at the level i. End.

Repeat the above procedure for . Since the , the level will last value of , passed to this procedure, is not be updated, but just created and renamed. Remark 5: The renaming of the token symbols at level is being done while this level is being traversed from left to right.

YANG et al.: UNIVERSAL LOSSLESS DATA COMPRESSION WITH SIDE INFORMATION

2139

Fig. 5. Grammar MPM(2; 0).

Fig. 6. Intermediate grammar.

Fig. 7. Intermediate grammar.

Fig. 8. Grammar MPM(2; 1).

Consequently, the time required to rename the tokens at level is linear on the length of level . Remark 6: Just to avoid confusion, we have not mentioned itself can be implemented as the array of earlier that the set pointers. Remark 7: We call the reader’s attention to the fact that although every block of tokens can be associated with a supersymbol, we do not, however, explicitly compare token strings when determining identical supersymbols. Such explicit string comparison would significantly increase the computational complexity. Instead of comparing blocks of tokens as strings, we sequentially compare single tokens forming those blocks. Moreover, in our actual linked-list implementation there is no explicit integer multiplication, division, and summation except incrementing by , and integer bitwise shift by (division or multiplication by two). can be easily modified Remark 8: The pseudocode for to handle an arbitrary . We just need to convert the -ary tree ) tree by inserting some intermediate levels into binary ( as shown in Fig. 4. One may see that the number of added intermediate binary nodes is no more than the number of the nodes (tokens) at the level . This property will be used later to establish the linear complexity of the Bottom-to-Top algorithm for an arbitrary . be the same as that shown Example 1: Let the string in Fig. 3. The Bottom-to-Top algorithm sequentially builds , MPM , and MPM grammars, where MPM grammar is the same as the grammar the final MPM shown in Fig. 3. The token sequences representing the grammars are denoted by . For the illustrative purpose, each is not replaced by the first-appearing distinct token in symbol , but just underlined. The algorithm begins with the grammar shown in Fig. 5, which consists of only MPM . First, the algorithm produces one token sequence grammar by creating the sequence and the MPM

updating the sequence . Below, we illustrate the process of creating the level and updating the level in details; all other and update , the levels are processed similarly. To create algorithm does three passes. Let us think of the sequences and as arrays and , respectively. Recall that for each even number between and , token is called a “parent” of tokens and . At is set to zero and the elements the beginning, the counter are not yet defined. of Pass 1: From left to right, the algorithm visits those even (accentuated by check in Fig. 6), whose imelements of mediate left neighbor is the symbol . If a token being visited has already appeared earlier, the parent of this token is set equal to the parent of that token which is the first appearance of . is incremented by and the parent Otherwise, the counter and underlined. The reof the token being visited is set to sulted grammar is shown in Fig. 6. Pass 2: From left to right, the algorithm visits those even (accentuated by check in Fig. 7) whose imelements of mediate left neighbor is the symbol . If a token being visited has already appeared earlier, the parent of this token is set equal to the parent of that token which is the first appearance of . is incremented by and the parent Otherwise, the counter and underlined. The reof the token being visited is set to sulted grammar is shown in Fig. 7. and rePass 3: From left to right, the algorithm traverses names tokens to preserve the natural ordering. For this particular sequence , the algorithm switches tokens 2 and 3. At the same time, the algorithm sets to zero (i.e., deletes) those tokens at the level whose parents are not underlined. The resulted grammar is shown in Fig. 8. and upIn a similar manner, by creating the sequence dating the sequence , the algorithm produces the MPM grammar shown in Fig. 9. Analogously, the algorithm produces grammar, shown in Fig. 10, by creating the sethe MPM and updating the sequence . By replacing the unquence

2140

Fig. 9.

Fig. 10.

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

Grammar MPM(2; 2).

Grammar MPM(2; 3).

derlined symbols in resulting output grammar

with the symbol , we get the

comparison, incrementing by , and integer bitwise shift by (division or multiplication by two). To estimate the time complexity of the Bottom-to-Top algorithm, we introduce the following notation: •

(III.1) Before encoding the grammar sequences (III.1), the algorithm . The resulting seremoves the leftmost symbol in quences

• • • •

is the execution time of node creation, deletion, insertion, and visiting (when traversing a list); is the time required to read or write an array element; is the time required to compare two integers; is the time required to increment an integer by ; is the execution time of integer bitwise shift by .

We consider two cases: (Bisection): The execution time of the bot1) Case tom level creation is the sum of the amounts of time required to create the following structures: ;

• •

; ;

• are then encoded by the arithmetic encoder. B. The Complexity of the Bottom-to-Top Algorithm Following [2], [1], unless specified otherwise, a RAM model with uniform cost criterion,7 on which most practical computers are based, is assumed when we measure the time and space complexity of our algorithms. This assumption implies that each basic operation in our pseudocode, such as node creation, deletion, and insertion as well as node data assigning and reading, takes a constant amount of time. This viewpoint reflects [2] how the pseudocode would be implemented on most actual computers. It also should be noted that, as we discussed in Section III-A, in the Bottom-to-Top algorithm, we use only the following explicit arithmetic operations8 on integers: assignment, 7In complexity analysis, a RAM model with logarithmic cost criterion is also used sometimes; see [2], [1] for discussions regarding the difference and appropriateness of these two cost criteria. 8It is possible, however, to eliminate integer arithmetic in our pseudocode almost entirely by introducing auxiliary linked-list structures while keeping linear computational complexity.

•

.

Thus, the time required to create the bottom level is

To calculate the time required to create level and update level , we sum9 the running times for the corresponding pseudocode lines which are listed in Table I. We get

9With

the exception of branching if/else, where we take the maximum.

YANG et al.: UNIVERSAL LOSSLESS DATA COMPRESSION WITH SIDE INFORMATION

TABLE I EXECUTION TIME FOR THE MAJOR PSEUDOCODE IN SECTION III-A

2141

from a joint alphabet , where is the alphabet from is drawn. To help which the side information sequence the reader understand what the transformed CMPM grammar is, we first explain how can be constructed from top to bottom as we did in Section II-A for the unconditional and grammar. First, we partition the sequences from left to right into nonoverlapping subsequences of the lengths , , where the lengths are obtained from the -ary expansion of the integer as shown in Section II-A. We denote these subsequences by and , respectively. Accordingly, the concatenation of gives the sequence and the concatenation of gives the sequence . The CMPM transform generates according to the top level of the conditional grammar the following three steps. into blocks of -symbols of length Step 1: Partition . Denote these blocks by variables and the resulting sequence by . Analogously, partition into blocks of -symbols of length , and denote these blocks , and the resulting seby variables

where denotes the size of a linked list . Consequently, we obtain the following upper bound on the Bottom-to-Top al: gorithm time complexity for

Similarly, the space complexity is bounded by , and are constants. where 2) Arbitrary : As we already mentioned, the Bottomto-Top algorithm with an arbitrary is reduced to the bisection algorithm by converting the -ary tree to the binary tree. The number of added intermediate binary nodes between levels and is still bounded by the length of which has been created, but not yet updated. Therefore, the time complexity of the entire grammar transform is bounded by

The space complexity is also linear. IV. THE CONDITIONAL MPM GRAMMAR TRANSFORM AND THE CONDITIONAL MPM CODE Having defined the [unconditional] MPM grammar transform, we now extend it to the conditional transform, called CMPM . The input to the conditional grammar transform is the sequence of pairs

by . For brevity, we will quence call a block of -symbols an “ -block” and a block -symbols an “ -block.” of from left to Step 2: Visit every -block in the sequence right, and label all identical -blocks with the same integers and all distinct -blocks with distinct integers in increasing order, starting with . Denote each -block label, or a -token, corresponding to an by . For every distinct -token , let denote the subsequence . We call this subsequence a conditional subsequence of since can be regarded as the sequence conditioned on the -token . All conditional subsequences of are processed independently from each other at Step 3 below. Step 3: For each distinct -token , visit every -block from left to in the conditional subsequence right and label the first appearance of each distinct -block in this subsequence by a special symbol “ .” If the same -block appears in again, label it by an integer so that all identical -blocks in , except for the most left one, will be labeled by the same integer, which is just the number of distinct -blocks in up to the first inclusively. Denote appearance of the -block in each label corresponding to an -block by . . The Apply similar notation to levels transform generates the levels through by CMPM repeating the following four steps for each level . Step 1: For every such that and the block blocks of length .

, partition the -block into sub-

2142

Fig. 11.

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

Conditional grammar

G

;

n = 31; I = 3; r = 2.

Step 2: Partition the sequence into -blocks of length and concatenate these -blocks to the -subblocks of length constructed at Step 1 above. We denote this new concatenated sequence of -blocks . Similarly, partition the sequence at the level by into -blocks of length and concatenate these -blocks to the -subblocks of length constructed at Step 1 above. We denote this concatenated sequence of -blocks at the level by . from left to Step 3: Visit every -block in the sequence right, and label all identical -blocks with the same integers and all distinct -blocks with distinct integers in increasing order, starting with . Denote each label, or a -token, corresponding to an -block by . For every distinct -token , let denote the subsequence . We call this subsequence the conditional subsequence corresponding to . All conditional subsequences of are processed independently from each other at Step 4 below. Step 4: For each distinct -token , visit every -block from left to in the conditional subsequence right and label the first appearance of each distinct -block in this subsequence by a special symbol “ .” If the same -block appears in again, label it by an integer so that all identical -blocks in , except for the most left one, will be labeled by the same integer, which is just the number of distinct -blocks in up to the first appearance of the -block inclusively. For each -block in , denote its label by . For level 1, we perform only Steps 1 and 2 described above, and and instead of performing Steps 3 and 4, we let be and , respectively, where in each -block consists of only one -symbol, and in each -block consists of only one -symbol. Thus, in the CMPM grammar,

every level token pairs denoted by

is represented by the sequence of

Fig. 11 illustrates the above procedure for , , , , and . for the The transformed conditional grammar with the side information sequence is very sequence related to the transformed [unconditional] grammar for the joint sequence , where the grammar is obtained by applying the MPM grammar transform, described in Section II-A, to the sequence of supersymbols given by . Specifically, the number of token pairs at every pairs is equal to the number of token supersymbols level of , and any token pair in the grammar at the same level of and the token supersymbol in the same position in the both correspond to the same block of pairs grammar . If the grammar has a from the joint alphabet token symbol at some position, then the grammar also has the token symbol at the same position. We will use and later, the similarities between the grammars in the compression performance analysis of the conditional algorithm. A. Conditional Bottom-to-Top Algorithm In the previous subsection, we briefly explained how the CMPM grammar can be constructed from top to bottom. Now we present our new conditional Bottom-to-Top algorithm which is an extension of the Bottom-to-Top algorithm described in Section III. Similarly to the unconditional algorithm, the conditional Bottom-to-Top algorithm processes two adjacent levels at a time while moving up. That is, when the th level th level is being created. The is being updated, the algorithm will create and use the following structures.

YANG et al.: UNIVERSAL LOSSLESS DATA COMPRESSION WITH SIDE INFORMATION

• Arrays , , corresponding to the token sequences of the unconditional transform of , are an auxiliary structure used in intermediate steps of our algorithm. and are the • Arrays10 output of our conditional grammar transform and repregrammar being built. sent the conditional CMPM and of the arrays’ indexes are defined by • Lists the following expressions:

where

is the number of distinct integers in the array and is the number of distinct inte. gers in the array • Sets

and

of the lists are defined as follows:

Bottom level creation : We denote a mapping from by the alphabet to the set of integers and a mapping from the alphabet to the set of integers by . The conditional algorithm creates the arrays

2143

have been created, but not yet updated. The algorithm will create , , , , , and will update and and

according to the following steps.

Step 1: Use the set of lists to create the set of lists

Begin 1 Set s(i + 1) := 0; (i+1) i+1 2 Initialize the array Tx [1 1 1 1 bn=2 c] of integers; s 3 initialize the array I N DI C AT OR[1 1 1 1 fx; i ] of pointers to N ull; s 4 For m = 1 to fx; i do (i) skip the first element of the list Lx; m , 5 traverse it from left to right and do 6 For every odd index j bn=2i c in the list (i) Lx; m , (i) 7 if I N DI C AT OR[Tx [j + 1]] = N ull, then 8 s(i + 1) := s(i + 1) + 1; 9 Create the new list (i+1)

x; s(i+1)

in the same way as does the unconditional algorithm, with only does not need to be updated. The lists distinction that now and , defined as follows:

are created with the help of auxiliary arrays and . and updating the level Creating the level : Suppose the level has been just creand and the array ated. That is, the sets of lists have been created. The arrays and notions T [1 1 1 1 bn=r c] and T [1 1 1 1 bn=r c] are used here abusively to denote both the unupdated token sequence and the updated token sequence. 10The

and the array

Then, destroy the set and the array . The piece of pseudocode below implements Step 1 for the case of the bisection ). grammar (

L

and the sets of lists

and the array

=

fs(i

+ 1); (j + 1)=2g (i+1)

10 and append it to the set Sx ; (i) 11 make the pointer I N DI C AT OR[Tx [j + 1]] (i+1) point to this list Lx; s(i+1) ; (i+1) 12 Tx [(j + 1)=2] := s(i + 1); else 13 append the index (j + 1)=2 to the list pointed by the pointer (i) I N DI C AT OR[Tx [j + 1]] and 14 replace the array’s element (i+1) Tx [(j + 1)=2] by the first member of this list; (i) skip the first element of the list Lx; m , 15 traverse it from left to right and do 16 For every odd index j bn=2i c in the list (i) Lx; m , (i) 17 I N DI C AT OR[Tx [j + 1]] := N ull. s 18 Destroy the array I N DI C AT OR[1 1 1 1 fx; i ] and release the corresponding allocated memory. (i) 19 Destroy the set Sx , all the lists (i) (i) , Lx; 1 ; . . . ; L x; f (i)

20 and the array Tx [1 . . . bn=ri c] and, release the corresponding allocated memory. End.

Step 2: Use the set of lists and the array to create the set of lists and array

2144

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

and then, destroy the set in Step 1. Step 3: Use the set of lists

in the exact same way as and the array

Remark 10: The described implementation of CMPM Transform is “off-line.” Similarly to its unconditional prototype, CMPM Transform itself can be implemented sequentially with linear computational complexity. Recall that the sequences of token pairs can be obtained from the sequence of pairs

to create the array

(

)

and update the array according to the following piece of pseudocode. Begin (i+1) Initialize the array Txjy [1 1 1 1 bn=2i+1 c] of integers; initialize the array s IN DICAT OR[1 1 1 1 fx;i +1 ] of integers to all zeros; s For m = 1 to fy; i+1 do set s(i + 1) := 0; skip the first element of the list (i+1) Ly; m , traverse it from left to right and do (i+1) For every index j in the list Ly; m , (i+1) if IN DICAT OR[Tx [j ]] = 0, then s(i + 1) := s(i + 1) + 1; IN DICAT OR[Tx(i+1) [j ]] := s(i + 1); Tx(ijy+1) [j ] :=0 s0 ; else Tx(ijy+1) [j ] := IN DICAT OR[Tx(i+1) [j ]]; Tx(ijy) [2j 0 1] := 0 and Tx(ijy) [2j ] := 0; Ty(i) [2j 0 1] := 0 and Ty(i) [2j ] := 0 (i.e., (i) (i) (i) delete tokens Txjy [2j 01], Txjy [2j ], Ty [2j 01], (i) and Ty [2j ]); (i+1) traverse the list Ly; m from left to right and do (i+1) For every index j in the list Ly; m , (i+1) IN DICAT OR[Tx [j ]] := 0; s Destroy the array IN DICAT OR[1 1 1 1 fx; i] and release the corresponding allocated memory. End

The arrays and are the output of the grammar transform for level . . Repeat the above three steps for Since the last value of , passed to this procedure, is , the level will not be updated, but just created. Remark 9: Although not natural, the alphabet ordering in is unique in the sense that the deevery array coder will construct exactly the same arrays as does the encoder. Therefore, we do not have to rename the inaccording to the natural ortegers in the array are dering. On the other hand, since the indices in each list placed (at Step 2) in the strictly increasing order, Step 3 of the algorithm implicitly provides the conditional natural alphabet . ordering of the array

by deleting pairs of the form . For example, for the conshown in Fig. 5, the conditional ditional grammar Bottom-to-Top algorithm will produce the following :

(IV.1) B. Complexity of Conditional Bottom-to-Top Algorithm To estimate the computational complexity of the conditional Bottom-to-Top algorithm, we follow the approach we took for the unconditional algorithm analysis and use the same notation (bisection). for the time costs. First, we consider the case The execution time of the bottom level creation is the sum of the amounts of time required to create the following structures: •

,

, and

• • • • • • Thus, the time required to create the bottom level is To calculate the execution time of the pseudocode at Step 1 of and updating level , we sum all the time creating level entries in Table II. We obtain

YANG et al.: UNIVERSAL LOSSLESS DATA COMPRESSION WITH SIDE INFORMATION

2145

TABLE II EXECUTION TIME FOR STEP 1 OF THE MAJOR PSEUDOCODE IN SECTION IV-A

(IV.2) begins with the Since every conditional subsequence symbol “ ,” can be easily recovered from . Therefore, the instead of . Let us denote the arithmetic encoder encodes token sequences, constituting the token pairs in , as follows:

Let

In a very similar manner we can see that the execution time and of the pseudocode at Steps 2 and 3 of creating level updating level is upper-bounded by the expression

where is a constant. As a result, we obtain the following upper bound on the conditional Bottom-to-Top algorithm time : complexity for

where is a constant. It is easy to see that the storage complexity of the conditional Bottom-to-Top algorithm is linear . with respect to the sum In a manner similar to that discussed in Remark 8, the conditional Bottom-to-Top algorithm with an arbitrary can be re) algorithm by conduced to the conditional bisection ( verting the -ary tree to the binary tree. Therefore, the time and storage complexities of the conditional Bottom-to-Top algorithm with an arbitrary are also linear with respect to the . sum C. Arithmetic Encoder be a sequence obtained from Let removing the first (the most left) appearance of the pair . Continuing our example, for every distinct token in (IV.1) we will have the following : - empty

by for shown

be the set of all the distinct tokens in . Since the conditional grammar can be transformed back only in the top-to-bottom order, the into the original string arithmetic encoder encodes the grammar in this, top-to-bottom, , are encoded order. Thus, the sequences in the indicated order by a conditional arithmetic encoder. Specifically, the arithmetic encoder encodes every -token in the sequence , conditioning on ; therefore, the are encoded independently conditional subsequences . The encoder restarts at the for every distinct token and concatenates the beginning of each next sequence in the indicated order into one binary codewords for each . single binary codeword the -token sequence is encoded as For with a follows. We associate each distinct pair of symbols , where and belongs to the set of all counter . distinct -token symbols in the conditional subsequence is set to if and to otherwise. The Initially, . Denoting initial11 alphabet used by the arithmetic code is by , we encode each token in the the sequence and update the related counters according to the following steps. Step 1: Encode

by using the probability

where the summation is taken over , and is the number of times that has occurred before the position . Note the pair that the alphabet used at this point by the arithmetic . code is Step 2: Increase the counter

by .

, increase the counter Step 3: If to , where is defined in Step 1.

from

is Repeat the above procedure until the entire sequence encoded. , we use the following To encode the bottom level array two-steps conditional adaptive arithmetic coding with the fixed alphabet . 11The initial alphabet includes the symbol “1” due to the fact of removing the first appearance of the symbol s from each conditional subsequence of tokens.

2146

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

Step 1: Denoting probability

by

, encode

Step 2: Increase the counter

by using the

Proof: If we fix a symbol , then, as it was shown in (II.13) and (II.14), the number of bits required to encode the is bounded as follows: conditional subsequence

(bottom level)

by 1.

Repeat the above procedure until the entire sequence encoded.

is

Example 2: The product of the probabilities used in the arithmetic coding of the entire conditional grammar represented by token sequences (IV.2) is given by

is the number of the symbol in and where is the length of the sequence . Then, the number of bits is required to encode

Analogously, Remark 11: To decode the binary codeword the which in decoder must know the side information sequence turn implies the knowledge of the length of the sequence. in the Therefore, the decoder will be able to compute digits -ary expansion of .

Finally, the total number of bits in the output binary codeword is bounded as follows:

Remark 12: Similarly to that of a MPM code, the arithmetic coding part of a CMPM code has linear computational complexity. This is achieved by using the multilevel arithmetic coding algorithm [19] to encode the conditional token subsequences. D. Compression Rate Related to the Empirical Entropy of the Conditional Grammar Let

denote the number of the distinct pairs at . Let

that is, is the subsequence of . Let of the form

Let us agree that

and

. Let

(IV.3)

E. Redundancy of the CMPM Code Here we extend the finite-state arithmetic encoder model, which we used in Section II-F, by conditioning on side information sequences. Let be a transitional conditional probability function satisfying

obtained by removing pairs for any and . Random transitions between contexts from the set are allowed. For any sequences and , the compression rate in bits per letter resulting from the arithmetic encoder with transition probability is given by .

Definition IV.1: We define the conditional entropy of the as follows: CMPM grammar where is the initial context, and and are some prescribed initial and final segments of the side infor. mation sequence. For the rest of the paper, we assume where is the unnormalized conditional empirical given . entropy of the sequence be the size of the output binary Lemma IV.1: Let codeword for the input sequence . Then

Remark 13: The conditional arithmetic encoder, mentioned above, is a finite-state machine which encodes each symbol conditionally based on the context and the side information . Thus, the parameter defines the subsequence . sliding window of size The following definition extends the notion of the -context [unconditional] empirical entropy to the -context conditional

YANG et al.: UNIVERSAL LOSSLESS DATA COMPRESSION WITH SIDE INFORMATION

empirical entropy .

with a sliding window of size

Definition IV.2:

2147

Generally speaking, depends on . Nonetheless, it satisfies . is easy to see that Now we are ready to extend our reasoning in (II.21)–(II.25) to the conditional case. Let us agree, that when used as argu, , and , the records ments of the functions and denote the -sequence represented by the token and the -sequence represented by the token respectively. Then

(IV.4) (IV.7) where the maximization for is taken over all transitional conditional probability functions from

and

Definition IV.3: We define the worst case redundancy (per of the algorithm against the -context condisample) as tional empirical entropy

Theorem IV.1: (IV.5) where (IV.8) Proof: Fix and . Let be a transitional probability function which maximizes (IV.4); , of course, depends on and . For any and , define

The information inequality [3, Theorem 2.6.3, p. 26] together with the definition of traditional conditional entropy

(see also [3, pp. 23, 27]) implies

(IV.6)

(IV.9)

Consequently,

For each

, we normalize

is a probability distribution over

where the minimum is over all conditional probability distribuon , and function can be viewed as tions one of such distributions. Combining (IV.7)–(IV.9) together, we get

so that

for any

, i.e., (IV.10)

2148

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

Combining inequality (IV.10) with Lemma IV.1 and utilizing relations (II.7) and (II.8), we get

computed in a cyclic manner as follows:

where as convention, and whenever . We define the th-order conditional empirical entropy with sliding window of size

(IV.11) for our conditional grammar are Since the numbers and identical to those for the unconditional grammar for the joint sequence

where and . Since we can always define a context in (IV.4) at time as the most recent past characters , there always exists a probability function such that

we have, for (IV.12) (IV.14) . Substituting (IV.12) into

where (IV.11) yields

Therefore,

Thus, from Theorem IV.1

(IV.13) Letting

and applying the ergodic theorem, we get

from the Finally, we obtain the value of the constant right-hand side of the above expression for sufficiently large with probability one and

Corollary 1: For any stationary, ergodic source pair with alphabet

with probability one. Letting inequality yields

and

in the above

with probability one with probability one as conditional entropy rate of

, where given

denotes the defined as

(IV.15) On the other hand, from source coding theory [5], [22], we have with probability one

Proof: Let . For any and any , let quency of a special conglomerate

-sequence

with length be the frein the joint sequence

(IV.16) Inequalities (IV.15) and (IV.16) together imply with probability one

YANG et al.: UNIVERSAL LOSSLESS DATA COMPRESSION WITH SIDE INFORMATION

RESULTS

FOR A

TABLE III SOURCE PAIR WITH p = 0:7,

H

RESULTS

FOR A

(X

j

Y

) =

H(0 7) = 0 8813 :

(X

j

Y

) =

H(0 8) = 0 7219 :

= 0:6, AND

RESULTS

q

TABLE V SOURCE PAIR WITH p = 0:9,

FOR A

H

:

TABLE IV SOURCE PAIR WITH p = 0:8,

H

q

2149

= 0:6, AND

:

V. SIMULATION RESULTS In this section, we present simulation results on random binary sequences. We demonstrate the asymptotical convergence of the compression rate and the normalized grammar entropy to the conditional entropy rate for so-called compound first-order Markov binary source pairs . In such a pair, the sequence is via obtained by transmitting a first-order Markov source a memoryless binary-symmetrical channel. We generate and as follows:

where is independent and identically distributed (i.i.d.) is i.i.d. with with the probability of symbol 1 being and and are also the probability of symbol being , and jointly independent. The notation stands for modulo- addiis uniform. Thus, the tion, and the initial distributions of is transition matrix of the Markov source

and the conditional entropy rate is equal to the bias shown in the following: nary entropy function

Tables III–VI list simulation results for source pairs with different generating parameters and . Parameter (the number . Recall that normalof the grammar levels) is set to

RESULTS

FOR A

(X

j

Y

) =

H(0 9) = 0 4690 :

TABLE VI SOURCE PAIR WITH p = 0:95,

H

(X

j

Y

) =

q

H(0 95) = 0 2864 :

= 0:8, AND

:

q

= 0:6, AND

:

ized grammar entropy is denoted by . Entropy and rates are expressed in terms of bits per letter. For all the tables, as the length of the sequences increases, we can clearly see the following two trends: 1) the compression rate decreases but is always above the of a source pair; conditional entropy rate increases 2) the normalized grammar entropy . but is always below The first trend simply illustrates that the compression rate approaches from above the conditional entropy rate. Similarly, to the second trend indicates the convergence of and is due to the fact that pairs are ignored in . the calculation of the grammar entropy REFERENCES [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms. Reading, MA: Addison-Wesley, 1974. [2] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms. Cambridge, MA: MIT Press, 1997. [3] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1993. [4] M. Li, X. Chen, J. H. Badberg, S. Kwong, P. Kearney, and H. Zhang, “An information based sequence distance and its application to whole genome phylogeny,” preprint. [5] J. C. Kieffer, “Sample converses in source coding theory,” IEEE Trans. Inform. Theory, vol. 37, pp. 263–268, Mar. 1991. [6] R. Stites and J. Kieffer, “Resolution scalable lossless progressive image coding via conditional quadrisection,” in Proc. ICIP 2000, Vancouver, B.C. [7] J. C. Kieffer and E.-H. Yang, “Grammar-based codes: A new class of universal lossless source codes,” IEEE Trans. Inform. Theory, vol. 46, pp. 737–754, May 2000. [8] J. C. Kieffer, E.-H. Yang, G. Nelson, and P. Cosman, “Lossless compression via multilevel pattern matching,” IEEE Trans. Inform. Theory, vol. 46, pp. 1227–1245, July 2000. [9] J. K. Lanctot, M. Li, and E.-H. Yang, “Estimating DNA sequence entropy,” in Proc. SODA 2000, San Francisco, CA, Jan. 9–11, 2000, pp. 409–418. [10] G. Louchard and W. Szpankowski, “On the average redundancy rate of the Lempel–Ziv code,” IEEE Trans. Inform. Theory, vol. 43, pp. 2–8, Jan. 1997. [11] A. Moffat, “Linear time adaptive arithmetic coding,” IEEE Trans. Inform. Theory, vol. 36, pp. 401–406, Mar. 1990.

2150

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 6, SEPTEMBER 2001

[12] E. Plotnik, M. Weinberger, and J. Ziv, “Upper bounds on the probability of sequences emitted by finite-state sources and on the redundancy of the Lempel–Ziv algorithm,” IEEE Trans. Inform. Theory, vol. 38, pp. 66–72, Jan. 1992. [13] B. Ryabko and A. Fionov, “An efficient method for adaptive arithmetic coding of sourses with large alphabets,” Probl. Inform. Transm., vol. 35, no. 4, pp. 95–108, 1999. , “Fast and space-efficient adaptive arithmetic coding,” in Cryptog[14] raphy and Coding, Proc. 7th IMA Int. Conf. (Lecture Notes in Computer Science, no. 1746), Cirencester, U.K., Dec. 1999, pp. 270–279. [15] P. Subrahmanya and T. Berger, “A sliding window Lempel–Ziv algorithm for differential layer encoding in progressive transmission,” in Proc. 1995 IEEE Int. Symp. Inform. Theory, Whistler, BC, Canada, p. 266. [16] H. S. Witsenhausen, “The zero-error side information problem and chromatic numbers (Corresp.),” IEEE Trans. Inform. Theory, vol. IT-22, pp. 592–593, Sept. 1976. [17] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Commun. Assoc. Comput. Mach., vol. 30, pp. 520–540, 1987. [18] E.-H. Yang and D.-k. He, “Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform—Part two: With context models,” IEEE Trans. Inform. Theory, submitted for publication.

View publication stats

[19] E.-H. Yang and Y. Jia et al., “Universal lossless coding of sources with large or unbounded alphabets,” in Numbers, Information and Complexity, I. Althofer et al., Eds. Norwell, MA: Kluwer, Feb. 2000, pp. 421–442. [20] E.-H. Yang and J. C. Kieffer, “Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform—Part one: Without context models,” IEEE Trans. Inform. Theory, vol. 46, pp. 755–777, May 2000. [21] , “Universal source coding theory based on gramar transforms,” in Proc. 1999 IEEE Information Theory and Communications Workshop, Kruger National Park, South Africa, June 20–25, pp. 75–77. [22] E. -H. Yang and S. Shen, “Shannon information content of a single event and infinite random sequences (I),” Sci. in China, ser. A, vol. 34, pp. 1183–1193, 1991. [23] J. Ziv, “Fixed-rate encoding of individual sequences with side information,” IEEE Trans. Inform. Theory, vol. IT-30, pp. 348–352, Mar. 1984. , “Universal decoding for finite-state channels,” IEEE Trans. In[24] form. Theory, vol. IT-31, pp. 453–460, July 1985. [25] J. Ziv and A. Lempel, “Compression of individual sequences via variable rate coding,” IEEE Trans. Inform. Theory, vol. IT-24, pp. 530–536, Sept. 1978.

Lihat lebih banyak...

Universal lossless data compression with side information by using a conditional MPM grammar transform

Descripción

Comentarios