Operators over Compressed Integer Encodings

August 3, 2017 | Autor: Eric Sadit Tellez | Categoría: Common Property, Data Compression, Search Algorithm

Descripción

Operators over Compressed Integer Encodings Carlos Bedregal Liz´arraga† University of Chile Email: [email protected]

Abstract—Compact representation of integer values is a key feature for data compression. A compressed representation allows us to store more integers within less space, and therefore, to work in faster hierarchies of memory. Adaptive searching yields to compressed representations of integers. In scenarios where the cost of updating the encode is high, approaches to perform operations over integers dynamically are preferable. In this paper we present a framework to create integer encodings, based on augmenting search algorithms in the comparison model. We show that such algorithms shares common properties allowing the support of arithmetic and logic operations over the encoded integers. Finally, we propose a practical approach to support these operations over Elias integer encodings.

I. I NTRODUCTION Searching for the position of a key k inside a list L of ordered items is a particular interesting problem being pervasive for most of the applications in computer science. Let N = |L| be the size of L. The na¨ıve algorithm sequentially checks every item in the list and report the occurrence position. The search process stops if we can certify that k cannot be inside L, using the order as argument. Clearly, it takes p comparisons to determine if k exists in L, where p is the insertion position of k. The maximum number of comparisons is N . This algorithm is commonly named sequential search. A more clever way to use the order inside L is binary searching for k. The procedure is simple and well known: compare k with the median of L at position M = N2 , if k < L[M ] search for k in L[1..M − 1], else search it in L[M..N ]. We can obtain a single candidate in lg N steps, and we can check for equality in an additional comparison which yields to a total cost of lg N + 1 comparisons. This algorithm, called binary search, is a classic example of an optimal worst case algorithm. Figure 1(a) shows an example of sequential searching. Unfortunately, binary search is not aware of the potential simplicity of the instance. For example, if k is in the first or second position of the array, binary search spends lg N + 1 comparisons to find it. An adaptive algorithm can take advantage of these instances, for example when k is close to the beginning, and the measure of difficulty is the smallest † Supported

by Conicyt Grant.

Eric Sadit Tellez Universidad Michoacana Email: [email protected]

codification of p using the information. A simple example of steps in binary searching is shown in Figure 1(b). Doubling search is an algorithm taking advantage of the position p of k. It consist in determining a range bounded by plower and pupper where k can exist, and finally performing binary search for k inside this range. Starting in position 1 of L, the algorithm checks for k doubling the position until k is bounded. In other words, it needs to find an i such that L[2i−1 ] < k ≤ L[2i ].1 Then, plower = 2i−1 and pupper = 2i , taking i comparisons to determine these bounds, corresponding to ⌈lg p⌉ comparisons. Similarly, pupper − plower = plower = 2⌈lg p⌉−1 ≤ 2lg p then binary search on this range has a complexity of lg p + 1 comparisons. The complete cost of doubling search for k, which is found in position p, is 2⌈lg p⌉ + 1. As we can observe, the complexity is independent of N and only depends in the insertion position p of k on L. Both algorithms, binary search and doubling search, have worst case complexity O(lg N ), over instances of size N . Doubling search is faster for small values of p, in which case doubling search is a better option than binary search, and exactly when 2⌈lg p⌉ + 1 ≤ ⌈lg N ⌉ + 1, a simpler bound can be found such that 2(⌊lg p⌋ + 1) ≤ ⌊lg N ⌋ + 1. So, doubling search is the preferable choice against binary search when p is constrained under the following relation: 2⌊lg p⌋ ≤ ⌊lg p⌋ ≤ p ≤

⌊lg N ⌋ − 1 N 1 ⌊lg ⌋ 2 2 r N 2

The last binary search (of doubling search) can be replaced by a nested machinery of doubling search algorithms many times giving a family of adaptive search algorithms, as described by Bentley and Yao [BY76b]. Figure 1(c) illustrates the search pattern of 2-nested doubling search algorithm. A special modification can be performed when we expect p to be large, with N much larger or unbounded. We can i apply galloping using larger steps, i.e. 22 where i is the step. 1 For simplicity, we suppose N to be a power of two or even more it can be unbounded.

Obviously, this can be extended to K levels of powers, e.g. 2i when K = 3, 22 . Suppose that K = 2, i.e. the steps are determined by i 22 . Similarly to doubling search, we bound k in position i−1 i p, so we advance until L[22 ] < k ≤ L[22 ], performing ⌈lg lg p⌉ ≤ lg lg p + 1 comparisons for it. Now, i−1 ⌈lg lg p⌉−1 we need to search in a range of 22 = 22 items. Finally, we can apply binary search on it, needing ⌈lg lg p⌉−1 lg 22 + 1 = 2⌈lg lg p⌉−1 + 1 comparisons for it. Since 2⌈lg lg p⌉−1 +1 ≤ 2⌊lg lg p⌋ +1 we obtain lg p+1, giving a total of lg p + lg lg p + 1 comparisons. Again, this is competitive when lg p+ lg lg p+ 1 ≤ 2 lg p+ 1 (against doubling search), and lg p + lg lg p + 1 ≤ lg N + 1 (against binary search). Also described by Bentley and Yao [BY76b]. This behavior defines a C-adaptive algorithm: An algorithm running in the worst case at a constant factor of the optimal algorithm for the problem, but for a certain family of inputs it solves the instance using C as the main complexity measure, i.e. the class of instances are characterized to be easy under a difficulty measure C. For our case C is the position of the queried integer. Under this point of view doubling search, and the exposed variants, are adaptive to the position. Enhancements for searching algorithms are possible if we take advantage of the domain. For example, if we have numbers the usage of the distribution can produce interpolation search for binary search and extrapolation search for doubling search. Both modification are based on the estimation of the position p based on looking for hint values. Algorithm for search in ordered lists are covered in depth by Knuth [Knu73].

Trunked binary code. It performs the coding of an integer inside a range using only the necessary bits to identify a value in the range, instead of the fixed number of bits used by a native representation, i.e. b bits can encode 2b integers. Thus a range from 0 to 2b − 1 can be represented using b bits. The main drawback is that we need to know the range. We call it simply binary code when the context allow it. Elias-γ-code. It encodes each item using its own range. The length of the binary code is specified in unary, followed by the binary code of the number, [Eli75]. Elias-δ-code. The main problem for γ-code is when the integers are large, it needs many bits to specify its length in unary. δ-code [Eli75], solves this problem specifying the length using the γ-code. Golomb-code. It uses blocks of size M to represent the integer space. It uses unary to represent the block number, and binary to represent the position inside the block. So we can represent every integer as two integers, q as the quotient or block identifier, and r as the remainder. Then, if u is the integer to encode q = ⌈ u−1 M ⌉, r = u − qM − 1, which yields a space of q + ⌈lg M ⌉ bits per integer. It is fully described and analyzed in [Gol66]. Rice-coding. It is an special case of Golomb-codes where M is a power of two. This restriction allows to compute q and r faster. The interested reader is referred to [Ric76]. There is a strong connection between integer encoding and searching algorithms. As will be evident in next sections, all of these integer encodings have a corresponding search algorithm.

II. I NTEGER ENCODING

Let L be a sorted list containing the first N natural numbers. Let A be a search algorithm in the comparison model with input parameter k, the integer to be found in L. A makes an output of 1 whenever it test k with some x ∈ L obtaining k > x, and it outputs 0 otherwise. We formalize our claim with the following proposition. Proposition 1: A search algorithm in the comparison model can be used to encode any integer k ∈ L. The same algorithm can be used to recover the encoding integer k. Proof. A is deterministic, L is in total order, and k ∈ L. The execution of algorithm A defines the code CA (k), a bitstring of size ℓ, where the i-th bit indicates the result of the i-th comparison. After each comparison the working range [lower, upper] is stretched. When upper = lower both pointers references the position of k.2 Hence, given A, L and CA (k), we can decode CA (k) obtaining k unequivocally as follows, we pass an special k ′ to A, k ′ responds to each comparison as specified by CA (k), iterating over it, i.e. i-th comparison performed by A is evaluated as CA (k)[i]. After ℓ comparisons A finishes and upper (and lower) variables points to k. As a completeness remark, CA (k) = CA (k ′ ).

Typically, an integer is represented by a fixed quantity of bits dependent on the architecture of the computer, e.g. 8, 16, 32 or 64 bits are common values. These number of bits bounds the ranges of maximum and minimum integers that can be represented by an architecture. This is the architecture or native representation of integers. The native representation is sometimes larger than necessary, and for these cases it represents an expensive technique. Clever representations of integers allow to store more integers in the same space, at the cost of a little more effort in the implementation stage and a cpu cost overhead, necessary to encode and decode integers. Since we can represent more items in less space, we really can store more items in faster hierarchies of memory being many times faster than na¨ıve implementations, even with the undesired overhead of the coding-decoding scheme. There exists some common ways to encode integers, like trunked binary coding, γ-coding, δ-coding, Golomb-coding and Rice-coding: Unary code. It represents the number p using a string of p bits.

III. S EARCHING

2 This

defines a prefix free code.

IS CODING

MinValue

q

MaxValue

(a) Sequential search

q

MinValue

MaxValue BinarySearchinthefullrange

(b) Binary search Skipsindoublingsearch Verylargesequence q MinValue

Localizedbinarysearch

(c) Nested doubling search Figure 1.

Skip patterns followed for search algorithms working on sorted arrays

Additionally, we consider that every A defines a binary search trees (BST), TA . In this sense, CA (k) is the description of the path from the root to k, where left edges are labeled by 0 and right edges by 1. Prefix-free codes define trees with data in the leafs. A balanced tree suggest good worst case algorithms, such as binary search algorithm (see Figure 1(b)). On the other hand, an unbalanced tree suggests adaptiveness to items closer to the root, at the cost of larger complexities in the rest of items3 (see Figure 1(c)). A. Compressing integers Let us define a compressed integer as an integer value encoded in a number of bits depending on the represented value and not in the size of the machine word. Section III shows that searching algorithms in the comparison model produces integer encoding schemes, as well as binary search trees (Figure 2). Any integer coding algorithm induces a search algorithm but not necessarily in the comparison model. The proof is constructive, and is similarly to Proposition 1 (or as demonstrated by Beigel [Bei90]). B. Relation between doubling search coding and Elias-γ Consider Cγ (x) the Elias-γ code representing integer x. This representation can be decomposed in Cγ (x) = 3 Adaptive

algorithms, will create a trees with heights in O(lg N ).

L(x)B(x), where B(x) is the representation of x in trunked binary, and L(x) is the representation of the number of bits u = ⌈lg x⌉ required by B(x). For Elias-γ, L(x) is represented using unary code: a string of u−1 zeros followed by a single 1. This codification corresponds to the galloping step in doubling search minus one, but encoded as the complement. The codification of B(x) corresponds to the output for binary search. Some interesting properties arises from Elias-γ: • •

It encodes x = 1 as the empty string, using one bit. The integer 1 is inferred when L(x) = 1. The last bit in L(x) is merged with the most significant bit of B(x), i.e. since there is not representation for 0, then B(x) always have 1 in its most significant bit. This way it is possible to save a bit in the representation.

Most operations are performed using native integer operations. This is true since L(x) is encoded as the complement, and B(x) uses the extra one from the separator to become the full binary representation. So, the encoding, decoding and operations are simple and efficient. Even when Elias-γ needs less bits to encode any integer, doubling search code produce codes as good. This is an example that scheme of searching is coding helps to create new integer codification schemes, helping on the task of

xˆ = yˆ ⇐⇒ x = y

search algorithm

xˆ ≻ yˆ ⇐⇒ x > y binary search tree

integer coding

Figure 2. Relation between integer encoding, binary search trees and search algorithms in sorted lists. The half arrow leaving from integer encoding indicates that not every integer encoding algorithm can be expressed in terms of search algorithms in the comparison model.

Where ≺, =, ≻ means for lexicographic comparisons smaller than, equal to, and greater than, respectively. Proof. It is true by construction using Proposition 1. Proposition 2 will be used to define operations involving simple navigation inside the BST. A. Basic operations

create adaptive codings for particular distributions. Table I shows the encodings obtained by Elias-γ coding and doubling search coding in a range of integers from 0 to 15. The BST defined by the doubling search algorithm is shown in Figure 3(a), as stated, the code represents the decision path to reach a the leaf containing the number. In Figure 3(c), the BST for binary search is shown. The sequential search is represented as its associated BST in Figure 3(b). IV. O PERATIONS

OVER ENCODED INTEGERS

Binary code is a well known way to encode integers. Induced by the binary search algorithm, it can be simulated by a balanced search tree (Figure 3(c)). Binary code is an optimal code maximizing the possible number of encoded integers within a given number of bits, adopted as the native representation of integers for most computers. There exists applications where binary encoding is not the most efficient way to encode integers, i.e. mainly in domains working with variable length integers. For this applications, we can develop algorithms performing arithmetic operations, based on the exact representation of integers. This is useful to construct specialized processing units working with these codings, e.g. using FPGA devices. Another interesting scheme, comes from exploiting capabilities of large parallel architectures, e.g. GPGPU, to work with very large datasets of integers of variable size, skipping the decoding step. In general, we can work directly with compressed representations of integers, then we need a better comprehension of how they work. In this subsection we define operators for any encoding produced applying literately the Proposition 1. In order to be general we are restricted to the most basic tools and properties found in every integer encoding algorithm, in the class of algorithms described before. We show how to work in constrained situations and being able to define a complete gamma of operators. A core property is that every encoding derived from Proposition 1 maintains the same lexicographic order than the numerical order seeing each bit as a symbol, formally: Proposition 2: Let x, y ∈ N and CA (x), CA (y) their corresponding codes using Proposition 1, then: xˆ ≺ yˆ ⇐⇒ x < y

Consider a prefix-free coding scheme CA based in a search algorithm A in the comparison model, and its associated binary search tree TA , (see Figure 3 for some examples). Consider CA (a) and CA (b) the codes representing integers a, b using encoder A. We define some easily implementable operations using TA . Successor is defined as succ(CA (a)) = CA (a + 1), which corresponds to the next leaf in T following a in an in-order walk over T . It can be easily implemented navigating from CA (a). If the algorithm encodes a finite range [0..N − 1], succ(N − 1) is not defined. Predecessor is defined as pred(CA (a)) = CA (a − 1). Its definition is analogous to successor, then succ(pred(CA (a))) = CA (a). Note that pred(·) is not defined for the first representable integer. Predicate equal than, CA (a) = CA (b). Since a code only can represent an unique integer, checking for equality needs to prove that every single bit position is identical in both CA (a) and CA (b). In a practical implementation, this operation is computed in constant time using bit-wise operations for integers represented in O(1) words of memory. Predicate less than, CA (a) < CA (b), produces true if x < y and false otherwise. Clearly, CA (a) < CA (b) is true if under repeated Predecessor operations over CA (b) we obtain CA (a). This is an expensive solution, with a maximum cost of b predecessor operations. We can get a faster solution, in time bounded by the smaller code4 . Since both codes were induced by comparisons in the same tree TA , we must iterate from the most significant bit to the least one. We stop on mismatching bits, if in that position CA (a) contains a 0 then the predicate is true, and false otherwise. We can implement this operator using bit-parallelism and a pre-computed table in O(1), if it can be stored in O(1) words. Predicates for >, ≤, ≥, can be defined analogously in O(1) time. B. Arithmetic for any integer coding scheme Working with arithmetic operations of the form CA (c) = CA (a) ⊙ CA (b), for ⊙ one of the operators +, −, ×, / corresponding to addition, subtraction, multiplication and 4 This must be O(lg N ) for a code induced by an adaptive or a worst case algorithm.

Elias-γ code 1 010 011 00100 00101 00110 00111 0001000 0001001 0001010 0001011 0001100 0001101 0001110 0001111

Integer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

# of Bits 1 3 3 5 5 5 5 6 6 6 6 6 6 6 6

doubling search code 00 01 100 101 11000 11001 11010 11011 1110000 1110001 1110010 1110011 1110100 1110101 1110110 1110111

Integer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

# of bits 2 2 3 3 5 5 5 5 7 7 7 7 7 7 7 7

Table I 16 INTEGERS ENCODED WITH E LIAS -γ CODE AND DOUBLING SEARCH CODE .

division respectively. In general, arithmetic operations can be performed following one of these strategies: 1) Decode both operands, then perform operation over the decoded numbers, and finally encode the result of the operation. 2) Work with an encoded value and a not-encoded value. 3) Operate directly with the encoded values. The first approach has a complexity of O(lg N ), the time to decode CA (a) and CA (b), and then generate CA (c). Although easy to implement in practice, this approach is not efficient for integers bigger than the machine native representation. The second approach is used to work dynamically with integer encodings: we have a encoded integer CA (a) that is modified by integer b and updated to obtain the code CA (c). The third approach must uses the information of the encoder A (i.e. the shape of the tree TA ) to correctly perform the operations. In this scenario we note that when A is not completely known, operations can be supported based in the navigation inside the tree, as this can be universally known for any encoder. If we suppose that information about the coder is not know, but we can still perform the basic operations inside the tree, it is possible to generalize the operations to any search algorithm in the comparison model. Even if we do not know a-priori CA (c), we can rely in successors and predecessor operators given in Section IV-A to perform more complex operations. Predecessor and successors operations can be computed in O(1) amortized time, with a worst case of O(lg N ) per (successor/predecessor) operation if the tree TA has O(lg N ) search complexity. Predecessor and successor are solved using the Proposition 2, i.e. navigating inside the tree.

The addition uses constant time to ensure that CA (a) ≤ CA (b), i.e. compare and swap, O(lg N ) to decode. So, addition is solved applying and accumulating min{a, b} times successor to the max{a, b} argument. The subtraction is computed applying and accumulating b predecessor calls. If b > a we stop in the smaller representable integer. Then, it performs min {a, b} predecessor operations. Both operations, addition and subtraction, are efficient only when a small operand is involved. Unfortunately, under general suppositions, the multiplication and divisions have an even worst complexity. CA (c) = CA (a) × CA (b) can be computed adding b times a into an accumulator. Division follows a similar procedure using subtraction, but counting the number of times that we can subtract b from a. The logarithm can be easily implemented using basic operations, yet not efficiently. In the next section we discuss a practical and more specific approach for operations over compressed representations of integers. V. O PERATIONS

OVER

E LIAS INTEGER ENCODINGS

Now, the models proposed to support operations on compressed representations of integers are implemented for Elias [Eli75] encoders, as they are isomorphic with the algorithms for unbounded searching of Bentley and Yao [BY76a]. Considering the operation CA (a) ⊙ b = CA (a′ ), where a, b, a′ are positive integers, CA (i) represents the encoding of the integer i, and ⊙ one of the operators +, −, ×, / corresponding to addition, subtraction, multiplication and division respectively. As A represents one of the Elias encoders, we decomposed it in CA (i) = L(i)B(i), where B(i) is the part of CA (i) representing i in truncated binary, and L(i) is the representation in of the number of bits u = ⌈lg i⌉ required by B(i).

(1) 0

1

(0)

(3)

0 1

0

1

(2)

(7)

0 1

0

3

(5)

(15)

1

0

(6)

(11)

0

2

1

0 (4) 0 1 4

0

5

(0) 0

1

1

6

(8) 0 1 9

(2) 0

1

1

0

(10)

(12)

1

10

1

1

(13)

0

(1) 0

(9) 0

8

0

0

7

1

0

11

12

1

2

(3)

1

0

(14) 1

3

0

13

1 (4)

1

0

14

15

4

(a) Induced tree shape for doubling search

(b) Induced tree shape for sequential search

(7) 0

1

(3) 0 (1) 0 (0) 0 1 0

1

2

(11)

1

0

(5)

(9)

1

0

1

0

(2)

(4)

(6)

0 1

0 1

0

3

4

5

6

1 7

8

1 (13)

1

0

(8)

(10)

(12)

0 1

0

9

10

1

0 11

12

1 (14) 1 13

0 14

1 15

(c) Induced tree shape for binary search Figure 3.

Some prefix-free codes generated by search algorithms/binary search trees.

The cost of each operation is analyzed in terms of the number of flipping bits inside the codes, i.e. how many bits are changed from 0 to 1, or conversely from 1 to 0. This measure is useful for applications working with devices whose functionality is given by a limited number of write operations, i.e. flash memories, or DVD-RW discs. Another interesting application lies in compressed data structures (such as compressed inverted indexes) that manage compressed integers as a core building block [WMB99]. Comparison operations CA (a) • CA (b), such as >, ⌈lg a⌉: The representation of a′ will add bits to CA (a). •

•

⌈lg b⌉ ≤ ⌈lg a⌉: This is the basic scenario since we need to add only one bit. First we add a bit 1 at the beginning of B(a) and perform at most ⌈lg a⌉ flips of bits in B(a) depending on the value of a′ . ⌈lg b⌉ > ⌈lg a⌉: In general, if we need an arbitrary amount of additional bits to represent a′ , we insert the most significant bits of b at the beginning of B(a) and then perform at most ⌈lg b⌉ flips of bits on B(a) in order to reflect the new resulting number a′ .

Depending on the encoding used in point 2, we must update the bits in L(a). Let b′ = ⌈lg a + b⌉ − ⌈lg a⌉, then: • • •

Elias-γ: add b′ bits 0 at the beginning of L(a). Elias-δ: considering that Cδ (a) = Cγ (⌈lg a⌉)B(a), we perform Cγ (⌈lg a⌉) + b′ using the addition algorithm. Elias-ω: considering that Cω,k (a) = Cω,k−1 (⌈lg a⌉)B(a), we perform the algorithm for Cω,k−1 (⌈lg a⌉) + b′ recursively repeating the algorithm until b′ = 0.

B. Subtraction The procedure is very similar to the algorithm of addition. For the Elias encodings the subtraction operation is performed as follows: 1) ⌈lg a′ ⌉ = ⌈lg a⌉: There is no need for additional bits to represent a′ so we perform B(a) − b directly, performing at most ⌈lg a⌉ − 1 flips of bits in B(a). 2) ⌈lg a′ ⌉ > ⌈lg a⌉: In order to consistently represent a′ , we are required to remove bits from CA (a). ⌈lg b⌉ < ⌈lg a⌉−1: This is the basic scenario since we need to remove only one bit. First we remove the most significant bit from B(a) and perform at most ⌈lg a⌉ − 1 flips of bits in B(a) depending on the value of a′ . • ⌈lg b⌉ = ⌈lg a⌉ o ⌈lg b⌉ = ⌈lg a⌉ − 1: In general, if we need to remove an arbitrary amount of bits to represent a′ , we remove from the beginning of B(a) the bits 0 originated from the subtraction and then perform at most ⌈lg b⌉ flips of bits on B(a) in order to reflect the new resulting number a′ . For point 2, as in the addition algorithm, we need to update the bits preceding in L(a). Let b′ = ⌈lg a⌉ − ⌈lg b⌉, then: •

• • •

Elias-γ: remove b′ 0 bits from the beginning of CA (a). Elias-δ: repeat the algorithm performing Cγ (⌈lg a⌉)−b′ . Elias-ω: repeat the algorithm performing the subtractions Cω,k−1 (⌈lg a⌉) − b′ until b′ = 0.

C. Multiplication The operation of multiplication is performed as follows: 1) b is power of 2, with p = lg b: This case is trivial since we just have to add p bits 0 at the end of B(a), there is no need to perform flip of bits. 2) b = 2p + b′ , with p = ⌊lg b⌋: This is the general case. a) c = a. b) Add p bits 0 at the end of B(a). ′ c) Decompose b′ = 2p + b′′ , with p′ = ⌊lg b′ ⌋. ′ d) Perform B(a) + (c ∗ 2p ) using the addition algorithm. e) Repeat from step c) with b′ = b′′ . The total number of iterations performed is ⌊lg b⌋. Depending on the encoding used, the update of the bits that precede CA (a) is conducted in the same way than in the addition algorithm since it will require b′ = ⌈lg(ab)⌉−⌈lg a⌉ additional bits to represent a′ . D. Division In the same fashion as in the multiplication algorithm, division by an integer b that is a power of 2 is performed removing p = lg b bits from the end of B(a)). For the general case, if we consider the division as a variant of the previous algorithm, the operation is very expensive because of iterative decomposition b in negative powers of 2 (as in point 2 of the multiplication algorithm) generates infinite iterations. One approach is to limit the number of iterations, which would give a accurate answer as we are working with integer numbers. The update of bits preceding CA (a) in CA (a) is conducted in the same way as in the subtraction algorithm, as we are required to remove b′ = ⌈lg a⌉ − ⌈lg a/b⌉ bits from the beginning of CA (a). E. Other Operations The operation of the logarithm to base 2, ⌈lg2 CA (a)⌉, is trivial as the answer is explicitly stored by all the Elias encoding schemes in L(a). For the case of the logarithm to base b, lgb CA (a)⌉, they may give an exact answer only if a is a power of 2. Otherwise the operator is not defined. b • The operation power, CA (a) , is trivial when a is a power of 2. For the other cases, the naive approach will have to reduce the problem to a multiplication, performing CA (a) × a, b times using the multiplication algorithm. As the size of the code changes, dynamic representations for integers are required to support the operators efficiently in practice. Current approaches for dynamic bit vectors of size n allow the operations rank, select, insert, delete and update in O(lg n) time and O(n) space [lCkHwL04], [MN06]. •

Given the close relation between encoding and searching algorithms, our results can be extended to other encoding schemes. The support of operations over Elias encodings can be extended to other coders with similar behavior such as Golomb [Gol66] and Rice [Ric76]. The analysis was done considering the number of flips of bits required to update the code to its new states. Considering a machine word of size w, it is possible to express the analysis in terms of word updates (a word can be updated, and modify its bits, in a single operation). Therefore, a integer n coded using O(lg n) bits can be updated in time O( lgwn ). This observation is more relevant working with integers where lg n ≫ w, for these scenarios our approach updates a minimum number of machine words instead of the whole set of words used for the codification of n. These operations can be used to allow more of dynamism in compressed structures, as it is a common practice that compressed structures have an static behavior, where any modification in the data implies the reconstruction of the structure. VI. C ONCLUSIONS

AND

F UTURE W ORK

We had presented a general framework to produce integer encoding algorithms, allowing to produce integer encodings adapting to particular distributions. A set of arithmetic and logic operators were defined for the general case, and particularly, for the popular integer encodings Elias-γ Eliasδ and Elias-ω. It is of interest to evaluate the practical advantages of the support of operations over compressed integers considering these Elias encodings in distinct scenarios, noting how the particularities of each encoding scheme could impact the performance of the operations. Is interesting to note that the use of randomized binary search trees will yield to integer representation schemes useful for cryptography. We can expect integers encodings of O(lg N ) bits in average, additionally supporting arithmetic and logic operators. This can be exploited in devices handling sensitive data, allowing, for example, to operate without explicitly decoding the data. Finally, a generalization of the scheme allowing codes with larger alphabets, i.e. re-defining the comparison function to a mapping function, yields to a new variety of encoding schemes supporting logic and arithmetic operations. ACKNOWLEDGES The authors would like to thank Prof. J´er´emy Barbay for the ideas and suggestions for this paper. The motivation for this study was based on his lectures on “Analysis and design of adaptive algorithms” at the Department of Computer Science, University of Chile. Authors would also like to thank Prof. Gonzalo Navarro for his helpful observations during the early stages of this study.

R EFERENCES [Bei90]

Richard Beigel. Unbounded searching algorithms. SIAM J. Comput., 19(3):522–537, 1990.

[BY76a]

Jon Louis Bentley and Andrew Chi-Chih Yao. An almost optimal algorithm for unbounded searching. Inf. Process. Lett., 5(3):82–87, 1976.

[BY76b]

Jon Louis Bentley and Andrew Chi-Chih Yao. An Almost Optimal Algorithm for Unbounded Searching. Inform. Proc. Lett., 5:82–87, 1976.

[Eli75]

P. Elias. Universal codeword sets and representations of the integers. Information Theory, IEEE Transactions on, 21(2):194 – 203, mar 1975.

[Gol66]

S. W. Golomb. Run-length encodings. In IEEE Transactions on Information Theory, IT 12(3), pages 399–401, 1966.

[Knu73]

Donald E. Knuth. The Art of Computer Programming, Volume III: Sorting and Searching. AddisonWesley, 1973.

[lCkHwL04] Ho leung Chan, Wing kai Hon, and Tak wah Lam. Compressed index for a dynamic collection of texts. In In Proc. CPM’04, LNCS 3109, pages 445–456, 2004. [MN06]

V. M¨akinen and G. Navarro. Dynamic entropycompressed sequences and full-text indexes. In Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 4009, pages 307–318, 2006.

[Ric76]

R. F. Rice. Some practical universal noiseless coding techniques. In Jet Propulsion Laboratory, Pasadena, California, JPL Publication 79–22, 1976.

[WMB99]

Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing documents and images. Morgan Kaufmann Publishing, second edition edition, 1999.

Lihat lebih banyak...

Operators over Compressed Integer Encodings

Descripción

Comentarios