An efficient hardware design for intra-prediction in H.264/AVC decoder

June 27, 2017 | Autor: M. Nadeem | Categoría: Video Compression, Image coding, Hardware Design, Algorithm Design, Digital Video, Very high throughput, Hardware Implementation of Algorithms, Intra Prediction, Very high throughput, Hardware Implementation of Algorithms, Intra Prediction

Share Embed

Laporkan tautan ini

Descripción

An Efficient Hardware Design for Intra-prediction in H.264/AVC Decoder Muhammad Nadeem, Stephan Wong, and Georgi Kuzmanov Computer Engineering Laboratory, Delft University of Technology Mekelweg 4, 2628CD Delft, The Netherlands {M.Nadeem, J.S.S.M.Wong, G.K.Kuzmanov}@tudelft.nl

Abstract—The H.264/AVC intra-frame codec is widely used to compress image/video data for applications like Digital Still Camera (DSC), Digital Video Camera (DVC), Television Studio Broadcast, and Surveillance video. Intraprediction is one of the top 3 compute-intensive processing functions in the H.264/AVC baseline decoder [6] and, therefore, consumes significant number of compute cycles a processor. In this paper, we propose a configurable, high-throughput, and area-efficient hardware design for the intra-prediction unit. The intra-prediction algorithm is optimized to significantly reduce the redundancy in addition operations (e.g., 27% reduction when compared with state-of-the-art in literature [12]). The area requirement for our hardware implementation of the optimized intraprediction algorithm is further reduced by employing a configurable design to reuse data paths for mutually exclusive processing scenarios. The proposed design is described in VHDL and synthesized under 0.18µm CMOS standard cell technology. While working at a clock frequency of 150 MHz, it can easily meet the throughput requirement of HDTV resolutions and consumes only 21K gates. Keywords-intra-prediction; H.264/AVC decoder; image and video compression; inverse integer transform.

I. INTRODUCTION The Advanced Video Coding standard H.264/AVC, also known as MPEG-4 part 10, is jointly developed by ITU-T VCEG and ISO/IEC MPEG [1]. It is able to achieve significantly higher compression efficiency over the previous video coding standards, like H.263 and MPEG-2/4. The H.264/AVC provides similar subjective video quality as that of MPEG-2 with at least 50% more reduction in bit-rate [2], [3]. It provides up to 30% better compression when compared with H.263+ and MPEG-4 Advanced Simple Profile (ASP) [4]. This significantly higher compression efficiency is achieved at the cost of additional computational complexity of the video coding algorithms in H.264/AVC, approximately 10 times higher than the MPEG-4 simple profile [4]. The H.264/AVC supports multiple directional intra prediction modes (4 modes for the luma 16 × 16 / chroma 8 × 8 block types and 9 modes for the 4 × 4 block type) to reduce the spatial redundancy in the video signal. These multiple intra prediction modes help to significantly improve the encoding performance of an H.264 intra-frame encoder. Studies [6] shown

that H.264/AVC outperforms JPEG2000, a state-of-theart still-image coding standard, in terms of subjective as well as objective image quality. This makes the H.264/AVC intra-frame codec, an attractive choice for an image compression engine. Applications like Digital Still Camera (DSC) employ intra-frame encoding technique to compress high-resolution images. In a video frame encoded in intra mode, the current macro-block (i.e., 16×16 pixels block) is predicted from the previously encoded neighboring macro-blocks (MB) from the same video frame. Therefore, an intra-frame with all intra-MB does not depend on any other video frame(s) and can be decoded independently. A video encoded with intra-frames only is easier to edit than video with inter-frames (frames predicted from past or future video frames). Similarly in many surveillance systems, the video is compressed using intra frames encoding mode due to legal reasons. Courts in many countries do not accept the predicted image frames as legal evidence [5]. As a result, a typical video surveillance system compresses video using intra encoded frames only. Consequently, intra-only video coding is widely used coding technique in Television Studio Broadcast, Digital Cinema and Surveillance video applications. The H.264/AVC baseline decoder is approximately 2.5 times more time complex than the H.263 baseline decoder [6]. According to the analysis of run-time profiling data of H.264/AVC baseline decoder subfunctions, intra-prediction is one of the top 3 computeintensive functions [6]. A high-throughput intra-frame processing chain is, therefore, an important requirement for H.264/AVC intra-frame decoder for real-time video processing applications. The demanding characteristics of the intra-prediction algorithm suggest a hardware implementation for such function for high-definition video applications, where even larger frame sizes at higher frame rates are to be processed in real time. Several different hardware architectures have been proposed in the literature in last few years and most of them are developed from the encoder’s requirements point of view. In a video decoder, while decoding the compressed input video bit-stream, the block type and intra-prediction mode for the current macro-block are

Bitstream Input

Entropy Decoder

Reorder

Q-1

T-1 +

Intra Prediction

already known. Therefore, this information can be used to reduce on-chip area cost by designing overlapped data paths for mutually exclusive processing scenarios. In this paper, we focus on the realization of a highthroughput and area-efficient design of intra-prediction unit in H.264/AVC video decoder and provide the following specific contributions: •

•

A proposal to optimize the intra-prediction algorithm for 4 × 4 luma blocks by decomposing the filter kernels. The proposed decomposition significantly reduces the additions operations for its hardware implementation (27% - 60% reduction). A configurable hardware design of intra-prediction module to reduce on-chip area by using hardware sharing approach for mutually exclusive processing scenarios (approx. 21K gates).

Furthermore, we also optimize the common equations in the intra-prediction algorithm, to compute the prediction samples for 16 × 16 luminance and 8 × 8 chrominance blocks. The remainder of this paper is organized as follows. The related work is presented in Section II. Section III provides an overview of the intra-prediction algorithm and also describes the proposed optimizations in the algorithm. The proposed configurable hardware design is presented in Section IV, whereas design evaluation is provided in Section V. Finally, Section VI concludes this paper. II. RELATED WORK In the last few years, many researchers proposed a number of optimized algorithms and efficient hardware implementations for intra-prediction unit in H.264/AVC. For instance, a hardware implementation with five-stage registers and three configurable data paths for intraprediction is proposed in [8]. The proposed solution can process approximately 100 VGA frames in real time. In [9], an efficient intra-frame codec is proposed. The proposed solution can process 720p @ 30 fps in real time and can be used for both the encoder and decoder implementations. This solution, however, excludes the plane mode for 16 × 16 luma and 8 × 8 chroma blocks. The proposed solution, therefore, can be used in a matched encoder-decoder scenarios only where it is guaranteed that plane mode is not used for intra-prediction. High-throughput hardware designs for a H.264/AVC decoder are proposed in [10] and [11]. The proposed designs can process high-definition (HD) video in real time. Since no attempts were made to optimize the intraprediction algorithm to reduce the arithmetic operations, therefore, the hardware implementation of these designs

Recon Frame (Fn-1)

+

MC Bitstream Input

Intra Prediction Recon Frame (Fn)

Figure 1.

+ Deblocking Filter

+

T-1

Q-1

Entropy Decoder

Functional block diagram of H.264/AVC decoder.

cost significant amount of hardware resources (approximately 29K gates). Similarly, in [12], an efficient hardware implementation for intra-prediction unit is proposed. The design targets the H.264/AVC encoder and uses a so-called combined module approach to generate a subset of intraprediction modes in parallel. Furthermore, the intraprediction algorithm is optimized to significantly reduce the number of arithmetic operation for the computation of prediction samples. The proposed solution, however, does not fully eliminate the redundancy in the algorithm and also primarily targets the H.264/AVC encoder. In short, most of the hardware implementations presented in the literature do not specifically target the H.264/AVC decoder implementation and, therefore, fails to fully exploit the prior intra-prediction mode information to reduce area cost. Some of them do not support the computation of all of the intra-prediction modes and therefore, cannot be used in a general H.264/AVC video decoder where any of the intra-prediction modes can appear in the encoded bit-stream. Similarly, some of them have made attempts to reduce the redundancy in the intra-prediction algorithm. However, with the proposed decomposition in this paper, further arithmetic operations reduction is possible. III. INTRA-PREDICTION ALGORITHM In this section, we briefly introduce the intra-frame processing chain in an H.264/AVC decoder. Subsequently, an overview of the intra-prediction algorithm for 4 × 4 luminance blocks, 16 × 16 luminance blocks, and 8 × 8 chrominance blocks is provided in separate subsections. The optimizations to reduce the number of operations in the intra-prediction algorithm are also explained along with algorithm overview. The functional block diagram for an H.264/AVC intra-frame decoder is depicted in Fig. 1. The entropy decoder unit parses the input bit-stream and decodes the intra-prediction mode used for the current MB. This intra-prediction mode information is passed on to the intra-prediction unit to generate the prediction block.

x-1 x0 x1 x2 x3 x4 x5 x6 x7

x-1 x0 x1 x2 x3 x4 x5 x6 x7

x-1 x0 x1 x2 x3 x4 x5 x6 x7

y0

y0

y0

Mode - 0 Vertical (VT)

y1 y2

Mode - 1 Horizontal (HZ)

y1 y2

Mode - 2 DC

y1 y2

y3

y3

x-1 x0 x1 x2 x3 x4 x5 x6 x7

x-1 x0 x1 x2 x3 x4 x5 x6 x7

x-1 x0 x1 x2 x3 x4 x5 x6 x7

y0

y0

y0

Mode - 3 Diagonal Down Left (DDL)

y1 y2 y3

y3

Mode - 4 Diagonal Down Right (DDR)

y1 y2 y3

Mode - 5 Vertical Right (VR)

y1 y2 y3

x-1 x0 x1 x2 x3 x4 x5 x6 x7

x-1 x0 x1 x2 x3 x4 x5 x6 x7

x-1 x0 x1 x2 x3 x4 x5 x6 x7

y0

y0

y0

Mode - 6 Horizontal Down (HD)

y1 y2 y3

y2 y3

x-1 x0 x1 x2 x3 x4 x5 x6 x7

Figure 2. y

Mode - 7 Vertical Left (VL)

y1

Mode - 8 Horizontal Up (HU)

y1 y2 y3

x-1 x0 x1 x2 x3 x4 x5 x6 x7

Illustration of nine y4x4 luminance prediction modes.Mode - 7 Figure 3. Comparison: Additions for 4x4 luminance intra-prediction Vertical y modes. Left

0

0

y1

1

y2

y2

y3

(VL)

Mode - 7

y3

Mode - 3 The residual blockDiagonal can be computedVertical in a parallel dataLeft t t t t Down Left s t s t −1 (VL) (DDL) path usingt the inverse quantization (Q ) and the inverse t t t s t s t 1. t t t t s t s t −1 integer transform (T ) processing units as depicted in 2. t t t t s t s t 3. Fig. 1. Once the intra-prediction block is made available 4. by intra-prediction processing unit, the current pixel 5. block is reconstructed by adding the residual block to 6. x x x x x x x x x the predicted block. The unfiltered reconstructed pixels y 7. -2 of the current block are used as neighbor ory boundaryMode 8. DC y pixels for the generation of the prediction samples for y 9. 10. the next block in the video frame. 11. The H.264/AVC video coding standard supports mul12. tiple intra-prediction modes for 4 × 4 and 16 × 16 luminance pixels block types, and the 8 × 8 chromi13. nance pixels block type as explained in the following 14. subsections. 15. 2

3

4

5

2

2

3

3

3

4

5

6

3

3

4

4

4

5

6

7

4

4

5

5

5

6

7

8

5

5

6

6

-1

0

1

2

3

4

5

6

7

0 1 2 3

A. Intra-prediction 4 × 4 Luminance Block There are 9 intra-prediction modes for a 4 × 4 luminance pixels blocks in the H.264/AVC. Each of these prediction modes generate 4 × 4 predicted pixel block using some or all of the neighboring pixels as depicted in Fig. 2. Since the prediction samples for Vertical (VT) and Horizontal (HZ) prediction modes are same as the boundary pixels, therefore, these prediction modes do not require any processing and are easy to implement. Similarly, the DC prediction mode compute the average value using valid boundary pixels for all the prediction samples. The remaining 6 modes (modes 3 to 8), however, compute the prediction samples using 3- or 2-taps filter kernels. The filter equations for these modes in the H.264/AVC video coding standard require 59 addition operations [1] for the computation of prediction samples. Efforts have been made to reduce the addition operations by efficient reuse of intermediate results. The optimized data path proposed in [12] requires 33 addition operations and is the lowest number reported in literature. In this paper, we propose to decompose the filter kernels in unique intra prediction equations for luminance intra 4×4 modes 3 to 8. The final 24 filter equations after decomposition are depicted in Fig. 4. With the proposed

s1 = x-1 + x0 + 1 s2 = x0 + x1 + 1 s3 = x1 + x2 + 1 s4 = x2 + x3 + 1 s5 = x3 + x4 + 1 s6 = x4 + x5 + 1 s7 = x5 + x6 + 1 s8 = x6 + x7 + 1 s9 = x-1 + y0 + 1 s10 = y0 + y1 + 1 s11 = y1 + y2 + 1 s12 = y2 + y3 + 1

16. 17. 18. 19. 20.

t1 t2 t3 t4 t5 t6 t7 t8

21. 22. 23. 24.

t9 t10 t11 t12

= = = = = = = =

s1 s2 s3 s4 s5 s6 s7 s8

+ + + + + + + +

s2 + 1 = (x-1 + 2x0 + x1 + 2) s3 + 1 = (x0 + 2x1 + x2 + 2) s4 + 1 = (x1 + 2x2 + x3 + 2) s5 + 1 = (x2 + 2x3 + x4 + 2) s6 + 1 = (x3 + 2x4 + x5 + 2) s7 + 1 = (x4 + 2x5 + x6 + 2) s8 + 1 = (x5 + 2x6 + x7 + 2) 2x7 + 1 = (x6 + 3x7 + 2)

= s9 + s10 + 1 = (x-1 + 2y0 + y1 + 2) = s10 + s11 + 1 = (y0 + 2y1 + y2 + 2) = s11 + s12 + 1 = (y1 + 2y2 + y3 + 2) = s12 + 2y3 + 1 = (y2 + 3y3 + 2)

Figure 4. Proposed decomposed filter kernels for 4x4 intra prediction

1.modes. s1 = x-1 + x0 + 1 2. s2 = x0 + x1 + 1 3. s3 = x1 + x2 + 1 4.decomposition, s4 = x2 + x3 the + 1 number of additions to compute 5. s5 = x3 + x4 + 1 the predicted samples for these modes is reduced to 6. s6 = x4 + x5 + 1 approach provides 60% reduction in addition 7.24. sOur 7 = x5 + x6 + 1 compared with standard unique intra+ x7 + 1 8.operations s8 = x6when equations 9.prediction s9 = x-1 + y0 + 1and 27% reduction when compared 10. y0 + y1 + 1 [12]. A comparison with respect withs10 the=state-of-the-art y2 + 1 operations required for the 4x4 11. s11 = y1of+addition to number 12. s12 = y2 + y3 + 1

intra-prediction modes is provided in Fig. 3. In Section IV, twe =show helps to s1 +that s2 +the 1 =proposed (x-1 + 2xdecomposition 13. 1 0 + x1 + 2) design the common data-paths for the intra-prediction 14. t2 = s2 + s3 + 1 = (x0 + 2x1 + x2 + 2) unitt3which the2xon-chip 15. = s3 further + s4 + reduces 1 = (x1 + 2) cost for its 2 + x3 +area + s + 1 = (x + 2x + x + 2) 16. t4 = simplementation. hardware 4 5 2 3 4 17. 18. 19. 20.

t5 t6 t7 t8

= = = =

s5 s6 s7 s8

+ + + +

s6 + 1 = (x3 + 2x4 + x5 + 2) s7 + 1 = (x4 + 2x5 + x6 + 2) s8 + 1 = (x5 + 2x6 + x7 + 2) 2x7 + 1 = (x6 + 3x7 + 2)

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

s1 = s2 = s3 = s4 = s5 = s6 = s7 = s8 = s9 = s10 = s11 = s12 =

13. 14. 15. 16. 17. 18. 19. 20.

t1 t2 t3 t4 t5 t6 t7 t8

21. 22. 23. 24.

t9 = t10 t11 t12

= = = = = = = =

5. predc[x,y] = Clip (a + b * (x - 3) + c * (y - 3) + 16 >> 5) 6. 7. Where 8. a = 16 * (p[-1, 7] + p[7,-1]) 9. b = (5 * H + 32) >> 6 10. c = (5 * V + 32) >> 6 11. 12. and H and V are specified as 13. H = Σ (x’ +1) * (p[4+x’, -1] - p[2-x’, -1], where x’ = 0 .. 3 14. V = Σ (y’ +1) * (p[-1, 4+y’] - p[-1, 2-y’], where y’ = 0 .. 3 15. 16 . End If

1. Intra Chroma Plane Prediction Mode-3 2. If (p[x, -1] and p[-1, y] with x = 0 .. 7 and, y = -1 ..7 are available) Then 3. 4. The values of the prediction samples predc[x,y] are derived as follows: 5. predc[x,y] = Clip (a + b * (x - 3) + c * (y - 3) + 16 >> 5) 6. 7. Where 8. a = 16 * (p[-1, 7] + p[7,-1]) 9. b = (5 * H + 32) >> 6 10. c = (5 * V + 32) >> 6 11. 12. and H and V are specified as 13. H = Σ (x’ +1) * (p[4+x’, -1] - p[2-x’, -1], where x’ = 0 .. 3 14. V = Σ (y’ +1) * (p[-1, 4+y’] - p[-1, 2-y’], where y’ = 0 .. 3 15. 16 . End If

Figure 5.

Intra-prediction algorithm for 8x8 luminance block.

1. Optimized Intra Chroma Plane Prediction Mode-3 2. 3. H = (x4 – x2) + 2 * (x5 – x1) + 3 * (x6 – x0) + 4 * (x7 – x-1) 4. => H = (x4 + 2 * x5 + 3 * x6 + 4 * x7 + 4) - (x2 + 2 * x1 + 3 * x0 + 4 * x-1 + 4) 5. => H = ((x4 + 2 * x5 + x6 + 2) + 2 * (x6 + x7 + 1) + 2 * x7) ) 6. ((x2 + 2 * x1 + x0 + 2) + 2 * (x0 + x-1 + 1) + 2 * x-1) ) 7. => H = (t6 + 2 * s8 + 2 * x7) - (t2 + 2 * s1 + 2 * x-1) 8. => H = (t6 - t2) + 2 * ((s8 - s1) + (x7 - x-1)) 9. 10. Similarly 11. V = (t6 - t2) + 2 * ((s8 - s1) + (x7 - x-1)) 12. 13. predc[x,y] = Clip ((k + b*x + c*y ) >> 5), From line 5 14. Where k = a – 3*(b + c) + 16 15. => k = 16 * (x7 + y7) - 3*(b + c) + 16, From line 8 16. => k = ((x7 + y7 + 1) H = ((x4 + 2 * x5 + x6 + 2) + 2 * (x6 + x7 + 1) + 2 * x7) ) 6. ((x2 + 2 * x1 + x0 + 2) + 2 * (x0 + x-1 + 1) + 2 * x-1) ) 7. => H = (t6 + 2 * s8 + 2 * x7) - (t2 + 2 * s1 + 2 * x-1) 8. => H = (t6 - t2) + 2 * ((s8 - s1) + (x7 - x-1)) 9. 10. Similarly 11. V = (t6 - t2) + 2 * ((s8 - s1) + (x7 - x-1)) 12. 13. predc[x,y] = Clip ((k + b*x + c*y ) >> 5), From line 5 14. Where k = a – 3*(b + c) + 16 15. => k = 16 * (x7 + y7) - 3*(b + c) + 16, From line 8 16. => k = ((x7 + y7 + 1) 5 ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎠

IV. THE HARDWARE ACCELERATOR ORGANIZATION In this section, different building blocks in the top level hardware design are first introduced. Subsequently, the data flow for the different intra mode generation in the core processing block is explained. The top level hardware organization of the intra-prediction unit is depicted in Fig. 7. It consists of Neighbor Pixel Buffer block, Prediction Samples Selection block, Control block, and Intra-prediction Core Processing block. The Neighbor Pixel Buffer, as the name suggests, stores the boundary pixels for the current block. For a 4 × 4 block, depending upon the decoding order, block position, and encoding conditions, up to 13 boundary pixels (8-bit each) are stored in the buffer, whereas for the 16 × 16 block, it provides storage registers to holds 33 boundary pixels at most. The valid boundary pixels are provided to the core intra-prediction processing block for computation of prediction samples. The same buffer is also used to store the partial results during the processing for 16×16/8×8, luminance/chrominance DC and plane-modes. All of the processing to compute any given intraprediction mode takes place in the intra-prediction core

(1)

Down Left (DDL)

y3

x-1 x0 x1 x2 x3 x4 x5 x6 x7

x-1 x0 x1 x2 x3 x4 x5 x

y0

y0

y0

Mode - 6 Horizontal Down (HD)

y2 y3

x-1

s1 t12

+

x0

2y3

s9

t8

+ y0

6

6

+

s8

+

2x7

x7

t7

-

k

+

+

y5

t3

x3 y5 x4

+

s5

t4

+

y6

y2 y3

y0

y1

y1

y2

y2

t2 t3 t4 t5 t3 t4 t5 t6

s4

y3

Mode - 3 Diagonal Down Left (DDL)

y3 s2 t2 s3 t3 s3 t3 s4 t4

t4 t5 t6 t7

s4 t4 s5 t5

t5 t6 t7 t8

s5 t5 s6 t6

Mode Horizon Up (HU)

y1

x-1 x0 x1 x2 x3 x4 x5 x6 x7

y7

y4 x3

y2

y0

Lihat lebih banyak...

An efficient hardware design for intra-prediction in H.264/AVC decoder

Descripción

Comentarios