MPEG2 video decompression on a multiprocessing VLIW microprocessor

August 11, 2017 | Autor: S Sudharsanan | Categoría: Computer Architecture, Data Compression, Video Compression, CPU, Instruction Set Architecture, VLIW, Decoding, Instruction Sets, VLIW, Decoding, Instruction Sets

Share Embed

Laporkan tautan ini

Descripción

MPEG−2 Video Decompression on a Multi−processing VLIW Microprocessor Parthasarathy Sriram, Subramania Sudharsanan and Amit Gulati Sun Microsystems, Inc., 901 San Antonio Road, Palo Alto, CA 94303, USA Abstract Compression and decompression of MPEG−2 video play a major role in entertainment delivery devices. This paper investigates software MPEG− 2 video decoding on a multiprocessing microprocessor, MAJC−5200. MAJC (Microprocessor Architecture for Java Computing), is a very long instruction word (VLIW) processor with instruction set geared towards multimedia computing. In this paper, we briefly describe the microproc− essor and show that less than 50% of a single CPU out of the available two processors is sufficient for decoding 4 MB/s video sequences.

1. Introduction Many implementations of MPEG−2 for VLIW processors have been reported in the industry [1]. Many of these VLIW processors con− tain MPEG−2 specific assist blocks such as variable length decod− ers. A salient point of MAJC microprocessors is the multilevel parallelism: thread, instruction, and single instruction multiple data (SIMD) levels. This paper describes how MPEG−2 decoding can be performed completely in software exploiting the multi− level parallelism of MAJC. In the next section, we provide a brief description of a MAJC microprocessor and its multimedia features. In Section 3, we briefly discuss the MPEG−2 decode implementa− tion for both uni− and multiprocessor operation. Section 4 pro− vides results and discussions.

2. MAJC−5200 MAJC [2] is scalable, exploits parallelism at a hierarchy of levels and modular for ease of implementation. At the highest level of parallelism, the architecture provides inherent support for multiple processors on a chip. The next level is the ability of vertical micro−threading which is attained through hardware support for rapid, low overhead context switching. The context switches can be triggered through either due to a long latency memory fetch or other events. The next hierarchy of parallelism comes from an improved very long instruction word (VLIW) architecture. The instruction packets can vary in length, up to 128 bits, with a maximum of four instructions each of 32 bits per packet. The lowest level of parallelism comes from single instruction multiple data (SIMD) or sub−word parallelism. The first implementation of MAJC, MAJC−5200, is a 500 MHz, dual−CPU multimedia processor with a high I/O bandwidth. It implements several key parallelism features of MAJC. The two CPUs share a coherent four−way set−associative 16−KB data cache and common external interfaces. Each of these CPU is a four−issue MAJC VLIW engine. Each CPU contains its own two−way set−as− sociative instruction cache of 16 KB. A high throughput bandwidth requirement is addressed by a multitude of interfaces with built−in controllers. The main memory is a direct Rambus DRAM (DRDRAM) with an interface supporting a peak transfer rate of 1.6 GB/s. A direct interface to 32−bit/66 MHz PCI provides DMA and programmed I/O (PIO) capabilities to transfer up to 264 MB/s. There are two other interfaces that support up to 4.0 GB/s that could be used for high speed parallel (64−bits at 250 MHz) inter−

Presented at ICCE, June 2000, Los Angeles, CA

faces: North and South UPA. (Universal Port Architecture or UPA has been an interface for graphics and multi−processor configura− tions of UltraSparc−based systems [3].) The NUPA block contains a 4 KB input FIFO buffer that can also be accessed by both CPUs. The other specialized block in the chip is a graphics preprocessor (GPP). The GPP has built−in support for real−time 3D geometry decompressing, data parsing, and load balancing between the two processors. An on−chip Data Transfer Engine (DTE) provides DMA capabilities amongst these various memory and i/o devices, with the bus interface unit acting as a central crossbar. MAJC−5200 provides support for 64−, 32−, 16−, and 8−bit integers in addition to 32− and 16−bit fixed point numbers. Both single (32− bit) and double (64−bit) IEEE 754−1985 floating point numbers are supported as well. The core is capable of issuing four standard ar− ithmetic and logical operations per cycle. One load or store opera− tion on 1,2,4, or 8 bytes quantities along with three fused multiply− add operations can be performed in a single cycle. SIMD versions operate on 16−bit short integer pairs or S.15 or S2.13 fixed point formats (Sign.integer.fraction). Another set of unique SIMD in− structions are six−cycle parallel divide and reciprocal square−root for S2.13 fixed point format data. These instructions form a basis for very powerful DSP, media, and graphics capabilities: multiply− add fused and dot product instructions aiding the filtering and transforming operations. At 500 MHz clock frequency, the peak performance for the dual CPUs becomes more than 12.33 GOPs for the 16−bit quantities and 6.16 GFLOPS for single precision floating point operations. MAJC−5200 also implements instructions to fa− cilitate bit and byte manipulations. A two−level 4K entry branch prediction array with 12−bit history minimizes branch taken laten− cies. In addition, predication instructions along with a large regis− ter file (224 logical registers) help minimize branches in the in− struction stream for improved performance.

3. MPEG−2 Implementation Implementation of MPEG−2 decoding is carried out closely fol− lowing MAJC’s hierarchy of multiple levels of parallelism. This is feasible by exploiting MPEG’s bitstream hierarchy. The slice struc− ture in an MPEG stream is accessible by utilizing the slice headers and can be independently decoded without any serial dependency. Using this, we create our decoder as shown in Figure 1a. As the first step, the video syntax analyzer extracts appropriate header information and performs computations to balance the load across K independent processing groups. This information is used by a monitor process to create K threads to handle independent decod− ing of portions of the picture. A synchronization process finally ensures the completion of the threads before sending the decoded picture for display. Each thread handles the decoding of a pre− scribed number of macroblocks (MB). This model naturally fits into a multiprocessor environment. MAJC−5200 has a shared, dual−ported data cache that enables fast synchronization between the two on−chip CPUs. This naturally fits the above decoder model permitting simultaneous execution of

© IEEE 2000

two threads. In addition, this mechanism helps utilize the high memory bandwidth of 1.6 Gbytes/sec the DRDRAM enables. Each sub−process is further able to exploit lower levels of parallelism the chip provides. Processing of each MB consists of performing variable length de− coding (VLD) of the Huffman coded quantized, DCT coefficients; inverse zig−zag scanning (IZZ) of the decoded code; inverse quan− tization (IQ); inverse DCT; and motion compensation (MC). To ef− fectively use MAJC’s instruction level parallelism, VLD, IZZ, and IQ are merged together using a pipelining mechanism. Similarly, the IDCT and MC functions are combined for improved parallel− ism. IDCT and MC also heavily use MAJC’s SIMD instructions for sub−word parallelism. Figure 1.b depicts the described MB decod− ing procedure. Within this process, it is possible for us to do all steps for each MB or repeat each step for several MBs. This flexi− bility helps adjust for optimum cache performance. Figure 1.c shows how each sub−process or thread can be split into multi− ple (N) iterations for a group of MBs.

Bu f f e r e d Vi d e o

S u b Pr. 0 S u b Pr . 1

S yn t ax a na l y z e r

Fi gu r e 1 a

S y n c h. Pr o c e s s

Mo n i t o r proces s

Di s p l a y Pr o c e s s

S ubPr. K− 1

MB d e c o d e

Operation Total Decode

4.2 Mb/s Sequence

9.5 Mb/s Sequence

210 M C/s

256 M C/s

15.3 Mb/s Sequence 282 M C/s

MC/IDCT

164 M C/s

164 M C/s

164 M C/s

VLD/IZZ/IQ

38 M C/s

58 M C/s

76 M C/s

Header + Cntrl

4 M C/s

16 M C/s

32 M C/s

Table 1Raw Performance in Million Cycles per Second It is evident from Table 1 that the decoding of a 9.5 Mb/s MP@ML MPEG−2 sequence takes only approximately 50% of each MPU of MAJC−5200 (or 25% overall). The MC/IDCT procedure takes a big chunk of the overall computation at 4.2 Mb/s sequence (about 78%). The primary reason for this result is due to the fact all MBs were considered to have coded coefficients and hence, IDCT’s were always done for all the blocks. In addition, since the MC and IDCT processes were interleaved, the effect of processor stalls incurred due to the loading of reference data not in the data cache were greatly minimized. A good amount of instruction−level parallelism was also obtained in this procedure: the average issue width was about 2.95 instruction/cycle. Another significant result to note from Table 1 is the fact that the computational requirements of MC/IDCT block is not affected by the change in bit−rates. As a re− sult, the computational requirements for decoding higher bit−rate sequences do not scale as much as the ratio of the bit−rate increase. For example, the overall computational complexity increased only by 22% and 34% at 9.5 and 15.3 Mb/s over the performance of 4.2 Mb/s even though the bit−rates increased by 126% and 264% re− spectively. This was due to the fact that the VLD/IZZ/IQ process at higher bit−rates was designed to decode a symbol per iteration.

S u b P r o c. 0 MBh e a d e r

Re p e a t M t i me s

VL D/ I Z Z /I Q

Re p e a t M t i me s

MC/ I DCT

Re p e a t M t i me s

He a d e r Ex t r a c t Re p e a t

The estimated performance results for the real machine from the CS for decoding two MP@ML 4.2 Mb/s MPEG−2 sequences (one in each MPU) independently is shown in Table 2. In addition, the performance for decoding the sequences separately (when the other processor is idle) is shown. It is evident from this table that the computational requirements increase by about 38% due to the interactions with the memory transactions of the second MPU.

N t i me s Re F . F r a me

CPU0 Sequence

Fi gu r e 1 . b

CPU1

MB d e c .

Fi gu r e 1 . c

Mobile

Fluent

Rate

4.2 Mb/s

4.2 Mb/s

Separate

264 MC/s

225 MC/s

Simultaneous

367 MC/s

315 MC/s

Table 2 Estimated Real Performance

4. Results The decoder performance testing was carried out using an instruc− tion accurate simulator (IS) and a cycle accurate simulator (CS) for the processor. The IS gives the possible maximal performance while the CS gives the performance with the memory effects. The performance results for decoding the Mobile sequence at Main− Profile/Main−Level resolution obtained using an IS at different encoding bit−rates is shown in Table 1. In addition, the perform− ance of the three key computational blocks of this implementation is also shown in Table 1. The experiment was conducted using a single thread with M=2 (refer to Figure 1.b).

Presented at ICCE, June 2000, Los Angeles, CA

We have shown in this paper that a multi−processing VLIW mi− croprocessor with a hierarchy of parallelism can be exploited for decode of MPEG−2 sequences efficiently. We showed results from simulators that a 500 MHz dual CPU chip will take less than 50% of its compute power for a MP@ML decode. REFERENCES

1. 2. 3. 4.

Philips Trimedia SDE Reference Manual, http://www.trime− dia.philips.com M. Tremblay, "A microprocessor architecture for a new mille− nium," Hot Chips 11, August 1999. Sun Microsystems, http://www.sun.com/microelectronics B. G. Haskell, A. Puri and A. N. Netravali, Digital Video: An in− troduction to MPEG−2, Chapman and Hall, 1997, New York, NY.

© IEEE 2000

Lihat lebih banyak...

MPEG2 video decompression on a multiprocessing VLIW microprocessor

Descripción

Comentarios