Ultra low-power multimedia processor for mobile multimedia applications

Share Embed


Descripción

ESSCIRC 2002

8/75$/2:32:(508/7,0(',$352&(6625)2502%,/( 08/7,0(',$$33/,&$7,216 D. Alfonso, A. Artieri, A. Capra, M. Mancuso, F. Pappalardo, F.Rovati and R. Zafalon 670LFURHOHFWURQLFV $EVWUDFW 7KLVSDSHUGHDOVZLWKDVHWRIWHFKQRORJLHVWKDWDOORZHG WR DFKLHYH RQH RI WKH PRVW SRZHU HIILFLHQW GHYLFHV UXQQLQJ0XOWLPHGLDDSSOLFDWLRQVLQSRUWDEOHGHYLFHV:H ZLOO GHPRQVWUDWH WKDW LI SURSHU V\VWHP VROXWLRQV PHWKRGRORJ\ DQG SRZHU FRQVFLRXV GHVLJQ DQG DUFKLWHFWXUHV DUH WDNHQ LQWR DFFRXQW VLQFH WKH HDUO\ SKDVHRIWKHGHVLJQWKHWUDGHRIIEHWZHHQSHUIRUPDQFHV DQGSRZHUFDQEHVXFFHVVIXOO\VROYHGDFKLHYLQJOHDGLQJ HGJH SHUIRUPDQFHV $Q LQWHJUDWHG DSSURDFK DOORZLQJ IXOOPXOWLPHGLD$XGLRDQG9LGHRSURFHVVLQJDVZHOODVD FRPSOH[ RSHUDWLQJ V\VWHP WR UXQ RQ D VLQJOH 9/6, GLH ZLWKSRZHUGLVVLSDWLRQFRQILQHGLQIHZPLOOLZDWWZLOOEH WKRURXJKO\ GHVFULEHG LQ WKLV SDSHU 2SWLPL]DWLRQ VWDUWV IURP DOJRULWKP GHILQLWLRQ GRZQ WR ORZ OHYHO +: LPSOHPHQWDWLRQ JLYLQJ WR WKH V\VWHP GHVLJQHUV WKH EHVW WUDGHRIILQWHUPVRISHUIRUPDQFHVDQGIOH[LELOLW\

 ,1752'8&7,21 The booming of the portable electronic device market is motivating an increased emphasis on design techniques enabling an effective low power VLSI and system design. However, a host of advanced embedded systems such as Personal Digital Assistants and Communicators, multimedia messaging [2], PIM applications, wearable smart appliances, new-generation home entertainment systems, networked home appliances and portable multimedia devices (DVD player, Still-Cameras, etc.), still require a large amount of data-processing throughput. For multimedia devices the predominant goal of lowpower design is to minimize total power dissipation of the system, given some fixed data throughput and price requirements. With portable computing, low-power requires system design that balances the user model’s ergonomics with speed and power constraints, forcing designers to re-evaluate performance metric and criteria for this growing class of systems. While visual communication and PIM application will enhance the communication model and the functionalities of the device, it will require a camera and a bigger/better quality display. The designer who wishes to develop an efficient image/video acquisition and encoding system for mobile

devices has to deal with the constraints imposed by working modes (still/video capture) and the related codecs. Still picture acquisition will require higher sensor resolution than the one imposed by the adopted standard video codecs. Solutions must be find to avoid of having the power consumption of a VGA sensor, and associated processing, while capturing QCIF video clip, for example. The available bandwidth will be in the order of 64 kb/s to 384 kb/s: high performance algorithms, as the MPEG-4[1] standard uses, must then be used to achieve the highest compression without compromising the quality of the encoded video sequences. Being the motion estimation one of the most computation intensive tasks in the video encoding, it is crucial to find an efficient for both algorithm and hardware implementation. During the last decade we have seen an impressive technology enhancement, allowing regular chip density growing (still matching Moore’s law!) as well as supply voltage scaling and power-per-gate reduction. By adopting such a “technology brute force”, in fact, achieving the first order power constraints which came out so far, has been quite easy and certainly faster (though more expensive) than considering the whole power budgeting issue from a true design perspective. Unfortunately, we now see that the power constraints could be well dictating the ultimate boundary to the IC super-integration product market. Cost is severely impacted by the necessity of dissipating heat to guarantee correct operation and sufficient system lifetime, as well as by signal integrity issues and electromagnetic compatibility. Portability is hampered by bulky battery systems needed by powerhungry electronic devices to achieve acceptable timebetween-recharges. By considering the integration rate and projecting the future supply voltage scaling, it comes clear that, on a full chip perspective, power per logic function scales down much slower than integration density’s growth. As a consequence, the total power budget per chip will grow up steadily, according to industrial experience as well as to SIA Roadmap 2000 [3]. According to the above highlights and projected trends, there are evidences that, in order to continue the cost-performance challenge for embedded computing’s next generation, industry needs to adopt a thorough

63

“Power Conscious” design strategy in all product segments, not just in mobile or hand-held units. In other words, new systems-on-silicon for embedded computing will be required to provide exponentially increasing computational capabilities at a roughly constant price. Such a need will become pervasive just tomorrow (if not present today), requiring a tight link among design for power AND performance, at all levels, including algorithms and protocols, processor’s and DSP’s microarchitecture as well as multiple clock-domains optimization, ultra low voltage circuits, new logic families (multi Vt, dynamic Vt), efficient leakage control and, on top of the line, enhanced modeling capabilities to reliably explore the Area-Time-Power (ATP) design space. • /RZSRZHUPHWKRGRORJ\VKRXOGVSDQWKH ZKROHUDQJH A really effective low-power methodology should span and optimize all of the design abstraction levels across the flow: high-level system design (including power aware SW and RT-OS), run time power management, architectural exploration, memory hierarchy and global bus optimization, down to process and technology. The designer should make optimizations at all levels of the design space, which have a fairly cumulative effect on system power saving. As of today the shared experience of several design teams in the field of low-power design, give the evidence that circuit and gate-level optimization techniques typically have less than 2X impact on power, while appropriate approaches on architecture, algorithms, run time power management and SW applications offer savings from 10 to 100X and more. We would like to draw the attention on energy-aware Run Time Management (RTM) techniques since many current SoCs are already complex enough to require a significant amount of deeply embedded software to control their operation and many silicon vendors are now considering RTM development as an integral part of SoC design process. The development of RTM software (for embedded platforms) aims to coordinate the activity of complex SoCs with the primary objective of achieving energyefficient utilization of on-chip hardware resources, fully exploiting system flexibility to achieve high energy efficiency while, at the same time, meeting the performance requirement. The RTM software (especially the SW layer represented by RT-OS and SW drivers) becomes integral part of the SoC design, in the same way a hardwired controller is considered a part of the hardware unit it controls. Another important class of embedded systems includes applications, such as video image processing, audio digital filtering and speech recognition, which are extremely memory dominant. In such systems, a

64

significant amount of power is consumed during memory accesses. This consideration also leads to explore for the best memory hierarchy and uP-to-memory bus bandwidth of the embedded systems, which can considerably impact the total power budget of the whole design. In the following, section 2 will face with the basic stream flow of the image capturing sub-system, with a peculiar focus on bus encoding optimization for low power. Section 3 will provide an overview of the SLIMPEG motion estimation subsystem.

 ,PDJH&DSWXUH Off-chip global bus lines are generally loaded with large capacitances, up to three orders of magnitude larger than the average on-chip interconnect capacitance in VLSI circuits. When using standard CMOS signalling, the switching power dissipated over the bus wires can be represented as follows: (1)

1 3%XV = ∑L α L &L9GG2 I FON 2

Although by lowering the power rail (9GG) the designer may achieve a quadratic saving in energy, it is widely recognized among the industry that a multiple-9GG strategy can severely impact both the signal’s interface (i.e.: level shifters need to be carefully inserted and managed), the physical implementation and power-grids of the IC (leading to higher design risk and cost). As a consequence, today the main focus on system level communication channels is to minimize the information entropy of the transaction, while strictly preserving the actual semantic. Hence, a way of reducing power dissipation on bus drivers is to minimize the average transition activity α ) by encoding the data sent over the bus according to a suitable policy. The switching power will then decrease linearly with the average transition rate. Based on this observation, several encoding schemes have been proposed, aiming to reduce the average number of transitions activity across the bus, under a rigorous loss-less policy [4]. L

Encoder

R G R G R

G B G B G

R G R G R

G B G B G

R G R G R

Decoder

Image Processing Unit

Sensor

Figure 1. Dedicated Bus between CMOS sensor and the Image Processor Unit It is worth to highlight that all the presented encoding schemes assume some knowledge of the statistical properties of the streams that must be encoded. For that reason, they are primarily devoted to optimize application specific embedded systems.

activity is distributed in a non-uniform fashion, it should be more appropriate to exclude bus lines that don't have a switching activity. In case of unknown activity distribution and line correlation, a static selection of the bus width to encode (BI) is not efficient. For these cases the set of bus lines to be actually encoded needs to be adapted to the specific pattern under transmission, by means of a weighted bus mask. The i-th bus line may (or may not) be included into the encoded subset of bus lines. The mask will be suitably adapted as time (and bus traffic) goes on. The efficiency of this Adaptive Partial BI encoding (APBI) [16] relies on the selection of the coding mask. According to such a mask, only the bus lines showing a high switching activity the BI algorithm will become a candidate to be inverted. Unlike the basic BI technique, the APBI technique is more efficient with a larger bus width. 10 bits BUS

5 bits BUS

6.01% 13.06%

16.82% 7.31%

BI APBI

Figure 2. Switching activity reduction for a bus connected to a Bayer sensor (10 bpp) using either BI or APBI As shown in Figure 2, the BI shows to be the optimal choice in order to reduce the switching activity in a narrow width bus, taking also into account the low complexity of the coding/decoding logic required to implement the BI. However, the overall bus power saving can be largely improved by leveraging, when possible, on the spatio-temporal correlation of the transmitted data. Thus, applying the BI coder to a narrow bus (i.e. half the size of a pixel element) and sending data through the channel in a reordered sequence (while requiring a small memory line-buffer to store the latest pixels in the encoder, this technique may have a positive impact on data-correlation) it’s possible to reduce the original switching activity of up to 55%.

60 50 Efficiency(%)

Some of those encodings [5],[6],[7] exploit spatial redundancy by increasing the number of bus lines (i.e. extra routing area), while others exploit temporal redundancy, by increasing the number of bits transmitted in successive bus cycles (i.e. extra latency) [8]. Finally, a few encoding techniques do not rely on spatial/temporal redundancy at all [9][10]. Theoretical issues in bus encoding for low transition activity are investigated in [11]. In that work, the authors introduce an information-theoretic framework for studying low-transition encoding, and prove a useful lower bound on minimum achievable average transition activity. Several redundant and irredundant codes are then analyzed and compared to the theoretical bounds to assess their quality. In [12][13], the same authors introduce a generic encoder-decoder architecture that can be specialized to obtain an entire class of low-transition coding schemes. A few personalization of the generic architecture are described, and the reductions in transition activity are compared. A comprehensive set of algorithms for the synthesis of codec interface logic minimizing the average number of transitions on heavily-loaded bus lines is presented in [14][15]. In [15] the authors also proposed a distinguishing feature allowing to catch the optimum without solely relying on designer's intuition, but upon a tool which automatically constructs low-transition activity codes and hardware implementation of encoders and decoders, under the given traffic statistics. In our paper, an accurate method that is applicable to narrow buses, as well as approximate methods that scale well with bus width is presented, in addition to an adaptive architecture that automatically adjusts encoding on buses whose word-level statistics are not known apriori. A critical domain, where the transition activity of the channel is remarkably high, is the image capture section of a mobile multimedia system. In particular, considerable results can be achieved by encoding the data sent by the Bayer sensor to the Image Processing Unit, as depicted in Figure 1. Assuming to work with a VGA 10 bpp Bayer sensor that acquires 15 frames per second, the required throughput across the bus is more than 5.6 MB/s. This can heavily impact the total power budget of the system. According with a number of specific data types, some of the encoder algorithms reviewed in the literature have been investigated. In the Bus Inverter coding [5], if the Hamming distance (i.e. the number of bits changed during a transition) between the new pattern to be transferred and the old one, currently on the bus, is larger than half the bus width, then the new pattern is transferred after inverting each bit. An additional line is used to flag the decoder side whether the pattern is inverted or not. This technique is less efficient as the bus width gets larger (i.e. a 5 bits bus is more efficient than a 10 bits one). If we increase the bus width and the switching

40 30 20 10 0 -100

0

100

200

300

400

500

600

700

Number of Pixels

Figure 3. Switching activity reduction (%) versus the line memory size used for data reordering By adopting this solution, we found a good compromise between cost (BI’s codec’s circuit is less expensive than APBI or other encoders) and overall performance.

65

A further crucial corner where we recognized opportunity to reduce the power consumption is shown in Figure 4. When a QCIF image is required, a classic IGU implementation scales the image after the Image Generation Processor (IGP). This means that the IGP would handle a higher resolution image (i.e.: VGA) although a lower resolution picture is eventually required (i.e. QCIF). Our strategy is depicted in Figure 5, where the IGP will first downscale the image to the required out dimensions, preventing any extra computation effort to waste time and power all through the following processing chain. Image Generation Unit R G R G R

G B G B G

R G R G R

G B G B G

R G R G R

Sensor

VGA Bayer 10 bits

BAND 5.6 MB

Image Generation Pipeline

QCIF RGB

VGA RGB

Scaling

8 bits

8 bits

BAND 1.1 MB

BAND 13.2 MB

Figure 4. Bandwidth required from the sensor to the output of a unoptimized IGU architecture

In [17],[18] the authors propose a policy to optimize the solution shown in Figure 5. Within the image sensor, a pre-processing unit is integrated, in order to produce a scaled RGB intermediate format. In this way, the throughput and the total number of operations, required by the IGP, are drastically reduced. Image Generation Unit R G R G R

G B G B G

R G R G R

G B G B G

Sensor

R G R G R

QCIF 15 fps Bayer

VGA 15fps Bayer 10 bits

BAND 5.6 MB

Scaling

10 bits

BAND 0.45 MB

Image Generation Pipeline

QCIF 15 fps RGB 8 bits

BAND 1.1 MB

Figure 5. Optimised IGU achieves huge saving in computation bandwidth and, thus, power consumption. As a matter of fact, the successful combination of bus encoding and image down-scaling lead us to dramatically reduce the bus traffic between the sensor and the IGP by more than 93%, with respect to the classical, not optimized configuration. This has a proportional saving, of course, in bus communication energy.

 0RWLRQHVWLPDWLRQLQWKHYLGHRVWUHDP The motion estimation is the tool that allows eliminating the temporal redundancy between the adjoining frames of a video sequence. It is based on the assumption that each frame can be locally expressed as a translation of previous or successive frames of the sequence. The current frame is divided into nonoverlapping square regions of 16x16 pixels, called macroblocks. Each macroblock is then differentially coded with respect to a similar region in a reference frame, called predictor, pointed by a motion vector: the

66

actual motion estimation’s task is to locate the best predictor for each macroblock in the current frame. The traditional exhaustive search approach, also known as Full Search, which evaluates any motion vector inside a given area called search window, is just unfeasible due to the high computational complexity, so in the past years many algorithms have been proposed to perform the motion estimation with a reduced number of operations. Different techniques are known to reduce the computational complexity of the motion estimation process, mainly based on the decimation of the motion field. Other techniques tend to decimate the pixel domain, computing an approximate value of the matching error. Among several options, the spatialtemporal correlation algorithms proved their excellent performances in terms of overall video quality versus the number of operations performed. Yet they are suitable for efficient hardware implementations [19] [20] [21] [22]. In this section, a novel motion estimator subsystem for Simple Profile Mpeg-4 video encoding is proposed. It exploits a low-complexity motion estimation algorithm, achieving a video quality comparable to the Full Search approach with only a very small fraction of the original computational load.

 3RZHUFRQVFLRXVDOJRULWKP For the mobile multimedia application we had the challenge to develop a very efficient algorithm, able to achieve high video quality even at the low bandwidth available, and with the lowest possible number of operations in order to minimize the power consumption. Our algorithm exploits the spatial and temporal correlation existing between neighbouring motion vectors belonging to the same frame (spatial correlation) or to adjacent frames (temporal correlation), operating in two steps. The first step is the identification of the best motion vector within a set of candidates, chosen among the available spatially and temporally correlated motion vectors, re-using the results of the motion estimation already performed on previous macroblocks. In order to reduce the computation workload, only the nearest spatial and temporal vectors are tested and, among them, the best one is selected. A refinement step is then applied around the position pointed by the selected vector, looking for the optimal motion vector. The motion vectors actually tested, called updates, depends adaptively on the motion amount contained in the video sequence under coding, while the number of vectors actually tested is fixed and predetermined for each macroblock, regardless of the search window range. In this way we obtain many advantages. First, we are able to track with great accuracy the real movement contained in video sequences, achieving excellent quality results in terms of both subjective evaluation and objective measures of PSNR (Pulse Signal to Noise Ratio). Second, we dramatically reduce the computational complexity. For the typical search range of [-16;+16]

Motion Vector Processor

Predictor Builder

SAD Engine

Cache Memory

Dedicated bus FXUUHQWPDFUREORFNGDWD

Memory Decoder System bus SUHGLFWRUGDWD

Figure 6.

Motion estimator architecture

1.2 1.0 0.8 0.6 0.4 0.2 teeny

silent

renata

news

mother

missa

monitor

0.0 foreman

The motion estimator, shown in Figure 6, is composed by five major blocks. The Motion Vector Processor is a programmable application specific processor designed to run motion estimation algorithms. It has a hardware complexity comparable to a hardwired FSM explicitly designed to execute a specific motion estimation algorithm but, being programmable, it has the flexibility of a general-purpose CPU, from the motion estimation point of view, and it can be easily reprogrammed to target different applications and video encoding standards. The chosen motion estimation algorithm is encoded using 27 instructions of 24 bits each, for a total memory size of 81 bytes. The Predictor Builder receives the predictor data from the frame buffer through a local cache memory, and produces the final 16x16 predictor through data alignment, sub-pixel interpolation, and border extension if the vector points to an area which is partially outside the frame (unrestricted motion vector). The cache approach allows to greatly reducing the bandwidth on the system bus and minimizes the conflicts of the motion estimator with the other blocks of the system. Furthermore, thanks to the search window independency of the algorithm, it allows to widen the search window size, increasing the quality performances while maintaining a fixed hardware complexity, in terms of local memory size and bandwidth required on the system bus. A simple 2 KByte, 2-way set-associative cache is enough for QCIF and CIF video encoding. In order to further down scale the power consumption, we also introduced a feature to cut the peak bandwidth to a certain amount, allowing only a fixed and programmable number of cache refills during a macroblock period. The effect on the algorithm is that some predictors can be lost because the cache controller does not allow fetching their data from the main memory. Simulations showed that the average quality loss caused by the bandwidth limitation is not visible at all when the peak bandwidth is halved. The overall bandwidth and so the switching activity, is further reduced thanks to a lossy data compression algorithm (Figure 7) which adaptively operates on luminance and chrominance samples, achieving a fixed compression ratio of nearly 50%. Data is stored in the

children

 $UFKLWHFWXUH

frame buffer after compression and is decoded before entering the cache, thus halving the frame buffer size with negligible video quality loss (Figure 8). Finally, the SAD engine computes the sum of absolute differences between the predictor and the current macroblock on a pixel-by-pixel basis, with a parallelism of eight operations per clock cycle. To evaluate the goodness of a predictor we chose the Sum of Absolute Differences instead of the Mean Square Error, because of the good reliability of the results and the lower computation power requested. The motion estimator subsystem has a die area smaller than one square millimeter in 0.18µm technology. It needs less than 2 MHz clock frequency for QCIF encoding at 15 fps and less than 7 MHz for CIF at the same rate.

carphone

pixels, our algorithm is 99% less complex compared to the Full Search. By considering that the full search’s workload grows quadratically with the window’s size, our algorithm shows additional saving as the search window is widened. Third, our algorithm produces well-aligned motion fields, making the motion vector differential coding strategy easier to be adopted by Mpeg-4. The result is a remarkable bit rate saving. Fourth, exploiting the behavioural characteristics of the algorithm, we were able to design an efficient system architecture for the motion estimator of the video encoder, highly optimised for power consumption and silicon area.

Figure 7. Average bandwidth in MB/s without (white bars) and with memory compression (black bars) for QCIF at 15 fps. The dashed line is the bandwidth without cache.

67

2,0 1,5 1,0 0,5 0,0

teeny

silent

renata

news

mother

missa

monitor

foreman

children

carphone

-0,5

Figure 8. Average Y PSNR gain in dB of the proposed solution vs. the Full Search, without (white bars) and with memory compression (black bars). Values are for QCIF at 64kb/s, 15 fps.

 &21&/86,216 An innovative and comprehensive low-power design strategy for facing with image acquisition and MPEG4 encoding processing platform for mobile application has been presented in this work, in the frame of a thorough “power conscious” design methodology. The Bayer bus encoding combined with the optimized IGU (Figure 5) allow us to achieve a saving of more than 93% in data throughput (and, thus, communication energy) on the pipeline from the image sensor to the image encoder (connected to the IGU out). As far as the MPEG4 motion estimator’s architecture is concerned, while providing a quality of results comparable to the common Full Search approach, our solution achieves remarkable savings in terms of number of computation workload (99% less that Full Search), internal and external memory size, silicon area as well as bandwidth requirement on the system bus. $&.12:/('*(0(176 The authors gratefully acknowledge the AST Lab teams in Agrate and Catania and the CMG Imaging Division at Edimburgh for the valuable and synergic support in doing this work.

 %,%/,2*5$3+
Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.