Unsupervised - Neural Network Approach for Efficient Video Description

June 7, 2017 | Autor: Daniela Girimonte | Categoría: Video Coding, Video Analysis, Video Streaming, Video Codec, Multi Dimensional, Object Oriented, Kernel Function, Object Oriented, Kernel Function

Share Embed

Laporkan tautan ini

Descripción

Unsupervised – Neural Network approach for Efficient Video Description 1

1

1

Giuseppe Acciani , Ernesto Chiarantoni , Daniela Girimonte , Cataldo Guaragnella 1

1

D.E.E. – Electro-technology and Electronics Department Politecnico di Bari, Via E. Orabona, 4 I-70125, Bari – Italy {acciani,chiarantoni,guaragnella}@poliba.it

Abstract. MPEG-4 object oriented video codec implementations are rapidly emerging as a solution to compress audio-video information in an efficient way, suitable for narrowband applications. A different view is proposed in this paper: several images in a video sequence result very close to each other. Each image of the sequence can be seen as a vector in a hyperspace and the whole video can be considered as a curve described by the image-vector at a given time instant. The curve can be sampled to represent the whole video, and its evolution along the video space can be reconstructed from its video-samples. Any image in the hyperspace can be obtained by means of a reconstruction algorithm, in analogy with the reconstruction of an analog signal from its samples; anyway, here the multi-dimensional nature of the problem asks for the knowledge of the position in the space and a suitable interpolating kernel function. The definition of an appropriate Video Key-frames Codebook is introduced to simplify video reproduction; a good quality of the predicted image of the sequence might be obtained with a few information parameters. Once created and stored the VKC, the generic image in the video sequence can be referred to the selected key-frames in the codebook and reconstructed in the hyperspace from its samples. Focus of this paper is on the analysis phase of a give video sequence. Preliminary results seem promising. KEYWORDS – MPEG-4, video coding, video streaming, internet TV, video analysis.

1. Introduction MPEG-4 is increasingly becoming important in the video coding framework for its ability of dealing with high to very low bit rate applications ([1-5]). The general structure of a MPEG based codec tries to exploit the temporal redundancy of several images in a sequence to greatly reduce the required information to be efficiently transmitted to the receiving end. Its fortune depends on the object oriented video coding structure, allowing several bit streams to be multiplexed together to form a single video frame structure to be transmitted. Each of the separate streams contains information about a single Video Object (VO), defined by means of arbitrary shape, motion and residual error coding. Depending on the applications, the time varying available bandwidth can cause freezes of images and/or degradation of the received image quality, so that research very often addresses very low bit rate coding and/or bandwidth adaptive coding algorithms to overcome such annoying situations. In this paper a novel technique for video reproduction is presented, trying to look at the coding problem in a sub-space description of a video sequence. The goal the transmission of a given video should reach is to reproduce, in a end user given metrics, good perceived quality. The video to be reproduced is segmented into shots, each one being described as a video sequence; for each shot, an appropriate, reduced in size, number of key frames are properly

selected to constitute the skeleton of the image sequence; each frame of the sequence is referred to the selected frames (key frames) to obtain a good prediction of the image at hand to be coded. The coder adapts its coding structure to the peculiarities of the video at hand, creating a Video Key-frames Codebook (VKC), and then reduces the coding requirements to the transmission of very poor information on the network. Only a few images of the sequence should be chosen as Key-frames to represent video, and any other frame can be someway related to the Key-frames selected, to reproduce the true images along the whole video sequence. In this way, the image reproduction of a generic video sequence can be considered as the problem of the analog signal reconstruction from its discrete time samples. The VKC creation is based on the video analysis in a vector space; the generic image of the video sequence is firstly segmented into color coherent zones by means of an unsupervised neural network approach; subsequently, the image feature-vector is used to represent the image in a vector space, and clustering of all the images of the video sequence in the feature-space is performed, in order to select the smallest set of Video Key-frames to be used in the definition of the VKC. The selection of the sequence Key-frame in each cluster is obtained on the basis of the minimal distance form the centroids of the obtained clusters. Once the VKC has been created, the video reproduction can be obtained by a proper interpolation of the VKC images to obtain the generic image of the sequence. The paper is so structured: section 2 describes the used unsupervised NN for image feature extraction and subsequent time clustering of images in the feature space; section 3 addresses the video description application, and the video reconstruction with the “nearest neighbor” application is presented, together with quantifications of the minimal bit rate for the sequence reproduction is given; section 4 describes the simulation environment and preliminary results. Conclusions and future work close the paper.

2.

The Unsupervised Neural Network approach

Several algorithms have been proposed in literature, in the field of video reordering, skimming, summarization and storyboarding: all such techniques require to select a proper subset of images of the video sequence being representative of the complete video. This procedure requires the selection of the images in the sequence someway creating a “video spot”, meaning that all the images in a time-neighborhood of such images (wherever located in the video) could be very well described by such images. This procedure requires the definition of the minimal subset of the maximally distant images, presenting such a characteristic. The problem of selecting a metric is not easily addressable, depending mainly on the subjective definition of the obtained storyboard of the complete video sequence, so that an unsupervised approach has been used, based on the use of a neural network, to obtain the clusters of the sequence and as a second step, to extract the video key-frames from the cluster definition. Similarly to standard competitive neural networks, the used network ([6]) is composed by M processing elements, where each unit receives an input signal x = (x1 x2 ... xN ) from an external data-base and is characterized by the weight vector w = (w1 w2 ... wN ) . When the input is received, the unit computes the Euclidean distance d ( x , w ) between the input and the weight. Then, for a single element of the network, the following competitive learning algorithm is proposed in table1. In table 1 y is the neuron output, is the logistic function with f (u ) =0 if u ≤ 0 , 0< f (u ) 0 , α (k ) is the learning coefficient, β (k ) is a coefficient that changes with the same rate

of α , δ is a constant value whereas σ (k ) is an adaptive threshold, the behavior of which will be clarified later on. From first equation of table 1 it is clear that the activation of the neuron ( y >0) occurs only if the extreme of the input vector x is within the hyper-sphere with radius given by σ and center given by the extreme of w. When y = 0, equation (2) is driven by the value of α . On the other hand, when α goes to zero, the value of y predominates and becomes a “winner” indicator. In this way, it is possible to implement a learning law that generates a transition from a phase of “weak” undifferentiated learning ( y =0), where each datum influences the learning, to a phase of “locally” selective learning, where the competitive signal β y enables the unit to learn only in the local “winner region”. In order to adapt the ‘winner region’ to the feature of the input space, an adaptive threshold is introduced. Namely, the idea is to increase σ when there is no activation, and vice-versa. During a phase of undifferentiated learning ( y =0), equation (3) is driven by α . On the other hand, while w moves toward its center, the selective learning ( y >0) takes place, α goes to zero, σ decreases, the neighborhood size decreases and, consequently, input vectors of a reduced neighbor of w contribute to the learning. Table 1.

y (k ) = f (σ (k ) − d ( x (k ), w (k )))

∆w (k ) = (α (k ) + β (k ) y (k ) )(x − w ( k ) ) w (k + 1) = w (k ) + ∆w (k )

∆σ (k ) = (α (k ) + β (k ) y (k ) )(d ( x , w ( k )) − σ (k ) ) σ (k + 1) = σ (k ) + ∆σ (k ) ∆α (k ) = α (k ) y (k )δ α (k + 1) = α (k ) + ∆α (k ) ∆β (k ) = −∆α (k ) β (k + 1) = β ( k ) + ∆β (k ) Iterate to Step 1 until ∆σ (k ) < ε

(2a) (2b) (3a) (3b) (4a) (4b)

Network with Local Connections: Differently from WTA (Winner Take All) paradigms, the proposed network herein is an array of N units, which are characterized by local inhibitory connections. At the beginning each unit receives samples from the input space. However, only the first unit learns, since it has no input inhibitory links, whereas all the other units are inhibited, due to the presence of input inhibitory links. Therefore, the first inhibitory link is turned off and the second unit can start its learning phase. This means that the input samples, which make active the first unit, continue to produce learning only for the first unit, whereas the remaining samples produce learning for the second unit. This network architecture combines the advantages of both sequential cluster search and unsupervised competitive networks (UCNN). This feature leads to a major advantage over the performances of unsupervised competitive neural networks: due to the presence of inhibitory links, lower priority units cannot affect higher priority units. As a consequence, if two networks are composed of N and (N+M) units, the common N units will behave in the same way in both the networks and will found exactly the same centroids. This means that it is always possible to add some extra units (in order to check if some cluster in the data structure has been missed) without modifying the behavior of the first unit.

Notice that UCNN’s lack of this property, that is, they suffer of the problem of the dependence of the final partitions of the data on the number of the network elements.

Fig. 1: the obtained clusters for the image simplified description

This NN structure has been used to simplify image description to obtain in a second step the selection of the Video Key-frames. Presented results refer to the application to the sequence “Aki yo”. For each image of the sequence only image luminance has been considered in the clustering procedure for image simplification (feature extraction). Seven clusters per image have been considered a good trade- off between order of the hyperspace of the images and a good segmentation into coherent regions of the image. Figure 1 represents a typical clustering of each image in the sequence.

Fig. 2: the five prototypes images of the Akiyo sequence.

Once the feature description phase has been carried over all the images in the sequence, a time clustering procedure in a simplified 7-D hyperspace has been carried on. The goal of this procedure is the detection of clusters inside the sequence; each cluster prototype, i.e. the cluster centroids represents a typical image but doesn’t coincide with a given image in the sequence, so that key-frames have been chosen as the closer cluster image to the detected centroid; such images can well represent the video sequence summary report. We chose to indicate the closest images to the devised centroids after the time clustering procedure as the VKC, the Video Key-frames Codebook. Fig. 2 reports the obtained key-frames; it can be seen, by observing the video, that the obtained key frames well represent the whole video sequence. In a video coding framework, the predicted image construction at the receiving end is based in predictive coding techniques on the knowledge of the motion field the images is experiencing; motion fields are transmitted as side information to the receiver. In the proposed coding scheme, the side information to obtain the predicted image is substituted, when the VKC has been defined, by the transmission of the pointer (the LUT entry in the VKC to the stored Key-frame assumed as a prediction), if a zero order interpolating kernel is assumed, and on a vector of distances from the key-frames in a more general case. Loss-less coding requires also an efficient technique for the transmission of the prediction error, i.e. the DFD (Displaced Frame Difference).

Substantially, the clustering procedure creates a look up table (LUT) to be used in the coding phase instead of the usual prediction phase.

3.

The video coding experimental set-up

Performances evaluation of the proposed video reproduction technique refers to the assumed loss-less entropy coding of the difference between the “interpolated” image and the true one. Any image can be approximated by its key frame in the VKC, in a raw description. This means the use of the zero order interpolator in the hyperspace, which means to substitute the nearest neighbor key frame to the generic frame of the sequence. The required bit rate to code the video in this way is really negligible if the initialization phase of the connection is neglected. If N key-frames have been selected to represent the whole video, the required bit rate might be computed as log2(N) b/frame, a very poor bit rate indeed. Table 2. Key-frames number

Time average squared error

5 10

21.42 12.29

Time average pixel mean entropy (b/pixel) 2.16 1.91

Time average bit rate (kb/frame) 54.7 48.1

The use of a zero order interpolating function produces high prediction errors; the choice of a better interpolating kernel function might allow lower requirements information. In order to test the validity of our assumptions, fig. 3 reports the entropy time evolution in the case of a zero order interpolating function. In this work, only preliminary results of the proposed idea are presented. Experimental tests have been conduced on Akiyo standard video sequence in QCIF (144×176, 300 frames) luminance only video format. As a measure of the required bit rate we chose the entropy of the prediction error corresponding to substituting the mostly similar image in the video sequence with the true one. The prediction error entropy is used as a measure of the required bit rate to obtain the perfectly reconstructed video sequence. Coding results are reported both in visual and quantitative form respectively as a bit rate time evolution diagram and in table 2 as average values for two distinct cases of 5 and 10 selected key frames. As notable, the bit rate per frame are very high values, but it should be noted that: → They refer to a zero order interpolation kernel (the nearest neighbor is used as a prediction); → Loss-less coding is assumed, as the entropy of the prediction error has been used as a measure of the required bit rate. With this kind of coding structure, we can tune algorithms handles to select the coding performances: handles are represented by the VKCs size, the required received imagequality and the required information to describe the image to be reproduced.

Fig. 3: Time evolution of the frame-difference entropy obtained for a zero order interpolating function. It obviously nulls on selected key-frames.

4. Conclusion and discussion The creation of a Video Key-frame Codebook is here introduced to simplify the coding required information to be transmitted on very narrowband channels as those experienced in traffic congestion situations. The proposed space approach can be considered a straightforward generalization of the forward-backward motion compensation mechanism included in MPEG-4 standard, with the fundamental difference that no time dependency exists between the image to be predicted and its preceding and following ones. The use of a zero order multidimensional interpolating function produces high prediction errors and thus high bit rate for the difference coding; the choice of a better interpolating kernel function might allow lower requirements in bandwidth, on one hand, even if higher requirements in the receiver hardware performances. Further research is still involved in the definition of a good quality multi-dimensional interpolating kernel, being able in simply reproducing the generic frame of a given sequence by interpolation carried on the video key frames. References [1] [2] [3] [4] [5] [6] [7]

[8] [9] [10]

P Salembier, F. Marques, Region based representation of image and video: Segmentation tool for multimedia services, IEEE Trans. Circuits and systems for Video Technology, invited paper, vol. 9 no. 8, dec. 1999 ISO/IEC DIS 13818–2, Information Technology – Generic Coding of Moving Pictures and Associated Audio Information – Part 2: Video, ISO, 1994. MPEG Video Group, MPEG–4 Video Verification Model Version 4.0, ISO/IEC JTC1/SC29/WG11/M1380, Proceedings of Chicago meeting, October 1996 CCITT SG XV, Recommendation H.261 – Video Codec for Audiovisual Services at px64 kbit/s, COM XV– R37–E, Int. Telecommunication Union, August 1990. ITU–T Draft, Recommendation H.263 – Video Coding for low bit rate communication, Int. Telecommunication Union, November 1995. G. Acciani, E. Chiarantoni, and M. Minenna “A new non Competitive Unsupervised Neural Network for Clustering” Proc. of Intern. Symp. On Circuits and Systems, Vol. 6, pp. 273 - 276, London May 1994. C. Guaragnella, E. Di Sciascio, Object Oriented Motion Estimation by Sliced-Block Matching Algorithm, Proc. IEEE 15th Intl. Conf. On Pattern Recognition, Vol. 3, Image, speech and signal processing, pp. 865869, Barcelona, Sept. 3-7, 2000. C. Cafforio, E. Di Sciascio, C. Guaragnella, Motion estimation and Modeling for Video Sequences, Proc. of EUSIPCO 98, 8–11, Rhodes, GR. A.Guerriero and V.Di Lecce, An Evaluation of the Effectiveness of image Features for Image Retrieval, J. Visual Communication and Image Representation 10,351-362 (1999). A.Del Bimbo and P. Pala, Visual image retrieval by elastic matching of user sketches, IEEE Trans. Pattern Anal. Mach. Intell. 19(2), 1997.

Lihat lebih banyak...

Unsupervised - Neural Network Approach for Efficient Video Description

Descripción

Comentarios