Empty Speech Pause Detection Algorithms\' Comparison

June 28, 2017 | Autor: Anna Esposito | Categoría: Information Systems

Descripción

International Journal of Advanced Intelligence Volume 2, Number 1, pp.145-160, July, 2010. c AIA International Advanced Information Institute ⃝

Empty Speech Pause Detection Algorithms’ Comparison Vojt˘ ech Stejskal1 , Nikolaos Bourbakis2 and Anna Esposito3 1 Department

of Telecommunications, University of Technology Czech Republic [email protected]

2 ATRC,

Wright State University Dayton, OH, USA [email protected]

3 Department

of Psychology and IIASS, Second University of Naples Via Vivaldi 43, 81100 Caserta, Italy [email protected] Received (January 2010) Revised (May 2010)

Nowadays, empty speech pause detection algorithms play an important role in several speech processing fields such as speech recognition, speech enhancement, and speech coding. This work describes two new pause detection algorithms and compare their performance with four standard Voice Activity Detection (VAD) methods represented by the adaptive Long Term Spectral Divergence (LTSD) algorithm, the Likelihood Ratio Test (LRT) algorithm, the Neural Network thresholding and G.729. The proposed algorithms exploit the concept of adaptation in order to handle adverse conditions. The test data are recordings of spontaneous speech made in noisy environments. The experimental results show that the performance of proposed algorithms on noisy and even artificially cleaned speech are superior than that achieved by standard methods reported in literature. Keywords: Speech recognition; Speech processing; Neural nets; Voice activity detection.

1. Introduction A characteristic of spontaneous speech, as well as of other types of speech, is the presence of silent intervals (empty pauses) and vocalizations (filled pauses) that do not have a lexical meaning. An empty and filled pause are more likely to coincide with boundaries, realized as a silent interval of varying length, at clause and paragraph level 1,2,3 and often marks the boundaries of narrative units 4,5,6 . Pauses in speech are typically a multi-determined phenomenon attributable to physical, socio-psychological, communicative, linguistic and cognitive cause (see 7,8 for more details). An accurate detection of empty speech pauses appears to be crucial for most of today’s speech processing methods and involves three main application categories as speech recognition, speech enhancement, and speech coding. Nowadays, the detection and the modeling of empty and filled speech play significant role in 145

146

V. Stejskal, N. Bourbakis and A. Esposito

development of interactive dialog system and avatars. Moreover, empty pauses can provide useful biometrical information 9 . The discrimination between speech and non-speech segments, composing empty pauses, is not as trivial as it might appear at first sight: most of the detection algorithms fail due to the combination of background noise and speaker’s coarticulation effects. Former algorithms were frequently based on energy thresholding, pitch detection, zero-crossing rate, periodicity measure, cepstral features, spectrum analysis, and Linear Prediction Coding (LPC) or combinations of these parameters (for more details see 17 ). The efforts to enhance detection performance have led to the implementation of statistical models with decision rules derived from the Likelihood Ratio Test (LRT) applied to a set of hypotheses. Recently, the Gaussian statistical model improved with the incorporation of an effective hang-over scheme based on a Markov Chain Model was applied in order to achieve more reliable results, 14,15,16 . A different approach exploits a set of fuzzy rules implemented into the detection algorithm 18 . It has also been shown that algorithm robustness can be improved by using the Signal-to-Noise Ratio (SNR) and the long-term information about the speech/non-speech signal measured separately on each filtered spectral band to formulate the appropriate decision rule for the problem under examination 19 . Each of the above proposed solutions has proved to give a satisfactory performance when tested on standard databases, such as TIMIT, NTIMIT, or Aurora framework, where noise has constant attributes that do not change from one record to another or during recording and the pauses present in the recordings are essentially articulatory pauses 20,21,22 . Moreover, the performance of the above algorithms decreases when the environmental noise changes due to variations in the recording environment. The present work proposes two adaptive energy thresholding algorithms that overcome the mentioned problems, and compare their performance against to standard pause detection algorithms. 2. Proposed Pause Detection Methods Considering the relevant number of applications involving the modeling and the detection of speech pauses in spontaneous dialogs, we developed an algorithm for their detection. To this aim the input speech signal was divided into frames using a 20 ms Hamming window with 10 ms overlap, and the sample log-spectral energy values were computed for each time window. Next, the computed spectrum was divided into four Mel Frequency Scale sub-bands to match the human psychoacoustical ability to resolve sounds with respect to frequency. The signal parameters of interest were described in the 0-4 kHz frequency range containing an adequate amount of vocal activity, vocal tract articulatory features, and non-speech segment information (see details in 28,29 ). A decision on whether the processed frame is a speech or non-speech segment was made by applying to the output of each filter a thresholding algorithm based

Empty Speech Pause Detection Algorithms’ Comparison

147

on the following principles (for more details see 28 ) The algorithm starts by computing the threshold T as a mean of all band energy values in the first-frame. The first silent segment is detected when all the first band energy features fall below T value. The thresholds Ts and Tp are introduced and set equal to T and 1.2 T respectively. This initial set-up may result in a minor misclassification when the first detected silent frame shows a slight offset with respect to its manually detected position. A non-speech frame is detected when the energy value in each band falls below the current Tp value. Otherwise for a speech frame to be detected the energy in each band has to be greater than the Ts current value. Ts and Tp are re-computed when a non-speech frame is detected. Both threshold values change according to the amount of energy in non-speech frames. We developed and experimentally tested several algorithms for the calculation and the adaptation of the thresholds, in order to identify a procedure capable of preventing the threshold-value fluctuations that might arise from either a random noise or long-term silent segments. As results of this process, we devised the min/max and the spectral flatness methods. Min/Max Method The min/max algorithm is based on the ratio of the predicted minimal noise energy in a detected non-speech region to the maximal noise energy computed on recently detected non-speech regions. Since the entire detection system runs in realtime, the processing of minimum noise value must be predicted on-line in concurrence with the detection of silent pauses. The min/max ratio allows an adaptation of the generalized threshold level in response to changes in the noise level. The computation of the generalized threshold is described by eq. (1):

Tk(n)

[

] Nk,max · S k,u = Nk + 1 − Nk,min (p)

(1)

where Tk is the threshold value computed for kth band and nth frame, N k is the mean of the noise energy computed on recently detected pauses, S k,u is the mean value of long-term speech energy computed from the input signal, Nk,max and Nk,min (p) are, respectively, the maximum noise energy value from recently detected pauses and the minimum noise energy value in the current detected silent pause p. The thresholds Tks (n) and Tkp (n) are introduced and set equal to Tk (n) and 1.2 Tk (n) respectively and are protected against overflow and underflow through appropriate energy levels computed respectively on the previous 10 s and 2 s of the input signal. Spectral Flatness Method This method introduces into the adaptive algorithm a correction factor:

148

V. Stejskal, N. Bourbakis and A. Esposito

γ (1 − Fc (n)) (2) 4 where Fc is the threshold correction obtained from the spectral flatness function described as: Tk (n) = λN k +

Fc =

N −1 ∑

log(|S(i, n)|) − log

i=0

N −1 ∑

(|S(i, n)|)

(3)

i=0

λ is the constant SNR correction and it is estimated only once during the first detected non-speech segment. The SNR is computed as the ratio of the average speech energy to the noise energy in the first detected silent pause and reflects energy variations in the speaker’s voice and in the environment; S(i, n) is a spectral energy in each frequency band i of the FFT with size N and time window n; γ is a constant taking on two different values, γT s = 1 when the algorithm is processing speech frames and γT p = 1.6 otherwise. The output of above described detection algorithms is a vector of 4 binary thresholded values. Such output is then processed by a mapping algorithm indicating if the frame under examination is a speech or a silent segment (for more details see 17 ). A backward analysis is performed after a new silent pause is detected to avoid including in the energy and slope vector computation acoustic features of speech segments (such as weak fricatives and/or reduced vowels) whose energy may fall below the thresholds defined for the silent intervals.

3. Long Term Spectral Divergence The premise for successful speech segmentation performed by thresholding methods is often dependent on reliable speech signal representation. Since thresholding methods are sensitive to rapid spectral changes, the compromise between reliable spectral representation and its smoothed version have to be found. Application of smoothing shouldn’t lessen discriminative ability of segment recognition that causes misclassifications. An alternative procedure which reduces signal variance can be represented by spectral envelope smoothing - Long Term Spectral Envelope algorithm 10,11 , j=+M

LT SEM (k, n) = max [X(k, n + j)]j=−M

(4)

where the X(k, n) is representing modules on the kth spectral coefficient in the nth time frame. Spectral envelope in each frequency bin is estimated as a maximum value of spectral energy from neighboring time frames at the same frequency bin whose amount is limited by ±M . Value M is called “order” (typical values are 6-12 which is 60-120ms with 10ms window step). Wrong (accidental) classification of speech/pause transition is performed when value M is smaller than typical. This

Empty Speech Pause Detection Algorithms’ Comparison

149

transition doesn’t have to be recognized properly or small pauses can be missed on the contrary. LTSE doesn’t provide sufficient discriminative function between segments consisted of speech mixed with noise and segments with only noise. Therefore, the algorithm based on ratio between smoothed long-term spectral envelope and smoothed long-term noise value was introduced - Long-Term Spectral Divergence (LTSD), [ N −1 ] 1 ∑ LT SE 2 (k, n) (5) LT SDM (n) = 10 log10 N N 2 (k, n) k=0

where N is a number of FFT samples. Variable N (k) represents exponentially smoothed noise value of kth spectral bin in nth time frame representation actualized only inside detected pause segments, N (k, n) = αN (k, n − 1) + (1 − α)NK (k)

(6)

Actual variable NK is computed for each spectral bin as noise value average from neighboring time frames surrounding nth frame. Decision whether the computed frame is speech or pause is performed by linear thresholding function depended on noise energy from previously detected pause and on empirically estimated threshold limits. 4. Neural network Another pause detection approach can be represented by neural network, Multi Layer Perceptron (MLP) respectively. There isn’t specified any rule defining network configurations but in general, the number of hidden layers is chosen as compromise between slow training process (high amount of levels) and possible wrong classification (low amount of levels). Moreover, the training process can stuck in local minima when is present a high amount of neurons in the layer. The feed-forward back propagation three layer perceptron was chosen for pause detection (see Fig. 1). Input layer is connected with five Mel bank filters’ output in the 0-4 kHz frequency range. Input features are normalized accordingly to MLP output sigma function threshold. Each neuron is activated by sigma function too. Input segment is assumed as pause when output value y(t) > −0, 1. MLP training is performed in boosting fashion on two thirds of speech records and tested on the rest that all records can be tested. The backward pause analysis method is performed on obtained results (similar for all mentioned methods). 5. Likelihood Ratio Test The main idea of Likelihood Ratio Test method is a decision whether the signal is speech or noise. This decision is based on the ratio of two probability density functions computed from speech features and Maximum Likelihood (ML) classification12 . Speech and noise time samples’ distribution is assumed as Gaussian and spectral coefficients as asymptotically independent Gaussian random

150

V. Stejskal, N. Bourbakis and A. Esposito

Fig. 1. Empty pause detection with MLP.

variables13 . Probability density functions conditioned on two hypothesis H0 and H1 are considered assuming presence of an additive noise in signal X(k) = S(k)+N (k), where the S(k) represents speech spectral component and N (k) noise. First hypothesis H0 expects only noise presence in a signal ⃗ 0) = p(X|H

L−1 ∏ k=0

1 2 (k) πσN

e

−

|X(k)|2 σ 2 (k) N

(7)

2 where σN (k) is the kth spectral coefficient variance of X(k) and L is the number of FFT samples. Second hypothesis H1 covers presence of mixed speech and noise

⃗ 1) = p(X|H

L−1 ∏ k=0

2

|X(k)| − 2 1 σ (k)+σ 2 (k) N S e 2 (k) + σ 2 (k)) π(σN S

where sigma2S (k) is the variance of kth spectral coefficient each frequency component is than Λk =

14

(8)

. Likelihood ratio for

γ k ξk 1 p(Xk |H1 ) = e 1+ξk p(Xk |H0 ) 1 + ξk

(9)

where ξk = σS (k)/σN (k) is a priori SNR and γk = |Xk |2 /σN (k) is a posteriori SNR. Decision rule is defined as the mean value of likelihood ratios for all frequency components lnΛ =

L−1 1 ∑ lnΛk >η

Lihat lebih banyak...

Empty Speech Pause Detection Algorithms\' Comparison

Descripción

Comentarios