Bel Pon11Sep-1

September 6, 2017 | Autor: Jose R. Beltran | Categoría: Image Processing, Audio Signal Processing, Blind Source Separation, Wavelets

Share Embed

Laporkan tautan ini

Descripción

Blind Separation of Overlapping Partials in Harmonic Musical Notes Using Amplitude and Phase Reconstruction Jes´ us Ponce de Le´on

Jos´e Ram´on Beltr´an

[email protected]

[email protected]

Abstract In this work, a new method of Blind Audio Source Separation of monaural musical signals is presented. The input (mixed) signal is processed using a flexible analysis and synthesis algorithm (the CWAS, acronym of Complex Wavelet Additive Synthesis, algorithm), which is based on the Complex Continuous Wavelet Transform. When the information from two or more sources overlaps in a certain frequency band (or group of bands), a new technique based on amplitude similarity criteria is used to obtain an approximation to the original partial information. Our main goal is to show that the CWAS algorithm can be a powerful tool in Blind Audio Source Separation, due to its strong coherence in both time and frequency domains. A set of 20 synthetically mixed monaural signals with 2 and 3 present sources have been analyzed using this method. The obtained results show the strength of the technique.

1

Figure 1: The general BASS task, with N mixed sources and M sensors.

microphones). The number of mixtures defines each particular case, and for each situation the literature provides several methods of separation. Probably the most studied and useful case is the undetermined one, where N > M (N > M not always implies worst results). For example, in the stereo separation (through the DUET algorithm [1] and other time-frequency masking evolutions [2] [3] [4]), the delay and attenuation between the left and right channel information can be used to discriminate the present sources and some kind of scene situation [5]. In other applications, when a monaural solution is needed (that is, when M = 1), the mathematical indetermination of the mixture highly increases the difficulties of the task. Hence, monaural separation is probably the most difficult challenge of BASS, but even in this case, the human auditory system itself can somehow segregate the acoustic signal into separate streams [6]. Several techniques for solving the BASS problem in general (and the monaural separation in particular) have been developed. Psychoacoustic studies, like Computational Auditory Scene Analysis (CASA) [7] [8], inspired in Au-

Introduction

Blind Audio Source Separation (BASS) has been receiving increasing attention in recent years. The BASS techniques try to recover the source signals from a mixture, when the mixing process is unknown. Blind means that very little information is needed to carry out the separation, although is in fact absolutely necessary to make assumptions about the statistical nature of the sources or the mixing process itself. In the most general case, see Figure 1, separation will deal with N sources and M mixtures (sensors, 1

ditory Scene Analysis (ASA) [6], attempts to explain the cited capability of the human auditory system in selective attention. Psychoacoustic also suggests that temporal and spectral coherence between sources can be used to discriminate them [9]. Within the statistical techniques, Independent Component Analysis (ICA) [10] [11] assumes statistical independence among sources, while Independent Subspace Analysis (ISA) [8], extends ICA to single-channel source separation. Sparse Decompositions (SD) [12] assumes that a source is a weighted sum of bases from an overcomplete set, considering that most of these bases are inactive most of the time [13], that is, their relative weights are presumed to be mostly zero. Nonnegative matrix factorization (NMF) [14] [15], attempts to find a mixing matrix (with sparse weights, [16] [17]) and a source matrix with non-negative elements such that the reconstruction error is minimized. Finally, sinusoidal modeling techniques assume that every sound is a linear combination of sinusoids (partials) with time-varying frequencies, amplitudes, and phases. Therefore, sound separation requires the adequate estimation of these parameters for each source present in the mixture [18] [19] [20], or some a priori knowledge, i. e. rough pitch estimates of each source [21] [22]. One of the most important applications is monaural speech enhancement and separation [23]. They are generally based on some analysis of speech or interference and subsequent speech amplification or noise reduction. Most of the authors use the STFT to analyze the mixed signal in order to obtain its main sinusoidal components or partials. Auditory-based representations [24], can be also used. One of the most important and difficult to solve problem in separation of pitched musical sounds is overlapping harmonics, that is, when frequencies of two harmonics are approximately the same. The problem of overlapping harmonics have been studied during the last decades [25], but is in recent years when there has been a significant increase in research on this topic. As far as the information in overlapped regions is unreliable, several recent systems attempt to utilize the information of the neighboring non-overlapped harmonics. Some systems assume that the spectral envelope of the instrument sounds is smooth [26] [27], hence the amplitude of

an overlapped harmonic can be estimated from the amplitudes of non-overlapped harmonics of the same source, via interpolation [26] [20], or weighted sum [19]. The spectral smoothness approximation is often violated in real instrument recordings. A different approximation is known as Common Amplitude Modulation (CAM) [21], which assumes that the amplitude envelopes of different harmonics of the same source tend to be similar. In this work, we use an experimentally less restrictive version of the CAM assumption within a sinusoidal model generated using a complex band pass filtering of the signal. Non-overlapping harmonics are obtained using a binary masking approach based on the Complex Continuous Wavelet Transform (CCWT). The separated amplitudes of overlapping harmonics are reconstructed proportionally from the non-overlapping harmonics, following energy criteria in a least-squares framework. This way, it is is possible to relax the phase restrictions, and the instantaneous phase for each overlapping source can be also constructed from the phase of non-overlapping partials. This paper is divided as follows: in Section 2 there is a brief introduction to the CCWT and the CWAS algorithm, including the interpretation of its results and the additive synthesis process. The proposed separation algorithm is presented in Section 3, a detailed example of the proposed technique and the experimental results are shown in Section 4. Finally, the main conclusions and actual and future lines of work are presented in Section 5.

2

Complex Bandpass Filtering

2.1

The Complex Continuous Wavelet Transform

The Complex Continuous Wavelet Transform (CCWT) can be defined in several ways [28]. For a certain input signal x(t), it can be written as: Z +∞ Wx (a, b) = x(t)Ψ∗a,b (t)dt (1) −∞

Where 2

∗

is the complex conjugate and Ψa,b (t) is

the mother wavelet, frequency scaled by a factor a and temporally shifted by a factor b: 1 t−b Ψa,b (t) = √ Ψ (2) a a In our case, we will choose a complex analyzing wavelet, (concretely the Morlet wavelet). The Morlet wavelet is a complex √ exponential modulated by a Gaussian of width 2 2/σ, centered in the frequency ω0 /a. Its Fourier transform is: 2 ψˆa (ω) = Ce−σ

(aω−ω0 )2 2

Figure 2: Up: Waveform of the analyzed signal (the mixture of a tenor trombone playing a C5 note and a trumpet playing a D5 vibrato). Down left: The wavelet spectrogram, that is, the module of the CCWT coefficients (module of the CWT matrix). The bright zones are the different detected partials. Down right: Scalogram of the signal.

(3)

where C is a normalization constant which can be calculated independently of the input signal in order to conserve the energy of the transform [29].

2.2

The Complex Wavelet Additive Figure 3 there appear depicted the scalogram of a Synthesis algorithm

guitar playing an E4 note (330Hz). Each maximum of the scalogram is marked with a black point. Its associated upper and lower frequency limits are marked with red stars. They are located in the minima point between adjacent maxima. For a certain peak i of the scalogram, its complex partial function Pi is the summation of the complex wavelet coefficient between its related frequency limits [30]. As b is the temporal evolution of the wavelet coefficients, we can write:

In the Complex Wavelet Additive Synthesis (CWAS) algorithm [30], a complex mother wavelet allows us to analyze the complex coefficients of Equation (1), stored in a matrix (the CWT matrix) in module and phase, obtaining directly the instantaneous amplitude and the instantaneous phase of each detected component [29] [31]. A single parameter, the number of divisions per octave D (a vector with as much dimensions as present octaves in the signal’s spectrum) controls the frequency resolution of the analysis. In Figure 2 (down left) there appear depicted the module of the complex wavelet coefficients (also called wavelet spectrogram) of the mixture of a tenor trombone playing a C5 note and a trumpet playing a D5 vibrato. In the figure, the bright zones are associated with the main trajectories of information (each one related with a partial). The temporal addition of the module of the wavelet coefficients represents the scalogram of the signal. The scalogram presents a certain number of peaks, each one related to a detected component of the signal. In the CWAS algorithm, we used a definition of partial which differs from the classical one. In our model, a partial contains all the information situated between an upper and a lower frequency limits. In

miup

Pi (t) =

X

Wx (ami , t)

(4)

m=milow

Studying the complex–valued function Pi (t) in module and phase, we can obtain the instantaneous amplitude Ai (t) and the instantaneous phase Φi (t) of each detected partial, being: p Ai (t) = kPi (t)k = ti its related source. For each overlapping partial, we know the sources that share it. In order to sepaIn these equations, θm is an energy (amplitude) rate the sources, it is necessary somehow to obtain threshold. 7

The first approximation is then:

Consider a certain partial Pm , of median frequency fm . Let Km be the set of overlapping sources in this mix partial. Km can be written as: Km = sk (t) ∈ x(t) | }mk < θa (21)

• Given p, q of Equation (25), the phases of partials P1 and P2 are approximately proportional with the same ratio (except an initial phase φ0 ): p φ2 (t) ≈ φ1 (t) + φ0 (26) q

that is, the set of sources of which Pm is a natural harmonic. Let Nk be the cardinal of Km (that is, the number of present sources). The mixed partial can We have found that, in the proposed model of the be written as follows: audio signal, even knowing the envelopes of the origX Psk (t) Pm (t) = Am (t)ej[φm (t)] = inal overlapping harmonics, an error in the initial phase φ0 of 1 part between 103 is enough for failk∈Km (22) X ing to adequately reconstruct the mixed partial. On = Ask (t)ej[φsk (t)] the other hand, each partial has an aleatory initial k∈Km phase (that is, there is not a relation between the iniwhere Psk (t) are the original harmonics which overlap tial phases of partials belonging to the same source). on the mixed partial. In Equation (22), the only But, as the instantaneous frequency of the mixed haraccesible information is the instantaneous amplitude monics can be retrieved with accuracy independently and phase of the mixed partial, that is, Am (t) and of the value of the initial phase, the original and the φm (t). synthetically mixed partials (using the separated conFrom each isolated set of partials Pk , with k ∈ Km , tribution from each source) present similar sounds, we will search for a partial with an energy as similar provided that the second assumption is true. to the energy of Pm as possible, and with a median This second assumption, is a is a slightly less refrequency as close to fm as possible. If ∆(Ej,m ) = strictive version of the CAM principle, which asserts |Ej − Em | and ∆(fj,m ) = |fj − fm |, these conditions that the amplitude envelopes of spectral components can be written as: from the same source are correlated [21]. Pk,win = Pj ∈ Pk | ∆(Ej,m )|min , • The amplitudes (envelopes) of two harmonics P1 (23) ∆(fj,m )|min and P2 , with similar energy E1 ≈ E2 , belonging to the same source, are correlated. The energy condition is calculated in first place. Only in doubtful cases, the frequency condition is While this approximation is true, we will have evaluated. The energy of a partial is calculated us- better separation results. In fact, as we are using ing Equation (7). For purposes of simplicity, let Pwk the global signal information, the correlation factor denote the selected (winner) isolated partials of each between the strongest harmonic (and/or the fundasource. It can be written: mental partial) and the other harmonics descends as j[φwk (t)] Pwk (t) = Awk (t)e ∀k ∈ Km (24) amplitude/energy differences increase [21]. Hence, the election for the reconstruction of non-overlapping 3.2.2.1 Assumptions In order to obtain the en- harmonics which presence is energetically similar to velopes and phases of an overlapping partial related the energy of the overlapping harmonic suggest that to each source, we will assume two approximations. the correlation factor between the involved partials If P1 and P2 are harmonics of median frequencies will be higher. In fact, as correlation between high f1 and f2 respectively, related with the same funda- energy partials tends to be also high, while the errors related with this assumption in lower energy parmental frequency f0k , it follows: tials tend to be energetically negligible, in most of p f2 the cases the quality measurement parameters have = , ∃p, q ∈ N | f1 q (25) a high value, and the acoustic differences between the 2 1 original and the separated source are assumable. being p = ff0k and q = ff0k 8

(b−A∗α)0 ∗(b−A∗α), where ∗ denotes the standard matrix product and 0 is the transposition operator. Once obtained the mixture vector α for the partial Pm using Equations (30) or (31), the separated contributions of each present source are defined as:

3.2.2.2 Reconstruction The main idea is quite simple. An overlapping harmonic Pm (t) given by Equation (22) lead us to one partial from each overlapping source, Pwk (t) given by Equation (24). The reconstructed phases φsk (t) of each source contribution can be obtained from Equations (25) and (26). We will take: pk φwk (t) ∀k ∈ Km qk

Psk (t) = αk Ak (t)ej[φsk (t)]

(32)

This process can be repeated for each overlapping partial of the mix signal x(t), obtaining the set of In all cases, we suppose φ0 = 0, but in fact an separated partials for each source, Psk . Each one aleatory initial phase can be inserted without sig- of the N separated signals is synthesized using an nificant difference neither in the numeric nor in the additive process: X X acoustical results. sk (t) = Pk ∈ P k + Psk ∈ Psk (33) The envelope of the overlapping harmonic, Am (t) can be described by a discrete set of real numbers: 3.2.2.3 Main characteristics, advantages and Am (t) = Am (t1 ), Am (t2 ), . . . , Am (tlm ) (28) limitations Except the crossed information recovered in the isolated partials (fundamentals and harwhere tlm is the last temporal sample of the partial. monics), which in general terms presents only one The amplitudes of the selected reconstruction parenergetic source, there is no information wrongly astials can be written in a similar way. signed to any source (because the overlapping par Ak (t) = Ak (t1 ), Ak (t2 ), . . . , Ak (tl1 ) (29) tials information is never used to separate, but to estimate the separated synthetic partials from the inWe want to find the best linear combination of formation of the pure isolated sources). It means that Ak (t) ∀k ∈ Km that minimizes the error in the ob- the interference terms in the separation process will be in general negligible. On the other hand, the retention of Am (t), that is: construction process tends to generate artifacts and X Am (ti ) = αk Ak (ti ) ∀ti (30) distortions (when the envelopes used to generate the k∈Km separated harmonics present lower resemblance with respect to the expected harmonics). These results Equation (30) is equivalent to the least-squares sowill be numerically confirmed in Section 4. lution in presence of known covariance of the system: The advantages of this separation process are mainly two. First of all, the processes of separation A∗α=b (31) of overlapping harmonics (partial selection, calculus where A is a lcom ×Nk matrix which contains the val- of the best linear combination, source reconstruction) ues of the envelopes of the Nk selected (winner) par- are not computationally expensive. In fact, the obtials described by Equation (23), α = (α1 , . . . , αNk ) tention of the wavelet coefficients and their separaand b = Am (t). This equation will be evaluated in the tion in partials uses much more computation time. points where all the involved partials, Pwk ∀k ∈ Km The second advantage of this process is that the sepand Pm have non-null values (a total amount of lcom aration is completely blind. That is, we do not need samples, where the onset and offset times of the any a priori characteristic of the input signal, neisource, Equations (19) and (20), are taken in ac- ther the pitch contour of the original sources nor the count to avoid the noisy background when the played relative energy, number of present sources, etc. note have ended). Then, α is in general the Nk -byAt its actual stage, the purposed technique can be 1 vector that minimizes the sum of squared errors used to separate two or more musical instruments φsk (t) ≈

(27)

9

playing each one a single note. The final quality of the separation depends of the number of mixed sources. It is due to the use of isolated partials to reconstruct the overlapping harmonics. The higher the number of sources, the lower the number of isolated harmonics and the worst the final musical timbre of the separated sources. This limitation is very important if we want to synthesize the original sources, but it could be less important in other applications, like for example automatic musical transcription.

4

Experimental results

In the example, we will be centered in a single overlapping partial. The related isolated original partials will be also used to test the robustness of the proposed method. The main characteristics of the overlapping partial are shown in Tables 1 and 2, at the end of this section. The exact results given by the fundamental frequency estimator (Section 3.1) are f01 = 589.2527Hz for the trumpet and f02 = 525.9583Hz for the trombone. The instantaneous amplitude from these partials is shown in Figure 7. The blue line comes from the fundamental partial of the trumpet and the green line from the tenor trombone fundamental.

A total amount of 20 signals, 15 with 2 and 5 with 3 synthetically mixed sources (most of the original signals have been chosen again from the University of Iowa database [36], have been analyzed and separated using this technique. The musical instruments set (see Table 3) includes flutes, clarinets, sax, trombones, trumpets, oboe, bassoon, horn, tuba, violin, viola, guitar and piano2 . In this Section, we will show the obtained experimental results, including quality Figure 7: Envelopes of the fundamental partials. Blue separation measurements. line, trumpet fundamental. Green line, tenor trombone fundamental.

4.1

Detailed Example

To clarify the specified separation process, we will use a concrete example, chosen arbitrarily from the set of analyzed signals (see next Section): the separation of a mixture composed by a trumpet playing a D5 (587Hz) vibrato and a tenor trombone playing a C5 (523Hz). Hence, Equation (10) becomes: x(t) = s1 (t) + s2 (t)

(34)

The waveform, module of the CWT matrix and scalogram of this signal can be seen in Figure 2. The numerical quality separation measurement of this signal can be seen in the next section. 2 The piano is not really an harmonic instrument, and the separation process will relocate the mixed partials in the expected harmonic frequency. Consequently, the final spectrum of the piano will be distorted. But, in fact, most of the other acoustic characteristics of the signal are preserved. We have analyzed 3 signals containing a piano note (not included in the cited set of 20 separated signals), see Section 4.2.2.

After the harmonic analysis, some of the detected partials are clearly natural harmonics from one or the other source. The instantaneous amplitudes of the sets P1 and P2 , both fulfilling Equation (18), are depicted in Figure 8. Note that the fundamental partials are included in these sets. The instantaneous amplitudes of the fundamental partials are depicted with gross lines. The separation process takes one by one the overlapping partials, selecting for each one a winner of each present source, chosen from sets P1 and P2 . The winner partials are also marked in Figure 8 (named as partials #8 and #39 respectively). Using the fundamental frequencies f01 and f02 , and the median frequency of the mixed partial, through Equation (13), the proportionality ratios (p1 /q1 , p2 /q2 ) are calculated through Equation (25). The result is the estimated instantaneous phase for each separated partial. From these phases, using Equation (8), we can obtain the corresponding estimated instantaneous fre-

10

Equation (27). In Figure 10, the waveforms of the original partials (obtained from the analysis of the isolated signals) which overlap in the mixture signal are depicted in blue. In red, the separated contributions for each source.

Figure 8: Envelopes of the isolated set of harmonics from each source. The fundamental envelopes are marked with gross lines. In blue: trumpet. In green: tenor trombone.

quency. In Figure 9, the instantaneous frequencies of the isolated original partials and the estimated instantaneous frequency of each separated contribution are shown. Figure 10: (a) and (c): Waveforms of the original partials (in blue, trumpet and tenor trombone, respectively). (b) and (d) Waveforms of the separated partials (in red, trumpet and tenor trombone, respectively).

Figure 9: Comparison between the original (isolated) instantaneous frequencies and the estimated (separated) instantaneous frequencies. (a): results for the trumpet source (in blue, the original fins ; in red the estimated one). (b): results for the tenor trombone source.

In the next step, the least mean squares solution of Equation (31) is used to find the best linear combination of the estimated contributions which fit the overlapping partial data. This estimation is composed by the scaled amplitudes of each winner partial, Equation (30), and the corresponding estimated phases,

Once obtained each separated partial using the presented technique, they are added to its corresponding source. This iterative process eventually results in the separated sources. In Figure 11 there appear the final waveforms of the separation of the vibrato trumpet (D5) and the tenor trombone (C5). In the first graph, Figure 11 (a), the original mix signal. In Figure 11 (b) and (d), the original isolated signals. In Figure 11 (c) and (e) the separated sources (in red). In Figure 12 the Fourier spectra of the different signals is shown. The first spectrum, Figure 12 (a), is from the mixed signal. In the next graphs, Figure 12 (b) and (c), the trumpet and tenor trombone spectra, respectively. In each graph, in blue, the original isolated spectrum. In red, the spectrum of the separated source. As can be seen in the spectra, most of the harmonic part of each source have been properly sepa-

11

Figure 12: (a) Spectrum of the mix. (b) Spectra of the Figure 11: Final waveforms. (a): Mixed signal. (b) isolated trumpet (blue) and the separated trumpet (red). and (c): The original (isolated) trumpet (blue) and the separated trumpet (red). (d) and (e): The original (isolated) tenor trombone signal (blue) and the separated tenor trombone (red).

(c) Spectra of the isolated tenor trombone (blue) and the separated tenor trombone (red).

Mixed signal global data

Separated partial data

Source f0 fj p q φ0 RMS rated. The incorrectly estimated harmonics have not, (Hz) (Hz) (r/s) (dB) in general, enough energy to result relevant. However, temporal errors in the envelopes of the highs1 (D5) 589.25 4712.1 8 9 -2.36 -79.78 frequency partials can sometimes result in audible s2 (C5) 525.96 4733 5 7 -2.3 -94.68 timbral differences. In Tables 1 and 2, there appear the main results of Table 2: Main data from the mix signal and the separated contributions to the overlapping partial of the example. the detailed example of the separation process.

Global data

Partial data

Signal

f0 (Hz)

fj (Hz)

φ0 (r/s)

RMS (dB)

Trumpet D5 Trombone C5

589.27 525.94

4712 4732.8

2.97 -0.86

-78.46 -87.55

Table 1: Main data from the original (isolated) signals and the overlapping partial of the example.

In Table 1, the data corresponding to the isolated signals, including their fundamental frequencies, median frequency of the mix overlapping partial of the example, the initial phases from the wavelet coefficients and the RM S value of the temporal envelope of the partials are presented. In Table 2, there ap-

pear the numerical results of the separation process, including the fundamental frequencies detected in the mix signal, the median frequencies of the separated partials, the experimental values for p and q for each source, the initial phases of the separated partials, and the RM S value of amplitude of each separated partial. As general conclusions, the fundamental frequencies are recovered with great accuracy, the instantaneous (and hence, the median) frequency of each separated partial is also well recovered, most of the energy of the expected partial has been correctly obtained. The initial phases of the isolated partials and the separated partials are obviously different. This last result supposes that this technique is not able to recover the exact waveform of the isolated

12

partials, but only their general shape. In general, this is not a great disadvantage, because if envelopes and instantaneous frequencies have resemblance enough, the separated partials will sound similarly. In Figures 13 and 14 there appear depicted the wavelet spectrograms and scalograms (obtained from the CWAS algorithm) corresponding to the isolated signals (tenor trombone and trumpet, respectively) and their related separated sources. From the spectrograms (module of the CWT matrix), it can been observed that most of the harmonic information have been properly recovered. This conclusion is reinforced using the spectrogram information. Note that the harmonic reconstruction produces an artificial scalogram (red line) harmonically coincident with the original scalogram (blue line). In the figures, the separated wavelet spectrogram shows that only the harmonic partials have been recovered. When the inharmonic partials carry important (non noisy) information, the synthetic signal can sound somehow different (as happened with the possible envelope errors in the high frequency partials). The values of the standard quality measurement parameters for this example and the rest of the analyzed signals will be detailed in Section 4.2.1.

4.2

Quality Separation Measurement

In this work, we will assume that the errors committed in the separation process can have three different origins: They can be due to interference between sources, to distortions inserted in the separated signal and to artifacts introduced by the separation algorithm itself. Noise will not be taken in account. From this point of view, and given a mixture of N sources, Equation (10), let sk be the original source and sˆk the separated source. The total relative distortion can be defined as [35] [37], [38]: Dtotal =

kˆ sk k2 − |hˆ sk , sk i|2 |hˆ sk , sk i|2

(35)

Assuming the orthogonal decomposition: sˆk = hˆ sk , sk isk + einterf + eartif

(36)

where hˆ sk , sk i is the contribution of the true source, einterf is the error term due to interference of the 13

other present sources, and eartif is the error tern due to the artifacts generated by the separation algorithm. Then, the relative distortion due to interferences can be written as: Dinterf =

keinterf k2 |hˆ sk , sk i|2

(37)

while the relative distortion due to artifacts is: Dartif =

keartif k2 khˆ sk , sk isk + einterf k2

(38)

We have used three standard parameters to test the final quality of the separation results using the proposed method related to these distortions. These parameters are the signal-to-interference-ratio, SIR, the signal-to-distortion-ratio, SDR and the signal-toartifacts-ratio SAR. −1 SIR = 10 log10 Dinterf (39) −1 SDR = 10 log10 Dtotal (40) and: −1 SAR = 10 log10 Dartif f

(41)

The next quality separation measurements have been obtained within the MATLAB r toolbox BSS EVAL, developed by F´evotte, Gribonval, and Vincent and distributed online under the GNU Public License [37]. The analyzed set of signals includes 15 signals with two sources (Section 4.2.1) and 5 signals with three sources (Section 4.2.3). An special case have been analyzed separately (Section 4.2.2): the separation of an inharmonic instrument (concretely a piano playing a G#3 note). All the analyzed signals are real recordings of musical instruments, most of them extracted from [36]. To label the signals, we have followed the nomenclature presented in Table 3. In the name of the signal it appears each present instrument and the musical note which it plays. For example, in the signal F nvC#5+GB4, the first source is a flute with no vibrato playing a C#5 note, and the second source is a guitar playing a B4. All the analyzed signals have been sub-sampled to fs = 22050Hz (in order to save processing time), then

Figure 13: Spectrograms of the Tenor trombone signals. (a) Wavelet spectrogram of the original (isolated) tenor trombone. (b) Blue line: Original scalogram. Red line: Scalogram of the separated source. Upper box: main information about the original signal. (c) Wavelet spectrogram of the separated source.

Label prefix

Musical instrument

AF AS B BC BF Cb Ce Fv Fnv G H O P S SS Tv Tnv TrB TrT Tu V Vi

Alto Flute Alto Sax Bassoon Bass Clarinet Bass Flute Bb Clarinet Eb Clarinet Flute (vibrato) Flute (no vibrato) Guitar (*) Horn Oboe Piano (*) Sax (*) Soprano sax Trumpet (vibrato) Trumpet (no vibrato) Bass trombone Tenor trombone Tuba Violin Viola

synthetically mixed. The number of divisions per octave D and all the thresholds used in the CWAS and the separation algorithms is the same for all the analyzed signals. Concretely, it has been taken D = {16; 32; 64; 128; 128; 100; 100; 100; 100}. Observe that the number of divisions per octave depends of the octave, so we have a variable resolution. The quality measurement results appear in Tables 4 to 6. In these tables, each row presents the numerical results of the SDR, SIR and SAR parameter, respectively, for each separated source, and the median value of these parameters for the signal. The first numerical value corresponds to the first separated source (which is also the first musical instrument in the name of the signal). As advanced at the end of Section 3.2.2, the expected SIR values are high, while the SDR and SAR parameters are expected to be lower. 4.2.1

Mixture of two sources

In the first 15 rows of Tables 4 to 6, there appear the SDR1,2 , SIR1,2 and SAR1,2 values for each separated source, and the median SDR, SIR and SAR Table 3: Nomenclature of musical instruments. Instru- values for each analyzed signal, in the case of mixments marked with asterisk (*) do not come from [36]. tures of two sources. The median value of each parameter can be used 14

Mixture of two harmonic sources Signal SDR1 SDR2 FnvC#5+GB4 BD4+HG4 TrTC5+TvD5 VG4+GB4 CeC#4+FvB4 TnvC5+CbC#4 SC3+FnvC#5 TnvB5+OF5 FvB3+SSG#3 HF#4+BD#4 AFA#5+OF#5 TuC4+ViA#3 BCD2+BFC3 AFE4+AFF#5 VA#4+CeG#4

30.9780 22.1792 32.9437 16.4476 26.2550 30.8267 19.1246 25.3510 16.6133 29.5651 27.8945 25.2563 12.5086 27.1284 20.3137

15.0499 12.2720 28.1849 16.5802 17.3567 30.8069 24.9896 33.8416 17.9595 29.0708 18.0306 27.1284 17.3728 25.2563 23.0616

SDR 23.0139 17.2256 30.5643 16.5139 21.8059 30.8168 22.0571 29.5963 17.2864 29.3180 22.9625 26.1924 14.9407 26.1924 21.6876

Mixture of two sources (one inharmonic instrument) Signal SDR1 SDR2 SDR TnvB5+PG#3 CbD#4+PG#3 FvB3+PG#3

24.6972 20.7072 13.9715

10.3093 -3.3710 -7.1017

Mixture of three harmonic sources Signal SDR1 SDR2 SDR3 ASC#4+BE4+GB4 TvC5+FnvA4+OD5 TrBF#3+SSvG#3+FvC#4 HC4+VA4+CeD4 TnvE5+CbD5+TrTC5

20.3099 22.1084 18.1797 22.7484 19.6558

21.7320 12.1312 18.9864 2.9655 35.5402

6.3413 13.1094 11.5468 2.4187 16.2889

Table 4: Numerical results of the SDR parameter.

15

17.5033 8.6681 3.4349

SDR 16.1278 15.7830 16.2376 9.3775 23.8283

Mixture of two harmonic sources Signal SIR1 SIR2 FnvC#5+GB4 BD4+HG4 TrTC5+TvD5 VG4+GB4 CeC#4+FvB4 TnvC5+CbC#4 SC3+FnvC#5 TnvB5+OF5 FvB3+SSG#3 HF#4+BD#4 AFA#5+OF#5 TuC4+ViA#3 BCD2+BFC3 AFE4+AFF#5 VA#4+CeG#4

64.0949 72.4347 90.5903 55.0862 110.5218 76.1676 64.7522 85.6674 71.0086 65.9539 70.7206 82.3606 62.1065 64.5388 66.2725

86.3455 60.1982 84.8376 50.9414 78.9238 66.9925 58.5935 74.6080 80.8125 71.7516 68.1027 64.5388 55.8376 82.3606 66.7609

SIR 75.2202 66.3165 87.7139 53.0138 94.7228 71.5801 61.6729 80.1377 75.9105 68.8527 69.4117 73.4497 58.9720 73.4497 66.5167

Mixture of two sources (one inharmonic instrument) Signal SIR1 SIR2 SIR TnvB5+PG#3 CbD#4+PG#3 FvB3+PG#3

66.8617 50.5429 22.9468

67.1622 72.4518 37.2872

Mixture of three harmonic sources Signal SIR1 SIR2 SIR3 ASC#4+BE4+GB4 TvC5+FnvA4+OD5 TrBF#3+SSvG#3+FvC#4 HC4+VA4+CeD4 TnvE5+CbD5+TrTC5

41.7025 39.4132 67.8138 58.3956 81.3426

51.0600 45.7920 69.7338 41.2671 71.7489

60.3403 53.6480 67.6455 32.0106 69.1010

Table 5: Numerical results of the SIR parameter.

16

67.0120 61.4974 30.1170

SIR 51.0343 46.2844 68.3977 43.8911 74.0642

Mixture of two harmonic sources Signal SAR1 SAR2 FnvC#5+GB4 BD4+HG4 TrTC5+TvD5 VG4+GB4 CeC#4+FvB4 TnvC5+CbC#4 SC3+FnvC#5 TnvB5+OF5 FvB3+SSG#3 HF#4+BD#4 AFA#5+OF#5 TuC4+ViA#3 BCD2+BFC3 AFE4+AFF#5 VA#4+CeG#4

30.9801 22.1792 32.9437 16.4483 26.2550 30.8269 19.1247 25.3510 16.6133 29.5661 27.8947 25.2563 12.5087 27.1292 20.3138

15.0499 12.2721 28.1849 16.5818 17.3567 30.8080 24.9915 33.8420 17.9595 29.0710 18.0306 27.1292 17.3734 25.2563 23.0617

SAR 23.0150 17.2257 30.5643 16.5150 21.8059 30.8174 22.0581 29.5965 17.2864 29.3186 22.9627 26.1928 14.9410 26.1928 21.6878

Mixture of two sources (one inharmonic instrument) Signal SAR1 SAR2 SAR TnvB5+PG#3 CbD#4+PG#3 FvB3+PG#3

24.6974 20.7117 14.5814

10.3093 -3.3710 -7.1008

Mixture of three harmonic sources Signal SAR1 SAR2 SAR3 ASC#4+BE4+GB4 TvC5+FnvA4+OD5 TrBF#3+SSvG#3+FvC#4 HC4+VA4+CeD4 TnvE5+CbD5+TrTC5

20.3418 22.1905 18.1798 22.7496 19.6558

21.7371 12.1332 18.9864 2.9664 35.5412

6.3413 13.1098 11.5468 2.4262 16.2889

Table 6: Numerical results of the SAR parameter.

17

17.5034 8.6704 3.7403

SAR 16.1401 15.8112 16.2377 9.3807 23.8287

Figure 14: Spectrograms of the Trumpet signals. (a) Wavelet spectrogram of the original (isolated) trumpet. (b) Blue line: Original scalogram. Red line: Scalogram of the separated source. Upper box: main information about the original signal. (c) Wavelet spectrogram of the separated source.

somehow to measure the final quality of the separa- 4.2.3 Mixture of three sources tion, in the case of two mixed sources. These values In the last 5 rows of Tables 4 to 6 there appear the are: SDR1,2,3 , SIR1,2,3 and SAR1,2,3 values for each separated source, and the median SDR, SIR and SAR • SDR2s = 23.3449 dB. values for each analyzed signal. In these tables, the analyzed set of signals have three sources. • SIR2s = 71.7961 dB. The median value of the standard parameters in the case of three mixed sources are: • SAR = 23.3453 dB. 2s

• SDR3s = 16.2709 dB. 4.2.2

An inharmonic instrument: the piano

• SIR3s = 56.7343 dB.

The reconstruction process includes the phase and instantaneous frequency reconstruction of the overlapping partials, considering them natural harmonics of the present fundamental frequencies. If the musical instrument is not harmonic, the reconstruction process will incorrectly relocate the overlapping partials in the spectrum, producing an increase of artifacts and distortions. These results are reflected in rows 16 to 18 of Tables 4 to 6. In the first case, the number of overlapping harmonics is low, hence the final separation has similar values to the results presented in Section 4.2.1. As the number of overlapping partials increase, the standard parameter values for the piano signal decrease significantly (columns SDR2 and SAR2 of Tables 4 and 6, respectively).

• SAR3s = 16.2797 dB. These results are coherent with the situation. Under the same precision in the frequency axis, the higher the number of sources, the lower the separation between partials and the higher the probability of interference (lower SIR). Hence, the final distortions and artifacts tend to increase.

5

Conclusions

In this work, a new technique of BASS of monaural musical notes have been presented. The differences between the proposed algorithm and the existing ones are mainly two: first of all, the timefrequency analysis tool is not based in the STFT but

18

in the CCWT, which offers a highly coherent model of the audio signal in both time and frequency domains. This tool allows us to obtain with great accuracy the instantaneous evolution (in time and frequency) of the isolated harmonics, easily assignable to the present sources in the mixture. Second, the overlapping partials can be entirely reconstructed from the isolated partials searching for the best linear combination which minimizes the amplitude error in the mixture process, assuming the CAM principle. Using non-overlapping partials with similar energy to the overlapping partials, if the overlapping partial has high energy, the correlation factor tends to be high, and if the energy is low, errors associated with the low correlation are usually assumable. This way, the phase reconstruction is not as important as in other techniques, obtaining separated sources which presents both high quality separation measurement values and high acoustic resemblance respect to the original signals.

References

The purposed technique can be actually used to separate two or more sources playing a single (and different) note each. In order to develop a complete source separation algorithm, it would be necessary to implement the fundamental frequency estimator in a frame-to-frame context, dynamically obtaining the onset and offset time of each played note, in order to separate them. If we do not want to use any more input data but the mix signal (that is, not knowing neither the number of present sources nor the pitch or energy distribution of the sources), it would be also capital to develop an algorithm of timbre classification, attempting to assign the different notes played by the same musical instrument to the same source. Therefore, future challenges are still to solve.

[1] S. Rickard, Blind Speech Separation, chapter 8. The DUET Blind Source Separation Algorithm, pp. 217–241, Springer Netherlands, 2007. [2] T. Melia, Underdetermined Blind Source Separation in Echoic Environments Using Linear Arrays and Sparse Representations, Ph.D. thesis, School of Electrical, Electronic and Mechanical Engineering University College Dublin, National University of Ireland, 2007. [3] M. Cobos and J. J. L´opez, “Stereo audio source separation based on time-frequency masking and multilevel thresholding,” Digital Signal Processing, vol. 18, pp. 960–976, 2008. [4] M. Cobos, Application of Sound Source Separation Methods to Advanced Spatial Audio Systems, Ph.D. thesis, Universidad Polit´ecnica de Valencia, 2009. ¨ Yilmaz and S. Rickard, “Blind separation [5] O. of speech mixtures via time-frequency masking,” IEEE Transactions On Signal Processing, vol. 52, No. 7, pp. 1830–1847, 2004. [6] A. S. Bregman, Auditory Scene Analysis: The perceptual organization of sound, MIT Press, 1990. [7] G. J. Brown and M. Cooke, “Computational Auditory Scene Analysis,” Computer speech & language, Elsevier, vol. 8, No. 4, pp. 297–336, 1994. [8] DeLiang Wang and Guy J. Brown, Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, Wiley-IEEE Press, 2006.

Acknowledgments

[9] G. Cauwenberghs, “Monaural Separation of Independent Acoustical Components,” Proceedings of the 1999 IEEE International Symposium on Circuits and Systems-ISCAS ’99, vol. 5, pp. 62–65, 1999.

This work has been supported by the Spanish project TEC2006-13883-C04-01. 19

[10] S. Amari and J.F. Cardoso, “Blind Source Separation – Semiparametric Statistical Approach,” IEEE Transactions on Signal Processing, vol. 45, No. 11, pp. 2692–2700, 1997.

Conference on Acoustics, Speech and Signal Processing-ICASSP ’00, vol. 2, pp. 765–768, 2000.

[11] J. F. Cardoso, “Blind signal separation: Statistical principles,” Proceedings of the IEEE, vol. 86, pp. 2009–2025, October 1998.

[19] T. Virtanen, Sound Source Separation in Monaural Music Signals, Ph.D. thesis, Tampere University of Technology, 2006.

[20] [12] M. G. Jafari, S. A. Abdallah, M. D. Plumbey, and M. E. Davies, “Sparse Coding for Convolutive Blind Audio Source Separation,” Lecture Notes in Computer Science-Independent Component Analysis and Blind Signal Separation, vol. 3889, pp. 132–139, 2006. [21] [13] S. A. Abdallah, Towards Music Perception by Redundancy Reduction and Unsupervised Learning in Probablistic Models, Ph.D. thesis, King’s College London, 2002. [14] T. Virtanen, “Monaural Sound Source Separation by Non-Negative Matrix Factorization with Temporal Continuity and Sparseness Criteria,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, No. 3, pp. 1066–1074, 2007. [15] N. M. Schmidt and Mørup M., “Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation,” Proceedings of the 6th International Conference on Independent Component Analysis and Blind Signal Separation, ICA’06, vol. 3889 of Lecture Notes in Computer Science, pp. 700–707, 2006.

M. R. Every and J. E. Szymanski, “Separation of Synchronous Pitched Notes by Spectral Filtering of Harmonics,” IEEE Transactions on audio, speech and language processing, vol. 14, No. 5, pp. 1845–1856, 2006. Y. Li, J. Woodruff, and D. L. Wang, “Monaural Musical Sound Separation Based on Pitch and Common Amplitude Modulation,” Transactions on Audio, Speech and Language Processing, vol. 17, No. 7, pp. 1361–1371, 2009.

[22] J. Woodruff, Y. Li, and D. L. Wang, “Resolving Overlapping Harmonics for Monaural Musical Sound Separation Using Pitch and Common Amplitude Modulation,” Proceedings of the International Conference on Music Information Retrieval, pp. 538–543, 2008. [23] G. Hu and D. Wang, “Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation,” IEEE Transactions On Neural Networks, vol. 15, No. 5, pp. 1135–1150, 2004.

[24] J.J. Burred and T. Sikora, “On the Use of Auditory Representations for Sparsity-Based Sound Source Separation,” Fifth International Conference on Information, Communications and Sig[16] D. D. Lee and H. S. Seung, “Learning the nal Processing, ICICS05, pp. 1466–1470, 2005. Parts of Objects by Nonnegative Matrix Factorization,” Nature, vol. 401, pp. 788–791, 1999. [25] T. W. Parsons, “Separation of Speech from Interfering Speech by means of Harmonic Selec[17] M. N. Schmidt and R. K. Olsson, “Singletion,” The Journal of the Acoustical Society of channel Speech Separation Using Sparse NonAmerica, vol. 60, No. 4, pp. 911–918, 1976. negative Matrix Factorization,” Internationnal Conference on Spoken Languaje Processing, IC[26] T. Virtanen and A. Klapuri, “Separation of HarSLP’06, pp. 2614–2617, 2006. monic Sounds Using Multipitch Analysis and It[18] T. Virtanen and A. Klapuri, “Separation of erative Parameter Estimation,” IEEE Workshop Harmonic Sound Sources Using Sinusoidal Modon the Applications of Signal Processing to Aueling,” Proceedings of the IEEE International dio and Acoustics, pp. 83–86, 2001. 20

[27] A. Klapuri, “Multiple Fundamental Frequency [35] R. Gribonval, E. Vincent, C. F´evotte, and L. BeEstimation Based on Harmonicity and Spectral naroya, “Proposals for performance measureSmoothness,” IEEE Transactions on Speech and ment in source separation,” Proceedings of the Audio Processing, vol. 11, No. 6, pp. 804–816, International Conference on Independent Com2003. ponent Analysis and Blind Source Separation (ICA), pp. 763–768, 2003. [28] I. Daubechies, Ten Lectures on wavelets, vol. 61 Fritts (Electronic Music Stuof CBMS-NSF Regional Conference Series in [36] L. dios). University of Iowa, “http:// Applied Mathematics, CBMS-NSF Series Appl. theremin.music.uiowa.edu/ MIS.html,” . Math., SIAM, 1992. [29] J. R. Beltr´ an and J. Ponce de Le´ on, “Analysis [37] C. F´evotte, R. Gribonval, and E. Vincent, “BSS EVAL Toolbox User Guide – Revision and Synthesis of Sounds through Complex Ban2.0,” Tech. Rep., IRISA Technical Report 1706, dapss Filterbanks,” Proc. of the 118th ConvenRennes, France, 2005. tion of the Audio Engineering Society (AES’05). Preprint 6361, May. 2005. [38] E. Vincent, R. Gribonval, and C. F´evotte, “Performance Measurement in Blind Audio Source [30] J. R. Beltr´ an and J. Ponce de Le´ on, “Estimation Separation,” IEEE Transactions on Audio, of the Instantaneous Amplitude and the InstanSpeech and Language Processing, vol. 14, No. 4, taneous Frequency of Audio Signals using Compp. 1462–1469, 2006. plex Wavelets,” Signal Processing, vol. 90/12, pp. 3093–3109, 2010. [31] J. R. Beltr´ an and J. Ponce de Le´ on, “Extracci´ on de Leyes de Variaci´ on Frecuenciales Mediante la Transformada Wavelet Continua Compleja,” Proceedings of the XX Simposium Nacional de la Uni´ on Cient´ıfica Internacional de Radio (URSI’05), 2005. [32] B. Boashash, “Estimating and Interpreting the Instantaneous Frequency of a Signal. Part 1: Fundamentals,” Proceedings of the IEEE, vol. 80, No. 4, no. 4, pp. 520–538, Apr. 1992. [33] J. R. Beltr´ an and J. Ponce de Le´ on, “Blind Source Separation of Monaural Musical Signals Using Complex Wavelets,” Proceedings of the 12th International Conference on Digital Audio Effects (DAFx-09), 2009. [34] J. R. Beltr´ an, J. Ponce de Le´ on, N. Degara, and A. Pena, “Localizaci´ on de Onsets en Se˜ nales Musicales a trav´es de Filtros Pasobanda Complejos,” Proceedings of the XXIII Simposium Nacional de la Uni´ on Cient´ıfica Internacional de Radio (URSI’08), 2008. 21

Lihat lebih banyak...

Bel Pon11Sep-1

Descripción

Comentarios