830 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009 A Spatio–Temporal Speech Enhancement Technique Based on Generalized Eigenvalue Decomposition Malay Gupta, Member, IEEE, and Scott C. Douglas, Senior Member, IEEE Abstract—We present a new spatio–temporal algorithm for speech enhancement using microphone arrays. Our technique uses an iterative method for computing the generalized eigenvectors of the multichannel data as measured from the microphone array. Coefﬁcient adaptation is performed using the spatio–temporal correlation coefﬁcient sequences of the observed data. The technique avoids large matrix–vector multiplications and has lower computational resource requirements as compared to competing methods. The technique also does not require a calibrated microphone array and is applicable to a wide variety of noise types, including stationary correlated noise and nonstationary speech-like (e.g., babble) background noise. Application of the method to microphone array data in various environmental settings indicate that the procedure can achieve signiﬁcant gains in signal-to-interference ratios (SIRs) even in low SIR environments, without introducing musical tone artifacts in the enhanced speech. Index Terms—Acoustic arrays, adaptive arrays, eigenvalues and eigenfunctions, decorrelation, speech enhancement. I. INTRODUCTION I N many applications involving speech signals, reverberation effects and background acoustic noise corrupt the speech signal, degrading speech intelligibility. When using a cellular telephone, the presence of acoustic noise in the environment often lowers the quality of the transmitted speech. Acoustic noise can also lower the performance of many commercial speech recognition systems used in automated toll-free telephone lines, voice activated global positioning system (GPS) navigation control systems, voice dialers for portable communication devices, speaker identiﬁcation systems, and so on. Speech enhancement is an important feature for such devices as well as in applications for medicine (e.g., hearing aids) and law enforcement (e.g., forensics). The ﬁrst speech enhancement techniques developed by signal processing engineers used a single microphone for simplicity. Boll’s spectral subtraction algorithm [1] and its numerous variants can reduce background acoustic noise in single-channel Manuscript received February 11, 2008; revised October 20, 2008. Current version published April 01, 2009. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Tim Fingscheidt. M. Gupta was with the Department of Electrical Engineering, Southern Methodist University (SMU), Dallas, TX 75275 USA. He is now with Research in Motion Corporation, Rolling Meadows, IL 60008 USA (e-mail: mgupta@rim.com). S. C. Douglas is with the Department of Electrical Engineering, Southern Methodist University, Dallas, TX 75275 USA (e-mail: douglas@engr.smu.edu). Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/TASL.2008.2012327 speech signals; however, these methods can create distortion in the form of musical tones in the enhanced speech, leading to lower intelligibility. This effect is particularly prominent when the noise spectrum is nonstationary or when the input signal-tonoise-ratio (SNR) is low (e.g., less than 5 dB) [2]. Several engineers have addressed this issue by using constrained optimization techniques in an attempt to reduce musical noise artifacts while maintaining the residual noise energy below some acceptable threshold [3]–[7]. Signal-subspace-based processing [8] is an alternative to spectral subtraction in which the noisy speech is decomposed into two mutually orthogonal subspaces: the signal-plus-noise subspace, and the noise-only subspace. Common methods for this calculation are the singular value decomposition (SVD), the eigenvalue decomposition (EVD), or the Karhunen–Loéve transform (KLT) applied to the second-order statistics of the observed noisy speech and/or any noise processes corrupting the speech. Speech enhancement is performed by nulling signals within the noise subspace and enhancing signals within the signal-plus-noise subspace via the use of a spectral gain function. Subspace approaches assume that 1) speech ﬁts a low-rank model, and 2) the noise signal corrupting the speech is uncorrelated in time [8]. The ﬁrst assumption is often reasonable in practice; however, the corrupting noise component is typically time-correlated, resulting in lower performance in practical scenarios. Extensions of the algorithm in [8] to the correlated noise case are described in [9], [10]. The techniques in [9] and [10] assume that clean speech can be represented by the eigenvectors of the observed data, a fact that is difﬁcult to show to be true except in the uncorrelated noise case. Alternatively, one can estimate generalized eigenvectors that are common to both the signal-plus-noise subspace and the noise-only subspace [11], [12]. These methods combine voice activity detection, generalized singular-value decomposition (GSVD) or generalized eigenvalue decomposition (GEVD) processing, and/or spectral domain manipulations. The techniques in [11], [12] integrate noise prewhitening within the algorithms and typically provide better speech enhancement in correlated noise scenarios as compared to previously described approaches. Microphone arrays [13] have recently attracted much interest in the speech enhancement community due to their ability to combine spatial beamforming [14] with temporal processing for more effective speech enhancement. A multimicrophone subspace algorithm based on the GSVD [15] is an extension of the method in [11] to the multichannel case. Computing the GSVD is nontrivial and requires specialized algorithms [16], [17]. Similarly, computing the GEVD also requires more work as compared to the EVD due to an integrated whitening procedure for 1558-7916/$25.00 © 2009 IEEE Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on April 3, 2009 at 21:22 from IEEE Xplore. Restrictions apply. GUPTA AND DOUGLAS: SPATIO–TEMPORAL SPEECH ENHANCEMENT TECHNIQUE 831 the former method. As such, convolutive time-domain extensions of the methods like [12] to the multimicrophone case are computationally expensive, particularly given the dimensional increases in the calculations due to multisensor data sets. For example, in an -microphone system with an -tap long ﬁlter per channel, the direct computation of the GEVD of an correlation matrix can be difﬁcult for typical data processing window sizes, such that most existing methods are practical only for short windows (e.g., 40 to 80 samples long) that limit their overall effectiveness. One widely used technique to deal with computational issues in convolutive multichannel signal processing problems is to transform the problem into the frequencydomain. One such approach was described in [18], where a gradient algorithm was employed to compute the GEVD corresponding to data in each frequency bin to perform acoustic beamforming for speech signal enhancement. In this paper, we develop a novel multimicrophone time-domain speech enhancement technique based on an iterative methodology to compute the generalized eigenﬁlters for enhancing speech from spatio–temporal correlation sequences. The method requires measurements of the noise-only signal ﬁeld as heard at the microphone array and incorporates a clever time-domain ﬁlter update for performing joint diagonalization of the spatio–temporal correlation statistics of both the noisy speech and the background noise signals [19]. The advantage of this technique is in its use of a single eigenﬁlter for representing an entire -dimensional signal subspace by time shifts of the corresponding ﬁlter impulse response. Our technique does not involve large matrix–vector multiplications or any matrix inversions and hence is computationally attractive for real-time processing. It also does not require a calibrated microphone array. Application of the method to microphone array data measured both in a laboratory environment and in an ordinary conference room indicates that the procedure can achieve signiﬁcant gains in signal-to-interference ratios (SIRs) even in low SIR environments, without introducing musical tone artifacts in the enhanced speech. II. SPATIO–TEMPORAL EIGENFILTERING Let denote a clean speech signal measured at the output of an -microphone array in the presence of a time-correlated noise signal . The th microphone signal is written as where , , and are -dimensional vectors corresponding to the observed signal, the clean speech signal and the noise signal at the microphone array at time instant . We model the speech enhancement problem to be an iterative multichannel linear ﬁltering task, in which the multichannel ﬁlter output at iteration is (3) where the matrix sequence , , contains the coefﬁcients of a multichannel adaptive ﬁlter. For ease of no- tation, we constrain to be even-valued. To perform speech en- hancement, the sequence is adjusted such that the total SIR of the multichannel signal outputs is maximized. SIR maximization for correlated noise interference is related to the GEVD and has been referred to as oriented principal component analysis (OPCA) [20]. OPCA solves for generalized eigenvec- tors that, when applied to the data, maximize the signal variance and minimize the noise variance in the multichannel output. We can express the total power in the elements of as (4) where is the length of the data block used to compute the power estimate at iteration and denotes the matrix trace. The sequence denotes the multichannel autocorrelation sequence of and is deﬁned as (5) In the above relations, is zero outside of the range , and is constrained to be zero outside of the range . Note that value of depends upon acoustic mixing conditions, in which highly reverberant environments require larger values of for speech enhancement. By substituting (2) in (3), we obtain (6) (1) where are the coefﬁcients of the time-invariant acoustic impulse response between the speech source and the th micro- phone. The and signals represent the ﬁltered speech and the noise component at the output of the th microphone, respectively. The additive noise is uncorrelated with the clean speech and has an unknown correlation structure. A vector model incorporating signals from all the microphones is given by (2) Assuming that speech and noise signals are uncorrelated with each other, the total output signal power can be written as , where (7) (8) and and are the multichannel autocorrelation se- quences of the speech and noise signals received at the micro- phone array. Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on April 3, 2009 at 21:22 from IEEE Xplore. Restrictions apply. 832 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009 In practice, and are not directly available; however, we assume that there exist speech silence periods whereby can be estimated from the sensors using data during these silence periods. Let the length of the available multichannel noise-only data block be , such that (9) As we do not have access to , to compute we re- place by which is an estimate of the total signal power that depends on the autocorrelation sequence of the noisy signal . In order to maximize the speech signal power, given the knowledge about the noise correlation sequence we propose to ﬁnd the coefﬁcient sequence that maximizes the following metric: of the noise correlation statistics and the spatio–temporal diagonalization of the signal-plus-noise correlation statistics in a single coefﬁcient update. Our algorithm uses spatio–temporal correlation coefﬁcient sequences of dimension , thus avoiding the manipulation of block Toeplitz autocorrelation matrices common in other approaches [15]. The algorithm we propose for solving (11)–(12) is inspired by the family of adaptive algorithms presented in [21] and is a nontrivial extension of the adaptive eigenvalue decomposition method described in [21] to solving the generalized eigenvalue problem in the space of multichannel ﬁlters. For descriptive sim- plicity, we ﬁrst develop the algorithm in the spatial-only case before extending the approach to the temporal domain. A. Spatial Case Consider the signal enhancement system in (6) for . We ﬁrst assume that the noise statistics are spatially uncorrelated, such that . Initially, we have (10) (13) The function is a spatio–temporal extension of the Rayleigh quotient [17], such that the sequence that max- imizes (10) corresponds to the generalized eigenﬁlters of the multichannel autocorrelation sequence pair . Hence, at the stationary point of (10), the sequence sat- isﬁes if otherwise (11) if otherwise (12) where and denote the generalized eigenvalues and eigenﬁlters of . In other words, the coefﬁcient sequence simultaneously diagonalizes the sequences and in both space and time. We now describe an algorithm that attempts to solve (11)–(12) in an iterative fashion. III. ITERATIVE COMPUTATION OF GENERALIZED EIGENVALUE DECOMPOSITION The GEVD performs simultaneous eigendecomposition of two matrices. In our application, these matrices are obtained from the correlation statistics of the speech-plus-noise signal and the noise-only signal as heard at the microphone array. The GEVD usually involves the calculation of a generally nonorthogonal matrix that whitens the noise-only signal and diagonalizes the correlation statistics of the speech-plus-noise signal. While many numerical methods for the GEVD have been developed [17], they can be computationally prohibitive when the dimensions of the matrices involved are large. We propose an iterative algorithm for solving (11)–(12) that combines two iterative procedures for spatio–temporal whitening where is some nonsingular (e.g., orthogonal) matrix. Our goal is to compute the matrix such that after suf- ﬁciently many iterations (14) and (15) where is an diagonal matrix with as its diagonal entries, and is an identity matrix. A widely known solution to this joint diagonalization problem corresponds to the GEVD of the matrix pair ( ) [17]. The GEVD solution in (14)–(15) corresponds to whitening and rotation operations on and , respectively. In [21], various iterative updates for the inverse Cholesky and eigenvalue decompositions of a symmetric matrix were proposed. A differential equation for computing the in- verse Cholesky decomposition of a positive deﬁnite symmetric matrix such that and is upper tri- angular1 is given by (16) where denotes the upper triangular part of the matrix . A differential equation for computing the lower triangular2 matrix with unity diagonal elements such that is a diagonal matrix is given by (17) where denotes the strictly lower triangular3 part of . While two separate systems could be used to implement (16) and (17), we recognize the following key point: a common sta- tionary point to both differential equation is the generalized eigendecomposition of the matrix pair ( ). However, 1upper triangular part of a matrix including the diagonal elements. 2lower triangular part of a matrix including the diagonal elements. 3lower triangular part of a matrix excluding the diagonal elements. Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on April 3, 2009 at 21:22 from IEEE Xplore. Restrictions apply. GUPTA AND DOUGLAS: SPATIO–TEMPORAL SPEECH ENHANCEMENT TECHNIQUE 833 Fig. 1. Convergence of the proposed iterative generalized eigenvalue procedures. The x-axis in each subplot denotes the number of iterations, and the y-axis corresponds to the estimated diagonal element of the quadratic form. The top row corresponds to (14), and the bottom row corresponds to (15). The top row fR R g shows the generalized eigenvalues , 1 i 3 for the matrix pair ; in solid lines and its iterative approximation in dotted lines. The bottom row W R W R shows the convergence of (k) (k) to an identity matrix. The data matrices are given by = [[ 1 0:4 0:3 ] [ 0:4 0:7 0:8 ] [ 0:3 0:8 1 ] ] R and = [[ 0:8 0:5 0:7 ] [ 0:5 0:6 0:2 ] [ 0:7 0:2 0:9 ] ] . (a) Convergence of (18)–(23) for 1 k 100. (b) Convergence of (24)–(27) for 1 k 100. in order to develop a practical, real-time version of the algorithm, discrete-time versions of differential (16) and (17) are required. It has previously been noted in [22], [23] in the context of spatio–temporal subspace analysis that substituting ﬁnite differences for differentials can lead to numerical problems due to error accumulation, and as such require normalizations/modiﬁcations in the algorithm to ensure stability of the resulting algorithm. These realizations leads us to consider the coupled update given by (18) (19) forms in (16) and (17). Our strategy is inspired by the EASI algorithm for blind source separation [25] and uses the combined matrix (24) where (25) The terms and are time-dependant scaling factors used to adjust the average magnitude of the elements of to match that of and are deﬁned as (20) (21) (22) (23) where , , and are partic- ular scaling factors used to stabilize such algorithms [24]. The and are th elements of matrices and , respectively, and is a step-size parameter. These scaling factors are chosen to impose an a posteriori norm con- straint on and that affects the stability of the gradient update. These scaling factors are easy to compute and impose little additional computational complexity on the algorithm [24]. However, the algorithm in (18)–(23) is unwieldly as it in- volves several matrix multiplications. We desire a more elegant and simpler update relation that combines the two differential (26) The coefﬁcient updates are then given by (27) where , , and is the th element of the matrix [24]. Fig. 1 shows the convergence of both algorithms with a step size of , in which and are (3 3) symmetric pos- itive deﬁnite matrices as given in the ﬁgure caption. The general- ized eigenvalues of the matrix pair ( ) were computed using the MATLAB command and were compared against the diagonal elements of for each it- erative method described in this section. The generalized eigen- values computed with MATLAB were 26.68, 0.95, and 0.03 and are plotted as solid lines, and iterative values are plotted with dotted lines. The algorithms appear to work as desired; however, their behaviors require a careful analysis to justify the observed performance. A local stability analysis of the proposed method in (27) is presented in Section IV. Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on April 3, 2009 at 21:22 from IEEE Xplore. Restrictions apply. 834 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009 B. Spatio–Temporal Extensions An extension of the procedure in (27) to the multichannel ﬁlter structure in (3) is now described. It requires spatio–tem- poral extensions of the update term (24) and produces coefﬁ- cient updates for the matrix sequence , . We ﬁrst deﬁne spatio–temporal extensions of (18) and (25) in terms of multichannel convolution operations involving the co- efﬁcient sequence and spatio–temporal correlation se- quences corresponding to the noise-only process and the speech-plus-noise process . The th element of the mul- tichannel correlation sequence corresponding to and are computed as if otherwise if (28) otherwise if otherwise if (29) otherwise. In the above set of equations, is a windowing operation given by , where , is a centered Bartlett window. This windowing operation was found to be necessary to ensure the validity of the estimated autocor- relation sequences at each iteration and to allow the algorithm to converge to a stable stationary point. Based on the above set of equations, we deﬁne the th update term as where and are the elements of the matrix se- quences and , respectively. Finally, we de- ﬁne a correction term for the coefﬁcient updates as (33) The coefﬁcient updates are then given by (34) where , and [24]. Typically, step sizes in the range are chosen and appear to work well for a broad range of values of and in different environments, as illustrated in the simulations section. Note that the correction term in (33) and the spatio–temporal update equation in (34) closely follow the algorithm developed for the spatial-only case in (27), in which matrix multiplications of the spatial-only case are re- placed by ﬁltering operations in the spatio–temporal case and follow the rules of FIR matrix algebra, a detailed description of which can be found in [26]. Hence, drawing an analogy from the spatial only case, the stationary points of (34) correspond to the generalized eigenﬁlters of the correlation sequence pair and correspond to the solution (11)–(12), such that upon convergence . Once the algorithm has converged, the output signals contain ﬁltered versions of the speech-plus-noise signal when speech is present. Clearly, the signal with highest SIR is chosen as the output of the system. In order to simplify the notation of the above algorithm, one can deﬁne the -transform of the multichannel coefﬁcient se- quence at iteration as (35) The terms and average magnitude of are deﬁned as (30) are scaling factors used to adjust the to match that of and (31) (32) The algorithm update in -domain form can be written as (36)–(37), shown at the bottom of the page. In this relation, the matrix polynomials and are the -transforms of the sequences and , respectively, and are deﬁned equivalently to (35). In addition, the expression corresponds to truncation of the polynomial to only include polynomial terms of order . Extensive simulations indicate that this algorithm achieves the stationary point for typical data sets and the convergence speed of the algorithm is similar to that of [24]. To illustrate this behavior, Fig. 2 shows plots of , , and after 100 iterations of the algorithm operating on single-talker data Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on April 3, 2009 at 21:22 from IEEE Xplore. Restrictions apply. (36) (37) GUPTA AND DOUGLAS: SPATIO–TEMPORAL SPEECH ENHANCEMENT TECHNIQUE 835 Theorem 1: The algorithm in (24)–(27) is locally asymptotically stable about the solution satisfying and (38) Proof: To analyze the algorithm in (24)–(27), we study its associated averaged ODE given by (39) To study the local stability properties of (39) about a solution satisfying (38), deﬁne (40) G G G Fig. 2. Sequences , , and obtained after 100 iterations of (27) as applied to three-microphone data; see text for explanation. taken from a three-microphone laboratory setup under babble noise conditions. The microphones were arranged in a linear array with 4-cm spacing and mixture had an initial SIR of approximately dB at each microphone and . Each of the nine plots shows a for and such that the -axis of each plot is . The last column of plots shows the entire sequence, indicating that . The presented spatio–temporal version of our algorithm re- quires an initial computation of a pair of correlation sequences which are then used further in the algorithm to ﬁnd the general- ized eigenﬁlters. FFT based fast techniques are used to perform correlation and convolution operations needed in the algorithm. The initial computation of the biased multichannel correlation sequences for speech-plus-noise and noise-only sequences requires multiplies and adds and then the algorithm iterates on these multichannel sequences (constrained to length ) to optimize the multichannel ﬁlter coefﬁcient sequences so that they converge to generalized eigenﬁlters of the two multichannel correlation sequences. The total number of operations needed to perform a single iteration [of (28)–(34)] is approximately , which is lesser than the computational complexity of [15] which is of the order of at least . where is a small perturbation matrix whose entries satisfy . Substituting this value of in (39), utilizing the joint diagonalization property of , postmultiplying both sides of the right-hand-side of (39) by and ignoring second- and higher order terms denoted by , we obtain a lin- earized version of (39) as (41) where is a positive multiplicative constant which after neglecting terms of is given as (42) where are the diagonal entries of . An expansion of (41) in its matrix components reveals that the evolutions of the diagonal entries of are uncoupled and follow the relation (43) The solution to the above scalar ODE is (44) It is easy to see that converges exponentially to zero as . The evolutions of the off-diagonal entries of are pairwise-coupled and satisfy for the equation (45) IV. ALGORITHM ANALYSIS where . The valid solutions to this differential equation are exponential and are of the form In this section, we analyze the local stability properties of the iterative GEVD algorithm given in (30)–(34). We ﬁrst consider the special case of corresponding to the algorithm given in (24)–(27). The structure of the following closely follows a similar proof of stability given in [21]. Consider the update rule in (27) for . Extensive com- puter simulations indicate that this algorithm causes , which results in and ; thus, we set and in what follows. We now prove the following. (46) where and are the eigenvalues and eigenvectors of the (2 2) matrix on the right-hand side of (45). The eigenvalues are (47) Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on April 3, 2009 at 21:22 from IEEE Xplore. Restrictions apply. 836 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009 Fig. 3. Measurement environment used for numerical evaluations. (a) Acoustic chamber showing loudspeakers 6, 7, 8 and the microphone array. Wall treatments were used to control the reverberation time in the chamber to 300 ms. (b) Conference room showing speakers 5 to 8 and the microphone array. The reverberation time of the conference room was around 600 ms as measured using the pseudo noise sequences. Note that if and only if . Since for , . This result coupled with the convergence of to zero guarantees the local stability of at the solution . Remark: The above proof shows that the proposed technique is a locally stable procedure for GEVD computation for matrix pair . It does not guarantee that the spatio–temporal extension to the multichannel ﬁlter impulse response will have the same behavior. Our experience with the algorithm, however, indicates that the convergence properties in the exten- sion are maintained, and that the algorithm achieves fast conver- gence indicated by the instantaneous example shown in Fig. 1. This fact is underscored by the notation used in (36)–(37), which closely follows that of the algorithm for if . V. NUMERICAL EVALUATIONS To study the performance and robustness of the proposed method, we performed extensive tests in a speech enhancement task involving microphone array data recorded from a laboratory environment as well as from a conference room. In all cases, we use actual recordings as the experimental data. The impulse responses in Fig. 4 are shown for illustrative purposes only. The description and test procedure in both environments is as follows. 1) Acoustic Laboratory Setting: Data for these experiments has been collected in the acoustic chamber within the Multimedia Systems Laboratory at SMU. In this setup, the acoustic laboratory has up to three loudspeakers, of which one simulates the speech source and the other two simulate the noise sources. Fig. 3(a) shows a picture of the lab in which the microphones are labeled as 1 to 4 from right to left, and the loudspeakers are labeled as 6 to 8 also from right to left. The wall treatments for the experiments were chosen to obtain a reverberation time of 300 ms. The microphone array employs between two and four omnidirectional lapel microphones in an approximate linear array with a nominal 4-cm spacing. For the case in which the speech source is corrupted by a single noise source, the directions of arrival (DOA) of the speech source and the noise source are approximately 30 and 30 , respectively, from the array normal. For the case in which two noise sources corrupt the speech source, the DOAs of the noise sources are 30 and 0 , whereas the DOA of the speech source is 30 . All sound sources are equidistant from the microphone array and are lo- cated 1.25 m away from the array. 2) Conference Room Setting: Data for these experiments has been collected in a conference room within the SMU School of Engineering. The conference room has untreated walls with a nominal reverberation time of ms. Fig. 3(b) shows a picture of the conference room. The loudspeakers in the picture are numbered from 5 to 8 from left to right. The microphones are numbered from 1 to 4 from left to right. For the case when the speech source is corrupted by a single noise source, loud- speaker 5 simulates the noise source and loudspeaker 7 simu- lates the speech source, corresponding to the DOAs of approxi- mately 15 and 15 , respectively. In this case, the noise source and the speech source are equidistant from the microphone array at a distance of 2.34 m. For the case when the speech source is corrupted by two noise sources, loudspeakers 5 and 7 simulate the noise sources corresponding to the DOAs of 15 and 15 , respectively, loudspeaker 8 simulates the speech source with a DOA of 30 . The distances of the two noise sources and the speech source are 2.34, 2.34, and 1.22 m, respectively. All measurements were made using 10 s of data per channel at a 48-kHz sampling rate and were downsampled to an 8-kHz sampling rate for processing. For each data set, the ﬁrst 3 s contains only noise, whereas the last 7 s contains speech plus noise. Thus, the initial 3 s and last 7 s of data can be used to estimate and for the proposed algorithm. The algorithm was allowed to run for 100 iterations in every case with , after which the signal was found to largely contain the speech source of interest. The total pro- cessing time with our MATLAB implementation in this setup varied from approximately 1.5 s for to 25 s for on a 3.6-GHz single-core Pen- tium PC. Since recorded speech was employed, least-squares methods were used to estimate the contributions of this speech signal before and after processing to determine initial SIRs and the SIR improvement obtained by the algorithm. Extensive ex- periments were run to understand the algorithm’s behavior to variations in 1) Noise distribution [pink, babble, and pink plus babble], 2) Number of noise sources [1 or 2], 3) Initial SIR [from 10 dB to 10 dB], 4) Number of microphones [from 2 to 4], 5) Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on April 3, 2009 at 21:22 from IEEE Xplore. Restrictions apply. GUPTA AND DOUGLAS: SPATIO–TEMPORAL SPEECH ENHANCEMENT TECHNIQUE 837 Fig. 4. Impulse responses. (a) Acoustic chamber impulse response for speakers 6, 7, 8, and microphones 1, 2, 3, 4 at a sampling frequency of Fs = 8 KHz. (b) Conference room impulse response for speakers 5, 7, 8, and microphones 1, 2, 3, 4 at a sampling frequency of 8 KHz. TABLE I SIR GAIN IN dB WITH A SINGLE PINK NOISE INTERFERER Filter length [from 256 to 1024], 6) Reverberation time [300 ms or 600 ms], and 7) Duration of the reference noise segment [from 0.5 s to 3 s]. In addition, we compared the performance of the procedure with a multichannel Wiener ﬁlter method [15]. Tables I–III tabulate the SIR improvement computed as (Final SIR—Initial SIR) for numerous test cases. Based on these results, we can conclude the following. 1) The algorithm’s performance is impressive across all SIRs and environments considered. For example, it provides over 20 dB of SIR gain for a four-microphone array in laboratory environment and over 12 dB in a conference room for an initial SIR of 10 dB with babble noise interference (see Table II). This performance saturates when the initial SIR is high and when the ﬁlter length parameter is large. 2) Larger numbers of microphones does not lead to a proportional increase in performance. We observe that for a given initial SIR and ﬁlter length, the SIR improvement in going from two microphones to three microphones is greater than that obtained from going from three microphones to four microphones. 3) There is a natural tradeoff between the number of microphones and the ﬁlter length parameter needed to achieve a given level of performance in the presence of ﬁxed number of interfering signals. For the same level of enhancement, a system with more microphones requires a smaller value of to achieve it. Similar observations were made in [27] via the application of MINT theorem [28]. 4) The algorithm performs the best under babble noise conditions. Since babble noise is nonstationary, our estimate of is clearly not an exact match to the actual noise correlation statistics during the speech-plus-noise signal period. Thus, our algorithm appears to not be highly sensitive to estimation errors in the noise correlation statistics. 5) As it is expected, in a highly reverberant conference room environment, the performance of the algorithm is degraded as compared to a controlled laboratory environment. In reverberant environments, longer ﬁlters are needed in order to compensate for the additional reﬂections caused by the increasing reverberation. Moreover, for a ﬁxed number of microphones, increasing the ﬁlter length produces more gain than increasing the number of microphones for a given ﬁlter length parameter. In all cases, the enhanced speech heard at the ﬁrst output was found to be free of any musical tones. The only artifact Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on April 3, 2009 at 21:22 from IEEE Xplore. Restrictions apply. 838 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009 TABLE II SIR GAIN IN dB WITH A SINGLE BABBLE NOISE INTERFERER TABLE III SIR Gain in dB WITH A BABBLE NOISE INTERFERER AND A PINK NOISE INTERFERER TABLE IV RT = 300 REFERENCE PROCEDURE: SIR GAIN IN dB WITH THE MULTICHANNEL WIENER FILTER, ms 3 s in intervals of 0.5 s. The results are shown in Fig. 5. From the ﬁgure, it is seen that the algorithm is robust in terms of the availability of the reference noise segment. Going from a noise segment of duration 0.5 s to one of 3 s duration changes the gain in performance roughly by 1 dB at an initial SIR of dB with a single babble noise interference. For reference, we also show the performance of the multichannel Wiener ﬁlter given by the relation [15] (48) Fig. 5. SIR improvement as a function of the duration of available noise segment. produced by the proposed algorithm appears to be a slight whitening or spectral ﬂatness of the enhanced speech signals. Similar to frequency-domain methods [18], time-domain postﬁltering methods could also be developed to compensate for this spectral ﬂatness. Next, we investigate the effect of the duration of the reference noise segment on the performance of the algorithm. In this experiment, we vary the noise segment duration from 0.5 s to where and are symmetric block-Toeplitz ma- trices corresponding to the second-order statistics of the noisy speech and the noise, respectively. The middle row of was then used as the multichannel ﬁlter to compute the en- hanced speech. Performance of this system for data from the ms laboratory environment and at an initial SIR of 0 dB are given in Table IV. Clearly, this method does not per- form as well as our proposed approach. As the spectral outputs of the methods differ, we also investigated signal whitening of the Wiener ﬁlter output before measuring the SIR performance of the method; however, no signiﬁcant changes in SIR perfor- mance were observed. VI. CONCLUSION This paper describes a novel method for multimicrophone speech enhancement that uses knowledge of the spatio–temporal characteristics of the noise ﬁeld in an iterative procedure. The algorithm does not involve matrix inverses or large-scale Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on April 3, 2009 at 21:22 from IEEE Xplore. Restrictions apply. GUPTA AND DOUGLAS: SPATIO–TEMPORAL SPEECH ENHANCEMENT TECHNIQUE 839 matrix-vector multiplications, converges quickly, and requires little ﬁne-tuning. Extensive numerical experiments under dif- ferent scenarios show signiﬁcant SIR gains for a broad range of initial SIR conditions, without introducing musical tone ar- tifacts in the enhanced speech. A local stability analysis of the algorithm has also been presented to verify its good convergence properties. REFERENCES [1] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113–120, Apr. 1979. [2] Y. Hu and P. C. Loizou, “A comparative intelligibility study of speech enhancement algorithms,” in IEEE Int. Conf. Acoust., Speech, Signal Process., Honolulu, HI, Apr. 15–20, 2007, pp. 561–564. [3] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. IEEE Intl. Conf., Acoust., Speech, Signal Process., Apr. 1979, vol. 4, pp. 208–211. [4] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 1109–1121, Dec. 1984. [5] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 33, no. 2, pp. 443–445, Apr. 1985. [6] Z. Goh, K. C. Tan, and B. T. G. Tan, “Postprocessing method for suppressing musical noise generated by spectral subtraction,” IEEE Trans. Speech Audio Process., vol. 6, no. 3, pp. 287–292, May 1998. [7] N. Virag, “Single channel speech enhancement based on masking properties of the human auditory system,” IEEE Trans. Speech Audio Process., vol. 7, no. 2, pp. 126–137, Mar. 1999. [8] Y. Ephraim and H. L. VanTrees, “A signal subspace approach for speech enhancement,” IEEE Trans. Speech Audio Process., vol. 3, no. 4, pp. 251–266, Jul. 1995. [9] U. Mittal and N. Phamdo, “Signal/noise KLT based approach for enhancing speech degraded by colored noise,” IEEE Trans. Speech Audio Process., vol. 8, no. 2, pp. 159–167, Mar. 2000. [10] A. Rezayee and S. Gazor, “An adaptive KLT approach for speech enhancement,” IEEE Trans. Speech Audio Process., vol. 9, no. 2, pp. 87–95, Feb. 2001. [11] S. H. Jensen, P. C. Hansen, S. D. Hansen, and J. A. Sorensen, “Reduction of broad-band noise in speech by truncated QSVD,” IEEE Trans. Speech Audio Process., vol. 3, no. 6, pp. 439–448, Nov. 1995. [12] Y. Hu and P. C. Loizou, “A generalized subspace approach for enhancing speech corrupted by colored noise,” IEEE Trans. Speech Audio Process., vol. 11, no. 4, pp. 334–341, Jul. 2003. [13] Microphone Arrays: Signal Processing Techniques and Applications, M. Brstein and D. Ward, Eds. New York: Springer, 2001. [14] B. D. V. Veen and K. M. Buckley, “Beamforming: A versatile approach to spatial ﬁltering,” IEEE ASSP. Mag, vol. 5, no. 2, pp. 4–24, Apr. 1988. [15] S. Doclo and M. Moonen, “GSVD-Based optimal ﬁltering for single and multimicrophone speech enhancement,” IEEE Trans. Signal Process., vol. 50, no. 9, pp. 2230–2244, Sep. 2002. [16] F. T. Luk, “A parallel method for computing the generalized singular value decomposition,” J. Parall. Distrib. Comput., vol. 2, no. 3, pp. 250–260, Aug. 1985. [17] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd ed. Baltimore, MD: John Hopkins Univ. Press, 1996. [18] E. Warsitz and M. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 5, pp. 1529–1539, Jul. 2007. [19] M. Gupta and S. C. Douglas, “An iterative spatio-temporal speech enhancement algorithm for microphone arrays,” in IEEE Int. Conf. Acoust., Speech, Signal Process., Las Vegas, NV, 2008, pp. 81–84. [20] K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks: Theory and Applications. New York: Wiley-Interscience, 1996. [21] S. C. Douglas, “Simple adaptive algorithms for cholesky LDL , QR, and eigenvalue decompositions of autocorrelation matrices for sensor array data,” in Proc 35th Asilomar Conf. Signals Syst. Comput., Paciﬁc Grove, CA, Nov. 2001, vol. 2, pp. 1134–1138. [22] S. C. Douglas, S.-Y. Kung, and S. Amari, “A self stabilized minor subspace rule,” IEEE Signal Process. Lett., vol. 5, no. 12, pp. 328–330, Dec. 1998. [23] S. C. Douglas, S. Amari, and S.-Y. Kung, “Gradient adaptive paraunitary ﬁlter banks for spatio-temporal subspace analysis and multichannel blind deconvolution,” J. VLSI Signal Process., vol. 37, pp. 247–261, 2004. [24] S. C. Douglas and M. Gupta, “Scaled natural gradient algorithms for instantaneous and convolutive blind source separation,” in IEEE Int. Conf. Acoust., Speech, Signal Process., Honolulu, HI, Apr. 15–20, 2007, vol. 2, pp. II-637–II-640. [25] J. F. Cardoso and B. H. Laheld, “Equivariant adaptive source separation,” IEEE Trans. Signal Process., vol. 44, no. 12, pp. 3017–3030, Dec. 1996. [26] R. Lambert, “Multichannel blind deconvolution: FIR matrix algebra and separation of multipath mixtures,” Ph.D. dissertation, Dept. of Elect. Eng., Univ. of Southern California, , Los Angeles, 1996. [27] J. Benesty, J. Chen, and Y. Huang, “On microphone-array beamforming from a mimo acoustic signal processing perspective,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 3, pp. 1053–1065, Mar. 2007. [28] M. Miyoshi and Y. Keneda, “Inverse ﬁltering of room acoustics,” IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 2, pp. 145–152, Feb. 1988. Malay Gupta (S’00–M’06) was born in Kanpur India. He received the B.E. degree (with distinction) in electrical engineering from the University of Roorkee (now Indian Institute of Technology), Roorkee, India, in 1997, and the M.S. and Ph.D. degrees in electrical engineering from the University of New Mexico, Albuquerque, in 2002 and 2006, respectively. He joined Research in Motion Corporation in February 2008, where he works in the DSP Algorithms Group. From December 2005 to February 2008, he was with Southern Methodist University, Dallas, TX, where he worked as a Postdoctoral Researcher. From August 1997 to June 2000, he was with the Control and Automation Division, Larsen and Toubro, Ltd., Mumbai, India. His research interests include multiuser communication systems, blind source separation, independent component analysis, and speech enhancement. Scott C. Douglas (S’88-M’92-SM’98) received the B.S. (with distinction), M.S., and Ph.D. degrees in electrical engineering from Stanford University, Stanford, CA, in 1988, 1989, and 1992, respectively. He is currently an Associate Professor in the Department of Electrical Engineering, Southern Methodist University, Dallas, TX. He is the author or coauthor of two books, six book chapters, and more than 150 articles in journals and conference proceedings. He frequently consults with industry and offers short courses and tutorials in the areas of signal processing and source separation. He has served as an editor or coeditor of several books and conference proceedings, including the Digital Signal Processing Handbook (CRC, 1997), the International Symposium on Active Control of Sound and Vibration, and the IEEE Workshops on Machine Learning for Signal Processing conference records. His research activities include source separation, speech enhancement, adaptive computational imaging devices, and hardware implementation of DSP systems. Dr. Douglas is the recipient of the 2002 IEEE Signal Processing Society Best Paper Award in Audio and Electroacoustics and the 2003 Gerald J. Ford Research Fellowship. He has served on several technical committees of the IEEE Signal Processing Society and is a Past Chair of the Neural Networks for Signal Processing Technical Committee. He has helped organize several meetings of the IEEE, including the ICASSP and ISCAS conference series, the Digital Signal Processing and Signal Processing Education Workshop series, and the Machine Learning for Signal Processing Workshop series. He is the General Chair of the International Conference on Acoustics, Speech, and Signal Processing 2010 in Dallas. He is also one of the key authors and developers of curriculum and technology for The Inﬁnity Project, a joint effort between educational, civic, and government organizations to implement engineering curricula at the precollege level. He is a member of Phi Beta Kappa and Tau Beta Pi. Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on April 3, 2009 at 21:22 from IEEE Xplore. Restrictions apply.

## 评论