编辑推荐

相关资源

- signal processing technques for software radio
- Acoustic MIMO Signal Processing
- Digital Signal Processing
- Digital Signal Processing-2nd_S.K.Mitra
- MIMO Radar Signal Processing
- Optimum Array Processing
- Digital Signal Processing_原理 算法 应用-- 习题解答 答案
- Digital Signal Processing Handbook
- Applied Dgital Signal Processing
- Academic.Press.Digital.Signal.Processing.System.Design.2nd

上传者其他资源

电子电路热门资源

本周本月全部

文档简介

Microphone Arrays_ Signal Processing Techniques 回声消除 旁瓣消除

文档预览

Digital Signal Processing Michael Brandstein . Darren Ward Microphone Arrays Springer-Verlag Berlin Heidelberg GmbH Engineering ONLINE LIBRARY http://www.springer.de/engine/ Michael Brandstein . Darren Ward (Eds.) Microphone Arrays Signal Processing Techniques and Applications With 149 Figures Springer Series Editors Prof. Dr.-Ing. ARILD LACROIX Johann- Wolfgang-Goethe- Universitiit Institut ftir angewandte Physik Robert-Mayer-Str. 2-4 D-60325 Frankfurt Prof. Dr.-Ing. ANASTAS lOS VENETSANOPOULOS University of Toronto Dept. of Electrical and Computer Engineering 10 King's College Road M5S 3G4 Toronto, Ontario Canada Editors Prof. MICHAEL BRANDSTEIN Harvard University, Div. of Eng. and Applied Scciences 33 Oxford Street MA 02138 Cambridge USA e-mail: msb@hrl.harvard.edu Dr. DARREN WARD Imperial College, Dept. of Electrical Engineering Exhibition Road SW7 2AZ London GB e-mail: d.ward@ic.ac.uk ISBN 978-3-642-07547-6 ISBN 978-3-662-04619-7 (eBook) DOl 10.1007/978-3-662-04619-7 Cip data applied for This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in other ways, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag Berlin Heidelberg GmbH. Violations are liable for prosecution act under German Copyright Law. http://www.springer.de © Springer-Verlag Berlin Heidelberg 2001 Originally published by Springer-Verlag Berlin Heidelberg New York in 2001 Softcover reprint of tbe hardcover 1st edition 2001 The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready copy by authors Cover- Design: de'blik, Berlin SPIN: 10836055 62/3020 543 2 1 0 Printed on acid-free paper Preface The study and implementation of microphone arrays originated over 20 years ago. Thanks to the research and experimental developments pursued to the present day, the field has matured to the point that array-based technology now has immediate applicability to a number of current systems and a vast potential for the improvement of existing products and the creation of future devices. In putting this book together, our goal was to provide, for the first time, a single complete reference on microphone arrays. We invited the top researchers in the field to contribute articles addressing their specific topic(s) of study. The reception we received from our colleagues was quite enthusiastic and very encouraging. There was the general consensus that a work of this kind was well overdue. The results provided in this collection cover the current state of the art in microphone array research, development, and technological application. This text is organized into four sections which roughly follow the major areas of microphone array research today. Parts I and II are primarily theoretical in nature and emphasize the use of microphone arrays for speech enhancement and source localization, respectively. Part III presents a number of specific applications of array-based technology. Part IV addresses some open questions and explores the future of the field. Part I concerns the problem of enhancing the speech signal acquired by an array of microphones. For a variety of applications, including humancomputer interaction and hands-free telephony, the goal is to allow users to roam unfettered in diverse environments while still providing a high quality speech signal and robustness against background noise, interfering sources, and reverberation effects. The use of microphone arrays gives one the opportunity to exploit the fact that the source of the desired speech signal and the noise sources are physically separated in space. Conventional array processing techniques, typically developed for applications such as radar and sonar, were initially applied to the hands-free speech acquisition problem. However, the environment in which microphone arrays is used is significantly different from that of conventional array applications. Firstly, the desired speech signal has an extremely wide bandwidth relative to its center frequency, meaning that conventional narrowband techniques are not suitable. Secondly, there VI Preface is significant multipath interference caused by room reverberation. Finally, the speech source and noise signals may located close to the array, meaning that the conventional far-field assumption is typically not valid. These differences (amongst others) have meant that new array techniques have had to be formulated for microphone array applications. Chapter 1 describes the design of an array whose spatial response does not change appreciably over a wide bandwidth. Such a design ensures that the spatial filtering performed by the array is uniform across the entire bandwidth of the speech signal. The main problem with many array designs is that a very large physical array is required to obtain reasonable spatial resolution, especially at low frequencies. This problem is addressed in Chapter 2, which reviews so-called superdirective arrays. These arrays are designed to achieve spatial directivity that is significantly higher than a standard delay-and-sum beamformer. Chapter 3 describes the use of a single-channel noise suppression filter on the output of a microphone array. The design of such a post-filter typically requires information about the correlation of the noise between different microphones. The spatial correlation functions for various directional microphones are investigated in Chapter 4, which also describes the use of these functions in adaptive noise cancellation applications. Chapter 5 reviews adaptive techniques for microphone arrays, focusing on algorithms that are robust and perform well in real environments. Chapter 6 presents optimal spatial filtering algorithms based on the generalized singular-value decomposition. These techniques require a large number of computations, so the chapter presents techniques to reduce the computational complexity and thereby permit realtime implementation. Chapter 7 advocates a new approach that combines explicit modeling of the speech signal (a technique which is well-known in single-channel speech enhancement applications) with the spatial filtering afforded by multi-channel array processing. Part II is devoted to the source localization problem. The ability to locate and track one or more speech sources is an essential requirement of microphone array systems. For speech enhancement applications, an accurate fix on the primary talker, as well as knowledge of any interfering talkers or coherent noise sources, is necessary to effectively steer the array, enhancing a given source while simultaneously attenuating those deemed undesirable. Location data may be used as a guide for discriminating individual speakers in a multisource scenario. With this information available, it would then be possible to automatically focus upon and follow a given source on an extended basis. Of particular interest lately, is the application of the speaker location estimates for aiming a camera or series of cameras in a video-conferencing system. In this regard, the automated localization information eliminates the need for a human or number of human camera operators. Several existing commercial products apply microphone-array technology in small-room environments to steer a robotic camera and frame active talkers. Chapter 8 summarizes the various approaches which have been explored to accurately locate an individ- Preface VII ual in a practical acoustic environment. The emphasis is on precision in the face of adverse conditions, with an appropriate method presented in detail. Chapter 9 extends the problem to the case of multiple active sources. While again considering realistic environments, the issue is complicated by the presence of several talkers. Chapter 10 further generalizes the source localization scenario to include knowledge derived from non-acoustic sensor modalities. In this case both audio and video signals are effectively combined to track the motion of a talker. Part III of this text details some specific applications of microphone array technology available today. Microphone arrays have been deployed for a variety of practical applications thus far and their utility and presence in our daily lives is increasing rapidly. At one extreme are large aperture arrays with tens to hundreds of elements designed for large rooms, distant talkers, and adverse acoustic conditions. Examples include the two-dimensional, harmonic array installed in the main auditorium of Bell Laboratories, Murray Hill and the 512-element Huge Microphone Array (HMA) developed at Brown University. While these systems provide tremendous functionality in the environments for which they are intended, small arrays consisting of just a handful (usually 2 to 8) of microphones and encompassing only a few centimeters of space have become far more common and affordable. These systems are intended for sound capture in close-talking, low to moderate noise conditions (such as an individual dictating at a workstation or using a hands-free telephone in an automobile) and have exhibited a degree of effectiveness, especially when compared to their single microphone counterparts. The technology has developed to the point that microphone arrays are now available in off-theshelf consumer electronic devices available for under $150. Because of their growing popularity and feasibility we have chosen to focus primarily on the issues associated with small-aperture devices. Chapter 11 addresses the incorporation of multiple microphones into hearing aid devices. The ability of beamforming methods to reduce background noise and interference has been shown to dramatically improve the speech understanding of the hearing impaired and to increase their overall satisfaction with the device. Chapter 12 focuses on the case of a simple two-element array combined with postfiltering to achieve noise and echo reduction. The performance of this configuration is analyzed under realistic acoustic conditions and its utility is demonstrated for desktop conferencing and intercom applications. Chapter 13 is concerned with the problem of acoustic feedback inherent in full-duplex communications involving loudspeakers and microphones. Existing single-channel echo cancellation methods are integrated within a beamforming context to achieve enhanced echo suppression. These results are applied to single- and multichannel conferencing scenarios. Chapter 14 explores the use of microphone arrays for sound capture in automobiles. The issues of noise, interference, and echo cancellation specifically within the car environment are addressed and a particularly effective approach is detailed. Chapter 15 discusses the applica- VIII Preface tion of microphone arrays to improve the performance of speech recognition systems in adverse conditions. Strategies for effectively coupling the acoustic signal enhancements afforded through beamforming with existing speech recognition techniques are presented. A specific adaptation of a recognizer to function with an array is presented. Finally, Chapter 16 presents an overview of the problem of separating blind mixtures of acoustic signals recorded at a microphone array. This represents a very new application for microphone arrays, and is a technique that is fundamentally different to the spatial filtering approaches detailed in earlier chapters. In the final section of the book, Part IV presents expert summaries of current open problems in the field, as well as personal views of what the future of microphone array processing might hold. These summaries, presented in Chapters 17 and 18, describe both academically-oriented research problems, as well as industry-focused areas where microphone array research may be headed. The individual chapters that we selected for .the book were designed to be tutorial in nature with a specific emphasis on recent important results. We hope the result is a text that will be of utility to a large audience, from the student Or practicing engineer just approaching the field to the advanced researcher with multi-channel signal processing experience. Cambridge MA, USA London, UK January 2001 Michael Brandstein Darren Ward Contents Part I. Speech Enhancement 1 Constant Directivity Beamforming Darren B. Ward, Rodney A. Kennedy, Robert C. Williamson ........ 3 1.1 Introduction................................................ 3 1.2 Problem Formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Theoretical Solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Continuous sensor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2 Beam-shaping function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Practical Implementation .................................... 9 1.4.1 Dimension-reducing parameterization. . . . . . . . . . . . . . . .. . . . 9 1.4.2 Reference beam-shaping filter. . . . . . . . . . . . . . . . . . . . . . . . . .. 11 1.4.3 Sensor placement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12 1.4.4 Summary of implementation . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12 1.5 Examples;................................................. 13 1.6 Conclusions .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16 References ..................................................... 16 2 Superdirective Microphone Arrays Joerg Bitzer, K. Uwe Simmer .................................... 19 2.1 Introduction................................................ 19 2.2 Evaluation of Beamformers... .. . . . . .. .. . . . . . . . . . . . .. . . . . .. ... 20 2.2.1 Array-Gain........................................... 21 2.2.2 Beampattern.......................................... 22 2.2.3 Directivity............................................ 23 2.2.4 Front-to-Back Ratio ................................... 24 2.2.5 White Noise Gain ..................................... 24 2.3 Design of Superdirective Beamformers . . . . . . . . . . . . . . . . . . . . . . . .. 24 2.3.1 Delay-and-Sum Beamformer ............................ 26 2.3.2 Design for spherical isotropic noise. . . . . . . . . . . . . . . . . . . . . .. 26 2.3.3 Design for Cylindrical Isotropic Noise . . . . . . . . . . . . . . . . . . .. 30 2.3.4 Design for an Optimal Front-to-Back Ratio.. .. . . . . . . . . ... 30 2.3.5 Design for Measured Noise Fields. . .. . . . . . .. . . . . .. . . . . ... 32 2.4 Extensions and Details ...................................... 33 2.4.1 Alternative Form. . .. . . .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. 33 X Contents 2.4.2 Comparison with Gradient Microphones. . . . . . . . . . . . . . . . .. 35 2.5 Conclusion................................................. 36 References ..................................................... 37 3 Post-Filtering Techniques K. Uwe Simmer, Joerg Bitzer, Claude Marro. .. . . . . . . . .. . . . . . . . . . .. 39 3.1 Introduction................................................ 39 3.2 Multi-channel Wiener Filtering in Subbands . . . . . . . . . . . . . . . . . . .. 41 3.2.1 Derivation of the Optimum Solution. . . . . . .. . . . . .. . . . . . .. 41 3.2.2 Factorization of the Wiener Solution . . . . . . . . . . . . . . . . . . . .. 42 3.2.3 Interpretation......................................... 45 3.3 Algorithms for Post-Filter Estimation ......................... 46 3.3.1 Analysis of Post-Filter Algorithms. . . . . . . . . . . . . . . . . . . . . .. 47 3.3.2 Properties of Post-Filter Algorithms ..................... 49 3.3.3 A New Post-Filter Algorithm ........................... 50 3.4 Performance Evaluation ..................................... 51 3.4.1 Simulation System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52 3.4.2 Objective Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52 3.4.3 Simulation Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 54 3.5 Conclusion................................................. 57 4 Spatial Coherence Functions for Differential Microphones in Isotropic Noise Fields Gary W. Elko .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 61 4.1 Introduction................................................ 61 4.2 Adaptive Noise Cancellation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 61 4.3 Spherically Isotropic Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 65 4.4 Cylindrically Isotropic Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73 4.5 Conclusions................................................ 77 References ..................................................... 84 5 Robust Adaptive Beamforming Osamu Hoshuyama, Akihiko Sugiyama. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 87 5.1 Introduction................................................ 87 5.2 Adaptive Beamformers . .... . . . . . . . . . . .. . . . . . . . . . .. . . .. . . . . . .. 88 5.3 Robustness Problem in the GJBF . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 90 5.4 Robust Adaptive Microphone Arrays - Solutions to Steering- Vector Errors .............................................. 92 5.4.1 LAF-LAF Structure ................................... 92 5.4.2 CCAF-LAF Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 94 5.4.3 CCAF-NCAF Structure.......... ... ...... ............. 95 5.4.4 CCAF-NCAF Structure with an AMC ................... 97 5.5 Software Evaluation of a Robust Adap~ive Microphone Array. . . .. 99 5.5.1 Simulated Anechoic Environment.. .. . . . . . .. . . . . .. . . . . . .. 99 5.5.2 Reverberant Environment .............................. 101 Contents XI 5.6 Hardware Evaluation of a Robust Adaptive Microphone Array .... 104 5.6.1 Implementation ....................................... 104 5.6.2 Evaluation in a Real Environment ....................... 104 5.7 Conclusion ................................................. 106 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6 GSVD-Based Optimal Filtering for Multi-Microphone Speech Enhancement Simon Doclo, Marc Moonen ...................................... 111 6.1 Introduction ................................................ 111 6.2 GSVD-Based Optimal Filtering Technique ..................... 113 6.2.1 Optimal Filter Theory ................................. 114 6.2.2 General Class of Estimators ............................. 116 6.2.3 Symmetry Properties for Time-Series Filtering ............ 117 6.3 Performance of GSVD-Based Optimal Filtering ................. 118 6.3.1 Simulation Environment ................................ 118 6.3.2 Spatial Directivity Pattern .............................. 119 6.3.3 Noise Reduction Performance ........................... 121 6.3.4 Robustness Issues ...................................... 121 6.4 Complexity Reduction ....................................... 122 6.4.1 Linear Algebra Techniques for Computing GSVD .......... 122 6.4.2 Recursive and Approximate GSVD-Updating Algorithms ... 123 6.4.3 Downsampling Techniques .............................. 125 6.4.4 Simulations ........................................... 125 6.4.5 Computational Complexity ............................. 126 6.5 Combination with ANC Postprocessing Stage ................... 127 6.5.1 Creation of Speech and Noise References ................. 127 6.5.2 Noise Reduction Performance of ANC Postprocessing Stage. 128 6.5.3 Comparison with Standard Beamforming Techniques ....... 129 6.6 Conclusion ................................................. 129 References ..................................................... 130 7 Explicit Speech Modeling for Microphone Array Speech Acquisition Michael Brandstein, Scott Griebel . ................................ 133 7.1 Introduction ................................................ 133 7.2 Model-Based Strategies ...................................... 136 7.2.1 Example 1: A Frequency-Domain Model-Based Algorithm .. 137 7.2.2 Example 2: A Time-Domain Model-Based Algorithm ....... 140 7.3 Conclusion ................................................. 148 References ..................................................... 151 Part II. Source Localization XII Contents 8 Robust Localization in Reverberant Rooms Joseph H. DiBiase, Harvey F. Silverman, Michael S. Brandstein ..... 157 8.1 Introduction ................................................ 157 8.2 Source Localization Strategies ................................ 158 8.2.1 Steered-Beamformer-Based Locators ..................... 159 8.2.2 High-Resolution Spectral-Estimation-Based Locators ....... 160 8.2.3 TDOA-Based Locators ................................. 161 8.3 A Robust Localization Algorithm ............................. 164 8.3.1 The Impulse Response Model ........................... 164 8.3.2 The GCC and PHAT Weighting Function ................ 166 8.3.3 ML TDOA-Based Source Localization .................... 167 8.3.4 SRP-Based Source Localization .......................... 169 8.3.5 The SRP-PHAT Algorithm ............................. 170 8.4 Experimental Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 References ..................................................... 178 9 Multi-Source Localization Strategies Elio D. Di Claudio, Raffaele Parisi . ............................... 181 9.1 Introduction ................................................ 181 9.2 Background ................................................ 184 9.2.1 Array Signal Model .................................... 184 9.2.2 Incoherent Approach ................................... 185 9.2.3 Coherent Signal Subspace Method (CSSM) ............... 185 9.2.4 Wideband Weighted Subspace Fitting (WB-WSF) ......... 186 9.3 The Issue of Coherent Multipath in Array Processing ............ 187 9.4 Implementation Issues ....................................... 188 9.5 Linear Prediction-ROOT-MUSIC TDOA Estimation ............ 189 9.5.1 Signal Pre-Whitening .................................. 189 9.5.2 An Approximate Model for Multiple Sources in Reverberant Environments ......................................... 191 9.5.3 Robust TDOA Estimation via ROOT-MUSIC ............. 192 9.5.4 Estimation of the Number of Relevant Reflections ......... 194 9.5.5 Source Clustering ...................................... 195 9.5.6 Experimental Results .................................. 196 References ..................................................... 198 10 Joint Audio-Video Signal Processing for Object Localization and Tracking Norbert Strobel, Sascha Spors, Rudolf Rabenstein . ................... 203 10.1 Introduction ................................................ 203 10.2 Recursive State Estimation ................................... 205 10.2.1 Linear Kalman Filter .................................. 206 10.2.2 Extended Kalman Filter due to a Measurement Nonlinearity 210 10.2.3 Decentralized Kalman Filter ............................ 212 10.3 Implementation ............................................. 218 Contents XIII 10.3.1 System description ..................................... 218 10.3.2 Results ............................................... 219 10.4 Discussion and Conclusions .................................. 221 References ..................................................... 222 Part III. Applications 11 Microphone-Array Hearing Aids Julie E. Greenberg, Patrick M. Zurek . ............................. 229 11.1 Introduction ................................................ 229 11.2 Implications for Design and Evaluation ........................ 230 11.2.1 Assumptions Regarding Sound Sources ................... 230 11.2.2Implementation Issues .................................. 231 11.2.3Assessing Performance ................................. 232 11.3 Hearing Aids with Directional Microphones .................... 233 11.4 Fixed-Beamforming Hearing Aids ............................. 234 11.5 Adaptive-Beamforming Hearing Aids .......................... 235 11.5.1 Generalized Sidelobe Canceler with Modifications .......... 236 11.5.2 Scaled Projection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 11.5.3Direction of Arrival Estimation .......................... 243 11.5.4 Other Adaptive Approaches and Devices ................. 243 11.6 Physiologically-Motivated Algorithms .......................... 244 11.7 Beamformers with Binaural Outputs .......................... 245 11.8 Discussion ................................................. 246 References ..................................................... 249 12 Small Microphone Arrays with Postfilters for Noise and Acoustic Echo Reduction Rainer Martin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 12.1 Introduction ................................................ 255 12.2 Coherence of Speech and Noise ............................... 257 12.2.1 The Magnitude Squared Coherence ...................... 257 12.2.2 The Reverberation Distance ............................ 258 12.2.3 Coherence of Noise and Speech in Reverberant Enclosures .. 259 12.3 Analysis of the Wiener Filter with Symmetric Input Signals ...... 263 12.3.1 No Near End Speech ................................... 265 12.3.2 High Signal to Noise Ratio .............................. 265 12.4 A Noise Reduction Application ............................... 266 12.4.1 An Implementation Based on the NLMS Algorithm ........ 266 12.4.2 Processing in the 800 - 3600 Hz Band .................... 268 12.4.3 Processing in the 240 - 800 Hz Band ..................... 269 12.4.4 Evaluation ............................................ 269 12.4.5 Alternative Implementations of the Coherence Based Postfilter271 12.5 Combined Noise and Acoustic Echo Reduction .................. 271 XIV Contents 12.5.1 Experimental Results .................................. 274 12.6 Conclusions ................................................ 275 References ..................................................... 276 13 Acoustic Echo Cancellation for Beamforming Microphone Arrays Walter L. Kellermann . .......................................... 281 13.1 Introduction ................................................ 281 13.2 Acoustic Echo Cancellation .................................. 282 13.2.1 Adaptation algorithms ................................. 284 13.2.2 AEC for multi-channel sound reproduction ................ 287 13.2.3AEC for multi-channel acquisition ....................... 287 13.3 Beamforming ............................................... 288 13.3.1 General structure ...................................... 288 13.3.2Time-invariant beamforming ............................ 290 13.3.3 Time-varying beamforming ............................. 291 13.3.4 Computational complexity .............................. 292 13.4 Generic structures for combining AEC with beamforming ........ 292 13.4.1 Motivation ............................................ 292 13.4.2 Basic options ......................................... 293 13.4.3 'AEC first' ............................................ 293 13.4.4 'Beamforming first' .................................... 296 13.5 Integration of AEC into time-varying beamforming ............. 297 13.5.1 Cascading time-invariant and time-varying beamforming .... 297 13.5.2 AEC with GSC-type beamforming structures ............. 301 13.6 Combined AEC and beamforming for multi-channel recording and multi-channel reproduction ................................... 302 13.7 Conclusions ................................................. 303 References ........................................ . . . . . . . . . . . . . 303 14 Optimal and Adaptive Microphone Arrays for Speech Input in Automobiles Sven Nordholm, Ingvar Claesson, Nedelko Grbic .................... 307 14.1 Introduction: Hands-Free Telephony in Cars .................... 307 14.2 Optimum and Adaptive Beamforming ......................... 309 14.2.1 Common Signal Modeling ...... ; ....................... 309 14.2.2 Constrained Minimum Variance Beamforming and the Gen- eralized Sidelobe Canceler .............................. 310 14.2.3 In Situ Calibrated Microphone Array (ICMA) ............. 312 14.2.4 Time-Domain Minimum-Mean-Square-Error Solution ....... 313 14.2.5Frequency-Domain Minimum-Mean-Square-Error Solution .. 314 14.2.6 Optimal Near-Field Signal-to-Noise plus Interference Beam- former ............................................... 316 14.3 Subband Implementation of the Microphone Array .............. 317 14.3.1 Description of LS-Subband Beamforming ................. 318 Contents XV 14.4 Multi-Resolution Time-Frequency Adaptive Beamforming ........ 319 14.4.1 Memory Saving and Improvements ....................... 319 14.5 Evaluation and Examples .................................... 320 14.5.1 Car Environment ...................................... 320 14.5.2Microphone Configurations ............................. 321 14.5.3Performance Measures ................................. 321 14.5.4Spectral Performance Measures .......................... 322 14.5.5 Evaluation on car data ................................. 323 14.5.6Evaluation Results ..................................... 323 14.6 Summary and Conclusions ................................... 324 References ..................................................... 326 15 Speech Recognition with Microphone Arrays Maurizio Omologo, Marco Matassoni, Piergiorgio Svaizer ............ 331 15.1 Introduction ................................................ 331 15.2 State of the Art ............................................ 332 15.2.1 Automatic Speech Recognition .......................... 332 15.2.2Robustness in ASR .................................... 336 15.2.3Microphone Arrays and Related Processing for ASR ....... 337 15.2.4 Distant-Talker Speech Recognition ....................... 339 15.3 A Microphone Array-Based ASR System ....................... 342 15.3.1 System Description .................................... 342 15.3.2 Speech Corpora and Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 15.3.3 Experiments and Results ............................... 346 15.4 Discussion and Future Trends ................................ 348 References ..................................................... 349 16 Blind Separation of Acoustic Signals Scott C. Douglas . ............................................... 355 16.1 Introduction ................................................ 355 16.1.1 The Cocktail Party Effect .............................. 355 16.1.2 Chapter Overview ..................................... 356 16.2 Blind Signal Separation of Convolutive Mixtures ................ 357 16.2.1 Problem Structure ..................................... 357 16.2.2 Goal of Convolutive BSS ............................... 359 16.2.3Relationship to Other Problems ......................... 360 16.3 Criteria for Blind Signal Separation ........................... 362 16.3.1 Overview of BSS Criteria ............................... 362 16.3.2 Density Modeling Criteria .............................. 362 16.3.3 Contrast Functions .................................... 364 16.3.4 Correlation-Based Criteria .............................. 366 16.4 Structures and Algorithms for Blind Signal Separation ........... 367 16.4.1 Filter Structures ....................................... 367 16.4.2Density Matching BSS Using Natural Gradient Adaptation. 368 16.4.3 Contrast-Based BSS Under Prewhitening Constraints ...... 370 XVI Contents 16.4.4 Temporal Decorrelation BSS for Nonstationary Sources ..... 372 16.5 Numerical Evaluations ....................................... 373 16.6 Conclusions and Open Issues ................................. 375 References ..................................................... 378 Part IV. Open Problems and Future Directions 17 Future Directions for Microphone Arrays Gary W. Elko .................................................. 383 17.1 Introduction ................................................ 383 17.2 Hands-Free Communication .................................. 383 17.3 The "Future" of Microphone Array Processing .................. 385 17.4 Conclusions ................................................ 387 18 Future Directions in Microphone Array Processing Dirk Van Compernolle . .......................................... 389 18.1 Lessons From the Past ...................................... 389 18.2 A Future Focused on Applications ............................ 391 18.2.1 Automotive .......................................... 391 18.2.2 Desktop .............................................. 392 18.2.3 Hearing Aids ......................................... 393 18.2.4 Teleconferencing ....................................... 393 18.2.5 Very Large Arrays .................................... 393 18.2.6 The Signal Subspace Approach - An Alternative to Spatial Filtering? ........................................... 393 18.3 Final Remarks ............................................. 394 Index ......................................................... 395 List of Contributors Joerg Bitzer Houpert Digital Audio Bremen, Germany Michael S. Brandstein Harvard Universtiy Cambridge MA, USA Ingvar Claesson Blekinge Inst. of Technology Ronneby, Sweden Joseph H. DiBiase Brown Universtiy Providence RI, USA Elio D. Di Claudio University of Rome "La Sapienza" Rome, Italy Simon Dodo Katholieke Universiteit Leuven Leuven, Belgium Scott C. Douglas Southern Methodist University Dallas TX, USA Gary W. Elko Agere Systems Murray Hill NJ, USA Nedelko Grbic Blekinge Inst. of Technology Ronneby, Sweden Julie E. Greenberg Massachusetts Inst. of Technology Cambridge MA, USA Scott M. Griebel Harvard Universtiy Cambridge MA, USA Osamu Hoshuyama NEC Media Research Labs Kawasaki, Japan Walter L. Kellermann University Erlangen-Nuremberg Erlangen, Germany Rodney A. Kennedy The Australian National University Canberra, Australia Claude Marro France Telecom R&D Lannion, France Rainer Martin Aachen University of Technology Aachen, Germany Marco Matassoni Istituto per la Ricerca Scientifica e Tecnologica Povo, Italy Marc Moonen Katholieke Universiteit Leuven Leuven, Belgium XVIII List of Contributors Sven Nordholm Curtin University of Technology Perth, Australia Maurizio Omologo Istituto per la Ricerca Scientifica e Tecnologica Povo, Italy Raffaele Parisi University of Rome "La Sapienza" Rome, Italy Rudolf Rabenstein University Erlangen-Nuremberg Erlangen, Germany Harvey F. Silverman Brown Universtiy Providence RI, USA K. Uwe Simmer Aureca GmbH Bremen, Germany Sascha Spors University Erlangen-Nuremberg Erlangen, Germany Norbert Strobel Siemens Medical Solutions Erlangen, Germany Akihiko Sugiyama NEC Media Research Labs Kawasaki, Japan Piergiorgio Svaizer Istituto per la Ricerca Scientifica e Tecnologica Povo, Italy Dirk Van Compernolle Katholieke Universiteit Leuven Leuven, Belgium Darren B. Ward Imperial College of Science, Technology and Medicine London, UK Robert C. Williamson The Australian National University Canberra, Australia Patrick M. Zurek Sensimetrics Corporation Somerville MA, USA Part I Speech Enhancement 1 Constant Directivity Beamforming Darren B. Ward!, Rodney A. Kennedy2, and Robert C. Williamson2 1 Imperial College of Science, Technology and Medicine, London, UK 2 The Australian National University, Canberra, Australia Abstract. Beamforming, or spatial filtering, is one of the simplest methods for discriminating between different signals based on the physical location of the sources. Because speech is a very wideband signal, covering some four octaves, traditional narrowband beamforming techniques are inappropriate for hands-free speech acquisition. One class of broadband beamformers, called constant directivity beamformers, aim to produce a constant spatial response over a broad frequency range. In this chapter we review such beamformers, and discuss implementation issues related to their use in microphone arrays. 1.1 Introduction Beamforming is one of the simplest and most robust means of spatial jiltering, i.e., discriminating between signals based on the physical locations of the signal sources [1]. In a typical microphone array environment, the desired speech signal originates from a talker's mouth, and is corrupted by interfering signals such as other talkers and room reverberation. Spatial filtering can be useful in such an environment, since the interfering sources generally originate from points in space separate from the desired talker's mouth. By exploiting the spatial dimension of the problem, microphone arrays attempt to obtain a high-quality speech signal without requiring the talker to speak directly into a close-talking microphone. In most beamforming applications two assumptions simplify the analysis: (i) the signals incident on the array are narrowband (the narrowband assumption); and (ii) the signal sources are located far enough away from the array that the wavefronts impinging on the array can be modeled as plane waves (the farjield assumption). For many microphone array applications, the farfield assumption is valid. However, the narrowband assumption is never valid, and it is this aspect of the beamforming problem that we focus on in this chapter (see [2] for techniques that also lift the nearfield assumption). To understand the inherent problem in using a narrowband array for broadband signals, consider a linear array with a fixed number of elements separated by a fixed inter-element distance. The important dimension in measuring array performance is its size in terms of operating wavelength. Thus for high frequency signals (having a small wavelength) a fixed array will appear large and the main beam will be narrow. However, for low frequencies M. Brandstein et al. (eds.), Microphone Arrays © Springer-Verlag Berlin Heidelberg 2001 4 Ward et al. 180 FREQUENCY (Hz) ANGLE (degrees) Fig. 1.1. Response of a narrowband array operated over a wide bandwidth. (large wavelength) the same physical array appears small and the main beam will widen. This is illustrated in Fig. 1.1 which shows the beampattern of an array designed for 1.5 kHz, but operated over a frequency range of 300 Hz to 3 kHz. If an interfering signal is present at, say, 60°, then ideally it should be attenuated completely by the array. However, because the beam is wider at low frequencies than at high frequencies, the interfering signal will be low-pass filtered rather than uniformly attenuated over its entire band. This "spectral tilt" results in a disturbing speech output if used for speech acquisition, and thus, such a narrowband array is unacceptable for speech applications. Another drawback of this narrowband design is that spatial aliasing is evident at high frequencies. 1 To overcome this problem, one must use a beamformer that is designed specifically for broadband applications. In this chapter we focus on a specific class of broadband beamformers, called constant directivity beamformers (CDB), designed such that the spatial response is the same over a wide frequency band. The response of a typical CDB is shown in Fig. 1.6 on page 15. There have been several techniques proposed to design a CDB. Most techniques are based on the idea that at different frequencies, a different array should be used that has total size and inter-sensor spacing appropriate for that particular frequency. An example of this idea is the use of harmonically- 1 Spatial aliasing comes about if a sensor spacing wider than half a wavelength is used. It is analogous to temporal aliasing in discrete-time signal processing. 1 Constant Directivity Beamforming 5 nested subarrays, e.g., [3-5J. In this case, the array is composed of a set of nested equally-spaced arrays, with each subarray being designed as a narrowband array. The outputs of the various subarrays are then combined by appropriate bandpass filtering. The idea of harmonic nesting is to reduce the beampattern variation to that which occurs within a single octave. This approach can be improved by using a set of subarray filters to interpolate to frequencies between the subarray design frequencies [6J. A novel approach to CDB design was proposed by Smith in [7J. Noting that, for a given array, the beamwidth narrows at high frequencies, Smith's idea was to form several beams and to steer each individual beam in such a way that the width of the overall multi-beam was kept constant. Thus, as the individual beams narrow at higher frequencies, they are progressively "fanned" outwards in an attempt to keep the overall beamwidth constant. Unless a very large number of beams are formed, at high frequencies this fanning will result in notches in the main beam where the progressively narrower beams no longer overlap. This approach was applied to the design of microphone arrays in [8J. The first approach to CDB design that attempted to keep a constant beampattern over the entire spatial region (not just for the main beam) was presented by Doles and Benedict [9J. Using the asymptotic theory of unequally-spaced arrays [10,11]' they derived relationships between beampattern characteristics and functional requirements on sensor spacings and weightings. This results in a filter-and-sum array, with the sensor filters creating a space-tapered array: at each frequency the non-zero filter responses identify a subarray having total length and spacing appropriate for that frequency. Although this design technique results in a beampattern that is frequencyinvariant over a specified frequency band, it is not a general design technique, since it is based on a specific array geometry and beampattern shape. Other recent techniques for CDB design include [12J (based on a two-dimensional Fourier transform property [13J which exists for equally-spaced arrays) and [14J (based on a beam space implementation). Prompted by the work of Doles and Benedict, we derived in [15J a very general design method for CDB's, suitable for three-dimensional array geometries. In this chapter we outline this technique, and discuss implementation issues specific to microphone array applications. Time-domain versus frequency-domain beamforming There are two general methods of beamforming for broadband signals: timedomain beamforming and frequency-domain beamforming. In time-domain beamforming an FIR filter is used on each sensor, and the filter outputs summed to form the beamformer output. For an array with M sensors, each feeding a L tap filter, there are M L free parameters. In frequency-domain beamforming the signal received by each sensor is separated into narrowband frequency bins (either through bandpass filtering or data segmentation 6 Ward et al. and discrete Fourier transform), and the data in each frequency bin is processed separately using narrowband techniques. For an array with M sensors, with L frequency bins within the band of interest, there are again M L free parameters. As with most beamformers, the method that we describe in this chapter can be formulated in either domain. A time-domain formulation has previously been given in [16], and hence, we restrict our attention to frequency-domain processing here. 1.2 Problem Formulation Consider a linear array of M = 2N + 1 sensors located at Pn, n = -N, ... ,N. Assume that the data received at the nth sensor is separated into narrowband frequency bins, each of width 1).f. Let the center frequency of the ith bin be fi, and denote the frequencies within the bin as Fi = [fi - 1).f/2, fi + 1).f /2). The array data received in the ith bin at time k, is given by the M-vector: The desired source signal is represented by siCk), and the M-vector vi(k) represents the interfering noise (consisting of reverberation and other unwanted noise sources). The array vector a(8, 1) represents the propagation of the signal source to the array, and its nth element is given by where c is the speed of wave propagation, and 8 is the direction to the desired source (measured relative to the array axis). To simplify notation we will drop the explicit dependence on k in the sequel. The beamformer output is formed by applying a weight vector to the received array data, giving where H denotes Hermitian transpose, and Wi is the M -vector of array weights to apply to the ith frequency bin.2 The spatial response of the beamformer is given by b(8, 1) = wfa(8, f), f E Fi , (1.2) which defines the transfer function between a source at location 8 E [-1f, 1f) and the beamformer output. Also of interest is the beampattern, defined as the squared magnitude of the spatial response. 2 Note that it is a notational convention to use w H rather than w T [1]. 1 Constant Directivity Beamforming 7 The problem of designing a CDB can now be formulated as finding the array weights in each frequency bin such that the resulting spatial response remains constant over all frequency bins of interest. One simple (but not very illuminating) approach to solving this problem is to perform a least-squares optimization in each frequency bin, i.e., (1.3) where bFI((J) is the desired frequency-invariant response. Thus, in each frequency bin there are M free parameters to optimize. Although this is a standard least-squares optimization problem and the required array weights are easily found, the solution provides very little insight into the problem. Specifically, there is no suggestion of any inherent structure in the CDB, and many important questions are left unanswered, such as how many sensors are required, and what range of frequencies can be used. In an attempt to provide some insight into the problem of designing a CDB, we take an alternative theoretical approach in the following section, and then relate these theoretical results back to the problem of finding the required filter coefficients. As we will see, there is in fact a very strong implicit structure in the CDB, and exploiting this structure enables us to reduce the number of design parameters and find efficient implementations. 1.3 Theoretical Solution It is well known that the important dimension in determining the array response is the physical array size, measured in wavelengths. Thus, to obtain the same beampattern at different frequencies requires that the array size remains constant in terms of wavelength. Specifically, consider a linear array with N elements located at Pn, n = 1, ... , N, and assume the array weights are chosen to produce a desired beampattern b((J) at a frequency fl. Then, at a frequency 12, the same beampattern b((J) will be produced if the same array weights are used in an array with elements located at Pn(fdh), n = 1, ... , N. In other words, the size of the array must scale di- rectly with frequency to obtain the same beampattern.3 To obtain the same beampattern over a continuous range of frequencies would theoretically require a continuum of sensors. 1.3.1 Continuous sensor Motivated by this interpretation, we consider the response of a theoretical continuous sensor. Assume that a signal x(p, f) is received at a point P on 3 This is precisely the idea used in the harmonically-nested subarray technique. 8 Ward et al. the sensor at frequency f, and a weight w(p, f) is applied to the sensor at this point and frequency. The output of the sensor is J y(f) = w(p, f) x(p, f) dp, and the spatial response for a source at angle B is J b(B, f) = w(p, f) e-j27r!c-1pcoslJ dp. (1.4) We assume that the aperture has finite support in p, and thus, the integration has infinite limits. Let u = c- 1 cos B. The response of the continuous sensor can now be written J bu(u, f) = w(p, f) e- j27r!pu dp. Let the sensor weighting function be given by w(P,f) = fB(pf), (1.5) where B(·) is an arbitrary, absolutely-integrable, finite-support function. Sub- stitution gives J bu(u, f) = f B(pf) e- j27r!pu dp. (1.6) With the change of variable ( = p f, and noting that d( = f dp, it is easily seen that the resulting spatial response is now independent of frequency, i.e., (1. 7) This is an important result, since it states that if the weighting function is given by (1.5), then the resulting spatial response will be independent of frequency. In Qther words, (1.5) defines the weighting function for a CDB. It was shown in [15], that not only does (1.5) provide a sufficient condition, but it is in fact the necessary condition for a frequency-invariant spatial response. 1.3.2 Beam-shaping function Equation (1. 7) defines a Fourier transform relationship between B (.) and bF10. To achieve some desired spatial response, the required function B(() is thus easily found by taking the inverse Fourier transform of b(u). We will refer to B(·) as the beam-shaping (BS) function, since it has a fundamental role in determining the spatial response. 1 Constant Directivity Beamforming 9 Because of its symmetry with respect to space and frequency, the BS function can be interpreted as either a filter response at a certain point, i.e., Hp(f) = B(Pf), or equivalently, as an aperture weighting function at a certain frequency, i.e., Af(P) = B(Pf). We will assume that the BS function is Hermitian symmetric, i.e., B(-() = B*((). This implies that the resulting spatial response is real-valued. 1.4 Practical Implementation Whilst we have shown theoretically that it is possible to produce a beampattern that is exactly frequency-invariant using a continuous sensor, in practise we must attempt to approximate such a response using a finite array of discrete sensors. The problem of approximating a continuous aperture by a discrete array has been considered in [17]. One simple but effective technique is to approximate the integral in (1.6) using a Riemann sum-this is the approach we take here. In particular, we use trapezoidal integration to approximate the integral (1.6) by a summation of the form: LN bF1(U) = f B(Pnf) e-j27rfPnU Lln (1.8) n=-N where Pn is the location of the nth discrete sensor, and bFI denotes an approx- imation of bF1 . We assume that the array is Hermitian symmetric about the origin, so that B(-pf) = B(Pf)*, and P-n = -Pn· Although the technique is suitable for an arbitrary array geometry, a symmetric geometry simplifies implementation, and ensures that the position of the array phase center does not vary with frequency. The length of the nth subinterval is Ll _ Pn+l - Pn-l n- 2 ' (1.9) which we refer to as the spatial weighting term. Relating (1.8) to the response of a general array (1.2), we find that for a CDB the weight on the nth sensor in the ith frequency bin is (1.10) where, recall, Pn is the location of the sensor, and Ii is the center frequency of the bin. 1.4.1 Dimension-reducing parameterization Define the reference beam-shaping filter response as H(f) = B(prer/), (1.11) 10 Ward et al. where Pref is some reference location (to be defined later). Also define the beam-shaping filter response of the nth sensor as Hn(f) = B(pnf), n = -N, ... ,N. It immediately follows that the BS filters satisfy the following dilation property: (1.12) where Pn "(n=- P 1 results in compression in the frequency domain, whereas "(n < 1 results in frequency expansion. Since the discrete-time frequency response H(f) is periodic, it follows that frequency compression may cause aliasing; this is extremely undesirable. Aliasing can be avoided in One of two ways. First, choosing Pref = max IPnl ensures that "(n :S 1, Vn, thus avoiding aliasing altogether-however, this requires additional constraints On the reference BS coefficients to impose the low-pass property (1.16). Alterna- tively, for sensors having "(n > 1, the weights Wi,n are set to zero for frequency bins Ii > In-the reference BS weights are nOw potentially unconstrained. Of these two approaches, the second is preferable, since it removes any con- straints on the BS coefficients. Moreover, the requirement that the sensor weights within certain bins are always zero does not complicate implementa- tion. Assume that the frequency response of the reference BS filter is non-zero for all frequencies up to 1./2, the Nyquist frequency; this is the most general case of H(f). From (1.16), it follows that a sensor with non-zero frequency response up to 1./2 would be positioned at IPnl = Qc/I •. Thus, for the most general case of H(f) the reference location is chosen as (1.17) The reference BS coefficients can be found by using the Fourier transform relationship defined by (1.7). Specifically, the BS function B(() is found by taking the Fourier transform of the desired frequency-invariant spatial re- sponse bFI(U). Setting I = (/Pref' B(() now defines the frequency response of the reference BS filter. The BS coefficient vector h is found using any standard FIR filter design technique. In practise, low-order implementations of the reference BS filter are generally to be preferred; this point is demonstrated in the following section. 12 Ward et al. 1.4.3 Sensor placelllent The most common geometry for array processing applications is typically an equally-spaced array, usually with a spacing of one half-wavelength at the highest frequency of operation. Although such a geometry is valid for a CDB, less sensors are required if a logarithmically spaced array is used. In choosing an appropriate sensor geometry, the most important consideration is to ensure that at any frequency spatial aliasing is avoided. The idea is to start with an equally-spaced array that is used at the highest frequency, and then progressively add more sensors with wider spacings as frequency decreases (and the wavelength increases). At any frequency i, the total active aperture size should be Qcji, and the largest spacing within the active array should be cj(2J). These requirements are met (using the least number of sensors) with the following symmetric array geometry: C Pn = n 2iu' O- 2' Pn < 2h (1.18a) (1.18b) (1.18c) Note that a harmonically-nested subarray geometry is only produced if Q = 2. 1.4.4 SUllllllary of illlplelllentation 1. Choose a set of L reference BS coefficients, h. 2. Position the sensors according to (1.18a)-(1.18c). 3. In the ith frequency bin, the weight on the nth sensor is where {ii t. _ L1ndn(!i), .,n - 0, !i ssgJvv -1ddH gJvv A- d + 1 4>ssdH gJvv -1d 'l'ss + [ 1 4>ss ] gJ -1d 4>ss d H gJvv -1d vv =[ 4>ss 1 gJvv -1d 4>ss+(dHgJvv-1d)-1 d H gJvv-1d · (3.19) Equation (3.19) shows that the multi-channel Wiener filter (3.10) can be written as the product of the weight vector of the MVDR beamformer, (see Chapter 2) and a real-valued scalar factor. A similar result is used in [36] and [1] to show that the multi-channel Wiener and the MVDR solution yield the same SNR if the input is narrowband. In this case the MVDR beamformer is preferable since it is data independent (Le. completely defined by the spatial configuration of signal and noise sources), whereas the Wiener solution is data dependent (4)ss must be known or estimated) and is therefore much more difficult to handle. However, MVDR and Wiener solutions yield the same SNR only if the input consists of a single frequency. For the broadband case (which has already been discussed in [37]), the scalar factor becomes a subband or frequency domain post-filter that may significantly improve the SNR. 44 Simmer et al. To show that the optimum post-filter is also a Wiener filter that operates on the single-channel output data, we evaluate the power of the desired signal at the output of the MVDR processor as (3.20) This demonstrates the distortionless magnitude response. Furthermore, we determine the power of the output noise as (3.21) Substituting (3.20) and (3.21) into (3.19), we can finally factorize the optimum MMSE solution into the following expression: Wopt = [ ¢soso 'Al.'.soso + 'Al.'.vovo ] iPvv -ld dH iPvv -Id' , '" '''-v-'' Wiener post-filter MVDR array (3.22) Equation (3.22) includes the complex weight vector of the MVDR beamformer W (k i) _ iP;;v1 (k, i) d(k, i) mvdr , - dH(k,i) iP;;;(k,i) d(k,i)' (3.23) and the scalar, single channel Wiener post-filter that depends on the SNR at the output of the beamformer: H (k') _ ¢soso (k, i) post ,t - A.. 0/8 0 80 ( k,' )~+ A.. o/Vo Vo ( k,'~) SNRout(k,i) 1 + SNRout(k,i)' (3.24) The output signal z(k, i) of the factorized MMSE filter is the product of the output signal y(k, i) of the MVDR array: y(k, i) = w:;'vdr(k, i) x(k, i), (3.25) and the transfer function Hpost(k, i) of a single-channel post-filter: z(k, i) = y(k, i) Hpost(k, i). (3.26) The MVDR solution (3.23) maximizes the directivity index if iPvv equals the correlation matrix of the diffuse sound field. The resulting system may therefore be called 'superdirective array with Wiener post-filter' (although the term superdirectivity originated in the context of analog microphones). Since the definition (3.13) of the propagation vector does not include any farfield assumptions, (3.23) may also be used to design a near-field superdirective array. 3 Post-filtering Techniques 45 3.2.3 Interpretation Although the above results are clearly related to Wiener's work on optimum filtering [38], some basic assumptions were different. First of all, Wiener considered continuous time signals which leads to the Wiener-Hopf integral equation. The corresponding equation in matrix form (3.10) usually determines the filter coefficients for an optimum discrete time FIR filter of order N. In our case, the delay line is defined by the spatial arrangement of the acoustic sensor and the taps are realized by the N microphones. The array and the weight vector form a spatial filter. Wiener assumed that signal and noise are ergodic and stationary random processes and he used the Fourier-transform to find a solution for the optimum filter. This leads to a linear, time invariant filter. Such a filter is not appropriate for speech signals that may be modeled as short-time stationary processes only. The derivation used here is based on ensemble averages (expectations) and does not assume stationarity. In practice, however, only an approximate realization of such a filter is possible. There are two main sources of errors: the analysis and synthesis filterbank, and the procedures to estimate the time-varying signal and noise powers in the individual subbands. For the design of the filter-banks, a compromise between frequency and time resolution has to be made. High resolution in the frequency domain leads to poor resolution in the time domain and vice versa. Therefore, the highest possible frequency resolution that does not violate the short-term stationarity of speech should be chosen. Furthermore, the minimum error in the time-domain is only reached if the filters have nonoverlapping frequency regions (see the discussion of subband methods in [39]). Since such filters are physically unrealizable, overlapping of subbands cannot be avoided. As a result, the suppression of a noise-only subband may affect adjacent subbands containing desired signal components. In the following, we will use windowing, Fast Fourier Transform (FFT) and the overlap-add method to implement the filter-bank. However, (3.22) is general enough to allow any complex or real valued filter-bank method. If overlap-add is used, circular convolution should be avoided by zero padding and by constraints imposed on the estimated transfer function. In the derivation of the optimum filter, expectations are used to estimate the parameters. This is a theoretical construction since the ensemble averages cannot be computed in practice. An approximation proposed in [9] is the recursive Welsh periodogram: ¢xy(k, i) = a ¢xy(k - 1, i) + (1 - a)x(k, i)y* (k, i), (3.27) where a = exp(- D j[T0: Is]) is defined by the decimation factor D of the filter-bank, the time-constant To: (ms), and the sampling frequency Is (kHz). The time constant is again a compromise. If To: is low, artifacts may occur due to the variation of the transfer function estimate. On the other hand, if a high time constant To: is chosen, the assumption of short time stationarity is violated and the output speech signal may sound reverberant. 46 Simmer et al. Unfortunately, the factorized result (3.22) does not give any indication of how the Wiener post-filter could be estimated. A possible solution, which we discuss in the next section, is based on the observation that the correlation between two microphone signals is low if the sound field is diffuse and the microphone distance is large enough. 3.3 Algorithms for Post-Filter Estimation Figure 3.1 shows the block diagram of the studied algorithms. The microphone signals are time aligned and decomposed by a frequency subband transform (FT). The coefficients Wn represent the weight vector w of the beamformer and H represents the post-filter. The inverse subband transform (1FT) synthesizes the output signal. The coefficients In for post-filter estima- tion form a vector f. Unless otherwise noted we assume that f = w. We begin + Postfi lter estimation Fig. 3.1. General block diagram of the examined post-filters. our analysis on multi-microphone post-filters by recalling some results on the performance of arrays from Chapter 2 since these results are needed later. We generally assume that the coefficients are normalized so that w H 11H w = 1 and f H ll Hf = 1, where 1 is the N-vector of ones. Therefore, the array gain equals the noise reduction of the array. For convenience, we define a noise power attenuation factor that equals the inverse of the array gain: Ar = wHrvvw = C-1, (3.28) 3 Post-filtering Techniques 47 where the coherence matrix rvv is the normalized noise correlation matrix rvv = 4>vvNftrace [4>vv], and all quantities are assumed to be frequency de- pendent. An examination of (3.28) shows that the noise attenuation of the array is the weighted sum of the complex coherence functions of all sensor pairs. Thus, all products appear in conjugate pairs rmn + rnm = 2Re{rnm}. As a result, the noise reduction of the array is actually a function of the real part of the complex coherence between the sensors. The knowledge of the magnitude squared coherence is not sufficient. The white noise gain is the array gain for spatially uncorrelated noise, where rvv = I. Thus, the attenuation factor for spatially white noise is (3.29) The additional noise attenuation of the post-filter is given by Apost = 1Hpost 12 . (3.30) The total noise attenuation of the combined system is the product of the attenuation of the array and the attenuation of the post-filter, or the respective sum in dB: Atotall = 101og10 (Ar) + 101oglO (Aposd . dB (3.31) 3.3.1 Analysis of Post-Filter Algorithms The first method for post-filter estimation we study is a generalized version of Zelinski's algorithms that was discussed by Marro et al. [15]. It covers several other algorithms as a special case. (3.32) Equation (3.32) includes Danilenko's [2] idea to use the ratio of cross-correlation 4>xnXTn and power 4>xnxn for suppressing incoherent noise, the complex subband approach of Allen et al. [9], Zelinski's proposal to average over all mi- crophone pairs m > n [11], and Marro's [40] extension to complex shading coefficients W n . To write this algorithm in matrix notation, we note that L L L L L N-2 N-l } N-l N-l N-l 2Re { wnw:n4>xnxTn = wnW:n!PXnXm - wnw~!PXnXn· n=O m=n+ 1 n=O m=O n=O 48 Simmer et al. This is a Hermitian form of the shading coefficients Wn and the correlation matrix Pxx , minus the weighted sum of diagonal elements of Pxx . The algorithm (3.32) requires that the relative time-delay differences and gain ratios between the microphone signals have been compensated in advance so that d = 1. This leads to a modified noise correlation matrix Pxx (see Chapter 2). The transfer function of the post-filter (3.32) may now conveniently be written in matrix form as (3.33) where p?x is a diagonal matrix of the diagonal elements of Pxx . If the sound field is homogeneous, we have the same input power at each microphone, i.e. p?x = ¢xxI , and may write (WHpXXW - ¢xxwHw) H zm = ¢u (w HII Hw _ wHw)· (3.34) If signal and noise are uncorrelated we have Pxx = PSS + Pvv . Therefore, (3.35) Assuming that the coefficients are normalized such that w H 11H w = 1, the desired signal is coherent, i.e., Pss = ¢ssIIH. With the noise correlation matrix being Pvv = ¢vvrvv, where ¢vv = trace [pvv] IN, we finally obtain H_ zm - ¢ss ¢ss + ¢vv ¢vv (w Hrvv w - wHw) + (¢ss + ¢vv) (1 -H w w ). (3.36) Although the designs of the MVDR array and the post-filter estimation algorithm do not seem to have much in common, the transfer function of the post-filter may be expressed as a function of the attenuation factors of the array by substituting (3.28) and (3.29) into (3.36): H _ ¢ss + ¢vv (Ar - AI) zm - ¢ss + ¢vv (¢ss + ¢vv) (1 - AI) (3.37) This is also true for the slightly modified version of Zelinski's algorithm [13]: (3.38) 3 Post-filtering Techniques 49 where c/lyy = c/lss + c/lvvAr is the output power of the array. The modified post-filter can be expressed as Hsm = c/lss + c/lvvAr (Ar - AI) c/lss + c/lvvAr (c/lss + c/lvvAr) (1 - AI) (3.39) These rather surprising results were first derived in [15]. They are used in the following section to discuss the properties of a large class of post-filtering algorithms. 3.3.2 Properties of Post-Filter Algorithms First of all, we note that the shading coefficients Wn form a weight vector w that generally can be computed by using the design rule of the MVDR array. It is not necessary, however, to use the same design for array processor and post-filter (see Fig. 3.1). Both the MVDR weight vector and the array gain are functions of the noise correlation matrix. It should be noted that the correlation matrix that is used for the design may differ from the correlation matrix of the environment in which the array operates. Therefore, three different correlation matrices may be involved: a first one for the design of the array processor, a second one for the design of the post-filter, and a third one to determine the performance in the actual environment. Analyzing (3.37) and (3.39) leads to the following conclusions: • Optimum performance is only reached if Ar = AI: The difference of the two attenuation factors is zero only if the noise is spatially uncorrelated which was Danilenko's initial assumption in the design of his suppression system. In this case, (3.37) becomes a Wiener filter for the input signal of the beamformer. On the other hand, (3.39) becomes a Wiener filter for the beamformer output and therefore represents the MMSE solution for uncorrelated noise if the delay and sum beamformer is used. All other coefficient sets, including superdirective solutions, yield suboptimal performance. In a diffuse sound field, the noise is correlated at low frequencies which leads to poor performance for low frequency noise. • Negative post-filter if Ar < AI: In a diffuse noise field, or if coherent sources are present, the difference of the attenuation factors (Ar - Ad may cause a negative transfer-function. If negative parts of the transfer functions are set to zero, which is a common strategy, signal cancellation may occur. • Infinite post-filter if AI = 1: This is usually the case with superdirective designs which amplify uncorrelated noise at low frequencies. To demonstrate the preceding results, we computed the theoretical performance of a four microphone end-fire array with 8 cm inter-microphone dis- tance in a diffuse noise field (c/lss = 0). Figure 3.2 shows the attenuation 50 Simmer et al. 10 m~ 0 \ c: \ f0il -10 c::::l Q) ~ -20 -30 0 I '" / ',I \. \ \ - Ar AI - -A zs A z \ .\ l \\ \ / \ t \ '\ l"",'\ / F, 1000 2000 3000 4000 Frequency (Hz) 10 m~ c: 0 "- \ f0il -10 \ , \ '/ ..\ '\ . - Ar AI - -A zs :c:::l Q) ~ -20 -30 0 \ \ \ 1000 ,; / 2000 \ \ '\ \/ /\ I 3000 4000 Frequency (Hz) Fig. 3.2. Theoretical noise attenuation of an end-fire array for a diffuse noise field. Left: delay and sum beamformer coefficients. Right: superdirective coefficients. factors Ar and AI of the beamformer and the noise attenuation Apost of the post-filter (3.37). The left part depicts the attenuation for delay and sum beamformer coefficients (f = w = liN) and the right part depicts the attenuation for superdirective coefficients (f = WMVDR). The performance of the delay and sum beamformer and the respective post-filter is poor at low frequencies. At high frequencies the coherence of a diffuse noise field is approaching zero. Therefore, Ar is close to AI and both post-filters perform nearly optimally. The superdirective beamformer performs particularly well at low frequencies. The respective post-filter, however, does not benefit from using superdirective coefficients. The performance gets even worse at low frequencies and the transfer function is infinite at the frequency where AI crosses 0 dB. 3.3.3 A New Post-Filter Algorithm To derive an improved algorithm we note that in all cases the subtraction of the white noise attenuation AI in (3.37) is causing the trouble. It reduces the performance for superdirective coefficients and is responsible for negative or infinite post-filters. Our straightforward approach for solving these problems is to replace the difference Ar - AI with Ar , since Ar is the parameter that is actually minimized by the design of the MVDR beamformer. Substituting AI = 0 in (3.37) results in Hapab = ~~~~) I ~~~;-T'I ~ Copy of adaptive coefficients $==>- Algorithm I ys+v(k! S l R i · /,___JmasterL I Copy of adaptive ~ _ _ _ _ _ r--_ _ _ _ coe~cie~ts y (k) NV(~)e . _M RIR __~~~lthl___~ Evaluation unit Segmental SNREnhancement Speech quality (LAR) Speech degradation (SD) Fig. 3.5. Graphical description of the complete simulation system. hancement (SNRE): SNRE(l) = SNRin(l) - SNRout(l)· (3.43) The segmental SNR is computed from consecutive samples with block-length B = 256 at a sampling frequency of 8 kHz: L(lH)B s2(k) SNRin(l) = 10 ·lOgl0 -k=CIB-+l) - - - 1+1 B L v2 (k) k=IB+l (3.44) Cl+l)B L y;(k) SNRout(l) = 10 ·loglO k=IBH -C-)-'----- L1+1 B y~(k) k=IB+l (3.45) The second objective measure is the log-area-ratio distance (LAR) which has been tested with good results in [42]. This quantity can be computed in three steps: 1. Estimate the PARtial CORrelation coefficients (PARCOR) of a block of samples. The block-size should be small enough to hold the assumption of stationarity but large enough to reduce bias and variance of the estimated values. A good choice is a block-size of 256 for a model order of P = 12. An algorithm for estimating PARCOR coefficients is the well-known Burgalgorithm [35]. 2. Determine the area-coefficients by g(p, I) = 1 + k(p, I) 1 _ k(p, I) \.I v 1:::; P :::; 12 (3.46) 54 Simmer et al. where k(p, l) is the pth PARCOR coefficient of block l. 3. Compute the LAR of block I (3.47) The final quantity we use is a speech degradation measure, which can be defined by the LAR of the input and the output speech signals only (3.48) It includes the room reverberation, the signal distortion caused by the tested algorithm, and the dereverberation features of the tested algorithm only. Finally, the average of all blocks containing speech is computed. 3.4.3 Simulation Results The described simulation system was used to evaluate the performance of four different post-filter algorithms: 1. Ze188: The algorithm by Zelinski in the frequency-domain implementation [21]. 2. Sim92: The algorithm by Simmer described in [13]. 3. APAB: The adaptive post-filter for an arbitrary beamformer, described in section 3.3 with a constrained MVDR-beamformer designed for an isotropic noise field in three dimensions (superdirective beamformer). The constraining parameter is set to Jl. = 0.01 (see Chapter 2). 4. APES: The adaptive post-filter extension for superdirective beamformers [32]. For comparison, we include the results of the case in which no algorithm is used (No NR). The speech sample we used is the sentence "I am now speaking to you from a distance of 50 cm from the microphone" spoken by an adult male. The length of this file leads to 98 blocks containing speech. The noise file was white Gaussian noise used in order to give technical results which can be reproduced by other researchers. The input SNR was computed only for blocks containing speech by using the segmental SNR. In the first experiment, the broadside array shown on the left side of Fig. 3.4 is examined. Figure 3.6 depicts the results for the SNRE. The left side shows the dependence on the input-SNR if the reverberation time is set to 760 = 300 ms. The right figure shows the results for SNR=5 dB as a function of the reverberation time. This provides information on the behavior of the algorithms for different spatial conditions. The noise-field is coherent for low reverberation time and approximately diffuse for high values. 3 Post-filtering Techniques 55 't60 = 300 ms 14 SNR = 5 dB i1 III '"0 .!:: 10 awz:: (J) 8 . · ·.Sim92 .·~APAB . G- ElZel88 . *-*APES 6~~------~--~--~ -5 0 5 10 15 20 SNR in dB--7 i 12 III '"0 .!:: 10 azw:: (J) 8.,'. Sim92 '¢-¢APAB ' G- El Zel88 .*-* APES o6~----------------~ 500 1000 't60 in mS--7 Fig. 3.6. Left: SNRE vs. input-SNR. Right: SNRE vs. reverberation time 760 (Broadside). Although not optimal the Ze188 algorithm performs quite well, especially for high reverberation times where it provides the best results of all tested algorithms (if only the SNRE is considered). At low reverberation times APAB and APES can benefit from the better suppression at low frequencies by using a superdirective beamformer instead of a standard delay and sum beamformer. 't60 =300 ms SNR = 5 dB 4 ' 'tI " III '"0 2 .!:: o +-+No N (J) 1 *"*Sim92¢~APAB o I3:-ElZeI8S*",*APES -5 0 5 10 15 20 SNR in dB--7 +-+ No NR ..fJ*:* i 4 .". G- El Sim92 Zel88 ' 'O -o~AP~ASB ' _ 3 8-'8,8 ...., .. , .,' '" III ' ~~~i4p:.4~~;:::::=,' o 500 1000 't60 in mS--7 Fig. 3.7. Left: SD vs. input-SNR. Right: SD vs. reverberation time 760 (Broadside). If we take into account the next two measures shown in Fig. 3.7 and 3.8, which describe the performance in terms of speech quality, the results are different. All algorithms enhance the speech quality in comparison to the 56 Simmer et al. t60 =300 ms SNR = 5 dB c ~2 ...J +-+No NH *. ·*Sim92 o [3,-ElZeI88*-'*APES . -5 0 5 10 15 20 SNR in dB ~ 4.5 +-+ No NFt . t 4 *' '* Sim92 . ·¢-oAPAB . ~ 3.5 [T.El.Z~18~.*:-* APES c a« : ...J 500 t60 in ms ~ 1000 Fig. 3.8. Left: LAR vs. input-SNR. Right: LAR vs. reverberation time 760 (Broadside). t60 = 300 ms 12 *"*Sim92 ¢..:¢APAB . t 10 i;3:-.I;lZe188 .. •..~A.PE. S. -[3..;:' ~ c -'--¢B-.-~-.;€..]::~' ' . W6 :......•..~. :-0.,. ~ 4" .*. .. .. ...... ::.*... , SNR = 5 dB 12 ** Sim92 . 0:..:0 APAB . t 10 i;3:-I;lZe188 .. *""*.APES. .-¢..¢-;i: '=" ~-..,. .. .£ waz: .... JZIP' ..•.. '.. * ',,,:.. C/) 2 - 5~---o---~5---1-0---1-5--~20 SNR in dB ~ 2 L-__________________~ o 500 1000 t60 in ms ~ Fig. 3.9. Left: SNRE vs. input-SNR. Right: SNRE vs. reverberation time 760 (Endfire). unprocessed input signal 1. However, the algorithm with the highest SNRE does not produce the best LAR. A closer look at Fig. 3.7 explains this behavior. Since these figures show the speech degradation only, the non-processed signal is constant versus the SNR and reduces to zero if no reverberation is added to the speech signal. The algorithms cause signal distortion at low SNR and the algorithm with the highest performance in SNRE induces the largest distortion, whereas APAB and APES provide the best speech quality (LAR). At very good conditions (SNR > 15 dB), these algorithms are able to suppress reverberation without introducing speech degradation. The lack of artifacts was corroborated through informal listening tests. 1 Smaller values indicate better quality. 3 Post-filtering Techniques 57 In a second experiment (right side of Fig. 3.4), we changed the orientation of the array and the inter-microphone distance. Additionally, only four microphones were used to reduced the array size. In Fig. 3.9 the SNRE results of the simulation are shown. The performance of the Sim92 and Zel88 algorithms degrades drastically, since the inherent delay and sum beamformer does not perform well at low frequencies due to the small array size. On the other hand, APAB and APES perform well under all conditions. The SNRE for APES at high reverberation time is close to the result for the broadside-experiment although the number of microphones is reduced. Thus, we conclude that end-fire steering is preferable for this algorithm. 3.5 Conclusion Wiener post-filtering of the output signal of an MVDR beamformer provides an optimum MMSE solution for signal enhancement. A large number of published algorithms for post-filter estimation are based on the assumption of spatially uncorrelated noise. This assumption leads to post-filtering algorithms with suboptimal performance in coherent and diffuse noise fields. In this chapter we presented a new algorithm which performs considerably better in correlated noise fields by using the gain of an arbitrary array. Small size end-fire arrays comprising an MVDR beamformer and optimized post-filters showed the best performance in our simulations. References 1. R. A. Monzingo and T. W. Miller, Introduction to Adaptive Arrays, John Wiley and Sons, New York, 1980. 2. L. Danilenko, Binaurales Horen im nichtstationiiren diffusen Schallfeld, PhD thesis, RWTH Aachen, Aachen, Germany, 1968. 3. G. von Bekesy, Experiments in Hearing, McGraw-Hill, New York, 1960. 4. S. Gierl, Geriiuschreduktion bei Sprachiibertragung mit Hilfe von Mikrofonar- raysystemen, PhD thesis, Universitat Karlsruhe, Karlsruhe, Germany, 1990. 5. S. Gierl, "Noise reduction for speech input systems using an adaptive microphone-array", in Int. Symp. Automotive Tech. and Automation (ISATA), Florence, Italy, May 1990, pp. 517-524. 6. H.-Y. Kim, F. Asano, Y. Suzuki, and T. Sone, "Speech enhancement based on short-time spectral amplitude estimation with two-channel beamformer", IEICE Trans. Fundament., vol. E79-A, no. 12, pp. 2151-2158, Dec. 1996. 7. M. Dorbecker and S. Ernst, "Combination of two-channel spectral subtraction and adaptive Wiener post-filtering for noise reduction and dereverberation" , in Proc. EURASIP European Signal Proc. Conf. (EUSIPCO), Trieste, Italy, Sept. 1996. 8. K. Kroschel, A. Czyzewksi, M. Ihle, and M. Kuropatwinski, "Adaptive noise cancellation of speech signals in a noise reduction system based on a microphone array", in 102nd Audio Eng. Soc. Conv., preprint 4450, Munich, Germany, Mar. 1997. 58 Simmer et al. 9. J. B. Allen, D. A. Berkley, and J. Blauert, "Multimicrophone signal-processing technique to remove room reverberation from speech signals", J. Acoust. Soc. Amer., vol. 62, no. 4, pp. 912-915, Oct. 1977. 10. Y. Kaneda and M. Tohyama, "Noise suppression signal processing using 2point received signals", Electron. Communicat. Japan, vol. 67-A, no. 12, pp. 19-28, Apr. 1984. 11. R. Zelinski, "A microphone array with adaptive post-filtering for noise reduction in reverberant rooms", in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Proc. (ICASSP), New York, USA, Apr. 1988, pp. 2578-2581. 12. R. Zelinski, "Noise reduction based on microphone array with LMS adaptive post-filtering", Electron. Lett., vol. 26, no. 24, pp. 2036-2037, Nov. 1990. 13. K. U. Simmer and A. Wasiljeff, "Adaptive microphone arrays for noise suppression in the frequency domain", in Second Cost 229 Workshop Adapt. Alg. Communi cat. , Bordeaux, France, Oct. 1992, pp. 185-194. 14. Y. Mahieux and C. Marro, "Comparison of dereverberation techniques for videoconferencing applications", in 100th Audio Eng. Soc. Conv., preprint 4231, Copenhagen, Denmark, May 1996. 15. C. Marro, Y. Mahieux, and K. U. Simmer, "Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering", IEEE Trans. Speech and Audio Processing, vol. 6, no. 3, pp. 240-259, May 1998. 16. R. Le Bouquin and G. Faucon, "On using the coherence function for noise reduction", in Proc. EURASIP European Signal Proc. Con/. (EUSIPCO), Barcelona, Spain, Sept. 1990, pp. 1103-1106. 17. R. Le Bouquin and G. Faucon, "Study of a noise cancellation system based on the coherence function", in Proc. EURASIP European Signal Proc. Conf. (EUSIPCO), Brussels, Belgium, Aug. 1992, pp. 1633-1636. 18. G. Faucon and R. Le Bouquin-Jeannes, "Optimization of speech enhancement techniques coping with uncorrelated and correlated noise", in Proc. IEEE Int. Con/. on Communication Technology (ICCT-96) , Beijing, China, May 1996, pp. 416-419. 19. R. Le Bouquin-Jeannes, A. A. Azirani, and G. Faucon, "Enhancement of speech degraded by coherent and incoherent noise using a cross-spectral estimator", IEEE Trans. Speech and Audio Processing, vol. 5, no. 5, pp. 484-487, Sept. 1997. 20. P. Kuczynski, Mehrkanal-Analyse von Sprachsignalen zur adaptiven Storunterdruckung, PhD thesis, University of Bremen, Shaker Verlag, Aachen, Germany, Sept. 1995. 21. K. U. Simmer, P. Kuczynski, and A. Wasiljeff, "Time delay compensation for adaptive multichannel speech enhancement systems", in Proc. Int. Symp. Signals, Syst. Electron. ISSSE-92, Paris, France, Sept. 1992, pp. 660-663. 22. M. Drews and M. StreckfuB, "Multi-channel speech enhancement using an adaptive post-filter with channel selection and auditory constraints", in Proc. Int. Workshop Acoust. Echo and Noise Control, London, UK, Sept. 1997, pp. 77-80. 23. M. Drews, Mikrofonarrays und mehrkanalige Signalverarbeitung zur Verbesserung gestorier Sprache, PhD thesis, Technische Universitat Berlin, Berlin, Germany, 1999. 3 Post-filtering Techniques 59 24. K. U. Simmer, S. Fischer, and A. Wasiljeff, "Suppression of coherent and incoherent noise using a microphone array", Annals of Telecommunications, vol. 49, no. 7/8, pp. 439-446, July 1994. 25. S. Fischer and K. U. Simmer, "Beamforming microphone arrays for speech acquisition in noisy environments", Speech Commun., vol. 20, no. 3-4, pp. 215-227, Dec. 1996. 26. A. Hussain, D.R. Campbell, and T.J. Moir, "A new metric for selecting subband processing in adaptive speech enhancement systems", in Proc. ESCA European Con/. Speech Communicat. Tech. (EUROSPEECH), Rhodes, Greece, Sept. 1997, pp. 1489-1492. 27. R. Atay, E. Mandridake, D. Bastard, and M. Najim, "Spatial coherence exploitation which yields non-stationary noise reduction in subband domain", in Proc. EURASIP European Signal Proc. Con/. (EUSIPCO), Rhodes, Greece, Sept. 1998, pp. 1489-1492. 28. J. Gonzales-Rodriquez, J. L. Sanchez-Bote, and J. Ortega-Garcia, "Speech dereverberation and noise reduction with a combined microphone array approach" , in Proc. IEEE Int. Con/. Acoustics, Speech and Signal Proc. (ICASSP), Istanbul, Thrkey, Apr. 2000, pp. 1489-1492. 29. D. Mahmoudi and A. Drygajlo, "Combined Wiener and coherence filtering in wavelet domain for microphone array speech enhancement" , in Proc. IEEE Int. Con/. Acoustics, Speech and Signal Proc. (ICASSP), Atlanta, USA, May 1998, pp. 1489-1492. 30. D. Mahmoudi and A. Drygajlo, "Wavelet transform based coherence function for multi-channel speech enhancement", in Proc. EURASIP European Signal Proc. Con/. (EUSIPCO), Rhodes, Greece, Sept. 1998, pp. 1489-1492. 31. J. Bitzer, K. U. Simmer, and K. D. Kammeyer, "An alternative implementation of the superdirective beamformer", in Proc. IEEE Workshop Applicat. Signal Processing to Audio and Acoust., New Paltz, New York, Oct. 1999, pp. 7-10. 32. J. Bitzer, K. U. Simmer, and K. D. Kammeyer, "Multi-microphone noise reduction by post-filter and superdirective beamformer", in Proc. Int. Workshop Acoust. Echo and Noise Control, Pocono Manor, USA, Sept. 1999, pp. 100-103. 33. I. A. McCowan, C. Marro, and L. Mauuary, "Robust speech recognition using near-field superdirective beamforming with post-filtering", in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Proc. (ICASSP), Istanbul, Thrkey, Apr. 2000. 34. J. P. Burg, "Three-dimensional filtering with an array of seismometers", Geophysics, vol. 29, no. 5, pp. 693-713, Oct. 1964. 35. S. Haykin, Adaptive Filter Theory, Prentice Hall, 3rd edition, 1996. 36. L. W. Brooks and I. S. Reed, "Equivalence of the likelihood ratio processor, the maximum signal-to-noise ratio filter, and the Wiener filter", IEEE 'lTans. Aerosp. Electron. Syst., vol. AES-8, no. 5, pp. 690--692, Sept. 1972. 37. D. J. Edelblute, J. M. Fisk, and G. L. Kinnison, "Criteria for optimum-signaldetection theory for arrays", J. Acoust. Soc. Amer., vol. 41, no. 1, pp. 199-205, Jan. 1967. 38. N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series with Engineering Applications, Wiley, New York, 1949. 39. W. Kellermann, "Analysis and design of multirate systems for cancelling of acoustic echoes", in Proc. IEEE Int. Con/. Acoustics, Speech and Signal Proc. (ICASSP), Munich, Germany, Apr. 1988, pp. 2570-2573. 60 Simmer et al. 40. C. Marro, Y. Mahieux, and K. U. Simmer, "Performance of adaptive dereverberation techniques using directivity controlled arrays", in Proc. EURASIP European Signal Proc. Conf. (EUSIPCO), Trieste, Italy, Sept. 1996, pp. 11271130. 41. J. B. Allen and D. A. Berkley, "Image method for efficiently simulating smallroom acoustics", J. Acoust. Soc. Amer., vol. 65, no. 4, pp. 943-950, Apr. 1979. 42. S. R. Quakenbusch, T. P. Barnwell, and B. A. Clemens, Objective Measures of Speech Quality, Prentice-Hall, Englewood Cliffs, NJ, 1988. 4 Spatial Coherence Functions for Differential Microphones in Isotropic Noise Fields Gary W. Elko Media Signal Processing Research, Agere Systems, Murray Hill NJ, USA Abstract. The spatial correlation function between directional microphones is useful in the design and analysis of the performance of these microphones in actual acoustic noise fields. These correlation functions are well known for omnidirectional receivers, but not well known for directional receivers. This chapter investigates the spatial correlation functions for Nth-order differential microphones in both spherically and cylindrically isotropic noise fields. The results are used to calculate the amount of achievable cancellation from an adaptive noise cancellation application using combinations of differential microphones to remove unwanted noise from a desired signal. The results are useful in determining signal-to-noise ratio gains from arbitrarily positioned differential microphone elements in microphone array applications. 4.1 Introduction The spatial correlation function is important in the design of optimal beamformers that maximize the signal-to-noise ratio (SNR), source direction finding algorithms, the calculation of actual SNR gain from arrays, and other array signal processing areas. The space-time correlation functions are well known for omnidirectional receivers in two specific environments: spherically and cylindrically isotropic noise fields. One area of large concern that has been a topic of ongoing work has been the design and performance of directional differential microphone systems. One application of these systems is in adaptive noise cancellation schemes. In order to predict the expected performance gains of these adaptive cancellation systems, the spatial correlation functions between directional microphones are required. Results are presented here for the specific cases of general orientation for first-order differential microphones in both spherically and cylindrically isotropic fields. Specific results are given for the general Nth-order cases for differential arrays that have collinear axes. 4.2 Adaptive Noise Cancellation The use of adaptive noise cancellation in communication devices has been under investigation for more than two decades [1], [2]. The early studies predicted SNR gains on the order of 10 dB and higher. However, it was M. Brandstein et al. (eds.), Microphone Arrays © Springer-Verlag Berlin Heidelberg 2001 62 Elko n(t) s(t) +~L d(t) +1 x(t) •e(t) Fig. 4.1. Schematic model of adaptive noise cancellation system. quickly learned that these predictions were not realized when devices were actually tested in real acoustic environments [2]. One of the problems that was encountered was the lack of coherence between a noise-alone sensor and the noise signal that was corrupting the desired signal. This lack of coherence was due to time-varying multipath, multiple uncorrelated noise sources, and nonlinearities in the transmission path to the signal channel [3]. Figure 4.1 shows the typical model of an adaptive noise cancellation sys- tem. It can be seen from this model that the adaptive noise cancellation problem is equivalent to the acoustic echo cancellation problem as described by Sondhi [4]. The desired output signal is s(t). This signal is, however, corrupted by the noise signal n(t), and the measured noise signal x(t) convolved with the transmission path h from the measured noise channel to the signal pick-up channel. The adaptive cancellation algorithm estimates the transmission path h and this estimated filter is represented by h. It is assumed that the signals s(t), n(t), and x(t) are uncorrelated stationary random processes. The output signal is e(t), and if h ~ h, the output signal e(t) ~ s(t). If it is further assumed that the filter h is time-invariant, the optimum filter Hopt is the Wiener filter given by [1], (4.1) where Sxd is the cross-spectrum between signals x and d, and Sxx is the autospectrum of signal x. If this filter is used in the model shown in Fig 4.1 then the output auto-spectrum is, See(W) = Sdd(W)- 1Hopt(W) 12 Sxx(W) = Sdd(W) [1- 1/xd(W) 12] (4.2) 4 Spatial Coherence Functions 63 20 r---r---.---.---.---.---.---.---'---'---~ 18 16 14 6 4 2 O _ ~~L- _~_ _~_ _~_ _~_ _~_ _~_ _~_ _~_ _~ o 0.1 0.2 0 .3 0.4 0,5 0 ,6 0 ,7 0 ,8 O,g Magnitude-Squared Coherence Fig. 4.2. Adaptive cancellation in dB versus the mean-square coherence between the noise signal x(t) and the signal d(t) as defined in Fig 4.1. where 'Yxd is the complex coherence function between the signals x(t) and d(t) and is defined as, (4.3) The amount of cancellation is equal to the ratio of the primary corrupted signal power to the output signal power, R(w) = Sdd(W) See(w) 1 1- 1 'Yxd(W) 12 ' (4.4) The results presented in (4.4) are well known [2] and a plot of this equation is shown in Fig. 4.2. As can be seen in Fig. 4.2, the magnitude-squared coherence value must be greater than 0.9 if the cancellation R is to be larger than 10 dB. In Fig. 4.1 it can be seen that if s(t) and n(t) are zero, then the cancellation will become infinite. However, in the case of a multipath field with many independent noise sources, the cancellation will be diminished since the coherence between the signals x and d will decrease. To see this, it is illustrative to examine the case of two independent noise sources nl and n2 as shown in Fig. 4.3. 64 Elko '2(t) G----+--I x(t) d(t) Fig. 4.3. Schematic model of two independent sources nl and n2 combining through filters to form signals x and d. For this case the autospectral densities are, (4.5) and (4.6) The cross-spectral density is, (4.7) where the superscript * denotes the complex conjugate. The magnitude- squared coherence between x and d is therefore, L2 2 8ii (w)Hix(w)Hid(W) 1l'xd(w) 12 = -=-[-=-2_-,-t,---=,·1=--------:]=-[=-2=---"-------=-] < 1. (4.8) ~ 8ii(W) IHix(wW ~ 8ii (w) IHid (W)1 2 The coherence function given in (4.8) has a value of 1 only if Hlx = Hid and H2x = H2d . In general, for L independent sources the limit of the sums in (4.8) would be L. The model as explained above is a reasonable approx- imation to what is typically found in practice for acoustic environments in which people work and communicate. Thus, the loss of coherence between 4 Spatial Coherence Functions 65 sensors in adaptive noise cancellation will most likely be due to this multipleindependent-noise condition. As such, an analysis as to the loss of coherence between sensors for different acoustic noise fields is important. This chapter investigates the achievable cancellation for adaptive noise cancellation using differential sensors in both spherically and cylindrically isotropic noise fields. It is expected that these two types of fields will yield results that are representative of what can be obtained in real-world acoustic noise fields. A practical example of interest in telephony is the use of adaptive noise cancellation for noise removal from the transmitter (microphone) in a telephone handset. A recent patent application [5] has explicitly proposed the use of a secondary directional microphone mounted on the handset such that the null of this noise-alone microphone is aimed in the direction of the "desired" signal. The output from this "noise-alone" microphone is then used to cancel the correlated noise in the microphone that is used to pick-up the desired signal. In order to predict the cancellation from this proposed arrangement of transducers, it is necessary to calculate the spatial coherence between these sensors. In a typical adaptive noise cancellation implementation the transfer function H is approximated as an all-zero filter, i.e., the impulse response h is estimated by an adaptive finite-impulse response (FIR) filter. One advantage of making this system adaptive is to allow for the possibility of a time varying impulse response h(t). There are several problems that occur in this implementation. One major problem is the presence of the desired signal and/or uncorrelated noise signal net), when the adaptive filter is attempting to adapt to the measured noise-to-primary input transfer function. This problem is the same as the "double-talk" problem in the field of acoustic echo cancellation [4]. Another problem is that the signals set), net), and x(t) are typically nonstationary. Finally, another problem that can limit the cancellation performance is low coherence between the signals x(t) and the signal d(t), even when set) and net) are small in signal power compared to the power of the noise signal x(t). This lack of coherence has been postulated to be due to nonlinearities and strong nonstationary (time-varying) multipath environments [3],[10]. 4.3 Spherically Isotropic Coherence The spatio-temporal autocorrelation and cross-correlation functions are very useful quantities in sensor array processing. Perhaps the most simple and historically prominent calculation was the correlation between two omnidirectional microphones in an isotropic noise field. The initial calculation was published by R. K. Cook et al. [6]. For completeness and to develop the notation this well-known result will now be derived. 66 Elko The space-time correlation function for stationary random processes Pi and P2 is defined as, (4.9) where E is the expectation operator, s is the position of the sensor measuring acoustic pressure Pi, and r is the displacement vector to the sensor measuring pressure P2. For a plane-wave incident field with wavevector k, (II k II = k = w/c where c is the speed of sound), R12 can be written as (4.10) where R is the temporal autocorrelation function of the acoustic pressure p. The cross-spectral density is the Fourier transform of the cross-correlation function, (4.11) If we assume that the acoustic field is spatially homogeneous (the correlation function is not dependent on the absolute position of the sensors), and also assume that the field is spherically isotropic (uncorrelated signals from all directions), the vector r can be replaced with a scalar variable r which is the spacing between the two measurement locations. Thus the cross-spectral density for an isotropic field is the average cross-spectral density for all spherical directions, (), ,w)T;(B,,w)e-jkrcosB sin BdBd, and (10 10 D 12 (r,w) = 1r 2tr I T1(B,,w) 12 sinBdBd,w) 12 sinBdBd The denominator is inversely proportional to the geometric mean of the two microphone directivity factors Q1 and Q2 [8]. Therefore the denominator D12 is, ( 4.17) A general closed-form solution for the spatial coherence between any Nth and M th-order differential array if the differential axes are collinear has been found and is presented in a subsequent section. First, however, a general result for first-order differential arrays will be discussed. For this particular differential order, a solution is presented that allows the calculation of the spatial coherence for any arbitrary orientation of first-order differential arrays. The directional response for a first-order differential microphone can be written as [8], (4.18) where '¢i is the angle between the incident wave and the axis of the ith first-order microphone. Defining Ui as the unit vector indicating the spatial orientation of differential microphone i, and defining :k = kj II k II as a unit vector, results in the following definitions in spherical coordinates: :k = (cos sin B, sin sin B, cos B) Ui = (cos i sin Bi , sin i sin Bi , cos Bi)· (4.19) 68 Elko Thus, the cosine term in (4.18) can be written as (4.20) Using (4.15), (4.16), (4.18), (4.19), and (4.20) and again assuming, without loss of generality, that the microphones lie along the z-axis, yields r1 (27r NI2(kr) = 47r 10 10 [0:1 + (1 - o:d(Xl cos

< - - cardioid M. - - cardioid "v 0,4 0,3 0.2 0.1 \ 0 ---- \ - -- - ..-_ .... 0 2 3 4 5 6 7 8 9 10 kr Fig. 4.6. Magnitude-squared coherence (MSC) for various orientations of cardioid microphones in a spherically isotropic noise field. From Appendix A, the result is In = n! [ejkr ~ (-jkr)n _ e-jkr ~ (jkr)n] 2(jkr)n+1 ~ m! ~ m! m=O m=O (4.29) 72 Elko 35 30 25 10 ~20 I c ,Q 1;j "di g 15 u" 10 - cardioid » cardioid >< - - cardioid "" -- cardioid "v 5 -- °0~ - -~~~~2~~3~==~4~==:5====~6 ====7~--~8-----9--~10 kr Fig.4.1. Maximum cancellation (dB) for various orientations of cardioid microphones for spherically isotropic noise fields. O,g 0,8 ....... , 0.7 \~ \ 0,6 "" - omni-dipoJe 0 omnJ-dipoJe 01 - - omni-cardioid 0 < - - omni-cardioid 0" 8 9 10 Fig. 4.8. Magnitude-squared coherence (MSC) for various orientations of omnidirectional and dipole and cardioid microphones in a spherically isotropic noise field. The numerator of (4.21) is a sum of integrals given by (4.29). The denominator is inversely proportional to the square-root of the product of the directivity factors as given in (4.24) . Therefore the solution to (4.21) for a general 10 9 8 7 m- ~ 6 e: ~5 Q; ue: U'" 4 ", """" 3 2 4 Spatial Coherence Functions 73 - - omni-dipole 0omni-dipole 0 1 - - omni-cardioid 0< - - omni-cardioid oA o ~~--~~~~~~~~~--~-~-~-~-~~~~~~ o 2 3 4 5 6 7 8 9 10 kr Fig. 4 .9. Maximum cancellation (dB) for various orientations of omnidirectional and dipole and cardioid microphones for spherically isotropic noise fields. combination of collinear differential arrays is [ n n 1 '~ "N ' a n b N- n n! (jkr)n+! ejkr '~ " ' (-jkr)n m! _ e-jkr '~ " ' (jkr)n m! [~t, rrm] [~t,Mi,] 1(kr ) - n=O 2 m1=~O 1m~=O <+J even <+J even (4.30) Plots of the coherence function for second and third-order dipole and cardioid microphones are shown in Figs. 4.10 and 4.11. 4.4 Cylindrically Isotropic Fields The previous section dealt with spherically isotropic acoustic noise fields. It has been proposed that some room acoustic fields may be more closely modeled as a cylindrically isotropic field [8]. As a result, it is useful to derive theoretical spatial coherence functions for this type of field. The coherence function for any general field was given in (4.15). To derive the forms for the cylindrical field the only difference from the previous development for the spherically isotropic case is the integration implied by the expectation operator E. For the cylindrically isotropic field the expectation involves only 74 Elko - 2nd-order dipole -3rd- order dipole -- 0.7 0.6 () ~0.5 0.4 0.3 0.2 0.1 0 0 2 3 4 5 6 7 8 9 10 kr Fig. 4.10. Magnitude-squared coherence for second and third-order collinear dipoles in a spherically isotropic noise field . 0.9 - 2nd-order cardioid « 0.8 3rd-order cardioid <:< 0.7 0.6 () ~0.5 0.4 0.3 0.2 0.1 OL-__ __ __ ____ ~ ~ ~ _ _ _ _ L -_ _~_ _~_ _~_ _~L- ~ ~ o 2 3 4 5 6 7 8 9 10 kr Fig. 4.11. Magnitude-squared coherence for second and third-order collinear cardioids in a spherically isotropic noise field. the integration in one dimension, the cylindrical angle cp. The directional responses of the two first-order differential arrays with general orientation of 4 Spatial Coherence Functions 75 (4.31) The numerator for the coherence function is the integral of the product of the two directional responses given in (4.31) and is (assuming without loss in generality that the microphones lie along the z-axis), 10 1 f27r N12(kr) = 2rr [0:1 + (1- 0:t)(X1 COS¢COS¢1 + sin ¢sin ¢t)] x [0:2 + (1 - 0:2) (cos ¢cos ¢2 + sin ¢ sin ¢2)] x e-jkrcos¢d¢. (4.32) The integration of (4.32) is rather tedious and is given in Appendix B. The resulting numerator for the coherence function is N12 (kr) = 0:10:2Jo(kr) +(0:1 - 1)(0:2 - 1) COS¢1 cos¢2[Jo(kr) - h(kr)]/2 +(0:1 - 1)(0:2 - 1) sin¢1 sin ¢2[Jo(kr) + J2(kr)]/2 +j[0:2 cos ¢1 (1 - o:t) + 0:1 cos ¢2(1 - 0:2)]J1(kr) (4.33) where I n are the Bessel functions of the first-kind of integer order n. The denominator for the coherence function for first-order differential arrays is easily derived and is, (4.34) A closed-form solution can also be found for the general Nth-order differential array in a cylindrically correlated field if the differential microphones have axes that are collinear. The numerator for the coherence function is the integral of the product of the individual directional responses given in (4.27). This product of polynomials can itself be expressed as a polynomial of order equal to the sum of the two individual directivity polynomial orders. In general, the solution for the numerator requires the evaluation of the integral (4.35) From Appendix C, In is, 1 In = 2n1_1 [~ n/2 cm(-j)n-2mC(n,m)Jn_2m(kr) , for n even 1 In = 1 2n- 1 [(n~ -1)/2 (_j)n-2mC(n,m)Jn_2m(kr) , for n odd (4.36) 76 Elko where cm is defined as, Cm = 1, m =I n/2, = 1 -2' m =n/2, and the function C is the binomial coefficient [7] n! C(n,m) = (n _ m).'m.,. The numerator of the coherence function is L2N N I2 (kr) = dn1n, n=O where the coefficients dn are components of the vector (4.37) (4.38) (4.39) (4.40) The symbol * indicates the convolution and the vectors a and b are from the directivity response polynomials as defined in (4.27). The denominator term has previously been shown as equal to the inverse of the directivity factor. The directivity factor for a differential array in a cylindrically isotropic sound field is [8] aTGa Qcyl(ao, ...aN-d = aTHa' (4.41) where the superscript T denotes the transpose operator, the subscript on Q indicates a cylindrical field, G is an (N + 1) x (N + 1) matrix whose elements are (4.42) (4.43) and H is a Hankel matrix given by, (i + j -I)!! = Hi,j { (i + j)!! ' 0, if i+j even, otherwise. (4.44) The double factorial function is defined as [7]: (2n)!! = 2 ·4 .... (2n) for n even, and (2n + I)!! = 1 ·3· .... (2n + 1) for n odd. The denominator D12 is -1/2 = D12 [Qcyll Qcyl2] . (4.45) 4 Spatial Coherence Functions 77 The quotient of (4.39) and (4.45) yields the general result for the coherence function between any arbitrarily oriented first-order differential microphones spaced at a distance r . If the two values of O!i are both unity, the spatial coherence reduces to the well-known value for omnidirectional elements in a cylindrically isotropic noise field [6] 1'12(kr) = Jo(kr), (4.46) where Jo is the zero-order Bessel function of the first-kind. Figure 4.12 shows the coherence between a pair of omnidirectional microphones and various orientations of dipole microphones spaced as a function of the dimensionless parameter kr. Figure 4.13 shows the amount of possible cancellation attainable with these various orientations of the dipole microphones. In general the curves for the cylindrically isotropic noise fields are similar to those of the spherically isotropic fields except that the values are higher for the cylindrical case as a function of kr. This result should not be too surprising since the integration region has now been confined to a plane, and not over all spherical directions. Figure 4.14 shows the coherence between various orientations of cardioid microphones and as a function of kr. Figure 4.15 shows the amount of possible cancellation attainable with these various orientations of the cardioid microphones. Figure 4.16 shows the coherence between various orientations of omnidirectional microphones and dipole and cardioid microphones as a function of kr. Figure 4.17 shows the amount of possible cancellation attainable with these various orientations of the omnidirectional and dipole and cardioid microphones. Plots of coherence function for second and third-order dipole and cardioid microphones are shown in Figs. 4.18 and 4.19. The coherence functions decay more slowly for higher-order differential arrays that are collinear. This is due to the narrower beamwidth and the commensurate higher weighting of the noise field in the direction along the microphone axes. 4.5 Conclusions It has been shown that adaptive noise cancellation schemes that utilize loworder differential microphones in isotropic noise fields require care in the orientation of the sensors. As an example, the use of orthogonal dipole microphones or an omnidirectional and an appropriately rotated dipole microphone will yield no noise cancellation at all. In general, adaptive cancellation will occur only for small values of kr (frequency-spacing product). It has been argued that strong multipath (reverberant) acoustic fields exhibit statistics similar to isotropic fields [10]. As a result, it should be expected that adaptive noise cancellation schemes will show limited SNR improvements in isotropic fields over a wide bandwidth. There is also the the problem of signal cancellation that occurs with adaptive algorithms in multipath acoustic fields that further limits the performance of adaptive noise cancellation in reverberant acoustic fields. The results presented here can be used to predict the 78 Elko - omnio dipole 0 .8 - • dipole - I 0 .7 - - dipole II 0 .6 <.) ~0. 5 0.4 0.3 \ \ 0.2 0.1 ~~--~--~2 ~~3L-~~4~~~5~~~6 ----7~--~8~--~9 ~~10 kr Fig.4.12. Magnitude-squared coherence (MSC) for omnidirectional and various orientations of dipole microphones in a cylindrically isotropic noise field. I 30 ' - omnio dipole -- - . dpi ole -I 25 - - dipole II \ \ ,, °0~--~~~2~~3~==~4~==·5----~6~~7C===-8----~9 --~10 kr Fig. 4.13. Maximum cancellation (dB) for omnidirectional and various orientations of dipole microphones for cylindrically isotropic fields. maximum attainable noise reduction for adaptive noise cancellation implementations in isotropic fields. If the field is significantly non-isotropic it can be expected that higher cancellation can be achieved. This is especially true 4 Spatial Coherence Functions 79 - cardioid » cardioid >" - - cardioid >< - - cardioid "v 0 .3 0 .2 0.1 -,-:.., - - - - ... .... O ~--~~~~~~-~--~~~~--~----~~~--~~~ o 2 3 4 5 6 7 8 9 10 kr Fig. 4.14. Magnitude-squared coherence (MSC) for various orientations of cardioid microphones in a cylindrically isotropic noise field. 35 r---'---~----r---~--~----~---r---.----.---, - cardioid» 30 cardioid >" - • cardioid >< 25 - - cardioid "v 10 5 Fig. 4.15. Maximum cancellation (dB) for various orientations of cardioid microphones for cylindrically isotropic fields. if the noise field is generated by a dominant noise source close to the microphone array, i.e., the direct field of the noise dominates. 80 Elko 0.9 0.6 0.7 - omni--* --- Microphone signal -e- Delay- and- sum ---e- Generalized Sidelobe Canceller GSVD -0- GSVD + ANC postprocessing 20 Q, .. ,~ '0, , '0, .. --- ><. , - - - ....... - - - -0. _ -~----------------- --*- .- -- - - -- ~ ---- - ----------- - 5 ...". o o 500 1000 1500 Reverberation time Teo (ms) Fig.6.11. Comparison of unbiased SNR for different signal enhancement algorithms for different reverberation times (N = 4, L = 20, LANG = 800, Q = 1, SNR= OdB) References 1. M. Dendrinos, S. Bakamidis, and G. Carayannis, "Speech enhancement from noise : A regenerative approach," Speech Communication, vol. 10, no. 2, pp. 45- 57, Feb. 1991. 2. Y . Ephraim and H. L. Van Trees, "A Signal Subspace Approach for Speech Enhancement," IEEE Trans. Speech, Audio Processing, vol. 3, no. 4, pp. 251266 , July 1995. 3. S. H. Jensen, P. C. Hansen, S. D. Hansen, and J. A. S0rensen, "Reduction of Broad-Band Noise in Speech by Truncated QSVD," IEEE Trans. Speech, Audio Processing, vol. 3, no. 6, pp. 439- 448, Nov. 1995. 4. U. Mittal and N. Phamdo, "Signal/Noise KLT Based Approach for Enhancing Speech Degraded by Colored Noise," IEEE Trans . Speech, Audio Processing, vol. 8, no. 2, pp. 159- 167, Mar. 2000. 5. F. Asano, S. Hayamizu, T. Yamada, and S. Nakamura, "Speech Enhancement Based on the Subspace Method," IEEE Trans. Speech, Audio Processing, vol. 8, no. 5, pp. 497-507, Sept. 2000. 6. S. Dodo and M. Moonen, "SVD-based optimal filtering with applications to noise reduction in speech signals," in Proc. of the IEEE Workshop Applicat. Signal Processing to Audio and Acoust. (WASPAA '99), New Paltz, NY, USA, Oct. 1999, pp. 143- 146. 7. S. Dodo and M. Moonen, "Robustness of SVD-based Optimal Filtering for Noise Reduction in Multi-Microphone Speech Signals," in Proc. of the 1999 IEEE Int. Workshop Acoust. Echo and Noise Control (IWAENC'99), Pocono Manor, PA, USA, Sept. 1999, pp. 80-83. 6 GSVD-Based Optimal Filtering 131 8. S. Doclo and M. Moonen, "Noise Reduction in Multi-Microphone Speech Signals using Recursive and Approximate GSVD-based Optimal Filtering," in Proc. IEEE Benelux Signal Processing Symp. (SPS2000), Hilvarenbeek, The Netherlands, Mar. 2000. 9. S. Doclo, E. De Clippel, and M. Moonen, "Multi-microphone noise reduction using GSVD-based optimal filtering with ANC postprocessing stage," in Proc. of DSP2000 Workshop, Hunt, TX, USA, Oct. 2000. 10. S. Doclo, E. De Clippel, and M. Moonen, "Combined Acoustic Echo and Noise Reduction using GSVD-based Optimal Filtering," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP2000), Istanbul, Turkey, June 2000, vol. 2, pp. 1061-1064. 11. S. Van Gerven and F. Xie, "A Comparative Study of Speech Detection Methods," in Proc. EUROSPEECH, Rhodos, Greece, Sept. 1997, vol. 3, pp. 10951098. 12. S. G. Tanyer and H. Ozer, "Voice activity detection in nonstationary noise," IEEE Trans. Speech, Audio Processing, vol. 8, no. 4, pp. 478-482, July 2000. 13. F. T. Luk, "A parallel method for computing the generalized singular value decomposition," Internat. J. Parallel Distr. Comp., vol. 2, pp. 250-260, 1985. 14. G. H. Golub and C. F. Van Loan, Matrix Computations, MD : John Hopkins University Press, Baltimore, 3rd edition, 1996. 15. P. Butler and A. Cantoni, "Eigenvalues and eigenvectors of symmetric centrosymmetric matrices," Linear Algebra and its Applications, vol. 13, pp. 275288, Mar. 1976. 16. 1. Dologlou and G. Carayannis, "Physical Representation of Signal Reconstruction from Reduced Rank Matrices," IEEE Trans. Signal Processing, vol. 39, no. 7, pp. 1682-1684, July 1991. 17. S. Doclo and M. Moonen, "SVD-based optimal filtering with applications to noise reduction in speech signals," Tech. Rep. ESAT-SISTA/TR 1999-33, ESAT, K.U.Leuven, Belgium, Apr. 1999. 18. J. Allen and D. Berkley, "Image method for efficiently simulating small-room acoustics," J. Acoust. Soc. Amer., vol. 65, pp. 943-950, Apr. 1979. 19. C. C. Paige, "Computing the generalized singular value decomposition," SIAM J. Sci. Statist. Comput., vol. 7, pp. 1126-1146, 1986. 20. C. Van Loan, "Computing the CS and the Generalized Singular Value Decomposition," Numer. Math., , no. 46, pp. 479-491, 1985. 21. J. P. Charlier, M. Vanbegin, and P. Van Dooren, "On efficient implementations of Kogbetliantz's algorithm for computing the singular value decomposition," Numer. Math., vol. 52, pp. 279-300, 1988. 22. M. Moonen, P. Van Dooren, and J. Vandewalle, "A Singular Value Decomposition Updating Algorithm for Subspace Tracking," SIAM Journal on Matrix Analysis and Applications, vol. 13, no. 4, pp. 1015-1038, Oct. 1992. 23. M. Moonen, P. Van Dooren, and J. Vandewalle, "A systolic algorithm for QSVD updating," Signal Processing, vol. 25, pp. 203-213, 1991. 24. W. M. Gentleman, "Least squares computation by Givens transformations without square roots," J. Inst. Math. Appl., vol. 12, pp. 329-336, 1973. 25. M. Moonen, P. Van Dooren, and J. Vandewalle, "A systolic array for SVD updating," SIAM Journal on Matrix Analysis and Applications, vol. 14, no. 2, pp. 353-371, 1993. 132 Dodo and Moonen 26. L. J. Griffiths and C. W. Jim, "An alternative approach to linearly constrained adaptive beamforming," IEEE Trans. Antennas Propag., vol. 30, pp. 27~34, Jan. 1982. 27. K M. Buckley, "Broad-Band Beamforming and the Generalized Sidelobe Canceller," IEEE Trans. Acoust., Speech, and Signal Processing, vol. 34, no. 5, pp. 1322~1323, Oct. 1986. 28. J. Bitzer, K U. Simmer, and K-D. Kammeyer, "Theoretical Noise Reduction Limits of the Generalized Sidelobe Canceller (GSC) for Speech Enhancement," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), Phoenix, AZ, USA, May 1999, vol. 5, pp. 2965~2968. 29. D. Van Compernolle, "Switching Adaptive Filters for Enhancing Noisy and Reverberant Speech from Microphone Array Recordings," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP), Albuquerque, USA, Apr. 1990, vol. 2, pp. 833~836. 30. S. Haykin, Adaptive Filter Theory, Prentice Hall, 3rd edition, 1996. 7 Explicit Speech Modeling for Microphone Array Applications Michael Brandstein and Scott Griebel Harvard University, Cambridge MA, USA Abstract. In this chapter we address the limitations of current approaches to using microphone arrays for speech acquisition and advocate the development of multi-channel techniques which employ an explicit model of the speech signal. The goal is to combine the advantages of spatial filtering achieved through beamforming with knowledge of the desired time-series attributes. We offer two examples of algorithms which incorporate this principle. The first is a frequency domain approach illustrating the utility of model-based microphone array processing for improved speech spectral analysis. The second is a time domain procedure which highlights the benefits of speech modeling/spatial filtering fusion for the purpose of improving speech quality. 7.1 Introduction A major thrust of microphone array research has focused on improving the spatial filtering capability of the system in its operating environment. The previous chapters have directly addressed a number of the methods which have proven beneficial in this regard. The goal of this chapter is to present an alternative and complementary strategy which emphasizes the incorporation of explicit speech modeling into the microphone array processing. As will be shown, by combining specific knowledge of the desired time-series attributes with the advantages of multi-channel spatial filtering, this approach can significantly improve the quality of the speech signal and its derived quantities obtained in challenging acoustic environments. Much of this work is motivated from current research in the field of singlechannel speech systems where, due to necessity perhaps, improvements have been obtained by focusing on the underlying content of the signal of interest in addition to its environmental degradations. There is a rich history of work addressing the use of single channel methods for speech enhancement. Summaries of these techniques may be found in texts on the subject, such as [13]. While capable of improving perceived quality in restrictive environments (additive noise, no multipath, high to moderate signal-to-noise ratio (SNR), single source), these approaches do not perform well in the face of reverberant distortions, competing sources, and severe noise conditions. In recent years, sophisticated speech models have been applied to the enhancement problem. In addition to utilizing the periodic features of the speech, as in the case of comb filtering, these systems exploit the signal's mixture of harmonic and M. Brandstein et al. (eds.), Microphone Arrays © Springer-Verlag Berlin Heidelberg 2001 134 Brandstein and Griebel stochastic components [4-6]. Such model-based techniques offer an improved performance, both in speech quality and intelligibility. Additionally, these methods, by virtue of their specific parameterization of the speech signal, offer some applicability to the more general acquisition problem. Currently, however, these model-based estimation schemes have been limited to single channel applications. By employing spatial filtering in addition to temporal processing, microphone arrays offer a distinct performance advantage over single-channel techniques in the presence of additive noise, interfering sources, and distortions due to multipath channel effects. Fixed and adaptive adaptive beamforming techniques generally assume the desired source is slowly varying and at a known location. While dynamic localization schemes, like those of [7], and robust weighting constraints may be incorporated into the adaptation procedure, these methods are very sensitive to steering errors which limit their noise source attenuation performance and frequently distort or cancel the desired signal. Furthermore, these algorithms are oriented solely toward noise reduction and have limited effectiveness at enhancing a desired signal corrupted by reverberations. Similarly, the class of post-filtering algorithms detailed in Chapter 3 are limited in their assumption that the additive noise is uncorrelated. Once again the emphasis here is additive noise reduction. The performance (and appropriateness) ofthese post-filters quickly degrades in the presence of multi-path distortions and coherent noise. Another general approach is based upon attempting to undo the effects of multipath propagation. The channel responses themselves are in general not minimum phase and are thus non-invertible. By beamforming to the direct path and the major images, it is possible to use the multipath reflections constructively to increase SNR's well beyond those achieved with a single beamformer. The result is a matched filtering process [8] which is effective in enhancing the quality of reverberant speech and attenuating noise sources. Unfortunately, this technique has a number of practical shortcomings. The matched filter is derived from the source location-dependent room response and as such is difficult to estimate dynamically. The channel responses obtained in this manner do not address the issue of non-stationary or unknown source locations or changing acoustic environments. These problems are addressed by Affes and Grenier [9] in attempting to adaptively estimate the channel responses and incorporate the results into an adaptive beamforming process. In general, beamforming research has dealt with algorithms to attenuate undesired sources and noise, track moving sources, and deconvolve channel effects. These approaches, while effective to some degree, are fundamentally limited by the nature of the distant talker environment. Array design methods are overly sensitive to variations in their assumptions regarding source locations and radiation patterns, and are inflexible to the complex and time- 7 Explicit Speech Modeling 135 Source 1 , Original : Individual Channel5 -10 -"" --400 200 4()0 600 eoo 1000 1200 , 400 1600 , 800 2000 Iraq. (Hz) Source 2 ....°0 200 .00 600 800 1000 1200 1400 11500 laoo :2000 1rtQ.(HZ) Fig. 7.1. Spectra of a voiced speech segment and its individual channels simulated at source locations spaced lOcm apart. varying nature of the enclosure's acoustic field. Motion as little as a few centimeters or a talker turning his or her head is frequently sufficient to compromise the optimal behavior of these schemes in practical scenarios [10,11) . Similarly, matched filter processing, while shown to be capable of tracking source motion to a limited degree, requires significant temporal averaging and is not adaptable at rates sufficient to effectively capture the motions of a realistic talker. To illustrate the acoustic environment's sensitivity to source location variations, a simple example is presented. Two source locations spaced 10cm apart are simulated in the center of a noiseless 4m x 4m x 3m rectangular 136 Brandstein and Griebel room. The enclosure is assumed to have plane reflective surfaces and uniform, frequency-independent reflection coefficients equivalent to a 400ms reverberation time. Room impulse responses are generated for 8 microphones with 25cm spacing positioned along one wall of the enclosure using the Allen and Berkley image model technique [12] with intra-sample interpolation and up to sixth order reflections. Both the microphones and sources are assumed to have cardioid patterns and the sources are oriented toward the center of the array. Figure 7.1 plots the spectra of a voiced speech segment generated at each source location. The bold lines correspond to the spectrum of the original speech while the dotted lines plot the spectra of the data received at each of 8 microphones. The reverberation effects are multiplicative in the frequency domain and vary considerably from channel to channel. The results of this simulation show that there are significant variations in the spectra of the individual channels (up to 10dB for some frequencies) when the source is moved just a few inches. An implication of this example is that any system which attempts to estimate the reverberation effects and apply some means of inverse filtering would have to be adaptable on almost a frame-by-frame basis to be effective. However, the temporal averaging required by these processes prohibits adaptation at such a high rate. 7.2 Model-Based Strategies Single-channel techniques exploit various features of the speech signal; multichannel methods focus primarily on improving the quality of the spatial filtering process. In [11] we proposed an alternative to the traditional microphone array methods by explicitly incorporating the Dual Excitation Speech Model [5] into the beamforming process. Our work in [13] extended this idea using a multi-channel version of the Multi-Pulse Linear Predictive Coding (MPLPC) model [14,15] and a nonlinear event-based processing method to discriminate impulses in the received signals due to channel effects from those present in the desired speech. These concepts were then expanded upon in [16,17] by employing the wavelet domain for a multi-resolution signal analysis and reconstruction of the LPC residual. These works illustrated the ability of this approach to suppress the deleterious effects of both reverberations and additive noise without explicitly identifying the channel while being adaptive on a frame by frame basis. This allows the model-based processing paradigm to be applied for effective speech analysis and enhancement under general conditions. Two examples of this prior work are summarized here. The first is a frequency domain approach illustrating the utility of model-based microphone array processing for improved speech spectral analysis. The second is a time domain procedure which highlights the benefits of the speech modeling/spatial filtering fusion advocated for the purpose of improving speech quality. 7 Explicit Speech Modeling 137 7.2.1 Example 1: A Frequency-Domain Model-Based Algorithm In the single-channel, Dual Excitation (DE) Speech Model [5], a windowed segment of speech, s[n], is represented as the sum of two components: a voiced signal, v[n], and an unvoiced signal, urn]. In the frequency-domain, the relationship may be expressed as: Sew) = V(w) + U(w) (7.1) where Sew), V(w), and U(w) correspond to the Fourier transforms of s[n], v[n], and urn], respectively. The voiced portion is assumed to be periodic over the time window and may be represented as the sum of the harmonics of a fundamental frequency, Wo: LM V(w) = AmW(w - mwo) (7.2) m=-M where W (w) is the Fourier Transform of the analysis window, Am is the com- plex spectral amplitude of the mth harmonic, and M is the total number of harmonics (M = l7r/ woJ). Following [18], the fundamental frequency and har- monic amplitudes are estimated through minimization of the mean-squared error criterion: E = -1 j7l" IS(w) - V(wWdw (7.3) 27r -71" L 1 j7l" M = 27r IS(w) - Am W(w - mwoWdw. (7.4) -71" m=-M This non-linear optimization problem may be decoupled efficiently by noting that for a given fundamental frequency, the harmonic amplitudes which minimize the error are found through the solution of a set of linear equations. The optimal parameter set may then be calculated through global minimization of the error function in (7.4) versus all fundamental frequencies of interest. To effectively model the spectrum of the higher harmonics, a fundamental frequency resolution of less than 1 Hz is typically required. In practice, this exhaustive procedure may be computationally prohibitive. A more efficient approach is to evaluate a coarse, integer pitch estimate via a traditional time-domain pitch estimation procedure and then use the above frequency-domain analysis by synthesis procedure to refine the fundamental frequency estimate. The estimated unvoiced signal plus noise spectrum, U(w), is then found from the difference spectrum: U(w) = Sew) - V(w) (7.5) where V(w) is the estimated voiced spectrum derived from (7.2) using the estimated values of Wo and Am· 138 Brandstein and Griebel The utility of the DE model for improving speech degraded by background noise lies in its independent enhancement of the voiced and unvoiced components of the speech. Assuming that the degrading noise is independent of the harmonic structure, the voiced spectrum is subjected to only a minor thresholding operation relative to the background noise power. The bulk of the enhancement is achieved by nulling out the unvoiced portions of strongly voiced harmonics and applying a modified Wiener filter to the remaining unvoiced spectral regions. The Dual Excitation model is extended here for the multi-channel problem to improve its effectiveness for the additive noise case and to address the more general distant-talker scenario involving multipath channels and multiple sources. Consider first the extension of the DE error criterion in (7.4) to include data from N channels: (7.6) where Gi(w) is the filter weighting associated with the ith channel and Si(W) is the short-term spectrum of the data received at the ith microphone. Alternatively, for environments where the dominant degradation effect is reverberant, it may be advantageous to recast the above error criterion as the L2 norm in the log spectrum domain. The voiced signal estimate, VN(W), derived from the parameters minimizing (7.6) would then be used to produce the unvoiced signal plus noise spectrum from: L ~ 1N ~ UN(w) = N Hi (W)[Gi(W)Si(w) - VN(w)] (7.7) i=l The channel weightings, Gi(w), could be designed to provide appropriate spatial filtering, addressing issues of noise-reduction, attenuation of interfering sources, and dereverberation. Additionally, the channel-dependent weighting filters, Hi(W), could be incorporated as a multi-channel post-processor to exploit known signal characteristics. In the simplest case of independent additive noise, the extension of the Dual Excitation model to a plurality of channels would stand to improve its enhancement performance by virtue of the data averaging alone. With the inclusion of the spatial filtering afforded through (7.6) and (7.7) it is possible to give the DE model a robustness to channel effects and interfering sources. With regard to multiple sources, the error criterion in (7.6) could be extended explicitly to include L sources and N channels by: ~o 200 400 600 800 1000 1200 , ..00 1600 '800 2000 rraq. (Hz) Fig. 7.2. Spectra of the Original, Delay and Sum Beamformer result, and Voiced Multi-Channel Dual Excitation result for the voiced segment in Figure 7.1. where Gij(w) is the spatial filter associated with the ith channel and jth signal source, WOj is the fundamental frequency of the jth source, and Amj is the amplitude of the mth harmonic associated with the jth source. Using this approach it would be possible to track individual sources through a combination of location and pitch data. Such a multi-channel DE model would have the ability to isolate and enhance a desired source signal by employing both spatial and signal-content information. To illustrate the potential of such an approach, again consider the example of the voiced speech segment in Fig. 7.1. Figure 7.2 shows the relationship between the Delay and Sum Beamformer and the voiced signal estimate, VN(W), 140 Brandstein and Griebel derived from the proposed multi-channel scheme for the two closely-spaced source locations. The pair of results was generated using delays appropriate for the source 1 location. This would correspond to a 10cm mis-aim in the source 2 case. The Delay and Sum method, like any beamforming technique, lacks any signal-dependent constraints on the output produced. As the plots suggest, by exploiting the periodic structure of the desired signal, the Multi-Channel Dual Excitation Model is significantly more robust to the local spectral variations produced by channel reverberations. Unlike the Delay and Sum method, the approach is relatively insensitive to imperfect knowledge of the source location suggesting a robustness to the small, but nominal, variations encountered in a practical operating environment. This result is confirmed by more quantitative methods, such as SNR and log spectral distortion scores. 7.2.2 Example 2: A Time-Domain Model-Based Algorithm The reverberant speech signal, xdn], observed at the ith microphone (i = 1,2, ... ,J) can be modeled in the time-domain as: (7.8) where s[n] is the clean speech utterance, ui[n] is noise, and hi[n] is the room impulse response between the speech source and the ith microphone. Under all but ideal circumstances, the room impulse response is a complicated function of the environment's acoustical and geometrical properties. The noise term, ui[n], is assumed to be uncorrelated, both temporally and spatially. A very general model for speech production approximates the vocal tract as a time-varying all-pole filter [2]. In the case of voiced speech, the filter excitation is modeled as a quasi-periodic impulse train where the average width between consecutive impulses is the pitch period. For unvoiced speech, the excitation signal is approximated by random noise. The proposed algorithm relies on the assumption that the detrimental effects of additive noise and reverberations introduce only zeros into the overall system and will primarily affect only the nature of the speech excitation sequence, not the all-pole filter. It is also assumed that the noise and errant impulses contributed to the excitation sequences are relatively uncorrelated across the individual channels, while the excitation impulses due to the original speech are correlated after performing time-delay compensation. Essentially, the approach will be to identify the clean speech excitation signal from a set of corrupted excitation signals. The enhanced speech is then reconstructed by using the enhanced excitation signal as input to an estimate of the all-pole filter representing the vocal system. The proposed algorithm offers an effective method for estimating and then reconstructing the excitation signal by employing a class of wavelets to decompose the LPC residual signals. In [19], quadratic spline wavelets x;[n] IxN 7 Explicit Speech Modeling 141 Long-Term lxl Coherence • lxN §[n] I xN ... Inverse LPC Filter ;[n] I xN ... Wavelet Transform lxN ,.. JxN t---- f--+ Short-Term t-- Coherence r- r- Wei (n,j) Cj[n] LPC Filter ern] lxN Extrema Clustering ' . E « 0.1 -0.1 -0· ~ '8:-- - : '':9= - - -2='0="----"2''-- - - '2-2- -2.J3...----"2'4-- - - '2-5- -2.J6...----"2'7-----!28 TIme (ms) Fig. 8.1. A close-up of a IO-millisecond segment of a room impulse response measured in a typical conference room. The direct-path component and some strong reflected components are highlighted. 8.3.2 The Gee and PHAT Weighting Function For a pair of microphones, n = 1,2, their associated TDOA, 712, is defined as (8.3) Applying this definition to their associated received microphone signal models yields Xl (t) = 1 -s(t - 7d * 91 (qs, t) + V1 (t) T1 X2(t) = 1 -s(t - 71 - 712) * 92(qs, t) + V2(t). (8.4) T2 If the modified impulse responses for the microphone pair are similar, then (8.4) shows that a scaled version of s(t - 7d is present in the signal from microphone 1 and a time-shifted (and scaled) version of s(t - 71) is present in the signal from microphone 2. The cross-correlation of the two signals should show a peak at the time lag where the shifted versions of s(t) align, corresponding to the TDOA, 712. The cross correlation of signals and is defined as: (8.5) 8 Robust Localization in Reverberant Rooms 167 The GCC function, R12 (T), is defined as the cross correlation of two filtered versions of X1(t) and X2(t) [29]. With the Fourier transforms of these filters denoted by G1(w) and G2(w), respectively, the GCC function can be expressed in terms of the Fourier transforms of the microphone signals (8.6) Rearranging the order of the signals and filters and defining the frequency dependent weighting function, tJr12 == G1(w)G2(w)*, the GCC function can be expressed as (8.7) Ideally, R12 (T) will exhibit an explicit global maximum at the lag value which corresponds to the relative delay. The TDOA estimate is calculated from f12 = argmax R12(T). (8.8) TED The range of potential TDOA values is restricted to a finite interval, D, which is determined by the physical separation between the microphones. In general, R 12 (T) will have multiple local maxima which may obscure the true TDOA peak and subsequently, produce an incorrect estimate. The amplitudes and corresponding time lags of these erroneous maxima depend on a number of factors, typically ambient noise levels and reverberation conditions. The goal of the weighting function, tJr12 , is to emphasize the GCC value at the true TDOA value over the undesired local extrema. A number of such functions have been investigated. As previously stated, for realsitic acoustical conditions the PHAT weighting [29] defined by 1 tJr12 (W) == IX1(W)X;(w)1 (8.9) has been found to perform considerably better than its counterparts designed to be statistically optimal under specific non-reverberant, noise conditions. The PHAT weighting whitens the microphone signals to equally emphasize all frequencies. The utility of this strategy and its extension to steeredbeamforming form the basis of the SRP-PHAT algorithm that follows. 8.3.3 ML TDOA-Based Source Localization Consider the ith pair of microphones with spatial coordinates denoted by the 3-element vectors, Pi1 and Pi2, respectively. For a signal source with known 168 DiBiase et al. spatial location, qs, the true TDOA relative to the ith sensor pair will be denoted by T ({pil, Pi2}, qs), and is calculated from the expression T ({ P i l ,Pi2},q)s -- -'-.:q:1",--s_--=-P:.:::. i2,---I---'Ic=-qs=------'P::. c . :i-"-'.1l (8.10) where c is the speed of sound in air. The estimate of this true TDOA, the result of a TDE procedure involving the signals received at the two microphones, will be given by Ti. In practice, the TDOA estimate is a corrupted version of the true TDOA and in general, Ti =f. T( {Pil, Pi2}, qs). For a single microphone pair and its TDOA estimate, the locus of potential source locations in 3-space which satisfy (8.10) corresponds to one-half of a hyperboloid of two sheets. This hyperboloid is centered about the midpoint of the microphones and has Pi2 - Pil as its axis of symmetry. For sources with a large source-range to microphone-separation ratio, the hyperboloid may be well-approximated by a cone with a constant direction angle relative to the axis of symmetry. The corresponding estimated direction angle, Oi' for the microphone pair is given by: (8.11) In this manner each microphone pair and TDOA estimate combination may be associated with a single parameter which specifies the angle of the cone relative to the sensor pair axis. For a given source and TDOA estimate, (Ji is referred to as the DOA relative to the ith pair of microphones. Given a set of M TDOA estimates derived from the signals received at multiple pairs of microphones, the problem remains as how to best estimate the true source location, q 8. Ideally, the estimate will be an element of the intersection of all the potential source loci. In practice, however, for more than two pairs of sensors this intersection is, in general, the empty set. This disparity is due in part to imprecision in the knowledge of system parameters (TDOA estimate and sensor location measurement errors) and in part to unrealistic modeling assumptions (point source radiator, ideal medium, ideal sensor characteristics, etc.). With no ideal solution available, the source location must be estimated as the point in space which best fits the sensor-TDOA data or more specifically, minimizes an error criterion that is a function of the given data and a hypothesized source location. If the time-delay estimates at each microphone pair are assumed to be independently corrupted by zeromean additive white Gaussian noise of equal variance then the ML location estimate can be shown to be the position which minimizes the least squares error criterion M E(q) = ~)Ti - T({Pil,Pi2},q))2. i=l (8.12) 8 Robust Localization in Reverberant Rooms 169 The location estimate is then found from qs = argmin E(q). q (8.13) The criterion in (8.12) will be referred to as the LS-TDOA error. As stated earlier, the evaluation of qs in this manner involves the optimization of a non-linear function and necessitates the use of search methods. Closed-form approximations to this method were given earlier. 8.3.4 SRP-Based Source Localization The microphone signal model in (8.2) shows that for an array of N microphones in the reception region of a source, a delayed, filtered, and noise corrupted version of the source signal, s(t), is present in each of the received microphone signals. The delay-and-sum beamformer time aligns and sums together the Xn(t), in an effort to preserve unmodified the signal from a given spatial location while attenuating to some degree the noise and convolutional components. It is defined as simply as LN y(t, qs) = xn(t + Ll n) n=l (8.14) where Lln are the steering delays appropriate for focusing the array to the source spatial location, q., and compensating for the direct path propagation delay associated with the desired signal at each microphone. In practice, the delays relative to a reference microphone are used instead of the absolute delays. This makes all shifting operations causal, which is a requirement of any practical system, and implies that y(t, qs) will contain an overall delayed version of the desired signal which in practice is not detrimental. The use of a single reference microphone means that the steering delays may be determined directly from the TDOA's (estimated or theoretical) between each microphone and the reference. This implies that knowledge of the TDOA's alone is sufficient for steering the beamformer without an explicit source location. In the most ideal case with no additive noise and channel effects, the output of the deal-and-sum beamformer represents a scaled and potentially delayed version of the desired signal. For the limited case of additive, unCOrrelated, and uniform variance noise and equal source-microphone distances this simple beamformer is optimal. These are certainly very restrictive conditions. In practice, convolutional channel effects are nontrivial and the additive noise is more complicated. The degree to which these noise and reverberation components of the microphone signals are suppressed by the delay-and-sum beamformer is frequently minimal and difficult to analyze. Other methods have been developed to extend the delay-and-sum concept to the mOre general filter-and-sum approach, which applies adaptive filtering to the microphone 170 DiBiase et al. signals before they are time-aligned and summed. Again, these methods tend to not be robust to non-theoretical conditions, particularly with regard to the channel effects. The output of an N-element, filter-and-sum beamformer can be defined in the frequency domain as LN Y(w,q) = Gn(w)Xn(w)ejwLln n=l (8.15) where Xn(w) and Gn(w) are the Fourier Transforms of the nth microphone signal and its associated filter, respectively. The microphone signals are phase- aligned by the steering delays appropriate for the source location, q. This is equivalent to the time-domain beamformer version. The addition of micro- phone and frequency-dependent filtering allows for some means to compen- sate for the environmental and channel effects. Choosing the appropriate filters depends on a number of factors, including the nature of the source signal and the type of noise and reverberations present. As will be seen, the strategy used by the PHAT of weighing each frequency component equally will prove advantageous for practical situations where the ideal filters are unobtainable. The beamformer may be used as a means for source localization by steer- ing the array to specific spatial points of interest in some fashion and evalu- ating the output signal, typically its power. When the focus corresponds to the location of the sound source, the SRP should reach a global maximum. In practice, peaks are produced at a number of incorrect locations as well. These may be due to strong reflective sources or merely a byproduct of the array geometry and signal conditions. In some cases, these extraneous maxima in the SRP space may obscure the true location and in any case, complicate the search for the global peak. The SRP for a potential source location can be expressed as the output power of a filter-and-sum beamformer by 1+00 P(q) = -00 lY(wWdw (8.16) and location estimate is found from qs = argmax P(q). q (8.17) 8.3.5 The SRP-PHAT Algorithm Given this background, the SRP-PHAT algorithm may now be defined. With respect to GCC-based TDE, the PHAT weighting has been found to provide an enhanced robustness in low to moderate reverberation conditions. While improving the quality of the underlying delay estimates, it is still not sufficient to render TDOA-based localization effective under more adverse conditions. 8 Robust Localization in Reverberant Rooms 171 The delay-and-sum SRP approach requires shorter analysis intervals and exhibits an elevated insensitivity to environmental conditions, though again, not to a degree that allows for their use under excessive multi-path. The filter-and-sum version of the SRP adds flexibility but the design of the filters is typically geared towards optimizing SNR in noise-only conditions and is excessively dependent on knowledge of the signal and channel content. Originally introduced in [5), the goal of the SRP-PHAT algorithm is to combine the advantages of the steered beamformer for source localization with the signal and condition independent robustness offered by the PHAT weighting. The SRP of the filter-and-sum beamformer can be expressed as (8.18) where Wlk(W) = G1(w)Gi.(w) is analogous to the two-channel Gee weighting term in (8.7). The corresponding multi-channel version of the PHAT weighting is given by (8.19) which in the context of the filter-and-sum beamformer (8.15) is equivalent to the use of the individual channel filters 1 Gn(w) = IXn(w)1 (8.20) These are the desired SRP-PHAT filters. They may be implemented from the frequency-domain expression above. Alternatively, it may be shown that (8.18) is equivalent to the sum of the Gee's of all possible N-choose-2 microphone pairings. This means that the SRP of a 2-element array is equivalent to the Gee of those two microphones. Hence, as the number of microphones is increased, SRP naturally extends the Gee method from a pairwise to a multi-microphone technique. Denoting Rlk(T) as the PHAT-weighted Gee of the [th and kth microphone signals, a time-domain version of SRP-PHAT functional can now be expressed as 2: 2: N N P(q) = 27f Rlk( L1 k - L1t}. 1=1 k=l (8.21) This is the sum of all possible pairwise Gee permutations which are timeshifted by the differences in the steering delays. Included in this summation is the sum of the N autocorrelations, which is the Gee evaluated at a lag of zero. These terms contribute only a De offset to the steered response power since they are independent of the steering delays. Given either method of computation, SRP-PHAT localization is performed in a manner similar to the standard SRP-based approaches. Namely, 172 DiBiase et al. Room Layout -- 3D View 2.5 1.5 e- Nt 0 .5 o Microphone Array Whiteboard --- - - >*

*相关帖子*

回到顶部