首页资源分类电子电路模拟及混合电路 > Single-Sensor Imaging - Methods and Applications for Digital Cameras

Single-Sensor Imaging - Methods and Applications for Digital Cameras

已有 445466个资源

下载专区

上传者其他资源

    电子电路热门资源

    本周本月全部

    文档信息举报收藏

    标    签:sensorcamera

    分    享:

    文档简介

    Single-Sensor Imaging - Methods and Applications for Digital Cameras

    文档预览

    Single-Sensor Imaging Methods and Applications for Digital Cameras IMAGE PROCESSING SERIES Series Editor: Phillip A. Laplante, Pennsylvania State University Published Titles Adaptive Image Processing: A Computational Intelligence Perspective Stuart William Perry, Hau-San Wong, and Ling Guan Color Image Processing: Methods and Applications Rastislav Lukac and Konstantinos N. Plataniotis Image Acquisition and Processing with LabVIEW™ Christopher G. Relf Image and Video Compression for Multimedia Engineering Second Edition Yun Q. Shi and Huiyang Sun Multimedia Image and Video Processing Ling Guan, S.Y. Kung, and Jan Larsen Shape Analysis and Classification: Theory and Practice Luciano da Fontoura Costa and Roberto Marcondes Cesar Jr. Single-Sensor Imaging: Methods and Applications for Digital Cameras Rastislav Lukac Software Engineering for Image Processing Systems Phillip A. Laplante Single-Sensor Imaging Methods and Applications for Digital Cameras Edited by Rastislav Lukac Boca Raton London New York CRC Press is an imprint of the Taylor & Francis Group, an informa business MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software. CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2009 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-13: 978-1-4200-5452-1 (Hardcover) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http:// www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Single-sensor imaging : methods and applications for digital cameras / editor, Ratislav Lukac. p. cm. -- (Image processing series ; 9) Includes bibliographical references and index. ISBN 978-1-4200-5452-1 (hardback : alk. paper) 1. Digital cameras. 2. Image processing--Digital techniques. 3. Image converters. I. Lukac, Rastislav. TR256.S556 2009 771.3’3--dc22 2008026505 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Artists can color the sky red because they know it’s blue. Those of us who aren’t artists must color things the way they really are or people might think we’re stupid. —Jules Feiffer, cartoonist and satirist Dedication To my wife Ivana Preface Over the past two decades, advances in hardware and software technology have allowed for massive sales of consumer electronics based on the concept of converting analog information into its digital form. As in many other application areas where digital devices have replaced their analog predecessors, manufacturers and consumers have been losing interest in conventional film cameras and have been turning instead to digital cameras. This is mainly due to the fact that capturing and developing photos using chemical and mechanical processes cannot provide users with the conveniences of digital cameras which record, store and manipulate photographs electronically using image sensors and built-in computers. Features such as displaying an image immediately after it is recorded, the capacity to store thousands of images on a small memory device and the ability to delete images from this device in order to allow its further re-use, and the ability to edit images and even record them with sound make digital cameras very attractive consumer electronic products. To create an image of a scene, digital cameras use a series of lenses that focus light onto a sensor which samples the light and records electronic information which is subsequently converted into digital data. The sensor is an array of light-sensitive spots, called photosites, which record the total intensity of the light that strikes their surfaces. Unfortunately, common image sensors are monochrome devices which cannot record color information. Among existing technologies developed to overcome the problem, single-sensor imaging offers trade-offs among performance, complexity and cost. Thus, most of today’s digital cameras are single-sensor devices which capture visual scenes in color using a monochrome image sensor in conjunction with an array of color filters. It should not be surprising that single-sensor digital camera imaging is considered one of the most rapidly developing research and application fields and numerous commercial products capitalizing on its principles have already appeared in diverse market applications. The extreme and still increasing popularity of consumer single-sensor digital cameras boosts research activities in the fields of digital color image acquisition, processing, and storage. Single-sensor camera image processing methods are becoming increasingly important due to the development and proliferation of emerging digital camera imaging applications and commercial devices, such as consumer digital still and video cameras, image-enabled mobile phones and personal digital assistants, sensor networks, surveillance and automotive apparatus. The surge of emerging applications, such as digital photography, visual communications, machine vision, multimedia, digital cinema, art, visual surveillance, medical imaging and astronomy, suggests that the demand for single-sensor imaging and digital camera image processing solutions will grow considerably in the next decade. The purpose of this book is to fill the existing gaps in the literature and comprehensively cover the system design, implementation, and application aspects of single-sensor imaging and digital camera image processing. Due to rapid developments in specialized areas of single-sensor imaging and digital camera image processing, the book is a contributed volume in which well-known experts deal with specific research and application problems. It presents both the state-of-the-art and the most recent trends in digital camera imaging and applications. It serves the needs of different readers at different levels. It can be used as a textbook in support of a graduate course in digital imaging and visual data processing or as a stand-alone reference for graduate students, researchers and practitioners. For example, a researcher can use it as an up-to-date reference since it offers a broad survey of the relevant literature. A development engineer and technical manager may find it useful in the design and implementation of various digital camera image and video processing tasks. This book details recent advances in single-sensor imaging and digital camera image processing methods and explores their applications. The book begins by focusing on singlesensor imaging fundamentals, a reusable embedded software platform for versatile digital cameras, and digital camera image processing chain design. The next part of the book presents optical antialiasing filter design, spatio-spectral sampling and color filter array design, and mosaicking / demosaicking for multispectral digital cameras. Moving along the camera image processing pipeline, this book targets frequency-domain analysis of color filter array sampling for the design of demosaicking algorithms, linear minimum mean square error demosaicking, and color filter array image analysis for joint demosaicking and denoising. This is followed by automatic white balancing, enhancement of digital photographs using color transfer techniques, and exposure correction. The next part of the book focuses on image storage issues, targeting three areas: digital camera image storage formats, modelling of image processing pipelines from a data compression point of view, and lossless compression of color mosaic images and videos. Then, the reader’s attention is turned to optional, but frequently used steps in the camera image processing pipeline such as automatic red-eye removal for digital photography and single-sensor image resizing. Finally, the remaining chapters explore video processing approaches across a broad spectrum of single-sensor imaging applications ranging from video-demosaicking, through simultaneous demosaicking and resolution enhancement, to image and video stabilization. Chapters 1 through 3 discuss concepts and technologies which allow for effective design and high performance of single-sensor imaging devices. Single-sensor digital color imaging fundamentals are essential for understanding image formation using a color filter array and a monochrome image sensor. As demonstrated by numerous examples, despite the fact that finished digital photographs are achieved from captured sensor data through extensive image processing, they often suffer from various visual impairments due to the shortcomings of image acquisition systems, various constraints imposed on imaging devices, and a lack of information during image processing. To improve visual quality, processing solutions should be able to fully use various image characteristics. A reusable embedded software platform for versatile single-sensor digital cameras has an important role in designing an imaging device, as it usually supports both attractive features in user operation mode and calibration / test functions in engineering mode. Using such a platform, embedded software designers can easily capture the whole view of the camera hardware architecture without being sidetracked by the study of detailed hardware specifications. The embedded software architecture allows for fast stepping into practical camera design. In addition, the embedded self-calibration flow and sensor/shutter calibration algorithms give a valuable reference to efficiently build a commercial camera for mass production. In practice, the problem of digital camera image processing chain design is usually seen as taking relatively simple, well-known image processing operations and staging them in a manner that produces the best synergistic effects. In an image processing chain that transforms digital camera raw sensor image data into a full-color fully processed image, the possible orderings of individual operations and associated implementation details have a great impact on both image quality and computational efficiency. Image processing operations that are highly effective may not be viable candidates for image processing chains in constrained computing environments. Therefore, the image processing task is to balance the opposing requirements of desirable image quality and modest computing resource use. Chapters 4 through 6 are intended to cover the basics of and review recent advances in visual information sampling. Sampled imaging systems such as digital cameras often produce aliasing artifacts. Once an image is sampled, the aliased low-frequency content is difficult to correct automatically because it has similar characteristics as actual low-frequency content. To prevent such artifacts, most cameras use optical antialiasing filters to band-limit the optical image spatial frequencies. Limiting the image spatial frequencies is equivalent to blurring the image, so these filters are sometimes called blur filters. Analysis of antialiasing filter performance must include all capture system parameters, particularly the pixel aperture size, lens performance, and interpolation technique. Ideally the antialiasing filter and the interpolation technique should be co-optimized to maximize system modulation transfer function below the Nyquist frequency and minimize system modulation transfer function above the Nyquist frequency. In single-sensor digital cameras, visual information is sampled by a color filter array and an image sensor. Spatio-spectral sampling and color filter array design can be seen as a problem of simultaneously maximizing the spectral support of luminance and chrominance channels subject to their mutual exclusivity in the Fourier domain. Key to this design paradigm is the notion that the measurement process, an inner product between the color filter array and the image data, induces a modulation in the frequency domain. Modulating the chrominance spectra away from the baseband luminance channel constitutes a basis for a design of a physically realizable color filter array by specifying these modulation frequencies directly. This method generates panchromatic color filter arrays that mitigate aliasing and admit favorable trade-offs between demosaicking quality and computational efficiency. It is probably no surprise that some ideas from single-sensor imaging can be adopted in multispectral imaging which expands color cameras’ capability to capture spectral information at multiple wavelengths other than that of visible light. Mosaicking and demosaicking in the design of multispectral digital cameras focuses on the design of multispectral filter arrays and the development of the corresponding demosaicking algorithms. The binary tree-driven multispectral filter array generation process guarantees that the pixel distributions of different spectral bands are uniform and highly correlated. These spatial features facilitate the design of the generic demosaicking method based on the same tree, which considers three interrelated issues: band selection, pixel selection, and interpolation. The development of a generic demosaicking algorithm enables cost-effective multispectral imaging. Chapters 7 through 9 address important issues in the area of demosaicking which is a crucial step in the single-sensor imaging pipeline to restore the color image from the raw mosaic sensor data. Color filter array sampling of color images involves spatial domain multiplexing of three or more color components of a color image, each on a subset of the lattice consisting of all sensor elements. In the frequency domain, this same operation can be viewed as the multiplexing of a luma component at baseband and two or more chrominance components centered at certain spatial modulation frequencies. This view leads to efficient demosaicking algorithms that would not normally be evident from the spatial domain representation. Linear minimum mean square error demosaicking constitutes another computationally-efficient method for reconstructing the color information of a captured image. This method is applicable to single-sensor data obtained using different color filter arrays. It takes advantage of a model of spatio-chromatic sampling applicable to both the human visual system and digital cameras and allows the construction of both linear and data-adaptive demosaicking solutions. Fourier and wavelet-packet filterbank-based color filter array image analysis for joint demosaicking and denoising reveals that the observed data consists of a mixture of baseband luminance signals, spectrally shifted difference images, and noise. It is quite well known that noise can significantly affect the perceptual quality of the output from a digital camera. Since preserving the sharpness of edges and textures is a key factor in demosaicking, noise is often amplified by demosaicking as noise patterns may form false edge structures. The problem of estimating the noise-free image signal given a set of incomplete observations of pixel components that are corrupted by noise can be approached statistically from a point of view of Bayesian statistics. Such an approach allows for different design regimes that can be thought of as simultaneous demosaicking and image denoising and can allow for solutions with better performance than when these two processing steps are handled separately. Chapters 10 through 12 are focused on color and exposure corrections. White balancing is used to adjust color in the captured image in order to compensate for shifts from perceived color in the scene due to the ambient illumination. In manual mode, users can choose from white balance settings predetermined by a camera manufacturer for typical lighting scenarios or define a unique white balance reference. In automatic mode, digital cameras can use special sensors to dynamically detect the color temperature of the ambient light and compensate for its effects. Cost-effective cameras achieve color-balancing effects solely using an image processing algorithm which sets, in a fully-automated manner, the white balancing parameters based on the image content and statistics. Enhancement of digital photographs using color transfer techniques constitutes an advanced approach to altering the color of captured images. This approach transforms the captured image so that its final colors match the palette of the target image regardless of the content of the pictures. One way to treat this recoloring problem is to find a one-to-one color mapping that is applied to every pixel in the captured image, producing an image which is identical in every aspect to the original captured image, except that it now exhibits the same color statistics, or palette, as the target image. Exposure correction in imaging devices is essential to compensate an image for improper exposure of the sensor to light. Digital consumer devices make use of ad-hoc strategies and heuristics to derive exposure setting parameters. Typically such techniques are completely blind with respect to the specific content of the scene. Unfortunately, it is not rare for images to be acquired with a nonoptimal or incorrect exposure due to complex visual scene lighting conditions or poor optics, resulting in too dark or bright images. Correcting the exposure thus often means reproducing the most important regions, according to contextual or perceptive criteria, with intensities more or less in the middle of the possible range. Chapters 13 through 15 address the important issues of image storage and data compression. Adopting the same standard image file format enables readers, including computers and photo printers, to make use of the image data. Current digital camera image storage formats evolved over time, adopting some features first introduced in their predecessors. According to the standard, captured files are named and organized into folders, and contain image data along with metadata created by the camera. One of two current standards stores the sensor image data prior to demosaicking, thus providing higher image quality while requiring that the reader performs camera image processing. The other standard format stores the developed photograph, after the complete camera image processing has been performed on sensor data, in an image format compatible with existing imaging hardware and software. The position of the image compression step with respect to the demosaicking step can be used as the basis for modelling of image processing pipelines in single-sensor digital cameras. The current development of digital camera image processing is typically guided by empirical performance evaluations. However, results provided by empirical evaluations are usually limited to the training set and are often inconclusive. Therefore, taking advantage of mathematical models linking the performance of the camera image processing to image content and algorithm settings can allow for better understanding of design issues. Focusing on image quality and computational efficiency issues, lossless compression of single-sensor mosaic images and videos is of paramount interest in a number of applications, ranging from digital photography and cinema to medical imaging. Technically, lossless compression of color filter array mosaic images poses a unique challenge of spectral decorrelation of spatially interleaved samples of three or more sampling colors. Among a number of reversible lossless transforms that can remove statistical redundancies in both spectral and spatial domains, Mallat wavelet packet transform constitutes an ideal solution which is extendable for lossless compression of a time sequence of color filter array mosaic frames. Chapters 16 and 17 deal with two popular optional steps in the camera imaging pipeline. Automatic red-eye detection and removal is needed to eliminate red-eye effects in digital or digitized film photographs. These effects are caused by the reflection of the blood vessels in the retina when a strong and sudden light, such as the camera flash, strikes the eye. A common technique to reduce red-eye effects is to adopt multiple flashes to contract the pupils before the final shot. However, the flash consumes a significant amount of power and the effects cannot be completely removed. Therefore, red-eye removal techniques, which aim at locating and correcting red-eyes in captured photos digitally using image processing and pattern recognition solutions, are implemented directly in cameras, or externally in printer drivers and software running on a personal computer. Image resizing is another optional step in the camera imaging pipeline. The spatial resolution of digital camera images is often modified by the user or as a result of application constraints. For instance, images are downsampled to reduce bandwidth for their transmission, to fit more pictures on the storage media, or to obtain the effective resolution for printing. On the other hand, images may be upsampled to overcome the limitations in optical capabilities of inexpensive cameras with fixed-zoom lenses or to allow close visual inspection of fine details and areas of interest. The specific spatial resolution of the visual input may also be required in some processing steps such as scene analysis and object recognition to achieve desired performances. Finally, Chapters 18 through 20 discuss various issues in single-sensor video processing. Recently, video-demosaicking has gained the interest of the digital camera imaging community. With advances in hardware and software, more and more cameras allow for the recording of digital video. Since digital video represents a three-dimensional image signal or a time sequence of two-dimensional images, video-demosaicking goes one step beyond traditional spatial demosaicking in order to utilize spectral, spatial, temporal, and motion characteristics during processing to produce high-quality color video. Obviously, a multiframe approach is also essential for simultaneous demosaicking and resolution enhancement. The goal of multiframe processing here is to estimate the high resolution image from a collection of low resolution images which typically suffer from noise, and warping, blurring, downsampling, and color-sampling effects. To mitigate the shortcomings of imaging systems, the presented multiframe approach solves both demosaicking and resolution enhancement problems in a joint fashion offering improved performance. Image and video stabilization has become more and more important due to the enhancement of camera portability which usually results in less stable image and video capturing. Unwanted position fluctuations of the camera affect the visual quality of captured image sequences and reduce the performance of automated machine vision systems. In order to accurately predict both camera fluctuations and motion of objects in the scene, and to compensate for unwanted shakes of the camera, digital cameras use various solutions such as optical, electronic and digital image stabilization. Among these three popular approaches, digital image stabilization is the most cost-effective as it is implemented purely in software and requires no motion sensors or adjustable prisms. The bibliographic links included in all chapters of the book provide a good basis for further exploration of the presented topics. The volume includes numerous examples and illustrations of single-sensor image processing results, as well as tables summarizing the results of quantitative analysis studies. Complementary material is available online at http://www.colorimageprocessing.org. I would like to thank the contributors for their effort, valuable time and motivation to enhance the profession by providing material for a wide audience while still offering their individual research insights and opinions. I am very grateful for their enthusiastic support, timely response and willingness to incorporate suggestions from me and from other contributing authors who served as reviewers and helped to improve the quality of contributions. I also appreciate my colleagues at Epson Edge, particularly Ian Clarke, Sohaib Sajid and Graham Sellers for their understanding and support. Finally, a word of appreciation for CRC Press / Taylor & Francis for giving me the opportunity to edit a book on single-sensor imaging. In particular, I would like to thank Dr. Phillip A. Laplante for his encouragement, Nora Konopka for initiating and supporting this project, and Shashi Kumar for his LaTeX assistance. Rastislav Lukac Epson Canada Ltd., Toronto, ON, Canada E-mail: lukacr@colorimageprocessing.com Web: www.colorimageprocessing.com The Editor Rastislav Lukac (www.colorimageprocessing.com) received M.S. (Ing.) and Ph.D. degrees in Telecommunications from the Technical University of Kosice, Slovak Republic, in 1998 and 2001, respectively. From February 2001 to August 2002, he was an assistant professor with the Department of Electronics and Multimedia Communications at the Technical University of Kosice. From August 2002 to July 2003, he was a researcher with the Slovak Image Processing Center in Dobsina, Slovak Republic. From January 2003 to March 2003, he was a postdoctoral fellow with the Artificial Intelligence and Information Analysis Laboratory, Aristotle University of Thessaloniki, Greece. From May 2003 to August 2006, he was a postdoctoral fellow with the Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada. Since September 2006, he has been a senior image processing scientist at Epson Edge, Epson Canada Ltd., Toronto, Canada. He is a contributor to seven books and he has published over 200 papers in the areas of digital camera image processing, color image and video processing, multimedia security, and microarray image processing. Dr. Lukac is a member of the Institute of Electrical and Electronics Engineers (IEEE) and IEEE Circuits and Systems, IEEE Consumer Electronics, and IEEE Signal Processing societies. He is an editor of the book Color Image Processing: Methods and Applications (Boca Raton, FL, CRC Press / Taylor & Francis, 2006). He is a guest editor of RealTime Imaging, Special Issue on Multi-Dimensional Image Processing, Computer Vision and Image Understanding, Special Issue on Color Image Processing, International Journal of Imaging Systems and Technology, Special Issue on Applied Color Image Processing, and International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Facial Image Processing and Analysis. He is an associate editor for the Journal of RealTime Image Processing. He serves as a technical reviewer for various scientific journals, and he participates as a member of numerous international conference committees. He is the recipient of the 2003 North Atlantic Treaty Organization / National Sciences and Engineering Research Council of Canada (NATO/NSERC) Science Award, and the Most Cited Paper Award for the Journal of Visual Communication and Image Representation for the years 2005–2007. Contributors Rastislav Lukac Epson Canada Ltd., Toronto, ON, Canada Wen-Chung Kao National Taiwan Normal University, Taipei, Taiwan Hung-Hsin Wu National Taiwan Normal University, Taipei, Taiwan Sheng-Yuan Lin Cheng-Wei Precision Industry Co., Ltd., Taipei, Taiwan James E. Adams, Jr. Eastman Kodak Company, Rochester, NY, USA John F. Hamilton, Jr. Eastman Kodak Company, Rochester, NY, USA Russ Palum Eastman Kodak Company, Rochester, NY, USA Keigo Hirakawa Harvard University, Cambridge, MA, USA Patrick J. Wolfe Harvard University, Cambridge, MA, USA Lidan Miao University of Tennessee, Knoxville, TN, USA Hairong Qi University of Tennessee, Knoxville, TN, USA Wesley E. Snyder North Carolina State University, Raleigh, NC, USA Eric Dubois University of Ottawa, Ottawa, ON, Canada David Alleysson Universite´ Pierre-Mende`s France, Grenoble, France Brice Chaix de Lavare`ne Universite´ Joseph Fourier, Grenoble, France Sabine Su¨ sstrunk Ecole Polytechnique Fe´de´rale de Lausanne, Lausanne, Switzerland Jeanny He´rault Universite´ Joseph Fourier, Grenoble, France Edmund Y. Lam The University of Hong Kong, Hong Kong George S. K. Fung The University of Hong Kong, Hong Kong Franc¸ois Pitie´ Trinity College, Dublin, Ireland Anil Kokaram Trinity College, Dublin, Ireland Rozenn Dahyot Trinity College, Dublin, Ireland Sebastiano Battiato University of Catania, Catania, Italy Giuseppe Messina ST Microelectronics, Catania, Italy Alfio Castorina ST Microelectronics, Catania, Italy Kenneth A. Parulski Eastman Kodak Company, Rochester, NY, USA Robert Reisch Eastman Kodak Company, Rochester, NY, USA Nai-Xiang Lian Nanyang Technological University, Singapore Vitali Zagorodnov Nanyang Technological University, Singapore Yap-Peng Tan Nanyang Technological University, Singapore Ning Zhang IMAX Corporation, Mississauga, ON, Canada Xiaolin Wu McMaster University, Hamilton, ON, Canada Lei Zhang The Hong Kong Polytechnic University, Kowloon, Hong Kong Francesca Gasparini University of Milano-Bicocca, Milano, Italy Raimondo Schettini University of Milano-Bicocca, Milano, Italy Wei Lian The Hong Kong Polytechnic University, Hong Kong Sina Farsiu Duke University Eye Center, Durham, NC, USA Dirk Robinson Ricoh Innovations Inc., Menlo Park, CA, USA Michael Elad The Technion – Israel Institute of Technology, Haifa, Israel Peyman Milanfar University of California, Santa Cruz, CA, USA Contents 1 Single-Sensor Digital Color Imaging Fundamentals 1 Rastislav Lukac 2 Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 31 Wen-Chung Kao, Hung-Hsin Wu, and Sheng-Yuan Lin 3 Digital Camera Image Processing Chain Design 67 James E. Adams, Jr. and John F. Hamilton, Jr. 4 Optical Antialiasing Filters 105 Russ Palum 5 Spatio-Spectral Sampling and Color Filter Array Design 137 Keigo Hirakawa and Patrick J. Wolfe 6 Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 153 Lidan Miao, Hairong Qi, and Wesley E. Snyder 7 Color Filter Array Sampling of Color Images: Frequency-Domain Analysis and Associated Demosaicking Algorithms 183 Eric Dubois 8 Linear Minimum Mean Square Error Demosaicking 213 David Alleysson, Brice Chaix de Lavare`ne, Sabine Su¨sstrunk, and Jeanny He´rault 9 Color Filter Array Image Analysis for Joint Demosaicking and Denoising 239 Keigo Hirakawa 10 Automatic White Balancing in Digital Photography 267 Edmund Y. Lam and George S. K. Fung 11 Enhancement of Digital Photographs Using Color Transfer Techniques 295 Franc¸ois Pitie´, Anil Kokaram, and Rozenn Dahyot 12 Exposure Correction for Imaging Devices: An Overview 323 Sebastiano Battiato, Giuseppe Messina, and Alfio Castorina 13 Digital Camera Image Storage Formats 351 Kenneth A. Parulski and Robert Reisch 14 Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras 381 Nai-Xiang Lian, Vitali Zagorodnov, and Yap-Peng Tan 15 Lossless Compression of Color Mosaic Images and Videos 405 Ning Zhang, Xiaolin Wu, and Lei Zhang 16 Automatic Red-Eye Removal for Digital Photography 429 Francesca Gasparini and Raimondo Schettini 17 Image Resizing Solutions for Single-Sensor Digital Cameras 459 Rastislav Lukac 18 Video-Demosaicking 485 Lei Zhang and Wei Lian 19 Simultaneous Demosaicking and Resolution Enhancement from Under- Sampled Image Sequences 503 Sina Farsiu, Dirk Robinson, Michael Elad, and Peyman Milanfar 20 An Overview of Image / Video Stabilization Techniques 535 Wen-Chung Kao and Sheng-Yuan Lin Index 563 1 Single-Sensor Digital Color Imaging Fundamentals Rastislav Lukac 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Color Image Acquisition in Digital Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Three-Sensor Digital Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Single-Sensor Digital Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 From Raw Sensor Data to Digital Photographs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Pipelining Image Processing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2 Design Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Visual Artifacts in Digital Camera Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.1 Image Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.2 Demosaicking Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.3 Coloration Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4.4 Exposure Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4.5 Image Compression Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5 What Is Really Important in Digital Camera Image Processing? . . . . . . . . . . . . . . . 17 1.5.1 Spatial Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5.2 Structural Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.5.3 Spectral Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.5.4 Temporal Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.1 Introduction Capturing visual scenes in color and producing digital photographs which are faithful representations of the original scene is quite challenging due to a number of constraints under which digital cameras operate. Differences in characteristics between image acquisition systems and the human visual system constitute an underlying problem in digital color imaging. In today’s digital cameras, most challenges result from sampling visual information spectrally, spatially, tonally and temporally. Using a color filter array, spectral sampling reduces available color information to light of certain wavelengths which can be acquired by a monochrome image sensor. Spatial sampling reduces the angle of view that the camera sees to a rectangular array of pixels in the captured image. This is realized by an image sensor representing a two-dimensional 1 2 Single-Sensor Imaging: Methods and Applications for Digital Cameras array of light-sensitive spots which record the total intensity of the light that strikes their surfaces. Tonal sampling characterizes the quantization process used to represent the original continuously-varying visual information by discrete values. Finally, time sampling characterizes an exposure of the sensor to the light for a certain amount of time. Obviously, the first step in achieving high-quality images is to design sampling procedures in a manner which allows for a precise digital representation of various visual scenes. However, as with other real-life imaging systems, this is not quite possible and extensive processing of acquired sampled data is needed to compensate for the shortcomings of an imaging system and to produce digital photographs with a natural appearance matching the original scene. To facilitate the following discussions on technical challenges in digital cameras, this chapter presents fundamentals of single-sensor color imaging and digital camera image processing. More specifically, Section 1.2 describes popular color image acquisition technologies. Of particular interest are consumer digital cameras which use color filter arrays to capture visual scenes in color using only one monochrome image sensor. Since color filter arrays distinguish this type of digital camera from other image acquisition solutions, this section also discusses color filter array design issues and digital image representations. Section 1.3 presents image processing pipelines for single-sensor color imaging devices. The data acquired by such devices constitutes a grayscale image with a mosaic structure following the underlying pattern of the color filter array. A pipeline consists of a number of processing steps necessary to produce a finished photograph from the acquired sensor data. Depending on the image storage format and user requirements, such pipelines can be implemented in camera or software running on personal computers. Presented examples of real digital camera images suggest that both the choice and order of processing steps employed in the pipeline greatly influence the quality of produced digital photographs. Section 1.4 focuses on typical image quality issues in single-sensor digital cameras. These issues mainly relate to the presence of noise introduced into the image during its acquisition and various color shifts and artifacts caused by insufficient image processing. Section 1.5 discusses important characteristics of color images and videos. Exploring different types of pixel correlations, these characteristics relate to spatial, spectral, structural and temporal properties of the captured visual data and the omission of these characteristics during processing usually results in various visual artifacts in finished images. Thus, using as much of the information available in the image as possible is crucial for achieving high visual quality of captured images. Finally, this chapter concludes with Section 1.6 by summarizing main single-sensor color imaging and digital camera image processing ideas. 1.2 Color Image Acquisition in Digital Cameras Digital cameras acquire a scene by first focusing and then transmitting light through the optical system. Once the light reaches the sensor surface, it is sampled by the sensor in order to obtain the corresponding digital representation of the sensor values through Single-Sensor Digital Color Imaging Fundamentals 3 FIGURE 1.1 (See color insert.) Three-sensor digital camera architecture. subsequent analog-to-digital conversion. Acquired digital data undergoes a series of image processing operations [1], [2] which are realized by an application-specific integrated circuit and a microprocessor. To reduce various artifacts due to the sampling of visual information, camera manufacturers often place blur filters in the optical system to reduce high-frequency content of the image. Chapter 4 discusses optical issues in detail. Human vision is based on three types of color photo-receptor cone cells, implying that three numerical components are necessary and sufficient to describe a color [3]. However, common image sensors, such as charge-coupled devices (CCD) [4], [5], charge-injection devices (CID) [6], [7], or complementary metal oxide semiconductor (CMOS) sensors [8], [9], are monochromatic devices. Therefore, to capture color information using such sensors, digital camera manufacturers place a color filter on top of each sensor cell.1 The two most popular camera technologies for color image acquisition — three-sensor and onesensor solutions — are described below. 1.2.1 Three-Sensor Digital Cameras To overcome the sensor’s monochromatic nature and capture the visual scene in color, predecessors of today’s consumer digital cameras used a beam splitter to separate incoming light onto three optical paths, each having its own red (R), green (G) or blue (B) color filter with different spectral transmittances [12] and sensor for sampling the filtered light 1There also exists a layered sensor which directly captures the complete color information at each spatial location in an image during a single exposure. This is possible by stacking and ordering color filters vertically according to the energy of the photons absorbed by silicon [10], [11]. The layered sensor is an alternative to earlier technology which rotates a series of red, blue and green filters in front of a single sensor in order to record three separate images in rapid succession. 4 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) FIGURE 1.2 (See color insert.) Three-sensor imaging. (a) Registration of three grayscale images on the left, top and bottom of the figure which were acquired using sensors with red, green and blue color filters, respectively. (b) Final full-color image achieved by registering three sensor images. (Figure 1.1). Each optical path deals with the same visual information; however, because of the filters, each sensor only responds to one of the primary colors. Thus, each of three sensors acquires a monochromatic image corresponding to one channel of a color image output by a camera. The resulting color image is produced by registering three grayscale (monochromatic) images, requiring precise mechanical and optical alignment. Figure 1.2 shows the registration procedure. The captured photo can be considered a K1 × K2 RGB digital color image x : Z2 → Z3 representing a two-dimensional matrix of three-component pixels x(r,s) = [x(r,s)1, x(r,s)2, x(r,s)3]T . As noted, each individual channel of x is a K1 × K2 monochromatic image xk : Z2 → Z, for k = 1, 2, 3. The pixel x(r,s) represents the color vector [13] indexed by its spatial location (r, s), with r = 1, 2, ..., K1 denoting the image row and s = 1, 2, ..., K2 denoting the image column. The value of the R (k = 1), G (k = 2), and B (k = 3) component x(r,s)k defined in the integer domain Z is equal to an arbitrary integer value ranged from 0 to 255 in a standard 8-bits per component representation and denotes the contribution of the k-th primary in x(r,s). The process of displaying an image creates a graphical representation of the image matrix where the pixel values represent particular colors in the visible spectrum. 1.2.2 Single-Sensor Digital Cameras It is probably no surprise that the sensor is the most expensive component of the digital camera, usually taking from 10% to 25% of the total cost [14]. To reduce expenses and allow for high sales volumes, current consumer digital cameras use only one image sensor covered with a mosaic of color filters to capture all the necessary colors at the same time. Figure 1.3 shows the sensor with a color filter array (CFA) proposed by B.E. Bayer in 1976 [15] which has been the most widely used CFA since its introduction. Examples of various CFAs used by camera manufacturers can be found in References [16] and [17], and Chapters 5 and 8. Reference [17] also presents detailed discussions on CFA design issues together with performance evaluations of a number of CFAs. Single-Sensor Digital Color Imaging Fundamentals 5 CFA + sensor RG B R GB light color filter sensor cell RG B light color filter sensor cell FIGURE 1.3 (See color insert.) (left) Image sensor covered by a Bayer color filter array. (right) The concept of acquiring the visual information using color filters. One of the most important features for the design of CFAs is the choice of a color system [17]. Popular CFA configurations are usually constructed using tristimulus (RGB) color filters. There also exist configurations based on mixed primary and complementary colors (e.g., mixtures of magenta, green, cyan, and yellow). Recently, CFAs with four and more colors were introduced; these typically contain white or colors with shifted spectral sensitivity which may be combined with primary or secondary colors. Four or more color systems may produce a more accurate hue gamut compared to tristimulus systems; unfortunately, they often limit the useful range of darker colors [16]. Regardless of which color system has been employed, the acquired CFA sensor readings constitute a K1 × K2 monochromatic image z : Z2 → Z with pixel values z(r,s). Figure 1.4a shows a simulated example. The image has a mosaic-like structure dictated by the CFA and is either passed through the camera image processing pipeline to develop a final digital photograph or stored as the so-called raw camera file. In either case, the arrangement of color filters in the actual CFA is known, and thus the pixels in the CFA image can be mapped to the corresponding channels of a color image of the same size [2]. For example, a CFA image z acquired using the Bayer CFA with GR phase in odd rows and BG phase in even rows corresponds to a color image x with RGB pixels x(r,s) = [z(r,s), 0, 0]T for (odd r, even s), x(r,s) = [0, z(r,s), 0]T for (odd r, odd s) and (even r, even s), and x(r,s) = [0, 0, z(r,s)]T for (even r, odd s). Two zeros in x(r,s) indicate two missing components in order to denote their portion to the coloration of the image x shown in Figure 1.4b which is a color version of the CFA image shown in Figure 1.4a. These two missing components per pixel location must be determined from the adjacent pixels using a digital image processing solution called demosaicking [2], [18], which restores the full-color information.2 Thus, demosaicking is an integral step in the single-sensor imaging pipeline. Depending on the employed algorithm, the demosaicked image, such as the one shown in Figure 1.4c, may suffer from color 2Other terms known from the literature are demosaicing, color interpolation, and CFA interpolation. 6 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) FIGURE 1.4 (See color insert.) Single-sensor imaging: (a) grayscale mosaic image, (b) color version of the mosaic image, (c) demosaicked full-color image, and (d) postprocessed demosaicked image with improved visual quality. shifts, aliasing effects, blur, and other visual artifacts. These effects can be suppressed, if not eliminated, via demosaicked image postprocessing [19], [20], [21]. Figure 1.4d shows an example of a postprocessed demosaicked image. Apart from demosaicking and demosaicked image postprocessing which constitute, respectively, mandatory and optional steps in the single-sensor imaging pipeline, the rest of the pipeline is more or less identical to the pipeline in a three-sensor digital camera. Chapter 3 discusses the digital camera image processing chain design in detail. Another important factor in CFA design is the arrangement of color filters in the array, because it suggests potential cost-effectiveness of color reconstruction by demosaicking, immunity to color artifacts and color moire´, reaction of the array to image sensor imperfections, and immunity to optical and electrical cross talk between neighboring pixels. A small basic repetitive unit in the CFA usually allows for relatively simple demosaicking. For example, the Bayer CFA shown in Figure 1.3 is constructed by repeating a 2 × 2 square composed of one red filter, one blue filter, and two green filters located on the diagonal. Given that the wavelength of the green color band is close to the peak of the human luminance frequency response,3 many CFAs have the higher number of green filters compared 3In the literature, G components are often referred to as the luminance whereas R and B components are known as the chrominance. Single-Sensor Digital Color Imaging Fundamentals 7 to the amount of other filters in the array in order to reduce the amount of demosaicking artifacts [22]. The sensitivity of the array to color artifacts in the demosaicked image can be reduced by surrounding each color filter with color filters of all other types, thus minimizing the size of the local neighborhood and performing demosaicking in a more local fashion. Introducing some degree of randomness or aperiodicity into the arrangement of color filters can make CFAs more robust to color moire´ effects; unfortunately, this comes at the expense of increased computational complexity due to the larger size of the minimum repetitive pattern. Image sensor imperfections are typically observed along rows or columns of the sensor cells, suggesting that a diagonal version of the Bayer CFA as well as various diagonal stripe patterns can be more immune. Finally, CFAs with the fixed number of neighbors corresponding to each type of color filters in the CFA are usually more robust against optical and electrical cross talk between neighboring pixels than pseudo-random patterns. This concludes an overview of current CFA design issues. A novel view on the problem of designing CFAs is presented in Chapter 5. 1.3 From Raw Sensor Data to Digital Photographs As previously discussed, once the sensor image has been acquired, it can be stored as raw data or passed through the camera image processing pipeline to produce a digital photograph. A number of digital cameras, typically digital single lens reflex (SLR) cameras, use the former approach by following the Tagged Image File Format for Electronic Photography (TIFF-EP) [23]. By applying lossless compression to raw sensor image data and storing compressed image data together with metadata containing information about the camera settings, TIFF-EP allows for developing high-quality digital photographs on a companion personal computer (PC) using sophisticated solutions, under different settings, and reprocessing the image until certain quality criteria are met. Thus, this approach may have quite different design and performance characteristics compared to the latter one where the sensor image, immediately after its acquisition, undergoes in-camera real-time image processing to produce the final image to be typically stored using Joint Photographic Experts Group (JPEG) compression [24] in the Exchangeable Image File (EXIF) format [25] together with the metadata. Chapter 13 discusses camera formats in detail. 1.3.1 Pipelining Image Processing Solutions The way an imaging pipeline is constructed can vary significantly between camera and imaging software manufacturers, depending on many factors such as the selection and order of pipelined image processing steps, preferences on visual appearances of digital photographs, implementation constraints, etc. Typically, early processing stages [26] aim at detecting defective pixels caused by a failure of individual sensor photo-elements and correcting them using the concept of image interpolation. A linearization step may be needed if the captured data resides in a nonlinear space due to the involved electronics. In lowexposure images, where both signal and noise levels may be comparable, it is essential to 8 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) FIGURE 1.5 (See color insert.) Images stored at different stages of the single-sensor camera image processing pipeline: (a) mosaic CFA image, (b) demosaicked image, (c) white-balanced image, (d) color-corrected image, and (e) tone / scale-rendered image. compensate for dark current noise which is introduced into the signal through thermally generated electrons in the sensor substrate. The rest of the pipeline consists of image processing steps which critically influence the visual quality of final digital photographs. Figure 1.5 shows the images produced by typical components of the imaging pipeline. Namely, the CFA mosaic image is shown in Figure 1.5a. Figure 1.5b shows the full-color image restored by demosaicking CFA mosaic data — the step which distinguishes the single-sensor camera image processing pipeline from other camera pipelines. Figure 1.5c shows the image after white balancing which is the process of adjusting the image values Single-Sensor Digital Color Imaging Fundamentals 9 to compensate for the scene illuminant, recovering the true scene coloration [27]. Since the spectral sensitivities of the camera are not identical to the human color matching function [26], the pipeline uses color correction to adjust the values of color pixels from those corresponding to accurate scene reproduction to those corresponding to visually pleasing scene reproduction. Figure 1.5d shows a color-corrected image. Finally, Figure 1.5e shows the image achieved after subsequent tone / scale rendering which transforms the color image from unrendered spaces with twelve to sixteen-bit representations to a rendered (mostly sRGB [28]) space with eight-bit representation, as is required by most output media. This step also makes the tonality of a finished image match the nonlinear characteristics of the human visual system. Additional information on white balancing can be found in Chapter 10, and on color correction and tone / scale rendering in Chapter 3. The purpose of Figure 1.5 is to illustrate the basic camera image processing chain. The processing steps included in the chain are essential for producing visually pleasing images and can be found in practically all digital cameras. Depending on available resources and camera manufacturer preferences, the processing chain can be extended by adding a series of image quality enhancement steps. As shown in Figure 1.4, visual quality of the demosaicked image can be improved through its postprocessing. In order to enhance structural content, such as edges and color transitions, a sharpening step [29] should be included whereas noise and insignificant details can be suppressed or removed via denoising or lowpass filtering [30], [31]. As visually pleasing appearances of photographs also depend on proper exposure settings, there is also a need for the inclusion of exposure correction in the pipeline. Chapter 12 describes popular exposure correction solutions for digital cameras. It should be mentioned here that both professional photographers and advanced digital SLR camera users prefer no sharpening and denoising in order to preserve the natural appearance of photographs. In addition, professional photographers tend to control white balancing and exposure correction manually in order to achieve desired visual effects. On the other hand, slim compact digital cameras, image-enabled mobile phones and personal digital assistants (PDAs) have to rely on denoising due to the poorer noise characteristics of miniature sensors compared to large-size sensors in SLR cameras. Functionalities of the camera image processing pipeline are often further enhanced by adding optional steps such as image resizing [32], [33] to alter the spatial resolution of captured images. 1.3.2 Design Alternatives As already noted, the order of processing steps has great impact on the overall performance of the imaging pipeline. Unfortunately, there is no ideal way of pipelining the processing steps because the choice of a solution for each particular image processing step has a great impact on both design and performance characteristics of the imaging pipeline. To simplify the problem of designing the single-sensor camera imaging pipeline, the position of processing steps under consideration can be related to the position of demosaicking. Practically any processing step can be used before or after demosaicking. Performing steps before demosaicking can allow for significant computational savings due to the grayscale nature of CFA image data, as opposed to performing the same operation on demosaicked color data which basically increases the number of calculations three-fold. 10 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) FIGURE 1.6 (See color insert.) Image sharpening. Finished images generated by the pipeline depicted in Figure 1.5: (a) original pipeline, i.e., no sharpening, (b) original pipeline with added image sharpening after demosaicking, and (c) original pipeline with added image sharpening before demosaicking. Figure 1.6 shows the influence of image sharpening on the visual quality of digital camera images. Figure 1.6a is a cropped area from the image in Figure 1.5e. Inspection of the results shown in Figure 1.6a and Figure 1.6b reveals that enhancing the pipeline by adding an image sharpening step after demosaicking increases sharpness of the camera image while amplifying color noise and artifacts present in the image after demosaicking. On the other hand, as can be seen in Figure 1.6c, performing the same image sharpening operations before demosaicking can produce even more significant sharpening effects than the previous approach. In this case, however, the sharpening step may amplify sensor noise present in the captured CFA image data. When exploring TIFF-EP and EXIF storage formats, a typical example of the processing task considered to be placed before or after demosaicking in the single-sensor imaging pipeline is image compression. Refer to Chapters 3, 14, and 15 for details on compression schemes. Also, image resizing, denoising, sharpening, white balancing, and tone-scale rendering are implementable in either case. Some processing operations can also be implemented in a joint manner with demosaicking, thus potentially reducing cost of implementation, enhancing performance of processing tasks, and producing higher visual quality. Usually, implementing various processing steps in a joint process is possible if they employ similar digital signal processing concepts. A good example of that is demosaicking and image resizing which both basically perform interpolation [32], [34], [35]. Another example of the joint process can be constructed by treating demosaicking and denoising from the signal estimation perspective [30]. Details on these two joint processes and performance comparisons between joint implementations and the corresponding traditional cascades can be found in Chapters 9 and 17. Table 1.1 lists other demosaicking-based processing configurations implementable in today’s imaging pipelines. It should be noted at this point that some integration can also be done apart from demosaicking. Examples include simultaneous white balancing and color correction, and tasks based on the filtering concept, such as simultaneous image smoothing and sharpening [31], [36] or simultaneous image resizing and edge enhancement [37], both achieved by combining filters with low-pass and high-pass characteristics. Integrating and performing different processing steps in the camera imaging pipeline jointly is interesting, in particular from the design point of view. However, depending on the nature of the processing steps to be inte- Single-Sensor Digital Color Imaging Fundamentals 11 TABLE 1.1 Feasibility of some interesting processing configurations for today’s imaging pipelines. processing task image compression gamma correction white balancing denoising image sharpening image resizing superresolution reconstruction digital stabilization red-eye detection / removal face detection face recognition digital rights management image encryption forensics before after joint with demosaicking demosaicking demosaicking grated this may be impractical due to the increased memory requirements and complexity of the design, as well as reduced flexibility and high number of calculations to be repeated if some of the steps have to be rerun with different settings. Some designs can flexibly accommodate various image, video and multimedia processing operations which are occasionally used or may be required in the near future as default components of the imaging pipeline. An example of the former is red-eye removal [38], [39] which aims at detecting and correcting defects in photographs caused by the reflection of the blood vessels in the retina due to strong and sudden light, and video stabilization needed to compensate for undesired camera movements [40], [41]. For details refer to Chapters 16 and 20, respectively. The latter includes emerging CFA video compression used to store captured video data by reducing spectral and spatiotemporal redundancies present in single-sensor captured image sequences [42], video-demosaicking which restores full-color video from its CFA sampled version [43], [44], [45], and resolution enhancement which aims at producing images or frames of higher resolution compared to the input [44], [46]. Detailed discussions on these topics can be found in Chapters 15, 18, and 19, respectively. In terms of multimedia, many digital cameras support audio, either sole or in conjunction with video. Image-enabled consumer electronic devices that greatly benefit from audio and video support are mobile video phones and PDAs. It is quite common to employ in-camera face detection [47], [48] in order to improve auto-focusing and to optimize exposure and flash output. Face detection can also help to improve performance of automatic red-eye removal and to produce visually pleasing photographs with enhanced color and tonal quality by setting the optimal white balance and color correction. In addition to face detection, face recognition [49], [50] is used in digital photo-archives to organize and retrieve photos. Finally, as in other areas dealing with digital media, there are already certain needs in digital camera imaging for digital rights management (DRM) [51], [52], [53], encryption [54], [55], and forensics [56], [57] in order to ensure digital photograph integrity, secure transmission of photos in public communication networks and protect intellectual property rights. 12 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1.4 Visual Artifacts in Digital Camera Images Digital images are often corrupted by noise and various artifacts which significantly degrade the value of captured visual information, decrease the perceptual fidelity of an image and complicate many image processing and analysis tasks. In practice, relations between sources of these defects are very complex. Reducing or eliminating one type of defect can have significant implications on the appearance of another type of defect. The following focuses on common types of visual impairments present in single-sensor captured images. Namely, issues to be discussed include image noise, demosaicking artifacts, coloration and exposure shifts, and compression artifacts. Discussions of other, mostly optics-based types of defects, such as spherical and chromatic aberrations, vignetting and flare effects, can be found elsewhere (e.g., Reference [58]). 1.4.1 Image Noise Noise in digital camera images usually appears as random speckles in otherwise smooth regions, altering both tone and color of the original pixels. Typically, noise is caused by random sources associated with quantum signal detection, signal independent fluctuations, and inhomogeneity of the responsiveness of the sensor elements. The appearance of noise in images varies amongst different digital camera models. Noise increases with the sensitivity (ISO) setting in the camera, length of the exposure, and temperature. It can vary within an individual image; darker regions usually suffer more from noise than brighter regions. The level of noise also depends on characteristics of the camera electronics and the physical size of photosites in the sensor. Larger photosites usually have better lightgathering abilities, thus producing a stronger signal and higher signal-to-noise ratio. A survey of these issues, supported by numerous examples, can be found in Reference [59]. According to the tristimulus theory of color representation, the RGB color vector x(r,s) = [x(r,s)1, x(r,s)2, x(r,s)3]T is uniquely defined by its length (magnitude) Mx(r,s) = x(r,s) = (x(2r,s)1 + x(2r,s)2 + x(2r,s)3)−1/2 and orientation (direction) Dx(r,s) = x(r,s)/ x(r,s) = x(r,s)/Mx(r,s) in a three-dimensional vector space, where Dx(r,s) = 1 denotes the unit sphere. Thus, both direction and magnitude of x(r,s) significantly influence perception of its color by the human observer and both are affected by noise [13]. Since noisy pixels deviate from their noise-free neighbors, the evaluation of magnitude and directional differences between vectors in a local image area constitutes the basis in a number of noise filtering techniques. Reference [31] surveys popular filtering approaches in detail. It is well known that magnitude and direction of a color vector correspond to its luminance (intensity) and chrominance (color) characteristics. Thus, noise can be seen as fluctuations in intensity and color, and can be handled separately in the luminance and chrominance domain. The relative amount of color and luminance noise differs significantly amongst digital camera models. As shown in Figure 1.7, color noise can be completely eliminated; however, suppressing luminance noise can result in unnatural looking images and excessive blur. For additional discussions on noise suppression in digital camera images refer to Chapters 3 and 9. Single-Sensor Digital Color Imaging Fundamentals 13 (a) (b) (c) (d) FIGURE 1.7 (See color insert.) Cropped parts of a color checker image captured with ISO 1600 setting: (a) captured noisy image, (b) luminance noise suppression, (c) color noise suppression, and (d) both luminance and color noise suppression. 1.4.2 Demosaicking Artifacts Demosaicking is the core of image processing in single-sensor digital color cameras. Since its goal is to restore both color and structural content of an image from mosaic sensor data, the quality of demosaicking significantly influences the amount of detail and artifacts in finished digital photographs. Figure 1.8 shows several examples of typical issues in demosaicking. Namely, Figure 1.8a shows an image which suffers from zipper effects. These effects can usually be seen along abrupt edges when pixels from both side of an edge are used in demosaicking. Figure 1.8b shows an image with isolated color artifacts which are often introduced in regions rich of details due to the lack of spectral and structural information during demosaicking. Due to their localized nature, both defects described above are not as apparent as other impairments when printing or displaying an image at its natural resolution. As can be seen in Figure 1.8c, this is not the case of aliasing artifacts or color moire´ patterns which usually constitute large, visually annoying regions. These artifacts cannot be therefore removed using traditional low-pass filters which rely on local image characteristics. Aliasing artifacts appear in areas where the resolution limit of the sensor has been reached and where color sampling prevents correctly detecting orientations of edges in an image. This is particularly true in fine texture regions where aliasing artifacts often take the form 14 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) FIGURE 1.8 (See color insert.) Typical demosaicking defects: (a) zipper effects, (b) color shifts, (c) aliasing artifacts, and (d) blur effects. of repeating patterns of false colors. Finally, Figure 1.8d shows an image with apparent resolution loss due to excessive blur of its structural content during demosaicking with insufficient edge-preserving characteristics. As discussed above, demosaicking artifacts vary in their characteristics, appearance and size. Considering the complexity of the problem of reversing color sampling in areas with difficult structural content, demosaicking artifacts may never be fully avoided in real-life situations. Therefore, many digital camera designers focus on achieving trade-offs between noise, image sharpness, demosaicking artifacts and processing time rather than emphasizing any of these issues. 1.4.3 Coloration Shifts Image sensors are calibrated for certain light characteristics. Whenever an image is shot under light of a different color temperature from those for which sensors were calibrated, the image coloration is shifted from the perceived coloration of a scene. This is well observable in the case of neutral (i.e., achromatic) colors, particularly white, which is one of the most recognizable colors due to high and approximately equal contributions of all three color primaries. Unlike noise and demosaicking artifacts which have a localized nature, setting an incorrect white balance affects the appearance of the whole image. Therefore, white balance is considered by many as the most important characteristic of captured images. Single-Sensor Digital Color Imaging Fundamentals 15 (a) (b) (c) (d) FIGURE 1.9 (See color insert.) Coloration shifts due to incorrect white balance settings: (a) cool appearance, (b) warm appearance, (c) grayish appearance, (d) saturation effects. See also Figure 1.5e corresponding to an as-shot white balance setting. To produce photographs with a natural color tint, an image processing operation called white balancing is performed on captured images. Digital SLR camera users usually set the white balance parameters in their camera based on the shooting situation. Alternatively, some adjustments to color balance can also be made by using software for processing camera images stored in a raw format. Detailed discussions on popular white-balancing approaches can be found in Chapter 10. Figure 1.9 shows examples of incorrect white balancing. As demonstrated, images often appear bluish (Figure 1.9a) or reddish (Figure 1.9b) which are often referred to as cool or warm, respectively. Some algorithms produce images with a grayish appearance (Figure 1.9c) or with saturated colors (Figure 1.9d). Obviously, such images are not as visually pleasing as images obtained using an as-shot white balance setting (Figure 1.5e). 1.4.4 Exposure Shifts An exposure setting is another important characteristic which affects the global appearance of captured images. Through opening and closing the aperture, the camera basically controls the amount of light reaching the sensor. By deciding how long to leave the shutter open, it controls the period for which the sensor is exposed to the light to collect photons. Finally, adjusting the ISO also has an effect on the exposure. Depending on exposure settings, the appearance of images can range from dark, which is the effect known as un- 16 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) FIGURE 1.10 (See color insert.) Influence of exposure settings on image quality: (a) underexposure, (b) normal exposure, and (c) overexposure. derexposure, to bright, which is referred to as overexposure. Figure 1.10 shows images of the same scene captured with different exposure settings. To ensure proper exposure, digital cameras use various compensation methods which first measure the amount of light reaching the image sensor and then adjust the exposure accordingly. These methods, however, often fail in complex scenarios with different subjects having different reflectivity. Therefore, cameras and PC software also allow manual control of the exposure by simply multiplying and dividing sensor readings by factors of two. Refer to Chapter 12 for additional information on exposure compensation methods. 1.4.5 Image Compression Artifacts Finished digital camera images are commonly stored in EXIF format using JPEG compression to reduce their file size and allow more pictures to fit on a memory card. Depending on compression settings, JPEG can reduce the size of original files ten-fold, and for images with solid color backgrounds even more. Due to lossy coding, JPEG-compressed images typically have a blocky appearance which is often referred to as image compression artifacts. These artifacts (and basically also compression abilities of JPEG) result from casting away neighboring pixels with similar luminance and chrominance components in a manner which prevents recovering their original values. Since JPEG and other lossy compression formats ruin fine details and edges, compression artifacts are considered by many a bigger problem than sensor noise. Fortunately, many of today’s digital cameras allow for storing captured images in a raw format following the TIFF-EP standard. Some raw formats do not use compression at all whereas others apply lossy compression to CFA image data, very often using quantization and filtering which results in loss of the resolution. However, most raw formats rely on lossless compression to reduce the size of the files without affecting image quality. Figure 1.11 demonstrates the effect of compression on the image quality. As shown in Figure 1.11a, compressing full-color data using JPEG produces block artifacts and reduces original structural content. It also suppresses potential demosaicking artifacts due to the low-pass nature of lossy compression. Applying the same compression scheme to CFA mosaic data using a structure conversion [60], [61] results in less blocky appearance of the final image shown in Figure 1.11b while it may be accompanied with higher level of noise and demosaicking artifacts due to performing demosaicking after lossy compression. Finally, Figure 1.11c shows that compression artifacts can be avoided by using lossless coding. Chapters 14 and 15 discuss image compression issues in detail. Single-Sensor Digital Color Imaging Fundamentals 17 (a) (b) (c) FIGURE 1.11 (See color insert.) Influence of data compression on image quality for the same demosaicking method: (a) lossy compression of a full-color image, (b) lossy compression of a CFA image, and (c) lossless compression. 1.5 What Is Really Important in Digital Camera Image Processing? In order to prevent the introduction of various artifacts to finished digital camera images, well-designed processing solutions should be able to follow image characteristics as much as possible. Focusing on still digital photography, this relates to the utilization of spatial, structural, and spectral characteristics. In digital video capture, additional temporal characteristics should be considered. Obviously, the need for spatial characteristics results from the fact that spatially neighboring pixels are usually highly correlated. Structural characteristics should be followed because edges and fine details convey essential information about a scene. Using spectral characteristics is essential since a typical natural image exhibits significant correlation among its R, G, and B color planes. Finally, temporal characteristics are good indicators of scene changes and object motion in digital video. To support the above discussion, Figure 1.12 illustrates the importance of using structural and spectral characteristics during demosaicking. As can be seen from this example, the omission of any of these characteristics usually results in excessive blur, color shifts, and aliasing effects. 1.5.1 Spatial Characteristics Natural images are nonstationary due to noise and blur processes encountering the image formation. The presence of edges and fine details results in additional variations between neighboring regions. To reduce the processing errors, many camera image processing solutions (e.g., demosaicking, image resizing, noise filtering, edge sharpening, etc.) operate in small localized image areas, each of which can be treated as stationary. Such small areas are localized by placing the supporting window [13], usually centered in the pixel location (r, s) under consideration. The window, defined as the set ζ of pixel locations (i, j) in a 18 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) FIGURE 1.12 (See color insert.) Different visual quality of images demosaicked using: (a) spatial characteristics, (b) spatial and spectral characteristics, (c) spatial and structural characteristics, and (d) spatial, structural, and spectral characteristics. local neighborhood of (r, s), slides over the entire image placing, successively, every target pixel at the center of a local neighborhood denoted by ζ . The procedure generates the output of the processing operation as a function f (·) — defined as f (z(i, j); (i, j) ∈ ζ ) for CFA image or f (x(i, j); (i, j) ∈ ζ ) for full-color data of samples identified by ζ . In this way, processing solutions can minimize local distortion in the output image. The performance of a processing solution based on the windowing concept is generally influenced by the size of the local area and the actual number of samples in it used as input of the processing solution. Note that these two often refer to two different things. Namely, the former can be seen as an indicator of memory requirements, as it suggests how many image lines (or columns) should be buffered. The latter indicates the processing speed, as it denotes how many normalized operations (e.g., additions, multiplications, etc.), as dictated by function f (·), have to be performed on pixels localized by ζ . For example, it is common in demosaicking to use ζ = {(r − 2, s), (r − 1, s), (r, s − 2), (r, s − 1), (r, s), (r, s + 1), (r, s + 1), (r + 1, s), (r + 2, s). Implementing such demosaicking may require buffering five image lines and in these lines operate in five columns to read all necessary values. In the literature, such demosaicking is often referred as 5 × 5 demosaicking, although in fact only nine of twenty-five available samples are used in f (·). Single-Sensor Digital Color Imaging Fundamentals 19 (a) (b) FIGURE 1.13 (See color insert.) Correlation characteristics of the image shown in Figure 1.5e. Brighter values correspond to higher correlations inside the supporting window. (a) RGB values denote spatial correlations in the red, green and blue channel of the original image, respectively. (b) RGB values denote spectral correlations between red and green, green and blue, and red and blue channels of the original image, respectively. Pixels inside the supporting window usually exhibit significant spatial correlation, meaning that their values are very similar. As shown in Figure 1.13a, this is truly the case of flat and slowly changing regions. On the other hand, in areas rich of edges and fine details, spatial correlations are usually weaker. Thus, spatial characteristics play in an important role in many image processing operations, such as image compression, demosaicking, filtering and resizing. 1.5.2 Structural Characteristics Edges and flat regions constitute the basic structural content of a digital image. Edges can be seen as discontinuities in the vector space representing the color image. Edges split image regions of different color or intensity; thus, they are essential for human perception. It is therefore important to preserve edges and fine details during image processing. Popular edge operators process information contained in a local image neighborhood, which is determined by supporting window ζ . Fast and robust edge detection can be achieved with using image-agnostic edge operators able to perform with no prior information about the image structure. Edge detectors for color images can be divided into two basic classes: scalar operators which process each channel of a color image separately or require color-to-luminance conversions before edge detection, and vector operators which fully utilize the spectral correlation and process pixels in a color image as vectors. Both approaches are surveyed in References [31] and [62]. The trade-off between computational complexity and good detection performance makes scalar edge operators ideal candidates to support various camera image processing steps. Typical steps which rely on some form of edge detection are demosaicking, image resizing, denoising, and image sharpening. Edge detection may also help in image compression and white balancing if in these steps more localized processing is needed. Scalar edge detectors use the concept of gradients — the first-order directional derivatives of the image — to determine the edge contrast used in edge map formation, or the concept of zero-crossing 20 Single-Sensor Imaging: Methods and Applications for Digital Cameras — the second-order directional derivatives — to identify locations with zero crossings. Through gradients of the image function U(r,s): ∇U(r,s) = , ∂U(r,s) ∂U(r,s) ∂r ∂s (1.1) the first derivative uses the gradient magnitude ∇U(r,s) = + ∂U(r,s) 2 ∂r ∂U(r,s) 2 ∂s (1.2) to provide information on the rate of change of image intensity and the gradient direction θ(r,s) = arctan / ∂U(r,s) ∂U(r,s) ∂r ∂s (1.3) to determine the orientation of an edge. Since the second derivative is zero when the first derivative achieves a maximum, it is also possible to localize edges by evaluating the zeros of the second derivatives of U(r,s). One possible implementation of second-order derivatives is given by ∆U(r, s) = ∇2U(r, s) = ∂ 2U(r,s) ∂ r2 + ∂ 2U (r,s) ∂ s2 . (1.4) With respect to single-sensor imaging, edge-detection can be performed on mosaic CFA data or full-color data. In either case, popular edge operators can be implemented as fol- lows: ∑ m(r,s) = w ∗ U(r,s) = {w(i, j)u(i, j)} (1.5) (i, j)∈ζ where m(r,s) is the detector response, u(·,·) is image data, and w(·,·) denotes the coefficient of the convolution mask w used to approximate the first- or second-order directional derivative edge operator. Each coefficient w(i, j) is associated with spatial location (i, j). The term U(r,s) = {u(i, j); (i, j) ∈ ζ } denotes the set of (CFA or full-color) pixels used as input by performing detection in the pixel location (r, s) under consideration. Typically, for the first-order directional derivative operators, masks are defined enabling the determination of the gradient magnitude in each of two orthogonal directions. For example, the convolution masks for the well-known Sobel operator are defined as follows: −1 0 w = −−21 0 0 1 −1 −2 −1   0 1 2 ,  0 1 1 0 2 01 , or w = −−12 0 −1 2 −2 −1 1 , −1 0 0 01 0 12 (1.6) The first pair of convolution masks allows for edge detection in horizontal and vertical directions, whereas diagonal edges can be located using the second pair. A similar approach is to perform edge detection using second-order directional derivative operators. Equation 1.4 represents the well-known Laplacian operator which is approxi- mated in practice for a four- and eight-neighborhood, respectively, as follows: 0 1 0 1 1 1 w = 1 −4 1 , or w = 1 −8 1 (1.7) 0 10 111 Single-Sensor Digital Color Imaging Fundamentals 21 (a) (b) FIGURE 1.14 (See color insert.) Structural content of the image shown in Figure 1.5e: (a) horizontal edges, and (b) vertical edges. Each of two above convolution masks is separable into a few directional filters with coefficients [1, −2, 1]. This is an important property, because the determination of edge orientations is essential in various interpolation and filtering steps to avoid processing errors. Figure 1.14 shows the result of edge detection in horizontal and vertical directions. As can be seen, the directional edge operators differ significantly in their response to structural content in an image. Therefore, a number of image processing algorithms aim at detecting edge orientation first and then performing the processing operation along an edge in order to preserve it. This concept is quite common in camera image processing such as demosaicking, image resizing, and filtering. 1.5.3 Spectral Characteristics It was already shown in Figure 1.13a that images consist of regions where neighboring pixels are well correlated in a spatial sense. It is also common that small regions in natural images are well correlated in a spectral sense; meaning that the different color channels in these regions exhibit similar dynamics which basically relates to structural content of the image. As shown in Figure 1.13b, significant spectral correlations can typically be observed in slowly changing regions whereas spectral correlations are lower in regions with color edges. A number of camera image processing steps, such as those based on interpolation, filtering, and color manipulation concepts, operate based on the assumption of significant spectral correlation among RGB planes of natural color images, utilizing spectral characteristics of the captured image during its processing. Chromaticity is one of the most characteristic features of color pixels. It relates to the directional characteristics of the three-component vectors representing color pixels in an RGB space [63], [64]. Thus, it is reasonable to assume that two color vectors x(r,s) and x(i, j) occupying spatially neighboring locations (r, s) and (i, j) have the same chromaticity characteristics if they are collinear in the RGB color space [2]. Based on the definition of dot product x(r,s).x(i, j) = M M x(r,s) x(i,j) cos x(r,s), x(i, j) , where x(r,s), x(i, j) denotes the angle between RGB color vectors x(r,s) and x(i, j), the following can be implied: x(r,s), x(i, j) = 0 ⇔ ∑3k=1 x(r,s)kx(i, j)k =1 ∑3k=1 x(2r,s)k ∑3k=1 x(2i, j)k (1.8) 22 Single-Sensor Imaging: Methods and Applications for Digital Cameras The above concept can be further extended by considering both magnitude and directional characteristics of color vectors, as they are both essential for human perception. It was shown in References [65] and [66] that this can be achieved by enforcing the underlying modelling principle of identical color chromaticity on linearly shifted variants of vectors x(r,s) and x(i, j): x(r,s) + γI, x(i, j) + γI = 0 ⇔ ∑3k=1 (x(r,s)k + γ)(x(i, j)k + γ) =1 ∑3k=1 (x(r,s)k + γ )2 ∑3k=1 (x(i, j)k + γ )2 (1.9) where I = [1, 1, 1]T is a unity vector and x(·,·)k + γ is the k-th component of the linearlyshifted vector [x(·,·) + γI] = [x(·,·)1 + γ, x(·,·)2 + γ, x(·,·)3 + γ]T . The linear-shift γ is a design parameter which controls the influence of directional and magnitude characteristics of color vectors during processing. It can be shown that any component of RGB vector x(r,s) can be derived from Equation 1.8 or Equation 1.9 based on two other components of x(r,s) and all three components of x(i, j) by solving the quadratic equation problem. Using full-color information in calculations makes the solution attractive and potentially highly accurate. Unfortunately, the required number of multiplications may prevent implementation of the solution in current real-time camera imaging pipelines. To reduce the complexity, the concept behind Equation 1.9 can be applied to two- component vectors which can be created from RGB color vectors by omitting one of their three components. It is straightforward to show that for two-component vectors, each com- prising one luminance (G) component and one chrominance (R or B) component, the ap- proach reduces to the following: x(r,s)k + γ x(i, j)k + γ = x(r,s)2 + γ x(i, j)2 + γ (1.10) where k = 1 refers to R components whereas k = 3 refers to B components. This constitutes the normalized color-ratio model [67]. Setting a design parameter to γ = 0 and γ → ∞, respectively, allows for further simpli- fication of the above expression. Namely, for γ = 0, it reduces to the color-ratio model of Reference [68]: x(r,s)k x(i, j)k = x(r,s)2 x(i, j)2 (1.11) whereas for γ → ∞, the color-difference model of Reference [69] is approximated: x(p,q)k − x(i, j)k = x(p,q)2 − x(i, j)2 (1.12) Both Equation 1.11 and Equation 1.12 constitute early, yet still popular spectral modelling approaches due to their relatively good performance and high computational efficiency, which is particularly true for Equation 1.12. Spectral models presented in this section can be used in a number of color image processing operations. The most typical ones are those based on interpolation and filtering Single-Sensor Digital Color Imaging Fundamentals 23 (a) (d) (b) (e) (c) (f) FIGURE 1.15 Typical color-based approaches in digital camera image processing: (a-c) luminance-chrominance approach and (d-f) color-difference approach. These approaches were applied to a color image shown in Figure 1.5e to obtain: (a) luminance image, (b,c) two chrominance images, (d) green-red color difference image, (e) greenblue color difference image, and (f) red-blue color difference image. Middle intensities in images of (b) to (f) correspond to zero values of displayed signals; low and high intensities correspond to negative and positive signal values, respectively. concepts; however, spectral modelling also helps in white balancing and image compres- sion. In general, any spectral modelling driven processing operation can be defined as follows: x(r,s) = Λ−1 x(r,s), f (Λ(x(i, j)); (i, j) ∈ ζ ) (1.13) where Λ(·) and Λ−1(·) denote, respectively, spectral modelling and inverse spectral mod- 24 Single-Sensor Imaging: Methods and Applications for Digital Cameras elling functions and f (·) is the processing operation performed in the spectral domain over the samples in the neighborhood defined by ζ . In many demosaicking solutions, f (·) takes the form of an averaging-like operation performed on color difference quantities achieved via Λ(x(i, j)), for (i, j) ∈ ζ . In this case, Λ(·) can perform the subtraction of the chrominance component from the luminance component of x(i, j), thus implying that Λ−1(·) adds the chrominance component of x(r,s) to the output of f (·) in order to obtain the resulting luminance component of x(r,s). Spectral modelling is not the only color-based approach used in digital camera image processing. A number of processing steps, such as demosaicking, filtering and compression, can operate by first transforming the color data into one luminance and two chrominance components, and then performing the processing operation on luminance and chrominance signals separately. The final RGB color image is obtained through the inverse transform of processed luminance and chrominance images. Figure 1.15 allows for comparison of both approaches. As can be seen in Figure 1.15a, the luminance image contains almost complete structural content of the color image shown in Figure 1.5e. On the other hand, as seen in Figure 1.15b to Figure 1.15f, both chrominance and color-difference images lack most details and small edges, suggesting that these signals have a low-pass nature. Processing images with such reduced structural content can mitigate the processing error and allow for higher performance compared to when operating directly on RGB color channels. 1.5.4 Temporal Characteristics Compared to images, digital video has an additional dimension, as it consists of twodimensional images or frames captured in a certain period of time. This suggests that processing digital video may require an extension to traditional approaches for still images for more dimensional signals or using special approaches to effectively deal with its unique temporal characteristics. These additional characteristics can be expressed as changes between consecutive frames. Typically, most of these changes are caused by the motion, either of objects in the scene or both camera and objects. Therefore, popular video processing techniques aim at deriving motion information first and then adjusting the processing operation accordingly in order to prevent motion blur and artifacts and to increase performance. To avoid problems found when processing still images, temporal characteristics should be used together with spatial, structural and spectral characteristics, taking advantage of different types of correlations. Detailed discussions on typical video-processing issues in digital cameras can be found in Chapters 18 to 20. 1.6 Conclusion This chapter aimed at summarizing the fundamentals of single-sensor color imaging and digital camera image processing. The concept of sampling visual information using a color filter array placed on top of a monochrome image sensor became one of the most important developments in the history of digital imaging due to its good performance, effectiveness, Single-Sensor Digital Color Imaging Fundamentals 25 and relatively low cost. Therefore, it is no surprise that this concept plays a key role in popular consumer electronic devices with image-capturing capabilities, such as digital still and video cameras, mobile phones, and personal digital assistants, and that many color imaging applications, for example, digital photography, printing, visual communication, machine vision, digital cinema, medical imaging, and astronomy, benefit from using such devices. As described in this chapter, the data captured by a color filter array sensor architecture has to undergo a number of image processing steps in order to produce digital photographs. These steps constitute the digital camera image processing pipeline and can be implemented in imaging devices or performed on personal computers, thus flexibly providing the user with a number of options. In either case, each processing step has its own design, performance, and implementation challenges. Most of these challenges relate to effectively using information available in captured visual data in order to produce visually pleasing finished images in the output of the pipeline. Thus, the utilization of the spatial, structural, spectral, and even motion characteristics is essential in modern imaging systems which attempt to mimic human perception of the visual environment. Meeting such objectives requires large efforts in designing the pipeline which achieves the best collaborative effect of its components and is reasonably robust in order to deal with the infinite amount of variations in the visual scene. References [1] K. Parulski and K.E. Spaulding, Digital Color Imaging Handbook, ch. Color image processing for digital cameras, G. Sharma (ed.), Boca Raton, FL: CRC Press, 2002, pp. 728–757. [2] R. Lukac and K.N. Plataniotis, Color Image Processing: Methods and Applications, ch. Single-sensor camera image processing, R. Lukac and K.N. Plataniotis (eds.), Boca Raton, FL: CRC Press / Taylor & Francis, October 2006, pp. 363–392. [3] G. Wyszecki and W.S. Stiles, Color Science, Concepts and Methods, Quantitative Data and Formulas, 2nd Edition, New York: John Wiley, 1982. [4] P.L.P. Dillon, D.M. Lewis, and F.G. Kaspar, “Color imaging system using a single CCD area array,” IEEE Journal of Solid-State Circuits, vol. 13, no. 1, pp. 28–33, February 1978. [5] B.T. Turko and G.J. Yates, “Low smear CCD camera for high frame rates,” IEEE Transactions on Nuclear Science, vol. 36, no. 1, pp. 165–169, February 1989. [6] H.K. Burke and G.J. Michon, “Charge injection imaging: Operating techniques and performances characteristics,” IEEE Journal of Solid-State Circuits, vol. 11, no. 1, pp. 121–128, February 1976. [7] C. Anagnostopoulos, “Signal readout in CID image sensors,” IEEE Transactions on Electron Devices, vol. 25, no. 2, pp. 85–89, February 1978. [8] A.J. Blanksby and M.J. Loinaz, “Performance analysis of a color CMOS photogate image sensor,” IEEE Transactions on Electron Devices, vol. 47, no. 1, pp. 55–64, January 2000. [9] D. Doswald, J. Haflinger, P. Blessing, N. Felber, P. Niederer, and W. Fichtner, “A 30-frames/s megapixel real-time CMOS image processor,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp. 1732–1743, November 2000. 26 Single-Sensor Imaging: Methods and Applications for Digital Cameras [10] R.F. Lyon and P.M. Hubel, “Eying the camera: Into the next century,” in Proceedings of the Tenth IS&TSID Color Imaging Conference, Scottsdale, AZ, USA, November 2002, pp. 349– 355. [11] P.M. Hubel, J. Liu, and R.J. Guttosh, “Spatial frequency response of color image sensors: Bayer color filters and Foveon X3,” Technical Report ID 6502, Foveon, San Antonio, TX, USA, March 2002. [12] G. Sharma and H.J. Trussell, “Digital color imaging,” IEEE Transactions on Image Processing, vol. 6, no. 7, pp. 901–932, July 1997. [13] R. Lukac, B. Smolka, K. Martin, K. N. Plataniotis, and A. N. Venetsanopulos, “Vector filtering for color imaging,” IEEE Signal Processing Magazine, Special Issue on Color Image Processing, vol. 22, no. 1, pp. 74–86, January 2005. [14] J. Adams, K. Parulski, and K. Spaulding, “Color processing in digital cameras,” IEEE Micro, vol. 18, no. 6, pp. 20–30, November 1998. [15] B.E. Bayer, “Color imaging array,” U.S. Patent 3 971 065, July 1976. [16] FillFactory, “Technology - image sensor: The color filter array faq.” Available online: http:// www.fillfactory.com/htm/technology/htm/rgbfaq.htm. [17] R. Lukac and K.N. Plataniotis, “Color filter arrays: Design and performance analysis,” IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1260–1267, November 2005. [18] B.K. Gunturk, J. Glotzbach, Y. Altunbasak, R.W. Schaffer, and R.M. Murserau, “Demosaicking: Color filter array interpolation,” IEEE Signal Processing Magazine, Special Issue on Color Image Processing, vol. 22, no. 1, pp. 44–54, January 2005. [19] R. Lukac and K. N. Plataniotis, “A robust, cost-effective postprocessor for enhancing demosaicked camera images,” Real-Time Imaging, Special Issue on Spectral Imaging II, vol. 11, no. 2, pp. 139–150, April 2005. [20] R. Lukac, K. Martin, and K. N. Plataniotis, “Demosaicked image postprocessing using local color ratios,” IEEE Transactions on Circuit and Systems for Video Technology, vol. 14, no. 6, pp. 914–920, June 2004. [21] W. Lu and Y.P. Tang, “Color filter array demosaicking: New method and performance measures,” IEEE Transactions on Image Processing, vol. 12, no. 10, pp. 1194–1210, October 2003. [22] B. Gunturk, Y. Altunbasak, and R. Mersereau, “Color plane interpolation using alternating projections,” IEEE Transactions on Image Processing, vol. 11, no. 9, pp. 997–1013, September 2002. [23] Technical Committee ISO/TC 42, Photography, “Electronic still picture imaging - removable memory, part 2: Image data format - TIFF/EP,” ISO 12234-2, January 2001. [24] “Information technology - digital compression and coding of continuous-tone still images: Requirements and guidelines.” ISO/IEC International Standard 10918-1, ITU-T Recommendation T.81, 1994. [25] Japan Electronics and Information Technology Industries Association, “Exchangeable image file format for digital still cameras: Exif Version 2.2,” Technical report, JEITA CP-3451, April 2002. [26] R. Ramanath, W.E. Snyder, Y. Yoo, and M.S. Drew, “Color image processing pipeline,” IEEE Signal Processing Magazine, Special Issue on Color Image Processing, vol. 22, no. 1, pp. 34– 43, January 2005. [27] R. Lukac, “New framework for automatic white balancing of digital camera images,” Signal Processing, vol. 88, no. 3, pp. 582–593, March 2008. Single-Sensor Digital Color Imaging Fundamentals 27 [28] International Electrotechnical Commission, Colour Measurement and Management in Multimedia Systems and Equipment - Part 2-1: Default RGB Colour Space - sRGB. IEC 61966-2-1, 1999. [29] R. Lukac and K.N. Plataniotis, “A new image sharpening approach for single-sensor digital cameras,” International Journal of Imaging Systems and Technology, Special Issue on Applied Color Image Processing, vol. 17, no. 3, pp. 123–131, June 2007. [30] K. Hirakawa and T.W. Parks, “Joint demosaicing and denoising,” IEEE Transactions on Image Processing, vol. 15, no. 8, pp. 2146–2157, August 2006. [31] R. Lukac and K.N. Plataniotis, Advances in Imaging and Electron Physics, ch. A taxonomy of color image filtering and enhancement solutions, pp. 187–264, San Diego, CA: Elsevier / Academic Press, February/March 2006. [32] R. Lukac, K. Martin, and K.N. Plataniotis, “Digital camera zooming based on unified CFA image processing steps,” IEEE Transactions on Consumer Electronics, vol. 50, no. 1, pp. 15– 24, February 2004. [33] R. Lukac, K. N. Plataniotis, and D. Hatzinakos, “Color image zooming on the Bayer pattern,” IEEE Transactions on Circuit and Systems for Video Technology, vol. 15, no. 11, pp. 1475– 1492, November 2005. [34] K.H. Chung and Y.H. Chan, “A low-complexity joint color demosaicking and zooming algorithm for digital camera,” IEEE Transactions on Image Processing, vol. 16, no. 7, pp. 1705– 1715, July 2007. [35] L. Zhang and D. Zhang, “A joint demosaicking-zooming scheme for single chip digital color cameras,” Computer Vision and Image Understanding, vol. 107, no. 1-2, pp. 14–25, July/August 2007. [36] R. Lukac, B. Smolka, and K.N. Plataniotis, “Sharpening vector median filters,” Signal Processing, vol. 87, no. 9, pp. 2085–2099, September 2007. [37] R. Lukac, “Methods for fast enlargement of digital images,” U.S. Patent, submitted, December 2007. [38] M. Gaubatz and R. Ulichney, “Automatic red-eye detection and correction,” in Proceedings of the IEEE International Conference on Image Processing, Rochester, NY, USA, September 2002, vol. 1, pp. 804–807. [39] F. Volken, J. Terrier, and P. Vandewalle, “Automatic red-eye removal based on sclera and skin tone detection,” in Proceedings of the IS&T Third European Conference on Color in Graphics, Imaging and Vision, Leeds, UK, June 2006, pp. 359–364. [40] A. Engelsberg and G. Schmidt, “A comparative review of digital image stabilising algorithms for mobile video communications,” IEEE Transactions on Consumer Electronics, vol. 45, no. 3, pp. 591–597, August 1999. [41] S.C. Hsu, S.F. Liang, and C.T. Lin, “A robust digital image stabilization technique based on inverse triangle method and background detection,” IEEE Transactions on Consumer Electronics, vol. 51, no. 2, pp. 335–345, May 2005. [42] L. Zhang, X. Wu, and P. Bao, “Real-time lossless compression of mosaic video sequences,” Real-Time Imaging, Special Issue on Multi-Dimensional Image Processing, vol. 11, pp. 370– 377, October-December 2005. [43] R. Lukac and K.N. Plataniotis, “Fast video demosaicking solution for mobile phone imaging applications,” IEEE Transactions on Consumer Electronics, vol. 51, no. 2, pp. 675–681, May 2005. [44] S. Farsiu, M. Elad, and P. Milanfar, “Multiframe demosaicing and super-resolution of color images,” IEEE Transactions on Image Processing, vol. 15, no. 1, pp. 141–159, January 2006. 28 Single-Sensor Imaging: Methods and Applications for Digital Cameras [45] X. Wu and L. Zhang, “Improvement of color video demosaicking in temporal domain,” IEEE Transactions on Image Processing, vol. 15, no. 10, pp. 3138–3151, October 2006. [46] S.G. Narasimhan and S.K. Nayar, “Enhancing resolution along multiple imaging dimensions using assorted pixels,” IEEE Transactions on Pattern Recognition and Machine Intelligence, vol. 27, no. 4, pp. 518–530, April 2005. [47] P. Viola and M.J. Jones, “Robust real-time object detection,” Tech. Rep. CRL 2001/01, Compaq Cambridge Research Laboratory, Cambridge, Massachusetts, February 2001. [48] R.L. Hsu, M. Abdel-Mottaleb, and A.K. Jain, “Face detection in color images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 696–706, May 2002. [49] S.Z. Li and A.K. Jain (eds.), Handbook of Face Recognition. New York: Springer, 2005. [50] K. Delac and M. Grgic (eds.), Face Recognition. Vienna, Austria: I-Tech, 2007. [51] F. Bartolini, A. Tefas, M. Barni, and I. Pitas, “Image authentication techniques for surveillance applications,” Proceedings of the IEEE, vol. 89, no. 10, pp. 1403–1418, October 2001. [52] P. Blythe and J. Fridrich, “Secure digital camera,” in Proceedings of the Digital Forensic Research Workshop, Baltimore, MD, USA, August 2004. [53] R. Lukac and K.N. Plataniotis, “Secure single-sensor digital camera,” IEE Electronics Letters, vol. 42, no. 11, pp. 627–629, May 2006. [54] H. Cheng and X. Li, “Partial encryption of compressed images and videos,” IEEE Transactions on Signal Processing, vol. 48, no. 8, pp. 2439-2451, August 2000. [55] K. Martin, R. Lukac and K.N. Plataniotis, “Efficient encryption of wavelet-based coded color images,” Pattern Recognition, vol. 38, no. 7, pp. 1111–1115, July 2005. [56] A.C. Popescu and H. Farid, “Exposing digital forgeries in color filter array interpolated images,” IEEE Transactions on Signal Processing, vol. 53, no. 10, pp. 3948–3959, October 2005. [57] J. Lukas, J. Fridrich, and M. Goljan, “Digital camera identification from sensor pattern noise,” IEEE Transactions on Information Forensics and Security, vol. 1, no. 2, pp. 205–214, June 2006. [58] P. van Walree, “Photographic optics.” Available online: http://www.vanwalree.com/optics. html. [59] S.T. McHugh, “Digital camera image noise.” http://www.cambridgeincolour. com/tutorials/noise.htm. Available online: [60] C.C. Koh, J. Mukherjee, and S.K. Mitra, “New efficient methods of image compression in digital cameras with color filter array,” IEEE Transactions on Consumer Electronics, vol. 49, no. 4, pp. 1448–1456, November 2003. [61] R. Lukac and K.N. Plataniotis, “Single-sensor camera image compression,” IEEE Transactions on Consumer Electronics, vol. 52, no. 2, pp. 299–307, May 2006. [62] K.N. Plataniotis and A.N. Venetsanopoulos, Color Image Processing and Applications. New York: Springer Verlag, 2000. [63] P.E. Trahanias, D. Karakos, and A.N. Venetsanopoulos, “Directional processing of color images: Theory and experimental results,” IEEE Transactions on Image Processing, vol. 5, no. 6, pp. 868–881, June 1996. [64] B. Tang, G. Sapiro, and V. Caselles, “Color image enhancement via chromaticity diffusion,” IEEE Transactions on Image Processing, vol. 10, no. 5, pp. 701–707, May 2001. [65] R. Lukac and K.N. Plataniotis, “Vector concepts-based spectral modelling, in Proceedings of the 19th Annual Canadian Conference on Electrical and Computer Engineering, Ottawa, ON, Canada, May 2006, pp. 2009–2012. Single-Sensor Digital Color Imaging Fundamentals 29 [66] R. Lukac and K.N. Plataniotis, “Demosaicking using vector spectral model,” in Proceedings of the IEEE International Conference on Multimedia and Expo, Toronto, ON, Canada, July 2006, pp. 1185–1188. [67] R. Lukac and K.N. Plataniotis, “Normalized color-ratio modeling for CFA interpolation,” IEEE Transactions on Consumer Electronics, vol. 50, no. 2, pp. 737–745, May 2004. [68] D.R. Cok, “Signal processing method and apparatus for producing interpolated chrominance values in a sampled color image signal,” U.S. Patent 4 642 678, February 1987. [69] J. Adams, “Design of practical color filter array interpolation algorithms for digital cameras,” Proceedings of SPIE, vol. 3028, pp. 117–125, February 1997. 2 Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras Wen-Chung Kao, Hung-Hsin Wu, and Sheng-Yuan Lin 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2 Hardware Platform Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2.1 Zoom Lens Module and Auto Focus Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.1.1 Overview of Zoom Lens Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.1.2 Design Considerations of Auto Focus Algorithms . . . . . . . . . . . . . . . 35 2.2.2 Image Sensor and Auto Exposure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2.2.1 Overview of Image Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2.2.2 Design Considerations of Auto Exposure Algorithms . . . . . . . . . . . 38 2.2.3 Camera Signal Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.3 Embedded Software Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3.1 Software Programming Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3.2 Software Design Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.3.3 Application Program Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.3.4 Device Driver Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4 Software Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.4.1 Task Scheduling by Real-Time Operating System . . . . . . . . . . . . . . . . . . . . . . . 45 2.4.2 DSP Subsystem Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.4.3 Hardware Accelerator Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.4.4 Dynamic Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5 Software Module Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5.1 Available Hardware Resources Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5.1.1 Previewing Images on Color LCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5.1.2 MPEG Audio / Video Playback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.5.2 Job Scheduling and Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.5.3 Background Processing and Data Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.5.3.1 Continuous Still Image Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.5.3.2 MPEG Audio/Video Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.5.4 Power Aware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.6 Embedded Software Design of Built-in Automatic Camera Calibration . . . . . . . . 58 2.6.1 Automatic Camera Calibration Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.6.2 Mechanical Shutter Delay Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.6.3 Image Sensor Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 31 32 Single-Sensor Imaging: Methods and Applications for Digital Cameras References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.1 Introduction Many versatile point-and-shoot digital cameras, which support attractive functions with satisfactory system performance, have been announced in the consumer electronics market [1], [2], [3], [4]. However, camera system designers still suffer from the difficulty of lacking hardware architecture standards and good software design methodology. According to the product definition, camera designers must carefully select several key hardware components such as camera signal processor (CSP) [5], [6], [7], [8], [9],[10], [11], lens module [12], [13], [14], [15], [16], charge-coupled device (CCD) [17], [18], [19] or complementary metal oxide semiconductor (CMOS) image sensor [20], [21], [22], analog front end (AFE) chip [23], [24], and liquid crystal display (LCD). Among these components, CSP plays the most important role for the entire system. A typical CSP consists of an embedded microprocessor (EMP), hardware engines, peripherals, and other programmable computing units such as digital signal processors (DSPs) for real-time image/video processing. The scheduling as well as the allocation for these heterogeneous computational resources is quite a complex issue in embedded software design. In recent years, the functions demanded for digital cameras have grown very numerous and complex. Development time is forced to decrease and maintaining the entire software system becomes more and more difficult. Many integrated signal processors have been proposed for simplifying the design of digital still cameras (DSCs), but very few systematic software platforms are available to support the development of multi-function DSCs [25], [26]. The lack of a standard software platform defined in industry remains a significant obstacle in developing digital cameras. The embedded software team should construct several software systems for a variety of camera models that are designed with specific CSPs. It is estimated that up to 60% of the development time may be spent in software coding and thus the software design phase has become a bottleneck in developing a DSC. A modern digital camera is no longer a simple imaging system. Other than some basic functions such as capture, playback, display and storage, a high performance camera supports many attractive functions while satisfying miscellaneous timing constraints and the power budget. Namely, the supported features include digital zoom in both capture and playback mode, continuous shots with instant audio recording, MPEG video/audio compression and real-time data streaming, direct print that incorporates output formatting and color postprocessing, and parsing or preparing configuration files for the captured pictures such that they can be mailed out and printed automatically from a personal computer (PC). Some of the functions listed above tightly depend on the resolutions of the image sensor and display device. Without a good software architecture design, changing part of these devices may result in effects that the current software system might not be able to accommodate. Apart from camera functionality, performance is also important for a camera system. The speed of capturing an image as well as the response time of human machine interface affects the evaluation of a camera. An advanced consumer DSC supports fast continuous shots, recording audio at the same time, and runs fast auto-exposure (AE) [27], [28], [29], Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 33 auto-focus (AF) [30], [31], [32] between successive shots. Such a camera has to handle start and stop of audio recording, immediate capture of the next picture, and data storage while avoiding accidentally recording the sound of the mechanical shutter. On the other hand, the user may want to have fast response from the camera whenever they press buttons. It is particularly critical to have a short delay from pressing the shutter button to actually taking an instant image. The challenge behind this performance index is that many tasks are executed simultaneously, including periodic AE and automatic white balancing (AWB) in preview mode, pre-capture exposure/focus estimation, fine resolution image raw data readout, audio recording, image processing, key scan task, and real-time streaming of compressed data. A robust embedded software platform, that has good flexibility in accommodating these changeable hardware components and software modules, would be key in producing versatile digital cameras. In addition to advanced functions supported by cameras and the performance requirements, from the viewpoints of system designers, the most important features of an ideal embedded software platform include flexibility, reusability [33], [34], [35], [36], and calibration/diagnosis capability [37]. This chapter describes an integrated embedded software platform for developing consumer DSCs. Although different camera models possess diverse graphic user interface (GUI) or adopt different key hardware components, many functions are common for all cameras, even if they might be implemented with different operations. The platform should be reusable when some of the hardware components are replaced or the image resolution is changed. Furthermore, an ideal software platform must utilize common functions for both normal operation mode and engineering modes that can calibrate camera parameters and diagnose/verify the system specifications. Each camera instance can be calibrated in the production line without connecting to any computers, thus the number of computer based test fixtures can be drastically reduced. 2.2 Hardware Platform Overview The hardware organization of a typical consumer camera system is shown in Figure 2.1. The image captured from the CCD/CMOS sensor is stored in synchronous dynamic random access memory (SDRAM) first and then processed by the CSP. On the other hand, audio is recorded from the microphone and converted into digital signals through an analog-todigital converter (ADC). The audio/video data that pass through signal processing as well as compression are finally stored in the internal flash memory or external flash cards. The color LCD provides a friendly GUI for viewing the pictures to be taken in the field. A few DSCs support the functions of printing the pictures directly to printers through USB interface in the host or slave mode. This direct print mode enables connection to some specific printers and makes the print operation simpler without having to clumsily set the print mode. In the following subsections, several key components of a camera system are introduced: the lens module, the image sensor, and the color image signal processor. Meanwhile, the design considerations of AF and AE algorithms are also discussed. 34 Single-Sensor Imaging: Methods and Applications for Digital Cameras M-shu tter SDRAM aud io D/A; A/D zoom len s TV user I/F & button cam era signal p rocessor strobe com p u ter DPS printer USB slave in t er n a l flash memory USB host FIGURE 2.1 The hardware organization of a consumer digital camera. c 2005 IEEE m icrop hone sp eaker TV LCD d isplay flash card s trad itional p r in ter 2.2.1 Zoom Lens Module and Auto Focus Control A zoom lens allows the photographer to change view angles at the same spot. Using zoom lens cameras, the photographer does not need to bring many lenses with different focal length while taking pictures in different view angles. In general, with longer focal length, the view angle is narrower and more image details in a smaller area can be captured clearly. On the other hand, with shorter focal length, a wider scene can be captured but the resolution for the same object becomes lower. Since every lens has limited depth of field (DOF), the distance in the object space that can be projected clearly on the image plane is limited. In order to capture scenes at different distances, the image sensor position has to be adjusted to make the projected image of the main subject fall accurately on the sensor. This action is called auto focus (AF), and in practice the lens is moved instead of the image sensor because controlling the lens motor is much easier. 2.2.1.1 Overview of Zoom Lens Control A zoom lens is usually composed of several groups. As the individual groups of the lens move along the optical axis and change the relative distance among them, the effective focal length (EFL) of the lens will be changed. The relation of the zoom position and the corresponding EFL is inherently nonlinear, but lens makers are always able to provide a look-up table (LUT) of EFL versus motor positions for the software designer to use. To simplify the zoom lens control, most manufactures split the range of zoom motor motion into a few steps and the control LUT contains only the data for these steps. In practice, it is possible to design some test patterns to calibrate the EFL values versus the motor positions. The so-called zoom ratio is actually the ratio of the longest EFL to the shortest EFL. For example, if the range of the EFL for a digital camera is 5.8 - 17.4mm, the zoom ratio Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 35 is 17.4/5.8 = 3. It is then marked as a 3× zoom lens. When the subject is far from the camera, it is approximately true that when the EFL is switched from 5.8mm to 17.4mm, the image width is increased to about three times. Note that changing the EFL of a zoom lens also changes the effective aperture value. Longer EFL results in smaller aperture and larger F-number. The main issue with different F-numbers is that the light intensity falling on the image sensor will be different and the AE algorithm needs to take this change into account while changing zoom steps. For consumer digital still cameras, the AE algorithm relies on the image sensor to sense and analyze the brightness of the scene. This issue is automatically taken care of and thus changing F-number is not a big problem for AE control. Changing EFL also affects the characteristic of lens fall-off. The problem of lens fall-off means that the light intensity at the corners of a lens is typically lower than the intensity at the center. Lenses with shorter EFLs have obvious fall-off problems. An approach to dealing with this issue is to perform lens fall-off compensation in the image processing pipeline, in which different compensation factors can be applied at different zoom steps. The relationship between EFL and lens position is usually nonlinear. As the zoom motor is moved, the focus position will be also changed such that the focus needs to be readjusted. However, enabling both zoom motors and focus motors at the same time may cause temporarily very high power consumption. Hence zoom motor and focus motor are not enabled simultaneously in practice, and a fine AF adjustment is executed to achieve a more accurate focus after the zoom motor stops at the target position. 2.2.1.2 Design Considerations of Auto Focus Algorithms Auto focus is usually achieved by moving the lens to multiple focus positions and taking several test images to evaluate the focus conditions. A better focus condition means that an image keeps more image details. The focus condition is either called focus index (FI) or figure of merit (FOM), which are usually implemented with a two-dimensional high-pass digital filter. In traditional photography, a small region of interest (ROI) is defined at the center of the image and AF program only evaluates the FOM at this ROI. The photographer must point the camera to the main subject to make correct focus. The camera scans several focus motor positions and the FOM value is calculated for each position to generate a focus curve as shown in Figure 2.2. The peak of the curve is usually regarded as the best focus position. Several ways have been proposed to speed up the process of finding the best focus position. One is to estimate the slope of the curve and determine the next point of focus trial with a better increment. The coarse search step in AF is according to the DOF of the camera in the current zoom step. When the curve changes direction, it means a local peak has been found and the remaining process is to fine search the peak position in the focus curve. A modified approach is to use a second order polynomial curve fitting to fast predict the peak of the focus curve after coarse search. This approach provides a promising search result in general conditions and has been widely applied in designing commercial cameras. On the other hand, advanced digital still cameras frequently use a larger ROI for AF. The ROI is split into a few smaller regions and the FOM value of each region is calculated individually. Then these FOM values are compared and a best focus is selected. 36 Single-Sensor Imaging: Methods and Applications for Digital Cameras 20000 16000 FOM value 12000 8000 4000 00 20 40 60 80 lens position FIGURE 2.2 The figure of merit (FOM) curve in auto focus. 2.2.2 Image Sensor and Auto Exposure 2.2.2.1 Overview of Image Sensors There are two popular types of commercial image sensors, charge-coupled device (CCD) and complimentary metal-oxide-semiconductor (CMOS) image sensor. In both types of image sensor, the transducer is a photo diode coupled with a capacitor. The difference between them is the way of transferring the image signals to the outside circuit. When light falls on the sensor, some photons are absorbed by the silicon and electron-hole pairs are generated. The sequence of electron-hole pairs forms an electric current, which is in general proportional to the light intensity falling on the photo diode. The charge accumulated in the capacitor is the integration over time of the current flowing into it. Since the current is proportional to image light intensity and the output voltage is proportional to the accumulated charge, it can be deduced that the output voltage of a photo diode is proportional to the product of image light intensity and the integration time. The integration time of a photo diode is controlled by its reset timing. The control circuit periodically sends a reset pulse to clear the charge accumulated in the capacitor. When the effective exposure starts, the control mechanism simply stops the reset signal and the newly generated charge will be accumulated. This exposure control mechanism is also called electronic shutter. The end of the exposure procedure is achieved by connecting the capacitor to a sense and read amplifier through another transistor. After the voltage is sensed and converted to a digital value, the photo diode capacitor is cleared again and the next exposure procedure can be started. This kind of electronic shutter is very precise and controlling extremely short durations is possible. The shortest exposure time of a traditional mechanical shutter is around 1/2000 second, while electronic shutter can easily achieve 1/50000 second of exposure time resolution. There are several different architectures of CCD design, including frame transfer, full frame transfer, frame interline transfer, and interline transfer. The most popular architecture adopted in consumer cameras is the interline transfer which is shown in Figure 2.3. The pixels are arranged as a two-dimensional array. Between neighboring vertical lines of the photo diodes, there is a line of vertical CCDs which are masked by a metal layer and will not Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 37 1 pixel vertical CCD . . . p h otod iod e vertical shift . . . .... .... .... reset VBIAS .... .... .... metal mask . . . output first pixel FIGURE 2.3 Basic interline transfer CCD architecture. horizontal CCD respond to incident light. The pixels in the entire frame are exposed and the accumulated charges in the photo diodes are transferred to the neighboring vertical CCDs at the same time. The pixel charges in each horizontal line are then shifted vertically downward to the horizontal CCDs. Finally, the signals stored in the horizontal CCDs are amplified and sent out in pixel rate. The main difference between CCD and CMOS image sensors is that CMOS sensor lacks charge buffer as the readout mechanism. Once the exposure is completed, the charge is read out immediately. Since there is at most one row of ADCs at the bottom of the pixel array and the data readout is also sequential, only one row of the pixels can be read out at a time. The exposure of the second row must be completed exactly at the instant of the first row readout completed and so on for the remaining rows. In order to keep the exposure time for all rows of pixels to be the same, the start of exposure for each row also has to be adjusted accordingly. As a result, the entire image array is not exposed at the same time. The exposure mechanism is called rolling exposure, with the upper rows exposed earlier and lower rows exposed later. The impact of this rolling exposure is that the image captured will be distorted if the subject moves horizontally. That is, a vertical line will become slanted. This effect is especially significant when the resolution of the image is larger, because the time lag from the first row to the last row is longer than in low resolution sensors. 38 Single-Sensor Imaging: Methods and Applications for Digital Cameras number of p ixels low E1 E2 E3 h ig h b r ig h t n ess FIGURE 2.4 The dynamic range of the image sensor and the problem of auto exposure control. 2.2.2.2 Design Considerations of Auto Exposure Algorithms Although high quality CCD/CMOS sensors have a larger dynamic range, most commercial imaging systems have narrower exposure latitude than color negative films. The dynamic range of a camera system is usually limited by the noise of the sensor as well as the precision of the ADC. The luminance level in the brightest area can be millions of times of that in the darkest area. When taking pictures in typical scenes, the dynamic range, which refers to the ratio of the highest and lowest level of light intensities, is usually higher than what most image sensors can measure. Auto exposure (AE) is to determine the suitable aperture size, gain setting, and exposure time such that the exposure value best fits the dynamic range of the current scene. As shown in Figure 2.4, the luminance distribution of the scene is wider than the dynamic range of the sensor. Consequently, exposure needs to be adjusted in order for the sensor to capture proper range of luminance data. E1, E2, and E3 represent three different exposure conditions. If selecting E1, many pixels corresponding to the dark area are underexposed so that their output codes are zero. On the contrary, selecting E3 may produce lots of saturated pixels whose values are clipped to the maximum output of the ADC. The brightness information is also lost. Therefore, a proper adjustment of exposure is needed to acquire more brightness information from the scene. Typical AE algorithm adopts the additive system of photographic exposure (APEX) system [38]. Under APEX system, the relationship between the shutter time, aperture, luminance level and camera sensitivity is stated as follows: A2f /Ts = BsS f /K (2.1) where A f represents the F-number of the optical lens, Ts is the electronic shutter time (or exposure time), Bs is the scene brightness, S f is the sensitivity of the camera, and K is a scaling factor. In practical system design, a log2 data system is used for computational considerations. The unit under this data system is usually expressed as exposure value (EV), which means a difference of 1 EV is equivalent to change the shutter time by two times. The task of AE control is interpreted as follows: Given a good brightness measurement Bs of the current image, find a proper combination of A f , Ts, S f , and Bs to capture another image such that the exposure in the new image is optimal. Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 39 F number EV10 EV11 EV12 EV13 EV14 EV15 EV16 EV17 20202 02420 EV9 EV8 EV7 short exp osu r e tim e 2 4 16 4 2 02420 41414 lon g exp osu r e tim e (a) (b) FIGURE 2.5 Auto exposure: (a) an example of the weighting matrix, and (b) auto exposure control curve. c 2005 IEEE In preview mode, the brightness is measured based on the raw data of CCD sensor operating in draft mode in which only a few data lines rather than the entire sensor data are read out. Since the importance of different areas in the scene usually varies, a popular approach is to evaluate the importance index by analyzing the stability of brightness. As shown in Figure 2.5a, the raw image is further divided into 5 × 5 regions and the final measurement result is calculated based on the weighted values. The adjustment of exposure time and gain setting must be performed before the next frame starts its exposure. A typical adjustment algorithm is based on the lookup curve shown in Figure 2.5b. In this figure, the slanted lines represent different scene brightness and the AE algorithm should move along the AE curve to find the proper combination of aperture size, exposure time, and gain setting. In the example, when the scene brightness increases, the algorithm will follow the solid trace, initially keeping the large aperture while reducing the exposure time. Then the algorithm switches to a smaller aperture at EV13 and increases the exposure time in order to maintain the same exposure. When the scene brightness increases further, the algorithm will keep with the same smaller aperture but further reduce exposure time. The reverse is similar but slightly different. As the scene brightness decreases, the algorithm will hold the same small aperture while increasing the exposure time to match the exposure. However, the algorithm will not switch aperture setting until it reaches EV11. In practice, the exposure time in this instance is 1/60 second to prevent the image blur problem. Note that the sensor sensitivity as well as register setting is usually different for draft mode and fine mode. The final exposure value for capturing still image must consider this factor which should be provided in the specification of the image sensor. 2.2.3 Camera Signal Processor A CSP is designed for fast processing of massive image, video, and audio data and providing various peripherals to support user interaction and various IO extensions. Versatile camera features are realized by utilizing these hardware resources together. Figure 2.6 shows a block diagram of a typical CSP which includes EMP, DSP, peripherals and other 40 Single-Sensor Imaging: Methods and Applications for Digital Cameras external SDRAM DSP internal memory SDRAM co n t r o ller DMA DCT accelerator EMP VL encod er VL d ecod er q u a n tiz er d equantizer instru ction ca ch e d ata cache live view engine OSD engine a u d io encod er CCD LCD MIC FIGURE 2.6 A typical color image signal processor (CSP). engines. The EMP hosts the entire camera system, such as OS scheduling, interrupt handling, and interfacing to IO connection of the DSC system. The DSP, as its name shows, excels in executing signal processing programs. A multi-core solution provides the possibility of running jobs in parallel. Although performance is boosted in parallel processing with the multi-core architecture, the difficulty in controlling the cores also rises. The designer should carefully study the inherent property as well as limitation for each hardware resource. An EMP is especially suitable for executing conditional branches, such as if-else, or switch-case statements. Most EMPs used in digital cameras are designed with a reduced instruction set computer (RISC) architecture. Although the EMP is able to execute most camera functions, it still cannot outperform a programmable DSP while processing image and video data. The EMP requires more instruction cycles to process visual data, while a DSP is particularly efficient for processing the data with the same properties by the same algorithm. A typical DSP has four or more high speed multiply-accumulator (MAC) units, hardware looping, and several on-chip memory buses. This is why multiple image and video data streams can be processed simultaneously in a DSP. In addition to a programmable DSP, several dedicated engines might be included in a CSP for compressing image/video data following popular standards. Each engine can only execute one type of computation. They may include discrete cosine transform (DCT), quantizer/dequantizer, or variable length (VL) coding engines. Although the programmable DSP can also do these computations, providing these engines definitely provides the advantage of realizing parallel processing of visual data in pipeline stages. Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 41 Previewing scenes at 30 frames/second is a basic requirement for a DSC. However, the performance of a programmable DSP may not be high enough to satisfy this requirement. A specific functional module called preview or live view engine/accelerator is often built into the CSP to support camera preview and video recording modes. Since the live view engine processes raw CCD data to a formatted image, it is actually a fairly simple image pipeline. When the formatted data are generated from the live view engine, the OSD engine manages the display, and writes the data to LCD driver for display. The role of the SDRAM controller and direct memory access (DMA) controllers is to manage data I/O among the processors, peripherals, and external SDRAM. Many hardware modules, including the live view engine, on-screen display (OSD) engine, DSP, and EMP, are connected to the SDRAM controller. However, transferring a large amount of data under the control of the EMP is too slow to satisfy the timing requirements. A CSP is often equipped with several DMA units to solve the bottleneck problem regarding data transfer among different hardware devices. The DMAs efficiently transfer blocks of data between SDRAM and the DSP or peripherals. 2.3 Embedded Software Platform Although most camera functions are executed by several specialized hardware engines, software plays an increasingly important role in the design of camera systems. By increasing the range of software layout, it is possible to gain several advantages, like flexibility to change hardware components or add new user operations. The use of software facilitates reuse of previously designed software modules, independent from the selected hardware platform. The objective of design reuse can be achieved by designing the software modules at a processor and real-time operating system (RTOS) independent abstraction level. In this section, an embedded software platform for developing versatile camera systems is described. For improving the programmability and reusability, the platform is divided into three programming layers and two interfaces: application layer, functional layer, system layer, application program interface (API), and device driver interface (DDI), as shown in Figure 2.7. To keep the application-specific features independent from the functional operations, program development of the application layer is based only on those APIs which are supported by the modules in the functional layer. No other lower level functions can be called by the modules in the application layer. This prevents the top-level routines from directly executing lower level operations or accessing hardware resources directly without being protected by any mechanisms. Direct function calls may result in an unpredicted state transition that may make the system hang in an unknown state. Even the modules in the functional layer can only call the service from other modules through API calls. 2.3.1 Software Programming Layers In the application layer, two modules — GUI and manufacture/calibration interface (MCI) — are executed independently with their own control flows. The state diagram, 42 Single-Sensor Imaging: Methods and Applications for Digital Cameras graphic user interface m anufacture & calibration in t er fa ce application program interface (API) m aster state en g in e still image aud io capture annotation p la y b a ck AE AWB AF man m achine in t er fa ce d irect USB MPEG p r in t d evice d river interface (DDI) d ock a p p lica t ion layer fu n ct ion a l layer color LCD CCD/TG & RTOS & d river AFE d river file system flash card d rivers system layer FIGURE 2.7 An example of an embedded software platform for digital cameras. data flow, and dynamic behavior of the GUI module are customized by the GUI design specification. A typical GUI design specification defines the operational modes as well as their state diagrams. The common modes include system initialization, preview, still image capture, playback, video recording, direct print, and connection to a computer [39]. Although the GUI module is designed as several state machines, each of them is scheduled by one dedicated task. Once the user presses the shutter button, the GUI task will receive a message, and the message will be passed to a specific submodule in the GUI module. The operational flow design of the MCI module is based on the manufacturing process adopted in the production lines. Its objective is to achieve high throughput, good product quality, and better production reliability. A good approach to designing MCI is to define common interfaces between computer and camera for the manufacture and calibration related APIs. Hence the MCI modules can be controlled by computer through a USB cable or operated by function calls inside the camera. Based on this approach, production engineers or line operators can monitor the manufacture/calibration progress and collect production data. In the functional layer, some modules are designed to support versatile camera functions such as master state engine (MSE), still image capture, audio annotation, playback, man machine interface, USB, MPEG video and audio, and direct print. Each module runs under the scheduling of one or more tasks, which contain their own message boxes, state diagrams, and data flows. Each particular operation, such as still image capture with audio annotation or MPEG video and audio recording, is implemented with APIs supported by one or several functional modules. The communication between two modules is through message boxes or event flag settings. Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 43 In the system layer, the major design objectives are hardware abstraction and software programming environment construction. Since many cameras have similar hardware architecture with different types of image sensors, color LCDs, and flash cards, the software related to these key components should be designed as a configurable or replaceable one. Most of the code in the upper programming layers is reusable. In addition, by running the system with an RTOS, the programming environment can separate upper level functional operations from the detailed timing scheduling and file allocation mechanisms. 2.3.2 Software Design Reuse The difficulties of embedded software reuse come from the following facts. i) The hardware components, such as zoom lens, CCD sensor, color LCD modules, analog front end chip and the type of storage cards, might be replaced while the next generations of digital cameras are being developed. ii) The CSP or RTOS may be changed for performance or cost considerations. iii) The design of the GUI flow is unique for each camera series in order to differentiate their product lines. The design team must develop a new GUI module for a new camera series. iv) A robust camera development platform must provide manufacture and diagnosis interfaces such that the design engineers can monitor device calibration parameters and system performance indices. In order to maximize software reuse, the embedded system platform is abstracted at a level where the basic functional modules can utilize a device-independent interface to the hardware. The application software, which implements various GUI specifications, is designed by simply ignoring the implementation details of timing, scheduling, and resource allocation on the physical hardware platform. Figure 2.8 summarizes the basic idea of a camera embedded software platform. Given a new camera system specification, designers first map the specification onto the hardware platform by choosing a suitable family of hardware key components. Then the designers develop drivers for specific components and link them with those predefined DDI functions. In the meantime, the application programmers analyze the GUI specification and design their state machines, menu system, and artwork. The key to concurrent hardware and software development is these predefined API/DDI functions, which abstract the behavior of functional modules and isolate hardware specific features and their implementation details. 2.3.3 Application Program Interface The application program interface (API) provides a unique abstract representation of the functional modules with the implementation details hidden. With such APIs, the application software can easily be reused while developing new products. It is also possible to change the program modules located in application and functional layers for accommodating new applications features. A simple way to implement an API is: API ModuleName FunctionName() { SendMessage (ModuleName, FunctionName); WaitMessage (ModuleName, Finished); } 44 Single-Sensor Imaging: Methods and Applications for Digital Cameras ap p lication sp ace ap p lication instance API softw are p latform DDI hard w are p latform instance system p latform sp ace FIGURE 2.8 The design reuse of a camera embedded software platform. c 2005 IEEE In such an implementation style, an API may not include the exact execution code to the functions it needs to execute, but it only sends a message to a specific module that executes the requested function. The module receiving the message will be invoked and switched into the ready state to carry out the task. It is worth noting that a task calling such an API will be switched to the blocked state and can only be reactivated after it receives the task-finishing message. One problem that might occur while a task calls such an API is that the task cannot run any other code until it receives the task-finishing message. A periodic task, which is automatically invoked by a timer interrupt, monitors camera status and can send requests to stop the running processes. This task only triggers other modules but does not wait for responses. Therefore, it should not be blocked by any RTOS mechanisms. 2.3.4 Device Driver Interface Similar to API design, the device driver interface (DDI) provides an abstract represen- tation of the hardware device or hardware platform instances. The DDI isolates devicespecific features and unifies device behaviors such that the program in the functional layer can execute without considering the types of device (e.g., storage media) in the system. The only thing needed to be modified is the low level driver that directly controls the storage cards. All program functions that access data from storage cards should be kept the same or only have minor changes. Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 45 2.4 Software Design Methodology 2.4.1 Task Scheduling by Real-Time Operating System A modern camera executes several functions simultaneously to fully utilize all available resources on the hardware platform. Embedded software equipped with a real-time operating system (RTOS) to schedule all jobs is common in most of digital cameras. Unlike other powerful real-time kernels, the RTOS adopted in digital cameras is usually a scalable one in which only those services actually used by the camera are brought into the run-time image. The RTOS is loaded and executed by the EMP, yet jobs scheduled by RTOS are not only those running on the EMP but also those on the DSP or hardware engines. The RTOS achieves the illusion of concurrent processing by rapidly switching EMP among tasks and each task manages some jobs in a software module or controls the allocation of a hardware engine. Resource control mechanisms such as semaphores or event flags provided by RTOS are commonly used for maintaining the execution sequence of the required jobs. As shown in Figure 2.9, a task can be in the running, blocked, or ready state, and four transitions are possible among these three states. Transition A occurs when a task is waiting for an occupied resource or a required event flag that has not been triggered. When the resource is available or the event occurs, transition B occurs. If no other task is running at the instant, transition C will be triggered and the task will immediately enter the running state. Transition D occurs only in preemptive scheduling scheme. It happens when the scheduler decides that the running task has occupied a resource long enough, and it is time to release the resource and activate another task. Note that transition D will not happen in non-preemptive scheduling. The task can only release the resource after it is finished or it is blocked when transition A occurs. Preemptive scheduling is more powerful than non-preemptive, but the analysis of state transitions and resource allocations becomes more difficult than in non-preemptive scheduling. This is due to the fact that the running sequence for a specific operation scenario is unpredictable. It is more difficult to predict and meet timing constraints if using preemptive scheduling on a camera. Although RTOSs may allow assignment priorities for a task, it is infeasible to dynamically change the priority of a task to meet the timing constraint for different scenarios. FIGURE 2.9 The states of a task. r u n n in g D A C blocked read y B 46 Single-Sensor Imaging: Methods and Applications for Digital Cameras preview thread preview thread read CCD pixel value set CCD exposu re valu e evalu ation thread calcu late send (exposu re valu e) calcu lating exposu re valu e read CCD pixel value read CCD pixel value set CCD exposu re valu e read CCD pixel value calcu lation thread calcu late calcu late com plete (exposu re valu e) com plete (exposu re valu e) calcu lating exposu re valu e calcu lating exposu re valu e (a) (b) FIGURE 2.10 Sequence diagrams: (a) the sequence diagram of auto exposure in preview mode, and (b) the modified sequence diagram of auto exposure in preview mode. Non-preemptive scheduling with the addition of some exception handling mechanisms is another good approach to digital camera design. Non-preemptive scheduling can cope with the difficulties of designing, debugging, reliability testing, and timing optimization. The main drawback of non-preemptive scheduling is that the task may occupy EMP resource if it is trapped in an unknown state or an infinite loop. Fortunately, such software issues or bugs can easily be detected in the development and verification phase. The timing constraint problem can be solved by adding extra code to interrupt routines with timeconsuming jobs. The software modules located in functional or application layers are managed by one or more dedicated tasks. With scheduling by RTOS, these tasks run independently. The initial state of a task is set to the block state and it can only be invoked by a message sent from another module. A typical example is where one task sends a specific message to another task to trigger a specific action. Figure 2.10a shows an example of the interactions of auto exposure with sequence diagram in unified modelling language (UML) notation. The preview task sends a message called CALCULATE to trigger the evaluation task after the CCD raw data readout is completed. After the evaluation task finishes the calculation, it sends the evaluated exposure value back to the preview task. The preview task sets the new CCD exposure time according to the evaluated exposure value. Figure 2.10a shows a simple sequence diagram for auto exposure, which is executed by the EMP but the DSP is not involved. An alternative design is that the evaluation task running on the EMP establishes communication between the EMP and the DSP. The task does not evaluate the exposure value. Instead, it only handles the DSP resources and enables the DSP to calculate exposure value. A modified sequence diagram shown in Figure 2.10b shows the parallelism of running the EMP and the DSP simultaneously. While the DSP is calculating the exposure value for the previous frame, the preview task, which is executed by the EMP, sets the new exposure and gets the next frame data. Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 47 2.4.2 DSP Subsystem Management Due to the program memory size limitation of the DSP subsystem, the DSP program is partitioned into several sections which include preview and capture, playback, and MPEG video/audio recording. Only one section is loaded into the DSP at a time according to the current camera operation mode. The interactions between the EMP and the DSP are through interrupts, where both the EMP and the DSP have corresponding interrupt service routines (ISRs) to handle the communication. Since a software module running on the EMP is driven by messages or events, the ISR triggered by the DSP will send messages to the corresponding modules. As shown in following sample code, the module may load the corresponding DSP code first and then the task will be blocked on waiting for the DSP response. After checking the semaphore, the routine LoadDSP loads new DSP code. In addition, the event flag was set for controlling the execution of the task. The RTOS function call WaitDSPEventFlag in routine ProcessStillImage will block the corresponding task until the DSP finishes its job. This task can be reactivated only after the corresponding event flag is set again in the ISR. The ISR is triggered by the DSP subsystem once the assigned jobs are finished. ProcessStillImage() { WaitSemaphore (DSP) // Wait for DSP resource freed SetSemaphore (DSP); // Get the DSP resource ClearEventFlag (DSPFree); // DSPFree is the event flag indicating // whether the DSP is free or not LoadDSP (Capture); // Load DSP codes of capture section StartImageProcess (); // Triggering DSP to run image processing WaitDSPEventFlag(DSPFree);// Waiting for the completion event flag } 2.4.3 Hardware Accelerator Management Several additional hardware accelerators such as a live view engine for scene previewing, a variable length encoder and decoder, and a quantizer and dequantizer for fast image processing may be included in a CSP. These accelerators play important roles in real-time image/video processing. The configuration registers in these accelerators are set through the EMP which also manages the data flow and protects these hardware accelerators from access by several software modules. Controlling a hardware accelerator includes the following two steps: 1. The EMP first writes parameters into the corresponding registers of the accelerator. Then the mode of accelerator is switched from idle state to active state by setting the activating register. The setting of registers may not take effect immediately, but is usually activated with a predefined synchronizing signal. 2. The accelerator changes its state from active to idle after completing the job. Two popular mechanisms are used for checking the state of an accelerator: polling and 48 Single-Sensor Imaging: Methods and Applications for Digital Cameras interruption. With the polling method, the EMP runs an infinite loop to check the ready bit of a control register in the accelerator. This may occupy all computational resources of the EMP. Using an interrupt would be more efficient than polling because the EMP can execute other tasks until the accelerator interrupts the EMP. 2.4.4 Dynamic Memory Management The system may violate real-time constraints if the software uses too many system calls to dynamically allocate memory from the system heap. When the task executes dynamic memory allocation routines, the current execution task enters the blocked state for waiting the service of RTOS. Then RTOS searches suitable free space and allocates the required memory. For memory size and execution efficiency considerations, most RTOSs adopted in cameras are quite small. These systems usually lack a sophisticated garbage collection mechanism to compact memory fragments. An alternative approach to dynamic memory allocation is handling it in the application software itself instead of the RTOS. One reason for this is that the memory arrangement for miscellaneous operation modes is different. Another reason is that the memory allocation and free operations in the RTOS may not achieve acceptable performance. To solve these problems, several basic memory allocation strategies can be adopted, including analyzing the memory usage for each camera mode such that the memory maps are customized for each mode, or statically assigning memory addresses for larger memory buffers and leaving the remaining space for dynamic memory allocation. As shown in Figure 2.11, all camera operations are assigned to four basic modes with different memory maps in the camera software platform. In these maps, a few large memory buffers are assigned at fixed contiguous locations and the remaining space is used as heap memory for dynamic allocation. Since a typical RTOS cannot dynamically change the heap size and location, the only way to realize the memory allocation strategies is to develop a dedicated memory management routine that manages the heap size and location itself. It is particularly useful for USB connectivity mode, because several user functions are main program preview bu ffers captu re raw data bu ffers image/audio bit stream dynamic heap on screen d isplay system parameters system stack system heap RTOS kernel main program playback bu ffers JPEG bit stream audio bit stream im age operation bu ffers dynamic heap on screen d isplay system parameters system stack system heap RTOS kernel main program preview bu ffers video raw data audio bit stream dynamic heap on screen d isplay system parameters system stack system heap RTOS kernel main program dynamic heap on screen d isplay system parameters system stack system heap RTOS kernel) FIGURE 2.11 Examples of the memory maps corresponding to the preview/capture mode, playback mode, MPEG mode, and USB connectivity mode (from left to right). c 2005 IEEE Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 49 supported when the camera is connected to a personal computer or printer through a USB cable. The required memory size as well as the partitioning is unpredictable. For example, the memory buffer can be used for processing several printing pictures with different aspect ratios or layout formats. It can also process commands for setting an electronic mail with pictures. With this USB connectivity mode, memory can be allocated flexibly with higher efficiency. 2.5 Software Module Design Guidelines The previous section focuses on the general design methodology of an entire software platform. The remaining important thing is how to design each software module to fully utilize all available hardware resources such that all jobs are completed within the specified timing constraints. Studying what hardware resources are available for accomplishing the desired operations is the first step in the design of a software module. As described above, a modern CSP integrates several powerful processors and hardware engines into a single integrated chip. The software running on the embedded microprocessor plays an important role in the coordination of the execution of all hardware resources. The programming environment for traditional computers is different from that of camera systems in some aspects. Although both use operating systems to manage resources, in a digital camera, several computational resources can be used and many timing constraints are requested. The camera software designer should analyze available hardware resources and assign jobs to suitable hardware modules. 2.5.1 Available Hardware Resources Analysis Analyzing available hardware resources for each camera operation would be the first step of software module design. The challenge of this step is to assign suitable hardware resources for specific jobs and to consider the data flow for camera operation. The hardware engines are controlled and configured through the EMP. The DSP subsystem can perform operations only after a program has been uploaded into internal program memory. The program on the DSP can be dynamically changed when the camera is operated in different modes, and some jobs can be executed in both the EMP and the DSP. The designers should optimize the system performance by balancing the loading of the EMP, DSP, and other accelerators. In addition to these engines, a DMA controller and SDRAM controller also play important roles in sharing data among them. In order to explain the available hardware resource analysis strategies for software module design, two simple examples, previewing image on a display device and MPEG audio and video playback are illustrated in the following subsections. 2.5.1.1 Previewing Images on Color LCD The function of the camera preview mode is to show real-time images with exposure and white-balance corrections. The live view engine is designed to support camera preview and 50 Single-Sensor Imaging: Methods and Applications for Digital Cameras CCD SDRAM buzzer LCD AFE CCD con t r oller Storage DMA Ctrl live view en g in e SDRAM con t r oller aud io signal encod er vid eo signal encod er SD card EMP DSP FIGURE 2.12 The hardware resource allocation in preview mode. d isplay d river video recording mode. EMP and DSP are only involved in some low complexity calculations such as auto-exposure and auto-white-balancing that are performed only two or three times per second. As shown in Figure 2.12, several hardware resources are enabled for processing image signals from the image sensor: CCD/CMOS sensors, CCD timing controller, live view accelerator, SDRAM and its controller, EMP, DSP, video encoder, and color LCD display. To analyze the resource allocation, a data flow graph is designed in which arrowed lines represent the data flow of the captured image signals inside the camera: 1. The voltage signals of the CCD are first amplified and digitized by the AFE chip. By setting appropriate parameters in the CCD controller or timing generator, only the image data corresponding to the active area of the image sensor are read out and passed to the next stage. 2. The live view engine processes the image data that are screened by the CCD controller. A simple image pipeline, that performs color interpolation, white balance, noise filter, color correct, and tone and gamma correction, is included in the live view engine. This special pipeline aims for fast processing of the data without various image enhancement steps. It is rarely used for processing final pictures because the engine does not include a powerful noise filter, tone and color reproduction that are typical in CSPs. 3. A few frames are used for AE and AWB measurement each second. The DSP executes measurement of scene brightness and color temperature based on the raw frame data. The EMP reads the estimated parameters, adjusts exposure time of the CCD, and periodically sets new white balance gains in the engine. 2.5.1.2 MPEG Audio / Video Playback Since the complexity of decoding MPEG video bit streams is still too high for the EMP, dedicated hardware modules for inverse quantization and inverse discrete cosine transform Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 51 CCD SDRAM buzzer LCD AFE CCD con t r oller storage DMA Ctrl live view en g in e SDRAM con t r oller aud io signal encod er vid eo signal encod er SD card EMP DSP FIGURE 2.13 The hardware resource allocation in MPEG playback mode. d isplay d river (DCT) may be included in the CSP. The key points of hardware resource allocation and data flow design include how to fully utilize DMA for transferring data in a background process and the approach to synchronizing EMP, DSP, and other hardware modules. The MPEG playback data flow from retrieving the MPEG bit stream to displaying video frames on the LCD is shown in Figure 2.13 and can be summarized as follows: 1. The compressed MPEG audio and video bit streams are transferred from the storage device to SDRAM through a DMA channel, which is managed by the DMA controller. 2. The compressed audio and video data stored in SDRAM is uncompressed by the DSP subsystem. Since the complexity of audio and video decoding is much less than in the encoding process, both audio and video decodings are usually performed by the DSP. 3. The decoded data is transferred from the DSP to SDRAM through a DMA channel. The DMA controller will signal an interrupt once video and audio data is decoded completely. 4. The decoded video data is displayed on the LCD while the audio data are played through an audio codec and speaker. 2.5.2 Job Scheduling and Resource Allocation In a camera system, several software modules may execute simultaneously. The design needs to consider adopting several modelling tools such as finite state machines, Petri nets, and the collaboration and sequence diagram. This section introduces one of the tools for designing digital cameras rather than discussing the technology details of various modelling tools. 52 initial state bu tton S1 p r essed p r ev iew state AE d on e S1 p ressed / stop 3A AF cap tu re fin ish ed AE/AF lock state preview state AE AWB d on e AF d on e cap tu re finished / AW enable IPP B cap tu re state S1 released Single-Sensor Imaging: Methods and Applications for Digital Cameras AE/AF lock state bu tton S1 released bu tton S2 p r essed cap tu re state PreCapAE d on e preCapAE preCapAF preCapAF d on e reset d on e CCD reset read ou t d on e CCD exp osu r e raw read ou t exposu re finished (a) FIGURE 2.14 State diagram based on: (a) user’s point of view, and (b) designer’s point of view. S2 p ressed (b) Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 53 Job scheduling and resource allocation in cameras can be modelled by a state diagram. A complete state diagram must have state definitions, input events, state transitions and output actions. The design of state diagrams may have two alternatives based on user’s and designer’s points of view. Figure 2.14a shows the design of state diagrams from user’s point of view. The default state is the preview state in which the camera shows real-time images with proper exposure and white-balance correction. To explain how to use state diagram for modelling jobs, a scenario for capturing still images is presented as an example. A digital camera has two stages of shutter button called S1 and S2. When a user presses button S1, the camera enters AE/AF lock state (also known as pre-capture state) in which the camera measures scene brightness and object distance. The state diagram shown in Figure 2.14a is easily understood by end-users but it is too rough to represent the detailed control flow of the software module. Figure 2.14b shows a refined state diagram of the image capture module. There are three composition states, and each of them is decomposed into several sequential substates. The default state is the preview state in which the camera performs 3A (AE, AWB, and AF) periodically. When the S1 shutter button is pressed, the key scan task should detect that S1 shutter button is being pressed and send a message (S1 Pressed) to the image capture module. The image capture module stops the current 3A control loop and other activities of the DSP, and switches to AE/AF lock state after it receives this message. Note that the preview AE/AWB/AF, pre-capture AE/AF, and image processing pipeline are typically executed in the same DSP, but only one of them can be executed in the DSP at any one time. Since the preview AE algorithm is totally different from pre-capture AE, the capture module task calls Stop3A() routine to stop the execution of periodic 3A task and runs precapture AE one time. When calling Stop3A, the EMP sends an interruption to the DSP no matter what actions are executed on it. After that, DSP switches to the codes of pre-capture AE metering immediately. This diagram does not show the execution details of the image processing pipeline (IPP) but only sends an enabling message to the IPP task by calling the EnableIPP routine. Execution of IPP is not a real-time job, instead it is only a background progress. Each of the running jobs in the background processing task can be interrupted by a shutter button press event. The current status for these jobs will be temporarily saved. The remaining unprocessed data are queued into buffers until the DSP resource is freed. The modules located in functional or application layers are allocated by one or more tasks. With scheduling done by the RTOS, these tasks run independently and their initial states are set to idle. An idle task can only be invoked after receiving a trigger message sent from another module. The button press events or hardware and software timer interrupts are the triggering source of the entire camera system. In order to help illustrate the module design concept, an example of the programming structure is shown below, in which partial codes of the module ImageCaptureModuleTask are listed. The module includes an infinite loop, which waits for messages from the RTOS with the routine WaitMessage(). The received message will be dispatched to other functions that handle different messages accordingly in order to perform right actions in current operation mode. This coding style is the fundamental mechanism of task scheduling without using preemptions in the software system. For other modules in the application and functional layers, a similar programming structure is adopted such that tasks do not occupy computation resources until they receive messages. Note that messages or events are normally 54 Single-Sensor Imaging: Methods and Applications for Digital Cameras triggered by external hardware signals or internal timer interrupts. The corresponding interrupt service routines send messages to specific modules. The hardware interrupt sources include key-press or key-release events, ADC/DAC, real-time clock (RTC), CCD timing generator and the DSP subsystem. void ImageCaptureModuleTask() { while (true) { Message = WaitMessage (); switch (CurrentState) { case PreviewState: PreviewStateMsgFunc (Message); break; case AEAFLockState: AEAFLockStateMsgFunc (Message); break; case CaptureState: CaptureStateMsgFunc(Message); break; ··· } } } void PreviewStateMsgFunc (MSG Message) { switch (Message) { case S1 Pressed: StopDSP (); EnterState (CaptureState); break; case xxButton Pressed: ··· } } 2.5.3 Background Processing and Data Buffering Data buffering plays a key role in background signal processing within the camera. Some camera operations have tighter real-time constraints than others. Users may be concerned about response time when they press particular buttons. Among all operations of a digital camera, color image and video processing and data compression are the most timeconsuming. Fortunately, it is not difficult to design a camera that has faster response to key press events by leaving previously taken images unprocessed. All unprocessed images are first queued into a buffer. The jobs of image processing and compression are executed sequentially and scheduled by the background processing task. Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 55 The design of the background processing task must consider memory requirements, complexity of individual tasks, and the availability of hardware resources. Memory requirements are one of the critical issues in the camera and will determine the number of pictures allowed to be left unprocessed in the buffer. For example, storing the raw data for a six mega-pixel picture requires 12 MB of memory if a 10-bit or higher precision ADC is adopted. The maximum number of queued images will be less than five if 64 MB is installed in the camera. Another application case is to record audio/video with unlimited recording length within the capacity of the storage card. Two ping-pong buffers can be used to realize parallelism of data processing and data streaming rather than allocating a huge memory pool in such cases. Since the required memory size for ping-pong buffers is more flexible, allocating these buffers is much easier and may produce better memory usage. In the following text, two design cases are stated explaining the concepts of continuous still image capture and real-time MPEG audio/video recording. 2.5.3.1 Continuous Still Image Capture An advanced digital camera can take several pictures continuously. Users are usually concerned about how fast pictures can be taken in continuous mode. The time frame between two successive shots may be too short to finish the image processing or compression for a picture. In practice, the image capture task will immediately switch back to preview mode after raw data readout is done. As described in the previous section, the image capture task sends an enabling message to the IPP task by calling the EnableIPP routine. The remaining color image processing and real-time streaming jobs are taken over by IPP task which is a background processing job. Similar to other tasks in the embedded system, the IPP task is triggered when it receives an enabling message. As shown in Figure 2.15, the captured raw image data are first queued into buffers and then the remaining jobs are taken over by the IPP task. While the images are being processed, users may press the shutter buttons (S1 and S2) again which will result in the DSP resource being switched to execute pre-capture AE and AF. The raw data stored in the image data buffers may not be completely processed in time. Two or more images may be queued into the raw data buffers. Since Joint Photographic Experts Group (JPEG) format is the most popular image file format adopting macro blocks as a basic data unit, a straightforward way for the IPP task is to also use macro blocks which contain 16 × 16 pixels as a basic unit. The DSP handles raw image data within the image processing pipeline and writes the compressed bit stream to one of the ping-pong buffers. Once the current bit stream buffer is full, the DSP interrupts the microprocessor to enable the background storing task, which controls DMA for streaming data onto the storage card. The DSP continues to process raw image data if no other actions are enabled while the DMA is streaming data to the card. This realizes parallelism of color image processing, compression, and real-time streaming. Note that actions in the background processing task may be interrupted at any time when receiving the message that the user has pressed the shutter button. 2.5.3.2 MPEG Audio/Video Recording Video signal processing, compression and real-time streaming would be the most time consuming jobs in a digital camera. Since the DSP subsystem is always tied up with motion 56 Single-Sensor Imaging: Methods and Applications for Digital Cameras p rocessed im age block (16´16 p ixels) com p ressed bit stream p rocessing qu eu e im age raw 3 im age raw 2 im age raw 1 cap tu red raw data JPEG bu ffer 1 DMA SD card storage DMA Ctrl FIGURE 2.15 Background processing for still images. JPEG bu ffer 2 DSP estimation and motion stabilization calculation for video compression, it is better to assign other data processing jobs such as audio signal processing to the EMP. In addition to fundamental video signal processing steps, bit rate control also highly affects the final quality of recorded video. The target bit rate must be a compromise between video quality, bit stream size, and the streaming speed of the storage card. Hence, the background processing should continuously monitor the average write speed of the inserted storage card. Audio data compression is triggered by DMA interrupts and executed in the microprocessor. The analog-to-digital converter (ADC) converts the analog signal sent from the microphone and stores the digital data into a pair of ping-pong buffers. The DMA activates an interrupt once one of the buffers is full, and then the EMP takes over the remaining audio data noise filtering and compression processes. On the other hand, access of video data is triggered by vertical sync signal of CCD. The data are processed in both the live view engine and the DSP system. Note that typical video data is sent from CCD sensors at 30 frames/s, but not all of the frames are processed due to the limitation of computation capability. In the meantime, the microprocessor runs AE/AWB adjustment based on the results sent from DSP and then synchronizes video and audio bit streams to generate MPEG files. Due to large variations in the bit rates of the video bit stream and the write speed of different brands/types of storage cards, the task of real-time streaming audio/video data to storage cards is not as easy as that performed in personal computers, which usually enjoy much larger buffer memories and faster CPUs. In many cases, even flash cards of the same brand have different write speeds among cards of different memory size due to the effect of internal SRAM buffers. In a practical example, the design can incorporate a rate control algorithm and the described buffering scheme to automatically balance video quality and Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 57 com pressed vid eo fram e com pressed aud io bit stream vid eo bit stream buffering aud io bit stream ping-pong buffer DMA DSP SD card storage DMA Ctrl FIGURE 2.16 Audio/video bit stream buffers. write speed of the storage card. As shown in Figure 2.16, the available SDRAM space is organized as circular and ping-pong buffers for video and audio bit streams, respectively. Given a default video compression rate, the DSP starts MPEG video compression, and DMA is enabled for streaming the data to the storage card as data crosses block boundaries. The ratio of the total number of audio and video frames can be calculated based on the recorded time stamps and the target bit rate is adjusted dynamically according to the free space in the buffers. 2.5.4 Power Aware Design Portable devices rely on batteries to provide power and thus power consumption is an important issue in achieving long battery life. Besides selecting excellent battery technology and components with low power consumption, there is a lot that can be done with embedded software to achieve optimal performance. The events with the highest power consumption are as follows; changing zoom step, running auto-focus, enabling the mechanical shutter driver, switching aperture size, and charging the strobe capacitor. It is easy to ensure that the aperture and mechanical shutter not be activated together. In general, there are several features of battery and power circuits that the designers should keep in mind. First, there is always an effective series resistance of the battery. When the current drawn from the battery is higher, the voltage drop across the series resistance is higher and the terminal voltage of the battery becomes lower. Lower battery terminal voltage usually results in lower power circuit efficiency and, for the same output current, draws even higher current from the battery. Higher voltage drops across the series resistance not only waste energy but also dissipate more heat. Therefore, it is always a good strategy to keep the battery current low. Second, the effective series resistance is not constant and the characteristics of the battery change drastically close to energy depletion. As the voltage of the battery reduces, there is 58 Single-Sensor Imaging: Methods and Applications for Digital Cameras a threshold voltage below which the battery will stop providing power. The software needs to be able to detect this threshold and stop the device before this point is reached. There are several things that need to be taken into consideration when developing embedded software for a digital camera: i) turn off the power to the circuits not in use, ii) monitor the battery level and adjust the power control accordingly, iii) when possible, gauge and calculate the remaining energy stored in the battery, and iv) arrange the timing such that circuits with high power consumption are not activated at the same time. 2.6 Embedded Software Design of Built-in Automatic Camera Calibration The characteristics of some hardware components such as sensors, lens modules and mechanical shutters are usually inconsistent among different camera instances although each of them may be produced by the same manufacturers of the same component. The specifications of these components show that they can only be guaranteed within an acceptable range. Unfortunately, the final picture quality highly depends on several characteristics of the image sensors. Namely, the variations of sensitivity for different channels, the sensor saturation voltage in preview and capture modes, the effective aperture ratio for different F-numbers of the lens, the percentage of lens falloff as well as the best active window aligned with the sensor, the photo response non-uniformity among the different pixels of the same sensor, bad pixel identification, and the extra DC voltage generated by the sensor dark current and circuit offset. It is difficult to produce good images if one of the above items is not calibrated accurately or otherwise compensated for. The design of a camera model can only be done based on a set of component parameters and variations in the hardware components must be calibrated. The best way is to incorporate an automatic calibration function into the developed software platform. Each camera can be calibrated in the production line without connecting to any computers, or where computers are required only at the start of production line for initializing calibration process. The number of computer based test fixtures needed is drastically reduced. In many cases the same test environment can accommodate several cameras and perform calibration simultaneously. If calibration data need to be collected for statistical analysis, they can either be stored on the camera or on a memory card. It is always possible to upload these data at a final quality control station. This system increases the throughput dramatically and reduces the mistakes made in the production line. The following sections first describe the system flow and then discuss detailed calibration methods for the two major calibration items. 2.6.1 Automatic Camera Calibration Flow Figure 2.17 shows the calibration flow of the proposed system. Since the program is embedded in the camera, it can run these five steps automatically. The only equipment needed for these calibration items is a standard light box that provides stable and uniform light. The intensity and color temperature of the light box needs to be measured and cal- Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 59 black level m echanical shutter delay and aperture ratio minimum AGC gain sensor w hite balance and active w ind ow bad pixel FIGURE 2.17 The calibration flow of the proposed system. c 2005 IEEE ibrated periodically. The design methodology of this camera calibration flow is based on the following observations: • The captured raw data contains DC components caused by sensor dark current and other offset voltages inherent in analog circuits. These components must be removed before the data can be used for other calibration processes. Even though many commercially available analog front end processors can already provide very good performance in reducing the effect of optical black level, it is still good practice to verify the result and hence this step is performed first. • Minimum automatic gain control (AGC) setting determines the linear region of the optoelectronic conversion function (OECF) of the sensor response for preview and capture modes. This is needed for the sensor to provide highest dynamic range for image capture. However, to make sure the still image capture operation works correctly, it is necessary to calibrate the mechanical shutter delay before performing minimum AGC setting calibration. • Mechanical shutter delay calibration must be executed for all aperture sizes, since the associated mechanical shutter delays are different. Hence the captured raw data can also be used for aperture ratio calculation. • The sensor white balance calibration must rely on accurate AGC settings to prevent the sensor from operating in the nonlinear region. Thus, the calibration of the minimum AGC setting should be performed before calibrating the sensor white balance. 60 Single-Sensor Imaging: Methods and Applications for Digital Cameras • Active window adjustment can be combined with sensor white balance calibration because the optimal exposure time and gain setting have been determined in this stage. • The bad pixel identification stage is carried out as the last step because its exposure condition must be well-controlled. For the black level, sensor white balance, active window, and bad pixel calibrations, it is only necessary to take one or two raw images for simple statistical analysis. This section only discusses the mechanical shutter delay and minimum AGC setting calibrations. 2.6.2 Mechanical Shutter Delay Calibration A typical timing diagram of a CCD based digital still camera is shown in Figure 2.18. As long as the pixels are exposed to the incident light, the image charge will be accumulated and the amount of charge is proportional to the exposure time. The electronic shutter is composed of a sequence of reset pulses (RP), which will clear the image charge accumulated on the sensor elements called photosites. With consecutive RPs, the image charge stored on the photosites will be cleared until the last RP finishes. The time duration between the last RP to the time instant that the final image charge is transferred to the vertical CCDs is the so-called effective electronic shutter time. In high resolution commercial digital still cameras, the vertical CCDs are not able to transfer the entire array of image charges at one time. Consequently, the image array is usually split into two or more fields and the entire image data is read out field by field. When the first field is being read out, the remaining fields are still exposed to the incident light if it is not blocked and this will result in uneven exposure. A popular solution is to use a mechanical shutter to block the incident light completely. The result is a combination of an electronic shutter to control the start of the exposure and a mechanical shutter for the end of the exposure. T VD VD RP TE_M S MS MS_CTL TW _M S TD_M S TEE t FIGURE 2.18 Typical CCD exposure timing. The term TD MS denotes mechanical shutter delay time, TW MS stands for mechanical shutter control signal wait time, TEE is electronic shutter exposure time, TE MS denotes mechanical shutter exposure time, and TVD denotes VD cycle time. c 2005 IEEE Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 61 A shu tter opening RP MS_CTL T ES TD_M S 100% 90% LP 50% L CL 10% B TCL FIGURE 2.19 Shutter closure and delay time. c 2005 IEEE A typical mechanical shutter (MS) is driven by a solenoid whose speed is finite as shown in Figure 2.19. This is because the solenoid needs to accumulate enough energy to over- come static friction and move the shutter blades. In addition, the delay time and shutter closure time also vary from component to component. In real applications, if we plot the effective opening area of the mechanical shutter as a function of time, the shutter closure curve can be approximated by a slanted line LCL. The time from 90% to 10% of the full opening of the mechanical shutter is usually called the closure time TCL. By assuming that area A is approximately equal to area B, the closure of the MS can be replaced with an equivalent perfect shutter at the instant of 50% of full opening, which is represented by the vertical line LP. The effective exposure time contains two parts, one is the time TES that the mechanical shutter remains completely open, and the other is roughly TCL/2. In normal applications, the reset pulses should not enter the shutter closure period. Referring to Figure 2.18, the mechanical shutter is calibrated as follows. The period between the VD pulses TVD is known, and is usually designed to be close to 1/30 second. Define electronic shutter time TEE as the time between the falling edge of the last RP and the start of the next VD pulse. This electronic shutter time can be set in software. We can set a mechanical shutter control signal wait time TW MS and try to measure the mechanical shutter delay time TD MS. The effective shutter time composed of electronic shutter and mechanical shutter time is designated as TE MS. The relation between the above variables is shown in the following equation: TV D − TEE + TE MS = TW MS + TD MS (2.2) The calibration procedure is to try different values of electronic shutter time TEE to find the point where the effective shutter time TE MS becomes zero. When this happens, the CCD output voltage will become zero. Under such conditions, we can obtain the following equation by substituting TE MS with zero and TEE = Teez: TV D − Teez = TW MS + TD MS (2.3) 62 Single-Sensor Imaging: Methods and Applications for Digital Cameras GA V GA V1 GA V2 linear approximation L1 GA V3 GA Vn-2 GA Vn-1 GA Vn 0 ZP L2 P L1 Teen Teen-1 Teep Teez Teen-2 Tee3 Tee2 exposure time TEE Tee1 FIGURE 2.20 Shutter characteristic. c 2005 IEEE From Equation 2.3, the mechanical shutter delay time TD MS can be calculated. It is then possible to adjust the value of TW MS in order to make the instant that the mechanical shutter actually closes fixed relative to the leading edge of VD. The difficulty in finding the closing point ZP in Figure 2.20 is that the output level (GAV ) of the CCD is very low. The data tends to be corrupted by noise when the effective shutter time is extremely short. In addition, when the electronic shutter time is close to the mechanical shutter closure time, the effective exposure will be affected by the mechanical shutter tolerance. Therefore, we can only rely on the region where the effective exposure time is long enough such that the CCD output level is relatively high. In real applications with 10-bit analog-to-digital converter (ADC) output, the starting point of the electronic shutter time Tee1 is adjusted such that the output level GAV1 is approximately 800. When the data is plotted, it is possible to adopt minimum mean square error method to find linear approximation to the data points. The approximated line is represented as L1, and it is extended so that it intersects with the horizontal axis at the crossing point ZP. The equation of line L1 has the following form: GAV = α · TEE + β (2.4) whereas the location (Teez) of the point ZP can be obtained as follows: Teez = − β α (2.5) where α and β are the slope and the offset value of the line L1, respectively. 2.6.3 Image Sensor Calibration The OECF curve of typical CCD output usually contains linear, nonlinear and saturation regions, as shown in Figure 2.21a. Only the linear region is suitable for image processing [40]. The pixels operating outside the linear region are very difficult, if not impossible, to handle using typical image processing pipelines. The main difficulty is that this nonlinearity Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 63 output voltage VSAT VL saturation region nonlinear region Q 4095 DL DP C1 L3 linear approxim ation A M IN C0 sa tu r a tion P A0 linear region exp osu r e (a) T0 (b) exp osu r e FIGURE 2.21 (a) Optoelectronic conversion function of CCD sensors. (b) Minimum AGC setting calibration. c 2005 IEEE is usually not consistent among different sensors, thus it is difficult to characterize and utilize this nonlinearity. In order to provide better sensitivity and wider dynamic range, it is necessary to operate CCD sensor with the maximum range of its linear region. This requirement can be achieved by finding the AGC setting so that the upper edge of the linear range matches the maximum input voltage VR of the ADC. The corresponding AGC parameter is called the minimum AGC setting AMIN. Under normal lighting environments, the AGC is always set at AMIN, but higher AGC settings will be used to increase the effective sensitivity in darker environments. This setting can guide the auto exposure control to prevent the sensor from operating in the nonlinear region. The proposed calibration process relies on the ADC result for analysis. A typical example of adopting 12-bit ADC is shown in Figure 2.21b. First, a relatively low AGC gain value of A0 is set such that the ADC output value of the saturation region is much lower than the full scale output of the ADC, which is 4095 in this case. Then a few raw images are taken with different shutter times and the center green pixels are averaged to check for the output level of the CCD. The associated OECF curve is represented by C0. From these data points, a linear approximation line L3 can be derived. In the next step, the shutter time is increased further while the average data of the captured raw image is compared with the data calculated from the linear approximation. At time T0 it is found that the actual data DP at point P is lower than the estimated data DL by about 3%, this can be judged as the end of the linear region of the OECF. It is straightforward that additional gain needed to bring point P to point Q is A1 = 4095/DP. With the overall AGC gain value set as AMIN = A0 × A1, the resulting OECF curve is C1. 2.7 Conclusion A design of versatile digital cameras, which supports both attractive features in user operation mode and calibration/testing functions in engineering mode, is a complex process. 64 Single-Sensor Imaging: Methods and Applications for Digital Cameras Robust embedded software platforms can make the development of digital cameras fast and shorten the design cycle time. This chapter described a camera software platform that has been successfully used in developing several consumer cameras. Both major hardware components and operation modes supported by this platform allow for easy understanding of the camera hardware architecture and practical camera design. In addition, the proposed embedded self-calibration flow and sensor/shutter calibration algorithms give a valuable reference for efficient construction of consumer cameras in mass production lines. Acknowledgment Figure 2.1, Figure 2.8, and Figure 2.11 are reprinted from Reference [25], Figure 2.5 is reprinted from Reference [26], and Figure 2.17 to Figure 2.21 are reprinted from Reference [37], with the permission of IEEE. References [1] S. Kawamura, “Capturing images with digital still cameras,” IEEE Micro, vol. 18, no. 6, pp. 14–19, November-December 1998. [2] N. Nakano, R. Nishimura, H. Sai, A. Nishizawa, and H. Komatsu, “Digital still camera system for megapixel CCD,” IEEE Transactions on Consumer Electronics, vol. 44, no. 3, pp. 581– 586, August 1998. [3] S. Okada, Y. Matsuda, T. Yamada, and A. Kobayashi, “System on a chip for digital still camera,” IEEE Transactions on Consumer Electronics, vol. 45, no. 3, pp. 584–590, August 1999. [4] M.J. Loinaz, K.J. Singh, A.J. Blanksby, D.A. Inglis, K. Azadet, and B.D. Ackland, “A 200mW, 3.3-V, CMOS color camera IC producing 352 × 288 video at 30 frames/s,” IEEE Journal of Solid-State Circuits, vol. 33, no. 12, pp. 2092–2103, December 1998. [5] D. Talla, C.Y. Hung, R. Talluri, F. Brill, D. Smith, D. Brier, B. Xiong, and D. Huynh, “Anatomy of portable digital mediaprocessor,” IEEE Micro, vol. 24, no. 2, pp. 32–39, MarchApril 2004. [6] K. Illgner, H.G. Gruber, P. Gelabert, J. Liang, Y. Yoo, W. Rabadi, and R. Talluri, “Programmable DSP platform for digital still cameras,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, USA, March 1999, vol. 4, pp. 2235–2238. [7] COACH-9m, Camera-on-a-chip, Digital camera processor. Technical Report, Zoran Corporation, 2005. [8] S. Miyashita, H. Teshirogi, T. Sato, T. Nakajima, and K. Nishio, “A camera chipset for dualmode mega-pixel camcoders,” in Proceedings of the IEEE International Conference on Consumer Electronics, Los Angeles, CA, USA, June 2000, pp. 174–175. [9] TMS320DM310, Digital Media (DSP) Technical Reference Manual. Texas Instruments, 2003. [10] S. Okada, Y. Matsuda, T. Yamada, and A. Kobayashi, “System on a chip for digital still camera,” IEEE Transactions on Consumer Electronics, vol. 45, no. 3, pp. 584–590, August 1999. Reusable Embedded Software Platform for Versatile Single-Sensor Digital Cameras 65 [11] H. Mori, T. Hanagata, H. Nakada, N. Gamou, Y. Taura, T. Nakajima, D. Kumagai, and N. Osawa, “A digital color camera LSI chip set for multiple applications,” IEEE Transactions on Consumer Electronics, vol. 43, no. 3, pp. 725–731, August 1997. [12] G.H. Smith, Camera Lenses: From Box Camera to Digital. Bellingham, WA: SPIE Press, 2006. [13] S.C. Park and R.R. Shannon, “Zoom lens design using lens modules,” Optical Engineering, vol. 35, no. 6, pp. 1668–1676, June 1996. [14] Y. Ogata, “Zoom lens system,” U.S. Patent 4 789 226, December 1988. [15] I.A. Neil and E.I. Betensky, “High performance zoom lens system,” U.S. Patent 6 122 111, September 2000. [16] N. Nanba, “Zoom lens system and camera having the same,” U.S. Patent 6 931 207, August 2005. [17] P.L.P. Dillon, D.M. Lewis, and F.G. Kaspar, “Color imaging system using a single CCD area array,” IEEE Journal of Solid-State Circuits, vol. 13, no. 1, pp. 28–33, August 1978. [18] J.R. Janesick, Scientific Charge-Coupled Devices. Bellingham, WA: SPIE Press, 2001. [19] J. Nakamura (ed.), Image Sensors and Signal Processing for Digital Still Cameras. Boca Raton, FL: CRC Press, 2005. [20] A.E. Gamal and H. Eltoukhy, “CMOS image sensors,” IEEE Circuits and Devices Magazine, vol. 21, no. 3, pp. 6–20, May-June 2005. [21] K. Yoon, C. Kim, B. Lee, and D. Lee, “Single-chip CMOS image sensor for mobile applications,” IEEE Journal of Solid State Circuits, vol. 37, no. 12, pp. 1839–1845, December 2002. [22] E.R. Fossum, “CMOS image sensors: electronic camera-on-a-chip,” IEEE Transactions on Electron Devices, vol. 44, no. 10, pp. 1689–1698, October 1997. [23] Y. Fujimoto, H. Tani, M. Maruyama, H. Akada, H. Ogawa, and M. Miyamoto, “A low-power switched-capacitor variable gain amplifier,” IEEE Journal of Solid-State Circuits, vol. 39, no. 7, pp. 1213–1216, July 2004. [24] M. Koen, “An analog-to-digital processor for camcorders and digital still cameras,” IEEE Transactions on Consumer Electronics, vol. 44, no. 3, pp. 570–580, August 1998. [25] W.C. Kao, C.C. Kao, C.K. Lin, T.H. Sun, and S.Y. Lin, “Reusable embedded software platform for versatile camera systems,” IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1379–1386, November 2005. [26] W.C. Kao, S.H. Chen, T.H. Sun, T.Y. Chiang, and S.Y. Lin, “An integrated software architecture for real-time video and audio recording systems,” IEEE Transactions on Consumer Electronics, vol. 51, no. 3, pp. 879–884, August 2005. [27] W.C. Kao and S.Y. Lin, “Various auto exposure control strategies for digital cameras,” Images & Recognition, vol. 12, 2006. [28] S. Shimizu, T. Kondo, T. Kohashi, M. Tsuruta, and T. Komuro, “A new algorithm for exposure control based on fuzzy logic for video cameras,” IEEE Transactions on Consumer Electronics, vol. 38, no. 3, pp. 617–623, August 1992. [29] T. Kuno, H. Sugiura, and N. Matoba, “A new automatic exposure system for digital still cameras,” IEEE Transactions on Consumer Electronics, vol. 44, no. 1, pp. 192–199, February 1998. [30] C.M. Chen, C.M. Hong, and H.C. Chuang, “Efficient auto-focus algorithm utilizing discrete difference equation prediction model for digital still cameras,” IEEE Transactions on Consumer Electronics, vol. 52, no. 4, pp. 1135–1143, November 2006. 66 Single-Sensor Imaging: Methods and Applications for Digital Cameras [31] J.S. Lee, Y.Y. Jung, B.S. Kim, and S.J. Ko, “An advanced video camera system with robust AF, AE, and AWB control,” IEEE Transactions on Consumer Electronics, vol. 47, no. 3, pp. 694–699, August 2001. [32] J.H. Lee, K.S. Kim, and B.D. Nam, “Implementation of a passive automatic focusing algorithm for digital still camera,” IEEE Transactions on Consumer Electronics, vol. 41, no. 3, pp. 449–454, August 1995. [33] E. Johansson, A. Wesslen, L. Bratthall, and M. Host, “The importance of quality requirements in software platform development - A survey,” in Proceedings of the 34th Hawaii International Conference on System Sciences, Maui, HI, USA, January 2001, pp. 1–10. [34] K.H. Kim, “APIs for real-time distributed object programming,” IEEE Computer, vol. 3, no. 6, pp. 72–80, June 2000. [35] A.S. Vincentelli and G. Martin, “Platform-based design and software design methodology for embedded systems,” IEEE Design and Test of Computers, vol. 18, no. 6, pp. 23–33, November-December 2001. [36] P.G. Paulin, C. Lien, M. Cornero, F. Nacabal, and G. Goossens, “Embedded software in realtime signal processing systems: application and architecture trends,” Proceedings of the IEEE, vol. 85, no. 3, pp. 419–435, March 1997. [37] W.C. Kao, C.M. Hong, and S.Y. Lin, “Automatic sensor and mechanical shutter calibration for digital still cameras,” IEEE Transactions on Electron Devices, vol. 51, no. 4, pp. 1060–1066, November 2005. [38] D.A. Kerr, APEX-The Additive System of Photography Exposure, July 2006. [39] P. Bigioi, G. Susanu, P. Corcoran, and I. Mocanu, “Digital camera connectivity solutions using the picture transfer protocol,” IEEE Transactions on Consumer Electronics, vol. 48, no. 3, pp. 417–427, August 2002. [40] H.C. Lee, Introduction to Color Imaging Science. New York: Cambridge University Press, 2005. 3 Digital Camera Image Processing Chain Design James E. Adams, Jr. and John F. Hamilton, Jr. 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2 A First Image Processing Path and the Basic Building Blocks . . . . . . . . . . . . . . . . . 69 3.2.1 Cost and User Sensitivity Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2.2 Systematic Sensor Data Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.2.2.1 Dark Floor Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.2.2.2 Structured Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.2.3 CFA Data Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.2.3.1 Stochastic Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.2.3.2 Exposure and White Balance Correction . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2.4 Adjusted CFA Image and Image Data Calibration . . . . . . . . . . . . . . . . . . . . . . . 74 3.2.4.1 Color Filter Array Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2.4.2 Stochastic Color Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.2.4.3 Color Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.2.5 Image Space Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.5.1 Tone Scale and Gamma Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.5.2 Edge Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.3 Variations on the First Image Processing Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3.1 Luminance-Chrominance Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.3.2 Spatial Frequency Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.3.3 Computing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.3.3.1 Intermediate Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.3.3.2 Physical Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.3.4 Resizing and Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3.4.1 Image Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3.4.2 Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.3.5 Other Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.3.5.1 Bit Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.3.5.2 Nonlinear Photometric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.3.5.3 Extended Dynamic Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.4 How Video Differs from Still Photography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 67 68 Single-Sensor Imaging: Methods and Applications for Digital Cameras 3.1 Introduction The transformation of digital camera raw sensor image data into a full-color fully processed image involves a complex chain of computations. The possible orderings of individual operations and associated implementation details that constitute the image processing chain can lead to a sea of permutations. However, despite this seemingly immense number of available degrees of freedom, the problem of image processing chain design is overconstrained. Image quality must be maximized while compute resource use must be minimized. It is the minimization of required computational effort that, in fact, severely restricts the number of degrees of freedom in the image processing chain design problem. Consequently, image processing operations that are highly effective may not be viable candidates for image processing chain for constrained compute environments. In the end, the process of designing an image processing chain becomes one of taking relatively simple, well-known image processing operations and staging them in a manner that produces the best synergistic effects. This chapter begins with Section 3.2, which presents in detail a basic image processing chain that will be used as a reference for the balance of the chapter. After a discussion of cost and user sensitivity considerations (Section 3.2.1), the reference path is artificially segmented into four main stages: correcting image errors resulting from the flaws of the sensor hardware (Section 3.2.2), correcting image errors caused by the image capture conditions (Section 3.2.3), creating a standardized processed image (Section 3.2.4), and rendering an output device-specific fully processing image (Section 3.2.5). Each section is further subdivided into component operations such as dark floor subtraction, structured noise reduction, stochastic noise reduction, exposure and white balance correction, color filter array interpolation (demosaicking), stochastic color noise reduction, color correction, tone scale and gamma correction, and edge enhancement. With the reference path fully presented, a number of variations are discussed in detail in Section 3.3. Section 3.3.1 presents the idea of splitting the chain in part along the lines of independent and interdependent luminance-chrominance image processing paths. Section 3.3.2 follows up with a similar discussion on the merits of splitting the chain along the lines of spatial frequency content. The nature of the computing environment and its impact on the image processing chain design process are discussed in detail in Section 3.3.3. In addition to the topics of internal image data management and manipulation, the section addresses the implementation details for a variety of physical computing environments. Section 3.3.4 is focused on resizing and compression which are two key image processing operations not included in the initial reference path. Finally, other important facts, such as data bit depth resolution, the use of nonlinear photometric spaces, and the issues around extended dynamic range processing, are presented in Section 3.3.5. The bulk of this chapter is concerned with the problem of processing digital still images. Digital video imaging has been an equally important application. Section 3.4 discusses the significant changes that must be made to the image processing chain in response to this very different imaging environment. Finally, Section 3.5 summarizes the main image processing chain design ideas. Digital Camera Image Processing Chain Design raw CFA im age color filter array in ter p ola tion 69 “cam era” RGB im a ge d ark floor su b tr a ction stru ctu red noise red u ction stochastic noise red u ction exposure and w hite balance correction stochastic color noise red uction color * correction * d enotes a potential noise am plification step tone scale and * gam m a correction ed ge * enhancem ent ad ju sted CFA im age FIGURE 3.1 Reference image processing chain. “d isplay” RGB im age 3.2 A First Image Processing Path and the Basic Building Blocks As will be seen, the point of image processing chain design is that there are a number of plausible orderings and configurations. Indeed, each image processing chain designer will have its own preferred and valid chain configurations with supporting justifications. In this regard, any reference to a “standard” image processing path must be taken with a certain grain of salt. Still, it is a useful talking point. To this end, Figure 3.1 presents what will be the reference image processing chain for subsequent discussion. Example images achieved at different stages of the chain are presented in Chapter 1. 3.2.1 Cost and User Sensitivity Considerations In a system as complex as an image processing chain, there will be a number of engineering decisions that must be made. Many of these decisions will have no objective basis for determining what is “right”; however, many decisions will depend on what is preferred or expected. Consider the camera used for photographing a wedding versus the other camera used to take holiday snapshots at the beach. The user expectations (UE) concerning the resulting image quality would be very different. This point will be raised numerous times in the following discussions. The image chain designer must make technical decisions that ultimately will be judged according to the UE of the target customer base. 70 Single-Sensor Imaging: Methods and Applications for Digital Cameras 3.2.2 Systematic Sensor Data Correction Figure 3.1 begins with the raw color filter array (CFA) image data produced by the sensor. This data will typically be in a nominally linear photometric space. As a result, the pixel values will be directly proportional to scene reflectance in the scene space. The capture bit depth of the pixel values will typically be anywhere from 8 bits to 16 bits, with 8- and 10-bit systems; that is, those that capture 256 and 1024 code values per pixel, respectively, being the most common. With planning (and perhaps some luck), the desired portion of the dynamic range of the scene will fall in the range of valid code values without being significantly clipped. 3.2.2.1 Dark Floor Subtraction A tacit assumption often made is that a pixel receiving no light will produce a code value of zero. Unfortunately, thermal noise and other nonphoton-generated noise sources will produce nonzero pixel values. As a result, the first course of action is to subtract a dark floor from the original CFA image. This can be as simple as applying a fixed value equally to every pixel in the image. Alternately, a spatially dependent set of values can be subtracted. While the nominal goal is to remove unwanted biases in the pixel data, some residual bias may be left in order to avoid data clipping and other quantization errors in the shadow regions. Certainly, subtracting a single fixed value from the image is the quickest and easiest operation to implement. For a low UE situation, this will almost be an automatic decision by the image chain designer. The only subtleties in this case are deciding how much of the dark floor to subtract and then changing the dark floor as a function of camera exposure index (ISO), shutter time, and temperature. As to the first question, a quick characterization of the image noise near the photometric zero point of the image will provide a mean and standard deviation. Photon noise will exhibit a Poisson distribution, but because other noise sources may be present, a Poisson distribution is not guaranteed. Without doubt, the mean does not want to be clipped from the image data. A safe initial setting might be to subtract a value equal to the mean reduced by one or two standard deviations. These statistics can be empirically determined with either a “lens cap” shot or the capture of a good-quality matte black test card. Alternately, if the sensor has shield pixels around its perimeter, these values can be interrogated at the time of capture. This last approach has the added benefit of characterizing the dark floor at the actual circumstances of the camera at the time of capture; for example, using the information about temperature, ISO, and shutter time. Barring the convenience of shielded pixels in the sensor, a calibration of the camera in the factory will need to be performed to determine nominal values for dark floor subtraction for, at least, the ISO settings the camera supports. 3.2.2.2 Structured Noise Reduction Unfortunately, at this point we must abandon the notion that the sensor is perfect. The assumption that the dark floor is constant across the entire image capture is a simplification that is valid only if the UE is suitably lax enough. There are many potential causes for why the dark floor may not be uniform across the extent of the sensor. Apart from physical flaws or nonuniformities in the sensor itself, the proximity to heat-producing components in the Digital Camera Image Processing Chain Design 71 camera body may cause one part of the sensor to be warmer than another part. Entering into the world of spatially dependent dark floor correction, the task is now to subtract a dark floor mask from the captured image. This mask is created either in the factory or during a dark field shot (i.e., closed shutter) that may be automatically captured during power up of the camera. As discussed above, this mask can be suitably scaled prior to subtraction. One liability of this dark floor mask subtraction operation is that while the structured noise in the image will be reduced, the stochastic noise in the image will be increased because of the root mean square adding of variances of the two random variables: the stochastic noise in the captured image and the stochastic noise in the dark floor map. A natural solution to address this increase in stochastic noise is to consider capturing several dark floor masks and subtracting their average from the captured image. Another solution is to fit the dark floor mask with a low-frequency polynomial model and subtract the resulting polynomial from the captured image. This route has the additional advantage that if the map is to be stored in the camera then only the coefficients of the polynomial map need to be recorded rather than data for a full image. Somewhat implied by the previous discussion is the fact that dark floor subtraction tends to address low-frequency structured noise better than high-frequency structured noise. A significant type of high-frequency structured noise is the defective pixel. It is almost inevitable that a number of the individual pixels in any given sensor will be defective. In fact, whole columns and rows of pixels might be nonfunctional. Even if the sensor is handpicked in the factory to be free of defective pixels, over time cosmic radiation will eventually create defective pixels in the device. One can consider defective pixels falling into two broad categories. First, there are the pixels that are completely and consistently nonfunctional, being “stuck” at complete black or complete white regardless of what light may fall upon them. This is the simpler of the two categories to address. The second category is more problematic because these pixels are only partially defective. They may still respond to light, but with a significantly different gain factor from the majority pixel population of the sensor. They may also only be defective for certain camera settings (e.g., ISO) but function normally for others. When considering this second class of defective pixels, even the determination of how different a pixel must be from the main population before it is considered defective can be problematic. Defective pixels of the first category can be easily mapped out in the factory and their locations stored in the camera firmware. Defective pixels of the second category or those of the first category that are formed after the camera has left the factory are more difficult to address. Pixels that fail to white can be detected using closed shutter captures and subsequent impulse or outlier detection algorithms. Pixels that fail to black are more problematic. Ultimately, one may be forced to treat all unidentified defective pixels with stochastic noise cleaning methods [1], [2], [3]. Some defective pixels can be masked, at least partially, by the dark floor mask subtraction. However, the main method of defective pixel masking is to replace defective pixel values with the average values from known working neighboring pixels. The strategy can be as simple as performing a boxcar average of like-colored pixels over a given region or as complex as performing edge detection and selecting an appropriate directional blur kernel on a pixel-by-pixel basis [4], [5]. Isolated defective pixels are easily dealt with using the simplest of methods. Clusters and whole rows and columns of defective pixels need solutions that are more involved in order to prevent visible artifacts in the final image. 72 Single-Sensor Imaging: Methods and Applications for Digital Cameras A final class of structured noise to be discussed deals with variations in the thickness of the CFA color filters across the surface of the sensor. These are usually a consequence of (unwanted) interactions of the manufacturing process with the morphology of the silicon wafer. As a result, an image of a featureless neutral field may exhibit low-frequency variations in color. Because this is a stable phenomenon, it can be mapped in the factory and stored in the camera firmware. Either a full-resolution mask can be created or the coefficients of a polynomial fit can be determined. The method of correction is similar to dark floor subtraction, except that each color channel in the CFA will have its own separate mask or polynomial. 3.2.3 CFA Data Correction The next stage of the image processing chain is concerned with the reduction of stochastic noise and corrections for exposure and white balance errors at the time of capture. Referring to Figure 3.1, the ordering of these two operations is arbitrary and can be inverted without penalty. Unless there are highly unusual interactions present in the individual image processing implementations, these two operations can be considered independent. This is largely because stochastic noise reduction will be mainly concerned with the highfrequency spatial component of the image data while the exposure and white balance correction will be focused on using the low-frequency spatial component to the image. This notion of high-frequency / low-frequency split processing will be revisited below. 3.2.3.1 Stochastic Noise Reduction It is well known that if a system consists of a number of operations that are signal amplifiers, then it is best to reduce noise contributions as early as possible in the chain. As will be discussed later, many of the image processing operations in Figure 3.1, notably color correction, tone scale and gamma correction, and edge enhancement, are signal amplifiers. This would argue for performing noise reduction prior to these operations. Moving further back along the chain, the nature of the CFA interpolation operation needs to be considered. This operation may or may not be a signal amplifier, depending on the composition of the missing pixel value estimators. In addition, this operation may be linear or it may be adaptive (nonlinear). In the latter case, the robustness of the algorithm’s decisions can be significantly influenced by the presence of noise in the CFA image data. Therefore, it seems prudent to perform noise cleaning before CFA interpolation. Finally, we have already noted that there is no strong reason to prefer performing noise reduction before or after exposure and white balance correction. Performing noise reduction before CFA interpolation presents its own unique set of challenges. Because there is only one color channel value at each pixel location, it becomes difficult to exploit the partial correlation between the color channels in the image. For this reason, Figure 3.1 has a separate stochastic color noise reduction block after CFA interpolation. Consequently, noise reduction before CFA interpolation generally employs the techniques of single-channel grayscale image processing. Conceptually, the CFA image data is split into three or more color channel components by collecting pixels of like color into each component. At this point, each component can be treated as an individual grayscale image and noise-cleaned in any appropriate manner. After noise reduction, the color chan- Digital Camera Image Processing Chain Design 73 GR BG RGB FIGURE 3.2 A 2 × 2 block of pixels in a Bayer mosaic image forms a single full-color superpixel. nel components can be merged back into a (now noise-cleaned) CFA image. In practice, the CFA image is apt to be left intact rather than being formally split and merged. Instead, the pixel strides of the noise reduction operations will be adjusted to avoid unwanted mixing of color channels. Typical grayscale noise reduction methods used on CFA data include low-pass filtering, sigma filtering [6], and median filtering, although potentially any noise reduction scheme appropriate to grayscale imaging could be successfully used. The effect of noise filtering on camera images is illustrated in Chapter 1 while joint denoising and demosaicking solutions are discussed in Chapter 9. There is a subtle challenge not to concede on addressing color channel correlation at this point of the image processing chain. While addressing correlation at the full sensor resolution must wait until after CFA interpolation, lower spatial frequency color correlation can be conditioned beforehand. Dividing the CFA image into a lower resolution array of superpixels (Figure 3.2) produces a full-color image than can now be treated with a larger set of noise reduction techniques. The discussion of these techniques will be postponed until later in the chain. 3.2.3.2 Exposure and White Balance Correction Unlike the human visual system (HVS) that constantly and automatically adjusts the apparent exposure and white point of what we see, digital cameras have no such innate functionality. Therefore, such adjustments must be made algorithmically after image capture. The goal of such algorithms is to render neutral areas in the scene as regions of equal code values for all color channels in the final image. Additionally, midlevel grays (18% scene reflectance) should also map to mid code value range of the final image. Sometimes the processes of exposure correction and white balance correction are referred to collectively as scene balance correction. Detailed descriptions of exposure correction and white balance correction can be found in Chapters 10 to 12. As with noise reduction, these adjustments can be grouped into two categories. The first category consists of adjustments in response to user input. In the direct case, although an unlikely one, the user can specify a particular exposure compensation (e.g., 1.33 stops) and also specify the type of scene illuminant (e.g., daylight or tungsten). With this explicit information, the CFA image data can be directly modified in accordance with the user input. For exposure compensation, all pixel values would be equally modified by the appropriate scale factor. For white balance correction, there would be a set of three different scale factors, one for each color channel. This white balance triplet would be characteristic of the given scene illuminant. In the more likely indirect case, a simpler input mechanism for 74 Single-Sensor Imaging: Methods and Applications for Digital Cameras the user would be to click on a region in the image that is known to be neutral in color and of a middle exposure level. The algorithm can then interrogate that region of the image to determine the scale factors necessary to drive it, along with the rest of the image, to a desired code value position. The second category of exposure and white balance adjustment is far more difficult. User input is not available. All the algorithm has is the image data itself. This is the realm of automatic exposure and white balance correction algorithms [7], [8], [9]. This is an ill-posed and unfair problem! One is asked to find the proper adjustment for a midlevel gray region that does not exist in the image. Most approaches are based on heuristic statistical models such as the gray world hypothesis. This hypothesis contends that all images, taken as a single set of pixels, average to approximately 18% scene reflectance gray (neutral). Unfortunately, this says nearly nothing about the statistics of any given single image. Slightly more robust statements can be made by restricting the application of the gray world hypothesis to only those parts of the image that lie along the edge and feature boundaries. Thus, large regions of sky or grass will not dominate the scene average. Using this and other heuristic rules, a set of code values (usually a triplet) representing 18% gray for a given image are computed. At this point, the calculation proceeds as before in the first category of exposure and white balance adjustment. As an example of additional heuristics that can be applied, consider the problem of correcting an image with a warm (red) cast. This image could be an indoor scene at home under tungsten, or an outdoor scene in one’s backyard of a sunset. In the first case, the reddish color should be rendered as neutral (i.e., more or less fully corrected) because that is how one perceives the scene. In the second case, the reddish cast should be preserved because sunsets look red. A heuristic that can help distinguish between these two cases is the overall brightness level that can be found from metadata describing the aperture setting, shutter time, and exposure index setting at the time of capture (a sunset is typically much brighter than an indoor light bulb) [10]. Chapter 13 describes camera image storage formats and associated metadata in detail. Finally, it is noted that the computations just described can be done on spatially very low-resolution data. It may even be preferred to use low-resolution image data to improve the performance of the heuristic rules. This has the advantage of making the computations relatively impervious to noise and detailed scene content and composition. In addition, a small data set is computationally more tractable, allowing the use of more involved heuristic systems without undue execution time penalties. 3.2.4 Adjusted CFA Image and Image Data Calibration At this point, we have conditioned the CFA image data to be reduced in noise, both structured and stochastic, and to represent an image that has been properly scene balanced. Most subsequent image processing operations in the chain tacitly assume these idealized conditions. It will be the responsibility of the later image processing operations to address any residual departures from this state. The next task is to create a full-color image and then convert that image into a known, calibrated color space. Along the way, the aforementioned residual noise and scene balance errors will need to be addressed. Digital Camera Image Processing Chain Design 75 GR YM CM BG CY YG (a) (b) (c) FIGURE 3.3 Minimum repeating units: (a) RGB Bayer CFA, (b) CMY Bayer CFA, and (c) hybrid CMYG pattern. 3.2.4.1 Color Filter Array Interpolation The basic premise underlying CFA decimation and subsequent interpolation or demosaicking is that full-color image data is high in redundant information. Some sense of this, perhaps, can be gained by noting that spatial detail as perceived by the human visual system is predominantly based on luminance information in the scene and only to a much smaller extent on chrominance information (see page 380 in Reference [11]). Beginning with color television, most electronic imaging systems have taken advantage of this fact by recording chrominance information at a lower spatial frequency than luminance. The savings in reduced information bandwidth more than compensates for any small loss in image fidelity. Consequently, most CFAs used for image capture subsample chrominance with respect to luminance in two ways. First, only one of three (or more) color channels is recorded at each pixel location. Second, in the color filter array minimum repeating unit (MRU) there are more filters assigned to sensing luminance information than there are for sensing chrominance information. The prototypical CFA is the Bayer pattern [12]. In the RGB case (Figure 3.3a), the green pixels are used to sense luminance information, and the red and blue pixels are used to sense chrominance information. In the CMY case (Figure 3.3b), the yellow pixels are used to sense luminance, and the cyan and magenta pixels are used to sense chrominance. The complexity can be further increased by adding a fourth color and reading out sums and differences of adjacent pixels to produce luminance-chrominance information. In the CMYG-based MRU (Figure 3.3c), it is common to read out two luminance (C + Y, M + G) and two chrominance (C - Y, M - G) values [13], [14]. Regardless of the CFA used, a full-color image must be produced at this point in the image processing chain. Full-color in this case means each pixel in the image has a color specification triplet. There are two general approaches to the problem of CFA interpolation. The first is to use standard linear interpolation methods. The most common approach is to combine neighboring pixel values of the same color in some straightforward method to produce an estimate for the missing pixel value. This method can take the form of a convolution operation and implement such standard practices as pixel replication, bilinear interpolation, or bicubic interpolation. If, on the other hand, there is some understanding of the cross color channel correlation of the data, more than one color may be used in this process. This latter approach is most readily accomplished by first interpolating all of the luminance pixel data and then forming color differences between the luminance and chrominance pixel values (e.g., R - G and B - G). These color difference values are then interpolated and the resulting chrominance values recovered by adding the luminance values back to the interpolated color difference values [15]. 76 Single-Sensor Imaging: Methods and Applications for Digital Cameras The second approach to CFA interpolation is to use nonlinear adaptive methods. These are generally heuristic in composition as they fall out of the realm of linear shift-invariant systems. With these systems, the segmentation of image data into luminance and chrominance channels becomes more important because the decisions made by the algorithm are generally keyed off the fine spatial detail in the image. As a result, the luminance channel is interpolated first by using some form of edge detection of the luminance data to determine the precise manner of interpolation from pixel to pixel. Sometimes these algorithm decisions are made immediately and irrevocably [16], [17], and sometimes the decisions are revisited and revised after their consequences are perceived later in the algorithmic process [18], [19]. Once the luminance channel is fully populated, the chrominance channels are generally treated with the linear approaches previously described, although references will be found in the literature to adaptive methods used in place of those approaches [17]. Chapters 1 and 5 to 9 discuss demosaicking issues in detail. The result of the CFA interpolation process will be a “camera” full-color image. The color space will generally not be a standard, calibrated space. Instead, it will be defined by the spectral sensitivities of the camera image capture hardware (detector quantum efficiencies and CFA spectral responses being the biggest players). Before addressing this issue, an intermediate operation is discussed. 3.2.4.2 Stochastic Color Noise Reduction During the stochastic noise reduction applied to CFA image data, the color channels were treated as separate and independent grayscale channels. In each channel, looking at artifacts seen in a flat field, stochastic noise gives the impression of “visual static” or “graininess” in what should be a smooth region of the image. Sufficiently reducing the visibility of the noise leaves one with the subtle fluctuations of a texture that looks “real” instead of artificial. Now that all the color channels are fully populated, another facet of stochastic noise emerges. A texture that might be acceptable in the context of a single-channel image is deemed not acceptable when matched with similar, but different, textures in the other color channels. Instead of producing a light-dark texture, residual stochastic noise in an RGB image produces unexpected color variations. This effect is most pronounced in neutral (gray) regions because the color fluctuations include pastels of widely divergent hue angles. Such color artifacts are far more visible and objectionable than the corresponding light-dark fluctuations in a single-channel image. The color aspect of stochastic noise reduction is now addressed. The good news is that at this point the image data is in a well-known and well-behaved representation. Methods abound in the literature on how to noise-clean fully populated color images. The simplest approach may be to, again, treat each color channel as an independent grayscale image and then clean these components separately. However, this may not be overly effective and tends to miss the whole point. It is far better to transform the image into a luminance-chrominance representation (assuming it is not already so). At this point, the luminance and chrominance channels can be noise-cleaned in any appropriate manner. Generally, the luminance data will require a significantly different cleaning modality from that used for the chrominance data. If the same method is used, at least the tunings of the operation will be quite different. Digital Camera Image Processing Chain Design 77 Because this is a color noise reduction operation, the luminance channel may be left completely untouched. It is still useful as a reference for use with adaptive chrominance channel noise-cleaning operations. The chrominance channels, if sufficiently devoid of spatial edge detail, can be cleaned most simply by a low-pass (blurring) convolution operation. If the presence of low-frequency color blobs is evident, then rather than use a large convolution kernel, the chrominance channels can be decomposed into a Gaussian or wavelet pyramid. The individual pyramid components can then be convolved with smaller kernels and a noise-cleaned image reconstructed from the processed components. If there is a desire to preserve the residual edge detail in the chrominance channels, then adaptive noise-cleaning methods such as sigma filters or steerable median filters can be used. When using such adaptive methods, the luminance channel can often be used in the edge detection operations, providing improved robustness over the lower modulation edges in the chrominance channels. If there is a reason, the luminance channel can also be noise-cleaned at this time using any method applicable to single channel grayscale images. 3.2.4.3 Color Correction It is now time to prepare the image for its final rendered form. This requires transforming the image into a standard calibrated color space. There are a number of possible destination color spaces. Of these, the industry has standardized on sRGB [20], which is designed for video, or soft display, devices. However, because sRGB is itself a color transform from CIE 1931 XYZ space, the latter can be considered the first target of the color correction operation. CIE 1931 XYZ space (see pages 101 to 110 in Reference [11]) is a color space defined by standardized x¯(λ ), y¯(λ ), and z¯(λ ) color matching functions. CIE 1931 XYZ space itself is a color transform from the previously defined CIE RGB color space, although there is generally no reason to explicitly invoke that relationship in today’s digital cameras. The first part of the color correction process is to transform the image data from camera color space into CIE 1931 XYZ space. Assuming an RGB camera color space, the operation becomes a 3 × 3 matrix multiply:     X a11 a12 a13 Rcamera  Y  =  a21 a22 a23   Gcamera  Z a31 a32 a33 Bcamera (3.1) The coefficients of the transformation matrix are computed in the factory through a regres- sion process using measured camera RGB tristimulus values of color patches with known XYZ tristimulus values. Once the XYZ tristimulus values have been computed, they can be transformed to sRGB tristimulus values with a standard matrix as follows:     RsRGB 3.2410 − 1.5374 − 0.4986 X  GsRGB  =  −0.9692 1.8760 0.0416   Y  (3.2) BsRGB 0.0556 − 0.2040 1.0570 Z Combining Equations 3.1 and 3.2 produces the color correction matrix as would be imple- 78 Single-Sensor Imaging: Methods and Applications for Digital Cameras mented in the image processing chain as follows:      RsRGB 3.2410 − 1.5374 − 0.4986 a11 a12 a13 Rcamera  GsRGB  =  −0.9692 1.8760 0.0416   a21 a22 a23   Gcamera  BsRGB 0.0556 − 0.2040 1.0570    b11 b12 b13 Rcamera =  b21 b22 b23   Gcamera  a31 a32 a33 Bcamera b31 b32 b33 Bcamera (3.3) It should be noted that up to this point it has been tacitly assumed that all computations that have been performed in the image processing chain have been done in linear space. While the use of nonlinear spaces for these computations will be discussed below, the color correction computations just presented are explicitly designed for linear space data. The case of performing color correction on nonlinear space data will also be addressed below. 3.2.5 Image Space Rendering The remaining steps in the image processing chain are targeted at producing the best image for a given image rendering. In keeping with current industry standards, this means preparing the image for display on a video device. The transformation to sRGB tristimulus values has already begun this process. The process is completed by transforming the image into a nonlinear space suited for video display devices and applying edge enhancement (i.e., sharpening). 3.2.5.1 Tone Scale and Gamma Correction The human visual system’s ability to adapt to a wide range of scene luminances is another essential capability that the digital camera must duplicate, if only in a primitive manner. In this case, the overall contrast of the scene must be adjusted so that the image as viewed on the soft display device looks similar to the original scene viewed under illumination that was typically a hundred times as bright, if not more. Added to this, the image data must be transformed to account for the nonlinearity of the video display. As in the case of color correction, the tone scale and gamma correction operation is implemented as a single transform composed of these two components. The tone scaling operation adjusts the contrast of the image. It is usually implemented as a fixed lookup table that is applied equally to the red, green, and blue channels. It assumes the input data is in a linear space that has been properly exposure corrected. There are two general classes of tone scale transforms. The first class consists of fixed transforms that are installed in the factory and used on all images. There may be a single transform or a small family of fixed transforms, with each family member assigned to a different exposure compensation step, such as ±2, ±1, and 0 stops. The shape of the fixed transform curve is typically “S” shaped (Figure 3.4a) with a slope greater than unity in the middle of the input code value range and considerably less than one at the two extremes; that is, the shadows and the highlights [11]. The visual impact of applying such a curve is to increase the overall contrast of the image. Note that there is no reason that the tone scale need be symmetric in its handling of the shadows and the highlights. In order to reduce the visibility of noise in dark regions of the image, a tone scale may apply more aggressive Digital Camera Image Processing Chain Design 79 CV output CV output CV output CV in p u t (a) CV in p u t (b) CV in p u t (c) FIGURE 3.4 Tone scale functions: (a) idealized function, (b) function with suppressed shadow response, and (c) scenespecific function. compression of shadows than highlights (Figure 3.4b). The tone scale is initially generated by the image processing chain designer based on the characteristics of the digital camera system and customer preference; for instance, professional photographers tend to prefer a lower contrast than do consumer snap shooters. The second class consists of tone scale transforms that are generated dynamically on an image-by-image basis. Returning to the HVS, its dynamic range is significantly greater than any current digital camera. On a sunny day, a person can look into a blue sky and see puffy clouds, then immediately look into deep shadows and see well-defined image detail. However, a digital camera with a fixed tone scale will be forced to either saturate the sky to pure white to render details in the shadows or clip the shadows to complete black to preserve the sky. A partial solution to this dilemma is to create a custom tone scale that renders both the shadows and the sky (highlights) at the expense of the midtones, which are visually less important in such high dynamic range scenes. Figure 3.4c shows an example of such a tone scale. There are many ways of algorithmically producing such a tone scale based on histogram analysis of the image [21], [22]. Regardless of how the tone scale is generated, it ultimately becomes a simple point transform of the image data:   RsRGB = T (RsRGB)  GsRGB BsRGB = = T T (GsRGB) (BsRGB) (3.4) The second transformation to be applied is video gamma correction. This standard transform is also defined in the sRGB specification [20] and accounts for the fundamental photometric nonlinearity of the cathode ray tube (CRT) display. This transform is essentially a simple power relationship: XsRGB = 11.20.5952XXssRR(1GG/2BB.4) − 0.055 for XsRGB ≤ 0.00304 for XsRGB > 0.00304 (3.5) where XsRGB is RsRGB, GsRGB, or BsRGB normalized to [0, 1]. The output is also in the range [0, 1] and can be subsequently scaled to any conventional data range, usually [0, 255]. 80 Single-Sensor Imaging: Methods and Applications for Digital Cameras Equation 3.5 is a point transform and can be concatenated with the tone scale correction to produce a final single point transform to perform both operations simultaneously: XsRGB = V XsRGB = V (T (XsRGB)) = G (XsRGB) (3.6) In this expression, V (·) is the video gamma correction, T (·) is the tone scale correction, and G(·) is the composition correction, usually referred to casually as the gamma correction. The systematic construction of the composite gamma correction as described here is frequently shortcut in favor of simply beginning with the sRGB video correction (Equation 3.5) and then customizing this transform until the average image has the desired contrast, dynamic range rendering, and shadow noise suppression. Such an approach is generally heuristic in nature and usually targeted at generating a fixed tone scale transform to be used for all images. 3.2.5.2 Edge Enhancement The final block in the “standard” image processing chain is edge enhancement, more casually referred to as sharpening. The essential purpose of this operation is to amplify the high-frequency spatial components of an image to make it look sharper. Because noise has also high-frequency characteristics, attention must be given to the question of controlling noise amplification during edge enhancement. The two main approaches to edge enhancement are direct convolution and unsharp masking. Actually, these operations are two sides of the same coin, as they produce mathematically equivalent results when confined to the world of linear shift-invariant systems. The direct convolution method consists of extracting a high-frequency record from the image via convolution with a high-pass kernel. Some scaled amount of this high-frequency record is then added back to the original image to produce the sharpened result as follows: A = A + k (A ∗ h) (3.7) where A is the original image, h is the high-pass convolution kernel, k is a scale factor, and A is the resulting sharpened image. In the case of unsharp masking, the high-frequency record is created by computing the difference between the image and a blurred (low-pass) version of itself: A = A + k (A − A ∗ b) (3.8) where A is the original image, b is the low-pass convolution kernel, k is a scale factor, and A is the resulting sharpened image. From Equations 3.7 and 3.8 it can be seen that h and b are related by h = I − b where I is the identity matrix. In either approach, adjusting the scalar k will adjust the amount of sharpening applied to the image. In order to control noise amplification during the edge enhancement process, the highfrequency record needs to be noise-cleaned in some manner prior to being added back to the original image. The initial impulse (no pun intended) may be to use a standard noisecleaning operation. However, high-frequency image data is zero mean and largely devoid of low-frequency information, so low-pass filtering begins to lose its meaning. Instead, rather than performing a spatial noise-cleaning operation, an amplitude noise cleaning method is usually employed. A coring function is used to noise-clean the high-frequency record in Digital Camera Image Processing Chain Design CV output CV output 81 CV output CV input CV input CV input (a) (b) (c) FIGURE 3.5 Example coring functions. CV stands for a code value. this manner [23], [24]. This function is a point operation that, like the previously described tone scale correction, is usually implemented as a lookup table. Modifying Equations 3.7 and 3.8, the coring operation, C(·), can be added: A = A + kC (A ∗ h) (3.9) A = A + kC (A − A ∗ b) (3.10) Note that if h = I − b, Equations 3.9 and 3.10 are still mathematically equivalent. The shape of the coring function (Figure 3.5) is heuristically determined based on the fundamental noise characteristics of the digital camera system. Nominally, small amplitude values in the high-frequency record are suppressed to zero to reduce amplifying noise in flat regions of the image. Midrange amplitude values in the high-frequency record are left essentially unaltered. There is no clear consensus on how to modify large amplitude values in the high-frequency record. An entire range of possibilities can be found in practice: leave them unchanged, clip them to some maximum value, or beyond a certain input amplitude value begin to reduce the size of the cored amplitude so that the largest amplitude values in the high-frequency record are actually set to zero. The selection criteria for the shape of the coring function is a tradeoff between noise amplification, overall image sharpness, and image distortions such as the loss of three-dimensionality of strong edges. Once the mechanisms of edge enhancement have been determined, attention can be turned to determining precisely which components of the image will be sharpened. As stated before, most of the fine spatial detail in the image is contained in the luminance component. Therefore, it is plausible to split the image into luminance and chrominance components, sharpen just the luminance channel, and then merge the components back into a sharpened image. If luminance-chrominance space is the final destination of the image, for instance, in preparation for Joint Photographic Experts Group (JPEG) compression, then this strategy is practical, as well. However, if sRGB is the final destination space, a simpler route is available. The luminance channel, for the purposes of edge enhancement, can be adequately approximated by the green channel [23], [24]. Thus, a scaled, noise-cleaned (cored) highfrequency record is produced directly from the green channel and then added equally to the original red, green, and blue color channels of the image (Figure 3.6). The additional benefit of this approach is that the green channel of a digital camera system tends to be the least noisy of the color channels. 82 Single-Sensor Imaging: Methods and Applications for Digital Cameras “cam era” RGB im a ge green channel ed ge enhancem ent noise coring transfer fu nction stochastic color noise red uction color * correction tone scale and * gam m a correction ed ge * enhancem ent * d enotes a potential noise am plification step FIGURE 3.6 RGB edge enhancement. “d isplay” RGB im age 3.3 Variations on the First Image Processing Path It has been suggested a number of times that there are alternate and possibly better ways of configuring the reference image processing path. Many competing pressures will generally drive the ultimate configuration for a given application. One of the leading determiners is the available computing environment. From a simplistic perspective, this can be considered to be how many pixels can be processed per unit time and what the overall allowable execution time boundaries are. Complicating this interpretation is the question of sharing compute resources with other nonimaging tasks that the device, for instance, a mobile phone, may perform. In addition, an individual microprocessor may have idiosyncrasies that cause some simpler types of operations to execute slowly while other more complex types of operations execute quickly. As a result, the reference path just presented may have to be significantly modified. Section 3.4 will discuss this in some depth. Sometimes image quality requirements, as established by the UE, require better results than can be easily achieved by simply improving the individual components of the image processing chain. The current trend towards producing acceptable images at higher and higher exposure indices (ISO ratings) is one example. As a first approach, additional noise cleaning operations may be added to the image processing chain, for instance, after the color correction or edge enhancement steps. As suggested under edge enhancement, described above, a similar noise reduction can be achieved not by formally adding more Digital Camera Image Processing Chain Design noisy RGB im a ge chrom inance im a ge RGB to YCC im age conversion lu m inance noise red u ction YCC to RGB im age conversion filtered RGB im age FIGURE 3.7 Luminance-chrominance stochastic noise reduction. 83 stochastic noise red uction chrom inance im a ge lu m inance noise red u ction computations but by modifying the chain so that certain image components are processed differently in order to prevent unwanted noise amplification. 3.3.1 Luminance-Chrominance Processing Under stochastic color noise reduction, the concept of transforming a primary color space (e.g., RGB) into a luminance-chrominance space (YCC) provided an opportunity for improved image processing results [23]. Figure 3.7 illustrates the concept. The noisy RGB image is initially converted to YCC space. The luminance and chrominance components are now reduced in noise using methods appropriate to the type of data. Namely, luminance noise-cleaning must preserve edges and fine spatial detail, whereas chrominance noise-cleaning generally does not need to be burdened with such concerns. Figure 3.7 shows that luminance information may be used to assist the chrominance noisecleaning process. After noise reduction, the noise-cleaned YCC components are recombined and converted back into RGB space. The concept of YCC image processing can be extended over larger parts of the image processing chain. One classic example, as already discussed, is with edge enhancement [24]. In Figure 3.6, a side branch has been added to the image processing chain. After CFA interpolation, a copy of the luminance channel (which could be just the green channel) is routed around the stochastic color noise reduction and the color correction operations. Instead, it is fed directly into the tone scale and gamma correction operation and then the edge enhancement block. This results in a less noisy edge enhancement boost record, having avoided the noise amplification inherent in the color correction step. 3.3.2 Spatial Frequency Processing Just as difference color channel components (e.g., luminance and chrominance) benefit from different image processing, so will the different spatial frequency band components in 84 Single-Sensor Imaging: Methods and Applications for Digital Cameras low frequ ency im a ge fr eq u en cy d ecom position stochastic color noise red uction color * correction high frequ ency im a ge + * d enotes a potential noise am plification step FIGURE 3.8 Color correction of the low-frequency spatial image component. tone scale and * gam m a correction the image. This is a very powerful concept that has been exploited in a number of ways as described in the literature. The simplest approach is to split the image data into two spatial frequency bands: low-frequency and high-frequency. This splitting method has already been discussed under edge enhancement. In the current discussion, the image processing chain will once again be bifurcated around the color correction operation with only the low-frequency component being color corrected (Figure 3.8) [25]. The result of this image processing chain modification is that high-frequency noise will not be amplified by the color correction step. Because most of the color information in the image is carried by the low-frequency component of the image data, there is little, if any, visual penalty for not color correcting the high-frequency component. Noise-cleaning benefits from the same strategy (Figure 3.9). noisy RGB im a ge high frequ ency im a ge fr eq u en cy d ecom position high frequ ency noise red uction + filtered RGB im age FIGURE 3.9 Low-frequency / high-frequency stochastic noise reduction. stochastic noise red uction low frequ ency im a ge low frequ ency noise red uction Digital Camera Image Processing Chain Design 85 A single low-frequency / high-frequency split of the image data allows different strategies to be used on each component. Typically, most of the objectionable high-frequency noise will be segregated into the high-frequency component leaving a relatively clean low frequency image component. Therefore, the low-frequency component can be left untouched, thus preserving the fidelity of its image content, and noise reduction performed on only the high-frequency component. As indicated in Figure 3.9, the low-frequency component can optionally be used to help drive the high-frequency noise reduction process. There is no reason to stop at a two-component spatial frequency split. Full wavelet or Laplacian / Gaussian pyramid decompositions can also be used to great effect [26]. In these approaches after the initial split, the low-frequency component is split again into higher and lower spatial frequency components. This splitting continues until the resulting component sizes have reached some useful lower limit in size. Each component or partial reconstruction may now be processed in a custom way. Candidate operations for such custom processing are noise reduction, color correction, tone scale and gamma correction, and edge enhancement. The obvious liabilities of pyramid decomposition / reconstruction approaches are that they stress the capabilities of the compute environment with the proliferation of components, and the number of degrees of freedom mushroom with each new pyramid level, making optimum and robust tunings of such systems sometimes problematic. If there is sufficient memory available, the image can be decomposed into a pyramid representation after CFA interpolation and then reconstructed back into a single image until after edge enhancement. In the interim, each component can potentially follow an individualized image processing path best suited to its nature. 3.3.3 Computing Environments As would be expected, the computing environment strongly dictates the nature of the image processing chain. The amount of available memory for storing both image data and intermediate results, the nature of the processor’s design, and just the fundamental speed of the processing unit are just some of the significant factors that must be taken into account. Rather than trying to address all of the issues that can be largely of a computer engineering nature, only topics directly associated with image processing chain design will be discussed. 3.3.3.1 Intermediate Data Storage There are three data structures generally used in image processing chain implementations: full frame, line buffer, and tile buffer. • Full-Frame Processing Full-frame processing is conceptually the simplest way to implement an image processing chain. The entire raw image is read into memory at the beginning of the image processing chain. Each operation in the chain then operates sequentially as an independent entity on the image data in memory. At the end of the chain, the entire image is written to storage. Color images are generally stored in one of three ways in full frame processing. Perhaps the most common is pixel interleaved in which pixel value triplets are stored sequentially (see Figure 3.10 and Figure 3.11). 86 Single-Sensor Imaging: Methods and Applications for Digital Cameras B11 B12 B13 G11 G12 G13 R11 R12 R13 B21 B22 B23 G21 G22 G23 R21 R22 R23 RGB im age B31 B32 B33 G31 G32 G33 R31 R32 R33 FIGURE 3.10 Original image data prior to intermediate data storage. R11 G11 B11 R12 G12 B12 R13 G13 B13 R21 G21 B21 R22 G22 B22 R23 G23 B23 p ixel interleaved R31 G31 B31 R32 G32 B32 R33 G33 B33 R11 R12 R13 R21 R22 R23 R31 R32 R33 G11 G12 G13 G21 G22 G23 G31 G32 G33 fr am e interleaved B11 B12 B13 B21 B22 B23 B31 B32 B33 R11 R12 R13 G11 G12 G13 B11 B12 B13 R21 R22 R23 G21 G22 G23 B21 B22 B23 line interleaved R31 R32 R33 G31 G32 G33 B31 B32 B33 FIGURE 3.11 Image data from Figure 3.10 arranged in (top) pixel interleaved, (middle) frame interleaved, and (bottom) line interleaved formats. This method has the advantage of keeping all pixel values associated with a given support region (pixel neighborhood) in a smaller region of the full-frame buffer. The second most common format would be frame interleaved in which all pixels of a given color are stored contiguously in memory. Each color channel memory block itself may or may not be stored contiguously in memory. This is perhaps the most intuitive arrangement of the image data, especially when applying independent grayscale operations to each of the color channels. Digital Camera Image Processing Chain Design 87 row 1 row 2 row 3 row 4 row 5 row 6 row 7 (a) data bu ffer next row row 1 row 2 row 3 row 4 row 5 row 6 row 7 (b) old row data bu ffer next row FIGURE 3.12 (a) Producing one row of fully processed data (row 3) from a line data buffer. (b) Data buffer has been rolled from the image on the left. Row 4 may now be fully processed. The third data format is line interleaving, in which each line of color image data is stored as separate lines of each of the individual color channels. This last approach does not see as much application as the other two formats. The key element that differentiates full-frame processing from the other two approaches is that the entire image is available to each of the image processing operations in the chain. This makes operations such as a full in-place wavelet pyramid decomposition possible. It also restricts issues associated with support region (pixel neighborhood) boundary conditions to the physical edges of the image. As an extension of this idea, all intermediate results (e.g., an edge map) are also stored completely in memory in full-frame buffers and the entirety of such intermediate results are available to each image processing operation. The downside of full-frame processing is its use of large amounts of memory and the associated memory cache misses that are incurred. Even if a computing environment has sufficient RAM to hold all of the full frame image buffers, the microprocessor has only a limited amount of cache memory that it can directly manipulate in a rapid manner [27]. If image data is required that is currently not in the cache, then a relatively time-consuming memory swapping process must occur to store the current cache contents in slower RAM and bring the required data into the faster cache memory. Using relatively large pixel neighborhoods can quickly slow down the image processing chain attributed to the preponderance of memory cache misses. • Line Buffer Processing Line buffering was created in order to address the memory cache limitations. In line buffering, only enough lines of image data are read into memory at a time sufficient to produce one fully processed output line of image data, which is subsequently sent to storage (Figure 3.12a). Once the output line is written, the line buffer is rolled by the one line so that a new line of input data can be read into memory (Figure 3.12b). This process is continued until the entire image is processed. Conceptually, in the rolling process one can think of image data physically being copied from a lower row to a higher row. In practice, a set of line data pointers would be rolled instead to avoid a large number of unnecessary data transfers. The immediate advantage of line buffering is that far less memory is consumed, as only a few lines of the image are ever resident. This, in turn, reduces the tendency to incur memory paging (i.e., storing and retrieving blocks of memory from external storage) in 88 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1: X X X … 1: X X X X X 2: X X X … 2: X BC BC BC X 3: X X X … 3: X BC BCS BC X 4: BC BC BC … 4: X BC BC BC X 5: BC BC BC … 5: X X X X X 6: BC BC BC … (a) (b) FIGURE 3.13 (a) Example line buffer processing content. (b) Example tile buffer processing content. physical memory constrained systems [27]. The notable and significant downside is that the complexity of the image processing chain implementation increases substantially. Point processes, such as color correction, are unaffected by using a line buffering strategy. Area processes, such as any convolution operation, can become a bookkeeping exercise. Consider Figure 3.13a. As a simplified example, the task is to produce a line of processed image data that is blurred with a 3 × 3 low-pass kernel, then color corrected, then sharpened with a 3 × 3 high-pass kernel. First, in order to perform a convolution with a 3 × 3 kernel a minimum of three lines of image data must be present. Therefore, three rows of unprocessed pixels (X) are read into memory lines 1 to 3. (Pixel interleaving is assumed but not explicitly shown.) This allows the blurring operation (B) to be performed and the results stored in memory line 4. The color correction operation (C), being a point operation, can now be performed in place on line 4, producing pixels marked with both B and C. At this point, the unprocessed pixel buffer (lines 1 to 3) is rolled so that line 2 is written to line 1, line 3 is written to line 2, and a new row of unprocessed pixels is written to line 3. (Again, in practice only the pointers to the memory lines would be changed.) At this point, a new blurring and color correction set of operations can be performed and the results stored in memory line 5. Lines 1 to 3 are rolled again and subsequently blurred and color-corrected to populate memory line 6. Now there are enough rows of blurred and color-corrected pixels to perform the sharpening operation, resulting in a finished row of blurred, color-corrected, and sharpened pixels that can be written to the output storage. At this point in the cycle, a steady-state situation has been achieved. In order to produce the next row of finished pixels, the unprocessed pixel buffer need only be rolled one row, the intermediate blurred and color-corrected line buffer also need only be rolled one row, and the new pixel values computed. In a similar manner, a more elaborate image processing can be analyzed and the corresponding line buffering requirements and sequencing of operations determined. A key observation in this regard is that as the number of area processing operations increases so will the required number of rows in the line buffer. • Tile Buffer Processing Some constrained computing environments will still be overtaxed by line buffering. This leads to the third alternative, tile buffering. With this method, a small two-dimensional region of pixels is read into memory, and a corresponding region of fully-processed pixel Digital Camera Image Processing Chain Design 89 values is produced. The size of the tile is determined by the size of the available cache memory, with the goal to minimize the number of memory cache misses that occur during the processing of a given tile. Consequently, execution time can be significantly reduced. Unfortunately, algorithm complexity continues to grow, as well. Figure 3.13b illustrates tile buffer processing. The simple image processing chain will again be a 3 × 3 low-pass kernel, followed by color correction and then sharpening with a 3 × 3 high-pass kernel. As can be seen in the figure, the tile size has been set to 5 × 5. This allows the central 3 × 3 region of pixels to be blurred (B). These nine pixels can then be color corrected (C). Finally, the central pixel can be sharpened (S). Therefore, a 5 × 5 tile produces a single fully processed pixel! (As with line buffering, the details of the convolution operations here have been omitted. A second tile would be required to properly perform these convolutions.) Once the output pixel has been written, the tile is rolled in a manner similar to line buffering in order to reduce the number of computations that need to be repeated. In a similar manner, more complex image processing chains can be analyzed and implemented with tile buffering. As with line buffering, as the number of area operations increases so will the required size of the tile. It is noted in passing that hybrids between full-frame, line, and tile buffering are quite possible, depending on the nature of the computing environment. 3.3.3.2 Physical Environments With a number of image data buffering options available, it is appropriate to discuss under which circumstances said methods are most applicable. • Custom Hardware Implementations For optimum computational efficiency, computing hardware can be explicitly designed to implement a given image processing chain. This usually takes the form of an applicationspecific integrated circuit (ASIC). Development of new ASICs is an expensive process requiring relatively high usage volumes to make this approach financially attractive. Therefore, to keep component costs at a minimum, image processing paths that use a minimal amount of memory are highly preferred. This typically results in designing image processing chains that can be implemented as “soda straw” pipelines. To this end, one will see line and tile buffering used almost exclusively. Additionally, the image processing chain itself will be relatively devoid of branch points that require intermediate results (e.g., resulting from luminance-chrominance splitting) to be kept in memory while other computations are performed. Because the size of the line and tile buffers is heavily influenced by the number of area operations and their respective support region radii in the chain, there will be a strong incentive to minimize both of these aspects. As a final observation, once constructed, the image processing chain in the ASIC cannot be changed. Therefore, the importance of careful and robust image processing chain design in an ASIC cannot be overemphasized. • Firmware Implementations The digital signal processor (DSP) provides a significant degree of freedom to the image processing chain engineer: the DSP is a programmable device. This greatly lowers the cost of the device compared to an ASIC as the same DSP can be used in a wide variety of products. Changes can also be made to the image processing chain in the DSP, which 90 Single-Sensor Imaging: Methods and Applications for Digital Cameras provides the ability to upgrade the resident firmware to customize for specific applications, address bugs, or to add new features. In return for these benefits, processing speed is lost with respect to the ASIC. Many of the considerations of using an ASIC carry over into using DSPs. Because cache memory is usually at a premium, line and tile buffering are still essential. The constitution of the image processing chain, however, may vary from the ASIC equivalent. Today’s DSPs come with built-in capabilities that streamline certain essential image processing operations (e.g., convolution) making some decisions to perform “more” computations to actually result in shorter execution times. Therefore, designing an optimized image processing chain for a DSP puts the additional burden on the chain designer to understand the computational idiosyncrasies of the DSP in question. • Desktop Implementations Most image processing chains are prototyped in a desktop environment first. This can sometimes lead to a false sense of potential performance in one of the environments just discussed. Today’s desktop computers have extremely large amounts of cache memory, compared to the DSP and (equivalently) ASIC environments. Therefore, area operations using larger support region sizes may run very quickly on a desktop computer while coming to a crawl in the DSP environment. Another advantage of the desktop computer is that memory is usually copiously available. This makes full-frame processing using many extra buffers for storing intermediate results practical, if not preferred. If the ultimate compute environment of the image processing chain is a desktop computer, then there is generally no incentive to use line or tile buffering unless the sizes of the images are significant compared to the available memory. Professional and scientific digital cameras can have sensors that have in excess of 20 megapixels. When converted to fullcolor images, these become 60 megapixel entities. These numbers assume one byte of memory per color per pixel. For such high-end applications, it is not uncommon to require two bytes of memory per color per pixel leading to 120 megapixel images in memory. Now, consider the common scenario where it is desired to work with two or more such images simultaneously. Under these specialized circumstances, line or tile buffering may begin to be attractive. 3.3.4 Resizing and Compression 3.3.4.1 Image Resizing One of the most common image processing operations not included in the reference chain is resizing. Resizing is simply a form of interpolation that encompasses both digitally enlarging (i.e., digital zoom) and reducing an image. The standard interpolation methods (e.g., pixel replication or subsampling, bilinear interpolation, and bicubic interpolation) are again the general workhorses of this operation. In the case of image reduction, there may also be the preliminary step of antialiasing (i.e., low-pass filtering or blurring) before interpolation to prevent the occurrence of aliasing artifacts in the resized image [28]. Conceptually, the simplest way to add resizing to the reference image processing chain is to append it at the end where it resizes the display RGB image (Figure 3.14). This is a useful position if nothing is known a priori about the resizing parameters (i.e., what the final size Digital Camera Image Processing Chain Design 91 tone scale and * gam m a correction tone scale and * gam m a correction ed ge * enhancem ent im a ge resizing * d enotes a potential noise am plification step im a ge resizing ed ge * enhancem ent ad ju sted CFA im age “d isplay” RGB im age FIGURE 3.14 Image resizing performed after edge enhancement (left) and before edge enhancement (right). of the resized image will be). This would be the scenario of a user-driven postprocessing operation, such as one might find in a kiosk or desktop application. However, even in this situation, it may be possible to find a better location in the image processing chain. The effects of the edge enhancement operation are particularly sensitive to the final image size. The desired amount of sharpening for a standard-size video display image may produce an undersharpened or oversharpened image after resizing. The easiest way to address this situation is to perform resizing before edge enhancement (Figure 3.14). In this way, edge enhancement operates at the resolution of the final image size, providing some level of desensitization to image resizing. The correction is not ideal and further adjustments to the edge enhancement may be required. If the resizing operation is constrained (e.g., only enlarging) or static (fixed to a specific enlargement / reduction ratio), then further economies may be possible. If the size of the final image will be reduced with respect to the original image, it makes sense to move the resizing operation to as early a stage in the image processing chain as possible. This will reduce the amount of data that is subsequently processed, decreasing memory usage and execution time. In this scenario, two locations in the chain suggest themselves: immediately after CFA interpolation (on the camera RGB image) and, further back, after the structured noise reduction. The former case is evident. The camera RGB image is a full-color, fullresolution image that is handled in one of the manners previously discussed. The latter case is a bit more problematic. The task, in this case, is to resize CFA image data. The exact details of the computations performed will be dictated by the actual CFA MRU. Formal interpolation of the individual subsampled color channels can be performed. However, the typical image reduction scenario is one that requires short execution times, for instance, for video-rate readout and real-time image preview. Therefore, formal interpolation operations are usually simplified, sometimes significantly. If the image quality requirements are sufficiently lax, one approach is to use a superpixel-inspired pixel subsampling, with or 92 Single-Sensor Imaging: Methods and Applications for Digital Cameras without antialiasing preprocessing [29]. The main engineering task in these reduced computation environments is to minimize the creation of color aliasing artifacts resulting from CFA image data reduction. If the size of the final image will be enlarged, then it would seem that the later on in the chain the resizing operation occurs, the better. The reasoning behind this decision is that the number of image pixels that must be processed is inflated only at the very end, thereby minimizing the amount of computation. This would lead to performing resizing either before or after edge enhancement. One confounding notion in this approach is the inherent redundancy between the resizing and CFA interpolation operations. Both perform a type of interpolation. With thought it is possible to conceive computational schemes that simultaneously perform CFA interpolation and resizing [30]. By directly computing a resized full-color image from original-resolution CFA data, a significant savings in execution time can be realized. Algorithmic complexity is the price that is paid for this improvement in efficiency. Chapter 17 discusses image resizing issues in detail. 3.3.4.2 Image Compression Compression of an image is more an exercise in changing the representation of the data than in actually operating on the data itself. This is precisely true for lossless compression. On the other hand, while lossy compression can be viewed as having the added benefit of providing rudimentary noise reduction capabilities, the intent of most such compression operations is to leave the final image as visually similar to the original image as possible. JPEG has long since become identified in the industry as the lossy image compression algorithm of choice [31]. (The lossless form of JPEG is largely ignored.) This algorithm, based on luminance-chrominance color spaces and discrete cosine transforms, performs a spatial frequency transform of the image, and then quantizes the frequency components in order to eliminate visually redundant information, which, as a result, produces a smaller image representation. With respect to lossless compression, there are a couple of choices. Especially for Web-based applications, use of Lempel-Ziv-Welch (LZW) compression as implemented in GIF and some TIFF file formats is a de facto standard [32]. More recently, the LZ77 variant called deflation and implemented in the PNG file format claims slightly better compression performance and, perhaps more importantly, freedom from any intellectual property (i.e., patent) entanglements [33]. It should be hastily added that this list is far from comprehensive, as compression continues to be an active area of research with applications in the areas of data storage and retrieval, transmission, and security being only some of the significant applications of this work. Because the encoded state of the image data is generally an impractical one for any image processing operation other than decompression, it immediately becomes apparent that compression should occur on the final display RGB image prior to actual storage (Figure 3.15). This, of course, emphasizes one of the primary purposes of compression cited above: to reduce the storage size of the image. With the standardization of sRGB space, it has become commercially attractive to build JPEG hardware and firmware engines that perform an otherwise complex algorithm in a minimal amount of time. As a result, any digital camera that produces a fully processed image can largely be expected to compress that image in JPEG format prior to storage. Digital Camera Image Processing Chain Design raw CFA im a ge CFA im age com pression / d ecom pression d ark floor su b tr a ction * d enotes a potential noise am plification step stru ctu red noise red u ction 93 tone scale and * gam m a correction ed ge * enhancem ent RGB im age com pression / d ecom pression “d isplay” RGB im age FIGURE 3.15 Image compression of the raw CFA image (left) and the display RGB image (right). On a historical note, image compression was once performed much earlier in the image processing chain when digital cameras had only primitive compute capabilities on-board (see Figure 3.15) [25]. The role of the digital camera was once solely to capture a raw CFA image and then store that image in a compressed format for later image processing in a desktop environment. While it was clear that a luminance-chrominance space image can be compressed much more effectively than an RGB space image, CFA image data generally cannot be converted directly to, for example, YCrCb without an intermediate CFA interpolation step. Assuming a Bayer pattern CFA, the solution was to create a temporary interpolated green pixel value at each red and blue pixel location using a simple boxcar average of the four neighboring green pixel values. This in turn permitted the computation of R G and B G color differences at the red and blue pixel locations. The result was a luminance (green, full resolution)-chrominance (R G, B G, quarter resolution) representation of the CFA data that could be compressed in a lossy manner, resulting in a relatively small image file to be stored [34]. Upon decompression and transformation back to RGB, the temporary interpolated green pixel values were discarded in favor of results generated by better CFA interpolation methods. Detailed treatment of camera image compression can be found in Chapters 14 and 15. 3.3.5 Other Factors Until recently, there has been a largely tacit assumption that the image data in the chain has been kept in the same photometric space as originally read from the sensor. There are many situations when this may not produce optimum results. These are discussed below. 94 Single-Sensor Imaging: Methods and Applications for Digital Cameras TABLE 3.1 First values of the 8-bit linear to 8-bit video gamma transform. Representation 8-bit linear 8-bit video Values 0 1 2 3 4 5 6 7 8 9 10 0 13 22 28 34 38 42 46 50 53 56 TABLE 3.2 First values of the 10-bit linear to 8-bit video gamma transform. Representation Values 10-bit linear 8-bit video 0 1 2 3 4 5 6 7 8 9 10 0 3 6 10 13 15 18 20 22 23 25 3.3.5.1 Bit Depth The number of code values used to span the range from full black to full white heavily influences the amount of quantization error (contouring) that will be seen in the final image. Unfortunately, this is the not the only factor as local scene content, image noise, and the Weber’s Law sensitivity of the HVS also come into play. Still, bit depth can be considered of prime importance. Opposing the solution of simply carrying enough bits to eliminate quantization errors is the corresponding requirement of additional memory use to store and manipulate the extra data resolution. Because sRGB video space consists of 8-bit data [0-255], it is natural to want to design an entirely 8-bit image processing chain. Supporting this decision is that to use larger bit depths immediately doubles (at least) the memory requirement from one-byte-per-pixel value to two-bytes-per-pixel value, unless data packing is used, which brings its own series of issues. Unfortunately, 8-bit linear data will not populate all 256 states after the video gamma transform. As shown in Table 3.1, the first 11 code values in 8-bit linear space map into only 11 of the first 57 code values in 8-bit video gamma space. The other 46 code values will never be populated. Because of the compressive nature of the video gamma curve, more of the larger code value states will be populated, but significant contouring should be expected in the shadow regions of the image. Even-numbered bit depths appear to be preferred over odd-numbered bit depths, so the next data resolution level considered is usually 10 bits [0-1023]. The number of missing states after transformation to video gamma space is significantly reduced, but not eliminated (Table 3.2). Although there may be a small amount of contouring visible in shadow regions, this level is usually acceptable for consumer photography. For professional photography, however, it is best that all video gamma states are populated. This leads to the next bit depth to be considered (Table 3.3), which is 12 bits [0-4095]. While all of the output states are now populated in this example, quantization error can still occur as a result of some of the input 12-bit states being depopulated either because of poor exposure during capture (e.g., underexposure followed by digital exposure compensation) or loss of data resolution during the image processing chain computations. As a result, some professional digital cameras now employ 14 bits [0-16383]. From Table 3.4 Digital Camera Image Processing Chain Design 95 TABLE 3.3 First values of the 12-bit linear to 8-bit video gamma transform. Representation Values 12-bit linear 8-bit video 0 1 2 3 4 5 6 7 8 9 10 0122345667 8 TABLE 3.4 First values of the 14-bit linear to 8-bit video gamma transform. Representation Values 14-bit linear 8-bit video 0 1 2 3 4 5 6 7 8 9 10 0001111122 2 it is clear that several of the 14-bit steps could be depopulated, and there would still be a great likelihood that all of the 8-bit video gamma states would be used. The choice of bit depth is additionally complicated by matters beyond what have already been discussed. As the number of bits increases, the possibility of data overflow during the computation of intermediate numerical results becomes increasingly likely without using higher precision arithmetic. Because higher precision arithmetic requires more memory and takes longer to perform, this can become a significant liability. Additionally, the use of lookup tables with higher bit depths can become unwieldy as the sizes of the tables are forced to grow if numerical precision is to be preserved. In addition to having to maintain such large lookup tables, operations involving large tables can quickly produce an undesirable amount of memory cache hits. 3.3.5.2 Nonlinear Photometric Spaces One compromise solution to the bit depth question is to store the image data in a nonlinear photometric space. One such space is the 8-bit video gamma space (see above). This is a solution frequency found in video imaging applications (see below). Another convenient family of spaces, to be discussed shortly, is based on logarithmic responses. The crux of these solutions is to constrain the numerical bit depth to eight bits by discarding code states that are least likely to produce visually objectionable image quantization artifacts. Referring to the Weber’s Law response of the HVS, this suggests a compressive function response that preserves most of the states in the lower portion of the code range and discards states more frequently as the code values increase (i.e., a logarithmic-like response). In the context of the image processing chain, the data would be produced from the image sensor and A/D converter at, for example, 10-bit resolution and then immediately transformed to an 8-bit space for storage in RAM. A pure logarithm function, a log (1 + x), to perform this transformation will have problems near zero because of high slope of the curve and the resulting number of missing output states. Mimicking the solution used by the sRGB video gamma curve, a linear segment of unity slope is used to replace the logarithm in this region and the two functional pieces are matched in value and slope at the knot point to eliminate visible discontinuities (Figure 3.16) [35]. 96 Single-Sensor Imaging: Methods and Applications for Digital Cameras 256 192 8-bit nonlinear cod e valu e 128 KLUT sRGB 64 0 0 256 512 768 1024 10-bit linear cod e value FIGURE 3.16 Comparison of linear-log transform (KLUT) and sRGB video gamma transform. The general form of this transform then becomes y= x, for x ≤ k k + k ln x k , for x > k (3.11) k=− y1 W−1 − y1 x1 e−1 (3.12) where W−1 (x) is the −1 branch of Lambert’s W function [36]. The input code value range is from zero to x1, and the output code value range is from zero to y1. This approach can be augmented by adding a linear segment to the upper end of the logarithm to regulate the slope of the function in that region. Once the functional form of the nonlinear transform is determined, the image data in the image processing chain can be freely transformed between nonlinear and linear spaces as needed. To speed the transform processes, lookup tables can be used. If this is impractical due to the size of the lookup table, the transform function can be approximated by a rational function of the form (x + a0) (b1x + b0), which may be simple enough to be recomputed as needed. Consideration must be given to keeping the number of transform-inverse transform cycles to a minimum as each operation may be lossy in terms of data resolution. To this end, while some image processing operations, such a color correction, in reality need to be performed on linear data (but see below), it is plausible to try to perform other operations, such CFA interpolation, in the nonlinear space. 3.3.5.3 Extended Dynamic Range It is well known among photographers that silver halide films provide significantly more dynamic range capture capability (5˜ stops) than do silicon sensors (2˜ stops). Closing this Digital Camera Image Processing Chain Design 97 G* R G* R* PGPR B G B* G GPRP G* R* G* R P B PG B* G B G B PG P (a) (b) FIGURE 3.17 CFA MRU: (a) with different photometric gains and (b) with red (R), green (G), blue (B), and panchromatic (P) pixels. performance gap is presently an active area of research in both academia and industry. The image processing chain ramifications of two prominent approaches are discussed below. • Multiple Color Channel Sensitivities One general approach is to expand the MRU of the CFA pattern, and for each color in the pattern provide two or more versions with different photometric gains [37]. Figure 3.17a illustrates one approach. Pixels marked with an asterisk have a higher photometric gain than those that are not. In processing, the point will inevitably be reached where the two inherent photometric ranges will need to be merged to produce an extended dynamic range image. This may take the form of a final image with the entire dynamic range preserved to provide the maximum flexibility in subsequent post-processing operations, or simply as an intermediate entity that is eventually reduced to an 8-bit sRGB image in a way that takes advantage of the extended dynamic range, for instance, via a custom tone scale transform. This will reintroduce the issue of bit depth and data resolution into the image processing chain design. Typically, one to two extra bits of data will be needed for each stop of exposure difference between the two photometric gains if no quantizing is performed. Where the merge of photometric ranges occurs in the image processing chain is a subject that must also be considered. One natural place for this is as early as possible in the image processing chain, usually CFA interpolation [38]. Once merged, the image can proceed through a normal image processing chain, albeit with a larger dynamic range. Alternately, the two dynamic ranges could be kept separated with custom image processing paths, in the manner of luminance-chrominance or high- and low-frequency splitting, until later on in the chain. • Separate Panchromatic Channel A different approach is to add a spectrally broad channel to the CFA MRU and use the natural boost in photometric sensitivity to create an extended dynamic range image [39], [40], [41]. Figure 3.17b illustrates one such arrangement. The pixels marked P in this figure are associated with the panchromatic (broad) spectral channel. The typical spectral response of such a panchromatic channel as compared to the more traditional RGB color channels is shown in Figure 3.18. The RGB color channels can be used to create a low-resolution 98 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 350 400 450 500 550 600 650 700 p anchromatic red green blu e IR cu toff FIGURE 3.18 Typical spectral responses of red, green, blue, and panchromatic CFA filters. Also shown is a typical spectral response for an infrared (IR) cutoff filter. full-color image with a typical photometric sensitivity. The panchromatic channel can be used to create a high-resolution grayscale image with high photometric sensitivity. These images can be merged by taking the low-frequency component of the full-color image and adding to it the high-frequency component of the panchromatic image [42], [43]. The result is a full-color image with extended dynamic range that is achieved by significantly lowering the noise floor of the imaging system. As discussed previously, the image processing chain can be bifurcated to process the low-resolution full-color image and the high-resolution panchromatic image differently before the point of merger. 3.4 How Video Differs from Still Photography The foregoing discussion has been focused on still photography. When considering video photography, a number of assumptions must be significantly reassessed. Perhaps the most important issue is that real-time video image processing must be accomplished at video rates (e.g., 30 frames per second). This means the entire image processing chain must produce a fully finished video frame in around one-thirtieth of a second. This stands in stark contrast to a consumer digital still camera, which may take a second or two to produce a finished image. As a direct consequence, in order to lower the computation demands, pixel resolutions of video frames are much lower than for digital still images. Even with dramatically lowered pixel resolutions, it becomes almost immediately clear that the image processing chain must be significantly abbreviated for the video environment. Digital Camera Image Processing Chain Design raw CFA im age color filter array in ter p ola tion 99 “cam era” RGB im a ge tone scale and gam m a correction exposure and w hite balance analysis * color correction ed ge enhancem ent exposure and w hite balance correction ad ju sted CFA im age * to be used when captu ring the next fram e FIGURE 3.19 Idealized video image processing chain. RGB to YCC co n v er sio n “ v id eo” YCC im age Somewhat offsetting the problem of limited execution times is that video image processing can take advantage of temporal averaging between consecutive frames. When the video sequence is viewed, each frame is only displayed for, continuing our example, one-thirtieth of a second. This is close to the refresh rate of the HVS, which is approximately between 20-60 Hz [44]. Each frame will “blur” into the next one and provide a perceived smoothing effect on the image. As a result, a certain amount of noise reduction comes “for free”. This means that the image quality of the individual video frame does not need to be as high as a corresponding digital still image. One engineering decision in video image processing chain design that seems to be made almost universally is to transform the raw data from the sensor directly into video gamma space and then retain it in that nonlinear space for all subsequent image processing operations (see Figure 3.19). This has multiple benefits. First, the data resolution is fixed at 8 bits, permitting the use of a single byte of memory for each pixel value. Second, no time is spent transforming between photometric spaces in order to produce a final sRGB video frame. Third, quantization errors will tend to occur in the highlight of the image, where they are visually less objectionable, and will tend not to occur in the shadows of the image, where their visibility is more pronounced. The liability with this “all-video gamma” approach is that certain image processing operations, most notably color correction, perform poorly in nonlinear spaces. Color correction is generally predicated on Grassmann’s Laws [45], which describes color mixing as a lin- 100 Single-Sensor Imaging: Methods and Applications for Digital Cameras ear phenomenon. For colors with low amounts of saturation (i.e., nearly neutral), video gamma computations will behave similarly to linear space computation. However, as the saturation increases (i.e., colors become more “colorful”) computations in video gamma space can begin to produce results in significant variance from Grassmann’s Laws. As a result, unexpected and unwanted color shifts may result from color correction. Therefore, performing color correction in video gamma space will reduce color fidelity. One way to address this problem is to tune the color correction matrix values to less aggressively correct the output colors in order to minimize the visibility of such errors. As a result, the loss of color accuracy due to reduced color correction has become part of the UE for consumer video photography. In the idealized video image processing chain shown in the figure, no stochastic noise reduction operations are performed. Stochastic noise is considered to be addressed by either temporal averaging or any binning (summing) of adjacent pixel values performed by the hardware to produce the video frame raw CFA image data. (Temporal averaging will generally have no similar effect on structured noise.) After CFA interpolation, RGB data is transformed into YCC space, although these two operations could be consolidated into a single operation of CFA interpolating RGB data directly into YCC data if computational economies result. After color correction and edge enhancement, the video frame is complete and can be written to the output video stream, leaving the system ready to process the next frame. Exposure and white balance calculations are performed in a feedback loop between consecutive frames in the video sequence in order to prevent abrupt and undesirable changes in perceived image brightness or color. The above describes video processing issues from the processing chain design point of view. Typical camera video processing tasks, such as video-demosaicking, resolution enhancement and video stabilization are discussed in Chapters 18 to 20. 3.5 Conclusion The image processing chain that transformed digital camera raw sensor image data into a full-color fully processed image was the focus of this chapter. The possible orderings of individual operations and associated implementation details that constitute the image processing chain were discussed. Despite the seemingly immense number of available degrees of freedom, the problem of image processing chain design was seen to be overconstrained. The image processing task was to balance the opposing requirements of desirable image quality and modest compute resource use. It was shown that image processing operations that were highly effective might not be viable candidates for image processing chains in constrained compute environments. In the end, the process of designing an image processing chain became one of taking relatively simple, well-known image processing operations and staging them in a manner that produced the best synergistic effects. Digital Camera Image Processing Chain Design 101 References [1] F. Sudo and T. Asaida, “Image defect correcting circuit for a solid state imager,” U.S. Patent 5 144 446, September 1992. [2] J. Heller and J. Breisch, “Electronic camera capable of detecting defective pixel,” U.S. Patent 6 683 643, January 2004. [3] J. Takayama and N. Takizawa, “Scaling algorithm for efficient color representation / recovery in video,” U.S. Patent 6 236 433, May 2001. [4] J. Hamilton, “Correcting for defects in a digital image taken by an image sensor caused by pre-existing defects in two pixels in adjacent columns of an image sensor,” U.S. Patent 6 741 754, May 2004. [5] J. Hamilton, “Correcting defects in a digital image caused by a pre-existing defect in a pixel of an image sensor,” U.S. Patent 6 900 836, May 2005. [6] J.S. Lee, “Digital image smoothing and the sigma filter,” Computer Vision, Graphics and Image Processing, vol. 24, no. 2, pp. 255–269, November 1983. [7] M. Gaboury, “Illuminant discriminator with improved boundary conditions,” U.S. Patent 5 037 198, August 1991. [8] Y. Takagi and T. Imaide, “White balance adjusting system including a color temperature variation detector for a color image pickup apparatus,” U.S. Patent 5 170 247, December 1992. [9] T. Miyano and E. Shimizu, “Automatic white balance adjusting device,” U.S. Patent 5 644 358, July 1997. [10] J. Adams, J. Hamilton, E. Gindele, and B. Pillman, “Method for automatic white balance of digital images,” U.S. Patent 6 573 932, June 2003. [11] R. Hunt, The Reproduction of Colour. Tolworth, UK: Fountain Press, 1987. [12] B.E. Bayer, “Color imaging array,” U.S. Patent 3 971 065, July 1976. [13] M. Noda and T. Imaide, “Solid-state imaging device with two-row mixing gates,” U.S. Patent 4 768 084, August 1988. [14] Y. Takizawa, “Solid-state color imaging apparatus for preventing color alias,” U.S. Patent 4 794 448, December 1988. [15] D.R. Cok, “Signal processing method and apparatus for producing interpolated chrominance values in a sampled color image signal,” U.S. Patent 4 642 678, February 1987. [16] J. Adams and J. Hamilton, “Adaptive color plan interpolation in single sensor color electronic camera,” U.S. Patent 5 506 619, April 1996. [17] J. Hamilton and J. Adams, “Adaptive color plan interpolation in single sensor color electronic camera,” U.S. Patent 5 629 734, May 1997. [18] P.S. Tsai, T. Acharya, and A. Ray, “Adaptive fuzzy color interpolation,” Journal of Electronic Imaging, vol. 11, no. 3, pp. 293–305, July 2002. [19] K. Hirakawa and T.W. Parks, “Adaptive homogeneity-directed demosaicing algorithm,” in Proceedings of the IEEE International Conference on Image Processing, Barcelona, Spain, September 2003, vol. III, pp. 669–672. [20] M. Stokes, M. Anderson, S. Chandrasekar, and R. Motta, “A standard default color space for the Internet - sRGB,” www.w3.org/Graphics/Color/sRGB.html, November 1996. [21] R. Goodwin and A. Gallagher, “Method and apparatus for area selective exposure adjustment,” U.S. Patent 5 818 975, October 1998. 102 Single-Sensor Imaging: Methods and Applications for Digital Cameras [22] A. Gallagher and E. Gindele, “Method for adjusting the tone scale of a digital image,” U.S. Patent 6 275 605, August 2001. [23] J. Adams, J. Hamilton, and J.A. Hamilton, “Removing color aliasing artifacts from color digital images,” U.S. Patent 6 804 392, October 2004. [24] R. Hibbard, K. Parulski, and L. D’Luna, “Detail processing method and apparatus providing uniform processing of horizontal and vertical detail components,” U.S. 4 962 419, October 1990. [25] K. Parulski, D. Bellis, R. Hibbard, E. Giorgianni, and E. McInerney, “Method and apparatus for improving the color rendition of hardcopy images from electronic cameras,” U.S. Patent 5 189 511, February 1993. [26] J. Adams, J. Hamilton, and F. Williams, “Noise reduction in color digital images using pyramid decomposition,” U.S. Patent 7 257 271, August 2007. [27] D. Patterson and J. Hennessy, Computer Organization and Design: The Hardware/Software Interface. San Francisco, CA: Morgan Kaufmann Publishers, 1998. [28] W. Pratt, Digital Image Processing: PIKS Scientific Inside. Hoboken, NJ: John Wiley & Sons, Inc., 2007. [29] C. Smith and J. Adams, “CFA correction for CFA images captured at partial resolution,” U.S. Patent 6 366 318, April 2002. [30] S. Yoshikawa, “Resizing images captured by an electronic still camera,” U.S. Patent 7 092 020, August 2006. [31] W. Pennebaker and J. Mitchell, JPEG Still Image Data Compression Standard. New York: Van Nostrand Reinhold, 1993. [32] T. Welch, “High speed data compression and decompression apparatus and method,” U.S. Patent 4 558 302, December 1985. [33] G. Roelofs, “Portable network graphics,” www.libpng.org/pub/png/, 2007. [34] D. Couwenhoven, B. Gandhi, and C. Smith, “Data compression rate control method and apparatus,” U.S. Patent 5 596 602, January 1997. [35] J. Adams, K. Spaulding, and K. Parulski, “Method and system for the reduction of memory capacity required for digital representation of an image,” U.S. Patent 5 708 729, January 1998. [36] E. Weisstein, “Lambert W-function,” www.mathworld.wolfram.com/LambertWFunction.html, 2005. [37] A. Gallagher and D. Nichols, “Method and apparatus to extend the effective dynamic range of an image sensing device,” U.S. Patent 6 909 461, June 2005. [38] J. Adams and J. Hamilton, “Extended dynamic range image sensor capture using an array of fast and slow pixels,” U.S. Patent Application 2005/0140804, June 2005. [39] J. Compton and J. Hamilton, “Image sensor with improved light sensitivity,” U.S. Patent Application 2007/0024931, February 2007. [40] J. Hamilton and J. Compton, “Capturing images under varying lighting conditions,” U.S. Patent Application 2007/0046807, March 2007. [41] T. Kijima, H. Nakamura, J. Compton, and J. Hamilton, “Image sensor with improved light sensitivity,” U.S. Patent Application 2007/0177236, August 2007. [42] J. Hamilton and J. Compton, “Processing color and panchromatic pixels,” U.S. Patent Application 2007/0024879, February 2007. [43] J. Adams, J. Hamilton, and M. OBrien, “Interpolation of panchromatic and color pixels,” U.S. Patent Application 2007/0024934, February 2007. Digital Camera Image Processing Chain Design 103 [44] A. Giorgi, “Effect of wavelength on the relationship between critical flicker frequency and intensity in foveal vision,” Journal of the Optical Society of America, vol. 53, no. 4, pp. 480– 486, April 1963. [45] R. Hunt, Measuring Colour. London, UK: Ellis Horwood, 1995. 4 Optical Antialiasing Filters Russ Palum 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.1.1 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.1.2 Digital Photographic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.2 Nyquist Domain Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.3 The Four-Spot Birefringent Antialiasing Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4 Modulation Transfer Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.5 Lens Modulation Transfer Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.6 Fourier Analysis and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.7 Fourier Transform Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.8 Image System Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.8.1 System Modulation Transfer Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.9 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.10 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.11 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.1 Introduction Aliasing is an artifact of digital images; it occurs when an image contains detail smaller than the pixel pitch. This chapter starts with an intuitive look at how aliasing can occur in sampled images. Then, the history and the description of antialiasing filters are presented in Section 4.2. Nyquist diagrams, which provide a way to analyze and visualize antialiasing filter requirements, are discussed in Section 4.3. Section 4.4 discusses the modulation transfer function, a measure of system resolution, because the occurrence of aliasing is dependent on system resolution. Lens modulation transfer function is the focus of Section 4.5. Sections 4.6 and 4.7 present a brief description of convolution and Fourier analysis. Many of the details of sampled imaging systems become apparent when analyzed in the frequency domain; therefore, Section 4.8 steps through the sampling process in the spatial and frequency domain. Subsequently, reconstruction issues are discussed in Section 4.9. Finally, antialiasing filter construction and testing are the central topics of Sections 4.10 and 4.11, and conclusions are presented in Section 4.12. 105 106 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) FIGURE 4.1 Aliasing effects: (a) image with aliasing, and (b) antialiased image. 4.1.1 Aliasing A digital camera is an example of an imaging system that samples with a regular array of points. Aliasing is an artifact of sampled imaging systems. Figure 4.1a shows an example of aliasing in a digital image. The swirling lines are the result of narrow stripes that have been imaged so that the stripes look like a few wide lines; this is an aliasing artifact. The image in Figure 4.1b has been properly prepared prior to sampling; the narrow stripes are not resolved but they do not produce swirling wide line artifacts. Low resolution is better than low frequency artifacts that do not represent the original object. Once an image is sampled, the aliased low-frequency content is difficult to correct automatically because it is the same as actual low-frequency content. Software is available to correct aliasing artifacts but intervention is usually required to locate the artifact. Silver halide photographic systems sample on a random array of points so they do not produce aliasing artifacts. Interestingly (and surprisingly), for local areas, the human retina samples a regular array of points, a close-packed hexagonal grid. The eye lens limits the image spatial frequency content to prevent aliasing [1]. Figure 4.2 is the beginning of an intuitive look at the cause of aliasing. A section of imager 1 mm long is shown as a two-dimensional (2D) grid. The spatial image at the bottom varies sinusoidally in the horizontal direction. The top bars represent the sampled image, displayed so that each pixel occupies one square. This is a simple way to display a basic image from a sampled image. The example imager samples at a rate of 40 samples/mm; this sampling rate is called the Nyquist rate. The spatial frequency imaged on the array is one-half the Nyquist rate, 20 cycle/mm; this is called the Nyquist frequency. According to the Nyquist criteria [2], a spatial frequency greater than half the sampling rate cannot be reproduced. The Nyquist criterion is further extended by the Whittaker-Shannon theorem [3]: if a signal is band limited to within the Nyquist frequency of the sensor, the signal can be recovered, without error, from the sampled signal. Or, worded differently: any band-limited function can be specified exactly by its sampled values taken at regular intervals, provided that these intervals do not exceed some critical sampling interval. Optical Antialiasing Filters 107 FIGURE 4.2 Nyquist frequency. FIGURE 4.3 Shifted sine wave. Figure 4.3 illustrates a pathological case at the edge of the Whittaker-Shannon theorem. A 20 cycle per mm image is shifted by one-half pixel. Each pixel is centered on the average value of the sine wave and each pixel sees half of the bright portion of the sine wave and half of the dark portion. There is no pixel-to-pixel modulation because all the pixels see the same amount of light. This sampled image cannot be recovered. 108 Single-Sensor Imaging: Methods and Applications for Digital Cameras FIGURE 4.4 21 Cycle per millimeter sine wave. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 FIGURE 4.5 Aliasing in temporal signals. The Whittaker-Shannon theorem is violated in Figure 4.4; a 21 cycle per mm image is falling on a 40 sample per mm imager. This spatial frequency exceeds the Nyquist criteria. Notice the sine wave image starts in phase with the pixels, so it is reproduced. It goes out of phase around the tenth pixel, so these pixels become a wide band. The image goes back in phase at the twentieth pixel, out of phase again at the thirtieth, and in phase again at the fortieth pixel. This leads to an image with wide bands separated by narrow bands. The sampled image cannot reproduce the original image because 40 pixels can only reproduce 20 cycles of light-to-dark transitions and 21 cycles requires 22 light-to-dark transitions. This is an aliasing artifact. Figure 4.5 is a common representation of aliasing. The signal is usually temporal and it is sampled with an analog-to-digital converter or some other sampling device. The dark squares are evenly spaced samples on the high-frequency sine wave. The sampling in this case is well below Nyquist and the result is a very low-frequency aliased representation of the actual sine wave. Optical Antialiasing Filters 109 optical im age len s and imager im age processing d isplay object FIGURE 4.6 Digital imaging system. a n t ia lia sin g filter 4.1.2 Digital Photographic Systems A typical digital photographic system contains a lens, an image capture device, an image display device, and some amount of image processing capability. The basic diagram in Figure 4.6 can be used for a film scanning system, microscopy, astronomy, or any other digital imaging system that starts with an optical image. Aliasing is prevented by band-limiting the optical image spatial frequencies to the Nyquist frequency and below. Fourier analysis is used to determine if the frequency content of the image is band-limited to the Nyquist frequency. Most digital cameras use an antialiasing filter to band-limit the optical image spatial frequencies. Limiting the image spatial frequencies is equivalent to blurring the image, so these filters are sometimes called blurring filters, which really scares the marketing staff. According to Greivenkamp (Reference [2], p. 676), the first antialiasing filter appears to have been invented by Pritchard for color stripe single-tube video cameras. Prior to singletube color cameras, aliasing was not a substantial problem. Black and white cameras and three-tube color cameras have an analog horizontal signal. The vertical sampling has 100% fill so it does not alias substantially and there are no complications from color errors. Color stripe video cameras sample color horizontally and early cameras used analog electronic algorithms to produce a full-resolution color image. Aliasing in a single-tube color stripe camera can produce color artifacts that do not even match the color name of the original object. 4.2 Nyquist Domain Graph The previous examples have been one-dimensional (1D). The Nyquist frequency of 2D imaging systems depends on direction. A Nyquist domain graph is used to display the locus of points at the Nyquist frequency in two dimensions [4]. Spatial frequency on the graph corresponds to the inverse of the spacing in the spatial domain. Radial distance from the center of the graph corresponds to spatial frequency. 110 Single-Sensor Imaging: Methods and Applications for Digital Cameras 2p G G GRG 2p G GG G G G GRG G GG G G G GRG G GG G G GG G 2p Ö2 (a) (b) FIGURE 4.7 Bars at the Nyquist frequency: (a) imager, and (b) green channel Bayer pattern. 1 0.8 0.6 0.4 -1 -0.8 -0.6 -0.4 -0.2 0.2 -0.2 0.2 0.4 0.6 0.8 1 -0.4 -0.6 -0.8 -1 FIGURE 4.8 Nyquist domain graph with imager sampling frequency normalized to 1/p. Solid line corresponds to the monochrome imager, dash line corresponds to the green channel, and dot line corresponds to the red and green channels. Figure 4.7a is a small section of a monochrome imager. The vertical and horizontal bars are at the vertical and horizontal Nyquist frequency. The period of the bars is twice the pixel pitch (p). The frequency of the bars is 1/(2p). Notice the spacing of the diagonal bars is smaller by 1/(square root of two). The corresponding Nyquist domain graph is shown in Figure 4.8 by the solid line, with the horizontal and vertical coordinates normalized to the sampling frequency, 1/p. Normalizing to the sampling frequency generalizes the discussion. The diagonal coordinates are at 1/(2p) or 1.414p. A monochrome imager has a higher Nyquist frequency diagonally than it has on the vertical and horizontal axes. Single imager color systems with a color filter array (CFA) pattern add additional complexity to the Nyquist domain graph. A section of an imager with a Bayer CFA pattern is shown in Figure 4.9; the pattern has three color channels that are analyzed separately. The Optical Antialiasing Filters 111 p 2p G R GR G R GR B G BG B G BG G R GR G R GR B G BG B G BG G R GR G R GR B G BG B G BG G R GR G R GR 2p FIGURE 4.9 Bayer color filter array. green channel is a square pattern with twice as many elements as the red and blue channels. However, the square pattern is rotated 45 degrees so that it looks like a checkerboard. The relationship between the diagonal and the vertical / horizontal Nyquist frequency is reversed compared to the monochrome imager because the green channel is a square pattern rotated 45 degrees. The green pattern is shown with bars at the Nyquist frequency in Figure 4.7b. The pixels are not contiguous; this is called a sparse array. The Nyquist rate is still the inverse of the period for a sparse array. The horizontal and vertical Nyquist frequency, normalized to the monochrome pixel pitch, is a half cycle per sample. The period of the diagonal bars at the Nyquist frequency is larger so the Nyquist frequency is lower on the diagonal. The green channel Nyquist domain graph is the inner diamond shown in Figure 4.8 by the dash line. As shown in Figure 4.9, the red and blue channels have the same pitch, 2p, where p is the monochrome pixel pitch. Both patterns are square sparse arrays with a spatial offset between the two channels. The offset does not affect the analysis. The red and blue channel Nyquist domain graphs are the same and they are similar to the monochrome Nyquist domain graph except the frequencies are half as high because the red and blue pitch is twice as large. The Nyquist domain graph for the red and blue channels is shown in Figure 4.8 by the dot line. 4.3 The Four-Spot Birefringent Antialiasing Filter The four-spot birefringent antialiasing filter is the most common antialiasing filter. The construction of this type of filter is discussed in Section 4.10; however, a brief discussion of birefringence will be useful here. Some optical crystalline materials are birefringent; calcite is a naturally occurring example of a material that is very birefringent. These materials can be cut so a ray of light that enters the crystal is split into two rays that take different paths through the crystal and emerge with a separation between the rays. An example is shown in Figure 4.10a and Fig- 112 op tical axis Single-Sensor Imaging: Methods and Applications for Digital Cameras E ray E O ray O (a) (b) (c) FIGURE 4.10 Birefringent antialiasing filter: (a) birefringence on the crystal slice, (b), the slice with the outline of a plate that will be included in an antialiasing filter, (c) four-spot blur filter. ure 4.10b. The amount of separation is determined by the material and the thickness of the plate. When a plate of birefringent material is placed behind a lens it will spilt the image into two images with a separation between them. For a small separation the image appears to be slightly blurred. Multiple plates of a birefringent material can be made to produce four or more images that will blur the image in many directions. The four-spot birefringent filter limits the scene spatial frequency content by spreading the light from every point in the image over four points as shown in Figure 4.10c. The thickness of the filter determines the spacing between the four spots. The separation is chosen to limit the image spatial frequency content to the Nyquist frequency and below. The separation for a monochrome imager is the pixel pitch. The details of the modulation calculation are presented in Section 4.8. The modulation at the Nyquist frequency is reduced as shown in Figure 4.11a; light from a single point is focused on two pixels. This is true for every point on the object. If the object is a sinusoid that produces an image at the Nyquist frequency, the sinusoid will be dark on one pixel and light on the next. Blurring light from each object point over two pixels produces the same light level at every pixel when the image is a sinusoid at the Nyquist frequency so there is no modulation at the Nyquist frequency. Alternatively, looking through the filter from the pixel (Figure 4.11b), the light that falls on the pixel will appear to come from two places in the object plane. If the object is a sinusoid that will produce an image at the Nyquist frequency, the two spots will appear to come from points 180 degrees of phase apart so these points will always add to the average value. This is the case for every pixel, so there is no modulation at the image plane for (a) (b) FIGURE 4.11 (a) Antialiasing filter effect on the image plane. (b) Antialiasing filter looking back at scene. Optical Antialiasing Filters 113 G R G R B G B G (a) (b) FIGURE 4.12 Preventing color interpolation error: (a) highlight falling on one color component, (b) spreading the light over four pixels to prevent the color error. sinusoids at the Nyquist frequency. The same four-spot filter is used for Bayer CFA imagers. It should be clear, based on the Nyquist domain discussion, that this blur filter will not prevent aliasing in a Bayer CFA imager. There are a few reasons why this apparently inadequate filter is adequate. First, there is resistance to applying enough blur to prevent aliasing because no one likes to pay for pixels and an expensive lens and then blur the image. Second, many interpolation algorithms take advantage of the correlation between channels so the effective color channel arrays may not be as sparse as they appear. Some examples of this are presented in Section 4.11. Next, the antialiasing filter prevents color interpolation errors. It is not unusual for a highlight, like a catch light in an eye, to fall on one pixel. The catch light in Figure 4.12a will be rendered bright red by the interpolation algorithm because it only illuminates a red pixel. The antialiasing filter spreads the light over four pixels as shown in Figure 4.12b to prevent the color error. Sharp edges may have the same type of problem depending on the interpolation algorithm. It is possible to make a birefringent antialiasing filter that has the appropriate spacing for each color but this filter is difficult to manufacture [2]. 4.4 Modulation Transfer Function An understanding of modulation transfer function (MTF), convolution, and a few Fourier transform pairs are required to understand the analysis of sampled imaging systems. MTF analysis is used to determine the limits of the optical image spatial frequency content at the imager. The MTF for each component can be measured or determined from theory and then the system MTF limits can be determined by cascading the component MTF’s. The MTF is a measure of system and component spatial resolution performance [5]. It is the ratio of the signal output modulation Mo to the sine wave input modulation Mi as a function of spatial frequency r. Modulation M is determined as follows: M = Mmax Mmax − Mmin + Mmin (4.1) where Mmax and Mmin are the sine wave peak and valley measurements. The input modula- 114 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) FIGURE 4.13 Input and output sine wave. Top row shows: (a) original sine wave with 100% modulation, and (b) image of sine wave with 33% modulation. tion can be measured in luminance or any other linear metric. The output modulation also has to be measured in a linear metric like transmission or linear code value. The MTF ratio, the ratio of the output modulation to the input modulation, is calculated as follows: MTF(r) = Mo(r) Mi(r) (4.2) The MTF curve is the result of a number of MTF measurements taken at different spatial frequencies. To produce an MTF curve, an input sine wave as shown in Figure 4.13a is imaged to produce the output image shown in Figure 4.13. The input and output max and min are measured. The modulation is calculated using Equations 4.1 and 4.2 to determine the modulation at one spatial frequency. As previously mentioned, the complete MTF curve is the result of repeating this procedure for a range of spatial frequencies. Figure 4.14 is a sine wave pattern that increases in frequency linearly with distance. This is commonly called a chirp because of its similarity to the sine wave pattern of a bird chirp. This pattern can be used to make an MTF measurement over a range of frequencies with just one image. The middle graph in Figure 4.14 is a trace of the input and output chirp. Notice the low frequency part of the pattern is reproduced with good modulation, but the higher frequencies drop off; therefore, the MTF drops as shown at the bottom of Figure 4.14. MTF is nominally expressed as a number from zero to one, or it can be converted to a percentage. When a digital image is sharpened, the MTF at some frequencies can exceed one. Sharpening can also be applied to silver halide photographic systems. Silver halide film can be formulated to chemically enhance edges so the MTF may be larger than one at Optical Antialiasing Filters 115 1 0 FIGURE 4.14 Chirp image. in p u t 1 0.5 0 output some spatial frequencies. Negative MTF values are also possible; these are the result of a phase change in the image. For example, a white peak in an image may be at the location of a dark peak in the object. The MTF measurement assumes the system is linear or can be made linear. Nonlinear systems can be made linear by measuring the low frequency transfer function to convert the output to the corresponding input luminance which is a linear measure. Silver halide photographic materials can be made linear with this technique even though they are inherently nonlinear. 4.5 Lens Modulation Transfer Function A lens is usually the first component in a digital imaging system. It can reduce the system spatial frequency content, but lens effects change with f-number and focus so the lens alone is not an effective control for aliasing. In most systems, the lens controls the spatial frequency content beyond the Nyquist frequency, but a birefringent blur filter is usually used to control spatial frequency content near the Nyquist frequency. Lenses are designed by adjusting glass types, surface shapes, and spacing to optimize the resolution in the image plane. This type of analysis is based on geometric optics. It is possible to design a lens that images perfectly based on geometric optics but lens performance is further limited by diffraction. Lenses that focus all the rays within the effects of diffraction are called diffraction-limited lenses. The point spread function for a diffraction-limited lens is an Airy disk. Figure 4.15a is an image of an Airy disk that has been adjusted to make the outer rings visible (Reference [5], p. 160). The outer rings are very faint in an accurate rendition. Eighty-four percent of the power is in the center core, 7% is in the first ring and 3% is in the next ring. The equation for the Airy disk is [6]: E(q, λ , N) = 2J1( πq λN ) 2 πq λN (4.3) 116 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) FIGURE 4.15 (a) Airy disk, (b) Airy disk graph. where E is the relative illumination, q is radius, λ is wavelength in the same units as q, N is lens f/#, and J1(·) is a Bessel function of the first type. The Airy disk graph is shown in Figure 4.15b. For most photographic applications, the wavelength range of interest is 400 to 700nm. The center of the wavelength range, 550nm, is chosen to determine the size of the Airy disk. For a selected wavelength, the diameter of the center core from zero crossing to zero crossing is strictly a function of the lens f/#. The diameter can be determined as follows: D =. 2.44λ N (4.4) As a rule of thumb, the diameter of the bright spot in micrometers is approximately equal to N, the f/#, because the product of the wavelength (approximately 1/2 mm) and the constant (value 2.44) is close to one. Figure 4.16a shows a diffraction-limited lens MTF curve (see Reference [5], p. 377). All diffraction-limited lenses have the same MTF curve shape. The zero crossing, called the cutoff frequency, is a function of the lens f/#. The cutoff frequency increases with increasing aperture size, which corresponds to decreasing f/#. The cutoff spatial frequency (vc) is: vc = 1 λN (4.5) MTF 1 0.8 0.6 cu toff 0.4 frequ ency 0.2 00 0.2 0.4 0.6 0.8 1.0 1.2 relative sp atial frequ ency (a) 1 0.8 0.6 0.4 0.2 0 (b) FIGURE 4.16 (a) Two-dimensional lens MTF, (b) three-dimensional lens MTF. Optical Antialiasing Filters 117 An equation for diffraction limited lens MTF as a function of spatial frequency is presented in Section 4.8. Lenses with a round aperture have a circularly symmetric diffractionlimited MTF; the MTF is the same for a given spatial frequency, regardless of direction. Figure 4.16a shows a 2D MTF whereas its three-dimensional version is depicted in Figure 4.16b. 4.6 Fourier Analysis and Convolution Fourier analysis is used to analyze sampled systems because the MTF of each component or system is the modulus of the Fourier transform of the point spread function for that component or system. In addition, Fourier techniques can make sampled system analysis easier when convolution is required. The Fourier transform of the convolution of f (x) and g(x) equals the product of the Fourier transforms of f (x) and g(x). In addition, the Fourier transform of the product of f (x) and g(x) is the convolution of the Fourier transforms of f (x) and g(x). In many cases, it is easier to determine the result of a convolution using Fourier techniques than it is to directly compute the convolution. Gaskill [7] presents an excellent explanation of convolution. A simplistic explanation of convolution in one dimension is offered in Figure 4.17. A rectangle is convolved with a rectangle. Convolution, by definition, includes flipping the first function left to right. Most sampled imaging system functions are symmetric, so neglecting the flip does not change the result. Starting at Figure 4.17a, the rectangle slides along the axis in infinitely small steps. The two functions are multiplied and the convolution is the area of the product at each point. In this case, the convolution is zero until the shaded rectangle touches the second rectangle shown in Figure 4.17b, then it ramps up, as shown in Figure 4.17d. The convolution is constant while the first rectangle is inside the second rectangle and then drops back to zero as the first rectangle leaves the second rectangle, as shown in Figure 4.17e. A second case, depicted in Figure 4.18, shows the convolution between two equal rectangles. Notice that the rectangles are only exactly inside each other at one point, thus resulting in a triangle rather than a trapezoid. The image formed by a lens is the result of (a) FIGURE 4.17 Convolution. (b) (c) (d) (e) * FIGURE 4.18 Convolution with equal rectangles. 118 Single-Sensor Imaging: Methods and Applications for Digital Cameras * Airy d isk FIGURE 4.19 Airy disk convolution. a 1 0.8 0.6 0.4 0.2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 (a) FIGURE 4.20 (a) Rect function. (b) Sinc function. 1 0.8 0.6 0.4 1/a 0.2 -5 -4 -3 -2 -1 0 1 2 3 4 5 -0.2 -0.4 (b) a 2D convolution between the object and the lens point spread function. The point spread function for a lens is an Airy disk. To simplify the explanation a magnification of one and a bar target object are used. The Airy disk slides over the bar target in Figure 4.19, and the convolution is calculated at each point, resulting in the blurred bar target. 4.7 Fourier Transform Pairs There are a few Fourier transform pairs needed for sampled image system analysis. For a more detailed discussion of Fourier analysis refer to References [3] and [7]. There are also two properties of the Fourier transform that are very useful for imaging system analysis. First, the Fourier transform of the convolution of two functions equals the product of the Fourier transforms of the individual functions. Second, except for scaling, the Fourier transform of the Fourier transform of a function is the original function. The rect function described below is a good example of this. The Fourier transform of a rect is a sinc and the Fourier transform of a sinc is a rect. The rect function is a rectangle in the spatial domain. Figure 4.20a is a 1D rect of width a. The Fourier transform of a rect as shown in Figure 4.20b is a sinc. The following describes the sinc function: F rect x a = |a| sinc(πar) = sin(π ar) πr (4.6) Optical Antialiasing Filters 119 a b 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 -0.2 (a) (b) FIGURE 4.21 (a) Two-dimensional rect, (b) two-dimensional sinc. where a is the width of the rect on the x axis in the spatial domain and the frequency variable is r. The F indicates a Fourier transform should be taken. In general, the width of a Fourier transform is inversely related to the size of the original function. In Figure 4.20b, the first zero of the sinc in the frequency domain is at 1/a where a is the width of the rect function in the spatial domain. The 2D rect is a rectangular solid (Figure 4.21a). Its Fourier transform: F rect x a , y b = |a| sinc(πar) · |b| sinc(πbs) = sin(π ar) πr · sin(π bs) πs (4.7) is a 2D sinc (Figure 4.21b). The 2D sinc is the product of two 1D sinc functions (this is called a separable function). Where, in the spatial domain, a is the x-axis width of the rect and b is the y-axis width. The corresponding frequency variables are r and s. The delta function (also known as the Dirac delta function or the impulse function), is represented graphically in Figure 4.22a. The arrow is optional; it indicates that the height is not as shown. The function has an area of one and a width of zero. The function is zero except at x = 0. The function notation and definition, including a shift in two dimensions, is: δ (x − a, y − b) = 0 at x = a, y = b δ (x − a, y − b) = 0 at x = a, y = b (4.8) The Fourier transform of the delta function at x and y coordinates a and b is an exponential: F (δ (x − a, y − b)) = e− j2π(ar+bs) (4.9) -¥ 0 ¥ -¥ 0 ¥ -¥ 0 ¥ (a) (b) (c) FIGURE 4.22 (a) Delta function. (b) Comb function. (c) Fourier transform of comb function. 120 Single-Sensor Imaging: Methods and Applications for Digital Cameras FIGURE 4.23 Bed of nails. 1 0.5 0 -0.5 -1 5 10 15 (a) 1/p (b) FIGURE 4.24 (a) Cosine with pitch p. (b) Fourier transform of cosine, two delta functions at, with frequency 1/p. The comb is a 1D array of delta functions. A representation is shown in Figure 4.22b. The Fourier transform is another comb with spacing equal to the inverse of the original function (Figure 4.22c). The 2D comb function is sometimes called a bed of nails; a representation is shown in Figure 4.23. The Fourier transform of the bed of nails is another bed of nails with spacing equal to the inverse of the original bed of nails spacing. The Fourier transform of the cosine in Figure 4.24a is a pair of delta functions at ±1/p from the origin, as shown in Figure 4.24b. 4.8 Image System Response To prevent aliasing in an image the spatial frequency content has to be limited prior to sampling. This means the optical system has to control the spatial frequency content. The analysis of the optical system is based on Figure 4.25. The top half of Figure 4.25 represents the capture process in the spatial domain. The bottom half represents the frequency Optical Antialiasing Filters 121 (a) (b) (c) (d) (e) FIGURE 4.25 Capture in (top) the spatial domain and (bottom) the frequency domain. domain in one dimension. The picture in the top of Figure 4.25a is the original object. The frequency spectrum of the original object is shown in the bottom of the figure. The top portion of Figure 4.25b shows a representation of the lens and its point spread function, an Airy disk. The bottom portion shows the lens MTF — the modulus of the Fourier transform of the lens point spread function. The rectangle shown in Figure 4.25c represents a four-spot antialiasing filter. In one dimension the MTF of the antialiasing filter is a cosine with negative values reflected across the horizontal axis. The first zero is at the Nyquist frequency, 1/(2p). The square shown in Figure 4.25d is a representation of the pixel active area. The MTF, shown in the bottom (the modulus of the Fourier transform), is a sinc function. If the pixel active area is one pixel pitch wide the MTF goes to zero at the sampling frequency. The image shown in Figure 4.25e is the result of convolving the image with the component point spread functions. The graph in the bottom of the figure is the result of cascading the MTFs for all of the point spread functions (multiplicative combination). This procedure describes the optical processing applied before the image is sampled. Figure 4.26 represents sampling. It is also divided in half — the left represents the spatial domain and the right represents the frequency domain. Only one scan line will be included in this discussion. The graph shown in Figure 4.26a is the profile of a line of pixels; the frequency spectrum is shown in the bottom of the figure. The graph shown in Figure 4.26b is the same line of pixels that has been convolved with the combined point spread functions as described by the capture process in Figure 4.25. The frequency spectrum of the line is shown in the bottom of Figure 4.26b. The function in the top of Figure 4.26c is a comb, and in the bottom is the corresponding Fourier transform, another comb with the inverse pitch. Figure 4.26d shows a representation of how the line is sampled; the profile on the top of the figure is multiplied by the comb from Figure 4.26c to produce the sampled (a) (b) (c) (d) (e) FIGURE 4.26 Sampling in (top) the spatial domain and (bottom) the frequency domain. 122 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1 0.8 0.6 0.4 0.2 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 (a) N yquist frequency 1 0.8 0.6 0.4 0.2 0 0.25 0.5 0.75 1 (b) FIGURE 4.27 (a) Replicated spectrum, (b) mirrored spectrum. image on the top of Figure 4.26e. In the frequency domain, the frequency spectrum in Figure 4.26b is convolved with the comb to produce the repeating frequency spectrum shown in the bottom of Figure 4.26e. If the image spatial frequencies are above the Nyquist frequency, the repeating spectrums overlap the baseband and the high spatial frequency image is aliased to a low-frequency image. The dark vertical bars in Figure 4.27a are at the Nyquist frequency. The Nyquist frequency is also called the fold frequency because the overlap of the replicated spectrum is identical to the baseband spectrum mirrored across the Nyquist frequency (Figure 4.27b). The grey section of the curve in Figure 4.27a is the replicated spectrum that actually produced the aliased content. Converting the sampled image back into an analog representation is called reconstruction or desampling. This is the last, and usually overlooked, step in the display of a sampled image. An artifact that looks similar to aliasing results without reconstruction. Reconstruction is treated in more detail in Section 4.9. 4.8.1 System Modulation Transfer Function Most of the analysis of aliasing involves capture. Scene frequency content above the Nyquist frequency has to be suppressed prior to sampling in order to prevent aliasing. Figure 4.25 shows that the lens MTF, the antialiasing filter MTF (AAfilterMTF) and the pixel MTF have to be cascaded to analyze the MTF prior to sampling. According to MTFcapture = MTFlens · MTFAA f ilter · MTFpixel (4.10) the lens MTF, antialiasing filter MTF, and pixel MTF are multiplied together at each spatial frequency to produce the capture MTF. Optical Antialiasing Filters 123 The lens MTF can be evaluated using MTFlens(φ ) = 2 π (φ − cos(φ ) sin(φ )) (4.11) φ (r, s) ∼= arccos λ N r2 + s2 (4.12) where r is spatial frequency on the horizontal axis, s is spatial frequency on the vertical axis, N denotes lens f/#, λ is wavelength in the same units as r and s, and φ is an intermediate variable. This is the MTF for a perfectly designed and manufactured lens. The measured lens MTF can be substituted for this MTF if it is available, or additional terms can be added to this expression to account for field angle and defocus (Reference [5], p. 378). The MTF of a standard four-spot antialiasing filter is next. The first zero of the filter MTF is at the Nyquist frequency if the spot pitch is equal to the pixel pitch. The MTF of the four-spot filter is a 2D cosine that goes to zero at 1/(spot pitch). This is not difficult to evaluate but a more general solution for any number of spots uses the properties of the delta function. Complex variable math is required, but spreadsheet programs and math programs that handle complex variables are common and make a more general solution fairly easy. As shown in Equation 4.9, the Fourier transform of a delta function is an exponential. Thus, the Fourier transform of the four-spot filter is the sum (average) of four exponentials: F (fourspot(x, y)) = 1 4 e− j2π(a0r+b0s) + e− j2π(a1r+b1s) +e− j2π(a2r+b2s) + e− j2π(a3r+b3s) (4.13) The MTF is the modulus of the Fourier transform: MTFAA f ilter = Real2 + Imaginary2 (4.14) The MTF of a filter with any number of spots uses the same technique, except exponentials corresponding to each of the spots are summed (averaged). If possible, the spots should be placed symmetrically around the origin to avoid phase terms. The pixel aperture also affects the system MTF. The MTF of the pixel aperture is a 2D sinc function (Equation 4.7) with the first zero at a frequency of (1/a) on the horizontal axis where a is the width of the pixel on the x axis. Similarly, the first zero on the vertical axis is at 1/b. The area of the standard form of the rect function becomes the peak value of the Fourier transform. The point spread function has to be scaled by 1/a so it has an area of one. The MTF will then peak at 1.0. Alternatively, the sinc function can be scaled by 1/a so the peak is 1.0: MTFpixel = 1 ab · sin(π ar) πr · sin(π bs) πs (4.15) Figure 4.28 is an example of an MTF cascade. In this case, the MTF is only analyzed on one axis. The actual system spatial frequencies are used because the lens MTF is based on lens parameters; these cannot be scaled to the sampling frequency unless a particular sampling frequency is chosen. The imager in the example has a two mm pixel pitch, the lens is set at f/4 and the antialiasing filter is a four spot filter with a spot separation equal to 124 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1 1 0.8 0.8 0.6 0.6 MTF 0.4 0.4 0.2 0.2 MTF -600 -400 -200 0 200 400 600 spatial frequency in cycles/m m (a) 1 0.8 0.6 0.4 0.2 MTF -600 -400 -200 0 200 400 600 spatial frequency in cycles/m m (b) 1 0.8 0.6 0.4 0.2 MTF -600 -400 -200 0 200 400 600 spatial frequency in cycles/m m (c) -600 -400 -200 0 200 400 600 spatial frequency in cycles/m m (d) FIGURE 4.28 Cascade system response: (a) lens MTF, (b) filter MTF, (c) pixel MTF, (d) system MTF. the pixel pitch. The lens MTF shown in Figure 4.28a is calculated using Equation 4.11. The antialiasing filter MTF response shown in Figure 4.28b is calculated using Equation 4.13. The pixel MTF shown in Figure 4.28c is calculated using Equation 4.15. Finally, the system MTF shown in Figure 4.28d is the result of multiplying the three component MTFs, spatial frequency by spatial frequency. Notice the blur filter MTF goes to zero at the Nyquist frequency for the sensor, 250 cycles per mm. The system MTF past the Nyquist frequency is suppressed by the pixel MTF and the lens MTF. The area under the curve beyond the Nyquist frequency can be used as a merit function to optimize system performance; the minimum area provides the best performance. If the analysis is done in 2 dimensions the weighted volume beyond Nyquist can be used to optimize system performance. Weighting is required because the volume is not a linear function of radius. A weighting function can also be used to give more influence to spatial frequencies that are more visible to the eye. 4.9 Reconstruction In general, a sampled image has to be resampled for display. If the display has enough samples it can emulate a nonsampled display. The process of converting a sampled image to a continuous image is called reconstruction or desampling [8]. Without reconstruction, an artifact, sometimes called interpolation error, may appear. The artifact is a variation in modulation that occurs at spatial frequencies below the Nyquist frequency. Optical Antialiasing Filters 125 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 FIGURE 4.29 Sampled sine wave, ten cycles and twenty-two samples. (a) (b) (c) (d) (e) (f) FIGURE 4.30 Reconstruction: (a-c) spatial domain, (e-f) frequency domain. To illustrate the problem, a ten-cycle sine wave is shown in Figure 4.29. Samples are taken at the evenly spaced dark squares. The output image is displayed by drawing a smooth curve through the samples. Notice the amplitude is high when the peak and valley samples are in phase with the sampled points but the amplitude goes down when the peaks and valleys are out of phase with the sampled points. The minimum amplitude occurs when two samples are split across a peak or valley. The modulation envelope is caused by the replicated versions of the image spatial frequency content. Multiplying the image frequency spectrum by a rect (Figure 4.30d) as wide as the first order spectrum will eliminate the replicas (Figure 4.30f). This can be accomplished in the spatial domain by convolving the image with a 2D sinc as shown in one dimension in Figure 4.30a. The first zero of the sinc function is at 1/p where p is the capture sampling pitch. The convolution is a continuous function so it reconstructs an analog image. The low modulation sections of the sampled image are boosted in the convolution by modulation at the tails of the sinc function. The reconstructed sine wave at Figure 4.31c is an analog reproduction of the original image. In practice, the convolution is only evaluated at the points required for the displayed image. Computational speed is an issue with this technique because the sinc is infinitely wide. This can be handled, with some image quality penalty, by windowing with suitable functions (e.g., Hamming, Hann, etc.). The operation is still computationally expensive although the 2D sinc function is separable, so a 1D sinc can be applied to the rows and then to the columns. This reduces the amount of computer arithmetic required to reconstruct a 126 1/p (a) FIGURE 4.31 Reconstruction. Single-Sensor Imaging: Methods and Applications for Digital Cameras 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 (b) 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 (c) sampled image. Finally, the image has to be rendered for display. After reconstruction the modulation envelope is suppressed. Resampling for the display will create a new modulation envelope if the image is not low pass filtered based on the display resolution. The number of display pixels required depends on how much interpolation error is acceptable. The display pixel count required can be determined from Figure 4.32, a graph of modulation due to interpolation error. It is assumed the sine wavelength is not an even multiple of the number of pixels so the phase of the samples will precess from a sample at a peak or valley to samples split evenly across a peak or valley. Figure 4.33 shows the origin of this graph; at some point the sine wave will be sampled and displayed at a value of 1 or −1 (peak or valley) and at some other point the samples will be split across the peak or valley. At 1/2 cycle per display pixel, the sine wave envelope can go from ±1 at the peak and valley samples to zero when the samples are split across the peak or valley. At 1/4 cycle per display pixel the sine only drops to 70% when the samples are split across a sine wave peak so the modulation envelope is reduced to 30%. At 1/6 cycle per display pixel, the modulation is reduced to approximately 10%. To reduce the modulation envelope to a particular value the reconstructed image can be low pass filtered to limit spatial frequencies to a particular number of cycles per display pixel or a higher resolution display can be used if reducing the displayed resolution is not an option. envelope mod ulation 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 d isp lay sam p les / cam era sam p le FIGURE 4.32 Modulation based on interpolation error. Optical Antialiasing Filters 127 1 0.6 1/6 cycle p er 0.4 d isplay pixel 0.2 -5 0 - 0.2 - 0.4 - 0.6 - 0.8 -1 FIGURE 4.33 Determine envelope modulation. 1/4 cycle p er d isplay pixel 5 1/2 cycle p er d isplay pixel 4.10 Construction The birefringent antialiasing filter is the most common antialiasing filter and the four spot square pattern is the most common pattern. An example of the four spot square pattern is shown on the right side of Figure 4.10c. The single spot shown the left side of Figure 4.10c is a representation of the point spread function produced by a lens from an object point. The conventional antialiasing filter design makes this into four spots separated by the pixel pitch. To create this pattern requires three birefringent plates cemented together. The usual material is quartz but lithium niobate and calcite have also been used. Figure 4.34a is an example of the raw birefringent material, cultured crystalline quartz. This particular piece is an optical half section. There are less expensive forms used in the electronics industry that can be used for small filters. The index of refraction of this material is dependent on the polarization and direction of travel through the crystal. If a spark could be set off inside the crystal, the light would radiate from the spark as shown in Figure 4.35a. Some of the rays, depending on the po- n ext slice Z (a) (b) FIGURE 4.34 (a) Cultured quartz bar. (b) Cut crystal. 128 Single-Sensor Imaging: Methods and Applications for Digital Cameras E rays op tical axis O rays incid ent O E (a) (b) FIGURE 4.35 The E and O rays in: (a) birefringent crystal, (b) crystal cut at 45 degrees to the optical axis. larization axis, travel at the same speed in all directions. These rays are called the ordinary or O rays [9], [10]. The speed of the rest of the rays depends upon direction in the crystal. These rays are called the extraordinary or E rays. The E and O rays are orthogonally polarized. Both rays travel through the quartz at the same speed on one axis called the optical axis. The maximum speed difference is on the orthogonal axis. If the crystal is cut at 45 degrees to the optical axis (Figure 4.35b) an incident beam will separate into E and O rays. The rays separate at 5.8 micrometers per millimeter of material. A crystal that has been cut at 45 degrees to the optical axis is shown in Figure 4.34b. The optical axis, Z, is perpendicular to the base. A ray incident on a slice of this crystal is shown in Figure 4.10a, illustrating how the ray splits into an ordinary ray and an extraordinary ray. The two rays are separated but parallel to each other when they leave the crystal. Figure 4.10b represents the slice of Figure 4.10a with the outline of a plate that will be included in an antialiasing filter. The line with two circles represents the projection of the optical axis in the plane of the plate. In this case the optical axis is at 45 degrees to the edge of the plate in addition to a tilt of 45 degrees in and out of the page. 5.8 m 4.1 m 4.1 m 1 m m plate 0.707 m m p late 0.707 m m p late im age point FIGURE 4.36 Four-spot filter construction. 5.8 m 5.8 m Optical Antialiasing Filters 129 (a) (b) FIGURE 4.37 (a) Pleat filter. (b) Phase-noise filter. To build a square pattern filter, three plates are cut with different angles of rotation between the edge of the plate and the projection of the optical axis. The plates are cemented together to produce a filter. A representation of the three plates in Figure 4.36 shows the angle of the edge of the plate to the projection of the optical axis and the resulting spot patterns. The line through the spots indicates the polarization of the spots. In practice the plate thickness is chosen on the basis of the pixel pitch but for the purpose of illustration the first plate is chosen to be one millimeter thick and the remaining plates are chosen to produce a square pattern. Notice the first plate produces two spots. It takes two steps to copy these spots and make a square pattern. The next plate is cut 45 degrees to the optical axis. At 45 degrees, the spots from the first plate have equal 45 degree components as seen from the second plate, so both spots are doubled by the second plate. Notice the third plate does not produce additional spots because its optical axis is 90 degrees to the second plate optical axis. The spots on the right do not move because they pass through as the ordinary ray, but the spots on the left are shifted [11]. There is an alternative design for this filter that uses a retarder for the middle plate. The retarder is still made out of quartz but it is cut differently. It effectively depolarizes the light that reaches the third plate, so the third plate replicates the first two spots to create a square pattern. There are other antialiasing filter types that use refraction or diffraction to reduce 5.8 m 4.1 m 5.8 m 1.00 m m p late 0.71 m m p late 1.00 m m p late FIGURE 4.38 Eight-spot filter. 130 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1 0.8 8 spot X 8 spot Y 0.6 4 spot square 0.4 MTF 0.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 spatial frequency FIGURE 4.39 Comparison of different filters. the high frequency content of an image. Figure 4.37a is a pleated filter, which produces a four-spot pattern [12]. This filter should be co-optimized with the lens. Figure 4.37b is a phase-noise filter. It has different size spots with a thickness on the order of the wavelength of light. The spots change the wave front entering the lens to increase the size of the point spread function and reduce the image high-frequency content [13], [14]. It is possible to make this filter wavelength dependent so the point spread function can be adjusted to match the color channel Nyquist domain. It is possible to build an eight-spot and a seven-spot birefringent filter with three plates. The details of an eight-spot filter are shown in Figure 4.38. Figure 4.39 shows a comparison of the MTF of a square pattern filter, four-, and eight-spot filter. The filter point spread functions are chosen to have a zero at a half cycle per sample. Notice that the four-spot filter maintains MTF below the Nyquist frequency better than the other filters, but the MTF suppression past Nyquist is not as good as the other filters. The eight-spot filter starts to fill in the four-spot filter, so the eight-spot performance has some of the attributes of the fourspot filter and some of the attributes of a filter with a square pattern point spread function. In the spatial domain, the eight-spot filter pattern is taller than it is wide, so the X-axis MTF is different than the Y-axis MTF. The four-spot filter works well when the lens and pixel aperture control the MTF past the Nyquist frequency. If the antialiasing filter has to control the MTF past Nyquist, one of the other patterns may work better. 4.11 Testing Antialiasing filters can be tested with a point source, a good quality photographic objective, and a microscope. The photographic objective is used to image a point source. The image-forming beam passes through the antialiasing filter and the resulting image is viewed with a microscope (Figure 4.40). If the antialiasing filter is a four-spot birefringent filter, Optical Antialiasing Filters 131 p oint sou rce a n t i-a lia sin g filter p hotograp hic objective FIGURE 4.40 Viewing antialiasing point spread function. m icroscop e the image will be four copies of the lens point spread function in a square pattern. The lens point spread function has to be small relative to the size of the filter pattern or the image will just be a square patch of light. There are a number of issues with this technique. When cameras had six or twelve micrometer pixels, this technique was easy to set up without much attention to detail but current consumer cameras are rapidly moving toward two micrometer pixels or less. This pixel size is pushing the limits of optical microscopes and photographic objectives. To image two micrometer pixels, the objective has to be diffraction limited at f/2 or less. The point source has to fill the entrance pupil of the lens or the lens point spread function will be larger because the lens is effectively operating at a smaller aperture. Finally, the microscope objective has to be aligned carefully and the NA has to be large enough to accept the f/2 beam. If the antialiasing filter is not available to test separately, or if a system test is required, a circular chirp pattern or a zone plate can be used to analyze the aliasing potential of a camera or other digital imaging system. The spatial frequency of a circular chirp increases linearly with radius from the center of the pattern. The equation is: u = cos(2πq2) + 1 /2 (4.16) where u is the reflectance or transmission of the chirp and q is the radius from the center of the chirp. A chirp pattern is illustrated in Figure 4.41a. All of the following chirp images are simulated captures that are processed as if they were camera captures. Figure 4.41b is a red or blue channel chirp pattern from a Bayer CFA that was interpolated using bilinear interpolation. Notice the aliasing at the Nyquist frequency and the pronounced aliasing at the sampling frequency. Figure 4.41c shows an interpolated green channel without an antialiasing filter. Figure 4.41d is an interpolated red or blue channel with an antialiasing filter. The improvement with the filter is clear. Figure 4.41e and Figure 4.41f show the results of adaptive interpolation. In this case, the red or blue channel does not look as good as the bilinear interpolation version because the color errors are not apparent in the black and white image. The green channel is very clearly better using adaptive interpolation. To show the image quality improvement with adaptive interpolation compared to bilinear interpolation, the images are compared in Lab color space (Figure 4.42). Lab is an opponent 132 Single-Sensor Imaging: Methods and Applications for Digital Cameras N yquist d om ain (a) sa m p lin g fr eq u e n cy N yquist d om ain (c) sa m p lin g fr eq u e n cy (b) (d) (e) (f) FIGURE 4.41 Experimentation using a chirp pattern: (a) original image, (b) red or blue channel interpolated from Bayer pattern without antialiasing filter, (c) green channel interpolated from Bayer pattern without antialiasing filter, (d) red or blue channel interpolated from Bayer pattern with antialiasing filter, (e) red or blue channel interpolated from Bayer pattern with antialiasing filter and adaptive interpolation, and (f) green channel interpolated from Bayer pattern with antialiasing filter and adaptive interpolation. Optical Antialiasing Filters 133 (a) (b) (c) (d) (e) (f) FIGURE 4.42 Interpolation of a chirp pattern image in Lab space: (a-c) adaptive interpolation, (d-f) bilinear interpolation; (a,d) L channel, (b,e) a channel, (c,f) b channel. color space; negative values of a represent green and positive values represent magenta whereas negative values of b represent blue and positive values represent yellow. The a and b channels are both zero for gray images. As illustrated, the a and b images are scaled to all positive values between zero and 255. The image value 128 corresponds to a or b equal to zero. Areas darker or lighter than 128 indicate more color content. Notice the images with adaptive interpolation have much less color detail. This indicates a much lower aliased color level in the adaptively interpolated image. 4.12 Conclusions Analysis of antialiasing filter performance must include all capture system parameters, particularly the pixel aperture size, lens performance, and interpolation technique. Ideally the antialiasing filter and the interpolation technique should be co-optimized to maximize system MTF below the Nyquist frequency and minimize system MTF above the Nyquist frequency. Antialiasing filters have been in electronic and digital cameras for over forty years. A good compromise has been reached between reduced MTF below the Nyquist frequency 134 Single-Sensor Imaging: Methods and Applications for Digital Cameras and reduced aliasing because of scene content above the Nyquist frequency. Reducing the lens MTF has always been an attractive alternative to the expense of an antialiasing filter, but variation in lens MTF with f/number and focus makes it difficult to achieve consistent antialiasing with this alternative. Some consumer cameras with pixel sizes of two micrometers or less are not including an antialiasing filter because the lens MTF is low enough beyond the Nyquist frequency to suppress aliasing. Pixel size is still being driven lower although the inherent noise due to small pixels may stop this trend. If pixel size drops to one micrometer and lens f/numbers remain at about f/3 camera systems certainly will not need an antialiasing filter. At this pixel pitch even a diffraction limited lens does not have a high enough MTF at the Nyquist frequency to produce aliasing. Acknowledgments The author is especially grateful to Bruce Pillman and John Hamilton, both of Eastman Kodak Company. Bruce insightfully suggested some of the included topics and was always available to discuss content. John helped formulate the math; any errors that may have gotten by John’s watchful eye are the author’s. References [1] D.R. Williams, “Aliasing in human foveal vision,” Vision Research, vol. 25, no. 2, pp. 195– 205, 1985. [2] J.E. Greivenkamp, “Color dependent optical prefilter for the suppression of aliasing artifacts,” Applied Optics, vol. 29, no. 5, pp. 676–684, 1990. [3] J.W. Goodman, Introduction to Fourier Optics. New York: McGraw-Hill, 1968, reissued 1988. [4] P. Dillon, D.M. Lewis, and F.G. Kaspar, “Color imaging system using a single CCD area array,” IEEE Journal of Solid-State Circuits, vol. 13, no. 1, pp. 28–33, February 1978. [5] W.J. Smith, Modern Optical Engineering. New York: McGraw-Hill, 3rd edition, 2000. [6] E. Hecht and A. Zajac, Optics. Reading, MA: Addison-Wesley Longman, 3rd edition, 1997. [7] J.D. Gaskill, Linear Systems, Fourier Transforms and Optics. New York: John Wiley & Sons, 1978. [8] R.E. Volmerhausen and R.G. Driggers, Analysis of Sampled Imaging Systems. Bellingham, WA: SPIE Press, 2000. [9] D. Clarke and J.F. Grainger, Polarized Light and Optical Measurement. Oxford: Pergamon Press Ltd., 1971. [10] A.F. Jenkins and H.E. White, Fundamentals of Optics. New York: McGraw-Hill, 2001. [11] D. Kessler, A. Nutt, and R. Palum, “Anti-aliasing low-pass blur filter for reducing artifacts in imaging apparatus,” U.S. Patent 6 937 283, August 2005. Optical Antialiasing Filters 135 [12] R. Palum, “Optical blur filter having a four-feature pattern,” U.S. Patent 6 326 998, December 2001. [13] K. Sayanagi, “Phase noise filter and its application to photography and photolithography,” U.S. Patent 2 959 105, August 1960. [14] J. Hirsh, J.F. Revelli, and A. Nutt, “Phase-noise type broad spectral bandwidth optical lowpass anti-aliasing filter,” U.S. Patent 6 040 857, March 2000. 5 Spatio-Spectral Sampling and Color Filter Array Design Keigo Hirakawa and Patrick J. Wolfe 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2 Spatio-Spectral Analysis of Existing Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.2.1 Color Filter Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.2.2 Aliased Sensor Data and Demosaicking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.3 Spatio-Spectral Color Filter Array Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.3.1 Frequency-Domain Specification of Color Filter Array Designs . . . . . . . . . 143 5.3.2 Analysis and Design Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.4 Linear Demosaicking via Demodulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.5 Examples and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.1 Introduction Owing to the growing ubiquity of digital image acquisition and display, several factors must be considered when developing systems to meet future color image processing needs, including improved quality, increased throughput, and greater cost-effectiveness [1], [2], [3]. In consumer still-camera and video applications, color images are typically obtained via a spatial subsampling procedure implemented as a color filter array (CFA), a physical construction whereby only a single component of the color space is measured at each pixel location [4], [5], [6], [7]. Substantial work in both industry as well as academia has been dedicated to postprocessing this acquired raw image data as part of the so-called image processing pipeline, including in particular the canonical demosaicking task of reconstructing a full color image from the spatially subsampled and incomplete data acquired using a CFA [8], [9], [10], [11], [12], [13]. However, as we detail in this chapter, the inherent shortcomings of contemporary CFA designs mean that subsequent processing steps often yield diminishing returns in terms of image quality. For example, though distortion may be masked to some extent by motion blur and compression, the loss of image quality resulting from all but the most computationally expensive state-of-the-art methods is unambiguously apparent to the practiced eye. Refer to Chapters 1 and 3 for additional information on single-sensor imaging fundamentals. 137 138 Single-Sensor Imaging: Methods and Applications for Digital Cameras As the CFA represents one of the first steps in the image acquisition pipeline, it largely determines the maximal resolution and computational efficiencies achievable by subsequent processing schemes. Here we show that the attainable spatial resolution yielded by a particular choice of CFA is quantifiable, and propose new CFA designs to maximize it [14], [15]. In contrast to the majority of the demosaicking literature, we explicitly consider the interplay between CFA design and properties of typical image data, and its implications for spatial reconstruction quality. Formally, we pose the CFA design problem as simultaneously maximizing the allowable spatio-spectral support of luminance and chrominance channels, subject to a partitioning requirement in the Fourier representation of the sensor data. This classical aliasing-free condition preserves the integrity of the color image data and thereby guarantees exact reconstruction when demosaicking is implemented as demodulation (demultiplexing in frequency). Surprisingly, from this perspective we can show the suboptimality of CFA designs based on pure tristimulus values [15]—a standard design approach long taken by industry, particularly as manifested by the popular Bayer pattern [4]. Such designs are less resilient to spatial aliasing as image resolution increases, requiring both stronger assumptions about the image data as well as more computationally demanding nonlinear demosaicking methods to avoid reconstruction artifacts. Here our interest lies in quantifying the trade-offs between performance and complexity for different classes of CFA design; we consider the purely linear reconstruction of typical images as an indication of baseline performance, and interpret the resultant degree of aliasing as providing a measure of the maximally attainable spatio-spectral resolution. As an alternative to existing CFA patterns, we provide a constructive method to generate feasible CFA designs that exhibit robustness to prior assumptions on color channel bandlimitedness and yield high performance while implying only low complexity for subsequent processing steps in the imaging pipeline. Because our emphasis is on the efficiencies of the overall color image acquisition pipeline, we omit an explicit comparison of demosaicking strategies. However, our analysis yields a general class of linear demosaicking methods that provide state-of-the-art performance and enjoy complexity comparable to simple bilinear interpolation. In addition, our proposed CFA designs are also designed for increased noise robustness: the color filters themselves are panchromatic, alleviating difficulties in low-light conditions, and the linear reconstruction methods we propose can also be expected to enable more tractable noise modelling [15]. The remainder of this chapter is organized as follows. We begin in Section 5.2 by examining the spatio-spectral properties of typical CFA designs in the Fourier domain, and discuss their susceptibility to aliasing. We propose in Section 5.3 a constructive method to specify a physically realizable CFA pattern in terms of its spatio-spectral properties. The resultant CFA designs admit fast, optimal linear reconstruction schemes, which we outline in Section 5.4. In Section 5.5 we give several explicit examples of these new patterns, and provide empirical evaluations on standard color image test sets. We summarize and conclude with a discussion in Section 5.6. Spatio-Spectral Sampling and Color Filter Array Design 139 5.2 Spatio-Spectral Analysis of Existing Patterns In this section, the spatio-spectral properties of the sampling induced by existing CFA patterns are analyzed. In single-sensor cameras, the pixel sensor at each spatial location is equipped with a color filter, a physical device whose pigments absorb a portion of the electro-magnetic wave in the visible spectrum while passing the rest to the photosensitive element beneath this filter. The measured value at each location is therefore an inner product resulting from a spatio-temporal integration of the incident light over each pixel’s physical area and exposure time, taken with respect to the corresponding color filter’s spectral response. This is similar to the acquisition process in the retina, where each cone measures the intensity of the light with respect to its spectrally-shifted response [4], [5], [6], [7]. Because the spectral response functions of the cones can be taken to span a three-dimensional space, and cone and sensor measurements are largely proportional to the intensity of the light (i.e., linear), the observed light can be uniquely represented (up to linear transformation) by a color triple. We therefore adopt the standard convention and identify these filters by their color names such as red, green, and blue—though these may not be synonymous with perceived color, which is a function of the environmental illuminant [1]. As the goal of this chapter is the identification and optimization of relevant objective metrics, rather than subjective metrics related to perception, we make no further attempt to elaborate on the issues of color science. 5.2.1 Color Filter Arrays Here we begin with the Fourier analysis of the spatio-spectral properties of the CFA patterns [10], [13]. This spatially global perspective is a logical starting point for a number of reasons (a spatially local perspective is provided in the next section). First, color filter arrays are physical constructions that are fixed prior to image acquisition, and therefore not adapted to local image properties. Second, color filter arrays typically comprise a repetitive tiling of the image plane formed by the union of alternating color samples.1 As we describe below, the global spatial periodicity of CFA sampling patterns may be understood in terms of lattices, with a so-called dual or reciprocal lattice determining the resultant spectral periodicity under Fourier transform. Finally, the linear reconstruction methods we consider in the interest of evaluating computation-quality trade-offs preclude adaptation to local statistics of the image under consideration. To motivate our analysis, let us first consider the interplay between color channels of the acquired image. Let x(n) = [xr(n), xg(n), xb(n)]T denote the RGB tristimulus value of the desired color image at pixel location n ∈ Z2. Define c(n) = [cr(n), cg(n), cb(n)]T as the corresponding CFA color combination, so that the measured sensor value y(n) at location n can be expressed as the inner product y(n) = c(n)T x(n). For the moment, we restrict our attention to c(n) ∈ {[1, 0, 0]T , [0, 1, 0]T , [0, 0, 1]T } as a model for CFA schemes 1Pseudo-random CFA patterns have also been considered in the past [7]. Despite their potential theoretical advantages, we omit them from our discussion, as the corresponding reconstruction schemes incur much greater computational expense. 140 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) FIGURE 5.1 Log-magnitude spectra of a typical color image (i.e., image flower here) illustrating the lowpass nature of difference channels xα and xβ relative to xr, xg, and xb. Individual spectra correspond to: (a) red channel, (b) green channel, (c) blue channel, (d) difference xα , and (e) difference xβ . that multiplex color samples; note that cr + cg + cb = 1. Each pixel sensor thus measures  xr(n)  xα (n) y(n) = c(n)T x(n) = cr(n) cg(n) cb(n) xg(n) = cr(n) 1 cb(n) xg(n) , xb(n) xβ (n) (5.1) where xα = xr − xg and xβ = xb − xg are difference channels. As noted in References [10] and [13], this {xα , xg, xβ } representation offers an advantage over the original {xr, xg, xb} formulation; the difference channels xα and xβ serve as a proxy for chrominance components, which enjoy rapid decay in the spatial frequency domain, whereas xg can be taken to represent the image luminance component, which embodies edge and texture information. In fact, the Pearson correlation coefficient measured between the high-frequency compo- nents of the color channels {xr, xg, xb} is typically larger than 0.9 [8]—and because of this high degree of redundancy, it is often assumed that xα and xβ are lowpass relative to {xr, xg, xb}; see Figure 5.1. The key observation to be gleaned from Equation 5.1 is that y constitutes a sum of the green channel xg and the subsampled difference images cr · xα and cb · xβ . In order to understand the limitations of existing color filter array designs, it is helpful to consider the geometric and algebraic structure of subsampling patterns cr and cb through the notion of point lattices [15]. To this end, we say a (nonsingular) sampling matrix M ∈ R2×2 generates a lattice MZ2. Certain sampling patterns cr and cb can in turn be rewritten as two-dimensional pulse trains using lattice notation: cr(n) = ∑ δ (n − n0); n0 ∈{mr +Mr Z2 } cb(n) = ∑ δ (n − n0), n0∈{mb+MbZ2} (5.2) where Mr, Mb are 2 × 2 sampling matrices; mr, mb ∈ Z2 are termed coset vectors; and δ (n) is the Kronecker delta function.2 Lattices themselves admit the notion of a Fourier transform as specified by a dual lattice 2πM−T Z2; if we define Y (ω) as the Fourier transform (in angular frequency ω) of sensor data y(n), it follows from Equations 5.1 and 5.2 that 2In fact, Equation 5.2 represents a special case in which sampling patterns cr and cb are each themselves lattices. More generally, they are defined in terms of unions of lattice cosets [15]; however, this does not change the fundamentals of our present discussion. Spatio-Spectral Sampling and Color Filter Array Design 141 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) FIGURE 5.2 (See color insert.) Examples of existing CFAs: (a) Bayer [4], (b) Yamanaka [5], (c) Lukac [7], (d) vertical stripes [7], (e) diagonal stripes [7], (f) modified Bayer [7], (g) cyan-magenta-yellow, (h) Kodak I [16], (i) Kodak II [16], (j) Kodak III [16]. Y (ω) over the region [−π, π) × [−π, π) is given by Y (ω) = Xg(ω) + | det(Mr)|−1 ∑ e− jmTr ωXα (ω − λr) λr∈{2πM−T Z2∩[−π,π)2} +| det(Mb)|−1 ∑ e− jmTb ωXβ (ω − λb). λb∈{2πM−T Z2∩[−π,π)2} (5.3) The key point of Equation 5.3 is that these dual lattices specify the carrier frequencies {λr, λb} about which spectral copies of the difference channels xα and xβ are replicated in the Fourier domain. The popular Bayer CFA [4], for instance, can be specified as Mr = Mb = 2I, mr = [0, 0]T , and mb = [1, 1]T —implying dual lattices equal to πZ2, with nonzero {λr, λb} given by [−π, 0]T , [0, −π]T , and [−π, −π]T . Examples of several existing CFAs c(n) and the corresponding spectra Y (ω) of typical sensor data are illustrated in Figure 5.2 and Figure 5.3, respectively; note that aliasing occurs when, for nonzero λr or λb, the spectral supports of Xg(ω) and Xα (ω − λr) or Xβ (ω − λb) overlap. Despite its widespread use, the spectral periodization about [−π, 0]T and [0, −π]T in- duced by the Bayer CFA severely limits allowable spectral bandwidth for Xg. In fact, all CFAs depicted in Figure 5.2 are suboptimal in at least one of two ways: First, as shown in Figure 5.3a to Figure 5.3d and Figure 5.3g to Figure 5.3j, spectral copies of the differ- ence channels appear along the horizontal and/or vertical axes of the Fourier representation, leaving the baseband channel Xg vulnerable to the horizontal and vertical features that frequently dominate natural images [17]. Second, as shown in Figure 5.3d to Figure 5.3f and Figure 5.3h to Figure 5.3j, maximal separation between Xg(ω) and Xα (ω −λr), Xβ (ω −λb) is precluded unless all nonzero carrier frequencies {λr, λb} lie elsewhere along the perimeter of [−π, π) × [−π, π). 142 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) FIGURE 5.3 (See color insert.) Log-magnitude spectra of a typical color image (i.e., image flower here) sampled with CFAs corresponding to Figure 5.2. Color coding is used to distinguish different components, with the xg(n) component shown in green, xα (n) = xr(n) − xg(n) in red, and xβ (n) = xb(n) − xg(n) in blue. Individual spectra correspond to: (a) Bayer [4], (b) Yamanaka [5], (c) Lukac [7], (d) vertical stripes [7], (e) diagonal stripes [7], (f) modified Bayer [7], (g) cyan-magenta-yellow, (h) Kodak I [16], (i) Kodak II [16], (j) Kodak III [16]. In fact, these two conditions can be used to formulate a precise statement of CFA suboptimality [15]: any CFA design of the form c(n) ∈ {[1, 0, 0]T , [0, 1, 0]T , [0, 0, 1]T } that places all spectral replicates on the perimeter of [−π, π) × [−π, π), while avoiding [−π, 0]T and [0, −π]T , can only support two distinct colors. While we show in Section 5.3 how panchromatic designs can overcome this restriction, those that have emerged to date (including four-color CFAs) fail to satisfy the above two conditions. 5.2.2 Aliased Sensor Data and Demosaicking Because the suboptimal CFA designs detailed above are prone to aliasing, linear reconstruction methods no longer suffice as the spectral support of Xg(ω) increases. Reconstruction is then an ill-posed problem, meaning that stronger assumptions about the signal are needed to recover the full-color image from aliased sensor data. To this end, the most common approach is to invoke the principle that local image features are sparse in some canonical representation. One explicit form of this principle is directionality—the notion that image features are assumed to be oriented in one direction, and thus that the energy of the corresponding local Fourier coefficients is concentrated accordingly. If Xg is sparse in the direction parallel to an image feature orientation, then aliasing can in turn be avoided; this principle is exploited either explicitly or implicitly by many state-of-the-art demosaicking methods [9], [10], [11], [12], [13]. In a similar manner, under a transformation that is local in both space and frequency, the signal energy may be assumed to be compressed into a few transform coefficients; regularization in the transform domain then helps to recover the full color image [8], [12]. However, demosaicking methods that exploit these assumptions are usually highly nonlinear and computationally demanding. Indeed, effec- Spatio-Spectral Sampling and Color Filter Array Design 143 tive detection of image feature orientation (especially under the influence of noise) is an active area of research, and the determination of local image statistics requires additional computation. Moreover, subsequent interpolation steps are tightly coupled to estimates of feature directionality; this type of nonlinearity is effectively a data-driven switching mechanism that is expensive to implement in ASIC or DSP hardware. On the other hand, waveletand filterbank-based methods often employ iterative reconstruction schemes that may not easily be implemented in portable imaging devices. The difficulties posed by nonlinear reconstruction methods are especially evident in today’s digital video camera architectures. In order to meet the required frame rate with limited computational complexity, for example, it is common to implement demosaicking using methods such as bilinear interpolation that fail to yield satisfactory results. Other processing schemes may introduce pixel flickering artifacts, for instance, interframe oscillation or toggling of pixel colors caused by the susceptibility of edge-detection techniques to noise. Finally, nonlinear demosaicking methods are themselves subject to perturbations due to noise. Although simultaneous image denoising and interpolation methods have emerged in recent years (see, for example, Reference [12]), the difficulties of characterizing noise statistics after nonlinear demosaicking often render stand-alone image denoising methods ineffective. In contrast, the statistics of noise that undergoes only linear processing remain highly tractable, suggesting that a combination of denoising and demosaicking may indeed be possible. 5.3 Spatio-Spectral Color Filter Array Design By simultaneously considering both the spectral support of luminance and chrominance components, and the spatial sampling requirements of the image acquisition process, we may conceive of a new paradigm for designing CFAs. With robustness to aliasing achieved via ensuring that spectral replicates lie along the perimeter of the Fourier-domain region [−π, π) × [−π, π) while avoiding the values [−π, 0]T and [0, −π]T along the horizontal and vertical axes, our CFA design methodology aims to preserve the integrity of color images by way of subsampled sensor data. Images acquired in this manner are easily manipulated, enjoy simple reconstruction schemes, and admit favorable computation-quality trade-offs with the potential to ease subsequent processing in the imaging pipeline [14], [15]. 5.3.1 Frequency-Domain Specification of Color Filter Array Designs Let 0 ≤ cr(n), cg(n), cb(n) ≤ 1 indicate the CFA projection values at a particular spatial location, where cr(n), cg(n), cb(n) now assume continuous values and hence represent a mixture of prototype channels. With the additional constraint that cr +cg +cb = γ, it follows in analogy to Equation 5.1 that  xα (n) y(n) = c(n)T x(n) = cr(n) γ cb(n) xg(n) , xβ (n) 144 Single-Sensor Imaging: Methods and Applications for Digital Cameras and we may determine the modulation frequencies of difference channels xα (n) and xβ (n) by our choice of cr(n) and cb(n). Recalling Equation 5.3, we seek choices such that Fourier transforms of the frequency-modulated difference images Xα (ω − λr), Xβ (ω − λb) are maximally separated from the baseband spectrum Xg(ω). In the steps outlined below, we first specify candidate carrier frequencies {τi} and corresponding weights si,ti ∈ C for color filters cr(n) and cb(n). Recalling that for constants ν, κ we have that F {κcr + ν}(ω) = κF cr(ω)+ νδ (ω), we see that it is possible to manipulate our candidate color filter values until the realizability condition 0 ≤ cr(n), cg(n), cb(n) ≤ 1 is met. This notion leads to the following algorithm for frequency-domain specification of color filter array designs (with ¯· denoting complex conjugation, and Figure 5.4 illustrating the algorithmic steps): ALGORITHM 5.1 Frequency-domain color filter array design. 1. Specify initial values {τi, si,ti}. Set modulation frequencies: ∑ c(r0) =F −1 siδ (ω + τi) + s¯iδ (ω − τi) i ∑ c(b0) =F −1 tiδ (ω + τi) + t¯iδ (ω − τi). i 2. Subtract a constant νr = min c(r0)(n), νb = min c(r0)(n) (non-negativity): c(r1) = c(r0) − νr, c(b1) = c(b0) − νb. 3. Scale by κ = (maxn c(r1)(n) + c(b1)(n))−1 (convex combination): c(r2) = κc(r1), c(b2) = κc(b1). 4. Find green: c(g2) = 1 − c(r2) − c(b2). 5. Scale by γ = (max{c(r2)(n), c(g2)(n), c(b2)(n)})−1: cr = γc(r2), cg = γc(g2), cb = γc(b2). In the first step, candidate carrier frequencies are determined by taking the inverse Fourier transform of δ (ω ± τi). The conjugate symmetry in this step guarantees a real-valued color filter array; in general, however, the resultant design is not physically realizable (points in Figure 5.4a fall outside of the first quadrant, for example). Constants νr, νb are then subtracted to ensure non-negativity of color filters (Figure 5.4b). A scaling by κ and computation of the green component in the next two steps projects candidate values onto the unit simplex, ensuring convexity and a maximum component value of unity (Figure 5.4c and Figure 5.4d). Finally, multiplication by γ maximizes the quantum efficiency of the color filters (Figure 5.4e). The resultant CFA is physically realizable, with observed spectral data Y given by the sum of baseband components and modulated versions of Xα and Xβ : Y (ω) = γXg(ω) − γκνrXα (ω) − γκνbXβ (ω) ∑ +γκ {siXα + tiXβ }(ω + τi) + {s¯iXα + t¯iXβ }(ω − τi). i Spatio-Spectral Sampling and Color Filter Array Design 145 (a) (b) (c) (d) (e) FIGURE 5.4 Color filter array design visualized in Cartesian coordinates (cr, cb, cg), with the dotted cube representing the space of physically realizable color filters (0 ≤ cr(n), cg(n), cb(n) ≤ 1). Steps 1 to 5 in Algorithm 5.1 are shown as (a) to (e), respectively. This approach enables the specification of CFA design parameters directly in the Fourier domain, by way of carrier frequencies {τi} and weights {si,ti}. In doing so, we ensure that nonzero carrier frequencies lie along the perimeter of [−π, π) × [−π, π), while avoiding the values [−π, 0]T and [0, −π]T as desired. 5.3.2 Analysis and Design Trade-Offs In this section, some notable features of the above CFA design strategy are considered; readers are referred to Reference [15] for a thorough analysis of design trade-offs. We first note that CFA designs resulting from Algorithm 5.1 are panchromatic, with the resultant filters comprising a mixture of red, green, and blue colors at each spatial location. As color filters are commonly realized by pigment layers of cyan, magenta, and yellow dyes over an array of pixel sensors (i.e., subtractive colors) [18], designs for which γ > 1 suggest improved quantum efficiency. Furthermore, it becomes easier to control for sensor saturation, as the relative quantum efficiency at each pixel location is approximately uniform (cr + cg + cb = γ). We also note that the space of feasible initialization parameters {τi, si,ti} corresponding to Algorithm 5.1 is underconstrained, offering flexibility in optimizing the CFA design according to other desirable characteristics such as demosaicking complexity, pattern periodicity, resilience to illuminant spectrum, and numerical stability [15]. Our design strategy assumes bandlimitedness of the difference images xα and xβ , and therefore its robustness hinges on how well this claim holds in various practical situations (e.g., under changes in illuminant). Even as the bandwidths of the modulated difference spectra grow, the increased distance between these channels and the baseband component serves to reduce the risk of aliasing, effectively increasing the spatial resolution of the imaging sensor. Consequently, local interpolation methods are less sensitive to the directionality of image features, and a linear demosaicking method then suffices for many applications. 146 Single-Sensor Imaging: Methods and Applications for Digital Cameras As described earlier, linearization of the demosaicking step is attractive for several reasons: it can be coded more efficiently in DSP chips, it eliminates the temporal toggling pixel problems in video sequences, it provides a more favorable setup for deblurring, and it yields more tractable noise and distortion characterizations. 5.4 Linear Demosaicking via Demodulation In this section, we show that the processing pipeline of a typical digital camera can be exploited to greatly reduce the complexity of reconstruction methods [14]. Suppose the conjugate modulation sequences cα (n) = c(r0)(n)−1 and cβ (n) = c(b0)(n)−1 exist;3 when these sequences are orthogonal, the modulated signal can be recovered via a multiplica- tion by the conjugate carrier frequency followed by a lowpass filter. Assuming mutual exclusivity of the supports of Xg, Xα , and Xβ in the frequency domain, we expect an exact reconstruction according to    xˆr(n) 1 1 0 1/(γκ) 0   0 hα ∗ {cα y} xˆ(n) = xˆg(n) = 0 1 0  νr/γ 1/γ νb/γ   hg ∗ y  , (5.4) xˆb(n) 011 0 0 1/(γκ) hβ ∗ {cβ y} where ∗ denotes the discrete convolution operator, and the passbands of lowpass filters hα , hg, hβ match the respective bandwidths of the signals xα , xg, xβ . Given the mutual exclusivity of the signals xα , xg, xβ in the Fourier domain, we assume c(r0)hα + hg + c(b0)hβ = δ , where δ (n) is again a Kronecker delta function. Using the linearity and modulation properties of convolution, we obtain: hg ∗ y = (δ − c(r0)hα − c(b0)hβ ) ∗ y = y − {c(r0)hα } ∗ y − {c(b0)hβ } ∗ y = y − c(r0){hα ∗ {cα y}} − c(b0){hβ ∗ {cβ y}}. The demodulation in Equation 5.4 in turn takes the following simplified form:   1 1 0 1/(γκ) 0 0 100 hα ∗ {cα y} xˆ(n) = 0 1 0 011 νr /γ 0 1/γ 0 νb/γ 1/(γ κ ) −c(r0)(n) 0 1 0 −c(b0)(n) 1 y hβ ∗ {cβ y}  1/(γ κ ) + νr /γ − c(r0) (n)/γ =  νr/γ − c(r0)(n)/γ νr/γ − c(r0)(n)/γ 1/γ 1/γ 1/γ νb/γ − c(b0)(n)/γ νb/γ − c(b0)(n)/γ   1/γ + νb/γ − c(b0)(n)/γ hα ∗ {cα y} y. hβ ∗ {cβ y} (5.5) The first term in Equation 5.5 is a 3 × 3 matrix multiplication (a completely pixelwise operation), whereas the spatial processing component is contained in its second term. In 3In this chapter, we do not discuss cases in which there are zeros; however, the results presented here generalize easily to such cases via an appropriate multiplicative constant. Spatio-Spectral Sampling and Color Filter Array Design 147 the usual layout of a digital camera architecture, a color conversion module follows immediately, converting the tristimulus output from demosaicking to a standard color space representation through another 3 × 3 matrix multiplication on a per-pixel basis. The two cascading matrix multiplication steps can therefore be performed together in tandem, with the combined matrix computed offline and preloaded into the camera system. Given sufficient separation of the modulated signals in the frequency domain, crudely designed low-pass filters suffice for the reconstruction task. Suppose we choose to implement Equation 5.5 using a separable two-dimensional odd-length triangle filter — a linear-phase filter with a modest cutoff in the frequency domain. Four cascading boxcar filters can be used to implement a filter of length 2q − 1 having the following Z transform, with Z1 and Z2 corresponding to delay lines in horizontal and vertical directions, respectively: Hα (Z) = Hβ (Z) = 1 − Z1−q 1 − Z1−1 1 − Z1−q 1 − Z1−1 1 − Z2−q 1 − Z2−1 1 − Z2−q 1 − Z2−1 . (5.6) The computational complexity of the above system is eight adders for hα and hβ each. Moreover, in 4 × 4 repeating CFAs, the carrier frequencies c(r0) and c(b0) are often proportional to sequences of ±1’s (and by extension, cα and cβ also). In this case, the multiplication by −1 before addition in Equation 5.6 simply replaces adders with subtracters, which is trivial to implement. The overall per-pixel complexity of the demodulation demosaick- ing in Equation 5.5 is therefore comparable to that of bilinear interpolation (16 add/subtract operations per full pixel), despite its state-of-the-art image quality performance. 5.5 Examples and Analysis In this section we provide several examples of CFA designs and analyze their performance. These designs, shown in Figure 5.5 and detailed in Table 5.1, were generated in the spirit of Algorithm 5.1 by employing an exhaustive search over a restricted parameter space {τi, si,ti} [15]. Though some CFAs in Figure 5.5 have rectangular geometries, we see that nevertheless every pixel sensor has an equal number of neighboring colors, a condition that helps mitigate cross-talk noise due to leakages of photons and electrons. Their TABLE 5.1 Example CFA patterns specified in terms of parameter values {τi, si,ti}. pattern i = 0 i = 1 pattern i = 0 i = 1 pattern i=0 i=1 τi (π , π 2 ) (π, π) τi (π , 2π 3 ) ( 2π 3 , π ) τi (π , π 2 ) (π, π) A red si 1 + 1 j 1 C red si 1j 1 j E red si 1 + 1 j 1 blue ti 1 + 1 j −1 blue ti 1 j −1 j blue ti 1 + 1 j −1 τi (π , π 2 ) (π, π) τi B red si 1 + 1 j 0 D red si blue ti 0 1 blue ti (π , π 3 ) 3+4j 3−4j (π, π) 1 1 τi (π , π 2 ) (π, π) F red si 1 + 1 j 0 blue ti 0 1 148 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) FIGURE 5.5 (See color insert.) Proposed CFAs (top) and resultant log-magnitude spectra (bottom) of a typical color image (i.e., image flower here). Color coding is used as in Figure 5.3 to distinguish components Xα , Xg, and Xβ . Subfigures correspond to: (a) pattern A, (b) pattern B, (c) pattern C, and (d) pattern D. 1 1 1 b lu e 0.8 red 0.8 0.8 0.6 green 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 400 450 500 550 600 650 700 (a) 0 400 450 500 550 600 650 700 (b) 0 400 450 500 550 600 650 700 (c) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 400 450 500 550 600 650 700 (d) 0 400 450 500 550 600 650 700 (e) FIGURE 5.6 Spectral sensitivity characteristics (a) of a typical Sony CCD sensor [19], and (b-e) the corresponding pattern A color filters derived from these characteristics. designs are given in Table 5.1 as combinations of prototype red, green, and blue filters; the precise color specifications used in subsequent demosaicking experiments were derived from a popular Sony CCD quantum efficiency function [19] shown in Figure 5.6a. The resultant spectral responses, shown in Figure 5.6b, may be implemented using subtractive color pigments such as cyan, magenta, and yellow. Spatio-Spectral Sampling and Color Filter Array Design 149 (a) (b) (c) (d) (e) (f) (g) (h) (i) FIGURE 5.7 (See color insert.) Bike image sensor data (top row), with nonlinear and linear reconstruction methods shown for the case of clean (middle row) and noisy (bottom row) sensor data. Individual images correspond to: (a) original image, (b) Bayer CFA sampling, (c) pattern A sampling, (d) nonlinear Bayer reconstruction [8], (e) linear Bayer reconstruction, (f) linear pattern A reconstruction, (g) noisy nonlinear Bayer reconstruction, (h) noisy linear Bayer reconstruction, and (i) noisy pattern A linear reconstruction. In comparing Figure 5.3 and Figure 5.5, we see that in the latter case spectral copies of xα and xβ are placed farther from the Cartesian axes and the origin, thus achieving a better separation of channels in the Fourier domain. The implications of this design improvement may be seen in the demosaicking examples of Figure 5.7; while demosaicking performance is both algorithm- and CFA-dependent, we may consider state-of-the-art methods for demosaicking Bayer CFA data along with the linear reconstruction methodology outlined in Section 5.4, using the well-known bike test image shown in Figure 5.7a. To this end, Figure 5.7b and Figure 5.7c show simulated sensor data y(n) = c(n)T x(n) for the bike image x(n), acquired under c(n) representing the Bayer CFA and pattern A of Figure 5.5, respectively. Figure 5.7d to Figure 5.7f show demosaicked images corresponding respectively to a reconstruction of a color image from Bayer CFA data using the iterative, nonlinear method of Reference [8], the linear demosaicking algorithm of Sec- 150 Single-Sensor Imaging: Methods and Applications for Digital Cameras tion 5.4, and the same linear method applied to the pattern A sampled data. This latter reconstruction is competitive with the nonlinear Bayer reconstruction of Figure 5.7d, and exhibits significantly reduced zipper artifacts. On the other hand, compared to the purely linear Bayer demosaicking shown in Figure 5.7e, the linear pattern A reconstruction shows a significant gain in fidelity for equal hardware resolution and computational cost. Finally, Figure 5.7g to Figure 5.7i demonstrate its improved resilience to noise, by way of showing the same three reconstructions applied to sensor data corrupted by simulated Poisson noise. Compared to the reconstructions using Bayer CFA data depicted in Figure 5.7g and Figure 5.7h, the pattern A linear reconstruction of Figure 5.7i renders contributions from signal-dependent noise far less noticeable. 5.6 Conclusion By considering the interplay between color filter arrays and typical images, we have posed here the CFA design problem as one of simultaneously maximizing the spectral support of luminance and chrominance channels subject to their mutual exclusivity in the Fourier domain. From this perspective, current design practices were seen to be suboptimal: as image resolution increases, existing CFAs are prone to aliasing, linear reconstruction methods no longer suffice, stronger assumptions must be made about the underlying signal, and additional computational resources are needed to reconstruct the full-color image. Key to our design paradigm was the notion that the measurement process, an inner product between the color filter array and the image data, induces a modulation in the frequency domain. To this end, we chose to modulate the chrominance spectra away from the baseband luminance channel, and in doing so we proposed a constructive method to design a physically realizable CFA by specifying these modulation frequencies directly. This method generates panchromatic CFA designs that mitigate aliasing and admit favorable computation-quality trade-offs. As we have shown, our corresponding linear demosaicking method yields state-of-the-art performance with an order of complexity comparable to that of bilinear interpolation. References [1] K. Parulski and K.E. Spaulding, Digital Color Imaging Handbook, ch. Color image processing for digital cameras, G. Sharma (ed.), Boca Raton, FL: CRC Press, 2002, pp. 728–757. [2] R. Ramanath, W.E. Snyder, Y. Yoo, and M.S. Drew, “Color image processing pipeline,” IEEE Signal Processing Magazine, vol. 22, no. 1, pp. 34–43, January 2005. [3] R. Lukac and K.N. Plataniotis, Color Image Processing: Methods and Applications, ch. Single-sensor camera image processing, R. Lukac and K.N. Plataniotis (eds.), Boca Raton, FL: CRC Press / Taylor & Francis, October 2006, pp. 363–392. [4] B.E. Bayer, “Color imaging array,” U.S. Patent 3 971 065, July 1976. Spatio-Spectral Sampling and Color Filter Array Design 151 [5] S. Yamanaka, “Solid state color camera,” U.S. Patent 4 054 906, August 1977. [6] M. Parmar and S.J. Reeves, “A perceptually based design methodology for color filter arrays,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, May 2004, vol. III, pp. 473–476. [7] R. Lukac and K.N. Plataniotis, “Color filter arrays: Design and performance analysis,” IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1260–1267, November 2005. [8] B.K. Gunturk, J. Glotzbach, Y. Altunbasak, R.W. Schafer, and R.M. Mersereau, “Demosaicking: Color filter array interpolation in single chip digital cameras,” IEEE Signal Processing Magazine, vol. 22, no. 1, pp. 44–54, January 2005. [9] K. Hirakawa and T.W. Parks, “Adaptive homogeneity-directed demosaicing algorithm,” IEEE Transactions on Image Processing, vol. 14, no. 3, pp. 360–369, March 2005. [10] D. Alleysson, S. Su¨sstrunk, and J. He´rault, “Linear demosaicing inspired by the human visual system,” IEEE Transactions on Image Processing, vol. 14, no. 4, pp. 439–449, April 2005. [11] R. Lukac and K.N. Plataniotis, “Universal demosaicking for imaging pipelines with an RGB color filter array,” Pattern Recognition, vol. 38, no. 11, pp. 2208–2212, November 2005. [12] K. Hirakawa and T.W. Parks, “Joint demosaicking and denoising,” IEEE Transactions on Image Processing, vol. 15, no. 8, pp. 2146–2157, August 2006. [13] E. Dubois, “Filter design for adaptive frequency-domain Bayer demosaicking,” in Proceedings of the IEEE International Conference on Image Processing, Atlanta, GA, USA, October 2006, pp. 2705–2708. [14] K. Hirakawa and P.J. Wolfe, “Second-generation CFA and demosaicking design,” in Proceedings of the IS&T/SPIE 19th Annual Symposium on Electronic Imaging, San Jose, CA, USA, January 2008. [15] K. Hirakawa and P.J. Wolfe, “Spatio-spectral color filter array design for enhanced image fidelity,” in Proceedings of the IEEE International Conference on Image Processing, San Antonio, TX, USA, September 2007, vol. 2, pp. 81–84. Extended version submitted to IEEE Transactions on Image Processing, October 2007. [16] T. Kijima, H. Nakamura, J. Compton, and J. Hamilton, “Image sensor with improved light sensitivity,” U.S. Patent Application 2007 0 177 236, August 2007. [17] D.M. Coppola, H.R. Purves, A.N. McCoy, and D. Purves, “The distribution of oriented contours in the real world,” Proceedings of the National Academy of Sciences of the United States of America, vol. 95, no. 7, pp. 4002–4006, March 1998. [18] R. Ramanath and W.E. Snyder, “Adaptive demosaicking,” Journal of Electronic Imaging, vol. 12, no. 4, pp. 633–642, October 2003. [19] Sony Corporation, “Diagonal 6 mm (type 1/3) progressive scan CCD image sensor with square pixel for color cameras,” http://products.sel.sony.com/semi/PDF/ICX204AK.pdf, 2004. 6 Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras Lidan Miao, Hairong Qi, and Wesley E. Snyder 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.2 Mosaicked Filter Array Patterns and Their Design Philosophy . . . . . . . . . . . . . . . . . 155 6.2.1 Color Filter Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.2.2 Biological Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.2.3 Design Requirements for Multispectral Filter Arrays . . . . . . . . . . . . . . . . . . . . 158 6.3 A Generic Filter Array Design Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.4 A Generic Binary Tree-based Demosaicking Method . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.4.1 Correlation Analysis of Multispectral Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.4.2 Band Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.4.3 Pixel Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.4.4 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.5.1 Pure Evaluation of Generated Filter Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.5.1.1 Static Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.5.1.2 Consistency Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.5.2 Evaluation of Mosaicked Multispectral Imaging System . . . . . . . . . . . . . . . . 172 6.5.2.1 Effectiveness of Binary Tree and Edge Sensing Method . . . . . . . . . 173 6.5.2.2 Comparison with Advanced CFA Demosaicking Algorithms . . . . 176 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.1 Introduction In recent years, considerable work has been conducted in multispectral imaging, which expands color cameras’ capability to capture spectral information at multiple wavelengths other than visible light. Multispectral images have been widely investigated for their applications in remote sensing [1], [2] to analyze the landscapes and structures from aircrafts or satellites. Particularly, differences in spectral signatures among various land covers enable the detection and classification of different crops or minerals [3], [4]. Multispectral imaging has also been widely used in the field of biological microscopy in an effort to discriminate multiple co-localized fluorescent molecules [5], [6], [7]. Using common mi- 153 154 Single-Sensor Imaging: Methods and Applications for Digital Cameras croscopy methods, the number of molecules that can be detected simultaneously is limited by both spectral and spatial overlap. These issues can be tackled using spectral information which extends the possibilities to distinguish multiple proteins, organelles or functions within a single cell [8]. In biomedical imaging, one of the potential applications of multispectral imagery is the detection of breast cancer at its early stage when cancer cells are still very small in size but show aggressive growth activities which can be picked up by infrared imaging [9], [10]. Moreover, multispectral imaging is a significant technology for the acquisition, analysis, and display of accurate color information [11], [12], [13]. Many methods have been used to obtain multispectral imagery [14]. To achieve high spectral resolution, the following techniques are popularly adopted in existing imaging systems: 1) imaging spectrometer, which uses an optical dispersing element such as a grating or prism to split the light into many narrow and adjacent wavelength bands, and the energy in each band will be measured by a separate detector. Such imaging systems are complex and expensive. The manufacture of multiple sensor array is also complicated and delicate. In addition, the data size is limited by the requirements of data storage, transmission, and processing [15]; 2) filter-based spectral imaging, in which images are taken with a static camera mounted with a set of discrete filters. The filters are switched by revolving a filter wheel, then a discrete set of multispectral images can be obtained. However, due to the changing environment, the acquired images need to be registered before other processing algorithms can be applied [14]; and 3) the technique of quantum well imaging arrays, which still needs years of research before maturity [16]. To achieve an efficient solution for multispectral imaging, we study the application of multispectral filter array (MSFA), a mosaic array of multiple wavelength-specific filters, which is stimulated by the color filter array (CFA) technique in commercial digital color cameras [17]. Although considerable work has been conducted in the color domain, to the best of our knowledge, no attempt has been given to multispectral imaging. To acquire multispectral information, instead of using multiple detectors at each pixel location to obtain measurements for different spectral bands, single photo detector covered by an MSFA is adopted. In this way, a multispectral camera captures a scene such that each photo detector only captures spectral information at a single band, resulting in a mosaic-like monochrome image called the mosaicked image. To obtain the full multispectral data, a reconstruction operation, referred to as MSFA demosaicking, is required to estimate the missing spectral components at each pixel location. We call the resulted multispectral image the reconstructed image or the demosaicked image. The system diagram of an MSFA digital camera a ct u a l scen e m osaicking m osaicked im age d em osaicking r econ str u ct ed im age MSFA d igital cam era FIGURE 6.1 System diagram of an MSFA camera. Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 155 sp ectr a l B R BGRBGRBGR G G B mosaicked image R sp atial d emosaicked image FIGURE 6.2 Illustration of mosaicking and demosaicking process. The light red, green, and blue pixels in the right figure represent the estimated values. is shown in Figure 6.1. Figure 6.2 illustrates the mosaicking and demosaicking process for a row of a color image. It is clear that at each spatial location, only the information at a single band is measured, and for individual spectral bands, there exist a large number of missing components across the image plane. Compared to a full multispectral imaging system, an MSFA camera trades spatial resolution for spectral resolution. The MSFA technique provides several advantages like low cost, exact registration, compact physical setup and strong robustness, which have made it very attractive to the industry. This chapter focuses on the MSFA design methodologies (MSFA mosaicking) and the development of effective reconstruction algorithms (MSFA demosaicking). In Section 6.2, we discuss the underlying design principles starting from a brief review of color filter array and its biological relevances. Three design criteria for MSFA are then identified and summarized which form the essential building block of this chapter. The gist of this chapter is the MSFA design approach and the corresponding demosaicking algorithm which are detailed in Section 6.3 and Section 6.4, respectively. In Section 6.3, we present a binary tree based method to generate MSFAs for both rectangular and hexagonal tessellations. Given the number of spectral bands and the probability of appearance (POA) of each band, the algorithm starts from a checkerboard pattern and generates various MSFAs following a binary tree separation procedure. The demosaicking algorithm addressed in Section 6.4 follows the same tree in a reverse direction and progressively estimates the missing pixel components. Three interrelated processes are involved in the reconstruction algorithm, namely band selection, pixel selection, and interpolation, which facilitate the exploration of spectral correlation to achieve better reconstructions than individual demosaicking of each image plane. In Section 6.5, the performance of a multispectral mosaicking and demosaicking system is evaluated from two perspectives. We first evaluate the intrinsic properties of MSFA patterns to see how well the created patterns satisfy the design criteria. We then assess the entire system by evaluating the performance of reconstructed images in terms of classification accuracy and root mean square error compared with the full multispectral data. Finally, Section 6.6 concludes this chapter. 6.2 Mosaicked Filter Array Patterns and Their Design Philosophy In essence, a multispectral camera is simply a visual system, which can sense spectral information outside the visual wavelength range as many animal visual systems do. One 156 Single-Sensor Imaging: Methods and Applications for Digital Cameras important feature of the human and animal visual systems is that they possess the capability of instant processing and high resolution of discrimination. Many recent efforts have been devoted to the emulation of human and animal visual systems to achieve cost-effective, high-resolution and real-time imaging systems. The technique of CFA is one of the major achievements following this principle. Due to the unique advantages of the mosaic technique, the potential application of MSFAs has been studied in References [18] and [19]. Several problems have to be addressed before the MSFA technique becomes a reality. First, there is a tradeoff between the spatial and spectral resolution as shown in Figure 6.2. To achieve high spectral resolution, less number of samples within each band can be acquired, resulting in low spatial resolution images; on the other hand, for a given imaging field, many more detectors have to be integrated on chip to obtain higher spatial resolution, which would increase the camera cost and complexity. There is yet another concern we need to address besides the resolution issue. Since the design of MSFA is associated with a set of selected spectral bands and different targets would possess different sets of signature spectral bands, the bands that are most effective in discriminating the target from its background, it appears that a multispectral camera can only be specialized in imaging a certain type of target which is apparently not cost-effective. Fortunately, ongoing research in the area of adaptive imaging [20], [21], [22] has proposed potential solutions to this problem. The Defense Advanced Research Projects Agency (DARPA) has recently developed an Adaptive Focal Plane Array (AFPA) program [23], in which a high-performance focal plane array (FPA) is to be developed that is widely tunable on a pixel-by-pixel basis across the relevant wavebands in the infrared spectrum. With this technology, real-time reconfiguration of the array can be realized to meet different application requirements in an ever changing environment. Two fundamental procedures involved in the MSFA technique are the design of MSFAs and the demosaicking algorithm. In the color domain, considerable work has been reported to optimally reconstruct the full color image for the popular Bayer array [24]. However, studies on the intrinsic properties of the filter array as well as the underlying design principles have been very limited [25], [26]. Moreover, the design philosophies of CFAs are only applicable in the color domain. Its generalization to MSFA in the multispectral domain is not straightforward and needs further research. Due to the increased number of spectral bands, the design of MSFAs as well as the reconstruction of multispectral data is more complicated. Therefore, the development of a generic algorithm with capability of generalizing the mosaicking and demosaicking processes of different multispectral applications is of great importance. In this section, we summarize our study findings in the CFA technique and its biological relevance, from which we identify three design requirements for MSFA that are the guidelines for the generic MSFA generation algorithm discussed in the next section. 6.2.1 Color Filter Arrays Most digital cameras use a rectangular array of light-sensing elements covered by wavelength-specific filters to capture spectral information at different bands. By doing so, only one color component is sensed at each pixel location. The resulting mosaic-like image is then processed using spectral interpolation algorithms to estimate the missing color Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 157 (a) (b) (c) (d) (e) FIGURE 6.3 Examples of CFAs: (a) Bayer array [27], (b) diagonal Bayer array [28], (c) diagonal stripe [28], (d) Sony RGBE array [29], and (e) spatio-spectral array [26]. Figure 5.2 shows presented CFAs in color. components. The key idea of using the filter array technique has been demonstrated in Figure 6.2. Although some expensive cameras use a set of three light-sensing elements or place three layers of sensors one over the other to produce a high quality image, a single array of sensors with a color filter array is the simplest solution, and its disadvantages, i.e., reduced spatial resolution, will be overcome as it becomes possible to make larger arrays at higher densities. For a color filter array, the type of filters and the spatial arrangement of different filters constitute its two basic features. Regarding the type of filters used, different color bases have been considered [25], [28], including the tristimulus color basis (RGB, YMC), the mixed primary/complementary color (MGCY), and various four-color schemes [29]. Due to the complexity issue in the demosaicking process and the widely used RGB image format for storage, most existing color systems utilize the RGB CFAs [25]. In terms of the spatial arrangement of color filters, the earliest and most popularly used CFA is the Bayer array [27] (shown in Figure 6.3a). However, the design philosophy was not followed up and extended until several improvements proposed in References [30], [31], and [32]. Recently, there is a growing research interest in investigating the interplay between the CFA design and the subsequent demosaicking process. The studies in this area have shown that the CFA has a great impact on the quality of reconstructed images besides the demosaicking algorithms [25], [26]. A spatio-spectral method is proposed in Reference [26] aiming at improving color filter arrays to achieve enhanced image quality. In Figure 6.3, we illustrate several examples of CFA patterns; for more discussions on the design of color filter arrays, see Reference [25]. 6.2.2 Biological Relevance The idea of using mosaic filter array instead of multiple photo detectors is stimulated from the study findings of the human visual system (HVS) as well as many other animal visual systems. In the human retina, three types of cones with absorbance maxima in the long-, middle-, and short-wavelength (L, M, and S cones) region are organized into mosaics that tile the retina [33] as illustrated in Figure 6.4a. Only a single type of photoreceptor samples the image scene at any given location. Several studies [33], [34] have attempted to analyze the spatial arrangement of the L, M, and S cones. It was suggested in Reference [35] that the three types of cones are arranged randomly in the human eyes. The random arrangement gives rise to clumps containing only single type of cones, where the eyes cannot distinguish colors. In addition, the random sampling causes a deterioration 158 Single-Sensor Imaging: Methods and Applications for Digital Cameras p er ch blue acara t en ch gold fish rudd stickleback (a) (b) corkw ing w rasse shanny 14-sp in ed stickleback sap h ir in e gurnard gray gurnard p la ice r o ck lin g rock goby flound er lemon sole dab (c) (d) FIGURE 6.4 (See color insert.) Cone mosaic of human and fish retina (pictures taken from Reference [37]): (a) human cone mosaic, (b) freshwater fish, (c) litoral coastal fish, and (d) deep coastal fish. of image quality [36], in which the authors showed that an irregular retinal mosaic array causes a frequency-dependent reduction in signal amplitude and introduces random noise. They further pointed out the retina irregularity reduces visual acuity especially for high frequency signals. The cone mosaic has also been examined in a variety of species in the animal society. Figure 6.4b to Figure 6.4d illustrate the cone mosaic of fish in various aquatic environments. In the fresh water, where many wavelengths of light still penetrate the water, fish have more photoreceptors and a more heterogeneous arrangement across species. As one moves to deeper water, where the available wavelengths of light are more limited, there is less variation in both the types of photopigments and the mosaic arrangements across species. Those fish living in deep coastal waters where the available spectra are very limited show the least variation and the fewest photopigments [37]. In addition, researchers have discovered UV (300-360nm)-absorbing cones in the Japanese dace fish, which enable the fish to see wavelengths down to 360nm [38]. It has also been found that the mosaic array of most vertebrates is regular. Those animals who need high acuity and rely heavily on vision possess a very regular mosaic array, such as fish [39], [40] and mouse [41], [42]. 6.2.3 Design Requirements for Multispectral Filter Arrays Several critical issues in the design of CFA related to the effort of camera manufacture have been summarized in References [25] and [28]: matching the sensitivity of HVS, enabling cost-effective reconstruction algorithms, immunity to color artifacts, tolerance to Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 159 sensor imperfections, and immunity to optical crosstalk among neighboring pixels. These criteria are associated with the two intrinsic features of CFAs, i.e., the selection of filter type and the spatial arrangement of different filters. Inspired by the CFA design characteristics and its biological relevance, we identify three important design requirements for MSFAs: probability of appearance, spectral consistency, and spatial uniformity. Probability of appearance guides the selection of filters. One criterion used in the design of CFA (e.g., Bayer array) is that the pattern has to match the sensitivity response of the HVS. Since the HVS is more sensitive to changes in the green spectral band, most CFAs have more pixels sensitive to the green than to the red or the blue. In the design of MSFA, the objective is mostly to achieve better target recognition and separation of the object from clutter. We thus relate the POA, which is the ratio of the number of samples of a certain type of filter to the total number of samples, to the effectiveness of the spectral band in recognizing the target. The spectral band that affects the classification result the most will be assigned more pixels in the filter array. Generally, for a specific application scenario, the target(s) of interest is/are known a priori. An efficient multispectral camera would select a proper subset of spectral bands that maximizes the class separability [43]. A number of band selection algorithms [44], [45] have been proposed. In addition to selecting an optimal subset of bands out of the original set according to a class-separability criterion, some algorithms [46], [47] rank the selected spectral bands on the basis of eigenvectors and eigenvalues. Knowing the importance of each spectral band, its probability of appearance can then be derived accordingly. Spectral consistency concerns the spatial locations of different filters to reduce the optical crosstalk, a very common phenomenon existed in optical imaging systems [48]. As illustrated in Figure 6.5, an incoming photon intersects with the blue filter at a certain angle and enters the adjacent photodetector under the green filter instead of the blue one. This results in a contamination of the adjacent pixel’s charge packet and generates artifacts in the output image. Since the effect of optical crosstalk cannot be corrected using any image processing methods, a sub-optimal design consideration is to arrange the filter array mosaic pattern in a way such that the crosstalk would be uniformly distributed across the entire imaging plane, since a consistent effect of contamination would cause less damage than an red filter green filter blu e filter m icrolens FIGURE 6.5 Illustration of optical crosstalk. Redrawn from Reference [48]. c 2006 IEEE color filter photod etector pixel cell silicon substrate 160 Single-Sensor Imaging: Methods and Applications for Digital Cameras inconsistent artifact which interferes with the object recognition. In order to achieve this design requirement, pixels of a certain spectral band should always have the same pattern of neighbors, a property which we refer to as the spectral consistency. Spatial uniformity also concerns the spatial arrangement of different filters. In a mosaicked pattern, since each pixel only has one direct measurement from a certain spectral band, the unmeasured spectral components of the pixel must be estimated from its neighbors. This requires that the filter array for each spectral band samples the entire image as evenly as possible. If the pixels distribute densely in some regions while sparsely in other regions, serious information loss might occur. The research in biological studies also supports that the uniform distribution outweighs the random arrangement. We consider the above three criteria as the most important issues in the design of MSFAs. Although there might exist some other concerns, the extra constraints introduced could result in empty set of solutions. As in the color domain, no single CFA can satisfy all the design criteria listed previously [25]. 6.3 A Generic Filter Array Design Method Since normally, two-dimensional signals are digitized and stored as rectangularly sam- pled arrays, in this chapter, we only discuss the MSFA generation using rectangular arrays. For in-depth discussions using the hexagonal tessellation, readers are referred to Refer- ence [18]. Suppose we have selected a set of representative spectral bands and derived their prob- abilities of appearance, this section reviews the generic MSFA design method [18] with a focus on the spatial arrangement of various filters. The design algorithm starts from a checkerboard pattern and generates different MSFAs following a binary tree separation procedure. The binary tree-driven MSFA design process guarantees that the pixel distribu- tions of different spectral bands are uniform and highly correlated. We will show, through case studies, that most of the CFAs currently used by the industry can be derived as special cases of MSFAs generated using the generic algorithm. We adopt the checkerboard pattern as the starting point to generate different filter ar- rays. The selection of the checkerboard pattern is based on a number of properties that this pattern possesses: first, the checkerboard pattern is symmetric horizontally, vertically, and diagonally; second, the black and white blocks are uniformly distributed across the whole board; and third, this pattern has the same sampling frequency in both the horizontal and vertical directions. These properties facilitate the generation of MSFA patterns that satisfy all the design requirements. Suppose we need to generate a K-band filter array and each spectral band has its specific POA r1, · · · , rK, where ri = 1 2n with n being an integer, and ∑Ki=1 ri = 1. First, we generate a binary tree such that it has K leaves and the leaf i represents a spectral band with a POA of ri. Following this binary tree, we treat the original checkerboard as the root and use a combination of decomposition and subsampling operations to generate various patterns. Each resulting pattern should correspond to one node in the binary tree. Finally, all the leaf Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 161 patterns are combined to form a mosaic pattern, which is the desired MSFA satisfying the three design requirements. Figure 6.6 illustrates the creation of a five-band MSFA using a binary tree with five leaves (Figure 6.6a), which is generated based on the specified probabilities r = { 1 4 , 1 4 , 1 4 , 1 8 , 1 8 }. Following this tree, various patterns are generated through the operation of decomposition and subsampling as shown in Figure 6.6b. The decomposition is applied to the nodes at the even levels of the binary tree (including level zero, i.e., the root). The function of decomposition is to treat the pattern as a checkerboard and then divide the black and white blocks into two patterns. For example, the label 1 and label 2 patterns are generated by decomposing the original checkerboard, and the label 7 and label 8 patterns are the decomposed results of the pattern 3. The subsampling is to downsample the pattern by 2 level+1 2 along the horizontal and vertical directions, where level refers to the level of the pattern being processed. It can be seen that the label 3 and label 4 patterns are obtained by subsampling pattern 1, and the label 5 and label 6 patterns are the results of subsampling pattern 2 by 2. Process the checkerboard until it has the same structure as the binary tree. The next step is to combine all the leaves to generate a mosaic pattern, as shown in Figure 6.6c, in which the left figure is obtained by combining all the leaf patterns in Figure 6.6b, and the right figure is the color representation. It can be shown that two of the popularly used CFAs illustrated in Figure 6.3 are actually special cases generated from the generic algorithm. For example, if we combine patterns 2, 3, 4 and assign different colors, the resulting mosaic pattern is the same as the Bayer array, and the Sony RGBE array can be obtained by combining patterns 3, 4, 5, and 6. One unavoidable constraint associated with the binary tree-based method is that the POA is limited to power of two. In the case that the probabilities do not fit the tree, we choose the closest approximation to substitute the original POAs. This approximation is necessary to satisfy the uniform distribution design requirement which dictates that each pixel always has 2n amount of neighbors. Note that the rectangular domain is similarly constrained; each pixel is either in a four-neighborhood or an eight-neighborhood. From the above case studies, it is easy to see that the filter arrays generated from the generic method have the following characteristics: first, each spectral band is arranged symmetrically and uniformly; second, each band has the same number of neighbors of a certain spectral band but the relative positions of different bands are not always the same; and third, the probability of appearance of each spectral band is determined by the two separation steps. The MSFA generation process described above can be mathematically formulated as a sampling problem of multispectral images [18]. 6.4 A Generic Binary Tree-based Demosaicking Method Although there has been considerable research in the field of demosaicking algorithm [49], [50], [51], [52], [53], [54], [55], they are confined to the three-band Bayer array and cannot be directly extended to multispectral demosaicking. Due to the increased 162 1 2 1/2 Single-Sensor Imaging: Methods and Applications for Digital Cameras 3 45 6 1/4 7 8 1/8 (a) 111 111 111 111 111 111 222 222 222 222 222 222 333 333 333 444 444 444 555 666 555 666 555 666 757575 646464 7 7 8 857585 646464 7 8 8 758575 646464 (b) 7 7 8 (c) FIGURE 6.6 Generic MSFA generation process: (a) binary tree, (b) checkerboard separation, and (c) five-band MSFA generated by combining all the leaf patterns from (b). c 2006 IEEE Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 163 number of spectral bands in the multispectral domain, the resolution of known MSFA samples of each band would be gradually reduced. On one hand, this resolution reduction inevitably introduces severe artifacts which is not desired; however, on the other hand, the reduced spatial resolution brings extra spectral information. The correlation among different spectral bands has the potential of providing more information than the independent demosaicking of individual image planes. As in the color domain, the spectral correlations have been intensively utilized in the CFA demosaicking algorithms to render better reconstructions [51], [56], [57], [58]. In the following, we first discuss the possible spectral correlations that can help the demosaicking process. Then, a demosaicking algorithm involving three interrelated components that facilitate the exploration of spectral correlation will be presented. 6.4.1 Correlation Analysis of Multispectral Images One commonly used concept of spectral correlations in the CFA demosaicking is the color ratio [59] or the color difference rule [51], which states that within a local image region, the ratios or differences between different color channels are very similar. Instead of estimating the absolute value in the two chromatic color channels (i.e., red and blue), these algorithms estimate the color ratio or difference in order to derive the chrominance value. Since the human visual system is more sensitive to color artifacts than to luminance or saturation errors [60], these schemes can reconstruct full color images with less visible artifacts and sharp edges. Although very promising in the color domain, these rules, however, do not hold in the multispectral domain as we have analyzed in Reference [19]. Another important inter-band correlation in the color domain is that all color bands possess similar edge information [61], [62]. Most wavelet-based demosaicking algorithms explore this correlation [63], [64]. In the multispectral domain, due to the wide wavelength range with each band capturing very different signatures, the edge information of different spectral bands would not be the same. Although it is true that different spectral bands might identify different edge locations, there should be no spurious edges. In other words, if the edges derived in all spectral bands are combined together, the resulting image would present all edge information of the scene. One example is elaborated in Figure 6.7, where we sum up seven edge images (they have intensity 1 at edge locations and 0 everywhere else) generated from a 7-band multispectral image using the Canny edge detector, and different colors are used to denote different intensity values. Note that for the worst case, if all the images possess different edge locations, then the summation image would have thick edges and all edge pixels have intensity one. However, we can see from Figure 6.7 that only a few pixels possess intensity one and most edges still have single-pixel width, resulting from similar edge locations among different spectral bands. The consistency of edge locations among different spectral bands enables better reconstructions of high frequency information. The essential idea is that we identify a band with rich high frequency details and then use the edge information of this band to help the reconstruction of the other image planes. For this purpose, we developed a generic demosaicking algorithm based on the same tree that generates the MSFAs. The algorithm progressively estimates the missing pixel values, while utilizing the edge correlation information. Three interrelated issues need to be addressed: band selection — the determination 164 Single-Sensor Imaging: Methods and Applications for Digital Cameras FIGURE 6.7 (See color insert.) Summation of edge images of seven spectral bands. Different colors represent different intensity values (1: red, 2: green, 3: blue, 4: cyan, 5: magenta, 6: yellow, 7: white). c 2006 IEEE of the interpolation order of different spectral bands; pixel selection — the determination of pixel interpolation order within each spectral plane; and interpolation — the interpolation algorithm to estimate missing pixels within each spectral band. The following discussion will focus on the rectangular tessellation. The same idea can be extended to the hexagonal domain. 6.4.2 Band Selection In the multispectral domain, since normally there are more than three spectral bands that need to be processed, the order of spectral band selection for interpolation needs to be predetermined. As illustrated in Section 6.3, different spectral bands possess different POAs. It is intuitive that more detailed information will be preserved in the spectral bands with higher POAs and that these bands contribute more in obtaining a reconstructed image that better resembles the real scene. Moveover, the reconstructed image plane can be utilized to assist the interpolation of other spectral bands with lower POAs based on the spectral correlation of consistent edge locations. For this reason, we start the interpolation by choosing a spectral band with the highest POA. In the binary tree, band selection can be viewed as a process of selecting leaf nodes at different tree levels. We know the nodes at the same level possess the same POA and the deeper the level, the smaller the POA. To select spectral bands with their POAs in a descending order, we start from the first level of the binary tree. If there is a leaf node at this level, it will be the first selected spectral band for interpolation. This process continues as the tree level goes deeper. If there exists more than one leaf at a certain level, the selection order among these nodes is random. This band selection scheme facilitates our exploitation of spectral correlation. Since the band which preserves the edge information the best will be interpolated first, the estimation of other bands can utilize the edge information of the first interpolated image plane provided that different bands possess similar edge locations. 6.4.3 Pixel Selection In most demosaicking schemes in the color domain, the missing pixels are estimated only based on known pixel values. However, in the multispectral domain, more missing Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 165 pixels are present in each spectral band and only using known MSFA samples will not generate good results. Here, we present a “progressive” demosaicking method, taking into consideration that sparse samples exist in MSFA patterns. That is, part of the missing pixel values are estimated first, then the estimated pixel values together with the known MSFA samples are used to estimate other unknown pixel values. In this way, it is very important to determine which pixel locations are estimated first and which are the next. To effectively utilize the structural features of different patterns presented in the binary tree, we develop a pixel selection scheme, which is a binary tree traversal process. Starting from one of the leaf patterns selected in the band selection component, the algorithm first interpolates the missing band information at pixel locations where its sibling pattern locates, then the algorithm goes up one level of the binary tree and finds the sibling of its parent pattern. If its parent’s sibling is an internal node, then the leaf patterns of the subtree under this sibling pattern are investigated. This process continues until the root node is visited. It can be seen that, at each step, after interpolating the selected pixel locations, the resulting pattern is the same as the parent pattern. Thus, the pixel selection scheme guarantees that all the intermediate patterns during the demosaicking process are those present in the binary tree. Figure 6.8 illustrates an example of the pixel selection process, in which we aim to reconstruct spectral band 7. Starting from the node 7, we first select the pixel locations where its sibling pattern 8 locates (Figure 6.8b). We use 7/8 to denote the interpolation of the 7 value at the 8 location. Then we go up one level to node 3 and select pixel locations where its sibling pattern 4 locates (Figure 6.8c). Continuing this process one more level will lead us to the internal node 2, which is the combination of pixel locations of the pattern 5 and 6 (Figure 6.8d). The directed dash lines in Figure 6.8a indicate the trace of traversal and the resulting pattern at each step is shown in Figure 6.8b to Figure 6.8d, respectively. Note that the intermediate patterns are determined by the traversal trace on the binary tree, and the POA at each step is given by 1/2level, 1/2level−1, · · · , 1/2. 7 7/8 7 7/8 1 2 7/8 7 7/8 7 3456 7 7/8 7 7/8 start 7 8 (a) 7/8 7 7/8 7 (b) 7 7/8 7 7/8 7/4 7/4 7/4 7/4 7/8 7 7/8 7 7/4 7/4 7/4 7/4 7 7/8 7 7/8 7/4 7/4 7/4 7/4 7/8 7 7/8 7 7/4 7/4 7/4 7/4 (c) 7 7/5 7/8 7/5 7 7/5 7/8 7/5 7/6 7/4 7/6 7/4 7/6 7/4 7/6 7/4 7/8 7/5 7 7/5 7/8 7/5 7 7/5 7/6 7/4 7/6 7/4 7/6 7/4 7/6 7/4 7 7/5 7/8 7/5 7 7/5 7/8 7/5 7/6 7/4 7/6 7/4 7/6 7/4 7/6 7/4 7/8 7/5 7 7/5 7/8 7/5 7 7/5 7/6 7/4 7/6 7/4 7/6 7/4 7/6 7/4 (d) FIGURE 6.8 Illustration of pixel selection process of band 7. (a) The directed dash lines indicate the trace of traversal. (b) The 7 values at pixel locations with known 8 are first estimated based on known 7s. (c) The 7 values at pixel locations with known 4 are secondly estimated based on both known and estimated 7s from (b). (d) At node 1, pixel locations at 2 positions are selected, which are combinations of pixel locations at node 5 and 6. 166 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) FIGURE 6.9 The basic patterns: (a) quincunx pattern, and (b) rectangular pattern. 6.4.4 Interpolation Given a certain pixel location within a certain spectral band, selected based on the band and pixel selection scheme described above, the last issue to be investigated is how to estimate the missing pixel values based on neighboring pixel information. The key to the design of a generic demosaicking algorithm is the application of the binary tree. We observe that the set of pixels selected based on the binary tree always form one of the two regular distribution patterns, i.e., the quincunx or the rectangular, as shown in Figure 6.9. It can also be seen that through subsampling, all the patterns in the binary tree can be transformed to these two basic patterns. Therefore, the demosaicking of MSFAs eventually relies on the interpolation of these two basic patterns. In order to preserve edge details, we adopt the idea of edge-sensing interpolation (i.e., weighted sum of neighboring pixels with weights determined based on the edge information), which has been successfully used in CFA demosaicking [51], [59]. Let B denote the spectral band being processed and Bi, j the known pixel value at the spatial location (i, j), Bˆi, j the corresponding estimate, the missing components of the quincunx pattern shown in Figure 6.9a are estimated using the weighted sum of four nearest neighbors, Bˆi, j = ∑s,t Wi+s, j+t Bi+s, j+t ∑s,t Wi+s, j+t (6.1) with |s + t| = 1, ∀s,t ∈ {−1, 0, 1}. The weights of the two neighboring pixels along the vertical direction are calculated by Wm,n = (1 + |Bm+2,n − Bm,n| + |Bm−2,n − Bm,n| + 1 2 |Bm−1,n−1 − Bm+1,n−1| + 1 2 |Bm−1,n+1 − Bm+1,n+1|)−1 and that along the horizontal direction is (6.2) Wm,n = (1 + |Bm,n+2 − Bm,n| + |Bm,n−2 − Bm,n| + 1 2 |Bm+1,n−1 − Bm+1,n+1| + 1 2 |Bm−1,n−1 − Bm−1,n+1|)−1 (6.3) It can be seen that the weight Wm,n is inversely proportional to the edge magnitude at location (m, n). By doing so, the unknown pixel is interpolated along the edge direction. For the rectangular pattern shown in Figure 6.9b, we only need to estimate the set of the shaded pixels as the resulting pattern is again a quincunx distribution. Using the same idea, Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 167 the unknown shaded pixel values are estimated by the weighted sum of the four diagonal neighbors. The weights of the left-diagonal are calculated by Wm,n = (1 + |Bm+2,n+2 − Bm,n| + |Bm−2,n−2 − Bm,n| + 1 2 |Bm,n−2 − Bm+2,n| + 1 2 |Bm−2,n − Bm,n+2|)−1 Similarly, the right-diagonal weights are (6.4) Wm,n = (1 + |Bm+2,n−2 − Bm,n| + |Bm−2,n+2 − Bm,n| + 1 2 |Bm−2,n − Bm,n−2| + 1 2 |Bm,n+2 − Bm+2,n|)−1 (6.5) This edge-sensing approach interpolates the unknown according to pixel weights derived from edge information. Thus, the estimation of edge information directly affects the quality of reconstructed images. In multispectral imaging, as the number of spectral bands increases, the spatial resolution decreases in certain spectral bands and the edge information based on the low resolution spectral band would not be reliable. As analyzed before, the edge information in different spectral bands is either similar or partly overlapped. The spectral band with the highest POA preserves the edge information the best. Therefore, the edge information in high resolution spectral band can be used to calculate the weights for low resolution bands since the band selection scheme guarantees the high resolution spectral bands are reconstructed first. We refer to the proposed method as the binary tree based edge sensing method (BTES). 6.5 Experiments and Results In the color domain, the CFA has a significant impact on the quality of reconstructed images [25]. Similarly, the characteristics of MSFAs also play an important role in determining the maximum information that can be reconstructed from the mosaicked data. To fairly evaluate the two components involved in the MSFA technique, i.e., mosaicking and demosaicking, we conduct evaluations from two aspects. First, we carry out evaluations of the intrinsic properties of filter arrays by assessing how well the generated MSFAs satisfy the design criteria. Then, experiments are performed to evaluate the demosaicking algorithms using two performance measures: the commonly used root mean square error (RMSE) and the classification accuracy. 6.5.1 Pure Evaluation of Generated Filter Arrays From the design requirement analysis, we know that the spatial uniformity property guarantees that there is equal amount of information across the image plane such that the missing spectral information can be estimated with the same degrees of fidelity. On the other hand, the spectral consistency can counteract the artifacts caused by the optical crosstalk. To assess the intrinsic properties of the filter arrays, we design two performance metrics 168 Single-Sensor Imaging: Methods and Applications for Digital Cameras to measure the spatial uniformity and the spectral consistency, referred to as the static coefficient (SC) and the consistency coefficient (CC), respectively. 6.5.1.1 Static Coefficient In each filter plane, the pixels with known measurements are called the active pixels and all the others are the dead pixels. To assess the spatial uniformity of individual band, we only concern the active pixels, which means we will be processing images with a bunch of “holes”. In order to illustrate the effect of spatially non-adjacent pixels on one another in a more rational way, we introduce the electrostatic force model, in which the active pixels are interpreted as static electric particles with the same polarity of unit charge. We assume all the active pixels in a filter plane create static force fields around them. The joint force exerted on the particle of interest within a certain neighborhood is used to define the SC metric. The size of the neighborhood should be larger than or equal to the minimum distance between any two active pixels, i.e., the minimum number of active pixels included in a neighborhood must be two. Figure 6.10 illustrates one example of the particle interaction, where the black blocks indicate the active pixels. The center pixel i is the particle of interest, which is surrounded by pixels 1, · · · , 5. The vector Fki represents the force exerted on the pixel i by the neighboring pixel k (k ∈ Ni, Ni denotes the set of neighboring pixels of i), whose magnitude and direction are determined by Fki = (xk − xi)2 1 + (yk − yi)2 , tan θ = yk xk − − yi xi (6.6) where · denotes the magnitude of a vector and θ the direction. Note that the magnitude of the force is inversely proportional to the square of the distance between the two pixels and the direction of the force is along the axis that connects the two active pixels, pointing away from the pixel of interest. The total force Fi exerted on the center pixel i by its neighbors is ∑ Fi = Fki k∈Ni (6.7) The SC of an MSFA can then be calculated by ∑ SC = 1− 1 K K j=1 1 1+µj (6.8) where µ j denotes the mean magnitude given by µj = 1 Nj ∑Ni=j 1 Fi , and Nj is the total number of active pixels in spectral band j, K the number of spectral bands. For a certain spectral plane, the more uniformly the active pixels distribute, the smaller the µ j, and the smaller the SC. Since SC is normalized to be between 0 and 1 in Equation 6.8, a zero SC would indicate a uniform distribution, whereas a maximum SC (SC = 1) implies the least uniformity. 6.5.1.2 Consistency Coefficient In the mosaicking technique, we expect all the pixels in a certain spectral band always have the same immediate neighbors so that the contamination introduced by optical Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 169 1 F3i 5 F2i Fi i 4 4 2 Fi 3 Fi 5 1 FIGURE 6.10 Illustration of electric force. c 2006 IEEE crosstalk generates the same effect on all the pixels in this spectral band. To quantify the spectral consistency, we begin by forming the notions of superpixel and template super- pixel. A superpixel is the combination of a center pixel and its immediate neighbors. A template superpixel is a distinguishable superpixel with the center pixel from a specific spectral band. For example, the Bayer CFA shown in Figure 6.11a has four template su- perpixels, illustrated in Figure 6.11b to Figure 6.11e. The two template superpixels in Figure 6.11b and Figure 6.11c have the same G center pixel, hence, we refer to them as the template superpixels of the G band. Likewise, Figure 6.11d and Figure 6.11e show the template superpixels of the R and the B bands, respectively. For a certain spectral band j, we identify all template superpixels and label them as Tj1, Tj2, · · · , Tjm, where the subscript m denotes the number of different templates in this band. It is known that an optimal design of spectral consistency admits one template superpixel for each spectral band. The more template superpixels a certain band has, the more inconsistent it would be across the image plane, and the worst case is when there are equal number of superpixels matching different templates. Let n j1, n j2, · · · , n jm denote the number of superpixels matching different tem- plates, and n j = n j1 + n j2 + · · · + n jm is the total number of superpixels in spectral band j. Then the probability of occurrence of template i is p(Tji) = n ji nj . We define an entropy-like metric as follows: m Hj(consistency) = − ∑ p(Tji) log p(Tji) (6.9) i=1 to measure the spectral consistency, referred to as the consistency entropy. Note the larger the consistency entropy, the less consistent the pattern. The consistency entropy is an overall measure which does not indicate to what degree these template superpixels differ from each other. For example, the two template superpixels of the G band in Figure 6.11b and Figure 6.11c have the same number of different neighbors (4 Gs, 2 Rs, and 2 Bs), but their relative positions are different. In other cases, the superpixel might have neighbors of different spectral bands. For the clarity of explanation, we refer to the former as the relative position difference (RPD) and the latter the spectral band difference (SBD). Intuitively, these two types of differences would cause additional inconsistency aside from that caused by the different numbers of template superpixels, which should be taken into account in the formulation of CC. In addition, the 170 Single-Sensor Imaging: Methods and Applications for Digital Cameras RGRGRGRG GBG GRG G BG BG BG B RGRGRGRG G BG BG BG B RGR GBG (b) BGB GRG (c) RGRGRGRG BGB RGR G BG BG BG B GRG GBG RGRGRGRG G BG BG BG B (a) BGB (d) RGR (e) FIGURE 6.11 Illustration of the superpixel: (a) an 8 × 8 Bayer array, (b, c) two different superpixels of the G spectral band, (d) superpixel of the R spectral band, (e) superpixel of the B spectral band. c 2006 IEEE inconsistency caused by SBD is more severe than that of RPD. Therefore, we introduce two penalty terms, pSBD and pRPD, to account for the contamination introduced by SBD and RPD, respectively. Combined with the consistency entropy, the CC of a single band is defined as CC j = 1 + 1 pRPD j + pSBD j · 1 + 1 H j (consist ency) and the CC of the entire MSFA is ∑ CC = 1− 1 K K CC j j=1 (6.10) It is apparent that a smaller value of CC implies a more consistent pattern and CC = 0 indicates the optimal spectral consistency. The next problem is how to determine the values of pRPD and pSBD. It is valid to assume that crosstalk only happens between adjacent pixels (here, we use eight-adjacency for rectangular tessellation). Therefore, the window size to analyze pRPD and pSBD is 3 × 3. We write the 3 × 3 template superpixel in a lexicographical form. For example, the template superpixels in Figure 6.11b and Figure 6.11c are TG1 = GRGBGBGRG and TG2 = GBGRGRGBG. With this representation, it is easier to see that finding the difference between template superpixels of a certain spectral band is simply a problem of finding “distance” between strings (or codes). We adopt the Levenshtein distance (or edit distance) [65] to serve this purpose. The Levenshtein distance counts a difference not only when strings have different characters but also when one has a character whereas the other does not. One of the most popular applications of the Levenshtein distance is spell checking. It tries to find the most common typing errors, e.g., character omissions, insertions, and substitutions. The idea is to calculate the minimum number of such operations to convert from one string (code) to another. We calculate the Levenshtein distance between different template superpixels. Since the size of the superpixel is always the same in our application, there is no need for insertion or deletion. Only two operations, substitution and swapping, are Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 171 (a) (b) (c) (d) FIGURE 6.12 Three-band filter arrays: (a) periodic pattern, (b-d) randomly permutated versions of the periodic pattern with (b) 50 random pixels, (c) 200 random pixels, and (d) 900 random pixels. c 2006 IEEE allowed. We produce a numerical score according to the following penalty scores, which is widely used in biological applications: • The penalty for each match is 0. • The penalty for each swapping among neighborhood pixels is 1. • The penalty for each mismatch or substitution is 2. We define the summation of minimum swapping penalty as pRPD, and that of mismatch penalty as pSBD. For example, if a spectral band has four template superpixels, there are four possible ways to convert them to the same string (or code). Then the pRPD and pSBD are calculated as the minimum summation of penalty to perform the conversions. One experiment is conducted by choosing an MSFA generated from the generic method as the initial pattern and then randomly permutating the pixel locations of different spec- tral bands to produce new random patterns for comparison purpose. The initial pattern tested is a three-band filter array of size 64 × 64 with probabilities of appearance 1 2 , 1 4 , 1 4 (see Figure 6.12a). The three permutated patterns (Rand1, Rand2 and Rand3) are obtained by randomly permuting 50, 200, and 900 pixels, respectively (see Figure 6.12b to Fig- ure 6.12d). The quantitative comparisons are shown in Table 6.1. Both the SC and CC values in this table show that the initial pattern exhibits the best spatial uniformity and spectral consistency. TABLE 6.1 Comparison of SC and CC between a threeband MSFA and its three permutated patterns (Rand1, Rand2, Rand3). c 2006 IEEE MSFA Rand1 Rand2 Rand3 SC 0 0.354 0.589 0.760 CC 0.294 0.997 1.000 1.000 172 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) FIGURE 6.13 (See color insert.) The visualization of the two real multispectral data sets and the corresponding class labels: (a) 92AV3C9 - band 1, (b) FLC1 - band 3, (c) class label of 92AV3C9, red-grass, green-tower, blue-corn, cyan-soil, yellow-hay, (d) class label of FLC1, red-oats, green-corn, blue-red clover, cyan-bare soil, and yellow-wheat. c 2006 IEEE 6.5.2 Evaluation of Mosaicked Multispectral Imaging System The performance of a certain MSFA demosaicking algorithm can be evaluated from two perspectives: the reconstruction accuracy and the target classification accuracy. There have been several commonly used metrics in literature to measure the reconstruction accuracy, including root mean square error (RMSE), peak signal to noise ratio (PSNR) and subjective comparison, etc. To measure the fidelity of the demosaicked images, we adopt the RMSE metric defined as RMSE = ∑ ∑ ∑ 1 Nb Nr Nc Nb Nr−1 Nc−1 [ k=1 i=0 j=0 fˆk(i, j) − fk(i, j)]2 where fˆk represents the k-th spectral plane of the demosaicked image and fk that of the original one; Nb, Nr, and Nc denote the number of spectral bands, rows, and columns of the multispectral image, respectively. In order to evaluate the reconstructed images regarding the target detection or recognition performance, classification is carried out on both the full multispectral images and the demosaicked images using a simple k-nearest neighbor (kNN) classifier [66]. Two sets of real multispectral data [67], popularly used in multispectral image analysis, are used to evaluate the proposed method. Figure 6.13a and Figure 6.13b display one spectral band of each data set (Figure 6.13b is only a small segment of the original data set). The 92AV3C9 contains 9 spectral bands selected from a June 1992 AVIRIS data cube [68]. The Flightline C1 (FLC1) image was collected with an airborne scanner in June 1966, which contains 12 spectral bands with the wavelength varying from 0.4µm to 1.0µm. These two data sets contain a significant number of vegetative species or ground cover classes and have “ground truth” available. We select five ground cover classes from each of the data sets according to the ground truth provided in References [68] and [69]. Figure 6.13c and Figure 6.13d show their corresponding class labels, where the five different colors correspond to the five different classes. For each cover class, we use half of the pixels to train the classifier and the other half serves as the test samples. In addition, we generate eight synthetic data sets by selecting seven spectral bands from the hyperspectral images created by a simulator [70] using the band selection method discussed in Reference [71]. Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 173 FIGURE 6.14 Four examples of the eight synthetic targets. c 2006 IEEE Each multispectral image has a different object. Figure 6.14 shows four examples of the eight synthetic targets. We treat each target as one class, which gives us in total eight classes. The training data set also consists of half of the target pixels uniformly selected from each target, and the rest of the target pixels are used as the test data. To study the MSFA mosaicking and demosaicking performance generalized to different numbers of spectral bands, we form new multispectral images by selecting different numbers of bands from each of the above multispectral data sets. For example, we create five multispectral images from the 92AV3C9 data, and they contain three to seven bands, respectively. The band selection is performed using the multispectral system [67] developed at Purdue University. The created multispectral images are first sampled using the derived MSFAs to generate the mosaicked images. Then we apply different demosaicking algorithms to reconstruct the full multispectral data. We design two sets of experiments to evaluate the performance of the BTES method. In the first experiment, we investigate the effectiveness of incorporating the binary tree and the edge information in the demosaicking process. In the second experiment, we compare the proposed BTES method with three advanced CFA demosaicking approaches published recently. 6.5.2.1 Effectiveness of Binary Tree and Edge Sensing Method The proposed BTES approach integrates the binary tree-based scheme and the edgesensing interpolation. In order to investigate the effectiveness of these two components, we implement three demosaicking methods that are variants of BTES, including the classic bilinear interpolation (BI) without using either component, the binary tree-based bilinear interpolation (BTBI), and the edge-sensing interpolation without the binary tree consideration (ES). Edge-sensing based demosaicking methods (i.e., ES and BTES) take into account different weights of each individual neighbors when estimating the missing information, while non-edge-sensing methods simply treat the neighboring pixels equally. Binary treebased methods (i.e., BTBI and BTES) estimate the missing pixels based on not only known MSFA samples, they also use estimated MSFA samples obtained following the binary tree structure. The classification accuracy generated by BTES and its three variants on the real multispectral data is summarized in Table 6.2, and the results of the synthetic data are listed in Table 6.3. Table 6.4 and Table 6.5 show the RMSE of different demosaicking methods on both data sets. From these four tables, we make the following three observations. First of all, among the demosaicking algorithms evaluated, BTES, in most cases and on average, 174 Single-Sensor Imaging: Methods and Applications for Digital Cameras TABLE 6.2 Classification accuracy (%) of original real multispectral data and reconstructions using different methods. c 2006 IEEE Images ORG BTES ES BTBI BI 92AV3C9 FLC1 3-band 4-band 5-band 6-band 7-band 3-band 4-band 5-band 6-band 7-band 90.30 88.96 88.12 86.79 86.78 91.80 93.65 92.30 92.81 92.14 90.80 91.14 90.63 89.63 90.30 91.63 91.30 89.80 90.80 90.97 89.63 90.47 88.46 88.63 88.79 76.74 78.43 77.83 77.79 77.47 80.79 82.56 82.14 82.32 81.84 82.55 84.39 83.88 83.75 83.53 83.19 85.93 85.29 85.24 84.88 83.00 85.44 84.67 84.56 84.17 TABLE 6.3 Classification accuracy (%) of original synthetic data and reconstructions using different methods. c 2006 IEEE Image 3-band 4-band 5-band 6-band 7-band ORG BTES ES BTBI BI 67.71 61.98 62.04 60.03 60.54 69.14 66.82 66.02 65.84 65.15 69.92 68.79 66.79 65.90 65.50 70.83 69.24 66.48 64.37 64.80 73.14 70.79 67.54 65.79 65.71 outperforms its three variants from both classification accuracy and RMSE perspectives. We also observe that the binary tree-based methods (i.e., BTES and BTBI) outperform the corresponding schemes without binary tree considerations (i.e., ES and BI). Another important observation is that the classification performance cannot be improved by simply increasing the number of spectral bands. As illustrated in Table 6.2, the fourband image gives the highest accuracy for the 92AV3C9 data, while the six-band image is the best for the FLC1 data. There are two underlying reasons for this phenomenon: first, the newly introduced spectral information does not guarantee to increase the class separability of multispectral data; second, there is a tradeoff between the spectral and the spatial resolution when using the MSFA technique. The extra spectral information is introduced at the cost of reducing the reconstruction performance due to lower spatial resolution. This observation can be further verified by investigating the RMSE of the reconstructed images from Table 6.4 and Table 6.5. It can be seen that the RMSE values increase as the number of spectral bands increases; that is, the lower the spatial resolution, the worse the reconstruction performance. Our third observation is that the classification accuracy of the demosaicked images is comparable to that of the original data. Interestingly, for the real multispectral scene, in most cases, the reconstructed images present higher classification accuracy. However, this is not true for the synthetic data, for which the original images always generate the highest classification performance. This phenomenon is related to both the characteristics of the selected data sets as well as the intrinsic feature of the mosaicking and the demosaicking process. We realize that the real multispectral images are acquired in real world environment interfered by both the sensor noise and all kinds of other environmental effects, compared to the synthetic data generated with a perfect zero interference. We further notice that Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 175 TABLE 6.4 RMSE of reconstructed real multispectral data using different methods. c 2006 IEEE Images BTES ES BTBI BI 92AV3C9 FLC1 3-band 4-band 5-band 6-band 7-band 3-band 4-band 5-band 6-band 7-band 8.92 18.88 18.06 16.96 19.28 3.50 4.27 4.48 4.65 4.77 9.20 18.85 17.87 17.47 19.73 3.85 4.48 4.90 5.24 5.45 9.11 18.82 18.32 17.30 19.61 4.02 4.60 4.80 5.05 5.21 9.30 18.90 18.39 17.41 19.75 3.99 4.66 4.95 5.27 5.47 TABLE 6.5 RMSE of reconstructed synthetic data using different methods. c 2006 IEEE Image 3-band 4-band 5-band 6-band 7-band BTES 4.10 3.78 4.36 5.40 6.63 ES 4.37 4.05 4.89 6.28 7.80 BTBI 4.39 4.03 4.77 6.16 7.70 BI 4.43 4.07 4.87 6.30 7.88 the mosaicking and the demosaicking process combined together act as a smoothing filter (interpolation of missing pixel information from weighted summation of neighbors), which actually suppresses both noise and outliers in the original images. Therefore, the demo- saicked real multispectral images, with less noise and outliers compared with the original data, would be able to generate higher classification accuracy. On the other hand, due to the loss of high frequency information, the demosaicked synthetic data would yield lower classification performance than the original ones, which contain all the information of the demosaicked images. To validate the above analysis, we add 20dB Gaussian noise to the synthetic images and then perform the mosaicking and demosaicking process. The classification accuracy improvement, defined as sification accuracy using i pr the =demacocdaseca−cioacrckceord· 100%, image where accde and accor denote the and the original data, respectively, claswith TABLE 6.6 Classification improvement between demosaicked and orig- inal synthetic images of noisy (iprn) and noise free (ipr) cases. c 2006 IEEE Image 3-band 4-band 5-band 6-band 7-band accor accde i pr 67.71 61.98 -8.46 69.14 66.82 -3.37 69.92 68.79 -1.62 70.83 69.24 -2.26 73.14 70.79 -3.21 accorn accden i prn 39.67 62.18 56.75 51.54 66.75 29.5 56.44 67.87 20.25 57.00 66.58 16.82 57.77 68.36 18.34 176 Single-Sensor Imaging: Methods and Applications for Digital Cameras and without noise cases is summarized in Table 6.6, in which accden and accorn denote the classification accuracy of the noisy data. We relist the classification results of pure signals without noise in Table 6.6 to facilitate comparison. Note that for the images without noise, the classification improvements are all negative, that is, the demosaicked images produce lower classification accuracy than the original data. However, for the noisy data, there exists up to 56.75% improvement on the classification performance of the demosaicked images over the original noisy data. These results verify our previous analysis on why the original data do worse than the demosaicked images. In real world applications, it is impossible to generate a perfect, noise free image. Most likely, the captured images would contain different types of noises, for which the demosaicked images after the process of mosaicking and demosaicking can provide comparable classification performance as the original data. 6.5.2.2 Comparison with Advanced CFA Demosaicking Algorithms The purpose of this experiment is to evaluate the proposed BTES algorithm with existing rich collection of CFA demosaicking algorithms. We selected three advanced CFA demosaicking approaches [51], [56], [58] recently published in the literature. These techniques effectively utilize the spectral and spatial correlations to suppress artifacts. Algorithm of Reference [51] uses edge-directed interpolation and effectively exploits the color difference correlation, in which the green channel is interpolated first, and the red and blue channels are interpolated with the green band information as a correction term. The postprocessing step uses the color difference information (i.e., green-red and green-blue) to reduce color artifacts. Algorithm of Reference [58] formulates the demosaicking problem as an iterative process of reconstructing correlated signals (i.e., the green plane and the red/blue plane) from their subsampled versions. Another reconstruction approach, Reference [56], introduces wavelet analysis to decompose the original image into detail subbands. The algorithm enforces similar high-frequency information for the three color planes by updating the detail subband of the red and blue channels so that they are within a threshold to that of the green channel. In order to perform a fair comparison, instead of modifying the algorithms to deal with multiple bands, we choose three adjacent bands (one visual band and two infrared bands) from multispectral images and then treat them as the three color planes. We observe that the visual band contains more detail information, therefore, we use the visual band as the green channel and the other two infrared bands as the red and blue channels. The quantitative comparisons based on the RMSE and the classification accuracy are summarized in Table 6.7 and Table 6.8, respectively. From the RMSE comparison, we see that algorithm TABLE 6.7 RMSE comparison between BTES and three CFA demosaicking algorithms. c 2006 IEEE Image 747 dc10 f15 mig tank0 tank1 tank2 tank3 BTES 3.46 3.77 4.40 4.08 6.21 4.11 4.71 3.71 Alg. [56] 7.66 7.37 8.28 8.89 9.64 7.17 8.62 7.33 Alg. [51] 3.01 3.09 3.88 4.79 5.72 2.49 4.61 2.36 Alg. [58] 4.67 4.58 5.26 5.71 6.65 4.82 5.60 4.55 Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 177 TABLE 6.8 Classification accuracy (%) of original and reconstructed image using different demosaicking algorithms. c 2006 IEEE Alg. Original BTES Alg. [56] Alg. [51] Alg. [58] acc 67.71 61.98 53.77 57.72 50.31 of Reference [51], in general, generates the best results, while the BTES algorithm ranks the second and outperforms algorithm presented in References [56] and [58] by producing lower RMSE. However, by investigating the classification results, we note that the BTES approach performs the best, and gives higher classification accuracy than other CFA demosaicking methods. Algorithm of Reference [51] provides better classification performance than algorithms in References [56] and [58], whose classification accuracy is much lower than that of the original data. In summary, the BTES generic approach provides the highest classification accuracy although with a slightly worse RMSE performance compared to algorithm of Reference [51]. 6.6 Conclusions The primary focus of this chapter is to present a robust and cost-effective solution for multispectral digital cameras. The potential application of MSFA technique is investigated, which uses a mosaic multispectral filter array to cover single CCD sensor resulting in a mosaic-like image. The missing spectral components are reconstructed based on spectral reconstruction algorithms. Two major issues, i.e., the design of MSFAs and the development of effective interpolation algorithms, are discussed in this chapter. The binary treedriven MSFA generation process guarantees that the pixel distributions of different spectral bands are uniform and highly correlated. These spatial features facilitate the design of the generic demosaicking method based on the same tree, which considers three interrelated issues: band selection, pixel selection and interpolation. The development of a generic algorithm enables the cost-effective multispectral imaging. The experimental results demonstrate that the mosaicking and demosaicking process preserves the classification accuracy effectively for real world data. This result further supports that the MSFA technique is a feasible solution for multispectral cameras. Acknowledgment Figure 6.5, Figure 6.10 to Figure 6.12, and Table 6.1 are reprinted from Reference [18], Figure 6.6, Figure 6.7, and Table 6.2 to Table 6.8 are reprinted from Reference [19], with the permission of IEEE. 178 Single-Sensor Imaging: Methods and Applications for Digital Cameras References [1] P. Colarusso, L.H. Kidder, I.W. Levin, J.C. Fraser, J.F. Arens, and E.N. Lewis, “Infrared spectroscopic imaging: From planetary to cellular systems,” Applied Spectroscopy, vol. 52, no. 3, pp. 106A–120A, March 1998. [2] P.J. Curran, “Imaging spectrometry,” Progress in Physical Geography, vol. 18, no. 2, pp. 247– 266, June 1994. [3] G.A. Clark, S.K. Sengupta, W.D. Aimonetti, F.Roeske, and J.D. Donetti, “Multispectral image feature selection for land mine detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 38, no. 1, pp. 304–311, January 2000. [4] J.S. Salazar, M.W. Koch, and D.A. Yocky, “A novel automatic target recognition approach for multispectral data,” in Proceedings of the SPIE Conference on Imaging Spectrometry VII, Seattle, Washington, July 2002, vol. 4816, pp. 222–241. [5] M.E. Dickinson, G. Bearman, S. Tille, R. Lansford, and S.E. Fraser, “Multi-spectral imaging and linear unmixing add a whole new dimension to laser scanning fluorescence microscopy,” Biotechniques, vol. 31, no. 6, pp. 1274–1276, June 2001. [6] T. Haraguchi, T. Shimi, T. Koujin, N. Hashiguchi, and Y. Hiraoka, “Spectral imaging fluorescence microscopy,” Genes to Cells, vol. 7, no. 9, pp. 881–887, September 2002. [7] Y. Hiraoka, T. Shimi, and T. Haraguchi, “Multispectral imaging fluorescence microscopy for living cells,” Cell Structure and Function, vol. 27, no. 5, pp. 367–374, October 2002. [8] T. Zimmermann, J. Rietdorf, and R. Pepperkok, “Spectral imaging and its applications in live cell microscopy,” Federation of European Biochemical Societies Letters, vol. 546, no. 1, pp. 87–92, July 2003. [9] H. Qi and N.A. Diakides, “Thermal infrared imaging in early breast cancer detection-a survey of recent research,” in Proceedings of the IEEE International Conference on Engineering in Medicine and Biology Society, Cancun, Mexico, September 2003, vol. II, pp. 1109–1112. [10] H. Szu, L. Miao, and H. Qi, “Thermodynamic free-energy minimization for unsupervised fusion of dual-color infrared breast images,” in Proceedings of the Independent Component Analyses, Wavelets, Unsupervised Smart Sensors, and Neural Networks IV at SPIE Defense and Security Symposium, Orlando, FL, USA, April 2006, vol. 6247, pp. 62470P:1–15. [11] A. Abrardo, “Color constancy from mulitspectral images,” in Proceedings of the IEEE International Conference on Image Processing, Kobe, Japan, October 1999, vol. 3, pp. 570–574. [12] H.M.G. Stokman, T. Gevers, and J.J. Koenderink, “Color measurement by imaging spectrometry,” Computer Vision and Image Understanding, vol. 79, no. 2, pp. 236–249, August 2000. [13] M. Yamaguchi, T. Teraji, K. Ohsawa, T. Uchiyama, H. Motomura, Y. Murakami, and N. Ohyama, “Color image reproduction based on the multispectral and multiprimary imaging: Experimental evaluation,” in Proceedings of the SPIE Conference on Color Imaging: Device Independent Color, Color Hardcopy and Applications VII, San Jose, CA, USA, January 2002, vol. 4663, pp. 15–26. [14] Y.Y. Schechner and S.K. Nayar, “Generalized mosaicing: Wild field of view multispectral imaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 10, pp. 1334–1348, October 2002. [15] R.B. Smith, “Introduction to hyperspectral imaging.” Available online: http://www.microimages.com/getstart/pdf new/hyprspec.pdf, 2006. [16] D.A. Scribner, J. Schuler, and M.R. Kruer, “Infrared multispectral sensors: Re-considering typical design assumptions.” Naval Research Lab., Code 5636, 1998. Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 179 [17] K. Parulski and K.E. Spaulding, Digital Color Imaging Handbook, ch. Color image processing for digital cameras, G. Sharma (ed.), Boca Raton, FL: CRC Press, 2002, pp. 728–757. [18] L. Miao and H. Qi, “The design and evaluation of a generic method for generating mosaicked multispectral filter arrays,” IEEE Transactions on Image Processing, vol. 15, no. 9, pp. 2780– 2791, September 2006. [19] L. Miao, H. Qi, R. Ramanath, and W.E. Snyder, “Binary tree-based generic demosaicking algorithm for multispectral filter arrays,” IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3550–3558, November 2006. [20] P.I. Shnitser, I.P. Agurok, S. Sandomirsky, and A. Avakian, “Spectrally adaptive imaging camera for automatic target contrast enhancement,” in Proceedings of the SPIE Conference on Algorithms for Multispectral and Hyperspectral Imagery V, Orlando, FL, USA, April 1999, vol. 3717, pp. 185–195. [21] D.H. Kim, K. Kolesnikov, A. Kostrzewski, G.S.A.A. Vasiliev, and M.A. Vorontsov, “Adaptive imaging system using image quality metric based on statistical analysis of speckle fields,” in Proceedings of the SPIE Conference on Hybrid Image and Signal Processing VII, Orlando, FL, USA, April 2000, vol. 4044, pp. 177–186. [22] Y. Jiao, S.R. Bhalotra, H.L. Kung, and D.A. Miller, “Adaptive imaging spectrometer in a time-domain filtering architecture,” Optics-Express, vol. 11, no. 17, pp. 1960–1965, August 2003. [23] “Adaptive focal plane array.” Available online: http://www.darpa.mil/mto/afpa/, 2003. [24] B.K. Gunturk, J. Glotzbach, Y. Altunbasak, and R.W. Schaffer, “Demosaicking: Color filter array interpolation,” IEEE Signal Processing Magazine, vol. 22, no. 1, pp. 44–54, January 2005. [25] R. Lukac and K.N. Plataniotis, “Color filter arrays: Design and performance analysis,” IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1260–1267, November 2005. [26] K. Hirakawa and P. Wolfe, “Spatio-spectral color filter array for enhanced image fidelity.” in Proceedings of the IEEE International Conference on Image Processing, San Antonio, TX, USA, September 2007, vol. II, pp. 81–84. [27] E.B. Bayer, “Color imaging array.” U.S. Patent 3 971 065, July 1976. [28] “Fillfactory: The color filter array faq.” Available online: http://www.fillfactory.com/htm/ technology/htm/rgbfaq.htm. [29] “Sony press release.” Available online: http://www.sony.net/SonyInfo/News/Press/200307/03029E/. [30] T. Yamagami, T. Sasaki, and A. Suga, “Image signal processing apparatus having a color filter with offset luminance filter elements.” U.S. Patent 5 323 233, June 1994. [31] J.F. Hamilton, J.E. Adams, and D.M. Orlicki, “Particular pattern of pixels for a color filter array which is used to derive luminance and chrominance values,” U.S. Patent 6 330 029 B1, December 2001. [32] E.B. Gindele and A.C. Gallagher, “Sparsely sampled image sensing device with color and luminance photosites,” U.S. Patent 6 476 865 B1, November 2002. [33] O. Packer and D.R. Williams, The Science of Color. Amsterdam, Boston: Elsevier, 2003. [34] S. Otake, P.D. Gowdy, and C.M. Cicerone, “The spatial arrangement of l and m cones in the peripheral human retina,” Vision Research, vol. 40, no. 6, pp. 677–693, March 2000. [35] A. Roorda, A.B. Metha, P. Lennie, and D.R. Williams, “Packing arrangement of the three cone classes in primate retina,” Vision Research, vol. 41, no. 10-11, pp. 1291–1306, May 2001. [36] A.S. French, A.W. Snyder, and D.G. Stavenga, “Image degradation by an irregular retinal mosaic,” Biological Cybernetics, vol. 27, no. 4, pp. 229–233, December 1977. 180 Single-Sensor Imaging: Methods and Applications for Digital Cameras [37] M. Siuta, “Color vision in fish.” Available online: http://instruct1.cit.cornell.edu/courses/ bionb424/students2004/mas262/neuroanatomy.htm. [38] G.S. Losey, T.W. Cronin, T.H. Goldsmith, and D. Hyde, “The UV visual world of fishes: A review,” Journal of Fish Biology, vol. 54, no. 5, pp. 921–943, May 1999. [39] P.A. Raymond, L.K. Barthel, and G.A. Curran, “Developmental patterning of rod and cone photoreceptors in embryonic zebrafish,” Journal of Comparative Neurology, vol. 359, no. 4, pp. 537–550, September 1995. [40] S. Thoya, A. Mochizuki, and Y. Iwasa, “Formation of cone mosaic of zebrafish retina,” Journal of Theoretical Biology, vol. 200, no. 2, pp. 231–244, September 1999. [41] Y. Fei, “Development of the cone photoreceptor mosaic in the mouse retina revealed by fluorescent cones in transgenic mice,” Molecular Vision, vol. 9, no. 6, pp. 31–42, February 2003. [42] M.A. Raven and B.E. Reese, “Mosaic regularity of horizontal cells in the mouse retina is independent of cone photoreceptor innervation,” Investigative Ophthalmology and Visual Science, vol. 44, no. 3, pp. 965–973, March 2003. [43] N. Keshava, “Best bands selection for detection in hyperspectral processing,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, USA, May 2001, vol. V, pp. 3149–3152. [44] S.B. Serpico and L. Bruzzone, “A new search algorithm for feature selection in hyperspectral remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 39, no. 7, pp. 1360–1367, July 2001. [45] J.C. Price, “Spectral band selection for visible-near infrared remote sensing: Spectral-spatial resolution tradeoffs,” IEEE Transactions on Geoscience and Remote Sensing, vol. 35, no. 5, pp. 1277–1285, September 1997. [46] T.M. Tu, C.H. Chen, J.L. Wu, and C.I. Chang, “A fast two-stage classification method for highdimensional remote sensing data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 36, no. 1, pp. 182–191, January 1998. [47] S.G. Bajwa, P. Bajcsy, P. Groves, and L.F. Tian, “Hyperspectral image data mining for band selection in agricultural applications,” Transactions of the American Society of Agricultural Engineers., vol. 47, no. 3, pp. 895–907, May / June 2004. [48] S.W. Grotta, “Anatomy of a digital camera: Image sensors,” Available online: http://www. extremetech.com/article2/0,3973,15465,00.asp, June 2001. [49] D. Cok, “Signal processing method and apparatus for producing interpolated chrominance values in a sampled color image signal.” U.S. Patent 4 642 678, February 1987. [50] R. Lukac, K. Martin, and K.N. Plataniotis, “Demosaicked image postprocessing using local color ratios,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 6, pp. 914–920, June 2004. [51] W. Lu and Y.P. Tan, “Color filter array demosaicking: New method and performance measures,” IEEE Transactions on Image Processing, vol. 12, no. 10, pp. 1194–1210, October 2003. [52] K. Hirakawa and T.W. Parks, “Adaptive homogeneity-directed demosaicing algorithm,” IEEE Transactions on Image Processing, vol. 14, no. 3, pp. 360–369, March 2005. [53] P. Scheunders, “An orthogonal wavelet representation of multivalued images,” IEEE Transactions on Image Processing, vol. 12, no. 6, pp. 718–725, June 2003. [54] H.J. Trussell and R.E. Hartwig, “Mathematics for demosaicking,” IEEE Transactions on Image Processing, vol. 11, no. 4, pp. 485–492, April 2002. [55] X. Li and M.T. Orchard, “New edge-directed interpolation,” IEEE Transactions on Image Processing, vol. 10, no. 10, pp. 1521–1527, October 2001. Mosaicking and Demosaicking in the Design of Multispectral Digital Cameras 181 [56] B.K. Gunturk, Y. Altunbasak, and R.M. Mersereau, “Color plane interpolation using alternating projections,” IEEE Transactions on Image Processing, vol. 11, no. 9, pp. 997–1013, September 2002. [57] X. Wu and N. Zhang, “Primary-consistent soft-decision color demosaicking for digital cameras,” IEEE Transactions on Image Processing, vol. 13, no. 9, pp. 1263–1274, September 2004. [58] X. Li, “Demosaicing by successive approximation,” IEEE Transactions on Image Processing, vol. 14, no. 3, pp. 370–379, March 2005. [59] R. Kimmel, “Demosaicing: Image reconstruction from color ccd samples,” IEEE Transactions on Image Processing, vol. 8, no. 9, pp. 1221–1228, September 1999. [60] S.C. Pei and I.K. Tam, “Effective color interpolation in CCD color filter arrays using signal correlation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 6, pp. 503–513, June 2003. [61] X. Wu, W.K. Choi, and P. Bao, “Color restoration from digital camera data by pattern matching,” Proceedings of the SPIE, vol. 3018, pp. 12–17, April 1997. [62] L. Chang and Y.P. Tan, “Effective use of spatial and spectral correlations for color filter array demosaicking,” IEEE Transactions on Consumer Electronics, vol. 50, no. 1, pp. 355–365, February 2004. [63] J. Driesen and P. Scheunders, “Wavelet-based color filter array demosaicking,” in Proceedings of the IEEE International Conference on Image Processing, Singapore, October 2004, vol. V, pp. 3311–3314. [64] L. Chen, K.H. Yap, and Y. He, “Color filter array demosaicking using wavelet-based subband synthesis,” in Proceedings of the IEEE International Conference on Image Processing, Genoa, Italy, September 2005, vol. II, pp. 1002–1005. [65] V.I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Doklady Akademii Nauk SSSR, vol. 163, no. 4, pp. 845–848, 1965. [66] R. Duda, P. Hart, and D. Stork, Pattern Classification. Wiley-Interscience, 2000. [67] “Laboratory for applications of remote sensing.” Available online: http://www.lars.purdue.edu. [68] D. Landgrebe, “Multispectral data analysis: A signal theory perspective.” Available online: http://dynamo.ecn.purdue.edu/ biehl/MultiSpec/Signal Theory.pdf, 1998. [69] D. Landgrebe, “Multispectral data analysis: A moderate dimension example.” Available online: http://dynamo.ecn.purdue.edu/ biehl/MultiSpec/Moderate Dimension.pdf, 1997. [70] R. Ramanath, A Framework for Object-characterization and Matching in Multi- and Hyperspectral Imaging Systems. Ph.D. thesis, North Carolina State University, 2003. [71] R. Ramanath, W.E. Snyder, and H. Qi, “Mosaic multispectral focal plane array cameras,” in Proceedings of the SPIE Defense and Security Symposium, Orlando, FL, USA, April 2004, vol. 5406, pp. 701–712. 7 Color Filter Array Sampling of Color Images: Frequency-Domain Analysis and Associated Demosaicking Algorithms Eric Dubois 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.2 Geometric Structure of the Color-Filter Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 7.3 Formation and Representation of the CFA Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.3.1 Formation of the CFA Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.3.2 Frequency-Domain Representation of the CFA Image . . . . . . . . . . . . . . . . . . . 186 7.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 7.3.3.1 Hexagonal Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 7.3.3.2 Diagonal Stripe Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 7.3.3.3 Four-Color Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 7.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 7.4 Demosaicking Based on the Frequency-Domain Representation . . . . . . . . . . . . . . . 198 7.4.1 The Demosaicking Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7.4.2 Algorithms Derived from the Frequency-Domain Representation . . . . . . . . 199 7.5 Filter Design for CFA Signal Demultiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 7.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Appendix: Lattices and Two-Dimensional Signals on Lattices . . . . . . . . . . . . . . . . . . . . . . 209 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 7.1 Introduction Color-filter-array (CFA) sampling of color images involves a spatial-domain multiplexing of three or more color components of a color image, each on a subset of the lattice consisting of all sensor elements. In the frequency domain, this same operation can be viewed as the frequency-domain multiplexing of a luma component at baseband and two or more chrominance components centered at certain spatial modulation frequencies. This view leads to some very efficient demosaicking algorithms that would not normally be evident from the spatial-domain representation. This chapter presents the frequency-domain representation for general periodic CFA structures and describes efficient demosaicking algorithms based on spatial filtering derived from this representation. 183 184 Single-Sensor Imaging: Methods and Applications for Digital Cameras The chapter is organized as follows. Section 7.2 describes how the geometric structure of a CFA pattern can be specified using the concepts of lattices, sublattices and cosets, and represented using several matrices. Section 7.3 presents the model for the formation of the CFA image and derives the frequency-domain representation for an arbitrary periodic CFA pattern. Three specific examples are examined in detail in addition to the popular Bayer CFA pattern. Section 7.4 then addresses the demosaicking problem and presents algorithms inspired by the frequency-domain representation, namely frequency-division demultiplexing of the luma and modulated chrominance components. The least-squares approach to design the filters used in this demosaicking structure is given in Section 7.5 along with a few examples. Some concluding remarks are given in Section 7.6. The theory of lattices is extensively used in this chapter. A summary of the main notation and properties required is presented in the appendix; more details can be found in References [1] and [2]. 7.2 Geometric Structure of the Color-Filter Array In typical image sensors such as the charge-coupled devices (CCDs), the image window W is partitioned into a set of sensor elements of the same shape, usually rectangular. Each of these sensor elements is assigned to one of C classes, according to the characteristics of an optical filter placed over that sensor element. For example, the conventional Bayer CFA has three classes, corresponding to red (R), green (G) and blue (B) filters. The sensor elements are assumed to lie on a lattice Λ, and the shape of each sensor element is a subset of a unit cell P of Λ. The sensor elements are slightly smaller than the unit cell to allow for wiring, but this effect will be ignored without loss of generality in this presentation. A general description of CFA sensors and some specific CFA patterns can be found in References [3] and [4], and an evaluation of several RGB CFA patterns is given in Reference [5]. Several authors have used stacked-matrix formulations to describe CFA patterns (e.g., References [6] and [7]), but in this chapter we give a presentation based on lattices. Figure 7.1 illustrates the setup for the Bayer structure, showing the upper-left corner of the image window. The lattice Λ is a rectangular lattice with equal horizontal and vertical sample spacing X, and the unit cell P is a square of size X × X. The origin of the coordinate system is placed at the center of the upper-left sensor element, with the y-axis pointing downward. To simplify notation, the unit-cell dimension X is taken as the unit of length, called the pixel height (px), i.e., X = 1 px. With this choice, the lattice Λ is simply the integer Cartesian lattice Z2. The CFA pattern is assumed to be regular and periodic, as most are. The present development does not apply to non-periodic CFA structures, such as non-periodic pseudo-random structures, but it does apply to periodic ones (see Reference [8] for an example of a periodic pseudo-random CFA). One period of the pattern is replicated on the points of a sublattice Γ of Λ. The number of sensor elements in one period is equal to the index of Γ in Λ, denoted K = (Λ : Γ). For the example of Figure 7.1, we have Γ = (2Z)2, and (Λ : Γ) = 4; one period is indicated by the heavy square in the upper left. Each sensor element in the basic period corresponds to one coset of Γ in Λ. We denote the coset representatives belonging to this CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 185 G RX G R G x BGBG B X GRGR G B GBG B GRGR G y FIGURE 7.1 Upper-left portion of the Bayer CFA sampling structure, showing the constituent sampling structures ΨR (2), ΨG (◦) and ΨB ( ). The union of these three sampling structures forms the lattice Λ. basic period as bk, k = 1, . . . , K; the corresponding cosets are bk +Γ, k = 1, . . . , K. The set of coset representatives can be compactly represented with a 2 × K matrix B = [b1 b2 . . . bK]. For the Bayer lattice of Figure 7.1, we can choose b1 = [0 0]T , b2 = [1 0]T , b3 = [0 1]T , b4 = [1 1]T . Each sensor class is associated with one or more of these cosets. Let the sampling structure for sensor class i be denoted Ψi ⊂ Λ. By definition, Λ = C i=1 Ψi where the Ψi are disjoint subsets of Λ, Ψi ∩Ψj = 0/ for i = j. Each of the sampling structures is a union of selected cosets of Γ in Λ. If we define B j = {k | bk ∈ Ψ j}, then we have Ψ j = (bk + Γ). (7.1) k∈B j For the Bayer CFA, BR = {2}, BG = {1, 4}, BB = {3}. We can then write explicitly ΨR = (b2 + Γ), ΨG = (b1 + Γ) ∪ (b4 + Γ) = Γ ∪ (b4 + Γ), ΨB = (b3 + Γ). There is no unique choice of the bi but there is often a natural one, such as the one indicated above. Also, the indexing of the coset representatives is arbitrary, but we choose b1 = 0. The assignment of sensor classes to cosets can be summarized by a K ×C matrix J defined by [J] ji = 1 0 if j ∈ Bi otherwise j = 1, . . . , K; i = 1, . . . ,C. (7.2) In summary, the geometric structure of a CFA sensor is captured by the number of sensor classes C, the sensor lattice Λ represented by a sampling matrix VΛ, the CFA periodicity lattice Γ represented by a sampling matrix VΓ, and the matrix J assigning sensor classes to cosets of Γ in Λ. 186 Single-Sensor Imaging: Methods and Applications for Digital Cameras 7.3 Formation and Representation of the CFA Image 7.3.1 Formation of the CFA Image Assume that f (x, λ ) is the spectral light intensity (irradiance) projected at position x on the plane containing the image sensor by an ideal (pinhole) optical system. A single value fCFA[x] is measured at each point of Λ ∩ W , and is approximated by λmax fCFA[x] = f (x − s, λ )ha(s) ds ci(λ ) dλ , x ∈ Ψi ∩ W , i = 1, . . . ,C. λmin R2 (7.3) The spatial convolution with ha accounts for both blurring by the optical system and integration of the optically / spectrally-filtered light irradiance over one sensor element. The filters placed over sensor elements of class i have spectral transmission curve ci(λ ), which would also include the effect of any global filter placed in the optical path that affects all classes. It is assumed that all the filters have negligible transmission below λmin and above λmax which correspond to the spectral limits of the human visual system. A typical set of filter spectral responses for RGB can be found in Reference [4]; these responses include a global infrared-stop filter. Note that the measurement of image values may also include a pointwise nonlinearity such as gamma correction [9]. We do not account for such nonlinearities in this chapter since they do not strongly influence the demosaicking process, but a system designer must be aware of them and handle them correctly. We define fi[x] to be the component corresponding to the ith sensor class, defined on the entire lattice Λ, λmax fi[x] = f (x − s, λ )ha(s) ds ci(λ ) dλ , x ∈ Λ ∩ W , i = 1, . . . ,C. λmin R2 (7.4) Of course, the signal fi[x] is not measured or available off Ψi, i.e., on the points Λ\Ψi, and it is necessary to estimate fi[x] at these points. The CFA signal can be expressed as C ∑ fCFA[x] = fi[x]mi[x], i=1 (7.5) where mi[x] is the indicator function for Ψi, mi[x] = 1, 0, x ∈ Ψi; x ∈ Λ\Ψi. (7.6) This model for the formation of the CFA signal is illustrated in the top portion of Figure 7.2. 7.3.2 Frequency-Domain Representation of the CFA Image Each of the mi[x] is a periodic function on Λ, with periodicity given by Γ (i.e., mi[x+y] = mi[x] for all y ∈ Γ), and so can be expressed as a discrete Fourier series. Since Γ is a sublattice of Λ, there is an inverse relationship for the reciprocal lattices, namely Λ∗ ⊂ Γ∗, CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 187 Ha(u) f(x,l) Ht (u) òc1(l) . dl . . . òcC(l). dl ò _p1(l). dl ò _p2(l). dl ò _p3(l). dl m1[x] f1[x] L . . . . m C[x] . . fC [x] L fCFA[x] ~ f1[x] L ~ f2[x] L ~f [x] ~f3[x] L FIGURE 7.2 Block diagram of CFA camera (top) and ideal camera for human observer (bottom). with (Γ∗ : Λ∗) = K. Let {d1, . . . , dK} be a set of coset representatives for the K cosets of Λ∗ in Γ∗, with D the 2 × K matrix D = [d1 d2 . . . dK]. Again, this choice of coset representatives is not unique, but we choose d1 = 0 and choose the others to lie in a Voronoi unit cell of Λ∗. Then, the discrete Fourier series representation of mi[x] is given by [2]: K mi[x] = ∑ Mki exp( j2πx · dk) k=1 (7.7) where ∑ Mki = 1 K K mi[b j] exp(− j2πb j · dk). j=1 (7.8) We can limit the sum to the non-zero terms only, ∑ Mki = 1 K exp(− j2πb j · dk). j∈Bi (7.9) We can equivalently define the binary matrix J of Equation 7.2 by [J] ji = mi[b j]; then we can define the matrix M = [Mki] by M = 1 K [exp(− j2π DT B)]J (7.10) where the exponential of the matrix is carried out term by term, and postmultiplication by J is matrix multiplication. 188 Single-Sensor Imaging: Methods and Applications for Digital Cameras With this representation, we can express the CFA signal as C K fCFA[x] = ∑ fi[x] ∑ Mki exp( j2πx · dk) i=1 k=1 KC = ∑ ∑ Mki fi[x] exp( j2πx · dk) k=1 i=1 K = ∑ qk[x] exp( j2πx · dk) k=1 K = ∑ rk[x] k=1 where we have identified the new signals (7.11) C qk[x] = ∑ Mki fi[x], k = 1, . . . , K, or i=1 q[x] = Mf[x], (7.12) (7.13) which are different linear combinations of the original components. Here, q[x] = [q1[x] . . . qK[x] ]T and f[x] = [ f1[x] . . . fC[x] ]T . The rk[x] = qk[x] exp( j2πx · dk) are the modulated versions of these components. The matrix M represents a linear transformation from RC to RK, where C ≤ K. If C = K, this transformation is invertible using the matrix inverse. If C < K, the transformation can be inverted using the pseudo-inverse [10] as follows: f[x] = (MH M)−1MH q[x] = M†q[x], (7.14) where H denotes conjugate transpose of a matrix (the conjugate must be used if M is complex). This expression gives the least-squares estimate if q is not in the range (column space) of the matrix M. Taking the Fourier transform of Equation 7.11, and using the standard modulation prop- erty, K ∑ FCFA(u) = Qk(u − dk). (7.15) k=1 Noting that d1 = 0 and that the other dk are non-zero, the CFA signal is the sum of q1[x] at zero frequency (DC) and each of the qk[x] modulated at non-zero frequency dk. The DC (or baseband) component q1[x] has a particularly simple form, since d1 = 0. From Equation 7.9, ∑ M1i = 1 K exp(0) j∈Bi = |Bi| K (7.16) where |Bi| represents the number of elements in the set Bi. In other words, the baseband component is a weighted sum of the original components, where the positive weights are the relative sampling densities of the corresponding components. Since the sum of these weights is 1.0, we see that if all the input components are equal, then the baseband component is equal to the CFA signal, which is the same as all the individual input components. CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 189 -1 -0.5 d1 d2 u c/px -1 -0.5 0.5 1 d3 0.5 1 v c/px FIGURE 7.3 Reciprocal lattices Λ∗ (◦) and Γ∗ (×) and a suitable choice of representatives for the cosets of Λ∗ in Γ∗ for the Bayer sampling structure. This formulation has already been reported for the Bayer RGB CFA mosaic [11], which is a simplification of an equivalent formulation previously reported in Reference [12]. In general, a CFA signal is characterized in the frequency domain by the set of modulating frequencies {dk, k = 1, . . . , K} and the matrix M = [Mki] that defines the transformed components via the matrix equation q[x] = Mf[x], which are all determined from the geometric structure of the CFA pattern. Viewed from the frequency domain perspective, the CFA signal is equivalent to a frequency division multiplexing of the qk[x]. The implied demosaicking algorithm involves separating these components and then obtaining the desired components by a matrixing operation. The above development can be illustrated by continuing the example of the Bayer RGB CFA to reproduce the results presented in Reference [11], but with the notation of this chapter. The coset representatives of Γ in Λ were given above, resulting in B= 0101 0011 . (7.17) The reciprocal lattices are easily seen these reciprocal lattices and a suitable tcohboeicΛe ∗of=coZs2etanredpΓre∗se=nt(a12tiZv)e2s.fForigΛu∗rein7.Γ3∗i,llyuisetlrdaitnegs the matrix D= 0 1 2 0 0 0 1 2 1 2 1 2 . (7.18) 190 Single-Sensor Imaging: Methods and Applications for Digital Cameras The matrix J defining the three input channels is 0 1 0 J = 10 0 0 01 010 (7.19) so that application of Equation 7.10 yields   12 1 M = 1 4 −11 0 0 −11 . −1 2 −1 (7.20) We then see that the four transformed signals qi are given explicitly by q1[x] = 1 4 f1[x] + 1 2 f2[x] + 1 4 f3[x] q2[x] = − 1 4 f1[x] + 1 4 f3 [x] q3[x] = 1 4 f1[x] − 1 4 f3[x] q4[x] = − 1 4 f1[x] + 1 2 f2[x] − 1 4 f3[x]. Note that q3 = −q2. The pseudo-inverse M† is given by   1 −1 1 −1 M† = 1 0 0 1 1 1 −1 −1 (7.21) (7.22) (7.23) (7.24) (7.25) so that the inverse relationship is f1[x] = q1[x] − q2[x] + q3[x] − q4[x] f2[x] = q1[x] + q4[x] f3[x] = q1[x] + q2[x] − q3[x] − q4[x]. Imposing the constraint q3[x] = −q2[x], this simplifies to f1[x] = q1[x] − 2q2[x] − q4[x] f2[x] = q1[x] + q4[x] f3[x] = q1[x] + 2q2[x] − q4[x]. (7.26) (7.27) (7.28) (7.29) (7.30) (7.31) These four transformed signals are modulated at the frequencies (0.0, 0.0), (0.5, 0.0), (0.0, 0.5) and (0.5, 0.5) (obtained from D), so that FCFA(u, v) = Q1(u, v) + Q2(u − 0.5, v) − Q2(u, v − 0.5) + Q4(u − 0.5, v − 0.5). (7.32) We note that there are two separate and independent copies of Q2(u, v) at (0.5, 0.0) and (0.0, 0.5) respectively. The input components f1, f2 and f3 correspond respectively to fR, CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 191 dB 20 0 -20 Q4 Q3 Q1 Q2 -40 -60 -80 -1/2 -1/4 1/2 0 1/4 v (c/px) 1/2 -1/2 -1/4 0 1/4 u (c/px) FIGURE 7.4 Two-dimensional power density spectrum estimate of a CFA image with the Bayer sampling structure. fG and fB in Reference [11], and the output components q1, q2 and q4 correspond respectively to fL, fC2 and fC1. One period of the two-dimensional power-density spectrum of a sample CFA image with the Bayer CFA structure is shown in Figure 7.4. This spectrum is obtained using the method of averaging modified periodograms [2]. The different components are easily identified on this figure; compare with Figure 7.3. This spectral diagram also serves to explain the artifacts commonly seen when Bayer CFA images are demosaicked [12]. High-frequency luma patterns intrude into the chrominance bands, resulting in false colors. High-frequency chrominance information intrudes into the luma band, resulting in false luma patterns, often having a zipper-like appearance. These effects are very similar to the luma-chrominance crosstalk familiar in NTSC and PAL composite television signals [13]. 7.3.3 Examples To illustrate these concepts, three additional examples of CFA structures that have been proposed are presented. There is no implication that these are the best of the many proposed structures; rather, they have been selected because they illustrate different scenarios. These structures are: i) a hexagonal array, as used in the Super CCD proposed and manufactured by Fujifilm, that is like a Bayer pattern rotated by 45◦ and has much in common with the Bayer example already presented [3]; ii) a diagonal stripe pattern with C = K = 3 [8]; and iii) a four-color pattern with C = K = 4. 7.3.3.1 Hexagonal Pattern The first example concerns a sensor where the sensor elements are placed on a hexagonal lattice. This has been referred to as the pixel interleaved array CCD (PIACCD) or as the SuperCCD [3]. The lattice and CFA structure are shown in Figure 7.5. The Voronoi unit cell has a square shape (rotated by 45◦), but the PIACCD sensor elements are in fact octagonalshaped subsets of the unit cell. Again, this detail does not affect our analysis; it simply contributes to the precise form of ha(x). This sampling structure is seen to be equivalent to a Bayer structure rotated by 45◦, so there are C = 3 sensor classes and K = 4 elements in a period of the CFA. One period used here is outlined by the thick border in the top left of 192 Single-Sensor Imaging: Methods and Applications for Digital Cameras G XG G G x R B R B X G G G G B R B R G G G G y FIGURE 7.5 PIACCD CFA structure showing the constituent sampling structures ΨR (2), ΨG (◦) and ΨB ( ). The union of these three sampling structures forms the lattice Λ. Figure 7.5. We use the distance X indicated in Figure 7.5 as the unit of length (1 px) in the following. Typically, the demosaicked signal on Λ would be upsampled to a square lattice with spacing X, but we do not consider that step here. By inspection of Figure 7.5, we identify Λ = LAT 21 01 , Γ = LAT 42 02 , B= 0213 0011 , 0 1 0 and J = 01 1 0 00 . 001 The corresponding reciprocal lattices can be found to be (7.33) (7.34) Λ∗ = LAT 1 0.5 0 0.5 Γ∗ = LAT 0.5 0.25 0 0.25 . (7.35) These lattices are illustrated in Figure 7.6. Then, we choose the set of coset representatives for Λ∗ in Γ∗ as indicated in the figure, yielding the matrix D= 0 0.5 0.25 0.25 0 0 0.25 −0.25 . (7.36) Substitution into Equation 7.10 gives the matrix   12 1 M = 1 4 −−11 2 0 −11 , 1 0 −1 (7.37) CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 193 -1 -1 -0.5 -0.5 d4 d1 d2 0.5 d3 0.5 u c/px 1 1 v c/px FIGURE 7.6 Reciprocal lattices Λ∗ (◦) and Γ∗ (×) and a suitable choice of representatives for the cosets of Λ∗ in Γ∗ for the PIACCD structure. and the resulting transformed signals q1[x] = 1 4 f1[x] + 1 2 f2[x] + 1 4 f3[x] q2[x] = − 1 4 f1[x] + 1 2 f2[x] − 1 4 f3[x] q3[x] = − 1 4 f1[x] + 1 4 f3[x] q4[x] = 1 4 f1[x] − 1 4 f3[x] where we note that q4 = −q3. In the frequency domain, we have FCFA(u, v) = Q1(u, v) + Q2(u − 0.5, v) +Q3(u − 0.25, v − 0.25) − Q3(u − 0.25, v + 0.25). (7.38) (7.39) (7.40) (7.41) (7.42) 7.3.3.2 Diagonal Stripe Pattern The second example is a diagonal stripe pattern that contains C = 3 sensor classes (RGB) and only K = 3 elements in a period of the CFA structure. A portion of this CFA pattern using the same structure as Reference [8] is shown in Figure 7.7. Again, with X as the unit of length, we identify Λ = LAT 10 01 = Z2, Γ = LAT 31 01 , and (7.43) B= 012 000 . (7.44) 194 Single-Sensor Imaging: Methods and Applications for Digital Cameras R BX G R B x GRBGR X BGRB G RBGR B GRBGR y FIGURE 7.7 Stripe RGB CFA structure showing the constituent sampling structures ΨR (2), ΨG (◦) and ΨB ( ). The union of these three sampling structures forms the lattice Λ. The matrix J defining the CFA structure is 1 0 0 J = 0 0 1 . 010 (7.45) The corresponding reciprocal lattices Λ∗ = LAT 10 01 , Γ∗ = LAT 1 1 3 0 − 1 3 , (7.46) are illustrated in Figure 7.8 along with a suitable choice of coset representatives for Λ∗ in Γ∗, giving D= 0 1 3 − 1 3 0 − 1 3 1 3 . (7.47) Substitution into Equation 7.10 gives the matrix 1 M = 1 3 1 1 1√ − 1 2 + j 3 √2 − 1 2 − j 3 2 − 1 2 1 −  j √ 3 √2  . − 1 2 + j 3 2 (7.48) The three transformed signals are thus q1[x] = 1 3 f1[x] + 1 3 f2[x] + 1 3 f3[x] q2[x] = 1 3 f1[x] + − 1 6 + j √1 23 f2[x] + − 1 6 − j √1 23 q3[x] = 1 3 f1[x] + − 1 6 − j √1 23 f2[x] + − 1 6 + j √1 23 f3[x] f3[x]. (7.49) (7.50) (7.51) CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 195 -1 -0.5 d2 d1 u c/px -1 -0.5 0.5 1 d3 0.5 1 v c/px FIGURE 7.8 Reciprocal lattices Λ∗ (◦) and Γ∗ (×) and a suitable choice of representatives for the cosets of Λ∗ in Γ∗ for the stripe structure. We see that q2[x] and q3[x] are complex and that q3[x] = q∗2[x]. The inverse transformation recovers the original components using the matrix M−1, 1 M−1 = 1 1 1√ − 1 2 − j 3 √2 − 1 2 + j 3 2 − 1 2 1 + j  √ 3 √2  . − 1 2 − j 3 2 (7.52) In the frequency domain FCFA(u, v) = Q1(u, v) + Q2(u − 1 3 , v + 1 3 ) + Q3(u + 1 3 , v − 1 3 ). (7.53) Although there is no problem with this complex formulation, we can avoid the use of complex signals by expressing the modulation as a quadrature modulation of real signals. Thus, rather than considering the two complex modulated signals q2[x] exp − j2π x 3 − y 3 + q3[x] exp − j2π − x 3 + y 3 , we can consider the equivalent real, quadrature modulated signals (7.54) where q2[x] cos 2π x 3 − y 3 + q3[x] sin 2π x 3 − y 3 (7.55) q2[x] = 2ℜ{q2[x]} = 2 3 f1[x] − 1 3 f2[x] − 1 3 f3[x] q3[x] = 2ℑ{q2[x]} = √1 3 f2[x] − √1 3 f3[x], (7.56) (7.57) and where ℜ and ℑ extract the real and imaginary part of a complex number. 196 Single-Sensor Imaging: Methods and Applications for Digital Cameras dB 20 Q1 Q2 0 -20 -40 Q3 -60 -80 -1/2 -1/4 1/2 0 1/4 v (c/px) 1/2 -1/2 -1/4 0 1/4 u (c/px) FIGURE 7.9 Two-dimensional power density spectrum estimate of a CFA image with the stripe sampling structure. One period of the two-dimensional power density spectrum of a sample CFA image with the stripe CFA structure is shown in Figure 7.9. The different components are easily identified on this figure; compare with Figure 7.8. 7.3.3.3 Four-Color Pattern The final example considers a four-color pattern as illustrated in Figure 7.10. A number of these have been proposed in recent years including CMYB [14], RBG1G2 [15], RGB + gray [16] and the RGB + Emerald as used in the Sony Cybershot DSC-F828 digital camera. In this case, C = K = 4. The lattices Λ = Z2 and Γ = (2Z)2 are the same as for the Bayer structure and the same matrix B can be used. The matrix J is the 4 × 4 identity matrix I4. The reciprocal lattices Λ∗ = Z2 and Γ∗ = ( 1 2 Z)2 are again the same as for the Bayer structure and the same D can be used, so Figure 7.3 applies to this case. 1 2X 1 2 1 x 3 4343 X 1 2121 3 4343 1 2121 y FIGURE 7.10 Four-color CFA structure showing the constituent sampling structures Ψ1 (◦), Ψ2(2), Ψ3 ( ) and Ψ4 (×). The union of these four sampling structures forms the lattice Λ. CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 197 Applying Equation 7.10 to the above, we obtain 1 1 1 1 M = 1 4 11 1 −1 −1 1 −−11 , 1 −1 −1 1 (7.58) the transformed signals q1[x] = 1 4 f1[x] + 1 4 f2[x] + 1 4 f3[x] + 1 4 f4[x] q2[x] = 1 4 f1[x] + 1 4 f2[x] − 1 4 f3[x] − 1 4 f4[x] q3[x] = 1 4 f1[x] − 1 4 f2[x] + 1 4 f3[x] − 1 4 f4[x] q4[x] = 1 4 f1[x] − 1 4 f2[x] − 1 4 f3[x] + 1 4 f4[x] and in the frequency domain (7.59) (7.60) (7.61) (7.62) FCFA(u, v) = Q1(u, v) + Q2(u − 0.5, v) + Q3(u, v − 0.5) + Q4(u − 0.5, v − 0.5). (7.63) We need to know what the four signals (sensor classes) are to proceed further. Note that if c1(λ ) = c4(λ ), we revert to the case of the Bayer array. 7.3.4 Summary The analysis in this section has shown that the spatial multiplexing of pixels corresponding to different sensor classes, such as red, green and blue, can be equivalently viewed as the frequency domain multiplexing of transformed components obtained as linear combinations of the original input components. These components generally consist of a luma at baseband and several chrominance components modulated at certain frequencies. The baseband component is a weighted average of the input components, where the weights are the relative sampling densities of the given components. This baseband component is similar to the luminance component of human vision, which arises in a similar fashion from the retinal mosaic of cones [12]. Since this baseband component is not luminance, we use the term luma as advocated by Poynton [9]. The chrominance components are various differences of the input components, and all are identically zero in the case where the inputs from all sensor classes are equal (a gray-scale condition for these sensor classes). The analysis presented shows how to use lattice theory to determine these transformed components and the frequencies at which they are modulated. The interested reader can apply the analysis to the other RGB CFA patterns of [5]. The author’s analysis can be found at the companion webpage for this chapter [17]. The next section shows how these components can be demultiplexed using spatial filtering and subsequently transformed to yield the desired tristimulus values at all points on the sampling lattice, as required to further process and display the image. As seen in Section 7.3.3.2, some of the transformed chrominance components may be complex. However, this can be avoided by using the quadrature modulation representation. For any coset representative dk of Λ∗ in Γ∗, there are two possible situations, that correspond to real and complex chrominance components respectively. The first situation is 198 Single-Sensor Imaging: Methods and Applications for Digital Cameras when dk and −dk belong to the same coset, i.e., dk − (−dk) ∈ Λ∗, or equivalently 2dk ∈ Λ∗. This always applies to d1 = 0, but it also applies to all the dk for the Bayer structure and for Sections 7.3.3.1 and 7.3.3.3. Since dk and −dk must give the same result in Equation 7.10, it follows that Mki = Mk∗i, i = 1, . . . ,C, and so qk = q∗k. Thus all chrominance components for these CFA structures are real. In the second situation, −dk ∈ dk + Λ∗, and so −dk ∈ dl + Λ∗ for some l = k. Thus Mli = Mk∗i for i = 1, . . . ,C and ql = q∗k. The corresponding two terms in Equation 7.11 can be written as qk[x] exp( j2πx · dk) + q∗k[x] exp(− j2πx · dk) = qk cos(2πx · dk) + ql sin(2πx · dk) (7.64) where qk = 2ℜ{qk[x]} and ql = 2ℑ{qk[x]}. Note that qk and ql are still linear combinations of the original components. 7.4 Demosaicking Based on the Frequency-Domain Representation 7.4.1 The Demosaicking Problem The digital camera with CFA sensor is an approximation to the ideal camera, whose ob- jective is to accurately capture spatial color patterns to be reproduced for a human observer on a color display device. The bottom part of Figure 7.2 shows a model of an ideal camera designed to produce sampled color images for human viewing. The spatio-spectral light intensity is passed through a linear shift invariant spatial camera aperture ht(x), with frequency response Ht(u), that is adapted to the sampling lattice Λ and possibly the assumed viewing setup [18]. The resulting signal is passed through three spectral filters p¯i(λ ), i = 1, 2, 3, that correspond to three primaries P1, P2, P3 of the human visual color space, and the total power is measured to produce the samples on Λ. This yields the vector signal ˜f[x], x ∈ Λ, with three components f˜i[x] = λmax f (x − s, λ )ht (s) ds p¯i(λ ) dλ , i = 1, 2, 3. λmin R2 (7.65) The functions p¯i(λ ) are known in colorimetry as color-matching functions, and are a property of the human visual system [19]. The three components f˜i[x], i = 1, 2, 3 are called tristimulus values with respect to the given primaries. The problem at hand is to estimate ˜f[x] for x ∈ Λ from the observed scalar signal fCFA[x]. This is an ill-posed inverse problem with at least three separate aspects: (i) Only one component is measured at each spatial location and two others must be estimated. This operation is often called demosaicking or color-plane interpolation. (ii) The actual aperture ha(x) is different from the ideal one ht(x) and may introduce excessive resolution loss; compensating for this is image restoration or aperture correction. (iii) The p¯i(λ ) cannot be expressed as a linear combination of the actual recording filters ci(λ ), thus introducing color errors. In this case, which is the normal situation, the camera filters are said to be non-colorimetric. In this chapter, we concentrate on the demosaicking problem. We do not consider aperture correction (which must also account for noise in the capture process), and standard solutions for the color error problem are used. CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 199 Consider for now only the color aspect of the problem. In the ideal situation, the spectral responses of the color filters for the C classes span a space that contains the color-matching functions: p¯i(λ ) ∈ span({c j(λ ), j = 1, . . . ,C}), i = 1, 2, 3. (7.66) In this case, we can express these functions as linear combinations of the c j(λ ), C p¯i(λ ) = ∑ ai jc j(λ ), i = 1, 2, 3, j=1 (7.67) where the ai j are unique if the c j(λ ) are linearly independent, which is a reasonable assumption in practice. Then, the tristimulus values for a color with spectral distribution f (λ ) are given by ∑ f˜i = λmax C f (λ )p¯i(λ ) dλ = ai j λmax f (λ )c j(λ ) dλ , λmin j=1 λmin (7.68) so that the desired tristimulus values can be obtained from the values measured with the given color filters as ˜f = Af (7.69) where A = [ai j] is the 3 × C matrix implied by (7.67) and [f]i = f (λ )ci(λ ) dλ . The primaries for the desired signal could be the CIEXYZ primaries for a device-independent representation, or some standard RGB space such as the ITU Recommendation 709 RGB primaries [9], also used in the sRGB representation. However, in the practical situation, p¯i(λ ) ∈ span({c j(λ ), j = 1, . . . ,C}), i = 1, 2, 3, (7.70) so that color errors are inevitable. In particular, two different but metamerically equivalent color spectra with equal tristimulus values will in general give different measured values, while visually different colors can give the same measured values. Essentially the human and camera visual systems are different. In this case, a transformation mapping f to ˜f is required, that should minimize the expected color error according to an appropriate color distance metric over some suitable ensemble of spectral densities f (λ ) [20]. Although the best such transformation is not necessarily linear, it has been found that linear transformations can give excellent results. See Reference [21] for a review of standard techniques to solve the problem. A typical one is to project the p¯i(λ ) onto span({c j(λ ), j = 1, . . . ,C} and to use the projected color-matching functions to compute the approximate tristimulus values. Thus, we will still assume that the desired tristimulus values are obtained from the measured values using Equation 7.69 and that this is the desired signal we are trying to measure. The resulting colorimetric errors are not considered further in this chapter. 7.4.2 Algorithms Derived from the Frequency-Domain Representation The premise of the demosaicking algorithms based on the frequency-domain representation is to extract the luma and modulated chrominance components from the CFA signal using spatial filters, and then to transform the luma and demodulated chrominance components to the desired estimated tristimulus values using the appropriate linear transformation. 200 Single-Sensor Imaging: Methods and Applications for Digital Cameras The specific structure of the CFA signal in the frequency domain should be exploited. For example, with the Bayer structure, one component is modulated at two different frequencies, and either can be used to reconstruct the signal. However, if locally one suffers from crosstalk, the other is often relatively free of crosstalk. By adaptively selecting which of the two candidates should be used, superior results can be obtained [11], [22]. The basic algorithm is derived from Equation 7.15. The CFA signal is passed through a series of bandpass filters to extract the modulated chrominance components, which are demodulated to yield the estimated chrominance components. These are then linearly transformed to give the required tristimulus values. Let Hk(u) be a two-dimensional linear shift-invariant bandpass filter with center frequency dk. The shape of the passband would depend on the expected support of the spectrum of the corresponding chrominance signal. Then, rˆk = fCFA ∗ hk, and the resulting signal is demodulated to baseband to obtain qˆk[x] = rˆk[x] exp(− j2πx · dk). The baseband luma component q1[x] at frequency d1 = 0 does not need to be demodulated, and can be obtained by subtracting the estimated modulated chrominance components from the CFA signal K qˆ1[x] = fCFA[x] − ∑ rˆk[x]. k=2 (7.71) When C < K, the transformed components qk are linearly dependent (e.g., q3[x] = −q2[x] for the Bayer CFA). Since the estimated transformed components obtained as above will not in general satisfy this constraint (e.g., qˆ3[x] = −qˆ2[x]), the constraint needs to be imposed, either explicitly or implicitly. This step is often the key to a successful result. The approach is illustrated for the Bayer CFA structure of Figure 7.1. The global spectrum shows substantial overlap between both the modulated Q2(u − 0.5, v) and Q2(u, v − 0.5), and the luma component at baseband, as can be seen from Figure 7.3. However, locally, the overlap tends to be only with one of these components and the luma, depending on the local image content. This is illustrated schematically in Figure 7.11 which shows the hypothesized local spectrum for two different scenarios. By identifying which of the two versions locally has less crosstalk and using that one for reconstruction, a much better Q4 Q3 Q2 Q1 Q4 Q4 Q3 Q2 u Q2 Q1 Q4 Q2 u Q4 Q3 Q4 v (a) Q4 Q3 v Q4 (b) FIGURE 7.11 Local spectrum scenarios schematically illustrated for Bayer CFA pattern: (a) scenario with Q2 being better estimate, and (b) scenario with Q3 being better estimate. CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 201 (a) (b) FIGURE 7.12 (See color insert.) Reconstruction of lighthouse image (a) using only Q2(u − 0.5, v) and (b) using only Q2(u, v − 0.5). result can be obtained than by using both of them equally. In fact, Figure 7.12 shows that if the lighthouse image is reconstructed using only Q2(u − 0.5, v) (Figure 7.12a) and only Q2(u, v − 0.5) (Figure 7.12b), all areas of the image are well reconstructed in one or the other of these two images; we just need a genie to identify which one is better for each pixel. Note in particular the house at the left where the first scenario applies, and the picket fence where the second scenario applies. a n a ly z e r^2 q^2a h2 (–1)n1 com bine q^2 q^2 ^ f1 r^3 q^2b h3 fCFA – (–1)n2 (–1)n1 – (–1)n2 q^1 m atrix ^ f2 r^4 h4 q^4 (–1)n1+ n2 q^4 ^ f3 FIGURE 7.13 Block diagram of adaptive demosaicking algorithm for the Bayer CFA structure. 202 Single-Sensor Imaging: Methods and Applications for Digital Cameras As described in Reference [11], we can make this decision by measuring the average local energies eX and eY in the vicinity of the circular regions in Figure 7.11 centered at frequencies (um, 0) and (0, vm). The direction (horizontal or vertical) with the lower energy is assumed to suffer less from crosstalk and is given more weight in the reconstruction. The proposed algorithm from Reference [11] with notation adapted to that of this chapter is summarized below (Algorithm 7.1) and illustrated in Figure 7.13. Note that here, with x = [n1 n2]T , we have exp(± j2πx · d2) = (−1)n1, exp(± j2πx · d3) = (−1)n2, and exp(± j2πx · d4) = (−1)n1+n2 . ALGORITHM 7.1 Adaptive demosaicking algorithm for the Bayer CFA. 1. Filter fCFA with a bandpass filter h4 centered at frequency (0.5, 0.5) to extract rˆ4 = fCFA ∗ h4, and shift it to baseband to estimate qˆ4[n1, n2] = rˆ4[n1, n2] · (−1)n1+n2 . 2. Filter fCFA with h2 to get rˆ2 = fCFA ∗ h2 and demodulate to baseband, qˆ2a[n1, n2] = rˆ2[n1, n2] · (−1)n1 . Similarly, rˆ3 = fCFA ∗ h3 and qˆ2b[n1, n2] = −rˆ3[n1, n2](−1)n2 (using q2 = −q3). 3. The local average energies eX and eY are estimated using modulated Gaussian filters with standard deviations of rG1 and rG2 px along major and minor axes, centered at frequencies (±um, 0.0) and (0.0, ±vm) c/px respectively. The filter at (0.0, ±vm) is the transpose of the filter at (±um, 0.0). This is followed by smoothing of the squared output with a 5 × 5 moving average filter. 4. The final estimate of q2 is obtained as qˆ2[n1, n2] = w[n1, n2] · qˆ2a[n1, n2] + (1 − w[n1, n2])qˆ2b[n1, n2] using the weighting coefficient w = eY /(eX + eY ). 5. Estimate the luma by qˆ1[n1, n2] = fCFA[n1, n2] − rˆ4[n1, n2] − qˆ2[n1, n2]((−1)n1 − (−1)n2 ). 6. Estimate the RGB components fˆ1, fˆ2, fˆ3 from qˆ1, qˆ2 and qˆ4 using Equations 7.29 to 7.31, and Equation 7.69 if needed. A similar approach can be used for the PIACCD structure (Section 7.3.3.1), since in this case there are two separate copies of the component q3. However, for the other two examples of the stripe pattern (Section 7.3.3.2) and the four-color patterns (Section 7.3.3.3), there is no duplication of components and all components are required to reconstruct the image. Thus, the above approach is not directly applicable. The basic algorithm can be used for the stripe pattern, while other approaches can be considered for the four-color patterns. We consider the stripe pattern of Section 7.3.3.2 in more detail to illustrate the case of complex modulation. Using complex processing, a single complex filter h2 is required to extract rˆ2. The algorithm is straightforward (see Algorithm 7.2 and Algorithm 7.3). CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 203 ALGORITHM 7.2 Complex demosaicking algorithm for stripe CFA pattern. 1. Filter fCFA with a complex bandpass filter h2 centered at frequency ( 1 3 , − 1 3 ) to extract rˆ2 = fCFA ∗ h2, and shift it to baseband to estimate qˆ2[n1, n2] = rˆ2[n1, n2] · exp − j2π n1 3 − n2 3 . 2. Estimate the luma component qˆ1 = fCFA − rˆ2 − rˆ2∗ = fCFA − 2ℜ{rˆ2}. 3. Estimate qˆ3 = qˆ∗2. 4. Estimate the RGB components, fˆ1, fˆ2, fˆ3 from qˆ1, qˆ2, qˆ3 using the inverse matrix from Equation 7.52, and Equation 7.69 if needed. ALGORITHM 7.3 Real version of the demosaicking algorithm for the stripe CFA pattern. h2 = h2R + jh2I c2[n1, n2] = cos 2π n1 3 − n2 3 s2[n1, n2] = sin 2π n1 3 − n2 3 rˆ2R = fCFA ∗ h2R rˆ2I = fCFA ∗ h2I qˆ1 = fCFA − 2rˆ2R qˆ2 = rˆ2R · c2 + rˆ2I · s2 qˆ3 = rˆ2R · s2 − rˆ2I · c2 fˆ1 fˆ2 fˆ3 = = = qˆ1 qˆ1 qˆ1 + − − 2qˆ2 qˆ2 − qˆ2 + √ √3qˆ3 3qˆ3 7.5 Filter Design for CFA Signal Demultiplexing As presented in the previous section, the demosaicking methods based on the frequencydomain representation require the use of two-dimensional bandpass filters centered at the frequencies dk, k = 2, . . . , K. The design method for these filters is a crucial step for the successful implementation of these demosaicking algorithms. In Reference [12], the use of Gaussian filters has been proposed. Essentially, baseband Gaussian lowpass filters are first generated using two parameters, the horizontal and vertical standard deviation. These lowpass filters are then converted to bandpass filters by modulating the unit-sample response with a complex sinusoid of the given frequency. Thus two parameters, and the support of the filter, need to be selected. In Reference [11], the use of frequency selective filters obtained with the window design method was presented. This was essentially a proof of concept and used 21 × 21 filters. The results with that method gave better performance than other methods known at that time for the Bayer structure when used with the adaptive algorithm described in the previous section. In Reference [22] it was shown that a 204 Single-Sensor Imaging: Methods and Applications for Digital Cameras least-squares design method could give comparable or better results to the window method described in Reference [11], but with much lower complexity. Thus, only the least-squares methodology is presented in this chapter. According to Equation 7.15, the observed CFA signal is the sum of K constituent modulated signals, each occupying a distinct frequency band, that we can estimate by a linear filter. Let rY represent any one of these signals, that we estimate by filtering fCFA with the linear, shift-invariant filter having unit-sample response hY : rˆY = fCFA ∗ hY . If we assume that the difference between rˆY and rY can be modelled as a stationary random field, then we could choose hY to minimize the expected squared error: hY = arg minh E[(rY [n1, n2]−( fCFA ∗h)[n1, n2])2]. In practice, since we don’t have a good random field model of the estimation error, we can minimize the actual error over a training set of typical color images: this becomes a least-squares problem. Note that other approaches to demosaicking using linear least-squares, Wiener filtering or similar techniques have been presented in References [6], [7], [23], and [24]. Assume that we have a training set of color images for which all the desired sensor-class components are available on the full lattice Λ. For each image in this training set, we can form both the CFA signal and the desired signal rY at full resolution. Suppose that we partition the training set into P sub-images, where the ith sub-image is defined on the spatial block W (i) ⊂ Λ. Assume that the desired filter hY is a finite impulse response (FIR) 2D filter with region of support S ⊂ Λ. Then, the least-squares filter can be obtained as solution to ∑ ∑ ∑ P hY = arg min h i=1 (n1,n2)∈W (i) rY(i)[n1, n2] − h[k1, k2] fC(iF)A[n1 − k1, n2 − k2] (k1,k2)∈S 2 . (7.72) This expression can easily be cast in matrix form to simplify the solution with standard matrix packages such as MATLAB R . Let NB = |S | be the number of filter coefficients to be determined, and NW = |W (i)| be the number of samples in the sub-images (the same for all i). We form an NB × 1 column vector h from the filter coefficients by scanning the region S in some fixed order, say column by column from left to right. Similarly, we form an NW × 1 column vector rY(i) from rY(i)[n1, n2] by scanning the pixels of W (i) in a fixed order. Finally, we form an NW × NB matrix Z(i) from the elements of fC(iF)A as follows: each column of Z(i) corresponds to an element (k1, k2) ∈ S scanned in the same order as used to form h; this column is obtained by scanning the elements of fC(iF)A[n1 − k1, n2 − k2] for (n1, n2) ∈ W (i) in the same order used to form rY(i). In this way, equation (7.72) can be written in matrix form as P ∑ hY = arg min h i=1 Z(i)h − rY(i) 2. (7.73) This is a standard least-squares problem with solution [10] ∑ ∑ hY = P Z(i)H Z(i) −1 P Z(i)H rY(i) . i=1 i=1 (7.74) The result hY is then reshaped to give the desired filter hY [x]. CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 205 m a gn itu d e m a gn itu d e 1 0.5 0 -0.5 0 0.5 0 v (c/px) 0.5 -0.5 u (c/px) (a) 1 0.5 0 -0.5 0 0.5 0 v (c/px) 0.5 -0.5 u (c/px) (b) 1 m a gn itu d e 0.5 0 -0.5 0 0.5 0 v (c/px) 0.5 -0.5 u (c/px) (c) FIGURE 7.14 Frequency response of 11 × 11 filters designed by the adaptive least-squares method: (a) h2, (b) h3, and (c) h4. For the Bayer structure, this approach can be used to determine the three filters h2, h3 and h4 required in the adaptive frequency domain algorithm. Although it is probably quite adequate for determining h4, it does not account for the adaptive nature of the algorithm used to estimate q2. The following describes a least-squares algorithm to simultaneously determine h2 and h3 to minimize the squared error in the estimation of q2 with the adaptive algorithm of Section 7.4. Referring to the algorithm description, the estimate of q2[n1, n2] is obtained by ∑ qˆ2[n1, n2] = w[n1, n2](−1)n1 × h2[k1, k2] fCFA[n1 − k1, n2 − k2] (k1,k2)∈S ∑ − (1 − w[n1, n2])(−1)n2 × h3[k1, k2] fCFA[n1 − k1, n2 − k2], (k1,k2)∈S (7.75) and we can choose h2 and h3 jointly to minimize the total squared error between q2 and qˆ2 over the training set. Again, we cast this squared error in matrix form. Let h23 be the 2NB × 1 column vector obtained by stacking h2 on top of h3. The column vector q(2i) is obtained by scanning the elements of q2[n1, n2] over W (i) in the same order as described above. Finally, we form a NW ×2NB matrix W(i) as follows: the first NB columns are formed by reshaping w(i)[n1, n2](−1)n1 fC(iF)A[n1 − k1, n2 − k2] for each (k1, k2) ∈ S while the second NB columns are formed by reshaping −(1 − w(i)[n1, n2])(−1)n2 fC(iF)A[n1 − k1, n2 − k2] in the 206 Single-Sensor Imaging: Methods and Applications for Digital Cameras m a gn itu d e m a gn itu d e 1 0.5 0 -0.5 0.5 0 0 v (c/px) 0.5 -0.5 u (c/px) (a) 0.4 0.3 0.2 0.1 0 -0.5 0 0.5 0 v (c/px) 0.5 -0.5 u (c/px) (b) m a gn itu d e 0.4 0.3 0.2 0.1 0 -0.5 0.5 0 0 v (c/px) 0.5 -0.5 u (c/px) (c) FIGURE 7.15 Frequency response of 11 × 11 filters designed by the least-squares method for the stripe pattern: (a) h2, (b) h2R, and (c) h2I . same order. Once again, this leads to a least squares problem of the form ∑P h23 = arg min h i=1 W(i)h − q(2i) 2 ∑ ∑ = P W(i)H W(i) −1 P W(i)H q(2i) . i=1 i=1 (7.76) (7.77) Finally, h2 and h3 are extracted from h23 and reshaped to give the optimized filters h2[x] and h3[x]. Figure 7.14 shows the frequency response of the filters obtained by applying these procedures, using Equation 7.74 to obtain h4[x] and Equation 7.77 to obtain h2[x] and h3[x]. The training set consists of images 1-12 of the standard Kodak database commonly used to test demosaicking algorithms [25], where all filters have a support of 11 × 11. As reported in Reference [22], these filters give equivalent or better mean-squared error (MSE) than the 21 × 21 window-designed filters of Reference [11], which in turn gave equal or better MSE results than all the techniques compared in the review paper [25]. The results of demosaicking the twenty-four Kodak images with these 21 × 21 filters used in the adaptive algorithm can be seen on the website associated with Reference [11]. The results with the least-squares filters are visually very similar. CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 207 (a) (b) (c) (d) FIGURE 7.16 (See color insert.) Portion of JPEG2000 test image bike, downsampled by four in each direction: (a) original color image, (b) image reconstructed from Bayer CFA using bilinear interpolation, (c) image reconstructed from Bayer CFA using the adaptive frequency demultiplexing algorithm, and (d) image reconstructed from the stripe CFA using the least-squares frequency demultiplexing. An extensive study, to be reported elsewhere, examined the effect of the various parameters and filter support regions and identified good choices for these parameters. Specifically, in step 3 of the algorithm, we recommend um = vm = 0.375 c/px, rG1 = 3.0, rG2 = 1.0, and a maximum support for the Gaussian filters of 11 × 3. Filters h2, h3 and h4 have maximum support of 11 × 11, as use of larger support gave no improvement in performance. The least-squares design of Equation 7.74 can be applied directly to complex signals without modification. This is illustrated for the stripe CFA pattern. Applying this design algorithm to the Kodak dataset yields the complex filter h2 whose magnitude response is illustrated in Figure 7.15a. The constituent real and imaginary parts h2R and h2I are shown in Figure 7.15b and Figure 7.15c. While this pattern and filters give reasonable results, they are in general inferior to the results obtained with the best adaptive algorithm used with the Bayer pattern. However, certain areas are improved, such as the problematic picket fence in the lighthouse image. This is to be expected from inspection of the spectral plots for the two CFA patterns. 208 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) FIGURE 7.17 (See color insert.) Portion of Spincalendar: (a) original color image, (b) image reconstructed from Bayer CFA using bilinear interpolation, (c) image reconstructed from Bayer CFA using the adaptive frequency demultiplexing algorithm, and (d) image reconstructed from the stripe CFA using the least-squares frequency demultiplexing. The frequency-domain approach can be illustrated with critical areas taken from two standard images: a portion of the JPEG 2000 test image Bike (downsampled by four in each direction), and a portion of the Spincalendar HDTV test sequence. Figure 7.16 and Figure 7.17 each show the original image, the image reconstructed from the Bayer CFA using bilinear interpolation, the image reconstructed from the Bayer CFA using the adaptive least-squares algorithm, and the image reconstructed from the stripe pattern using the leastsquares algorithm. The bilinear reconstruction clearly shows the artifacts due to lumachrominance crosstalk for the Bayer pattern in both images. The test-pattern portion of the Bike image has very high horizontal frequencies, right up to the Nyquist frequency. An excellent result is obtained with the adaptive least-squares frequency demultiplexing algorithm, but a small amount of residual cross color remains. For the stripe CFA, there is much less crosstalk in the horizontal-frequency components, due to the location of the chrominance components, and an even better result is obtained for these portions of the Bike image. However, other portions with diagonal structures do not fare as well. For the Spincalendar image, there are strong diagonal frequencies, and the Bayer CFA with the CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 209 adaptive least-squares frequency demultiplexing algorithm gives a much better result than the stripe pattern. Overall, the Bayer pattern is better when evaluated over all 24 Kodak images. Other demosaicking results can be found at the companion webpage [17] for this chapter. 7.6 Concluding Remarks This chapter has presented a mathematical framework to analyze periodic CFA patterns (which includes most that have been proposed) using a frequency-domain approach. This framework serves to explain the typical artifacts observed in demosaicked images with these patterns and to inspire new demosaicking methods with good performance and moderate complexity. The analysis has been illustrated with the common Bayer pattern and with three other CFA patterns that bring out several aspects of the theory. Many more structures have been proposed and can be analyzed with these methods. Detailed numerical and visual results for these methods have not been presented here as that was not the goal; many such results can be found elsewhere. The adaptive method for the Bayer CFA described in this chapter is currently highly competitive with the state of the art with respect to demosaicked image quality and computational complexity. However, the work on this topic is not concluded. Among other directions, the extension of the adaptive algorithm to the case of four sensor classes should be pursued, as well as integrating the method with up-sampling and super-resolution techniques. Acknowledgments The author thanks Markus Beermann, Ste´phane Coulombe and Brian Leung for their helpful comments and corrections on earlier drafts of this chapter. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC). Appendix: Lattices and Two-Dimensional Signals on Lattices Lattices have been widely used to describe sampled multidimensional signals with nonrectangular sampling structures. For the purposes of this chapter, we are concerned with discrete-space two-dimensional (2D) still images. This appendix summarizes the key concepts and notations used in this chapter. Detailed expositions and illustrations can be found in References [1] and [2]. The discussion is limited to the 2D case, since that is all we require here. 210 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1. A lattice Λ in two-dimensions is the set of all linear combinations, with integer coefficients, of two linearly independent vectors v1 and v2 in R2, Λ = {n1v1 + n2v2 | n1, n2 ∈ Z}. (7.78) The basis vectors v1 and v2 are expressed as 2 × 1 column matrices, and thus so are the elements of Λ. 2. The 2 × 2 matrix V = [v1 | v2] is referred to as a sampling matrix for Λ. Then, we write Λ = LAT(V) = {Vn | n ∈ Z2}. (7.79) The sampling matrix for a given lattice is not unique; LAT(V1) = LAT(V2) if and only if E = V−1 1V2 is unimodular, i.e., an integer matrix such that | det E| = 1. 3. A unit cell of a lattice Λ is a set P ⊂ R2 such that copies of P centered on each lattice point tile all of R2 without overlap. The unit cell is not unique. The area of any unit cell is d(Λ) = | det V| for any sampling matrix V. The Voronoi cell is a unit cell in which no point is closer to any non-zero element of Λ than to the origin 0. 4. The set Λ∗ = {r | r · x ∈ Z for all x ∈ Λ} is a lattice known as the reciprocal lattice. If Λ = LAT(V), then Λ∗ = LAT(V−T ) where V−T denotes (VT )−1, T denotes matrix transpose and r · x denotes the matrix product rT x. d(Λ∗) = 1/d(Λ). 5. The set Γ is a sublattice of Λ if both Λ and Γ are lattices, and every point of Γ belongs to Λ. We write Γ ⊂ Λ. Γ = LAT(VΓ) is a sublattice of Λ = LAT(VΛ) if and only if M = (VΛ)−1VΓ is an integer matrix. 6. If Γ ⊂ Λ, then Λ∗ ⊂ Γ∗. 7. If Γ ⊂ Λ, so that VΓ = VΛM for an integer matrix M, then d(Γ) = | det M|d(Λ) where K = | det M| is an integer. K = d(Γ)/d(Λ) is called the index of Γ in Λ, denoted (Λ : Γ). 8. If Γ ⊂ Λ, the set c + Γ = {c + x | x ∈ Γ} (7.80) for any c ∈ Λ is called a coset of Γ in Λ. Two cosets are either identical or disjoint: c + Γ = d + Γ if c − d ∈ Γ; otherwise (c + Γ) (d + Γ) = 0/ . There are K = (Λ : Γ) distinct cosets of Γ in Λ. If b1, . . . , bK are arbitrary elements of these K cosets, denoted coset representatives, we have K Λ = (bk + Γ). k=1 (7.81) 9. Let f [x], x ∈ Λ be a scalar signal defined on a lattice Λ. We define the Fourier transform of f [x] to be F(u) = ∑ f [x] exp(− j2πu · x) x∈Λ (7.82) CFA Sampling: Frequency-Domain Analysis and Associated Demosaicking Algorithms 211 where u = [u v]T is a two-dimensional frequency vector, expressed in cycles per unit of length. The Fourier transform is a periodic function of the continuous frequency vector, with periodicity given by the reciprocal lattice: F(u) = F(u + r) for all r ∈ Λ∗. The Fourier transform of f [x] exp( j2πu0 · x) is F(u − u0) for an arbitrary fixed frequency vector u0; this is the modulation property. 10. A signal f [x], x ∈ Λ is periodic with periodicity lattice Γ if f [x + c] = f [x] for all c ∈ Γ, where Γ ⊂ Λ. There are K = (Λ : Γ) distinct values of this signal, which form one period. These are f [b1], . . . , f [bK], where b1, . . . , bK is an arbitrary set of coset representatives of Γ in Λ. The periodic signal is constant on cosets of Γ in Λ. 11. A periodic signal f [x], x ∈ Λ, with periodicity lattice Γ ⊂ Λ has the discrete Fourier series representation K f [x] = ∑ F[k] exp( j2πx · dk), x ∈ Λ k=1 (7.83) where ∑ F [k] = 1 K K j=1 f [b j] exp(− j2πb j · dk), k = 1, . . . , K. (7.84) In these expressions, K = (Λ : Γ), b1, . . . , bK are coset representatives for Γ in Λ and d1, . . . , dK are coset representatives for Λ∗ in Γ∗. References [1] E. Dubois, “The sampling and reconstruction of time-varying imagery with application in video systems,” Proceedings of the IEEE, vol. 73, no. 4, pp. 502–522, April 1985. [2] D.E. Dudgeon and R.M. Mersereau, Multidimensional Digital Signal Processing. Englewood Cliffs, New Jersey: Prentice-Hall, 1984. [3] T. Yamada, Image Sensors and Signal Processing for Digital Still Cameras, ch. CCD image censors, J. Nakamura (ed.), Boca Raton, FL: CRC Press, 2005, pp. 95–141. [4] K. Parulski and K.E. Spaulding, Digital Color Imaging Handbook, ch. Color image processing for digital cameras, G. Sharma (ed.), Boca Raton, FL: CRC Press, 2002, pp. 727–757. [5] R. Lukac and K.N. Plataniotis, “Color filter arrays: Design and performance analysis,” IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1260–1267, November 2005. [6] H.J. Trussell and R.E. Hartwig, “Mathematics for demosaicking,” IEEE Transactions on Image Processing, vol. 11, no. 4, pp. 485–492, April 2002. [7] D. Taubman, “Generalized Wiener reconstruction of images from colour sensor data using a scale invariant prior,” in Proceedings of the IEEE International Conference on Image Processing, Vancouver, Canada, September 2000, vol. III, pp. 801–804. [8] R. Lukac and K.N. Plataniotis, “Universal demosaicking for imaging pipelines with an RGB color filter array,” Pattern Recognition, vol. 38, no. 11, pp. 2208–2212, November 2005. 212 Single-Sensor Imaging: Methods and Applications for Digital Cameras [9] C.A. Poynton, Digital Video and HDTV: Algorithms and Interfaces. San Francisco, California: Morgan Kaufmann, 2003. [10] T.K. Moon and W.C. Stirling, Mathematical Methods and Algorithms for Signal Processing. Upper Saddle River, New Jersey: Prentice Hall, 2000. [11] E. Dubois, “Frequency-domain methods for demosaicking of Bayer-sampled color images,” IEEE Signal Processing Letters, vol. 12, no. 12, pp. 847–850, December 2005. [12] D. Alleysson, S. Su¨sstrunk, and J. He´rault, “Linear demosaicing inspired by the human visual system,” IEEE Transactions on Image Processing, vol. 14, no. 4, pp. 439–449, April 2005. [13] E. Dubois and W.F. Schreiber, “Improvements to NTSC by multidimensional filtering,” SMPTE Journal, vol. 97, no. 6, pp. 504–511, June 1988. [14] J.J. Bean, “Cyan-magenta-yellow-blue color filter array,” U.S. Patent 6 628 331-B1, September 2003. [15] H. Hoshuyama, “Color image sensor, color filter array, and color imaging device,” U.S. Patent application 2005/0212934, September 2005. [16] S. Saito, “Solid state image pickup device having primary color and gray color filters and processing means thereof,” U.S. Patent 7 126 633, October 2006. [17] E. Dubois, “Color-filter-array sampling of color images: Frequency-domain analysis and associated demosaicking algorithms: Additional results,” http://www.site.uottawa.ca/ ∼edubois/SingleSensorImaging/, 2007. [18] H.A. Aly and E. Dubois, “Design of optimal camera apertures adapted to display devices over arbitrary sampling lattices,” IEEE Signal Processing Letters, vol. 11, no. 4, pp. 443–445, April 2004. [19] G. Sharma, Digital Color Imaging Handbook, ch. Color fundamentals for digital imaging, G. Sharma Ed., CRC Press, Boca Raton, FL, 2002, pp. 1–114. [20] G. Sharma and H.J. Trussell, “Figures of merit for color scanners,” IEEE Transactions on Image Processing, vol. 6, no. 7, pp. 990–1001, July 1997. [21] R. Bala, Digital Color Imaging Handbook, ch. Device characterization, G. Sharma Ed., CRC Press, Boca Raton, FL, 2002, pp. 269–382. [22] E. Dubois, “Filter design for adaptive frequency-domain Bayer demosaicking,” in Proceedings of the IEEE International Conference on Image Processing, Atlanta, GA, USA, October 2006, pp. 2705–2708. [23] H.S. Malvar, L.W. He, and R. Cutler, “High-quality linear interpolation for demosaicing of Bayer-patterned color images,” Proceedings of the IEEE International Conference on Acoustics Speech Signal Processing, Montreal, Canada, May 2004, vol. III, pp. 485–488. [24] B. Chaix de Lavare`ne, D. Alleysson, and J. He´rault, “Practical implementation of LMMSE demosaicing using luminance and chrominance spaces,” Computer Vision and Image Understanding, Special Issue on Color Image Processing, vol. 107, no. 1-2, pp. 3–13, July/August 2007. [25] B.K. Gunturk, J. Glotzbach, Y. Altunbasak, R.W. Schafer, and R.M. Mersereau, “Demosaicking: Color filter array interpolation,” IEEE Signal Processing Magazine, vol. 22, no. 1, pp. 44– 54, January 2005. 8 Linear Minimum Mean Square Error Demosaicking David Alleysson, Brice Chaix de Lavare`ne, Sabine Su¨ sstrunk, and Jeanny He´rault 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 8.1.1 Trichromacy in Human Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 8.1.2 Digital Color Image Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 8.1.3 Image Acquisition through Single Chip Digital Camera . . . . . . . . . . . . . . . . . 216 8.2 Color Filter Array Signal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 8.2.1 Luminance-Chrominance Representation of Color Images . . . . . . . . . . . . . . 217 8.2.2 Luminance-Chrominance in Color Filter Arrays . . . . . . . . . . . . . . . . . . . . . . . . 219 8.2.3 Examples of Practical CFAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 8.3 Linear Systems for Luminance-Chrominance Estimation . . . . . . . . . . . . . . . . . . . . . . 223 8.3.1 Linear Estimation Using Constant Ratio Hypothesis . . . . . . . . . . . . . . . . . . . . 224 8.3.2 Filter Design from the CFA Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 8.3.3 Wiener Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 8.3.3.1 Direct RGB Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 8.3.3.2 Estimation through Luminance and Chrominance . . . . . . . . . . . . . . . 229 8.3.3.3 Performance Comparison of Different CFAs . . . . . . . . . . . . . . . . . . . . 230 8.4 Nonlinear and Adaptive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 8.4.1 Accurate Luminance Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.4.2 Frequency Domain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 8.1 Introduction Today, digital cameras are ubiquitous. One can be purchased for a price equivalent to a night’s stay in a hotel in occidental countries, or it may be included for “free” in a mobile phone or a laptop computer. The existence of such a common and ultimately cheap device is due to fundamental developments in the way an image is acquired by the sensor and processed by the microcomputer embedded in the camera. These developments would not have been possible without recent breakthroughs in digital color image processing [1]. In this chapter, we focus on the sampling of images by a single camera sensor and on the subsequent digital image processing called demosaicking necessary to mimic the fact that a color image has three different color responses at each spatial position [2]. We highlight the properties of the human visual system (HVS) that have been exploited in the development 213 214 Single-Sensor Imaging: Methods and Applications for Digital Cameras of digital cameras in general and demosaicking in particular. The workings of the human visual system are still a source of inspiration today because they have capabilities that are not yet taken into account in digital cameras. In particular, we discuss how the random nature of the arrangement of chromatic samples in the retina can improve color sampling by digital cameras. This chapter is organized as follows. In the first section, we recall the properties of the human visual system that have been exploited in the development of digital color image processing. How the trichromacy of color vision has been discovered and is used in digital image processing is discussed. We describe how digital images are either acquired through three sensors or through a single sensor. In the second section, a model of spatio-chromatic sampling applicable to the human visual system and digital cameras is described. This model enables the understanding of the signal content in single-chip digital cameras. In the third section, methods of reconstructing color information from a mosaic through linear approaches are discussed. In the fourth section, we extend these approaches with adaptive processes. 8.1.1 Trichromacy in Human Vision We can state that the history of color science started in the 16th century with the discovery made by Newton [3] who showed that sunlight was composed of several colors, the colors of the rainbow. By extension, the light is a combination of all the monochromatic lights of the visible spectrum. In practice, this discovery permits us to reproduce color sensations by modulating the intensity of different monochromatic primaries. However, it is not necessary to modulate the entire wavelength domain to reproduce a color sensation because, as discovered by Young [4] and then confirmed by Helmholtz [5], the human visual system is trichromatic. The property of trichromacy is that it is sufficient to modulate only three primaries to mimic the sensation of any light with arbitrary continuous spectrum, provided that none of the primaries can be matched with a combination of the other two. In general, we use a red (R), a green (G), and a blue (B) primary. These primaries are indicative of their maximum intensity in the visible spectrum: blue is the color of short wavelength radiation, green of middle wavelengths, and red of long wavelengths. Trichromacy was established long before it was known that the human retina was composed of three kinds of cones sensitive to three different wavelength ranges [6], [7]. These cones are called L, M, and S for their respective sensitivity to long, middle and short wavelengths. Two lights that have the same L, M, and S cone responses give the same color sensation (such lights are called metamers). Thus, the dimension of the space that represents color in the human visual system is three. This property is used in digital capture systems and image displays. Trichromacy is also the basis for colorimetry [8], the science of color measurement. The principle of colorimetry is based on color matching experiments. The experiments consist of comparing a color with an additive mixture of monochromatic primaries. The intensity of the primaries needed to obtain a match with a color sensation serves as an indicator of the color content in terms of these primaries. This method was used to standardize color spaces such as CIE-RGB and CIE-XYZ by the Commission Internationale de l’E´ clairage (CIE) in 1931. Linear Minimum Mean Square Error Demosaicking 215 (a) (b) (c) (d) FIGURE 8.1 Example of a color image decomposition into its red, green, and blue components: (a) color image, (b) red channel, (c) green channel, and (d) blue channel. Color version of the original image is available in Figure 8.2. An important property of color spaces is that the mixture of light is a linear process. The color-space position of a light mixture can be derived by adding the coordinate vectors of the lights that make up that mixture. Also, in a color space defined by the ϕR, ϕG, and ϕB spectral sensitivities of a camera sensor, any light is defined by a linear combination of the Red (R), Green (G), and Blue (B) primaries. Thus, the RGB values of a digital image define the corresponding color as a linear mixture of the R, G, and B primaries. 8.1.2 Digital Color Image Encoding Trichromacy is exploited in the formation, rendering, and reproduction of color images. As shown in Figure 8.1, a color image is a matrix with three components, one for R, one for G, and one for B, respectively. The rendering of these three components on a video screen, which has three RGB phosphors or three color filters, allows reproduction of the color sensation equivalent to that produced by the natural scene itself. Thus, the color processing chain from acquisition of a color image, coding of digital values, and rendering to the display can be designed using a three-dimensional space for all color representations. An RGB representation is not the only color representation used in digital video and imaging. Chromatic information in images is often reduced to achieve smaller sized files. The human visual system is less sensitive to high frequencies in chrominance than in luminance. In other words, the spatial resolution of chrominance can be quite a bit lower than the resolution of luminance without observing any visual degradation in the image. A luminance-chrominance representation is analogous to the receptive field encoding at the ganglion cells in the human retina. Luminance represents spatial information in terms of light-dark changes, such as edges, while chrominance represents the hue and saturation of a color (see Section 8.2.1). There are several ways to construct a luminance-chrominance representation of a color image. For example, we can transform the R, G and B values into a triplet called Y, Cb, and Cr, where Y represents the luma and is a positive combination of R, G, and B values. In general, the RGB values are already “gamma-corrected,” meaning that a nonlinear encoding has been applied to the color channels, and thus Y is not representative of the physically measurable luminance anymore. Cr and Cb are two opponent chromatic channels. The amount of image data can be reduced by retaining all Y values, but subsampling Cb and Cr by a factor of two or four without significant visual loss. 216 Single-Sensor Imaging: Methods and Applications for Digital Cameras 8.1.3 Image Acquisition through Single Chip Digital Camera Similarly to the human retina, where we have only one cone type per spatial location, most digital cameras today use a single sensor to capture color images. The sensor is covered by an array matrix of color filters to allow the acquisition of different chromatic contents of the scene. In general, the filters transmit either blue, green, or red light. Consequently, a single chromatic value is sampled at each spatial location. To reconstruct three chromatic values from the mosaic of single values, we need to use a signal processing method that is called demosaicking. There are many color filter arrays (CFAs) for digital cameras. The problem to take into account when designing a CFA is the ability to fully reconstruct a color image with three chromatic components from the mosaic of single color values. The sampling of a scene with a single color filter per spatial location results in a compromise in the representation of spatial versus chromatic information of the image. Spatial and chromatic information is present as a mixture in a CFA image. Thus, can we design an arrangement of color filters that maximizes the ability to fully reconstruct the spatial and chromatic content of the scene? This question is still unresolved, but we discuss the properties of several CFA’s in terms of their spatial and chromatic representation of light in Section 8.2.2. The first proposed color filter array was composed of vertical stripes of red, green, and blue columns. It has also been proposed that the stripes could be oriented. These arrangements, even if they are easy to build, have not been used extensively because the color sampling frequencies are not the same in horizontal and vertical direction. Another CFA was proposed by Bayer [9] in 1976. It fulfills two constraints. It has a color sampling frequency that is the same in vertical and horizontal direction for all three colors. It also has two times more green pixels than red and blue, favoring the sampling of luminance as stated by the inventor. The diagonal sampling frequency is thus higher for green than for red and blue. In the beginning, this CFA was not as successful as today. Recall that in 1976, most electronic image capture was for analog color television, which used an interlaced image encoding. Interlacing video means that the first frame displays only even image lines, and then the second frame displays the odd lines. This method reduces the amount of data to be processed at one time, and it became a standard for color television. The problem with the Bayer CFA is that the red or blue values are sampled only on either the even or the odd lines, respectively. This has the consequence that the red and blue colors flickered when video frames were displayed in interlaced mode. To compensate for that Dillon [10] proposed another CFA where red and blue are present at every line. However, the Bayer CFA is certainly nowadays the most popular CFA in digital cameras because it has a good representation of chromatic and spatial information. We will also discuss a CFA proposed by Lukac [11] that is an improvement of the Bayer CFA for the horizontal representation of luminance values. There exist several other CFAs, which use either four colors or a hexagonal arrangement of the color filters. We do not discuss these CFAs in this chapter. Also, a different method for acquiring a color image with a single sensor has recently been proposed by Foveon [12]. This method uses the fact that the penetration of light in silicon depends on the wavelength of the light. It allows the separation of red, green, and blue at each pixel by reading the responses at different well depths, and is not further discussed. Linear Minimum Mean Square Error Demosaicking 217 Finally, similarly to the human visual system that has a random arrangement [13] of cones at the surface of the retina, we study the case of a random arrangement of chromatic samples for digital cameras. The problem with a random arrangement is that the neighborhood of a sample changes from location to location and uniform space invariant reconstruction methods cannot be used. However, using a pattern of random color filters periodically repeated on the sensor surface, we benefit from the nonaliasing properties of random sampling and are able to reconstruct the color with a linear method. 8.2 Color Filter Array Signal Representation In this section, we show that the representation of a color image in luminance and opponent chromatic channels is better than in red, green, and blue when considering demosaicking. This representation allows distinguishing the spatial and chromatic contents of a scene. Moreover, it is still effective even in the case of a mosaic image with a single chromatic value per spatial position. 8.2.1 Luminance-Chrominance Representation of Color Images Considering the trichromatic properties of the human visual system discussed in the previous section, it has become natural to think that a color image should be represented by three components. These components are representative of the energy measured by three sensors with different sensitivities, usually red, green, and blue. An RGB representation of a color image does not highlight the two main properties of our visual perception of scenes: the ability to account for both the intensity of the light source on an object surface and the colors of the object. However, since the tristimulus values form a vector space, a linear transformation from this space to another with different properties is possible without drastically changing the nature of the data. A linear transformation also allows for an inverse transformation. The transformation should decorrelate as much as possible the spatial information from chromatic content to allow for processing the color data without aliasing effects. The spatial component should contain no chromatic content, while the chromatic component should be free of any intensity information. Intuitively, the transformation should be a positive combination of red, green, and blue values for the achromatic channel, whereas it should be a difference of these values for the chromatic channel, with an average value that is zero. There exists an infinite number of possible transformations from RGB to achromatic and chromatic color channels following these rules. Actually, the achromatic information of a tristimulus value is not uniquely defined. The achromatic channel has a spectral sensitivity response that is defined by the respective contributions of the spectral sensitivity responses of each R, G and B channel. Let I = {CR,CG,CB} be a color image with three color planes Red, Green, and Blue. The projection of the color channels on an achromatic axis is given by the following sum, with the assumption that pi (for i ∈ {R, G, B} and ∑i pi = 1) is the proportion of the chromatic 218 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) FIGURE 8.2 (See color insert.) Example of color image decomposition into its luminance and chrominance components. (a) Color image with three chromatic values R, G and B at each spatial location. (b) The luminance component, a scalar value per spatial position corresponding to the mean of RGB values. (c) The chrominance values having three components per spatial position corresponding to the difference of each RGB component with the luminance. component i in the achromatic signal φ : φ = ∑ piCi (8.1) i The content of the signal Ci can be expressed by the irradiance of the scene S(x, y, λ ) through the spectral sensitivity responses of the filters ϕi (i = R, G, B) of the acquisition process: Ci(x, y) = S(x, y, λ )ϕi(λ )dλ (8.2) λ From Equations 8.1 and 8.2, note that luminance has a spectral sensitivity response given by the spectral sensitivity functions ϕi and the weights pi of the chromatic channels. The spectral sensitivity function for luminance is given by ∑i piϕi. This means that the luminance is directly dependent on both the spectral sensitivity functions of the chromatic channels R, G, and B and the proportion of each of these channels that give the luminance signal. For the human visual system, the luminance signal is not well defined. The CIE rec- ognizes the V (λ ) function [8] as being the luminosity function of the standard observer, which should represent the human ability to assess the sensation of luminosity. It is there- fore possible to simulate this function during the luminance computation by choosing the appropriate coefficients pi that match as best as possible the spectral sensitivity response of the luminance function to the shape of V (λ ). Concerning chrominance, the debate is even more open. There is no real consensus on what the human visual system computes as chromatic channels. Some indications are given by the work of Hering [14] and Jameson and Hurvich [15], which estimated opponent color responses psychophysically. Recall that our chrominance is defined as having a zero mean to avoid any intensity information. Also, since the dimension of the chromatic color space is three, consider that the chrominance space is also of dimension three to take into account the variability along each of the chromatic axes. However, these variables are linearly dependent and the Linear Minimum Mean Square Error Demosaicking 219 300 40 250 20 200 0 150 -20 100 -40 50 -60 0 -80 0 50 100 150 200 250 300 0 50 100 150 200 250 300 FIGURE 8.3 Plots of B as a function of G (left) and ψB as a function of φ (right) for the example of Figure 8.2. We can see that B and G have a principal component in y = x, while ψB and φ are more decorrelated. intrinsic dimension is thus only two. If we consider the difference between each chromatic channel and the luminance function defined in Equation 8.1, we have: ∑ ∑ ψi = Ci − φ = Ci − piCi = (1 − pi)Ci − p jCj i j=i (8.3) Since ∑i pi = 1, it follows that ∑i piψi = 0, meaning that the weighted average of the chrominance components is zero at each pixel. Our definitions of luminance and chromi- nance respect the condition we have imposed. Figure 8.2 shows an example of a luminance chrominance decomposition. Note the separate representation of spatial and chromatic in- formation compared to Figure 8.1, where spatial information is present in all three of the R, G and B channels. Figure 8.3 plots B values as a function of G values, and ψB as a function of φ . We can observe that chrominance and luminance are effectively more decorrelated than the color planes B and G. 8.2.2 Luminance-Chrominance in Color Filter Arrays To model the sampling of chromatic values by a mosaic, we use the formal construction of the mosaic itself. A mosaic m is a spatial arrangement of chromatic samples. With the assumption that the samples are on a grid (either square or hexagonal) and that all the positions on the grid are filled by one filter color, we can decompose the global mosaic into submosaics, mR, mG and mB. Each one corresponds to the spatial arrangement of one type of color filter, having a value of one where the filter is present and a value of zero at the empty positions: m(x, y) = mR(x, y) + mG(x, y) + mB(x, y) (8.4) with (x, y) ∈ N2 the integer coordinates on the grid.1 1Note that in this chapter, x and y are oriented using Matlab convention, where the origin is at the top left and where x is the vertical axis and y the horizontal axis. 220 Single-Sensor Imaging: Methods and Applications for Digital Cameras Since the mosaics are periodic, their Fourier transforms result in Dirac impulses in fre- quency domain. One period of the mosaics’ Fourier transforms is given by:  mˆ = δ0 and ∑∑  mˆ R mˆ G = = r0δ0 g0δ0 + + n=0 rnδn gnδn ∑  mˆ B = b0δ0 + n=0 bnδn n=0 (8.5) where δ0 = δ (νx, νy) is the Dirac distribution for spatial frequency dimension νx and νy of spatial dimension x and y and δn = δ (nx − νx, ny − νy). Thus, δn denotes the Dirac impulse at spatial frequency n = (nx, ny). From the analysis of the Bayer CFA pattern, for example, n describes the set of frequency dimensions [16]: (νx, νy) ∈ {−1, 0, 1} and (νx, νy) = (0, 0) (8.6) where 1 is the normalized Nyquist frequency. The terms r0, g0 and b0 denote the mean values of each submosaic, i.e., the probability at each spatial location to contain a sample of the respective color. We call them pR, pG and pB, respectively. By unicity of the Fourier transform, we can conclude: pR + pG + pB = 1 rn + gn + bn = 0 (∀n = 0) (8.7) The resulting CFA image Im (a single channel image containing the different color submo- saics) is obtained by: Im(x, y) = ∑ Ci(x, y)mi(x, y) (8.8) i∈{R,G,B} The multiplication in Equation 8.8 becomes a convolution product in the Fourier domain. Hence, using Equation 8.5 and given that a convolution product with a Dirac impulse corresponds to a translation to the corresponding frequency of the Dirac impulse, we obtain: rnCˆR(n − ν) Iˆm(ν) = ∑ piCˆi(ν) + ∑ + gnCˆG(n − ν) i n=0 + bnCˆB(n − ν) φˆ (ν ) ψˆ n (n−ν ) (8.9) Since φ is a linear combination of color signals with positive weights, it represents luminance. The term ψn(n − ν) is a linear combination of color signals with coefficients whose sum vanishes, modulated at frequency n. It represents modulated chrominance. Figure 8.4 shows five examples of CFAs and the amplitude spectra of an image sampled by those CFAs. Note that the CFA pattern determines the location of the chrominance, and thus controls the amount of aliasing between the baseband luminance and the modulated chrominances. For additional information on CFAs refer to Chapters 1 and 5. Linear Minimum Mean Square Error Demosaicking 221 (a) (b) (c) (d) (e) FIGURE 8.4 CFA patterns and amplitude spectra of corresponding CFA images: (a) Bayer pattern, (b) vertical stripe pattern, (c) diagonal stripe pattern, (d) Lukac pattern [11], and (e) tiling of a 6 × 6 pseudorandom pattern. Figure 5.2 shows presented CFAs in color. 8.2.3 Examples of Practical CFAs The formalism developed in Section 8.2.2 allows revisiting the notion of opponent chro- matic channels in the case of digital camera acquisition and of the human visual system. The fact that the mosaic construction, i.e., the arrangement of chromatic samples of each type, defines the nature of the luminance and chromatic channels is very important and could be used either for the design of appropriate cameras or to understand the nature of opponent chromatic channels in the visual system. It was shown in Equation 8.9 that the position of chrominance in the Fourier spectrum depends on the spatial arrangement of each corresponding chromatic value in the CFA. For the Bayer CFA, for example, the repetition of the R color filter is one out of two pixels in horizontal and vertical directions. This places the red chrominance at the border of the Fourier spectrum. To show precisely where the chrominances are located, it is useful to mathematically describe the arrangement of the chromatic values in the CFA. For the Bayer pattern, if we consider the pixel at position (0,0) to be a red pixel, we have:   mR(x, y) = (1 + cos πx)(1 + cos πy)/4  mG mB (x, (x, y) y) = = (1 (1 − − cos cos π π (x + x)(1 y))/2 − cos π y)/4 (8.10) This equation can be rewritten separating the constant part and the modulation part given by the cosine function:   mR(x, mG(x, y) y) = = 1/4 1/2 + − (cos πx cos π(x + + cos πy y))/2 + m˜ R cos π x cos π y)/4  mB(x, y) = 1/4 + (− m˜ G cos π x − cos π y + cos π x cos π y)/4 m˜ B (8.11) 222 Single-Sensor Imaging: Methods and Applications for Digital Cameras which shows that the decomposition of luminance and chrominance is given by the mosaic arrangement. It follows that the luminance part in the Bayer CFA is defined as: ∑ φBayer(x, y) = i piCi(x, y) = R + 2G 4 + B (8.12) with R = CR(x, y), G = CG(x, y) and B = CB(x, y). The chrominance in the Bayer CFA is then defined as the difference between the CFA image and the luminance image: ψBayer(x, y) = Im(x, y) − φBayer(x, y) = ∑ mi(x, y)Ci(x, y) − piCi(x, y) i = ∑ m˜ i(x, y)Ci(x, y) i (8.13) The Fourier transform ψˆBayer can be explicitly calculated and decomposed into two modulated chrominance components: ∑ ∑ ψˆBayer(νx, νy) = ψˆ1(ν) ∗ δn + ψˆ2(ν) ∗ δn n∈N1 n∈N2 (8.14) where ∗ denotes a convolution, with the set of frequencies N1 = {(1, 1)} and N2 = {(0, 1), (1, 0)} and with ψˆ1(ν) = (Rˆ − 2Gˆ + Bˆ)/16 ψˆ2(ν) = (Rˆ − Bˆ/8) (8.15) The chrominance ψBayer actually reflects a three dimensional space of chromatic opponent channels, which is subsampled and coded in a single lattice. Let us compute the product ψBayer(x, y)mi(x, y), which selects from the chrominance image the values corresponding to the chromatic value i in the Bayer CFA. Using mi = pi + m˜ i and the fact that mim j = 0 for i = j, we have: ∑ ψBayer(x, y)mi(x, y) = m˜ j(x, y)mi(x, y)Cj(x, y) j = (1 − pi)Ci(x, y) − ∑ p jCj(x, y) mi(x, y) j=i ψi (8.16) Chrominance is thus composed of three different opponent chromatic channels. For the Bayer CFA, these channels are:   ψR = (3R − 2G − B)/4  ψG ψB = = (−R (−R + − 2G − 2G + B)/4 3B)/4 (8.17) The principle of recovering the three chromatic opponent channels from the chrominance image ψBayer can be done in two equivalent ways. First, demultiplexing (i.e., multiplying) with the function mi will bring back the chrominances ψR, ψG and ψB to the center of the Linear Minimum Mean Square Error Demosaicking 223 spectrum of each color plane [16]. Since mi also modulates high frequencies, the operation has to be followed by a low-pass filter. Another way is to directly estimate the modulated chrominances ψ1 and ψ2 using two different filters, and then to demodulate them to low frequencies [17], [18]. Refer to Chapters 5 and 7 for additional information on frequency multiplexing and demultiplexing. The same equations can be derived for Lukac’s CFA. The mosaics can be written formally as:    mR mG mB (x, (x, (x, y) y) y) = = = 1/4 1/2 1/4 + − + cos(π cos(π cos(π y) y) y) cos( π 2 cos(π cos( π 2 x)/2 + cos(πx)/4 x/2)/2 − cos(π y) cos( π 2 (x − 1))/2 + cos(π(x − (x − 1))/2 1))/4 (8.18) which yields the modulated chrominances ψˆLukac(νx, νy) = ψˆ1(ν) ∗ ∑n∈N1 δn + ψˆ2(ν) ∗ ∑n∈N2 δn + ψˆ3(ν) ∗ ∑n∈N3 δn:   ψˆ 1 (ν ) = (Rˆ − (1 + j)Gˆ + jBˆ)/8  ψˆ 2 (ν ψˆ 3 (ν ) ) = = (Rˆ (Rˆ − − Bˆ)/8 (1 − j)Gˆ − jBˆ)/8 (8.19) with j = ( − 1)) and N1 = (1, 1 2 ) , N2 = {(0, 1)} and N3 = (1, − 1 2 ) . The chromi- nances are also located at the border of the spectrum (see Figure 8.4), but contrary to the Bayer CFA, there is maximum resolution in the horizontal direction. The luminance and demodulated chrominances are actually the same as for the Bayer CFA, since the propor- tions pi are the same in both CFAs. 8.3 Linear Systems for Luminance-Chrominance Estimation The missing values in a mosaic of chromatic samples can be reconstructed through a linear combination of the neighboring pixel values. In this section, we describe several methods, ranging from the simple copy of pixels and bilinear interpolation to more sophisticated methods taking into account the properties of CFA images. We also describe demosaicking as a generalized inverse problem. According to Reference [9], demosaicking was originally not a priority in the 1970’s. Cameras were essentially designed for color television and the number of pixels in the camera was designed to match the number of lines of the television format. Moreover the TV color format was YCbCr 4:2:2, thus at least at acquisition no interpolation was required. Later, the reconstruction of missing color values became necessary to display appropriate color on the screen. The first method was to simply copy the values of the neighboring pixel. Figure 8.5a depicts the method for the Bayer CFA whereas Figure 8.6a shows the result of applying this method to an image. Soon after, bilinear interpolation was proposed, which consists of averaging the values from the closest neighbors in the horizontal and vertical directions. Following the Bayer CFA depicted in Figure 8.5b, the linear interpolation of the green and blue pixels at position (2,2) is given by: G22 = (G12 + G21 + G32 + G23)/4, B22 = (B11 + B31 + B13 + B33)/4 (8.20) 224 Single-Sensor Imaging: Methods and Applications for Digital Cameras green red blu e (a) FIGURE 8.5 Illustration of pixel copy and bilinear interpolation. BG BG BG 11 12 13 14 15 16 GR GRGR 21 22 23 24 25 26 B 31 G32 BG 33 34 BG 35 36 GR GRGR 41 42 43 44 45 46 BG 51 52 B53 G54 BG 55 56 G61 R62 G 63 R64 G65 R66 (b) Bilinear interpolation operates on the three color channels in isolation. If we consider each channel with its existing values and zeros at the missing values, it can easily be shown that the following convolution filter allows interpolating the missing color by bilinear in- terpolation. 0 1 0 1 2 1 FG =  1 4 1  /4; FRB =  2 4 2  /4 (8.21) 010 121 Both methods can produce significant color artifacts (see Figure 8.6b) and are thus not always suitable. The development of digital cameras for general use and the possibility of embedding a microprocessor in the camera have encouraged investigations into more sophisticated demosaicking algorithms, as discussed in the next sections. 8.3.1 Linear Estimation Using Constant Ratio Hypothesis To improve the result of bilinear interpolation, Cok [19], [20] proposed interpolating hue instead of each color channel separately. This method is based on the observation that hue tends to remain constant over a local neighborhood. Hue is calculated as the ratio between red and green (R/G) and between blue and green (B/G). The method is as follows. First, the missing green pixel values are calculated with bilinear interpolation. Then, the ratios R/G and B/G are computed at the respective pixel positions. The missing hue values are then interpolated, and the respective color (red or blue) is obtained by multiplying each hue value with the corresponding green values. For example, the interpolation of the blue pixel at position (2,2) is given by: B22 = G22 B11 G11 + B13 G13 + 4 B31 G31 + B33 G33 (8.22) Figure 8.7a shows a result of demosaicking with this method. An interesting exploitation of the constant hue algorithm was proposed by Crane et al. [21]. They used the constant hue hypothesis to derive convolution filters, which apply directly on the CFA rather than separately in each color plane. They originally designed their method for a CFA called Chromaplex; here we show the algorithm for the Bayer CFA. Linear Minimum Mean Square Error Demosaicking 225 (a) (b) FIGURE 8.6 (See color insert.) Examples of demosaicking by (a) pixel copy and (b) bilinear interpolation. (a) (b) FIGURE 8.7 (See color insert.) Examples of demosaicking by (a) constant hue and (b) predetermined coefficients. Suppose that we want to interpolate the red pixel at position (3,3). We denote with capital letters the existing color values and with lowercase letters the missing values, and write r33 = B33 − B¯ + R¯ (8.23) where B¯ and R¯ are the averages of blue and red in the neighborhood. These averages can be expressed with existing values in the neighborhood as follows: r33 = B33 − (B33 + B31 + B13 + B35 + B53) /5 + (R22 + R24 + R42 + R44) /4 (8.24) We obtain the following filter with integer coefficients by multiplying by a factor of 20: 0 1 20  0 −4 0 0 −4 50 0 16 50 0 0 5 0 5 0 −4 0  0 0 −4 0 0 Note that there are many different filters possible, dependent on the size of the extension we want to give the filters. We have tested many and report only those filters that give the best results for the images we used. 226 Single-Sensor Imaging: Methods and Applications for Digital Cameras For the Bayer CFA, we need to study several positions. The previous example can be used to find blue values at red pixel positions. Namely, using the following filter   0 0 −4 0 0 f1 = 1 80  0 −4 0 −4 0 5 −16 30 −16 5 05 12 −16 64 30 12 −16 05 0 −4 0 −4 0  0 0 −4 0 0 results in red values at positions (2,3) and (4,5) and blue values at positions (3,2) and (3,4). The second filter  0 0 −4 0 −4 0 0  f2 = 1 80  0 −4 0 5 −16 0 12 5 −16 30 −16 64 12 30 −16 5 0 5 0 −4 0  0 0 −4 0 −4 0 0 is suitable to obtain red values at positions (3,2) and (3,4) and blue values at positions (2,3) and (4,3). Finally, the third filter   0 0 4 0 4 00 f3 = 1 100  0 4 0 4 0 −5 0 −10 0 −5 0 −8 25 −8 0 −10 25 60 25 −10 0 −8 25 −8 0 −5 0 −10 0 −5 0 4 0 4 0  0 0 4 0 4 00 can be used to recover all missing green values. Figure 8.7b shows a reconstruction example with these filters. 8.3.2 Filter Design from the CFA Spectrum The CFA sampling model in Section 8.2.1 shows that the luminance and chrominance are localized in the Fourier spectrum. Thus, a filter that is able to select the frequency components of the luminance independently from those of the chrominance can act as a luminance selection filter. We can even derive a linear space-invariant filter for estimating luminance [16], [22]; the processing then does not depend on the position considered in the image. The method uses a linear shift invariant finite response filter to estimate the luminance in the CFA image. Generally, a filter of size 9 × 9 gives accurate enough results. Once the luminance is estimated, it is subtracted from the CFA image. The resulting image now only contains chrominance. This is equivalent to applying the orthogonal filter for estimating the chrominance. Then, the chrominance is demultiplexed according to the red, green, and blue CFA arrangements, resulting in three images containing opponent chromatic channels. Demultiplexed chrominance is then interpolated using simple bilinear filters, such as of Equation 8.21. The principle of demosaicking by frequency selection is shown in Figure 8.8. Linear Minimum Mean Square Error Demosaicking 227 m u ltip lexed lu m inance + im a ge lu m inance selection + + – RGB im age m u ltip lexed chrom inance d em u ltiplexing su bsa m p led chrom inance in ter p ola tion chrom inance FIGURE 8.8 Synopsis of the demosaicking by frequency selection. The results of this method depend on the linear space-invariant filter, which needs to accurately estimate the luminance information. As shown in Figure 8.4, the luminance is a wide-band signal that needs to be estimated in frequencies close to the Nyquist frequency while avoiding frequencies containing chrominance. The filter design thus depends on the spatial and color characteristics of the camera. A generic filter was proposed in Alleysson et al. [16] for general purpose use. In terms of reconstruction quality, using an invariant filter is not the best solution because the structure of the CFA favors some positions, such as the green pixels in the Bayer CFA. This will be discussed in the next section. 8.3.3 Wiener Estimation This method allowing for a direct estimation of a space variant filter was originally proposed by Trussell and Taubman [23], [24]. The idea is to express demosaicking as an inverse problem. Given the data acquired by the sensor, is it possible to reconstruct the original scene? The problem could then be solved with a Wiener approach, which supposes that the original data can be retrieved from a linear combination of the acquired data. The calculation of the linear parameters can be performed with a linear minimum mean square error between acquired data and original data. Trussell [23] formalized Wiener demosaicking by considering that the original data is the multispectral content of the scene. He also takes into account the optics of the camera. Taubman [24] proposes a practical method to solve the system by taking into account that the process of acquisition is space invariant. 228 Single-Sensor Imaging: Methods and Applications for Digital Cameras R1 4 G2 G3 B4 X HW/4 CFA W H R1 G2 G3 B4 Pr 12 1 =4 1 1 RGB R1 R2 G1 G2 B1 B2 R3 R4 G3 G4 B3 B4 H W R1 R2 R3 1 R4 G1 G2 12 G3 G4 B1 B2 B3 B4 Y HW/4 FIGURE 8.9 Illustration that a CFA image X is constructed from a matrix multiplication between Pr and the color image Y if they are represented as column-stacked superpixels. In this section we discuss the principles of Wiener demosaicking by simplifying the model, considering only the transformation of the RGB image to a mosaic, as previously described in Chaix et al. [25]. We show how we can use an image database to constrain the solution. Also, we describe a luminance-chrominance approach to Wiener demosaicking. Finally, we use this method to compare the performances of different CFA’s. 8.3.3.1 Direct RGB Estimation The principle of the formalism of Wiener demosaicking is the use of stacked notation that unfolds the color image of size H × W × 3 into a column vector of size HW 3 × 1, where H, W , and 3 are respectively the height, the width, and the number of color channels in the image. This allows us to express the model of image formation as a matrix multiplication between the original image and the CFA sampling matrix. In Reference [24], Taubman introduced the concept of superpixel. A superpixel is a group of pixels that matches the basic pattern of the CFA. In the Bayer CFA, the basic pattern is composed of four pixels arranged on a 2 × 2 square: one red, two green, and one blue. At the scale of the superpixel, the mosaic is regular, a tiling of superpixels. With the assumption that the acquisition process is invariant over the image, which is widely used, it allows the design of spaceinvariant filters at that scale, i.e., of block shift-invariant filters [26] at the scale of a pixel. Thus, the stacked notation should be expressed at the scale of a superpixel, as shown in Figure 8.9 and the following equation: X = PrY (8.25) with Pr being a projection operator that represents the sampling process, converting four pixels of the image with three colors per pixel of Y to four single-color pixels of the CFA image X. The goal of linear demosaicking is to find a matrix D that will recover the color image Y˜ from the CFA image X: Y˜ = DX (8.26) Linear Minimum Mean Square Error Demosaicking 229 minimizing the mean square error e with the original color image Y: e = E[ Y − Y˜ 2] (8.27) The classical solution to this equation is the Wiener solution given by: D = (E[YXT ])(E[(XXT )])−1 (8.28) We can compute matrix D by applying Equation 8.28 over a database of full resolution color images. The use of a database means that we explicitly know Y and that we simulate the CFA image X using Equation 8.25. This computation requires only the inversion of a matrix of size 4n2 × 4n2 (n being the size of the neighborhood in superpixels). The details of the method are described by Chaix et al. in Reference [25]. A similar approach was recently used in Reference [27] by defining spatio-chromatic covariance matrices for the four elements of the superpixel. 8.3.3.2 Estimation through Luminance and Chrominance Instead of directly estimating the color image from the CFA, as described in the previous section, we can in a first step estimate the luminance Φ˜ from the CFA: Φ˜ = HΦX (8.29) where HΦ is the luminance filter (Figure 8.10). Once the luminance is estimated, we recover the modulated chrominance as the difference between the CFA image and the luminance Ψ˜ = (X − Φ˜ ). As suggested, we demultiplex the chrominance by multiplying it by PrT before interpolating it to obtain the full chrominance Ψ˜c: Ψ˜c = HΨPrT Ψ˜ (8.30) where HΨ is the matrix containing the three chrominance interpolating filters. Finally, the reconstructed color image Y˜ is the sum of both parts: Y˜ = Φ˜ c + Ψ˜ c where Φ˜ c = 1 1 1 T ⊗ Φ˜ . We thus have to train two filters over the image database: (8.31) • the luminance estimator, calculated from the CFA image X (which is simulated from the database by setting the appropriate chromatic values to zero) and the luminance Φ (which is also computed from the database): HΦ = (E[ΦXT ])(E[(XXT )])−1 (8.32) • and the chrominance interpolator, calculated from the chrominance Ψc and the subsampled chrominance Ψ (both computed from the database): HΨ = (E[Ψc(PrT Ψ)T ])(E[(PrT Ψ)(PrT Ψ)T )])−1 (8.33) with Φ = PY, Ψc = McY and Ψ = X − Φ. 230 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) FIGURE 8.10 Amplitude spectra of the luminance filter for each position (1, 2, 3 and 4) in the superpixel. At position 2 and 3 (G pixels), luminance can be retrieved with a maximal horizontal and vertical acuity. The great advantage of using such a decomposition is that the chrominance has narrow bandwidths with respect to the Nyquist frequency of each demultiplexed plane. It thus requires only small order filters for interpolation. However, the luminance estimator needs to have high gradients at the frequencies located on the border between luminance and modulated chrominance. Therefore, it requires a high order filter for estimation (typically 7 × 7 or 9 × 9), but at least this estimation is performed only once. This property makes the algorithm much more computationally efficient than the direct RGB estimation. 8.3.3.3 Performance Comparison of Different CFAs The Wiener approach combined with the estimation of the filters taken from a database can be used to compare different CFA performances. The procedure is automatic and since we consider the same neighborhood size, the only difference is the capability of the CFA to represent spatial and chromatic information. The advantage of using a linear method for reconstruction is that it does not favor a particular matching between the CFA and a particular nonlinear method. We used the leave-one-out method, so that the tested image was not included in the training set. We chose the Kodak image database, which is widely used in the demosaicking community. We performed the comparison on the CFAs shown in Figure 8.4. The basic pattern of the random CFA is of size 6 × 6 pixels, with the same proportions of R, G and B pixels. This pattern was generated randomly and then manually readjusted in order to avoid any cluster of the same color filter. The results of the demosaicking process are given in Table 8.1 in TABLE 8.1 Average PSNR values (dB) and standard deviations between the original and the reconstructed images of the 24 Kodak images and for each CFA configuration of Figure 8.4. CFA R G B Average Bayer Vertical stripes Horizontal stripes Diagonal stripes Lukac Lukac (90◦ rotated) Pseudorandom Pseudorandom (90◦ rotated) 38.53 (±2.67) 34.50 (±2.81) 32.95 (±2.46) 38.17 (±2.48) 38.69 (±2.45) 38.50 (±2.42) 38.90 (±2.50) 38.87 (±2.49) 41.22 (±2.47) 34.61 (±2.82) 33.09 (±2.48) 38.84 (±2.42) 41.24 (±2.37) 40.96 (±2.34) 40.12 (±2.41) 40.16 (±2.40) 37.25 (±2.59) 34.50 (±2.69) 33.16 (±2.26) 38.20 (±2.59) 38.25 (±2.51) 38.07 (±2.53) 39.44 (±2.67) 39.51 (±2.64) 39.00 (±2.58) 34.54 (±2.77) 33.07 (±2.40) 38.40 (±2.50) 39.39 (±2.44) 39.18 (±2.43) 39.49 (±2.53) 39.51 (±2.51) Linear Minimum Mean Square Error Demosaicking 231 (a) (b) (c) (d) (e) (f) FIGURE 8.11 (See color insert.) Results obtained using the ‘CZP’ image: (a) original image, (b-f) demosaicked images corresponding to (b) vertical stripe pattern, (c) diagonal stripe pattern, (d) Bayer pattern, (e) Lukac pattern, and (f) pseudorandom pattern. terms of objective quality (the PSNR values) and in Figures 8.11, 8.12 and 8.13 for visual quality on the CZP image, the lighthouse image, and the Bahamas image. Using PSNR as a quality criterion, we see that not all the CFAs give the same performance. As can be expected, the Bayer CFA gives better results than the diagonal stripes and the horizontal or vertical stripes. However, Lukac’s CFA and the pseudorandom pattern give superior results than the widely used Bayer CFA. Two major properties intervene in the quality of a CFA: the frequencies of the chrominance and their orientations. Chrominance should actually be located the farthest from the center of the frequency spectrum to reduce the spectral overlap between luminance and modulated chrominances, while the vertical and horizontal directions are the most sensitive orientations for natural images [17]. As discussed earlier in this chapter, the Fourier spectrum illustrates the frequencies of chrominance and thus the potential amount of aliasing for the Bayer, Lukac, and stripe CFAs. These are all periodic patterns. However, that does not hold for the pseudorandom pattern. The Fourier spectrum actually shows information at every multiple of 1/n (with n × n being the size of the basic pattern) and these frequencies are present at a global scale, but not at a local scale. In order to know precisely which frequencies effectively alias, one should rather look at the demosaiced CZP images. This achromatic image is made of an isotropic sine function whose frequency is linearly modulated with spatial position. Consequently, we can visualize the local frequencies and know which ones cause spectral overlapping, as they will appear as false colors. 232 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) FIGURE 8.12 Results obtained using the ‘lighthouse’ image: (a) original image, (b-f) demosaicked images corresponding to (b) vertical stripe, (c) diagonal stripe, (d) Bayer, (e) Lukac, and (f) pseudorandom patterns. Considering the frequency positions of the chrominance, the stripe CFAs are the worst, since they modulate at 1/3 of the Nyquist frequency. Bayer and Lukac CFAs modulate at half of the Nyquist criterion. They are thus optimal concerning the frequency criteria. However, it is noteworthy that Lukac’s CFA totally preserves the 0◦ orientation (or 90◦, following the orientation of the pattern), whereas the Bayer CFA modulates in both directions. In Reference [17], the authors show that false colors arise mainly from these directions with the Bayer CFA. Lukac’s CFA is thus superior considering this second criterion, and it gains 1dB in PSNR values for the B color plane. The pseudorandom pattern preserves both the 0◦ and 90◦ directions. Not all of the chrominance carriers are at half the Nyquist frequency, but those which are not have less energy. Hence, the pseudorandom CFA gives visually pleasant images, and it satisfies both the frequency and orientation criteria. Linear Minimum Mean Square Error Demosaicking 233 (a) (b) (c) (d) (e) (f) FIGURE 8.13 Results obtained using the ‘bahamas’ image: (a) original image, (b-f) demosaicked images corresponding to (b) vertical stripe, (c) diagonal stripe, (d) Bayer, (e) Lukac, and (f) pseudorandom patterns. Zipper noise, a dual artifact to false colors, appears with these linear methods. It is caused by chrominance being interpreted as luminance. Changing the CFA pattern does not suppress this artifact, but changes its shape (Figure 8.13). Hence, zipper noise could be addressed by using an adaptive method exploiting the intra-plane correlation. Note that the stripe CFAs are extremely sensitive to zipper noise in regions of highly saturated colors. 8.4 Nonlinear and Adaptive Methods Many demosaicking methods are nonlinear, as they exploit a nonlinear, adaptive and/or iterative process. It is impossible to describe all of these methods in this chapter, and the reader is referred to Reference [28]. The first nonlinear method proposed was based on pattern analysis [29]. The method first defines features like contours, corners, and bands based on the neighborhood of the pixel. Following these features, the interpolation differs to avoid the interpolation across TABLE 8.2 PSNR values for several methods on the Bayer CFA. Channel LMMSE Ref. [17] Ref. [34] R 38.53 (±2.67) 38.81 (±2.50) 38.78 (±2.59) G 41.22 (±2.47) 42.82 (±2.50) 42.12 (±2.79) B 37.25 (±2.59) 38.62 (±2.69) 38.68 (±2.62) 234 Single-Sensor Imaging: Methods and Applications for Digital Cameras a contour, a corner, or a band. Inspired by this, many methods were later published that use an estimate of the image gradient to detect contours. Demosaicking is then performed along the contours rather than across them [30], [31], [32], [33]. We do not review all of these methods here, but rather focus on two nonlinear methods that use the spectral representation of the CFA image [17], [34] and give good objective results (Table 8.2). 8.4.1 Accurate Luminance Method The zipper noise discussed in the previous section arises from the crosstalk of chrominance on luminance, i.e., when high frequencies of chrominance are interpreted as luminance. In Reference [34], a weighted interpolation of chrominance is performed, which takes edges into account. This allows retrieval of a sharp chrominance signal along the edges. Consequently, the chrominance recovers frequencies that go beyond the spectral boundary between luminance and chrominance, and hence the zipper noise is reduced. For instance, at an R pixel with coordinates (x, y): ψR(x, y) = ∑ wi jψ˜R(x − i, y − j) (i, j)∈D (8.34) where D = {(−1, 0), (0, 1), (1, 0), (0, −1)} are the four closest neighbors, and where Ψ˜ R is a rough estimate of ψR based on a linear interpolation of R components and a frequency selection of luminance. The weights wi j depend on the gradient of the luminance and of the color channel R. The obtained ψR values are used to update the chrominance estimate at G pixels. Then another iteration of gradient-based estimation is performed. Note that this method resembles closely that of Reference [35], in which the green color plane represents luminance. Here the choice of luminance at G pixels is judicious because ψG vanishes at G pixels and luminance is thus least aliased. This method is able to correctly estimate luminance and chrominance within the luminance band. It suppresses zipper noise but sometimes fails in the chrominance band when luminance overflows. It is noteworthy that Lian’s method has a complexity close to that of Reference [16] while having better visual quality and higher PSNR values. 8.4.2 Frequency Domain Method This adaptive method suppresses false colors and artifacts. False colors are due to crosstalk of luminance on chrominance. The chrominance is then falsely estimated because it contains some achromatic components. In order to have a better estimate of chrominance, Dubois [17] exploits the redundancy of the chrominance components in the Fourier plane. We can make an analogy to the spread spectrum method in signal transmission, which consists of improving robustness by utilizing more frequencies than needed. In the case of a color mosaic, the chrominance components are modulated at different frequencies. If a chrominance frequency overlaps with a luminance frequency, it is possible that the same chrominance component — but at a different spectral location — will not overlap with a luminance frequency. This allows estimating a chrominance value without residual luminance, and thus potentially suppresses false colors. However, zipper noise is not reduced since chrominance is not interpolated while taking edges into account. Linear Minimum Mean Square Error Demosaicking 235 The choice of chrominance is driven by the amounts of energy Ex and Ey at the intersection between luminance and chrominance in horizontal and vertical directions: ψ2(x, y) = wxψ2h(x, y) + wyψ2v(x, y) (8.35) with wx = EX /(EX + EY ) and wY = 1 − wX . This method correctly estimates chrominance within the chrominance band, eliminating the residual information of luminance. However, it fails at reducing zipper noise. 8.5 Conclusion In this chapter we reviewed some linear methods for reconstructing full-color images from a mosaic of a single chromatic sample per spatial location. Using a model of chromatic sampling, we show that the representation in luminance and chrominance components also holds for CFA images. Subsampling does not affect the luminance component but rather the chrominance, which is modulated to high frequencies. The LMMSE method described in this chapter allows objective comparisons of the different CFA arrangements published in the literature. It appears that the widely used Bayer CFA may not be the optimal one. Both the CFA proposed by Lukac and the pseudorandom pattern proposed here give better results in terms of minimizing false colors. For the Bayer CFA, however, sophisticated methods need to be applied in order to reduce these artifacts. The best trade-off for the system comprising of the CFA and demosaicking modules will perhaps be the alliance between the best CFA that intrinsically reduces false color effects at data acquisition, and a cost-effective demosaicking method that will control the zipper noise effect. Such solutions are proposed in References [35] and [36], which both present a general demosaicking algorithm. References [1] K. Parulski and K.E. Spaulding, Digital Color Image Handbook, ch. Color image processing for digital cameras, G. Sharma (ed.), CRC Press, Boca Raton, FL, 2002, pp. 727–757. [2] R. Lukac and K.N. Plataniotis, Color Image Processing: Methods and Applications, ch. Single-sensor camera image processing, R. Lukac and K.N. Plataniotis (eds.), Boca Raton, FL: CRC Press / Taylor & Francis, 2006, pp. 363–392. [3] I. Newton, Optics: Or, a Treatise of the Reflections, Refractions, Inflections and Colours of Light. London, UK, 4th ed., 1730. [4] T. Young, “On the theory of light and colours,” Philosophical Transactions, vol. 92, pp. 12–48, July 1802. [5] H. von Helmholtz, Handbuch der Physiologischen Optik. Leipzig: Leopold Voss, 1862. [6] J. Nathans, D. Thomas, and D.S. Hogness, “Molecular genetics of human color vision: The genes encoding blue, green, and red pigments,” Science, vol. 232, no. 4747, pp. 193–202, April 1986. 236 Single-Sensor Imaging: Methods and Applications for Digital Cameras [7] D.A. Baylor, B.J. Nunn, and J.L. Schnapf, “Spectral sensitivity of the cones of the monkey Macaca fascicularis,” Journal of Physiology, vol. 390, no. 1, pp. 145–160, September 1987. [8] G. Wyszecki and W. Stiles, Color Science: Concepts and Methods, Quantitative Data and Formulae. New York, New York, John Wiley & Sons, 1982. [9] B.E. Bayer, “Color imaging array,” U.S. Patent 3 971 065, July 1976. [10] P. Dillon, “Color imaging array,” U.S. Patent 4 047 203, September 1977. [11] R. Lukac and K.N. Plataniotis, “Color filter arrays: Design and performance analysis,” IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1260–1267, November 2005. [12] R. Lyon and P. Hubel, “Eyeing the camera: Into the next century,” in Proceedings of the Color Imaging Conference, Scottsdale, AZ, USA, November 2002, pp. 349–355. [13] A. Roorda and D.R. Williams, “The arrangement of the three cone classes in the living human eye,” Nature, vol. 397, no. 11, pp. 520–522, February 1999. [14] E. Hering, Zur Lehre vom Lichtsinn. Vienna, Austria: 1878. [15] L.M. Hurvich and D. Jameson, “An opponent-process theory of color vision,” Psychological Review, vol. 64, no. 6, pp. 384–404, November 1957. [16] D. Alleysson, S. Su¨sstrunk, and J. He´rault, “Linear color demosaicing inspired by the human visual system,” IEEE Transactions on Image Processing, vol. 14, no. 4, pp. 439–449, April 2005. [17] E. Dubois, “Frequency-domain methods for demosaicking of Bayer-sampled color images,” IEEE Signal Processing Letters, vol. 12, no. 12, pp. 847–850, December 2005. [18] E. Dubois, “Filter design for adaptive frequency-domain Bayer demosaicking,” in Proceedings of the IEEE International Conference on Image Processing, Atlanta, GA, USA, October 2006, pp. 2705–2708. [19] D.R. Cok, “Signal processing method and apparatus for sampled image signals,” U.S. Patent 4 630 307, December 1986. [20] D.R. Cok, “Signal processing method and apparatus for producing interpolated chrominance values in a sampled color image signal,” U.S. Patent 4 642 678, February 1987. [21] H.D. Crane, J.D. Peter, and E. Martinez-Uriegas, “Method and apparatus for decoding spatiochromatically multiplexed color image using predetermined coefficients,” U.S. Patent 5 901 242, May 1999. [22] D. Alleysson, S. Su¨sstrunk, and J. He´rault, “Color demosaicing by estimating luminance and opponent chromatic signals in the Fourier domain,” in Proceedings of the 10th IS&T/SID Color Imaging Conference, Scottsdale, AZ, USA, November 2002, pp. 331–336. [23] H.J. Trussell and R.E. Hartwig, “Mathematics for demosaicking,” IEEE Transactions on Image Processing, vol. 11, no. 4, pp. 485–492, April 2002. [24] D. Taubman, “Generalized Wiener reconstruction of images from colour sensor data using a scale invariant prior,” in IEEE International Conference on Image Processing, Vancouver, BC, Canada, September 2000, vol. III, pp. 801–804. [25] B. Chaix de Lavare`ne, D. Alleysson, and J. He´rault, “Practical implementation of LMMSE demosaicing using luminance and chrominance spaces,” Computer Vision and Image Understanding: Special Issue on Color Image Processing, vol. 107, no. 1-2, pp. 3–13, July/August 2007. [26] Y. Hel-Or, “The impulse responses of block shift-invariant systems and their use for demosaicing algorithms,” in Proceedings of the IEEE International Conference on Image Processing, Genoa, Italy, September 2005, vol. 2, pp. 1006–1009. Linear Minimum Mean Square Error Demosaicking 237 [27] J. Portilla, D. Otaduy, and C. Dorronsoro, “Low-complexity linear demosaicing sing joint spatial-chromatic image statistics,” in Proceedings of the IEEE International Conference on Image Processing, Genoa, Italy, September 2005, vol. I, pp. 61–64. [28] B.K. Gunturk, J. Glotzbach, Y. Altunbazak, R.W. Schafer, and R.M. Mersereau, “Demosaicking: Color filter array interpolation in single-chip digital cameras,” IEEE Signal Processing Magazine, vol. 22, no. 1, pp. 44–54, January 2005. [29] D.R. Cok, “Signal processing method and apparatus for sampled image signals,” U.S. Patent 4 630 307, December 1986. [30] R.H. Hibbard, “Apparatus and method for adaptively interpolating a full color image utilizing luminance gradients,” U.S. Patent 5 382 976, January 1995. [31] C.A. Laroche and M.A. Prescott, “Apparatus and method for adaptively interpolating a full color image utilizing chrominance gradients,” U.S. Patent 5 373 322, December 1994. [32] J.F. Hamilton and J.E. Adams, “Adaptive color plane interpolation in single sensor color electronic camera,” U.S. Patent 5 629 734, May 1997. [33] R. Kimmel, “Demosaicing: Image reconstruction from color samples,” in IEEE Transactions on Image Processing, vol. 8, no. 9, pp. 1221–1228, September 1999. [34] N. Lian, L. Chang, and Y.P. Tan, “Improved color filter array demosaicking by accurate luminance estimation,” in IEEE International Conference on Image Processing, Genoa, Italy, September 2005, vol. 1, pp. 41–44. [35] R. Lukac and K.N. Plataniotis, “Universal demosaicking for imaging pipelines with an RGB color filter array,” Pattern Recognition, vol. 38, no. 11, pp. 2208–2212, November 2005. [36] B. Chaix de Lavare`ne, D. Alleysson, and J. He´rault, “Efficient demosaicing through recursive filtering,” in Proceedings of the IEEE International Conference on Image Processing, San Antonio, TX, USA, September 2007, vol. II, pp. 189–192. 9 Color Filter Array Image Analysis for Joint Demosaicking and Denoising Keigo Hirakawa 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 9.1.1 A Comment About Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 9.1.2 Terminologies and Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 9.2 Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 9.3 Spectral Analysis of CFA Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 9.4 Wavelet Analysis of CFA Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 9.5 Constrained Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 9.6 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 9.7 Filterbank Coefficient Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 9.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 9.1 Introduction Noise is among the worst artifacts that affect the perceptual quality of the output from a digital camera (see Chapter 1). While cost-effective and popular, single-sensor solutions to camera architectures are not adept at noise suppression. In this scheme, data are typically obtained via a spatial subsampling procedure implemented as a color filter array (CFA), a physical construction whereby each pixel location measures the intensity of the light corresponding to only a single color [1], [2], [3], [4], [5]. Aside from undersampling, observations made under noisy conditions typically deteriorate the estimates of the fullcolor image in the reconstruction process commonly referred to as demosaicking or CFA interpolation in the literature [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]. A typical CFA scheme involves the canonical color triples (i.e., red, green, blue), and the most prevalent arrangement called Bayer pattern is shown in Figure 9.1b. As the general trend of increased image resolution continues due to prevalence of multimedia, the importance of interpolation is de-emphasized while the concerns for computational efficiency, noise, and color fidelity play an increasingly prominent role in the decision making of a digital camera architect. For instance, the interpolation artifacts become less noticeable as the size of the pixel shrinks with respect to the image features, while the 239 240 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) (g) (h) FIGURE 9.1 (See color insert.) Zoomed portion of the Clown image: (a) original color image, (b) color version of ideal CFA image, (c) color version of noisy CFA image, (d) demosaicking the ideal CFA image, (e) demosaicking the noisy CFA image, (f) demosaicking the noisy CFA image followed by denoising, (g) denoising the noisy CFA image followed by demosaicking, and (h) joint denoising and demosaicking of the noisy CFA image. decreased dimensionality of the pixel sensors on the complementary metal oxide semiconductor (CMOS) and charge coupled device (CCD) sensors make the pixels more susceptible to noise. Photon-limited influences are also evident in low-light photography, ranging from a specialty camera for precision measurement to indoor consumer photography. Sensor data, which can be interpreted as subsampled or incomplete image data, undergo a series of image processing procedures in order to produce a digital photograph. Refer to Chapters 1 and 3 for details. However, these same steps may amplify noise introduced during image acquisition. Specifically, the demosaicking step is a major source of conflict between the image processing pipeline and image sensor noise characterization because the interpolation methods give high priority to preserving the sharpness of edges and textures. In the presence of noise, noise patterns may form false edge structures, and therefore the distortions at the output are typically correlated with the signal in a complicated manner that makes noise modelling mathematically intractable. Thus, it is natural to conceive of a rigorous tradeoff between demosaicking and image denoising. For better illustration, Figure 9.1a shows a typical color image. Suppose we simulate the noisy sensor observation by subsampling this image according to a CFA pattern (Figure 9.1b) and corrupting with noise (Figure 9.1c). While state-of-the-art demosaicking methods such as the ones in [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16] do an impressive job in estimating the full-color image given ideal sensor data (Figure 9.1d), the interpolation may also amplify the noise in the sensor measurements, as demonstrated in Figure 9.1e. The state-of-the-art denoising methods applied to Figure 9.1f yield unsatisfactory results (Figure 9.1g), suggesting a lack of coherent strategy to address interpolation Color Filter Array Image Analysis for Joint Demosaicking and Denoising 241 and noise issues jointly. For comparison, the output from a joint demosaicking and denoising method [17] is shown in Figure 9.1h, clearly demonstrating the advantages. In this chapter, the problem of estimating the complete noise-free image signal of interest given a set of incomplete observations of pixel components that are corrupted by noise is approached statistically from a point of view of Bayesian statistics, that is modelling of the various quantities involved in terms of priors and likelihood. The three design regimes that will be considered here can be thought of as simultaneous interpolation and image denoising, though this chapter has a wider scope in the sense that modelling the image signal, missing data, and the noise process explicitly yield insight into the interplay between the noise and the signal of interest. The chapter is not intended to comprise detailed stepby-step instructions of how to estimate a complete noise-free image; rather we present a theoretical basis for generalizing the image signal models to the noisy subsampled case, and propose major building blocks for manipulating such data. The author feels that leading the discussion in this manner is most effective, as it allows flexibility in the choice of models. There are a number of advantages to the proposed estimation schemes over the obvious alternative, which is the serial concatenation of the independently designed interpolation and image denoising algorithms. For example, the inherent image signal model assumptions underlying the interpolation procedure may differ from those of the image denoising. This discrepancy is not only contradictory and thus inefficient, but also triggers mathematically intractable interactions between mismatched models. Specifically, interpolating distorted image data may impose correlation structures or bias to the noise and image signal in an unintended way. Furthermore, a typical image denoising algorithm assumes a statistical model for natural images, not that of the output of interpolated image data. While grayscale and color image denoising techniques have been suggested [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], removing noise after demosaicking, however, is impractical. Likewise, although many demosaicking algorithms developed in the recent years yield impressive results in the absence of sensor noise, the performance is less than desirable in the presence of noise. In this chapter, we investigate the problem of estimating a complete color image from the noisy undersampled signal using spectral and wavelet analysis of the noisy sensor data. In Section 9.2, we characterize the noise corresponding to CMOS and CCD sensors and evaluate it with respect to human visual system sensitivities and current image denoising techniques. Section 9.3 identifies the structure in the loss of information due to sampling and noise by examining the sensor data in the Fourier domain, and motivates a unified approach to interpolation and denoising. To exploit the local aliasing structures, Section 9.4 refines the spectral analysis of sensor data using time-frequency analysis. Conditioned on the signal image model, we propose three frameworks for estimating the complete noisefree image via the manipulation of noisy subsampled data. In Section 9.5, we discuss the design of a spatially-adaptive linear filter whose stop-band annihilates color artifacts and whose pass-band suppresses noise. Section 9.6 demonstrates the modelling of noisy subsampled color images in the wavelets domain using a statistical missing data formulation. As outlined in Section 9.7, however, it is possible to estimate the wavelet coefficients corresponding to the desiderata from the wavelet coefficients of the sensor data. This section presents example output images obtained using the techniques presented in this chapter. Finally, concluding remarks are listed in Section 9.8. 242 Single-Sensor Imaging: Methods and Applications for Digital Cameras 9.1.1 A Comment About Model Assumptions The wavelet-based statistical models for image signals play a dominant part in the image denoising literature. In this paradigm, wavelet coefficients corresponding to image signals exhibit a heavy-tailed distribution behavior, motivating the use of Laplacian distribution, Student’s t-distribution, and Gaussian mixtures, to name a few. These heavy-tailed priors can be written as a continuous mixture of Gaussian with the general form, x|q ∼ N (µx, σx2/q), where µx and σx2 are the mean and variance parameters of a random variable x, and q = 0 is an augmented random variable with its own distribution specific to the choice of heavytail. Thus, x is conditionally normal; conditioned on q, its posterior distribution can largely be manipulated with second-order statistics. Alternatives to wavelet-based models include image patches [32], principal components [33], and anisotropic diffusion equations [28]. Many of them make use of the sum of (sometimes spatially-adaptive) outer-products of vectorized pixel neighborhoods, which is the deterministic-counterpart to the pixel-domain second-order statistics. The intentions of this chapter, as stipulated previously, are to provide tools for analyzing and manipulating subsampled data in a way that is relevant to the CFA image. Rather than reinvent signal models for subsampled image data, we choose to work with statistical or deterministic models for a complete image data. In doing so, we inherit a rich literature in image modelling that has been shown to work well for image denoising, interpolation, segmentation, compression, and restoration. Furthermore, the discussion that follows is intentionally decoupled from a particular choice of image signal model. Instead, conditioned on the complete image model, the primary focus of the discussions will be on making the necessary changes amenable to the direct manipulation of the CFA image. Specifically, the theoretical frameworks for analyzing subsampled data below are developed in terms of second-order statistics of complete image data. By taking the expectation over the conditionals in the posterior (E[x] = E E[x|q] in the example above, where x|q in the inner expectation is normal) one can generalize the estimator derived for the multivariate normal to the heavy-tailed distribution, as in the case of Bayesian estimators. Alternatively, replacing the second-order statistics with the sum of outer-products would yield the deterministic extension of the CFA image processing. In any case, the technical frameworks presented below are nonrestrictive and compatible with a wide range of assumed models, allowing for the flexibility in selecting a model best suited for the computational and image quality requirements of the application. 9.1.2 Terminologies and Notational Conventions Because there are several technical terms used in this chapter that sound similar but have different meanings, we would like to clarify their definitions. The term color filter refers to a physical device placed over photosensitive elements called pixel sensors. It yields a color coding by cutting out electromagnetic radiations of specified wavelengths. This is not to be confused with a filter, or convolution filtering realized by taking a linear combination of nearby pixel or sensor values. Likewise, given a two-dimensional signal, terminologies Color Filter Array Image Analysis for Joint Demosaicking and Denoising 243 such as frequency and spectrum are to be interpreted in the context of two-dimensional Fourier transforms and not in the sense of colorimetry. In this chapter, all image signals are assumed to be discrete (or post-sampling). For notational simplicity, plain characters (e.g., x) represent a singleton, whereas bold-face characters (e.g., x) represent a vector or a matrix. An arrow over a character symbolizes a vectorization; that is, x is a re-arrangement of x(·) into a vector form. Other conventions are summarized below for bookkeeping, but their formal definitions will be made explicit in the sequel: n ∈ Z2 x : Z2 → R3 : Z2 → R3 c : Z2 → {0, 1}3 : Z2 → R α : Z2 → R β : Z2 → R y : Z2 → R ε : Z2 → R z : Z2 → R g : Z2 × Z2 → R3 h0, h1, f0, f1 : Z → R pixel/sample location index signal-of-interest, ideal (noise-free) color image; x = [x1, x2, x3]T are the RGB triples noise for x color filter coding indicator monochromatic or approximate luminance image, = 1 4 x1 + 1 2 x2 + 1 4 x3 color difference or approximate chrominance image, α = x1 − x2 color difference or approximate chrominance image, β = x3 − x2 ideal (noise-free) sensor data or CFA image, y(n) = cT (n)x(n) noise for y noisy sensor data, z = y + ε spatially-adaptive filter coefficients one-dimensional impulse responses to convolution filters used in filterbank In the above, the elements in the vector x(n) = [x1(n), x2(n), x3(n)]T are interpreted as the red, green, blue pixel component values, respectively, though the results established in this chapter are equally applicable in other color coding schemes. The luminance-chrominance representation of a color image, [ (n), α(n), β (n)], is an invertible linear transformation of x(n). The symbols x : Z → R and ε : Z → R are also occasionally used for a generic (nondescriptive) signal and noise, respectively. Singleton functions x(n) and ε(n) are used interchangeably with x(n) and (n) to generalize results to the multivariate case, respec- tively. In addition, given a two-dimensional function x : Z2 → R, its Fourier transform is de- noted by xˆ(ω), where, in the two-dimensional case, ω = [ω0, ω1]T ∈ R2 is the modulo-2π frequency index. Similarly, let i ∈ {0, 1, . . . , I}2 be the subband index for the (I + 1)2- level (separable) two-dimensional filterbank decomposition, where a smaller index value corresponds to low-frequency channel. Then wxi (n) is the filterbank (or wavelet packets) coefficient at the i-th subband, n-th spatial location corresponding to the signal x(n). 244 Single-Sensor Imaging: Methods and Applications for Digital Cameras 9.2 Noise Model In order to design an effective image denoising system, it is important to characterize the noise in an image sensor. The CMOS photodiode active pixel sensor typically uses a photodiode and three transistors, all major sources of noise [34]. The CCD sensors rely on the electron-hole pair that is generated when a photon strikes silicon [35]. While a detailed investigation of the noise source is beyond the scope of this chapter, studies suggest that z : Z2 → R, the number of photons encountered during an integration period (duration between resets), is a Poisson process Py: p z(n) y(n) = e−y(n)y(n)z(n) z(n)! , where n ∈ Z2 is the pixel location index, and y(n) is the expected photon count per integration period at location n, which is linear with respect to the intensity of the light. Note E z(n) y(n) = y(n) and E z2(n) − E z(n) y(n) 2 y(n) = y(n). Then, as the integration period increases, p(z(n)|y(n)) converges weakly to N y(n), y(n) , or z(n) ≈ y(n) + y(n)ε(n), (9.1) where ε i.∼i.d. N (0, 1) is independent of y. This approximation is justifiable via a straightforward application of central limit theorem to the binomial distribution. The noise term, y(n)ε(n) is commonly referred to as the shot noise. In practice, the photodiode charge (e.g., photodetector readout signal) is assumed proportional to z(n), thus we interpret y(n) and z(n) as the ideal and noisy sensor data at pixel location n, respectively. For a typical consumer-grade digital camera, the approximation in Equation 9.1 is reasonable. The significance of Equation 9.1 is that the signal-to-noise ratio improves for a large value of y(n) (e.g., outdoor photography), while for a small value of y(n) (e.g., indoor photography) the noise is severe. To make matters worse, human visual response to the light y(n) is often modeled as 3 y(n), suggesting a heightened sensitivity to the deviation in the dark regions of the image. To see this, the perceived noise magnitude is proportional to: 3 z(n) − 3 y(n) = 3 y(n) + y(n)ε(n) − 3 y(n), which is a monotonically decreasing function with respect to y(n) for a fixed value of ε(n). There have been some hardware solutions to the sensor noise problems. For example, the cyan-magenta-yellow (CMY) CFA pattern performs better in a noisy environment, as the quantum efficiency is more favorable for CMY as compared to RGB. That is, a CMYbased CFA allows more photons to penetrate through to the photosensitive element because the pigments used in it are considerably thinner than those of the RGB-based CFA. The disadvantage is that the photo-sensitivity wavelengths of the cyan, magenta, and yellow overlap considerably, and therefore the color space conversion from CMY to the RGB color space is an unstable operation. Today, the CMY-based CFAs are more readily used in video Color Filter Array Image Analysis for Joint Demosaicking and Denoising 245 cameras, since the frame-rate restricts the length of the integration period. Other circuitbased noise-reduction techniques include correlated double sampling. In this scheme, the pixel sensors are each sampled twice, first measuring the reset/amplifier noise alone, and second measuring the photon counts and the reset/amplifier noise combined. The difference of the two is presumed noise-free. In reality, efforts to address signal-dependent noise in Equation 9.1 lag behind those of image interpolation and image denoising for additive white Gaussian noise (AWGN). A standard technique for working with signal-dependent noise is to apply an invertible nonlinear operator γ(·) on z such that signal and noise are (approximately) decoupled: γ(z)|γ(y) ∼ N γ(y), σ 2 for some constant σ 2. Homomorphic filtering is one such operator designed with monotonically-increasing nonlinear pointwise function γ : R → R, [36], [37]. The HaarFisz transform γ : Z2 × R → Z2 × R is a multiscale method that asymptotically decorrelates signal and noise [38], [39]. In any case, a signal estimation technique (assuming AWGN) is used to estimate γ(y) given γ(z), and the inverse transform γ−1(·) yields an estimate of y. The advantage of this approach is the modularity of the design of γ(·) and the estimator. The disadvantage is that the signal model assumed for y may not hold for γ(y) and the optimality of the estimator (e.g., minimum mean squared error estimator) in the new domain does not translate to optimality in the rangespace of y, especially when γ(·) significantly deviates from linearity. An alternative to decorrelation is to approximate the noise standard deviation, y(n). The AWGN noise model is effectively a zero-th order Taylor expansion of the Poisson process; an affine noise model is the first order Taylor expansion of Equation 9.1 used in References [32] and [40]. In practice, these approximations yield acceptable performance because the CMOS sensors operate on a relatively limited dynamic range, giving validity to the Taylor assumption (when the expansion is centered about the midpoint of the operating range). The human visual system can also tolerate a greater degree of error in the brighter regions of the image, allowing for more accurate noise characterization for small values of y (at the cost of poorer characterization for higher rangespace of y). Alternatively, empirical methods that address signal-dependent noise take a two-step approach [21]. First, a crude estimate of the noise variance at each pixel location n is found; second, conditioned on this noise variance estimate, we assume that the signal is corrupted by signal-independent noise. A piecewise AWGN model achieves a similar approximation. Methods that work with the posterior distribution of the coefficients of interests, such as Markov chain Monte Carlo and importance sampling, either have a slow convergence rate or require a large number of observations [41]. Emerging frameworks in Bayesian analysis for Poisson noise yield an asymptotic representation of the Poisson process in the wavelets domain, but the manipulation of data in this class of representation is extremely complicated [42]. For all the reasons above, it is clear that the estimation of the mean y given the Poisson process z is not a well-understood problem; and existing methods use variations of AWGN models to address the Poisson noise. Hence, while acknowledging inadequacies, we restrict 246 Single-Sensor Imaging: Methods and Applications for Digital Cameras our attention to the AWGN problem, z(n) = y(n) + ε(n), (9.2) where ε i.∼i.d. N (0, σε2). 9.3 Spectral Analysis of CFA Image In this section, we take a closer look at the sampling scheme and the structure of aliasing induced by the Bayer color filter array illustrated in Figure 9.1b, [11], [17]. The estimation of missing pixel components given observed pixel components is generally an ill-posed problem. By assuming that the image signals are highly structured, however, we effectively assume that the signal-of-interest lives in a lower-dimensional subspace that can be represented by the subspace spanned by the color filter array. Thus, although the loss of data at the hardware interface is inevitable, the loss of information due to sampling may be limited. We will show that the Fourier analysis and aliasing serve as a measure of loss of information, and that they motivate joint modelling and manipulation of subsampled data and noise (which will subsequently be fine-tuned using locally adaptive schemes in Sections 9.5 to 9.7). In a color image, such as one shown in Figure 9.1a, the image pixel x(n) = [x1(n), x2(n), x3(n)]T at the position n ∈ Z2 denotes a vectorial value, typically expressed in terms of RGB coordinates. Figure 9.2a is a grayscale version of Figure 9.1a. Visual inspection of the original color image and its corresponding red, green, and blue channels depicted in Figure 9.2b to Figure 9.2d, respectively, reveals that the decomposed color channels may contain redundant information with respect to edge and textural formation, reflecting the fact that the changes in color at the object boundary are secondary to the changes in intensity. It follows from the (de-)correlation of color content at high frequencies and is well accepted among the color image scientists that the difference images (e.g., red-green, bluegreen) exhibit rapid spectral decay relative to monochromatic image signals (e.g., gray, red, green), and are therefore slowly-varying over spatial domain. See Figure 9.2e and Figure 9.2f. Such heuristic intuitions are further backed by human physiology — the contrast sensitivity function for the luminance channel in human vision is typically modelled with a much higher pass-band than that of the chrominance channels. An alternative to spectral modelling strategy based on color-ratio has been studied [43], [44], [45], [46]. Assuming that objects are piecewise constant color, then the ratios between color components within an object are constant, even though the intensities of pixels may vary over space. In practice, however, the numerical stability of ratios is difficult to achieve, and the spatial variation of the intensity levels is not captured explicitly by this model. For these reasons, while acknowledging the merits of the color-ratio modelling strategy, the discussions in this chapter will be confined to the difference image modelling. Let c(n) = [c1(n), c2(n), c3(n)]T ∈ {[1, 0, 0]T , [0, 1, 0]T , [0, 0, 1]T } be a CFA coding such that the noise-free sensor data can be written as an inner product, y(n) = cT (n)x(n). Given Color Filter Array Image Analysis for Joint Demosaicking and Denoising 247 (a) (b) (c) (d) (e) (f) (g) (h) FIGURE 9.2 Zoomed portion of the Clown image: (a) gray-scale version of original color image, (b) decomposed red channel, (c) decomposed green channel, (d) decomposed blue channel, (e) difference image x1 − x2, (f) difference image x3 − x2, (g) subsampled version of x1 − x2, and (h) subsampled version of x3 − x2. that it is a convex combination, we may then decompose y(n) in the following manner: y(n) = c1(n)x1(n) + c2(n)x2(n) + c3(n)x3(n) = c1(n)x1(n) + (1 − c1(n) − c3(n))x2(n) + c3(n)x3(n) = c1(n)(x1(n) − x2(n)) + c3(n)(x3(n) − x2(n)) + x2(n) = c1(n)α(n) + c3(n)β (n) + x2(n), (9.3) where the difference images α(n) = x1(n) − x2(n) and β (n) = x3(n) − x2(n) are crude approximations for the chrominance channels. In other words, the convex combination above can be thought of as the summation of x2(n) with the subsampled difference images, c1(n)α(n) and c3(n)β (n); it is shown pictorially in Figure 9.2c, Figure 9.2g and Figure 9.2h, as their sum is equal to the sensor data in Figure 9.1b. It follows from the composition of the dyadic decimation and interpolation operators induced by the Bayer sampling pattern that yˆ(ω), the Fourier transform of sensor data y(n), is a sum of xˆ2(ω) and the spectral copies of αˆ (ω) and βˆ (ω): yˆ(ω) = xˆ2 (ω) + 1 4 (αˆ + βˆ )(ω) + (αˆ − βˆ )(ω − [π, 0]T ) +(αˆ − βˆ )(ω − [0, π]T ) + (αˆ + βˆ )(ω − [π, π]T ) = ˆ(ω) + 1 4 (αˆ − βˆ )(ω − [π, 0]T ) +(αˆ − βˆ )(ω − [0, π]T ) + (αˆ + βˆ )(ω − [π, π]T ) , (9.4) 248 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) FIGURE 9.3 Log-magnitude two-dimensional spectra of: (a) ˆ, (b) αˆ , (c) βˆ , and (d) yˆ. The spectra were obtained using the Clown image. The figure is color-coded to show contribution from each channel in figure (d): green for ˆ, red for αˆ , blue for βˆ . where, without loss of generality, the origin is fixed as c(0, 0) = [1, 0, 0]T , and ˆ= xˆ2(ω) + 1 4 αˆ (ω) + 1 4 βˆ (ω) = 1 4 xˆ1(ω) + 1 2 xˆ2(ω) + 1 4 xˆ3(ω), (9.5) is a crude approximation to the luminance channel. The representation of sensor data (Equation 9.4) in terms of luminance and difference images α and β is convenient because α and β are typically sparse in the Fourier domain. To see this, consider Figure 9.3, in which the log-magnitude spectra of a typical color image is shown. The high-frequency components, a well-accepted indicator for edges, object boundaries, and textures, are easily found in Figure 9.3a. In contrast, the spectra in Figure 9.3b and Figure 9.3b reveal that α and β are low-pass, which supports our earlier claim about the slowly-varying nature of the signals in Figure 9.2e and Figure 9.2f. It is typically easier to estimate a lower bandwidth signal from its sparsely subsampled versions (see Figure 9.2g and Figure 9.2h), since it is less subject to aliasing. The key observation that can be made in Equation 9.4, therefore, is that we expect a Fourier domain representation of sensor data similar to what is illustrated in Figure 9.3d — the spectral copies of αˆ − βˆ centered around [π, 0]T and [0, π]T overlap with the baseband ˆ, while αˆ + βˆ centered around [π, π]T remain aliasing-free. Note that there exists no straightforward global strategy such that we recover unaliased ˆ because both spectral copies centered around [π, 0]T and [0, π]T are aliased with the baseband ˆ. Dubois et al., however, emphasized that the local image features of the baseband, ˆ, exhibit a strong directional bias, and therefore either (αˆ − βˆ )(ω − [π, 0]T ) or (αˆ − βˆ )(ω − [0, π]T ) is locally recoverable from the sensor data [47]. This observation motivates nonlinear processing that is locally adaptive — in fact, most existing demosaicking methods can be reexamined from this perspective. Specifically, Figure 9.4 illustrates the presumed local aliasing pattern. The locally horizontal images suffer from aliasing between ˆ and (αˆ − βˆ )(ω − [π, 0]T ) while we expect that (αˆ − βˆ )(ω − [0, π]T ) remains relatively intact. Conversely, locally vertical images suffer from aliasing between ˆ and (αˆ − βˆ )(ω − [0, π]T ) while (αˆ − βˆ )(ω − [π, 0]T ) is clean. On a sidenote, locally diagonal image features, which are often ignored by the demosaicking algorithm designs, do not interfere with (αˆ − βˆ )(ω − [π, 0]T ) and (αˆ − βˆ )(ω − [0, π]T ), making the reconstruction of diagonal features a trivial task. Color Filter Array Image Analysis for Joint Demosaicking and Denoising 249 (a) (b) FIGURE 9.4 Presumed aliasing structure in local spectra, conditioned local image features of the surrounding. Images correspond to: (a) yˆ given horizontal features, and (b) yˆ given vertical features. Compare with Figure 9.3d. Finally, let z(n) be the noisy sensor data, z(n) = y(n) + ε(n) = c1(n)α(n) + c3(n)β (n) + x2(n) + ε(n), (9.6) where ε i.∼i.d. N (0, σε2). Recall that Fourier transform is a unitary transformation — a spatially white noise in space domain remains uncorrelated in the frequency representation. It follows that the Fourier transform of a noisy observation is zˆ(ω) = ˆ(ω) + 1 4 (αˆ − βˆ )(ω − [π, 0]T ) +(αˆ − βˆ )(ω − (0, π)T ) + (αˆ + βˆ )(ω − [π, π]T ) + εˆ(ω). In other words, the sensor data is the baseband luminance image ˆ distorted by the noise εˆ and aliasing due to spectral copies of αˆ and βˆ , where εˆ, αˆ , and βˆ are conditionally normal. A unified strategy to demosaicking and denoising, therefore, is to design an estimator that suppresses noise and attenuates aliased components simultaneously. We will see how this can be accomplished via a spatially-adaptive linear filter whose stop-band contains the spectral copies of the difference images and pass-band suppresses noise (Section 9.5). 9.4 Wavelet Analysis of CFA Image In the previous section, we established the inadequacy of taking the global approach to CFA image processing. In this section, we develop a time-frequency analysis framework to exploit the local aliasing structures [17]. Specifically, image signals are highly nonstationary/inhomogeneous and thus an orthogonal filterbank (or wavelet packet) expansion for sparsely sampled signal would prove useful. For simplicity, consider first a one-dimensional signal x : Z → R. A one-level filterbank structure defined by filters {h0, h1, f0, f1} is shown in Figure 9.5. It is a linear transformation composed of convolution filters and decimators. The channel containing the 250 Single-Sensor Imaging: Methods and Applications for Digital Cameras h0 w0(n) f0 x(n) h1 w1(n) f1 FIGURE 9.5 One-level filterbank structure. low-frequency components is often called approximation (denoted wx0(n)), and the other containing the high-frequency components is referred to as the detail (denoted wx1(n)). The decomposition can be nested recursively to gain more precision in frequency. The approximation and detail coefficients from one-level decomposition can be analyzed in the Fourier domain as: wˆxi (ω) = 1 2 hˆ i ω 2 xˆ ω 2 + hˆi ω 2 −π xˆ ω 2 −π , where i ∈ {0, 1}. With a careful choice of filters {h0, h1, f0, f1}, the original signal, x(n) can be recovered exactly from the filterbank coefficients wx0(n) and wx1(n). To see this, consider the reconstruction of one-level filterbank, as in Figure 9.5. The transfer function of the system (or the reconstructed signal xrec(n)) has the following form in the frequency domain: xˆrec(ω) = fˆ0(ω)wˆx0(2ω) + fˆ1(ω)wˆx1(2ω) = 1 2 fˆ0(ω)hˆ0(ω) + fˆ1(ω)hˆ1(ω) xˆ(ω) + 1 2 fˆ0(ω)hˆ0(ω − π) + fˆ1(ω)hˆ1(ω − π) xˆ(ω − π). In other words, the output is a linear combination of the filtered versions of the signal xˆ(ω) and a frequency-modulated signal xˆ(ω − π). The structure in Figure 9.5 is called a perfect reconstruction filterbank if fˆ0(ω)hˆ0(ω) + fˆ1(ω)hˆ1(ω) = 2 fˆ0(ω)hˆ0(ω − π) + fˆ1(ω)hˆ1(ω − π) = 0. The filters corresponding to xˆ(ω) constitute a constant, whereas the filters corresponding to the aliased version are effectively a zero. A large body of literature exists on designing a set of filters {h0, h1, f0, f1} that comprise a perfect reconstruction filterbank [48]. For example, wavelet packets belong to a class of filterbanks arising from the factorizing filters satisfying the Nyquist condition (SmithBarnwell [48]). In this case, the following are met by construction: hˆ1(ω) = −e− jωmhˆ0(−ω − π) fˆ0(ω) = hˆ1(ω − π) fˆ1(ω) = −hˆ0(ω − π). (9.7) Color Filter Array Image Analysis for Joint Demosaicking and Denoising 251 In other words, h1 is a time-shifted, time-reversed, and frequency-modulated version of h0; and f0 and f1 are time-reversed versions of h0 and h1, respectively. Derivation of these filters is beyond of the scope of this chapter, and interested readers are referred to Reference [48] for details. Define modulated signal and subsampled signal of x(n), respectively, as xm(n) = (−1)nx(n) xs(n) = 1 2 x(n) + xm(n) = x(n) 0 for even n for odd n. To derive an explicit filterbank representation of xs(n), we are interested in characterizing the relationship between filterbank coefficients of x(n) and xm(n). Let wx0m(n) and wx1m(n) be the approximation and detail coefficients of the one-level filterbank decomposition of (−1)nx(n). Then substituting into Equation 9.7 we obtain wˆ x0m (ω) = 1 2 hˆ 0 ω 2 xˆ ω 2 − π + hˆ0 ω 2 −π xˆ ω 2 = 1 2 e− jm ω 2 hˆ 1 − ω 2 −π xˆ ω 2 −π + e− jm( ω 2 −π ) hˆ 1 − ω 2 xˆ ω 2 = e− jm ω 2 2 hˆ ∗1 ω 2 − π xˆ ω 2 −π − hˆ∗1 ω 2 xˆ ω 2 wˆ x1m (ω) = 1 2 hˆ 1 ω 2 xˆ ω 2 − π + hˆ1 ω 2 −π xˆ ω 2 = 1 2 − e− jm ω 2 hˆ 0 − ω 2 −π xˆ ω 2 −π − e− jm( ω 2 −π ) hˆ 0 − ω 2 xˆ ω 2 = e− jm ω 2 2 − hˆ∗0 ω 2 −π xˆ ω 2 −π + hˆ∗0 ω 2 xˆ ω 2 , where m is an odd integer, and ∗ denotes the complex conjugation. A subtle but important detail of the equations above is that if the approximation and detail coefficients of x(n) were computed using h0(−n − m) and h1(−n − m) instead of h0(n) and h1(n), these coefficients behave exactly like the detail (wx1m(n)) and approximation (wx0m(n)) coefficients for (−1)nx(n), respectively (note the reversed ordering of detail and approximation). It is straightforward to verify that if {h0(n), h1(n)} comprise perfect reconstruction filterbank, then {h0(−n − m), h1(−n − m)} constitute a legitimate perfect reconstruction filterbank as well (we will refer to the latter as the time-reversed filterbank). Reversal of coefficients is illustrated in Figure 9.6 — the systems in Figure 9.6a and Figure 9.6b are equivalent. Restricting our attention to the Haar decomposition for the rest of discussion and fixing m = 1, we have that h0(n) = h0(−n − 1) and h1(n) = −h1(−n − 1) and the approximation coefficient of (−1)nx(n) is exactly equal to the detail coefficient of x(n) by construction, and vice-versa — i.e., wx0m(n) = wx1(n) and wx1m(n) = wx0(n). It follows that the multilevel filterbank decomposition of (−1)nx(n) is equivalent to the time-reversed filterbank decomposition of x(n), but with the reversed ordering of low-to-high frequency coefficients. This reversed-order filterbank can be used to derive the filterbank representation of xs(n). 252 Single-Sensor Imaging: Methods and Applications for Digital Cameras h0 h0 h1 x(n) × -1n h1 h1 h0 (a) h0 h0* h1 x(n) h1 h1* h0 (b) FIGURE 9.6 Two equivalent filterbanks for xm(n) = (−1)nx(n): (a) filterbank transform of xm, (b) reversed-order filterbank transform of x. Here, * indicates time-reversed filter coefficients. Specifically, let wx0s(n) and wx1s(n) be the approximation and detail coefficients of the onelevel filterbank decomposition of xs(n). Then wx0s (n) = w01/2(x+xm)(n) = 1 2 wx0(n) + wx0m (n) = 1 2 wx0(n) + wx1(n) wx1s (n) = w11/2(x+xm)(n) = 1 2 wx1(n) + wx1m (n) = 1 2 wx1(n) + wx0(n) = wx0s (n). Now, update the definition of wxi to mean the i-th subband of (I + 1)-level filterbank decomposition. Then by recursion, we have a general form wxi s (n) = 1 2 wxi (n) + wxI−i(n) . (9.8) Also see Figure 9.7. Equation 9.8 should not come as a surprise, as it is analogous to the Fourier domain aliasing where the high frequency component is summed to the low. Similar analysis for xs can be performed for nonHaar decompositions, but omitted here for simplicity. Extending to two-dimensional signals, let us show the decomposition of CFA image in the separable wavelet packet domain. Let wi (n), wiα (n), wβi (n) be the filterbank coefficients corresponding to (n), α(n), β (n), respectively, where i = [i0, i1]T ∈ {0, 1, . . . , I}2 indexes the horizontal and the vertical filterbank channels, respectively. As before, assume without loss of generality that c(0, 0) = [1, 0, 0]T . In order to apply the filterbank analysis to the Color Filter Array Image Analysis for Joint Demosaicking and Denoising 253 h0 h0 h1 x s(n) h1 h1 h0 (a) h0 h0 h1 x(n) h1 h1 h0 (b) FIGURE 9.7 Two equivalent filterbanks for xs = 1 2 (x + xm ); up to multiplicative constant 2: (a) filterbank transform of xs, and (b) ordinary and reversed-order filterbank transform of x. Here, we assume the Haar decomposition. sensor data, we re-write y(n) in the following manner: y(n) = x2(n) + c1(n)α(n) + c3(n)β (n) = x2(n) + 1 + (−1)n0 + (−1)n1 + (−1)n0+n1 α (n) 4 + 1 + (−1)n0+1 + (−1)n1+1 + (−1)n0+n1 β (n) 4 , and its corresponding filterbank representation: wyi (n) = wxi 2 (n) + 1 4 wiα (n) + w(αi0,I−i1)(n) + w(αI−i0,i1)(n) + wα(I−i0,I−i1)(n) + 1 4 wβi (n) − w(βi0,I−i1)(n) − wβ(I−i0,i1)(n) + wβ(I−i0,I−i1)(n) = wi (n) + 1 4 w(αi0,I−i1)(n) + wα(I−i0,i1)(n) + w(αI−i0,I−i1)(n) 1 4 −wβ(i0,I−i1)(n) − w(βI−i0,i1)(n) + wβ(I−i0,I−i1)(n) , where the minus signs in some wβ terms occur due to translation in space, and w (n) is the filterbank coefficients of the signal in Equation 9.5. The globally bandlimitedness of difference images, as argued in the previous section, allows us to conclude that wiα (n) ≈ 0 and wβi (n) ≈ 0, ∀i0 > Iˆ or i1 > Iˆ for some Iˆ. The above simplifies to a form that reveals the 254 Single-Sensor Imaging: Methods and Applications for Digital Cameras energy compaction structure within CFA image:  wyi (n) ≈ wwii wi (n) (n) (n) + + + w(αI−i0,i1 wα(i0,I−i1 )(n) )(n) − − ww((ββIi0−,Ii−0,ii11 )(n) )(n) /4 /4 w(αI−i0,I−i1)(n) + wβ(I−i0,I−i1)(n) /4 wi (n) if I − i0 < Iˆ, i1 < Iˆ if i0 < Iˆ, I − i1 < Iˆ if I − i0 < Iˆ, I − i1 < Iˆ otherwise (9.9) Recall Equation 9.2 and that the filterbank transforms with appropriate choices of filters constitute a unitary transform. Thus, wzi (n) = wyi (n) + wiε (n), providing wzi (n) ≈  wwii wi wi (n) (n) (n) (n) + + + + w(αI−i0,i1)(n) − w(αi0,I−i1)(n) − wα(I−i0,I−i1)(n) wiε (n) w(βI−i0,i1)(n) /4 + wεi (n) wβ(i0,I−i1)(n) /4 + wiε (n) + w(βI−i0,I−i1)(n) /4 + wiε (n) if I − i0 < Iˆ, i1 < Iˆ if i0 < Iˆ, I − i1 < Iˆ if I − i0 < Iˆ, I − i1 < Iˆ otherwise, (9.10) where wiε i.∼i.d. N (0, σε2) is a filterbank transform of ε(n). In other words, the filterbank transformation of noisy sensor data wz is the baseband luminance coefficient w distorted by the noise wε and aliasing due to reversed-order filterbank coefficients wα and wβ , where w , wα , and wβ are (conditionally) normal. A unified strategy to demosaicking and denoising, therefore, is to design an estimator that estimates w , wα , and wβ from the mixture of w , wα , wβ . and wε . We will see how this can be accomplished in Section 9.7. Lastly, we remind the readers that Equation 9.10 can be generalized to any filterbanks that satisfy Equation 9.7 using time-reversed filter coefficients for h0 and h1. However, Haar wavelets are used exclusively in this chapter to simplify the notation. 9.5 Constrained Filtering In this section, we motivate an approach to joint demosaicking and denoising using wellunderstood DSP machineries [40]. Recall Equations 9.3 and 9.4. We are interested in estimating x(n) given z(·). It is worth noting that even if z(n) for some n corresponds to an observation of a red pixel x1(n), for example, z(n) does not suffice as an estimate of x1(n) (unlike the pure demosaicking problems) because it is contaminated by noise. We begin by highlighting monochromatic image denoising methods that operate by taking a linear combination of neighboring pixels. These methods include bilateral filters [28], principal components [31], and total least squares based methods [32], where the linear weights adapt to the local image features. Transform-based shrinkage and threshold methods can also be re-interpreted as spatially-varying linear estimators, because there exists a linear combination of neighboring pixels that is equivalent to shrinkage of transform coefficients. In the Bayesian estimation framework, the linearity of estimation is (conditionally) true for (a mixture of) normally distributed transform coefficients. In any case, the Color Filter Array Image Analysis for Joint Demosaicking and Denoising 255 locally adaptive linear estimator, xest, takes the general form: xest(n) = ∑ g(n, m)z(n − m), m∈η(n) where z(n) is the noisy version of x(n), g(n, m) is the spatially-adaptive linear weights, and the summation is over η(n), a local neighborhood of pixels centered around n. Typically, we choose g(n, m) such that it solves the least-squares minimization problem (though not necessarily [32]), min E x(n) − xest(n) 2 . g (9.11) In this section, we will show how the estimator in the above form can be modified such that the linear weights can be used to simultaneously interpolate and denoise CFA data [40]. Let xest(n) be an estimate of ideal color image x(n) by taking a linear combination of noisy sensor data z(n). That is, xest(n) = ∑ g(n, m)z(n − m), m∈η(n) (9.12) where g(n, m) ∈ R3 is a spatially-adaptive linear weight. Let gk(n, m) correspond to the linear weight for estimating xk. In the following discussion, we focus on the estimation of x2(n) via the design of g2(n, ·) because Equation 9.3 already assumes x2(n) as its baseband. The results achieved here are generalized to the estimation of x1(n) and x3(n) at the end of this section. Substituting Equation 9.6 into Equation 9.12, x2est(n) = ∑ g2(n, m)z(n − m) m∈η(n) = ∑ g2(n, m) x2(n − m) + ε(n − m) m∈η(n) +g2(n, m) c1(n − m)α(n − m) + c3(n − m)β (n − m) . (9.13) The first term, ∑m g2(n, m)[x2(n − m) + ε(n − m)] represents an ordinary monochromatic image denoising. That is, g2(·, ·) operates on the noisy version of x2(·). The extra term involving α(·) and β (·) also motivates the need for further restricting g2(·, ·) such that the latter term is attenuated. To accomplish this task, recall that c1(n)α(n) and c3(n)β (n) occupy frequency regions around ω = {(0, 0), (0, π), (π, 0), (π, π)}. Let us consider a class of linear filters with stop- bands near {(0, 0), (0, π), (π, 0), (π, π)} (i.e., band-pass). In particular, if the filter coeffi- cients corresponding to red and blue samples in CFA sum to zero, respectively, then ∀n, ∑ g2(n, m)c1(n − m) = 0 m∈η(n) ∑ g2(n, m)c3(n − m) = 0, m∈η(n) (9.14) 256 Single-Sensor Imaging: Methods and Applications for Digital Cameras and with a finite spatial support on g2(·, m), we can safely assume that the frequency components in the vicinity of {(0, 0), (0, π), (π, 0), (π, π)} are attenuated as well (because gˆ2 is a linear combination of cosines in the Fourier domain). If the restriction in Equation 9.14 holds true, then the estimator in Equation 9.13 reduces to a monochromatic image denoising problem — that is, x2est ≈ ∑m g2(n, m)[x2(n − m) + ε(n − m)]. Therefore, the underlying strategy for deriving a joint demosaicking and denois- ing operator is to solve a constrained linear estimation problem. In other words, instead of Equation 9.11, solve 2 J = min E x2(n) − ∑ g2(n, m)[x2(n − m) + ε(n − m)] . m∈η(n) subject to ∑ g2(n, m)c1(n − m) = ∑ g2(n, m)c3(n − m) = 0 m m (9.15) Conveniently, this optimization problem allows us to pretend as though we are designing a monochromatic image denoising method. However, the constraints on the filter coefficients ensure that J remains a good approximation to the residual of the actual problem, x2(n) − x2est(n) 2. Note that Equation 9.14 does not imply ∑m g2(n, m)c2(n − m) = 0. Instead, the contributions from x1 and x3 to the estimation of x2 are limited to the frequency components in the band-pass region, whereas the contributions from x2 are unrestricted. In many cases, the existing image denoising techniques naturally extend to simultane- ously solving the demosaicking and denoising problems. Let x (and similarly ε, z, g) be a re-arrangement of {x(n − m)|m ∈ η(n)} into a vector form. Then least-squares solution to Equation 9.11 often involves an inner product of the form xest(n) = gTLS(x + ε), where gLS = E (x + ε)(x + ε)T −1 E (x + ε)T x(n) . (9.16) The inner product occurs often in Bayesian estimators, when the prior on the data are (con- ditionally) normally distributed (e.g., Laplace, Student’s t, Gaussian mixture). If this prior on x is defined in the linear transform domain (such as on the wavelet coefficients), then the equivalent second-order statistics in the pixel domain are simply a linear transformation of the statistics in the transform domain. Let M = |η(n)| be the size of the neighborhood, η(n). The band-pass constraint in Equation 9.14 may be imposed by asserting that g ∈ RM lives in a lower-dimensional subspace, span{v ∈ RM|vT c1 = vT c3 = 0}, or g = Gs, where G ∈ RM×M−2 is an orthogonal matrix whose column vectors span this subspace. Then the constrained LS problem in Equation 9.15 can be rewritten as J = min E g=Gs x2 − gT [x2 + ε] 2 = min E s x2 − sT GT [x2 + ε] 2 . (9.17) It is easy to verify that the solution to the above has the form x2est(n) = gTCLSz, where gCLS = G GT E (x2 + ε)(x2 + ε)T G −1 E G(x2 + ε)T x2(n) . (9.18) Note that Equations 9.17 and 9.18 are minor alterations to Equations 9.11 and 9.16 using the same second-order statistics, respectively, and thus it is a straightforward exercise to Color Filter Array Image Analysis for Joint Demosaicking and Denoising 257 leverage existing monochromatic image denoising methods to a joint demosaicking and denoising scheme. In order to design spatially adaptive filters similar to Equation 9.18 for estimating x1 and x3, we see that Equation 9.3 can be written alternatively as y(n) = c2(n)[x2(n) − x1(n)] + c3(n)[x3(n) − x1(n)] + x1(n) = c1(n)[x1(n) − x3(n)] + c2(n)[x2(n) − x3(n)] + x3(n). It follows that the appropriate constraints on filter coefficients g1(·, ·) and g3(·, ·) are ∑ g1(n, m)c2(n − m) = 0, m∈η(n) ∑ g3(n, m)c1(n − m) = 0, m∈η(n) ∑ g1(n, m)c3(n − m) = 0 m∈η(n) ∑ g3(n, m)c2(n − m) = 0. m∈η(n) 9.6 Missing Data The statistical modelling of image signals in a linear transform domain is primarily mo- tivated by the correlation structures that exist within the transform coefficients of image signals. These models, which require a complete observation of image data, are not eas- ily generalizable to the digital camera context, as the observation of color image data is incomplete at the sensor interface. That is, processing with missing or incomplete pixels is difficult because a linear transformation takes a linear combination of the pixel values, and thus all of the noisy transform coefficients are unobserved. Yet, it is still convenient or desirable to apply the sophisticated statistical modelling techniques even when none of the transform coefficients are observable. This section explicitly addresses the issue of combining the treatment of missing data and the wavelet-based modeling [49]. Bayesian hierarchical modelling is used to capture the second-order statistics in the transform domain. We assume a general model form and couple the EM algorithm framework with the Bayesian models to estimate the hyper- and nuisance parameters via the marginal likelihood; that is, we adopt the empirical partial Bayes approach. Within this framework, problems with missing pixels or pixel compo- nents, and hence unobservable wavelet coefficients, are handled simultaneously with image denoising. [wIxi n1(onr)d,ewrxit2o(ne)x,twenxi 3d(tnh)e]Tcocmorprelestpeoinmdatgoewmaovdeelelltincogesftfiractieegnytstcooirnrceospmopnldeitnegdatotax,1le, xt 2w,xix(3na)t = i- th level, and assume that wxi (n) i.∼i.d. N (0, Σi). Then the distribution of the neighboring pixel values is also jointly normal, as linear transformation of a multivariate normal vector is also normal. Because the wavelet transform is unitary, wx+ (n)|wx(n) i.∼i.d. N (wx(n), σ 2I). 258 Single-Sensor Imaging: Methods and Applications for Digital Cameras To summarize, θ = {Σi, σ 2} are the hyper- and nuisance parameters, respectively. If the θ is known, the regression of the missing pixels on the known clean pixel- component measurements y(n), E[x(n)|y(n), θ], serves as a demosaicking method based on the LS estimator and has a straightforward implementation. Conditioned on the incomplete and noisy measurement of pixel-components z(n), E[x(n)|z(n), θ], is an interpolated and denoised image signal, where xest(n) = E[x(n)|z(n), θ] = E E[x(n)|y(n), z(n), θ] z(n), θ . The nested expectation operator has an intuitive interpretation: the inner expectation, E[x(n)|y(n), z(n), θ] = E[x(n)|y(n), θ], is an interpolator, and the outer expectation is the denoiser. Conversely, the same formula can equivalently be written as: xest(n) = E[x(n)|z(n), θ] = E E[x(n)|(x + )(n), z(n), θ] z(n), θ , where the inner expectation operator, E[x(n)|(x + )(n), z(n), θ] = E[x(n)|(x + )(n), θ] is a denoiser, and the outer expectation is the interpolator. Conditioned on θ, therefore, a design of simultaneous demosaicking and denoising method is straightforward. The posterior mean estimate, xest, is sensitive to the choice of parameters θ; and given only a subset of the noisy pixel components z(n), we are left with estimating θ from the data when the wavelet coefficients are not observable. In particular, we solve for the θ that maximizes the marginal log-likelihood log p(z|θ), and estimate x as its posterior mean conditioned on θˆ (where θˆ is obtained from the maximal likelihood estimate above). The direct maximization of log p(z|θ) is very difficult because of the missing pixel values. The EM algorithm circumvents this problem by iteratively maximizing the much easier augmenteddata log-likelihood, log p(x, |θ), where {x, } are the augmented data. Given the [t]-th iterate hyper- and nuisance parameter estimate, θ[t] = {Σ[it], σ 2[t]}, the [t + 1]-st iteration of the EM algorithm first calls for Q θ; θ[t] = E log p(x, |θ) z, θ[t] . A celebrated result of EM algorithm [50] states that log p(z|θ) − log p(z|θ[t]) ≥ Q θ; θ[t] − Q θ[t]; θ[t] , where log p(z|θ) is the log-likelihood of θ based on the actual observed data, z(n). Thus, the choice of θ that maximizes Q(θ; θ[t]), that is, the next iterate θ[t+1], increases our objective function: log p z θ[t+1] ≥ log p z θ[t] . Consequently, maximizing Q(θ; θ[t]) is the same as maximizing log p(z|θ), but with augmented-data sufficient statistics. Given [t]-th iterate hyper- and nuisance parameters θ[t], the explicit formula for Q(θ; θ[t]) is in the closed form: Q θ; θ[t] = E log p(x, |θ) z, θ[t] ∑ = E log p(wxi (n)|Σi) + log p(wi (n)|σ 2) z, θ[t] . i,n (9.19) Color Filter Array Image Analysis for Joint Demosaicking and Denoising 259 It is then easy to verify that the maximizer of Q(θ; θ[t]) is the weighted least squares estimate [50]: ∑ Σi[t+1] = 1 Ni E n wxi (n)wxi T (n) z, θ[t] ∑ Σ[t+1] = 3 1 ∑i Ni E i,n wi T (n)wi (n) z, θ[t] , (9.20) where Ni is the number of wavelets samples in the i-th subband. In each iteration, the computation of the sufficient statistics in Equation 9.19 is often called expectation- or E-step, whereas the process of carrying out Equation 9.20 to find θ[t+1] is referred to as maximizationE[wi T (n)wi (n)|z, or θ[t M-step. Carrying ]] in E-step is rather out the math cumbersome, to find and the E[wxi (n)wxi T derivation is (n)|z, θ[t]] omitted in and this chapter. Interested readers are encouraged to refer to References [49] and [50] for more details. As was the case in the previous section, it is worth noting that wavelet coefficients are often modelled with heavy-tailed distributions (e.g., Laplace, Student’s t, Gaussian mix- ture). Distributions belonging to an exponential family can be rewritten as a scalar mixture of Gaussian random variables, and thus are conditionally Gaussian. The EM algorithm developed above is therefore generalized to a heavy-tailed distribution via the integration over the mixture variable in the posterior sense. 9.7 Filterbank Coefficient Estimation Computational efficiency and elegance of shrinkage or thresholding estimators and the- oretical properties amenable to spatial inhomogeneities have contributed to the immense popularity of wavelet-based methods for image denoising. However, typical denoising techniques assume complete grayscale or color image observation, and hence must be ap- plied after demosaicking. In the previous section we showed that it is possible to model noisy color images in the wavelet domain directly by taking advantage of the statistical framework of missing data. However, the computational burden of doing so is severe, and the energy compaction arguments put forth in Section 9.4 suggest an alternative approach by choosing to work with wavelet coefficients of the noisy subsampled data directly. In this section, we propose necessary changes to a complete image wavelet coefficient model such that it is amenable to the direct manipulation of wyi (n), [17]. Given that the difference images are sufficiently low-pass, simplification in Equation 9.9 reveals that cally, wyi (n) there is a surprising degree ≈ wi (n) for the majority of of similarity subbands — between wyi (n) and wi (n). Specifithe exceptions are the subbands that are normally considered high-frequency, which now contain a strong mixture of the low- frequency (or scaling) coefficients from the difference images, α and β . Operating under the premises that the filterbank transform decomposes image signals such that subbands are approximately uncorrelated from each other, the posterior mean estimate of wi (n) takes the 260 Single-Sensor Imaging: Methods and Applications for Digital Cameras form {wi }est(n) = E wi (n) wzi ≈ E wi (n) wi +ε for all subbands that function f : R → R, mf (eweit+tεh)e=wyiE(n(w) i≈(nw)|iw(ni +)εa)pipsroaxwimelaltsiotund.ieSdinpcreotbhleemwainvetlheet shrinkage literature, we can leverage existing image denoising methods to the CFA image context. In a simple special case where wi (n) ∼ N (0, σ 2 ,i ), the L2 estimator is f (wzi ) = σ 2 ,i σ 2 ,i + σε2 wzi (n). However, in the subbands that contain a mixture of wi , wiα , wβi , and wiε , we must proceed with caution. Let wiα (n) ∼ N (0, σα2,i), wiβ (n) ∼ N (0, σβ2,i). Consider the case such that i0 > I − Iˆ and i1 < Iˆ, and define j = (i0, I − i1), k = (I − i0, I − i1). Then wzi (n), wzj (n), wzk(n) are highly correlated due to their common components in their mixture, wiα and wβi , where i = (I − i0, i1). Thus the L2 estimates for wi (n), wj (n), wk(n) are  {{wwij }est(n) }est(n) = E wi wj (n) (n) {wk}est(n) wk(n) wwzjzi ((nn))   wzk(n) = E wwij (n) (n) wwzjzi ((nn))T   E wwzjzi (n) (n) wwzizj ((nn))T −1  wwzizj   wk(n) wzk(n) wzk(n) wzk(n) wzk  =  σ 2 ,i  σ 2 ,j σ 2 ,k   σ 2 ,i + σε2 σα2 ,i σα2 ,i + σα2,i −σβ2,i 16 +σβ2,i +σβ2,i 16 16 σα2,i −σβ2,i 16 σ 2 ,j + σε2 + σα2 ,i +σβ2,i 16 σα2,i −σβ2,i 16 σα2,i +σβ2,i 16 σα2,i −σβ2,i 16 σ 2 ,k + σε2 + σα2 ,i +σβ2,i 16 −1    wwzizj . wzk Similarly, {{wwiαβi }est(n) }est(n)  =E wwiαβi (n) (n) wwzizj ((nn))   wzk(n)  = E  wwiαiβ (n) (n) wwzizj ((nn))T wzk(n)   E wwzizj (n) (n) wzk(n) wwzizj ((nn))T wzk(n) −1  wwzizj   wzk σα2,i σα2,i σα2,i = − 1σ6β2,i 16 σ1β26,i 16 − 1σ6β2,i 16  σ 2 ,i + σε2 + σα2 ,i +σβ2,i 16  σα2,i −σβ2,i  16 σα2,i +σβ2,i 16 σα2,i −σβ2,i 16 σ 2 ,j + σε2 + σα2 ,i +σβ2,i 16 σα2,i −σβ2,i 16 σα2,i +σβ2,i 16 σα2,i −σβ2,i 16 σ 2 ,k + σε2 + σα2 ,i +σβ2,i 16 −1    wwzizj . wzk Color Filter Array Image Analysis for Joint Demosaicking and Denoising 261 Once {wi }est, {wiα }est, {wβi }est are computed ∀i, n as above, then xest(n) is calculated by taking the inverse filterbank transform (n), α(n), β (n), which in turn is used of to {wi }est, {wiα solve xest. }est, {wβi }est to find the estimates of Practically, it should be noted that the actual implementation of this method should in- clude cycle-spinning, a standard technique in filterbank and wavelet literature whereby a linear space-variant system can be transformed into linear space-invariant system via av- eraging over all possible spacial shifts. As with the previous sections, we note that the estimator naturally extends to multivariate normal or heavy-tailed distributions. 9.8 Conclusion Given the inadequacies and model inconsistencies of treating the image denoising and demosaicking problem independently, we focused on the analysis and the techniques for processing (see Figure 9.8 and Figure 9.9) subsampled data. In particular, the Fourier and filterbank (wavelet-packet) analyses reveal a systematic aliasing structure in CFA images, where the observed data consists of a mixture of baseband luminance signal, spectrallyshifted difference images, and noise. The same analysis motivates a unified strategy to address demosaicking and denoising estimation problems by interpreting the sensor data as luminance image distorted by noise with some degrees of structure. Conditioned on the complete observation image model of the digital camera designer’s choosing, we proposed three design regimes for estimating the complete noise-free image signal of interest given a set of incomplete observations of pixel components that are corrupted by noise. First, well-understood DSP machineries were employed to design a spatially-adaptive linear filter whose stop-band contains the spectral copies of the difference images, and the pass-band suppresses noise. Second, coupling of the EM algorithm framework with the Bayesian models to estimate the hyper- and nuisance parameters via the marginal likelihood, and in turn, adopting the empirical partial Bayes approach for estimating the ideal color image data allowed us to apply heavy-tailed priors to the unobservable wavelet coefficients. Third, exploiting the reversed-order filterbank structure, a regression of luminance and difference image filterbank coefficients on the CFA image filterbank coefficients were simplified. The above estimation techniques were derived using second-order statistics for complete observation models. Acknowledgments The author would like to thank his wonderful collaborators, Dr. Thomas W. Parks in the Department of Electrical and Computer Science at Cornell University, Dr. Xiao-Li Meng in the Department of Statistics at Harvard University, and Dr. Patrick J. Wolfe in the School of Engineering and Applied Sciences at Harvard University, whose invaluable contributions 262 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) FIGURE 9.8 Reconstruction of the Peppers image given a simulated noisy sensor data: (a) noise-free original color image, (b) color version of simulated noisy sensor data, (c) estimated with demosaicking method in Reference [7] and denoising method in [23], (d) estimated with the approach in Section 9.5, (e) estimated with the approach in Section 9.6, and (f) estimated with the approach in Section 9.7. Color Filter Array Image Analysis for Joint Demosaicking and Denoising 263 (a) (b) (c) (d) (e) (f) FIGURE 9.9 Reconstruction of the Lena image given a simulated noisy sensor data: (a) noise-free original color image, (b) color version of simulated noisy sensor data, (c) estimated with demosaicking method in Reference [7] and denoising method in [23], (d) estimated with the approach in Section 9.5, (e) estimated with the approach in Section 9.6, and (f) estimated with the approach in Section 9.7. 264 Single-Sensor Imaging: Methods and Applications for Digital Cameras to the works in References [17], [40], [42], [49] are reflected in this chapter. His gratitude extends also to Dr. Bahadir Gunturk at Louisiana State University and Dr. Javier Portilla at Universidad de Granada for making their simulation codes available; and to Daniel Rudoy, Ayan Chakrabarti, and Prabahan Basu at Harvard University for their constructive criticisms. References [1] B.E. Bayer, “Color imaging array.” U.S. Patent 3 971 065, July 1976. [2] R. Lukac and K.N. Plataniotis, “Color filter arrays: Design and performance analysis,” IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1260–1267, November 2005. [3] K. Parulski and K.E. Spaulding, Digital Color Image Handbook, ch. Color image processing for digital cameras, G. Sharma (ed.), Boca Raton, FL: CRC Press, 2002, pp. 727–757. [4] S. Yamanaka, “Solid state camera.” U.S. Patent 4 054 906, November 1977. [5] M. Parmar and S.J. Reeves, “A perceptually based design methodology for color filter arrays,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, May 2004, vol. 3, pp. 473–476. [6] K. Hirakawa and T.W. Parks, “Adaptive homogeneity-directed demosaicing algorithm,” IEEE Transactions on Image Processing, vol. 14, no. 3, pp. 360–369, March 2005. [7] B.K. Gunturk, Y. Altunbasak, and R.M. Mersereau, “Color plane interpolation using alternating projections,” IEEE Transactions on Image Processing, vol. 11, no. 9, pp. 997–1013, September 2002. [8] X. Wu and N. Zhang, “Primary-consistant soft-decision color demosaicking for digital cameras,” IEEE Transactions on Image Processing, vol. 13, no. 9, pp. 1263–1274, September 2004. [9] B.K. Gunturk, J. Glotzbach, Y. Altunbasak, R.W. Schafer, and R.M. Mersereau, “Demosaicking: Color filter array interpolation in single chip digital cameras,” IEEE Signal Processing Magazine; Special Issue on Color Image Processing, vol. 22, no. 1, pp. 44–54, January 2005. [10] R. Lukac and K.N. Plataniotis, “Universal demosaicking for imaging pipelines with an RGB color filter array,” Pattern Recognition, vol. 38, no. 11, pp. 2208–2212, November 2005. [11] D. Alleysson, S. Su¨sstrunk, and J. Herault, “Linear demosaicing inspired by the human visual system,” IEEE Transactions on Image Processing, vol. 14, no. 4, pp. 439–449, April 2005. [12] L. Chang and Y.P. Tang, “Effective use of spatial and spectral correlations for color filter array demosaicking,” IEEE Transactions on Consumer Electronics, vol. 50, no. 2, pp. 355–365, May 2004. [13] R. Kakarala and Z. Baharav, “Adaptive demosaicing with the principal vector method,” IEEE Transactions on Consumer Electronics, vol. 48, no. 4, pp. 932–937, November 2002. [14] W. Lu and Y.P. Tan, “Color filter array demosaicking: new method and performance measures,” IEEE Transactions on Image Processing, vol. 12, no. 10, pp. 1194–1210, October 2003. [15] D.D. Muresan and T.W. Parks, “Demosaicing using optimal recovery,” IEEE Transactions on Image Processing, vol. 14, no. 2, pp. 267–278, February 2005. [16] R. Ramanath and W.E. Snyder, “Adaptive demosaicking,” Journal of Electronic Imaging, vol. 12, no. 4, pp. 633–642, October 2003. Color Filter Array Image Analysis for Joint Demosaicking and Denoising 265 [17] K. Hirakawa, X.L. Meng, and P.J. Wolfe, “A framework for wavelet-based analysis and processing of color filter array images with applications to denoising and demosaicking,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, HI, USA, April 2007, vol. 1, pp. 597–600. [18] M.S. Crouse, R.D. Nowak, and R.G. Baraniuk, “Bayesian tree-structured image modeling using wavelet-domain hidden Markov models,” IEEE Transactions on Image Processing, vol. 46, no. 7, pp. 1056–1068, July 1998. [19] D.L. Donoho and I.M. Johnstone, “Ideal spatial adaptation via wavelet shrinkage,” Biometrika, vol. 81, no. 3, pp. 425–455, September 1994. [20] M. Jansen and A. Bultheel, “Empirical Bayes approach to improve wavelet thresholding for image noise reduction,” Journal of American Statistical Association, vol. 96, no. 454, pp. 629– 639, June 2001. [21] I.M. Johnstone and B.W. Silverman, “Wavelet threshold estimators for data with correlated noise,” Journal of Royal Statistical Society - Series B, vol. 59, no. 2, pp. 319–351, 1997. [22] J. Portilla, V. Strela, M.J. Wainwright, and E.P. Simoncelli, “Image denoising using scale mixture of Gaussians in the wavelet domain,” Tech. Rep. TR2002-831, Comput. Sci. Dept., Courant Inst. Math. Sci., New York Univ., 2002. [23] J. Portilla, V. Strela, M.J. Wainwright, and E.P. Simoncelli, “Image denoising using scale mixture of Gaussians in the wavelet domain,” IEEE Transactions on Image Processing, vol. 12, no. 11, pp. 1338–1351, November 2003. [24] A. Pizurica, W. Philips, I. Lemahieu, and M. Acheroy, “A joint inter and intrascale statistical model for Bayesian wavelet based image denoising,” IEEE Transactions on Image Processing, vol. 11, no. 5, pp. 545–557, May 2002. [25] L. Sendur and I.W. Selesnick, “Bivariate shrinkage functions for wavelet-based denoising exploiting interscale dependency,” IEEE Transactions on Signal Processing, vol. 50, no. 11, pp. 2744–2756, November 2002. [26] L. Sendur and I.W. Selesnick, “Bivariate shrinkage with local variance estimation,” IEEE Signal Processing Letters, vol. 9, no. 12, pp. 438–441, December 2002. [27] J.L. Starck, D.L. Donoho, and E. Cande, “Very high quality image restoration,” in Proceedings of the SPIE Conference on Wavelet and Applications in Signal and Image Processing, San Diego, CA, USA, July 2001, vol. 4478, pp. 9–19. [28] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Proceedings of the IEEE International Conference on Computer Vision, Bombay, India, January 1998, pp. 839–846. [29] G. Hua and M. T. Orchard, “A new interpretation of translation invariant denoising,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Montreal, Canada, May 2004, vol. 3, pp. 189–192. [30] X. Li and M.T. Orchard, “Spatially adaptive image denoising under overcomplete expansion,” in Proceedings of the IEEE International Conference on Image Processing, Vancouver, BC, Canada, September 2000, vol. 3, pp. 300–303. [31] D.D. Muresan and T.W. Parks, “Adaptive principal components and image denoising,” in Proceedings of the IEEE International Conference on Image Processing, Barcelona, Spain, September 2003, vol. 1, pp. 101–104. [32] K. Hirakawa and T.W. Parks, “Image denoising for signal-dependent noise,” IEEE Transactions on Image Processing, vol. 15, no. 9, pp. 2730–2742, September 2006. [33] D.D. Muresan and T.W. Parks, “Adaptively quadratic (AQua) image interpolation,” IEEE Transactions on Image Processing, vol. 13, no. 5, pp. 690–698, May 2004. 266 Single-Sensor Imaging: Methods and Applications for Digital Cameras [34] H. Tian, B. Fowler, and A.E. Gamal, “Analysis of temporal noise in CMOS photodiode active pixel sensor,” IEEE Journal on Solid State Circuits, vol. 36, no. 1, pp. 92–101, January 2001. [35] G.E. Healey and R. Kondepudy, “Radiometric CCD camera calibration and noise estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 3, pp. 267–276, March 1994. [36] R. Ding and A.N. Venetsanopoulos, “Generalized homomorphic and adaptive order statistic filters for the removal of impulsive and signal-dependent noise,” IEEE Transactions on Circuits Systems, vol. 34, no. 8, pp. 948–955, August 1987. [37] A.V. Oppenheim, R.W. Schafer, and J.R. Buck, Discrete-Time Signal Processing. Upper Saddle River, NJ: Prentice-Hall, 2nd Edition, 1999. [38] P. Fryzlewicz and G.P. Nason, “A Haar-Fisz algorithm for Poisson intensity estimation,” Journal of Computational and Graphical Statistics, vol. 13, no. 3, pp. 621–638, September 2004. [39] P. Fryzlewicz and G.P. Nason, “Smoothing the wavelet periodogram using the Haar-Fisz transform,” Scientific Charge, 2001. [40] K. Hirakawa and T.W. Parks, “Joint demosaicing and denoising,” IEEE Transactions on Image Processing, vol. 15, no. 8, pp. 2146–2157, August 2006. [41] M. Raphan and E.P. Simoncelli, Advances in Neural Information Processing System, ch. Learning to be Bayesian without Supervision. Cambridge, MA: MIT Press, vol. 19, 2007. [42] K. Hirakawa, “Signal-dependent noise characterization in wavelets domain,” in Proceedings of the SPIE Conference on Optics & Photonics, San Diego, CA, USA, August 2007. [43] D.R. Cok, “Signal processing method and apparatus for producing interpolated chrominance values in a sampled color image signal,” U.S. Patent 4 642 678, February 1987. [44] R. Kimmel, “Demosaicing: Image reconstruction from color CCD samples,” IEEE Transactions on Image Processing, vol. 8, no. 9, pp. 1221–1228, September 1999. [45] R. Lukac, K.N. Plataniotis, D. Hatzinakos, and M. Aleksic, “A novel cost effective demosaicing approach,” IEEE Transactions on Consumer Electronics, vol. 50, no. 1, pp. 256–261, February 2004. [46] R. Lukac and K.N. Plataniotis, “Normalized color-ratio modeling for CFA interpolation,” IEEE Transactions on Consumer Electronics, vol. 50, no. 2, pp. 737–745, May 2004. [47] E. Dubois, “Filter design for adaptive frequency-domain Bayer demosaicking,” in Proceedings of the IEEE International Conference on Image Processing, Atlanta, GA, USA, October 2006, pp. 2705–2708. [48] M.J.T. Smith and T.P. Barnwell, “A procedure for designing exact reconstruction filter banks for tree structured subband coders,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, San Diego, CA, USA, March 1984, pp. 27:1.1–1.4. [49] K. Hirakawa and X.L. Meng, “An empirical Bayes EM-wavelet unification for simultaneous denoising, interpolation, and/or demosaicing,” in Proceedings of the IEEE International Conference on Image Processing, Atlanta, GA, USA, October 2006, pp. 1453–1456. [50] G.J. McLachelan and T. Krishnan, The EM Algorithm and Extentions. New York: John Wiley & Sons, 1997. 10 Automatic White Balancing in Digital Photography Edmund Y. Lam and George S. K. Fung 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 10.2 Human Visual System and Color Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 10.2.1 Illumination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 10.2.2 Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 10.2.3 Color Stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 10.2.4 Human Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 10.2.5 Color Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 10.3 Challenges in Automatic White Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 10.4 Automatic White Balancing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 10.4.1 Gray World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 10.4.2 White Patch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 10.4.3 Iterative White Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 10.4.4 Illuminant Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 10.4.5 Color by Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 10.4.6 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 10.5 Implementations and Quality Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 10.1 Introduction Color constancy is one of the most amazing features of the human visual system. When we look at objects under different illuminations, their colors stay relatively constant. This helps humans to identify objects conveniently. While the precise physiological mechanism is not fully known, it has been postulated that the eyes are responsible for capturing different wavelengths of the light reflected by an object, and the brain attempts to “discount” the contribution of the illumination so that the color perception matches more closely with the object reflectance, and therefore is mostly constant under different illuminations [1]. A similar behavior is highly desirable in digital still and video cameras. This is achieved via white balancing which is an image processing step employed in a digital camera imaging pipeline (detailed description of the camera imaging pipeline can be found in Chapters 1 and 3) to adjust the coloration of images captured under different illuminations [2], [3]. 267 268 Single-Sensor Imaging: Methods and Applications for Digital Cameras This is because the ambient light has a significant effect on the color stimulus. If the color temperature of a light source is low, the object being captured will appear reddish. An example is the domestic tungsten lamp, whose color temperature is around 3000 Kelvins (K). On the other hand, with a high color temperature light source, the object will appear bluish. This includes the typical daylight, with color temperature above 6000 K [4]. Various manual and automatic methods exist for white balancing. For the former, the camera manufacturer often has predefined settings for typical lighting conditions such as sunlight, cloudy, fluorescent, or incandescent. The user only needs to make the selection, and the camera will compute the adjustment automatically. Higher-end cameras, such as prosumer (professional-consumer) and single-lens reflex (SLR) digital cameras, would even allow the user to define his or her own white balance reference. Most amateur users, however, prefer automatic white balancing. The camera then needs to be able to dynamically detect the color temperature of the ambient light and compensate for its effects, or determine from the image content the necessary color correction due to the illumination. The automatic white balancing (AWB) algorithm employed in the camera imaging pipeline is thus critical to the color appearance of digital pictures. This chapter is devoted to a study of such algorithms commonly used in digital photography. We organize this chapter as follows. In Section 10.2, we first briefly review the human visual system and the theory of color, which are necessary background materials for our discussion on AWB strategies in cameras. Certain terminologies would also be introduced that are commonly used in discussing color. This is followed by looking into the physical principles of color formation on an electronic sensor. The challenges that exist in digital photography are described in Section 10.3. Then, in Section 10.4, we describe a few representative AWB algorithms. Our goal is not to be encyclopedic, which is rather impossible considering the wide array of methods in existence and the proprietary nature of some of these schemes, but to be illustrative of the main principles behind the major approaches. In Section 10.5, experimental results of some of these representative algorithms are presented to evaluate and compare the efficacy of various techniques. Some concluding remarks are given in Section 10.6. 10.2 Human Visual System and Color Theory It is instructive for us to begin the discussion on color with the physics of light and the physiology of the human visual system. The primary reason is that humans are typically the end-user and the judge of the images in our camera systems. A secondary reason is that the eye is itself a complex and beautifully made organ that acts as an image capturing device for our brain. In many cases, we model our camera system design on the natural design of our eyes. Visible light occupies a small section of the electromagnetic spectrum, which we call the visible spectrum. In the seventeenth century, Sir Isaac Newton (1642–1727) was the first to demonstrate that when a white beam entered a prism, due to the law of refraction of light the exit beam would consist of shades of different colors. Further experimentation and Automatic White Balancing in Digital Photography 269 measurement show that different colors correspond to electromagnetic waves of different wavelengths, usually denoted with the symbol λ and measured in nanometers (nm). Visible spectrum roughly spans from λ = 400 nm to λ = 700 nm. Our eyes interpret different colors based on the wavelength of the electromagnetic wave. For example, at λ ≈ 400 nm we have the sensation of blue; at λ ≈ 550 nm we have the sensation of green; at λ ≈ 700 nm we have the sensation of red. For a range of λ < 400 nm, the region is called ultra-violet (UV), while for a range of λ > 700 nm, we refer to it as infrared (IR). UV and IR are gaining importance in imaging, particularly in medical imaging and remote sensing respectively, but our focus with digital camera systems is on the visible spectrum. Hence we will focus on the range approximately from 400 nm < λ < 700 nm hereafter. 10.2.1 Illumination Imaging begins with the source of light called illumination. Virtually all illuminations consist of light with multiple wavelengths. (In fact, we have to go to extraordinary lengths to create lasers that are of a single wavelength or a very narrow bandwidth. Imaging under such circumstances is rare for digital photography and is mostly for scientific research purposes, and thus we do not take them into account here.) Each illumination then is described with a curve showing the strength of the electromagnetic radiations at different values of λ . If we normalize the curves of various illuminations, the result is a very useful description of the spectral power distribution of illumination as a function of wavelength. We can then compare the representative spectral power distributions of various common light sources. Note that we usually only describe the general characteristics of the illuminations; an actual measurement for sunlight, for example, would depend on the location, altitude, and atmospheric and weather conditions during the measurement. As an example, Figure 10.11 shows the spectral power distribution of various common illumination sources. Figure 10.1a is the curve for typical sunlight, which is continuous (although not uniform) over the visible spectrum. Tungsten light, as shown in Figure 10.1b, also appears to be rather smooth. Later, we will see why sunlight tends to be perceived as bright yellowish-white, while tungsten usually gives us a sensation of yellowish hue. In contrast, the fluorescent lamp consists of sharp spikes in the spectral power distribution, as shown in Figure 10.1c. Finally, we see that with a light-emitting diode (LED), the spectral power distribution is also smooth, but is often limited to a narrow range of the visible spectrum as depicted in Figure 10.1d. LEDs with spectral power concentrating at the higher wavelength region are more common, and as a result we have mostly red LEDs. In this chapter we denote the spectral power distribution of an illumination as I(λ ). This is an important quantity as we relate this with other color production factors to be explained next. In general, there are numerous possible curves I(λ ). It is also possible to express I(λ ) 1Data for some of the spectral power distributions and spectral reflectance can be obtained online from the Munsell Color Science Laboratory, Chester F. Carlson Center for Imaging Science at Rochester Institute of Technology. 270 1 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1 relative spectral pow er d istribu tion relative spectral pow er d istribu tion 0 400 500 600 700 w avelength l (nm ) (a) 1 0 400 500 600 700 w avelength l (nm ) (b) 1 relative spectral pow er d istribu tion relative spectral pow er d istribu tion 0 400 500 600 700 w avelength l (nm ) (c) 0 400 500 600 700 w avelength l (nm ) (d) FIGURE 10.1 Spectral power distribution of various common types of illuminations: (a) sunlight, (b) tungsten light, (c) fluorescent light, and (d) light-emitting diode (LED). as a linear combination of known basis functions Ij(λ ) with m I(λ ) = ∑ α jIj(λ ), j=1 (10.1) where, for example, three basis functions (corresponding to m = 3) are sufficient to represent standard daylights [5]. This property can be used in the design of AWB algorithms. 10.2.2 Object Illumination is one of the three main factors contributing to the sensation of color in our brains. The second major factor is the object. When the electromagnetic radiations from the illumination reach an object, they are partially absorbed, and partially reflected or transmitted (for transparent objects). For different items, the proportion of the reflection or transmission varies with wavelengths, but this is an inherent property of the object irrespective of the illumination that takes place. We can therefore characterize an object’s spectral reflectance or spectral transmittance as a function of wavelength for comparison. Automatic White Balancing in Digital Photography 271 FIGURE 10.2 (See color insert.) GretagMacbeth color rendition chart. To illustrate, we plot the spectral reflectance corresponding to several typical object colors. These colors represent patches taken from the GretagMacbeth Color Checker (Figure 10.2) which is often used to test digital camera performances. The spectral reflectance plots are shown in Figure 10.3. For each plot, the y-axis denotes the fraction of light that is being reflected from the object. Figure 10.3a to Figure 10.3d show the spectral reflectance of a red, light blue, yellow, and gray patches of color, respectively. As expected, a red patch absorbs most of the greenish-blue frequencies and reflects most of the higher-wavelength light that gives the sensation of red. However, it should be noted that some residual lowerwavelength frequencies are also reflected, only that the amount is much smaller. We observe a similar behavior for the light blue and yellow patches as well. For the gray patch, the spectral reflectance is roughly a constant for the different wavelengths, causing the resulting gray sensation to be neutral in color. We denote the spectral reflectance of an object with R(λ ). Similarly, we can also define the spectral transmittance of an object with T (λ ). There are also attempts to decompose the spectral reflectance into a summation of known basis functions R j(λ ) such that m R(λ ) = ∑ β jR j(λ ). j=1 (10.2) It has been shown that three basis functions (corresponding to m = 3) can accurately represent 433 Munsell-chips reflectance functions [6], and seven basis functions (corresponding to m = 7) are sufficient for a large number of natural objects [7]. 10.2.3 Color Stimulus Illumination and object reflectance or transmittance together determine the color stimulus. The spectral power distribution of the illumination governs how much energy is incident on the object at every wavelength. For a reflective object, the spectral reflectance dictates what fraction of that radiation is reflected and will arrive at the eye or the sensor, again at every wavelength. Similarly, for a transmissive object, the spectral transmittance determines the fraction of the radiation being transmitted through the object. Therefore, the spectral power distribution of an object is the product of the spectral power of the illumination and the spectral reflectance of the object. This is also called the color stimulus. 272 1 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1 spectral reflectance spectral reflectance 0 400 500 600 700 w avelength l (nm ) (a) 1 0 400 500 600 700 w avelength l (nm ) (b) 1 spectral reflectance spectral reflectance 0 400 500 600 700 w avelength l (nm ) (c) 0 400 500 600 700 w avelength l (nm ) (d) FIGURE 10.3 Spectral reflectance of various color patches: (a) red patch, (b) light blue patch, (c) yellow patch, (d) gray patch. Mathematically, we denote the color stimulus with S(λ ). This is related to the illumination I(λ ) and object reflectance R(λ ) by [8] S(λ ) = I(λ )R(λ ). (10.3) If we use the basis decomposition in Equations 10.1 and 10.2, this becomes m n S(λ ) = ∑ α jIj(λ ) ∑ βkRk(λ ) j=1 k=1 mn = ∑ ∑ α jβkIj(λ )Rk(λ ). j=1 k=1 (10.4) (10.5) Although the above equations appear deceptively simple, they underscore an important fact in color science that we must emphasize here. The color stimulus depends on both the illumination and the object. Therefore, given any object, we can theoretically manipulate the illumination so that it produces any desired color stimulus! We will see shortly that the Automatic White Balancing in Digital Photography 273 optical nerve blind spot lens p u p il retina (p h otor ecep tor s) FIGURE 10.4 An anatomy of a human eye. color stimulus in turn contributes to our perception of color. One noteworthy corollary is that color is not an inherent feature of an object. A red object, for example, can be perceived as blue by a clever design of the illumination. In other words, the illumination is as important as the object reflectance. This fact is very important for our camera system design. When we take a picture of the same object first under sunlight and then under fluorescent light, for example, the color stimuli vary significantly due to differences in illumination. We must adapt our camera to interpret the color stimuli differently, or otherwise the color of the photographs would look very different. This explains why AWB is so critical and challenging. However, before we can proceed on discussing AWB algorithms, we need to explore how our eyes interpret the color stimuli first. This brings us to the third major factor contributing to the color sensation: the physiology of our eyes. 10.2.4 Human Visual System The color stimulus is a function of wavelength. If we had to faithfully reproduce the color stimulus, there is a lot of information to be stored. A key fact from color science is that there is no need to reproduce the entire spectral distribution in color reproduction. As a matter of fact, we often need only three values to specify a color, despite the obvious loss in information and possibility of ambiguity. The reason lies in our human visual system. A basic anatomy of a human eye is shown in Figure 10.4. Visible light enters through the pupil and is refracted by the lens to create an image on the retina. The optical nerve transfers the image on the retina to the brain for interpretation. The retina is able to form an image because of tiny sensors called photoreceptors. For a normal person, there are two types of photoreceptors, cones and rods, which function very differently. The rods are responsible for scotopic, or dim-light, vision. We have somewhere between 75 and 150 million rods in each of our eyes, and they are distributed all over the retina. Their chief aim is to give us an overall picture of the field of view of our eyes, rather than for color vision. When we enter a room that is rather dark, we may still be able to see objects even though they tend to be colorless. This is because under such illumination, only the rods are able to give us the images. 274 Single-Sensor Imaging: Methods and Applications for Digital Cameras S-co n e M -co n e L-co n e green filter red filter blue filter sp ectral sensitivities sp ectral sensitivities 350 400 450 500 550 600 650 700 750 w avelength l (nm ) (a) 350 400 450 500 550 600 650 700 750 w avelength l (nm ) (b) FIGURE 10.5 Spectral sensitivities of: (a) the three types of cones in a human eye, and (b) a typical digital camera. The cones, on the other hand, are highly sensitive to color. There are far fewer cones in our eyes; an average person has about six to seven million only. They are also localized at a place called the fovea, rather than distributed all over the retina. They help us resolve fine details in images, and are responsible for photopic, or bright-light, vision. More importantly, there are three types of cones: • L-cones which have peak sensitivity towards the long wavelength section of the visible spectrum, • M-cones which have peak sensitivity towards the middle wavelength section of the visible spectrum, and • S-cones which have peak sensitivity towards the short wavelength section of the visible spectrum. These three types of cones together give us the sensation of color vision. When one or more of the cone types are defective, those people are said to possess what we collectively refer to as color deficiency or color blindness. Approximately one in twelve men has this condition to varying degrees, and this is more common in men than in women. Figure 10.5a shows the spectral sensitivities of the three types of cones of the human eye. In subsequent discussions, we use l(λ ), m(λ ), and s(λ ) to denote the spectral sensitivity responses of the L-, M-, and S-cones respectively. For the sake of comparison, the curves have been normalized to equal area. It is interesting to observe that they do not cover disjoint sections of the visible spectrum, nor do they cover it entirely. In fact, the responses of L-cones and M-cones overlap significantly, and all three curves show low response to stimulus below around 400 nm and above around 650 nm. As we will see in the next section, camera designs mimic the responses of our human eyes. The sensor in the camera consists of three filters, typically red, green, and blue filters. The spectral sensitivities of a typical camera are shown in Figure 10.5b. We can observe that the peaks of these filters correspond to the peaks of the L-, M-, and S-cones of our eyes. Automatic White Balancing in Digital Photography 275 (a) (b) FIGURE 10.6 The aim of photography. Observing an object: (a) directly through a human eye, and (b) indirectly through a photograph. When an object with stimulus S(λ ) = I(λ )R(λ ) is observed, each of the three cones responds to the stimulus by summing up the reaction at all wavelengths. Therefore, three values are produced from the three cones, in accordance with the equations: 700 X = l(λ )I(λ )R(λ ) dλ 400 700 Y = m(λ )I(λ )R(λ ) dλ . 400 700 Z = s(λ )I(λ )R(λ ) dλ 400 (10.6) The triplet (X,Y, Z) is called trichromatic response. Despite its simplicity to describe color, it is estimated that humans are capable of resolving about 10 million color sensations! An important consequence of trichromatic response is that in the digital camera, we only require three numbers at each pixel to capture the color information. We do not need to record the color stimulus at all wavelengths! In fact, this gives rise to a useful phenomenon: Even if we consider the spectral sensitivities to be known, for any given triplet (X,Y, Z) there could be an infinite number of possibilities for the color stimulus according to Equation 10.6. Two stimuli that produce the same trichromatic response are called metamers. The possibility of metamers is key to color photography. 10.2.5 Color Matching The goal of photography is a bit different from the way our eyes perceive the color of an object. Consider the two scenarios depicted in Figure 10.6. In Figure 10.6a, our eyes observe a certain color object. In Figure 10.6b, our camera captures the object, and in turn 276 Single-Sensor Imaging: Methods and Applications for Digital Cameras produces an image on screen or in a hardcopy. Our eyes then observe that object. Note that through the capturing device, because the spectral sensitivities of the camera differ from our eyes and the ink in the hardcopy differs in reflectance from the object, the print is not of identical color to the original object. The goal of color photography is to make the image appear as similar to the object as possible. As such, our aim is color matching. Consider digital images shown on a screen, such as using cathode-ray tube (CRT), liquid crystal display (LCD), or even organic light-emitting device (OLED). In these cases, each pixel consists of three color patches called primaries, which are usually red, green, and blue. Color is formed from a linear combination of intensities from these primaries. Each primary is associated with a certain color stimulus, which we denote as Pr(λ ), Pg(λ ), and Pb(λ ) for the three colors. It is not difficult to realize that we only need to perform color matching on monochromatic light sources. Any real stimulus, caused by any real illumination reflected or transmitted through any object, can be decomposed as a linear combination of these singlewavelength light sources. Mathematically, we assume that our light source has the color stimulus S(λ ) = δ (λ − λ0), (10.7) which indicates that it has unit strength at wavelength λ0 and zero elsewhere. We assign scalar weights l0, m0, and s0, respectively, to the three primaries Pr(λ ), Pg(λ ), and Pb(λ ) in matching colors. Note further that the color matching must be performed with respect to an observer. It is common to define a standard observer, with a particular set of spectral sensitivities l(λ ), m(λ ), and s(λ ). Equipped with all these parameters, we can now calculate the tristimulus value of directly observing the original stimulus with unit strength at wavelength λ0 to be 700 X = l(λ )S(λ ) dλ = l(λ0) 400 700 Y = m(λ )S(λ ) dλ = m(λ0). 400 700 Z = s(λ )S(λ ) dλ = s(λ0) 400 (10.8) The tristimulus value of indirect observation, through the three primaries of our display device, would be Xˆ = 700 l(λ ) [l0Pr(λ ) + m0Pg(λ ) + s0Pb(λ )] dλ 400 Yˆ = 700 m(λ ) [l0Pr(λ ) + m0Pg(λ ) + s0Pb(λ )] dλ . 400 Zˆ = 700 s(λ ) [l0Pr(λ ) + m0Pg(λ ) + s0Pb(λ )] dλ 400 (10.9) To match the color, we require only that the tristimulus values match, i.e., X = Xˆ , Y = Yˆ , and Z = Zˆ. Equating Equations 10.8 and 10.9, we have Automatic White Balancing in Digital Photography 277  Pr(λ )l(λ ) dλ  Pr(λ )m(λ ) dλ Pr(λ )s(λ ) dλ Pg(λ )l(λ ) dλ Pg(λ )m(λ ) dλ Pg(λ )s(λ ) dλ P     Pb(λ )l(λ ) dλ l0 l(λ0) Pb(λ )m(λ ) dλ   m0  =  m(λ0)  . Pb(λ )s(λ ) dλ s0 s(λ0) vc v (10.10) This equation is fundamental to color science. Several remarks can be made for the matrix equation above: 1. The vector vc is the only unknown in the above equation. With three equations and three unknowns, the solution is unique provided that the matrix P is not singular. 2. The process can be repeated for different values of λ0. If we view l0 above not as a scalar but as a value of the curve x(λ ) at λ = λ0, by repeating the process at different wavelengths we can generate the entire curve of x(λ ). Similarly, we generate y(λ ) from m0 and z(λ ) from s0. These curves are called color-matching functions. 3. The color-matching functions depend on the primaries (R(λ ), G(λ ), and B(λ )) and the observer (l(λ ), m(λ ), and s(λ )). Changing either, or both, of these quantities would result in new color-matching functions. 4. The vector v is fixed for the same observer. In this case, the color-matching functions with different primaries are simply linear combinations of one another. Therefore, the two sets of color-matching can be described as P1vc1 = v and P2vc2 = v, and thus vc1 = P1−1P2vc2. (10.11) 5. One particularly useful set of color-matching functions defined by the Commission Internationale de l’E´ clairage (CIE), called the CIE Standard Colorimetric Observer color-matching functions, is shown in Figure 10.7. They are used in the calculation of the CIE tristimulus values X, Y , and Z, which quantify the trichromatic characteristics of color stimuli [8]. z(l) y(l) x(l) color m atching fu nctions 350 400 450 500 550 600 650 700 750 w avelength l (nm ) FIGURE 10.7 The CIE standard colorimetric observer color-matching functions. 278 Single-Sensor Imaging: Methods and Applications for Digital Cameras 10.3 Challenges in Automatic White Balancing In the previous section, we have discussed how the color stimulus is equally dependent on the illumination and the object reflectance. Mathematically, we would thus expect the color stimulus of the same object to be different under different lighting conditions. However, our experience seems to the contrary: the same object appears to be of the same color even under different illuminations. This is known as color constancy. It is also known that the human visual system corrects for the prevailing scene illumination [9], [10]. However, for digital cameras, this is a challenging engineering problem. Digital cameras nowadays use a single-image sensor, with a mosaic of color filters on top of each photodetector. For details refer to Chapters 1 and 5. These filters can be fabricated as a photoresist layer mixed with the red, green, or blue dyes [11], with spectral sensitivities such as those shown in Figure 10.5b. When an object with stimulus S(λ ) = I(λ )R(λ ) is observed, each filter responds to the stimulus by summing up the reaction at all wavelengths. Therefore, three values are produced from the three filters, in accordance with the equation Rsensor = Gsensor = Bsensor = 700 r(λ )I(λ )R(λ ) dλ 400 700 g(λ )I(λ )R(λ ) dλ , 400 700 b(λ )I(λ )R(λ ) dλ 400 (10.12) where r(λ ), g(λ ), and b(λ ) refer to the spectral sensitivities of the sensors under the red, green, and blue filters respectively. This equation is essentially identical to Equation 10.6 except that we are now concerned with the spectral sensitivities of the camera sensors rather than our cone responses in our eyes. We can now state our AWB goal as follows: we seek to minimize the effect of I(λ ) and ensure that Rsensor, Gsensor, and Bsensor correlate with the object reflectance R(λ ) only [12]. The solution, however, involves dealing with an underdetermined set of equations. We can count the number of variables and equations as follows. For simplicity, assume for the moment that our sensors do not rely on demosaicking to recover the full RGB image. Hence, for an image of size n × n, we have n2 pixels and therefore 3n2 captured values. From these known values, we want to estimate parameters for the n2 pixels together with the illuminant. Assume we discretize Equation 10.12 above so that m ∑ Rsensor = r(λ j)I(λ j)R(λ j) ∆λ j=1 m ∑ Gsensor = g(λ j)I(λ j)R(λ j) ∆λ . j=1 m ∑ Bsensor = b(λ j)I(λ j)R(λ j) ∆λ j=1 (10.13) Automatic White Balancing in Digital Photography 279 light source v iew p oin ts Lam bertian su r fa ce FIGURE 10.8 The Lambertian reflection model. We are then using m sample points to represent the integral. Thus, for each pixel we want to derive R(λ j) for m values, hence there are a total of mn2 unknowns. In addition, we have m unknowns for the illuminant. Thus, comparing 3n2 known values with mn2 +m = m(n2 +1) unknowns, it is clear that we do not have sufficient equations [12]. Moreover, we also need to note that Equations 10.12 and 10.13 above correspond to a simplified two-dimensional world in which all objects are flat, matte, Lambertian surfaces, and uniformly illuminated [12]. A Lambertian, or diffuse, surface assumes that light energy reaching a surface is reflected evenly in all directions [13], as shown in Figure 10.8. Thus, a planar patch appears to be of uniform brightness for all visible viewpoints. This occurs when the surface is rough enough relative to the wavelength of the light. Otherwise, considerations such as flare will substantially complicate the problem further. To deal with the underconstrained nature of this problem, we often make additional assumptions about the world. Many of the AWB techniques to be mentioned in the following section rely on particular assumptions. For instance, the gray world method, as the name implies, considers that the average intensity of the scene is gray. The white patch method assumes there are always some white pixels in the image. Different assumptions thus lead to different implementations, and the efficacy of various AWB algorithms can be judged from how well the actual scenes satisfy the prior assumptions. 10.4 Automatic White Balancing Algorithms For the above discussions, it can be seen that ideally AWB techniques require information about the camera being used, and possibly are based on assumptions about the statistical properties of the expected illumination and spectral reflectance [14]. In practice, many AWB algorithms follow a two-stage process: 280 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1. Illuminant estimation — This may be done explicitly, often choosing from a known set of possible illuminants, or implicitly with assumptions about the effect of such illuminants. 2. Image color correction — This generates a new image as if it had been taken under a standard illuminant. The correction is often achieved through an independent gain regulation of the three color signals. This is known as the Von Kries hypothesis [15]. Commonly, it is achieved by adjusting the intensities of red and blue only, as AWB is concerned about the ratio of the three color signals. Below we discuss several representative algorithms. We present them independently, but we should also note that combination techniques exist (e.g., Reference [14]) where multiple algorithms are run simultaneously, and a consensus decision is required afterwards to select the best results. 10.4.1 Gray World The first method incorporates the gray world assumption, which argues that the average reflectance of a scene is achromatic. In other words, the mean of the red (Rsensor), green (Gsensor), and blue (Bsensor) channels in a given scene should be roughly equal. This method has its root in film photography, where for the negatives the average is biased towards dark regions of the scene, which tend to be neutral [4]. Algorithmically, as stated above we can adjust a gain factor to two of the channels so that both their means are now equal to the reference channel, which is often taken to be green. We denote a full-color image of size n × n as RGBsensor(x, y), where x and y denote the indices of the pixel position. The individual red, green, and blue color components are then Rsensor(x, y), Gsensor(x, y), and Bsensor(x, y), respectively. We compute ∑ ∑ Ravg = 1 n2 nn Rsensor(x, y) x=1 y=1 ∑ ∑ Gavg = 1 n2 nn Gsensor(x, y). x=1 y=1 ∑ ∑ Bavg = 1 n2 nn Bsensor(x, y) x=1 y=1 (10.14) If the three values are identical, the image already satisfies the gray world assumption and no further adjustment is necessary. In general, they may not be. We then compute the gain for the red and blue channels as αˆ and βˆ , where αˆ = Gavg Ravg and βˆ = Gavg Bavg . (10.15) The corrected image is formed with Rˆsensor(x, y), Gˆ sensor(x, y), and Bˆsensor(x, y), where Rˆsensor(x, y) = αˆ Rsensor(x, y) Gˆ sensor(x, y) = Gsensor(x, y). Bˆsensor(x, y) = βˆ Bsensor(x, y) (10.16) Automatic White Balancing in Digital Photography 281 Often, this image has sufficient intensity range for all the channels. In the event that the highest intensity of the three channels is significantly below the maximum allowable value, we can scale all three channels by the same amount so that the average intensity is still preserved. This gray world method is quite effective in practice, except in situations where a certain color may dominate, such as a blue hue for the sky, or when an object with a substantial amount of a certain color occupies the majority of the view. There are a number of extensions to this method that can deal with such situations. One example is given in Reference [16]. In this method, one defines a region in the Ravg − Gavg versus Bavg − Gavg plane. If the computed {Ravg, Gavg, Bavg} falls within the region, the scene is considered good enough and AWB adjustments using Equation 10.16 will not be performed. 10.4.2 White Patch The second method is based on the Retinex theory2 of visual color constancy, which argues that perceived white is associated with the maximum cone signals [18]. This is also known as the white world assumption [19]. This is because the brightest point in an image is often due to reflectance of a glossy surface, which tends to reflect the actual color of the light source [20]. The white balancing scheme then attempts to equalize the maximum value of the three channels to produce a white patch. To avoid disturbances to the calculation caused by a few bright pixels, one can treat clusters of pixels or lowpass the image [4]. To implement this, we compute Rmax = max x,y Rsensor(x, y) Gmax = max x,y Gsensor(x, y). Bmax = max x,y Bsensor(x, y) (10.17) If Gmax is too small we can scale the green intensities up first, otherwise we keep the green channel unchanged. We define the gain for the red and blue channels as α˜ and β˜ , where α˜ = Gmax Rmax and β˜ = Gmax Bmax . (10.18) The corrected image is formed with R˜sensor(x, y), G˜ sensor(x, y), and B˜sensor(x, y), where R˜sensor(x, y) = α˜ Rsensor(x, y) G˜ sensor(x, y) = Gsensor(x, y). B˜sensor(x, y) = β˜ Bsensor(x, y) (10.19) Gray world and white patch methods have their respective strengths. It is conceivable that satisfying the conditions in both methods would result in even better images. But we first need to make the following remarks: 2Retinex, which comes from the words retina and cortex, was coined to suggest that both the eye and the brain are involved in visual color constancy [17]. 282 Single-Sensor Imaging: Methods and Applications for Digital Cameras • For most images, the two methods produce different results. In other words, the corrected image can rarely satisfy both the gray world assumption and the Retinex theory. • Equations 10.16 and 10.19 are both linear adjustments to the pixel intensities. Furthermore, there is also a fixed point in the mappings: for pixels with zero intensity, the two mappings would not affect their values. Evidently, it is rarely possible to achieve the requirements of both gray world assumption and Retinex theory with a linear technique. Instead, a simple adjustment with a quadratic mapping of intensities was described in Reference [21]. Let the change to the red channel be R˘sensor(x, y) = µR2sensor(x, y) + νRsensor(x, y), (10.20) where µ and ν are parameters to be found. The adjustment to the blue channel is computed analogously. To satisfy the gray world assumption, we require that nn ∑ ∑ R˘sensor(x, y) = n2Gavg, x=1 y=1 (10.21) and therefore, nn nn ∑ ∑ ∑ ∑ µ R2sensor(x, y) + ν Rsensor(x, y) = n2Gavg. x=1 y=1 x=1 y=1 (10.22) Simultaneously, to satisfy the Retinex assumption to produce a white patch, we need max x,y R˘sensor(x, y) = Gmax, (10.23) and therefore, if we assume that Rsensor(x, y) takes on integer values between 0 and 255, and that µ and ν are positive numbers, µ max x,y R2sensor(x, y) + ν max x,y Rsensor(x, y) = Gmax. (10.24) Equations 10.22 and 10.24 together form two equations in two unknowns. We can represent them in a matrix form ∑nx=1 ∑ny=1 R2sensor(x, y) ∑nx=1 ∑ny=1 Rsensor(x, y) maxx,y R2sensor(x, y) maxx,y Rsensor(x, y) µ ν = n2Gavg Gmax . (10.25) This can be solved analytically for µ and ν using Cramer’s rule. 10.4.3 Iterative White Balancing The gray world method and white patch method described above are global techniques in that all pixels are involved in the computation. A drawback is that both may be susceptible to statistical anomalies. For the former, the method will give incorrect results if the scene Automatic White Balancing in Digital Photography 283 is heavily biased towards certain color cast, such as an outdoor scene of an ocean and the sky is typically rich in blue. For the white patch method, if a few pixels in the image have very large red, green, or blue values, they end up dominating the calculations. In contrast, we have algorithms that pre-select a subset of pixels fulfilling certain a priori criteria, and the necessary color correction is derived from these pixels, although the adjustment is performed on all pixels subsequently. We can, for instance, perform an iterative white balancing technique as follows by extracting certain white points. We first convert the RGB values to YUV, a color space commonly used in video signals such as the PAL format, given by the following formula:     Y 0.299 0.587 0.114 Rsensor U  =  −0.147 −0.289 0.436   Gsensor  . V 0.615 −0.515 −0.100 Bsensor (10.26) An ideal white point is when Rsensor = Gsensor = Bsensor = 255, which when put to the equation above makes Y = 255 and U = V = 0. Relaxing this condition a bit, we extract the pixels as white points if they satisfy the condition [22] Y >ξ |U| < ρ, |V | < τ (10.27) or an alternative criterion defined as [23]: Y − |U| − |V | > ζ , (10.28) where ξ , ρ, τ, and ζ are some pre-defined constants. While such a local method can avoid the scene being dominated by statistical anomalies, there are also situations that this would fail such as when there is no white object in the scene. Another refinement is to look at gray points, which form a superset of the white points and therefore are more abundant in a typical scene. Reference [24] proposes selecting these points by the formula |U | + |V Y | < η, (10.29) where η is a positive threshold value much less than 1. The rationale is that if the light source is biased, say, to have a stronger red component, we can represent the captured red component R˜ as R˜ = (1 + κR)R, (10.30) where R is the true red component if captured in a canonical light source, and κR denotes the percentage increase. This gives rise to a set of Y , U, and V where     Y 0.299 0.587 0.114 (1 + κR)R U  =  −0.147 −0.289 0.436   G  . V 0.615 −0.515 −0.100 B (10.31) 284 Single-Sensor Imaging: Methods and Applications for Digital Cameras Note that for a canonical illumination, we have R = G = B for a gray point, and therefore     Y 0.299 0.587 0.114 (1 + κR)R U  =  −0.147 −0.289 0.436   R  (10.32) V 0.615 −0.515 −0.100 R   1 + 0.299κR =  −0.147κR  R. (10.33) 0.615κR Putting the above in Equation 10.29, we get |U| + |V | Y = 0.147κR + 0.615κR 1 + 0.299κR = 0.762κR 1 + 0.299κR < ν. (10.34) (10.35) This value is close to zero if κR is small. Similar results can be derived if the color cast is in green or blue. After selecting these gray points, we compute their average U and V values as Uˆavg and Vˆavg. An iterative procedure is then employed to adjust them both to zero. At the jth iteration, we compute φ j = max(|Uˆavg|, |Vˆavg|). (10.36) If this equals to Uˆavg, implying that the color is biased towards blue, we adjust the gain of the blue channel. Otherwise, the color is biased towards red, and the gain of the red channel is adjusted. The amount of adjustment used in Reference [24] is empirical and determined by trial and error. This changes Uˆavg and Vˆavg for the next iteration, and Equation 10.36 is computed again until satisfactory results are obtained. 10.4.4 Illuminant Voting The three methods discussed above all make assumptions about the effects of illumina- tion and adjust the pixel intensities directly. In principle, such methods attempt to adjust the intensity values of an image so that they appear “normal,” but there is no guarantee that the resulting image is indeed possible under any illuminant! In other words, we may have created an image that is not physically realizable with any lighting condition on the partic- ular object. On the other hand, there are also various techniques that aim at recovering the illuminant explicitly from the observed images. One such example is the illuminant voting technique [25]. After identifying the illuminant, the correction to any alternative lighting condition will ensure that the resulting image is realizable. This illuminant voting method is based on the idea of Hough transform. This is a well known technique in image processing, especially in pattern detection, and can be illustrated as follows. Suppose we would like to detect a straight line in an image. This line can be represented as ρ = x cos θ + y sin θ (10.37) in the x–y plane, where ρ is the distance from the origin and θ is the angle of the line. The Hough transform of this line is then the point (ρ, θ ) in a new parameter space. We Automatic White Balancing in Digital Photography 285 can think of the Hough transform as mapping a line to a point, but we can also think of it as mapping a point to a line, because a point in the original x–y plane can relate to a set of (ρ j, θ j), where j is the index for the element, as long as they satisfy Equation 10.37 above. In theory there is an infinite number of (ρ j, θ j), but in practice they are quantized and therefore there is only a finite number of elements. Thus, we can imagine each point casts one vote to each member of the element. When we have multiple points, the (ρ j, θ j) with the most number of votes denotes the strongest presence of a line. In implementation, we commonly pick θ j first and then solve for ρ j and eventually count the votes, before moving on to a new value of θ j. Details of the Hough transform can be found in many image processing textbooks (e.g., Reference [26]). In a similar manner, we rely on the observed data to vote for the most likely illuminant. This requires modelling of the illuminant and reflectance by using low order linear com- binations, as described in Equations 10.1 and 10.2. Putting them to Equation 10.12, we have   Rsensor  Gsensor  =  700 400 700 400 r(λ ) ∑mj=1 g(λ ) ∑mj=1 ∑nk=1 ∑nk=1 α jβkIj(λ )Rk(λ ) dλ α jβkIj(λ )Rk(λ ) dλ   Bsensor 700 400 b(λ ) ∑mj=1 ∑nk=1 α j βk I j (λ )Rk(λ ) dλ (10.38) n = ∑ βkMk α, k=1 (10.39) where the jth column of Mk, denoted as (Mk) j, equals  700 400 r(λ )Ij(λ )Rk(λ ) dλ  (Mk) j =  700 400 g(λ )I j (λ )Rk(λ ) dλ  700 400 b(λ )I j (λ )Rk (λ ) dλ (10.40) and α = [α1, α2, . . . , αm]T . Thus, it is clear that the equation above is linear in α. In fact, it is bilinear in α and β, where β = [β1, β2, . . . , βn]T , because the above equation can also be written by interchanging illumination and spectral reflectance [27]. Given this bilinearity, we can use the observed pixel data to vote for the set of illuminant and reflectance parameters in a way similar to the Hough transform. The procedure consists of the following steps: 1. Selection of the reflectance parameters — We pick a set of β j, which determine the object reflectance given the basis functions R j(λ ) in Equation 10.2. Since there could be many possible object spectral reflectances in the scene, we have to go through the three-step procedure many times, each with a different set of β j. 2. Determination of illumination parameters — We solve for α j using Equation 10.39 above. Note that this is an inverse problem. Provided that (∑nk=1 βkMk) is not singular, a solution can be found. However, it should also be noted that if the matrix is ill-conditioned [28], [29], the solution can be very sensitive to noise. To deal with this problem, typically we retain only the cases where the system matrix is wellconditioned. 286 Single-Sensor Imaging: Methods and Applications for Digital Cameras 3. Casting of vote — A vote is cast for the α obtained. As with most implementations of Hough transform, this is quantized so that similar values are grouped together. Otherwise, there will be too many singleton votes. After repeating the procedure for different object reflectances, the one with the most votes is deemed the illuminant. 10.4.5 Color by Correlation The fundamental premise of the color by correlation method is that although there are numerous possible spectral power distributions such as those in Figure 10.1, for instance, different hours of the day and different days would present different spectral power distributions of sunlight, there are only a small selection of substantially different illuminants (e.g., sunlight, fluorescent light, tungsten light, etc.). Some of these are modes or illumination conditions used in semi-automatic white balancing, where the user selects the particular mode and the camera performs white balancing accordingly. Similar to the previous method, the goal of AWB is achieved through illuminant identification, but the difference between the two is that the current method seeks not just a simple answer of the illumination function I(λ ), but a set of possible illuminants together with their likelihoods. Thus, not only does it determine the most likely illuminant, but it also computes the likelihood of all other illuminants so that the error margin of the subsequent choices is also known. A prerequisite for this method is that we need to know the range and distribution of image colors that can be recorded by the camera under a set of possible lights [12]. We can then correlate the observed image with these distributions and identify the closest one as the most likely illuminant. More precisely, assume that there are k possible illuminants altogether. Instead of working with three sensory responses, we deal with only the chromaticity, where we can compute the chromaticities (c1, c2) as c1 = Rsensor Gsensor and c2 = Bsensor Gsensor . (10.41) In practice, Reference [12] advocates the equation c1 = Rsensor Gsensor 1 3 and c2 = Bsensor Gsensor 1 3 (10.42) which leads to chromaticities that are more uniformly distributed. We partition the space of all chromaticities into N × N bins. The task is now to determine the possible (c1, c2) under each illuminant. There are a few possibilities: 1. The empirical way is to take the camera and capture a wide range of objects with various surface reflectances under each illuminant. We can then obtain the gamut of colors which the camera records under each lighting condition. This approach however can be rather cumbersome when there are many possible illuminants, and some may not be easily obtained at will (e.g., a bright sunlight illumination when it happens that the experimenters are experiencing rainy days!). Automatic White Balancing in Digital Photography 287 2. We can generate the chromaticities using Equations 10.12 and 10.42. This requires us to know the spectral response characteristics of the camera (such as in Figure 10.5b), the spectral power distribution of each illuminant, and the surface reflectance of a range of objects. In addition, we can take the convex hull of these chromaticities to form the gamut, setting all entries inside the gamut to be 1 and those outside to be 0. 3. We can further refine the scheme above by assigning the probability of each chromaticity value as the entries. This is computed empirically from the relative frequencies of occurrence estimated from the number of surfaces falling in each bin of the discretized chromaticity space. We record the information above in an N2 × k correlation matrix C, where each column (denoted as (C) j for the jth column) corresponds to a possible illuminant. Its row entry is the likelihood of observing that chromaticity under the particular illumination. Now for a given image, we correlate the above information with that present in the image. We transform the pixel intensities to chromaticity values using the same formula as above, such as Equation 10.42. We then form a vector ω of length N2, where the jth element ω j is one if the corresponding chromaticity value is present in the image, and zero otherwise. We can then compute the most appropriate illuminant jˆ by the formula jˆ = arg max j ω, (C) j (10.43) where ·, · denotes inner product. Another way to view this is that if we compute χ = ωTC (10.44) then the vector χ is a row vector of length k, where each value suggests the likelihood of the illuminant. Thus, in a single operation we can find not only the most likely illuminant but also the error margin of the others. In summary, the color by correlation method entails the following three-step process: 1. Preprocessing step — Information about the interaction between image colors and illuminants is coded. This is considered the prior information about the illuminants. 2. Correlation step — This prior information is correlated with the information that is present in a particular image. In other words, the colors in an image determine the likelihood of each possible illuminant. 3. Recovery step — These likelihoods are used to recover an estimate of the scene illuminant. 10.4.6 Other Methods The above discussion of AWB techniques is by no means exhaustive. Other promising techniques include the gamut mapping algorithm using coefficient rule (CRULE) [30], color in perspective [31], Bayesian formulation [10], neural networks [32], adaptive gains [33], [34], and combined strategies [14]. We refer readers to these original papers for further descriptions of their methods. 288 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) FIGURE 10.9 (See color insert.) AWB methods for the Macbeth color chart: (a) original image, (b) gray world, (c) white patch, (d) iterative white balancing, (d) illuminant voting, and (f) color by correlation. 10.5 Implementations and Quality Evaluations In this section, we consider the performance of the above methods using a few test images. Our aim is not to extensively compare the various methods, which can often be found in the respective original papers and others written specifically for such a purpose [35], [36], but to give readers some general ideas of the performance of these techniques. It is also Automatic White Balancing in Digital Photography 289 known to be difficult to objectively evaluate image quality. With synthetic data, we can generate an ideal image with the desirable illumination, and a test image captured with another illumination but corrected with one of the AWB algorithms. If the former has intensities {Rideal, Gideal, Bideal} in the three channels and the latter has intensities {Rtest, Gtest, Btest}, we can compute the mean square error (MSE) between these images using the formula nn ∑ ∑ MSE = (Rideal − Rtest)2 + (Gideal − Gtest)2 + (Bideal − Btest)2 , x=1 y=1 (10.45) where each of the quantities above has the argument (x, y). A large MSE means that the ideal and test images are dissimilar, and suggests that the AWB algorithm may not be working well. Unfortunately, MSE is commonly not a good metric for two reasons [37]. First, with real data we may not have the ideal image that we can compare with the test image. Even if we do, a second problem is that MSE does not correspond to the human perception of images. For instance, if we scale the intensities of the test image or shift it spatially by a small amount, the effects may be rather negligible perceptually, but mathematically the MSE can be significantly increased. There are other possibilities to compare the color differences objectively, such as using S-CIELAB [38] and CIEDE2000 [39]. Below, however, we mainly evaluate three sets of images subjectively to give readers a flavor of the AWB algorithms. This first one is shown in Figure 10.9. In Figure 10.9a, the original image is seen to have a reddish-orange hue. The performance of the AWB algorithms is shown in Figure 10.9b to Figure 10.9f. In this particular case, the results of these algorithms are all quite satisfactory, and resemble each other. This can be attributed to the nature of this color chart object. For example, because many colors are present and there is no bias towards, for example, red, green, or blue, the assumption that the average intensity should be gray is quite agreeable. In a similar way, the object has a white patch that should correspond to the maximum intensity of the red, green, and blue channels. After the correction, the white patch is clearly evident. Next, we consider the performance on another set of images shown in Figure 10.10. As in the previous example, Figure 10.10a is the original image. In this case, a bluish-green cast is visible in the image. Note that the actual image should be binary with black and white only. After the correction on the original image, the gray world technique shown in Figure 10.10b and the iterative white balancing scheme in Figure 10.10d both result in images that contain shades of gray only. However, the white background is rendered somewhat grayish in both cases. This can be attributed to the fact that both algorithms aim at reducing the chromatic components of the image, but do not have any mechanism that favors white to gray. On the other hand, the white patch technique in Figure 10.10c and the color by correlation method in Figure 10.10f are better in forcing the background to appear white. Note that in the white patch algorithm, we have filtered out the isolated bright pixels as mentioned in Section 10.4.2. Finally, the illuminant voting algorithm in Figure 10.10e seems to have over-compensated for the bluish cast and the resulting corrected image now contains mild shades of yellow. 290 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) FIGURE 10.10 (See color insert.) AWB methods for the resolution chart: (a) original image, (b) gray world, (c) white patch, (d) iterative white balancing, (d) illuminant voting, and (f) color by correlation. The third set of experimental results is given in Figure 10.11. This is an image of a bookshelf taken under fluorescent light but with incorrect white balance setting in the camera. We can observe that the white patch method in Figure 10.11c seems to perform the best in this case, whereas the gray world method in Figure 10.11b and the iterative white Automatic White Balancing in Digital Photography 291 (a) (b) (c) (d) (e) (f) FIGURE 10.11 (See color insert.) AWB methods for the bookshelf image: (a) original image, (b) gray world, (c) white patch, (d) iterative white balancing, (d) illuminant voting, and (f) color by correlation. balancing scheme in Figure 10.11d again appear to produce an image that is slightly grayish. Both the illuminant voting and color by correlation methods, in Figure 10.11e and Figure 10.11f respectively, perform white balancing insufficiently. This result seems to agree with the comment made in Reference [36] that algorithms taking advantage of the chromaticity statistics seem to perform worse than expected. 292 Single-Sensor Imaging: Methods and Applications for Digital Cameras 10.6 Conclusion In this chapter we considered the issue of automatic white balancing (AWB) in digital photography. We discussed the nature of the problem, and various algorithms that could be implemented to achieve AWB, including gray world, white patch, iterative white balancing, illuminant voting, and color by correlation. Nevertheless, we should note that color constancy — the root problem of AWB — is still recognized as a difficult problem that has not been solved satisfactorily, or even understood well enough how humans and some other animals possess this fascinating quality. Even the state-of-the-art algorithms are not nearly as good as our own color constancy [10] or in some cases, sufficient for machine vision tasks such as object recognition [40]. Evidently, there is much room for research both in understanding color constancy in human visual systems and the possibility of adapting it or using some other methods to achieve AWB in digital camera systems design. Acknowledgments This work is supported in part by Grant 10204548 from the University Research Committee of the University of Hong Kong and by the Research Grants Council of the Hong Kong Special Administrative Region, China under Project HKU 7143/05E. Dong Liang of the University of Hong Kong put much effort in the implementation of the various AWB algorithms. Experience gained when the principal author was involved with the Stanford programmable digital camera project also helped shape the content of this chapter. References [1] E. Land, “Recent advances in retinex theory and some implications for cortical computations: Color vision and the natural images,” Proceedings of the National Academy of Science, vol. 80, no. 16, pp. 5163–5169, August 1983. [2] E.Y. Lam, “Image restoration in digital photography,” IEEE Transactions on Consumer Electronics, vol. 49, no. 2, pp. 269–274, May 2003. [3] R. Lukac, “Single-sensor imaging in consumer digital cameras: A survey of recent advances and future directions,” Journal of Real-Time Image Processing, vol. 1, no. 1, pp. 45–52, October 2003. [4] R. Hunt, The Reproduction of Colour. Chichester, West Sussex, UK: John Wiley & Sons, 2004. [5] D.B. Judd, D.L. MacAdam, and G. Wyszecki, “Spectral distribution of typical daylight as a function of correlated color temperature,” Journal of the Optical Society of America, vol. 54, no. 8, pp. 1031–1040, 1964. Automatic White Balancing in Digital Photography 293 [6] J. Cohen, “Dependency of the spectral reflectance curves of the Munsell color chips,” Psychoneurological Science, vol. 1, no. 12, pp. 369–370, 1964. [7] M.J. Vrhel, R. Gershon, and L.S. Iwan, “Measurement and analysis of object reflectance spectra,” Color Research and Application, vol. 19, no. 1, pp. 4–9, February 1994. [8] E. Giorgianni and T. Madden, Digital Color Management. Reading, MA: Addison Wesley, 1998. [9] L. Arend and A. Reeves, “Simultaneous color constancy,” Journal of the Optical Society of America A, vol. 3, no. 10, pp. 1743–1751, October 1986. [10] D.H. Brainard, W.A. Brunt, and J.M. Speigle, “Color constancy in the nearly natural image. I. Asymmetric matches,” Journal of the Optical Society of America A, vol. 14, no. 9, pp. 2091– 2110, September 1997. [11] D. Qian, J. Toker, and S. Bencuya, “An automatic light spectrum compensation method for CCD white balance measurement,” IEEE Transactions on Consumer Electronics, vol. 43, no. 2, pp. 216–220, May 1997. [12] G.D. Finlayson, S.D. Hordley, and P.M. Hubel, “Color by correlation: A simple, unifying framework for color constancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1209–1221, November 2001. [13] L.G. Shapiro and G.C. Stockman, Computer Vision. Upper Saddle River, NJ: Prentice Hall, 2001. [14] S. Bianco, F. Gasparini, and R. Schettini, “Combining strategies for white balance,” in Proceedings of the SPIE, Digital Photography III, San Jose, CA, USA, January 2007, vol. 6502, id. 65020D. [15] M. Fairchild, Color Appearance Models. Chichester, West Sussex, UK: John Wiley & Sons, 2005. [16] Y. Kim, H.S. Lee, and A.W. Morales, “A video camera system with enhanced zoom tracking and auto white balance,” IEEE Transactions on Consumer Electronics, vol. 48, no. 3, pp. 428– 434, August 2002. [17] E. Land, “The Retinex,” American Scientist, vol. 52, no. 2, pp. 247–264, 1964. [18] E. Land and J. McCann, “Lightness and Retinex theory,” Journal of the Optical Society of America, vol. 61, no. 1, pp. 1–11, 1971. [19] N. Kehtarnavaz, H. Oh, and Y. Yoo, “Development and real-time implementation of auto white balancing scoring algorithm,” Real-Time Imaging, vol. 8, no. 5, pp. 379–386, October 2002. [20] C. Weng, H. Chen, and C. Fuh, “A novel automatic white balance method for digital still cameras,” in Proceedings of the IEEE International Symposium on Circuits and Systems, Kobe, Japan, May 2005, vol. 4, pp. 3801–3804. [21] E.Y. Lam, “Combining gray world and Retinex theory for automatic white balance in digital photography,” in Proceedings of the International Symposium on Consumer Electronics, Macau, China, June 2005, pp. 134–139. [22] N. Nakano, R. Nishimura, H. Sai, A. Nishizawa, and H. Komstsu, “Digital still camera system for megapixel CCD,” IEEE Transactions on Consumer Electronics, vol. 44, no. 3, pp. 460– 466, August 1998. [23] R.Z. Zhou, J. He, and Z.L. Hong, “Adaptive algorithm of auto white balance for digital camera,” Journal of Computer-Aided Design and Computer Graphics, vol. 17, no. 3, pp. 529–533, March 2005. [24] J. Huo, Y. Chang, J. Wang, and X. Wei, “Robust automatic white balance algorithm using gray color points in images,” IEEE Transactions on Consumer Electronics, vol. 52, no. 2, pp. 541–546, May 2006. 294 Single-Sensor Imaging: Methods and Applications for Digital Cameras [25] G. Sapiro, “Color and illuminant voting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 11, pp. 1210–1215, November 1999. [26] R. Gonzalez and R. Woods, Digital Image Processing. Upper Saddle River, New Jersey: Prentice Hall, 2002. [27] G. Sapiro, “Bilinear voting,” in Proceedings of the Sixth International Conference on Computer Vision, Bombay, India, November 1998, pp. 178–183. [28] E.Y. Lam and J.W. Goodman, “Iterative statistical approach to blind image deconvolution,” Journal of the Optical Society of America A, vol. 17, no. 7, pp. 1177–1184, July 2000. [29] G. Golub and C. Van Loan, Matrix Computations. Baltimore, Maryland: Johns Hopkins University Press, 1996. [30] D.A. Forsyth, “A novel algorithm for color constancy,” International Journal of Computer Vision, vol. 5, no. 1, pp. 5–36, August 1990. [31] G.D. Finlayson, “Color in perspective,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 10, pp. 1034–1038, October 1996. [32] B.V. Funt, V. Cardei, and K. Barnard, “Learning color constancy,” in Proceedings of the Fourth Color Imaging Conference, Scottsdale, AZ, USA, November 1996, pp. 58–60. [33] R. Lukac, “Refined automatic white balancing,” Electronic Letters, vol. 43, no. 8, pp. 445– 446, April 2007. [34] R. Lukac, “New framework for automatic white balancing of digital camera images,” Signal Processing, vol. 88, no. 3, pp. 582–593, March 2008. [35] K. Barnard, V. Cardei, and B. Funt, “A comparison of color constancy algorithms - Part I: Methodology and experiments with synthesized data,” IEEE Transactions on Image Processing, vol. 11, no. 9, pp. 972–983, September 2002. [36] K. Barnard, B. Funt, L. Martin, and A. Coath, “A comparison of color constancy algorithms Part II: Experiments with image data,” IEEE Transactions on Image Processing, vol. 11, no. 9, pp. 985–996, September 2002. [37] E.Y. Lam, “Robust minimization of lighting variation for real-time defect detection,” RealTime Imaging, vol. 10, no. 6, pp. 365–370, December 2004. [38] X. Zhang and B. Wandell, “Color image fidelity metrics evaluated using image distortion maps,” Signal Processing, vol. 70, no. 3, pp. 201–214, November 1998. [39] G. Johnson and M. Fairchild, “A top down description of S-CIELAB and CIEDE2000,” Color Research and Application, vol. 28, no. 6, pp. 425–435, December 2003. [40] B. Funt, K. Barnard, and L. Martin, “Is machine colour constancy good enough?,” in Proceedings of the Fifth European Conference on Computer Vision, Freiburg, Germany, June 1998, pp. 455–459. 11 Enhancement of Digital Photographs Using Color Transfer Techniques Franc¸ois Pitie´, Anil Kokaram, and Rozenn Dahyot 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 11.1.1 How Does Color Grading Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 11.1.2 How to Deal with Content Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 11.2 Color Distribution Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 11.3 Linear Color Distribution Transfer Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 11.3.1 Independent Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 11.3.2 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 11.3.3 Principal Axes Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 11.3.4 Linear Monge-Kantorovitch Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 11.4 Nonlinear Color Distribution Transfer Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 11.4.1 Independent Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 11.4.2 Composition Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 11.4.3 The Discrete Kantorovitch Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 11.4.4 Transfer via the Radon Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 11.5 What Color Space to Choose? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 11.6 Reducing Grain Noise Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 11.6.1 Reducing the Stretching by Adjusting the Distributions . . . . . . . . . . . . . . 312 11.6.2 Reducing the Artifacts by Adjusting the Gradient Fields . . . . . . . . . . . . . 313 11.7 Application Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 11.8 Parting Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 11.1 Introduction Color is an essential aspect of a picture which conveys to the viewer many emotions and symbolic meanings. Adjusting the color grade of pictures is therefore an important step in professional photography. This process is part of the larger activity of grading in which the color and grain aspects of the photographic material are digitally manipulated. The term color grading will be used specifically to refer to the matching of color. Color grading is a delicate task since the slightest of color variations can alter the mood of a picture. For instance, changing the white point of a picture can make a picture look warmer or more 295 296 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) FIGURE 11.1 (See color insert.) Color transfer example: (a) original image, (b) image with the target palette, and (c) output image. A color mapping is applied on the original picture to match the palette of an example provided by the user. metallic. Refer to Chapter 10 for details on white balancing. Increasing the contrast can give a sharper aspect to the picture. Color grading is an even more critical step in movie postproduction. Like with photographic stills, it gives the movie a unique artistic signature. For instance the movie Ame´lie (2002) has a look of its own. Bright colors, deep black levels and saturated colors all contribute to create an artificial expressionistic world. The main problem of grading a movie is to adjust the color consistently across all the shots, even though the movie has been edited with heterogeneous video material. Shots taken at different times under natural light can have a substantially different feel due to even slight changes in lighting. Currently in the industry, color balancing (as it is called) is achieved by experienced artists who use very expensive edit hardware and software to manually match the color between frames by tuning parameters. Typical tuning operations comprise adjusting the exposure, brightness and contrast, calibrating the white point or finding a color mapping curve for the luminance levels and the three color channels. For instance, in an effort to balance the contrast of the red color, the digital samples in the red channel in one frame may be multiplied by some gain factor and the output image viewed and compared to the color of some other (a target) frame. The gain is then adjusted if the match in color is not quite right. The amount of adjustment and whether it is an increase or decrease depends crucially on the experience of the artist. This is a delicate task since the change in lighting conditions induces a very complex change of illumination. It would be therefore beneficial to automate this task in some way. The scope of color grading goes beyond the sole photographic and postproduction activities. In digital restoration [1], the goal is to recover the original colors of paintings that have been faded by smoke or dust. The process can also be used for color image equalization for scientific data visualization [2], [3]. Also, the specialized activities of high dynamic range tone mapping [4], [5], as well as the grayscale to color [6] and color to grayscale processes [7], [8], could be considered as special instances of color grading. 11.1.1 How Does Color Grading Work? The color grading workflow typically begins with grading the tone of the entire picture and then proceeds to local color correction where specific areas of the image are isolated for dedicated color grading. When dealing with image sequences, a tracking operation is Enhancement of Digital Photographs Using Color Transfer Techniques 297 then necessary to keep local color corrections accurate across the whole shot. This chapter presents techniques which aim at facilitating the choice of the color mappings. These techniques belong to the class of example-based color transfer methods. The idea, first formulated by Reinhard et al. [9], has raised a lot of interest recently [10], [11], [12], [13]. Figure 11.1 illustrates this with an example. The original picture (Figure 11.1a) is transformed so that its colors match the palette of the target image (Figure 11.1b), regardless of the content of the pictures. Consider the two pictures as two sets of three-dimensional (3D) color pixels. A way of treating the recoloring problem would be to find a one-to-one color mapping that is applied to every pixel in the original image. For example in Figure 11.1, every blue pixel is recolored in green. The new picture is identical in every aspect to the original picture, except that the picture now exhibits the same color statistics, or palette, as the target picture. Other color grading workflows could also be considered. Another way of treating the problem would be to recolor the picture in compliance with the actual 3D illumination structure of the scene. This kind of idea has been explored by Shen and Xin [14] with the aim of producing realistic recoloring under different illuminant. The method, derived from a dichromatic reflection model for the 3D objects, is however limited in practice by the difficulty of retrieving the necessary underlying 3D information from real photographic material. Another approach to color grading is to combine the color transfer mapping with the texture information. In particular it is worth mentioning the colorization of grayscale images by Welsh et al. [6] which draws from the recent success of non-parametric texture synthesis and transfer by Efros [15], [16] and Hertzmann [17]. The proposal is to replace the color of each pixel in the original image with the color of the best matching pixel in the example image. The best matching pixel is found by looking at the closest luminance image patch in the example image. The idea is that the color mapping will be more accurate if it is guided by the content information given by the neighboring texture. Bae et al. [18] have proposed a decomposition of the picture into a textured and a textureless layer followed by an application of the color transfer separately for the textured layer of the image and the textureless layer. This chapter will however not consider the texture information directly. Instead it will focus on the sole problem of finding a color mapping that transfers the color statistics of a picture example back to the original picture. Note that Welsh’s and Bae’s techniques still manipulate this idea of transfer of statistics, except that in their case the transfer is conditioned to the texture information. The notion of transfer of statistics encompasses an entire range of possibilities from the simple match of the mean, variances [9], covariance [10], [19] to the exact transfer of the exact distribution of the color samples [11], [13], [20]. Thus, depending on how close the graded picture should match the color distribution of the example image, multiple techniques could be used. As will be explained, finding a color mapping is actually closely related to the mass transportation problem, which has a well established mathematical background [21], [22], [23]. This chapter aims then at conducting a comprehensive review of existing color statistic transfer techniques under this new perspective. The review, presented in Section 11.2, references existing work but also discloses new techniques that could be advantageously used. 298 Single-Sensor Imaging: Methods and Applications for Digital Cameras FIGURE 11.2 Example of grain induced by color transfer: (left) original image, and (right) image after mapping. 11.1.2 How to Deal with Content Variations One important aspect of the color transfer problem is the change of content between the two pictures. Consider a pair of landscape images where the sky in one picture covers a larger area than in the other. When transferring the color from one picture to the other, the excess of sky color may be used in parts of the scenery of the ground in the other. This is because all color transfer algorithms are sensitive to variations in the areas of the image occupied by the same color. They risk overstretching the color mappings and thus producing unbelievable renderings as visible on Figure 11.2 which are very grainy. To deal with this issue a simple solution [9], [10], [14] is to manually select swatches in both pictures and thus associate color clusters corresponding to the same content. This is tantamount to performing manual image segmentation, and is simply impractical for a large variety of images, and certainly for sequences of images. The solution commonly adopted [11], [13] for color transfer techniques is to simply reduce the accuracy of both distributions by smoothing the color histograms. This simple technique avoids artifacts at the expense of an accurate transfer. Methods that make use of the spatial information to constrain the color mapping [6], [7], [24] can be successful but are usually computationally demanding. Another solution is to restrict the variability on the color mapping. For example, Chang [25] proposed classification of pixels of both images in a restricted set of basic color categories, derived from psycho-physiological studies (red, blue, pink, etc.). The color transfer ensures, for instance, that blueish pixels remain blueish pixels. This gives a more natural transformation. The disadvantage is that it limits the range of possible color transfers. The solution adopted here is to treat the grainy artifacts after the color grading. The noise could be attenuated by employing various color filtering techniques [26], [27] but these may not be the best solution for the the intended application due to the application specifics. Instead, a dedicated postprocessing is proposed which aims at protecting the original picture by preserving its original gradient field while preserving the color transfer. This balance is done using a variational approach inspired by Poisson editing techniques [28]. Protecting the gradient of the original picture particularly protects the flat areas and more generally results in the exact reproduction of film grain/noise as in the original image. Enhancement of Digital Photographs Using Color Transfer Techniques 299 This chapter is organized as follows. A review of techniques for transferring color statistics is presented in Section 11.2. The review is accompanied with a table of comparative results. The techniques suitable for finding linear mappings are presented in Section 11.3 and nonlinear mappings in Section 11.4. Section 11.5 deals with the impact of the color space. The problem of dealing with content variations and the steps of the regraining stage are then explained in Section 11.6. The chapter is concluded in Section 11.7 with some examples coming from applications where color transfer is applied. 11.2 Color Distribution Transfer The notion of color statistics can be understood by considering that an image can be represented as a set of color samples. When working in RGB color space, the image is represented by the set of the RGB color samples (R(i), G(i), B(i))1≤i≤M. In a probabilistic sense, these color samples are realizations of a 3D color random variable which will be denoted as u for the original image and v for the target palette image. The color palettes of the original and target pictures correspond then to the distributions of u and v. To simplify the presentation of the problem, it supposed here that both distributions have absolutely continuous probability density function (pdf) f and g. In practice, pdfs can only be numerically approximated. The simplest form of approximation would be color histograms. Since only a finite number of color samples are available, color histograms are usually very sparse and rough. The pdfs estimation can be improved by smoothing the histograms or even better by using standard kernel density approaches. The reader is invited to read Silverman [29] for more details on density estimation. One general idea of density estimation is that the amount of smoothing controls the degree of accuracy of the pdfs. At the extreme, the pdfs can be approximated up to simple multivariate Gaussian (MVG) distributions by estimating only the mean and the correlation matrices of u and v. Now that the notion of a color statistic has been properly introduced, the color transfer problem can then be defined as follows. The problem of color transfer is to find a C1 continuous mapping u → t(u), such that the new color distribution of t(u) matches the target distribution g. This latter problem, illustrated in Figure 11.3, is also known as the mass preserving transport problem in the mathematic literature [21], [22], [23]. To characterize the mapping, it is worth noticing that the mapping is, in essence, a change of variables. Thus the transfer equation can be written as: f (u)du = g(v)dv ⇒ f (u) = g(t(u))|detJt(u)| (11.1) where Jt(u) is the Jacobian of t taken at u. Using the cumulative distribution functions F and G for f and g, the condition for the mapping to realize the transfer is derived as follows: ∀u ∈ RN , F(u) = G(t(u)) (11.2) It is essential to realize here that the color mapping corresponds to a warping of the cumulative distribution function and not of the pdf. In grayscale images [30] where N = 1, 300 Single-Sensor Imaging: Methods and Applications for Digital Cameras m apping u ® t(u) ? initial pd f f of rand om variable u target pd f g for t(u) FIGURE 11.3 Distribution transfer concept. How to find a mapping that transforms the distribution on the left to the distribution on the right? it is possible to invert the cumulative distribution function G, and the mapping has this simple form: ∀u ∈ R ,t(u) = G−1 (F(u)) (11.3) where G−1(α) = inf {u|G(u) ≥ α}. The mapping can then easily be solved by using discrete look-up tables. For higher dimensions, that is, with color images, the cumulative distribution function cannot be inverted since multiple solutions are possible. Thus multiple ways of finding a valid mapping can be achieved. Actually numerous different methods have been proposed and successfully applied to color transfer. These will be presented hereafter in Sections 11.3 and 11.4. The problem encountered with most of them is that the geometry of the resulting mapping might not be as it was intended and it is possible that an exact transfer maps black pixels to white and white pixels to black. The resulting picture will have the color proportions as expected, but locally the colors will have been swapped. To avoid this effect, a good solution is to further constrain the transfer problem and to ask the mapping to also minimize its displacement cost: I[t] = t(u) − u 2 f (u)du u (11.4) Finding this minimal mapping is known as the Monge’s optimal transportation problem. Monge’s problem has raised a major interest in mathematics in recent years [21], [22], [23] as it has been found to be relevant for many scientific fields like fluid mechanics. Another formulation of this problem is the Kantorovitch’s optimal transportation problem which offers a relaxation of the one-to-one mapping constraint by allowing one color to be mapped into multiple colors: I[t] = π(u, v) v − u 2dudv u,v (11.5) The associations are described in π(·, ·) which is the joint pdf between u and v. The Kantorovitch solution is related to the Monge problem since for continuous distributions, it Enhancement of Digital Photographs Using Color Transfer Techniques 301 turns out that the Kantorovitch solution coincides with the Monge solution, i.e., that for continuous pdfs, the best association is a one-to-one mapping: π(u, v) = δ (t(u) − v). In the rest of the chapter, the problem will be thus referred to as the Monge-Kantorovitch (MK) problem. This chapter is not intended to be a course on the Monge-Kantorovitch problem and the reader eager to develop a better mathematical insight is invited to read a more specialized mathematical bibliography [21], [22], [23]. To understand the interest of the MK solution in color grading, three aspects of the MK solution will however be reported here. The first result of importance is that the MK solution always exists for continuous pdfs and is unique. This means that there will no room left for ambiguity. The second property is that the MK solution is consistent with orthogonal basis changes (the MK does not depend more on one component than the other). The last result, which is of interest here, is that the MK solution is the gradient of a convex function1: t = ∇φ where φ : RN → R is convex (11.6) This might seem quite obscure at first sight, but this property is the equivalent of monotonicity for one-dimensional (1D) functions in R. This property is thus quite intuitive for the color transfer problem. For instance, the brightest areas of a picture will still remain the brightest areas after mapping. The following presents techniques that have been explored in the color grading literature, starting with the methods that consider only a linear transformation of the color samples and then presenting the solutions that can match any color distribution. Most of these techniques do not solve for the Monge-Kantorovitch problem. In particular, existing literature in the domain does not propose the MK solution for the linear case but it has been introduced here and turns out to outperform other linear methods. The different techniques are compared to each other so the reader will be able to make an independent judgment. 11.3 Linear Color Distribution Transfer Techniques The linear case considers the problem of finding linear mappings of the form t(u) = Tu + t0 where T is a N × N matrix. It is not necessarily possible to find a linear mapping in the general case, but this can always be achieved when both the original distributions f and the target distributions g are multivariate Gaussian distributions (MVG) N (µu, Σu) and N (µv, Σv) f (u) = 1 (2π )N /2 |Σu |1/2 exp g(v) = 1 (2π )N /2 |Σv |1/2 exp − 1 2 (u − µu)T Σ−u 1(u − µu) − 1 2 (v − µv)T Σ−v 1 (v − µv) (11.7) 1A convex function [22] φ : RN → R is such that ∀u1, u2 ∈ RN , α ∈ [0; 1], φ (αu1 + (1 − α)u2) ≤ αφ (u1) + (1 − α)φ (u2). 302 Single-Sensor Imaging: Methods and Applications for Digital Cameras with Σu and Σv the covariance matrices of u and v. Note that when the distributions are not MVG, a MVG approximation can always be obtained by estimating the mean and the covariance matrices of the distributions. To have the pdf transfer condition g(t(u)) ∝ f (u), it must hold that (t(u) − µv)T Σ−v 1(t(u) − µv) = (u − µu)T Σ−f 1(u − µu). Thus t must satisfy the following condition: t(u) = T (u − µu) + µv with T T Σ−v 1T = Σ−u 1 (11.8) It turns out that there are numerous solutions for the matrix T and thus multiple ways of transferring the color statistics. 11.3.1 Independent Transfer The first method, used by Reinhard et al. [31] in their original paper on color transfer, is to simply match the means and the variances of each component independently. This means that both distributions are separable, thus the covariance matrices are diagonal and Σu = diag(var(u1), . . . , var(uN)) and Σv = diag(var(v1), . . . , var(vN)). It yields for the mapping that   T =  var(v1) var(u1) ... 0  (11.9) 0 var(vN ) var(uN ) The independence assumption is simplistic since it is rarely true for real images. The poor quality of the transfer in the results in Figure 11.4c shows that this is indeed not always the case. The solution proposed by Reinhard is to work in the decorrelated color space lαβ of Ruderman [32]. This helps to some extent but cannot guarantee a full decorrelation between components. 11.3.2 Cholesky Decomposition Since both matrices Σu and Σv are symmetric semi definite positive, a solution is to use the Cholesky decomposition of Σu = LuLuT and Σv = LvLvT where Lu and Lv are lower triangular matrices with strictly positive diagonal elements. This decomposition yields the following solution: T = LvLu−1 (11.10) Note that this solution is dependent on the ordering of the color components. Figure 11.4d shows however some improvements on the previous method. This method is quite successful in practice but is not as reliable as the MK solution. 11.3.3 Principal Axes Transfer Another popular solution [10], [19], [33], [34], [35] is to find the mapping that realigns the principal axes of Σv to that of Σu. This can be done by using the square root operator for symmetric positive semidefinite matrices. The square root matrices Σ1u/2 and Σ1v/2 are Enhancement of Digital Photographs Using Color Transfer Techniques 303 (a) (b) (c) (d) (e) (f) FIGURE 11.4 (See color insert.) Results for linear techniques: (a) original image, (b) target palette, (c) separable linear transfer, (d) Cholesky based transfer, (e) principal axes transfer, and (f) linear Monge-Kantorovitch transfer. All transfers are done in the RGB color space. obtained through the spectral decomposition of Σu and Σv: Σu = PuDuPuT and Σ1u/2 = PuD1u/2PuT Σv = PvDvPvT and Σ1v/2 = PvD1v/2PvT (11.11) (11.12) where Pu and Pv are orthogonal matrices and Du and Dv the diagonal matrix containing the (positive) eigenvalues of Σu and Σv. Note that the square roots Σ1u/2 and Σ1v/2 are uniquely defined. These decompositions lead to the following mapping: T = Σv1/2Σ−u 1/2 (11.13) Results displayed in Figure 11.4e show an improvement over the mapping based on the Cholesky decomposition. Note in particular that the violet color of the grass in Figure 11.4e is not present here. This might be due to fact that the mapping does not depend on the ordering of the color components. 11.3.4 Linear Monge-Kantorovitch Solution A better approach would be to use the Monge-Kantorovitch solution for MVG distributions. The MK solution is geometrically more intuitive than the Cholesky or the Principal Axes solution. One could be concerned that the MK solution might not be linear, but fortunately it is actually linear and it admits a simple closed form. The detailed proof of how to find the MK mapping can be found in Reference [36]. The mainstay of the reasoning is that since the MK solution is the gradient of a convex function, the matrix T has to be 304 Single-Sensor Imaging: Methods and Applications for Digital Cameras symmetric definite positive, which leads to this unique solution for T : T = Σ−u 1/2 Σ1u/2ΣvΣ1u/2 −1 Σ−u 1/2 (11.14) The corresponding results in Figure 11.4f are convincing. The results are slightly better than the ones of the Principal Axes method. For instance a pink trace on the grass in front of the house is visible in Figure 11.4e but not in Figure 11.4f. The MK solution is thus interesting as it provides an intuitive and probably better mapping for a similar computational complexity to the popular methods. 11.4 Nonlinear Color Distribution Transfer Techniques 11.4.1 Independent Transfer Extending the solutions for MVG to any distribution is sadly more complicated. The approach used at the moment in professional color grading tools is to extend the pdf transfer to higher dimensions by performing the 1D nonlinear pdf matching separately for each channel [37]. Like in the linear case, this is only exact if both distributions are separable, i.e., if the joint distribution is the product of its marginals: f (u1, u2, · · · , uN) = f (u1) f (u2) · · · f (uN) (11.15) This is however not realistic in most situations. The result displayed in Figure 11.5c reflects this. 11.4.2 Composition Transfer This method, that has been applied to the color transfer problem by Neumann [11], builds on the fact that the correlated variables u1, · · · , uN can be recombined into the following independent variables u1, u2, · · · , uN: u1 ∼ f (u1) u2 ∼ f (u2|u1) · · · uN ∼ f (uN|u1, · · · , uN−1) (11.16) The independence becomes apparent from the following conditional decomposition: f (u1, u2, · · · , uN) = f (u1) f (u2|u1) · · · f (un|u1, · · · , un−1) (11.17) Since these conditional variables are independent, it is possible to use 1D pdf transfers separately for each of the conditional variables. The final mapping is then composed as follows: t(u1, · · · , uN) = (t1(u1),t2(u2|u1), · · · ,tN(uN|u1 · · · , uN−1)) (11.18) and each 1D mapping t1, · · · ,tN is found by using the corresponding pdf transfer: t1(u1) : f (u1) ⇒ g(v1) t2(u2|u1) : f (u2|u1) ⇒ g(v2|v1) ... (11.19) tN(uN|u1, · · · , uN−1) : f (uN|u1, · · · , uN−1) ⇒ g(vN|v1, · · · , vN−1) Enhancement of Digital Photographs Using Color Transfer Techniques 305 (a) (b) (c) (d) (e) (f) FIGURE 11.5 (See color insert.) Results for different scenarios: (a) original image, (b) target palette, (c) separable transfer, (d) composition transfer, (e) discrete Kantorovitch, and (f) IDT. This technique suffers from two main problems. Firstly the mapping itself depends heavily on the order in which the variables are conditioned to each other. For instance matching f (u1) and then f (u2|u1) results in a different mapping than mapping f (u2) and then f (u1|u2). The second issue is that even for large pictures, the estimation of the conditional marginals like f (uN|u1, · · · , uN−1) is based on only a very small number of color samples since only a few pixels will have exactly the same color. This means in practice for a RGB color grading, that if the transfer of the first red component is accurate, the transfer of last blue component is however poor. This is reflected in Figure 11.5d where the blue gain has been overestimated. The situation can be improved by using some proper density estimation scheme [29] which mainly implies smoothing the original 3D histogram. Note that smoothing the pdfs is a delicate operation that requires some computational time for large color histograms. 11.4.3 The Discrete Kantorovitch Solution As with the linear case, the Monge-Kantorovitch solution seems then to be more appropriate here. Numerical solutions exist for N-dimensional variables, but they involve heavy computational loads as they require the use of an iterative partial differential equation solver in the N-dimensional distribution [21]. However, to reduce the computational complexity, it is possible to segment the actual pdf to a smaller number of colors. Since the pdfs are now discretized, it might not be possible to find a one-to-one mapping that transfers one pdf exactly to another. Instead, the one-to-one mapping assumption is relaxed and it is allowed for one histogram bin to be mapped onto multiple bins. The problem is then to find the flow π() (joint distribution) that minimizes the overall transportation cost: 306 Single-Sensor Imaging: Methods and Applications for Digital Cameras ∑ πˆ = inf π i, j π(ui, v j)|ui − v j|2 (11.20) with ∀ j, ∑i π(ui, v j) = g(u j) ∀i, ∑ j π(ui, v j) = f (ui). Note that this formulation of the transportation problem has been introduced in the computer vision community by Rubner under the name of Earth Mover Distance [38]. The discretized problem can be numerically solved by linear programming using the Simplex algorithm [39]. Several specialized algorithms for solving the transport problem also exist, notably the northwest corner method and the Vogels approximation [40]. Since a color can be mapped into multiple colors, the issue is now of deciding which color to assign to a particular pixel. Morovic proposes to decide for each pixel on a random basis. This process is, in essence, similar to a randomized dithering. One remarkable result of the MK theory is that, in the continuous case, the Kantorovich solution is actually Monge’s one-to-one mapping. This simply means that increasing the number of bins will reduce the dithering and gives a better approximation of Monge’s color mapping. The problem is that the Simplex algorithm is quite slow and becomes intractable when dealing with large histograms. The result in Figure 11.5e has been obtained for a color histogram of 300 color bins. Despite these limitations, the method still produces visibly better results than the previously presented methods. 11.4.4 Transfer via the Radon Transform Recently, Pitie´ et al. [41] have proposed another solution to the distribution transfer problem which is based on the iterative use of 1D transfers for various directions in the N-dimensional space. The idea is to break down the problem into a succession of 1D dis- tribution transfer problems. Consider the use of the N-dimensional Radon Transform. It is widely acknowledged that via the Radon Transform, any N-dimensional function can be uniquely described as a series of projections onto 1D axes [42] (see Figure 11.6). In this case, the function considered is a N-dimensional pdf, hence the Radon Transform projec- tions result in a series of 1D marginal pdfs from which can be derived the corresponding 1D pdf transfers along these axes. Intuitively then, operations on the N-dimensional pdf should be possible through applying the 1D pdf transfer along these axes. Consider that after some sequence of such manipulations, all 1D marginals match the corresponding marginals of the target distribution. It then follows that, by nature of the Radon Transform, the transformed f , corresponding to the transformed 1D marginals, now matches g. The operation applied to the projected marginal distributions is thus similar to that used in 1-dimension. Denote a particular axis by its vector direction e ∈ RN. The projection of both pdfs f and g onto the axis e results in two 1D marginal pdfs fe and ge. Using the 1-dimensional pdf transfer mapping of Equation 11.3 yields a 1D mapping te along this axis: ∀u ∈ R , te(u) = G−e 1 (Fe (u)) (11.21) For a N-dimensional sample u = [u1, · · · , uN]T, the projection of the sample on the axis is given by the scalar product eTu = ∑i eiui, and the corresponding displacement along the axis is u → u + δ with δ = (te(eTu) − eTu) e (11.22) Enhancement of Digital Photographs Using Color Transfer Techniques 307 N -D fu nction f 1D projections FIGURE 11.6 N-dimensional Radon transform of a distribution’s pdf. The pdf is projected onto every possible axis. Each projection corresponds to a 1D marginal of the distribution. The ensemble of all projections uniquely describes the distribution. ALGORITHM 11.1 Iterative distribution transfer using the Radon transform. 1. Initialization of the data set source u, set the displacement δ to zero: k ← 0 , δ (0) ← 0 2. Repeat the following: • Take a rotation matrix R = [e1, · · · , eN]. • For every rotated axis i of the rotation, get the projections fi(k) of {u + u(k)} and gi of {v}. • For every rotation axis i, find the 1D transformation ti that matches the marginals fi into gi. • Update the displacement δ (k+1) = δ (k)  + R  t1(eT1(u + δ (k)) − ... eT1(u + δ (k)))   . tN (eT1(u + δ (k)) − eTN (u + δ (k))) • Update k ← k + 1. until convergence on all marginals for every possible rotation. 3. The final one-to-one mapping T is given by ∀ j , u j → t(u j) = u j + δ j(∞). 308 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1D pd fs give the m apping initial 2D pd f initial 1D marginal target 1D m arginal after m apping, the new 1D marginal m atches the target marginal pdf FIGURE 11.7 Illustration of the data manipulation, based on the 1D pdf transfer on one axis. After transformation, the projection fe of the new distribution f is now identical to ge. The manipulation is explained in Figure 11.7. Considering that the operation can be done independently on orthogonal axes, the proposed manipulation consists in choosing an orthogonal basis R = (e1, · · · , eN) and then applying the following mapping:   u → u + R  t1(eT1 u) ... − eT1 u  (11.23) tN (eTN u) − eTN u where ti is the 1D pdf transfer mapping for the axis ei. The idea is that iterating this manipulation over different axes will result in a sequence of distributions f (k) that hopefully converges to the target distribution g. The overall algorithm (Algorithm 11.1) will be referred to as the iterative distribution transfer (IDT). A theoretical and numerical study of the method is developed in more depth in Reference [41]. The experimental study strongly suggests that convergence occurs for any distribution. The study also considers the problem of finding a sequence of axes that maximizes the convergence speed of the algorithm. The sequence of axes is designed to minimize the correlation between the directions. These directions for N = 3 are listed in Table 11.1. As in the composition transfer case, the manipulations are based on the use of 1D marginals. The difference is that the marginals are not based on conditional probabilities. This means that the operation is independent of the channel ordering. Also, the estimation of the marginals does not suffer from the data sparseness and the N-dimensional pdf does not need to be smoothed. In contrast with the discrete Kantorovitch method, the method Enhancement of Digital Photographs Using Color Transfer Techniques 309 TABLE 11.1 Optimized rotations for N = 3. No. 1 x 1 0 0 y 0 1 0 z 0 0 1 No. 3 x 0.577350 0.211297 0.788682 y –0.577350 0.788668 0.211352 z 0.577350 0.577370 –0.577330 No. 5 x 0.332572 0.910758 0.244778 y –0.910887 0.242977 0.333536 z –0.244295 0.333890 –0.910405 No. 7 x –0.109199 0.810241 0.575834 y 0.645399 0.498377 –0.578862 z 0.756000 –0.308432 0.577351 No. 9 x 0.862298 0.503331 –0.055679 y –0.490221 0.802113 –0.341026 z –0.126988 0.321361 0.938404 No. 11 x 0.687077 –0.577557 –0.440855 y 0.592440 0.796586 –0.120272 z –0.420643 0.178544 –0.889484 2 0.333333 0.666667 –0.666667 0.666667 0.666667 0.333333 –0.666667 0.666667 –0.333333 4 0.577350 0.408273 –0.577350 –0.408224 0.577350 –0.816497 0.707092 0.707121 0 6 0.243799 0.910726 0.910699 –0.333174 –0.333450 –0.244075 0.333376 0.244177 0.910625 8 0.759262 0.649435 –0.041906 0.143443 –0.104197 0.984158 0.634780 –0.753245 –0.172269 10 0.982488 0.149181 0.111631 0.186103 –0.756525 –0.626926 –0.009074 0.636722 –0.771040 12 0.463791 0.822404 0.030607 –0.386537 –0.885416 0.417422 0.329470 0.921766 0.204444 is treated with one-to-one mapping all along, thus no dithering postprocess is necessary. Figure 11.5f shows that the results are quite similar to the discrete Kantorovitch solution. 11.5 What Color Space to Choose? If a method can exactly transfer the complete color statistics, then this method will work regardless of the chosen color space. Thus, in that respect, the color space is not important to transfer the color feel of an image. The color space has however an influence on the geometrical form of the color mapping. In the MK formulation for instance, the cost of transportation is related to the color difference, which depends on the chosen color space. Thus uniform color spaces like YUV, or better CIELAB and CIELUV, are preferable to the basic RGB to obtain coherent color mappings. Comparative results for RGB, YUV, CIE XYZ, CIELAB and CIELUV are displayed in Figure 11.8 for the linear MK solution and in Figure 11.9 for the proposed IDT method. Differences might be difficult to see on printed 310 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) FIGURE 11.8 (See color insert.) Results of the linear Monge-Kantorovitch transfer for different color spaces: (a) target color palette, (b) RGB, (c) YUV, (d) XYZ, (e) CIELAB, and (f) CIELUV. (a) (b) (c) (d) (e) (f) FIGURE 11.9 (See color insert.) Results of the IDT transfer for different color spaces: (a) target color palette, (b) RGB, (c) YUV, (d) XYZ, (e) CIELAB, and (f) CIELUV. images, but these are very significant when displayed on a big screen, especially if they are images in a sequence which is played back. The CIELAB color space overall offers better renderings, for the linear and the nonlinear case. This is because it is designed to measure the difference between colors under different illuminants. Enhancement of Digital Photographs Using Color Transfer Techniques 311 (a) (b) (c) (d) 250 200 150 100 50 0 0 50 100 150 200 250 (e) FIGURE 11.10 Result of grain reduction: (a, b) two consecutive archive frames suffering from extreme brightness variation known as flicker [45], (c) mapped original frame, (d) mapped original frame after grain artifact reduction, and (e) employed mapping transformation. Since the corresponding mapping transformation is overstretched, the mapped original frame has an increased level of noise. The proposed grain artifact reducer is able to reproduce the noise level of the original picture. The top of the original picture is saturated and cannot be retrieved but the algorithm succeeds in preserving the soft gradient. 312 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) FIGURE 11.11 Artifact grain reduction in color picture: (a) original image, (b) after IDT color transfer, and (c) after regraining. 11.6 Reducing Grain Noise Artifacts Figure 11.10 and Figure 11.11 show that mapping the colors of the picture might produce some grain artifacts. When the content differs, or when the dynamic ranges of both pictures are too different, the resulting mapping function can be stretched on some parts (see Figure 11.10e), and thus enhances the noise level (see Figure 11.10c). This can be understood by taking the simple example of a linear transformation t of the original picture: t(u) = a u + b. The overall variance of the resulting picture is changed to var{t(u)} = a2 var{u}. This means that a greater stretching (a > 1) produces more noise. The problem is that it is impossible to say a priori if the large difference in the distributions is due to a drastic color mapping, which is always possible, or due to a content variation. The difficulty of differentiating these cases is the Achilles’ heel of example based color transfer techniques. Since any distribution can map any other distribution, it is impossible to impose any prior on the distributions. This issue remains an open problem but a few ad hoc solutions can help here. Solutions in the literature [11], [13] have been proposed to use a preprocessing smoothing operation on the pdfs. The motivation for these solutions, exposed hereafter, is to reduce the over-stretching of the mapping. This chapter also proposes a postprocessing method to protect the original picture gradient and thus reduce the amount of artifacts. 11.6.1 Reducing the Stretching by Adjusting the Distributions One origin of the artifacts is that the target pdf is usually only a rough estimate of what the true palette should be. This means that using very accurate pdf approximations might lead to erroneous transfers. In that sense, using simple MVG approximations is always a safer solution. Also an advantage of linear mappings is that they have a constant stretching |t (u)| = |Σv|1/2/|Σu|1/2 over the color distribution. The color deformation is thus also constant over the whole image and the artifacts are uniformly distributed over the picture. Nonlinear mappings have however to be used when dealing with complex illuminations. In order to avoid unnecessary finesse in the approximations, what is desired is a mechanism that controls the level of approximation. This control can be achieved by smoothing the color histograms by a variable amount. The smoothing can be done by employing a kernel Enhancement of Digital Photographs Using Color Transfer Techniques 313 filter K with bandwidth h. For instance, one can use the Epanechnikov kernel, which is the function Kh(u) = (3/4)(1 − u/h 2) for u/h < 1 and zero for u/h outside that range. The smoothness of the pdf is then controlled by the bandwidth parameter h. Decreasing h increases the detail level of the pdf. Smoothing the pdf reduces distortions that are due to fine disparities, but does not specifi- cally address the problem of change of color proportions. Consider the example, previously discussed, where the sky in one picture covers a larger area than in the other. The peak in the pdf corresponding to the blue color is at the same location in both pdfs, but the mass of this peak is then different. Smoothing the pdf solves this problem only partially. A solution used by Neumann [11] and Pitie´ [13] is to reduce the relative size of each peak in the pdfs. In this way, the variation in color proportions is limited. Combining both smoothing and this dominant color correction idea results in the following ad hoc smoothing operation on the pdf: f˜(u) = (Kh ∗ f (u) + ε)(1/p) (11.24) where the exponent p > 1 controls the relative size of the pdf peaks and ε avoids problems when f (u) = 0. When working with 1D marginals, like the IDT algorithm does, the smoothing can be applied to the 1D marginals and not the original ND pdf. Bear in mind however that this smoothing operation should not be used if the target distribution is actually the one desired, since in that case the resulting mapping would then be erroneous. The best solution is to leave most of the correction process to the regrain postprocessing which is explained in the following section. 11.6.2 Reducing the Artifacts by Adjusting the Gradient Fields A solution to reduce the grain is to run a postprocessing algorithm that forces the noise level to remain the same. The idea presented by Pitie´ et al. [41] is to adjust the gradient field of the resulting picture so that it matches the gradient field of the original picture. If the gradient fields of both pictures are similar, the noise level will be the same. Matching the gradient of a picture has been addressed in different computer vision applications like high dynamic range compression [4]; the value of this idea has been thoroughly demonstrated by Pe´rez et al. in Reference [28]. Manipulating the image gradient can be efficiently done by using a variational approach. The problem here is slightly different, since recoloring also implies changing the contrast levels. Thus, the new gradient field should only loosely match the original gradient field. Denote I(x, y) the 3D original color picture. To simplify the discussion, coordinates are omitted in the expressions and I, J, ψ,φ , etc. actually refer to I(x, y), J(x, y), ψ(x, y) and φ (x, y). Let t : I → t(I) be the color transformation. The problem is to find a modified image J of the mapped picture t(I) that minimizes on the whole picture range Ω: min φ · ||∇J − ∇I||2 + ψ · ||J − t(I)||2dxdy JΩ (11.25) with Neumann boundaries condition ∇J|∂Ω = ∇I|∂Ω so that the gradient of J matches with the gradient of I at the picture border ∂ Ω. The term ||∇J −∇I||2 forces the image gradient to be preserved. The term ||J −t(I)||2 ensures that the colors remain close to the target picture 314 Single-Sensor Imaging: Methods and Applications for Digital Cameras and thus protects the contrast changes. Without ||J − t(I)||2, a solution of Equation 11.25 will be actually the original picture I. The weight fields φ (x, y) and ψ(x, y) affect the importance of both terms. Many choices are possible for φ and ψ, and the following study could easily be changed, depending on the specifications of the problem. The weight field φ (x, y) has been here chosen to emphasize that only flat areas have to remain flat but that gradient can change at object borders: φ = 1 30 + 10 ||∇I|| (11.26) The weight field ψ(x, y) accounts for the possible stretching of the transformation t. Where ∇t is big, the grain becomes more visible: ψ = 2/ (1 + ||(∇t)(I)||) if ||∇I|| > 5 ||∇I||/5 if ||∇I|| ≤ 5 (11.27) where (∇t)(I) is the gradient of t for the color I and thus refers to the color stretching. The case ||∇I|| ≤ 5 is necessary to re-enforce that flat areas remain flat. While the gradient of t is easy to estimate for grayscale pictures, it might be more difficult to obtain for color mappings. The field can then be changed into: ψ(x, y) = 1 if ||∇I|| > 5 ||∇I||/5 if ||∇I|| ≤ 5 (11.28) The minimization problem in Equation 11.25 can be solved using the variational principle which states that the integral must satisfy the Euler-Lagrange equation: ∂F ∂J − d dx ∂F ∂ Jx − d dy ∂F ∂ Jy = 0 (11.29) where F(J, ∇J) = φ · ||∇J − ∇I||2 + ψ · ||J − t(I)||2 (11.30) from which the following can be derived: φ · J − div (ψ · ∇J) = φ · t(I) − div (ψ · ∇I) (11.31) The above is an elliptic partial differential equation. The expression div (ψ · ∇I) at pixel x = (x, y) can be approximated using standard finite differences [43] by: ∑ div (ψ · ∇I) (x) ≈ xn∈Nx ψxn + 2 ψx (Ixn − Ix) (11.32) where Nx corresponds to the four neighboring pixels of x. Using this in Equation 11.31 yields a linear system as follows: a6(x, y) = a1(x, y)J(x, y − 1) + a2(x, y)J(x, y + 1) +a3(x, y)J(x − 1, y) + a4(x, y)J(x + 1, y) + a5(x, y)J(x, y) (11.33) Enhancement of Digital Photographs Using Color Transfer Techniques 315 with a1(x, y) = − ψ (x, y − 1) 2 + ψ (x, y) a2(x, y) = − ψ (x, y + 1) 2 + ψ (x, y) a3(x, y) = − ψ (x − 1, y) 2 + ψ (x, y) a4(x, y) = − ψ (x + 1, y) 2 + ψ (x, y) a5(x, y) = 1 2 4ψ(x, y) + ψ(x, y − 1) + ψ(x, y + 1) +ψ(x − 1, y) + ψ(x + 1, y) + φ (x, y) (11.34) (11.35) (11.36) (11.37) (11.38) a6(x, y) = 1 2 (ψ(x, y) + ψ(x, y − 1))(I(x, y − 1) − I(x, y)) +(ψ(x, y) + ψ(x, y + 1))(I(x, y + 1) − I(x, y)) +(ψ(x, y) + ψ(x − 1, y))(I(x − 1, y) − I(x, y)) +(ψ(x, y) + ψ(x + 1, y))(I(x + 1, y) − I(x, y)) φ (x, y)I(x, y) (11.39) The system can be solved by standard iterative methods like SOR or Gauss-Seidel with multigrid approach. Implementations of these numerical solvers are widely available and one can refer for instance to Reference [44]. The main step of these methods is to solve iteratively for J(x, y). Note that J(x, y) and ai(x, y) are of dimension 3, but that each color component can be treated independently. For instance, the iteration for the red component field is of the form JR(k+1)(x, y) = 1 aR5 (x, y) aR5 (x, y) − aR1 (x, y)JR(k)(x, y − 1) − aR2 (x, y)JR(k)(x, y + 1) −aR3 (x, y)JR(k)(x − 1, y) − aR4 (x, y)JR(k)(x + 1, y) (11.40) where JR(k)(x, y) is the result in the red component at the kth iteration. The overall method takes less than a second per image at PAL resolution (720 × 576) on a 2GHz machine using a commercial implementation of the method. Figure 11.10d and Figure 11.11c show the efficiency of the method. In Figure 11.10d, the top of the original frame is clamped, thus resulting in the loss of grain texture. Note that the regraining method is well designed for this kind of situation where the mapping is actually correct and artifacts only come from the discrete nature of the image. The case of Figure 11.11 is more difficult since the target color pdf can only be a crude approximation of what is desired. However, the regraining tool is quite successful at recovering the original gradient information in the resulting image (Figure 11.11c). 316 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) FIGURE 11.12 (See color insert.) Examples of color grading for matching lighting conditions: (a-c) the color properties of the sunset are used to synthesize the evening scene depicted at sunset, (d-f) the color grading corrects the change of lighting conditions induced by clouds, and (c, f) the color transfer achieved by employing IDT followed by the regraining process. (a) (b) (c) (d) (e) FIGURE 11.13 (See color insert.) Example of color grading for image and video restoration used to recreate different atmospheres: (a) original frame, (b) 1970s atmosphere, (c) pub atmosphere, (d) linear MK result using 1970s atmosphere, and (e) IDT followed by regraining using pub atmosphere. Enhancement of Digital Photographs Using Color Transfer Techniques 317 11.7 Application Results The color transfer techniques are tested here for some color grading applications. Considering the advantages of the MK solution for the linear case and of the IDT method for the nonlinear case, only these methods will be used in the following examples. Also, the IDT will be systematically used in conjunction with the regraining process. The algorithms work in the CIELAB color space. Matching lighting conditions is illustrated in Figure 11.12 in two typical situations. In the first case, the color properties of the sunset (Figure 11.12b) are used to synthesize the evening scene (Figure 11.12a) depicted at sunset (Figure 11.12c). This kind of grading is frequent when shooting a movie at sunset since the light is changing quickly. To be able to cope with the nonlinearity of the contrast change, the color transfer is performed by using IDT. In the second case (Figure 11.12d to Figure 11.12f), the color grading corrects another classical change of lighting conditions due to passing clouds. Even when using the grain artifact reducer, an unavoidable limitation of color grading is the clipping of the color data: saturated areas cannot be retrieved (for instance, the sky on the golf image cannot be recovered). A general rule is to match pictures from higher to lower range dynamics. Figure 11.13 displays examples of movie restoration using color grading. The idea is similar to color grading, i.e., to enhance the color and match a desired atmosphere. Experimentation showed that the linear MK method can be enough to enhance properly the faded color palette. The IDT can however recreate a wider variety of grades than the linear MK method. For instance, the IDT can produce both images shown in Figure 11.13d and Figure 11.13e whereas the linear MK solution will fail at reproducing the strong contrasts of Figure 11.13e. To obtain Figure 11.13e, the smoothing process (see Section 11.6.1) has been used before starting the IDT. Figure 11.14 and Figure 11.15a to Figure 11.15d present two examples of color grading for photography. Because images are not played back in photography, the consistency is not as critical as it is in movie postproduction. If the IDT method provides the closest color feel to the target picture, the linear MK transformation can sometimes satisfy aesthetically the artist and serves as a starting point for further editing. It is possible with the IDT algorithm to perform true color equalization by choosing the uniform distribution as a target. As illustrated in Figure 11.15e, the equalization process puts the whole color spectrum in the image. Note that even though the mapping is extremely stretched, the smoothness of the picture is still preserved by the regraining process. 11.8 Parting Remarks This chapter has presented a review of color transfer techniques that are based on oneto-one color mappings. It has also established that these methods have a common mathematical background which is the mass transportation theory and that a good way of dealing with the problem is to use the Monge-Kantorovitch mappings. Monge-Kantorovitch map- 318 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) FIGURE 11.14 (See color insert.) Color grading results: (a) original image, (b) target palette, (c) linear MK result, and (d) IDT result. (a) (b) (c) (d) (e) FIGURE 11.15 (See color insert.) Color grading results: (a) original image, (b) target palette, (c) linear MK result, (d) IDT followed by the regraining process resulting in better color contrast, (e) original image recolored with the whole color spectrum using the IDT algorithm followed by the regraining process. Enhancement of Digital Photographs Using Color Transfer Techniques 319 pings have intuitive geometrical properties and turn out to be robust when used in real applications. Thus finding mappings that transfer color statistics is now a well understood problem. The algorithms that have been proposed here, that is, the linear MK solution and the nonlinear IDT method, have already been implemented in industrial applications and are currently used by artists. Grain artifacts resulting from the mapping correction have also been addressed and a practical solution has been found and can be used in tandem with the IDT color transfer. It is important to realize that one-to-one color transfer techniques are not the universal answer to color grading. The methods are indeed limited by the ability of one-to-one mapping to model change of color grade. Also, these methods assume that the target image contains the exact color distribution. In practice however, target images are only an approximation of the desired color palette. Dealing with content variations is still an open problem, even though some pre and post processing can improve the robustness. The problems and techniques discussed in this chapter can be viewed as a set of tools that one can use confidently, provided that their conditions of use are well understood. Acknowledgments The authors would like to acknowledge the helpful discussions with the people at The Foundry and GreenParrotPictures as well as Dr. Naomi Harte for her precious help. References [1] M. Pappas and I. Pitas, “Digital color restoration of old paintings,” Transactions on Image Processing, vol. 9, no. 2, pp. 291–294, February 2000. [2] L. Lucchese and S.K. Mitra, “A new method for color image equalization,” in Proceedings of the IEEE International Conference on Image Processing, Thessaloniki, Greece, October 2001, vol. 1, pp. 133–136. [3] E. Pichon, M. Niethammer, and G. Sapiro, “Color histogram equalization through mesh deformation,” in Proceedings of the IEEE International Conference on Image Processing, Barcelona, Spain, September 2003, vol. 3, pp. 117–120. [4] R. Fattal, D. Lischinski, and M. Werman, “Gradient domain high dynamic range compression,” in Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, San Antonio, TX, USA, July 2002, pp. 249–256. [5] P. Debevec, E. Reinhard, G. Ward, and S. Pattanaik, “High dynamic range imaging,” in ACM SIGGRAPH 2004 Course Notes, 2004, p. 14. [6] T. Welsh, M. Ashikhmin, and K. Mueller, “Transferring color to greyscale images,” ACM Transactions on Graphics, vol. 21, no. 3, pp. 227–280, July 2002. [7] Y. Ji, H.B. Liu, X.K. Wang, and Y.Y. Tang, “Color transfer to greyscale images using texture spectrum,” in Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, China, August 2004, vol. 7, pp. 4057–4061. 320 Single-Sensor Imaging: Methods and Applications for Digital Cameras [8] A.A. Gooch, S.C. Olsen, J. Tumblin, and B. Gooch, “Color2Gray: Salience-preserving color removal,” ACM Transactions on Graphics, vol. 24, no. 3, pp. 634–639, July 2005. [9] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley, “Color transfer between images,” IEEE Computer Graphics Applications, vol. 21, no. 5, pp. 34–41, September / October 2001. [10] A. Abadpour and S. Kasaei, “A fast and efficient fuzzy color transfer method,” in Proceedings of the IEEE Symposium on Signal Processing and Information Technology, Athens, Greece, December 2005, pp. 491–494. [11] L. Neumann and A. Neumann, “Color style transfer techniques using hue, lightness and saturation histogram matching,” in Proceedings of the Workshop on Computational Aesthetics in Graphics, Visualization and Imaging, Girona, Spain, May 2005, pp. 111–122. [12] C.M. Wang and Y.H. Huang, “A novel color transfer algorithm for image sequences,” Journal of Information Science and Engineering, vol. 20, no. 6, pp. 1039–1056, November 2004. [13] F. Pitie´, A. Kokaram, and R. Dahyot, “N-dimensional probability density function transfer and its application to colour transfer,” in Proceedings of the International Conference on Computer Vision, Beijing, China, October 2005, vol. 2, pp. 1434–1439. [14] H.L. Shen and J.H. Xin, “Transferring color between three-dimensional objects,” Applied Optics, vol. 44, no. 10, pp. 1969–1976, April 2005. [15] A. Efros and T. K. Leung, “Texture synthesis by non-parametric sampling,” in Proceedings of the IEEE International Conference on Computer Vision, Corfu, Greece, September 1999, pp. 1033–1038. [16] A. Efros and W. Freeman, “Image quilting for texture synthesis and transfer,” in Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, August 2001, pp. 341–346. [17] A. Hertzmann, C. Jacobsk, N. Oliver, B. Curless, and D. Salesin, “Image analogies,” in Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, August 2001, pp. 327–340. [18] S. Bae, S. Paris, and F. Durand, “Two-scale tone management for photographic look,” ACM Transactions on Graphics, vol. 25, no. 3, pp. 637–645, July 2006. [19] H. Kotera, “A scene-referred color transfer for pleasant imaging on display,” in Proceedings of the IEEE International Conference on Image Processing, Genoa, Italy, September 2005, vol. 2, pp. 5–8. [20] J. Morovic and P.L. Sun, “Accurate 3D image colour histogram transformation,” Pattern Recognition Letters, vol. 24, no. 11, pp. 1725–1735, July 2003. [21] L.C. Evans, “Partial differential equations and Monge-Kantorovich mass transfer,” Current Developments in Mathematics, pp. 65–126, 1998. [22] W. Gangbo and R. McCann, “The geometry of optimal transport,” Acta Mathematica, vol. 177, pp. 113–161, 1996. [23] C. Villani, Topics in Optimal Transportation. Providence, RI: American Mathematical Society, 2003. [24] J. Jia, J. Sun, C.K. Tang, and H.Y. Shum, “Bayesian correction of image intensity with spatial consideration,” in Proceedings of the 8th European Conference on Computer Vision, Prague, Czech Republic, May 2004, pp. 342–354. [25] Y. Chang, S. Saito, K. Uchikawa, and M. Nakajima, “Example-based color stylization of images,” ACM Transactions on Applied Perception, vol. 2, no. 3, pp. 322–345, July 2005. [26] R. Lukac, B. Smolka, K. Martin, K.N. Plataniotis, and A.N. Venetsanopoulos, “Vector filtering for color imaging,” IEEE Signal Processing Magazine, vol. 22, no. 1, pp. 74–86, January 2005. Enhancement of Digital Photographs Using Color Transfer Techniques 321 [27] R. Lukac and K.N. Plataniotis, “A taxonomy of color image filtering and enhancement solutions,” Advances in Imaging and Electron Physics, P.W. Hawkes (ed.), San Diego, CA: Elsevier / Academic Press, vol. 140, pp. 187–264, June 2006. [28] P. Pe´rez, A. Blake, and M. Gangnet, “Poisson image editing,” ACM Transactions on Graphics, vol. 232, no. 3, pp. 313–318, July 2003. [29] B.W. Silverman, Density Estimation for Statistics and Data Analysis. London, UK: Chapman & Hall, 1986. [30] R. C. Gonzalez and R. E. Woods, Digital Image Processing. Boston, MA, USA: Addison Wesley, 1992. [31] E. Reinhard, M. Stark, P. Shirley, and J. Ferwerda, “Photographic tone reproduction for digital images,” ACM Transactions on Graphics, vol. 21, no. 3, pp. 267–276, July 2002. [32] D.L. Ruderman, T.W. Cronin, and C.C. Chiao, “Statistics of cone responses to natural images: Implications for visual coding,” Journal of the Optical Society of America, vol. 15, no. 8, pp. 2036–2045, August 1998. [33] F. Pitie´, “Statistical Signal Processing Techniques for Visual Post-Production,” Ph.D. thesis, University of Dublin, Trinity College, April 2006. [34] A. Abadpour and S. Kasaei, “An efficient PCA-based color transfer method,” Journal of Visual Communication and Image Representation, vol. 18, no. 1, pp. 15–34, February 2007. [35] H.J. Trussell and M.J. Vrhel, “A fast and efficient fuzzy color transfer method,” in Proceedings of the SPIE, vol. 1452, pp. 2–9, June 1991. [36] I. Olkin and F. Pukelsheim, “The distance between two random vectors with given dispersion matrices,” Linear Algebra and its Applications, vol. 48, pp. 257–263, 1982. [37] M. Grundland and N.A. Dodgson, “Color histogram specification by histogram warping,” in Proceedings of the SPIE Conference on Color Imaging X: Processing, Hardcopy, and Applications, San Jose, CA, USA, January 2005, vol. 5667, pp. 610–624. [38] Y. Rubner, C. Tomasi, and L.J. Guibas, “The earth mover’s distance as a metric for image retrieval,” International Journal Computer Vision, vol. 40, no. 2, pp. 99–121, October 2000. [39] F.L. Hitchcock, “The distribution of a product from several sources to numerous localities,” Journal of Mathematics and Physics, vol. 20, pp. 224–230, 1941. [40] N. Wu and R. Coppins, Linear Programming and Extensions. New York: McGraw-Hill, 1981. [41] F. Pitie´, A. Kokaram, and R. Dahyot, “Automated colour grading using colour distribution transfer,” Journal of Computer Vision and Image Understanding, vol. 107, no. 1–2, pp. 123– 137, July / August 2007. [42] S. Helgason, The Radon Transform, 2nd edition. Basel, Switzerland: Birkha¨user, 1999. [43] J. Weickert, B. ter Haar Romeny, and M. Viergever, “Efficient and reliable schemes for nonlinear diffusion filtering,” IEEE Transactions on Image Processing, vol. 7, no. 3, pp. 398–410, March 1998. [44] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical Recipes in C: The Art of Scientific Computing. New York: Cambridge University Press, 1992. [45] F. Pitie´, R. Dahyot, F. Kelly, and A.C. Kokaram, “A new robust technique for stabilizing brightness fluctuations,” in Proceedings of the 2nd Workshop on Statistical Methods in Video Processing, Prague, Czech Republic, May 2004, pp. 153–164. 12 Exposure Correction for Imaging Devices: An Overview Sebastiano Battiato, Giuseppe Messina, and Alfio Castorina 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 12.2 Exposure Metering Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 12.2.1 Classical Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 12.2.1.1 Spot Metering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 12.2.1.2 Partial Area Metering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 12.2.1.3 Center-Weighted Average Metering . . . . . . . . . . . . . . . . . . . . . . . 327 12.2.1.4 Average Metering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 12.2.2 Advanced Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 12.2.2.1 Matrix or Multi-Zone Metering . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 12.3 Exposure Correction Content Dependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 12.3.1 Feature Extraction: Contrast and Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 12.3.2 Feature Extraction: Skin Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 12.3.3 Exposure Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 12.3.4 Exposure Correction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 12.4 Bracketing and Advanced Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 12.4.1 The Sensor Versus the World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 12.4.2 Camera Response Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 12.4.3 High Dynamic Range Image Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 12.4.4 The Scene Versus the Display Medium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 12.4.4.1 Histogram Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 12.4.4.2 Chiu’s Local Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 12.4.4.3 Bilateral Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 12.4.4.4 Photographic Tone Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . 345 12.4.4.5 Gradient Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 12.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 12.1 Introduction One of the main problems affecting image quality, leading to unpleasant pictures, comes from improper exposure to light. Beside the sophisticated features incorporated in today’s cameras (i.e., automatic gain control algorithms), failures are not unlikely to occur. Digital 323 324 Single-Sensor Imaging: Methods and Applications for Digital Cameras consumer devices make use of ad-hoc strategies and heuristics to derive exposure setting parameters. Typically such techniques are completely blind with respect to the specific content of the scene. Some techniques are completely automatic, such as those based on average / automatic exposure metering or more complex matrix / intelligent exposure metering. Others provide the photographer with a certain control over the selection of the exposure, thus allowing space for personal taste or enabling the user to satisfy particular needs. In spite of the great variety of methods for regulating the exposure and the complexity of some of them, it is not rare for images to be acquired with a nonoptimal or incorrect exposure. This is particularly true for handset devices (e.g., mobile phones) where several factors contribute to badly exposed pictures: poor optics, absence of flash, complex scene lighting conditions, and so forth. There is no exact definition of what a correct exposure should be. It is possible to abstract a generalization and to define the best exposure that enables one to reproduce the most important regions (according to contextual or perceptive criteria) with a level of gray or brightness, more or less in the middle of the possible range. In any case, if the dynamic range of the scene is sensibly high, there is no way to acquire the overall details. One of the main issues of this chapter is devoted to giving an effective overview of the technical details involved in: • exposure settings of imaging devices before acquisition phase (i.e., pre-capture phase) [1]; • content-dependent enhancement strategies applied as postprocessing [2]; • advanced solutions where multi-picture acquisition of the same scene with different exposure time allows to reproduce the radiance map of the real world [3]. . Namely, Section 12.2 discusses in detail traditional and advanced approaches related to the pre-capture phase (i.e., the sensor is read continuously and the output is analyzed in order to determine a set of parameters strictly related with the quality of the final picture [1]). The role of exposure setting will be analyzed by considering some case studies where, by making use of some assumptions about the dynamic range of the real scene, it is possible to derive effective strategies. Section 12.3 describes the work presented in Reference [2] where by using postprocessing techniques an effective enhancement has been obtained through analysis of some content dependent features of the picture. Section 12.4 surveys advanced approaches devoted to improving acquiring capabilities by using multipicture acquisition (i.e., bracketing). In particular, this section focuses on popular techniques able to reproduce effectively the salient part of a real scene after having computed a reliable high dynamic range (HDR) [3]. Finally, conclusions are offered in Section 12.5. 12.2 Exposure Metering Techniques Metering techniques built into the camera are improving with the introduction of computers; however, limitations still remain. For example, taking a picture of a snow scene or trying to photograph a black locomotive without overriding the camera calculated metering is very difficult. The most important aspect of the exposure duration is to guarantee that the Exposure Correction for Imaging Devices: An Overview 325 acquired image falls in a good region of the sensor’s sensitivity range. In many devices, the selected exposure value is the main processing step for adjusting the overall image intensity that the consumer will see. Many of older digital cameras used a separate metering system to set exposure duration, rather than using data acquired from the sensor chip. Integrating exposure-metering function into the main sensor (through-the-lens, or TTL, metering) may reduce system cost. The imaging community uses a measure called exposure value (EV) to specify the relationship between the f-number1, F, and exposure duration, T : EV = log2 F2 T = 2 log2(F) − log2(T ) (12.1) The exposure value (Equation 12.1) becomes smaller as the exposure duration increases, and it becomes larger as the f-number grows. Most auto-exposure algorithms work in the following manner: 1. Take a picture with a pre-determined exposure value, EVpre; 2. Convert the RGB values to brightness, B; 3. Derive a single value Bpre (like center-weighted mean, median, or more complicated weighted method as in matrix-metering) from the brightness picture; 4. Based on linearity assumption and Equation 12.1, the optimum exposure value EVopt should be the one that permits a correct exposure. The picture taken at this EVopt should give a number close to a pre-defined ideal value Bopt, thus: EVopt = EVpre + log2(Bpre) − log2(Bopt ) (12.2) The ideal value Bopt for each algorithm is typically selected empirically. Different algorithms mainly differ in how the single number Bpre is derived from the picture. 12.2.1 Classical Approaches The metering system in a digital camera measures the amount of light in the scene and calculates the best-fit exposure value based on the metering mode explained below. Automatic exposure is a standard feature in all digital cameras. After having selected the metering mode, the picture is captured by pointing the camera and pressing the shutter release. The metering method defines which information of the scene is used to calculate the exposure value and how it is determined. Cameras generally allow the user to select between spot, center-weighted average, or multi-zone metering modes. 12.2.1.1 Spot Metering Spot metering allows the user to meter the subject in the center of the frame (or on some cameras at the selected AutoFocus (AF) point). Only a small area of the whole frame 1Aperture values or f-numbers, are measurement of the size of the hole that the light passes through the rear of the lens, relative to the focal length. The smaller the f-number, the more light gets through the lens. 326 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) FIGURE 12.1 Metering examples: (a) spot area, (b) partial area, (c) center-weighted area, (d) spot example, (e) partial area example, and (f) center-weighted area example. (between 1-5% of the viewfinder area) is metered while the rest of the frame is ignored. In this case, Bpre in Equation 12.2 is the mean of the center area (Figure 12.1a). This will typically be the effective center of the scene, but some cameras allow the user to select a different off-center spot, or to recompose by moving the camera after metering. A few models support a multi-spot mode which allows multiple spot meter readings to be taken of a scene that are averaged. Some cameras also support metering of highlight and shadow areas. Spot metering is very accurate and is not influenced by other areas in the frame. It is commonly used to shoot very high contrast scenes. For example (see Figure 12.1d), if the subject’s back is being hit by the rising sun and the face is a lot darker than the bright halo around the subject’s back and hairline (the subject is “backlit”), spot metering allows the photographer to measure the light bouncing off the subject’s face and expose properly for that, instead of the much brighter light around the hairline. The area around the back and hairline will then become over-exposed. Spot metering is a method upon which the zone system depends.2 12.2.1.2 Partial Area Metering This mode meters a larger area than spot metering (around 10–15% of the entire frame), and is generally used when very bright or very dark areas on the edges of the frame would otherwise influence the metering unduly. Like spot metering, some cameras can use variable points to take readings from (in general autofocus points), or have a fixed point in the 2The Zone System is a photographic technique for determining optimal film exposure and development, formulated by Ansel Adams and Fred Archer in 1941. The Zone System provides photographers with a systematic method of precisely defining the relationship between the way they visualize the photographic subject and the final results. Although it originated with black and white sheet film, the Zone System is also applicable to roll film (black and white, color, negative, and reversal) and to digital photography. Exposure Correction for Imaging Devices: An Overview 327 center of the viewfinder. Figure 12.1e shows an example of partial metering on a backlight scene; this method permits to equalize much more the global exposure. 12.2.1.3 Center-Weighted Average Metering This method is probably the most common metering method implemented in nearly every digital camera: it is also the default for those digital cameras which do not offer metering mode selection. In this system, as described in Figure 12.1c, the meter concentrates between 60 to 80 percent of the sensitivity towards the central part of the viewfinder. The balance is then “feathered” out towards the edges. Some cameras allow the user to adjust the weight/balance of the central portion to the peripheral one. One advantage of this method is that it is less influenced by small areas that vary greatly in brightness at the edges of the viewfinder; as many subjects are in the central part of the frame, consistent results can be obtained. Unfortunately, if a backlight is present in the scene, the central part appears darker than the rest of the scene (Figure 12.1f), and an unpleasant underexposed foreground is produced. 12.2.1.4 Average Metering In this mode, the camera will use the light information coming from the entire scene and averages for the final exposure setting, giving no weighting to any particular portion of the metered area. This metering technique has been replaced by center-weighted metering, thus is obsolete and only present in older cameras. 12.2.2 Advanced Approaches 12.2.2.1 Matrix or Multi-Zone Metering This mode is also called matrix, evaluative, honeycomb, segment metering, or esp (electro selective pattern) metering on some cameras. It was first introduced by the Nikon FA, where it was called automatic multi-pattern metering. On a number of cameras, this is the default/standard metering setting. The camera measures the light intensity in several points of the scene, and then combines the results to find the settings for the best exposure. How they are combined/calculated deviates from camera to camera. The actual number of zones used varies wildly, from several to over a thousand. However performance should not be concluded on the number of zones alone, or the layout. As shown in Figure 12.2, the layout can change drastically from a manufacturer to another; also within the same company the use of different multi-zone metering can change due to several reasons (e.g., the dimension of the final pixel matrix). Many manufacturers are not open about the exact calculations used to determine the exposure. A number of factors are taken into consideration; these include: AF point, distance to subject, areas in-focus or out-of-focus, colors / hues of the scene and backlighting. Multi-zone tends to bias its exposure towards the autofocus point being used (while taking into account other areas of the frame too), thus ensuring that the point of interest has been properly exposed. It is also designed to avoid the need to use exposure compensation in most situations. A database of many thousands of exposures is pre-stored in the camera, and the processor can use a selective pattern to determine what is being photographed. 328 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) FIGURE 12.2 Examples of different kinds of multi-zone metering modes used by several camera manufacturers: (a) Canon 21 zones, (b) Canon 16 zones, (c) Canon 35 zones, (d) Nikon 10 segments, (e) Nikon 7 segments, (f) Nikon 6 segments, (g) Sigma 9 zones, (h) Samsung 16 zones, (i) Olympus ESP, (j) Konica Minolta 40 zone honeycombs, (k) Konica Minolta 14 zone honeycombs. Some cameras allow the user to link (or unlink) the autofocus and metering, giving the possibility to lock exposure once AF confirmation is achieved. This is called auto exposure lock (AEL). Using manual focus, on many compact cameras, the AF point is not used as part of the exposure calculation. In such instances it is common for the metering to default to a central point in the viewfinder, using a pattern based off of that area. Some users have problems with wide angle shots in high contrast, due to the large area which can vary greatly in brightness; it is important to understand that even in this situation, the focus point can be critical to the overall exposure. Exposure Correction for Imaging Devices: An Overview 329 R G1 R average G’ B G2 B FIGURE 12.3 Bayer data subsampling generation. R G’ B 12.3 Exposure Correction Content Dependent As explained in Section 12.2, it is possible to define the best exposure able to reproduce the most important regions (according to contextual or perceptive criteria) with a level of gray or brightness, more or less in the middle of the possible range. After acquisition phase, typical postprocessing techniques try to realize an effective enhancement via global approaches, such as histogram specification, histogram equalization and gamma correction to improve global contrast appearance [4] by stretching the global distribution of the intensity. More adaptive criterions are needed to overcome such drawbacks. The exposure correction technique [2] described in this section is designed essentially for mobile sensor applications. This new element, present in newest mobile devices, is particularly harmed by “backlight” when the user utilizes a mobile device for video phoning. The detection of skin characteristics in captured images allows selection and proper enhancement and/or tracking of regions of interest (e.g., faces). If no skin is present in the scene the algorithm switches automatically to other features (such as contrast and focus) tracking for visually relevant regions. This implementation differs from the algorithm described in Reference [5] because the whole processing can also be performed directly on Bayer pattern images [6] (for detailed information on Bayer pattern and single-sensor imaging fundamentals refer to Chapter 1), and simpler statistical measures were used to identify information carrying regions. The algorithm is defined as follows: 1. Luminance extraction. If the algorithm is applied on Bayer data, in place of the three full color planes, a subsampled (quarter size) approximated input data (Figure 12.3) is used. 2. Using a suitable feature extraction technique the algorithm fixes a value to each region. This operation permits to seek visually relevant regions (for contrast and focus, the regions are block based; whereas for skin detection, the regions are associated to each pixel). 3. Once the visually important pixels are identified (e.g., the pixels belonging to skin features) a global tone correction technique is applied using as main parameter the mean gray level of the relevant regions. 330 Single-Sensor Imaging: Methods and Applications for Digital Cameras 12.3.1 Feature Extraction: Contrast and Focus To identify regions of the image that contain more information, luminance plane is subdivided in N blocks of equal dimensions (in our experiments we employed N = 64 for VGA images). For each block, statistical measures of contrast and focus are computed. Therefore, it is assumed that well focused or high-contrast blocks are more relevant compared to the others. Contrast refers to the range of tones present in the image. A high contrast leads to a higher number of perceptually significant regions inside a block. Focus characterizes the sharpness or edgeness of the block and is useful in identifying regions where high frequency components (i.e., details) are present. If the aforementioned measures were simply computed on highly underexposed images, then the regions having better exposure would always have higher contrast and edgeness compared to those that are obscured. In order to perform a visual analysis revealing the most important features regardless of lighting conditions, a new visibility image is constructed by pushing the mean gray level of the input green Bayer pattern plane (or the Y channel for color images) to 128. The push operation is performed using the same function that is used to adjust the exposure level and it will be described later. The contrast measure is computed by simply building a histogram for each block and then calculating its deviation from the mean value. A high deviation value denotes good contrast and vice versa. In order to remove irrelevant peaks, the histogram is slightly smoothed by replacing each entry with its mean in a ray 2 neighborhood. Thus, the original histogram entry is replaced with the gray-level I˜[i]: I˜[i] = (I [i − 2] + I [i − 1] + I [i] + I 5 [i + 1] + I [i + 2]) Histogram deviation D is computed as: (12.3) D = ∑2i=550 |i − M| · I˜[i] ∑2i=550 I˜[i] (12.4) where M is the mean value: M = ∑2i=550 i · I˜[i] ∑2i=550 I˜[i] (12.5) The focus measure is computed by convolving each block with a simple 3 × 3 Laplacian filter. In order to discard irrelevant high frequency pixels (mostly noise), the output of the convolution at each pixel location is thresholded. The mean focus value of each block is computed as: F = ∑Ni=1 threshτ N l a pl (i) (12.6) where N is the number of pixels, lapl(i) is the output of the convolution of each block with a 3 × 3 Laplacian kernel, and the threshτ (·) operator discards values lower than a fixed threshold τ. Once the values F and D are computed for all blocks, relevant regions will be classified using a linear combination of both values. Feature extraction pipeline is illustrated in Figure 12.4. Exposure Correction for Imaging Devices: An Overview 331 m1 m2 m3 m4 m5 m 6 m 7 m 8 m 9 m 10 m 11 m 12 m 13 m 14 m 15 m 16 m 17 m 18 m 19 m 20 m 21 m 22 m 23 m 24 m 25 (a) (b) (c) (d) FIGURE 12.4 Feature extraction pipeline for focus and contrast with N = 25: (a) input image, (b) luminance blocks, (c) relevance measures, and (d) relevant blocks. Visual relevance of each luminance block of the input image is based on relevance measures able to obtain a list of relevant blocks. (a) (b) (c) (d) (e) (f) FIGURE 12.5 (See color insert.) Skin detection examples on RGB images: (a,d) original images compressed in JPEG format, (b,e) simplest threshold method output, and (c,f) probabilistic threshold output. 12.3.2 Feature Extraction: Skin Detection As explained before, a visibility image obtained by forcing the mean gray level of the luminance channel to be about 128 is built. Most existing methods for skin color detection usually threshold some sort of measure of the likelihood of skin colors for each pixel and treat them independently. Human skin colors form a special category of colors, distinctive from the colors of most other natural objects. It has been found that human skin colors are clustered in various color spaces [7], [8]. The skin color variations between people are mostly due to intensity differences. These variations can therefore be reduced by using 332 Single-Sensor Imaging: Methods and Applications for Digital Cameras 1 0.9 0.8 0.7 0.6 r 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 g (a) (b) (c) FIGURE 12.6 Skin detection examples on Bayer pattern image: (a) original image in Bayer data; (b) recognized skin with probabilistic approach; and (c) threshold skin values on r − g bidirectional histogram (skin locus). chrominance components only. Yang et al. [9] have demonstrated that the distribution of human skin colors can be represented by a two-dimensional (2D) Gaussian function on the chrominance plane. The center of this distribution is determined by the mean vector µ and its shape is determined by the covariance matrix Σ; both values can be estimated from an appropriate training data set. The conditional probability p (x|s) of a block belonging to the skin color class, given its chrominance vector x, is then represented by: p (x|s) = 1 2π |Σ|− 1 2 exp − [d(x)]2 2 (12.7) where d(x) is the Mahalanobis distance between x and µ: [d(x)]2 = (x − µ) Σ−1(x − µ) (12.8) The value d(x) determines the probability that a given block belongs to the skin color class. This probability thus reduces with the increase of the distance d(x). Such class has been experimentally derived using a large dataset of images acquired at different conditions and resolution using CMOS-VGA sensor on STV6500-E01 Evaluation Kit equipped with 502 VGA sensor [10]. Due to the large quantity of color spaces, distance measures and 2D distributions, many skin detection algorithms can be used. The skin color algorithm is independent from exposure correction; thus we introduce two alternative techniques aimed to recognize skin regions (as shown in Figure 12.5): 1. By using the input YCbCr image and p (x|s) from Equation 12.7, each pixel is classified as a skin or nonskin pixel. Then a new image with normalized grayscale values is derived, where skin areas are properly highlighted (Figure 12.5c). The higher the gray value the bigger the probability to compute a reliable identification. 2. By processing an input RGB image, a 2D chrominance distribution histogram is computed via r = R/(R + G + B) and g = G/(R + G + B). Chrominance values representing skin are clustered in a specific area of the normalized r and g planes, called skin locus (Figure 12.6c), as defined in Reference [11]. Pixels having a chrominance value belonging to the skin locus will be selected to correct exposure. Exposure Correction for Imaging Devices: An Overview 333 (a) (b) (c) (d) (e) FIGURE 12.7 (See color insert.) Exposure correction results by real-time and post processing: (a) Bayer image, (b) skin detected in the Bayer image in real-time, (c) RGB color-interpolated image from Bayer data, (d) skin detected in RGB data using postprocessing, and (e) exposure-corrected image obtained from RGB image. For Bayer data, the skin detection algorithm works on the RGB image created by subsampling the original picture, as described in Figure 12.3. 12.3.3 Exposure Correction Once the visually relevant regions are identified, the exposure correction (Figure 12.7) is carried out using the mean gray value of those regions as reference point. A simulated camera response curve is used for this purpose. This function can be expressed by using a simple parametric closed form representation: f (q) = 255 (1 + e−Aq)C (12.9) where q represents the light quantity and the final pixel value is obtained by means of the parameters A and C used to control the shape of the curve. The q term is supposed to be expressed in 2-based logarithmic unit (usually referred as stops). These parameters could be estimated, depending on the specific image acquisition device or chosen experimentally, as better specified below (see Section 12.4). The offset from the ideal exposure is computed using the f curve and the average gray level of visually relevant regions avg, as: ∆ = f −1(Trg) − f −1(avg) (12.10) where Trg is the desired target gray level. The value of Trg should be around 128 but it could be slightly changed especially when dealing with Bayer pattern data where some postprocessing is often applied. 334 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) FIGURE 12.8 (See color insert.) Exposure correction results by postprocessing: (a,d) original color input images, (b,e) contrast and focus visually significant blocks detected, and (c,f) exposure-corrected images obtained from RGB images. The luminance value Y (x, y) of a pixel (x, y) is modified as follows: Y (x, y) = f f −1(Y (x, y)) + ∆ (12.11) Basically, this step is often implemented as a lookup table (LUT) transform. This concludes the correction process; all pixels are now corrected. 12.3.4 Exposure Correction Results The described technique has been tested using a large database of images acquired at different resolutions, with different acquisition devices, both in Bayer and RGB format. In the Bayer case, the algorithm was inserted in a real-time framework, using a CMOS-VGA sensor on STV6500-E01 Evaluation Kit equipped with 502 VGA sensor [10]. Examples of skin detection by using real time processing are reported in Figure 12.7. In the RGB case the algorithm could be implemented as postprocessing step. Examples of skin and contrast/focus exposure correction are respectively shown in Figure 12.8 and Figure 12.9. Exposure Correction for Imaging Devices: An Overview 335 (a) (b) (c) (d) (e) (f) FIGURE 12.9 (See color insert.) Exposure correction results: (a-c) original images, (d-f) corrected images. Images in (a) and (b) were acquired by Nokia 7650 VGA sensor and compressed in JPEG format whereas the image in (c) was acquired with Olympus E-10 camera equipped with a 4.1 Mega-pixel CCD sensor. Results show how the features analysis capability of the proposed algorithm permits contrast enhancement taking into account some strong peculiarity of the input image. Major details and experiments can be found in Reference [2]. 12.4 Bracketing and Advanced Applications In order to attempt to recover or enhance a badly exposed image, even if some kind of postprocessing is available, there are situations where this strategy is not possible or leads to poor results. The problem comes from the fact that badly captured data can be enhanced, but if no data exists at all there is nothing to enhance. Today, almost all digital photo-cameras still deal with limited dynamic range and inadequate data representation, which make critical lighting situations, and the real world has tons of them, difficult to handle. This occurs despite the great advancements realized by digital photography, which has made available large resolution even for mass market oriented products. Therefore, multiple exposure capture stands as a useful alternative to overpass actual technology limits. Even if the idea of combining multiple exposed data is just recently receiving great attention, the methodology itself is very old. In the early 1960s, well before the advent of digital image processing Charles Wyckoff [12] was able to capture high dynamic range images by using photographic emulsion layers of different sensitivity to light. The information coming from each layer was printed on paper using different colors, thus obtaining a pseudo-color image depiction. 336 Single-Sensor Imaging: Methods and Applications for Digital Cameras TABLE 12.1 Typical world luminance levels. Scene Starlight Moonlight Indoor light Sunlight Illumination 10−3cd/m2 10−1cd/m2 102cd/m2 105cd/m2 12.4.1 The Sensor Versus the World Dynamic range refers to the ratio of the highest and lowest sensed level of light. For example, a scene where the quantity of light ranges from 1000 cd/m2 to 0.01 cd/m2 has a dynamic range of 1000/0.01=100,000. The simultaneous presence in real world scenes poses great challenges on image capturing devices, where usually the available dynamic range is not capable of coping with that coming from the outside world. High dynamic range scenes are not uncommon; imagine a room with a sunlit window, environments presenting opaque and specular objects and so on. Table 12.1 shows typical luminance values for different scenes, spanning a very wide range from starlight to sunlight. On the other side dynamic range (DR) of a digital still camera (DSC) is defined as the ratio between the maximum charge that the sensor can collect (full well capacity, FWC), and the minimum charge that is just above sensor noise (noise floor, NF). This quantity is usually expressed in logarithmic units: DR = log10 FWC NF (12.12) The dynamic range, which is seldom in the same order of magnitude of those coming from real world scenes, is further affected by errors coming from analog to digital conversion (ADC) of sensed light values. Once the light values are captured, they are properly quantized to produce digital codes, that for common eight-bit data fall in the [0 : 255] range. This means that a sampled, coarse representation of the continuously varying light values is produced. scen e w orld exp osu r e 10-6 106 sensor 8 bit data 10-6 106 FIGURE 12.10 Due to limited camera dynamic range, only a portion, depending on exposure settings, of the scene can be captured and digitized. Exposure Correction for Imaging Devices: An Overview 337 image histogram image histogram scene light values (a) scene light values (b) FIGURE 12.11 Information loss for: (a) high exposure, and (b) low exposure. For simplicity, only eight quantization levels, shown with dotted lines, are considered. Limited dynamic range and quantization thus leads to loss of information and to inadequate data representation. This process is synthetically shown in Figure 12.10, where the dynamic range of a scene is converted to the digital data of a DSC. Note that only part of the original range is captured; the remaining part is lost. The portion of the dynamic range where the loss occurs depends on employed exposure settings. Low exposure settings, by preventing information loss due to saturation of highlights, allow to capture highlight values, but lower values will be easily overridden by sensor noise. On the other side, high exposure settings allow a good representation of low light values, but the higher portion of the scene will be saturated. Once again a graphical representation gives a good explanation of the different scenarios. Figure 12.11a shows a high exposure capture. Only the portion of the scene under the green area is sensed with a very fine quantization; the other portion of the scene is lost due to saturation which happens at the luminance level corresponding to the end of the green area. Figure 12.11b shows a low exposure capture. This time saturation, which happens at the light level corresponding to the end of the red area, is less severe due to low exposure settings and the complete scene is captured (the red area). Unfortunately, due to very widely spanned sampling intervals, quality of captured data is damaged by quantization noise and errors. In summary, data captured by different exposure settings allows to cover a wider range, and reveals more detail than would have been possible by a single shot. The process is usually conveyed by different steps: i) camera response function estimation, ii) high dynamic range construction, and iii) tone mapping to display or print medium. 12.4.2 Camera Response Function In order to properly compose a high dynamic range image, using information coming from multiple low dynamic range (LDR) images, the camera response function must be known. This function describes the way a camera reacts to changes in exposures, thus providing digital measurements. Camera exposure X, which is the quantity of light accumulated by the sensor in a given time, can be defined as: X = It (12.13) where I is the irradiance and t is the integration time. 338 scen e Single-Sensor Imaging: Methods and Applications for Digital Cameras len s shutter sensor ad c p rocessing im age FIGURE 12.12 The full pipeline from scene to final digital image. The main problem behind assembling the high dynamic range from multiple exposures lies in recovering the function synthesizing the full process. When a pixel value Z is produced, it is known that it comes from some scene radiance I sensed for a given time t, mapped into the digital domain through some function f . Even if most CCD and CMOS sensors are designed to produce electric charges that are strictly proportional to the incoming amount of light (up to the near saturation point, where values are likely to fluctuate), the final mapping is seldom linear. Nonlinearities can come from the ADC stage, sensor noise, gamma mapping and specific processing introduced by the manufacturer. In fact often DSC cameras have a built-in nonlinear mapping to mimic a film-like response, which usually produces more appealing images when inspected on lowdynamic displays. The full pipeline, from the scene to the final pixel values, is shown in Figure 12.12 where prominent nonlinearities can be introduced in the final, generally unknown, processing. The most obvious solution to estimate the camera response function is to use a picture of uniformly lit different patches, such as the Macbeth Chart [13], and establish the relationship between known light values and recorded digital pixel codes. However this process requires expensive and controlled environment and equipment. This is why several chartless techniques have been investigated. One of the most flexible algorithms has been described in Reference [14], which only requires an estimation of exposure ratios between the input images. Of course, exposure ratios are at hand given the exposure times, as produced by almost all photo-cameras. Given N digitized LDR pictures representing the same scene and acquired with timings t j, for j = 1, .., N, exposure ratios R j, j+1 can be easily described as R j, j+1 = tj t j+1 (12.14) Thus, the following equation relates the ith pixel of the jth image, Zi, j, to the underlying unknown radiance value Ii: Zi, j = f (Iit j) (12.15) which is the aforementioned camera response function. The principle of high dynamic range composing is the estimation for each pixel, of the radiance values behind it, in order to obtain a better and more faithful description of the scene that has originated the images. This means that we are interested in finding the inverse of Equation 12.14; a mapping from pixel value to radiance value is needed: g(Zi, j) = f −1(Zi, j) = Iit j (12.16) The nature of the function g(·) in the above equation is unknown; the only assumption is that it must be monotonically increasing. That is why a polynomial function of order K is Exposure Correction for Imaging Devices: An Overview 339 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) FIGURE 12.13 A sequence of ten images, captured at ISO 50, f-6.3, and exposures ranging from 1/1600 to 1/4 seconds: (a) 1/1600s, (b) 1/800s, (c) 1/400s, (d) 1/200s, (e) 1/100s, (f) 1/50s, (g) 1/25s, (h) 1/13s, (i) 1/8s, and (j) 1/4s. used: K Ie = g(Z) = ∑ ckZk k=0 (12.17) The problem thus becomes the estimation of the order K and the coefficients ck appearing in Equation 12.17. If the ratios between successive image pairs ( j, j + 1) are known, the following relation holds: Iit j Iit j+1 = g(Zi, j) g(Zi, j+1) = R j, j+1 (12.18) Using Equation 12.18, parameters are estimated by minimizing the following objective function: NP K K 2 ∑ ∑ ∑ ∑ O = ckZik, j − R j, j+1 ckZik, j+1 j=1 i=1 k=0 k=0 (12.19) where N is the number of images and P the number of pixels. The system can be easily solved by using the least squares method. The condition g(1) = 1 is enforced to fix the scale of the solution, and different K orders are tested. The K value that better minimizes the system is retained. To limit the number of equations to be considered, not all pixels of the images should be used and some kind of selection is advised by respecting the following rules: i) pixels should be well spatially distributed, ii) pixels should sample the input range, and iii) pixels should be picked from low variance (homogenous) areas. 340 Single-Sensor Imaging: Methods and Applications for Digital Cameras pixel value pixel value 1 0.9 0.8 0.7 0.6 0.5 red 0.4 green 0.3 blu e 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 exp osu r e (a) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 -4 red green blu e 10 -3 10 -2 exp osu r e (b) 10 -1 100 FIGURE 12.14 Camera response functions derived from images depicted in Figure 12.13: (a) response curve on linear scale, and (b) response curve on logarithmic scale. A different approach for feeding the linear system in Equation 12.19 could be done by replacing pixel value correspondences by comparagram pairs. Comparagrams have been well described in Reference [15] and provide an easy way to represent how pixels of one image are mapped to the same image with different exposure. This mapping is usually called brightness transfer function (BTF). It is worth noting that if direct access to raw data is available, and known to be linear, the response curve estimation step could be avoided in this case since the function equals a simple straight line normalized in the range [0, .., 1]. Figure 12.13 shows 10 images captured at different exposure settings, from 1 1600 sec to 1 4 sec, while Figure 12.14 shows the recovered response curve on both linear (left) and logarithmic units. 12.4.3 High Dynamic Range Image Construction Once the response function (estimated or a priori known) is at hand, the high dynamic range image, usually referred to as radiance map and composed of floating point values having greater range and tonal resolution than usual low dynamic range (LDR) data, can be assembled. The principle is that each pixel in each image provides a more or less accurate estimation of the radiance value of the scene in the specific position. For example, very low pixel values coming from low exposure images are usually noisy, and thus not reliable, but the same pixels are likely to be well exposed in images acquired with higher exposure settings. Given N images, with exposure ratios ei : i = 1 : N and considering Equation 12.16 the sequence {g(Zi,1)/t1, g(Zi,2)/t2, ..., g(Zi,N)/tN} of estimates for a pixel in position i is obtained. Different estimates should be assembled by means of a weighted average taking into account reliability of the pixel itself. Of course, the weight should completely discard pixels that appear as saturated and assign very low weight to pixels whose value is below some noise floor, since they are unable to provide decent estimation. Exposure Correction for Imaging Devices: An Overview 341 One possible weighting function could be a hat or Gaussian shaped function centered around mid-gray pixel values, which are far from noise and saturation. As a rule of thumb, for each pixel there should be at least one image providing a useful pixel (e.g., that is not saturated, nor excessively noisy). Given the weighting function w(Z) the radiance estimate for a given position i is calculated as follows: Ii = ∑Nj=1 w(Zi, j ) g(Zi, tj j ) ∑Nj=1 w(Zi, j) (12.20) 12.4.4 The Scene Versus the Display Medium Once the high dynamic range image has been assembled, what is usually required is a final rendering on the display medium, such as a CRT display or a printer. The human eye is capable of seeing a huge range of luminance intensities, thanks to its capability to adapt to different values. Unfortunately, this is not the way most image rendering systems work. Hence they are usually not able to deal with the full dynamic range contained in images that provide an approximation of real world scenes. Most CRT displays have a useful dynamic range in the order of nearly 1:100. It is certain that in the near future, high dynamic reproduction devices will be available, but for the moment they are far from mass market consumers. Simply stated, tone mapping is the problem of converting an image containing a large range of numbers, usually expressed in floating point precision, into a meaningful number of discrete gray levels (usually in the range 0, ..., 255), that can be used by any imaging device. So, we can formulate the topic as that of the following quantization problem: Q(val) = |(N − 1)F(val) + 0.5| ; with F : [Lwmin : Lwmax] → [0 : 1] (12.21) where [Lwmin : Lwmax] is the input range, N the number of allowed quantization levels, and F the tone mapping function. A simple linear scaling usually leads to the loss of a high amount of information on the reproduced image. Figure 12.15a shows the result obtained by linearly scaling a high dynamic range image, constructed from the sequence in Figure 12.13 using the techniques described above. As can be seen, only a portion of the scene is visible, so better alternatives for F are needed. Two different categories of tone mapping exist: • tone reproduction curve (TRC) where the same function is applied for all pixels; and • tone reproduction operator (TRO) where the function acts differently depending on the value of a specific pixel and its neighbors. In what follows, several techniques — such as histogram adjustment from the TRC category and Chiu’s local operator, bilateral filtering, photographic tone reproduction, and gradient compression, all from the TRO category — will be briefly described and applied on the input HDR image, assembled from the sequence in Figure 12.13. The recorded input was in the range of 0.00011 : 32. 342 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) (c) (d) (e) (f) FIGURE 12.15 (See color insert.) (a) HDR image built from the sequences of images shown in Figure 12.13 using linear scaling in the [0, ..., 1] range and quantization to 8 bits, (b) image obtained using histogram adjustment mapping, (c) image mapped using Chiu’s algorithm with some halo artifacts highlighted, (d) image mapped using bilateral filtering, (e) photographic tone reproduction mapping, and (f) gradient compression mapping. 12.4.4.1 Histogram Adjustment The algorithm described in Reference [16] is based on ideas coming from image en- hancement techniques, specifically histogram equalization. While histogram equalization is usually employed to expand contrast images, in this case it is adapted to map the high dynamic range of the input image within that of the display medium, while preserving the sensation of contrast. The process starts by computing a downsampled version of the im- age, with a resolution that equals to 1 degree of visual angle. Luminance values of this so-called fovea image are then converted in the brightness domain, which can be approx- imated by computing logarithmic values. For the logarithm-valued image, a histogram is built where values between minimum and maximum bounds Lwmin and Lwmax (of the input radiance map) are equally distributed on the logarithmic scale. Usually employing around 100 histogram bins each having a size of ∆b = log(Lwmax )−log()Lwmin ) 100 provides sufficient res- olution. The cumulative distribution function, normalized by the total number of pixels T , is defined as: P(b) = ∑ f (bi)/T ; with T = ∑ f (bi) bi is constructed, and for each level the gradient field is computed. The attenuation function is then computed on each level and reported to the upper level in bottom to top fashion. The attenuation function at the top level is the one that can be effectively used in Equation 12.38. Attenuation function at each level s is computed as follows: Ψs(x, y) = α ||∇ls(x, y)|| ||∇ls(x, y)|| β α (12.40) The α parameter in Equation 12.40 determines which gradient magnitudes are left untouched, while the β exponent amplifies magnitudes greater than α. Suggested values are β = 0.9 and α equal to average gradient magnitude multiplied by 0.1. Since the attenuation function is computed for each resolution level s, the propagation to full resolution is done by scaling the attenuation function from level s − 1 to s, and accumulating the values to obtain the full resolution attenuation function Φ(x, y) that will be effectively used (authors claim that by using the attenuation function just at full resolution halo artifacts are mostly invisible). This can be expressed by the following equations: Φd (x, y) = Ψd(x, y) Φk (x, y) = L (Φk+1) (x, y) · Ψd(x, y) Φ(x, y) = Φ0(x, y) (12.41) (12.42) (12.43) where d is the smallest resolution level and L is the bilinear up-sampling operator. Figure 12.15f shows the result of applying the gradient compression operator on our sample HDR image. The operator looks computationally more complicated than others that have been described but as can be seen, the mapped image looks far more impressive in terms of high-light and low-light visibility than the previous renderings. 348 Single-Sensor Imaging: Methods and Applications for Digital Cameras 12.5 Conclusion The problem of the proper exposure settings for image acquisition is of course strictly related with the dynamic range of the real scene. In many cases, some useful insights can be achieved by implementing ad-hoc metering strategies. Alternatively, it is possible to apply some tone correction methods that enhance the overall contrast of the most salient regions of the picture. The limited dynamics range of the imaging sensors does not allow to recover the dynamic of the real world; in that case only by using “bracketing” and acquiring several pictures of the same scene with different exposure timing a final good rendering can be realized. In this chapter, we have presented a review of automatic digital exposure correction methods and reported the specific peculiarities of each solution. Just for completeness, we report that recently, Raskar et al. [21] have proposed a novel strategy devoted to “flutter” the camera’s shutter open and closed during the chosen exposure time with a binary pseudorandom sequence. In this way high-frequency spatial details can be recovered especially when movements with constant speed are present. In particular a robust deconvolution process is achieved just considering the so-called coded-exposure that makes the problem well-posed. We think that Raskar’s technique could also be used in multi-picture acquisition just to limit the overall number of images needed to reconstruct a reliable HDR map. References [1] M. Mancuso and S. Battiato, “An introduction to the digital still camera technology,” ST Journal of System Research, vol. 2, no. 2, pp. 1–9, January 2001. [2] S. Battiato, A. Bosco, A. Castorina, and G. Messina, “Automatic image enhancement by content dependent exposure correction,” EURASIP Journal on Applied Signal Processing, vol. 2004, no. 12, pp. 1849–1860, September 2004. [3] S. Battiato, A. Castorina, and M. Mancuso, “High dynamic range imaging for digital still camera: An overview,” Journal of Electronic Imaging, vol. 12, no. 3, pp. 459–469, July 2003. [4] R. Gonzalez and R. Woods, Digital Image Processing. Addison-Wesley Longman Publishing, 2nd edition, Boston, MA, USA, 1992. [5] S. Bhukhanwala and T. Ramabadran, “Automated global enhancement of digitized photographs,” IEEE Transactions on Consumer Electronics, vol. 40, no. 1, pp. 1–10, February 1994. [6] B.E. Bayer, “Color imaging array,” U.S. Patent 3 971 065, July 1976. [7] S. Phung, A. Bouzerdoum, and D. Chai, “A novel skin color model in YCBCR color space and its application to human face detection,” in Proceedings of the IEEE International Conference on Image Processing, Rochester, NY, USA, September 2002, vol. 1, pp. 289–292. [8] B. Zarit, B. Super, and F. Quek, “Comparison of five colour models in skin pixel classification,” in Proceedings of the International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, Corfu, Greece, September 1999, pp. 58–63. Exposure Correction for Imaging Devices: An Overview 349 [9] J. Yang, W. Lu, and A. Waibel, “Skin-colour modeling and adaptation,” Technical Report CMU-CS-97-146 School of Computer Science, Carnegie Mellon University, 1997. [10] STMicroelectronics, “Colour sensor evaluation kit vv6501.” Edinburgh, available online: www.edb.st.com/ products/image/sensors/501/6501evk.htm. [11] M. Soriano, B. Martinkauppi, and M.L.S. Huovinen, “Skin color modeling under varying illumination conditions using the skin locus for selecting training pixels,” in Proceedings of the International Workshop on Real-time Image Sequence Analysis, Oulu, Finland, August 2000, pp. 43–49. [12] C. Wyckoff, “An experimental extended response film,” SPIE Newsletter, pp. 16–20, June / July 1962. [13] Y.C. Chang and J.F. Reid, “RGB calibration for analysis in machine vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 5, no. 10, pp. 1414–1422, October 1996. [14] T. Mitsunaga and S. Nayar, “Radiometric self calibration,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, June 1997, vol. 2, pp. 374–380. [15] S. Mann, “Comparametric equations with practical applications in quantigraphic image processing,” IEEE Transactions on Image Processing, vol. 9, no. 8, pp. 1389–1406, August 2000. [16] G.W. Larso, H. Rushmeier, and C. Piatko, “A visibility matching tone reproduction operator for high dynamic range scenes,” IEEE Transactions on Visualization and Computer Graphics, vol. 3, no. 4, pp. 291–306, October - December 1997. [17] B. Chiu, M. Herf, P. Shirley, S. Swamy, C. Wang, and K. Zimmerman, “Spatially nonuniform scaling functions for high contrast images,” in Proceedings of the Graphics Interface Conference, Toronto, ON, Canada, May 1993, pp. 245–253. [18] F. Durand and J. Dorsey, “Fast bilateral filtering for the display of high dynamic range images,” in Proceedings of the International Conference on Computer Graphics and Interactive Techniques, San Antonio, TX, USA, July 2002, pp. 257–266. [19] E. Reinhar, M. Stark, P. Shirley, and J. Ferweda, “Photographic tone reproduction for digital images,” in Proceedings of the International Conference on Computer Graphics and Interactive Techniques, San Antonio, TX, USA, July 2002, pp. 267–276. [20] R. Fattal, D. Lischinski, and M. Werman, “Gradient domain high dynamic range compression,” in Proceedings of the International Conference on Computer Graphics and Interactive Techniques, San Antonio, TX, USA, July 2002, pp. 249–256. [21] R. Raskar, A. Agrawal, and J. Tumblin, “Coded exposure photography: Motion deblurring using fluttered shutter,” ACM Transactions on Graphics, vol. 25, no. 3, pp. 795–804, July 2006. 13 Digital Camera Image Storage Formats Kenneth A. Parulski and Robert Reisch 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 13.2 Image Formats, Memory Formats, and Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 13.3 History of Image Formats for Digital Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 13.4 Exif-JPEG Image Format Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 13.5 Exif-JPEG Digital Camera Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 13.6 Raw Image Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 13.7 Directory and Control Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 13.8 Advanced Image Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 13.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 13.1 Introduction This chapter describes the image formats used by most current digital cameras. These include the Exif (Exchangeable image file) format [1], [2] used to store processed, JPEG (Joint Photographic Experts Group) compressed images in most consumer cameras, and the TIFF (Tagged Image File Format) compatible raw formats [3] used by many DSLR (Digital Single Lens Reflex) cameras. These current formats have evolved over time, adopting some features first introduced in other, less successful, image formats. Thus, a historical perspective can be helpful in understanding the image formats used in today’s single sensor color cameras. The chapter begins with Section 13.2 which provides a high level description of image formats, memory card formats, and image metadata. Following this introduction, a timeline showing the evolution of digital camera image formats is presented in Section 13.3. Section 13.4 describes the structure of the Exif-JPEG image format and Section 13.5 focuses on the metadata included in Exif-JPEG image files. This is followed by Section 13.6 which provides a description of raw image file formats, using the TIFF/EP (Tagged Image File Format for Electronic Photography) format as an example. Next, the DCF (Design rule for Camera File system) directory structure, and the DPOF (Digital Print Order Format) control format, are described in Section 13.7. Finally, a number of recent image formats are reviewed in Section 13.8. The chapter concludes with Section 13.9. 351 352 Single-Sensor Imaging: Methods and Applications for Digital Cameras TABLE 13.1 Memory card formats. Name Date Size (mm) Type / Contacts URL PC Card Compact Flash Smart Media MMC Memory Stick SD xD 1990 1994 1996 1997 1998 1999 2002 85.6 × 54.0 × 3.3 36.0 × 43.0 × 3.3 45.0 × 37.0 × 0.76 32.0 × 24.0 × 1.4 21.5 × 50 × 2.8 32.0 × 24.0 × 1.4 20.0 × 25.0 × 1.7 Socket / 68 Socket / 50 Surface / 22 Surface / 7 Surface / 10 Surface / 9 Surface / 18 www.pcmcia.org www.compactflash.org www.ssfdc.or.jp www.mmca.org www.memorystick.com www.sdcard.org www.xd-picture.com 13.2 Image Formats, Memory Formats, and Metadata The conventional wisdom, when consumer digital cameras were first developed, was that all cameras would need to use the same standard memory card format in order to be successful. This thinking was based on the fact that film photography was a mass-market success as a result of film format standards. The interoperability provided by 35 mm film standards allowed many different companies throughout the world to provide compatible cameras, films, and photofinishing services. These standards ensured that the images captured on 35 mm format film could be developed and printed anywhere in the world. In other words, having standard physical properties and photofinishing processes for the film media were necessary to enable interoperability. However, consumer digital cameras have used many different memory card formats, as shown in Table 13.1. Over the years, these card formats evolved from credit-card sized PC Cards, to smaller CompactFlash and Smart Media cards, and eventually to Memory Stick, MMC (MultiMedia Card), SD (Secure Digital) and xD cards. Miniature versions of some of the card formats listed in the table, such as miniSD cards, are used in mobile phones that incorporate digital cameras. All these cards, except for Smart Media cards, are still used today, and it is likely that new card formats will be introduced in the coming years. Nevertheless, consumer digital photography became a mass-market success. This happened because consumer digital cameras adopted the same standard image file format. So the image file and directory formats, not the memory card format, are most analogous to the film format in conventional photography. Today’s consumer digital cameras store images using the industry standard Exifcompressed image format, which uses the JPEG image compression standard [4]. This enables the images from digital cameras to be used by many other devices, such as home computers, appliance printers, retail kiosks, and on-line printing services. The image files can be used by many different software applications, and posted on web pages or emailed so that they can be accessed anywhere in the world. Many different companies provide these digital photography products, software, and services. This provides benefits to both consumers and manufacturers. But the compatibility provided by the Exif-JPEG format has some drawbacks, too. The format uses baseline JPEG compression, which is limited to storing 24-bit color images, Digital Camera Image Storage Formats 353 Camera Model Image Size Date / Time Original Lens Focal Length Flash Exposure Time F-number Metering Mode Kodak DX4330 Camera 1726 × 1150 pixels July 10, 2004 10:59:43 am 63 mm (35mm equiv.) Flash did not fire, auto mode 1/700 second 7.0 Average FIGURE 13.1 Images and metadata. Kodak V705 Dual Lens Camera 3072 × 2304 pixels April 19, 2007 5:38:33 pm 94 mm (35mm equiv.) Flash fired, always flash mode 1/100 second 4.3 Pattern using 8 bits per component for the luminance (Y) component, and the two color components, red minus luminance (Cr), and blue minus luminance (Cb). Furthermore, baseline JPEG uses a lossy compression algorithm that can create image artifacts [5]. Professional photography applications often require higher quality. Therefore, most professional cameras, and many advanced amateur cameras, include a raw image format setting. The raw file stores digital data that directly relates to the single-sensor color information captured by the camera, normally using a single sample, with 12 bits or 16 bits, per pixel. The color demosaicking and other processing are performed after the image is transferred to a host computer. This allows more sophisticated image processing to be used, and enables the user to adjust various image-processing settings. Because the raw image file stores the data directly from the sensor, the characteristics of the raw image data, such as the color it encodes and the type of noise it includes, are specific to the type of digital camera that created the file. In order to produce the finished image, the host device must be able to perform the image processing required for the specific camera. In many cases, the companies that make digital cameras or raw image processing software keep the details of their image processing algorithms proprietary, for competitive reasons. Therefore, while the format of the raw data can be standardized, the images provided by standard raw files will vary because of differences in the image-processing algorithms used to transform the raw data into standard color image data. In addition to the digital values used to encode the image, image files store so-called metadata [6]. Metadata is any type of information that relates to the image. Figure 13.1 shows two different captured images and some of the metadata provided by the camera when the images were captured. The metadata specifies the model of the camera that captured the image and the size of the image. The metadata also includes the date and time the image was captured. This is very useful in retrieving specific images captured during a particular period of time, or on a specific date, such as a birthday or holiday. The 354 Single-Sensor Imaging: Methods and Applications for Digital Cameras metadata also includes camera settings, such as the lens focal length and whether or not the camera flash fired. This metadata can help to automatically locate specific types of images, such as close-up scenes or indoor images. The camera settings also normally include the exposure time and f-number of the lens. A long exposure time means that the image is more likely to include motion blur, and a low f-number means that the image may have a relatively narrow depth-of-field. Other metadata provide camera mode settings, such as the metering mode. All of this metadata is stored using standardized tags within the Exif-JPEG file. Each metadata tag includes a data field which identifies the particular type of metadata stored by the tag, along with data fields that provide the metadata values and define how these values are encoded. Some digital cameras provide other types of metadata, such as the ambient light level, lens focus distance, and the position of the main subject in the image. The metadata can also indicate the copyright owner, and other types of tags or labels selected by the photographer. In some cameras and camera phones, the metadata includes GPS (Global Positioning System) co-ordinates indicating the location of the camera when the picture was taken. In addition, proprietary metadata can be included in the image file. To remain useful, the metadata must be properly managed when an image is edited. 13.3 History of Image Formats for Digital Cameras The image formats used in today’s digital cameras are built on the TIFF image format and the JPEG compression standard. Figure 13.2 is a timeline showing when various image formats were developed and revised. The JPEG standard, ISO/IEC 10918-1, specifies a baseline compression algorithm using the discrete cosine transform (DCT). The JPEG standard was developed in the late 1980’s and published as an approved international standard in 1994. It defines a minimal image file format for the exchange of compressed data, but without color space specifications. Instead, it defines only the metadata needed to decompress the JPEG compressed image data, such as the quantization and Huffman tables. While the JPEG standard was being approved, the JPEG file interchange format (JFIF) [7] was published by Eric Hamilton of C-Cube Microsystems. JFIF specifies a standard image orientation and color encoding. It uses an APP0 application marker at the beginning of the JPEG file to store a thumbnail image and metadata indicating the pixel aspect ratio and density. Several years later, the JPEG group developed a complete Still Picture Interchange File Format (SPIFF), which was published in 1997 as part of ISO/IEC 10918-3, JPEG extensions [8]. But this was too late, and SPIFF never achieved significant adoption. TIFF was developed by Aldus and Microsoft in the 1980’s for storing digital images from scanners and computer graphics. It remains one of the most popular and flexible raster file formats. TIFF is now controlled by Adobe, which published the current version 6.0 in 1992 [9]. TIFF defines a very flexible, tag-based file structure for storing image data and metadata. TIFF files can store many different types of image data. Standard TIFF tags are used to indicate, for example, the number of samples per line and the number of bits per sample. TIFF defines some common metadata, such as the make and model of the device Digital Camera Image Storage Formats 355 V1 V2 DCF V1 V2.0 V2.1 exif TIFF TIFF version 6.0 V2.2 DNG TIFF/EP (ISO 12234-2) HD photo CIFF JPEG JPEG standard (ISO 15444) Fla s h P ix V1 exif JPEG V2.0 V2.1 V2.2 V2.2.1 SISRIF JFIF JPEG standard (ISO/IEC 10918) 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 year FIGURE 13.2 Image format timeline. that created the file, and the date and time the file was last modified. TIFF also allows users to register proprietary tags to store new types of metadata. However, baseline TIFF 6.0 implementations are only required to read 8 bits per channel uncompressed image data and can ignore most of the metadata in the file. In 1994, Adobe published a draft TIFF technical note describing how JPEG compressed data should be embedded in TIFF files, which was later supported by their software applications [10]. However, baseline TIFF readers are not required to support these embedded JPEG files, and so even today very few software applications provide this capability. The first digital cameras used proprietary image formats. In 1992, the ISO technical committee on photography standards, known as technical committee 42 (TC 42), created working group 18 (WG 18) in order to develop standards for electronic still picture imaging. In 1993, the Electronic Still Camera Working Group of the Japan Electronic Industry Development Association (JEIDA) adopted a standard image data format, which could store both uncompressed and baseline JPEG compressed image data on PC Cards. The development of this standard, later called SISRIF (Still Image, Sound and Related Information Format) [11], was led by Motokazu Ohkawa of Toshiba. SISRIF used tuples to store metadata, including the date and time, image size, and audio annotation. SISRIF was proposed to WG18 to become the ISO standard for storing digital camera images. In 1994, two alternative image formats were also proposed to ISO, TC42, and WG18. One alternative was the TIFF/EP format [12], led by George Lathrop of Eastman Kodak 356 Single-Sensor Imaging: Methods and Applications for Digital Cameras Company. Another alternative was the Exif format [13], led by Makio Watanabe of Fuji Photo Film. TIFF/EP supported three different types of image data, including uncompressed RGB (using baseline TIFF 6.0), JPEG compressed data, and raw data. The advantage of TIFF/EP was that all three types of camera image data could be stored using a common TIFF wrapper. The disadvantage was that few existing software applications could read the JPEG compressed data file that was embedded in the TIFF/EP file. The Exif image format solved this problem by using two different types of files for compressed and uncompressed image data. A TIFF file and TIFF tags were used for uncompressed data. Compressed data was stored in a JPEG file, with the metadata stored as TIFF tags within a TIFF file embedded within an APP1 application segment at the beginning of the JPEG file. This enabled existing JPEG software applications to read the Exif-JPEG image data while ignoring the newly defined Exif metadata in the APP1 application segment. It was this backward compatibility with existing JPEG software that made Exif successful. Version 1 of Exif was approved by JEIDA in 1996. ISO/TC42/WG18 worked to standardize, to the extent possible, the metadata tags used in TIFF/EP and Exif. This ISO group developed the ISO 12234-1 standard for storing digital camera images on memory cards [14], which was published in 2001. The standard allowed either Exif or TIFF/EP to be used as the image format for new digital cameras. Several other image formats were developed during the mid-1990s. The FlashPix format [15], developed by Kodak, Microsoft, Hewlett Packard, and Live Picture was made public in 1996. FlashPix used a hierarchical image representation, with 64 × 64 pixel image tiles, in a structured storage file. Each tile could be uncompressed or use baseline JPEG compression. FlashPix adopted most of the metadata definitions from TIFF/EP, but stored the metadata in property sets rather than TIFF tags. FlashPix files had a standard-sized thumbnail, and optionally included metadata extensions such as audio annotations and proprietary metadata extensions. FlashPix supported two color spaces: the wide-gamut KODAK PHOTOYCC Color Interchange Space [16], first used in Kodak’s Photo CD system, and the NIF RGB color space. NIF RGB matched the typical indoor viewing conditions of CRT monitors, and led to the sRGB color space later standardized by the IEC [17]. In 1997, Exif was updated to version 2.0, in order to enable easy conversion between Exif files and FlashPix files. A new TIFF tag was defined to indicate whether or not the file contained sRGB color data. Some additional camera metadata tags, originally developed for TIFF/EP, were also added. An extension mechanism was defined to allow Flashpix extension streams to be stored in APP2 application segments within Exif files. The FlashPix extensions can include a standard audio annotation extension as well as proprietary extensions. While FlashPix never achieved significant adoption, these new features remain part of the current Exif-JPEG format. The Camera Image File Format (CIFF) [18] was introduced by Canon in 1997 and was used by a number of camera makers for several years. CIFF stored images using JFIF files, but kept the metadata for all the images together in a so-called hierarchical heap. This approach enabled rapid searching of the metadata. CIFF provided specifications for naming image files and for arranging the image files in a standard directory structure. It also specified how audio annotations were stored in separate WAV (Waveform audio) format files associated with particular images. Digital Camera Image Storage Formats 357 In order to harmonize the competing Exif and CIFF formats, JEIDA approved version 1.0 of the DCF standard [19] in 1998. DCF uses modified versions of the file and directorynaming conventions from CIFF, while adopting version 2.1 of the Exif format. This revision defined an Interoperability tag, which indicates whether the Exif file meets the requirements for DCF basic files. These requirements include using sRGB color image data, as well as requiring that the Exif-JPEG file include a 160 × 120 pixel compressed thumbnail. The JEIDA DCF specification successfully unified how digital cameras stored JPEGcompressed images. Surveys conducted by the author’s company confirmed that for many years, all popular digital cameras sold worldwide have produced JPEG files that conform to the requirements for DCF basic files. In 2001, Epson introduced Print Image Matching (PIM) [20] which supported the sYCC color space [21] having a wider gamut than sRGB. PIM also included proprietary metadata intended to improve printing. In response, Exif was updated to version 2.2 in 2002 in order to support sYCC and to provide new, standardized metadata useful for printing. Examples of this new metadata include the camera’s white balance, contrast, saturation, and sharpness settings. In 2004, DCF was updated by the Japan Electronics and Information Technology Industries Association (JEITA) to version 2.0 [22], and Exif was updated to version 2.2.1, in order to support an optional extended-gamut color space known as Adobe RGB. 13.4 Exif-JPEG Image Format Structure Essentially all current consumer digital cameras store fully processed, JPEG compressed images that meet the requirements for DCF basic files, using the JPG extension. DCF defines how the files are named and organized in directories, as described later in this chapter. It also sets restrictions and metadata requirements for the Exif-JPEG files. Figure 13.3 shows the structure of an Exif-JPEG image file. Exif-JPEG files are JPEG files that contain metadata in one or more Application Marker Segments (APPn) at the beginning of the JPEG file, as shown in the left side of the figure. Following the APPn segments, the file holds the quantization table (DQT), Huffman table (DHT), start of frame (SOF), and the compressed main image data. The main image data is compressed using baseline JPEG image compression, which performs a DCT on 8 × 8 pixel blocks of Y (luminance) and color differential Cr and Cb signals. For DCF basic files, these color differential signals must be created from red, green, and blue signals provided in the sRGB color space. An optional color space, which corresponds to the Adobe RGB color specification, can be used in the DCF optional files, as defined in DCF version 2.0. Normally, so-called 4:2:0 color differential subsampling is used for the main image. This means that there are twice as many rows and columns of Y samples as Cb or Cr samples. However, DCF basic files can also use so-called 4:2:2 subsampling. This means that there are an equal number of rows of Y, Cb, and Cr samples, and twice as many columns of Y samples as Cb or Cr samples. The DCT coefficient values of the Y, Cb, and Cr signals are then quantized, entropy coded, and stored as the main image data in the Exif-JPEG file. SOI (0xFFD8)start of image APP1 segm ent (Exif m etad ata) im a ge file optional APP2 segm ents (FlashPix extensions) DQT (0xFFDB) quantization table DH T (0xFFC4) H u ffman table SOF start of fram e com pressed 4:2:0 or 4:2:2 m ain image data EOI (0xFFD9) end of image FIGURE 13.3 Exif/JPEG image file structure. APP1 m arker (0xFFE1) APP1 segment length APP1 id entifier cod e TIFF byte ord er TIFF id entifier (0x002A) offset to 0th IFD( exif Su bIFD tag cou nt Exif exif d efined m etad ata tags IFD GPS Su bIFD tag count GPSm etad ata tags op tion a l GPS IFD 0th IFD Tag Cou nt TIFF 6.0 d efined m etad ata tags Exif Su bIFD pointer GPS Su bIFD pointer offset to 1st IFD APP2 marker APP2 segment APP2 ID cod e content list or stream d ata (e.g., au d io stream , screen nail stream) op tion a l em bed d ed audio and scr een n a il data 1st IFD tag cou nt thumbnail metadata tags JPEGIntForm at (0x0201) last IFD id entifier (0000) SOI DQT DHT SOF com pressed 160 × 120 pixel 4:2:2 thu m bnail image data EOI em bed d ed JPEG th u m bn a il im a ge Single-Sensor Imaging: Methods and Applications for Digital Cameras 358 Digital Camera Image Storage Formats 359 TABLE 13.2 Structure of Exif APP1 segment header. Field Bytes Field Definition Hexadecimal value APP1 2 Length 2 APP1 Marker. Total APP1 field byte count, including the 2-byte count value, but excluding the 2-byte APP1 marker itself. FF, E1 Identifier 6 This zero terminated and padded string 45, 78, 69, 66, 00, 00 (Exif) uniquely identifies this APP1 segment. TIFF 2 Byte Order Byte Order value is either: II (0×49, 0×49) (little endian) or MM (0×4D, 0×4D) (big endian) 49, 49 TIFF ID 2 TIFF Identifier (decimal 42) 2A, 00 Offset 4 Offset to 0th Image File Directory (IFD) 08, 00, 00, 00 if the IFD immediately follows, with II byte order. The format of the APPn segments, which precede the main image data, is specified in Annex B of the JPEG standard, ISO/IEC 10918-1. Application segments can have different numbers (e.g., APP1 and APP2) that correspond to the different application marker values defined in the JPEG standard. The JPEG standard does not limit the number or sequence of application segments that can be included in a JPEG file. However, the Exif specification provides some restrictions on application segments. Exif requires that the JPEG file contain a specific APP1 segment immediately after the Start of Image (SOI) marker at the beginning of the file. This APP1 segment must have an identifier code, or label, with a value of Exif. The remainder of the APP1 segment is a TIFF file containing metadata and thumbnail image data. Following this APP1 segment, the file can optionally include a number of APP2 segments labelled FPXR (FlashPix-Ready). Storing metadata in these APP1 and APP2 segments yields image files with excellent backward compatibility because the JPEG standard requires that readers ignore unknown application segments. Exif-JPEG files contain a TIFF file stored in an APP1 application segment immediately following the two-byte JPEG SOI marker. Table 13.2 shows the structure of the beginning portion of this APP1 segment. The first two bytes (FFE1 in hexadecimal notation, written as 0×FFE1) define the segment as an APP1 segment. The next two bytes are the length of the segment. Each application segment must be less than 64 Kbytes in length. The next six bytes provide ASCII characters of the identifier code field, which must be Exif followed by two zeros for termination and padding. The TIFF Byte Order field follows the Exif Identifier Field. This Byte Order field is the first two bytes of the TIFF file that is embedded in the APP1 segment. It is followed by the decimal value 42, sometimes known as the TIFF magic number. Finally, the last entry in the table is the address offset to the Image File Directory (IFD0) structure that contains TIFF metadata tags. As a result, the last three entries in Table 13.2 contain the 8-byte header required for TIFF files. 360 Single-Sensor Imaging: Methods and Applications for Digital Cameras Each IFD contains a group of metadata tags. The first two bytes of an IFD is a count field that indicates the number of tags present in the IFD. Each tag has a tag number. Tag numbers greater than 0×8000 are called private tags. Private tags are issued by Adobe, the registration authority for TIFF tags, to a particular company or group. The TIFF specification requires that all of the tags in an IFD be arranged in ascending order. Each TIFF tag is a 12-byte record with the following format: 1. Bytes 0-1: The Tag ID Field, which is a number that identifies the particular tag. 2. Bytes 2-3: The Data Type Field, which is one of the following values for the tags used in Exif-JPEG files: • 1 = BYTE, an 8-bit unsigned integer. • 2 = ASCII, a string of 8-bit bytes containing 7-bit ASCII character codes ter- minated with a NULL code. • 3 = SHORT, a 16-bit (2-byte) unsigned integer. • 4 = LONG, a 32-bit (4-byte) unsigned integer. • 5 = RATIONAL, is two LONGs. The first LONG is the numerator and the second LONG is the denominator. • 7 = UNDEFINED, indicates an 8-bit byte that can take any value depending on the field definition. 3. Bytes 4-7: The Count Field, which indicates the number of Data Type values (not the number of bytes) stored by the tag. 4. Bytes 8-11: The Value or Offset Field, which indicates the tag value(s) if they can be stored in these 4 bytes, or the offset to the data if more room is required. The offset value is the relative offset from the Byte Order field in the TIFF header. The last 4 bytes of the IFD structure are either the offset to the next IFD, or are four zero bytes indicating that this is the last IFD. In order to meet the requirements for DCF basic files, the APP1 segment in an Exif file contains two required IFDs. IFD0 provides metadata relating to the main JPEG compressed image, but no image data. IFD1 contains JPEG compressed data for the 160 × 120 pixel thumbnail. IFD0 includes some TIFF metadata tags that are defined in the TIFF 6.0 specification, such as the Make and Model. These tags must be used as specified in the TIFF 6.0 document. IFD0 also includes the Exif IFD Pointer, 0×8769. This tag uses a LONG Data Type with a count of one, where the value is the offset to the Exif IFD. This private tag number was registered for use in the Exif specification. Therefore, all of the tags included in the Exif IFD are private tags, which must be used as specified in the Exif standard. The Exif IFD stores tags that describe the capture conditions used when capturing the digital image stored in the file. An Interoperability IFD Pointer tag is located within the Exif IFD, with a tag ID of 0×A005. As shown in Table 13.3, the Interoperability IFD contains tags indicating whether the Exif image has followed the DCF interoperability rules, and thus whether the Exif file is a DCF basic file or a DCF optional file. IFD0 optionally includes the GPS Info IFD Pointer, with a tag ID of 0×8825. The GPS IFD stores tags that provide Global Positioning Satellite information indicating the location of the camera when the image was captured. It is used by a number of digital cameras and camera phones that include GPS receivers, either integrated in the camera or attached to the camera. Digital Camera Image Storage Formats 361 TABLE 13.3 Interoperability IFD metadata tags. Tag Field Name InteroperabilityIndex Number 0×0001 1 InteroperabilityVersion 0×0002 2 T U Description M N A value of R98 indicates a DCF basic file that has used sRGB color encoding. A value of R03 indicates a DCF optional file that has used Adobe RGB color encoding. These values are NULL terminated. M N This tag records the version of the InteroperabilityIndex value. The value is the 4-byte ASCII 0100, meaning Version 1.00. This is not terminated by NULL, as the tag Type is UNDEFINED. The Exif specifications allow a thumbnail image in an Exif-JPEG file to be stored either as uncompressed TIFF image data or as a compressed JPEG image stream. However, for DCF basic files, compressed thumbnails are required. This has made the use of uncompressed thumbnails obsolete. The JPEG compressed thumbnail must be 160 × 120 pixels, and must use 4:2:2 color differential sampling. Because the size and orientation of the thumbnail is fixed, the left and right edges of the thumbnail image must be padded when a main image having a portrait aspect ratio is stored. Black pixels are recommended, but not required, for padding. The top and bottom of the thumbnail image are padded when a landscape image having an aspect ratio wider than 4:3 is used for the main image. The compressed thumbnail image data is stored as a JPEG stream within IFD1 of the TIFF file contained in APP1 of the Exif-JPEG file, as shown at the bottom right of Figure 13.3. The stored thumbnail data is itself a JPEG file, beginning with a Start of Image TABLE 13.4 Thumbnail IFD tags. Tag Field Name Compression Number 0×0103 259 Xresolution 0×011A 282 Yresolution 0×011B 283 ResolutionUnit 0×0128 296 JPEGInterchangeFormat 0×0201 513 JPEGInterchangeFormatLength 0×0202 514 Description A value of 6 must be used for DCF basic files, which the Exif specification defines as the value for JPEG compressed thumbnail data. The number of pixels in the image width direction per ResolutionUnit to use when reproducing the image. The number of pixels in the image height direction per ResolutionUnit to use when reproducing the image. Used to define the scale of the Xresolution and Yresolution values. Typically, inches are used. This tag provides the offset to the location of the thumbnail image data. This tag provides the size of the JPEG compressed bit stream. 362 Single-Sensor Imaging: Methods and Applications for Digital Cameras (SOI) marker and ending with an End of Image (EOI) marker, but is not allowed to contain Application Markers or Restart Markers. The tags listed in Table 13.4 are also required to define the thumbnail in IFD1. 13.5 Exif-JPEG Digital Camera Metadata The metadata used in Exif-JPEG files is defined in the Exif specifications. However, this information is not always well understood by English readers, for several reasons. For example, the Exif specifications assume that the reader is familiar with both the JPEG standard (ISO 10918) and the TIFF 6.0 image file format. Furthermore, because the English versions of Exif are translations from the original Japanese language documents, some of the descriptions are brief and ambiguous. Table 13.5 and Table 13.6 provide a more detailed explanation of some of the key metadata, provided by digital cameras, that is stored in ExifJPEG image files. Table 13.5 lists metadata found in IFD0, and Table 13.6 lists metadata found in the Exif IFD. The tags are listed in numerical order. The third column of these tables uses the following key to indicate the type (T) of metadata. M indicates a mandatory tag that must be recorded in DCF basic files, R indicates a Recommended tag that should be recorded, if possible, and O indicates an Optional tag that may be recorded. The Exif specifications provide guidelines describing what metadata tags need to be updated when an Exif file is edited by a software application. The fourth column of Table 13.5 and Table 13.6 indicates which metadata may need to be updated (U) when the file is edited. The Exif specification does not explicitly describe how to add new, vendor-defined metadata to Exif files. However, because it uses TIFF tags to store metadata, it is possible to register new TIFF tags and to include these tags within IFD0 of the TIFF file stored in the APP1 segment of the Exif image file. The Private Fields and Values section of the TIFF 6.0 specification describes the process for registering private TIFF tags. The remainder of the fields that comprise the 12-byte TIFF tag can be set as required by the registrant of the tag. The newly registered TIFF tag can then be placed in the 0th IFD of an Exif JPEG image. However, these tags should not conflict with, or duplicate, tags that are already defined in the Exif specifications. For example, the use of an ICC profile in an Exif file is not allowed, because the information in the ICC profile will either conflict with, or duplicate, the information contained in some of the color related tags defined in the Exif specification. Instead of using TIFF tags, metadata can be stored using other types of encodings, such as XML (extensible markup language). The DIG35 specification created XML metadata definitions for digital photography. These definitions were later included in the JPEG 2000 image format, which is discussed later in this chapter. In 2001, Adobe defined XMP (extensible metadata platform), an XML based metadata model that can be used with various defined sets of metadata items. XMP defines schemas for recording how an image has been captured, edited, and assembled into a final image. XMP metadata can be added to TIFF files images using the XMP tag (700), and to JPEG images using an APP1 segment with the identifier code http://ns.adobe.com/xap/1.0/. However, this XMP APP1 segment should not be used in an Exif-JPEG file, since the Exif specification allows only APP1 segments that use the Exif identifier shown in Table 13.2. Digital Camera Image Storage Formats 363 TABLE 13.5 TIFF metadata tags in Oth IFD. Tag Field Name ImageDescription Number 0×10E 270 Make Model Orientation 0×010F 271 0×0110 272 0×112 274 XResolution YResolution ResolutionUnit Software DateTime 0×011A 282 0×011B 283 0×0128 296 0×131 305 0×132 306 Artist YCbCrPositioning 0×13B 315 0×213 531 Copyright 0×8298 33432 ExifIFDPointer 0×8769 34665 GPSInfoIFDPointer 0×8825 34853 T U Description R N A character string with the title of the image, using a onebyte (ASCII) character code. If Unicode is required for the title, the UserComment tag in the Exif IFD is typically used. M N A character string with the name of the manufacturer of the digital camera. M N A character string with the model name / number of the digital camera. R Y The orientation of the stored main image data, indicating whether the reader should rotate the image before display: If the value is 1, the image is in the normal orientation and should not be rotated when displayed. If the value is 6, the image should be rotated 90 degrees clockwise; and if the value is 8, the image should be rotated 90 degrees counterclockwise. Note that if the rotated image is saved, the orientation tag value should be set equal to 1. M N The number of pixels in the image width direction per ResolutionUnit, to use when reproducing the image. M N The number of pixels in the image height (Y) direction per ResolutionUnit, to use when reproducing the image. M N The unit for measuring both XResolution and Yresolution, typically set to inches. O Y The name and version number of the firmware used by the digital camera, written as an ASCII string. R Y The date and time the file was last modified. The format is YYYY:MM:DD HH:MM:SS with time shown in 24-hour format and the date and time separated by one blank character. R N The name of the camera owner or the photographer, written as an ASCII string. M N The positioning of chrominance components in relation to the luminance component. A value of 2, co-sited, is normally used for Y:Cb:Cr = 4:2:2. A value of 1, centered, is normally used for Y:Cb:Cr = 4:2:0. O N A copyright notice for the photographer or editor, written as an ASCII string. M N The value of this tag is the offset from the start of the TIFF header to the position where the Exif IFD is stored. O N This tag is only recorded when global position information for the image is recorded. The value is the offset to the position where the GPSInfoIFD is stored. 364 Single-Sensor Imaging: Methods and Applications for Digital Cameras TABLE 13.6 Camera capture metadata tags in Exif IFD. Tag Field Name ExposureTime Fnumber ExposureProgram Number 0×829A 33434 0×829D 33437 0×8822 34850 SpectralSensitivity ISOSpeedRatings 0×8824 34852 0×8827 34855 OECF ExifVersion 0×8828 34856 0×9000 36864 DateTimeOriginal 0×9003 36867 DateTimeDigitized 0×9004 36868 Components-Configuration 0×9101 37121 CompressedBitsPerPixel 0×9102 37122 T U Description R N The time, in seconds, that the image was exposed on the image sensor. O N The lens f-number, which is the ratio of the lens aperture to its focal length, which was used when the image was captured. O N The mode used by the camera to set the exposure time and f-number. Allowed values include 1 for manual mode (user selects all settings), 2 for the normal automatic exposure mode, 3 for aperture priority mode (user sets the f-number), 4 for shutter priority (user sets the exposure time), 5 for creative mode (biased toward depth of field), 6 for action mode (biased toward fast shutter speed), 7 for portrait mode (background out of focus), and 8 for landscape mode (background in focus). O N An ASCII string that gives the spectral sensitivity of each color channel, using an ASTM defined format. O N Indicates the ISO Speed and ISO Latitude of the camera or input device as specified in ISO 12232. The first value is the ISO saturation speed rating and the last two optional values are the minimum and maximum ISO Speed Latitude value. O N A table of values that describe the Opto-Electronic Conversion Function of the camera, measured as specified in ISO 14524. M N The version of the Exif standard used for the image file. An Exif version 2.22 image file has a value of 0222 stored as 4-byte ASCII. Because the type is UNDEFINED, there is no NULL for termination. M N The date and time the picture was taken by the camera. The format is YYYY:MM:DD HH:MM:SS with time shown in 24-hour format. M N The date and time when the image was stored as digital data, which is normally identical to the DateTimeOriginal. M N The order of the compressed data, which has a value of 1230 for Y, Cb, Cr data. O Y The compression setting used when the image was compressed by the camera, using a bits per pixel value. Digital Camera Image Storage Formats 365 TABLE 13.6 Camera capture metadata tags in Exif IFD. (cont.) Tag Field Name Number ShutterSpeedValue 0×9201 37377 ApertureValue 0×9202 37378 BrightnessValue 0×9203 37379 ExposureBiasValue 0×9204 37380 MaxApertureValue SubjectDistance MeteringMode 0×9205 37381 0×9206 37382 0×9207 37383 LightSource 0×9208 37384 Flash 0×9209 37385 FocalLength 0×920A 37386 SubjectArea 0×9214 37396 T U Description O N The APEX (Additive Systems of Photographic Exposure) time value of the shutter speed. The Exposure Time = 1/(2ShutterSpeedValue). O N The APEX value of the lens aperture used when the image was captured. For example, an APEX value of 6.0 indicates f/8.0. O N The luminance of the scene, as measured by the camera. The value is expressed in APEX units, given by the equation Bv = log2 L/.3K, where L is the luminance (brightness) in candelas/m2 and K is the reflected light metering constant 11.4 candelas/m2. For example, a BV of 5 corresponds to 109 candelas/m2. O N Indicates the amount of intentional over or under exposure, using an APEX value. For example, 2.0 means 2 photographic stops overexposure that has brightened the image, and –.5 indicates 1 stop underexposure that has darkened the image. O N The smallest f-number that the camera lens be set to, expressed as an APEX value. O N The distance between the camera and the main subject of the image, given in meters. O N The type of exposure metering used when capturing the image. Some of the allowed values include 1 for average, 2 for center-weighted average, 3 for spot, 4 for multi-spot, and 5 for pattern. O N The type or color temperature of the light source that illuminated the scene. Some of the allowed values include 0 for unknown, 1 for daylight, 2 for fluorescent, 3 for incandescent, and 4 for flash. R N Provides information on whether camera flash was used when the image was captured. Bit 0 indicates if the flash fired. Bits 1 and 2 indicate if the flash was quenched. Bits 3 and 4 indicate the flash mode. Bit 5 indicates whether or not the camera included a flash, and bit 6 indicates if the red eye flash mode was used. O N The true focal length of the lens, in mm. As the size of the image sensor in a digital camera is normally much smaller than the 35 mm film frame, the value is normally much less than the 35 mm equivalent focal length. O Y The location and area of the main subject in the captured image. 366 Single-Sensor Imaging: Methods and Applications for Digital Cameras TABLE 13.6 Camera capture metadata tags in Exif IFD. (cont.) Tag Field Name MakerNote UserComment SubsecTime-Original FlashPixVersion ColorSpace Number 0×927C 37500 0×9286 37510 0×9291 37521 0×A000 40960 0×A001 40961 PixelXDimension PixelYDimension RelatedSoundFile 0×A002 40962 0×A003 40963 0×A004 40964 InterOperability-IFD Pointer 0×A005 40965 SpatialFrequence-Response 0×A20C 41484 ExposureIndex 0×A215 41493 T U Description O N Different camera makers store a variety of different information. O Y This tag is used to write keywords or comments concerning the image. It can use either ASCII onebyte character codes, or two byte JIS or Unicode character codes. O N This tag value is used with the DateTimeOriginal tag value to resolve the sub-second time the picture was captured, which is useful when a burst of images is captured. M N This mandatory tag has a value of 0100 recorded as 4-byte ASCII, to indicate FlashPix format Version 1.0. Because the type is UNDEFINED, there is no NULL for termination. M N For DCF basic files, a value of 1 is used to indicate sRGB. For DCF optional files, the uncalibrated value (0xFFFF) is used. The InteroperabilityIndex tag is then checked to see if it is set to R03 which means that the optional (e.g., Adobe RGB) color space is used. In this case, the file names also need to begin with the underscore character. M Y The tag value is the Image Width of the main image, without including the padding pixels that may be present in the JPEG image stream. M Y The tag value is the Image Height of the main image. As vertical data padding is unnecessary, this is the same value as that stored in the SOF (Start of Frame) JPEG marker. O N This tag is used to store the name of an audio file that is related to the image data. The audio file must have the same DCF name as the Exif file, and use the WAV extension. M N The value of this tag is the offset from the start of the TIFF header byte order field to the position where the Interoperability IFD is stored. O N A table that provides the spatial frequency response (SFR) of the digital camera in the horizontal, vertical, and diagonal directions, as measured using ISO 12233. O N The exposure index used by the digital camera when the particular image was captured, as defined in ISO 12232. Digital Camera Image Storage Formats 367 TABLE 13.6 Camera capture metadata tags in Exif IFD. (cont.) Tag Field Name SensingMethod Number 0×A217 41495 FileSource CFAPattern CustomRendered 0×A300 41728 0×A302 41730 0×A401 41985 ExposureMode 0×A402 41986 WhiteBalance DigitalZoomRatio 0×A403 41987 0×A404 41988 FocalLengthIn-35mmFilm 0×A405 41989 SceneCaptureType 0×A406 41990 GainControl 0×A407 41991 T U Description O N The method for sensing the color image. Allowed values include 2 for cameras that use one-chip color area image sensors, 4 for three-chip color cameras, 5 for color-sequential cameras, and 7 for tri-linear color sensors. O N The type of device that created the file. Allowed values include 1 for film scanners, 2 for print scanners, and 3 for digital cameras. O N This tag indicates the geometric pattern of the color filter array used by the digital camera. O Y This tag indicates if the camera or computer has processed the image to produce a custom-rendered image, such as by intentionally coloring or overexposing the image. The value is 0 if the normal processing should be used when printing or displaying the image. The value is 1 if further processing, such as auto-correction, should be disabled or minimized when printing or displaying the image. R N The camera’s exposure mode setting. Allowed values include 0 for automatic exposure modes, 1 for manual exposure modes, and 2 for automatic exposure bracketing modes, which capture an exposure series of the same scene. R N The camera’s white balance mode. Allowed values are 0 for automatic white balance, and 2 for manual white balance. O N The digital zoom ratio used to capture the image. For example, 2/1 indicates 2× digital zoom. If the numerator of the recorded value is 0, digital zoom was not used. O N The focal length of the lens, in mm, for an equivalent field of view from a 35 mm film camera. A value of 0 means the focal length is unknown. R N The type of scene that was shot, normally determined by a camera mode setting. Allowed settings include 0 for standard scenes, 1 for landscapes, 2 for portraits, and 3 for night scenes. O N The gain setting of the camera, which increases the camera’s exposure index. Allowed values are 0 for normal gain setting, 1 for low gain up settings, 2 for high gain up settings, 3 for low gain down settings, and 4 for high gain down settings. 368 Single-Sensor Imaging: Methods and Applications for Digital Cameras TABLE 13.6 Camera capture metadata tags in Exif IFD. (cont.) Tag Field Name Contrast Saturation Number 0×A408 41992 0×A409 41993 Sharpness 0×A40A 41994 DeviceSettings-Description SubjectDistanceRange 0×A40B 41995 0×A40C 41996 ImageUniqueID 0×A420 42016 T U Description O N The camera’s contrast setting. Allowed values are 0 for the normal setting, 1 for low contrast (e.g., soft) settings, and 2 for high contrast (e.g., hard) settings. O N The camera’s color saturation setting. Allowed values are 0 for the normal setting, 1 for low color saturation settings, and 2 for high color saturation settings. O N The camera’s sharpness setting. Allowed values are 0 for the normal setting, 1 for low sharpness (e.g., soft) settings, and 2 for higher sharpness (e.g., hard) settings. O N A table that can be used to store camera settings that are not included in other tags. O N The approximate distance from the camera to the subject. Allowed values are 0 for unknown, 1 for macro setting, 2 for a relatively close subject, and 3 for a distant subject. O N This tag value is a globally unique identifier (GUID) for the image. It is recorded as an ASCII string of 32 hexadecimal values that encode the 128-bit GUID. There are two options for storing audio annotations along with Exif-JPEG image files. The first option uses a separate WAV file, and the second option embeds the WAV file within APP2 segments in the Exif image file. The second approach ensures that the audio file gets transferred along with the image file, but it increases the size of the image file when a very long audio recording is made. The first option uses associated image and file names along with metadata, and the requirements for this option are described in detail in the Exif specification. Exif defines the RelatedSoundFile tag (A004.H), which stores the name of the associated WAV file. The Exif specification also defines the details associated with the WAV RIFF file format, which stores the associated image file name within the WAV file. The DCF specification provides naming conventions for associating these image and audio files. Both files have the same DCF file number, but different file extensions: JPG for image files and WAV for audio annotation files. This makes the two files part of the same image object, and the DCF specification requires that the two files be copied or deleted as a pair. The second option embeds the audio WAV file within APP2 segments, as a FlashPixReady data stream. Because there is no limit to the number of APP2 segments present in the Exif file, there are no limits to the length of the audio WAV file data. However, the APP2 segments must use the FPXR identifier code shown in Figure 13.3, since this is the only type of APP2 segment allowed in an Exif-JPEG file. The Exif specification defines the format of APP2 segments for storing FlashPix-Ready data streams in Interoperability Structure of APP2 in Compressed Data section. Audio is stored as defined in the FlashPix audio Digital Camera Image Storage Formats 369 extension specification [23]. There are two kinds of APP2 segments used for recording audio data as a FlashPix extension. The first is a Contents List Segment that records the list of data streams stored in subsequent APP2 segments. This contents list creates an important implied data element called the em Index to Contents List. The first stream in the contents list is assigned a zero value in the Index to Contents List and subsequent streams are assigned an ascending value. The APP2 segments that store the data streams, referred to as Stream Data Segments, use the Index to Contents List value to identify the data stream that is associated with the data stored in that APP2 segment. The APP2 segments can also be used to store privately defined image data. For example, the thumbnail image size required for DCF basic files, 160 × 120 pixels, is too small to provide a sharp image for digital cameras that have a relatively large LCD display. Storing a larger screen nail image in the Exif-JPEG file provides a higher-resolution review image on the camera. Kodak has used the FlashPix-Ready APP2 segments defined in Exif to store JPEG compressed screen nail image data in some digital cameras to provide a sharper review images. 13.6 Raw Image Formats A raw image file stores the data provided by the single-chip color sensor in a digital camera prior to most image processing. The raw image data is processed when the file is read, typically using a special software application on a host computer. This enables higher quality for several reasons. First, the image data is not degraded by the 8-bits per component, lossy baseline JPEG compression. Instead, the image data can be uncompressed data, typically using 12 or more bits per color channel. Since image processing is normally performed using a powerful processor on a host computer, the image processing algorithms can be more complex. Furthermore, the user can typically monitor the results of the processing and adjust the image processing parameters for specific images. Digital cameras that create raw files normally come with proprietary software that performs this image processing. For details on image processing in digital cameras refer to Chapters 1 and 3. The image data in the raw file can be stored either uncompressed or using lossless compression. However, in order to properly interpret the color data, the meaning of the digital color values stored in the digital image file must be identified in the file and understood by the software application that processes the image. At a minimum, this requires that the color filter pattern, and the color responses of the color channels, be specified. In many cases, other parameters are also needed to enable the software application to perform the appropriate processing. These parameters can include user-adjustable camera settings, such as white balance and sharpness settings. They can also include camera characterization information, such as sensor noise data and lens correction data. TIFF/EP, which is based on the TIFF 6.0 specification, was the first standard image format supporting raw image data. Almost all digital cameras that support raw files use TIFF version 6.0, and many use formats compatible with TIFF/EP. TIFF/EP uses the tags shown in Table 13.7 to define the uncompressed raw color filter array (CFA) image data. Chapters 1 and 5 discuss CFA patterns in detail. 370 Single-Sensor Imaging: Methods and Applications for Digital Cameras TABLE 13.7 TIFF/EP raw image format tags. Tag Field Name ImageWidth ImageLength BitsPerSample Compression Number 0×0100 256 0×0101 257 0×0102 258 0×0103 259 PhotometricInterpretation 0×0106 262 StripOffsets 0×0111 273 SamplesPerPixel 0×0115 277 ICC Color Profile [24] 0×8773 34675 CFARepeatPatternDim 0×828D 33421 CFAPattern 0×828E 33422 Description Stores the width of the CFA image data. Stores the length of the CFA image data. Stores the number of bits per CFA sample, which is often 12 bits. Has a value of 1 if the CFA data is uncompressed or 7 if lossless JPEG compression is used. Other values can be used to signify proprietary compression. A value of 32803 indicates that the color space is color filter array data. Stores the offset to the image data. Has a value of 1, as each pixel has only one sample. Used to define the RGB reference primaries, white point, and opto-electronic conversion function. Used to encode the repeating dimensions of the color filter array pattern. Used to encode the color filter array pattern. In order to improve some workflows, certain raw formats include a processed JPEG file inside the TIFF file. Figure 13.4 shows the structure of a raw file produced by the KODAK EASYSHARE P880 Zoom Digital Camera. This Kodak raw format uses the KDC file extension, and is based on TIFF-EP. The TIFF image file includes the JPEG Interchange Format tag, which is a pointer to a full resolution Exif-JPEG image that has been fully processed by the camera. This Exif-JPEG file, shown in the middle of Figure 13.4, can be extracted and used by applications and equipment that are not capable of performing raw image processing. The file also includes an Exif IFD Pointer to the camera settings tags normally stored in Exif-JPEG files. The largest portion of the file is the Raw CFA sensor image data, which stores 12 bits per sample data from the Bayer pattern color sensor used in the KODAK EASYSHARE P880 Zoom Digital Camera. The file also includes a number of image reconstruction parameters, such as the white balance, sharpening, and noise cleaning settings. When the image is processed on the host computer, the stored image reconstruction parameters are used as the default image processing settings. The user can adjust these settings, in order to modify the processing used for particular images. Digital Camera Image Storage Formats 371 Digital cameras from other companies create raw files having slightly different arrangements. For example, the Nikon D70 Digital Camera produces Nikon Electronic Format raw files, using the NEF extension that is also based on TIFF/EP. These files include an uncompressed 160 × 120 thumbnail image in IFD0. SubIFDs tags point to the raw data for the main image and to a compressed rendered main image. Canon digital cameras have used two different types of raw files. The Canon Rebel 10D camera produces raw files defined as part of CIFF, using the CRW extension. CIFF uses tags having a 10-byte structure, compared to the 12-byte tags used by TIFF. Another difference is that the offsets to the data in CIFF tags are relative to the start of the data block for each directory in a CIFF file. In a TIFF file, the offset to the data is relative to the first byte of the image file, which is the byte order field. Canon uses a raw format based on TIFF 6.0, with the CR2 extension, in their high-end digital SLR cameras such as the Canon EOS-1Ds Mark II camera. Various-sized JPEG compressed processed images are stored in the 0th IFD, 1st IFD, and 2nd IFD. The 3rd IFD contains the compressed raw image data. One problem with raw formats is that the processing used to convert the raw files to finished color images is often proprietary. Therefore, if the image processing software becomes unavailable in the future, the image file either may not be usable, or may be of unknown quality, if alternative image processing is used. While the TIFF/EP standard defines how raw image files can be stored, TIFF/EP support by compliant readers is optional. In 2004, Adobe defined a raw format called DNG [25]. DNG is compatible with TIFF/EP, but defines additional tags used by some raw format cameras, and sets restrictions on the values of some TIFF tags. The ISO/TC42/WG18 group responsible for TIFF/EP began work on a revision to the TIFF/EP standard in 2006. In 2007, Adobe submitted their DNG specification for consideration by WG18. Therefore, the revised version of TIFF/EP may include a raw profile that adopts tags and restrictions from the DNG specification. 13.7 Directory and Control Formats The DCF specification defines how image files are named and arranged into folders on a memory card. Figure 13.5 is an example of a DCF-compliant directory structure, which requires that the memory card be formatted using the DOS FAT file system. Images are stored in the DCIM (Digital Camera Images) directory, directly under the root directory of the removable media. The DCIM directory includes one or more DCF directories. The names of these DCF directories are eight characters in length. The first three characters provide the directory number, which must be between 100 and 999. The final five characters are free characters, chosen by the camera vendor. Only the capital letters A through Z, the numbers 0 through 9, and the underscore character may be used. The five free characters are often used to indicate the camera model number. In Figure 13.5, the 100KV705 and 102KV705 directories store images from a KODAK EASYSHARE V705 Dual Lens Digital Camera, and the 101KP880 directory stores images from a KODAK EASYSHARE P880 Camera. When a memory card is swapped between different model cameras, the camera normally creates a new DCF directory having the next directory number. TIFF byte ord er TIFF id entifier (0x002A) offset to 0th IFD 0th IFD tag cou nt field TIFF 6.0 and TIFF/EP d efined m etad ata tags JPEG interchange form at (0x0201) Su b IFD pointer (0x014A) Exif IFD pointer (0x8769) im age proc IFD pointer last IFD id entifier (0000) FIGURE 13.4 Example TIFF/EP raw image file structure. SOI APP1 segm ent w ith Exif m etad ata APP2 segment w ith scr een n a il DQT DHT SOF JPEG com pressed 4:2:0 main image data EOI fu lly p r ocessed em bed d ed JPEG im a ge child sub-IFD tag cou nt im age w id th (0x0100) im age length (0x0101) bits per sam ple (0x0102) com pression (0x0103) photometric interpretation strip offsets (0x0111) sam ples per pixel (0x0115) CFA repeat pattern d im (0x828D) CFA pattern (0x828E) raw CFA im age data Exif IFD tag cou nt Exif Exif d efined m etad ata tags m etad ata processing Su b IFD tag cou nt im age reconstruction metadata tags Single-Sensor Imaging: Methods and Applications for Digital Cameras 372 Digital Camera Image Storage Formats 373 root d irectory RCIM 100KV705 101KP880 100_0001.JPG 100_0002.JPG 100_0005.JPG 101_0001.JPG 101_0002.KDC FIGURE 13.5 DCF file naming example. MISC 102KV705 102_0021.JPG 102_0024.JPG 102_0026.JPG AUTPRIN T.MRK AUTXFER.MRK Within each DCF directory, DCF basic image files are stored using an eight-character name followed by the JPG file extension. The first four characters are free characters chosen by the camera vendor. The last four characters are the file number, which ranges from 0001 to 9999. It is common to use the DCF directory number as part of the free characters to eliminate file name conflicts if images from multiple DCF directories are later transferred to the same directory. But this is not required. It is possible to store raw image files within a DCF directory using a file extension other than JPG. For example, in Figure 13.5, the 101KP880 directory includes both a DCF basic Exif-JPEG file named 101 0001.JPG, and a raw file named 101 0002.KDC. In order to view the subject of the raw file on a camera that does not support raw image processing, some cameras also store a DCF thumbnail file. This thumbnail file has the same DCF file number, but uses the THM extension. Audio annotations can be stored in a separate file using the WAV extension. Files that have the same DCF file number, but a different extension, are considered part of the same DCF object, and need to be deleted, moved, or copied together. Many consumer digital cameras allow images to be selected for printing as they are reviewed on the camera. The image selections are recorded in a text file that conforms to the Digital Print Order Format (DPOF) specification [26]. The camera or memory card can then be directly connected to a printer, which reads the DPOF file and prints the selected images. The DPOF print order file must be named AUTPRINT.MRK and must be located in a folder named MISC under the media root directory, as shown in Figure 13.5. 374 Single-Sensor Imaging: Methods and Applications for Digital Cameras [HDR] GEN REV = 01.10 GEN CRT = “KODAK V705 DUAL LENS DIGITAL CAMERA” GEN DTM = 2007:04:19:19:47:24 [JOB] PRT PID = 001 PRT TYP = STD PRT QTY = 001 IMG FMT = EXIF2 -J IMG SRC = “../DCIM/100KV705/100 0002.JPG” [JOB] PRT PID = 002 PRT TYP = STD PRT QTY = 002 IMG FMT = EXIF2 -J IMG SRC = “../DCIM/101KP880/101 0001.JPG” [JOB] PRT PID = 003 PRT TYP = STD PRT QTY = 001 IMG FMT = EXIF2 -J IMG SRC = “../DCIM/102KV705/102 0026.JPG” FIGURE 13.6 DPOF Auto Print file example. The DPOF version 1.0 specification was developed by Eastman Kodak Company, Canon Inc., Fuji Photo Film Co., Ltd., and Matsushita Electric Industrial Co., Ltd. in 1998. DPOF was updated to version 1.10 in 2000. Version 1.10 includes optional parameters to enable several images to be printed on one page and to specify the size of the print. Part 2 of DPOF version 1.10 defines an Auto Transfer (AUTXFER.MRK) file, which specifies images to be emailed, as well as Auto Play (AUTPLAYn.MRK) files, which define the order of still images, video images, and audio to be used in slide shows. Figure 13.6 is an example of a DPOF print order file. The HDR header section indicates that the file conforms to DPOF version 1.10, and was created by the KODAK EASYSHARE V705 Dual Lens Digital Camera on April 19, 2007 at 7:47 PM. Optional parameters can be used to provide the name, address, and phone number of the camera owner. Vendor-unique parameters can also be included in the file to provide other types of information. The DPOF print order file shown in Figure 13.6 contains three print jobs, identified by sequential ID numbers. The first print job is for a quantity of one standard sized print of the Exif-JPEG version 2 image file having the pathname ../DCIM/100KV705/100 0002.JPG. In other words, one print should be made using the file named 100 0002.JPG in the DCF directory named 100KV705 shown in Figure 13.5. The second print job is for two copies of a standard sized print of the file named 101 0001.JPG in the DCF directory named 101KP880, and the third print job is for one copy of the file named 102 0026.JPG. DPOF print files can also contain optional parameters that specify other print sizes and index prints, or indicate how the image is to be cropped or overlaid with text. But because DPOF-compliant printers are not required to support these optional parameters, few consumer digital cameras currently support them. Digital Camera Image Storage Formats 375 [HDR] GEN REV = 01.10 GEN CRT = “KODAK V705 DUAL LENS DIGITAL CAMERA” GEN DTM = 2007:04:19:19:48:34 [JOB] PMT PID = 001 DST EML = “ken@kodak.com” IMG FMT = EXIF2 -J IMG SRC = “../DCIM/100KV705/100 0005.JPG” [JOB] PMT PID = 002 DST EML = “ken@kodak.com” IMG FMT = EXIF2 -J IMG SRC = “../DCIM/100KV705/102 0021.JPG” [JOB] PMT PID = 003 DST EML = “rob@kodak.com” IMG FMT = UNDEF IMG SRC = “../DCIM/101KP880/101 0002.KDC” FIGURE 13.7 DPOF Auto Transfer file example. Figure 13.7 is an example of a DPOF auto transfer file. The HDR header section indicates that the file conforms to DPOF version 1.10, and was created by the KODAK EASYSHARE V705 Dual Lens Digital Camera on April 19, 2007 at 7:48 PM. The file contains three transfer jobs, identified by sequential ID numbers. The first job is to transfer the Exif-JPEG version 2 image file having the pathname ../DCIM/100KV705/100 0005.JPG to the email address ken@kodak.com. In other words, the image file named 100 0005.JPG in the DCF directory named 100KV705 shown in Figure 13.5 should be used. The second job requests that the file named 102 0021.JPG be sent to the same email address. The third job requests that the file named 101 0002.KDC be sent to the email address rob@kodak.com. 13.8 Advanced Image Formats The image file formats used by future digital cameras will continue to advance. This may happen by continuing to make small, compatible enhancements to the current Exif and TIFF formats, or by replacing these formats with new, incompatible formats. One example of a possible replacement is the JPEG 2000 image format [27] that was standardized in the late 1990’s. JPEG 2000 uses the discrete wavelet transform, rather than the discrete cosine transform. A JPEG 2000 file provides a multi-resolution representation of an image, using bit-plane encoding [28]. This allows progressive decoding that supports both lossy and lossless compression. Metadata is stored using an XML schema, rather than using TIFF tags. Refer to Chapters 14 and 15 for detailed discussions on image compression. 376 Single-Sensor Imaging: Methods and Applications for Digital Cameras The JPEG 2000 compressed bit stream is organized in a flexible binary container using a rich syntax. The JPEG 2000 code stream can be fragmented and ordered into different boxes, and each box can be edited separately. The syntax provides improved metadata support, compared with the application segment approach used by Exif-JPEG. It also supports many color-encoding standards, including both sRGB and extended gamut color RGB encodings, such as ROMM RGB [29], RIMM RGB [30], and scRGB [31]. These color spaces are supported by storing a restricted ICC profile in the file, which essentially provides instructions to allow the color image data to be easily converted into sRGB if needed. In 2004, a profile for using JPEG 2000 in digital still cameras was approved as an American national standard [32]. This standard provides a way for cameras to write JPEG 2000 files with a full set of digital camera metadata in a way that can be correctly read and interpreted by other devices. It also defines file naming and directory rules that are consistent with DCF. While JPEG 2000 has been used in various government and medical applications, it is not yet supported by consumer digital cameras. In 2006, Microsoft introduced a new image file format called HD Photo [33], also known as Windows Photo Media. HD Photo uses the same TIFF header described earlier, except that little-endian byte order is required and a different magic number is used. HD Photo stores images and metadata using TIFF-like tags and IFDs. The file can include Exif metadata using the standard Exif IFD, as well as XMP metadata, which uses XML. The HD Photo file extension can be either HDP or WDP. Images are compressed using a proprietary, reversible, lapped bi-orthogonal transform. HD Photo uses a PixelFormat tag with a Globally Unique Identifier (GUID) to indicate which of over 50 different pixel representations is used for the image data. These range from monochrome, up to 8-channel color data, using either 8, 16, or 32 bits per channel. However, baseline decoders are required to support only a few of these options, providing RGB or monochrome data at 8 or 16 bits per channel. The RGB color encoding is defined either implicitly as sRGB or scRGB, depending on the GUID, or by using an embedded ICC profile. In 2007, HD Photo was submitted to the JPEG group (formally known as ISO/IEC JTC 1/SC 29/WG 1) for standardization. It is likely that this format, to be known as JPEG XR, will be approved and published as ISO 29199-2 in 2009. 13.9 Conclusion Digital cameras store images using standard image file formats. This enables readers, including computers and photo printers, to make use of the image data. Most consumer cameras store fully-processed images using the Exif compressed image format. These ExifJPEG files are named and organized into folders according to the DCF requirements, and can be read by baseline JPEG readers. Metadata created by the camera is stored using TIFF tags in an application segment near the beginning of the file. These tags store required metadata, such as the make and model of the camera, the capture date and time, and a 160× 120 pixel thumbnail image. The tags can also store recommended and optional metadata, such as lens settings, camera parameters, and GPS coordinates. DPOF control files can be used to indicate which images will be automatically printed or emailed. Digital Camera Image Storage Formats 377 Most SLR and prosumer digital cameras feature a raw mode that stores the sensor image data prior to demosaicking. This provides higher image quality, but requires that the reader performs demosaicking and other camera image processing. The TIFF/EP format is used to store raw image data in some cameras. TIFF/EP defines metadata, such as the color filter pattern and an ICC profile, which describe the raw data values. Some digital cameras also use the Exif-uncompressed or TIFF/EP formats to store processed, uncompressed, images that can be read by baseline TIFF readers. The use of metadata in image files is playing an increasingly important role in the management and organization of consumer image collections. The size of these collections grows every year. A typical family now has many thousands of images stored on the hard drive of their home computer. While almost all current digital cameras create files that fully comply with the Exif and DCF specifications, some popular imaging applications produce edited files that do not. Compliance problems often arise when metadata is added or updated. To create compliant software, the developers must fully understand the requirements contained in several different documents, some of which were translated from the Japanese language. Because certain prohibitions are not fully described, some developers have made design choices that cause interoperability problems. This chapter has attempted to clarify some of the key requirements, in order to improve compliance. The Exif-JPEG image format has been widely used in digital cameras for more than a decade. Because it was designed to be flexible and extensible, new capabilities have been added, while maintaining backward compatibility. New image formats, including JPEG2000 and HD Photo, provide higher image quality and more capabilities. But they are not backward compatible with the existing Exif-JPEG format. This makes it difficult for any new format to gain acceptance. It is possible that one of these new formats will replace Exif-JPEG as the primary image format for consumer digital cameras. But it is more likely that Exif-JPEG will continue to be enhanced and used to store images in digital cameras for many years to come. References [1] Technical Standardization Committee on AV & IT Storage Systems and Equipment, Exchangeable image file format for digital still cameras: Exif Version 2.2, Japan Electronics and Information Technology Industries Association, JEITA CP-3451-1, April 2002. Available online: http://www.jeita.or.jp. [2] Amendment 1 exchangeable image file format for digital image still cameras; Exif Version 2.21 (Amendment to Version 2.2), Japan Electronics and Information Technology Industries Association, JEITA CP-3451, 2002. Available online: http://www.jeita.or.jp. [3] 2001 Electronic still picture imaging removable memory Part 2: TIFF/EP image data format, ISO 12234-2. Available online: http://www.iso.ch. [4] Information technology - Digital compression and coding of continuous-tone still images: Requirements and guidelines, ISO/IEC 10918-1, 1994. Available online: http://www.iso.ch. [5] W.B. Pennebaker and J.L. Mitchell, JPEG Still image data compression standard. New York: Van Nostrand Reinhold, 1993. 378 Single-Sensor Imaging: Methods and Applications for Digital Cameras [6] J. Milch and K. Parulski, “Using metadata to simplify digital photography,” in Proceedings of the IS&T Conference on PIC, Savannah, GA, USA, April 1999, pp. 26–30. [7] C-Cube microsystems, JPEG file interchange format, Version 1.02, September 1992. Available online: http://www.jpeg.org/jpeg/index.html. [8] Information technology - digital compression and coding of continuous-tone still images: Extensions, ISO/IEC 10918-3, 1997. Available online: http://www.iso.ch. [9] Adobe Systems Incorporated, TIFF (Tag Image File Format), Revision 6.0, June 1992. Available online: http://partners.adobe.com/public/developer/en/tiff/TIFF6.pdf. [10] Adobe Photoshop, TIFF technical notes JPEG compression, March 2002. Available online: http://partners.adobe.com/public/developer/en/tiff/TIFFphotoshop.pdf. [11] PCMCIA, JEIDA specific extensions,Section 3, SISRIF (Still Image, Sound and Related Information Format), PC Card Standard, no. 12, February 1995. [12] K. Parulski and G. Lathrop, “TIFF/EP, A flexible image format for electronic still cameras,” in Proceedings of the IS&T 48th Annual Conference, Washington, D.C., USA, May 1995, pp. 425–428. [13] M. Watanabe, F. Funazaki, Y. Shigyo, and S. Nishi, “An image data file format for digital still camera,” in Proceedings of the International Symposium on Electronic Photography, Washington, D.C., USA, May 1995, pp. 421–424. [14] Electronic still picture imaging Removable memory Part 1: Basic removable-memory module, ISO 12234-1, 2001. [15] Flashpix format specification, Version 1.0, Eastman Kodak Company, September 1996. Available online: http://www.i3a.org. [16] PhotoYCC color encoding and compression schemes, Eastman Kodak Company, KODAK Publication PCD-045, April 1994. [17] Multimedia systems and equipment Colour measurement and management Part 2-1: Colour management default RGB colour space sRGB, IEC 61966-2-1, 1999. Available online: http://www.iec.ch. [18] CIFF specification on image data file, Version 1.0, Revision 4, Canon Inc., December 1997. [19] Design rule for camera file system, Version 1.0, Japan Electronic Industry Development Association (JEIDA), December 1998. [20] Print image matching white paper, Epson America, Inc., 2001. Available online: http://www.printimagematching.com/what is pim.php. [21] IEC 61966-2-1-amendment 1, January 2003. Available online: http://www.iec.ch. [22] Design rule for camera file system: DCF Version 2.0, JEITA CP-3461, September 2003. Available online: http://www.jeita.or.jp. [23] Flashpix audio extension specification, International Imaging Industry Association (I3A). Available online: http://www.i3a.org/i Flashpix.html. [24] Image technology colour management Architecture, profile format, and data structure, ICC.1:2004-10 (Profile version 4.2.0.0). Available online: http://www.color.org. [25] Adobe digital negative (DNG) specification, Version 1.1.0.0, Adobe Systems Inc., February 2005. Available online: http://www.adobe.com/products/dng/. [26] Summary of DPOF. Available online: http://panasonic.jp/dc/dpof 110/white e.htm. [27] Information technology JPEG 2000 image coding system; Information technology JPEG 2000 image coding system extensions; Information technology JPEG 2000 image coding system conformance, ITUT Rec. T.800 — IS0 154441:2002; ITUT Rec. T.801 — IS0 154442:2002; ITUT Rec. T.803 — IS0 154444:2002. Available online: http://www.iso.ch. Digital Camera Image Storage Formats 379 [28] P.N. Topiwala, Wavelet image and video compression. Dordrecht, the Netherlands: Kluwer Academic Publishers, 1998. [29] Photography and graphic technology Extended colour encodings for digital image storage, manipulation and interchange Part 2: Reference output medium metric RGB colour image encoding (ROMM RGB), ISO/TS 22028-2, 2006. Available online: http://www.iso.ch. [30] Photography and graphic technology Extended colour encodings for digital image storage, manipulation and interchange Part 3: Reference input medium metric RGB colour image encoding (RIMM RGB), ISO/TS 22028-3, 2006. Available online: http://www.iso.ch. [31] Multimedia systems and equipment - Colour measurement and management - Part 2-2: Colour management - extended RGB colour space scRGB, IEC 61966-2-2, 2003. Available online: http://www.iec.ch. [32] Photography digital still cameras JPEG 2000 DSC profile, ANSI/I3A IT10.2000-2004, 20002004. Available online: http://www.ansi.org. [33] HD photo, photographic still image file format, Version 1.0, Microsoft Corporation, November 2006. Available online: http://www.microsoft.com/whdc/xps/wmphoto.mspx. 14 Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras Nai-Xiang Lian, Vitali Zagorodnov, and Yap-Peng Tan 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 14.2 Elements of Digital Still Camera Image Processing Pipeline . . . . . . . . . . . . . . . . . 382 14.2.1 Demosaicking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 14.2.2 Color Adjustments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 14.2.3 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 14.3 Alternative Image Processing Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 14.4 Performance Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 14.4.1 Modelling Pipeline Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 14.4.2 Modelling Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 14.4.3 Modelling via Taylor Series Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 14.5 Modelling of Individual Digital Still Camera Processing Elements . . . . . . . . . . . 390 14.5.1 Modelling of Demosaicking Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 14.5.2 Modelling of Compression Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 14.5.2.1 Full-Color Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 14.5.2.2 CFA Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 14.5.2.3 Evaluation of the Compression Error Models . . . . . . . . . . . . . . 393 14.6 Modelling of Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 14.6.1 Color Adjustments and Demosaicking / Compression . . . . . . . . . . . . . . . . 394 14.6.2 Demosaicking and Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 14.7 Performance Evaluation of Digital Still Camera Processing Pipelines . . . . . . . . . 397 14.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 14.1 Introduction The main concerns in the design of digital still cameras (DSCs) are cost, size, image quality, and operational/power efficiency [1]. To reduce cost and size, most DSCs capture images using an electronic sensor overlaid with a color filter array (CFA), sampling one of three color primaries at each pixel location. The two missing color components are then estimated to restore full-color information [2], [3], [4], [5], [6]. The process is referred to as demosaicking [7], [8], [9], [10], [11]. For additional information on the fundamentals on single-sensor color imaging refer to Chapter 1. 381 382 Single-Sensor Imaging: Methods and Applications for Digital Cameras color ad justm ent CFA data d em osaicking (w hite balance, color and gam m a co r r ect io n s ) DSC FIGURE 14.1 Block diagram of conventional processing chain. com pression store / tr a n sm it d ecom pression output im a ge end d evice Figure 14.1 shows the conventional DSC image processing pipeline. The CFA data from the sensor is first demosaicked to estimate full-color image. Then, the color values of the resultant image are modified by adjusting white balance, and performing color and gamma correction [12], [13], to match the colors of the original scene when displayed on a computer monitor. White balancing removes the color tint of an image to make white objects appear white. Color correction transforms the CFA sensor color space to a suitable display color space (e.g., sRGB [14]). Gamma correction adjusts the image intensity to compensate the non-linearity of cathode-ray-tube (CRT) displays. Some DSC models may also include noise reduction or image sharpening, but these processes are optional and can be deferred to a later stage if necessary. Finally, the image is compressed for storage or transmission [12], [13]. Detailed discussion on typical camera image processing pipelines can be found in Chapters 1 and 3. The current development of DSC processing is typically guided by empirical performance evaluations [15], [16], [17], [18], [19], [20]. However, results provided by empirical evaluations are usually limited to a training set and are often inconclusive. For example, experiments may show one method to perform better on some types of images but worse on others, without explaining why this happens. The main goal of this chapter is to show the advantages of mathematical models that link the performance of the DSC processing to image content and algorithm settings. We also show how these models can be developed in practice through the examples of demosaicking, compression, color and gamma correction steps, as well as their interactions. Namely, we review the elements of DSC image processing pipelines in Section 14.2, and the alternative DSC processing chains in Section 14.3. Motivation for performance modelling is introduced in Section 14.4. Then, we show how to model the individual DSC processing elements and their interactions in Sections 14.5 and 14.6, respectively. Finally, an application example is presented in Section 14.7 and a summary of the chapter is given in Section 14.8. 14.2 Elements of Digital Still Camera Image Processing Pipeline 14.2.1 Demosaicking Bilinear interpolation (or bilinear demosaicking) uses a linear function of two or four samples of the same color type located in the spatial proximity of the missing color Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras 383 value [21]. Bicubic interpolation can improve the performance moderately at a substantial computational cost. However, the performance of these interpolation techniques can be rather poor, especially in regions rich in image details and edges. Using anisotropic interpolation [5], [11], [22], [23] and exploiting spectral correlations [5], [7], [9], [10], [11] can yield additional performance improvements. Various ideas have been proposed here, such as interpolation along the edge direction (Hamilton method) [5] and substituting detail wavelet coefficients of green values into missing red/blue values by using an alternating projection (AP) method [9]. A more commonly used technique to incorporate spectral correlations is to take color ratios [7] or transform the image into a suitable color-difference space, as in the so-called effective color interpolation (ECI) [10] and improved effective color interpolation (IECI) [11] methods. Exploiting spectral correlation leads to incorporating samples from different color channels into an estimation of a missing color value. This requires higher computational complexity but produces better performance when compared to bilinear or bicubic methods that use the same color samples for prediction. Chapters 6 to 9 discuss demosaicking issues in more detail. 14.2.2 Color Adjustments The appearance of scenes captured by DSCs depends on the color temperature of the light source. The goal of the white-balancing step is to adjust the colors of captured images to have a more natural appearance. In general, white balance can be achieved by multiplying the color channels with appropriate gain factors that depend on the color temperature of the light source. Color correction aims to relate the sensor output values to colorimetry of the original scene [12]. Generally, the transformation depends on the digital camera, reflecting the differences in the sensor’s spectral sensitivities as well as any nonlinear function used to encode the sensor output [12]. A typical example of color correction matrix for Canon EOS 300D is:   1.5915 −0.6456 0.0541 T =  −0.0838 1.4794 −0.3956  (14.1) 0.0697 −0.4739 1.4042 Gamma correction is basically a power function of the form axb + c, where b < 1 and x is a color value in the range [0, 1]. However, such a function has an infinite slope at zero, which may increase noise in dark regions. Hence, the function is usually modified with a linear portion near zero. A typical expression of gamma correction is [24]: γ(x) = 4.5x for x ≤ 0.018 1.099x0.45 − 0.099 for x > 0.018 (14.2) 14.2.3 Compression Some cameras store the captured images in Tagged Image File Format for Electronic Photography (TIFF/EP) [25] format, which uses no or lossless compression. We can compress the image using standard compression techniques for DSC image storage since the 384 Single-Sensor Imaging: Methods and Applications for Digital Cameras CFA data com pression store / tr a n sm it d ecom pression d em osaicking DSC FIGURE 14.2 Block diagram of alternative processing chain. end d evice color ad justm ent (w hite balance, color and gam m a co r r ect io n s ) output im a ge demosaicking produces full color images. JPEG compression, which is based on discrete cosine transform [26], is used in Exchangeable Image Format (EXIF) storage format [27], a standard for most current DSCs [12]. Recently, wavelet-based compression (now formalized in the JPEG2000 compression standard [28]) has been shown to yield better performance than JPEG [29], by requiring fewer coefficients to represent image discontinuities. Because of its superiority over JPEG, JPEG2000 compression is likely to be incorporated into DSCs in the near future [12]. Detailed discussions on camera image storage formats can be found in Chapter 13. 14.3 Alternative Image Processing Pipelines Demosaicking does not increase the information content (entropy) of the original image, so the additional pixels produced are mainly redundant [16], consuming substantial resources of the camera. To avoid this drawback, alternative processing pipelines often reverse the demosaicking and compression processes [15], [16], [17], [18], [19], [20]. Refer to Chapters 1 and 3 for detailed discussions on various processing configurations. Figure 14.2 depicts the case where the output of the CFA sensor is compressed directly before converting it to a full-color image. This removes the demosaicking and color adjustments (white balance, color correction, and gamma correction) from the camera to the end device. Because the number of available color components in the CFA image is only 1/3 of that in the full-color image, this alternative processing chain requires less computational resources, storage capacity, and/or transmission bandwidth. The processing time (and hence the power consumption) is reduced by off-loading the demosaicking process from the camera to the end device, such as a personal computer (PC), personal digital assistant (PDA) or printer. This simplifies the hardware architecture and can reduce the cost and power consumption of DSCs, which is especially beneficial to cameras situated on mobile phones and PDAs [30]. Furthermore, recent work [15], [16], [17], [18], [19], [20] has indicated that the alternative processing chain can achieve better image quality than the conventional one under low compression ratios. It also allows different demosaicking methods to be applied to suit the needs of applications, for example, simpler demosaicking methods for a PDA image viewer, more advanced and complex demosaicking methods for high resolution PC displays [31]. Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras 385 All elements of the conventional processing chain can be reused in the alternative chain, except for compression, which has to be revised to work on CFA samples directly rather than on a full-color image. Several techniques have been proposed in the literature, such as applying standard grayscale compression techniques by discarding the pixel color labels [16], using Mallat packet wavelet transform and JPEG2000 compression to achieve better performance in the presence of artificial discontinuities [17], [32]. It has also been suggested to compress CFA color components separately [15], [16], [33], [34], which effectively discards the spectral correlations. Color transformation is used in existing image compression standards to decorrelate the color components [26], [28]. A similar transformation can be used to remove artificial discontinuities in CFA images. For example, the popular Bayer CFA [35] consists of two green, one blue and one red sample arranged in a 2 × 2 square block. Therefore, Xie et al. [36] downsampled the green plane to the same size as the red and blue planes to simplify the use of color transformation for CFA mosaic images. Lee and Ortega [18], Parrein et al. [19], and Koh et al. [20] converted the four color values in each 2 × 2 Bayer unit to two luminance values (Y1 and Y2) and two chrominance values (U and V ). The resultant chrominance values reside in a rectangular lattice with size four times smaller than the size of CFA, while the luminance values populate a quincunx lattice with half the size of CFA. The chrominance components (U or V ) reside on a standard rectangular grid and hence can be compressed using any standard grayscale approach. For the quincunx lattice of the luminance plane, it has been suggested to either split it into two rectangular lattices Y1 and Y2, convolve it with a low-pass filter followed by columnwise downsampling [20], or rotate it by 45◦, converting it into a diamond shaped rectangular lattice [18], [19]. For detailed information on CFA structure conversions and lossless compression of single-sensor mosaic images refer to Chapter 15. 14.4 Performance Modelling 14.4.1 Modelling Pipeline Elements First, we must define what we mean by performance modelling. A performance model is a mathematical expression that links an algorithm’s performance with a parametric description of the data content and/or algorithm’s settings. For example, the data content parameters may include spatial (i.e., interpixel) or spectral (i.e., interchannel) correlations, noise variance; the possible algorithm’s settings are bitrate, block size (for compression), and interpolation window size (for demosaicking). Availability of a mathematical model provides a better understanding of the algorithm as well as important cues to its application and further development. The current research efforts in analysis of the DSC elements tend to use empirical performance evaluation. This, however, often produces inconclusive results where some algorithms perform better on some images and worse on others. For example, Table 14.1 shows the CPSNR (color peak signal-to-noise ratio) performance of three demosaicking methods for 24 test images shown in Figure 14.3. Namely, demosaicking methods under 386 Single-Sensor Imaging: Methods and Applications for Digital Cameras FIGURE 14.3 Test images referred to as Image 1 to Image 24, enumerated from left-to-right and top-to-bottom. TABLE 14.1 CPSNR performance (in dB) of three demosaicking methods. The best CPSNR of each row is shown in bold. image FDM VOCD AFD 1 38.08 38.53 37.60 2 38.81 40.00 40.30 3 41.68 42.54 42.77 4 40.23 40.31 40.40 5 37.91 38.02 38.12 6 39.97 40.01 38.13 7 41.89 41.74 42.74 8 35.32 36.43 35.38 9 42.26 43.18 42.99 10 42.06 42.31 42.65 11 39.85 39.95 39.49 12 42.92 43.03 42.83 image FDM VOCD AFD 13 35.18 34.97 33.90 14 36.11 37.26 37.39 15 39.19 39.40 39.54 16 43.75 43.73 41.33 17 41.61 41.61 41.60 18 36.74 36.65 36.68 19 40.10 40.83 40.14 20 39.82 40.38 40.29 21 38.81 39.26 38.71 22 37.99 38.01 38.55 23 40.30 40.41 41.09 24 35.13 34.89 34.66 consideration are the frequency-domain method (FDM) [37], the variance of color difference (VOCD) method [38], and the adaptive filtering demosaicking (AFD) method [39]. Here CPSNR is defined as the average mean square error (MSE) of the reconstructed color components for RGB values normalized to [0,1]: ∑ CPSNR = −10 log( 1 3 c∈{R,G,B} MSEc) (14.3) Note that FDM performs the best for images 13, 16, 18, and 24; VOCD excels on images 1, 6, 8, 9, 11, 12, 17, and 19 to 21; and AFD is best for the remaining ten images. Without a mathematical performance model it is difficult to predict beforehand which method will achieve better performance on a given image. Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras 387 TABLE 14.2 Performance comparison (SNR, in dB) of demosaicking on image data after and before gamma and color corrections (on average of test images). method bilinear Hamilton [5] AP [9] IECI [11] after 13.39 before 12.46 21.18 18.22 23.34 19.41 23.91 21.25 TABLE 14.3 Performance comparison (SNR, in dB) of compression on image data after and before gamma and color corrections (on average of test images). compression ratio 5:1 10:1 15:1 30:1 JPEG after 27.13 22.58 19.86 15.78 before 21.30 18.41 16.65 13.75 JPEG2000 after 31.47 26.23 23.22 18.79 before 24.57 21.04 18.99 15.82 Another example of the use of performance models comes from our recent work on error inhomogeneity in wavelet-based compression [40]. In this work we extend the wellknown Kato-Yatsuda compression error model [41] to design a balanced wavelet, which sacrifices some compression performance for reduction in error inhomogeneity. The same compression model has also been used in many recent publications [42], [43], [44], [45], [46], [47] on wavelet evaluation and design for compression. 14.4.2 Modelling Interactions Interactions between various elements of the DSC pipeline are rarely mentioned in the literature. Their existence and importance can be demonstrated using the following simple experiments. Table 14.2 and Table 14.3 show a surprising result that applying demosaicking and compression after gamma and color corrections1 can improve SNR performance by one to four dB for demosaicking and up to seven dB for compression at high bitrates, compared with the opposite order. Here SNR is defined as the ratio between the signal variance σI2 and average MSE of the reconstructed image: SNR = 10 log 1 3 σI2 ∑c∈{R,G,B} MSEc (14.4) If there were no interaction between these elements, we would expect to see exactly the same performance regardless of the order. 1Here we adopt the color correction matrix from Equation 14.1 used in the Canon EOS300D camera and gamma correction defined as in Equation 14.2 according to Rec.709 [24]. Signal-to-noise ratios (SNRs) are computed on the gamma and color-corrected data. 388 Single-Sensor Imaging: Methods and Applications for Digital Cameras (a) (b) FIGURE 14.4 (See color insert.) Cropped region of Window image: (a) after bilinear demosaicking; (b) after bilinear demosaicking and JPEG (10:1) compression. TABLE 14.4 CPSNR performance (in dB) of the conventional processing chain with bilinear demosaicking and JPEG/JPEG2000 compression. after after demosaicking and compression image demosaicking JPEG(10:1) JPEG(30:1) JPEG2000(10:1) JPEG2000(30:1) window lighthouse statue sails 32.37 28.01 31.91 31.69 33.86 30.04 33.66 33.59 32.40 29.17 32.41 32.97 32.08 27.87 31.67 31.58 31.71 27.65 31.27 31.38 That compression and demosaicking also interact with each other follows from the fact that applying JPEG compression after bilinear demosaicking leads to overall error reduction, as shown in Table 14.4 and illustrated in Figure 14.4. This would not happen if the compression and demosaicking errors were independent, since lossy compression always leads to decrease in image quality. Moreover, this interaction may depend on the demosaicking and compression methods used. For example, the CPSNR improvement after compression does not happen if we use JPEG2000 compression; see Table 14.4. Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras 389 The importance of modelling interactions follows from a recent survey of publications on the comparison between conventional and alternative chains [18], [20], [19]. One of the goals of these studies was to provide support for viability of alternative processing chains by discovering conditions under which alternative processing chains have their performance comparable to or better than that of the conventional processing chain. It appears that the main factors affecting the relative performance of the two chains are the compression ratio and the choice of compression algorithm. For example, Koh et al. [20] used JPEG compression in their alternative processing chain and found that it can outperform the conventional one at low (10:1) and high (80:1) compression ratios. Lee and Ortega [18] observed superiority of alternative processing with JPEG compression at low compression ratios. Parrein et al. [19] used JPEG2000 and reached the same conclusion for compression ratios 15:1. All of these conclusions are based on empirical studies, making their results relevant to only existing compression and demosaicking algorithms. Without a mathematical performance model it is difficult to explain why the alternative processing chain is better than conventional ones at low compression ratios, or predict the break-point compression ratio (where the relative performance changes sign) for a given pair of compression and demosaicking algorithms. Furthermore, it is also unclear which chain will be superior when more advanced demosaicking and compression algorithms become available in the future. 14.4.3 Modelling via Taylor Series Expansion Demosaicking and compression methods usually exploit spatial and spectral image correlations. In general, lower correlations yield larger reconstruction and compression errors, while close to 1 correlations yield smaller errors. For perfect spatial and spectral correlations ρ = 1, the error should be zero. This suggests the following performance model, where the error variance e(ρ) is treated as a function of correlation and is expanded in terms of Taylor series near ρ = 1: ∑ e(ρ ) = ∞ k=1 e(k)(1) (ρ − 1)k k! σI2 (14.5) where σI2 is the signal variance for the color values x. In reality, an image exhibits different types of correlations (e.g., spatial correlation in horizontal and diagonal direction, and spectral correlation) and hence the error variance should be treated as a function of several variables. The following expression shows a model example where we preserved only linear terms and included spatial correlations in various directions ρi, j and spectral correlations νSk,Sl : ∑ ∑ e(ρ,ν) = ei, j(1)(ρi, j − 1) + eSk,Sl (1)(νSk,Sl − 1) σI2 i, j Sk =Sl (14.6) Here Sl and Sk, both from {R, G, B}, designate the color components. To complete the modelling, the unknown coefficients are estimated empirically. Taylor series expansion can also be used to model interactions. Let D and F be two consecutive DSC processing elements. Let ∆x be the error of processing element D, D(x) = x + ∆x, and ζ 2 E(∆2x) be its variance. If the the processing element F does not produce 390 Single-Sensor Imaging: Methods and Applications for Digital Cameras errors, the total error is F[D(x)] − F(x) = F(xi + ∆x) − F(x). The variance of the total error is a monotonic function of ζ 2, f (ζ 2) = E{[F(x + ∆x) − F(x)]2}, such that f (0) = 0. Assuming that ζ 2 is small, we can expand f (ζ 2) using a Taylor series, preserving only the linear term: f (ζ 2) = E{[F(x + ∆x) − F(x)]2} ≈ c2ζ 2 (14.7) where (in general) parameter c2 = E{[ ∂ If the processing element F produces F∂a(xnx)e]r2r}orcawnitdhepvaernidanocne Fξ.2 , the total error becomes F[D(x)] − x = F[D(x)] − F(x) + F(x) − x (14.8) The total error variance can be shown as [48], E {F[D(x)] − x}2 = c2ζ 2 + ξ 2 + 2cρcζ ξ (14.9) where ρc is the correlation between two processing errors. In the following sections we provide several practical examples of modelling the DSC processing elements (Section 14.5) and their interactions (Section 14.6). 14.5 Modelling of Individual Digital Still Camera Processing Elements In DSC processing, color adjustments (white balance, color and gamma corrections) transfer images from one space to another, which does not produce their own errors. Hence, in this section we concentrate on modelling the demosaicking and compression errors. 14.5.1 Modelling of Demosaicking Errors Demosaicking methods use pixels in a local neighborhood for interpolation. Hence we restrict Equation 14.6 to correlations corresponding to |i| + | j| ≤ 1. Furthermore, to re- duce the number of parameters, we assume eSk,Sl to be the same for any Sk = Sl. As noncoinciding pixels in different color components are used for demosaicking, we add a new term corresponding to the spectral correlations of non-coinciding pixels νSk,Sl ρ0. The resultant model then has the form of ∑ ∑ ζ 2 = {χ0 + χ1ρ0 + χ2 νSk,Sl + χ3 νSk,Sl ρ0}σI2 = χ PσI2 Sk =Sl Sk =Sl (14.10) where the corresponding sums ei, j and eSk,Sl are now abbreviated as coefficients χi, and P is the vector of image correlations. To estimate the unknown parameters for each demosaicking method we fit the actual demosaicking errors from 15 test images shown in Figure 14.3 to the model as shown in Equation 14.10. Table 14.5 shows the modelling misfit, defined via the ratio between the predicted model we and the actual also applied it demosaicking errors |1 to four test images that −weζrζp2era2ecdteiucaxtledc|l.udTeod check the validity of from the training set; our see the bottom four rows of Table 14.5. As we can see, the misfit is less than 10% in most cases, which implies the sufficient accuracy of the linear model. Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras 391 TABLE 14.5 Modelling misfit of demosaicking error variance. method prediction misfit for training images test image 16 17 18 19 bilinear Freeman Hamilton ECI AP ECI3 0.5% 4.5% 2.4% 0.4% 0.0% 2.7% 9.6% 2.1% 9.4% 13.0% modelling misfit 14.1% 0.1% 0.2% 1.2% 2.6% 17.5% 12.8% 19.9% 14.3% 0.0% 0.0% 0.7% 0.2% 1.2% 8.6% 5.1% 1.2% 1.9% 15.3% 9.5% 14.5.2 Modelling of Compression Errors Unlike demosaicking, the model for compression error needs to link the performance with both the image content (correlations) and compression bitrate. The previously discussed method based on Taylor series expansion (Equation 14.6) is effective in capturing the former but not the latter. Most existing compression models are based on the work of Katto and Yasuda [41] on modelling the compression error variance ξI2 using K-subbands decomposition2 and optimal bitrate allocation, given as K−1 ∏ ξI2 = ε2σI22−2RI · (AkBk)αk k=0 (14.11) where RI is the total compression bitrate, αk is the relative proportion of the number of kth subband wavelet coefficients, and ε is a constant depending on the signal. The term Ak represents the relative proportion of signal energy in the kth subband, that is, σk2 = AkσI2. The term Bk governs the contribution of the kth subband quantization error variance σq2k to the total quantization error, that is, ξI2 = ∑kK=−01 Bkσq2k. For one-dimensional (1D) signals modelled using Markov chains, Ak and Bk are related to the analysis hk and synthesis gk filters and spatial correlation [42], [43] as follows: ∑∑ Ak = hk(i)hk( j)ρ|i− j| ij Bk = ∑ g2k(i) i (14.12) (14.13) Since the filters are known by construction, the resultant model depends only on bitrate RI, spatial correlations ρi, and signal variance σI2. The model presented in Equation 14.11 can be extended to images modelled by two-dimensional (2D) Markov random processes [42], [43], [44], [45], [46], [47]. 2The theory can be straightforwardly extended to JPEG compression by grouping DCT coefficients and treating them as subbands. For example, in JPEG compression all zero frequency coefficients are assigned the same bitrate, and can be considered as one subband. 392 Single-Sensor Imaging: Methods and Applications for Digital Cameras It is important to note that the product AkBk in Equation 14.11 is closely related to the coding gain [42], [43] defined as GI = 1 ∏Kk=−01 (Ak Bk )αk (14.14) The derivation of Equation 14.11 is based on modelling the kth subband quantization error as σq2k = ε2σk22−2Rk [49], where Rk is the subband bitrate. This quantization model is valid only at high bitrates Rk, [50]. We can generalize the model in Equation 14.11 for lower bitrates by allowing β to be a function of bitrate [48]: ξI2 = ε 2σI22−2β (RI)RI GI−1 (14.15) Based on the generalized error model from Equation 14.15 it is possible to derive the compression error models for full-color and CFA images. 14.5.2.1 Full-Color Image Compression Components of color images (R, G, and B) can be compressed independently by treat- ing each component as a grayscale image. Let C ∈ {R, G, B} be an arbitrary color component. Using the error model (14.15), the compression error of C is given by ξc2 = ε2σc22−2β(Rc)RcGI−1. Here σc2 and Rc are the pixel variance and the bitrate assigned to component C , respectively, and ∑C Rc = R. Hence, the average compression error of all color components is ∑ ξs2 = 1 3 ε 2GI−1 C σc22−2β (Rc)Rc (14.16) A color image is typically compressed in an appropriate color space, such as YUV or YCbCr space [26], [28]. We use the error model from Equation 14.15 by treating the color transformation Ts as a 3-subband decomposition [48]. Then the analysis and synthesis filters hk and gk are rows of Ts and Ts−1, respectively, and the spatial correlations in Equation 14.15 are replaced by spectral correlations. Furthermore, parameters Ac and Bc for C ∈ {Y,U,V } can be derived from Equations 14.12 and 14.13. As detailed in Refer- ence [48], the overall compression error becomes ξs2 = ε 2 σI2 2−2β ( R 3 ) R 3 ∏ ∏ (AcBc ) 1 3 K−1 (AkBk)αk C ∈{Y,U,V } k=0 = ε 2 σI2 2−2β ( R 3 ) R 3 Gs−1 (14.17) where Gs is the new coding gain. 14.5.2.2 CFA Image Compression Without color transformation, the model for CFA compression is similar to that for the grayscale image compression as shown in (14.15). We obtain [48]: K−1 ∏ ξo2 = ε2σI22−2β (R)R (AokBk)αk k=0 = ε 2σI22−2β (R)RGo−1 (14.18) Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras 393 The only difference is the new expression for the coding gain Go, which is computed according to the new spatial correlations. Performance can be improved by using a luminance-color difference transformation. The red (R), blue (B), and two green samples (G1 and G2) in a 2 × 2 Bayer unit are converted into Y1, Y2, U, and V samples using a linear transformation defined by a 4 × 4 matrix. With this transformation, the compression error variance becomes [48]: ξt2 = ε 2σI22−2β (R)R ∏ 1 K−1 4 ∏ AcBc (AckBk)αk C ∈{Y1,Y2,U,V } k=0 = ε 2σI22−2β (R)RGt−1 (14.19) where Gt is the corresponding coding gain. The correlated components Y1 and Y2 can be more efficiently compressed as one image Y , [18], [19], [20]. It has also been suggested to rotate Y by 45◦, converting a quincunx lattice to a rectangular lattice with diamond-shaped support [18], [19]. Equation 14.19 is still applicable if we recalculate parameter Ack based on the spatial correlations in the horizontal direction in a quincunx lattice. 14.5.2.3 Evaluation of the Compression Error Models It should be noted that compression error models can accurately predict the relative performance of two compression algorithms on the same data but not the absolute value of the error [48]. Here we demonstrate this on an example comparison between full-color and CFA image compression performances. From Equations 14.17 and 14.18, we obtain ξs2 ξo2 = 22βr (R)R Go Gs (14.20) where βr(R) = β (R) − β ( R 3 3 ) , while Gs and Go are the respective coding gains. The ratio ξs2 ξo2 depends on have higher sthpeatriaatliaonodfstpheectcroadl icnogrrgealaintisonGGsos and the bitrate compared with via 22βr(R)R. Full-color images the corresponding CFA images, leading to larger coding gains Gs > Go. However, full-color images also have three times as many pixel values as the CFA images. This is captured by the term 22βr(R)R. At high bitrate, β (R) = 1 and then βr(R) = 1− 1 3 = 2 3 . Hence this term 22βr(R)R 1 dominates at high bitrates leading to processing chain at ξξtso22he=lo2w2βcr(oRm)RpGGroses>sio1narnadtiob.rings about the superiority of the alternative Experimental results are generally consistent with our analysis. In particular, the results show that the log-ratio is approximately linear for bitrate R > 1.5 bpp (bits per pixel), and the break-point bitrate, corresponding to ξs2 ξo2 = 1, generally occurs in the linear portion of the curves. Hence, we can predict the break-point bitrate by treating βr(R) in Equation 14.20 as a constant, which can be estimated by fitting Equation 14.20 to actual data at high bitrates. Our predicted break-point for JPEG2000 is Rb = 2.2 bpp, which practically coincides with the actual value. For JPEG, the predicted value is Rb = 3.6 bpp, which is slightly smaller than the actual Rb = 3.8 bpp. 394 Single-Sensor Imaging: Methods and Applications for Digital Cameras 14.6 Modelling of Interactions 14.6.1 Color Adjustments and Demosaicking / Compression Since color and gamma corrections do not produce their own errors, we model their interaction with demosaicking/compression based on the result in Equation 14.7. If demosaicking precedes color and gamma correction, the total error variance is ϒ2 = c2f ζ02 (14.21) where c2f depends on the derivative of the color adjustment function. When color and gamma corrections precede demosaicking, ϒ2 = ζ12, where ζ02 ≈ ζ12 since the total strength of spatial and spectral correlations is not substantially affected by color and gamma corrections. Since color correction is a linear operation, its derivative does not depend on image content. Assuming that the errors of the preceding operation have equal variance and are independent in the three color channels, we have, c2f = 1 3 T r(TT ), (14.22) where Tr is the trace of the matrix. For the Canon EOS300D camera, c2f = 2.5. Other cameras have c2f around 2. This means that color correction could enlarge the errors of the preceding operations, such as demosaicking and compression, conforming to our earlier observation in Table 14.2 and Table 14.3. The derivative of gamma correction, Equation 14.2, can be calculated as dγ dx (x) = 4.5 0.4945x−0.55 x ≤ 0.018 x > 0.018 (14.23) Then we can obtain the value of c2f using the distribution of image color values p(x), c2f = 1 0 dγ dx (x)2 p(x)dx (14.24) We plot the derivative dγ dx (x) as a function of image intensity in Figure 14.5. The curve inter- sects the line dγ dx = 1 at point x = 0.278, indicating that gamma correction enlarges the error for x < 0.278 (darker regions) and reduces the error for x > 0.278 (brighter regions). Hence, darker images will have larger c2f and brighter images will have smaller c2f . For example, c2f = 1.32 and c2f = 11.13 for test images 12 and 17, whose distribution of color values are shown in Figure 14.6a and Figure 14.6b, respectively. Note that c2f > 1 even in the brighter image, which is due to much larger multiplication factors in darker regions. While it is theoretically possible to have c2f < 1 for very bright images, for the overwhelming majority of cases c2f > 1 and hence demosaicking and compression can achieve better performance on gamma-corrected images. This agrees with our earlier observations in Table 14.2 and Table 14.3. Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras 395 4 3 dy/dx 2 1 d arker brighter 0 0 0.2 0.4 0.6 0.8 1 x FIGURE 14.5 Relationship between intensity x and derivative value ( dγ dx ) for gamma correction. 0.07 0.20 0.06 0.05 0.15 0.04 p(x) p(x) 0.10 0.03 0.02 0.05 0.01 d arker brighter d arker brighter 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x (a) (b) FIGURE 14.6 The distribution of color values for (a) test image 12 and (b) test image 17. 14.6.2 Demosaicking and Compression The influence of interaction between demosaicking and compression on the overall performance can be modelled using a Taylor series expansion approach as shown in (14.9). If demosaicking precedes compression, the error is [48]: Ω2 = ξs2 + ac2ζ 2 + 2ρaacξsζ ≈ ξs2 + a2cζ 2 (14.25) The cross term can be discarded due to small correlation ρa between demosaicking and compression errors. Empirical tests show that a2c ≈ 0.5 for JPEG compression and a2c = 1 for JPEG2000. Figure 14.7 shows that the model in Equation 14.25 provides an excellent fit to the experimental results. The values of coefficient a2c can be justified as follows. Demosaicking errors mostly occur near edges and hence contain a lot of high frequency information. JPEG compression is poor at preserving these high frequencies and hence makes the error variance of the preceding demosaicking step smaller (a2c < 1). On the other hand, JPEG2000 can better 2 W 396 0.08 0.06 0.04 0.02 0 0 0.03 Single-Sensor Imaging: Methods and Applications for Digital Cameras 0.08 0.06 2 W 0.04 0.02 p red icted actu al 0 0.02 0.04 0.06 0.08 0 x 2 s (a) 0.03 p red icted actu al 0.02 0.04 0.06 0.08 xs2 0.02 0.02 2 2 W W 0.01 0 0 0.01 p red icted actu al 0.01 0.02 xs2 0 0.03 0 (b) p red icted actu al 0.01 0.02 0.03 xs2 FIGURE 14.7 The error model from Equation 14.25 for demosaicking followed by (a) JPEG and (b) JPEG2000 compression. Used demosaicking methods: (left) Hamilton, and (right) AP. preserve high-frequency content, leading to a2c ≈ 1. However, when the compression ratio is very large, both techniques will suppress high frequencies and hence a2c → 0. The model in Equation 14.25 can also explain the phenomenon in Table 14.4, that the JPEG compression following bilinear demosaicking reduced the total error. For such a combination, a2c ≈ 0.5 and ζ 2 is large (due to poor performance of bilinear demosaicking). Hence it is possible that Ω2 ≈ ξs2 + 0.5ζ 2 < ζ 2 (14.26) at least when the compression error is small (low compression ratios). The same phe- nomenon does not appear in JPEG2000 even at low compression ratios (Table 14.4), since a2c ≈ 1, [48]. If compression precedes demosaicking, then from Equation 14.9 we have [48]: Ψ2 = ζ 2 + b2dξo2 + 2ρbbdζ ξo ≈ ζ 2 + ξo2 (14.27) The cross term can be discarded due to small correlation ρb. It also follows that parameter b2d ≈ 1 for all tested combinations of demosaicking and compression. Figure 14.8 shows the excellent fit of the model in Equation 14.27 to the experimental results. Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras 397 2 Y 0.08 0.06 0.04 0.02 0 0 0.08 0.06 2 Y 0.04 p red icted 0.02 actu al 0 0.02 0.04 0.06 0.08 0 x 2 o (a) p red icted actu al 0.02 0.04 0.06 0.08 x 2 o 0.03 0.03 0.02 0.02 2 2 Y Y 0.01 0 0 0.01 p red icted actu al 0.01 0.02 x 2 o 0 0.03 0 (b) p red icted actu al 0.01 0.02 0.03 x 2 o FIGURE 14.8 The error model from Equation 14.27 for (a) JPEG and (b) JPEG2000 compression followed by demosaicking. Used demosaicking methods: (left) Hamilton, and (right) AP. 14.7 Performance Evaluation of Digital Still Camera Processing Pipelines In this section we show how the earlier results on modelling of DSC processing elements (Section 14.5) and interactions (Section 14.6) can be applied to a practical problem of comparison between conventional and alternative processing pipelines. The need for such a comparison arises because the alternative chain promises to offload some of the processing operations, such as demosaicking and color adjustments, from the camera to the end device, thus simplifying the DSC and making it more cost and power efficient. However, doing so will be justifiable only if alternative processing does not degrade image quality. For the sake of simplicity we assume that gamma correction is applied directly to CFA data prior to demosaicking and compression, and that color correction can be neglected for both the conventional and alternative chains. Hence, for comparison we can consider the simplified processing pipelines shown in Figure 14.9. 398 Single-Sensor Imaging: Methods and Applications for Digital Cameras color ad ju sted CFA d ata d em osaicking store / transm it com p ression d ecom p ression output im age (a) color ad ju sted CFA d ata store / transm it com p ression d ecom p ression d em osaicking output im age (b) FIGURE 14.9 Simplified block diagrams of DSC image processing chains: (a) conventional, and (b) alternative. Based on the models in Equations 14.25 and 14.27, we obtain the following expression for the difference of total error variances between the two chains: Ω2 − Ψ2 = (ξs2 − ξo2) + (a2c − 1)ζ 2 (14.28) At high compression ratios, the compression error is larger than the demosaicking error and the relative performance depends mainly on ξs2 − ξo2. At low compression ratios, even though (ξs2 − ξo2) < ζ 2, both compression and demosaicking error terms should be preserved since a2c can be close to one [48], leading to cancellation of the demosaicking error term. The results of qualitative analysis of Equation 14.28 are summarized in Table 14.6. Here we split the demosaicking approaches into simple (e.g., bilinear - large error) and advanced (majority of recently proposed methods - small error). As follows from the table, the alternative chain can outperform the conventional one at low compression ratios, except when simple demosaicking is combined with JPEG compression. Qualitative predictions in Table 14.6 agree very well with our experimental results shown in Figure 14.10 and Figure 14.11. In these figures we compare the signal-to-noise ratio (SNR) performance of the conventional and alternative processing chains, where the latter applied a simple grayscale compression for CFA images. As expected, the alternative processing chain outperforms the conventional one at low compression ratios (high bitrates) for JPEG2000 compression. The same is true for JPEG compression combined with the AP demosaicking method. However, in the combination of JPEG compression and bilinear demosaicking, the conventional chain performs better for all tested compression ratios, as shown in Figure 14.10 and Figure 14.11. TABLE 14.6 Superiority of processing chains (C - conventional, A - alternative). compression bitrates high low JPEG JPEG2000 simple demosaicking advanced demosaicking C A A C C C Modelling of Image Processing Pipelines in Single-Sensor Digital Cameras 399 18 16 20 SN R SN R 14 15 12 exact W 2 exact W 2 10 exact Y 2 o 10 exact Y 2 o 8 8 0 1 2 3 4 5 0 1 2 3 4 5 bitrate bitrate (a) 18 25 16 20 SN R SN R 14 W2 15 W2 12 Y 2 o Y 2 o Y 2 t Y 2 t 10 10 0 1 2 3 4 5 0 1 2 3 4 5 bitrate bitrate (b) FIGURE 14.10 Performance comparison between conventional and alternative processing chains using JPEG compression: (a) actual and (b) predicted performance; (left) bilinear and (right) AP demosaicking. The relative performance model in Equation 14.28 can also provide precise prediction of the break-point compression ratio tb (or a processing chains have equal performance [48]. break-point As shown in bitrate Figure R14b.1=0,2ttb4h)e at which the actual break- point bitrate for JPEG-AP combination is Rb ≈ 4.5 bpp, while predicted Rb ≈ 4.0 bpp. For JPEG2000, both experimental and predicted results yield Rb ≈ 2.5 bpp (Figure 14.11). Prediction results can be extended to more advanced (color transformation based) CFA image compression methods by using Equation 14.19. The corresponding curves are shown in Figure 14.10 and Figure 14.11 (designated as Ψt2) and indicate that the break-points occur at lower bitrates. This happens because color transformation removes discontinuities in CFA images caused by interlaced color components, leading to a larger coding gain Gt > Go. As the image size remains unchanged, the larger coding gain leads to smaller compression error ξt2 < ξo2, moving the break-points to lower bitrates [48]. Figure 14.12 shows an example obtained by using a CFA image from an actual DSC Canon EOS300D. Besides some blurring effect due to the use of a simple (bilinear) de- mosaicking method, the reconstructed images obtained by the two processing chains have similar quality under a typical DSC compression ratio 10:1. 400 Single-Sensor Imaging: Methods and Applications for Digital Cameras SN R 15 14 20 13 SN R 12 exact W 2 15 exact W 2 11 exact Y 2 o exact Y 2 o 10 10 0 1 2 3 4 5 0 1 2 3 4 5 bitrate bitrate (a) 25 15 14 20 13 SN R SN R 12 W2 15 W2 11 Y 2 o Y 2 t