首页资源分类DSP > Video Codec Design: Developing Image and Video Compression Systems

Video Codec Design: Developing Image and Video Compression Systems

已有 445125个资源

下载专区

文档信息举报收藏

标    签:Videocodec

分    享:

文档简介

Video compression coding is the enabling technology behind a new wave of communication applications. From streaming internet video to broadcast digital television and digital cinema, the video codec is a key building block for a host of new multimedia applications and services. Video Codec Design sets out to de-mystify the subject of video coding and present a practical, design-based approach to this emerging field.

文档预览

Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Video Codec Design To Freya and Hugh Video Codec Design Developing Image and Video Compression Systems lain E. G. Richardson The Robert Gordon University, Aberdeen, UK JOHNWILEY & SONS, LTD Copyright 02002 by John Wiley & Sons Ltd, Baffins Lane, Chichester, West Sussex P019 IUD, England National 07112941311 International ( +44) 1243 119111 e-mail (for orders and customer service enquiries): cs-books@wiley.co.uk Visit our Home Page on http://www.wileyeurope.com All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London, UK W I P OLP, without the permission in writing of the publisher. Neither the authors nor John Wiley & Sons Ltd accept any responsibility or liability for loss or damage occasioned to any person or property through using the material, instructions, methods or ideas contained herein, or acting or refraining from acting as a result of such use. The authors and Publisher expressly disclaim all implied warranties, including merchantability of fitness for any particular purpose. Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons is aware of a claim, the product names appear in initial capital or capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Other Wiley Editorial Offices John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 10158-0012, USA WILEY-VCH Verlag GmbH, Pappelallee 3, D-69469 Weinheim, Germany John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809 John Wiley & Sons (Canada) Ltd, 22 Worcester Road, Rexdale, Ontario M9W 1L1, Canada British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 0 411 48553 5 Typeset in 10/12 Times by Thomson Press (India) Ltd., New Delhi Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire This book is printed on acid-free paper responsibly manufactured from sustainable forestry, in which at least two trees are planted for each one used for paper production. Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Image andVideo Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2VideoCODECDesign ..................................... 2 1.3 Structure of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Digitalvideo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Concepts. Capture andDisplay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 The Video Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Digital Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.V3 ideo Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.5 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Colour Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3R.1GB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 YCrCb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 The HumanVisual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5Video Quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1 Subjective Quality Measurement . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.2 Objective Quality Measurement . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 Standards for Representing Digital Video . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7.1 Platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.8 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 ImageandVideoCompressionFundamentals . . . . . . . . . . . . . . . 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1Do We Need Compression?. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Image andVideo Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 DPCM (Differential Pulse Code Modulation). . . . . . . . . . . . . . . . 30 3.2.2 Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.3 Motion-compensated Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.4 Model-based Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 ImageCODEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Quantisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 vi CONTENTS 3.3.3 Entropy Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.V4 ideo CODEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Frame Differencing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.2 Motion-compensated Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.3 Transform, Quantisation and Entropy Encoding . . . . . . . . . . . . . . 45 3.4.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 VideoCodingStandards:JPEGandMPEG . . . . . . . . . . . . . . . . . 47 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 The International Standards Bodies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 The Expert Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.2 The Standardisation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.3 Understanding andUsing the Standards . . . . . . . . . . . . . . . . . . . 50 4.3 JPEG (Joint Photographic Experts Group) . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.1 JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1 4.3.2 Motion JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.3 JPEG-2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 MPEG (Moving Picture Experts Group) . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4l. MPEG-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.2 MPEG-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.3 MPEG-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5 VideoCodingStandards:H.261,H.263andH.26L . . . . . . . . . . . 79 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 H.261 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3 H.263 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.1 Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4 The H.263 Optional Modes/H.263+ . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4.1 H.263 Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5 H.26L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.6 Performance of the Video Coding Standards . . . . . . . . . . . . . . . . . . . . . 90 5.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6 MotionEstimation and Compensation . . . . . . . . . . . . . . . . . . . . . . 93 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2 Motion Estimation and Compensation. . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2.1 Requirements for Motion Estimation and Compensation . . . . . . . . 94 6.2.2 Block Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.3 Minimising Difference Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 Full Search Motion Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.F4ast Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.1 Three-Step Search (TSS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 CONTENTS vii 6.4.2 Logarithmic Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.3 Cross Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.4.4 One-at-a-Time Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.5 Nearest Neighbours Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.6 Hierarchical Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.5 Comparison of Motion Estimation Algorithms . . . . . . . . . . . . . . . . . . . 109 6.6 Sub-Pixel Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.7 Choice of Reference Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.7.1 Forward Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.7.2 Backwards Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.7.3 Bidirectional Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.7.4 Multiple Reference Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.8 Enhancements to the MotionModel . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.8.l Vectors That canPoint Outside the Reference Picture. . . . . . . . . 115 6.8.2 VariableBlock Sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.8.3 Overlapped BlockMotion Compensation (OBMC). . . . . . . . . . . 116 6.8.4 Complex Motion Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.9 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.9.1 Software Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.9.2 Hardware Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.10 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7 TransformCoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.2 Discrete Cosine Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.3 Discrete Wavelet Transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.4Fast Algorithms for theDCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.4.1 Separable Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.4.2 Flowgraph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.4.3 Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.4.4 Other DCT Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.5 Implementing the DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.5.1 Software DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.5.2 Hardware DCT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.6 Quantisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.6.1 Types of Quantiser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.6.2 Quantiser Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.6.3 Quantiser Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.6.4 Vector Quantisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 S EntropyCoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 8.2Data Symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.2.1 Run-Level Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 viii CONTENTS 8.2.2 Other Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.3 Huffman Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 8.3.1 ‘True’ Huffman Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i69 8.3.2 ModifiedHuffman Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 8.3.3 Table Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 8.3.4 Entropy Coding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.3.5 Variable Length Encoder Design . . . . . . . . . . . . . . . . . . . . . . . 180 8.3.6 Variable Length Decoder Design . . . . . . . . . . . . . . . . . . . . . . . 184 8.3.7 Dealing with Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 8.4 Arithmetic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.4.1 Implementation Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 9 Pre- and Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 9.2 Pre-filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 9.2.1 Camera Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 9.2.2 Camera Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 9.3 Post-filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 9.3.1 Image Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 9.3.2 De-blocking Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 9.3.3 De-ringing Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 9.3.4 Error Concealment Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 10 Rate.DistortionandComplexity . . . . . . . . . . . . . . . . . . . . . . . . . . 211 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 10.2Bit Rate and Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 10.2.1 The Importance of Rate Control . . . . . . . . . . . . . . . . . . . . . . 212 10.2.2 Rate-Distortion Performance . . . . . . . . . . . . . . . . . . . . . . . . 215 10.2.3 The Rate-Distortion Problem . . . . . . . . . . . . . . . . . . . . . . . . 217 10.2.4 Practical Rate Control Methods . . . . . . . . . . . . . . . . . . . . . . 220 10.3 Computational Complexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 10.3.1 Computational Complexity and Video Quality . . . . . . . . . . . . 226 10.3.2 Variable Complexity Algorithms. . . . . . . . . . . . . . . . . . . . . . 228 10.3.3 Complexity-Rate Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 11 Transmission of CodedVideo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 1 1.2 Quality of Service Requirements and Constraints . . . . . . . . . . . . . . . . 235 1l .2. l QoS Requirements for Coded Video . . . . . . . . . . . . . . . . . . . 235 l 1.2.2 Practical QoS Performance . . . . . . . . . . . . . . . . . . . . . . . . . 239 11.2.3 Effect of QoS Constraints on Coded Video . . . . . . . . . . . . . . 241 CONTENTS ix 11.3 Design for Optimum QoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 11.3.1Bit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 11.3.2 Error Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 11.3.3 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 11.4 Transmission Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 11.4.1 Digital Television Broadcasting: MPEG-2 Systems/Transport . 249 11.4.2 PacketVideo: H.323 Multimedia Conferencing . . . . . . . . . . . 252 11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 12 Platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 12.2 General-purpose Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 12.2.1 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 12.2.2 Multimedia Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 12.3 Digital Signal Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 12.4 Embedded Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 12.5 Media Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 12.6Video Signal Processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 12.7 Custom Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 12.8 CO-processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 12.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 13 VideoCODECDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 13.2 VideoCODEC Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 13.2.1 Video IdOut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 13.2.2 Coded DataIn/Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 13.2.3 Control Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 13.2.4 Status Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 13.3 Design of a Software CODEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 13.3.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 13.3.2 Specification and Partitioning. . . . . . . . . . . . . . . . . . . . . . . . 279 13.3.3 Designing the Functional Blocks . . . . . . . . . . . . . . . . . . . . . 282 13.3.4 Improving Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 13.3.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 13.4 Design of a Hardware CODEC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 13.4.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 13.4.2 Specification and Partitioning. . . . . . . . . . . . . . . . . . . . . . . . 285 13.4.3 Designing the Functional Blocks . . . . . . . . . . . . . . . . . . . . . 286 13.4.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 14 FutureDevelopments . . . . . . . 289 14.1 Introduction . . . . . . . . . . . . . 289 X CONTENTS 14.2 StandardEsvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 14.3Video Coding Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 14.4 PlatformTrends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 14.5 ApplicatioTnrends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 14.6Video CODEC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Index Application Programming Interface, 276 artefacts blocking, 199 ringing, 200 block matching, 43, 95 blockiness. See artefacts:blocking B-picture, 59 chrominance, 12 CODEC entropy, 37, 45, 163 image, 33 video, 41 coded block pattern, 70, 167 coding arithmetic, 188 channel, 29 content-based, 70 entropy, 163 field, 64 Huffman, 169 lossless, 28 lossy, 28 mesh, 74 model based, 32 object-based, 70 run-level, 37, 164 scalable, 65, 73 shape, 70 source, 29 sprite, 73 transform, 3 1, 127 colour space, 10 RGB, 11 YCrCb, 12 complexity complexity-rate control, 231 computational, 226 variable complexity algorithms, 228 compression image and video, 28 DCT, 31, 127 basis functions, 130 distributed, 144 fast algorithms, 138 flowgraph, 140 forward, 127 hardware, 148 inverse, 127 pruned, 228 software, 146 design hardware CODEC, 284 performance, 283 software CODEC, 278 Digital Cinema, 291 Digital Versatile Disk, 24 displaced frame difference, 94 DPCM, 30 error concealment, 244 error resilience, 73, 81, 244 errors, 244 filters de-blocking, 82, 206 de-ringing, 207 error concealment, 208 loop, 202 pre, 195 stabilization, 198 formats 4CIF, 24 CIF, 24 ITU-R 601, 23 QCIF, 24 frame rates, 17, 8 GOB. See Group of Blocks Group of Blocks, 70 Group of Pictures, 62 H.26 1, 80 H.263, 80 annexes, 8 1 baseline, 81 302 INDEX H.263 (Continued) PB-frames, 82 profiles, 86 H.26L, 87, 289 H.323, 252 high definition television, 67 Human Visual System, 16 HVS. See Human Visual System interface coded data, 274 control and status, 276, 277 to video CODEC, 271 video, 271 interframe, 41 International Standards Organisation, 47 International Telecommunications Union, 47 intraframe, 41 I-picture, 59 ISO. See International Standards Organisation ITU. See International Telecommunications Union JPEG, 51 baseline CODEC, 51 hierarchical, 55 lossless, 54 Motion, 56 JPEG2000, 56 KLT, 31 latency, 240, 237, 243 luminance, 12 memory bandwidth. 274 MJPEG. See JPEG:motion motion compensation, 43, 94 estimation, 43, 94, 109 Cross Search, 104 full search, 99 hardware, 122 hierarchical, 107 Logarithmic Search,103 nearest neighbours search, 105 OTA, 105 performance, 109 software, 117 sub-pixel, 111 Three Step Search, 102 vectors, 43, 94, 167 MPEG, 47 MPEG- 1, 58 syntax, 61 MPEG-2, 64 Program Stream, 250 systems, 249 Transport Stream, 250 video, 64 MPEG-21,49, 289 MPEG-4, 67 Binary Alpha Blocks, 71 profiles and levels, 74 Short Header, 68 Very Low Bitrate Video core, 68 Video Object, 68 Video Object Plane, 68 video packet, 73 MPEG-7,49, 289 Multipoint Control Unit, 253 OBMC. See motion compensation prediction backwards, 1 13 bidirectional, 113 forward, 1 13 processors co-processor, 267 DSP, 260 embedded, 262 general purpose, 257 media, 263 PC, 257 video signal, 264 profiles and levels, 66, 74 quality, 16 DSCQS, 17 ITU-R 500-10, 17 objective, 19 PSNR, 19 recency, 18 subjective, 17 Quality of Service, 235 quantisation, 35, 150 scale factor, 35 vector, 157 rate, 212 control, 212, 220 Lagrangian optimization, 2 18 rate-distortion, 217 Real Time Protocol, 252, 254 redundancy statistical, 29 subjective, 30 reference picture selection, 84, 247 re-ordering pictures, 60 modified scan, 166 zigzag, 166, 37 RGB. See colour space ringing. See artefacts: ringing RVLC. See variable length codes sampling 4-2-0, 272, 12 4-2-2, 272, 12 4-4-4, 12 spatial, 7 temporal, 7 scalability. See coding: scalable Single Instruction Multiple Data, 258 slice, 63, 83 source model, 28 still image, 5 sub pixel motion estimation. See motion: estimation test model, 50 transform DCT. See DCT INDEX 303 fractal, 35 integer, 145 wavelet, 35, 57, 133 variable length codes reversible, 73, 187 table design, 174 universal, 175 variable length decoder, 184 variable length encoder, 180 vectors. See motion:vectors Very Long Instruction Word, 263 video capture, 7 digital, 5 interlaced, 9, 64 progressive, 9 stereoscopic, 7, 65 Video Coding Experts Group, 48 Video Quality Experts Group, 19 VOP. See MPEG-4: Video Object Plane wavelet transform. See transform:wavelet YCrCb. See colour space Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Introduction 1.1 IMAGE AND VIDEOCOMPRESSION The subject of this book is the compression (‘coding’) of digital images and video. Within the last 5-10 years, image and video codinghavegonefrom being relatively esoteric research subjects with few ‘real’ applications to become key technologies fora wide range of mass-market applications, from personal computers to television. Like many other recent technological developments, the emergence of video and image coding in the mass market is due to convergence of a number of areas. Cheap and powerful processors, fast network access,the ubiquitous Internet and a large-scale research and standardisation effort have all contributed to the development of image and video coding technologies. Coding has enabled a host of new ‘multimedia’ applications including digital television, digital versatile disk (DVD) movies,streamingInternetvideo,homedigital photography and video conferencing. Compression coding bridges a crucial gap in each of these applications: the gap between the user’s demands (high-quality still and moving images, delivered quickly at a reasonable cost) andthe limited capabilities of transmission networks and storage devices. For example, a ‘television-quality’ digital video signal requires216Mbits of storage or transmission capacity for one second of video. Transmission of this type of signal in real time is beyond the capabilities of most present-day communications networks. A 2-hour movie(uncompressed) requires over 194Gbytes of storage, equivalent to 42 DVDs or 304 CD-ROMs. In orderfordigital video tobecome a plausible alternative toitsanalogue predecessors (analogue television or VHS videotape), it has been necessary todevelop methods of reducing or compressing this prohibitively high bit-rate signal. The drive to solve this problem has taken several decades and massive efforts in research, developmentand standardisation (and work continuestoimprove existing methodsand develop new coding paradigms). However, efficient compression methods are now a firmly established component of the new digital media technologies such as digital television and DVD-video. A welcome side effect of these developmentsis that videoandimage compression has enabledmany novelvisual communication applications that would not have previously been possible. Some areas have taken off more quickly than others (for example, the long-predicted boom in video conferencing has yet to appear), but there is no doubt that visual compression is here to stay. Every new PC has a number of designed-in features specifically to support and accelerate video compression algorithms. Most developed nations have a timetable forstopping the transmission of analogue television, after which all television receivers will need compression technology to decode and display TV images. VHS videotapes are finally being replaced by DVDs which can be played back on 2 INTRODUCTION DVD players or on PCs. The heart of all of these applications is the video compressor and decompressor; or enCOder/DECoder; or video CODEC. 1.2 VIDEO CODEC DESIGN Video CODEC technology has in the past been something of a ‘black art’ known only to a small community of academics and technical experts, partly because of the lack of appro- achable,practicalliterature on thesubject.One view of imageandvideocodingis as a mathematical process. The video coding field poses a number of interesting mathematical problems and this means that much of the literature on the subject is, of necessity, highly mathematical. Such a treatment is important for developing the fundamental concepts of compression but can be bewilderingfor an engineer or developerwho wants to put compression into practice. The increasing prevalence of digital video applications has led to the publication of more approachable texts on the subject: unfortunately, some of these offer at best a superficial treatment of the issues, which can be equally unhelpful. This book aims tofill a gapin the market between theoretical and over-simplified textson video coding. It is written primarily from a design and implementation perspective. Much work has beendoneoverthelasttwodecades in developingaportfolio of practical techniques and approaches to video compression codinwg eallsas a large body of theoretical research. A grasp of these design techniques, trade-offsand performance issues is important to anyone who needtso design, specify or interfacteo video CODECs. Thisbook emphasises these practical considerations rather than rigorous mathematical theory and concentrates on the current generation of video coding systems, embodied by the MPEG-2, MPEG-4 and H.263 standards.By presenting the practicalitiesof video CODECdesign in an approachable way it is hoped that this book will help to demystify this important technology. 1.3 STRUCTURE OF THIS BOOK The book is organised in three main sections (Figu1r.e1). We deal first with the fundamental concepts of digital video, image and video compression and the main international standards for video coding (Chapters2-5). The second section (Chapters6-9) covers the key components of video CODECs in some detaiFl.inally, Chapters 10-14 discuss system design issues and present some design case studies. Chapter 2, ‘Digital Video’, explainstheconcepts of videocapture,representation and display; discusses the way in which we perceive visual information; compares methods for measuring and evaluate visual ‘quality’; and lists some applications of digital video. Chapter 3, ‘Image and Video Compression Fundamentals’, examines the requirements for video and image compression and describes the components of a ‘generic’ image CODEC and video CODEC. (Note: this chapter deliberately avoids discussing technical or standard- specific details of image and video compression.) Chapter 4, ‘JPEGandMPEG’,describestheoperation of theinternationalstandards bodies and introduces theI S 0 image and video compression standards: JPEG, Motion JPEG and JPEG-2000 for images and MPEG-I, MPEG-2 and MPEG-4 for moving video. STRUCTURE OF THIS BOOK 3 Section 1: Fundamental Concepts L r - 7 2. DigitalVideo 3. Image and Video t 6. Motion Estimation/ Compensation 7. Transform Coding 8. Entropy Coding -Section 2: Component Design t- 9. Pre- and PostProcessing - 10. Rate, Distortion, Complexity -11. Transmission Section 3: System Design - - 12. Platforms - 13. Design 14. Future I Trends Figure 1.1 Structure of the book 4 INTRODUCTION Chapter 5 , ‘H.261, H.263 and H.26L‘, explains the concepts of the ITU-T video coding standards H.261 and H.263 and the emerging H.26L. The chapter ends with a comparoifson the performance of the main image and video coding standards. Chapter 6, ‘Motion Estimation and Compensation’, deals with the ‘front end’ of a video CODEC. The requirements and goals of motion-compensated prediction are explained and the chapter discusses a number of practical approaches to motion estimation in software or hardware designs. Chapter 7, ‘TransformCoding’,concentratesmainlyonthepopular discrete cosine transform. The theory behind the DCTis introduced and practical algorithms for calculating the forward and inverse DCTare described. The discrete wavelet transform (an increasingly popular alternative to the DCT) andthe process of quantisation (closely linkedto transform coding) are discussed. Chapter 8, ‘Entropy Coding’, explains the statistical compression process that forms the final step in avideoencoder;shows how Huffmancode tables are designedandused; introduces arithmetic coding; and describes practical entropy encoder and decoder designs. Chapter 9, ‘Pre- and Post-processing’, addresses the important issue of input and output processing; shows how pre-filtering can improve compression performance; and examines a numberof post-filtering techniques,fromsimplede-blocking filters to computationally complex, high-performance algorithms. Chapter 10, ‘Rate, Distortion and Complexity’, discusses the relationships between com- pressed bit rate, visual distortion and computational complexity in a ‘lossy’ video CODEC; describes rate control algorithms fordifferent transmission environments; and introducesthe emergingtechniques of variable-complexitycoding that allow the designer to trade computational complexity against visual quality. Chapter 11, ‘Transmission of Coded Video’, addresses the influence of the transmission environment on video CODEC design; discusses the quality of service required by a video CODEC and providedby typical transport scenarios; and examinesways in which quality of service can be ‘matched’ between the CODEC and the network to maximise visual quality. Chapter 12, ‘Platforms’, describes a number of alternative platforms for implementing practical video CODECs, ranging from general-purpose PC processors to custom-designed hardware platforms. Chapter 13, ‘Video CODEC Design’, brings together a numbeorf the themes discussedin previous chapters and discusses how they influence the designof video CODECs; examines the interfaces betweenavideoCODECandothersystemcomponents;andpresentstwo design studies, a software CODEC and a hardware CODEC. Chapter 14, ‘Future Developments’, summarises someof the recent work in research and development that will influence the next generation of video CODECs. Each chapter includes referencesto papers and websitesthat are relevant to the topic. The bibliography lists a number of books that may be useful for further reading and a companion web site to the book may be found at: http://www.vcodex.conl/videocodecdesign/ Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Digital Video 2.1 INTRODUCTION Digital video is now an integral part of many aspects of business, education and entertain- ment, from digital TVto web-based videonews. Before examining methods for compressing andtransportingdigitalvideo,itis necessary toestablishtheconceptsandterminology relating to video in the digital domain. Digital video is visual information represented in a discrete form, suitable for digital electronic storage and/or transmission. In this chapter we describe and define the conceptof digital video: essentially a sampled two-dimensional (2-D) version of a continuous three-dimensional (3-D) scene. Dealing with colour video requires us to choose a colour space (a system for representing colour) andwe discuss two widely used colour spaces,RGB and YCrCb. The goalof a video coding system is to support videocommunications with an ‘acceptable’visualquality:thisdepends on the viewer’s perception of visual information, whichin turn is governed by the behaviour of the human visualsystem.Measuringandquantifyingvisualqualityisa difficult problemand we describe some alternative approaches, from time-consuming subjective tests to automatic objective tests (with varying degrees of accuracy). 2.2 CONCEPTS,CAPTURE AND DISPLAY 2.2.1TheVideoImage A video image is a projection of a 3-D scene onto a 2-D plane (Figure 2.1). A 3-D scene consisting of a numberof objects eachwith depth, texture and illumination is projected onto a plane to form a 2-D representationof the scene. The 2-D representation contains varying texture and illumination but no depth information. A still image is a ‘snapshot’ of the 2-D representation at a particular instant in time whereas a video sequence represents the scene over a period of time. 2.2.2 DigitaVl ideo A ‘real’ visual scene is continuous both spatially and temporally. In order to represent and process avisual scene digitally it isnecessary to sample thereal scene spatially (typicallyon a rectangular grid in the video image plane) and temporally (typically as a series of ‘still’ 6 DIGITAL VIDEO l--- Figure 2.1 Projection of 3-D scene onto a video image Moving scene Spatisaal mplinpgoinTtsemporsaal mpling Figure 2.2 Spatial and temporal sampling images orframes sampled at regular intervals in time) as shown in Figure 2.2. Digital video is the representationof a spatio-temporally sampled video scenein digital form. Each spatiotemporal sample (described as a picture element or pixel) is represented digitally as one or more numbers that describe the brightness (luminance) and colour of the sample. A digital video system is shown in Figure 2.3. At the input to the system, a ‘real’ visual scene is captured, typically with a camera and converted to a sampled digital representation. Q>Q=Camera Scene Digital domain /- Transrnlssion Figure 2.3 Digital video system: capture, processing and display CONCEPTS, CADPITSPULRAEYAND 7 This digital video signal may then be handled in the digital domain in a number of ways, including processing, storageand transmission. At the outputof the system, the digital video signal is displayedto a viewerby reproducing the 2-D video image(or video sequence)on a 2-D display. 2.2.3 VideoCapture Video is captured using a camera or a systeomf cameras. Most current digital video systems use 2-D video, captured with a single camera. The camera focuses a 2-D projection of the video scene onto a sensor, such as an array of charge coupled devices (CCD array). In the case of colourimagecapture,eachcolourcomponent(seeSection 2.3) is filtered and projected onto a separate CCD array. Figure 2.4 shows a two-camera system that captures two 2-D projections of the scene, takenfromdifferent viewing angles.Thisprovidesastereoscopicrepresentation of the scene:thetwoimages, whenviewed in the left and right eye of the viewer, give an appearance of ‘depth’ to the scene. There is an increasing interest in the use of 3-D digital video, wherethe video signal is representedand processed in three dimensions. This requires the capture system to provide depth information as well as brightness and colour, and this may beobtainedinanumber ofways. Stereoscopicimagescanbe processed to extract approximate depth information and form a 3-D representatiofnthe scene: other methodsof obtaining depth information include processing of multiple images from a single camera (whereeitherthecameraortheobjects in thescenearemoving)andthe useof laser ‘striping’ to obtain depth maps. In this book we will concentrate on 2-D video systems. Generatingadigitalrepresentation of avideoscenecanbeconsidered in two stages: acquisition (converting a projection of the scene into an electrical signal, for example via a CCD array) and digitisation (sampling the projection spatially and temporally and convert- ing each sample to a number or set of numbers). Digitisation may be carried out using a separate device or board (e.g. a video capture card in a PC): increasingly, the digitisation process is becoming integrated with cameras so that the output of a camera is a signal in sampled digital form. 2.2.4 Sampling A digitalimage maybe generated by samplingananaloguevideosignal(i.e.avarying electrical signal that represents a video image) at regular intervals. The result is a sampled Figure 2.4 Stereoscopic camera system 8 DIGITAL VIDEO Figure 2.5 Spatial sampling (square grid) version of the image: the sampled image is only defined at a series of regularly spaced sampling points. The most common format for a sampled image is a rectangle (often with width larger than height) with the sampling points positionedon a square grid (Figure2.5). The visualquality of theimageisinfluenced by thenumber of samplingpoints.More sampling points (a higher sampling resolution) give a ‘finer’ representation of the image: however, moresamplingpointsrequirehigherstoragecapacity. Table 2.1 listssome commonly used image resolutions and gives an approximately equivalent analogue video quality: VHS video, broadcast TV and high-definition TV. A movingvideoimageisformed by sampling the video signaltemporally, taking a rectangular‘snapshot’ of thesignal at periodictimeintervals.Playingback the series of frames produces the illusionof motion. A higher temporal samplingrate (frame rate) gives a ‘smoother’appearance to motioninthevideoscene but requiresmoresamplesto be captured and stored (seeTable 2.2). Frame rates below 10 frames per second are sometimes Table 2.1 Typical video image resolutions Image resolution Number of sampling points 352 x 288 704 x 576 1440 x 1152 101 376 405 504 1313 280 Analogue video ‘equivalent’ VHS video Broadcast television High-definition television Table 2.2 Video frame rates rate Video frame Below 10 frames per second 10-20 frames per second 20-30 frames per second SO-60 frames per second Appearance ‘Jerky’, unnatural appearance to movement Slow movements appear OK; rapid movement is clearly ‘jerky’ Movement is reasonably smooth Movement is very smooth CONCEPTS, CAPTURE AND DISPLAY 9 r 4, U Complete frame Upper field Figure 2.6 Interlacedfields Lower field used for very low bit-rate video communications (because the amount of data is relatively small): however, motion is clearly jerky andunnatural at this rate. Between 10 and 20 frames per second is more typical for low bit-rate videocommunications; 25 or 30 frames per second isstandard for television pictures (together with the use of interlacing, see below); 50 or 60 frames per second is appropriate forhigh-quality video (at the expenseof a very high data rate). The visual appearance of a temporally sampled video sequence canbe improved by using interlaced video, commonly used for broadcast-quality television signals. For example, the European PAL video standard operates at a temporal frame rate of 25 Hz (i.e. 25 complete frames of video per second). However, in order to improve the visual appearance without increasing the data rate, the video sequence is composed offields at a rate of 50 Hz (50 fields per second). Each field contains half of the lines that make up a complete frame (Figure2.6): theodd-and even-numbered lines fromtheframeon the left are placed in two separate fields, each containing half the information of a complete frame. These fields are captured and displayed at M o t h of a second intervals and the result is an update rate of 50 Hz, with the data rate of a signal at 25 Hz. Video that is captured and displayed in this way is known as interlaced video and generally has a more pleasing visual appearance than video transmitted as completeframes (non-interlaced or progressive video). Interlaced video can, however, produce unpleasant visual artefacts when displaying certain textures or types of motion. 2.2.5 Display Displaying a 2-D video signal involves recreating each frame of video on a 2-D display device. The most common type of display is the cathoderay tube (CRT) in which the image 10 DIGITAL VIDEO Phosphor coating Figure 2.7CRT display is formedby scanning a modulated beamof electrons across a phosphorescent screen (Figure 2.7). CRT technology is mature and reasonably cheap to produce. However, a CRT suffers from the requirement to provide a sufficiently long path for the electron beam (making the devicebulky)and the weight of the vacuum tube.Liquidcrystaldisplays (LCDs) are becoming a popular alternative to the CRT for computer applications but are not as bright; other alternatives such as flat-panel plasma displays are beginning to emerge but are noytet available at competitive prices. 2.3 COLOUR SPACES A monochrome (‘grey scale’) video image may be represented using just one number per spatio-temporal sample. This number indicates the brightness or luminanceof each sample position:conventionally, a largernumberindicatesabrightersample. If asample i s representedusing n bits, then avalue of 0 mayrepresentblackandavalue of (2” - I ) may represent white, with other values in between describing shades of grey. Luminance is commonlyrepresented with 8 bits per samplefor‘general-purpose’videoapplications. Higherluminance‘depths’ (e.g. 12 bits or more per sample)aresometimes used for specialist applications (such as digitising of X-ray slides). Representing colour requires multiple numbers per sample. There are several alternative systems for representing colour, eachof which is known as a colour space. We will concen- trate here on twoof the most common colour spaces for digital imageand video representa- tion: RGB (redgreenblue) and YCrCb (luminancehed chrominancehlue chrominance). COLOUR SPACES 11 2.3.1 RGB In the redgreedblue colour space, each pixel is representbeyd three numbers indicating the relative proportions of red, green and blue. These are the three additive primary colours of light: any colour may be reproduced by combining varying proportions of red, green and blue light. Because the three components have roughly equal importantocethe final colour, RGB systems usually represent each component with the same precision (and hence the same number of bits). Using 8 bits per component is quite common: 3 x 8 = 24 bits are required to represent each pixel. Figur2e.8 shows an image (originally colour, but displayed here in monochrome!) and the brightness ‘maposf’each of its three colour components. The girl’s cap is a bright pink colour: this appears bright in the red component and slightly less bright in the blue component. 1 t J I (b) Figure 2.8 (a) Image, (b) R, ( c ) G, (d) B components 12 DIGITAL VIDEO ~~ I Figure 2.8 (Continued) 2.3.2 Y CrCb RGB is not necessarily the most efficient representatioofncolour. The human visual system (HVS, see Section 2.4) is less sensitive to colour than to luminance (brightness): however, the RGB colour space does not providaen easy way to take advantageof this since the three colours are equally important and the luminance is present in all three colour components. It is possible to represenat colour image more efficientblyy separating the luminance from the colour information. A popular colour space of this type is Y: Cr :Cb. Y is the luminance component, i.e. a monochrome version of the colour image. Y is a weighted average of R, G and B: COLOUR SPACES 13 where k areweightingfactors.Thecolourinformationcan be representedas colour difference or chrominance components, where each chrominance component is the differ- ence between R, G or B and the luminance Y: Cr=R-Y Cb=B-Y Cg=G-Y Thecompletedescriptionisgiven by Y (theluminancecomponent)andthreecolour differences Cr, Cb and Cg that represent the ‘variation’ betwetheencolour intensity and the ‘background’ luminance of the image. So far, this representation has little obvious merit: we now have four components rather + + than three. However, it turns otuhtat the value of Cr Cb Cg isa constant. This meansthat only two of the three chrominance componentsneed to be transmitted: the third component can alwaysbe found from the othetrwo. In the Y : Cr :Cb space,only the luminance (Y) and red andbluechrominance (Cr, Cb)aretransmitted.Figure 2.9 shows the effect of this operation on thecolourimage.Thetwochrominancecomponents only have significant values wherethereis a significant ‘presence’ or ‘absence’ of theappropriatecolour(for example, the pink hat appears as an area of relative brightness in the red chrominance). The equations for converting an RGB image into the Y: Cr :Cb colour space and vice versa are given in Equations 2.1 and 2.2. Note that G can be extracted from the Y: Cr :Cb representation by subtracting Cr and Cb from Y. + + Y = 0.299 R 0.587 G 0.1 14B Cb = 0.564 (B- Yj Cr = 0.713 (R - Yj + R = Y 1.402Cr G = Y - 0.344Cb - 0.714Cr + B = Y 1.772Cb The key advantage of Y: Cr :Cb over RGB is that theCrandCbcomponents may be represented with a lower resolution than Y because the HVS is less sensitive to colour than luminanceT. hisreducestheamount of datarequiredtorepresenthechrominance components without having an obviouseffecton visual quality: to thecasualobserver, there is no apparent difference betweenan RGB image and aY : Cr :Cb imagewith reduced chrominance resolution. Figure 2.10 shows three popular ‘patterns’ for sub-sampling Canr d Cb. 4 : 4 : 4 means that the three components (Y: Cr :Cb) have the same resolution and hence a sample of each component existsat every pixel position. (The numbers indicate the relative sampling raotef each componentin the horizontal direction, i.e. forevery 4 luminance samples there a4reCr and 4Cb samples.) 4 : 4 : 4samplingpreservesthefull fidelity of thechrominance components. In 4 : 2 : 2sampling,thechrominancecomponents have thesamevertical resolution but half the horizontal resolution (the numbers indicate thatevfeory 4 luminance 14 DIGITAL VIDEO c l (c) Figure2.9 (a)Luminance,(b) Cr, (c) Cb components samples in the horizontal direction thearree 2 Cr and 2 Cb samples) and the locationosf the samples are shown in the figure4. :2 :2 video is used for high-quality colour reproduction. 4 :2 :0 means thatCr and Cb each havehalf the horizontaland vertical resolutionof Y,as shown. The term ‘4 :2 :0’ is rather confusing: the numbers do not actually have a sensible interpretationandappeartohavebeenchosenhistoricallyasa‘code’toidentifythis COLOUR SPACES 0 0 15 00 0 0 0000 0 0 0 0 0 0 0000 4:4:4 4:2:0 0 Ysample Cr sample Cb sample Figure 2.10 Chrominancesubsamplingpatterns particular sampling pattern. 4 :2 :0 samplingispopularin‘massmarket’digitalvideo applications such as video conferencing, digital televisionand DVD storage. Because each colour difference component contains a quarteorf the samples of the Y component, 4:2 :0 video requires exactlyhalf as many samples as 4:4 :4 (or R :G :B) video. Example Image resolution: 720 x 576 pixels Y resolution: 720 x 576 samples, each represented with 8 bits 4 :4 :4 Cr, Cb resolution: 720 x 576 samples, each 8 bits Total number of bits: 720 x 576 x 8 x 3 =9 953 280 bits 4 :2 :0 Cr, Cb resolution: 360 x 288 samples, each 8 bits + Total number of bits: (720 x 576 x 8) (360 x 288 x 8 x 2) =4 976640 bits The 4 :2 :0 version requireshalf as many bits as the 4:4:4 version To further confuse things, :42 :0 sampling is sometimes described as ‘12 bits per pixTehl’e. reason for this can be illustrated by examining a group of 4 pixels (Figure 2.11). The lefthand diagram shows 4:4 :4 sampling: a totaolf 12 samples are required, 4 eaocfhY, Cr and Cb, requiring a total of 12 x 8 =96 bits, i.e. an average of 96/4 =24 bits per pixel. The right-hand diagram shows :42 :0 sampling: 6 samples are required,Y4and one eachof Cr, Cb, requiring a totalof 6 x 8 =48 bits, i.e. an average of 48/4 = 12 bits per pixel. 0 00 Figure 2.11 4 pixels: 24 and 12 bpp iric 16 ir DIGITAL VIDEO 9 retina fovea --+ 1 2 optic nerve Figure 2.12 HVS components brain 2.4 THE HUMAN VISUAL SYSTEM A critical design goal for a digital video system is that the visual images produced by the system should be ‘pleasing’ to theviewer. In order toachieve this goal it isnecessary to take into account the response of the human visual system (HVS). The HVS is the ‘system’ by which a humanobserver views, interpretsand responds to visual stimuli. The main components of the HVS are shown in Figure 2.12: Eye: The image isfocused by the lens onto the photodetecting area of the eye, theretina. Focusing and object tracking are achieved by the eye muscles and the iris controls the aperture of the lens and hence the amount of light entering the eye. Retina: The retina consists of an array of cones (photoreceptors sensitive to colour at high light levels) androds (photoreceptors sensitive to luminance at low light levels). The more sensitive cones are concentrated in a central region (the fovea) which means that high-resolution colour vision is only achieved over a small area at the centre of the field of view. Optic nerve: This carries electrical signals from the retina to the brain. Brain: The human brain processes and interprets visual information, based partly on the received information (theimagedetected by theretina)and partly on prior learned responses (such as known object shapes). The operation of the HVS is a large and complexarea of study. Some of theimportant features of the HVS thathaveimplicationsfor digital videosystem design are listed in Table 2.3. 2.5 VIDEO QUALITY In order to specify, evaluate and compare video communication systems it is necessary to determine the quality of the video images displayed to the viewer. Measuring visual quality is a difficult and often impreciseart because there are so many factors that caninfluence the results. Visual quality is inherentlysubjective and is therefore influenced by many subjective factorsthatcanmakeit difficult toobtain a completelyaccuratemeasure of quality. VIDEO QUALITY 17 Table 2.3 Features of the HVS Feature The HVS is more sensitive to luminance detail than to colour detail The HVS is more sensitive to high contrast (i.e. large differences in luminance) than low contrast The HVS is more sensitive to low spatial frequencies (i.e. changes in luminance that occur over a large area) than high spatial frequencies (rapid changes that occur in a small area) The HVS is more sensitive to image features that persist for a long duration The illusion of ‘smooth’ motion can be achieved by presenting a series of images at a rate of 20-30 Hz or more HVS responses vary from individual to individual Implication for digital video systems Colour (or chrominance) resolution may be reduced without significantly affecting image quality Large changes in luminance (e.g. edges in an image) are particularly important to the appearance of the image It may be possible to compress images by discarding some of the less important higher frequencies (however, edge information should be preserved) It is important to minimise temporally persistent disturbances or artefacts in an image Video systems should aim for frame repetition rates of 20 Hz or more for ‘natural’ moving video Multiple observers should be used to assess the quality of a video system Measuring visual quality using objective criteria gives accurate, repeatableresults, but as yet there are no objective measurement systems that will completely reproduce the subjective experience of a human observer watching a video display. 2.5.1 SubjectiveQualityMeasurement Several test procedures for subjective quality evaluation are defined in ITU-R Recommendation BT.500-10.’ One of the most popular of these quality measures is the double stimulus continuous quality scale (DSCQS) method. An assessor is presented with a pair of images or short video sequencesA and B, one after theother, and isasked to giveA and B a ‘score’ by marking on a continuous line with five intervals. Figure 2.13 shows an example of the rating form on which the assessor grades each sequence. In a typical test session, the assessor is shown a series of sequence pairs and is asked to grade each pair. Within each pair of sequences, one is an unimpaired ‘reference’ sequence and the other is the same sequence, modified by a system or process under test. A typical example from the evaluation of video coding systems is shown in Figure 2.14: the original sequence is compared with the samesequence, encoded and decodedusing a video CODEC. The order of the two sequences, original and ‘impaired’, is randomised during the test session so that the assessor does not know which is the original and which is the impaired sequence. This helps prevent the assessor from prejudging the impaired sequence compared with the reference sequence. At theend of the session, thescores are converted to a normalised range and the result is a score (sometimes described as a ‘mean opinion score’) that indicates the relative quality of the impaired and reference sequences. 18 DIGITAL VIDEO Test 1 A B Test 2 A B Excellent Test 3 A B Good Fair Poor Bad Figure 2.13 DSCQS rating form The DSCQS test is generally accepted as a realistic measure of subjective visual quality. However, it suffers from practical problems. The results cavnary significantly, depending on the assessor and also on the video sequence under test. This variation can be compensated for by repeating the test with several sequences and several assessors. An ‘expert’ assessor (e.g. one who is familiarwith the natureof video compression distortions or ‘artefactsm’)ay give a biased score and it is preferable to use ‘non-expert’ assessors.In practice this means that a large pool of assessors is required because a non-expert assessowr ill quickly learn to recognise characteristic artefacts in the video sequences. These factors make it expensive and time-consuming to carry out the DSCQS tests thoroughly. A second problem is that this testonisly really suitable forshort sequences of video. It has beenshown2 that the‘recencyeffect’means that the viewer’s opinionis heavily biased towards the lastfew seconds of a video sequence: the qualitoyf this last sectionwill strongly influence the viewer’s rating for the whole of a longer sequence. Subjective tests are also influenced by the viewing conditions: a test carried out in a comfortable, relaxed environment will earn a higher rating than the same test carried out in a less comfortable setting. Source video I U sequence Display Y Video encoder J Video decoder I I Figure 2.14 DSCQS testing system VIDEO QUALITY 19 2.5.2 ObjectiveQualityMeasurement Because of the problemsof subjective measurement, developerosf digital video systemsrely heavily on objective measures of visual quality. Objective measures have not yet replaced subjective testing:however, they are considerably easier toapply and are particularly useful during development and for comparison purposes. Probably the most widely used objective measure is peak signal to noise ratio (PSNR), calculated using Equation 2.3. PSNR is measuredon a logarithmic scale and ibsased on the meansquarederror(MSE)between an originaland an impairedimageorvideoframe, relativeto (2” - (thesquare of thehighestpossiblesignal value in theimage). (2“ P S N ~ B= lolog,, MSE PSNR can be calculated very easily and is therefore a very popular quality measure. It is widely used as a method of comparing the ‘quality’of compressed and decompressed video images. Figure 2.15 shows some examples: thfierst image (a) is the original and (b), (c) and (d) arecompressedanddecompressedversions of theoriginalimage.The progressively poorer image quality is reflected by a corresponding drop in PSNR. The PSNR measuresuffersfromanumber of limitations, however. PSNR requires an ‘unimpaired’ original image for comparison: thismay not be available in every case and it may not be easy to verify that an ‘original’ image has perfect fidelity. A more important limitation is that PSNR does not correlate well with subjective video quality measures such as ITU-R 500. For a given image or image sequence, high PSNR indicates relatively high quality and lowPSNR indicates relatively low quality. However, a particularvalue of PSNR does not necessarilyequateto an ‘absolute’subjectivequality. For example,Figure2.16 shows two impaired versions of the original image from Figure 2.15. Image (a) (with a blurred background)has a PSNR of 32.7 dB,whereas image (b) (with ablurred foreground) has a higher PSNR of 37.5 dB. Most viewers would rate image (b) as significantly poorer than image (a):however, the PSNR measure simply counts the mean squarpeidxel errors and by this methodimage(b)isrankedas‘better’ than image(a).Thisexampleshows that PSNR ratings do not necessarily correlate with ‘true’ subjective quality. Because of these problems, therehas been a lot of work in recent years to try to develop a moresophisticatedobjective test thatcloselyapproachessubjective test results. Many differentapproaches have been p r ~ p o s e d , ~b-u~t none of these has emergedasclear alternatives to subjectivetests. With improvements in objectivequalitymeasurement, however, someinterestingapplicationsbecomepossible, such as proposals for ‘constant- quality’ video coding6 (see Chapter 10, ‘Rate Control’). ITU-RBT.500-10 (and more recently, P.910) describe standard methods for subjective quality evaluation: however, as yet there is no standardised, accurate system for objective (‘automatic’)qualitymeasurement that issuitablefordigitallycodedvideo. In recogni- tion of this, the ITU-TVideo Quality Experts Group (VQEG) are developing a standard for objective video quality evaluation7. The first step in this process was to test and com- pare potential models for objective evaluation. In March 2000, VQEG reported on the first round of testsin which 10 competingsystemswere tested underidenticalconditions. 20 DIGITAL,VIDEO .I (b) Figure 2.15 PSNR examples: (a) original; (b) 33.2dEi; (c) 31.8dB; (d) 26.5 dB VIDEO QUALITY 21 (4 Figure 2.15 (Continued) 22 DIGITAL VIDEO STARNEPDFROAERRSDEDSNVITGIIDINTEGAOL Table 2.4 ITU-R BT.601-5 parameters 30 Hz frame rate Fields per second Lines per complete frame 864 Luminance samples per line Chrominance samples per line Bits per sample Total bit rate Active lines per frame 720 Active samples per line (Y) Active samples per line (Cr, Cb) 60 525 858 429 8 216 Mbps 480 720 360 23 25 Hz frame rate 50 625 432 8 216Mbps 576 360 Unfortunately,none ofthe 10 proposals was consideredsuitableforstandardisation.The problem of accurate objective quality measurement is therefore likely to remain for some time to come. The PSNR measure is widely used as an approximate objective measure forvisual quality and so we will use this measure for quality comparison in this book. However, it is worth remembering the limitations of PSNR when comparing different systems and techniques. 2.6 STANDARDS FOR REPRESENTING DIGITALVIDEO A widely used format for digitally coding video signals for television production is ITU-R Recommendation BT.601-5’ (the term ‘coding’ in this context means conversion to digital format and does not imply compression). The luminance component of the video signal is sampled at 13.5 MHz and the chrominance at 6.75 MHz to produce a 4 :2 :2 Y : Cr : Cb component signal. The parameters of the sampled digital signal depend on the video frame rate (either 30 or 25 Hz) and are shown in Table 2.4. It can be seen that the higher 30 Hz frame rate is compensated for by a lower spatial resolution so that the total bit rate is the same in each case (216Mbps). The actual area shown on the display, the active area, is smaller than the total because it excludes horizontal and vertical blanking intervals that exist ‘outside’ the edgesof the frame. Each sample hasa possible rangeof 0-255: however, levels of 0 and 255 are reserved for synchronisation. The active luminance signal is restricted to a range of 16 (black) to 235 (white). Fovr ideocodingapplicationsv, ideo is oftenconvertedtoone of anumber of ‘intermediateformats’prior to compression and transmission. A set of popularframe resolutions is based around the common intermediate formaCt,IF, in which each frame has a 128 704 Table 2.5 Intermediate formats Format Luminance resolution (horiz. x vert.) Sub-QCIF Quarter CIF (QCIF) CIF 4CIF x 96 176 x 144 352 x 288 x 576 24 4CIF 704x 576 DIGITAL VIDEO CIF 352 x 288 QClF 176 x 144 Figure 2.17 Intermediate formats (illustration) resolution of 352 x 288 pixels. The resolutions of these formats are listed in Table 2.5 and their relative dimensions are illustrated in Figure 2.17. 2.7 APPLICATIONS The last decade has seen a rapid increase in applications for digital video technology and new, innovative applications continue to emerge. A small selection is listed here: 0 Home video: Videocamerarecordersforprofessionalandhomeuseareincreasingly moving away from analogue tapteo digital media (including digital storage on taapned on solid-state media). Affordable DVD video recorders will soon be available for the home. 0 Video storage: A variety of digital formats are now used for storing video on disk, tape andcompacdt iskor DVD forbusinessandhomeuse, bothin compressed and uncompressed form. 0 Videoconferencing: One of theearliesat pplicationsforvideocompressionv, ideo conferencing facilitates meetings between participantisn two or more separate locations. 0 Videotelephony: Oftenusedinterchangeablywithvideoconferencingt,his means a face-to-face discussion between two parties via a video ‘link’. usually 0 Remotelearning: There is anincreasinginterest in theprovision of computer-based learning to supplement or replace traditional ‘face-to-face’ teachinagnd learning. Digital SUMMARY 25 video is seen as an important component of this in the form of stored video material and video conferencing. Remote medicine: Medical support provided at a distance, or ‘telemedicine’, is another potential growth area where digital video and images may be used together with other monitoring techniques to provide medical advice at a distance. Television: Digital television is now widely available and many countries have a timetable for ‘switching off’ the existing analoguetelevision service. Digital TVis one of the most important mass-market applications for video coding and compression. Video production:Fully digital video storage,editing and production have been widely used in television studios for many years. The requirement for high image fidelity often means that the popular ‘lossy’ compression methods described in this book are not an option. Games and entertainment: The potential for‘real’ video imagery in the computer gaming market is just beginning tboe realised with the convergence of 3-D graphics and ‘natural’ video. 2.7.1 Platforms Developers are targeting an increasing range of platforms to run the ever-expanding list of digital video applications. Dedicated platforms are designed to support a specific video application and no other. Examples include digital video cameras, dedicated video conferencing systems, digital TV set-top boxes and DVD players. In the early days, the high processing demands of digital video meant that dedicated platforms were the only practical design solution. Dedicated platforms will continue to be importanftolrow-costm, ass-markestystems but are increasingly being replaced by more flexible solutions. The PC hasemergedasa key platformfordigital video. A continualincrease in PC processing capabilities (aidedby hardware enhancements for media applications such as the Intel MMX instructions) means that it is now possible to support a wide range of video applications from video editing to real-time video conferencing. Embeddedplatforms areanimportant new marketfordigitalvideotechniques.For example, the personalcommunicationsmarketis now huge,drivenmainly by users of mobiletelephones. Video servicesformobiledevices(running on low-costembedded processors) are seen as a major potential growth area. This type of platform poses many challenges for application developers due to the limited processing power, relatively poor wireless communications channel and the requirement to keep equipment and usage costs to a minimum. 2.8 SUMMARY Sampling ofan analoguevideosignal, both spatially and temporally,producesadigital video signal. Representingacolourscenerequiresatleastthreeseparate‘components’: popular colour ‘spaces’ include red/green/blue and Y/Cr/Cb (which has the advantage that the chrominance may be subsampled to reduce the information rate without significant loss 26 VIDEO DIGITAL of quality). Thehuman observer’s response tovisual information affects the way we perceive videoqualityand this is notoriously difficult toquantify accurately. Subjective tests (involving ‘real’ observers) are time-consuming and expensive to run; objective tests range from the simplistic (but widely usePdS) NR measure to complex modelsof the humanvisual system. The digital video applications listed above have been made possible by the development of compression or coding technology. In the next chapter we introduce thbeasic concepts of video and image compression. REFERENCES 1. Recommendation ITU-T BT.500-10, ‘Methodology for the subjective assessment of the quality of television pictures’, ITU-T, 2000. 2. R. Aldridge, J. Davidoff, M. Ghanbari, D. Hands and D. Pearson, ‘Subjective assessment of timevarying coding distortions’, Proc. PCS96, Melbourne, March 1996. 3. C. J. van den Branden Lambrecht and 0. Verscheure, ‘Perceptual quality measure using a spatiotemporal model of the Human Visual System’, Digital Video Compression Algorithms and Technologies, Proc. SPIE, Vol. 2668, San Jose, 1996. 4. H. Wu, Z. Yu, S. Winkler and T. Chen, ‘Impairment metrics for MC/DPCM/DCT encoded digital video’, Proc. PCSOI, Seoul, April 2001. S . K. T. Tan and M. Ghanbari, ‘A multi-metric objective picture quality measurement model for MPEG video’, IEEE Trans. CSVT, 10(7), October 2000. 6. A. Basso, I. Dalgiq, F. Tobagi and C. J. van den Branden Lambrecht, ‘A feedback control scheme for low latency constant quality MPEG-2 video encoding’, Digital Compression Technologies and Systems for Video Communications, Proc. SPIE, Vol. 2952, Berlin, 1996. 7. http://www.vqeg.org/ [Video Quality Experts Group]. 8. Recommendation ITU-R BT.601-S, ‘Studio encoding parameters of digital television for standard 4 : 3 and wide-screen 16 : 9 aspect ratios’, ITU-T, 1995. Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Image and Video Compression Fundamentals 3.1 INTRODUCTION Representing video materialin a digitalform requires a large numberof bits. Thevolume of data generated by digitising a video signal is too large for most storage and transmission systems (despite the continual increase in storage capacity and transmission ‘bandwidth’). This means that compression is essential for most digital video applications. The ITU-R 601 standard (described in Chapter 2) describes a digital format for vitdheaot is roughly equivalent to analogue television, in terms of spatial resolution and frame rate. One channel of ITU-R 601 television, broadcast in uncompressed digital form, requires a transmission bit rate of 216Mbps. At this bit rate,a4.7Gbyte DVD couldstorejust 87 seconds of uncompressed video. Table 3.1 shows the uncompressed bit rates of several popular video formats. From this table it can be seenthat even QCIF at 15 frames persecond (i.e. relatively low-qualityvideo, suitable for video telephony) requires 4.6Mbps for transmission or storage. Table 3.2 lists typical capacities of popular storage media and transmission networks. There is a cleargap between the high bit rates demanded by uncompressed video and the available capacityof current networksand storage media. The purposoef video compression (video coding) is to fill this gap. A video compression system aimsto reduce the amount of data required to store or transmit video whilst maintaining an ‘acceptable’ level of video quality. Most of the practical systems and standards for video compression are ‘lossy’, i.e. the volume of data is reduced (compressed) at the expense of a loss of visual quality. The quality loss depends on many factors, but in general, higher compression resultsin a greater loss of quality. 3.1.1 Do We Need Compression? The following statement (or something similar) habseen made many times over the 20-year history of image and video compression: ‘Video compression will become redundant very soon, once transmission and storage capacities have increased to a sufficient level to cope with uncompressed video.’ It is truethat both storage and transmission capacities continutoe increase. However, an efficient and well-designedvideocompression system gives very significant performanceadvantagesfor visual communicationsat both low and high transmission bandwidths. At low bandwidths, compression enables applications that would not otherwise be possible, such as basic-quality video telephony over a standard telephone 28 IAMNADGE VIDEO COMPREFSUSNIODNAMENTALS Table 3.1 Uncompressedbitrates secondper BitsLuCmpherioFnrmaranimcneaensce ution Format 601 ITU-R 482598 x 525 x 525 30 352 CIF x 288 Mbp1s76 x 14436.5 30 17Q6CIF x 144 88 x 72 15 Table 3.2 Typical transmission / storage capacities Capacity Media / network EthernLeAt(M1N0bpMsM)a1x0b.ps Typical ADSL ISDN-2 dowkbnpsstmr5e6oaVdm.e9m0 DVD-5 CD-ROM 128 kbps / Typi1cM-a2lbps upkstbrpe/as3m3 216Mbps 4.6 Mbps connection. At high bandwidths, compression can support amuch higher visual quality.For example, a4.7 Gbyte DVD can store approximatel2yhours of uncompressed QCIF video (at 15 frames per second)or 2 hours of compressed ITU-R 601 video (at 30 frames per second). Most users would prefer to see ‘television-quality’ video with smooth motion rather than ‘postage-stamp’ video with jerky motion. Video compression and video CODECs will therefore remain a vital part of the emerg- ing multimedia industry for the foreseeable future, allowing designers to make the most efficient use of available transmission or storage capacity. In this chapter we introduce the basiccomponents of an imageorvideocompressionsystem. We begin by defining the concept of an image or video encoder (compressor) and decoder (decompressor). We then describe the main functional blocks of an image encodeddecoder (CODEC) and a video CODEC. 3.2 IMAGEANDVIDEOCOMPRESSION Information-carrying signals may be compressed, i.e. converted to a representation or form that requires fewer bits than the original (uncompressed) signal. A device or program that compresses a signal is an encoder and a device or program that decompresses a signal is a decoder. An enCOderDECoder pair is a CODEC. Figure 3.1 shows a typical exampleof a CODEC as parot f a communication system. The original (uncompressed) information is encoded (compressed): this is source coding. The source coded signal isthen encoded further to add error protection (channel coding)prior to transmission over a channel. At thereceiver,a channeldecoder detects andor corrects transmission errors and asource decoder decompresses the signal. The decompressedsignal maybe identicaltotheoriginal signal (lossless compression) or it maybe distorted or degraded in some way (lossy compression). IMAGE AND VIDEO COMPRESSION 29 :---Ch-a-nn-el CODEC ------7 Figure 3.1 Sourcecoder,channelcoder,channel General-purposecompressionCODECsareavailable that are designed toencodeand compressdatacontaining statistical redundancy. An information-carrying signal usually contains redundancy, which meansthat it may (in theory) be represented ina more compact way. For example, characters within a text file occur with varying frequencies: in English, the letters E, T and A occur more oftenthan the lettersQ, Z and X. This makes it possible to compress a text file by representing frequently occurring characters with short codes and infrequently occurring characterws ith longer codes (thipsrinciple is used in Hufinan coding, described in Chapter 8). Compression is achieved by reducing the statistical redundancy in the text file. This type of general-purpose CODEC isknown as an entropy CODEC. Photographic images and sequences of video frames are not amenable to compression using general-purpose CODECs. Their contents (pixel values) tend to be highly correlated, i.e. neighbouring pixels have similar values, whereas an entropy encoderperforms best with data values that have a certaindegree of independence(decorrelateddata).Figure 3.2 illustratesthe poor performance of a general-purposeentropyencoder with imagedata. The original image (a) is compressed and decompressed using a ZIP program to produce (c) daencdo(dce)d; P E G endcoaencdodeded 30 IMAGE AND VIDEO COMPRESSION FUNDAMENTALS I I I I Original -----) Source model image l video Entropy encoder - 4 Decoded image / video Store l transmit Source model t Entropydecoder 4 Figure 3.3 Image or video CODEC image (b). This is identical to the original (lossless compression), but the compressed file is only 92% of the size of the original, i.e. there is very little compression. Image(c)is obtained by compressing and decompressing the original using the JPEG compression method. The compressed version is less than a quarter of the size of the original (over 4 x compression) and the decompressed image looks almost identical to the original. (It is in fact slightly ‘degraded’ due to the lossy compression process.) In this example, the JPEG method achieved good compression performance by applying a source model to the image before compression. The source model attempts to exploit the properties of video or image data andto represent it in a form that can readily be compressed by an entropy encoder. Figure 3.3 shows the basic design ofan image or video CODEC consisting of a source model and an entropy encoderldecoder. Images and video signals have a number of properties that may be exploited by source models. Neighbouring samples (pixels) within an image or a video frame tend to be highly correlated and so there is significant spatial redundancy. Neighbouring regions within successive video frames also tend to be highly correlated (temporal redundancy). As well as these statistical properties (statistical redundancy), a source model may take advantage of subjectiveredundancy, exploiting the sensitivity of the human visual system to various characteristics of images and video. For example, the HVS is much more sensitive to low frequencies than to high ones and so it is possible to compress an image by eliminating certain high-frequency components. Image (c) in Figure 3.2 was compressed by discarding certain subjectively redundant components of the information: the decoded image is not identical to the original but the information loss is not obvious to the human viewer. Examples of image and video source models include the following: 3.2.1 DPCM (Differential PulseCodeModulation) Each sample or pixel is predicted from one or more previously transmitted samples. The simplest prediction is formedfrom the previous pixel (pixel A in Figure 3.4). A more accurate prediction can be obtained using a weighted average of neighbouring pixels (for example, A, B and C in Figure 3.4).The actual pixel value X is subtracted from the prediction and the difference (the prediction error) is transmitted to thereceiver. The prediction error will typically be small due to spatial correlation, and compression can be achieved by representing common, small prediction errors with short binary codes and larger, less commonerrors with longer codes. Further compression may be achieved by quantising the prediction error and reducing its precision: this is lossy compression as it IMAGE AND VIDEO COMPRESSION 31 i j Previoulisne of transmitted pixels -.................I . Current line of pixels \ Pixel to be predicted Figure 3.4 DPCM becomes impossible to exactly reproduce the original values at the decoder. DPCM may be applied spatially (using adjacent pixels in the same frame)and/or temporally (using adjacent pixels in a previous frame to form the prediction) and gives modest compression with low complexity. 3.2.2TransformCoding Theimagesamplesare transformed intoanotherdomain (or representation)and are represented by transform coeficcients. In the ‘spatial domain’ (i.e. the original form of the image), samples arehighly spatially correlated. The aimof transform coding is to reductehis correlation, ideally leavingasmall number of visually significant transform coefficients (important to the appearance of theoriginalimage) and a large number of insignificant coefficients (that may be discarded without significantly affecting the visual quality of the image).The transformprocess itself does not achievecompression:a lossy quantisation process in which the insignificant coefficients are removed, leaving behind a small number of significant coefficients, usually follows it. Transform coding (Figure 3.5) forms the basis of most ofthe popular imageand video compression systems and is described in more detail in this chapter and in Chapter 7. 3.2.3Motion-compensatedPrediction Using a similarprinciple to DPCM, the encoder forms a modoefl the current framebased on the samples of a previouslytransmitted frame. The encoder attempts to ‘compensate’ for motion in a video sequence by translating (moving) or warping the samplesof the previously transmitted ‘reference’ frameT. he resulting motion-compensated predicted frame (the model of the currentframe)is subtractedfrom the currentframe to producearesidual ‘error’ frame (Figure 3.6). Further coding usually follows motion-compensated prediction, e.g. transform coding of the residual frame. Figure 3.5 Transform coding 32 IMAGE AND VIDEO COMPRESSION FUNDAMENTALS Current Motion compensation Residual Figure 3.6 Motion-compensatedprediction 3.2.4 Model-basedCoding Theencoderattemptstocreate a semantic model of thevideoscene,forexample by analysing and interpreting the contentof the scene. An example is a ‘talking head’ model: the encoder analyses a scene containing a person’s head and shoulders (a typical video conferencing scene) and models the head as a 3-D object. The decoder maintains its own 3-D model of the head. Instead of transmitting information that describes the entire image, the encoder sends only the animation parameters required‘mtoove’ the model, together with an error signal that compensates for the difference between the modelled scene and the actualvideoscene(Figure 3.7). Model-based coding has the potentialforfargreater original I I reconstructed 3-D model 3-D model Figure 3.7 Model-basedcoding IMAGE CODEC 33 compression than the othersourcemodels described above: however, the computational complexity required to analyse and synthesise 3-D models of a video scene in real time is very high. 3.3 IMAGE CODEC An image CODEC encodes and decodes single images or individual frames from a video sequence (Figure3.8) and may consist of a transform coding stage followed by guantisation and entropy coding. 3.3.1 Transform Coding The transform coding stage converts (transforms) the image from the spatial domain into another domain in order to make it more amenable to compression. The transform may be applied to discrete blocks of the image (block transform) or to the entire image. Block transforms The spatial imagesamplesare processed in discrete blocks, typically 8 x 8 or 16 x 16 samples. Each block is transformed using a 2-D transform to produce a block of transform coefficients. The performance of a block-based transform for imagecompression depends on how well it can decorrelate the information in each block. The Karhunen-Loeve transform (KLT) hasthe ‘best’ performance ofany block-based image transform. The coefficients produced by the KLT are decorrelated and the energy in theblockispackedinto a minimalnumber of coefficients. The KLT is, however, very computationally inefficient, and it is impractical because the functions required to carry out the transform (‘basis functions’) must be calculated in advance and transmitted to the decoder for every image. The discrete cosine transform (DCT) performs nearly as well as the KLT and is much more computationally efficient. Figure 3.9 shows a 16 x 16 block of Source image - Encoder Transform 4 Quanlise 4 Reorder + Entropy encode Reorder Rescale Decoder Figure 3.8 Image CODEC Entropy decode Store / transmit 34 IMAGE AND VIDEO COMPRESSION FUNDAMENTALS (b) Figure 3.9 (a) 16 x 16 block of pixels; (b) DCT coefficients IMAGE CODEC 35 image samples (a) and the corresponding block of coefficients produced by the DCT (b). In the original block, the energy is distributed across the 256 samples and the latter are clearly closely interrelated (correlated). In the coefficient block, the energy is concentrated into a few significant coefficients (at the top left). The coefficients are decorrelated: this means that the smaller-valued coefficients may be discarded (for example by quantisation) without significantly affecting the quality of the reconstructed image block at the decoder. The 16 x 16 array of coefficients shown in Figure 3.9 represent spatial frequencies in the original block. At the top left of the array are the low-frequency components, representing the gradual changes of brightness (luminance) in the original block. At the bottom right of the array are high-frequency components and these represent rapid changes in brightness. These frequency components are analogous to the components produced by Fourier analysis of a time-varying signal (and in fact the DCT is closely related to the discrete Fourier transform) except that here the components are 2-D. The example shownin Figure 3.9is typical for a photographic image: most ofthecoefficients produced by the DCT are insignificantand can be discarded. This makes the DCT a powerfultool for image and video compression. Image transforms The DCT isusually applied to small, discrete blocks of an image, for reasons of practicality. In contrast, an image tran.$orm may be applied to a complete video image (or to a large ‘tile’ within the image). The most popular transform of this type is the discrete wavelet transform. A 2-D wavelet transform is applied to the original image in order to decompose it into a series of filtered ‘sub-band’ images (Figure 3.10). Image (a) is processed in a series of stages to produce the ‘wavelet decomposition’ image (b). This is made upof a series of components, each containing a subset of the spatial frequencies in the image. At the top left is a low-pass filtered version of the original and moving to the bottom right, each component contains progressively higher-frequency information that adds the ‘detail’ of the image. It is clear that the higher-frequency components are relatively ‘sparse’, i.e. manyof the values (or ‘coefficients’) in these components are zero or insignificant. The wavelet transform is thus an efficientwayof decorrelating or concentrating the important information into a few significant coefficients. The wavelet transform is particularly effective for still image compression and has been adopted as part of the PEG-2000 standard and for still image ‘texture’ coding in the MPEG-4 standard. Wavelet-based compression is discussed further in Chapter 7. Another image transform that has received much attention is the so-called fractal transform. A fractal transform coder attempts to represent an image as a set of scaled and translated arbitrary ‘basispatterns’. Fractal-based coding has not, however,shownsufficiently good performance to be included in any of the international standards for video and image coding and so we will not discuss it in detail. 3.3.2 Quantisation The block and image transforms described above do not themselves achieve any compression. Instead, they represent the image in a different domain in which the image data is 36 IMAGE AND VIDEO COMPRESSION F U N D A M E N T A L S (b) Figure 3.10 Waveletdecomposition of image IMAGE CODEC 37 separated into components of varying ‘importance’ to the appearance of the image. The purpose of quantisationisto remove thecomponents of thetransformeddatathatare unimportantto the visualappearance of theimage and to retainthevisuallyimportant components.Onceremoved, the lessimportantcomponentscannot be replaced and so quantisation is a lossy process. Example 1. The DCT coefficients shown earlierinFigure 3.9 arequantised by dividingeach coefficient by an integer. Theresultingarray of quantisedcoefficientsis shown in Figure 3.1l(a): the large-value coefficients map to non-zero integers and the smallvalue coefficients map to zero. 2. Rescaling the quantised array (multiplying each coefficient by the same integer) gives Figure 3.11(b). The magnitudes of the larger coefficients are similar to the original coefficients; however, the smaller coefficients (set to zero during quantisation) cannot be recreated and remain at zero. 3. Applying an inverse DCT to the rescaledarray gives the blockof image samplesshown in Figure 3.12: this looks superficially similar to the original image blockbut some of the information has been lost through quantisation. It is possible to vary the ‘coarseness’ of the quantisation process (using a quantiser ‘scale factor’ or ‘step size’). ‘Coarse’ quantisation will tend to discard most of the coefficients, leaving only the most significant,whereas ‘fine’ quantisationwilltend to leave more coefficients in the quantised block. Coarse quantisationusually gives higher compression at the expense of a greater loss in image quality. The quantiser scale factor or step size is often the main parameter used to control image quality and compression in an image or video CODEC. Figure 3.13 shows a small original image (left) and the effect of compression and decompression with fine quantisation (middle) and coarse quantisation (right). 3.3.3 Entropy Coding A typical image blockwill contain afew significant non-zero coefficientsand a large number of zero coefficients after block transform codingand quantisation. The remaining non-zero datacan be efficiently compressed using astatisticacl ompression method (‘entropy coding’): 1. Reorder the quantised coefficients. The non-zero quantised coefficientosf a typicalimage block tend to be clustered around the ‘top-left comer’, i.e. around the low frequencies (e.g.Figure3.9).Thesenon-zero values can be groupedtogether in sequence by reordering the 64coefficients,forexample in a zigzag scanningorder(Figure 3.14). Scanning through in azigzagsequence from thetop-left (lowest frequency) to the bottom-righ(thighesftrequencyc)oefficientgs rouptsogethetrhesignificant low- frequency coefficients. 38 IMAGE AND VIDEO COMPRESSION FUNDAMENTALS 5ooj:.1000 : fo 0 i1 5 Figure 3.11 (a) Quantised DCT coefficients; (b) rescaled 2. Run-level coding. Thereordered coefficient array is usually'sparse',consistingofa group of non-zerocoefficientsfollowed byzeros(withoccasionalnon-zerohigherfrequency coefficients). This type of array may be compactly represented as a series of (run, level) pairs,asshownintheexampleinTable 3.3. Thefirstnumberinthe(run, IMAGE CODEC 39 Figure 3.12 Reconstructedblock of imagesamples level) pair represents the numberof preceding zeros and the second number representsa non-zero value (level). For example, (5, 12) represents five zeros followed by 12. 3. Entropy coding. A statistical coding algorithm is applied to the (run, level) data. The purpose of the entropy coding algorithm is to represent frequently occurring (run, level) pairs with a short code and infrequently occurring (run, levepla)irs with a longer code.In this way, the run-level data may be compressed into a small number of bits. Huffman coding and arithmetic coding are widely used for entropy coding of image and video data. Huffman coding replaces each ‘symbol’ (e.g. a [run, level]pair) with a codeword containing a variable number of bits. The codewords are allocatebdased on the statistical distribution of Figure 3.13 (a)Originalimage;(b)finequantisation; (c) coarsequantisation 40 Low frequency IMAGE AND VIDEO COMPRESSION FUNDAMENTALS Non-zero coefficients I ..., Reordered linear array High frequency Figure 3.14 Zigzagreordering of quantisedcoefficients the symbols. Short codewords are allocated to common symbols and longer codewords are allocated to infrequent symbols. Each codeword is chosen to be ‘uniquely decodeable’, so that a decodercanextracttheseries of variable-lengthcodewordswithoutambiguity. Huffman coding is well suited to practical implementation and is widely used in practice. Arithmetic coding maps a series of symbols to a fractional number (see Chapter 8) that is then converted intoa binary number and transmitted. Arithmetic codinhgas the potential for higher compressionthan Huffman coding. Eachsymbol may be represented with a fractional number of bits (rather than just an integral number of bits) and this means that the bits allocated per symbol may be more accurately matched to the statistical distribution of the coded data. 3.3.4 Decoding The output of the entropy encoder is a sequence of binary codes representing the original image in compressed form. In order to recreate the image it is necessary to decode this sequence and the decoding process (shown in Figu3r.8e) is almost the reversoef the encoding process. An entropy decoder extractsrun-level symbols from thebit sequence. These are converted to a sequence of coefficients that are reordered into a block of quantised coefficients. The decoding operations up to this point are the inverse of the equivalent encoding operations. Each coefficient is multipliedby the integer scale factor (‘rescaled’). This is often described Table 3.3 Run-levelcodingexample Reordecroeedfficient data 24,3, -9,O, -2,0,0,0,0,0,12,0,0,0,2,. CODEC VIDEO 41 as ‘inverse quantisation’, but infact the loss of precision due to quantisation cannotbe reversed and so the rescaled coefficients are not identical to the original transform coefficients. The rescaled coefficients are transformed with an inverse transform to reconstruct a decoded image. Becauseof the data loss during quantisation, this image will not be identical tothe original image:theamount of difference depends partly on the ‘coarseness’ of quantisation. 3.4 VIDEO CODEC A video signal consists of a sequence of individual frames. Each frame may be compressed individually using an image CODEC as described above: this is described as intra-frame coding, where each frame is ’intra’ coded without any reference to other frames. However, better compression performance may be achieved by exploiting the temporal redundancy in a video sequence (the similarities between successive video frames). This may be achieved by adding a ‘front end’ to the image CODEC, with two main functions: 1. Prediction: create a prediction of the current frame based on one or more previously transmitted frames. 2. Compensation: subtract the prediction from the current frame to produce a ‘residual frame’. The residual frame is then processed using an ‘image CODEC’. The key to this approach is the prediction function: if the prediction is accurate, the residual frame will contain little data and will hence be compressed toa very small size by the image CODEC. In order to decode the frame, the decoder must ‘reverse’ the compensation process, adding the prediction to the decoded residual frame (reconstruction) (Figure 3.15). This is inter-frame coding: frames are coded based onsome relationship with other video frames, i.e. coding exploits the interdependencies of video frames. ENCODER DECODER I I prCerdeicattieon Create prediction Prevlous frame@) Prevlous frame@) Figure3.15 Video CODEC with prediction 42 IMAGE AND VIDEO COMPRESSION FUNDAMENTALS 3.4.1 FrameDifferencing The simplest predictor is jutshte previous transmitted frame. Figure3.16 shows theresidual frameproduced by subtracting the previous framefrom the currentframe in a video sequence.Mid-greyareas of theresidualframecontain zero data:light and dark areas indicate positiveand negative residual datarespectively. It is clearthat much of the residual data is zero: hence, compression efficiency can be improved by compressing the residual frame rather than the current frame. Table 3.4 Prediction‘drift’ porDeupdeitnrdcpiecopeudticduEitniooetcnpndrtcuieootrnder DecooudEtepnEructno/cdoedrer Original frame 1 Zero Original frame 2 Original frame 1 Original frame 3 Original frame 2 ... Compressed frDaemceoZdedro frame 1 Compressed Decodedframe 1 residual frame 2 Compressed Decodedframe 2 residual frame 2 1 Decodedframe 2 Decodedframe 3 ... Current frame Subtract VIDEO CODEC F Image encoder 43 Encoded frame L,! Create prediction Previous frame(s) Image decoder Figure 3.17 Encoderwithdecoding loop The decoder faces a potential problem that can be illustrated as follows. Table 3.4 shows the sequence of operations required to encode and decode a series of video frames using frame differencing. Forthefirst frame the encoder and decoder use no prediction. The problem starts with frame 2: the encoder uses the original frame 1 as a prediction and encodes the resulting residual. However, the decoder only has the decoded frame 1 available to form the prediction. Because the coding process is lossy, there is a difference between the decoded and original frame 1 which leads to a small error in the prediction of frame 2 at the decoder. This error will build up with each successive frame and the encoder and decoder predictors will rapidly ‘drift’ apart, leading to a significant drop in decoded quality. The solution to this problem isfor the encoder to use a decoded frame to formthe prediction. Hence the encoder in the above example decodes (or reconstructs) frame 1 to form a prediction for frame 2. The encoder and decoder use the same prediction and drift should be reduced or removed. Figure 3.17 shows the complete encoder which now includes a decoding ‘loop’in order to reconstruct its prediction reference. The reconstructed (or ‘reference’) frame is stored in the encoder and in the decoder to form the prediction for the next coded frame. 3.4.2 Motion-compensatedPrediction Frame differencing gives better compression performance than intra-frame coding when successive frames are very similar, but does not perform wellwhen there is a significant change between the previous and current frames. Such changes are usually due to movement in the video scene and a significantly better prediction can be achieved by estimating this movement and compensating for it. Figure 3.18 shows a video CODEC that uses motion-compensated prediction. Two new steps are required in the encoder: 1. Motion estimation: a region of the current frame (often a rectangular block of luminance samples) is compared with neighbouring regions of the previous reconstructed frame. 44 IMAGE AND VIDEO COMPRESSION FUNDAMENTALS ENCODER Motion-compensated D ........ p..r.e..d..i.c..t.i.o..n.......~ Current DECODE f Previous I Image decoder Previous frame@) Figure 3.18 Video CODEC with motion estimation and compensation The motion estimator attempts tfoind the ‘best match’, i.e. the neighbouring block in the reference frame that gives the smallest residual block. 2. Motioncompensation: the ‘matching’ region or block from the referenceframe (identified by the motion estimator) is subtracted from the current region or block. Thedecodercarries out the samemotion compensation operation to reconstructthe current frame. This meantshat the encoder has totransmit the location of the ‘best’ matching blocks to the decoder (typically in the form of a set of motion vectors). Figure 3.19 shows a residualframeproduced by subtracting a motion-compensated version of the previous frame from the currentframe (shown in Figure 3.16). The residual frameclearlycontainslessdata than the residual in Figure 3.16. Thisimprovement in SUMMARY 45 compressiondoes not come without aprice: motion estimation can bevery computationally intensive.The design of a motion estimation algorithm canhave a dramatic effecot n the compression performance and computational complexity of a video CODEC. 3.4.3 Transform,Quantisation and EntropyEncoding A block or image transformis applied to the residual framaend the coefficients are quantised and reordered. Run-level pairsareentropycoded as before(althoughthestatistical distribution and hencethecodingtablesaregenerallydifferentforinter-codeddata). If motion-compensated predictionis used, motion vector information must be sent in addition to the run-level data. The motion vectors are typically entropy encoded in a similarway to run-level pairs, i.e. commonly occurring motion vectors are coded with shorter codes and uncommon vectors are coded with longer codes. 3.4D.4ecoding A motion-compensateddecoder(Figure 3.18) is usually simpler than thecorresponding encoder.Thedecoderdoes not need a motion estimationfunction(sincethe motion information is transmitted in the coded bit stream) and it contains only a decoding path (compared with the encoding and decoding paths in the encoder). 3.5 SUMMARY Efficient coding of images and video sequencesinvolves creating amodel of the source data that convertsitintoaform that can be compressed. Most image and videoCODECs developed over the last two decades have been based around a common set of ‘building blocks’.For motion videocompression,the first stepistocreateamotion-compensated prediction of the frame to be compressed, based on one or more previously transmitted frames. The difference betweenthis model and the actual input frame isthen coded using an image CODEC. The data is transformed into another domain (e.g. the DCT or wavelet domain), quantised, reordered and compressed using an entropy encoder. A decoder must reverse these steps to reconstruct the framheo:wever, quantisation cannotbe reversed and so the decoded frame is an imperfect copy of the original. An encoder and decoder must clearly use acompatible set of algorithms in order to successfully exchange compressed imagoer video data.Of prime importance is the syntaoxr structure of the compressed data.In the past 15 years there has been a significant worldwide effort to developstandardsforvideo and imagecompression.Thesestandardsgenerally describe a syntax(and a decoding process) tosupport video or image communications for a wide range of applications. Chapters 4 and 5 provide an overview of the main standards bodies and JPEG, MPEG and H . 2 6 ~video and image coding standards. Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Video Coding Standards: JPEG and MPEG 4.1 INTRODUCTION The majority of video CODECs in use today conform to one of the international standards for video coding. Two standards bodies, the International Standards Organisation (KO) and the International Telecommunications Union (ITU), have developed a series of standards that have shaped the development of the visual communications industry. The I S 0 JPEG and MPEG-2 standards have perhaps had the biggest impact: JPEG has become one of the most widely used formats for still image storage and MPEG-2 forms the heart of digital television and DVD-video systems. The ITU’s H.261 standard was originally developed for video conferencing over the ISDN, but H.261 and H.263 (its successor) are now widely used for real-time video communications over a range of networks including the Internet. This chapter begins by describing the process by which these standards are proposed, developed and published. We describe the popular IS0 coding standards, JPEG and P E G 2000 for still images, MPEG-1, MPEG-2 and MPEG-4 for moving video. In Chapter 5 we introduce the ITU-T H.261, H.263 and H.26L standards. 4.2 THE INTERNATIONALSTANDARDS BODIES It was recognised in the 1980s that video coding and transmission could become a commercially important application area. The development of video coding technology since then has been bound up with a series of international standards for image and video coding. Each of these standards supports a particular application of video coding (or a set of applications), such as video conferencing and digital television. The aim ofan image or video coding standard is to support a particular class of application and to encourage interoperability between equipment and systems from different manufacturers. Each standard describes a syntax or method of representation for compressed images or video. The developers of each standard have attempted to incorporate the best developments in video coding technology (in terms of coding efficiency and ease of practical implementation). Each of the international standards takes a similar approach to meeting these goals. A video coding standard describes syntax for representing compressed video data and the procedure for decoding this data as well as (possibly) a ‘reference’ decoder and methods of proving conformance with the standard. 48 VCIODSDETOIANNGDARDS: JPEGMAPNEGD In order to provide the maximum flexibility and scope for innovation, the standards do not define a video or image encoder:this is left tothe designer’s discretion. However, in practice the syntax elements and reference decoder limit the scope for alternative designs that still meet the requirements of the standard. 4.2.1 The Expert Groups The most important developments in video coding standards have been dueto two international standards bodies: the ITU (formerly the CCITT)’ and the ISO.’ The ITU has concentrated on standards to support real-time, two-way video communications. The group responsible for developing these standards is known as VCEG (Video Coding Experts Group) and has issued: 0 H.261 (1990): Video telephony over constant bit-rate channels, primarily aimed at ISDN channels of p x 64 kbps. H.263 (1995): Video telephony over circuit- and packet-switched networks, supporting a range of channels from low bit rates (20-30 kbps) to high bit rates (several Mbps). 0 H.263+ (1998),H.263++ (2001): Extensions to H.263 to support a wider range of transmission scenarios and improved compression performance. 0 H.26L (under development): Video communications over channels ranging from very low (under 20 kbps) to high bit rates. The H . 2 6 ~series of standards will be described in Chapter 5. In parallel with the ITU’s activities, the IS0 has issued standards to support storage and distribution applications. The two relevant groups are JPEG (Joint Photographic Experts Group) and MPEG (Moving Picture Experts Group) and they have been responsible for: 0 JPEG (1992)3: Compression of still images for storage purposes. 0 MPEG-1 (1993)4:Compression of video and audio for storage and real-time play back on CD-ROM (at a bit rate of 1.4Mbps). 0 MPEG-2 (1995)’: Compression and transmission of video and audio programmes for storage and broadcast applications (at typical bit rates of 3-5Mbps and above). 0 MPEG-4 (1998)? Video and audio compression and transport for multimedia terminals (supporting a wide range of bit rates from around 20-30 kbps to high bit rates). 0 JPEG-2000 (2000)7: Compression of still images (featuring better compression performance than the original JPEG standard). Since releasing Version 1 of MPEG-4, the MPEG committee has concentrated on ‘framework’ standards that are not primarily concerned with video coding: 0 MPEG-7’: Multimedia Content Description Interface. This is a standard for describing multimedia content data, with the aim of providing a standardised system for content-based INTERNTAHTEIONAL STBAONDDIAESRDS 49 indexing and retrieval of multimedia information. MPEG-7 is concerned with access to multimedia data rather than the mechanisms for coding and compression. MPEG-7 is scheduled to become an international standard in late 2001. 0 MPEG-219: Multimedia Framework. The MPEG-21 initiative looks beyond coding and indexing to the complete multimedia content ‘delivery chain’, from creation through production and delivery to ‘consumption’ (e.g.viewing the content). MPEG-21 will definekey elements of this delivery framework, including content description and identification, content handling, intellectual property management, terminal and network interoperation and content representation. The motivation behind MPEG-21 is to encourage integration and interoperation between the diverse technologies that are required to create, deliver and decode multimedia data. Work on the proposed standard started in June 2000. Figure 4.1 shows the relationship between the standards bodies, the expert groups and the video coding standards. The expert groups have addressed different application areas (still images, video conferencing, entertainment and multimedia), but in practice there are many overlaps between the applications of the standards. For example, a version of JPEG, Motion JPEG, is widely used for video conferencing and video surveillance; MPEG-1 and MPEG-2 have been used for video conferencing applications; and the core algorithms of MPEG-4 and H.263 are identical. In recognition of these natural overlaps, the expert groups have cooperated at several stages and the result ofthis cooperation has led to outcomes such as the ratification of MPEG-2 (Video) as ITU standard H.262 and the incorporation of ‘baseline’ H.263 into MPEG-4 (Video). There is also interworking between the VCEG and MPEG committees and Figure 4.1 Internationalstandards bodies 50 VCIODSDETIOANNMGDAPAJNEPRGDEDGS: other related bodies such as the Internet Engineering Task Force (IETF), industry groups (such as the Digital Audio Visual Interoperability Council, DAVIC) and other groups within ITU and ISO. 4.2.2TheStandardisationProcess The development of an international standard for image or video coding is typically an involved process: 1. The scopeandaims of the standard are defined. For example, the emerging H.26L standard is designed with real-time video communications applications in mind and aims to improve performance over the preceding H.263 standard. 2. Potential technologiesfor meeting these aimsare evaluated, typically by competitive testing. The test scenario and criteria are defined and interested parties are encouraged to participate anddemonstrate the performance of their proposed solutions. The'best' technology is chosen based on criteria such as coding performance and implementation complexity. 3. The chosen technology is implemented as a test model. This is usually a software implementation that is made available to members of the expert group for experimentation, together with a test model document that describes its operation. 4. The test model is developed further: improvements and features are proposed and demonstrated by members of the expert group and the best of these developments are integrated into the test model. 5. At a certain point (dependingon the timescales of the standardisation effort and on whether the aims of the standard have been sufficiently met by the test model), the model is 'frozen' and the test model document forms the basis of a drafl standard. 6. Thedraftstandard is reviewed and after approval becomes a published international standard. Officially, the standard is not available in the public domain until the final stage of approval and publication. However, because of the fast-moving nature of the video communications industry, draft documents and test models can be very useful for developers and manufacturers. Many of the ITU VCEG documents and models are available via public FTP." Most of the MPEGworking documents arerestricted to members of MPEG itself, but a number of overview documents are available at the MPEG website." Information and links about JPEG and MPEG are a ~ a i l a b l e . ' ~K. 'e~eping in touch with the latest developments and gaining access to draft standards are powerful reasons for companies and organisations to become involved with the MPEG, JPEG and VCEG committees. 4.2.3UnderstandingandUsing theStandards Published ITU and I S 0 standards may be purchased from the relevant standards body.'.* For developers of standards-compliant video coding systems, the published standard is an JPEG (JOINT PHOTOGREAXGPRHEORIUCTPS) 51 essential point of reference as it definesthe syntax and capabilities that a video CODEC must conform to in order to successfully interwork with other systems. However, the standards themselves are not an ideal introduction to the concepts and techniques of video coding: the aim of the standard is to define the syntax as explicitly and unambiguously as possible and this does not make for easy reading. Furthermore, the standards do not necessarily indicate practical constraints that a designer must take into account. Practical issues and good design techniques are deliberately left to the discretion of manufacturers in order to encourage innovation and competition, and so other sources are a much better guide to practical design issues. This book aims to collect together information and guidelines for designers and integrators; other texts that may be useful for developers are listed in the bibliography. The test models produced by the expert groups are designed to facilitate experimentation and comparison of alternative techniques, and the test model (a software model with an accompanying document) can provide a valuable insight into the implementation of the standard. Further documents such as implementation guides (e.g. H.263 Appendix IIII4) are produced by the expert groups to assist with the interpretation of the standards for practical applications. In recent years the standards bodies have recognised the need to direct developers towards certain subsets of the tools and options available within the standard. For example, H.263 now has a total of 19 optional modes and it is unlikely that any particular application would need to implement all of these modes. This has led to the concept of profiles and levels. A ‘profile’ describes a subset of functionalities that may be suitable for a particular application and a ‘level’ describes a subset of operating resolutions (such as frame resolution and frame rates) for certain applications. 4.3 JPEG(JOINTPHOTOGRAPHICEXPERTSGROUP) 4.3.1 JPEG International standard I S 0 109183 is popularly known by the acronym of the group that developed it, the Joint Photographic Experts Group. Released in 1992, it provides a method and syntax for compressing continuous-tone still images (such as photographs). Its main application is storage and transmission of still images in a compressed form, and it is widely used in digital imaging, digital cameras, embedding images in web pages, and many more applications. Whilst aimed at still image compression, JPEG has found some popularity as a simple and effective method of compressing moving images (in the form of Motion JPEG). The JPEG standard defines a syntax and decoding process for a baseline CODEC and this includes a set of features that are designed to suit a wide range of applications. Further optional modes are defined that extend the capabilities of the baseline CODEC. The baseline CODEC A baseline JPEG CODEC is shown in block diagram form in Figure 4.2. Image data is processed one 8 x 8 block at a time. Colour components or planes (e.g. R, GB, or Y, Cr, Cb) 52 VIDEO CODING STANDARDS:P E G AND MPEG Figure 4.2 P E G baseline CODEC block diagram may be processed separately (one complete component aat time) or in interleaved ord(ee.rg. a block from eachof three colour components in succession). Each blioscckoded using the following steps. Level shift Inputdatais shifted so that it is distributed about zero: e.g. an8-bitinput sample in the range 0 :255 is shifted to the range - 128 : 127 by subtracting 128. Forward DCT An 8 x 8 block transform, described in Chapter 7. Quantiser Each of the 64 DCT coefficients C, is quantised by integer division: (2) Cqij =round Qv is a quantisation parameter and Cqu is the quantised coefficient. A larger value of Qv gives higher compression (because more coefficients are set to zero after quantisation) athe expense of increased distortion in the decoded image. The64 parameters Qv (one for each coefficient position ij) are stored in a quantisation 'map'. The map is not specified by the standardbut can be perceptually weighted so that lower-frequency coefficients (DC and lowfrequency AC coefficients) are quantised lessthan higher-frequency coefficients. Figure 4.3 Low frequencies ~'. .24 64 78 87 103 121 33 95 98 1 Highfrequencies Figure 43PEG quantisationmap P(HJOOJITPNOETGREAXGPRHEORICUTPS) 53 gives an example of a quantisation map: the weighting means that the visually important lower frequencies (to the top left of the map) are preserved and the less important higher frequencies (to the bottom right) are more highly compressed. Zigzagreordering The 8 x 8 block of quantisedcoefficientsisrearranged in azigzag order so that the low frequencies are grouped together at the start of the rearranged array. DC differentialprediction Becausethereisoftena high correlationbetween the DC coefficientsof neighbouring image blocks, a predictioonf the DC coefficient is formed from the DC coefficient of the preceding block: The prediction DCpredis coded and transmitted, rather than the actual coefficient DC,,,. Entropyencoding Thedifferential DC coefficients and AC coefficientsareencoded as follows. The numberof bits requiredto represent theDC coefficient, SSSS, is encodedusing avariable-lengthcode.Forexample, SSSS=O indicates that the DC coefficient iszero; SSSS=1 indicates that the DC coefficient is +/- 1 (i.e. it can be represented with 1 bit); SSSS=2 indicates that the coefficient is +3, $2, -2 or -3 (which can be represented with 2 bits). The actualvalue of the coefficient,an SSSS-bit number, is appended to the variablelength code (except when SSSS=O). Each AC coefficient is coded as avariable-lengthcode RRRRSSSS, where RRRR indicates the number of preceding zero coefficients and SSSS indicates the number of bits required to represent thecoefficient (SSSS=O is not required). The actualvalue is appended to the variable-length code as described above. Example A run of six zeros followed by the value +5 would be coded as: [RRRR=6] [SSSS=3] [Value= $51 Markerinsertion Markercodesareinsertedintotheentropy-coded data sequence. Examples of markersinclude the frameheader(describingtheparameters of theframe such as width, height and number of colour components), scan headers (see below) and restart interval markers (enabling a decoder to resynchronisewith the coded sequence if an error occurs). The result of the encoding process is a compressed sequenceof bits, representing the image data, that may be transmitted or stored. In order to view the image, it must be decoded by reversing the above steps, starting with marker detection and entropy decoding and ending with an inverse DCT. Because quantisationis not areversibleprocess(asdiscussed in Chapter 3), the decoded image is not identical to the original image. 54 VIDEO CODING STANDARDS: JPEG AND MPEG Lossless JPEG P E G also defines a lossless encoding/decoding algorithm that uses DPCM (described in Chapter 3). Each pixel is predicted from up to threeneighbouring pixels and the predicted value is entropy codedandtransmitted.Lossless P E G guaranteesimage fidelity at the expense of relatively poor compression performance. Optional modes Progressive encoding involves encoding the image in a series of progressive ‘scans’. The first scan may be decoded toprovide a ‘coarse’ representation of the image; decoding each subsequent scan progressively improves the quality of the image until the final quality is reached. This can be useful when, for example, a compressed image takes a long time to transmit: thedecodercan quickly recreateanapproximateimagewhichis then further refined in a series of passes. Two versions of progressive encoding are supported: spectral selection, where each scan consists of a subset of the DCT coefficients of every block (e.g. (a)DConly; (b) low-frequency AC; (c) high-frequency AC coefficients) and successive approximation, where the first scan containsN most significant bits of each coefficient and later scans containthe less significant bits. Figure4.4 shows an image encodedand decoded using progressive spectral selection. The first image contains the DC coefficients of each block, the second imagecontainsthe DCand two lowest AC coefficients and the third contains all 64 coefficients in each block. (a) + Figure 4.4 Progressive encoding example (spectralselection):(a) DC only; (b) DC two AC; (c) all coefficients P E G (JOINT PHOTOGRAPHIC EXPERTS GROUP) 55 Figure 4.4 (Contined) Hierarchical encoding compresses an image asa series of components at different spatial resolutions. For example, the first component may be a subsampled image at a low spatial resolution, followed by furthercomponentsat successively higher resolutions. Each successive componentisencoded differentially from previous components,i.e. only the differences are encoded. A decoder maychoose to decode only a subset of the fullresolution image; alternatively, the successive components may be used to progressively refine the resolution in a similar way to progressive encoding. 56 VIDEO CODING STANDARDS: JPEG AND MPEG The two progressive encoding modes and the hierarchical encoding mode can be thought of as scalable coding modes. Scalable coding will be discussed further in the section on MPEG-2. 4.3.2 Motion JPEG A ‘Motion JPEG’ or MJPEG CODEC codes a video sequence as a series of JPEG images, each corresponding to one frame of video (i.e. a series of intra-coded frames). Originally, the JPEG standard was not intended to beusedin this way: however, MJPEG has becomepopular and is used in a number of video communications and storage applications. No attempt is made to exploit the inherent temporal redundancy in a moving video sequence and so compression performance is poor compared with inter-frame CODECs (see Chapter 5 , ‘Performance Comparison’). However, MJPEG has a number of practical advantages: 0 Low complexity: algorithmic complexity, and requirements for hardware, processing and storage are very low compared with even a basic inter-frame CODEC (e.g. H.261). 0 Error tolerance: intra-frame codinglimits the effect of an error to a single decoded frame and so is inherently resilient to transmission errors. Until recent developments in error resilience (see Chapter 1l), MJPEG outperformed inter-frame CODECs innoisy environments. 0 Market awareness: JPEG is perhaps the most widely known and used of the compression standards and so potential users are already familiar with the technology of Motion JPEG. Because of its poor compression performance, MJPEG is only suitable for high-bandwidth communications (e.g. over dedicated networks). Perversely, this means that users generally have a goodexperience of MJPEG because installations do not tend to suffer from the bandwidth and delay problems encountered by inter-frame CODECs used over ‘best effort’ networks (such as the Internet) or low bit-rate channels. An MJPEGcoding integrated circuit(IC), the Zoran ZR36060, is described in Chapter 12. 4.3J.P3 EG-2000 The original JPEG standard has gained widespread acceptance and is now ubiquitous throughout computing applications: it is the main format for photographic images on the world wide web and it is widelyused for image storage. However, the block-based DCT algorithm has a number of disadvantages, perhaps the most important of which is the ‘blockiness’ of highly compressed JPEG images (see Chapter 9). Since its release, many alternative coding schemes have been shown to outperform baseline JPEG. The need for better performance at high compression ratios led to the development of the JPEG-2000 The features that JPEG-2000 aims to support are as follows: 0 Good compression performance, particularly at high compression ratios. P(HJOOJITPNOETGREAXGPRHEORICUTPS) 57 0 Efficient compression of continuous-tone, bi-level and compound images (e.g. photographic images with overlaid text: the original JPEG does not handle this type of image well). Lossless and lossy compression (within the same compression framework). 0 Progressive transmission (JPEG-2000 supports SNR scalability, a similar concept to JPEG’s successive approximation mode, and spatial scalability, similar to JPEG’s hierarchical mode). Region-of-interest (ROI) coding. This feature allows an encoder to specify an arbitrary region within the image that should be treated differently during encoding: e.g. by encoding the region with a higher quality or by allowing independent decoding of the ROI. 0 Error resilience tools including data partitioning (see the description of MPEG-2 below), error detection and concealment (see Chapter 11 for more details). Open architecture. The JPEG-2000 standard provides an open ‘framework’ which should make it relatively easy to add further coding features either as part of the standard or as a proprietary ‘add-on’ to the standard. The architecture of a JPEG-2000 encoder is shown in Figure 4.5. This is superficially similar to the JPEG architecture but one important difference is that the same architecture may be used for lossy or lossless coding. The basic coding unit of JPEG-2000 is a ‘tile’. This is normally a 2” x 2” region of the image, and the image is ‘covered’ by non-overlapping identically sized tiles. Each tile is encoded as follows: Transform: A wavelet transform is carried out on each tile to decompose it into a series of sub-bands (see Sections 3.3.1 and 7.3). The transform may be reversible (for lossless coding applications) or irreversible (suitable for lossy coding applications). Quantisation: The coefficients of the wavelet transform are quantised (as described in Chapter 3) according to the ‘importance’ of each sub-band to the final image appearance. There is an option to leave the coefficients unquantised (lossless coding). Entropy coding: JPEG-2000 uses a form of arithmetic coding to encode the quantised coefficients prior to storage or transmission. Arithmetic coding can provide better compression efficiency than variable-length coding and is described in Chapter 8. The result is a compression standard that can give significantly better image compression performance than JPEG. For the same image quality, JPEG-2000 can usually compress images by at least twice as much as JPEG.At high compression ratios, the quality of images Imagedata -1 H H trwanasvfoerlmet Quantiser Arithmetic 1- encoder I l l I Figure 4.5 Architecture of JPEG-2000encoder 58 VIDEO COSDTAINNMGDAPJAENPGRDEDGS: degrades gracefully, with the decoded image showing a gradual ‘blurring’ effect rather than themoreobviousblockingeffectassociated with the DCT. Theseperformancegains areachievedattheexpense of increasedcomplexity and storagerequirementsduring encoding and decoding. One effect of this is that images take longer to store and display using JPEG-2000 (though this shouldbe less of an issue as processors continuteo get faster). 4.4 MPEG(MOVINGPICTUREEXPERTSGROUP) 4.4M.1PEG-1 The first standardproduced by theMovingPictureExpertsGroup,popularly known as MPEG-1, was designed to provide videoand audio compression for storagaend playback on CD-ROMs. A CD-ROM played at ‘single speed’ has a transfer rate of 1.4Mbps. MPEG-1 aims to compress videoand audio to abit rate of 1.4 Mbpswith a quality that is comparable to VHS videotape. The target market was the ‘video CD’, a standard CD containing up to 70 minutes of stored video and audio. The video CD was never a commercial success: the quality improvement over VHS tape was not sufficient to tempt consumers to replace their video cassette recorders and the maximum lengotfh70 minutes createdan irritating break in a feature-lengthmovie. However, MPEG-1isimportantfor two reasons: it has gained widespread use in other video storage and transmission applications (including CD-ROM storageaspart of interactiveapplications and videoplayback over the Internet), andits functionality is used and extended in the popular MPEG-2 standard. The MPEG-1 standard consistsof three parts. Part 116 deals with system issues (including the multiplexingof coded videoand audio), PartZ4 deals with compressed videoand Part 317 with compressed audio. Part 2 (videow) as developed with aim of supporting efficient coding of video for CD playback applications and achieving video quality comparable to,or better than, VHS videotape at CD bit rates (around 1.2Mbps for video). There was a requirement tominimisedecodingcomplexitysince most consumerapplications were envisagedto involve decoding and playbackonly, not encoding. HenceMPEG- 1 decodingis considerably simpler than encoding (unlike JPEG, where the encoder and decoder have similar levels of complexity). MPEG-I features The input videosignal to an MPEG-1 video encoder is 4:2 :0 Y :Cr :Cb format (see Chapter 2) with a typical spatial resolution of 352 x 288 or 352 x 240 pixels. Each frame of video is processed in units of a macroblock, corresponding to a 16 x 16 pixel area in the displayed frame. This area is made up of 16 x 16 luminance samples, 8 x 8 Cr samples and 8 x 8 Cb samples (because Cr andCb have half the horizontaland vertical resolutionof the luminance component). A macroblock consists of six 8 x 8 blocks: four luminance (Y) blocks, one Cr block and one Cb block (Figure 4.6). Eachframe of video is encoded to produceacoded picture. Therearethree main types: I-pictures, P-pictures and B-pictures. (The standard specifies a fourth picture type, D-pictures, but these are seldom used in practical applications.) MPEG (MOVING PICTURE EXPERTS GROUP) 59 16 8 Figure 4.6 Structure of amacroblock l-pictures are intra-coded without any motion-compensated prediction (in a similar way to a baseline JPEG image).An I-picture is used as a reference for further predicted pictures (P- and B-pictures, described below). P-pictures are inter-coded using motion-compensated prediction from areference picture (the P-picture or I-picture preceding the current P-picture). Hence a P-picture is predicted using forward prediction and a P-picture may itself be used asareferenceforfurther predicted pictures (P- and B-pictures). B-pictures areinter-coded using motion-compensatedpredictionfrom two reference pictures, the P- and/or I-pictures before and after the current B-picture.Two motion vectors are generated for each macroblock in a B-picture (Figure 4.7): one pointing to a matching area in the previous reference picture (aforward vector) and one pointing to a matching area B-picture Current macroblock l 0 Backward reference area vector Forward reference area Figure 4.7 Prediction of B-picture macroblock using forward and backward vectors 60 VIDEO CODING STANDARDS:P E G AND MPEG / Bo I1 / Figure 4.8 MPEG-1 group of pictures (IBBPBBPBB): display order in the futurereferencepicture(a backward vector). A motion-compensated prediction macroblockcanbeformedinthree ways: forward prediction using the forward vector, backwardsprediction using thebackwardvectoorbr idirectionapl rediction (where the predictionreferenceisformed by averaging theforward and backwardprediction references). Typically, anencoderchoosesthepredictionmode(forward,backwardor bidirectional) that gives the lowest energy in the difference macroblock. B-pictures are not themselves used as prediction references for any furtherpredicted frames. Figure 4.8 shows a typical series of I-, B- and P-pictures. In order to encodea B-picture, two neighbouring I- or P-pictures(‘anchor’ pictures or ‘key’ pictures) must beprocessed and stored in the prediction memory, introducing a delay of several frames into the encoding procedure. Before frameB2 in Figure 4.8 can be encoded, its two‘anchor’ frames 11 and P4 must be processed and stored, i.e. frames 1-4 must be processed before frames2 and 3 can be coded. In this example, there is a delay of at least three frames during encoding (fram2,es 3 and 4 must be stored before B2 can be coded) and this delay will be larger if more B- pictures are used. In ordertolimitthedelayatthe decoder, encodedpicturesare reordered before transmission, such that all the anchor pictures required to decode a B-picture are placed before the B-picture. Figure 4.9 shows the sameseries of frames,reorderedpriorto transmission. P4 is now placed before B2 andB3. Decoding proceeds as shown in Table 4.1: P4 is decoded immediately afteIr1 and is stored by the decoder. B2 and B3 can now be decoded and displayed (because their prediction references, I1 and P4,are both available), after which P4 is displayed. There is at most one frame delay betwdeeecnoding and display and thedecoderonly needs tostoretwodecodedframes. This is oneexample of ‘asymmetry’ betweenencoderand decoder: the delay andstorage in thedecoder are significantly lower than in the encoder. Figure 4.9 MPEG-1 group of pictures:transmissionorder MPEG (MOVING PICTURE EXPERTS GROUP) 61 Display Table 4.1 MPEG-1 decodinganddisplayorder Decode I-pictures are useful resynchronisation points in the coded bit stream: because it is coded without prediction, an I-picture may be decoded independentlyof any other coded pictures. This supports random accesbsy a decoder(a decoder may start decoding the bit stream atany I-picture position) and error resilience (discussed in Chapter 11). However, an I-picture has poor compression efficiencybecausenotemporalprediction is used. P-pictures provide better compression efficiency due to motion-compensatedpredictionandcanbeused as prediction references. B-pictureshave the highest compression efficiency of each of the three picture types. TheMPEG-1 standard does notactuallydefine thedesign ofan encoder:instead,the standard describes the coded syntax and a hypothetical ‘reference’ decoder. In practice, the syntaxandfunctionalitydescribedby the standardmean that a compliant encoder has to contain certain functions. The basic CODEC is similar to Figure 3.18. A ‘front end’ carries out motion estimation and compensation based on one reference frame (P-pictures) or two reference frames (B-pictures). The motion-compensaterdesidual (or the original picture data in thecase of an I-picture)is encodedusingDCT, quantisation, run-level coding and variable-length coding. In an I- or P-picture, quantised transform coefficients are rescaled andtransformedwith the inverse DCT to produce a stored referenceframefor further predictedP-orB-pictures.In the decoder, the coded datais entropy decoded, rescaled, inversetransformedandmotion compensated.The mostcomplex part of the CODEC is often the motion estimatorbecause bidirectional motion estimationiscomputationally intensive. Motion estimation is onlyrequired in the encoder and this is another example of asymmetry between the encoder and decoder. MPEG-I syntax The syntax of an MPEG- 1 coded video sequence forms a hierarchy ashown in Figure 4.10. The levels or layers of the hierarchy are as follows. Sequence layer This maycorrespond to acompleteencodedvideo programme. The sequence starts with a sequence header that describes certain key information about the coded sequence including picture resolutionand frame rate. The sequence consists of a series of groups ofpictures (GOPs), the next layer of the hierarchy. 62 VIDEO CODING STANDARDS: JPEG AND MPEG I Sequence l I Group of Pictures I . ... . . ... I Picture .. . . Slice 1 ... Figure4.10 MPEG- 1 synatxhierarchy GOP layer A GOP is one I-picture followed by a series of P- and B-pictures (e.g. Figure 4.8). In Figure 4.8, the GOP contains nine pictures (one I, two P and six B) but many other GOP structures are possible, for example: (a) All GOPs contain just one I-picture, i.e. no motion compensated prediction is used: this is similar to Motion JPEG. (b) GOPs contain only I- and P-pictures, i.e. no bidirectional prediction is used: compression efficiency is relatively poor but complexityis low (since B-picturesare more complex to generate). (c) LargeGOPs:theproportion of I-pictures in thecodedstreamis lowand hence compression efficiency is high. However, there arefew synchronisation points which may not be ideal for random access and for error resilience. (d) Small GOPs: there is a high proportion of I-pictures and so compression efficiency is low, however there are frequent opportunities for resynchronisation. An encoder need not keep a consistent GOP structure within a sequence. It may be useful to vary the structure occasionally, for example by starting a new GOP when a scene change or cut occurs in the video sequence. MF’EG (MOVING PICTURE EXPERTS GROUP) 63 Figure 4.11 Example of MPEG-1 slices Picturelayer A picture defines asinglecodedframe.Thepictureheaderdescribesthe type of coded picture (I, P, B) and a temporal reference that defines when the picture should be displayed in relation to the otherpictures in the sequence. Slicelayer A picture is made upof anumber of slices,each of whichcontains an integranl umber of macroblocks. In MPEG-l thereisnorestriction on the sizeor arrangement of slices in a picture, exceptthat slices should cover the picture in rasterorder. Figure 4.11 shows one possible arrangement: each shaded region in this figure is a single slice. A slice starts with a slice header that defines its position. Each slice may be decoded independently of other slices within the picture and this helps the decoder to recover from transmission errors: if an error occurswithin a slice, the decodecran always restart decoding from the next slice header. Macroblock layer A slice is made upofan integralnumber of macroblocks,each of which consists of six blocks (Figure 4.6). The macroblock header describes the type of macroblock, motion vector(s) and defines which 8 x 8 blocks actuallycontaincoded transform data. The picture type (I, P or B) defines the ‘default’ prediction mode for each macroblock, but individual macroblocks within P- orB-pictures may beintra-coded if required (i.e. coded without any motion-compensated prediction). This can be useful if no good matchcan be found within the search area itnhe reference frames since mitay be more efficient to code the macroblock without any prediction. Block layer A block contains variable-length code(s) that represent the quantised transform coefficients in an 8 x 8 block. Each DC coefficient (DCT coefficient [0,01) is coded differentially from the DC coefficient of the previous coded block, to exploit the fact that neighbouring blocks tend to have very similar DC (average) values. AC coefficients (all other coefficients) are codedasa(run,level) pair, where ‘run’ indicates the number of preceding zero coefficients and ‘level’ the value of a non-zero coefficient. 64 VIDEO CODING STANDARDS: JPEG AND MPEG 4.4M.2PEG-2 The next important entertainment application for coded video (after CD-ROM storage) was digital television. In order to provide an improved alternative to analogue television, several key features were required of the video coding algorithm. It had to efficiently support larger frame sizes (typically 720 x S76 or 720 x 480 pixels for ITU-R 601 resolution) and coding of interlaced video. MPEG-1 was primarily designed to support progressive video, where eachframeisscanned as a single unit in raster order.At television-quality resolutions, interlaced video(where a frameismade up of two interlaced ‘fields’ as described in Chapter 2) gives a smoother video image. Because the two fields are captured at separate time intervals (typically 1/50or 1/60of a second apart), better performance may be achieved by coding the fields separately. MPEG-2 consists of three main sections: Video (described below), Audio” (based on MPEG-1audiocoding) and Systems” (defining,in more detail than MPEG-lSystems, multiplexing and transmission of the coded audio/visual stream). MPEG-2 Video is (almost) a superset of MPEG-I Video, i.e. most MPEG-I video sequences should be decodeable by an MPEG-2 decoder. The main enhancements added by the MPEG-2 standard are as follows: EfJicient coding of television-qualiry video The mostimportant application of MPEG-2is broadcast digital television. The ‘core’ functions of MPEG-2 (described as ‘main profile/main level’) are optimised for efficient coding of television resolutions at a bit rate of around 3-S Mbps. Support for coding of interlaced video MPEG-2 video hasseveral features that support flexible coding of interlaced video. The two fields that make up a complete interlaced frame can be encoded as separate pictures (field pictures), each of which is coded as an I-, P- or B-picture. P- and B- field pictures may be predicted from a field in another frame or from the other field in the current frame. Alternatively, the two fields may be handled as a single picture (aframe picture) with the luminance samples in each macroblock of a frame picture arranged in one of two ways. Frame DCT coding is similar to the MPEG-1 structure, where each of the four luminance blocks containsalternatelinesfrom bothfields. With $eld DCT coding, the top two luminance blocks contain only samples from the top field, and the bottom two luminance blocks containsamplesfrom the bottom field. Figure 4.12 illustrates the two coding structures. In a field picture, the upper and lower 16 x 8 sample regions of a macroblock maybe motion-compensated independently: hence each of the two regions has its own vector (or two vectors in the case of a B-picture). This adds an overhead to the macroblock because of the extra vector(s) that must be transmitted. However, this 16 x 8 motioncompensation mode can improve performance because a field picture has half the vertical resolution of a frame picture and so there are more likely to be significant differences in motion between the top and bottom halves of each macroblock. MPEG (MOVING PICTURE EXPERTS GROUP) 65 0 1 (a) Frame DCT 16x16 region of luminance component 2 3 n l jt- (b) Field DCT Figure 4.12 (a) Frame and (b) field DCT coding In dual-prime motion compensation mode,thecurrent field (within a field orframe picture) is predicted from the twfoields of the reference frameusing a single vector together with a transmitted correction factor. The correction factor modifies the motion vector to compensate for the small displacement between the twofields in the reference frame. Scalability The progressive modes of P E G described earlier are formsof scalable coding. A scalable coded bit stream consistsof a number of layers, a base layer and one or more enhancement layers. The base layer can be decoded to provide a recognisable video sequencethat has a limited visual quality, and a higher-quality sequence may be produced by decoding the base layer plus enhancement layer(sw),ith each extra enhancement layer improvinthge quality of the decoded sequence. MPEG-2 videosupports four scalable modes. Spatial scalability This is analogous to hierarchical encoding in the P E G standard. The base layer is coded at a low spatial resolution and each enhancement layer,when added to the base layer, gives a progressively higher spatial resolution. Temporal scalability The base layer is encoded at a low temporal resolution (frame rate) and the enhancement layer (S) are coded to provide higher frame rate(s) (Figure4.13). One application of this mode is stereoscopic videocoding: the base layer provides a monoscopic ‘view’ and an enhancement layer provides a stereoscopic offset ‘view’. By combining the two layers, a full stereoscopic image may be decoded. S N R scalability In a similar way to the successiveapproximation mode of P E G , the base layer is encoded at a ‘coarse’ visual quality (with high compression). Each enhancement layer, when added to the base layer, improves the video quality. 66 VCIODSDTEAIONNMGDAPJAENPGRDEDGS: /j jj q i Enhancement layer Figure 4.13 Temporal scalability Datapartitioning The codedsequenceis partitioned into two layers. The base layer contains the most ‘critical’ components of the coded sequence such as header information, motion vectors and (optionally) low-frequency transform coefficients. The enhancement layer contains all remaining coded data (usually less critical to successful decoding). These scalable modes may be used in a number of ways. A decoder may decode the current programme at standard ITU-R 601 resolution (720 x 576 pixels, 25 or 30 frames per second) by decoding justthe base layer, whereas a ‘high definition’ decoder may decode one or more enhancement layer (S) to increase the temporal and/or spatial resolution. The multiple layers can support simultaneous decoding by ‘basic’ and ‘advanced’ decoders. Transmission of the base and enhancement layers is usually more efficient than encoding and sending separate bit streams at the lower and higher resolutions. The base layer is the most ‘important’ to provide a visually acceptable decoded picture. Transmission errors in the base layer can have a catastrophic effect on picture quality, whereas errors in enhancementlayer (S) are likely to have a relativelyminor impact on quality. By protecting the base layer (for example using a separate transmission channel with a low error rate or by adding error correction coding), high visual quality can be maintained even when transmission errors occur (see Chapter 11). Profiles and levels Most applications require only a limited subset of the wide range of functions supported by MPEG-2. In order to encourage interoperability for certain ‘key’ applications (such as digital TV), the standard includes a set of recommended projiles and levels that each define a certain subset of the MPEG-2 functionalities. Each profiledefines a set of capabilities and the important ones are as follows: 0 Simple: 4 :2 : 0 sampling, only I- and P-pictures are allowed. Complexity is kept low at the expense of poor compression performance. 0 Main: This includes all of the core MPEG-2 capabilities including B-pictures and support for interlaced video. 4 :2 : 0 sampling is used. 0 4 ; 2 : 2: As the name suggests, 4 : 2 : 2 subsampling is used, i.et.he Cr and Cb components have fullvertical resolution and half horizontal resolution. Each macroblock contains eight blocks: four luminance, two Cr and two Cb. MPEG (MOVING PEIGCXRTPOUERURTEPS) 67 0 SNR: As ‘main’ profile, except that an enhancement layer is added to provide higher visual quality. 0 Spatial: As ‘SNR’ profile, except that spatial scalability may also be used to provide higher-quality enhancement layers. 0 High: As ‘Spatial’ profile, with the addition of support for 4 :2 :2 sampling. Each level defines spatial and temporal resolutions: 0 Low: Up to 352 x 288 frame resolution and up to 30 frames per second. 0 Main: Up to 720 X 576 frame resolution and up to 30 frames per second. 0 High-1440: Up to 1440 x 1152 frame resolution and up to 60 frames per second. 0 High: Up to 1920 x 1 152 frame resolution and up to 60 frames per second. The MPEG-2 standard defines certain recommended combinations ofprofiles and levels. Main projilellow level (using only frame encoding) is essentially MPEG-l. Main projilel main level is suitable for broadcast digital television and this is the most widely used profile/ level combination. Main projile lhigh level is suitable for high-definition television (HDTV). (Originally, the MPEG working group intended to release a further standard, MPEG-3, to support coding for HDTV applications. However, once it became clear thatthe MPEG-2 syntax could deal with this application adequately, work on this standard was dropped and so there is no MPEG-3 standard.) In addition to the main features described above, there are some further changes from the MPEG-1 standard. Slices in an MPEG-2 picture are constrained such that theymaynot overlap from onerow of macroblocks to the next (unlike MPEG-1 where a slice may occupy multiple rows of macroblocks). D-pictures in MPEG-1 were felt to be of limited benefit and are not supported in MPEG-2. 4.4.3 MPEG-4 The MPEG-I and MPEG-2 standards deal with complete video frames, each coded as a single unit. The MPEG-4standard6 was developed with the aim of extending the capabilities of the earlier standards in a number of ways. Support for low bit-rate applications MPEG-1 and MPEG-2 are reasonably efficient for coded bit rates above around 1Mbps. However, many emerging applications (particularly Internet-based applications) require a much lower transmission bit rate and MPEG-1 and 2 do not support efficient compression at low bit rates (tens of kbps or less). Supportforobject-basedcoding Perhaps the most fundamental shift in the MPEG-4 standard has been towards object-based or content-based coding, where a video scene can be handled as a set of foreground and background objects rather than justas a series of rectangular frames. This type of coding opens up a wide range of possibilities, such as independent coding of different objects in a scene, reuse of scene components, compositing 68 VCIODDEOING STANDARDS:JPEG AND h4PEG (where objects from a number of sources are combined intoa scene) and a high degree of interactivity. The basic concept used in MPEG-4 Visual is that of the video object (VO). A video scene (VS) (a sequence of video frames) is maduep of a number of VOs. For example, the VS shown in Figure 4.14 consists of a background V 0 and two foreground VOs. MPEG4 provides tools that enable each V 0 to be coded independently, opening up a range of new possibilities. The equivalent of a ‘frame’ in V 0 terms, i.e. a ‘snapshot’ of a V 0 at a single instant in time, is a video object plane (VOP). The entire scene may be coded as a single, rectangular VOP and this is equivalent to a picture in MF’EG-1 and MPEG-2 terms. Toolkit-basedcoding MPEG-lhas a very limiteddegree of flexibility; MPEG-2introduced the concept of a ‘toolkit’ of profiles and levels that could be combined in different ways for various applications. MPEG-4 extendsthis towards a highly flexible set of coding tools that enable a range of applications aswell as a standardised framework that allows new tools to be added to the ‘toolkit’. The MPEG-4 standard is organised so that new coding tools and functionalities may be added incrementally asnew versions of the standard are developed, andso the list of tools continues togrow. However, the main tools forcoding of video images can be summarised as follows. MPEG-4 Visual: very low bit-rate video core The video coding algorithmthsat form the‘very low bit-rate video (VLBV)core’ of MPEG4 Visual are almost identical tothe baseline H.263 videocoding standard (Chapter 5 ) . If the short header mode is selected, frame coding is completely identical to baseline H.263. A video sequence is coded asa series of rectangular frames(i.e. a single VOP occupying the whole frame). Input format Video dataisexpectedto be pre-processed and converted tooneof the picture sizes listedin Table 4.2, at a frame rateof up to 30 framesper second and in 4 :2 :0 Y: Cr :Cb format (i.e. the chrominance components have half the horizontal and vertical resolution of the luminance component). Picture types Each frame is coded as an I- or P-frame. An I-frame contains only intracoded macroblocks, whereas a P-frame can contain either intra- ionrter-coded macroblocks. MPEG (MOVING PICTURE EXPERTS GROUP) 69 (luminances)ize Table 4.2 MPEG4VLBV/H.263 picturesizes Picture Format SubQCIF QCIF CIF 4CIF 16CIF 128 x 96 176 x 144 352 x 288 704 x 576 1408 x 1152 Motionestimationandcompensation Thisiscarried out on 16 x 16 macroblocks or (optionally) on 8 x 8 macroblocks. Motion vectors can have half-pixel resolution. Transformcoding The motion-compensated residual iscoded with DCT, quantisation, zigzag scanning and run-level coding. Variable-length coding The run-level coded transform coefficients, together with header information andmotion vectors, arecoded using variable-length codes. Each non-zero transform coefficient is coded as a combination of run, level, last (where ‘last’ is a flag to indicate whether this is the last non-zero coefficient in the block) (see Chapter 8). Syntax The syntax of an MPEG-4 (VLBV) coded bit stream is illustrated in Figure 4.15 Picture layer The highest layer of the syntax contains a complete coded picture. The picture header indicates the pictureresolution, the type of coded picture (inter orintra) and includes a temporal reference fieldT. his indicates the correct display timfoer the decoder (relative to other coded pictures) and can help to ensure that a picture is not displayed too early or too late. Picture Cr I Picture 1 I ... Group of Blocks ... 1 1 ... Macroblock ... Figure 4.15 MPEG-4/H.263 layeredsyntax 70 VIDEO CODING STANDARDS: JPEG AND MPEG GOB 0 (22macroblocks) GOB 1 GOB 2 ... ... GOB 0 (11 macroblocks) ... GOB 17 GOB 6 GOB 7 GOB 8 (a) CIF (b) QCIF Figure 4.16 GOBs: (a) CIF and (b) QCIF pictures Group of blocks layer A group of blocks (GOB) consists of one complete row of macroblocks in SQCF, QCIF and CIF pictures (two rowsin a 4CIF picture and four rows in a 16CIF picture). GOBs are similar to slices in MPEG-1 and MPEG-2 in that, if an optional GOB header is inserted in the bit stream, the decoder can resynchronise to the start of the next GOB if an error occurs. However, the size and layout of each GOB are fixed by the standard (unlike slices). The arrangement of GOBs in a QCIF and CIF picture is shown in Figure 4.16. Macroblock layer A macroblock consists of four luminance blocks and two chrominance blocks. The macroblock header includes information about the type of macroblock, ‘coded block pattern’ (indicating which of the six blocks actually contain transform coefficients) and coded horizontal and vertical motion vectors (for inter-coded macroblocks). Blocklayer A block consists of run-level coded coefficients corresponding to an 8 x 8 block of samples. The core CODEC (based on H.263) was designed for efficient coding at low bit rates. The use of 8 x 8 block motion compensation and the design of the variable-length coding tables make the VLBV MPEG-4 CODEC more efficient than MPEG-I or MPEG-2 (see Chapter 5 for a comparison of coding efficiency). Other visual coding tools The features that make MPEG-4 (Visual) unique among the coding standards are the range of further coding tools available to the designer. Shape coding Shape coding is required to specify the boundarieosf each non-rectangular VOP in a scene. Shape information may be binary (i.e. identifying the pixels that are internal to the VOP, described as ‘opaque’, or external tothe VOP, described as ‘transparent’) or grey scale (where each pixel position within a VOP is allocated an 8-bit ‘grey scale’ number that iden- tifies the transparency of the pixel). Grey scale information is more complex and requires more bits to code: however, it introduces the possibility of overlapping, semi-transparent VOPs (similar to the concept of ‘alphaplanes’ in computer graphics). Binary information is simpler to code becauseeachpixelhasonlytwopossiblestates,opaque or transparent. Figure 4.17 MF’EG (MOVING PICTURE EXPERTS GROUP) 71 Figure4.17 (a) Opaqueand (b) semi-transparent VOPs illustratestheconcept of opaque and semi-transparent VOPs: in image (a), VOP2 (foreground) isopaque and completely obscures VOPl(background), whereas in image(b) VOP2 is partly transparent. Binary shape informationis coded in 16 x 16 blocks (binary alpha blocks, BABs). There are three possibilities for each block 1. All pixels aretransparent,i.e.theblockis information is coded. ‘outside’ the VOP. No shape (or texture) 2. All pixels are opaque, i.e. the block is fully ‘inside’ the VOP. No shape information is coded: the pixel values of the block (‘texture’) are coded as describeidn the next section. 72 VCIODSDTEAIONMNGDAPEJANPGRDEDGS: 3. Some pixels are opaqueand some are transparent, i.e. the block crosses baoundary of the VOP. The binary shapevalues of each pixel (1 or 0) are codedusing a formof DPCM and the texture information of the opaque pixels is coded as described below. Grey scale shape information produces values in the range 0 (transparent) to 255 (opaque) that are compressed using block-based DCT and motion compensation. Motion compensation Similar options exist to the I-, P- and B-pictures in MPEG-1 and MPEG-2: 1. I-VOP: VOP is encoded without any motion compensation. 2. P-VOP: VOP is predicted using motion-compensated prediction from apast I- or P-VOP. 3. B-VOP: VOP is predicted using motion-compensated prediction from apast and a future I- or P-picture (with forward, backward or bidirectional prediction). Figure 4.18 showsmode (3), prediction of a B-VOP fromapreviousI-VOP and future P-VOP. For macroblocks (or 8 x 8 blocks) that are fully contained within the current and reference VOPs, block-based motion compensation is used in a similar way to MPEG- 1 and MPEG-2. The motion compensation process ims odified for blocksor macroblocks along the boundary of the VOP.In the reference VOP, pixels in the 16 x 16 (or 8 x 8) search area are padded based on the pixels alongthe edge of the VOP. The macroblock (or block)in the current VOP is matched with this search areausing block matching: however, the difference value (meanabsoluteerrororsum of absoluteerrors)is only computedforthose pixel positions that lie within the VOP. Texture coding Pixels (or motion-compensated residual values) within a VOP are coded as ‘texture’. The basic tools are similarto MPEG-1 and MPEG-2: transform using the DCT, quantisation of the DCT coefficients followed by reordering and variable-length coding. To further improve compression efficiency, quantised DCT coefficients may be predicted from previously transmitted blocks (similar to the differential prediction of DC coefficients used in JPEG, MPEG-1 and MPEG-2). P B-VOP Figure 4.18 B-VOP motion-compensatedprediction MPEG (MOVING PEGIXCRPTOEUURRTPE)S 73 A macroblock that covers a boundary of the VOP will contain both opaque andtransparent pixels. In order to apply a regular 8 x 8 DCT, it is necessary to use ‘padding’ to fill up the transparent pixel positions. In an inter-codedVOP, where the textureinformation is motioncompensated residual data, the transparent positions are simply filled with zeros. In an intracoded VOP, where the texture is ‘original’ pixel data, the transparent positions are filled by extrapolating the pixel values along the boundary of the VOP. Erroresilience MPEG-4incorporates a number of mechanisms that can provide improvedperformance in the presence of transmission errors (such as bit errorsor lost packets). The main tools are: 1. Synchronisation markers: similar to MPEG-1 and MPEG-2 slice start codes, except that these may optionally be positioned so that each resynchronisation interval contains an approximately equal number of encoded bits (rather than a constant number of macroblocks). This means that errors are likely to be evenly distributed among the resynchronisation intervals. Each resynchronisation interval may be transmitted in a separate video packet. 2. Data partitioning: similar to the data partitioning mode of MPEG-2. 3. Header extension: redundant copies of header information are inserted at intervals in the bit stream so that if an important header(e.g. a picture header) is lost due to anerror, the redundant header may be used to partially recover the coded scene. 4. Reversible VLCs: these variable lengthcodes limit the propagation (‘spread’) ofan errored region in a decoded frame or VOP and are described further in Chapter 8. Scalability MPEG-4 supports spatial and temporal scalability. Spatial scalability applies to rectangular VOPs in a similar way to MPEG-2: the base layer gives a low spatial resolution and an enhancement layer may be decoded together with the base layer to give a higher resolution. Temporal scalability is extended beyondthe MPEG-2 approach in that it may be applied to individual VOPs. For example, a background VOP may beencoded without scalability, whilst a foreground VOP may beencoded with several layers of temporal scalability. This introduces the possibility of decoding a foreground object at a higher frame rate and more static, background objects at a lower frame rate. Sprite coding A ‘sprite’ is a VOP that is present for the entire duration of a video sequence (VS). A sprite may be encoded and transmitted once at the start of the sequence, giving a potentially large benefitin compression performance. A goodexampleis a background sprite: the background image to a scene is encoded as a sprite at the start of the VS. For the remainder of the VS, only the foreground VOPs need to be coded and transmitted since the decoder can ‘render’ the background from the original sprite. If there is camera movement (e.g. panning), then a sprite that is larger than the visible scene is required (Figure 4.19). In order to compensate for more complex camera movemen(ets.g. zoom or rotation), it may be necessary for the decoder to ‘warp’ the sprite. A sprite is encoded as an I-VOP as described earlier. Static texture An alternative set of tools to the DCT may be used to code ‘static’ texture, i.e. texture data that doesnot change rapidly. The main application for this is to codetexture 74 VIDEO CODING STANDARDS:P E G AND MPEG Figure 4.19 Example of background sprite and foreground VOPs that is mappedonto a 2-Dor3-Dsurface(described below). Staticimage texture is coded efficiently using a wavelet transform. The transform coefficients are quantised and coded with a zero-tree algorithmfollowed by arithmetic coding.Wavelet coding is described further in Chapter 7 and arithmetic coding in Chapter 8. Mesh and 3-D modelcoding MPEG-4 supports more advanced object-based coding techniques including: 0 2-D mesh coding, where an object is codeda amsesh of triangular patches in a 2-D plane. Static texture(coded as described above) can be mapped onto themesh. A moving object can be represented by deforming the mesh and warping the texture as the mesh moves. 0 3-D mesh coding, where an object is described as a mesh in 3-D space. This is more complex than a 2-D mesh representation but gives a higher degree of flexibility in terms of representing objectswithin a scene. 0 Face and body model coding, where a human face or body is rendered at the decoder accordingto a faceor body model. Themodeliscontrolled(moved) by changing ‘animation parameters’. In this way a ‘head-and-shoulders’ video scene may be coded by sending only the animation parametreerqsuired to ‘move’ the model at the decoder. Static texture is mapped onto the model surface. Thesethreetoolsoffer the potentialforfundamentalimprovements in video coding performance and flexibility: however, their application is currently limited because of the high processing resourcesrequired to analyse and render even a very simple scene. MPEG-4 visual profiles and levels In common with MPEG-2, a number of recommended ‘profiles’ (sets of MPEG-4 tools) and ‘levels’ (constraints onbit stream parameterssuch as frame size and ratea)re defined in the (MPEMIOGCXVPRTPEIUOENGRUGTEPS) 75 MPEG-4 standard. Each profile is defined in terms of one or more ‘object types’, where an object type is a subset of the MPEG-4 tools. Table 4.3 lists the main MPEG-4 object types that make up the profiles. The ‘Simple’ object type contains tools for coding of basic I- and P-rectangular VOPs (complete frames) together with error resilience tools and the‘short header’ option (for compatibility with H.263). The ‘Core’ type adds B-VOPs and basic shape coding (using a binary shape mask only). The main profile adds grey scale shape coding and sprite coding. MPEG-4 (Visual) is gaining popularity in a number of application areas such as Internet- based video. However, to date the majority of applications use only the simple object type and there has been limited take-up ofthe content-based features ofthe standard. This is partly because of technical complexities (for example, it is difficult to accurately segment a video scene into foreground and background objects, e.g. Figure 4.14, using an automatic algorithm) and partly because useful applications for content-based video coding and manipulation have yet to emerge. At the time of writing, the great majority of video coding applications continue to work with complete rectangular frames. However, researchers continue to improve algorithms for segmenting and manipulating video The content-based tools have a number of interesting possibilities: for example, they make it Table 4.3 MPEG-4videoobjecttypes Video object types VitsSouoiamlClsMpsolcareaeinlable Basic Still SimplAe nimatedanimatedscalablSeimple 2-D mtexsthteuxrfteaucre Basic (I-VOP, P-VOP, J JJJ coefficient prediction, 16 x 16 and 8 x 8 motion vectors) Error resilience J JJJ Short header J JJ B-VOP JJJ P-VOP with overlapped block matching Alternative quantisation JJ P-VOP based temporal JJ scalability Binary shape JJ Grey shape J Interlaced video coding J Sprite J Rectangular temporal J scalability Rectangular spatial J scalability Scalable still texture 2-D mesh Facial animation parameters J J J J J J J J J J J J J J 76 VIDEO COSDTAINNGDAJANPRDEDGS: MPEG possible to develop‘hybrid’ applications with a mixture of ‘real’ video objects(possibly from a number of different sources) and computer-generatedgraphics. So-called synthetic natural hybrid coding has the potential to enable a new generation of video applications. 4.5 SUMMARY The I S 0 has issued a number of image and videocodingstandards that have heavily influenced the developmentof the technology and market for video codingapplications. The original JPEG still image compression standard is now a ubiquitous method for storing and transmitting still images and has gained sompeopularity as a simple androbust algorithm for videocompression. The improved subjective andobjectiveperformance of its successor, JPEG-2000, may lead to the gradual replacement of the original JPEG algorithm. The first MPEG standard, MPEG-l, was never a market success in its target application (video CDs) but is widely used for PC and internet video applications and formed the basis for the MPEG-2 standard. MPEG-2 has enableda worldwide shift towards digital television andis probably the most successful of thevideocodingstandards in terms of market penetration. The MPEG-4 standard offers a plethora of video coding toolswhich may intime enable many new applications: however, at the present time the most popular element of MPEG-4 (Visual) is the ‘core’ low bit rateCODEC that is based on the ITU-TH.263 standard. In the next chapter we will examine the H . 2 6 ~series of coding standards, H.261, H.263 and the emerging H.26L. REFERENCES 1. http://www.itu.int/ [International Telecommunication Union]. 2. http://www.iso.ch/[InternationalStandardsOrganisation]. 3. ISO/IEC 10918-1 /ITU-T Recommendation T.81, ‘Digital compression and coding of continuous- tone still images’, 1992 [JPEG]. 4. ISO/IEC 11172-2, ‘Information technology-coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s-part 2: Video’, 1993 [MPEGl Video]. 5 . ISOlIEC 138 18-2, ‘Information technology: generic codoinfgmoving pictures and associated audio information: Video’, 1995 [MPEG2 Video]. 6. ISO/IEC14996-2,‘Informationtechnology-coding of audio-visualobjects-part 2: Visual’,1998 [MPEG-4 Visual]. 7. ISO/IEC FCD 15444-1, ‘JPEG2000 Final Committee Draft v1 .O’, March 2000. 8. ISO/IEC JTCl/SC29/WG1l N403 1, ‘Overview of the MPEG-7 Standard’, Singapore, March 2010. 9. ISO/IEC JTCl/SC29/WG11 N4318, “PEG-21 Overview’, Sydney, July 2001. 10. http://standards.pictel.com/ftp/video-site[/VCEGworkingdocuments]. I 1. http://www.cselt.it/mpeg/ [MPEG committee official site]. 12. http://www.jpeg.org/[JPEGresources]. 13. http://www.mpeg.org/ [MPEG resources]. 14. ITU-T Q6/SG16 Draft Document, ‘Appendix I11 for ITU-T Rec H.263’, Porto Seguro, May 2001. 15. A. N. Skodras, C. A. Christopoulosand T. Ebrahimi,‘JPEG2000: The upcoming still image compression standard’, Proc. 11th Portuguese Conference on Pattern Recognition, Porto, 2000. 16. ISO/IEC 11 172-1, ‘Information technology-coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s-part 1: Systems’, 1993 [MPEGI Systems]. REFERENCES 77 17.ISO/IEC 11172-2,Informationtechnology-coding of movingpicturesandassociatedaudio for digital storage mediat at up to about lSMbit/s-part 3: Audio’, 1993 [MPEGl Audio]. 18. ISO/IEC 138 18-3, ‘Information technology: generic codoifnmg oving pictures and associated audio information: Audio’, 1995 [MPEG2 Audio]. 19. ISO/IEC 138 18-1, ‘Information technology: generic codoifnmg oving pictures and associated audio information Systems’, 1995 [MPEG2 Systems]. 20. P. Salembier andF. MarquCs, ‘Region-based representationsof image andvideo: segmentation tools for multimedia services’, IEEE Trans. CSVT 9(8), December 1999. 21. L. Garrido, A. Oliveras and P. Salembier, ‘Motion analysis of image sequences using connected operators’, Proc. VCIP97, San Jose, February 1997, SPIE 3024. 22. K. Illgner and F. Muller, ‘Image segmentation using motion estimation’, in Erne-varying Image Processing and Image Recognition, Elsevier Science, 1997. 23. R. Castagno and T. Ebrahimi,‘VideoSegmentationbasedonmultiplefeaturesforinteractive multimedia applications’, IEEE Trans. CSVT 8(5), September, 1998. 24. E. Steinbach, P. Eisert and B. Girod, ‘Motion-based analysis and segmentationof image sequences using 3-D scene models’, Signal Processing, 66(2), April 1998. 25. M. Chang, M. Teklap and M. Ibrahim Sezan, ‘Simultaneous motion estimation and segmentation’, IEEE Trans. Im. Proc., 6(9), 1997. 5 Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Video Coding Standards: H.261, H.263 and H.26L 5.1 INTRODUCTION The I S 0 MPEG video coding standards are aimed at storage and distribution of video for entertainment and have tried to meet the needs of providers and consumers in the ‘media industries’. The ITU has (historically) been more concerned about the telecommunications industry, anditsvideocodingstandards (H.261,H.263, H.26L)haveconsequentlybeen targeted at real-time, point-to-point or multi-point communications. The first ITU-T video coding standard to havae significant impact,H.26 I , was developed during the late 1980s/early 1990s with a particular application and transmission channel in mind. The application was video conferencing (two-way communicationsvia a video ‘link’) and the channelwas N-ISDN. ISDN providesa constant bit rate o f p X 64 kbps, wherep is an integer in the range 1-30: it was felt at the time that ISDN would be the medium of choice forvideocommunicationsbecause of its guaranteedbandwidth andlow delay. Modem channels over the analogue POTSPSTN (at speeds of less than 9600 bps at the time) were considered to be too slow for visual communications and packet-based transmissionwas not considered to be reliable enough. H.261 was quite successful and continues to be used in many legacy video conferencing applications.Improvements in processorperformance,videocodingtechniques and the emergence of analogue Modems and Internet Protocol (IP) networks as viable channels led tothedevelopment of its successor,H.263, inthe mid-1990s. By making a number of improvementstoH.261,H.263provided significantly bettercompressionperformance as well as greater flexibility. The original H.263 standard (Version 1) had four optional modes which could be switched on to improve performance (at the expenosef greater complexity). Thesemodeswereconsideredtobeuseful andVersion 2(‘H.263+’)added12further optional modes. The latest (and probably the last) version (v3) will contain a total of 19 modes, each offering improved coding performance, error resilience and/or flexibility. Version 3 of H.263 has becomea rather unwieldy standard becauseof the large numberof options andtheneed to continue to support the basic (‘baseline’) CODEC functions. The latest initiative of the ITU-T experts group VCEG is the H.26L standard (where ‘L‘ stands for ‘long term’). This isa new standard that makes use of some of the best features of H.263 andaimstoimprovecompressionperformance by around 50% atlowerbitrates. Early indications are that H.26L will outperform H.263+ (but possibly not by 50%). 80 VIDCOEODSITNAGNDARHDH.2AS.62:N16,D3 H.26L 5.2 H.261’ Typical operating bit rates for H.261 applications are between 64 and 384 kbps. At the time of development,packet-basedtransmissionover the Internetwasnotexpected to be a significant requirement, and the limited video compression performance achievable at the time was not considered to be sufficient to support bit rates below 64 kbps. A typical H.261 CODEC is very similar to the ‘generic’ motion-compensated DCT-based CODEC described in Chapter 3. Video data is processed in 4 :2 :0 Y: Cr :Cb format. The basic unit isthe‘macroblock’,containingfourluminanceblocksandtwochrominance blocks (each 8x 8 samples) (see Figure 4.6)A. t the input to the encoder, 1x6 16 macroblocks may be(optionally)motioncompensated using integermotionvectors. The motion- compensated residual data is coded with an 8x 8 DCT followed by quantisation and zigzag reordering. The reordered transform coefficients are run-level coded and compressed with an entropy encoder (see Chapter 8). Motion compensation performance is improved by use of an optional loop jilter, a 2-D spatial filter that operates on each8 x 8 block in a macroblock priorto motion compensation (if the filter is switched on). The filter has the effect of ‘smoothing’ the reference picture which can help to provide a better prediction reference. Chapter 9 discusses loop filters in more detail (see for example Figures 9.1 1 and 9.12). Inaddition, a forwarderrorcorrectingcodeis defined in thestandardthatshouldbe inserted into the transmitted bit stream. In practice, this code is often omitted frompractical implementations of H.261:theerrorrate ofan ISDNchannelis low enoughthaterror correction is not normally required, and the code specified in the standard is not suitable for other channels (such as a noisy wireless channel or packet-based transmission). Each macroblock may be coded in ‘intra’ mode (no motion-compensated prediction) or ‘inter’ mode (with motion-compensatedprediction).Onlytwoframesizesaresupported, CIF (352 x 288 pixels) and QCIF (176 x 144 pixels). H.261 was developed at a time when hardware and software processing performance was limitedandthereforehastheadvantage oflow complexity.However,itsdisadvantages include poor compression performance (with poor video quality at bit rates of under about 100kbps) andlack of flexibility. It has beensuperseded by H.263,whichhashigher compressionefficiency and greater flexibility, butisstillwidely used in installedvideo conferencing systems. 5.3 H.2632 In developing the H.263 standard, VCEG aimteodimprove upon H.261 in a numberof areas. By taking advantage of developments in video coding algorithms and improvements in processing performance, it provides better compression. H.263 provides greater flexibility than H.261: for example, a wider rangeof frame sizes is supported (listedin Table 4.2). The first version of H.263 introduced four optional modes, each described in an antnoexthe standard, and further optional modes were introduced in Version 2 of the standard (‘H.263f’). The target application of H.263 is low-bit-rate, low-delay two-way video communications. H.263 can support video communications at bit rates below 20 kbps (at a very limited visual quality) and is now widely used both in ‘established’ applications suchas video telephony and video conferencing and an increasing number of new applications (such as Internet-based video). THE H.263 OPTIONAL MODES/H.263+ 81 5.3F.e1atures The baseline H.263 CODEC is functionally identical to the MPEG-4 ‘short header’ CODEC described in Section 4.4.3. Input frames in 4 : 2 : 0 format are motion compensated (with half-pixel resolution motion vectors), transformed with an 8 x 8 DCT, quantised, reordered and entropy coded.The main factorsthat contributeto the improvedcodingperformanceover H.261are the use of half-pixel motion vectors (providing better motion compensation) and redesigned variablelength code (VLC) tables(described further in Chapter 8). Features such as I- and P-pictures, more frame sizesand optional coding modesgive the designer greaterflexibility to deal with different application requirements and transmission scenarios. 5.4 THE H.263 OPTIONAL MODES/H.263+ The original H.263 standard (Version 1) included four optional coding modes (Annexes D, E, F and G). Version 2 of the standard added 12 further modes (Annexes I to T) and a new release is scheduled with yet more coding modes (Annexes U, V and W). CODECs that implement some of the optional modes are sometimes described as ‘H.263+’ or ‘H.263++’ CODECs depending on which modes are implemented. Each modeadds to or modifies the functionality of H.263, usually at theexpense of increased complexity. An H.263-compliant CODECmust support the ‘baseline’ syntax described above: the use of optional modes may be negotiated between an encoder and a decoder prior to starting avideocommunicationssession. The optionalmodes have a number of potential benefits: some of the modes improve compression performance, others improveerroresilience or provide tools that are useful for particular transmission environments such as packet-based transmission. Annex D, Unrestrictedmotionvectors The optionalmode described in Annex D of H.263 allows motion vectors to point outside the boundaries of the picture. This canprovide a coding performancegain, particularly if objects are moving intoor out of the picture. The pixels at the edges of the picture are extrapolated to form a ‘border’ outside the picture that vectors may point to (Figure 5.1). In addition, the motion vector range is extended so that Figure 5.1 Unrestrictedmotionvectors 82 VIDEO CODING STANDARDS: H.261, H.263 AND H.26L Figure 5.2 One or four motion vectorsper macroblock longer vectors are allowedF. inally, Annex D contains an optional alternative setof VLCs for encoding motion vector data. These VLCs are reversible, making it easier to recover from transmission errors (see Chapter 11). Annex E, Syntax-based arithmetic coding Arithmetic coding is used instead of variablelength coding. Eachof the VLCs defined in the standard is replaced with a probability value that is used by an arithmetic coder (see Chapter 8). Annex F, Advanced prediction The efficiency of motion estimation and compensation is improved by allowing the use of four vectors per macroblock (a separate motion vector for each 8 x 8 luminance block, Figure 5.2). Overlapped block motion compensation(described inChapter 6) is used to improve motion compensation and reduce ‘blockiness’ in the decoded image. Annex F requirestheCODEC to support unrestricted motion vectors (Annex D). Annex G, PB-frames A PB-frame is a pair of frames coded as a combined unit. The first frame iscoded as a ‘B-picture’ and the second as a P-picture. The P-picture is forward predicted from the previous I- or P-picture and the B-picture is bidirectionally predicted from the previous and current I- or P-pictures. Unlike MPEG-I (where a B-picture is coded as a separate unit), each macroblock of the PB-frame contains data from both the P-picture and theB-picture(Figure 5.3). PB-framescangive an improvement in compression efficiency. Annex I, Advanced intra-coding Thismodeexploits the correlation between DCT coefficients in neighbouringintra-coded blocks in an image. The DC coefficient and the first row or column oAf C coefficients may be predicted from the coefficients of neighbouring blocks (Figure 5.4). The zigzagscan,quantisationprocedure and variablelength code tables are modified and the result is an improvement in compression efficiency for intra-coded macroblocks. Annex J, Deblocking filter The edges of each 8 x 8 block are ‘smoothed’ using a spatial filter (described in Chapter 9). This reduces ‘blockiness’ in the decoded picture and also improves motion compensation performance. When the deblockinfgilter is switchedon, four THE H.263 OPTIONAL MODES/H.263+ 83 P macroblodcakta B macroblodcakta Figure 5.3 MacroblockinPB-frame Annex K, Slice structured mode This modeprovidessupport for resynchronisation intervals that aresimilar to MPEG-1 ‘slices’. A sliceisaseries of codedmacroblocks Prediction from above Prediction from left Current block Figure 5.4 Prediction of intra-coefficients, H.263 Annex I 84 VICDOEODSITNAGNDARHDH.A2HS.62N.:126D,63L (a) Raster order (br)seAlcictraebnsitgraurlayr Figure 5.5 H.263 Annex K: sliceoptions startingwith a slice header. Slicesmaycontainmacroblocks in rasterorder,or in any rectangular region of the picture (Figure 5.5). Slices may optionally be sent in an arbitrary order. Each slice may be decoded independentolyf any other slicein the picture andso slices can be useful for error resilience (see Chapter 11) sincean error in one slice will not affect the decoding of any other slice. Annex L, Supplementalenhancementinformation Thisannexcontainsanumber of supplementary codes that may be sent by an encoder to a decoder. These codes indicate display-relatedinformationaboutthevideosequence, such aspicturefreezeandtiming information. Annex M, Improved PB-frames As the name suggests, this is an improved versionof the original PB-frames mode (Annex G). Annex M adds the options of forward or backward prediction for the B-frame part of each macroblock (as well as the bidirectional prediction defined in Annex G), resulting in improved compression efficiency. Annex N, Referencepicture selection Thismodeenablesanencoder to choosefroma number of previously coded pictures for predicting the current picturTe.he use of this mode to limit error propagation ina noisy transmission environment is discussedin Chapter 1 1 . At the start of each GOB or slice, the encoder may choose the preferred reference picture for prediction of macroblocks in that GOB or slice. Annex 0,Scalability Temporal, spatial and SNR scalability are supported by this optional mode. In a similar way to the MPEG-2 optional scalability modes, spatial scalability increases frame resolution, SNR scalability increases picture quality and temporal scalability increasesframerate. In eachcase,a ‘base layer’providesbasicperformance and the increased performanceis obtained by decoding the base layer togethwerith an ‘enhancement layer’. Temporal scalability is particularly useful because it supports B-pictures: these are similar to the ‘true’ B-pictures in the MPEG standards (where a B-picturies a separate coded unit) and are more flexible than the combined PB-frames described in Annexes G and M. THE H.263 OPTIONAL MODES/H.263+ 85 Annex P, Referencepictureresampling Thepredictionreferenceframe usedbythe encoder and decoder may beresampledpriortomotioncompensation.Thishasseveral possible applications. For example, an encoder can change the frame resolution ‘on the fly’ whilst continuing to use motion-compensated prediction. The prediction reference frame is resampled to match thenew resolution and the current frame can thenbe predicted from the resampledreference.Thismode may also beused tosupport warping, i.e. thereference picture is warped (deformed)prior to prediction,perhapstocompensatefornonlinear camera movements such as zoom or rotation. Annex Q, Reduced resolution update An encoder may choosetoupdateselected macroblocksat a lowerresolution than thenormalspatialresolution ofthe frame.This may be useful, for example, to enable a CODEC to refresh moving partosf a frame at alow resolution using a small number of coded bits whilst keeping the static parts of the frame at the original higher resolution. AnnexR,Independentsegmentdecoding Thisannexextends the concept of theinde- pendently decodeable slices (AnnexK) or GOBs. Segments of the picture (where a segment is one slice or an integral number of GOBs) may be decoded completely independently of any other segment.In the slice structured mode (AnnexK), motion vectors can pointto areas of thereferencepicturethatareoutside the currentslice;withindependentsegment decoding, motion vectors and other predictions can only reference areas within the current segment in the reference picture (Figure 5.6). A segment can be decoded (over a series of frames) independently of the rest of the frame. Annex S, Alternative inter-VLC The encoder may use an alternative variable-length code tablefortransformcoefficients in inter-codedblocks.ThealternativeVLCs (actually the same VLCs used for intra-coded blocks in Annex I) can provide better coding efficiency when there are a large number of high-valued quantised DCT coefficients (e.g. if the coded bit rate is high and/or there is a lot of variation in the video scene). Annex T, Modifiedquantisation Thismodeintroducessomechangestothe waythe quantiser and rescaling operations are carried out. Annex T allows the encoder to chanthgee Figure 5.6 Independentsegments 86 VICDOEODSITNAGNDARHDH.A2HS.62N.:126D,63L quantiser scale factor in a more flexible way during encoding, making it possible to control the encoder output bit rate more accurately. Annex U, Enhanced reference picture selection Annex U modifies the reference picture selection modeof Annex N to provide improved error resilienceand coding efficiency. There are a number of changes, including a mechanism to reduce the memory requirements for storing previously coded pictures and the ability to select a reference picture for motion compensation on a macroblock-by-macroblock basis. This means that the ‘best’ match for each macroblock may be selected from any of a number of stored previous pictures (also known as long-term memory prediction). Annex V, Datapartitioned slice Modified from Annex K, this mode improves the resilience of slice structured data to transmission errors. Within each slice, the macroblock data isrearranged so that all of the macroblock headers are transmitted first, followed by all of the motion vectors andfinally by all of the transform coefficient data. An error occurring in header or motion vector datausually has a more serious effect on the decodedpicture than an error in transform coefficient data: by rearranging the data in this way, an error occurring part-way through a slice should only affect the less-sensitive transform coefficient data. Annex W, Additional supplemental enhancement information Two extra enhancement information items are defined (in addition to those defined in Annex L). The ‘fixed-point IDCT’ function indicates that an approximateinverse DCT (IDCT) may be used rather than the ‘exact’ definition of the IDCTgiven in the standard: this can be useful for low-complexity fixed-point implementations of the standard. The‘picture message’ function allows the insertion of a user-definable message into the coded bit stream. 5.4.1 H.263 Profiles It is very unlikely that all 19 optional modes will be required for any oneapplication. Instead, certain combinations of modes may be useful for particular transmission scenarios. In commonwith MPEG-2 and MPEG-4, H.26d3efines a set of recommended projiles (where a profile is a subset of the optionaltools) and levels (where a level sets a maximum value on certain coding parameters such as frame resolution, frame rate and bit rate). Profiles and levels are defined in the final annex of H.263, Annex X. There are a total of nine profiles, as follows. Profile 0, Baseline This is simply the baseline H.263 functionality, without any optional modes. Profile 1, Coding efficiency (Version 2) This profile provides efficient coding using only tools available in Versions I and 2 of the standard (i.e. up to Annex T). The selectedoptional modes are Annex I (Advanced Intra-coding), Annex J (De-blocking Filter), Annex L (SupplementaIlnformation: only the full picture freezfeunctionisupported) and Annex T (Modified Quantisation). Annexes I, J and T provide improved coding efficiency compared with thebaselinemode.Annex J incorporatesthe ‘best’ features of the first version of the standard, four motiovnectors per macroblock andunrestricted motion vectors. H.26L 87 Profile 2, Coding efficiency (Version 1) Only tools available in Version 1 of the standard are usedin this profile and in fact only Annex F (Advanced Prediction) is included. The other three annexes (D, E, G) from the original standard arenot (with hindsight) considered to offer sufficient coding gains to warrant their use. Profiles 3 and 4, Interactive and streaming wireless These profiles incorporate efficient coding tools (Annexes IJ, and T) together with the slice structured mode (AnnexK) and, in the case of Profile 4, the datapartitionedslicemode(Annex V). Theseslicemodes can support increasederrorresiliencewhichisimportantfor ‘noisy’ wirelesstransmission environments. Profiles 5,6, 7,Conversational These three profiles support low-delay, high-compression ‘conversational’ applications (suchas video telephony).Profile 5 includes tools that provide efficient coding; Profile 6 adds the slice structured mode (Annex K) for Internet conferencing; Profile 7 adds support for interlaced camera sources (part of Annex W). Profile 8, High latency For applicationsthat can tolerate a higher latency (delay),such as streaming video, Profile 8 adds further efficient coding tools such as B-pictures (Annex 0) and reference picture resampling (Annex P). B-pictures increase coding efficiency at the expense of a greater delay. The remaining toolswithin the 19 annexes are not included in any profile, either because they are considered to be too complex for anything otherthan special-purpose applications, or because more efficient tools have superseded them. 5.5 H.26L3 The 19 optional modes of H.263 improved coding efficiency and transmission capabilities: however, development of H.263 standard is constrained by the requirement to continue to support the original ‘baseline’ syntax. The latest standardisation effortby the Video Coding Experts Group is to develop a new coding syntax that offers significant benefits over the older H.261 and H.263 standards. Thnisew standard is currently describeads ‘H.26L‘, where the L stands for ‘long term’ and refers to the fact that this standard was planned as a longterm solution beyond the ‘near-term’ additions to H.263 (Versions 2 and 3). The aim of H.26L is to provide a ‘next generation’ solution for video coding applications offering significantly improved coding efficiency whilst reducing the ‘clutter’ of the many optionalmodes in H.263.The new standard alsoaimstotakeaccount of the changing nature of video coding applications. Early applications of H.261 used dedicated CODEC hardware over thelow-delay, low-error-rate ISDN. The recenttrend is towards software-only ormixed softwarehardwareCODECs(wherecomputationalresources arelimited,but greater flexibility is possible than with a dedicated hardware CODEC)and more challenging transmissionscenarios (such aswireless links with high errorratesandpacket-based transmission over the Internet). H.26L is currently at the test model development stagaend may continue to evolve before standardisation. The main features can be summarised as follows. 88 VIDEO CODING STANDARDS: H.261, H.263 AND H.26L Y 011 4 5 2367 4x4 I 8 9 12 13 10 11 14 15 mU V 22 23 4x4 24 25 l l coefficients DC 2x2 2x2 Figure 5.7 H.26L blocks in amacroblock Processing units The basic unit isthemacroblock, as withthe previous standards. However,thesubunit is now a 4 x 4 block(rather than an 8 x 8 block). A macroblock contains 26 blocksin total (Figure 5.7): 16 blocks for the luminance (each4 x 4), four 4 x 4 blocks each for the chrominance components and two2 x 2 ‘sub-blocks’ which hold theDC coefficients of each of the eight chrominance blocks. It is more efficient to code these DC coefficients together because they are likely to be highly correlated. Intra-prediction Beforecodinga 4 x 4 block within anintra-macroblock,each pixelin the block is predicted from previously coded pixels. This prediction reduces the amount of data coded in low-detail areas of the picture. Prediction reference forinter-coding In a similar way to AnnexesN and U of H.263, the reference frame for predicting the current inter-coded macroblock may be selected from a range of previously coded frames. This can improve codingefficiency and error resilience at the expense of increased complexity and storage. Sub-pixel motionvectors H.26Lsupportsmotionvectorswithpixel and (optionally) pixel accuracy; $pixel vectors can give an appreciable improvement in coding efficiency over $-pixel vectors(e.g. H.263, MPEG-4)and$-pixelvectorscangiveasmallfurther improvement (at the expense of increased complexity). Motion vector options H.26L offers seven different options for allocating motion vectors within a macroblock, ranging from one vector per macroblock (Mode1 in Figure 5.8) to an individual vector for each of the 16 luminance blocks (Mode7 in Figure 5.8). This makes it possible to model the motion of irregular-shaped objects with reasonable accuracy. More motion vectors require extra bits to encode and transmit ansod the encoder must balancethe choice of motion vectors against coding efficiency. De-blockingfilter The de-blocking filter defined in Annex J of H.263 significantly improvesmotioncompensation efficiency becauseitimprovesthe‘smoothness’ of the reference frame used for motion compensation. H.26L includes an integral de-blockingfilter that operates across the edges of the 4 x 4 blocks within each macroblock. H.26L 89 i0 1 I Mode 1 Mode 2 Mode 3 ’ Mode 4 Mode 5 Mode 6 Figure 5.8 H.26L motion vector modes 12 13 14 15 Mode 7 4 x 4 Block transform After motion compensation, the residual datawithin each block is transformed using a 4 x 4 block transform. This is based on a 4 x 4 DCT but is an integer transform (ratherthan the floating-point ‘true’ DCT). An integer transformavoids problems caused by mismatches between different implementationsof the DCT and is well suited to implementation in fixed-point arithmetic units (such as low-power embedded processors, Chapter 13). Universalvariable-lengthcode The VLC tables in H.263arereplaced with a single ‘universal’ VLC.A transmitted code is created by building upa regular VLC from the ‘universal’ codeword.Thesecodes have twoadvantages: they can be implemented efficiently in softwarewithouttheneedforstorage of largetablesand they arereversible,makingit easier to recover from transmission errors (see Chapters 8 and 11 for further discussion of VLCs and error resilience). Content-based adaptive binary arithmetic coding This alternative entropy encoder uses arithmetic coding (describeidn Chapter 8) to give higher compressionefficiency than variablelength coding.In addition, the encoder can adapt to local image statistics, i.e. it can generate and use accurate probability statistics rather than using predefined probability tables. B-pictures These are recognised to be a very useful coding tool, particularly for applications that are not very sensitive to transmission delays. H.26L supports B-picturiensa similar way to MPEG-l and MPEG-2, i.e. there is no restriction on the number of B-pictures that may be transmitted between pairs of I- and/or P-pictures. At the time of writing it remains to be seen whether H.26L will supersede the popular H.261andH.263standards. Early indicationsarethatitoffers areasonablyimpressive performance gain over H.263 (see the next section): whether these gains are sufficient to merit a ‘switch’ to the new standard is not yet clear. 90 VIDEO CODING STANDARDS: H.261, H.263 AND H.26L 5.6 PERFORMANCE OF THE VIDEO CODING STANDARDS Each of the image and video coding standards described in Chapters4 and 5 was designed for a different purpose and includes different features. This makes it difficult to compare themdirectly.Figure 5.9 compares the PSNRperformance of each of the videocoding standards for one particular test video sequence, 'Foreman', encodedQaCt IF resolution and a frame rate of 10 frames per second. The results shownin the figure should be interpreted with cautions, incedifferenpt erformance will be measureddepending on the video sequence,framerateand so on.However,the trend in performanceisclear.MJPEG performs poorly (i.e.itrequires a relativelyhighdatarate to support a given picture 'quality') because it does not use any inter-frame compression. H.261 achievesa substantial gain over MJPEG, dueto the use of integer-pixel motion compensation. MPEG-2 (with half- pixel motion compensation) is next, followed by H.263MPEG-4 (which achieve a further gain by using four motion vectors per macroblock). The emerging H.26L test model achieves the best performance of all. (Note that MPEG-l achievesthe same performanceas MPEG-2 in this test because the video sequence is not interlaced.) This comparison is not the complete picture because it does not take into account the special features of particular standards (for example, the content-based toolsof MPEG-4 or the interlaced video tools of MPEG-2). Table 5.1 compares the standardsin terms of coding performance and featuresA. t the present time, MPEG-2, H.263 and MPEG-4 are each viable 41 39 37 =mh 35 m5 1 a 33 a, BF 31 Q 29 27 Video coding performance: "Foreman", QCIF, 10 frameslsec .H.26L H.263 l MPMEGPE-2G4 / MJPEG 300 250 0200 51050 100 Bit rate (kbps) Figure 5.9 Codingperformancecomparison SUMMARY Table 5.1 Comparison of thevideocodingstandards Coding plication Standard Target MJPEG H.261 MPEG- 1 Image coding Video conferencing Video-CD 1 (worst) 2 3 (equal) TV DigMitaPlEG-2 3 (equal) conVfeirdeHeno.c2i6n3g 4 (equal) (equMaMl)P4uEcltGoimd-4iendgia conVfeirdHeen.o2c6inLg 5 (best) 91 Features Scalable and lossless coding modes Integer-pixel motion compensation I, P, B-pictures, half-pixel compensation As above; field coding, scalable coding Optimised for low bit rates; many optional modes Many options including content- based tools Full feature set not yet defined alternatives for designers of video communication systems. MPEG-2 is a relatively mature technology for the mass-market digital television applications; H.263 offers good coding performance and options to support a range of transmission scenarios; MPEG-4 provides a largetoolkit with thepotentialfor new andinnovativecontent-basedapplications.The emerging H.26L standard promisetos outperform the H.263 and MPEG-4 standaridnsterms of video compression efficiency4but is not yet finalised. 5.7 SUMMARY The ITU-T Video Coding Experts Group developed the H.261 standard fovrideo conferen- cing applications which offered reasonable compression performance with relatively low complexity. This was supersedebdy the popular H.263 standard, offering better performance through featuressuch as half-pixel motion compensationandimproved Variable-length coding. Two further versionsof H.263 have been released, each offering additional optional codingmodes to supportbettercompression efficiency andgreater flexibility. Thelatest version (Version 3) includes 19 optional modes, but is constrained by the requirement to support the original, ‘baseline’ H.263 CODEC. The H.26L standard, under development atthetime of writing,incorporatesanumber of new codingtools such as a 4 x 4 blocktransform and flexible motion vector optionsandpromises to outperformearlier standards. Comparing the performance of the various coding standards is difficult because a direct ‘rate-distortion’comparisondoes not takeintoaccountotherfactors such as features, flexibilityand market penetration. It seemsclear that theH.263,MPEG-2andMPEG-4 standards eachhave their advantages for designeorsf video communication systems. Eacohf thesestandardsmakesuse of commoncodingtechnologiesm: otionestimationand compensation, block transformation and entropy coding. In the next section of this book we will examine these core technologies in detail. 92 VIDCOEODSITNAGNDARDS: REFERENCES H.261, H.263 AND H.26L 1. ITU-T Recommendation H.261, ‘Video CODEC for audiovisual services at px64kbit/s’, 1993. 2. ITU-T Recommendation H.263, ‘Video coding for low bit rate communication’, Version 2, 1998. 3. ITU-T Q6/SG16 VCEG-L45, ‘H.26L Test Model Long Term Number 6 (TML-6) draft 0’, March 200 1. 4. ITU-T Q6/SG16 VCEG-MO8, ‘Objective coding performance of [H.26L] TML 5.9 and H.263+’, March 2001. Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Motion Estimation and Compensation 6.1 INTRODUCTION In the video coding standards described in Chapters 4 and 5, blocks of image samples or residual data are compressed using a block-based transform (such as the DCT) followed by quantisationandentropyencoding.There is limitedscopeforimprovedcompression performance in the later stages of encoding (DCT, quantisation and entropy coding), since the operation of the DCT and the codebookfor entropy coding are specified by the relevant video coding standard. However, there is scope for significant performance improvement in thedesign of the first stage of avideoCODEC(motionestimationandcompensation). Efficient motion estimation reduces the energy in the motion-compensated residual frame and can dramatically improve compression performance. Motion estimation can bevery computationally intensive and so this compression performance may be at the expense of high computational complexity. This chapter describes the motion estimation and compen- sation process in detail and discusses implementation alternatives and trade-offs. The motion estimation and compensation functions have many implications for CODEC performance. Key performance issues include: 0 Coding performance (how efficient is the algorithm at minimising the residual frame?) 0 Complexity (does the algorithm makeeffective use of computation resources,how easy is it to implement in software or hardware?) 0 Storage and/or delay (does the algorithm introduce extra delay and/or require storage of multiple frames?) 0 ‘Side’ information(how much extrainformation, e.g. motion vectors, needs to be transmitted to the decoder?) 0 Error resilience (how does the decoder perform when errors occur during transmission?) These issues are interrelated and are potentiallycontradictory (e.g. better codingperfor- mance may lead to increased complexity and delay and poor error resilience) and different solutions are appropriate for different platforms and applications. The design and imple- mentation of motionestimation,compensationandreconstructioncan be critical to the performance of a video coding application. 94 COEMSMPTAEOINMTDSAIOATNTIOIONN 6.2 MOTION ESTIMATION AND COMPENSATION 6.2.1 Requirements for MotionEstimationandCompensation Motion estimation creates a model of the current frame based on available data in one or more previously encoded frames (‘reference frames’). These reference framesmay be ‘past’ frames (i.e. earlier than the current frame in temporal order) or ‘future’ frames (i.e. later in temporal order). The design goals for a motion estimation algorithm areto model the current frame as accurately as possible(since this gives better compressionperformance) whilst maintainingacceptablecomputational complexity. InFigure 6.1, the motion estimation module creates a model by modifying one or more reference frames to match the current frame asclosely as possible (according to a matching criterion).The current frame ismotion compensated by subtracting the model from the frame to produce a motion-compensated residualframe.Thisiscoded and transmitted, along with theinformation required for the decoder to recreate the model (typically a set of motion vectors). At the same time, the encoded residual is decoded and added to the model to reconstruct a decoded copy of the current frame (which may not be identical to the original frame because of coding losses). This reconstructed frame is stored to be used as a reference frame for further predictions. The residualframe(ordisplacedframedifference,DFD) is encoded and transmitted, together with any ‘side information’ (such as motion vectors) needed to recreate themodel at the decoder. The ‘best’ compression performance is achieved when the size of the coded DFD and coded side information is minimised. The size of the coded DFD is related to the energyremaining in the DFD after motion compensation.Figure 6.2 shows a previous, current and residual frame (DFD)without motion compensation: there is clearly a significant amount of energy present around the boundaries of moving objects (the girl and the bicycle in this case).It should be possible to reducethisenergy (and improve compression performance) using motion estimation and compensation. Current frame Motion compensation Encode residual Reconstructed frame $. Reconstruction 4 Decode residual d Figure 6.1 Motionestimationandcompensation block diagram ESMTIOMTAIOTNION AND COMPENSATION 95 Figure 6.2 (aP)revioufsrame; (b) current (c) frame; (c) DFD (no motion compensation) 6.2.2 Block Matching In the popular video coding standards (H.261, H.263, MPEG-1, MPEG-2 and MPEG-4), motion estimation and compensation are carried out8oxn8 or 16x 16 blocksin the current frame. Motion estimation of complete blocks is known as block matching. Foreachblock of luminancesamples(say16 x 16)inthecurrentframe,themotion estimation algorithm searches a neighbouring area of the reference frame for a ‘matching’ 16 x 16 area. The best match is the one that minimises the eneorfgtyhe difference between the current 16 x 16 block and the matching 16 x 16 area. The area in which the search is carried out may be centred around the position of the current 16 x 16 block, because (a) there is likely to be a good match in the immediate areoaf the current block due to thheigh similarity(correlation)betweensubsequentframes and (b) it would becomputationally intensive to search the whole of the reference frame. Figure 6.3 illustrates the block matching process. The current ‘block’ (in this case, 3x 3 pixels) is shown on theleft and thisblock is comparedwiththesamepositioninthe reference frame (shown by the thick line in the centre) and the immediate neighbouring positions (+/-l pixelineachdirection). The mean squarederror (MSE) betweenthe 96 MOTION ESTIMATION AND COMPENSATION CRbulerorfecenkret nce area Positions (x,y) Figure 6.3 Current 3 x 3 block and 5 x 5 reference area current block and the same position in the reference frame (position (0,0)) is given by + + + + {(l - 4)* ( 3 - 2)2 (2 - 3)2 (6 - 4)* (4 - 2)’ + + + + ( 3 - 2)’ ( 5 - 4)’ (4 - 3)’ (3 - 3)2}/9 = 2.44 The complete set of MSE values for each search position is listed in Table 6.1 and shown graphically in Figure 6.4. Of the nine candidate positions, ( - 1, l ) gives the smallest MSE and hence the ‘best’ match. In this example, the best ‘model’ for the current block (i.e. the best prediction) is the 3 x 3 region in position ( - l , 1). A video encoder carries out this process for each block in the current frame 1. Calculate the energyof the difference betweenthe current block and a set of neighbouring regions in the reference frame. 2. Select the region that gives the lowest error (the ‘matching region’). 3. Subtract the matching region from the current block to produce a difference block. 4. Encode and transmit the difference block. 5. Encode and transmit a ‘motion vector’ that indicates theposition of the matching region, relative to the current block position (in the above example, the motionvector is ( - I , 1). Steps 1 and 2 above correspond to motion estimation and step 3 to motion compensation. The video decoder reconstructs the block as follows: 1. Decode the difference block and motion vector. 2. Add the difference block to the matching region in the reference frame (i.e. the region ‘pointed’ to by the motion vector). Table6.1 MSEvalues for block matchingexample Position(x,y) (-1-,1) MSE 4.26.78 (0, - 1 ) 2.89 (1, - 1 ) (-1,O) 2.434.22 (0,O) (1, 0) (-1, 1) (0, 1 ) (1, 1) 3.33 0.22 5.323.56 ESMTOIMTAIOTNION AND COMPENSATION 97 (b) Figure 6.4 MSE map:(a)surfaceplot; (b) pseudocolourplot 6.2.3 MinimisingDifferenceEnergy The name‘motionestimation’ismisleadingbecausetheprocessdoes not necessarily identify ‘true’ motion, instead it attempts to find a matching region in the reference frame MOTION ESTIMATION AND COMPENSATION l, I 2t 3 4- 5- 6- e-- 7 8 “0 2 4 6 8 10 12 Figure 6.5 16 x 16 block motion vectors thatminimisestheenergy of thedifferenceblock.Wherethereisclearly identifiable linearmotion,such as largemovingobjectsorglobalmotion(camerapanning,etc.), motion vectors produced in this way should roughly correspond to the movement of blocks between the reference and the current frames. However, where the motion is less obvious (e.g.smallmovingobjectsthat do notcorrespondtocompleteblocks,irregularmotion, etc.), the ‘motion vector’ may not indicate genuine motion but rather the positionof a good match. Figure6.5showsthemotionvectorsproduced by motionestimationforeach of the 16 x 16 blocks (‘macroblocks’)of the frame in Figure 6.2.Most of the vectors do correspond to motion: the girl and bicycle are movintgo the left and so the vectors point to therighr (i.e. to the region the objects have movedfrorn). There is an anomalous vectionrthe middle (it is larger than the rest and points diagonally upwards). This vector does not correspond to ‘true’ motion, it simply indicates that the best match can be found in this position. There are many possible variations on the basic block matching process, some of which will be described later in this chapter. Alternative measures of DFD energy maybeused (to reduce the computation required to calculate MSE). Varying block sizes, or irregular- shaped regions, can be more efficient at matching ‘true’ motion than fixed 16 x 16 blocks. A better match may be found by searching within two or more reference frames (rather than justone).Theorder of searchingneighbouringregionscanhave a significant effecton matching efficiency and computational complexity. Objects do not necessarily move byan integral number of pixels between successive frames andso a better match may be obtained by searching sub-pixel positions in the reference frame. The block matching process itself only works well for large, regular objects with linear motion: irregular oabnjdecntosn-linear motion(such as rotation or deformation) may bemodelledmoreaccurately with other motion estimation methods such as object-based or mesh-based estimation. EFSTUMILMLOATSTIEOIAONRNCH 99 Comparison criteria Mean squared error provides a measure of the energy remaining in the difference block. MSE for a N x N-sample block can be calculated as follows: where C, is a sampleof the current block,RV is a sampleof the reference area andCOOR,oo are the top-left samples in the current and reference areas respectively. Mean absolute error (MAE) provides a reasonably good approximationf residual energy and is easier to calculate than MSE, since it requires a magnitude calculation instead of a square calculation for each pair of samples: . N-l N-l Thecomparison maybe simplifiedfurther by neglectingtheterm l/N2 andsimply calculating the sum of absolute errors (SAE) or sum of absolute differences (SAD): SAE gives a reasonable approximation to block energy and so Equation 6.3 is a commonly used matching criterion for block-based motion estimation. 6.3 FULL SEARCH MOTION ESTIMATION In order to find the best matching region in the reference frame, in theory it is necessary to carry outacomparison of thecurrentblock with everypossibleregion ofthe reference frame. This is usually impractical becauseof the large number of comparisons required. In practice,agoodmatchforthecurrentblockcan usuallybe found in theimmediate neighbourhood of the block position in the reference frame (if a match exists). Hence in practical implementations, the search for a matching regionis limited to a ‘search window’, typically centred on the current block position. The optimum size of search window depends on several factors: the resolution of each frame(alarger window isappropriateforahigherresolution), the type of scene(high- motionscenes benefit fromalargersearch windowthan low-motionscenes)and the 100 start EMSTOITMIOANTION AND COMPENSATION end (start in centre) (a) Raster order (b) Spiral order Figure 6.6 Fullsearch:(a) rasterand (b) spiralorder availableprocessingresources(sincealargersearch window requiresmorecomparison operations and hence more processing). Full search motion estimation calculates the comparison criterion (such asSAE) at each possible location in the searchwindow. Full search is computationally intensive, particularly for large search windows. The locations may be processed in a ‘raster’ order (Figure 6.6, left-hand diagram) or in a spiral order starting from the centre (0, 0) position (Figure 6.6, right-handdiagram). The spiralsearchorderhascertaincomputationaladvantages when early termination algorithms areused (see Section6.9.1) because the best match (and hence the smallest SAE) is most likely to occur near the centre of the search region. Figure 6.7 shows an example of the SAE results for a full search. The figure shows the currentblockandthereferencearea (+/-l5 pixelsaroundthecurrent16 X 16block position) together with a plotof the SAE values found at each search location. There are a + total of 3 1 x 31 S A E values (corresponding to integer steps fro- m15 to 15 in thex and y directions). The smallesSt AE value can be found at locatio(nx =6, y = 1) and is marked on the SAE plot. This is the global minimumof the SAE function in the search region and the full search algorithm will selectthis position as the ‘best’ match. Note that there are other, local minima of the S A E function (the dark ‘patches’ on the SAE plot): the importance of these local minima will become clear in the next section. Theeffect of motionestimation and compensation is illustratedinFigure 6.8. After motion estimation (using full search block matchianngd) compensation, the reference frame (shown in Figure 6.2) is ‘reorganised’ to provide a closer match to the current frame. The motion-compensated DFD shown in Figure6.8 contains less energy than the uncompensated DFD in Figure 6.2 and will therefore produce a smaller coded frame. FULL SEARCH MOTION ESTIMATION 101 (c) Figure 6.7 SAD map: (a) current block; (b) search area; (c) map with minima 102 MOTION ESTIMATION AND COMPENSATION Figure 6.8 Residual frameafter full search motion estimation and compensation 6.4 FAST SEARCH The computational complexityof a full search algorithm is often prohibitive, particularly for software CODECs that must operate in‘real time’. Many alternative ‘fast search’ algorithms have been developedformotionestimation. A fastsearchalgorithmaimstoreducethe number of comparison operations comparedwith full search,i.e. a fast search algorithmwill ‘sample’ just a few of the points in the SAE map whilst attempting to find the minimum SAE. The critical questionwihsether the fast algorithm can locate th‘terue’ minimum rather than a ‘local’ minimum. Whereas the full search algorithm is guaranteed to find the global minimum SAE, a search algorithm that samples only someof the possible locations in the search region may get ‘trapped’ in a local minimum. The result is that the difference block found by the fast search algorithm contains more energy than the block fobuynfdull search and hencethenumber of codedbitsgenerated by thevideoencoder will belarger. Because of this, fast search algorithms usually give poorer compression performance than full search. 6.4.1 Three-StepSearch (TSS)’ This algorithm is most widely known in its three-step form, the ‘three-step search’ (TSS), but it can be carried out with other numbers of steps (i.e. N-step search). For a search window of + / - ( 2 N - 1) pixels, the TSS algorithm is as follows: 1. Search location (0, 0). 2. Set S = 2N-’ (the step size). 3. Search eight locations +/-S pixels around location (0, 0). 4. From the nine locations searched so far, pick the location with the smallest make this the new search origin. SAE and FAST SEARCH 103 Figure 6.9 Three-stepsearch(TSS) 5. Set S = S/2. 6. Repeat stages 3-5 until S = 1. Figure 6.9 illustrates the procedure for a sewarinchdow of +/-7 (i.e. N =3). The first ‘step’ involves searching location (0, 0) and eight locations +/-4 pixels around the origin. The second ‘step’ searches+/-2 pixels around the best match from thfeirst step (highlighted in bold) and the third step searches +/-l pixels around the best match from the second step (again highlighted).The best match from this third step is chosen as the resuoltf the search algorithm. With a search window of +/-7, three repetitions (steps) are required tofind the bestmatch. A total of (9+8+8) =25 searchcomparisonsarerequiredfortheTSS, comparedwith (15 x 15)=225 comparisonsfortheequivalentfullsearch. In general, (8N+ 1) comparisons are required for a search area of +/-(2N - 1) pixels. 6.4.2 LogarithmicSearch’ The logarithmic search algorithm can be summarised as follows: 1. Search location (0, 0). 2. Search four locations in the horizontal and vertical directions, S pixels away from the + origin (where S is the initial step size). The five locations make a ‘ ’ shape. 3. Set the new origin to the best match (of the five locations tested).If the best match is at the centre of the ‘+’, S = S/2,otherwise S is unchanged. 4. If S = 1 then go to stage 5 , otherwise go to stage 2. 5. Search eight locations immediately surrounding the best match. The search result is the best match of the search origin and these eight neighbouring locations. 104 EMSTOITMIOANTION AND COMPENSATION Figure 6.10 Logarithmic search Figure 6.10 shows an example of the search pattern with S = 2 initially. Again, the best match at each iteration is highlighted in bold (note that the bold 3 is the best match at iteration 3 and at iteration4). In this example20 search comparisons are required: however, the number of comparisons varies depending on numberof repetitions of stages 2, 3 and 4 above. Note that the algorithmwill not search a candidate positionif it is outside the search window (+/-7 in this example). 6.4.3 Cross-Search3 This algorithm is similar to the three-step search except tfhivaet points are compared at each step (forming an X) instead of nine. 1. Search location (0, 0). 2. Search four locations at +/-S, forming an ‘X’ shape (where S = 2N-’ as for the TSS). 3. Set the new origin to be the best match of the five locations tested. 4. If S > 1 then S = S/2 and go to stage 2; otherwise go to stage 5. 5. If the best matchis at thetop left or bottom rightof the ‘X’, evaluate four more points in + an ‘X’ at a distance of +/-l; otherwise (best match is at the top right or bottom left) evaluate four more points in a ‘ ’ at a distance of +/-l. + Figure 6.11 showstwoexamples of the cross-search algorithm: in the first example, the final points are in the shape of a ‘X’ and in the second, they are in the shape of a ‘ ’ (thebestmatch at eachiterationishighlighted).Thenumber of SADcomparisonsis + (4N 5) for a search area of +/-(2N - 1) pixels (i.e. 17 comparisons for a +/-7 pixel window). FAST SEARCH 105 Figure 6.11 Cross-search 6.4.4One-at-a-TimeSearch This simple algorithm essentially involves following the SAD ‘gradient’ in the horizontal direction until a minimum is found, then following the gradient in the vertical directionto find a vertical minimum: 1. Set the horizontal origin to (0,0). 2. Search the origin and the two immediate horizontal neighbours. 3. If the origin has the smallest SAD(of the three neighbouring horizontal points), then go to stage 5, otherwise. . .. 4. Setthe new origintothehorizontalpoint with thesmallest SAD and searchthe neighbouring point that has not yet been searched. Go to stage 3. 5 . Repeat stages 2-4 in the vertical direction. The one-at-a-time search is illustrated in Figure 6.12. The positions marked1 are searched and the left-hand position gives the best match. Position 2 is searched and gives the best match. The horizontal search continues with position3s, 4 and 5 until position 4 is found to havealower SAD thanposition 5 (i.e. ahorizontalminimumhas been detected).The vertical search starts with positions 6: the best match is at the top and the vertical search continues with 7, 8, 9 until a minimum is detected at position 8. In this example only nine searches are carried out: however, there is clearly potential to be trainppaedlocal minimum. 6.4.5NearestNeighboursSearch4 This algorithmwasproposedfor H.263 and MPEG-4 (shortheader)CODECs. In these CODECs, each motion vector is predicted from neighbouring (already coded) motion vectors prior to encoding (see Figure 8.3). This makes it preferable to choose a vector closeto this 106 EMSTOIMTIAOTNION AND COMPENSATION Figure 6.12 One-at-a-timesearch ‘median predictor’ position, for two reasons. First, neighbouring macroblocks often have similar motion vector(sso that thereis a good chance that the median predictor will be close to the ‘true’ best match). Second, a vector near the median will have a small displacement and therefore a small VLC. The algorithm proceeds as follows: 1. Search the (0, 0) location. 2. Set the search origin to the predicted vector location and search this position. ‘+ 3. Search the four neighbouring positionsto the origin in a ’ shape. 4. If the search origin (or location0, 0 for the first iteration) gives the best matchth,is is the chosen search result; otherwise, set tnhew origin to the positioonf the best matchand go to stage 3. + The algorithm stopswhen the best match is at the centoref the ‘ ’ shape (or the edgeof the search window has been reached).An example of a search sequence ishown in Figure 6.13. Figure 6.13 Nearest neighbours search FAST SEARCH 107 The median predicted vector is ( - 3, 3) and this is shown with an arrow. The (0, 0) point (marked 0) andthe first ‘layer’ of positions(marked 1) are searched: the best matchis highlighted. The layer2 positions are searched,followed by layer 3. The best match for layer + 3 is in the centre of the ‘ ’ shape and so the search is terminated. This algorithm will perform well if the motion vectors are reasonably homogeneous, i.e. there are not too many sudden changes in the motion vector field. The algorithm described in4 includes two further features. First, if the median predictor is unlikely to be accurate (because too many neighbouring macroblocks are intra-coded and therefore have nomotion vectors), an alternative algorithm such as the TSS is used. Second, a cost function is proposed to estimate whether the computational complexity of carrying out the next set of searches is worthwhile. (This will be discussed further in Chapter 10.) 6.4.6 HierarchicaSl earch The hierarchical search algorithm (and its variants) searches a coarsely subsampled version of the image first, followed by successively higher-resolution versions until the full image resolution is reached: 1. Level 0 consists of the current andreferencferames at their full resolutions. Subsample level 0 by a factor of 2 in the horizontal and vertical directions to produce level 1. 2. Repeat, subsampling level 1 to produce level 2, and so on until the required number of levels are available (typically, three or four levels are sufficient). 3. Searchthe highest level to find the best match: this isthe initial ‘coarse’ motion vector. 4. Search the next lower level around the position of the ‘coarse’ motion vector and find the best match. 5 . Repeat stage 4 until the best match is found at level 0. The search method used at the highest level may be full search or a ‘fast’ algorithm such as TSS. Typically, at eachlower level only +/-l pixels are searchedaroundthecoarse vector. Figure 6.14 illustrates themethod with three levels (2, 1 and 0) and a window of +/-3 positions at the highest level. A full search iscarriedoutat the top level: however, the complexity is relatively low because we are only comparing a 4 x 4 pixel area at each level 2 search location. The best match (the number ‘2’) is used as the centre of the level 1 search, where eight surrounding locations are searched. The best match (number ‘1 ’) is used as the centre of the final level 0 search. The equivalent search window is +/-l5 pixels (i.e. the algorithm can find a match anywhere within +/-l5 pixels of the origin at level 0). In total, 49 searches are carried out at level 2 (each comparing 4 x 4 pixel regions), 8 searches at level 1 (each comparing 8 x 8 pixel regions) and 8 searches at level 0 (comparing 16 x 16 pixel regions). 108 MOTION ESTIMATION AND COMPENSATION COMPARISON OF MOETSITOINMAATLIOGNORITHMS 109 6.5 COMPARISON OF MOTION ESTIMATION ALGORITHMS The widerange of algorithmsavailableforblock-basedmotionestimationcanmake it difficult to choose between them. There are a numbeorf criteria that may help in the choice: 1. Matching performance: how effective is the algorithm at minimising the residual block? 2. Rate-distortionperformance: howwell does the completeCODECperformatvarious compressed bit rates? 3. Complexity: how many operations are required to complete the matching process? 4. Scalability: does the algorithm perform equally well for large and small search windows? 5. Implementation: is the algorithm suitable for software or hardware implementation for the chosen platform or architecture? Criteria 1 and 2 appear tobe identical. If the algorithm is effective at minimising the energy in themotion-compensatedresidualblock,thenitought to providegoodcompression efficiency (good image quality at laow compressed bit rate). However, there are other factors that complicate things: for example, every motion vector that is calculated by the motion estimation algorithm must be encoded and transmitted as part of the compressed bit stream. As will be discussedin Chapter 8, larger motion vectorsare usually coded with more bits and so analgorithm that efficiently minimises the residualframe but produceslargemotion vectors maybe less efficient thananalgorithmthatis ‘biased’ towardsproducingsmall motion vectors. Example In the following example, block-based motion estimation and compensation were carried out on five frames of the‘bicycle’sequence (shownin Figure 6.2). Table 6.2 comparesthe performance of full search motion estimation with a rangoef search window sizes. The table lists the total SAE of the five difference frames without motion compensation (i.e. simply subtracting the previous from the current frame) and with motion compensation (i.e. block- based motion compensation on 16 x 16 blocks). The final column lists the total number of comparison operations (where one operation is the comparison of two luminance samples, IC, - R,\), As thesearch window increases,motioncompensation efficiency improves (shown by a smaller SAE): however, the number of operations increases exponentially with the window size.Thissequencecontainsrelativelylowmovement and so most of the Table 6.2 Full searchmotionestimation, five frames:varyingwindowsize Total Search compensated)window +/-l +/-3 +/-7 +/-l5 SAE 783 1 326 ... 23.4 . . . ... 581 99.1 Total SAE (compensated) 1 278 610 1 173 060 898 897 163 Number of comparison operations 1.0 x lo6 5.2 x lo6 x 10‘ x 10‘ 110 EMSTOITMIOANTION AND COMPENSATION Table 6.3 Motion estimation algorithmcomparison, five frames:searchwindow = +/-l5 (u.noAc(pcoleogmrmoaprtpeitoenahnrsmasisatoetnedd)) Total SAE Total S A E Number of 99.s1earc1h63Full 897 783 326 searcThhree-step 1 ... 3.6 753 914 x lo6 x lo6 performancegainfrommotionestimation is achieved with asearch window of +/-7 samples. Increasing thewindow to +/-l5 gives only a modest improvement inSAE at the expense of a fourfold increase in computation. Table 6.3 compares the performance of full search and three-step search with a search window of +/-l5 pixels. Full search produces a lower S A E and hence a smaller residual frame than TSS. However, the slight increase in SAE produced by the TSS algorithm is offset by a substantial reduction in the numberof comparison operations. Figure 6.15 shows how afastsearchalgorithmsuchastheTSS may fail tofind the bestpossiblematch.Thethree-stepsearchalgorithmstarts by consideringthepositions 15 10 5 0 -5 -10 -’!l5 -10 -5 0 5 10 15 Figure 6.15 SAE map showing three-stepsearch‘trapped’in local minimum SUB-PIXEL MOTION ESTIMATION 111 +/-8 pixels around the originT. he best match at thefirst step is found a(t- 8 , O ) and this is marked with a circle on the figureT. he next step examines positions withi+n/-4 pixels of this point and the bestof these is found at (- 12, -4). Step 3 also chooses the poin(t- 12, + -4) and the final step selects(- 13, -3) as the best match (shown wit‘h a’). This pointis a local minimumbut not the global minimum. Hencethe residualblockaftermotion compensationwillcontainmoreenergythanthebestmatchfound by thefullsearch algorithm (point 6, 1 marked with an ‘X’). Of the other search algorithms mentioned above, logarithmic search, cross-search and one-at-a-time search providelow computational complexity at the expenosferelatively poor matching performance. Hierarchical search can give a good compromise between perfor- mance and complexity and is well suited to hardware implementations. Nearest-neighbours search, with its in-built ‘bias’ towards the median-predicted motion vector, is reported to perform almostas well as full search, with a very much reduced complexity. The high perfor- mance is achieved becausethe ‘bias’ tends to producevery small (and hencevery efficiently coded) motion vectors and this efficiency offsets the slight drop in SAE performance. 6.6 SUB-PMEL MOTIONESTIMATION So far, we have assumed that the best match can be found at a region offset from the current block by an integer number of pixels. In fact, for many blocks a better match (and hence a smaller DFD) can be obtained by searching a region interpolated to sub-pixel accuracy. The search algorithm is extended as follows: 1. Interpolate between the samples of the search area in the reference frame to form a higher-resolution interpolated region. 2. Searchfull-pixel and sub-pixellocationsintheinterpolatedregionand match. find thebest 3. Subtract the sampleosf the matching region (whether full- or sub-pixel) from the samples of the current block to form the difference block. Half-pixel interpolation is illustrated in Figur6e.16. The original integer pixel positions‘a’ are shown in black. Samples b and c (greayr)e formed by linear interpolation between pairs @ oinrtiegginear:l samples 43 @@B@ b,c,d: interpolatedsamples Arrows indicate directionof interpolation Figure 6.16 Half-pixelinterpolation 112 EMSTOITMIOATNION AND COMPENSATION of integer pixels, and samples d (white) are interpolated between four integer pixels (as indicated by the arrows). Motion compensationwith half-pixel accuracy is supportedby the H.263 standard, and higher levels of interpolation ($ pixel or more) are proposed for the emerging H.26L standardI.ncreasingthe‘depth’ of interpolationgivesbetterblock matching performance at the expenseof increased computational complexity. Searchingonasub-pixelgridobviouslyrequiresmorecomputationthantheinteger searches described earlier. In order to limit the increase in complexity, it is common practice to find thebestmatchingintegerpositionandthentocarryoutasearchathalf-pixel locationsimmediatelyaroundthisposition.Despitetheincreasedcomplexity,sub-pixel motion estimation and compensation can significantly outperform integer motion estimation/ compensation. This is because a moving object will not necessarily move by an integral number of pixels between successive video frames. Searching sub-pixel locatioansswell as integer locations is likely tofind a good match in a larger number of cases. Interpolating the reference area shown in Figu6r.7e to half-pixel accuracyand comparing the current block with each half-pixel position giveSsAtEhemap shown in Figur6e.17. The best match(i.e. the lowestS A E ) is found at position(6,0.5). The block found at this position 2 4 6 8 10 Figure 6.17 SAE map (half-pixel interpolation) CHOICE OF REFERENCE FRAMES 113 in the interpolated reference framegives a better match than position (6, 1) and hence better motion compensation performance. 6.7 CHOICE OF REFERENCEFRAMES The most ‘obvious’ choice of reference frame is theprevious coded frame, sincethis should be reasonably similartothe current frame and is available in theencoderand decoder. However, there can be advantages in choosing from one or more other reference frames, either before or after the current frame in temporal order. 6.7.1 Forward Prediction Forward prediction involves using an ‘older’ encodedframe(i.e. a preceding frame in temporal order) as prediction reference for the current frame. Forward prediction performs poorly in certain cases, for example: 1. when there is a significant time difference between the reference frame and the current frame (which may mean that the image has changed significantly); 2. when a scene change or ‘cut’ occurs; 3. when a moving object uncovers a previously hidden area of the image (e.g. a door opens): the hidden areadoes not existinthereferenceframeand so cannot be efficiently predicted. 6.7.2 BackwardPs rediction The prediction efficiency forcases (2) and (3) above can be improved by using a ‘future’ frame(i.e. a laterframe in temporal order) as prediction reference. A frame immediately after a scene cut, or an uncovered object, can be betterpredicted from a future frame. Backwards prediction requires theencodetro buffer codefdrameasnedncode them out of temporal order, so that the future reference frame is encoded before the current frame. 6.7.3 BidirectionaPl rediction In somecases, bidirectional prediction may outperform forward or backward prediction: here, the prediction reference is formed by ‘merging’ forward and backward references. Forward, backward and bidirectional predictions are all available for encoding an MPEG-1 or MPEG-2B-picture. Typically, the encoder carries out twomotion estimation searches for each macroblock (16 x 16 luminance samples), one based on the previous reference picture (an I- or P-picture) and one based on the future reference picture. The encoder finds the motion vector that gives the best match (i.e. the minimum SAE) based on (a) the previous 114 CEOMSMTOAPITMENINOADSNTAIOTNION reference frame and (b) the future reference frame. A third SAE value (c) is calculated by subtracting the average of the two matching areas (previous and future) from the current macroblock.Theencoderchoosesthe‘mode’ of thecurrentmacroblock based on the smallest of these three SAE values: (a) forward prediction (b) backwards prediction, or (c) bidirectionalprediction. In this way, the encoder canfind the optimum prediction reference for each macroblock and this improves compression efficiency by up to 50% for B-pictures. 6.7.4 MultipleReferenceFrames MPEG-1 or MPEG-2 B-pictures are encoduesding two reference frames. This approacmhay be extendedfurther by allowingtheencoder to chooseareferenceframefromalarge number of previously encoded frames. Choosing between multiple possible reference frames can be a useful tool in improving error resilience (as discussedin Chapter 11). This method is supported by the H.263 standard (Annexes N and U, see Chapter 5) and has been analysed in.5 Encoder and decoder complexity and storage requirements increase as more prediction reference frames are utilised. ‘Simple’ forward prediction fromthe previous encoded frame gives the lowest complexity (but also the poorest compression efficiency), whilst the other methodsdiscussedaboveaddcomplexity(andpotentiallyencodingdelay) but give improved compression efficiency. Figure 6.18 illustratesthepredictionoptionsdiscussedabove,showingforward and backwards prediction from past and future frames. Fonvard prediction 4-/ / / / / / / / Backward prediction Figure 6.18 Referenceframepredictionoptions ENHANCEMENTSMTMOOTTDIHOEELN 115 6.8 ENHANCEMENTS TO THE MOTION MODEL Bidirectional prediction and multiple referenceframes (described above) canincrease compression efficiency because they improve the motion model, allowing awider range of prediction options for each coded macroblock than a simple forward prediction from the previous encoded frame. Sub-pixel interpolation of the reference frame also improves the motion model by catering for the case when motion does not map neatly onto integer-pixel locations. There are a number of other ways in which the motion model may be enhanced, some of which are listed here. 6.8.1 VectorsThatcanPointOutsidetheReferencePicture If movement occurs near the edges of the picture, the best match for an edge block may actually be offset slightly outside the boundariesof the reference picture. Figure 6.19 shows an example: the ball that has appeared in the current frame is partly visible in the reference frame and part of the best matching block will be found slightly above the boundary of the frame. The match may be improved by extrapolating the pixel values at the edge of the reference picture. Annex D of H.263 supports this type of prediction by simple linear extrapolation of the edge pixels into the area around the frame boundaries (shown in Figure 6.19). Block matching efficiency and hence compression efficiency is slightly improved for video sequences containing motion near the edges of the picture. 6.8.2 VariableBlock Sizes Using a block size of 16 x 16 for motion estimation and compensation gives a rather ‘crude’ model of image structureand motion. The advantagesof a large blocksize are simplicity and the limited number of vectors that must be encoded and transmitted. However, in areas of complex spatial structureand motion, better performancecan be achieved with smaller block sizes. H.263 Annex F enables an encoder to switch between a block size of 16 x 16 (one I ame Current frame Reference Figure 6.19 Example of best match found outside the reference picture 116 EMSTOIMTIAOTNION AND COMPENSATION motion vector per macroblock) and 8 x 8 (four vectors per macroblock) : the small block size is used when it gives better coding performance than the large block size. Motion compensationperformanceisnoticeablyimprovedat the expense oafn increase in complexity:carryingout 4 searches per macroblock(albeitonasmallerblocksize with only 64 calculations per SAE comparison) requires more operations. The emerging H.26L standard takes this approach further and supports multiple possible block sizes for motion compensation within a macroblock. Motion compensation may be carried out for sub-blocks with horizontalor vertical dimensions of any combination of 4, 8 or 16samples.Theextremecases are 4 x 4 sub-blocks(resulting in 16 vectors per macroblock)and 16 x 16blocks(onevector per macroblock) with manypossibilities in between (4 x 8, 8 x 8, 4 x 16blocks,etc.).This flexibility givesafurtherincrease in compression performance at the expense of higher complexity. 6.8.3 Overlapped Block MotionCompensation(OBMC) When OBMC is used, each sample of the reference block used for motion compensation is formed by combining three predictions: 1. a sample predicted using the motion vector of the current block (Ro); 2. a sample predicted using the motion vector of the adjacent block in the vertical direction (i.e. the nearest neighbour block above or below) ( R I ) ; 3. asamplepredictedusingthemotionvector of theadjacentblock in thehorizontal direction (i.e. the nearest neighbour block left or right) (Rz). The final sample is a weighted average ofthe three values. R0 is given themostweight (because it uses the current block’s motion vector). R1 and R2 are given more weight when the current sample is near the edge of the block, less weight when it is in the centre of the block. The result of OBMC is to ‘smooth’ the prediction across block boundariesin the reference frame.OBMCissupported by Annex F of H.263andgivesaslightincrease in motion compensation performance(at the expense of a significant increaisneComplexity). A similar ‘smoothing’ effect can be obtained by applying a filter to the block edges in the reference + ++) frame and later versionsof H.263 (H.263 and H.263 recommend using a block filter instead of OBMC because it gives similar performance with lower computational complexity. OBMCand filteringperformance have beendiscussedelsewhere,6 and filters are examined in more detail in Chapter 9. 6.8.4 ComplexMotionModels The motion estimation and compensation schemes discussed so far have assumed a simple translational motion model, i.e. they work best when all movement in a scene occurs in a plane perpendicular to the viewer. Of course, there are many other types of movement such as rotation,movementstowardsorawayfromtheviewer(zooming)anddeformation of objects (such as a human body). Better motion compensation performance may be achieved by matching the current frame to a more complex motion model. In the MPEG-4 standard, a video object planme ay be predicted from the pixels that exist only within a reference VOP. This is a form of region-based motion compensation, where IMPLEMENTATION 117 After warping Before Figure6.20 Triangular mesh beforeandafterdeformation compensation is carriedoutonarbitrarilyshapedregionsrather thanfixed rectangular blocks. This has the capability to provide a more accurate motion model for ‘natural’ video scenes (where moving objects rarely have ‘neat’ rectangular boundaries). Picture warping involves applying a global warping transformation ttohe entire reference picture for example to compensate for global movements such as camera zoom or camera rotation. Mesh-basedmotioncompensation overlaysthereferencepicture with a2-D meshof triangles.Themotion-compensatedreferenceisformed by movingthecorners of each triangleanddeforming the referencepicturepixelsaccordingly(Figure6.20showsthe general approach). A deformable mesh can model a wide range of movements, including object rotation, zooming and limited object deformations. A smaller mesh will give a more accurate motion model (but higher complexity). Stillmoreaccuratemodellingmay be achieved using object-basedcoding wherethe encoderattempts to maintaina3-Dmodel of thevideoscene.Changesbetweenframes are modelled by moving and deforming the components of the 3-D scene. Picturewarpingis significantly morecomplex than ‘standard’blockmatching.Mesh- based and object-based coding are successively more complex and are not suitable for real- timeapplications with currenpt rocessing technologyH. owevert,hey offersignificant potential for future video coding systems when more processing power becomes available. These and other motion models are active areas for research. 6.9 IMPLEMENTATION 6.9.1 Software Implementations Unless dedicated hardware assistance is available (e.g. a motion estimation co-processor), the key issue in a software implementation of motion estimation is the trade-off between 118 COEMSMTOPAIMTENINAODSTNAIOTINON computational complexity (the total number of processor cycles required) and compression performance. Other important considerations include: e The eflciency of the mapping to the target processoFr.or example, an algorithm that fully utilises the instruction pipeline of the processor is preferable to an algorithm that introduces data dependencies and ‘stalls’ into the pipeline. Data storage and delay requirements. For example, there may be advantages to carrying out motion estimation for the entire frame before further encoding takes place: however, this requires more storage and can introduce more delay than an implementation where each macroblock is estimated, compensated, encodedand transmitted before moving onto the next macroblock. Even with the use of fast search algorithms, motion estimationis often the most computationally intensive operation in a software video CODEC and so it is important to find ways to speed up the process. Possible approaches to optimising the code include: 1. Loop unrolling. Figure 6.21 lists pseudocode for two possible versions of the SAE calculation (Equation 6.3) for a 16 x 16 block. Version (a) is a directc, ompact implementation of the equation. However, each of the16 x 16 = 256 calculationsis accompanied by incrementing and checking the inner loop counteri. Version (b) ‘unrolls’ :a)Direct implementation: t i Current position: i,j Offset in reference frame: ioffset, joffset .OtdlSAE = 0; fojr = 0 to 15 { / / ROW counter for i = 0 to 15 { / / Column counter totalSAE = totalSAE + abs(C[i,jl - R[i+ioffset,j+joffsetl); 1 1 [b)Unrolled innerloop: / / Current position: i,j Offset in reference frame: ioffset, joffset IotalSAE = 0; €or j = 0 to 15 { / / Row counter totalSAE = totalSAE + abs(C[O,jl - R[O+ioffset,j+joffsetI); totalSAE = totalSAE + abs(C[l.jl - R[l+ioffset, j+joffsetl); totalSAE = totalSAE + abs(C[Z,jl - R[Z+ioffset,j+joffsetl); totalSAE = totalSAE + abs(C[3,jl - R[3+ioffset, j+joffset;l) totalSAE = totalSAE + abs(C[4,jl - R[4+ioffset,j+joffsetl); totalSAE = totalSAE + abs(C[5,jl - R[5+ioffset,j+joffsetl); totalSAE = totalSAE + abs(C[6,jl - R[ti+ioffset,j+joffsetI); totalSAE = totalSAE + abs(C[7,jl - R[7+ioffset,j+joffsetI); totalSAE = totalSAE + abs(C[8,jl - R[8+ioffset,j+joffsetI); totalSAE = totalSAE + abs(C[9,jl - R[9+ioffset,j+joffsetl); totalSAE = totalSAE + abs(C[lO,jl - R[lO+io€fset,j+joffsetl); totalSAE = totalSAE + abs(C[Ll,jl - R[ll+ioffset,j+joffsetl); totalSAE = totalSAE + abs(C[lZ.j] - R[lZ+ioffset,j+joffsetl); totalSAE = totalSAE + abs(C[13,jl - R[13+ioffset,j+joffsetI); totalSAE = totalSAE + abs(CL14,jl - R[14+ioffset,j+joffsetI); totalSAE = totalSAE + abs(CL15,jl - R[15+ioffset,j+joffsetI); l Figure 6.21 Pseudocode for two versions of SAE calculation IMPLEMENTATION 119 the inner loop and repeats the calculation 16 times. More lines of code are required but, on most platforms, version (b) will run faster (note that some compilers automatically unrollrepetitiveloops,butbetterperformancecanoftenbeachieved by explicitly unrolling loops). 2. ‘Hand-coding’ of critical operations. The SAE calculation for a block (Equation 6.3) is carried out many times during motion estimation and is therefore a candidate for coding in assembly language. 3. Reuse of calculated values. Consider thefinal stage of the TSS algorithm shownin Figure 6.9: a total of nine SAE matches are compared, each 1 pixel apart. This means that most of theoperations of each SAE matchareidenticalforeachsearchlocation.It may therefore be possibletoreduce the number of operations by reusingsome of the calculated values lCg - Rql between successive SAE calculations. (However, this may not be possible if multiple-sample calculations are used, see below.) 4. Calculatemultiplesamplecomparisonsinasingleoperation. Matching is typically carried out on 8-bit luminance samples from the current and reference frames. A single match operation IC, - R01 takes as its input two 8-bitvalues and produces an 8-bit output value. With a large word width (e.g. 32 or 64 bits) it may be possible to carry out several matching operations at once by packing several input samples into a word. Figure 6.22 shows the general idea: here, four luminance samples are packed into each of two input words and the resuoltf IC, - Rql for each sample are available as 4thbeytes of an outputword.Careisrequired with thisapproach: first, thereis an overhead associated with packing and unpacking bytes intoloouft words, and second, theremay be the possibility foroverflow during the comparison (since the resulotf Cij- RV is actually a 9-bit signed number prior to the magnitude operator 11). These and further optimisations may be applied to significantly increase the speed of the search calculation. In general, more optimisation leads to more lines of code that may be difficult to maintain andmay only perform well on a particular processor platforHmo.wever, increased motion estimation performance can outweigh these disadvantages. Reference: sample 1 + 1 byte sample 2 3 sample 4 sample sample 1 sample 2 1 Icurrent-reference1 Current: 3 sample 4 sample res& 1 result 2 res& 3 result4 Figure 6.22 Multiple SAE comparisons in parallel 120 CEOMSMTOAPITMENINOADSNTAIOTNION Reduced complexity matching criteria The fast searchalgorithmsdescribed in Section 6.4 reducethecomplexity of motion estimation by attempting to subsample the number of points in the SAE mapthat require to be tested. At each comparison point, Equation 6.3must be evaluated andthis requires N x N calculations (where N is the block size). However, it is possible to reduce the number of calculations for each matching point in several ways. Early termination In many cases, the outcomeof the SAE calculationwill be an SAE that is largerthan the previous minimum SAEI.f we know that the currenmt atching position will not produce the smallestSAE, we do not need to finish the calculation.If this is the case, the value total SAE inFigure 6.21 will exceed the previous minimum SAE at some pointbefore the end of the calculation. A simple way of reducing complexity is to check for this, e.g.: if (totalSAE>minSAE) breakfromtheloop. This check itself takes processing time and so it is not efficient to test after every single sample comparison: instead, a good approach is to include the above check after each inner loop (i.e. each row of 16 comparisons). Row and columnprojections A projection of each row and column inthe current and reference blocks is formed. The projection is formed by adding all the luminance values in the current row or column: for a 16 x 16 block, there are 16 row projections and 16 column projections. Figure6.23 shows theprojectionsforonemacroblock. An approximation to SAE is calculated as follows: c + c SAE,,,,, N-l N-I = Ccol; - Rcoli Crowj - Rrow; i=O ;=0 Current macroblock Row projections Crow Figure 6.23 Row and column projections 122 CEOMSMTOAPITMENINOADSNTAIOTNION SAE value I Pixel position (x-axis) -2 -1 0 0.5 1 1.5 2 Figure6.25 SAE interpolation to estimatehalf-pixelresults 6.9.2 HardwareImplementations The design of a motion estimation unit in hardware is subject to a number of (potentially conflicting) aims: 1. Maximise compression performance. A full search algorithm usually achieves the best block matching performance. 2. Minimise cycle count (and hence maximise throughput). 3. Minimise gate count. 4. Minimise data flow to/from the motion estimator. Example: Full search block matching unit A ‘direct’ implementation of the fullsearchalgorithm involves evaluating Equation 6.3 (SAE calculation) at each position in the search region.There are several ways in which the implementation can be speeded up (typically at the expense of matching efficiency and/or size of thedesign)i,ncludingparallelisation(calculatingmultiple results in parallel), pipelining and the use of fast search algorithms. Parallelization of full search The full search algorithm is highly regular and repetitive and there are no interdependencies between the search results (i.e. the order of searches does not affect the final result). It is therefore a good candidate for parallelisation and a number of alternative approaches are available. Two popular approaches are as follows: 1. Calculate search results in parallel. Figure 6.26 shows the general idea:M processors are used, each of which calculates a single SAE result. The smallest SAE of the M results is IMPLEMENTATION 123 Search window memory Current block memory I I v *t v v Processor 1 Processor 2 .............................. I Processor M v Comparator Best match Figure6.26 Parallelcalculation of searchresults chosen as thebest match (for that particular set of calculations). The numberof cycles is reduced (and the gate count of the design is increased) by a factor of approximately M . 2. Calculate partial SAE results for each pixel position in parallel. For example, the SAE calculation for a16 x 16 block may be speeded up by using 16 processors, eachof which calculates the SAE component for one column of pixels in the current block. Again, this approach has the potential to speed up the calculation by approximately M times (if M parallel processors are used). Fast search It may not be feasible or practical to carroyut a complete full search becauosfegate count or clock speed limitations. Fast search algorithms can perform almost as wellas full search 124 MOTION ESTIMATION AND COMPENSATION Table 6.4 Pipelinedoperation: three stepsearch Step 1 Step 2 Step 3 Block 1 Block 2 Block 3 Block 4 Block 1 Block 2 Block 3 Block 1 Block 2 with many fewer comparison operations and so these are attractive for hardware as well as software implementations. In a dedicatedhardwaredesignit may be necessary tocarryouteachmotionestimation search in a fixed number of cycles (in order toensure that all the processing units within the design are fully utilised during encoding). In this case algorithms such as logarithmic search and nearest neighbours search are not ideal because the total number of comparisons varies from block to block. Algorithms such as the three-step search andhierarchical search are more useful because the number of operations is constant for every block. Parallel computation may be employed to speed up the algorithm further, for example: 1. Each SAEcalculationmay be speeded up by using parallel processing units (each calculating the SAE for one or more columns of pixels). 2. The comparisons at one ‘step’ or level of thealgorithm may be computed in parallel (for example, one ‘step’ of thethree-step search or one level of the hierarchical search). 3. Successivesteps of thealgorithmmay be pipelined toincreasethroughput. Table 6.4 shows an exampleforthethree-stepsearch.The first ninecomparisons (step 1) are calculated for block 1. The next eightcomparisons(step 2) for block 1 are calculated by another processing unit (or set of units), whilst step 1 is calculated for block 2, and so on. Note that the steps or levels cannotbe calculated in parallel: the search locations examined in step 2 depend on theresult of step 1 andso cannot be calculated until the outcome of step 1 is known. Option 3 above (pipelining of successive steps) is useful for sub-pixel motion estimation. Sub-pixel estimation isusually carried out on the sub-pixel positions around the best integer pixeml atchand this estimationstep may also be pipelined. Figure 6.27 shows an examplefor a three-stepsearch(+/-7 pixels) followed by a half-pixel estimation step. Note that memory bandwidth may an important issue with this type of design. Each step requires access to the current blockand reference area and this can lead to an unacceptably high level of memory accesses. One option is to copy the current and reference areas to separate local memories for each processing stage but this requires more local memory. Descriptions of hardware implementations of motion estimation algorithmscan be found elsewhere.’&’* REFERENCES 125 Search window memory Current memory I + Figure 6.27 Pipelinedmotionestimation:threeintegersteps half-pixelstep 6.10 SUMMARY Motion estimation is used in an inter-frame video encoder to createa ‘model’ that matches the current frame as closely apsossible, based on one ormore previously transmitted frames (‘reference frames’). Thims odel is subtracted from thecurrent frame (motion compensation) to producea motion-compensated residual frame. Thedecoder recreates the model (based on information sent by the encoder) and adds the residual frame to reconstruct a copy of the original frame. The goal of motion estimation design is to minimise the amount of coded information (residual frameand model information), whilst keeping the computational complexity of motionestimationandcompensationtoanacceptablelimit. Many reduced-complexity motion estimation methods exist (‘fast search’ algorithms), and these allow the designer to ‘trade’ increased computational efficiency against reduced compression performance. After motion estimation and compensation, the next problem faced by a video CODEC is to efficiently compress the residual frame. Themost popular method istransform coding and this is discussed in the next chapter. REFERENCES 1. T. Koga, K. Iinuma et al., ‘Motion compensated interframe coding for video conference’, Proc. NTC, November 1991. 2. J. R. Jainand A. K. Jain,‘Displacementmeasurement and its applicationininterframeimage coding’, ZEEETrans. Communications, 29, December 1981. 3. M. Ghanbari, ‘The cross-search algorithm for motion estimation’, IEEE Trans. Communications, 38, July 1990. 4. M. Gallant, G. C6tC and F. Kossentini, ‘An efficient computation-constrained block-based motion estimationalgorithmfor low bitratevideocoding’, IEEETrans.ImageProcessing, 8(12), December 1999. 126 COEMSMTOPAIMETNNIADOSTNAIOTINON 5. T. Wiegand, X. Zhang and B. Girod, ‘Long-term memory motion compensated prediction’, IEEE Trans. CSVT, September 1998. 6. B. Tao and M. Orchard,‘Removal of motionuncertaintyandquantizationnoiseinmotion compensation’, IEEE Trans. CVST, 11(1), January 2001. 7. Y. Wang, Y. Wang and H. Kuroda, ‘A globally adaptive pixel-decimation algorithm for block motion estimation’, IEEE Trans. CSVT, 10(6), September 2000. 8. X. Li and C. Gonzales, ‘A locally quadratic modelof the motion estimation error criterion function and its application to subpixel interpolation’, IEEE Trans. CSW, 3, February 1993. 9. Y. Senda,‘Approximatecriteriaforthe MPEG-2 motion estimation’, IEEETrans. CSW, 10(3), April 2000. 10. P. Pirsch, N.Demassieux and W. Gehrke, ‘VLSI architectures for video compression-a survey’, Proceedings ofthe IEEE, 83(2), February 1995. 11. C. J. Kuo, C. H. Yeh and S. F. Odeh, ‘Polynomial search algorithm for motion estimation’, lEEE Trans. CSVT, 10(5), August 2000. 12. G. Fujita,T. Onoye and I. Shirakawa, ‘AVLSI architecture for motion estimation core dedicated to H.263 video coding’, IEICE Trans. Electronics, ES1-C(5), May 1998. Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Transform Coding 7.1 INTRODUCTION Transform coding is at the heart of the majority of video coding systems and standards. Spatialimagedata(imagesamples or motion-compensated residual samples)aretransformed into a different representation, the transform domain. There are good reasons for transforming image data in this way. Spatial image data is inherently ‘difficult’ to compress: neighbouring samples are highly correlated (interrelated) and the energy tends to be evenly distributed acrossan image, makingit difficult to discard dataor reduce the precisionof data without adversely affecting image quality. With a suitable choice of transform, the data is ‘easier’ to compress in the transform domain. There are several desirable properties of a transformforcompression. It shouldcompacttheenergy inthe image(concentratethe energy into a small number of significant values); it should decorrelate the data (so that discarding ‘insignificant’ datahas a minimaleffectonimagequality); and it should be suitable for practical implementation in software and hardware. The two most widely used image compression transforms are the discrete cosine transform (DCT)and the discrete wavelet transform (DWT). The DCiTs usually applied to small, regular blocks of image samples (e.g. 8 x 8 squares) and the DWT is usually applied to larger image sections (‘tiles’)or to complete images.Many alternatives have been proposed, for example 3-D transforms (dealingwith spatial and temporal correlation), variable blocksize transforms, fractal transforms, Gabor analysiTs.he DCThas proved particularly durable and is at the core of most of the current generation of image and video coding standards, includingJPEG, H.261, H.263,H.263+, MPEG-l,MPEG-2 andMPEG-4.The DWT is gaining popularity becauseit can outperformthe DCT for still image codingand so it is used in the newJPEG image codingstandard (JPEG-2000) and for still ‘texture’ coding in MPEG-4. This chapter concentrates on theDCT. The theory and properties of the transforms are described first, followed by an introduction to practical algorithms and architectures for the DCT. Closely linked with the DCT is the process of quantisation and the chapter ends with a discussion of quantisation theory and practice. 7.2 DISCRETE COSINE TRANSFORM Ahmed, Natarajan and Rao originally proposed the DCTin 1974.’ Sincethen, it has become the most popular transform for image and video coding. There are two main reasons for its popularity: first, it is effective at transforming image data into a fotrhmat is easy to compress and second, it can be efficiently implemented in software and hardware. 128 TRANSFORM CODING Samples DCT Coefficients 01234567 2-D FDCT Figure 7.1 l-D and 2-D discretecosinetransform The forward DCT (FDCT) transforms a set of image samples (the ‘spatial domain’) intoa set of transform coefficients (the ‘transform domain’). The transform is reversible: the inverseDCT(IDCT)transforms a set of coefficients into a set of imagesamples. The forward and inverse transforms are commonly used in l-D or 2-D forms for image and video compression. The l-D version transforms a l-D array of samplesinto an a 1-D array of coefficients, whereas the 2-D version transforms a 2-D array (block) of samples into a block of Coefficients. Figure 7.1 shows the two forms of the DCT. The DCT hastwo useful properties for image and video compressione,nergy compaction (concentratingtheimageenergyinto a small number of coefficients) and decorrelution (minimising the interdependencies between coefficients). Figure 7.2 illustrates the energy compaction property of the DCT. Image (a) is an 80 x 80 pixel image and image(b) plots the coefficients of the2-D DCT. The energy in the transformed coefficients is concentrated about the top-left comer of the array of coefficients (compaction).The top-left coefficients correspond to low frequencies: there is a ‘peak’ in energy in this area and the coefficient values rapidly decrease to the bottom right of the array (the higher-frequency coefficients). The DCTcoefficients are decorreluted which means that many of the coefficients with small values can be discarded without significantly affecting image quality. A compact array of decorrelated coefficients can be compressed much more efficiently than an array of highly correlated image pixels. The decorrelation and compaction performance of the DCT increases with block size. However, computational complexity also increases (exponentially)with block size. A block size of 8 x 8 is commonlyused in image and video codingapplications. This size gives a good compromise between compression efficiency and computational efficiency (particularly as there are a number of efficient algorithms for a DCT of size 2”’ x 2”’, where m is an integer). The forward DCT for an 8 x 8 block of image samples is given by Equation 7. I : DISCRETE COSINE TRANSFORM 129 ‘i- 0 Figure 7.2 (a) 80 x 80 pixel image; (b) 2-D DCT fi,j are the 64 samples ( i J ) of the input sample block, are the 64 and C(x), C ( y ) are constants: DCT coefficients (x,y) The labelling of samples V;:,j) and coefficients (Fx,?)is illustrated in Figure 7.1. 130 TRANSFORM CODING TheDCTrepresentseachblock of imagesamplesasaweightedsum of2-D cosine functions (‘basis functions’). The functionasre plotted as surfaces in Figure7.3 and as 8 x 8 pixel ‘basis patterns’ in Figur7e.4. The top-left pattern has the lowest ‘frequencya’nd is just a uniform block. Movingto the right, the patterns containan increasing numberof ‘cycles’ betweendark and light in the horizontal direction: these represent increasing horizontal spatial frequency. Moving down, the patterns contain increasing vertical spatial frequency. Moving diagonally to the right and down, the patterns contain both horizontal and vertical frequencies. The block of samples may be reconstructed by adding together the 64 basis patterns, each multiplied by a weight (the corresponding DCT coefficient FX,J. The inverseDCTreconstructsablock of imagesamplesfromanarray of DCT coefficients. TheIDCTtakesasitsinput ablock of 8 x 8 DCTcoefficients Fx,y and reconstructs a block of 8 x 8 image samplesAj (Equation 7.2). C(x)and C(y) are the same constants as for theFDCT. Figure 7.3 DCT basisfunctions(plotted as surfaces) DISCRETE COSINE TRANSFORM 131 Figure 7.4 DCTbasispatterns Example: DCT of an 8 x 8 block of samples Figure 7.5 shows an 8 x 8 block of samples(b)takenfromimage(a). The block is transformed with a 2-D DCT to produce the coefficients shown in image (c). The six most significantcoefficients are: (0, 0), (1, O), (1, l), (2, 0), (3, 0) and (4, 0) and theseare highlighted on the table of coefficients (Table 7.1). Figure 7.5 (a) Originailmage; (b) 8 x 8 block of samples; (c) 2-D DCT coefficients 132 TRANSFORM CODING DCT coefficients Figure 7.5 (Continued) DISCRETE WAVELET TRANSFORM 133 Table 7.1 DCT coefficients 1 1 1 y: 1 1 1 -7.61-8.4 0.8 0.6 1-1.9 0.1 -14.7 1.4 4.1 0.0 -0.3 -0.2 -22.3 1.3 -40..36 5-.0.5 0.2 -0.3 A reasonable approximation to the original image block can be reconstructed from just these six coefficients, as shown in Figure 7.6. First, coefficient (0, 0) ismultiplied by a weight of 967.5andtransformedwith the inverse DCT. This coefficient representsthe average‘shade’ of theblock (in thiscase,mid-grey)andisoftendescribed as the ‘DC coefficient’ (the DC coefficient is usuallythemost significant in anyblock).Figure 7.6 shows the reconstructed block formedby the DC coefficient only (the top-right blockin the figure). Next, coefficient (1, 0) ismultiplied by a weighot f - 163.4 (equivalento subtracting itsbasispattern). The weighted basispattern is shown in thesecond rowof Figure 7.6 (on the left) and the sum of the first two patterns is shown on theright. As each of the further four basis patterns is added to the reconstruction, more detail is added to the reconstructed block. The final result (shown on the bottom right of Figure 7.6 and produced using just 6 outof the 64 coefficients) is a good approximation of the original. This example illustrates the two key properties of the DCT: the significant coefficients are clustered around the DC coefficient (compaction)and theblockmaybe reconstructed using onlyasmall number of coefficients (decorrelation). 7.3 DISCRETE WAVELET TRANSFORM The DCT described above operates ona block of samples (usually 16 x 16, 8 x 8 or smaller) and decomposesthesesamplesinto a set of spatialfrequencycomponents. A wavelet transform also decomposes spatial image data according to frequency and wavelet-based compressionhasbeenshowntooutperform DCT-based compressionfor still images. Because of this, the new version of the JPEG image compression standard, JPEG-2000, is wavelet-based rather than DCT-based. One of the key differencesbetween the application of waveletanddiscretecosine transforms is that a wavelet transform is typically applied to a complete image or a large 134 (0.0) 967.5 TRANSFORM CODING Reconstructed (1.0) * -163.4 (1.11 * -71.3 -(3,O)* 81.8 I (4,O) * 38.9 g 9 ’X F: S Figure7.6 Reconstruction of image block from six basis patterns rectangular region (‘tile’o)f the image, in contraswt ith the small block size chosen for DCT implementations. The DCT becomes increasingly complex to calculate for larger block sizes, whilst the benefits of larger blocks rapidly diminish above 8 X 8 or 16 x 16 samples, whereas a wavelet transform may be more efficiently applied to large blocks or complete images and produces better compression performance for larger blocks. A single-stage wavelet transformation consistsof a filtering operation that decomposesan image into four frequency bands as shown in Fig7u.r7e. Image (a) is thoeriginal; image (b) is the resulotf a single-stage wavelet transform. The top-lceoftmer of the transformed image (‘LC) is the original image, low-passfiltered and subsampled in the horizontal and vertical TRANSWFOARVMELDEISTCRETE 135 (b) Figure 7.7 (a)Original image; (b) single-stage wavelet decomposition 136 TRANSFORM CODING dimensions. The top-right comer (‘W)consists of residual vertical frequencies (i.e. the vertical component of the difference between the subsampled ‘LL‘ image and the original ’ image). The bottom-left comer ‘LH’ contains residual horizontal frequencies (for example, theaccordionkeys are veryvisiblehere),whilstthebottom-rightcomer ‘HH’ contains residual diagonal frequencies. This decomposition process may be repeated for the‘LL‘ component to produce another set of four components: a new ‘LL‘ component that is a further subsampled versionof the original image, plus three more residual frequency components. Repeating the decomposi- tion three times gives the wavelet representation shown in Figure 7.8. The smailnl tihmeage top left is the low-pass filtered original and the remaining squares contain progressively higher-frequencyresidualcomponents. This process may berepeatedfurther if desired (until, in the limit, the top-left component contains only 1 pixel which is equivalent to the ‘DC’ or average valueof the entire image). Each sample in Figure 7.8 represents a wavelet transform coeflccient. The wavelet decomposition has some important properties. First, the number of wavelet ‘coefficients’ (the spatial values that maukpe Figure 7.8) is the same as the numobfeprixels in the original imageand so the transformis not inherently adding or removing information. Second, manyof the coefficientsof the high-frequency components(‘HH’, ‘HL‘ and ‘LH’ at eachstage)arezeroorinsignificant.Thisreflectsthefactthatmuch of theimportant information in an image is low-frequency. Our response to an image is based upon a low- frequency ‘overview’ of the image with important detail added by higher frequencies in a fewsignificantareas of theimage.Thisimpliesthat it shouldbepossibletoefficiently Figure 7.8 Three-stage wavelet decomposition DISCRETE WAVELET TRANSFORM 137 layer 1 Layer 2 Figure 7.9 Relationshibpetwee‘nparent’ and ‘child’ regions compress the wavelet representation shown in Figure 7.8 if we can discard the insignificant higher-frequency coefficients whilst preserving the significant ones. Third, the decomposition is not restricted by block boundaries (unlikethe DCT) and hence may be a more flexible way of decorrelating the image data (i.e. concentrating thseignificant components intoa few coefficients) than the block-based DCT. The method of representing significant coefficients whilst discarding insignificant coefficients is criticaltotheuse of wavelets in imagecompression. The embeddedzerotree approach and, more recently, set partitioning into hierarchical trees (SPIHT) are considered by some researchers tobe the most effectiveway of .doingt h k 2 The wavelet decomposition can be thought of as a ‘tree’, where the ‘root’ is the top-left LL component and its ‘branches’ arethe successivelyhigher-frequencyLH, HL and HH components at eachlayer.Each coefficient in a low-frequency component hasa number of corresponding ‘child’coefficients in higher-frequency components. This concept is illustrated in Figure 7.9, where a single coefficient at layer 1 maps to four ‘child’ coefficients in each component at layer 2. Zerotree coding workson the principle that if a parent coefficient is visually insignificant then its ‘children’are unlikely to be significant. Working from the topleft,each coefficient and itschildrenareencoded as a ‘tree’. As soon as thetreereaches a coefficient that is insignificant, that coefficientand all its children are coded as a ‘zero tree’. The decoder will reconstruct the significant coefficients and set all coefficients in a ‘zero tree’ to zero. Thisapproachprovides a flexible andpowerful method of imagecompression. The decision as to whether a coefficient is ‘significant’ or ‘insignificant’ is madeby comparing it with a threshold. Setting a high threshold means that most of the coefficients are discarded and the image is highly compressed; settinaglow threshold means that mostcoefficients are retained,giving low compression and highimage fidelity. Thisprocess is equivalent to quantisation of the wavelet Coefficients. Wavelet-based compression performs well for still images (particularly in comparison with DCT-based compression) and can be implemented reasonably efficiently. Under high compression, wavelet-compressed imagesdo not exhibitthe blocking effects characteristicof the DCT. Instead, degradation is more ‘graceful’and leads to a gradual blurring of the image as higher-frequency coefficients are discarded. Figure 7.10 compares the resulotfs compression of the original image (on the left) with a DCT-based algorithm (middle image, JPEG compression) and a wavelet-based algorithm (right-hand image, JPEG-2000 compression). In 138 rRANSFORM CODING l !l Figure 7.10 (a) Original; (b) compressed and decompressed (DCT); (c) compressed and decompressed (wavelet) each case, the compression ratio i1s 6x. Thedecompressed P E G image is clearly distorted and ‘blocky’, whereas the decompressedPEG-2000 image is much closer to the original. Because of its good performance in compressing images, the DWT is used in the new PEG-2000 still image compression standard and is incorporated as a still image compression tool in MPEG-4 (see Chapter 4). However, wavelet techniques have not yet gained widespread support for motion video compression because there isanneoatsy way to extend wavelet compression in the temporal domain. Block-based transforms such as thwe oDrkCT well withblock-basedmotionestimationandcompensation,whereasefficient,computationallytractablemotion-compensationmethodssuitableforwavelet-basedcompression have not yet been demonstrated. Hence, the DCT is still the most popular transform for video coding applications. 7.4 FASTALGORITHMSFOR THE DCT According to Equation7.1, each of the 64 coefficients in the FDCT is a weighted functioofn all 64 image samples. This means that 64 calculations, each involving a multiplicationand an accumulation (‘multiply-accumulate’) must be carried out for each DCT coefficient. A total of 64 x 64 =4096 multiply-accumulate operations are required for a full8 x 8 FDCT. Fortunately, the DCT lends itself to significant simplification. 7.4.1 SeparableTransforms Many practical implementations of the FDCT and IDCT use the separable property of the transforms to simplify calculation.The 2-D FDCT can be calculatedby repeatedly applying the l-D DCT. The l-D FDCT is given by Equation 7.3: DCT THAELGFOFOARRISTTHMS 8 x 8 samples (i, j ) l-D FDCT on rows l-D FDCToncolumns 139 8x 8 coefficients(x, y) 3 c 3 Figure 7.11 2-D DCT via two l-D transforms Equation 7.1 may be rearranged as follows (Equation 7.4): where F, is the l-D DCT described by Equation 7.3. In other words, the 2-D DCT can be represented as: F x . y = 1-DDCTy-direction(1-D DCTx-direct~on) The 2-D DCT of an 8 x 8 block can be calculated in two passes: a l-D DCT of each row, followed by a l-D DCTofeachcolumn (or vice versa). Thisproperty is known as separability and is shown graphically in Figure 7.11. The 2-D IDCT canbe calculated using two l-D IDCTs in a similar way. The equation for the l-D IDCT is given by Equation 7.5: This separable approach has two advantageFs.irst, the number of operations is reduced: each ‘pass’ requires 8 x 64 multiply-accumulate operations, i.e. the total number of operations is 2 x 8 x 64 = 1024.Second,the l-D DCTcanbereadilymanipulated to streamlinethe number of operations further or to otherwise simplify calculation. Practical implementations of the FDCT and IDCT fall into two main categories: 1. Minimal computation: the DCT operations (1 -D or 2-D) are reorganised to exploit the inherentsymmetry of cosineoperations in order to minimisethenumber of multiplications and/or additions. This approach is appropriate for software implementations. 2. Maximal regularity: the l-D or2-D calculations are organised to regularise the data flow and processing order. This approach is suitable for dedicated hardware implementations. In general, l-D implementations(usingtheseparablepropertydescribedabove) are less complexthan2-Dimplementations,but it is possible to achievehigherperformance by manipulating the 2-D equations directly. 140 CODING TRANSFORM 7.4.2 FlowgraphAlgorithms The computational requirements of the l-D DCT may be reduced significantly by exploiting the symmetry of the cosine weighting factors. We show how the complexity of the DCT can be reduced in the following example. Example: Calculating l - D DCT coeficient F2 From Equation 7.3: The following properties of the cosine function can be used to simplify Equation 7.7: and cos(?) = - c o s ( F ) = -cosr$) =cos($) These relationships are shown graphicallyin Figure 7.12 wherecos(7r/8), cos(7n/8), etc. are plotted as circles ( 0 ) and cos(37r/8), cos(57r/8), etc. are plotted as stars (*). Using the above relationships, Equation 7.7 can be rewritten as follows: 1 F2 = 2- [.cos(;) + f l cos(?) - h c o s ( 3 - f 3 c o s ( 3 - f.cOs(;) (g) (g) -f5 cos + f6 cos + f7 cos(;)] hence: FAST ALGORITHMS FOR THE DCT l ‘, t 0.8 i \ 0o..64 141 1 l -0.8 -1 0 - 1 2 3 4 5 Figure 7.12 Symmetries of cosinefunction 1 I_ 6 The calculation for F2 has been reduced from eight multiplications andeight additions (Equation 7.7) to two multiplications and eight additions/subtractions (Equation 7.9). Applying a similar process to F6 gives The additions and subtractions are clearly the same as in Equation 7.9. We can therefore combine the calculations of F2 and Fh as follows: bl (i) (g)] Step 2 : calculate 2F2 = cos +D2cos and In total, the two steps require X additions or subtractions and 4 multiplications, compared with 16 additions and 16 multiplications for the full calculation. The combined calculations of F2 and Fh can be graphically represented by ajowgraph as shown in Figure 7.13. In this figure, a circle represents addition and a square represents multiplication by a scaling factor. For clarity, thecosinescalingfactorsare represented as ‘cX’, meaning ‘multiply by cos (Xn/16)’. Hence, cos(r/X) is represented by c2 and cos(3r/X) is represented by c6. This approach canbe extended to simplify the calculation of F. and F4, producing the top half of the flowgraph shown in Figure 7.14. Applying basic cosine symmetries doesnot give 142 fo fl f2 f3 f4 f5 CODING TRANSFORM 2F2 2F6 Multiply by -1 Multiplybycos(Xpiil6) 0 Add f7 Figure 7.13 Partial FDCT flowgraph (F2 and F6 outputs) such a useful result for the odd-numbered FDCT coefficients ( I , 3 , 5 , 7). However, further manipulation of the matrix of cosine weighting factors can simplify the calculation of these coefficients. Figure 7.14 shows a widely used example of a ‘fast’ DCT algorithm,’ requiring only 26additionskubtractions and 20 multiplications (in fact,thiscanbe reduced to 16 multiplications by combining some of the multiplications by c4).This is considerably simpler than the 64 multiplies and 64 additionsof the ‘direct’ l-D DCT. Each multiplication is by a constant scaling factor and these scaling factors may be pre-calculated to speed up computation. In Figure 7.14, eight samples fo . . .f, are input at the left and eight DCT coefficients 2Fo . . .2F7 are output at the right. A I-D IDCT may be carried out by simply reversing the direction of the graph, i.e. the coefficients 2Fo. . .2F7 are now inputs and the samplesfo . . .,f, are outputs. By manipulating the transformoperations in different ways, many other flowgraph algorithmscan be obtained. Each algorithm has characteristics that may make it suitable for a particular application: for example, minimal numbeorf multiplications (for processing platformswheremultiplications are particularly expensive)m, inimal total number of operations (where multiplications are not computationally expensive), highly regular data flow, and so on. Table 7.2 summarises the features of some popular l-D ‘fast’ algorithms. Arai’s algorithmrequiresonly five multiplications, making ivt ery efficient for most processing platforms; however, this algorithm results in incorrectly scaled coefficients and this must be compensated for by scaling the quantisation algorithm (see Section 7.6). FAST ALGORITHMS FOR THE DCT 143 fo 2Fo fl 2F4 f2 2F2 f3 2F6 f4 2F1 f5 2F5 f6 2F3 f7 2F7 Figure 7.14 CompleteFDCTflowgraph(fromChen,FralickandSmith) Thesefast l-D algorithmsexploitsymmetries of the l-D DCTand there arefurther variations onthe fast l-D DCT .738Inordertocalculate a complete2-D DCT, the l-D transform is applied independently to the rows and then to the columns of the block of data. Further reductions in the number of operations may be achieved by taking advantage of furthersymmetries in the'direct'form of the2-DDCT(Equation 7.1). In general, it is possible to obtainbetter performance with a direct 2-Dalgorithmg3" than with separable l-D algorithms. However, this improved performance comes at the cost of a significant increase inalgorithmic complexity. In many practical applications, the relative simplicity (and smaller software code size) of the l-D transforms is preferable to the highercomplexity Table 7.2 Comparison of l-D DCT algorithms (S-point DCT) Source ~~ Multiplications Additions 'Direct' 64 64 Chen3 16 26 ~ee~ 12 29 ~oeffler~ 11 29 Arai6 5 28 144 CODING TRANSFORM of direct 2-D algorithms. For examplei,t is more straightforward to develoaphighly efficient hand-optimised l-D function than to carry out the same software optimisationswith a larger 2-D function. 7.4.3 Distributed Algorithms The ‘flowgraph’ algorithms described above are widelyused in software implementations of the DCT buthavetwodisadvantagesfordedicatedhardwaredesigns.They tend to be irregular (i.e. different operations take place at each stageof the flowgraph) and they require large multipliers (which take up large areas of siliconin an IC). Itis useful to develop alternative algorithms that have a more regular structure and/or do not require large parallel multipliers. Equation 7.3 (l-D FDCT) may be written as a ‘sum of products’: c7 F, = Ci,,fi i=O where Ci.x= - 2 + (2i 1 ) X T (7.1 1) Thel-DFDCT isthesum of eightproducts,whereeachproducttermisformed by multiplying an input sampleby a constant weighting factorCi,*.The first stage of Chen’s fast algorithm shown in Figure 7.14 is a series of additions and subtractions and these can be used to simplify Equation 7.1 1. First, calculate four sums (U)and four differences ( v ) from the input data: Equation 7.11 can be decomposed into two smaller calculations: (7.12) (7.13) (7.14) i=O In this form, the calculations are suitable for implementation using a technique known as distributedarithmetic (first proposedin1974forimplementingdigitalfilters”).Each multiplicationiscarriedout a bit at a time,using a look-uptableand an accumulator (rather than a parallel multiplier). A B-bit twos complement binary number n can be represented as: (7.15) ALGOFRAISTTHMS FOR THE DCT 145 H’ is the most significant bit (MSB) of n (the sign bit) and n j are the remaining ( B - 1) bits of n. Assuming that each input U,is a B-bit binary number in twos complement form,Equation 7.13 becomes (7.16) Rearranging gives (7.17) or D,!&) is a function of the bits at position j in each of the four input values: these bits are ub, U < , U/ and ud. This means that there are only 24 = 16 possible outcomesof D, and these 16 ( c ) outcomes may be pre-calculated and stored in a look-up table. The FDCT describe by Equation 7.18 can be carried outby a series of table look-ups ( D J , additions and shifts ( 2 - 3 . In this form, no multiplication is required and thismaps efficiently to hardware (see Section 7.5.2).Asimilarapproachistaken to calculate the four odd-numbered FDCT coefficients F 1 ,F3, F5 and F7 and the distributed form may also be applied to the l-D IDCT. 7.4.4 Other DCT Algorithms The popularity of the DCT has led to the development of further algorithms. For example, a l-D DCT in the form of a finite difference equation has been presented.” Using this form, the DCT coefficients may be calculated recursively using an infinite impulse response (IIR) filter. This has several advantages: the algorithm is very regular and there are a number of well-established methods for implementing IIR filters in hardware and software. Recently, approximate forms of the DCT have been proposed. Each of the DCT-based image and video coding standards specifies a minimum accuracy for the inverse DCT and in order to meet this specification it is necessary to use multiplications and fractional-precision numbers. However, if accuracy and/orcompletecompatibility with thestandardsare of lesser importance,itis possible to calculate the DCT and IDCT using one of several approximations. For example, an approximate algorithm has been proposed13 that requires only additions and shifts (i.e. there are no multiplications). This type of approximation may be suitable for low-complexity software or hardware implementations where computational power is very limited. However, thedisadvantageis that image quality will be reduced compared with an accurate DCT implementation.An interesting trend is shown in the H.26L 146 TRANSFORM CODING draft standard (see Chapte5r ) , where an integer DCT approximation idsefined as part of the standard to facilitate low-complexity implementationswhilst retaining compliance with the standard. 7.5 IMPLEMENTING THE DCT 7.5.1 Software DCT The choiceof ‘fast’ algorithm for a software video CODEC depenodnsa numberof factors. Different processing platforms (see Chapter 1h2a)ve different strengths and weaknessesand these may influence the choice of DCT/IDCT algorithm. Factors include: 0 Computational ‘cost’ of multiplication. Some processors take many cycles to carry out a multiplication, others are reasonably faAstl.ternative flowgraph-based algorithmsallow the designer to ‘trade’ the number of multiplications against the total numberof operations. 0 Fixed vs. floating-point arithmetic capabilities. Poor floating-point performance may be compensated for by scaling the DCT multiplication factors to integer values. 0 Register availability. If the processor has a small numberof internal registers then temporary variables should be kept to a minimum and reused where possible. 0 Availability of dedicated operations, ‘Custom’ operations such as digital signal processor (DSP) multiply-accumulate operations and the Intel MMX instructions may be used to improve the performance of some DCT algorithms (see Chapter 12). Because of the proliferation of ‘fast’ algorithms, it isusually possible to choose a ‘shortlist’ of two or three alternative algorithms (typically flowgraph-based algorithms for software designs) and to compare the performance of each algorithm on the target processor before making the final choice. Example Figure 7.15 lists pseudocode forChen’s algorithm (shown in Figur7e.14). (Only thetop-half calculations are given for clarity). The multiplication factors CX are pre-calculated constants. In this example, floating-point arithmetic is used: alternatively, the multipliers CX may be scaled up to integers and the entire DCT may be carried out using integer arithmetic (in whichcaset,he final resultsmust be scaledbackdowntocompensate).Thecosine multiplicationfactors never changeand so these may be pre-calculated (in this case as floating-point numbers). A l-D DCT is applied to each row in turn, then to each column. Note the use of a reasonably large number of temporary variables. Furtherperformanceoptimisationmay be achieved by exploitingtheflexibility of a softwareimplementation.Forexample,variable-complexityalgorithms(VCAs) may be applied to reduce the number of operations required to calculate the DCT and IDCT (see Chapter 10 for some examples). IMPLEMENTING THE DCT 147 constant c4 = 0.707107 constant c2 = 0.923880 constant c6 = 0.382683 / / (similarly for cl, c3, c5 and c7) €or (every row) { io = fO+f7 / / First stage il = fl+f6 i2 = fZ+f5 i3 = f3+f4 i4 = f3-f4 I i5 = fZ-f5 i6 = fl-f6 i7 = fO-f7 j0 = io + i3 / / Second stage jl = il+ i2 j2 = il - i2 j3 = io - i3 / / (similarly for j4..j7) kO = (j0 + jl) c4 / / Third stage kl = (j0 - jl) c4 k2 = (j2*c6) + (j3*c2) - k3 = (j3*c6) (j2*c2) / / (similarly for k4..k7) PO = k0>>1 F4 = kl>>l F2 = k2>>1 F6 = k3>>1 / / (Fl..F7 require another stage of multiplications and additions) I / / end of row calculations ior (every column) { / / repeat above steps on the columns Figure 7.15 Pseudocode for Chen’s algorithm 148 input samples TRANSFORM CODING select F + 1-D DCT output coefficients - Figure 7.16 2-D DCT architecture 7.5.2 Hardware DCT Dedicatedhardwareimplementations ofthe FDCT/IDCT(suitableforASICor FPGA designs, for example) typically make use of separable l-D transforms to calculate the 2-D transform. The two sets of row/column transforms shown in Figure 7.11 may be carried out using a single l-D transformunit by transposingthe 8 x 8 arraybetweenthetwo I-D transforms. i.e. Input data + l - D transform on rows + Transpose array -+ l-D transform on columns + Output data An 8 x 8 RAM (‘transposition RAM’) maybe used to carry out the transposition. Figur7e.16 shows an architecture for the 2-D DCT that uses a l-D transform‘core’together with a transposition RAM. The following stages are required to calculate a complete 2-D FDCT (or IDCT): 1. Load input data in row order; calculate l-D DCT of each row; write into transposition RAM in row order. 2. Read from RAM in column order; calculatel-D DCT of each column; write into RAMin column order. 3. Read output data from RAM in row order. There are a number of options for implementing the l-D FDCT or IDCT ‘core’. Flowgraph algorithms are not ideal for hardware designs: the data flow is not completely regular and it is not usually possible to efficiently reuse arithmetic units (such as adders and multipliers). Two popular and widely used designs are the parallel multiplier and distributed arithmetic approaches. IMPLEMENTING THE DCT 1 14 bits from U inputs address ROM 2-1 4 Accumulator F 149 Coefficient F, add or subtract Figure 7.17 DistributedarithmeticROM-accumulator Parallel multiplier This is a more or less direct implementation of Equations 7.13 and 7.14 (four-point sum of products). After an initial stage that calculates the four sums (U) and differences (v) (see Equation 7.12), each sum-of-products result is calculated. There are 16 possible factors Ci,, for each of the two 4-point DCTs, and these factors may be pre-calculated to simplify the design. High performancemay be achieved by carrying out the four multiplications for each result in parallel; however, this requires four large parallel multipliers andmay be expensive in terms of logic gates. Distributed arithmetic The basic calculation of the distributed arithmetic algorithmis given by Equation 7.18. This calculationmaps to the hardwarecircuitshown in Figure 7.17, known as a ROMAccumulator circuit.Calculatingeach coefficient F, takesatotal of B clockcyclesand proceeds as follows (Table 7.3). The accumulator is reset to zero at the start. During each Table 7.3 - ~~ Bit position Distributedarithmetic:calculation of onecoefficient ~ ~ ROM ROM input Accocnutemonuttslpautotr B- 1 B-2 ... 1 0 - 1 (i.e. bit ( B - 1) of M O , M I , u2 and u3) - 2 (i.e. bit ( B- 2 ) of M O . M I , u2 and u3) U' (i.e. bit 1 of uo, U ] ,u2 and u3) uo (i.e. signbit of uo, uI. u2 and u3) D,(UB - l) D&P2) D,(ul) D,(uo) D,(uB - l ) + D,(uB-2) [D,(uB- l ) >> l ] .. ... + D,(u') (previoucsontents >> 1) + -D,(uo) (previoucsontents >> 1) 150 TRANSFORM CODING clock cycle, one bit from each inpuut (or each inputv if F, is an odd-numbered coefficient) selectsapre-calculatedvalue D, fromthe ROM. The ROM output D, isaddedtothe previous accumulator contents (right-shifted by I bit, equivalent to multiplication by 2 - l ) . The final output D,(uo) is subtracted (this is the sign bit position). After B clock cycles, the accumulator contains the final result F,. The ROM-Accumulatoris a reasonably compact circuit and it is possible to use eight of these in parallel to calculateall eight coefficients F, of a l-D FDCTor IDCTin B cycles. The distributed arithmetic design offers good performance with a modest gate count. There have been examples of multiplier-based DCT designs,’491f5ilter-based anddistributeadrithmetic A hardwaarrechitectuhrbeaesepnresentedb”ased on a ‘direct’ 2-D DCT implementation. 7.6 QUANTISATION In a transform-based video CODEC, the transform stage is usually followed by a quantisa- tion stage. The transforms described in this chapter (DCT and wavelet) are reversible i.e. applying the transform followed by its inverse to image data results in the original image data. This means that the transform process does not removeanyinformation; it simply representstheinformationinadifferenftorm. The quantisationstageremovesless ‘important’ information (i.e. information that does not have a significant influence on the appearance of the reconstructed image), making it possible to compress the remaining data. In themainimageandvideocodingstandardsdescribed in Chapters 4 and 5, the quantisationprocessissplitintotwoparts,anoperationintheencoderthatconverts transform coefficients intolevels (usually simply describedas quantisation) and an operation in thedecoderthact onvertslevelsintoreconstructedtransformcoefficients(usually described as rescaling or ‘inverse quantisation’). The key to this process is that, whilst the originaltransformcoefficients may takeonalargenumber of possiblevalues(like an analogue‘, continuous’signal)t,helevelsandhencethereconstructed coefficients are restricted to adiscreteset of values. Figure 7.18 illustratesthequantisationprocess. Transform coefficients on a continuous scale are quantised to a limited number of possible levels. The levels are rescaled to produce reconstructed coefficients with approximately the same magnitude as the original coefficients but a limited number of possible values. Coefficient value may be anywhere in this range - - Coefficient may only - Origlnal coeffioents Quantised values - Rescaled coefficients Figure 7.18 Quantisationandrescaling QUANTISATION 151 Quantisation has two benefits for the compression process: If thequantisationprocess is correctlydesigned,visually significant coefficients (i.e. those that have a significant effect on the quality of the decoded image) are retained (albeit with lower precision)w, hilst insignificant coefficients are discarded. This typically results in a ‘sparse’ set of quantised levels, e.g. most of the 64 coefficients in an 8 x 8 DCT are set to zero by the quantiser. A sparse matrix containing levels with a limited number of discrete values (the result of quantisation) can be efficiently compressed. Thereis, of coursea,detrimentaleffect to imagequalitybecausethereconstructed coefficients are not identical to the original set of coefficients and hence the decoded image will not be identical to the original.The amount of compression andthe loss of image quality depend on the number of levels produced by the quantiser. A large number of levels means that thecoefficient precision is only slightly reduceadnd compression is low; a small number of levelsmeansa significant reduction in coefficient precision(andimagequality)but correspondingly high compression. Example The DCT coefficients from Table 7.1 are quantised and rescaled with (a) a ‘fine’ quantiser (with the levels spaced at multiplesof 4) and (b) a ‘coarse’ quantiser (with the levels spaced at multiples of 16). The results are shown in Table 7.4. The finely quantised coefficients (a) retainmost of theprecision of theoriginalsand 21 non-zero coefficients remainafter quantisation. The coarsely quantised coefficients (b) have lost much of their precision and only seven coefficients are left after quantisation (tshixe coefficients illustrated in Figure 7.6 plus [7, 01). The finely quantised block will produce a better approximation of the original image block after applying the IDCT; however, the coarsely quantised block will compress to a smaller number of bits. Table 7.4 Quantisedandrescaledcoefficients: (a) fine quantisation, (b) coarse quantisation 152 TRANSFORM CODING output 4 -3 -2 -- l 4 -3 -2 -1 0 I I I I I I I I Input 1234 Figure 7.19 Linearquantiser 7.6.1 Types of Quantiser Thecomplete quantisationprocess(forwardquantisationfollowed by rescaling)can be thought of as a mapping between a set of input values and a (smaller) set of output values. The type of mapping has a significant effect on compression and visualquality. Quantisers can be categorised as linear or nonlinear. Linear The set of input values map to a set of evenly distributed output values and an example is illustrated in Figure 7.19. Plottingthemappingin this way producesacharacteristic ‘staircase’. A linearquantiserisappropriate when it is requiredto retain themaximum precision across the entire range of possible input values. Nonlinear The set of output values are not linearly distributed; this means that input values are treated differently depending on their magnitude. A commonly used example is a quantiser with a ‘dead zone’ about zero, as shown in Figure 7.20. A disproportionately wide range of lowvalued inputs are mapped to a zero output. This has the effectof ‘favouring’ larger values at the expense of smaller onesi,.e. small input values tend to be quantised to zero, whilst larger values are retained. This type of nonlinear quantiser may be used, for example, to quantise ‘residual’ image data in an inter-coded frame. The residual DCT coefficients (after motion compensation and forward DCT) are distributed about zeAro.typical coefficient matrix will contain a large number of near-zero values (positive and negative) and a small number of higher values and a nonlinear quantisewr ill remove the near-zero valueswhilst retaining the high values. QUANTISATION 153 output I I 1 I I l l I I Input I‘il 2 3 4 I 4 - 2 zone l- I t -3 t-4 Figure 7.20 Nonlinearquantiserwithdeadzone Figure 7.21 shows the effectof applying two different nonlinear quantisers to a sine input. The figure shows the input together with the quantised and rescaled output; note the ‘dead zone’ about zero. The left-hand graph shows a quantiser with 11 levels and the right-hand graph shows a ‘coarser’ quantiser with only 5 levels. 7.6.2 Quantiser Design The design of the quantisation process has an important effecton compression performance andimage quality. The fundamentalconcepts of quantiserdesignwerepresentedelse- where.20 In order to support compatibility between encoders and decoders, the image and videocodingstandardspecifythe levels produced bythe encodearnd the set of reconstructed coeflcients. However, they do not specify the forward quantisation process, i.e. the mapping between the input coefficients and the set of levels. This gives the encoder designer flexibility todesigntheforwardquantiser to giveoptimumperformancefor different applications. For example, the MPEG-4 rescaling process is as follows for inter-coded blocks: IRECl = QUANT. ( 2 . lLEVELl+ 1 ) (if QUANT isoddand LEVEL # 0) + (RECI = QUANT. ( 2 . (LEVEL( 1) - 1 (if QUANT is evenand LEVEL # 0) (7.19) REC = 0 (if LEVEL = 0) LEVEL is the decoded level prior to rescaling andREC is the rescaled coefficient. The sign of REC is the same as the sign of LEVEL. QUANT is a quantisation ‘scale factor’ in the range 1-31. Table 7.5 givessomeexamples of reconstructed coefficients forafew of the possible combinations of LEVEL and QUANT. The QUANT parameter controls the step 154 25 TRANSFORM CODING ‘-25 I (b) Figure 7.21 Nonlinear quantisation of sine wave: (a) low quantisation; (b) high quantisation size of the reconstruction process: outside the ‘dead zone’ (about zero), the reconstructed values are spaced at intervaolsf (QUANT * 2). A larger value of QUANT means morewidely spaced reconstruction levels and this inturn gives higher compression (andpoorer decoded image quality). The other ‘half’ of the process, the forward quantiser,is not defined by the standard. The design of the forward quantiser determines the range of coefficients (COEF) that map to each of the levels. There are many possibilities here: for example, one option is to detshigen QUANTISATION 155 Table 7.5 MPEG-4 reconstructedcoefficients 1 Levels I I QUANT ... -2 -1 0 1 2 3 4 ... 1 ... -5 -3 0 3 5 7 9 ... 2 -9 -5 0 135 9 17 3 -15 -9 270 219 15 4 -19 -11 0 2171 19 35 ... quantiser so that eachof the reconstructed values lies at the ceonftrae rangeof input values. Figure 7.22 shows an example for QUANT = 4. After quantisation and rescaling, original coefficients in the range (- 7 < COEF < 7) map to 0 (the ‘dead zone’); coefficients in the range(7- 27 >- 19 15 >- 11 L7 -7 4-1 5 >- 0 >- -11 =-- -19 -23b W - -27 -31 I OrigcioneaflficQieunatntaisnedd Figure 7.22 Quantiser (1): REC rVesacluaelesd in centre of range 156 TRANSFORM CODING MPEG-4 P-plctures’ distributionof coeticients 0 10 20 30 40 50 60 Coffictent (reordered) Figure 7.23 Typical distribution of INTER coefficients a high value. A better quantiser might‘bias’ the reconstructedcoefficients towards zero; this means that, on average, the reconstructed values will be closer to the original values (for original coefficient values that areconcentrated about zero).Anexample of a ‘biased’ forward quantiser design is given in Appendix I11 of the H.263++ standard: ILEVELl = (ICOEFI - QUANT/2)/(2* QUAN(T7.)20) The positions of the original and reconstructed coefficients for QUANT = 4 are shown in Figure 7.24. Each reconstructedcoefficient value (REC)is near the lower endof the rangeof corresponding input coefficients. Inputs in the range ( - 10 < COEF< 10) map to 0; the range(10- 0 7 - -19 t-l8 -26P - -27 -34 Original coefficientvalues Quantised and Figure 7.24 Quantiser (2): REC biased reSCaled values towardslowerend of range These operations are reasonably straightforward in software or hardware; however, in somearchitecturesdivisionmaybe‘expensive’,forexamplerequiring many processor cycles in a software CODEC or requiring a large multiplier circuit in a hardware architectur Rescaling may lend itself to implementation using a look-up table. For example, the MPEG- 4 rescaling operation fo‘rinter’ coefficients (Equation7.19) has a limited numbeorf possible outcomes: there are 31 QUANT values and the valueof ILEVELI may be in the range 0-128 and so thereare(31 * 129 =) 3999possiblereconstructedvalues.Instead of directly calculating Equation 7.19, the 3999 outcomes may be pre-calculated and stored in a look- up table which the decoder indexes according toQUANT and LEVEL. This is less practical fortheforwardquantiser:themagnitude of COEF may beintherange0-2048 and so (2049 * 31 =) 63 519 possibleLEVEL outcomes need to be stored in the look-up table. 7.6.4 VectorQuantisation In the examples discussed above, each sample (e.g. a transform coefficient) was quantised independently of all other samples (scalar quantisation). In contrast, quantising a groupof samples as a‘unit’ (vector quantisation) can offer more scope for efficient compression.21 In its basic form, vector quantisation is applied in the spatial doma(iin.e. it does not involve a transform operation). The heart of a vector quantisation(VQ) CODEC is a codebook. This contains a predetermined setof vectors, where each vector is a blockof samples or pixel.A VQ CODEC operates as follows: 158 block TRANSFORM CODING Encode Decode Find best Transmit Codebook Vector 1 Vector 2 Vector N Codebook Vector 1 Vector 2 Vector N + output block Figure 7.25 Vector quantisation CODEC 1. Partition the source image into blocks (e.g. 4 x 4 or 8 x 8 samples). 2 . For each block, choose a vector from the codebook that matches the block as closely as possible. 3. Transmit an index that identifies the chosen vector. 4. The decoder extracts the appropriate vector anduses this to represent the original image block. Figure 7.25 illustrates the operation of a basic VQ CODEC. Compression is achieved by ensuring that the index takes fewer bits to transmthitan the original imageblock itself. VQ is inherently lossy because, for most image blocks, the chosen vector will not be an exact match and hence the reconstructed image block will not be identical to the original. The larger the number of vectors (predetermined image blocks)in the codebook, the higher the likelihood of obtaining a good match. However, a large codebook introduces two difficulties: first, the problem of storing the codebookandsecond,theproblem of searchingthe codebook to find the optimum match. The encoder searches the codebook and attempts to minimise the distortion between the original image block x and the chosen vector x, according to some distortion metric (for example, mean squared error: I(x- x ( l 2). The search complexity increases with the number of vectors in the codebookN , and muchof the research into VQ techniquheass concentrated on methods of minimising the complexityof the search whilst achieving good compression. Many modifications to the basic VQtechniquedescribedabovehavebeenproposed, including the following. Tree search VQ In order to simplify the search procedurein a VQ encoder, the codebook is partitioned into a hierarchy. At each level of the hierarchy, the input image block is compared with just two QUANTISATION 159 Input block 1, Level 0 Level 1 Level 2 ... ... Figure 7.26 Tree-structuredcodebook possible vectors and the best match is chosen.At the next level down, two further choices are offered (based on the choice at the previous level), and so on. Figure 7.26 shows the basic technique: the inputblock is first compared with two ‘root’ vectors A and B (level 0). If A is chosen, the next comparison is with vectors C and D; if B is chosen, the next level chooses between E and F; and so on. In total, 2logzN comparisons are required for a codebook of N vectors. This reduction in complexity is offset against a potential loss of image quality compared with a ‘full’ search of the codebook, since the algorithm is not guaranteed to find the best match out of all possible vectors. Variable block size In its basic form, with blocks of a constant size, a VQ encoder must transmit an index for every block of the image.Most images have areasof high and low spatial detail and itcan be advantageous touse a variable block size for partitioning the image, as shown in Figure 7.27. Prior to quantisation, the image is partitioned into non-overlapping blocks of varying sizes. Small blocks are chosen for areas of the image containing important detail; large blocks are used where there is less detail in theimage. Each block is matched with a vector from the codebook and the advantage of this method is that a higher density of vectors (and hence better image reproduction) is achieved for detailed areas of the image,whilst a lower density (and hence fewer transmitted bits) is chosen for less detailed areas. Disadvantages include the extra complexity of the initial partitioning stage and the requirement ttoransmit a ‘map’ of the partitioning structure. Practical considerations Vector quantisation is highly asymmetrical in terms of computational complexity. Encoding involves an intensive search operation for every image block, whilst decoding involves a simple table look-up. VQ (in its basic form) is therefore unsuitable for many two-way video 160 TRANSFORM CODING Figure 7.27 Image partitioned into varying block sizes communication applications but attractive for applications whelroew decoder complexity is required. At present, VQ has not found iwtsay into anyof the ‘mainstream’ video and image coding standards. However, it continues to be an active area for research and increasingly sophisticated techniques (such as fast codebook search algorithms and VQ combined with other image coding methods) may lead to increased uptake in the future. 7.7 SUMMARY The most popular methodof compressing images (or motion-compensated residual frames) is by applying a transform followedby quantisation. The purpose of an image transform is to decorrelatetheoriginalimagedata and to‘compact’theenergy of theimage.After decorrelationandcompaction,most ofthe imageenergyisconcentratedintoasmall number of coefficients which are ‘clustered’ together. The DCT is usuallyappliedto 8 x 8 blocks of imageorresidualdata. The basic2-D transformisrelativelycomplex to implement,butthecomputationcan be significantly reduced first bysplittingitintotwo l-D transforms and second by exploitingsymmetry properties to simplify each 1-D transform. ‘Flowgraph’-type fast algorithms are suitable for software implementations anda range of algorithms enable the designer to tailor the choice of FDCTtotheprocessingplatform.Themoreregularparallel-multiplierordistributed arithmetic algorithms are better suited to dedicated hardware designs. The design of the quantiser can have an important contribution to image quality inan image or video CODEC. After quantisation, the remaininsgignificant transform coefficients are entropy encoded together with side information (such as headers and motion vectors) toform a compressedrepresentation of theoriginalimage or videosequence. The next chapter will examine the theory and practice of designing efficient entropy encoders and decoders. REFERENCES 161 REFERENCES 1. N. Ahmed,T. Natrajan andK. R. Rao, ‘Discrete cosine transform’I,EEE Trans. Computers, January 1974. 2. W. A. Pearlman, ‘Trendsof tree-based, set-partitioning compression techniques instill and moving image systems’, Proc. PCSOl, Seoul, April 2001. 3. W-H. Chen, C. H. Smith and S. C. Fralick, ‘A fast computational algorithm for the discrete cosine transform’, IEEE Trans. Communications, COM-25, No. 9, September 1977. 4. B. G. Lee, ‘A new algorithm to compute the discrete cosine transform’, IEEE Trans. ASSP, 32(6), December 1984. 5. C. Loeffler, A. Ligtenberg and G. Moschytz, ‘Practical fast l-D DCT algorithms with 11 multi- plications’, Proc. ICASSP-89, 1989. 6. Y. Arai, T. Agui and M. Nakajima, ‘A fast DCT-SQ scheme for images’, Trans. ofthe IEICE E, 71(1l), November 1988. 7. M.Vetterli and H. Nussbaumer,‘SimpleFFTandDCTalgorithmswithreducednumber of operations’, Signal Processing, 6(4), August 1984. 8. E. Feig and S. Winograd, ‘Fast algorithms for the discrete cosine transform’, IEEE Trans. Signal Processing, 40(9), September 1992. 9. F. A. Kamangar and K. A. Rao, ‘Fast algorithms for the 2-D discrete cosine transformIE’E, E Trans. Computers, 31(9), September 1982. 10. M. Vetterli, ‘Fast 2-D discrete cosine transform’, Proc. IEEE ICASSP, 1985. 11.A.Peled and B. Liu, ‘A new hardwarerealization of digital filters’, IEEE Trans. ASSP, 22(6), December 1974. 12. J. R. Spanier, G. Keane, J. Hunter and R. Woods, ‘Low power implementationof a discrete cosine transform IP Core’, Proc. DATE-2000, Paris, March 2000. 13. T. D. Tran, ‘The BinDCT fast multiplierless approximation of the DCT’, IEEE Signal Processing Letters, 7, June 2000. 14. M. Sanchez, J. Lopez, 0.Plata, M. Trenas andE. Zapata, ‘An efficient architecture for the in-place fast cosine transform’, .I. VLSI Sig. Proc., 21(2), June 1999. 15. G. Aggarwal and D. Gajski, Exploring DCT Implementations, UC Irvine Tech Report TR-98-10, March1998. 16. G. A. Jullien, ‘VLSI digital signal processing: some arithmetic issuesP’r,oc. SPIE, 2846, Advanced Signal Processing Algorithms, Architectures and Implementations, October 1996. 17. M.T. Sun, T. C. ChenandA.Gottlieb,‘VLSIimplementation of a16 * 16discretecosine transform’, IEEE Trans. Circuits and Systems, 36(4), April 1989. 18 T-S. Chang, C-S. Kung andC-W. Jen, ‘A simple processor core design for DCTLDCTI’E, EE Trans. on CSVr, 10(3), April 2000. 19. P. Lee and G. Liu,‘An efficient algorithm for the 2D discrete cosine transformS’,ignal Processing, 55(2), Dec. 1996, pp. 221-239. 20. Y. Shoham and A. Gersho, ‘Efficient bit allocation for an arbitrary set of quantisers’, IEEE Trans. ACSSP, 32(3), June 1984. 21. N. Nasrabadi and R.King,‘Imagecodingusingvectorquantisation:areview’, IEEE Trans. Communications, 36(8), August 1988. Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Entropy Coding 8.1 INTRODUCTION A video encoder contains two main functions: a source model that attempts to represent a video scene in a compact form that is easy to compress (usually an approximation of the original video information) and anentropy encoder that compresses the outputof the model prior to storage and transmission. The source model is matched to the characteristics of the input data (images or video frames), whereas the entropy coder may use ‘general-purpose’ statistical compressiontechniques that arenotnecessarilyuniqueintheirapplication to image and video coding. As with the functions described earlier (motion estimation and compensation, transform coding,quantisation),thedesign ofan entropyCODECisaffectedbyanumber of constraints including: 1. Compression eficiency: the aim is to represent the source model output usingas few bits as possible. 2. Computational eficiency: thedesignshouldbesuitableforimplementationonthe chosen hardware or software platform. 3. Error robustness: if transmission errors are likely, the entropy CODEC should support recoveryfromerrorsandshould (if possible)limiterrorpropagation at decoder (this constraint may conflict with (1) above). In a typical transform-based video CODEC, the data to be encoded by the entropy CODEC fallsintothreemaincategories:transformcoefficients (e.g. quantisedDCT coefficients), motion vectors and ‘side’ information (headers, synchronisation markers, etc.). The method of coding side information depends on the standard. Motion vectors can often be represented compactly in a differential form dueto the high correlation between vectors for neighbouring blocks or macroblocks.Transformcoefficientscanberepresented efficiently with‘run- level’ coding, exploiting the sparse nature of the DCT coefficient array. An entropy encoder maps input symbols (for example, run-level coded coefficients) to a compresseddatastream. It achievescompressionbyexploitingredundancy in theset of input symbols, representing frequently occurring symbols with a small number of bits and infrequently occumng symbols with a larger number of bits. The two most popular entropy encodingmethodsusedinvideocodingstandardsareHuffmancodingandarithmetic coding.Huffmancoding (or ‘modified’Huffmancoding)representseachinputsymbol by avariable-lengthcodewordcontaining an integralnumber of bits. It is relatively 164 ENTROPY CODING straightforwardtoimplement,butcannotachieveoptimalcompressionbecause of the restriction that each codeword must contain an integral number of bits. Arithmetic coding maps aninputsymbolintoafractionalnumber of bits,enablinggreatercompression efficiency at the expense of higher complexity (depending on the implementation). 8.2 DATA SYMBOLS 8.2.1 Run-LeveCl oding The output of the quantiser stage in a DCT-based video encoder is a block of quantised transform coefficients. The arrayof coefficients is likely to be sparsei:f the image block has beenefficientlydecorrelated by the DCT, most of thequantisedcoefficients in a typical block are zero. Figure 8.1 shows a typical block of quantised coefficients from an MPEG-4 ‘intra’blockT. hestructure of thequantisedblock is fairlytypical. A few non-zero coefficients remain after quantisation, mainly clustered around DCT coefficient (0,O): this is the ‘DC’ coefficient and is usually the most important coefficient to the appearancoef the reconstructed image block. The block of coefficients shown in Figure 8.1 may be efficiently compressed as follows: 1. Reordering. The non-zero values are clustered around the top left of the 2-D array and this stage groups these non-zero values together. 2. Run-level coding. This stage attempts to find a more efficient representation for the large number of zeros (48 in this case). 3. Entropy coding. The entropy encoder attempts to reduce the redunodfatnhcey data symbols. Reordering The optimum method of reordering the quantised data depends on the distribution of the non-zero coefficients. If the original image (or motion-compensated residual) data is evenly DC Figure 8.1 Block of quantisecdoefficients (intra-coding) DATA SYMBOLS 165 distributedin thehorizontal andverticaldirections (i.e. thereis not apredominance of ‘strong’ image features in either direction), then the significant coefficienwtsill also tend to be evenly distributed about the top left of the array (Figure 8.2(a)). In this case, a zigzag reordering pattern such as Figure 8.2 (c) should group together the non-zero coefficients Typical coefficientmap:frame coding - 8000 6000. 4000. - 2000 O0 L i 2P o Typical coefficientmap:field coding j 2000 O0 b 2 2 88 (b) Figure 8.2 Typicadlatadistributionasndreorderingpatterns(:ae)vendistribution(;bf)ield distribution;(c) zigzag; (d) modified zigzag 166 ENTROPY CODING efficiently. However, in some cases an alternative pattern performs better. For example, a field of interlaced video tends to vary more rapidly in the vertical thanin the horizontal direction (because it has been vertically subsampled). In this case the non-zero coefficients are likely to be ‘skewed’ as shown in Figure 8.2(b): they are clustered more to the leofft the array (corresponding to basis functions with a strong vertical variation, see for example Figure 7.4). A modified reordering pattern such as Figure 8.2(d) should perform better at grouping the coefficients together. Run-level coding The output of the reordering process is a linear array of quantised coefficients. Non-zero coefficients are mainly grouped together near the start of the array and the remaining values inthe array arezero.Longsequences of identicalvalues(zerosinthiscase)can be represented as a (run, level) code, where (run) indicates the number of zeros preceding a non-zero value and (level) indicates the sign and magnitude of the non-zero coefficient. The following example illustrates the reordering and run-level coding process. Example Theblock of coefficientsinFigure 8.1 isreorderedwiththezigzag Figure 8.2 and the reordered array is run-level coded. scan shown in Reordered array: [102, -33, 21, -3, -2, -3, -4, - 3 , 0 , 2 , 1,0, 1,0, -2, - 1, -1,0, 0,0, -2, 0,0, 0, 0,0,0,0,0,0,0,0,1,0 ...l Run-level coded: (0, 102) (0, -33) (0, 21) (0, -3) (0, -2) (0, -3) (0, -4) (0, -3) (1, 2) (0, 1 ) (1, 1 ) (1, -2) (0, - 1) (0, -1) (4, -2) (11, 1) DATA SYMBOLS 167 Two special cases need to be considered. Coefficient (0, 0) (the ‘DC’ coefficient) is impor- tant to the appearance of the reconstructed image block and has no preceding zeros. In an intra-codedblock (i.e. codedwithoutmotioncompensation),theDCcoefficientisrarely zero and so istreateddifferentlyfromothercoefficients. In an H.263 CODEC, intra-DC coefficientsareencodedwitha fixed, relativelylowquantisersetting(topreserveimage quality) and without (run, level) coding. Baseline JPEG takes advantagoef the property that neighbouringimageblockstendtohavesimilarmeanvalues(andhencesimilar DC coefficientvalues)andeachDCcoefficientisencodeddifferentiallyfromtheprevious DC coefficient. The second special case is the final run of zeros in a block. Coefficient (7, 7) is usually zero and so we need a special case to deal with the final run of zeros that has no terminating non-zero value. In H.261 and baseline JPEG, a special code symbol, ‘end of block’ or EOB, is inserted after the last (run, level) pair. This approach is known as ‘two-dimensional’ run- level coding since each code represenjtus st two values (run and level).The method doesnot perform well underhighcompression:inthiscase,manyblockscontainonlya DC coefficient and so the EOB codes make up a significant proportion of the coded bit stream. H.263 and MPEG-4 avoid this problemby encoding a flag along with each (run, level)pair. This ‘last’ flag signifies the final (run, level) pair in the block and indicates to the decoder thattherest of theblockshouldbe filled withzeros.Eachcode now representsthree values (run, level, last) and so this method is known as ‘three-dimensional’ run-level-last coding. 8.2.2 Other Symbols In addition to run-level coded coefficient data, a number of other values need to be coded and transmitted by the video encoder. These include the following. Motion vectors The vectordisplacementbetweenthecurrentandreferenceareas (e.g. macroblocks)is encoded along with each dataunit. Motion vectors for neighbouring data units are oftenvery similar, and this property may be used to reduce the amount of information required to be encoded.In anH.261 CODEC,forexample,themotionvectorforeachmacroblock is predicted from the preceding macroblock. The difference between the current and previous vectorisencodedandtransmitted(instead of transmittingthevectoritself). A more sophisticatedpredictionisformedduringMPEG-4/H.263coding:thevectorforeach macroblock(orblock if theoptionaladvancedpredictionmodeisenabled)ispredicted from up to threepreviouslytransmittedmotionvectors.Thishelps to further reduce the transmittedinformation. Thesetwomethods of predictingthecurrentmotionvector are shown in Figure 8.3. Example Motion vector of current macroblock: Predicted motion vector from previous macroblocks: Differential motion vector: x = +3.5, x = f3.0, dx = +0.5, y = +2.0 y = 0.0 dy = -2.0 168 ENTROPY CODING Current macroblock Current macroblock H.261: predict MV from previous macroblock vector MV1 H.263/MPEG4: predict MV from three previous macroblock vectors MV1, MV2 and MV3 Figure 8.3 Motion vector prediction (H.261, H.263) Quantisation parameter In order to maintain a target bit rate, it is common for a videoencodertomodifythe quantisation parameter (scale factor or step size) during encoding. The change must be signalledtothedecoder. It isnotusuallydesirabletosuddenlychangethequantisation parameter by a large amount during encodingof a video frame andso the parameter may be encoded differentially from the previous quantisation parameter. Flags to indicate presence of coded units It iscommonforcertaincomponents of amacroblocknottobepresent.Forexample, efficient motion compensation and/or high compression leads to many blocks containing only zero coefficients after quantisation. Similarly, macroblocks in an area that contains no motionor homogeneous motionwill tend tohavezeromotionvectors(afterdifferential prediction as described above). In some cases, a macroblockmay contain no coefficient data and a zero motion vector, i.e. nothing needs to be transmitted. Rather than encoding and sending zero values, it can be more efficient to encode flag(s) that indicate the presence or absence of these data units. Example Coded blockpattern(CBP)indicatestheblockscontainingnon-zerocoefficients inter-coded macroblock. in an INumber of cnooenf-fziceireonts bloinckeach I YO Yl Y2 Y3 2 1 I 0 6 0 I 9 7 I 1 Cr I 1 Cb I 3 II I I I CBP llO100 0 0 I 011111 HUFFMAN CODING 169 Synchronisation markers A video decoder may require to resynchronise in the evenot f an error or interruption to the stream of coded data. Synchronisation markers in the bit stream provide a means of doing this. Typically, the differential predictions mentioned above (DC coefficient, motion vectors and quantisation parameter) are reset after a synchronisation marskoetrh, at the data after the marker may be decoded independentlyof previous (perhaps errored) data. Synchronisation is supported by restart markers in JPEG, group of block (GOB) headers in baseline H.263 and MPEG-4 (at fixed intervals within the coded picture) and slice start codes in the MPEG-1, MPEG-2 and annexes to H.263 and MPEG-4 (at user definable intervals). Higher-level headers Information that applies to a complete frame or picture is encoded in a header (picture header). Higher-level information about a sequence of framesmayalsobeencoded(for example, sequence and group of pictures headers in MPEG-1 and MPEG-2). 8.3 HUFFMAN CODING A Huffman entropy encoder maps each input symbol into a variable length codeword and thistype of coderwas first proposedin1952.’ Theconstraintsonthe variablelength codewordarethat it must (a)containanintegralnumber of bitsand(b) be uniquely decodeable (i.e. the decoder must be able to identify each codeword without ambiguity). 8.3.1 ‘True’HuffmanCoding In order to achieve the maximum compression of aset of datasymbolsusingHuffman encoding, it is necessary to calculate the probabilityof occurrence of each symbol. A set of variablelengthcodewordsisthenconstructed for thisdataset. This processwillbe illustrated by the following example. Example: H u f i a n coding, ‘Carphone’ motion vectors A video sequence, ‘Carphone’, was encoded with MPEG-4 (short header mode). Table 8.1 liststheprobabilities of themostcommonlyoccurringmotionvectorsintheencoded Table 8.1 Probability of occurrence of motion vectors in ‘Carphone’ sequence ProbabilitVy ector P log2(l/P) - 1.5 0.014 6.16 -1 0.024 5.38 - 0.5 0.117 3.10 0 0.646 0.63 0.5 0.101 3.31 1 0.027 5.21 1 .S 0.0 16 5.97 170 1 ENTROPY CODING Probability distribution of motion vectors 0.9 0.8 0.7 0.6 .-.2aL-_.0.5 a2 0.4 0.3 0.2 0.1 OL -3 -2 -1 0 1 2 3 MVX or MVY Figure 8.4 Distribution of motionvectorvalues sequence and their information content, 10g2(1/P). To achieve optimum compression, each value should be represented with exactly 10g2(llP) bits. The vector probabilities are shown graphicalliyn Figure 8.4 (the solid line).‘0’is the most common value and the probability drops sharply for larger motion vectors. (Note that there are a small numberof vectors larger than+/- 1.5 and so the probabilitiesin the table donot sum to l.) 1. Generatingthe HufSman codetree To generate a Huffman code table for this set of data, the following iterative procedure is carried out (we will ignore any vector values that do not appear in Table 8.1): 1. Order the list of data in increasing order of probability. 2. Combinethetwolowest-probabilitydataitemsintoa‘node’ probability of the data items to this node. and assignthejoint 3. Reordertheremainingdataitemsandnode(s) repeat step 2. in increasingorder of probabilityand HLTFFMAN CODING 171 U Figure 8.5 Generating the Huffman codetree:‘Carphone’motionvectors The procedure isrepeated until there is a single ‘root’ node that contains all other nodes and data items listed ‘beneath’ it. This procedure is illustrated in Figure 8.5. 0 Original list: The data items are shown as square boxes. Vectors ( - 1S ) and (1S ) have the lowest probability and these are the first candidates for merging to form node ‘A’. 0 Stage 1: The newly created node ‘A’ (shown as a circle) has a probability of 0.03 (from the combined probabilities of ( - 1.5) and (1.5)) and the two lowest-probability items are vectors ( - l ) and (1). These will be merged to form node ‘B’. 0 Stage 2: A and B are the next candidates for merging (to form ‘C’). 0 Stage 3: Node C and vector (0.5) are merged to form ‘D’. 0 Stage 4: (-0.5) and D are merged to form ‘E’. 0 Stage 5: There are two ‘top-level’ items remaining: node E andthehighest-probability vector (0). These are merged to form ‘F’. 0 Final tree:The data itemshave all been incorporated into a binary ‘tree’ containing seven data values and six nodes. Each data item is a ‘leaf’ of the tree. 2. Encoding Each ‘leaf’ of the binary tree is mapped to a VLC. To find this code, the tree is ‘traversed’ from the root node (F in this case) to the leaf (data item). For every branch, a 0 or 1 is appended to the code:0 for an upper branch,1 for a lower branch (shown in thefinal tree of Figure 8.5). Thisgivesthefollowingset of codes(Table 8.2). Encodingisachievedby transmittingtheappropriate codeforeachdataitem.Note thatoncethetree has been generated, the codes may be stored in a look-up table. 172 ENTROPY CODING Table 8.2 Huffman codes: ‘Carphone’motionvectors Bits CodeVector (actual) Bits (ideal) 0 1 1 0.63 - 0.5 00 2 3.1 0.5 01 1 3 3.3 1 - 1.5 5 01000 6.16 1 .S 01001 5 5.97 -1 01010 5 5.38 1 0101 l 5 5.21 Note the following points: 1.Highprobability dataitemsareassignedshortcodes (e.g. 1bitforthemostcommon vector ‘0’). However,thevectors ( - 1.5,1.5, - 1 , 1)areeach assigned5-bitcodes (despite the fact that - 1 and - 1 have higher probabilities than 1.5 and 1.5). The lengths of the Huffman codes (each an integral number of bits) do not match the ‘ideal’ lengths given by log,( l/P). 2. No code contains any other code as a prefixi,.e. reading from the left-hand bit, each code is uniquely decodable. For example, the series of vectors (1, 0, 0.5) would be transmitted as follows: 3. Decoding In order to decode the data, the decoder must havloecal copy of the Huffman code tree (or look-up table). This may be achieved by transmitting the look-up table itselfo, r sending the list of data and probabilities, prior to sendingthe coded data. Each uniquely decodable code may then be read and converted back to the original data. Following the example above: 01011 is decoded as (1) 1 is decoded as (0) 01 1 is decoded as (0.5) Example: Hu@nan coding, ‘Claire’ motion vectors Repeatingtheprocessdescribedaboveforthevideosequence‘Claire’givesadifferent result. Thissequencecontains less motionthan‘Carphone’and so thevectorshavea different distribution (shown in Figure8.4, dotted line). A much higher proportionof vectors are zero (Table 8.3). The corresponding Huffman tree is given in Figure 8.6. Note that the ‘shape’ of the tree has changed (because of the distribution of probabilities) and this gives a different set of HUFFMAN CODING 173 Table 8.3 Probabilityofoccurrenceofmotionvectors in ‘Claire’ sequence 9.66 0.001 - 1.5 8.38 - 1 0.003 - 0.5 5.80 0.0 18 0.07 0 0.953 0.021 0.5 0.003 1 1.5 0.001 9.66 Figure8.6 Huffmantreefor‘Claire’motionvectors Huffman codes (shown in Table 8.4). There are still six nodes in the tree, one less than the number of data items (seven): this is always the case with Huffman coding. If the probability distributions are accurate, Huffman coding provideasrelatively compact representation of the original data. In these examples, the frequently occurring (0) vector is represented very efficiently as a single bit. However, to achieve optimum compression, a Table 8.4 Huffmancodes:‘Claire’motionvectors eal) Bits (actual) Bits CodeVector 0 1 1 0.07 0.5 00 2 5.57 - 0.55.8 01 1 3 0100 1 4 8.38 -1 8.38 0101 1 5 - 1.59.66 010100 6 0101011.S9.66 6 174 ENTROPY CODING separate code table is required for each of the two sequences ‘Carphone’ and ‘Claire’. The loss of potential compression efficiency due to the requirement for integral length codes is very obvious for vector ‘0’ in the ‘Claire’ sequence: the optimum nuomfbbeirts (information content) is 0.07 but the best that can be achieved with Huffman coding is 1 bit. 8.3.2 Modified Huffman Coding The Huffman coding process described above has two disadvantages for a practical video CODEC. First, the decoder must use the same codeword set as the encoder. This means that the encoder needs to transmit the information contained in the probability table before the decoder can decode the bit stream, an extra overhead that reduces compression efficiency. Second, calculating the probability table for a large video sequence (prior to generating the Huffman tree) is a significant computational overhead and cannot be done until after the video data is encoded. For these reasons, the image and video coding standards define soefts codewords based on the probability distributionsof a large rangeof video material. Because the tables are ‘generic’, compression efficiency is lower than that obtained by pre-analysing the data to be encoded, especially if thesequencestatisticsdiffersignificantlyfromthe ‘generic’ probability distributions. The advantage of not requiring to calculate and transmit individual probability tables usually outweighs this disadvantage. (Note: Annex C of the original JPEG standard supports individually calculated Huffman tables, but most practical implementations use the ‘typical’ Huffman tables provided in Annex K of the standard.) 8.3.3Table Design The following two examples of VLC table design are taken from the H.263 and MPEG-4 standards. These tables are required for H.263 ‘baseline’ coding and MPEG-4 ‘short video header’ coding. H.263/MPEG-4 transform coeficients (TCOEF) H.263andMPEG-4use‘3-dimensional’coding of quantisedcoefficients,whereeach codewordrepresentsacombination of (run,level,last) asdescribedinSection8.2.1. A total of 102 specific combinationsof (run, level, last) haveVLCs assigned to them. Table 8.5 shows 26 of these codes. A further 76 VLCs are defined, each up to 13 bits long. Note that the last bit of each codewordisthesign bit ‘S’, indicatingthesign of thedecodedcoefficient(O=positive, 1 =negative). Any (run, level, last) combination that is not listed in the table is codedusing an escape sequence, a special ESCAPE code (000001 1 ) followed by a 13-bit fixed length code describing the values of run, level and last. The codes shown in Table 8.5 are represented in ‘tree’ form in Figure 8.7. A codeword containingarun of morethaneightzerosisnotvalid, so anycodewordstartingwith 000000000. . . indicates an error in the bit stream (or possibly a start code, which begins with a long sequence of zeros, occurring at an unexpected position in the sequence). All othersequences of bitscan bedecodedasvalidcodes.Notethatthesmallestcodesare HUFFMAN CODING 175 Table 8.5 H.263MPEG4transformcoefficient(TCOEF) VLCs (partial, all codes 9 bits) Last CRoduen Level 0 0 1 0 1 1 0 2 1 0 0 2 1 0 1 0 3 1 0 4 l 0 5 1 0 0 3 0 l 2 0 6 1 0 7 1 0 8 1 0 9 1 1 1 1 1 2 1 1 3 1 1 4 1 0 0 4 0 10 1 0 11 1 0 12 1 1 5 1 1 6 1 1 7 1 1 8 1 ESCAPE ... 10s 110s 1110s 1111s 0111s 01 101s 01 loos 0101 1s 010101s 010100s 01001 1s 0l0010s 010001s 0 10000s 001111s 001110s 001101s 001 loos 00101 11s 0010110s 0010101s 00lOlO0s 0010011s 0010010s 0010001s 00 10000s 000001 1s ... allocated to short runs and small levels (e.g. code ‘10’ represents a run of 0 and a level of +/- l), since these occur most frequently. H.263/MPEG-4 motion vector difference (MVD) The H.263MPEG-4 differentially coded motion vectors (MVD) described in Section 8.2.2 are each encoded as a pairof VLCs, one for the x-component and one for the y-component. Part of the table of VLCs is shown in Table 8.6 and in ‘tree’ form in Figure 8.8. A further 49 codes (8-13 bits long) are not shown here. Note that the shortest codes represent small motion vector differences (e.g. MVD =0 is represented by a single bit code ‘l’). H.26L universal VLC (UVLC) The emerging H.26L standard takes a step away from individually calculated Huffman tables by using a ‘universal’ setof VLCs for any coded element. Each codeword is generated from 176 ENTROPY CODING 000000000X (error) Start ~ 0 1D B 1 0 1 000001 1 (escape) ...19 codes D 1 0010000 (1,8,1) 0010001 (1,7,1) T 0010010 (1,6,1) 0010011 (1,5,1) 0010100 (O,lZ,l) 0010101 ( 0 , l l . l ) m 0010110(0,10,1) 00101 11 (0,0.4) 001100 (1,4,1) 001101 (1,3,1) 001110(1.2,1) 001111 ( l , l , l ) - ti-010000(0,9,1) 010001 (0,8,1) 010010 (0,7,1) 010011 (0,6,1) 010100 (0,1.2) 010101 (0,0,3) 0101 1 (0,5,1) if-01100(0,4,1) 01 101 (0,3.1) 0111 (l,O,l) 10 (O,O,1) ‘ . 1 : : : 9 110(O,l,l) Figure 8.7 H.263/MPEG-4TCOEFVLCs Code HUFFMAN CODING 177 Table 8.6 H.263/MPEG-4motionvector difference (MVD) VLCs MVD 0 f0.5 - 0.5 +l -1 + 1.5 - 1.5 +2 -2 + 2.5 - 2.5 +3 -3 + 3.5 - 3.5 ... 1 010 01 1 0010 001 1 00010 0001 1 00001 10 000011 1 00001010 00001011 0000 1000 0000 100 1 00000 1 10 000001 11 ... the following systematic list: ... where xk is a single bit. Hence there is one l-bit codeword; two 3-bit codewords; four 5-bit codewords; eight 7-bit codewords; andso on. Table 8.7 shows the first 12 codes and these are represented in tree form in Figure 8.9. The highly regular structure of the set of codewords can be seen in this figure. Any data element to be coded(transform coefficients, motion vectors, block patterns, etc.) is assigned a code from the list of UVLCs. The codes are not optimised for a specific data element (since the same set of codes is used for all elements): however, the uniform, regular structure considerably simplifies encoder and decoder design sincethe same methods can be used to encode or decode any data element. 8.3.4 EntropyCodingExample This examplefollows the process of encoding and decoding a block of quantised coefficients in an MPEG-4 inter-coded picture. Only six non-zero coefficients remain in the block: this 178 Table 8.7 H.26L universal VLCs Index x2 0 1 2 3 4 5 6 7 8 9 10 11 ... ... ENTROPY CODING X1 X0 NIA 0 1 0 1 0 1 0 1 0 1 0 ... ... Codeword 1 00 1 01 1 0000 1 0001 1 01001 0101 1 OOoooO1 0000011 0001001 000101 1 0 10000 1 ... Start ~ i0 1 - i ...39 codes ...10codes A T ooooo11o (3.5) om0011 1 (-3.5) aoooO1ooO (3) oooo1001 (-3) b 1 T oooO1010 (2.5) ooO01011 (-2.5) A ooo0110 (+2) m 1 1 1 (-2) ooO10 (+l .5) o o O 1 1 (-1.5) 010 (+0.5) 011 (-0.5) Figure 8.8 H.263iMPEG-4 MVDVLCs HUFFMAN CODING 179 ...etc 0000001 (7) ...etc 0000011 (8) ...etc 0001001 (9) ... etc 001 (1) ooo1011 (10) c00011(4) ...etc 01oooo1 (11) ...etc 1 01001 (5) O l o o o l l (12) ...etc 0101001 (13) ...etc 011 (2) 0101011 (14) 01011 (6) - 1 (0) Figure 8.9 H.26L universal VLCs 180 ENTROPY CODING wouldbecharacteristic of eitherahighlycompressedblockorablockthathasbeen efficiently predicted by motion estimation. Quantised DCT coefficients (empty cells are ‘0’): Zigzag reordered coefficients: 4-10 2-30 0 0 0 0-10 0 0 1 0 o... TCOEF variable length codes: (from Table 8.5: note that the last bit is the sign) 00101110; 101; 0101000; 0101011; 010111; 0011010 Transmitted bit sequence: 001011101010101000010l01l01011I00l10l0 Decoding of this sequence proceeds as follows. The decoder ‘steps’ through the TCOEF tree (shown in Figure 8.7) until it reaches the ‘leaf’ 00101 11. The next bit (0) is decoded as the sign and the (last, run, level) group (0, 0, 4) is obtained. The steps taken by the decoder for this first coefficient are highlightedin Figure 8.10. The process is repeated with the ‘leaf’10 followed by sign (1) and so on until a ‘last’ coefficient is decoded. The decoder can now fill the coefficient array and reverse the zigzag scan to restore the array of 8 x 8 quantised coefficients. 8.3.5 VariableLengthEncoderDesign Sofiware design A general approach to variable-length encoding in software is as follows: HUFFMAN CODING 181 0 Start 001OOOO (1,8,1) 0010001 (1,7,1) 0010010(1,6,1) 0010011 (1,5,1) 0010100 (0,12,1) 0010101 (O,ll,l) 0010110 (O,lO,l) 0010111 (0,0,4) Figure8.10 Decoding of codeword OOlOllls f o r eachdatasymbol findthecorrespondingVLCvalueandlength( i n b i t s ) packthisVLCintoanoutputregisterR if the contents of RexceedLbytes writeL (least significant) bytestotheoutputstream s h i f t R by L bytes Example Using the entropy encoding example above, L = 1 byte, R is empty at start of encoding: 182 ENTROPY CODING Thefollowingpackedbytesarewritten totheoutputstream:00101110,01000101, 10101 101,00101111. At the endof the above sequence, the output registRerstill contains 6 bits (001101).If encoding stops here, it willbe necessary to ‘flush’ the contents of R to the output stream. The MVD codes listed inTable 8.6 can be stored in a simple look-up table. Only64 valid MVD values exist and the contents of the look-up table are as follows: [ index 1 [vlc] [ length ] where [index] is a number in the range 0 . ..63 that is derived directly from MVD, [vlc] is the variable length code ‘padded’with zeros and representedwith a fixed number of bits (e.g. 16 or 32 bits) and [length] indicates the number of bits present in the variable length code. Converting (last, run, level) into the TCOEF VLCs listed in Table 8.5 is slightly more problematic. The 102 predetermined combinations of (last, run, level) have individuaVl LCs assignedtothem(thesearethemostcommonlyoccurringcombinations) and anyother combination must be converted to an Escape sequence. The problem is that there are many more possible combinations of (last, run, level) than there are individual VLCs. ‘Run’ may take any value between 0 and 62; ‘Level’ any value between 1 and 128; and ‘Last’ is 0 or 1. This gives 16 002 possible combinations of (last, run, level). Three possible approaches to finding the VLC are as follows: 1. Large look-up table indexed by (last, run, level). The size of this table may be reduced somewhatbecauseonlylevelsin therange 1-12 and runs in therange 0-40 have individual VLCs. The look-up procedure is as follows: i f ( \ l e v e l ] < 1 3 a n d r u n3<9 ) lookuptablebasedon (last, run, level) returnindividualVLCor calculateEscape sequence else calculate Escape sequence The look-up table has ( 2 x 4 0 1~2) =960 entries; 102 of these contain individual VLCs and the remaining 858 contain a flag indicating that an Escape sequence is required. 2. Partitioned look-up tables indexedby (last, run, level). Basedon the valuesof last, runand level, choose a smaller look-up table (e.g. a table that only applies when last =O). This requires one or more comparisons before choosingtatbhle but allows the large table tboe split intoa number of smaller tables with fewer entries overall. The proceduarsefoilslows: i f ( l a s t , r u n , l e v e l ) E { s e t A} l o o k up t a b l e A returnVLCor calculateEscape sequence else i f (last, run, level) E {set B} l o o k up t a b l e B returnVLCor calculateEscapesequence .... else calculate Escape sequence CODING HUFFMAN 183 For example, earlier versions of the H.263 ‘test model’ software used this approach to reduce the number of entries in the partitioned look-up tables to 200 (i.e. 102 valid VLCs and 98 ‘empty’ entries). 3. Conditional expression for every valid combination of (last, run, level). For example: switch (last, run, level) case {A} : v l c =vA, length = 1 A c a s e { B } : v l c = v B , l e n g t h= lB .. . ( 100mor e c a s e s ) ... default : calculateEscapesequence Comparing the three methods, method 1 lends itself to compact code, is easy to modify (by changing the look-up table contents) and is likely btoe computationally efficient; however, it requires a large look-up table, most of which is redundant. Method 3, at the other extreme, requires the most code and is the most difficult to change (since each valid combination is ‘hand-coded’) but requires the least data storage. On some platforms it may be the slowest method. Method 2 offers a compromise between the other two methods. Hardware design Ahardwarearchitecture for variablelengthencodingperformssimilartasks to those described above and an example is shown in Figure 8.11 (based on a design proposed by Lei and Sun’). A ‘look-up’ unit finds the length and valueof the appropriate VLC and passes these to a ‘pack’ unit. The pack unit collects together a fixed number of bits (e.g. 8, 16 or 32 bits) and shifts these out to a stream buffer. Within the ‘pack’ unit, a counter records the number of bits in the output register. When this counter overflows, a data word is output (as in the example above) and the remaining upper bits in the output register are shifted down. The design of the look-up unit is critical to the size, efficiency and adaptability of the design. Options range from a ROM or RAM-based look-up table containing all valid codes plus‘dummy’entriesindicatingthatan Escape sequenceisrequired,toa ‘Hard-wired’ approach (similar to the ‘switch’ statement described above) in which each valid combina- tion ismapped to theappropriateVLC and length fields. This approachissometimes described as a‘programmablelogic array’ (PLA)look-up table. Another example of a hardware VLE is presented e l ~ e w h e r e . ~ h table VLC select calculate VLC Figure 8.11 Hardware VLE byte or word stream 184 ENTROPY CODING 8.3.6 VariableLengthDecoderDesign Software design The operation of a decoder for VLCs can be summarised as follows: scanthroughbits inaninput buffer i f a v a l i d V L C ( l e n g t h L )i s d e t e c t e d removeLbitsfrombuffer returncorrespondingdataunit ifinvalidVLCisdetected returnanerror flag Perhaps the most straightforward way of finding a valid VLC is to step through the relevant Huffman code tree. For example, a H.263 / MPEG-4 TCOEF code may bedecoded by stepping through the tree shown in Figure 8.7, starting from the left: i f ( f i r s t b i t= 1) i f ( s e c o n d b i t = 1) i f ( t h i r d b i t = 1) if (fourthbit=l) r e t u r n (0,0,2) else r e t u r n (0,2,1) else r e t u r n (O,l,l) else r e t u r n (0,0,1) else _ _ _ decode a l l VLCs s t a r t i n g w i t h 0 This approach requires a large nested if. . .else statement (or equivalent) that can deal with 104cases(102uniqueTCOEFVLCs,oneescapecode,plusanerrorcondition).This method leads to a large code size,may be slow to execute and is difficult to modify (because the Huffman tree is ‘hand-coded’ into the software); however, no extra look-up tables are required. An alternative is touse one or more look-up tables. The maximum lengthof TCOEF VLC (excluding the sign bit and escape sequences) is 13 bits. We can construct a look-up table whose index is a 13-bit number (the 13 Isbs of the input stream). Each entry of the table contains either a (last, run, level) triplet or a flag indicating Escape or Error; 213= 8192 entries are required, most of which will be duplicates of other entries. For example, every code beginning with ‘10. . .’ (starting with the Isb) decodes to the triplet (0, 0, 1). An initial test of the range of the 13-bit number maybe used to select one of a number of smaller look-up tables. For example, the H.263 reference model decoder described earlier breaks the table into three smaller tables containing around 300 entries (about 20o0f which are duplicate entries). HUFFMAN CODING 185 input1d 4 : : 1 input one or data unit Shift Ezrrbits Find VcoLde rbeigtsistrtear m Figure 8.12 Hardware VLD Thechoice of algorithm may depend on the capabilities of thesoftwareplatform. If memory is plentiful and array access relatively fast, a large look-up table may be the best approach for speed and flexibility. If memory is limited and/or array access is slow, better performance may be achieved with an ‘if. ..else’ approach or a partitioned look-up table. Whichever approach is chosen, VL decoding requires a significant amount of bit-level processing and for many processors this makes it a computationally expensive function. An interestingdevelopment in recentyearshasbeen the emergence of dedicatedhardware assistance for softwareVLdecoding. The PhilipsTriMedia and EquatodHitachiMAP platforms, for example, contain dedicated variable length decoder (VLD) co-processors that automatically decode VL data in an input buffer, relieving the main processor of the burden of variable length decoding. Hardware design Hardware designs for variable length decoding fall into two categories: (a) those thadt ecode n bits from the input stream every m cycles (e.g. decoding 1 or 2bits per cycle) and (b) those that decode n complete VL codewords every m cycles (e.g. decoding 1 codeword in one or two cycles). The basic architecture of a decoder is shown in Figure 8.12 (the dotted line ‘code length L is only required for category (b) decoders). Category(a), n bits per m cycles Thistype of decoderfollowsthrough the Huffman decoding tree. The simplestdesignprocesses one level of the tree every cycle: this is analogous to the large ‘if. . .else’ statement described above. The shift register shown in Figure 8.12 shifts 1 bit per cycle to the ‘Find VL code’ unit. This unit steps through the tree (basedon the value of eachinput bit) until a valid code (a‘leaf’)isfoundand can be implemented with a finite state machine (FSM) architecture.For example, Table 8.8 lists part of the FSM for theTCOEF tree shown in Figure 8.7. Each state corresponds to a node of the Huffmantreeand the nodes in thetablearelabelled(withcircles) in Figure 8.13 for convenience. There are 102 nodes (and hence 102 states in the FSM) and 103 output values. To decode l 1 10, for example, the decoder traces the following sequence: State 0 + State 2 + State 5 + State 6 + output (0, 2, 1) Hence the decoder processes1 bit per cycle (assuming that a state transition occursper clock cycle). 186 ENTROPY CODING Table 8.8 Part of state table for TCOEF decoding stateNexItnpuSt tate or output 0 0 1 1 0 1 2 0 1 3 0 1 4 0 1 5 0 1 6 0 1 Thistype of decoderhasthedisadvantagethattheprocessingratedepends on the (variable) rateof the coded stream.It is often more useful to be capableof processing one or morecompleteVLCsperclockcycle(forexample,toguaranteeacertaincodeword throughput), and this leads to the second category of decoder design. Category (b), n codewords per m cycles Thisisanalogoustothe‘largelook-uptable’ approach in a software decoder. K bits (stored in the input shift register) are examined per cycle, where K is the largest possible VLC size (13, excluding the sign binit,the example of H.263MPEG-4 TCOEF). The‘Find VL code’ unitin Figure 8.12 checks all combinationosf K bits and finds a matching valid code, Escape code or flags an error. The length of the matching code (L bits) is fed back and the shift register shifts the input data by L bits (i.e. L bitsareremovedfromtheinputbuffer).HenceacompleteL-bitcodewordcan be processed in one cycle. The shift register can be implemented using a barrel shifter (a shift-register circuit that shifts its contents by L places in one cycle). The ‘Find VL code’ unit may be implemented using logic (a PLA). The logic array should minimise effectively sincme ost of the possible inputcombinationsare‘don’tcares’. In theTCOEFexample, all 13-bitinputwords ‘IOXXXXXXXXXXX’maptotheoutput (0, 0, l). It isalsopossibletoimplementthis unit as a ROM or RAM look-up table with 213 entries. A decoder that decodes one codeword per cycle is describedby Lei and Sun2 and Chang and Me~serschmitte~xamine the principles of concurrent VLC decoding. Further examples of VL decoders can be found elsewhere.526 8.3.7 Dealing with Errors An errorduringtransmission may causethedecodertolosesynchronisationwiththe sequence of VLCs and this in turn can cause incorrect decodionfgsubsequent VLCs. These HUFFMAN CODING 187 0 1 0 Start c9 1 I 11 11 (0,0,2) Figure 8.13 Part of TCOEF tree showing state labels decoding errors may continue to occur (propagate) until a resynchronisation point occurisn the bit stream. The synchronisation markers describedin Section 8.2.2 limit the propagation of errors at thedecoder.Increasingthefrequency of synchronisationmarkersinthebit streamcanreducetheeffect of anerroronthedecodedimage:however,markersare ‘redundant’ overhead and so this also reduces compression efficiency. Transmission errors and their effect on coded video are discussed further in Chapter 11. Error-resilient alternatives to modified Huffman codes have been proposed. For example, MPEG-4 (video) includes an option touse reversible variable length codes(RVLCs), a class of codewords that may be successfully decoded in either a forward or backward direction from a resynchronisation point. Whenan error occurs, it is usually detectablbey the decoder (since a serious decoder error is likely to violate the encoding syntax). The decoder can decode the current section of data in both directions, forward from the previous synchronisation point and backward from the next synchronisation point. Figure 8.14 shows an example. Region (a) is decoded and then an error is identified. The decoder ‘skips’ to the 188 ENTROPY CODING I 1 Header Decoded (a) mbn Decoded (b) Synchronisation marker Figure 8.14 DecodingwithRVLCswhen an error is detected nextresynchronisation point anddecodesbackwards from theretorecoverregion(b). Without RVLCs, all of region (b) would be lost. An interesting recent development is the usoef 'soft-decision' decoding of VLCs, utilising information available from the communications receiver about the probability of error in each codeword to improve decoding performance in the presence of channel n ~ i s e . ~ - ~ 8.4 ARITHMETIC CODING Entropy coding schemes based on codewords that are an integral number of bits long (such as Huffman coding or UVLCs) cannot achieve optimal compression of every set of data. This is because the theoretical optimum number of bits to represent a data symbol isusually a fraction (rather than an integer). This optimum number of bits is the 'information content' logz(U P ) , where P is the probability of occurrence of each data symbol. In Table 8.1, for example,the motionvector '0.5' should be representedwith 3.31 bits formaximum compression.Huffman coding producesa5-bitcodewordfor this motionvector and so the compressed bit stream is likely to be larger than the theoretical maximally compressed bit stream. Arithmetic codingprovidesapracticalalternativetoHuffman coding and can more closely approach the theoretical maximum compression.'' An arithmetic encoder converts a sequence of data symbolsinto a singlefractionalnumber. Thelonger thesequence of symbols, the greater the precision required to represent the fractional number. Example Table 8.9 lists five motion vector values ( - 2 , - 1, 0, 1, 2 ) . The probability of occurrence of each vector is listed in the second column. Each vector is assigned a subrange within the Table 8.9 Subranges ProbabilitVy ector Iogd 1/ P ) -2 0.1 3.32 -1 0.2 2.32 0 0.3-0.7 0.4 1.32 1 0.7-0.9 0.2 2.32 2 0.1 0.9-1 3.32 Subrange 0-0.1 0.1-0.3 .O ARITHMETIC CODING 189 range 0.0-1.0, depending on itsprobability of occurrence. In this example, ( - 2) hasa probability of 0.1 and is given the subrange 0-0.1 (i.e. the first 10%of the total range 0-1 .O). ( - 1) has a probability of 0.2 and is given the next 20% of the total range, i.e. the subrange 0.14.3. After assigning a subrange to each vector, the total range 0-1.0 has been ‘divided’ amongst the data symbols (the vectors) according to their probabilities. The subranges are illustrated in Figure 8.15. The encoding procedure is presentedbelow, alongside a worked example for the sequence of vectors: (0, - l , 0, 2 ) . Encoding procedure Subrange Encporodcinegdure 1. Set the initial range 2. For the first data symbol, find the corresponding subrange (low to high). 3. Set the new range (1) to this subrange 4. For the next data symbol, find the subrange L to H 5. Set the new range (2) to this subrange within the previous range 6. Find the next subrange 7. Set the new range (3) within the previous range 8. Find the next subrange 9. Set the new range (4) within the previous range Range (L + H) Symbol (L + H) 0 + 1.0 (0) 0.3 + 0.7 0.3 + 0.7 ( - 10).1 0.34 + 0.42 + 0.3 (0) 0.364 + 0.396 (02.)9 0.3928 + 0.396 0.3 + 0.7 + 1.0 Notes This is the subrange within the interval 0-1 0.34 is 10% of the range; 0.42 is 30% of the range 0.364 is 30% of the range; 0.396 is 70% of the range 0.3928 is 90%of the range; 0.396 is 100% of the range Each time a symbol is encoded, the rang(eL to H) becomes progressively smaller.At the end of the encoding process (four steps in this example), we are left with a final range (L to H). The entire sequence of data symbols can be fully represented by transmitting a fractional number that lies within this final range. In the example above, we could send any number in Total range t 0.7 0.3 0 0.1 1 190 CODING ENTROPY the range 0.3928-0.396: for example, 0.394. Figure 8.16 shows how the initial ran(g0e-1) is progressivelypartitionedintosmallerranges as eachdatasymbolisprocessed.After encoding the first symbol (vector 0), the new range is (0.3, 0.7). The next symbol (vector -1) selects the subrange (0.34, 0.42) which becomes the new range, andso on. The final symbol (vector +2) selects the subrange (0.3928, 0.396) and the number 0.394 (falling within this range) is transmitted. 0.394 can be represented as a fixed-point fractional number using 9 bits, i.e. our data sequence (0, - l , 0, 2) is compressed to a 9-bit quantity. Decoding procedure The sequence of subranges (and hence the sequence of data symbols) can be decoded from this number as follows. Decoding procedure symDbeScouolbdreaRdnagnege 1 . Set the initial range 0+1 2. Find the subrange in which the 0.3 ---t 0.7 (0) received number falls. This indicates the first data symbol 3.Sethenewrange (1) to thissubrange0.3 4 0.7 4. Find the subrange of the new 0.34 40.42 range in which the received number falls. This indicates the second data symbol 5. Set the new range ( 2 ) to this 0.34 -+ 0.42 subrange within the previous range 6. Find the subrange in which the 0.364 4 0.396 received number falls and decode the third data symbol 7. Set thenewrange(3) to thissubrange0.364 ---f 0.396 within the previous range 392th8wehicshuinbrat8hn.egFeind 4 0.396 (2) received number falls and decode the fourth data symbol The principal advantage of arithmetic coding is that the transmitted number (0.394 in this case, which can be represented as a fixed-point number with sufficient accuracy using 9 bits) is not constrainedto an integral numberof bits for each transmitted data symbolT. o achieve optimal compression, the sequence of data symbols should be represented with: In this example, arithmetic coding achieves 9 bits which is close to optimum. A scheme using an integral number of bits for each data symbol (such as Huffman coding) would not come so close to theoptimumnumber of bits and in general,arithmeticcodingcan outperform Huffman coding. ARITHMETIC CODING 191 0 0.1 0.9 1 4 , 1 0.3 0.34 0.42 (2) Figure8.16 Arithmeticcodingexample 8.4.1 Implementation Issues A number of practical issues need to be taken into account when implementing arithmetic coding in software or hardware. Probability distributions As with Huffman coding, it is not always practical to calculate symbol probabilities prior to coding. In several video coding standards(e.g. H.263, MPEG-4, H.26L), arithmetic coding is provided as anoptionaal lternativetoHuffmancodingandpre-calculatedsubranges are defined by thestandard(basedon‘typical’probabilitydistributions).This has the advantage of avoiding the need to calculate and transmit probability distributions, but the disadvantage that compression will be suboptimal for a video sequence that doesnot exactly follow the standard probability distributions. Termination In our example, we stopped decoding after four stepsH. owever, there is nothing containedin the transmitted number (0.394) to indicate the number of symbols that must be decoded: it could equally be decoded as three symbolsor five. The decoder must determinewhen to stop decoding by some other means. In thearithmeticcodingoption specified in H.263, for example, the decoder can determinethenumber of symbolstodecodeaccordingtothe syntax of the coded data. Decodingof transform coefficients in a block halts when an end-ofblock code is detected. Fixed-length codes (such as picture start code) are included in the bit stream and these will ‘force’ the decoder to stop decoding (for example, if a transmission error has occurred). 192 ENTROPY CODING Fixed-point arithmetic Floating-point binary arithmetic is generally less efficient than fixed-point arithmetic and some processors do not support floating-point arithmetic at all. An efficient implementation with fixed-point arithmetic can be achieved by specifying the subranges as fixed-precision binary numbers. For example, in H.263, each subrange is specified as an unsigned 14-bit integer(i.e.atotalrange of 0-16383). Thesubrangesforthe differentialquantisation parameter DQUANT are listed as an example: H.263 DQUANT value Subrange 2 0-4094 1 19 4095-8 1 -1 8192-12286 -2 12287-16383 Incremental encoding Asmoredatasymbolsareencoded, theprecision of thefractionalnumberrequiredto represent the sequence increases. It is possible for the number to exceed the precisiofn the processor after a relatively small numberof data symbols and a practical arithmetic encoder must take steps to ensure that this does not occur. This can be achieved by incrementally encoding bits of the fractional number as they are identified by the encoderI.n our example above, after step 3, the range is 0.364-0.396. We know that the final fractional number will beginwith ‘0.3.. . ’ and so we can sendthemostsignificantpart (e.g. 0.3, or itsbinary equivalent) without prejudicing the remaining calculations. At the same time, the limits of the range are left-shifted to extend the range. In this way, the encoder incrementally sends themosstignificanbt its of thefractionanl umbewr hilsct ontinuallyreadjustingthe boundaries of the range to avoid arithmetic overflow. Patent issues A number of patents have been filed that cover aspects of arithmetic encoding (such as IBM’s‘Q-codera’ rithmetic oding algorithm”). It is not entirelycleawr hethetrhe arithmetic coding algorithms specified in the image and video codingstandards are covered by patents. Some developers of commercial video coding systems have avoided the use of arithmeticcodingbecause of concernsaboutpotentialpatentinfringements,despiteits potential compression advantages. 8.5 SUMMARY An entropy coder maps a sequence of data elements to a compressed bit stream, removing statistical redundancy in the process. In a block transform-based video CODEC, the main REFERENCES 193 data elements are transform coefficients (run-level coded to efficiently represent sequences of zero coefficients), motion vectors (which may be differentiallycoded)andheader information. Optimum compression requires the probability distributions of the data to be analysed prior to coding; for practical reasons, video CODECs use standard pre-calculated look-up tables for entropy coding. The two most popularentropycodingmethodsfor video CODECsare ‘modified’ Huffman coding (in whicheachelementismappedtoaseparate VLC) and arithmetic coding(in which a series of elements arecoded to formafractionalnumber). Huffman encoding may be carried using a series of tablelook-upoperations; a Huffmandecoder identifies each VLC and this is possible because the codes are designed such that no code forms the prefix of any other. Arithmetic coding is carried out by generating and encoding a fractional number to represent a series of data elements. This concludes the discussion of the main internal functions of a video CODEC (motion estimation and compensation, transform coding and entropy coding). The performance of a CODEC in a practical video communication system can often be dramatically improved by filtering the source video (‘pre-filtering’) and/or the decoded video frames (‘post-filtering’). REFERENCES 1. D. A. Huffman, ‘A method for the construction of minimum-redundancy codes’,Proceedings ofthe Institute of Electrical and Radio Engineers, 40(9), September 1952. 2. S. M. Lei and M-T. Sun, ‘An entropy coding system for digital HDTV applications‘, IEEE Trans. CSW, 1(1), March 1991. 3. Hao-ChiehChang,Liang-GeeChen,Yung-ChiChangandSheng-ChiehHuang, ‘A VLSIarchi- tecture design of VLC encoder for high data rate videohmage coding’, 1999 IEEE International Symposium on Circuits and Systems (ISCAS ’99). 4. S. F. ChangandD.Messerschmitt,‘Designinghigh-throughputVLCdecoder,PartI-concurrent VLSI architectures’, IEEE Trans. CSVT, 2(2),June 1992. 5 . J. Jeon, S. Park and H. Park, ‘A fast variable-length decoder using plane separation’, IEEE Trans. CSVT, 10(5),August 2000. 6. B-J. Shieh, Y-S. Lee and C-Y. Lee‘,A high throughput memory-based VLC decoder with codeword boundary prediction’, IEEE Trans. CSW, lo@),December 2000. 7. A. KopanskyandM.Bystrom,‘SequentialdecodingofMPEG-4codedbitstreamsforerror resilience’, Proc. Con$ on Information Sciences and Systems, Baltimore, 1999. 8. J. Wen and J. Villasensor, ‘Utilizing soft information in decoding of variable length codes’, Proc. IEEE Data Compression Conference, Utah, 1999. 9. S. KaiserandM.Bystrom,‘Softdecodingofvariable-lengthcodes’, Proc. IEEE International Communications Conference, NewOrleans, 2000. IO. I. Witten, R. Neal and J. Cleary, ‘Arithmetic codingfor data compression’, Communications ofthe ACM, 30(6), June1987. 11. J. Mitchell and W. Pennebaker, ‘Optimal hardware and software arithmetic coding procedures for the Q-coder’, IBM Journal of Research und Development, 32(6), November 1988. Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Pre- and Post-processing 9.1 INTRODUCTION The visual quality at the output of a video coding system depends on theperformance of the ‘core’ coding algorithms described in Chap6te8rs, but can also be significantly influenbycepdreand post-processing (dealt with in this chapter) and bit-rate control (covered in Chapter IO). In a practical video application, the source image is often far from perfect. Imperfections such as camera noise (introduced during image capture), poor lighting and camera shake may all affect the original images. Figure 9.1 shows a typical example of an image captured by a low-cost ‘webcam’. Imperfections such as cameranoise can produce variation and highfrequency activity in otherwise static parts of the video scene. These variations are likely to produce an increased numberof transform coefficientsand can significantly increase the amount of transmitted data (hence reducing compression performance). The aim of pre$filtering is to reduce these input impairments andimprove compression efficiencywhilst retaining the important features of the original image. Quantisation leads to discontinuities and distortions in the transform coefficients that in turn produce artefacts (distortion patterns) in the decoded video images. In general, higher compression ratios require ‘coarser’ quantisation and introduce more obvious distortion. These artefacts are closely related to the block-based structure of transform coding and it is possible to detect andcompensateusing post-filtering. A significant improvemenitn subjective quality can be achieved by using filters designed to remove coding artefacts, in particular blocking and ringing. The aim of post-filtering is to reduce coding artefacts whilst retaining visually important image features. Figure 9.2 shows the locations of pre- and post-filters in a video CODEC. In this chapter we investigate the causes of input variations and decoder artefacts and we examine a number of methods for improving subjective quality and compression efficiency through the use of filters. 9.2 PRE-FILTERING DCT-based compression algorithms can perform well for smooth, noise-free regions of images. A regionwithflat texture or a gradual variationin texture (like the face area of the image in Figure 9.3) produces a very small number of significant DCT coefficients and hence is compressed efficiently. However, to generate a ‘clean’ video image like Figure 9.3 requires good lighting, an expensive camera and a high-quality video capture system. For most applications, these requirements are impractical. A typical desktop video-conferencing scenario might involve a low-cost camera 011 top of the user’s monitor, poor lighting and a 196 Pm- AND POST-PROCESSING Figure 9.1 Typical image from a low-cost ‘webcam’ ‘busy’ background, and all of these factors can be detrimental to the quality of the final image. A typical source image for this type of application is shown in Figure 9.1. Further difficulties can becaused for motion video compression: for examplea, hand-held camera or a motorised surveillance camera are susceptible to camera shake which can significantly reduce the efficiency of motion estimation and compensation. 9.2.1 Camera Noise Low-level noise with a uniform distribution is added to Figure 9.3to produce Figure 9.4.The image now contains high-frequency variation which is not obvious but which will affect compression performance. After applyinga DCT, this variation produces a number of highfrequency ‘AC’ coefficients, some of which remain after quantisation. This means that more bits will remain aftercompressingFigure9.4 than aftercompressingthe ‘clean’ image (Figure 9.3). After JPEG compression (with the same quantiserscale),Figure9.3compresses to 3211 bytes and Figure 9.4 compresses to 4063 bytes. Thenoise added to Figure 9.4 has decreased compressionefficiency by over 25% in this example. Thisis typical of the effect produced by camera noise (i.e. noise introduced by the camera and/or analogue to Transmit Camera Figure9.2 Re- and post-filters in a video CODEC PE-FILTERING 197 I Figure 9.3 Noise-freesourceimage digital conversion). All cameraskapture systems introduce noiseb,ut it ismore of a problem for lower-cost cameras (such as ‘webcams’). Re-filtering the imagedata before encodingcan improve compression efficiency. The aim of a pre-filter is to increase compression performance without adversely affecting image quality, and a simple filter example is illustrated by Figure 9.5. The ‘noisy’ image (Figure 9.4) is filtered with a Gaussian 2-D spatial filter to produce Figure 9.5. This simplelow-pass filter successfully reduces thenoise. After P E G compression, Figure9.5 requires 3559 bytes (i.e. the compression efficiency is only 10% worse than the noise-free image). However, this compression gain is atthe expense of a loss of image quality: thefilter has ‘blurred’ some of the sharp lines in the image because it doesnot discriminate between high-frequency noise and ‘genuine’ high-frequency components of the image. With a more sophisticated pre-filter it is possible to minimise the noise whilst retaining important image features. I Figure 9.4 Image with noise 198 P m -AND POST-PROCESSING Figure 9.5 ‘Noisy’ image after Gaussian filtering 9.2.2 CameraMovement Unwanted camera movements (camera ‘shake’ or‘jitter’)areanothercause of poor compression efficiency. Block-based motion estimation performs best when the camera is fixed in one position or when it undergoes smooth linear movement (pan or tilt).In the case of a hand-held camera, or a motorised padtilt operation(e.g.as a surveillancecamera ‘sweeps’ over a scene), the image tends to experience random ‘jitter’ between successive frames. If the motion search algorithmdoes not detect this jitter correctly, the resultis a large residual frame after motion compensation. This in turn leads to a larger number of bits in the coded bit stream and hence poorer compression efficiency. Example Two versions of a short 10-frame video sequence (the first frame is shown in Figure 9.6) areencoded with MPEG-4 (simple profile, with half-pixelmotionestimation and POST-FILTERING 199 compensation). Version l (the original sequence) has a fixed camera position. Version 2 is identical except tha2tof the 10frames are shifted horizontalolyr vertically by utpo 2 pixels (to simulate camera shake). The sequences are coded with H.263, using a fixed quantiser step size (10) in each case. For Version l (the original), the encoded sequence is 18 703 bits. For Version 2 (with ‘shaking’ of two frames), the encoded sequence is 29 080 bits: the compression efficiency drops by over 50% due to a small displacement within 2 of the 10 frames. Thisexample shows that camera shake canbe very detrimental to video compression performance (despite the fact that the encoder attempts to compensate for the motion). The compression efficiency may be increased (and the subjective appearance of the video sequence improved) with automatic camera stabilisation. Mechanical stabilisation is used in some hand-held cameras, but this adds weight and bulk to the system. ‘Electronic’ image stabilisation can be achieved without extra hardware (at the expense of extra processing). For example, onemethod’ attempts to stabilise the video frames prior to encoding. In this approach, a matching algorithm is used to detect global motion (i.e. common motion of all background areas, usually due to camera movement). The matching algorithm examines areas near the boundary of each image (not the centre of the image-since the centre usually contains foreground objects). If global motion is detected, the image is shifted to compensate for small, short-term movements due to camera shake. 9.3 POST-FILTERING 9.3.1Image Distortion Lossy image or video compression algorithms (e.g.JPEG, MPEG and H . 2 6 ~ )introduce distortion into video information. Higher compression ratios produce more distortion in the decoded video frames. The nature and appearance of the distortion depend on the type of compression algorithm. In a DCT-based CODEC, coding distortion is due to quantisation of DCTcoefficients. This hastwomain effects ontheDCTcoefficients: those with smaller magnitudes (particularly higher-frequency AC coefficients) are set to zero, and the remaining coefficients (including the low-frequency andDCcoefficients) lose precision due to quantisation. These effects lead to characteristic types of distortion in the decoded image. Figure 9.7 shows the result of encoding Figure 9.3 with baseline JPEG, at a compression ratio of 1 8 . 5 ~(i.e. the compressed image is 18.5 times smaller than the original). Figure 9.8 highlights three types of distortion in a close-up of this image, typical of any image or video sequence compressed using a DCT-based algorithm. Blocking Often, the most obvious distortion or artefact is the appearance of regular square blocks superimposed on the image. These blocking artefacts are a characteristic of block-based transform CODECs, and their edges are aligned with the 8 x 8 regions processed via the DCT. There are two causes of blocking artefacts: over-quantisation of the DC coefficient and suppression or over-quantisation oflow frequency AC coefficients. TheDC coefficient 200 PRE- AND POST-PROCESSING I Figure 9.7 Image compressed 18x (JPEG) corresponds to the average (mean) value of each 8 x 8 block. In areas of smooth shading (such as the face areian Figure 9.7),over-quantisation of the DC coefficient means that there is a largechangeinlevel between neighbouring blocks. When two blocks with similar shades are quantized to different levels, thereconstructed blocks can have a larger ‘jump’ in level and hence a visible change of shade. This is most obvious at the block boundary, appearing as a tiling effect on smooth areas of the image. A second cause of blocking is over-quantisation or elimination of significant AC coefficients. Where there should be a smooth transition between blocks,a ‘coarse’ reconstruction of low-frequency basis patterns (see Chapter 7) leads to discontinuitiesbetween block edges. Figure 9.9 illustrates thestwe o blocking effects in one dimension. Image sample amplitudes fora flat region are shown on the left and for a smoothly varying region on the right. Ringing High quantisation can have a low-pass filtering effect, since higher-frequency AC coeffi- cients tend to be removed during quantisation. Where there are strong edges in the original image, this low-pass effect can cause ‘ringing’ or ‘ripples’ near the edges. This is analogous to the effectof applying a low-pass filter to a signal with a sharp change in amplitude: low- frequency ringingcomponentsappearnearthechange position. This effect appears in Figure 9.8 as ripples near the edge of the hat. Basis pattern breakthrough Coarsequantisation ofAC coefficients can eliminate many of the original coefficients, leaving a few ‘strong’ AC coefficients in a block. After the inverse DCT, the basis pattern corresponding to a strong AC coefficient can appearin the reconstructed imageblock (‘basis pattern breakthrough’). An exampleis highlighted inFigure9.8: the block in question appears to be overlaid with one of the DCT basis patterns. Basis pattern breakthrough also POST-FILTERING 201 Figure 9.8 Close-up showing typicalartefacts contributes to the blocking effect (in this case, there is a sharp change between the strong basis pattern and the neighbouring blocks). These three distortion effectsdegrade the appearance of decoded images or videoframes. Blocking is particularly obvious because the large 8 x 8 patterns are clearly visiblien highly compressed frames. The artefacts can also affect the performance of motion-compensated video coding. A video encoder that uses motion-compensated prediction forms a reconstructed (decoded) version of the current frame aas prediction reference for furtheerncoded frames: this ensures that the encoderand decoder use identical reference frameasnd prevents 'drift' at the decoder. However, if a high quantiser scale is used, the reference frame at the encoder will contain distortion artefacts that were not present in the original frame. When 202 Original levels: PRE- AND POST-PROCESSING (a) DC coefficqiueannt tisation (b) AC coefficqiueannt tisation Amplitude Amplitude I Block A Block B Block A Block B Amplitude Amplitude L Reconstructed levels: I Block A Block B Block A Block B Figure 9.9 Blocking effects (shown in one dimension) the reference frame (containing distortion) is subtracted from the next input frame (without distortion), these artefacts willtend to increase the energy in the motion-compensated residual frame, leading to a reduction in compression efficiency. This effect can produce a significant residual component even when there isno change between successive frames. Figure 9.10 illustrates this effect. The distorted reference frame (a) is subtracted from the current frame (b). There is nochange in the image content but the difference frame (c) clearly contains residual energy (the ‘speckled’ effect). This residual energy will be encoded and transmitted, even though there is no real change in the image. It is possible to design post-filters to reduce the effect of these predictable artefacts. The goal is to reduce the ‘strength’ of a particular type of artefact without adversely affecting the important features oftheimage (such as edges). Filters can be classified according to the type of artefact they are addressing (usually blocking or ringing), their computational complexity and whether they are applied inside or outside the coding ‘loop’. A filter applied after decoding (outside the loop) can be made independent of the CODEC: however, good performance can be achieved by making use of parameters from the video decoder. A filter applied to the reconstructed frame within the encoder (inside the loop) has the advantage of improving compression efficiency (as described above) but must also be applied within the decoder. The use of in-loop filters is limited to non-standard CODECs except in the case of loop filtersdefinedin the coding standards. Post-filters can be categorised as follows, depending on their position in the coding chain. ( a )In-loop filters The filter is applied to the reconstructed frame bothin the encoder andinthedecoder. Applying the filter within the encoder loop can improvethe quality of the reconstructed (c) frame difference (c) 204 Current frame P M -AND POST-PROCESSING Subtract - - Motion estimation Image encoder Encoded + frame F vectors In-loop filter ]e 4 - Previous frame@) image decoder Encoded frame Image decoder Add 4 Decoded frame l/----- filter frame(s) Figure 9.11 In-loop filter: (a) encoder; @) decoder POST-FILTERING 205 Decoding parameters decforadmere Video Post-filter Filtered Figure 9.12 Decoder-dependent filter reference frame,which in turn improves the accuracy of motion-compensated prediction for the next encoded frame sincethe quality of the prediction reference isimproved. Figure 9.11 shows the position of the filter within the encoder and decoder, immediately prior to motion estimation or reconstruction. Placing the filter within the coding loop has two advantages: first, the decoderreferenceframeisidenticalto the encoderreferenceframe (avoiding prediction ‘drift’ between encoder and decoder) and second, the quality of the decoded frame is improved. The disadvantage of this approach is that the encoder and decoder must use an identical filter and this limits interoperabilitybetween CODECs (unless standardised filters are used, such as H.263 Annex J). (b)Decoder-dependent filters In the second category, the filter is applied after the decoder and makes use of decoded parameters. A good example of a useful decoder parameter is the quantiser step sizet:his can be used to predict the expected level of distortion in the current block, e.mg.ore distortion is likely to occur when the quantiser step size is high than when it is low. This enables the decoder to adjust the ‘strength’ of the filter according to the expected distortion.A ‘strong’ filter may be applied when the quantiser step size is high, reducing the relevant type of distortion. A ‘weak’ filter is applied when the step size is low, preserving detail in blocks with lower distortion. Good performance can be achieved with this type of filter; however, the filter must be incorporated in the decoder or closely linked to decoding parameters. The location of the decoder-dependent filter is shown in Figure 9.12. (c) Decoder-independent filters In order to minimise dependence on the decoder, the filter may be applied after decoding without any ‘knowledge’ of decoder parameters, as illustratedin Figure 9.13. This approach gives the maximum flexibility (for example, the decoder and the filter may be treated as separate ‘black boxes’ by the system designer).However, filter performance is generallynot decfordaemr e Video Post-filter Figure 9.13 Decoder-independent filter Filtered 206 Pm-AND POST-PROCESSING Horizontal block boundary Filtered pixels Vertical block boundary Figure 9.14 H.263+ Annex J deblocking filter as good as decoder-dependenftilters, since the filter has no information about the coding of each block. 9.3.2 De-blockingFilters Blocking artefactsare usually the most obvious and therefore the most important to minimise through filtering. In-loop jilters It is possible to implement a non-standard in-loop de-blocking filter, however, the use of such a filter is limited to proprietary systems. Annex J of the H.263+ standard defines an optional de-blocking filter that operateswithin the encoder and decoder ‘loops’. Al-D filter is applied across blockboundaries as shown in Figure 9.14. Four pixel positions at a timeare smoothed across the blockboundaries, first across the horizontal boundariesand then across the vertical boundaries. The ‘strength’ of the filter (i.e. the amount of smoothing applied to the pixels) is chosen depending on the quantiser value (as described above). The filter is effectively disabled if there is a strong discontinuity between the values of A and B or between the values of C and D: this helps to prevent filtering of ‘genuine’ strong horizontal or vertical edges in the original picture. In-ldoeo-pblocking filters have been compared2and the authors conclude that the best performance is given by POCS algorithms (described briefly below). Decoder-dependent and decoder-independent jilters If the filter isimplementedonly in thedecoder (not in theencoder), the designer has complete flexibility and a wide range of filtering methods have been proposed. Annex F of MPEG-4 describes an optional de-blocking filter that operates across each horizontal and vertical blockboundary as above. The ‘smoothness’ of the image in the region POST-FILTERING 207 Horizontal block boundary Vertical block boundary Figure 9.15 MPEG-4 deblocking filter of the boundaryis estimated based on the valoufes10pixels (A to J in Figure 9.15). If the image is not deemed to be‘smooth’, a default l-D filter is applied tothe two pixels on either sideof the boundary (E and F). If the image is ‘smooth’ in this region, then a more sophisticated filter is required toreduceblocking whilst preservingthe smooth texture: in this case, 8 pixels (B to I) are filtered. The filter parameters depend on the quantiser step size. By adapting the filter in this way, a more powerful (but more computationally complex) filter is applied where itis needed in smooth regions whilst a less complex filter is applied elsewhere. Many alternativaepproaches can bfeound in the These range from highly complex algorithms such as Projection OntCoonvex Sets (POCS), in which many candidate images are examined to find a close approximation to the decoded image that does not containblockingartefacts,toalgorithms such asthe MPEG-4 Annex F filter that are significantly less complex. The best image quality is usually achieved at the expense of computation: for example, POCS algorithms are iterative and may be at least 20x more complex than thedecodingalgorithmitself. Decoder-dependent algorithms can often outperform decoder-independent algorithmsbecause the extra informationabout the coding parameters makes it easier to distinguish ‘true’ image features from blocking distortions. 9.3.3 De-ringingFilters After blocking, ringing is often the next most obvious type of coding artefact. De-ringing filters receive somewhat less attention than de-blocking filters. MPEG-4 Annex F describes an optional post-decoder de-ringing filter. In this algorithm, a threshold thr is set for each reconstructed block based on the mean pixel value in the block. The pixel values within the block are compared with the threshold and 3 x 3 regions of pixels that are all eitherabove or below the threshold are filtered using a 2-D spatial filter. This has the effect of smoothing homogeneous regions of pixels on either side of strong image edges whilst preserving the edges themselves: it is these regions that are likely to be affected by ringing. Figure 9.16 208 P E - AND POST-PROCESSING Edge betweendark and light areas Pixels that may be. filtered \ Figure 9.16 Application of IWEG-4 de-ringing filter shows an example of regions of pixels that may be filtered in thisway in a block containing a strong edge. In this example,pixels adjacent to the edgewill be ignored by the filter (hence preserving the edge detail). Pixels rienlatively ‘flat’ regions on either sideof the edge (which are likely to contain ringing) will be filtered. 9.3.4 ErrorConcealmentFilters A final category of decoder filter isthat of error concealmentfilters. When a decoder detects that a transmission error has occurred, itis possible to estimatethe area of the frame that is likely tobe corrupted by the error. Once the areais known, a spatial or temporalfilter may be applied toattemptoconceatlhe error. Basic errorconcealment filters operate by interpolating from neighbouring error-free regions (spatially and/or temporally) to ‘cover’ the damaged area. More advanced methods (such as POCS filtering, mentioned above) attempt tomaintainimagefeaturesacross the damaged region.Errorconcealmentis discussed fuaher in Chapter 11. 9.4 SUMMARY Re- and post-filtering can be valuable tools fora video CODEC designer. The goal of a prefilter is to ‘clean up’ the source image and compensate for imperfections such as camera noise and camera shakewhilst retaining visually important image features.A well-designed pre-filter can significantly improve compression efficiency by reducing the number of bits spent on coding noise. Post-filters are designed to compensate for characteristic artefacts introduced by block-based transform coding such as blocking and ringing effects. A postfilter can greatly improve subjective visual quality, reducing obviousdistortions whilst retaining important featuresin the image. There are threemain classes of this type of filter: loop filters (designed to improvemotion compensation performance aswell as image quality REFERENCES 209 and present in both encoder and decoder), decoder-dependent post-filters (which musaekeof decoded parameters to improve filtering performance) and decoder-independent post-filters (which are independent of the coding algorithm but generally suffer from poorer perfor- mance than the other types). As with many other aspects of video CODEC design, there is usually a trade-off between filter complexity and performance (in termosf bit rate andimage quality).Therelationshipbetweencomputationalcomplexity,codedbitrateandimage quality is discussed in the next chapter. REFERENCES 1. R. Kutka, ‘Detection of image background movement as compensation for camera shaking with mobile platforms’, Proc. Picture Coding Symposium PCSOl, Seoul, April 2001. 2. M. Yuen and H. R. Wu,‘Performance comparisonof loop filtering in generic MC/DPCM/DCT video coding’, Proc. SPIE Digital Video Compression, San Jose, 1996. 3. Y. Yang, N. Galatsanos and A. Katsaggelos, ‘Projection-based spatially adaptive reconstruction of block transform compressed images’, IEEE Trans. Image Processing, 4, July 1995. 4. Y. Yang and N. Galatsanos, ‘Removal of compression artifacts using projections onto convex sets and line modeling’, IEEE Trans. Image Proc-essing, 6, October 1997. 5. B. Jeon, J. Jeong and J. M. Jo, ‘Blocking artifacts reduction in image coding based on minimum block boundary discontinuity’, Proc. SPIE VCIP9.5,Taipei, 1995. 6. A. Nostratina, ‘Embedded post-processing for enhancement of compressed images’, Proc. Data Compression Conference DCC-99, Utah, 1999. 7. J. Chou, M. Crouse and K. Ramchandran, ‘A simple algorithm for removing blocking artifacts in block transform coded images’, IEEE Signal Processing Letters, 5, February 1998. 8. S. Hong, Y. Chanand W. Siu, ‘A practicalreal-timepost-processingtechniqueforblockeffect elimination’, Proc. IEEE ICIP96, Lausanne, September 1996. 9. S. Marsi, R. Castagno and G. Rampon‘Ai, simple algorithm for the reductioonf blocking artifacts in images and its implementation’, IEEE Trans. on Consumer Electronics, 44(3), August 1998. 10. T. Meier, K. Nganand G. Crebbin, ‘Reductionof coding artifactsat low bit rates’, Proc. SPIE Visual Communications and Image Processing, San Jose, January 1998. 11. Z. Xiong, M. Orchard, Y.-Q. Zhang, ‘A deblocking algorithm for JPEG compressed images using overcomplete wavelet representations’, IEEE Trans. CSVT, 7,April 1997. l0 Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Rate, Distortion and Complexity 10.1 INTRODUCTION The choice of video coding algorithm and encoding parameters affect the coded bit rate and the quality of the decoded video sequence (as well as the computational complexity of the videoCODEC). The preciserelationship between codingparameters, bit rate and visual quality varies depending on the characteristics of the video sequence (e.g. ‘noisy’ input vs. ‘clean’ input; high detail vs. low detail; complex motion vs. simple motion). At the same time, practicallimitsdetermined by the processor and the transmission environment put constraints on the bit rate and image quality that may be achieved. It is important to control thevideoencoding process in order to maximisecompressionperformance (i.e. high compression and/or good image quality) whilst remaining within the practical constraints of transmission and processing. Rate-distortion optimisation attempts to maximise image quality subject to transmission bit rate constraints. Thebest optimisation performance comesat the expense of impractically high computation. Practical algorithmsfor the control of bit rate can be judged accordingto how closely they approach optimum performance. Many alternative rate control algorithms exist; sophisticated algorithms can achieve excellent rate-distortion performance, usually at a costof increased computational complexity. The careful selection and implementation of a rate control algorithm can make a big difference to video CODEC performance. Recent trends in software-only CODECs and video coding inpower-limited environments (e.g. mobile computing) mean that computational complexity is an important factor in video CODECperformance.In many application scenarios,videoquality is constrained by availablecomputationalresources as well as or instead of available bit rate. Recent developments in variable-complexity algorithms (VCAs) forvideocodingenablethe developer to manage computational complexity and trade processing resources for image quality. This leads to situations in which rate, complexity and distortion are interdependent. New algorithms are required to jointly control bit rate and computational complexity whilst minimising distortion. In this chapter we examine the factorsthat influence rate-distortion performance in a video CODECand discuss how these factors can be exploitedto efficiently control coded bit rate. We describe a number of popular algorithms for rate control. We discuss the relationship between computation, rate and distortion and show how new VCAs are beginning to influence the design of video CODECs. 212 COMDPLIASENTXODRITRAYTTIEO,N 10.2 BIT RATE AND DISTORTION 10.2.1 The Importance of Rate Control A practical video CODEC operateswithin an environment that places certain constraints on its operation. Oneof the most important constraints isthe rate at which the video encoder is ‘allowed’ to produce encoded data. A source of video data usually supplies video data at a constant bit rate (a constant number of bits per second) and a video encoder processes this high, constant-rate source to producea compressed stream of bits at a reduced bit rate. The amount of compression (and hencethe compressed bit rate) depends on a number of factors. These may include: 1. The encoding algorithm (intra-frame or inter-frame, forward or bidirectional prediction, integer or sub-pixel motion compensation, DCT or wavelet, etc.). 2. The type of videomaterial(materialcontaining lots of spatial detail and/or rapid movementgenerallyproducesmore bits than materialcontaininglittle detail and/or motion). 3. Encodingparametersanddecisions(quantiserstepsize, picture or macroblockmode selection, motion vector search area, the number of intra-pictures, etc.). Some examples of bit rate ‘profiles’ are given below. Figure 10.1 plots the number of bits in each frame for a videosequenceencoded using Motion JPEG.Eachframeiscoded independently (‘intra-coded’) and the bit rate for each frame does not change significantly. Small variations in bit rate are due to changes in the spatial content of the frames in the 10-frame sequence. Figure 10.2 shows the bit rate variation for the same sequence coded WPEG 0 1 2 3 4 5 6 7 8 9 Frarn? Figure 10.1 Bit-rate profile: Motion JPEG BIT RATE AND DISTORTION 213 H263 I 0 1 2 3 4 5 6 7 8 9 Fram Figure 10.2 Bit-rateprofile: H.263 (baseline) with H.263. The first frame is anintra-frameandfollowingframesareP-pictures.The compression efficiency for a P-picture is approximately 10 x higher than for an I-picture in this example and there is a small variation between P-pictures due to changes in detail and in movement. Coding the same sequence using MPEG-2 gives the bit rate profile shown in Figure 10.3. In this example, the initial I-picture is followed by the following sequence of picture types: B-B-P-B-B-P-B-B-I. There is clearly a large variation between the three picturetypes, with B-pictures givingthebest compressionperformance.Thereisalsoa smaller variation between coded pictureosf the same type(I, P orB) due to changesin detail and motion as before. L20 000 5000 0 0 1 2 3 4 5 6 7 a 9 Fram Figure 10.3 Bit-rateprofile: MPEG-2 214 DIASNTODRRATTIEO,N COMPLEXITY These exampleshow that thechoice of algorithmand the content of thevideo sequence affect thebitrate(andalso the visual quality) of the coded sequence. At the same time, the operating environment places important constraints on bit rate. These may include: 1. The mean bit rate that may be transmitted or stored. 2. The maximum bit rate that may be transmitted or stored. 3. The maximum variation in bit rate. 4. The requirement to avoid underflow or overflow of storage buffers within the system. 5. A requirement to minimise latency (delay). Examples: DVD-video The mean bit rateisdetermined by theduration of thevideo material. For example, if a 3-hour movie is to bestored on a single 4.7 Gbyte DVD, then the mean bit rate (for thewholemovie) must not exceedaround 3.5 Mbps. The maximum bit rate is determined by the maximum transfer rate from the DVD and the throughput of the video decoder. Bit-rate variation (subject to these constraints) and latency are not such important issues. VideoconferencingoverISDN The ISDN channeloperates at a constant bit rate (e.g. 128kbps). The encoded bit rate must match this channel rate exactly, i.e. no variation is allowed. The output of the video encoder is constant bit rate (CBR) coded video. Videoconferencing over a packet-switchednetwork The situation here is more complicated. The availablemean and maximumbit rate mayvary, depending on the network routeing and on the volume of other traffic. In some situations, latency and bit rate may be linked, i.e. a higher data rate may cause increased congestion and delay inthe network. The video encoder can generate CBR or variable bit rate (VBR) coded video, but the mean and peak data rate may depend on the capacity of the network connection. Each of these applicationexampleshas different requirements in terms of the rate of encodedvideo data. Rate control, the process of matchingtheencoderoutputtorate constraints, isa necessary component of the majority of practical video coding applications. Theratecontrol‘problem’is defined below in Section 10.2.3. Therearemany different approaches to solving thipsroblem andin a given situation, the choiceof rate control method can significantly influence videoqualityatthe decoder. Poor rate control may cause a number of problems such as low visual quality, fluctuations in visual quality and dropped frames leading to ‘jerky’ video. In the next section we will examine the relationship between coding parameters, bit rate and visual quality. DISTORTIAONNDBIT RATE 215 10.2.2 Rate-DistortionPerformance A lossless compression encoder produces a reduction in data rate with no loss of fidelity of the original data. A lossy encoder, on the other hand, reduces data rateat the expense of a loss of quality.As discussed previously,significantlyhigher compression of image avndideo data can be achieved using lossy methods than with lossless methodsT. he output of a lossy video CODEC is a sequence of images that are of a lower quality than the original images. The rate-distortion petformance of a video CODEC provides a measure of the image quality produced at a range of coded bit rates. For a given compressed bit rate, measure the distortion of the decoded sequence (relative to the original sequence). Repeat this for a range of compressed bit rates to obtain the rate-distortion curve such as the example shown in Figure 10.4.Each point on this graphis generated by encoding a video sequence using an MPEG-4encoder with a differentquantiserstepsize Q. Smallervalues of Q produce a higher encodedbit rate and lower distortion; largervalues of Q produce lowerbit rates at the expense of higher distortion. In this figure, ‘image distortion’ is measured by peak signal to noise ratio (PSNR), describedin Chapter 2. PSNR is a logarithmic measure, anad high value of PSNR indicates low distortion.Thevideosequenceis a relatively static,‘head-andshoulders’ sequence (‘Claire’). The shape of the rate-distortion curve is very typical: better imagequality(asmeasured by PSNR)occursathigher bit rates,and the qualitydrops sharply once the bit rate is below a certain threshold. The rate-distortion performance of a video CODEC may be affected bymany factors, including the following. Video material Under identical encoding conditions, the rate-distortion performancemay vary considerably depending on the video material that is encoded. Figure 10.5 compares the rate-distortion ‘Cialre encoded using WEG-4 (slmple profile) (30frames per second) 42 l ~~ ~.. ~ .... ~~ ... 28 I o IO 20 30 40 50 M) 70 80 90 Figure 10.4 Rate-distortion curve Rate (kbps) example 216 RATE, DISTORTION AND COMPLEXITY ‘Clalre’ and ‘Foreman’ encoded using WEG-4 (simple prohle) U 301 :p , 28 .___ .,-. I - 0 50 100 150 200 250 300 350 400 450 Figure 10.5 Rate-distortion Rate (kbps) comparison of two sequences performance of two sequences, ‘Claire’ and ‘Foreman’, under identical encoding conditions (MPEG-4, fixed quantiser step sizevarying from 4 to 24). The ‘Foreman’ sequence contains a lot of movement and detail and is therefore more ‘difficult’ to compress than ‘Claire’. At the same value of quantiser, ‘Foreman’ tends to have a much higher encoded bit rate and a higher distortion (lower PSNR) than ‘Claire’. The shape of the rate-distortion curveis similar but the rate and distortion values are very different. Encoding parameters In a DCT-based CODEC, a number of encoding parameters (in addition to quantiser step size) affect the encoded bit rate. An efficient motion estimation algorithm produces a small residualframeaftermotioncompensationandhence a low coded bit rate; intra-coded macroblocks usually requiremore bits than inter-codedmacroblocks; sub-pixel motion compensation produces a lower bit rate than integer-pixel compensation; and so on. Less obvious effects include, for example, the intervals at which the quantiser step size is varied during encoding. Each time the quantiser step size changes, the new value (or the change) must be signalled to the decoder and this takes more bits (and hence increases the coded bit rate). Encoding algorithms Figures 10.1-10.3 illustrate how the coded bit rate changes depending on the compression algorithm. In each of these figures, the decoded image quality is roughly the samebut there is a big difference in compressed bit rate. Rate control algorithms A rate control algorithm chooses encoding parameters (such as those listed above) in order to try and achieve a ‘target’ bit rate. For a given bit rate, thechoice of rate control BIT RATE AND DISTORTION 217 algorithm can have a significant effect on rate-distortion performance, as discussed later in this chapter. So far we havediscussed only spatial distortion (the variation in quality of individual frames in the decoded video sequence). It is also important to consider temporal distortion, i.e. the situation wherecompleteframesare ‘dropped’ fromthe original sequence in orderto achieve acceptableperformance. The curves shown in Figure 10.5 were generated for video sequences encoded at 30 frames per second. It would be possible to obtain lower spatial distortion by reducing the framerate to 15 frames per second (dropping every second frame), at theexpense of anincrease in temporal distortion (becausethe frame rate has been reduced). The effect of this type of temporal distortion is apparent as ‘jerky’ video. This is usually just noticeable around15-20 frames per second and very noticeable below 10 frames per second. 10.2.3 The Rate-Distortion Problem The trade-off between coded bit rateand image distortion is an exampleof the generalrutedistortion problem in communications engineering. In a lossy communication system, the challenge is to achievae target data rate with minimal distortion of the transmitted signal (in this case,animageorsequence of images).This problem may be described as follows: ‘Minimize distortion (D) whilst maintaining a bit rate R that does not exceed a maximum bit rate R,, or min{D} s.t. R 5 R, (10.1) (where s.t. means ‘subject to’). Theconditions of Equation 10.1 can be met by selectingtheoptimumencoding parameters to give the ‘best’ image quality (i.e. the lowest distortion) without exceeding the target bit rate. This process can be viewed as follows: 1. Encode a video sequence with a particular set of encoding parameters (quantiser step size, macroblock modeselection, etc.) and measure the codedbit rate and decoded image quality (or distortion). This gives a particular combination of rate (R) and distortion (D), an R-D operating point. 2 . Repeat the encodingprocess with a different set of encoding parametersto obtain another R-D operating point. 3. Repeat for further combinations of encoding parameters. (Note that the set of possible combinations of parameters is very large.) Figure 10.6 shows a typical set of operating points plotted on a graph. Each pointrepresents the mean bit rate and distortion achieved for a particular set of encoding parameters. (Note that distortion [D] increases as rate [R] decreases). Figure 10.6 indicates that there are ‘bad’ and ‘good’ rate-distortion points. In this example, the operatingpoints that give the best ratedistortion performance (i.e. the lowest distortion for a given rate R) lie close to the dotted curve. Rate-distortion theory tells us that this curve is convex (a convex hull). For a given 218 D A RATE. DISTORTION AND COMPLEXITY ‘..,. ‘C\\ e Operating points 0 I R Figure 10.6 R-D operatingpoints target rate R,,,, the minimumdistortion D occurs at a point on this convex curve. The aim of rate-distortion optimisation is to find a set of coding parameters that achieves an operating point as close as possible to this optimum curve.’ One way to find the position of the hull and hence achieve this optimal performance isby using Lagrangian optimisarion.Equation IO. I is difficult to minimise directly and a popular method is to express it in a slightly different way as follows: + min{J = D XR} (10.2) J is a new function that contains D and R (as before) as well as a Lagrange multiplier, X. J is + the equationof a straight line D A R , where X gives the slopeof the line. There is a solution to Equation 10.2 for every possible multiplier X, and each solution is a straight line that makes a tangent to the convex hull described earlier. The procedure may be summarised as follows: 1. Encode the sequence many times, each time with a different set of coding parameters. 2. Measurethe coded bit rate ( R ) and distortion ( D ) of each coded sequence. These measurements are the ‘operating points’ ( R , D). 3. For each value of X, find the operating point( R , D ) that gives the smallestvalue J , where + J = D XR. This gives one point on the convex hull. 4. Repeat step (3) for a range of X to find the ‘shape’ of the convex hull. This procedure is illustrated in Figure 10.7.The ( R , D ) operating points are plotted as before. + Three values of X are shown: X,, X*. and X3. In each case, the solution to J = D ;\R is a straight line with slopeX. The operating point (R, D ) that gives the smallestJ is shown in black, and these points occur on the lower boundary (the convex hull) of all the operating points. The Lagrangian method will find the set (or sets) of encoding parameters that give the best performance and these parameters may then be applied to the videoencodertoachieve optimum rate-distortion performance. However, this is usually a prohibitively complex BIT RATE AND DISTORTION 219 D Figure 10.7 Findingthe best (R, D ) points using Lagrangian opti- R misation process. Encoding decisions (sucahs quantiser step size, macroblock typet,c.) may change for every macroblock in the coded sequenceand so there are an extremely large numberof combinations of encoding parameters. Example Macroblock 0 in a picture is encoded usiMngPEG-4 (simple profile) with a quantiser step +/- size Q0 in the range2-31. The choiceof Q1for macroblock 1 is constrained toQ. 2. There are 30 possible values of Qo; (almost) 30 x 5 = 150 possible combinations of Q. and Q , ; (almost) 30 x 5 x 5=750 combinations of Qo, Ql and Q2;and so on. The computation required to evaluateall possible choices of encoding decision becomes prohibitive even for a short video sequence. Furthermore, no two video sequences produce the same rate-distortion performance for the same encoding parameterasnd so this process needs to be carried out each time a sequence is to be encoded. There have been a numbeorf attempts to simplify the Lagrangian optimisation method in order to make it more practicalluyseful.24 For example, certain assumptionsmay be made about good and bad choices of encoding parameters in order to limit the exponential growth of complexity described above.The computational complexityof some of these methods is still much higher than the computation required for the encoding process itself: however, this complexity may be justified in some applications, such as (for example) encoding a feature film to obtain optimum rate-distortion performance for storage on a DVD. An alternative approach is to estimate the optimum operating points using a model of therate-distortion character is ti^.^ Lagrange-basedoptimisationis first carriedouton somerepresentativevideosequencesinorder tofind the 'true' optimalparametersfor these sequences. The authorspropose a simple model of the relationshipbetween encoding mode selection andX and the encoding mode decisions required to achieve minimal distortion for a given rate constraint R,, can be estimated from this model. The authors report a clear performance gain over previous methodswith minimal computational complexity. Another 220 COMDPLIASENTXODRIRTAYTTIEO,N attempt has been made6 to define an optimum partition between the coded bits representing motion vector information and the codedbits representing displaced frame difference (DFD) in an inter-frame CODEC. 10.2.4 PracticalRateControlMethods Bit-rate control ina real-time video CODEC requiresa relatively low-complexity algorithm. The choiceof rate control algorithm canhave a significant effect on video qualityand many alternativealgorithmshave been developed. The choice of rate control algorithm is not straightforward because a number of factors are involved, including: 0 the computational complexity of the algorithm 0 whether the rate control‘model’ is appropriate to the typoef video material to be encoded (e.g. ‘static’ video-conferencing scenes or fast-action movies) 0 the constraints of the transmission channel (e.g. low-delay real-time communications or offline storage). A selection of algorithms is summarised here. Output buffer feedback One of the simplest rate control mechanisms is shown in Figure10.8. A frame of video i is encoded to produce bi bits. Because of the variation in content of a video sequence, bi is likely to vary from frame to frame, i.e. the encoder output bit rate is variable, R,. In Figure 10.8weassume that thechannelrate is constant, R, (thisis the caseformany practical channels). In order tomatch the variable rate R, to the constant channerlate R,, the encoded bits are placed in a buffer, filled at rate R, and emptied at rate R,. Figure 10.9 shows how the buffer contents vary duringencoding of a typical video sequence. As each frame isencoded, the buffer fills at a variable rate and after encodingof each frame,a fixed number of bits b, are removed from thebuffer. With no constraint onthe Q control Video frames Encoder 1 l Rate R, Output buffer + Channel 4 Rate R, Figure 10.8 Bufferfeedbackratecontrol Buffer contents B BIT RATE AND DISTORTION 221 Unconstrained G I l l I l I Frame I Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 r Time Figure 10.9 Buffer contents: constrained and unconstrained variable rate R,, it is possible for the buffer contents to rise to a point at which the buffer overflows (B,,, in the figure). The black line shows the unconstrained case: the buffer overflows in frames 5 and 6. To avoid this happening, a feedback constraint is required, wherethe buffer occupancy B is ‘fed back’ to control thequantiserstep size Q. As B increases, Q also increaseswhich has the effect of increasing compression and reducing the number of bits per framebi. The grey line in Figure 10.9shows that with feedback, the buffer contents are never allowed to rise above about 50% of B,,,. This methodissimple and straightforward but has several disadvantages. A sudden increase in activity in the video scene may cause B to increase too rapidly to be effectively controlled by the quantiserQ, so that the buffer overflows, and in this case theonly course of action is to skip framesr,esulting in a variable frame rate. As Figure 10.9 shows, B increases towards the end of each encoded frame and this means that Q also tends to increasetowards the end of the frame. This can lead to an effect whereby the top of each frame is encoded with a relatively high quality whereas the foot of the frame is highly quantised and has an obvious drop in quality, as shown in Figure 10.10. The basic buffer-feedback method tends to produce decoded video with obvious quality variations. MPEG-2 Test Model 57 Version 5 of the MPEG-2 video Test Model (a reference design for MPEG-2 encoding and decoding) describes a rate control algorithm for CBR encoding that takes account of the different properties of the three coded picture types (I, P and B-pictures). The algorithm consists of three steps: bit allocation, rate control and modulation. 222 RATE, DISTORTION AND COMPLEXITY \ low quantiser Figure 10.13 Availablecomputational resources 228 DIASNTODRRATTIEO,N COMPLEXITY These scenarios illustrate the need for a morfelexible approach to computationin a video CODEC. In thistype of scenario,computationcan no longer be considered to be a ‘constant’. CODECperformanceis now afunction of threevariables:computational complexity, coded bit rate and video quality. Optimising the complexity, rate and distortion performance of a video CODEC requires flexible control of computational complexity and this has led to the development of variable complexity algorithms for video coding. 10.3.2 VariableComplexityAlgorithms A variablecomplexityalgorithm(VCA)carriesoutaparticulartask with a controllable degree of computationaloverhead. As discussedabove,computationisoften related to image quality and/or compression efficiency: in general, better image quality and/or higher compression require a higher computational overhead. Input-independent VCAs In this class of algorithms, the computational complexityof the algorithm is independent of the input data. Examples of input-independent VCAs include: Frame skipping: encoding a frame takes a certain amount of processing resources and ‘skipping’frames(i.e. not codingcertainframes in the input sequence) is a crude but effective way of reducing processor utilisation. The relationship between frame rateand utilisation is not necessarily linear in an inter-frame CODEC: when the frame rate is low (because of frame skipping), there is likely to be a larger difference between successive frames and hence more datato code in the residual frame. Frame skipping may lead to a variable frame rate as the available resources change and this can be very distracting to the viewer. Frame skipping is widely used in software video CODECs. Motion estimation (ME) searchwindow: increasing or decreasing the ME search window changes the computational overhead of motion estimation. The relationship between search window size and computational complexity depends on the search algorithm. Table 10.1 compares the overhead of different search window sizes for the popular n-step search algorithm. With no search, only the (0, 0) position is matched; with a search window of + / - 1, a total of nine positions are matched; and so on. Table 10.1 Computationaloverhead for n-stepsearch(integersearch) Search window Number of comparison steps Computation (normalised) 0 +l- 1 +l-3 +l-7 + l - 15 1 0.03 9 0.27 17 0.5 1 25 0.76 33 1.o COMPUTATIONAL COMPLEXITY 229 2x2 4x4 8x8 DCT Figure 10.14 Pruned DCT Pruned DCT: a forward DCT (FDCT) processes a bloocfk samples (typically8 x 8) and produces a block of coefficients. In a typical image block, many of the coefficients are zeroafterquantisationandonlya few non-zerocoefficientsremain to becodedand transmitted. These non-zero coefficients tend to occupy the lower-frequency positionsin theblock. A 'pruned'DCTalgorithmonlycalculatesasubset of the 8 x 8 DCT coefficients(usuallylowerfrequencies),reducingthecomputationaloverhead of the DCT."." Examples of possible subsets are shown in Figure 10.14: the 'full' 8 x 8 DCT may be reduced to a 4 x 4 or 2 x 2 DCT, producing only low-frequency coefficients. However, applying a pruned DCT to all blocks means that the small (but significant) number of high-frequency coefficients are losat nd this can have avery visible impact on image quality. Input-dependent algorithms An input-dependent VCA controls computational complexity depending on the characteristics of the video sequence or coded data. Examples include the following. Zero testingin IDCT In a DCT-based CODEC operating at medium olorw bit rates,many blocks contain noAC coefficients after quantisation(i.e. only the DC coefficient remains, or no coefficients remain).This may be exploited to reduce the complexiotfythe IDCT (which must be calculated in both the encoadnedr the decoder inan inter-frame CODEC). Eacrhow or column of eight coefficients is tested for zeros. If the seven highest coefficients are all zero, then the row or column will contain a uniform value (the DC coefficient) after the IDCT. In this case, the IDCT may be skipped and all samples set to the DC value: if ( F 1 = F2 = F 3 = F4 = F 5 = F 6 = F7 = 0 ) { fO = f l = f 2 =f3 =f4 =f5 =f6=f7 =F0 } else { [ c a l c u l a t e the IDCT..] > 230 COMDPLIASENTXODRIRTAYTTIEO,N Thereis a small overhead associated with testing for zero: however, thecomputational savingcan be very significant and there is no loss of quality. Further input-dependent complexity reductions can be applied to the IDCT.I2 FDCT complexity reduction Many blocks contain few non-zero coefficients after quantisation (particularly in inter-coded macroblocks). It is possible to predict the occur- rence of some of theseblocksbeforetheFDCTiscarried out so that theFDCTand quantisation steps may be skipped, saving computation. The sum of absolute differences (SAD or SAE) calculated during motion estimation can act as a useful predictor for these blocks. SAD isproportional to the energy remainingin the block aftermotion compensation. If SAD is low, the energy in the residual block is low and it is likely that the block will contain little or no data after FDCT qanudantisation. Figure 10.15 plots the probability that a block contains nocoefficients after FDCTand quantisation, against SAD. Thisimplies that it should be possible to skip the FDCT and quantisation steps for blocks with an SAD of less than a threshold value T : if ( S A D < T ) { setblock contentsto zero } else { c a l c u l a t e t h e FDCT a n d q u a n t i z e } If we set T = 200 then any block with SAD < 200 will not be coded. According to the figure, this ‘prediction’ of zero coefficients will be correct 90% of the time. Occasionally (10% of the timein this case), theprediction will fail, i.e. a block will be skipped that should have been encoded. Thereduction in complexity due toskipping FDCT andquantisation for some blocksistherefore offset by anincrease in distortion due to incorrectly skipped x 0 l 200 400 600 800 1000 Figure 10.15 Probability of zero Sum of Absolute Differences block vs. SAD COMPUTATIONAL, COMPLEXITY 231 Input-dependent motion estimation A description has been given15 of a motion estima- tionalgorithm with variablecomputationalcomplexity.Thisisbasedonthenearest neighbours search(NNS) algorithm (described in Chapter6 ) ,where motion search positions are examined in a series of ‘layers’ until a minimum is detected. The NNS algorithm is extended to a VCA by adding a computational constraint on the numberof layers that are examined at each iteration of the algorithm. As with the SAD prediction discussed above, thisalgorithmreducescomputationalcomplexity at theexpense of increasedcoding distortion. Other computationally scalable algorithms for motion estimation algrithms are described el~ewhere.’~’’~ 10.3.3 Complexity-RateControl The VCAs described above are useful for controlling the computational comploefxivtiydeo encoding and decoding. Some VCAs (such as zero testing in the IDCT) have no effect on image quality; however, the more flexiabnlde powerful VCAs (such as zero DCT prediction) do havean effect on quality. These VCAs may.also change the coded bit rate: for exifample, a high proportionof DCT operations are ‘skipped’, fewer coded bits will be praonddutcheed rate will tend to drop. Conversely, th‘tearget’ bit rate can affect computational complexifty VCAs are used. For example, a lower bit raanted higher quantiser scalewill tend to produce fewer DCT coefficients and a higher proportion of zero blocks, reducing computational complexity. Complexity - Rate - Distortion Surface 0.5 0 Rate (kbps) Figure 10.16 Complexity-rate-distortionsurface 232 COMDPLIASENTXODRITRAYTTIEO,N Itistherefore not necessarily correcto treat complexitycontrolandrate control asseparate issues. Aninteresting recent development isthe emergence of complexity- distortion theory.” Traditionally, video CODECs have been judged by their rate-distortion performance as described in Section 10.2.2. Withthe introduction of VCAs,itbecomes necessary toexamine performance in three axes: complexity, rateand distortion. The ‘operating point’ of a videoCODECis no longer restricted to a rate-distortion curve buitnsteadlieson a rate-distortion-complexity su@zce, liketheexample shown in Figure 10.16. Each point on this surface represents a possible set of encoding parameters, leadingto a particularset of values forcoded bit rate, distortion andcomputational complexity. Controllingrate involves moving theoperating point along this surface in the rate- distortion plane; controlling complexity involves moving the operating point in the complexity- distortion plane. Because of the interrelationship between computational complexitaynd bit rate, it may be appropriate to control complexity and rataet the same time. Thinsew area of complexity-rate control is at a very early stage and some preliminary results can be found elsewhere.14 10.4 SUMMARY ManypracticalvideoCODECshave to operate in a rate-constrained environment. The problem of achievingthe best possible rate-distortion performance isdifficult tosolve andoptimumperformancecan only be obtained at theexpense of prohibitively high computational cost. Practical rate control algorithms aim to achieve good, consistent video quality within the constraints of rate, delay andcomplexity. Recent developments in variable complexity coding algorithms enable a further trade-off between computational complexity and distortion and are likely to become important for CODECs with limited computational resources andlor power consumption. Bit rate is oneof a number of constraints that are imposedby the transmission or storage environment. Video CODECsaredesignedfor use in communicationsystemsand these constraints must be taken into account. In the next chapter we examine the key ‘quality of service’ parameters required by a video CODEC and provided by transmission channels. REFERENCES 1. A. Ortega and K. Ramchandran, ‘Rate-distortion methods for image and video compression’l,EEE Signal Processing Magazine, November 1998. 2. L-J Lin and A. Ortega, ‘Bit-rate control using piecewise approximated rate-distortion characteristics’, fEEE Trans. CSVT, 8, August 1998. 3. Y. Yang, ‘Rate control for video codingand transmission’, Ph.D. Thesis, Cornell University,2000. 4. M. Gallant and F. Kossentini, ‘Efficient scalable DCT-based video coding at low bit rates’, Proc. ZCfP99, Japan, October, 1999. 5. G. Sullivanand T. Wiegand,‘Rate-distortionoptimizationforvideocompression’, IEEE Signal Processing Magazine, November 1998. 6. G. M. Schuster and A. Katsaggelos, ‘A theory for the optimal bit allocation between displacement vector field and displaced frame difference’, IEEE J. Selected Areas in Cornrnunicarions, 15(9), December 1997. REFERENCES 233 7. ISO/IEC JTCI/SC29WG11 Document 93/457, “PEG-2 Video Test Mode5l’, Sydney, April 1993. 8. J. Ribas-Corbera and S. Lei, ‘Rate control for low-delay video communications [H.263 TM8 rate control]’, ITU-T Q6/SG16 Document Q15-A-20, June 1997. 9. J. Ronda, M. Eckert, F. Jaureguizar and N. Garcia, ‘Rate control and bit allocation for MPEG4’, IEEE Trans. C S W , 9(8), December 1999. 10. C. Christopoulos, J. Bormans, J. Cornelisand A. N. Skodras,‘Thevector-radixfastcosine transform: pruning and complexity analysis’, Signal Processing, 43,1995. 1 1 . A. Hossen and U. Heute, ‘Fast approximate DCT basic idea, error analysis, applications’, Proc. ICASSP97, Munich, April 1997. 12. K. Lengwehasatit and A. Ortega, ‘DCT computation based on variable complexity fast approxima- tions’, Proc. ICIP98, Chicago, October 1998. 13. M-T. Sun and I-M. Pao,‘Statistical computation of discrete cosine transformin video encoders’,J. Visual Communication and Image Representation, June 1998. 14. I. E. G. Richardson and Y. Zhao, ‘Video CODEC complexity management’, Proc. PCSOI, Seoul, April 2001. 15. M. Gallant, G. C6tC and F. Kossentini, ‘An efficient computation-constrained block-based motion estimationalgorithmforlow bit ratevideocoding’, IEEE Trans.ImageProcessing, 8(12), December 1999. 16. K. Lengwehasatit,A.Ortega, A. Bassoand A. Reibman, ‘A novelcomputationallyscalable algorithm for motion estimation’, Proc. VCIP98, San Jose, January 1998. 17. V. G. Moshnyaga, ‘A new computationallyadaptiveformulation of block-matchingmotion estimation’, IEEE Trans. CSVT, 11(1),January 2001. 18. V. K. Goyal and M. Vetterli, ‘Computation-distortion characteristics of block transform coding’, Proc. ICASSP97, Munich, April 1997. l1 Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Transmission of Coded Video 11.1 INTRODUCTION A video communication system transmits coded video data across a channeolr network and the transmission environment has a number of implications for encoding and decoding of video. The capabilitiesandconstraints of thechannelornetwork vary considerably,for example from low bit rate, error-prone transmission over a mobile network to high bit rate, reliable transmission over a cable television network. Transmission constraints should be taken into account when designing or specifying video CODECs; the aim is not simply to achieve the best possible compression but to develop a video coding system that is well matched to the transmission environment. This problem of ‘matching’ the application to the network is often described as a ‘quality of service’ (QoS) problem. There are two sides to the problem: the QoS required by the application (which relates to visual quality perceived by the user) and tQheoS ofleered by the transmission channel or network (which depends on the capabilities of the network). In this chapter we examine QoS from these two points of view and discuss design approaches that help to match the offered and required QoS. We describe two examples of transmission scenarios and discuss how these scenarios influence video CODEC design. 11.2 QUALITY OF SERVICE REQUIREMENTS AND CONSTRAINTS 11.2.1 QoS Requirements for Coded Video Successful transmission of coded video places a number of demands on the transmission channelornetwork.Themainrequirements (‘QoS requirements’)forreal-timevideo transmission are discussed below. Data rate A videoencoderproducescodedvideoatavariableorconstantrate(asdiscussedin Chapter 10). The key parameters for transmission are the mean bit rate and the variation of the bit rate. 236 TRANSMISSION OF CODED VIDEO The mean rate (or the constant rate for CBR video) depends on the characteristics of the source video (frame size, numberof bits per sample, frame rate, amountof motion, etc.) and on the compression algorithm. Practical video coding algorithms incorporate a degree of compression control (e.g. quantiser step size and mode selection) that allows some contorfol the mean rate after encoding. However, for a given source (with a particular frame size and frame rate) there is an upper andlower limit on the achievable mean compressed bit rateF.or example,‘broadcast TV quality’ video (approximately 704 x S76 pixelsperframe,25or 30 frames per second) encoded using MPEG-2 requires a mean encoded bit rate of around 2-5 Mbps for acceptable visual quality.In order to successfully transmit video at ‘broadcast’ quality, the network or channel must support at least this bit rate. Chapter IO explained how thevariationincodedbitratedependsonthevideoscene content and on the type of rate control algorithmused. Without rate control, a videoCODEC tends to generate more encoded data whenthescenecontainsalot of spatialdetailand movement and less data when the scene is relatively static. Different encoding modes (such as I, P or B-pictures in MPEG video) produce varying amounts of coded data. An output buffer together witha rate control algorithm may be used to ‘map’ this variable rate to either a constantbitrate(nobitratevariation)oravariable bit ratewithconstraintsonthe maximum amount of variation. Distortion Most of thepracticalalgorithmsforencoding of real-timevideoarelossy, i.e. some distortion is introduced by encoding and the decoded video sequence is not identical to theoriginalvideosequence. Theamount of distortionthatisacceptabledepends on the application. Example 1 A movie is displayed on a large, high-quality screen at HDTV resolution. Capture and editing of the video material isof a very high quality and the viewing conditions are good. In this example, there is likely to be a low ‘threshold’ for distortion introduced by the videoCODEC,since anydistortionwilltend tobehighlighted by thequality of the material and the viewing conditions. Example 2 A small video ‘window’ is displayed on a PC as part of adesktopvideo-conferencing application. The scene being displayed ispoorly lit; the camera is cheap and placed at an inconvenient angle; the video is displayed aat low resolution alongside a number of other applicationwindows. In thisexample, we mightexpectarelativelyhighthresholdfor distortion.Because of themanyotherfactorslimitingthequality of thevisualimage, distortion introduced by the video CODEC maynot be obvious until it reaches significant levels. QUALITY OF SERVRICEEQUIREMENACTNOSDNSTRAINTS 237 Ideally, distortion due to coding should be negligible, i.e. the decoded video should be indistinguishablefromtheorigina(luncoded)videoM. orepracticarlequirements for distortion may be summarised as follows: 1. Distortion should be ‘acceptable’ for the application. As discussed above, the definition of ‘acceptable’ vanes depending on the transmission and viewing scenario: distortion due to coding should preferably not be the dominant limiting factor for video quality. 2. Distortion should be near constant from a subjective viewpoint. The viewerwill quickly become ‘used’ to a particular level of video quality. For example, analogueVHS video is relatively low quality but this has not limited the popularity of themediumbecause viewers accept a predictable level of distortion. However, sudden changes in quality (for example, ‘blocking’ effects due to rapid motion or distortion due to transmission errors) are obvious to the viewer and can have a very negative effect on perceived quality. Delay By its nature, real-time video is sensitive to delay. The QoS requirements in terms of delay depend on whether video is transmitted one way (e.g. broadcastvideo,streamingvideo, playback from a storage device) or two ways (e.g. video conferencing). Simplex (one-way) video transmission requires framofevsideo to be presented to the viewer at the correct points in time. Usually, this means a constant frame rate; in the case where a frame is not available at the decoder (for example, due to frame skipping at the encoder), the other frames should be delayed as appropriate so thattheoriginaltemporalrelationships betweenframesarepreserved.Figure 11.1 showsanexample:frame 3 fromtheoriginal sequence is skippedby the encoder (because of rate constraints) and the frames arrive at the decoder in order I, 2 , 4 , 5 , 6 . The decoder must‘hold’ frame 2 for two frame periodsso that the later frames(4, 5 , 6) are not displayed too early with respect to frames l and 2. In effect, the CODEC maintainsaconstantdelaybetweencaptureanddisplay of frames. Any accompanyingmedia that is ‘linked’ tothe videoframesmustremainsynchronised:the most common example is accompanying audio, whereloass of synchronisation of more than about 0.1 S can be obvious to the viewer. [ T I F/frasmkiep 3 15118) Encodedframes 1411 5 1 161 Receivedframes F/ m 15) Displayed frames hold frame 2 Figure 11.1 Preservingtemporalrelationshipbetweenframes 238 TRANSMISSIONVOIDFECOODED Duplex (two-way) video transmission has the above requirements (constant delay in each direction, synchronisation between related media) plus the requirement that the total delay from capture to display must be kept low. A ‘rule of thumb’ for video conferencingis to keep the total delay less than 0.4S. If the delay is longer than this, normal conversation becomes difficult and artificial. Interactive applications, in which the viewer’s actions affect the encoded video material, also have a requirement of low delay. An example is remote ‘VCR’ controls (play, stop, fast forward,etc.)for a streamingvideosource. A longdelaybetweenthe user action (e.g. fastforwardbutton)andthecorrespondingeffectonthevideosourcemaymakethe application feel ‘unresponsive’. Figure 11.2 illustrates these three application scenarios. Capture Encoder Decoder - (a) One-way transmission 4 Constant, low delay in each direction ( Network ) U* - Display * Display (b) Two-way transmission Encoder t I Network 1 Decoder Feedback Display 4 Low feedback delay W (c) One-way transmission with interaction Figure 11.2 Delay scenarios QUALITY OF SERVICE REQUIREMENTS AND CONSTRAINTS 239 11.2.2 Practical QoS Performance The previous section discussed theQoS requirements for coded video transmission; the other side of the equation is theQoS that can beprovided by practical transmission environments. Data rate Circuit-switched networks such as the PSTN/POTS provide a constant bit rate connection. Examples include 33.6 kbps for a two-way modem connection ovearn analogue PSTN line; 56 kbps ‘downstream’ connection from an Internet service provider (ISP) over an analogue PSTN line; 128kbps over basic rate ISDN. Packet-switched networks such as Internet Protocol(IP) and Asynchronous Transfer Mode (ATM) provide a variable rate packet transmission service. This implies that these networks may be better suited to carrying coded video (with its inherently variable bit rate). However, the mean and peak packet transmission rate depend on the capacitoyf the network and may vary depending on the amount of other traffic in the network. The data rate of a digital subscriber line connection (e.g. Asymmetric Digital Subscriber Line, ADSL) can vary depending on the quality of the line from the subscriber to the local PSTN exchange (the‘local loop’). The end-to-end rate achieved over this typeof connection may depend on the‘core’ network (typically IP) rather than the locAalDSL connection speed. Dedicated transmission services such as satellite broadcast, terrestrial broadcast and cable TV provideaconstantbitrateconnectionthat is matchedtothe QoS requirements of encoded television channels. Errors The circuit-switched PSTN and dedicated broadcast channels have a low rate of bit errors (randomlydistributedi,ndependenetrrorsc,as(eai)nFigure 11.3). Packet-switched I \\ I I / (a) Bit errors Transmitted bit sequence (b) Lost packets (c) Burst errors Figure 11.3 (a) Bit errors; (b) lost packets; (c) burst errors 240 TRANSMISSION OF CODED VIDEO networks such as IP usually have a low bit error rate but can suffer from packet loss during periods of network connection (loss of the data ‘payload’ of a complete network packet, case (b) in Figure 11.3). Packet loss is often ‘bursty’, i.e. a high rate of packet loss may be experiencedduring a particularperiodfollowed by amuchlowerrate of loss. Wireless networks (such as wireless LANs and personal communications networks) may experience high bit error rates due to poorpropagation conditions. Fading of the transmitted signal can lead to bursts of bit errors in this type of network (case (c) in Figure l 1.3, a sequence of bits containingasignificant number of bit errors).Figure 11.4 showsthepath loss (i.e. the variation in received signal power) between a base station and receiver in a mobile network, plotted as a function of distance. Amobilereceivercanexperiencerapidfluctuations in signal strength (and hence in error rates) due to fading effects (such as the variation with distance shown in the figure). Delay Circuit-switchednetworksanddedicatedbroadcasct hannelsprovideanear-constant, predictable delay. Delay through a point-to-point wireless connection is usually predictable. The delaythrough a packet-switched networkmay be highly variable, dependingon the route taken by the packet and the amount of other traffic. The delay through a network node, for example, increases if the traffic arrival rate is higher than the processing rate of the node. Figure 11.5 shows how two packets may experience very different delays as they traverse a packet-switched network. In this example, a packet following route A passes through four routersandexperienceslong queuing delayswhereasapacketfollowingroute B passes through two routers with very little queuing time. (Some improvement may be gained by adopting virtual circuit switching where successive packets from the same sourcefollow identical routes.) Finally, automatic repeat request (ARQ)-based error control can lead to very variable delays whilst waiting for packet retransmission, and so ARQ is not generally appropriateforreal-timevideotransmission(except in certainspecial cases forerror handling, described later). Signal strength 4 l I Distance Figure 11.4 Path loss variation withdistance QUALITY OF SERVICE REQUIREMENTS AND CONSTRAINTS 241 ROUTE A Input Router queue Figure 11.5 Varying delays within a packet-switched network 11.2.3 Effect of QoS ConstraintsonCodedVideo The practical QoS constraints described above can have a significant effect on the quality and performance of video applications. Data rate Mosttransmissionscenariosrequire some form of ratecontroltoadaptthe inherently variable rate produced by a video encoder to a fixed or constrained bit rate supported by a network or channel. A rate control mechanism generally consists of an output buffer and a feedback control algorithm; practical rate control algorithms are described in Chapter 10. Errors A bit error in a compressed video sequenccean causea ‘cascade’ oefffects that may leadto severe picture degradation. The following example illustrates the potential effects of a single bit error: 1. A single bit error occurs within variable-length coded transform coefficient data. 2. The coefficient corresponding to the affectedVLC is incorrectly decoded. Depending on the magnitudeof the coefficient, this may or maynot have a visible effect on the decoded image. The incorrectly decoded coefficient may cause the current 8 x 8 block to appear distorted. If the current block is a luminance block, this will affect 8 x 8 pixels in the displayed image; if the current block contains chrominance data, this will affect 16 x 16 pixels (assuming 4 :2 :0 sampling of the chrominance). 3. Subsequent VLCs may be incorrectly decoded because the error changes avalid VLC into another valid (but incorrect) VLC. In the worst case, the decoder may be unable to 242 TRANSMISSIONVOIDFECOODED regain synchronisation with the correct sequence of syntax elements. The decoder can always recover at thenext resynchronisation marker (such as a slice stacrtode [MPEG-21 or G O B header [MPEG-4/H.263]). However, a whole sectioonf the current framemay be corrupted before resynchronisation occurs.This effect is knownas spatial erwr propaga- tion, where a single error can cause a large spatial area of the frame to be distorted. Figure 11.6 shows an example: a single bit error affects a macroblock in the second-last row inthispicture(codedusing MPEG-4). Subsequentmacroblocks areincorrectly decoded and the errored region propagates untiltheend of the row of macroblocks (where a GOB header enables the decoder to resynchronise). 4. If the current frame isused as a prediction reference (e.g.an I- or P-picturein MPEG or H.263), subsequent decoded frames are predicted from thedistortedregion.Thusan error-freedecodedframemaybedistortedduetoanerrorinapreviousframe(in decoding order): the error-free frame is decoded to produce a residual or difference frame which is then added to a distorted reference frame to produncewadistorted frame. This effect is tewporal error propagation and is illustrated in Figure 11.7. The two frames (a) (b) Figure 11.7 Example of temporalerrorpropagation QUALITY OF SERVICE REQUIREMENTS AND CONSTRAINTS 243 (a) Error in frame 1 (b) Error spreads into neighbouring macroblocks in frame 2 due to motion compensation Figure11.8 Increase in erroredareaduringtemporalpropagation here were predicted from the errored frame shown in Figure 11.6: no further errors have occurred, but the distorted area continues to appear in further predicted frames. Because the macroblocks of the frames in Figure 11.7 are predicted using motion compensation, the errored region changes shape.The corrupted area may actually increasein subsequent predictedframes,asillustrated in Figure 11.8: inthisexample,motionvectorsfor macroblocksinframe 2 point‘towards’ an erroredarea in frame 1 and so theerror spreads out in frame 2. Over a long sequenceof predicted frames, an errored region will tend to spread out and also to fade as it is ‘added to’ by successive correctly decoded residual frames. In practice, packet losses are more likely to occur than bit errors in many situations. For example, a network transport protocol may discard packets containing bit errors. When a packet is lost, an entire section of coded data is discarded. A large section of at least one frame will be lost and this area may be increased due to spatial and temporal propagation. Delay Any delay within the video encoder and decoder must not cause the total delay to exceed the limitsimposedbytheapplication (e.g. a totaldelay of 0.4s forvideoconferencing). Figure 11.9 shows the main sources of delay within a video coding application: each of the components shown (from the capture buffer through to the display buffer) introduces a delay. (Notethatany multiplexing/packetising delay is assumedto be included in theencoder output buffer and decoder input buffer.) Capture Output Input Dlsplay Dlsplay buffer buffer buffer Figure11.9 Sources of delay in a video CODEC application 244 TRANSMISSION OF CODED VIDEO The delay requirements place a constraint on certain aspects of the CODEC, including bidirectional prediction and output buffering. B-pictures (supported by the MPEG and H.263 standards) are predicted from tworeference frames,one past,onefuture. The use of B-pictures introduces an extradelay of one frame in the decoder and at least oneframe in the encoder (depending on the numberof B-pictures between successive reference pictures) and so the improvedcompression efficiency of B-picturesneeds tobe balanced with delay constraints. A large encoder output buffer makes it easier to ‘smooth’ variations in encoded bit rate without rapid changes in quantisation (and hence quality): however, delay through the CODEC increases as thebuffer size increases and so delay requirements limit the size of buffer that may be used. 11.3 DESIGN FOR OPTIMUM QoS The problem of providing acceptable quality video over a channel with QoS constraints can be addressed by considering these constraints in the design and control of a video CODEC. There are a number of mechanisms within the video coding standards that may be exploited to maintain acceptable visual quality. 11.3.1 Bit Rate Many different approaches to video rate control have been proposed and video quality may be maximised by careful choice of aratecontrolalgorithm to match the type of video material and the channel characteristics. Chapter 10 discusses rate control in detail: the aims of a rate control algorithm (often conflicting) are to maximise quality within the bit rate and delay constraints of the channel and the application. Tools that may be used to achieve these aims include quantisation control, encoding mode selection and (if necessary) frame skipping. Further flexibility in the control of encoder bit rate is provided by some of the optional modes of H.263+ and MPEG-4, for example: 0 Reference pictureresampling (H.263+ Annex P) enables an encodertochangeframe resolution ‘on the fly’. With this optional mode, a picture encoded at the new resolution may be predicted from a resampled reference picture at the old resolution. Changing the spatial resolution can significantly change the encoded bit rate without interrupting the flow of frames. 0 Object-based coding (MPEG-4)enablesindividual‘objects’withinavideoscene(for example, foregroundandbackground)to be encodedlargelyindependently. This can support flexible rate control by, for example,reducing the quality andframe update rate of less important backgroundobjectswhilstmaintaining high quality and updateratefor visually significant foreground objects. 11.3.2 Error Resilience Performance in the presence of errors can be improved at a number of stages in the CODEC 1-3 i’ncludintghfeollowing. DESIGN FOR OPTIMUM QoS 245 Encoder Resynchronisation methods may be used to limit error propagation. These include restart markers (e.g. slice start code, GOB header) to limit spatial error propagation, intra-coded pictures (e.g. MPEG I-pictures) to limit temporal error propagation and intra-coded macro- blocks to ‘force’ an error-free update of a region of the p i ~ t u r eS. ~plitting an encoded frame into sections that may be decoded independently limits the potential for error propagation andH.263+AnnexR(independentsegmentdecoding)andthevideopacketmode of MPEG-4 support this. A further enhancement of error resilience is providedby the optional reversible variable length codes (RVLCs) supportedby MPEG-4 and describedin Chapter 8. Layeredorscalablecoding(such as thefourscalablemodes of MPEG-2)canimprove performance in the presence of errors. The ‘base’ layer in a scalable coding algorithm is usually more sensitive to errors than the enhancement layer (S), and some improvement in erroresiliencehasbeendemonstratedusing unequael rrorprotection, i.e. applying increased error protection to the base layer.3 Channel Suitabletechniquesincludetheuse of errorcontrolcoding536and ‘intelligent’ mapping of coded data into packets. The error control code specified in H.261 and H.263 (a BCH code)cannotcorrectmanyerrors.Morerobustcodingmaybemoreappropriate (for example, concatenated Reed-Solomon and convolutional coding for MPEG-2 terrestrial or satellite transmission, see Section 11.4.1). Increased protection from errors can be provided at the expense of higher error correction overhead in transmitted packets. Careful mapping of encoded data into network packets can minimise the impoafcat lost packet. For example, placing an independently decodeable unit (suchas an independent segment or video packet) into each transmitted packet meansthat a lost packet will affect the smallest possible areoaf a decoded frame (i.e. the error will not propagate spatially beyond the data contained within the packet). Decoder A transmission error may cause a ‘violation’ of the coded data syntax that is expectedat the decoder.Thisviolationindicates theapproximatelocation of thecorrespondingerrored regioninthedecodedframe.Oncethis is known,thedecodermayimplement error concealment to minimise the visual impact of the error. The extent of the errored region can be estimated once the position of the error is known, as the errorwill usually propagate spatially up to the next resynchronisation point (e.g. GOB header or slice start code). The decoder attempts to conceal the errored region by making use of spatially and temporally adjacenterror-freeregions. A number of errorconcealmentalgorithms exist andthese usually take advantage of the fact that a human viewer is more sensitive to low-frequency components in the decoded image. An error concealment algorithm attempts to restore the low-frequency information and (in some cases) selected high-frequency information. Spatial error concealment repairs the damaged regionby interpolation from neighbouring error-free pixels.’ Errors typically affect a ‘stripe’ of macroblocks across a picture (see for example Figure 11.6) and so the best method of interpolation is to use pixels immediately 246 TRANSMISSION OF CODED VIDEO 1 1st errored macroblock Figure 11.10 Spatialerrorconcealment above and below the damaged area asshown in Figure 1 1.10. A spatial filtermay be used to ‘smooth’theboundaries of therepairedarea.Moreadvancedconcealmentalgorithms attempttomaintainsignificantfeaturessuch as edgesacrossthedamagedregion.This usually requires a computationally complex algorithm, for example using projection onto convex sets (see Section 9.3, ‘Post-filtering’). Temporal error concealment copies data from temporally adjacent error-free frames to hide the damaged a ~ - e a .A~ ,s~imple approach is to copy the same region from the previous frame (often available in the frame store memory at the decoder). A problem occurs when there is a change between the frames due to motion: the copied area appears to be ‘offset’ and this can be visually disturbing. This effect can be reduced by compensating for motion and this is straightforward if motion vectors are available for the damaged macroblocks. However, in manycasesthemotionvectorsmaybedamagedthemselvesand must be reconstructed, for example by interpolating from the motion vectors of undamaged macro- blocks. Good results may be obtainebdy re-estimating the motion vectorsin the decoder,but this adds significantly to the computational complexity. Figure 11.l 1 shows how the error-resilient techniques described abovemay be applied to the encoder, channel and decoder. Combined approach Recently,somepromisingmethodsforerrorhandlinginvolvingcooperationbetween severalstages of thetransmission‘chain’havebeenproposed.9-”Inareal-timevideo t SynchdroeEcntmeorisrEcdnoatritirntrokioognerlnrs ment Spatial PacketisaIntitorna-coding Reversible VLCs Unequal error protection concealment Temporal Figure 11.11 Application of error-resilienttechniques DESIGN FOR OPTIMUM QoS 247 communication system it is not usually possible to retransmit damagedor lost packets due to delayconstraints: however, itispossibleforthe decoder to signalthelocation of a lost packet to the encoder. The encodercan then determine the area of the frame that is likely to be affected by the error (a largerareathantheoriginalerroredregion dueto motion- compensated prediction from the errored region) and encode macroblocks in this area using intra-coding. This will have the effect of ‘cleaning up’ the errored region once the feedback message is received. Alternatively, the technique of reference picture selection enables the encoder (and decoder) to choose an older, error-free frame for prediction of the next inter- frame once the position of the error is known. This requires both encoder and decoder to + store multiple reference frames.The reference picture selection modesof H.263 (Annex N and Annex U) may be used for this purpose. These two methods of incorporating feedback are illustrated in Figures 11.12 and 11.13. In Figure 11.12, an error occurs during transmission of frame 1. The decoder signals the estimated location of the error to the encoder: meanwhile, the error propagates to frames 2 and 3 and spreads out due to motion compensation. The encoder estimates the likely spread of the damagedarea and intra-codesanappropriateregion of frame 4. The intra-coded macroblockshalt the temporalerrorpropagation and ‘clean up’ decodedframes 4 and onwards. In Figure 1 1.13, an error occurs in frame 1 and the error is signalled back to the encoder by the decoder. On receiving the notification, the encoder selects a known ‘good’ reference frame (frame 0 in this case) to predict the next frame (frame 4). Frame 4 is inter- coded by motion-compensated prediction from frame 0 at the encoder. The decoder also selects frame 0 for reconstructing frame 4 and the result is an error-free frame 4. 11.3.3 Delay The components shown in Figure 11.9 can each add to the delay (latency) through the video communication system: Encoder intr acode thisatea Decoder initial eterrmorpporroaplagaetrioronr‘cleaunpe’d Figure 11.12 Error trackingviafeedback 248 TRANSMISSION OF CODED VIDEO Encoder predictfrom older referenceframe forward prediction Decoder Figure11.13 predict from older reference frame Reference pictureselection 0 Capture buffer: this should only add delay if the encoder ‘stalls’, i.e. it takes too long to encode incoming frames. This may occur in a software video encoder when insufficient processing capacity is available. Encoder: I- and P-frames do not introduce a significant delay: however, B-picture coding requires a multiple frame delay (as discussed in Chapter 4) and so the use of B-pictures should be limited in a delay-sensitive application. 0 Output bufleer: the output buffer adds a delay that depends on its maximum size (in bits). For example, if the channel bit rate is 64 kbps, a buffer of 32 kbits adds a delayof 0.5 S. Keeping the buffer small minimises delay, but makes it difficult to maintain consistent visual quality (as discussed in Chapter 10). NetworWchannel: if a resource reservation mechanism (such as those provided by RSVP [resource reservation protocol] in the Internet, see Section 11.4.2) is available, it may be possible to reserve a path with a guaranteed maximum delay. However, many practical networks cannot guarantee a particular delay through the network. The best alternative may be to use a ‘low overhead’ transport protocol such as the user datagram protocol (UDP), perhaps in conjunction with a streaming protocol such as the real time protocol (RTP) (see Section 11.4.2). 0 Inputbuffer: thedecoderinputbuffersizeshouldbesettomatchtheencoderoutput buffer. If the encoder and decoder are processing video at the same rate (i.e. the same SCENARIOS TRANSMISSION 249 number of frames per second), the decoder input buffer doensot add any additionaldelay. (It can be shown12that thesum of the encoder buffer contents and decoder buffer contents is a constant if network delay is constant). 0 Decoder: the use of B-pictures adds at most one frame’s delay at the decoder and so this is not such a critical issue as at the encoder. 0 Display buffer: as with the capture buffer, the display buffer should not add a significant delay unless a queue of decoded frames is allowed to build up due to variable decoding speed. In this case, the decoder should pause until the correct time for decoding a frame. 11.4 TRANSMISSION SCENARIOS The design constraints and performance goalsfor a video CODEC are very dependent on the communicationsenvironment for which it is intended.Transmissionscenariosforvideo communicationsapplicationsrangefromhigh bit rate,highintegritytransmission(e.g. television broadcasting) to low bit rate, unreliable environments (e.g. packet-based transmis- sionoverthe Intemetl3). Anumber of ‘framework’protocols have beendeveloped to support video and audio transmission over different environments and some examples are listed in Table 11.1. In this section we choose two popular transmission environments (digital television and LANAntemet) and describethe protocols used for video transmission and their impact otnhe design of video CODECs. 11.4.1 DigitalTelevisionBroadcasting: MPEG-2 Systems/Transport The MPEG-2 family of standards was developed with the aim of supporting‘televisionquality’ digital transmission of video and audio programmes. The video element is coded using MPEG-2 (Video) (described in Chapter 4) and the audio element is typically coded with MPEG-2 (Audio) Layer 3 (‘MP3’). These elements are combined and transmitted via the MPEG-2 ‘Systems’ framework. MPEG-2 transmission is currently used in a number of environments including terrestrial radio transmission, direct broadcasting via satellite (DBS) and cable TV (CATV). MPEG-2 is also the chosen standard for video storage and playback on digital versatile disk (DVD). These transmission environments have a number of differences but they typically have some common characteristics: a fixed, ‘guaranteed’ bit rate, a predictable transmission delay and (usually) predictable levels of noise (and hence errors). Table 11.1 Transmissiodstorageenvironmentsandprotocols Environment PSTNASDN LANAP Digital television broadcasting H.320,I4 H.32415 H.323I6 MPEG-2 Systems” Constant bit rate, low delay networks Variable packet rate, variable delay, unreliable transmission Constant bit rate, error rates depend on transmission medium 250 TRANSMISSIONVOIDFECOODED MPEG-2 (Systems) describes two methods of multiplexing audio, video and associated information, the program stream and the transport stream. In each case, streams of coded audio, video, data and system information are first packetised to form packetised elementary stream packets (PES packets). The program stream This is the basic multiplexing method, designed for storage or distribution in a (relatively) error-freeenvironment. A programstreamcarriesasingle program (e.ga.television programme) and consists of astream of PESpacketsconsisting of thevideo,audioand ancillary information needed to reconstruct the program. PES packets may be of variable length and these are grouped together in pucks, each of which starts with a pack header. Accurate timing control is essential for high-quality presentation of video and audio and this is achieved by a system of time references and time stamps. A decoder maintains a local system time clock (STC). Each pack header contains a system clock reference (SCR) field that is used to reset the decodeSr TC prior to decodingof the pack. PES packets contain time stamps and the decoder uses these to determine when the data in each packet should be decodedandpresented. In this way, accuratesynchronisationbetweenthevariousdata streams is achieved. The transport stream The transport stream (TS) is designed for transmission environments that are prone to errors (such as terrestrial or satellite broadcast). The basic element of the TS is the PES packet. However, variable-length PES packets are further packetised to formfixed length TS packets (each is 188 bytes) making it easier to add error protection and identify and recover from transmissionerrors. AsingleTS may carryone or moreprogramsmultiplexedtogether. Figure 11.14 illustrates the way in which information is multiplexed into programs and then into TS packets. Program .............................................................................. System, other data ........................................... PES packets from other L programs TS packets Figure 11.14 Transport stream multiplexing Modulate and transmit Demodulate TRANSMISSION SCENARIOS 251 Convolutional, RS decode .................p..r.o. gr.a.m...d..e..c...o....d...e..r.... Video decoder video Audio decoder audio System info. i Clock Figure 11.15 Transportstream demultiplexing Two levels of error correcting coding provide protection from transmission errors. First, 16 parity bytes are added to each 188 byte TS packet to form a 204-byte Reed-Solomon codeword and thestream of codewordsarefurtherprotectedwithaconvolutionalcode (usually a 718 code, i.e. the encoder produces 8 output bits for every 7 input bits). The total errorcodingoverhead is approximately25%.The‘outer’convolutionalcodecancorrect random bit errors and the ‘inner’Reed-Solomon code can correcbturst errors up to 64 bitins length. Figure 1l . 15 illustrates the processof demultiplexing and decoding an MPEG-2 TS. After correcting transmission errors, the streamof TS packets are demultiplexed and PES packets correspondingtoaparticularprogramarebufferedanddecoded.ThedecoderSTCis periodically updated when a SCR field is received and the STC provides a timing reference for the video and audio decoders. Implications for video CODEC design The characteristics of a typical MPEG-2 program are as follows: ITU-R 601 resolution video, 25 or 30 frames per second 0 Stereoaudio Video coded to approximately 3-S Mbps Audio coded to approximately 300kbps 0 Total programme bit rate approximately 6Mbps An MPEG-2 video encodertdecoder design aims to provide high-quality video within these transmissionparameters.Thechannelcoding(Reed-Solomon and convolutionalECC)is designed to correct most transmission errors and error-resilient video coding is generally limited to simple error recovery (and perhaps concealment) at the decoder to handle the occasionaluncorrectederror.TheSTC and the use of timestamps in eachPESpacket provide accurate synchronisation. 252 TRANSMISSION OF CODED VIDEO 11.4.2 Packet Video: H.323 MultimediaConferencing H.323 is an ITU-T ‘umbrella’ standard, describing a framework for multimedia commu- nicationsoverlocal areanetworks (LANs) andIP-basednetworksthatdo not support guaranteed QoS. Since its release in 1996, H.323 has gained popularity in Internet-based video and audio applications. H.323defines basicaudioandvideocodingfunctionalities so thatH.323-compliant devices and systems should be able to inter-operwatiteh at least a minimumset of communi- cation capabilities. H.323 provides independence from a particular network or platform (for example, by supporting translationbetweenprotocolframeworksfordifferentnetwork environments). It can assist with call set-up and managemewnitthin a controlled ‘zone’ and it can support multi-point conferencing (three or more participants) and multi-cast (trans- mission from one source to many receivers). H.323 components Terminal This is the basicentityinanH.323-compliantsystem. An H.323 terminal consists of a setof protocols and functions and its architecture is shoiwnnFigure 11.16. The mandatory requirements for an H.323 terminal (highlighted in thefigure) are audio coding (usingtheG.711, G.723 or G.729 audiocodingstandards)andthreecontrolprotocols: H.245 (channel control), 4.93 1 (call set-up and signalling) anrdegistration/admission/status ( M S ) (used to communicate with a gatekeeper, see below). Optional capabilities include video coding (H.261, H.263), data communications(using T.120) and the realtime protocol (RPf) or packet transmission over JP networks. All H.323 terminals supportpoint-to-point conferencing (i.e. one-to-one communications), support for multi-point conferencing (three or more participants) is optional. contrSoyl stem Video Data I/O Audio I10 CODERCAS Interface A V Audio Control H245 A A Data interface T. 120 A :ElYEp2’6: $ G.711 l G.723l G.729 RTP ___ H323 .._._..._._._._...-__.-_-.-_~..-.-_-._.-_..-_._..-..------------ terminal V V V LAN Interface Figure 11.16 H.323 terminal architecture Required components 0 TRANSMISSION SCENARIOS U 253 n 0 Figure 11.17 H.323 multi-point conferences Gateway An H.323 gateway provides an interface to other conferencing protocols such as H.320(ISDN-basedconferencing),H.324(PSTN-basedconferencing)andalsoanalogue telephone handsets via the PSTN. Gatekeeper ThisisanoptionalH.323entitythatmanagescommunications within a ‘zone’ (a defined set of H.323components within thenetwork).Gatekeeperfunctions includecontrolling the set-upandrouteing of conferencing ‘calls’ withinitszone. The gatekeeper can manage bandwidth usage within the zone by tracking the number of active calls and the bandwidth usage of each and barring new calls once the network has reached saturation. Multi-point control unit (MCU) Thisentityfacilitatesmulti-pointconferencing within H.323. The twomaintypes of multi-pointconferenceare centralised and decentralised (shown in Figure 11.17). In a centralised conference, all calls in the conference are routed through theMCU:henceeachterminalonlyhas to deal with point-to-point (‘unicast’) communications. This places a heavy processing burden on the MCU but is guaranteed to work withallH.323terminals. A decentralisedconferencerequires H.323 terminals that support ‘multi-cast’ communications: each terminal multi-casts itsdata to all other terminals in the conference and the MCU’s role is to set up the conference and provide control and status information to each participant. Video coding in the H.323 environmem If anH.323terminalsupports video communication,itmust be capable of using H.261 coding at QCIF resolution (see Chapter 5). Optionally,it may support H.263 coding and otherresolutions (e.g. CIF, 4CIF). The capabilities of each terminal in aconference are signalled via the H.245 protocol: in a typical session, the terminals will choose the ‘lowest common denominator’ of video support. This could be H.261 (the minimum support), H.263 (baseline) or H.263 with optional modes. H.323isbecomingpopularforcommunicationsovertheInternet. The Internet is inherently unreliable and this influences the choice of video coding tools and transmission protocols. The basic transport protocol is the unreliable datagram protocol (UDP):packets are transmitted without acknowledgement and are not guaranteed to reach their destination. This keeps delay to a minimum but packets may arrive out of order, late or not at all. 254 TRANSMISSION OF CODED VIDEO 151 161 1 3 1 Receivpeadckets pacRkee-otsrdered [T1 2 1 131 Figure 11.18 Packetsequencing using RTP 161 RTP may be used ‘on top’ of UDP for transmission of coded video and audio. RTP adds time stamps and sequence numbers to UDP packets, enabling a decoder to identify lost, delayed or out-of-sequence packets. If possible, a receiver will reorder the packets prior to decoding; if a packet does not arrive in time, its position is signalled to the decoder so that error recovery can be carried out. Figure 11.18 illustrates the way in which RTP reorders packets and signals the presence of lost packets. Packet 4 from the original sequence is lost duringtransmissionandtheremainingpacketsarereceivedout of orderS. equence numbering and time stamps enable the packets to be reordered and indicate the absence of packet 4. The realtime control protocol (RTCP) may be used tomonitorand control an RTP session. RTCP sends quality control messages to each participant in the session containing useful QoS information such as the packet loss rate. The resource reservation protocol (RSVP) enablesterminalstorequest a ‘guaranteed’ transmission bandwidth for the duration of the communication session. This improves the available QoS for real-time video and audio communications but requires support from every switch or router in the section of network traversed by the session. Implications for video CODEC design Video coding for two-way conferencing in an H.323 environment should support low delay and low bit-rate coding. Coding tools such as B-pictures that add to encoding delay should probably be. avoided. Depending on the packet loss rate (which may be signalled by the RTCPprotocol),anencodermaychoose to implementerror-resilientfeaturessuch as increased intra-coding and resynchronisation markers (to limit spatial and temporal error propagation) and the useof slice-structured coding (e.g. Annexes K and V of H.263) to map codedvideotoequal-sized packets. A videodecodercanusetheinformationcontained within an RTP packet headetro determine the exact presentation toimf eeach decoded packet and to implement error handling and error concealment when a lost packet is detected. 11.5 SUMMARY Successful video communications relies upon matching the QoS required by an application with the QoS provided by the transmission network. In this chapter we discussed key QoS REEERENCES 255 parametersfromthepoint of view of the video CODEC andthenetwork.Removing subjectiveandstatisticarledundancythroughthevideocompressionprocesshasthe disadvantagethat hecompresseddatabecomessensitivetotransmissionimpairments suchasdelaysanderrors. An effectivesolution to the QoS problemistodealwithit both in the video CODEC (for example by introducing error-resilient features and matching the rate control algorithm to the channel) and in thenetwork(forexample by adopting protocols such as RTP). We described two popular transmission scenarios, digital television broadcast and IP video conferencing, and their influence on video CODEC desigTnh. e result of takingthetransmissionenvironmentintoaccountis a distinctlydifferentCODEC in each case. Video CODEC design is also heavily influenced by the implementation platform and in the next chapter we discuss alternative platforms and their implications for the designer. REFERENCES 1. Y. Wang, S. Wenger, J. Wen and A. Katsaggelos, ‘Review of error resilient coding techniques for real-time video communications’, IEEE Signal Processing Magazine, July 2000. 2. B. Girod and N. Farber, ‘Error-resilient standard-compliant video coding’, from Recovery Tech- niques for Image and Video Compression, Kluwer Academic Publishers, 1998. 3. I. E.G.Richardson,‘Videocodingforreliablecommunications’,Ph.D. thesis, RobertGordon University,1999. 4. J. Y. Liao and J. Villasensor, ‘Adaptive intra block update for robust transmissionof H.263’, IEEE Trans. C S W , February2000. 5. W. Kumwilaisak, J. Kim and C. Jay Kuo, ‘Video transmission over wireless fading channels with adaptive FEC’, Proc. PCSOI, Seoul, April 2001. 6. M. Bystrom and J. Modestino, ‘Recent advances in joint source-channel coding of video’, Proc. URSI Symposium on Signals, Systems and Electronics, Pisa, Italy, 1998. 7. S. Tsekeridou and I. Pitas, “PEG-2 error concealment based on block-matching principles’,IEEE Trans. Circuits and Systems for Video Technology, June 2000. 8. J. Zhang, J.F. Arnold andM. Frater, ‘A cell-loss concealment technique for MPEG-2 coded video’, IEEE Trans. CSVT, June 2000. 9. B. Girod and N. Farber, ‘Feedback-based error control for mobile video transmission’, ProIcE. EE (special issue on video for mobile multimedia), 1999. 10. P-C. Chang and T-H. Lee, ‘Precise and fast error tracking for error-resilient transmission of H.263 video’, IEEE Trans. Circuits and Systems f o r Video Technology, June 2000. 11. N. Farber, B. Girod and J. Villasensor, ‘Extensionsof the ITU-T recommendation H.324 for error- resilient video transmission’, IEEE Communications Magazine, June 1998. 12. Y. Yang, ‘Rate control for video coding and transmission’, Ph.D. thesis, Cornell University, 2000. 13. G. J. Conklin etal., ‘Video coding for streaming media deliveoryn the internet’,IEEE Trans. CSVT, 11(3), March 2001. 14. ITU-T Recommendation H.320, ‘Line transmission of non-telephone signals’, 1992. 15. ITU-T Recommendation H.324, ‘Terminal for low bitrate multimedia communication’, 1995. 16. ITU-T Recommendation H.323, ‘Packet based multimedia communications systems’, 1997. 17. ISOlIEC 13818-1, ‘Information technology: generic codionfgmoving pictures and associated audio information:Systems’,1995. Platforms Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) 12.1 INTRODUCTION In the early days of video coding technology, systems tended to fall into two categories, dedicated hardware designs for real-time video coding(e.g. H.261 videophones) or software designs for ‘off-line’ (not real-time) image or video coding (e.g. JPEG compressioddecom- pression software). The continued increases in processor performance, memory density and storage capacity have led to a blurrionfgthese distinctionsand video coding applications are now implemented on a wide range of processing platforms. General-purpose platforms such as personal computer (PC) processors can achieve respectable real-time coding performance and benefit from economies of scale (i.e. widespread availability, good development tools, competitive cost). There is still a need for dedicated hardware architecturesin certain niche applications, such as high-endvideoencodingor verylow powersystems. The ‘middle ground’betweenthegeneral-purposeplatform and thededicatedhardwaredesign(for applicationsthatrequiremoreprocessingpower than ageneral-purposeprocessorcan providebutwhereadedicateddesignis not feasible) was, until recently, occupied by programmable ‘video processors’. So-called ‘media processors’, providing support fworider functionalities such as audio and communications processing, are beginning to occupy this market. There is currently a convergence of processing platforms, with media extensions and featuresbeingaddedtotraditionallydistinctprocessorfamilies(embedded, DSP, general-purpose) so that the choice of platform for video CODEC designs is wider than ever before. In this chapter we attempt to categorise the main platform alternatives and to compare their advantages and disadvantages for the designer of a video coding system. Of course, some of the information in this chapter will be out of date by the time this book is published, due to the rapid pace of development in processing platforms. 12.2 GENERAL-PURPOSE PROCESSORS A desktop PC contains a processor that can be described as ‘general-purposTe’h.e processor is designed to provide acceptable performance for a wide range of applicationssuch as office, games and communications applications. Manufacturers need to balance the user’s demandforhigherperformanceagainstthe need tokeepcostsdownforamass-market product. At the same time, the large economoiefsscale in the PC market makeit possible for the major manufacturers to rapidly develop and release higher-performance versions of the processors. Table 12.1 lists some of the main players in the market and their recent processor offerings (as of August 2001). 258 Table 12.1 PopularPCprocessors ManFoufeLfaeaatructietnruesgrtser Pentium Intel 4 PMGo4owtoeroPlCa AMD PLATFORMS Clock speed up to 2 GHz; highly pipelined; 128-bit single instruction multiple data (SIMD) Clock speed up to about 1 GHz; 128-bit vector processing Clock speed up to 1.4 GHz; multiple integer and floating- point execution units; SIMD processing 12.2C.1apabilities PC processors can be loosely characterised as follows: 0 goodperformance at running ‘general’ applications; 0 not optimised for any particular class of application (though the recent trend is to add features such as SIMD capabilities to support multimedia applications); 0 high powerconsumption(thoughlower-power available for mobile devices); versions of theabove processors are 0 support word lengths of 32 bits or more, fixed and floating point arithmetic; 0 support for SIMD instructions(for example carrying outthe same operation on 4 x 32-bit words). The popular PC operating systems (Windows and Mac O/S) support multi-tasking applications and offer good support for external hardware (via plug-in cards or interfaces such as USB). 12.2.2 MultimediaSupport Recenttrendstowardsmultimediaapplicationshave led to increasing support for real- timemedia.Thereareseveral ‘frameworks’ that may be used within the Windows O/S, for example, to assist in the rapid development and deployment of real-time applications. The DirectX and Windows Media frameworks provide standardised interfaces and tools to support efficient capture, processing, streaming and display of video and audio. The increasing usage of multimedia has driven processor manufacturers to add architec- tural and instructionsupportfor typical multimedia processing operations.The three processor families listed in Table 12.1 (Pentium, PowerPC, Athlon) each support a version of ‘single instruction,multipledata’(SIMD) processing. Intel’s MMXand SIMDexten- sions’.* provide a number of instructions aimed at media processing. A SIMD instruction operates on multiple data elements simultaneously (e.g. multiple 16-bit words within a 64- bit or 128-bit register). This facilitates computationally intensive,repetitive operations such PROCGESESNOERRSAL-PURPOSE 259 Figure 12.1 SAD calculation using SIMD instruction as motion estimation (e.g. calculatingsum of absolute differences, SAD) and DCT(e.g. multiply-accumulate operations). Figure 12.1 shows how the Intel instruction p sadbw may be used to calculate SAD for eigphatirs of input samples(Ai,Bi) in parallel, leading toa potentially large computational saving. Table 12.2 summarises the main advantages and disadvantages of PC platforms for video coding applications. The large user base and comprehensivedevelopment support make it an attractive platform for applications such as desktopvideo conferencing (Figure 12.2) in which a videoCODEC is combined with a number of othercomponentssuch as audio CODEC, chat and document sharing to provide a flexible, low-cost video communication system. Table 12.2 Advantages and disadvantages of PC platform Advantages Disadvantages High market penetration, very large potential user base Availability of efficient compilers and powerful development tools Multimedia extension functions to improve video processing performance Efficient multi-tasking with other applications Availability of multimedia application development frameworks Computationally intensive video coding functions must be camed out in software Medium to high power consumption Use of ‘special’ instructions such as SIMD limits the portability of the video coding application Processor resources not always available (can be problematic for real-time video) 260 PLATFORMS Camera 1Remotevideo I., Shared Microphone ? Figure 12.2 PC-basedvideoconferencing 12.3 DIGITALSIGNAL PROCESSORS Digitalsignalprocessors(DSPs)aredesignedtoefficientlyhandleapplicationsthatare based around computationally intensive signal processing algorithms. Typical applications includeaudioprocessing (e.g. filteringandcompression),telecommunicationsfunctions (suchasmodemprocessingf, ilteringandechocancellation)andsignalconditioning (transformation,noisereduction,etc.).Mass-marketapplicationsforDSPsinclude PC modems,wireless and hand-heldcommunicationsprocessing,speechcoding and image processing. These applications typically require good signal processing performance in a power, cost andor space-limited environment.DSPs can be characterised as follows: 0 high performance for a selected range of signal processing operations; 0 lowlmediumpowerconsumption; 0 lowlmediumcost; 0 fixed or point arithmetic capability; e limitedon- and off-chipcodeanddatastorage(depending space); on theavailableaddress 0 16- or32-bitwordlength. Table 12.3listsafewpopularDSPs and comparestheirfeatures:this is onlyasmall selection of the wide rangeof DSPs on the markeAt. s well as these discreteICs, a numberof manufacturers provide DSP cores (hardware architectures designed to be integrated with other modules on a single IC). PROCESSOSRIGS NAL DIGITAL 261 Table12.3 Popular DSPs Manufacturer Device Texas Instruments C5000 series C6000 series Analog Devices Motorola ADSP-218x and 219x series SHARC DSP563xx DSP568xx Features Low power, 16-bit, up to 500MIPS [instructions per second), optimised for portable devices and communications Medium power, 16 or 32-bit, 1OOG4000 MIPS, fixed or floating-point arithmetic, optimised for broadband communications and image processing Low power, 16-bit, over 300MIPS 32-bit, fixed and floating-point arithmetic, SIMD instructions, 600 MOPS (operations per second) 24-bit, fixed-point, up to 200 MIPS, PC1 bus interface 16-bit, fixed-point, combines DSP and microcontroller features A key feature of DSPs is the ability to efficiently carry out repetitive processing algorithms such as filtering and transformation. This means that they are well suited to many of thecomputationally intensive functions required of a typical DCT-based video CODEC,suchas motion estimation, DCTandquantisation,andsome promising performance results have been r e p ~ r t e d .B~e.c~ause a DSP is specifically designed for this type of application, this performance usually comes without the penalty of high power consumption. Support for related video processing functions (such as videocapture, transmission and rendering) islikely to belimited. The choice of application development toolsis not as wide as for the PC platform and high-level language support is often limited to the C language. Table 12.4 summarises the advantagesand disadvantages of the DSP platform for video coding applications. In a typical DSPdevelopmentscenario,codeis developed on a host PC in C, crosscompiledanddownloaded to a developmentboardfor testing. The developmentboard consists of a DSP IC together with peripherals such as memory, A/D converters and other interfaces. To summarise, a DSP platform can provide good performance with low power consumption but operating system and development support is often limited. DSPs may be a suitable platform for low-power, special-purposeapplications (e.g. a hand-held videophone). Table12.4 Advantages and disadvantages of DSP platform Advantages Disadvantages Low power consumption Relatively high computational performance Low price Built-in telecommunications support (e.g. modem functions, AID conversion) Less well suited to ‘higher-level’ complex aspects of processing Limited development support Limited operating system support Limited support for external devices [e.g. frame capture and display) 262 PLATFORMS 12.4 EMBEDDEDPROCESSORS The term ‘embedded processor’ usually refers to a processor or controller that is “embedded” into a larger system, in order to provide programmable control and perhaprsocessing capabilities alongside more specialist, dedicated devices. Embedded processors are widely used in communications (mobile devices, network devices, etc) and contraoplplications (e.g. automotive control). Typical characteristics are: 0 lowpowerconsumption; 0 low cost; 0 limitedprocessingandaddressingcapabilities; 0 limited wordlengths; 0 fixed-point arithmetic. Until recently, an embedded processor would not have been considered suitable for video coding applications becausoef severely limited processing capabilities. However, in common withothertypes of processor, the processing ‘power’ ofnew generations of embedded processor continues to increase. Table 12.5 summarises the features of some popular em- bedded processors. The popular ARM and MIPS processors are licensed as cores for integration into third- party systems. ARM is actively targeting low-power video coding applications, demonstrat- ing 15 frames per second H.263 encodingand decoding (QCIF resolution) on an ARM9s and developing co-processor hardware to further improve video coding performance. Table 12.6 summarisestheadvantagesanddisadvantages of embedded processors for video coding applications. Embedded processors are of interest because of their large marketpenetration (for example, in the high-volumemobiletelephone market). Running low-complexity video coding functionsin software on an embeddedprocessor (perhaps with limiteddedicatedhardwareassistance) may be a cost-effective wayof bringing video applications to mobile and wireless platforms. For example, the hand-held videophone is seen as a key application for the emerging‘3G’ high bit-rate mobile networks. Video coding on low-power embedded or DSP processors may be a key enabling technology for this type of device. Table 12.5 Embedded processor features Manufacturer Device MIPS 4K series ARM ARMnntel ARM9 series StrongARM series Features Low power, 32-bit, up to approx. 400 MIPS, multiply-accumulate support (4KM) Low power, 16-bit, up to 220 MIPS Low power, 32-bit, up to 270 MIPS MEDIA PROCESSORS 263 Table 12.6 Advantages and disadvantages of embedded processors Advantages Disadvantages Low power consumption Low price High market penetration Good development tool support Increasing performance Limited performance Limited word lengths, arithmetic, address spaces (As yet) few features to support video processing 12.5 MEDIA PROCESSORS DSPs have certain advantages over general-purpose processors for video coding applications; so-called ‘media processors’ go a step further by providing dedicated hardware functions that support video and audio compression and processing. The general conceptof a media processor is a ‘core’ processor together with a numobferdedicated co-processors that carry out application-specific functions. The architectureof the Philips TriMedia platform sihsown in Figure 12.3. The coroef the TriMedia architecture is vaery long instructionword (VLIW) processor.A VLIW processor can carry out operations on multiple datawords (typically four 32-bit words in the caseof the TriMedia)at the same time. This is a similar concept toSItMheD instructions described earlier (see for example Figure12.1) and is useful for video and audio coding applications. Computationally intensive functions in a video CODEC such as motion estimation and DCT may be efficiently carried out using VLIW instructions. I Timers m p I Audio V0 Il 4 System bus Figure 12.3 TriMediablockdiagram 264 PLATFORMS Table 12.7 Advantages and disadvantages of media processors Advantages Disadvantages Good performance for video coding Application-specific features (e.g. co-processors) High-level language support Medium power consumption and cost Application-specific features may not support future coding standards Good performance requires extensive code optimisation Limited development tool support Limited market penetration The co-processors in the TriMedia architecture are designed to reduce the computational burden onthe‘core’ by carryingoutintensiveoperationsinhardware. Available co- processorunits,shown in Figure 12.3, includevideoandaudiointerfaces,memoryand external bus interfaces, an image co-processor and a variable-length decoder (VLD). The imageco-processor is usefulfor pre- andpost-processingoperationssuch as scalingand filtering, and the VLD can decode an MPEG-2 stream in hardware (but does not currently supportothercodingstandards).Withcarefulsoftwaredesignandoptimisation,avideo coding application running on the TriMedia can offer good performance at a modest clock speed whilst retaining some of the benefitsof a general-purpose processor (including the ability to program the core processor in C or C++ software).6 The MAPprocessor developedby Equator and Hitachi is another media processor that has generatedinterestrecently.Theheart ofthe processorisa VLIW core,surrounded by peripheral units that deal with videIo/O, communications, videofiltering and variable-length coding. According to the manufacturer, the MAP-CA can achieve impressive performance forvideocodingapplications,forexampleencodingMPEG-2Main Profile/MainLevel video at 30 frames persecond using 63% ofthe availableprocessing resource^.^ This is higher than the reported performance of similar applications on the TriMedia. Media processors haveyet to capture asignificant part of the market, andit is not yet clear whetherthe‘halfwayhouse’betweendedicatedhardware and general-purposesoftware platforms will beamarketwinner.Table 12.7 summarisestheirmainadvantages and disadvantages for video coding. 12.6 VIDEO SIGNAL PROCESSORS Video signal processors are positioned between media processors (which aim to process multiple media efficiently) and dedicated hardware CODECs (designed to deal with one video coding standard or a limited range of standards). A video signal processor contains dedicated units for carrying out common video coding functions (sucahs motion estimation, DCT/IDCTand VLENLD) butallowsacertaindegree of programmability,enablinga common platform to support a number of standards and to be at least partly ‘future proof’ (i.e. capable of supporting future extensions and new standards). An example is the VCPex offered by 8 x 8 Inc.: this is aimed at video coding applications (but also has audio coding support). The VCPex architecture (Figure 12.4) consists of two 32-bit data buses, labelled SRAM and DRAM. The SRAM bus is connected to the main controller (a RISC processor), a static RAM memory interface and other external interfaces. Thi‘sside’ of the VCPex deals VIDEO SIGNAL PROCESSORS 265 Figure 12.4 VCPex architecture with lowerbit-ratedatasuchascompressedvideo,graphicsandalsocodedaudio.The DRAM bus is connected to a dedicated video processor (tVheP6), a dynamicRAM interface and video input and output ports. TheDRAM ‘side’ deals with high bit-rate, uncompressed video and with most of the computationally intensive video coding operations. Variablelengthencodinganddecodingarehandled by dedicatedVLE and VLD modules. This partitionedarchitectureenablesthe VCPex toachievegoodvideocoding and decoding performance with relatively low powerconsumption.Computationallyintensivevideo coding functions (and pre- and post-processing) are handled by dedicated modules, but at the same time the MSC and VP6 processors may be reprogrammed to support a range of coding standards. Table 12.8 summarises the advantagesand disadvantages of this type of processor. Video signal processors do not appear to be a strong force in the video communications market, Table 12.8 Advantages and disadvantages of video signal processors Advantages Disadvantages Good performance for video coding Application-specific features Limited programmability Application-specific features may not support future coding standards (but generally more flexible than dedicated hardware) Reprogramming likely to require high effort Limited development tool support Cost tends to be relatively high for mass market applications Dependent on a single manufacturer 266 PLATFORMS perhaps because they can be outperformbeyd a dedicated hardware design whitlhsety do not offer the same flexibility as a media processor or general-purpose processor. 12.7 CUSTOM HARDWARE General-purposeprocessors and (toalesserextent)media and videosignalprocessors sacrifice a certain amountof performance in order to retain flexibilitaynd programmability. A dedicated hardware design, optimised for a specific coding standairsd,likely to offer the highest performance (in terms of video processing capacity and power consumption) at the expense of inflexibility. The Zoran ZR36060 is a dedicatPedE G CODEC on a single chip capabolfeencoding or decoding ITU-R 601 video at 25 or 30 frames per second using Motion P E G (see Chapter 4). A block diagramof the IC is shown in Figure 12.5. During encoding, video is captbuyread dedicated video interfaceand stored in a‘strip buffer’ that stores eight lineosf samples prior to block processing. TheP E G core carries out JPEG encoding and the coded bit stream is passed to a first in first out (FIFO) buffer prior to output via the CODE interface. Decoding followsthereverseprocedure.Controlandstatusinterfacingwithahostprocessoris providedviatheHOSTinterface. Thechipisdesignedspecificallyfor P E G coding: however, some programmability of encodinganddecodingparameters and quantisation tables is supported via the host interface. Toshiba’sTC35273isasingle-chipsolutionforMPEG-4video and audiocoding (Figure 12.6). Separate functional modules (on the lefotf the figure) handle MPEG-4 video coding and decoding (simple profile), audio coding and network communications, and each of these modules consists of a RISC controller and dedicated processing hardware. Video Video Video JPEG CODEC l Controller 4’ ‘O Datadd DatIaHntaoensHrdftoacset Figure 12.5 ZR36060 block diagram CO-PROCESSORS 267 MPEG-4 video CODEC -[__l Filtering 4 VideIno A LoV Video Out Multiplexing Bit stream Neatnwdork 4 Host I/F Host Figure 12.6 ToshibaTC35273blockdiagram capture, displayand filtering are handled by co-processing modules. The IC is aimedloawt power, low bit-rate video applications and can handle QCIF video coding and decaotd1i5ng frames per second with a power consumpotfio2n40 mW. A reduced-functionality versionof this chip, the TC35274, handles only MPEG-4 video decoding. Table 12.9 summarises the advantages and disadvantages of dedicated hardware designs. This type of CODEC is becoming widespread for mass market applications such as digital television receivers andDVD players. One potential disadvantage is the reliance on a single manufacturer in a specialist markett;his is perhaps less likely to be a problem with generalpurpose processors and media processors as they are targeted at a wider market. 12.8 CO-PROCESSORS A co-processor is a separate unit that is designed to work with a host processor (such as a general-purposePCprocessor). Theco-processor(or‘accelerator’)carriesoutcertain computationally intensive functions in hardware, removing some of the burden from the host. Table 12.9 Advantages and disadvantages of dedicated hardware CODECs Advantages Disadvantages High performance for video coding Optimised for target video coding standard Cost-effective for mass-market applications No support for other coding standards Limited control options Dependent on a single manufacturer 268 PLATFORMS vHost Host CPU Accelerator Video data buffers t Host data buffers Frame ---+ Display buffers Figure 12.7 DirectX VA architecture PC video display card manufacturers have begun to asudpdport for common video coding functions to the display card hardware and a recent attempt to standardise the interface to this type of co-processor has led tothe DirectX VA standard.* This aimsto provide a standard API between a video decoding and display ‘accelerator’ and a host PC processor. The general architecture is shownin Figure 12.7. Complex, standard-specific functions such as variable-length decodingandheader parsing arecarriedout by the host, whilst computationally intensive functions (but relatively regular and common to most standards) such as IDCTandmotioncompensationare ‘offloaded’ to the accelerator. The basic operation of this type of system is as follows: 1. The host decodes the bit stream and extracts rescaled block coefficients, motion vectors and header information. 2. This informationispassed to the accelerator(using a standard API) via a set of data buffers. 3. The accelerator carries out IDCT and motion compensation and writes the reconstructed frame to a display buffer. 4. The display buffer is displayedonthePC screen and is also used as a prediction for further reconstructed frames. Table 12.10 lists the advantages and disadvantages of this type of system. The flexibility of softwareprogrammabilitytogether with dedicatedhardware support for key functions makes itan attractiveoption for PC-basedvideoapplications. Developers should benefit from the large PC market whichwill tend to ensure competitive pricing and performance for the technology. SUMMARY 269 Table 12.10 Advantagesanddisadvantagesofco-processorarchitecture Advantages Disadvantages Flexible support for computationally intensive decoding functions Supports all current DCT-based standards ‘Front end’ of decoder remains in software Large market for video display cards should lead to a number of alternative suppliers for this technology Dependent on specific platform and API Some intensive functions remain with host (e.g. VLD) Currently supports video decoding only 12.9 SUMMARY Table 12.11 attempts to compare the merits of the processing platforms discussed in this chapter. It should be emphasised that the rankings in this table arenot exact and there will be exceptions in a number of cases (for example, a high-performance DSP that consumes more power than a media processor). However, the general trend is probably correct: the best coding performance per milliwatt of consumed power should be achievable with a dedicated hardware design, but on the other hand PC and embedded platforms are likely to offer the maximum flexibility and the best development support due to their widespread usage. The recenttrend is for a convergence between so-called ‘dedicated’ media processors and general-purpose processors, for example demonstratedby the developmentof SIMD/ VLIWtype functions forall the major PC processors. This trend is likely to continue as multimedia applications and services become increasingly important. At thesame time, the latest generation of videocoding standards (MPEG-4,H.263-t and H.26L) require relatively complex processing (e.g.to support object-based codingandcodingmode decisions), as well as repetitive signal processing functions such as block-based motion estimation and Table 12.11 Comparisonofplatforms(approximate) ding Video Flexibility consumptionperformance DedicatedBest EmbWeddoersdt hardware Video signal processor Media processor PC processor Digital signal processor processor Dedicated hardware Embedded processor Digital signal processor Video signal processor Media processor PC processor PC processor Embedded processor Digital signal processor Media processor Video signal processor Dedicated hardware PC processor Embedded processor Digital signal processor Media processor Video signal processor Dedicated hardware 270 PLATFORMS transform coding. These higher-level complexities are easier to handle in software than in dedicatedhardware,and it may bethatdedicatedhardwareCODECs will becomeless important(exceptforspecialist,‘high-end’functions such as studioencoding)and that general-purpose processors will take caroef mass-market video coding applications (perhaps with media processors or co-processors to handle low-level signal processing). In the next chapter we examine thmeain issues that are facedby the designerof a software or hardware video CODEC, including issues common to both (such as interface require- ments) and the separate design goals for a software or hardware CODEC. REFERENCES 1. M. Mittal, A. Peleg and U. Weiser, ‘MMX technology architecture overview’, Intel Technology Journal, 3rd Quarter, 1997. 2. J. Abel et al., ‘Applications tuning for streaming SIMD extensions’, Intel Technology Journal, 2nd Quarter, 1999. 3. H. Miyazawa, H.263Encoder: TMS32OC6000 Implementation, Texas Instruments Application Report SPRA721, December 2000. 4. K. Leung, N. Yung and P. Cheung, ‘Parallelization methodology for video coding-an implementation on the TMS320C80’, IEEE Trans CSW, lO(X), December 2000. 5 . I. Thornton, MPEG-4 over Wireless Networks, ARM Inc. White Paper, 2000. 6. I. E. G. Richardson, K. Kipperman et al., ‘Video coding using digital signal processors’, Proc. DSP World Conference, Orlando, November 1999. 7. C. Basoglu et al., The MAP-CA VLIW-basedMediaProcessor, Equator Technologies Inc. White Paper, January 2000. http://www.equator.com 8. G. Sullivan and C. Fogg, ‘Microsoft Direct XVA: Video Acceleration APUDDI’, Windows Platform Design Note, Microsoft, January 2000. Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Video CODEC Design 13.1 INTRODUCTION In this chapter we bring together some of the concepts discussed earlier and examine the issues faced by designers of video CODECs and systems that interface to video CODECs. Key issuesincludeinterfacing(theformat of theinput and outputdata,controllingthe operation of the CODEC), performance (frame rate, compression, quality), resource usage (computational resources, chip area)and design time. This last issue is important becausoef the fast pace of change in the market for multimedia communication systems. A short time-tomarket is critical for video coding applications and we discuss methods of streamlining the design flow. We present design strategiesfor two types of videoCODEC,asoftware implementation (suitable for a general-purpose processor) and a hardware implementation (for FPGA or ASIC). 13.2 VIDEOCODECINTERFACE Figure 13.1 shows the main interfaces to a video encoder and video decoder: Encoder input: frames of uncompressed video (from a frame grabberor other source); control parameters. Encoder output: compressed bit stream (adapted for the transmission network, see Chapter 11); status parameters. Decoder input: compressed bit stream; control parameters. Decoderoutput:frames parameters. of uncompressedvideo(sendtoadisplayunit);status A video CODEC is typically controlled by a ‘host’ application or processor that deals with higher-level application and protocol issues. 13.2.1 Video In/Out There are many options available forthe format of uncompressed video intothe encoder or out of the decoder and we list some examples here. (The four-character codes listed for options (a)and (b) are ‘FOURCC’ descriptors originallydefined as part of the AV1 video file format.) 272 VIDEO CODEC DESIGN I Contrd 1 Status A Video in (a) Encoder Encoder Coded data out Network adaptation (multiplexing, packetising, etc.) 1 Host ~ CSotnattruosl A 1 Networkadaptation data in Video out (b) Decoder Figure 13.1 Video encoder (a) and decoder (b) interfaces YUY2 (4 :2 :2). The structure of this format is shown in Figure 13.2. A sample of Y (luminance)data is followed by a sample of Cb (blue colour difference), a second sample of Y, a sample of Cr (red colour difference), and so on. The result is that the chrominance components have the same vertical resolution as the luminance component but half the horizontal resolution (i.e. 4 :2 : 2 sampling as described in Chapter 2). In the example in the figure, the luminance resolution is 176 x 144 and the chrominance resolution is 88 x 144. YV12 (4 :2 :0) (Figure 13.3). The luminance samples for the current frame are stored in sequence, followed by the Cr samples and then the Cb samples. The Cr and Cb samples have half the horizontal and vertical resolution of the Y samples. Each colour VIDEO CODEC INTERFACE 273 ... ... Figure 13.2 YUY2 (4 :2 :2) i pixel in the original image maps toan average of 12 bits (effectively 1 Y sample, Cr sample and Cb sample), hence the name ‘W12’. Figure 13.4 shows an example of a frame stored in this format, with the luminance array first followed by the half-width and half-height Cr and Cb arrays. (c) Separate buffers for each componen(tY, Cr, Cb). The CODEC ispassed a pointer to the start of each buffer prior to encoding or decoding a frame. As well as reading the source frames (encoder) and writing the decoded frames (decoder), both encoder and decoder require to store one or more reconstructed reference frames for motion-compensatedprediction.Theseframestores may bepart of theCODEC (e.g. internallyallocatedarraysin a softwareCODEC)orseparatefromtheCODEC (e.g. external RAM in a hardware CODEC). Y (frame 1) ... Figure 13.3 W 1 2 (4 :2 :0) 274 DESIGN CODEC VIDEO i Figure 13.4 Example of W 1 2 data Memory bandwidth may be a particular issue for large frame sizes andhigh frame rates. For example, in order to encode or decode video at ‘television’ resolution (ITU-R 601, approximately 576 x 704 pixels per frame, 25 or 30 frames per second), the encoder or decoder video interface must be capable of transferring 216 Mbps. The data transfer rate may be higher if the encoder or decoder storerseconstructed frames in memory external to the CODEC. If forward prediction is used, the encoder must transfer datacorresponding to three complete frames for each encoded frame, as shionwFnigure 13.5: reading a new input frame, reading a stored frameformotionestimation and compensation and writing a reconstructed frame. This meansthat the memory bandwidthat the encoderinput is at least 3 x 216 =648 Mbps for ITU-R601 video. If two or more prediction references aruesed for motionestimatiodcompensation(forexample, during MPEG-2 B-picture encoding),the memory bandwidth is higher still. 13.2.2 Coded Data IdOut Coded video data is a continuoussequence of bits describingthe syntax elements of coded video, such asheaders, transform coefficients and motion vectors. Ifmodified Huffman coding is used, the bit sequence consistsof a series of variable-length codes (VLCs) packed together; if arithmetic coding is used, the bits describe a series of fractional numbers each INTERFACE CODEC VIDEO 275 External interface Encoder memory Current frame ’ __________________ I l I I I Motion estimation E- - Reference frame(s) II r Motion compensation ................................ ,..... ..................................... Reconstructed frame I I I l I +l Reconstruction I I I I I__________________ Figure 13.5 Memory access at encoder input representing a seriesof data elements (see Chapter8). The sequenceof bits must be mapped to a suitable data unit for transmissiodtransport, for example: 1. Bits: If the transmission channel is capable of dealing with an arbitrary number of bits, no special mapping is required. Thismay be the case for a dedicated serial channeblut is unlikely to be appropriate for most network transmission scenarios. 2. Bytes or words: The bit sequence is mapped to an integral number of bytes (8 bits) or words (16 bits, 32 bits, 64 bits, etc.). This is appropriate fomr any storage or transmission scenarios where data isstored in multiples of a byte. Theend of the sequencemay require to be padded in order to make up an integral number of bytes. 3. Complete coded unit: Partition the coded stream along boundaries that make up coded units within the video syntax. Examplesof these coded units include slices (sectionsof a codedpicturein MPEG-l, MPEG-2, MPEG-4orH.263+),GOBS (groups of blocks, sections of a coded picturein H.261 or H.263) and completecoded pictures. Theintegrity of the coded unit is preserved during transmission, for example by placing each coded unit in a network packet. 276 VIDEO CODEC DESIGN 0 1 2 Picture 5 I O 1'1 2 3 l41 5 l Coded data Figure 13.6 GOB locations in a frame and variable-size coded units Figure 13.6shows the locations oGfOBS ina frame coded using H.263iMPEG-4. The coded units (GOBs in this case) correspond to regular areas of the original frame: however, when encoded, each GOB generatesa different number of coded bits (due to variations in content within the frame). The result is that the GOBs generate the variable-size coded units shown in Figure 13.6. An alternative is to use irregular-sized slices (e.g. using the slice structured mode in H.263+, video packet modeinMPEG-4). Figure 13.7 shows slice boundaries that cover irregular numbers of macroblocksin the original frame and are chosen such that, when coded, each slice contains a similar number of coded bits. 13.2.3 Control Parameters Some of the moreimportant control parameters are listed here (CODEC application programming interfaces [APIs] might not provide access to all of these parameters). 0 Picture 4 -a- 0 1 2 3 4 5 Coded data Figure13.7 Slice boundaries in a picture and constant-size coded units VIDEO CODEC INTERFACE 277 Encoder Framerate Maybe specified as a number of frames per second or as a proportion of frames to skip during encoding(e.g. skip every second frame).If the encoder is operatingin a rate- or computation-constrained environment (see Chapter lo), then this will be a target frame rate (rather than an absolute rate) that may or may not be achievable. Framesize For example, a ‘standard’ frame size (QCIF, CIF, ITU-R 601, etc) or a nonstandard size. Targetbitrate Required for encoders operating in a rate-controlled environment. Quantiser step size If rate control isnot used, a fixed quantiser step sizemay be specified: this will give near-constant video quality. Modecontrol Forexample‘inter’ or ‘intra’coding mode. Optionalmodeselection MPEG-2, MPEG-4 and H.263includeanumber of optional codingmodes (for improvedcoding efficiency, improved errorresilience,etc.). Most CODECs will only support a subset of these modes, and the choice of optional modes to use (if any) must be signalled or negotiated between the encoder and the decoder. Starvstopencoding A series of videoframes. Decoder Most of the parameters listed above are signalledto the decoder within the coded bit stream itself. For example, quantiser step size is signalleidn frame/picture headersand (optionally) macroblock headers; frame rate is signalled by means of a timing reference in each picture header; mode selection is signalled in the picture header; ansdo on. Decoder controlmay be limited to ‘start/stop’. 13.2.4 Status Parameters Thereare many aspects of CODEC operation that maybe useful as statusparameters returned to the host application. These may include: 0 actualframerate (may differfromthetargetframeratein constrained environments); rate- orcomputation- 0 number of coded bits in each frame; 0 macroblock mode statistics (e.g. number of intrdinter-macroblocks); 0 quantiser step size for each macroblock (thismay be useful for post-decoder filtering, see Chapter 9); 278 DESIGN CODEC VIDEO 0 distribution of coded bits (e.g. proportion of bits allocated to coefficients, motion vector data, header data); 0 error indication (returned by the decoder when a transmission error has been detected, possibly with the estimated location of the error in the decoded frame). 13.3 DESIGNOF A SOFTWARECODEC In this section we describe the design goals and the main steps required to develop a video CODEC for a software platform. 13.3.1 Design Goals A real-time software video CODEC has to operate under a number of constraints, perhaps the most important of which arecomputational(determined by theavailableprocessing resources) and bit rate (determined by the transmissionor storage medium).Design goals for a software video CODEC may include: 1. Maximise encoded frame rateA. suitable target frame rate depends on the application, for example, 12-15 frames per second for desktop video conferencingand 25-30 frames per second for television-quality applications. 2. Maximise frame size (spatial dimensions). 3. Maximise‘peak’ coded bit rate.This may seem an unusual goalsincethe aim of a CODEC is to compress video: however, it can be useful to take advantage of a high network transmission rate or storage transfer rate (if it is available) so that video can be coded at a high quality. Higher coded bit rates place higher demands on the processor. 4. Maximise video quality (for a given bit rate). Within the constraints of a video coding standard, there are usually many opportunities to ‘tradeoff’videoqualityagainst computational complexity, such asthevariablecomplexityalgorithmsdescribed in Chapter 10. 5. Minimise delay (latency) through the CODEC. This is particularly important fotrwo-way applications (such as video conferencing) where low delay is essential. 6. Minimise compiled code and/or data size. This is important for platforms with limited available memory (suchasembeddedplatforms).Somefeatures of the popular video coding standards (such as the use of B-pictures) provide high compression efficiency at the expense of increased storage requirement. 7. Providea flexible API, perhapswithinastandardframework Chapter 12). such as DirectX (see 8. Ensure that code is robust (i.e. it functions correctly faonry video sequence, all allowable codingparameters and undertransmissionerrorconditions),maintainable and easily upgradeable (for example to add support for future coding modes and standards). DESIGN OF A SOlTWARE CODEC 279 Frame rate t I F Framesize Figure 13.8 Trade-off of frame size and frame rate in a software CODEC 9. Provide platform independencewhere possible. ‘Portable’ softwarethat may be executed on a number of platforms can have advantages for development, future migratiotno other platforms and marketability. However, achieving maximum performance may require some degree of platform-specific optimisation (such as the use of SIMDNLIW instructions). The first fourdesigngoalslistedabove may be mutually exclusive.Each of the goals (maximising frame rate, frame size, peak bit rate and video quality) requires an increased allocation of processing resources. A software video CODEC is usually constrained by the availableprocessingresources andor theavailabletransmission bit rate. In atypical scenario,the number of macroblocks of video that aCODEC can processis roughly constant (determined by either the available bit rate or the available processing resources). This means that increased framerate can only be achieved at the expenseof a smaller frame size and vice versa. Thegraph in Figure 13.8 illustrates thistrade-off between frame sizeand frame rate in a computation-constrained scenario. Itmay, however, be possible to ‘shift’ the line to the right (i.e.increaseframerate without reducingframesizeorviceversa) by making better use of the available computational resources. 13.3.2 Specification and Partitioning Based on the requirements of the syntax (for example, MPEG-2, MPEG-4 or H.263), an initial partitionof the functions required to encodaned decode a frameof video can be made. Figure 13.9 shows a simplified flow diagramforablocklmacroblock-basedinter-frame encoder (e.g. MPEG-1,MPEG-2, H.263 or MPEG-4) and Figure 13.10 shows the equivalent decoder flow diagram. The order of some of the operations is fixed by the syntax of the coding standards. It is necessary to carry out DCT and quantisation of each block within a macroblock before generating theVLCs for the macroblock header: this is because the header typically contains a ‘coded block pattern’ field that indicates which of the six blocks actually contain coded transform coefficients. Thereis greater flexibility in decidingthe order of some of the other 280 VIDEO CODEC DESIGN 7 Starl (frame) Pre-process Set rate control parameters Code plcture header 14 Motion estimate and Compensate macroblock DCT, Quantlse block Rescale. IDCT block Repeat for 6 blocks Code macroblock header + motion Repeat for all macroblocks Reorder, Run Length Encode Varlable Lecgth Encode block Repeat for 6 blocks ReconstrUCt macroblock Figure 13.9 Flow diagram: software encoder DESIGN OF A SOFTWARE CODEC 281 vStart (frame) Decode picture header macroblock header Decode ~ ~ Variable Length Decode Run Length Decode, Reorder Repeat for all macroblocks Repeat for 6 blocks Rescale, IDCT Reconstnct macroblock 1 Post-process f rarne 7i Figure13.10 Flow diagram: software decoder - + + 282 Frames VIDEO CODEC DESIGN Motionestimate and DCT _.+ Quantise ~ ‘lgzag’ RLE Frames Reconstrict l j j1 +! t IDCT ! .! i ~ i I 1I ~ *-J l t-! *.., + t Rescale RLD, Reorder VLE - VLD Figure 13.11 Encoderand decoder interoperating points operations. An encoder may choose to canyout motion estimation and compensation for the entire frame before carrying out the block-level operations (DCT, quantise, etc.), instead of codingthe blocks immediately after motion compensating the macroblock. Similarly, an encoder or decoder may choose to reconstruct each motion-compensated macroblock either immediately after decoding the residual blocks or after the entire residual frame has been decoded. The following principles can help to decide the structure of the software program: 1. Minimise interdependencies between coding functions inorder to keep the software modular. 2. Minimise data copying between functions (since each copy adds computation). 3. Minimise function-calling overheads. This may involve combining functions, leading to less modular code. 4. Minimise latency. Coding and transmitting each macroblock immediately after motion estimation and compensationcan reduce latency. The coded data maybe transmitted immediately, rather than waiting untilthe entire frame has been motion-compensated before coding and transmitting the residual data. 13.3.3 Designing the Functional Blocks A good approach is to start with the simplest possible implementation of each algorithm (for example, the basic form of the DCTshown in Equation 7.1) in order to develop a functional CODEC as quickly as possible. The first ‘pass’ of the design will result in a working, but very inefficient, CODEC and the performance can then be improved by replacing the basic algorithms with ‘fast’ algorithms. The first version of the design may be used as a ‘benchmark’ to ensure that later, faster versions still meet the requirements of the coding standard. Designing the encoder and decoder in tandem and taking advantage of ‘natural’ points at which the twodesignscan interwork may further easethe design process. Figure 13.11 shows some examples of interworking points. For example, the residual frame produced after encoder motion compensation may be ‘fed’ to the decoder motion reconstruction function and the decoder output frame should match the encoder input frame. DESIGN OF A SOFTWARE CODEC 283 13.3.4 Improving Performance Once abasic working CODEC has been developed, the aim is tiomprove the performancein order to meetthe design goalsdiscussed above. This mayinvolve someor all of the following steps: 1. Carry out software profiling to measure the performance of individual functions. This is normally carriedoutautomatically by the compiler inserting timing codeintothe software and measuringtheamount of time spentwithin eachfunction.This process identifies ‘critical’ functions, i.e. those that take the most execution time. 2. Replacecriticalfunctions with ‘fast’ algorithms. Typically,functionssuch as motion estimation, DCT and Variable-length coding are computationally critical. The choice of ‘fast’ algorithm depends on the platform and to some extent the design structure of the CODEC. It is often good practice to compareseveral alternative algorithms and to choose the best. 3. Unroll loops. See Section 6.8.1 for an example of how a motion estimation function may be redesigned to reduce the overhead due to incrementing a loop counter. 4. Reducedatainterdependencies. Many processors have the ability to execute multiple operations in parallel (e.g. using SIMDNLIW instructions); however, this isonly possible if the operations are working on independent data. 5. Consider combining functions to reduce function calling overheads and data copies. For example, a decoder carries out inverse zigzag ordering of a block followed by inverse quantisation. Each operation involves a movement of data from one array into another, together with the overhead of calling and returning from a function. By combining the two functions, data movement and function calling overhead is reduced. 6. For computationally critical operations (such as motion estimation),consider using platform-specific optimisations such asinlineassemblercode,compilerdirectives or platform-specific library functions (such as Intel’s image processing library). Applying some or all of these techniques can dramatically improve performance. However, these approaches canlead to increaseddesign time, increased compiledcode size (for example, due to unrolled loops) and complex software code that is difficult to maintain or modify. Example An H.263 CODEC was developed for the TriMedia TMlOOO platform.’ After the ‘first pass’ of the software design process (i.e. without detailed optimisation), the CODEC ran at the unacceptably low rate of 2 CIF frames per second. After reorganising the software (combining functionsandremoving interdependencies between data), executionspeed was increased to 6 CIF frames per second. Applyingplatform-specificoptimisation of critical functions (using the TriMedia VLIW instructions) gave a further increase to 15 CIF frames per second (an acceptable rate for video-conferencing applications). 284 DESIGN CODEC VIDEO 13.3.5 Testing In addition to the normal requirements for software testing, the following areas should be checked for a video CODEC design: 0 Interworking between encoder and decoder (if both are being developed). 0 Performance with arange of videomaterial(including ’live’ video if possible),since some‘bugs’ may only show up undercertainconditions(forexample, an incorrectly decoded VLC may only occur occasionally). 0 Interworking with third-party encoder(s) and decoder(s). Recent video coding standards have software‘testmodels’availablethataredevelopedalongsidethestandard and provide a useful reference for interoperability tests. 0 Decoder performance under error conditions,such as random bit errors and packet losses. To aid in debugging, it can be useful to provide a ‘trace’ mode in which each of the main coding functions records its data to a log file. Without this type of mode, it can bevery difficult to identify the causeof a software error (sayb)y examining the stream of coded bits. A real-timetestframework which enables ‘live’ videofromacamera to be coded and decoded in real time using the CODEC under development can be very useful for testing purposes, as can be bit-stream analysis tools (such as ‘MPEGTool’) that provide statistics about a coded video sequence. Someexamples of efficientsoftwarevideoCODECimplementations have been disOpportunities have been examined for parallelising video coding algorithms for multiple-processor platform^,^-^ and a method has been described for splitting a CODEC implementationbetweendedicatedhardware and software.8 In thenextsection wewill discuss approaches to designing dedicated VLSI video CODECs. 13.4 DESIGN OF A HARDWARE CODEC The design process for a dedicated hardware implementation is somewhat different, though many of the design goals are similar to those for a software CODEC. 13.4.1 Design Goals Design goals for a hardware CODEC may include: 1. Maximise frame rate. 2. Maximise frame size. 3. Maximise peak coded bit rate. 4. Maximise video quality for a given coded bit rate. 5. Minimise latency. 6. Minimise gate countldesign ‘area’, on-chip memory and/or power consumption. DESIGN OF A HACRODDWEACRE 285 7. Minimiseoff-chipdatatransfers(‘memorybandwidth’) performance ‘bottleneck’ for a hardware design. as thesecanoftenactasa 8. Provide aflexible interface to the host system (very often a processorrunning higher-level application software). In a hardware design, trade-offs occur between the first four goals (maximise frame rate/ frame size/peakbit rate/quality) and numbers (6) and (7) above (minimise gate count/power consumption and memory bandwidth). As discussed in Chapters 6-8, thereare many alternative architectures for the key coding functions such as motion estimation, DCT and variable-length coding, but higher performance often requires an increased gate count. An important constraint is the cycle budget for each coded macroblock. This can be calculated based on the target frame rate and frame size and the clock speed of the chosen platform. Example Target framesize:QCIF(99macroblocksperframe, Target framerate3: 0framespersecond Csploecekd: 20 MHz H.263MPEG-4 coding) Macroblocks per second: Clock cycles per macroblock: 99 X 30 = 2970 20 x 106/2970 = 6374 This means that all macroblock operations must be completed within 6374 clock cycles. If the various operations(motionestimation,compensation, DCT, etc.)arecarried out serially then the sum total for all operationsmust not exceed this figure;if the operations are pipelined (see below) then any one operation must not take more than 6374 cycles. 13.4.2 Specification and Partitioning The same sequenceof operations listedin Figures 13.9 and 13.10need to be carried out by a hardware CODEC. Figure 13.12 shows an example of a decoder that uses a ‘common bus’ Motion Motion estimator compensator A 4 .. FDCT/IDCT .etc. A V V V tt l . . ‘ : . 1$ interface mController Figure 13.12 Common bus architecture 286 VIDEO CODEC DESIGN Motion E! Controller RAM - 4estimator l Motion I compensator __+ FDCT 4 Quantise 4 Reorder l RLE ---+ ...etc. Figure 13.13 Pipelined architecture architecture. This type of architecture may be flexible and adaptable but the performance may be constrained by data transfer over the bus and scheduling of the individual processing units. A fully pipelined architecture sucahs the example in Figur1e3.13has the potential to give high performance due to pipelined execution by the separate functional units. However, this type of architecture may require significant redesign in order to support a different coding standard or a new optional coding mode. A further consideration for a hardware design is the partitioning between the dedicated hardware and the ‘host’ processor. A ‘co-processor’ architecture such as that described in the DirectX VA framework (see Chapter 13) implies close interworking between the host and the hardware on a macroblock-by-macroblock basis. An alternative approach is to move more operations into hardware, for example by allowing the hardware to process a complete frame of video independently of the host. 13.4.3 Designing the Functional Blocks The choice of design for each functional block depends on the design goals (e.g. low area and/or power consumption vs. high performance) and to a certain extent on the choice of architecture. A ‘common bus’-type architecture may lend itself to the reuse of certain ‘expensive’ processing elements. Basic operations such as multiplication may be reused by several functional blocks (e.g. DCT and quantise). With the ‘pipelined’ type of architecture, individual modules do not usually share processing elements and the aim is to implement each function as efficiently as possible, for example using slower, more compact distributed designs such as the distributed arithmetic architecture described in Chapter 7. In general, regular, modular designs are preferable both for ease of design and efficient implementation on the target platform. For example, a motion estimation algorithm that maps to a regular hardware design (e.g. hierarchical search) may be preferable to less regular algorithms such as nearest-neighbours search (see Chapter 6). 13.4.4 Testing Testing and verification of a hardware CODEC can be a complicated process, particularly sinceit may be difficult to test with‘real’ video inputs until a hardware prototype is available. It may be useful to develop a software model that matches the hardware design to REFERENCES 287 assist in generating test vectors and checking the results. A real-time test bench, where a hardware designis implemented ona reprogrammable FPGA in conjunction witha host system and video capture/displaycapabilities,can support testing with a range of real video sequences. VLSI video CODEC design approaches and examples have been reviewed9-’* and two specific design case studies presented.’‘,I2 13.5 SUMMARY The design of a videoCODEC depends on the target platform, thetransmission environment and the userrequirements.However, there aresomecommongoals andgooddesign practices that may be useful for a range of designs. Interfacing to a video CODEC is an important issue, becauseof the need to efficiently handle a high bandwidth of video data in real time and because flexible control of the CODEC can make a significant difference to performance. There are many options for partitioning the design into functional blocks and the choice of partition will affect the performance and modularity of the system. A large number of alternative algorithmsand designs exist for eachof the main functions in a video CODEC. A good design approach is to use simple algorithms where possible and to replace these with more complex, optimised algorithms in performance-critical areas of the design. Comprehensive testing with a range of video material and operating parameters is essential to ensure that all modes of CODEC operation are working correctly. REFERENCES 1. I. Richardson, K. Kipperman and G. Smith, ‘Video coding using digital signal processors’, DSP World Fall Conference, Orlando, 1999. 2. J. McVeigh et al., ‘A software-based real-time MPEG-2 video encoder’, IEEE Trans. CSVT, 10(7), October 2000. 3. S. Akramullah, 1. Ahmad and M. Liou, ‘Optimization of H.263 video encoding using a single processor computer’, IEEE Trans. CSVT, 11(8), August 2001. 4. B. Erol, F. Kossentini and H.Alnuweiri, ‘Efficientcoding and mapping algorithms for software-only real-time video coding at low bit rates’, IEEE Trans. CSVT, 10(6), September 2000. 5. N. Yung andK. Leung, ‘Spatial and temporal data parallelization of the H.261 video coding algorithm’, IEEE Trans. CSVl; 11(1), January 2001. 6. K. Leung, N. Yung and P. Cheung, ‘Parallelization methodology for video coding-an implementation on the TMS320C80’,IEEE Trans. CSVT, 10(8), December 2000. 7. A. Hamosfakidis, Y.Paker and J. Cosmas, ‘A study of concurrency in MPEG-4 video encoder’, Proceedings of IEEE Multimedia Systems’98, Austin, Texas, July 1998. 8. S. D. Kim, S. K. Jang, J. Lee, J. B. Ra, J. S. Kim, U. Joung, G. Y. Choi and J. D. Kim, ‘Efficient hardware-software co-implementation of H.263 video CODEC’, Proc. IEEE Workshop on Multimedia Signal Processing, pp. 305-310, Redondo Beach, Calif., 7-9 December 1998. 9. P. Pirsch, N. Demassieux and W. Gehrke, ‘VLSI architectures for video compression-a survey’, Proceedings of the IEEE, 83(2), February 1995. 10. P. Pirsch and H. -J. Stolberg, ‘VLSI implementations of image and video multimedia processing systems’, IEEE Transactions on Circuits and Systemsfor Video Technology,8(7), November 1998, pp. 878-891. 288 DESIGVNIDEO CODEC 11. P. Pirsch and H. -J. Stolberg, ‘VLSI implementations of image and video multimedia processing systems’, IEEE Transactions onCircuitsand Systernsfor Video Technology, 8(7)N, ovember 1998, pp. 878-891. 12. A. Y. Wu, K. J. R. Liu, A. Raghupathy and S. C. Liu, System Architecture of a Massively Parallel Programmable Video Co-Processor, Technical Report ISR TR 95-34, University of Maryland, 1995. Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Future Developments 14.1 INTRODUCTION This book hasconcentrated on the design of video CODECs that are compatible with current standards (in particular, MPEG-2, MPEG-4 and H.263) and on the current ‘state of the art’ in video coding technology.’ Video coding is a fast-moving subject and current research in the field moves beyond the bounds of the international standards; at the same time, improvements in processingtechnology will soon make itpossibletoimplementtechniques that were previously considered too complex. This final chapter reviews trends in video coding standards, research and platforms. 14.2 STANDARDSEVOLUTION The I S 0 MPEG organisationis at presentconcentrating on two mainareas:updates to existing standards and a new standard, MPEG-21. MPEG-4 is a large and complex standard with many functions and tools that go well beyond the basic H.263-like functionality of the popular‘simple profile’ CODEC. It was originallydesigned with continualevolution in mind: as new techniques and applications become mature, extra tools and profiles continue to be added to the MPEG-4 set of standards. Recent work, for example, has included new profiles that support someof the emerging Internet-based applications for MPEG-4.Some of the more advanced elements of MPEG-4 (such as sprite coding and model-based coding)are not yet widely used in practice, partly for reasons of complexity. As these elements become morepopular(perhaps dueto increasedprocessorcapabilities), it may be thatheir description in the standard will need to be modified and updated. * MPEG-2 I builds on the coding tools of MPEG-4 and the content description tools of the MPEG-7standardtoprovidea‘framework’formultimediacommunication. The MPEG committee has moved beyond the details of coding and description to an ambitious effort tostandardiseaspects of thecompletemultimedia ‘delivery chain’f,romcreationto ‘consumption(’viewingoirnteracting with the data)T. his process may include the standardisation of new coding and compression tools. The Video Coding Experts Group of the ITU continues to develop the H . 2 6 ~series of standards. The recently added Annexes V, W and X of H.263 are expected to be the last major revisions to this standard. The main ongoing effort has to finalise the first version of H.26L: the core tools of the standard (described in Chapter 5 ) are reasonably well defined, but there is further work required to convert these into a published international standard. The technical aspects of H.26L were scheduled to be finalised during 2001. However, there 290 FUTURE DEVELOPMENTS is now an initiative between MPEG and VCEG to jointly develop a based on H.26L3. new coding standard 14.3 VIDEO CODING RESEARCH Video coding technology remains a very active area for researchers. Research in this field fallsintotwomaincategories,‘applied’researchintothepracticalimplementation of establishedvideocodingtechniquesand‘speculative’researchinto new and emerging codingalgorithms. As a guidetothesubjectsthatarecurrentlypopular in theresearch community, it is interesting to examine the papers presented at the 2001 PictureCoding Symposium (a specialistforumforimageandvideocodingresearch).The total of 110 papers included: 0 22 papers on the implementation and optimisation of the popular block DCT-based video coding standards; 0 l 1 papers on transmissionissues; 0 7 papers on quality measurement and quality metrics; 0 22 papers on content-basedandobject-basedcoding(includingMPEG-4object-based coding); 0 5 papers on wavelet-based coding of video sequence; 0 5 papers on coding of 3D/multi-view video. (Note that some papers were difficult to categorise.) This cross-sectiofntopics implies that much of thecurrentresearcheffortfocuses on practicalimplementationissuesforthe popular block-based coding standards. The object-based functions of the MPEG-4 standard attract a lot of research interest and the feeling is that there are still a number of practical problems to solve (suchas reliable, automatic segmentationof video scenes into video object planes) before these tools become widely adoptebdy the multimedia industry.A surprisingly small number of papers were presented on ‘blue sky’ research intonovel coding methods. It is important to research and develop the next generation of video coding algorithms; at the same time, there is clearlya lot of scope for improving and optimising the current generation of coding technology. 14.4 PLATFORM TRENDS Chapter 12 summarisedthe key features of a range of platformsforvideoCODEC implementation. There is some evidence of convergence between some of these platforms; for example, PC processor manufacturers continue to add instructions and features that were formerly encountered in special-purpose video or media processors. Howeveirt,is likely that TRENDS APPLICATION 291 there will continue to be distinct classes of platform for video coding, possibly along the following lines: 1. PC processorswithmediaprocessingfunctionsandincreasing processing (e.g. in video display cards). use of hardwareco- 2. More‘streamlined’processors (e.g. embeddedprocessors with internal or external multimedia support, or media processors) for embedded multimedia applications. 3. Dedicated hardware CODECs (with limited programmability) for efficient implementation of ‘mass-market’ applications such as digital TV decoding. There is still a place in the market for dedicated hardware designs but at thesame time there is a trend towards flexible, embedded designs for new applications such as mobile multimedia. The increasing use of ‘system on a chip’ (SoC) techniques,with which a complex IC design can be rapidly put together from Intellectual Property building blocks, shoulmd ake it possible to quickly reconfigure and redesign a ‘dedicated’ hardware CODEC. This will be necessary if dedicated designs are to continueto compete with the flexibility of embedded or general-purpose processors. 14.5 APPLICATIONTRENDS Predicting future directions for multimedia applications is notoriously difficult. Few of the ‘interactive’ applications that were proposed in the early 1990s, for example, have gained a significant market presence. The largest markets for video coding at present are probably digitaltelevisionbroadcasting and DVD-video (both utilising MPEG-2 coding).Internet video is gaining popularity, but is hampered by the limited Internet connections experienced by most users. There are some signs that MPEG-4 coding for video compression, storagaend playback may experiencea boom in popularitysimilarto MPEG Layer3Audio(‘MP3’ audio). However, much work needsto be doneon the managementand protection of intellectual property rights before this can take place. Videoconferencing via theInternet(typicallyusingtheH.323protocolfamily)is becoming more widely used and may gain further acceptance with increases in processor and connection performance. It has yet to approach the popularity of communication via voice,e-mail and textmessaging. There aretwoapplication areas thatarecurrently of interesttodevelopersandcommunications providers, at oppositeends of the bandwidth spectrum: 1. Very low power, very low bandwidthvideoforhand-heldmobiledevices(one of the hoped-for ‘killer applications’ for the costly third-generation mobile networks). The challengehere is toprovideusable,low-costvideoservicesthatcouldmatchthe popularity of mobile telephony. 2. High bandwidth, high quality video coding for applications such as: (a) ‘Immersive’ video conferencing, for example displaying conference participants on a video ‘wall’ as if they were sitting across a table from each other. The eventual goal is a video conference meeting that is almost indistinguishable from a face-to-face meeting. 292 FUTURE DEVELOPMENTS (b)Highdefinitiontelevision (HDTV, approximatelytwicetheresolution of ITU-R 601 ‘standard’digitaltelevision).Codingmethods(part of MPEG-2)havebeen standardisedforseveralyears but thistechnologyhas not yet taken hold in the marketplace. (c) Digital cinema offers an alternative to the reelosf projector film that are still used for distribution and display of cinema films. There is currently an effort by the MPEG committee (among others) to develop standard(s) to support cinema-quality coding of video and audio. MPEG’s requirements document for digital cinema4 specifies ‘visuallylossless’compression(i.e. no loss should bediscernible by a human observer in a movie theatre) of frames containing up to 16 million pixels at frame ratesofupto 150 Hz.Incomparison,anITU-R 601 framecontainsaround 0.5 million pixels. Coding and decoding at cinemafidelity are likely to be extremely demanding and will pose some difficult challenges for CODEC developers. An interesting by-product of the ‘mainstream’ video coding applications and standards is the growing list ofnew and innovative applications for digital video. Some examples include the use of ‘live’ video in computer games; video ‘chat’ on a large scale with multiple part- icipants; video surveillance in increasingly hostile environments (such as in an oil well or inside the body of a patient); 3-D video conferencing; video conferencing for groups with special requirements (for example deaf users); and many others. Early experiences have taught designers of digital video applications that an application willonly besuccessful if users find it to be a usable,usefulimprovementoverexisting technology.Inmanycasesthedesign of theuserinterfaceis as important as, ormore importantthan,theefficiency of a videocodingalgorithm.Usabilityis a vital but often overlooked requirement for any new video-based application. 14.6 VIDEO CODEC DESIGN The aim of this book has been to introduce readers to the concepts, standards, design tech- niques and practical considerations behind the design of video coding and communication systems. A question that is often raisedis whether the huge worldwide effort in video coding research and development will continue to be necessary, since transmission bandwidths may perhaps reach the point at which compression becomes unnecessary. Videoandmultimediaapplicationshaveonlybeguntomake a significantimpact on businesseasndconsumersincethelate 1990s. Despitecontinuedimprovementisn resourcessuch as processingpower,storageandbandwidth,theseresourcescontinueto be stretched by increasing demands for high-quality, realistic multimedia communications with more functionality. There is still a large gap between the expectations of the user and the capabilities of present-day video applications and this gap shows no sign of diminishing. As digitalvideoincreasesitsshare of themarket,consumerdemandsforhigher-quality, richermultimediaservices will continue to increase.Bridgingthe gap (providingbetter qualityandfunctionalitywithinthelimits of bandwidthandprocessingpower)requires, among other things, continued improvements in video CODEC design. In thepast,marketresearchershaveoverestimatedtherate of take-up of multimedia applications such as digital TV and video conferencing and it remains to be seen whether there is a real demand for some of the newer video services such as mobile video. Some REFERENCES 293 interesting trends (for example, the continued popularityof MJPEG video CODECs because of their design simplicity and inherent error resilience) imply that the video communications market is likely to continue to be driven more by user needs than by impressive research developments.This in turnimpliesthatonlysome of therecentdevelopments in video coding (such as object-based coding, content-based tools, media processors and so on) will survive.However,videocoding will remainacoreelement of thegrowingmultimedia communicationsmarketP. latformsa, lgorithmsandtechniquesfovr ideocoding will continue to change and evolve. It is hoped that this book will help to make the subject of video CODEC design accessible to a wider audience of designers, developers, integrators and users. REFERENCES 1.T.EbrahimiandM.Kunt,‘Visualdatacompressionformultimediaapplications:anoverview’, Proceedings of the IEEE, 86(6), June 1998. 2. ISOflEC JTCl/SC29/WGlI N4318, “PEG-21 overview’, Sydney, July 2001. 3. ITU-T Q6/SG16 VCEG-L45, ‘H.26LTesMt odel Long-term number6 (TML-6) draft0’, March 2001. 4. ISO/IEC JTCl/SC29/WGI 1 N4331, ‘Digital cinema requirements’, Sydney, July 2001. Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) Bibliography 1. Bhaskaran, V. and K. Konstantinides, Image and Video Compression Standards: Algorithms and Architectures, Kluwer, 1997. 2. Ghanbari, M. Video Coding: An Introduction to Standard Codecs, IEE Press, 1999. 3. Girod, B., G. Greiner and H. Niemann (eds), Principles of 30 Image Analysis and Synthesis, Kluwer, 2000. 4. Haskell, B., A. Puri and A. Netravali, Digital Video:An Introduction to MPEG-2C, hapman & Hall, 1996. 5 . Netravali, A. and B. Haskell, Digital Pictures: Representation, Compression and Standards, Plenum Press, 1995. 6. Parhi, K. K. and T. Nishitani (eds), Digital Signal Processing for Multimedia Systems, Marcel Dekker, 1999. 7. Pennebaker, W. B. and J. L. Mitchell, JPEG: Still Image Data Compression Standard, Van Nostrand Reinhold, 1993. 8. Pennebaker, W. B., J. L. Mitchell, C . Fogg and D. LeGall, MPEG Digital Wdeo Compression Standard, Chapman & Hall, 1997. 9. Puri, A. and T. Chen (eds), Multimedia Systems, Standards and Networks, Marcel Dekker, 2000. 10. Rao, K. R. and J. J. Hwang, Techniques and Standards for Image, Video and AudioCoding, Prentice Hall, 1997. 11. Rao, K. R. and P. Yip, Discrete Cosine Transform, Academic Press, 1990. 12. Riley, M. J. and I. G. Richardson, Digital Video Communications, Artech House, February 1997. Glossarv Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) 4 :2 :0 (sampling) 4 :2 :2 (sampling) 4 :4 :4 (sampling) API arithmetic coding artefact BAB baseline (CODEC) block matching blocking B-picture channel coding chrominance CIF CODEC colour space DCT DFD DPCM DSCQS DVD DWT entropy coding error concealment field flowgraph full search GOB GOP H.261 H.263 H.26L HDTV Huffman coding sampling method : chrominance components have half the horizontal and vertical resolution of luminance component sampling method : chrominance components have half the horizontal resolution of luminance component sampling method : chrominance components have same resolution as luminance component application programming interface coding method to reduce redundancy visual distortion in an image binary alpha block, indicates the boundaries of a region (MPEG-4 Visual) a codec implementing a basic set of features from a standard motion estimation carried out on rectangular picture areas square or rectangular distortion areas in an image coded picture predicted using bidirectional motion compensation error control coding colour difference component common intermediate format, a colour image format COderlDECoder pair method of representing colour images discrete cosine transform displaced frame difference (residual image after motion compensation) differential pulse code modulation double stimulus continuous quality scale, a scale and method for subjective quality measurement digital versatile disk discrete wavelet transform coding method to reduce redundancy post-processing of a decoded image to remove or reduce visible error effects odd- or even-numbered lines from a video image pictorial representation of a transform algorithm (or the algorithm itself) a motion estimation algorithm group of blocks, a rectangular region of a coded picture group of pictures, a set of coded video images standard for video coding standard for video coding ‘Long-term’ standard for video coding high definition television coding method to reduce redundancy 298 GLOSSARY HVS inter-frame (coding) interlaced (video) intra-frame (coding) IS0 ITU ITU-R 601 JPEG JPEG-2000 KLT latency loop filter MCU media processor memory bandwidth MJPEG motion compensation motion estimation motion vector MPEG MPEG- 1 MPEG-2 MPEG-4 objective quality OBMC profile progressive (video) pruning (transform) PSNR QCIF QoS quantise rate control rate-distortion RGB ringing (artefacts) RTP RVLC scalable coding short header (MPEG-4) SIMD slice statistical redundancy subjective quality subjective redundancy sub-pixel (motion compensation) human visual system, the system by which humans percieve and interpret visual images coding of video frames using temporal prediction or compensation video data represented as a series of fields coding of video frames without temporal prediction International Standards Organisation International Telecommunication Union a colour video image format Joint Photographic Experts Group, a committee of ISO; also an image coding standard an image coding standard Kamuhen-Loeve transform delay through a communication system spatial filter placed within encoding or decoding feedback loop multi-point control unit, controls a multi-party conference processor with features specific to multimedia coding and processing Data transfer rate to/from RAM System of coding a video sequence using JPEG intra-frame compression prediction of a video frame with modelling of motion estimation of relative motion between two or more video frames vector indicating a displaced block or region to be used for motion compensation Motion Picture Experts Group, a committee of I S 0 a video coding standard a video coding standard a video coding standard visual quality measured by algorithm(s) overlapped block motion compensation a set of functional capabilities (of a video CODEC) video data represented as a series of complete frames reducing the number of calculated transform coefficients peak signal to noise ratio, an objective quality measure quarter common intermediate format quality of service reduce the precision of a scalar or vector quantity control of bit rate of encoded video signal measure of CODEC performance (distortion at a range of coded bit rates) red/green/blue colour space ‘ripple’-like artefacts around sharp edges in a decoded image real-time protocol, a transport protocol for real-time data reversible variable length code coding a signal into a number of layers a coding mode that is functionally identical to H.263 (‘baseline’) single instruction multiple data a region of a coded picture redundancy due to the statistical distribution of data visual quality as perceived by human observer(s) redundancy due to components of the data that are subjectively insignificant motion-compensated prediction from a reference area that may be formed by interpolating between integer-valued pixel positions test model TSS VCA VCEG video packet (MPEG-4) video processor VLC VLD VLE VLIW VLSI V 0 (MPEG-4) VOP (MPEG-4) VQEG YCrCb GLOSSARY 299 a software model and document that describe a reference implementation of a video coding standard three-step search, a motion estimation algorithm variable complexity algorithm Video Coding Experts Group, a committee of ITU coded unit suitable for packetisation processor with features specific to video coding and processing variable length code variable length decoder variable length encoder very long instruction word very large scale integrated circuit video object video object plane Video Quality Experts Group luminance/red chrominancehlue chrominance colour space

Top_arrow
回到顶部
EEWORLD下载中心所有资源均来自网友分享,如有侵权,请发送举报邮件到客服邮箱bbs_service@eeworld.com.cn 或通过站内短信息或QQ:273568022联系管理员 高进,我们会尽快处理。