编辑推荐

相关资源

- Digital Signal Processing
- digital image processing
- Embedded media processing
- Digital Signal Processing Handbook
- Digital Image Processing (HongjunXu)
- Digital Signal Processing_原理 算法 应用-- 习题解答 答案
- 数字信号处理（Digital Signal Processing
- Digital Signal Processing-2nd_S.K.Mitra
- Digital Image Processing Chapter1
- digital signal processing book by prokis

嵌入式开发热门资源

本周本月全部

DSP相关资料

文档简介

Digital Media Processing in ANSI C which includes a rich number of demonstration.

文档预览

Digital Media Processing DSP Algorithms Using C Hazarathaiah Malepati AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Newnes is an imprint of Elsevier Newnes is an imprint of Elsevier 30 Corporate Drive, Suite 400 Burlington, MA 01803, USA The Boulevard, Langford Lane Kidlington, Oxford, OX5 1GB, UK Copyright © 2010 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this ﬁeld are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Malepati, Hazarathaiah. Digital media processing : DSP algorithms using C / by Hazarathaiah Malepati. p. cm. Includes bibliographical references and index. ISBN 978-1-85617-678-1 (alk. paper) 1. Multimedia systems. 2. Embedded computer systems—Programming. 3. Signal processing—Digital techniques. 4. C (Computer program language). I. Title. QA76.575.M3152 2919 006.7–dc22 2009050460 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. For information on all Newnes publications visit our website at www.elsevierdirect.com Printed in the United States 10 11 12 13 14 10 9 8 7 6 5 4 3 2 1 This book is dedicated to my late father Mastanaiah Malepati, whose vision and hard work shaped my career a lot. This page intentionally left blank Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Digital Media Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Media-Processing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Embedded Systems and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Algorithm Implementation on DSP Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Part 1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 2 Data Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Cryptography Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Triple Data Encryption Algorithm.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Advanced Encryption Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4 Keyed-Hash Message Authentication Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.5 Elliptic-Curve Digital Signature Algorithm .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . 58 Chapter 3 Introduction to Data Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.2 Error Detection Algorithms .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.3 Block Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.4 Hamming (72, 64) Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.5 BCH Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.6 RS Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.7 Convolutional Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.8 Trellis Coded Modulation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.9 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.10 Turbo Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.11 LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Chapter 4 Implementation of Error Correction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .155 4.1 BCH Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.2 Reed-Solomon Error-Correction Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.3 RS Erasure Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 4.4 Viterbi Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 4.5 Turbo Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 4.6 LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Chapter 5 Lossless Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .225 5.1 Entropy Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 5.2 Variable Length Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 5.3 H.264 VLC-Based Entropy Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 5.4 MQ-Decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 5.5 Context-Based Adaptive Binary Arithmetic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 vi Contents Part 2 Digital Signal and Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Chapter 6 Signals and Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .285 6.1 Introduction to Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 6.2 Time-Frequency Representation of Continuous-Time Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 6.3 Sampling of Continuous-Time Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 6.4 Time-Frequency Representation of Discrete-Time Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 6.5 Linear Time-Invariant Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 6.6 Generalized Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Chapter 7 Transforms and Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .321 7.1 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 7.2 Discrete Cosine Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 7.3 Filter Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 7.4 Finite Impulse-Response Filters ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. 352 7.5 Inﬁnite Impulse-Response Filters . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. 363 Chapter 8 Advanced Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .381 8.1 Adaptive Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 8.2 Multirate Signal Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 8.3 Wavelet Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 8.4 Simulation and Implementation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 Chapter 9 Digital Communications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .437 9.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 9.2 Single- and Multicarrier Communication Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 9.3 Channel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 9.4 Channel Equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 9.5 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 9.6 Simulation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 Chapter 10 Image Processing Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .509 10.1 Color Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 10.2 Color Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 10.3 Brightness and Contrast Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 10.4 Edge Enhancement/Sharpening of Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 10.5 Image Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 10.6 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 10.7 Image Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 10.8 Erosion and Dilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 10.9 Objects Corner Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 10.10 Hough Transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536 10.11 Simulation of Image Processing Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540 Chapter 11 Advanced Image Processing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .553 11.1 Image Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 11.2 Digital Image Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 11.3 Image Objects Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 11.4 2D Image Filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 11.5 Fisheye Distortion Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 11.6 Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 Part 3 Digital Speech and Audio Processing .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Chapter 12 Speech and Audio Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .595 12.1 Sound Waves and Signals .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. 595 Contents vii 12.2 Digital Representation of Audio Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596 12.3 Signal Processing with Embedded Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 12.4 Speech Compression . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . 611 12.5 VoIP and Jitter Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626 Chapter 13 Audio Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .637 13.1 Psychoacoustics and Perceptual Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637 13.2 Audio Signals Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642 13.3 MPEG-4 AAC Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647 13.4 Popular Audio Codecs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651 13.5 Audio Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 Part 4 Digital Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 Chapter 14 Video Coding Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .659 14.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659 14.2 Video Coding Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 14.3 MPEG-2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 14.4 H.264 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681 14.5 Scalable Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709 Chapter 15 Video Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .713 15.1 Video Quality Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713 15.2 Video Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713 15.3 Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728 15.4 Video Transcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .745 viii Contents On the Website Part 5 Embedded Systems Chapter 16 Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 16.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 16.2 Embedded System Components .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... .. .. .. .. .. . . . . . 1 16.3 Embedded Video Processing and System Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 16.4 Software–Hardware Partitioning .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. . 37 16.5 Embedded Processors and Application Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 17 Embedded Processing Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 17.1 Automotive Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 17.2 Video Surveillance Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 17.3 Portable Entertainment Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 17.4 Digital Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 17.5 Digital Camera Image Pipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 17.6 Homeland Security and Health Care . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Appendix A Reference Embedded Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.1 Blackﬁn Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A.2 Overview of Blackﬁn Instruction Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.3 Blackﬁn DMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 A.4 Cycles Estimation with Blackﬁn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Appendix B Mathematical Computations on Fixed-Point Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 B.1 Numeric Data Fixed-Point Computing .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 B.2 Galois Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Appendix C Look-up Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Preface The title of this book could well have been Digital Media Processing Algorithms: Efﬁcient Implementation Techniques in C, as it is not only about digital media processing algorithms, but also contains many implementation techniques for most algorithms. The main purpose of it is to ﬁll the gap between theory and techniques taught at universities and that are required by the software industry in the digital processing of data, signal, speech, audio, images, and video on an embedded processor. The book serves as a bridge to transit from the technical institute to the embedded software development industry. Many powerful algorithms in current cutting-edge technologies are analyzed, and simulation and implementation techniques are presented. Digital media processing demands efﬁcient programming in order to optimize functionality. Data, signal, image, audio, and video processing—some or all of which are present in all electronic devices today—are complex programming environments. Optimized algorithms (step-by-step directions) are difﬁcult to create, but they can make all the difference when developing a new application. This book discusses the most recent algorithms available to maximize your programming, while simultaneously keeping in mind memory and realtime constraints of the architecture with which you are working. General implementation concepts can be integrated into many architectures that you ﬁnd yourself working with on a speciﬁc project. My interest in writing a book on digital media processing algorithms derives from reading literature in the ﬁeld and working on those algorithms. This book cannot replace the literature on the background theory related to the algorithms; in fact, what is written here is largely incomplete without it. Although I do not rigorously discuss the theory and derivation of equations and theorems, a brief introduction and basic mathematics are provided for most of the algorithms presented. Typically, developers of embedded software modules want to know the basic functionality of an algorithm and simulation techniques, in addition to whether any techniques are available to efﬁciently implement a particular algorithm. Most developers are proﬁcient with equations and algorithms as a result of university training; however, the efﬁcient implementation of such algorithms requires industry experience. But employers, of course, expect developers to immediately begin work. Often they provide training for writing quality software, but not for writing efﬁcient software. Software engineers learn how to do this in time, such as during the course of working on a few efﬁciently implemented modules or observing a senior engineer’s implementation methods. Many such techniques to efﬁciently simulate and implement digital media processing algorithms are described in this book. Today many algorithms are available on the Internet, and the software for a number of them is available in the public domain. But the information available on the web is theory oriented, and we may obtain only pieces of the software here and there and not the complete solution. Sometimes, we can obtain the complete software for a particular algorithm that works well, but it may be inefﬁcient for use in a particular project. Consequently, users have to enhance software efﬁciency by purchasing it from a third-party source. What’s here provides the information needed to develop efﬁcient software for many algorithms from scratch. The book is aimed at graduate and postgraduate students in various engineering subdisciplines and software industry junior-level employees developing embedded systems software. Only college-level knowledge of mathematics is required to understand the equations and calculations. Knowledge of ANSI C is a prerequisite for this book. Knowledge of microcontroller, microprocessor, or digital signal processing (DSP) architectures will provide an added advantage so that you can understand implementation skills a bit faster. Unlike other DSP algorithm books that concentrate mainly on basic operations, such as the Fourier transforms and digital ﬁlters, this book covers many algorithms commonly used in media processing. For most of them, this book provides full details of ﬂow, implementation complexity, and efﬁcient implementation techniques using ANSI C. In addition, simulation results are provided for selected algorithms. This book uses the Analog Devices, Inc. (ADI) Blackﬁn processor (BF5xx series) as the reference embedded processor, and it discusses implementation complexity of all algorithms covered with respect to this amazing general-purpose DSP processor. The Pcode notation (meaning pseudocode or program code) is used to ﬂag simulation code. x Preface The availability of test vectors is very important for testing the functionality of any algorithm. Test vectors, look-up tables, and simulation results for most of the standalone algorithms described in this book are available on the companion website at www.elsevier.direct/companions. In addition, a ﬁnal part, Embedded Systems, can be found there along with Appendices A and B, References, and Exercises. Disclaimer An algorithm can be implemented on an embedded processor in more than one way. Performance metrics vary according to implementation method. Sometimes there may be a ﬂaw in a particular implementation of a given algorithm, even though we get the best performance with it. It may not be possible to test rigorously for all possible ﬂaws in a given time frame. The program code provided in this book is tested for only a few cases, and it provides selected ways of implementing algorithms and corresponding simulation code. The code may contain bugs. In particular, cryptographic systems are very vulnerable to changes in algorithm ﬂow and implementation as well as software and hardware bugs. Neither the author nor the publisher is responsible for system failures due to the use of any of the techniques or program codes presented in this book. In addition, a few techniques provided may be patented by either ADI or another company; check with the patent ofﬁce before attempting to incorporate any of the implementation methods discussed when developing your own software. Acknowledgments I am very thankful to Analog Devices, Inc. (ADI) and its employees for giving me the opportunity to write this book. ADI is a great place to work and to achieve career goals. In particular, I am very much indebted to Yosi Stein and Rick Gentile, without whom I may not have succeeded in completing this book. The theme for the book originated while working with Yosi at ADI. My dream of writing it came true with the constant support and encouragement I received from Rick Gentile. I am proud to say Rick and Yosi are the heart and soul of this book. It is with great pleasure that I thank Boris Liberol for reading every page and providing material on loopﬁlter and motion compensation for the video coding chapter; Chalil Mohammed for providing sections for the audio coding chapter; and Gabby Yi for providing material on motion estimation. David Katz and Rick Gentile generously gave me permission to take a few sections from their book, Embedded Media Processing. I thank Rick Gentile, Pushparaj Domenic, Gabby Yi, and Bijesh Poyil for reviewing selected sections, and external reviewers Seth Benton and Kenton Williston for reviewing some portions of the material and for giving valuable suggestions for improving the book. I thank Goulin Pan, An Wei, and Boris Learner for spending their precious time with me to clarify a few digital media processing concepts. I am especially grateful to S.V. Narasimhan, V.U. Reddy, and K.V.S. Hari for their guidance. It is with them that I ﬁrst began my journey into digital media processing. I thank N. Sridhara, P. Rama Prabhu, Pushparaj Domenic, Yosi Stein, Joshua Kablotsky, Gordon Sterling, and Rick Gentile for giving me a chance to work with them as part of their team. I offer my heartfelt thanks to Analog Dialogue editor Scott Wayne for forwarding this material to Newnes– Elsevier, and to acquisitions editor Rachel Roumeliotis at Newnes for accepting and preparing the contract for this book. I am very thankful to this book’s project manager Marilyn E. Rash, copyeditor Barbara A. Kohl, and proofreader Samantha Molineaux-Graham for enhancing the material here by far from my original writing. Last, but not least, I thank my family for their support and encouragement during this intense period of brainstorming: my mother Mastanamma for her love and sacriﬁces and the effort she made in shaping my career; my sister Madhavi, brother-in-law Venkateswarulu, father-in-law Guruvaiah, and mother-in-law Swarajyam have been very supportive and taken care of family responsibilities while I was engaged in this endeavor. Above all, I would like to thank my wife Sunitha Rani for her love, patience, and constant support throughout this project, and my beautiful daughter Akshara Mahalakshmi, who stayed with her grandparents while I was writing this book. I missed her a lot and hope she will forgive me for not being with her during this time. CHAPTER 1 Introduction 1.1 Digital Media Processing Digital media processing as it is currently understood and further developed in this book is described in the following subsections. 1.1.1 Digital Media Defined In this book, media comprises data, text, signal, voice, audio, image, or video information, and digital media is the digital representation of analog media information. In our daily lives, we typically use many types of media for various purposes, including the following: • telephoning (voice) • listening to music (audio) • watching TV (audio/video) • camera use (image/video) • e-mailing (text/images) • online shopping (text/data/images) • money transfer (text/data) • navigating websites (text/image) • conferencing (voice/video) • body scanning with ultrasound and/or magnetic resonance imaging (MRI) (signal/image) • driving vehicles using GPS (signal/audio/video), and so on Applications that use media are continually increasing. 1.1.2 Why Digital Media Processing Is Required In all of the previously mentioned applications, media is sent or received. As a sender or receiver, we typically use the media (talking, listening, watching, mailing, etc.) without experiencing difficulties in perceiving (with our eyes, ears, etc.) or delivering (talking, mailing, texting, etc.) the media. In reality, the media that we send or receive passes through many physical channels and each one adds noise (due to interference, interruptions, switching, lightning, topographic obstacles, etc.) to the original media. In addition, users may want to protect the media (from others), enhance it (improve the original), compress it (for storing/transmitting with less bandwidth), or even work with it (for analysis, detection, extraction, classification, etc.). Digital media processing using appropriate algorithms then is required at both the transmitting and receiving ends to prevent and/or eliminate noise and to achieve application-specific objectives mentioned here. 1.1.3 How Digital Media Is Processed A software-based digital media processing system is comprised of three entities: an algorithm (that which processes), a software language (to implement the processing), and embedded hardware (to execute the processing). Examples of embedded hardware are digital signal processors (DSPs), field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs). In this book, the Analog Devices, Inc. Blackﬁn © 2010 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-1-85617-678-1.00001-6 1 2 Chapter 1 DSP is the reference embedded processor (see Appendix A on the companion website) for executing algorithms. The algorithms are implemented in the C language. Algorithm examples are discussed in the next section. 1.2 Media-Processing Algorithms In this book, digital media processing algorithms are divided into four categories: data, signal and image, speech and audio, and video. Each category of algorithms are discussed in great detail in various chapters of this book. 1.2.1 Data Processing Digital systems handle media signals (e.g., data, voice, audio, image, video, text, graphics, and communication signals) by representing them with 1s and 0s, known as binary digits (bits). There are many advantages to digital representation of signals. For example, providing integrity and authenticity to the signal using data security algorithms becomes possible once the signal is digitized. It is also possible to protect data from random and burst errors using data error correction algorithms. In some cases, it is even possible to compress the digital media data using source-coding techniques to minimize the required data transmission or storage bandwidth. Part 1 of this book covers the most popular algorithms used for data security, error correction, and compression. For all algorithms, a brief introduction, complete details of algorithm ﬂow, C simulation for core algorithm functions, efﬁcient techniques to implement data processing algorithms on the embedded processor, and algorithm computational cost (in terms of clock cycles and memory) for implementing on the reference embedded processor ADI-BF53x (2005) are provided. Chapter 2 is focused on the most widely used data security algorithms in practice. The algorithms covered include triple data encryption algorithms (TDEA), advanced encryption standard (AES), keyed-hash message authentication code (HMAC), and elliptic curve digital signature algorithm (ECDSA). In addition, cryptography basics and pseudorandom-number generation methods are brieﬂy discussed. Chapter 3 discusses various data-error detection and correction algorithms. Error detection based on checksum and cyclic redundancy check (CRC) computation is discussed. Both block codes and convolutional codes for error correction and corresponding decoding methods are discussed in detail. The algorithms covered include CRC32, Hamming (N, K ), BCH (N, K ), Reed-Solomon (RS) (N, K ) error correction codes, RS (N, K ) erasures correction codes, trellis coded modulation (TCM), turbo codes, low-density parity check (LDPC) codes, Viterbi decoding, maximum a posteriori (MAP) decoding, and sum-product (SP) decoding algorithms. Chapter 4 discusses efﬁcient simulation and implementation techniques for all error correction algorithms discussed in Chapter 3. Widely used data entropy coding methods are discussed in Chapter 5. Variable length codes and arithmetic coding approaches for entropy coding are discussed. The algorithms covered include the MPEG2 VLD, H.264 UVLC and CAVLC, JPEG2000 MQ-coder, and H.264 CABAC. 1.2.2 Digital Signal and Image Processing We process raw signals using signal processing algorithms to get the desired signal output. Signal processing algorithms have many applications—telecommunications, medical, aerospace, radar, sonar, and weather forecasting, to name the most common. Part 2 of this book is dedicated to signals and systems, time-frequency transformation algorithms, ﬁltering algorithms, multirate signal-processing techniques, adaptive signal processing algorithms, and digital communication algorithms. The later chapters of Part 2 are devoted to image processing tools and advanced image processing algorithms. In Chapter 6, background theory of digital signal processing algorithms is discussed. We will cover signal representation, types of signals, sampling theorem, signal time-frequency representation (using Fourier series, Fourier transform, Laplace transform, z-transform, and discrete cosine transform [DCT]), linear time invariant (LTI) systems, and convolution operation. Signal time-frequency representation and signal ﬁltering are discussed thoroughly in most digital signal processing textbooks, including this one. In Chapter 7, we discuss implementation aspects of the fast Fourier transform (FFT), DCT, finite-impulse response (FIR) ﬁlters, and inﬁnite impulse response (IIR) ﬁlters. Introduction 3 C simulation is provided for all algorithms. Comparative algorithm costs (in terms of clock cycles and memory) for implementation on the reference embedded processor are discussed. Chapter 8 discusses adaptive signal processing algorithms (minimum mean square error [MMSE] criterion, least mean square [LMS], recursive least squares [RLS], linear prediction [LP], Levinson-Durbin algorithm and lattice ﬁlters), multirate signal processing building blocks (e.g., decimation, interpolation, polyphase ﬁlter implementation of decimation and interpolation, and ﬁlter banks), and wavelet signal processing (multiresolution analysis and discrete wavelet transform). The C fixed-point implementation of the LMS algorithm is also presented. Chapter 9 discusses the digital communication environment (channel capacity, noise measurement, modulation techniques), single-carrier communication, multicarrier communication system building blocks (discrete multitone [DMT] and orthogonal frequency division multiplexing [OFDM] transceivers), channel estimation algorithms (for both wireline and wireless), channel equalizers (minimum mean square [MMS] equalizer, decision-feedback [DF] equalizer, Viterbi equalizer, and turbo equalizer) and synchronization algorithms (frequency offset estimation, symbol timing recovery, and frame synchronization). As most digital communication algorithms involve basic signal-processing tasks (e.g., DFT, ﬁltering), no exclusive C simulation is provided for these algorithms. However, a few techniques to efﬁciently implement commonly used basic mathematic operations such as division and square root on fixed-point processors are discussed, and C-simulation code is provided for those basic operations. Image processing plays an important role in medical imaging, digital photography, computer graphics, multimedia communications, automotive, and video surveillance, to name the most common applications. Image processing tools are basically algorithms used to process the image to achieve aims speciﬁc to the application, such as improving image quality, creating special effects, compressing images for storage or fast transmission, and correcting abnormalities in the captured image (sometimes the capturing device itself introduces artifacts in the image due to hardware limitations or lens distortion). Image processing tools are also used in classifying images, detecting objects in the image, and extracting useful information from captured images. Chapter 10 is focused on discussing and simulating widely used image processing tools such as color conversion, color enhancement, brightness and contrast correction, edge enhancement, noise reduction, edge detection, image scaling, image object corners detection, dilation and erosion morphological operators, and the Hough transform. Advanced image processing algorithms such as image rotation, image stabilization, object detection (e.g., the human face, vehicle license plates), 2D image filtering, fisheye correction, and image compression techniques (DCT-based JPEG and wavelet-based JPEG2000), are discussed in Chapter 11. The C-simulation code and algorithm costs (in terms of processor clock cycles and memory) are also provided for image rotation and 2D image filtering algorithms. 1.2.3 Speech and Audio Processing Speech and audio coding are very important topics in the field of multimedia storage and communication systems. Example audio- and speech-coding applications are telecommunications, digital audio broadcasting (DAB), portable media players, military applications, cinema, home entertainment systems, and distance learning. Human speech processing has many other applications, such as voice detection and speech recognition. Part 3 is dedicated to discussion of algorithms related to speech processing, speech coding, audio coding, and audio post-processing, among others. In Chapter 12, we discuss sound and audio signals, and explore how audio data is presented to the processor from a variety of audio converters. Next, the formats in which audio data is stored and processed are described. Selected software building blocks for embedded audio systems are also discussed. Because efﬁcient data movement is essential for overall system optimization, data buffering as it applies to speech and audio algorithms is examined. There are many speech coding algorithms in the literature and this chapter brieﬂy discusses a few methods. Various speech compression standards are also brieﬂy addressed. Finally, the Voice over Internet Protocol (VoIP) and the purpose of the jitter buffer in VoIP communication systems are discussed. 4 Chapter 1 Audio coding methods are discussed in Chapter 13. While audio requires less processing power in general than video processing, it should be considered equally important. Recent applications such as wireless, Internet, and multimedia communication systems have created a demand for high-quality digital audio delivery at low bit rates. The technologies behind various audio coding techniques are discussed, followed by examination of MPEG-4 AAC codec modules and encoder and decoder architectures. Various commercially available audio codecs and their implementation costs (in terms of cycles and memory) are presented. Finally, we discuss a few audio post-processing techniques for enhancing the listening experience. 1.2.4 Video Processing Advances in video coding technology and standardization, along with rapid development and improvements of network infrastructures, storage capacity, and computing power, are enabling an increasing number of video applications. Digitized video has played an important role in many consumer electronics applications, including DVD, portable media players, HDTV, video telephony, video conferencing, Internet video streaming, and distance learning, among others. As we move to high-definition video, the computing bandwidth required to process video increases manyfold, and more than 80% of total available embedded processor computing power is allocated for video processing. Chapter 14 describes video signals, and various redundancies present in video frames are explored. Video coding building blocks (e.g., motion estimation/compensation, block transform, quantization, and variable-length coding) are brieﬂy discussed, followed by a survey of various video coding standards and comparisons with respect to coding efficiency and costs. Computationally complex (high-cost) coding blocks are identified. Efficient ways of implementing video coders are discussed, followed by an examination of the two most widely adopted video coding standards—the MPEG-2 and H.264 decoder modules. Details of H.264-specific decoding modules (e.g., H.264 transform, intraprediction, loop filtering) are provided. Also discussed are a few techniques to efficiently implement the H.264 macroblock layer. A scalable video coding (based on the H.264 scalable extension standard) and its applications are discussed. Video processing, as stated before, when compared to other media processing, is very costly in terms of computation, memory, and data movement bandwidths. Video coding and system issues because of limited MIPS, memory, and system bus bandwidth are presented in Section 16.5 on the companion website, along with the use of proper frameworks to minimize power consumption in low-power video applications. Video data is often processed after decompression and before sending it to the display for enhancement or rendering it suitable for playing on the screen. This part of the procedure is called “video post-processing.” Chapter 15 is focused on video post-processing modules such as video scaling, video filtering, video enhancement, alpha blending, gamma correction, and video transcoding. 1.3 Embedded Systems and Applications Embedded systems enable numerous digital devices used in daily life, and thus, are literally everywhere. Embedded computing systems have grown tremendously in recent years not only in popularity, but also in computational complexity. In all the applications listed in Table 1.1, digital embedded systems process some form of digital data. Digital media processing algorithms play an important role in all embedded system applications. This book is focused on digital media and communication processing algorithms—that is, applications involving processing and communication of large data blocks (whether image, video, audio, speech, text blocks, or some combination of these), which often need real-time data processing. For an application, we choose a particular embedded processor along with a peripheral set only after studying its capabilities to run the algorithms of a particular application. The last part of this book discusses embedded systems, media processing, and their applications. Embedded systems have several common characteristics that distinguish such systems from general-purpose computing systems. Unlike desktops, the embedded systems handle huge amount of data per second with very limited resources (e.g., arithmetic logic units [ALUs], memory, peripherals). In most cases, embedded systems handle very few tasks and usually these tasks must be performed in real time. In Chapter 16 (see companion website), we discuss the important components of an embedded system (e.g., processor core, memory, and peripherals). Various types of memory and peripheral components are brieﬂy Introduction 5 Table 1.1: Digital media processing applications Digital Home Telecommunications Consumer Electronics AV receivers ADSL/VDSL Digital camera DVD/Blu-Ray players Cable modems Portable media players T V/desktop audio/video Wire/wireless smart phones Portable DVD players Sound bar IP phone Digital video recorder Digital picture frame Femto base stations Personal GPS navigation Video telephony Software defined radio Mobile T V IP T V, IP phone, IP camera WL AN, WiFi, WiMAX Bluetooth Door phone Mobile T V HD/ANC headphones Smoke detector Radar/sonar Video game players Network video recorder Power line communication Digital music instruments CD clock radio Video conferencing FM/satellite radio Automotives Advanced driver assistance Industrial Power meter Medical Ultrasound Automotive infotainment Motor control C T, MRI, PE T Digital audio/satellite radio Active noise cancellation Digital x-ray Vision control Barcode scanner Pulse oximetry Bluetooth hands-free phone Flow meter Digital stethoscope Electronic stability control Safety/airbag control Crash detection Oscilloscope Security Surveillance IP networks Blood-pressure monitor Lab diagnostic equipment Heart rate monitor Fingerprint biometrics Video doorbell Video analytic server discussed. The necessity of software–hardware partitioning of embedded systems to handle complex applications is discussed, as well as possible ways to efficiently partition such a system. Finally, we discuss future embedded processor requirements to handle very complex embedded applications. Chapter 17 (see companion website) brieﬂy discusses various applications. Different embedded applications use different algorithms. The processing power and memory requirements vary from one application to another. We brieﬂy talk about various modules present in a few embedded application sectors. The applications covered in this chapter include automotive, video surveillance, portable entertainment systems, digital communications, digital camera, and immigration and healthcare sectors. 1.4 Algorithm Implementation on DSP Architectures In Section 1.2, various algorithms that are playing a critical role in diverse applications were mentioned. Although dozens of semiconductor companies are designing embedded processors with a range of architectural features to support different kinds of applications, no single architecture is efficient for processing all types of digital media processing algorithms. This is because processors designed with many pipeline stages (to execute in parallel multiple operations of numeric-intensive algorithms) do not efficiently handle algorithms that contain full-control operations. The architectures developed for executing the control code are not efficient at computing numeric-intensive algorithms. The architectural feature set of the reference embedded processor (see Appendix A on the companion website) is in between, and is good at handling both control and numeric-intensive algorithms. In the following subsections, DSP architecture and its performance in executing various algorithms are brieﬂy discussed. We also brieﬂy describe a few algorithm implementation techniques. 6 Chapter 1 1.4.1 DSP Architecture A simpliﬁed block diagram of embedded DSP architecture is shown in Figure 1.1. The main architectural blocks of an embedded processor are the processor core (with register sets, ALU, data address generator [DAG], sequencer, etc.), memory (for holding instructions and data, for stack space, etc.), peripherals (e.g., serial peripheral interface [SPI], parallel peripheral interface [PPI], serial ports [SPORT], general-purpose timers, universal asynchronous receiver transmitter [UART], watchdog timer, and general-purpose I/O) and a few others (e.g., JTAG emulator, event controller, direct memory access [DMA] controller). Embedded processor peripherals and memory architectures are discussed in some detail in Chapter 16. The peripheral features are important when we talk about the overall application. In this book, we assume that the architecture comes with all necessary peripherals to enable a particular application. Also, we assume that the program code and data required for algorithm processing are residing in the faster memory (or level 1, L1) memory, which can be accessed at the speed of the processor core. If we cannot fit data and program in L1 memory, then we store the extra data or program in L2/L3 memory and use DMA to get the data or program from L2/L3 memory without interrupting the processor core. From an algorithm-implementation point of view, the important things are processor core architecture, availability of L1 memory, and internal bus bandwidth. Even more important than getting data into (or sending it out from) the processor, is the structure of the memory subsystem that handles the data during processing. It is essential that the processor core access data in memory at rates fast enough to meet application demands. L1 memory is often split between instruction and data segments for efficient utilization of memory bus bandwidth. Most DSP architectures support this Harvard-like architecture (in which data and instruction memories are accessed simultaneously, as shown in Figure 1.1) in combination with a hierarchical memory structure that views memory as a single, unified gigabyte address space using 32-bit addresses. All resources, including internal memory, external memory, and I/O control registers, occupy separate sections of this common address space. The register file contains different register types (e.g., data registers, accumulators, address registers) to hold the information temporarily for ALU processing or for memory load/store purposes. The processor’s computational units perform numeric processing for DSP algorithms and general control algorithms. Data moving in and out of the computational units go through the data register file. The processor’s assembly language provides access to the data register file. The syntax lets programs move data to and from these registers and specify a computation’s data format at the same time. The DAGs generate addresses for data moving to and from memory. By generating addresses, the DAGs let programs refer to addresses indirectly using a DAG register instead of an absolute address. The program sequencer controls the instruction execution ﬂow, including instruction alignment and decoding. The program sequencer determines the next instruction address by examining both the current instruction being executed and the current state of the processor. Generally, the processor executes instructions from memory in sequential order by incrementing the look-ahead address. However, when encountering one of the following structures, the processor will execute an instruction that is not at the next sequential address: jumps, conditional branches, function calls, interrupts, loops, and so on. Data Memory Instruction Memory Peripherals Registers DAG Unit ALU Unit Sequencer DSP Core Figure 1.1: Simplified diagram of DSP architecture. Introduction 7 In the next subsection, we consider three algorithms with different processing ﬂow requirements and discuss to what extent the benchmarks provided by processor manufacturers are useful in deciding which processor (from dozens of processors available today in the market) is suitable for a particular application. 1.4.2 Algorithm Complexity and DSP Performance In this section, we consider three simple algorithms—dot product, RC4 stream cipher, and the H.264 CABAC encode-symbol-normalization process—and discuss embedded processor performance (with a particular architectural feature set) in executing those three algorithms. Dot Product Dot product involves accumulation of sample-by-sample multiplication of elements from two sample arrays. The dot product, z, of two N-length sample arrays x [] and y[], can be computed as N −1 z = x [n]y[n] n=0 (1.1) A simple “for” loop C code that implements the dot product described by Equation (1.1) is shown in Pcode 1.1. What is the cost (in terms of cycles and memory) of this dot-product algorithm for implementation on the embedded processor, given its processor core architecture? Clearly, we require two buffers of length 2*N bytes (assuming the elements are the 16-bit word type), each to hold the two input array buffers in memory. In terms of computations, it involves N multiplications and N additions. If the embedded processor consumes one cycle for multiplication and one cycle for addition, then we require a total of 2N cycles (assuming a single ALU) to execute the corresponding dot-product code given in Pcode 1.1. What about the cycle cost of loading the data from memory to the data registers? Typically, many processors come with separate data load/store units; hence, we assume that the data loads happen parallel to compute operations and therefore they are free. z = 0; for(i = 0;i < N;i++) z += x[i] * y[i]; Pcode 1.1: Pseudo code for dot product. Many embedded processors come with multiply–accumulate (MAC) units, and in this case we require only N cycles, as the dot product contains a total of N MAC operations. For this case, the two memory loads must happen in a single cycle. Now, you may wonder whether this cycle count can be achieved with the C code ported to the processor assembly using the compiler or with the optimized assembly-level code written manually. Here, when we say that the cycle count is N for executing the dot product, it means that one MAC operation is mapped to a single processor instruction, which consumes exactly one cycle; only then can we describe the cycle count as N cycles for N MAC operations. Is this the final cycle count for computing the dot product? Not exactly—in the dot-product case, it also depends on the number of MAC units that the processor comes with. For example, if the processor consists of four MAC units, then we require only N/4 cycles to complete the dot product. How is this possible? It is possible because we can execute four MAC operations in parallel on a four-MAC processor, as the dot product has no ﬂow dependencies. However, we will have a problem with the data load unless we load 128 bits (four 16-bit words from array x [] and another four 16-bit words from array y[]) of data to eight 16-bit registers in a single cycle. For efficient compilation to run on a four-MAC processor, we unroll the dot-product loop in Pcode 1.1 by four times and reduce the loop count by a factor of 4 as shown in Pcode 1.2. Given that the dot product is a simple algorithm, most compilers can efficiently map the C code to the assembly language so that the difference between cycle estimation and actual cycles measured is negligible. 8 Chapter 1 z1 = 0; z2 = 0; z3 = 0; z4 = 0; for(i = 0;i < N/4;i += 4) { z1 += x[i]*y[i]; z2 += x[i + 1]*y[i + 1]; z3 += x[i + 2]*y[i + 2]; z4 += x[i + 3]*y[i + 3]; } z = z1 + z2 + z3 + z4; // MAC unit 1 // MAC unit 2 // MAC unit 3 // MAC unit 4 Pcode 1.2: Pseudo code for dot product with loop unrolling four times. Digital media processing algorithms are not just “dot products.” Next, we consider another simple algorithm, the RC4 stream cipher. RC4 Stream Cipher The RC4 algorithm (see Section 2.1.6, RC4 Algorithm, for more details) is used as a stream cipher in low-security applications and as a pseudorandom number generator in many standard ciphers applications. RC4 is used in many commercial software packages, such as Lotus Notes and Oracle Secure SQL, and in network protocols, such as SSL, IPsec, WEP, and WPA. An RC4 simulation code is given in Pcode 1.3. j = 0; for (i = 0;i < N;i++) { k = i & 0xff; r0 = SBox[k]; r1 = j + r0; j = r1 & 0xff; r1 = SBox[j]; Sbox[j] = r0; Sbox[k] = r1; r1 = r1 + r0; r1 = r1 & 0xff; r1 = Sbox[r1]; in[i] = in[i] ˆ r1; } // N: data length in bytes // i mod 255 // i mod 255 // look-up table access with arbitrary offset // swap look-up table elements // i mod 255 // look-up table access with arbitrary offset // encrypt input message bytes Pcode 1.3: Simulation code for RC4 stream cipher. In the iterative procedure of computing RC4 encrypted data using Pcode 1.3, the computation of a new j value requires updated (swapped) S-box values. Thus, computing many j values and swapping them at the same time is not possible due to the dependency of j on updated S-box values. The RC4 algorithm is sequential in nature, although no jumps are present. Even if multiple compute units are available with the processor, we cannot use them in this case for parallel implementation of the algorithm. See Section 2.1.6, RC4 Algorithm, for cycle costs and memory requirements to implement RC4 on the reference embedded processor. Unlike the dot product, the execution of algorithms, such as RC4 on deep-pipeline processors, may not be efficient in terms of cycles. RC4 can be computed efficiently on microcontrollers with a two-stage pipeline in fewer cycles, compared to DSPs with 10 or more pipeline stages. In the case of algorithms with frequently occurring conditional branches (e.g., the H.264 CABAC encode symbol normalization process described in Section 5.5), the performance of deep-pipeline DSPs worsens. As shown in Pcode 1.4, the normalization process has many conditional jumps in a “while loop.” This process is costly in terms of cycles, as it performs normalization 1 bit at a time with many jumps. Avoiding jumps is the only solution to reduce cycle cost (see Section 5.5 for details). In summary, DSPs are good at handling FFTs, filters, and matrix operations, and are less effective at handling both control code and sequential algorithms. Simple pipeline processors (e.g., ARM) are good at handling control and sequential algorithms, and less effective at handling signal processing tasks such as transforms, filtering operations, and so on. In brief, the dot-product benchmark provided by the DSP manufacturer may not provide much useful information because the application at hand rarely contains dot-product kinds of operations. To efficiently run Introduction 9 while(pBAC->Range < 256) {// Low, Range, Outstanding bits (or Obits) are CABAC params if(pBAC->Low >= 512) { pBAC->Low -= 512; write_bits(1,1); if(pBAC->Obits > 0) { write_bits(0,pBAC->Obits); // bit-fifo write pBAC->Obits = 0; } } else if(pBAC->Low < 256) { write_bits(0,1); if(pBAC->Obits > 0){ write_bits(1,pBAC->Obits); // bit-fifo write pBAC->Obits = 0; } } else{ pBAC->Obits++; pBAC->Low -= 256; } pBAC->Range = pBAC->Range << 1; pBAC->Low = pBAC->Low << 1; } Pcode 1.4: Simulation code for H.264 CABAC encode symbol normalization. any algorithm on a particular digital signal processor, we need to dedicate some time to understanding the underlying mathematical structure of the algorithm and then tune it to write efﬁcient code for that processor. A few techniques to map algorithms to DSPs are discussed in the next section. 1.4.3 Algorithm Implementation Techniques Digital data is efﬁciently processed with an embedded processor by optimizing the corresponding program at both the algorithm ﬂow level and the instruction level. The algorithms are optimized for throughput, memory usage, I/O bandwidth, and power dissipation. In this subsection, we discuss algorithm-level optimization using various techniques for increasing throughput. In most cases, there is a trade-off between throughput and memory. Algorithm code is optimized at the instruction level to eliminate pipeline stalls due to data dependencies, to minimize the overhead of control code such as jumps and software loop overheads, and to efficiently handle data movement within the system. Instruction-level optimization techniques vary by processor. Compilers also perform some degree of instruction-level optimization. Typically we see a 10 to 20% gain with instruction-level optimization (measured by a decrease in core clock cycles). When optimizing the code at the instruction level, complete knowledge of the algorithm structure may not be necessary. On the other hand, program-ﬂow optimization at the algorithm level requires knowledge of the algorithm’s mathematical structure and properties. Compilers cannot achieve algorithm-level program optimization. Minimizing the number of computations and balancing the CPU and load/store bandwidth are possible with algorithm-level optimization. We can achieve algorithm-level optimization using multiple approaches. A few of these methods considered in this section include changing the algorithm ﬂow, using look-up tables, using algorithm-ﬂow statistics, using symmetry and periodicity, reusing already-computed data, and approximating mathematic functionality. The amount of cycle savings depends on a particular algorithm and its ﬂow. For the algorithms discussed in this book, the amount of cycle savings achieved with algorithm-level optimization ranges from 20 to 80%. Is Optimizing All the Program Code Worthwhile? Before we proceed, we ask whether optimizing all the program code is worthwhile. The answer is that it depends on processor capabilities and application demands. Usually, we start optimizing the most critical modules in C, and if the MIPS budget is not met, we continue to optimize other critical modules. If we are still not within the MIPS budget, then we start writing assembly language and optimizing it. For example, consider a video decoder (see Chapter 14 for details); it has many layers and modules (see Figure 14.15). In the slice layer, we decode 10 Chapter 1 the slice headers, and this is performed once per slice. We may spend a few hundred cycles decoding the slice headers. Thus, the corresponding code can be in C. Similarly, the next layer is the macroblock layer, which may consume a few thousand cycles since we access it for every macroblock to decode macroblock layer headers. The macroblock layer code can be done in C or in assembly language, and we may optimize the code a little bit depending on performance requirements. The most critical modules in a video codec are motion compensation, DCT transformation, intraframe prediction, de-block ﬁltering, quantization, zig-zag scanning, and entropy coding. All of these modules work at the pixel level, and therefore consume millions of cycles each second. Thus, optimization of these critical modules comprising a video decoder is important to “play” the video in real time. Apart from these modules, we may be required to perform other critical video post-processing modules (e.g., scaling, ﬁltering, blending, YUV to RGB conversion). Therefore, complicated applications such as video require a lot of optimization at many levels. Optimization by Changing the Algorithm Flow A change of algorithm ﬂow sometimes leads to a lower number of computations and may balance the CPU and load/store bandwidth. With a change of algorithm ﬂow, even if the algorithm structure changes, we still output the same data from the program. Consider the data encryption standard (DES) algorithm (see Section 2.2) as an example module for optimization. Without algorithm-level optimization, 4288 cycles are required for implementation on the reference embedded processor, whereas only 896 cycles are required for DES using the algorithm-level optimization techniques discussed in Section 2.2. In implementing algorithms such as the AES (see Section 2.3), RS decoder (see Section 4.2), Viterbi decoder (see Section 4.4), turbo decoder (see Section 4.5), LDPC decoder (see Section 4.6), and CABAC encoder/decoder (see Section 5.5), program optimization at the algorithm level can save many cycles. Optimization Based on Algorithm-Flow Statistics In an algorithm with multiple data paths, all data paths may not occur with equal probability. A few data paths can occur very frequently and a few data paths may occur rarely. If we write the instruction-level optimized code to cover all the data path logic, we may spend too many cycles in parsing all the algorithm paths. Instead, handling the frequently occurring data paths separately and optimizing to the maximum extent saves many cycles. See Section 5.5.4, Normalization Process, for the H.264 CABAC encode symbol normalization process optimization using algorithm-ﬂow statistics. Optimization Using Symmetry and Periodicity The number of arithmetic operations in a mathematical transformation algorithm can be reduced if any symmetry or periodicity is present in the transformation matrix coefficients. For example, the symmetry and periodicity properties of the DFT twiddle matrix are used to speed up computation upon implementing FFT algorithms. In Section 7.2, we consider an 8 × 8 DCT computation as an example, and optimize implementation using its symmetry and periodicity. Optimization by Approximation Simple approximations to underlying mathematical functions (without substantially compromising performance) sometimes lead to great reductions in computations and cycle counts. In Section 10.6.3, we consider an example of a pixel gradient of magnitude G and quantized angle φ as part of the computations in the Canny edge-detector algorithm. We start with the x -gradient Gx and y-gradient G y of pixels, and compute the pixel gradient magnitude and angle, which involve nonlinear functions of square root and trigonometric functions as follows: G= G 2x + G 2 y , φ = tan−1 G y Gx (1.2) Usually, we perform the preceding operations on a fixed-point DSP using some kind of approximation. Optimization by Reuse of Already-Computed Data In many cases, delay buffers are used to store history or reuse already-computed data. This is especially true in video or image-processing applications, where a huge amount of data is present for processing. If we can reuse some portion of the computed data to process neighboring pixels, we save many computations. Introduction 11 In Section 15.3.1, we consider the example of a 3 × 3 median image filter, which is commonly used because it preserves the edge information when compared to average 3 × 3 filters. With the 3 × 3 median filter, the center pixel is replaced in the 3 × 3 block of pixels with the median of those n = 9 pixels. Computing a median by sorting is very costly, as it involves 3n2 operations. The 3 × 3 median filter can be efficiently computed using the techniques discussed in Section 15.3.1. Optimization via Precomputing and Using Look-up Tables The computation of nonlinear functions (e.g., square root, inverse, trigonometric, and exponential) using fixedpoint processors can be very costly. Consequently, we precompute the outputs of these functions in advance and store the results in memory to minimize cycle costs. The implementation of bit-processing modules on deeppipeline DSPs is costly from the MIPS point of view. In such cases, we minimize computations by converting the bit processing to word processing with precomputation of module output for all bit patterns of fixed length and storing the results to a look-up table. A few bit-processing examples that can be implemented efficiently via this optimization technique include DES, CRC error detection (see Section 3.2.3), BCH encoding (see Section 4.1), and turbo encoding (see Section 4.5.1). In addition, the computation of modular arithmetic for arbitrary modules is a costly operation. With precomputing and using look-up tables, we can minimize the cycles required to perform modular arithmetic operations. 1.4.4 C-Level Program Optimization According to a recent study, most programmers develop embedded algorithms in C rather than in assembly language. There are a number of reasons to use C rather than assembly language: C is much easier to develop and maintain, and it is comparatively portable. However, there is often a poor match between C and the features of embedded processors in vectorization and fractional processing. These hardware features are essential to efficient processing, but they are not natively supported in ANSI C. For this reason, inline assembly code is often used within C programs. Digital media processing algorithms have specialized characteristics, and compilers usually cannot generate efﬁcient code for them without some level of programmer intervention. Many embedded processors have specialized hardware or instructions to speed up common data-processing algorithms (e.g., FFT butterﬂies, video processing operations, and Galois field arithmetic). These include, for example, single-cycle MACs with specialized addressing modes, single-cycle quad-byte average and clip operations, and addition and multiplication with Galois field elements. In such cases, C compiler-specific intrinsics are very useful to better utilize the processor-specialized hardware. Probably the most useful tool for code optimization is a good profiler. In many implementations, 20% of the code accounts for 80% of the processing time. Focusing on these critical sections yields the highest marginal returns. It turns out that loops are prime candidates for optimization in most digital media processing algorithms because intensive numeric processing usually occurs inside those loops. Compiler-Level Optimization The global approach to code optimization is to enable compiler optimization options that optimize for either speed or memory conservation. Many other compiler options may exist to support different functionalities. There is a vast difference in performance between compiled optimized and compiled nonoptimized code. In some cases, optimized code can run 2 to 5 times faster. Intrinsics and Inline Assembly Intrinsics are compiler-specific instructions that are embedded within C code and are translated by the compiler into a predefined sequence of assembly-code instructions. Intrinsics give the programmer a way to access specialized processor features without actually having to write assembly code. Many embedded processor compilers support intrinsics. An example of intrinsic usage is shown in Figure 1.2, a case of fractional dot-product computation. With the code shown in Figure 1.2(a), the compiled code is executed as a multiply, followed by a shift, followed by an accumulation. However, the processor MAC with fractional arithmetic support performs all these tasks in a single cycle. Thus, with the code shown in Figure 1.2(b) using intrinsics for both fractional multiplication and 12 Chapter 1 Sum = 0; for(i = 0; i < 100; i++) { sum += ((a[i]*b[i]) >> 15); } () sum = 0; for (i = 0; i < 100; i++) { sum = add_fr32(sum, mult_fr32(a[i],b[i])); } () (a) (b) Figure 1.2: Fractional dot product: (a) without intrinsics and (b) with intrinsics. addition, the compiler can translate the code of fractional multiply and accumulate into a single MAC instruction supporting the fractional arithmetic mode. Many compilers also support the use of inline assembly code, using the asm() construct within a C program. This feature causes the compiler to insert the specified assembly code into the compiler’s assembly code output. Inline assembly is a good way to access specialized processor features, and it may execute faster than calling assembly code in a separate function. Using inline assembly, various costs are avoided, such as program ﬂow latencies, function entry and exit instructions, and parameter passing overhead. Profile-Guided Optimization Profile-guided optimization (PGO) is an excellent way to tune the compiler’s optimization strategy for a program’s typical runtime behavior. Many program characteristics that cannot be known statistically at compile time can be provided through PGO. The compiler can use this knowledge to bring about benefits, such as accurate branch prediction, improved loop transformations, and reduced code size. The technique is most relevant where behavior of the application over different data sets is expected to be similar. PGO should always be implemented as the last optimization step. If the application source code is changed after gathering profile data, this profile data becomes invalid. The compiler does not use profile data when it can detect that it is inaccurate. However, it is possible to change source code in a way that is not detectable by the compiler (e.g., by changing constants). The programmer should ensure that the profile data used for optimization remains accurate. Available C or DSP runtime libraries can also be used for efficient implementation of algorithms. See ADIVDSP (2006) for more detail on various C-level compiler optimization techniques available for the reference embedded processor. System-Level Optimization System optimization starts with proper memory layout. In the best case, all code and data would fit inside the processor’s L1 memory. Unfortunately, this is not always possible, especially when large C-based applications are implemented within a networked application. The real dilemma is that processors are optimized to move data independently of the core via DMA, but microcontroller unit (MCU) programs typically run using a cache model instead. While core fetches are an inescapable reality, using DMA or cache for large transfers is mandatory to preserve performance. Because internal memory is typically constructed in subbanks, simultaneous access by the DMA controller and the core can be accomplished in a single cycle by placing data in separate banks. For example, the core can be operating on data in one subbank while the DMA is filling a new buffer in a second subbank. Under certain conditions, simultaneous access to the same subbank is also possible. See Katz and Gentile (2006) for more details on system-level optimization for reference embedded processor applications. Part 1 Data Processing This page intentionally left blank CHAPTER 2 Data Security Data exchange and data storage are common processes that we use every day. The data is usually categorized as unclassiﬁed and classiﬁed. Unclassiﬁed data can be accessed by anyone without restrictions; whereas classiﬁed data cannot be accessed by unintended third parties (i.e., other than sender and receiver). Examples of classiﬁed data are nations’ homeland security- and military-related data, highly innovative and research-related data connected to defense and corporate, and ﬁnancial transactions. 2.1 Cryptography Basics Cryptography techniques are used to protect classiﬁed data from unintended observers or eavesdroppers (also called adversaries, attackers, interceptors, interlopers, intruders, opponents, or simply the enemy). 2.1.1 Cryptography Terminology The following is a list of some important cryptography terms: Plaintext: Message with understandable substance (content). Encryption: Process of disguising a message in such a way as to hide its substance. Cipher text: Encrypted message. Decryption: Process of turning cipher text back into plain text. Cipher: Mathematical function (algorithm) used for encryption. Inverse cipher: Mathematical function used for decryption. Key: Large m-bit number used in the encryption or decryption process. The range of possible values of the key is called key space. Cryptosystem: Algorithm along with all possible plain texts, cipher texts, and keys. Cryptography: Art and science of keeping messages secure that cryptographers practice. Cryptanalysis: Art and science of breaking cipher text practiced by cryptanalysts. Cryptology: Branch of mathematics encompassing both cryptography and cryptanalysis practiced by cryptologists. 2.1.2 Cryptography System Using cryptographic techniques, we make the information unintelligible to people who do not have a need to know or who should not know. The basic cryptographic module consists of a secret key and a mathematical algorithm as shown in Figure 2.1. The cryptographic process of converting plain text to unintelligent form (termed as cipher text) is called encryption. The inverse process of converting cipher text to plain text is called decryption. Figure 2.1: Cryptographic modules (a) encryption and (b) decryption. © 2010 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-1-85617-678-1.00002-8 Plain Text Key Encryption Algorithm Cipher Text (a) Key Cipher Text Decryption Algorithm ( b) Plain Text 15 16 Chapter 2 We can understand the importance of a cryptography system by considering an example of data exchanged between a military commander and his superior as follows: SIR, WE ARE MOVING TOWARDS ENEMY We make this classiﬁed information unintelligible (to unintended recipients) by encrypting the message before sending it over a communication channel, and we decrypt the message at the receiving side to read the information. To encrypt, we pass the original message (plain text) to the encryption algorithm to generate a cipher text (unintelligent message). An encryption algorithm is a mathematical algorithm along with a secret key. Usually, any cryptographic secret key is a large random number (e.g., a 128-bit number). To work with the mathematical algorithm, we ﬁrst use a codeword table (e.g., the 8-bit ASCII table) to generate the numeric equivalent of the plain text. In the previous classiﬁed message, we have a total of 32 characters (one comma, ﬁve spaces, and 26 letters with few repeats). The equivalent numeric 8-bit ASCII values for the previous message characters are space, 00100000 (0x20); comma, 00101100 (0x2c); A, 01000001 (0x41); D, 01000100 (0x44); E, 01000101 (0x45); G, 01000111 (0x47); I, 01001001 (0x49); M, 01001101 (0x4d); N, 01001110 (0x4e); O, 01001111 (0x4f); R, 01010010 (0x52); S, 01010011 (0x53); T, 01010100 (0x54); V, 01010110 (0x56); W, 01010111 (0x57); and Y, 01011001 (0x59). The binary equivalent form of the previous message follows: 01010011 01001001 01010010 00101100 00100000 01010111 01000101 00100000 01000001 01010010 01000101 00100000 01001101 01001111 01010110 01001001 01001110 01000111 00100000 01010100 01001111 01010111 01000001 01010010 01000100 01010011 00100000 01000101 01001110 01000101 01001101 01011001 If we represent the previous binary equivalent data in hexadecimal notation, then the plain text becomes 53 49 52 2c 20 57 45 20 41 52 45 20 4d 4f 56 49 4e 47 20 54 4f 57 41 52 44 53 20 45 4e 45 4d 59 Let us select the following cryptographic key (say, random numbers with 128 bits) in hexadecimal notation: 89 fc 23 d5 71 1a 86 22 c1 42 76 dd b3 94 7e a9 With a mathematical algorithm along with the previous secret key, we obtain the following cipher data for the previous plain data: da b5 71 f9 50 4d c3 02 80 10 33 fd fe db 28 e0 c7 bb 03 81 3e 4d c7 70 85 11 56 98 fd d1 3e f0 Now, if we map the previous hexadecimal cipher data back to cipher text using the ASCII table, we get Ú μ q ù P M Ã STX € DLE 3 ý þ Û ( à Ç ETX · > M Ç p . . . DC1 V ˜ ý Ñ > ð This cipher text is an unintelligent text (as we do not know its substance). The sender transmits this cipher text to the receiver and the recipient decrypts the received cipher text with the same cryptographic key, obtaining the plain text SIR, WE ARE MOVING TOWARDS ENEMY. This example shows the importance of cryptography systems, as it is very difﬁcult for an adversary to obtain message content in the process of communication. 2.1.3 Cryptographic Practices Cryptographic techniques allow us to transmit or to store the classiﬁed data in a secure manner. In the cryptographic process, a cryptographic (or mathematical) algorithm can be in the public domain, but the cryptographic (or secret) key should not be disclosed to the public. Now, the question is, how good is this cryptographic system (i.e., a secret key, an algorithm, plain text, and cipher text)? Will it protect our data from eavesdroppers, or is it possible for eavesdroppers to get the content of message without the cryptographic key? Well, that depends on the properties of the mathematical algorithm and the length and randomness of the key chosen. Here, the cryptographic key should be random enough and eavesdroppers should not have any clue about the key pattern. Eavesdroppers usually know the algorithm that is used in the cryptographic process, but with a well-designed algorithm this knowledge will not help them. In other words, the only way for eavesdroppers to get the content of cipher text is by decrypting the cipher text with each possible key pattern. The possible number of key patterns with a 128-bit number is 2128. We call this set of 2128 possibilities the key space for a 128-bit key. Breaking the cipher text with this approach (i.e., breaking cipher text by attempting all possible keys) is called a brute force attack on a cryptographic system. Data Security 17 Brute force attacks are very costly. For example, to break cipher text generated with a 128-bit key, the amount of computational power needed is estimated as follows. Assume that decrypting the cipher text with one key pattern takes about N operations. If the computer performs 1 million (or approximately 220) such operations per second, or 236 (24 ∗ 60 ∗ 60 ∗ 220) operations per day, or 245 (365 ∗ 24 ∗ 60 ∗ 60 ∗ 220) operations per year, then with 1 million computers (i.e., 265 operations per year), we would have to work for the next N ∗ 1020 years. To put this in context, we believe this universe was formed 1020 years ago! Even if the cipher text decrypted with one (N = 1) operation, breaking the cipher text using the brute force method is impossible with available technology. Is it only the way to break the cipher text? That’s a good question. The answer is no. Many types of attacks are used to break the cipher text. We will start by discussing one such attack called the known plain-text attack, and will examine other attacks later. In the known plain-text attack, the eavesdropper knows the content of some portion of plain text and tries to break the cipher text by deducing the key pattern. If the eavesdropper succeeds in this process, then the cryptographic system can be attacked with a simple decryption process and 1 million computers for 220 years need not spend time on breaking the cipher text. Is it possible for the eavesdropper to get the content of plain text and break the present cipher text? Well, it depends on how the particular organization handles classiﬁed information and manages the secret keys. For example, if the secret key of the cryptographic algorithm is not changed for a long time, and the previous plain text messages are obtained by bribing the secretary, then the eavesdropper can succeed in his operation. Most of the time, the eavesdroppers will not succeed in their operation as the secret key patterns are changed periodically. If the cipher is generated with a new key, whatever plain text and cipher text the eavesdropper had are not useful. Thus, secret key management plays an important role in cryptographic applications. Later we present an overview of the key management process. Detailed discussion of the secret key management process is beyond the scope of this book. Security with Encryption Algorithms As discussed previously, an algorithm is considered computationally secure if it cannot be broken with available resources, either current or future. We measure the complexity of an attack in different ways: Data complexity: The amount of data needed as input to perform an attack Processing complexity: The time required to perform an attack Storage requirements: The amount of memory needed to perform an attack The security of a cryptosystem (plain texts, cryptographic algorithm, secret key, and cipher texts) is a function of two things: the strength of the algorithm and the length of the key. If the strength of an algorithm is perfect, then there is no better way to break the cryptosystem other than trying every possible key in a brute-force method. Good cryptosystems are designed to be infeasible to break with the computing power that is expected to evolve for many years in the future. If we hide the functionality of the encryption algorithm and the security of an algorithm is based on keeping the way that algorithm works a secret, it is a restricted algorithm, and is inadequate by today’s standards. A large or changing group of users cannot use them, because every time a user leaves the group, everyone else must switch to a different algorithm. If a user accidentally reveals the secret, everyone must change his or her algorithm. If we do not have a good cryptographer in the group, then we do not know whether we have a secure algorithm. Despite these major drawbacks, restricted algorithms are enormously popular for low-security applications, where users either do not realize or do not care about the security problems inherent in their system. All of the security in the standardized algorithm is based in the key, compared to none based in the details of the algorithm. Products using these algorithms can be mass produced. It does not matter if an eavesdropper knows our algorithm; if she/he does not know our particular key, she/he cannot read our messages. Cryptosystems that look perfect are often extremely weak. Strong cryptosystems, with a couple of minor changes can become weak. So it is best to trust algorithms that professional cryptologists have scrutinized for years without cracking them. 18 Chapter 2 Attacks The whole point of cryptography is to keep the plain text (or the key, or both) secret from eavesdroppers. Eavesdroppers are assumed to have complete access to the communications between the sender and receiver. Cryptanalysis is the science of recovering the plain text of a message without access to the key. An attempted cryptanalysis is called an attack. There are four general types of cryptanalytic attacks. Of course, each of them assumes that the cryptanalyst has complete knowledge of the encryption algorithm used. Let Pi , Ci , and EK denote plain text, cipher text, and encryption algorithm with key K . The four cryptanalytic attacks are described in the following. 1. Ciphertext-only attack: Given: C1 = EK (P1), C2 = EK (P2), . . . , Ci = EK (Pi ) Deduce: Either P1, P2, . . . , Pi ; K ; or an algorithm to infer Pi+1 from Ci+1 = EK (Pi+1). 2. Known plain-text attack: Given: P1, C1 = EK (P1), P2, C2 = EK (P2), . . . , Pi , Ci = EK (Pi ) Deduce: Either K , or an algorithm to infer Pi+1 from Ci+1 = EK (Pi+1) 3. Chosen plain-text attack: This is more powerful than a known plain-text attack because the cryptanalyst can choose speciﬁc plain text blocks to encrypt that might yield more information about the key. 4. Adaptive chosen plain-text attack: This is a special case of a chosen plain-text attack. Not only can the cryptanalyst choose the plain text that is encrypted, but he can also modify his choice based on the results of previous encryption. Other types of cryptanalytic attacks include chosen cipher text, chosen key, rubber hose cryptanalysis, and purchase key. Algorithms differ by degrees of security; this depends on how hard they are to break. Categories of breaking an algorithm follow: Total break: Finding a key Global deduction: Finding an alternate algorithm that results in plain text without knowledge of key Instance deduction: Finding plain text of an intercepted cipher text Information deduction: Gaining knowledge about key or plain text Key Management Key management basically deals with the key generation, distribution, storage, key renewal, and updating and key destruction. In the real world, key management is the hardest part of cryptography. Cryptanalysts often attack cryptosystems through the loopholes of key management. Why should we bother going through all the trouble of trying to break the cryptographic algorithm if we can recover the key because of some sloppy key management procedures? Why should we spend $500 million building a cryptanalysis machine if we can spend $500 bribing a clerk? The security of an algorithm rests in the key. If we are using a cryptographically weak process (reduced key spaces or poor key choices) to generate keys, then our whole system is weak. The eavesdropper need not analyze our encryption algorithm; he/she can analyze our key generation algorithm. Therefore, we should generate the key bits from either reliably random source or a cryptographically secure pseudorandom-bit generator. In Section 2.1.6, we discuss more about pseudorandom number generation for cryptographic applications. We use encrypted keys in transferring keys from one point to another. The keys of encryption keys have to be distributed manually. No data encryption key should be used for an inﬁnite period. The longer a key is used, the greater the chance that it will be compromised. It is generally easier to do cryptanalysis with more cipher text encrypted with the same key. Given that, the keys must be replaced regularly and old keys must be destroyed securely. The keys of encryption keys do not have to be replaced as frequently. They are used only occasionally for key exchange. However, if a key of the encryption keys is compromised, the potential loss is extreme as the security of the data encryption key rests on the key of encryption keys. Data Security 19 2.1.4 Cryptographic Applications The cryptographic algorithms are used mainly for three purposes: (1) to keep the classiﬁed data conﬁdential, (2) to maintain data integrity, and (3) to have data authenticity. Data conﬁdentiality: Eavesdroppers try to acquire knowledge of classiﬁed data in data communications or data storage systems by tapping the classiﬁed data without authorization. By processing the classiﬁed data using cryptographic algorithms, we transmit or store the data in a secure manner. Data integrity: Sometimes we may need to keep the data unchanged. The data may be altered by adding or deleting or substituting with some other data. Data transmission or memory retrieval devices may introduce errors by adding noise. Sometimes unauthorized persons may change the content of data before it reaches to the intended party. Data authentication: Data authentication basically gives the source of data origin. By generating the authentication code using a secret key, we can have data authenticity after veriﬁcation. Most of the time the data need not be conﬁdential, but to have conﬁdence in the data, the data should have a trusted source and should not be modiﬁed by unauthorized people. 2.1.5 Cryptographic Algorithms Cryptographic algorithms are broadly divided into three categories: (1) symmetric key algorithms, (2) public-key algorithms, and (3) hash functions based algorithms. In symmetric key algorithms, we use the same secret key for both the encryption and decryption process. In public-key algorithms, we use one key for the encryption process (generation) and a different key for the decryption process (veriﬁcation). In hash functions, we do not use a secret key to process the data. With these three kinds of algorithms, we achieve data conﬁdentiality, data integrity, and data authentication. Symmetric Key Algorithms The examples for symmetric key algorithms are the advanced encryption standard (AES) and the triple data encryption algorithm (TDEA), and are used in most cryptographic applications for data encryption. Sections 2.2 and 2.3 present more details about the TDEA and AES algorithms, simulations, and efﬁcient implementation techniques. Public Key Algorithms The example for public-key algorithm is RSA (Rivest, Shamir, and Adelman). Public-key algorithms are used for data authentication. For data authentication, we transmit digital signatures computed using a public key-based digital signature algorithm (DSA). In Section 2.5, the elliptic-curve digital signature algorithm (ECDSA)—that is, elliptic curve-based DSA—is discussed and simulated. Hash Functions Examples of hash-based algorithms are SHA functions. Popular and standardized SHA functions include SHA-1, SHA-256, SHA-384, and SHA-512. Hash functions are used in achieving data integrity by computing unique condensed message (or message digest) for data. Hash-based algorithms are also used to generate pseudorandom numbers. In public-key algorithms and in computing message authentication codes, we use SHA functions to condense the messages. In Section 2.4, the keyed hash message-authentication code (HMAC) algorithm is discussed in detail and simulated. In Section 2.5, we use the hash function to generate condensed messages for ECDSA. 2.1.6 Cryptography and Random Numbers We use random numbers in cryptography for many purposes. For example, all cryptographic keys are random numbers. We also use random numbers as default initial constants or as a seed for some cryptography algorithms. Cryptographic algorithms use random number as input (as a key or as its state) and output random data (as cipher 20 Chapter 2 text, as authentication code, or as condensed message). In other words, cryptographic algorithms can also be used for generating random numbers. Typically, we use symmetric key algorithms or hash functions to generate random numbers for public key algorithms. As discussed previously, given an encryption algorithm that is mathematically proven and has good properties (for randomizing data without having any weak instants for key patterns), the strength and security of the cryptographic system entirely depends on its key management process, as discussed in Section 2.1.3. In particular, the use of good (i.e., random or unpredictable) key patterns for a cryptographic algorithm is very important to improve the strength of the overall cryptosystem. Now, the question is how to generate random numbers? In practice, we have two kinds of random numbers. One kind is truly random; we cannot reproduce them with any deterministic method. Another kind is not truly random, but they look random and they can be reproduced with deterministic methods. We cannot generate true random numbers with software algorithms. Instead, we use a physical phenomenon (e.g., radioactive decay, electronic-parts generated noise, or instant temperature measures) along with hardware for producing true random numbers; the subject of true random number generation is beyond the scope of this book. Pseudorandom Numbers Generation A pseudorandom number generator (PRNG) uses a deterministic algorithm to produce the random numbers, and these numbers are not truly random, as we can reproduce them again and again. There is a vast amount of literature on the subject of pseudorandom number generation, and many algorithms have been developed for PRNG by the research community in the last few decades. There are many test procedures in the literature to verify randomness of numbers generated by PRNG. Once again, the subject of PRNG theory and test procedures is beyond the scope of this book. In this section, we discuss PRNGs based on the linear feedback shift register (LFSR) and the RC4 algorithm. We also discuss simulation and implementation techniques for these two algorithms in the next two subsections. Note that these two algorithms may not be practically useful in cryptography and may not pass all PRNG test procedures for reasons discussed later. Linear Feedback Shift Register The LFSR contains a small amount of memory to hold its state at any point of time. LFSR is basically used for scrambling (or randomizing) data bits to uniformly distribute energy in the whole bitstream. We can generate a pseudorandom binary sequence (PRBS) by using LFSR. The PRBS sequence is also used for bit interleaving with error correction algorithms such as RS codes and turbo codes. Figure 2.2 shows a signal ﬂow diagram of the LFSR for the following PRBS generator polynomial: p(x ) = x 15 + x 14 + 1 The randomizer is initialized at the very beginning with a seed value of 100101010000000. As this LFSR contains 15 bits of memory, its output bit pattern does not repeat in the cycle of 215 − 1 bits. In other words, the LFSR shown in Figure 2.2 generates a pseudorandom binary sequence (PRBS) of length less than 215. Then this PRBS sequence is used for randomizing the input data, interleaving the bit patterns, and generating random numbers. The straightforward simulation code for the LFSR shown in Figure 2.2 is given in Pcode 2.1 and a much more efﬁcient simulation code is given in Pcode 2.2. LFSR and Pseudorandom Number Generation for Cryptography Applications The LFSR system shown in Figure 2.2 generates random bits without repeating the bit pattern until the loop runs up to 215 − 1 times. In the interval [1, 215], the generated bits are random. Now, the question is whether the random numbers generated 100101010000000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure 2.2: Signal ﬂow diagram of data Data In randomizer. PRBS Data Out S[0] = 1; s[1] = 0; s[2] = 0; s[3] = 1; S[4] = 0; s[5] = 1; s[6] = 0; s[7] = 0; S[8] = 0; s[9] = 0; s[10] = 0; S[11] = 0; s[12] = 0; s[13] = 0; s[14] = 0; for(i = 0; ; i++){ tmp = s[14] ^ s[13]; s[14] = s[13];s[13] = s[12]; s[12] = s[11]; s[11] = s[10]; s[10] = s[9]; s[9] = s[8]; s[8] = s[7]; s[7] = s[6]; s[6] = s[5]; s[5] = s[4]; s[4] = s[3]; s[3] = s[2]; s[2] = s[1]; s[1] = s[0]; s[0] = tmp; data_out[i] = data_in[i] ^ tmp; } Pcode 2.1: Simulation code for LFSR shown in Figure 2.2. A = 0x95000000; // initial state (15 MSBs) B = 0xff8000000; // MASK for(i = 0; ; i++){ C = A >> 1; A = A ^ C; A = A & B; A = A << 14; if (A) data_out[i] = data_in[i] ^ 1; A = C | A; } Pcode 2.2: Efﬁcient simulation code for LFSR shown in Figure 2.2. Data Security 21 by this LSFR system satisfy the requirement of cryptography standards. The answer is no. In cryptographic practices, the algorithm will be in the public domain and for the cryptanalyst, attacking this type of system is very easy even if the cryptanalyst does not have knowledge of the initial seed as the length of seed is only 15 bits. The seed pattern of the LFSR shown in Figure 2.2 can easily be derived from its output sequence with the present day technology by using the brute force method. As per present cryptographic standards, we require a minimum of 160-bit-width polynomial seeds for LFSR. To avoid attacks based on analytical methods, the SHA function (discussed in Section 2.4) is applied on LFSR output and the pseudorandom numbers generated by the LFSR-SHA system may be acceptable for cryptographic applications. For example, the LFSR with output cycle period as 2160 − 1 using primitive polynomial of degree 160 follows: p(x ) = x 160 + x 159 + x 158 + x 157 + x 155 + x 153 + x 151 + x 150 + x 149 + x 148 + x 147 + x 146 + x 142 + x 141 + x 137 + x 134 + x 133 + x 132 + x 130 + x 128 + x 126 + x 125 + x 121 + x 120 + x 118 + x 117 + x 116 + x 114 + x 112 + x 111 + x 109 + x 108 + x 106 + x 104 + x 102 + x 95 + x 94 + x 90 + x 89 + x 88 + x 86 + x 85 + x 84 + x 83 + x 82 + x 81 + x 80 + x 78 + x 76 + x 68 + x 66 + x 64 + x 61 + x 60 + x 59 + x 57 + x 52 + x 50 + x 46 + x 45 + x 41 + x 40 + x 39 + x 38 + x 37 + x 36 + x 35 + x 31 + x 29 + x 27 + x 26 + x 25 + x 23 + x 20 + x 18 + x 16 + x 11 + x 10 + x 8 + x 7 + x 6 + x 5 + x 3 + x + 1 The binary coefﬁcients of the previous primitive polynomial p(x ) are also represented in hexadecimal vector form as P = [0xf57e313a, 0xb1badaa0, 0x63bfa80a, 0x9d0a31fc, 0x574a86f5, 0x80000000], where the coefﬁcient of highest degree corresponds to the non-zero MSB (most signiﬁcant bit) of the left-most word. Reproducing the PRBS of this LFSR system without knowing the 160-bit initial seed by using the brute force method is not an easy task. As mentioned earlier, to avoid attacks (or deriving the seeds) for LFSR based on analytical methods, SHA functions are applied on output of LFSR. In the next subsection, we discuss pseudorandom number generation based on the RC4 algorithm. 22 Chapter 2 RC4 Algorithm In this section, we discuss pseudorandom number generation using the RC4 stream cipher algorithm. The RC4 algorithm involves computation of S-Box (which consists of 256-byte elements, initially assigned with 0 to 255) values using the given key information. RC4 uses a variable length key from 1 to 256 bytes to initialize a 256-byte S-Box table. The S-Box computation is done by iteratively swapping the locations of S-Box elements as given in Pcode 2.3. The S-Box table is used for the subsequent generation of pseudorandom bytes which are then XORed with the input plain text to produce a cipher text. In other words, once we have a computed S-Box, then the input data is encrypted (or randomized) one byte at a time by XORing with an S-Box element, which is accessed through the offset obtained after the manipulation of indices in some particular way as given in Pcode 2.4. The S-Box elements are also continuously swapped in encryption of every input byte and each element in the S-Box table is swapped at least once in this process. Like this, the encryption (or randomization) process will be continued until the input data bytes get over. The RC4 algorithm is a nonstandardized and yet powerful stream cipher. One of the reasons for not standardizing this RC4 algorithm is because of its simple mathematical structure. However, the RC4 algorithm is used as a stream cipher in low-security-risk applications and used as a pseudorandom number generator in many standard ciphers applications. RC4 is used in many commercial software packages such as Lotus Notes and Oracle Secure SQL, and in network protocols such as SSL, IPsec, WEP, and WPA. RC4 and Pseudorandom Numbers Generation In this section, we discuss the pseudorandomness of data patterns generated by the RC4 algorithm. As the RC4 state (S-Box) consists of 256 bytes, it is computationally difﬁcult for adversaries to break the RC4-generated random pattern by using the brute force method. However, the RC4 algorithm is vulnerable to analytic attacks of the S-Box table; some weak keys exist for RC4 and some theoretical attacks have been performed on RC4 (Mister and Tavares, 1998). for(i = 0;i < 256;i++) S_Box[i] = i; j = 0; for(i = 0;i < 256;i++){ r0 = S_Box[i]; r1 = r0 + r1; r1 = r1 + j; j = r1 & 0xff; r1 = S_Box[j]; S_Box[j] = r0; S_Box[i] = r1; } // initialize S_Box // update S_Box using 256 bytes key r1 = key[i]; // look-up table access with arbitrary offset Pcode 2.3: Simulation code for RC4 S-Box computation. j = 0; for (i = 0;i < N;i++){ k = i & 0xff; r0 = S_Box[k]; r1 = j + r0; j = r1 & 0xff; r1 = S_Box[j]; S_Box[j] = r0; S_Box[k] = r1; r1 = r1 + r0; r1 = r1 & 0xff; r1 = S_Box[r1]; in[i] = in[i] ^ r1; } // N: data length in bytes // i mod 255 // can be loaded with circular buffer addressing // mod 255 // look-up table access with arbitrary offset // mod 255 // look-up table access with arbitrary offset Pcode 2.4: Simulation code for RC4 Cipher. Data Security 23 We can strengthen RC4 security by following a few rules: 1. Drop the ﬁrst few hundred bytes of output of RC4 to avoid weak key attacks and other key schedule–related attacks. 2. Do not repeat the secret key when generating the S-Box of RC4. 3. Do not use RC4 for generating (or encrypting) lengthy data patterns. For more information on RC4 weaknesses, see Mister and Tavares (1998); Mantin and Shamir (2001); and Fluhrer, et al. (2001). The other block ciphers such as DES and AES discussed in Sections 2.2 and 2.3, and hash functions discussed in Section 2.4 are also used for generating pseudorandom numbers. In the following, we discuss the complexity and simulation of RC4 as well as an efﬁcient software implementation method for the RC4 encryption process. RC4 Simulation and Complexity In the iterative procedure of computing RC4 S-Box or encryption (or randomization) processes given in Pcodes 2.3 and 2.4, the computation of new j value requires updated (swapped) S-Box values. So, computing many j values and swapping them all at one time is not allowed due to dependency of j on updated S-Box values. Every time we access the S-Box element from memory on the reference embedded processor using an arbitrary offset to the S-Box table, we consume extra clock cycles (due to pipeline stalls). This implementation is very inefﬁcient as we cannot interleave the program to avoid the pipeline stalls. We have one such look-up table access in Pcode 2.3 and two in Pcode 2.4. Next we estimate the complexity of the RC4 algorithm given in Pcodes 2.3 and 2.4 in terms of processor cycles. See Appendix A, Section A.4, on this book’s companion website for more information on cycles estimation on the reference embedded processor. With the present program ﬂow, we consume 11 cycles per iteration (assuming three pipeline stalls) in S-Box computation using Pcode 2.3, and we consume 17 cycles per iteration in data byte encryption using Pcode 2.4. With this, we consume 2816 (= 11 ∗ 256) cycles for S-Box computation and 17 ∗ N cycles for encryption of N data bytes. For N = 128, we consume 2176 (= 17 ∗ 128) cycles for encryption process. RC4 Implementation and Optimization The memory access stalls in RC4 can be avoided if we can compute a minimum of two j values (if not more) at a time and interleave the program code. After careful observation, the computation of two j values at a time is possible except for one case, when j = i + 1. By conditionally computing the new index value j , we can have two j values and can do two swaps at a time and thereby avoid extra stalls. Computing two random bytes and encrypting two data bytes at a time also achieve similar elimination of the stalls in the data encryption algorithm. This efﬁcient implementation code is given in Pcode 2.5. Here, we have a scope to interleave the program code and to eliminate the memory access stalls. With this approach we j = k = 0; m = 1; for(i = 0;i < 256;i += 2){ r0 = S_Box[i]; r1 = key[i]; r1 = r1 + r0; r1 = r1 + k; r5 = key[m]; j = r1 & 0xff; r2 = S_Box[j]; if (j == m) r4 = r0; else r4 = r3; r1 = r4 + r5; r1 = r1 + j; k = r1 & 0xff; S_Box[i] = r2; S_Box[j] = r0; r4 = S_Box[k]; r3 = S_Box[m]; S_Box[k] = r3; S_Box[m] = r4; m = m + 2; } // mod 255 // memory access with arbitrary offset // mod 255 // memory access with arbitrary offset Pcode 2.5: Efﬁcient implementation of RC4 S-Box computation. 24 Chapter 2 can reduce on average two clock cycles per iteration from the simulation code of Pcode 2.3. Now, we consume about 2304 (= 18 ∗ 128) cycles (instead of 2816) in the S-Box computation. The similar approach can be used to eliminate the pipeline stalls in the encryption process due to look-up table accesses with arbitrary offsets. 2.2 Triple Data Encryption Algorithm The triple data encryption algorithm is based on the data encryption standard, adopted worldwide by most public and private organizations for data communications and data storage. The TDEA algorithm can process data blocks of 64 bits using three different keys, each of 56-bit length. In this section, we discuss the ﬂow description of TDEA-algorithm modules—namely, DES key expansion, DES cipher, and DES inverse cipher. We simulate the TDEA algorithm modules and get the simulation results for given input data and key. Also we discuss the computational complexity of the DES algorithm and efﬁcient techniques to implement the DES cipher and DES inverse cipher. 2.2.1 Introduction to TDEA As shown in Figure 2.3, the TDEA algorithm consists of three cascaded DES units. Each DES unit uses a separate key to process the data. In the case of the TDEA cipher, we cascade the DES cipher followed by the DES inverse cipher followed by another DES cipher. The TDEA inverse cipher consists of inverse TDEA DES units. In other words, in the case of a TDEA inverse cipher, we cascade the DES inverse cipher followed by the DES cipher followed by another DES inverse cipher. The same set of three keys shall be used for the TDEA cipher and TDEA inverse cipher. Hence we call the TDEA a symmetric cipher. A few applications of the TDEA include data communications, data storage, Internet, military applications, classiﬁed data management, online banking, and memory protection. Similar to TDEA, recently developed AES (advanced encryption standard) is used in all the previous applications. We discuss the AES algorithm in Section 2.3. The strength of an encryption algorithm depends on its mathematical properties and supported key lengths. The DES is a very old standard with less key space, and analysts have thoroughly understood and attacked the DES cipher text. The T-DES is based on DES with a large key space. AES is the latest standard with very large key space, no known attacks, and no known weak key patterns existed as of this writing. 2.2.2 TDEA Algorithm The TDEA algorithm uses the DES algorithm as a basic unit as shown in Figure 2.3. TDEA uses a total of three DES units in cascade fashion with a different 56-bit keyword for each DES unit. Effectively, the TDEA algorithm key space is 168 (= 56 ∗ 3) bits. If we know how DES works, then TDEA is performed by simply cascading three such DES units. From here on, we concentrate on the DES algorithm. The ﬂow diagram of the DES algorithm is shown in Figure 2.4. Input to the DES algorithm is a plain text of 64 bits and a key of 56 bits. (The key starts as a 64-bit encoded key. It is 56 bits after removing the check bits from the 64-bit encoded key.) The input 56-bits of key are then expanded using the DES key scheduler. DES Key Scheduler The DES key scheduler consists of three steps as shown in Figure 2.4. In the ﬁrst step, we obtain the permuted 56-bit key data by applying the permutation choice-1 (PC-1). The second step is basically a loop run 16 times that produces that many 56-bit data words. Before starting the loop, we treat the 56-bit key as two independent Figure 2.3: Block diagram of the TDEA algorithm. Plain Text E-DES-K1 Cipher Text D-DES-K1 D-DES-K2 E-DES-K3 TDEA Cipher Cipher Text E-DES-K2 D-DES-K3 TDEA Inverse Cipher Plain Text Key Scheduler Butterfly Input Key PC-1 Left Shifts PC-2 Plain Text Data Security 25 IP Initial Permutation L0 R0 n51 f(.) Ln 5 Rn21 Rn 5 Ln21 1 f (Rn21, Kn) n5n11 n ,5 16 Figure 2.4: Flow diagram of DES algorithm. Final L16 R16 Permutation FP Cipher Text 28-bit words. In the loop, we rotate the two 28-bit words left by 1 or 2 bits in each iteration. The input to the next iteration of the loop is its previous iteration output. In the third step of key scheduler, we take a 56-bit word (i.e., the result after combing the left shifted two 28-bit words) output from each iteration of the loop and generate a 48-bit word (or eight 6-bit words) by using permutation choice-2 (PC-2). In this way, the key scheduler expands the input 56-bit key to total 128 (= 16 ∗ 8) 6-bit keywords for performing the DES algorithm (the same key scheduler is used for both cipher/inverse cipher). DES Cipher As shown in Figure 2.4, the DES algorithm also consists of three steps, initial permutation, butterﬂy loop, and ﬁnal permutation. In the ﬁrst step, we apply initial permutation on input plain text before entering the butterﬂy loop. In the second step, the permuted plain text (split into two 32-bit words) passes through a 16-iteration butterﬂy loop to output the pre-encrypted data using expanded key data. We use eight 6-bit keywords in each iteration of the butterﬂy loop. As a third step, we apply the inverse of the initial permutation on the butterﬂy-loop output (i.e., on pre-encrypted data) to get the cipher text. The main module in a DES-algorithm butterﬂy loop is a nonlinear function f (.). The ﬂow diagram of function f (.) is shown in Figure 2.5. The nonlinear function f (.) in the DES butterﬂy loop again consists of three steps. In the ﬁrst step, we expand (E) the 32-bit data to 48-bit data and then we XOR the expanded 48-bit data with 48 bits of key data (we use eight 6-bit words or 48 bits of key from the key scheduler output in a single iteration of the butterﬂy loop). Then in the second step, the XORed 48-bit data is split into eight 6-bit words and passed through 4 × 16 dimension S-Boxes (6-bit words are used as addresses to the S-Box tables with the ﬁrst and last bit to specify the row of a table and middle 4 bits to specify the column number, see Section 2.2.3, DES Function Simulation) to get eight 4-bit words (S-Box consists of 4-bit words). Next we merge the eight 4-bit words to a single 32-bit 26 Chapter 2 Rn21 Kn 32 E 48 48 48 6 6 6 6 6 6 6 6 S1 S2 S3 S4 S5 S6 S7 S8 4 4 4 4 4 4 4 4 32 Figure 2.5: Flow diagram of nonlinear function f (.) in DES algorithm. P 32 f (Rn21, Kn) word and then as a third step we apply permutation (P) on merged 32-bit data to get the nonlinear function f (.) output. DES Inverse Cipher The ﬂow of the DES inverse cipher is the same as that of the DES cipher. The only difference between the DES cipher and DES inverse cipher is that the former accesses the keywords from the start of the keyword buffer to~the end of the buffer with its loop iterations (i.e., the ﬁrst eight 6-bit keywords from 0 to 7 used for ﬁrst iteration, the next eight 6-bit keywords from 8 to 15 used for second iteration, etc.), whereas the inverse cipher accesses the keywords from the end of buffer (i.e., the last eight 6-bit keywords from 120 to 127 used for ﬁrst iteration, the next eight 6-bit keywords from 112 to 119 used for second iteration, etc.). 2.2.3 Simulation of DES Algorithm In the DES algorithm, the permutation or expansion operations are carried out using the mapping tables (speciﬁed in the DES standard, Federal Information Processing Standard [FIPS], 1999). For example, the permutation operation in the butterﬂy function is carried out by using the mapping table given in Table 2.1. By using this bit position mapping table, we get the 1st bit in the permuted word from 16th bit of input word, the 2nd bit in the permuted word from 7th bit of input word and so on. Finally, the last bit of permuted word is coming from 25th bit of input word. In the simulation of all permutation operations, we use the equivalent shift values (precomputed and stored in a memory) instead of standard table values to reduce the cycle cost. For example, we use the derived shift values in Table 2.2 instead of actual bit numbers in Table 2.1 for simulating the DES butterﬂy Table 2.1: Bit numbers for permutation 16 7 20 21 29 12 28 17 1 15 23 26 5 18 31 10 2 8 24 14 32 27 3 9 19 13 30 6 22 11 4 25 Table 2.2: Shift values for permutation 16 25 12 11 3 20 4 15 31 17 9 6 27 14 1 22 30 24 8 18 0 5 29 23 13 19 2 26 10 21 28 7 Data Security 27 permutation operation. In Table 2.2, the shift values are obtained by subtracting bit position numbers from 32. If we perform the permutation of bits with logical AND, SHIFT, and OR operations as given in Pcode 2.16, then the use of the derived shift value will consume less cycles with C code when compared to the use of bit numbers and bits extract. DES Key Scheduler Simulation For simulation purpose, we split the DES key scheduler into four parts: (1) permutation choice-1, (2) permutation choice-2, (3) left shifts, and (4) main key scheduler function. In the left shifts operation, we rotate independently two 28-bit words to the left by 1 bit or 2 bits. As we repeat this left shifts operation many times, we deﬁne two macros: DES_KEY_SCH_MACRO_ONE( ) for 1-bit left shift, and DES_KEY_SCH_MACRO_TWO( ) for 2-bit left shift to simplify the code. The simulation code for these two macros is given in Pcode 2.6. We call the functions permutation choice-1, permutation choice-2, and the two left-shift macros from the main key scheduler function. The simulation code for the key scheduler function is given in Pcode 2.7. Permutation Choice-1 FIPS PUB 46-3 standard speciﬁes a look-up table to perform permutation choice-1 (PC-1) operation. According to PC-1 table (shown in Table 2.3), we map 64-bit encoded key bits data to 56-bit permuted bits data as follows. The 1st bit of permuted key is obtained from the 57th bit in the input key, the 2nd bit of the permuted key is obtained from the 49th bit in the input key, and so on, until the 56th bit of the permuted key is obtained from the 4th bit of the input key. DES_KEY_SCH_MACRO_ONE( ) \ r3 = r1 >> 27; r1 = r1 | (r3 & 0x10); r2 = r2 << 1; \ r2 = r2 | (r3 & 0x10); r1 = r1 << 1; \ r3 = r2 >> 27; \ // rotate left by one bit DES_KEY_SCH_MACRO_TWO( ) \ r3 = r1 >> 26; r1 = r1 | (r3 & 0x30); r2 = r2 << 2; \ r2 = r2 | (r3 & 0x30); r1 = r1 << 2; \ r3 = r2 >> 26; \ // rotate left by two bits Pcode 2.6: Simulation code for DES key scheduler macros. // void DESKeySch( ) PermCh1(pc1); r1 = pc1[0]; r2 = pc1[1]; DES_KEY_SCH_MACRO_ONE( ) PermCh2(r1,r2); DES_KEY_SCH_MACRO_ONE( ) PermCh2(r1,r2); for(i = 0;i < 6;i++){ DES_KEY_SCH_MACRO_TWO( ) PermCh2(r1,r2); } DES_KEY_SCH_MACRO_ONE( ) PermCh2(r1,r2); for(i = 0;i < 6;i++){ DES_KEY_SCH_MACRO_TWO( ) PermCh2(r1,r2); } DES_KEY_SCH_MACRO_ONE( ) PermCh2(r1,r2); // call permutation choice 1 // call permute choice 2 (--> K1) // --> K2 // --> K3 to K8 // --> K9 // --> K10 to K15 // --> K16 Pcode 2.7: Simulation code for DES key scheduler function. 28 Chapter 2 Table 2.3: DES key scheduler permutation choice-1 table values 57 49 41 33 25 17 9 1 58 50 42 34 26 18 10 2 59 51 43 35 27 19 11 3 60 52 44 36 63 55 47 39 31 23 15 7 62 54 46 38 30 22 14 6 61 53 45 37 29 21 13 5 28 20 12 4 We do not use Table 2.3 directly in the simulation of PC-1 function; however, we generate the same outputs as what table values say. We simulate PC-1 function using logical AND, SHIFT, and OR operations instead of using a look-up table (since we consume fewer cycles on the reference embedded processor per bit with the analytic method given in Pcode 2.8 when compared to bit-mapping using look-up values). We demultiplex 64-bit key data into seven 8-bit words in a nested loop. In this process, the check bits 8, 16, 24, . . ., 64 present in the 64-bit key are removed by left shifting 2 bits (instead of 1 bit) at the end of each iteration of the inner loop. After the loop, we obtain two 28-bit permuted words from seven 8-bit words by rearranging the demultiplexed bits. See Section 2.2.3, DES Simulation Results, for PC-1 simulation output results. // void PermCh1(unsigned long *x3) r1 = r2 = r3 = r4 = r5 = r6 = r7 = 0; for(j = 0;j < 2;j++){ tmp1 = des_key[j]; for(i = 0;i < 4;i++){ tmp2 = tmp1 & 0x80000000; r1 = r1 | tmp2; tmp2 = tmp1 & 0x80000000; r2 = r2 | tmp2; tmp2 = tmp1 & 0x80000000; r3 = r3 | tmp2; tmp2 = tmp1 & 0x80000000; r4 = r4 | tmp2; tmp2 = tmp1 & 0x80000000; r5 = r5 | tmp2; tmp2 = tmp1 & 0x80000000; r6 = r6 | tmp2; tmp2 = tmp1 & 0x80000000; r7 = r7 | tmp2; } } tmp1 = r1; r2 = r2 >> 8; tmp1 = tmp1 | r2; r3 = r3 >> 16; tmp1 = tmp1 | r3; r1 = r4 >> 28; r1 = r1 << 4; pc1[0] = tmp1 | r1; tmp2 = r7; r6 = r6 >> 8; tmp2 = tmp2 | r6; r5 = r5 >> 16; tmp2 = tmp2 | r5; r1 = r4 << 4; r1 = r1 >> 28; r1 = r1 << 4; pc1[1] = tmp2 | r1; r1 = r1 >> 1; tmp1 = tmp1 << 1; r2 = r2 >> 1; tmp1 = tmp1 << 1; r3 = r3 >> 1; tmp1 = tmp1 << 1; r4 = r4 >> 1; tmp1 = tmp1 << 1; r5 = r5 >> 1; tmp1 = tmp1 << 1; r6 = r6 >> 1; tmp1 = tmp1 << 1; r7 = r7 >> 1; tmp1 = tmp1 << 2; // remove check bit // store permuted first 28-bits // store permuted second 28-bits Pcode 2.8: Simulation code for DES key scheduler PC-1. Data Security 29 Permutation Choice-2 In permutation choice-2 (PC-2), we use the following look-up values (which are derived from the FIPS PUB 46-3 standard PC-2 table) to perform shift operations: pc2[48] = { 18,15,21, 8,31,27,29, 4,17,26,11,22, 9,13,20,28, 6,24,16,25, 5,12,19,30, 19, 8,29,23,13, 5,30,20, 9,15,27,12,16,11,21, 4,26, 7,14,18,10,24,31,28}; PC-2 function takes two 28-bit left-shifted inputs and outputs two 24-bit permuted outputs. To perform this process, we get a shift value from the pc2[ ] look-up table, and obtain the permuted bit by shifting right a 28bit input word with that shift value and extracting the ﬁrst bit by ANDing with 0x01. The output of the PC-2 function (24-bit permuted word) is stored to ks_key[ ] buffer. In Pcode 2.7, the PC-2 function is called a total of 16 times, and in each call it produces two 24-bit keywords. We use these expanded keys in both DES cipher and inverse cipher functions. The simulation code for the PC-2 function is given in Pcode 2.9. See Section 2.2.3, DES Simulation Results, for PC-2 simulation output results. DES Function Simulation In the DES function, we form a DES state using the given 64-bit input data and we update DES state with DESInitP( ) followed by a 16-iteration butterﬂy function and then followed by the DESFinalP( ) function. The butterﬂy loop itself consists of functions ExpandF( ), S-Box( ), and PermL( ). Figures 2.4 and 2.5 show the ﬂow of the DES function. Both the cipher and inverse cipher use the same DES function except that the sequence in which the expanded keys are accessed differs. The simulation code for the DES cipher and DES inverse ciphers are given in Pcodes 2.10 and 2.11, respectively. DES Initial Permutation The simulation techniques used for DES initial permutation (IP) is the same as the techniques used for simulating PC-1 function. In IP we permute all 64 input bits and output as 64 permuted bits (unlike in PC-1, where we eliminate the redundant bits from input). The simulation code for IP is given in Pcode 2.12. See Section 2.2.3, DES Simulation Results, for IP simulation output results. DES Final Permutation We perform the DES ﬁnal permutation (FP) as per the look-up table values of IP−1 given in the FIPS PUB 46-3. The function FP takes 64 bits as the input and outputs 64 permuted bits. We can also compute FP using the analytic method. Although we used the analytic method in the simulation code to perform FP, we get the same permuted bits as in the look-up table method. We use logical AND, SHIFT, and OR operations to perform FP with the analytic method. The simulation code for DES FP is given in Pcode 2.13. // void PermCh2(unsigned long x1, unsigned long x2) k = 0; for(j = 0;j < 4;j++){ tmp3 = 0; for(i = 0;i < 6;i++){ tmp1 = pc2[k++]; tmp2 = x1 >> tmp1; tmp2 = tmp2 & 0x01; tmp3 = tmp3 | tmp2; } ks_key[n++] = tmp3; } for(j = 0;j < 4;j++){ tmp3 = 0; for(i = 0;i < 6;i++){ tmp1 = pc2[k++]; tmp2 = x2 >> tmp1; tmp2 = tmp2 & 0x01; tmp3 = tmp3 | tmp2; } ks_key[n++] = tmp3; } tmp3 = tmp3 << 1; // store permuted first 24-bit word tmp3 = tmp3 << 1; // store permuted second 24-bit word Pcode 2.9: Simulation code for DES key scheduler PC-2. 30 Chapter 2 // void DESCipher( ) des_state[0] = p_data[0]; des_state[1] = p_data[1]; // DES state DESInitP( ); // initial permutation j = 0; for(i = 0;i < 16;i++){ // butterﬂy loop Ln = des_state[0]; Rn = des_state[1]; des_state[0] = Rn; // L[n] = R[n-1] // R[n] = L[n-1] XOR f(R[n-1],K[n]),f(R[n-1),K[n])-> P(S(E(R[n-1]) XOR K[n]))) ExpandF(Rn,t); // E(R[n-1]) t[0] = ks_key[j++]^t[0]; t[1] = ks_key[j++]^t[1]; // E(R[n-1]) XOR K[n] t[2] = ks_key[j++]^t[2]; t[3] = ks_key[j++]^t[3]; t[4] = ks_key[j++]^t[4]; t[5] = ks_key[j++]^t[5]; t[6] = ks_key[j++]^t[6]; t[7] = ks_key[j++]^t[7]; S_Box(t); // S(E(R[n-1]) XOR K[n]) tmp = PermL(t); // P(S(E([n-1]) XOR K[n])) des_state[1] = Ln^tmp; // L[n-1] XOR f(R[n-1],K[n]) } Ln = des_state[0]; Rn = des_state[1]; des_state[0] = Rn; des_state[1] = Ln; DESFinalP( ); // final permutation Pcode 2.10: Simulation code for DES cipher. // void DESInvCipher( ) des_state[0] = c_data[0]; des_state[1] = c_data[1]; DESInitP( ); // initial permutation j = 120; // key words accessing index initialization for(i = 0;i < 16;i++){ Ln = des_state[0]; Rn = des_state[1]; des_state[0] = Rn; // L[n] = R[n-1], // R[n] = L[n-1] (+) f(R[n-1], K[n]),f(R[n-1),K[n])-> P(S(E(R[n-1]) (+) K[n]))) ExpandF(Rn,t); // E(R[n-1]) t[0] = ks_key[j++]^t[0]; t[1] = ks_key[j++]^t[1]; // E(R[n-1]) (+) K[n] t[2] = ks_key[j++]^t[2]; t[3] = ks_key[j++]^t[3]; t[4] = ks_key[j++]^t[4]; t[5] = ks_key[j++]^t[5]; t[6] = ks_key[j++]^t[6]; t[7] = ks_key[j++]^t[7]; S_Box(t); // S(E(R[n-1]) (+) K[n]) tmp = PermL(t); // P(S(E([n-1]) (+) K[n])) des_state[1] = Ln^tmp; // L[n-1] (+) f(R[n-1],K[n]) j-= 16; } Ln = des_state[0]; Rn = des_state[1]; des_state[0] = Rn; des_state[1] = Ln; DESFinalP( ); // final permutation Pcode 2.11: Simulation code for DES inverse cipher. The output of FP gives the cipher text in the case of the DES cipher and gives plain text in the case of the DES inverse cipher. Expand Function The Expand function (E-function) is part of the butterﬂy function f (.), which is iterated 16 times in the main DES function. The E-function expands the 32 bit input data to 48 bits by repeating few bits two times. We perform E-function as per the E-BIT SELECTION TABLE given in the FIPS PUB 46-3 standard. In the simulation code given in Pcode 2.14, we used an analytic method to simulate the E-function. S-Box Mixing In S-Box mixing, we output a 4-bit word from 6-bit input data by using a 2-dimensional S-Box mixing look-up table. As shown in Figure 2.5, we obtain a total of eight 4-bit words (32 bits) from eight 6-bit words (48 bits), by using eight S-Box mixing look-up tables. In the simulation code given in Pcode 2.15, we Data Security 31 // void DESInitP( ) r1 = r2 = r3 = r4 = r5 = r6 = r7 = r8 = 0; for(j = 0;j < 2;j++){ tmp1 = des_state[j]; for(i = 0;i < 4;i++){ tmp2 = tmp1 & 0x80000000; r1 = r1 >> 1; tmp1 = tmp1 << 1; r1 = r1 | tmp2; tmp2 = tmp1 & 0x80000000; r2 = r2 >> 1; tmp1 = tmp1 << 1; r2 = r2 | tmp2; tmp2 = tmp1 & 0x80000000; r3 = r3 >> 1; tmp1 = tmp1 << 1; r3 = r3 | tmp2; tmp2 = tmp1 & 0x80000000; r4 = r4 >> 1; tmp1 = tmp1 << 1; r4 = r4 | tmp2; tmp2 = tmp1 & 0x80000000; r5 = r5 >> 1; tmp1 = tmp1 << 1; r5 = r5 | tmp2; tmp2 = tmp1 & 0x80000000; r6 = r6 >> 1; tmp1 = tmp1 << 1; r6 = r6 | tmp2; tmp2 = tmp1 & 0x80000000; r7 = r7 >> 1; tmp1 = tmp1 << 1; r7 = r7 | tmp2; tmp2 = tmp1 & 0x80000000; r8 = r8 >> 1; tmp1 = tmp1 << 1; r8 = r8 | tmp2; } } tmp1 = r2; r4 = r4 >> 8; tmp1 = tmp1 | r4; r6 = r6 >> 16; tmp1 = tmp1 | r6; r8 = r8 >> 24; des_state[0] = tmp1 | r8; tmp2 = r1; r3 = r3 >> 8; tmp2 = tmp2 | r3; r5 = r5 >> 16; tmp2 = tmp2 | r5; r7 = r7 >> 24; des_state[1] = tmp2 | r7; // store permuted first 32-bits // store permuted second 32-bits Pcode 2.12: Simulation code for initial permutation of DES function. // void DESFinalP( ) r1 = 25; r2 = 24; r3 = 25; r4 = 24; tmp1 = des_state[0]; tmp2 = des_state[1]; tmp3 = 0; tmp4 = 0; for(i = 0;i < 4;i++){ for(j = 0;j < 4;j++){ r5 = tmp1 & 0x80000000; r6 = tmp2 & 0x80000000; r5 = r5 >> r1; r6 = r6 >> r2; tmp4 = tmp4 | r5; tmp1 = tmp1 << 1; tmp4 = tmp4 | r6; tmp2 = tmp2 << 1; r1-= 8; r2-= 8; } for(j = 0;j < 4;j++){ r5 = tmp1 & 0x80000000; r6 = tmp2 & 0x80000000; r5 = r5 >> r3; r6 = r6 >> r4; tmp3 = tmp3 | r5; tmp1 = tmp1 << 1; tmp3 = tmp3 | r6; tmp2 = tmp2 << 1; r3-= 8; r4-= 8; } r1+= 34; r2+= 34; r3+= 34; r4+= 34; } des_state[0] = tmp3; des_state[1] = tmp4; Pcode 2.13: Simulation code for ﬁnal permutation of DES function. perform S-Box mixing by combining eight look-up tables into a single big look-up table sb[ ] and accessing the corresponding 4-bit words with appropriate offsets. The look-up table sb[ ] values can be found on this book’s companion website. 32 Chapter 2 // void ExpandF(unsigned long x, unsigned char *y) r1 = x << 3; r2 = r1 >> 26; y[1] = r2; r1 = x << 11; r2 = r1 >> 26; y[3] = r2; r1 = x << 19; r2 = r1 >> 26; y[5] = r2; r1 = x << 27; r2 = r1 >> 26; r1 = x >> 31; r2 = r2 | r1; y[7] = r2; r3 = x << 7; r4 = r3 >> 26; y[2] = r4; r3 = x << 15; r4 = r3 >> 26; y[4] = r4; r3 = x << 23; r4 = r3 >> 26; y[6] = r4; r3 = x << 31; r4 = r3 >> 26; r3 = x >> 27; r4 = r4 | r3; y[0] = r4; Pcode 2.14: Simulation code for Expand function of DES butterﬂy function f (.). // void S_Box(unsigned char *y) for(i = 0;i < 8;i++){ r1 = y[i]; r2 = r1 & 1; r3 = r1 >> 5; r1 = r1 >> 1; r1 = r1 & 0x0f; r3 = r3 << 5; r2 = r2 << 4; r3 = r3 | r1; r3 = r3 | r2; r3 = i*64+r3; r2 = sb[r3]; y[i] = r2; } Pcode 2.15: Simulation code for S-Box mixing in DES butterﬂy function f (.). // unsigned long PermL(unsigned char *y) tmp = 0; r2 = 0; for(i = 0;i < 8;i++){ tmp = tmp << 4; tmp = tmp | y[i]; // pack 4-bit words to 32-bit word } for(i = 0;i < 32;i++){ r2 = r2 << 1; r1 = tmp >> PermtL[i]; r1 = r1 & 1; r2 = r2 | r1; } return r2; Pcode 2.16: Simulation code for permutation in DES butterﬂy function f (.). Permutation The permutation function of the DES butterﬂy function f (.) takes 32 bits of data as input and outputs 32 bits as permuted data. The simulation code for the permutation operation of the butterﬂy function is given in Pcode 2.16. We use the following shift values look-up table PermtL[ ] (the same as Table 2.2, which is derived from Table 2.1) to perform the permutation operation. PermtL[32] = { 16,25,12,11,3,20, 4,15,31,17,9, 6,27,14, 1,22, 30,24, 8,18,0, 5,29,23,13,19,2,26,10,21,28, 7}; DES Simulation Results Input: p_data[ ], 64-bit plain text and des_key[ ], 64-bit encoded key p_data[2] = {0x01122334, 0x45566778}; des_key[2] = {0x0f1e2d3c, 0x4b5a6978}; Data Security 33 Key Scheduler PC-1 output: xx[ ], 56-bit or two 28-bit words (after removing check bits) xx[2] = {0x00f0cca0, 0x330fffa0}; // left aligned Left shifts output: yy[ ], two 28-bit words (after rotating 1 bit left) yy[2] = {0x01e19940, 0x661fff40}; // left aligned PC-2 output: zz[ ], eight 6-bit words zz[8] = {0x1c, 0x03, 0x03, 0x24, 0x3a, 0x3d, 0x32, 0x38}; Key scheduler output: ks_key[ ], 128 6-bit words ks_key[128] = { 0x1C,0x03,0x03,0x24,0x3A,0x3D,0x32,0x38,0x00,0x09,0x31,0x34,0x22,0x3F,0x3B,0x3A, 0x31,0x06,0x21,0x12,0x2F,0x1D,0x3C,0x31,0x09,0x2E,0x1C,0x20,0x26,0x34,0x39,0x36, 0x32,0x21,0x14,0x03,0x37,0x1E,0x2E,0x14,0x1A,0x18,0x09,0x19,0x2C,0x16,0x1B,0x1D, 0x01,0x1D,0x02,0x0A,0x3E,0x3B,0x0A,0x07,0x0C,0x20,0x27,0x12,0x2D,0x26,0x1E,0x2F, 0x04,0x25,0x28,0x11,0x0D,0x37,0x36,0x07,0x03,0x13,0x25,0x04,0x1B,0x22,0x07,0x37, 0x00,0x26,0x13,0x0D,0x39,0x3E,0x27,0x0F,0x16,0x14,0x14,0x20,0x19,0x29,0x1F,0x1B, 0x30,0x08,0x26,0x29,0x37,0x39,0x15,0x2F,0x24,0x1A,0x08,0x07,0x13,0x2D,0x3F,0x28, 0x08,0x11,0x3A,0x02,0x16,0x0F,0x35,0x3D,0x18,0x03,0x08,0x08,0x3B,0x0D,0x31,0x3D}; DES Cipher DES state: des_state[ ], 64-bit data copied from p_data[ ] des_state[2] = {0x01122334, 0x45566778}; Initial permutation output: des_state[ ], 64-bit permuted data des_state[2] = {0xf0aa7855, 0x00cc8066}; DES Butterﬂy output: des_state[ ], 64-bit intermediate data after each iteration des_state[2] = {0x00cc8066, 0xc9ed3c55}; des_state[2] = {0xc9ed3c55, 0x8e6d9383}; des_state[2] = {0x8e6d9383, 0xd42d8678}; des_state[2] = {0xd42d8678, 0x67202012}; des_state[2] = {0x67202012, 0xa319a3bc}; des_state[2] = {0xa319a3bc, 0x80dd257e}; des_state[2] = {0x80dd257e, 0x31ead8ed}; des_state[2] = {0x31ead8ed, 0x38f0ff66}; des_state[2] = {0x38f0ff66, 0xd10d67a6}; des_state[2] = {0xd10d67a6, 0xcf0a862c}; des_state[2] = {0xcf0a862c, 0x7dd727c4}; des_state[2] = {0x7dd727c4, 0xafceae47}; des_state[2] = {0xafceae47, 0xb9bdad67}; des_state[2] = {0xb9bdad67, 0xcced41af}; des_state[2] = {0xcced41af, 0x70cb25bd}; des_state[2] = {0x70cb25bd, 0x1491f770}; // after first iteration // after second iteration // after third iteration // after fourth iteration // after fifth iteration // after sixth iteration // after seventh iteration // after eighth iteration // after ninth iteration // after tenth iteration // after eleventh iteration // after twelfth iteration // after thirteenth iteration // after fourteenth iteration // after fifteenth iteration // after sixteenth iteration Pre-encrypted DES output: des_state[ ], 64-bit intermediate data des_state[2] = {0x1491f770, 0x70cb25bd}; Final permutation output: des_state[ ], 64-bit output data des_state[2] = {0x3e244e22, 0xd78fa536}; Output: c_data[ ], 64-bit cipher text c_data[2] = {0x3e244e22, 0xd78fa536}; 2.2.4 Computational Complexity of DES Algorithm Most of the operations involved in the DES key scheduler, DES cipher and DES inverse cipher are bit operations rather than byte or word operations and consume more cycles to run DES on the reference embedded processor as we process all the data in terms of bits. For more details on clock cycle requirements for particular operations, see Appendix A, Section A.4, on this book’s companion website. 34 Chapter 2 Complexity of DES Key Scheduler Permutation Choice-1 In PC-1, we use a permutation table as shown in Table 2.3 to get a permuted key data from input key data. We do not use Table 2.3 directly in the simulation of PC-1; however, using Pcode 2.8, we generate the same outputs as the table values. We estimate the clock cycles requirement for PC-1 operation. With the approach used to simulate PC-1, we have a nested loop in the program. The inner loop runs four times and the outer loop runs two times. In these loops, we basically demultiplex the 64-bit input key into seven 8-bit words. From Pcode 2.8, to demultiplex 1 bit we perform four operations and that takes four cycles. We consume 224(= 56 ∗ 4) cycles for 56 input key bits. For rearranging the demultiplexed bits to get the ﬁnal two 28-bit words, we consume 18 cycles. We consume another 18 cycles in initialization, loading input key and for removing check bits. With this, we consume about 260 cycles to perform the PC-1 operation. Permutation Choice-2 The next big module in the DES key scheduler is permutation choice (PC) 2. The DES standard, FIPS PUB 46-3, speciﬁes another table for PC-2 functionality. In the simulation of PC-2 operation, we use derived values from a standard table for extracting the permuted bit with a reference embedded processor. The look-up values for PC-2 are generated by subtracting the standard table values from 32. In PC-2 simulation as given in Pcode 2.9, we have two nested loops. For the inner loop, we consume ﬁve cycles. The inner loop runs six times, and the outer loop, four times. Therefore, we consume a total of 128(= (6 ∗ 5 + 2) ∗ 4) cycles in a single nested loop. A total 256(= 2 ∗ 128) cycles for two nested loops are consumed in performing the PC-2 operation. We perform the PC-2 operation 16 times in the DES key scheduler and we consume a total 4096(= 16 ∗ 256) cycles. Apart from permutations, the DES key scheduler performs left shifts of two 28-bit words and these operations consume about 128(= 16 ∗ 8) cycles. With this, to run the DES key scheduler on the reference embedded processor, we spend about 4484(= 260 + 128 + 4096) cycles. Complexity of DES Cipher Now we discuss the complexity of the DES cipher module. In the DES cipher, we perform an initial permutation (IP), a butterﬂy loop with f (.) function and a ﬁnal permutation (FP). The complex part of the DES cipher is its butterﬂy loop. The nonlinear butterﬂy function f (.) consists of three subfunctions, expand, S-Box mixing and permutation. Initial Permutation The operations IP (Pcode 2.12) for the cipher and PC-1 (Pcode 2.8) for the key scheduler are almost similar and clock cycle consumption of IP is the same as that of PC-1. Therefore, about 260 clock cycles are required to run IP on the reference embedded processor. Final Permutation The FP operation consists of a nested loop with two inner loops as given in Pcode 2.13. Each inner loop consumes 40 cycles. We consume a total of 346(= [2 ∗ 40 + 4] ∗ 4 + 10) cycles in performing the FP operation with a reference embedded processor. Expand Function As given in Pcode 2.14, the simulation code of the expansion subfunction does not have any dependencies and each operation consumes a single cycle. Therefore, the expansion subfunction consumes a total of 28 cycles. S-Box Mixing We use Pcode 2.15 for S-Box mixing to obtain 4-bit words from 6-bit words. We consume 12 cycles for obtaining a single 4-bit word and we consume about 96 cycles for obtaining eight 4-bit words. Permutation The third subfunction of the butterﬂy function f (.) is a permutation operation. This operation is costly in terms of cycle as it involves 32 bits permutation. In butterﬂy permutation, ﬁrst we pack eight 4-bit words to a 32-bit word and it takes 16 cycles. Then we use a look-up table to permute the 32-bit word. We consume ﬁve cycles in getting 1 permuted bit. Therefore, we consume 160 cycles for permutation of 32 bits. We consume a total of 176 cycles in performing the permutation operation. With this, the total number of clock cycles required for the butterﬂy function is 300(= 28 + 96 + 176) cycles. Apart from the butterﬂy function, we perform adding the key to the expanded input data and storing the temporary data via swapping. These operations take 29 cycles. Therefore, in a single iteration of the DES cipher, we consume 329(= 300 + 29) cycles to process the data. Now, for 16 iterations, we consume 5264(= 329 ∗ 16) cycles. With this, the total number of cycles required to get a 64-bit cipher text from 64-bit plain text using the DES cipher Data Security 35 is 5870(= 260 + 346 + 5264) cycles. As both the DES cipher and inverse cipher have the same ﬂow, the DES inverse cipher also consumes about the same number of cycles. This clock cycles estimate is meaningful only when we interleave the program code (since many look-up table accesses with immediate usage consume more than one cycle if we do not interleave the program code) in implementation of the DES algorithm. Otherwise (i.e., without interleaving the program code), the cycle estimate for the DES algorithm is much more than the previous estimated numbers. As the DES key scheduler does not work in real time, we not discuss further about its optimization. In the next section, we discuss the optimization techniques for DES cipher modules. 2.2.5 Efﬁcient Implementation of DES Cipher As discussed in the previous section, the DES cipher module consists of three steps. The ﬁrst and last steps are permutations and the middle step is the butterﬂy loop. The costliest step is the butterﬂy loop, which consumes about 5264 clock cycles to encrypt or decrypt the data using an expanded key. In this section, we discuss the efﬁcient way of implementing the butterﬂy-loop function. As discussed, the function f (.) in the butterﬂy loop consists of three steps: expansion, S-Box mixing and permutation. As shown in Figure 2.5, after expanding 32-bit data (R[n − 1]) to 48-bit data and XORing with keywords (K [n]), we have eight 6-bit words. In the S-Box mixing, we get eight 4-bit output words from eight 6-bit input words. Then, we merge the eight 4-bit words to a single 32-bit word before permutation. The permutation operation maps bits one-to-one from input 32-bit word to output 32-bit permuted word. This one-to-one mapping of bits by the permutation operation gives us the scope for optimizing the butterﬂy loop. After careful observation, the last two steps of S-Box mixing and permutation operation can be combined as follows. The following equation is valid for S-Box mixing and permutation operations. y = Si -Box[x ] z = P(y) = Mi[x] Here Mi[x ] contains the permuted values of Si -Box elements. We understand the meaning of the previous equations with an example. Assume i = 1 and x = 48, then y is obtained from ﬁrst S-Box and is equal to 15 (as we get second row and eighth column of S-Box from the value 110,000 [= 48] with the ﬁrst and last bits representing the row index and the middle 4 bits representing the column index). Now the value of z is obtained by permuting the bits of value 15 (= 1111 in binary form). Here we know that i = 1 and hence the location of bits in the merged 32-bit word are the ﬁrst 4 bits (from left). According to the permutation table given in Table 2.1, the 1st bit goes to the 9th position, the 2nd bit goes to the 17th position, the 3rd bit goes to the 23rd position and the 4th bit goes to the 31st position in the permuted word. Therefore the permuted value z is equal to 0x00808202 (0000 0000 1000 0000 1000 0010 0000 0010). If the value of x is the same (equal to 48) but the S-Box number is different (i.e., i is other than one), then the value of z will be different from 0x00808202 as the position of bits of y in the merged word occupy a position different from the ﬁrst four positions. We store the look-up table Mi [x ] elements such that the elements are accessed linearly (that means, if x = 48, then the corresponding element present in the look-up table Mi [x ] is at the location with offset equal to 48). Therefore, in this case, unpacking and packing of bits (to simulate as speciﬁed in the standard, like ﬁrst and last bit represents row index and middle 4 bits represents column index) is not needed to implement it. The elements in Mi [x ] are comprised of 32-bit words. If we want to permute all eight S-Box values in advance, then we need 2 kB (= 512 ∗ 4) of on-chip memory. The 512 elements of Mi [x ] can be found on the companion website. With this, the butterﬂy-loop ﬂow can be viewed as eight independent parallel ﬂows as shown in Figure 2.6. The simulation code for efﬁcient implementation of the DES butterﬂy loop is given in Pcode 2.17. In the ﬁrst step, we get expanded 6-bit words from an input 32-bit word for all eight paths. In the second step we XOR eight expanded 6-bit words with eight 6-bit keywords. In the third step, we get permuted S-Box elements for all eight paths by using eight XORed 6-bit values as offsets to the look-up table Mi [x ]. Finally, we OR all eight paths’ 32-bit words (which are orthogonal to each other with respect to bit-positions ﬁlled by placing four permuted bits) to get one 32-bit word as the butterﬂy function f (.) output. In this approach, we have more scope to interleave 36 Chapter 2 Rn21 32 E E E E E E 6 Kn,1 6 6 6 Kn,2 6 6 6 Kn,3 6 6 6 Kn,4 6 6 6 Kn,5 6 6 6 Kn,6 6 6 E 6 Kn,7 6 6 E 6 Kn,8 6 6 M1 32 M2 32 M3 32 M4 32 M5 32 M6 32 M7 32 M8 32 OR 32 f (Rn21, Kn) Figure 2.6: Efﬁcient implementation of DES butterﬂy ﬂow. j = 0; for(i = 0;i < 16;i++){ r2 = des_state[1]; r1 = r2 << 31; tmp1 = r2 >> 1; r1 = tmp1 | r1; tmp1 = r1 >> 26; tmp1 = tmp1^ks_key[j++]; tmp2 = M[tmp1]; for(k=1;k < 7;k++) { r1 = r1 << 4; tmp1 = r1 >> 26; // get 6-bit words tmp1 = tmp1^ks_key[j++]; // XOR with key tmp1 = tmp1 + 64*k; // get offset tmp1 = M[tmp1]; tmp2 = tmp2 | tmp1; } r1 = r2 >> 31; tmp1 = r2 << 1; r1 = r1 | tmp1; tmp1 = r1 & 0x3f; tmp1 = tmp1^ks_key[j++]; tmp1 = tmp1 + 64*7; tmp1 = M[tmp1]; tmp2 = tmp2 | tmp1; r1 = des_state[0]; r1 = r1^tmp2; des_state[0] = r2; des_state[1] = r1; } Pcode 2.17: Simulation code for efﬁcient implementation of DES loop. the program code. Also, it is easy to distribute the workload to multiple ALUs of deep pipelined embedded processor. Now, we discuss the clock cycle consumption of the DES cipher with the suggested implementation of the DES butterﬂy loop. As seen in Pcode 2.17, in the butterﬂy loop, we consume six cycles for all three operations—expansion, S-Box mixing and permutation in paths 2 to 6, whereas in 1 and 8, we consume eight cycles. Once we get f (R[n − 1], K [n]), we update the left and right outputs of the butterﬂy as L[n] = R[n − 1] and R[n] = L[n − 1] ⊕ f (R[n − 1], K [n]). These operations consume about four cycles. We consume a total Data Security 37 56 (= 6 ∗ 6 + 2 ∗ 8 + 4) cycles for one iteration of the butterﬂy loop. So, cycles required for the butterﬂy loop total 896 (= 56 ∗ 16), whereas the original approach consumes 5624 cycles as discussed in the previous section. As the suggested approach is easily extendable to multiple ALUs, the cycles’ consumption for the DES butterﬂy loop on a four-ALU embedded processor is about 225 cycles. The same look-up table values and suggested butterﬂy-loop implementation can also be used for DES inverse cipher. 2.3 Advanced Encryption Standard The advanced encryption standard is the latest data security standard known as FIPS 197 (Federal Information Processing Standard, 2001) adopted worldwide by most public and private sectors, for secure data communications and data storage purposes. The AES is used in a large variety of applications, from mobile consumer products to high-end servers. 2.3.1 Introduction to AES Algorithm The AES algorithm is a symmetric key algorithm, standardized by the National Institute of Science and Technology (NIST) in 2001. The AES standard (Federal Information Processing Standard, 2001) speciﬁes the Rijndael algorithm that can process data blocks of 128 bits, using keys of 128-, 192-, or 256-bit length (and we call the AES with particular key length AES-128, AES-192, and AES-256). The AES encipher (cipher) converts data (plain text) to an unintelligible form (cipher text) using the cipher key, and the AES decipher (inverse cipher) converts the cipher text back to plain text using the same cipher key. In AES, we use the same key (hence it is a symmetric key algorithm) for both encryption and decryption. AES encryption and decryption are based on four different transformations applied repeatedly in a certain sequence on input data and the ﬂows of encryption and decryption are not same. The AES standard also speciﬁes a key expansion module to supply keys for multiple iterations of the AES algorithm. Depending on input key length, the number of iterations (or complexity) of the AES algorithm (including key expansion, encryption and decryption) will vary. In this chapter, we discuss the ﬂow of the AES algorithm and simulation of AES-128 key expansion, the AES cipher, and the AES inverse cipher modules. In addition, we discuss the computational complexity of AES and efﬁcient techniques to implement the AES cipher and inverse cipher on the reference embedded processor. A few applications include data communications, data storage, Internet, military applications, classiﬁed data management, and memory protection. Similar to the AES, the TDEA (triple data encryption algorithm) is used in all applications mentioned previously (see Section 2.2). The strength of an encryption algorithm depends on its mathematical properties and supported key lengths. The DES is a very old standard with less key space and analysts thoroughly understood and attacked DES cipher text. Whereas AES was developed recently and its key space is very large. No known attacks and no known weak keys exist for AES as of now. 2.3.2 AES Algorithm Description The ﬂow diagram of an AES encryption engine is shown in Figure 2.7. The main transformations in the AES Rijndael’s cipher are (1) AddRoundKey (AR), (2) SubBytes (SB), (3) ShiftRows (SR), and (4) MixColumns (MC). All these transforms work on a matrix called state that is formed using the input data. The AES state is updated in multiple iterations using the previous transformations. The key expansion (KE) module expands the given key for supplying the keys to all iterations of the AES cipher engine. The number of times the state is iterated in a loop of the AES algorithm depends on what key length (Nk ) we have chosen. For example, if we choose the key length of 128 bits (i.e., Nk = 4 32-bit words), then we iterate the data (Nr − 1) times, where Nr = Nb + Nk + 2 and Nb = 4. In the AES algorithm, the parameter Nb(= 4) corresponds to the number of rows of state. AR transformation is applied before starting the loop. The transformations present within an AES loop are SB, SR, MC, and AR. In addition, the transformations SB, SR, and AR are applied after the loop before outputting the cipher text. We deﬁne each of these transformations in the following. For pictorial illustrations of SB, SR, MC, and AR, please refer to Federal Information Processing Standards (2001). 38 Chapter 2 PT KEY FS K0 AR SB SR MC AR KE (Nr 1 1) Keys Kj j , Nr 2 1 YES NO SB SR AR KNr GO CT Figure 2.7: AES encryption engine. The input to the AES algorithm is 128 bits (16-bytes) of plain text and a key of any following three lengths: 128 bits (or 16 bytes), 192 bits (or 24 bytes) or 256 bits (or 32 bytes). An overview of major steps in the AES algorithm follows. Form State (FS): At the start of the cipher, the input bytes in0, in1, . . . , in15 are copied into the state matrix as Sr,c = inr+4c for 0 ≤ r < 4, 0 ≤ c < 4. After FormState( ), the elements Sr,c of AES state are given here. ⎡ ⎤⎡ ⎤ S00 S01 S02 S03 in0 in4 in8 in12 ⎢⎢⎣SS1200 S11 S21 S12 S22 SS1233⎥⎥⎦ = ⎢⎢⎣iinn12 in5 in6 in9 in10 iinn1134⎥⎥⎦ S30 S31 S32 S33 in3 in7 in11 in15 Get Output (GO): Reverse operation of FS. Key Expansion (KE): The key expansion module generates a total of Nb(Nr + 1) keywords, as the AES algo- rithm requires that many keywords to encrypt the data. As shown in Figure 2.7, we expand the given key with the key expansion module before processing data with the AES algorithm. More details of AES key expansion is given in Section 2.3.3, AES-128 Key Expansion Simulation, and in Section 2.3.4, Complexity of AESKeyExp( ). Add Round Key (AR) Transformation: In add round key transformation, 4Nb 8-bit keywords are added to the state by a simple bit-wise XOR operation. For 0 ≤ i ≤ Nr , Sr,c = Sr,c ⊕ K16i+4r+c where, 0 ≤ c < 4 and 0 ≤ r < Nb. Substitution Bytes (SB) Transformation: The substitution bytes transformation is a nonlinear byte substitution operation that operates independently on each byte of the state using the substitution table. In SB, we simply replace the state elements with the S-Box elements using state element as an offset to S-Box table. The AES algorithm substitution tables for the AES cipher (S-Box) and AES inverse cipher (inverse S-Box) are available on the companion website. Shift Rows (SR) Transformation: In the shift rows transformation, the byte positions in the last three rows of the state are cyclically shifted (to the left in the case of encryption and to the right in the case of decryption) by different number of offsets. The ﬁrst row, r = 0 is not shifted. The shift offset value depends on the row number r as follows: shift(r = 0) = 0, shift(r = 1) = 1, shift(r = 2) = 2 and shift(r = 3) = 3 Data Security 39 Mix Columns (MC) Transformation: Mix columns transformation operates on the states column-by-column treating each column as a four element vector: S’i = A · Si, where ⎡⎤ ⎡ ⎤ ⎡ ⎤ si,0 02 03 01 01 0e 0b 0d 09 Si = ⎢⎢⎣ssii,,12⎥⎥⎦, for encryption A = ⎢⎢⎣0011 02 01 03 02 0013⎥⎥⎦ and for decryption A = ⎢⎢⎣00d9 0e 09 0b 0e 00db⎥⎥⎦ si,3 03 01 01 02 0b 0d 09 0e For details on the computation process of MC transformation, see the following Section 2.3.4, Complexity of MixColumns( ). 2.3.3 AES-128 Simulation With the AES-128 algorithm, we use 128-bit-length keys. We initialize the parameters for the AES-128 algorithm as Nk = 4 (number of 32-bit input keywords), Nb = 4 (number of state rows) and Nr = 10 (= Nb + Nk + 2), the number of iterations in the AES loop. In the following sections, we simulate the AES-128 key expansion module and the AES-128 cipher and inverse-cipher transformations. AES-128 Key Expansion Simulation The AES-128 algorithm uses a total of 176(= 4 · (Nr + 1) · Nk ) bytes of key in the encryption or decryption process. We expand the given 16 bytes (or 128 bits) of input key to 176 bytes for the AES algorithm. The simulation code of key expansion module AESKeyExp( ) for AES-128 algorithm is given in Pcode 2.18. We discuss more details on AESKeyExp( ) module in Section 2.3.4, Complexity of AESKeyExp( ). For a given 128-bit (or 16 bytes) input key, an expanded key of 44 words (or 176 bytes) generated with AESKeyExp( ) module follows: AES-128 Key Expansion Module Input key[4] = { 0x47f11a71, 0x1d29c589, 0x6fb7620e, 0xaa18be1b}; i = 0; // key expansion array index while (i < pAes->Nk){ exp_key[i] = key[i]; // the first Nk words of key expansion is same as input key i++; } k = pAes->Nb * (pAes->Nr + 1);// loop count j = 0; temp = exp_key[i-1]; // at this point, i = Nk, while (i < k){ // this while loop code generates 4 key words in one iteration Rc = temp << 8; // substitute bytes + shift rows transformations Rc = Rc >> 24; w = S_Box[Rc]; Rc = temp << 16; w = w << 8; Rc = Rc >> 24; w = w | S_Box[Rc]; Rc = temp & 0xff; w = w << 8; w = w | S_Box[Rc]; Rc = temp >> 24; w = w << 8; w = w | S_Box[Rc]; Rc = Rcon[j++]; Rc = Rc << 24; temp = w ^ Rc; w = exp_key[i-pAes->Nk]; temp = temp ^ w; w = exp_key[i-pAes->Nk]; exp_key[i++] = temp; temp = temp ^ w; exp_key[i++] = temp; w = exp_key[i-pAes->Nk]; temp = temp ^ w; w = exp_key[i-pAes->Nk]; exp_key[i++] = temp; temp = temp ^ w; exp_key[i++] = temp; } Pcode 2.18: Simulation code for AESKeyExp( ) module. 40 Chapter 2 AES-128 Key Expansion Module Output exp_key[44] = { 0x47f11a71, 0x1d29c589, 0x6fb7620e, 0xaa18be1b, 0xeb5fb5dd, 0xf6767054, 0x99c1125a, 0x33d9ac41, 0xdcce361e, 0x2ab8464a, 0xb3795410, 0x80a0f851, 0x388fe7d3, 0x1237a199, 0xa14ef589, 0x21ee0dd8, 0x1858862e, 0x0a6f27b7, 0xab21d23e, 0x8acfdfe6, 0x82c60850, 0x88a92fe7, 0x2388fdd9, 0xa947223f, 0x02557d83, 0x8afc5264, 0xa974afbd, 0x00338d82, 0x81086ee0, 0x0bf43c84, 0xa2809339, 0xa2b31ebb, 0x6c7a84da, 0x678eb85e, 0xc50e2b67, 0x67bd35dc, 0x0dec025f, 0x6a62ba01, 0xaf6c9166, 0xc8d1a4ba, 0x05a5f6b7, 0x6fc74cb6, 0xc0abddd0, 0x087a796a}; AES Cipher Simulation As discussed in Section 2.3.2, AES Cipher consists of four transformations, and we use the following function names for each transformation: SubBytes( ) for substitute bytes transformation, ShiftRows( ) for shift rows transformation, AddRoundKey( ) for add round key transformation and MixColumns( ) for mix column transformation. AddRoundKey( ) In add round key transformation, we add 16 key bytes to 16 bytes of AES state. The addition operation is modulo 2 addition and we simulate this operation by XORing key bytes with state bytes as given in Pcode 2.19. As the expanded key exp_key[ ] from AESKeyExp( ) module is in terms of 32-bit words, we unpack exp_key[ ] words into bytes and add to state. SubBytes( ) In simulation of SubBytes( ) transformation, we replace each AES state byte with S-Box element as given in Pcode 2.20. ShiftRows( ) transformation rotates AES state rows to the left by a particular number of bytes depending on the row number. The simulation code for the ShiftRows( ) transformation is given in Pcode 2.21. As the state elements are represented with bytes, we simulate the shift rows transformation in terms of load and store bytes rather with a logical cyclic shift of 32-bit words. MixColumns( ) In the MixColumns( ) transformation, we multiply each column of state with the matrix A for encryption process as speciﬁed in Section 2.3.2. In this process, we multiply each state byte with 0x02 by performing a Galois ﬁeld multiplication in GF(28). More details on the MixColumns( ) transformation is given in Section 2.3.4, Complexity of MixColumns( ). The simulation code for MixColumns( ) is given in Pcode 2.22. for(j = 0;j < 4;j++){ tmp1 = exp_key[k++]; tmp2 = tmp1 >> 24; state[0][j] = t[0][j]^tmp2; state[1][j] = t[1][j]^tmp2; state[2][j] = t[2][j]^tmp2; state[3][j] = t[3][j]^tmp2; } tmp2 = (tmp1 & 0x00ff0000) >> 16; tmp2 = (tmp1 & 0x0000ff00) >> 8; tmp2 = tmp1 & 0xff; Pcode 2.19: Simulation code for AddRoundKey( ) transformation. for(j = 0;j < 4;j++) for(i = 0;i < 4;i++) state[j][i] = S_Box[state[j][i]]; Pcode 2.20: Simulation code for SubBytes( ) transformation. for(j = 1;j < 4;j++) for(i = 0;i < j;i++){ tmp1 = state[j][0]; state[j][0] = tmp2; state[j][1] = tmp2; state[j][2] = tmp2; state[j][3] = tmp1; } tmp2 = state[j][1]; tmp2 = state[j][2]; tmp2 = state[j][3]; Pcode 2.21: Simulation code for ShiftRows( ) transformation. Data Security 41 AESCipher( ) The simulation code for AES cipher algorithm is given in Pcode 2.23. AES cipher uses all the transformations discussed previously along with FormState( ) and GetOutput( ) operations. for(j = 0;j < 4;j++) for(i = 0;i < 4;i++){ // Premultiplication of State bytes with 0x02 tmp1 = state[j][i]; tmp2 = tmp1 >> 7; tmp1 = tmp1 << 1; if (tmp2) tmp1 = tmp1^0x1b; s[j][i] = tmp1; } for(i = 0;i < 4;i++){ t[0][i] = s[0][i]^(s[1][i]^state[1][i])^state[2][i]^state[3][i]; t[1][i] = state[0][i]^s[1][i]^(s[2][i]^state[2][i])^state[3][i]; t[2][i] = state[0][i]^state[1][i]^s[2][i]^(s[3][i]^state[3][i]); t[3][i] = (s[0][i]^state[0][i])^state[1][i]^state[2][i]^s[3][i]; } Pcode 2.22: Simulation code for MixColumns( ) transformation. k = 0; FormState( ); AddRoundKey( ); for (r = 1; r < 10; r++){ SubBytes( ); ShiftRows( ); MixColumns( ); AddRoundKey( ); } SubBytes( ); ShiftRows( ); AddRoundKey( ); GetOutput( ); // offset to access expanded key Pcode 2.23: Simulation code for AESCipher( ). AES-128 Encryption Simulation Results Key: {0x47, 0xf1, 0x1a, 0x71, 0x1d, 0x29, 0xc5, 0x89, 0x6f, 0xb7, 0x62, 0x0e, 0xaa, 0x18, 0xbe, 0x1b} Plain text: {0x9f, 0x5d, 0xbd, 0x6e, 0x43, 0xef, 0xc4, 0xa6, 0x39, 0xa8, 0x31, 0xa4, 0xd3, 0x37, 0xf2, 0x8b} After FormState( ): {{0x9f, 0x43, 0x39, 0xd3}, {0x5d, 0xef, 0xa8, 0x37}, {0xbd, 0xc4, 0x31, 0xf2}, {0x6e, 0xa6, 0xa4, 0x8b}} After AddRoundKey( ): {{0xd8, 0x5e, 0x56, 0x79}, {0xac, 0xc6, 0x1f, 0x2f}, {0xa7, 0x01, 0x53, 0x4c}, {0x1f, 0x2f, 0xaa, 0x90}} //Loop Start r=1 (input): {{0xd8, 0x5e, 0x56, 0x79}, {0xac, 0xc6, 0x1f, 0x2f}, {0xa7, 0x01, 0x53, 0x4c}, {0x1f, 0x2f, 0xaa, 0x90}} r=1 (after substitute bytes): {{0x61, 0x58, 0xb1, 0xb6}, {0x91, 0xb4, 0xc0, 0x15}, 0x5c, 0x7c, 0xed, 0x29}, {0xc0, 0x15, 0xac, 0x60}} r=1 (after shift rows): {{0x61, 0x58, 0xb1, 0xb6}, {0xb4, 0xc0, 0x15, 0x91}, {0xed, 0x29, 0x5c, 0x7c}, {0x60, 0xc0, 0x15, 0xac}} r=1 (after Mix Columns): {{0x88, 0x02, 0x0f, 0x0f}, {0x5e, 0x78, 0x6a, 0xa7}, {0xb4, 0x91, 0x23, 0x30}, {0x3a, 0x9a, 0xab, 0x6f}} r=1 (after add round key): {{0x63, 0xf4, 0x96, 0x3c}, {0x01, 0x0e, 0xab, 0x7e}, {0x01, 0xe1, 0x31, 0x9c}, {0xe7, 0xce, 0xf1, 0x2e}} r=2 (input): {{0x63, 0xf4, 0x96, 0x3c}, {0x01, 0x0e, 0xab, 0x7e}, {0x01, 0xe1, 0x31, 0x9c}, {0xe7, 0xce, 0xf1, 0x2e}} r=3 (input): {{0x21, 0xa3, 0x71, 0x90}, {0x1b, 0x2e, 0x1b, 0x01}, {0xa0, 0x9b, 0x49, 0x7c}, {0x06, 0x1f, 0x39, 0xaa}} r=4 (input): {{0x1d, 0x93, 0x58, 0x0d}, {0xf1, 0x27, 0xee, 0xe5}, {0xb2, 0x95, 0xaa, 0xdc}, {0x86, 0xe6, 0x70, 0xe7}} r=5 (input): {{0x3c, 0x13, 0xb6, 0xbc}, {0x04, 0x36, 0x35, 0x6e}, {0x0a, 0x08, 0x86, 0x0e}, {0x8a, 0xee, 0x69, 0xad}} r=6 (input): {{0x91, 0x06, 0x4a, 0xa7}, {0x7e, 0x7b, 0x62, 0x74}, {0xca, 0x0b, 0x9a, 0xc5}, {0x06, 0xa1, 0xa3, 0xbb}} 42 Chapter 2 r=7 (input): {{0x2a, 0x78, 0xf5, 0x97}, {0xaf, 0x42, 0x33, 0xe5}, {0x93, 0x71, 0x55, 0x6a}, {0x4d, 0x07, 0x5e, 0xaa}} r=8 (input): {{0x74, 0xd7, 0x1c, 0xd9}, {0x06, 0x30, 0x75, 0x6f}, {0xab, 0x79, 0x5b, 0x5a}, {0x47, 0x47, 0x9c, 0x52}} r=9 (input): {{0x66, 0xd9, 0xc7, 0xd4}, {0xab, 0xd8, 0xdf, 0x49}, {0x60, 0xb7, 0x20, 0x61}, {0x4a, 0x34, 0x49, 0xfd}} //Loop end //After Loop {{0x2b, 0x80, 0xbd, 0x6c}, {0x8b, 0x8c, 0xaf, 0x86}, {0xd9, 0xb5, 0xff, 0x8a}, {0x74, 0x98, 0xec, 0xdf}} After SubBytes( ): {{0xf1, 0xcd, 0x7a, 0x50}, {0x3d, 0x64, 0x79, 0x44}, {0x35, 0xd5, 0x16, 0x7e}, {0x92, 0x46, 0xce, 0x9e}} After ShiftRows( ): {{0xf1, 0xcd, 0x7a, 0x50}, {0x64, 0x79, 0x44, 0x3d}, {0x16, 0x7e, 0x35, 0xd5}, {0x9e, 0x92, 0x46, 0xce}} After AddRoundKey( ): {{0xf4, 0xa2, 0xba, 0x58}, {0xc1, 0xbe, 0xef, 0x47}, {0xe0, 0x32, 0xe8, 0xac}, {0x29, 0x24, 0x96, 0xa4}} Cipher text after GetOutput( ): {0xf4, 0xc1, 0xe0, 0x29, 0xa2, 0xbe, 0x32, 0x24, 0xba, 0xef, 0xe8, 0x96, 0x58, 0x47, 0xac, 0xa4} AES Inverse Cipher Simulation The AES inverse cipher consists of four transformations that are inverse operations of the AES cipher transformation and we use the following function names for each transformation: InvSubBytes( ) for inverse substitute byte transformation, InvShiftRows( ) for inverse shift rows transformation, InvAddRoundKey( ) for inverse add round key transformation and InvMixColumns( ) for inverse mix columns transformation. The functionality of inverse substitute bytes and inverse add round key transformations are the same as cipher substitute bytes and add round key transformations except that the look-up table values and order of accessing keyword values are different in the two cases. Although the same expanded key is used for both cipher and inverse cipher, in the case of cipher the keywords are accessed from the beginning of the expanded array by increasing the array index and in the case of inverse cipher the keywords are accessed from the end of the array by decreasing the array index. Then, both the shift rows and mix columns transformations of cipher and inverse cipher are inversely related. InvAddRoundKey( ) Same as AddRoundKey( ), but the keywords are accessed from the end of the key expansion array. InvSubBytes( ) This is the same as SubBytes( ), but it uses Inv_S_Box[ ] instead of S_Box[ ]. InvShiftRows( ) In the InvShiftRows( ) transformation, we rotate the state rows to the right by a particular number of bytes depending on the row number. The simulation code for InvShiftRows( ) is given in Pcode 2.24. As the state elements are represented with bytes throughout our simulation, we simulate this inverse shift rows transformation in terms of load and stores bytes rather with logical cyclic shift of 32-bit words. InvMixColumns( ) transformation is the costly transformation in the AES algorithm. It involves multiplication of state bytes with 0x09, 0x0b, 0x0d, and 0x0e element combinations in the Galois ﬁeld GF(28). The simulation code for InvMixColumns( ) is given in Pcode 2.25. InvAESCipher( ) The simulation code for the AES inverse cipher algorithm is given in Pcode 2.26. The AES inverse cipher uses all the transformations discussed previously along with FormState( ) and GetOutput( ) operations. for(j = 1;j < 4;j++) for(i = 0;i < j;i++){ tmp1 = state[j][3]; state[j][3] = tmp2; state[j][2] = tmp2; state[j][1] = tmp2; state[j][0] = tmp1; } tmp2 = state[j][2]; tmp2 = state[j][1]; tmp2 = state[j][0]; Pcode 2.24: Simulation code for InvShiftRows( ) transformation. Data Security 43 for(j = 0;j < 4;j++) for(i = 0;i < 4;i++){ // multiply with 0x02 tmp1 = t[j][i]; tmp2 = tmp1 >> 7; tmp1 = tmp1 << 1; if (tmp2) tmp1 = tmp1^0x1b; s[j][i] = tmp1; } for(j = 0;j < 4;j++) for(i = 0;i < 4;i++){ // multiply with 0x04 tmp1 = s[j][i]; tmp2 = tmp1 >> 7; tmp1 = tmp1 << 1; if (tmp2) tmp1 = tmp1^0x1b; ss[j][i] = tmp1; } for(j = 0;j < 4;j++) for(i = 0;i < 4;i++){ // multiply with 0x08 tmp1 = ss[j][i]; tmp2 = tmp1 >> 7; tmp1 = tmp1 << 1; if (tmp2) tmp1 = tmp1^0x1b; sss[j][i] = tmp1; } for(i = 0;i < 4;i++){ state[0][i] = (sss[0][i]^ss[0][i]^s[0][i])^(sss[1][i]^s[1][i]^t[1][i])^ (sss[2][i]^ss[2][i]^t[2][i])^(sss[3][i]^t[3][i]); state[1][i] = (sss[0][i]^t[0][i])^(sss[1][i]^ss[1][i]^s[1][i])^ (sss[2][i]^s[2][i]^t[2][i])^(sss[3][i]^ss[3][i]^t[3][i]); state[2][i] = (sss[0][i]^ss[0][i]^t[0][i])^(sss[1][i]^t[1][i])^ (sss[2][i]^ss[2][i]^s[2][i])^(sss[3][i]^s[3][i]^t[3][i]); state[3][i] = (sss[0][i]^s[0][i]^t[0][i])^(sss[1][i]^ss[1][i]^t[1][i])^ (sss[2][i]^t[2][i])^(sss[3][i]^ss[3][i]^s[3][i]); } Pcode 2.25: Simulation code for InvMixColumns( ) transformation. k = 40; // offset to access expanded key FormState( ); k-= 8; AddRoundKey( ); for (r=1; r < 10; r++){ InvShiftRows( ); InvSubBytes( ); InvAddRoundKey( ); InvMixColumns( ); k-= 8; } InvShiftRows( ); InvSubBytes( ); InvAddRoundKey( ); GetOutput( ); Pcode 2.26: Simulation code for InvAESCipher( ). AES Inverse-Cipher Simulation Results As the inverse AES cipher works in the reverse order as AES cipher, the simulation results presented in Section 2.3.3, AES-128 Encryption Simulation Results, can be obtained in reverse order using the inverse AES cipher. Therefore, the same intermediate outputs given in this section can be used to debug the AES inverse cipher. 2.3.4 Computational Complexity of AES In this section, we analyze AES algorithm complexity for implementing on the reference embedded processor. We discuss the complexity of each transformation in terms of cycles (see Appendix A, Section A.4, on the 44 Chapter 2 companion website for cycles’ consumption by a particular operation on the reference embedded processor) and data memory usage. Although the transformations AddRoundKey (AR) and ShiftRows (SR) can be computed with fewer cycles by treating the AES state data as 32-bit words (simply XORing word by word for AR transformation and shifting each word cyclically by a particular offset for the SR transform), the other two transformations SubBytes (SB) and MixColumns (MC) work with bytes only, hence we work with bytes in all the transformations. Complexity of SubBytes( ) In the SB transform, each byte of state is updated with the look-up table value by using the state byte value as the offset for the look-up table. Basically SB transform involves only look-up table access. With a reference embedded processor, though, the look-up table access takes multiple cycles per byte load; as we are not using the output immediately, we can load each byte by consuming two cycles (one cycle for computing the absolute address and one cycle for memory load) with program code interleaving. For updating all 16 bytes of state with the SB transform, we consume a total of 32 cycles. Complexity of ShiftRows( ) In the SR transform, every row of state is rotated left cyclically except the zeroth row. The amount of rotation for the ﬁrst row is one byte, for the second row is two bytes and for the third row is three bytes. This is achieved (without cyclic shifts) by loading the ﬁrst byte to a temporary variable, and then loading the next location byte and storing that to the current byte location as given in Pcode 2.21 of shift rows transformation simulation code. Like in SB, here also we are not using the loaded value immediately. So, we can do this shifting of row left by 1 byte in eight cycles. For the second row, we have to shift 2 bytes left, by applying the previously described procedure twice, which consumes 16 cycles. Finally, in the third row, we have to shift 3 bytes left, by one right shift of the third row. This takes another eight cycles. The SR transform consumes a total of 32 cycles. Complexity of MixColumns( ) The MC transformation is the costliest operation in the AES algorithm as MC transform involves costly Galois ﬁeld element multiplications. Now we discuss the MC transformation steps and then we estimate its cycle consumption. First, we understand the process of two Galois ﬁeld elements’ multiplication. If we want to multiply two Galois ﬁeld elements {0x07} and {0xab}, we use the approach described in AES standard FIPS 197. The ﬁeld element {0x07} can be written as {0x04 ⊕ 0x02 ⊕ 0x01}. Then Galois ﬁeld multiplication can be expanded as {0x07} · {0xab} = {{0x04} · {0xab} ⊕ {0x02} · {0xab} ⊕ {0xab}}. If we want to multiply any ﬁeld element {0xmm} with {0x02} and if the MSB of {0xmm} is zero, one left shift of {0xmm} results in value {0xmm} · {0x02}. If the MSB of {0xmm} is not zero, then one left shift of value {0xmm} along with XORing of the result with {0x1b} is needed to make the multiplication result belongs to Galois ﬁeld GF(28). This process is equivalent to taking of modulo by dividing the multiplications result with irreducible polynomial speciﬁed in the standard. If we want to multiply {0x04} · {0xmm}, we repeat the previous procedure twice. In the AES cipher MC transform, we multiply the matrix A with a vector to get the transformed output. The Galois ﬁeld elements present in the matrix are {0x01}, {0x02} and {0x03}. Multiplying any Galois ﬁeld element by {0x01} results in the same element. Multiplying any Galois field element by {0x02} is done as described previously. Multiplying any Galois field element {0xmm} by {0x03} is done by ﬁrst multiplying the element with {0x02} and then XORing the result with the original value as ({0x02} · {0xmm}) ⊕ {0xmm}. With this knowledge, we can simulate the MC transform as follows. First, we multiply all the state elements by {0x02}. Multiplying one state element by {0x02} takes approximately 10 cycles: one load (requires four cycles, including the stall, as we immediately use the loaded value in the next operation), one shift, one condition check, one XOR, one conditional move and one store. So, we spend a total of 160 cycles to multiply all the state elements with {0x02} and store in a temporary buffer. The multiplication of state elements with {0x03} is done by XORing the result of {0x02} multiplication output with the original state elements. Once we have the multiplication of state elements by {0x02}, then computing the MC transform involves only XOR operations as given in Pcode 2.22. To compute one MC transform output element, we calculate a total of four XOR operations (four cycles) and ﬁve load operations (ﬁve cycles assuming the program can be interleaved; otherwise it takes Data Security 45 20 cycles). For all elements of state we consume 144 cycles (= 16x9). To store all the state elements back to state we consume another 16 cycles. With this, the total number of cycles consumed in applying MC transform on state is 320 cycles. Complexity of AddRoundKey( ) In the AR transform, we ﬁrst load the keyword from memory (four cycles) and we unpack the word into bytes (six cycles). We load the four state bytes row-wise (a minimum of eight cycles are needed after interleaving the program code) and XOR with the key bytes (four cycles) and store them back to state (four cycles). So, in applying AR transform for one row of state, we spend approximately 26 cycles, and for complete AR transform on four rows of state we spend 104 cycles. Overall Complexity of AES Cipher Total cycles consumed for all the transforms in a single iteration of the AES cipher loop sum up to 488 cycles. For the key length of 128 bits, the AES cipher loop iterates nine times. Thus, the approximate number of cycles for encrypting one block of 128 bits of data using the AES cipher is about 5000 cycles (= 488x9 + cycles consumed by all transforms before and after the loop). Inverse AES Cipher Computational Complexity In the case of the AES inverse cipher, except the inverse MC transformation, all other transforms takes the same number of cycles as the cipher transformations. In inverse MC transform, the matrix elements are {0x0d}, {0x0e}, {0x0b}, and {0x09}, and to multiply the state elements with these matrix elements, we need to store multiplication results of {0x08} and {0x04} elements in temporary buffers apart from the {0x02} multiplication result (as we may expand {0x0d} as {0x08 ⊕ 0x04 ⊕ 0x01} to perform multiplication of the state element with {0x0d}). Generation of multiplication outputs for {0x08} and {0x04} elements with each state element take an extra 320 cycles per loop iteration. Also, inverse MC multiplications take an extra 48 cycles per iteration (as the multipliers in this case are large and need more XOR and load operations, as given in Pcode 2.25). With this, the AES inverse cipher loop consumes approximately 856 cycles. So, the approximate total number of cycles for decrypting one block of 128 bits of data using the AES inverse cipher with the reference embedded processor is 8000 cycles (= 856x9 + cycles consumed by the transforms before and after the loop). Complexity of AESKeyExp( ) Now, we discuss the complexity of AES key expansion module in expanding a 128-bit key. As given in Pcode 2.18, the expanded key ﬁrst four keywords are copied from the input key and it takes eight cycles for load and store of keywords. The loop of the AES key expansion module is unrolled partially so that each iteration of the “while” loop generates four keywords by avoiding conditional jumps. With this, the loop count for 128-bit key expansion becomes 10 (= Nb · Nr/4 = 4 ∗ 10/4). For generating the ﬁrst keyword in each iteration of the while loop, from the previous keywords, we perform the transformations, namely, substitute word and rotate word left and then we XOR the result with Rcon. These operations consume 36 cycles (six cycles for unpacking the previous keyword, 16 cycles for loading four S-Box values, four cycles for loading Rcon constant, four cycles for packing the bytes and four cycles for XORing with Rcon and for other operations). The operations substituting word, rotate word, and XORing with Rcon need not be performed in generating the last three keywords in any iteration of the “while” loop. Then, before storing each word as a keyword, we XOR the current word with the already generated keyword. This operation of XORing the current four words with the previously generated four keywords and storing XORed outputs consumes about 24 cycles (16 cycles for loading the previous four keywords, four cycles for XORing and four cycles for storing). With this, total cycles consumed for generation of four keywords in a single iteration of the key expansion loop are 60 (= 36 + 24). For generating all keywords with the key expansion module, we consume approximately 608 (= 60x10 + 8) cycles. AES Algorithm Memory Requirements In this section, we analyze the amount of data memory used in the AES algorithm. In key expansion, we used 176 bytes for storing expanded key, and 10 bytes for storing the Rcon constants. Both key expansion and AES cipher use the S-Box values and we need 256 bytes of data memory for storing S-Box values. The AES inverse cipher uses inverse S-Box and it needs another 256 bytes of data memory. We use almost 100 bytes of data 46 Chapter 2 memory for input, output, state and for temporary buffers to store Galois ﬁeld multiplication results. With this, the total amount of data memory used in the AES algorithm is about 0.75 kB. 2.3.5 Efﬁcient Implementation of AES In the previous section, we discussed the complexity of the AES algorithm in terms of reference embedded processor clock cycles. The key expansion module consumes approximately 600 cycles and the key expansion need not be done in real time as encryption of the data. Moreover, the key expansion module need not be called for every data block. Therefore, we are not going to discuss the optimization techniques for the key expansion module in this section. Next, the transforms used in the AES algorithm before and after the main loop are occurring once per block of data. The costly part of the AES algorithm is the main loop that runs Nb + Nk + 1 times. In this section, we discuss the ways to optimize the transformations in the AES main loop. The main loop of the AES algorithm contains SB, SR, MC, and AR transformations. All of these transformations take input data from the previous transformation’s output. On a deep pipelined processor such as a reference embedded processor, implementing this sequential ﬂow of the AES algorithm as it is takes lot of cycles, as discussed in the previous section. If we optimize the algorithm for reduced dependencies in its ﬂow, only then can we utilize full bandwidth and resources of an embedded processor (with multiple arithmetic and logic units) and then the algorithm consumes less cycles. Therefore, in this section, we concentrate on restructuring the AES algorithm for parallel ﬂow to utilize the full bandwidth of the processor. Now, we discuss how to make the AES algorithm suitable for running on deep pipelined multiple ALU embedded processors. If we can somehow make the process of getting 16 output elements of state at the end of a loop iteration from 16 bytes of state at the beginning of the loop without any dependency between the outputs to the inputs (i.e., having 16 parallel independent ﬂows for a full iteration of the loop), then we can efﬁciently program such a ﬂow on a deep pipeline embedded processor. The present ﬂow of the AES algorithm is shown in Figure 2.8 with dependencies. If any transformation has cross-inputs or cross-outputs, then there will be a dependency between the transformations as we wait for all the inputs to become available for starting the next transformation. From Figure 2.8, we can clearly see the dependency between SB and SR, SR and MC, and MC and AR. There is no dependency between AR and SB, as the inputs or outputs of these transforms are not crossed. Efﬁcient Implementation of AES Algorithm The transformations SB and SR are commutative (Federal Information Processing Standards, 2001), meaning the outputs of both functions SR(SB(state)) and SB(SR(state)) are the same. Out of all transformations, MC transformation is the most costly. We can reduce the cycles for this transformation at the cost of memory. In Daemen and Rijmen (2000), an alternative approach is suggested for fast implementation of AES using 4 kB of data memory. In this approach, instead of computing the intermediate Galois ﬁeld multiplication values at runtime for performing MC, we precompute the multiplication values for all 255 S-Box elements with all rotated combinations of MC matrix ﬁrst-row elements and store them in a data memory. In Gladman (2003), with three extra rotate operations, the memory required for fast implementation of AES had been reduced to 1 kB. Here, we precompute the S-Box elements’ multiplied values for one row of elements in the MC matrix and store them in memory using 1 kB of data memory. With the precomputed multiplication values, we spend the cycles in MC transformation for loading the multiplication values, for rotations and for XORing them with the input of MC. An efﬁcient ﬂow for the AES-algorithm loop transformations with precompution look-up tables is possible with the following formula: T = AR(MC(SB(SR(S )))), where S: input state, T : output state Figure 2.9 shows the efﬁcient implementation of the previous equation. Let M be the mix column matrix elements, S the input vector, and S the output of mix columns transformation. Data Security 47 S00 S10 S20 S30 S01 S11 S21 S31 S02 S12 S22 S32 S03 S13 S23 S33 SB SB SB SB SB SB SB SB SB SB SB SB SB SB SB SB S00 S10 S20 S30 S01 S11 S21 S31 S02 S12 S22 S32 S03 S13 S23 S33 SR SR SR SR S00 S10 S20 S30 S01 S11 S21 S31 S02 S12 S22 S32 S03 S13 S23 S33 MC MC MC MC S00 S10 S20 S30 S01 S11 S21 S31 S02 S12 S22 S32 S03 S13 S23 S33 AR AR AR AR AR AR AR AR AR AR AR AR AR AR AR AR T00 T10 T20 T30 T01 T11 T21 T31 T02 T12 T22 T32 T03 T13 T23 T33 Figure 2.8: Flow of AES cipher algorithm transformations. S = M·S ⎡⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡⎤ ⎡⎤ ⎡⎤ ⎢⎢⎣sss012⎥⎥⎦ = m ⎢⎢⎣mm 0 3 2 m1 m0 m3 m2 m1 m0 m3 s0 m0 m1 m2 m3 m m 2 1 ⎥⎥⎦ · ⎢⎢⎣ss21⎥⎥⎦ = ⎢⎢⎣mm 32⎥⎥⎦ · s0 ⊕ ⎢⎢⎣mm 03⎥⎥⎦ · s1 ⊕ ⎢⎢⎣mm 10⎥⎥⎦ · s2 ⊕ ⎢⎢⎣mm21⎥⎥⎦ s3 m1 m2 m3 m0 s3 m1 m2 m3 m0 We precompute Li for 0 ≤ i ≤ 3 (Galois ﬁeld multiplication, · , of si with ﬁrst column of M) as follows, and store it in memory. Li = {m0} · si | {m3} · si | {m2} · si | {m1} · si Now, to compute the mix column transformation for one column of state, we load Li for 0 ≤ i ≤ 3 from memory corresponding to si . Next, we get Li from Li by rotating Li to the right by i bytes. Then, we obtain si by XORing all Li s as follows: si = L0 ⊕ L1 ⊕ L2 ⊕ L3 where L0 = {m0} · s0|{m3} · s0|{m2} · s0|{m1} · s0 L1 = {m1} · s1|{m0} · s1|{m3} · s1|{m2} · s1 L2 = {m2} · s2|{m1} · s2|{m0} · s2|{m3} · s2 L3 = {m3} · s3|{m3} · s3|{m2} · s3|{m1} · s3 48 Chapter 2 S00 S11 S22 S33 L0 L1 L2 L3 L3 L0 L1 L2 L2 L3 L0 L1 L1 L2 L3 L0 K0 T00 S01 S12 S23 S30 K4 T10 K8 T20 K12 T30 L0 L1 L2 L3 L3 L0 L1 L2 L2 L3 L0 L1 L1 L2 L3 L0 K1 T01 S02 S13 S20 S31 K5 T11 K9 T21 K13 T31 L0 L1 L2 L3 L3 L0 L1 L2 L2 L3 L0 L1 L1 L2 L3 L0 K2 T02 S03 S10 S21 S32 K6 T12 K10 T22 K14 T32 L0 L1 L2 L3 L3 L0 L1 L2 L2 L3 L0 L1 L1 L2 L3 L0 K3 T03 K7 T13 K11 T23 K15 T33 Figure 2.9: Efﬁcient implementation of AES cipher. Finally, we get the output for one iteration of the AES loop by XORing the mix columns output with round key T = S ⊕ K (here to reduce the number of XORs for AR, we transpose AES round keywords in the key expansion module). Therefore, to compute one column of the state matrix, we require four extracts (to get individual state elements after SR transformation), four loads (SB transformation), three rotations and four XORs (MC and AR). The simulation code for an efﬁcient AES cipher is given in Pcode 2.27. In Figure 2.9, we can see that the outputs T00 to T33 do not depend on any intermediate results. All 16 outputs can be computed independently if we have sufﬁcient processor compute and data bandwidth. On the deep pipelined embedded processor, by interleaving the program code, we can avoid all the stalls present with the memory (or look-up table) accesses. In this way, using the approach for AES implementation in Gladman (2003), we can compute AES transformation operations by consuming one cycle for every operation with the program interleaving. In MC, we work on columns; it is convenient if we hold one column of elements in one register. For this, we transpose the state matrix before entering the loop. We again transpose back to the AES state matrix after the loop to work with the last three transformations outside the loop. Complexity of Optimized AES Algorithm At this juncture, we estimate the cycles (see Appendix A, Section A.4, on the companion website) for computing output T per iteration from Pcode 2.27 as follows. We have 16 state elements extracts (16 cycles), 16 XOR Data Security 49 for(r = 1; r < = pAes->Nr; r++){ r0 = r4 & 0xff; r1 = (r5 >> 8)&0xff; r2 = (r6>>16) & 0xff; r3 = r7>>24; // SR r0 = sbmc[r0]; r1 = sbmc[r1]; r2 = sbmc[r2]; r3 = sbmc[r3]; // SB tmp1 = r1 >> 24; r1 = r1 << 8; r1 = r1 | tmp1; // rotate r1 by one byte tmp1 = r2 >> 16; r2 = r2 << 16; r2 = r2 | tmp1; // rotate r2 by two bytes tmp1 = r3 >> 8; r3 = r3 << 24; r3 = r3 | tmp1; // rotate r3 by three bytes r0 = r0 ^ r1; r0 = r0 ^ r2; r0 = r0 ^ r3; // MC r1 = enc_key_exp[k++]; r2 = r0 ^ r1; temp[0] = r2; // AR r3 = r4 >> 24; r0 = r5 & 0xff; r1 = (r6 >> 8)&0xff; r2 = (r7 >> 16) & 0xff; r0 = sbmc[r0]; r1 = sbmc[r1]; r2 = sbmc[r2]; r3 = sbmc[r3]; tmp1 = r1 >> 24; r1 = r1 << 8; r1 = r1 | tmp1; tmp1 = r2 >> 16; r2 = r2 << 16; r2 = r2 | tmp1; tmp1 = r3 >> 8; r3 = r3 << 24; r3 = r3 | tmp1; r0 = r0 ^ r1; r0 = r0 ^ r2; r0 = r0 ^ r3; r1 = enc_key_exp[k++]; r0 = r0 ^ r1; temp[1] = r0; r2 = (r4 >> 16)&0xff; r3 = r5 >> 24; r0 = r6 & 0xff; r1 = (r7 >> 8)&0xff; r0 = sbmc[r0]; r1 = sbmc[r1]; r2 = sbmc[r2]; r3 = sbmc[r3]; tmp1 = r1 >> 24; r1 = r1 << 8; r1 = r1 | tmp1; tmp1 = r2 >> 16; r2 = r2 << 16; r2 = r2 | tmp1; tmp1 = r3 >> 8; r3 = r3 << 24; r3 = r3 | tmp1; r0 = r0 ^ r1; r0 = r0 ^ r2; r0 = r0 ^ r3; r1 = enc_key_exp[k++]; r0 = r0 ^ r1; temp[2] = r0; r1 = (r4 >> 8)&0xff; r2 = (r5 >> 16)&0xff; r3 = r6 >> 24; r0 = r7 & 0xff; r0 = sbmc[r0]; r1 = sbmc[r1]; r2 = sbmc[r2]; r3 = sbmc[r3]; tmp1 = r1 >> 24; r1 = r1 << 8; r1 = r1 | tmp1; tmp1 = r2 >> 16; r2 = r2 << 16; r2 = r2 | tmp1; tmp1 = r3 >> 8; r3 = r3 << 24; r3 = r3 | tmp1; r0 = r0 ^ r1; r0 = r0 ^ r2; r0 = r0 ^ r3; r1 = enc_key_exp[k++]; r0 = r0 ^ r1; temp[3] = r0; r4 = temp[0]; r5 = temp[1]; r6 = temp[2]; r7 = temp[3]; } Pcode 2.27: Efﬁcient implementation of AES Cipher loop. operations (16 cycles), 16 look-up table accesses (32 cycles for both address generation and memory load), and 12 rotations (36 = 3 × 12 cycles, as we compute rotate operation in two SHIFTS and one OR, because there is no rotate instruction on the reference processor). Therefore, the total number of cycles per iteration is 100. The total number of cycles consumed for encrypting one block of 128 bits of data with a 128-bit key using the efﬁcient implementation of AES cipher given in Pcode 2.27 is 1050 cycles (= 100 × 9 + cycles consumed by the transformations before and after the loop). In addition, with the inverse cipher (using the equivalent inverse cipher in Federal Information Processing Standards, 2001), we consume the same number of cycles for decryption of 128 bits of cipher text. We can use the same Pcode 2.27 for the AES inverse-cipher (i.e., for the equivalent inverse cipher) loop as well by simply changing the SR code (as ISR and SR are inversely related) and properly accessing the expanded key data (as the inverse cipher uses keys from the end of the expanded key buffer). We use the sbmc[ ] and isbmc[ ] look-up table in the cipher and inverse cipher, respectively. Look-up values for sbmc[ ] and isbmc[ ] can be found on this book’s companion website. With the described AES implementation method, we can compute in parallel all 16 output elements of state in a single iteration of the loop. If the embedded processor has more than one compute unit (ALU), then the number of cycles required for processing a block will decline. On the deep-pipelined embedded processor (having similar architectural features as the reference-embedded processor) with four compute units, the suggested method can be implemented within 300 ( = 1050/4 + overhead) cycles. The extra overhead may result from uneven compute and data bandwidth issues (meaning that compute slots may be adequate, but load/store slots for executing an algorithm are insufﬁcient) in the processor. With the previous efﬁcient AES implementation, we require 1.25 kB (1 kB for sbmc[ ] and 0.25 kB for S-Box[ ]) of L1 data memory for encryption process look-up tables and we require another 1.25 kB of memory 50 Chapter 2 for decryption process look-up tables. Now, depending on the processor (with 32- or 8-bit supported registers and on-chip L1 memory sufﬁciently available or not) used in a particular application, we choose either Pcode 2.23 or 2.27 to implement the AES cipher. 2.4 Keyed-Hash Message Authentication Code The purpose of the HMAC is preservation of data authenticity and data integrity. Data authentication is intended to prevent the alteration of data (presumed unaltered from sender to receiver) by a third-party. The HMAC uses a cryptographic key in conjunction with secure hash algorithm (SHA) to generate message authentication code (MAC). In this section, we discuss the HMAC using the SHA functions and we simulate the HMAC using the SHA-256 function. Also, we discuss the computational complexity of HMAC using the SHA-256 algorithm. 2.4.1 HMAC Algorithm The HMAC plays an important role in digital communications and data storage applications to maintain data integrity. With the HMAC, we generate a MAC using a secret key that is shared between two parties, namely sender and receiver. The HMAC uses this secret key for generation and veriﬁcation of the MAC. The sender sends the message along with the MAC and the receiver receives the message and its MAC (A). Then the receiver also computes a new MAC (B) for the received message. If the transmitted message is unaltered, then A and B will be same, otherwise they will differ. In this way, the HMAC provides data integrity. The HMAC uses one of the four SHA functions—SHA-1, SHA-256, SHA-384, and SHA-512—for computing MAC. SHA Functions SHA functions are one-way hash functions used to generate a condensed data representation (called a message digest) for a long data message (the data length for SHA-1 and SHA-256 is <264 bits and for SHA-384 and SHA-512, <2128 bits). With one-way functions, we cannot reproduce the original data from the condensed data. Here, one-way function means that the input message cannot be reproduced from the condensed data. With the mathematical structures of existing SHA functions (SHA-1, SHA-256, etc.), it is almost impossible to generate a same message digest value with two different data messages. In other words, a small change in the data will generate an entirely different message digest. Also, it is not computationally feasible to generate an original message from its message digest. This property enables maintaining the integrity of the data in which we are interested. SHA functions are used in digital signature algorithms and HMAC algorithms. The performance (strength) of HMAC depends on the strength of the hash function and key. 2.4.2 HMAC Description The general block diagram for the HMAC algorithm is shown in Figure 2.10. HMAC algorithm inputs include the message (which supposedly needs authentication) and a key (which is needed in the generation of message authentication). Outputs include the original message and its authentication. The HMAC algorithm has three layers. In the ﬁrst layer, the HMAC parser prepares the data to the SHA parser, and in the second layer, the SHA KEY HMAC Algorithm HMAC Parser SHA Parser SHA Function Message Figure 2.10: Block diagram of keyed-hash message authentication code algorithm. Message Authentication Code Message KEY M Determine K0 from KEY Data Security 51 Figure 2.11: Flow diagram of HMAC parser. X 5 K0 1 OPAD Y 5 K0 1 IPAD Z 5 H (Y || M ) A 5 H (X || Z ) MAC(M ) 5 A (t leftmost significant bytes) parser prepares the data to the SHA function. The core hash algorithm sits in the third layer. In the following sections, we discuss the functionalities of all layers in detail. HMAC Parser The HMAC parser consists of many steps and uses the message M, and key KEY to generate the authentication code. The ﬂow diagram of the HMAC parser is shown in Figure 2.11. The ﬁrst step of the HMAC parser is determining K0, which is B (where the value of B is the length of the SHA-function input block) bytes of data derived from the given input KEY. The data K0 is derived as follows. If the length of input KEY is K , then K0 = KEY, if B = K K0 = H(KEY), if B < K (here H is SHA function) K0 = KEY || zeros, if B > K In the second step, we compute X and Y by XORing the derived K0 with IPAD and OPAD data (where IPAD is equal to the value of 0 × 36 repeated B times, and OPAD is equal to the value of 0 × 5c repeated B times). We compute Z in the third step by passing the appended data of Y and input message M to the SHA function through the SHA parser. In the fourth step, A is computed by passing the data from the appended X and Z to the SHA function through the SHA parser. Finally, in the ﬁfth step, we get the input message MAC by extracting the t-left-most signiﬁcant bytes of A. SHA Parser In the SHA parser, basically we prepare B bytes of data blocks to the SHA function. The SHA parser consists of three steps: (1) message padding, (2) dividing the padded message into B-byte length blocks, and (3) initialization of the SHA function state H . We append a bit “1” and a Q-bit value (in the case of SHA-1 or SHA-256, Q = 64 and in the case of SHA-384 or SHA-512, Q = 128) representing the L (where L is the length of input message in bits) to the message data. Bit 1 is appended immediately after the message, whereas the Q-bit value is appended at the end of the block. To keep the message multiple of 8 ∗ B bits (or B bytes), we append zeros between bit 1 and the Q-bit value. Zeros (if needed) and the Q-bit value are appended to the message data in step 1 as message padding. In step 2, we divide the message into data blocks of N 8 ∗ B bits, and pass them to the SHA function one block per iteration for N iterations. The SHA function updates its state H (i) in every iteration. We initialize the SHA state to H (0) in step 3 of the SHA parser before calling the SHA function. SHA Function The SHA function is the core module of the HMAC algorithm. The inputs to the SHA function are message data block M(i) of length 8 ∗ B bits or B/4 32-bit words and SHA state H (i) (for i = 1, 2, . . . , N ). All functions— SHA-1, SHA-256, SHA-384, and SHA-512—are quite similar with simple variations (e.g., different input sizes, 52 Chapter 2 initial states, and constant values). The ﬂow of SHA-1 is a little bit different from the other three. In the next section, we discuss the most popular SHA-256 function in detail. 2.4.3 SHA-256 Function For the SHA-256 function, the length of the input block B is 64 bytes or 16 32-bit words, and the length of state H is eight 32-bit words. The SHA function is called N times to compute the hash value or the message digest of the entire message (divided into N blocks) with one data block per iteration as input. In the SHA-256 function, we perform three steps: 1. Prepare data block scheduling. 2. Initialize eight working variables with initial state H (0) values. 3. Updating of eight working variables with iterative process. At the end of the SHA function, we update the SHA state H (i) by adding eight working variables to H (i−1) in corresponding positions. Full details of the SHA-256 function follow. In step 1 of the SHA-256 function (i.e., preparing the data block scheduling), we expand the input data block of 16 32-bit words to 64 32-bit words as follows: Wt = Mt(i) σ1{256}(Wt−2) + Wt−7 + σ0{256}(Wt−15) + Wt−16 0 ≤ t ≤ 15 16 ≤ t ≤ 63 where σ0{256}(x ) = ROTR7(x ) ⊕ ROTR18(x ) ⊕ SHR3(x ) σ1{256}(x ) = ROTR17(x ) ⊕ ROTR19(x ) ⊕ SHR10(x ) ROTRn(y) = (y >> n) | (y << (32 − n)) SHRn(y) = y >> n In step 2 of the SHA-256 function, we assign eight working variables (a, b, c, d, e, f, g, and h) with the previous iteration’s SHA state H values as shown here: a = H0(i−1), b = H1(i−1), c = H2(i−1), d = H3i−1 e = H4(i−1), f = H5(i−1), g = H6(i−1), h = H7(i−1) In step 3, we update eight working variables of SHA-256 through the following iterative process: Loop: j = 1:64 End Loop T1 = h + {256} (e) + Ch(e, 1 f, g) + K 256 j +Wj {256} T2 = 0 (a) + Maj (a, b, c) h=g g= f f =e e = d + T1 d =c c=b b=a a = T1 + T2 Data Security 53 where {256} 0 (x ) = ROTR2(x ) ⊕ ROTR13(x ) ⊕ ROTR22(x ) {256} 1 (x ) = ROTR6(x ) ⊕ ROTR11 (x ) ⊕ ROTR25(x ) Ch(x , y, z) = (x ∧ y) ⊕ (x˜ ∧ z) Maj(x , y, z) = (x ∧ y) ⊕ (x ∧ z) ⊕ (y ∧ z) ROTRn(y) = (y >> n) | ((y << (32 − n)) and K {256} j comprises the following 64 constant values array K[ ]: K[64] = { 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5, 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174, 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da, 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967, 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85, 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070, 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3, 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2}; After completing three steps of the SHA-256 function, we update the SHA state as follows: H0(i) = a + H0(i−1), H1(i) = b + H1(i−1), H2(i) = c + H2(i−1), H3(i) = d + H3(i−1) H4(i) = e + H4(i−1), H5(i) = f + H5(i−1), H6(i) = g + H6(i−1), H7(i) = h + H7(i−1) Then, we repeat the previous process N times to cover M(i) message blocks. The digest for the entire message is obtained with the last iteration SHA state as H0(N ) || H1(N ) || H2(N ) || H3(N ) || H4(N ) || H5(N ) || H6(N ) || H7(N ) 2.4.4 HMAC and SHA-256 Simulation In this section, we simulate the HMAC with the SHA-256 function. The initial values for the SHA-256 function state H and the deﬁned values for IPAD and OPAD follow: H [8] = { // initial values for SHA-256 state 0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19}; ipad [16]={ // IPAD for HMAC with SHA-256 0x36363636, 0x36363636,0x36363636, 0x36363636,0x36363636, 0x36363636,0x36363636, 0x36363636, 0x36363636, 0x36363636,0x36363636, 0x36363636,0x36363636, 0x36363636, 0x36363636, 0x36363636}; opad[16] = { // OPAD for HMAC with SHA-256 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c}; HMAC Parser The simulation code for the HMAC parser is given in Pcode 2.28. We deﬁne constants and declare variables such that the HMAC parser supports the SHA-256 parser and SHA-256 function. Although the SHA function is computational intensive, it is straightforward with simple operations. The complex part (logically) of the HMAC algorithm is present in the HMAC parser and SHA parser. Next, we discuss simulating K0 computation from the given input KEY. With the computation of K0, we basically make the input KEY suitable for use with the HMAC + SHA algorithm. Depending on the length of input KEY (K in bytes), we have three conditions to check in preparing K0. If K and B (input block size of the SHA-256 function) are equal, then K0 = KEY. If K < B, then K0 is equal to KEY with (B − K ) appended zero 54 Chapter 2 // prepare K0 of length B (=64) bytes from given key of length K bytes if (K > 512){ sha256(key, tmp, K); // shorten key to 256 bits for(i = 0;i < 8;i++) mac_key[i] = tmp[i]; for(i = 8;i < 16;i++) mac_key[i] = 0; // append 256 ‘0’ bits } else if (K < 512){ j = K >> 5; for(i = 0;i < j;i++) mac_key[i] = key[i]; r0 = key[i]; i = K - (j<<5); k =-1; k = k << (32-i); r0 = r0 & k; mac_key[j] = r0; r0 = 0; for(i=j+1;i < 16;i++) mac_key[i] = 0; // append (B-K) ‘0x00’ bytes } else{ for(i = 0;i < 16;i++) mac_key[i] = key[i]; } for(i = 0;i < 16;i++) // K0 XOR ipad and append to in[ ] as prefix in[i] = mac_key[i]^ipad[i]; // apply hash and output to tmp[ ] array from 16th word to 31st word: H((K0 ^ ipad):text) sha256(in, &tmp[16], L+512); for(i = 0;i < 16;i++) // K0 XOR opod : H((K0 XOR ipad):text) tmp[i] = mac_key[i] ^ opad[i]; sha256(tmp,op,768); // H(K0 XOR opad : H((K0 XOR ipad):text)) Pcode 2.28: The simulation code for HMAC parser. bytes from the LSB (least signiﬁcant bit) side. Simulation of appending (B − K ) “0 × 00” bytes to KEY is not limited to a single instruction code. We have two choices to simulate this: (1) ﬁrst zeroing the K0 and adding K bytes from KEY to K0; and (2) ﬁrst moving K bytes of KEY to K0 and zeroing the remaining (B − K ) bytes. If K > B, then this particular case becomes a bit complex. We ﬁrst shorten the KEY length to 32 bytes by applying SHA-256 on KEY and then append 32 zero bytes from the LSB side to get K0. If we get K0, then the rest of the HMAC parser is straightforward with operations for XORing, data appending, and computing hash values. Here, we have to take care of the data placement in the buffers properly at the input and output of the SHA function. At the very beginning, the input text is placed in the buffer in[ ] from the 16th word location and we make sure that the ﬁrst 16 word positions are empty so that the XORed K0 and IPAD is placed directly as the preﬁx in in[ ] (with this, the simulation of appending K0 to the input message becomes easy) before calling the SHA function. The SHA function output is also placed after 16 word positions in buffer tmp[ ] so that the XORed K0 and OPAD are placed directly as a preﬁx in the tmp[ ] buffer. The last SHA function uses tmp[ ] as its input, and its output (op[ ]) is considered as MAC (message authentication code). Optionally, sometimes we output the left-most t bytes of op[ ] as MAC. SHA Parser The SHA-256 function works on blocks of 512 bits of data at a time. The functionality of the SHA parser prepares those 512-bit blocks for SHA-256 functioning. The SHA parser gets message data along with its length (L) as input. The value of L need not be equal to 512, it can be less than or greater than 512. We insert bit “1” and a 64-bit L value to the message data. If the message data size is not a multiple of 512 bits, then the SHA parser pads “0” bits to message data between the inserted bit 1 and 64-bit value L. Then we divide the message data into N 512-bit data blocks M(i). We compute the hash value for each data block of M(i). In the SHA parser, ﬁrst we initialize the SHA state to predeﬁned initial values H (0). Then, if L > 512, we compute the hash value with SHA function for each 512-bit message block and add to the SHA state until the length of the message block falls below the 512 mark. If the current length of the remaining message block is 448 Data Security 55 bits or more, then we have two more iterations of hash computation, otherwise we compute hash value once. In both cases, we insert a bit “1” at the end of the message and a 64-bit value at the end of the data block along with padded zeros in-between (if needed) to make a 512-bit blocks, and compute its hash values. After each iteration of hash computation, the computed hash values are added to the previous SHA state by the SHA function. The SHA parser outputs SHA state (the ﬁnal result of all iterations) as a message digest. The simulation code for the SHA parser is given in Pcode 2.29. SHA-256 Function The SHA-256 function is a simple algorithm with logical shift and XOR operations. In this SHA function, all additions are performed with module 232. The SHA-256 function consists of three steps (1) preparation of a 64-word length message from an input 16-word (512 bits) length message, (2) initialization of the eight SHA-256 working variables, and (3) the iterative message digest process. The SHA-256 function gets the previous SHA state and 16 words of message from the SHA parser as an input. In the expanded 64-word message, the ﬁrst 16 words are the same as the input 16 words. To avoid copying the 16-word input to another buffer in the process of expansion, we pass the input directly into the expand buffer W[ ] by declaring the expand buffer as a global // assign initial values of H sha_state[0] = H[0]; sha_state[1] = H[1]; sha_state[2] = H[2]; sha_state[3] = H[3]; sha_state[4] = H[4]; sha_state[5] = H[5]; sha_state[6] = H[6]; sha_state[7] = H[7]; // padding zeros n = L >> 5; m = n >> 4; k = 0; while(m--){ for(j = 0;j < 16;j++) w[j] = in[k++]; sha256fn(sha_state, w); } j = n - k; if (j >=14){ i = L - (n << 5); tmp1 = 0x80000000; tmp1 = tmp1 >> i; tmp2 = in[n]; tmp2 = tmp2 | tmp1; w[15] = 0; for(i = 0;i < j;i++) w[i] = in[k++]; w[i] = tmp2; sha256fn(sha_state, w); for(i = 0;i < 15;i++) w[i] = 0; w[15] = L; sha256fn(sha_state, w); } else{ i = L - (n << 5); tmp1 = 0x80000000; tmp1 = tmp1 >> i; tmp2 = in[n]; tmp2 = tmp2 | tmp1; for(i = 0;i < 15;i++) w[i] = 0; for(i = 0;i < j;i++) w[i] = in[k++]; w[i] = tmp2; w[15] = L; sha256fn(sha_state, w); } out[0] = sha_state[0]; out[1] = sha_state[1]; out[2] = sha_state[2]; out[3] = sha_state[3]; out[4] = sha_state[4]; out[5] = sha_state[5]; out[6] = sha_state[6]; out[7] = sha_state[7]; Pcode 2.29: The simulation code for SHA parser. 56 Chapter 2 variable. Now, we expand the message from 16 to 63 words (a total of 48 words) by using the equations given in step 1 of the SHA function (see Section 2.4.3). In step 2 of the SHA-256 function, we initialize all eight working variables with SHA state values. The iterative process of the SHA-256 in step 3 involves updating of these eight working variables in each iteration (see step 3 of the SHA function in Section 2.4.3). Here, we compute two temporary values. The ﬁrst temporary value is computed from some of the working variables, expanded message and predeﬁned constants and the second one is computed from only working variables. Then, we update the next iteration eight working variables with the present iteration working variable and with the two temporary values computed. After completion of the iterative process, the updated eight working variables are added to the SHA state. The simulation code for SHA-256 function is given in Pcode 2.30. for(i = 16;i < 64;i++){ // prepare 64 word length message tmp1 = W[i-7]; tmp2 = W[i-16]; r0 = W[i-2]; r1 = W[i-15]; r2 = r0 >> 17; r3 = r1 >> 7; r4 = r0 << 15; r5 = r1 << 25; r6 = r2 | r4; r7 = r3 | r5; r4 = r0 >> 19; r5 = r1 >> 18; r2 = r0 << 13; r3 = r1 << 14; r2 = r2 | r4; r3 = r3 | r5; r6 = r6 ^ r2; r7 = r7 ^ r3; r2 = r0 >> 10; r3 = r1 >> 3; r6 = r6 ^ r2; r7 = r7 ^ r3; r6 = r6 + tmp1; r7 = r7 + tmp2; W[i] = r6 + r7; } r0 = state[0]; r1 = state[1]; // initialize a, b r2 = state[2]; r3 = state[3]; // initialize c, d r4 = state[4]; r5 = state[5]; // initialize e, f r6 = state[6]; r7 = state[7]; // initialize g, h for(i = 0;i < 64;i++){ // start message digest loop tmp3 = r4 >> 6; tmp4 = r0 >> 2; tmp5 = r4 << 26; tmp6 = r0 << 30; tmp1 = tmp3 | tmp5; tmp2 = tmp4 | tmp6; tmp3 = r4 >> 11; tmp4 = r0 >> 13; tmp5 = r4 << 21; tmp6 = r0 << 19; tmp3 = tmp3 | tmp5; tmp4 = tmp4 | tmp6; tmp1 = tmp1 ^ tmp3; tmp2 = tmp2 ^ tmp4; tmp3 = r4 >> 25; tmp4 = r0 >> 22; tmp5 = r4 << 7; tmp6 = r0 << 10; tmp3 = tmp3 | tmp5; tmp4 = tmp4 | tmp6; tmp1 = tmp1 ^ tmp3; tmp2 = tmp2 ^ tmp4; tmp3 = r4 & r5; tmp4 = r0 & r1; tmp5 = ∼r4 & r6; tmp6 = r0 & r2; tmp3 = tmp3 ^ tmp5; tmp4 = tmp4 ^ tmp6; tmp6 = r1 & r2; tmp4 = tmp4 ^ tmp6; tmp1 = tmp1 + tmp3; tmp2 = tmp2 + tmp4; tmp1 = tmp1 + r7; tmp1 = tmp1 + K[i]; tmp1 = tmp1 + W[i]; r7 = r6; r6 = r5; r5 = r4; r4 = r3 + tmp1; r3 = r2; r2 = r1; r1 = r0; r0 = tmp1 + tmp2; } state[0] = state[0] + r0; state[1] = state[1] + r1; state[2] = state[2] + r2; state[3] = state[3] + r3; state[4] = state[4] + r4; state[5] = state[5] + r5; state[6] = state[6] + r6; state[7] = state[7] + r7; Pcode 2.30: The simulation code for SHA-256 function. Data Security 57 Simulation Results The simulation results of HMAC using the SHA-256 algorithm follow. Inputs for HMAC are 320 bits of message and 264 bits of key. As the length of key (K ) is less than the SHA function input block length (B), we append 248 zero bits (i.e., B-K bytes) to the input KEY to form 512 bits K0. Intermediate values for main operations are presented along with their output data lengths in bits. Input Message (M): 320 bits 0x00112233, 0x44556677, 0x8899aabb, 0xccddeeff, 0x0f1e2d3c, 0x4b5a6978, 0x8796a5b4, 0xc3d2e1f0, 0x01234567, 0x89abcdef Input Key: 264 bits //ignore all bits of last word except 8 msbs 0x4a09e669, 0xdb67ae81, 0xec6ef374, 0x554ff539, 0x310e527c, 0x7b056882, 0x7f83d9a1, 0x1be0cd18, 0x20000000 K0: 512 bits 0x4a09e669, 0xdb67ae81, 0xec6ef374, 0x554ff539, 0x310e527c, 0x7b056882, 0x7f83d9a1, 0x1be0cd18, 0x20000000, 0x00000000, 0x00000000, 0x00000000,0x 00000000, 0x00000000, 0x00000000, 0x00000000 K0 XOR IPAD: 512 bits 0x7c3fd05f, 0xed5198b7, 0xda58c542, 0x6379c30f, 0x0738644a, 0x4d335eb4, 0x49b5ef97, 0x2dd6fb2e, 0x16363636, 0x36363636, 0x36363636, 0x36363636, 0x36363636, 0x36363636, 0x36363636, 0x36363636 (K0 XOR IPAD)||M: 832 bits 0x7c3fd05f, 0xed5198b7, 0xda58c542, 0x6379c30f, 0x0738644a, 0x4d335eb4, 0x49b5ef97, 0x2dd6fb2e, 0x16363636, 0x36363636, 0x36363636, 0x36363636, 0x36363636, 0x36363636, 0x36363636, 0x36363636, 0x00112233, 0x44556677, 0x8899aabb, 0xccddeeff, 0x0f1e2d3c, 0x4b5a6978, 0x8796a5b4, 0xc3d2e1f0, 0x01234567, 0x89abcdef H((K0 XOR IPAD)||M): 256 bits 0x4e938d08, 0x322f37e8, 0x8df9483f, 0x1c68c2e1, 0xfe1411e0, 0x85e8b0d0, 0xbc196189, 0x006378d6 K0 XOR OPAD: 512 bits 0x1655ba35, 0x873bf2dd, 0xb032af28, 0x0913a965, 0x6d520e20, 0x275934de, 0x23df85fd, 0x47bc9144, 0x7c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c (K0 XOR OPAD)||H((K0 XOR IPAD)||M): 768 bits 0x1655ba35, 0x873bf2dd, 0xb032af28, 0x0913a965, 0x6d520e20, 0x275934de, 0x23df85fd, 0x47bc9144, 0x7c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x5c5c5c5c, 0x4e938d08, 0x322f37e8, 0x8df9483f, 0x1c68c2e1, 0xfe1411e0, 0x85e8b0d0, 0xbc196189, 0x006378d6 H((K0 XOR OPAD)||H((K0 XOR IPAD)||M)): 256 bits 0xbaa04656, 0x9880510e, 0x94b6c6c7, 0x58737860, 0xc3ccf3d6, 0xc6100ed5, 0x7566260d, 0x8f8b2f33 Message Authentication Code (MAC): 88 bits (taking t = 11 left-most bytes) 0xbaa04656, 0x9880510e, 0x94b6c600 2.4.5 Computational Complexity of HMAC The SHA function is a complex core module of the HMAC algorithm. First we analyze the complexity of the SHA function in terms of cycles (see Appendix A, Section A.4, on the companion website for more details on the cycle consumption of particular operations on the reference embedded processor). The common operations in the SHA function are ROTR, XOR, ADD mod 232, SHIFT and OR. The ROTR operation is achieved with two SHIFTs and one OR. In the ﬁrst step of the SHA-256 function, we iterate the loop 48 times. In a single iteration of the loop, we have ﬁve load-store operations and 20 arithmetic and logical operations. We have a total of 25 operations, and a single iteration consumes 25 cycles. Therefore, we consume about 1200 (= 25 ∗ 48) for 48 iterations. We consume eight cycles in assigning eight working variables. In the iterative message digest process, we run the loop 64 times. In a single iteration of the message digest iterative process, we have 41 arithmetic and logical operations and two load operations. Therefore, a single iteration costs 43 cycles. We consume a total of 2752 cycles for the message digest iterative process; at the end we spend another eight to update the SHA state. With this, the SHA-256 function consumes 3968 (= 1200 + 2752 + 16) cycles. In the SHA parser, we spend 16 cycles for initializing the state and for copying the state to the output buffer at the end. We spend 50 to 65 cycles for message padding (includes inserting bit “1”, inserting 64bit value L, padding zeros [if needed] and dividing the padded message into blocks) and for calling SHA256 function. Here, we consume 50 cycles for only one call of the SHA-256 function. If the length of the message is larger than 448 bits, then we call the SHA-256 function multiple times. In that case, for each extra call, we consume about 28 cycles (for copying 16 words to the working buffer and for the function call). In HMAC parser, we consume 16 to 24 cycles to prepare K0 (apart from the SHA-256 function call cycles). We consume 32 cycles for XORing KEY with IPAD and OPAD. Another 20 cycles are consumed for two SHAparser function calls. The overall cycle consumption of the HMAC algorithm depends on message length and 58 Chapter 2 key length. Here, we analyze HMAC complexity for message length of 320 bits and key length of 264 bits. The clock cycles distribution is shown in the following: HMAC Parser: K0 preparation: 24 cycles IPAD & OPAD: 32 cycles Two SHA-parser calls: 24 cycles SHA Parser: L = 832, 768 Two times 2 SHA-256 calls: 200 (= 2 ∗ (16 + 28 + 50 + overhead)) cycles SHA-256 function: Four times called: 15,872 cycles Total: 16,152 From the previous cycle count information, it is clear that the SHA-256 function consumes more than 98% of cycles and both the HMAC parser and SHA parser consume only less than 2% of total cycles. 2.5 Elliptic-Curve Digital Signature Algorithm Public key cryptography allows us to have data authentication. Since the invention of public-key cryptography in 1976 by Whitﬁeld Difﬁe and Martin Hellman, various public-key cryptographic systems have been proposed. Security in all of these systems relies on the difﬁculty of solving an underlying mathematical problem. In public key cryptographic algorithms (unlike in symmetric key algorithms where we use the same secret key for both encryption and decryption), the key used for encryption is different from the key used for decryption, and hence we also call the public key algorithms as asymmetric key algorithms. 2.5.1 Digital Signature Algorithm Digital signature algorithm (DSA), based on public key cryptography techniques, is used in conjunction with the hash function SHA to provide data authentication and data integrity. See Section 2.4.3 for more details on how to compute hash value (or message digest) using the SHA function for a given message. In this section more emphasis is given to DSA based on elliptic curve public-key cryptographic systems. We discuss ECDSA (elliptic-curve DSA) algorithms, their simulation techniques and also present a few simulation results at the end. DSA Algorithm Analogous Conceptually, today electronic mail (e-mail) system works on the philosophy of public key cryptography. In the e-mail system, the user will have two identiﬁcations: (1) e-mail id and (2) password to send or receive e-mail. An e-mail id is in the public domain and the password is with the user (and it’s not disclosed to the public). If the user wants to send an e-mail, then that user has to enter into an e-mail system by using his/her password. Once the user is in the electronic mail system, then he or she can send a mail using another end person’s e-mail id. If the user wants to receive an e-mail from the other end person, then that person also follows the same procedure to send an e-mail. In other words, the sender uses his/her password to send a message and the receiver views the e-mail with the help of the sender e-mail id. In the same way, with DSA using the public key cryptographic system, we have a key-pair, namely, public key and private key. If we want to have authenticity and integrity to our communicating message, then we use a DSA scheme to provide authenticity and integrity to the message. Using the DSA scheme, the sender generates a digital signature using his/her private key and send the message along with the signature to the recipient. After receiving the message, the recipient veriﬁes the signature using the sender public key to rule out any third-party involvement in this data communication. In other words, if the received signature is a valid one, then we assume that the message is not altered. Later we brieﬂy discuss three popular DSA approaches to protect data/messages. The DSA algorithm is intended for use in electronic mail, electronic funds transfer, electronic data interchange, software distribution, data storage, and other applications that require data integrity assurance and data origin authentication. Similar to DSA, the HMAC (keyed-hash message authentication code) algorithm also provides Data Security 59 data/message authentication and integrity. The only difference is that the HMAC uses same key for generation and veriﬁcation of authentication code using SHA, whereas the DSA algorithm uses the public key cryptographic system in conjunction with the SHA function to provide data authenticity and integrity. 2.5.2 DSA Description Building blocks of digital signature algorithm (DSA) are shown in Figure 2.12. The basic digital signature scheme consists of three blocks and they are (1) the key-pair generation block, (2) the message digest generation block, and (3) the signature generation/veriﬁcation block. The key-pair generator generates two keys; we call them the private key and public key. Here the private key is a secret key and should not be shared/disclosed. The public key will be in the public domain and anyone can access it. The message digest block computes a unique condensed value (called as message digest) corresponding to the message (that is supposed to be communicated) using an SHA hash function. If party A wants to send a message to party B, and if party B wants to have a message authenticity and integrity, then party A must generate a digital signature for the message using his/her private key and send the message to B along with the signature. Party B checks the validity of the message after receiving it by verifying the signature using sender’s public key. As shown in Figure 2.13, at the source (transmitter side), the sender generates a signature using his/her private key and using the message digest value. At the destination (receiver side), the receiver checks the validity of the message by verifying the received signature using the sender public key and using the message digest value. Note that the receiver also computes the message digest for the received message and that both message digests computed at the transmitter and receiver are the same if the message is unaltered. The digital signature algorithm uses a mathematical system for its key-pair generation and digital signature generation/veriﬁcation processes. Any DSA mathematical system consists of a parameter set (ﬁeld elements, SEED-A Key-Pair Generation Key-Pair Generation SEED-B Signature Generation/ Verification (Message, Signature) Signature Generation/ Verification Message Message Digest Computation Message Digest Message Computation Party-A Party-B Figure 2.12: Digital signature algorithm building blocks. Message to Be Transmitted Received Message Private Key SHA Message Digest Value SHA Signature Generation Digital Signature Signature Verification Public Key Signature Valid /Not Valid Transmitter Receiver Figure 2.13: DSA algorithm-ﬂow diagram. 60 Chapter 2 order of a ﬁeld, etc.) and an operation set (modular arithmetic computations and other operations depend on the particular parameter set chosen for DSA). As of today, DSA supports three popular types of parameter sets and they are (1) RSA parameter set, (2) discrete logarithm based parameter set, and (3) elliptic curve-based parameter set. Both sender and receiver must use the same parameter set to communicate with each other. In the following subsections, we discuss and compare three DSA approaches in terms of security (for given key size) and mathematical complexity. RSA Public Key Cryptography Based DSA The RSA digital signature algorithm, based on integer factorization problem, is an FIPS approved or NIST recommended cryptographic algorithm for generating and validating digital signatures. The strength of the RSA algorithm depends on the computational difﬁculty of factoring large numbers. Steps in the RSA algorithm follow. 1. Generate two large prime numbers p and q. 2. Let n = pq, and let m = ( p − 1)(q − 1). 3. Choose a small number e, coprime to m. 4. Find d, such that de(mod m) = 1. Then, publish (e, n) as the public key and keep (d, n) as the secret/private key. See Appendix B, Section B.2, on the companion website for more details on modulo arithmetic. 5. If T and C denote plain text and cipher text, then encrypted text C = T e(mod n) and decrypted text T = C d(mod n). Discrete Logarithm-Based DSA The digital signature algorithm, based on a discrete logarithm problem, is an FIPS-approved or NISTrecommended cryptographic algorithm for generating and validating digital signatures. The strength of DLDSA depends on the computational difﬁculty of ﬁnding a logarithm for large numbers. Key-pair generation, signature generation, and signature veriﬁcation steps of the DLDSA algorithm follow. DLDSA Algorithm Key-Pair Generation 1. Choose two large prime numbers p and q such that q divides p − 1. 2. Choose g, an element of order q in GF( p), see Appendix B, Section B.2, on the companion website for more details on Galois ﬁeld. 3. Select a random integer x in the range [1, q − 1] and compute y = gx mod p. 4. Here, x is private key (do not disclose) and y is public key (disclose it). Signature Generation Using DLDSA Algorithm 1. Select a random integer k in the interval [1, q − 1]. 2. Compute r = (gkmod p) mod q. 3. Compute s = k−1(e + xr) mod q, where e = SHA(M) is a message digest value. 4. The signature for message M is (r, s). Signature Verification Using DLDSA Algorithm 1. Compute e = SHA(M), a message digest value for received message M. 2. Compute u1 = es−1 mod q and u2 = rs−1 mod q, where (r, s) is received signature for M. 3. Compute v = (gu1 yu2 mod p) mod q. 4. If v = r, then signature is valid and accept the message. Elliptic Curve-Based DSA Elliptic curve DSA (ECDSA) algorithm, based on elliptic curve discrete logarithm problem, is an FIPS-approved or NIST-recommended cryptographic algorithm for generating and validating digital signatures. The strength of ECDSA depends on the computational difﬁculty of ﬁnding a logarithm of an elliptic curve point. The structure and ﬂow of ECDSA are similar to the DLDSA algorithm discussed in Section 2.5.2, Discrete Logarithm-Based DSA. In the later sections, full details of ECDSA along with necessary algorithms and simulation techniques are discussed. In the next subsection, the three approaches of DSA are compared with respect to security level for the given key sizes. Data Security 61 Comparison of Three DSA Approaches Now, we compare the three DSA approaches, RSA, DLDSA and ECDSA, with respect to key sizes used by a particular approach for a required security. Key sizes of three approaches for a given security level are given in Table 2.4. If we take care of weak instances of three approaches and if we use a general-purpose algorithm to solve the underlying problem of three approaches, then RSA and DLDSA are solved in subexponential time (solving a problem in subexponential time is still considered as hard) whereas ECDSA can be solved only in exponential time. In simple terms, this means that the elliptic curve discrete logarithm problem is currently considered harder than either the integer factorization problem or the discrete logarithm problem. Table 2.5 compares the time required to break the ECC with the time required to break RSA or DSA for various key sizes using the best-known general algorithm. The values are computed in MIPS years. A MIPS year represents a computing time of 1 year on a machine capable of performing one million instructions per second. 2.5.3 Elliptic Curves Overview Mathematical systems (with parameter set, operation set) used in DSA forms an algebraic group. A group consists of a set of elements with predeﬁned operations on those elements. In this section, we discuss algebraic groups formed by elliptic curves. For elliptic curve groups, the operation set is deﬁned geometrically. Before going to elliptic curve groups deﬁned over ﬁnite ﬁelds, we understand elliptic curves with real numbers. Elliptic Curves An elliptic curve over real numbers is deﬁned with a set of points {(Xi , Yi )} satisfying an elliptic curve equation E(x , y) of the form y2 = x 3 + ax + b, where a and b are real numbers. With different values of parameters a and b, we have different elliptic curves. One such elliptic curve geometrical view is shown in Figure 2.14. P:(Xp, Yp), Q:(Xq, Yq ), and R:(Xr , Yr ) are three points on elliptic curve E(x , y) as shown in Figure 2.14. If 4a3 + 27b2 is not 0, then the elliptic curve y2 = x 3 + ax + b forms an additive group, meaning that the points on the elliptic curve follows the closure property (i.e., the resulting point after adding two points on elliptic curve also satisﬁes the elliptic curve), identity property (consists of identity element with respect to addition) and inverse property (consists of inverse element with respect to addition). See the following subsections for rules of addition with the elliptic curve points. An elliptic curve group over real numbers consists of points on the corresponding elliptic curve, together with a special point O called the point at inﬁnity. In elliptic curve operations, O is treated as an identity element and the elliptic curve additive group satisﬁes the identity property Table 2.4: Comparison of three approaches with respect to key sizes for a given security level RSA DLDSA ECDSA Private Key Size 2048 160 160 Public Key Size 1088 1024 161 Table 2.5: Comparison of security levels of RSA, DLDSA, and ECDSA for given key sizes MIPS Years 4.5 × 105 3 × 1012 3 × 1021 2 × 1033 RSA 512 1024 2048 4096 Key Size DLDSA 512 1024 2048 4096 ECDSA 128 172 234 314 R P Figure 2.14: Elliptic curve E (x, y). Q 2R 62 Chapter 2 P + O = O + P = P. A reﬂection of a point R on the elliptic curve with respect to the x -axis is treated as −R and its coordinates are (Xr , −Yr ). If P = −Q, then P + Q = O, a point at inﬁnity and hence it follows the inverse property over addition. Addition of Two Points on Elliptic Curve The addition of two points P and Q (where P = Q) on an elliptic curve is deﬁned as a reﬂection of point of −R:(Xr , −Yr ) which is a point of intersection of the elliptic curve with a line passing through P and Q. Geometric interpretation of the addition of points P and Q on the elliptic curve E(x , y) is shown in Figure 2.14. Algebraically, the coordinates (Xr , Yr ) of the resulting point R after adding points P and Q are obtained as Xr = s2 − X p − Xq and Yr = −Yp + s(X p − Xr ) where s = (Yp − Yq )/(X p − Xq), the slope of the line passing through P and Q. Point Double on Elliptic Curve When P and Q represent the same point on the elliptic curve, then we deﬁne another operation called point double instead of points addition. Point double is deﬁned as a reﬂection of point −R:(Xr , −Yr ) which is a point of intersection of an elliptic curve with the tangent line passing through P:(X p, Yp). The coordinates (Xr , Yr ) of point R = 2P are obtained as Xr = s2 − 2X p and Yr = −Yp + s(X p − Xr ) where s = (3 X 2 p + a)/(2Yp), the slope of the tangent passing through point P. Scalar Point Multiplication Multiplication of point P of an elliptic curve by a constant k is termed as scalar point multiplication. Scalar multiplication of point P with k results in another point S on the elliptic curve. If k = 5, then S = 5P and the point S is obtained from P with point double and point add operations as P, 2P after ﬁrst doubling, 4P after second doubling, and 5P after adding P to 4P. As seen in subsequent sections, the scalar point multiplication is a computationally intensive part of ECDSA algorithm. Elliptic Curves over Finite Fields GF(q) Elliptic curves over real numbers are of no practical use as they cannot be used in cryptographic applications. Moreover, computationally it is not feasible to work with real number elliptic curves. Therefore, hereafter we consider elliptic curves deﬁned over ﬁnite ﬁelds. A group over a ﬁnite ﬁeld contains a ﬁnite number of elements and the output of the group operation after modulo reduction (either with prime P or an irreducible polynomial depending on the ﬁnite ﬁeld) results in an element that also belongs to the same ﬁnite ﬁeld. The order of the ﬁnite ﬁeld is given by the number of elements in that ﬁnite ﬁeld. Next, we discuss the elliptic curves deﬁned over prime Galois ﬁelds GF(P) and binary Galois ﬁelds GF(2m). Elliptic Curves over Prime Field GF(P) An elliptic curve E( p) over GF(P) deﬁned by the parameters a and b is the set of solutions {(Xi , Yi ), for Xi , Yi ∈ GF(P)}, to the equation: y2 = x 3 + ax + b, together with the point O at inﬁnity. The number of points in E(P) is denoted by #E(P). If 4a3 + 27b2(mod P) is not zero, then E(P) forms an additive group satisfying closure property, identity property and inverse property. In the prime ﬁeld GF(P), the equations for elliptic curve points operations are the same as that deﬁned over a real number in the previous section except with the extra computation of modulo reduction on the result of the operation with prime number P to make sure that the result belongs to the prime ﬁeld GF(P). ■ Example 2.1: Elliptic Curve over GF(23) Points that satisfy the elliptic curve y2 = x 3 + x + 1 deﬁned over GF(23) with a = b = 1 follow: (0, 1) (0, 22) (1, 7) (1, 16) (3, 10) (3, 13) (4, 0) (5, 4) (5, 19) (6, 4) (6, 19) (7, 11) (7, 12) (9, 7) (9, 16) (11, 3) (11, 20)(12, 4)(12, 19)(13, 7)(13, 16)(17, 3)(17, 20)(18, 3)(18, 20)(19, 5)(19, 18). The curve E(23) has 28 points (including the point at inﬁnity O; we can assign O = (0, 0) in this example as (0,0) is not on the curve). If P = (5, 4), Q = (7, 11), then using the points addition rule and point double rule, the Data Security 63 points R = P + Q and W = 2P are computed as (see Appendix B, Section B.2.2, on the companion website for more details on computing in GF(P)). P = (X p, Yp) = (5, 4), Q = (Xq, Yq ) = (7, 11) R = (Xr , Yr ) = P + Q s = (Yp − Yq )/(X p − Xq) = (4 − 11)/(5 − 7) = −7/ − 2 = 7/2 = (7 + 23)/2 = 30/2 = 15 Xr = s2 − X p − Xq = 225 − 5 − 7(mod 23) = 213(mod 23) = 6 Yr = s(X p − Xr ) − Yp = 15(5 − 6) − 4(mod 23) = −15 − 4(mod 23) = −19(mod 23) = −19 + 23 = 4 W = (Xw, Yw ) = 2P s = (3 X 2 p + a )/(2Y p ) = (75 + 1)/8 = 76(mod 23)/8 = 7/8 = (7 + 23 × 7)/8 = 168/8 = 21 Xw = s2 − 2X p = 441 − 2 × 5 = 431(mod 23) = 17 Yw = s(X p − Xw) − Yp = 21(5 − 17) − 4 = −21 × 12 − 4 = −256(mod 23) = −3(mod 23) = −3 + 23 = 20 Note that the resulting points R and W , after addition and doubling of given points P and Q, also lie on the same elliptic curve. ■ Elliptic Curves over Binary Field GF(2m) An elliptic curve E(2m ) over GF(2m) deﬁned by the parameters a, b ∈ GF(2m), b = 0, is the set of solutions {(Xi , Yi ), for Xi , Yi ∈ GF(2m)}, to the equation y2 + x y = x 3 + ax 2 + b together with a point O at inﬁnity. The number of points in E(2m) is denoted by #E(2m). The additive inverse of point R:(Xr , Yr ) of E(2m ) is deﬁned as −R:(Xr , Xr + Yr ). With this, the elliptic curve E(2m) points form an additive group with satisfying closure, identity, and inverse properties. The operations of the elliptic curve over the GF(2m ) ﬁeld are deﬁned in the following. Addition Rule Let P:(X p, Yp) ∈ E(2m ) and Q:(Xq, Yq) ∈ E(2m) be the two points such that X p = Xq. Then the coordinates (Xr , Yr ) of R, the result after the addition of two points P and Q, is given by Xr = s2 + s + X p + Xq + a, Yr = s(X p + Xr ) + Yp + Xr , where s = (Yp + Yq )/(X p + Xq) Doubling Rule Let (X p, Yp) ∈ E(2m) be a point with X p = 0. The coordinates (Xr , Yr ) of R, the result after a doubling of P, are given by Xr = s2 +s + a, Yr = X 2 p + (s + 1)Xr , where s = X p + Yp Xp ■ Example 2.2: Elliptic Curve over GF(24) With the irreducible polynomial f (x ) = x 4 + x + 1 and primitive element α, the generated elements of GF(24) follow (see Appendix B, Section B.2.3, on the companion website for more details on computing in GF(2m)). α0 = (0001), α1 = (0010), α2 = (0100), α3 = (1000), α4 = (0011), α5 = (0110), α6 = (1100), α7 = (1011), α8 = (0101), α9 = (1010), α10 = (0111), α11 = (1110), α12 = (1111), α13 = (1101), α14 = (1001), α15 = α0(0001) 64 Chapter 2 Consider an elliptic curve E(24) over GF(24), with deﬁning equation y2 + x y = x 3 + α4x 2 + 1 for a = α4 and b = 1. The solution set of the elliptic curve E(24) deﬁned over GF(24) is given by: {(0, α0), (α0, α6), (α0, α13), (α3, α8), (α3, α13), (α5, α3), (α5, α11), (α6, α8), (α6, α14), (α9, α10), (α9, α13), (α10, α1), (α10, α8), (α12, α0), (α12, α12)} The solution set has 16 elements (including the point at inﬁnity O, we can assign O = (0, 0) in this example as (0, 0) is not on the curve). If P = (α5, α3) and Q = (α6, α8), then, using the points addition rule and the point double rule, the points R = P + Q and W = 2P are computed as follows: s = (Yp + Yq )/(X p + Xq ) = (α3 + α8)/(α5 + α6) = α13/α9 = α4 Xr = s2 + s + X p + Xq + a = a8 + a4 + a5 + a6 + a4 = (a2 + 1)(a2 + a)(a3 + a2) = a3 + a2 + a + 1 = α12 Yr = s(X p + Xr ) + Yp + Xr = α4(α5 + α12) + α3 + α12 = α4α14 + α10 = α3 + α2 + α + 1 = α12 s = α5 + α3/α5 = α2 + α + α18/α5 = α2 + α + α13 = α2 + α + α3 + α2 + 1 = α7 Xw = s2 + s + a = α14 + α7 + α4 = (α3 + 1) + (α3 + α + 1) + (α + 1) = α0 Yw = X 2 p + (s + 1)Xw = α10 + (α7 + 1)α0 = (α2 + α + 1) + (α3 + α + 1) = α6 Note that the resulting points R:(α12, α12) and W :(α0, α6), after addition and doubling of given points P:(α5, α3) and Q:(α6, α8), also lie on the same elliptic curve. ■ 2.5.4 ECDSA In this section, we discuss the application of elliptic curves deﬁned over ﬁnite ﬁelds GF(q). Similar to the discrete logarithm problem (DLP), an elliptic curve discrete logarithm problem (ECDLP) is described as, ﬁnd the integer a given Q ∈ E(q) and W = a Q, where q = prime P or 2m. As described in Section 2.5.2, Comparison of Three DSA Approaches, solving ECDLP needs exponential computational time. Because of this reason, digital signature algorithms (DSA) over elliptic curve groups are recommended for many applications. Before going into the use of elliptic curves in DSA, we explore some of the standard parameters (also called domain parameters) necessary to work with ECDLP. These domain parameters follow: • Elliptic curve coefﬁcients: a, b • Elliptic curve base point: G • Order of elliptic curve base point G:n (a subset n elements of E(q) are given by r G, 1 ≤ r ≤ n − 1) • Cofactor: h (is equal to N/n, where N is the order of the elliptic curve #E(q)) First we set up the parameter set by selecting coefﬁcients a and b of the elliptic curve deﬁned over GF(q). Then we select a base point G such that the order of the elliptic-curve group base point is the order of n. With this, we can generate a subset of elliptic curve group elements as {O, G, 2G, 3G, . . . , (n − 1)G}. Here, the choice of the base point G is not a security consideration as long as it has a large prime order as required by the standards. However, sender and receiver must use the same set of elliptic curve domain parameters. One example set of domain parameters follows: a = 00 17858FEB 7A989751 69E171F7 7B4087DE 098AC8A9 11DF7B01 b = 00 FDFB49BF E6C3A89F ACADAA7A 1E5BBC7C C1C2E5D8 31478814 G = (01 F481BC 5F0FF84A 74AD6CDF 6FDEF4BF 61796253 72D8C0C5E1, 00 25E399F2 903712CC F3EA9E3A 1AD17FB0 B3201B6A F7CE1B05) n = 01 00000000 00000000 00000000 C7F34A77 8F443ACC 920EBA49 h =2 The previous domain parameters are used with elliptic curve E: y2 + x y = x 3 + ax 2 + b over GF(2193). Key-Pair Generation In the key-pair generation, ﬁrst we choose a statistically unique random number k in the interval [1, n − 1]. Usually k is generated using a pseudorandom number generator (block ciphers discussed in Sections 2.2 and 2.3 can be used for pseudorandom number generation) and we assume that k is available for our key-pair generation Data Security 65 process. Once we have k, then we compute a point W on E(2m) as W = kG. In other words, we compute point W by multiplying the elliptic curve base point G with a large random number k. The size of random number k can be up to m bits. The ﬂow diagram of key-pair generation is shown in Figure 2.15. As shown in Figure 2.15, ECDSA key-pair generation process outputs (k, W ), where k is a private key and W is a public key. The private key k should not be shared with the public and the sender only uses k for generating the signature of message. Anyone can have access to the public key W , and the recipient veriﬁes the digital signature using W . Techniques to implement key generation process on an embedded processor are presented in Section 2.5.5. Signature Generation The ECDSA signature generation process consists of three steps and they are (1) pseudorandom number generation, (2) message digest computation, and (3) signature generation. The ﬂow diagram of the ECDSA signature generation process is shown in Figure 2.16. In the signature-generation process, after generating pseudorandom Figure 2.15: Flow diagram of ECDSA key-pair generation process. Start Generate Random Number, k W 5 kG Output (k, W ) End Start Get elliptic curve domain parameters set {a, b, G, n} Generate random number, d Compute message digest, e Compute dG ϭ (Xg, Yg) Compute r ϭ Xg mod n Compute s ϭ d Ϫ1(e ϩ k • r ) mod n Figure 2.16: Flow diagram of ECDSA signature generation process. Output (r, s) End 66 Chapter 2 number d, we compute P:(X p, Yp) = dG by using elliptic curve point scalar multiplication algorithm. We compute the message digest value e using the SHA function (see Section 2.4.3 for more details on message digest generation algorithms). Then we generate r = X p mod n and s = t mod n, where t = d−1(e + k · r). In scalar point multiplication, the operations involved are over either prime ﬁeld GF(P) or binary ﬁeld GF(2m ), whereas the operations involved in generating r and s are over prime ﬁeld GF(n), where n is the order of the elliptic curve base point G. In signature generation, we have one inverse, one multiplication and one addition over GF(n). In the later sections, we discuss the algorithms for computing inverse and multiplication over GF(n) in detail. After generating the signature, the sender sends the message ‘M’ along with the signature (r, s) to the receiver. Signature Veriﬁcation Signature veriﬁcation is done at the receiving end by the receiver. We get the message M along with signature (r, s) from the sender and we verify the signature by using the sender public key W . The signature veriﬁcation process also requires message digest, and we compute it by using the same SHA function that the sender used for computing the message digest. The ﬂow diagram of the signature-veriﬁcation process is shown in Figure 2.17. In the signature-veriﬁcation process, we have one inverse and two multiplications over GF(n), two scalar point multiplications and one addition of elliptic curve points over GF(q). In Section 2.5.5, signature veriﬁcation algorithms and their simulation techniques are presented. Next, an example of ECDSA over GF(23) is presented. Figure 2.17: Flow diagram of signature-veriﬁcation process. Start Get elliptic curve domain parameters set {a, b, G, n} and received signature (r, s) Compute message digest, e Compute t ϭ s Ϫ1 (mod n) Compute u1 ϭ e • t (mod n) u2 ϭ r • t (mod n) Compute (X1, Y1) ϭ u1G ϩ u2W Compute v ϭ X1 (mod n) Y SV ϭ 1 vϭr? N SV ϭ 0 End Data Security 67 ■ Example 2.3: ECDSA over Prime Field GF(23) In this example, ECDSA algorithm ﬂow for key-pair generation, signature generation, and signature veriﬁcation are presented. We start the ECDSA by ﬁrst selecting the domain parameters. Domain Parameters (E:y2 = x3 + ax + b) Elliptic curve coefﬁcients: a = 1, b = 1 Elliptic curve base point: G(13, 7) Order of elliptic curve base point G:n = 7(since 7G = O, a point at inﬁnity) Cofactor: h = 4(h = N/n [see Section 2.5.4]; curve has a total of N = 28 points) Key-Pair Generation 1. Select random number d in the range [1, n − 1] = [1, 6], say d = 4. 2. Compute point W = dG = 4G = 2(2G) (i.e., doubling of G two times is required [see Example 2.1 for equation of doubling operation]) = 4(13, 7) = 2(2(13, 7)) = 2(5, 4) = (17, 20). 3. Here d = 4 is a private key and point W = (17, 20) is a public key. Signature Generation 1. Select random number k in the range [1, n − 1] = [1, 6], say k = 3. 2. Compute point (X1, Y1) = kG = 3G = G + 2G (i.e., one point doubling and one point addition are required) = 3(13, 7) = (13, 7) + 2(13, 7) = (13, 7) + (5, 4) = (17, 3); r = X1(mod n) = 17(mod 7) = 3. 3. Let us assume for now that the message digest value e of given message M is equal to 5. Then, s = k−1(e + dr)(mod n) = 3−1(5 + 4 ∗ 3)(mod 7) = 1. 4. Signature is (r, s) = (3, 1). 5. We send message M along with its signature (3, 1) to the recipient. Signature Veriﬁcation At the recipient, we have message M along with its signature (r, s) = (3, 1). We compute the message digest value e for message M again for signature veriﬁcation and e = 5 (same as what we computed, or assumed, in signature generation). t = s−1(mod n) = 1−1(mod 7) = 1 u1 = e · t (mod n) = 5 ∗ 1(mod 7) = 5 u2 = r · t (mod n) = 3 ∗ 1(mod 7) = 3 (X1, Y1) = u1G + u2W = 5(13, 7) + 3(17, 20) = [(13, 7) + 2(2(13, 7))] + [(17, 20) + 2(17, 20)] = [(13, 7) + (17, 20)] + [(17, 20) + (13, 7)] = (5, 19) + (5, 19) = 2(5, 19) (if points P and Q are the same, then P + Q is obtained by 2P) = (17, 3) v = X1 mod n = 17 mod 7 = 3 Since v = r, the signature is valid and we accept the message, because it was not altered during the transmission. ■ 2.5.5 Simulation of ECDSA over Binary Field GF(2m) In Section 2.5.4, Examples of ECDSA over Prime Field GF(23), the order of the elliptic curve group used is 7, which we represent with 3 bits as 111. In practice, the order of the elliptic curve generated over binary 68 Chapter 2 ﬁeld GF(2m ) can be up to m bits. At present, most of the applications adapting ECDSA over GF(2m ) use m as greater than or equal to 163 bits. All operations of ECDSA over GF(2m ) involve handling of m-bit integers. This means that the size of elliptic curve coefﬁcients, points and the order of the elliptic curve parameters are all m-bit numbers. The question here is, with 163 or more bit integer numbers, how to compute the elliptic curve operations (e.g., points addition over GF(q), point doubling over GF(q) or scalar point multiplication over GF(q), where q is also of the order of m bits) and modular arithmetic operations (e.g., multiplication modulo n, inverse modulo n, and square modulo n, where n = 2m or n = prime P) used in ECDSA key-pair generation, signature generation and signature veriﬁcation processes. Well, we need not worry by seeing that big numbers as we are not manually performing those operations, rather the computer will do it for us. But, we have to program the computer to do it. This section deals with the methods used to program the computer to perform those operations with such big numbers. In ECDSA over GF(2m), we use modular arithmetic over both prime ﬁeld and binary ﬁeld. Binary Field Arithmetic In ECDSA, we use binary ﬁeld GF(2m) arithmetic in elliptic-curve point operations. The following binary ﬁeld arithmetic functions, gfb_add( ) for addition, gfb_mod( ) for modulo reduction, gfb_sqr( ) for squaring, gfb_mul( ) for multiplication, and gfb_inv( ) for inverse, are used in the implementation of ECDSA over GF(2m). If f(x ) = x m + r(x ) is an irreducible binary (primitive) polynomial of degree m, and if the elements of GF(2m ) are generated using the primitive polynomial f(x ), then the elements of GF(2m) are binary polynomials of degree at most m − 1 and we perform modulo f(x ) arithmetic operations on the output of GF(2m) elements arithmetic to make sure the result of the arithmetic operation belongs to GF(2m ). In GF(2m), a ﬁeld element is an m-bit number that can be represented in polynomial form as a(x ) = am−1x m−1 + · · · + a2x 2 + a1x + a0 or in vector form as A = [am−1am−2 . . . a2a1a0]. In arithmetic operations implementation on an embedded processor, we work with either 4-bit, 8-bit, 16-bit or 32-bit words. We do not perform m-bit arithmetic operation bit-by-bit as it is most time-consuming. Because we handle GF(2m ) ﬁeld elements most of the time as 32-bit words, we represent them with 32-bit words as X = (x [n − 1], . . . , x [2], x [1], x [0]),where n = m/32 and the right-most bit of x [0] is the LSB bit of the m-bit ﬁeld element. The left-most t = (32n − m) bits of x [n − 1] are not used and are set to zero. For example, if m = 163, then we have n = 6 words (x [5], x [4], x [3], x [2], x [1], x [0]) in a ﬁeld element of GF(2163) with left-most t = 29 bits of x [5] set as zero. Next, we discuss the simulation techniques of arithmetic operations over the binary ﬁeld GF(2163) and the same simulation techniques can be used for implementation of other binary ﬁeld elements arithmetic operations. gfb_add( ): Addition of Two Field Elements X[ ] and Y [ ] of GF(2163) Among all the binary arithmetic operations, gfb_add( ) is the simplest operation, and Z [ ], the result of adding two elements X[ ] and Y [ ], is computed by XORing the ﬁeld elements X[ ] and Y [ ], as seen in Pcode 2.31. gfb_sqr( ): Squaring of Field Element X[ ] of GF(2163) We take a simple example to understand the process of squaring binary ﬁeld elements. If b(x ) = x 2 + x + 1, then b2(x ) = b(x ) · b(x ) = (x 2 + x + 1) · (x 2 + x + 1) = (x 4 + x 3 + x 2 + x 3 + x 2 + x + x 2 + x + 1) = (x 4 + x 2 + 1). If we represent b(x ) in vector form B = [111], then B2 = [10101]. So, if we square the binary ﬁeld element, all the odd exponent terms become zero and only even exponent terms remain. In the vector form, we see alternate zeros and ones in a squared element vector. To achieve this squaring with larger ﬁeld elements, there are two ways to compute square of ﬁeld element. In the ﬁrst approach, we insert the zero bits using shift right, AND, shift left and OR. Each bit takes four cycles on the reference embedded processor (see Appendix A on the companion website) and 163 bits takes 552 cycles. In the second approach, we achieve this squaring in 150 cycles by using a 512-byte look-up table, gfb_sqr_tbl[ ]. This look-up table consists of squared values for 8-bit elements. The gfb_sqr_tbl[ ] look-up table values can be for(i = 0;i < 6;i++) Z[i] = X[i]^Y[i]; Pcode 2.31: Simulation code for additions of two ﬁeld elements in GF(2163). j = 0; for(i = 0;i < 3;i++) { r0 = x[2*i]; r1 = x[2*i+1]; r2 = r0 & 0xff; r3 = r1 & 0xff; r4 = gfb_sqr_tbl[r2]; r5 = gfb_sqr_tbl[r3]; r2 = r0 >> 8; r3 = r1 >> 8; r2 = r2 & 0xff; r3 = r3 & 0xff; r2 = gfb_sqr_tbl[r2]; r3 = gfb_sqr_tbl[r3]; r2 = r2 << 16; r3 = r3 << 16; r4 = r4 | r2; r5 = r5 | r3; y[0+j] = r4; y[2+j] = r5; r2 = r0 >> 16; r3 = r1 >> 16; r2 = r2 & 0xff; r3 = r3 & 0xff; r4 = gfb_sqr_tbl[r2]; r5 = gfb_sqr_tbl[r3]; r2 = r0 >> 24; r3 = r1 >> 24; r2 = gfb_sqr_tbl[r2]; r3 = gfb_sqr_tbl[r3]; r2 = r2 << 16; r3 = r3 << 16; r4 = r4 | r2; r5 = r5 | r3; y[1+j] = r4; y[3+j] = r5; j+= 4; } Pcode 2.32: Simulation code for squaring binary ﬁeld element in GF(2163). Data Security 69 found on the companion website. First, we unpack the 32-bit words to 8-bit bytes, and then we use the look-up table to get the 16-bit squared equivalent of an 8-bit value. Next, we OR the 16-bit look-up value with 16-bit left-shifted output. The simulation code for the look-up table-based binary-ﬁeld element squaring is given in Pcode 2.32. gfb_mod( ): Modulo Reduction with f (x ) In binary ﬁeld arithmetic, if we square or multiply m − 1 degree polynomials, the degree of output polynomial is 2m − 2. If the arithmetic operation output polynomial y(x ) degree is more than the degree of primitive polynomial f (x ), then we compute y(x ) modulo f (x ) to make sure y(x ) polynomial degree is less than m. In the binary ﬁeld, it is true that x i = x i−mr(x )(mod f (x )) for i ≥ m. If m = 163, then 2m − 2 degree polynomial y(x ) can be represented with eleven 32-bit word vectors as Y = ( y[10], y[9], . . . , y[2], y[1], y[0]). If f (x ) is a trinomial or pentanomial with middle terms close to each other, then reduction of y(x ) modulo f (x ) can be efﬁciently performed one 32-bit word at a time. For example, if f (x ) = x 163 + x 7 + x 6 + x 3 + 1, then we can compute the modulo reduction for y[9] (bits from 288 to 319 of Y ) as follows: x 288 = x 132 + x 131 + x 128 + x 125(mod f (x )) x 289 = x 133 + x 132 + x 129 + x 126(mod f (x )) ... x 318 = x 162 + x 161 + x 158 + x 155(mod f (x )) x 319 = x 163 + x 162 + x 159 + x 156(mod f (x )) By observing the previous congruencies, the reduction of y[9] can be performed by adding y[9] four times to Y , with zeroth LSB of y[9] added to bits 132, 131, 128 and 125 of Y , ﬁrst LSB of y[9] added to bits 133, 132, 129 and 126 of Y , and so on. Finally, the MSB of y[9] is added to bits 163,162,159 and 156 of Y . Like this, we eliminate y[10], y[9], y[8], y[7], y[6], and y[5] (except three LSBs) of Y . The simulation code for arithmetic modulo reduction over binary ﬁeld GF(2163) is given in Pcode 2.33. gfb_mul( ): Multiplication of Two Field Elements of GF(2163) In GF(2163), two binary ﬁeld elements multiplication is efﬁciently carried out by using a precompute window method. To better understand this efﬁcient way of implementing multiplication of two ﬁeld elements A and B of GF(2163) by precomputing, ﬁrst we work with a simple example. A = [11010] and B = [10011] are vector representations of two mth (= 4) degree polynomials. If we precompute vector B with all ﬁrst degree polynomial combinations P = ([11], [10], [01], [00]), we 70 Chapter 2 j = 0; for(i = 10;i > 5;i--){ r0 = y[i]; r1 = r0 << 29; r2 = r0 << 4; y[i-6] = y[i-6]^r1; r1 = r0<<3; r1 = r1 ^ r2; r2 = r0 >> 3; r1 = r1 ^ r2; r2 = r0 >> 28; r1 = r1 ^ r0; r3 = r0 >> 29; y[i-5] = y[i-5]^r1; r1 = r2^r3; y[i-4] = y[i-4]^r1; } r4 = 0xfffffff8; r5 = 0x00000007; r0 = y[5] & r4; r2 = r0 << 4; r3 = r0 << 3; r1 = r2 ^ r3; r2 = r0 >> 3; r1 = r1 ^ r2; r2 = r0 >> 28; r1 = r1 ^ r0; r3 = r0 >> 29; y[0] = y[0] ^ r1; r1 = r2 ^ r3; y[1] = y[1] ^ r1; y[5] = y[5] & r5; z[0] = y[0]; z[1] = y[1]; z[2] = y[2]; z[3] = y[3]; z[4] = y[4]; z[5] = y[5]; Pcode 2.33: Simulation code for modulo reduction over binary ﬁeld GF(2163). have B’ = B.P = [b3’, b2’, b1’, b0’] = ([110111], [100110], [010011], [000000]). Now C = A · B is obtained by dividing A into three 2-bit blocks [a2a1a0] = [01 10 10] (here the last block MSB is appended with zero to make a 2-bit block) and using precomputed B’ as C = [000000000], c2 = a2 · B = [01] · B = b1’ = [010011] C = C + c2 = [000000000] ⊕ [010011] = [000010011] C = C << 2 = [001001100], c1 = a1 · B = [10] · B = b2’ = [100110] C = C + c1 = [001001100] ⊕ [100110] = [001101010] C = C << 2 = [110101000], c0 = a0 · B = [10] · B = b2’ = [100110] C = C + c0 = [110101000] ⊕ [100110] = [110001110] The previous window method (with window size w = 2) involves two left shifts, three loads and three additions. If we increase the window size w to 3, then we will have one left shift, two loads and two additions. From this, we can say that the number of left shifts and number of additions required in multiplying two ﬁeld elements reduces with the increase of window size w. In this analysis, we did not include the overhead of precomputing and this overhead also increases with the window size w. In GF(2163), two binary ﬁeld elements A and B multiplication is efﬁciently carried out by using the precomputed multiplied values of third-degree polynomials (or w = 4) of all combinations with one of ﬁeld element. We use ﬁeld element B in precompute multiplication with third-degree polynomials and bits of element A for loading the precomputed values. For this, we divide A into 4-bit blocks as (MSB) 4|4|4| . . . |4|4|4 (LSB) and start the multiplication process from the MSB 4-bit block. Here the ﬁeld elements are 163 bits in length and we work in terms of 32-bit words. There are six 32-bit blocks in one ﬁeld element with some appended MSB zero bits in the last 32-bit block. Multiplication of two ﬁeld elements is carried out using a nested loop with two loops. The outer loop runs eight times to cover all eight 4-bit blocks of one 32-bit word of A, and the inner loop runs six times to cover all 32-bit words of A. The output C of multiplication contains a total of eleven 32-bit words. Before the start of multiplication we initialize C with zeros. In the inner loop, for six 32-bit words of A, we get multiplication of for(i = 0;i < 12;i++) Tmp[i] = 0; // C = 0 for(j = 7;j >= 0;j--){ k = j<<2; r1 = 0; for(i = 0;i < 6;i++){ r0 = a[i]; r0 = r0 >> k; r2 = r0 & 0xf; r2 = r2*6; Tmp[r1+0] = Tmp[r1+0]^Bu[r2++]; // modulo 2 additions Tmp[r1+1] = Tmp[r1+1]^Bu[r2++]; Tmp[r1+2] = Tmp[r1+2]^Bu[r2++]; Tmp[r1+3] = Tmp[r1+3]^Bu[r2++]; Tmp[r1+4] = Tmp[r1+4]^Bu[r2++]; Tmp[r1+5] = Tmp[r1+5]^Bu[r2++]; r1+=1; } if (j != 0){ r0 = Tmp[0]; r1 = Tmp[1]; // left shift by w-bits or C = C.xw r6 = r0 >> 28; r7 = r1 >> 28; r0 = r0 << 4; r1 = r1 << 4; r1 = r1 | r6; Tmp[0] = r0; r0 = Tmp[2]; Tmp[1] = r1; r6 = r0 >> 28; r0 = r0 << 4; r0 = r0 | r7; r1 = Tmp[3]; r7 = r1 >> 28; r1 = r1 << 4; r1 = r1 | r6; Tmp[2] = r0; r0 = Tmp[4]; Tmp[3] = r1; r6 = r0 >> 28; r0 = r0 << 4; r0 = r0 | r7; r1 = Tmp[5]; r7 = r1 >> 28; r1 = r1 << 4; r1 = r1 | r6; Tmp[4] = r0; r0 = Tmp[6]; Tmp[5] = r1; r6 = r0 >> 28; r0 = r0 << 4; r0 = r0 | r7; r1 = Tmp[7]; r7 = r1 >> 28; r1 = r1 << 4; r1 = r1 | r6; Tmp[6] = r0; r0 = Tmp[8]; Tmp[7] = r1; r6 = r0 >> 28; r0 = r0 << 4; r0 = r0 | r7; r1 = Tmp[9]; r7 = r1 >> 28; r1 = r1 << 4; r1 = r1 | r6; Tmp[8] = r0; r0 = Tmp[10]; Tmp[9] = r1; r6 = r0 >> 28; r0 = r0 << 4; r0 = r0 | r7; r1 = Tmp[11]; r1 = r1 << 4; Tmp[10] = r0; r1 = r1 | r6; Tmp[11] = r1; } } Pcode 2.34: Simulation code for window based multiplication of GF(2163) ﬁeld elements. Data Security 71 a 4-bit block (of A, one 4-bit block per outer loop iteration starting from MSB side) with B and add to C. In each outer loop iteration, we multiply the output after current iteration with x 4 (i.e., shift left C by 4 bits). The simulation code for the window-based multiplication process is given in Pcode 2.34. gfb_inv( ): inverse of ﬁeld element GF(2163) modulo f (x ) In the binary ﬁeld GF(2m), one way of computing the inverse of the ﬁeld element is by exponentiation of the ﬁeld element to the power 2m − 2 (i.e., 1/α = α2m−2). The following method is used to compute the inverse by exponentiation process. Let m − 1 = br , br−1 . . . b1; b0 is the binary representation of m − 1, where the most signiﬁcant bit br of m − 1 is 1. Set β = α and k = 1 For i = r − 1:0 γ =β For j = 1:k γ =γ2 End 72 Chapter 2 β = βγ and k = 2k If bi = 1, then set β = β2 α End Output β2 and k = k + 1 The simulation code for computing Y , an inverse of ﬁeld element X = (x [5], x [4], x [3], x [2], x [1], x [0]) in GF(2163) is given in Pcode 2.35. Prime Field Arithmetic In ECDSA over GF(2m ), we also use prime ﬁeld GF(P) (where P is a large prime number and represented with 163 bits) arithmetic in the signature generation and signature veriﬁcation processes. The following prime ﬁeld arithmetic functions, gfp_add( ) for addition, gfp_mod( ) for modulo reduction, gfp_mul( ) for multiplication and gfp_inv( ) for inverse, are used in the implementation of ECDSA key-pair generation, signature generation and veriﬁcation operations. The prime ﬁeld arithmetic operations are similar to normal integer arithmetic and the only extra operation present in prime ﬁeld arithmetic is computing of modulo P for output of arithmetic operation. If the GF(P) ﬁeld element A is of 163 bits in size, then it is represented with six 32-bit words as A = (a5, a4, a3, a2, a1, a0). In other words, a ﬁeld element of GF(P) can be represented with a 5th-degree polynomial whose coefﬁcients are of 32-bit words in size. As most of the embedded processor registers precision is limited to 32 bits and multiplication or addition of two 32-bit numbers result in more than 32 bits, we perform ﬁeld elements arithmetic operations by representing ﬁeld elements with either 16-bit words or 8-bit bytes. In this section, we simulate the prime ﬁeld arithmetic by assuming P as a 163-bit number and such GF(P) ﬁeld elements represented either with six 32-bit coefﬁcient polynomials or with 11 16-bit coefﬁcients or with 21-byte coefﬁcient polynomials. gfp_add( ): Addition of GF(P) Field Elements The addition of two ﬁeld elements of GF(P) is carried out by converting ﬁeld elements’ 32-bit coefﬁcients to 16-bit coefﬁcients as given in Pcode 2.36. Here, we perform addition of two 11th-degree polynomials with 16-bit coefﬁcients and the result is also an 11th-degree polynomial with 16-bit coefﬁcients. After addition, the result is converted back to the 5th-degree polynomial by merging the 16-bit coefﬁcients to 32-bit coefﬁcients. r0 = 162; j = 7; k = 1; for (i = 0;i < 6;i++){ T1[i] = x[i]; T3[i] = x[i]; } for (j = 6;j > = 0;j--){ T2[0] = T1[0]; T2[1] = T1[1]; T2[2] = T1[2]; T2[3] = T1[3]; T2[4] = T1[4]; T2[5] = T1[5]; for(i = 0;i < k;i++) gfb_sqr(T2,T2); k = k << 1; gfb_mul(T1,T2,T1); r1 = r0 >> j; r1 = r1 & 1; if (r1 == 1){ gfb_sqr(T1,T1); gfb_mul(T1,T3,T1); k = k+1; } } gfb_sqr(T1,T2); y[0] = T2[0]; y[1] = T2[1]; y[2] = T2[2]; y[3] = T2[3]; y[4] = T2[4]; y[5] = T2[5]; // T22 -> T2 // T1xT2 -> T1 // T12 -> T1 // T1*x -> T1 // T12 -> T2 Pcode 2.35: Simulation code for computing inverse of ﬁeld element in GF(2163). r4 = 0; for(i = 0;i < 6;i++){ r0 = x[i]; r1 = y[i]; r2 = r0 & 0xffff; r3 = r1 & 0xffff; r3 = r2 + r3 + r4; r0 = r0 >> 16; r1 = r1 >> 16; r2 = r3 & 0xffff; r3 = r3 >> 16; r0 = r0 + r1 + r3; r1 = r0 & 0xffff; r4 = r0 >> 16; r1 = r1 << 16; r2 = r2 | r1; z[i] = r2; } Pcode 2.36: Simulation code for addition of two ﬁeld elements in GF(P). Data Security 73 for(i = 0;i < 6;i++){ // convert from 32-bit word coefficient to 8-bit byte coefficient r0 = x[i]; r1 = r0 & 0xff; r2 = r0 >> 8; a[j++] = r1; r1 = r2 & 0xff; a[j++] = r1; r2 = r0 >> 16; r1 = r2 & 0xff; r2 = r0 >> 24; a[j++] = r1; a[j++] = r2; r0 = y[i]; r1 = r0 & 0xff; r2 = r0 >> 8; b[k++] = r1; r1 = r2 & 0xff; b[k++] = r1; r2 = r0 >> 16; r1 = r2 & 0xff; r2 = r0 >> 24; b[k++] = r1; b[k++] = r2; } r0 = 0; for(i = 0;i < 24;i++){ k = i; // compute c0 to c23 for(j = 0;j < =i;j++) r0 = r0 + a[k--]*b[j]; r1 = r0 & 0xff; c[i] = r1; r0 = r0 >> 8; } for(i = 24;i < 47;i++){ k = 23; // compute c24 to c47 for(j=i-23;j < 24;j++) r0 = r0 + a[k--]*b[j]; r1 = r0 & 0xff; c[i] = r1; r0 = r0 >> 8; } c[47] = 0; for(i = 0;i < 12;i++){ // convert 8-bit byte coefficients to 32-bit word coefficient j = i<<2; r0 = c[j]; r1 = c[j+1]; r1 = r1 << 8; r2 = c[j+2]; r0 = r0 | r1; r2 = r2 << 16; r0 = r0 | r2; r1 = c[j+3]; r1 = r1 << 24; r0 = r0 | r1; z[i] = r0; } Pcode 2.37: Simulation code for multiplication of two primary ﬁeld GF(P) elements. gfp_mul( ): Multiplication of Two Prime Field Elements The multiplication of two ﬁeld elements of GF(P) is carried out by converting their polynomial default, 32-bit coefﬁcients to either 16- or 8-bit coefﬁcients (as multiplication of two 32-bit coefﬁcients needs processor registers of 63 bits of precision to hold the multiplied value, and 32-bit embedded processor registers can only support 32 bits of precision). In the simulation, we represent 163-bit ﬁeld elements of GF(P) with polynomials containing 8-bit coefﬁcients as given in Pcode 2.37 to perform a multiplication operation, and we have 21 such 8-bit coefﬁcients in the 163-bit GF(P) ﬁeld element 74 Chapter 2 polynomial. The simulation code supports the following 24 coefﬁcient (or 23rd degree) polynomial multiplication and this can also be used to perform 21 coefﬁcient polynomial multiplication. Let the two polynomials A(x ) = a23x 23 + a22x 22 + · · · + a2x 2 + a1x + a0 and B(x ) = b23x 23 + b22x 22 + · · · + b2x 2 + b1x + b0 represent two GF(P) ﬁeld elements. If C(x ) = A(x ).B(x ), then the coefﬁcients of C(x ) is given by c0 = a0b0 c1 = a0b1 + a1b0 c2 = a0b2 + a1b1 + a2b0 .... c23 = a0b23 + a1b22 + · · · + a21b2 + a22b1 + a23b0 c24 = a1b23 + a2b22 + · · · + a22b2 + a23b1 .... c45 = a22b23 + a23b22 c46 = a23b23 C(x ) is a 46th-degree polynomial with 47 coefﬁcients. We perform the modulo reduction on C(x ) to make sure that the result of multiplication of A(x ) · B(x ) belongs to GF(P). That means we reduce C(x ) to a 21-coefﬁcient polynomial by performing modulo reduction. gfp_mod( ): Modulo Reduction of Polynomials over Prime Field GF(P) There are multiple modulo reduction algorithms in the literature, and we discuss a straightforward simple method (also called classical modular reduction) in this section to perform modulo reduction. In GF(P), we perform modulo reduction for a polynomial C(x ) with a degree more than or equal to 20 (since P is of size 163 bits and represented as a 20th-degree polynomial p(x ) with 21 byte coefﬁcients) to keep the result of the arithmetic in GF(P). The remainder of the division C(x ) by p(x ) is treated as a modulo reduction for C(x ). In this classical reduction method, we work on bytes (i.e., radix or b = 28). If C(x ) is a polynomial of degree m − 1, which is greater than or equal to 20, then we perform modulo reduction of C(x ) as follows (we normalize both C(x ) and p(x ) such that cm−1 ≥ b/2 to speed up the modular operation). Now, we reduce C(x ) byte by byte iteratively from the MSB side by ﬁrst computing a coarse estimate of the quotient, followed by a ﬁne estimation of the quotient. For ﬁnding a coarse estimate of the quotient (q) of division of C (dividend) with P (divisor), we divide two leftmost bytes of C with one leftmost byte of P. Then we check whether the estimated quotient is correct or to be adjusted (ﬁne estimation) by subtracting multiplied two leftmost bytes of P with q from three leftmost bytes of C. If the result is positive, then we subtract q ∗ P from C otherwise we reduce the quotient by one and repeat the previous ﬁne estimation process. This modulo reduction process eliminates one leftmost byte of C(x ) at a time and it will be continued until the degree of C(x ) falls below 20. If the dividend C(x ) contains M digits and divisor p(x ) contains L digits, then modulo reduction of C with respect to P is done in M + L − 1 steps. Simulation code for modular reduction over prime ﬁeld is given in Pcode 2.38. gfp_inv( ): Inverse of Element in Prime Field GF(P) If B is an element of prime ﬁeld GF(P), then C, an inverse of ﬁeld element B is computed by direct exponentiation as C = B−1 = B P−2. Finding the inverse of an element in prime ﬁelds with straightforward exponentiation is costly because it involves square and multiplication of ﬁeld elements with modulo reduction. Although there are many algorithms for ﬁnding the inverse of the prime ﬁeld element, we choose the exponentiation method because it can be efﬁciently implemented on an embedded processor with Montgomery multiplication operation, MonMul( ). Montgomery multiplication of a and b is deﬁned as MonMul(a, b) = a · b · r−1 (mod P), where a and b are less than P and GCD(r, P) = 1. Even though the algorithm works for any r that is relatively prime to P, it is more useful when ris taken to be a power of Data Security 75 p = k; // k = m-n, where m and n are the degrees of A (= C(x)) and B (= P(x)) for(i = 0;i < p;i++){ // eliminates one MSB byte of A per iteration if(a[m] == b[n]) r0 = 255; // q else { r1 = a[m]; r2 = a[m-1]; r1 = r1 << 8; r1 = r1 | r2; r0 = r1/b[n]; // q: coarse estimate of quotient } r1 = r1 << 8; r2 = a[m-2]; r1 = r1 | r2; r2 = b[n]; r2 = r2 << 8; r3 = b[n-1]; r2 = r2 | r3; while ((r0*r2) > r1) r0 = r0 - 1; // q: fine estimate of quotient r2 = 0; for(j = 0;j < = n;j++) { // c[n] = q.b[n] r1 = b[j]; r3 = r1 * r0; r1 = r3 + r2; r3 = r1 & 0xff; r2 = r1 >> 8; c[j] = r3; } c[n + 1] = r2; tmp2 = 0; for(j = 0;j < = (n+1);j++){ // x[m:m-n] - c[n+1] -> x[m:m-n] tmp1 = a[k+j-1] - c[j] + tmp2; a[k+j-1] = tmp1; tmp2 = tmp1>>8; } if (tmp2 != 0){ // if x[m:m-n-1] < 0, then add b[n] tmp2 = 0; for(j = 0;j < = n;j++){ // x[m:m-n-1] + b[n] tmp1 = a[k+j-1] + b[j] + tmp2; a[k+j-1] = tmp1; tmp2 = tmp1>>8; } } m = m-1; k = k-1; } Pcode 2.38: Simulation code for classical modulo reduction over prime ﬁeld GF(P). 2. In this case, the Montgomery algorithm performs divisions by a power of 2, which is an intrinsically fast operation on an embedded processor. Let P be a k-bit prime integer in the range 2k−1 ≤ P < 2k and r = 2k (here GCD(P, r) = 1, as P is a prime number and less than r). To describe the Montgomery multiplication algorithm, we ﬁrst deﬁne P-residue of a (where a < P) as a’ = a · r (mod P). Then c’ = MonMul(a’, b’) = a’ · b’ · r−1 (mod P) which is P-residue of c since a · r · b · r · r−1 (mod P) = a · b · r (mod P) = c · r (mod P). To describe the Montgomery modulo reduction we need another element P’ such that r · r−1 − P · P’ = 1, where r−1 and P’ are precomputed using the extended Euclidean algorithm. Let Q = P − 2, then C = B Q is computed as Q = [qnqn−1 . . . q2q1q0], binary value of Q with qn = 1 B’ = B · r (mod P) C’ = B’ for i = n − 1:0 C’ = MonMul(C’, C’) if(qi = 1), then C’ = MonMul(C’, B’) end output MonMul(C’, 1) as C The product MonMul(a’, b’) with Montgomery modulo reduction is computed as g = a’· b’ h = (g + (g · P’ mod r) · P)/r. if h ≥ P, then output h − P, else output h. The simulation code for the Montgomery multiplication algorithm is given in Pcode 2.39. 76 Chapter 2 gfp_mul(x,y,muly); // a’.b’ -> t1, here both a’,b’ are in Montgomery domain (i.e., P-residues) T1[0] = muly[0]; T1[1] = muly[1]; // t1 mod r -> t’ T1[2] = muly[2]; T1[3] = muly[3]; T1[4] = muly[4]; T1[5] = muly[5] & 7; gfp_mul(T1,n_dash,mulx); // t’.n’-> tmp T2[0] = mulx[0]; T2[1] = mulx[1]; // tmp mod r -> t2 T2[2] = mulx[2]; T2[3] = mulx[3]; T2[4] = mulx[4]; T2[5] = mulx[5] & 7; gfp_mul(T2,modulous,mulx); // t2.n -> t2 r0 = 0; for (i = 0;i < 12;i++){ // t1 + t2 -> t1 r1 = muly[i]; r2 = mulx[i]; r3 = r1 & 0xffff; r4 = r2 & 0xffff; r3 = r3 + r4 + r0; r0 = r3 >> 16; r5 = r3 & 0xffff; r3 = r1 >> 16; r4 = r2 >> 16; r3 = r3 + r4 + r0; r0 = r3 >> 16; r4 = r3 & 0xffff; r4 = r4 << 16; r5 = r4 + r5; muly[i] = r5; } for (i = 5;i < 11;i++){ // t1/r -> u r0 = muly[i]; r1 = muly[i+1]; r2 = r0 >> 3; r3 = r1 << 29; r2 = r2 | r3; z[i-5] = r2; } j = 1; for (i = 5;i > = 0;i--) { // check whether u > = n or not if (z[i] == modulous[i]) continue; else if (z[i] < modulous[i]){j = 0; break; } else break; } if (j){ // if u > = n, then output u - n r6 = 0; for(i = 0;i < 6;i++){ r0 = z[i]; r1 = modulous[i]; tmp0 = r0 & 0xffff; tmp1 = r1 & 0xffff; r7 = tmp0 - tmp1 + r6; r6 = r7 >> 31; r2 = r7 & 0xffff; r0 = r0 >> 16; r1 = r1 >> 16; tmp0 = r0; tmp1 = r1; r7 = tmp0 - tmp1 + r6; r6 = r7 >> 31; r3 = r7 & 0xffff; r3 = r3 << 16; r2 = r3 | r2; z[i] = r2; } } Pcode 2.39: Simulation code for Montgomery multiplication. Representation of Elliptic Curve Points In ECDSA signature generation and veriﬁcation processes (see Section 2.5.4, Signature Generation, and Section 2.5.4, Signature Veriﬁcation), we need to compute a modulo inverse, which is very costly in terms of computations (one inverse is almost 30 to 50 times more costly compared to multiplication in terms of computational complexity). To avoid these modulo inverse operations, we convert the afﬁne coordinates (X, Y ) of elliptic curve points to projective coordinates (X∗, Y ∗, Z ∗) to take care of the denominator part of the operations with Z ∗. At the end, we convert back from projective coordinates (X∗, Y ∗, Z ∗) to afﬁne coordinates (X, Y ) and have more than one kind of projective coordinate. Here are two popular projective coordinate representations. Standard projective coordinates: Afﬁne to projective conversion: (X∗, Y ∗, Z ∗) = (X, Y, 1) X∗ Y∗ Projective to afﬁne conversion: (X, Y ) = Z ∗2 , Z ∗3 Modiﬁed Jacobian coordinates: Data Security 77 Afﬁne to projective conversion: (X∗, Y ∗, Z ∗) = (X, Y, 1) X∗ Y ∗ Projective to afﬁne conversion: (X, Y ) = Z ∗ , Z ∗2 As conversion from projective to afﬁne coordinates also involves modulo inverse computation, it is not a good idea to use projective coordinates for simple operations such as point double or points addition. But, use of projective coordinates for scalar point multiplication speeds up the process by a lot as it involves many point double and addition operations. Simulation of Elliptic Curve Operations in GF(2m) In the simulation, we use projective coordinates for all three elliptic curve operations, namely, points addition, point double and scalar point multiplication. For point-double and two-points addition, methods using both standard projective coordinates and modiﬁed Jacobian coordinates are discussed. Addition of E(2m) Curve Points Given two elliptic curve points P: (X p, Yp, Z p) and Q: (Xq, Yq , Zq ) in projective coordinates, the projective coordinates of point R: (Xr , Yr , Zr ), which is the result of addition of two points P and Q (i.e., R = P + Q), is obtained with EccPointsAdd( ) as follows. With standard projective coordinates: Xr = a · [Z p( X p Z q2+ X q Z 2 p ) Z q ]2 + [Y p Z 3 q +Yq Z 3 p + Z p ( X p Z 2 q + Xq Z 2p)Zq ][Yp Zq3+Yq Z 3p] + [ X p Z 2 q + Xq Z 2 p ]3 Yr = [Y p Z 3 q + Yq Z 3 p + Z p ( X p Z 2 q + Xq Z 2 p ) Z p]Xr + [(Y p Z 3 q + Yq Z 3p) Xq + Z p( X p Z 2 q + X q Z 2 p )Yq ][ Z p( X p Z 2 q + X q Z 2 p ]2 Zr = Z p( X p Z 2 q + Xq Z 2 p ) Z q With modiﬁed Jacobian coordinates: Zr = [Z p(Xq Z p + X p)]2 Xr = Z p(Xq Z p + X p)(Yq Z 2 p + Yp) + (Yq Z 2 p + Yp)2 + [(Xq Z p + X p)Z p + a Z 2p](Xq Z p + X p)2 Yr = Z p(Xq Z p + X p)(Yq Z 2 p + Yp) + (Zr Xq + Xr ) + Zr (Zr Yq + Xr ) The simulation code for two elliptic curve points addition with standard projective coordinates is given in Pcode 2.40 and with modiﬁed Jacobian coordinates is given in Pcode 2.41. Doubling of E(2m) Curve Points Given an elliptic curve point P: (X p, Yp, Z p) and the projective coordinates of point R: (Xr , Yr , Zr ), a result of doubling a point P (i.e., Q = 2P) is obtained with EccPointDouble( ) as follows: With standard projective coordinates: Zr = X p Z 2 p Xr = (X p + c · Z 2 p )4, where c = b2m−2 Yr = X 4 p Z r + (Zr + X 2 p + Y p Z p ) X r 78 Chapter 2 r0 = r2 = 0; for(i = 0;i < 6;i++){ // Px -> T1, Py -> T2, Pz -> T3, Qx -> T4, Qy -> T5, Qz -> T3, a->T9 T1[i] = x[i]; T2[i] = x[6 + i]; T3[i] = x[12 + i]; T4[i] = y[i]; T5[i] = y[6 + i]; T6[i] = y[12 + i]; T9[i] = a[i]; r2+= y[12 + i]; r0+= a[i]; } if (r2 != 1){ // Qz != 1 gfb_sqr(T6,T7); // T62 -> T7 gfb_mul(T1,T7,T1); // T1xT7 -> T1 gfb_mul(T6,T7,T7); // T6xT7 -> T7 gfb_mul(T2,T7,T2); // T2xT7 -> T7 } gfb_sqr(T3,T7); // T32 -> T7 gfb_mul(T4,T7,T8); // T4xT7 -> T8 gfb_add(T1,T8,T1); // T1+T8 -> T1 gfb_mul(T3,T7,T7); // T3xT7 -> T7 gfb_mul(T5,T7,T8); // T5xT7 -> T8 gfb_add(T2,T8,T2); // T2+T8 -> T2 gfb_mul(T2,T4,T4); // T2xT4 -> T4 gfb_mul(T1,T3,T3); // T1xT3 -> T3 gfb_mul(T3,T5,T5); // T3xT5 -> T5 gfb_add(T4,T5,T4); // T4+T5 -> T4 gfb_sqr(T3,T5); // T32 -> T5 gfb_mul(T4,T5,T7); // T4xT5 -> T7 if (r2 != 1) gfb_mul(T3,T6,T3); // T3xT6 -> T3 gfb_add(T2,T3,T4); // T2+T3 -> T4 gfb_mul(T2,T4,T2); // T2xT4 -> T2 gfb_sqr(T1,T5); // T12 -> T5 gfb_mul(T1,T5,T1); // T1xT5 -> T1 if (r0 != 0){ // a != 0 gfb_sqr(T3,T8); // T32 -> T8 gfb_mul(T8,T9,T9); // T8xT9 -> T9 gfb_add(T1,T9,T1); // T1+T9 -> T1 } gfb_add(T1,T2,T1); // T1+T2 -> T1 gfb_mul(T1,T4,T4); // T1xT4 -> T4 gfb_add(T4,T7,T2); // T4+T7 -> T2 for(i = 0;i < 6;i++){ // T1 -> Rx, T2 -> Ry, T3 -> Rz z[i] = T1[i]; z[6 + i] = T2[i]; z[12 + i] = T3[i]; } Pcode 2.40: Simulation code for points addition over GF(2163) using standard projective coordinates. With modiﬁed Jacobian coordinates: Zr = X 2 p Z 2 p Xr = X 4 p + b Z 4 p Yr = Xr (Y 2 p + b Z 4 p + a · Z r ) + b Z 4 p Zr For point double using standard projective coordinates, we have to precompute the value c = b2m−2 . In the case of modiﬁed Jacobian coordinates, this precomputation of c is not needed. The simulation codes for point-double using standard projective coordinates and modiﬁed Jacobian coordinates are given in Pcodes 2.42 and 2.43, respectively. Scalar Point Multiplication In this section, we discuss two methods for computing the multiplication of an elliptic curve point P with a constant value k. A scalar multiplication kP of a point P on an elliptic curve is computed with the doubling and add operations deﬁned over elliptic curves. As the doubling and add operations of elliptic curve points are too costly in terms of computations, here we discuss an efﬁcient way of computing Data Security 79 r0 = 0; for(i = 0;i < 6;i++){ // Xq -> T1, Yq -> T2, Zq -> T3, Xp -> T4, Yp -> T5, a->T9 T1[i] = x[i]; T2[i] = x[6 + i]; T3[i] = x[12 + i]; T4[i] = y[i]; T5[i] = y[6 + i]; T9[i] = pE->coeff_a[i]; r0 = r0 + pE->coeff_a[i]; } gfb_sqr(T3,T6); // T3^^2 -> T6 gfb_mul(T5,T6,T7); // T5xT6 -> T7 gfb_mul(T4,T3,T8); // T4xT3 -> T8 gfb_add(T7,T2,T7); // T7+T2 -> T7 gfb_add(T8,T1,T8); // T8+T1 -> T8 gfb_mul(T8,T3,T1); // T8xT3 -> T1 if(r0==0) { T9[0] = T1[0]; T9[1] = T1[1]; T9[2] = T1[2]; T9[3] = T1[3]; T9[4] = T1[4]; T9[5] = T1[5]; } else if (r0==1){ T9[0] = T6[0]; T9[1] = T6[1]; T9[2] = T6[2]; T9[3] = T6[3]; T9[4] = T6[4]; T9[5] = T6[5]; } else gfb_mul(T9,T6,T9); // axT6 -> T9 gfb_add(T9,T1,T9); // T9+T1 -> T9 gfb_sqr(T8,T8); // T8^^2 -> T8 gfb_mul(T8,T9,T8); // T8xT9 -> T8 gfb_mul(T1,T7,T2); // T1xT7 -> T2 gfb_sqr(T1,T1); // T1^^2 -> T1 gfb_sqr(T7,T7); // T7^^2 -> T7 gfb_add(T7,T8,T7); // T7+T8 -> T7 gfb_add(T7,T2,T3); // T7+T2 -> T3 gfb_mul(T1,T4,T6); // T1xT4 -> T6 gfb_add(T6,T3,T6); // T6+T3 -> T6 gfb_mul(T1,T5,T8); // T1xT5 -> T8 gfb_add(T8,T3,T8); // T8+T3 -> T8 gfb_mul(T2,T6,T2); // T2xT6 -> T2 gfb_mul(T8,T1,T8); // T8xT1 -> T8 gfb_add(T2,T8,T2); // T2+T8 -> T2 for(i = 0;i < 6;i++){ z[i] = T3[i]; z[6 + i] = T2[i]; z[12 + i] = T1[i]; } Pcode 2.41: Simulation code for two points addition over GF(2163) using modiﬁed Jacobian coordinates. scalar point multiplication with less points add and double operations. For better understanding of this scalar point multiplication algorithm, two methods of building a 12-bit integer number k (say, 1796d = 704h = 011100000100b) with very few operations are described in the following. COMB METHOD 011100000100 = 1 · 210 + 1 · 29 + 1 · 28 + 0 · 27 + 0 · 26 + 0 · 25 + 0 · 24 + 0 · 23 + 1 · 22 + 0 · 21 + 0 · 20 = (1 · 21 + 1 · 20) · 29 + (1 · 22 + 0 · 21 + 0 · 20) · 26 + (0 · 22 + 0 · 21 + 0 · 20) · 23 + (1 · 22 + 0 · 21 + 0 · 20) · 20 = (0 · 22 + 1 · 21 + 1 · 20) · 29 + (1 · 22 + 0 · 21 + 0 · 20) · 26 + (0 · 22 + 0 · 21 + 0 · 20) · 23 + (1 · 22 + 0 · 21 + 0 · 20) · 20 = (0 · 29 + 1 · 26 + 0 · 23 + 1 · 20) · 22 + (1 · 29 + 0 · 26 + 0 · 23 + 0 · 20) · 21 + (1 · 29 + 0 · 26 + 0 · 23 + 0 · 20) · 20 80 Chapter 2 = (b23 · 29 + b22 · 26 + b21 · 23 + b20 · 20) · 22 + (b13 · 29 + b12 · 26 + b11 · 23 + b10 · 20) · 21 + (b03 · 29 + b02 · 26 + b01 · 23 + b00 · 20) · 20 = 2(2 · [b23b22b21b20] + [b13b12b11b10]) + [b03b02b01b00] = 2 · (2 · (B2) + B1) + B0 r1 = 0; for(i = 0;i < 6;i++){ r1 = r1 + x[i + 6]; T1[i] = x[i]; T2[i] = x[6+i]; T3[i] = x[12+i]; T4[i] = c_bsqrm_2[i]; } gfb_mul(T2,T3,T2); gfb_sqr(T3,T3); gfb_mul(T3,T4,T4); gfb_mul(T1,T3,T3); gfb_add(T2,T3,T2); gfb_add(T1,T4,T4); gfb_sqr(T4,T4); gfb_sqr(T4,T4); gfb_sqr(T1,T1); gfb_add(T1,T2,T2); gfb_mul(T2,T4,T2); gfb_sqr(T1,T1); gfb_mul(T1,T3,T1); gfb_add(T1,T2,T2); for(i = 0;i < 6;i++){ x[i] = T4[i]; x[i + 6] = T2[i]; x[i + 12] = T3[i]; } // Px -> T1, Py -> T2, Pz -> T3, b2m-2 -> T4 // T2xT3 -> T2 // T32 -> T3 // T3xT4 -> T4 // T1xT3 -> T3 // T2+T3 -> T2 // T1+T4 -> T4 // T4^2 -> T4 // T4^2 -> T4 // T1^2 -> T1 // T1+T2 -> T2 // T2xT4 -> T2 // T1^2 -> T1 // T1xT3 -> T1 // T1+T2 -> T2 // T4->Qx, T2->Qy, T3->Qz // Xr // Yr // Zr Pcode 2.42: Simulation code for point double in GF(2163) using standard projective coordinates. for(i = 0;i < 6;i++){ T1[i] = x[i]; T2[i] = x[6+i]; T3[i] = x[12+i]; T6[i] = a[i]; T7[i] = b[i]; } gfb_sqr(T1,T1); gfb_sqr(T3,T3); gfb_mul(T1,T3,T4); gfb_sqr(T1,T1); gfb_sqr(T3,T3); gfb_mul(T7,T3,T3); gfb_add(T1,T3,T1); gfb_sqr(T2,T2); gfb_add(T2,T3,T2); gfb_mul(T3,T4,T3); gfb_mul(T6,T4,T5); gfb_add(T2,T5,T2); gfb_mul(T2,T1,T2); gfb_add(T2,T3,T2); for(i = 0;i < 6;i++){ x[i] = T1[i]; x[6+i] = T2[i]; x[12+i] = T4[i]; } // Xq -> T1, Yq -> T2, Zq -> T3, a -> T6, b -> T7 // T12 -> T1 // T32 -> T3 // T1xT3 -> T4 // T12 -> T1 // T32 -> T3 // bxT3 -> T3 // T1+T3 -> T1 // T22 -> T2 // T2+T3 -> T2 // T3xT4 -> T3 // axT4 -> T5 // T2+T5 -> T2 // T2xT1 -> T2 // T2+T3 -> T2 // Xr // Yr // Zr Pcode 2.43: Simulation code for point double in GF(2163) using modiﬁed Jacobian coordinates. Data Security 81 where B j = [b j3b j2b j1b j0] = b j3 · 29 + b j2 · 26 + b j1 · 23 + b j0 · 20 ⎡ ⎤⎡ ⎤ b00 b01 b02 b03 0001 ⎣b10 b11 b12 b13⎦ = ⎣0 0 0 1⎦ b20 b21 b22 b23 1010 In this example, B j can have 16 possible combinations and they are [0000] = (0 · 29 + 0 · 26 + 0 · 23 + 0 · 20), [0001] = (0 · 29 + 0 · 26 + 0 · 23 + 1 · 20), . . . , [1111] = (1 · 29 + 1 · 26 + 1 · 23 + 1 · 20). These 16 values are given by 0, 1, 8, 9, 64, 65, 72, 73, 512, 513, 520, 521, 576, 577, 584, and 585. For our example of 1796, the values of B0, B1 and B2 are given by 512 = [1000], 512 = [1000] and 65 = [0101]. The number 1796 is obtained from the expression 2 · (2 · (B2) + B1) + B0 as 2 · (2 · 65 + 512) + 512. This involves two multiplications and two additions, whereas the straightforward method involves 11 multiplications and 10 additions (it also requires 11 precomputed values of power 2). With the precomputation of 16 values, any 12-bit integer value can be computed easily using the previous method with two additions and two multiplications. Preparing offsets Bi for the comb method is illustrated in the following: 1 0 0 bi0 0 0 0 bi1 1 0 0 bi2 0 1 1 bi3 B2 B1 B0 011 100 000 100 bi3 bi2 bi1 bi0 Bi 5 [bi3 bi2 bi1 bi0] In the comb method, we basically divide the given binary string into small ﬁxed-length blocks (if the leftmost block does not have sufﬁcient bits, then we add enough zero bits at the MSB side to form a block). We arrange blocks one below another and read from bottom to top column-wise for getting each offset Bi . From this, B0 = [1000], B1 = [1000], and B2 = [0101]. The pseudocode for scalar point multiplication kG using the comb method follows: Q = Bn−1G; // assuming n-offsets of Bi for i = n − 2:0 Q =EccPointDouble( Q ); Q=EccPointsAdd(Q,Bi G); end Output Q; ADD-SUBTRACT WITH LOW ONE POPULATION In this method, we compute a number h from a given number k by multiplying by 3. Then we compute g by XORing h and k. Let g = (gn gn−1 gn−2 . . . g1g0) and k = (kn kn−1 kn−2 . . . k1k0), where gn = 1. Now, we build the number k from 1 by using the binary strings (gn gn−1 gn−2 . . . g1g0) and (kn kn−1 kn−2 . . . k1k0) in an iterative fashion, as seen in the following: a = b = 1; for(i = n − 1; i > 0; i − −){ a = 2a; if (gi == 0) continue; else{ if(ki == 0) a = a + b; else a = a − b; } } 82 Chapter 2 Next we use the previous method to compute a 12-bit number k from 1. Let k = 1796d = 0011100000100b, h = 3 ∗ k = 3 ∗ 1796 = 5388d = 1010100001100b, and g = k ⊕ h = 1001000001000b = (gn gn−1gn−2 . . . g1g0), where n = 12. Then, from the previous iterative method, the updated value at the end of each iteration becomes a = 2, 4, 7, 14, 28, 56, 112, 224, 449, 898, 1796. This method requires n − 1 multiplications and few additions (in our example case, two additions took place at the non-zero value of gi ). The number of additions depends on the one’s population in the binary string of g. For computing scalar point multiplication, in the previous method we replace the multiplication by 2 with PointDouble, and addition and subtraction with FullAdd and FullSub operations. If we have sufﬁcient memory, then use of the comb method reduces computations of scalar multiplication of the elliptic curve point by a lot. As discussed in Section 2.5.2, Elliptic Curve-Based DSA, in ECDSA, we have four scalar point multiplications, one in key-pair generation, one in signature generation, and two in signature veriﬁcation. Out of four scalar point multiplications, we use base point G in three of them. In some cases, the domain parameters of elliptic curves may not be changed, and precomputation of a few base-point scalar multiplications can speed up signature generation or veriﬁcation process computation. The simulation code for scalar point multiplication using the comb method (for computing kG), EccPointMulComb( ), and add-subtract method (for computing kP), EccPointMulAddSub( ), are given in Pcodes 2.44 for(j = m-1;j > = 0;j--){ EccPointDouble(Q); r0 = 0; for(i=5;i > =0;i--){ r1 = rk[i]; r1 = r1 >> j; r0 = r0 << 1; r1 = r1 & 1; r0 = r0 | r1; } if(r0 != 0){ r0+=-1; r0 = r0 * 18; for(i = 0; i < 18;i++) P[i] = Gu[i+r0]; EccPointsAdd(pEC,P,Q,Q); } } // convert projective to affine coordinates, T1[0] = Q[12]; T1[1] = Q[13]; T1[2] = Q[14]; T1[3] = Q[15]; T1[4] = Q[16]; T1[5] = Q[17]; gfb_sqr(T1,T1); T9[0] = T1[0]; T9[1] = T1[1]; T9[2] = T1[2]; T9[3] = T1[3]; T9[4] = T1[4]; T9[5] = T1[5]; gfb_inv(T1,T2); T1[0] = Q[0]; T1[1] = Q[1]; T1[2] = Q[2]; T1[3] = Q[3]; T1[4] = Q[4]; T1[5] = Q[5]; gfb_mul(T1,T2,T1); y[0] = T1[0]; y[1] = T1[1]; y[2] = T1[2]; y[3] = T1[3]; y[4] = T1[4]; y[5] = T1[5]; T1[0] = Q[12]; T1[1] = Q[13]; T1[2] = Q[14]; T1[3] = Q[15]; T1[4] = Q[16]; T1[5] = Q[17]; gfb_mul(T1,T9,T1); gfb_inv(T1,T2); T1[0] = Q[6]; T1[1] = Q[7]; T1[2] = Q[8]; T1[3] = Q[9]; T1[4] = Q[10]; T1[5] = Q[11]; gfb_mul(T1,T2,T1); y[6] = T1[0]; y[7] = T1[1]; y[8] = T1[2]; y[9] = T1[3]; y[10] = T1[4]; y[11] = T1[5]; // 2Q->Q // get offset Bi // get precomputed value // Gu[ ] contains precomputed values // Q+[xxxxxx].G -> Q (Xq,Yq,Zq) -> (Qx,Qy) // Qz^^2 -> T1 // T1 -> T9 // 1/T1 -> T2 // T1xT2 -> T1 // T1 -> Qx // T1xT9 -> T1 // 1/T1 -> T2 // T1xT2 -> T1 // T1 -> Qy Pcode 2.44: Simulation code for scalar point multiplication in GF(2163) using comb method. Data Security 83 r0 = 0; r3 = 3; for(i = 0;i < 6;i++) { r1 = k[i]; r2 = r1 & 0xffff; r1 = r1 >> 16; r4 = r2 * r3; r5 = r1 * r3; r4 = r4 + r0; r1 = r4 & 0xffff; r0 = r4 >> 16; r5 = r5 + r0; r2 = r5 & 0xffff; r0 = r5 >> 16; r2 = r2 << 16; r2 = r2 | r1; h[i] = r2; } // 3*k -> h for(i = m - 1;i > = 1;i--){ EccPointDoubleModJac(pEC, Q); r0 = i>>5; r2 = i & 0x1f; r1 = k[r0]; r3 = h[r0]; r1 = r1 >> r2; r3 = r3 >> r2; r1 = r1 & 1; r3 = r3 & 1; if ((r1 == 0) && (r3 == 0)) continue; if ((r1 == 1) && (r3 == 1)) continue; if((r1 == 0) && (r3 == 1)) EccFullAddModJac(pEC, Q,P,Q); else EccFullSubModJac(pEC, Q,P,Q); } gfb_inv(&Q[12],T2); gfb_mul(Q,T2,y); gfb_sqr(&Q[12],T1); gfb_inv(T1,T2); gfb_mul(&Q[6],T2,&y[6]); // 2Q->Q // convert from Modified Jacobian to affine coordinates // 1/Zq -> T2 // XqxT2 -> x // Zq^^2 -> T1 // 1/T1 -> T2 // YqxT2 -> y Pcode 2.45: Simulation code for scalar point multiplication in GF(2163) using add-subtract method. and 2.45, respectively. In scalar multiplication using the add-subtract method, we have two new functions, namely, EccFullAddModJac( ) and EccFullSubModJac( ). The EccFullSubModJac( ) function ﬁrst computes the reﬂection point of the second input, and then calls the function EccFullAddModJac( ). The EccFullAddModJac( ) function is computed as follows. EccAddPointsModJac(X, Y , Z); If (Z = 0) EccDoublePointModJac(X, Z); Output Z; // X + Y −> Z // 2X−> Z Simulation of ECDSA over GF(2163) The ECDSA parameter set over a binary ﬁeld consists of the following parameters: coefﬁcients a and b, ﬁeld base-point G, ﬁeld size s, and ﬁeld order n. The ECDSA uses the three functions EccKeyPairGen( ) for key-pair generation, EccSigGen( ) for signature generation and EccSigVer( ) for signature veriﬁcation. The signature generation routine uses a private key to compute the signature, whereas the signature veriﬁcation process uses a public key to verify the signature. See Section 2.5.4 for more details of ECDSA functionality. Both signature generation and signature veriﬁcation processes assume that the message digest value is available. EccKeyPairGen( ): Key-Pair Generation Process The key-pair generation process generates two keys, namely, private key and public key by multiplying the base point G with a pseudorandom number k. The simulation code for EccKeyPairGen( ) is given in Pcode 2.46. EccSigGen( ): Signature Generation Process The signature generation routine uses a private key (k) and a message digest value to compute the signature of a message, as shown in Figure 2.15. The simulation code for EccSigGen( ) is given in Pcode 2.47. 84 Chapter 2 EccSigVer( ): Signature Veriﬁcation Process The signature veriﬁcation routine uses the public key (W ) and a message digest value to verify the signature of a message as shown in Figure 2.17. The simulation code for EccSigVer( ) is given in Pcode 2.48. // choose (generate) private key ‘k’ in the interval [1, n-1], where n = 2m , m = 163 // we assume the private key K[ ] (a random number) is available for this simulation pECC->pvkey_m[0] = K[0]; pECC->pvkey_m[1] = K[1]; pECC->pvkey_m[2] = K[2]; pECC->pvkey_m[3] = K[3]; pECC->pvkey_m[4] = K[4]; pECC->pvkey_m[5] = K[5]; // compute public key ‘W’ as W = k.G EccPointMulComb(pECC, K, pbkey); for(i = 0;i < 12;i++) pECC->pbkey_m[i] = pbkey[i]; Pcode 2.46: Simulation code for key-pair generation process. // S = (x,y) = t.G , where ‘t’ is random number in the interval [1,n-1] EccPointMulComb(pECC, t, randm, X); // r = Sx mod n, its a m-bit integer modulo reduction gfp_mod(P,T9,T4); // r = Sx mod n -> T4 // check if (sum(z[1:6})==0), then again start signature generation // h = k.r, m-bit integer multiplication, where k is a private key of signature generator T1[0] = pECC->pvkey_m[0]; T1[1] = pECC->pvkey_m[1]; T1[2] = pECC->pvkey_m[2]; T1[3] = pECC->pvkey_m[3]; T1[4] = pECC->pvkey_m[4]; T1[5] = pECC->pvkey_m[5]; // k.r, two m-bit numbers integer multiplication gfp_mul(T1,T4,X); // T1 * T4 -> X gfp_mod(X,T9,T3); // k.r mod n -> T3 // generate message digest ‘e = msgd[ ]’ using SHA-1 function T1[0] = msgd[0]; T1[1] = msgd[1]; T1[2] = msgd[2]; T1[3] = msgd[3]; T1[4] = msgd[4]; T1[5] = 0; // e + (k.r mod n) // two m-bit numbers integer addition gfp_add(T1,T3,T5); // T1+T3 -> T5 // 1/t, inverse for m-bit integer number gfp_inv(temp_randm,T3); // inv(t) -> T3 gfp_mul(T5,T3,X); // inv(t)*(e+k.r) -> X gfp_mod(X,T9,T1); // s = inv(t)*(e+k.r) mod n -> T1 // (r,s) -> output as signature Pcode 2.47: Simulation code for EccSigGen( ) process in GF(2163). 2.5.6 Simulation Results of ECDSA over GF(2163) In this section, the simulation results for two recommended elliptic curves in GF(2163) are presented. For each elliptic curve, domain parameters, EccKeyPairGen( ) output, EccSigGen( ) output, and EccSigVer( ) output are presented. For both signature generation and signature veriﬁcation, we use a temporary message digest value, msgd[ ]. Simulation Results for Koblitz Elliptic Curve over GF(2163) Domain Parameters a = 0x01; // coefficient ‘a’ b = 0x01; // coefficient ‘b’ G:(Xg , Yg) = (02 fe13c053 7bbc11ac aa07d793 de4e6d5e 5c94eee8, 02 89070fb0 5d38ff58 321f2e80 0536d538 ccdaa3d9); // base point ‘G’ N = 04 00000000 00000000 00020108 a2e0cc0d 99f8a5ef; // order of curve ‘n’ h = 02; // cofactor ‘h’ Key-Pair Generation Input: A ‘‘seed’’ value for random number generator. Data Security 85 // assume message digest of message as e = msgd[ ], compute inverse of ‘s’ of signature (r,s) for(i = 0;i < 6;i++){ T1[i] = pECC->outputs[6+i]; // s T5[i] = pECC->outputs[i]; // r T9[i] = pECC->order_g[i]; // n } gfp_inv(T1,T4); // inv(s) -> T4 gfp_mul(msgd,T4,X); // e.inv(s) -> X gfp_mod(X,T9,X); // e.inv(s) (mod n) -> X gfp_mul(T5,T4,Y); // r.inv(s) -> Y gfp_mod(Y,T9,Y); // r.inv(s) (mod n) -> Y EccPointMulComb(pECC,X,X); // X.G -> X EccPointMulAddSub(pECC, Y, Y); // Y.W -> Y for(i = 0;i < 6;i++){ P[i] = X[i]; Q[i] = Y[i]; P[6+i] = X[6+i]; Q[6+i] = Y[6+i]; P[12+i] = Zg[i]; Q[12+i] = Zg[i]; } EccFullAddModJacc(pECC, P,Q,Q); // convert from modified Jacobian to afﬁne coordinates gfb_inv(&Q[12],T3); gfb_mul(Q,T3,X); gfb_sqr(&Q[12],T2); gfb_inv(T2,T3); gfb_mul(&Q[6],T3,Y); for(i = 0;i < 6;i++){ T9[i] = pECC->order_g[i]; X[6+i] = 0; } gfp_mod(X,T9,T2); pECC->valid_s = 1; for(i = 0;i<6;i++){ if(T2[i] != pECC->outputs[i]){ pECC->valid_s = 0; break; } } // Zg[ ] = 1 // X.G + Y.G -> (x,y) // 1/Zq -> T3 // XqxT3 -> X // Zq2 -> T2 // 1/T2 -> T3 // YqxT3 -> Y // x (mod n) -> T2 Pcode 2.48: Simulation code for EccSigVer( ) process in GF(2163). Output: k:03 a41434aa 99c2ef40 c8495b2e d9739cb2 155a1e0d; W :(Xw, Yw) = (03 7d529fa3 7e42195f 10111127 ffb2bb38 644806bc, 04 47026eee 8b34157f 3eb51be5 185d2be0 249ed776); // private key // public key Signature Generation Input: e:00 a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d; d:00 a40b301c c315c257 d51d4422 34f5aff8 189d2b6c; k:03 a41434aa 99c2ef40 c8495b2e d9739cb2 155a1e0d; // message digest value // random number ∈ [1,n−1] // private key Output: (r, s): (01 52f95ca1 5da1997a 8c449e00 cd2aa2ac cb988d7f, 00 994d2c41 aa30e529 52aea846 2370471b 2b0a34ac); // signature:(r, s) Signature Veriﬁcation Input: (r, s): (01 52f95ca1 5da1997a 8c449e00 cd2aa2ac cb988d7f, 00 994d2c41 aa30e529 52aea846 2370471b 2b0a34ac); e:00 a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d; W :(Xw, Yw) = (03 7d529fa3 7e42195f 10111127 ffb2bb38 644806bc, 04 47026eee 8b34157f 3eb51be5 185d2be0 249ed776); // signature:(r, s) // message digest value // public key Output: V alid_s: 1 // 1/0:valid/not valid 86 Chapter 2 Simulation Results for Random Elliptic Curve Over GF(2163) Domain Parameters a = 07 b6882caa efa84f95 54ff8428 bd88e246 d2782ae2; b = 07 13612dcd dcb40aab 946bda29 ca91f73a f958afd9; G: (Xg, Yg) = (03 69979697 ab438977 89566789 567f787a 7876a654, 00 435edb42 efafb298 9d51fefc e3c80988 f41ff883); N = 03 ffffffff ffffffff ffff48aa b689c29c a710279b; h = 02; // coefficient ‘a’ // coefficient ‘b’ // base point ‘G’ // order of curve ‘n’ // cofactor ‘h’ Key-Pair Generation Input: A ‘‘seed’’ value for random number generator. Output: k:03 a41434aa 99c2ef40 c8495b2e d9739cb2 155a1e0d; // private key W : (Xw, Yw) = (05 7f8f4671 cfa2badf 53c57cb5 4e5c48a9 45ff2114, 07 4da202c5 0a98ec3b badf742d 4c9dcf17 f52dc591); // public key Signature Generation Input: e:00 a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d; d:00 a40b301c c315c257 d51d4422 34f5aff8 189d2b6c; k:03 a41434aa 99c2ef40 c8495b2e d9739cb2 155a1e0d; // message digest value // random number ∈ [1,n−1] // private key Output: (r, s): (01 40ca54a6 4474606e c63f5dc8 affc2e14 a8acf423, 00 b653b62f d233247b c3441e64 b57449f2 cc5f1677); // signature:(r, s) Signature Veriﬁcation Input: (r, s): (01 40ca54a6 4474606e c63f5dc8 affc2e14 a8acf423, 00 b653b62f d233247b c3441e64 b57449f2 cc5f1677); // signature:(r, s) e:00 a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d; // message digest value W : (Xw, Yw) = (05 7f8f4671 cfa2badf 53c57cb5 4e5c48a9 45ff2114, 07 4da202c5 0a98ec3b badf742d 4c9dcf17 f52dc591); // public key Output: V alid_s: 1 // 1/0:valid/not valid CHAPTER 3 Introduction to Data Error Correction Error-correcting codes, an important part of modern digital communications systems, are used to detect and correct errors introduced during transmission. In many communications applications, as shown in Figure 3.1, a substantial portion of the baseband signal processing is dedicated to keeping a very low bit error rate (BER), usually less than 10−10. As system designers, we can trade coding gain for lower transmit power or higher data throughput, and there is an ongoing effort to incorporate increasingly more powerful channel coding techniques into communications systems. In this chapter, we discuss various channel coding techniques and simulate the most popularly used CRC32 error detection algorithm. The error correction algorithm simulation techniques will be discussed in the next chapter. 3.1 Deﬁnitions Communications channels introduce noise into useful signals during transmission. There are many noise sources that generate noise and add it to the signal. See Section 9.1.2 for more information on noise generation and measurement in communications systems. In his famous paper published in 1948, “Mathematical Theory of Communications,” Shannon wrote that reliable communication through a channel is possible only if the rate of data transmission is below the channel capacity. From this paper, it can be understood that the presence of noise and the non-zero response of the channel are the two parameters that determine the channel capacity. It also says that the channel capacity can be achieved through complex channel coding and modulation schemes. See Section 9.1 for more information on channel capacity and modulation schemes. In this chapter, we discuss the channel coding techniques through which we can perform forward error correction (FEC) or the correction of channel errors at the receiver side without requesting retransmission of data. In Figure 3.1, the shaded portion represents the baseband processing related to channel coding. Depending on the communications system, we may use either an error detection scheme (which may request for retransmission of data from the transmitter) or error-correction schemes, or both error detection and error correction schemes. In two-way communications systems such as telecom or computer data systems, we can use error detection with ARQ (automatic repeat request) schemes to improve communication quality. In Section 3.2, Source Data Source Coding Channel Coding Digital Modulation Transmitter Back End Received Data Data Decompression Channel Decoding Noisy Channel Receiver Front End Figure 3.1: Channel coding in digital communications. © 2010 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-1-85617-678-1.00003-X 87 88 Chapter 3 we discuss various error detection methods and simulate the popularly used CRC32 error detection algorithm. Communications systems such as broadcast systems are examples of one-way communications systems, where we do not have a backward channel to request for retransmission of error data frames. In such cases, we apply error-correction techniques in the forward direction of the data itself to reduce the number of error frames. An overview of various error-correction schemes based on block coding and convolutional coding methods is provided in Sections 3.3 through 3.11. 3.2 Error Detection Algorithms Data error detection algorithms with ARQ schemes play an important role in two-way communications systems to minimize the number of error data frames at the user end. The two-way communications systems examples are twisted-pair telephone lines, computer data communication networks, and some satellite communications systems. Error detection is performed by using redundant data added to the original data at the transmitter. We obtain the redundant data using either odd or even parity bit computation or cyclic-redundancy-check (CRC) bit computation. At the receiver, we again compute the check bits from received data bits and compare against the received redundant information. If the transmitted redundant information is the same as the computed check bits at the receiver, then we assume that the received data is error free; otherwise, we treat the current data frame as an error frame and request the transmitter to retransmit the current frame. In this section, we discuss how error detection works with parity or CRC bits, and simulate the widely used CRC32 algorithm. 3.2.1 Error Detection with Parity Bits Assume that the transmitter and receiver communicate using data frames of n bits. At the receiver, we would like to know whether the current received frame contains any error bits. One way to know if errors are present in the received frame is by adding the parity bit at the transmitter to each transmitted frame. To perform this, we use only n − 1 data bits and 1 parity bit to make an n-bit frame. At the receiver, we compute the parity bit again from n − 1 data bits and check whether the computed parity bit matches the received parity bit. If they match, we assume that there are no errors in the received frame; otherwise, an error is detected. Even or odd parity (i.e., making the number of ones in the n bits frame as even or odd) can be used in computing parity bits. Here we consider even parity. In Example 3.1, we use the data frame length of 8 bits and explain how single-bit errors are detected in the received 8-bit frames with a parity bit. Typically, we use 8 bits for transferring ASCII character data among computer memory, CPU, and peripherals. In these 8 bits, we use 7 bits to transmit actual data and 1 bit for parity. ■ Example 3.1 Assume that the following 7 bits of data—“1011001”—are to be transmitted. With the even parity, we make the number of 1s present in the 8-bit data even by adding a 0 bit as a parity bit. Then after adding the parity bit, the 8-bit frame becomes “10110010” (where the bit highlighted with a bold letter is a parity bit), and we transmit this 8-bit frame to the receiver through a noisy channel. Assume we received an 8-bit data frame as “10110110” with the 1 bit in error marked with an underscore. If we compute the parity bit with 7 data bits of the received frame, then we get 1 for even parity, whereas the parity bit of the received frame (i.e., 8th bit) is 0. As the parity bits are not matching, the received frame contains errors and we request the transmitter to transmit this frame again. However, we will have a problem if the number of errors occurred are even. For example, if the received frame contains two bit errors, “10100110,” then both computed parity bit and received parity bit are the same. So, if an even number of errors occurs in the received frame, we fail to detect the error. In some applications, we use long data frame lengths on the order of hundreds of bits and the probability of an even number of errors to occur is also high. Hence, the error frame detection failure rate is also high with this even-parity bit error detection scheme. ■ Introduction to Data Error Correction 89 With large data frames, to improve the error detection rate, we use more than 1 bit for parity data and compute these parity bits using the data blocks (of length k bits) instead of computing in terms of individual data bits. In this way, we reduce the data length per parity bit by factor k. Also, we overcome the problem of burst error detection as the parity computation module sees the burst errors as distributed across many parity bits. This is explained in Example 3.2. ■ Example 3.2 Assume that the following 7 data bytes, “B7, CA, 49, 62, AE, DD, 78,” are set for transmission. Along with these 7 data bytes, assume that the allowed parity data is 1 byte. Next, we compute a parity byte as 0x5D from the data of 7 bytes and then we transmit an 8-byte or 64-bit frame “B7, CA, 49, 62, AE, DD, 78, 5D.” The ﬁrst 7 bytes are data and the last byte is a parity byte highlighted with the bold letters. Here, we computed even parity across the data bytes. The parity byte is also obtained by computing the check-sum (or by XORing all 7 data bytes) using the XOR operation. At the receiver, we again compute the check-sum of 7 data bytes. If the check-sum computed at the receiver matches with the parity byte, then we assume that the received 64-bit frame is error free. If the check-sum does not match with the parity byte, then the received 64-bit frame contains errors and we may request retransmission of the entire frame. 1011 0111 (0xB7) 1100 1010 (0xCA) 0100 1001 (0x49) 0110 0010 (0x62) 1010 1110 (0xAE) 1101 1101 (0xDD) 0111 1000 (0x78) –> data bytes 0101 1101 (0x5D) –> parity byte With this scheme, a burst of all lengths, except multiples of 16 bits, can be detected. For example, assume that the received 8-byte data frame is “B7, CA, 49, 62, D1, 22, 78, 5D” with 15 continuous bit errors. If we compute the check-sum again at the receiver as 1011 0111 (0xB7) 1100 1010 (0xCA) 0100 1001 (0x49) 0110 0010 (0x62) 1101 0001 (0xD1) 0010 0010 (0x22) 0111 1000 (0x78) –> data bytes 1101 1101 (0xDD) –> parity byte we get the check-sum 0xDD, which is different from the received parity byte 0x5D. However, if we have a burst of 16 bits, then we may fail to detect that frame as an error frame. ■ To reduce the overall failure rate or to improve the error detection rate signiﬁcantly, error detection schemes based on CRC bits are widely used. In the next section, we discuss how to compute CRC bits from the given message data for use in error detection schemes. 3.2.2 Error Detection with CRC Bits Error detection schemes based on a cyclic redundancy check are commonly used in many applications such as digital communications and computer data storage systems for detecting the errors in the presence of noise. Similar to the parity schemes discussed in the previous section, we compute the CRC bits from original (or 90 Chapter 3 payload) data at the transmitter and append CRC as overhead to the original data before transmitting through a noisy channel. At the receiver, we again compute the CRC for payload data and compare it with the received CRC bits to verify the integrity of the message. In computing the CRC, we use a division operation instead of addition. While addition is clearly not strong enough to form an effective check-sum, it turns out that division gives better redundant data as long as the divisor is wide enough and satisﬁes certain criteria to be discussed later. The CRC bits are given by the remainder of the division operation. The CRC algorithms operate on blocks of data instead of on individual data bits. Usually, a CRC is performed through binary polynomial division with modulo-2 arithmetic. The elements of modulo-2 arithmetic are from Galois ﬁeld GF(2). In short, GF(2) is a ﬁeld consisting of the elements 0 and 1, with + and * operations deﬁned as logical XOR and logical AND operations for modulo-2 arithmetic. The modulo-2 addition or XOR (⊕) and modulo-2 multiplication or AND (∩) tables adhere to the following rule. ⊕01 0 01 1 10 ∩01 0 00 1 01 Polynomial arithmetic with modulo 2 allows an efﬁcient implementation of a form of division that is fast, easy to implement and sufﬁcient for the purpose of CRC computation. In the CRC computation, choosing of the divisor (from now onwards, we call it “generator polynomial”) plays an important role to obtain CRC with good characteristics. A well-chosen CRC generator polynomial ensures an evenly distributed mapping of message data to CRC values. A well-constructed CRC value over data blocks of limited size will detect any contiguous burst of errors shorter than the CRC data, any odd number of errors throughout the message, 2-bit errors anywhere in the message, and certain other possible errors anywhere in the message. Next, we discuss the computation of CRC bits given the message data bits and generator polynomial. We represent all the inputs and outputs of the CRC module, such as message data, the CRC generator, and the CRC value itself, in terms of bits and eventually in terms of polynomials in the computation of CRC bits. For example, a binary vector b = [10010101] is represented in polynomial form as follows: b(x) = 1 · x7 + 0 · x6 + 0 · x5 + 1 · x4 + 0 · x3 + 1 · x2 + 0 · x + 1 = x7+x4+x2+1 For purposes of clarity, in Example 3.3 the division operation is performed separately using binary digits and corresponding polynomials. Here, the division is performed in the same way as long division performed manually on paper. ■ Example 3.3 Let b = [10010101] be the dividend and g = [101] be the divisor of the division operation. In the polynomial notation, their equivalents are represented as b(x ) = x 7 + x 4 + x 2 + 1 and g(x ) = x 2 + 1. The remainder of the division is obtained in the vector form as c = [11] or in the polynomial form as c(x) = x + 1. 10111 x5 + x3 + x2 + x 101 10010101 x2+1 x7+x4+x2+1 101 x7 + x5 011 000 110 101 111 101 100 x5 + x4 + x2 x5 + x3 x4 + x3 + x2 +1 x4 + x2 x3+1 x3+x 101 x +1 11 Introduction to Data Error Correction 91 If b = [10010101] is the message vector and g = [101] is a generator vector, then the remainder c = [11] corresponds to CRC value. We append the CRC bits to the message data as m = b|c and transmit to the receiver. ■ Note that the even parity is a particular case of CRC and when the generator vector g = [11] or g(x ) = x + 1, we get the CRC output the same as even parity output as computed in Example 3.4. For the same 8-bit message vector considered in Example 3.3, if we compute even parity, we get the parity bit as “0” since the number of ones present in the message vector is even. If we perform the division for the same message vector using generator vector g = [11], then the CRC, that is, the remainder of the division, is obtained as “0.” Therefore, the even parity and CRC outputs the same redundant bit when CRC uses the generator polynomial g(x ) = x + 1. ■ Example 3.4 In this example, we compute CRC with generator polynomial g(x ) = x + 1, and show that the CRC and even parity outputs the same redundant bit. First, we compute the even parity for a given 8-bit vector 10010101 as 0 since the number of 1s present in the given 8-bit vector are even. We add the 9th bit as “0” to make sure that the parity added 9-bit data vector consists of an even number of 1s. Next, we compute the CRC for the original 8-bit vector using 11 as divisor as follows: 1110011 11 10010101 11 10 11 11 11 00 00 01 00 10 11 0 x6 + x5 + x4 + x x +1 x7+x4+x2+1 x7 + x6 x6 + x4 + x2 x6+x5 x5 + x4 + x2 +1 x5+x4 x2 +1 x2 + x x +1 x +1 0 With the message vector 10010101 and generator vector 11, the CRC, which is the remainder of division, is obtained as “0” as expected. From this, we can say both CRC and even parity compute the same when the generator polynomial used for CRC is g(x ) = x + 1. ■ We explore how 1-bit errors and 2-bit errors are detected using CRC. For this, we use Example 3.3 as the transmitter-side CRC generation. Here, we transmit a total of 10 bits (8 bits of message and 2 bits of CRC) as 10010101|11. We consider two cases as in Example 3.5. The ﬁrst case deals with the received message vector a that contains one error, and the second case deals with the received message vector b that contains two errors. With CRC, we are able to detect both 1-bit and 2-bit errors, as the CRC computed in both cases at the receiver is different from the received CRC. ■ Example 3.5 Assume that the transmitted message along with CRC bits is 10010101|11 and a noisy channel introduces a 1-bit error in the received sequence, say a = 10110101|11. If we then compute CRC again for the received data as follows, we obtain CRC bits 01, which is different from the received CRC bits 11; hence, a single-bit error is detected. 92 Chapter 3 100100 101 10110101 => a 101 001 000 010 000 101 101 01 x5 + x2 x2+1 x7+x5+x4+x2+1 x7 + x5 x4 + x2 x4 + x2 1 If the noisy channel introduces 2-bit errors in the received sequence, say b = 10001101|11, and we then compute CRC again for the received data as follows, we obtain CRC 00, which is different from the received CRC 11; thus, a double-bit error is detected. 101001 101 10001101 => b 101 010 000 101 101 001 000 101 101 00 x5 + x3 +1 x2+1 x7+x3+x2+1 x7 + x5 x5 + x3 + x2 x5 + x3 x2 +1 x2 +1 0 ■ However, we may fail in some cases with short generator polynomials like g(x ) = x 2 + 1 in detecting doublebit errors. For example, we receive the message, say d = 10111101|11, which corresponds to the transmitted message m = 10010101|11. The received message differs from the transmitted message in two bit places as highlighted with underscoring. If we compute the CRC, the computed CRC will be the same as received CRC (i.e., 11), so we fail to detect the received message d that contained a 2-bit error. In practice, we use 16, 24, or 32-bit generator vectors for generating CRC bits to signiﬁcantly improve the error detection rate. As the message vector lengths are also very long, the overhead added by CRC (i.e., 32 bits) is negligible. Next, we introduce a few notations to simplify and efﬁciently compute the CRC bits using the LFSR (linear feedback shift register). Let m(x ), g(x ), and c(x ) represent a message polynomial of degree k − 1, generator polynomial of degree n, and a CRC polynomial of degree n − 1, respectively. Then m(x ) = mk−1x k−1 + mk−2x k−2 + · · · + m1x + m0 g(x ) = x n + gn−1 x n−1 + · · · + g1x + g0 c(x ) = cn−1 x n−1 + cn−2 x n−2 + · · · + c1x + c0 If c(x ) is the remainder when we divide m(x ) with g(x ), then c(x ) = m(x ) mod g(x ), where “mod” represents a modulo operation that outputs the remainder of the division operation. Since cn−1 need not be a non-zero value, we cannot say that the CRC polynomial degree is n − 1. However, to generate n CRC bits, we have to Introduction to Data Error Correction 93 g0 g1 g2 gn 21 c0 c1 c2 cn 21 mk 21mk 22... m1m0 00 ... 0 After k 1 n clock cycles: CRC 5 cn 21cn 22…c2c1c0 k bits n bits Figure 3.2: CRC computation using LFSR. use the nth-degree generator polynomial g(x ). Then m(x ) = g(x ) · q(x ) + c(x ), where q(x ) is a quotient of division. As the CRC bits (of length n) are appended at the end of message, we shift the message vector left by n bits (or multiply the message polynomial by x n) and append n CRC bits. As the division has no effect on the remainder even after multiplying the message polynomial by x n, we can write m(x ) · x n = g(x ) · Q(x ) + c(x ) or m(x ) · x n + c(x ) = g(x ) · Q(x ), where Q(x ) = q(x ) · x n. At the receiver, if we compute the CRC for entire k + n bits, that is, by performing division of m(x ) · x n + c(x ) by g(x ), and if we get zero as remainder, then no errors are present in the received message, as the message is a multiple of the generator polynomial g(x ). LFSR is commonly used to compute the remainder of the division of a message polynomial with a generator polynomial. As shown in Figure 3.2, the LFSR consists of n-shift registers and uses generator polynomial coefﬁcients as its taps. Here, the size of the LFSR is the same as the number of CRC bits. We shift the message left by n bits and pass it through the LFSR 1 bit per clock cycle. After passing all k + n bits through the LFSR, the state of the LFSR gives the CRC bits as shown in Figure 3.2. Then we append the CRC to the message mk−1 . . . m1m0cn−1 . . . c1c0 and transmit it through a noisy channel to the receiver. The LFSR-based CRC computation is explained in Example 3.6. ■ Example 3.6 We consider the message polynomial used in Example 3.3 to compute CRC using LFSR. The state of shift registers after each clock cycle is tabulated as follows. After k + n cycles (i.e., 8 + 2 = 10), the CRC is given by the shift register values as shown in Figure 3.3. c0 c1 1001010100 After 10 clock cycles: CRC ϭ c1c0 ϭ 11 LFSR State Table Cycle c0 c1 1 1 0 2 0 1 3 1 0 4 1 1 5 1 1 6 0 1 7 1 0 8 1 1 9 1 1 10 1 1 Figure 3.3: Illustration of LFSR-based CRC computation. ■ The CRC computation using LFSR can be efﬁciently implemented in hardware. An equivalent software implementation of LFSR-based CRC computation is given in Pcode 3.1. The input message vector is stored in a buffer and loads a 32-bit word at a time from the buffer to pass the data. We pass n 0 bits at the end of the message data to get the ﬁnal CRC bits from the shift registers. As Pcode 3.1 computes CRC bits by processing the message data bit by bit, it is not an efﬁcient implementation to compute CRC especially when the input is very long and the LFSR length is in the order of 32 bits. In this 94 Chapter 3 k = pCRC->message_length; // length of input message n = pCRC->crc_length; // length of CRC bits r2 = 0; // initialize LFSR r4 = pCRC->gen_poly; // generator vector, [101] r0 = *pCRC->in_data++; mask = pCRC->extract_crc; m = k >> 5; tb = 1 << (n-1); if (m != 0) { // if k > 32 bits for(j = 0;j < m;j++) { for(i = 0;i < 32;i++) { r1 = r0 >> 31; r3 = r2 & tb; r2 = r2 << 1; r2 = r2 | r1; if (r3) r2 = r2 ˆ r4; r0 = r0 << 1; } r0 = *pCRC->in_data++; } } m = k-32*m; if (m != 0){ // if n%32 is not zero for(i = 0;i < m;i++) { r1 = r0 >> 31; r3 = r2 & tb; r2 = r2 << 1; r2 = r2 | r1; if (r3) r2 = r2 ˆ r4; r0 = r0 << 1; } } if (pCRC->enc_ﬂag){ // to compute CRC at transmitter side r0 = 0; for(i = 0;i < n;i++) { // passing n zero bits r1 = r0 >> 31; r3 = r2 & tb; r2 = r2 << 1; r2 = r2 | r1; if (r3) r2 = r2 ˆ r4; r0 = r0 << 1; } } else { // to verify CRC at receiver side for(i = 0;i < n;i++) { r1 = r0 >> 31; r3 = r2 & tb; r2 = r2 << 1; r2 = r2 | r1; if (r3) r2 = r2 ˆ r4; r0 = r0 << 1; m = m + 1; if (m==32) { r0 = *pCRC->in_data++; m = 0; } } } r2 = r2 & mask; pCRC->crc_bits = r2; Pcode 3.1: CRC implementation using bit-by-bit method. approach, we consume up to 8 cycles per message bit (see Appendix A, Section A.4, on the companion website for more details on cycle requirements to execute particular operations on the reference embedded processor). In the next section, we discuss efﬁcient block-based software implementation of CRC32 using look-up tables. 3.2.3 CRC32 As discussed earlier, to get CRC bits with good characteristics, we need more CRC bits; hence, longer generator polynomials are used to get wider CRC data. In the industry, the following four CRC generator polynomials are popularly used: CRC-12: g = [1100000001111] or g(x ) = x 12 + x 11 + x 3 + x 2 + x + 1 CRC-16: g = [11000000000000101] or g(x ) = x 16 + x 15 + x 2 + 1 CRC-CCITT: g = [10001000000100001] or g(x ) = x 16 + x 12 + x 5 + 1 Introduction to Data Error Correction 95 CRC-32: Used in Ethernet, g = [100000100110000010001110110110111] or g(x ) = x 32 + x 26 + x 23 + x 22 + x 16 + x 12 + x 11 + x 10 + x 8 + x 7 + x 5 + x 4 + x 2 + x + 1 In this section, we concentrate on efﬁcient implementation of CRC32 using look-up tables. To understand the look-up-table–based CRC computation, we consider CRC computation for small size data b = [101100110110] with the generator polynomial g(x ) = x 4 + x + 1 or g = [10011], as given in Example 3.7. We compute the intermediate CRC value for 4 bits at a time instead of 1 bit. ■ Example 3.7 In this example, we compute CRC bits on a block basis instead of bit by bit. As shown in the following, we consider the ﬁrst 4 bits, “1011,” of the message vector, and compute the intermediate CRC, and then add this CRC to the next 4 bits of message and shift the message left by 4 bits and continue this process. The length of the intermediate CRC depends on the degree of the generator polynomial. Here, the degree of the generator polynomial is 4; hence, we have intermediate CRC output that contains 4 bits (see Figure 3.4). 10110000 10011 01010 00000 10100 10011 01110 00000 1110 11010000 10011 10010 10011 00010 00000 00100 00000 0100 Intermediate CRC 00100000 00000 01000 00000 10000 10011 00110 00000 0110 1011 0011 0110 10 0000 1110 1101 0110 10 0000 0100 0010 10 0000 0110 110000 10011 11 1000 0101 10110 10011 1101 0101 Remainder Figure 3.4: Illustration of look-up-table–based CRC computation. ■ From Example 3.7, it is clear that the intermediate CRC values can be obtained from a precomputed lookup table which contains intermediate CRC values for all possible 4-bit combinations of input values. The intermediate CRC is accessed from the look-up table using the input 4- (= p) bit number as an offset to the look-up table. The look-up-table–based CRC computation scheme works as shown in Figure 3.5. We generate the values for the look-up table to implement an Ethernet CRC32 scheme. We represent the CRC32 generator polynomial in short by using the hexadecimal notation as G = 0x04c11db7. The program in Pcode 3.2 follows the same approach used in Example 3.7 to generate the look-up table entries, but computes 32-bit intermediate CRC values (as the degree of CRC32 is 32) using 8-bit length bitstream combinations (since p bits Message bits n-Length intermediate CRC bits Figure 3.5: Block-based CRC implementation using a look-up table. p Look-up table 2p entries 96 Chapter 3 8 bits are easily accessed from buffers when compared to 4 bits). As we can represent 256 possible levels with 8 bits, the loop runs 256 times and generates intermediate 32-bit CRC values for all 256 combinations. The CRC32_LUT[ ] look-up table on the companion website contains the intermediate 32-bit CRC values for an Ethernet CRC generator polynomial with an input of all 8-bit combinations. r2 = pCRC->gen_poly; for(i = 0;i < 256;i++){ r0 = (i<<24); for(j = 0;j < 8;j++){ r1 = r0 >> 31; r0 = r0 << 1; if (r1) r0 = r0 ^ r2; } CRC_LUT[i] = r0; } Pcode 3.2: Block-based CRC look-up table generation. Once the look-up table for intermediate CRC values of all possible combinations of 8-bit data is generated, then computing the CRC of message data is very simple. As given in Pcode 3.3, we extract 8 bits from message, get the 32-bit intermediate CRC value from look-up table CRC32_LUT[ ] and then XOR this value with the next 32 bits of message data, and continue the same process until the end of the message. The look-up-table–based CRC32 computation requires 1 kB of data memory to store a 256-element look-up table and consumes about 4 cycles per byte or 0.5 cycles per bit, whereas the bit-by-bit CRC computation given in Pcode 3.1 consumes about 8 cycles per bit. Example 3.8 describes application of a 32-bit CRC with a small test vector. r0 = 0; for(i = 0;i < pCRC->message_length_bytes;i++){ r1 = (r0 >> 24) & 0xff; // extract 8-bit or byte of data r0 = (r0 << 8) | *pCRC->data_in++; // append next byte r0 = r0 ˆ CRC_LUT[r1]; // XOR with look-up table output } if (pCRC->enc_ﬂag){ // to generate CRC for(i = 0;i < 4;i++){ r1 = (r0 >> 24) & 0xff; r0 = (r0 << 8); r0 = r0 ˆ CRC_LUT[r1]; } } else { // to verify CRC for(i = 0;i < 4;i++){ r1 = (r0 >> 24) & 0xff; r0 = (r0 << 8) | *pCRC->data_in++; r0 = r0 ˆ CRC_LUT[r1]; } } pCRC->crc_bits = r0; Pcode 3.3: Look-up table based CRC32 implementation. ■ Example 3.8 Let G = 0x04c117db7 be the CRC32 generator polynomial represented in hexadecimal notation. Assume that the 2 bytes “0x1c, 0x11” are intended for transmission. The CRC32 is computed using Pcode 3.3 and its 32-bit CRC value is given by “0x97ed3f2f.” We append CRC32 to data bytes and transmit as “0x1c, 0x11, 0x97, 0xed, 0x3f, 0x2f.” At the receiver, we compute CRC32 again and detect error frames if any. As we compute CRC32 for the entire frame including the transmitted CRC32, we get “0x00000000” as the CRC32 value at the receiver if no errors are present in the received data frame. We verify CRC32 (use the same code given in Pcode 3.3 by setting pCRC->enc_ﬂag to zero) in the following, assuming three cases: received data contains zero errors, received data contains one error, and received data contains two errors. Introduction to Data Error Correction 97 Case 1: Zero errors Received data: 0x1c, 0x11, 0x97, 0xed, 0x3f, 0x2f Computed CRC32 at the receiver: 0x00000000 Result: No errors are present in the received data frame Case 2: One-bit error Received data: 0x1d, 0x11, 0x97, 0xed, 0x3f, 0x2f Computed CRC32 at the receiver: 0xd219c1dc Result: Errors are present in the received data frame Case 3: Two-bit errors Received data: 0x1c, 0x14, 0x97, 0xed, 0x3f, 0x2f Computed CRC32 at the receiver: 0x17c56b6b Result: Errors are present in the received data frame ■ In essence, with CRC32, we can ensure the following: • 100% detection of single-bit errors • 100% detection of all double-bit errors (except those errors that are separated by 232 − 1 bits • 100% detection of any errors spanning up to 32 bits With one-way communications systems (e.g., broadcast systems), the error detection schemes are not used as the one-way communications systems cannot request for retransmission. In the next section, we discuss error correction algorithms which not only detect errors but also correct them. 3.3 Block Codes In the previous section we discussed how parity check or cyclic redundancy check bits are used to detect errors in a received data block. With the error detection methods, we request for retransmission of data frames after detecting errors in the received data. In this section, we introduce a few concepts with which we not only detect errors but also correct them. This is called forward error correction (FEC). With FEC, we may not request for retransmission of data frames as those errors are corrected at the receiver with error correction algorithms. All errors can be corrected if the number of errors occurred in the received data is less than or equal to the capability of the particular FEC algorithm used. Before discussing the theory behind the block codes, we consider two examples to get a feel for error correction (see Examples 3.9 and 3.10). Then we introduce linear block codes and discuss encoding and decoding techniques for simple codes. In the later sections, we discuss various types of powerful linear block codes and convolutional codes. ■ Example 3.9 Assume that we want to transmit 4 bits, “1,0,1,1,” and we would like to receive them exactly without any errors. In the presence of noise, it is not guaranteed to receive error-free bits. If we receive the bits as “1,0,0,1,” we cannot say anything from those bits; we don’t know whether they are error free or not since we don’t have any extra information about those bits. If we append a few bits to “1,0,1,1” in a speciﬁc manner, then it is possible to know whether errors are present or not, and we can correct those errors. For example, we repeat each bit three times like “111, 000, 111, 111,” and transmit these 3-bit blocks through a noise channel. That means, for each message bit, we are transmitting 3 bits. At the receiver, we receive those 3-bit blocks as “111, 000, 101, 111” with 1 bit in error in the third block as highlighted with underscoring. If we apply a decoding procedure which simply decodes such that if more zeros are present in a block then decode as a bit “0’and if more ones are present in a block then decode that bit as a bit “1.” With this decoding procedure we get decoded bits “1,0,1,1.” Although there is a 1 bit in error in the received sequence, we are able to get the transmitted data bits without errors with the repetition of bits three times. We call it a repetition code. With a 3-bit repetition code, we can only 98 Chapter 3 correct one error per 3-bit block. The disadvantage with this code is that the bandwidth (i.e., number of bits transmitted per unit time) is increased by three. In a communications system, we want to keep data bandwidth as low as possible. Also, this code treats each bit as an individual block and the channels may not introduce errors in each 3-bit block. Usually, block codes are used in a digital communications system as an outer coder where we will have a bit-error rate (BER) of 10−2 or less. The BER is computed as the ratio of total number of error bits to a total number of received bits. So, BER = 10−2 or less means that there will be a 1-bit error in 100 or more received bits. ■ ■ Example 3.10 We introduce another approach, where we treat a chunk of bits as one block (we call it a “message block”) and add redundant bits per message block instead of to each individual bit. For example, we consider the same previous bits for transmission but as one block like “1011” (i.e., input message block length k = 4). Let B = b0b1b2b3 = 1011. We add 3 bits, “ p0 p1 p2” (we call them “parity bits”), to this block B, and form a new message block (we call it a message codeword) by appending parity bits to data bits as “ p0 p1b0 p2b1b2b3” (i.e., output message codeword length is n = 7). The parity bits p0, p1, and p2 are calculated from a matrix arrangement of data and parity bits as shown in the following: 00 01 10 11 0 − p0 p1 b0 1 p2 b1 b2 b3 In this matrix, each bit can be identiﬁed with a row index and a column index. For example, p0: (0, 01), p1: (0, 10), b0: (0, 11), and so on. We ignore “,” and form one binary string to get the index for message and parity bits as p0: 001(1), p1: 010(2), b0: 011(3), and so on. From the preceding matrix, if we observe carefully, the parity bits are placed such that their corresponding index is a power of 2 (i.e., p0: 001, p1: 010, p2: 100). The parity bit p0 (note that its index 0th bit = 1) is calculated by XORing the data bits at indexes where the 0th bit of index is 1 (i.e., bits b0, b1, and b3). Similarly, the parity bit p1 (note that its index 1st bit = 1) is calculated by XORing the data bits at indexes where the 1st bit of index is 1 (i.e., bits b0, b2, and b3). Finally, the parity bit p2 (note that its index 2nd bit = 1) is calculated by XORing the data bits at indexes where the 2nd bit of index is 1 (i.e., bits b1, b2, and b3). With this, the parity bits p0, p1 and p2 are computed as follows: p0 = b0 ⊕ b1 ⊕ b3 p1 = b0 ⊕ b2 ⊕ b3 p2 = b1 ⊕ b2 ⊕ b3 From the preceding equations, the parity bits are computed using the message block bits b0b1b2b3 = 1011 as p0 = 0, p1 = 1 and p2 = 0. We transmit the message codeword p0 p1b0 p2b1b2b3 = 0110011 through a noisy channel to the receiver. Assume codeword bits p0 p1b0 p2b1b2b3 = 0110111 are received at the receiver with 1 bit in error as highlighted with an underscore. Next, we discuss a method to correct the bit, which is received with an error. From received data block, we separate data bits and parity bits as b0b1b2b3 = 1111 and p0 p1 p2 = 010. We compute the metrics, called syndromes, S0S1S2 (which give an indication of an error if one is present) from the received data bits b0b1b2b3 as S0 = (q0 ⊕ p0), S1 = (q1 ⊕ p1) and S2 = (q2 ⊕ p2). Here, we compute q0q1q2 in the same way as parity bits are computed just by replacing ps with qs in the preceding parity equations. With this, q0 = 1, q1 = 1, q2 = 1 and S0S1S2 = 101. The index (S2, S1S0) = (1, 01) gives the bit position where the error would have occurred. That means from the preceding table, the index (1, 01) says b1 is in error. As the codeword contains only binary digits (or bits), if we toggle bit b1 in the received sequence, then we will get the corrected sequence as 0110011, which is the same as the transmitted sequence. If S0S1S2 = 000, then no errors are present in the received data block. ■ Introduction to Data Error Correction 99 In the approach discussed in Example 3.10, we added three extra bits to the original message 4-bit block at the transmitter to correct single-bit error in the received block. To compare the two methods discussed above, we deﬁne a term called code rate (Rc) as the ratio of message block length (k) to the message codeword length (n). In Example 3.9, k = 1, n = 3, and Rc = k/n = 1/3. In Example 3.10, k = 4, n = 7, and Rc = k/n = 4/7. Here, if the code rate is more, then we need less transmission bandwidth. Hence, the second method requires less bandwidth to correct the 1 bit per message block transmitted. Therefore, from here onwards, we concentrate and build the framework for block codes based on the second method. We rearrange the codeword p0 p1b0 p2b1b2b3 as B|P = b0b1b2b3| p0 p1 p2 to compute the block codes in a systematic way by using the matrix representation. Here, we basically compute the parity data bits and append to the message block to form a codeword. If the input message block length is k and output codeword length is n, then we refer such a code as (n, k) code. If the original message block is present as it is in the output codeword (as in B|P), then we call such a code (n, k) systematic code. The code that is not systematic is called nonsystematic code. Given a message block B, we compute the codeword C = B|P using generator matrix G as: C = B·G (3.1) where G = [Ik |P]. One example of generator matrix follows: ⎡ ⎤ 1000110 G = [Ik |P] = ⎢⎢⎣ 0 0 1 0 0 1 0 0 1 0 0 1 1 1 ⎥⎥⎦ (3.2) 0001111 In this matrix multiplication, we perform additions using modulo-2 or XOR operation. In decoding of this (n, k) systematic codeword, we use another matrix called the parity check matrix H = [P T |In−k ]. The H matrix corresponding to G given in Equation (3.2) follows: ⎡ ⎤ 1101100 H = [P T |In−k ] = ⎣ 1 0 1 1 0 1 0 ⎦ (3.3) 0111001 This satisﬁes G · H T = 0, resulting in a k × (n − k) matrix with all zero elements, and C · H T = 0, resulting in an n − k element row vector. 3.3.1 Linear Block Codes A block code with input message length k and output codeword length n is referred to as an (n, k) code. With the (n, k) code, we append (n − k)-length data as parity to k-length input message to form an n-length codeword. At the receiver, we use this (n − k)-length parity data to correct the data errors present in the received sequence. A subclass of block codes, known as linear block codes, is commonly used, as they have efﬁcient decoding methods. If k is the input-message-block length, then we can compute a set of 2k n-length codewords {C} using the generator matrix G. A block code is called linear block code if the addition of two codewords from {C} results in another codeword that belongs to the same block code set {C}. The performance of such a linear block code depends on the minimum Hamming distance (dmin) between the codewords of set {C}. As we are working with binary digits {0, 1}, the Hamming distance between two binary codewords B and C is deﬁned as the number of positions in which the codewords differ in the bit values. If the weight of a codeword is deﬁned as the number of ones present in a codeword, then the Hamming distance between two codewords B and C is also computed as the weight of the codeword D, where D is obtained by adding the two codewords B and C. In the case of linear block codes, as the addition of two codewords results in another codeword, the weight of a particular codeword represents the Hamming distance between some other two codewords. Therefore, the minimum Hamming distance of the code {C} is given by the minimum weight of codewords of set {C}. Since C · H T = 0, the column vectors of H are linearly dependent if C is a non-zero codeword. If C is a codeword with minimum weight dmin, then there are dmin number of columns of H that are linearly dependent. 100 Chapter 3 Alternatively, we may say that no more than dmin −1 columns of H are linearly independent. From Equation (3.3), we will have a minimum of n − k linearly independent vectors in H (as H contains In−k ), so n − k ≥ dmin − 1. Therefore, dmin is upper-bounded, as in dmin ≤ n − k + 1 (3.4) In the case of linear block codes, with minimum distance dmin, we can correct at most (dmin − 1)/2 errors. For example, in Example 3.10, (dmin − 1)/2 = 3/2 = 1. That means, we can correct at most one error with (7, 4) code as discussed in Example 3.10. Decoding with Linear Block Codes At the receiver, the matrix H is used to compute the syndrome vector S as S = R·HT (3.5) where R is an n-length received noisy codeword corresponding to transmitted codeword C. ■ Example 3.11 In Example 3.10, we used parity equations to compute the parity bits. We can also use the Equation (3.1) to compute the parity bits and to obtain a systematic codeword C = [b0b1b2b3| p0 p1 p2] = [1011|010]. If the error occured at bit b1 in the received sequence, then the received noisy vector R is given by [b0b1b2b3| p0 p1 p2] = [1111|010]. Then using Equation (3.5), the syndrome vector S is computed as follows: ⎡ ⎤ 110 S = R·HT = [1111010] ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 1 0 1 1 0 0 1 1 0 1 1 1 1 0 0 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ = [101] 001 If b0 is in error instead of b1, then R = [0011|010] and we get S = R · H T as [110]. The error location is given by (0, 11) which corresponds to b0. What will happen if more than one error occurs? Let us assume that the bits p0 and b2 are received as an error, then the codeword R = [1001|110]. From Equation (3.5), S = [111] or (1, 11), which is the location of b3. So, we have not corrected the two errors, but we know there are errors in the received sequence as the syndrome vector results in a non-zero vector. So, with (7, 4) code, we can only detect the errors if two errors are present in the received sequence. ■ 3.3.2 Popular Linear Block Codes Depending upon their error correction capabilities and algebraic properties, there are many types of linear block codes that are in use today. The most widely used are Hamming codes, BCH codes, RS codes, and LDPC codes. From Hamming codes to LDPC codes, the error correction capabilities of linear block codes increase and at the same time the decoding complexity and memory requirements to implement these codes also increase by multiple factors. Hamming Codes The Example 3.11 that we worked with previously is a (7, 4) Hamming code. There are both binary and non- binary Hamming codes. Here, we consider only binary Hamming codes. The general form of (n, k) Hamming code is as follows: (n, k) = (2m − 1, 2m − 1 − m) (3.6) where m is a positive integer. The (7, 4) code is an example for m = 3. Introduction to Data Error Correction 101 The binary (n, k) Hamming code can be extended to (n + 1, k) to increase dmin by 1 or can be shortened to (n − l, k − l) by removing l rows from its generator matrix G to yield a code that has same error correction capabilities as (n, k) code. In Section 3.4, we discuss and simulate the popularly used (72, 64) Hamming code, which is a shortened form of (127, 119) Hamming code. BCH Codes BCH (Bose-Chaudhuri-Hocquenghem) codes are subsets of linear block codes and comprise a large class of cyclic codes that include both binary and nonbinary codes. An overview of BCH codes with examples are presented in Section 3.5. RS Codes RS (Reed-Solomon) codes are nonbinary cyclic linear block codes and these codes are used for FEC in many technologies for their excellent error-correction performance. An overview of RS codes with examples is presented in Section 3.6. LDPC Codes LDPC (low-density parity check) codes are the most promising capacity approaching codes; they have been largely forgotten for four decades. Recently, these codes have been reinvented and many standards are adapting these codes in present technologies. An overview of LDPC codes are discussed in Section 3.11. 3.4 Hamming (72, 64) Coder There are many error-correcting codes (ECC) in the literature for correcting bit errors in the received data. In this section, we discuss the Hamming (72, 64) coder, which is popularly used to correct all single-bit errors and to detect all double-bit errors that could occur during the data transmission or storage and retrieval of data from memory. In this section, we restrict ourselves to memory error-correction application. The answer to the question of how much improvement we can get in BER (bit-error rate) performance curves using a 1-bit error correction depends on the raw BER (RBER) without error correction and the codeword length used with a single-bit error-correction coder. In Figure 3.6, the BER performance curves improvement (i.e., uncorrectable BER, known as UBER) with zero to six bits error correction is shown for the given raw BER values. Here, the codeword length used to generate BER curves is 2048 bits. For example, with the given BER = 10−7, we can achieve an UBER of 10−11 with single-bit error correction. For a given RBER, a shorter codeword will provide better error-correcting capability or higher UBER as shown in Figure 3.7. Given the RBER, P, codeword length, N, and the number of error bits, n, we get the UBER with a 1-bit error correction using the following equation (Mielke, 2008): N n=2 N n P n(1 − P)N−n 1− N n=0 (1 − P)N − N n=1 P(1 − P)N−1 UBER = = N N (3.7) 1025 n50 UBER 10210 10215 n51 n52 n53 10220 1027 n54 n55 1026 n56 1025 RBER 1024 1023 Figure 3.6: RBER versus UBER with various error correction capability coders. 102 Chapter 3 1023 1024 RBER 5 10 23 1025 1026 RBER 5 10 24 UBER 1027 1028 RBER 5 10 25 1029 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Codeword Length Figure 3.7: Codeword length versus UBER for various RBER. Memory Section 1 Memory Section 2 Memory Section 3 • Programs, look-ups, etc. (constant) • Encoder is used once, at the time of storing • Decoder is used for each memory load • Parameters, user inputs (changes slowly) • Encoder is used whenever new data stored • Decoder is used for each memory load • Video/audio data, navigation data (changes in real time) • ECC may not be required ECC Complexity Figure 3.8: Application of ECC in memory error correction. 3.4.1 Memory Error Correction with Hamming Codes In automotive applications, software integrity level (ASIL) (memory with error correction capabilities) is one of the important issues in choosing embedded processors. Software-based Hamming codes can be used to improve the reliability of the most important sections of memory, thus improving the ASIL metric. Memory is used to store information of various types. Some types of information require strong protection against errors and others do not. For example, application software code, data structures, parameters, and look-up tables are very sensitive and any content alteration may end up with catastrophic errors. On the other hand, information such as data samples and image pixels is not as sensitive and may not require error protection. A typical automotive application can be broken into different sections of memory consisting of different types of information: (1) constant data such as software code and look-up tables, (2) slowly varying data such as application parameters, and (3) continuously varying data such as audio/video data and navigation data (as shown in Figure 3.8). A software ROM-based error-correction approach that uses Hamming code to correct the single-bit errors in the ﬁrst two sections of memory can then be implemented using a very small percentage of processor resources. In the ﬁrst case, as the data is constant, the extra error-correction information is constant and can be generated once. Every time information is retrieved from this memory section an ECC decoder is applied to correct the single-bit errors. In the second case, we call the decoder for each memory load, and the encoder is called to update the error-correction information only when the new data is ready for storing to memory. In these two cases, a software-based single-bit error correction can be implemented using a very small percentage of processor resources. A schematic diagram of the software-based memory error correction is shown in Figure 3.9. With the Hamming (n, k) coder, we divide the data into k bits long, compute the n − k parity bits, and store the n-bits long block to Introduction to Data Error Correction 103 Figure 3.9: Schematic diagram of software-based memory error correction. Data for storing to memory A Encoder B CPU ROM Memory Section with Important Information Decoder Data retrieved from memory Reference Embedded Processor Figure 3.10: Matrix arrangement of Hamming (72, 64) encoder input–output bits. 000 p 1 p 2 d 0 p 3 d 1 d 2 d 3 p 4 d 4 d 5 d 6 d 7 d 8 d 9 d 10 p 5 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 d 21 d 22 d 23 d 24 d 25 p 6 d 26 d 27 d 28 d 29 d 30 d 31 d 32 d 33 d 34 d 35 d 36 d 37 d 38 d 39 d 40 d 41 d 42 d 43 d 44 d 45 d 46 d 47 d 48 d 49 d 50 d 51 d 52 d 53 d 54 d 55 d 56 p 7 d 57 d 58 d 59 d 60 d 61 d 62 d 63 000 001 010 011 100 101 110 111 0000 0001 0010 0011 0100 0101 0110 0111 1000 the memory area. The parity overhead percentage with respect to data length can be computed as (n − k)∗100/k. When the data is retrieved, the decoder uses the n − k parity bits to detect and correct errors that corrupted during the time when data was residing in memory. We verify the computed parity with the retrieved parity data. If both parity bits match, then no error bits are present in the received data; otherwise, there will be error bits in the received data. The Hamming decoder can detect and correct all single-bit errors or detect all double-bit errors. Because the error-correction software is permanently stored in the ROM and uses the core resources whenever memory is accessed, we prefer the ECC solution which uses a very small amount memory and the processor cycles. In the next subsections, we discuss the widely used Hamming (72, 64) coder and also discuss the implementation techniques and the computational complexity for encoding and decoding of Hamming (72, 64) code on the reference embedded processor. 3.4.2 Hamming (72, 64) Encoder The Hamming (72, 64) encoder generates 8 bits of parity from 64 input data bits. To understand the Hamming (72, 64) encoder parity bits generation, we arrange the data bits (input) and parity bits (output) in a matrix fashion and give binary indexing to each row and column as shown in Figure 3.10. We have eight columns and nine rows. Each bit in the matrix can be uniquely addressed with the row and column index bits. For example, the address of bit d20 is 0011 010. Each parity bit p1 to p7 is placed at a special address that is a power of 2 (i.e., p1: 0000 001, p2: 0000 010, p3: 0000 100, p4: 0001 000, p5: 0010 000, p6: 0100 000, p7: 1000 000). The parity bit p1 is generated by XORing the data bits that have “1” at the position “k” in the address ﬁeld xxxx xxk. Similarly the parity bit p2 is generated by XORing the data bits with “1” at the position “k” in the address ﬁeld xxxx xkx and so on. The equations for generating parity bits p1 to p7 follow. p1 = d4 ⊕ d11 ⊕ d19 ⊕ d26 ⊕ d34 ⊕ d42 ⊕ d50 ⊕ d57 ⊕ d0 ⊕ d6 ⊕ d13 ⊕ d21 ⊕ d28 ⊕ d36 ⊕ d44 ⊕ d52 ⊕ d59 ⊕ d1 ⊕ d8 ⊕ d15 ⊕ d23 ⊕ d30 ⊕ d38 ⊕ d46 ⊕ d54 ⊕ d61 ⊕ d3 ⊕ d10 ⊕ d17 ⊕ d25 ⊕ d32 ⊕ d40 ⊕ d48 ⊕ d56 ⊕ d63 (3.8) 104 Chapter 3 p2 = d5 ⊕ d12 ⊕ d20 ⊕ d27 ⊕ d35 ⊕ d43 ⊕ d51 ⊕ d58 ⊕ d0 ⊕ d6 ⊕ d13 ⊕ d21 ⊕ d28 ⊕ d36 ⊕ d44 ⊕ d52 ⊕ d59 ⊕ d2 ⊕ d9 ⊕ d16 ⊕ d24 ⊕ d31 ⊕ d39 ⊕ d47 ⊕ d55 ⊕ d62 ⊕ d3 ⊕ d10 ⊕ d17 ⊕ d25 ⊕ d32 ⊕ d40 ⊕ d48 ⊕ d56 ⊕ d63 (3.9) p3 = d7 ⊕ d14 ⊕ d22 ⊕ d29 ⊕ d37 ⊕ d45 ⊕ d53 ⊕ d60 ⊕ d1 ⊕ d8 ⊕ d15 ⊕ d23 ⊕ d30 ⊕ d38 ⊕ d46 ⊕ d54 ⊕ d61 ⊕ d2 ⊕ d9 ⊕ d16 ⊕ d24 ⊕ d31 ⊕ d39 ⊕ d47 ⊕ d55 ⊕ d62 ⊕ d3 ⊕ d10 ⊕ d17 ⊕ d25 ⊕ d32 ⊕ d40 ⊕ d48 ⊕ d56 ⊕ d63 (3.10) p4 = d4 ⊕ d5 ⊕ d6 ⊕ d7 ⊕ d8 ⊕ d9 ⊕ d10 ⊕ d18 ⊕ d19 ⊕ d20 ⊕ d21 ⊕ d22 ⊕ d23 ⊕ d24 ⊕ d25 ⊕ d33 ⊕ d34 ⊕ d35 ⊕ d36 ⊕ d37 ⊕ d38 ⊕ d39 ⊕ d40 ⊕ d49 ⊕ d50 ⊕ d51 ⊕ d52 ⊕ d53 ⊕ d54 ⊕ d55 ⊕ d56 (3.11) p5 = d11 ⊕ d12 ⊕ d13 ⊕ d14 ⊕ d15 ⊕ d16 ⊕ d17 ⊕ d18 ⊕ d19 ⊕ d20 ⊕ d21 ⊕ d22 ⊕ d23 ⊕ d24 ⊕ d25 ⊕ d41 ⊕ d42 ⊕ d43 ⊕ d44 ⊕ d45 ⊕ d46 ⊕ d47 ⊕ d48 ⊕ d49 ⊕ d50 ⊕ d51 ⊕ d52 ⊕ d53 ⊕ d54 ⊕ d55 ⊕ d56 (3.12) p6 = d26 ⊕ d27 ⊕ d28 ⊕ d29 ⊕ d30 ⊕ d31 ⊕ d32 ⊕ d33 ⊕ d34 ⊕ d35 ⊕ d36 ⊕ d37 ⊕ d38 ⊕ d39 ⊕ d40 ⊕ d41 ⊕ d42 ⊕ d43 ⊕ d44 ⊕ d45 ⊕ d46 ⊕ d47 ⊕ d48 ⊕ d49 ⊕ d50 ⊕ d51 ⊕ d52 ⊕ d53 ⊕ d54 ⊕ d55 ⊕ d56 (3.13) p7 = d57 ⊕ d58 ⊕ d59 ⊕ d60 ⊕ d61 ⊕ d62 ⊕ d63 (3.14) The parity bit 8 is used to detect double-bit errors and is generated by XORing all the data bits as follows: p8 = d0 ⊕ d1 ⊕ d2 ⊕ d3 ⊕ d4 ⊕ d5 ⊕ d6 ⊕ d7 ⊕ d8 ⊕ d9 ⊕ d10 ⊕ d11 ⊕ d12 ⊕ d13 ⊕ d14 ⊕ d15 ⊕ d16 ⊕ d17 ⊕ d18 ⊕ d19 ⊕ d20 ⊕ d21 ⊕ d22 ⊕ d23 ⊕ d24 ⊕ d25 ⊕ d26 ⊕ d27 ⊕ d28 ⊕ d29 ⊕ d30 ⊕ d31 ⊕ d32 ⊕ d33 ⊕ d34 ⊕ d35 ⊕ d36 ⊕ d37 ⊕ d38 ⊕ d39 ⊕ d40 ⊕ d41 ⊕ d42 ⊕ d43 ⊕ d44 ⊕ d45 ⊕ d46 ⊕ d47 ⊕ d48 ⊕ d49 ⊕ d50 ⊕ d51 ⊕ d52 ⊕ d53 ⊕ d54 ⊕ d55 ⊕ d56 ⊕ d57 ⊕ d58 ⊕ d59 ⊕ d60 ⊕ d61 ⊕ d62 ⊕ d63 (3.15) The generated parity bits are concatenated to the original 64 data bits to form 72-bit encoded data. In Figure 3.9, with Hamming (72, 64) coder, 64 bits of data enter into the encoder block at point A and 72 bits of encoded data come out at point B. Then this encoded 72-bit data frame is stored to the memory. This process will be continued for all data blocks. 3.4.3 Hamming (72, 64) Decoder The Hamming decoder consists of two steps: (1) syndrome computation and (2) error correction. In the syndrome computation step, the Hamming (72, 64) decoder computes eight syndromes using the 72 bits retrieved from memory. The syndromes s1 to s8 are computed by XORing the encoder parity bits p1 to p8 (which are retrieved from memory and these parity bits may be different in value due to bit errors) with the decoder parity bits c1 to c8 (which we compute at decoder) as follows: s1 = c1 ⊕ p1, s2 = c2 ⊕ p2, s3 = c3 ⊕ p3, s4 = c4 ⊕ p4 s5 = c5 ⊕ p5, s6 = c6 ⊕ p6, s7 = c7 ⊕ p7, s8 = c8 ⊕ p8 In the syndromes computation, to generate the decoder parity bits, we use the same encoder parity bit generator equations from (3.8) to (3.15). For example, if the 64-data bits of 72 bits retrieved from memory are named as b0 to b63 which corresponds to encoder data bits d0 to d63, then decoder parity bit c1 is generated using the p1 parity bit generator equation as follows: c1 = b4 ⊕ b11 ⊕ b19 ⊕ b26 ⊕ b34 ⊕ b42 ⊕ b50 ⊕ b57 ⊕ b0 ⊕ b6 ⊕ b13 ⊕ b21 ⊕ b28 ⊕ b36 ⊕ b44 ⊕ b52 ⊕ b59 ⊕ b1 ⊕ b8 ⊕ b15 ⊕ b23 ⊕ b30 ⊕ b38 ⊕ b46 ⊕ b54 ⊕ b61 ⊕ b3 ⊕ b10 ⊕ b17 ⊕ b25 ⊕ b32 ⊕ b40 ⊕ b48 ⊕ b56 ⊕ b63. Start Compute 8 syndromes Introduction to Data Error Correction 105 Y Output “0” errors Are all syndromes N zero? N Y Does s8 ϭ 0? Assuming singlebit errors, correct error and output Output double-bit errors detected End Figure 3.11: Hamming decoder ﬂow chart diagram. With respect to the retrieved 72 bits from memory, there are four possible cases of bit errors: (1) no occurrence of bit errors, (2) occurrence of 1-bit error, (3) occurrence of 2-bit errors, and (4) occurrence of more than 2-bit errors. If all the computed eight syndrome values are zero, then there is no bit error in the retrieved 72 bits. The non-zero values of syndromes indicate the presence of errors. The single-bit error is detected and also corrected, if any single-bit error is present in the data bits, using the eight syndromes information in the error-correction step. If any of the syndromes from s1 to s7 are non-zero and s8 is zero, then this indicates presence of two error bits and this cannot be corrected. So, if two bits are in error, Hamming (72, 64) decoder only detects the errors and cannot correct them. Any other result in syndrome values indicates presence of more than two error bits in the retrieved data of 72 bits and they cannot be detected and corrected. The ﬂow chart diagram for Hamming decoder is shown in Figure 3.11. 3.4.4 Hamming (72, 64) Simulation There are two ways to simulate the Hamming (72, 64) coder. In the ﬁrst method, we store the bit indices of each parity equation; extract corresponding bits from a 72-bit bitstream using bit indices and then XOR each individual bit to get the parity bit. Although this method is simple to simulate, it is expensive in terms of cycles and memory (as the look-up tables have to be stored in ROM permanently for this software-based memory correction application). In the second approach, we compute the parity bits using the precomputed masks by assuming the input 64 bits are present in two 32-bit registers r0 and r1. 0 1 2 3 . . . 30 31 32 33 34 . . . 62 63 r0 r1 The masks for each parity bit are generated using the parity equations given in Equations 3.8 through 3.15. For example, to generate the mask for computing parity bit p1, we place “1” if a particular bit is participating in the parity bit p1 computation; otherwise, we place bit “0” in that position as shown in the following: b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 b16 b17 b18 b19 b20 b21 1101101010110101010101 b22 b23 b24 b25 b26 b27 b28 b29 b30 b31 b32 b33 b34 b35 b36 b37 b38 b39 b40 b41 b42 010110101010101010101 b43 b44 b45 b46 b47 b48 b49 b50 b51 b52 b53 b54 b55 b56 b57 b58 b59 b60 b61 b62 b63 010101010101011010101 106 Chapter 3 Since the reference embedded processor is a 32-bit machine, we can only hold 32 bits in a register. So, we split the 64 bits into two 32-bit groups and convert to hexadecimal numbers as follows: 1 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0: 0xdab5556a 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1: 0xaaaaaad5 In the same way, we can generate the masks for other parity bits computation. The mask values for all parity bits computation follow. Mask0p1 = 0xdab5556a, Mask1p1 = 0xaaaaaad5, Mask0p2 = 0xb66cccd9, Mask1p2 = 0x999999b3, Mask0p3 = 0x01e3c3c7, Mask1p3 = 0x8787878f, Mask0p4 = 0x0fe03fc0, Mask1p4 = 0x7f807f80 Mask0p5 = 0x001fffc0, Mask1p5 = 0x007fff80 Mask0p6 = 0x0000003f, Mask1p6 = 0xffffff80 Mask0p7 = 0x00000000, Mask1p7 = 0x0000007f Mask0p8 = 0xffffffff, Mask1p8 = 0xffffffff The simulation code for computing the 8 parity bits of Hamming (72, 64) code is given in Pcode 3.4. The preceding precomputed parity bit masks are stored in the look-up table hm_masks[ ]. For each parity bit, we get the corresponding masks into r2 and r3 from the look-up table and AND the masks with the actual data bit words present in r0 and r1. The ANDed result is stored back into r2 and r3. Then we XOR r2 and r3, and get the result to r2. If the number of ones present in the r2 is even, then the parity bit pn is set to 1; otherwise, that is, if an odd number of ones present in r2, pn = 0. As counting the number of ones present in a 32-bit word requires many operations in C simulation, we achieve it by shift and XOR in a few operations as shown in Pcode 3.4. On the reference embedded processor, we can compute each parity bit in three cycles using a special instruction set. r7 = 0; for(i = 0;i < 8;i++) { r2 = hm_masks[2*i]; r3 = hm_masks[2*i+1]; r2 = r0 & r2; r3 = r1 & r3; r2 = r2 ˆ r3; r3 = r2 >> 16; r2 = r2 ˆ r3; r3 = r2 >> 8; r2 = r2 ˆ r3; r3 = r2 >> 4; r2 = r2 ˆ r3; r3 = r2 >> 2; r2 = r2 ˆ r3; r3 = r2 >> 1; r2 = r2 ˆ r3; r2 = r2 & 1; r2 = r2 << i; r7 = r7 | r2; } Pcode 3.4: Simulation code to generate parity bits of Hamming (72, 64) code. Single-Bit Error Correction and Double-Bit Error Detection Once we compute the 8 parity bits at the decoder using the data bits retrieved from memory, then we compute syndromes by XORing both encoder and decoder parity bits. The syndromes provide indications about bit errors. Also, syndromes provide the bit location if a single-bit error occurred and we ﬂip that bit to correct the data. We output a ﬂag value depending on whether the bit errors occurred or not in the retrieved data. For example, we output the decoded data status information by returning the value “0” for no errors occurred or one Introduction to Data Error Correction 107 error occurred and corrected, “1” for two errors occurred and detected, and “2” for multiple errors occurred in the retrieved data. The simulation code for correcting the error bit using Hamming (72, 64) coder is given in Pcode 3.5. r6 = data[2]; r6 = r6 ˆ r7; r6 = r6 >> 24; r4 = r6 & 0x80; r6 = r6 & 0x7f; j = 0; // assume no errors if (r6 != 0){ if ((r4 == 0x80) & (r6 != 0)){ // correct single bit errors if (r6 < 72){ r5 = hm_error_table[r6]; if (r5 < 32){ r5 = 31 - r5; r4 = 1 << r5; data[0] = data[0] ˆ r4; } else if (r5 < 64){ r5 = r5 - 32; r5 = 31 - r5; r4 = 1 << r5; data[1] = data[1] ˆ r4; } } else j = 2; // multiple errors } else j = 1; // double bit error detected } } Pcode 3.5: Simulation code for correcting single bit error with Hamming (72, 64) coder. Computational Complexity Assuming each operation in Pcode 3.4 consumes one cycle on the reference embedded processor (see Appendix A, Section A.4, on the companion website for more details on cycles estimation), it takes approximately 150 cycles. It takes another 20 to 30 cycles for correcting an error using Pcode 3.5. With this, the Hamming encoder consumes about 2.5 cycles per bit and the decoder consumes about 3 cycles per bit on the reference embedded processor. Using a special instruction set, we can perform Hamming (72, 64) encoding in 0.5 cycles/bit and decoding in 0.75 cycles per bit. We use a total of 136 bytes of data memory for the look-up table. Simulation Results Assume the 64 bits that will be stored in a memory are r0 = 0x8f7f6f5f; r1 = 0x4f3f2f1f (0th bit is MSB of r0). We compute 8 parity bits using 64 bits as 0xf4000000 (MSB bit is p8) and append to data bits to make a 72-bit codeword before storing to memory. Assume the retrieved 72 bits of data are r0 = 0x8e7f6f5f, r1 = 0x4f3f2f1f, r2 = 0xf4000000 with a 1-bit error in the ﬁrst 32-bit word. The parity bits c1 to c8 are computed using the retrieved data as 0x78000000 (MSB bit is c8). Then the eight syndromes are computed as 0x8c000000 (MSB bit is s8). We use look-up table hm_error_table[ ] to get the error location from syndromes. Once we know the error location, we correct the single-bit error (if occurred) using Pcode 3.5. The values of hm_error_table[ ] follow. hm_error_table[72] = { 64,64,64, 0,64, 1, 2, 3,64, 4, 5, 6, 7, 8, 9,10, 64,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25, 64,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40, 41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56, 64,57,58,59,60,61,62,63}; 108 Chapter 3 3.5 BCH Codes The framework of BCH codes support a large class of powerful random error-correcting cyclic binary and nonbinary linear block codes. With BCH (N, K ) codes, we compute mT (= N − K ) parity bits from the input block of K bits using generator polynomial G(x ) and we correct up to T bit errors in the received block of N bits. At the transmitter side, the BCH(N, K ) encoder computes and appends mT parity bits to the block of K data bits and at the receiver side the BCH(N, K ) decoder corrects up to T errors by using mT bits of parity information. We work with Galois ﬁeld GF(2m) elements for decoding of BCH(N, K ) codes. See Appendix B, Section B.2, on the companion website for more details on Galois ﬁeld arithmetic operations. 3.5.1 BCH Encoder We represent the data either in polynomial form (like A(x ), B(x ), C(x ), . . .), or in vector form (like A, B, C, . . .). A polynomial over a ﬁeld GF(q) is a mathematical expression of the form F(x ) = fn−1 x n−1 + fn−2 x n−2 + · · · + f1 x + f0 where the symbol x is an intermediate, the coefﬁcients fn−1, fn−2, . . . , f0 are elements of GF(q) and the indices and exponents are integers. The BCH (N, K ) encoder computes mT (= N − K ) bits of parity data from K bits of input data by using a generator polynomial G(x ) = g0 + g1x + g2x 2 + · · · + gN−K −1x N−K −1 + x N−K , where gi ∈ GF(2). For BCH (N, K ) codes, the generator polynomial G(x ) is obtained by computing the multiplication of T minimal polynomials φ2i−1(x ) of ﬁeld elements α2i−1 for 1 ≤ i ≤ T as follows: G(x ) = φ1(x )φ3(x ) · · · φ2T −1(x ) (3.16) As every even power of α has the same minimal polynomial as some preceding odd power of α, the G(x ) is obtained by computing the least common multiple (LCM) of minimum polynomials φi (x ) for 1 ≤ i ≤ 2T ; hence, G(x ) has α, α2, α3 · · · α2T as its roots. In other words, G(αi ) = 0 for 1 ≤ i ≤ 2T . See Appendix B, Section B.2, on the companion website for more details on Galois ﬁeld arithmetic operations (see also Example 3.12). Suppose that the input message block of K bits to be encoded is D = [d0d1d2 · · · dK −1] and the corresponding message polynomial is D(x ) = d0 + d1x + d2x 2 + · · · + dK −1x K −1. Let B = [b0b1b2 · · · bN−K −1] denotes the computed parity data of N − K (= mT ) length and its polynomial representation is B(x ) = b0 + b1x + b2x 2 + · · · + bN−K −1x N−K −1. This parity polynomial B(x ) is given by the remainder when we divide D(x ) · x N−K with generator polynomial G(x ). The polynomial B(x ) is computed as B(x ) = D(x ) · x N−K mod G(x ) (3.17) After computing parity polynomial B(x ), the encoded code polynomial C(x ) is constructed as C(x) = D(x) · x N−K + B(x) = b0 + b1x + b2x 2 + · · · + bN−K −1x N−K −1 + d0x N−K + d1x N−K +1 + · · · dK −1x N−1 = c0 + c1x + c2x 2 + · · · + cN−1 x N−1 (3.18) Basically, we append mT bits of parity data to the input block of K bits and form a systematic codeword of length N (= K + mT ) bits. The encoded polynomial in the vector form is represented as C = [c0c1c2 · · · cN−1]. Equations (3.17) and (3.18) can be realized with an LFSR signal ﬂow diagram as shown in Figure 3.12. To compute parity polynomial B(x ) coefﬁcients, we input the data polynomial D(x ) coefﬁcients to LFSR with dK −1 coefﬁcient as ﬁrst input. The values present in the delay units (Z ) after passing all K coefﬁcients of D(x ) gives the coefﬁcients of parity polynomial B(x ). ■ Example 3.12 Let us consider Galois ﬁeld GF(23) with m = 3 from Appendix B, Section B.2, on the companion website. With this, we can work with codeword length of N = 23 − 1 = 7 bits. We choose message length K = 4 bits. Then mT = 7 − 4 = 3. In this case, we can correct a 1-bit error (since T = 1) with BCH(7, 4) code. The generator polynomial for BCH(7, 4) code is G(x ) = x 3 + x + 1 (Shu Lin, 1983). Let 4-bit message Introduction to Data Error Correction 109 g0 g1 g2 Z Z Z gNϪKϪ1 Z fbv D(x) B(x) C(x) Figure 3.12: Realization of BCH(N, K) encoder. data vector D = [1110] or in polynomial notation D(x ) = x 3 + x 2 + x . Using the generator polynomial G(x ) and message polynomial D(x ), we compute the parity (or remainder) using the Equation (3.17) or using the shift register realization shown earlier. Here, we compute the parity using shift registers. We initialize the shift registers with zero and then we pass messages through the shift registers one after another. We obtain the parity as B = [100] after passing four message bits through shift registers (Figure 3.13). So, the BCH codeword of 7 bits length is given by C = [1110100] or in polynomial notation C(x) = x6 + x5 + x4 + x2. Z Z Z 1110 Input 1 1 1 0 State 000 110 101 010 001 1110100 Figure 3.13: LFSR-based parity bits generation for BCH(7,4) encoding. ■ 3.5.2 BCH Decoder At the receiver, we use a BCH (N, K ) decoder to detect and correct the bit errors. A BCH decoder consists of the following three steps to decode the received data block R. • Computation of syndromes • Computation of error-locator polynomial • Computation of error positions The received data vector R or its polynomial R(x ) = r0 + r1x + r2x 2 + · · · + rN−1x N−1 consists of transmitted data polynomial C(x ) along with an added error polynomial E(x ). R(x) = C(x) + E(x) = D(x) · x N−K + B(x) + E(x) (3.19) = D(x) · G(x) + E(x) In a BCH decoder (unlike as in a BCH encoder), we have to perform Galois ﬁeld arithmetic operations in decoding of BCH codes. Syndromes Computation To know the presence of errors and the error pattern, we compute 2T syndromes using the received data polynomial as follows: R(αi ) = D(αi ) · G(αi ) + E(αi ), where 1 ≤ i ≤ 2T = 0 + E(αi ) = E(αi ) (3.20) = Si 110 Chapter 3 ␣i Figure 3.14: Signal ﬂow diagram of rj syndrome computation. Z aj cj Z si bj From the preceding syndromes computation, if no errors are present in the received data vector, we get all computed syndrome values (Si ) as zero. If any one or more syndromes are non-zero, then we assume that the errors are present in the received data vector. The syndromes Si = R(αi ) are computed with the LFSR signal ﬂow diagram as shown in Figure 3.14 (see also Example 3.13). ■ Example 3.13 We transmit the codeword C = [1110100] computed in Example 3.12 through a noisy channel. Let the received vector be R = [1110110], which differs from the transmitted codeword by 1 bit, highlighted with an underscore. So, the error vector E = [0000010] or E(x ) = x (but we don’t know errors in advance). We can ﬁnd the error vector with the BCH decoder if the number of errors occurred are less than or equal T . In our example, T = 1 (i.e., the decoder can correct a 1-bit error) and we can correct one error present in the received data vector. The ﬁrst step in the BCH decoding is the computation of syndromes. To correct T errors, we have to compute 2T syndromes Si = R(αi ), where i = 1, 2. The R(αi ) is obtained after substituting x = αi in the R(x ). We use Galois ﬁeld GF(23) arithmetic (see Appendix B, Section B.2, on the companion website) in computing these two syndromes. R(x) = x6 + x5 + x4 + x2 + x R(α) = α6 + α5 + α4 + α2 + α = (α2 + 1) + (α2 + α + 1) + (α2 + α) + α2 + α = α R(α2) = α12 + α10 + α8 + α4 + α2 = α7α5 + α7α3 + α7α + α4 + α2 = α5 + α3 + α + α4 + α2 = α2 S1 = α, S2 = α2 ■ ∵ Error-Locator Polynomial Computation An error-locator polynomial computation is the second step in decoding of BCH codes. We use the Berlekamp -Massey recursive algorithm to compute the error-locator polynomial. The ﬂow chart of Berlekamp-Massey recursion is shown in Figure 3.15. If the number of errors present in the received data vector is L (which less than or equal to T ), then this algorithm computes the L-degree error-locator polynomial in 2T iterations. First, we initialize the error-locator polynomial (x ) = 1 as a minimum-degree polynomial with degree L = 0. Then we use syndromes information to build an error-locator polynomial by computing discrepancy delta. If the value of delta is not zero then we update the minimum-degree polynomial with the discrepancy; otherwise, we continue the loop. If the number of errors in the received polynomial R(x ) is T or less, then (x ) produces the true error pattern. At the end of 2T iterations of the Berlekamp-Massey recursion, we will have the Lth-degree error-locator polynomial (with 0 = 1) as follows: (x) = 0 + 1x + 2x2 + · · · + L x L = (1 + X1x )(1 + X2x ) · · · (1 + X L x ) (3.21) Once we have an error-locator polynomial of degree L, then we can ﬁnd L error positions by computing the roots of the error-locator polynomial (see Examples 3.14 and 3.15). Introduction to Data Error Correction 111 Start L(0)(x ) 5 1, B (0)(x ) 5 1 L 5 0, k 5 1 L S Dk 5 L(ik 21)·Sk 2i i50 Y Dk 5 0 N L(k)(x ) 5 L(k 21)(x ) 2 Dk B (k 21) (x )·x N B (k)(x ) 5 B (k 21) (x )·x 2L # k 21 Y L5k2L B (k)(x ) 5 L(k 21)(x )/Dk Figure 3.15: Flow chart diagram of Berlekamp-Massey algorithm. k # 2T Y N End ■ Example 3.14 Using the two syndromes computed in Example 3.13, we build the error-locator polynomial using Berlekamp-Massey recursion as shown in Figure 3.15. Initialization: (0)(x ) = 1, B(0)(x ) = 1, L = 0, k = 1 First iteration (k = 1): 1 = S1 = α Since 1 = 0, (1)(x ) = 1 + α x (Note: In the Galois ﬁeld “+” is the same as “−”.) Since (2L ≤ k − 1), L = k − L = 1 and B(1)(x ) = 1/α = α6 Second iteration (k = 2): 2= (1) 0 S2 + (1) 1 S1 = α2 + α2 = 0 Since 2 = 0, (2)(x ) = (1)(x ), B(2)(x ) = x B(1)(x ) = α6x . Since k = 2T (last iteration reached), stop Berlekamp-Massey algorithm. The error-locator polynomial (x ) = (2)(x ) = 1 + αx . ■ Error Positions Computation If the number of errors L present in the received data vector is less than or equal to T (i.e., L ≤ T ), then the error-locator polynomial can be factored into L ﬁrst-degree polynomials as in Equation (3.21) and the roots of the error-locator polynomial are X 1−1 , X −1 2 , . . . X −1 L . The error positions are given by the inverse of the roots of the error-locator polynomial. So the L error positions are X1, X2, . . . X L. As binary BCH codes work on the data bits and if we ﬁnd the error positions in the received data bits, then correction of data bits is achieved by simply ﬂipping the bit values in those error positions. ■ Example 3.15 We continue Example 3.14 and ﬁnd the error locations by ﬁnding the roots of the error-locator polynomial. Since the computed error-locator polynomial has degree 1, its root is computed as ( X 1−1 ) = 1 + α X −1 1 = 0 ⇒ α X −1 1 = 1 ⇒ X −1 1 = 1/α = α6 112 Chapter 3 The error position is given by the inverse of roots of the error-locator polynomial. Therefore, X1 = 1/α6 = α1 ■ Error Correction As we are working with binary BCH codes, we correct only the bit errors present in the received data (in the next section we discuss how to correct m-bit words with RS codes). The correction of bit errors is achieved by ﬂipping the bit value at the error position (see Example 3.16). If the degree of error-locator polynomial (L) and the number of error positions (P) are not equal then the BCH decoder cannot correct errors as the number of errors occurred is more than the decoder error correction capability. Therefore, we skip error bits correction when L = P. ■ Example 3.16 We computed the error position in Example 3.15 as X1 = α1. The exponent of error positions gives the location of errors in the received data vector. In our case, the exponent of error position is 1 and the error is present at position 1 in the received vector R = [1110110]. The indexing starts from the LSB side as shown in the following. 1110110 6543210 Thus, the corrected data vector is [1110100], which is the same as the transmitted data vector. ■ In Section 4.1, we will further study the BCH codes. Also, we discuss the simulation of BCH codes and the efﬁcient techniques to implement BCH codes. 3.6 RS Codes Reed-Solomon (RS) codes are block-based linear nonbinary error-correcting codes with a wide range of applications. The RS(N, K ) coder works on a block of data and takes a K element block as input and outputs an N element block by adding N − K elements as redundant data, which is used to perform error correction at the receiver side. By adding redundant data before transmission, RS codes can detect and correct errors within blocks of the data frame. For any positive integer T ≤ 2m − 1, there exists a T -symbol error correcting RS code with the following parameters. N = 2m − 1 K = N − 2T = 2m − 1 − 12T dmin = 2T + 1 = N − K + 1 The RS(N, K ) coder works with Galois ﬁeld elements of m bits width and the data elements of RS(N, K ) coder belongs to GF(2m ) Galois ﬁeld. The RS(N, K ) encoder adds 2T = N−K elements of redundancy at the transmitter side and the RS(N, K ) decoder uses that redundancy to correct up to T = (N−K )/2 errors at the receiver side. As RS code consists of m-bit elements, these codes are well suited to correct burst bit errors. A few applications where RS codes are predominantly used include high-speed modems such as ADSL, xDSL, storage devices (e.g., compact disc [CD], DVD, hard disk), mobile and satellite communications, and digital television and DVB. Like RS codes, BCH codes (see Sections 3.5 and 4.1) are used in some of the previous applications for FEC. Both BCH codes and RS codes are linear block codes. The BCH codes are binary, whereas RS codes are nonbinary. Introduction to Data Error Correction 113 The error correction capability of BCH codes is inferior when compared to RS codes. In other words, we achieve larger coding gain with RS codes than with BCH codes for given data rates and channel conditions. In burst error cases, RS codes perform better than BCH codes. In this section, we discuss the RS(N, K ) coder to correct T data elements at the receiver side. The block diagram of the RS(N, K ) coder is shown in Figure 3.16. The RS(N, K ) encoder takes K -element block D as input and outputs N element block M. RS(N, K ) decoder takes received N element error block R as input and outputs K element block D . Input Block D RS(N, K) M Channel Encoder K Elements Transmitter N Elements R RS(N, K) Decoded Output DЈ Decoder Receiver K Elements Figure 3.16: Block diagram of RS(N, K) coder. D (x ) x N2K Polynomial Multiplier Polynomial M(x ) Adder G(x ) Polynomial Divider Figure 3.17: Operational blocks of RS(N, K) encoder. 3.6.1 RS(N, K) Encoder Using the RS(N, K ) encoder, we compute N−K length parity polynomial B(x ) from K -length input message D(x ) by using the generator polynomial G(x ). The encoded message M(x ) is obtained as M(x) = D(x) · x N−K + B(x) (3.22) The following generator polynomial is used in the RS(N, K ) encoder to compute the parity data: G(x ) = (x + α0)(x + α1)(x + α2) · · · (x + α2T −1) = g0 + g1x + g2x 2 + · · · + g2T −1x 2T −1 + x 2T (3.23) where 2T = N − K . Here, the polynomial G(x ) is computed by multiplying 2T ﬁrst-degree polynomials (x + αi) where 0 ≤ i < 2T . The parity polynomial B(x ) is computed as B(x ) = D(x ) · x N−K mod G(x ) (3.24) The equivalent schematic block diagram of Equations (3.22) and (3.24) is shown in Figure 3.17. In this section, we work with a few examples to better understand RS codes (see Examples 3.17, 3.18, and 3.19). ■ Example 3.17 Let us consider the data elements with 3-bit width. We work with RS(7, 3) coder and use Galois ﬁeld GF(23) arithmetic (see Appendix B, Section B.2, on the companion website for more details on GF) to encode and decode the data elements. With this, the three parameter values of RS coder are N = 7, K = 3, 2T = N−K = 4. Let the K length message vector D = [3 1 4]. In terms of polynomial notation, D(x) = α3x2 + α1x + α4 114 Chapter 3 For T = 2, the generator polynomial G(x ) is given by G(x ) = (x + α)(x + α2)(x + α3)(x + α4) = x4 + α3x3 + x2 + αx + α3 The parity polynomial B(x ) is computed using Equation (3.24) as B(x ) = x N−K D(x ) mod G(x ) = x 4(α3x 2 + αx + α4) mod (x 4 + α3x 3 + x 2 + αx + α3) = α3x3 + α5x2 + α5x + α Then the codeword polynomial M(x ) is obtained from Equation (3.22) as M(x) = α3x6 + αx5 + α4x4 + α3x3 + α5x2 + α5x + α ■ 3.6.2 RS(N, K) Decoder The RS(N, K ) decoder takes data blocks of N elements as input and outputs a K element data block as shown in Figure 3.16. If errors are present in the received data and if they are less than or equal to (N−K )/2, then the RS decoder corrects the errors and outputs a corrected data block. Let R(x) = rN−1 x N−1 +rN−2 x N−2 + · · ·+r1x +r0 be the received polynomial with noise, then R(x ) = M(x ) + E(x ), where E(x ) is the error polynomial. If R(x ) has v errors at the locations x i1 , x i2 , . . . , x iv , then E(x ) will be represented with corresponding error magnitudes as follows: E (x ) = ei1 x i1 + ei2 x i2 + · · · + eiv x iv (3.25) The error correction with the RS decoder is achieved in four steps and the schematic block diagram of the RS decoder is shown in Figure 3.18. 3.6.3 Syndrome Computation In RS decoding, the ﬁrst step of the decoder is syndrome computation. Syndromes, which give an indication of presence of errors, are computed using the received data polynomial R(x ). The syndromes are nothing but the evaluated values of the received polynomial at x = α j for 1 ≤ j ≤ 2T . S j = R(α j ) = M(α j ) + E(α j ). ∵ M(α j ) = D(α j )G(α j ) = 0, ⇒ S j = R(α j ) = E(α j ) R(x) Syndromes Computation Error Locator Polynomial Computation Error Roots Finding Error Magnitudes Computation Delay Mˆ (x) Figure 3.18: Schematic block diagram of RS decoder. Introduction to Data Error Correction 115 If all the syndromes are zero, then there are no errors in the received data. We compute a total of 2T syndromes in the syndrome computation step. An i-th syndrome is computed as follows: N −1 Si = R(αi ) = rn (αi )n n=0 where addition is modulo-2 and performed using ⊕ instead of +. (3.26) ■ Example 3.18 R(x ) is the received noise polynomial corresponding to the transmitted codeword polynomial M(x ). The received polynomial with errors in two positions follows: R(x) = α3x6 + αx5 + α6x4 + α3x3 + α3x2 + α5x + α From Equation (3.26), the 4(= 2T ) syndromes are computed as S1 = α5, S2 = α3, S3 = 0, S4 = α2 ■ 3.6.4 Error-Locator Polynomial Computation Let Xi for i = 1, 2, . . . , v, be the error locations and (x ) be the error-locator polynomial. Then (x ) = (1 − X1x )(1 − X2x ) · · · (1 − Xv x ) v = (1 − Xi x ) i=1 = 1+ 1x + 2x2 +···+ vxv The coefﬁcients 1, 2, · · · v of (x ) are computed using the Berlekamp-Massey recursion (see Figure 3.15) seen in the following. Initial conditions: (0)(x ) = 1, B(0)(x ) = 1, Li i-th iteration: i = (i−1) j Si− j j =0 L0 = 0 δi = 1 if 0 i = 0 and 2Li−1 ≤ i − 1 otherwise (i)(x ) B(i)(x ) = 1 − ix −i 1δi (1 − δi )x (i−1)(x ) B(i−1)(x ) Li = δi (i − Li−1) + (1 − δi )Li−1 We iterate the Berlekamp-Massey algorithm 2T times to get an error-locator polynomial (x ) of degree v that is less than or equal to T . If v ≤ T , then the roots of the error-locator polynomial (x ) give the valid error positions in the received data vector. ■ Example 3.19 We compute the error-locator polynomial by using the syndromes of the received polynomial computed in Example 3.18. (0)(x ) = 1, B(0)(x ) = 1, L0 = 0 116 Chapter 3 For i = 1, 1= (0) j S1− j = S1 = α5 δ1 = 1, ∵ 1 = 0 and 2L0 ≤ 0 (1)(x ) = 1 + α5x B(1)(x ) = α2 L1 = 1 Like this, continue up to i = 2T (in our case 2T = 4). The ﬁnal error-locator polynomial is given by (x ) = (4)(x ) = 1 + αx + α6x 2 = (1 + α2x )(1 + α4x ) ■ 3.6.5 Roots of Error-Locator Polynomial We compute the roots of the error-locator polynomial (ELP) (x ) with a brute force method (also called Chien’s search) by checking all the ﬁeld elements to know whether any of ﬁeld elements satisﬁes (x ). The Equation (3.27) gives the error roots as X −1 i = αk where 1 ≤ i ≤ v whenever Pk becomes zero. v Pk = (αk) = j (αk) (3.27) j =0 where 0 ≤ k < N. With ELP from Example 3.19, the error roots are found as X −1 1 = α5 and X −1 2 = α3. Then the error positions are given by the inverse of error roots. Thus, X1 = α2 and X2 = α4. 3.6.6 Error Magnitude Polynomial Computation The error magnitude polynomial (x ) = 1 + ω1x 1 + ω2x 2 + · · · + ω2T x 2T is deﬁned as (x ) = (x )[1 + S(x )] mod x 2T+1 (3.28) where S(x ) = 2T j =1 S j x j and (x) = v i=0 i x i with 0 = 1 are the syndrome polynomial and error-locator polynomial, respectively. From Equations (3.25) and (3.26), ⎡ ⎤ v v S(x) = Yi X j i xj = Yi ⎣ (Xi x ) j ⎦ j i=1 i=1 j where Yk = eik and Xk = x ik are error magnitudes and error locations, respectively. Assuming |(Xi x )| < 1 and using inﬁnite geometric series summation result, the S(x ) can be approximated as v S(x ) = Yi i=1 Xix 1− Xix (3.29) From Equations (3.28) and (3.29), (x) = = = v (x ) 1 + Yi i=1 Xix 1− Xix mod x 2T +1 (x ) mod x 2T +1 + v i=1 Yi Xi x (x ) 1− Xix mod x 2T+1 v (x ) + Yi Xi x (1 − X j x ) i=1 j =i (3.30) Introduction to Data Error Correction 117 3.6.7 Error Magnitude Computation To compute the error magnitudes from the error magnitude polynomial, we use the Forney algorithm. From Equation (3.30), (Xk−1) = v ( X −1 k ) + Yi X i X −1 k (1 − X j X −1 k ) i=1 j =i = Yk (1 − X j Xk−1) j =k (3.31) The error-locator polynomial with its factors follows: v (x ) = (1 − Xi x ) i=1 (3.32) Differentiating Equation (3.32) with respect to x on both sides, we have (x ) = ∂ ∂x v (1 − Xi x i=1 v = − X j (1 − Xi x ) j =1 i= j v (Xk−1) = − X j (1 − Xi Xk−1) j =1 i= j v = −Xk (1 − Xi X −1 k ) i=k (3.33) From Equations (3.31) and (3.33), the error magnitudes are obtained as Yk = eik = − Xk ( X k−1 ) ( Xk−1) (3.34) 3.6.8 Error Correction Once we know the error locations and error magnitudes, then we can compute the error polynomial E(x ) from Equation (3.25). (See Example 3.20.) The corrected data polynomial Dˆ (x )is obtained from the received data vector R(x ) as Mˆ (x ) = R(x ) + E(x ) (3.35) ■ Example 3.20 From Example 3.19, the error positions X1 and X2 are obtained as X1 = α2 and X2 = α4. From Equation (3.31), the quantities (X1−1) and (X2−1) are computed as (x ) = − X1(1 − X2x ) − X2(1 − X1x ) ( X 1−1 ) = − X 1(1 − X 2 X −1 1 ) = α 2(1 + α4 α 5) = α ( X 2−1 ) = − X 2(1 − X 1 X −1 2 ) = α 4(1 + α2 α 3) = α From Equation (3.28), the error magnitude polynomial (x ) from (x ) and S(x ) is obtained as (x) = 1 + α6x + α3x2 118 Chapter 3 Then Using Equation (3.34), (X1−1) = α, (X2−1) = 1 Y1 = −X1 ( X −1 1 ) ( X −1 1 ) = α2α α = α2 Y2 = α3 The error polynomial is computed from Equation (3.25) as E(x) = α2x2 + α3x4 The corrected data polynomial from Equation (3.35) is obtained as Mˆ (x ) = α3x 6 + αx 5 + α4x 4 + α3x 3 + α5x 2 + α5x + α Dˆ (x ) = α3x 2 + αx + α4 or Dˆ = [3 1 4] ■ In the Section 4.2, we discuss the simulation techniques for the RS coder. We will consider RS(204, 188) coder for simulation purpose. Also, we discuss efﬁcient implementation techniques for the RS decoder to minimize the cycle cost on the reference embedded processor. 3.7 Convolutional Codes The difference between block codes and convolutional codes is that the former work on a block-by-block basis without any data dependency between the blocks, whereas in the latter case the output of encoder depends not only on the current input block to encoder but also on the previous K − 1 input blocks where K is the constraint length of an encoder. A convolutional code is generated by passing the bitstream through a linear ﬁnite-state shift register as shown in Figure 3.19. All ﬂip-ﬂop registers are updated for every encoded input data block (so the encoder state changes with the encoding of each input block). The functionality of the convolutional encoder is similar to the convolutional operation (i.e., linear ﬁltering); hence, these codes are called convolutional codes. If we input k bits to the encoder and it outputs n coded bits, then we call it the rate k/n encoder. Usually, convolutional codes perform better than cyclic block codes (e.g., RS codes) for the following reasons: convolutional decoders utilize the dependency among coded bits and are also capable of accepting soft information as input in decoding the bits. In the following subsections, various representations of convolutional codes are presented, the generation of both systematic and nonsystematic codes is discussed and the decoding of convolutional codes using hard decisions (with Hamming distance criterion) is discussed. The optimal decoding of convolutional codes (with soft data and Euclidean distance criterion) using the Viterbi algorithm is discussed in Section 3.9. 3.7.1 Convolutional Encoder Representation As we discussed in Section 3.3, the block codes can be represented with a generator matrix. Here we cannot use a generator matrix to represent convolutional codes as these codes are semi-inﬁnite. However, it is possible to represent a generator function for each output bit of a convolutional encoder. In this section, we discuss different ways of representing a convolutional encoder along with individual output bits generator function representation using the rate 1/2 encoder shown in Figure 3.19. a1 Figure 3.19: Flip-ﬂop register representation of convolutional encoder. S0 S1 Introduction to Data Error Correction 119 z1 z2 Flip-Flop Register Representation In the ﬂip-ﬂop representation of the convolutional encoder, we deﬁne input and output connections through ﬂip-ﬂop registers. Using the encoder shown in Figure 3.19, with one input bit a1, we get two output bits z1 and z2 (hence, it is rated as a 1/2 coder). The ﬂip-ﬂop registers are updated for every input block (or 1 bit). As per inputoutput connections shown in Figure 3.19, the state value S1 of the ﬂip-ﬂop register is updated with S0 and the state value S0 of the ﬂip-ﬂop register is updated with the input bit a1. The constraint length K of this coder is 3, as the output bits depend not only on the current input bit but also on the previous two input bits (which are present in the ﬂip-ﬂop registers S0 and S1). The following equations give the relationship between output bits z1 and z2 and input bit a1. z1 = a1 ⊕ S0 ⊕ S1 z2 = a1 ⊕ S1 (3.36) (3.37) Generator/Transfer Function Representation In transfer function representation, we basically provide the input to output connections by assigning bit “1” if connection to the output is present; otherwise, we assign a bit “0” to say that there is no connection to the output. The number of bits in a generator function depends on the maximum total number of connections to any output bit. For example, in Figure 3.19, there are three connections to output bit z1 and two connections to output bit z2. Therefore, in the generator function representation of convolutional coder shown in Figure 3.19, we use three bits for both outputs’ generator functions. From Equations (3.36) and (3.37), the generator functions g1 and g2 for two output bits z1 and z2 are g1 = [111] g2 = [101] We also can represent the generator functions in the polynomial form as G1(D) = 1+ D + D2 G2(D) = 1+ D2 (3.38a) (3.38b) State Machine Representation From Figure 3.19, we can see that the output bits of the convolutional coder depend on both input bits and the state values. Using state machine representation of the convolutional coder, we can show the updated states along with output bits for a given input bits. The corresponding state machine representation of the convolutional encoder of Figure 3.19 is shown in Figure 3.20. From Figure 3.20, we can see how the states are updated and what output bits are generated with corresponding input bits. For example, if we input bit “0” when the encoder state is “01,” then the output state becomes “10” and the output bits are “01.” Similarly, if we input bit “1,” then the output state is “11” and the corresponding output bits are “10.” The state machine is a compact representation of a convolutional encoder when compared to other representations. With this state machine, we can see all possible states and output bits values for a given input value. Tree Diagram Representation In the tree diagram representation, we represent the states as nodes of a tree and the outputs as branches of a tree. We start the encoder at zero state and build the tree for each possible input block bit pattern. The number of branches emerging from any node depends on the number of bits (k) in one input block. For example, in Figure 3.19, each input block contains only 1 bit; hence, there will be two branches (2k) from each node of the 120 Chapter 3 tree diagram. The corresponding tree diagram for the convolutional encoder shown in Figure 3.19 is presented in Figure 3.21. The upward branches from a node are due to input bit “0,” whereas the downward branches from a node are due to input bit “1.” Starting with the zero state, the tree diagram shows all possible output states and output bits for all possible input block bit patterns. Trellis Diagram Representation The trellis diagram is a time-indexed version of a state diagram. With trellis diagram representation, we can see all possible output states and output bits for a given input bit with respect to time scale. In practice, we start trellis from zero state and force zero state at the end of the input bitstream with trellis terminating bits (usually, we use 0 bits to terminate trellis). A trellis diagram is popularly used in decoding of convolutional codes. Figure 3.22 shows the trellis diagram corresponding to the rate 1/2 encoder shown in Figure 3.19. Systematic and Nonsystematic Convolutional Codes As discussed in Section 3.3, the error-correction codes are classiﬁed into two types: (1) systematic codes, and (2) nonsystematic codes (NSC). In the case of systematic codes, the original input data block is present as it is Figure 3.20: State machine representation of convolutional coder. Figure 3.21: Tree diagram representation of convolutional coder. a1/z2z1 0/00 1/11 00 S1S0 01 1/00 1/10 0/01 0/11 10 11 0/10 1/01 S1S0 a1/z2z1 0 00 1 00 00 0/00 1/11 0/00 00 01 1/11 10 0/00 0/01 01 1/10 11 00 10 0/11 1/11 1/00 0/01 01 01 1/10 10 0/10 11 1/01 11 State (S1S0) 00 01 10 11 a1/z2z1 0/00 0/00 1/11 1/11 0/00 0/01 1/10 iϭ0 iϭ1 iϭ2 iϭ3 iϭ4 iϭnϪ2 iϭnϪ1 Figure 3.22: Trellis diagram representation of convolutional codes. Introduction to Data Error Correction 121 along with parity data at the output of encoder, whereas with nonsystematic codes, we do not have a separate input data block in the output data after encoding. The convolutional code generated with the encoder shown in Figure 3.19 is a nonsystematic code as no input data bits are directly present at the output. A class of systematic codes called recursive systematic codes (RSC) is popularly used with turbo coding (see Section 3.10), where we output the input data block along with the parity data block as shown in Figure 3.44. 3.7.2 Decoding Criterion for Convolutional Codes Usually, in digital communications systems, the convolutional decoding happens after baseband demodulation as shown in Figure 3.23. In the baseband binary phase shift keying (BPSK) demodulation, we have two options to obtain the demodulated data; in the ﬁrst option, we quantize the data based on the sign of the demodulated output and get bit “0” if the sign is positive and bit “1” if the sign is negative. In this case we used 1 bit to represent the data and these decisions are called hard decisions. In the second option, we quantize the demodulator output with more than one level. In other words, we represent the demodulated data with multiple levels using more than 1 bit (e.g., represented with eight levels as −4, −3, −2, −1, 0, 1, 2, 3 using 3 bits) and these decisions are called soft decisions. Hard Decision versus Soft Decision At the convolutional decoder output, we will see a considerable performance difference between hard decisions and soft-decision inputs to the decoder. The reason is simple as illustrated in Figure 3.24. Consider a demodulator input sample highlighted with a dashed circle. This sample corresponds to bit “1,” which supposedly is downwards with some negative amplitude like other “1” bit input samples to the demodulator. But, because of the presence of more noise at that sample, the noisy sample became a positive sample with value 0.0944. With the harddecision demodulator, we output bit zero as if it corresponds to a “0” transmitted bit, but actually it corresponds to the transmitted bit “1.” With soft decisions, the demodulator outputs the sample with a small positive allowed quantization level. Now, assume a decoder based on the probability of having the sample close to some constant positive and negative thresholds. In other words, it is more likely to decode a bit as “0” or it is less likely to decode a bit as “1” if the soft decision has more positive value. Similarly, it is more likely to decode a bit as “1” or it is less likely to decode a bit as “0” if the soft decision has more negative value. From a probabilistic point of view, with demodulator hard decision outputs, the highlighted sample (corresponding to bit “1”) has the same probability as “0”-bit samples, whereas with soft decisions, when compared to the highlighted sample, the probability of “0”-bit samples is considerably higher. If these kinds of samples occur frequently in a sequence, then the decoders based on the maximum likelihood criterion may make more wrong decisions with hard-decision inputs when compared to soft-decision inputs. ak Rate 1/2 cm Convolutional Coder BPSK Sn Transmitter x(t ) Modulator Back End Transmitter AWGN noise u(t ) Channel h(t ) Channel Figure 3.23: Block diagram of baseband digital communications system. y (t ) Receiver Rn BPSK dm Front End Demodulator Convolutional bk Decoder Receiver 122 Chapter 3 0.4523 0.9645 0.2931 Demodulator Input 20.8928 20.8611 0.0944 21.1134 0.5722 20.2638 20.4417 Hard Decisions by Demodulator 0 00 0 0 1 1 1 11 Soft Decisions by Demodulator 13 12 11 0 21 22 23 24 Figure 3.24: Illustration of hard decisions versus soft decisions. Hamming Distance versus Euclidean Distance As we discussed in Section 3.3.1, the Hamming distance between two codewords is given by the number of positions in which the bits in those two codewords are different. For example, the Hamming distance between the two codewords, 01011101 and 01001011, is 3, as they differ in three bit positions. We may prefer to use Hamming distance in convolutional decoding if the input to the decoder is hard decisions. In the next subsection, we discuss the decoding of convolutional codes with hard-decision inputs and Hamming distance criterion. The Euclidean distance two scalars. For example, is deﬁned as consider two the dista−→ nce betw−→een two vectors OA and OB with vectors or A = (2.54, the absolute −1.98) and difference between B = (1.44, −2.32). The Euclidean distance between these two vectors is computed as (2.54 − 1.44)2 + (−1.98 + 2.32)2 = 1.3256. The Euclidean distance between two scalars P = −1.23 and Q = 2.45 is |−1.23 − 2.45| = 3.68. The Euclidean distance is popularly used in convolutional decoding both with soft-decision inputs as well as hard-decision inputs. From a hardware point of view, the Hamming distance can be computed with less complex hardware circuitry and also the computation will be fast, whereas the computation of the Euclidean distance involves ﬂoating-point operations so the corresponding hardware is costly and the computations will not be fast due to slow ﬂoating point hardware circuitry. However, with soft-decision inputs and Euclidean distance criterion, we will see a considerable performance gain at the convolutional decoder output. In Section 3.9, we discuss the optimal decoding of convolutional codes with the Viterbi algorithm using the Euclidean distance as a criterion. 3.7.3 Convolutional Decoding with Hard Decisions As we discussed in Section 3.7.1, the convolutional encoder is basically a ﬁnite-state machine. The optimum decoding criterion for convolutional codes is maximum likelihood sequence estimation (MLSE). In the decoding using maximum likelihood (ML) criterion, we select the most probable symbols as decoded symbols by minimizing overall symbol errors. This is achieved by processing all the trellis stages corresponding to the encoded symbols. We process the trellis stage-by-stage with the removal of less probable paths of the trellis in each stage and retaining the most probable paths at each node of a trellis stage. For this, we deﬁne two metrics, namely branch and state metrics. The branch metrics are obtained by computing the distance between the received symbol and branch symbol values. The state metrics are obtained by selecting the minimum error value obtained after adding the branch metrics to the previous stage state metrics from where these branches are diverged. In this way we obtain the most probable symbol path in the trellis at every stage of trellis processing. The path that includes most probable paths of all trellis stages is called the global most probable path. Then the bits corresponding to Introduction to Data Error Correction 123 (S1S0) 0 0 1 2 3 0 2 00 (i 5 0) State Metrics 0 2 2 Branch Metrics 2 3 1 2 0 2 1 1 0 1 0 0 2 1 2 2 1 3 11 1 0 1 3 1 0 1 2 21 2 11 (i 5 1) 1 1 (i 5 2) 10 (i 5 3) (S1S0) 2 0 2 1 0 1 1 2 1 3 0 2 1 1 2 1 0 2 2 2 3 0 1 2 1 1 3 1 1 2 13 1 13 1 23 0 02 2 3 21 1 3 1 1 30 2 2 3 0 3 11 (i 5 4) 00 (i 5 5) 1 0 (i 5 6) 01 (i 5 7) (S1S0) 3 0 0 3 2 4 1 4 2 3 2 0 1 0 3 1 2 2 0 0 3 2 1 4 1 0 4 2 2 2 14 1 13 1 23 0 14 1 3 31 1 1 4 1 0 3 2 1 3 1 4 00 (i 5 8) 1 1 (i 5 9) 10 (i 5 10) 11 (i 5 11) Figure 3.25: Convolutional decoding by trellis processing with Hamming distance. the trellis global most probable path are output as the decoded bit values. An example of decoding with ML criterion is shown in Figure 3.25. The two major issues with decoding of convolutional codes using ML criterion are computational complexity and memory usage. The computational complexity of the ML decoder increases exponentially with constraint length K (as the number of trellis states is equal to 2K −1). As we actually start decoding bits after processing all the trellis stages, with large received frames and with large constraint lengths, we need a lot of data memory to store the most probable trellis branches history and all states’ metrics information. As shown in Figure 3.25, we needed to store all state metrics as we don’t know in advance which state metrics contribute toward the most probable paths. We store the branch connections information for each stage to trace the global most probable path. As an example, we use a rate 1/2 convolutional coder shown in Figure 3.19 for illustrating decoding of convolutional codes using ML criterion. Assume we want to transmit 10 bits 011000101100 (the last 2 bits are used for trellis termination and they are extra bits apart from our 10 bits of information for transmission). We start the encoder at state zero (i.e., S1S0 = 00). The corresponding encoded codewords for each bit are obtained (updated trellis states are not shown here) as 00, 11, 10, 10, 11, 00, 11, 01, 00, 10, 10, 11. As we used terminating 124 Chapter 3 bits, the trellis state at the end of this encoding becomes zero. With the digital communications system shown in Figure 3.23, assume we obtain hard decisions at the receiver after the BPSK demodulator as 00, 11, 11, 10, 11, 00, 10, 01, 00, 11, 10, 11, with 3 bits in error (due to noise), when compared to transmitted bit sequence. We decode the demodulator hard-decision outputs with the ML decoder by processing the trellis as shown in Figure 3.25. Here, we use Hamming distance for computing the distance between the received codewords and encoder trellis codewords. We know the encoder started from zero state and was forced to zero state at the end of encoding by using two terminating 0 bits and these bits are not part of the information that is intended for communication. At the receiver, we have a total of 12 codewords including two trellis termination codewords. Therefore, to decode 10 transmitted bits, we have to process 12 codewords (or trellis stages) in total. Convolutional Decoding by Trellis Processing We use the transmitter encoder trellis shown in Figure 3.25 to decode convolutional codes with the ML decoder. We follow this trellis ﬂow and compute the path (or branch) metrics and state metrics using the received codewords. The received codewords along with the codeword index are shown at the bottom of each trellis stage. At stage i = 0, we received a 2-bit codeword of “00.” As the encoder started from a zero state, we have only two possible paths at stage i = 0. We compute the Hamming distance between the received codeword and the trellis paths codewords of the ﬁrst stage. For convenience, we use the encoder stabilized trellis stage with output bits for allowed trellis paths as shown in Figure 3.26. We initialize the state metric to zero value at the start of the encoder trellis, as shown in Figure 3.25. For now we ignore the meaning of branches representation with solid, dashed and dotted lines. At stage i = 1, the computed Hamming distance between the received codeword and the trellis path connecting 0<>0 states (here m<>n denotes a branch that connects previous stage state m to current stage state n) is 0 as both codewords have the same bits (i.e., 00). Similarly, the Hamming distance between the received codeword and the trellis path connecting 0<>1 states is 2 as the two codewords differ in both bit positions (since the received codeword is 00 and the trellis branch 0<>1 is 11 as shown in Figure 3.26). We add the branch metrics to the previous (left side to current stage) state metrics and place the accumulated state metrics at the current (right side to current stage) states. At stage i = 0, we have two trellis paths; we select the most probable path as the one that connects to the state with minimum state metric (i.e., the path connecting 0<>0 as shown by a solid line). The accumulated state metrics at stage i = 0 are 0 and 2. We move to processing the trellis stage i = 1. At stage i = 1, we have four trellis branches diverging from states at stage i = 0 and merging to states at stage i = 1. We compute the Hamming distances from those four branches to the received codeword (i.e., 11) at stage i = 1. The values of four branch metrics are shown at corresponding branches. Then we add the branch metrics to the previous stage state metrics and place them at the current stage states. Here also (at stage i = 1), we have only single branches merging to current states and the most probable path for this stage is given by the trellis branch that merges to the state with the minimum accumulated state metric (i.e., branch 0<>1 at stage i = 1). At stage i = 2, the trellis stabilizes and all allowed branches diverge from previous stage states and merge at current stage states. We obtain the branch metrics by computing the Hamming distance between the received codeword (i.e., 11, the underlined bit is in error) and all trellis stage branch codewords. Then we add branch metrics to previous state metrics. We have more than one branch merging to the same state from this stage onwards. If we have more than one branch merging to the current state, then we choose the probable path as 0 1 2 Figure 3.26: Stabilized trellis stage branches with corresponding output bits. 3 00 0 11 11 00 1 01 10 2 10 01 3 Introduction to Data Error Correction 125 one with which we will have a minimum of accumulated state metric. For example, we consider two branches merging to state “0” (i.e., 0<>0 and 2<>0). With the branch 0<>0, we have an accumulated metric of 4, whereas with branch 2<>0, we have an accumulated metric of 3. Therefore, we choose the branch 2<>0 as a probable path to state “0.” In the same way, we compute the probable paths to all states. Now the most probable path for the current stage is given by the branch that connects to the previous stage’s most probable path and converges to the current state with minimum accumulated state metric. We continue in the same manner and compute the most probable paths to all stages of the trellis. Now we understand the meaning of solid, dashed, and dotted lines in the trellis. The dotted line branches are the least probable paths as their metrics after accumulation with previous state metrics end up having relatively big values. The dashed lines represent the probable paths to each current state from previous states and connect to the current state with a smaller accumulated state metric (when compared to the least probable path accumulated state metric). The solid lines represent the most probable paths to a current state from a previous state with minimum state metric (when compared to other state metrics). Next, we discuss a few speciﬁc cases that arise in trellis processing. At stage i = 2, we have two state metrics with the same accumulated metric values and those two paths diverge from the same previous state. In this case, as we don’t know in advance which path is going to survive, we assume both paths as most probable paths. Because of this, the two paths, 1<>2 and 1<>3, are represented by solid lines. Next, when two branches from different previous states merge to a state with the same accumulated metric value, we choose randomly one path as the probable path. For example, at stage i = 4, two paths, 1<>2 and 3<>2, have the same accumulated state metric. We choose randomly one out of those two as a probable path and the other as the less probable path. In this case, we have chosen path 1<>2 as a probable path and path 3<>2 as a less probable path. As shown in Figure 3.25, the accumulated state metrics grow with errors and we may have more than one most probable path. After processing all the trellis stages, we end up with one path that connects all stages’ most probable paths and we consider it the global most probable path. Tracing back the global most probable path and taking the corresponding branch input bits gives the decoded bit sequence. Since we forced the encoder to zero state at the end of the bitstream with terminating 0 bits, the global most probable path starts and ends at the zero state. We know the input bit values for each trellis path that updates the trellis states. Figure 3.27 shows the trellis paths with corresponding input bits. By following the global most probable path, we can retrieve the corresponding stage’s most probable path (which is part of global most probable path) bits. These bits give an estimate of transmitted bits. From Figures 3.25 and 3.27, we retrieve the global most probable path bits as illustrated in Table 3.1, and the retrieved bitstream is 011000101100, where the last 2 bits are trellis termination bits and we ignore them. The remaining 10 bits, 0110001011, are the bits decoded by ML decoder as the estimate of the transmitted information bits. Although we had 3-bit errors at the input of the decoder, we corrected these errors with our convolutional decoder. As we discussed, the computational cost to perform convolutional decoding depends on constraint length (as the number of states of trellis increases exponentially with the constraint length) of an encoder. For example, decoding the convolutional codes that are encoded using a convolutional coder with constraint length equal to 4 requires processing of an 8 state trellis as shown in Figure 3.28. The memory usage depends on the input data frame length (as an ML decoder works on one frame at a time) and constraint length. We have to store all stages 0 1 2 Figure 3.27: Stabilized trellis stage branches with corresponding input bits. 3 0 0 1 0 1 1 0 1 2 0 1 3 126 Chapter 3 Table 3.1: Global most probable path and corresponding input bits Stage (i) 0 1 2 3 4 5 6 7 8 9 10 11 Most Probable Global Path 0<>0 0<>1 1<>3 3<>2 2<>0 0<>0 0<>1 1<>2 2<>1 1<>3 3<>2 2<>0 Decoded Bits 0 1 1 0 0 0 1 0 1 1 0 0 b1 S0 b0 S2S1S0 000 001 010 011 100 101 Figure 3.28: (a) Rate 2/3 110 convolutional coder and (b) Corresponding steady-state 111 trellis. b 1b 0 /c 2c 1c 0 c2 c1 S1 S2 c0 (a) S2S1S0 b 1b 0 /c 2c 1c 0 000 00/000, 00/100, 00/010, 00/110 001 10/010, 10/110, 10/000, 10/100 010 01/100, 01/000, 01/110, 01/010 011 11/110, 11/010, 11/100 11/000 100 00/001, 00/101, 00/011, 00/111 101 10/011, 10/111, 10/001, 10/101 110 01/101, 01/001, 01/111, 01/011 111 11/111, 11/011, 11/101, 11/001 ( b) and all state metrics as well as all most probable paths connections m<>n to trace the global most probable paths to decode the bits. In Section 3.9, we discuss optimal decoding of convolutional codes with the Viterbi algorithm and also we address memory savings by implementing the decoder with the window method. 3.8 Trellis Coded Modulation Trellis coded modulation (TCM) is a combined coding and modulation technique used for digital transmission over band-limited channels. With TCM, we can achieve signiﬁcant coding gains over conventional uncoded multilevel modulation without trading bandwidth. In this section, we discuss the coded modulation system and its performance gain over an uncoded system and performance gain over a system where channel coding and modulation is separately performed. We discuss the Viterbi decoder, a decoding technique for TCM symbols, in the next section. Introduction to Data Error Correction 127 Figure 3.29: PSK modulation. (a) 4-point constellation. (b) 8-point constellation. B dminϭ!ß2 A 1 !ß2 (a) !ß 2 Ϫ!ß2 1 D !ß 2 Ϫ!ß2 C (b) bn M-PSK Sm Transmitter x(t ) Band Limited y(t ) Receiver Rm M-PSK Cn Modulator Back End Channel h(t) Front End Demodulator 1 2 3 4 5 6 Transmitter Channel Receiver Figure 3.30: Uncoded baseband communications system with M-PSK modulation. We consider bandwidth-constrained channels (e.g., twisted-pair copper telephone lines) to study the TCM systems and to see the performance gain of TCM over other coded and uncoded systems. For such band-limited channels, the digital communications system is designed to use bandwidth-efﬁcient multilevel/multiphase modulation schemes, such as PAM, PSK or QAM. See Section 9.1.3 for more details on baseband modulation schemes (e.g., PSK, QAM). Here, we consider PSK modulation schemes in our performance analysis of TCM systems. For convenience, the 4-PSK and 8-PSK constellations from Section 9.1.3 are redrawn here as shown in Figure 3.29. Uncoded System We consider a simple baseband uncoded communications system with a PSK modulation scheme as shown in Figure 3.30. The inputs to the M-PSK modulator are equiprobable binary digits bn and the outputs are PSK symbols Sm chosen from an M-point PSK constellation array. We assume that the DAC (digital to analog conversion) operation along with low-pass ﬁltering (to ﬁlter out-of-band frequency content) is performed in the transmitter back-end module. The output of the transmitter back-end is a continuous time and continuous amplitude signal x (t) that is suitable for transmission over channel h(t). The receiver front-end includes ﬁlters (to combat channel distortions such as noise, and ISI), symbol synchronization circuitry (to get accurate sampling time and phase), ADC (analog to digital conversion), and a symbol detector (to get multilevel PSK symbols), among other things. The output Rm of the receiver front-end is the PSK symbol. These PSK symbols are fed to M-PSK demodulator to get back the transmitted binary digits, cn (which may be different from bn due to channel impairments). This communications system is an uncoded system since no channel coding is present in the signal chain. In Figure 3.30, the data rates before the modulator (represented with 1 in a circle) and after the modulator (represented with 2 in a circle) need not be the same. The modulator input data are bits bn, and its output are PSK symbols Sm. Depending on the constellation used, we map m (= log2M ) bits to one PSK symbol. If the bit rate at the modulator input is P, then the symbol rate at the output of the modulator is Q = P/m. As we discussed in the previous sections and also will discuss later, the channel coding at the transmitter side adds redundancy to the input bitstream and that increases the bit rate P at the input of modulator. However, we can keep the symbol rate the same at the output of modulator by increasing m using multilevel/phase modulation. This important feature of multilevel/phase modulators is very useful in designing a communications system for a band-limited channel. The disadvantage with this type of system is that the constant symbol rate increases the number of bits per symbol when bit rate increases and therefore we have to increase the energy levels of the symbols for transmission to reduce the channel noise effect on the detection of symbols at the demodulator. This type of communications system design is suitable for wireline communication where we do not have much bandwidth but we can use more energy to transmit data. We use this kind of system design with a small 128 Chapter 3 100 1022 4-PSK 8-PSK BER 1024 1026 1028 2 4 6 8 10 12 14 Eb /N0 (in dB) Figure 3.31: Performance curves of uncoded 4-PSK and 8-PSK systems. value of m for satellite communication too, where we have inﬁnite bandwidth and limited power is available for transmission. Typically, we use 256- or 512-point constellation symbols for wireline communications, whereas we use symbols from 4- or 8-point constellations in the case of satellite communications. The BER (bit error rate) performance curves for this uncoded communication system are shown in Figure 3.31. From the M-PSK performance curves, we can clearly see that the required Eb/N0 (energy per bit) increases for a given modulator output bandwidth as bit rate (or M, the number of constellation points) increases. At BER = 10−6, we need to spend 3.5 dB more energy per bit with 8-PSK symbols when compared to 4-PSK symbols. As no coding is involved in this system, the parameters SNR and Eb/N0 are related by the following formula (see Section 9.1.2 for more details). Eb/N0 = (Es/N0 )/m = SNR/m (3.39) or Eb/N0 (in dB) = SNR (in dB) − 10∗ log 10(m) (3.40) Coded System With channel coding methods, it is possible to trade the bandwidth of the communications system with the transmission power. Here, we discuss the application of channel coding to improve data rates in bandwidth- constrained channels. When coding is applied to such channels, a performance gain is desired without expanding the signal bandwidth. As an example, we consider the system shown in Figure 3.30 using 4-PSK constellation points for modulation. This uncoded 4-PSK modulation achieves 2 bits/sec/Hz (capacity per unit of the channel bandwidth) at an error probability of, say, 10−6. For this error rate, the signal to noise ratio (SNR) per bit (i.e., Eb/N0) is 10.5 dB (from Figure 3.31). If we want to reduce the SNR per bit using channel coding without expanding the bandwidth, then we have to use symbols from a bigger constellation to accommodate redundant bits (resulted due to channel coding) in the given bandwidth. √ Using rate 2/3 coder, we go from 4-PSK (2 bits per symbol) with a minimum distance of 2 √between the points as shown in Figure 3.29(a) to 8-PSK (3 bits per symbol) with a minimum distance of 2 − 2 between the points as shown in Figure 3.29(b) to keep the bandwidth constant. With appropriate mapping of the encoded bits to the signal points, the rate 2/3 coder in conjunction with 8-phase PSK yields the same data throughput as the uncoded 4-phase PSK. An increase in the number of signal points from 4 to 8 requires an additional G dB Introduction to Data Error Correction 129 (3.5 dB in this particular case; see Figure 3.31) approximately in signal power (since the minimum distance CD < AB as shown in Figure 3.29) to maintain the same error rate. Therefore, if coding is used to reduce the SNR per bit, then the rate 2/3 coder must overcome this G dB penalty and yield further gain. If the coding and modulation are performed separately, then the use of very powerful codes (e.g., convolutional codes with large constraint length) is required to offset the loss and provide some signiﬁcant coding gain. Coded Modulation System On the other hand, if we combine the encoding process with the modulation to increase the minimum Euclidean distance between pairs of coded signals, then the loss from the expansion of the signal set is easily overcome and a signiﬁcant coding gain is achieved with relatively simple codes. The TCM is one such coding scheme that generates modulated codewords. The performance of a TCM system with rate 2/3 convolutional coder of constraint length 3 using an 8-PSK modulation system (that achieves 2 bits/sec/Hz) is shown in Figure 3.32. We use the corresponding Viterbi decoder (see Section 3.9 for more details) to decode this system. From performance curves we can clearly see the performance gain with a TCM system over an uncoded system. At BER of 10−6, with TCM, we see a coding gain of 3 dB with respect to the uncoded 4-PSK system. In the next section, we discuss TCM codeword generation. TCM applications include voiceband modems, DSL modems, cable modems, and satellite communications, among others. 3.8.1 TCM Encoder In this section, we discuss the generation of TCM encoded symbols. The TCM encoder consists of two operators, a convolutional coder and a modulator, as shown in Figure 3.33. In TCM, we map the coded bits to modulated signal points in a particular way without increasing the data transmission bandwidth. Convolutional Coder For convolutional coding, we use a simple rate 1/2 convolutional encoder as shown in Figure 3.34. The dashed lines (in parallel to the solid lines in the trellis diagram) in Figure 3.34 correspond to the uncoded bit paths (or 100 1022 Uncoded 8-PSK Uncoded 4-PSK TCM: R 5 2/3, S 5 4, 8-PSK 1024 BER 1026 1028 0 2 4 6 8 10 12 14 16 Eb /N0 (in dB) Figure 3.32: TCM system performance. bn Figure 3.33: Schematic block diagram of a TCM system. Convolutional Coder Mapper and Modulator Sm 130 Chapter 3 b0 b1 Figure 3.34: A rate 2/3 convolutional coder and its trellis diagram. A B S0 S1 States (S0S1) c0 0 00 1 c1 10 7 6 01 c2 11 d0 d1 C Figure 3.35: 8-PSK symbol d2 constellation and set partitioning. branches). The encoder consists of two delay units (or shift registers); hence, its constraint length is K = 3. Although the encoder takes 1 input bit and outputs 2 bits, due to passing of one uncoded bit, the effective code rate becomes 2/3. For every two inputs, we get three output bits (which we represent with eight levels, 0 to 7). For example, when the encoder is at state (or node) zero (i.e., S0S1 = 00), if input bits b1b0 are 00, 01, 10 and 11 then we obtain corresponding output bits c2c1c0 as 000 (0), 001 (1), 110 (6) and 111 (7). As said earlier, the uncoded bits produce parallel paths in the trellis. If we have m uncoded bits then we will have 2m parallel paths in the trellis. In our case, we have 1 uncoded bit (i.e., m = 1) and we have 2 parallel paths diverging from and converging to all states in the trellis. Next, we discuss the mapping of coded bits to modulated signal points. Mapper and Modulator The key to this integrated modulation and coding approach is to devise an effective method for mapping the coded bits into signal points such that the minimum Euclidean distance is maximized. For this, we perform partition of constellation points into subpartitions more than once and make sure that the distance between points increases in the subpartitions with each partitioning. The degree to which the signal constellation set is partitioned depends on the characteristics of the code. The constellation set partitioning is shown in Figure 3.35. In Figure 3.35, if d0 is the minimum distance between points at level A, d1 is the minimum distance between points at level B and d2 is the minimum distance between points at level C , then d0 < d1 < d2. From Figure 3.34 and Figure 3.35, the assignment of signal points for each coded output is made according to the following Ungerboeck set partitioning rules (Ungerboeck, 1987). 1. Parallel transitions are assigned to signal points separated by the maximum Euclidean distance. 2. The transitions originating from a particular state and merging into any state are assigned to signal points separated by at least the next-largest distance. 3. The signal points should occur with equal frequency. To satisfy these rules, the coded bits are used to choose a subset of points and the uncoded bits are used to choose the points within a subset. With a rate 2/3 coder as shown in Figure 3.34, we use 2 coded bits to choose one of Introduction to Data Error Correction 131 Figure 3.36: General structure of a TCM encoder. n: number of uncoded bits k: number of bits to coder m: redundancy added S: number of states P: number of parallel paths (5 2n) K : constraint length (= log2 (S) +1) a1 a2 a3 an an 11 an 12 an 1k … … S -state, k /(k 1m) Rate TCM Encoder … … z1 z2 z3 zn zn 11 zn 12 zn 1k 1m Select Point from Subset Select Subset Signal point Output Values from Each State States P (0, 1, …, P 21), (2P, 2P 11, …, 3P 21), (4P, 4P 11, …, 5P 21), (6P, 6P 11, …, 7P 21) 000 (5P, 5P 11, …, 6P 21), (7P, 7P 11, …, 8P 21), (P, P 11, …, 2P 21), (3P, 3P 11, …, 4P 21) 001 (2P, 2P 11, …, 3P 21), (0, 1, …, P 21), (6P, 6P 11, …, 7P 21), (4P, 4P 11, …, 5P 21) 100 (7P, 7P 11, …, 8P 21), (5P, 5P 11, …, 6P 21), (3P, 3P 11, …, 4P 21), (P, P 11, …, 2P 21) 101 (6P, 6P 11, …, 7P 21), (4P, 4P 11, …, 5P 21), (2P, 2P 11, …, 3P 21), (0, 1, …, P 21) 010 (3P, 3P 11, …, 4P 21), (P, P 11, …, 2P 21), (7P, 7P 11, …, 8P 21), (5P, 5P 11, …, 6P 21) 011 (4P, 4P 11, …, 5P 21), (6P, 6P 11, …, 7P 21), (0, 1, …, P 21), (2P, 2P 11, …, 3P 21) 110 (P, P 11, …, 2P 21), (3P, 3P 11, …, 4P 21), (5P, 5P 11, …, 6P 21), (7P, 7P 11, …, 8P 21) 111 Figure 3.37: Trellis diagram for general TCM rate 2/3 encoder with K = 4. four subsets at level C in Figure 3.35, and the uncoded bit is used to choose one of two points from the selected subset. The general block diagram of a TCM encoder is shown in Figure 3.36. This general encoder consists of S-state nonsystematic convolutional rate k/(k + m) encoder with constraint length K and outputs n + k + m bits by taking n + k bits as input. Out of n + k input bits, ﬁrst n bits are uncoded and the rest of k bits are coded to output k + m coded bits. We map the output n + k + m bits of encoder to signal constellation points with the help of the mapper. For particular realization of the encoder with k = 2, m = 1, S = 8 and K = 4, the steady-state trellis diagram is shown in Figure 3.37. The “n” uncoded bits results in P = 2n parallel transitions in each branch of trellis. The output values from each state of the trellis are also shown in Figure 3.37. 3.8.2 Coding Gains with TCM To observe the coding gain (Cgain) with TCM, we consider the TCM encoder shown in Figure 3.34 and for a comparison we consider an uncoded system shown in Figure 3.30 with which we transmit 2 bits per symbol by using a 4-PSK modulation scheme. The symbol constellations, considered in this section, are scaled so that the average symbol energy is unity. From the signal constellation of√Figure√3.29(a), if the ra√dius of √the circle is unity, then the coordinates of points A and B are given by A:(1/ 2,√1/ 2) and B:(−1/ 2, 1/ 2). The minimum Euclidean distance of the constellation is dmucin (uncoded) = 2. Now, consider the TCM scheme shown in Figure 3.36 with k = m = n = 1 and S = 3. The output bits of the encoder are mapped to symbols in different subsets of the constellation according to the Ungerboeck set partitioning rules. Using set partitioning as shown in Figure 3.35, we assign a set of points (which have the largest minimum distance) at level C to the trellis parallel paths (that diverge from a particular state at the current stage and converge to the same state at the next stage, for example the paths of the trellis in Figure 3.34 with output values 0, 1 or 6, 7). We assign a set of points 132 Chapter 3 (which have the next largest minimum distance) at level B to the trellis paths that diverge from the same state at the current stage and converge to different states at the next stage (e.g., the paths of the trellis in Figure 3.34 with output values 0, 6 or 0, 7, or 1, 6 or 1, 7). With this mapping procedure, we can satisfy Ungerboeck’s set partitioning rules. The asymptotic coding gain for this TCM is given by Cgain = 10 log((dfcree)2/(dmunin)2) where dfcree is the free Euclidean distance of the trellis, which is deﬁned as the minimum distance between those transition paths which diverge from a state at the current stage and converge to the same state at later stages. This is illustrated in Figure 3.38 with two paths A and B. The two paths A and B diverge from the state zero at one stage and converge to the same zero state again at some other stage. The Euclidean distance with path A is d2 (as we assigned parallel paths with points that are separated by maximum possible distance), whereas the squared Euclidean distance with path B is d12 + d02 + d12 = d02 + 2d12 = d02 + d22. Hence, in this case the minimum Euclidean distance separation between paths that diverge from any state and converge to the same state is d2. For the TCM scheme considered, the value of dfcree is equal to 2. Therefore, the coding gain is obtained as Cgain = 10 log((dfcree)2/(dmunin)2) = 10 log(4/2) = 3.01dB With TCM, we can achieve coding gains of about 2 dB to 6 dB depending on the type of coder used (i.e., the number of states, the amount of redundancy added and the dimensionality of constellations considered) used. An example of 8-state and 16-state rate 2/3 convolutional encoders with corresponding steady-state trellis diagrams are shown in Figure 3.39 and Figure 3.40, respectively. 00 10 01 Figure 3.38: Trellis-free Euclidean distance illustration. 11 A 00 10 01 11 00 00 10 B 10 01 01 11 11 Figure 3.39: 8-state rate 2/3 convolutional coder. b1 S1 b0 S0 S2 (a) b1b0 (c2c1c0) 0(0), 1(4), 2(2), 3(6) 0(1), 1(5), 2(3), 3(7) 0(4), 1(0), 2(6), 3(2) 0(5), 1(1), 2(7), 3(3) 0(2), 1(6), 2(0), 3(4) 0(3), 1(7), 2(1), 3(5) 0(6), 1(2), 2(4), 3(0) 0(7), 1(3), 2(5), 3(1) (b) c2 c1 c0 S2S1S0 0 1 2 3 4 5 6 7 Introduction to Data Error Correction 133 c2 b1 S1 S3 b0 Figure 3.40: 16-state rate 2/3 convolutional coder. b1b0 (c2c1c0) 0(0), 1(4), 2(2), 3(6) 0(1), 1(5), 2(3), 3(7) 0(4), 1(0), 2(6), 3(2) 0(5), 1(1), 2(7), 3(3) 0(2), 1(6), 2(0), 3(4) 0(3), 1(7), 2(1), 3(5) 0(6), 1(2), 2(4), 3(0) 0(7), 1(3), 2(5), 3(1) 0(4), 1(0), 2(6), 3(2) 0(5), 1(1), 2(7), 3(3) 0(0), 1(4), 2(2), 3(6) 0(1), 1(5), 2(3), 3(7) 0(6), 1(2), 2(4), 3(0) 0(7), 1(3), 2(5), 3(1) 0(2), 1(6), 2(0), 3(4) 0(3), 1(7), 2(1), 3(5) S0 S2 (a) (b) c1 c0 S3S2S1S0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3.8.3 TCM for DMT systems One application of TCM is an ADSL modem which is based on a DMT (Discrete Multi Tone) system. See Section 9.2.3 for more details on DMT transceiver. Consider a DMT system with N subchannels. Let the number of information bits per symbol in the i-th subchannel be bi . TCM for DMT can be implemented in two ways— coding separately in each subchannel and coding across the subchannels. In the former case, we perform coding and decoding separately and need (N/2) + 1 encoders and an equal number of decoders. This means a large amount of hardware when N is large. Also, if the DMT symbol interval is T , then the decoding delay in this case is approximately 5KT to 8KT where K = max(Ki ) with Ki denoting the constraint length of the i-th subchannel encoder. Usually, TCM coders used in DMT applications works across the subchannels with a code rate of bmin/(bmin + 1), where bmin is the minimum number of bits carried by a subchannel in a DMT block. The remaining bi − bmin bits, where bi is the number of bits carried by the i-th subchannel, are uncoded. The TCM across the subchannels encoder for a DMT system is shown in Figure 3.41. First, bmin bits of b0 input bits, corresponding to the 0th subchannel, are coded into bmin + 1 bits and the remaining b0 − bmin bits are passed to the output uncoded. Next the encoder will work on input bits corresponding to the ﬁrst subchannel, while the input state of the encoder is the output state determined by the input bits of the previous subchannel, that is, 0th subchannel. Thus, in general, the state of the encoder, attained after the input 134 Chapter 3 … … n 5 bi 2 bmin … Figure 3.41: Across subchannels TCM encoder for a DMT system. k 5 bmin … (bmin)/(bmin 1 1) Rate Encoder bits of the i-th subchannel is encoded, becomes the initial state of the encoder for the (i + 1)th subchannel. The decoding is performed accordingly. Note that the channel SNR is different from subchannel to subchannel because of the nonﬂat response of the channel. This may also be due to different noise variances in the subchannels. In such a situation, we have to take the noise variance into consideration at each stage of the trellis while implementing the Viterbi algorithm. 3.9 Viterbi Algorithm The Viterbi algorithm is an optimum decoding algorithm used for decoding of convolutional codes (see Section 3.7 for more details on convolutional codes) and it has often been served as a standard technique in digital communications systems for maximum likelihood sequence estimation (MLSE). The Viterbi algorithm application area is not limited to convolutional decoding in communications where the algorithm was originally developed. It is used for channel equalization (Viterbi equalizer) in modern communications systems. It also covers diverse applications such as pattern recognition, data storage, and target tracking. In this section, we discuss the Viterbi algorithm and decoding of TCM (see Section 3.8 for more details on TCM) symbols. The simulation and implementation techniques for the Viterbi algorithm are discussed in Chapter 4. The Viterbi algorithm is commonly expressed in terms of a trellis diagram (which is a time-indexed version of a state diagram). In the convolutional coding, a Viterbi decoder at the receiver follows the trellis used by the transmitter and attempts to estimate the transmitted sequence through the trellis whose distance is closest to the received noisy sequence. In other words, the Viterbi algorithm ﬁnds the sequence at a minimum Euclidean distance from the received signal using a transmitter trellis. The sequence computed by the Viterbi algorithm is the global most likely sequence. To compute the global most likely sequence, the Viterbi algorithm ﬁrst recursively computes the survivor path entering each state. After computing the survivor paths for all states, we select the survivor path with a minimum path metric as the most likely path. We compute in this manner the global most likely path for all symbols of a received sequence. We take this global most likely path and trace back to get the bits of survivor branches. This decoded bits sequence corresponds to an estimate of the transmitted bits sequence. 3.9.1 Maximum Likelihood Sequence Estimation Assume that an N -length symbol sequence X = {x0, x1, . . . , xN−1} is transmitted, where x j is a symbol from a signal constellation that consists of a ﬁnite number of points S with unit average energy. The corresponding Nlength received sequence is Y = {y0, y1, . . . , yN−1}. With an AWGN (additive white Gaussian channel) channel, y j = x j + u j , where u j is a noise sample and it is a zero mean white Gaussian random variable. Let Xi denote an N-length symbol sequence corresponding to the i-th path of the trellis diagram as shown in Figure 3.42 (which corresponds to the TCM encoder shown in Figure 3.34). Then the maximum likelihood (ML) sequence estimate Xd (representing the global most likely sequence) of X is given by Xd = arg max{ p(Y/Xi )} (3.41) where p(Y/Xi ) denotes a conditional density function of Y given Xi . Since y j = x j + u j , Xd can be expressed as Xd = arg max{ p(u = Y − Xi )} (3.42) Introduction to Data Error Correction 135 Two parallel branches 00 10 01 11 Figure 3.42: Trellis (of encoder shown in Figure 3.34) with N stages. j50 j51 j52 j53 j54 j 5 N 22 j 5N 21 where u is an N-length vector and is a multivariate Gaussian with mutually uncorrelated components which have zero mean and variance σ 2 = E(|u j |2). The p(u) forms a Gaussian probability density function (pdf) as follows: ⎧ ⎪⎨ Xd = arg max i ⎪⎩ j √1 2π σ ⎛ exp ⎜⎝− y j − x i j 2σ 2 2 ⎞⎫ ⎟⎠⎪⎬⎪⎭ (3.43) After observing the Gaussian pdf given in Equation (3.43), the expression for Xd can be simpliﬁed by keeping only factors that affect the maximization criterion as ⎧⎛ Xd = arg max i ⎨ ⎩exp ⎝− 1 2σ 2 j ⎞⎫ ⎬ yj − x i j 2⎠ ⎭ (3.44) or ⎧ ⎫ ⎨ N −1 ⎬ Xd = min i ⎩ y j − x i j 2 ⎭ (3.45) j =0 3.9.2 Viterbi Algorithm Using the Viterbi algorithm, we obtain the global most likely sequence Xd as derived in Equation (3.45). In Figure 3.42, each path consists of N stages. Let the branch metric (BM) at the j -th stage for the i-th path be deﬁned as BM j,i = |yj − x i j |2 where yj and x i j denote the received signal and the transmitted symbol on the i-th path corresponding to the j -th stage of the trellis, respectively. Then the state metric for the i-th path can be deﬁned as SMi = j |y j − x i j |2. The estimate Xd of the transmitted symbol sequence is given by the path with the minimum state metric. The following steps describe the computations present in obtaining the global most likely sequence using the Viterbi algorithm. 1. At stage j = 0, set SM to zero for all states. 2. At a node in a stage of j > 0, compute BM for all branches entering the node. 3. Add the BM to the present SM for the path ending at the source node of the branch to get a candidate SM for the path ending at the destination node of it. After the candidate SM has been obtained for all branches entering the node, compare them and select only one with the minimum value. Let this corresponding branch survive and delete all other branches to that node from the trellis current stage. This process is shown in Figure 3.43. 4. Return to step 2 for dealing with the next node. If all nodes in the present stage have been processed, go to step 5. 5. If j < N, increment j and return to step 2, else go to step 6. 6. Take the path with minimum SM (as the global most likely path) and follow the survivor branches backward through the trellis up to the beginning of the trellis. Now collect the bits corresponding to the survivor branch at all stages of the trellis to form the estimate of the transmitted information bits. Figure 3.43 corresponds to the encoder shown in the Figure 3.34. Each branch contains two parallel transitions before processing. For the most likely path sequence, the parallel transitions are resolved by selecting the signal 136 Chapter 3 00 10 BM 01 11 Survivor paths SM Global most likely path jϭ0 jϭ1 jϭ2 j ϭN Ϫ3 j ϭN Ϫ2 j ϭN Ϫ1 Figure 3.43: Processing of trellis stages in Viterbi decoding. points closest to the received sequence. The performance of the Viterbi decoder depends on the free distance dfcree of the trellis. dfcree = min (dpcarallel, dncon–parallel), is the minimum distance between paths, which diverge from a particular node at the present stage and converge to the same node later at some stage. Note that the processing of the trellis results in the solution of the Equation (3.45). 3.10 Turbo Codes Turbo codes have attracted the research community as well as the industry greatly since their introduction in 1993 because of their remarkable performance. The turbo codes operate near (with SNR gap of 0.7 dB or less) the ultimate limits of capacity of a communication channel (i.e., Shannon channel-capacity limit). Turbo codes were ﬁrst proposed in Berrou et al. (1993). Turbo codes are constructed using concatenated constituent convolutional coders. In the turbo coding scheme, we generate two or more component codes on different interleaved versions of the same information sequence. On the decoder side, we use SOVA (soft-output Viterbi algorithm) or MAP (maximum a posterior) algorithms to decode the decisions in an iterative manner. The decoding algorithm uses the received data symbols, parity symbols (which correspond to parity bits computed from actual and interleaved versions of data bits) and other decoder soft output information to produce more reliable decisions. In this section, we discuss turbo codes generation and MAP decoding algorithm. We discuss the simulation and implementation techniques for turbo codes in Section 4.5. Turbo codes gave rebirth to concatenated coding and iterative decoding schemes. In turbo decoding with two component codes, we pass soft decisions from the output of one decoder to the input of a second decoder and iterate this process several times to produce more reliable decisions. The decision-making concept by iterative decoding allows one to explore applications of turbo coding beyond coding theory. One such application is channel equalization. The turbo equalizers overcome the limitations of zero-forcing and decision-feedback equalizers. The turbo decoding algorithms (e.g., MAP) use both forward and reverse state metrics information and also support iterative decoding using an interleaved priori information generated from the other decoder output’s soft information to produce more reliable decisions. With turbo codes, we approach Shannon channel capacity and can achieve an SNR gap below 0.7 dB. Turbo codes perform very well at low SNRs, however these codes suffers from error ﬂoor characteristics at high SNRs. Use of a good random interleaver in turbo coding improves the turbo codes’ performance to a great extent. Sometimes an RS coder is used as the outer coder along with the turbo coder to overcome the error ﬂoor of turbo codes at high SNRs. Sample Turbo Code Applications • Mobile radio • DVB-RCS • Deep space exploration • W-CDMA, UMTS (3GPP), CDMA2000 (3GPP2) • Satellite communication • DSL Input bits cn TT D D z1 z2 Feedback Introduction to Data Error Correction 137 Systematic output bits dn,0 dn,1 Parity output bits D z3 Figure 3.44: Recursive systematic convolutional encoder. Figure 3.45: Trellis data ﬂow for RSC encoder (of Figure 3.44). Input /Output Current State z1z2z3 000 S0 001 S1 0/00 010 S2 cn /dn,0dn,1 0/00 1/11 1/11 011 S3 100 S4 101 S5 110 S6 111 S7 Next State 000 S0 001 S1 010 S2 011 S3 100 S4 101 S5 110 S6 111 S7 Dotted line: data 0 Solid line: data 1 3.10.1 Turbo RSC Encoder Turbo codes are produced by parallel concatenation of constituent convolutional coders. The encoder can be visualized either as an FIR (ﬁnite impulse response) system that produces nonsystematic convolutional (NSC) codes or as an IIR (inﬁnite impulse response) system that produces recursive systematic convolutional (RSC) codes. At any given SNR (signal to noise ratio), for high code rates, RSC codes give better error performance when compared to NSC codes. In this section, we discuss the RSC encoder. In the RSC encoder shown in Figure 3.44, we continually feed back the intermediate outputs to the encoder’s input. The corresponding trellis is shown in Figure 3.45. At any time, we input 1 bit (cn = 0 or 1) and output 2 bits (dn,0dn,1 = 00, 01, 10 or 11). The code rate (the ratio of the number of input bits to the number of output bits) for this encoder is 1/2. With each input bit, the state of the encoder is updated and the allowed input state (current state) and output state (next state) combinations by the RSC encoder shown in Figure 3.44 are given by the trellis as shown in Figure 3.45. For example, if the encoder is at state “001” and if we input a 0 bit to the RSC encoder, then the output bit (parity bit) and output state of the RSC encoder are “0” and “100,” respectively. Due to feedback, the encoder shown in Figure 3.44 produces an inﬁnite bit sequence and we enable a dotted line (TT ) at the end of input bit sequence cn to terminate the trellis by forcing the encoder state to zero. In turbo coding, we concatenate two such RSC encoders in parallel with an interleaved bit sequence as input to the second encoder as shown in Figure 3.46. From the second encoder, we take only parity information bits (dn,2). Therefore, the effective code rate is 1/3 for the turbo encoder shown in Figure 3.46. We transmit the triplet (dn,0dn,1dn,2) for each input cn after multiplexing the output bits of two RSC encoders. The code rate of the encoder may be increased by puncturing the 2-parity bitstreams. For example, 1 parity bit produced from 2 parity bits by puncturing increases the code rate from 1/3 to 1/2. (See Section 4.5 for efﬁcient implementation techniques of the turbo encoder.) 3.10.2 Turbo Decoder The triplet (dn,0dn,1dn,2) obtained from the turbo encoder is passed through a mapper (i.e., a baseband modulator) before transmitting through the channel. With BPSK modulation, we map “0” to “+1” and “1” to “−1.” Here, we 138 Chapter 3 Input bits cn RSC dn,0 Encoder 1 dn,1 I Figure 3.46: Turbo encoder. RSC Encoder 2 dn,2 MAPPER dn,0, dn11,0, dn12,0,… BPSK “0” 2. 11 “1” 2. 21 AWGN Channel u0(n) xn,0, xn 11,0, xn 12,0,… yn,0, yn 11,0, yn 12,0,… MAPPER dn,1, dn 11,1, dn 12,1,… BPSK “0” 2. 11 “1” 2. 21 u1(n) xn,1, xn 11,1, xn 12,1,… yn,1, yn 11,1, yn 12,1,… MAPPER dn,2, dn 11,2, dn 12,2,… BPSK “0” 2. 11 “1” 2. 21 u2(n) xn,2, xn 11,2, xn 12,2,… yn,2, yn 11,2, yn 12,2,… Figure 3.47: Modulator and channel model for transmission. use the AWGN channel model to mitigate the impairments in a real communication channel because the AWGN model approximates the effect of accumulation of noise components from many sources. Figure 3.47 shows the BPSK modulator along with the AWGN channel. The noise sequences ui (n) are from i.i.d. (independent and identically distributed) random process with zero mean and variance σ 2. At the receiver side, we receive a noisy sequence . . . , yn−1,0, yn−1,1, yn−1,2, yn,0 , yn,1, yn,2, yn+1,0, yn+1,1, yn+1,2, . . . and pass the received noisy symbols to the turbo decoder to get reliable transmitted data symbols as shown in Figure 3.48. Here, we assume that proper synchronization of data symbols (i.e., the boundaries of triplets in the received sequence corresponding to transmitted triplets) are identiﬁed properly. After data symbols synchronization, we identify received triplets as . . . (yn−1,0, yn−1,1, yn−1,2), (yn,0, yn,1 , yn,2), (yn+1,0, yn+1,1, yn+1,2) . . . . Then we pass intrinsic information (systematic bits [ yi,0] and ﬁrst encoder parity bits [ yi,1] of the received sequence) to the ﬁrst decoder along with extrinsic information, Ext.2 (soft information) from the second decoder. For the ﬁrst iteration, we use zeros for Ext.2 by assuming equiprobability for intrinsic information symbols. After completing decoding with the ﬁrst decoder, we start a second decoder with intrinsic information (interleaved systematic bits, I[ yi,0] and second encoder parity bits, yi,2) and extrinsic information, Ext.1 (soft information) from the ﬁrst decoder as input. This process is repeated many times until we get reliable decisions from the second decoder output. At the end of the iterative decoding, we deinterleave the output of the second decoder (LLRs) to get a transmitted symbol sequence. Then we obtain hard bits by using sign information of output symbols. At the heart of turbo decoding we use a MAP decoder to get the likelihood ratio of received symbols. In the next section we discuss the turbo decoding using the MAP algorithm. 3.10.3 MAP Decoding In turbo decoding, we use the maximum a posteriori (MAP) algorithm to determine the most likely information bit that has been transmitted. In the MAP algorithm, we ﬁrst obtain a posteriori probabilities (APPs) for each yn,1, yn 11,1, yn 12,1,… yn,0, yn 11,0, yn 12,0,… I yn,2, yn 11,2, yn 12,2,… Introduction to Data Error Correction 139 MAP Decoder 1 I : Interleaver I 21: Deinterleaver Ext.1 I I 21 MAP Decoder 2 Ext.2 I 21 LLRs xˆn,0, xˆn 11,0, xˆn 12,0,… Figure 3.48: Turbo decoder. transmitted data bit and then to decode a data bit, we assign to the data bit a decision value that corresponds to the maximum a posteriori probability. The MAP algorithm using APPs minimizes the bit error probability (BER) by calculating the likelihood ratio (LR) for every transmitted bit dn,0(= cn) as follows: δn = LR(cn) = P (cn P (cn = 1|Y1N ) = 0|Y1N ) (3.46) where Y1N is the received corrupted data symbol sequence from time n = 1 through some time N. If δn > 1 then the decoded bit cn = 1 else if δn < 1 then the decoded bit cn = 0. For the RSC (recursive systematic coder) codes with the AWGN channel model, the APP of a transmitted coded bit cn is equal to the sum of all encoder states joint probabilities. P(cn = i|Y1N ) = λin,m , i = 0, 1 m (3.47) where λin,m = P(cn = i, Sn = m|Y1N ) and Sn is the encoder state at the time n. Therefore, δn = LR(cn) = m λ1n,m m λ0n,m For 1 < n < N , the sequence Y1N can be represented as Y1N = {Y1n−1 , Yn , YnN+1} and therefore λin,m = P(cn = i, Sn = m|{Y1n , Yn , YnN+1 }) (3.48) (3.49) Using Bayes’ theorem, the Equation (3.49) can be simpliﬁed and can be factored into three metrics as follows: λin,m = αnm γni,m βnf+(i1,m) P(Y1N ) (3.50) where αγnnmi,m∼=∼=PP(Y(c1nn−1=|Sin, = Sn m), a = m, forward state Yn), a branch metric metric at at time time n n and and state state m m βnf+(i1,m) = P(YnN+1|Sn+1 = f (i, m), a reverse state metric at time n + 1 and state f (i, m), is the next state for a given input bit i and state m. Then the MAP algorithm is translated to δn = LR(cn ) = m αnm γn1,m βnf+(11,m) m αnm γn0,m βnf+(01,m) (3.51) 140 Chapter 3 We take the natural logarithm on both sides for the preceding equations to avoid the multiplications present in computing likelihood ratios. The resultant Log-MAP algorithm is given by δ¯n = LLR(cn) = ln m αnm γn1,m βnf+(11,m) m αnm γn0,m βnf+(01,m) = ln αnm γn1,m βnf+(11,m) − ln αnm γn0,m βnf+(01,m) m m (3.52) If ln(ab) = ln(a) + ln(b) = a¯ + b¯, then ab = ea¯+b¯. Using this transformation, δ¯n = ln e α¯ nm +γ¯n1,m +β¯nf+(11,m) m − ln e α¯ nm +γ¯n0,m +β¯nf+(01,m) m where α¯ nm, β¯nf+(11,m), γ¯n1,m and δ¯n are logarithms of αnm, βnf+(11,m), γn1,m and δn, respectively. (3.53) Forward Metric Computation The forward state metrics α¯ nm are recursively computed (or updated by accumulation) with the trellis representation of encoder states (at each time instance n) from time n = 0 assuming initial values for α¯0m as α¯ 00 = 0 and α¯0k = −∞, where 1 ≤ k ≤ 2M − 1 and M is the number of memory units present in one RSC encoder. The forward state metrics α¯ nm at time n are computed from forward state metrics α¯ nb−( j1,m) at time n − 1 according to e = e + e α¯nm α¯ nb(−01,m) +γ¯n0−,b1(0,m) α¯ nb(−11,m) +γ¯n1−,b1(1,m) (3.54) where b( j, m) corresponds to the previous state (at time n − 1) connecting to the present state m (at time n) for j = 0 and 1. Reverse Metric Computation The reverse state metrics β¯nm are recursively computed (or updated by accumulation) from n = N + 1 assuming initial values for β¯Nm+1 as β¯N0 +1 = 0 and β¯Nk +1 = −∞, where 1 ≤ k ≤ 2M − 1. The reverse state metrics β¯nm at time n are computed from reverse state metrics β¯nf+( j1,m) at time n + 1 using the encoder state trellis as e = e + e β¯nm β¯nf+(01,m) +γ¯n0,m β¯nf+(11,m) +γ¯n1,m (3.55) Branch Metric Computation The branch metric γ¯ni,m is computed from its deﬁnition as follows: γni,m = P(cn = i, Sn = m, Yn ) = P(Yn |cn = i, Sn = m) P(Sn = m|cn = i) P(cn = i) = P(Yn |cn = i, Sn = m) Pa(i), where P a (i ) = P (Sn = m |cn = i) P (cn = i) = 1 2M P (cn = i) We provide intrinsic information (both systematic symbols and parity symbols) to the ﬁrst decoder as {yn,0, yn,1} and for the second decoder as {I [yn,0], yn,2}. We derive the branch metric for ﬁrst decoder and the same approach can be used to obtain the branch metric for the second decoder. For the ﬁrst decoder, Yn = {yn,0, yn,1}. Assuming an AWGN channel with noise of zero mean and variance σ 2 and replacing the joint probability with the pdf (probability density function), the metric γni,m is computed as γni,m = P a (i )e− ( yn,0 −xni ,0 )2 +( 2σ 2 yn,1−xni ,1 )2 (3.56) Although the right-hand side of Equation (3.56) appears to be independent of state m, actually it is not true—the parity symbols xni ,1 are state dependent. Introduction to Data Error Correction 141 Extrinsic Information Computation From log-MAP Equation (3.52), δ¯n = LLR(cn) = ln m αnm γn1,m βnf+(11,m) m αnm γn0,m βnf+(01,m) ⎡ = ln ⎢⎣ m αnm m αnm P a (1)e− (yn,0 −xn1,0 )2 +( yn ,1 2σ 2 −xn1,1 )2 P a (0)e− (yn,0 −xn0,0 )2 +( yn ,1 2σ 2 −xn0,1 )2 ⎤ βnf+(11,m) ⎥⎦ βnf+(01,m) = ln ⎡ ⎢⎣ P P a (1) a (0) e− e− ( yn ,0 −xn1,0 )2 2σ 2 ( yn ,0 −xn0,0 )2 2σ 2 m m αnm αnm e− ( yn ,1 −xn1,1 2σ 2 )2 e− ( yn ,1 −xn0,1 2σ 2 )2 βnf+(11,m) βnf+(01,m) ⎤ ⎥⎦ = ln P a (1) P a (0) 4yn,0 + ln e 2σ2 ⎡ + ln ⎢⎣ m m αnm αnm e− e− ( yn ,1 −xn1,1 )2 2σ 2 ( yn ,1 −xn0,1 )2 2σ 2 βnf+(11,m) βnf+(01,m) ⎤ ⎥⎦ LLR(cn ) = L 1e + 2 yn,0 σ2 + L 2e where L 1e = P a (1) P a (0) is the input a priori probability ratio, L 2e is the output extrinsic information (or a priori information for the second decoder to minimize the probability of decoding error within an iterative decoding framework). This extrinsic information is computed from the likelihood ratio as L 2e = LLR(cn ) − L 1e − 2 yn,0 σ2 (3.57) See Section 4.5 for simulation and implementation techniques of turbo codes. 3.10.4 Interleaver In general, the purpose of the interleaver is to spread burst errors (which occur due to lightning or switching interference) across the entire received data sequence. First, we understand the importance of the interleaver in the case of block codes’ (e.g., RS codes) performance. RS codes, discussed in Section 3.6, can correct up to T errors in a block of N data elements. If we assume that the received k-th block has zero errors and (k + 1)th block has L(>T ) errors, then the RS decoder does nothing in the k-th block and cannot correct errors of (k + 1)th block as it has more than T errors. In Figure 3.49, we considered two RS codewords for N = 15, K = 7 and T = 4 to illustrate the purpose of the interleaver. We can see in Figure 3.49 how we overcome this problem by interleaving the codewords at the transmitter and by spreading (or by deinterleaving) the errors at the receiver. In both X (without interleaver) and Y (with interleaver) schemes, we assume that the errors have occurred at the same positions in a data frame after the transmission. The interleaver shown in Figure 3.49 is a simple two-element-depth matrix interleaver. In practice, the data is handled in terms of data frames with hundreds of elements. We have to interleave total data frame elements at one time and that is why the matrix dimension is usually very large. In a simple interleaver, we ﬁll the matrix of size P × Q in row-wise and read in column-wise. In the case of the deinterleaver, we ﬁll the matrix column-wise and read the elements row-wise as shown in Figure 3.50. If we do not have sufﬁcient elements to ﬁll the matrix of P × Q, we ﬁll the rest of the matrix with zeros as shown in Figure 3.50. In general, after ﬁlling the matrix row-wise and before reading the matrix column-wise, we randomize the matrix row and column elements to get random data elements. The concept of interleaving reﬂects 142 Chapter 3 ABACDAEDFBBCFEA BCCEAFDAAACBBBE Scheme “X” without interleaver ABACDAEDFBBCFEA No correction needed BDCEAEDACACBBEF RS can’t correct errors as 5 . T Channel ABACDAEDFBBCFEA BCCEAFDAAACBBBE Interleaver Scheme “Y” with interleaver Channel ABBCACCEDAAFEDD AEABAACCAFBEBFD Deinterleaver ABACDAEDEBACFEF RS can correct 3 (, T ) errors BCCEAFDAAACABBD RS can correct 2 (, T ) errors Interleaver Deinterleaver Figure 3.49: Interleaver purpose illustration. a0 a1 a2 a3 … a(P 21)Q a0 a1 a2 a3 … a(P 21)Q a0 a1 a2 … aQ21 aQ aQ11 aQ12 … a2Q21 a2Q … … a(P21)Q … 0 …0 Interleaver a0 a1 a2 … aQ21 aQ aQ11 aQ12 … a2Q21 a2Q … … a(P 21)Q … 0 …0 Deinterleaver a0 aQ a2Q a3Q … a(P 21)Q a1 aQ11 … Figure 3.50: Interleaver and de-interleaver. the Shannon view of random and very long complex codes which can approach channel capacity. Shannon (1948) showed that as the length of code approaches inﬁnity, the random codes achieve channel capacity. Although we work (i.e., encoding or decoding) on a small block of data elements at a time within a data frame, because of interleaving, the dimension of codes increases to the size of the data frame. By permuting the elements in rows and columns of the matrix, we obtain random codes. The interleaver requires a large amount of data memory and introduces delay in the communications system. In Figure 3.49, we see the effect of the interleaver on the block codes’ (e.g., RS codes) performance. We see how the interleaver improves the performance of convolutional codes (e.g., turbo codes). With simple convolutional decoding (e.g., using trellis codes with Viterbi decoding), we know that the decoding converges after 6K stages (where K is a constraint length of coder). If a burst of errors (of order of 6K length) occurs in a particular coded data block, then we never converge in that particular region at the time of decoding of the coded sequence. With turbo codes using an iterative MAP decoder, by performing interleaving and deinterleaving of data, the decoding process converges after a few iterations (as we spread burst errors across the entire data frame). We output more reliable decisions with increased number of iterations. In practice, we iterate between 6 to 18 times. See Section 4.5, for simulation of the turbo RSC encoder and the MAP decoder. Introduction to Data Error Correction 143 3.11 LDPC Codes Low-density parity check (LDPC) codes, introduced by Gallager (1963), are linear block codes deﬁned by sparse parity check matrices. These efﬁcient error control codes have attracted a lot of attention due to (1) their remarkable bit error rate (BER) versus signal-to-noise ratio (SNR) performance, and (2) availability of elegant decoding schemes. LDPC codes with larger frame lengths can perform within 0.0045 dB of the Shannon limit. Like turbo codes, LDPC codes are also decoded iteratively. The following table summarizes coding differences between turbo codes and LDPC codes. Turbo Codes Generated with convolutional codes Use trellis representation for decoding Use MAP algorithm for decoding On average require 8 iterations Decoding complexity per iteration is high LDPC Codes Generated with block codes Use graphical representation for decoding Use sum-product algorithm for decoding On average require 30 iterations Decoding complexity per iteration is low 3.11.1 Graphical Representation of Parity Check Matrix As we discussed in Section 3.3, linear block codes are deﬁned by parity-check matrix H . An M × N parity-check matrix H deﬁnes a linear block code of length N , where each codeword C = [c0c1c2 . . . cN−1] satisﬁes M parity check equations. The parity-check matrix can also be represented using a Tanner graph (which is a bipartite graph with two types of nodes). One set of nodes called parity (or check) nodes represents the parity check constraints and the other set of nodes called bit (or variable) nodes represents the codeword bits as shown in Figure 3.51. The edge connections between bit nodes and parity nodes are deﬁned based on H matrix elements, h ji ∈ {0, 1}. If h ji = 1, then a bit node bi (corresponding to column i in H ) is connected to the parity node p j (corresponding to row j in H ). Thus, each edge in the Tanner graph represents an entry of H that is equal to 1. With the bipartite graph, the nodes of the same type cannot be connected (i.e., a bit node cannot be connected to another bit node). All bit nodes connected to a particular parity node must sum (modulo-2) to zero. A cycle of length L in a Tanner graph is a path of L distant edges, which closes on itself. One such cycle of length L = 4 is shown with dark edges in Figure 3.51. The shortest possible cycle in a Tanner graph has length 4. The presence of short cycles in a Tanner graph limits the decoding performance of graph codes such as the LDPC code. We avoid the presence of short cycles (especially of length L = 4) in designing of parity check matrices for Tanner graph codes. We consider a parity check matrix H4×7 as given in Equation (3.58) to work with an example graph code. The corresponding Tanner graph is shown in Figure 3.52. The number of parity nodes in a Tanner graph is equal to the number of rows of a parity check matrix and the number of bit nodes is equal to the number of columns of parity check matrix. ⎡ ⎤ 1010011 H4×7 = ⎢⎢⎣01 0 1 1 0 1 0 1 0 0 1 00⎥⎥⎦ (3.58) 0101101 Parity constraint p0 0 p1 p2 0 0 pM 22 0 pM 21 0 Parity nodes Codeword bit c0 c1 c2 b0 b1 b2 cN 22 cN 21 bN 22 bN 21 Bit nodes Figure 3.51: Graphical representation of parity-check matrix H. 144 Chapter 3 p0 p1 p2 p3 Parity nodes Figure 3.52: Tanner graph of Equation (3.58). b0 b1 b2 b3 b4 b5 b6 Bit nodes 0 0 0 0 Parity nodes c0 c1 c2 c3 c4 c5 c6 Bit nodes Figure 3.53: Graphical representation of codeword [1010111] with H4×7. 1 0 1 0 1 1 1 0 0 1 1 Parity nodes d0 d1 d2 d3 d4 d5 d6 Bit nodes Figure 3.54: Parity checks for received codeword [1110111]. 1 1 1 0 1 1 1 Decoding Hard-Decision Channel Output: Bit-Flip Algorithm Before discussing the practically used graph codes, we work with a simple hard-decision decoding example to understand the graph codes a bit more. With the parity check matrix H given in Equation (3.58), the encoded codeword C for the input message vector M = [101] is computed as C = [c0c1c2 . . . c6] = [1010111]. Usually, to get an encoded codeword, we multiply the message vector M with a generator matrix G, which is obtained from the parity check matrix H . We can verify that the parity check constraint of H is satisﬁed (i.e., the modulo-2 sum of all bit nodes connected to any particular parity node is zero) for the codeword [1010111] as shown in Figure 3.53. Assume that the received data (hard-decision channel output) after transmitting through a noise channel is D = [d0d1d2 . . . d6] = [1110111]. There is a 1-bit error (highlighted with an underscore) in the received data. With this, we will have some parity nodes that do not satisfy the parity constraint as shown in Figure 3.54. We use the bit-ﬂip algorithm to correct the error bit in the received codeword by passing the message bits between bit nodes and parity nodes. With the bit-ﬂip algorithm, we ﬂip the bit values passing from a parity node ( pi ) to bit nodes (bi ) whenever the parity constraint (i.e., modulo-2 sum of inputs is equal to zero) at that parity node is not satisﬁed. To better understand the bit-ﬂip algorithm for the decoding of a codeword using Figure 3.54, we tabulate the values of message bits passed from bit nodes (bi ) to parity nodes ( pi ) and parity nodes ( pi ) to bit nodes (bi ) as in Tables 3.2 and 3.3. In Table 3.2, we use the received hard-decision bits to pass from bit nodes to parity nodes. The computed parity for the received bits is given in the right column of Table 3.2. If all received bits are error free, then we get the computed parity as zero. In our case, we get some non-zero parity bits as the received bits contain one error. In the last two rows, the computed parity bits are not zero and it means that the error bit is present in those two rows. As we do not know in reality which bit is in error, we tentatively pass the ﬂipped bits from the parity nodes, at which the parity constraint is not satisﬁed, to the bits nodes by assuming the message bit from the current bit node is in error and the bits from the other bit nodes are error free. In Table 3.3, the bits that are ﬂipped at Introduction to Data Error Correction 145 Table 3.2: Message bits passing from bit nodes to parity nodes Message passing from bit nodes to parity nodes b0 → p0 1 b2 → p1 1 b0 → p2 1 b1 → p3 1 b2 → p0 1 b3 → p1 0 b1 → p2 1 b3 → p3 0 b5 → p0 1 b4 → p1 1 b5 → p2 1 b4 → p3 1 b6 → p0 1 b6 → p3 1 Parity constraint 0 0 1 1 Table 3.3: Message bits passing from parity nodes to bit nodes Message passing from parity nodes to bit nodes Received bit Decoded bit p0 → b0 1 p2 → b1 0 p0 → b2 1 p1 → b3 0 p1 → b4 1 p0 → b5 1 p0 → b6 1 p2 → b0 0 p3 → b1 0 p1 → b2 1 p3 → b3 1 p3 → b4 0 p2 → b5 0 p3 → b6 0 1 1 → 1 0 Corrected 1 1 0 0 1 1 1 1 1 1 the time of passing from parity nodes to bit nodes are highlighted with bold letters. Then we decode the bits at bit nodes with the majority vote criterion using bits from parity nodes and the received bit for that bit node. The decoded bits are shown in the right-most column of Table 3.3 and the error bit is corrected with the bit-ﬂip algorithm. With hard decisions as the input to the graph code as shown in Figure 3.54, it is difﬁcult to correct more than one error per codeword with the given parity check matrix H4×7. In the later sections, we introduce a sum-product algorithm (which can take soft decisions as input) to decode graph codes and with that we correct more than a 1-bit error per codeword using the same H4×7 parity check matrix. 3.11.2 LDPC Encoder Since the LDPC codes are block codes deﬁned by parity check matrix HM×N like any other block codes, we can compute the LDPC encoder codeword vector for the given message vector by simply multiplying the message vector with the generator matrix G, which is derived from the parity check matrix H . However, to achieve bit-error rate performance with LDPC codes close to the channel capacity, we require a codeword of a size in the order of thousands of bits. The matrix multiplication for that big codeword size demands huge memory and computational requirements. Also, generating the parity check matrices of that order is not a simple task. Due to this reason and due to the lack of linear time decoding algorithms in earlier times, the LDPC codes were forgotten for decades. With the recent developments in semiconductor technology, deterministic ways of computing the large parity check matrices, and the introduction of polynomial time decoding algorithms for LDPC codes (e.g., the sumproduct algorithm), LDPC codes were rediscovered in the late 1990s. Since then LDPC codes have gained momentum, and these codes were also recently embedded in a few standards such as WiMax, 802.16e, and DVB. In the WiMax standard, the parity check matrices are compactly represented with a few elements for storing. Since the parity check matrix H of LDPC code is a sparse (or low in density) binary matrix, it can be represented with small size zero matrices and permutations of an identity matrix as in 802.16e. The WiMax 146 Chapter 3 standard describes the way to uncompress the compact base parity check matrix to get actual parity matrix and also describes how to encode the lengthy codewords using small message blocks without computing the generator matrix G from the parity check matrix H . Refer to the 802.16e standard for more details on implementing the LDPC encoder for practical applications. The number of 1s participating in any parity bit generation of the LDPC code is very small due to the low density of ones present in the parity check matrix H . Let wr be the weight of j -th row, then the number of 1s participating in j -th parity bit generation is wr (here the weight of a binary vector is deﬁned as the number of 1s present in it). Similarly, the i-th column weight wc gives the number of parity constraints which depends on the i-th message bit. The LDPC codes are of two types: regular and irregular. If the row weights wr and the column weights wc are uniform (or are almost uniform), then we call such code regular LDPC code; otherwise, we call the irregular LDPC. Usually, irregular LDPC codes perform better than regular LDPC codes. In this chapter, we concentrate on the regular LDPC codes. To generate a regular LDPC code, a small (≥3) column weight wc is selected ﬁrst and values for N (the block length) and M (the redundancy length) are selected. Then an M × N matrix H is generated, which has weight wc in each column and weight wr in each row. To get a uniform row of weight wr , we have to satisfy wc N = wr M. One more important characteristic of regular LDPC code is that the minimum distance of the code increases linearly with N provided that wc > 3. 3.11.3 LDPC Decoder In this section, we discuss the sum-product algorithm, a practically usable soft-decision decoding algorithm for LDPC codes. Like the bit-ﬂip algorithm discussed in Section 3.11.1, the sum-product algorithm uses the concept of message passing or belief propagation between bit nodes and parity nodes in an iterative manner. The advantage of the sum-product algorithm is that it can accept soft values and thus we do not pass the message harddecision bits between nodes in the sum-product algorithm, instead we pass the message reliability information between bit nodes and parity nodes. As we iterate the sum-product algorithm more and more using the Tanner graph, the reliability of soft information (called a posteriori probability) will improve with the iteration count. Suppose that an encoded LDPC codeword C = [c0c1c2 . . . cN−1] is modulated using the binary phase shift keying (BPSK) modulation and let X = [x0x1x2 . . . xN−1] be the resultant message symbol vector after BPSK modulation. With BPSK, we map codeword bit “0” to symbol “1” and codeword bit “1” to symbol “−1.” Then this BPSK symbol vector X is transmitted through an AWGN channel, and Y = [y0 y1 y2 . . . yN−1] is the corresponding received symbols. Assume that the codewords (Y j ) and symbols (yi ) are properly synchronized before passing to the LDPC decoder. 3.11.4 Sum-Product Algorithm At the receiver, we use the sum-product algorithm to decode the LDPC codeword. The ultimate goal of the sum-product algorithm is to ﬁnd the LLR of the encoded bit ci , which is deﬁned as LLRi = Log P(ci = 1)/Y0N−1 P(ci = 0)/Y0N−1 , Y0N−1 = [y0, y1, y2, . . . yN−1] (3.59) Then we make the hard decision to get the decoded bit cˆi as follows: cˆi = 1 0 if LLRi < 0 otherwise (3.60) Here, we do not provide complete derivations for the sum-product algorithm; instead we will use the ﬁnal metrics computation equations to work with the sum-product algorithm. For full derivations of LDPC decoding equations and sum-product algorithms, Gallager (1963) and MacKay (1999) are recommended. We use the following notations for the sum-product algorithm: Qij = Extrinsic information to be passed from bit node i to parity node j R ji = Extrinsic information to be passed from parity node j to bit node i Ui = { j such that h ji = 1}, the set of row locations of the 1s in the i-th column Figure 3.55: (a) Connections to third parity node from 4-bit nodes. (b) Connections to 5th-bit node from two parity nodes. b1 b3 b4 b6 Q13 Q33 Q43 Q63 p3 (a) Introduction to Data Error Correction 147 b5 R05 R25 p0 p2 (b) Ui\a = Ui − {a}, the set of row locations of the 1s in the i-th column excluding the a-th row V j = {i such that h ji = 1}, the set of column locations of 1s in the j -th row V j\b = V j − {b}, the set of column locations of the 1s in the j -th row excluding b-th column αij = sign (Qij ), sign of Qij βij = |Qij |, magnitude of Qij φ(x ) = − log[tanh(x /2)] = log e x +1 e x −1 = φ−1(x ) λi = 2yi /σ 2, the channel a posteriori probabilities (APPs) From Figure 3.55 and Example 3.21, we understand the previous notations with illustrations and examples. For this, we use the parity check matrix in Equation (3.58), and we consider the 5th-bit node connecting to two parity nodes and the third parity node connecting to 4-bit nodes as shown in Figure 3.55. ■ Example 3.21 The set of row locations with 1s in the 5th column of H is U5 = {0, 2} and the set of columns with 1s in the 3rd row of H is V3 = {1, 3, 4, 6} as highlighted in Figure 3.56. i 0 12 3 45 6 1 0 1 0 0 1 10 Figure 3.56: Parity check matrix 0 0 1 1 1 0 01 H5 illustrating the node connections 1 1 0 0 0 1 02 j of Figure 3.55. 0 1 0 1 1 0 13 ■ Then U5\0 = U5 − {0} = {2}, U5\2 = U5 − {2} = {0} V3\1 = V3 − {1} = {3, 4, 6}, V3\3 = V3 − {3} = {1, 4, 6}, and so on. We initialize Qij = λi at the start of the sum-product algorithm. After some manipulations, the Equation (3.59) using the Tanner graph is computed as LLRi = λi + R ji j ∈Ui (3.61) The extrinsic information R ji is computed as ⎛ ⎞⎛ ⎞ Rji = ⎝ αi j ⎠ φ ⎝ φ βi j ⎠ i ∈V j \i i ∈V j \i (3.62) The sum-product algorithm is iterated many times to converge the LLRi values to true APPs, and the single iteration involves computation of Qij at bit nodes, R ji at parity nodes and LLRi at end of each iteration. The 148 Chapter 3 Initialize with received data Bit nodes y0 y1 y2 b0 b1 b2 Edge interleaver Parity nodes Qij p0 p1 p2 Edge deinterleaver Rji LLRi b0 b1 b2 Qij Total decoder iterations (S) Rij LLRi b0 b1 b2 Qij p0 p1 p2 ynϪ2 ynϪ1 bn Ϫ 2 bn Ϫ 1 pm Ϫ 2 pm Ϫ 1 iϭ1 bn Ϫ 2 bn Ϫ 1 bn Ϫ 2 bn Ϫ 1 pm Ϫ 2 pm Ϫ 1 iϭS Rji LLRi Decision maker Decoder bits Make hard decision using LLRi according to Equation (3.60) x0 x1 x2 xk Ϫ 1 Figure 3.57: Data ﬂow diagram of multi-iteration LDPC decoder. extrinsic information Qij to be passed from bit nodes to parity nodes is updated in subsequent iterations as follows: Qij = LLRi − R ji (3.63) An S-iteration LDPC decoding with the sum-product algorithm using an unrolled Tanner graph is illustrated in Figure 3.57. All bit nodes are initialized with received data symbols yi . The channel APPs λi are computed from yi . At the start of iterative decoding, the Qij values are initialized with λi for all j wherever h ji = 1. Then we compute R ji s using Equation (3.62) and LLRi s using Equation (3.61). These steps account for the ﬁrst iteration. In the subsequent iterations, we use Equations (3.63), (3.62), and (3.61) to compute Qij , R ji and LLRi . In any iteration, we compute Qij at the bit nodes and pass it to the connected parity nodes; then we compute Rij at the parity nodes and pass it to the connected bit nodes; next we update LLRi at every bit node and this completes a single iteration of the sum-product algorithm. If the Tanner graph contains zero cycles, then LLRi s converges to Introduction to Data Error Correction 149 true APPs as the number of iterations tends to inﬁnity. However, in practice we halt the sum-product algorithm if any one of the following conditions is satisﬁed: Halt if cˆH T = 0 (this requires computation of cˆ at the end of each iteration), or Halt if the maximum number of iterations (S) is reached Unlike the turbo coder, the LDPC coder does not have an external interleaver for randomization of messages. However, the edge connections from bit nodes to parity nodes act as interleaving of extrinsic information (i.e., Qij ’s) passed from bit nodes to parity nodes and the edge connections from parity nodes to bit nodes act as deinterleaving of extrinsic information (i.e., R ji ’s) passed from parity nodes to bit nodes and vice versa. 3.11.5 Min-Sum Algorithm The sum-product algorithm is computationally expensive as it involves of the processing of nonlinear function φ(.). For this reason, we use the min-sum algorithm (which is an approximation of the sum-product algorithm) in practical LDPC decoders. As the nonlinear function φ(.) is a hyperbolic self-inverse function, we can approximate the sum-product algorithm as follows: φ φ βi j ≈ φ φ min βi j i i = min βi j i (3.64) Using the approximation in Equation (3.64), we can approximate the computationally expensive metric R ji computation as ⎛ ⎞ Rji = ⎝ αi j ⎠ min βi j i ∈V j \i i ∈V j \i (3.65) To avoid the biased estimate of R ji in Equation (3.65), we multiply the Equation (3.65) with a constant k, where k < 1. ⎛ ⎞ Rji = k ⎝ αi j ⎠ min βi j i ∈V j \i i ∈V j \i (3.66) The computation of R ji using Equation (3.66) involves the computation of a minimum of magnitudes and the XOR of sign information. This greatly reduces the complexity of the sum-product algorithm. The performance loss due to the approximation is about 0.2 dB, which is acceptable for practical applications. The c-simulation of the min-sum algorithm is presented in Section 4.6. 3.11.6 Simulation Results We use the same parity check matrix H given in Equation (3.58) to work with the min-sum algorithm for decoding the LDPC codeword. We consider the same codeword used with the bit-ﬂip algorithm, that is, C = [1010111]. The BPSK modulated symbols of codeword C are X = [−1, 1, −1, 1, −1, −1, −1]. We pass the BPSK modulated symbols through an AWGN channel with noise variance σ 2 = 1. At the receiver we decode the message bits with the min-sum algorithm using received noisy symbols for four test cases with corresponding hard decisions containing 1-, 2-, 3-, and 4-bit errors. If we look at the soft values of the corresponding hard-decision bits that are in error, those soft values are nearer to zero with a ﬂip of sign in all four cases. In terms of probability, their probability is around 0.5 indicating that they have equal chances to become 0 or 1. The value for constant k in Equation (3.60) is chosen as 0.8 (for better performance results k is chosen between 0.8 and 0.9). In all four cases, we present the ﬁrst few iterations and the last iteration outputs. We stop the decoding if the hard decisions contain no errors at the end of any particular iteration or if the maximum iteration counts of S = 10 is reached. Case 1: One-bit error in hard decisions of channel output—Let the received noisy vector Y = [−0.85, −0.05, −0.91, +0.88, −0.79, −0.90, −0.81] and the corresponding hard-decision vector D = [1110111]. The error bit is highlighted with underscoring. 150 Chapter 3 Initialization and First Iteration Channel APPs: λi = [−1.70, −0.1, −1.82, 1.76, −1.58, −1.80, −1.62] Extrinsic information passed from bit nodes to parity nodes: ⎡−1.70 0 −1.82 0 0 −1.80 −1.62⎤ Qij = ⎢⎢⎣−10.70 0 −0.10 −1.82 0 1.76 0 −1.58 0 0 −1.80 0 0 ⎥⎥⎦ 0 −0.1 0 1.76 −1.58 0 −1.62 Extrinsic information passed from parity nodes to bit nodes: ⎡−1.30 0 −1.30 0 0 −1.30 −1.36⎤ Rji = ⎢⎢⎣ 0 0.08 0 −1.26 1.36 0 1.26 0 −1.41 0 0 0.08 0 0 ⎥⎥⎦ 0 1.26 0 −0.08 0.08 0 0.08 Updated LLRs for transmitted message bits: Hard-decision output: LLRi = [−2.91, 2.52, −4.38, 2.94, −2.908, −3.02, −2.9] Cˆ = [1010111] At the end of the ﬁrst iteration we got the right outputs after making hard decisions, and we stop decoding for Case 1 with the min-sum algorithm. Case 2: Two-bit errors in hard decisions of channel output—Let the received noisy vector Y = [−0.85, −0.05, −0.91, +0.88, −0.79, 0.10, −0.81] and the corresponding hard-decision vector D = [1110101]. The error bits are highlighted with underscoring. Initialization and First Iteration Channel APPs: λi = [−1.70, −0.10, −1.82, 1.76, −1.58, 0.20, −1.62] Extrinsic information passed from bit nodes to parity nodes: ⎡−1.70 0 −1.82 0 0 0.20 −1.62⎤ Qij = ⎢⎢⎣−10.70 0 −0.10 −1.82 0 1.76 0 −1.58 0 0 0.20 0 0 ⎥⎥⎦ 0 −0.10 0 1.76 −1.58 0 −1.62 Extrinsic information passed from parity nodes to bit nodes: ⎡ 0.16 0 0.16 0 0 −1.29 ⎤ 0.16 R ji = ⎢⎢⎣−00.08 0 −0.16 −1.26 0 1.26 0 −1.41 0 0 0.08 0 0 ⎥⎥⎦ 0 1.26 0 −0.08 0.08 0 0.08 Updated LLRs for transmitted message bits: LLRi = [−1.62, 1.00, −2.92, 2.94, −2.91, −1.02, −1.38] Hard-decision output: Cˆ = [1010111] Introduction to Data Error Correction 151 At the end of the ﬁrst iteration we got the right outputs after making hard decisions and we stop decoding for Case 2 with the min-sum algorithm. Case 3: Three-bit errors in hard decisions of channel output —Let the received noisy vector Y = [−0.85, −0.05, −0.91, −0.08, −0.79, 0.10, −0.81] and the corresponding hard-decision vector D = [1111101]. The error bits are highlighted with underscoring. Initialization and First Iteration Channel APPs: λi = [−1.70, −0.10, −1.82, −0.16, −1.58, 0.20, −1.62] Extrinsic information passed from bit nodes to parity nodes: ⎡−1.70 0 −1.82 0 0 0.20 −1.62⎤ Qij = ⎢⎢⎣−10.70 0 −0.10 −1.82 0 −0.16 0 −1.58 0 0 0.20 0 0 ⎥⎥⎦ 0 −0.10 0 −0.16 −1.58 0 −1.62 Extrinsic information passed from parity nodes to bit nodes: ⎡ 0.16 0 0.16 0 0 −1.29 ⎤ 0.16 R ji = ⎢⎢⎣−00.08 0 −0.16 0.128 0 1.26 0 0.127 0 0 0.08 0 0 ⎥⎥⎦ 0 −0.128 0 −0.08 −0.08 0 −0.08 Updated LLRs for transmitted message bits: LLRi = [−1.62, −0.387, −1.53, 1.024, −1.53, −1.016, −1.54] Hard-decision output: Cˆ = [1110111] At the end of the ﬁrst iteration, we have a 1-bit error in the outputs after making hard decisions and we continue decoding with the min-sum algorithm. Second Iteration Extrinsic information passed from bit nodes to parity nodes: ⎡−1.78 0 −1.69 0 Qij = ⎢⎢⎣−10.54 0 −0.227 −1.66 0 −0.24 0 0 −1.66 0 0.28 0 −1.095 −1.70 ⎤ 0 0 ⎥⎥⎦ 0 −0.26 0 1.104 −1.452 0 −1.459 Extrinsic information passed from parity nodes to bit nodes: ⎡ 0.224 0 0.224 R ji = ⎢⎢⎣0.1082 0 0.876 0.192 0 0 1.328 0 0 −1.354 ⎤ 0.224 0.192 0 0 0.182 0 0 ⎥⎥⎦ 0 0.883 0 −0.207 0.207 0 0.207 Updated LLRs for transmitted message bits: LLRi = [−1.29, 1.659, −1.404, 0.96, −1.18, −0.97, −1.188] Hard-decision output: Cˆ = [1010111] 152 Chapter 3 We got the right outputs at the end of the second iteration after making hard decisions and we stop decoding for Case 3 with the min-sum algorithm. Case 4: Four bit errors in hard decisions of channel output—Let the received noisy vector Y = [0.14, −0.05, −0.91, −0.08, −0.79, 0.10, −0.81] and the corresponding hard-decision vector D = [1111101]. The error bits are highlighted with underscoring. Initialization and First Iteration Channel APPs: λi = [0.28, −0.10, −1.82, −0.16, −1.58, 0.20, −1.62] Extrinsic information passed from bit nodes to parity nodes: ⎡ 0.28 0 −1.82 0 0 0.20 −1.62⎤ Qij = ⎢⎢⎣0.028 0 −0.10 −1.82 0 −0.16 0 −1.58 0 0 0.20 0 0 ⎥⎥⎦ 0 −0.10 0 −0.16 −1.58 0 −1.62 Extrinsic information passed from parity nodes to bit nodes: ⎡ 0.16 0 −0.16 0 0 0.224 −0.16⎤ R ji = ⎢⎢⎣−00.08 0 0.16 0.128 1.264 0.128 0 0 0 0 −0.08 0 0 ⎥⎥⎦ 0 −0.128 0 −0.08 −0.08 0 −0.08 Updated LLRs for transmitted message bits: Hard-decision output: LLRi = [0.36, −0.068, −1.852, 1.024, −1.53, 0.344, −1.86] Cˆ = [0110101] At the end of the ﬁrst iteration, we have 3-bit errors in the outputs after making hard decisions and we continue decoding with min-sum algorithm. Here, we skip a few iterations and give the outputs for the ﬁfth iteration. Fifth Iteration Extrinsic information passed from bit nodes to parity nodes: ⎡ 0.502 0 −1.845 0 0 0.435 −1.77⎤ Qij = ⎢⎢⎣ 0 0.68 0 0.81 −2.22 0.009 −1.73 0 0 0 0 0.65 0 0 ⎥⎥⎦ 0 0.122 0 1.257 −1.605 0 −2.02 Extrinsic information passed from parity nodes to bit nodes: ⎡ 0.348 0 −0.348 0 R ji = ⎢⎢⎣0.5024 0 0.524 0.008 0 1.384 0 0 0.008 0 0.402 −0.348⎤ 0 0.545 0 0 ⎥⎥⎦ 0 1.006 0 0.098 −0.098 0 −0.098 Updated LLRs for transmitted message bits: Hard-decision output: LLRi = [1.15, 1.429, −2.16, 1.32, −1.67, 1.146, −2.06] Cˆ = [0010101] Introduction to Data Error Correction 153 At the end of the ﬁfth iteration, we have 2-bit errors in the outputs after making hard decisions and we continue decoding with min-sum algorithm. Again, we skip a few more iterations and give the outputs for 11th iteration. Eleventh Iteration Extrinsic information passed from bit nodes to parity nodes: ⎡ 1.033 0 −2.04 0 0 0.983 −2.16⎤ Qij = ⎢⎢⎣ 0 1.08 0 −2.62 0.376 −2.116 1.077 0 0 0 0 1.04 0 0 ⎥⎥⎦ 0 0.65 0 1.454 −1.802 0 −2.42 Extrinsic information passed from parity nodes to bit nodes: ⎡ 0.786 0 −0.786 0 0 0.827 −0.786⎤ R ji = ⎢⎢⎣0.8032 0 0.832 −0.301 0 1.693 0 −0.301 0 0 0.862 0 0 ⎥⎥⎦ 0 1.163 0 0.523 −0.523 0 −0.523 Updated LLRs for transmitted message bits: LLRi = [1.898, 1.895, −2.907, 2.056, −2.404, 1.88, −2.929] Hard-decision output: Cˆ = [ 0010101] At the end of the 11th iteration, we still have 2-bit errors in the outputs after making hard decisions, and we passed the maximum iteration count of 10. So, we stop decoding with the min-sum algorithm, although all errors are not corrected. 1021 Iterations: 10 Iterations: 30 Iterations: 50 1022 1023 BER 1024 1025 1026 0.5 1 1.5 2 2.5 3 Eb /N0 Figure 3.58: LDPC BER versus Eb/N0 curves for cod eword length of 576 bits. 154 Chapter 3 Usually the decoder fails to correct errors if the number of errors occurred are greater than the error correction capability of the decoder irrespective of the number of iterations. The error correction capability of the LDPC coder depends on the length of the codeword and the characteristic of the parity check matrix. With good parity-check matrices, the decoder gives a better performance with larger codewords. In practice, the length of the LDPC codeword used is in the order of thousands of bits. In Figure 3.58 on the previous page, the BER performance versus Eb/N0 curves are shown using codeword length of 576 bits for different iteration counts. In Figure 3.58, we can see the improved BER performance with the number of iterations for a given codeword length. The encoder used is a rate 1/2 coder deﬁned by parity check matrix H288×576, which is obtained from the WiMax standard base matrices. The LDPC decoder uses the min-sum algorithm. The value for constant k is chosen as 0.8. CHAPTER 4 Implementation of Error Correction Algorithms In Chapter 3, we brieﬂy discussed various error correction algorithms and their related theory with a few examples. In this chapter, we discuss efﬁcient implementation techniques for widely used error correction algorithms. Section 4.1 covers Bose-Chaudhuri-Hocquenghem (BCH) code simulation and implementation techniques. The BCH codes are popularly used in correcting the bit errors in the header information included in data frame communications. A subset of BCH codes called Reed-Solomon (RS) codes is discussed in Section 4.2. The RS coder is widely used in cutting-edge communications systems as an outer coder. In Section 4.3, we discuss RS erasure codes that are commonly used for further error correction in forward error correction (FEC) systems. Section 4.4 covers simulation of the Viterbi algorithm used for decoding convolutional codes. The Viterbi algorithm is a popular decoding algorithm used in many applications (apart from digital communications). Next, we discuss turbo codes in Section 4.5. The most promising at present, turbo codes operate at near channel capacity with an SNR gap of about 0.7 dB. Finally, in Section 4.6, we discuss the oldest and newest codes, namely lowdensity parity-check (LDPC) codes. The LDPC codes were discovered in the 1960s, mostly forgotten for almost four decades, and then reinvented in 1999. Like turbo codes, LDPC codes also operate at near channel capacity. In this chapter, we simulate most of the algorithms that are popularly used in the industry. 4.1 BCH Codes The BCH code framework supports a large class of powerful, random-error-correcting cyclic binary and nonbinary linear block codes. With the BCH(N, K ) codes, we compute mT (= N − K ) parity bits from an input block of K bits using the generator polynomial G(x ), and we correct up to T bit errors in the received block of N bits. At the transmitter side, the BCH(N, K ) encoder computes and appends mT parity bits to the block of K data bits, and at the receiver side the BCH(N, K ) decoder corrects up to T errors by using mT bits of parity information. We work with the Galois ﬁeld GF(2m ) elements for decoding the BCH(N, K ) codes. See Section 3.5 for more details on theory and examples of the BCH(N, K ) codes. In this section, we discuss the simulation and implementation details of the BCH(N, K ) binary codes. Also we discuss the optimization techniques to efﬁciently implement the BCH(N, K ) coder on embedded processors. We consider the BCH(67, 53) coder as an example to discuss the implementation complexity and deriving efﬁcient implementation techniques. The BCH(67, 53) coder is used in the DVB-H standard for correcting bit errors in the received TPS data. The BCH(67, 53) codes are a short form of the BCH(127, 113) systematic codes, which are decoded using the Galois ﬁeld GF(27). The ﬁeld elements of GF(27) are generated using primitive polynomial P(x ) = 1 + x 3 + x 7. As mT = N − K = 67 − 53 = 14 = 7 × 2, this BCH(67, 53) coder is capable of correcting up to 2(= T ) random bit errors using 14(= mT ) redundancy (or parity) bits. 4.1.1 BCH Encoder The BCH(N, K ) encoder computes mT (= N − K ) bits of parity data from K bits of input data by using a generator polynomial G(x ) = g0 + g1x + g2x 2 + · · · + gN−K −1x N−K −1 + x N−K . For the BCH(N, K ) codes, the generator polynomial G(x ) is obtained by computing the multiplication of T minimal polynomials φ2i−1(x ) of ﬁeld elements α2i−1 for 1 ≤ i ≤ T as follows: G(x ) = φ1(x )φ3(x ) · · · φ2T −1(x ) (4.1) © 2010 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-1-85617-678-1.00004-1 155 156 Chapter 4 g0 g1 g2 Z Z Z gN2K 21 Z fbv D(x) B(x) C(x) Figure 4.1: Realization of BCH(N, K) encoder. As every even power of α has the same minimal polynomial as some preceding odd power of α, the G(x ) is obtained by computing the LCM (least common multiple) of minimum polynomials φi (x ) for 1 ≤ i ≤ 2T , and hence G(x ) has α, α2, α3, . . . , α2T as its roots. In other words, G(αi ) = 0 for 1 ≤ i ≤ 2T . Suppose that the input message block of K bits to be encoded is D = d0d1d2 · · · dK −1 and the corresponding message polynomial is D(x ) = d0 + d1x + d2x 2 + · · · + dK −1x K −1. Let B = b0b1b2 · · · bN−K −1 denotes the computed parity data of N − K (= mT ) length and its polynomial representation is B(x ) = b0 + b1x + b2x 2 + · · ·+ bN−K −1 x N−K −1. This parity polynomial B(x ) is given by the remainder when we divide D(x ).x N−K with the generator polynomial G(x ). The polynomial B(x ) is computed as B(x ) = D(x ) · x N−K mod G(x ) (4.2) After computing the parity polynomial B(x ), the encoded code polynomial C(x ) is constructed as C(x) = D(x) · x N−K + B(x) = b0 + b1x + b2x 2 + · · · + bN−K −1x N−K −1 + d0x N−K + d1x N−K +1 + · · · + dK −1x N−1 = c0 + c1x + c2 x 2 + · · · + cN−1 x N−1 (4.3) Basically, we append mT bits of parity data to the input block of K bits and form a systematic codeword of N (= K + mT ) bits. The encoded polynomial in the vector form is represented as C = c0c1c2 . . . cN−1 . Equations (4.2) and (4.3) can be realized with linear feedback shift register (LFSR) signal ﬂow diagram as shown in Figure 4.1. To compute parity polynomial B(x ) coefﬁcients, we input the data polynomial D(x ) coefﬁcients to LFSR with the dK −1 coefﬁcient as the ﬁrst input. The values present in the delay units (Z ) after passing all K coefﬁcients of the data polynomial D(x ) represents the coefﬁcients of the parity polynomial B(x ). Next, we discuss the simulation of the BCH(N, K ) encoder. We use the LFSR signal ﬂow diagram as shown in Figure 4.1 for simulation of the BCH(N, K ) encoder. We initialize all delay units with zero values. We start with the data polynomial coefﬁcient dK −1 and compute the feedback value (fbv). If the value of fbv is not zero, then we update all delay units using fbv along with generator polynomial coefﬁcients and using the present values of the delay units. Otherwise, if the value of fbv is zero, then we update all delay units with the present values of delay units. The simulation code for the BCH(N, K ) encoder is given in Pcode 4.1. 4.1.2 BCH Decoder At the receiver, we use the BCH(N, K ) decoder to detect and correct the bit errors. The BCH decoder consists of the following steps to decode the received data block R. 1. Computation of syndromes 2. Computation of error locator polynomial 3. Computation of error positions Implementation of Error Correction Algorithms 157 for (i = 0; i < N - K; i++) delay_unit[i] = 0; for (i = K - 1; i >=0; i--) { fbv = data_in[i] ˆ delay_unit[N - K - 1]; if (fbv != 0) { for (j = N - K - 1; j > 0; j--) if (bch_gp[j] != 0) delay_unit[j] = delay_unit[j - 1] ˆ fbv; else delay_unit[j] = delay_unit[j - 1]; delay_unit[0] = bch_gp[0] & fbv; } else { for (j = N - K - 1; j > 0; j--) delay_unit[j] = delay_unit[j - 1]; delay_unit[0] = 0; }; data_out[N - 1 - i] = data_in[K - 1 - i]; }; for (i = N - K - 1; i >=0; i--) data_out[i] = delay_unit[i]; Pcode 4.1: The simulation code for BCH(N , K ) encoder. The received data vector R or its polynomial R(x ) = r0 +r1x +r2x 2 + · · ·+rN−1x N−1 consists of the transmitted data polynomial C(x ) along with the added error polynomial E(x ): R(x) = C(x) + E(x) = D(x) · x N−K + B(x) + E(x) (4.4) = D(x) · G(x) + E(x) In the BCH decoder (unlike as in the BCH encoder), we have to perform the Galois ﬁeld arithmetic operations in decoding of the BCH codes. See Appendix B, Section B.2, on the companion website for more details on the Galois ﬁeld arithmetic operations and their computational complexity analysis. In the simulation of the BCH(67, 53) decoder, we use the Galois ﬁeld GF(27) element look-up tables from the companion website; use Galois_Log[ ] for performing the logarithm and Galois_aLog[ ] for performing the anti-logarithm. Syndrome Computation To determine the presence of errors and the error pattern, we compute 2T syndromes for the received data polynomial as follows: R(αi ) = D(αi ) · G(αi ) + E(αi ), where 1 ≤ i ≤ 2T = 0 + E(αi ) = E(αi ) = Si (4.5) Considering the previous syndrome computation, if no errors are present in the received data vector, all computed syndrome values (Si ) are zero. If any one or more syndromes are non-zero, then we assume that the errors are present in the received data vector. The syndromes Si = R(αi ) are computed with the LFSR signal ﬂow diagram as shown in Figure 4.2. We simulate the signal ﬂow diagram shown in Figure 4.2 for computing syndromes. The simulation code for computing syndromes is given in Pcode 4.2. The Galois ﬁeld element value a j , the j -th power of the Galois ﬁeld element αi , is computed by taking the Galois anti-logarithm for i · j modN. As the received vector consists of binary coefﬁcient values (r j ), we do not really perform the Galois ﬁeld multiplication r j a j in computing c j = b j + r j a j , instead we conditionally add a j to b j . 158 Chapter 4 ␣i Figure 4.2: Signal ﬂow diagram rj of syndrome computation. Z aj cj Z si bj for (i = 0; i < 2*T; i++) { Syndromes[i] = 0; for (j = n - 1; j >=0; j--) if (r[j] != 0) { a = (i + 1)*(n-1-j)%N; Syndromes[i] = Syndromes[i] ˆGalois_aLog[a]; } } Pcode 4.2: Simulation code for syndrome computation. Error Locator Polynomial Computation Computing an error locator polynomial is the second step in decoding of the BCH codes. We use the BerlekampMassey recursive algorithm to compute the error locator polynomial. If the number of errors present in the received data vector is L (which is less than or equal to T ), then this algorithm computes the L-th degree error locator polynomial in 2T iterations. As discussed in Section 3.5, ﬁrst we initialize the error locator polynomial (x ) = 1 as minimum-degree polynomial with degree L = 0. Then, we use syndromes information to build the error locator polynomial by computing discrepancy delta. If the value of delta is not zero, then we update the minimum degree polynomial with the discrepancy, otherwise we continue the loop. If the number of errors in the received polynomial R(x ) is T or less, then (x ) produces the true error pattern. At the end of 2T iterations of the Berlekamp-Massey recursion, we have an L-th degree error-locator polynomial (with 0 = 1) as follows: (x) = 0 + 1x + 2x2 + · · · + L x L = (1 + X1x )(1 + X2x ) · · · (1 + X L x ) (4.6) The simulation code for computing the error-locator polynomial is given in Pcode 4.3. Once we have the error locator polynomial of degree L, then we can ﬁnd L error positions by computing the roots of the error locator polynomial. Error Position Computation If the number of errors L present in the received data vector is less than or equal to T (i.e., L ≤ T ), then the error locator polynomial can be factored into L ﬁrst-degree polynomials as in Equation (4.6) and the roots of the error locator polynomial are X −1 1 , X2−1, . . . , X −1 L . As described in Section 3.5, the error positions are given by the inverse of roots of error locator polynomial. So the L error positions are X1, X2, . . . , X L. The simulation code for ﬁnding the error positions is given in Pcode 4.4. Because binary BCH codes work on the data bits, when we ﬁnd the error positions in the received data bits, correction of data bits is achieved by simply ﬂipping the bit values in those error positions. Error Correction When working with BCH binary codes, we correct only bit-errors present in the received data. The correction of bit errors is achieved by ﬂipping the bit value at the error position. If the degree of the error locator polynomial (L) computed using Pcode 4.3 and the number of error positions (k) computed using Pcode 4.4 are not the same, then the BCH decoder cannot correct errors as the number of errors that occurred is greater than the decoder’s error-correction capability. Therefore, we skip error bit correction when L = k. The simulation code for correcting bit errors with the BCH decoder is given in Pcode 4.5. Implementation of Error Correction Algorithms 159 L = 0; // initialization Elp[0] = 1; Tx[0] = 1; for(i = 1;i < 2*T; i++){ Elp[i] = 0; Tx[i] = 0; } r0 = Syndromes[0]; // starting delta for(k = 0;k < 2*T;k++){ for(i = 0;i < T+1;i++) Conn_poly[i] = Elp[i]; // Conn_poly = Elp if (r0 != 0) { // Elp = Conn_poly - Delta*Tx r2 = Galois_Log[r0]; // log (delta) for(i = 0;i < T+1;i++){ r1 = Conn_poly[i]; r3 = Tx[i]; r3 = Galois_Log[r3]; // log (delta), log(Tx[i]) r3 = r2 + r3; r3 = Galois_aLog[r3]; r1 = r3 ˆ r1; // Conn_poly[i] ˆ Delta*Tx[i] Elp[i] = r1; } if (2*L < (k+1)){ L = k+1 - L; for(i = 0;i < T+1;i++) { // Tx = Conn_poly/Delta r1 = Conn_poly[i]; r1 = Galois_Log[r1]; m = r1 - r2; if (m < 0) m+= 127; r1 = Galois_aLog[m]; Tx[i] = r1; } } } for(i = T+1;i > 0;i--) // Tx = [0 Tx] Tx[i] = Tx[i-1]; Tx[0] = 0; r0 = Syndromes[k+1]; if(L > 0) { for(i = 0;i < L;i++) { // compute delta by convolution of Syndrome poly and Elp r1 = log_Syndromes[k-i+1]; r2 = Elp[i+1]; r2 = Galois_Log[r2]; r1 = r1 + r2; r2 = Galois_aLog[r1]; r0 = r0 ˆ r2; } } } Pcode 4.3: Simulation code for error-locator polynomial computation. 4.1.3 BCH Codes: Computational Complexity In this section, we discuss the computational complexity of the BCH encoder and decoder, and we estimate cycles from the simulations presented in Sections 4.1.1 and 4.1.2. See Appendix A, Section A.4, on the companion website for more details on cycle requirements to execute speciﬁc operations on the reference embedded processor. BCH Encoder In the BCH encoder simulation (as given in Pcode 4.1), we initialize N − K delay units with zeros at the beginning, and we move the parity data from N − K delay units to the data buffer at the end. For this, we consume about 2 ∗ (N − K ) cycles. We use all bits of the message block to compute the parity for that message block. We compute the parity with K input data bits using the BCH(N, K ) encoder. For K input data bits, we have to compute the feedback value and it consumes about K cycles. Depending on feedback value, fbv, we have two paths to proceed. If fbv is zero, then we update N − K delay unit values with the current delay unit values by consuming N − K cycles. If fbv is not zero, then depending on generator polynomial coefﬁcients, 160 Chapter 4 k = 0; for(i = 127;i>=1;i--) { r0 = Elp[0]; for(j = 1;j < L+1;j++) { r1 = i*j; r2 = r1 >> 7; r1 = r1 & 0x7f; r3 = log_Elp[j]; r1 = r1 + r2; r2 = r1 >> 7; r1 = r1 & 0x7f; r1 = r1 + r2; r1 = r1 + r3; r2 = Galois_aLog[r1]; r0 = r0 ^ r2; } if (r0 == 0){ Error_positions[k] = 127-i; k++; } } Pcode 4.4: Simulation code for ﬁnding error positions. p = 1; for(i = 0; i < L; i++) { m = Error_position[i]; k = n-1-m; data[k] = data[k] ˆ p; } Pcode 4.5: Simulation code for bit errors correction. we conditionally update N − K delay unit values. To update one delay unit, we consume 3 cycles (1 cycle for generator polynomial coefﬁcient checking, 1 cycle for computing value to update delay unit, and 1 cycle for conditional update of delay unit), and we consume 3 ∗ (N − K ) cycles to update N − K delay units. Assuming equal probability for fbv to become zero or one, on average we consume 2 ∗ (N − K ) + X cycles to update delay units for 1 bit of the message block. Here, X cycles are overhead cycles consumed for conditional check and conditional jump depending on fbv. Therefore, we consume [2 ∗ (N − K ) + X] ∗ K cycles to compute parity for a K -bit input message block. With this, we consume about 2 ∗ (N − K ) + [2 ∗ (N − K ) + X] ∗ K cycles to execute the BCH(N, K ) encoder on the reference embedded processor. As an example, we estimate the computational complexity of the BCH(67, 53) encoder. We consume a total of 28(= 14 + 14) cycles to initialize parity bits (before the main loop) and to move the computed parity bits (after the main loop) to the output buffer. We assume the jump taken (9 cycles) when feedback value is zero. If the feedback value is not zero, then a single iteration of the main loop consumes 42 cycles. If the feedback value is zero, then a single iteration consumes 24 cycles (including conditional check and conditional jump). Assuming equal probability for feedback value to become one or zero, single iterations of the main loop require an average of 33 cycles. The main loop runs 53 times for the BCH(67, 53) encoder. With this, implementation of the BCH(67, 53) encoder using the method given in Pcode 4.1 takes about 1777(= 28 + 53 ∗ 33) cycles. BCH Decoder Syndrome Computation Based on Pcode 4.2, in syndrome computation, we have to compute the Galois ﬁeld element powers and that involves a costly modulo operation. In addition, look-up table access requires addition of an arbitrary offset to the base address and we have stalls to load values from the Galois_aLog[ ] table due to arbitrary offsets. We estimate the cycles for syndrome computation by assuming the interleaving of the program to avoid stalls and circular buffer usage to mimic modulo operation (see Appendix A.4 on the companion website). We consume 1 cycle to get a power of the Galois ﬁeld element value from the Galois_aLog[ ] look-up table using circular buffer registers. We conditionally update the accumulation value for syndrome by checking the received bit (whether zero or not) and consume 4 cycles. We consume a total of 5 cycles to update a syndrome for one received bit. We do not jump on checking the received bit as it takes about 10 cycles for a conditional check and conditional jump. Next, to compute one syndrome, we use all N received bits and consume about 5 ∗ N cycles. Implementation of Error Correction Algorithms 161 With this, to compute 2∗ T syndromes, we consume about 2 ∗ T ∗ 5 ∗ N cycles. For computing syndromes of the BCH(67, 53) decoder, we require about 1380(= 2 ∗ 2 ∗ 5 ∗ 67 + overhead) cycles. Error Locator Polynomial Computation Based on Pcode 4.3, in the i-th iteration, we use Li − 1 Galois ﬁeld additions and multiplication in convolving syndromes with (x ) to compute discrepancy delta i . We use the Galois logarithm and anti-logarithm look-up tables for the Galois ﬁeld multiplication. As we know all syndromes in advance, we get logarithm values for all syndromes before entering the loop of (x ) computation. We have to get the logarithm values for (x ) coefﬁcients in all iterations as they change from iteration to iteration. With this, we can compute the i-th iteration delta i in 6 ∗ (Li − 1) cycles. Depending on current iteration discrepancy i , we update (x ) (if i = 0) as i (x ) = i−1 (x ) − x · i · T i−1(x ) (4.7) where T i−1(x ) is computed in the previous iteration as T i−1(x ) = i−2(x )/ i−1 x · T i−2(x ) if i−1 = 0, 2Li−1 ≤ i − 1 otherwise (4.8) If i = 0, we spend a total of 7 ∗ (T + 1) cycles for computing i (x ) and another 7 ∗ (T + 1) cycles for computing T i (x ) if 2Li ≤ i. We spend an overhead of another 20 cycles for moving the data to and from buffers and for conditional checks. With this, we consume about 2T ∗ [6 ∗ (Li − 1) + 14 ∗ (T + 1) + 20] cycles for computing the error locator polynomial using the simulation code given in Pcode 4.3. Assuming Li = 2, we consume about 272(= 4 ∗ 68) cycles to compute the error locator polynomial. Error Position Computation The error locator polynomial roots inverse {Xi , 0 ≤ i ≤ L} give the error positions in the received data vector (if at all errors are present and the number of errors are less than or equal to T ).We ﬁnd roots of the error locator polynomial (x ) by substituting every possible error position (Chien’s search) in (x ) and checking for whether the particular error position satisﬁes (x ). In the error locator polynomial roots ﬁnding, we need to ﬁnd the powers of the Galois ﬁeld elements and we compute the powers here with the analytic method (instead of using circular buffer registers as in syndrome computation). Here, we consume 7 cycles (which can be achieved with one cycle on an embedded processor with circular buffer registers) to ﬁnd the power of the Galois ﬁeld element. To ﬁnd a particular data element that is in error (if that element position satisﬁes the (x )) or not, we spend 11 ∗ L cycles. We search for all the data element positions to ﬁnd the roots of (x ). Therefore, to ﬁnd the roots of the error locator polynomial (x ) with an analytic method (without using the circular registers of an embedded processor), we consume about N ∗ (11 ∗ L + 4) cycles. Assuming L = T = 2, we consume about 1742(= 67 ∗ (11 ∗ 2 + 4)) cycles to ﬁnd the roots of the error locator polynomial of the BCH(67, 53) decoder. 4.1.4 BCH Coder Optimization In this section, we discuss efﬁcient implementation of the BCH(N, K ) coder for particular values of N and K . As an example, we consider the BCH(67, 53) coder. BCH(67, 53) Encoder It is clear from Pcode 4.1 that the conditional update of delay units in the loop is costly and it is very inefﬁcient. Given that we know the generator polynomial coefﬁcients in advance for the BCH(67, 53) encoder, we can avoid the conditional ﬂow of the encoder by coding for this particular conﬁguration. The LFSR ﬂow diagram of the BCH(67, 53) encoder is shown in Figure 4.3. The BCH(67, 53) encoder computes 14(= mT ) parity bits from 53(= K ) input bits by using the following generator polynomial: G(x ) = 1 + x + x 2 + x 4 + x 5 + x 6 + x 8 + x 9 + x 14 (4.9) As m = 7 and T = 2 for the BCH(67, 53) codes, the generator polynomial G(x ) of the BCH(67, 53) coder is obtained from Equation (4.1) as G(x ) = φ1(x )φ3(x ) 162 Chapter 4 fbv D(x) Z Z ZZ Z Z ZZ Z Z Z Z ZZ Figure 4.3: Realization of BCH(67, 53) encoder. B(x) C(x) where φ1(x ) = 1 + x 3 + x 7 and φ3(x ) = 1 + x + x 2 + x 3 + x 7. For more details on minimal polynomials working with other BCH(N, K ) encoders for different values of m and T , see Shu Lin (1983). In Figure 4.3, as the G(x ) has binary coefﬁcients, we avoid multiplication of feedback values with gi s. The feedback connections to the delay units are shown only to the non-zero coefﬁcients of the generator polynomial given in Equation (4.9). The simulation code for the BCH(67, 53) encoder is given in Pcode 4.6. As the BCH codes contain only binary elements and we know in advance the generator polynomial non-zero coefﬁcient positions, we further simplify the simulation code by working with packed delay unit elements instead of array of individual delay unit elements. The simulation code for the efﬁcient BCH(67, 53) encoder method is given in Pcode 4.7. for (i = 0; i < 14; i++) delay_unit[i] = 0; for (i = 52; i >= 0; i--) { fbv = data_in[i] ˆ delay_unit[13]; delay_unit[13] = delay_unit[12]; delay_unit[12] = delay_unit[11]; delay_unit[11] = delay_unit[10]; delay_unit[10] = delay_unit[9]; delay_unit[9] = delay_unit[8] ˆ fbv; delay_unit[8] = delay_unit[7] ˆ fbv; delay_unit[7] = delay_unit[6]; delay_unit[6] = delay_unit[5] ˆ fbv; delay_unit[5] = delay_unit[4] ˆ fbv; delay_unit[4] = delay_unit[3] ˆ fbv; delay_unit[3] = delay_unit[2] ; delay_unit[2] = delay_unit[1] ˆ fbv; delay_unit[1] = delay_unit[0] ˆ fbv; delay_unit[0] = fbv; data_out[i+14] = data_in[i]; } for (i = 13; i>=0; i--) data_out[i] = delay_unit[i]; Pcode 4.6: The simulation code for BCH(67, 53) encoder. Next, we estimate the cycle consumption of the simulation code given in Pcode 4.7 for the BCH(67, 53) encoder. We consume about 7 cycles outside the main loop, about 6 cycles in a single iteration of the loop, and about 318(= 53 × 6) cycles for 53 iterations. With this efﬁcient method, we consume a total of about 325 cycles (instead of 1777 cycles using the general method) for the implementation of the BCH(67, 53) encoder. BCH(67, 53) Decoder Syndrome Computation The syndrome computation block is one of the costliest blocks in the BCH(N, K ) decoder. Instead of computing the Galois ﬁeld elements’ powers on the ﬂy, we use precomputed Galois ﬁeld element powers and avoid performing modulo operations in computing powers. The simulation code for efﬁcient syndrome computation is given in Pcode 4.8. The BchSynTbl[ ] look-up table values for computing syndromes of the BCH(67, 53) decoder can be found on the companion website. Implementation of Error Correction Algorithms 163 delay_units = 0; // 14 MSB bits represents bit values in delay units gpc = 0x0ddc0000; // generator polynomial coefficient positions for (i = 52; i >= 0; i--) { fbv = delay_units >> 31; temp = 0; if (fbv != data[i]) temp = gpc; // data[] consists of unpacked bits of data_in[] delay_units = delay_units << 1; delay_units = delay_units ˆ temp; } data_out[0] = data_in[0]; // first 32-bits of data temp = delay_units >> 21; // 11 MSB bits of parity data_out[1] = data_in[1] | temp; // remaining 21 data bits | 11 parity bits temp = delay_units << 11; // 3 LSB bits of parity data_out[2] = temp; // remaining 3 bits of parity, total data_out is 67 bits Pcode 4.7: Efﬁcient implementation of BCH(67, 53) encoder. for(i = 0;i < 4;i++) { Syndromes[i] = 0; for(j = n-1;j>=0;j--) { temp = BchSynTbl[67*i+n-1-j]; temp = temp ˆ Syndromes[i]; if(data[j] != 0) Syndromes[i] = temp; } } Pcode 4.8: Efﬁcient implementation of syndrome computation. In Pcode 4.8, we consume 4 cycles to update conditionally the accumulation of the Galois ﬁeld elements powers. With N = 67 and K = 53 of the BCH decoder, we consume about 1080(= 4 ∗ 67 ∗ 4 + overhead) cycles to compute 2T syndromes. In this method, we do not use any circular buffer registers to access look-up table BchSynTbl[ ]. Error Correction As we know that the BCH(67, 53) decoder can correct up to two errors, we take a few shortcuts in computing the error locator polynomial. Most of the time errors may not present in the received data. We can ﬁnd the absence of errors by checking the values of syndromes. If all syndromes are zero, then no errors are present in the received data and we stop the BCH from further decoding. We handle one-error and two-errors cases separately. The correction of single errors does not require computation of an error locator polynomial and roots ﬁnding. This avoids 50% of computations of the BCH decoder. Single-Bit Error Correction After computing 2T syndromes {S1, S2, S3, S4} with the BCH(67, 53) decoder, if S1 = 0 and S3 = S13, then a single error is present in the received data vector. The error position is given by S1−1. The simulation code for correcting single-bit errors after computing syndromes is given in Pcode 4.9. r0 = Syndromes[0]; r0 = Galois_Log[r0]; m = n-1 - r0; p = 1; // n = 67 rec_msg[m] = rec_msg[m]ˆp; Pcode 4.9: Simulation code for correcting single bit errors. Double-Bit Error Correction After computing syndromes, if S1 = 0 and S3 = S13, then two-bit errors are present in the received data vector. If two-bit errors are present, then the maximum degree of the error locator polynomial is two and the coefﬁcients 1 and 2 of the error locator polynomial (x ) = 1 + 1x + 2x 2 are given by 1 = S1, 2 = S3 + S13 S1 (4.10) 164 Chapter 4 Once we know the error locator polynomial coefﬁcients, we ﬁnd the error positions by using Chien’s search algorithm. As discussed in Section 4.1.2, Error Correction, and Section 4.1.3, BCH Codes: Computational Complexity, the computation of roots for (x ) using Chien’s search is a costly process and we consume 1742 cycles to compute roots of the second-degree error locator polynomial. Instead, we rearrange the second-degree polynomial as seen in the following to reduce the number of computations with Chien’s search algorithm. If z is a root of (x ), then (z) = 2z2 + 1z + 1 = 0 ⇒ z2 + 1 z + 1 = 0 2 2 ⇒z z+ 1 + 1 =0 2 2 ⇒ z(z + a) + b = 0 (4.11) where a = 1 and b = 1 . 2 2 We precompute a and b from 1 and 2 before starting Chien’s search algorithm. With the previous rear- rangement, we compute the Galois ﬁeld element substitution value with two additions and one multiplication. The simulation code for the efﬁcient error locator polynomial computation, error locator polynomial roots ﬁnd- ing and error correction is given in Pcode 4.10. Next, we estimate the cycles for the error locator polynomial and error position computation. We consume approximately 30 cycles (here most look-up table access operations consume 4 cycles as we do not have much scope to interleave the program code) to compute the error locator polynomial (x ) = x (x + a) + b. In error position computation, we have scope to interleave the program by computing more than one substitution value per iteration of the loop. Therefore, we consume 8 cycles (6 cycles for computing substitution value and 2 cycles for checking and continuing the loop) to know whether the bit at the i-th position is in error or not. We consume a total of 536(= 67 ∗ 8) cycles for ﬁnding the error position. //Compute error locator polynomial: Delta(x) = x(x+a) + b r0 = Syndromes[0]; r1 = Syndromes[2]; r0 = Galois_Log[r0]; k = r0 * 3; m = r0*2; if (k>=127) k-=127; if (m>=127) m-=127; if (k>=127) k-=127; r2 = Galois_aLog[k]; r2 = r2 ˆ r1; r2 = Galois_Log[r2]; k = m - r2; m = r0 - r2; if (k < 0) k+= 127; if (m < 0) m+= 127; r1 = Galois_aLog[k]; r2 = Galois_aLog[m]; // a, b for (i = 127; i>=60; i--) { // roots finding r0 = Galois_aLog[i]; // z r0 = r0 ˆ r1; // z + a r0 = Galois_Log[r0]; r0 = r0 + i; r0 = Galois_aLog[r0]; // z(z + a) r0 = r0 ˆ r2; // z(z + a) + b if (r0 == 0) data[i-127+n-1] ^= r3; // bit error correction, n = 67 } Pcode 4.10: Simulation code for efﬁcient BCH(67, 53) decoder error correction. With the previous suggested techniques, we consume about 1646(= 1080 + 30 + 536) cycles to correct two-bit errors using the BCH(67, 53) decoder. Without this algorithm level optimization, we consume (as estimated in Section 4.1.3, BCH Codes: Computational Complexity) about 3394(= 1380 + 272 + 1742) cycles to correct two-bit errors using the BCH(67, 53) decoder. The cycle saving with the optimized BCH(67, 53) decoder is about 51%. The cycle cost may vary if we implement the BCH decoder on a particular embedded processor by taking advantage of its architectural and instruction set features. Implementation of Error Correction Algorithms 165 BCH Decoder: Further Optimization for T = 2 In the case of T = 2, we know that the BCH decoder can correct up to two errors. Based on Equation (4.10), to compute the error locator polynomial (x ), we use only two syndromes S1 and S3 out of 2T (= 4) computed syndromes. The value of S1 only dictates whether errors are present (if S1 = 0) or not (if S1 = 0). In addition, if errors are present, then how many (whether one or two) errors are present is also decided by using the relation between S1 and S3. Because syndrome computation is very costly in terms cycles, we do not have to compute S2 and S4 to correct up to two errors (when T = 2) with the binary BCH decoder because we can calculate them using S1 and S3. This saves 50% of syndrome computation cycles. The other costly routine is Chien’s search method, used to ﬁnd error positions when two or more errors are present in the received data. As we can correct up to two errors for T = 2, the resultant second-degree error locator polynomial consists of two parameters as given in Equation (4.11). We can ﬁnd the two roots of the second-degree polynomial using the precomputed look-up table method if we have a second-degree polynomial with only one parameter. In this case, we do not have to use Chien’s search and therefore we save a lot of cycles. In the look-up table method, discussed by Okano and Imai (1987), we precompute two roots (if they exist) of the second-degree single parameter polynomial for all its possible values. We convert the second-degree polynomial with two parameters (a and b) given in Equation (4.11) to one parameter of the second-degree polynomial by substituting z = a · y as follows: (z) = z(z + a) + b = 0 (ay) = ay(ay + a) + b = 0 ⇒ a2y2 +a2y +b = 0 ⇒ y2 + y +c = 0 (4.12) where c = b . a2 If m = 7, then c ∈ GF(27) and we have 128 possible values for c. Next, we precompute all existing roots of the polynomial in Equation (4.12) for all 128 possible values of c. The precomputed look-up table elp_roots[ ] for the roots of the second-degree polynomials with a single parameter that belong to GF(27) follows. The roots y1 and y2 of the polynomial in Equation (4.12) are obtained by using c as the index (or offset) to the look-up table elp_roots[ ] that can be found on the companion website. If roots do not exist for particular values of c, then the table is ﬁlled with zeros at those offset values of c. The actual roots z1 and z2 for Equation (4.11) are obtained by back substitution as z1 = a.y1 and z2 = a · y2. The simulation code for efﬁcient implementation of the BCH(67, 53) without Chien’s search method is given in Pcode 4.11. If we get the two roots z1 and z2 as zeros, then there are no roots for Equation (4.11) and this indicates that more than two errors occurred and we have to exit from the decoder without any bit errors correction. An example of the previously described method of ﬁnding roots of the second-degree polynomial follows. Let a, b ∈ GF(2127), a = α122, and b = α77. The computed roots for the polynomial given in Equation (4.11) using the Chien search method given in Pcode 4.10 are z1 = α0 and z2 = α77. Next, c = b a2 = α87. Given that the look-up table values start from c = 0 (its logarithm value is not deﬁned), and if we access the look-up table with logarithm values of c, then we should access the look-up table with offset 88(= 87 + 1) to get the correct roots. The roots of Equation (4.12) are obtained from the look-up table elp_roots[ ] as y1 = α5 and y2 = α82. Then, the roots of Equation (4.11) are computed as follows: z1 = a · y1 = α122 · α5 = α127 = α0 = z1 z2 = a · y2 = α122 · α82 = α204 = α77 = z2 Next, we estimate the cycle cost of this efﬁcient method. As we need only two syndromes, we consume 536(= 2 ∗ 67 ∗ 4) cycles to compute syndromes S1 and S3 using Pcode 4.8. Then we consume approximately 30 to 70 cycles to ﬁnd error positions and to correct one and two-bit-errors with the simulation code given in Pcode 4.11. With this, the BCH(67, 53) decoder can be implemented on the reference embedded processor with 166 Chapter 4 in 600 cycles to correct up to two-bit errors in the received 67 data bits. The suggested techniques for the BCH decoding simulation is also valid for other values of N and K as long as T = 2. r4 = Syndromes[0]; r0 = Galois_Log[r4]; r1 = Syndromes[2]; k = r0 * 3; m = r0*2; if (k>=127) k-=127; if (m>=127) m-=127; if (k>=127) k-=127; r2 = Galois_aLog[k]; if (r4 != 0){ if (r2 == r1) data[n-1-r0] ˆ= 1; // single error correction else { r2 = r2 ˆ r1; r2 = Galois_Log[r2]; k = m - r2; m = r0 - r2; if (k < 0) k+= 127; if (m < 0) m+= 127; // a, b m = m - 2*k; if (m < 0) m+= 127; if (m < 0) m+= 127; r0 = elp_roots[2*(m+1)]; r1 = elp_roots[2*(m+1)+1]; if ((r0!=0) && (r1 != 0)) { // double error correction r0 = r0 + k; r1 = r1 + k; if (r0 > 127) r0-=127; if (r1 > 127) r1-=127; data[r0-127+n-1]ˆ=1; data[r1-127+n-1]ˆ=1; } } } Pcode 4.11: Simulation code for efﬁcient BCH decoding (for T = 2). 4.2 Reed-Solomon Error-Correction Codes RS codes are widely used in digital communications and digital storage and retrieval systems for forward error correction (FEC). See Section 3.6 for more information on theory and example of RS codes. In this section, we discuss the simulation of RS(N, K ) block codes. In particular we discuss the simulation techniques for the RS(204, 188) coder, which is used in DVB-H standard for FEC. We also discuss the computational complexity of the RS coder in terms of cycles and memory to implement on the reference embedded processor. 4.2.1 RS(N, K) Encoder Using the RS(N, K ) encoder, we compute N − K length parity data B(x ) from K length input message D(x ) by using the generator polynomial G(x ). The encoded message M(x ) is obtained as M(x) = D(x) · x N−K + B(x) (4.13) The following generator polynomial is used in the RS(N, K ) encoder (see Section 3.6) to compute the parity data: G(x ) = (x + α0)(x + α1)(x + α2) · · · (x + α2T −1) = g0 + g1x + g2x 2 + · · · + g2T −1x 2T −1 + x 2T (4.14) where 2T = N − K . Here, the polynomial G(x ) is computed by multiplying 2T ﬁrst-degree polynomials (x + αi ) where 0 ≤ i < 2T . The parity data B(x ) is computed as B(x ) = D(x )·x N−K mod G(x ) (4.15) Equations (4.13) and (4.15) can be realized with a feedback system as shown in Figure 4.4. Implementation of Error Correction Algorithms 167 Figure 4.4: Realization of RS(N, K) encoder. g0 Z ␣i g2 Z g3 Z Z g2T 21 Z B(x) M(x) D(x) Figure 4.5: Syndrome computation rk signal ﬂow diagram. Z si 4.2.2 RS(N, K) Decoder As discussed, the RS(N, K ) decoder takes data blocks of N elements as input and outputs K elements as a decoded data block. If errors are present in the received data and if they are less than or equal to (N − K )/2, then the RS decoder corrects the errors and outputs a corrected data block. The error correction with the RS decoder is achieved with the following four steps: 1. Syndrome computation 2. Error locator polynomial computation 3. Error locator polynomial roots computation 4. Error magnitude polynomial computation Syndrome Computation In the RS decoder, the ﬁrst step of decoding is a syndrome computation. Syndromes, which give an indication of presence of errors, are computed using the received data polynomial R(x ). If all the syndromes are zero, then there are no errors in the received data. We compute 2T syndromes in the syndrome computation step. An i-th syndrome is computed (see Section 3.6) as follows: N −1 Si = R(αi ) = rk (αi )n n=0 (4.16) where addition is modulo-2 addition and performed using ⊕ instead of +. Equation (4.16) can be realized with a feedback system as shown in Figure 4.5. Computation of Error Locator Polynomial The error locator polynomial is computed using the Berlekamp-Massey (BM) recursive algorithm as shown in Figure 4.6. We iterate the BM algorithm 2T times to get an error locator polynomial (x ) of degree v which is less than or equal to T . If v ≤ T , then the roots of the error locator polynomial (x ) give the correct error positions in the received data vector. The error locator polynomial (x ) of degree v is represented as (x ) = 1 + 1x + 2 x 2 + · · · + v−1 x v−1 (4.17) Computation of Error Locator Polynomial Roots We compute the roots of the error locator polynomial (ELP), (x ), with a brute-force method (also called Chien’s search) by checking all of the ﬁeld elements to know whether any of the ﬁeld elements satisfy the Equation (4.17). The following equation (with 0 = 1) gives the error location as i whenever Pi become zero: Pi = v (αi ) = j =0 j (αi ) (4.18) where 0 ≤ i < N. Equation (4.18) can be realized with the signal ﬂow diagram as shown in Figure 4.7. 168 Chapter 4 Start L(0)(x) 5 1, B (0)(x) 5 1 L 5 0, k 5 1 L S Dk 5 L(ik 21)·Sk2i i50 Y Dk 5 0 N L(k)(x) 5L(k21)(x) 2 Dk B (k21) (x)·x N B (k)(x) 5 B (k21) (x)·x 2L # k 2 1 Y L5k2L B (k)(x) 5 L(k21)(x)/Dk Figure 4.6: Flow chart diagram of Berlekamp-Massey algorithm. ␣i Y k # 2T N End Z Figure 4.7: Chien’s brute-force search Lk method for ﬁnding error locations. Z Pi Computation of Error Magnitude Polynomial The error magnitude polynomial (x ) = 1 + ω1x 1 + ω2x 2 + · · · + ω2T x 2T is computed as (x ) = (x ) [1 + S(x )] mod x 2T+1 where 2T v S(x ) = S j x j and (x ) = i x i with 0 = 1 j =1 i=0 (4.19) Error Correction If (x ) represents the derivative of the error locator polynomial (x ) and ik represents error positions, then error magnitudes Y j = eik , where ik ∈ [0, N ), are computed using error positions information X j = (α j )ik , 0 ≤ j < v, ik ∈ [0, N ) as follows: Xj Yj = − X −1 j X −1 j (4.20) Once we know error positions ik and error magnitudes eik , then we can obtain the error polynomial as E (x ) = ei1 x i1 + ei2 x i2 + · · · + eiv x iv (4.21) The corrected data polynomial Mˆ (x ) is obtained from the received data vector R(x ) as follows: Mˆ (x ) = R(x ) + E(x ) (4.22) Implementation of Error Correction Algorithms 169 4.2.3 RS(204, 188) Coder The RS(204, 188) coder, used in the DVB-H standard (see Section 17.4), is derived from the RS(255, 239) coder, whose ﬁeld elements belong to GF(28) and the Galois ﬁeld elements for RS(255, 239) coder are generated using the primitive polynomial p(x ) = x 8 + x 4 + x 3 + x 2 + 1. The generator polynomial used to compute parity data is obtained as G(x ) = (x + α0)(x + α1) · · · + (x + α15). The RS(204, 188) (shortened version of RS(255, 239)) coder uses the Galois ﬁeld GF(28). See Appendix B, Section B.2, on the companion website for more information on the Galois ﬁeld GF(2n) arithmetic operations and respective simulation techniques. RS(204, 188) Coder Data Representation The polynomial and the corresponding vector representation of RS(204, 188) coder inputs, outputs, parity data, generator polynomial, and error polynomial follow. Generator polynomial (17 coefﬁcients): G(x ) = x 16 + g15x 15 + · · · + g2x 2 + g1x + g0 G = [1, g15, . . . , g2, g1, g0] Encoder input (188-coefﬁcient polynomial): D(x ) = d187x 187 + d186x 186 + · · · + d2x 2 + d1x + d0 D = [d187, d186, . . . , d2, d1, d0] Parity data (16-coefﬁcient polynomial): B(x ) = b15x 15 + b14x 14 + · · · + b2x 2 + b1x + b0 B = [b15, b14, . . . , b2, b1, b0] Encoder output (204 coefﬁcients): M(x ) = m203x 203 + m202x 202 + · · · + m2x 2 + m1x + m0 M = [m203, m202, . . . , m2, m1, m0] M = D|B Error polynomial (maximum T coefﬁcients) with ν errors: E (x ) = ei1 x i1 + ei2 x i2 + · · · + eiv x iv E = 0, 0, . . . , ei1 , 0, 0, 0, . . . , ei2 , 0, 0, 0, . . . , 0, 0, 0, . . . , 0, eiv , 0, 0, 0, . . . , 0 Decoder input (204-coefﬁcient polynomial): R(x ) = r203x 203 + r202x 202 + · · · + r2 x 2 + r1x + r0 R = [r203, r202, . . . , r2, r1, r0] R = M+E Decoder output (188 coefﬁcients): D (x ) = d187x 187 + d186x 186 + · · · + d2x 2 + d1x + d0 D = d187, d186, . . . , d2, d1, d0 170 Chapter 4 In RS decoding, if v ≤ T , then D = D . In other words, if the number of errors v present in the received data vector R is less than or equal to T , then we can correct v errors using the RS decoder and the decoded output D and actual transmitted data D will be the same; otherwise, D and D will be different. RS(204, 188) Coder Generator Polynomial Coefﬁcients of the RS(204, 188) coder generator polynomial G(x ) = (x +α0)(x +α1) · · · +(x +α15) are obtained with the simulation code given in Pcode 4.12. We compute G(x ) from ﬁrst-degree polynomials iteratively in 2T iterations. As we do not compute generator polynomials in runtime, we will not discuss its computational complexity and optimization. The simulation results of Pcode 4.12 (i.e., the coefﬁcients of polynomial G(x ) of the RS(204, 188) coder) are provided in Section 4.2.6. In later sections, we discuss the simulation of RS(204, 188) encoder and RS(204, 188) decoder modules. Gx[0] = 1 ; Gx[1] = 1 ; for(i = 2;i<=2*T;i++) { Gx[i] = 1 ; for (j = i-1; j > 0; j--) if (Gx[j]!= 0) { r0 = Gx[j-1]; r1 = Gx[j]; r1 = Galois_Log[r1]; r1 = r1 + i-1; r2 = r1 >> 8; r1 = r1 & 0xff; r1 = r2 + r1; r2 = Galois_aLog[r1]; r0 = r0 ˆ r2; Gx[j] = r0; } else Gx[j] = Gx[j-1]; r1 = Gx[0]; r1 = Galois_Log[r1]; r1 = r1 + i-1; r2 = r1 >> 8; r1 = r1 & 0xff; r1 = r2 + r1; r0 = Galois_aLog[r1]; Gx[0] = r0; } // [1 1] = (x+alphaˆˆ0), initialization // multiplying with (x+alphaˆˆi) // coefficient xˆˆi = 1 // mod 255 // coefficients from xˆˆ(i-1) to xˆˆ1 // mod 255 // coefficient xˆˆ0 Pcode 4.12: Simulation code for computing a generator polynomial. 4.2.4 RS(204,188) Encoder Simulation We simulate the RS(204, 188) encoder using the signal ﬂow diagram shown in Figure 4.4. We generate 16(= 2T = N − K = 204 − 188) parity data elements with the RS(204, 188) encoder using input message of 188 data elements. The simulation code for computing parity data vector B from input message vector D is given in Pcode 4.13. We obtain the encoded message M by appending data bytes to parity bytes as M = data bytes|parity bytes. The computation of the parity data vector involves multiplication and addition of the Galois ﬁeld elements. We obtain the parity data vector from shift registers of the feedback loop by passing all data elements of the message vector one at a time to the feedback loop. The complexity of the RS(204, 188) encoder is estimated (see Appendix A.4 on the companion website for cycles estimation on the reference embedded processor) as follows. To update the feedback loop with one message data element, we spend 6 + 2 ∗ T ∗ 9 cycles by interleaving the program code. Thus, we consume K ∗ (2 ∗ T ∗ 9 + 6) cycles for updating the feedback loop with K input message elements. 4.2.5 RS(204,188) Decoder Simulation With the RS(204,188) decoder, we process a data block of 204 elements at a time. In the receiver, before coming to the RS decoder, the data had been processed by other physical layer modules such as demodulation, equalization, Implementation of Error Correction Algorithms 171 for(i = K-1;i>=0;i--) { r0 = Dx[K-1-i]; r1 = Bx[2*T-1]; r0 = r0 ˆ r1; // addition of Galois field elements r7 = Galois_Log[r0]; // feedback if (r7 != log0) { for (j = 2*T-1;j > 0;j--) if (log_Gx[j] != log0) { r1 = log_Gx[j]; r0 = Bx[j-1]; r1 = r1 + r7; // multiplication of Galois field elements r2 = r1 >> 8; r1 = r1 & 0xff; r1 = r1 + r2; // modulo 255 r2 = Galois_aLog[r1]; r2 = r2 ˆ r0; Bx[j] = r2; } else Bx[j] = Bx[j-1]; r1 = log_Gx[0]; r1 = r1 + r7; r2 = r1 >> 8; r1 = r1 & 0xff; r1 = r1 + r2; // modulo 255 r2 = Galois_aLog[r1]; Bx[0] = r2; } else { for (j = 2*T-1;j > 0;j--) Bx[j] = Bx[j-1]; Bx[0] = 0; } } for(i = 0;i < 2*T;i++) // multiply input msg with xˆˆ(N-K) and add parity data Dx[K+i] = Bx[2*T-1-i]; Pcode 4.13: Simulation code for RS(204, 188) encoder. and so on (see Section 17.4). We assume that the proper data block (i.e., a block corresponding to the encoder output block) with 204 elements is available to the RS decoder as an input after data symbols synchronization. Due to channel impairments, the received data vector R may not be same as the transmitted data vector M (see Figure 3.16). Some of the byte elements in the received vector R may be in error and we can correct all the error data bytes using the RS decoder if the number of errors are less than or equal to T , where T = (N − K )/2 = 8. As discussed in Section 4.2.2, the RS decoder consists of four steps as follows: 1. Syndrome computation 2. Error locator polynomial computation 3. Finding roots for error locator polynomial 4. Error magnitude polynomial computation Simulation of these four steps follows. Syndrome Computation Computation of one syndrome (see Figure 4.5) involves computation of the Galois ﬁeld N (= 204) element powers ((αi )k), N multiplications (rk αi·k), and N − 1 additions (⊕). We compute the Galois ﬁeld two-element multiplication using logarithm and anti-logarithm look-up tables of the Galois ﬁeld elements (see Appendix B, Section B.2.4, on the companion website). The x and y multiplication, z = x · y, using logarithm and antilogarithm look-up tables involves four steps: 1. Get a, the logarithm of x using the Galois_Log[ ] 2. Get b, the logarithm of y using the Galois_Log[ ] 3. Compute c = a + b 4. Get z, the anti-logarithm of c using the Galois_aLog[ ]) 172 Chapter 4 Given the Galois ﬁeld element β = αi , implementation of the Galois ﬁeld element power (γ = βk) also involves four steps: 1. Get i, an exponent of α or logarithm value of β 2. Compute i ∗ k 3. Compute j = i ∗ k modulo 255 4. Get γ = anti-logarithm of j These steps consume approximately 7 cycles on the reference embedded processor. Instead, we use the look-up table with precomputed Galois ﬁeld element powers. We perform the Galois ﬁeld addition using the XOR operator. The simulation code for syndrome computation is given in Pcode 4.14. The 0-th syndrome (S0 = R(α0) = R(1)) is computed by adding all received message polynomial coefﬁcients. We handle computation of 0th syndrome separately as it involves only XOR operations. Syndromes from S1 to S15 are computed in a loop using a look-up table for the Galois ﬁeld element powers, sGalois_elem_pow[ ]. For each syndrome, we need to compute 204 Galois ﬁeld element powers, and hence the sGalois_elem_pow[ ] look-up table consists of 3060 (= 204 ∗ 15) elements. In the inner loop, we perform syndrome computation for two data elements at a time with 12 instructions per iteration. If we interleave the program code, then the inner loop consumes 12 cycles per iteration. Therefore, the syndrome computation block consumes (12 ∗ (2T − 1) ∗ N/2 + N) cycles. With this, for T = 8, we require about 18,600 (= 12 ∗ 102 ∗ 15 + 204 + etc.) cycles to implement the syndrome computation module on the reference embedded processor. r0 = 0; for(i = 0;i < N;i++) r0 = r0 ^ rec_msg[i]; Syndromes[0] = r0; for(j = 1; j < 2*T; j++) { r0 = 0; r7 = j*N; for(i = 0; i < N; i+=2) { r1 = rec_msg[N-i-1]; r2 = rec_msg[N-i-2]; r3 = Galois_Log[r1]; r4 = Galois_Log[r2]; r5 = sGalois_elem_pow[r7+i]; r6 = sGalois_elem_pow[r7+i+1]; r3 = r3 + r5; r4 = r4 + r6; r3 = Galois_aLog[r3]; r4 = Galois_aLog[r4]; r4 = r4 ˆ r3; r0 = r0 ˆ r4; } Syndromes[j] = r0; } Pcode 4.14: Simulation code for syndrome computation. If all computed syndromes are zero, then no errors are present in the received data and we skip the next steps of RS decoding. The received data is not in error most of the time. Even if data is in error, only a few elements (one or two with high probability) of data will be in error. With the RS(204, 188) decoder, we can correct up to 8 (= T ) error data elements. If errors are present in the received data block, then not all syndromes are zero and the degree of the error locator polynomial gives the indication of the number of errors present in the data. Next, we discuss the simulation of the error location polynomial generation process. Error Locator Polynomial Computation We use the Berlekamp-Massey recursion to compute the error locator polynomial (x ). With Berlekamp-Massey recursion, error locator polynomial (x ) is generated using 2T syndromes in 2T iterations. Before entering the loop, we initialize L, the degree of (x ) polynomial, as zero (i.e., (x ) = 1, assuming zero errors). We compute the discrepancy at the beginning of every iteration, and if the discrepancy is not zero, then we update (x ) with the discrepancy. For the ﬁrst iteration, (discrepancy) is a 0th syndrome. For convenient simulation, we get the ﬁrst discrepancy before entering the loop and we compute discrepancy for the next iteration always at the end of the current iteration. Implementation of Error Correction Algorithms 173 The discrepancy is computed by convolving syndromes with the current error locator polynomial, (x ) = 1 + 1x + 2x 2 + · · · + L−1x L−1 of degree L − 1. For the i-th iteration, discrepancy i is computed as follows: Li −1 i= j =0 j Si− j = 0 Si ⊕ 1 Si−1 ⊕ · · · ⊕ Li −1 Si−(Li −1) = Si ⊕ 1 Si−1 ⊕ · · · ⊕ Li −1 Si−(Li −1) In the i-th iteration, we use Li − 1 Galois ﬁeld additions and multiplications in convoluting syndromes with current (x ) to compute discrepancy i . We use the Galois logarithm and anti-logarithm look-up tables for the Galois ﬁeld elements multiplication. Because we know all syndromes in advance, we get logarithm values for the syndromes before entering the loop of (x ) computation. We have to compute the logarithm values for (x ) coefﬁcients in every iteration as they change from iteration to iteration. With this, we can compute the i-th iteration discrepancy i in 6 ∗ (Li − 1) cycles. Depending on the current iteration discrepancy i computed at the end of the previous iteration, we update (x ) (if i = 0) as i (x ) = i−1 (x ) − x · i · T i−1(x ), where T (i−1)(x ) is computed in the previous iteration as T (i−1)(x ) = i−2 (x )/ i−1 x · T i−2(x ) if i−1 = 0 and 2Li − 1 ≤ i − 1 otherwise If i = 0, we spend a total of (T + 1) ∗ 7 cycles for computing i (x ) and another (T + 1) ∗ 7 cycles for computing T i (x ) if 2Li − 1 ≤ i − 1. We spend an overhead of another 20 cycles for moving the data to and from buffers and for conditional checks. Thus, for T = 8 we consume about 16 ∗ [6 ∗ (Li − 1) + 14 ∗ 9 + 20] cycles for computing error locator polynomial. The simulation code for computing the error locator polynomial (x ) using the Berlekamp-Massey recursion routine is given in Pcode 4.15. Once we compute the error locator polynomial (x ), depending on the degree of (x ), we get an idea about the number of errors present in the received data. However, we cannot come to a conclusion about the number of errors present by seeing the degree of (x ) as it gives wrong information when the number of errors present in the received data vector is more than T . By ﬁnding the roots of the error locator polynomial (x ), we can get exact information about the number of errors (if the errors are less than or equal to T ) present and about the error positions in the received data vector. Roots Computation for Error Locator Polynomial The roots {Xi , 0 ≤ i < L} of the error locator polynomial gives the error positions in the received data vector (if at all present and if they are less than or equal to T ). We ﬁnd the roots of the error locator polynomial (x ) by substituting every possible error position (Chien’s search) in (x ) and checking for whether the particular error position satisﬁes the (x ). In error locator polynomial roots ﬁnding, we need to ﬁnd the powers of the Galois ﬁeld elements and we compute the powers here with an analytic method (instead of using look-up tables as in syndrome computation). Here, we consume 7 cycles (and this can be achieved with one cycle on an embedded processor with circular buffer registers; see Appendix A, Section A.4, on the companion website) to ﬁnd the power of a Galois ﬁeld element. To ﬁnd whether a particular data element is in error (if that element position satisﬁes the (x )) or not, we spend (7 + 4) ∗ L cycles. We search all data element positions to ﬁnd the roots of the (x ). Therefore, to ﬁnd the roots of error locator polynomial (x ) with an analytic method (without using the modular arithmetic registers of an embedded processor), we consume about 204 ∗ 11 ∗ L cycles. The simulation code of Chien’s search algorithm for ﬁnding the roots of an error locator polynomial is given in Pcode 4.16. Computation of Error Magnitude Polynomial We need to know error magnitudes (since the data elements are nonbinary) to correct the errors present in the received data. The error magnitudes are computed with the help of an error magnitude polynomial. We compute 174 Chapter 4 L = 0; r0 = Syndromes[0]; // starting delta for(k = 0;k < 2*T;k++) { for(i = 0;i < T+1;i++) Conn_poly[i] = Elp[i]; // Conn_poly = Elp if (r0 != 0) { // Elp = Conn_poly - Delta*Tx r2 = Galois_Log[r0]; // log (delta) for(i = 0;i < T+1;i++) { r1 = Conn_poly[i]; r3 = Tx[i]; r3 = Galois_Log[r3]; // log (delta), log(Tx[i]) r3 = r2 + r3; r3 = Galois_aLog[r3]; r1 = r3 ^ r1; // Conn_poly[i]ˆDelta*Tx[i] Elp[i] = r1; } if (2*L < (k + 1)) { L = k + 1 - L; for(i = 0;i < T+1;i++) { // Tx = Conn_poly/Delta r1 = Conn_poly[i]; r1 = Galois_Log[r1]; m = r1 - r2; if (m < 0) m+= 255; r1 = Galois_aLog[m]; Tx[i] = r1; } } } for(i = T+1;i > 0;i--) // Tx = [0 Tx], increment degree by 1 Tx[i] = Tx[i-1]; Tx[0] = 0; r0 = Syndromes[k+1]; if(L > 0) { for(i = 0;i < L;i++) { // compute delta by convolution of Syndromes and Elp r1 = log_Syndromes[k-i+1]; r2 = Elp[i+1]; r2 = Galois_Log[r2]; r1 = r1 + r2; r2 = Galois_aLog[r1]; r0 = r0 ^ r2; } } } Pcode 4.15: Simulation code for Berlekamp-Massey recursion. the error magnitude polynomial (x ), as described in Section 3.6, using the following equation: (x ) = (x )[1 + S(x )] mod x 2T+1 2T v where S(x ) = S j x j and (x ) = i x i with 0 = 1 j =1 i=0 Computation of the error magnitude polynomial (x ) involves multiplication of two polynomials, (x ) and S(x ). If (x ) = 1 + ω1x 1 + ω2x 2 + · · · , then the coefﬁcients of (x ) are obtained as follows: ω1 = 1 + S1 ω2 = 2 + 1 S1 + S2 ... As we know both the error locator polynomial and syndromes in advance, we precompute logarithm values for syndromes and error locator polynomial coefﬁcients to efﬁciently perform the Galois ﬁeld multiplication Implementation of Error Correction Algorithms 175 k = 0; for(i = 203;i>=0;i--) { r0 = Elp[0]; for(j = 1;j < L+1;j++) { r1 = i*j; // power of Galois field r2 = r1 >> 8; r1 = r1 & 0xff; // take modulo 255 r3 = log_Elp[j]; r1 = r1 + r2; r2 = r1 >> 8; r1 = r1 & 0xff; r1 = r1 + r2; r1 = r1 + r3; // addition of powers (same as multiplication of log values) r2 = Galois_aLog[r1]; r0 = r0 ˆ r2; } if (r0 == 0) { Error_position[k] = 255-i; k++; } } Pcode 4.16: Simulation code for ﬁnding roots of error locator polynomial. Emp[0] = 0; // error magnitude polynomial first coefficient logarithm value for(j = 1;j<=T;j++) { r0 = 0; for(i = 0;i<=j;i++) { r1 = log_Elp[i]; r2 = log_Syndromes[j-i]; r1 = r1 + r2; r2 = Galois_aLog[r1]; r0 = r0 ˆ r2; } r0 = Galois_Log[r0]; Emp[j] = r0; // logarithm of error magnitude polynomial i-th coefficicent } Pcode 4.17: Simulation code for computing error magnitude polynomial. in computing (x ). As we can only correct T data element errors, we compute (x ) up to degree T . The simulation code for computing the error magnitude polynomial is given in Pcode 4.17. For T = 8, we consume about 4 ∗ T + (5 + 10 + 15 + · · · + 40) cycles to compute the error magnitude polynomial. Data Error Correction To correct data errors, we have to know both the error positions and error magnitudes. We know error positions from the roots of the error locator polynomial. We ﬁnd error magnitudes with the help of the error magnitude polynomial, differentiated error locator polynomial, and error roots {Xi , 0 ≤ i < L} by using the following equation: ei = − Xi ( X −1 i ) ( X −1 i ) where (x ) is the differentiated error locator polynomial of (x ), and is achieved by simply zeroing alternate coefﬁcients of (x ). Therefore, (x ) = 1 + 3x 2 + · · · . We compute error magnitudes (Yi ) for all error positions (Xi) by substituting the inverse of error position ( X −1 i ) in (x ) and (x ) and then computing the Galois ﬁeld arithmetic expression Xi ( X i−1)/ ( X −1 i ). The simulation code for computing error magnitudes and for correcting data errors is given in Pcode 4.18. By interleaving the program code, we consume about 144 (= T ∗ 18) cycles to compute (Xi−1) and (Xi−1), 7 more cycles to perform division and multiplication for computing error magnitude (ei ) and about 3 cycles for getting the error data element and correcting with error magnitude. Therefore, we consume a total of 154 cycles for correcting one data element. If we have data with L errors, then we spend L ∗ 154 cycles to correct all the error data elements. 176 Chapter 4 for(j = 0;j<=T;j+=2) { // logarithm of derivative of error locator polynomial log_Derv_Elp[j] = log_Conn_poly[j+1]; log_Derv_Elp[j + 1] = log0; // log0 = a value not in the Galois field GF(28) } for(i = 0;i < L;i++) { // Find error magnitudes using Forney algorithm, and correct data errors r0 = 0; r5 = 0; r2 = Error_position[i]; for(j = 0;j<=T;j++) { r1 = Omega_gf[j]; r6 = log_Derv_Elp[j]; r3 = r2 * j; r4 = r3 >> 8; r3 = r3 & 0xff; r3 = r4 + r3; r4 = r3 >> 8; r3 = r3 & 0xff; k = r4 + r3; k = 255 - k; if (k < 0) k+= 255; r1 = r1 + k; r6 = r6 + k; r1 = Galois_aLog[r1]; r6 = Galois_aLog[r6]; r0 = r0 ˆ r1; r5 = r5 ˆ r6; } r0 = Galois_Log[r0]; r5 = Galois_Log[r5]; k = r0 + 2*r2 - r5; if (k < 0) k+= 255; r0 = Galois_aLog[k]; m = N-1 - r2; rec_msg[m] = rec_msg[m]ˆr0; } Pcode 4.18: Simulation code for computing error magnitudes and data correction. 4.2.6 RS(204, 188) Simulation Results In this section, we present the simulation results of the RS(204, 188) coder. We get input data of 188 bytes and compute parity data of 16 bytes from the input data using the RS(204, 188) encoder. To frame 204 elements of encoded data, we left shift the input data vector by 16 bytes and append parity data on the right side. Then we add eight random errors (as RS(204, 188) can correct up to eight errors) to the encoded data, and we input to the RS(204, 188) decoder. The simulation results for the RS(204, 188) coder with encoder input, decoder output and intermediate results follow. Generator polynomial coefﬁcients vector: G = [0x3b, 0x24, 0x32, 0x62, 0xe5, 0x29, 0x41, 0xa3, 0x8, 0x1e, 0xd1, 0x44, 0xbd, 0x68, 0xd, 0x3b, 0x1] RS(204, 188) encoder simulation results: Input data vector D = [0x70, 0x18, 0x00, 0x36, 0xc9, 0xd1, 0x25, 0xa2, 0x95, 0x34, 0xb4, 0xff, 0xd2, 0xc4, 0x63, 0x01, 0x6d, 0x53, 0xc9, 0x6f, 0xb5, 0xf3, 0xb5, 0x23, 0x52, 0xc9, 0x49, 0xcc, 0x36, 0x62, 0xee, 0xfb, 0xc0, 0x9e, 0x0e, 0x56, 0x3d, 0x88, 0xad, 0x38, 0xa9, 0x1e, 0xda, 0x2a, 0x9d, 0xa2, 0xc4, 0x8b, 0x68, 0x36, 0xa0, 0xd4, 0xc3, 0xc3, 0xb3, 0xd1, 0x30, 0x32, 0x36, 0xc4, 0xe9, 0x3b, 0x58, 0xb2, 0x04, 0x8e, 0x9b, 0x73, 0x07, 0xfd, 0x0a, 0x0c, 0x1d, 0x4f, 0xb5, 0x1f, 0x83, 0x18, 0xb1, 0x46, 0x76, 0xa4, 0x09, 0xe5, 0xf7, 0x31, 0x27, 0x37, 0x8e, 0xe3, 0x51, 0x73, 0x73, 0x96, 0xb6, 0xb6, 0x41, 0x1d, 0x1b, 0x1d, 0x59, 0xba, 0x61, 0xb4, 0x5b, 0x03, 0x2a, 0xdd, 0x8e, 0x08, 0x2a, 0x2b, 0x18, 0xc1, 0x3e, 0xc3, 0x89, 0xf2, 0xfd, 0x0b, 0xfb, 0x51, 0x74, 0xb7, 0xee, 0x8c, 0x1e, 0x86, 0x90, 0x30, 0x4f, 0xf5, 0xf0, 0x37, 0xce, 0x44, 0xcf, 0x69, 0x9f, 0x8c, 0x83, 0x05, 0x6d, 0x05, 0x06, 0x79, 0x86, 0xf4, 0xc6, 0x29, 0x9f, 0xbf, 0x27, 0x95, 0xee, 0x78, 0xc8, 0x9f, 0x0b, 0x14, 0x78, 0x6d, 0xfd, 0x8b, 0xf1, 0x1b, 0x2a, 0x5e, 0xaf, 0xfa, 0x0d, 0x17, 0x14, 0xad, 0xea, 0x12, 0x97, 0x3a, 0xf9, 0x66, 0x83, 0x82, 0x97, 0x5e, 0x1c, 0x9b, 0x87, 0x81] Parity vector generated by RS(204, 188) encoder B = [0xd4, 0x02, 0x65, 0xb2, 0x97, 0x1b, 0xa2, 0x06, 0x3b, 0xbf, 0xd5, 0xe7, 0x5c, 0xa4, 0x3b, 0x99] Implementation of Error Correction Algorithms 177 Encoder output (parity bytes are bolded): D | B M = [0x70, 0x18, 0x00, 0x36, 0xc9, 0xd1, 0x25, 0xa2, 0x95, 0x34, 0xb4, 0xff, 0xd2, 0xc4, 0x63, 0x01, 0x6d, 0x53, 0xc9, 0x6f, 0xb5, 0xf3, 0xb5, 0x23, 0x52, 0xc9, 0x49, 0xcc, 0x36, 0x62, 0xee, 0xfb, 0xc0, 0x9e, 0x0e, 0x56, 0x3d, 0x88, 0xad, 0x38, 0xa9, 0x1e, 0xda, 0x2a, 0x9d, 0xa2, 0xc4, 0x8b, 0x68, 0x36, 0xa0, 0xd4, 0xc3, 0xc3, 0xb3, 0xd1, 0x30, 0x32, 0x36, 0xc4, 0xe9, 0x3b, 0x58, 0xb2, 0x04, 0x8e, 0x9b, 0x73, 0x07, 0xfd, 0x0a, 0x0c, 0x1d, 0x4f, 0xb5, 0x1f, 0x83, 0x18, 0xb1, 0x46, 0x76, 0xa4, 0x09, 0xe5, 0xf7, 0x31, 0x27, 0x37, 0x8e, 0xe3, 0x51, 0x73, 0x73, 0x96, 0xb6, 0xb6, 0x41, 0x1d, 0x1b, 0x1d, 0x59, 0xba, 0x61, 0xb4, 0x5b, 0x03, 0x2a, 0xdd, 0x8e, 0x08, 0x2a, 0x2b, 0x18, 0xc1, 0x3e, 0xc3, 0x89, 0xf2, 0xfd, 0x0b, 0xfb, 0x51, 0x74, 0xb7, 0xee, 0x8c, 0x1e, 0x86, 0x90, 0x30, 0x4f, 0xf5, 0xf0, 0x37, 0xce, 0x44, 0xcf, 0x69, 0x9f, 0x8c, 0x83, 0x05, 0x6d, 0x05, 0x06, 0x79, 0x86, 0xf4, 0xc6, 0x29, 0x9f, 0xbf, 0x27, 0x95, 0xee, 0x78, 0xc8, 0x9f, 0x0b, 0x14, 0x78, 0x6d, 0xfd, 0x8b, 0xf1, 0x1b, 0x2a, 0x5e, 0xaf, 0xfa, 0x0d, 0x17, 0x14, 0xad, 0xea, 0x12, 0x97, 0x3a, 0xf9, 0x66, 0x83, 0x82, 0x97, 0x5e, 0x1c, 0x9b, 0x87, 0x81, 0xd4, 0x02, 0x65, 0xb2, 0x97, 0x1b, 0xa2, 0x06, 0x3b, 0xbf, 0xd5, 0xe7, 0x5c, 0xa4, 0x3b, 0x99] RS(204, 188) decoder results: Received data (error data elements are underlined) R = [0x70, 0x18, 0x00, 0x36, 0xc9, 0xd1, 0x25, 0xa2, 0x95, 0x34, 0xb4, 0xff, 0xd2, 0xc4, 0x63, 0x01, 0x6d, 0x53, 0xc9, 0x6f, 0xb5, 0xf3, 0xb5, 0x23, 0x52, 0xc9, 0x49, 0xcc, 0x36, 0x62, 0xee, 0xfb, 0xc0, 0x9e, 0x0e, 0x56, 0x3d, 0x88, 0xad, 0x38, 0xa9, 0x1e, 0xda, 0x2a, 0x67, 0x38, 0xc4, 0x8b, 0x68, 0x36, 0xa0, 0xd4, 0xc3, 0xc3, 0xb3, 0xd1, 0x30, 0x32, 0x36, 0x10, 0xe9, 0x3b, 0x58, 0xb2, 0x04, 0x8e, 0x9b, 0xa5, 0x07, 0xfd, 0x0a, 0x0c, 0x1d, 0x4f, 0xb5, 0x1f, 0x83, 0x18, 0xb1, 0x46, 0x76, 0xa4, 0x09, 0xe5, 0xf7, 0x71, 0x27, 0x37, 0x8e, 0xe3, 0x51, 0x73, 0x73, 0x96, 0xb6, 0xb6, 0x41, 0x1d, 0x1b, 0x1d, 0x59, 0xba, 0x61, 0xb4, 0x5b, 0x03, 0x2a, 0x48, 0x8e, 0x08, 0x2a, 0x2b, 0x18, 0xc1, 0x3e, 0xc3, 0x89, 0xf2, 0xfd, 0x0b, 0xfb, 0x51, 0x74, 0xb7, 0xee, 0x8c, 0x1e, 0x5c, 0x90, 0x30, 0x4f, 0xf5, 0xf0, 0x37, 0xce, 0x44, 0xcf, 0x69, 0x9f, 0x8c, 0x83, 0x05, 0x6d, 0xb1, 0x06, 0x79, 0x86, 0xf4, 0xc6, 0x29, 0x9f, 0xbf, 0x27, 0x95, 0xee, 0x78, 0xc8, 0x9f, 0x0b, 0x14, 0x78, 0x6d, 0xfd, 0x8b, 0xf1, 0x1b, 0x2a, 0x5e, 0xaf, 0xfa, 0x0d, 0x17, 0x14, 0xad, 0xea, 0x12, 0x97, 0x3a, 0xf9, 0x66, 0x83, 0x82, 0x97, 0x5e, 0x1c, 0x9b, 0x87, 0x81, 0xd4, 0x02, 0x65, 0xb2, 0x97, 0x1b, 0xa2, 0x06, 0x3b, 0xbf, 0xd5, 0xe7, 0x5c, 0xa4, 0x3b, 0x99] Syndrome vector S = [0xd9, 0x9c, 0xfd, 0x0, 0x84, 0x16, 0x96, 0x3e, 0x60, 0x3a, 0x18, 0xd3, 0xfb, 0xcf, 0x90, 0xf0] Error locator polynomial vector = [0x1, 0x9a, 0x3f, 0xe1, 0xc1, 0x34, 0x13, 0x7b, 0x62] Error position vector X = [0x3c, 0x4c, 0x60, 0x76, 0x88, 0x90, 0x9e, 0x9f] Error magnitude polynomial vector = [0x1, 0x43, 0x13, 0xad, 0x86, 0xfd, 0xad, 0x88, 0xaa] Error magnitudes e = [0xb4, 0xda, 0x95, 0x40, 0xd6, 0xd4, 0x9a, 0xfa] Decoder output (corrected data elements are italicized and underlined) D = [0x70, 0x18, 0x0, 0x36, 0xc9, 0xd1, 0x25, 0xa2, 0x95, 0x34, 0xb4, 0xff, 0xd2, 0xc4, 0x63, 0x1, 0x6d, 0x53, 0xc9, 0x6f, 0xb5, 0xf3, 0xb5, 0x23, 0x52, 0xc9, 0x49, 0xcc, 0x36, 0x62, 0xee, 0xfb, 0xc0, 0x9e, 0xe, 0x56, 0x3d, 0x88, 0xad, 0x38, 0xa9, 0x1e, 0xda, 0x2a, 0x9d, 0xa2, 0xc4, 0x8b, 0x68, 0x36, 0xa0, 0xd4, 0xc3, 0xc3, 0xb3, 0xd1, 0x30, 0x32, 0x36, 0xc4, 0xe9, 0x3b, 0x58, 0xb2, 0x4, 0x8e, 0x9b, 0x73, 0x7, 0xfd, 0xa, 0xc, 0x1d, 0x4f, 0xb5, 0x1f, 0x83, 0x18, 0xb1, 0x46, 0x76, 0xa4, 0x9, 0xe5, 0xf7, 0x31, 0x27, 0x37, 0x8e, 0xe3, 0x51, 0x73, 0x73, 0x96, 0xb6, 0xb6, 0x41, 0x1d, 0x1b, 0x1d, 0x59, 0xba, 0x61, 0xb4, 0x5b, 0x3, 0x2a, 0xdd, 0x8e, 0x8, 0x2a, 0x2b, 0x18, 0xc1, 0x3e, 0xc3, 0x89, 0xf2, 0xfd, 0xb, 0xfb, 0x51, 0x74, 0xb7, 0xee, 0x8c, 0x1e, 0x86, 0x90, 0x30, 0x4f, 0xf5, 0xf0, 0x37, 0xce, 0x44, 0xcf, 0x69, 0x9f, 0x8c, 0x83, 0x5, 0x6d, 0x5, 0x6, 0x79, 0x86, 0xf4, 0xc6, 0x29, 0x9f, 0xbf, 0x27, 0x95, 0xee, 0x78, 0xc8, 0x9f, 0xb, 0x14, 0x78, 0x6d, 0xfd, 0x8b, 0xf1, 0x1b, 0x2a, 0x5e, 0xaf, 0xfa, 0xd, 0x17, 0x14, 0xad, 0xea, 0x12, 0x97, 0x3a, 0xf9, 0x66, 0x83, 0x82, 0x97, 0x5e, 0x1c, 0x9b, 0x87, 0x81, 0xd4, 0x2, 0x65, 0xb2, 0x97, 0x1b, 0xa2, 0x6, 0x3b, 0xbf, 0xd5, 0xe7, 0x5c, 0xa4, 0x3b, 0x99] 4.2.7 RS(N, K) Coder Computational Complexity The cycle consumption estimates of the RS encoder and decoder discussed in Sections 4.2.4 and 4.2.5 are meaningful only with the particular approach followed in the simulation of the RS encoder and decoder. In this implementation, we assumed sufﬁcient on-chip memory (1.5 kB for the Galois_aLog[ ], 0.5 kB for the 178 Chapter 4 Galois_Log[ ] and for temporary working buffers, and 3.2 kB for precomputed look-up tables in syndrome computation) is available to store the look-up table values. Whatever approach we use in the implementation of RS codes, the overall cycle cost of RS coding depends on its error-correction capability (i.e., T ). If we want to correct more errors with RS coding by adding more redundancy to the original data, then the computational cost of RS coding also increases. Next, we discuss the computational complexity of the RS coder for two different values of T with the same implementation techniques used in this chapter to perform RS encoding and decoding. The expressions used for cycles estimate is valid only with the assumption of one cycle per operation (including data loads) after interleaving the program code to eliminate pipeline stalls of the reference embedded processor. If we do not interleave the program code, then the cycle consumption increases by a lot as the approach for implementation of RS codes involves many data load/store memory accesses. In addition, we did not include the overhead of initialization of variables, jumps and other pipeline stalls in obtaining cycle consumption expressions. RS Encoder Computational Complexity Based on Section 4.2.4, the cycles estimate for the RS(N, K ) encoder in terms of T follows: encoder cycles = K ∗ (2 ∗ T ∗ 9 + 6) For the RS(204, 188) coder with T = 8 error correction capability, we consume about 28,200 (= 188 ∗ (2 ∗ 8 ∗ 9 + 6)) cycles to compute 16 parity elements using the RS(204, 188) encoder. For T = 16, we consume about 55,272 (= 188 ∗ (2 ∗ 16 ∗ 9 + 6)) cycles to compute 32 parity elements using the RS(220, 188) encoder. RS Decoder Computational Complexity To correct up to T data errors using the RS(N, K ) decoder, the total cycles we consume in all four steps of decoding as seen in Section 4.2.5 follows: Decoder cycles = [12 ∗ (2 ∗ T − 1) ∗ N/2 + N ] + 2 ∗ T ∗ [6 ∗ (Li − 1) + 14 ∗ (T + 1) + 20] + N ∗ 11 ∗ L + [W + L ∗ (T ∗ 18 + 10)] where Li is the length/degree of the error locator polynomial in the i-th iteration of the Berlekamp-Massey recursive algorithm, L is actual number of data errors occurred in the received data and W is the number of cycles consumed by the error magnitude polynomial computation. The decoder cycles expression depends on many parameters and we assume some values for Li and L parameters in obtaining cycles. For T = 8 (or 16), we obtain the RS decoder cycles by assuming the actual errors occurred as L = 6 (or 12) and the average iteration count used for computing discrepancy (Li − 1) as 4 (or 8). The value of W is 212 (or 744). The computational complexity of the individual steps of the RS(204, 188) (or RS(220, 188)) decoder in terms of cycles follows: • Syndrome computation: 18,600 (or 41,030) • Error locator polynomial computation: 2720 (or 9792) • Roots ﬁnding: 13,464 (or 29,040) • Error magnitudes and correction: 1136 (or 4320) Total cycle consumption for RS(204, 188) is obtained by summing individual step cycles and is equal to 35,920 (or 84,182). Of the four steps comprising the RS decoder, the syndrome computation and roots-ﬁnding steps consume 80 to 90% of total cycles. As we see from the previous estimated ﬁgures, the cycle consumption of the RS(N, K ) decoder increases with T (i.e., cycles consumption increases with the error-correction capability of the RS decoder). 4.2.8 RS Decoder: Efficient Implementation As discussed in Section 4.2.7, the RS decoder is too costly in terms of cycle consumption. For example, if we are working with a 1 Mbps bit rate application, and we want to correct the errors present in the received data using RS (204,188) decoder, then we consume about 191 (= 35920 ∗ 5319) processor MIPS to handle 5319 (= 1000000/188) output data blocks per second. In general, embedded processors will have 500 to 1000 MPIS budget. If the RS decoder only consumes 20 to 40% of the MIPS, then running all other (physical layer) Implementation of Error Correction Algorithms 179 modules on a single embedded processor will not be possible. Typically, we consider average MIPS as a criterion to determine the MIPS budget for a particular application. We may not ﬁnd errors in the received data all the time and even if present, most of the time one or two errors will be present. If no errors are present in the current block of received data frame, then the RS decoder cycle cost for that particular block can be reduced to 50% of the RS decoder total cycles. For this, we check the syndrome values after syndrome computation and stop further decoding if all syndromes are zero. If all syndromes are not zero and we obtain the error locator polynomial degree as one after the error locator polynomial computation, then we have one error in the received data block. If one error is present in the received data block, then computing error roots and error magnitude polynomial can be avoided. In this case, the error correction of the received data block is performed using the syndrome values and ﬁrst-degree error locator polynomial coefﬁcient. The simulation code for correcting single errors in the received data is given in Pcode 4.19. r2 = log_Conn_poly[1]; r0 = Syndromes[0]; m = N-1 - r2; rec_msg[m] = rec_msg[m]ˆr0; Pcode 4.19: Simulation code for correcting single data errors. If two errors are present in the received data block (i.e., v = L = 2), then we use the efﬁcient method (for ﬁnding two roots of the error locator polynomial) as described in Section 4.1.4, BCH Decoder: Further Optimization for T = 2. After ﬁnding two roots of the error locator polynomial, we use direct error correction for correcting two data elements (in this case we can avoid error magnitude polynomial computation and also avoid Forney algorithm for error magnitude computation). The simulation code for correcting double errors with the RS(204, 188) decoder is given in Pcode 4.20. With this, although the full the RS decoder needs 191 MIPS, we consume on average 100 MIPS for RS(204, 188) decoding at a 1-Mbps data rate on the reference embedded processor. r4 = Error_position[0]; r5 = Error_position[1]; r2 = Galois_aLog[r4]; r3 = Galois_aLog[r5]; r0 = r4 + log_Syndromes[0]; r1 = Syndromes[1]; r0 = Galois_aLog[r0]; m = N-1-r4; r0 = r0 ˆ r1; r2 = r2ˆr3; r0 = Galois_Log[r0]; r2 = Galois_Log[r2]; r2 = 255-r2; r4 = rec_msg[m]; r0 = r0 + r2; r2 = Syndromes[0]; r0 = Galois_aLog[r0]; k = N-1-r5; r2 = r2 ˆ r0; r5 = rec_msg[k]; r4 = r4 ˆ r2; r5 = r5 ˆ r0; rec_msg[m] = r4; rec_msg[k] = r5; // i1, i2 // X1, X2 // S1+S0 · X1, X1+X2 // 1/(X1+X2) // (S1+S0 · X1)/(X1+X2) // S0 + (S1+S0 · X1)/(X1+X2) Pcode 4.20: Simulation code for correcting double data errors. 4.3 RS Erasure Codes In Section 3.6, we discussed the RS(N, K ) coder that corrects T = (N − K )/2 errors. With the RS(N, K ) coder, we compute 2T parity symbols at the transmitter side from K message symbols to form N symbols’ length codeword. At the receiver, using these 2T parity symbols, we correct up to T errors in the received N length codeword. In this section, we discuss a different kind of RS(N, K ) coder, called the RS erasure coder, that can correct up to 2T errors given the error locations present in the received data. We discuss the encoder structure and decoding procedure for RS erasure codes, as well as simulation and optimization techniques for efﬁcient implementation of RS erasure codes. 4.3.1 Erasure Codes RS codes with known error locations are called erasure codes. In Section 4.2, we computed the error locator polynomial from the syndromes, and then computed it roots to ﬁnd the error locations. At these error locations, 180 Chapter 4 L bytes Row of bytes 012 N22 N21 CRC byte Figure 4.8: Illustration of erasure information. we could correct the received codeword symbols using error magnitudes which are computed from the syndrome polynomial and error locator polynomial. We could build the error locator polynomial using the BerlekampMassey algorithm until degree T and not more than that due to incomplete information from the convolution of the syndrome polynomial and connection polynomial. We can get only T error locations from the roots of the error locator polynomial of degree T . Actually, the RS decoder has the capability to correct up to 2T errors if we know 2T error locations in advance. How come we know error locations at the receiver in advance? Well, in some receivers using upper layer error check on received data, we can know the error locations. For example, in the DVB-H receivers, the MPE-FEC module (see Section 17.4) design provides the erasure information and allows us to correct up to 2T errors using the RS decoder. We discuss a simple system that generates erasure information for us. Assume that the data is divided into N blocks (or packets) of length L bytes each and arranged as shown in Figure 4.8. Out of L bytes, L − 1 bytes are payload (or message) and 1 byte (the last one) is CRC data (see Section 3.2 for more details on CRC computation). Next, we transmit all NxL bytes as one frame to the receiver. At the receiver, assume that the received frame contains a few error bytes. If we arrange the received frame as in Figure 4.8 and compute the CRC for each payload block (with L − 1 bytes), then we know whether any particular block was received with error bytes. We tag those data blocks whose computed CRCs do not match their received CRC and treat them as error blocks. Next, if we obtain the codeword from a row of data bytes as shown in Figure 4.8, then we know the error locations of that codeword in advance from the tag information. 4.3.2 RS Erasure Encoder Given the data frame, we discuss how to compute the parity symbols to work with erasure codes. For this, we consider a DVB-H MPE-FEC module that supports erasure decoding. The MPE-FEC frame is arranged as a matrix with 255 columns and a ﬂexible number of rows. As we discussed, the RS coder is a block code that takes K data symbols as input and output N symbol codeword by adding computed 2T parity symbols to K data symbols. Here, N = 255 = 2m − 1 and hence m = 8 (i.e., symbol = byte, represented with a ﬁeld element that belongs to the Galois ﬁeld GF(28)). Next, we choose K depending on how much error-correction capability we are targeting. In the case of the DVB-H MPE-FEC module, K is chosen as 191. The size of columns (i.e., L) can be a variable and we specify its value from the length of payload data that we want to pack in a single frame. If Q is the length of data bytes and if L ∗ 191 < Q, then we pad Q − L ∗ 191 zero bytes to the data frame before computing parity, as shown in Figure 4.9. The payload data section may not occupy a full column of matrix, in which case we continue the next data section immediately after the current data section. The data of S sections are stored to the matrix in columns one after another and zeros are padded at the end to make the payload data length L ∗ 191. Next, we work row-wise to compute the RS parity bytes. We compute 64 (= 2T) parity bytes from 191 (= K ) data bytes using the RS encoder given in Section 3.6.1. The generator polynomial G(x) used in computing 64 parity bytes follows: G(x ) = (x + α0)(x + α1)(x + α2) · · · (x + α63) (4.23) One data section Implementation of Error Correction Algorithms 181 #1 #3 #4 #2 Codeword PP aa rr ii P A t y t y DD D Z #S E R a t a a t a O S #1 #2 P a r i t y L D a t a #64 K 5 191 Data Sections 2T 5 N 2 K 5 64 RS Sections Figure 4.9: Structure of RS erasure encoder. Data Section 1 Data Section 2 CRC Data Section S RS Section 1 RS Section 2 RS Section 64 Figure 4.10: One MPE-FEC frame with CRC appended to data and RS sections. We append 64 parity symbols to 191 data symbols to form a 255 (= N) length systematic RS codeword. Let M(x) be the encoder output message (or codeword polynomial), represented with an N − 1 degree polynomial as M(x ) = m254x 254 + m253x 253 + m252x 252 + · · · + m2x 2 + m1x + m0 (4.24) where mi are ﬁeld elements belongs to GF(28). In Figure 4.9, one row of matrix can be represented with vector M = [m254, m253, m252, . . . , m2, m1, m0], which consists of N = 255 bytes. With the computation of parity data, the matrix is completely ﬁlled with data bytes and parity bytes and contains a total of N ∗ L bytes as shown in Figure 4.9. Next, we compute the CRC for each data section (except for the zero padded portion) and for each RS parity data columns and append it to corresponding sections, and then transmit to the receiver as a single frame, as shown in Figure 4.10. At the receiver, we again compute the CRC for each section and compare it with the received CRC of that section and we classify those sections as unreliable or error sections if the CRCs of those sections do not match. After classiﬁcation of each section as reliable or unreliable, we arrange them again in matrix form as shown in Figure 4.9. Next, we have the information about which columns contain the error data. Let R = [r254, r253, r252, . . . , r2, r1, r0] be the vector representing one row of matrix; we come to know the error locations in that vector from the sections’ CRC check tagged information. Note that the CRC tagged information may say current byte as an error byte, but actually this byte need not be in error (since the tagged information only conveys that there is a error data somewhere in the column to which the current byte belongs). We treat the current byte as an error byte if the CRC tagged information says so. In the next section, we discuss the RS decoder that corrects up to 2T errors given the erasure (or error locations) information. 4.3.3 RS Erasure Decoder As discussed in Section 3.6, given the received codeword R = [rN −1, rN −2, . . . , r2, r1, r0] of length N symbols, RS decoding consists of four steps: (1) syndrome computation, (2) error locator polynomial computation, (3) error root computation to ﬁnd error locations, and (4) error magnitude computation to correct the errors. Of these four steps, we don’t need to compute Steps 2 and 3 for erasures correction as erasure gives the error locations. 182 Chapter 4 We can use Pcode 4.14 to compute the syndromes of received vector R. Given the syndromes and error locations, we have two approaches to correct errors. One approach was discussed in Section 4.2.2, Computation of Error Magnitude Polynomial and Data Error Correction, in which we compute the error magnitude polynomial and then error magnitudes from differentiated error-locator and error magnitude polynomials. As we are not computing the error-locator polynomial for erasures decoding, we follow a different approach (using the Bjorck-Pereyra algorithm) discussed in Hong and Vetterli (1995), which doesn’t need the computation of an error magnitude polynomial and error locator polynomial. We recursively build the error magnitudes using the syndromes and error location values. Let Si = αai , 0 ≤ i ≤ 2T − 1, and E j = αbi , 0 ≤ j ≤ L − 1, where αai , αb j ∈ GF(28) represent syndrome values and error location values. Given L(≤ 2T), the number of error locations, a recursive algorithm to get error magnitudes, is executed as follows: for i = 0:L − 1 for j = L: − 1:i + 1 end S j =S j − Ei∗ S j−1 end for i = L − 2: − 1:0 for j = i + 1:L − 1 S j =S j/(E j −E j−i ) S j−1=S j−1−S j end end for i = 0:L − 1 end ei =Si/E i The values ei , 0 ≤ i ≤ L − 1 give the error magnitudes at the error locations Ei . Using error magnitudes ei , we correct the errors present in the received codeword at locations Ei . The simulation code for the Bjorck-Pereyra algorithm to compute error magnitudes is given in Pcode 4.21. Next, we discuss the computational complexity of RS erasure decoding presented in this section to execute on the reference embedded processor. As discussed in Section 4.2.7, RS Decoder Computational Complexity, we consume approximately 97,000 cycles to compute syndromes using Pcode 4.14 when N = 255 and 2T = 64. To compute error magnitudes using Pcode 4.21, we consume approximately (10 + 15) ∗ L ∗ (L + 1)/2 cycles (i.e., 52,000 cycles when L = 64). Thus, we require about 150,000 cycles (or approximately 100 cycles per bit or 100 MIPS at 1-Mbps bit rate) to perform the erasure decoding algorithm. This ﬁgure increases when we want to correct errors other than erasures. Once errors (for which we don’t know locations) are present in the received sequence, we then have to compute the error locator polynomial and error roots. As discussed in the previous subsection, roots ﬁnding using Chien’s search is as complex as ﬁnding syndromes. We will discuss a few optimization techniques in Section 4.3.5 with which the cycle consumption for decoding errors and erasures declines signiﬁcantly. 4.3.4 Decoding Errors and Erasures We may come across a received sequence containing errors along with the erasures information, and in such cases we have to perform RS decoding to correct both errors and erasures. To compute errors, we have to know the ELP. To build the ELP, we use syndromes, which contain information about erasures. In other words, the computed ELP also has information about erasures. But, with syndromes we can build an ELP only up to degree T , which is not useful as we already know of more than T error locations with erasures. Since we know the error locations in advance for erasures, we can also compute the erasure locator polynomial Er(x ). Then we build the ELP on that using the Berlekamp-Massey algorithm. For this, we modify the Berlekamp-Massey algorithm a little bit to accommodate erasures in the ELP. Computation of Erasure Locator Polynomial Given the L error locations a0, a1, . . . , aL−1, we compute the erasure locator polynomial as Er(x ) = (x + αa0 )(x + αa1) · · · (x + αaL−1 ) = u L−1x L−1 + u L−2x L−2 + · · · + u2x 2 + u1x + u0, where ui ∈ GF(28). The coefﬁcients ui of Er(x ) are obtained using the Pcode 4.22. The polynomial Er(x ) is computed iteratively by multiplying the factors one after another. Implementation of Error Correction Algorithms 183 // correct errors for(i = 0;i < L;i++){ r0 = Error_position[i]; sxr[i] = Galois_aLog[r0]; } for(i = 0;i < L-1;i++){ for(j = L-1; j > i; j--){ r0 = Error_position[i]; r1 = log_Syndromes[j]; r0 = r0 + r1; r0 = Galois_aLog[r0]; r1 = Syndromes[j]; r0 = r0 ˆ r1; r1 = Galois_Log[r0]; Syndromes[j] = r0; log_Syndromes[j+1] = r1; } } for(i = L-2;i>=0;i--){ for(j = i + 1;j < L;j++){ r0 = sxr[j]; r1 = sxr[j-i-1]; r0 = r0ˆr1; r0 = Galois_Log[r0]; r1 = log_Syndromes[j+1]; m = r1 - r0; if (m < 255) m+= 255; r0 = Galois_aLog[m]; r1 = Syndromes[j-1]; Syndromes[j] = r0; log_Syndromes[j+1] = m; r0 = r0 ˆ r1; r1 = Galois_Log[r0]; Syndromes[j-1] = r0; log_Syndromes[j] = r1; } } for(i = 0;i < L;i++){ m = Error_position[i]; r0 = rec_msg[N-m-1]; r1 = Syndromes[i]; r0 = r0 ˆ r1; rec_msg[N-m-1] = r0; } Pcode 4.21: Error magnitudes computation and errors correction. erasure_poly[0] = erasure_loc[0]; for(i = 1;i < p;i++) { r0 = erasure_poly[0]; r3 = erasure_loc[i]; r0 = Galois_aLog[r0]; r1 = Galois_aLog[r3]; r0 = r0 ˆ r1; r6 = Galois_Log[r0]; r2 = erasure_poly[i-1]; r4 = 0; for(j = i;j > 1;j--){ r0 = r2; r2 = r2 + r3; r1 = Galois_aLog[r2]; r2 = erasure_poly[j-2]; r1 = r1 ˆ r4; r5 = Galois_Log[r1]; erasure_poly[j] = r5; r4 = Galois_aLog[r0]; } r2 = r2 + r3; r1 = Galois_aLog[r2]; r1 = r1 ˆ r4; r5 = Galois_Log[r1]; erasure_poly[1] = r5; erasure_poly[0] = r6; } Pcode 4.22: Erasure polynomial computation. Modiﬁed Berlekamp-Massey Algorithm With the modiﬁed Berlekamp-Massey algorithm, we compute the ELP, (x ), using Er(x ) as the initial polynomial. The connection polynomial T (x ) is also initialized with Er(x ). The value of L gives the degree of the initial ELP. The simulation code for the modiﬁed BerlekampMassey algorithm is given in Pcode 4.23. With this ELP, we can get (2T − L)/2 extra error locations information. 184 Chapter 4 For example, if parity length 2T = 64 and erasures numbers L = 60, then we can get two more extra error location information by building the ELP from Er(x ). for(k = q; k < 2*T; k++){ for(i = 0;i < k;i++) Conn_poly1[i] = Conn_poly0[i]; // Conn_poly_temp = Conn_poly // compute discrepancy delta by convolution of Syndrome poly and Conn_poly r0 = Syndromes[k]; for(i = 0;i < L;i++){ r1 = log_Syndromes[k-i]; r2 = Conn_poly0[i]; r2 = Galois_Log[r2]; r1 = r1 + r2; r2 = Galois_aLog[r1]; r0 = r0 ˆ r2; } if (r0 != 0) { // Conn_poly = Conn_poly_temp - Delta*Tx r2 = Galois_Log[r0]; // log (delta) for(i = 0;i < 2*T + 1;i++) { r1 = Conn_poly1[i]; r3 = Tx[i+1]; r3 = r2 + r3; r3 = Galois_aLog[r3]; r1 = r3 ˆ r1; // Conn_poly_temp[i]ˆDelta*Tx[i] Conn_poly0[i] = r1; } if (2*L < (q + k + 1)){ L = q + k + 1 - L; m = 255 - r2; Tx[0] = m; for(i = 0;i < 2*T;i++){ // Tx = Conn_poly_temp/Delta r1 = Conn_poly1[i]; r1 = Galois_Log[r1]; m = r1 - r2; if (m < 0) m+= 255; Tx[i+1] = m; } } } for(i = 2*T + 1;i > 0;i--) // Tx = [0 Tx] Tx[i] = Tx[i-1]; Tx[0] = log0; } Pcode 4.23: Berlekamp-Massey recursion to compute ELP with erasures. Errors and Erasures Correction Given the ELP that contains both errors and erasures information, we can use the simulation codes from Pcode 4.16 to 4.18 to perform Chien’s search (which ﬁnds error locator polynomial roots), to ﬁnd an error magnitude polynomial and to perform a Forney algorithm (which gives error magnitudes). Computational Complexity with Errors Correction To correct both errors and erasures, we have to compute the erasure locator polynomial, error locator polynomial, error roots and error magnitude polynomial. These extra computations are not required if only erasures are present in the received data. We estimate the complexity for this portion of modules as follows. If we have L number of erasures and (2T − L)/2 number of errors, then to compute the erasure locator polynomial we consume L ∗ 14 + 8 ∗ L ∗ (L + 1)/2 cycles. For example, if L = 60, then we consume 15,480 cycles to compute the erasure locator polynomial. Based on Section 4.2.7, RS Decoder Computational Complexity, we can get the cycle counts for computing the error locator polynomial from the erasure locator polynomial by assuming Li = 61 as 3368; we consume 1,73,910 cycles for ﬁnding error roots, approximately 2000 cycles for computing error magnitude polynomial and 36,332 cycles for ﬁnding error magnitudes. With this we consume about 328,090 (97,000 (to ﬁnd syndromes) + 15,480 (to compute the erasure locator polynomial) + 3368 (to compute the remaining error locator polynomial) + 173,910 (to ﬁnd the roots) + 2000 (to ﬁnd the error magnitude polynomial) + 36,332 (to ﬁnd the error magnitudes)) cycles to run the RS erasure decoder on the reference embedded processor for correcting 60 erasures and 2 errors. Implementation of Error Correction Algorithms 185 4.3.5 Erasure Decoder Optimization As we saw in the previous section, the two most cycle consuming modules in RS decoding are syndrome computation and error locator polynomial roots ﬁnding. The reason for this is that the complexity of these two modules depends on block length N and on the error-correction capability T of the RS code. By contrast, the complexity of determining the error locator polynomial and the error values is a function of only T . Usually the value of N is very big when compared to T . The modules, syndrome computation and error roots ﬁnding of RS decoding, are similar in that both entail the evaluation of polynomials at particular elements of extension ﬁeld. The technique most often employed to carry out these two modules is Horner’s method of polynomial evaluation for ﬁnding syndromes and error locations. The high cost of syndrome computation and error roots ﬁnding can be traced to the iterative nature of their computation procedure. The set of computations carried out for ﬁnding a particular syndrome or an error root is repeated in ﬁnding other syndromes or error roots too. Moreover, for the erasure code RS(255,191) with the value of T = 32, we evaluate 64 syndromes and up to 63 (i.e., 62 erasures + 1 error) error roots. This is like computing the DFT (discrete Fourier transform; see Section 7.1) without using FFT. In DFT computation, we evaluate each frequency component at a time. Whereas with the FFT method, using periodicity and symmetric properties of DFT twiddle factors, we evaluate all frequency components together. Here, in RS decoding, if all the syndromes or all the error roots can be computed together, then a reduction in complexity is perhaps possible. In the following subsections, we examine the syndromes and error roots computation from the spectral point of view. FFT-Based Computation Consider the syndrome computation: N −1 Si = R(αi ) = r j (αi ) j , i = 0, 1, 2, . . . , 2T − 1 j =0 (4.25) Since α is an element of order N, these equations may be interpreted as a DFT with the syndromes representing 2T contiguous components of the spectrum of received polynomial R(x ). Thus, an alternative way to compute the syndromes is to perform FFT on R(x ) and discard the unwanted spectral components. Similarly, roots ﬁnding can be performed in a spectral domain. Given the error locator polynomial (x ), the roots of (x ) are those powers of α that satisfy L −1 (αi ) = j =0 j (αi ) j = 0 (4.26) or N −1 σi ∼= (αi ) = j (αi ) j , with L = L+1 = L+2 · · · = N−2 = N−1 = 0 j =0 (4.27) In Equation (4.27), the error locations are given by the indices of i for which σi = 0. If the spectrum is zero at index i, then (x ) has a root at i. This provides another way to compute the roots of the error locator polynomial. If we observe carefully, the syndrome computation uses a few outputs of FFT and the error roots ﬁnding contains a few inputs of FFT. To reduce the cost of the FFT method, we use FFT output pruning for syndrome computation and FFT input pruning for error roots ﬁnding. See Sections 7.2.5 and 7.2.6 for more detail on input and output pruning. However, we may not beneﬁt much from this FFT method when T is a small quantity. As the RS (255, 191) erasure decoder requires the computation of sizable syndromes and error roots, we beneﬁt more from the FFT method. FFT-Based Implementation Since 255 = 15 × 17 and 15 and 17 are relatively prime, we can map the onedimensional FFT into a twiddle-factor-free two-dimensional FFT via the Good-Thomas mapping. To compute the two-dimensional transform, we compute the row/column DFT transform followed by the column/row DFT transform. In the case of syndrome computation, as we want to perform output pruning with FFT, the way to 186 Chapter 4 compute the transform is to perform the 17 15-point column transforms followed by point-wise evaluation of the 64 points along the rows. Similarly, in the case of error roots ﬁnding, we perform the ﬁrst 17-point row transforms by straightforward multiplications, and then perform 17 15-point column transforms. In addition, 15 = 3 × 5, and 3 and 5 are relatively prime; we further simplify the 15-point row transform by mapping the one-dimensional 15-point FFT into a twiddle-factor-free two-dimensional FFT via Good-Thomas mapping as described in the following. The Good-Thomas FFT is given by N1 −1 N2 −1 X[n1, n2] = W Nn11 k1 x [k1 , k2 ]W n2 N2 k2 k1 =0 k2 =0 (4.28) If N = N1 N2, where N1 and N2 are relatively prime, then at the input of FFT, the two-dimensional vertical and horizontal indices k1 and k2 and one-dimensional index k are related by k = (N2 k1 + N1k2) mod N, where ki = (k) mod Ni (4.29) At the output side of FF T, we use mi, which is deﬁned as mi = N / Ni and satisﬁes m i m −1 i ≡ 1modNi , to get the relationship between one-dimensional and two-dimensional frequency indices: n = (m 1((m −1 1 n 1) mod N1) + m2((m−2 1n2) mod N2 )) mod N, where ni = (n) mod Ni (4.30) ■ Example 4.1 For N = 15, N1 = 3, and N2 = 5, it follows that: m1 = N/N1 = 5; m −1 1 = 2 m2 = N/N2 = 3; m −1 2 = 2 ∵ (2∗5) mod 3 = 1 ∵ (2∗3) mod 5 = 1 Table 4.1 contains the time-domain (i.e., input to FFT) and frequency-domain (i.e., output of FFT) one-dimensional and two-dimensional indice relationships using Equations (4.29) and (4.30). Using Equation (4.28), we can perform two-dimensional FFT computed using N1 DFTs of length N2 and N2 DFTs of length N1 without using an intermediate twiddle factor correction. In the implementation, we use look-up tables for getting indices for input pruning, output pruning, obtaining one-dimensional and two-dimensional FFT mapping indices, and for storing twiddle factors. The simulation code for computing 15-point FFT using two-dimensional 3x5 FFT is given in Pcode 4.24. ■ Look-up tables tp_w[ ] and fp_w[ ] on this book’s companion website were used to compute 3-point and 5-point FFTs. The look-up tables for input and output indexing to compute the 17 3x5 FFT are also on the website. Table 4.1: FFT input and output side of Good-Thomas mapping k1\k2 0 1 2 n1 \n2 0 1 2 0 x[0] x[5] x[10] 0 X[0] X[10] X[5] Input 1 2 x[3] x[6] x[8] x[11] x[13] x[1] output 1 2 X[6] X[12] X[1] X[7] X[11] X[2] 3 x[9] x[14] x[4] 3 X[3] X[13] X[8] 4 x[12] x[2] x[7] 4 X[9] X[4] X[14] Implementation of Error Correction Algorithms 187 for(p = 0;p < 17;p++){ r7 = p*m; r6 = 0; for(k = 0;k < 3;k++){ // compute three horizontal 5-point FFTs for(i = 0;i < 5;i++){ r5 = 0; for(j = 0;j < 5;j++){ r4 = input_ index_fp[r7+k*5+j]; r2 = 5*i+j; r3 = rec_msg[N-r4-1]; r2 = fp_w[r2]; r3 = Galois_Log[r3]; r0 = r2 + r3; r0 = Galois_aLog[r0]; r5 = r5ˆr0; } sxc[r7+r6] = r5; r6 = r6 + 1; } } r6 = 0; for(k = 0;k < 5;k++){ // compute five vertical 3-point FFTs for(i = 0;i < 3;i++){ r5 = 0; for(j = 0;j < 3;j++){ r4 = sxc[r7+k+j*5]; r2 = 3*i+j; r4 = Galois_Log[r4]; r2 = tp_w[r2]; r0 = r2 + r4; r0 = Galois_aLog[r0]; r5 = r5 ˆ r0; } r2 = output_index_fp[r6]; sxr[r7+r2] = r5; r6 = r6 + 1; } } } Pcode 4.24: Simulation code for 15-point two dimensional FFT computations. j = 0; for(k = 0;k < 2*T;k++){ r3 = sxn1[k]; r4 = sxn2[k]; r5 = 0; for(i = 0;i < n;i++){ r1 = sxr[m*i+r4]; r2 = sp_w[r3*n+i]; r1 = Galois_Log[r1]; r0 = r1 + r2; r0 = Galois_aLog[r0]; r5 = r5 ˆ r0; } Syndromes[j] = r5; j = j + 1; } Pcode 4.25: Syndromes computation with output pruning. Output Pruning and Syndrome Computation With Pcode 4.24, we could perform 17 15-point column FFT transforms. Next, using column FFT outputs, we perform 17-point row FFTs only for required spectral outputs. The simulation code to perform 17-point FFT with output pruning is given in Pcode 4.25. We use look-up tables sxn1[ ] and sxn2[ ] to choose the corresponding input points. We use the look-up table sp_w[ ] for storing 17-point FFT twiddle factors. (See the companion website for all three of the look-up tables.) Input Pruning and Error Location Computation In the FFT-based error locations computation, ﬁrst we perform 17-point row DFT with input pruning followed by 17 15-point column transforms. We use the look-up tables sxn1[ ] and sxn2[ ] for input pruning, and look-up tables ern0[ ] and ern1[ ] on the website for computing 17-point FFT at points of interest. Once we have a 17-point FFT row transform output, we use Pcode 4.24 to compute 15-point FFT with 3x5 two-dimensional FFTs. Next, we search for spectral nulls in FFT output and the indices of those spectral nulls give the error locations. 188 Chapter 4 m = 15; sxe[0] = 1; for(i = 1;i < L + 1;i++){ r2 = sxn1[i]; r3 = sxn2[i]; r4 = Conn_poly0[i-1]; sxe[r2*m+r3] = r4; } // seventeen point DFT with input pruning for(k = 0;k < 6;k++){ for(i = 0;i < n;i++){ r5 = 0; for(j = 0;j < 5;j++){ r4 = ern0[k*5+j]; r2 = r4*n+i; r3 = sxe[r4*m+k]; r2 = sp_w[r2]; r3 = Galois_Log[r3]; r0 = r2 + r3; r0 = Galois_aLog[r0]; r5 = r5ˆr0; } sxc[i*m+k] = r5; } } for(k = 6;k < m;k++){ for(i = 0;i < n;i++){ r5 = 0; for(j = 0;j < 4;j++){ r4 = ern1[(k-6)*4+j]; r2 = r4*n+i; r3 = sxe[r4*m+k]; r2 = sp_w[r2]; r3 = Galois_Log[r3]; r0 = r2 + r3; r0 = Galois_aLog[r0]; r5 = r5ˆr0; } sxc[i*m+k] = r5; } } Pcode 4.26: 17-point DFT with input pruning to compute error locations. FFT-Based Polynomial Evaluation and Computational Complexity To compute 15-point DFT, we use the two-dimensional 3x5 FFT as given in Pcode 4.24. To compute three 5-point FFTs we consume approximately 630 cycles and to compute ﬁve 3-point FFTs we consume another 450 cycles. With this, to compute the 15-point FFT, we need 1080 cycles and we require 18,360 (= 1080 ∗ 17) cycles to compute 17 15-point FFTs on the reference embedded processor. To compute one syndrome using a 17-point FFT, we consume about 110 cycles using Pcode 4.26. Like this, we consume 7040 (= 110 ∗ 64) cycles to compute 64 syndromes. Thus, total cycles required for syndrome computation are about 25,400 cycles which is far less when compared to Horner’s method of syndrome evaluation, which consumes about 97,000 cycles. With the FFT method, for error locations ﬁnding also we consume about the same number of cycles as syndrome computation. Based on this, total RS erasure and errors decoding with FFT implementation consumes about 107,980 (= 2 ∗ 25,400 + 15,480 + 3368 + 2000 + 36,332) which is about 1/3 of cycles when compared to non-FFT-based implementation that consumes about 328,090 cycles. 4.3.6 RS Erasure Decoding Simulation Results As part of the simulations, we consider a codeword M from the matrix in Figure 4.9, and this codeword is one complete row belonging to that matrix. Let its 255 values be as follows: M= 0x1C, 0x11, 0xC2, 0x4F, 0xD1, 0x27, 0x6E, 0x3D, 0xC6, 0x01, 0x8D, 0x3F, 0x66, 0x5A, 0x40, 0x1A, 0x68, 0x80, 0x07, 0x4B, 0xF0, 0x0A, 0x4A, 0x63, 0x57, 0x82, 0xE6, 0x03, 0x3A, 0xAA, 0xBD, 0xCF, 0x7A, 0xC3, 0x72, 0xBE, 0x53, 0xF1, 0x52, 0xC4, 0x9A, 0x22, 0xDF, 0x6B, 0xA9, 0xAF, 0x06, 0xA1, 0x4C, 0x20, 0xC2, 0x2F, 0x53, 0x91, 0x76, 0x39, 0x29, 0x19, 0x7B, 0x6C, 0x95, 0xEF, 0x70, 0xB4, 0xE7, 0x7A, 0xF7, 0x68, 0xD6, 0xD0, 0xC5, 0x82, 0xA6, 0xD7, 0x7E, 0xEC, 0x49, 0x79, 0xBB, 0x09, 0x70, 0x19, 0xB6, 0x6E, 0xC1, 0xD1, 0xF5, 0x04, 0x78, 0x00, 0xB3, 0xAE, 0x04, 0x24, 0x65, 0xC6, 0x34, 0xBF, 0x57, 0x2F, 0x8D, 0xF1, 0x7D, 0x3D, 0xC1, 0x40, 0x6E, 0x75, 0x04, 0xDE, 0xBF, 0x69, 0x88, 0xCD, 0x42, 0x98, 0xAB, 0xAC, 0xD3, 0x7E, 0x98, 0x63, 0x78, 0x22, 0x77, 0x4F, 0x36, 0x7D, Implementation of Error Correction Algorithms 189 0x19, 0x71, 0xAD, 0xAC, 0x70, 0x1C, 0x00, 0x29, 0x81, 0xC9, 0x8C, 0x57, 0x62, 0x01, 0xB8, 0xA7 0xB3, 0x32, 0xBE, 0x57, 0x2C, 0x69, 0xC1, 0xB1, 0x07, 0xFD, 0xDC, 0xCA, 0xDA, 0xC3, 0x3B, 0xE1, 0x13, 0x21, 0x55, 0x51, 0x67, 0x38, 0x65, 0x7F, 0xDE, 0xF6, 0x5E, 0x09, 0xDC, 0xD5, 0xE4, 0x32, 0x35, 0xD0, 0x66, 0x5C, 0xF2, 0x1A, 0xFD, 0x62, 0x4B, 0x5B, 0x0E, 0x05, 0xE5, 0x43, 0x1E, 0x1B, 0xE5, 0x7B, 0x2B, 0x47, 0x15, 0x62, 0xA1, 0x57, 0x07, 0xC6, 0x34, 0x0A, 0xC9, 0x16, 0x96, 0xDC, 0x95, 0xE5, 0xEE, 0x69, 0x66, 0xC6, 0xA4, 0x2A, 0x1F, 0x93, 0x4F, 0xE7, 0xA1, 0x89, 0x9B, 0xB8, 0x7B, 0x01, 0xBA, 0x2E, 0x6A, 0x88, 0x83, 0xD9, 0x77, 0x87, 0xBE, 0x9C, 0x92, 0xD3, 0x13, 0x09, 0x9B, 0x92, 0x15, 0xD7, 0x98, 0x61, 0xBA, 0x03, 0xEE, 0xF7, 0xC3, 0xEA, 0xF9, 0xDD, 0x1D Out of 255 bytes, the ﬁrst 191 bytes belong to payload (or data section) and the next 64 bytes belongs to parity (or RS section). The parity bytes are bolded, italicized hexadecimal numbers. Next, at the receiver, this codeword is received (after arranging the total frame into matrix form, checking parity check, tagging the error columns and extracting that particular row from matrix) as R. The codeword R contains a total of 60 erasures (meaning that the locations are known for these 60 errors) and two errors (for these two errors we don’t have error location information). The total incorrect bytes present in the R is 62 and this is also the RS(255, 191) erasure coder maximum correction capability (since erasure coder correction capability = L + (64 − L )/2 = 60 + (64 − 60)/2 = 62). All 60 erasure bytes are highlighted with underscores and the two-error bytes are highlighted with underscored bold numbers. R= 0x1c, 0x11, 0xc2, 0x4f, 0xd1, 0x27, 0x6e, 0x3d, 0xc6, 0x01, 0x8d, 0x3f, 0x66, 0x5a, 0x40, 0x1a, 0x68, 0x80, 0x00, 0xba, 0xf0, 0x0a, 0x00, 0x63, 0x57, 0x82, 0x00, 0x03, 0x3a, 0xaa, 0x00, 0xcf, 0x7a, 0xc3, 0x00, 0xbe, 0x53, 0xf1, 0x00, 0xc4, 0x9a, 0x22, 0x00, 0x6b, 0xa9, 0xaf, 0x00, 0xa1, 0x4c, 0x20, 0x00, 0x2f, 0x53, 0x91, 0x00, 0x39, 0x29, 0x19, 0x00, 0x6c, 0x95, 0xef, 0x00, 0xb4, 0xe7, 0x7a, 0x00, 0x68, 0xd6, 0xd0, 0x00, 0x82, 0xa6, 0xd7, 0x00, 0xec, 0x49, 0x79, 0x00, 0x09, 0x70, 0x19, 0x00, 0x6e, 0xc1, 0xd1, 0x00, 0x04, 0x78, 0x00, 0x00, 0xae, 0x04, 0x24, 0x00, 0xc6, 0x34, 0xbf, 0x00, 0x2f, 0x8d, 0xf1, 0x00, 0x3d, 0xc1, 0x40, 0x00, 0x75, 0x04, 0xde, 0x00, 0x69, 0x88, 0xcd, 0x00, 0x98, 0xab, 0xac, 0x00, 0x7e, 0x98, 0x63, 0x00, 0x22, 0x77, 0x4f, 0x00, 0x7d, 0x19, 0x71, 0x00, 0xac, 0x70, 0x1c, 0x00, 0x29, 0x81, 0xc9, 0x00, 0x57, 0x62, 0x01, 0x00, 0xa7, 0xb3, 0x32, 0x00, 0x57, 0x2c, 0x69, 0x00, 0xb1, 0x07, 0xfd, 0x00, 0xca, 0xda, 0xc3, 0x00, 0xe1, 0x13, 0x21, 0x00, 0x51, 0x67, 0x38, 0x00, 0x7f, 0xde, 0xf6, 0x00, 0xad, 0xdc, 0xd5, 0x00, 0x32, 0x35, 0xd0, 0x00, 0x5c, 0xf2, 0x1a, 0x00, 0x62, 0x4b, 0x5b, 0x00, 0x05, 0xe5, 0x43, 0x00, 0x1b, 0xe5, 0x7b, 0x00, 0x47, 0x15, 0x62, 0x00, 0x57, 0x07, 0xc6, 0x00, 0x0a, 0xc9, 0x16, 0x00, 0xdc, 0x95, 0xe5, 0x00, 0x69, 0x66, 0xc6, 0x00, 0x2a, 0x1f, 0x93, 0x00, 0xe7, 0xa1, 0x89, 0x00, 0xb8, 0x7b, 0x01, 0x00, 0x2e, 0x6a, 0x88, 0x00, 0xd9, 0x77, 0x87, 0x00, 0x9c, 0x92, 0xd3, 0x00, 0x09, 0x9b, 0x92, 0x00, 0xd7, 0x98, 0x61, 0x00, 0x03, 0xee, 0xf7, 0x00, 0xea, 0xf9, 0xdd, 0x00, The erasure locations Er follow: Note that the location indexing starts from the end of vector R since its corresponding polynomial R(x ) = r254x 254 + r253x 253 + · · · + r1 x + r0 in vector form is represented as R = [r254, r253, . . . , r1, r0]. Er = 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108, 112, 116, 120, 124, 128, 132, 136, 140, 144, 148, 152, 156, 160, 164, 168, 172, 176, 180, 184, 188, 192, 196, 200, 204, 208, 212, 216, 220, 224, 228, 232, 236 The coefﬁcients {Er60, Er59, . . . , Er2, Er1, Er0} of 60th degree erasure locator polynomial Er[x] (corresponding to erasure vector Er) computed using Pcode 4.22 follow: 0x01, 0xf6, 0x5, 0xd, 0x46, 0x25, 0x50, 0xa6, 0xc9, 0x42, 0xc9, 0xce, 0x96, 0xf0, 0xa4, 0xce, 0x24, 0x81, 0xf9, 0x47, 0x8f, 0x4d, 0x9, 0x1d, 0xa0, 0xc, 0x3d, 0xa7, 0xce, 0xa4, 0x31, 0x2e, 0xca, 0x52, 0x49, 0xdc, 0xc8, 0x2e, 0xbc, 0x7a, 0xe1, 0xee, 0x58, 0x3b, 0xf9, 0xe7, 0x11, 0xe8, 0xb6, 0x20, 0x92, 0x6c, 0x7c, 0x3, 0xd0, 0xbc, 0x5f, 0x22, 0x18, 0xb8, 0x64 The syndrome values Si for received codeword R computed with the FFT-based method using Pcode 4.24 and 4.25 follow: S= 0x10, 0x45, 0xa8, 0x9a, 0x7c, 0xb0, 0x5d, 0x2f, 0xc8, 0xd8, 0xad, 0xc9, 0xc6, 0x19, 0x70, 0x36, 0xbb, 0xfb, 0x6e, 0x1c, 0x00, 0x25, 0xda, 0x9d, 0x00, 0xe9, 0x5d, 0x39, 0x98, 0xb7, 0x28, 0xff, 0xad, 0x84, 0x74, 0xe4, 0xf5, 0x35, 0xdc, 0x8a, 0x7c, 0x87, 0x18, 0xcf, 0x7d, 0xc6, 0x9, 0xd3, 0xe0, 0x8f, 0xe8, 0x13, 0x1d, 0x4, 0xd2, 0x9c, 0xf6, 0x53, 0x70, 0xa9, 0x4d, 0x46, 0x89, 0x64 As the errors are present along with erasures in the received codeword R, we have to compute the effective error-erasure locator polynomial. The coefﬁcients Ei of the error-erasure locator polynomial computed using the modiﬁed Berlekamp-Massey algorithm using Pcode 4.23 follow: 190 Chapter 4 Ei = 0x01, 0xa6, 0x34, 0xcb, 0xee, 0x1f, 0x41, 0xf0, 0x64, 0x58, 0x61, 0x82, 0xb6, 0xfa, 0xb4, 0x4, 0xcc, 0xff, 0xe1, 0x3f, 0x71, 0x5a, 0x78, 0xa6, 0xbe, 0xd, 0x1d, 0x74, 0xfb, 0xb2, 0x20, 0xbc, 0xa5, 0x87, 0x76, 0x3d, 0x7f, 0x2e, 0x39, 0x50, 0x23, 0xf0, 0x52, 0x39, 0x84, 0xa4, 0x52, 0x79, 0x6b, 0x80, 0xa4, 0x53, 0x33, 0x8f, 0x8b, 0xfd, 0xdf, 0x3f, 0xa6, 0xad, 0xa9, 0x52, 0x8 Once we know the error-erasure locator polynomial, we compute the 62-error position vector (this is not necessary when only erasures are present) with the FFT-based method using Pcode 4.26. The error positions vector Ep follows: Ep 0, 204, 136, 68, 172, 104, 36, 208, 140, 72, 4, 176, 108, 40, 144, 76, 8, 212, 180, 112, 44, 148, 80, 12, 216, 116, 48, 235, 184, 84, 16, 220, 152, 120, 52, 188, 88, 20, 224, 156, 56, 192, 124, 24, 228, 160, 92,60, 196, 128, 28, 232, 164, 96, 200, 132, 64, 236, 168, 100, 83, 32 Note that the error positions output by the FFT method are not in order. Using the syndromes and error positions, we compute the error magnitudes Emi using Pcode 4.21. The error magnitude vector Em follows. The error magnitudes are also not in order, however they do correspond to the error positions. Em 0x1d, 0xc2, 0xd3, 0xe, 0xb6, 0xc1, 0x4f, 0x6, 0x42, 0xfd, 0xc3, 0xbb, 0xbe, 0xa4, 0xbf, 0x66, 0xba, 0xdf, 0x7e, 0xb8, 0xee, 0x6e, 0xe4, 0x15, 0x52, 0x8c, 0x96, 0xf1, 0xc5, 0x5e, 0x13, 0x72, 0x7d, 0x0, 0x34, 0xf7, 0x65, 0xbe, 0xbd, 0x57, 0xa1, 0x70, 0xad, 0x83, 0xe6, 0x65, 0x55, 0x2b, 0x7b, 0x36, 0xba, 0x4a, 0xb3, 0x3b, 0x76, 0x78, 0x1e, 0x7, 0xf5, 0xdc, 0xa4, 0x9b The RS(255, 191) erasure-decoder corrected output follows. All corrected bytes are highlighted with bold hexadecimal numbers. M= 0x1C, 0x11, 0xC2, 0x4F, 0xD1, 0x27, 0x6E, 0x3D, 0xC6, 0x01, 0x8D, 0x3F, 0x66, 0x5A, 0x40, 0x1A, 0x68, 0x80, 0x07, 0x4B, 0xF0, 0x0A, 0x4A, 0x63, 0x57, 0x82, 0xE6, 0x03, 0x3A, 0xAA, 0xBD, 0xCF, 0x7A, 0xC3, 0x72, 0xBE, 0x53, 0xF1, 0x52, 0xC4, 0x9A, 0x22, 0xDF, 0x6B, 0xA9, 0xAF, 0x06, 0xA1, 0x4C, 0x20, 0xC2, 0x2F, 0x53, 0x91, 0x76, 0x39, 0x29, 0x19, 0x7B, 0x6C, 0x95, 0xEF, 0x70, 0xB4, 0xE7, 0x7A, 0xF7, 0x68, 0xD6, 0xD0, 0xC5, 0x82, 0xA6, 0xD7, 0x7E, 0xEC, 0x49, 0x79, 0xBB, 0x09, 0x70, 0x19, 0xB6, 0x6E, 0xC1, 0xD1, 0xF5, 0x04, 0x78, 0x00, 0xB3, 0xAE, 0x04, 0x24, 0x65, 0xC6, 0x34, 0xBF, 0x57, 0x2F, 0x8D, 0xF1, 0x7D, 0x3D, 0xC1, 0x40, 0x6E, 0x75, 0x04, 0xDE, 0xBF, 0x69, 0x88, 0xCD, 0x42, 0x98, 0xAB, 0xAC, 0xD3, 0x7E, 0x98, 0x63, 0x78, 0x22, 0x77, 0x4F, 0x36, 0x7D, 0x19, 0x71, 0xAD, 0xAC, 0x70, 0x1C, 0x00, 0x29, 0x81, 0xC9, 0x8C, 0x57, 0x62, 0x01, 0xB8, 0xA7, 0xB3, 0x32, 0xBE, 0x57, 0x2C, 0x69, 0xC1, 0xB1, 0x07, 0xFD, 0xDC, 0xCA, 0xDA, 0xC3, 0x3B, 0xE1, 0x13, 0x21, 0x55, 0x51, 0x67, 0x38, 0x65, 0x7F, 0xDE, 0xF6, 0x5E, 0x09, 0xDC, 0xD5, 0xE4, 0x32, 0x35, 0xD0, 0x66, 0x5C, 0xF2, 0x1A, 0xFD, 0x62, 0x4B, 0x5B, 0x0E, 0x05, 0xE5, 0x43, 0x1E, 0x1B, 0xE5, 0x7B, 0x2B, 0x47, 0x15, 0x62, 0xA1, 0x57, 0x07, 0xC6, 0x34, 0x0A, 0xC9, 0x16, 0x96, 0xDC, 0x95, 0xE5, 0xEE, 0x69, 0x66, 0xC6, 0xA4, 0x2A, 0x1F, 0x93, 0x4F, 0xE7, 0xA1, 0x89, 0x9B, 0xB8, 0x7B, 0x01, 0xBA, 0x2E, 0x6A, 0x88, 0x83, 0xD9, 0x77, 0x87, 0xBE, 0x9C, 0x92, 0xD3, 0x13, 0x09, 0x9B, 0x92, 0x15, 0xD7, 0x98, 0x61, 0xBA, 0x03, 0xEE, 0xF7, 0xC3, 0xEA, 0xF9, 0xDD, 0x1D 4.4 Viterbi Decoder In this section, we discuss the simulation and implementation techniques for decoding convolutional codes by using the Viterbi algorithm. In particular, we implement the Viterbi decoder that decodes trellis-coded modulation data. Refer to Sections 3.7 through 3.9 for more details on convolutional codes, TCM, and the Viterbi algorithm. As we discuss later, the Viterbi algorithm is costly both in terms of computations and memory usage. We discuss the window-based method to avoid huge memory requirements in implementation of the Viterbi decoder. At the end, we provide simulation results for the 1/2-rate, four-state convolutional coder with 8-PSK modulation and for the corresponding Viterbi decoder. 4.4.1 TCM Convolutional Encoder In this section, we simulate the TCM encoder. In particular, we simulate the TCM coder shown in Figure 3.34 by using the set partitioning as shown in Figure 3.35. This coder takes 1 bit as input and outputs 2 bits (hence, rate R = 1/2). However, the overall rate of the code is 2/3 as we are passing 1 bit as uncoded. At each time instance, we get 3 bits (2 bits from the convolutional coder and one uncoded bit) and we use 8-PSK to modulate them. Implementation of Error Correction Algorithms 191 The look-up table, psk_8_tbl_tcm[ ], is used to map 3 bits to 8-PSK constellation points. We take care of the Ungerboeck set-partitioning of constellation points at the time of ﬁlling psk_8_tbl_tcm[ ] as follows: psk_8_tbl_tcm[8][2] = {{1,0}, {-1,0}, {1/sqrt(2),1/sqrt(2)}, {-1/sqrt(2),-1/sqrt(2)} , {-1/sqrt(2), 1/sqrt(2)}, {1/sqrt(2),-1/sqrt(2)}, {0,1}, {0,-1}} Subset 0: {(1,0), (-1,0)} Subset 1: {(1/sqrt(2),1/sqrt(2)), (-1/sqrt(2),-1/sqrt(2))} Subset 2: {(-1/sqrt(2),1/sqrt(2)), (1/sqrt(2),-1/sqrt(2))} Subset 3: {(0,1), (0,-1)} S0 = S1 = 0; for(m = 0;m < N;m++){ x = (ﬂoat) rand() / RAND_MAX; b0 = (int) (x+0.5); x = (ﬂoat) rand() / RAND_MAX; b1 = (int) (x+0.5); in_buf[2*m] = b0; in_buf[2*m+1] = b1; c0 = b0; c1 = S0ˆS1ˆb1; c2 = S1ˆb1; S1 = S0; S0 = b1; j = 4*c2 + 2*c1 + c0; tx_seq[2*m] = psk_8_tbl_tcm[j][0]; tx_seq[2*m+1] = psk_8_tbl_tcm[j][1]; } Pcode 4.27: Simulation code for TCM encoder. // b0: 0/1 // b1: 0/1 // inputs b0, b1 and outputs c0,c1, c2 // update encoder states // compute offset for 8-PSK look-up table // store transmitted sequence 6 4 2 1 0 3 5 Figure 4.11: 8-PSK constellation point 7 numbering with TCM set partitioning. The simulation code for the TCM coder is given in Pcode 4.27. The output of encoder “c2c1c0” forms offsets ranging from 0 to 7. Bits “c2c1” decide which subset to choose and bit “c0” decides which point to choose from a subset. For example, c2c1c0 = 110, then c2c1 = 11, and c0 = 0. We choose Subset 3 and 0-th point (i.e., (0,1) or the point numbered with 6 in Figure 4.11). The constellation points are numbered as shown in Figure 4.11. 4.4.2 Viterbi Decoder Simulation The Viterbi decoder, as we discussed in Section 3.9.2, basically involves the computation of Equation (3.45) or the processing of trellis as shown in Figure 3.43. The input to the Viterbi decoder is the received sequence rx_seq[ ], which is a corrupted (by AWGN noise; see Section 9.1.2 for more details on noise generation and measurement) version of transmitted sequence tx_seq[ ] (generated by the encoder given in Pcode 4.27). We encode the bits with the TCM encoder that usually starts at zero state and is forced to the zero state at the end of the encoding. Hence, we know the starting and ending states of the TCM encoder. Therefore, the corresponding trellis diagram also starts and ends at zero state as shown in Figure 3.42. We simulate the Viterbi decoder by following the six steps given in Section 3.9.2. The corresponding simulation code of the Viterbi decoder is given in Pcode 4.28 through Pcode 4.30. Computational Complexity and Memory Requirements Using the simulation code in Pcodes 4.28 through 4.30, we decode the whole frame of length N samples. In other words, the corresponding trellis consists of N stages. We obtain the survivor paths by computing all states’ 192 Chapter 4 r0 = rx_seq[0]; r1 = rx_seq[1]; // received sequence r2 = psk_8_tbl_tcm[0]; r3 = psk_8_tbl_tcm[1]; r4 = r0 - r2; r5 = r1 - r3; // stage: 0 r2 = psk_8_tbl_tcm[2]; r3 = psk_8_tbl_tcm[3]; r6 = r0 - r2; r7 = r1 - r3; r4 = r4*r4 + r5*r5; r6 = r6*r6 + r7*r7; if (r6 > r4) {r2 = r4; r3 = 0;} else {r2 = r6; r3 = 1;} vm[1][0] = r2; vn[1][0][0] = 0; vn[1][0][1] = r3; r2 = psk_8_tbl_tcm[12]; r3 = psk_8_tbl_tcm[13]; r4 = r0 - r2; r5 = r1 - r3; r2 = psk_8_tbl_tcm[14]; r3 = psk_8_tbl_tcm[15]; r6 = r0 - r2; r7 = r1 - r3; r4 = r4*r4 + r5*r5; r6 = r6*r6 + r7*r7; if (r6 > r4) {r2 = r4; r3 = 0;} else {r2 = r6; r3 = 1;} vm[1][1] = r2; vn[1][1][0] = 0; vn[1][1][1] = r3; // store survivor branches and state metric r0 = rx_seq[2]; r1 = rx_seq[3]; for(i = 0;i < 4;i++){ // stage: 1 a = vt_st_out0[2*i]; b = vt_st_out0[2*i+1]; r2 = psk_8_tbl_tcm[2*a]; r3 = psk_8_tbl_tcm[2*a+1]; r4 = r0 - r2; r5 = r1 - r3; r2 = psk_8_tbl_tcm[2*b]; r3 = psk_8_tbl_tcm[2*b+1]; r6 = r0 - r2; r7 = r1 - r3; r4 = r4*r4 + r5*r5; r6 = r6*r6 + r7*r7; a = vt_st_in0[2*i]; b = vt_st_in0[2*i+1]; r5 = vm[1][a]; r7 = vm[1][b]; r4 = r4 + r5; r6 = r6 + r7; if (r6 > r4) {r2 = r4; r3 = 0;} else {r2 = r6; r3 = 1; a = b;} vm[2][i] = r2; vn[2][i][0] = a; vn[2][i][1] = r3; // store survivor branches and metrics } Pcode 4.28: Viterbi decoder initial two stages processing. (i.e., S = 2K −1 states, where K is a constraint length of encoder) state metrics (SM) for all N stages. If we have n uncoded bits at each stage, then each path of the trellis consists of 2n parallel branches. The state metrics are computed using the current stage branch metrics and previous stage state metrics. Thus, the number of computations in decoding performed at each stage increases exponentially with n and K . We determine the global most likely sequence by taking the survivor branch (i.e., a branch with minimum state metric) at zero state of (N − 1)th stage and tracing back to the beginning of the trellis. To perform this, we have to store all state metrics and the survivor branches information. If one trellis stage contains S states and if we use 4 bytes per state to store one SM and if we use 1-byte per state to store the survivor branch information (i.e., the index of the previous stage state which connects to the current stage state through the survivor branch), then we need (4 + 2n) ∗ S ∗ N bytes of on-chip memory to store the processed trellis data. For example, if the frame length N is 2000 samples and if we use a 4-state encoder with 1-bit uncoded, then we require 48 kB (= (4 + 2) ∗ 4 ∗ 2000) of data memory to store only trellis data. However, we can reduce this memory requirement by using window-based trellis processing (which is suboptimal when compared to the original Viterbi algorithm). Based on computer simulations, it has been found that the decision taken at the current stage for a bit of stage back in time of L stages (where L is greater than or equal to 6K ) results in a correct decoded bit with a very high probability. This convergence property of trellis allows us to implement Viterbi decoder with less memory. In Pcode 4.28 and 4.29, we use the look-up tables vt_st_in0[ ] and vt_st_int1[ ] to access the trellis branches connected to appropriate states and we use look-up tables vt_st_out0 and vt_st_out1 to access the corresponding branches’ outputs. These look-up table values follow: vt_st_in0[8] = {0,0,0,0,1,1,1,1} vt_st_in1[16] = {0,0,2,2,0,0,2,2,1,1,3,3,1,1,3,3} vt_st_out0[8] = {0,1,6,7,2,3,4,5} vt_st_out1[16] = {0,1,6,7,6,7,0,1,2,3,4,5,4,5,2,3} Implementation of Error Correction Algorithms 193 j = 2; while(j < N){ r0 = rx_seq[2*j]; r1 = rx_seq[2*j+1]; for(i = 0;i < 4;i++){ a = vt_st_out1[4*i]; b = vt_st_out1[4*i+1]; r2 = psk_8_tbl_tcm[2*a]; r3 = psk_8_tbl_tcm[2*a+1]; r4 = r0 - r2; r5 = r1 - r3; r2 = psk_8_tbl_tcm[2*b]; r3 = psk_8_tbl_tcm[2*b+1]; r6 = r0 - r2; r7 = r1 - r3; r4 = r4*r4 + r5*r5; r6 = r6*r6 + r7*r7; a = vt_st_in1[4*i]; b = vt_st_in1[4*i+1]; r5 = vm[j][a]; r7 = vm[j][b]; r4 = r4 + r5; r6 = r6 + r7; // add if (r6 > r4) {r2 = r4; r3 = 0;} // compare and select else {r2 = r6; r3 = 1; a = b;} vm[j+1][i] = r2; vn[j+1][i][0] = a; vn[j+1][i][1] = r3; // store temporarily } } a = vt_st_out1[4*i+2]; b = vt_st_out1[4*i+3]; r2 = psk_8_tbl_tcm[2*a]; r3 = psk_8_tbl_tcm[2*a+1]; r4 = r0 - r2; r5 = r1 - r3; r2 = psk_8_tbl_tcm[2*b]; r3 = psk_8_tbl_tcm[2*b+1]; r6 = r0 - r2; r7 = r1 - r3; r4 = r4*r4 + r5*r5; r6 = r6*r6 + r7*r7; a = vt_st_in1[4*i+2]; b = vt_st_in1[4*i+3]; r5 = vm[j][a]; r7 = vm[j][b]; r4 = r4 + r5; r6 = r6 + r7; r5 = vm[j+1][i]; // add if (r6 > r4) {r2 = r4; r3 = 0;} // compare and select else {r2 = r6; r3 = 1; a = b;} if (r5 > r2){ // state metrics and survivor branches vm[j+1][i] = r2; vn[j+1][i][0] = a; vn[j+1][i][1] = r3; } Pcode 4.29: Viterbi decoder total frame trellis processing. // trace back and decode bits (baseband demodulation done automatically with Viterbi) k = 0; for(i = N;i > 0;i--){ b = (i-1) << 1;; j = vn[i][k][0]; // get previous stage state index a = vn[i][k][1]; // get branch (out of two parallel branches) dec_bits[b] = vb[j][k][a][1]; // get first decoded bit dec_bits[b+1] = vb[j][k][a][0]; // get second decoded bit k = j; } Pcode 4.30: Simulation code for decoding bits by trace back. We use the following look-up table, vb[ ][ ][ ][ ], for obtaining the associated input bits of the survivor branches belonging to the most global likely sequence in the trace back. vb[4][4][2][2] = { {{{0,0},{0,1}},{{1,0},{1,1}},{{0,0},{0,0}},{{0,0},{0,0}}}, {{{0,0},{0,0}},{{0,0},{0,0}},{{0,0},{0,1}},{{1,0},{1,1}}}, {{{0,0},{0,1}},{{1,0},{1,1}},{{0,0},{0,0}},{{0,0},{0,0}}}, {{{0,0},{0,0}},{{0,0},{0,0}},{{0,0},{0,1}},{{1,0},{1,1}}}} 4.4.3 Viterbi Decoder Implementation The simulation codes presented in the previous section to decode the TCM codes using Viterbi not only consume a huge amount of memory but also use ﬂoating-point computations. In this section, we discuss a ﬁxed-point Viterbi decoder to decode fast and use a window-based method to reduce the overall memory requirement in TCM decoding. We perform ﬁxed-point arithmetic for the Viterbi decoder by converting the input data to 194 Chapter 4 8.8 ﬁxed-point Q-format (which is achieved by multiplying the fractions by 28) and by using the 8.8 Q-format look-up table psk_8_ﬁx_tcm[ ] for 8-PSK constellation points as follows: psk_8_ﬁx_tcm[16] = {255,0,-255,0,181,181,-181,-181,-181,181,181,-181,0,255,0,-255} rx_ﬁx_seq[i] = 256*rx_seq[i], 0 ≤ i ≤ N − 1 The algorithm for window-based Viterbi decoding follows: 1. At stage j = 0, set SM to zero for all states. 2. At a node in a stage of j > 0, compute BM for all branches entering the node. 3. Add the BM to the present SM for the path ending at the source node of the branch, to get a candidate SM for the path ending at the destination node of it. After the candidate SM has been obtained for all branches entering the node, compare them and select only that with the minimum value. Let this corresponding branch survive and delete all the other branches to that node from the trellis. This process is shown in Figure 4.12. 4. Return to step 2 for dealing with the next node. If all nodes in the present stage have been processed, go to step 5. 5. If j < L (where L > 6K, the window length), increment n and return to step 2, else go to step 6. 6. Take the path with minimum SM (as the global most likely path) and follow the survivor branches backward through the trellis up to the beginning of the window considered. Now collect the bits that correspond to the survivor branch of the global most likely path at the start of the window to form the estimate of the original information bit sequence. 7. If j < n − 1, move the window one stage forward and go to step 2. To process the ﬁrst two stages of the window-based Viterbi in a ﬁxed-point format, we can use the same code presented in Pcode 4.28 by replacing rx_seq[ ] with rx_ﬁx_seq[ ] and psk_8_tbl_tcm[ ] with psk_8_ﬁx_tcm[ ]. In window-based Viterbi decoding, we process the trellis up to L-samples (or a window length) and perform decoding of a bit by tracing back. Then we move the window by one sample and compute the state metrics for the new sample entered into the window and decode the next bit by performing the traceback again. In this process, we perform the traceback for each decoded bit and it is too costly. Instead, we perform window-based Viterbi decoding in a different way in which we perform the traceback once per L-sample. For this, we process the trellis for the ﬁrst two windows before starting the trace back. In other words, we process the next window trellis in advance. At the end of the trellis processing of the second window, we perform the traceback and decode at once all bits of the ﬁrst window. The simulation code for this window-based Viterbi decoder is given in Pcodes 4.31 and 4.32. With the program in Pcode 4.31, we only process the trellis for the ﬁrst window without any trace back. In Pcode 4.32, we always perform trellis processing of the next window and decode all the bits of the previous window by performing the traceback. 00 10 01 Global most likely path 11 Survivor paths j50 j51 j52 j5L21 j5L j5n22 j5n21 Window of Length L Figure 4.12: Processing of trellis stages in window-based Viterbi decoding. Implementation of Error Correction Algorithms 195 // stages: 2 to 23 m = 2; for(j = m;j < m + 22;j++){ r0 = rx_fix_seq[2*j]; r1 = rx_fix_seq[2*j+1]; for(i = 0;i < 4;i++){ a = vt_st_out1[4*i]; b = vt_st_out1[4*i+1]; r2 = psk_8_fix_tcm[2*a]; r3 = psk_8_fix_tcm[2*a+1]; r4 = r0 - r2; r5 = r1 - r3; r2 = psk_8_fix_tcm[2*b]; r3 = psk_8_fix_tcm[2*b+1]; r6 = r0 - r2; r7 = r1 - r3; r4 = (r4*r4 + r5*r5)>>8; r6 = (r6*r6 + r7*r7)>>8; a = vt_st_in1[4*i]; b = vt_st_in1[4*i+1]; r5 = vm[j][a]; r7 = vm[j][b]; r4 = r4 + r5; r6 = r6 + r7; if (r6 > r4) {r2 = r4; r3 = 0;} else {r2 = r6; r3 = 1; a = b;} vm[j+1][i] = r2; vn[j+1][i][0] = a; vn[j+1][i][1] = r3; } } m+= 22; a = vt_st_out1[4*i+2]; b = vt_st_out1[4*i+3]; r2 = psk_8_fix_tcm[2*a]; r3 = psk_8_fix_tcm[2*a+1]; r4 = r0 - r2; r5 = r1 - r3; r2 = psk_8_fix_tcm[2*b]; r3 = psk_8_fix_tcm[2*b+1]; r6 = r0 - r2; r7 = r1 - r3; r4 = (r4*r4 + r5*r5)>>8; r6 = (r6*r6 + r7*r7)>>8; a = vt_st_in1[4*i+2]; b = vt_st_in1[4*i+3]; r5 = vm[j][a]; r7 = vm[j][b]; r4 = r4 + r5; r6 = r6 + r7; r5 = vm[j+1][i]; if (r6 > r4) r2 = r4; r3 = 0; else r2 = r6; r3 = 1; a = b; if (r5 > r2){ vm[j+1][i] = r2; vn[j+1][i][0] = a; vn[j+1][i][1] = r3; } Pcode 4.31: Simulation code for ﬁrst window trellis processing. 4.4.4 Simulation Results This section presents the simulation results for a four-state, 8-PSK, 1/2-rate convolutional coder (effective rate is 2/3 as 1 bit is uncoded) as shown in Figure 3.34. We consider 128 random bits for transmission as follows: Input Input bits (bn): 128 bits 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0 Convolutional Encoding We encode the bits bn using a rate 1/2 convolutional encoder as shown in Figure 3.34. With a rate 1/2 coder, we output 2 bits for every 1 input bit. We pass 1 more bit as uncoded (so, the effective code rate is 2/3). Hence, we have three output bits for every two input bits. At the start, the encoder state “S1S0” is initialized to zero. The encoded bits (192 output bits correspond to 128 input bits) follow: Encoded bits (ck): 192 bits 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1 196 Chapter 4 while(m < 1001){ for(j = m;j < m + 24;j++){ // compute metrics for next 6K stages p = j&0x3f; q = (j+1)&0x3f; r0 = rx_fix_seq[2*j]; r1 = rx_fix_seq[2*j+1]; for(i = 0;i < 4;i++){ a = vt_st_out1[4*i]; b = vt_st_out1[4*i+1]; r2 = psk_8_fix_tcm[2*a]; r3 = psk_8_fix_tcm[2*a+1]; r4 = r0 - r2; r5 = r1 - r3; r2 = psk_8_fix_tcm[2*b]; r3 = psk_8_fix_tcm[2*b+1]; r6 = r0 - r2; r7 = r1 - r3; r4 = (r4*r4 + r5*r5)>>8; r6 = (r6*r6 + r7*r7)>>8; a = vt_st_in1[4*i]; b = vt_st_in1[4*i+1]; r5 = vm[p][a]; r7 = vm[p][b]; r4 = r4 + r5; r6 = r6 + r7; // add if (r6 > r4) {r2 = r4; r3 = 0;} // compare and select else {r2 = r6; r3 = 1; a = b;} vm[q][i] = r2; vn[q][i][0] = a; vn[q][i][1] = r3; a = vt_st_out1[4*i+2]; b = vt_st_out1[4*i+3]; r2 = psk_8_fix_tcm[2*a]; r3 = psk_8_fix_tcm[2*a+1]; r4 = r0 - r2; r5 = r1 - r3; r2 = psk_8_fix_tcm[2*b]; r3 = psk_8_fix_tcm[2*b+1]; r6 = r0 - r2; r7 = r1 - r3; r4 = (r4*r4 + r5*r5)>>8; r6 = (r6*r6 + r7*r7)>>8; a = vt_st_in1[4*i+2]; b = vt_st_in1[4*i+3]; r5 = vm[p][a]; r7 = vm[p][b]; r4 = r4 + r5; r6 = r6 + r7; r5 = vm[q][i]; if (r6 > r4) {r2 = r4; r3 = 0;} else {r2 = r6; r3 = 1; a = b;} if (r5 > r2){vm[q][i] = r2; vn[q][i][0] = a; vn[q][i][1] = r3;} } } k = 0; a = vm[q][0]; // trace back and get decoded bits if (a > vm[q][1]) {k = 1; a = vm[q][1];} if (a > vm[q][2]) {k = 2; a = vm[q][2];} if (a > vm[q][3]) {k = 3; a = vm[q][3];} for(i = m + 24-1; i > m; i--){ p = i&0x3f; k = vn[p][k][0]; } for(i = m;i > m-24;i--){ b = (i-1)<<1; p = i&0x3f; j = vn[p][k][0]; a = vn[p][k][1]; dec_bits[b] = vb[j][k][a][1]; dec_bits[b+1] = vb[j][k][a][0]; k = j; } m+= 24; } Pcode 4.32: Subsequent window trellis processing and decoding by trace back. PSK Modulation At the time of encoding, the output of the encoder is mapped to 8-PSK symbols by using each 3-bit encoder output as an offset to the 8-PSK look-up table psk_8_ﬁx_tcm[ ] (which is constructed based on Ungerboeck’s set-partitioning rules and makes sure that the distance between trellis parallel transitions is maximum). This 8-PSK modulated data follows: 8-PSK normalized constellation points to transmit (Sm): 64 constellation points 0,-1, 0.707106769, 0.707106769, 0,-1,-1, 0,-1, 0, 0, 1,-0.707106769,-0.707106769, 0,-1, 0, 1,-0.707106769,-0.707106769, 0, 1, 0, 1,-0.707106769, 0.707106769, 0.707106769, 0.707106769, 0.707106769,-0.707106769, 0,-1, -1, 0, 1, 0, 1, 0, -1, 0, 0, 1, -0.707106769, -0.707106769, 1, 0,-0.707106769,-0.707106769, 1, 0,-0.707106769,-0.707106769, 1, 0,-0.707106769,-0.707106769, 1, 0, -0.707106769,-0.707106769,-1, 0, -0.707106769,-0.707106769, 0,-1, 0, 1,-0.707106769, 0.707106769, -0.707106769,-0.707106769, 0.707106769, 0.707106769,-0.707106769, 0.707106769, 0,-1, Implementation of Error Correction Algorithms 197 0,-1,-0.707106769, 0.707106769,-0.707106769,-0.707106769, 0.707106769, 0.707106769, -0.707106769,-0.707106769,-0.707106769, 0.707106769, 0,-1,-1, 0, 0, 1, 0.707106769,-0.707106769, 0.707106769, 0.707106769,-0.707106769, -0.707106769, 0.707106769,-0.707106769, 0,-1, 0, 1,-0.707106769,-0.707106769, 0, 1, 0, 1,-0.707106769,-0.707106769,-1, 0,-0.707106769, 0.707106769, 0.707106769, 0.707106769, -0.707106769, -0.707106769,-0.707106769,-0.707106769,-0.707106769, 0.707106769 Passing through AWGN Channel We transmit the PSK points Sm (after converting them to analog signals) over a noisy channel. For the simulation purpose we add AWGN noise to constellation points. At the receiver, we get noisy PSK constellation points (at the output of the receiver front end) as follows: Received noisy PSK constellation points (rm): 64 points 0.157108262, -0.876191974, 0.777749598, 0.572184622, -0.0493549667, -0.779661775, -1.04820085, 0.129765883,-0.754512846, 0.0238341205, 0.113822095, 1.08242452, -0.830362678,-0.480324298,-0.00643539662,-1.09403169, 0.139023885, 0.993276477, -0.764336884,-0.95889169, 0.0967011526, 1.0580529, 0.0155533217, 0.864503026, -0.576663733, 0.72110486, 0.439940155, 0.867248893, 0.187527895,-0.900946856, 0.269263357,-1.00332797,-0.965083599, 0.127240837, 0.969807982, 0.1253566, 1.0114671, 0.0736974776,-1.10441804, 0.208328649, 0.025401894, 1.40040243, -0.696813524,-0.711368084, 1.2361176, 0.201338947,-0.751767278,-0.69719702, 0.685953498,-0.0994339064,-0.590031147,-0.872876346, 1.08380294, 0.272845447, -0.854398847,-0.510077894, 0.770515382,-0.0814026967,-0.841046274,-0.553521454, -0.767543256,-0.11693459,-0.900310397,-0.909833312, 0.364312947,-1.22945499, -0.17605862, 0.983639538,-0.774587214, 0.775650978,-0.858343899,-1.18007827, 0.615384161, 0.283505648,-0.393607974, 0.756205738, 0.0724883378,-0.78575933, 0.185971975, -1.3460393,-1.04868209, 0.653228343,-0.779842257,-0.512806058, 0.724324882, 0.492230177,-0.558885276,-0.866437376,-0.738605142, 0.714647472, 0.401384085,-0.90096128,-1.04960263, 0.140945986, 0.12148124, 1.07132232, 0.84473902,-0.770106435, 0.83484894, 0.708259106,-0.634552479,-0.87108928, 0.501765013,-0.930042982,-0.214983284,-1.13528705, 0.107488595, 0.880786121, -0.718424797,-0.988860965, 0.431062669, 1.19838154, 0.0879992619, 1.43463016, -0.625213265,-0.663498342,-0.936734319, 0.0526892953,-0.609834671, 0.482433826, 1.04271317, 0.66147238,-0.550259411,-0.517056823,-0.646719575, -0.730190217, -0.631949961, 0.655985713, Preparing Soft Decisions To work with ﬁxed-point code, we convert (by quantizing) the received noisy PSK points to soft-decisions (multilevel) using 8.8 Q-format (i.e., 256 levels) as follows: Quantized received soft data (Rm): 64 points (in 8.8 format) 40, -223, 199, 146, -12, -199, -267, 33, -192, 6, 29, 277, -212, -122, -1, -279, 36, 254, -195, -244, 25, 271, 4, 221, -147, 185, 113, 222, 48, -230, 69, -256, -246, 33, 248, 32, 259, 19,-282, 53, 7, 359, -177, -181, 316, 52, -191, -177, 176, -24, -150, -222, 277, 70, -218, -130, 197, -20, -214, -141, -195, -29, -229, -232, 93, -314, -44, 252, -197, 199, -219, -301, 158, 73, -100, 194, 19, -200, 48, -344, -267, 167, -199, -130, 185, 126, -142, -221, -188, 183, 103, -230, -268, 36, 31, 274, 216, -196, 214, 181, -161, -222, 128, -237, -54, -290, 28, 225, -183, -252, 110, 307, 23, 367, -159, -169, -239, 13, -155, 124, 267, 169, -140, -131, -165, -186, -161, 168 Viterbi Decoding Next, we are ready with the data to feed the Viterbi decoder. The Viterbi decoder copies the transmitter side encoder trellis and processes it. The trellis starts from a zero state (as we assumed at the start of the encoder on the transmitter side) with zero-state metrics. Then we follow the Viterbi algorithm presented in Section 4.4.3 for each received data point. For purposes of clarity, we tabulated the processed trellis that follows on the next page. The ﬁrst column in the table gives the data points index, the second column gives the state metrics, the third column gives the traceback information, and ﬁnally, the fourth column gives the decoded bits (an estimate of transmitted bits) obtained from the global most likely sequence. 198 Chapter 4 Stages 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 0 438 640 193 238 307 810 975 473 946 1250 628 1100 1393 1144 1112 1065 1107 1146 1169 1249 1856 2115 1883 2124 1898 2190 2128 2295 2278 2460 2418 2676 2337 2800 3057 3045 3215 3014 2875 3319 3498 3684 3755 3668 3863 3456 3505 State Metrics 0 72 746 567 673 679 358 1092 981 510 1152 1088 666 1317 1362 1217 1182 1320 1500 1559 1653 1360 2107 1477 2152 1594 2081 1758 2455 1924 2465 2086 2670 2661 2384 3053 3117 3108 3202 3088 3012 3659 3632 3637 3826 3853 3710 3866 0 0 125 628 838 871 927 448 972 1199 587 1245 1305 998 1039 995 1278 1461 1536 1740 1894 2005 1364 2086 1491 2007 1666 2288 1846 2283 1997 2503 2185 2765 2957 2780 2938 2853 2801 3182 3480 3446 3441 3492 3603 3328 3817 4014 0 0 417 604 801 859 985 692 970 1188 949 1195 1313 704 813 1091 1404 1395 1600 1778 1788 2019 1718 1982 1845 2055 1966 2173 2106 2323 2279 2445 2544 2891 2869 2418 2576 2707 3001 3220 3384 3112 3181 3240 3319 3690 3980 3942 Traceback Information ---0<>0 0<>0 2<>1 0<>1 0<>1 0<>0 0<>1 2<>1 0<>0 0<>1 2<>0 0<>0 0<>1 2<>0 2<>1 2<>1 0<>1 0<>0 0<>0 0<>1 0<>0 0<>1 2<>0 0<>1 2<>1 2<>1 2<>0 0<>1 2<>1 0<>1 2<>1 0<>1 2<>1 0<>1 0<>1 2<>1 0<>0 2<>0 2<>1 2<>1 0<>1 0<>1 2<>0 2<>1 2<>0 2<>1 0<>1 ---0<>1 0<>0 2<>1 2<>1 0<>0 0<>0 2<>1 2<>1 0<>0 0<>1 2<>0 0<>0 0<>0 2<>0 0<>1 0<>1 2<>1 2<>0 2<>0 0<>0 0<>0 0<>1 2<>0 0<>1 2<>0 0<>1 2<>0 2<>1 2<>0 2<>1 2<>1 0<>1 2<>0 0<>0 0<>0 2<>1 2<>0 2<>1 0<>1 0<>1 2<>1 2<>1 2<>0 2<>1 2<>1 2<>0 2<>1 ---0<>0 1<>0 3<>1 3<>0 1<>1 1<>0 1<>1 3<>1 1<>0 1<>1 3<>0 1<>0 1<>0 3<>0 3<>1 3<>1 1<>1 1<>0 1<>0 1<>1 1<>0 1<>1 3<>1 1<>1 3<>1 1<>1 1<>0 1<>1 3<>1 1<>1 3<>0 1<>1 3<>1 1<>0 1<>1 3<>1 3<>1 3<>0 3<>1 1<>1 1<>1 3<>0 3<>1 3<>1 3<>0 3<>1 1<>1 ---0<>0 1<>1 3<>1 1<>0 1<>0 1<>0 1<>0 3<>1 3<>0 1<>1 3<>0 1<>0 1<>0 3<>0 3<>1 1<>1 1<>0 1<>1 1<>1 1<>0 1<>0 1<>1 3<>0 1<>0 3<>0 1<>1 3<>0 1<>0 3<>0 1<>0 3<>1 3<>1 1<>1 1<>0 1<>0 3<>1 3<>0 3<>0 3<>1 1<>1 1<>0 3<>1 3<>0 3<>1 3<>1 1<>1 1<>0 Decoded Bits 1, 1 0, 0 1, 0 1, 0 1, 0 0, 1 1, 0 1, 0 0, 1 1, 0 0, 0 0, 1 0, 1 0, 1 1, 0 1, 0 1, 0 0, 0 0, 0 1, 0 0, 1 1, 0 0, 1 1, 0 0, 1 1, 0 0, 1 1, 0 0, 1 1, 0 1, 1 1, 0 1, 0 0, 1 0, 1 1, 1 0, 1 0, 0 1, 0 1, 1 0, 1 1, 1 0, 1 1, 1 0, 0 1, 0 1, 0 0, 1 Implementation of Error Correction Algorithms 199 Stages 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 4003 4238 4255 4194 4167 3897 4349 4639 4190 4789 5054 4840 5064 5072 5171 5156 State Metrics 3555 4109 4278 3967 4189 4000 4316 4021 4340 3808 4256 4408 3955 4453 4535 4028 4480 4732 4325 4824 5034 4359 4388 4903 5126 4719 4900 4907 5153 4922 5198 5011 4171 3605 3638 3699 4064 4300 4497 4390 4587 4870 4677 4929 4471 4569 4660 4681 Traceback Information 0<>0 0<>0 1<>0 1<>0 0<>0 0<>1 1<>1 1<>1 2<>0 2<>0 3<>1 3<>0 2<>1 2<>1 3<>1 3<>1 2<>1 0<>1 3<>1 3<>1 2<>1 0<>1 3<>1 3<>1 0<>0 0<>0 1<>0 3<>0 2<>1 2<>0 0<>1 2<>0 1<>1 1<>0 1<>1 3<>0 0<>0 0<>0 1<>0 1<>0 0<>1 2<>0 0<>1 2<>1 1<>1 3<>0 1<>1 3<>1 0<>1 0<>0 1<>1 1<>0 2<>0 2<>0 3<>1 3<>0 2<>1 2<>1 3<>0 3<>1 2<>1 2<>1 3<>1 3<>1 Decoded Bits 1, 1 0, 1 1, 1 1, 0 1, 0 0, 1 1, 0 0, 0 0, 1 1, 0 1, 1 0, 1 0, 1 1, 1 1, 1 0, 0 4.4.5 TCM-Viterbi Performance The previous table provides only 128 decoded bits. With 128 bits, we cannot say whether the decoder works correctly or not. To see the performance of the TCM-Viterbi coder, we test the decoder with millions of bits (that means we have to encode and transmit that many bits). For example, to test the bit-error-rate (BER) of 10−8, we have to process at least 109 bits. In the following, BER versus Eb/N0 data for the TCM-Viterbi coder shown in Figure 3.34 are provided. The corresponding BER plot for this data is shown in Figure 3.32. BER versus Eb/N0 Data: EbNo = 9.000000 EbNo = 8.000000 EbNo = 7.000000 EbNo = 6.000000 EbNo = 5.000000 EbNo = 4.000000 EbNo = 3.000000 EbNo = 2.000000 error_count = 3 BER = 1.500000e-08 error_count = 76 BER = 3.800000e-07 error_count = 2104 BER = 1.052000e-05 error_count = 32888 BER = 1.644400e-04 error_count = 345240 BER = 1.726200e-03 error_count = 2254097 BER = 1.127049e-02 error_count = 8798461 BER = 4.399231e-02 error_count = 21351731 BER = 1.067587e-01 4.5 Turbo Codes In this section, we simulate a turbo encoder and decoder. There are more than one encoder conﬁgurations to generate turbo codes and we use the RSC encoder conﬁguration from the 3GPP standard (3rd Generation Partnership Project, 2007) in the simulations. We use the maximum a posteriori (MAP) decoding algorithm to decode turbo codes and for this we use derived equations from Section 3.10 to compute corresponding metrics in the simulation of the MAP algorithm. We estimate the computational complexity of the turbo decoder in terms of the number of computations and the amount of memory needed to decode turbo codes by using the MAP algorithm. 4.5.1 RSC Encoder Simulation For achieving better error-correction performance, we use RSC encoder to generate turbo codes. The 3GPP standard speciﬁes parallel concatenation of two RSC encoders for turbo codes generation as shown in Figure 4.13. These RSC encoders consists of 3 ( = M) delay units each and hence the constraint length of each RSC encoder is 4 (i.e., M + 1). The two RSC encoders are separated by an interleaver and the interleaver input and output sequence index relations are speciﬁed in the 3GGP standard. The ﬁrst encoder works on a direct input bit sequence and outputs a systematic output bit sequence and a parity bit sequence whereas the second decoder 200 Chapter 4 Input bits cn TT D D D z1 z2 z3 Feedback Encoder 1 I Systematic output bits dn,0 dn,1 Parity output bits D D D z1 z2 z3 TT Feedback Encoder 2 dn,2 Parity output bits Figure 4.13: Parallel concatenation of two RSC encoders with interleaver. works on the interleaved bit sequence and outputs another parity bit sequence. In other words, we generate three output bit sequences from one input bit sequence and hence the encoder shown in Figure 4.13 is a rate 1/3 coder. At the beginning we initialize two RSC encoder states with zeros, and then the state of each encoder is updated based on their input sequence bit and feedback bit. Typically, tracking of the RSC encoder state for each input bit (and feedback bit) is carried out with the help of a trellis diagram as shown in Figure 4.14(a). The trellis has three phases and they are (1) initialization phase, (2) steady-state phase, and (3) termination phase. After the start of encoding, we needed three stages to get into steady-state. Similarly, the termination phase also involves three stages. For trellis termination (TT), we use bits from a feedback loop instead of from input bits by switching, as shown in Figure 4.13. A zoomed version of the steady-state trellis in Figure 4.14(a) is shown in Figure 4.14(b). We rearrange the output states of the trellis to get the simple ﬂow of the steady-state trellis, and we use this rearranged ﬂow throughout the simulation as it has certain advantages in the implementation of the MAP algorithm on the reference embedded processor. In Figure 4.14(b), a solid line represents the RSC encoder state update from the current state to the next state when the input bit is “1,” and similarly the dotted line represents the state update when the input bit is “0.” The state diagram shown in Figure 4.14(b) corresponds to the ﬁrst RSC encoder as we output two bits (one systematic bit and one parity bit) from one input bit. The trellis for the second RSC encoder is also the same as the ﬁrst RSC encoder; the only difference is that the number of output bits in this case is one (i.e., a second parity bit). The simulation code of the 3GPP RSC encoder and the BPSK modulator is given in Pcode 4.33. This simulation code assumes an interleaved input bit sequence is available to the second RSC encoder; the study and simulation of the 3GPP interleaver is not in the scope of this book. Typically, the input data bits are accessed in terms of 8-bit bytes from memory as the minimum size of the data that the processor can access from memory is a byte (or an 8-bit quantity). Once we get a byte of data, then to encode bit-by-bit, we have to unpack the bits which takes about 1.125 cycle per bit (or a total of 9 cycles with the ﬁrst bit unpack requiring 2 cycles and rest of the bits requiring 1 cycle per bit) on the reference embedded processor. Then, an additional 7 cycles are required to code this bit as coding involves only sequential operations. After coding, we have to pack the coded bits and store them in memory for other processing and transmission. Packing of data takes the same number of cycles as unpacking 1.125 cycles per bit. Thus, a total of 10 cycles are required for encoding 1 bit of data and outputting one parity bit (including overhead). Implementation of Error Correction Algorithms 201 000 S0 001 S1 010 S2 011 S3 100 S4 101 S5 110 S6 111 S7 n=0 12 Initialization 3 4 5 …N26 N25 N24 N23 N22 N21 Steady State (a) Steady State Termination Input/Output Current State z1 z2 z3 000 S0 cn /dn,0dn,1 0/00 Next State 000 S0 1/11 001 S1 0/00 1/11 100 S4 010 S2 1/10 001 S1 0/01 011 S3 1/10 0/01 101 S5 100 S4 0/01 010 S2 1/10 101 S5 0/01 1/10 110 S6 110 S6 1/11 011 S3 0/00 111 S7 1/11 0/00 111 S7 (b) Dotted line: data 0 Solid line: data 1 Figure 4.14: (a) Trellis diagram ﬂow for RSC encoder. (b) Steady-state trellis data ﬂow of RSC encoder. S1[0] = 0; S1[1] = 0; S1[2] = 0; S2[0] = 0; S2[1] = 0; S2[2] = 0; for(i = 0;i < N;i++) { // first RSC encoder feedback = S1[1] ˆ S1[2]; tmp1 = c[i]; // c[] contains input bit sequence tmp1 = feedback ˆ tmp1; *x++ = 1 - 2*c[i]; // x[] contains output symbols tmp2 = tmp1 ˆ S1[2]; S1[2] = S1[1]; tmp2 = tmp2 ˆ S1[0]; S1[1] = S1[0]; S1[0] = tmp1; *x++ = 1-2*tmp2; // modulate parity bit one and store // second RSC encoder feedback = S2[1] ˆ S2[2]; tmp1 = c_in[i]; // c_in[] contains interleaved input bits tmp1 = feedback ˆ tmp1; tmp2 = tmp1 ˆ S2[2]; S2[2] = S2[1]; tmp2 = tmp2 ˆ S2[0]; S2[1] = S2[0]; S2[0] = tmp1; *x++ = 1-2*tmp2; // modulate parity bit two and store } Pcode 4.33: Simulation code for 3GPP RSC encoder and BPSK modulator. Turbo Encoder Complexity Since we use the look-up table for interleaver addresses instead of computing on the ﬂy, we only spend cycles for look-up table accesses (which may come for free with compute operations). To interleave one data bit, it takes about three cycles (one cycle for loading offset, two cycles for computing absolute address). Since turbo encoding involves two RSC encoders and one interleave operation, in total we consume 25 cycles (including overhead) for encoding one data bit. In other words, for applications with 14.4 Mbps bit rate (e.g., femtocell base station), we require about 360 MIPS and this is about 60% of the total available 600 MIPS of the reference embedded processor. 202 Chapter 4 Efﬁcient Implementation of Turbo Encoder As discussed, a turbo encoder is a costly module at higher bit rates if we are not implementing it properly. Next, we discuss techniques for efﬁcient implementation of the turbo encoder. We split the turbo encoder into two parts. In the ﬁrst part, we deal with the encoding of bits and in the second part we handle the interleaving of the data bits. Encoding Using Look-up Table Turbo encoding with two RSC encoders consumes about 20 cycles per input bit as we discussed. Here, we describe a different approach using a look-up table that consumes only 2.5 cycles (for both encoders) per input bit. For this, we need 256 bytes of extra memory for storing look-up table data. Given the present state of the RSC encoder, it is possible to encode more than 1 bit at a time using this look-up table. By precomputing the look-up table for all possible combinations of input bits of length L and for all three combinations of state bits, we can encode L bits at a time. In this encoding, we use a look-up table that has 2L+3 entries. As the value of L increases, then the size of the look-up table also increases. With L = 4 (i.e., encoding 4 bits at a time), we have 27 or 128 entries in the look-up table as shown in Figure 4.15(a). Each entry contains 4 encoded bits and 3 bits of updated state information. In other words, a byte (or 8 bits) is sufﬁcient to represent each entry of the look-up table. Exploring closely the details of the 8-bit look-up table design, it can be seen that to compute a 7-bit offset to the 128-entry look-up table from 4 input bits (say in register r0) and 3 current-state bits (say in register r1), we have to extract (1-cycle) 4 data bits (say to register r2) from the input byte (or from r0); extract (1-cycle) of the current state (say to register r3) from the look-up table output (or from r1) of the previous encoding; shift (1-cycle) 3 state bits by 4(r3 = r3 << 4); and OR (1-cycle) the extracted 4 input data bits to state bits (i.e., r4 = r2|r3). We can avoid the extract and shift operations for state bits by properly designing the look-up table. If we use 2 bytes for each look-up table entry and place the state bits in the shifted position as shown in the Figure 4.15(b), we can avoid two (saving 50%) of the offset calculation cycles. Next, after computing the encoded bits, we have to pack the encoded bits. As we are encoding 4 bits at a time and simultaneously outputting an encoded 4-bit nibble, packing nibbles into bytes is easy. We pack 2 nibbles into a byte in 2 cycles (with one left shift and one OR or ADD operation). For packing two encoder outputs, we spend 4 cycles on the reference embedded processor. By using the multiply-accumulate (MAC) unit, we can do this packing in 2 cycles for two encoders since we have two MAC units on the reference embedded processor. It is clear from this that the turbo encoding of 1 byte consumes 20 cycles or 2.5 cycles per bit. Offset 7 bits Encoded bits 1 updated state 128 entries Byte xxx (a) Byte xxxx One entry of look-up Figure 4.15: (a) Look-up table–based turbo encoding. (b) Look-up table design for efﬁcient turbo encoding. Next state of RSC encoder Output encoded bits (b) Implementation of Error Correction Algorithms 203 In the previous discussion, we encoded 4 bits at a time for two encoders. But, in reality the second encoder doesn’t get the data directly from the input bitstream bytes. We have to interleave the input bitstream before passing it to the second encoder. In the previous section, we assumed that the interleaving bits are available for the second encoder. The stored interleaved bits are accessed directly from the buffer for encoding by storing the interleaved bits in an addressable boundary (i.e., a minimum of a byte has to be used for storing 1 bit). Here, since we are encoding in terms of nibbles using the look-up table approach, we have to pack the interleaved bits back to bytes before storing them to the interleaver buffer. Therefore, to feed the bits to the second encoder in the right order, we have to perform the following three steps: unpack, interleave and pack. As we represent the data in terms of bytes, packing and unpacking involves demultiplexing and multiplexing of bytes into bits and bits into bytes, respectively. Packing of bits to bytes needs all interleaved bits, so we have to ﬁrst perform interleaving completely. We perform unpacking and interleaving together to avoid the stalls. The two operations, unpacking and interleaving, consumes about 3 cycles per bit. Then we pack the bits back to bytes and this packing operation consumes one cycle per bit on the reference embedded processor. Based on the previous discussion, the cycles consumed per bit for unpacking, interleaving and packing of interleaved data are 4. In encoding of data, we spend 2.5 cycles per bit. With this, the turbo encoder total cycle cost is 6.5 cycles per bit. Assuming an overhead of 1 cycle per bit, we consume about 7.5 cycles per bit for performing turbo encoding. With this efﬁcient implementation, we use 108 MIPS of the reference embedded processor or approximately 18% of processor MIPS at a bit rate of 14.4 Mbps. In comparison, we used 60% of processor MIPS with simple implementation of turbo encoding discussed previously. With the look-up table method described in this section for turbo encoding, we need 256 bytes of data memory to store precomputed encoding information. With this efﬁcient method, we need less data memory (by a factor 8) for storing the interleaved data as we pack the bits to bytes. Both methods require the same data memory for storing interleaver addresses as it is costly to compute interleaver addresses on the ﬂy. Modulation and Transmission of Bits The output of the turbo encoder is passed through a mapper to obtain a modulated encoded bit sequence xn,0, xn,1, xn,2, xn+1,0, xn+1,1, xn+1,2, . . . before transmitting through a channel as shown in Figure 3.47. With BPSK modulation, we map “0” to “+1” and “1” to “−1.” Here, we use the AWGN channel model to mitigate the real communication channel because the AWGN model approximates the effect of accumulation of noise components from many sources. The noise sequences ui (n) from i.i.d. (independent and identically distributed) random process with zero mean and variance σ 2 are added to xn,i to obtain yn,i . At the receiver side, we receive noisy sequence yn,0, yn,1, yn,2, yn+1,0, yn+1,1, yn+1,2, . . . and pass the received noisy symbols to the turbo decoder to get reliable transmitted data symbols as shown in Figure 3.48. Here, we assume proper synchronization of data symbols (i.e., the boundaries of triplets in the received sequence corresponding to transmitted triplets should be identiﬁed). After data symbols synchronization, we identify received triplets as (yn,0, yn,1, yn,2), (yn+1,0, yn+1,1, yn+1,2), and so on. Next, we pass intrinsic information (systematic bits (yn,0) and the ﬁrst encoder parity bits (yn,1) of the received sequence) to the ﬁrst decoder along with extrinsic information, Ext.2 (soft information) from the second decoder. For the ﬁrst iteration, we use zeros for Ext.2 by assuming equiprobability for intrinsic information symbols. After completing decoding with the ﬁrst decoder, we start the second decoder with intrinsic information (interleaved systematic bits, I[yn,0] and the second encoder parity bits, yn,2) and extrinsic information, Ext.1 (soft information) from the ﬁrst decoder as input. This process is repeated many times until we get reliable decisions from the second decoder output. At end of the iterative decoding, we deinterleave the output of the second decoder (LLRs) to get back the transmitted symbol sequence. Then, we obtain hard bits by using sign information of output symbols. In the next section, we discuss the computation of metrics to simulate the turbo decoder. 4.5.2 MAP Decoder Metrics Computation Turbo codes are decoded by using more than one approach or algorithm type (e.g., SOVA, MAP). In this section, we discuss a few techniques to simulate the MAP algorithm presented in Section 3.10.3 for decoding turbo 204 Chapter 4 codes. In the MAP algorithm, we need to compute alphas (forward-state metric using Equation 3.54), betas (reverse-state metric using Equation 3.55), gammas (branch metric using Equation 3.56), LLRs (using alphas, betas and gammas) and Extrinsic information (using Equation 3.57). In computing alphas, betas, and LLRs, we have to compute an equation of the following form: ez = ex + ey (4.31) Equation (4.31) can be simpliﬁed using a correction factor as follows: ez = emax(x,y)(1 + e−|x−y|) (4.32) Taking the natural logarithm on both sides of Equation (4.32) results in z = max(x , y) + ln(1 + e−|x−y|) = max∗(x , y) (4.33) The operator in Equation (4.33) is called a log-MAP operator. Sometimes we approximate the expression ln(1 + e−|x−y|) using a constant, and then we call it a constant-log-MAP operator: ln(1 + e−|x−y|) = 0 if|x − y| > 1.2 0.5 if|x − y| ≤ 1.2 If we completely ignore the value of ln(1 + e−|x−y|) in Equation (4.33), then we call that particular operator a max-log-MAP: z = max(x , y) (4.34) If the absolute difference between x and y is greater than 3, then the difference between the evaluated values of Equations (4.33) and (4.34) is negligible. For example, if x = 4 and y = 7, the computed values of z from Equations (4.33) and (4.34) is going to be 7.04 and 7, respectively. Depending on embedded processor capabilities, we use one of the previous operators in computing state metrics alpha and beta and the value of LLRs. We use the ﬂow of RSC encoder steady-state trellis data (after BPSK modulation) to compute alphas, betas, gammas, and LLRs. The rearranged steady-state trellis is shown in Figure 4.16. After mapping (with BPSK modulator), the output binary digits dn,a:{0, 1} in Figure 4.13 are changed to xn,a:{+1, −1}. The forward and backward state metrics computation ﬂow (with Equations (3.54) and (3.55)) is realized in Figure 4.17(a) and (b). In computing state metrics (i.e., α¯ nm and β¯nm), we use γ¯ni,m . Figure 4.16: Rearranged steady-state trellis data ﬂow diagram (after modulation). Current State S0 Output (x in,0, x in,1) (11, 11) Next State S0 (21, 21) S1 (11, 11) (21, 21) S4 S2 (11, 21) S3 (21, 11) (21, 11) S1 (11, 21) S5 S4 (11, 21) S2 (21, 11) S5 (11, 21) (21, 11) S6 S6 (11, 11) S7 (21, 21) (21, 21) S3 (11, 11) S7 Implementation of Error Correction Algorithms 205 Ϫ␣ b (0,m) n Ϫ1 Ϫ␥n0Ϫ, b1(0,m) Ϫ␣ m n Ϫnm ␥᎐n0,m Ϫnf (0,m) ϩ1 Ϫ␣ b (1,m) n Ϫ1 Ϫ␥n1Ϫ, b1(0,m) (a) ␥᎐n1,m (b) Ϫnf (1,m) ϩ1 Figure 4.17: State metrics computation realization. (a) Forward-state metric computation. (b) Reverse-state metric computation. We compute gammas using the Equation (3.56). For m = 0, based on Figures 4.16 and 4.17(b), xn0,0, xn0,1 = (+1, +1) and xn1,0, xn1,1 = (−1, −1). Thus, γn0,0 = P (0)e ( ) ( ) a − yn,0−xn0,0 2+ yn,1−xn0,1 2σ 2 2 = P (0)e e (a) − yn2,0 +yn2,1+2 2σ 2 yn,0+yn,1 σ2 = P (a)(0)P (a)(1) P (a)(0) P (a)(1) e e − yn2,0+yn2,1 2σ 2 +2 yn,0+yn,1 σ2 ⎛ ⎞ γ¯n0,0 = ln P(a)(0)P(a)(1) + ln⎝ P P (a) (a) (0) (1) ⎠− yn2,0 + yn2,1 2σ 2 + 2 + yn,0 + yn,1 σ2 = ln ⎛ P (a)(0)P (a)(1) − yn2,0 + yn2,1 2σ 2 +2 − ln⎝ ⎞ P P (a)(1) (a)(0) ⎠+ yn,0 + yn,1 σ2 (4.35) γ¯n0,0 = C0 + γ¯nh (4.36) where C0 = ln P (a)(0)P (a)(1) − yn2,0 + yn2,1 +2 2σ 2 contains terms which always result in positive values and γ¯nh = − ln P (a) (1) P (a) (0) + yn ,0 + yn ,1 σ2 contains terms which affect the maximum a posteriori probability. In a similar manner, we can compute γ¯n1,0 as γ¯n1,0 = C0 − γ¯nh (4.37) Then, after ignoring constant terms, e = e + e β¯n0 β¯nf+(01,0)+γ¯n0,0 β¯nf+(11,0)+γ¯n1,0 = eβ¯n0+1+γ¯nh + eβ¯n4+1−γ¯nh Using Equations (4.31) to (4.33), β¯n0 = max∗ β¯n0+1 + γ¯nh , β¯n4+1 − γ¯nh (4.38) Similarly, for m = 2, from Figures 4.16 and 4.17(b), xn0,0, xn0,1 = (+1, −1) and xn1,0, xn1,1 = (−1, +1). Then, e = e + e β¯n2 β¯nf+(01,2) +γ¯n0,2 β¯nf+(11,2)+γ¯n1,2 where γ¯n0,2 = C2 − γ¯ng,γ¯n1,2 = C2 + γ¯ng, C2 = C0, γ¯ng = − ln P(a) (1) P(a) (0) + yn ,0 − yn ,1 σ2 and then eβ¯n2 = eβ¯n1+1−γ¯ng + eβ¯n5+1+γ¯ng Using Equations (4.31) to (4.33), β¯n2 = max∗ β¯n1+1 − γ¯ng, β¯n5+1 + γ¯ng (4.39) 206 Chapter 4 Figure 4.18: Branch metric (gamma) computation. yn,0 yn,1 1 2 1 1 1 22 ln P P (a) (a) (1) (0) 1 1 1 1 ␥]nh ␥]ng Ϫ␣n0 ϪϪ␥nh Ϫ␣ 1 n Ϫ␣ 2 n Ϫ␥ng Ϫ␣n3 Ϫ␣ 4 n ϪϪ␥ng Ϫ␣ 5 n Ϫ␣ 6 n Ϫ␥nh Ϫ␣n7 Ϫ␥nh Ϫ␣n0ϩ1 Ϫn0 Ϫ␥nh Ϫ␥nh ϪϪ␥ng ϪϪ␥nh Ϫ␣n4ϩ1 Ϫ␣n1ϩ1 ϪϪ␥nh Ϫn1 Ϫn2 Ϫ␥nh ϪϪ␥ng ϪϪ␥nh Ϫ␥ng Ϫ␥ng Ϫ␥ng ϪϪ␥ng Ϫ␣n5ϩ1 Ϫn3 ϪϪ␥ng Ϫ␥ng Ϫ␣n2ϩ1 Ϫn4 Ϫ␥ng Ϫ␥ng ϪϪ␥nh ϪϪ␥ng Ϫ␣n6ϩ1 Ϫ␣n3ϩ1 ϪϪ␥ng Ϫn5 Ϫn6 Ϫ␥ng ϪϪ␥nh ϪϪ␥ng Ϫ␥nh Ϫ␥nh Ϫ␥nh ϪϪ␥nh Ϫ␣n7ϩ1 Ϫn7 ϪϪ␥nh (a) (b) Figure 4.19: State metric butterﬂies computation. Ϫn0ϩ1 Ϫn4ϩ1 Ϫn1ϩ1 Ϫn5ϩ1 Ϫn2ϩ1 Ϫn6ϩ1 Ϫn3ϩ1 Ϫn7ϩ1 In the same fashion we can derive γ¯ni,m for other values of m (or branches). With BPSK modulation, we have only two branch metrics {γ¯nh, γ¯ng} per stage; the realization of branch metrics (gammas) computation is shown in Figure 4.18. For a particular stage (at time n), state metrics alphas and betas are computed using the same branch metric gammas as shown in Figure 4.19. We calculate LLR using Equation (3.52) as follows: LLR = ln αnm γn1,mβnf+(11,m) − ln αnm γn0,m βnf+(01,m) m m = ln e − ln α¯nm +γ¯n1,m +β¯nf+(11,m) eα¯ nm +γ¯n0,m +β¯nf+(01,m) m m (4.40) Implementation of Error Correction Algorithms 207 Equation (4.40) for M = 3 (i.e., for 0 ≤ m ≤ 2M − 1) can be interpreted using the data ﬂow in Figures 4.18 and 4.19. The ﬁrst term in Equation (4.40) explains the connection from alpha to beta through gamma for bit “1” as shown in Figure 4.20(a) and the second term in Equation (4.40) explains the connection from alpha to beta through gamma for bit “0” as shown in Figure 4.20(b). We obtain a posteriori probabilities (APPs) in Equation (3.46) from Figure 4.20(a) and (b) for bit “1” and bit “0” at time n given received sequence YN as follows: ln(Pr(cn = 1/YN )) = ln e−γ¯nh eα¯n0+β¯n4+1 + eα¯n1+β¯n0+1 + eα¯n6+β¯n3+1 + eα¯n7+β¯n7+1 + e−γ¯ng eα¯n2+β¯n1+1 + eα¯n3+β¯n5+1 + eα¯n4+β¯n6+1 + eα¯n5+β¯n2+1 (4.41) ln(Pr(cn = 0/YN )) = ln eγ¯nh eα¯n0 +β¯n0+1 + eα¯n1+β¯n4+1 + eα¯n6+β¯n7+1 + eα¯n7+β¯n3+1 + eγ¯ng eα¯n2 +β¯n5+1 + eα¯n3+β¯n1+1 + eα¯n4+β¯n2+1 + eα¯n5+β¯n6+1 (4.42) Based on Equations (3.46), (4.41), and (4.42), the LLR for the n-th trellis stage is computed as LLR (cn) = ln(Pr (cn = 1/YN )) − ln(Pr (cn = 0/YN )) (4.43) 4.5.3 MAP Decoder Computational Complexity The LLR value in Equation (4.43) is computed from APPs which are obtained using Equations (4.41) and (4.42). In computing APPs, we use the n-th stage trellis all-states alphas (forward-state metrics), betas (reverse-state metrics), and gammas (branch metrics). At the n-th stage, gammas are computed using the received information and extrinsic information of the n-th stage, alphas are computed using (n − 1)-th stage alphas and gammas, and betas are computed using (n + 1)-th stage betas and gammas. In other words, to compute the LLR value at the n-th stage, we use information from alphas computed from previous n stages and betas computed from future (N − n) stages of the trellis as shown in Figure 4.21. 2␣ 0 n 22␥nh 2n011 2␣ 0 n 2␥nh 2n011 2␣n1 22␥nh 2␣ 2 22␥ng n 2␣ 3 22␥ng n 2␣ 4 n 22␥ng 2␣ 5 n 22␥ng 2␣n6 22␥nh 2␣ 7 22␥nh n (a) 2n411 2␣n1 2n111 2␣ 2 n 2n511 2␣ 3 n 2n211 2␣ 4 n 2n611 2␣ 5 n 2n311 2␣n6 2n711 2␣ 7 n 2␥nh 22␥ng 2␥ng 2␥ng 2␥ng 2␥nh 2␥nh (b) 2n411 2n111 2n511 2n211 2n611 2n311 2n711 Figure 4.20: (a) Bit “1” MAP connections. (b) Bit “0” MAP connections. 208 Chapter 4 LLR(n th Gamma, n th Alpha, (n 11)th Beta) Alphas Betas Figure 4.21: Illustration of LLR 1st computation at n-th stage. Stage n th Stage Trellis Stages N th Stage In Figure 4.21, to compute the LLR at the n-th stage we need the alpha and gamma of the n-th stage and the beta of (n + 1)-th stage. To compute alpha, we need previous alpha values, and to compute beta, we need future beta values. To compute alphas, betas, and LLRs at a particular stage, we need gammas of that particular stage. In other words, we have to keep all the stages alphas, betas, and gammas in the buffer alive for calculating LLRs. The turbo decoder shown in Figure 3.48 works on a sequence or frame of length N symbols at a time. The value of N ranges from a few tens of symbols to many thousands of symbols. For example, the range of N speciﬁed by the 3G standard is 40 to 5044. If we are using turbo codes in a particular application, we have to support all the data lengths used by that particular application or standard. The number of states (2M ) present in the trellis stage depends on the number of delay units present in the encoder. If M = 3 delay units are present in an RSC encoder (as in UMTS 3G), then we have eight states in a trellis stage. Here, we consider N = 5044 and M = 3 for estimation of turbo coder computational complexity in terms of the number of operations and memory requirements. Decoding Complexity and Number of Operations In decoding of turbo codes using the MAP algorithm, we use two MAP decoders per iteration and repeat for many iterations. In the maximum a posteriori algorithm, we compute all metrics in the logarithm domain to avoid multiplications and to avoid frequent normalization of alpha and beta as they grow with errors. In the logarithm domain, we predominantly use additions and subtractions and the log-MAP operator in computing the LLR metrics. Table 4.2 shows the number of operations required (per trellis stage per decoder per iteration) to compute gamma, alpha, beta, LLRs, and extrinsic information. Memory Requirements For M = 3, we have 2 gammas, 8 alphas and 8 betas in every stage of the trellis. We use 16 bits (or 2 bytes) to represent each value (to avoid saturation before normalization). If N = 5044, then we need approximately 20 kB (= 5044 x 2 x 2) for gammas, 80 kB (= 5044 x 2 x 8) for alphas, 80 kB for betas, and another 40 kB for storing LLRs, extrinsic information of both decoders and intrinsic information for both decoders. The MAP algorithm involves computation of betas from the last stage of the data frame to the ﬁrst stage of the data frame and alphas from the ﬁrst stage of the data frame to the last stage of the data frame. We can avoid storing one of either alpha or beta. For example, we store computed gamma and alpha for full frame, then we compute LLRs (in a backward direction) using betas computed on the ﬂy without storing them. The other way is storing computed gamma and beta for full frame, then compute LLRs (in a forward direction) using alphas computed on the ﬂy. In either case, we need approximately 140 kB of data memory on an embedded processor (see Appendix A on the companion website for more information on reference embedded processor resources) to implement the turbo decoder for N = 5044 and M = 3. Consequently, we can say turbo codes demand a lot of memory in their decoding. 4.5.4 Window-Based Turbo Decoder Implementation The huge requirement on memory in turbo decoding can be reduced by using a window-based method. This method involves dividing the entire data frame into smaller data blocks and performing the decoding on smaller windows. In this window-based method, we needed to store gamma and either one of alpha or beta. In the implementation, we always compute alphas for full window and store them whereas we compute betas on the ﬂy and use them. As we compute LLRs using the gammas, alphas and betas of data within the present window and if we divide the total frame into Q blocks, then the memory required with the window method is 1/Q of straight forward implementation. The turbo decoding using the window-based method is shown in Figure 4.22. By dividing the whole data frame into smaller windows, betas can be computed from the last stage of each Implementation of Error Correction Algorithms 209 Table 4.2: Number of operations involved in computing MAP algorithm metrics per trellis stage per decoder per iteration Metric Number of Operations Alpha Beta Gamma LLR Extrinsic information Additions, Subtractions, Others 16 16 8 21 6 Log-MAP Operations 8 8 0 14 0 Received Data Frame Window 1 Window 2 Betas Alphas: Betas : LLRs : Overlap Stages Window 3 Alphas LLRs Window Q 21 Window Q Figure 4.22: Turbo decoder window-based implementation. data window instead of the last stage of the data frame. In the case of the full data frame, we use the trellis termination sequence for betas to converge before the start of computing of LLRs from the last stage. But with the window-based method, we do not know the state of the transmitted trellis at the last stage of the window and hence we require a few overlap stages for betas to converge. Therefore, the window-based method requires extra (of overlap length) beta computation in its implementation. Since we do not decode any information bits during these overlap stages, this adds to the overall cost of MAP decoding. The disadvantage with the window-based method is the computational overhead, estimated as follows. Typically, the length of overlap stages needed is about 6K stages, where K is the constraint length of the RSC encoder. If we need “B” cycles to compute betas per stage, then we spend 6 ∗ K ∗ B extra cycles for computing betas in the window-based method. This overhead depends on the number of windows that are used in decoding the full frame. If the number of windows is less, then the overhead is also less (but requires more memory) and if the number of windows is more, then the overhead is also more (but requires less memory). Turbo decoding demands huge computations as well as huge memory requirements. Implementation of turbo decoding on deep pipelined embedded processors needs a lot of optimization at both algorithm level and instruction level. The optimization at algorithm level includes rearranging the algorithm ﬂow to suit the processor architecture and taking a few shortcuts (if possible) to avoid some of the computations. The optimization at the instruction level includes gathering many operations and feeding them to all compute units of the processor, balancing bandwidth of ALU and DAG units, avoiding pipeline stalls, and so on. Typically, we interleave the program code to avoid pipeline stalls in running the program on deep pipelined embedded processors. To interleave the program code, we have to gather as many operations that are independent from one another (i.e., their input does not depend on the output of other operations) as possible. In MAP decoding, the three major operations (which consume almost 90% of cycles) are computation of alphas, betas, and LLRs. Here, we have two options to implement the MAP decoder: (1) computation of all stages of alphas at once, all stages of betas at once, and all stages of LLRs at once; and (2) simultaneously computing the alpha, beta, and LLR for one stage. Next, we discuss the advantages and disadvantages with the previous two approaches. An advantage of the ﬁrst method is that the coding becomes simple, but there are many disadvantages. Based on Figures 4.19 through 4.21, 210 Chapter 4 Window 1 Alpha Window 1 Beta LLR Window 2 Alpha Window 2 Beta LLR Window 3 Alpha Window Q Beta LLR Figure 4.23: Efﬁcient implementation of window-based MAP decoder. disadvantages of using the ﬁrst method include (1) more memory is needed to store all metrics, (2) computation of alphas and betas of the next trellis stage requires current trellis stage outputs (whose trellis states are not the same as the inputs, and accessing them in the right order delays the next stage metric computation), and (3) reduced scope to interleave the program code. Next, a disadvantage of the second method is that simultaneous computation of all three terms (alpha, beta, and LLR) is not possible for the same stage (as shown in Figure 4.21). Advantages of the second method are (1) sufﬁcient storage of alpha or beta, (2) no delay in data access, and (3) good scope to interleave the program code to avoid pipeline stalls. As the MAP decoder implementation with the second method has many advantages, we concentrate on the realization of this approach for simulation of the MAP decoder. We can overcome the disadvantage with the second method in the window-based implementation as shown in Figure 4.23. Here, we compute only the alpha of the ﬁrst window before entering into the loop (as given later in Pcode 4.44), then in every iteration, we compute beta and LLR for the current window and alpha for the next window, and the process continues. In this approach, as we are computing at times alpha, beta, and LLR per iteration, we have sufﬁcient time to arrange the data to compute units and also have a good scope to interleave the program code. 4.5.5 Turbo Decoder Simulation In this section, we simulate the window-based turbo decoder shown in Figure 4.23. We choose the length of the input data frame to the turbo decoder as 5088. Based on this input data frame size, we deﬁne other parameter sizes in the simulation of the window-based turbo decoder as given in Pcode 4.34. We divide the input data frame into eight small data windows. We use 24(= 6 ∗ K = 6 ∗ 4) overlap stages for each window. We iterate ﬁve times to get reliable decisions using the MAP decoder. The other parameters such as window length, maximum window length and number of stages are deﬁned based on parameters chosen previously. In this turbo decoder simulation, we use the max-log-MAP operator (i.e., taking a simple maximum of two inputs). In addition, we use a Ping-Pong buffer concept to reduce the L1 data memory size. We use approximately 45 kB of data memory to store intermediate data, as given in Pcode 4.35. We store the whole received input data sequence (one systematic sequence, one interleaved systematic sequence and two parity sequences) in L3 memory. We precalculate offsets in advance for mitigating interleaver and deinterleaver functionality and store in L3 memory. #define DATA_SIZE #define NUM_ITERATIONS #define OVERLAP_LENGTH #define NUM_WINDOWS #define WINDOW_LENGTH #define MAX_WINDOW_LENGTH #define NUM_STAGES 5088 5 24 8 DATA_SIZE/NUM_WINDOWS WINDOW_LENGTH + OVERLAP_LENGTH DATA_SIZE + OVERLAP_LENGTH Pcode 4.34: Window-based turbo decoder implementation parameters. Data Handling and Transfer between L3 and L1 In the turbo decoder, we handle a huge amount of data in MAP decoding by storing the input data in slow L3 memory. We store three inputs, interleaver input and interleaver matrix in L3. As we compute beta and LLR for the current window and alpha for the next window, we need branch metrics for both windows. To reduce the data transfers, Implementation of Error Correction Algorithms 211 // Memory bank-1: approx. 22.5 kB for N = 5088 signed char Extrinsic1[NUM_STAGES]; signed char Extrinsic2[NUM_STAGES]; signed char inputX1[MAX_WINDOW_LENGTH]; signed char inputX2[MAX_WINDOW_LENGTH]; signed short Alpha0[WINDOW_LENGTH*8+8]; signed short interM1[WINDOW_LENGTH]; unsigned long Turbo_Struct[16]; // extrinsic info from MAP-1, 5kB // extrinsic info from MAP-2, 5kB // X in 5.3 format, 0.7 kB // 0.7 kB // Alpha0, 10 kB // interleaver look-up,1.3 kB // Memory bank-2: approx. 22.5 kB for N = 5088 signed short Alpha2[WINDOW_LENGTH*8+8]; signed short interM2[WINDOW_LENGTH]; signed short Gamma0[2*MAX_WINDOW_LENGTH+2]; signed short Gamma1[2*MAX_WINDOW_LENGTH+2]; signed short LLR01[WINDOW_LENGTH]; signed short inputX3[MAX_WINDOW_LENGTH+4]; signed short inputX4[MAX_WINDOW_LENGTH+4]; signed char inputY1[MAX_WINDOW_LENGTH]; signed char inputY2[MAX_WINDOW_LENGTH]; signed short Beta[8]; // Alpha2, 10 kB // 1.3 kB // Gamma0, 2.7 kB // Gamma1, 2.7 kB // 1.3 kB // 1.4 kB // 1.4 kB // Y in 5.3 format, 0.7 kB // 0.7 kB Pcode 4.35: Data buffers used in turbo decoder simulation. we bring the input data (intrinsic information + interleaver matrix for interleaving the output) for one window and store the needed information for computing gamma for the next window temporally in L1 memory. As data transfer (using DMA) introduces some latency, we always bring the data for the next window to avoid the data transfer latency. We use the Ping-Pong buffer concept in this data transfer process. MAP Decoder Metrics Simulation As the MAP decoder involves the computation of many metrics, we deﬁne macros for each metric simulation. We simulate the MAP decoder from bottom to top, meaning that we simulate the turbo decoder in the following order: (1) simulate individual metrics, (2) simulate one window, (3) simulate one MAP decoder, (4) simulate a single iteration, and (5) repeat this simulated code for many iterations. We use the data structure given in Pcode 4.36 to handle all data and addresses. Branch Metric: Gamma Based on Figure 4.18, the computation of the branch metric gamma requires intrinsic information (systematic input and parity input) and extrinsic information (a priori information). In addition, we multiply the intrinsic information with the channel noise variance (which is estimated at the receiver). A macro deﬁnition for gamma computation is shown in Pcode 4.37. In the current window, we always compute gamma for the next window and we store systematic input data temporally in L1 memory for future use (to compute extrinsic information of next window). We compute two gammas per trellis stage as we require two gammas per trellis stage to compute state metrics alphas and betas and to compute APPs for LLRs. typedef struct TurboDec_tag { signed char *xx; signed char *yy; signed char *Ext1; signed char *Ext2; signed short *AlphaC; signed short *AlphaN; signed short *GammaC; signed short *GammaN; signed short *mm; signed short *xC; signed short *xN; signed long Sigma; } TurboDec_t; // holds systematic input array address // holds parity input array address // holds first decoder extrinsic information array // holds second decoder extrinsic information array // holds current window Alpha metrics array // holds next window Alpha metrics array address // holds current window Gamma metrics array // holds next window Gamma metrics array // holds interleave offsets array // holds current window systematic input // holds next window systematic input // assign estimated channel noise variance Pcode 4.36: Data structure to handle turbo decoder parameters. 212 Chapter 4 #define COMP_GAMMA_N()\ j = i<<1;\ r2 = pT->Ext2[i+n]; r1 = pT->yy[i]; r0 = pT->xx[i];\ r4 = r2 + (r0 + r1) * pT->Sigma; r5 = r2 + (r0 - r1) * pT->Sigma; pT->xN[i] = r0;\ *(pT->GammaN+j) = r4; *(pT->GammaN+j+1) = r5; Pcode 4.37: Macro for gamma computation. #define ALPHA()\ r0 = pT->AlphaN[m+0]; r1 = pT->AlphaN[m+1]; r2 = pT->GammaN[n+0];\ tmp0 = r0 + r2; tmp1 = r1 - r2; tmp2 = r0 - r2; tmp3 = r1 + r2;\ tmp0 = max(tmp0, tmp1); tmp2 = max(tmp2, tmp3);\ pT->AlphaN[k+0] = tmp0; pT->AlphaN[k+4] = tmp2;\ r0 = pT->AlphaN[m+2]; r1 = pT->AlphaN[m+3]; r2 = pT->GammaN[n+1];\ tmp0 = r0 - r2; tmp1 = r1 + r2; tmp2 = r0 + r2; tmp3 = r1 - r2;\ tmp0 = max(tmp0, tmp1); tmp2 = max(tmp2, tmp3);\ pT->AlphaN[k+1] = tmp0; pT->AlphaN[k+5] = tmp2;\ r0 = pT->AlphaN[m+4]; r1 = pT->AlphaN[m+5]; r2 = pT->GammaN[n+1];\ tmp0 = r0 + r2; tmp1 = r1 - r2; tmp2 = r0 - r2; tmp3 = r1 + r2;\ tmp0 = max(tmp0, tmp1); tmp2 = max(tmp2, tmp3);\ pT->AlphaN[k+2] = tmp0; pT->AlphaN[k+6] = tmp2;\ r0 = pT->AlphaN[m+6]; r1 = pT->AlphaN[m+7]; r2 = pT->GammaN[n+0];\ tmp0 = r0 - r2; tmp1 = r1 + r2; tmp2 = r0 + r2; tmp3 = r1 - r2;\ tmp0 = max(tmp0, tmp1); tmp2 = max(tmp2, tmp3);\ pT->AlphaN[k+3] = tmp0; pT->AlphaN[k+7] = tmp2; Pcode 4.38: Macro for alpha computation. State Metric: Alpha We simulate alpha computation based on the data ﬂow shown in Figure 4.19(a). To compute forward-state metrics (alpha, indexed with k), we use the previous stage state metrics (alpha, indexed with m) and current stage branch metrics (gamma, indexed with n). The simulation code for alpha computation macro deﬁnition is given in Pcode 4.38. State Metric: Beta We simulate beta computation based on the data ﬂow shown in Figure 4.19(b). We compute reverse-state metrics (beta) from the last stage to the ﬁrst stage. To compute current trellis state beta metrics, we use next stage state beta metrics and next stage branch metrics (gamma, indexed with n). The simulation code for beta computation macro deﬁnition is given in Pcode 4.39. #define BETA()\ r0 = Beta[0]; r1 = Beta[4]; r2 = Beta[1]; r3 = Beta[5];\ r4 = Beta[2]; r5 = Beta[6]; r6 = Beta[3]; r7 = Beta[7];\ tmp2 = pT->GammaC[n]; tmp3 = pT->GammaC[n+1];\ tmp0 = r0 + tmp2; tmp1 = r1 - tmp2; r0 = r0 - tmp2; r1 = r1 + tmp2;\ tmp0 = max(tmp0, tmp1); r0 = max(r0, r1);\ Beta[0] = tmp0; Beta[1] = r0;\ tmp0 = r2 - tmp3; tmp1 = r3 + tmp3; r0 = r2 + tmp3; r1 = r3 - tmp3;\ tmp0 = max(tmp0, tmp1); r0 = max(r0, r1);\ Beta[2] = tmp0; Beta[3] = r0;\ tmp0 = r4 + tmp3; tmp1 = r5 - tmp3; r0 = r4 - tmp3; r1 = r5 + tmp3;\ tmp0 = max(tmp0, tmp1); r0 = max(r0, r1);\ Beta[4] = tmp0; Beta[5] = r0;\ tmp0 = r6 - tmp2; tmp1 = r7 + tmp2; r0 = r6 + tmp2; r1 = r7 - tmp2;\ tmp0 = max(tmp0, tmp1); r0 = max(r0, r1);\ Beta[6] = tmp0; Beta[7] = r0; Pcode 4.39: Macro for beta computation. LLRs Computation We compute LLRs based on bit “0” and “1” MAP connections shown in Figure 4.20(a) and (b). We use current stage alphas, gammas and next stage betas for computing LLRs. The macro deﬁnition for LLR computation is given in Pcode 4.40. In Pcode 4.38, we used array names alphaN[ ] and gammaN[ ] to represent the metrics of the next window. In Pcode 4.40, we used array names alphaC[ ] and gammaC[ ] to represent the metrics of the current window. Assuming limited data registers on an embedded processor, we use Implementation of Error Correction Algorithms 213 #define LLRS()\ r0 = pT->AlphaC[m+0]; r2 = Beta[0]; r1 = pT->AlphaC[m+1]; r3 = Beta[4];\ tmp0 = r0 + r2; tmp1 = r1 + r3; tmp2 = r0 + r3; tmp3 = r1 + r2;\ tmp0 = max(tmp0, tmp1); tmp2 = max(tmp2, tmp3);\ Turbo_Stack[0] = tmp0; Turbo_Stack[1] = tmp2;\ r0 = pT->AlphaC[m+6]; r2 = Beta[7]; r1 = pT->AlphaC[m+7]; r3 = Beta[3];\ tmp0 = r0 + r2; tmp1 = r1 + r3; tmp2 = r0 + r3; tmp3 = r1 + r2;\ tmp0 = max(tmp0, tmp1); tmp2 = max(tmp2, tmp3);\ Turbo_Stack[2] = tmp0; Turbo_Stack[3] = tmp2;\ r0 = pT->AlphaC[m+2]; r2 = Beta[5]; r1 = pT->AlphaC[m+3]; r3 = Beta[1];\ tmp0 = r0 + r2; tmp1 = r1 + r3; tmp2 = r0 + r3; tmp3 = r1 + r2;\ tmp0 = max(tmp0, tmp1); tmp2 = max(tmp2, tmp3);\ Turbo_Stack[4] = tmp0; Turbo_Stack[5] = tmp2;\ r0 = pT->AlphaC[m+4]; r2 = Beta[2]; r1 = pT->AlphaC[m+5]; r3 = Beta[6];\ tmp0 = r0 + r2; tmp1 = r1 + r3; tmp2 = r0 + r3; tmp3 = r1 + r2;\ tmp0 = max(tmp0, tmp1); tmp2 = max(tmp2, tmp3);\ Turbo_Stack[6] = tmp0; Turbo_Stack[7] = tmp2;\ r0 = Turbo_Stack[0]; r1 = Turbo_Stack[1]; r2 = Turbo_Stack[2]; r3 = Turbo_Stack[3];\ tmp0 = max(r0, r2); tmp1 = max(r1, r3);\ r0 = Turbo_Stack[4]; r1 = Turbo_Stack[5]; r2 = Turbo_Stack[6]; r3 = Turbo_Stack[7];\ tmp2 = max(r0, r2); tmp3 = max(r1, r3);\ r0 = pT->GammaC[n]; r1 = pT->GammaC[n+1];\ r2 = tmp0 + r0; r3 = tmp1 - r0; r0 = tmp2 + r1; r1 = tmp3 - r1;\ tmp0 = max(r0, r2); tmp1 = max(r1, r3);\ r0 = tmp1 - tmp0; LLR01[p--] = r0; Pcode 4.40: Macro deﬁnition for LLR computation. array Turbo_stack[ ] as a stack to store intermediate results in computation of LLRs. After computing the LLR of the current stage, we store it in array LLR01[ ]. State Metrics Initialization As discussed in Section 3.10.3, we initialize state metrics alpha and beta before we start computing the initial alphas (i.e., for the ﬁrst stage) and betas (i.e., the last stage). We initialize the ﬁrst state with zero and all other states with a large negative value (that can be represented within the allowed precision for state metrics) and usually we assign with a negative value that is equal to half of the extreme end value (to avoid saturation due to the initial ﬂuctuations). The macro deﬁnition for state metrics initialization is given in Pcode 4.41. State Metrics Normalization The values of the state metrics (alpha and beta) grow with errors. If we do not control the range of the state metrics, then we see a saturation of alpha and beta values after some stages of computation. To avoid saturation, we normalize state metrics for every L stages. The interval L depends on the number of bits or precision (i.e., 8, 16, 24, or 32 bits) used to represent state metrics. In the simulations, we used 16 bits precision to represent state metrics alpha and beta and we use L = 64. We perform normalization using either one of the following. Normalization of alphas is done by subtracting the maximum of all states metric value or the ﬁrst state metric value from all state metrics of current stage alphas. We also perform normalization of betas in the same way. The macro deﬁnition for alpha and beta normalization is given in Pcode 4.42. #define ALPHA_INIT()\ tmp0 = -4096*4;\ pT->AlphaC[0] = 0; pT->AlphaC[1] = tmp0;\ pT->AlphaC[2] = tmp0; pT->AlphaC[3] = tmp0;\ pT->AlphaC[4] = tmp0; pT->AlphaC[5] = tmp0;\ pT->AlphaC[6] = tmp0; pT->AlphaC[7] = tmp0; #define BETA_INIT()\ tmp0 = -4096*4;\ Beta[0] = 0; Beta[1] = tmp0; Beta[2] = tmp0; Beta[3] = tmp0;\ Beta[4] = tmp0; Beta[5] = tmp0; Beta[6] = tmp0; Beta[7] = tmp0; Pcode 4.41: Macro for initialization of state metrics. 214 Chapter 4 #define ALPHA_NORM()\ tmp0 = pT->AlphaN[m];\ pT->AlphaN[m+0] = pT->AlphaN[m+0] - tmp0;\ pT->AlphaN[m+1] = pT->AlphaN[m+1] - tmp0;\ pT->AlphaN[m+2] = pT->AlphaN[m+2] - tmp0;\ pT->AlphaN[m+3] = pT->AlphaN[m+3] - tmp0;\ pT->AlphaN[m+4] = pT->AlphaN[m+4] - tmp0;\ pT->AlphaN[m+5] = pT->AlphaN[m+5] - tmp0;\ pT->AlphaN[m+6] = pT->AlphaN[m+6] - tmp0;\ pT->AlphaN[m+7] = pT->AlphaN[m+7] - tmp0; #define BETA_NORM()\ tmp0 = Beta[0];\ Beta[0] = Beta[0] - tmp0; Beta[1] = Beta[1] - tmp0;\ Beta[2] = Beta[2] - tmp0; Beta[3] = Beta[3] - tmp0;\ Beta[4] = Beta[4] - tmp0; Beta[5] = Beta[5] - tmp0;\ Beta[6] = Beta[6] - tmp0; Beta[7] = Beta[7] - tmp0; Pcode 4.42: Macro for normalization of Alpha and Beta. Extrinsic Information Computation and Interleaving Once we compute LLRs, the next step in MAP decoding is the computation of present decoder extrinsic information from present decoder systematic input and LLRs and from other decoder extrinsic information. Then, we clip the computed extrinsic information between some thresholds to keep it within the same precision used to represent the received input data. We interleave the extrinsic information before storing it (as we pass this to another decoder in a future iteration) to be compliant with the other decoder inputs. The macro deﬁnition for extrinsic information computation and interleaving is given in Pcode 4.43. We interleave the data using the precalculated interleaving offsets and it is costly to compute these interleave offsets on the ﬂy. #define COMP_EXT()\ r3 = 127; r2 = -127;\ r0 = pT->Ext2[j]; r1 = pT->xC[j];\ r0 = (r0 + r1)*2; r1 = LLR01[j];\ r0 = (r0 - r1)/2; n = pT->mm[j];\ r0 = min(r0, r3); r0 = max(r0, r2);\ pT->Ext1[n] = r0; Pcode 4.43: Macro for extrinsic information computation and interleaving. To reduce L1 memory usage, we do not store betas of trellis stages in an array; instead we use betas immediately after their computation in obtaining LLRs. As we split the entire data frame into small windows, we simulate alphas, betas and LLRs based on Figures 4.22 and 4.23. We bring one window of received data at a time to L1 memory from L3 memory. We consider current window last stage alphas as the initial alpha values for the next window. But, in the case of betas, we do not have future window betas, and we have to compute them for every window. How many betas we need to compute to converge (or to get the initial valid beta values) for the current window depend on the constraint length (K ) of the encoder. For betas to converge we have to compute betas for 6K stages of the future window, and therefore we have to bring that much extra data from L3 to L1 as overlap data, as shown in Figure 4.22. We compute LLRs for one window at a time and for this we should have alphas, betas and gammas of that window to compute LLRs. To efﬁciently implement the turbo decoder by interleaving the program code, we compute alphas for the ﬁrst window outside the loop and we always compute betas, LLRs and extrinsic information for the current window and alphas for the next window in the loop. In addition, we compute gammas for the next window before entering the loop as the alphas computation for the next window needs those gammas. To compute LLRs of a current window in a loop, we ﬁrst compute betas for overlap data (that belongs to next window) to get converged betas, then we start computing LLRs from the last stage of current window by computing beta on the ﬂy without storing in L1 data memory (as given in Pcode 4.39). In the windowbased decoding, to reduce pipeline stalls and to utilize the system’s full bandwidth (i.e., ALU operations and Implementation of Error Correction Algorithms 215 // CompBetaLLRsAlpha(pTD) m = WINDOW_LENGTH*8; ALPHA_NORM_TX() // normalize Alpha for next window n = WINDOW_LENGTH; N = MAX_WINDOW_LENGTH; for(i = 0;i < N; i++){ COMP_GAMMA_N() // Compute Gamma for Next window } BETA_INIT() // Initialize Beta n = (MAX_WINDOW_LENGTH<<1)-2; M = OVERLAP_LENGTH; for(i = 0;i < M;i++){ // Compute overlap Beta BETA() n = n - 2; } k = 8; m = 0; p = 0; Turbo_Struct[9] = m; Turbo_Struct[10] = p; //push to stack N = WINDOW_LENGTH >> 6; L = 63; m = (WINDOW_LENGTH<<3)-8; p = WINDOW_LENGTH-1; Turbo_Struct[11] = m; Turbo_Struct[12] = n; //push to stack for(j = 0;j < N;j++){ // Compute current window LLR’s, current Beta and next window Alpha for(i = 0;i < L;i++){ m = Turbo_Struct[11]; n = Turbo_Struct[12]; LLRS() // Compute LLR’s for current stage BETA() // compute Beta for current window stage m = m - 8; n = n - 2; Turbo_Struct[11] = m; Turbo_Struct[12] = n; m = Turbo_Struct[9]; n = Turbo_Struct[10]; ALPHA() // compute Alpha for next window stages m+=8; k+=8; n+=2; Turbo_Struct[9] = m; Turbo_Struct[10] = n; } ALPHA_NORM() // normalize Alpha BETA_NORM() // normalize Beta L = 64; // next Alpha and Beta normalization occur after 64 iterations } M = WINDOW_LENGTH - ((WINDOW_LENGTH)>>6)*64+1; for(i = 0;i < M;i++){ //last sub window without normalization m = Turbo_Struct[11]; n = Turbo_Struct[12]; LLRS() // Compute LLR’s for current stage BETA() // compute Beta for current window stage m = m - 8; n = n - 2; Turbo_Struct[11] = m; Turbo_Struct[12] = n; m = Turbo_Struct[9]; n = Turbo_Struct[10]; ALPHA() // compute Alpha for next window stages m+=8; k+=8; n+=2; Turbo_Struct[9] = m; Turbo_Struct[10] = n; } N = WINDOW_LENGTH; r3 = 127; r2 = -127; for(j = 0;j < N;j++) { COMP_EXT() // compute extrinsic information for current window } Pcode 4.44: Simulation code for turbo decoding in a given window. load–store operations), we compute current window LLRs and betas, and then next window alphas as given in Pcode 4.44. We normalize alphas and betas once for every L stages. To avoid stages counting, conditional checks and jumps in performing normalization of alpha and beta after L stages, we use a hardware loop setup and compute L stages in a loop, and then we perform normalization. For the last M stages (where M is less than L), we compute alphas and betas in a separate loop at the end. Once we have LLRs, we compute extrinsic information of the current decoder by using current decoder LLRs, systematic input and other decoder extrinsic information. As shown in Figure 4.23, we compute alphas for the ﬁrst window before entering the loop, we compute betas and LLRs for the current window and alpha for next window inside the loop and we compute LLRs and betas for last window after the loop. These three functions for two MAP decoders are handled with the following macros MAP_ONE_A, MAP_ONE_B, MAP_ONE_C, MAP_TWO_A, MAP_TWO_B, and MAP_TWO_C. The simulation code for the MAP decoder 1 is given in Pcode 4.45 and the simulation code for the MAP decoder 2 is given in Pcode 4.46. We use different input and output buffers for MAP decoders 1 and 2. The main function that calls all six macros for MAP decoders 1 and 2 is given in Pcode 4.47 (see page 218). 216 Chapter 4 #define MAP_ONE_A()\ Get_X(inputX1,0);\ Get_Y(inputY1,0);\ pTD->xx = inputX1; pTD->yy = inputY1;\ pTD->Ext2 = &Extrinsic2[0]; pTD->GammaC = Gamma0; pTD->AlphaC = Alpha0;\ CompGamma(pTD);\ CompAlpha(pTD); #define MAP_ONE_B()\ Get_M(interM1,j);\ Get_X(inputX2,j+1);\ Get_Y(inputY2,j+1);\ pTD->xx = inputX2; pTD->yy = inputY2; pTD->xC = inputX3; pTD->xN = inputX4;\ pTD->Ext1 = Extrinsic1; pTD->Ext2 = &Extrinsic2[j*WINDOW_LENGTH];\ pTD->mm = interM1; pTD->AlphaC = Alpha0; pTD->AlphaN = Alpha2;\ pTD->GammaC = Gamma0; pTD->GammaN = Gamma1;\ CompBetaLLRsAlpha(pTD);\ Get_M(interM2,j+1);\ Get_X(inputX1,j+2);\ Get_Y(inputY1,j+2);\ pTD->xx = inputX1; pTD->yy = inputY1; pTD->xC = inputX4; pTD->xN = inputX3;\ pTD->Ext1 = Extrinsic1; pTD->Ext2 = &Extrinsic2[(j+1)*WINDOW_LENGTH];\ pTD->mm = interM2; pTD->AlphaC = Alpha2; pTD->AlphaN = Alpha0;\ pTD->GammaC = Gamma1; pTD->GammaN = Gamma0;\ CompBetaLLRsAlpha(pTD); #define MAP_ONE_C()\ Get_M(interM1,6);\ Get_X(inputX2,7);\ Get_Y(inputY2,7);\ pTD->xx = inputX2; pTD->yy = inputY2; pTD->xC = inputX3; pTD->xN = inputX4;\ pTD->Ext1 = Extrinsic1; pTD->Ext2 = &Extrinsic2[6*WINDOW_LENGTH];\ pTD->mm = interM1; pTD->AlphaC = Alpha0; pTD->AlphaN = Alpha2;\ pTD->GammaC = Gamma0; pTD->GammaN = Gamma1;\ CompBetaLLRsAlpha(pTD);\ Get_M(interM2,7);\ pTD->Ext1 = Extrinsic1; pTD->Ext2 = &Extrinsic2[7*WINDOW_LENGTH];\ pTD->mm = interM2; pTD->AlphaC = Alpha2; pTD->AlphaN = Alpha0; \ pTD->xC = inputX4; pTD->GammaC = Gamma1; pTD->GammaN = Gamma0;\ CompBetaLLRs(pTD); Pcode 4.45: Simulation code for window-based MAP decoder-1. 4.6 LDPC Codes In Section 3.11, we discussed LDPC codes generation and their decoding algorithms. Before reading this section, refer back to Section 3.11 for an introduction to LDPC codes. In this section, we simulate the min-sum algorithm to decode LDPC codes. We also discuss the efﬁcient way of implementing an LDPC decoder with larger parity check matrices. As discussed, the LDPC codes are deﬁned by the low-density parity check matrix H . At the transmitter side, we generate the LDPC code by multiplying the message vector with the corresponding generator matrix G derived from H (see IEEE, “802.16E Standard,” 2005, for other efﬁcient encoding methods to compute the LDPC codeword). Then we modulate the codeword bits using the BPSK modulator and transmit over a noisy channel. At the receiver, we receive the corresponding noisy symbols (here we assume that the symbols and frames are properly in sync). We convert the ﬂoating-point values of noisy symbols to ﬁxed-point symbols. We use the 5.3 format in the simulation to convert ﬂoating-point values to ﬁxed-point values. For example, if we receive the noisy symbol as −0.81, then its ﬁxed-point format is obtained by multiplying it by 23. The 5.3 ﬁxed-point equivalent of −0.81 is −6 (after truncation). 4.6.1 Decoding of LDPC Codes on Tanner Graph The parity check matrix H of LDPC codes can be represented using a Tanner graph which is a bipartite graph with two type of nodes, bit nodes and parity nodes. We decode the LDPC code symbols iteratively processing the Tanner graph by using the sum-product algorithm. We use less complex min-sum algorithms in the simulations Implementation of Error Correction Algorithms 217 #define MAP_TWO_A()\ Get_iX(inputX1,0);\ Get_Z(inputY1,0);\ pTD->xx = inputX1; pTD->yy = inputY1;\ pTD->Ext2 = &Extrinsic1[0]; pTD->GammaC = Gamma0; pTD->AlphaC = Alpha0;\ CompGamma(pTD);\ CompAlpha(pTD); #define MAP_TWO_B()\ Get_iM(interM1,j);\ Get_iX(inputX2,j+1);\ Get_Z(inputY2,j+1);\ pTD->xx = inputX2; pTD->yy = inputY2; pTD->xC = inputX3; pTD->xN = inputX4;\ pTD->Ext1 = Extrinsic2; pTD->Ext2 = &Extrinsic1[j*WINDOW_LENGTH];\ pTD->mm = interM1; pTD->AlphaC = Alpha0; pTD->AlphaN = Alpha2;\ pTD->GammaC = Gamma0; pTD->GammaN = Gamma1;\ CompBetaLLRsAlpha(pTD);\ Put_LLR(LLR01,j);\ Get_iM(interM2,j+1);\ Get_iX(inputX1,j+2);\ Get_Z(inputY1,j+2);\ pTD->xx = inputX1; pTD->yy = inputY1; pTD->xC = inputX4; pTD->xN = inputX3;\ pTD->Ext1 = Extrinsic2; pTD->Ext2 = &Extrinsic1[(j+1)*WINDOW_LENGTH];\ pTD->mm = interM2; pTD->AlphaC = Alpha2; pTD->AlphaN = Alpha0;\ pTD->GammaC = Gamma1; pTD->GammaN = Gamma0;\ CompBetaLLRsAlpha(pTD);\ Put_LLR(LLR01,j+1); #define MAP_TWO_C()\ Get_iM(interM1,6);\ Get_iX(inputX2,7);\ Get_Z(inputY2,7);\ pTD->xx = inputX2; pTD->yy = inputY2; pTD->xC = inputX3; pTD->xN = inputX4;\ pTD->Ext1 = Extrinsic2; pTD->Ext2 = &Extrinsic1[6*WINDOW_LENGTH];\ pTD->mm = interM1; pTD->AlphaC = Alpha0; pTD->AlphaN = Alpha2;\ pTD->GammaC = Gamma0; pTD->GammaN = Gamma1;\ CompBetaLLRsAlpha(pTD);\ Put_LLR(LLR01, 6);\ Get_iM(interM2,7);\ pTD->Ext1 = Extrinsic2; pTD->Ext2 = &Extrinsic1[7*WINDOW_LENGTH];\ pTD->mm = interM2; pTD->AlphaC = Alpha2; pTD->AlphaN = Alpha0;\ pTD->xC = inputX4; pTD->GammaC = Gamma1; pTD->GammaN = Gamma0;\ CompBetaLLRs(pTD);\ Put_LLR(LLR01, 7);\ Pcode 4.46: Simulation code for window based MAP decoder-2. to decode LDPC codes on the Tanner graph. We pass the extrinsic information computed at one type of nodes to another type of nodes through the Tanner graph edges back and forth; this mechanism of passing information is known as message passing or belief propagation. The edge connections, which are deﬁned by parity check matrix elements, act as interleavers when passing extrinsic information through them. Processing at Bit Nodes We compute the LLRi at the i-th bit node using the extrinsic information Rji passed from parity nodes that are connected to the i-th bit node and using the channel APP λi at the i-th bit node. Then, we compute the extrinsic information Qij at the i-th bit node using the LLRi of the i-th bit node and using the extrinsic information R ji passed to the i-th bit node from all connected parity nodes except from the j -th parity node. At the beginning, we initialize the Qij s with λi . The computed Qij is passed from the i-th bit node to all parity nodes which are connected to the i-th bit node. Processing at Parity Nodes We compute the extrinsic information R ji (to pass to the i-th bit node) at the j -th parity node using the extrinsic information Qij passed to the j -th parity node from all connected bit nodes except the i-th bit node. The magnitude value of R ji is obtained as the minimum of absolute values of participated Qij and the sign of R ji is 218 Chapter 4 //GenerateInterleaverMatrix(DATA_SIZE); //InterleaveInputX(); pTD->Sigma = 1; // one_by_sigma_square: 1 for(i = 0;i < NUM_ITERATIONS;i++){ // ----- first MAP decoder -------- // pre-compute Gamma and Alpha for first window of first MAP decoder MAP_ONE_A() for(j = 0;j < NUM_WINDOWS-2;j+=2){ // compute LLRS, Beta for current window and Alpha for next window MAP_ONE_B() } // compute LLRS and Beta and compute Extrinsic info for second MAP MAP_ONE_C() // ------ second MAP decoder -------// pre-compute Gamma and Alpha for first window of second MAP decoder MAP_TWO_A() for(j = 0;j < NUM_WINDOWS-2;j+=2){ // compute LLRS, Beta for current window and Alpha for next window MAP_TWO_B() } // compute LLRS and Beta and compute Extrinsic info for first MAP MAP_TWO_C() } Pcode 4.47: Simulation code for window-based turbo decoder. obtained by multiplication of signs of participated Qij . Here participated Qij nodes mean those Qij nodes that are involved in the computation of R ji . 4.6.2 Min-Sum Algorithm The min-sum algorithm discussed in Section 3.11 is summarized in the following. Initialization: λi = 2yi /σ 2 (4.44) First iteration: Qij = λi ⎛ ⎞ Rji = k⎝ αi j ⎠ min βi j i ∈V j \i i ∈V j \i (4.45) where αij = sign Qij , βij = abs Qij , V j\i is the set of column locations of the 1s in the j -th row excluding the i-th column in parity check matrix H , and k is a constant less than 1. LLRi = λi + R ji j ∈Ui where Ui is the set of row locations of 1s in the i-th column of parity check matrix H . Second iteration onwards: (4.46) Qij = LLRi − R ji ⎛ ⎞ Rji = k⎝ i αi ∈V j \i j ⎠ i min ∈V j \i βi j (4.47) (4.48) LLRi = λi + R ji j ∈Ui (4.49) Implementation of Error Correction Algorithms 219 Repeat the computations using Equations (4.47) through (4.49) for the remaining iterations. Here, we are not checking for the decoder halting at the end of the iteration. We run the decoder for all L iterations. After L iterations, we make hard decisions using the soft values of LLRi s. The values of LLRi grow fast once they start converging and we have to perform normalization of LLRi to avoid the saturation of metric values. Hard decision making: 1 cˆi = 0 if LLRi < 0 Otherwise (4.50) 4.6.3 LDPC Decoder Simulation In this section, we simulate the min-sum algorithm described in the previous section for decoding LDPC codes. We assume the noise variance σ 2 = 1 throughout the simulations and the received noisy ﬂoating-point symbols are converted to 5.3 ﬁxed-point format. The simulation code for initialization of the min-sum algorithm is given in Pcode 4.48. We use Equation (4.44) for initialization. Since the noise variance is assumed as 1, we simply multiply the received sequence yi by 2 to get λi . Then, we initialize Qij with λi wherever h ji = 1. The matrix Qij contains zeros in places where h ji = 0. for(i = 0;i < ldpc->n;i++){ Lambda[i] = 2*y[i]; } for(j = 0;j < ldpc->m;j++){ for(i = 0;i < ldpc->n;i++){ if (H[j][i] == 1) Qij[j][i] = Lambda[i]; else Qij[j][i] = 0; } } Pcode 4.48: Simulation code for initialization of min-sum algorithm. The extrinsic information R ji passed from parity nodes to bit nodes is computed using the Equation (4.47). The simulation code for computing R ji is given in Pcode 4.49. If h ji is equal to 1 (i.e., an edge connection is present from the i-th bit node to the j -th parity node), then the magnitude of R ji is equal to the minimum of all the j -th row Qij elements excluding the i-th column Qij element. If the minus sign is represented with bit 1 and the plus sign is represented with bit 0, then the sign of R ji is computed as XOR of all the j -th row Qij elements’ sign bits excluding the i-th column Qij element sign bit. Then, we multiply the R ji by 0.8 (or 6 in 5.3 format) to get unbiased extrinsic information. Once the extrinsic information R ji is available at bit nodes, then we can compute LLRi s of the transmitted bits using λi and R ji . The simulation code for computing LLRi s using Equation (4.48) is given in Pcode 4.50. At the end of all iterations we make hard decisions from LLRi s using Equation (4.49). The simulation code for making hard decisions from soft LLRi values is given in Pcode 4.51. We compute the Qij using LLRi and R ji from the second iteration onwards. The simulation code for computing Qij is given in Pcode 4.52. 4.6.4 Complexity of Min-Sum Algorithm We estimate the complexity of the min-sum algorithm in terms of the number of compute operations (or clock cycles) and in terms of memory requirements. As discussed, a single iteration of the min-sum algorithm involves the computation of Qij from LLRi and R ji , computation of R ji from Qij and computation of LLRi from λi and R ji . The total computations involved in the min-sum algorithm depends on the complexity of previous three metrics times the number of iterations the decoder runs before stop decoding. 220 Chapter 4 for(j = 0;j < ldpc->m;j++) for(i = 0;i < ldpc->n;i++){ if (H[j][i] == 1){ sign = 0; mag = 32768; for(k = 0;k < ldpc->n;k++){ if(i!=k){ if (H[j][k] == 1){ x = Qij[j][k]; b = x < 0 ? 1: 0; a = abs(x); sign = sign ^ b; if (mag > a) mag = a; } } } mag = (mag * 6) >> 3; Rji[j][i] = (sign==1) ? -mag : mag; } } Pcode 4.49: Simulation code for computing Rji . for(i = 0;i < ldpc->n;i++){ mag = 0; for(j = 0;j < ldpc->m;j++){ if (H[j][i] == 1) mag = mag + Rji[j][i]; } LLRi[i] = Lambda[i] + mag; } Pcode 4.50: Simulation code to compute LLRi . for(i = 0;i < ldpc->m;i++){ if (LLRi[i] < 0) ch[i] = 1; else ch[i] = 0; } Pcode 4.51: Simulation code for making hard decisions from LLRi s. for(j = 0;j < ldpc->m;j++) for(i = 0;i < ldpc->n;i++) if (H[j][i] == 1) Qij[j][i] = LLRi[i] - Rji[j][i]; Pcode 4.52: Simulation code for computing Qij . Qij Computational Complexity The computation of Qij involves one conditional arithmetic operation as shown in Pcode 4.52. A conditional arithmetic operation consumes 3 cycles on the reference embedded processor (see Appendix A.4 on the companion website for more details on cycles estimate on the reference embedded processor). As the loop of Qij computation runs for M ∗ N times, we require 3 ∗ M ∗ N cycles to compute Qij . R ji Computational Complexity The costliest module in the min-sum algorithm is an R ji computation. Based on Pcode 4.49, the innermost loop of the R ji computation consumes 7 cycles per loop iteration. As the computation of magnitude is performed conditionally, we consume two more cycles to assign the computed value conditionally. This means, whether the condition is true (for computation) or not (for jump), we spend 9 cycles, and so to run the innermost loop N Implementation of Error Correction Algorithms 221 times we require 9 ∗ N cycles. However, the innermost loop itself runs conditionally depending on the presence of element 1s in the parity check matrix. If h ji = 0, then we spend about 10 cycles (for conditional jump + overhead); otherwise, we spend 9 ∗ N cycles. Therefore, the total cycles cost of R ji computation is estimated as 10 ∗ (M ∗ N − S) + 9 ∗N ∗ S + 7 ∗ S (overhead to initialize parameters in the loop and to compute the ﬁnal R ji ) cycles, where S is the total number of 1s present in the parity check matrix. LLRi Computational Complexity Based on Pcode 4.50, LLRi computation involves one conditional arithmetic operation and is computed M ∗ N times. We have one more addition operation outside the inner loop and for that we consume N cycles as it runs for N times. Thus, we spend a total of (M ∗ N ∗ 2 + N ) cycles to compute LLRi . Next, if M = 288, N = 576, S = 2000, and L = 10 (number of iterations that the Tanner graph iterated), then we require approximately 120 million cycles for decoding 288 bits or 0.42 million cycles per bit. At this complexity, we cannot decode 2 kbps bit rate sequence on 600 MIPS of the reference embedded processor because it requires 840 MIPS. With the efﬁcient implementation techniques discussed in the next section, we can reduce the computational cycles by far. Memory Requirements The buffers used for holding H, Qij and R ji values are two-dimensional arrays each of size M × N . If we use bytes to represent parity check matrix H elements and 16-bit words for Qij and R ji , then we require 5 ∗ M ∗ N bytes of memory to store H, Qij and R ji . If M = 588 and N = 576, then we need 830 kB (see Appendix A.1 on the companion website for memory availability on the reference embedded processor) of memory to hold data values. All other buffers require less than 5 kB of data memory. In the next section, we discuss the techniques to reduce the memory requirements of the LDPC decoder. 4.6.5 Efficient LDPC Decoder Implementation In the previous section, we saw the computational and memory requirements with an inefﬁcient implementation of the LDPC decoder exceeding the budget of the reference embedded processor. In this section, we discuss techniques to implement the min-sum algorithm with less memory and computations. The low-density population of 1s in the parity check matrix not only gives the coding performance but also lowers high computational and memory requirements for LDPC codes decoding. The heavy computations and memory requirements in the LDPC decoder are due to the processing of the decoding algorithm on a two-dimensional array of M × N. But in reality, we needed to process the data for only S non-zero elements of parity check matrix H , where S << M × N. If we can track the presence of 1s locations in the parity check matrix H during the decoding time, then it is possible to avoid that heavy two-dimensional processing and memory usage. To track the 1s locations during Tanner graph decoding, we use four look-up tables: V2C[ ][ ], vc[ ][ ], C2V[ ][ ], and cv[ ][ ]. The look-up table V2C[j][ ] contains the positions of bit nodes, which are connected to the j -th parity node and look-up table vc[j][ ] consists of the number of parity nodes to which the current bit node (which is connected to the j -th parity node) is connected before the j -th parity node. This is illustrated in Figure 4.24. Based on the ﬁgure, before the j -th parity node, the i-th bit node is connected to two parity nodes; so V2C[ j ][ ] = [i − e, i, i + g, 3] and vc[ j ][ ] = [1, 2, 1]. The last entry in the V2C[j][ ] look-up table represents the number of bits nodes that connect to the j -th parity node. Similarly, the look-up table C2V[i][ ] contains the position of parity nodes connected to the i-th bit node, and the look-up table cv[i][ ] consists of the number of bit nodes to which the current parity node (which is Figure 4.24: Illustration to ﬁll entries of look-up tables. i Ϫe i Ϫf i-th Bit node i iϩg iϩh iϩk jϪa jϪb jϪc j - th Parity node j j ϩd 222 Chapter 4 connected to i-th bit node) is connected before the i-th bit node. Based on Figure 4.24, C2V[i][ ] = [ j − a, j − b, j, j + d, 4] and cv[i][ ] = [0, 1, 1, 0]. The last entry in the C2V[i][ ] look-up table represents the number of parity nodes connected to the i-th bit node. With this tracking information, we don’t need to hold the two-dimensional parity check matrix H elements. With this, if we consider the parity check matrix H of size M × N with row weight wr and column weight wc, then the buffers Qij and R ji are required to store only M ∗ wr and N ∗ wc values. If M = 288, N = 576, wr = 7 and wc = 6, then we need a memory of size 7 ∗ (N ∗ wc + M ∗ wr ) = 38304 bytes to store Qij (in 16-bit words), R ji (in 16-bit words), V2C[j][ ] (in 16-bit words), vc[j][ ] (in bytes), C2V[i][ ] (in 16-bit words) and cv[i][ ] (in bytes). Assuming 4 kB of memory used for other buffers, we require the data memory of 42 kB (which is reasonable) to implement the LDPC min-sum algorithm decoder. The simulation code for Qij , R ji , and LLRi computation using this memory-efﬁcient method is given in Pcodes 4.53, 4.54, and 4.55, respectively. As we are processing metrics only for non-zero elements parity check matrix, the number of computations is also greatly reduced. In Pcode 4.53, we spend 3 cycles per one iteration of the innermost loop and consume about 3 ∗ wr ∗ M cycles to compute Qij s. In computing R ji using Pcode 4.54, for(j = 0;j < ldpc->m;j++){ n = V2C[j][7]; for(i = 0;i < n;i++){ a = vc[j][i]; b = V2C[j][i]; Qij[b][a] = LLRi[b] - Rji[j][i]; } } Pcode 4.53: Simulation code for efﬁcient computation of Qij . for(j = 0;j < ldpc->m;j++){ n = V2C[j][7]; for(i = 0;i < n;i++){ sign = 0; mag = 32768; for(k = 0;k < n;k++){ if (i!= k){ m = V2C[j][k]; a = vc[j][k]; x = Qij[m][a]; a = x < 0 ? 1 : 0; b = abs(x); if (mag > b) mag = b; // finding minimum sign = sign ˆ a; // computing product of signs } } mag = (6*mag) >> 3; // k = 0.8 or 6 in 5.3 format Rji[j][i] = (sign == 1) ? -mag : mag; } } Pcode 4.54: Simulation code for efﬁcient computation of R ji. for(i = 0;i < ldpc->n;i++){ n = C2V[i][6]; mag = 0; for(j = 0;j < n;j++){ a = cv[i][j]; b = C2V[i][j]; mag = mag + Rji[b][a]; } LLRi[i] = Lambda[i] + mag; } Pcode 4.55: Simulation code to efﬁciently compute LLRi . Implementation of Error Correction Algorithms 223 we spend 9 cycles in the innermost loop and the loop runs conditionally wr times to compute one R ji . Outside the innermost loop, we spend 6 cycles to initialize and to compute the ﬁnal R ji value. Thus, to compute all R ji values, we consume M(wr (9 ∗ wr + 6) + 1) cycles. In Pcode 4.55, we spend 4 cycles in the inner most loop and the loop runs for wc times. We consume a total of N (4 ∗ wc + 3) cycles for LLRi . With this, for M = 288, N = 576, wr = 7, wc = 6, and L = 10, we consume about 1,630,080 cycles for 288 bits or 5660 cycles per bit in decoding with the min-sum algorithm. In other words, decoding a 100-kbps bitstream requires only about 566 MIPS on the reference processor. This page intentionally left blank CHAPTER 5 Lossless Data Compression Data compression (or source coding) enables the communication system to transfer more information by removing the redundancy present in the data (e.g., voice, audio, video). In other words, with data compression algorithms, it is possible to represent the given data with fewer number of information bits. Data compression algorithms are widely used in data storage and data communication applications. We use data compression algorithms to compress multimedia data at the transmitter side and corresponding decompression algorithms at the receiver side for getting back the transmitted data (which may not be exactly the same as the source-generated data). In Figure 5.1, the highlighted region corresponds to data compression (performed at the transmitter side) and decompression (performed at the receiver side) modules. In the communication system transceiver, the data compression block is placed at the beginning of the transmitter modules and the corresponding decompression block is placed at the end of the receiver modules so that the rest of the communication system works on compressed data to reduce the amount of data processing. The communication system bandwidth is limited due to switching equipment, channel non-zero response and other channel impairments. The system’s overall bandwidth determines the allowed bit rates for communication. However, with source coding techniques, it is possible to trade processing power with the communication system bandwidth. For example, in the case of multimedia (e.g., voice, audio, video, text) communications, data compression signiﬁcantly reduces bit rates and thus the cost of media communications. It enables the broadcast of multimedia content in real time by reducing the data rate. The data compression and decompression blocks shown in Figure 5.1 contain many modules such as parsing (to parse headers and payload data), transforms (to remove redundancy in the data), motion estimation/compensation (to remove temporal redundancy in video frames), quantization (to eliminate insigniﬁcant data coefﬁcients), entropy coding (to compress data parameters), and so on. In this chapter, we concentrate only on entropy coding (or lossless data compression, with which we can get back the original data parameters after decoding) modules that are used in video data compression. The other modules of data compression or decompression blocks will be discussed in this volume’s audio/video coding chapters. Source Data Data Compression Channel Coding Digital Modulation Transmitter Back End Received Data Entropy Coding / Decoding Data Decompression Channel Decoding Receiver Front End Noisy Channel Figure 5.1: Digital communication system with data compression and decompression. © 2010 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-1-85617-678-1.00005-3 225 226 Chapter 5 5.1 Entropy Coding In the data compression block, our entropy coding module is present at the end and we perform the corresponding entropy decoding in the receiver at the beginning of the data decompression block, which is highlighted with dark squares in Figure 5.1. Entropy coding algorithms output the bitstream by compactly representing various data parameters using their source information. As previously stated, entropy coding is a lossless process and is independent of the type of information (e.g., audio, video, text) that is being compressed. It is concerned solely with how the information is represented. In Example 5.1, we work with a simple entropy coding algorithm to see how an entropy coding system compactly represents the data information. ■ Example 5.1 We consider a source with symbol set S = {A,B,C,D,E,F,G,H}. Let us assume a probability set P = { pa, pb, pc, pd , pe, p f , pg, ph} that governs the occurrence of symbols from the source S for their transmission. Now, assume that the data string generated for transmission from S following the symbol probability distribution P is M = BAAACAAAAABBDAAAAEAAAAFAAGCAAAAB. We have a total of 32 symbols for transmission. Next, we assume two types of data coding schemes Type I and Type II as follows. Type I coding: A:000, B:001, C:010, D:011, E:100, F:101, G:110, H:111 Type II coding: A:1, B:01, C:001, D:0001, E:00001, F:000001, G:0000001, H:00000001 for compactly representing the symbols. With the Type I coding scheme, we need 96 bits to represent the data or an average of 3 bits/symbol to transmit the message. With Type I coding: 001 000 000 000 010 000 000 000 000 000 001 001 011 000 000 000 000 100 000 000 000 000 101 000 000 110 010 000 000 000 000 001 (total bits: 96) If we code the same message data by using the Type II scheme, we require only 56 bits or an average of 1.75 bits/symbol to transmit the message. With Type II coding: 01 1 1 1 001 1 1 1 1 1 01 01 0001 1 1 1 1 00001 1 1 1 1 000001 1 1 0000001 001 1 1 1 1 01 (total bits = 56) Here, with Type II coding, the average number of bits/symbol is less than with Type I coding. This is because the statistical nature of source S is modeled more accurately with the Type II coding scheme than with Type I. We will discuss this further in the next section. ■ In the literature, two types of entropy coding methods are widely used—Huffman (or variable length) coding and arithmetic coding. In previous generations of audio (e.g., MP3, WMV) and video codecs (e.g., the MPEG-2, H.263, WMV), variable length codes (VLCs) were widely used. Recent audio and video codecs (e.g., AAC, H.264) use arithmetic coding for lossless compression. With arithmetic coding, we achieve about 10 to 15% more compression when compared to VLC code. 5.1.1 Huffman Coding Suppose that the symbol set S has N symbols and these symbols occur in the input string with respective probabilities Pi , i = 1, 2, 3, . . . , N, so that Pi = 1. The symbol occurrence is statistically independent. Then, based on the fundamentals of information theory, the optimal number of bits to be assigned for each symbol of the input string (which gets the character symbol at random from the symbol set S) is Qi = log 2(1/Pi ), where Pi is the probability of an i-th symbol. In other words, we require on average at least H = − Pi log2(Pi ) bits per symbol to communicate the symbols from set S. Here H gives the average number of bits per symbol and is called the entropy rate of the symbol source S with the corresponding symbol probabilities Pi . The entropy rate of a source is a number that depends only on the statistical nature of the source. For example, if the probability Lossless Data Compression 227 of a symbol is 1/256, such as would be found in a random byte stream, the number of bits per symbol required is log2(256) or 8. As the probability goes up to 1/2, the optimum number of bits used to code the symbol would go down to 1. The idea behind Huffman coding is simply to use shorter bit patterns for frequently occurring character symbols. Huffman coding assigns a code to each symbol, with the codes being as short as 1 bit or considerably longer than the input symbols, strictly depending on their probabilities. In Example 5.1, the Type II coding scheme is an example of Huffman coding. With Huffman coding, we cannot use log 2(1/Pi ) for arbitrary values of Pi , as it outputs a noninteger number of bits. For this, we approximate the probabilities Pi by integer powers of ½ so that all the resulting Qi s are integers. Now we discuss the assignment of bits to each symbol of the set S in a constructive way. Let us consider a different symbol set S = {U, V , X, Y, Z } with the corresponding probability distribution set P = {0.4, 0.28, 0.22, 0.07, 0.03}. Note that the sum of probabilities of all N (=5) symbols is 1. Next, we build a binary tree with N stages. We start from the bottom of tree by considering the two least probable symbols. In this case, the two least probable symbols are Z and Y with probabilities 0.03 and 0.07. We always assign the bit “0” branch to the low-probability child node, and the bit “1” branch to the high-probability child node from the parent node. The probability of the parent node is the sum of probabilities of its child nodes. Next, we move one stage up and we consider the next highest probable symbol (i.e., X ), and assign branch “1” if its probability is more than the probability of parent node for two lower-stage child nodes; otherwise, it is assigned branch “0.” Continue like this until all characters of the symbol set are touched once. Finally, make sure that the ﬁnal parent (or root) node probability is 1.00. Then, proceed to collect the bits of branches connecting from the root node to leaf nodes that represent the characters from the symbol set S. In our case, from Figure 5.2, we assign the bits to characters as given in Table 5.1. With Huffman coding, it is not possible to code a symbol with a probability greater than 0.5 using a fraction of a bit. Thus, a minimum of 1 bit is required to represent a symbol with Huffman coding. Moreover, the adaptive Huffman coding algorithms are relatively time and memory consuming. We will discuss different variable length decoding algorithms used with the MPEG-2 and H.264 standards in Sections 5.2 and 5.3. Figure 5.2: Building Huffman codes using a binary tree. (1.00) 0 1 (0.40) U (0.60) 0 1 (0.28) V (0.32) 0 1 (0.10) 0 1 (0.22) X (0.03) Z (0.07) Y Table 5.1: Assignment of bits to symbols per probability Symbol U V X Y Z Bits 0 10 111 1101 1100 228 Chapter 5 1.00 C 0.75 B 0.50 A 0.00 0.75 0.625 0.625 0.6875 0.59375 0.625 0.5625 0.609375 0.609375 0.50 B 0.50 0.59375 0.59375 2. Compressed output A C A Figure 5.3: Illustration of arithmetic coding. 5.1.2 Arithmetic Coding As discussed, if we have a symbol set with nonuniform probabilities Pi , then the data compression is possible and the number of bits we assign to data symbols equals to Qi = log2(1/Pi ). With Huffman coding, we assign the length of bits to symbols after rounding the actual number of bits Qi to the nearest integers. In other words, Huffman coding achieves the Shannon limit only if the symbol probabilities are all integer powers of ½. Thus, we require a minimum of 1 bit to represent a symbol even if its probability is more than half. This limits the performance of Huffman codes. In contrast, using arithmetic coding, it is possible to code a symbol with a probability of more than 0.5 using a fraction of a bit. This allows us to code the data very close to the ideal entropy of the source. Because of this, with arithmetic coding we can get better compression (about 10 to 15%) when compared to Huffman coding. However, arithmetic coding is more complex than Huffman coding. In the arithmetic coding, an input message of any length is represented as a real number R in the range [0, 1). Unlike Huffman coding, which assigns a separate codeword for each character, arithmetic coding yields a single codeword for each encoded string of characters. The concept of arithmetic coding is explained in Example 5.2. Although arithmetic coding is a complex coding method when compared to Huffman coding, the process of encoding and decoding a stream of symbols using arithmetic coding is not very complicated. The ﬁrst step in arithmetic coding is to divide the numeric range [0,1) into N number of intervals, where N is the number of symbols in a character set. The size of each interval is related to the probabilities of corresponding symbols. In Example 5.2, the probability distribution of symbol set { A, B, C} = {0.5, 0.25, 0.25}. We divide the range [0,1) into 3 (=N) segments and mark the interval 0.0 to 0.5 for A, 0.50 to 0.75 for B, and 0.75 to 1.00 for C, respectively. The message to be compressed is, say, BACA. The ﬁrst symbol to code is B, and thus we zoom the interval B of the range [0,1), and subdivide it again into three segments with the length of subsegments proportional to the probabilities of characters in the given symbol set in the same way as we did earlier. The next symbol to code is A, and we again zoom the subsegment A and divide it into three segments, and continue this process for the rest of symbols as shown in Figure 5.3. As we code more and more symbols of the long string, the length of the working interval becomes shorter and shorter. Finally, the arithmetic coded value of the string is given by the bottom value of the ﬁnal subinterval. In Example 5.2, the complete arithmetic encoding and decoding of the symbol string is presented along with the encode and decode algorithms. ■ Example 5.2 Consider the symbol set {A,B,C} with probabilities ½, ¼, and ¼; the corresponding symbol intervals follow: A ½ 0.00 to 0.50 B ¼ 0.50 to 0.75 C ¼ 0.75 to 1.00 Lossless Data Compression 229 Symbol_Range(Symbol) = 0.00 to 0.50 for A, 0.50 to 0.75 for B, 0.75 to 1.00 for C High_Range(Symbol) = 0.50 for A, 0.75 for B, 1.00 for C Low_Range(Symbol) = 0.00 for A, 0.50 for B, 0.75 for C Message to transmit: BACA Encoding Algorithm Value = 0.0; high = 1.0; i = 4; while(i—) { r = high–value; high = value + r*high_range(symbol); value = value + r*low_range(symbol); } output value as encoded_value; Encoding Symbols B: r = 1, high = 0.75, value = 0.5 A: r = 0.25, high = 0.625, value = 0.5, C: r = 0.125, high = 0.625, value = 0.59375 A: r = 0.03125, high = 0.609375, value = 0.59375 Decoding Algorithm value = encoded_value; i = 4; while(i—) { symbol = symbol_range(value); r = high_range(symbol) - low_range(symbol); value = value–low_range(symbol); value = value / r; } Decoded Message symbol = B r = 0.25 value = 0.375 symbol = A r = 0.50 value = 0.75 symbol = C r = 0.25 value = 0 symbol = A ■ To avoid precision problems in arithmetic coding, the range of the arithmetic coder is frequently normalized. With binary symbols, the arithmetic coding can be implemented very efﬁciently, and this type of coding is popularly known as binary arithmetic coding. Binary Arithmetic Coding Binary arithmetic coding (BAC) is the most efﬁcient way of implementing general arithmetic coding that is applied to data sequences with only two symbols (0 and 1), thereby making it easier to implement arithmetic coding operations both in hardware and software. Typically, any decision can be coded with multiple binary decisions as we can represent any number with a binary sequence of 0s and 1s. With binary arithmetic coding, we handle the following three steps. 230 Chapter 5 Binary Symbols Probability The BAC works on binary symbols 0 and 1. However, it is more convenient to use variable symbol names for binary symbols, instead of 0 and 1 constants, to work with probabilities. In the BAC literature, the variable symbol names MPS (most probable symbol) and LPS (least probable symbol) are used for the binary decision 0 or 1. With BAC, knowing one symbol probability ( p) is sufﬁcient as the other symbol probability is obtained as 1 − p. Typically, as shown in Figure 5.4, we estimate the LPS probability and compute the MPS probability by subtracting the LPS probability from 1. In reality, the probabilities of symbols are not ﬁxed, as they were in Example 5.2. As we code different types of symbols for different types of parameters, the probabilities of symbols keep changing. With BAC, we track only the LPS probabilities. Either by observing the previous symbols’ probabilities or based on the context of symbols, we obtain the approximate probability information for the LPS. In addition, we ﬁx the MPS decision bit for the context model and link it to the LPS probability. We obtain the LPS bit as 1-MPS. We always get the LPS probability and the MPS decision bit from the context model. When we compress different types (e.g., headers, motion vectors, residual coefﬁcients) of data then we will have that many contexts. For each context, we assign initial LPS probability and MPS decision bits. We update this context information (i.e., LPS probability and MPS bit value) after coding each binary decision. In this way, the binary arithmetic coder easily adapts without much computation. Interval Subdivision To handle the precision problem with arithmetic coder interval size, we use a soft range for the interval range [0,1) by multiplying by a large integer, say, 215, and the interval becomes [0, 215−1]. The corresponding interval size is R = 215. If Qe (= p) is the probability of the LPS, then the LPS subinterval R_LPS is obtained by multiplying the interval range R with LPS probability. Therefore, R_L PS = Qe ∗ R R_M P S = R − R_L P S = (1−p) ∗ R Symbol Coding To perform BAC on the binary symbol, we get the corresponding context {Qe, MPS bit} and then obtain the R_LPS and R_MPS by subdividing the interval using Qe. As MPS occur frequently, we reduce the number of computations for MPS decision coding when compared to LPS decision coding by arranging the MPS and LPS subintervals as shown in Figure 5.4. We assign a lower interval part to the MPS decision and by doing that the coding of the MPS decision becomes easy. If the binary decision is to code an MPS bit, then the Code_Value remains the same. In this case, we just update the context and assign R_MPS to R. But, if the binary decision is to code an LPS bit, then the value is updated by adding R_MPS to Code_Value. We update the contexts accordingly and assign R_LPS to the next working interval R. In some cases, R_MPS becomes less than R_LPS for a given Qe, in that case we swap R_MPS and R_LPS intervals or toggle the MPS bit in that context. The symbol coding with BAC follows: MPS coding R = R_M PS LPS coding Code_V alue = Code_V alue + R_L P S R = R_L PS Interval Normalization As we code more and more decisions, the range of the interval becomes smaller and smaller. And correspondingly, the number of bits required to represent Code_Value also increases. In that case, we normalize the interval by shifting left both R and Code_Value. During the normalization, we collect the shifted bits from Code_Value as those bits represent the compressed decisions. p R_LPS LPS R 12 p R_MPS MPS Figure 5.4: Binary arithmetic coder. Code_Value Lossless Data Compression 231 Context Figure 5.5: Adaptive binary arithmetic coder. Binary Decisions Probability Estimation Update Probabilities Arithmetic Coding Engine Compressed Bitstream Adaptive Arithmetic Coding The adaptive binary arithmetic coding (ABAC) is one in which probabilities are adapted continuously with the coding of binary decisions. The schematic diagram of ABAC is shown in Figure 5.5. The two most popular adaptive binary arithmetic coders are the M-coder and MQ-coder. The JPEG 2000 standard uses the MQ-coder for compressing the binary decisions, and the H.264 standard uses a variant of the M-coder for arithmetic coding of binary decisions. We will discuss binary arithmetic decoding algorithms used with the JPEG 2000 and H.264 in Sections 5.4 and 5.5. 5.2 Variable Length Decoding As we discussed in Section 5.1.1, Huffman codes or VLCs are used to perform lossless data compression (or entropy coding) by assigning fewer bits to more frequently occurring data symbols and assigning more bits to rarely occurring data symbols. We use a variable length decoder (VLD) at the other side to decode the bitstream. But the question is whether we have any such application where these kinds of data symbols occur in reality. The answer is yes. There are many applications with these kinds of data symbols. In this chapter, we consider the video data in which we ﬁnd this kind of unevenly probable symbols. In video coding (see Chapter 14 for more details on video coding technology), after applying the DCT to residual coefﬁcients, we obtain the transform domain coefﬁcients. We use VLC to encode the value and position of zigzag scanned quantized DCT coefﬁcients at the encoder side using a predeﬁned VLC codeword table. We use the corresponding VLD at the decoder side to decode the value and position of quantized DCT coefﬁcients. The MPEG-2 standard (MPEG-2: ISO/IEC, 1995) uses VLD to decode the received video bitstream. The standard speciﬁes many codeword tables to encode/decode various types of slice parameters, macroblock parameters, and residual coefﬁcients to/from the bitstream. In this section, an overview of the MPEG-2 residual VLD’s most complex tasks is presented and its simulation and implementation techniques are discussed. A few applications of the MPEG-2 codec are digital video broadcasting, digital subscriber lines, personal media players, HDTV, video surveillance, digital media storage (DVD), multimedia communications, and so on. Similar to VLD in the MPEG-2 standard, the MPEG-4 standard uses a different VLD, and the H.264 standard uses the CAVLC (context-based adaptive variable length coder) for lossless data compression. The MPEG-2 VLD is simpler when compared to the MPEG-4 and H.264 standards. Although the concept of VLD is more or less the same in all standards, the way the encoding and decoding procedures are used for encoding or decoding the parameters varies greatly from standard to standard. The performance of the MPEG-2 VLD is reasonable when compared to variable length coding performance of the MPEG-4 and H.264 standards. 5.2.1 MPEG-2 VLD The MPEG-2 entropy coder uses variable length codes (VLCs) for lossless compression and decompression of video frame parameters. We use the VLD to decode the MPEG-2 bitstream. The MPEG-2 standard speciﬁes many codeword tables to decode the various types of data which is encoded using VLC. With VLD, we decode the bitstream and get the encoded parameter information back by matching the received bit pattern with the appropriate MPEG-2 VLD codeword tables. Although the bitstream consists of many parameters and headers 232 Chapter 5 information along with residual coefﬁcients, we focus on decoding 8×8 block residual coefﬁcients, since 80 to 90% of the bitstream contains residual coefﬁcient information. We use some video coding terminology in the following discussion; consider consulting Chapter 14 for more detail about video coding technology before proceeding. Decoding of MPEG-2 Residual Coefﬁcients If the encoded video format is 4:2:0, then we have four 8×8 luma blocks and two 8×8 chroma blocks per macroblock. In decoding the residual coefﬁcients of an 8×8 block, we decode DC (direct current or zero frequency) and AC (alternating current or high frequency) residual coefﬁcients for both luma and chroma components. To decode these coefﬁcients, the MPEG-2 standard speciﬁes altogether ﬁve codeword tables. With the MPEG-2 VLD, the only difference between luma and chroma coefﬁcients decoding is in decoding of the DC coefﬁcient in the case of intra macroblock. Otherwise the same code can be used to decode either luma or chroma. In decoding the DC coefﬁcients we use separate prediction values and codeword tables for luma blocks and chroma blocks. Although an 8×8 subblock contains 64 coefﬁcients, most of them will be zeros. We decode all non-zero coefﬁcients of an 8×8 subblock along with their locations using the MPEG-2 VLD. The ﬂow diagram for decoding an 8×8 subblock with the MPEG-2 VLD is shown in Figure 5.6. The basic parameters used in the MPEG-2 residual decoding are macroblock_type and intra_vlc_ format. Depending on the macroblock_type and intra_vlc_ format values, we select codeword tables for decoding residual coefﬁcients. If the macroblock_type is intra, then we decode the residual DC coefﬁcient (or ﬁrst coefﬁcient) using DC coefﬁcients codeword tables and AC coefﬁcients using another AC coefﬁcient codeword table. If the macroblock_type is inter, then both DC and AC coefﬁcients are decoded using the same codeword table. After decoding the ﬁrst coefﬁcient, we decode the rest of the AC coefﬁcients in the loop which run up to 63 times. In each iteration we get the codeword from the VLD table and check whether all the coefﬁcients are decoded (i.e., the end-of-block (EOB) is reached) or any coefﬁcients have to be decoded further. If EOB is reached, we then quit the loop, otherwise we continue decoding of the coefﬁcient. For each AC coefﬁcient, we compute two values: the signed value and the run m (the number of zeros present from the previous non-zero coefﬁcient to the present non-zero coefﬁcient; note that this run value is meaningful only with respect to zigzag scanned positions; see MPEG-2: ISO/IEC, 1995). If m > 0, we ﬁrst insert those many zeros in the coefﬁcient buffer and then place the signed coefﬁcient value. As shown in Figure 5.6, decoding AC coefﬁcients of a block includes many condition checks and condition jumps. We decode two parameters for each AC coefﬁcient; signed value and run. Decoding of one AC coefﬁcient using VLD involves the following four steps: 1. Accessing 16 bits of bitstream 2. Accessing the appropriate look-up table to get “run” and “value” 3. Decoding sign bit 4. Updating the bit position and word offset In the MPEG2 VLD, to decode a residual coefﬁcient, we have to analyze bit patterns of length from 2 bits to as large as 24 bits in the bitstream. Depending on the incoming bitstream bits, whichever bit pattern completely matches a codeword given in the table with a minimum length of bits, the corresponding row values (or VLD_SYMBOLS) of a table are chosen as the decoded values. In addition, we do not encode the information of number of bits (NUM_BITS) used for encoding the VLD_SYMBOLS, we determine NUM_BITS after decoding that particular codeword. Now, the question is how to search the codeword tables to ﬁnd the right match? The brute-force solution for this problem is matching all size bit patterns to all codewords of a table and it is very costly in terms of cycles. The other solution for this bitstream decoding problem is bit-pattern matching by using look-up tables. If we use one look-up table for decoding all sizes of bit patterns, then the size of the look-up table becomes (224) ∗ 4 = 64,000 kB. This is not a practical amount of on-chip data memory in many embedded processors. VLD Decoding with Look-up Tables The alternate solution for the MPEG-2 VLD decoding is to use a combination of look-up tables and analytic methods. In this approach we use many small look-up tables and apply some logic to match the bit patterns of First Coefficient Lossless Data Compression 233 Start Declare an array D[64] with 64 entries; n 5 0 N Decode 0th coefficient Does intra block? Y Decode DC coefficient n 5n 11 EB 5 0 Remaining Coefficients Y Does EB 51? N Decode one codeword (CW) EB 51 Does CW 5 EOB? N Get run m from CW Y D [n ] 5 0 n 5 n 11 m 5 m 21 m .0? N Get signed value x from CW Y n , 64? N Y D [n ] 5 0 n 5 n 11 D [n ] 5 x n 5n 11 EB 5 0 End Figure 5.6: Flow diagram for decoding coefﬁcients of MPEG-2 8×8 blocks with VLD. all sizes. Apart from the escape codes (which are of 24-bit length and by analyzing the ﬁrst 6 bits of a bit pattern we can tell whether it is an escape code or not), all other codes have maximum length of 17 bits including sign bit. If we take out the sign bit, then we have to analyze 16 bits. The advantage with these codes is that they are not random and they are systematically designed to uniquely represent all the possible positions and coefﬁcient values of 8×8 blocks depending on their probability of occurrence. Now we discuss a simple method for the decoding of one DC coefﬁcient and one AC coefﬁcient with an example. Let us assume that the current macroblock is intra, intra_vlc_ format is zero and the received bitstream is 1101101001110000. At the time of encoding, in the case of the DC coefﬁcient we encode both the DC 234 Chapter 5 difference and the number of bits (DC_SIZE) used to encode the DC difference. In the case of the AC coefﬁcient we encode both run (to represent how many zeros are present from the previous non-zero coefﬁcient) and value (with sign information). Therefore, ﬁrst we decode the DC difference value (as we encode only the DC difference after subtracting actual DC value from the predicted value) and then we decode AC coefﬁcients. Decoding the DC Coefﬁcient Codeword In decoding the DC difference, ﬁrst we decode the DC_SIZE and then we read DC_SIZE bits from the bitstream to get the signed DC difference value. For decoding the ﬁrst part, DC_SIZE (to know the size of DC difference), if we scan through the luma DC codeword table for matching a minimum length codeword with the input bitstream, then we match the minimum length 110 codeword with input bitstream and this corresponds to value 4 (from the dct_dc_size_luminance look-up table) which is the size of the DC difference in terms of bits. Now we read next 4 bits (1101) from the bitstream as the DC difference value. As seen here, the decimal equivalent of the DC difference value is equal to 13. This DC difference of 13 is again manipulated to get the actual signed DC difference value before adding it to the prediction value to get the ﬁnal DC coefﬁcient. For decoding DC, we used a total of 7 bits so now we advance the bit position by 7 bits. The remaining bit pattern after decoding DC is 001110000. In decoding the AC coefﬁcient we decode both the signed coefﬁcient and the number of zeros present in between the previous coefﬁcient and current decoding coefﬁcient. As we assumed the current macroblock was intra and intra_vlc_ format was zero, then we select the corresponding codeword table to scan for the matching bitstream. The minimum length codeword we match with the bitstream from the codeword table is 001110 and that corresponds to a signed value of 1 and a run (the number of zeros between current and previous coefﬁcients) of 3. Next, we discuss the methodology to decode DC coefﬁcients using small look-up tables. For this, we consider the design of a look-up table for decoding the DC_SIZE that is used to get a DC difference value. According to the MPEG-2 standard, to decode the DC_SIZE, we have to analyze a maximum of 9 bits. For this, if we use a look-up table, such a look-up table shall contain two parameters, DC_SIZE and number of bits in a codeword (NUM_BITS) to advance the bit position. We use 2 bytes to represent these two parameters in look-up table design. If we want to decode using a single look-up table without any extra logic, then we need 1024 (2 ∗ 29) bytes. This problem can also be solved by a different approach which uses a look-up table (VldTbA[ ], provided at the simulation results) that contains only 96 bytes, but requires a few operations to fully decode DC_SIZE. With this 96-byte look-up table, the parameters DC_SIZE and NUM_BITS are decoded as follows. First, we analyze 4 bits from the bitstream and if the decimal equivalent of 4 bits is less than 15, then we are sure (from the MPEG-2 codeword table, dct_dc_size_luminance) that the DC_SIZE can be obtained with a look-up table of 32 (2 ∗ 24) bytes. If the decimal equivalent is greater than or equal to 15, then we analyze 9 bits of bitstream to decode DC_SIZE. As seen in the codeword table, we know 4 MSB bits of codeword are all equal to 1, and if we mask these 4 bits then the effective address space is 5 bits. Therefore, the look-up table size for analyzing 9 bits is 64 bytes (2 ∗ 25). Decoding the AC Coefﬁcient Codeword Similarly, we analyze the procedure for decoding AC coefﬁcients in the MPEG2 VLD. We choose one codeword table out of two AC coefﬁcient codeword tables depending on macroblock_type and intra_vlc_ format to decode the AC coefﬁcient. We always extract a 16-bit string (excluding escape bits and sign bit) from the bitstream to decode any coefﬁcient. For most of the VLD codewords of the same length, the length of preﬁx zeros is also constant. We ﬁrst obtain the preﬁx zeros present in these 16 bits. For each preﬁx length, we choose a corresponding look-up table containing run and value. Given the length of preﬁx zeros, we remove the preﬁx zeros from the 16-bit string and we use the remaining bits (or nonpreﬁx bits) value as an offset to the look-up table. If the length of nonpreﬁx bits are different for a given preﬁx length, then we take care of this in the look-up table design and the bit position is updated according to the value of NUM_BITS. Thus, all the look-up tables designed to decode AC coefﬁcients contain NUM_BITS information along with run and value for each codeword. All 10 look-up tables from VldTb0[ ] to VldTb9[ ] to decode AC coefﬁcients are provided in the following section in the simulation results. Lossless Data Compression 235 Next, we discuss the methodology to decode the AC coefﬁcient using small look-up tables. We assume an AC coefﬁcient whose codeword contains six preﬁx zero bits. As said, we ﬁrst extract 16-bit strings from the bitstream. We check that the 16-bit string value is greater than or equal to 512 or not, to know whether the number of preﬁx zeros present in the 16-bit string is equal to 6 or more than 6. Once we know that the number of preﬁx zeros is 6, then we know from the MPEG-2 AC-coefﬁcient VLD tables that we have only 8 codewords with 6 preﬁx zeros. We get the corresponding values (value, run, NUM_BITS) as decoded output with appropriate offset derived from the 16-bit data as [(offset>>6)-8]. Here the offset is shifted right by 6 bits to discard the 6 LSBs as the length of codeword with 6 preﬁx zeros is only 10 bits excluding sign bit. Out of 10 bits, 6 are preﬁx zeros. In addition, the 4th bit from the right is 1 in all codewords with 6 preﬁx zeros and we subtract 8 from 1xxx to clear this bit. Then only a 3-bit string remains in the offset, which represents eight unique entries in the look-up table VldTb4[ ]. 5.2.2 MPEG-2 VLD Simulation As seen in the previous discussion, it is clear that the bitstream is accessed from the bitstream buffer for decoding each coefﬁcient. We call this a bit FIFO operation. We have two types of bit FIFO accesses for the bitstream buffer. In one case, we only extract certain number of bits from the buffer without updating the bit position immediately. For example, we extract 16 bits at the beginning of decoding any AC coefﬁcient, but we are not sure whether we are going to use all 16 bits. We come to know how many bits were used to decode a coefﬁcient only after obtaining NUM_BITS from the look-up table. Then we update the bit position with NUM_BITS. In another case, we know in advance how many bits we want to use to decode the value (as in decoding DC difference value using DC_SIZE bits). In this case, we update the bit position (and word pointer if the pointer update condition is satisﬁed) in the bit FIFO function itself. We use two different functions Read_Bits( ) and Next_Bits( ) to access the bit FIFO with and without bit position update. The simulation code for Next_Bits( ) and Read_Bits( ) is given in Pcodes 5.1 and 5.2. To extract K bits from the buffer, we read a continuous 32-bit string from the buffer and extract K bits from this string. In function Read_Bits( ), we decrement the bit position with the number of bits read from the buffer. Then we check whether the bit position is below zero and if it is, we increment the word (32 bits width) pointer by 1 and add 32 to the bit position. With the Next_Bits( ) function, we update the bit position outside the function after obtaining the NUM_BITS from the codeword read using ﬁxed-length bits. int Next_Bits(Mpeg2Vld *pVld, int n) { unsigned int x, y, z; if (pVld->bit_pos >= n){ x = Dat[pVld->word_offset]; z = x << (32 - pVld->bit_pos); z = z >> (32 - n); } else { x = Dat[pVld->word_offset]; z = x << (32 - pVld->bit_pos); z = z >> (32 - n); y = Dat[pVld->word_offset + 1]; y = y >> (32 - n + pVld->bit_pos); z = z | y; } return z; } Pcode 5.1: Simulation code for Next_Bits( ). DC Coefﬁcient Decoding Simulation The DC coefﬁcients are present only in the intra frame (I-frame) macroblocks. Both luma and chroma component macroblocks contain the DC coefﬁcients. Each 8×8 subblock of a macroblock contains one DC coefﬁcient. A total of six 8×8 subblocks (four luma and two for two chroma components) are present in one macroblock, and hence we will have six DC coefﬁcients per macroblock. However, luma and chroma subblocks use different VLD 236 Chapter 5 int Read_Bits(Mpeg2Vld *pVld, int n) { unsigned int x, y, z; if (pVld->bit_pos >= n) { x = Dat[pVld->word_offset]; z = x << (32 - pVld->bit_pos); z = z >> (32 - n); } else { x = Dat[pVld->word_offset]; z = x << (32 - pVld->bit_pos); z = z >> (32 - n); y = Dat[pVld->word_offset + 1]; y = y >> (32 - n + pVld->bit_pos); z = z | y; } pVld->bit_pos = pVld->bit_pos - n; if (pVld->bit_pos <= 0) { pVld->bit_pos+= 32; pVld->word_offset++; } return z; } Pcode 5.2: Simulation code for Read_Bits( ). codeword tables for decoding the DC coefﬁcient. The simulation code for decoding one luma DC coefﬁcient is given in Pcode 5.3. In this, ﬁrst we extract 4-bit strings from the bitstream using the Next_Bits( ) function (as we discussed earlier we don’t update the bit position within the Next_Bits( ) function), then check whether these 4 bits are sufﬁcient for the current DC coefﬁcient size, if they are, then we decode DC_SIZE, otherwise we read 9 bits to decode the DC_SIZE. In any case, we update the bit position after obtaining the NUM_BITS along with the DC_SIZE value as shown in Pcode 5.3. Once we know the DC_SIZE, then we extract the DC_SIZE bit value as DC_DIFFERENCE from the bitstream buffer using the Read_Bits( ) function. (Note: The Read_Bits( ) function updates bit position by itself and we do not update the bit position outside the function after reading the bits.) The actual DC coefﬁcient is obtained after adding the prediction value to the signed DC_DIFFERENCE, which is computed from DC_SIZE and DC_DIFFERENCE. if (pVld->intra_mb == 1) { // decode 1st coefﬁcient offset = Next_Bits(pVld, 4); if (offset < 15) cw = VldTbA[offset]; // DC_SIZE codeword else { offset = Next_Bits(pVld, 9); offset = offset - 0x1e0; cw = VldTbA[16 + offset]; // DC_SIZE codeword } len = cw & 0xff; // NUM_BITS pVld->bit_pos = pVld->bit_pos - len; if (pVld->bit_pos <= 0) { pVld->bit_pos+= 32; pVld->word_offset++; } size = cw >> 8; // DC_SIZE if (size==0) diff = 0; else { diff = Read_Bits(pVld, size); // DC_DIFFERENCE if ((diff & (1<<(size-1)))==0) diff-= (1<= 512) { if (pVld->vlc_format == 1) { if (offset >= 1024) cw = VldTb0[(offset>>8)-4]; else cw = VldTb1[(offset>>6)-8]; } else { if (offset >= 16384) cw = VldTb2[(offset>>12)-4]; else if (offset >= 1024) cw = VldTb3[(offset>>8)-4]; else cw = VldTb4[(offset>>6)-8]; } } else if (offset>=256) cw = VldTb5[(offset>>4)-16]; else if (offset>=128) cw = VldTb6[(offset>>3)-16]; else if (offset>=64) cw = VldTb7[(offset>>2)-16]; else if (offset>=32) cw = VldTb8[(offset>>1)-16]; else (offset>=16) cw = VldTb9[offset-16]; // continue with Pcode 5.6 } Pcode 5.4: Simulation code for decoding intra macroblock VLD codewords. for (i = 0; ; i++) { offset = Next_Bits(pVld, 16); if (offset>=16384) { if (i==0) cw = VldTbB[(offset>>12)-4]; else cw = VldTb2[(offset>>12)-4]; } else if (offset >= 1024) cw = VldTb3[(offset>>8)-4]; else if (offset >= 512) cw = VldTb4[(offset>>6)-8]; else if (offset>=256) cw = VldTb5[(offset>>4)-16]; else if (offset>=128) cw = VldTb6[(offset>>3)-16]; else if (offset>=64) cw = VldTb7[(offset>>2)-16]; else if (offset>=32) cw = VldTb8[(offset>>1)-16]; else (offset>=16) cw = VldTb9[offset-16]; // continue with Pcode 5.6 } Pcode 5.5: Simulation code for decoding inter macroblock VLD codewords. AC Coefﬁcients Decoding Simulation Depending on the macroblock_type and intra_vlc_ format ﬂag value, we choose one VLD codeword table for decoding AC coefﬁcients. The simulation of decoding residual AC coefﬁcients is divided into two parts. In the ﬁrst part, we obtain the VLD codeword and in the second part we compute the run, value and NUM_BITS from the codeword. The ﬁrst part of obtaining the codeword is a little bit different for luma and chroma components, as shown in Pcodes 5.4 and 5.5. But the basic idea of removal of the preﬁx zeros is the same in both cases. In both cases, we extract 16 bits from the bitstream buffer using the Next_Bits( ) function. We obtain the offset for the look-up table using the extracted 16 bits after removing the preﬁx zeros from the codeword. The entry containing run, value and NUM_BITS is accessed from the designed look-up tables using the offset. We extract the “run,” “value,” and NUM_BITS from the look-up output, and update the bit position using NUM_BITS. We inspect the “run” information for EOB, ESC or regular coefﬁcients and accordingly we proceed. If the “run” information contains the information for the regular coefﬁcient, then we will ﬁll ﬁrst the “run” number of zeros in the coefﬁcient buffer followed by a signed coefﬁcient. We compute the sign by accessing 1 bit from the bitstream buffer. MPEG-2 VLD Simulation Results Using the VLD codeword tables given in the MPEG-2 standard to decode the residual DCT coefﬁcients, we design the look-up tables for decoding the coefﬁcients. Here, we only design the look-up tables for decoding the coefﬁcients of the luma component. The look-up table VldTbA[ ] is used to decode the DC coefﬁcient of intra 238 Chapter 5 len = cw>>16; pVld->bit_pos = pVld->bit_pos - len; if (pVld->bit_pos < 0) { pVld->bit_pos+= 32; pVld->word_offset++; } run = cw & 0xff; if (run==64) { // EOB while (i < 64) { Sym[i] = 0; i++; } return; } if (run==65) { // escape run = Read_Bits(pVld, 6); val = Read_Bits(pVld, 12); if((sign = (val>=2048))) val = 4096 - val; } else { val = (cw & 0xff00)>>8; sign = Read_Bits(pVld, 1); } if (sign) val = -val; while (run > 0) { Sym[i] = 0; i++; run = run - 1; } Sym[i] = val; Pcode 5.6: Simulation code for decoding and storing the coefﬁcients from codeword obtained using either Pcode 5.4 or Pcode 5.5. subblocks whereas the look-up table VldTbB[ ] is used to decode the 0th coefﬁcient of inter subblocks. Look-up tables VldTb0[ ] through VldTb9[ ] are used to decode the remaining 63 coefﬁcients of the 8×8 subblock of either intra or inter macroblocks. All these look-up tables are derived from the MPEG-2 VLD codeword tables and all can be found on this book’s companion website. Simulation Results We provide the simulation results for decoding the residual coefﬁcients of one intra-luma macroblock assuming intra_vlc_ format ﬂag is 1. We use the following MPEG-2 encoded bitstream for four 8×8 subblocks of the luma intra macroblock. Dat[6] = {0xace43d68, 0x58d2968f, 0x79626883, 0xd16a0360, 0x54205adb, 0x50000000}; We initialize the bitstream buffer parameters word offset with 0 and bit position with 31. After decoding each 8×8 subblock, the updated word offset, bit position, and decoded residual coefﬁcients of 8×8 subblocks follow. Decoded output of ﬁrst luma block: pVld->word_offset = 0 pVld->bit_pos = 9 Sym[64] = 124, 0, 0, 0, 1, 2, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0; Decoded output of second luma block: pVld->word_offset = 2 pVld->bit_pos = 31 Sym[64] = 129, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, -2, 3, 0, 0, -1, 2, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0; Decoded output of third luma block: pVld->word_offset = 3 pVld->bit_pos = 23 Lossless Data Compression 239 Sym[64] = 151, -5, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, -1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0; Decoded output of fourth luma block: pVld->word_offset = 5 pVld->bit_pos = 29 Sym[64] = 146, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -7, 0, 0, 2, 0, 0, 0, 0, 0, 0, -1, 0, 0, -1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0; Computational Complexity of MPEG-2 VLD The decoding of residual coefﬁcients using the MPEG-2 VLD codeword tables is costly in terms of cycles. In decoding one coefﬁcient, we read the bitstream two or three times, access different look-up tables, and also the decoding ﬂow contains many conditional jumps. As the DC coefﬁcient is present in intra macroblocks and there is only one coefﬁcient, we are not going to discuss the complexity and optimization techniques for the DC coefﬁcient. The number of AC coefﬁcients present in a macroblock depends on the frame type and bit rate. As we saw in the simulation results, we may ﬁnd ﬁve to six AC coefﬁcients on average in medium bit rate applications. To decode residual AC coefﬁcients, we use the simulation codes in Pcodes 5.4 and 5.6 or Pcodes 5.5 and 5.6. Now, we analyze the maximum number of cycles consumed by the AC coefﬁcients decoding loop. This loop decodes up to 63 AC coefﬁcients in intra macroblocks and up to 64 coefﬁcients in inter macroblocks. We decode two parameters—signed value and run—for each AC coefﬁcient. As we discussed, the VLD simulation uses look-up tables with a few ALU operations. Decoding of one AC coefﬁcient using the VLD involves the following four steps: 1. Access 16-bit string of bitstream using Next_Bits( ) function (10 cycles) 2. Access appropriate look-up table to get “run” and “value” (4 cycles) 3. Update the bit position and word offset (4 cycles) 4. Decode sign bit using Read_Bits( ) function (13 cycles) If we look at the “for” loop code given in Pcode 5.4 or Pcode 5.5 to decode AC coefﬁcients, it consists of many conditional jumps. If we assume that one conditional jump takes about 9 cycles on the reference embedded processor (see Appendix A, Section A.4, on this book’s companion website for more details of cycle estimation on the reference embedded processor), then depending on the input data bits pattern we may take about 40 to 80 cycles to decode one AC coefﬁcient. We also spend a variable number of cycles ﬁlling zeros into the coefﬁcient array Sym[ ] when decoded “run” is not zero. In decoding residual AC coefﬁcient information as we discussed in Section 5.2.1, we can have preﬁx zeros up to 10 bits in a codeword and for that reason we scan for the number of preﬁx zeros by conditionally jumping after checking for a particular preﬁx length. Once we start decoding the VLD_SYMBOLS for an AC coefﬁcient (i.e., run and value) by analyzing the 16 bits of bit pattern, we have to know how many bits (NUM_BITS) are actually used in decoding the VLD_SYMBOLS to advance the bit position. This information (NUM_BITS) also has to be coded for each codeword in the look-up table. With this, for each AC coefﬁcient, our look-up table contains three entities (1) Run, (2) Value, and (3) NUM_BITS. If we represent each entity with 1 byte, then we need 3 bytes for each codeword of VLD tables. As we use 1-byte, 2-byte, or 4-byte words for easy memory access, we need 4 bytes for each codeword. If there are “n” remaining bits after removing the preﬁx zero bits and excluding the sign bit, then for covering all codewords with that particular number of preﬁx zeros, we need a look-up table size of 4 ∗ 2n bytes. Once we know the look-up table output, we come to know the value of NUM_BITS and we advance the bit position by NUM_BITS. We check the “run” information to ﬁnd out whether this particular bit pattern represents an escape code (ESC) or end of block (EOB) as “run” information contains abnormal values for these two cases. If the current bit pattern represents the escape code, we jump to decode escape information (run and signed value). If the bit pattern represents EOB, then we ﬁll the remaining coefﬁcient values with zeros and exit the 240 Chapter 5 loop. Otherwise, the coefﬁcient array pointer incremented by “run” with ﬁlling zeros. Then the sign information (the coefﬁcient value is negative if the next bit of bitstream is 1; otherwise, the coefﬁcient value is positive if the next bit of bitstream is zero) is decoded from the bitstream using Read_Bits( ). The bit position is advanced by 1 bit with the Read_Bits( ) function. The decoded signed value is stored in the coefﬁcient array pointed to by the array pointer, and the array pointer is increased by 1 to store the next coefﬁcient value. 5.2.3 MPEG-2 VLD Optimization Techniques As we discussed in the previous section, decoding an AC coefﬁcient with the MPEG-2 VLD is a costly process in terms of cycles. If we have 10 AC coefﬁcients in a particular 8×8 subblock then we may need on average 600 cycles to decode all the subblock coefﬁcients. In this section we will discuss an efﬁcient procedure (see Stein and Malepati, 2008) to decode AC coefﬁcients and this approach reduces the cycles cost by approximately 80%, but this reduction in cycles is achieved at the cost of more memory. This efﬁcient technique is considered after observing the statistics of MPEG-2 VLD test vectors. The following are the statistics of the MPEG-2 VLD for two test vectors. 1. About 90% of VLD symbols are coded with less than or equal to 10 bits. 2. The percentage of bits on average used for each VLD symbol is shown in Figure 5.7. 3. About 25 to 45% of the time any two successive VLD symbols are represented with less than or equal to 10 bits. 4. About 2 to 5% of the time any three successive VLD symbols are represented with less than or equal to 10 bits. As seen in the previous statistics, 90% of the time we decode coefﬁcients with less than or equal to 10 bits including sign information. Interestingly, out of 90%, 30% of coefﬁcients use only 3 bits, 30 to 40% of the time two consecutive coefﬁcients consume less than 10 bits, and 5% of the time three consecutive coefﬁcients consume about 10 bits. This analysis prompts us to think about designing look-up tables for decoding multiple coefﬁcients with only one access of the bitstream buffer and look-up table. If we consider the 10-bit offset, then we need a look-up table with 1024 (=210) entities. For each AC coefﬁcient, we need three elements information (NUM_BITS, value, and run). We do not compute sign information separately for each coefﬁcient; instead we embedded the sign information into “value” at the time of designing the look-up table. In this simulation, we pack up to three AC coefﬁcients information in one look-up table entity as shown in Figure 5.8. In each entity of 32 bits or 4 bytes width, we will have the information about the number of coefﬁcients packed, multiple coefﬁcients run and signed value information and the total number of bits (NUM_BITS) consumed by all the coefﬁcients packed in that entity. The look-up table size becomes 4096 (= 4 ∗ 210) bytes for one AC-coefﬁcient codeword table. In the MPEG-2 VLD, depending on intra_vlc_ format value, we have two codeword tables to decode residual AC coefﬁcients. Thus, we need a total of 8 kB of data memory for two look-up tables. In this implementation, all the codewords with more than 10 bits are treated as escape codes. In the case of escape codes, our look-up entry contains the pointer for the next small look-up table to decode the VLD symbol which Figure 5.7: Histogram of MPEG-2 VLD symbol length versus percentage of their occurrence. Percent of Occurrence 35 30 25 20 15 10 5 0 3 4 5 6 7 8 9 10 Ն11 Number of Bits 10-bit offset Lossless Data Compression 241 No. of coefficients Value 1, Run 1 Value 2, Run 2 Value 3, Run 3 NUM_BITS Figure 5.8: Look-up table design for efﬁcient implementation of MPEG-2 VLD. consumes less than or equal to 17 bits including sign bit. If the VLD symbol consumes more than 17 bits, then a separate code performs the decoding of that particular VLD symbol. Although we decided on an offset length of 10 bits, we can also simultaneously access the data-register-width number of bits (usually 32) from the bitstream to the data register to reduce the number of bitstream buffer accesses and thus decode multiple coefﬁcients with a single access. With this method of implementation, the following bit offset analysis and coefﬁcient decoding are possible. (10 bit, 10 bit, 10 bit) -> 3 to 9 coefﬁcients (occur with high probability) (10 bit, 10 bit, ESC) -> 2 to 6 coefﬁcients (occur with high probability) (10 bit, 17 bit, or ESC) -> 2 to 4 coefﬁcients (occur with medium probability) (17 bit or ESC, 10 bit) -> 1 to 4 coefﬁcients (occur with medium probability) (ESC) -> 1 coefﬁcient (occur with low probability) Based on the previous analysis, we can decode up to nine symbols with one bitstream access. The number of bitstream accesses, look-up table accesses, and the conditional jumps per 8×8 subblock AC coefﬁcients decoding will be greatly reduced in this approach. The simulation code for the implementation of this efﬁcient decoding is given in Pcodes 5.7 through 5.10. To reduce the cycles further, instead of ﬁlling the AC coefﬁcient array with zeros conditionally for every coefﬁcient and at the end of the block in “while” loops, we ﬁll all 64 coefﬁcients initially unconditionally with zeros and ﬁll only the coefﬁcient values at appropriate positions in the decoding loop by incrementing the coefﬁcient array pointer using “run” information. Computational Complexity with Efﬁcient Implementation of VLD With the efﬁcient implementation, we access the bitstream buffer once for multiple symbols. In addition, as we extract 32 bits at a time from the bitstream buffer, it is simple and consumes fewer cycles (around 5, when compared to the less-than-32 bits case, which takes around 10 cycles as discussed). Updating the bit position twice for each coefﬁcient is not required. Instead, the bit position is updated once for all symbols present in the 10-bit length pattern. With this efﬁcient implementation, jumps occur only in cases of EOB (once for 8×8 block) and ESC (occurs rarely). On the reference embedded processor, an average of less than 100 cycles are used to decode 10 coefﬁcients using this implementation, compared to 600 cycles using the standard implementation provided in Pcodes 5.4 through 5.6. On the ﬂip side, we require about 8 kB of additional data memory to use this implementation. Look-up Tables for Efﬁcient Implementation of VLD The simulation code given in Pcodes 5.7 through 5.10 can be used to decode the residual coefﬁcients of 8×8 subblocks of both intra macroblocks as well as inter macroblocks with appropriate look-up table selection based on intra_vlc_ format and macroblock_type. The look-up tables used with the efﬁcient implementation of the MPEG-2 VLD can be found on the companion website. The average number of coefﬁcients present in a subblock will vary and it depends on the bit rate for the given frame resolution and frame rate. For example, the average number of coefﬁcients present in a subblock will 242 Chapter 5 for(i = 1; ; i++) { cw = Next_Bits(pVld, 32); // extract 32 bits from bitstream code = cw >> 22; // obtain 10 bit offset bitstr = Tb[code]; // get one look-up table entity count = bitstr >> 30; if (count != 0) { val_inc = 5; if (count == 3) val_inc = 4; temp = 2; for(j = 0;j < count;j++) { run = ((bitstr << temp)>>28); temp+= 4; value = ((int)(bitstr << temp) >> (32-val_inc)); temp+= val_inc; i+= run; Sym[i] = value; i++; } i--; val = (bitstr << temp) >> 28; pVld->bit_pos = pVld->bit_pos - val; if (value == 0) break; cw = cw << val; code = cw >> 22; // obtain second 10-bits offset bitstr = Tb[code]; count = bitstr >> 30; // continue with Pcode 5.8 Pcode 5.7: Simulation code for efﬁcient implementation of MPEG-2 VLD. be less for the 6-Mbps bit rate than for the 10-Mbps bit rate for bitstreams with the same full D1 (720×480) resolution at 30 fps (frames per second). 5.3 H.264 VLC-Based Entropy Coding In Section 5.2, we discussed the VLCs used with the MPEG-2 standard. The MPEG-2 uses static VLC tables to code different types of video parameters and data. The VLC scheme used for MPEG-2 entropy coding is nonadaptive since we do not use any context information in coding the symbols (except the intra_vlc_ format ﬂag to choose between two codeword tables). In this section, we will discuss more advanced VLC coding schemes that are used in the H.264 standard. The H.264 standard uses two types of VLC schemes to compress the bitstream: (1) universal VLCs (UVLCs), and (2) CAVLCs. We use VLC schemes in H.264 when the entropy_coding_mode ﬂag is set to zero. The UVLC scheme is used to code different parameters (e.g., slice layer and macroblock layer headers, motion vectors, and coded block pattern), and the CAVLC scheme is used to code the residual coefﬁcients. The following subsections present an overview of the UVLC and CAVLC schemes and their simulation techniques. 5.3.1 Overview of the H.264 VLC Schemes With the H.264 coder (for more details, see Section 14.4), we code the following types of data elements: (1) sequence parameters, (2) picture parameters, (3) slice layer parameters, (4) macroblock layer parameters, and (5) residual coefﬁcients. All data elements except residual coefﬁcients are coded using either ﬁxed-length codes or exponential Golomb codes. These VLC schemes are also known as UVLCs. The residual coefﬁcients are coded using CAVLCs. Fixed-Length Codes We code the equiprobable data elements using a ﬁxed-length code (FLC) of n bits since the coding of equiprobable elements does not offer any data compression. With FLC, we don’t analyze the n bits length bit pattern and in Lossless Data Compression 243 // continuation from Pcode 5.7 if (count != 0) { val_inc = 5; if (count == 3) val_inc = 4; temp = 2; for(j=0;j>28); temp+= 4; value = ((int)(bitstr << temp) >> (32-val_inc)); temp+= val_inc; i+= run; Sym[i] = value; i++; } i--; val = (bitstr << temp) >> 28; pVld->bit_pos = pVld->bit_pos - val; if (value == 0) break; cw = cw << val; code = cw >> 22; // obtain third 10 bits bitstr = Tb[code]; count = bitstr >> 30; if (count != 0) { val_inc = 5; if (count == 3) val_inc = 4; temp = 2; for(j=0;j>28); temp+= 4; value = ((int)(bitstr << temp) >> (32-val_inc)); temp+= val_inc; i+= run; Sym[i] = value; i++; } i--; val = (bitstr << temp) >> 28; pVld->bit_pos = pVld->bit_pos - val; if (value == 0) break; } } // continue with Pcode 5.9 Pcode 5.8: Simulation code for efﬁcient implementation of MPEG-2 VLD. most cases we directly obtain the coded information from the bitstream n bits and in some cases we use a look-up table that is accessed using the n bits as an offset. In other words, we read a ﬁxed number of bits (here the bits are unsigned and we denote the bits reading function as u(n)) from bit FIFO and the Code_Num (or data parameter information) is given by either n bits block or output of the look-up table which is accessed using n bits block as an offset. Exponential Golomb Codes In the H.264 standard, the exponential Golomb codes (or exp-Golomb codes) are used to code a variety of data parameters. With exp-Golomb codes, a single inﬁnite length codeword table is used to code different kinds of parameters. Instead of designing a different VLC table for each data parameter, the mapping to the codeword table is adapted according to the data statistics for coding a particular data parameter. The codewords of such a code progress in the logical order. One such codeword table with general form [m-zeros|1|m bits] is given in Table 5.2. Here, the length of the codeword is 2m + 1. We construct each exp-Golomb codeword at the encoder with the formula m = log2(Code_Num + 1) . We frame m-zero bits with sufﬁx bit “1” as [m-zeros|1]. Then, we obtain the m bits information from another formula m bits = Code_Num + 1 − 2m. With this, the ﬁnal codeword is obtained as [m-zeros|1|m bits]. This exp-Golomb 244 Chapter 5 } else { // continuation from Pcode 5.8 else { if (bitstr == 0) continue; else { cw = cw << 10; pVld->bit_pos = pVld->bit_pos - 10; temp = bitstr >> 16; code = cw >> (32-temp); offset = bitstr & 0xffff; offset = offset + (code<<1); bitstr = Tba[offset]; offset = bitstr >> 8; value = (int) (bitstr << 24) >> 24; run = offset & 0x1f; code = offset >> 5; cw = cw << code; pVld->bit_pos = pVld->bit_pos - code; i+= run; Sym[i] = value; } } if (bitstr == 0) { cw = cw << 6; run = cw >> 26; i+= run; cw = cw << 6; value = cw >> 20; pVld->bit_pos = pVld->bit_pos - 24; if((sign = (value>=2048))) value = 4096 - value; if (sign) value = - value; Sym[i] = value; } // continue with Pcode 5.10 Pcode 5.9: Simulation code for efﬁcient implementation of MPEG-2 VLD. code has the regular decoding properties. To decode the given codeword, ﬁrst we compute the preﬁx zeros count (i.e., m), once we know the value of m, then we consider the following m + 1 bits after preﬁx zeros and compute the Code_Num as [1|m bits]-1. For example, given the bitstream 000101000101, the preﬁx zeros present are 3. The 4 (i.e., 3 + 1) bits following the preﬁx zero bits are [1010] and the Code_Num = [1010] − 1 = 9. The data parameter “v” is mapped to Code_Num before encoding and we name the exp-Golomb code accordingly. The four exp-Golomb codes used in the H.264 standard are (1) ue(v), unsigned exp-Golomb code, (2) se(v), signed exp-Golomb code (3) me(v), mapped exp-Golomb code, and (4) te(v), truncated exp-Golomb code. The parameter v is mapped to Code_Num for the previous schemes as follows. ue(v): Code_Num = v (5.1) ⎧ ⎪⎨ 0 if v = 0 se(v): Code_Num = ⎪⎩2v2|−v|1 if v < 0 if v > 0 (5.2) me(v): Code_Num = LUT[v] (5.3) where LUT is a predeﬁned mapping look-up table. te(v): Code_Num = ue(v) if v > 1 (5.4) !u(1) if v = 1 Lossless Data Compression 245 // continuation from Pcode 5.9 else { cw = cw << 10; pVld->bit_pos = pVld->bit_pos - 10; temp = bitstr >> 16; code = cw >> (32-temp); offset = bitstr & 0xffff; offset = offset + code; bitstr = Tba[offset]; offset = bitstr >> 8; value = (int) (bitstr << 24) >> 24; run = offset & 0x1f; code = offset >> 5; cw = cw << code; pVld->bit_pos = pVld->bit_pos - code; i+= run; Sym[i] = value; code = cw >> 22; bitstr = Tb[code]; count = bitstr >> 30; if (count != 0) { val_inc = 5; if (count == 3) val_inc = 4; temp = 2; for(j=0;j>28); temp+= 4; value = ((int)(bitstr << temp) >> (32-val_inc)); temp+= val_inc; i+= run; Sym[i] = value; i++; } i--; val = (bitstr << temp) >> 28; pVld->bit_pos = pVld->bit_pos - val; if (value == 0) break; } else continue; } } if (pVld->bit_pos <= 0) { pVld->bit_pos+= 32; pVld->word_offset++; } } if (pVld->bit_pos <= 0) { pVld->bit_pos+= 32; pVld->word_offset++; } Pcode 5.10: Simulation code for efﬁcient implementation of MPEG-2 VLD. Table 5.2: Exp-Golomb codeword table Code_Num 0 1 2 3 4 5 6 7 8 9 10 ….. Codeword 1 010 011 00100 00101 00110 00111 0001000 0001001 0001010 0001011 …. 246 Chapter 5 ■ Example 5.3 Consider the bitstream 011001010001111001100110100. We decode it with UVLC schemes u(n), ue(v), se(v), me(v), and te(v) using one scheme for one parameter. Let n = 3 for u(n), and the range of v is greater than 1 for te(v). Set the bit position to zero (bit_pos = 0) and the bits are read from the MSB side. We compute the data parameters as follows. In the case of se(v) decoding, if Code_Num is even then v = −(Code_Num)/2, else v = (Code_Num + 1)/2 and if Code_Num = 0 then v = 0. u(n): v = Code_Num = u(3) = 011 = 3. Total bits used: 3 Updated bit_pos: 3 Remaining bitstream = 001010001111001100110100 ue(v): Preﬁx zeros m = 2 Next m+1 bits: 101 v = Code_Num = [101]−1 = 4 Total bits used: 5 Updated bit_pos: 8 Remaining bitstream = 0001111001100110100 se(v): Preﬁx zeros m = 3 Next m+1 bits: 1111 Code_Num = [1111]−1 = 14 v = − Code_Num/2 = −7 Total bits used: 7 Updated bit_pos: 15 Remaining bitstream = 001100110100 me(v): Preﬁx zeros m = 2 Next m+1 bits: 110 Code_Num = [110]−1 = 5 v = LUT[Code_Num] Total bits used: 5 Updated bit_pos: 20 Remaining bitstream = 0110100 te(v): Use ue(v) as v>1 is assumed Preﬁx zeros m = 1 Next m+1 bits: 11 Code_Num = [11]−1 = 2 v = Code_Num Total bits used: 3 Updated bit_pos: 23 Remaining bitstream: 0100 ■ Context-Adaptive Variable Length Codes In the H.264 standard, the CAVLC is used to code residual zigzag ordered 16 luma DC coefﬁcients, 4 chroma DC coefﬁcients, and 4×4 (luma or chroma) subblocks AC coefﬁcients. With the CAVLC scheme, VLC tables for various syntax elements are changed depending on already coded syntax elements. Since the VLC tables are designed to match the corresponding statistics, the entropy coding performance is improved in comparison to schemes using a single VLC table such as in the MPEG-2 standard. The CAVLC tables are designed to take advantage of several characteristics of quantized residual data symbols. Typically, after transform, the magnitude Lossless Data Compression 247 of residual data symbols diminishes as we go from low frequency end to high frequency end. With quantization, most of the insigniﬁcant symbols are truncated to zero. At the medium bit rates, the residual data symbols contain many zeros after quantization. After the zigzag scan, the scanned array contains the DC and signiﬁcant AC coefﬁcients at the beginning of the array and most of the zeros fall after the signiﬁcant symbols. An example of zigzag scanned and residual AC coefﬁcients array r[ ] for 4×4 subblock follows: r = [7, −3, 0, 1, −1, 0, 0, 1, −1, 0, 0, 0, 0, 0, 0, 0] (5.5) As we discussed, the zigzag scanned and quantized residual symbol array contains a majority of zeros. In addition, the following cases are true with most of the quantized zigzag scanned, residual symbols array: • The non-zero coefﬁcients decay as we move from the start of the array toward its end. • The majority of trailing non-zero coefﬁcients are 1s. • The number of non-zero coefﬁcients of neighboring blocks are correlated. Hence, the CAVLC is designed to compactly represent the residual data, which reﬂects the previous cases. A few important characteristics of the CAVLC follow: 1. Adapts the tables to code the total coefﬁcients of the block depending on its neighbor blocks total coefﬁcients 2. Uses trailing 1s to take care of non-zero trailing coefﬁcients 3. Adapts various VLC tables to code non-zero coefﬁcients from large coefﬁcients to small coefﬁcients 4. Adapts various VLC tables to code the total number of zeros present in-between all the coefﬁcients 5. Adapts various VLC tables to code the number of zeros between two coefﬁcients 6. Uses run-level coding to compactly represent the string of 0s The H.264 CAVLC includes the following decoding steps: • Coeff_Token (total coefﬁcients and trailing 1s) • Sign information for trailing 1s • Signed-level information for remaining non-zero coefﬁcients • Total zeros present between all non-zero coefﬁcients • Run-before (the number of zeros present between two consecutive non-zero coefﬁcients) Coeff_Token In the CAVLC, the combined coefﬁcients and trailing 1s are treated as Coeff_Token. The coefﬁcient total comprises the count of all non-zero coefﬁcients present in a block. For example, in array r[ ], we have a total of six non-zero coefﬁcients. In many cases, most of the trailing coefﬁcients are 1s. However, we code up to three 1 coefﬁcients as trailing 1s with the CAVLC. We treat only the last three 1 coefﬁcients as trailing 1s even if we have more than three trailing 1s. For example, although we have four 1s in array r[ ], we treat the last three as trailing 1s and the remaining 1 as a normal coefﬁcient. Thus, with respect to array r[ ], the total coefﬁcients are six and the trailing 1s are three. There are many codeword tables speciﬁed in the H.264 standard to decode Coeff_Token. Depending on the context, we choose a particular look-up table to decode Coeff_Token from the bitstream. The context “nC” is determined based on the total coefﬁcients present in corresponding up and left blocks. Using context “nC” we choose one codeword table and decode the Coeff_Token by searching for the bitstream bit pattern that matches with a minimum length codeword from the codeword table. We choose the corresponding Coeff_Token of the codeword (that matches with the bitstream) to get the total coefﬁcients and trailing 1s. Sign of Trailing 1s Once we decode the Coeff_Token, we know whether there are any trailing 1s present in the block. If the trailing 1s are present, then we know their magnitudes are 1 and we need only their signs. Thus, we obtain the signs from the bitstream for all trailing 1s. If we read bit “1” from the bitstream, then the sign of the trailing 1 is minus, and if we read bit “0” then the sign of the trailing 1 is plus. We don’t use any context information in decoding the signs for trailing 1s. 248 Chapter 5 Levels The signed levels of the remaining non-zero coefﬁcients are decoded in reverse order starting with the highest frequency coefﬁcient and working back toward the DC coefﬁcient. There are seven VLC tables from VLC0 to VLC6 to choose from, based on context, depending on the previously decoded level’s magnitude. The table VLC0 is biased toward lower magnitudes, table VLC1 is biased toward slightly higher magnitudes and so on, and ﬁnally table VLC6 is biased toward larger magnitude levels. If the current decoded coefﬁcient is greater than the predeﬁned threshold, then we move VLCm to VLCn where (m < n). An analytical method to decode the signed level (apart from trailing 1s) follows. SufﬁxLength = 0 If total coefﬁcients are greater than 10 and trailing 1s are less than three, then SufﬁxLength is set to 1. 1. Decode LevelPreﬁx (using bitstream and the LevelPreﬁx VLC table) 2. Determine the LevelSufﬁxSize (from LevelPreﬁx and SufﬁxLength) 3. Decode LevelSufﬁx with LevelSufﬁxSize bits (from the bitstream) 4. LevelCode = (min(15, LevelPreﬁx) << SufﬁxLength) + LevelSufﬁx 5. Adjust LevelCode as follows: (a) If (LevelPreﬁx > 15) and (SufﬁxLength = 0), then LevelCode = LevelCode + 15 (b) If (LevelPreﬁx > 16), then LevelCode = LevelCode + (1<<(LevelPreﬁx − 3)) − 4096 (c) If trailing 1s are less than 3 and the ﬁrst non-zero coefﬁcient is decoding, then LevelCode = LevelCode + 2 6. Compute level from LevelCode as given: (a) Level = (LevelCode + 2) >> 1 if LevelCode is even (b) Level = (−LevelCode − 1) >> 1 if LevelCode is odd 7. Increment the SufﬁxLength if the current level magnitude is greater than the predeﬁned threshold and repeat steps 1 to 7 for the remaining levels decoding Total Zeros The value total_zeros gives the total number of zeros present between the start of the zigzag scanned array and the last non-zero coefﬁcient (which can be a trailing 1). For example, in array r[ ], the value total_zeros is 3. Although we compute run_before (which gives the total number of zeros present between two consecutive non-zero coefﬁcients) for placing each coefﬁcient in the decoded array, there are two advantages in computing the total_zeros. The ﬁrst advantage is that the decoding of run_before can be adapted with zeros-left information (which is obtained after subtracting the run_before of the previously decoded coefﬁcient from total_zeros) and the second advantage is that there is no need to compute the run_before for the lowest frequency non-zero coefﬁcient as zero left gives the indication of how many zeros are present from the start of the array to that coefﬁcient position. The VLC table for decoding of total_zeros is adapted based on the total number of non-zero coefﬁcients present in the block. Run Before The number of zeros preceding each non-zero coefﬁcient is termed as run_before and is decoded in the reverse order (i.e., from the highest frequency non-zero coefﬁcient toward the lowest frequency non-zero coefﬁcient). For example, in array r[ ], the run_before between −1 and 1 is 2, and the run_before between −3 and 1 is 1. We do not compute the run_before in the following two cases: (1) when the remaining total_zeros is zero (that indicates no zeros are present between the coefﬁcients) and (2) when we reach the lowest frequency non-zero coefﬁcient (for this zeros-left gives the count of preceding zeros). The VLC tables for decoding run_before are adapted using the zeros-left information (which is obtained after subtracting the run_before of previously computed coefﬁcient from its zeros-left). At the start, we assign the total_zeros to zeros-left. Once we decode all the levels and run lengths (of zeros), then we store each level accordingly using run lengths. Next, we will discuss the simulation details for decoding each step of the CAVLC. The H.264 standard speciﬁes many codeword tables and functions to decode the residual coefﬁcients as discussed previously. We consider the designing of look-up tables for a few codeword tables in the simulation and the rest of the look-up tables can be Lossless Data Compression 249 designed using similar approaches. The primary operation involved in decoding all of the steps is reading of bit pattern from the bitstream buffer. Assuming the VLC codes with codeword lengths more than 16 bits as escape codes (occurs very rarely), the bit FIFO is designed as follows. We use a structure to hold the parameters and data to work with bit FIFO. The structure contains current_word (current word in a bit FIFO which is MSB aligned), bit_pos (current bit position), and word_count (pointer or index to bitstream buffer) as seen in the following: struct { unsigned int current_word; int bit_pos; int word_count; } CAVLC_t; CAVLC_t *pVLC; At any time, to read n bits (less than or equal to 16) from the bit FIFO, we perform the following steps: 1. Extract n bits from the MSB side of the current_word. 2. Shift left the current word by n bits. 3. Increment the bit_ pos by n bits. 4. If the bit FIFO contains less than 16 bits, read the next 16 bits from the buffer. 5.3.2 Simulation of the H.264 VLC Schemes We use the 16-bit FIFO deﬁnition described previously in CAVLC simulations most of the time. For escape codes, we use the 32-bit FIFO discussed in Section 5.2. In this section, we design the look-up tables to efﬁciently simulate some of the CAVLC functions. Decoding UVLC Codes UVLC codes include both FLC and exp-Golomb codes. The FLC is a simple code that reads a ﬁxed number of bits from bit FIFO. The simulation code for reading a ﬁxed number of bits is given in Pcode 5.11. In computing signed or unsigned exp-Golomb codes, we ﬁrst compute Code_Num value. Assuming the codes with more than 16 bits are escape codes, we compute Code_Num by scanning for lead zeros in a 16-length bit pattern. Say, if the lead zeros present in this case is m, we extract next (m + 1) bits from the bitstream and its pattern looks like [1|m bits]. The Code_Num is given by [1|m bits]−1. Once we compute Code_Num, then the decoded unsigned and signed exp-Golomb code value “v” is obtained from Equations (5.1) and (5.2). The simulation code for the decoding of unsigned and signed exp-Golomb codes is given in Pcodes 5.12 and 5.13. w = (pVLC->current_word)>>(32-n); // read n-bits from MSB side pVLC->bit_pos = pVLC->bit_pos + n; // increment bit position pVLC->current_word = pVLC->current_word << n; // shift left bit FIFO by n-bits if (pVLC->bit_pos > 16) { pVLC->bit_pos = pVLC->bit_pos – 16; a = bit_stream[pVLC->word_count++]; a = a << pVLC->bit_pos; pVLC->current_word = pVLC->current_word | a; } return (w); Pcode 5.11: Simulation code to read n-bits from bitstream buffer. Decoding CAVLC Codes As most CAVLC functions require context information, we ﬁrst determine the context and choose the corresponding VLC table to decode the residual coefﬁcients from the bitstream. We use the following functions in decoding residual coefﬁcients. Coeff_Token (Nonpredictable Bit-Pattern Lengths) The Coeff_Token represents the total coefﬁcients and trailing 1s present in the zigzag scanned array. We analyze a maximum of 16 bits in decoding the Coeff_Token. Depending on context “nC” and the bit pattern, we read n bits 250 Chapter 5 w = pVLC->current_word >> 16; // consider 16-bits for scanning k = 0; while ((w & 0x8000) == 0) { w = w << 1; k++; // obtain prefix zeros }; pVLC->current_word = pVLC->current_word << k; w = pVLC->current_word >> (32-k-1); pVLC->bit_pos = pVLC->bit_pos + 2*k+1; if (pVLC->bit_pos > 16){ pVLC->bit_pos = pVLC->bit_pos – 16; a = bit_stream[pVLC->word_count++]; a = a << pVLC->bit_pos; pVLC->current_word = pVLC->current_word | a; } return (w-1); Pcode 5.12: Simulation code for unsigned exp-Golomb code ue(v). w = pVLC->current_word >> 16; // consider 16-bits for scanning k = 0; while ((w & 0x8000) == 0) { w = w << 1; k++; // obtain prefix zeros }; pVLC->current_word = pVLC->current_word << k; w = pVLC->current_word >> (32-k-1); pVLC->bit_pos = pVLC->bit_pos + 2*k+1; if (pVLC->bit_pos > 16){ pVLC->bit_pos = pVLC->bit_pos – 16; a = bit_stream[pVLC->word_count++]; a = a << pVLC->bit_pos; pVLC->current_word = pVLC->current_word | a; } if ((w&1) == 1) a = -(w-1)/2; else a = w/2; return (a); Pcode 5.13: Simulation code for signed exp-Golomb code se(v). (here n ranges from 1 to 16) from the bitstream and correlate with the codewords of the chosen VLC table. We select the minimum length codeword that matches with the bitstream and the associated Coeff_Token is chosen as the decoded total coefﬁcients and trailing 1s. Although the codewords consists of preﬁx zeros followed by information bits, these codewords are nonprogressive and we do not have any constructive formula to get the number of bits present in a codeword. Thus, we search for all length bit patterns (from 1 to 16 bits) and choose the minimum length codeword that matches with the bitstream. However, this kind of search consumes many cycles on embedded processors as it involves many operations. Instead, we design a look-up table that gives the Coeff_Token and the actual number of bits used for the codeword and thereby we spend a minimum number of cycles in decoding the Coeff_Token. For this, we choose one Coeff_Token VLC table for nC less than 2, and obtain the look-up table values as described in the following. The maximum number of preﬁx zeros present in the codeword of the Coeff_Token VLC table for nC less than 2, is 14. The Coeff_Token codeword looks like [ p-zero bits|1|q bits] where 0 ≤ p ≤ 14 and 0 ≤ q ≤ 3. As seen here, we design a look-up table that contains the information of total coefﬁcients, trailing 1s and the actual number of bits used p + q + 1. Note that the value of p + q + 1 never exceeds 16 or the value p + q never exceed 15 which we can represent with 4 bits. The maximum number of total coefﬁcients is 16 and we use 8 bits to represent it. The maximum number of trailing 1s is 3 and we use 4 bits of look-up table entry to represent it. A total of 16 bits (or 2 bytes) are used for each entry of the look-up table to hold the total coefﬁcients, trailing 1s and p + q. For example, the codeword for Coeff_Token(1,3), which represents three total coefﬁcients and a trailing 1, is 00000110. We have p = 5 preﬁx zeros and q = 2 bits and we have 8 bits in total for this codeword. The corresponding look-up table entry contains 0x8103. The general form of look-up table entry is [4 bits (actual Lossless Data Compression 251 bits used) | 4 bits (trailing 1s) | 8 (total coefﬁcients)]. We design a look-up table for extreme values of p and q so that the look-up table can be accessed with a unique address. With this, we require 240 (=15 ∗ 8 ∗ 2) bytes of data memory to store one VLC table of the Coeff_Token for 0 ≤ nC < 2. The look-up table contains 15 segments (to take care of all possible p values) and each segment contains 8 entries (to take care of all possible q values). For example, in codeword 00000110, we have only q = 2 information bits, and we append one dummy bit for this in the design of the look-up table to make sure each segment contain exactly 8 entries. With this, the offset for a particular look-up table entry is given by p ∗ 8 + q. The look-up table values of VLC codewords for 0 ≤ nC < 2 are available on the companion website. The simulation code to obtain the Coeff_Token using look-up table tcto_nc_less_than_2[ ] is given in Pcode 5.14. w = (pVLC->current_word) >> 16; // read 16-bits to w p = 0; while((w & 0x8000)==0) {w = w << 1; p++;} // scan for lead zeros if (nc < 2){ q = w << 1; // skip first ‘1’ bit q = q >> 13; offset = p*8 + q; b = tcto_nc_less_than_2[offset]; k = b >> 12; k = k + 1; // p+q+1 pVLC->bit_pos = pVLC->bit_pos + k; pVLC->current_word = pVLC->current_word << k; if (pVLC->bit_pos > 16){ // bit FIFO pVLC->bit_pos = pVLC->bit_pos - 16; w = pVLC->buffer_pointer[pVLC->word_count++]; w = w << pVLC->bit_pos; pVLC->current_word = pVLC->current_word | w; } *t_ones = (b & 0xfff) >> 8; *t_coeffs = b & 0xff; } Pcode 5.14: Simulation code for decoding Coeff_Token for nC < 2. Level Preﬁx The format for codewords of LevelPreﬁx is [(n − 1) zeros | 1] and contains a total of n bits. We treat the codes with n > 16 as escape codes. We scan 16 bits from the bitstream and ﬁnd the number of preﬁx zeros. The LevelPreﬁx is the same as the number of preﬁx zeros present in the codeword. Depending on the bitstream pattern, we read n bits (where 1 ≤ n ≤ 16 for nonescape codes) and output the corresponding LevelPreﬁx value. The simulation code for obtaining the LevelPreﬁx is given in Pcode 5.15. w = (pVLC->current_word) >> 16; k = 0; while((w&0x8000) == 0){w = w << 1; k++;} pVLC->bit_pos = pVLC->bit_pos+(k+1); pVLC->current_word = pVLC->current_word << (k+1); if (pVLC->bit_pos > 16){ // bit FIFO pVLC->bit_pos = pVLC->bit_pos - 16; w = pVLC->buffer_pointer[pVLC->word_count++]; w = w << pVLC->bit_pos; pVLC->current_word = pVLC->current_word | w; } *len = k; Pcode 5.15: Simulation code to compute LevelPreﬁx. Total Zeros Like Coeff_Token codewords, the codewords of total_zeros contain unpredictable VLC codeword lengths. The general form of total_zeros codeword is [ p-zeros |1/0|q bits], where 0 ≤ p ≤ 8 and 0 ≤ q ≤ 2. The VLC codeword 252 Chapter 5 tables of total_zeros are adapted depending on the context, which is the non-zero coefﬁcients count “tc” in a block. If fewer coefﬁcients are present in a block, then the total number of zeros present between coefﬁcients is also lower. There is no need to compute the total_zeros if all the coefﬁcients are present (i.e., total non-zero coefﬁcients is the same as the maximum number of coefﬁcients present in a block). If the total coefﬁcients (obtained from the Coeff_Token) are less than the maximum coefﬁcients of a block, we select the corresponding codeword table and decode the total_zeros using the bit pattern from the bitstream. We use a maximum 9 bits for decoding total_zeros. Depending on the bitstream pattern and the context (total coefﬁcients), we read n bits (n = 1 to 9) from the bitstream and output the corresponding total_zeros value. We design a look-up table to perform total_zeros computation as follows. The general form of look-up entry w is organized as w = [4 bits (actual number of bits used, maximum value 9) | 4 bits (total zeros present, maximum value 15)] for decoding only 4×4 luma block total_zeros. The look-up table contains a total of 15 segments and each segment contains 36 entries. The particular entry of a 36-entry segment is accessed using the p and q, where p is lead zeros and q is the information bits of the codeword. The offset to access the look-up table entry follows: offset = tc ∗ 36 + p ∗ 4 + q The look-up table total_zero_luma[ ] values of total_zeros computation for a 4×4 luma block can be found on the website. The simulation code to compute total_zeros for a 4×4 luma block is given in Pcode 5.16. w = (pVLC->current_word) >> 23; p = 0; while((w & 0x0100)==0) {w = w << 1; p++;} q = w << 1; q = (q >> 7) & 0x3; offset = (t_coeffs-1)*36 + p*4 + q; // t_coeffs: non-zero coefficients of a 4x4 luma block b = total_zeros_luma[offset]; k = b >> 4; pVLC->bit_pos = pVLC->bit_pos + k; pVLC->current_word = pVLC->current_word << k; if (pVLC->bit_pos > 16){ pVLC->bit_pos = pVLC->bit_pos - 16; w = pVLC->buffer_pointer[pVLC->word_count++]; w = w << pVLC->bit_pos; pVLC->current_word = pVLC->current_word | w; } t_zeros = b & 0xf; return (t_zeros); Pcode 5.16: Simulation code to compute total_zeros for a 4×4 luma block. Run Before We read a maximum of 11 bits in decoding run_before. Depending on the bitstream pattern and the context (zeros left), we read n bits (n = 1 to 11) from the bitstream and output the corresponding run_before value. We use a look-up table to decode run_before. The look-up table design for decoding run_before is as follows. The look-up table entry w looks like w = [4 bits (actual bits used, maximum value 3 without escape codes) | 4 bits (run_ before)]. If zeros-left is greater than 6 and if the lead zeros are greater than 2, then we treat those codes as escape codes. With this, scanning 3 bits of information from the bitstream is sufﬁcient to decode run_before with nonescape codes. The look-up table contains a total of 7 segments (corresponding to 7 contexts) and each segment contains 8 entries. The look-up table entry for escape codes is zero as highlighted with a bold number in the ﬁle on the companion website. The offset for the look-up table is calculated as follows: offset = zeros_left*8 + value (of 3 bits read from the bitstream) The simulation code to decode run_before is given in Pcode 5.17; see the website for the look-up table runbefore[ ] values for decoding run_before with nonescape codes. The individual functions of the CAVLC involved in decoding residual coefﬁcients have been discussed. The simulation code for the overall parsing process in decoding of a block of residual coefﬁcients is given in Pcodes 5.18 and 5.19. Lossless Data Compression 253 j = zeros_left; w = (pVLC->current_word)>>29; if (j > 6) j = 7; offset = (j-1)*8+w; a = runbefore[offset]; k = a >> 4; rb = a & 0xf; if (j == 7) { if (a == 0) { // scan next 8-bits w = (pVLC->current_word)>> 21; k = 3; while((w & 0x800) == 0) { w = w << 1; k++;} rb = rb + 7; } } pVLC->bit_pos = pVLC->bit_pos + k; pVLC->current_word = pVLC->current_word << k; if (pVLC->bit_pos > 16){ pVLC->bit_pos = pVLC->bit_pos - 16; w = pVLC->buffer_pointer[pVLC->word_count++]; w = w << pVLC->bit_pos; pVLC->current_word = pVLC->current_word | w; } return (rb); Pcode 5.17: Simulation code to decode run_before. // decode total coefficients and trialing ones present in a 4x4 subblock decode_tcoeffs_tones(pVLC, nc, &tcoeffs, &tones); if (tcoeffs != 0){ // decode sign information for trailing 1s k = 0; max_coeffs = 16; // initialize the local coefficient buffer to zero for(i=0;i 10) && (tones < 3)) suffix_length = 1; for(i=k;i= 15) level_suffix_size = prefix_length - 3; if (level_suffix_size == 0) level_suffix = 0; // decode level suffix else level_suffix = read_bits(pVLC, level_suffix_size); tmp1 = (prefix_length < 15) ? prefix_length : 15; tmp1 = tmp1 << suffix_length; level_code = tmp1 + level_suffix; // determine level code if ((prefix_length >= 15) && (suffix_length == 0)) level_code += 15; if (prefix_length >= 16){ tmp2 = (1<<(prefix_length-3))-4096; level_code = level_code + tmp2; } if ((i==tones) && (tones < 3)) level_code = level_code + 2; if ((level_code & 1) == 0) buf[i] = (level_code+2)>>1; else buf[i] = (-level_code-1)>>1; if (suffix_length == 0) suffix_length = 1; if (abs(buf[i])>sufvlc[suffix_length]) suffix_length+= 1; } } // Continued in Pcode 5.19 Pcode 5.18: Parsing process for decoding a block of residual coefficients. 254 Chapter 5 if (max_coeffs > tcoeffs){ // decode total zeros t_zeros = total_zeros(pVLC, tcoeffs); if (t_zeros != 0){ k = tcoeffs+t_zeros-1; for(i=0; i 0){ // decode run before coeff_buf[k] = buf[i]; // store the levels rb = run_before(pVLC, t_zeros); k = k - (rb + 1); t_zeros = t_zeros - rb; } else{ coeff_buf[k] = buf[i]; k = k - 1; } } coeff_buf[t_zeros] = buf[i]; } else { for(i = 0;i < tcoeffs;i++) coeff_buf[i] = buf[tcoeffs-1-i]; } } else{ for(i = 0;i < tcoeffs;i++) coeff_buf[i] = buf[tcoeffs-1-i]; } } Pcode 5.19: Parsing process for decoding a block of residual coefﬁcients. 5.3.3 H.264 CAVLC Simulation Results In this section, we present the simulation results for the H.264 CAVLC used to decode residual coefﬁcients. We consider the decoding of a few luma 4×4 block residual coefﬁcients with the following received bitstream. bit_stream_buffer[] = { 0x74f0, 0x696a, 0x07f9, 0x8bd9, 0xe234, 0x4af6, 0x462c, 0xd89f, 0x3736, 0x0924, 0x1f01, 0x233c, 0xf458, 0x1bc1, 0x064a, 0xf879}; Next, we present the intermediate results (includes Coeff_Token, trailing 1 sign, signed levels, total_zeros and run_before) for the decoding process of multiple 4×4 luma blocks residual coefﬁcients. The updated bit FIFO parameters {pVLC->current_word, pVLC-> bit_ pos, pVLC->word_count} are shown whenever the FIFO is accessed to the read bits. Initialization FIFO: {0x74f0696a, 0, 2} First luma 4x4 subblock —> Total coefﬁcients and trailing 1s: Coeff_Token (t_coeffs, t_1s) Context: nC = 0 Coeff_Token: (1, 1) Bits used: 2 FIFO: {0xd3c1a5a8, 2, 2} —> Trailing 1s sign information sign: -ve Bits used: 1 FIFO: {0xa7834b50, 3, 2} —> No levels to decode —> Total zeros information Context: 1 (t_coeffs) total_zeros: 0 Bits used: 1 FIFO: {0x4f0696a0, 4, 2} —> No run before to decode Lossless Data Compression 255 —> Output: [-1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Second luma 4x4 subblock —> Total coefﬁcients and trailing 1s: Context: nC = 1 Coeff_Token: (1, 1) Bits used: 2 FIFO: {0x3c1a5a80, 6, 2} —> Trailing 1s sign information sign: +ve Bits used: 1 FIFO: {0x7834b500, 7, 2} —> No levels to decode —> Total zeros information Context: 1 (t_coeffs) total_zeros: 1 Bits used: 3 FIFO: {0xc1a5a800, 10, 2} —> No run before to decode —> Output: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Third luma 4x4 subblock —> Total coefﬁcients and trailing 1s: Context: nC = 0 Coeff_Token: (0, 0) Bits used: 1 FIFO: {0x834b5000, 11, 2} —> No trailing 1s sign information to decode —> No levels to decode —> No total zeros information to decode —> Output: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Fourth luma 4x4 subblock —> Total coefﬁcients and trailing 1s: Context: nC = 1 Coeff_Token: (0, 0) Bits used: 1 FIFO: {0x0696a000, 12, 2} —> No trailing 1s sign information to decode —> No levels to decode —> No total zeros information to decode —> Output: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Fifth luma 4x4 subblock —> Total coefﬁcients and trailing 1s: Context: nC = 0 Coeff_Token: (3, 1) Bits used: 8 FIFO: {0x96a07f90, 4, 3} —> Trailing 1s sign information sign: -ve Bits used: 1 FIFO: {0x2d40ff20, 5, 3} 256 Chapter 5 —> Levels to decode: 2 First level - sufﬁx_length = 0 - Level preﬁx preﬁx_length: 2 Bits used: 3 FIFO: {0x6a07f900, 8, 3} - level_sufﬁx_size = 0 level_sufﬁx: 0 - level_code = 4 - coeff = 3 Second level - sufﬁx_length = 1 - Level preﬁx preﬁx_length: 1 Bits used: 2 FIFO: {0xa81fe400, 10, 3} - level_sufﬁx_size = 1 level_sufﬁx: 1 Bits used: 1 FIFO: {0x503fc800, 11, 3} - level_code = 4 - coeff = -2 —> Total zeros information Context: 3 (t_coeffs) total_zeros: 0 Bits used: 4 FIFO: {0x03fc8000, 15, 3} —> No run before to decode —> Output: [-2, 3,-1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Sixth luma 4x4 subblock —> Total coefﬁcients and trailing 1s: Context: nC = 0 Coeff_Token: (3, 0) Bits used: 9 FIFO: {0xf98bd900, 8, 4} —> No trailing 1s sign information to decode —> Levels to decode: 3 First level - sufﬁx_length = 0 - Level preﬁx preﬁx_length: 0 Bits used: 1 FIFO: {0xf317b200, 9, 4} - level_sufﬁx_size = 0 level_sufﬁx: 0 - level_code = 2 - coeff = 2 Second level - sufﬁx_length = 1 - Level preﬁx preﬁx_length: 0 Bits used: 1 FIFO: {0xe62f6400, 10, 4} - level_sufﬁx_size = 1 level_sufﬁx: 1 Bits used: 1 FIFO: {0xcc5ec800, 11, 4} - level_code = 1 - coeff = -1 Third level - sufﬁx_length = 1 Lossless Data Compression 257 - Level preﬁx preﬁx_length: 0 Bits used: 1 FIFO: {0x98bd9000, 12, 4} - level_sufﬁx_size = 1 level_sufﬁx: 1 Bits used: 1 FIFO: {0x317b2000, 13, 4} - level_code = 1 - coeff = -1 —> Total zeros information Context: 3 (t_coeffs) total_zeros: 5 Bits used: 4 FIFO: {0x17b3c468, 1, 5} —> Run before Context: 5 (zeros-left) run_before: 5 Bits used: 3 FIFO: {0xbd9e2340, 4, 5} —> Output: [-1, -1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0] 5.3.4 H.264 CAVLC Optimization Techniques In this section, we will discuss the computational complexity of the H.264 VLC and the optimization techniques for the parsing process of residual decoding. We estimate the computational complexity of the H.264 VLC in terms of clock cycles and memory used. H.264 VLC Computational Complexity As we discussed in Section 5.3.3, the simulation of the H.264 VLC involves many bit FIFO accesses and conditional jumps. With bit FIFO accesses, we have two cases: (1) updating only the FIFO parameters and (2) reading bits from the bitstream buffer along with FIFO updating parameters. We check to determine whether the number of bits present in the FIFO is less than 16, and then conditionally jump to read bits from the bitstream buffer to the FIFO if the bits present are less than 16. If we are not reading the bits from the bitstream buffer, then we consume only 4 cycles (2 cycles for FIFO update and 2 cycles for the conditional check and for taking the decision on the jump) to update the bit FIFO on the reference embedded processor by avoiding the conditional jump. See Appendix A, Section A.4, on the companion website for more details on cycles estimation on the reference embedded processor. If bits present in FIFO are less than 16, then we jump for reading bits from the bitstream buffer and jump back to continue the decoding. In this case we consume about 20 cycles. On average, we may read bits from the bitstream buffer once in four FIFO accesses. Hence, we consume on average about (3 ∗ 4 + 20)/4 = 8 cycles to access bit FIFO instead of 13 cycles as in the MPEG-2 32-bit FIFO discussed in Section 5.2. UVLC Computational Complexity The three UVLC functions u(n), ue(v), and se(v) access bit FIFO, and it is the major cycle-consuming portion of the code. As seen in Pcode 5.11, the function u(n) consists of only bits extraction and bit FIFO update functionality, and its average cycles consumption is about 9 cycles. The other two functions—unsigned expGolomb code ue(v) and signed exp-Golomb code se(v)—consist of lead zero computation, which can be achieved in 2 cycles on the reference embedded processor. In addition, we perform a little bit of adjustment to the value read from FIFO to get the ﬁnal Code_Num. On average, we consume about 12 and 15 cycles on the reference embedded processor to perform exp-Golomb code functions ue(v) and se(v), respectively. CAVLC Computational Complexity The CAVLC cycle estimation for decoding residual coefﬁcients is a difﬁcult task since it involves many contexts, functions, and jumps. We ﬁrst estimate the cycle cost and memory consumption of individual CAVLC functions and then estimate the overall complexity. 258 Chapter 5 Total Coefﬁcients and Trailing 1s In computing the Coeff_Token, we have 6 VLC tables to choose from depending on context and luma or chroma blocks. For this, we require about 1.2 kB of data memory to store all look-up tables of Coeff_Token VLC codewords. In Coeff_Token computation, we have the following steps: 1. Choose the codeword table depending on context and luma or chroma blocks (2 cycles for choosing VLC table with an offset) 2. Scan bits and obtain lead zeros (3 cycles) 3. Offset computation and look-up table accesses (4 cycles) 4. Extract total coefﬁcients, trailing 1s and actual bits used information (3 cycles) 5. Bit FIFO access (8 cycles) With this, we may consume about 20 cycles to compute the Coeff_Token on the reference embedded processor. Trailing 1s Sign Computation Computing the sign of the trailing 1s involves only bit FIFO access and making a decision on the sign information depending on bit “0” or “1” accessed from FIFO. We consume about 10 cycles to get the sign information for a single trailing 1. Level Preﬁx Computation of the level preﬁx (i.e., preﬁx_length) involves the following two steps: scanning bits and obtaining lead zeros (3 cycles), and bit FIFO access (8 cycles). With this, we consume about 11 cycles to compute level preﬁx for decoding 1 level. Level Sufﬁx If level_sufﬁx_size is not zero then we access bit FIFO to get the level_sufﬁx value otherwise if the level_sufﬁx_size is zero then the level_sufﬁx value is set to zero. As it involves a conditional check and jump whenever we don’t access bit FIFO, we consume either way about 10 cycles to compute level_sufﬁx. Total Zeros The total_zeros computation involves multiple VLC tables to choose from depending on context (here the context is total coefﬁcients). We require about 0.6 kB of data memory to store the look-up table to compute total_zeros present between all non-zero coefﬁcients of a block. In total_zeros computation, we have the following steps: 1. Choose the codeword table depending on context and luma or chroma blocks (2 cycles for choosing VLC table with an offset) 2. Scan bits and obtain lead zeros (3 cycles) 3. Offset computation and look-up table accesses (4 cycles) 4. Extract total_zeros and actual bits used information (2 cycles) 5. Bit FIFO access (8 cycles) With this, we may consume about 19 cycles to compute total_zeros on the reference embedded processor. Run Before We use 56 bytes of memory to store look-up tables in the run_before computation. We have the following steps in the run_before computation: 1. Adjust context (2 cycles) 2. Scan bits, offset computation, and look-up table access (5 cycles) 3. Escape code handling (2 cycles) 4. Execute run_before and the actual number of bits used, extraction (2 cycles) 5. Bit FIFO access (8 cycles) With this, we consume about 19 cycles in computing run_before. Parsing Residual Decoding Process The parsing of residual decoding is a complex process as given in Pcodes 5.18 and 5.19. In some cases we may obtain the Coeff_Token for the residual block as (0, 0), in which case we don’t perform the rest of the functions as no coefﬁcients are present in that residual block and we consume about 30 cycles. In some cases we may have only trailing 1s and so we don’t perform levels decoding. If we have trailing 1s, we consume another 10 cycles per trailing 1 sign computation. In other cases, we may have more non-zero coefﬁcients to decode. As given in Pcode 5.18, we have the following steps in decoding one non-zero coefﬁcient: 1. Determine sufﬁx_length (4 cycles) 2. Determine sufﬁx_level_size (8 cycles) Lossless Data Compression 259 3. Compute preﬁx_length (11 cycles) 4. Compute sufﬁx_level (10 cycles) 5. Compute level_code (17 cycles) 6. Determine signed coefﬁcient from level_code (3 cycles) 7. Update sufﬁx_length (3 cycles) Apart from this, we perform total_zeros computation and run_before computation to store coefﬁcients as given in Pcode 5.19. If the total coefﬁcient count is equal to maximum coefﬁcients, we do not perform total_zeros and run_before operations and we skip (10 cycles) these two operations. Otherwise, we consume 20 cycles for total_zeros computation and 25 cycles per coefﬁcient to perform run_before and to store that coefﬁcient (following zig-zag/ﬁeld scan rules). If zeros-left is zero, then we do not perform run_before and we skip (10 cycles) the run_before function in this particular case. As seen in the previous cycle estimate, we consume about 56 cycles to decode one coefﬁcient and 25 cycles to store that coefﬁcient using run_before. With this, if we have three coefﬁcients (a trailing 1 and two coefﬁcients) in a 4×4 residual block, we may consume about 217 cycles (20 cycles for Coeff_Token, 10 cycles for trailing 1s sign, 112 cycles for decoding two coefﬁcients and 75 cycles—20 cycles for total_zeros and 55 cycles for run_before and for other operations—for storing three coefﬁcients) or about 13.5 cycles/pixel (as we have a total of 16 pixels in a 4×4 block). Although the CAVLC for decoding a coefﬁcient is costly in terms of cycles, the average cycles per pixel will be small because the number of non-zero coefﬁcients per block is small. We see fewer than three or four coefﬁcients per 4×4 residual block most of the time with the D1 frame size at the 1-Mbps bit rate. Therefore, we consume about 10 cycles/pixel on average to decode the residual coefﬁcients of D1 video frames at 1 Mbps using the CAVLC. Optimization of the H.264 Parsing Process for CAVLC In this section, we discuss some optimization techniques to reduce the cycle cost of the residual decoding process using the CAVLC. Unlike the MPEG-2 VLC, where we do not have any contexts and can decode multiple symbols in a single FIFO access, H.264 CAVLC decoding involves many contexts and it is very difﬁcult to decode more than one coefﬁcient at a time. However, we can optimize the CAVLC ﬂow by avoiding the conditional ﬂow wherever possible and by reducing the bit FIFO accesses whenever context is not present to choose a particular VLC table from multiple tables. Especially in decoding signed level information, we have many conditional checks as we are handling all possible rarely occuring data paths with one ﬂow. If we separate the loop into two parts by treating preﬁx_length > 13 as an escape code, then we can avoid many conditional checks and conditional moves. This optimized data ﬂow is given in Pcode 5.20. In the case of computing the sign of trailing 1s, we access the bit FIFO three times if we have three trailing 1s as given in Pcode 5.18. Instead, we can also read 3 bits to a register from FIFO in one access and then extract the individual bits from the register in the loop as we do not have any context information in decoding trailing 1s sign information. In this way we save 50% of cycles in trailing 1s sign computation. In other words, we consume less than 15 cycles to get the sign information even if we have two or more trailing 1s. In addition, in computing signed level using Pcode 5.20, we do not use any external context information in decoding preﬁx_length or sufﬁx_level other than the updated sufﬁx_length (t) for decoding sufﬁx_level. Using six look-up tables (T1 to T6), we can minimize the cycle cost of signed level computation. The six look-up tables are designed based on the following rules. When preﬁx_length < 15, level = (preﬁx_length << (t-1) + 1 + sufﬁx_level) * sign where sufﬁx_level is a value of unsigned (t − 1) bits, and the sign bit follows the (t − 1) sufﬁx bits except for t = 1 (here the sign bit will be next to the “1” bit) and t is equal to the “n” in “Tn.” In this case, the codeword looks like [preﬁx zeros][1][sufﬁx bits][sign]. When preﬁx_level = 15, level = (15 << (t-1) + 1 + 11_sufﬁx_bits) * sign codeword = [0000 0000 0000 000][1][11 sufﬁx bits][sign] 260 Chapter 5 for(i = k;i < tcoeffs;i++){ level_preﬁx(pVLC, &preﬁx_length); // decode level preﬁx if (preﬁx_length < 14) { // decode level sufﬁx if (sufﬁx_length == 0) level_sufﬁx = 0; else level_sufﬁx = read_bits(pVLC, sufﬁx_length); tmp1 = preﬁx_length << sufﬁx_length; level_code = tmp1 + level_sufﬁx; // determine level code if ((i==tones) && (tones < 3)) level_code = level_code + 2; if ((level_code & 1) == 0) buf[i] = (level_code+2)>>1; else buf[i] = (-level_code-1)>>1; if (sufﬁx_length == 0) sufﬁx_length = 1; if (abs(buf[i])>sufvlc[sufﬁx_length]) sufﬁx_length+= 1; } else { // escape level_sufﬁx_size = sufﬁx_length; // determine level sufﬁx size if ((preﬁx_length == 14) && (sufﬁx_length == 0)) level_sufﬁx_size = 4; if (preﬁx_length >= 15) level_sufﬁx_size = preﬁx_length - 3; if (level_sufﬁx_size == 0) level_sufﬁx = 0; // decode level sufﬁx else level_sufﬁx = read_bits(pVLC, level_sufﬁx_size); tmp1 = (preﬁx_length < 15) ? preﬁx_length : 15; tmp1 = tmp1 << sufﬁx_length; level_code = tmp1 + level_sufﬁx; // determine level code if ((preﬁx_length >= 15) && (sufﬁx_length == 0)) level_code += 15; if (preﬁx_length >= 16){ tmp2 = (1<<(preﬁx_length-3))-4096; level_code = level_code + tmp2; } if ((i==tones) && (tones < 3)) level_code = level_code + 2; if ((level_code & 1) == 0) buf[i] = (level_code+2)>>1; else buf[i] = (-level_code-1)>>1; if (sufﬁx_length == 0) sufﬁx_length = 1; if (abs(buf[i])>sufvlc[sufﬁx_length]) sufﬁx_length+= 1; } } Pcode 5.20: Optimization of signed level decoding process. The tables updated (i.e., local context adaptation) as follows: Initially, t is set to zero except when (total_coeffs > 10) and (t_ones < 3), in this case t is set to 1. Afterwards, “t” is updated. If (abs(level) > C[t]), then t = t + 1, where the level is the decoded non-zero coefﬁcient and C[] = {0,3,6,12,24,48,32768}. When t = 0, this particular level is decoded as follows: 1. When ( preﬁx_length < 14), level = [(preﬁx_length + 2)>>1 ] * (-1)ˆpreﬁx_length 2. When ( preﬁx_length = 14), level = [(preﬁx_length + 2)>>1 + 3 sufﬁx bits] * sign codeword = [preﬁx zeros][1][3 sufﬁx bits][sign] 3. When ( preﬁx_length = 15), level = [(preﬁx_length + 1) + 11 sufﬁx bits] * sign codeword = [preﬁx zeros][1][11 sufﬁx bits][sign] With this optimization technique, we consume about 6 cycles/pixel on average to decode the residual coefﬁcients of D1 video frames at 1 Mbps using the CAVLC. 5.4 MQ-Decoder The JPEG 2000 standard (ISO and ITU JPEG2000, 2000) uses the MQ-coder for entropy coding to compress and decompress the data stream. In this section, we will discuss the overview, simulation and implementation of the MQ-decoder. All the notations used are similar to JPEG 2000 standard notations. Lossless Data Compression 261 5.4.1 MQ Coder Overview The MQ-coder is a context-based binary arithmetic coder. The basic parameters of the MQ-coder are interval range ( A), code value (C ), context parameters (Icx , MPScx) and bit counter (C T ). In the MQ-coder, unlike the binary arithmetic coder, we do not have multiplications or divisions to perform interval subdivision. The interval subdivision into least probable symbol (LPS) subinterval and most probable symbol (MPS) subinterval is achieved using a look-up table with the given probable state Icx which is obtained from the context model. The value of range A is always kept in the interval [0.75, 1.5). This allows a simple approximation of the following interval subdivision calculations for given probability value “Qe” as the value of A is of the order unity. MPS subinterval = A − ( A∗ Qe) = A − Qe LPS subinterval = A∗ Qe = Qe The subinterval value for LPS is obtained from the look-up table. Whenever the value of A falls below 0.75 (or equivalent ﬁxed point value of 0×8000), then both A and C are renormalized to keep the value of A around unity to perform the next subinterval division approximation. A few applications of JPEG 2000 include digital photography, optical drive, digital cinema (motion JPEG), Internet, and so on. Similar to the MQ-coder in the JPEG 2000 standard, the H.264/AVC standard uses a variant of the M-coder known as the context-based adaptive binary arithmetic coder (CABAC). The H.264 arithmetic coder is simpler than the MQ-coder. The MQ-coder performs well when compared to VLCs and the bit savings is about 10% more, whereas the H.264 arithmetic coder performs well when compared to the MQ-coder in terms of throughput and bit savings by 15 to 20% and 2 to 5%. In this section, assuming the availability of an MQ-coder-encoded bitstream, we will discuss bitstream decoding by using the MQ-decoder. As shown in Figure 5.9, the MQ-decoder consists of many ALU operations, look-up table accesses and conditional jumps. The ﬂow of the MQ-decoder is a little bit similar to the CABAC ﬂow, which we will discuss in Section 5.5. As in the CABAC, we can divide the MQ-decoder into three parts: • Interval subdivision • Parameter updating • Normalization Each part contains many steps as shown in Figure 5.9 with the numbers in the circles. In steps 1 and 2, we perform the interval subdivision. In interval subdivision, we get the LPS subinterval range from the look-up table using the offset obtained from the context model. Then we obtain the MPS subinterval after subtracting the LPS subinterval from A. Depending on the code value C, LPS subinterval QeIcx and MPS subinterval A, we continue either an LPS decoding path or an MPS decoding path to update the parameters. We use steps 3 to 11 to update the parameters. In parameter updating, we update the code value (in the MSB halfword of C) and the context parameters and we compute decision D. We perform the renormalization process with steps 12 to 14 (not all at a time). With the renormalization process, we make sure that the value of A falls into the range [0.75, 1.5). During renormalization, we append the bits from the bitstream to the code value C (from the LSB side). Like the CABAC, the renormalization of the MQ-coder is also a multi-iterative process. The decoded binary decision is given by the value D. We will discuss the simulation details of the MQ-decoder in the following sections. 5.4.2 JPEG 2000 MQ-Decoder Simulation The basic input and output parameters required for simulation of the JPEG 2000 arithmetic decoder are range ( A), value (C ), contexts (Icx , MPScx), bit counter (C T ), compressed data (Dat) and output decision (D). The following structure is used in the simulation of the MQ-decoder. typedef struct jad_tag { int A; int C; int CT; int Icx; int MPScx; 262 Chapter 5 unsigned char *BP; int D; } JpegArtDec_t; JpegArtDec_t JBA, *pJBA; Start 1 QeIcx 5 Qe [Icx] 2 A 5 A 2 QeIcx N 3 Ch 5 Ch 2 QeIcx D 5 MPScx Ch , QeIcx Y N A , QeIcx Y N A & 038000 5 0? Y A , QeIcx 6 N Icx 5 NMPS [Icx] 4 A 5 QeIcx D 512MPScx Y S[Icx] 51? N 7 A 5 QeIcx 5 D 5 MPScx Icx 5 NMPS [Icx] 8 9 D 512MPScx MPScx 512MPScx Icx 5 NLPS [Icx] S[Icx ] 51? N Icx 5 NLPS [Icx ] 11 CT 5 0 N Y 10 MPScx 512 MPScx Y Y B 5 03FF ? N 12 N B1 . 038F ? Y 13 BP 5BP 11 C 5 C 1 (B ,, 8) BP 5 BP 11 14 C 5 C 1 (B ,, 9) C 5 C 1 03FF 00 CT 5 8 CT 5 8 CT 5 7 A 5 A ,,1 15 C 5 C ,,1 CT 5 CT 21 A & 038000 5 0? Y N End Figure 5.9: Flow chart diagram of JPEG 2000 MQ-decoder. Lossless Data Compression 263 The values of A, C, and C T are initialized according to the JPEG 2000 standard, and the initialization code is given in Pcode 5.21. pJBA = &JBA; pJBA->C = (*(pJBA->BPST)) << 16; if (*(pJBA->BPST) == 0xff) { if (*(pJBA->BPST+1) > 0x8f) { pJBA->C = pJBA->C + 0xff00; pJBA->CT = 8; } else { pJBA->BPST++; pJBA->C = pJBA->C + (*(pJBA->BPST) << 9); pJBA->CT = 7; } } else { pJBA->BPST++; pJBA->C = pJBA->C + (*(pJBA->BPST) << 8); pJBA->CT = 8; } pJBA->C = (pJBA->C) << 7; pJBA->CT = pJBA->CT-7; pJBA->A = 0x8000; Pcode 5.21: Initialization of MQ-decoder. The simulation code for interval subdivision and parameter updating is given in Pcode 5.22. To divide the interval range into LPS subinterval and MPS subinterval, ﬁrst we obtain the LPS subinterval QeIcx from the look-up table Qe[ ]. We obtain the MPS subinterval by subtracting the LPS subinterval from the total interval A. We update the parameters accordingly depending on the code value C and LPS subinterval QeIcx . Given the current context (Icx , MPScx), if the MPS subinterval A is less than the LPS subinterval QeIcx and if the switch ﬂag S[Icx] is set for context index Icx , then we update the MPS value MPScx (i.e., 0 to 1 or 1 to 0) of the current index by inverting it. Next, we update the context index using LPS or MPS index tables depending on whether we are decoding LPS or MPS as shown in Figure 5.9. The simulation code for renormalization of interval range A of the MQ-decoder is given in Pcode 5.23. In the renormalization process we consume the bits from the input bitstream. We shift left both the interval register A and code register C 1 bit at a time (or 1 bit per iteration). With each shift, we consume 1 bit from bit FIFO (present in the LSB halfword of C) and the bit count CT is reduced by 1. Whenever CT becomes zero, we append to FIFO a new data byte obtained from the bitstream buffer. The renormalization process may involve multiple iterations depending on the interval value A. Whenever interval range A goes beyond 0×8000 (or 0.75 in decimal notation), then we stop the renormalization process iterations and output the decoded decision value D. MQ-Decoder Simulation Results Here we present simulation results for the JPEG 2000 MQ-decoder. For a given JPEG 2000 arithmetic encoded bitstream, the initialized parameter values, output decision values and the decoder parameters after decoding 1 output decision, 5 output decisions, 10 output decisions and 20 output decisions follow. The encoded bitstream is present in buffer dat[ ], and the decoded binary decision output D is stored in the buffer sym[ ]. The following look-up tables are used in the MQ-decoder. LPS probabilities or subintervals Qe[47] = { 0x5601, 0x3401, 0x1801, 0x0ac1, 0x0521, 0x0221, 0x5601, 0x5401, 0x4801, 0x3801, 0x3001, 0x2401, 0x1c01, 0x1601, 0x5601, 0x5401, 0x5101, 0x4801, 0x3801, 0x3401, 0x3001, 0x2801, 0x2401, 0x2201, 0x1c01, 0x1801, 0x1601, 0x1401, 0x1201, 0x1101, 0x0ac1, 0x09c1, 0x08a1, 0x0521, 0x0441, 0x02a1, 0x0221, 0x0141, 0x0111, 0x0085, 0x0049, 0x0025, 0x0015, 0x0009, 0x0005, 0x0001, 0x5601}; Next symbol probability estimation given the present symbol as MPS: nmps[47] = { 1, 2, 3, 4, 5, 38, 7, 8, 9, 10, 11, 12, 13, 29, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 45, 46}; Next symbol probability estimation given the present symbol as LPS: 264 Chapter 5 QeIcx = Qe[pJBA->Icx]; pJBA->A = pJBA->A — QeIcx; Ch = pJBA->C; Ch = Ch >> 16; if (Ch < QeIcx) { if (pJBA->A < QeIcx) { pJBA->A = QeIcx; pJBA->D = pJBA->MPScx; pJBA->Icx = nmps[pJBA->Icx]; } else { pJBA->A = QeIcx; pJBA->D = 1 — pJBA->MPScx; if (S[pJBA->Icx] == 1) pJBA->MPScx = 1 — pJBA->MPScx; pJBA->Icx = nlps[pJBA->Icx]; } // continue with renormalization process (use Pcode 5.23) } else { Ch = Ch — QeIcx; pJBA->C = pJBA->C & 0xffff; pJBA->C = pJBA->C | (Ch << 16); if ((pJBA->A & 0x8000) == 0) { if (pJBA->A < QeIcx) { pJBA->D = 1 — pJBA->MPScx; if (S[pJBA->Icx] == 1) pJBA->MPScx = 1 — pJBA->MPScx; pJBA->Icx = nlps[pJBA->Icx]; } else { pJBA->D = pJBA->MPScx; pJBA->Icx = nmps[pJBA->Icx]; } // continue with renormalization process (use Pcode 5.23) } else pJBA->D = pJBA->MPScx; } Pcode 5.22: Simulation code for interval subdivision and parameter updating. do { if (pJBA->CT == 0) { if (*(pJBA->BPST) == 0xff) { if (*(pJBA->BPST +1) > 0x8f) { pJBA->C = pJBA->C + 0xff00; pJBA->CT = 8; } else { pJBA->BPST++; tmp = *(pJBA->BPST); pJBA->C = pJBA->C + (tmp << 9); pJBA->CT = 7; } } else { pJBA->BPST++; tmp = *(pJBA->BPST); pJBA->C = pJBA->C + (tmp << 8); pJBA->CT = 8; } } pJBA->A = pJBA->A << 1; pJBA->C = pJBA->C << 1; pJBA->CT = pJBA->CT — 1; } while((pJBA->A & 0x8000) != 0) ; Pcode 5.23: Simulation code for renormalization of MQ-decoder. Lossless Data Compression 265 nlps[47] = { 1, 6, 9, 12, 29, 33, 6, 14, 14, 14, 17, 18, 20, 21, 14, 14, 15, 16, 17, 18, 19, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 46}; Switch ﬂag to toggle the MPS of context S[47] = { 1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; JPEG 2000 encoded bitstream data dat[] = {0x00, 0x00, 0xa4, 0xca, 0x2f, 0xff, 0x00, 0x00} After JPEG 2000 arithmetic decoder initialization pJBA->A = 0x00008000 pJBA->C = 0x00520000 pJBA->CT = 1 pJBA->Icx = 3 pJBA->MPScx = 0 sym[] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} After decoding 1 decision by arithmetic decoder pJBA->A = 0x0000AC10 pJBA->C = 0x05265000 pJBA->CT = 5 pJBA->Icx = C pJBA->MPScx = 0 sym[] = {1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} After decoding 5 decisions by arithmetic decoder pJBA->A = 0x0000B004 pJBA->C = 0x79905E00 pJBA->CT = 7 pJBA->Icx = 14 pJBA->MPScx = 0 sym[] = {1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} After decoding 10 decisions by arithmetic decoder pJBA->A = 0x0000C006 pJBA->C = 0x51B9C000 pJBA->CT = 2 pJBA->Icx = 15 pJBA->MPScx = 0 sym[] = {1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0} After decoding 20 decisions by arithmetic decoder pJBA->A = 0x0000A802 pJBA->C = 0x76F00000 pJBA->CT = 4 pJBA->Icx = E pJBA->MPScx = 0 sym[] = {1,1,1,0,0,0,0,0,1,0,0,0,1,1,0,1,1,1,1,1} MQ-Decoder Computational Complexity As seen in Figure 5.9, the ﬂow of the JPEG 2000 arithmetic decoder is somewhat complex. We will analyze decoder complexity by considering the following possible cases. The steps in each case are speciﬁed with <>. Case 1: In this case, the decoder steps in the path are considered. This is the shortest possible path. This path always decodes the MPS as output decision and does not require the process of renormalization. Case 2: In this case, the decoder steps in the following six paths , , , , and are considered. These paths include both the LPS and MPS decision decode and renormalization process. However, in the renormalization process we do not read bits from the bitstream as these correspond to the case where CT is greater than zero. In general (about 80% of the time), bits from the bitstream will not be read in the renormalization process. Case 3: In this case, all the paths are the same as Case 2 except the presence of step (12) in all the paths to read bits from the bitstream buffer in the renormalization process when CT becomes zero. However, in this case the current byte value and next byte values of the bitstream pointed to by the buffer pointer is assumed not equal to 0xff. With this assumption, we can efﬁciently implement the renormalization process, as we will discuss later. 266 Chapter 5 Case 4: In this case, all the considered paths of the decoder are the same as in Case 3 and the context is also same. The only difference is that one of the current bytes or next bytes of the bitstream buffer pointed to by the buffer pointer will be equal to 0xff. As seen in the preceding four cases, we can see that the decoder complexity increases from Case 1 to Case 4. With this analysis, in the following section we will optimize the JPEG 2000 arithmetic decoder ﬂow for number of cycles by keeping the memory usage the same for all cases. 5.4.3 Efﬁcient Simulation of JPEG 2000 MQ-Decoder In the optimization of the decoder, we ﬁrst optimize each individual case described in the previous section and we later combine all the cases for single ﬂow with a few conditional jumps. Here, the conditional jump is taken such that the average-to-peak cycles of decoding are reduced. Optimization of Case 1 This path (Start, (1), (2), (3), and End) of the decoder is the shortest path and we do not have much scope for optimization. However, a small modiﬁcation by combining the two conditions to one condition as shown in Pcode 5.24 will result in one conditional jump. QeIcx = Jpeg_Art[pJBA->Icx + tmp]; Ch = pJBA->C; Ch = Ch >> 16; pJBA->A = pJBA->A - QeIcx; if((Ch >= QeIcx) && ((pJBA->A&0x8000) == 0)) { } else { } Ch = Ch − QeIcx; D = pJBA->MPScx; pJBA->C = pJBA->C & 0xffff; pJBA->C = pJBA->C | (Ch << 16); // Case 2, Case 3, or Case 4 // Case 1 Pcode 5.24: Efﬁcient implementation of Case 1 of the MQ-decoder. Optimization of Case 2 The decoder ﬂows in this case are much more complex than in Case 1. The common process for all the ﬂows of Case 2 is the renormalization operation which is a conditional multi-iterative process and is very costly in terms of cycles. For example, if the interval A is 0x0ac1, then the four iterations are needed for the normalization process and if the bits are not going to read to value C (this is what assumed in Case 2), it requires about 68 (= 4 ∗ (5 + 2 ∗ 6)) cycles (see Section A.4 on the companion website for more details on clock cycles estimation on the reference embedded processor). Many of the cycles to execute the renormalization process can be avoided if we ﬁrst compute the normalization loop count “CNT ” by counting the leading zeros in A, then shifting A and C by CNT and subtracting CT from CNT. Next, a complex task common to all paths in Case 2 is obtaining the new values for Icx , MPScx and D and new values for A and C before normalization. The new values for these parameters can be efﬁciently computed using a look-up table and conditional moves. In this way we can avoid most of the jumps. As shown in Pcode 5.25, the new values of A and C are obtained by conditional computation. Instead of accessing different look-up tables for computing new values for Icx and MPScx, all look-up tables are combined to form a new look-up table. Depending on the conditions, an offset is chosen to select the appropriate look-up values. In this way, the output decision is also obtained from the look-up table. The look-up table’s 16-bit codeword contains D (4 bits), MPScx (4 bits) and Icx (8 bits). The values of the look-up table Jpeg_Art[ ] for obtaining all the previous speciﬁed parameters can be found on this book’s companion website. Optimization of Case 3 The optimization techniques used in Case 2 are all applicable to Case 3 too. The extra computations we perform in Case 3 are reading of the data bits from the bitstream buffer to value C in the renormalization process. If the Lossless Data Compression 267 QeIcx = Jpeg_Art[pJBA->Icx ]; r1 = 3; r2 = 1; r3 = 1; r4 = 3; r5 = 2; r6 = 4; Ch = pJBA->C; Ch = Ch >> 16; pJBA->A = pJBA->A - QeIcx; if (pJBA->MPScx == 1) { r1 = r6; r2 = r5; r3 = r5; r4 = r6; } if (pJBA->A < QeIcx) { r1 = r2; r3 = r4; } if((Ch >= QeIcx) && ((pJBA->A&0x8000) == 0)) { if (Ch >= QeIcx) { Ch = Ch − QeIcx; r1 = r3; pJBA->C = pJBA->C & 0xffff; pJBA->C = pJBA->C | (Ch << 16); } else pJBA->A = QeIcx; tmp = Jpeg_Art[r1*47+pJBA->Icx]; pJBA->Icx = tmp & 0xff; pJBA->MPScx = (tmp>>8)&1; pJBA->D = tmp >> 12; r1 = 0; while ((pJBA->A & 0x8000) == 0) { pJBA->A = pJBA->A << 1; r1++; } pJBA->C = pJBA->C << r1; // Case 2 pJBA->CT = pJBA->CT - r1; if (pJBA->CT <= 0) { if((*(pJBA->BPST) != 0xff) || (*(pJBA->BPST+1) != 0xff)) { // Case 3 } else { // Case 4 } } } else { // Case 1 } Pcode 5.25: Efﬁcient simulation of Case 2 of the MQ-decoder. current byte and next byte are not 0xff, then we can efﬁciently implement reading bits by moving 16 bits at a time to C when CT becomes less than or equal to zero. Then add 16 to CT. In this way, we will read the buffer only after 16 bits of renormalization process. If one of the current bytes or the next byte is 0xff, then we continue with Case 4 optimization techniques. The efﬁcient simulation code for reading bits to C in Case 3 is given in Pcode 5.26. Optimization of Case 4 In Case 4, we use all the previously suggested techniques of Cases 1 through 3. In this case, we handle the normalization process in two parts to avoid a bit-by-bit process of normalization as given in the JPEG 2000 standard. In the ﬁrst part, we read up to 8 bits to C when the normalization bits is less than or equal to 8. The second part handles instances in which the normalization bits are more than 8 to read up to 15 bits to C as shown in Pcode 5.27. Although Case 4 looks a little complex, this occurs rarely when compared to other cases. Computational Complexity with Optimized MQ-Decoder We estimate the computational complexity of the MQ-decoder in terms of memory and clock cycles consumed in executing the optimized MQ-decoder. We use 0.25 kB of extra data memory (see look-up table Jpeg_art[ ]) with the optimized MQ-decoder. Since Cases 1 and 3 of the MQ-decoder do not occur frequently and Case 4 268 Chapter 5 if (pJBA->CT <= 0){ if((*(pJBA->BPST) != 0xff) && (*(pJBA->BPST+1) != 0xff)){ pJBA->BPST++; r1 = *(pJBA->BPST); r1 = r1 << 8; pJBA->BPST++; r2 = *(pJBA->BPST); r1 = r1 | r2; pJBA->C = pJBA->C | (r1 << (-pJBA->CT)); pJBA->CT+= 16; } else { // Case 4 } } Pcode 5.26: Efﬁcient implementation of bit FIFO for Case 3 of the MQ-decoder. if (pJBA->CT <= 0){ if((*(pJBA->BPST) != 0xff) && (*(pJBA->BPST+1) != 0xff)){ // Case 3 } else { if (pJBA->CT >= -8) { if(*(pJBA->BPST) != 0xff) { pJBA->BPST++; pJBA->C = pJBA->C | (*(pJBA->BPST) << (8 - pJBA->CT)); pJBA->CT+= 8; } else { if (*(pJBA->BPST+1) > 0x8f) { pJBA->C = pJBA->C + (0xff00 << (8 - pJBA->CT)); pJBA->CT+= 8; } else {pJBA->BPST++; pJBA->C = pJBA->C | (*(pJBA->BPST) << (7 - pJBA->CT)); pJBA->CT+= 7; } } } else { if(*(pJBA->BPST) != 0xff) { pJBA->BPST++; pJBA->C = pJBA->C | (*(pJBA->BPST) << 16); pJBA->CT+= 8; } else { if (*(pJBA->BPST+1) > 0x8f) { pJBA->C = pJBA->C + (0xff00 << 16); pJBA->CT+= 8;} else { pJBA->BPST++; pJBA->C = pJBA->C | (*(pJBA->BPST) << 15); pJBA->CT+= 7; } } if(*(pJBA->BPST) != 0xff) { pJBA->BPST++; pJBA->C = pJBA->C | (*(pJBA->BPST) << (8 - pJBA->CT)); pJBA->CT+= 8; } else { if (*(pJBA->BPST+1) > 0x8f) { pJBA->C = pJBA->C + (0xff00 << (8 - pJBA->CT)); pJBA->CT+= 8; } else { pJBA->BPST++; pJBA->C = pJBA->C | (*(pJBA->BPST) << (7 - pJBA->CT)); pJBA->CT+= 7; } } } } } Pcode 5.27: Efﬁcient implementation of Case 4 of the MQ-decoder. occurs very rarely, we assume the average cycle cost of the MQ-decoder as the cycles required for Case 2 (since it occurs more frequently). As seen in Pcode 5.25, the approximate cycle cost to run Case 2 of the MQ-decoder on the reference embedded processor is about 45 cycles. We consume a minimum of 50 cycles and a maximum Lossless Data Compression 269 of around 150 cycles for Case 2 of the MQ-decoder without applying any optimization techniques. Thus, with optimization techniques, we can clearly reduce the average-to-peak cycles count by 100. 5.5 Context-Based Adaptive Binary Arithmetic Coding The H.264 standard (ITU-T H.264, 2005) uses a variant of the M-coder for entropy coding to compress and decompress the datastream. This entropy coding is known as the context-based adaptive binary arithmetic coding, or CABAC. See Section 5.1 for more details on the binary arithmetic coder (BAC). The H.264 standard’s main proﬁle deﬁnes three CABAC core routines for compressing/decompressing the bitstream: encode/decode binary symbol, encode/decode equiprobable binary symbol, and encode/decode terminate symbol. Out of these three core routines, encode and decode symbol routines are the more complex ones. In this section, we present an overview of the H.264 arithmetic coder encode and decode symbol routines, and we estimate the computational complexity of CABAC encode and decode symbol routines. Although the H.264 reference software (see http://iphome.hhi.de/suehring/tml/) is available in the public domain, it is written very inefﬁciently and cannot be used as is for real-time applications. Thus, we discuss here efﬁcient implementation techniques for H.264 CABAC encode and decode symbol routines. A few applications of the H.264 standard include digital video broadcasting, digital subscriber lines, personal media players, HDTV, video surveillance, digital media storage, and multimedia communications. Similar to the CABAC in the H.264/AVC standard, the JPEG 2000 standard (see Section 5.4) uses the MQ-coder for bitstream compression. The H.264 arithmetic coder is simpler than the MQ-coder. The MQ-coder (JPEG 2000) performs well when compared to VLCs and the bit savings is about 10% greater, whereas the H.264 arithmetic coder performs well when compared to the MQ-coder in terms of throughput and bit savings by 15 to 20% and 2 to 5%, respectively. 5.5.1 H.264 CABAC Overview The basic parameters used for the CABAC encode symbol function are Range (interval), Value or Low (code value), {State, MPS} (context parameters), and Obits (outstanding bits). In the H.264 CABAC, unlike in the binary arithmetic coder, we do not have multiplications or divisions to perform interval subdivision. The interval subdivision is achieved using a look-up table with the given Range and State (a quantized probability value, obtained from the context model). The Symbol (also called as binary decision or bin, obtained after binarization of syntax elements deﬁned by the H.264 standard) is coded as MPS (most probable symbol) or LPS (least probable symbol), depending on the Symbol and present MPS value. The parameters Range, Value, State and MPS are updated after coding of each Symbol. To keep the precision of Range within limits, normalization of Range and Value is performed whenever the value of Range becomes less than 256 (see Figure 5.11 on page 271). We will discuss more about the H.264 CABAC encode symbol function in Section 5.5.2, Encode Symbol. The basic parameters used for the CABAC decode symbol function are Range, Value, {State, MPS}, and compressed/encoded bitstream. We divide the current interval Range with given State (or quantized probability value) into MPS and LPS intervals. We get the LPS interval (rLPS) from the look-up table RangeLPS[ ] and we compute MPS interval by subtracting rLPS from current Range. Depending on the MPS interval and Value, we decode the symbol as MPS or LPS. We update Range, Value and {State, MPS} after decoding of every symbol. To keep the precision of Range within the limits, renormalization of Range and Value is performed whenever the value of Range becomes less than 256 and Value is ﬁlled with the bitstream during the renormalization process (see Figure 5.11). 5.5.2 CABAC Symbol Coding In video coding, we have various types of parameters (e.g., slice layer parameters, macroblock layer parameters, prediction modes, motion vectors, residual coefﬁcients) to encode (compress data) or decode (decompress data) using an entropy coder. The H.264 standard uses a special name for all these parameters: syntax elements. The H.264 standard deﬁnes various types of syntax elements along with the contexts {State, MPS} for coding different type of parameters. Over 460 contexts for different types of syntax elements are deﬁned in the H.264 standard. 270 Chapter 5 As the entropy coder CABAC of the H.264 works with binary data, we convert the syntax elements (nonbinary valued data) to binary Symbols (bins) using a binarization process (which is deﬁned in the H.264 standard for each type of syntax elements) for encoding nonbinary syntax elements. In the same way, we apply a corresponding debinarization process for decoded Symbols to build the syntax element value for particular parameters. A context is a probability model for one or more bins of the binarized syntax element. This probability model may be chosen from a set of available models depending on the statistics of recently coded syntax elements. As an example, the syntax element value, bins, and associated context parameters {State, MPS} (which are not as per H.264 standard) for CABAC of the residual coefﬁcient value 6 follow: Syntax element (residual coefﬁcient): 6 After binarization (Symbols or bins): 1 1 1 1 1 0 Contexts (for each bin): {21, 1}, {23, 0}, {24, 0}, {27, 1}, {28,0}, {29,1} For each image slice (a video frame may contain multiple slices) encoding or decoding, we initialize Range, Value, and context {State, MPS} parameters of the CABAC. The associated context parameters of syntax elements are updated when coding those syntax elements. The H.264 CABAC encode and decode symbol process is shown in Figure 5.10. At the transmitter side, we perform the CABAC encoder operations (e.g., binarization, symbol coding, context model update) and generate compressed bitstream which we transmit after processing by a signal chain (include modules like channel coding, modulation, ﬁltering, etc.) through a noisy channel. At the receiver, we receive the bitstream at the end of the receiver signal chain (includes ﬁlters, demodulation, data error correction, etc.). This bitstream corresponds to encoded bits. Signal chain blocks in the transmitter and receiver are not shown in Figure 5.10. In the H.264 CABAC, the symbol coding engine consists of three steps: (1) interval subdivision, (2) CABAC parameters update, and (3) normalization process. In the interval subdivision, we divide the current interval Range into LPS and MPS intervals. With the CABAC symbol coding routine, we code the Symbol as either LPS or MPS and update the CABAC parameters correspondingly. After updating the CABAC parameters, we check the value of Range and if it is below 256, then we perform normalization of Range to make sure the Range is above 256. In doing normalization, we also normalize Value which produce (in encoder) or consume (in decoder) the bitstream during the normalization process. Encode Symbol The ﬂow chart diagram of the H.264 CABAC encode symbol routine is shown in Figure 5.11. Inputs to the encode symbol function are Range, Value, {State, MPS} and Obits and outputs are the updated CABAC parameters and bitstream. According to Range and State, we get rLPS (an LPS interval range) using the look-up table RangeLPS[ ]. We compute MPS interval Range by subtracting rLPS from the current interval Range. Then, depending on the current Symbol and MPS, we code the Symbol as either MPS or LPS and update the parameters correspondingly. After updating the CABAC parameters, we check the value of Range and whether the Range is less than 256, then we perform the encoder normalization process. The H.264 CABAC encoder normalization process is a multi-iteration process as shown in Figure 5.11. In every iteration, we double the value of Range and compare it with 256 (to conﬁrm whether Range is greater than or equal to 256 or not). If Range is greater Syntax Elements Bins Context Models Bins Binarization Process CABAC Encoder Engine Transmitter Side Bits Noisy Channel CABAC Decoder Engine Debinarization Process Receiver Side Syntax Elements Figure 5.10: H.264 CABAC symbol coding process. Start Index 5 (Range .. 6) & 3 rLPS 5 RangeLPS[State][Index] Range 5 Range 2 rLPS Lossless Data Compression 271 Y Value 15 Range Range 5 rLPS State 55 0 N Symbol ! 5 MPS N Y MPS 5 ^1 State 5 StateMPS[State] State 5 StateLPS[State] N Range , 256 Y Y N Value , 256 Output (0 | Obits 1’s) N Value $ 512 Y Value 5 Value 2 0 3100 Obits 5 Obits 11 Value 5 Value 2 0 3 200 Output (1|Obits 0’s) Range 5 Range ,,1 End Value 5 Value ,,1 Figure 5.11: Flow chart diagram of the H.264 encode binary symbol. than or equal to 256, then we quit the normalization process loop. During the normalization process, we also normalize Value and output bits, depending on Value (to avoid overﬂow) in each iteration. Decode Symbol The ﬂow chart diagram of the H.264 CABAC decode symbol routine is shown in Figure 5.12. Inputs to the decode symbol function are Range, Value, {State, MPS} and bitstream and outputs are the updated CABAC parameters and decoded Symbol. Based on Range and State, we get the rLPS (an LPS interval range) using the look-up table RangeLPS[ ]. We compute the MPS interval Range by subtracting rLPS from the current interval Range. Then, depending on Value and Range, we decode either the LPS or MPS by updating the corresponding parameters. Then, we perform the normalization of Range and Value in multiple iterations if the value of Range is less than 256. During the normalization process, we update Value with the input bitstream. CABAC Symbol Coding Simulation We simulate CABAC symbol encoding (or decoding) using the ﬂow chart diagrams shown in Figure 5.11 (or Figure 5.12). We use three look-up tables (deﬁned by the H.264 standard) in CABAC symbol coding interval 272 Chapter 5 Start Index 5 (Range .. 6) & 3 rLPS 5 RangeLPS[State][Index] Range 5 Range 2 rLPS Y Value 5 Value 2 Range Range 5 rLPS State 55 0 N N Value $ Range Y MPS 5 ^1 State 5 StateMPS[State] State 5 StateLPS[State] Range , 256 Range 5 Range ,,1 End Value 5 Value ,,1| bitstream(1) Figure 5.12: Flow chart diagram of CABAC decode symbol routine. subdivision and parameters update; the values for the three look-up tables RangeLPS[ ], StateLPS[ ], and StateMPS[ ] can be found on the companion website. The simulation code for CABAC encode symbol is given in Pcode 5.28 and the simulation code for write_bits( ) (or Output( ) in Figure 5.11) is given in Pcode 5.29. The simulation code for the CABAC decode symbol is given in Pcode 5.30. We use the read_bits( )—or bit_stream( ) in Figure 5.12—function in the CABAC decode symbol routine to read bits from the bitstream buffer. We use the following parameters structure in CABAC symbol coding: typedef struct H264BacPars_tag { int Range; int Low; int State; int MPS; int Obits; int Symbol; int byteoffset; int bitpos; } H264BacPars_t; H264BacPars_t BAC, *pBAC; 5.5.3 CABAC Symbol Coding Complexity As seen in Figures 5.11 and 5.12, the CABAC symbol coding consists of many sequential and conditional operations (unlike other video coding block processing modules such as DCT transform, motion compensation and so on, where we don’t have a conditional ﬂow of operations). In some cases, the input of present operation depends on the output of the previous operation and we do not have much scope to interleave the program code. pBAC = &BAC; tmp = (pBAC->Range>>6)&3; rLPS = RangeLPS[4*pBAC->State + tmp]; pBAC->Range = pBAC->Range - rLPS; if (pBAC->Symbol == pBAC->MPS) pBAC->State = StateMPS[pBAC->State]; else { pBAC->Low = pBAC->Low + pBAC->Range; pBAC->Range = rLPS; if(pBAC->State == 0) pBAC->MPS = 1-pBAC->MPS; pBAC->State = StateLPS[pBAC->State]; } while(pBAC->Range < 256) { if(pBAC->Low >= 512) { pBAC->Low-=512; write_bits(1,1); if(pBAC->Obits > 0) { write_bits(0,pBAC->Obits); pBAC->Obits = 0; } } else if(pBAC->Low < 256) { write_bits(0,1); if(pBAC->Obits > 0){ write_bits(1,pBAC->Obits); pBAC->Obits = 0; } } else { pBAC->Obits++; pBAC->Low -= 256; } pBAC->Range = pBAC->Range << 1; pBAC->Low = pBAC->Low << 1; } Pcode 5.28: Simulation code for CABAC encode symbol. Lossless Data Compression 273 tmp = dat[pBAC->byteoffset]; for (i=0;ibitpos = pBAC->bitpos - 1; if(pBAC->bitpos == 0) { dat[pBAC->byteoffset] = tmp; pBAC->byteoffset++; pBAC->bitpos = 8; } } dat[pBAC->byteoffset] = tmp; Pcode 5.29: Simulation code for write_bits( ) function. The ﬁrst two parts, interval subdivision and parameters update, of the CABAC symbol encoder and decoder has similar ﬂow in terms of computations. In the interval subdivision (see Pcode 5.28 or Pcode 5.30), we have to perform the following operations in dividing Range. tmp1 = Range >> 6; tmp2 = 4*State; tmp1 = tmp1 & 3; index = tmp1 + tmp2; rLPS = RangeLPS[index]; //LPS interval Range = Range — rLPS; //MPS interval Dividing Range into MPS and LPS intervals takes around 9 to 10 cycles on the reference embedded processor as the rLPS value, after accessing from the look-up table, is used immediately in computing Range, which stalls the processor 3 to 4 cycles. The next step is coding the Symbol as MPS or LPS. This process involves 274 Chapter 5 tmp = (pBAC->Range>>6)&3; rLPS = RangeLPS[4*pBAC->State + tmp]; pBAC->Range = pBAC->Range - rLPS; pBAC->Symbol = pBAC->MPS; if (pBAC->Value < pBAC->Range) pBAC->State = StateMPS[pBAC->State]; //MPS decode else { pBAC->Value = pBAC->Value - pBAC->Range; pBAC->Range = rLPS; pBAC->Symbol = 1 - pBAC->MPS; if (pBAC->State == 0) pBAC->MPS = 1-pBAC->MPS; pBAC->State = StateLPS[pBAC->State]; //LPS decode } while (pBAC->Range < 256){ pBAC->Range = pBAC->Range << 1; pBAC->Value = (pBAC->Value << 1) | (read_bits(1)); } //Output is pBAC->Symbol Pcode 5.30: Simulation code for CABAC decode symbol. one conditional jump to choose between LPS path or MPS path, update of Range, update of Value, conditional update of MPS and one memory access to update State. These operations consume around 10 to 15 cycles to update parameters. Based on the previous analysis, the ﬁrst two parts of the CABAC symbol coding routines take around 25 cycles. CABAC Encode Symbol Normalization In the H.264 encode symbol routine given in Pcode 5.28, the normalization process has many conditional jumps in a “while loop.” This process is costly in terms of cycles as it performs normalization 1 bit at a time with many jumps. In addition to this, writing encoded bits to memory using the write_bits( ) function (or Output( ); see Figure 5.11) with normalization of Value is a very complex operation. We have to perform the following operations every time for writing 1 bit to the memory buffer. 1. Read unﬁlled word from buffer (tmp = dat[wordoffset]) 2. Shift the word left by 1 bit (tmp = tmp << (1) 3. OR the present bit “b” with the shifted word (tmp = tmp| b) 4. Store the ORed word to memory (dat[wordoffset] = tmp) 5. Reduce the bitpos by 1 (bitpos = bitpos – (1) 6. Check whether the bitpos is equal to zero (bitpos == 0) 7. If bitpos is zero, then increment the wordoffset by 1 and reset the bitpos to 32 (wordoffset = wordoffset+1; bitpos = 32) The procedure for writing bits to memory as just described is not part of the H.264 standard. But this function is needed to write the bits to the buffer. Typically, the data is stored to memory in bytes (8 bits), halfwords (16 bits) or words (32 bits) for easy addressing. When we want to store the encoded bits to a memory, ﬁrst the bits are packed into bytes or words, and then they are stored in a memory. The procedure described previously packs the bits into 32-bit words and then stores them to the data buffer. We choose the 32-bit word instead of 8-bit byte because we are going to spend fewer cycles in storing words than bytes with fewer memory accesses (once for every 32 bits instead of 8 bits). To pack the bits to 32-bit words, we use the bit counter (or bitpos) to know how many bits are still needed to ﬁll a 32-bit word. Every time we ﬁll the word with a bit, we reduce the bit count by 1. When the bit count is zero, the word is full with 32 bits and that word is stored to the buffer and the bit counter is reset to 32. To implement the previous procedure of packing bits to a word before storing to memory on the reference embedded processor, we need a minimum of 10 cycles. Now if we want to do two bits of normalization (i.e., the loop count is two) with outstanding bits (Obits) equal to zero, it takes around 30 to 40 cycles (including jumps and other operations) depending on Value. In addition to this, sometimes storing of Obits to memory in Lossless Data Compression 275 the normalization process will become a lengthy task as the upper limit on Obits count according to the standard is given by the number of encoding decisions present in a slice. This shows the complexity of the normalization process and the necessity of its optimization. CABAC Decode Symbol Normalization The decode symbol normalization is also a multi-iterative process. In each iteration, we shift left the values Range and Value by 1 bit and the LSB of Value is ﬁlled with 1 bit by reading 1 bit from the bitstream buffer. The complexity of reading bits from the memory buffer is the same as writing bits to the memory buffer. Therefore, a single iteration of the decode symbol normalization process consumes about 13 cycles (10 cycles for memory read and three cycles for left shifts and appending the bit to Value). 5.5.4 Efﬁcient CABAC Symbol Coding As seen in Section 5.5.3, the CABAC symbol coding consumes a minimum of 45 cycles for encoding and 35 cycles for decoding of one symbol. The compression ratio achieved with the H.264 CABAC coding engine is about 1.1. It means that the ratio of the number of input symbols to the number of output bits in the CABAC coder is approximately 1.1. If we work with 1 Mbps bit rate, then the H.264 CABAC symbol coding routine called approximately 1 million times per second and encode (or decode) symbol routine only consumes about 45 (or 35) MIPS of the embedded processor. In this section, we will discuss efﬁcient simulation of the CABAC symbol coding routines. Interval Subdivision and Parameters Update On the reference embedded processor, the conditional jumps are too costly. Instead of jumping conditionally we can update parameters by moving the values conditionally. To reduce the number of conditional moves and memory accesses, we pack State and MPS and access through one look-up table (that consists of State for both LPS or MPS path and effective MPS value). Because of this, the new State look-up table becomes four times bigger when compared to original State (LPS or MPS) look-up tables. The look-up table design also includes the conditional update of MPS based on the current value of State. The offset calculation for look-up table access is based on MPS value and the condition with which we decide whether MPS or LPS path is used to code the Symbol. Thus, the encode symbol new State look-up table consists of a total of four parts. In each part, the codeword consists of next State information (LSB byte) and effective MPS value (MSB byte). The efﬁcient simulation code for the ﬁrst two parts of the encode decision routine is given in Pcode 5.31 and the new derived look-up table with (MPS | State) for the efﬁcient CABAC encode symbol is available on the companion website. tmp = pBAC->Range >> 6; tmp = tmp & 3; offset = pBAC->State << 2; offset = offset + tmp; rLPS = RangeLPS[offset]; ﬂag = (pBAC->MPS == pBAC->Symbol); offset = ﬂag << 7; tmp = pBAC->MPS << 6; offset = offset + tmp; pBAC->Range = pBAC->Range - rLPS; if (!ﬂag) pBAC->Low = pBAC->Low + pBAC->Range; if (!ﬂag) pBAC->Range = rLPS; tmp = StateTbl[pBAC->State + offset]; pBAC->State = tmp & 0xff; pBAC->MPS = tmp >> 8; Pcode 5.31: Simulation code for CABAC encode symbol (without normalization). We use a structure pointer pBAC = &BAC, where BAC = {Range, Value, State, MPS, Obits, Symbol, wordoffset, bitpos}, to handle the CABAC code parameters. For the CABAC decode symbol routine, the aforementioned new State look-up up table can be used. As the decoder outputs Symbol information from present context MPS value, we can also embed this information into a new State look-up table by using the MSB of previous look-up table elements for Symbol or by adding 1 more byte to each element of the look-up table to represent Symbol. The simulation codes for the CABAC decode symbol (without normalization) is given in Pcode 5.32. As seen in Pcodes 5.31 and 5.32, the CABAC decode symbol routine ﬂow is different from encode symbol routine ﬂow and consumes ﬁve more cycles. With 276 Chapter 5 tmp = pBAC->Range >> 6; tmp = tmp & 3; offset = pBAC->State << 2; offset = offset + tmp; rLPS = RangeLPS[offset]; pBAC->Range = pBAC->Range - rLPS; //3 to 4 stalls ﬂag = (pBAC->Value >= pBAC->Range); offset = ﬂag << 7; tmp = pBAC->MPS << 6; offset = offset + tmp; if (!ﬂag) pBAC->Value= pBAC->Value + pBAC->Range; if (!ﬂag)pBAC->Range = rLPS; tmp = StateTbl[pBAC->State + offset]; pBAC->State = tmp & 0xff; pBAC->Symbol = tmp >> 15; tmp = tmp & 0x7fff; pBAC->MPS = tmp >> 8; Pcode 5.32: Simulation code for CABAC decode symbol (without normalization). this simulation, we consume (without normalization process) approximately 14 cycles for the CABAC encode symbol routine and 20 cycles for the CABAC decode symbol routine on the reference embedded processor. Normalization Process In H.264 CABAC, we perform the normalization process to keep the value of Range greater than or equal to 256. Hypothetically, the “while” loop in the normalization can be avoided if we precompute the number of times the loop is going to repeat. Mathematically the while loop count is equal to the value of log2 [256/Range]. In other words, if we have an instruction which gives the lead zeros with respect to halfword or word boundary, then we can get the value of log2[256/Range]. In the simulation code, we precompute the loop count using a “while” loop. This loop count indirectly gives us the number of bits to normalize for the Range in the encode (or decode) symbol routine and the number of bits needs to write (or read) to (or from) the buffer (along with the outstanding bits in the case of the encode symbol routine) depending on Value. With this, in a single pass we can do the total normalization process. Encode Symbol Normalization Process The implementation of the “while” loop bit-by-bit normalization process of the encode symbol routine as it is on the reference embedded processor is not an acceptable implementation due to its heavy conditional ﬂow. As described previously, we precompute the loop count for normalization to avoid iterative process. But the problem of writing a variable number of outstanding bits to memory will become complex in this case. The problem of storing outstanding bits will be there even if we use the bit-by-bit “while” loop implementation. With the precompute of the “while” loop count approach, the logic for writing the bits to the buffer will become more complicated because of arbitrary parameter values of loop count (or number of normalization bits), Value and Obits (the number of outstanding bits). With the assumption of sufﬁcient on-chip memory available, a look-up table based approach will eliminate most of the logic to implement the precompute loop count encode symbol normalization process. Now the question is how much on-chip memory is required for implementation of this look-up table based approach? Assuming the minimum value of Range that can go according to the H.264 standard as 2 (which means at most 7 bits of Range left shift is required), then the maximum loop count required is 7 (represented with 3 bits). This means that the analysis of 3 bits of loop count information, 8 bits of Value (as explained later the MSB of Value may ﬂip based on the value of Value after normalization process) and variable number of outstanding bits is required. Assuming the number of outstanding bits as “n,” the memory size required for the look-up table is 2ˆ (3 + 8 + log2(n)) ∗ 4 ∗ (n/8) bytes. According to the H.264 standard, the maximum limit on n is as high as 4,147,200 for a full D1-size video slice (which has 720×480×1.5×8 bits); implementing such a look-up table method is impractical. However, this methodology is a base for the efﬁcient normalization approach that is described in the following. The loop count and outstanding bits count are two important parameters used in the implementation of the look-up table based method. If we run the reference encoder with a few test vectors to get the statistics for these two parameters, then the histograms of those two parameters are obtained as shown in Figure 5.13. As seen in Number of Occurrences Number of Occurrences Lossless Data Compression 277 0.6 0.5 0.4 0.3 0.2 0.1 0 0123456789 Number of Outstanding Bits (a) 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 01234567 Number of Normalization Bits (b) Figure 5.13: Histograms. (a) Outstanding bits. (b) Normalization bits. the histograms, though the maximum number of outstanding bits according to the standard is much higher, the statistics show that the outstanding bits greater than 7 occur only in 2% of cases. Similarly, the normalization bits greater than 3 occur in only 3% of cases. Thus, if we consider 7 outstanding bits and 3 normalization bits for look-up table generation, almost 97% of the time we are going to use the look-up table for normalization process and the remaining 3% of the time we jump out and implement the costly bit-by-bit normalization process. The memory size required to implement the look-up table based approach for the previous parameters is 2(4+3+2)∗ 4 ∗ (7/8) = 2 kB. This makes a good trade-off between cycles and memory. The look-up table codeword contains the following information: (1) updated Obits (4 bits), (2) actual bits information that go to the buffer (0 to 10 bits), (3) length of bits that go to the buffer (4 bits), and (4) a ﬂag (1 bit) for Value’s MSB correction after normalization. Each codeword contains 19 bits of information and these bits may be packed such that they will be easily accessed from memory. The next example presents the functionality of the suggested method. ■ Example 5.4 Offset: 9 bits = loop count (2 bits) | Obits (3 bits) | Value (4 MSBs) Codeword: n (4 bits) | bits (10 bits) | Obits (4 bits) | Flag (1 bit) Look-up table size (or memory requirement): 512 entries (512 ∗ 4 = 2 kB) Loop count = 3, Obits = 6, Value (4 MSBs) = 1101 Offset = 0x1ed (Hex) = 11 110 1101 (bin) Look-up table codeword: 1000 0000 1000 0001 0000 0001 0000 0000 Iteration n Bits Obits Flag 1 7 1000000 0 1 2 8 10000001 0 0 3 8 10000001 1 0 → store to look-up ■ The purpose of Value’s MSB correction is more easily explained with an example. Let us consider the 10-bit value of Value as 1011xxxxxx with 4 MSBs of Value equal to 1011. If we normalize bit-by-bit as per the standard (as shown in Figure 5.11), after 2 bits of normalization we end up with Value as 01xxxxxx00. In the ﬁrst iteration, as Value is greater than 512, we subtract 512 from Value and it becomes 011xxxxxx0 after one 278 Chapter 5 left shift of Value before the next iteration. In the second iteration, as Value is greater than 256, we subtract 256 from Value and it becomes 01xxxxxx00 after one left shift at the end. If we use the normalization process with precompute of the “while” loop count, we left shift Value by two and then the values of Value becomes 11xxxxxx0. Now, if we compare both methods’ output values of Value, they are not same. To make it right, we have to correct the MSB of Value in the suggested method of normalization. This example tells us the purpose of Value MSB correction. The efﬁcient simulation code for the look-up table based normalization approach is given in Pcode 5.33. The cycle savings with the suggested method is explained in the following example. Let us assume that the updated Range (before normalization) was 0×0060, Value was 0×0140, and the accumulated outstanding bits (Obits) was 0×0006. Then the normalization loop count is equal to 2 (because two times left shift of Range is needed to make the Range greater than or equal to 0×0100). First, we estimate the cycle count for bit-by-bit normalization. The value of Value is between 0×0100 and 0×0200 in the ﬁrst iteration of the while loop. As it involves two conditional jumps and four arithmetic operations, it takes around 20 cycles. The accumulated outstanding bits become 0×0007 in the ﬁrst iteration. The value of Value is less than 0×0100 in the second iteration, so here we needed to write bit “0” to memory. In addition, we have to write 7 outstanding bits in this iteration. A total of 8 bits of storing (80 cycles), two conditional jumps (16 cycles) and four ALU operations (4 cycles) are present in the second iteration. This adds up to a total of 100 cycles. The estimate of total number of cycles consumed by bit-by-bit normalization for the previous example is about 120 cycles. Now we estimate the cycle count for the suggested method. In this case, the loop count is computed in advance, and it takes 2 cycles (1 cycle for lead zeros and 1 cycle for correction) on the reference embedded processor. As the loop count and outstanding bits are within limits (which takes 3 cycles to conﬁrm), we compute the offset tmp = 0; while (pBAC->Range < 256) { //precompute loop count pBAC->Range = pBAC->Range << 1; tmp++; } if ((tmp<=3) && (pBAC->Obits <= 7)) { //single ﬂow normalization process x1 = pBAC->Low >> 6; x2 = pBAC->Obits << 4; x1 = x1 + x2; x3 = tmp << 7; x1 = x1 + x3; pBAC->Low = pBAC->Low << tmp; //x1: offset for look-up table, x1[8:7]->nbits, x1[6:4]->obits, x1[3:0]-> MSB Value c = NormTbl[x1]; //c[31]-> ﬂag, c[26:24]-> obits, c[19:16]-> length of bits, c[9:0]->actual bits //c[31:28]-> length of bits, c[27:16]-> actual bits, c[15:8]-> obits, c[0]->ﬂag pBAC->Low = pBAC->Low & 0x1ff; ﬂag = c & 1; pBAC->Low = pBAC->Low | (ﬂag << 9); tmp = c << 16; x2 = tmp >> 24; x3 = c >> 28; x1 = c & 0x0fff0000; pBAC->Obits = x2; x1 = x1 >> 16; if (x3) { //write bits to memory pBAC->bitpos = pBAC->bitpos - x3; x2 = 32; tmp = datx[pBAC->wordoffset]; c = x1 << pBAC->bitpos; if (pBAC->bitpos < 0) c = x1 >> (-pBAC->bitpos); tmp = c | tmp; c = x2 + pBAC->bitpos; datx[pBAC->wordoffset] = tmp; x1 = x1 << c; datx[pBAC->wordoffset + 1] = x1; if (pBAC->bitpos <= 0) pBAC->wordoffset++; if (pBAC->bitpos <= 0) pBAC->bitpos+= 32; } } else { while (tmp > 0) { //do bit-by-bit normalization as described in Pcode 5.28 tmp = tmp — 1; if ((tmp<=3) && (pBAC->Obits <= 7)) break; } //continue previously described normalization process when tmp and obits are within limits } Pcode 5.33: Simulation code for look-up table based encode symbol normalization process. Lossless Data Compression 279 to access the look-up table. With this, we consume 10 cycles (6 for offset and 4 for loading) to get the look-up table value. Then unpacking of parameters to store in memory takes around 10 cycles. Then packing bits to the word for storing takes another 10 cycles. This adds up to a total of 35 cycles to perform the normalization for the previous example. We may not beneﬁt by using the suggested method if the loop count is 1, and the accumulated outstanding bits are zero for the normalization process. The normalization look-up table NormTbl[ ] values, which are used in the suggested method, can be found on the companion website. As we skip the normalization process when Range is greater than or equal to 256 (i.e., loop count = 0), the ﬁrst 512 bytes of the look-up table are not used in the normalization process. To reduce the memory usage with the suggested method for efﬁcient simulation of the H.264 binary arithmetic coder encode symbol routine, we can utilize these 512 bytes of memory to store StateTbl[ ] look-up values. With this change, the total memory usage of encode symbol routine including a look-up table of RangeLPS[ ] is equal to 2.25 kB. Decode Symbol Normalization Process With precompute of “while” loop count, the decode symbol normalization process will become very simple as the normalization of Range and Value and number of bits to be read from memory just depend on loop count. The simulation code for decode symbol normalization is given in Pcode 5.34. r0 = 0; while (pBAC->Range < 256) { //precompute loop count pBAC->Range = pBAC->Range << 1; r0++; } if (r0) {//read bits from memory pBAC->Value = pBAC->Value <wordoffset]; r3 = bit_stream[pBAC->wordoffset + 1]; r4 = r1 − pBAC->bitpos; r5 = r1 — r0; r2 = r2 << pBAC->bitpos; r3= r3 >> r4; r2 = r2 + r3; pBAC->bitpos = pBAC->bitpos — r0; r2 = r2 >> r5; if (pBAC->bitpos <= 0) pBAC->wordoffset++; if (pBAC->bitpos <= 0) pBAC->bitpos+= 32; pBAC->Value = pBAC->Value | r2; } Pcode 5.34: Simulation code for decode symbol normalization process. Further Optimization of Decode Symbol Normalization Process On a limited MIPS embedded processor, the decoder software modules have to be optimized to the maximum extent to run in real time. The cycle cost (from Pcode 5.34) of reading bits (bit FIFO) from the memory buffer bit_stream[ ] is about 14 cycles. The cycle cost for reading bitstream can be reduced to 5 cycles by reading bits in terms of 16-bit blocks from the buffer instead of an arbitrary number of bits. By shifting Value 22 bits to the left and working with an upper halfword for Value (MSB aligned) manipulation and lower halfword for bit FIFO functionality, we can reduce the cycle cost of bits reading from buffer bit_stream[ ]. For this, we have to place Range and rLPS values in upper halfwords by shifting 22 bits. At the time of the initialization of Value, we initialize Value with 32 bits instead of 9 bits and set bit position as 16 instead of 23. Now, to access bit FIFO and updating bit_stream[ ] buffer parameters (bitpos and wordoffset), we spend about 7 cycles in simulation code as given Pcode 5.35. 5.5.5 Simulation Results We assume few Symbols (or bins, which are obtained after binarization of syntax elements) to encode and decode using CABAC. In addition, we assume the corresponding context values {State, MPS} for Symbols coding as follows: Ctx[20][2] = {{24,1},{18,1}, {14,1}, {21,0}, {12,0}, {4,1}, {1,0}, {0, 1}, {18, 1}, {10,0}, {5, 0}, {17,1}, {11,0}, {2, 1}, {16, 0}, {20, 0}, {7, 1}, {8, 0}, {3, 1}, {9, 1}}; 280 Chapter 5 r0 = 0; while ( pBAC->Range < 256) { //precompute loop count pBAC->Range = pBAC->Range << 1; r0++; } pBAC->Value = pBAC->Value <bitpos = pBAC->bitpos — r0; if (pBAC->bitpos <= 0) { pBAC->bitpos+= 16; r1 = bit_stream[pBAC->wordoffset++]; r1 = r1 << pBAC->bitpos; pBAC->Value = pBAC->Value | r1; } Pcode 5.35: Efﬁcient simulation of decode symbol normalization. Encode Symbol Input: Symbols[20] = {1,1,1,0,0,0,0,0,1,0,1,1,0,0,0,1,1,1,1,1}; //bins Initialization: pBAC->Range = 0x1fe; pBAC->Value = 0; pBAC->Obits = 0; pBAC->bitpos = 32; pBAC->wordoffset = 0; Intermediate outputs after encoding 1 symbol: pBAC->Range = 0x01b9 pBAC->Value = 0x0000 pBAC->Obits = 0 pBAC->bitpos = 32 pBAC->wordoffset = 0 Intermediate outputs after encoding 5 symbols: pBAC->Range = 0x0146 pBAC->Value = 0x0000 pBAC->Obits = 0 pBAC->bitpos = 31 pBAC->wordoffset = 0 Intermediate outputs after encoding 10 symbols: pBAC->Range = 0x0115 pBAC->Value = 0x0060 pBAC->Obits = 3 pBAC->bitpos = 30 pBAC->wordoffset = 0 Intermediate outputs after encoding 20 symbols: pBAC->Range = 0x01a8 pBAC->Value = 0x0178 pBAC->Obits = 0 pBAC->bitpos = 17 pBAC->wordoffset = 0 bitstream at end of 20 symbols encoding (includes a few dummy encoded bits): bit_stream[] = {0x001e78f1, 0x00000000, 0x00000000, 0x00000000,… } Decode Symbol Input: Encoded bitstream[]. bit_stream[] = {0x001e78f1, 0x00000000, 0x00000000, 0x00000000,… } Initialization: pBAC->Range = 0x01fe; pBAC->Value = 0x0079; pBAC->wordoffset = 0; pBAC->bitpos = 13; Intermediate outputs after decoding 1 decision: pBAC->Range = 0x01b9; pBAC->Value = 0x0079; pBAC->wordoffset = 0; pBAC->bitpos = 13; Intermediate outputs after decoding 5 decisions: pBAC->Range = 0x0146; Lossless Data Compression 281 pBAC->Value = 0x00f3; pBAC->wordoffset = 0; pBAC->bitpos = 12; Intermediate outputs after decoding 10 decisions: pBAC->Range = 0x0115; pBAC->Value = 0x00dc; pBAC->wordoffset = 0; pBAC->bitpos = 8; Intermediate outputs after decoding 20 decisions: pBAC->Range = 0x01a8; pBAC->Value = 0x006a; pBAC->wordoffset = 1; pBAC->bitpos = 30; Decoded symbols: bins[20] = {1,1,1,0,0,0,0,0,1,0,1,1,0,0,0,1,1,1,1,1}; // Symbols This page intentionally left blank Part 2 Digital Signal and Image Processing This page intentionally left blank CHAPTER 6 Signals and Systems Raw signals are processed using signal-processing algorithms (e.g., DFT, DCT, FIR ﬁlters, IIR ﬁlters, correlation, LMS and RLS adaptive ﬁlters, and so on, to be discussed in subsequent chapters) to get the desired signal output. Signal processing algorithms have many applications, including telecommunications, medical, aerospace, radar, sonar, and weather forecasting. Real-time processing of signals for many applications is possible with advances in semiconductor technology. This chapter addresses the fundamentals of signals and signal processing. 6.1 Introduction to Signals A signal is a measure of physical phenomenon such as temperature, pressure, electric voltage, and radioactive decay, with respect to time or space. If we measure temperature during a day from 6 AM to 6 PM and plot the values, the plot may resemble Figure 6.1. Typically, the x -axis (or horizontal axis) is used to represent the independent variables (e.g., time, space) and the y-axis (or vertical axis) is used to represent the measured quantity (e.g., weight, amplitude) of dependent variables (e.g., temperature, voltage). Figure 6.1 shows how the temperature measured from 6 AM to 6 PM on a particular day varies with time. We call such a measured quantity with respect to time a signal. The temperature signal cannot be calculated using a welldeﬁned mathematical equation (because it depends on many factors such as weather, season, Earth orientation, etc.). Such signals are random. On the other hand, the behavior of signals that can be exactly predicted by mathematical equations is called deterministic. Examples of deterministic signals are sine waves, square waves, and staircase signals. 6.1.1 Deterministic Signals Deterministic signals can be expressed precisely with mathematical formulas. In this subsection we will discuss various basic signals that appear in subsequent chapters focused on signal processing. Sinusoidal Functions Sinusoidal signals play an important role in signal processing applications. Here we discuss various representations of a sinusoidal signals. The simple representation of a sinusoidal signal (abbreviated as sin) is y(t) = sin(t). This means that the value of a sinusoid at time t is y(t), and the plot of y(t) = sin(t) is shown in Figure 6.2. Figure 6.1: Signal representation of temperature (in ◦C) from 6 AM to 6 PM. © 2010 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-1-85617-678-1.00006-5 Temperature (8C) 50 Temperature 40 signal 30 20 10 6 AM 7 8 9 10 11 12 1 2 3 4 5 6 PM Time (hrs) 285 286 Chapter 6 Figure 6.2: Sine waveform y(t) = sin(t). y(t ) 5 sin(t ) 1 0.8 T 0.6 0.4 0.2 0 20.2 20.4 20.6 20.8 21 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 t As seen in Figure 6.2, the sine wave is periodic with period T , meaning that it repeats itself in regular intervals of time T . However, with the sine wave representation y(t) = sin(t), the sine wave period is not transparent in the equation. In addition, we cannot say how many times the sine wave repeats in one unit of the time interval. Therefore, we represent the sine wave in a more transparent way: y(t) = sin(2πft) (6.1) Using the sine wave representation given in Equation (6.1), we obtain more information about the sine wave. For example, if we plot y(t) = sin(2π ft) for f = 1, 2, and 3 as shown in Figure 6.3(a), (b), and (c), we can clearly see that parameter f controls the number of cycles present in one unit of the time interval. With f = 1, we have one cycle of the sine wave in one unit of the time interval. With f = 2, we have two cycles of the sine wave in one unit of the time interval and so on. If T is the period of the sine wave, then f = 1/T is the frequency of the sine wave. If time T is measured in seconds, then the quantity f gives the cycles per second. One cycle per second is equivalent to 1 hertz (Hz). With the sine wave representation in Equation (6.1), we can easily determine the number of cycles present in 1 second or we can know the frequency of the sine wave. The sine wave notation given in Equation (6.1) is commonly used in all signal processing algorithms. A more general form of sinusoidal function can be expressed as follows: x (t) = A sin(2πft + φ) = A sin(ωt + φ) (6.2a) where A is the peak amplitude, φ is the initial phase (or phase offset), and ω = 2πf, is the angular frequency (measured in radians/second). The quantity ωt + φ gives the instantaneous phase of the sinusoid in radians. Figure 6.4(a) and (b) show how the amplitude and phase offset modify the pure sinusoid function. When the phase value φ = π/2, we have a special case and the resulting waveform is called a cosinusoid (or cos) function, as shown in Figure 6.4(c) with the dotted line. This means that sin(ωt) lags cos(ωt) by π/2 radians (or 90◦) or cos(ωt) leads sin(ωt) by 90◦. Cosine and sine are often represented in complex number notation to perform signal processing tasks more efﬁciently. In particular, the multiplication and division operations on sinusoids become very easy with complex number representation. From the phasor (i.e., a rotating vector) diagram shown in Figure 6.4(d), the rectangular coordinates (a, b) are obtained from the polar coordinates ( A, ωt) as a = A cos ωt and b = A sin ωt. Using the famous Euler formula, a + jb = A(cos ωt + j sin ωt) = Ae jωt (6.2b) Based on Equation (6.2b), a = If P = a + jb = Ae jωt |t=tn , A cos ωt = Re(Ae jωt ) and b = A√sin ωt = Im(Ae jωt ). then the amplitude A = |P| = a2 + b2, and the instantaneous phase ωtn = ∠P = tan−1 b a . Note that A is the distance of the point P from the origin. For this reason, A is also called the magnitude. The conjugate of P is called P∗, and we deﬁne the conjugate P∗ as P∗ = a − jb = Ae− jωt . With this, the multiplication of two complex numbers (indirectly sinusoid values) P1 = A1e jω1t and P2 = A2e jω2t can be computed easily as P1 P2 = A1 A2e j (ω1+ω2)t , and the division of two complex numbers P1 and P2 is computed as y (t ) 5 sin(2ft ) for f 5 1 Hz y (t ) 5 sin(2ft ) for f 5 2 Hz Signals and Systems 287 1 0.8 0.6 0.4 0.2 0 20.2 20.4 20.6 20.8 21 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 t (sec) (a) 1 0.8 0.6 0.4 0.2 0 20.2 20.4 20.6 20.8 21 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 t (sec) (b) 1 0.8 0.6 0.4 0.2 0 20.2 20.4 20.6 20.8 21 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 t (sec) (c) Figure 6.3: Plots of the sine wave: (a) f = 1 Hz, (b) f = 2 Hz and (c) f = 3 Hz. y (t ) 5 sin(2ft ) for f 5 3 Hz P1/P2 = P1 P2∗/P2 P2∗ = ( A1/ A2)e j (ω1−ω2)t . As additions and subtractions are easy to perform using rectangular coordinates, we frequently switch between polar and rectangular coordinates in the computations involving sinusoids. Selected Important Deterministic Signals Dirac Delta Function or Impulse Function The Dirac delta function is an interesting and ideal function that is used for many theoretical purposes. The Dirac delta function δ(t) is deﬁned as follows: δ(t) = ∞ 0 if t = 0 otherwise (6.3) The signal diagram of the Dirac delta function is shown in Figure 6.5(a). An important property of the Dirac delta function is that it integrates to 1 when we ﬁnd the area under this function. This can be visualized as shown in Figure 6.5(b). The area of the rectangle shown is 1 (since area = width × height = a × 1/a = 1). Now, what happens if a approaches zero? In the limiting case, we will 288 Chapter 6 3 1 Aϭ1 Aϭ3 0.8 Phase ϭ 0 Phase ϭ /3 2 0.6 0.4 1 0.2 0 0 Ϫ0.2 Ϫ1 Ϫ0.4 Ϫ2 Ϫ0.6 Ϫ0.8 Ϫ3 Ϫ1 Ϫ0.8 Ϫ0.6 Ϫ0.4 Ϫ0.2 0 0.2 0.4 0.6 0.8 1 (a) Ϫ1 Ϫ1 Ϫ0.8 Ϫ0.6 Ϫ0.4 Ϫ0.2 0 0.2 0.4 0.6 0.8 1 (b) 1 0.8 sin cos Angular Im speed: rad/sec. 0.6 P 0.4 0.2 0 Ϫ0.2 Ϫ0.4 Ϫ0.6 Ϫ0.8 Ϫ1 Ϫ1 Ϫ0.8 Ϫ0.6 Ϫ0.4 Ϫ0.2 0 0.2 0.4 0.6 0.8 1 (c) b A t Re a (d) Figure 6.4: Sinusoid functions and phasor representation. (a) f = 1, φ = 0. (b) f = 1, A = 1. (c) Relation between sine and cosine. (d) Phasor representation. 1/a ␦(t ) Figure 6.5: (a) Dirac delta function. (b) Rectangle function with width a and height 1/a. t50 t 2a /2 0 a /2 (a) (b) have the Dirac delta with the unit area. If we multiply any function f (t) with the Dirac delta function δ(t) and integrate, we get f (0), the value of function f (t) at t = 0. Similarly, if we multiply f (t) with δ(t − T ), a shifted version of the Dirac delta by time T , and integrate, we get f (T ), the value of the function f (t) at t = T . Constant Function A constant function c(t), also referred to as DC value, is deﬁned as follows: c(t) = C − ∞ < t < ∞ Figure 6.6 shows a signal diagram of the constant function. Signals and Systems 289 Rectangular Pulse A rectangular pulse r(t) with width T and constant height C is deﬁned as ⎧ ⎨0 if t < −T /2 r(t) = ⎩C 0 if − T /2 ≤ t ≤ T /2 if t > T /2 (6.4) and a schematic diagram of the rectangular pulse is shown in Figure 6.7. Unit Step Function A unit step function u(t) is deﬁned as u(t) = 0 1 if t < 0 if t ≥ 0 (6.5) and a signal diagram of the unit step function is shown in Figure 6.8. Signum Function A signum function, sgn(t), is deﬁned as follows: ⎧ ⎨−1 if t < 0 sgn(t) = ⎩ 0 1 if t = 0 if t > 0 (6.6) Figure 6.9 shows a signal diagram of the sgn(t) function. Any real value x can be expressed as the product of its absolute value and its signum function as x = |x |sgn(x ). Figure 6.6: Signal diagram of constant function. C 0 t C r (t ) c(t ) u (t ) Figure 6.7: Signal diagram of rectangular pulse function. 2T/2 0 T/2 t 1 Figure 6.8: Signal diagram of unit step function. 0 t 290 Chapter 6 sgn(t ) 1 Figure 6.9: Signal diagram of signum function. 0 t 21 Sinc Function The sinc function is widely used in signal processing and communication systems. Its two versions are the un-normalized sinc function, and the normalized sinc function, as given in Equations (6.7) and (6.8), respectively. The un-normalized sinc function is deﬁned as sinc(t) = sin(t) , −∞ < t < ∞ (6.7) t As shown in Figure 6.10, the zero-crossings of the un-normalized sinc function are at multiples of π(≈ 3.14). The normalized sinc function is deﬁned as sinc(t ) = sin(πt) , πt −∞ < t < ∞ (6.8) As shown in Figure 6.11, the zero-crossings of the normalized sinc function occur at non-zero integer values. The normalized sinc function has important properties that make it ideal in relation to interpolation (since sinc(0) = 1 and sinc(k) = 0 for non-zero integers of k) and band-limited functions (if xk(t) = sinc(t − k), then xk(t) form an orthonormal basis for band-limited functions in L2(R) function space). The sinc function is also related to the Dirac delta function as follows: lim 1 sinc(t/a) = δ(t) (6.9) a−>0 a Figure 6.10: Unnormalized sinc(t) function. Figure 6.11: Plot of normalized sinc function. sinc(t ) sinc(t ) 1 0.8 0.6 0.4 0.2 0 20.2 2 2 20.4 220 215 210 25 0 5 10 15 20 t 1 0.8 0.6 0.4 0.2 0 20.2 20.4 28 26 24 22 0 2 4 6 8 t Signals and Systems 291 6.1.2 Random Signals Unlike deterministic signals, random signals are not so easy to handle. Random signals cannot be generated by simple, well-deﬁned mathematical equations, and their future values cannot be predicted. Rather, we must use probability and statistics to analyze their behavior. The plot of one such random signal y(t) is shown in Figure 6.12. As the random signal pattern varies from time to time, processing individual random signals does not make sense; instead we process ensembles (groups of random signals). At this juncture, you might ask: Why do I need to study random signals? Why do I need to process them? The answer is simple. In the real world (in nature), deterministic signals are always associated with random noise (see Section 9.1.2 for details on noise generation and measurement in a communication system environment). To analyze deterministic signals, ﬁrst we have to minimize the effect of noise, and for this we have to process (measure, classify, and eliminate) the random signals (or noise). To process the random signals, we use statistical measures such as mean, variance, standard deviation, and so on. Before going into statistical measure deﬁnitions, we introduce the concepts of random variable and random process. Examples are provided of random variables and random processes to present the overview of random signals. For deﬁnitions and fundamentals of random variables and processes, see Papoulis (1984) and Leon-Garcia (1994). Random Variables Typically, we process numerical data with digital computers. However, the output of an experiment (or events in a sample space) need not be a number. For example, if we conduct an experiment of tossing a coin, then the outcome of that experiment is a head or tail. The sample space (or all outcomes) of this coin experiment is S = {head, tail}. We cannot measure the output of the sample space of some experiments (e.g., {head, tail} of the coin experiment). Now, if we map these events (or subsets of the sample space) to a measurable space through a mapping function X, then we call such mapping function X a random variable. If we choose the measurable space as real numbers, then the random variable X maps the sample space to real numbers. For example, the head and tail events of a coin-tossing experiment may be mapped to +0.5 and −0.5 through random variable X. But, you may ask, why do we need a random variable? If we toss a coin, how long will it take to get the ﬁrst head? How can we answer such a question? The head may appear in the 1st, 2nd, 5th, or 10th toss, and so on. Clearly, we cannot answer such a question with a single number. However, using the random variable concept, we can answer the preceding coin-tossing experiment question in probabilistic terms. y(t ) 16 14 12 10 8 6 4 2 0 0 10 20 30 40 50 60 70 80 90 100 t Figure 6.12: Plot of random signal. 292 Chapter 6 Let X be a random variable mapping the event of ﬁrst head in a coin-tossing experiment conducted N times. Let i be the outcome of a coin-toss experiment with probabilities 2 Pr( 1 = Head) = p and Pr( 2 = Tail) = q = 1 − p with Pr( j ) = 1 j =1 For an unbiased coin, p = q = 0.5. Assume, in the m-th experiment, that we get the ﬁrst head. Then Pr(X = ﬁrst head) = qm−1 p. Similarly, in a dice-throwing experiment, we will have six outcomes of the dice facing up with 1, 2, 3, 4, 5, and 6. Now, assume that we want to know the probability that the event of a dice-throwing experiment never exceeds 4. We can answer this question in the same way as the coin-tossing experiment using the concept of a random variable with Pr(X less than or equal to 4) = 2/3 by assuming equal probabilities for all six outcomes. Continuous and Discrete Random Variables The outcome of an experiment need not be always discrete such as in the coin-tossing experiment. If we consider the lifetime of an electric bulb, then it can be any time until the bulb burns out. If the random variable represents the lifetime of an electric bulb, then that random variable takes continuous rather than discrete values. If a random variable can take only a ﬁnite number of distinct values, then it must be discrete. On the other hand, a continuous random variable is one that takes on an inﬁnite number of possible values. Typically, discrete random variables apply to countable events, and continuous random variables apply to measurable events. The examples of discrete random variables include the number of heads showing up when we toss a coin 10 times, the number of defective bulbs in an electronics store, the number of passengers in a train, and so on. Examples for continuous random variable include the output voltage level of an electric circuit, the temperature at a given time, travel time between cities, and so on. Probability Mass, Probability Density, and Cumulative Distribution Functions In general, the random variable X is associated with a probability. Typically, the probability mass function (pmf) is used with discrete random variables, whereas the probability density function (pdf) is used for continuous random variables. The probability distribution of a discrete random variable is a list of probabilities associated with each possible outcome that are represented by a random variable. If X is a discrete random variable with associated probability mass function fX (x ), and if xi represents an i-th value in the range of random variable X then fX (xi ) = Pr(X = xi ). For example, if we conduct an experiment of tossing a coin 10 times and the outcome of heads mapped to measurable space with random variable X, then the random variable takes 10 discrete values xi , where 0 ≤ i ≤ 10 and index i represents the count of heads in each experiment. The probability of xk (i.e., to get k heads) is given by Pr(X = xk) = 10 k pk (1 − p)10−k where p is the probability of a head. For an unbiased coin (i.e., the probability of getting a head is equal to the probability of getting a tail is equal to 0.5), the pmf for the random variable X is shown in Figure 6.13. The probability of getting zero heads is fX (x0) = Pr(X = x0) = 1 210 , the probability of getting one head is fX (x1) = Pr( X = x1) = 10 210 , and so on. The pmf always satisﬁes the following two conditions: 0 ≤ fX (xi ) ≤ 1 fX (xi ) = 1 i Given a random variable X, the probability function FX (x ) = Pr(X ≤ x ), where x is any real number in the interval (−∞, ∞), is called its cumulative distribution function (cdf). The function FX (x ) is a nondecreasing Signals and Systems 293 P(X ϭ xk) Figure 6.13: Probability mass function for getting heads in tossing a coin 10 times. 1 2 3 4 5 6 7 8 9 10 11 k 1 FX Figure 6.14: Cumulative distribution function for getting heads in tossing a coin 10 times. 1 2 3 4 5 6 7 8 9 10 11 x 1.0 pX FX Figure 6.15: Continuous random variable. (a) pdf. (b) cdf. x (a) x (b) function and satisﬁes the following conditions: 0 ≤ FX(x) ≤ 1 FX (−∞) = 0 and FX (+∞) = 1 Pr(x1 < X ≤ x2) = FX (x2) − FX (x1) The cdf of the coin-experiment random variable, whose pmf appears in Figure 6.13, is shown in Figure 6.14. In the case of discrete random variables, the associated cdf always has jumps in the distribution curve as shown in Figure 6.14. Like discrete random variables, continuous random variables are associated with the probability density function (pdf). As shown in Figure 6.15(a) and (b), both the pdf and cdf of a continuous random variable are smooth, unlike discrete random variable probability functions. In some practical problems, we may also encounter a random variable of mixed type. The cdf of such a random variable is a smooth, nondecreasing function in certain parts of the real line, and contains jumps at certain discrete values of x . The pdf, pX (x ), of the continous random variable can be obtained by differentiating the cdf FX (x ). Thus, we have p X (x ) = d FX (x), dx −∞ < x < ∞ x FX (x ) = pX (y)d y, −∞ < x < ∞ −∞ (6.10) (6.11) The two most popularly used probability distributions in signal processing are the uniform and Gaussian. The uniform distribution function is used to generate random numbers in a given interval. The pdf pX (x ) and the cdf FX (x ) of a uniform random variable are shown in Figure 6.16(a) and (b). The Gaussian distribution is widely used in digital media processing applications for noise modeling; the pdf and cdf of a Gaussian distributed random variable is shown in Figure 6.17(a) and (b), respectively. 294 Chapter 6 Figure 6.16: Uniform distribution: (a) pdf and (b) cdf. pX 1/(v 2 u) u v x (a) pX FX 1 u v x (b) F X 1.0 0.5 Figure 6.17: Gaussian distribution: (a) pdf and (b) cdf. x (a) x (b) Multiple Random Variables and the Joint Probability Density Function In practice, we encounter random phenomena resulting from multiple sources instead of just a single source. We measure the events or random variables of combined experiments using joint probabilities. Let us consider two random variables X1 and X2, each of which may be continuous, discrete, or mixed. The joint cdf for the two random variables is deﬁned as x1 x2 FX1,X2 (x1, x2) = Pr(X1 ≤ x1, X2 ≤ x2) = pX1,X2 (y1, y2)d y1d y2 −∞ −∞ (6.12) or, equivalently, ∂2 pX1,X2(x1, x2) = ∂ x1∂ x2 FX1,X2(x1, x2) (6.13) When the joint pdf pX1,X2 (x1, x2) is integrated over one of the variables, we obtain the density function of the other variable as follows: ∞ pX1,X2 (x1, x2)d x1 = pX2 (x2), −∞ ∞ pX1,X2 (x1, x2)d x2 = pX1 (x1) −∞ (6.14) The pdfs pX2 (x2) and pX1 (x1) obtained from the joint probability by integrating over the other random variable are called marginal pdfs. Conditional Probability Density Functions In some cases, we may have an idea about one random phenomenon in a combined experiment (e.g., a priori knowledge of symbol sets that are transmitted to the receiver). If one random variable X1 is given, then we obtain the conditional pdf of another random variable X2 as follows: pX2|X1(x2|x1) = pX1,X2(x1, x2) p X1 (x1) (6.15) Here, pX2|X1(x2|x1) is called the probability density of X2 given X1. We also express the joint pdf pX2,X1 (x2, x1) in terms of the conditional pdfs as in the following: pX1,X2(x1, x2) = pX2|X1(x2|x1) · pX1(x1) = pX1|X2(x1|x2) · pX2 (x2) (6.16) Signals and Systems 295 Bayes Theorem Given Equation (6.16), we can write P(x1|x2) as pX1|X2(x1|x2) = pX2|X1(x2|x1). pX1 (x1) pX2 (x2) (6.17) The Bayes theorem is a simple mathematical formula used for calculating conditional probabilities. Equation (6.17) represents the simplest form of the Bayes theorem. The theorem simply allows the new information to be used to update the conditional probability of a random variable in a combined experiment. For example, we can consider a digital communication system as a combined experiment representing the transmitted messages as mutually exclusive events with random variable X1. Let us say that M messages are transmitted in a given time interval and Pr(x1i ) represents the i-th message a priori probability. Assume that X2 is a random variable of another event of the combined experiment, and X2 represents the received noisy message that contains one of the M transmitted messages. Given X2, the a posteriori probability of X1i conditioned on having observed the received signal X2 (i.e., Pr(x1i |x2)) is obtained by using the generalized Bayes theorem as follows: Pr(x1i |x2) = Pr(x2|x1i ) · Pr(x1i ) Pr(x2) = Pr(x2|x1i) · Pr(x1i ) M j =1 Pr (x2|x1 j )Pr (x1 j ) (6.18) Thus, if we assume that event X2 arises with probability Pr(x2|x1i ) from each of the underlying messages X1i, i = 1, 2, . . . , M, we can use our observation of the occurrence of X2 to update our a priori assessment of the probability of occurrence of each message, Pr(x1i), to an improved a posteriori estimate, Pr(x1i |x2). Statistical Independence What if the two random variables are not at all related (i.e., the occurrence of random variable X1 has nothing to do with the occurrence of random variable X2)? In this case, what happens to the conditional probability? If the occurrence of random variable X1 does not depend on the occurrence of random variable X2, then conditional pdfs pX1|X2 (x1|x2) = pX1 (x1) and pX2|X1(x2|x1) = pX2 (x2). Based on Equation (6.16), pX1,X2 (x1, x2) = pX1 (x1) pX2 (x2). Thus, if two random variables X1 and X2 are statistically independent, then their joint pdf pX1,X2 (x1, x2) is given by the product of their individual pdfs pX1 (x1) and pX2 (x2). Similarly, for two statistically independent random variables X1 and X2, the joint cumulative distribution FX1,X2 (x1, x2) = FX1(x1)FX2 (x2). The notion of statistical independence can be easily extended to multiple random variables. If the N random variables X1, X2,. . . , X N are statistically independent, then their joint pdf is a product of their individual pdfs as follows: pX1,X2,... XN (x1, x2, . . . xN ) = pX1 (x1) pX2 (x2) · · · pXN (xN ) or equivalently, FX1,X2,... XN (x1, x2, . . . xN ) = FX1(x1)FX2 (x2) · · · FXN (xN ) Statistical Measures of Random Variables Statistical measures play an important role in the overall characterization of an experiment and in the characterization of random variables deﬁned on the sample space of an experiment. Popular statistical measures for single random variables are mean, variance and standard deviation; and for multiple random variables are correlation and covariance. These statistical measures are deﬁned next. The mean or expected value of a single continuous random variable X is deﬁned as ∞ E(X ) = μx = x pX (x )dx −∞ (6.19) 296 Chapter 6 where E(.) denotes expectation used for statistical averaging. The expectation is also known as the ﬁrst moment of a random variable X. In general, the n-th moment is deﬁned as ∞ E(Xn ) = x n pX (x )dx −∞ If μx is the expected value of random variable X, then the n-th central moment is deﬁned as ∞ E[(X − μx )n] = (x − μx )n pX (x )dx (6.20) −∞ When n = 2, the second central moment is called the variance of a random variable, and is denoted by σx2. Thus, the variance of random variable X is given by ∞ Var(x ) = σx2 = (x − μx )2 pX (x )dx −∞ (6.21) Equation (6.21) can also be expressed in terms of ﬁrst and second moments by expanding it as follows: σx2 = E(X2) − μ2x (6.22) The variance of a random variable X gives the amount of spread from the mean value of distribution. If the variance of a random variable is large, then its probability distribution is also broader to that extent. The standard deviation σx is given by the square root of the variance. ■ Example 6.1 We can compute the statistical measures for the discrete random variable with pmf shown in Figure 6.13. Here, the random variable is the number of heads that show up when we toss a coin 10 times. With an unbiased coin, the probability of k heads in n experiments is given by Pr(X = k heads) = n k /2n Table 6.1 shows the probability distribution values for n = 10. The mean of the distribution follows: ∞ 10 μx = xk Pr(X = xk) = xk Pr(X = xk) k=−∞ k=0 0 × 1 1 × 10 2 × 45 3 × 120 4 × 210 5 × 252 6 × 210 = 210 + 210 + 210 + 210 + 210 + 210 + 210 + 7 × 120 210 + 8 × 45 210 + 9 × 10 210 + 10 × 210 1 = 5 The distribution variance is obtained as follows: ∞ 10 σx2 = (xk − μx )2 Pr(X = xk) = (xk − μx )2 Pr(X = xk) k=−∞ k=0 = (25 + 160 + 405 + 480 + 210 + 0 + 210 + 480 + 405 + 160 + 25)/210 = 2.5 √ The standard deviation, then, is σx = 2.5 = 1.581. ■ Table 6.1: Probability distribution for number of heads in a coin-tossing experiment xk = Number of Heads 0 1 2 3 4 5 6 7 8 9 10 Pr(X = xk) 1/210 10/210 45/210 120/210 210/210 252/210 210/210 120/210 45/210 10/210 1/210 Signals and Systems 297 Central Limit Theorem The central limit theorem states that whenever a random sample z is taken from any distribution with mean μ and variance σ 2, then the sample mean zˆ of n random samples will be approximately normal or Gaussian distributed with mean μ and variance σ 2/n. For example, the associated noise present in the desired signal at the receiver of a digital communication system is a result of accumulation of noise components from many sources, and the underlying distribution of this accumulated noise is close to Gaussian. This is one of the reasons for using the normal or Gaussian distribution to model the noise source most of the time. Typically, we model the normal distribution by averaging the statistically independent and identically distributed (i.i.d.) random variables with ﬁnite mean μ and ﬁnite variance σ 2. For example, by adding 12 times the samples from a uniform distribution deﬁned over the interval [0, 12] and repeating the process many times, we create a normally distributed sample with μ = 6 and σ 2 = 1. The pdf of the Gaussian distributed random variable with mean μx and variance σ 2 follows: px (x ) = N (μx , σ 2) = √1 2π σ e−(x−μx )2/2σ 2 , −∞ < x < ∞, σ > 0 (6.23) Random Process In the previous discussion, we deﬁned random variables as functions that map the sample space to a measurable real number space. In the same way, when we map the sample space to a measurable signal space instead of number space, we call such a mapping function X (t) a random process. In the coin experiment, with the random process X (t), we may map, for example, a head to a square wave and a tail to a triangle wave. We can view a random process as a collection of random variables or a collection of sample functions. At a particular time instance ti , the random process X (t) represents random variable X (ti ). If we consider a process s(t) = X. sin(2π ft), and if X is a random variable, then for every possible value of X, there is a function of time called the sample function sx(t). Then, the collection of all such sample functions forms a random process. We call such a collection of functions an ensemble. Although the independent variable t is continuous, the underlying random process need not be continuous. If the associated random variable is discrete, then the corresponding random process is also discrete; if the random variable is continuous, then the corresponding random process is also continuous. Distribution Functions for Random Processes Random processes are easily studied by viewing them as a collection of random variables. Here, we consider the random process X (t) at time ti , X (ti ), where X (ti) represents a random variable. The cumulative distribution function for random variable X (ti) is given by FX(ti )(xi ) = Pr[X (ti ) ≤ xi ]. This relation can be generalized to 298 Chapter 6 the n-th–order case as follows: FX(t1),X(t2),...,X(tn )(x1, x2, . . . , xn) = Pr[X (t1) ≤ x1, X (t2) ≤ x2, . . . , X (tn) ≤ xn] (6.24) pX (t1),X (t2 ),...,X (tn )(x1, x2, . . . , xn) = ∂n FX (t1),X (t2 ),...,X (tn )(x1, x2, ∂x1∂x2 · · · ∂xn . . . , xn) (6.25) where x1, x2, . . . , xn are n random variables considered at n time instances t1, t2, . . . , tn. In general, a complete statistical description of a random process requires knowledge of all order distribution functions. The random processes X (t) and Y (t) are said to be statistically independent if and only if pX (t1),X (t2),...,X (tn),Y (t1),Y (t2),...,Y (tn)(x1, x2, . . . , xn, y1, y2, . . . , yn ) = pX (t1),X (t2),...,X (tn )(x1, x2, . . . , xn ) pY (t1),Y (t2),...,Y (tn)(y1, y2, . . . , yn) Stationarity of Random Processes A random process X (t) is said to be stationary if its statistical properties do not change with time. More precisely, a process X (t) is said to be stationary in the strict sense if pX (t1),X (t2),...,X (tn )(x1, x2, . . . xn) = pX (t1+ε),X (t2+ε),...,X (tn +ε)(x1, x2, . . . , xn ) (6.26) for all orders n and all time shifts ε. That is, all order statistics of a stationary random process are invariant to any translation of the time axis. On the other hand, when the joint pdfs vary with time shifts, then that random process is nonstationary. Another kind of random process in which the statistics are neither stationary nor nonstationary, but periodically vary with period T , is called cyclo-stationary. For cyclo-stationary random processes, the following formula applies: pX (t1),X (t2),...,X (tn )(x1, x2, . . . xn) = pX (t1+T ),X (t2+T ),...,X (tn +T )(x1, x2, . . . , xn) (6.27) where T is the period of n-th–order statistics. In practice, we work with two kinds of stationary processes: wide-sense stationary and ergodic. A random process X (t) is said to be wide-sense stationary if the following conditions are satisﬁed: its ﬁrst-order statistics are constant, and its second-order statistics depend only on time difference instead of absolute time. A random process is said to be ergodic if all orders’ statistical and time averages are interchangeable. Statistical Averages for Random Processes Statistical averages for random processes are deﬁned in ways similar to how we deﬁned statistical averages for random variables. We deﬁne next the popularly used ﬁrst-order statistic mean and the second-order statistic autocorrelation for the random process X (t). The expected value or mean μ(t) of a general random process X (t) is deﬁned as follows: ∞ μ(ti ) = E[X (ti )] = xi pX(ti )(xi )d xi −∞ (6.28) In general, the value of the mean depends on the time instance ti if the pdf of X (ti) depends on the time instance ti . For a stationary process, the pdf is independent of time; consequently, the ﬁrst-order statistic mean is also independent of time. Next, we consider two random variables X (t1) and X (t2) at time instances t1 and t2. The autocorrelation between X (t1) and X (t2) is measured by the joint movement with the following equation: ∞∞ Rxx (t1, t2) = E[X (t1)X (t2)] = x1x2 pX(t1),X(t2)(x1, x2)d x1d x2 −∞ −∞ (6.29) When the random process X (t) is stationary, the joint pdf pX(t1),X(t2)(x1, x2) is identical to the joint pdf pX(t1+ε),X(t2+ε)(x1, x2) for any arbitrary ε. This implies that the autocorrelation function of X (t) does not Signals and Systems 299 depend on the speciﬁc time instances t1 and t2; instead, it depends on the time difference τ = t1 − t2. Thus, for a stationary random process, the second-order statistic is Rxx (t1, t2) = Rxx (t1 − t2) = Rxx (τ ). As previously deﬁned, if the random process X (t)’s ﬁrst-order statistic mean μ is independent of time, and the second-order statistic autocorrelation Rxx (τ ) depends only on the time difference τ , then X (t) is called a wide-sense stationary (WSS) process. Time Averages for Random Processes The statistical averages using an ensemble of sample functions assume an inﬁnite-sized ensemble of signals. However, in practical applications, we only get ﬁnite ensemble sizes and ﬁnite-length signals rather than an inﬁnite ensemble of signals. Thus, for practical handling of real-world signals, we deﬁne the time average mean μt of random processes as follows: T μt = E[X (t)] = lim T →∞ 1 2T X (t)dt −T (6.30) Similarly, the time autocorrelation function is deﬁned as follows: T Rxx (τ ) = E[X (t )X (t + τ )] = lim T →∞ 1 2T X (t)X (t + τ )dt −T The time autocovariance function for random process X (t) is deﬁned as follows: (6.31) T γx x (τ ) = E [( X (t ) − μt )( X (t + τ ) − μt )] = lim T →∞ 1 2T (X (t) − μt )(X (t + τ ) − μt )dt −T The time cross-correlation function for two random process X (t) and Y (t) is deﬁned as follows: (6.32) T Rxy (τ ) = E[X (t )Y (t + τ )] = lim T →∞ 1 2T X (t)Y (t + τ )dt −T (6.33) In practice, it is commonly assumed that a given signal is a sample function of an ergodic random process so that the averages can be computed from a single function. The Fourier transform (see next section) of the autocorrelation function of WSS random process gives the power spectral density (PSD) of the random process. 6.2 Time-Frequency Representation of Continuous-Time Signals In Section 6.1 we introduced the concept of the signal and discussed various types of signals. All the signals discussed in the previous section are represented in the time domain. That is, the signal variations are represented with respect to time. Although we can clearly see the variations of physical phenomenon with respect to time in the time-domain representation of a signal, signal processing requires much more information than variation of signal amplitudes with respect to time. Using the time-domain signal information alone we cannot process the signal to get the desired signal. Sometimes by transforming the data from one domain to another, we may ﬁnd more relevant information in the transformed domain data than in the original domain data. In addition, by eliminating the undesired components in one domain, we may get the desired data in another. One way to process the raw signal to get a desired signal is by decomposing an arbitrary signal into known base components and choosing a subset of base components to form the desired signal. If we choose well-known sinusoidal components as base components to decompose the signal, then with the signal decomposition, we get the whole range of frequencies in the original signal. We obtain the desired signal by using a subset of frequencies. This emphasizes the frequency-domain representation of the signal. In addition, if the given signal contains fewer frequencies, we can compactly represent that signal in the frequency domain better than in the time domain. 300 Chapter 6 6.2.1 Sinusoids and Frequency-Domain Representation As discussed in Section 6.1, the sine wave representation y(t) = sin(2πft) provides the frequency value of a sine wave. We can now easily represent any sine wave of the form y(t) = A. sin(2πft + φ), where A is the amplitude and φ is the phase delay, in the frequency domain with the x -axis representing the frequency value and the y-axis representing both magnitude and phase at a particular frequency. We use two separate plots to show the magnitude and phase. The time- and frequency-domain representations of the signal y(t) = 5 sin(2π 3t + π/4) = 5 cos(2π3t + 3π/4) are shown in Figures 6.18 and 6.19, respectively. In Figure 6.18, the dotted curve represents the zero-phase sine wave, and the solid curve represents the sine wave with a phase difference of π/4 with respect to the zero-phase sine wave. The frequency-domain equivalent of Figure 6.18 is shown in Figure 6.19. Figure 6.19(a) indicates that a sinusoid of magnitude 5.0 is present at frequency index 3, and Figure 6.19(b) indicates that a sinusoid with a phase difference of 3π/4 (with respect to the zero-phase cosinusoidal wave) is present at the frequency index 3. Thus, both ﬁgures represent the same sinusoid information in different domains. Actually, Figure 6.18 shows only the ﬁnite-length sine wave due to limited space, but it should be of inﬁnite length to exactly match the equivalent frequency-domain information in Figure 6.19. Consider another waveform s(t) as shown in Figure 6.20. At ﬁrst glance, the waveform of the ﬁgure seems random, but it is not. It repeats itself with interval T , which is approximately equal to 1. Actually, s(t) is the sum of three sinusoids as follows: s(t) = 5 sin(2π t + π/4) + 7 sin(2π4t + π/8) + 4 sin(2π9t + π/12) (6.34) The equivalent frequency-domain representation of the waveform in Equation (6.34) is shown in Figure 6.21. It consists of three frequencies at f = 1, f = 4, and f = 9 with phases 3π/4, 5π/8, 7π/12, and amplitudes 5, 7, and 4, respectively. As discussed previously, the time- and frequency-domain plots provide complementary information about the same signal. The question at this juncture is how to transform the signal from one domain to another domain. We use well-known Fourier methods to transform the signal from one domain to another. Depending on the type 5 4 3 2 1 0 21 22 23 24 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 6.18: Time-domain plot for y (t) = 5 sin (2π 3t +π/ 4). ԽY(f )Խ 5.0 ЄY(f ) 3/4 0 12345678 f (a) 0 12345678 f (b) Figure 6.19: Frequency-domain representation for y (t ) = 5 sin (2π 3t +π/ 4). (a) Magnitude. (b) Phase. 15 10 5 0 25 210 215 0 Signals and Systems 301 T 0.5 1 1.5 2 2.5 3 3.5 4 Figure 6.20: Plot of a multifrequency waveform. |S(f )| 7 ЄS(f ) 5 4 3/4 5/8 7/12 0 123456789 f (a) 0 123456789 f (b) Figure 6.21: Frequency-domain representation of waveform s (t) of Equation (6.34). of signal (e.g., periodic, nonperiodic, continuous, discrete), we use the following four types of Fourier methods to transform the signal data: • Fourier series (deﬁned for periodic continuous-time signals) • Fourier transform (deﬁned for nonperiodic continuous signals) • Discrete time Fourier transform (deﬁned for nonperiodic discrete signals) • Discrete Fourier transform (deﬁned for periodic discrete signals) In this section, we discuss the ﬁrst two Fourier methods deﬁned for continuous-time signals. The last two will be discussed in Section 6.4. 6.2.2 Fourier Series Any periodic waveform s(t) can be represented as the sum of an inﬁnite number of sinusoidal and cosinusoidal terms together with a constant term as follows: ∞ ∞ s(t) = c + an cos(2π fnt) + bn sin(2π fnt) n=1 n=1 (6.35) where c is a constant (also called DC value), f = 1/T , and T is the signal period. Given s(t), the coefﬁcients an and bn (also called AC values) and the constant term c are obtained as follows: T /2 c= 1 T s (t )d t −T /2 (6.36) 302 Chapter 6 T /2 an = 2 T s(t) cos(2πfnt)dt −T /2 T /2 bn = 2 T s(t) sin(2πfnt)dt −T /2 (6.37) (6.38) ■ Example 6.2 Assume the periodic square wave shown in Figure 6.22(a), where the period is 1. One period of the square wave is deﬁned as follows: s(t) = 1 −1 0 ≤ t < 0.5 0.5 ≤ t < 1 Using Equation 6.36 we obtain the value of c as zero since the sum of s(t) over one period of inter- val results in zero. By substituting s(t) in Equations (6.37) and (6.38) and evaluating the integral over interval [−T /2, T /2], we obtain the coefﬁcients an as all zeros, and the coefﬁcients bn as follows: ⎧ ⎨0 if n is even bn = ⎩ 4 πn if n is odd 1 1 n51 0 0 21 21 20.5 0 0.5 1 1.5 2 2.5 20.5 0 0.5 1 1.5 2 2.5 (a) (b) 1 n53 1 n55 0 0 21 21 20.5 0 0.5 1 1.5 2 2.5 20.5 0 0.5 1 1.5 2 2.5 (c) (d) 1 n57 1 n59 0 0 21 21 20.5 0 0.5 1 1.5 2 2.5 20.5 0 0.5 1 1.5 2 2.5 (e) (f) 1 0 21 20.5 n 5 11 1 n 5 13 0 21 0 0.5 1 1.5 2 2.5 20.5 0 0.5 1 1.5 2 2.5 (g) (h) Figure 6.22: Fourier series representation of a periodic square wave. Signals and Systems 303 |S(nf )| ЄS(nf ) 4 4 3 4 5 4 7 0 123456789 n (a) Ϫ/2 at odd frequencies 0 12345 n (b) Figure 6.23: Frequency-domain representation of a square wave. Then, based on Equation 6.35, the decomposition of the square wave in sinusoidal terms is given by s(t) = 4 π sin(2π f nt) , 1 ≤ n < ∞ n odd n The frequency-domain representation of square wave s(t) is shown in Figure 6.23. ■ As seen in Example 6.2, the square wave in Figure 6.22(a) can be represented by combining inﬁnite sinusoids with odd frequencies only. We illustrated this in Figure 6.22 using the ﬁrst few sinusoids with odd frequencies. Figure 6.22(b) through (h) represents the square wave s(t) approximation using the sum of m sinusoids with frequencies n = 2m − 1 for 1 ≤ m ≤ 7. As shown in Figure 6.22(h), we are close to the ideal square wave in representing it using the sum of the ﬁrst eight odd-frequency sinusoids. However, we have ripples in the Fourier-series–represented square wave. The presence of ripples in the Fourier-series computed waveform is called the Gibbs phenomenon. We can only reduce the ripple width by adding more and more sinusoids, but cannot attenuate the peaks of the ripple. The reason for the presence of ripples in the Fourier-series approximated square wave is due to the discontinuous nature of square waves. If we have sharp edges or discontinuities in the periodic wave (e.g., square waveform, triangular waveform), then we cannot exactly represent the waveform using the Fourier-series representation. Because of this, the Fourier theory generalization was withheld from publication for decades. Nevertheless, we accept the Fourier-series representation for all periodic signals in the root mean square (RMS) error sense. The Fourier-series representation in Equation (6.35) may be written more compactly by using exponential notation, and has the advantage of exponential mathematical manipulations. Equation (6.35) can be rearranged as follows: ∞ s(t) = dne j 2πfnt n=−∞ (6.39) where T /2 dn = 1 T s(t )e− j2πfntdt −T /2 (6.40) The values of dn in Equation (6.40) are complex, containing both real and imaginary numbers. As the summation in Equation (6.39) includes negative values of n, we evaluate the integral for both negative and positive frequencies and the values of dn are halved numerically to represent an equal sharing of the magnitudes between corresponding negative and positive frequencies. Using Equations (6.39) and (6.40), the relationship between dn and c, an and bn is obtained as follows: d0 = c, |dn| = an2 + bn2, φn = − tan−1(bn/an ) 304 Chapter 6 Thus, each frequency component of the waveform is characterized by the magnitude |S(n f )| = |dn| and its phase angle ∠S(n f ) = φn. Based on Equation (6.40), it is clear that the Fourier series output is discrete and this means that the periodic time-domain signals contain the frequencies only at discrete values. 6.2.3 Fourier Transform What if the given signal is nonperiodic as shown in Figure 6.5 through Figure 6.12? How do we compute the frequency-domain information of such nonperiodic signals? The Fourier series approach is deﬁned for periodic signals and cannot be applied to nonperiodic signals. In Equation (6.40), by increasing the period value T to inﬁnity, the quantity f (= 1/T ) becomes f as T → ∞ and the quantity S(nf ) modiﬁes to S( f ) as follows: ∞ S( f ) = f s(t )e− j2π ft dt (6.41) −∞ After normalization of Equation (6.41) with f , we have ∞ S( f ) = F( f ) = s(t )e− j2π ft dt f −∞ (6.42) We can compute s(t) from Equation (6.42) by performing the inverse as follows: ∞ s(t ) = F( f )e j2π ft d f (6.43) −∞ If we replace 2π f with ω in Equations (6.42) and (6.43), then we have the Fourier transform pair as follows: ∞ F(ω/2π ) = s(t)e− jωt dt (6.44) −∞ ∞ s (t ) = 1 2π F(ω/2π )e jωt dω −∞ (6.45) In practice, we avoid the constant term 2π in the index of F(ω/2π ), and write F(ω) by assuming that the function F(.) is deﬁned for normalized frequencies. Equations (6.44) and (6.45) are called the Fourier transform pair. The time-frequency representation for nonperiodic signals, the Dirac delta function and rectangular pulse, are shown in Figure 6.24(a) and (b). If an arbitrary nonperiodic signal s(t) contains frequencies up to fmax, then |S( f )|, the magnitude of the Fourier transform of such a signal, resembles Figure 6.24(c). 6.3 Sampling of Continuous-Time Signals In Section 6.1, we introduced the concept of signals and discussed various types of signals. All the signals presented in that section are continuous in time. However, we cannot process continuous-time signals with digital computers. Signal-processing computers handle only discrete signals in both time and amplitude. The quantization of the signal amplitude into ﬁnite discrete levels is a lossy process and we cannot recover this loss of information. On the other hand, sampling the signal with respect to time to get the discrete time samples can be a lossless process, and the original signal can be recovered if we sample the signals appropriately. In this section, we concentrate on appropriate sampling of continuous-time signals to get discrete-time signals, and then reconstructing the original signal from discrete samples. An example of a discrete-time signal x [n] is shown in Figure 6.25, along with the actual analog signal x (t). Given sampling period T , the samples are obtained from x (t) as x [n] = x (t)|t=nT . With sampling, the samples x [n] are equal to the value of the corresponding analog signal x (t) at the sampling time instances. Signal values in between the samples are undeﬁned. Signals and Systems 305 ␦(t ) |␦(f )| FT tϭ0 r(t ) t (a) fϭ0 f |R(f )| FT s(t ) Figure 6.24: Time-frequency representation of nonperiodic signals. (a) Dirac delta function. (b) Rectangular pulse. (c) Arbitrary signal. x(t ) t f (b) |S(f )| FT t Ϫfmax (c) f fmax x [n] ϭ x [nT ] n Figure 6.25: Plot of discrete-time signal. Consider a 6-Hz sinusoidal signal as shown in Figure 6.26(a) with a solid curve. Since the sinusoid frequency is 6 Hz, it contains 6 cycles in 1-second intervals. The sinusoid shown in Figure 6.26(a) is plotted for 2 seconds and contains 12 cycles. We also show another sinusoid at a 2-Hz frequency (i.e., it contains only 4 cycles in a 2-second interval) in Figure 6.26(a) with a dotted curve. With sampling of continuous-time signal, we collect the samples at regular time intervals. For example, we sampled the 6-Hz signal at four samples per second in Figure 6.26(b) and at eight samples per second in Figure 6.26(c). Now the question is how many samples would be needed to obtain the original 6-Hz continuous-time sinusoidal signal? Is it possible to recover the original sinusoidal signal of 6 Hz using points sampled at four points per second or eight points per second? As seen in Figure 6.26(b) and (c), those four or eight points not only represent the 6-Hz sinusoid, but they also represent the 2-Hz sinusoid. Thus, there is ambiguity in deciding which sinusoidal curves those sampled points actually represent; we cannot recover the 6-Hz sinusoid signal using the four or eight points due to such ambiguity. What if we chose a different set of sampling instances as shown in Figure 6.27? Is it possible then to recover the 6-Hz sinusoid with those sample points? In Figure 6.27(b), we sampled the signal at six regular time intervals, and those six points also represent the 3-Hz sinusoid apart from the 6-Hz sinusoid. The same ambiguity arises even with nine points as shown in Figure 6.27(c). Consequently, you may think sampling with 100 points (instead of 8 or 9 points) makes recovering the original 6-Hz sinusoid possible. Yes, we can recover that sinusoid if we sample the 6-Hz signal with 100 samples per second, but processing and storing those 100 samples is very costly when compared to 8 or 9 samples. Therefore, we are interested in the minimum number of samples required to represent the continuous-time signal, such that we can recover the original signal without ambiguity. In the following, we discuss the famous sampling theorem that speciﬁes the rate at which an analog signal should be sampled to ensure that all the relevant information contained in the signal is captured or retained via 306 Chapter 6 1.5 1 0.5 0 20.5 21 21.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 t (sec) (a) 1.5 1 1 0.5 3 1 3 0 20.5 2 4 2 4 21 21.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 t (sec) (b) 1.5 1 0.5 0 20.5 21 21.5 12 34 0 0.2 0.4 56 12 78 34 0.6 0.8 1 1.2 1.4 t (sec) (c) 56 78 1.6 1.8 2 Figure 6.26: (a) Two sinusoids with frequencies 6 Hz (solid line) and 2 Hz (dashed line). (b) Sampling at four samples per second. (c) Sampling at eight samples per second. sampling. Depending on whether the signal is low pass (as shown in Figure 6.28(a), which contains most of its energy at the lower frequencies) or bandpass (which contains most of its energy away from lower frequencies), we follow a slightly different procedure in applying the sampling theorem. 6.3.1 Nyquist Criterion: Sampling of Low-Pass Signals According to the Nyquist criterion, if the highest-frequency component in a signal is fmax (Hz), then the signal should be sampled at the rate of at least 2 fmax samples per second to describe the signal completely. That is, the sampling frequency or rate Fs is given by Fs ≥ 2 fmax (6.46) We call the sample rate, Fs, the Nyquist rate. Now, if we look at the sampling of the sinusoid example, the maximum frequency of the sinusoid is 6 Hz as shown in Figure 6.26 or 6.27. This means that if we sample the 6-Hz sinusoid at 12 samples or more per second (i.e., greater than or equal to 2 × 6), then there will be no ambiguity in reconstructing the 6-Hz sinusoid after sampling. Sampling at less than the rate speciﬁed by the sampling theorem leads to aliasing of image frequencies into the desired frequency band; hence, the original signal cannot be recovered. The concept of image frequencies is explained next. For this, we consider a continuous-time Signals and Systems 307 1.5 1 0.5 0 20.5 21 21.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 t (sec) (a) 1.5 1 0.5 0 12 3 4 5 6 1 2 3 4 5 6 20.5 21 21.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 t (sec) (b) 1.5 1 1 4 7 1 4 7 0.5 0 2 5 8 2 5 8 20.5 21 21.5 0 3 6 9 3 6 9 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 t (sec) (c) Figure 6.27: (a) Two sinusoids with frequencies 6 Hz (solid line) and 3 Hz (dotted line). (b) Sampling at six samples per second. (c) Sampling at nine samples per second. signal as shown in Figure 6.28 along with its frequency-domain information (also called frequency spectrum). If we sample this continuous-time signal into discrete samples, then the frequency spectrum of discrete samples contains many replicas of the spectrum as shown in Figure 6.28(b). These are called image frequencies. The intuitive reasoning for the formation of images with the signal sampling is discussed next. When we applied the Fourier series to a periodic signal, we got a discrete frequency spectrum. This means that when we have discrete content in one domain, we see the periodic information content in the other domain. In the same way, the sampling of a continuous-time signal to discrete samples causes the frequency spectrum to repeat itself. To obtain the original signal, we ﬁlter out these replicas. If we sample a continuous-time signal with less than the sampling rate as shown in Figure 6.28(c), then the repeating spectral images overlap in the frequency domain. In this case, we cannot obtain the original continuous signal as the ﬁlter cannot completely remove the spectral images. In Figures 6.26 and 6.27, we saw that 2-Hz and 3-Hz sinusoids are formed when the 6-Hz sinusoid is sampled at less than 12 samples per second. This is because undersampling the 6-Hz signal causes overlapping in the frequency domain and forms an alias of higher-frequency signals at the lower frequencies. Thus, these 2-Hz and 3-Hz sinusoids are aliased signals of the actual 6-Hz signal when we undersample it. Even if we know the maximum frequency present in the desired signal, sampling the desired signal with the rate greater than twice the maximum frequency may not guarantee perfect reconstruction of the original signal due to noise. Noisy signals usually occupy wider frequency bands when compared to desired signals. In this case, even if we follow the 308 Chapter 6 s(t ) |S(f )| 2fmax fmax f t (a) s(t ) Proper sampling Image frequencies |S(f )| t Sampling rate . 2fmax (b) s(t ) 2fmax fmax f |S(f )| Undersampling t Sampling rate , 2fmax (c) 2fmax fmax f Aliasing Figure 6.28: Sampling continuous-time signals. (a) Continuous-time signal and its frequency spectrum. (b) Proper sampled discrete signal and corresponding image frequency spectrums. (c) Undersampling and corresponding aliased frequency spectrum images. Desired signal x(t ) |X(f )| Antialias Filter Noise y (t ) y [n] Sampler |Y(f )| |Yd(f )| fmax fmax Figure 6.29: Sampling with an antialiasing ﬁlter. sampling theorem, we may still see the frequency aliasing of sampled signals due to wideband noise. Thus, we ﬁlter the noise using an antialiasing ﬁlter before sampling the desired signal as shown in Figure 6.29. 6.3.2 Reconstruction of Signal from Discrete Samples According to the sampling theorem, we can reconstruct continuous-time band-limited signals from samples obtained by sampling at least twice the maximum frequency present in the signal. If T is the sampling period associated with the sequence x [n], then we construct the continuous-time signal x (t) from its samples x [n] as follows: x (t ) = ∞ n=−∞ x [n ] sin[π(t − π(t − n nT )/T T )/T ] (6.47) Signals and Systems 309 3 2.5 2 1.5 1 0.5 0 20.5 21 25T 24T 23T 22T 21T 0 1T 2T 3T 4T 5T Figure 6.30: Reconstruction of continuous-time signal from discrete samples. x (t ) Antialias Filter Sampler x [n] Digital y [n] Signal Processor Sinc Interpolator y (t ) Figure 6.31: Discrete-time processing of continuous-time signals. In Equation (6.47), we basically use the delayed versions of the normalized sinc function to reconstruct the continuous-time signal. With the normalized sinc function, sin(π t/T ) πt/T we have a continuous signal with a non-zero value at only t = 0, and zero values at all time instances t = nT , where n = (−∞, ∞) − {0}. Similarly, if we delay the sinc function by m samples in time as follows, sin[(π(t − mT )/T ] π(t − mT )/T then we have a non-zero value only at t = mT, and zero values at all time instances t = nT , where n = (−∞, ∞)−{m}. This is illustrated in Figure 6.30 using four samples x[n] = [1, 2, 3, 2] at sampling instances [−T , 0, T , 2T ]. As seen in Figure 6.30, the sinc function interpolates between the samples of x [n] to construct a continuoustime signal xc(t). In fact, if there is no aliasing, then the sinc function can reconstruct a continuous-time signal that exactly represents x (t). Given a continuous-time signal, we can obtain discrete-time samples by sampling the continuous-time signal at the Nyquist rate in Equation (6.46). We can then perform signal processing on discrete samples using specialized tools, and then we get back the processed continuous-time signal by using the reconstruction formula in Equation (6.47). The basic signal processing system for real-world, continuous-time signals using a digital signal processor is shown in Figure 6.31. 6.3.3 Sampling of Bandpass Signals Bandpass signals frequently occur in communication systems, where signals are modulated to occupy particular frequency bands for signal transmission, as shown in Figure 6.32. In such cases, the bandwidth of the signal B is often very small when compared to the maximum frequency ( fH ) present, and sampling will be costly (to process, store, or reconstruct) using the Nyquist criterion. The bandpass sampling theorem is used in such 310 Chapter 6 |X(f )| B Figure 6.32: Frequency-domain representation of a bandpass signal. 2fH 2fc 2fL 0 fL fc fH f situations. The signal is sampled at a rate of Fs, which satisﬁes the following equation: 2 fH n ≤ Fs ≤ 2 FL n−1 (6.48) where n = fH , an integer rounded up to next integer. B The bandpass sampling theorem allows us to sample narrowband signals at a much reduced rate, while simultaneously permitting reconstruction of the signal without aliasing problems. In a special case where the edge frequencies fL and fH are integer multiples of the signal bandwidth B, then such a signal can be sampled at a theoretical minimum rate of 2B without aliasing. For example, if B = 10 kHz, fL = 80 kHz, and fH = 90 kHz, then sampling at the rate 2B = 20 kHz allows us to reconstruct the original continuous-time bandpass signal. 6.4 Time-Frequency Representation of Discrete-Time Signals In Section 6.2, we used Fourier series and Fourier transforms for time-frequency representation of continuoustime signals. In this section, their counterparts to work with discrete-time signals are discussed. Like continuoustime signals, there are also two types of discrete-time signals: periodic and nonperiodic. A discrete-time signal x [n] is said to be periodic if x [n] = x [n + N] for some positive integer N. Here, the smallest integer value of N represents the x [n] period. Note that the continuous-time sinusoid sin(ωt) is periodic regardless of the value of ω. This is not the case with the discrete-time sinusoid sin( n). To make sin[ (n + N)] = sin( n), the following has to be satisﬁed: N = 2πm or 2π = m N In brief, a discrete-time sinusoid sin( n) is periodic only if 2π is a rational number. 6.4.1 Discrete-Time Fourier Transform We obtain the frequency-domain information for discrete-time nonperiodic signals using the discrete time Fourier transform (DTFT) as follows: ∞ X( )= x [n]e− j n n=−∞ (6.49) The DTFT output X ( ) is periodic as shown in Figure 6.33. Since |e− j2π n| = 1, clearly the frequency information X ( ) from Equation (6.49) is periodic with period 2π as derived here: ∞ ∞ ∞ X ( + 2π) = x [n]e− j ( +2π)n = x [n]e− j ne− j2π n = x [n]e− j n = X ( ) n=−∞ n=−∞ n=−∞ (6.50) In the following, the periodicity of this Fourier transform of the discrete-time signal is explained. The Fourier transform of an impulse train d(t) is again a periodic impulse sequence with a different period as shown in Figure 6.34. Because discrete-time signals are obtained by multiplying the continuous-time signal with the impulse train, we obtain the Fourier transform of the discrete-time signal as periodic. In other words, if we have discrete information in one domain (time/frequency), then we will have periodic information in the other Signals and Systems 311 |X(V)| x [n] DTFT 0 12 34 5 6 V n Figure 6.33: Discrete-time signals and periodic frequency-domain information. ` S d (t ) 5 ␦(t 2nT ) n 52` 24T 23T 22T 2T 0 T 2T 3T 4T 5T t (a) T/ 2 # Dk 5 1 T ␦(t )e2jkt dt 5 1 T , 2T/ 2 5 2 T 22 2 0 2 3 (b) Figure 6.34: (a) Impulse train in time domain. (b) Equivalent frequency-domain information. domain (frequency/time). Since the Fourier transform of discrete-time signals is periodic, the inverse transform is performed on one period, 2π , of the frequency spectrum: x [n] = 1 2π X ( )e j nd 2π (6.51) Equations (6.49) and (6.51) form a discrete-time Fourier transform pair. 6.4.2 Discrete Fourier Transform Digital systems process and output only discrete signals; thus, the DTFT tool is not suitable for digital signal processing (DSP) because the output of DTFT is continuous. For this reason, we derive the discrete Fourier transform (DFT) equations from the DTFT to work with DSPs. Upon sampling the DTFT output frequency information in Equation (6.49), and taking samples at regular frequency intervals, ∞ X[k 0] = x [n]e− j 0kn , −∞ 0 = 2π T (6.52) In Equation (6.52), we usually ignore 0 in the index and simply write X[k 0] as X[k]. With sampling of frequency-domain information, we force periodicity in the time domain. If we have N samples in one period T , then 0 = 2π N . With this, Equation (6.52) can be rewritten as follows: N −1 X [k] = x [n]e− j2πkn/N n=0 (6.53) 312 Chapter 6 x [n] X [k] DFT n N samples k N samples Figure 6.35: Graphic illustration of discrete Fourier transform. Given Equations (6.51) and (6.53), the inverse for DFT follows: x [n] = 1 N −1 X [k]e j2πkn/N N k=0 (6.54) Equations (6.53) and (6.54) represent the DFT pair. Next, we verify the periodicity of the time-domain sequence x [n] as follows: x [n + N] = 1 N −1 X [k]e j 2πk(n+N )/N = 1 N −1 X [k]e j2πkn/N e j2πk = 1 N −1 X [k]e j2πkn/N = x [n] N N N k=0 k=0 k=0 Therefore, the DFT assumes a built-in periodicity in the time-domain information. A graphic illustration of the DFT is shown in Figure 6.35. 6.4.3 Discrete Cosine Transform The discrete cosine transform (DCT) is commonly used to compress signal data. This is particularly important for the storage and transmission of image frames, as the images will have much spatial redundancy, and the DCT is good at eliminating the data correlations. Many types of DCT can be found in the literature; the following is the most commonly applied DCT pair in the image processing ﬁeld: N −1 π 1 X[k] = β[k] x [n] cos n + k , k = 0, 1, 2, . . . , N − 1 n=0 N 2 N −1 π x [n] = β[k]X[k] cos n+1 k, n = 0, 1, 2, . . . , N − 1 k=0 N 2 where β[0] = √(1/N) and β[m] = √(2/N) for m = 0. Implementation techniques for both DFT and DCT are discussed in the next chapter. (6.55) (6.56) 6.5 Linear Time-Invariant Systems A system is any process that produces an output signal in response to an input signal. With specialized signal processing techniques, it is possible to express most of the systems in terms of mathematical models. This allows us to apply signal processing tools for system analysis and thereby improve system performance. In particular, we are interested in linear-time-invariant (LTI) systems, since many tools are available to analyze these systems. One important characteristic of an LTI system is that its output response to a sinusoidal input is also a sinusoid with some gain in amplitude and delay in phase. However, the frequency of the output sinusoid is the same as the input sinusoid (i.e., we get the same frequency sinusoid with a different gain and phase). All systems with input–output relationships described by linear differential equations are linear-time-invariant systems when the coefﬁcients of such differential equations are constant. Next, we discuss the properties of stable, causal, linear-time-invariant systems. Linearity. A system H {.} is said to be linear if it follows the principles of superposition and homogeneity. If y1[n] and y2[n] are the output signals of system H {.}, when x1[n] and x2[n] are the respective input signals Signals and Systems 313 (i.e., y1[n] = H {x1[n]} and y2[n] = H {x2[n]}), then the system H {.} is linear if and only if H {ax1[n] + bx2[n]} = a H {x1[n]} + bH {x2[n]} = ay1[n] + by2[n] (6.55) where a and b are arbitrary constants. Time Invariance. Systems with parameters that do not change with respect to time are called time-invariant systems. With the time-invariant system H {.}, if the input x [n] to a system H {.} is delayed by N samples, then the corresponding output y[n] is also delayed by N samples. That is, if y[n] = H {x [n]}, then H {x [n − N]} = y[n − N]. Causality. A system is causal if for every choice of input signal delay N, the output signal y[n] at the index n = N depends only on the input signal x [n] values for index n ≤ N. That is, the output of a causal system depends on the current input and/or previous inputs, but not on future inputs. Stability. A system H {.} is stable in the BIBO (bounded-input, bounded-output) sense if and only if every bounded input sequence produces a bounded output sequence. In discrete time, the condition for BIBO stability is that the impulse response of the LTI system be absolutely summable, that is, |h[n]| < ∞. n Continuous- and Discrete-Time Systems. Systems in which inputs and outputs are continuous-time signals are called continuous-time systems, usually denoted by h(t). Similarly, systems whose inputs and outputs are discrete-time signals are called discrete-time systems, usually denoted by h[n]. Using the sampling theorem, we can obtain the corresponding discrete-time system from the given continuous-time system. 6.5.1 Impulse Response of LTI Systems A major reason for interest in LTI systems is that these systems can be completely described by an impulse response. Why is an impulse response so important? We process the signals using systems and the processing of signals involves classiﬁcation of signals and elimination of unwanted signals (or choosing the signal with a subset of frequency components). To process the signal with a system, we have to know the system (i.e., what frequency components the system attenuates and what frequencies it allows without attenuation). That is, we have to know the system frequencies and also the strength or amplitude and phase of each frequency. By passing a sinusoidal signal with a particular frequency through an LTI system, we can determine whether the system passes that particular frequency or attenuates it. Thus, by passing all the sinusoids within the required band of frequencies through an LTI system, we can describe the system in terms of frequencies in the band of interest. However, this is a huge task and it is not the most effective way to identify system behavior. Instead, if we input a unit impulse (whose frequency response is constant, meaning that it contains all frequencies, as shown in Figure 6.24(a)) to an LTI system, then its output response to a unit impulse provides the complete LTI system description. The response of a system to a unit impulse input is called an impulse response. The frequency content of a system’s impulse response contains all system frequencies. Thus, the impulse response of LTI systems plays an important role in signal-processing applications. The graphic illustration of continuous- and discrete-time system impulse responses is shown in Figure 6.36. In working with DSPs, we use only discrete-time systems. ␦(t ) h(t ) ␦(t ) LTI h(t ) System t t (a) ␦[n ] h[n] ␦[n] LTI h[n] Figure 6.36: Impulse response System examples of (a) continuous-time n n system and (b) discrete-time system. (b) 314 Chapter 6 6.5.2 Convolution Once we know the impulse response of a system, we can then compute the response of that system to an arbitrary input signal by using the convolution operation. If x (t) is the input signal and h(t) is the impulse response of a given continuous-time system, then its output signal y(t) is computed as follows: ∞ y(t) = x (τ )h(t − τ )dτ (6.56) −∞ In short, the continuous-time convolution operation in Equation (6.56) is represented as follows: y(t) = x (t)∗h(t) (6.57) where ∗ represents the continuous-time convolution operation. For discrete-time systems, Equation (6.56) modiﬁes to ∞ y[n] = x [k]h[n − k] (6.58) k=−∞ In brief, the discrete-time convolution operation in Equation (6.58) is represented as follows: y[n] = x [n] ⊗ h[n] (6.59) where ⊗ represents the discrete-time convolution operation. Based on Equations (6.58) and (6.59), an LTI system H {.} response to an arbitrary input signal x [k] can be expressed in terms of the impulse responses of the system to the input impulse train sequence x [n] = k x [k]δ[n − k] as follows: ∞ y[n] = H {x [n]} = x [k]H {δ[n − k]} (6.60) k=−∞ The pictorial interpretation of Equation (6.60) is shown in Figure 6.37. A simple way to compute the convolution sum in Equation (6.58) is by ﬁrst obtaining the mirror image of the impulse response h[n] (i.e., the mirror image of h[k] is h[−k]), and then correlating the input samples x [k] with mirror image samples h[n − k], where −∞ < k < ∞. This is illustrated in Example 6.3. If the length of input sequence x [n] is M and the length of impulse response h[n] is L, then the length of convolution output y[n] is N = M + L − 1. The convolution operation is commutative, meaning that y[n] = x [n] ⊗ h[n] = h[n] ⊗ x [n] (6.61) 6.5.3 DFT Based Convolution Computation One important property of the convolution operation is that convolution in the time domain turns out to be a multiplication in the frequency domain as follows: x [n] ⊗ h[n] ⇔ X[k] · H [k] (6.62) where X[k] = DF T{x [n]} and H [k] = DF T{h[n]}; that is, Y [k] = X[k] · H [k] or H [k] = Y [k]/ X[k]. Here, H [k] is called the system transfer function. Based on Equation (6.62), ∞ x [n] ⊗ h[n] = x [k]h[n − k] k=−∞ ∞ DF T{x [n] ⊗ h[n]} = DF T x [k]h[n − k] k=−∞ (6.63) x [n] 0 21 1 2 n x21[n] 21 0 1 2 n 0 x0[n] 21 1 2 n Signals and Systems 315 h[n] 012 n y21[n] 21 0 1 2 n 0 1 2 y0[n] 21 n x2[n] 21 0 1 2 n x [n] 5 x21[n] 1x0[n] 1x2[n] 0 Figure 6.37: Discrete-time system 21 1 2 n response computation by convolution operation. y2[n] 21 0 1 2 3 4 n y [n] 5 y21[n] 1y0[n] 1y2[n] 12 21 0 34 n If the number of samples present in the input signal x [n] is M, the number of samples present in the impulse response h[n] is L, and if N = M + L − 1, then, after some manipulations, Equation (6.63) can be written as follows: DF T{x [n] ⊗ h[n]} = DF T{x [n]} · DF T{h[n]} = X[k] · H [k], 0 ≤ n ≤ N − 1, 0 ≤ k ≤ N − 1 Now, by applying the IDFT on both sides of the previous equation, we have x [n](◦)h[n] = IDF T{X[k] · H [k]}, 0 ≤ n ≤ N − 1, 0 ≤ k ≤ N − 1 (6.64) where (◦) is a circular convolution operator. The circular convolution output y[n] of the two sequences x [n] and y[n] is deﬁned as follows: N −1 y[n] = x [n](◦)h[n] = x [k]h[(n − k) mod N] k=0 (6.65) Based on Equation (6.65), we can see that the DF T-based convolution assumes periodicity in the input sequences. Consequently, to obtain the correct convolution output, we must use a DF T of N = M + L − 1 minimum length in the computation of convolution by the DF T method. For large values of M and L, computation of the convolution sum using (6.58) is very complex. Because we can compute the DF T faster using FF T algorithms (discussed in the next chapter), performing convolution on DSPs using Equation (6.65) can result in huge computational power savings. Examples for computing the convolution sum using Equations (6.58) and (6.65) are provided in Examples 6.3 and 6.4, respectively. Note that the end results in these examples are the same. 316 Chapter 6 ■ Example 6.3 Assume that the input signal x [n] = [1, 2, −1, −3, −1, 1, 2, 4, 2, −1] and the impulse response h[n] = [1, 3, 2]. The convolution sum is computed as follows: ∞ y[n] = x [k]h[n − k] k=−∞ Given that M = 10 and L = 3, we will have a total of N = M + L − 1 = 10 + 3 − 1 = 12 samples in the convolution output. The output samples y[n] are obtained as follows: x [k]: 1, 2, −1, −3, −1, 1, 2, 4, 2, −1 h[0 − k]: 2, 3, 1 y[0] = 0 · h[−2] + 0 · h[−1] + 1 · h[0] = 1 x [k]: 1, 2, −1, −3, −1, 1, 2, 4, 2, −1 h[1 − k]: 2, 3, 1 y[1] = 0 · h[−2] + 1 · h[−1] + 2 · h[0] = 1 × 3 + 2 × 1 = 5 x [k]: 1, 2, −1, −3, −1, 1, 2, 4, 2, −1 h[2 − k]: 2, 3, 1 y[2] = 1 · h[−2] + 2 · h[−1] − 1 · h[0] = 1 × 2 + 2 × 3 − 1 × 1 = 7 x [k]: 1, 2, −1, −3, −1, 1, 2, 4, 2, −1 h[3 − k]: 2, 3, 1 y[3] = 2 · h[−2] − 1 · h[−1] − 3 · h[0] = 2 × 2 − 1 × 3 − 3 × 1 = −2 x [k]: 1, 2, −1, −3, −1, 1, 2, 4, 2, −1 h[4 − k]: 2, 3, 1 y[4] = −1 · h[−2] − 3 · h[−1] − 1 · h[0] = −1 × 2 − 3 × 3 − 1 × 1 = −12 x [k]: h[5 − k]: 1, 2, −1, −3, −1, 1, 2, 4, 2, −1 2, 3, 1 y[5] = −3 · h[−2] − 1 · h[−1] + 1 · h[0] = −3 × 2 − 1 × 3 + 1 × 1 = −8 ... x [k]: 1, 2, −1, −3, −1, 1, 2, 4, 2, −1 h[10 − k]: 2, 3, 1 y[10] = 2 · h[−2] − 1 · h[−1] + 0 · h[0] = 2 × 2 − 1 × 3 + 0 × 1 = 1 x [k]: 1, 2, −1, −3, −1, 1, 2, 4, 2, −1 h[11 − k]: 2, 3, 1 y[11] = −1 · h[−2] + 0 · h[−1] + 0 · h[0] = −1 × 2 + 0 × 3 + 0 × 1 = −2 y[n] = [1, 5, 7, −2, −12, −8, 3, 12, 18, 13, 1, −2] ■ ■ Example 6.4 Using the same input signal x [n] and impulse response h[n] as in Example 6.3, we compute the convolution sum using the DF T and IDF T pair as follows: x [n]: 1, 2, −1, −3, −1, 1, 2, 4, 2, −1 Signals and Systems 317 h[n]: 1, 3, 2 M = 10, L = 3, N = M + L − 1 = 12 Hence, we use the 12-point DF T and 12-point IDF T in computing the convolution sum. Before applying the 12-point DF T, we make the lengths of arrays x [n] and y[n] equal to 12 by padding two zeros to x [n] and nine zeros to h[n] as follows: x [n]: h[n]: 1, 2, −1, −3, −1, 1, 2, 4, 2, −1, 0, 0 1, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0 N −1 X[k] = DF T {x [n]} = x [n]e− j2πkn/N n=0 6.0000 + 0.0000i, −4.5981 + 5.9641i, 10.5000 − 6.0622i, 1.0000 − 1000i, −4.5000 − 2.5981i, 0.5981 − 0.9641i, 0.0000 + 0.0000i, 0.5981 + 0.9641, −4.5000 + 2.5981i, 1.0000 + 1.0000i, 10.5000 + 6.0622i, −4.5981 − 5.9641i N −1 H [k] = DF T {h[n]} = h[n]e− j2πkn/N n=0 6.0000 + 0.0000i, 4.5981 − 3.2321i, 1.5000 − 4.3301i, −1.0000 − 3.0000i, −1.5000 − 0.8660i, −0.5981 + 0.2321i, 0.0000 + 0.0000i, −0.5981 − 0.2321i, −1.5000 + 0.8660i, −1.0000 + 3.0000i, 1.5000 + 4.3301i, 4.5981 + 3.2321i Y [k] = X[k] · Z [k] = 36.0000 + 0.0000i, −1.8660 + 42.2846i, −10.5000 − 54.5596i, −4.0000 − 2.0000i, 4.5000 + 7.7942i, −0.1340 + 0.7154i, 0.0000 + 0.0000i, −0.1340 − 0.7154i, 4.5000 − 7.7942i, −4.0000 + 2.0000i, −10.5000 + 54.5596i, −1.8660 − 42.2846i y[n] = IDF T {Y [k]} = 1 N −1 Y [k]e j2πkn/N N k=0 1.0000, 5.0000, 7.0000, −2.0000, −12.0000, −8.0000, 3.0000, 12.0000, 18.0000, 13.0000, 1.0000, −2.0000 ■ 6.6 Generalized Fourier Transforms With Fourier methods, we are able to decompose an arbitrary signal in terms of sinusoids. Depending on the range of frequency components (or sinusoids) present in an LTI system impulse response, we get the corresponding output signal from the LTI system to an arbitrary input signal. However, as discussed previously, the Fourier transform of an arbitrary impulse response may contain inﬁnite frequencies, and handling or representing such LTI systems would be too difﬁcult. Moreover, most of the physical LTI systems (e.g., electric circuits, usually characterized by differential equations) have a particular kind of impulse response that decays with time, as shown in Figure 6.38. This kind of impulse response contains both sinusoids and exponentials of inﬁnite length. To handle this type of signals and more compactly represent the systems with this kind of impulse response, as well as to better understand systems behavior, we use generalized Fourier transforms known as the Laplace transform and z-transform. Another motivation for introducing this generalization is that the Fourier transform does not converge for all sequences, and a generalization of the Fourier transform that encompasses a broader class of signals is useful. The Laplace transform is deﬁned for continuous-time signals, whereas the z-transform is deﬁned for discrete-time signals. 318 Chapter 6 Figure 6.38: The impulse response shape of most real-world LTI systems. 6.6.1 Laplace Transform For signal x (t), the Laplace transform X (s) is deﬁned by ∞ X (s) = x (t)e−st dt (6.66) −∞ where e−st is called a complex exponential, which represents both sinusoid and exponential characteristics. With the Laplace transform, we are no longer in the frequency domain. We move to a two-dimensional s-plane, where s = σ + j ω, which is formed with two parameters: σ (an exponential decay constant), which is represented using the x -axis, and ω (the sinusoid frequency), which is represented using the y-axis. The inverse Laplace transform produces the time-domain signal x (t) from the s-plane response as follows: c+ j∞ x (t) = 1 j 2π X (s)est ds c− j∞ (6.67) The Laplace transform in Equation (6.66) may not converge for all values of s, and the region of s for which the integral in the equation converges is called the region of convergence. Based on Equation (6.66), ∞ ∞ ∞ X (s) = x (t )e−st dt = x (t )e−(σ + jω)t dt = x (t )e−σ t e− jωt dt −∞ −∞ −∞ If σ = 0 (i.e., by evaluating the Laplace transform along the y-axis), then the preceding equation leads to our Fourier transform equation as follows: ∞ X ( j ω) = x (t )e− jωtdt −∞ Therefore, from a mathematical perspective, the Fourier transform is a particular case of the Laplace transform. The Fourier transform analyzes signals in terms of sinusoids, whereas the Laplace transform analyzes signals in terms of sinusoids and exponentials. In addition, the time-domain convolution operation on two signals maps to the multiplication of their respective Laplace-transform outputs in the s-domain. Transfer Function If X(s) is the Laplace transform of system input signal x (t) and Y (s) is the Laplace transform of system output signal y(t), then the system transfer function H (s) is obtained as follows: H (s) = Y (s) X (s) (6.68) Poles and Zeros of LTI Systems For an LTI system, which is controlled by differential equations, the system transfer function can be expressed as follows: H (s) = bnsn + bn−1sn−1 + bn−2sn−2 + · · · + b1s + b0 sn + an−1sn−1 + an−2sn−2 + · · · + a1s + a0 (6.69) Signals and Systems 319 Poles Figure 6.39: Illustration of poles and zeros in system frequency response. Zero If we factor both the numerator and denominator of Equation (6.69), then H (s ) = bn(s − z0)(s − z1) · · · (s − zn−1 ) (s − p0)(s − p1) · · · (s − pn−1) (6.70) where pi’s (the roots of denominator) are called the poles and zi’s (the roots of numerator) are called the zeros of LTI systems. Depending on the location of poles and zeros in the s-plane, we can uniquely represent an LTI system, and these few parameters (i.e., poles and zeros) can completely describe system characteristics. The system frequency response contains very large values at the pole locations, whereas it contains very small values at the zero locations. A graphic view of poles and zeros is shown in Figure 6.39 by taking a cross-section of the s -plane. Typically, the number of zeros will be equal to, or less than, the number of poles. Factoring polynomials greater than the second order is difﬁcult; thus, we use a cascade of second-order stages (which can be represented with second-order polynomials) to construct larger systems. For example, a 10th-order system can be obtained by cascading ﬁve second-order systems. For example, the impulse response of the second-order system in Equation (6.71) is easily obtained by rearranging the summation in Equation (6.72): H (s ) = (s k(s + z0) + p0)(s + p1) H (s) = k0 + k1 (s + p0) (s + p1) (6.71) (6.72) By taking the inverse Laplace transform of H (s) in Equation (6.72), we obtain the impulse response as follows: h(t ) = (k0e− p0t + k1e− p1t )u(t ) (6.73) where u(t) is the unit step function deﬁned as in Equation (6.5). As the locations of poles and zeros provide a complete description of system frequency response (the frequency response is equal to the values of H (s) along the imaginary axis), the Laplace transform is popularly used to design the continuous-time LTI systems directly in the s-plane. In the next chapter, we will discuss the role of s-plane poles and zeros in the design of Butterworth, Chebyshev, and elliptic ﬁlters for given passband and stopband speciﬁcations. 6.6.2 z -Transform The z-transform plays the same role in the analysis of discrete-time signals and LTI systems as the Laplace transform does in the analysis of continuous-time signals and LTI systems. In other words, the Laplace transform is a generalization of the Fourier transform, whereas the z-transform is a generalization of the discrete-time Fourier transform. In addition, the convolution in the time-domain results in the multiplication of the z-transform domain. The z-transform of a discrete-time signal x [n] is deﬁned as the power series, ∞ X (z) = x [n]z−n n=−∞ (6.74) where z is a complex variable. By substituting z = re jω in Equation (6.74), we have ∞ ∞ X (re jω) = x [n](re jω)−n = (x [n]r −n )e− jωn n=−∞ n=−∞ (6.75) 320 Chapter 6 Thus, Equation (6.75) can be interpreted as the discrete-time Fourier transform of the product of the original signal x [n] and the exponential sequence r−n. With the inverse z-transform, we obtain the discrete-time signal x [n] from X (z) using the contour integral as follows: x [n] = 1 j 2π X (z)zn−1dz C (6.76) where C is any contour that lies in the region of convergence (ROC) of the z-transform and encircles the origin. Similar to how the Laplace transform deals with differential equations of LTI systems, the z-transform deals with the difference equations deﬁning the behavior of an LTI system. However, the mathematics of the s-plane uses rectangular coordinates, whereas the z-transform uses polar coordinates. In addition, there are correspondences (if not one-to-one) from the s-plane to the z-plane as follows: 1. The y-axis in s-plane is mapped to the unit circle in the z-plane. 2. The left half of the s-plane is mapped to the interior of the unit circle. 3. The right side of the s-plane is mapped to the exterior of the unit circle. 4. The symmetry about the x -axis is reserved from the s-plane to the z-plane. As the z-transform handles the sampled data, the z-plane can uniquely represent frequencies up to half the sampling rate, and the frequencies above that range are wrapped around the circles of the z-transform. The z-transform is commonly used in the design of digital ﬁlters (see Chapter 7). Typically, we design recursive digital ﬁlters by starting with analog ﬁlters, and then we obtain the desired digital ﬁlter after a series of mathematical conversions. So, we basically map the pole-zero locations from the s-plane to the z-plane in deriving the recursive digital ﬁlters from analog ﬁlters. The locations of pole zeros in the s-plane are on the vertical lines, and after mapping to the z-plane, they lie on circles concentric with the origin. CHAPTER 7 Transforms and Filters In Chapter 6, the concepts of convolution and time-frequency representation of signals were introduced. In this chapter, we discuss how these concepts are implemented in digital systems using digital ﬁlters and fast transforms. Transforms and ﬁlters are among the most powerful tools in the digital signal processing (DSP) ﬁeld. Indeed, it is the development of fast versions of these computationally demanding algorithms, combined with advances in semiconductor technology, that allow us to perform most media processing tasks in real time. Fast Fourier transform (FF T) algorithms are used to compute the discrete Fourier transform (DFT) with fewer computations. Digital ﬁlters are capable of achieving the performance that is close to desired system speciﬁcations. In addition, there are advantages of virtually eliminating errors in the ﬁlter (due to aging, temperature, etc., that usually degrade the performance of analog ﬁlters). One disadvantage of digital ﬁlters is that they are slower due to the “block” nature of the processing and cannot handle very high frequencies when compared to analog ﬁlters. In this chapter, we will discuss the simulation and implementation techniques of discrete transforms and digital ﬁlters. Various transforms were introduced in Chapter 6, including the Fourier series, discrete Fourier transform, discrete cosine transform (DCT), Laplace, and z-transforms. These transforms uniquely map time-domain signals to their frequency-domain representations. The inverses of these transforms likewise map a signal’s frequencydomain representation back to the time domain. Which transform we use depends on the signal’s nature (periodic, nonperiodic, exponential sinusoid, etc.) and signal type (continuous-time, discrete-time). Of all transforms discussed in Chapter 6, only the DFT and DCT can be implemented using digital systems (the others assume analog signals). The DFT is used in a wide range of signal-processing applications (e.g., telecommunications, medical, geophysics). The DCT is more heavily used in image and video compression applications (e.g., JPEG, MPEG-2, MPEG-4). The DFT and DCT are by far the most commonly used in media processing. Therefore, the discussion in this chapter is restricted to fast versions of these algorithms, and their ﬁxed-point simulation and efﬁcient implementation techniques. A ﬁlter is a system that allows some frequency components of a signal to pass through while attenuating other components. Consider two extremes. One extreme is an ampliﬁer, which allows all frequencies to pass through unattenuated. The other extreme would be an oscillator, which outputs only a single frequency. Filters lie somewhere in between. For example, linear-time-invariant (LTI) systems, as discussed in Section 6.5, are ﬁlters. In fact, all ﬁlters discussed in this chapter are assumed to be LTI systems. LTI ﬁlters are completely described by their impulse response, and the output of an LTI ﬁlter is obtained by convolving the ﬁlter input with its impulse response. A few applications of digital ﬁlters include telecommunications, medical signal processing, and audio/image/video processing. The two main ﬁlter types are ﬁnite-impulse-response (FIR), and inﬁnite-impulse-response (IIR) ﬁlters. In Sections 7.3 through 7.5 we brieﬂy discuss FIR and IIR ﬁlters, examine their speciﬁcations, and explore digital ﬁlter design, simulation, and techniques for efﬁcient implementation. 7.1 Fast Fourier Transform In Section 6.4.2, we brieﬂy discussed the DFT. Here, we discuss the complexity of the DFT, and then derive its faster, less complex variant, the FF T. If the sequence (or discrete-time signal) x [n] consists of N samples, the © 2010 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-1-85617-678-1.00007-7 321 322 Chapter 7 DFT also produces a sequence of N samples, X[k], spaced equally in the frequency domain: N −1 X[k] = x [n]e− j2πnk/N , k = 0, 1, 2, . . . , N − 1 n=0 (7.1) where e− j2πnk/N = cos(2π nk/N ) − j sin(2π nk/N ). The DFT can be viewed as a correlation of the input signal with a set of sinusoids. Each sinusoid evaluates the frequency content of the input signal at the sinusoid’s oscillation frequency. Equation (7.1) can also be expressed in terms of matrix multiplication: X N×1 = WN×N · xN×1 where xN×1 = [x0, x1, x2, . . . , xN−1 ]T , X N×1 = [X0, X1, X2, . . . , X N−1]T , and the matrix ⎡ 1 WN×N = ⎢⎢⎢⎢⎢⎢⎢⎢⎣ 1 1 ... 1 e− j 2π·1·1/N e− j 2π·2·1/N ... ⎤ 1 ··· 1 e− j 2π·1·2/N e− j 2π·2·2/N ... ··· ··· ... e− j 2π·1·(N−1)/N e− j 2π·2·(N−1)/N ... ⎥⎥⎥⎥⎥⎥⎥⎥⎦ 1 e− j 2π·(N−1)·1/N e− j 2π·(N−1)·2/N · · · e− j 2π·(N−1)·(N−1)/N (7.2) (7.3) can be constructed from N components Wk = e− j2πk/N , k = 0, . . . , N − 1, which we refer to as “twiddle factors.” DFT Computational Complexity As seen in Equation (7.2), the matrix and vector multiplication in the DFT require N 2 operations, and each operation involves one complex multiplication and one complex addition. One complex multiplication requires four real multiplications and two real additions. One complex addition requires two real additions. Thus, one operation in the DFT computation involves four real additions and four real multiplications. We can now calculate the complexity of an N-point DFT, in terms of real operations, as 4N2 real multiplications and 4N2 real additions. To illustrate how this maps to real hardware, consider the reference embedded processor. On the reference processor, multiplication and addition both consume 1 cycle (see Appendix A, Section A.4, on the companion website for more details on the cycle estimation). The processor also has two MAC (multiply and accumulate) units, which can perform two additions and two multiplications per cycle. Using these MAC units, an N = 4096-point DFT will consume approximately 33.5 million (= 2 × 4096 × 4096) cycles. 7.1.1 Fast Fourier Transforms The FFT works by exploiting symmetry in the matrix W in Equation (7.3). Before going into the concepts involved in the FFT, let’s examine the symmetry of W for N = 6, N = 7, and N = 8. ⎡ 1.0000 1.0000 1.0000 1.0000 1.0000 ⎤ 1.0000 W6×6 = ⎢⎢⎢⎣ 1.0000 1.0000 1.0000 0.5000 − 0.8660i −0.5000 − 0.8660i −1.0000 − 0.0000i −0.5000 − 0.8660i −0.5000 + 0.8660i 1.0000 + 0.0000i −1.0000 − 0.0000i 1.0000 + 0.0000i −1.0000 − 0.0000i −0.5000 + 0.8660i −0.5000 − 0.8660i 1.0000 + 0.0000i 0.5000 + 0.8660i −0.5000 + 0.8660i −1.0000 − 0.0000i ⎥⎥⎥⎦ 1.0000 −0.5000 + 0.8660i −0.5000 − 0.8660i 1.0000 + 0.0000i −0.5000 + 0.8660i −0.5000 − 0.8660i 1.0000 0.5000 + 0.8660i −0.5000 + 0.8660i −1.0000 − 0.0000i −0.5000 − 0.8660i 0.5000 − 0.8660i ⎡ 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 ⎤ W7×7 = ⎢⎢⎢⎢⎣ 1.0000 1.0000 1.0000 1.0000 0.6235 − 0.7818i −0.2225 − 0.9749i −0.9010 − 0.4339i −0.9010 + 0.4339i −0.2225 − 0.9749i −0.9010 + 0.4339i 0.6235 + 0.7818i 0.6235 − 0.7818i −0.9010 − 0.4339i 0.6235 + 0.7818i −0.2225 − 0.9749i −0.2225 + 0.9749i −0.9010 + 0.4339i 0.6235 − 0.7818i −0.2225 + 0.9749i −0.2225 − 0.9749i −0.2225 + 0.9749i −0.9010 − 0.4339i 0.6235 − 0.7818i 0.6235 + 0.7818i 0.6235 + 0.7818i −0.2225 + 0.9749i −0.9010 + 0.4339i −0.9010 − 0.4339i ⎥⎥⎥⎥⎦ 1.0000 −0.2225 + 0.9749i −0.9010 − 0.4339i 0.6235 − 0.7818i 0.6235 + 0.7818i − 0.9010 + 0.4339i −0.2225 − 0.9749i 1.0000 0.6235 + 0.7818i −0.2225 + 0.9749i −0.9010 + 0.4339i −0.9010 − 0.4339i −0.2225 − 0.9749i 0.6235 − 0.7818i Transforms and Filters 323 W8×8 = ⎡ 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 ⎤ 1.0000 ⎢⎢⎢⎢⎢⎢⎢⎢⎣11111.....00000000000000000000 1.0000 0.7071 − 0.7071i 0.0000 − 1.0000i 0.0000 − 1.0000i −1.0000 − 0.0000i −0.7071 − 0.7071i −0.0000 + 1.0000i −1.0000 − 0.0000i 1.0000 + 0.0000i −0.7071 + 0.7071i 0.0000 − 1.0000i −0.0000 + 1.0000i −1.0000 − 0.0000i −0.7071 − 0.7071i −1.0000 − 0.0000i −0.7071 + 0.7071i −0.0000 + 1.0000i 1.0000 + 0.0000i 0.0000 – 1.0000i 0.7071 − 0.7071i −1.0000 − 0.0000i 0.7071 + 0.7071i −1.0000 − 0.0000i 1.0000 + 0.0000i −1.0000 − 0.0000i 0.7071 + 0.7071i −1.0000 − 0.0000i 0.7071 − 0.7071i 0.0000 − 1.0000i 1.0000 + 0.0000i −0.0000 + 1.0000i −0.0000 + 1.0000i −1.0000 – 0.0000i 0.0000 − 1.0000i 1.0000 + 0.0000i −0.0000 + 1.0000i −1.0000 − 0.0000i 0.7071 −0.0000 −0.7071 −1.0000 −0.7071 −0.0000 + + + − − − 0.7071i 1.0000i 0.7071i 0.0000i 0.7071i 1.0000i ⎥⎥⎥⎥⎥⎥⎥⎥⎦ 1.0000 0.7071 + 0.7071i −0.0000 + 1.0000i −0.7071 + 0.7071i −1.0000 − 0.0000i −0.7071 − 0.7071i −0.0000 − 1.0000i 0.7071 − 0.7071i Observing the matrix elements of W6×6, W7×7, and W8×8, we ﬁnd symmetry (except for sign) in both the horizontal and vertical directions. Similarly, in both W6×6 and W8×8, we also ﬁnd periodicity (except for sign) in both horizontal and vertical directions. In matrix W6×6, the elements repeat (except for sign) two times in any column or row (i.e., period = N/2 = 6/2 = 3). In matrix W8×8, the elements repeat (except for sign) four times in any column or row (i.e., period = N/4 = 8/4 = 2). The N/2 period in matrix elements (or twiddle factors) is present in all DFT twiddle-factor matrices when N is even. Similarly, the N/4 period is present in all DFT twiddle-factor matrices where N is the power of 2 (i.e., for N equal to 4, 8, 16, 32, 64 . . .). DFT Matrix Factorization Why are the symmetry and periodicity of twiddle factors so important? They allow us to implement the DFT very efﬁciently. When we have repeated elements in a matrix, we can use divide-and-conquer methods to perform the matrix and vector multiplication (as seen in Equation (7.2)) with fewer multiplications. Consider, for illustration, a DFT matrix with N = 8. The 8-point DFT twiddle-factor matrix in terms of W8 (= e− j2π/8, the primitive eighth root of unity) is expressed as follows: ⎡ ⎤ 11 1 1 1 1 1 1 W8×8 = ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 1 1 1 1 1 1 W81 W82 W83 W84 W85 W86 W82 W84 W86 W88 W810 W812 W83 W86 W89 W812 W815 W818 W84 W88 W812 W816 W820 W824 W85 W810 W815 W820 W825 W830 W86 W812 W818 W824 W830 W836 W87 W814 W821 W828 W835 W842 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 1 W87 W814 W821 W828 W835 W842 W849 (7.4) To begin, we divide the matrix W8×8 into two parts, placing all even columns ﬁrst followed by all odd columns. This is achieved by multiplying W8×8 with the matrix A8×8, deﬁned as follows: ⎡ ⎤ 10000000 A = ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ 00000001 324 Chapter 7 The rearranged matrix, W8×8 = W8×8 A8×8, and the elements of W8×8 are given in the following: ⎡ ⎤ 11 1 1 1 1 1 1 W8×8 = ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 1 1 1 1 1 1 W82 W84 W86 W88 W810 W812 W84 W88 W812 W816 W820 W824 W86 W812 W818 W824 W830 W836 W81 W82 W83 W84 W85 W86 W83 W86 W89 W812 W815 W818 W85 W810 W815 W820 W825 W830 W87 W814 W821 W828 W835 W842 ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ = 1 W814 W828 W842 W87 W821 W835 W849 P4×4 Q4×4 R4×4 S4×4 (7.5) After careful observation of the matrices P4×4 and R4×4, we can see that W82 = e− j2π2/8 = e− j2π/4 = W41, W84 = W42, W86 = W43, W812 = W46, W818 = W49 W88 = e− j2π8/8 = e− j2π = 1 = W44, W816 = W48 = e− j2π8/4 = e− j2π2 = 1 W824 = 1, W810 = e− j 2π10/8 = e− j 2π(2+8)/8 = e− j 2π2/8.e− j 2π8/8 = e− j 2π1/4 = W41 W820 = W42, W830 = W43, W812 = W42, W824 = 1 = W44, W836 = W84 = W42.1 = W46, W814 = W43 W828 = W46 and W842 = W49 Thus, P4×4 and R4×4 represent 4-point, DFT twiddle-factor matrices as follows: ⎡ ⎤ 11 1 1 P4×4 = W4×4 = ⎢⎢⎢⎢⎣ 1 1 W41 W42 W42 W44 W43 W46 ⎥⎥⎥⎥⎦ 1 W43 W46 W49 ⎡ ⎤ 11 1 1 R4×4 = W4×4 = ⎢⎢⎢⎢⎢⎣ 1 1 W41 W42 W42 W44 W43 W46 ⎥⎥⎥⎥⎥⎦ 1 W43 W46 W49 Similarly, after examination of matrices Q4×4 and S4×4, we can rewrite them as follows: ⎡ ⎤ 11 1 1 Q 4×4 = ⎢⎢⎢⎢⎣ W81 W82 W83 W86 W85 W810 W87 W814 ⎥⎥⎥⎥⎦ W83 W89 W815 W821 ⎡ ⎤⎡ 10 0 0 11 1 = ⎢⎢⎢⎢⎣ 0 0 W81 0 0 W82 0 0 ⎥⎥⎥⎥⎦ ⎢⎢⎢⎢⎣ 1 1 W82 W84 W84 W88 00 ⎡ 0 W83 1 ⎤ W86 W812 11 1 1 = D8 ⎢⎢⎢⎢⎢⎣ 1 1 W41 W42 W42 W44 W43 W46 ⎥⎥⎥⎥⎥⎦ = D8 W4×4 1 W43 W46 W49 ⎤ 1 W86 W812 ⎥⎥⎥⎥⎦ W818 Transforms and Filters 325 S4×4 ⎡ = ⎢⎢⎢⎢⎣ W84 W85 W86 W812 W815 W818 W820 W825 W830 W828 W835 W842 ⎤ ⎥⎥⎥⎥⎦ = ⎡ W84 ⎢⎢⎢⎢⎢⎣ 1 W81 W82 1 W83 W86 1 W85 W810 ⎤⎡ 1 1 W87 W814 ⎥⎥⎥⎥⎥⎦ = −⎢⎢⎢⎢⎢⎣ W81 W82 1 W83 W86 1 W85 W810 ⎤ 1 W87 W814 ⎥⎥⎥⎥⎥⎦ W87 W821 W835 W849 W83 W89 W815 W821 W83 W89 W815 W821 ⎡ ⎤⎡ ⎤ 10 0 = − ⎢⎢⎢⎢⎢⎣ 0 0 W81 0 0 W82 0 11 1 1 0 0 ⎥⎥⎥⎥⎥⎦ ⎢⎢⎢⎢⎢⎣ 1 1 W82 W84 W84 W88 W86 W812 ⎥⎥⎥⎥⎥⎦ = −D8W4×4 0 0 0 W83 1 W86 W812 W818 Thus, Equation (7.5) can be rewritten as follows: W8×8 = W4×4 D8W4×4 W4×4 −D8W4×4 (7.6) Radix-2 FFT Algorithms Once again, using matrix A8×8, we can rearrange the input data x8×8. Let x8×1 = A8×8x8×1 and let X8×1 = W8×8x8×1. Since A8×8 A8×8 = I8×8, based on Equations (7.4) and (7.6), we will have X8×1 = W8×8x8×1 = W8×8 A8×8 A8×8x8×1 = W8×8x8×1 = X8×1. With this, by rearranging the input data x8×1, we can compute the 8-point DFT output X8×1 using two 4-point DFTs (using Equation (7.6)). The corresponding signal ﬂow diagram is shown in Figure 7.1. This factorization is not yet over, as we can further factorize the W4×4 into W2×2 in the same manner. That is, each 4-point DFT can be computed using two 2-point DFTs. The block diagram for computing an 8-point DFT using four 2-point DFTs is shown in Figure 7.2. The corresponding signal ﬂow diagram for computing an 8-point DFT using 2-point DFTs is shown in Figure 7.3. As shown in Figure 7.2, we compute the 8-point DFT using four 2-point DFTs in three stages. Only in the ﬁrst stage do we compute the 2-point DFTs; in the second and third stages we combine the outputs of previous stages with simple twiddle-factor multiplications (not shown in the block diagram) to get two 4-point DFT outputs and then one 8-point DFT output. As seen in Figure 7.3, we compute the 8-point DFT in terms of 2-point DFTs in three stages (here the 2-point DFT is the smallest butterﬂy enclosed in a dashed curve as shown). In general, if N is a power of 2, then we compute the N-point DFT in log2 N stages. The number of complex multiplications in this approach is N log2N, and half of these are multiplications by −1. Thus, we only require (N/2) log2 N complex multiplications and N log2 N complex additions to compute an N-point DFT. For N = 4096, this is only about 0.09 million cycles on the reference embedded processor. Compared to the DFT (which requires about 33.5 million cycles), this is 372 times faster! Data Arrangement 4-Point DFT x98 3 1 [0:3] 5{x0, x2, x4, x6} W4 3 4 Combine 4-Point DFTs 1 1 1 8-Point DFT Output X8 3 1 [0:3] 5 {X0, X1, X2, X3} x98 3 1 [4:7] 5{x1, x3, x5, x7} W4 3 4 X 1 1 2 X8 3 1 [4:7] 5 {X4, X5, X6, X7} D8 Figure 7.1: Signal ﬂow diagram of 8-point DFT computation using two 4-point DFTs. 326 Chapter 7 Data Four 2-Point Arrangement DFT Outputs Two 4-Point DFT Outputs 8-Point DFT Output x0 x4 W2 3 2 x2 x6 W2 3 2 x1 x5 W2 3 2 x3 x7 W2 3 2 Combine 2-Point DFT’s Output Combine 2-Point DFT’s Output X0 X1 X2 Combine X3 4-Point DFT’s X4 Output X5 X6 X7 Figure 7.2: Block diagram to compute 8-point DFT using four 2-point DFTs. Stage 1 Stage 2 Stage 3 x0 x4 W 0 8 21 x2 W 0 8 x6 W 0 8 W 2 8 21 x1 x5 W 0 8 21 x3 W 0 8 x7 W 0 8 W 2 8 21 21 21 W 0 8 W 1 8 W 2 8 21 W 3 8 21 X0 X1 X2 X3 21 X4 21 X5 X6 21 X7 21 Figure 7.3: Signal ﬂow diagram of decimation-in-time radix-2, 8-point DFT algorithm. The multistage algorithm used to compute the 8-point DFT efﬁciently as shown in Figure 7.3 is referred to as a decimation-in-time (DIT) radix-2 algorithm. We also have an equivalent decimation-in-frequency (DIF) radix-2 algorithm for the N-point DFT. The signal ﬂow diagram of DIF 8-point DFT is shown in Figure 7.4, and is exactly opposite to the ﬂow of the DIT algorithm. The complexity of both algorithms is exactly the same. An FFT is any algorithm that computes the DFT faster than the direct computation. Since the DFT is computed faster with radix-2 algorithms than with direct computation, we call these algorithms radix-2 FFT algorithms. Bit Reversal The FFT expects the input in the bit-reversal order in the case of DIT radix-2 or outputs the data in bit-reversal order in the case of DIF radix-2 algorithms. Therefore, we discuss the data sample’s arrangement at the input of DIT radix-2 algorithm and extracting the appropriate output in the case of the DIF radix-2 algorithm. In the case of the DIT radix-2 algorithm, the inputs are rearranged (log2 N − 1) times in the following manner: {x0, x1, x2, x3, x4, x5, x6, x7} → {[x0, x2, x4, x6], [x1, x3, x5, x7]} <−− First-time decimation → {[(x0, x4), (x2, x6)], [(x1, x5), (x3, x7)]} <−− Second-time decimation → {x0, x4, x2, x6, x1, x5, x3, x7} Transforms and Filters 327 Stage 1 Stage 2 Stage 3 x0 X0 x1 21 W 0 8 X4 x2 W 0 8 21 X2 x3 W 2 8 21 21 W 0 8 X6 x4 W 0 8 21 X1 x5 W 1 8 21 21 W 0 8 X5 x6 W 2 8 W 0 8 21 21 X3 x7 W 3 8 21 W 2 8 21 21 W 0 8 X7 Figure 7.4: Signal ﬂow diagram for decimation-in-frequency radix-2, 8-point DFT algorithm. Table 7.1: Bit-reversal index for samples of decimation-in-time radix-2 FFT Before Arrangement 0 (000) 1 (001) 2 (010) 3 (011) 4 (100) 5 (101) 6 (110) 7 (111) After Arrangement 0 (000) 4 (100) 2 (010) 6 (110) 1 (001) 5 (101) 3 (011) 7 (111) The DIF radix-2 algorithm outputs the frequency components in a particular order (permuted), and the actual DFT output is obtained by undoing this permutation. The actual and permuted sample indices for the 8-point DFT are provided in Table 7.1, and the corresponding binary numbers are shown in brackets. From these binary numbers, it is clear that the permuted index is obtained by reversing the bits of the actual index. For example, in the case of the DIT radix-2 algorithm, the input sample (x3) at index 3 (or binary 011) is moved to the index 6 (or binary 110), and the sample (x4) at index 4 (or binary 100) is moved to the index 1 (or binary 001) after rearrangement. Radix-4 FFT Algorithm When the length of the DFT N is a power of 4, we can further reduce the number of complex operations using a radix-4 FFT algorithm. With a radix-4 FFT, we ﬁrst divide the data into four datastreams and form a 4 × N/4 matrix. We compute four N/4-point DFTs and then multiply with twiddle factors and transpose the matrix (although this is a complex matrix, we transpose the matrix without conjugation). Then, we obtain the N-point DFT by computing 4-point DFTs on the transposed (N/4) × 4 matrix. This process is repeated for the next stage by dividing each N/4-point DFT into four N/16 streams and it is continued until the length of the DFT reaches 4 as illustrated in Figure 7.5. As an example, consider a 16-point DFT computed using a radix-4 FFT (i.e., N = 16). Let x = {x [n]}, 0 ≤ n ≤ N − 1 be the input vector with 16 discrete-time samples. We compute the 16-point DFT output X = {X[k]}, 0 ≤ k ≤ N − 1, for input x using a radix-4 FFT algorithm as shown in Figure 7.5 We ﬁrst divide x [n] into four 328 Chapter 7 Stages: S51 Compute N /4 4-Point DFTs S52 Make N /16 16-Point DFTs N Discrete Input Samples T S 5 log4N 2 1 S 5 log4N Make four N /4-Point DFTs T T Make one N-Point DFT • Multiply with twiddle factors • Transpose ( just do transpose, T : no complex conjugation) • Compute 4-Point DFTs on transpose Figure 7.5: An illustration of radix-4 FFT computation. sequences and form a 4 × N/4 two-dimensional array or matrix as x (u, v) = x [4v + u], where 0 ≤ v ≤ N/4 − 1 and 0 ≤ u ≤ 3. The samples of x (u, v) in two-dimensional space for N = 16 follow: x [0], x [1], x [2], x [3], x [4], x [5], x [6], x [7], x [8], x [9], x [10], x [11], x [12] x [13] x [14] x [15] If X (u, q) represents a row-wise N/4-point DFT of x (u, v), then ( N /4)−1 X (u, q) = x (u, p)WNp/q4 p=0 (7.7) For N = 16, Equation (7.7) becomes a 4-point DFT computation on rows of x (u, v), and it can be expressed in a matrix form as shown in Equation (7.8): ⎡ X (u, 0) ⎤ ⎡ 1 1 1 1 ⎤ ⎡ x (u, 0) ⎤ ⎢⎢⎣ X X (u, 1) (u, 2) ⎥⎥⎦ = ⎢⎢⎣ 1 1 −j −1 −1 1 j −1 ⎥⎥⎦ ⎢⎢⎣ x (u, 1) x (u, 2) ⎥⎥⎦ (7.8) X (u, 3) 1 j −1 − j x (u, 3) The butterﬂy of a 4-point DFT equivalent to Equation (7.8) is shown in Figure 7.6. The 4-point DFT butterﬂy is also the basic butterﬂy in the radix-4 FFT algorithm. X Next, we multiply the N/4-point DFT output X (u, q) with twiddle (u, q) = WNuq X (u, q), where 0 ≤ u ≤ 3 and 0 ≤ q ≤ N/4 − 1: factors W uq N and obtain X (u, q) as ⎡ X (0, 0) ⎢⎢⎢⎣ X X (1, 0) (2, 0) X (0, 1) X (1, 1) X (2, 1) X (0, 2) X (1, 2) X (2, 2) X X X (0, 3) (1, 3) (2, 3) ⎤⎡ ⎥⎥⎥⎦ = ⎢⎢⎢⎢⎣ W106 X (0, 0) W106 X (1, 0) W106 X (2, 0) W106 X (0, 1) W116 X (1, 1) W126 X (2, 1) W106 X (0, 2) W126 X (1, 2) W146 X (2, 2) W106 X W136 X W166 X (0, 3) (1, 3) (2, 3) ⎤ ⎥⎥⎥⎥⎦ X (3, 0) X (3, 1) X (3, 2) X (3, 3) W106 X (3, 0) W136 X (3, 1) W166 X (3, 2) W196 X (3, 3) Now, we compute N/4 4-point DFTs on N/4 columns of a 4×N/4 matrix with elements X (u, q), and output it as a 4×N/4 matrix with elements X ( p, q) as given in Equation (7.9). Then, the N-point DFT of x in the Transforms and Filters 329 y [0] Y [0] 2j y0 Y0 y [1] Y [1] 21 j Y1 y1 y [2] 21 Y [2] y2 Y2 21 Y3 y3 j 21 y [3] Y [3] 2j Figure 7.6: Radix-4 basic butterﬂy signal ﬂow diagram. digit-reversed order is obtained by converting two-dimensional indices to one-dimensional indices as X[(N/4) p + q] = X ( p, q). 3 X ( p, q) = X (m, q)W4mq, 0 ≤ p ≤ 3 m=0 (7.9) X (0, 0), X (0, 1), X (0, 2), X (0, 3), X (0), X (4), X (8), X (12), X (1, 0), X (1, 1), X (1, 2), X (1, 3), X (2, 0), X (2, 1), X (2, 2), X (2, 3), ⇒ X (1), X (5), X (2), X (6), X (9), X (13), X (10), X (14), X (3, 0), X (3, 1), X (3, 2), X (3, 3), X (3), X (7), X (11), X (15) The one-dimensional output vector X = {X[0], X[4], X[8], X[12], X[1], X[5], X[9], X[13], X[2], X[6], X[10], X[14], X[3], X[7], X[11], X[15]} is in digit-reversed order. If we apply N/4 4-point DFTs in Equation (7.9) on N/4 rows of transposed matrix (just a transpose, not a complex conjugate transpose) X (q, u), then we get the DFT output X with indices in the correct order. The 16-point radix-4 decimation-in-time algorithm with the input in normal order and the output in digitreversed order is shown in Figure 7.7. The complexity of a radix-4 FFT algorithm in terms of number of operations is N log2 N complex additions and (3N/8) log2 N complex multiplications. That is, the number of complex additions of the radix-4 FFT is the same as a radix-2 FFT, but the number of complex multiplications in the radix-4 FFT is less than the number present in the radix-2 FFT. Thus, if N is a power of 4, then the use of the radix-4 FFT has computational advantages. However, having a DFT whose length N is a power of 4 is not always possible. In that case we can combine both radix-4 and radix-2 FFTs. Usually, the last stage of an FFT can be either a radix-4 stage or a radix-2 stage depending on the length of N. If N = 2n and n > 3, we use only a radix-4 FFT for all stages when n is even. Otherwise, we use radix-4 for all stages except for the last stage where we use a radix-2 FFT algorithm instead. 7.1.2 Radix FFT Fixed-Point Simulation In this section, we discuss techniques to efﬁciently implement FFT algorithms on the reference embedded processor. There are three steps in the FFT implementation: (1) data arrangement, (2) butterﬂy computations, and (3) combining intermediate results. In the data arrangement step, we take linearly indexed samples and output them with indices in bit-reversed order. The simulation code to perform bit reversing is given in Pcode 7.1. This code reads the complex samples with linear indices from buffer x[ ], and outputs the complex samples with the bit-reversed indices into the same input buffer x[ ]. If we have sufﬁcient on-chip data memory, for an FFT with length N, we can compute the bit-reversed indices ofﬂine and store them in a look-up table instead of computing 330 Chapter 7 x0 X0 x1 X4 X8 x2 X12 x3 X1 x4 W 1 16 X5 x5 W 2 16 X9 x6 W 3 16 X13 x7 x8 W 2 16 X2 X6 x9 W 4 16 x10 W 6 16 X10 x11 X14 x12 W 3 16 X3 x13 W 6 16 X7 x14 X11 x15 W 9 16 X15 Figure 7.7: Sixteen-point DFT computation using decimation-in-time radix-4 FFT. them in real time. Then, we access the samples in bit-reversed indexing fashion from samples in a buffer with linear indexing using the bit-reversed indices look-up table. Now, with the N = 2n(n > 3) bit-reversed complex samples in x [ ], we can simulate the combination of radix-4 and radix-2 complex FFT algorithms using 1.15 ﬁxed-point computations (see Appendix B.1 on the companion website for more details on the ﬁxed-point representation of real numbers). In the DFT computation, we use the twiddle-factor matrix. We could calculate this matrix on the ﬂy, but this would be costly because the twiddle-factor computation involves ﬂoating-point computations of a nonlinear function e− j2π nk/N . Instead, we precompute the twiddle factors and store the values in the data memory in 1.15 format for various lengths of N. //void bit_reverse(short *x, int n) m = n << 1;j = 1; for(i = 1;i < m;i+ = 2){ if(j > i){ tmp = x[j-1];x[j-1] = x[i-1]; x[i-1] = tmp;tmp = x[j]; x[j] = x[i];x[i] = tmp; } k = n; while(k > = 2 && j > k){ j- = k;k >> = 1; } j+ = k; } Pcode 7.1: Simulation code to compute bit-reversed indexing. The complex FFT simulation is divided into three parts. In the ﬁrst part, we compute only 4-point complex DFTs. Using the routine given in Pcode 7.2, we compute a radix-4 FFT ﬁrst stage using only additions and subtractions without any multiplications. In the second part, we compute multiple radix-4 middle stages. In these middle stages, we multiply the previous stage output with the twiddle-factor values before applying 4-point DFTs for the current stage. As the values in the ﬁrst row of the twiddle-factor matrix are all 1s in all stages, we handle the ﬁrst row separately without any twiddle-factor multiplications. All other rows are multiplied Transforms and Filters 331 with twiddle factors ﬁrst, and then the 4-point DFTs are computed. The simulation code for the second part of the FFT computation is given in Pcode 7.3. The while( ) loop in Pcode 7.3 runs log4N − 1 times, where a represents the integer part of real number a. The ﬁrst for( ) loop computes 4-point DFTs for the ﬁrst row of the data matrix. The second for( ) loop computes 4-point DFTs for other rows of the data matrix after multiplying with the twiddle factors. We perform twiddle-factor multiplication using 1.15 ﬁxed-point computations. In the third part of the FFT computation, depending on the DFT length N, we use either radix-2 or radix-4 butterﬂies to compute the last stage of the FFT. If N is a power of 4, we call the radix-4 algorithm given in Pcode 7.3. If N is only a power of 2, then we use the radix-2 algorithm as given in Pcode 7.4. In the radix-2 algorithm computation, we reuse the twiddle-factor values of the radix-4 algorithm by accessing the appropriate twiddle-factor values (except for sign). The sign information is compensated for within the addition/ subtraction operations. As each stage of the FFT introduces a gain to the output, we take care of this by scaling the intermediate outputs to avoid overﬂow in the outputs. // void rad4_fft(short *x, short *tw, int n) m = n >> 2; // ﬁrst part: ﬁrst stage r = 2; s = 4; // values are ﬁxed for N = 512-point FFT computation t = 6; p = 8; k = -p; for(i = 0; i < m; i++){ // 512 -> 128x4 (i.e., compute 128 4-point DFTs) k = k + p; a = x[k] + x[k+r]; b = x[k+1] + x[k+r+1]; c = x[k] - x[k+r]; d = x[k+1] - x[k+r+1]; e = x[k+s] + x[k+t]; f = x[k+s+1] + x[k+t+1]; x[k] = (a + e) >> 1; x[k+1] = (b + f) >> 1; a = (a - e) >> 1; b = (b - f) >> 1; e = x[k+s] - x[k+t]; f = x[k+s+1] - x[k+t+1]; x[k+s] = a; x[k+s+1] = b; x[k+r] = (c + f) >> 1; x[k+r+1] = (d - e) >> 1; x[k+t] = (c - f) >> 1; x[k+t+1] = (d + e) >> 1; } Pcode 7.2: Simulation code for ﬁrst stage of radix-4 complex FFT algorithm. In the radix-4 FFT stages given in Pcodes 7.2 and 7.3, we scaled down the output of the 4-point DFTs by a factor of 2 by right shifting 1 bit within the addition/subtraction operations. We can perform this scaling of intermediate outputs for free on the reference embedded processor (see Appendix A on the companion website) by shifting the addition/subtraction value left by 1 bit using optional mode. 7.1.3 Larger DFT Simulation In many applications, the DFT length N is on the order of thousands of samples. For example, the DFT of length N = 2048, 4096, or 8192 is used in the DVB-H mobile TV application for performing OFDM (orthogonal frequency division modulation, used in many wireless standards). In such cases, the DFT computation uses large data buffers stored in memory, and the access pattern of the data from these buffers arbitrarily causes frequent closing and opening of DRAM pages, resulting in memory stalls. Thus, computation of longer-length DFTs requires special data arrangements to avoid memory stalls. If we divide the larger DFT into smaller DFTs, then this memory stall problem can be resolved. For this, we borrow the idea of the radix-4 FFT algorithm, which always divides the N-point DFT into four N/4-point DFTs. In the same way, we can efﬁciently compute the larger DFT using the matrix FFT. With matrix FFT, we divide a long one-dimensional data array x [n], where 0 ≤ n ≤ N − 1, into many shorterlength blocks y( p, q) = x [qP + p], where 0 ≤ p ≤ P − 1, 0 ≤ q ≤ Q − 1 and N = PQ, arranging them in a two-dimensional matrix. We then compute Q-point DFTs on P rows to get Y (r, s). Next, we multiply Y (r, s) with the twiddle factors and then compute P-point DFTs on Q columns to get Z (u, v). Now, the DFT of the 332 Chapter 7 // Second part: middle stages (continuation from Pcode 7.2) m = n >> 4; q = 3; p = p << 2; k = -p; u = n >> 1;v = n >> 2; u = u + v; u = u >> 1; l = u; r = r << 2;s = s << 2; t = t << 2; while(m > 1){ // 128 -> 32x4, 32 -> 8x4, 8 -> 2x4 (for N = 512 case) for(i = 0;i < m;i++){ // 1x32 4-point DFTs, 1x8 4-point DFTs, 1x2 4-point DFTs (for N = 512 case) k = k + p; a = x[k] + x[k+r]; b = x[k+1] + x[k+r+1]; c = x[k] - x[k+r]; d = x[k+1] - x[k+r+1]; e = x[k+s] + x[k+t]; f = x[k+s+1] + x[k+t+1]; x[k] = (a + e) >> 1; x[k+1] = (b + f) >> 1; a = (a - e) >> 1; b = (b - f) >> 1; e = x[k+s] - x[k+t]; f = x[k+s+1] - x[k+t+1]; x[k+s] = a; x[k+s+1] = b; x[k+r] = (c + f) >> 1; x[k+r+1] = (d - e) >> 1; x[k+t] = (c - f) >> 1; x[k+t+1] = (d + e) >> 1; } // ﬁrst row computed without multiplications as all twiddle factor values are 1s for(i = 0;i < q;i++){// 3, 15, 63 (for N = 512 case) k = k - m*p + 2; for(j = 0;j < m;j++){ // 3x32 4-point DFTs, 15x8 4-point DFTs, 63x2 4-point DFTs k = k + p; g = x[k+r]*tw[u]; h = x[k+r+1]*tw[u+1]; g = (g - h + RC) >> 15; a = x[k] + g; c = x[k] - g; g = x[k+r]*tw[u+1]; h = x[k+r+1]*tw[u]; g = (g + h + RC) >> 15; b = x[k+1] + g; d = x[k+1] - g; g = x[k+s]*tw[u+2]; h = x[k+s+1]*tw[u+3]; g = (g - h + RC) >> 15; v = x[k+s]*tw[u+3]; h = x[k+s+1]*tw[u+2]; h = (v + h + RC) >> 15; e = x[k+t]*tw[u+4]; v = x[k+t+1]*tw[u+5]; e = (e - v + RC) >> 15; f = x[k+t]*tw[u+5]; v = x[k+t+1]*tw[u+4]; f = (f + v + RC) >> 15; v = g + e; w = h + f; x[k] = (a + v) >> 1; x[k+1] = (b + w) >> 1; a = (a - v) >> 1; b = (b - w) >> 1; e = g - e; f = h - f; x[k+s] = a; x[k+s+1] = b; x[k+r] = (c + f) >> 1; x[k+r+1] = (d - e) >> 1; x[k+t] = (c - f) >> 1; x[k+t+1] = (d + e) >> 1; } // 4-point DFT computed after multiplying with twiddle factors u = u + l; } l = l >> 2; u = l; m = m >> 2; q = q << 2; q = q + 3; p = p << 2; r = r << 2; s = s << 2; t = t << 2; k = -p; } Pcode 7.3: Simulation code for middle stages of radix-4 complex FFT algorithm. one-dimensional long array x [n] is obtained as X[k] = Z (u N/P + v), where 0 ≤ k ≤ N − 1, 0 ≤ u ≤ P − 1 and 0 ≤ v ≤ Q − 1. For example, consider the computation of a DFT for data x [n] of length N = 8192. We divide N = 8192 into two integers P = 64 and Q = 128, and arrange the data x [n] in matrix form with 64 rows, each of 128 length. We ﬁrst compute 64 128-point DFTs row-wise and then multiply the row-wise DFT computed matrix with twiddle factors. We then compute 128 64-point DFTs column-wise. In this way, we avoid memory stalls due to page misses. When P and Q are relatively prime numbers (with N = PQ) and the twiddle factors are from a Galois ﬁeld, multiplication of intermediate matrix DFT output with twiddle factors is not required in computing the N-point DFT. For example, using the Reed-Solomon erasures correction in Section 4.3, the 255-point DFT is computed with 15 17-point row DFTs followed by 17 15-point column DFTs. In this case, as 15 and 17 are relatively prime, we do not require the multiplication of intermediate matrix DFT output with twiddle factors. Transforms and Filters 333 q = q + 1; q = q >> 1; u = 2; k = 0; for(i = 0;i < q;i++){ g = x[k+r]*tw[u]; g = (g - h + RC) >> 15; a = x[k] - g; x[k] = x[k] + g; g = x[k+r]*tw[u+1]; h = (g + h + RC) >> 15; b = x[k+1] - h; x[k+1] = x[k+1] + h; x[k+r] = a; u+= 6; } for(i = 0;i < q;i++){ g = x[k+r]*tw[u]; g = (-g - h + RC) >> 15; a = x[k] - g; x[k] = x[k] + g; g = x[k+r]*tw[u+1]; h = (g - h + RC) >> 15; b = x[k+1] - h; x[k+1] = x[k+1] + h; x[k+r] = a; u-= 6; } h = x[k+r+1]*tw[u+1]; h = x[k+r+1]*tw[u]; x[k+r+1] = b; k+= 2; h = x[k+r+1]*tw[u+1]; h = x[k+r+1]*tw[u]; x[k+r+1] = b; k+= 2; Pcode 7.4: Simulation code to compute the last stage of FF T (with radix-2 algorithm). 7.1.4 FFT Simulation Results In this section, we provide the simulation results for a 16-point DFT and a 32-point DFT. We compute the 16-point DFT using two radix-4 stages and the 32-point DFT with two radix-4 stages and one radix-2 stage. We use only a three-fourth length (of N) of the twiddle factors tw[ ] in the FF T computation since the ﬁrst row of the twiddle factors are all 1s (when we arrange the twiddle factors in the matrix form). Given the DFT length N, the twiddle factors are computed using the following equations: t w[3k] = W 2k N = e− j2π2k/N t w[3k + 1] = WNk = e− j2πk/N t w[3k + 2] = WN3k = e− j2π3k/N We use two additional twiddle factors {0, − j }, {0, − j } in computing the last stage with the radix-2 FFT. For ﬁxed-point computation, we represent the twiddle factors in 1.15 format. 16-point DFT Input: 16 complex samples {11,9},{1,7},{16,5},{9,14},{13,11},{10,13},{14,10},{3,8}, {8,3},{7,12},{4,6},{6,1},{15,15},{12,2},{2,4},{5,16} Bit-reversed index input: {11,9},{8,3},{13,11},{15,15},{16,5},{4,6},{14,10},{2,4}, {1,7},{7,12},{10,13},{12,2},{9,14},{6,1},{3,8},{5,16} Twiddle factors: 12 + 2 complex samples {32767,0},{32767,0},{32767,0},{23170,-23170} {30273,-12539},{2539,-30273},{0,-32767},{23170,-23170} {-23170,-23170},{-23170,-23170},{12539,-30273},{-30273,12539} {0,-32768},{0,-32768} FFT ﬁrst stage output: Radix-4 stage {47,38},{-1,8},{-9,-14},{7,4},{36,25},{18,-13},{4,-3},{6,11}, {30,34},{5,-3},{-14,4},{-17,-7},{23,39},{-5,15},{7,-9},{11,11} 334 Chapter 7 FFT second stage and ﬁnal output: Radix-4 stage {136,136},{18,-9},{-30,-4},{-16,-1},{6,6},{-20,39},{6,-14}, {22,15}, {30,-10},{-12,-19},{6,-32},{38,-15},{16,20},{10,21},{-18,-6}, {-16,17} 32-point DFT Twiddle factors: 24 + 2 complex samples {32767,0},{32767,0},{32767,0},{30273,-12539}, {32138,-6392},{27245,-18204},{23170,-23170},{30273,-12539}, {12539,-30273},{12539,-30273},{27245,-18204},{-6392,-32138}, {0,-32767},{23170,-23170},{-23170,-23170},{-12539,-30273}, {18204,-27245},{-32138,-6392},{-23170,-23170},{12539,-30273}, {-30273,12539},{-30273,-12539},{6392,-32138},{-18204,27245}, {0,-32768},{0,-32768} Input: 32 complex samples {15,26},{10,4},{7,13},{9,18},{20,23},{2,7},{22,9},{6,25}, {27,16},{32,17},{23,15},{4,32},{26,27},{12,10},{8,8},{3,24}, {5,11},{19,31},{31,2},{28,5},{18,30},{16,12},{17,14},{14,29}, {21,28},{25,6},{24,22},{30,21},{29,1},{1,3},{11,20},{13,19} Bit-reverse of input: {15,26},{5,11},{27,16},{21,28},{20,23},{18,30},{26,27},{29,1}, {7,13},{31,2},{23,15},{24,22},{22,9},{17,14},{8,8},{11,20}, {10,4},{19,31},{32,17},{25,6},{2,7},{16,12},{12,10},{1,3}, {9,18},{28,5},{4,32},{30,21},{6,25},{14,29},{3,24},{13,19} FFT ﬁrst stage output: Radix-4 stage {68,81},{-2,9},{-28,-7},{22,21},{93,81},{28,-4},{-17,25},{-24,-10}, {85,52},{-31,12},{-9,-22},{-17,10},{58,51},{-7,-2},{20,-5},{17,-8}, {86,58},{2,-34},{-28,12},{-20,-20},{31,32},{-7,-16},{5,6},{-21,6}, {71,76},{-8,39},{3,-30},{-30,-13},{36,97},{-3,6},{4,11},{-13,-14} FFT second stage output: Radix-4 stage {304,265},{-14,15},{-43,-10},{22,79},{-24,-27},{-2,51},{-51,-20},{18,-19}, {18,59},{44,-43},{37,30},{42,11},{-26,27},{-36,13},{-55,-28},{6,13}, {224,263},{-2,4},{-36,-27},{-7,22},{34,-9},{52,-32},{-46,41},{-24,9}, {10,-83},{-26,-84},{-8,41},{5,-40},{76,61},{-16,-24},{-22,-7},{-54,-71} FFT third stage and ﬁnal output: Radix-2 stage {528,528},{-15,19},{-87,-21},{28,101},{-6,-57},{0,-10},{-31,38},{22,6}, {-65,49},{-33,-1},{78,22},{6,29},{-37,-70},{-36,42},{-37,-13},{45,93}, {80,2},{-13,11},{1,1},{16,57},{-42,3},{-4,112},{-71,-78},{14,-44}, {101,69},{121,-85},{-4,38},{78,-7},{-15,124},{-36,-16},{-73,-43},{-33,-67} 7.2 Discrete Cosine Transform The two-dimensional (2D) discrete cosine transform (DCT) is widely used in various image and video coding applications. For instance, the two-dimensional (2D) DCT is used in JPEG for still-image compression, in the H261/2/3 standards for video teleconferencing applications, in MPEG-2 for DVD, MPEG-4 for HDTV, and so on. The purpose of the DCT in image and video coding standards is to reduce spatial redundancy in images or video frames, thereby allowing us to encode them using fewer bits. We could use the DFT (see Section 7.1) for image compression. However, we prefer the DCT for the following reasons: • Image pixels are highly correlated and the redundant (i.e., correlated) components are nicely decorrelated with a DCT type-II. • The DCT eliminates boundary discontinuities. This is important because boundary discontinuities introduce noticeable block edge artifacts. • The DCT has higher energy compaction. In other words, the DCT packs more energy into a smaller number of frequency components. This translates into fewer bits needed to represent the image block. • The DCT requires only real computations. When operating on real data, as is the case with pixel data, an N-point DCT has a frequency resolution similar to a 2N-point DFT. Transforms and Filters 335 In this section, we ﬁrst examine the DCT algorithm, deriving the popular type-II DCT and its matrix factorization. We then give a ﬁxed-point implementation recipe for the DCT on the reference embedded processor, and discuss DCT input/output pruning. We also discuss the computational complexity and accuracy of ﬁxed-point simulations with respect to ﬂoating-point simulations. DCT Algorithm The DCT obtains the frequency content of a signal/image in a similar manner as the discrete Fourier transform. There are eight variants of DCTs and four types out of eight are commonly used. Extending the DCT to two dimensions (2D) is straightforward. We achieve 2D DCT by performing 1D DCT in the horizontal direction followed by another 1D DCT in the vertical direction. The DCT works on a block of data, and its proper implementation on an embedded processor reduces the overall cycle cost of image and video coding. 7.2.1 Discrete Cosine Transform Of all discrete cosine transform variants, the type-II DCT (called DCT in this section) is the most commonly used for image/video compression. Since the 2D DCT is simply achieved using 1D DCTs (applied to row followed by column of 2D blocks or vice versa), here we will concentrate only on the 1D DCT (or just DCT) computations. The DCT equation (see Section 6.4.3) is given in the following: N −1 π X[k] = x [n] cos n+1 k, k = 0, 1, 2, . . . , N − 1 n=0 N 2 (7.10) To eliminate the scaling factor in the data after the inverse transform, we multiply the DCT Equation (7.10) with a variable constant βi : N −1 π X[k] = βk x [n] cos N n+1 2 k, k = 0, 1, 2, . . . , N − 1 n=0 (7.11) where β0 = 1 N for k = 0 and βk = 2 N for 1> 15; tmp4 = (r7 * s2) >> 15; r4 = (r4 * s2) >> 15; r7 = (r7 * c2) >> 15; r7 = r7 - r4; r4 = tmp3 + tmp4; tmp3 = (r5 * c1) >> 15; tmp4 = (r6 * s1) >> 15; r5 = (r5 * s1) >> 15; r6 = (r6 * c1) >> 15; r6 = r6 - r5; r5 = tmp3 + tmp4; // 3rd stage of DC T signal ﬂow diagram r0 = tmp1 + tmp2; r1 = tmp1 - tmp2; tmp1 = (r2 * c3) >> 14; tmp2 = (r2 * s3) >> 14; tmp3 = (r3 * c3) >> 14; tmp4 = (r3 * s3) >> 14; r2 = tmp1 + tmp4; r3 = tmp3 - tmp2; tmp1 = r4 + r6; tmp2 = r5 + r7; r6 = r4 - r6; r5 = r7 - r5; r4 = tmp2 - tmp1; r7 = tmp2 + tmp1; r5 = (r5 * p) >> 14; r6 = (r6 * p) >> 14; // last stage out[0] = (r0 * q) >>15; out[1] = (r7 * q) >>15; out[2] = (r2 * q) >>15; out[3] = (r5 * q) >>15; out[4] = (r1 * q) >>15; out[5] = (r6 * q) >>15; out[6] = (r3 * q) >>15; out[7] = (r4 * q) >>15; // multiply by 2 // multiply by 2 // multiply by 2 Pcode 7.6: Fixed point simulation code for an 8-point DCT. // 1st Stage of IDCT signal ﬂow diagram r0 = in[0] + in[4]; r1 = in[0] - in[4]; r2 = (in[2] * c3) >> 14; r3 = (in[6] * c3) >> 14; r4 = (in[2] * s3) >> 14; r5 = (in[6] * s3) >> 14; r2 = r2 - r5; r3 = r3 + r4; tmp1 = in[1] - in[7]; tmp2 = in[1] + in[7]; tmp3 = (in[3] * p) >> 14; tmp4 = (in[5] * p) >> 14; r4 = tmp1 + tmp4; r6 = tmp1 - tmp4; r5 = tmp2 - tmp3; r7 = tmp2 + tmp3; // 2nd Stage of IDCT signal ﬂow diagram tmp1 = r0; tmp2 = r1; r0 = tmp1 + r3; r3 = tmp1 - r3; r1 = tmp2 + r2; r2 = tmp2 - r2; tmp1 = (r5 * c1) >> 15; tmp2 = (r5 * s1) >> 15; tmp3 = (r6 * c1) >> 15; tmp4 = (r6 * s1) >> 15; r5 = tmp1 - tmp4; r6 = tmp3 + tmp2; tmp1 = (r4 * c2) >> 15; tmp2 = (r4 * s2) >> 15; tmp3 = (r7 * c2) >> 15; tmp4 = (r7 * s2) >> 15; r4 = tmp1 - tmp4; r7 = tmp3 + tmp2; // 3rd Stage of IDCT signal ﬂow diagram tmp1 = r0 + r7; r7 = r0 − r7; tmp2 = r1 + r6; r6 = r1 − r6; tmp3 = r2 + r5; r5 = r2 − r5; tmp4 = r3 + r4; r4 = r3 − r4; // last stage out[0] = (tmp1 * q) >> 15; out[7] = (r7 * q) >> 15; out[1] = (tmp2 * q) >> 15; out[6] = (r6 * q) >> 15; out[2] = (tmp3 * q) >> 15; out[5] = (r5 * q) >> 15; out[3] = (tmp4 * q) >> 15; out[4] = (r4 * q) >> 15; // multiply by 2 // multiply by 2 // multiply by 2 Pcode 7.7: Fixed point simulation code for an 8-point IDCT. whereas the ﬁxed-point code uses data types that are only 16 bits in length (i.e., the “short” data type in C). This difference in the output results can be reduced by increasing the precision of the fractional part of the decimal value. In ﬁxed-point simulations, if we assign more bits to the fractional part to get more accurate results, then there is a possibility of totally unacceptable results due to saturation or overﬂow. The saturation of output with Transforms and Filters 341 Table 7.2: DCT simulation results DCT Input DCT Floating- DCT Fixed-Point Point Simulation Simulation Output Output (with Input 16.0) x[0] = 75 x[1] = 68 x[2] = 69 x[3] = 65 x[4] = 69 x[5] = 75 x[6] = 75 x[7] = 77 X[0] = 202.5861 X[1] = −5.9478 X[2] = 8.1236 X[3] = 3.9048 X[4] = −0.3536 X[5] = 0.6289 X[6] = 3.9061 X[7] = 1.2166 X[0] = 202.5861 X[1] = −6.3640 X[2] = 7.7782 X[3] = 4.2426 X[4] = −0.3536 X[5] = −0.7071 X[6] = 3.8891 X[7] = 1.4142 DCT Fixed-Point Simulation Output (with Input 12.4) X[0] = 202.5861 X[1] = −6.0104 X[2] = 8.0433 X[3] = 3.8891 X[4] = −0.3536 X[5] = 0.5303 X[6] = 3.8891 X[7] = 1.1490 Table 7.3: IDCT simulation results IDCT Input IDCT Floating- IDCT Fixed-Point IDCT Fixed-Point Point Simulation Simulation Output Simulation Output Output (with Input 16.0) (with Input 12.4) X[0] = 202.5861 X[1] = −5.9478 X[2] = 8.1236 X[3] = 3.9048 X[4] = −0.3536 X[5] = 0.6289 X[6] = 3.9061 X[7] = 1.2166 x[0] = 75.0000 x[1] = 68.0000 x[2] = 69.0000 x[3] = 64.9999 x[4] = 68.9999 x[5] = 74.9999 x[6] = 75.0000 x[7] = 76.9999 x[0] = 73.8927 x[1] = 68.5894 x[2] = 68.9429 x[3] = 65.4074 x[4] = 69.6500 x[5] = 73.1856 x[6] = 74.9533 x[7] = 76.7211 x[0] = 74.8870 x[1] = 67.9927 x[2] = 69.0534 x[3] = 65.0538 x[4] = 69.0313 x[5] = 74.9754 x[6] = 74.9754 x[7] = 76.9641 ﬁxed-point simulation is due to overﬂow of the integer part in arithmetic operations on the data that is represented by assigning fewer bits to its integer part. The number of required bits that we use for the fractional part and integer part depends on the range of values present in the input as well as the gain introduced by a particular algorithm. We measure the accuracy of the results as the mean square error (MSE) between the ﬁxed-point output and the ﬂoating-point output. The MSE is computed as follows: MSE = 1 N (Y1[n] − Y2[n])2 n If we replace Y1[ ] with the ﬂoating-point simulation output of the DCT and the IDCT (second columns in Tables 7.2 and 7.3) and Y2[ ] with the ﬁxed-point simulation output of the DCT and IDCT (third columns in Tables 7.2 and 7.3), then the MSE of the ﬁxed-point simulation for DCT and IDCT is given by MSEDCT = 0.2789 and MSEIDCT = 0.6921, respectively. If we want to get even more accurate results, we can increase the precision for both the input of DCT √and IDCT via scaling. The DCT and IDCT ﬂow diagrams shown in Figures 7.8 and 7.9 introduce a gain of 2 2. To obtain more accurate results, we have to consider this gain in scaling up the inputs to the DCT and IDCT. If we increase the precision of the fractional part from 0 to 4 bits (i.e., convert the input data format from 16.0 to 12.4), then the MSE of the ﬁxed-point simulation for the DCT and IDCT is computed using the fourth-column values (of Tables 7.2 and 7.3) and the MSE is given by DCTMSE = 0.0032 and MSEIDCT = 0.0028. Thus, we can see that the accuracy (measured with the MSE, smaller is better) of the ﬁxed-point simulation results (given in Tables 7.2 and 7.3) is high in the case of fourth-column outputs (with 12.4 input format) when compared to third-column outputs (with 16.0 input format). Fixed-Point Simulation Cycle Cost The ﬁxed-point simulation code given in Pcodes 7.2 and 7.3 is very efﬁcient, and on a ﬁxed-point embedded processor it runs many times faster when compared to the ﬂoating-point simulation code given in Pcode 7.1. 342 Chapter 7 See Appendix A, Section A.4, on the companion website for cycle estimation on the reference embedded processor. As the data is handled as 16-bit data, multiplication of two ﬁxed-point numbers (including the right shift for scaling) can be achieved with 1 cycle on the reference embedded processor (this is the case with most ﬁxed-point embedded processors). If we assume that all arithmetic operations consume 1 cycle each on ﬁxedpoint embedded processors, then the ﬁxed-point simulation code given in Pcodes 7.2 and 7.3 for the DCT and IDCT take approximately 50 cycles on a single ALU ﬁxed-point embedded processors. The cycle consumption of the DCT and IDCT drops to 25 cycles on two ALU e