首页资源分类其它TIDesigns > 机器学习.汤姆·米切尔].McGrawHill,.Tom.Mitchell.-.Machine.Learning.pdf

机器学习.汤姆·米切尔].McGrawHill,.Tom.Mitchell.-.Machine.Learning.pdf

已有 451370个资源

下载专区

上传者其他资源

    文档信息举报收藏

    标    签:机器学习

    分    享:

    文档简介

    机器学习.汤姆·米切尔].McGrawHill,.Tom.Mitchell.-.Machine.Learning.pdf 

    文档预览

    Machine Learning Tom M. Mitchell Product Details • Hardcover: 432 pages ; Dimensions (in inches): 0.75 x 10.00 x 6.50 • Publisher: McGraw-Hill Science/Engineering/Math; (March 1, 1997) • ISBN: 0070428077 • Average Customer Review: Based on 16 reviews. • Amazon.com Sales Rank: 42,816 • Popular in: Redmond, WA (#17) , Ithaca, NY (#9) Editorial Reviews From Book News, Inc. An introductory text on primary approaches to machine learning and the study of computer algorithms that improve automatically through experience. Introduce basics concepts from statistics, artificial intelligence, information theory, and other disciplines as need arises, with balanced coverage of theory and practice, and presents major algorithms with illustrations of their use. Includes chapter exercises. Online data sets and implementations of several algorithms are available on a Web site. No prior background in artificial intelligence or statistics is assumed. For advanced undergraduates and graduate students in computer science, engineering, statistics, and social sciences, as well as software professionals. Book News, Inc.®, Portland, OR Book Info: Presents the key algorithms and theory that form the core of machine learning. Discusses such theoretical issues as How does learning performance vary with the number of training examples presented? and Which learning algorithms are most appropriate for various types of learning tasks? DLC: Computer algorithms. Book Description: This book covers the field of machine learning, which is the study of algorithms that allow computer programs to automatically improve through experience. The book is intended to support upper level undergraduate and introductory level graduate courses in machine learning PREFACE The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience. In recent years many successful machine learning applications have been developed, ranging from data-mining programs that learn to detect fraudulent credit card transactions, to information-filtering systems that learn users' reading preferences, to autonomous vehicles that learn to drive on public highways. At the same time, there have been important advances in the theory and algorithms that form the foundations of this field. The goal of this textbook is to present the key algorithms and theory that form the core of machine learning. Machine learning draws on concepts and results from many fields, including statistics, artificial intelligence, philosophy, information theory, biology, cognitive science, computational complexity, and control theory. My belief is that the best way to learn about machine learning is to view it from all of these perspectives and to understand the problem settings, algorithms, and assumptions that underlie each. In the past, this has been difficult due to the absence of a broad-based single source introduction to the field. The primary goal of this book is to provide such an introduction. Because of the interdisciplinary nature of the material, this book makes few assumptions about the background of the reader. Instead, it introduces basic concepts from statistics, artificial intelligence, information theory, and other disciplines as the need arises, focusing on just those concepts most relevant to machine learning. The book is intended for both undergraduate and graduate students in fields such as computer science, engineering, statistics, and the social sciences, and as a reference for software professionals and practitioners. Two principles that guided the writing of the book were that it should be accessible to undergraduate students and that it should contain the material I would want my own Ph.D. students to learn before beginning their doctoral research in machine learning. xvi PREFACE A third principle that guided the writing of this book was that it should present a balance of theory and practice. Machine learning theory attempts to answer questions such as "How does learning performance vary with the number of training examples presented?" and "Which learning algorithms are most appropriate for various types of learning tasks?" This book includes discussions of these and other theoretical issues, drawing on theoretical constructs from statistics, computational complexity, and Bayesian analysis. The practice of machine learning is covered by presenting the major algorithms in the field, along with illustrative traces of their operation. Online data sets and implementations of several algorithms are available via the World Wide Web at http://www.cs.cmu.edu/-tom1 mlbook.html. These include neural network code and data for face recognition, decision tree learning,code and data for financial loan analysis, and Bayes classifier code and data for analyzing text documents. I am grateful to a number of colleagues who have helped to create these online resources, including Jason Rennie, Paul Hsiung, Jeff Shufelt, Matt Glickman, Scott Davies, Joseph O'Sullivan, Ken Lang, Andrew McCallum, and Thorsten Joachims. ACKNOWLEDGMENTS In writing this book, I have been fortunate to be assisted by technical experts in many of the subdisciplines that make up the field of machine learning. This book could not have been written without their help. I am deeply indebted to the following scientists who took the time to review chapter drafts and, in many cases, to tutor me and help organize chapters in their individual areas of expertise. Avrim Blum, Jaime Carbonell, William Cohen, Greg Cooper, Mark Craven, Ken DeJong, Jerry DeJong, Tom Dietterich, Susan Epstein, Oren Etzioni, Scott Fahlman, Stephanie Forrest, David Haussler, Haym Hirsh, Rob Holte, Leslie Pack Kaelbling, Dennis Kibler, Moshe Koppel, John Koza, Miroslav Kubat, John Lafferty, Ramon Lopez de Mantaras, Sridhar Mahadevan, Stan Matwin, Andrew McCallum, Raymond Mooney, Andrew Moore, Katharina Morik, Steve Muggleton, Michael Pazzani, David Poole, Armand Prieditis, Jim Reggia, Stuart Russell, Lorenza Saitta, Claude Sammut, Jeff Schneider, Jude Shavlik, Devika Subramanian, Michael Swain, Gheorgh Tecuci, Sebastian Thrun, Peter Turney, Paul Utgoff, Manuela Veloso, Alex Waibel, Stefan Wrobel, and Yiming Yang. I am also grateful to the many instructors and students at various universities who have field tested various drafts of this book and who have contributed their suggestions. Although there is no space to thank the hundreds of students, instructors, and others who tested earlier drafts of this book, I would like to thank the following for particularly helpful comments and discussions: Shumeet Baluja, Andrew Banas, Andy Barto, Jim Blackson, Justin Boyan, Rich Caruana, Philip Chan, Jonathan Cheyer, Lonnie Chrisman, Dayne Freitag, Geoff Gordon, Warren Greiff, Alexander Harm, Tom Ioerger, Thorsten PREFACE xvii Joachim, Atsushi Kawamura, Martina Klose, Sven Koenig, Jay Modi, Andrew Ng, Joseph O'Sullivan, Patrawadee Prasangsit, Doina Precup, Bob Price, Choon Quek, Sean Slattery, Belinda Thom, Astro Teller, Will Tracz I would like to thank Joan Mitchell for creating the index for the book. I also would like to thank Jean Harpley for help in editing many of the figures. Jane Loftus from ETP Harrison improved the presentation significantly through her copyediting of the manuscript and generally helped usher the manuscript through the intricacies of final production. Eric Munson, my editor at McGraw Hill, provided encouragement and expertise in all phases of this project. As always, the greatest debt one owes is to one's colleagues, friends, and family. In my case, this debt is especially large. I can hardly imagine a more intellectually stimulating environment and supportive set of friends than those I have at Carnegie Mellon. Among the many here who helped, I would especially like to thank Sebastian Thrun, who throughout this project was a constant source of encouragement, technical expertise, and support of all kinds. My parents, as always, encouraged and asked "Is it done yet?" at just the right times. Finally, I must thank my family: Meghan, Shannon, and Joan. They are responsible for this book in more ways than even they know. This book is dedicated to them. Tom M. Mitchell CHAPTER INTRODUCTION Ever since computers were invented, we have wondered whether they might be made to learn. If we could understand how to program them to learn-to improve automatically with experience-the impact would be dramatic. Imagine computers learning from medical records which treatments are most effective for new diseases, houses learning from experience to optimize energy costs based on the particular usage patterns of their occupants, or personal software assistants learning the evolving interests of their users in order to highlight especially relevant stories from the online morning newspaper. A successful understanding of how to make computers learn would open up many new uses of computers and new levels of competence and customization. And a detailed understanding of informationprocessing algorithms for machine learning might lead to a better understanding of human learning abilities (and disabilities) as well. We do not yet know how to make computers learn nearly as well as people learn. However, algorithms have been invented that are effective for certain types of learning tasks, and a theoretical understanding of learning is beginning to emerge. Many practical computer programs have been developed to exhibit useful types of learning, and significant commercial applications have begun to appear. For problems such as speech recognition, algorithms based on machine learning outperform all other approaches that have been attempted to date. In the field known as data mining, machine learning algorithms are being used routinely to discover valuable knowledge from large commercial databases containing equipment maintenance records, loan applications, financial transactions, medical records, and the like. As our understanding of computers continues to mature, it 2 MACHINE LEARNING seems inevitable that machine learning will play an increasingly central role in computer science and computer technology. A few specific achievements provide a glimpse of the state of the art: programs have been developed that successfully learn to recognize spoken words (Waibel 1989; Lee 1989), predict recovery rates of pneumonia patients (Cooper et al. 1997), detect fraudulent use of credit cards, drive autonomous vehicles on public highways (Pomerleau 1989), and play games such as backgammon at levels approaching the performance of human world champions (Tesauro 1992, 1995). Theoretical results have been developed that characterize the fundamental relationship among the number of training examples observed, the number of hypotheses under consideration, and the expected error in learned hypotheses. We are beginning to obtain initial models of human and animal learning and to understand their relationship to learning algorithms developed for computers (e.g., Laird et al. 1986; Anderson 1991; Qin et al. 1992; Chi and Bassock 1989; Ahn and Brewer 1993). In applications, algorithms, theory, and studies of biological systems, the rate of progress has increased significantly over the past decade. Several recent applications of machine learning are summarized in Table 1.1. Langley and Simon (1995) and Rumelhart et al. (1994) survey additional applications of machine learning. This book presents the field of machine learning, describing a variety of learning paradigms, algorithms, theoretical results, and applications. Machine learning is inherently a multidisciplinary field. It draws on results from artificial intelligence, probability and statistics, computational complexity theory, control theory, information theory, philosophy, psychology, neurobiology, and other fields. Table 1.2 summarizes key ideas from each of these fields that impact the field of machine learning. While the material in this book is based on results from many diverse fields, the reader need not be an expert in any of them. Key ideas are presented from these fields using a nonspecialist's vocabulary, with unfamiliar terms and concepts introduced as the need arises. 1.1 WELL-POSED LEARNING PROBLEMS Let us begin our study of machine learning by considering a few learning tasks. For the purposes of this book we will define learning broadly, to include any .computer program that improves its performance at some task through experience. Put more precisely, Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. For example, a computer program that learns to play checkers might improve its performance as measured by its abiliry to win at the class of tasks involving playing checkers games, through experience obtained by playing games against itself. In general, to have a well-defined learning problem, we must identity these 3 CHAPTER 1 INTRODUCITON 0 Learning to recognize spoken words. All of the most successful speech recognition systems employ machine learning in some form. For example, the SPHINXsystem (e.g., Lee 1989)learns speaker-specific strategies for recognizing the primitive sounds (phonemes) and words from the observed speech signal. Neural network learning methods (e.g., Waibel et al. 1989) and methods for learning hidden Markov models (e.g., Lee 1989) are effective for automatically customizing to,individual speakers, vocabularies, microphone characteristics, background noise, etc. Similar techniques have potential applications in many signal-interpretation problems. 0 Learning to drive an autonomous vehicle. Machine learning methods have been used to train computer-controlled vehicles to steer correctly when driving on a variety of road types. For example, the ALVINN system (Pomerleau 1989) has used its learned strategies to drive unassisted at 70 miles per hour for 90 miles on public highways among other cars. Similar techniques have possible applications in many sensor-based control problems. 0 Learning to classify new astronomical structures. Machine learning methods have been applied to a variety of large databases to learn general regularities implicit in the data. For example, decision tree learning algorithms have been used by NASA to learn how to classify celestial objects from the second Palomar Observatory Sky Survey (Fayyad et al. 1995). This system is now used to automatically classify all objects in the Sky Survey, which consists of three terrabytes of image data. 0 Learning to play world-class backgammon. The most successful computer programs for playing games such as backgammon are based on machiie learning algorithms. For example, the world's top computer program for backgammon, TD-GAMMON(Tesauro 1992, 1995). learned its strategy by playing over one million practice games against itself. It now plays at a level competitive with the human world champion. Similar techniques have applications in many practical problems where very large search spaces must be examined efficiently. TABLE 1.1 Some successful applications of machiie learning. three features: the class of tasks, the measure of performance to be improved, and the source of experience. A checkers learning problem: Task T: playing checkers 0 Performance measure P: percent of games won against opponents Training experience E: playing practice games against itself We can specify many learning problems in this fashion, such as learning to recognize handwritten words, or learning to drive a robotic automobile autonomously. A handwriting recognition learning problem: 0 Task T: recognizing and classifying handwritten words within images 0 Performance measure P : percent of words correctly classified 4 MACHINE LEARNING Artificial intelligence Learning symbolic representations of concepts. Machine learning as a search problem. Learning as an approach to improving problem solving. Using prior knowledge together with training data to guide learning. 0 Bayesian methods Bayes' theorem as the basis for calculating probabilities of hypotheses. The naive Bayes classifier. Algorithms for estimating values of unobserved variables. 0 Computational complexity theory Theoretical bounds on the inherent complexity of different learning tasks, measured in terms of the computational effort, number of training examples, number of mistakes, etc. required in order to learn. Control theory Procedures that learn to control processes in order to optimize predefined objectives and that learn to predict the next state of the process they are controlling. 0 Information theory Measures of entropy and information content. Minimum description length approaches to learning. Optimal codes and their relationship to optimal training sequences for encoding a hypothesis. Philosophy Occam's razor, suggesting that the simplest hypothesis is the best. Analysis of the justification for generalizing beyond observed data. 0 Psychology and neurobiology The power law of practice, which states that over a very broad range of learning problems, people's response time improves with practice according to a power law. Neurobiological studies motivating artificial neural network models of learning. 0 Statistics Characterization of errors (e.g., bias and variance) that occur when estimating the accuracy of a hypothesis based on a limited sample of data. Confidence intervals, statistical tests. TABLE 1.2 Some disciplines and examples of their influence on machine learning. 0 Training experience E: a database of handwritten words with given classifications A robot driving learning problem: 0 Task T: driving on public four-lane highways using vision sensors 0 Performance measure P: average distance traveled before an error (as judged by human overseer) 0 Training experience E: a sequence of images and steering commands record- ed while observing a human driver Our definition of learning is broad enough to include most tasks that we would conventionally call "learning" tasks, as we use the word in everyday language. It is also broad enough to encompass computer programs that improve from experience in quite straightforward ways. For example, a database system CHAFTlB 1 INTRODUCTION 5 that allows users to update data entries would fit our definition of a learning system: it improves its performance at answering database queries, based on the experience gained from database updates. Rather than worry about whether this type of activity falls under the usual informal conversational meaning of the word "learning," we will simply adopt our technical definition of the class of programs that improve through experience. Within this class we will find many types of problems that require more or less sophisticated solutions. Our concern here is not to analyze the meaning of the English word "learning" as it is used in everyday language. Instead, our goal is to define precisely a class of problems that encompasses interesting forms of learning, to explore algorithms that solve such problems, and to understand the fundamental structure of learning problems and processes. 1.2 DESIGNING A LEARNING SYSTEM In order to illustrate some of the basic design issues and approaches to machine learning, let us consider designing a program to learn to play checkers, with the goal of entering it in the world checkers tournament. We adopt the obvious performance measure: the percent of games it wins in this world tournament. 1.2.1 Choosing the Training Experience The first design choice we face is to choose the type of training experience from which our system will learn. The type of training experience available can have a significant impact on success or failure of the learner. One key attribute is whether the training experience provides direct or indirect feedback regarding the choices made by the performance system. For example, in learning to play checkers, the system might learn from direct training examples consisting of individual checkers board states and the correct move for each. Alternatively, it might have available only indirect information consisting of the move sequences and final outcomes of various games played. In this later case, information about the correctness of specific moves early in the game must be inferred indirectly from the fact that the game was eventually won or lost. Here the learner faces an additional problem of credit assignment, or determining the degree to which each move in the sequence deserves credit or blame for the final outcome. Credit assignment can be a particularly difficult problem because the game can be lost even when early moves are optimal, if these are followed later by poor moves. Hence, learning from direct training feedback is typically easier than learning from indirect feedback. A second important attribute of the training experience is the degree to which the learner controls the sequence of training examples. For example, the learner might rely on the teacher to select informative board states and to provide the correct move for each. Alternatively, the learner might itself propose board states that it finds particularly confusing and ask the teacher for the correct move. Or the learner may have complete control over both the board states and (indirect) training classifications, as it does when it learns by playing against itself with no teacher present. Notice in this last case the learner may choose between experimenting with novel board states that it has not yet considered, or honing its skill by playing minor variations of lines of play it currently finds most promising. Subsequent chapters consider a number of settings for learning, including settings in which training experience is provided by a random process outside the learner's control, settings in which the learner may pose various types of queries to an expert teacher, and settings in which the learner collects training examples by autonomously exploring its environment. A third important attribute of the training experience is how well it represents the distribution of examples over which the final system performance P must be measured. In general, learning is most reliable when the training examples follow a distribution similar to that of future test examples. In our checkers learning scenario, the performance metric P is the percent of games the system wins in the world tournament. If its training experience E consists only of games played against itself, there is an obvious danger that this training experience might not be fully representative of the distribution of situations over which it will later be tested. For example, the learner might never encounter certain crucial board states that are very likely to be played by the human checkers champion. In practice, it is often necessary to learn from a distribution of examples that is somewhat different from those on which the final system will be evaluated (e.g., the world checkers champion might not be interested in teaching the program!). Such situations are problematic because mastery of one distribution of examples will not necessary lead to strong performance over some other distribution. We shall see that most current theory of machine learning rests on the crucial assumption that the distribution of training examples is identical to the distribution of test examples. Despite our need to make this assumption in order to obtain theoretical results, it is important to keep in mind that this assumption must often be violated in practice. To proceed with our design, let us decide that our system will train by playing games against itself. This has the advantage that no external trainer need be present, and it therefore allows the system to generate as much training data as time permits. We now have a fully specified learning task. A checkers learning problem: 0 Task T: playing checkers 0 Performance measure P: percent of games won in the world tournament 0 Training experience E: games played against itself In order to complete the design of the learning system, we must now choose 1. the exact type of knowledge to be,learned 2. a representation for this target knowledge 3. a learning mechanism 7 CHAFTER I INTRODUCTION 1.2.2 Choosing the Target Function The next design choice is to determine exactly what type of knowledge will be learned and how this will be used by the performance program. Let us begin with a checkers-playing program that can generate the legal moves from any board state. The program needs only to learn how to choose the best move from among these legal moves. This learning task is representative of a large class of tasks for which the legal moves that define some large search space are known a priori, but for which the best search strategy is not known. Many optimization problems fall into this class, such as the problems of scheduling and controlling manufacturing processes where the available manufacturing steps are well understood, but the best strategy for sequencing them is not. Given this setting where we must learn to choose among the legal moves, the most obvious choice for the type of information to be learned is a program, or function, that chooses the best move for any given board state. Let us call this function ChooseMove and use the notation ChooseMove : B -+ M to indicate that this function accepts as input any board from the set of legal board states B and produces as output some move from the set of legal moves M. Throughout our discussion of machine learning we will find it useful to reduce the problem of improving performance P at task T to the problem of learning some particular targetfunction such as ChooseMove. The choice of the target function will therefore be a key design choice. Although ChooseMove is an obvious choice for the target function in our example, this function will turn out to be very difficult to learn given the kind of indirect training experience available to our system. An alternative target functionand one that will turn out to be easier to learn in this setting-is an evaluation function that assigns a numerical score to any given board state. Let us call this target function V and again use the notation V : B + 8 to denote that V maps any legal board state from the set B to some real value (we use 8 to denote the set of real numbers). We intend for this target function V to assign higher scores to better board states. If the system can successfully learn such a target function V , then it can easily use it to select the best move from any current board position. This can be accomplished by generating the successor board state produced by every legal move, then using V to choose the best successor state and therefore the best legal move. What exactly should be the value of the target function V for any given board state? Of course any evaluation function that assigns higher scores to better board states will do. Nevertheless, we will find it useful to define one particular target function V among the many that produce optimal play. As we shall see, this will make it easier to design a training algorithm. Let us therefore define the target value V ( b )for an arbitrary board state b in B , as follows: 1. if b is a final board state that is won, then V ( b )= 100 2. if b is a final board state that is lost, then V ( b )= -100 3. if b is a final board state that is drawn, then V ( b )= 0 4. if b is a not a final state in the game, then V(b) = V(bl), where b' is the best final board state that can be achieved starting from b and playing optimally until the end of the game (assuming the opponent plays optimally, as well). While this recursive definition specifies a value of V(b) for every board state b, this definition is not usable by our checkers player because it is not efficiently computable. Except for the trivial cases (cases 1-3) in which the game has already ended, determining the value of V(b) for a particular board state requires (case 4) searching ahead for the optimal line of play, all the way to the end of the game! Because this definition is not efficiently computable by our checkers playing program, we say that it is a nonoperational definition. The goal of learning in this case is to discover an operational description of V ; that is, a description that can be used by the checkers-playing program to evaluate states and select moves within realistic time bounds. Thus, we have reduced the learning task in this case to the problem of discovering an operational description of the ideal targetfunction V. It may be very difficult in general to learn such an operational form of V perfectly. In fact, we often expect learning algorithms to acquire only some approximation to the target function, and for this reason the process of learning the target function is often called function approximation. In the current discussion we will use the symbol ? to refer to the function that is actually learned by our program, to distinguish it from the ideal target function V. 1.23 Choosing a Representation for the Target Function c Now that we have specified the ideal target function V, we must choose a repre- sentation that the learning program will use to describe the function that it will learn. As with earlier design choices, we again have many options. We could, for example, allow the program to represent using a large table with a distinct entry specifying the value for each distinct board state. Or we could allow it to represent using a collection of rules that match against features of the board state, or a quadratic polynomial function of predefined board features, or an artificial neural network. In general, this choice of representation involves a crucial tradeoff. On one hand, we wish to pick a very expressive representation to allow representing as close an approximation as possible to the ideal target function V. On the other hand, the more expressive the representation, the more training data the program will require in order to choose among the alternative hypotheses it c can represent. To keep the discussion brief, let us choose a simple representation: for any given board state, the function will be calculated as a linear combination of the following board features: 0 xl: the number of black pieces on the board x2: the number of red pieces on the board 0 xs: the number of black kings on the board 0 x4: the number of red kings on the board 9 CHAPTER I INTRODUCTION x5: the number of black pieces threatened by red (i.e., which can be captured on red's next turn) X6: the number of red pieces threatened by black Thus, our learning program will represent c(b) as a linear function of the form where wo through W6 are numerical coefficients, or weights, to be chosen by the learning algorithm. Learned values for the weights w l through W6 will determine the relative importance of the various board features in determining the value of the board, whereas the weight wo will provide an additive constant to the board value. To summarize our design choices thus far, we have elaborated the original formulation of the learning problem by choosing a type of training experience, a target function to be learned, and a representation for this target function. Our elaborated learning task is now Partial design of a checkers learning program: Task T: playing checkers Performance measure P : percent of games won in the world tournament Training experience E: games played against itself Targetfunction: V:Board + 8 Targetfunction representation The first three items above correspond to the specification of the learning task, whereas the final two items constitute design choices for the implementation of the learning program. Notice the net effect of this set of design choices is to reduce the problem of learning a checkers strategy to the problem of learning values for the coefficients wo through w 6 in the target function representation. 1.2.4 Choosing a Function Approximation Algorithm In order to learn the target function f we require a set of training examples, each describing a specific board state b and the training value Vtrain(b)for b. In other words, each training example is an ordered pair of the form (b, V',,,i,(b)). For instance, the following training example describes a board state b in which black has won the game (note x2 = 0 indicates that red has no remaining pieces) and for which the target function value VZrain(bi)s therefore +100. 10 MACHINE LEARNING Below we describe a procedure that first derives such training examples from the indirect training experience available to the learner, then adjusts the weights wi to best fit these training examples. 1.2.4.1 ESTIMATING TRAINING VALUES Recall that according to our formulation of the learning problem, the only training information available to our learner is whether the game was eventually won or lost. On the other hand, we require training examples that assign specific scores to specific board states. While it is easy to assign a value to board states that correspond to the end of the game, it is less obvious how to assign training values to the more numerous intermediate board states that occur before the game's end. Of course the fact that the game was eventually won or lost does not necessarily indicate that every board state along the game path was necessarily good or bad. For example, even if the program loses the game, it may still be the case that board states occurring early in the game should be rated very highly and that the cause of the loss was a subsequent poor move. Despite the ambiguity inherent in estimating training values for intermediate board states, one simple approach has been found to be surprisingly successful. This approach is to assign the training value of Krain(bf)or any intermediateboard state b to be ?(~uccessor(b))w,here ? is the learner's current approximation to V and where Successor(b) denotes the next board state following b for which it is again the program's turn to move (i.e., the board state following the program's move and the opponent's response). This rule for estimating training values can be summarized as ~ u l kfor estimating training values. V,,,i. (b)c c(~uccessor(b)) While it may seem strange to use the current version of f to estimate training values that will be used to refine this very same function, notice that we are using estimates of the value of the Successor(b) to estimate the value of board state b. In- tuitively, we can see this will make sense if ? tends to be more accurate for board states closer to game's end. In fact, under certain conditions (discussed in Chapter 13) the approach of iteratively estimating training values based on estimates of successor state values can be proven to converge toward perfect estimates of Vtrain. 1.2.4.2 ADJUSTING THE WEIGHTS All that remains is to specify the learning algorithm for choosing the weights wi to^ best fit the set of training examples { ( b ,Vtrain(b))A}s. a first step we must define what we mean by the bestfit to the training data. One common approach is to define the best hypothesis, or set of weights, as that which minimizes the squarg error E between the training values and the values predicted by the hypothesis V . Thus, we seek the weights, or equivalentlythe c , that minimize E for the observed training examples. Chapter 6 discusses settings in which minimizing the sum of squared errors is equivalent to finding the most probable hypothesis given the observed training data. Several algorithms are known for finding weights of a linear function that minimize E defined in this way. In our case, we require an algorithm that will incrementally refine the weights as new training examples become available and that will be robust to errors in these estimated training values. One such algorithm is called the least mean squares, or LMS training rule. For each observed training example it adjusts the weights a small amount in the direction that reduces the error on this training example. As discussed in Chapter 4, this algorithm can be viewed as performing a stochastic gradient-descent search through the space of possible hypotheses (weight values) to minimize the squared enor E. The LMS algorithm is defined as follows: LMS weight update rule. For each training example (b, Kmin(b)) Use the current weights to calculate ?(b) For each weight mi, update it as Here q is a small constant (e.g., 0.1) that moderates the size of the weight update. To get an intuitive understanding for why this weight update rule works, notice that when the error (Vtrain(b)- c(b)) is zero, no weights are changed. When (V,,ain(b)- e(b)) is positive (i.e., when f ( b ) is too low), then each weight is increased in proportion to the value of its corresponding feature. This will raise the value of ?(b), reducing the error. Notice that if the value of some feature xi is zero, then its weight is not altered regardless of the error, so that the only weights updated are those whose features actually occur on the training example board. Surprisingly, in certain settings this simple weight-tuning method can be proven to converge to the least squared error approximation to the &,in values (as discussed in Chapter 4). 1.2.5 The Final Design The final design of our checkers learning system can be naturally described by four distinct program modules that represent the central components in many learning systems. These four modules, summarized in Figure 1.1, are as follows: 0 The Performance System is the module that must solve the given performance task, in this case playing checkers, by using the learned target function(s). It takes an instance of a new problem (new game) as input and produces a trace of its solution (game history) as output. In our case, the 12 MACHINE LEARNING New problem (initial game board) Experiment Generator Hypothesis f VJ Performance System Generalizer Solution tract (game history) Critic Training examples /. .)...I FIGURE 1.1 Final design of the checkers learning program. strategy used by the Performance System to select its next move at each step is determined by the learned p evaluation function. Therefore, we expect its performance to improve as this evaluation function becomes increasingly accurate. e The Critic takes as input the history or trace of the game and produces as output a set of training examples of the target function. As shown in the diagram, each training example in this case corresponds to some game state in the trace, along with an estimate Vtraio,f the target function value for this example. In our example, the Critic corresponds to the training rule given by Equation (1.1). The Generalizer takes as input the training examples and produces an output hypothesis that is its estimate of the target function. It generalizes from the specific training examples, hypothesizing a general function that covers these examples and other cases beyond the training examples. In our example, the Generalizer corresponds to the LMS algorithm, and the output hypothesis is the function f described by the learned weights wo,...,W6. The Experiment Generator takes as input the current hypothesis (currently learned function) and outputs a new problem (i.e., initial board state) for the Performance System to explore. Its role is to pick new practice problems that will maximize the learning rate of the overall system. In our example, the Experiment Generator follows a very simple strategy: It always proposes the same initial game board to begin a new game. More sophisticated strategies could involve creating board positions designed to explore particular regions of the state space. Together, the design choices we made for our checkers program produce specific instantiations for the performance system, critic; generalizer, and experiment generator. Many machine learning systems can-beusefully characterized in terms of these four generic modules. The sequence of design choices made for the checkers program is summarized in Figure 1.2. These design choices have constrained the learning task in a number of ways. We have restricted the type of knowledge that can be acquired to a single linear evaluation function. Furthermore, we have constrained this evaluation function to depend on only the six specific board features provided. If the true target function V can indeed be represented by a linear combination of these Determine Type of TrainingExperience 1 Determine Target Function I I Determine Representation of Learned Function ... Linear function Artificial neural of six features network / I Determine Learning Algorithm I\ FIGURE 1.2 Sununary of choices in designing the checkers learning program. particular features, then our program has a good chance to learn it. If not, then the best we can hope for is that it will learn a good approximation, since a program can certainly never learn anything that it cannot at least represent. Let us suppose that a good approximation to the true V function can, in fact, be represented in this form. The question then arises as to whether this learning technique is guaranteed to find one. Chapter 13 provides a theoretical analysis showing that under rather restrictive assumptions, variations on this approach do indeed converge to the desired evaluation function for certain types of search problems. Fortunately, practical experience indicates that this approach to learning evaluation functions is often successful, even outside the range of situations for which such guarantees can be proven. Would the program we have designed be able to learn well enough to beat the human checkers world champion? Probably not. In part, this is because the linear function representation for ? is too simple a representation to capture well the nuances of the game. However, given a more sophisticated representation for the target function, this general approach can, in fact, be quite successful. For example, Tesauro (1992, 1995) reports a similar design for a program that learns to play the game of backgammon, by learning a very similar evaluation function over states of the game. His program represents the learned evaluation function using an artificial neural network that considers the complete description of the board state rather than a subset of board features. After training on over one million self-generated training games, his program was able to play very competitively with top-ranked human backgammon players. Of course we could have designed many alternative algorithms for this checkers learning task. One might, for example, simply store the given training examples, then try to find the "closest" stored situation to match any new situation (nearest neighbor algorithm, Chapter 8). Or we might generate a large number of candidate checkers programs and allow them to play against each other, keeping only the most successful programs and further elaborating or mutating these in a kind of simulated evolution (genetic algorithms, Chapter 9). Humans seem to follow yet a different approach to learning strategies, in which they analyze, or explain to themselves, the reasons underlying specific successes and failures encountered during play (explanation-based learning, Chapter 11). Our design is simply one of many, presented here to ground our discussion of the decisions that must go into designing a learning method for a specific class of tasks. 1.3 PERSPECTIVES AND ISSUES IN MACHINE LEARNING One useful perspective on machine learning is that it involves searching a very large space of possible hypotheses to determine one that best fits the observed data and any prior knowledge held by the learner. For example, consider the space of hypotheses that could in principle be output by the above checkers learner. This hypothesis space consists of all evaluation functions that can be represented by some choice of values for the weights wo through w6. The learner's task is thus to search through this vast space to locate the hypothesis that is most consistent with the available training examples. The LMS algorithm for fitting weights achieves this goal by iteratively tuning the weights, adding a correction to each weight each time the hypothesized evaluation function predicts a value that differs from the training value. This algorithm works well when the hypothesis representation considered by the learner defines a continuously parameterized space of potential hypotheses. Many of the chapters in this book present algorithms that search a hypothesis space defined by some underlying representation (e.g., linear functions, logical descriptions, decision trees, artificial neural networks). These different hypothesis representations are appropriate for learning different kinds of target functions. For each of these hypothesis representations, the corresponding learning algorithm takes advantage of a different underlying structure to organize the search through the hypothesis space. Throughout this book we will return to this perspective of learning as a search problem in order to characterize learning methods by their search strategies and by the underlying structure of the search spaces they explore. We will also find this viewpoint useful in formally analyzing the relationship between the size of the hypothesis space to be searched, the number of training examples available, and the confidence we can have that a hypothesis consistent with the training data will correctly generalize to unseen examples. 1.3.1 Issues in Machine Learning Our checkers example raises a number of generic questions about machine learning. The field of machine learning, and much of this book, is concerned with answering questions such as the following: What algorithms exist for learning general target functions from specific training examples? In what settings will particular algorithms converge to the desired function, given sufficient training data? Which algorithms perform best for which types of problems and representations? How much training data is sufficient? What general bounds can be found to relate the confidence in learned hypotheses to the amount of training experience and the character of the learner's hypothesis space? When and how can prior knowledge held by the learner guide the process of generalizing from examples? Can prior knowledge be helpful even when it is only approximately correct? What is the best strategy for choosing a useful next training experience, and how does the choice of this strategy alter the complexity of the learning problem? What is the best way to reduce the learning task to one or more function approximation problems? Put another way, what specific functions should the system attempt to learn? Can this process itself be automated? How can the learner automatically alter its representation to improve its ability to represent and learn the target function? 16 MACHINE LEARNING 1.4 HOW TO READ THIS BOOK This book contains an introduction to the primary algorithms and approaches to machine learning, theoretical results on the feasibility of various learning tasks and the capabilities of specific algorithms, and examples of practical applications of machine learning to real-world problems. Where possible, the chapters have been written to be readable in any sequence. However, some interdependence is unavoidable. If this is being used as a class text, I recommend first covering Chapter 1 and Chapter 2. Following these two chapters, the remaining chapters can be read in nearly any sequence. A one-semester course in machine learning might cover the first seven chapters, followed by whichever additional chapters are of greatest interest to the class. Below is a brief survey of the chapters. Chapter 2 covers concept learning based on symbolic or logical representations. It also discusses the general-to-specific ordering over hypotheses, and the need for inductive bias in learning. 0 Chapter 3 covers decision tree learning and the problem of overfitting the training data. It also examines Occam's razor-a principle recommending the shortest hypothesis among those consistent with the data. 0 Chapter 4 covers learning of artificial neural networks, especially the wellstudied BACKPROPAGATaIlOgoNrithm, and the general approach of gradient descent. This includes a detailed example of neural network learning for face recognition, including data and algorithms available over the World Wide Web. 0 Chapter 5 presents basic concepts from statistics and estimation theory, focusing on evaluating the accuracy of hypotheses using limited samples of data. This includes the calculation of confidence intervals for estimating hypothesis accuracy and methods for comparing the accuracy of learning methods. 0 Chapter 6 covers the Bayesian perspective on machine learning, including both the use of Bayesian analysis to characterize non-Bayesian learning algorithms and specific Bayesian algorithms that explicitly manipulate probabilities. This includes a detailed example applying a naive Bayes classifier to the task of classifying text documents, including data and software available over the World Wide Web. 0 Chapter 7 covers computational learning theory, including the Probably Approximately Correct (PAC) learning model and the Mistake-Bound learning model. This includes a discussion of the WEIGHTEMDAJORITYalgorithm for combining multiple learning methods. 0 Chapter 8 describes instance-based learning methods, including nearest neighbor learning, locally weighted regression, and case-based reasoning. 0 Chapter 9 discusses learning algorithms modeled after biological evolution, including genetic algorithms and genetic programming. 0 Chapter 10 covers algorithms for learning sets of rules, including Inductive Logic Programming approaches to learning first-order Horn clauses. 0 Chapter 11 covers explanation-based learning, a learning method that uses prior knowledge to explain observed training examples, then generalizes based on these explanations. 0 Chapter 12 discusses approaches to combining approximate prior knowledge with available training data in order to improve the accuracy of learned hypotheses. Both symbolic and neural network algorithms are considered. 0 Chapter 13 discusses reinforcement learning-an approach to control learning that accommodates indirect or delayed feedback as training information. The checkers learning algorithm described earlier in Chapter 1 is a simple example of reinforcement learning. The end of each chapter contains a summary of the main concepts covered, suggestions for further reading, and exercises. Additional updates to chapters, as well as data sets and implementations of algorithms, are available on the World Wide Web at http://www.cs.cmu.edu/-tom/mlbook.html. 1.5 SUMMARY AND FURTHER READING Machine learning addresses the question of how to build computer programs that improve their performance at some task through experience. Major points of this chapter include: Machine learning algorithms have proven to be of great practical value in a variety of application domains. They are especially useful in (a) data mining problems where large databases may contain valuable implicit regularities that can be discovered automatically (e.g., to analyze outcomes of medical treatments from patient databases or to learn general rules for credit worthiness from financial databases); (b) poorly understood domains where humans might not have the knowledge needed to develop effective algorithms (e.g., human face recognition from images); and (c) domains where the program must dynamically adapt to changing conditions (e.g., controlling manufacturing processes under changing supply stocks or adapting to the changing reading interests of individuals). Machine learning draws on ideas from a diverse set of disciplines, including artificial intelligence, probability and statistics, computational complexity, information theory, psychology and neurobiology, control theory, and philosophy. 0 A well-defined learning problem requires a well-specified task, performance metric, and source of training experience. 0 Designing a machine learning approach involves a number of design choices, including choosing the type of training experience, the target function to be learned, a representation for this target function, and an algorithm for learning the target function from training examples. 18 MACHINE LEARNING 0 Learning involves search: searching through a space of possible hypotheses to find the hypothesis that best fits the available training examples and other prior constraints or knowledge. Much of this book is organized around different learning methods that search different hypothesis spaces (e.g., spaces containing numerical functions, neural networks, decision trees, symbolic rules) and around theoretical results that characterize conditions under which these search methods converge toward an optimal hypothesis. There are a number of good sources for reading about the latest research results in machine learning. Relevant journals include Machine Learning, Neural Computation, Neural Networks, Journal of the American Statistical Association, and the IEEE Transactions on Pattern Analysis and Machine Intelligence. There are also numerous annual conferences that cover different aspects of machine learning, including the International Conference on Machine Learning, Neural Information Processing Systems, Conference on Computational Learning Theory, International Conference on Genetic Algorithms, International Conference on Knowledge Discovery and Data Mining, European Conference on Machine Learning, and others. EXERCISES 1.1. Give three computer applications for which machine learning approaches seem appropriate and three for which they seem inappropriate. Pick applications that are not already mentioned in this chapter, and include a one-sentence justification for each. 1.2. Pick some learning task not mentioned in this chapter. Describe it informally in a paragraph in English. Now describe it by stating as precisely as possible the task, performance measure, and training experience. Finally, propose a target function to be learned and a target representation. Discuss the main tradeoffs you considered in formulating this learning task. 1.3. Prove that the LMS weight update rule described in this chapter performs a gradient descent to minimize the squared error. In particular, define the squared error E as in the text. Now calculate the derivative of E with respect to the weight wi, assuming -e. that ?(b) is a linear function as defined in the text. Gradient descent is achieved by updating each weight in proportion to Therefore, you must show that the LMS training rule alters weights in this proportion for each training example it encounters. 1.4. Consider alternative strategies for the Experiment Generator module of Figure 1.2. In particular, consider strategies in which the Experiment Generator suggests new board positions by Generating random legal board positions 0 Generating a position by picking a board state from the previous game, then applying one of the moves that was not executed A strategy of your own design Discuss tradeoffs among these strategies. Which do you feel would work best if the number of training examples was held constant, given the performance measure of winning the most games at the world championships? 1.5. Implement an algorithm similar to that discussed for the checkers problem, but use the simpler game of tic-tac-toe. Represent the learned function V as a linear com- bination of board features of your choice. To train your program, play it repeatedly against a second copy of the program that uses a fixed evaluation function you create by hand. Plot the percent of games won by your system, versus the number of training games played. REFERENCES Ahn, W., & Brewer, W. F. (1993). Psychological studies of explanation-basedlearning. In G. DeJong (Ed.),Investigating explanation-based learning. Boston: Kluwer Academic Publishers. Anderson, J. R. (1991). The place of cognitive architecture in rational analysis. In K. VanLehn (Ed.), Architecturesfor intelligence @p. 1-24). Hillsdale, NJ: Erlbaum. Chi, M. T. H., & Bassock, M. (1989). Learning from examples via self-explanations.In L. Resnick (Ed.), Knowing, learning, and instruction: Essays in honor of Robert Glaser. Hillsdale, NJ: L. Erlbaum Associates. Cooper, G., et al. (1997). An evaluation of machine-learning methods for predicting pneumonia mortality. Artificial Intelligence in Medicine, (to appear). Fayyad, U. M., Uthurusamy, R. (Eds.) (1995). Proceedings of the First International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press. Fayyad, U. M., Smyth, P., Weir, N., Djorgovski, S. (1995). Automated analysis and exploration of image databases: Results, progress, and challenges. Journal of Intelligent Information Systems, 4, 1-19. Laird, J., Rosenbloom, P., & Newell, A. (1986). SOAR: The anatomy of a general learning mechanism. Machine Learning, 1(1), 1146. Langley, P., & Simon, H. (1995). Applications of machine learning and rule induction. Communications of the ACM, 38(1I), 55-64. Lee, K. (1989). Automatic speech recognition:The development of the Sphinx system. Boston: Kluwer Academic Publishers. Pomerleau, D. A. (1989). ALVINN: An autonomous land vehicle in a neural network. (Technical Report CMU-CS-89-107). Pittsburgh, PA: Carnegie Mellon University. Qin, Y., Mitchell, T., & Simon, H. (1992). Using EBG to simulate human learning from examples and learning by doing. Proceedings of the Florida AI Research Symposium (pp. 235-239). Rudnicky, A. I., Hauptmann, A. G., & Lee, K. -F. (1994). Survey of current speech technology in artificial intelligence. Communications of the ACM, 37(3), 52-57. Rumelhart, D., Widrow, B., & Lehr, M. (1994). The basic ideas in neural networks. Communications of the ACM, 37(3), 87-92. Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8, 257. Tesauro, G. (1995). Temporal difference learning and TD-gammon. Communications of the ACM, 38(3), 5848. Waibel, A,, Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(3), 328-339. CHAPTER CONCEPT LEARNING AND THE GENERAL-TO-SPECIFIC 0,RDERING The problem of inducing general functions from specific training examples is central to learning. This chapter considers concept learning: acquiring the definition of a general category given a sample of positive and negative training examples of the category. Concept learning can be formulated as a problem of searching through a predefined space of potential hypotheses for the hypothesis that best fits the training examples. In many cases this search can be efficiently organized by taking advantage of a naturally occurring structure over the hypothesis space-a generalto-specific ordering of hypotheses. This chapter presents several learning algorithms and considers situations under which they converge to the correct hypothesis. We also examine the nature of inductive learning and the justification by which any program may successfully generalize beyond the observed training data. 2.1 INTRODUCTION Much of learning involves acquiring general concepts from specific training examples. People, for example, continually learn general concepts or categories such as "bird," "car," "situations in which I should study more in order to pass the exam," etc. Each such concept can be viewed as describing some subset of objects or events defined over a larger set (e.g., the subset of animals that constitute 21 CHAFER 2 CONCEm LEARNING AND THE GENERAL-TO-SPECIFIC ORDERWG birds). Alternatively, each concept can be thought of as a boolean-valued function defined over this larger set (e.g., a function defined over all animals, whose value is true for birds and false for other animals). In this chapter we consider the problem of automatically inferring the general definition of some concept, given examples labeled as+.membersor nonmembers of the concept. This task is commonly referred to as concept learning, or approximating a boolean-valued function from examples. Concept learning. Inferring a boolean-valued function from training examples of its input and output. 2.2 A CONCEPT LEARNING TASK To ground our discussion of concept learning, consider the example task of learning the target concept "days on which my friend Aldo enjoys his favorite water sport." Table 2.1 describes a set of example days, each represented by a set of attributes. The attribute EnjoySport indicates whether or not Aldo enjoys his favorite water sport on this day. The task is to learn to predict the value of EnjoySport for an arbitrary day, based on the values of its other attributes. What hypothesis representation shall we provide to the learner in this case? Let us begin by considering a simple representation in which each hypothesis consists of a conjunction of constraints on the instance attributes. In particular, let each hypothesis be a vector of six constraints, specifying the values of the six attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast. For each attribute, the hypothesis will either 0 indicate by a "?' that any value is acceptable for this attribute, 0 specify a single required value (e.g., W a r m ) for the attribute, or 0 indicate by a "0" that no value is acceptable. If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive example (h(x) = 1). To illustrate, the hypothesis that Aldo enjoys his favorite sport only on cold days with high humidity (independent of the values of the other attributes) is represented by the expression (?, Cold, High, ?, ?, ?) Example Sky AirTemp Humidity Wind Water Forecast EnjoySport 1 Sunny Warm Normal Strong Warm Same Yes 2 Sunny Warm High Strong Warm Same Yes 3 Rainy Cold High Strong Warm Change No 4 Sunny Warm High Strong Cool Change Yes TABLE 2.1 Positive and negative training examples for the target concept EnjoySport. 22 MACHINE LEARNING The most general hypothesis-that every day is a positive example-is sented by (?, ?, ?, ?, ?, ?) repre- and the most specific possible hypothesis-that no day is a positive example-is represented by (0,0,0,0,0,0) To summarize, the EnjoySport concept learning task requires learning the set of days for which EnjoySport = yes, describing this set by a conjunction of constraints over the instance attributes. In general, any concept learning task can be described by the set of instances over which the target function is defined, the target function, the set of candidate hypotheses considered by the learner, and the set of available training examples. The definition of the EnjoySport concept learning task in this general form is given in Table 2.2. 2.2.1 Notation Throughout this book, we employ the following terminology when discussing concept learning problems. The set of items over which the concept is defined is called the set of instances, which we denote by X. In the current example, X is the set of all possible days, each represented by the attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast. The concept or function to be learned is called the target concept, which we denote by c. In general, c can be any boolean- valued function defined over the instances X; that is, c :X + {O, 1).In the current example, the target concept corresponds to the value of the attribute EnjoySport (i.e., c(x) = 1 if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No). - 0 Given: 0 Instances X: Possible days, each described by the attributes 0 Sky (with possible values Sunny, Cloudy, and Rainy), 0 AirTemp (with values Warm and Cold), 0 Humidity (with values Normal and High), 0 Wind (with values Strong and Weak), 0 Water (with values Warm and Cool), and 0 Forecast (with values Same and Change). 0 Hypotheses H: Each hypothesis is described by a conjunction of constraints on the attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast. The constraints may be "?" (any value is acceptable), " 0 (no value is acceptable), or a specific value. 0 Target concept c: EnjoySport : X + ( 0 , l ) 0 Training examples D: Positive and negative examples of the target function (see Table 2.1). 0 Determine: 0 A hypothesis h in H such that h ( x ) = c(x) for all x in X. TABLE 2.2 The EnjoySport concept learning task. When learning the target concept, the learner is presented a set of training examples, each consisting of an instance x from X, along with its target concept value c ( x ) (e.g., the training examples in Table 2.1). Instances for which c ( x ) = 1 are calledpositive examples, or members of the target concept. Instances for which C ( X )= 0 are called negative examples, or nonmembers of the target concept. We will often write the ordered pair ( x ,c ( x ) ) to describe the training example consisting of the instance x and its target concept value c ( x ) . We use the symbol D to denote the set of available training examples. Given a set of training examples of the target concept c , the problem faced by the learner is to hypothesize, or estimate, c . We use the symbol H to denote the set of all possible hypotheses that the learner may consider regarding the identity of the target concept. Usually H is determined by the human designer's choice of hypothesis representation. In general, each hypothesis h in H represents a boolean-valued function defined over X; that is, h : X --+ {O, 1). The goal of the learner is to find a hypothesis h such that h ( x ) = c ( x ) for a" x in X. 2.2.2 The Inductive Learning Hypothesis Notice that although the learning task is to determine a hypothesis h identical to the target concept c over the entire set of instances X, the only information available about c is its value over the training examples. Therefore, inductive learning algorithms can at best guarantee that the output hypothesis fits the target concept over the training data. Lacking any further information, our assumption is that the best hypothesis regarding unseen instances is the hypothesis that best fits the observed training data. This is the fundamental assumption of inductive learning, and we will have much more to say about it throughout this book. We state it here informally and will revisit and analyze this assumption more formally and more quantitatively in Chapters 5, 6, and 7. The inductive learning hypothesis. Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples. 2.3 CONCEPT LEARNING AS SEARCH Concept learning can be viewed as the task of searching through a large space of hypotheses implicitly defined by the hypothesis representation. The goal of this search is to find the hypothesis that best fits the training examples. It is important to note that by selecting a hypothesis representation, the designer of the learning algorithm implicitly defines the space of all hypotheses that the program can ever represent and therefore can ever learn. Consider, for example, the instances X and hypotheses H in the EnjoySport learning task. Given that the attribute Sky has three possible values, and that AirTemp, Humidity, Wind, Water, and Forecast each have two possible values, the instance space X contains exactly 3 . 2 2 . 2 2 . 2 = 96 distinct instances. A similar calculation shows that there are 5 . 4 - 4- 4- 4 . 4= 5120 syntacticallydistinct hypotheses within H. Notice, however, that every hypothesis containing one or more "IZI" symbols represents the empty set of instances; that is, it classifies every instance as negative. Therefore, the + number of semantically distinct hypotheses is only 1 ( 4 . 3 . 3 . 3 . 3 . 3 )= 973. Our EnjoySport example is a very simple learning task, with a relatively small, finite hypothesis space. Most practical learning tasks involve much larger, sometimes infinite, hypothesis spaces. If we view learning as a search problem, then it is natural that our study of learning algorithms will e x a ~ t h edifferent strategies for searching the hypothesis space. We will be particula ly interested in algorithms capable of efficiently searching very large or infinite hypothesis spaces, to find the hypotheses that best fit the training data. 2.3.1 General-to-Specific Ordering of Hypotheses Many algorithms for concept learning organize the search through the hypothesis space by relying on a very useful structure that exists for any concept learning problem: a general-to-specific ordering of hypotheses. By taking advantage of this naturally occurring structure over the hypothesis space, we can design learning algorithms that exhaustively search even infinite hypothesis spaces without explicitly enumerating every hypothesis. To illustrate the general-to-specific ordering, consider the two hypotheses hi = (Sunny, ?, ?, Strong, ?, ?) h2 = (Sunny, ?, ?, ?, ?, ?) Now consider the sets of instances that are classified positive by hl and by h2. Because h2 imposes fewer constraints on the instance, it classifies more instances as positive. In fact, any instance classified positive by hl will also be classified positive by h2. Therefore, we say that h2 is more general than hl. This intuitive "more general than" relationship between hypotheses can be defined more precisely as follows. First, for any instance x in X and hypothesis h in H, we say that x satisjies h if and only if h(x) = 1. We now define the more-general~han_or.-equalr~eolation in terms of the sets of instances that satisfy the two hypotheses: Given hypotheses hj and hk,hj is more-general-thanm-- equaldo hk if and only if any instance that satisfies hk also satisfies hi. Definition: Let hj and hk be boolean-valued functions defined over X. Then hj is moregeneral-than-or-equal-tohk (written hj 2, h k )if and only if We will also find it useful to consider cases where one hypothesis is strictly more general than the other. Therefore, we will say that hj is (strictly)more-generaldhan 25 CHAPTER 2 CONCEPT LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING Imtances X I Hypotheses H I A Specific XI= x = 2 General t i h l = h = 2 h3= FIGURE 2.1 Instances, hypotheses, and the m o r e - g e n e r a l - t h a n relation. The box on the left represents the set X of all instances, the box on the right the set H of all hypotheses. Each hypothesis corresponds to some subset of X-the subset of instances that it classifies positive. The arrows connecting hypotheses represent the m o r e - g e n e r a l - t h a n relation, with the arrow pointing toward the less general hypothesis. Note the subset of instances characterized by h2 subsumes the subset characterized by h l , hence h2 is m o r e - g e n e r a l - t h a n h l . hk (written hj >, h k ) if and only if (hj p, hk) A (hk 2 , hi). Finally, we will sometimes find the inverse useful and will say that hj is morespecijkthan hk when hk is more_general-than h j . To illustrate these definitions, consider the three hypotheses h l , h2, and h3 from our Enjoysport example, shown in Figure 2.1. How are these three hypotheses related by the p , relation? As noted earlier, hypothesis h2 is more general than hl because every instance that satisfies hl also satisfies h2. Similarly, h2 is more general than h3. Note that neither hl nor h3 is more general than the other; although the instances satisfied by these two hypotheses intersect, neither set subsumes the other. Notice also that the p, and >, relations are defined independent of the target concept. They depend only on which instances satisfy the two hypotheses and not on the classification of those instances according to the target concept. Formally, the p , relation defines a partial order over the hypothesis space H (the relation is reflexive, antisymmetric, and transitive). Informally, when we say the structure is a partial (as opposed to total) order, we mean there may be pairs of hypotheses such as hl and h3, such that hl 2 , h3 and h3 2 , h l . The p g relation is important because it provides a useful structure over the hypothesis space H for any concept learning problem. The following sections present concept learning algorithms that take advantage of this partial order to efficiently organize the search for hypotheses that fit the training data. 1. Initialize h to the most specific hypothesis in H 2. For each positive training instance x 0 For each attribute constraint a, in h If the constraint a, is satisfied by x Then do nothing Else replace a, in h by the next more general constraint that is satisfied by x 3. Output hypothesis h TABLE 2.3 FIND-SAlgorithm. 2.4 FIND-S:FINDING A MAXIMALLY SPECIFIC HYPOTHESIS How can we use the more-general-than partial ordering to organize the search for a hypothesis consistent with the observed training examples? One way is to begin with the most specific possible hypothesis in H, then generalize this hypothesis each time it fails to cover an observed positive training example. (We say that a hypothesis "covers" a positive example if it correctly classifies the example as positive.) To be more precise about how the partial ordering is used, consider the FIND-Salgorithm defined in Table 2.3. To illustrate this algorithm, assume the learner is given the sequence of training examples from Table 2.1 for the EnjoySport task. The first step of FINDS is to initialize h to the most specific hypothesis in H Upon observing the first training example from Table 2.1, which happens to be a positive example, it becomes clear that our hypothesis is too specific. In particular, none of the "0" constraints in h are satisfied by this example, so each is replaced by the next more general constraint {hat fits the example; namely, the attribute values for this training example. h -+(Sunny,Warm,Normal, Strong, Warm,Same) This h is still very specific; it asserts that all instances are negative except for the single positive training example we have observed. Next, the second training example (also positive in this case) forces the algorithm to further generalize h, this time substituting a "?' in place of any attribute value in h that is not satisfied by the new example. The refined hypothesis in this case is h -+(Sunny,Warm,?, Strong, Warm, Same) Upon encountering the third training example-in this case a negative example-the algorithm makes no change to h. In fact, the FIND-Salgorithm simply ignores every negative example! While this may at first seem strange, notice that in the current case our hypothesis h is already consistent with the new negative example (i-e., h correctly classifies this example as negative), and hence no revision is needed. In the general case, as long as we assume that the hypothesis space H contains a hypothesis that describes the true target concept c and that the training data contains no errors, then the current hypothesis h can never require a revision in response to a negative example. To see why, recall that the current hypothesis h is the most specific hypothesis in H consistent with the observed positive examples. Because the target concept c is also assumed to be in H and to be consistent with the positive training examples, c must be more.general_than-or-equaldoh. But the target concept c will never cover a negative example, thus neither will h (by the definition of more-general~han).Therefore, no revision to h will be required in response to any negative example. To complete our trace of FIND-S,the fourth (positive) example leads to a further generalization of h h t (Sunny, Warm,?, Strong, ?, ?) The FIND-Salgorithm illustrates one way in which the more-generaldhan partial ordering can be used to organize the search for an acceptable hypothesis. The search moves from hypothesis to hypothesis, searching from the most specific to progressively more general hypotheses along one chain of the partial ordering. Figure 2.2 illustrates this search in terms of the instance and hypothesis spaces. At each step, the hypothesis is generalized only as far as necessary to cover the new positive example. Therefore, at each stage the hypothesis is the most specific hypothesis consistent with the training examples observed up to this point (hence the name FIND-S).The literature on concept learning is Instances X Hypotheses H specific General * 1 = , + x2 = , + X3 = , x4-- , + h , = h2 = h3 = h4 -- FIGURE 2.2 'The hypothesis space search performed by FINDS. The search begins (ho) with the most specific hypothesis in H, then considers increasingly general hypotheses (hl through h4) as mandated by the "+," training examples. In the instance space diagram, positive training examples are denoted by negative by "-," and instances that have not been presented as training examples are denoted by a solid circle. populated by many different algorithms that utilize this same more-general-than partial ordering to organize the search in one fashion or another. A number of such algorithms are discussed in this chapter, and several others are presented in Chapter 10. The key property of the FIND-Salgorithm is that for hypothesis spaces described by conjunctions of attribute constraints (such as H for the EnjoySport task), FIND-S is guaranteed to output the most specific hypothesis within H that is consistent with the positive training examples. Its final hypothesis will also be consistent with the negative examples provided the correct target concept is contained in H, and provided the training examples are correct. However, there are several questions still left unanswered by this learning algorithm, such as: Has the learner converged to the correct target concept? Although FIND-S will find a hypothesis consistent with the training data, it has no way to determine whether it has found the only hypothesis in H consistent with the data (i.e., the correct target concept), or whether there are many other consistent hypotheses as well. We would prefer a learning algorithm that could determine whether it had converged and, if not, at least characterize its uncertainty regarding the true identity of the target concept. 0 Why prefer the most specific hypothesis? In case there are multiple hypotheses consistent with the training examples, FIND-Swill find the most specific. It is unclear whether we should prefer this hypothesis over, say, the most general, or some other hypothesis of intermediate generality. 0 Are the training examples consistent? In most practical learning problems there is some chance that the training examples will contain at least some errors or noise. Such inconsistent sets of training examples can severely mislead FIND-S,given the fact that it ignores negative examples. We would prefer an algorithm that could at least detect when the training data is inconsistent and, preferably, accommodate such errors. 0 What if there are several maximally specific consistent hypotheses? In the hypothesis language H for the EnjoySport task, there is always a unique, most specific hypothesis consistent with any set of positive examples. However, for other hypothesis spaces (discussed later) there can be several maximally specific hypotheses consistent with the data. In this case, FIND-Smust be extended to allow it to backtrack on its choices of how to generalize the hypothesis, to accommodate the possibility that the target concept lies along a different branch of the partial ordering than the branch it has selected. Furthermore, we can define hypothesis spaces for which there is no maximally specific consistent hypothesis, although this is more of a theoretical issue than a practical one (see Exercise 2.7). 2.5 VERSION SPACES AND THE CANDIDATE-ELIMINATION ALGORITHM This section describes a second approach to concept learning, the CANDIDATEELIMINATIOaNlgorithm, that addresses several of the limitations of FIND-S.Notice that although FIND-Soutputs a hypothesis from H,that is consistent with the training examples, this is just one of many hypotheses from H that might fit the training data equally well. The key idea in the CANDIDATE-ELIMINAaTlgIOorNithm is to output a description of the set of all hypotheses consistent with the training examples. Surprisingly, the CANDIDATE-ELIMINATalIgOoNrithm computes the description of this set without explicitly enumerating all of its members. This is accomplished by again using the more-general-than partial ordering, this time to maintain a compact representation of the set of consistent hypotheses and to incrementally refine this representation as each new training example is encountered. The CANDIDATE-ELIMINAaTlgIOorNithm has been applied to problems such as learning regularities in chemical mass spectroscopy (Mitchell 1979) and learning control rules for heuristic search (Mitchell et al. 1983). Nevertheless, practical applications of the CANDIDATE-ELIMINAaTndIOFNIND-Salgorithms are limited by the fact that they both perform poorly when given noisy training data. More importantly for our purposes here, the CANDIDATE-ELIMINAaTlIgOorNithm provides a useful conceptual framework for introducing several fundamental issues in machine learning. In the remainder of this chapter we present the algorithm and discuss these issues. Beginning with the next chapter, we will examine learning algorithms that are used more frequently with noisy training data. 2.5.1 Representation The CANDIDATE-ELIMINATalIgOoNrithm finds all describable hypotheses that are consistent with the observed training examples. In order to define this algorithm precisely, we begin with a few basic definitions. First, let us say that a hypothesis is consistent with the training examples if it correctly classifies these examples. Definition:A hypothesis h is consistent with a set of training examples D if and only if h(x) = c(x) for each example (x,c(x))in D. Notice the key difference between this definition of consistent and our earlier definition of satisfies. An example x is said to satisfy hypothesis h when h(x) = 1, regardless of whether x is a positive or negative example of the target concept. However, whether such an example is consistent with h depends on the target concept, and in particular, whether h ( x ) = c ( x ) . The CANDIDATE-ELIMINAaTlgIoOriNthm represents the set of all hypotheses consistent with the observed training examples. This subset of all hypotheses is called the version space with respect to the hypothesis space H and the training examples D, because it contains all plausible versions of the target concept. Dejnition: The version space, denoted V S H V Dw,ith respect to hypothesis space H and training examples D, is the subset of hypotheses from H consistent with the training examples in D. V S H ,=~ {h E HIConsistent(h, D ) ] 2.5.2 The LIST-THEN-ELIMINATlgEorithm One obvious way to represent the version space is simply to list all of its members. This leads to a simple learning algorithm, which we might call the LIST-THENELIMINATaElgorithm, defined in Table 2.4. The LIST-THEN-ELIMINAalTgoErithm first initializes the version space to contain all hypotheses in H, then eliminates any hypothesis found inconsistent with any training example. The version space of candidate hypotheses thus shrinks as more examples are observed, until ideally just one hypothesis remains that is consistent with all the observed examples. This, presumably, is the desired target concept. If insufficient data is available to narrow the version space to a single hypothesis, then the algorithm can output the entire set of hypotheses consistent with the observed data. In principle, the LIST-THEN-ELIMINAalTgEorithm can be applied whenever the hypothesis space H is finite. It has many advantages, including the fact that it is guaranteed to output all hypotheses consistent with the training data. Unfortunately, it requires exhaustively enumerating all hypotheses in H-an unrealistic requirement for all but the most trivial hypothesis spaces. 2.5.3 A More Compact Representation for Version Spaces The CANDIDATE-ELIMINAaTlgIOorNithm works on the same principle as the above LIST-THEN-ELIMINAalTgoErithm. However, it employs a much more compact representation of the version space. In particular, the version space is represented by its most general and least general members. These members form general and specific boundary sets that delimit the version space within the partially ordered hypothesis space. The LIST-THEN-ELIMAINlgAorTithEm 1. VersionSpace c a list containing every hypothesis in H 2. For each training example, ( x ,c ( x ) ) remove from VersionSpace any hypothesis h for which h(x) # c ( x ) 3. Output the list of hypotheses in VersionSpace TABLE 2.4 The LIST-THEN-ELIMINaAlgTorEithm. {1 FIGURE 2.3 A version space with its general and specific boundary sets. The version space includes all six hypotheses shown here, but can be represented more simply by S and G . Arrows indicate instances of the more-general-than relation. This is the version space for the Enjoysport concept learning problem and training examples described in Table 2.1. To illustrate this representation for version spaces, consider again the Enjoysport concept learning problem described in Table 2.2. Recall that given the four training examples from Table 2.1, FIND-Soutputs the hypothesis h = (Sunny,Warm,?, Strong, ?, ?) In fact, this is just one of six different hypotheses from H that are consistent with these training examples. All six hypotheses are shown in Figure 2.3. They constitute the version space relative to this set of data and this hypothesis representation. The arrows among these six hypotheses in Figure 2.3 indicate instances of the more-general~hanrelation. The CANDIDATE-ELIMINAaTlgIOorNithm represents the version space by storing only its most general members (labeled G in Figure 2.3) and its most specific (labeled S in the figure). Given only these two sets S and G, it is possible to enumerate all members of the version space as needed by generating the hypotheses that lie between these two sets in the general-to-specific partial ordering over hypotheses. It is intuitively plausible that we can represent the version space in terms of its most specific and most general members. Below we define the boundary sets G and S precisely and prove that these sets do in fact represent the version space. Definition:The general boundary G, with respect to hypothesis space H and training data D, is the set of maximally general members of H consistent with D. G = {g E HIConsistent(g, D) A (-3gf E H ) [ ( g f>, g) A Consistent(gt,D ) ] ] Definition:The specific boundary S, with respect to hypothesis space H and training data D, is the set of minimally general (i.e., maximally specific) members of H consistent with D. S rn {s E H(Consistent(s,D) A (-3s' E H)[(s >, s f )A Consistent(st,D ) ] ) As long as the sets G and S are well defined (see Exercise 2.7), they completely specify the version space. In particular, we can show that the version space is precisely the set of hypotheses contained in G , plus those contained in S, plus those that lie between G and S in the partially ordered hypothesis space. This is stated precisely in Theorem 2.1. Theorem 2.1. Version space representation theorem. Let X be an arbitrary set of instances and let H be a set of boolean-valued hypotheses defined over X. Let c : X + {O,1) be an arbitrary target concept defined over X, and let D be an arbitrary set of training examples {(x,c(x))). For all X, H, c, and D such that S and G are well defined, Proof. To prove the theorem it suffices to show that (1) every h satisfying the right- hand side of the above expression is in V S H ,a~nd (2) every member of V S H , ~ satisfies the right-hand side of the expression. To show (1) let g be an arbitrary member of G , s be an arbitrary member of S, and h be an arbitrary member of H, such that g 2, h 2, s. Then by the definition of S, s must be satisfied by all positive examples in D. Because h 2, s, h must also be satisfied by all positive examples in D. Similarly, by the definition of G , g cannot be satisfied by any negative example in D, and because g 2, h, h cannot be satisfied by any negative example in D. Because h is satisfied by all positive examples in D and by no negative examples in D, h is consistent with D, and therefore h is a member of V S H , ~T.his proves step (1). The argument for (2) is a bit more complex. It can be proven by assuming some h in V S H ,th~at does not satisfy the right-hand side of the expression, then showing that this leads to an inconsistency. (See Exercise 2.6.) 0 2.5.4 CANDIDATE-ELIMINALeTaIrnOiNng Algorithm The CANDIDATE-ELIMINAaTlIgOoNrithm computes the version space containing all hypotheses from H that are consistent with an observed sequence of training examples. It begins by initializing the version space to the set of all hypotheses in H; that is, by initializing the G boundary set to contain the most general hypothesis in H Go + {(?, ?, ?, ?, ?, ?)} and initializing the S boundary set to contain the most specific (least general) hypothesis so c- ((@,PI@, ,PI, 0,0)1 These two boundary sets delimit the entire hypothesis space, because every other hypothesis in H is both more general than So and more specific than G o . As each training example is considered, the S and G boundary sets are generalized and specialized, respectively, to eliminate from the version space any hypotheses found inconsistent with the new training example. After all examples have been processed, the computed version space contains all the hypotheses consistent with these examples and only these hypotheses. This algorithm is summarized in Table 2.5. 33 CHAPTER 2 CONCEET LEARNJNG AND THE GENERAL-TO-SPECIFIC ORDERING Initialize G to the set of maximally general hypotheses in H Initialize S to the set of maximally specific hypotheses in H For each training example d, do 0 If d is a positive example Remove from G any hypothesis inconsistent with d , 0 For each hypothesis s in S that is not consistent with d ,- 0 Remove s from S 0 Add to S all minimal generalizations h of s such that 0 h is consistent with d, and some member of G is more general than h 0 Remove from S any hypothesis that is more general than another hypothesis in S 0 If d is a negative example 0 Remove from S any hypothesis inconsistent with d For each hypothesis g in G that is not consistent with d Remove g from G 0 Add to G all minimal specializations h of g such that 0 h is consistent with d, and some member of S is more specific than h 0 Remove from G any hypothesis that is less general than another hypothesis in G TABLE 2.5 CANDIDATE-ELIMINATaIlgOoNrithm using version spaces. Notice the duality in how positive and negative examples influence S and G. Notice that the algorithm is specified in terms of operations such as computing minimal generalizations and specializations of given hypotheses, and identifying nonrninimal and nonmaximal hypotheses. The detailed implementationof these operations will depend, of course, on the specific representations for instances and hypotheses. However, the algorithm itself can be applied to any concept learning task and hypothesis space for which these operations are well-defined. In the following example trace of this algorithm, we see how such operations can be implemented for the representations used in the EnjoySport example problem. 2.5.5 An Illustrative Example Figure 2.4 traces the CANDIDATE-ELIMINAaTlgIOorNithm applied to the first two training examples from Table 2.1. As described above, the boundary sets are first initialized to Go and So, the most general and most specific hypotheses in H, respectively. When the first training example is presented (a positive example in this case), the CANDIDATE-ELIMINAaTlIgOoNrithm checks the S boundary and finds that it is overly specific-it fails to cover the positive example. The boundary is therefore revised by moving it to the least more general hypothesis that covers this new example. This revised boundary is shown as S1 in Figure 2.4. No update of the G boundary is needed in response to this training example because Go correctly covers this example. When the second training example (also positive) is observed, it has a similar effect of generalizing S further to S2, leaving G again unchanged (i.e., G2 = G I = GO).Notice the processing of these first 34 MACHINE LEARNING 1 S 1 : { } t S2 : {} Training examples: 1 . , Enjoy Sport = Yes 2 . , Enjoy Sport = Yes FIGURE 2.4 CANDIDATE-ELIMINATTraIcOeN1. So and Go are the initial boundary sets corresponding to the most specific and most general hypotheses. Training examples 1 and 2 force the S boundary to become more general, as in the FIND-Salgorithm. They have no effect on the G boundary. two positive examples is very similar to the processing performed by the FIND-S algorithm. As illustrated by these first two steps, positive training examples may force the S boundary of the version space to become increasingly general. Negative training examples play the complimentary role of forcing the G boundary to become increasingly specific. Consider the third training example, shown in Figure 2.5. This negative example reveals that the G boundary of the version space is overly general; that is, the hypothesis in G incorrectly predicts that this new example is a positive example. The hypothesis in the G boundary must therefore be specialized until it correctly classifies this new negative example. As shown in - Figure 2.5, there are several alternative minimally more specific hypotheses. All of these become members of the new G3 boundary set. Given that there are six attributes that could be specified to specialize G2, why are there only three new hypotheses in G3? For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal specialization of G2 that correctly labels the new example as a negative example, but it is not included in Gg. The reason this hypothesis is excluded is that it is inconsistent with the previously encountered positive examples. The algorithm determines this simply by noting that h is not more general than the current specific boundary, Sz. In fact, the S boundary of the version space forms a summary of the previously encountered positive examples that can be used to determine whether any given hypothesis C H m R 2 CONCEPT LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING 35 s29 s 3 : ( )] G 3: ( } A '32: I I Training Example: 3. , EnjoySporkNo FIGURE 2.5 CANDIDATE-ELMNATITOrNace 2. Training example 3 is a negative example that forces the G2 boundary to be specialized to G3.Note several alternativemaximally general hypotheses are included in Gj. is consistent with these examples. Any hypothesis more general than S will, by definition, cover any example that S covers and thus will cover any past positive example. In a dual fashion, the G boundary summarizes the information from previously encountered negative examples. Any hypothesis more specific than G is assured to be consistent with past negative examples. This is true because any such hypothesis, by definition, cannot cover examples that G does not cover. The fourth training example, as shown in Figure 2.6, further generalizes the S boundary of the version space. It also results in removing one member of the G boundary, because this member fails to cover the new positive example. This last action results from the first step under the condition "If d is a positive example" in the algorithm shown in Table 2.5. To understand the rationale for this step, it is useful to consider why the offending hypothesis must be removed from G. Notice it cannot be specialized, because specializing it would not make it cover the new example. It also cannot be generalized, because by the definition of G, any more general hypothesis will cover at least one negative training example. Therefore, the hypothesis must be dropped from the G boundary, thereby removing an entire branch of the partial ordering from the version space of hypotheses remaining under consideration. After processing these four examples, the boundary sets S4 and G4 delimit the version space of all hypotheses consistent with the set of incrementally observed training examples. The entire version space, including those hypotheses S 3: {) I I I S 4: ( ) Training Example: 4., EnjoySport = Yes FIGURE 2.6 CANDIDATE-ELIMINATTraIcOeN3.The positive training example generalizes the S boundary, from S3 to S4. One member of Gg must also be deleted, because it is no longer more general than the S4 boundary. bounded by S4 and G4, is shown in Figure 2.7. This learned version space is independent of the sequence in which the training examples are presented (because in the end it contains all hypotheses consistent with the set of examples). As further training data is encountered, the S and G boundaries will move monotonically closer to each other, delimiting a smaller and smaller version space of candidate hypotheses. s 4 :{) {, ) FIGURE 2.7 The final version space for the EnjoySport concept learning problem and training examples described earlier. CH.4PTF.R2 CONCEFT LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING 37 2.6 REMARKS ON VERSION SPACES AND CANDIDATE-ELIMINATI 2.6.1 Will the CANDIDATE-ELIMINAlTgoIrOitNhm Converge to the Correct Hypothesis? The version space learned by the CANDIDATE-ELIMINAaTlgIoOriNthm will con- verge toward the hypothesis that correctly describes the target concept, provided (1) there are no errors in the training examples, and (2) there is some hypothesis in H that correctly describes the target concept. In fact, as new training examples are observed, the version space can be monitored to determine the remaining ambiguity regarding the true target concept and to determine when sufficient training examples have been observed to unambiguously identify the target concept. The target concept is exactly learned when the S and G boundary sets converge to a single, identical, hypothesis. What will happen if the training data contains errors? Suppose, for example, that the second training example above is incorrectly presented as a negative example instead of a positive example. Unfortunately, in this case the algorithm is certain to remove the correct target concept from the version space! Because, it will remove every hypothesis that is inconsistent with each training example, it will eliminate the true target concept from the version space as soon as this false negative example is encountered. Of course, given sufficient additional training data the learner will eventually detect an inconsistency by noticing that the S and G boundary sets eventually converge to an empty version space. Such an empty version space indicates that there is no hypothesis in H consistent with all observed training examples. A similar symptom will appear when the training examples are correct, but the target concept cannot be described in the hypothesis representation (e.g., if the target concept is a disjunction of feature attributes and the hypothesis space supports only conjunctive descriptions). We will consider such eventualities in greater detail later. For now, we consider only the case in which the training examples are correct and the true target concept is present in the hypothesis space. 2.6.2 What Training Example Should the Learner Request Next? Up to this point we have assumed that training examples are provided to the learner by some external teacher. Suppose instead that the learner is allowed to conduct experiments in which it chooses the next instance, then obtains the correct classification for this instance from an external oracle (e.g., nature or a teacher). This scenario covers situations in which the learner may conduct experiments in nature (e.g., build new bridges and allow nature to classify them as stable or unstable), or in which a teacher is available to provide the correct classification (e.g., propose a new bridge and allow the teacher to suggest whether or not it will be stable). We use the term query to refer to such instances constructed by the learner, which are then classified by an external oracle. Consider again the version space learned from the four training examples of the Enjoysport concept and illustrated in Figure 2.3. What would be a good query for the learner to pose at this point? What is a good query strategy in general? Clearly, the learner should attempt to discriminate among the alternative competing hypotheses in its current version space. Therefore, it should choose an instance that would be classified positive by some of these hypotheses, but negative by others. One such instance is (Sunny,Warm,Normal, Light, Warm,Same) Note that this instance satisfies three of the six hypotheses in the current version space (Figure 2.3). If the trainer classifies this instance as a positive example, the S boundary of the version space can then be generalized. Alternatively, if the trainer indicates that this is a negative example, the G boundary can then be specialized. Either way, the learner will succeed in learning more about the true identity of the target concept, shrinking the version space from six hypotheses to half this number. In general, the optimal query strategy for a concept learner is to generate instances that satisfy exactly half the hypotheses in the current version space. When this is possible, the size of the version space is reduced by half with each new example, and the correct target concept can therefore be found with only rlog2JVS11experiments. The situation is analogous to playing the game twenty questions, in which the goal is to ask yes-no questions to determine the correct hypothesis. The optimal strategy for playing twenty questions is to ask questions that evenly split the candidate hypotheses into sets that predict yes and no. While we have seen that it is possible to generate an instance that satisfies precisely half the hypotheses in the version space of Figure 2.3, in general it may not be possible to construct an instance that matches precisely half the hypotheses. In such cases, a larger number of queries may be required than rlog21VS(1. 2.6.3 How Can Partially Learned Concepts Be Used? Suppose that no additional training examples are available beyond the four in our example above, but that the learner is now required to classify new instances that it has not yet observed. Even though the version space of Figure 2.3 still contains multiple hypotheses, indicating that the target concept has not yet been fully learned, it is possible to classify certain examples with the same degree of confidence as if the target concept had been uniquely identified. To illustrate, suppose the learner is asked to classify the four new instances shown in Ta- ble 2.6. 9 Note that although instance A was not among the training examples, it is classified as a positive instance by every hypothesis in the current version space (shown in Figure 2.3). Because the hypotheses in the version space unanimously agree that this is a positive instance, the learner can classify instance A as positive with the same confidence it would have if it had already converged to the single, correct target concept. Regardless of which hypothesis in the version space is eventually found to be the correct target concept, it is already clear that it will classify instance A as a positive example. Notice furthermore that we need not enumerate every hypothesis in the version space in order to test whether each 39 CHAPTER 2 CONCEPT LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING Instance - A B C D Sky Sunny Rainy Sunny Sunny AirTemp Warm Cold Warm Cold Humidity Normal Normal Normal Normal Wind Strong Light Light Strong Water Cool Warm Warm Warm Forecast Change Same Same Same EnjoySport ? ? ? ? TABLE 2.6 New instances to be classified. classifies the instance as positive. This condition will be met if and only if the instance satisfies every member of S (why?). The reason is that every other hypothesis in the version space is at least as general as some member of S. By our definition of more-general~hani,f the new instance satisfies all members of S it must also satisfy each of these more general hypotheses. Similarly, instance B is classified as a negative instance by every hypothesis in the version space. This instance can therefore be safely classified as negative, given the partially learned concept. An efficient test for this condition is that the instance satisfies none of the members of G (why?). Instance C presents a different situation. Half of the version space hypotheses classify it as positive and half classify it as negative. Thus, the learner cannot classify this example with confidence until further training examples are available. Notice that instance C is the same instance presented in the previous section as an optimal experimental query for the learner. This is to be expected, because those instances whose classification is most ambiguous are precisely the instances whose true classification would provide the most new information for refining the version space. Finally, instance D is classified as positive by two of the version space hypotheses and negative by the other four hypotheses. In this case we have less confidence in the classification than in the unambiguous cases of instances A and B. Still, the vote is in favor of a negative classification, and one approach we could take would be to output the majority vote, perhaps with a confidence rating indicating how close the vote was. As we will discuss in Chapter 6, if we assume that all hypotheses in H are equally probable a priori, then such a vote provides the most probable classification of this new instance. Furthermore, the proportion of hypotheses voting positive can be interpreted as the probability that this instance is positive given the training data. 2.7 INDUCTIVE BIAS As discussed above, the CANDIDATE-ELIMINATaIlgOoNrithm will converge toward the true target concept provided it is given accurate training examples and provided its initial hypothesis space contains the target concept. What if the target concept is not contained in the hypothesis space? Can we avoid this difficulty by using a hypothesis space that includes every possible hypothesis? How does the size of this hypothesis space influence the ability of the algorithm to generalize to unobserved instances? How does the size of the hypothesis space influence the number of training examples that must be observed? These are fundamental questions for inductive inference in general. Here we examine them in the context of the CANDIDATE-ELIMINAaTlgIOorNithm. As we shall see, though, the conclusions we draw from this analysis will apply to any concept learning system that outputs any hypothesis consistent with the training data. 2.7.1 A Biased Hypothesis Space Suppose we wish to assure that the hypothesis space contains the unknown target concept. The obvious solution is to enrich the hypothesis space to include every possible hypothesis. To illustrate, consider again the EnjoySpor t example in which we restricted the hypothesis space to include only conjunctions of attribute values. Because of this restriction, the hypothesis space is unable to represent even simple disjunctive target concepts such as "Sky = Sunny or Sky = Cloudy." In fact, given the following three training examples of this disjunctive hypothesis, our algorithm would find that there are zero hypotheses in the version space. Example Sky AirTemp Humidity Wind Water Forecast EnjoySport 1 Sunny Warm Normal Strong Cool Change Yes 2 Cloudy Warm Normal Strong Cool Change Yes 3 Rainy Warm Normal Strong Cool Change No To see why there are no hypotheses consistent with these three examples, note that the most specific hypothesis consistent with the first two examples and representable in the given hypothesis space H is S2 : (?, Warm,Normal, Strong, Cool, Change) This hypothesis, although it is the maximally specific hypothesis from H that is consistent with the first two examples, is already overly general: it incorrectly covers the third (negative) training example. The problem is that we have biased the learner to consider only conjunctive hypotheses. In this case we require a more expressive hypothesis space. 2.7.2 An Unbiased Learner The obvious solution to the problem of assuring that the target concept is in the hypothesis space H is to provide a hypothesis space capable of representing every teachable concept; that is, it is capable of representing every possible subset of the instances X. In general, the set of all subsets of a set X is called thepowerset of X. In the EnjoySport learning task, for example, the size of the instance space X of days described by the six available attributes is 96. How many possible concepts can be defined over this set of instances? In other words, how large is the power set of X? In general, the number of distinct subsets that can be defined over a set X containing 1x1elements (i.e., the size of the power set of X) is 211'. Thus, there are 296,or approximately distinct target concepts that could be defined over this instance space and that our learner might be called upon to learn. Recall from Section 2.3 that our conjunctivehypothesis space is able to represent only 973 of these-a very biased hypothesis space indeed! Let us reformulate the Enjoysport learning task in an unbiased way by defining a new hypothesis space H' that can represent every subset of instances; that is, let H' correspond to the power set of X. One way to define such an H' is to allow arbitrary disjunctions, conjunctions, and negations of our earlier hypotheses. For instance, the target concept "Sky = Sunny or Sky = Cloudy" could then be described as (Sunny,?, ?, ?, ?, ?) v (Cloudy, ?, ?, ?, ?, ?) Given this hypothesis space, we can safely use the CANDIDATE-ELIMINATION algorithm without worrying that the target concept might not be expressible. However, while this hypothesis space eliminates any problems of expressibility, it unfortunately raises a new, equally difficult problem: our concept learning algorithm is now completely unable to generalize beyond the observed examples! To see why, suppose we present three positive examples (xl,x2,x3) and two negative examples (x4,x5) to the learner. At this point, the S boundary of the version space will contain the hypothesis which is just the disjunction of the positive examples because this is the most specific possible hypothesis that covers these three examples. Similarly, the G boundary will consist of the hypothesis that rules out only the observed negative examples The problem here is that with this very expressive hypothesis representation, the S boundary will always be simply the disjunction of the observed positive examples, while the G boundary will always be the negated disjunction of the observed negative examples. Therefore, the only examples that will be unambiguously classified by S and G are the observed training examples themselves. In order to converge to a single, final target concept, we will have to present every single instance in X as a training example! It might at first seem that we could avoid this difficulty by simply using the partially learned version space and by taking a vote among the members of the version space as discussed in Section 2.6.3. Unfortunately, the only instances that will produce a unanimous vote are the previously observed training examples. For, all the other instances, taking a vote will be futile: each unobserved instance will be classified positive by precisely half the hypotheses in the version space and will be classified negative by the other half (why?). To see the reason, note that when H is the power set of X and x is some previously unobserved instance, then for any hypothesis h in the version space that covers x, there will be anoQer hypothesis h' in the power set that is identical to h except for its classification of x. And of course if h is in the version space, then h' will be as well, because it agrees with h on all the observed training examples. 2.7.3 The Futility of Bias-Free Learning The above discussion illustrates a fundamental property of inductive inference: a learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances. In fact, the only reason that the CANDIDATE-ELIMINAaTlgIOorNithm was able to generalize beyond the observed training examples in our original formulation of the EnjoySport task is that it was biased by the implicit assumption that the target concept could be represented by a conjunction of attribute values. In cases where this assumption is correct (and the training examples are error-free), its classification of new instances will also be correct. If this assumption is incorrect, however, it is certain that the CANDIDATE-ELIMINATalIgOoNrithm will rnisclassify at least some instances from X. Because inductive learning requires some form of prior assumptions, or inductive bias, we will find it useful to characterize different learning approaches by the inductive biast they employ. Let us define this notion of inductive bias more precisely. The key idea we wish to capture here is the policy by which the learner generalizes beyond the observed training data, to infer the classification of new instances. Therefore, consider the general setting in which an arbitrary learning algorithm L is provided an arbitrary set of training data D, = {(x,c(x))} of some arbitrary target concept c. After training, L is asked to classify a new instance xi. Let L(xi, D,) denote the classification (e.g., positive or negative) that L assigns to xi after learning from the training data D,. We can describe this inductive inference step performed by L as follows where the notation y + z indicates that z is inductively inferred from y. For example, if we take L to be the CANDIDATE-ELIMINAaTlIgOoNrithm, D, to be the training data from Table 2.1, and xi to be the fist instance from Table 2.6, then the inductive inference performed in this case concludes that L(xi, D,) = (EnjoySport = yes). Because L is an inductive learning algorithm, the result L(xi, D,) that it infers will not in general be provably correct; that is, the classification L(xi, D,) need not follow deductively from the training data D, and the description of the new instance xi. However, it is interesting to ask what additional assumptions could be added to D, r\xi so that L(xi, D,) would follow deductively. We define the inductive bias of L as this set of additional assumptions. More precisely, we define the t ~ h term inductive bias here is not to be confused with the term estimation bias commonly used in statistics. Estimation bias will be discussed in Chapter 5. CHAFI%R 2 CONCEPT LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING 43 inductive bias of L to be the set of assumptions B such that for all new instances xi (B A D, A xi) F L(xi,D,) where the notation y t z indicates that z follows deductively from y (i.e., that z is provable from y). Thus, we define the inductive bias of a learner as the set of additional assumptions B sufficient to justify its inductive inferences as deductive inferences. To summarize, Definition: Consider a concept learning algorithm L for the set of instances X. Let c be an arbitrary concept defined over X, and let D, = ( ( x ,c ( x ) ) }be an arbitrary set of training examples of c. Let L(xi, D,) denote the classification assigned to the instance xi by L after training on the data D,. The inductive bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples Dc (Vxi E X ) [ ( BA Dc A xi) k L(xi, D,)] (2.1) What, then, is the inductive bias of the CANDIDATE-ELIMINAaTlgIOorNithm? To answer this, let us specify L(xi,D,) exactly for this algorithm: given a set of data D,, the CANDIDATE-ELIMINATalIgOoNrithm will first compute the version space VSH,D,t,hen classify the new instance xi by a vote among hypotheses in this version space. Here let us assume that it will output a classification for xi only if this vote among version space hypotheses is unanimously positive or negative and that it will not output a classification otherwise. Given this definition of L(xi,D,) for the CANDIDATE-ELIMINATalIgOoNrithm, what is its inductive bias? It is simply the assumption c E H. Given this assumption, each inductive inference performed by the CANDIDATE-ELIMINAaTlgIOorNithm can be justified deductively. To see why the classification L(xi, D,) follows deductively from B = {c E H), together with the data D , and description of the instance xi, consider the following argument. First, notice that if we assume c E H then it follows deductively that c E VSH,DcT.his follows from c E H, from the definition of the version space V S H , D ,as the set of all hypotheses in H that are consistent with D,, and from our definition of D, = {(x,c ( x ) ) } as training data consistent with the target concept c. Second, recall that we defined the classification L(xi, D,) to be the unanimous vote of all hypotheses in the version space. Thus, if L outputs the classification L ( x , , D,), it must be the case the every hypothesis in V S H , ~al,so produces this classification, including the hypothesis c E V S H Y DTch. erefore c ( x i ) = L(xi, D,). To summarize, the CANDIDATE-ELIMINATalIgOoNrithm defined in this fashion can be characterized by the following bias Inductive bias of CANDIDATE-ELIMINAalTgoIrOitNhm. The target concept c is contained in the given hypothesis space H. Figure 2.8 summarizes the situation schematically.The inductive CANDIDATEELIMINATIOalNgorithm at the top of the figure takes two inputs: the training examples and a new instance to be classified. At the bottom of the figure, a deductive 44 MACHINE LEARNING Training examples New instance Inductive system Candidate Elimination Using Hypothesis Space H Classification of new instance, or "don't know" Training examples Equivalent deductive system I I Classificationof I new instance, or "don't know" Theorem Prover Assertion " Hcontains the target concept" -D P Inductive bias made explicit FIGURE 2.8 Modeling inductive systems by equivalent deductive systems. The input-output behavior of the CANDIDATE-ELIMINATaIlOgoNrithm using a hypothesis space H is identical to that of a deductive theorem prover utilizing the assertion " H contains the target concept." This assertion is therefore called the inductive bias of the CANDIDATE-ELIMINATalIgOoNrithm. Characterizing inductive systems by their inductive bias allows modeling them by their equivalent deductive systems. This provides a way to compare inductive systems according to their policies for generalizing beyond the observed training data. theorem prover is given these same two inputs plus the assertion "H contains the target concept." These two systems will in principle produce identical outputs for every possible input set of training examples and every possible new instance in X. Of course the inductive bias that is explicitly input to the theorem prover is only implicit in the code of the CANDIDATE-ELIMINAaTlgIOorNithm. In a sense, it exists only in the eye of us beholders. Nevertheless, it is a perfectly well-defined set of assertions. One advantage of viewing inductive inference systems in terms of their inductive bias is that it provides a nonprocedural means of characterizing their policy for generalizing beyond the observed data. A second advantage is that it allows comparison of different learners according to the strength of the inductive bias they employ. Consider, for example, the following three learning algorithms, which are listed from weakest to strongest bias. 1. ROTE-LEARNERL:earning corresponds simply to storing each observed training example in memory. Subsequent instances are classified by looking them CHAPTER 2 CONCEPT. LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING 45 up in memory. If the instance is found in memory, the stored classification is returned. Otherwise, the system refuses to classify the new instance. 2. CANDIDATE-ELIMINAaTlgIOorNithm: New instances are classified only in the case where all members of the current version space agree on the classification. Otherwise, the system refuses to classify the new instance. 3. FIND-S:This algorithm, described earlier, finds the most specific hypothesis consistent with the training examples. It then uses this hypothesis to classify all subsequent instances. The ROTE-LEARNEhRas no inductive bias. The classifications it provides for new instances follow deductively from the observed training examples, with no additional assumptions required. The CANDIDATE-ELIMINATaIlOgoNrithm has a stronger inductive bias: that the target concept can be represented in its hypothesis space. Because it has a stronger bias, it will classify some instances that the ROTELEARNEwRill not. Of course the correctness of such classifications will depend completely on the correctness of this inductive bias. The FIND-Salgorithm has an even stronger inductive bias. In addition to the assumption that the target concept can be described in its hypothesis space, it has an additional inductive bias assumption: that all instances are negative instances unless the opposite is entailed by its other know1edge.t As we examine other inductive inference methods, it is useful to keep in mind this means of characterizing them and the strength of their inductive bias. More strongly biased methods make more inductive leaps, classifying a greater proportion of unseen instances. Some inductive biases correspond to categorical assumptions that completely rule out certain concepts, such as the bias "the hypothesis space H includes the target concept." Other inductive biases merely rank order the hypotheses by stating preferences such as "more specific hypotheses are preferred over more general hypotheses." Some biases are implicit in the learner and are unchangeable by the learner, such as the ones we have considered here. In Chapters 11 and 12 we will see other systems whose bias is made explicit as a set of assertions represented and manipulated by the learner. 2.8 SUMMARY AND FURTHER READING The main points of this chapter include: Concept learning can be cast as a problem of searching through a large predefined space of potential hypotheses. The general-to-specific partial ordering of hypotheses, which can be defined for any concept learning problem, provides a useful structure for organizing the search through the hypothesis space. +Noticethis last inductive bias assumption involves a kind of default, or nonmonotonic reasoning. The FINDSalgorithm utilizes this general-to-specific ordering, performing a specific-to-general search through the hypothesis space along one branch of the partial ordering, to find the most specific hypothesis consistent with the training examples. The CANDIDATE-ELIMINAaTlIgOorNithm utilizes this general-to-specific ordering to compute the version space (the set of all hypotheses consistent with the training data) by incrementally computing the sets of maximally specific (S) and maximally general (G) hypotheses. Because the S and G sets delimit the entire set of hypotheses consistent with the data, they provide the learner with a description of its uncertainty regarding the exact identity of the target concept. This version space of alternative hypotheses can be examined to determine whether the learner has converged to the target concept, to determine when the training data are inconsistent, to generate informative queries to further refine the version space, and to determine which unseen instances can be unambiguously classified based on the partially learned concept. Version spaces and the CANDIDATE-ELIMINAaTlgIOorNithm provide a useful conceptual framework for studying concept learning. However, this learning algorithm is not robust to noisy data or to situations in which the unknown target concept is not expressible in the provided hypothesis space. Chapter 10 describes several concept learning algorithms based on the generalto-specific ordering, which are robust to noisy data. 0 Inductive learning algorithms are able to classify unseen examples only because of their implicit inductive bias for selecting one consistent hypothesis over another. The bias associated with the CANDIDATE-ELIMINAaTlIgOo-N rithm is that the target concept can be found in the provided hypothesis space (c E H). The output hypotheses and classifications of subsequent instances follow deductively from this assumption together with the observed training data. If the hypothesis space is enriched to the point where there is a hypothesis corresponding to every possible subset of instances (the power set of the instances), this will remove any inductive bias from the CANDIDATEELIMINATIOaNlgorithm. Unfortunately, this also removes the ability to classify any instance beyond the observed training examples. An unbiased learner cannot make inductive leaps to classify unseen examples. The idea of concept learning and using the general-to-specific ordering have been studied for quite some time. Bruner et al. (1957) provided an early study of concept learning in humans, and Hunt and Hovland (1963) an early effort to automate it. Winston's (1970) widely known Ph.D. dissertation cast concept learning as a search involving generalization and specialization operators. Plotkin (1970, 1971) provided an early formalization of the more-general-than relation, as well as the related notion of 8-subsumption (discussed in Chapter 10). Simon and Lea (1973) give an early account of learning as search through a hypothesis CHAFTER 2 CONCEPT LEARNING AND THE GENERALTO-SPECIFIC ORDEIUNG 47 space. Other early concept learning systems include (Popplestone 1969; Michalski 1973; Buchanan 1974; Vere 1975; Hayes-Roth 1974). A very large number of algorithms have since been developed for concept learning based on symbolic representations. Chapter 10 describes several more recent algorithms for concept learning, including algorithms that learn concepts represented in first-order logic, algorithms that are robust to noisy training data, and algorithms whose performance degrades gracefully if the target concept is not representable in the hypothesis space considered by the learner. Version spaces and the CANDIDATE-ELIMINAaTlIgOorNithm were introduced by Mitchell (1977, 1982). The application of this algorithm to inferring rules of mass spectroscopy is described in (Mitchell 1979), and its application to learning search control rules is presented in (Mitchell et al. 1983). Haussler (1988) shows that the size of the general boundary can grow exponentially in the number of training examples, even when the hypothesis space consists of simple conjunctions of features. Smith and Rosenbloom (1990) show a simple change to the representation of the G set that can improve complexity in certain cases, and Hirsh (1992) shows that learning can be polynomial in the number of examples in some cases when the G set is not stored at all. Subramanian and Feigenbaum (1986) discuss a method that can generate efficient queries in certain cases by factoring the version space. One of the greatest practical limitations of the CANDIDATEELIMINATIOalNgorithm is that it requires noise-free training data. Mitchell (1979) describes an extension that can handle a bounded, predetermined number of misclassified examples, and Hirsh (1990, 1994) describes an elegant extension for handling bounded noise in real-valued attributes that describe the training examples. Hirsh (1990) describes an INCREMENTVAELRSION SPACEMERGINGalgorithm that generalizes the CANDIDATE-ELIMINAaTlgIOorNithm to handle situations in which training information can be different types of constraints represented using version spaces. The information from each constraint is represented by a version space and the constraints are then combined by intersecting the version spaces. Sebag (1994, 1996) presents what she calls a disjunctive version space approach to learning disjunctive concepts from noisy data. A separate version space is learned for each positive training example, then new instances are classified by combining the votes of these different version spaces. She reports experiments in several problem domains demonstrating that her approach is competitive with other widely used induction methods such as decision tree learning and k-NEAREST NEIGHBOR. EXERCISES 2.1. Explain why the size of the hypothesis space in the EnjoySport learning task is 973. How would the number of possible instances and possible hypotheses increase with the addition of the attribute Watercurrent, which can take on the values Light, Moderate, or Strong? More generally, how does the number of possible instances and hypotheses grow with the addition of a new attribute A that takes on k possible values? , I 2.2. Give the sequence of S and G boundary sets computed by the CANDIDATE-ELIMINATION algorithm if it is given the sequence of training examples from Table 2.1 in reverse order. Although the final version space will be the same regardless of the sequence of examples (why?), the sets S and G computed at intermediate stages will, of course, depend on this sequence. Can you come up with ideas for ordering the training examples to minimize the sum of the sizes of these intermediate S and G sets for the H used in the EnjoySport example? 2.3. Consider again the EnjoySport learning task and the hypothesis space H described in Section 2.2. Let us define a new hypothesis space H' that consists of all painvise disjunctions of the hypotheses in H . For example, a typical hypothesis in H' is (?, Cold, H i g h , ?, ?, ?) v (Sunny,?, H i g h , ?, ?, Same) Trace the CANDIDATE-ELIMINATaIlOgoNrithm for the hypothesis space H' given the sequence of training examples from Table 2.1 (i.e., show the sequence of S and G boundary sets.) 2.4. Consider the instance space consisting of integer points in the x , y plane and the set of hypotheses H consisting of rectangles. More precisely, hypotheses are of the form a 5 x 5 b, c 5 y 5 d , where a , b, c, and d can be any integers. (a) Consider the version space with respect to the set of positive (+) and negative (-) training examples shown below. What is the S boundary of the version space in this case? Write out the hypotheses and draw them in on the diagram. (b) What is the G boundary of this version space? Write out the hypotheses and draw them in. (c) Suppose the learner may now suggest a new x , y instance and ask the trainer for its classification. Suggest a query guaranteed to reduce the size of the version space, regardless of how the trainer classifies it. Suggest one that will not. ( d ) Now assume you are a teacher, attempting to teach a particular target concept (e.g., 3 5 x 5 5 , 2 ( y 5 9). What is the smallest number of training examples you can provide so that the CANDIDATE-ELIMINATaIlOgoNrithm will perfectly learn the target concept? 2.5. Consider the following sequence of positive and negative training examples describing the concept "pairs of people who live in the same house." Each training example describes an ordered pair of people, with each person described by their sex, hair 49 CHAPTER 2 CONCEPT LEARNING AND THE GENERAL-TO-SPECIFIC ORDERING color (black, brown, or blonde), height (tall, medium, or short), and nationality (US, French, German, Irish, Indian, Japanese, or Portuguese). + ((male brown tall US)(female black short US)) + ((male brown short French)(female black short US)) - ((female brown tall German)(f emale black short Indian)) + ((male brown tall Irish)(female brown short Irish)) Consider a hypothesis space defined over these instances, in which each hypothesis is represented by a pair of Ctuples, and where each attribute constraint may be a specific value, "?," or "0," just as in the EnjoySport hypothesis representation. For example, the hypothesis ((male ? tall ?)(female ? ? Japanese)) represents the set of all pairs of people where the first is a tall male (of any nationality and hair color), and the second is a Japanese female (of any hair color and height). (a) Provide a hand trace of the CANDIDATE-ELIMINATIaOlgNorithm learning from the above training examples and hypothesis language. In particular, show the specific and general boundaries of the version space after it has processed the first training example, then the second training example, etc. (b) How many distinct hypotheses from the given hypothesis space are consistent with the following single positive training example? + ((male black short Portuguese)(f emale blonde tall Indian)) (c) Assume the learner has encountered only the positive example from part (b), and that it is now allowed to query the trainer by generating any instance and asking the trainer to classify it. Give a specific sequence of queries that assures the learner will converge to the single correct hypothesis, whatever it may be (assuming that the target concept is describable within the given hypothesis language). Give the shortest sequence of queries you can find. How does the length of this sequence relate to your answer to question (b)? (d) Note that this hypothesis language cannot express all concepts that can be defined over the instances (i.e., we can define sets of positive and negative examples for which there is no corresponding describable hypothesis). If we were to enrich the language so that it could express all concepts that can be defined over the instance language, then how would your answer to (c) change? 2.6. Complete the proof of the version space representation theorem (Theorem 2.1). Consider a concept learning problem in which each instance is a real number, and in which each hypothesis is an interval over the reals. More precisely, each hypothesis in the hypothesis space H is of the form a < x < b, where a and b are any real constants, and x refers to the instance. For example, the hypothesis 4.5 < x < 6.1 classifies instances between 4.5 and 6.1 as positive, and others as negative. Explain informally why there cannot be a maximally specific consistent hypothesis for any set of positive training examples. Suggest a slight modification to the hypothesis representation so that there will be. 'C 50 MACHINE LEARNING 2.8. In this chapter, we commented that given an unbiased hypothesis space (the power set of the instances), the learner would find that each unobserved instance would match exactly half the current members of the version space, regardless of which training examples had been observed. Prove this. In particular, prove that for any instance space X, any set of training examples D, and any instance x E X not present in D, that if H is the power set of X, then exactly half the hypotheses in V S H , Dwill classify x as positive and half will classify it as negative. 2.9. Consider a learning problem where each instance is described by a conjunction of n boolean attributes a1 ...a,. Thus, a typical instance would be (al = T) A (az = F) A ...A (a, = T) Now consider a hypothesis space H in which each hypothesis is a disjunction of constraints over these attributes. For example, a typical hypothesis would be Propose an algorithm that accepts a sequence of training examples and outputs a consistent hypothesis if one exists. Your algorithm should run in time that is polynomial in n and in the number of training examples. 2.10. Implementthe FIND-Salgorithm. First verify that it successfullyproduces the trace in Section 2.4 for the E n j o y s p o r t example. Now use this program to study the number of random training examples required to exactly learn the target concept. Implement a training example generator that generates random instances, then classifies them according to the target concept: (Sunny,W a r m , ?, ?, ?, ?) Consider training your FIND-Sprogram on randomly generated examples and measuring the number of examples required before the program's hypothesis is identical to the target concept. Can you predict the average number of examples required? Run the experiment at least 20 times and report the mean number of examples required. How do you expect this number to vary with the number of "?" in the target concept? How would it vary with the number of attributes used to describe instances and hypotheses? REFERENCES Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1957). A study of thinking. New York: John Wiey & Sons. Buchanan, B. G. (1974). Scientific theory formation by computer. In J. C. Simon (Ed.), Computer Oriented Learning Processes. Leyden: Noordhoff. Gunter, C. A., Ngair, T., Panangaden, P., & Subramanian, D. (1991). The common order-theoretic structure of version spaces and ATMS's. Proceedings of the National Conference on Artijicial Intelligence (pp. 500-505). Anaheim. Haussler, D. (1988). Quantifying inductive bias: A1 learning algorithms and Valiant's learning framework. Artijicial Intelligence, 36, 177-221. Hayes-Roth, F. (1974). Schematic classification problems and their solution. Pattern Recognition, 6, 105-113. Hirsh, H. (1990). Incremental version space merging: A general framework for concept learning. Boston: Kluwer. Hirsh, H. (1991). Theoretical underpinnings of version spaces. Proceedings of the 12th IJCAI (pp. 665-670). Sydney. Hirsh, H. (1994). Generalizing version spaces. Machine Learning, 17(1), 5 4 6 . Hunt, E. G., & Hovland, D. I. (1963). Programming a model of human concept formation. In E. Feigenbaum & J. Feldman (Eds.), Computers and thought (pp. 310-325). New York: McGraw Hill. Michalski, R. S. (1973). AQVALI1: Computer implementation of a variable valued logic system VL1 and examples of its application to pattern recognition.Proceedings of the 1st InternationalJoint Conference on Pattern Recognition (pp. 3-17). Mitchell, T. M. (1977). Version spaces: A candidate elimination approach to rule learning. Fijlh International Joint Conference on AI @p. 305-310). Cambridge, MA: MIT Press. Mitchell, T. M. (1979). Version spaces: An approach to concept learning, (F'h.D. dissertation). Electrical Engineering Dept., Stanford University, Stanford, CA. Mitchell, T. M. (1982). Generalization as search. ArtQcial Intelligence, 18(2), 203-226. Mitchell, T. M., Utgoff, P. E., & Baneji, R. (1983). Learning by experimentation: Acquiring and modifying problem-solving heuristics. In Michalski, Carbonell, & Mitchell (Eds.), Machine Learning (Vol. 1, pp. 163-190). Tioga Press. Plotkin, G. D. (1970). A note on inductive generalization. In Meltzer & Michie (Eds.), Machine Intelligence 5 (pp. 153-163). Edinburgh University Press. Plotkin, G. D. (1971). A further note on inductive generalization. In Meltzer & Michie (Eds.), Machine Intelligence 6 (pp. 104-124). Edinburgh University Press. Popplestone,R. J. (1969). An experiment in automatic induction. In Meltzer & Michie (Eds.), Machine Intelligence 5 (pp. 204-215). Edinburgh University Press. Sebag, M. (1994). Using constraints to build version spaces. Proceedings of the 1994 European Conference on Machine Learning. Springer-Verlag. Sebag, M. (1996). Delaying the choice of bias: A disjunctive version space approach. Proceedings of the 13thInternational Conferenceon Machine Learning (pp. 444-452). San Francisco: Morgan Kaufmann. Simon, H. A,, & Lea, G. (1973). Problem solving and rule induction: A unified view. In Gregg (Ed.), Knowledge and Cognition (pp. 105-127). New Jersey: Lawrence Erlbaum Associates. Smith, B. D., & Rosenbloom, P. (1990). Incremental non-backtracking focusing: A polynomially bounded generalization algorithm for version spaces. Proceedings of the 1990 National Conference on ArtQcial Intelligence (pp. 848-853). Boston. Subramanian, D., & Feigenbaum, J. (1986). Factorization in experiment generation. Proceedings of the I986 National Conference on ArtQcial Intelligence (pp. 518-522). Morgan Kaufmann. Vere, S. A. (1975). Induction of concepts in the predicate calculus. Fourth International Joint Conference on AI (pp. 281-287). Tbilisi, USSR. Winston, P. H. (1970). Learning structural descriptions from examples, (Ph.D. dissertation). [MIT Technical Report AI-TR-2311. CHAPTER DECISION TREE LEARNING Decision tree learning is one of the most widely used and practical methods for inductive inference. It is a method for approximating discrete-valued functions that is robust to noisy data and capable of learning disjunctive expressions. This chapter describes a family of decision tree learning algorithms that includes widely used algorithms such as ID3, ASSISTANT, and C4.5. These decision tree learning methods search a completely expressive hypothesis space and thus avoid the difficulties of restricted hypothesis spaces. Their inductive bias is a preference for small trees over large trees. 3.1 INTRODUCTION Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree. Learned trees can also be re-represented as sets of if-then rules to improve human readability. These learning methods are among the most popular of inductive inference algorithms and have been successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to assess credit risk of loan applicants. 3.2 DECISION TREE REPRESENTATION Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance, and each branch descending CHAPTER 3 DECISION TREE LEARNING 53 Noma1 \ No Yes Strong / No Weak \ Yes FIGURE 3.1 A decision tree for the concept PlayTennis. An example is classified by sorting it through the tree to the appropriate leaf node, then returning the classification associated with this leaf (in this case, Yes or No). This tree classifies Saturday mornings according to whether or not they are suitable for playing tennis. from that node corresponds to one of the possible values for this attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node. Figure 3.1 illustrates a typical learned decision tree. This decision tree classifies Saturday mornings according to whether they are suitable for playing tennis. For example, the instance (Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong) would be sorted down the leftmost branch of this decision tree and would therefore be classified as a negative instance (i.e., the tree predicts that PlayTennis = no). This tree and the example used in Table 3.2 to illustrate the ID3 learning algorithm are adapted from (Quinlan 1986). In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions. For example, the decision tree shown in Figure 3.1 corresponds to the expression (Outlook = Sunny A Humidity = Normal) V (Outlook = Overcast) v (Outlook = Rain A Wind = Weak) 54 MACHINE LEARNWG 3.3 APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING Although a variety of decision tree learning methods have been developed with somewhat differing capabilities and requirements, decision tree learning is generally best suited to problems with the following characteristics: Znstances are represented by attribute-valuepairs. Instances are described by a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision tree learning is when each attribute takes on a small number of disjoint possible values (e.g., Hot, Mild, Cold). However, extensions to the basic algorithm (discussed in Section 3.7.2) allow handling real-valued attributes as well (e.g., representing Temperature numerically). The targetfunction has discrete output values. The decision tree in Figure 3.1 assigns a boolean classification (e.g., yes or no) to each example. Decision tree methods easily extend to learning functions with more than two possible output values. A more substantial extension allows learning target functions with real-valued outputs, though the application of decision trees in this setting is less common. 0 Disjunctive descriptions may be required. As noted above, decision trees naturally represent disjunctive expressions. 0 The training data may contain errors. Decision tree learning methods are robust to errors, both errors in classifications of the training examples and errors in the attribute values that describe these examples. 0 The training data may contain missing attribute values. Decision tree methods can be used even when some training examples have unknown values (e.g., if the Humidity of the day is known for only some of the training examples). This issue is discussed in Section 3.7.4. Many practical problems have been found to fit these characteristics. Decision tree learning has therefore been applied to problems such as learning to classify medical patients by their disease, equipment malfunctions by their cause, and loan applicants by their likelihood of defaulting on payments. Such problems, in which the task is to classify examples into one of a discrete set of possible categories, are often referred to as classijication problems. The remainder of this chapter is organized as follows. Section 3.4 presents the basic ID3 algorithm for learning decision trees and illustrates its operation in detail. Section 3.5 examines the hypothesis space search performed by this learning algorithm, contrasting it with algorithms from Chapter 2. Section 3.6 characterizes the inductive bias of this decision tree learning algorithm and explores more generally an inductive bias called Occam's razor, which corresponds to a preference for the most simple hypothesis. Section 3.7 discusses the issue of overfitting the training data, as well as strategies such as rule post-pruning to deal with this problem. This section also discusses a number of more advanced topics such as extending the algorithm to accommodate real-valued attributes, training data with unobserved attributes, and attributes with differing costs. CHAPTER 3 DECISION TREE LEARMNG 55 3.4 THE BASIC DECISION TREE LEARNING ALGORITHM Most algorithms that have been developed for learning decision trees are variations on a core algorithm that employs a top-down, greedy search through the space of possible decision trees. This approach is exemplified by the ID3 algorithm (Quinlan 1986) and its successor C4.5 (Quinlan 1993), which form the primary focus of our discussion here. In this section we present the basic algorithm for decision tree learning, corresponding approximately to the ID3 algorithm. In Section 3.7 we consider a number of extensions to this basic algorithm, including extensions incorporated into C4.5 and other more recent algorithms for decision tree learning. Our basic algorithm, ID3, learns decision trees by constructing them topdown, beginning with the question "which attribute should be tested at the root of the tree?'To answer this question, each instance attribute is evaluated using a statistical test to determine how well it alone classifies the training examples. The best attribute is selected and used as the test at the root node of the tree. A descendant of the root node is then created for each possible value of this attribute, and the training examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the example's value for this attribute). The entire process is then repeated using the training examples associated with each descendant node to select the best attribute to test at that point in the tree. This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier choices. A simplified version of the algorithm, specialized to learning boolean-valued functions (i.e., concept learning), is described in Table 3.1. 3.4.1 Which Attribute Is the Best Classifier? The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. We would like to select the attribute that is most useful for classifying examples. What is a good quantitative measure of the worth of an attribute? We will define a statistical property, called informution gain, that measures how well a given attribute separates the training examples according to their target classification. ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree. 3.4.1.1 ENTROPY MEASURES HOMOGENEITY OF EXAMPLES In order to defineinformation gain precisely, we begin by defining a measure commonly used in information theory, called entropy, that characterizes the (im)purity of an arbitrary collection of examples. Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this boolean classification is ID3(Examples, Targetattribute, Attributes) Examples are the training examples. Targetattribute is the attribute whose value is to be predicted by the tree. Attributes is a list of other attributes that may be tested by the learned decision tree. Returns a decision tree that correctly classiJies the given Examples. Create a Root node for the tree + I f all Examples are positive, Return the single-node tree Root, with label = I f all Examples are negative, Return the single-node tree Root, with label = I f Attributes is empty, Return the single-node tree Root, with label = most common value of Targetattribute in Examples Otherwise Begin A t the attribute from Attributes that best* classifies Examples 0 The decision attribute for Root c A For each possible value, vi, of A, Add a new tree branch below Root, corresponding to the test A = vi 0 Let Examples,, be the subset of Examples that have value vi for A If Examples,, is empty Then below this new branch add a leaf node with label = most common value of Target attribute in Examples Else below this new branch add the subtree ID3(Examples,,, Targetattribute, Attributes - (A))) End Return Root * The best attribute is the one with highest information gain, as defined in Equation (3.4). TABLE 3.1 Summary of the ID3 algorithm specialized to learning boolean-valued functions. ID3 is a greedy algorithm that grows the tree top-down, at each node selecting the attribute that best classifies the local training examples. This process continues until the tree perfectly classifies the training examples, or until all attributes have been used. where p, is the proportion of positive examples in S and p, is the proportion of negative examples in S. In all calculations involving entropy we define 0log 0 to be 0. To illustrate, suppose S is a collection of 14 examples of some boolean concept, including 9 positive and 5 negative examples (we adopt the notation [9+, 5-1 to summarize such a sample of data). Then the entropy of S relative to this boolean classification is Notice that the entropy is 0 if all members of S belong to the same class. For example, if all members are positive (pe = I), then p, is 0, and Entropy(S) = -1 .log2(1) - 0 .log20 = -1 . 0 - 0 .log20 = 0. Note the entropy is 1 when the collection contains an equal number of positive and negative examples. If the collection contains unequal numbers of positive and negative examples, the CHAPTER 3 DECISION TREE LEARNING 57 FIGURE 3.2 The entropy function relative to a boolean classification, 0.0 0.5 LO as the proportion, pe, of positive examples varies pe between 0 and 1. entropy is between 0 and 1. Figure 3.2 shows the form of the entropy function relative to a boolean classification, as p, varies between 0 and 1. One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S (i.e., a member of S drawn at random with uniform probability). For example, if p, is 1, the receiver knows the drawn example will be positive, so no message need be sent, and the entropy is zero. On the other hand, if pe is 0.5, one bit is required to indicate whether the drawn example is positive or negative. If pe is 0.8, then a collection of messages can be encoded using on average less than 1 bit per message by assigning shorter codes to collections of positive examples and longer codes to less likely negative examples. Thus far we have discussed entropy in the special case where the target classification is boolean. More generally, if the target attribute can take on c different values, then the entropy of S relative to this c-wise classification is defined as -C Entropy(S) -pi log, pi i=l where pi is the proportion of S belonging to class i . Note the logarithm is still base 2 because entropy is a measure of the expected encoding length measured in bits. Note also that if the target attribute can take on c possible values, the entropy can be as large as log, c. 3.4.1.2 INFORMATION GAIN MEASURES THE EXPECTED REDUCTION IN ENTROPY Given entropy as a measure of the impurity in a collection of training examples, we can now define a measure of the effectiveness of an attribute in classifying the training data. The measure we will use, called information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. More precisely, the information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as Gain(S,A) I Entropy(S) - -ISEVnl tropy (S,) veValues(A) IS1 (3.4) where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for which attribute A has value v (i.e., S, = { s E SIA(s) = v)). Note the first term in Equation (3.4) is just the entropy of the original collection S, and the second term is the expected value of the entropy after S is partitioned using attribute A. The expected entropy described by this second term is simply the sum of the entropies of each subset S,, weighted by the fraction of examples that belong to S,. Gain(S,A) is therefore the expected reduction in entropy caused by knowing the value of attribute A. Put another way, Gain(S,A) is the information provided about the target &action value, given the value of some other attribute A. The value of Gain(S,A) is the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A. For example, suppose S is a collection of training-exampledays described by attributes including Wind, which can have the values Weak or Strong. As before, assume S is a collection containing 14 examples, [9+, 5-1. Of these 14 examples, suppose 6 of the positive and 2 of the negative examples have Wind = Weak, and the remainder have Wind = Strong. The information gain due to sorting the original 14 examples by the attribute Wind may then be calculated as Values(Wind) = Weak,Strong Gain(S,Wind) = Entropy(S) - -IES,nl tropy(S,) Is1 v ~ ( W e a k , S t r o n g ] Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the tree. The use of information gain to evaluate the relevance of attributes is summarizedin Figure 3.3. In this figure the information gain of two different attributes, Humidity and Wind,is computed in order to determine which is the better attribute for classifying the training examples shown in Table 3.2. CHAPTER 3 DECISION TREE LEARNING 59 wx Which attributeis the best classifier? S: [9+,5-I E =0.940 Humidity S: [9+,5-I E S.940 High Strong [3+,4-I E S.985 Gain (S, Hurnidiry ) [6t,l-l E S.592 [6+,2-I ES.811 [3+,3-I E =1.00 Gain (S, Wind) = ,940 - (8/14).811 - (6114)l.O = ,048 FIGURE 3.3 Humidity provides greater information gain than Wind, relative to the target classification. Here, E stands for entropy and S for the original collection of examples. Given an initial collection S of 9 positive and 5 negative examples, [9+, 5-1, sorting these by their Humidity produces collections of [3+, 4-1 (Humidity = High) and [6+, 1-1 (Humidity = Normal). The information gained by this partitioning is .151, compared to a gain of only .048for the attribute Wind. 3.4.2 An Illustrative Example To illustrate the operation of ID3, consider the learning task represented by the training examples of Table 3.2. Here the target attribute PlayTennis, which can have values yes or no for different Saturday mornings, is to be predicted based on other attributes of the morning in question. Consider the first step through Day Outlook Temperature Humidity Wind PlayTennis D l Sunny D2 Sunny D3 Overcast D4 Rain D5 Rain D6 Rain D7 Overcast D8 Sunny D9 Sunny Dl0 Rain Dl1 Sunny Dl2 Overcast Dl3 Overcast Dl4 Rain Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High Weak No High Strong No High Weak Yes High Weak Yes Normal Weak Yes Normal Strong No Normal Strong Yes High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes High Strong Yes Normal Weak Yes High Strong No TABLE 3.2 Training examples for the target concept PlayTennis. the algorithm, in which the topmost node of the decision tree is created. Which attribute should be tested first in the tree? ID3 determines the information gain for each candidate attribute (i.e., Outlook, Temperature, Humidity, and Wind), then selects the one with highest information gain. The computation of information gain for two of these attributes is shown in Figure 3.3. The information gain values for all four attributes are Gain(S, Outlook) = 0.246 Gain(S,Humidity) = 0.151 Gain(S,Wind) = 0.048 Gain(S, Temperature) = 0.029 where S denotes the collection of training examples from Table 3.2. According to the information gain measure, the Outlook attribute provides the best prediction of the target attribute, PlayTennis, over the training examples. Therefore, Outlook is selected as the decision attribute for the root node, and branches are created below the root for each of its possible values (i.e., Sunny, Overcast, and Rain). The resulting partial decision tree is shown in Figure 3.4, along with the training examples sorted to each new descendant node. Note that every example for which Outlook = Overcast is also a positive example of PlayTennis. Therefore, this node of the tree becomes a leaf node with the classification PlayTennis = Yes. In contrast, the descendants corresponding to Outlook = Sunny and Outlook = Rain still have nonzero entropy, and the decision tree will be further elaborated below these nodes. The process of selecting a new attribute and partitioning the training examples is now repeated for each nontenninal descendant node, this time using only the training examples associated with that node. Attributes that have been incorporated higher in the tree are excluded, so that any given attribute can appear at most once along any path through the tree. This process continues for each new leaf node until either of two conditions is met: (1) every attribute has already been included along this path through the tree, or (2) the training examples associated with this leaf node all have the same target attribute value (i.e., their entropy is zero). Figure 3.4 illustrates the computations of information gain for the next step in growing the decision tree. The final decision tree learned by ID3 from the 14 training examples of Table 3.2 is shown in Figure 3.1. 3.5 HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING As with other inductive learning methods, ID3 can be characterized as searching a space of hypotheses for one that fits the training examples. The hypothesis space searched by ID3 is the set of possible decision trees. ID3 performs a simple-tocomplex, hill-climbing search through this hypothesis space, beginning with the empty tree, then considering progressively more elaborate hypotheses in search of a decision tree that correctly classifies the training data. The evaluation function {Dl, D2, ...,Dl41 P+S-I Whichattribute should be tested here? Gain (Ssunnyj Temperaare) = ,970 - (215)0.0 - (Y5) 1.0 - (115) 0.0 = ,570 Gain (Sss,,,, Wind) = 970 - (215) 1.0 - (315) ,918 = ,019 FIGURE 3.4 The partially learned decision tree resulting from the first step of ID3. The training examples are sorted to the corresponding descendant nodes. The Overcast descendant has only positive examples and therefore becomes a leaf node with classification Yes. The other two nodes will be further expanded, by selecting the attribute with highest information gain relative to the new subsets of examples. that guides this hill-climbing search is the information gain measure. This search is depicted in Figure 3.5. By viewing ID^ in terms of its search space and search strategy, we can get some insight into its capabilities and limitations. 1 ~ 3 ' shypothesis space of all decision trees is a complete space of finite discrete-valued functions, relative to the available attributes. Because every finite discrete-valued function can be represented by some decision tree, ID3 avoids one of the major risks of methods that search incomplete hypothesis spaces (such as methods that consider only conjunctive hypotheses): that the hypothesis space might not contain the target function. ID3 maintains only a single current hypothesis as it searches through the space of decision trees. This contrasts, for example, with the earlier version space c a n d i d a t e - ~ l i r n i n a t - o d , which maintains the set of all hypotheses consistent with the available training examples. By determining only a single hypothesis, ID^ loses the capabilities that follow from :F + - + ... ... FIGURE 3.5 - Hypothesis space search by ID3. ID3 searches throuah the mace of possible decision trees from simplest to increasingly complex, guided by the information gain heuristic. explicitly representing all consistent hypotheses. For example, it does not have the ability to determine how many alternative decision trees are consistent with the available training data, or to pose new instance queries that optimally resolve among these competing hypotheses. 0 ID3 in its pure form performs no backtracking in its search. Once it,selects an attribute to test at a particular level in the tree, it never backtracks to reconsider this choice. Therefore, it is susceptible to the usual risks of hill-climbing search without backtracking: converging to locally optimal solutions that are not globally optimal. In the case of ID3, a locally optimal solution corresponds to the decision tree it selects along the single search path it explores. However, this locally optimal solution may be less desirable than trees that would have been encountered along a different branch of the search. Below we discuss an extension that adds a form of backtracking (post-pruning the decision tree). 0 ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine its current hypothesis. This contrasts with methods that make decisions incrementally, based on individual training examples (e.g., FIND-Sor CANDIDATE-ELIMINATOIOnNe )a.dvantage of using statistical properties of all the examples (e.g., information gain) is that the resulting search is much less sensitive to errors in individual training examples. ID3 can be easily extended to handle noisy training data by modifying its termination criterion to accept hypotheses that imperfectly fit the training data. 3.6 INDUCTIVE BIAS IN DECISION TREE LEARNING What is the policy by which ID3 generalizes from observed training examples to classify unseen instances? In other words, what is its inductive bias? Recall from Chapter 2 that inductive bias is the set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances. Given a collection of training examples, there are typically many decision trees consistent with these examples. Describing the inductive bias of ID3 therefore consists of describing the basis by which it chooses one of these consistent hypotheses over the others. Which of these decision trees does ID3 choose? It chooses the first acceptable tree it encounters in its simple-to-complex, hillclimbing search through the space of possible trees. Roughly speaking, then, the ID3 search strategy (a) selects in favor of shorter trees over longer ones, and (b) selects trees that place the attributes with highest information gain closest to the root. Because of the subtle interaction between the attribute selection heuristic used by ID3 and the particular training examples it encounters, it is difficult to characterize precisely the inductive bias exhibited by ID3. However, we can approximately characterize its bias as a preference for short decision trees over complex trees. Approximate inductive bias of ID3: Shorter trees are preferred over larger trees. In fact, one could imagine an algorithm similar to ID3 that exhibits precisely this inductive bias. Consider an algorithm that begins with the empty tree and searches breadthJirst through progressively more complex trees, first considering all trees of depth 1, then all trees of depth 2, etc. Once it finds a decision tree consistent with the training data, it returns the smallest consistent tree at that search depth (e.g., the tree with the fewest nodes). Let us call this breadth-first search algorithm BFS-ID3. BFS-ID3 finds a shortest decision tree and thus exhibits precisely the bias "shorter trees are preferred over longer trees." ID3 can be viewed as an efficient approximation to BFS-ID3, using a greedy heuristic search to attempt to find the shortest tree without conducting the entire breadth-first search through the hypothesis space. Because ID3 uses the information gain heuristic and a hill climbing strategy, it exhibits a more complex bias than BFS-ID3. In particular, it does not always find the shortest consistent tree, and it is biased to favor trees that place attributes with high information gain closest to the root. A closer approximation to the inductive bias of ID3: Shorter trees are preferred over longer trees. Trees that place high information gain attributes close to the root are preferred over those that do not. 3.6.1 Restriction Biases and Preference Biases There is an interesting difference between the types of inductive bias exhibited by ID3 and by the CANDIDATE-ELIMINAaTlIgOoNrithm discussed in Chapter 2. Consider the difference between the hypothesis space search in these two approaches: ID3 searches a complete hypothesis space (i.e., one capable of expressing any finite discrete-valued function). It searches incompletely through this space, from simple to complex hypotheses, until its termination condition is met (e.g., until it finds a hypothesis consistent with the data). Its inductive bias is solely a consequence of the ordering of hypotheses by its search strategy. Its hypothesis space introduces no additional bias. 0 The version space CANDIDATE-ELIMINAaTlIgOoNrithm searches an incomplete hypothesis space (i.e., one that can express only a subset of the potentially teachable concepts). It searches this space completely, finding every hypothesis consistent with the training data. Its inductive bias is solely a consequence of the expressive power of its hypothesis representation. Its search strategy introduces no additional bias. In brief, the inductive bias of ID3 follows from its search strategy, whereas the inductive bias of the CANDIDATE-ELIMINAaTlIgOoNrithm follows from the definition of its search space. The inductive bias of ID3 is thus a preference for certain hypotheses over others (e.g., for shorter hypotheses), with no hard restriction on the hypotheses that can be eventually enumerated. This form of bias is typically called a preference bias (or, alternatively, a search bias). In contrast, the bias of the CANDIDATEELIMINATIOaNlgorithm is in the form of a categorical restriction on the set of hypotheses considered. This form of bias is typically called a restriction bias (or, alternatively, a language bias). Given that some form of inductive bias is required in order to generalize beyond the training data (see Chapter 2), which type of inductive bias shall we prefer; a preference bias or restriction bias? Typically, a preference bias is more desirable than a restriction bias, because it allows the learner to work within a complete hypothesis space that is assured to contain the unknown target function. In contrast, a restriction bias that strictly limits the set of potential hypotheses is generally less desirable, because it introduces the possibility of excluding the unknown target function altogether. Whereas ID3 exhibits a purely preference bias and CANDIDATE-ELIMINATION a purely restriction bias, some learning systems combine both. Consider, for example, the program described in Chapter 1 for learning a numerical evaluation function for game playing. In this case, the learned evaluation function is represented by a linear combination of a fixed set of board features, and the learning algorithm adjusts the parameters of this linear combination to best fit the available training data. In this case, the decision to use a linear function to represent the evaluation function introduces a restriction bias (nonlinear evaluation functions cannot be represented in this form). At the same time, the choice of a particular parameter tuning method (the LMS algorithm in this case) introduces a preference bias stemming from the ordered search through the space of all possible parameter values. 3.6.2 Why Prefer Short Hypotheses? Is ID3's inductive bias favoring shorter decision trees a sound basis for generalizing beyond the training data? Philosophers and others have debated this question for centuries, and the debate remains unresolved to this day. William of Occam was one of the first to discusst the question, around the year 1320, so this bias often goes by the name of Occam's razor. Occam's razor: Prefer the simplest hypothesis that fits the data. Of course giving an inductive bias a name does not justify it. Why should one prefer simpler hypotheses? Notice that scientists sometimes appear to follow this inductive bias. Physicists, for example, prefer simple explanations for the motions of the planets, over more complex explanations. Why? One argument is that because there are fewer short hypotheses than long ones (based on straightforward combinatorial arguments), it is less likely that one will find a short hypothesis that coincidentally fits the training data. In contrast there are often many very complex hypotheses that fit the current training data but fail to generalize correctly to subsequent data. Consider decision tree hypotheses, for example. There are many more 500-node decision trees than 5-node decision trees. Given a small set of 20 training examples, we might expect to be able to find many 500-node decision trees consistent with these, whereas we would be more surprised if a 5-node decision tree could perfectly fit this data. We might therefore believe the 5-node tree is less likely to be a statistical coincidence and prefer this hypothesis over the 500-node hypothesis. Upon closer examination, it turns out there is a major difficulty with the above argument. By the same reasoning we could have argued that one should prefer decision trees containing exactly 17 leaf nodes with 11 nonleaf nodes, that use the decision attribute A1 at the root, and test attributes A2 through A l l , in numerical order. There are relatively few such trees, and we might argue (by the same reasoning as above) that our a priori chance of finding one consistent with an arbitrary set of data is therefore small. The difficulty here is that there are very many small sets of hypotheses that one can define-most of them rather arcane. Why should we believe that the small set of hypotheses consisting of decision trees with short descriptions should be any more relevant than the multitude of other small sets of hypotheses that we might define? A second problem with the above argument for Occam's razor is that the size of a hypothesis is determined by the particular representation used internally by the learner. Two learners using different internal representations could therefore anive at different hypotheses, both justifying their contradictory conclusions by Occam's razor! For example, the function represented by the learned decision tree in Figure 3.1 could be represented as a tree with just one decision node, by a learner that uses the boolean attribute XYZ, where we define the attribute XYZ to ~ ~ p r e nwthlile~ shaving. be true for instances that are classified positive by the decision tree in Figure 3.1 and false otherwise. Thus, two learners, both applying Occam's razor, would generalize in different ways if one used the XYZ attribute to describe its examples and the other used only the attributes Outlook, Temperature, Humidity, and Wind. This last argument shows that Occam's razor will produce two different hypotheses from the same training examples when it is applied by two learners that perceive these examples in terms of different internal representations. On this basis we might be tempted to reject Occam's razor altogether. However, consider the following scenario that examines the question of which internal representations might arise from a process of evolution and natural selection. Imagine a population of artificial learning agents created by a simulated evolutionary process involving reproduction, mutation, and natural selection of these agents. Let us assume that this evolutionary process can alter the perceptual systems of these agents from generation to generation, thereby changing the internal attributes by which they perceive their world. For the sake of argument, let us also assume that the learning agents employ a fixed learning algorithm (say ID3) that cannot be altered by evolution. It is reasonable to assume that over time evolution will produce internal representation that make these agents increasingly successful within their environment. Assuming that the success of an agent depends highly on its ability to generalize accurately, we would therefore expect evolution to develop internal representations that work well with whatever learning algorithm and inductive bias is present. If the species of agents employs a learning algorithm whose inductive bias is Occam's razor, then we expect evolution to produce internal representations for which Occam's razor is a successful strategy. The essence of the argument here is that evolution will create internal representations that make the learning algorithm's inductive bias a self-fulfilling prophecy, simply because it can alter the representation easier than it can alter the learning algorithm. For now, we leave the debate regarding Occam's razor. We will revisit it in Chapter 6, where we discuss the Minimum Description Length principle, a version of Occam's razor that can be interpreted within a Bayesian framework. 3.7 ISSUES IN DECISION TREE LEARNING Practical issues in learning decision trees include determining how deeply to grow the decision tree, handling continuous attributes, choosing an appropriate attribute selection measure, andling training data with missing attribute values, handling attributes with differing costs, and improving computational efficiency. Below we discuss each of these issues and extensions to the basic ID3 algorithm that address them. ID3 has itself been extended to address most of these issues, with the resulting system renamed C4.5 (Quinlan 1993). 3.7.1 Avoiding Overfitting the Data The algorithm described in Table 3.1 grows each branch of the tree just deeply enough to perfectly classify the training examples. While this is sometimes a reasonable strategy,in fact it can lead to difficulties when there is noise in the data, or when the number of training examples is too small to produce a representative sample of the true target function. In either of these cases, this simple algorithm can produce trees that overjt the training examples. We will say that a hypothesis overfits the training examples if some other hypothesisthat fits the training examples less well actually performs better over the entire distribution of instances (i.e., including instances beyond the training set). Definition: Given a hypothesis space H, a hypothesis h E H is said to overlit the training data if there exists some alternative hypothesis h' E H, such that h has smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances. Figure 3.6 illustrates the impact of overfitting in a typical applicationof decision tree learning. In this case, the ID3 algorithm is applied to the task of learning which medical patients have a form of diabetes. The horizontal axis of this plot indicates the total number of nodes in the decision tree, as the tree is being constructed. The vertical axis indicates the accuracy of predictions made by the tree. The solid line shows the accuracy of the decision tree over the training examples, whereas the broken line shows accuracy measured over an independent set of test examples (not included in the training set). Predictably, the accuracy of the tree over the training examples increases monotonically as the tree is grown. However, the accuracy measured over the independent test examples first increases, then decreases. As can be seen, once the tree size exceeds approximately 25 nodes, On training data On test data ---- i Size of tree (number of nodes) FIGURE 3.6 Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the accuracy of the tree measured over the training examples increases monotonically. However, when measured over a set of test examples independent of the training examples, accuracy first increases, then decreases. Software and data for experimenting with variations on this plot are available on the World Wide Web at http://www.cs.cmu.edu/-torn/mlbook.html. further elaboration of the tree decreases its accuracy over the test examples despite increasing its accuracy on the training examples. How can it be possible for tree h to fit the training examples better than h', but for it to perform more poorly over subsequent examples? One way this can occur is when the training examples contain random errors or noise. To illustrate, consider the effect of adding the following positive training example, incorrectly labeled as negative, to the (otherwise correct) examples in Table 3.2. (Outlook = Sunny, Temperature = Hot, Humidity = Normal, Wind = Strong, PlayTennis = No) Given the original error-free data, ID3 produces the decision tree shown in Figure 3.1. However, the addition of this incorrect example will now cause ID3 to construct a more complex tree. In particular, the new example will be sorted into the second leaf node from the left in the learned tree of Figure 3.1, along with the previous positive examples D9 and D l 1. Because the new example is labeled as a negative example, ID3 will search for further refinements to the tree below this node. Of course as long as the new erroneous example differs in some arbitrary way from the other examples affiliated with this node, ID3 will succeed in finding a new decision attribute to separate out this new example from the two previous positive examples at this tree node. The result is that ID3 will output a decision tree (h) that is more complex than the original tree from Figure 3.1 (h'). Of course h will fit the collection of training examples perfectly, whereas the simpler h' will not. However, given that the new decision node is simply a consequence of fitting the noisy training example, we expect h to outperform h' over subsequent data drawn from the same instance distribution. The above example illustrates how random noise in the training examples can lead to overfitting. In fact, overfitting is possible even when the training data are noise-free, especially when small numbers of examples are associated with leaf nodes. In this case, it is quite possible for coincidental regularities to occur, in which some attribute happens to partition the examples very well, despite being unrelated to the actual target function. Whenever such coincidental regularities exist, there is a risk of overfitting. Overfitting is a significant practical difficulty for decision tree learning and many other learning methods. For example, in one experimental study of ID3 involving five different learning tasks with noisy, nondeterministic data (Mingers 1989b), overfitting was found to decrease the accuracy of learned decision trees by 10-25% on most problems. There are several approaches to avoiding overfitting in decision tree learning. These can be grouped into two classes: approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data, 0 approaches that allow the tree to overfit the data, and then post-prune the tree. Although the first of these approaches might seem.more direct, the second approach of post-pruning overfit trees has been found to be more successful in practice. This is due to the difficulty in the first approach of estimating precisely when to stop growing the tree. Regardless of whether the correct tree size is found by stopping early or by post-pruning, a key question is what criterion is to be used to determine the correct final tree size. Approaches include: 0 Use a separate set of examples, distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree. 0 Use all the available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set. For example, Quinlan (1986) uses a chi-square test to estimate whether further expanding a node is likely to improve performance over the entire instance distribution, or only on the current sample of training data. 0 Use an explicit measure of the complexity for encoding the training examples and the decision tree, halting growth of the tree when this encoding size is minimized. This approach, based on a heuristic called the Minimum Description Length principle, is discussed further in Chapter 6, as well as in Quinlan and Rivest (1989) and Mehta et al. (199.5). The first of the above approaches is the most common and is often referred to as a training and validation set approach. We discuss the two main variants of this approach below. In this approach, the available data are separated into two sets of examples: a training set, which is used to form the learned hypothesis, and a separate validation set, which is used to evaluate the accuracy of this hypothesis over subsequent data and, in particular, to evaluate the impact of pruning this hypothesis. The motivation is this: Even though the learner may be misled by random errors and coincidental regularities within the training set, the validation set is unlikely to exhibit the same random fluctuations. Therefore, the validation set can be expected to provide a safety check against overfitting the spurious characteristics of the training set. Of course, it is important that the validation set be large enough to itself provide a statistically significant sample of the instances. One common heuristic is to withhold one-third of the available examples for the validation set, using the other two-thirds for training. 3.7.1.1 REDUCED ERROR PRUNING How exactly might we use a validation set to prevent overfitting? One approach, called reduced-error pruning (Quinlan 1987), is to consider each of the decision nodes in the.tree to be candidates for pruning. Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node. Nodes are removed only if the resulting pruned tree performs no worse than-the original over the validation set. This has the effect that any leaf node added due to coincidental regularities in the training set is likely to be pruned because these same coincidences are unlikely to occur in the validation set. Nodes are pruned iteratively, always choosing the node whose removal most increases the decision tree accuracy over the validation set. Pruning of nodes continues until further pruning is harmful (i.e., decreases accuracy of the tree over the validation set). The impact of reduced-error pruning on the accuracy of the decision tree is illustrated in Figure 3.7. As in Figure 3.6, the accuracy of the tree is shown measured over both training examples and test examples. The additional line in Figure 3.7 shows accuracy over the test examples as the tree is pruned. When pruning begins, the tree is at its maximum size and lowest accuracy over the test set. As pruning proceeds, the number of nodes is reduced and accuracy over the test set increases. Here, the available data has been split into three subsets: the training examples, the validation examples used for pruning the tree, and a set of test examples used to provide an unbiased estimate of accuracy over future unseen examples. The plot shows accuracy over the training and test sets. Accuracy over the validation set used for pruning is not shown. Using a separate set of data to guide pruning is an effective approach provided a large amount of data is available. The major drawback of this approach is that when data is limited, withholding part of it for the validation set reduces even further the number of examples available for training. The following section presents an alternative approach to pruning that has been found useful in many practical situations where data is limited. Many additional techniques have been proposed as well, involving partitioning the available data several different times in 7 "-.---.-.---.-..-_-.2 .._--- ,.-.\- _ __ -._ -~ . _ --_~.. -.-...-.-. ....--_....-_..-._-..-..._...--.....___...-_-------- On training data On test data ---- On test data (during pruning) - - - - - 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes) FIGURE 3.7 Effect of reduced-errorpruning in decision tree learning. This plot shows the same curves of training and test set accuracy as in Figure 3.6. In addition, it shows the impact of reduced error pruning of the tree produced by ID3. Notice the increase in accuracy over the test set as nodes are pruned from the tree. Here, the validation set used for pruning is distinct from both the training and test sets. multiple ways, then averaging the results. Empirical evaluations of alternative tree pruning methods are reported by Mingers (1989b) and by Malerba et al. (1995). 3.7.1.2 RULE POST-PRUNING In practice, one quite successful method for finding high accuracy hypotheses is a technique we shall call rule post-pruning. A variant of this pruning method is used by C4.5 (Quinlan 1993), which is an outgrowth of the original ID3 algorithm. Rule post-pruning involves the following steps: 1. Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing overfitting to occur. 2. Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node. 3. Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. 4. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances. To illustrate, consider again the decision tree in Figure 3.1. In rule postpruning, one rule is generated for each leaf node in the tree. Each attribute test along the path from the root to the leaf becomes a rule antecedent (precondition) and the classification at the leaf node becomes the rule consequent (postcondition). For example, the leftmost path of the tree in Figure 3.1 is translated into the rule IF (Outlook = Sunny) A (Humidity = High) THEN PlayTennis = No Next, each such rule is pruned by removing any antecedent, or precondition, whose removal does not worsen its estimated accuracy. Given the above rule, for example, rule post-pruning would consider removing the preconditions (Outlook = Sunny) and (Humidity = High). It would select whichever of these pruning steps produced the greatest improvement in estimated rule accuracy, then consider pruning the second precondition as a further pruning step. No pruning step is performed if it reduces the estimated rule accuracy. As noted above, one method to estimate rule accuracy is to use a validation set of examples disjoint from the training set. Another method, used by C4.5, is to evaluate performance based on the training set itself, using a pessimistic estimate to make up for the fact that the training data gives an estimate biased in favor of the rules. More precisely, C4.5 calculates its pessimistic estimate by calculating the rule accuracy over the training examples to which it applies, then calculating the standard deviation in this estimated accuracy assuming a binomial distribution. For a given confidence level, the lower-bound estimate is then taken as the measure of rule performance (e.g., for a 95% confidence interval, rule accuracy is pessimistically estimated by the observed accuracy over the training set, minus 1.96 times the estimated standard deviation). The net effect is that for large data sets, the pessimistic estimate is very close to the observed accuracy (e.g., the standard deviation is very small), whereas it grows further from the observed accuracy as the size of the data set decreases. Although this heuristic method is not statistically valid, it has nevertheless been found useful in practice. See Chapter 5 for a discussion of statistically valid approaches to estimating means and confidence intervals. Why convert the decision tree to rules before pruning? There are three main advantages. Converting to rules allows distinguishing among the different contexts in which a decision node is used. Because each distinct path through the decision tree node produces a distinct rule, the pruning decision regarding that attribute test can be made differently for each path. In contrast, if the tree itself were pruned, the only two choices would be to remove the decision node completely, or to retain it in its original form. Converting to rules removes the distinction between attribute tests that occur near the root of the tree and those that occur near the leaves. Thus, we avoid messy bookkeeping issues such as how to reorganize the tree if the root node is pruned while retaining part of the subtree below this test. Converting to rules improves readability. Rules are often easier for to understand. 3.7.2 Incorporating Continuous-Valued Attributes Our initial definition of ID3 is restricted to attributes that take on a discrete set of values. First, the target attribute whose value is predicted by the learned tree must be discrete valued. Second, the attributes tested in the decision nodes of the tree must also be discrete valued. This second restriction can easily be removed so that continuous-valueddecision attributes can be incorporated into the learned tree. This can be accomplished by dynamically defining new discretevalued attributes that partition the continuous attribute value into a discrete set of intervals. In particular, for an attribute A that is continuous-valued, the algorithm can dynamically create a new boolean attribute A, that is true if A < c and false otherwise. The only question is how to select the best value for the threshold c. As an example, suppose we wish to include the continuous-valued attribute Temperature in describing the training example days in the learning task of Table 3.2. Suppose further that the training examples associated with a particular node in the decision tree have the following values for Temperature and the target attribute PlayTennis. Temperature: 40 48 60 72 80 90 PlayTennis: No No Yes Yes Yes NO 73 CHAPTER 3 DECISION TREE LEARNING What threshold-based boolean attribute should be defined based on Temperature? Clearly, we would like to pick a threshold, c, that produces the greatest information gain. By sorting the examples according to the continuous attribute A, then identifying adjacent examples that differ in their target classification, we can generate a set of candidate thresholds midway between the corresponding values of A. It can be shown that the value of c that maximizes information gain must always lie at such a boundary (Fayyad 1991). These candidate thresholds can then be evaluated by computing the information gain associated with each. In the current example, there are two candidate thresholds, corresponding to the + values of Temperature at which the value of PlayTennis changes: (48 60)/2, + and (80 90)/2. The information gain can then be computed for each of the candidate attributes, T e m p e r a t ~ r e ,a~n~d T e m p e r a t ~ r e , ~a~n,d the best can be selected ( T e m p e r a t ~ r e , ~ ~T)h.is dynamically created boolean attribute can then compete with the other discrete-valued candidate attributes available for growing the decision tree. Fayyad and Irani (1993) discuss an extension to this approach that splits the continuous attribute into multiple intervals rather than just two intervals based on a single threshold. Utgoff and Brodley (1991) and Murthy et al. ( 1994) discuss approaches that define features by thresholding linear combinations of several continuous-valued attributes. 3.7.3 Alternative Measures for Selecting Attributes There is a natural bias in the information gain measure that favors attributes with many values over those with few values. As an extreme example, consider the attribute Date, which has a very large number of possible values (e.g., March 4, 1979). If we were to add this attribute to the data in Table 3.2, it would have the highest information gain of any of the attributes. This is because Date alone perfectly predicts the target attribute over the training data. Thus, it would be selected as the decision attribute for the root node of the tree and lead to a (quite broad) tree of depth one, which perfectly classifies the training data. Of course, this decision tree would fare poorly on subsequent examples, because it is not a useful predictor despite the fact that it perfectly separates the training data. What is wrong with the attribute Date? Simply put, it has so many possible values that it is bound to separate the training examples into very small subsets. Because of this, it will have a very high information gain relative to the training examples, despite being a very poor predictor of the target function over unseen instances. One way to avoid this difficulty is to select decision attributes based on some measure other than information gain. One alternative measure that has been used successfully is the gain ratio (Quinlan 1986). The gain ratio measure penalizes attributes such as Date by incorporating a term, called split informution, that is sensitive to how broadly and uniformly the attribute splits the data: 74 MACHINE LEARNING where S1 through S, are the c subsets of examples resulting from partitioning S by the c-valued attribute A. Note that Splitlnfomzation is actually the entropy of S with respect to the values of attribute A. This is in contrast to our previous uses of entropy, in which we considered only the entropy of S with respect to the target attribute whose value is to be predicted by the learned tree. The Gain Ratio measure is defined in terms of the earlier Gain measure, as well as this Splitlnfomzation, as follows Gain (S, A) GainRatio(S, A) r Split Inf ormation(S, A) Notice that the Splitlnfomzation term discourages the selection of attributes with many uniformly distributed values. For example, consider a collection of n examples that are completely separated by attribute A (e.g., Date). In this case, the Splitlnfomzation value will be log, n. In contrast, a boolean attribute B that splits the same n examples exactly in half will have Splitlnfomzation of 1. If attributes A and B produce the same information gain, then clearly B will score higher according to the Gain Ratio measure. One practical issue that arises in using GainRatio in place of Gain to select attributes is that the denominator can be zero or very small when ISi1 x IS1 for one of the Si. This either makes the GainRatio undefined or very large for attributes that happen to have the same value for nearly all members of S. To avoid selecting attributes purely on this basis, we can adopt some heuristic such as first calculating the Gain of each attribute, then applying the GainRatio test only considering those attributes with above average Gain (Quinlan 1986). An alternative to the GainRatio, designed to directly address the above difficulty, is a distance-based measure introduced by Lopez de Mantaras (1991). This measure is based on defining a distance metric between partitions of'the data. Each attribute is evaluated based on the distance between the data partition it creates and the perfect partition (i.e., the partition that perfectly classifies the training data). The attribute whose partition is closest to the perfect partition is chosen. Lopez de Mantaras (1991) defines this distance measure, proves that it is not biased toward attributes with large numbers of values, and reports experimental studies indicating that the predictive accuracy of the induced trees is not significantly different from that obtained with the Gain and Gain Ratio measures. However, this distance measure avoids the practical difficultiesassociated with the GainRatio measure, and in his experiments it produces significantly smaller trees in the case of data sets whose attributes have very different numbers of values. A variety of other selection measures have been proposed as well (e.g., see Breiman et al. 1984; Mingers 1989a; Kearns and Mansour 1996; Dietterich et al. 1996). Mingers (1989a) provides an experimental analysis of the relative effectiveness of several selection measures over a variety of problems. He reports significant differences in the sizes of the unpruned trees produced by the different selection measures. However, in his experimental domains the choice of attribute selection measure appears to have a smaller impact on final accuracy than does the extent and method of post-pruning. CHAPTER 3 DECISION TREE LEARNING 75 3.7.4 Handling Training Examples with Missing Attribute Values In certain cases, the available data may be missing values for some attributes. For example, in a medical domain in which we wish to predict patient outcome based on various laboratory tests, it may be that the lab test Blood-Test-Result is available only for a subset of the patients. In such cases, it is common to estimate the missing attribute value based on other examples for which this attribute has a known value. Consider the situation in which Gain(S, A ) is to be calculated at node n in the decision tree to evaluate whether the attribute A is the best attribute to test at this decision node. Suppose that ( x ,c ( x ) )is one of the training examples in S and that the value A ( x ) is unknown. One strategy for dealing with the missing attribute value is to assign it the value that is most common among training examples at node n . Alternatively, we might assign it the most common value among examples at node n that have the classification c ( x ) .The elaborated training example using this estimated value for A(x) can then be used directly by the existing decision tree learning algorithm. This strategy is examined by Mingers (1989a). A second, more complex procedure is to assign a probability to each of the possible values of A rather than simply assigning the most common value to A(x). These probabilities can be estimated again based on the observed frequencies of the various values for A among the examples at node n. For example, given a boolean attribute A, if node n contains six known examples with A = 1 and four with A = 0, then we would say the probability that A ( x ) = 1 is 0.6, and the probability that A ( x ) = 0 is 0.4. A fractional 0.6 of instance x is now distributed down the branch for A = 1, and a fractional 0.4 of x down the other tree branch. These fractional examples are used for the purpose of computing information Gain and can be further subdivided at subsequent branches of the tree if a second missing attribute value must be tested. This same fractioning of examples can also be applied after learning, to classify new instances whose attribute values are unknown. In this case, the classification of the new instance is simply the most probable classification, computed by summing the weights of the instance fragments classified in different ways at the leaf nodes of the tree. This method for handling missing attribute values is used in C4.5 (Quinlan 1993). 3.7.5 Handling Attributes with Differing Costs In some learning tasks the instance attributes may have associated costs. For example, in learning to classify medical diseases we might describe patients in terms of attributes such as Temperature, BiopsyResult, Pulse, BloodTestResults, etc. These attributes vary significantly in their costs, both in terms of monetary cost and cost to patient comfort. In such tasks, we would prefer decision trees that use low-cost attributes where possible, relying on high-cost attributes only when needed to produce reliable classifications. ID3 can be modified to take into account attribute costs by introducing a cost term into the attribute selection measure. For example, we might divide the Gpin by the cost of the attribute, so that lower-cost attributes would be preferred. While such cost-sensitive measures do not guarantee finding an optimal cost-sensitive decision tree, they do bias the search in favor of low-cost attributes. Tan and Schlimmer (1990) and Tan (1993) describe one such approach and apply it to a robot perception task in which the robot must learn to classify different objects according to how they can be grasped by the robot's manipulator. In this case the attributes correspond to different sensor readings obtained by a movable sonar on the robot. Attribute cost is measured by the number of seconds required to obtain the attribute value by positioning and operating the sonar. They demonstrate that more efficient recognition strategies are learned, without sacrificing classification accuracy, by replacing the information gain attribute selection measure by the following measure Cost ( A ) Nunez (1988) describes a related approach and its application to learning medical diagnosis rules. Here the attributes are different symptoms and laboratory tests with differing costs. His system uses a somewhat different attribute selection measure 2GaWS.A) - 1 + ( C o s t ( A ) where w E [0, 11 is a constant that determines the relative importance of cost versus information gain. Nunez (1991) presents an empirical comparison of these two approaches over a range of tasks. 3.8 SUMMARY AND FURTHER READING The main points of this chapter include: Decision tree learning provides a practical method for concept learning and for learning other discrete-valued functions. The ID3 family of algorithms infers decision trees by growing them from the root downward, greedily selecting the next best attribute for each new decision branch added to the tree. ID3 searches a complete hypothesis space (i.e., the space of decision trees can represent any discrete-valued function defined over discrete-valued instances). It thereby avoids the major difficulty associated with approaches that consider only restricted sets of hypotheses: that the target function might not be present in the hypothesis space. The inductive bias implicit in ID3 includes a preference for smaller trees; that is, its search through the hypothesis space grows the tree only as large as needed in order to classify the available training examples. Overfitting the training data is an important issue in decision tree learning. Because the training examples are only a sample of all possible instances, CHAFER 3 DECISION TREE LEARNING 77 it is possible to add branches to the tree that improve performance on the training examples while decreasing performance on other instances outside this set. Methods for post-pruning the decision tree are therefore important to avoid overfitting in decision tree learning (and other inductive inference methods that employ a preference bias). A large variety of extensions to the basic ID3 algorithm has been developed by different researchers. These include methods for post-pruning trees, handling real-valued attributes, accommodating training examples with missing attribute values, incrementally refining decision trees as new training examples become available, using attribute selection measures other than information gain, and considering costs associated with instance attributes. Among the earliest work on decision tree learning is Hunt's Concept Learning System (CLS) (Hunt et al. 1966) and Friedman and Breiman's work resulting in the CART system (Friedman 1977; Breiman et al. 1984). Quinlan's ID3 system (Quinlan 1979, 1983) forms the basis for the discussion in this chapter. Other early work on decision tree learning includes ASSISTANT (Kononenko et al. 1984; Cestnik et al. 1987). Implementations of decision tree induction algorithms are now commercially available on many computer platforms. For further details on decision tree induction, an excellent book by Quinlan (1993) discusses many practical issues and provides executable code for C4.5. Mingers (1989a) and Buntine and Niblett (1992) provide two experimental studies comparing different attribute-selection measures. Mingers (1989b) and Malerba et al. (1995) provide studies of different pruning strategies. Experiments comparing decision tree learning and other learning methods can be found in numerous papers, including (Dietterich et al. 1995; Fisher and McKusick 1989; Quinlan 1988a; Shavlik et al. 1991; Thrun et al. 1991; Weiss and Kapouleas 1989). EXERCISES Give decision trees to represent the following boolean functions: (a) A A -B (b) A V [B A C ] (c) A XOR B (d) [ A A B] v [C A Dl Consider the following set of training examples: Instance Classification a1 a2 ( a ) What is the entropy of this collection of training examples with respect to the target function classification? (b) What is the information gain of a2 relative to these training examples? 3.3. True or false: If decision tree D2 is an elaboration of tree Dl, then D l is moregeneral-than D2. Assume D l and D2 are decision trees representing arbitrary boolean functions, and that D2 is an elaboration of D l if ID3 could extend D l into D2. If true, give a proof; if false, a counterexample. (More-general-than is defined in Chapter 2.) 3.4. ID3 searches for just one consistent hypothesis, whereas the CANDIDATEELIMINATIOaNlgorithm finds all consistent hypotheses. Consider the correspondence between these two learning algorithms. ( a ) Show the decision tree that would be learned by ID3 assuming it is given the four training examples for the Enjoy Sport? target concept shown in Table 2.1 of Chapter 2. (b) What is the relationship between the learned decision tree and the version space (shown in Figure 2.3 of Chapter 2) that is learned from these same examples? Is the learned tree equivalent to one of the members of the version space? (c) Add the following training example, and compute the new decision tree. This time, show the value of the information gain for each candidate attribute at each step in growing the tree. Sky Air-Temp Humidity Wind Water Forecast Enjoy-Sport? Sunny Warm Normal Weak Warm Same No ( d ) Suppose we wish to design a learner that (like ID3) searches a space of decision tree hypotheses and (like CANDIDATE-ELIMINATIfOinNd)s all hypotheses consistent with the data. In short, we wish to apply the CANDIDATE-ELIMINATION algorithm to searching the space of decision tree hypotheses. Show the S and G sets that result from the first training example from Table 2.1. Note S must contain the most specific decision trees consistent with the data, whereas G must contain the most general. Show how the S and G sets are refined by thesecond training example (you may omit syntactically distinct trees that describe the same concept). What difficulties do you foresee in applying CANDIDATE-ELIMINATION to a decision tree hypothesis space? REFERENCES Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, P. 1. (1984). ClassiJicationand regression trees. Belmont, CA: Wadsworth International Group. Brodley, C. E., & Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19, 45-77. Buntine, W., & Niblett, T. (1992). A further comparison of splitting rules for decision-tree induction. Machine Learning, 8, 75-86. Cestnik, B., Kononenko, I., & Bratko, I. (1987). ASSISTANT-86: A knowledge-elicitation tool for sophisticated users. In I. Bratko & N. LavraE (Eds.), Progress in machine learning. Bled, Yugoslavia: Sigma Press. Dietterich, T. G., Hild, H., & Bakiri, G. (1995). A comparison of ID3 and BACKPROPAGATIfOorN English text-to-speech mapping. Machine Learning, 18(1), 51-80. Dietterich, T. G., Kearns, M., & Mansour, Y. (1996). Applying the weak learning framework to understand and improve C4.5. Proceedings of the 13th International Conference on Machine Learning (pp. 96104). San Francisco: Morgan Kaufmann. Fayyad, U. M. (1991). On the induction of decision trees for multiple concept leaning, (Ph.D. dissertation). EECS Department, University of Michigan. C m 3 DECISION TREE LEARNING 79 Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8, 87-102. Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In R. Bajcsy (Ed.), Proceedings of the 13th International Joint Conference on ArtiJcial Intelligence (pp. 1022-1027). Morgan-Kaufmann. Fayyad, U. M., Weir, N., & Djorgovski, S. (1993). SKICAT: A machine learning system for automated cataloging of large scale sky surveys. Proceedings of the Tenth International Conference on Machine Learning (pp. 112-1 19). Amherst, MA: Morgan Kaufmann. Fisher, D. H., and McKusick, K. B. (1989). An empirical comparison of ID3 and back-propagation. Proceedings of the Eleventh International Joint Conference on A1 (pp. 788-793). Morgan Kaufmann. Fnedman, J. H. (1977). A recursive partitioning decision rule for non-parametric classification. IEEE Transactions on Computers @p. 404408). Hunt, E. B. (1975). Art$cial Intelligence. New Yorc Academic Press. Hunt, E. B., Marin, J., & Stone, P. J. (1966). Experiments in Induction. New York: Academic Press. Kearns, M., & Mansour, Y. (1996). On the boosting ability of top-down decision tree learning algorithms. Proceedings of the 28th ACM Symposium on the Theory of Computing. New York: ACM Press. Kononenko, I., Bratko, I., & Roskar, E. (1984). Experiments in automatic learning of medical diagnostic rules (Technical report). Jozef Stefan Institute, Ljubljana, Yugoslavia. Lopez de Mantaras, R. (1991). A distance-based attribute selection measure for decision tree induction. Machine Learning, 6(1), 81-92. Malerba, D., Floriana, E., & Semeraro, G . (1995). A further comparison of simplification methods for decision tree. induction. In D. Fisher & H. Lenz (Eds.), Learningfrom data: AI and statistics. Springer-Verlag. Mehta, M., Rissanen, J., & Agrawal, R. (1995). MDL-based decision tree pruning. Proceedings of the First International Conference on Knowledge Discovery and Data Mining (pp. 216-221). Menlo Park, CA: AAAI Press. Mingers, J. (1989a). An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3(4), 319-342. Mingers, J. (1989b). An empirical comparison of pruning methods for decision-tree induction. Machine Learning, 4(2), 227-243. Murphy, P. M., & Pazzani, M. J. (1994). Exploring the decision forest: An empirical investigation of Occam's razor in decision tree induction. Journal of Artijicial Intelligence Research, 1, 257-275. Murthy, S. K., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. Journal of Art$cial Intelligence Research, 2, 1-33. Nunez, M. (1991). The use of background knowledge in decision tree induction. Machine Learning, 6(3), 23 1-250. Pagallo, G., & Haussler, D. (1990). Boolean feature discovery in empirical learning. Machine Learning, 5, 71-100. Qulnlan, J. R. (1979). Discovering rules by induction from large collections of examples. In D. Michie (Ed.), Expert systems in the micro electronic age. Edinburgh Univ. Press. Qulnlan, J. R. (1983). Learning efficient classification procedures and their application to chess end games. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. San Matw, CA: Morgan Kaufmann. Qulnlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106. Qulnlan, J. R. (1987). Rule induction with statistical data-a comparison with multiple regression. Journal of the Operational Research Society, 38,347-352. Quinlan, J.R. (1988). An empirical comparison of genetic and decision-tree classifiers. Proceedings of the Fifrh International Machine Learning Conference (135-141). San Matw, CA: Morgan Kaufmann. Quinlan, J.R. (1988b). Decision trees and multi-valued attributes. In Hayes, Michie, & Richards (Eds.),Machine Intelligence 1 1 , (pp. 305-318). Oxford, England: Oxford University Press. 80 MACHINE LEARNING Quinlan, J. R., & Rivest, R. (1989). Information and Computation, (go), 227-248. Quinlan, J. R. (1993). C4.5: Programsfor Machine Learning. San Mateo, CA: Morgan Kaufmann. Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics 11 (2), 416-431. Rivest, R. L. (1987). Learning decision lists. Machine Learning, 2(3), 229-246. Schaffer, C. (1993). Overfitting avoidance as bias. Machine Learning, 10, 113-152. Shavlik, J. W., Mooney, R. J., & Towell, G. G. (1991). Symbolic and neural learning algorithms: an experimental comparison. Machine k a m i n g , 6(2), 111-144. Tan, M. (1993). Cost-sensitive learning of classification knowledge and its applications in robotics. Machine Learning, 13(1), 1-33. Tan, M., & Schlimmer, J. C. (1990). Two case studies in cost-sensitive concept acquisition. Pro- ceedings of the AAAZ-90. Thrun, S. B. et al. (1991). The Monk's problems: A pe~ormancecomparison of different learn- ing algorithms, (Technical report CMU-FS-91-197). Computer Science Department, Carnegie Mellon Univ., Pittsburgh, PA. Turney, P. D. (1995). Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of A1 Research, 2, 369409. Utgoff, P. E. (1989). Incremental induction of decision trees. Machine Learning, 4(2), 161-186. Utgoff, P. E., & Brodley, C. E. (1991). Linear machine decision trees, (COINS Technical Report 91-10). University of Massachusetts, Amherst, MA. Weiss, S., & Kapouleas, I. (1989). An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. Proceedings of the Eleventh IJCAI, (781-787), Morgan Kaufmann. CHAPTER ARTIFICIAL NEURAL NETWORKS Artificial neural networks (ANNs) provide a general, practical method for learning real-valued, discrete-valued, and vector-valued functions from examples. Algorithms such as BACKPROPAGATuIsOe Ngradient descent to tune network parameters to best fit a training set of input-outputpairs. ANN learning is robust to errors in the training data and has been successfully applied to problems such as interpreting visual scenes, speech recognition, and learning robot control strategies. 4.1 INTRODUCTION Neural network learning methods provide a robust approach to approximating real-valued, discrete-valued, and vector-valued target functions. For certain types of problems, such as learning to interpret complex real-world sensor data, artificial neural networks are among the most effective learning methods currently known. For example, the BACKPROPAGATaIOlgNorithm described in this chapter has proven surprisingly successful in many practical problems such as learning to recognize handwritten characters (LeCun et al. 1989), learning to recognize spoken words (Lang et al. 1990), and learning to recognize faces (Cottrell 1990). One survey of practical applications is provided by Rumelhart et al. (1994). 4.1.1 Biological Motivation The study of artificial neural networks (ANNs) has been inspired in part by the observation that biological learning systems are built of very complex webs of interconnected neurons. In rough analogy, artificial neural networks are built out of a densely interconnected set of simple units, where each unit takes a number of real-valued inputs (possibly the outputs of other units) and produces a single real-valued output (which may become the input to many other units). To develop a feel for this analogy, let us consider a few facts from neurobiology. The human brain, for example, is estimated to contain a densely interconnected network of approximately 1011 neurons, each connected, on average, to lo4 others. Neuron activity is typically excited or inhibited through connections to other neurons. The fastest neuron switching times are known to be on the order of loe3 seconds--quite slow compared to computer switching speeds of 10-lo seconds. Yet humans are able to make surprisingly complex decisions, surprisingly quickly. For example, it requires approximately lo-' seconds to visually recognize your mother. Notice the sequence of neuron firings that can take place during this 10-'-second interval cannot possibly be longer than a few hundred steps, given the switching speed of single neurons. This observation has led many to speculate that the information-processing abilities of biological neural systems must follow from highly parallel processes operating on representations that are distributed over many neurons. One motivation for ANN systems is to capture this kind of highly parallel computation based on distributed representations. Most ANN software runs on sequential machines emulating distributed processes, although faster versions of the algorithms have also been implemented on highly parallel machines and on specialized hardware designed specifically for ANN applications. While ANNs are loosely motivated by biological neural systems, there are many complexities to biological neural systems that are not modeled by ANNs, and many features of the ANNs we discuss here are known to be inconsistent with biological systems. For example, we consider here ANNs whose individual units output a single constant value, whereas biological neurons output a complex time series of spikes. Historically, two groups of researchers have worked with artificial neural networks. One group has been motivated by the goal of using ANNs to study and model biological learning processes. A second group has been motivated by the goal of obtaining highly effective machine learning algorithms, independent of whether these algorithms mirror biological processes. Within this book our interest fits the latter group, and therefore we will not dwell further on biological modeling. For more information on attempts to model biological systems using ANNs, see, for example, Churchland and Sejnowski (1992); Zornetzer et al. (1994); Gabriel and Moore (1990). 4.2 NEURAL NETWORK REPRESENTATIONS A prototypical example of ANN learning is provided by Pomerleau's (1993) sys- tem ALVINN, which uses a learned ANN to steer an autonomous vehicle driving at normal speeds on public highways. The input to the neural network is a 30 x 32 grid of pixel intensities obtained from a forward-pointed camera mounted on the vehicle. The network output is the direction in which the vehicle is steered. The ANN is trained to mimic the observed steering commands of a human driving the vehicle for approximately 5 minutes. ALVINN has used its learned networks to successfully drive at speeds up to 70 miles per hour and for distances of 90 miles on public highways (driving in the left lane of a divided public highway, with other vehicles present). Figure 4.1 illustrates the neural network representation used in one version of the ALVINN system, and illustrates the kind of representation typical of many ANN systems. The network is shown on the left side of the figure, with the input camera image depicted below it. Each node (i.e., circle) in the network diagram corresponds to the output of a single network unit,and the lines entering the node from below are its inputs. As can be seen, there are four units that receive inputs directly from all of the 30 x 32 pixels in the image. These are called "hidden" units because their output is available only within the network and is not available as part of the global network output. Each of these four hidden units computes a single real-valued output based on a weighted combination of its 960 inputs. These hidden unit outputs are then used as inputs to a second layer of 30 "output" units. Each output unit corresponds to a particular steering direction, and the output values of these units determine which steering direction is recommended most strongly. The diagrams on the right side of the figure depict the learned weight values associated with one of the four hidden units in this ANN. The large matrix of black and white boxes on the lower right depicts the weights from the 30 x 32 pixel inputs into the hidden unit. Here, a white box indicates a positive weight, a black box a negative weight, and the size of the box indicates the weight magnitude. The smaller rectangular diagram directly above the large matrix shows the weights from this hidden unit to each of the 30 output units. The network structure of ALYINN is typical of many ANNs. Here the individual units are interconnected in layers that form a directed acyclic graph. In general, ANNs can be graphs with many types of structures-acyclic or cyclic, directed or undirected. This chapter will focus on the most common and practical ANN approaches, which are based on the BACKPROPAGATalIgOoNrithm. The BACKPROPAGATION algorithm assumes the network is a fixed structure that corresponds to a directed graph, possibly containing cycles. Learning corresponds to choosing a weight value for each edge in the graph. Although certain types of cycles are allowed, the vast majority of practical applications involve acyclic feed-forward networks, similar to the network structure used by ALVINN. 4.3 APPROPRIATE PROBLEMS FOR NEURAL NETWORK LEARNING ANN learning is well-suited to problems in which the training data corresponds to noisy, complex sensor data, such as inputs from cameras and microphones. E2' 1 Straight Ahead 1 1 30 Output Units n 30x32 Sensor Input Retina 1 FIGURE 4.1 Neural network learning to steer an autonomous vehicle. The ALVINN system uses BACKPROPAGATION to learn to steer an autonomous vehicle (photo at top) driving at speeds up to 70 miles per hour. The diagram on the left shows how the image of a forward-mounted camera is mapped to 960 neural network inputs, which are fed forward to 4 hidden units, connected to 30 output units. Network outputs encode the commanded steering direction. The figure on the right shows weight values for one of the hidden units in this network. The 30 x 32 weights into the hidden unit are displayed in the large matrix, with white blocks indicating positive and black indicating negative weights. The weights from this hidden unit to the 30 output units are depicted by the smaller rectangular block directly above the large block. As can be seen from these output weights, activation of this particular hidden unit encourages a turn toward the left. ~t is also applicable to problems for which more symbolic representations are often used, such as the decision tree learning tasks discussed in Chapter 3. In these cases ANN and decision tree learning often produce results of comparable accuracy. See Shavlik et al. (1991) and Weiss and Kapouleas (1989) for experimental comparisons of decision tree and ANN learning. The BACKPROPAGATION algorithm is the most commonly used ANN learning technique. It is appropriate for problems with the following characteristics: 0 Instances are represented by many attribute-valuepairs. The target function to be learned is defined over instances that can be described by a vector of predefined features, such as the pixel values in the ALVINN example. These input attributes may be highly correlated or independent of one another. Input values can be any real values. The target function output may be discrete-valued, real-valued, or a vector of several real- or discrete-valued attributes. For example, in the ALVINN system the output is a vector of 30 attributes, each corresponding to a recommendation regarding the steering direction. The value of each output is some real number between 0 and 1, which in this case corresponds to the confidence in predicting the corresponding steering direction. We can also train a single network to output both the steering command and suggested acceleration, simply by concatenating the vectors that encode these two output predictions. The training examples may contain errors. ANN learning methods are quite robust to noise in the training data. Long training times are acceptable. Network training algorithms typically require longer training times than, say, decision tree learning algorithms. Training times can range from a few seconds to many hours, depending on factors such as the number of weights in the network, the number of training examples considered, and the settings of various learning algorithm parameters. Fast evaluation of the learned target function may be required. Although ANN learning times are relatively long, evaluating the learned network, in order to apply it to a subsequentinstance, is typically very fast. For example, ALVINN applies its neural network several times per second to continually update its steering command as the vehicle drives forward. I The ability of humans to understand the learned targetfunction is not impor- tant. The weights learned by neural networks are often difficult for humans to interpret. Learned neural networks are less easily communicated to humans than learned rules. The rest of this chapter is organized as follows: We first consider several alternative designs for the primitive units that make up artificial neural networks (perce~trons,linear units, and sigmoid units), along with learning algorithms for training single units. We then present the BACKPROPAGATalIgOoNrithm for training multilayer networks of such units and consider several general issues such as the representational capabilities of ANNs, nature of the hypothesis space search, overfitting problems, and alternatives to the BACKPROPAGATaIlgOoNrithm. A detailed example is also presented applying BACKPROPAGATtIoONface recognition, and directions are provided for the reader to obtain the data and code to experiment further with this application. 4.4 PERCEPTRONS One type of ANN system is based on a unit called a perceptron, illustrated in Figure 4.2. A perceptron takes a vector of real-valued inputs, calculates a linear combination of these inputs, then outputs a 1 if the result is greater than some threshold and -1 otherwise. More precisely, given inputs xl through x,, the output o(x1, ... ,x,) computed by the perceptron is + + + o(x1,.. . , x , ) = 1 if wo w l x l + ~ 2 x 2 - . -1 otherwise W,X, > 0 where each wi is a real-valued constant, or weight, that determines the contribution of input xi to the perceptron output. Notice the quantity ( - w O ) is a threshold that + + the weighted combination of inputs wlxl ... wnxnmust surpass in order for the perceptron to output a 1. To simplify notation, we imagine an additional constant input xo = 1, al- lowing us to write the above inequality as C:=o wixi > 0, or in vector form as iir ..i! > 0. For brevity, we will sometimes write the perceptron function as where Learning a perceptron involves choosing values for the weights wo, ...,w,. Therefore, the space H of candidate hypotheses considered in perceptron learning is the set of all possible real-valued weight vectors. 4.4.1 Representational Power of Perceptrons We can view the perceptron as representing a hyperplane decision surface in the n-dimensional space of instances (i.e., points). The perceptron outputs a 1 for instances lying on one side of the hyperplane and outputs a -1 for instances lying on the other side, as illustrated in Figure 4.3. The equation for this decision hyperplane is iir ..i! = 0. Of course, some sets of positive and negative examples cannot be separated by any hyperplane. Those that can be separated are called linearly separable sets of examples. FIGURE 4 3 A perceptron. A single perceptron can be used to represent many boolean functions. For example, if we assume boolean values of 1 (true) and -1 (false), then one way to use a two-input perceptron to implement the AND function is to set the weights wo = -3, and wl = wz = .5. This perceptron can be made to represent the OR function instead by altering the threshold to wo = -.3. In fact, AND and OR can be viewed as special cases of m-of-n functions: that is, functions where at least m of the n inputs to the perceptron must be true. The OR function corresponds to rn = 1 and the AND function to m = n. Any m-of-n function is easily represented using a perceptron by setting all input weights to the same value (e.g., 0.5) and then setting the threshold wo accordingly. Perceptrons can represent all of the primitive boolean functions AND, OR, NAND ( 1 AND), and NOR ( 1 OR). Unfortunately, however, some boolean functions cannot be represented by a single perceptron, such as the XOR function whose value is 1 if and only if xl # xz. Note the set of linearly nonseparable training examples shown in Figure 4.3(b) corresponds to this XOR function. The ability of perceptrons to represent AND, OR, NAND, and NOR is important because every boolean function can be represented by some network of interconnected units based on these primitives. In fact, every boolean function can be represented by some network of perceptrons only two levels deep, in which FIGURE 4.3 The decision surface represented by a two-input perceptron. (a)A set of training examples and the decision surface of a perceptron that classifies them correctly. (b)A set of training examples that is not linearly separable (i.e.,that cannot be correctly classified by any straight line). xl and x2 are the Perceptron inputs. Positive examples are indicated by "+", negative by "-". the inputs are fed to multiple units, and the outputs of these units are then input to a second, final stage. One way is to represent the boolean function in disjunctive normal form (i.e., as the disjunction (OR) of a set of conjunctions (ANDs) of the inputs and their negations). Note that the input to an AND perceptron can be negated simply by changing the sign of the corresponding input weight. Because networks of threshold units can represent a rich variety of functions and because single units alone cannot, we will generally be interested in learning multilayer networks of threshold units. 4.4.2 The Perceptron Training Rule Although we are interested in learning networks of many interconnected units, let us begin by understanding how to learn the weights for a single perceptron. Here the precise learning problem is to determine a weight vector that causes the perceptron to produce the correct f 1 output for each of the given training examples. Several algorithms are known to solve this learning problem. Here we consider two: the perceptron rule and the delta rule (a variant of the LMS rule used in Chapter 1 for learning evaluation functions). These two algorithms are guaranteed to converge to somewhat different acceptable hypotheses, under somewhat different conditions. They are important to ANNs because they provide the basis for learning networks of many units. One way to learn an acceptable weight vector is to begin with random weights, then iteratively apply the perceptron to each training example, modifying the perceptron weights whenever it misclassifies an example. This process is repeated, iterating through the training examples as many times as needed until the perceptron classifies all training examples correctly. Weights are modified at each step according to the perceptron training rule, which revises the weight wi associated with input xi according to the rule where Here t is the target output for the current training example, o is the output generated by the perceptron, and q is a positive constant called the learning rate. The role of the learning rate is to moderate the degree to which weights are changed at each step. It is usually set to some small value (e.g., 0.1) and is sometimes made to decay as the number of weight-tuning iterations increases. Why should this update rule converge toward successful weight values? To get an intuitive feel, consider some specific cases. Suppose the training example is correctly classified already by the perceptron. In this case, (t -o) is zero, making Awi zero, so that no weights are updated. Suppose the perceptron outputs a -1, + when the target output is +1. To make the perceptron output a 1 instead of -1 in this case, the weights must be altered to increase the value of G . 2 . For example, if xi r 0,then increasing wi will bring the perceptron closer to correctly classifying this example. Notice the training rule will increase w, in this case, because (t-o), 7 , and Xi are all positive. For example, if xi = .8, q = 0.1, t = 1 , and o = -1 , then the weight update will be Awi = q(t - o)xi = O . 1 ( 1 - (-1))0.8 = 0.16. On the other hand, if t = - 1 and o = 1, then weights associated with positive xi will be decreased rather than increased. In fact, the above learning procedure can be proven to converge within a finite number of applications of the perceptron training rule to a weight vector that correctly classifies all training examples, provided the training examples are linearly separable and provided a sufficiently small 7 is used (see Minsky and Papert 1969). If the data are not linearly separable, convergence is not assured. 4.4.3 Gradient Descent and the Delta Rule Although the perceptron rule finds a successful weight vector when the training examples are linearly separable, it can fail to converge if the examples are not linearly separable. A second training rule, called the delta rule, is designed to overcome this difficulty. If the training examples are not linearly separable, the delta rule converges toward a best-fit approximation to the target concept. The key idea behind the delta rule is to use gradient descent to search the hypothesis space of possible weight vectors to find the weights that best fit the training examples. This rule is important because gradient descent provides the basis for the BACKPROPAGATaIlgOoNrithm, which can learn networks with many interconnected units. It is also important because gradient descent can serve as the basis for learning algorithms that must search through hypothesis spaces containing many different types of continuously parameterized hypotheses. The delta training rule is best understood by considering the task of training an unthresholdedperceptron; that is, a linear unit for which the output o is given by Thus, a linear unit corresponds to the first stage of a perceptron, without the threshold. In order to derive a weight learning rule for linear units, let us begin by specifying a measure for the training error of a hypothesis (weight vector), relative to the training examples. Although there are many ways to define this error, one common measure that will turn out to be especially convenient is where D is the set of training examples, td is the target output for training example d , and od is the output of the linear unit for training example d. By this definition, E ( 6 ) is simply half the squared difference between the target output td and the h e a r unit output od, summed over all training examples. Here we characterize E as a function of 27, because the linear unit output o depends on this weight vector. Of course E also depends on the particular set of training examples, but we assume these are fixed during training, so we do not bother to write E as an explicit function of these. Chapter 6 provides a Bayesian justification for choosing this particular definition of E. In particular, there we show that under certain conditions the hypothesis that minimizes E is also the most probable hypothesis in H given the training data. 4.4.3.1 VISUALIZING THE HYPOTHESIS SPACE To understand the gradient descent algorithm, it is helpful to visualize the entire hypothesis space of possible weight vectors and their associated E values, as illustrated in Figure 4.4. Here the axes wo and w l represent possible values for the two weights of a simple linear unit. The wo, w l plane therefore represents the entire hypothesis space. The vertical axis indicates the error E relative to some fixed set of training examples. The error surface shown in the figure thus summarizes the desirability of every weight vector in the hypothesis space (we desire a hypothesis with minimum error). Given the way in which we chose to define E, for linear units this error surface must always be parabolic with a single global minimum. The specific parabola will depend, of course, on the particular set of training examples. FIGURE 4.4 Error of different hypotheses. For a linear unit with two weights, the hypothesis space H is the wg, wl plane. The vertical axis indicates tk error of the corresponding weight vector hypothesis, relative to a fixed set of training examples. The arrow shows the negated gradient at one particular point, indicating the direction in the wo, w l plane producing steepest descent along the error surface. Gradient descent search determines a weight vector that minimizes E by starting with an arbitrary initial weight vector, then repeatedly modifying it in small steps. At each step, the weight vector is altered in the direction that produces the steepest descent along the error surface depicted in Figure 4.4. This process continues until the global minimum error is reached. 4.4.3.2 DERIVATION OF THE GRADIENT DESCENT RULE How can we calculate the direction of steepest descent along the error surface? This direction can be found by computing the derivative of E with respect to each component of the vector 2 . This vector derivative is called the gradient of E with respect to 221, written ~ ~ ( i i r ) . Notice VE(221)is itself a vector, whose components are the partial derivatives of E with respect to each of the wi. When interpreted as a vector in weight space, the gradient specijies the direction that produces the steepest increase in E . The negative of this vector therefore gives the direction of steepest decrease. For example, the arrow in Figure 4.4 shows the negated gradient -VE(G) for a particular point in the wo,wl plane. Since the gradient specifies the direction of steepest increase of E, the training rule for gradient descent is where Here r] is a positive constant called the learning rate, which determines the step size in the gradient descent search. The negative sign is present because we want to move the weight vector in the direction that decreases E. This training rule can also be written in its component form where E. which makes it clear that steepest descent is achieved by altering each component w ,of ii in proportion to To construct a practical algorithm for iteratively updating weights according to Equation ( 4 4 , we need an efficient way of calculating the gradient at each step. Fortunately, this is not difficult. The vector of derivatives that form the gradient can be obtained by differentiating E from Equation (4.2), as where xid denotes the single input component xi for training example d. We now have an equation that gives in terms of the linear unit inputs xid, outputs Od, and target values td associated with the training examples. Substituting Equation (4.6) into Equation (4.5) yields the weight update rule for gradient descent To summarize, the gradient descent algorithm for training linear units is as follows: Pick an initial random weight vector. Apply the linear unit to all training examples, then compute Awi for each weight according to Equation (4.7). Update each weight wi by adding Awi, then repeat this process. This algorithm is given in Table 4.1. Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector with minimum error, regardless of whether the training examples are linearly separable, given a sufficiently small learning rate q is used. If r) is too large, the gradient descent search runs the risk of overstepping the minimum in the error surface rather than settling into it. For this reason, one common modification to the algorithm is to gradually reduce the value of r) as the number of gradient descent steps grows. 4.4.3.3 STOCHASTIC APPROXIMATION TO GRADIENT DESCENT Gradient descent is an important general paradigm for learning. It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever (1) the hypothesis space contains continuously parameterized hypotheses (e.g., the weights in a linear unit), and (2) the error can be differentiated with respect to these hypothesis parameters. The key practical difficulties in applying gradient descent are (1) converging to a local minimum can sometimes be quite slow (i.e., it can require many thousands of gradient descent steps), and (2) if there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global minimum. - - 93 CHAF'l'ER 4 ARTIFICIAL NEURAL NETWORKS ~ ~ A D I E N T - D E s c E Nq )T ( ~ ~ ~ ~ ~ ~ ~ ~ ~ x ~ ~ ~ ~ ~ s , .. Each training example is a pair of theform (2,t), where x' is the vector of input values, and t is the target output value. q is the learning rate (e.g., .05). Initialize each w, to some small random value Until the termination condition is met, Do 0 Initialize each Awi to zero. 0 For each (2,t ) in trainingaxamples, Do w Input the instance x' to the unit and compute the output o For each linear unit weight w , , Do For each linear unit weight wi, Do TABLE 4.1 GRADIENDTESCENaTlgorithm for training a linear unit. To implement the stochastic approximation to gradient descent, Equation (T4.2) is deleted, and Equation (T4.1) replaced by wi c wi +q(t - o b i . One common variation on gradient descent intended to alleviate these difficulties is called incremental gradient descent, or alternatively stochastic gradient descent. Whereas the gradient descent training rule presented in Equation (4.7) computes weight updates after summing over a22 the training examples in D, the idea behind stochastic gradient descent is to approximate this gradient descent search by updating weights incrementally, following the calculation of the error for each individual example. The modified training rule is like the training rule given by Equation (4.7) except that as we iterate through each training example we update the weight according to where t, o, and xi are the target value, unit output, and ith input for the training example in question. To modify the gradient descent algorithm of Table 4.1 to implement this stochastic approximation, Equation (T4.2) is simply deleted and + Equation (T4.1)replaced by wi t wi v (t-o) xi. One way to view this stochastic gradient descent is to consider a distinct error function ~ ~ (def6ine)d for each individual training example d as follows Ed ( 6 )= 1 - (td 2 - 0 d )2 (4.11) where t, and od are the target value and the unit output value for training example d. Stochastic gradient descent iterates over the training examples d in D, at each iteration altering the weights according to the gradient with respect to Ed(;). The sequence of these weight updates, when iterated over all training examples, provides a reasonable approximation to descending the gradient with respect to our original error function E(G).By making the value of 7 (the gradient 94 MACHINE LEARNING descent step size) sufficiently small, stochastic gradient descent can be made to approximate true gradient descent arbitrarily closely. The key differences between standard gradient descent and stochastic gradient descent are: 0 In standard gradient descent, the error is summed over all examples before .updating weights, whereas in stochastic gradient descent weights are updated upon examining each training example. Summing over multiple examples in standard gradient descent requires more computation per weight update step. On the other hand, because it uses the true gradient, standard gradient descent is often used with a larger step size per weight update than stochastic gradient descent. r, In cases where there are multiple local minima with respect to E ( $ , stochas- tic gradient descent can sometimes avoid falling into these local minima because it uses the various V E d ( G )rather than V E ( 6 ) to guide its search. Both stochastic and standard gradient descent methods are commonly used in practice. The training rule in Equation (4.10) is known as the delta rule, or sometimes the LMS (least-mean-square) rule, Adaline rule, or Widrow-Hoff rule (after its inventors). In Chapter 1 we referred to it as the LMS weight-update rule when describing its use for learning an evaluation function for game playing. Notice the delta rule in Equation (4.10) is similar to the perceptron training rule in Equation (4.4.2). In fact, the two expressions appear to be identical. However, the rules are different because in the delta rule o refers to the linear unit output o ( 2 ) = i;) .?, whereas for the perceptron rule o refers to the thresholded output o(2) =sgn($ .2). Although we have presented the delta rule as a method for learning weights for unthresholded linear units, it can easily be used to train thresholded perceptron units, as well. Suppose that o = i;) .x' is the unthresholded linear unit output as above, and of = s g n ( G . 2 ) is the result of thresholding o as in the perceptron. Now if we wish to train a perceptron to fit training examples with target values o f f 1 for o', we can use these same target values and examples to train o instead, using the delta rule. Clearly, if the unthresholded output o can be trained to fit these values perfectly, then the threshold output of will fit them as well (because sgn(1) = 1, and sgn(-1) = -1). Even when the target values cannot be fit perfectly, the thresholded of value will correctly fit the f1 target value whenever the linear unit output o has the correct sign. Notice, however, that while this procedure will learn weights that minimize the error in the linear unit output o, these weights will not necessarily minimize the number of training examples misclassified by the thresholded output 0'. 4.4.4 Remarks We have considered two similar algorithms for iteratively learning perceptron weights. The key difference between these algorithms is that the perceptron train- C H m R 4 ARTIFICIAL NEURAL NETWORKS 95 ing rule updates weights based on the error in the thresholded perceptron output, whereas the delta rule updates weights based on the error in the unthresholded linear combination of inputs. The difference between these two training rules is reflected in different convergence properties. The perceptron training rule converges after a finite number of iterations to a hypothesis that perfectly classifies the training data, provided the training examples are linearly separable. The delta rule converges only asymptotically toward the minimum error hypothesis, possibly requiring unbounded time, but converges regardless of whether the training data are linearly separable. A detailed presentation of the convergence proofs can be found in Hertz et al. (1991). A third possible algorithm for learning the weight vector is linear programming. Linear programming is a general, efficient method for solving sets of linear inequalities. Notice each training example corresponds to an inequality of the form zZI - x' > 0 or G .x' 5 0, and their solution is the desired weight vector. Un- fortunately, this approach yields a solution only when the training examples are linearly separable; however, Duda and Hart (1973, p. 168) suggest a more subtle formulation that accommodates the nonseparable case. In any case, the approach of linear programming does not scale to training multilayer networks, which is our primary concern. In contrast, the gradient descent approach, on which the delta rule is based, can be easily extended to multilayer networks, as shown in the following section. 4.5 MULTILAYER NETWORKS AND THE BACKPROPAGATION ALGORITHM As noted in Section 4.4.1, single perceptrons can only express linear decision surfaces. In contrast, the kind of multilayer networks learned by the BACKPROPACATION algorithm are capable of expressing a rich variety of nonlinear decision surfaces. For example, a typical multilayer network and decision surface is depicted in Figure 4.5. Here the speech recognition task involves distinguishing among 10 possible vowels, all spoken in the context of "h-d" (i.e., "hid," "had," "head," "hood," etc.). The input speech signal is represented by two numerical parameters obtained from a spectral analysis of the sound, allowing us to easily visualize the decision surface over the two-dimensional instance space. As shown in the figure, it is possible for the multilayer network to represent highly nonlinear decision surfaces that are much more expressive than the linear decision surfaces of single units shown earlier in Figure 4.3. This section discusses how to learn such multilayer networks using a gradient descent algorithm similar to that discussed in the previous section. 4.5.1 A Differentiable Threshold Unit What type of unit shall we use as the basis for constructing multilayer networks? At first we might be tempted to choose the linear units discussed in the previous head hid 4 who'd hood .0 b a d hid + hod .r had r hawed hoard o heed ,c hud who'd hood FIGURE 4.5 Decision regions of a multilayer feedforward network. The network shown here was trained to recognize 1 of 10 vowel sounds occurring in the context "hd" (e.g., "had," "hid"). The network input consists of two parameters, F1 and F2, obtained from a spectral analysis of the sound. The 10 network outputs correspond to the 10 possible vowel sounds. The network prediction is the output whose value is highest. The plot on the right illustrates the highly nonlinear decision surface represented by the learned network. Points shown on the plot are test examples distinct from the examples used to train the network. (Reprinted by permission from Haung and Lippmann (1988).) section, for which we have already derived a gradient descent learning rule. However, multiple layers of cascaded linear units still produce only linear functions, and we prefer networks capable of representing highly nonlinear functions. The perceptron unit is another possible choice, but its discontinuous threshold makes it undifferentiable and hence unsuitable for gradient descent. What we need is a unit whose output is a nonlinear function of its inputs, but whose output is also a differentiable function of its inputs. One solution is the sigmoid unit-a unit very much like a perceptron, but based on a smoothed, differentiable threshold function. The sigmoid unit is illustrated in Figure 4.6. Like the perceptron, the sigmoid unit first computes a linear combination of its inputs, then applies a threshold to the result. In the case of the sigmoid unit, however, the threshold output is a net = C w ixi FIGURE 4.6 The sigmoid threshold unit. - o = @net) = 1 1+ kMf CHAPTER 4 ARTIFICIAL NEURAL NETWORKS 97 continuous function of its input. More precisely, the sigmoid unit computes its output o as where a is often called the sigmoid function or, alternatively, the logistic function. Note its output ranges between 0 and 1, increasing monotonically with its input (see the threshold function plot in Figure 4.6.). Because it maps a very large input domain to a small range of outputs, it is often referred to as the squashingfunction of the unit. The sigmoid function has the useful property that its derivative is easily expressed in terms of its output [in particular, dy = O(Y). (1 - dy))]. As we shall see, the gradient descent learning rule makes use of this derivative. Other differentiable functions with easily calculated derivatives are sometimes used in place of a. For example, the term e-y in the sigmoid function definition is sometimes replaced by e-k'y where k is some positive constant that determines the steepness of the threshold. The function tanh is also sometimes used in place of the sigmoid function (see Exercise 4.8). 4.5.2 The BACKPROPAGATIAOlgNorithm The BACKPROPAGATaIlOgNorithm learns the weights for a multilayer network, given a network with a fixed set of units and interconnections. It employs gradient descent to attempt to minimize the squared error between the network output values and the target values for these outputs. This section presents the BACKPROPAGATION algorithm, and the following section gives the derivation for the gradient descent weight update rule used by BACKPROPAGATION. Because we are considering networks with multiple output units rather than single units as before, we begin by redefining E to sum the errors over all of the network output units where outputs is the set of output units in the network, and tkd and OM are the I target and output values associated with the kth output unit and training example d. The learning problem faced by BACKPROPAGATisIOtoNsearch a large hypothesis space defined by all possible weight values for all the units in the network. The situation can be visualized in terms of an error surface similar to that shown for linear units in Figure 4.4. The error in that diagram is replaced by our new definition of E, and the other dimensions of the space correspond now to all of the weights associated with all of the units in the network. As in the case of training a single unit, gradient descent can be used to attempt to find a hypothesis to minimize E. B ~ c ~ ~ ~ o ~ ~ G A T I O ~ ( t r a i n i n g a xqa, nmi,p,~noe,,s,,nhidden) Each training example is a pair of theform (2,i ), where x' is the vector of network input values, and is the vector of target network output values. q is the learning rate (e.g., .O5). ni, is the number of network inputs, nhidden the number of units in the hidden layer, and no,, the number of output units. The inputfiom unit i into unit j is denoted xji, and the weightfrom unit i to unit j is denoted wji. a Create a feed-forward network with ni, inputs, m i d d e n hidden units, and nour output units. a Initialize all network weights to small random numbers (e.g., between -.05 and .05). r Until the termination condition is met, Do a For each (2,i ) in trainingaxamples, Do Propagate the inputforward through the network: 1, Input the instance x' to the network and compute the output o, of every unit u in the network. Propagate the errors backward through the network: 2. For each network output unit k, calculate its error term Sk 6k 4- ok(l - ok)(tk - 0 k ) 3. For each hidden unit h , calculate its error term 6h 4. Update each network weight wji where Aw.. Jl -- I 11 TABLE 4.2 The stochasticgradient descent version of the BACKPROPAGATaIlOgoNrithm for feedforward networks containing two layers of sigmoid units. One major difference in the case of multilayer networks is that the error surface can have multiple local minima, in contrast to the single-minimum parabolic error surface shown in Figure 4.4. Unfortunately, this means that gradient descent is guaranteed only to converge toward some local minimum, and not necessarily the global minimum error. Despite this obstacle, in practice BACKPROPAGAThaIOs N been found to produce excellent results in many real-world applications. The BACKPROPAGATaIlgOoNrithm is presented in Table 4.2. The algorithm as described here applies to layered feedforward networks containing two layers of sigmoid units, with units at each layer connected to all units from the preceding layer. This is the incremental, or stochastic, gradient descent version of BACKPROPAGATION. The notation used here is the same as that used in earlier sections, with the following extensions: 99 CHAPTER 4 ARTIFICIAL NEURAL NETWORKS An index (e.g., an integer) is assigned to each node in the network,where a "node" is either an input to the network or the output of some unit in the network. 0 xji denotes the input from node i to unit j , and wji denotes the corresponding weight. 0 6, denotes the error term associated with unit n. It plays a role analogous s. to the quantity (t - o ) in our earlier discussion of the delta training rule. As we shall see later, 6, = - Notice the algorithm in Table 4.2 begins by constructing a network with the desired number of hidden and output units and initializing all network weights to small random values. Given this fixed network structure, the main loop of the algorithm then repeatedly iterates over the training examples. For each training example, it applies the network to the example, calculates the error of the network output for this example, computes the gradient with respect to the error on this example, then updates all weights in the network. This gradient descent step is iterated (often thousands of times, using the same training examples multiple times) until the network performs acceptably well. The gradient descent weight-update rule (Equation [T4.5] in Table 4.2) is similar to the delta training rule (Equation [4.10]). Like the delta rule, it updates each weight in proportion to the learning rate r], the input value xji to which the weight is applied, and the error in the output of the unit. The only differ- ence is that the error (t - o ) in the delta rule is replaced by a more complex error term, aj. The exact form of aj follows from the derivation of the weight- tuning rule given in Section 4.5.3. To understand it intuitively, first consider how ak is computed for each network output unit k (Equation [T4.3] in the algorithm). ak is simply the familiar (tk - ok) from the delta rule, multiplied by the factor o k ( l - ok), which is the derivative of the sigmoid squashing function. The ah value for each hidden unit h has a similar form (Equation [T4.4] in the algorithm). However, since training examples provide target values tk only for network outputs, no target values are directly available to indicate the error of hidden units' values. Instead, the error term for hidden unit h is calculated by summing the error terms Jk for each output unit influenced by h, weighting each of the ak's by wkh,the weight from hidden unit h to output unit k. This weight characterizes the degree to which hidden unit h is "responsible for" the error in output unit k. I The algorithm in Table 4.2 updates weights incrementally, following the I Presentation of each training example. This corresponds to a stochastic approxi- mation to gradient descent. To obtain the true gradient of E one would sum the 6, x,, values over all training examples before altering weight values. The weight-update loop in BACKPROPAGATmIOayNbe iterated thousands of times in a typical application. A variety of termination conditions can be used to halt the procedure. One may choose to halt after a fixed number of iterations through the loop, or once the error on the training examples falls below some threshold, or once the error on a separate validation set of examples meets some 100 MACHINE LEARNING criterion. The choice of termination criterion is an important one, because too few iterations can fail to reduce error sufficiently, and too many can lead to overfitting the training data. This issue is discussed in greater detail in Section 4.6.5. 4.5.2.1 ADDING MOMENTUM Because BACKPROPAGATisIOsuNch a widely used algorithm, many variations have been developed. Perhaps the most common is to alter the weight-update rule in Equation (T4.5) in the algorithm by making the weight update on the nth iteration depend partially on the update that occurred during the (n - 1)th iteration, as follows: Here Awji(n) is the weight update performed during the nth iteration through the main loop of the algorithm, and 0 5 a < 1 is a constant called the momentum. Notice the first term on the right of this equation is just the weight-update rule of Equation (T4.5) in the BACKPROPAGATaIlOgoNrithm. The second term on the right is new and is called the momentum term. To see the effect of this momentum term, consider that the gradient descent search trajectory is analogous to that of a (momentumless) ball rolling down the error surface. The effect of a! is to add momentum that tends to keep the ball rolling in the same direction from one iteration to the next. This can sometimes have the effect of keeping the ball rolling through small local minima in the error surface, or along flat regions in the surface where the ball would stop if there were no momentum. It also has the effect of gradually increasing the step size of the search in regions where the gradient is unchanging, thereby speeding convergence. 4.5.2.2 LEARNING IN ARBITRARY ACYCLIC NETWORKS The definition of BACKPROPAGATpIrOesNented in Table 4.2 applies o h y to twolayer networks. However, the algorithm given there easily generalizes to feedforward networks of arbitrary depth. The weight update rule seen in Equation (T4.5) is retained, and the only change is to the procedure for computing 6 values. In + general, the 6, value for a unit r in layer rn is computed from the 6 values at the next deeper layer rn 1 according to Notice this is identical to Step 3 in the algorithm of Table 4.2, so all we are really saying here is that this step may be repeated for any number of hidden layers in the network. It is equally straightforward to generalize the algorithm to any directed acyclic graph, regardless of whether the network units are arranged in uniform layers as we have assumed up to now. In the case that they are not, the rule for calculating 6 for any internal unit (i.e., any unit that is not an output) is 101 CHAPTER 4 ARTIFICIAL NEURAL NETWORKS where Downstream(r) is the set of units immediately downstream from unit r in the network: that is, all units whose inputs include the output of unit r. It is this gneral form of the weight-update rule that we derive in Section 4.5.3. 4.5.3 Derivation of the BACKPROPAGATIROuNle This section presents the derivation of the BACKPROPAGATwIOeNight-tuning rule. It may be skipped on a first reading, without loss of continuity. The specific problem we address here is deriving the stochastic gradient descent rule implementedby the algorithm in Table 4.2. Recall from Equation (4.l l ) that stochastic gradient descent involves iterating through the training examples one at a time, for each training example d descending the gradient of the error Ed with respect to this single example. In other words, for each training example d every weight wji is updated by adding to it Awji where Ed is the error on training example d, summed over all output units in the network Here outputs is the set of output units in the network, tk is the target value of unit k for training example d, and ok is the output of unit k given training example d. The derivationof the stochasticgradient descent rule is conceptuallystraightforward, but requires keeping track of a number of subscripts and variables. We will follow the notation shown in Figure 4.6, adding a subscript j to denote to the jth unit of the network as follows: xji = the ith input to unit j xi wji = the weight associated with the ith input to unit j netj = wjixji (the weighted sum of inputs for unit j ) oj = the output computed by unit j t, = the target output for unit j a = the sigmoid function outputs = the set of units in the final layer of the network Downstream(j) = the set of units whose immediate inputs include the output of unit j 2 We now derive an expression for in order to implement the stochastic gradient descent rule seen in Equation (4:2l).To begin, notice that weight wji can influence the rest of the network only through netj. Therefore, we can use the 102 MACHINE LEARNING chain rule to write z . Given Equation (4.22), our remaining task is to derive a convenient expression for We consider two cases in turn: the case where unit j is an output unit for the network, and the case where j is an internal unit. Case 1: raini in^ Rule for Output Unit Weights. Just as wji can influence the rest of the network only through net,, net, can influence the network only through o j . Therefore, we can invoke the chain rule again to write To begin, consider just the first term in Equation (4.23) The derivatives &(tk -ok12 will be zero for all output units k except when k = j. We therefore drop the summation over output units and simply set k = j. Next consider the second term in Equation (4.23). Since oj = a(netj),the $ derivative is just the derivative of the sigmoid function, which we have already noted is equal to a(netj)(l - a(netj)).Therefore, Substituting expressions (4.24) and (4.25) into (4.23), we obtain and combining this with Equations (4.21) and (4.22), we have the stochastic gradient descent rule for output units Note this training rule is exactly the weight update rule implemented by Equa- tions (T4.3) and (T4.5) in the algorithm of Table 4.2. Furthermore, we can see now that Sk in Equation (T4.3) is equal to the quantity -$. In the remainder -% of this section we will use Si to denote the quantity for an arbitrary unit i . Case 2: Training Rule for Hidden Unit Weights. In the case where j is an internal, or hidden unit in the network, the derivation of the training rule for wji must take into account the indirect ways in which wji can influence the network outputs and hence Ed. For this reason, we will find it useful to refer to the set of all units immediately downstream of unit j in the network (i.e., all units whose direct inputs include the output of unit j). We denote this set of units by Downstream( j). Notice that netj can influence the network outputs (and therefore E d ) only through the units in Downstream(j). Therefore, we can write -$, Rearranging terms and using S j to denote we have and which is precisely the general rule from Equation (4.20) for updating internal unit weights in arbitrary acyclic directed graphs. Notice Equation (T4.4) from Table 4.2 is just a special case of this rule, in which Downstream(j) = outputs. 4.6 REMARKS ON THE BACKPROPAGATAIOLGNORITHM 4.6.1 Convergence and Local Minima As shown above, the BACKPROPAGATalIgOoNrithm implements a gradient descent search through the space of possible network weights, iteratively reducing the error E between the training example target values and the network outputs. Because the error surface for multilayer networks may contain many different local minima, gradient descent can become trapped in any of these. As a result, BACKPROPAGAToIvOeNr multilayer networks is only guaranteed to converge toward some local minimum in E and not necessarily to the global minimum error. Despite the lack of assured convergence to the global minimum error, BACKPROPAGATION is a highly effective function approximation method in practice. In many practical applications the problem of local minima has not been found to be as severe as one might fear. To develop some intuition here, consider that networks with large numbers of weights correspond to error surfaces in very high dimensional spaces (one dimension per weight). When gradient descent falls into a local minimum with respect to one of these weights, it will not necessarily be in a local minimum with respect to the other weights. In fact, the more weights in the network, the more dimensions that might provide "escape routes" for gradient descent to fall away from the local minimum with respect to this single weight. A second perspective on local minima can be gained by considering the manner in which network weights evolve as the number of training iterations increases. Notice that if network weights are initialized to values near zero, then during early gradient descent steps the network will represent a very smooth function that is approximately linear in its inputs. This is because the sigmoid threshold function itself is approximately linear when the weights are close to zero (see the plot of the sigmoid function in Figure 4.6). Only after the weights have had time to grow will they reach a point where they can represent highly nonlinear network functions. One might expect more local minima to exist in the region of the weight space that represents these more complex functions. One hopes that by the time the weights reach this point they have already moved close enough to the global minimum that even local minima in this region are acceptable. Despite the above comments, gradient descent over the complex error surfaces represented by ANNs is still poorly understood, and no methods are known to predict with certainty when local minima will cause difficulties. Common heuristics to attempt to alleviate the problem of local minima include: Add a momentum term to the weight-update rule as described in Equation (4.18). Momentum can sometimes carry the gradient descent procedure through narrow local minima (though in principle it can also carry it through narrow global minima into other local minima!). Use stochastic gradient descent rather than true gradient descent. As discussed in Section 4.4.3.3, the stochastic approximation to gradient descent effectively descends a different error surface for each training example, re- 105 CHAPTER 4 ARTIFICIAL NEURAL NETWORKS lying on the average of these to approximate the gradient with respect to the full training set. These different error surfaces typically will have different local minima, making it less likely that the process will get stuck in any one of them. 0 Train multiple networks using the same data, but initializing each network with different random weights. If the different training efforts lead to different local minima, then the network with the best performance over a separate validation data set can be selected. Alternatively, all networks can be retained and treated as a "committee" of networks whose output is the (possibly weighted) average of the individual network outputs. 4.6.2 Representational Power of Feedforward Networks What set of functions can be represented by feedfonvard networks? Of course the answer depends on the width and depth of the networks. Although much is still unknown about which function classes can be described by which types of networks, three quite general results are known: Boolean functions. Every boolean function can be represented exactly by some network with two layers of units, although the number of hidden units required grows exponentially in the worst case with the number of network inputs. To see how this can be done, consider the following general scheme for representing an arbitrary boolean function: For each possible input vector, create a distinct hidden unit and set its weights so that it activates if and only if this specific vector is input to the network. This produces a hidden layer that will always have exactly one unit active. Now implement the output unit as an OR gate that activates just for the desired input patterns. 0 Continuousfunctions. Every bounded continuous function can be approximated with arbitrarily small error (under a finite norm) by a network with two layers of units (Cybenko 1989; Hornik et al. 1989). The theorem in this case applies to networks that use sigmoid units at the hidden layer and (unthresholded) linear units at the output layer. The number of hidden units required depends on the function to be approximated. Arbitraryfunctions. Any function can be approximated to arbitrary accuracy by a network with three layers of units (Cybenko 1988). Again, the output layer uses linear units, the two hidden layers use sigmoid units, and the number of units required at each layer is not known in general. The proof of this involves showing that any function can be approximated by a linear combination of many localized functions that have value 0 everywhere except for some small region, and then showing that two layers of sigmoid units are sufficient to produce good local approximations. These results show that limited depth feedfonvard networks provide a very expressive hypothesis space for BACKPROPAGATHIOoNw.ever, it is important to keep in mind that the network weight vectors reachable by gradient descent from the initial weight values may not include all possible weight vectors. Hertz et al. (1991) provide a more detailed discussion of the above results. 4.6.3 Hypothesis Space Search and Inductive Bias It is interesting to compare the hypothesis space search of BACKPROPAGATtoION the search performed by other learning algorithms. For BACKPROPAGATeIvOeNry, possible assignment of network weights represents a syntactically distinct hypothesis that in principle can be considered by the learner. In other words, the hypothesis space is the n-dimensional Euclidean space of the n network weights. Notice this hypothesis space is continuous, in contrast to the hypothesis spaces of decision tree learning and other methods based on discrete representations. The fact that it is continuous, together with the fact that E is differentiable with respect to the continuous parameters of the hypothesis, results in a well-defined error gradient that provides a very useful structure for organizing the search for the best hypothesis. This structure is quite different from the general-to-specific ordering used to organize the search for symbolic concept learning algorithms, or the simple-to-complex ordering over decision trees used by the ID3 and C4.5 algorithms. What is the inductive bias by which BACKPROPAGATgIeOnNeralizes beyond the observed data? It is difficult to characterize precisely the inductive bias of BACKPROPAGATleIOarNning, because it depends on the interplay between the gradient descent search and the way in which the weight space spans the space of representable functions. However, one can roughly characterize it as smooth interpolation between data points. Given two positive training examples with no negative examples between them, BACKPROPAGATwIOillNtend to label points in between as positive examples as well. This can be seen, for example, in the decision surface illustrated in Figure 4.5, in which the specific sample of training examples gives rise to smoothly varying decision regions. 4.6.4 Hidden Layer Representations One intriguing property of BACKPROPAGATiIsOiNts ability to discover useful intermediate representations at the hidden unit layers inside the network. Because training examples constrain only the network inputs and outputs, the weight-tuning procedure is free to set weights that define whatever hidden unit representation is most effective at minimizing the squared error E. This can lead BACKPROPAGATION to define new hidden layer features that are not explicit in the input representation, but which capture properties of the input instances that are most relevant to learning the target function. Consider, for example, the network shown in Figure 4.7. Here, the eight network inputs are connected to three hidden units, which are in turn connected to the eight output units. Because of this structure, the three hidden units will be forced to re-represent the eight input values in some way that captures their Inputs Outputs Input Hidden output Values 10000000 .89 .04 .08 + 10000000 0 1000000 .15 .99 .99 + 0 1000000 00 100000 .01 .97 .27 + 00 100000 00010000 .99 .97 .71 + 00010000 00001000 .03 .05 .02 + 00001000 00000 100 ooOOOo 10 .01 .ll .88 + 00000100 .80 .01 .98 + 00000010 00000001 .60 .94 .01 + 00000001 FIGURE 4.7 Learned Hidden Layer Representation. This 8 x 3 x 8 network was trained to learn the identity function, using the eight training examples shown. After 5000 training epochs, the three hidden unit values encode the eight distinct inputs using the encoding shown on the right. Notice if the encoded values are rounded to zero or one, the result is the standard binary encoding for eight distinct values. relevant features, so that this hidden layer representation can be used by the output units to compute the correct target values. Consider training the network shown in Figure 4.7 to learn the simple target function f (2) = 2, where 2 is a vector containing seven 0's and a single 1. The network must learn to reproduce the eight inputs at the corresponding eight output units. Although this is a simple function, the network in this case is constrained to use only three hidden units. Therefore, the essential information from all eight input units must be captured by the three learned hidden units. When BACKPROPAGATisIOapNplied to this task, using each of the eight possible vectors as training examples, it successfully learns the target function. What hidden layer representation is created by the gradient descent BACKPROPAGATION algorithm? By examining the hidden unit values generated by the learned network for each of the eight possible input vectors, it is easy to see that the learned encoding is similar to the familiar standard binary encoding of eight values using three bits (e.g., 000,001,010,. .., 111). The exact values of the hidden units for one typical run of BACKPROPAGATaIrOe Nshown in Figure 4.7. This ability of multilayer networks to automatically discover useful repre- sentations at the hidden layers is a key feature of ANN learning. In contrast to learning methods that are constrained to use only predefined features provided by the human designer, this provides an important degree of flexibility that allows the learner to invent features not explicitly introduced by the human designer. Of course these invented features must still be computable as sigmoid unit functions of the provided network inputs. Note when more layers of units are used in the network, more complex features can be invented. Another example of hidden layer features is provided in the face recognition application discussed in Section 4.7. In order to develop a better intuition for the operation of BACKPROPAGATION in this example, let us examine the operation of the gradient descent procedure in greater detailt. The network in Figure 4.7 was trained using the algorithm shown in Table 4.2, with initial weights set to random values in the interval (-0.1,0.1), learning rate q = 0.3, and no weight momentum (i.e., a! = 0). Similar results were obtained by using other learning rates and by including nonzero momentum. The hidden unit encoding shown in Figure 4.7 was obtained after 5000 training iterations through the outer loop of the algorithm (i.e., 5000 iterations through each of the eight training examples). Most of the interesting weight changes occurred, however, during the first 2500 iterations. We can directly observe the effect of BACKPROPAGATIgOrNad'Sient descent search by plotting the squared output error as a function of the number of gradient descent search steps. This is shown in the top plot of Figure 4.8. Each line in this plot shows the squared output error summed over all training examples, for one of the eight network outputs. The horizontal axis indicates the number of iterations through the outermost loop of the BACKPROPAGATaIlOgoNrithm. As this plot indicates, the sum of squared errors for each output decreases as the gradient descent procedure proceeds, more quickly for some output units and less quickly for others. The evolution of the hidden layer representation can be seen in the second plot of Figure 4.8. This plot shows the three hidden unit values computed by the learned network for one of the possible inputs (in particular, 01000000). Again, the horizontal axis indicates the number of training iterations. As this plot indicates, the network passes through a number of different encodings before converging to the final encoding given in Figure 4.7. Finally, the evolution of individual weights within the network is illustrated in the third plot of Figure 4.8. This plot displays the evolution of weights connecting the eight input units (and the constant 1 bias input) to one of the three hidden units. Notice that significant changes in the weight values for this hidden unit coincide with significant changes in the hidden layer encoding and output squared errors. The weight that converges to a value near zero in this case is the bias weight wo. 4.6.5 Generalization, Overfitting, and Stopping Criterion In the description of t'le BACKPROPAGATalIgOoNrithm in Table 4.2, the termination condition for the algcrithm has been left unspecified. What is an appropriate condition for terrninatinp the weight update loop? One obvious choice is to continue training until the errcr E on the training examples falls below some predetermined threshold. In fact, this is a poor strategy because BACKPROPAGATiIsOsNusceptible to overfitting the training examples at the cost of decreasing generalization accuracy over other unseen examples. To see the dangers of minimizing the error over the training data, consider how the error E varies with the number of weight iterations. Figure 4.9 shows t ~ h seourcecode to reproduce this example is available at http://www.cs.cmu.edu/-tom/mlbook.hhnl. Sum of squared errors for each output unit Hidden unit encoding for input 01000000 4 3- I ..... Weights from inputs to one hidden unit _..__....-...................... ....->.-....--.--.-.-............... .... ..... ..:.. ...- .-..-.- .-.....-.- ..: - ..- s ..i z- iiii i -- i..-. -... /-,-.<-- 21- &:>:.--= ./;.,I/ ' - . ....<......,..,.;.,,,.,.-.,.'>..,..,.,.*.'....,..... ... -.. ................................................ ,I ,/' <, " - _ _ _ -1 - - I - - ...'.,.......,.'.. - -..., ....... . .:.. - - . - - - - - .. .. .. .. -2 - . . . . ..................... - _-_ .-.........".........._ .-. ...... _ _ _ _ 1 -- _ _ ......................................... _ FIGURE 4.8 Learning the 8 x 3 x 8 Network. The top plot shows the evolving sum of squared errors for each of the eight output units, as the number of training iterations (epochs) increases. The middle plot shows the evolving hidden layer representation for the input string "01000000." The bottom plot shows the evolving weights for one of the three hidden units. 110 MACHINE LEARNING Error versus weight updates (example 1) 0.008 0.007 Validation set error 0 5000 loo00 15000 20000 Number of weight updates Error versus weight updates (example 2) 0.08 %** I r 8 0.07 - 0.06 y+:L Training set error Validation set error *- + 0 lo00 2000 3000 4000 5000 6000 Number of weight updates FIGURE 4.9 Plots of error E as a function of the number of weight updates, for two different robot perception tasks. In both learning cases, error E over the training examples decreases monotonically, as gradient descent minimizes this measure of error. Error over the separate "validation" set of examples typically decreases at first, then may later increase due to overfitting the training examples. The network most IikeIy to generalize correctly to unseen data is the network with the lowest error over the validation set. Notice in the second plot, one must be careful to not stop training too soon when the validation set error begins to increase. this variation for two fairly typical applications of BACKPROPAGATCIoOnNsi.der first the top plot in this figure. The lower of the two lines shows the monotonically decreasing error E over the training set, as the number of gradient descent iterations grows. The upper line shows the error E measured over a different validation set of examples, distinct from the training examples. This line measures the generalization accuracy of the network-the accuracy with which it fits examples beyond the training data. 111 CHAPTER 4 ARTIFICIAL NEURAL NETWORKS Notice the generalization accuracy measured over the validation examples first decreases, then increases, even as the error over the training examples continues to decrease. How can this occur? This occurs because the weights are being tuned to fit idiosyncrasies of the training examples that are not representative of the general distribution of examples. The large number of weight parameters in ANNs provides many degrees of freedom for fitting such idiosyncrasies. Why does overfitting tend to occur during later iterations, but not during earlier iterations? Consider that network weights are initialized to small random values. With weights of nearly identical value, only very smooth decision surfaces are describable. As training proceeds, some weights begin to grow in order to reduce the error over the training data, and the complexity of the learned decision surface increases. Thus, the effective complexity of the hypotheses that can be reached by BACKPROPAGATIOinNcreases with the number of weight-tuning iterations. Given enough weight-tuning iterations, BACKPROPAGATwIOilNl often be able to create overly complex decision surfaces that fit noise in the training data or unrepresentative characteristics of the particular training sample. This overfitting problem is analogous to the overfitting problem in decision tree learning (see Chapter 3). Several techniques are available to address the overfitting problem for BACKPROPAGATION learning. One approach, known as weight decay, is to decrease each weight by some small factor during each iteration. This is equivalent to modifying the definition of E to include a penalty term corresponding to the total magnitude of the network weights. The motivation for this approach is to keep weight values small, to bias learning against complex decision surfaces. One of the most successful methods for overcoming the overfitting problem is to simply provide a set of validation data to the algorithm in addition to the training data. The algorithm monitors the error with respect to this validation set, while using the training set to drive the gradient descent search. In essence, this allows the algorithm itself to plot the two curves shown in Figure 4.9. How many weight-tuning iterations should the algorithm perform? Clearly, it should use the number of iterations that produces the lowest error over the validation set, since this is the best indicator of network performance over unseen examples. In typical implementations of this approach, two copies of the network weights are kept: one copy for training and a separate copy of the best-performing weights thus far, measured by their error over the validation set. Once the trained weights reach a significantlyhigher error over the validation set than the stored weights, training is terminated and the stored weights are returned as the final hypothesis. When this procedure is applied in the case of the top plot of Figure 4.9, it outputs the network weights obtained after 9100 iterations. The second plot in Figure 4.9 shows that it is not always obvious when the lowest error on the validation set has been reached. In this plot, the validation set error decreases, then increases, then decreases again. Care must be taken to avoid the mistaken conclusion that the network has reached its lowest validation set error at iteration 850. In general, the issue of overfitting and how to overcome it is a subtle one. The above cross-validation approach works best when extra data are available to provide a validation set. Unfortunately,however, the problem of overfitting is most 112 MACHINE LEARNWG I severe for small training sets. In these cases, a k-fold cross-validation approach is sometimes used, in which cross validation is performed k different times, each time using a different partitioning of the data into training and validation sets, and the results are then averaged. In one version of this approach, the m available examples are partitioned into k disjoint subsets, each of size m/k. The crossvalidation procedure is then run k times, each time using a different one of these subsets as the validation set and combining the other subsets for the training set. Thus, each example is used in the validation set for one of the experiments and in the training set for the other k - 1 experiments. On each experiment the above cross-validation approach is used to determine the number of iterations i that yield the best performance on the validation set. The mean i of these estimates for i is then calculated, and a final run of BACKPROPAGATisIOpNerformed training on all n examples for i iterations, with no validation set. This procedure is closely related to the procedure for comparing two learning methods based on limited data, described in Chapter 5. 4.7 AN ILLUSTRATIVE EXAMPLE: FACE RECOGNITION To illustrate some of the practical design choices involved in applying BACKPROPAGATIONth,is section discusses applying it to a learning task involving face recognition. All image data and code used to produce the examples described in this section are available at World Wide Web site http://www.cs.cmu.edu/-tomlmlbook. html, along with complete documentation on how to use the code. Why not try it yourself? 4.7.1 The Task The learning task here involves classifying camera images of faces of various people in various poses. Images of 20 different people were collected, including approximately 32 images per person, varying the person's expression (happy, sad, angry, neutral), the direction in which they were looking (left, right, straight ahead, up), and whether or not they were wearing sunglasses. As can be seen from the example images in Figure 4.10, there is also variation in the background behind the person, the clothing worn by the person, and the position of the person's face within the image. In total, 624 greyscale images were collected, each with a resolution of 120 x 128, with each image pixel described by a greyscale intensity value between 0 (black) and 255 (white). A variety of target functions can be learned from this image data. For ex- ample, given an image as input we could train an ANN to output the identity of the person, the direction in which the person is facing, the gender of the person, whether or not they are wearing sunglasses, etc. All of these target functions can be learned to high accuracy from this image data, and the reader is encouraged to try out these experiments. In the remainder of this section we consider one particular task: learning the direction in which the person is facing (to their left, right, straight ahead, or upward). I 30 x 32 resolution input images left straight right L Network weights after 1 iteration through each training example left Network weights after 100 iterations through each training example FIGURE 4.10 Learning an artificial neural network to recognize face pose. Here a 960 x 3 x 4 network is trained on grey-level images of faces (see top), to predict whether a person is looking to their left, right, ahead, or up. After training on 260 such images, the network achieves an accuracy of 90% over a separate test set. The learned network weights are shown after one weight-tuning iteration through the training examples and after 100 iterations. Each output unit (left, straight, right, up) has four weights, shown by dark (negative) and light (positive) blocks. The leftmost block corresponds to the weight wg, which determines the unit threshold, and the three blocks to the right correspond to weights on inputs from the three hidden units. The weights from the image pixels into each hidden unit are also shown, with each weight plotted in the position of the corresponding image pixel. 4.7.2 Design Choices In applying BACKPROPAGATItOoNany given task, a number of design choices must be made. We summarize these choices below for our task of learning the direction in which a person is facing. Although no attempt was made to determine the precise optimal design choices for this task, the design described here learns the target function quite well. After training on a set of 260 images, classification accuracy over a separate test set is 90%. In contrast, the default accuracy achieved by randomly guessing one of the four possible face directions is 25%. Input encoding. Given that the ANN input is to be some representation of the image, one key design choice is how to encode this image. For example, we could preprocess the image to extract edges, regions of uniform intensity, or other local image features, then input these features to the network. One difficulty with this design option is that it would lead to a variable number of features (e.g., edges) per image, whereas the ANN has a fixed number of input units. The design option chosen in this case was instead to encode the image as a fixed set of 30 x 32 pixel intensity values, with one network input per pixel. The pixel intensity values ranging from 0 to 255 were linearly scaled to range from 0 to 1 so that network inputs would have values in the same interval as the hidden unit and output unit activations. The 30 x 32 pixel image is, in fact, a coarse resolution summary of the original 120 x 128 captured image, with each coarse pixel intensity calculated as the mean of the corresponding high-resolution pixel intensities. Using this coarse-resolution image reduces the number of inputs and network weights to a much more manageable size, thereby reducing computational demands, while maintaining sufficient resolution to correctly classify the images. Recall from Figure 4.1 that the ALVINN system uses a similar coarse-resolution image as input to the network. One interesting difference is that in ALVINN, each coarse resolution pixel intensity is obtained by selecting the intensity of a single pixel at random from the appropriate region within the high-resolution image, rather than taking the mean of all pixel intensities within this region. The motivation for this ic ALVINN is that it significantly reduces the computation required to produce the coarse-resolution image from the available high-resolution image. This efficiency is especially important when the network must be used to process many images per second while autonomously driving the vehicle. Output encoding. The ANN must output one of four values indicating the direction in which the person is looking (left, right, up, or straight). Note we could encode this four-way classification using a single output unit, assigning outputs of, say, 0.2,0.4,0.6, and 0.8 to encode these four possible values. Instead, we use four distinct output units, each representing one of the four possible face directions, with the highest-valued output taken as the network prediction. This is often called a 1-0f-n output encoding. There are two motivations for choosing the 1-of-n output encoding over the single unit option. First, it provides more degrees of freedom to the network for representing the target function (i.e., there are n times as many weights available in the output layer of units). Second, in the 1-of-n encoding the difference between the highest-valued output and the second-highest can be used as a measure of the confidence in the network prediction (ambiguous classifications may result in near or exact ties). A further design choice here is "what should be the target values for these four output units?' One obvious choice would be to use the four target values (1,0,0,O) to encode a face looking to the left, (0,1,0,O) to encode a face looking straight, etc. Instead of 0 and 1 values, we use values of 0.1 and 0.9, so that (0.9,O.1,0.1,0.1) is the target output vector for a face looking to the left. The reason for avoiding target values of 0 and 1 is that sigmoid units cannot produce these output values given finite weights. If we attempt to train the network to fit target values of exactly 0 and 1, gradient descent will force the weights to grow without bound. On the other hand, values of 0.1 and 0.9 are achievable using a sigmoid unit with finite weights. Network graph structure. As described earlier, BACKPROPAGATcIOanNbe applied to any acyclic directed graph of sigmoid units. Therefore, another design choice we face is how many units to include in the network and how to interconnect them. The most common network structure is a layered network with feedforward connections from every unit in one layer to every unit in the next. In the current design we chose this standard structure, using two layers of sigmoid units (one hidden layer and one output layer). It is common to use one or two layers of sigmoid units and, occasionally, three layers. It is not common to use more layers than this because training times become very long and because networks with three layers of sigmoid units can already express a rich variety of target functions (see Section 4.6.2). Given our choice of a layered feedforward network with one hidden layer, how many hidden units should we include? In the results reported in Figure 4.10, only three hidden units were used, yielding a test set accuracy of 90%. In other experiments 30 hidden units were used, yielding a test set accuracy one to two percent higher. Although the generalization accuracy varied only a small amount between these two experiments, the second experiment required significantly more training time. Using 260 training images, the training time was approximately 1 hour on a Sun Sparc5 workstation for the 30 hidden unit network, compared to approximately 5 minutes for the 3 hidden unit network. In many applications it has been found that some minimum number of hidden units is required in order to learn the target function accurately and that extra hidden units above this number do not dramatically affect generalization accuracy, provided cross-validation methods are used to determine how many gradient descent iterations should be performed. If such methods are not used, then increasing the number of hidden units often increases the tendency to overfit the training data, thereby reducing generalization accuracy. Other learning algorithm parameters. In these learning experiments the learning rate r] was set to 0.3, and the momentum a! was set to 0.3. Lower values for both parameters produced roughly equivalent generalization accuracy, but longer training times. If these values are set too high, training fails to converge to a network with acceptable error over the training set. Full gradient descent was used in all these experiments (in contrast to the stochastic approximation to gradient descent in the algorithm of Table 4.2). Network weights in the output units were initialized to small random values. However, input unit weights were initialized to zero, because this yields much more intelligible visualizations of the learned weights (see Figure 4.10), without any noticeable impact on generalization accuracy. The number of training iterations was selected by partitioning the available data into a training set and a separate validation set. Gradient descent was used to minimize the error over the training set, and after every 50 gradient descent steps the performance of the network was evaluated over the validation set. The final selected network was the one with the highest accuracy over the validation set. See Section 4.6.5 for an explanation and justification of this procedure. The final reported accuracy (e-g., 90% for the network in Figure 4.10) was measured over yet a third set of test examples that were not used in any way to influence training. 4.7.3 Learned Hidden Representations It is interesting to examine the learned weight values for the 2899 weights in the network. Figure 4.10 depicts the values of each of these weights after one iteration through the weight update for all training examples, and again after 100 iterations. To understand this diagram, consider first the four rectangular blocks just below the face images in the figure. Each of these rectangles depicts the weights for one of the four output units in the network (encoding left, straight, right, and up). The four squares within each rectangle indicate the four weights associated with this output unit-the weight wo, which determines the unit threshold (on the left), followed by the three weights connecting the three hidden units to this output. The brightness of the square indicates the weight value, with bright white indicating a large positive weight, dark black indicating a large negative weight, and intermediate shades of grey indicating intermediate weight values. For example, the output unit labeled "up" has a near zero wo threshold weight, a large positive weight from the first hidden unit, and a large negative weight from the second hidden unit. The weights of the hidden units are shown directly below those for the output units. Recall that each hidden unit receives an input from each of the 30 x 32 image pixels. The 30 x 32 weights associated with these inputs are displayed so that each weight is in the position of the corresponding image pixel (with the wo threshold weight superimposed in the top left of the array). Interestingly, one can see that the weights have taken on values that are especially sensitive to features in the region of the image in which the face and body typically appear. The values of the network weights after 100 gradient descent iterations through each training example are shown at the bottom of the figure. Notice the leftmost hidden unit has very different weights than it had after the first iteration, and the other two hidden units have changed as well. It is possible to understand to some degree the encoding in this final set of weights. For example, consider the output unit that indicates a person is looking to his right. This unit has a strong positive weight from the second hidden unit and a strong negative weight from the third hidden unit. Examining the weights of these two hidden units, it is easy to see that if the person's face is turned to his right (i.e., our left), then his bright skin will roughly align with strong positive weights in this hidden unit, and his dark hair will roughly align with negative weights, resulting in this unit outputting a large value. The same image will cause the third hidden unit to output a value close to zero, as the bright face will tend to align with the large negative weights in this case. 4.8 ADVANCED TOPICS IN ARTIFICIAL NEURAL NETWORKS 4.8.1 Alternative Error Functions As noted earlier, gradient descent can be performed for any function E that is differentiable with respect to the parameterized hypothesis space. While the basic BAcWROPAGATION algorithm defines E in terms of the sum of squared errors of the network, other definitions have been suggested in order to incorporate other constraints into the weight-tuning rule. For each new definition of E a new weight-tuning rule for gradient descent must be derived. Examples of alternative definitions of E include a Adding a penalty term for weight magnitude. As discussed above, we can add a term to E that increases with the magnitude of the weight vector. This causes the gradient descent search to seek weight vectors with small magnitudes, thereby reducing the risk of overfitting. One way to do this is to redefine E as which yields a weight update rule identical to the BACKPROPAGATruIOleN, except that each weight is multiplied by the constant (1 - 2 y q ) upon each iteration. Thus, choosing this definition of E is equivalent to using a weight decay strategy (see Exercise 4.10.) a Adding a term for errors in the slope, or derivative of the target function. In some cases, training information may be available regarding desired derivatives of the target function, as well as desired values. For example, Simard et al. (1992) describe an application to character recognition in which certain training derivatives are used to constrain the network to learn character recognition functions that are invariant of translation within the image. Mitchell and Thrun (1993) describe methods for calculating training derivatives based on the learner's prior knowledge. In both of these systems (described in Chapter 12), the error function is modified to add a term measuring the discrepancy between these training derivatives and the actual derivatives of the learned network. One example of such an error function is 2 Here :x denotes the value of the jth input unit for training example d. Thus, is the training derivative describing how the target output value 118 MACHINE LEARNING 9 tkd should vary with a change in the input xi. Similarly, denotes the ax, corresponding derivative of the actual learned network. The constant ,u determines the relative weight placed on fitting the training values versus the training derivatives. 0 Minimizing the cross entropy of the network with respect to the target values. Consider learning a probabilistic function, such as predicting whether a loan applicant will pay back a loan based on attributes such as the applicant's age and bank balance. Although the training examples exhibit only boolean target values (either a 1 or 0, depending on whether this applicant paid back the loan), the underlying target function might be best modeled by outputting the probability that the given applicant will repay the loan, rather than attempting to output the actual 1 and 0 value for each input instance. Given such situations in which we wish for the network to output probability estimates, it can be shown that the best (i.e., maximum likelihood) probability estimates are given by the network that minimizes the cross entropy, defined as Here od is the probability estimate output by the network for training example d, and td is the 1 or 0 target value for training example d. Chapter 6 discusses when and why the most probable network hypothesis is the one that minimizes this cross entropy and derives the corresponding gradient descent weight-tuning rule for sigmoid units. That chapter also describes other conditions under which the most probable hypothesis is the one that minimizes the sum of squared errors. 0 Altering the effective error function can also be accomplished by weight sharing, or "tying together" weights associated with different units or inputs. The idea here is that different network weights are forced to take on identical values, usually to enforce some constraint known in advance to the human designer. For example, Waibel et al. (1989) and Lang et al. (1990) describe an application of neural networks to speech recognition, in which the network inputs are the speech frequency components at different times within a 144 millisecond time window. One assumption that can be made in this application is that the frequency componentsthat identify a specific sound (e.g., "eee") should be independent of the exact time that the sound occurs within the 144millisecond window. To enforce this constraint, the various units that receive input from different portions of the time window are forced to share weights. The net effect is to constrain the space of potential hypotheses, thereby reducing the risk of overfitting and improving the chances for accurately generalizing to unseen situations. Such weight sharing is typically implemented by first updating each of the shared weights separately within each unit that uses the weight, then replacing each instance of the shared weight by the mean of their values. The result of this procedure is that shared weights effectively adapt to a different error function than do the unshared weights. 4.8.2 Alternative Error Minimization Procedures While gradient descent is one of the most general search methods for finding a hypothesis to minimize the error function, it is not always the most efficient. It is not uncommon for BACKPROPAGATtoIOreNquire tens of thousands of iterations through the weight update loop when training complex networks. For this reason, a number of alternative weight optimization algorithms have been proposed and explored. To see some of the other possibilities, it is helpful to think of a weightupdate method as involving two decisions: choosing a direction in which to alter the current weight vector and choosing a distance to move. In BACKPROPAGATION the direction is chosen by taking the negative of the gradient, and the distance is determined by the learning rate constant q. One optimization method, known as line search, involves a different approach to choosing the distance for the weight update. In particular, once a line is chosen that specifies the direction of the update, the update distance is chosen by finding the minimum of the error function along this line. Notice this can result in a very large or very small weight update, depending on the position of the point along the line that minimizes error. A second method, that builds on the idea of line search, is called the conjugate gradient method. Here, a sequence of line searshes is performed to search for a minimum in the error surface. On the first step in this sequence, the direction chosen is the negative of the gradient. On each subsequent step, a new direction is chosen so that the component of the error gradient that has just been made zero, remains zero. While alternative error-minimization methods sometimes lead to improved efficiency in training the network, methods such as conjugate gradient tend to have no significant impact on the generalization error of the final network. The only likely impact on the final error is that different error-minimizationprocedures may fall into different local minima. Bishop (1996) contains a general discussion of several parameter optimization methods for training networks. 4.8.3 Recurrent Networks Up to this point we have considered only network topologies that correspond to acyclic directed graphs. Recurrent networks are artificial neural networks that apply to time series data and that use outputs of network units at time t as the input to other units at time t + 1 . In this way, they support a form of directed cycles in the network. To illustrate, consider the time series prediction task of predicting the next day's stock market average y(t +1 ) based on the current day's economic indicators x(t). Given a time series of such data, one obvious approach + is to train a feedforward network to predict y(t 1 ) as its output, based on the input values x(t). Such a network is shown in Figure 4.11(a). + One limitation of such a network is that the prediction of y(t 1 ) depends + only on x ( t ) and cannot capture possible dependencies of y ( t 1 ) on earlier values + of x. This might be necessary, for example, if tomorrow's stock market average ~ ( t 1 ) depends on the difference between today's economic indicator values x ( t ) and yesterday's values x(t - 1 ) . Of course we could remedy this difficulty I 120 MACHINE LEARNING (4Feedforward network (b)Recurrent network ( d Recurrent network unfolded in time FIGURE 4.11 Recurrent networks. x(t - 2) c(t - 2) by making both x(t) and x(t - 1) inputs to the feedforward network. However, + if we wish the network to consider an arbitrary window of time in the past when predicting y(t l), then a different solution is required. The recurrent network shown in Figure 4.1 1(b) provides one such solution. Here, we have added a new unit b to the hidden layer, and new input unit c(t). The value of c(t) is defined as the value of unit b at time t - 1; that is, the input value c(t) to the network at one time step is simply copied from the value of unit b on the previous time step. Notice this implements a recurrence relation, in which b represents information about the history of network inputs. Because b depends on both x(t) and on c(t), it is possible for b to summarize information from earlier values of x that are arbitrarily distant in time. Many other network topologies also can be used to CHAPTER 4 ARTIFICIAL NEURAL NETWORKS 121 represent recurrence relations. For example, we could have inserted several layers of units between the input and unit b, and we could have added several context in parallel where we added the single units b and c. How can such recurrent networks be trained? There are several variants of recurrent networks, and several training methods have been proposed (see, for example, Jordan 1986; Elman 1990; Mozer 1995; Williams and Zipser 1995). Interestingly, recurrent networks such as the one shown in Figure 4.1 1(b) can be trained using a simple variant of BACKPROPAGATTIOONu.nderstand how, consider Figure 4.11(c), which shows the data flow of the recurrent network "unfolded in time. Here we have made several copies of the recurrent network, replacing the feedback loop by connections between the various copies. Notice that this large unfolded network contains no cycles. Therefore, the weights in the unfolded network can be trained directly using BACKPROPAGATIOOfNc.ourse in practice we wish to keep only one copy of the recurrent network and one set of weights. Therefore, after training the unfolded network, the final weight wji in the recurrent network can be taken to be the mean value of the corresponding wji weights in the various copies. Mozer (1995) describes this training process in greater detail. In practice, recurrent networks are more difficult to train than networks with no feedback loops and do not generalize as reliably. However, they remain important due to their increased representational power. 4.8.4 Dynamically Modifying Network Structure Up to this point we have considered neural network learning as a problem of adjusting weights within a fixed graph structure. A variety of methods have been proposed to dynamically grow or shrink the number of network units and interconnections in an attempt to improve generalization accuracy and training efficiency. One idea is to begin with a network containing no hidden units, then grow the network as needed by adding hidden units until the training error is reduced to some acceptable level. The CASCADE-CORRELAaTlIgOoNrithm (Fahlman and Lebiere 1990) is one such algorithm. CASCADE-CORRELAbTeIgOinNs by constructing a network with no hidden units. In the case of our face-direction learning task, for example, it would construct a network containing only the four output units completely connected to the 30 x 32 input nodes. After this network is trained for some time, we may well find that there remains a significant residual error due to the fact that the target function cannot be perfectly represented by a network with this single-layer structure. In this case, the algorithm adds a hidden unit, choosing its weight values to maximize the correlation between the hidden unit value and the residual error of the overall network. The new unit is now installed into the network, with its weight values held fixed, and a new connection from this new unit is added to each output unit. The process is now repeated. The original weights are retrained (holding the hidden unit weights fixed), the residual error is checked, and a second hidden unit added if the residual error is still above threshold. Whenever a new hidden unit is added, its inputs include all of the original network inputs plus the outputs of any existing hidden units. The network is 122 MACHINE LEARNING grown in this fashion, accumulating hidden units until the network residual enor is reduced to some acceptable level. Fahlman and Lebiere (1990) report cases in which CASCADE-CORRELAsTigIOniNficantly reduces training times, due to the fact that only a single layer of units is trained at each step. One practical difficulty is that because the algorithm can add units indefinitely, it is quite easy for it to overfit the training data, and precautions to avoid overfitting must be taken. A second idea for dynamically altering network structure is to take the opposite approach. Instead of beginning with the simplest possible network and adding complexity, we begin with a complex network and prune it as we find that certain connections are inessential. One way to decide whether a particular weight is inessential is to see whether its value is close to zero. A second way, which g ) appears to be more successful in practice, is to consider the effect that a small variation in the weight has on the error E. The effect on E of varying w (i.e., can be taken as a measure of the salience of the connection. LeCun et al. (1990) describe a process in which a network is trained, the least salient connections removed, and this process iterated until some termination condition is met. They refer to this as the "optimal brain damage" approach, because at each step the algorithm attempts to remove the least useful connections. They report that in a character recognition application this approach reduced the number of weights in a large network by a factor of 4, with a slight improvement in generalization accuracy and a significant improvement in subsequent training efficiency. In general, techniques for dynamically modifying network structure have met with mixed success. It remains to be seen whether they can reliably improve on the generalization accuracy of BACKPROPAGATIHOoNw.ever, they have been shown in some cases to provide significant improvements in training times. 4.9 SUMMARY AND FURTHER READING Main points of this chapter include: 0 Artificial neural network learning provides a practical method for learning real-valued and vector-valued functions over continuous and discrete-valued attributes, in a way that is robust to noise in the training data. The BACKPROPAGATION algorithm is the most common network learning method and has been successfully applied to a variety of learning tasks, such as handwriting recognition and robot control. 0 The hypothesis space considered by the BACKPROPAGATaIlOgoNrithm is the space of all functions that can be represented by assigning weights to the given, fixed network of interconnected units. Feedforward networks containing three layers of units are able to approximate any function to arbitrary accuracy, given a sufficient (potentially very large) number of units in each layer. Even networks of practical size are capable of representing a rich space of highly nonlinear functions, making feedforward networks a good choice for learning discrete and continuous functions whose general form is unknown in advance. BACKPROPAGATsIeOarNches the space of possible hypotheses using gradient descent to iteratively reduce the error in the network fit to the training examples. Gradient descent converges to a local minimum in the training error with respect to the network weights. More generally, gradient descent is a potentially useful method for searching many continuously parameterized hypothesis spaces where the training error is a differentiable function of hypothesis parameters. One of the most intriguing properties of BACKPROPAGATisIOitNs ability to invent new features that are not explicit in the input to the network. In particular, the internal (hidden) layers of multilayer networks learn to represent intermediate features that are useful for learning the target function and that are only implicit in the network inputs. This capability is illustrated, for example, by the ability of the 8 x 3 x 8 network in Section 4.6.4 to invent the boolean encoding of digits from 1 to 8 and by the image features represented by the hidden layer in the face-recognition application of Section 4.7. Overfitting the training data is an important issue in ANN learning. Overfitting results in networks that generalize poorly to new data despite excellent performance over the training data. Cross-validation methods can be used to estimate an appropriate stopping point for gradient descent search and thus to minimize the risk of overfitting. 0 Although BACKPROPAGATisIOthNe most common ANN learning algorithm, many others have been proposed, including algorithms for more specialized tasks. For example, recurrent neural network methods train networks containing directed cycles, and algorithms such as CASCADCEORRELATIOalNter the network structure as well as the network weights. Additional information on ANN learning can be found in several other chapters in this book. A Bayesian justification for choosing to minimize the sum of squared errors is given in Chapter 6, along with a justification for minimizing the cross-entropy instead of the sum of squared errors in other cases. Theoretical results characterizing the number of training examples needed to reliably learn boolean functions and the Vapnik-Chervonenkis dimension of certain types of networks can be found in Chapter 7. A discussion of overfitting and how to avoid it can be found in Chapter 5. Methods for using prior knowledge to improve the generalization accuracy of ANN learning are discussed in Chapter 12. Work on artificial neural networks dates back to the very early days of computer science. McCulloch and Pitts (1943) proposed a model of a neuron that corresponds to the perceptron, and a good deal of work through the 1960s explored variations of this model. During the early 1960sWidrow and Hoff (1960) explored perceptron networks (which they called "adelines") and the delta rule, and Rosenblatt (1962) proved the convergence of the perceptron training rule. However, by the late 1960s it became clear that single-layer perceptron networks had limited representational capabilities, and no effective algorithms were known for training multilayer networks. Minsky and Papert (1969) showed that even simple functions such as XOR could not be represented or learned with singlelayer perceptron networks, and work on ANNs receded during the 1970s. During the mid-1980s work on ANNs experienced a resurgence, caused in large part by the invention of BACKPROPAGATanIOd Nrelated algorithms for training multilayer networks (Rumelhart and McClelland 1986; Parker 1985). These ideas can be traced to related earlier work (e.g., Werbos 1975). Since the 1980s, BACKPROPAGAThIaOsNbecome a widely used learning method, and many other ANN approaches have been actively explored. The advent of inexpensive computers during this same period has allowed experimenting with computationally intensive algorithms that could not be thoroughly explored during the 1960s. A number of textbooks are devoted to the topic of neural network learning. An early but still useful book on parameter learning methods for pattern recognition is Duda and Hart (1973). The text by Widrow and Stearns (1985) covers perceptrons and related single-layer networks and their applications. Rumelhart and McClelland (1986) produced an edited collection of papers that helped generate the increased interest in these methods beginning in the mid-1980s. Recent books on neural network learning include Bishop (1996); Chauvin and Rumelhart (1995); Freeman and Skapina (1991); Fu (1994); Hecht-Nielsen (1990); and Hertz et al. (1991). EXERCISES 4.1. What are the values of weights wo, w l , and w2 for the perceptron whose decision surface is illustrated in Figure 4.3? Assume the surface crosses the xl axis at -1, and the x2 axis at 2. 4.2. Design a two-input perceptron that implements the boolean function A A -.B. Design a two-layer network of perceptrons that implements A XO R B. + 4.3. Consider two perceptrons defined by the threshold expression wo w l x l + ~ 2 x >2 0. Perceptron A has weight values and perceptron B has the weight values True or false? Perceptron A is more-general~hanperceptron B. (more-general~han is defined in Chapter 2.) 4.4. Implement the delta training rule for a two-input linear unit. Train it to fit the target + concept -2 X I + 2x2 > 0. Plot the error E as a function of the number of training iterations. Plot the decision surface after 5, 10, 50, 100, ... , iterations. ( a ) Try this using various constant values for 17 and using a decaying learning rate of qo/i for the ith iteration. Which works better? (b) Try incremental and batch learning. Which converges more quickly? Consider both number of weight updates and total execution time. 4.5. Derive a gradient descent training rule for a single unit with output o, where 4.6. Explain informally why the delta training rule in Equation (4.10) is only an approximation to the true gradient descent rule of Equation (4.7). 4.7. Consider a two-layer feedforward ANN with two inputs a and b, one hidden unit c, and one output unit d. This network has five weights (w,, web, wd, wdc,wdO),where w,o represents the threshold weight for unit x . Initialize these weights to the values (.1, . l , . l , .l, .I), then give their values after each of the first two training iterations of the BACKPROPAGATIaOlgNorithm. Assume learning rate 17 = .3, momentum a! = 0.9, incremental weight updates, and the following training examples: abd 101 010 4.8. Revise the BACKPROPAGATIaOlgNorithm in Table 4.2 so that it operates on units using the squashing function tanh in place of the sigmoid function. That is, assume the output of a single unit is o = t a n h ( 6 . x ' ) . Give the weight update rule for output layer weights and hidden layer weights. Hint: tanh'(x) = 1 - tanh2(x). 4.9. Recall the 8x 3 x 8 network described in Figure 4.7. Consider trying to train a 8x 1x 8 network for the same task; that is, a network with just one hidden unit. Notice the eight training examples in Figure 4.7 could be represented by eight distinct values for the single hidden unit (e.g., 0.1,0.2, ...,0.8). Could a network with just one hidden unit therefore learn the identity function defined over these training examples? Hint: Consider questions such as "do there exist values for the hidden unit weights that can create the hidden unit encoding suggested above?'"do there exist values for the output unit weights that could correctly decode this encoding of the input?'and "is gradient descent likely to find such weights?' 4.10. Consider the alternative error function described in Section 4.8.1 Derive the gradient descent update rule for this definition of E. Show that it can be implemented by multiplying each weight by some constant before performing the standard gradient descent update given in Table 4.2. 4.11. Apply BACKPROPAGATItOoNthe task of face recognition. See World Wide Web URL http://www.cs.cmu.edu/-tomlbook.htmlfor details, including face-image data, BACKPROPAGATIcOoNde, and specific tasks. 4.12. Consider deriving a gradient descent algorithm to learn target concepts corresponding to rectangles in the x , y plane. Describe each hypothesis by the x and y coordinates of the lower-left and upper-right comers of the rectangle - Ilx, Ily, urn, and ury respectively. An instance ( x ,y ) is labeled positive by hypothesis ( l l x ,l l y , u r x , u r y ) if and only if the point ( x ,y ) lies inside the corresponding rectangle. Define error E as in the chapter. Can you devise a gradient descent algorithm to learn such rectangle hypotheses? Notice that E is not a continuous function of l l x , Ily, u r x , and u r y , just as in the case of perceptron learning. (Hint: Consider the two solutions used for perceptrons: (1) changing the classification rule to make output predictions continuous functions of the inputs, and (2) defining an alternative error-such as distance to the rectangle center-as in using the delta rule to train perceptrons.) Does your algorithm converge to the minimum error hypothesis when the positive and negative examples are separable by a rectangle? When they are not? Do you have problems with local minima? How does your algorithm compare to symbolic methods for learning conjunctions of feature constraints? REFERENCES Bishop, C. M. (1996). Neural networksfor pattern recognition. Oxford, England: Oxford University Press. Chauvin, Y., & Rumelhart, D. (1995). BACKPROPAGATITOhNeo: ry, architectures, and applications (edited collection). Hillsdale, NJ: Lawrence Erlbaum Assoc. Churchland, P. S., & Sejnowski, T. J. (1992). The computational brain. Cambridge, MA: The MIT Press. Cyhenko, G. (1988). Continuous valued neural networks with two hidden layers are sufficient (Technical Report). Department of Computer Science, Tufts University, Medford, MA. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2, 303-3 14. Cottrell, G. W. (1990). Extracting features from faces using compression networks: Face, identity, emotion and gender recognition using holons. In D. Touretzky (Ed.), Connection Models: Proceedings of the 1990 Summer School. San Mateo, CA: Morgan Kaufmann. Dietterich, T. G., Hild, H., & Bakiri, G. (1995). A comparison of ID3 and BACKPROPAGATfIoOrN English text-to-speech mapping. Machine Learning, 18(1), 51-80. Duda, R., & Hart, P. (1973). Pattern class@cation and scene analysis. New York: John Wiley & Sons. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179-21 1. Fahlman, S., & Lebiere, C. (1990). The CASCADE-CORRELATleIOarNning architecture (Technical Report CMU-CS-90-100). Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. Freeman, J. A., & Skapura, D. M. (1991). Neural networks. Reading, MA: Addison Wesley. Fu, L. (1994). Neural networks in computer intelligence. New York: McGraw Hill. Gabriel, M. & Moore, J. (1990). Learning and computational neuroscience: Foundations of adaptive networks (edited collection). Cambridge, MA: The MIT Press. Hecht-Nielsen, R. (1990). Neurocomputing. Reading, MA: Addison Wesley. Hertz, J., Krogh, A., & Palmer, R.G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison Wesley. Homick, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359-366. Huang, W. Y., & Lippmann, R. P. (1988). Neural net and traditional classifiers. In Anderson (Ed.), Neural Information Processing Systems (pp. 387-396). Jordan, M. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. Proceedings of the Eighth Annual Conference of the Cognitive Science Society (pp. 531-546). Kohonen, T. (1984). Self-organizationand associative memory. Berlin: Springer-Verlag. Lang, K. J., Waibel, A. H., & Hinton, G. E. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3, 3343. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L.D. (1989). BACKPROPAGATIaOpNplied to handwritten zip code recognition. Neural Computation, l(4). LeCun, Y . , Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In D. Touretzky (Ed.), Advances in Neural Information Processing Systems (Vol. 2, pp. 598405). San Mateo, CA: Morgan Kaufmann. Manke, S., Finke, M. & Waibel, A. (1995). NPEN++:a writer independent, large vocabulary online cursive handwriting recognition system. Proceedings of the International Conference on Document Analysis and Recognition. Montreal, Canada: IEEE Computer Society. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5 , 115-133. Mitchell, T. M., & Thrun, S. B. (1993). Explanation-based neural network learning for robot control. In Hanson, Cowan, & Giles (Eds.), Advances in neural informution processing systems 5 (pp. 287-294). San Francisco: Morgan Kaufmann. Mozer, M. (1995). A focused BACKPROPAGATaIOlgNorithm for temporal pattern recognition. In Y. Chauvin & D. Rumelhart (Eds.), Backpropagation: Theory, architectures, and applications (pp. 137-169). Hillsdale, NJ: Lawrence Erlbaum Associates. Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press. Nilsson, N. J. (1965). Learning machines. New York: McGraw Hill. Parker, D. (1985). Learning logic (MIT Technical Report TR-47). MIT Center for Research in Computational Economics and Management Science. pomerleau, D. A. (1993). Knowledge-based training of artificial neural networks for autonomous robot driving. In J. Come11 & S. Mahadevan (Eds.), Robot Learning (pp. 19-43). Boston: Kluwer Academic Publishers. Rosenblatt, F. (1959). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386-408. Rosenblatt, F. (1962). Principles of neurodynamics. New York: Spartan Books. Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing: exploration in the microstructure of cognition (Vols. 1 & 2). Cambridge, MA: MIT Press. Rumelhart, D., Widrow, B., & Lehr, M. (1994). The basic ideas in neural networks. Communications of the ACM, 37(3), 87-92. Shavlik, J. W., Mooney, R. J., & Towell, G. G. (1991). Symbolic and neural learning algorithms: An experimental comparison. Machine Learning, 6(2), 111-144. Simard, P. S., Victorri, B., LeCun, Y., & Denker, J. (1992). Tangent prop--A formalism for specifying selected invariances in an adaptive network. In Moody, et al. (Eds.), Advances in Neural Information Processing Systems 4 (pp. 895-903). San Francisco: Morgan Kaufmann. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition using time-delay neural networks. ZEEE Transactions on Acoustics, Speech and Signal Processing. Weiss, S., & Kapouleas, I. (1989). An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. Proceedings of the Eleventh ZJCAI @p. 781-787). San Francisco: Morgan Kaufmann. Werbos, P. (1975). Beyond regression: New toolsfor prediction and analysis in the behavioral sciences (Ph.D. dissertation). Harvard University. Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. IRE WESCON Convention Record, 4,96104. Widrow, B., & Stearns, S. D. (1985). Adaptive signalprocessing. Signal Processing Series. Englewood Cliffs, NJ: Prentice Hall. Williams,R., & Zipser, D. (1995). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin & D. Rumelhart (Eds.), Backpropagation: Theory, architectures, and applications (pp. 433-486). Hillsdale, NJ: Lawrence Erlbaum Associates. Zometzer, S. F., Davis, J. L., & Lau, C. (1994). An introduction to neural and electronic neiworks (edited collection) (2nd ed.). New York: Academic Press. CHAPTER EVALUATING HYPOTHESES Empirically evaluating the accuracy of hypotheses is fundamental to machine learning. This chapter presents an introduction to statistical methods for estimating hypothesis accuracy, focusing on three questions. First, given the observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over additional examples? Second, given that one hypothesis outperforms another over some sample of data, how probable is it that this hypothesis is more accurate in general? Third, when data is limited what is the best way to use this data to both learn a hypothesis and estimate its accuracy? Because limited samples of data might misrepresent the general distribution of data, estimating true accuracy from such samples can be misleading. Statistical methods, together with assumptions about the underlying distributions of data, allow one to bound the difference between observed accuracy over the sample of available data and the true accuracy over the entire distribution of data. 5.1 MOTIVATION In many cases it is important to evaluate the performance of learned hypotheses as precisely as possible. One reason is simply to understand whether to use the hypothesis. For instance, when learning from a limited-size database indicating the effectiveness of different medical treatments, it is important to understand as precisely as possible the accuracy of the learned hypotheses. A second reason is that evaluating hypotheses is an integral component of many learning methods. For example, in post-pruning decision trees to avoid overfitting, we must evaluate the impact of possible pruning steps on the accuracy of the resulting decision tree. Therefore it is important to understand the likely errors inherent in estimating the accuracy of the pruned and unpruned tree. Estimating the accuracy of a hypothesis is relatively straightforward when data is plentiful. However, when we must learn a hypothesis and estimate its future accuracy given only a limited set of data, two key difficulties arise: Bias in the estimate. First, the observed accuracy of the learned hypothesis over the training examples is often a poor estimator of its accuracy over future examples. Because the learned hypothesis was derived from these examples, they will typically provide an optimistically biased estimate of hypothesis accuracy over future examples. This is especially likely when the learner considers a very rich hypothesis space, enabling it to overfit the training examples. To obtain an unbiased estimate of future accuracy, we typically test the hypothesis on some set of test examples chosen independently of the training examples and the hypothesis. a Variance in the estimate. Second, even if the hypothesis accuracy is measured over an unbiased set of test examples independent of the training examples, the measured accuracy can still vary from the true accuracy, depending on the makeup of the particular set of test examples. The smaller the set of test examples, the greater the expected variance. This chapter discusses methods for evaluating learned hypotheses, methods for comparing the accuracy of two hypotheses, and methods for comparing the accuracy of two learning algorithms when only limited data is available. Much of the discussion centers on basic principles from statistics and sampling theory, though the chapter assumes no special background in statistics on the part of the reader. The literature on statistical tests for hypotheses is very large. This chapter provides an introductory overview that focuses only on the issues most directly relevant to learning, evaluating, and comparing hypotheses. 5.2 ESTIMATING HYPOTHESIS ACCURACY When evaluating a learned hypothesis we are most often interested in estimating the accuracy with which it will classify future instances. At the same time, we would like to know the probable error in this accuracy estimate (i.e., what error bars to associate with this estimate). Throughout this chapter we consider the following setting for the learning problem. There is some space of possible instances X (e.g., the set of all people) over which various target functions may be defined (e.g., people who plan to purchase new skis this year). We assume that different instances in X may be encountered with different frequencies. A convenient way to model this is to assume there is some unknown probability distribution D that defines the probability of encountering each instance in X (e-g., 23 might assign a higher probability to encountering 19-year-old people than 109-year-old people). Notice 23 says nothing about whether x is a positive or negative example; it only detennines the probability that x will be encountered. The learning task is to learn the target concept or target function f by considering a space H of possible hypotheses. Training examples of the target function f are provided to the learner by a trainer who draws each instance independently, according to the distribution D, and who then forwards the instance x along with its correct target value f ( x ) to the learner. To illustrate, consider learning the target function "people who plan to purchase new skis this year," given a sample of training data collected by surveying people as they arrive at a ski resort. In this case the instance space X is the space of all people, who might be described by attributes such as their age, occupation, how many times they skied last year, etc. The distribution D specifies for each person x the probability that x will be encountered as the next person arriving at the ski resort. The target function f : X + { O , 1 ) classifies each person according to whether or not they plan to purchase skis this year. Within this general setting we are interested in the following two questions: 1. Given a hypothesis h and a data sample containing n examples drawn at random according to the distribution D, what is the best estimate of the accuracy of h over future instances drawn from the same distribution? 2. What is the probable error in this accuracy estimate? 5.2.1 Sample Error and True Error To answer these questions, we need to distinguish carefully between two notions of accuracy or, equivalently, error. One is the error rate of the hypothesis over the sample of data that is available. The other is the error rate of the hypothesis over the entire unknown distribution D of examples. We will call these the sample error and the true error respectively. The sample error of a hypothesis with respect to some sample S of instances drawn from X is the fraction of S that it misclassifies: Definition: The sample error (denoted errors(h)) of hypothesis h with respect to target function f and data sample S is Where n is the number of examples in S, and the quantity S(f ( x ) ,h ( x ) ) is 1 if f ( x ) # h ( x ) , and 0 otherwise. The true error of a hypothesis is the probability that it will misclassify a single randomly drawn instance from the distribution D. Definition: The true error (denoted e r r o r v ( h ) )of hypothesis h with respect to target function f and distribution D,is the probability that h will misclassify an instance drawn at random according to D. errorv ( h ) = Pr [f ( x ) # h ( x ) ] XED Here the notation Pr denotes that the probability is taken over the instance XGV distribution V. What we usually wish to know is the true error errorv(h) of the hypothesis, because this is the error we can expect when applying the hypothesis to future examples. All we can measure, however, is the sample error errors(h) of the hypothesis for the data sample S that we happen to have in hand. The main question considered in this section is "How good an estimate of errorD(h) is provided by errors (h)?" 5.2.2 Confidence Intervals for Discrete-Valued Hypotheses Here we give an answer to the question "How good an estimate of errorv(h) is provided by errors(h)?' for the case in which h is a discrete-valued hypothesis. More specifically, suppose we wish to estimate the true error for some discretevalued hypothesis h, based on its observed sample error over a sample S, where 0 the sample S contains n examples drawn independent of one another, and independent of h, according to the probability distribution V 0 nz30 0 hypothesis h commits r errors over these n examples (i.e., errors(h) = rln). Under these conditions, statistical theory allows us to make the following assertions: 1. Given no other information,the most probable value of errorD(h)is errors(h) 2. With approximately 95% probability, the true error errorv(h) lies in the interval 7 errors(h)(l - errors ( h ) ) errors(h) f 1.96 To illustrate, suppose the data sample S contains n = 40 examples and that hypothesis h commits r = 12 errors over this data. In this case, the sample error errors(h) = 12/40 = .30. Given no other information,the best estimate of the true error errorD(h)is the observed sample error .30. However, we do not expect this to be a perfect estimate of the true error. If we were to collect a second sample S' containing 40 new randomly drawn examples, we might expect the sample error errors,(h) to vary slightly from the sample error errors(h). We expect a difference due to the random differences in the makeup of S and S'. In fact, if we repeated this experiment over and over, each time drawing a new sample S, containing 40 new examples, we would find that for approximately 95% of these experiments, the calculated interval would contain the true error. For this reason, we call this interval the 95% confidence interval estimate for errorv(h). In the current example, where r = 12 and n = 40, the 95% confidence interval is, according to the above expression, 0.30 f (1.96 - .07) = 0.30 f.14. ConfidencelevelN%: 50% 68% 80% 90% 95% 98% 99% Constant ZN: 0.67 1.00 1.28 1.64 1.96 2.33 2.58 TABLE 5.1 Values of z~ for two-sided N% confidence intervals. The above expression for the 95% confidence interval can be generalized to any desired confidence level. The constant 1.96 is used in case we desire a 95% confidence interval. A different constant, ZN, is used to calculate the N% confidence interval. The general expression for approximate N% confidence intervals for errorv(h) is where the constant ZN is chosen depending on the desired confidence level, using the values of z~ given in Table 5.1. Thus,just as we could calculate the 95% confidence interval for errorv(h) to be 0.305 (1.96..07) (when r = 12, n = 40), we can calculate the 68% confidence interval in this case to be 0.30 f (1.0- .07). Note it makes intuitive sense that the 68% confidence interval is smaller than the 95% confidence interval, because we have reduced the probability with which we demand that errorv(h) fall into the interval. Equation (5.1) describes how to calculate the confidence intervals, or error bars, for estimates of errorv(h) that are based on errors(h). In using this expression, it is important to keep in mind that this applies only to discrete-valued hypotheses, that it assumes the sample S is drawn at random using the same distribution from which future data will be drawn, and that it assumes the data is independent of the hypothesis being tested. We should also keep in mind that the expression provides only an approximate confidence interval, though the approximation is quite good when the sample contains at least 30 examples, and errors(h) is not too close to 0 or 1 . A more accurate rule of thumb is that the above approximation works well when Above we summarized the procedure for calculating confidence intervals for discrete-valued hypotheses. The following section presents the underlying statistical justification for this procedure. 5.3 BASICS OF SAMPLING THEORY This section introduces basic notions from statistics and sampling theory, including probability distributions, expected value, variance, Binomial and Normal distributions, and two-sided and one-sided intervals. A basic familiarity with these a A random variable can be viewed as the name of an experiment with a probabilistic outcome. Its value is the outcome of the experiment. A probability distribution for a random variable Y specifiesthe probability Pr(Y = yi)that Y will take on the value yi, for each possible value yi. C i The expected value, or mean, of a random variable Y is E [ Y ] = yi Pr(Y = yi).The symbol p ) i~s commonly used to represent E[Y]. The variance of a random variable is Var(Y) = E [ ( Y - p ~ ) ~ ]T.he variance characterizes the width or dispersion of the distribution about its mean. a The standard deviation of Y is JVar(Y). The symbol uy is often used used to represent the standard deviation of Y . The Binomial distribution gives the probability of observing r heads in a series of n independent coin tosses, if the probability of heads in a single toss is p. a The Normal distribution is a bell-shaped probability distribution that covers many natural phenomena. The Central Limit Theorem is a theorem stating that the sum of a large number of independent, identically distributed random variables approximately follows a Normal distribution. An estimator is a random variable Y used to estimate some parameter p of an underlying population. a The estimation bias of Y as an estimator for p is the quantity ( E [ Y ]- p). An unbiased estimator is one for which the bias is zero. a A N% conjidence interval estimate for parameter p is an interval that includes p with probability N%. TABLE 5.2 , Basic definitions and facts from statistics. concepts is important to understanding how to evaluate hypotheses and learning algorithms. Even more important, these same notions provide an important conceptual framework for understanding machine learning issues such as overfitting and the relationship between successful generalization and the number of training examples considered. The reader who is already familiar with these notions may skip or skim this section without loss of continuity. The key concepts introduced in this section are summarized in Table 5.2. 5.3.1 Error Estimation and Estimating Binomial Proportions Precisely how does the deviation between sample error and true error depend on the size of the data sample? This question is an instance of a well-studied problem in statistics: the problem of estimating the proportion of a population that exhibits some property, given the observed proportion over some random sample of the population. In our case, the property of interest is that h misclassifies the example. The key to answering this question is to note that when we measure the sample error we are performing an experiment with a random outcome. We first collect a random sample S of n independently drawn instances from the distribu- tion D,and then measure the sample error errors(h). As noted in the previous section, if we were to repeat this experiment many times, each time drawing a different random sample Si of size n, we would expect to observe different values for the various errors,(h), depending on random differences in the makeup of the various Si. We say in such cases that errors, (h), the outcome of the ith such experiment, is a random variable. In general, one can think of a random variable as the name of an experiment with a random outcome. The value of the random variable is the observed outcome of the random experiment. Imagine that we were to run k such random experiments, measuring the ran- dom variables errors, (h), errors, (h) ...errors, (h). Imagine further that we then plotted a histogram displaying the frequency with which we observed each possible error value. As we allowed k to grow, the histogram would approach the form of the distribution shown in Table 5.3. This table describes a particular probability distribution called the Binomial distribution. Binomial dishibution for n = 40, p =0.3 0.14 0.12 0.1 0.08 'F 0.06 0.04 0.02 0 0 5 10 15 20 25 30 35 40 A Binomial distribution gives the probability of observing r heads in a sample of n independent coin tosses, when the probability of heads on a single coin toss is p. It is defined by the probability function P ( r ) = - r ! ( nn-! r ) ! p r ( l - p)"-' If the random variable X follows a Binomial distribution, then: 0 The probability Pr(X = r ) that X will take on the value r is given by P ( r ) 0 The expected, or mean value of X, E[X],is 0 The variance of X , V a r ( X ) , is 0 The standard deviation of X, ax,is Var(X)=np(1- p) For sufficiently large values of n the Binomial distribution is closely approximated by a Normal distribution (see Table 5.4) with the same mean and variance. Most statisticians recommend using the Normal approximation only when n p ( 1 - p) 2 5. TABLE 5 3 The Binomial distribution. 5.3.2 The Binomial Distribution A good way to understand the Binomial distribution is to consider the following problem. You are given a worn and bent coin and asked to estimate the probability that the coin will turn up heads when tossed. Let us call this unknown probability of heads p. You toss the coin n times and record the number of times r that it turns up heads. A reasonable estimate of p is rln. Note that if the experiment were rerun, generating a new set of n coin tosses, we might expect the number of heads r to vary somewhat from the value measured in the first experiment, yielding a somewhat different estimate for p. The Binomial distribution describes for each possible value of r (i.e., from 0 to n), the probability of observing exactly r heads given a sample of n independent tosses of a coin whose true probability of heads is p. Interestingly, estimating p from a random sample of coin tosses is equivalent to estimating errorv(h) from testing h on a random sample of instances. A single toss of the coin corresponds to drawing a single random instance from 23 and determining whether it is misclassified by h. The probability p that a single random coin toss will turn up heads corresponds to the probability that a single instance drawn at random will be misclassified (i.e., p corresponds to errorv(h)). The number r of heads observed over a sample of n coin tosses corresponds to the number of misclassifications observed over n randomly drawn instances. Thus rln corresponds to errors(h). The problem of estimating p for coins is identical to the problem of estimating errorv(h) for hypotheses. The Binomial distribution gives the general form of the probability distribution for the random variable r, whether it represents the number of heads in n coin tosses or the number of hypothesis errors in a sample of n examples. The detailed form of the Binomial distribution depends on the specific sample size n and the specific probability p or errorv(h). The general setting to which the Binomial distribution applies is: 1. There is a base, or underlying, experiment (e.g., toss of the coin) whose outcome can be described by a random variable, say Y . The random variable Y can take on two possible values (e.g., Y = 1 if heads, Y = 0 if tails). 2. The probability that Y = 1 on any single trial of the underlying experiment is given by some constant p, independent of the outcome of any other experiment. The probability that Y = 0 is therefore (1 - p). Typically, p is not known in advance, and the problem is to estimate it. 3. A series of n independent trials of the underlying experiment is performed (e.g., n independent coin tosses), producing the sequence of independent, identically distributed random variables Y l , Yz, .. .,Yn. Let R denote the number of trials for which Yi= 1 in this series of n experiments 4. The probability that the random variable R will take on a specific value r (e.g., the probability of observing exactly r heads) is given by the Binomial distribution n! Pr(R = r ) = r!(n - r ) ! pr(l - p)"-' A plot of this probability distribution is shown in Table 5.3. The Binomial distribution characterizes the probability of observing r heads from n coin flip experiments, as well as the probability of observing r errors in a data sample containing n randomly drawn instances. 5.3.3 Mean and Variance Two properties of a random variable that are often of interest are its expected value (also called its mean value) and its variance. The expected value is_the average of the values taken on by repeatedly sampling the random variable. More precisely Definition: Consider a random variable Y that takes on the possible values yl, ...yn. The expected value of Y , E [ Y ] ,is For example, if Y takes on the value 1 with probability .7 and the value 2 with probability .3, then its expected value is (1.0.7+2.0.3 = 1.3). In case the random variable Y is governed by a Binomial distribution, then it can be shown that E [ Y ]= np (5.4) where n and p are the parameters of the Binomial distribution defined in Equation (5.2). A second property, the variance, captures the "width or "spread" of the probability distribution; that is, it captures how far the random variable is expected to vary from its mean value. Definition: The variance of a random variable Y , V a r [ Y ] ,is V a r [ Y ]= E[(Y - E [ Y ] ) ~ ] (5.5) The variance describes the expected squared error in using a single observation of Y to estimate its mean E [ Y ] .The square root of the variance is called the standard deviation of Y , denoted oy . Definition: The standard deviation of a random variable Y , u y , is In case the random variable Y is governed by a Binomial distribution, then the variance and standard deviation are given by 5.3.4 Estimators, Bias, and Variance Now that we have shown that the random variable errors(h) obeys a Binomial distribution, we return to our primary question: What is the likely difference between errors(h) and the true error errorv(h)? Let us describe errors(h) and errorv(h) using the terms in Equation (5.2) defining the Binomial distribution. We then have where n is the number of instances in the sample S, r is the number of instances from S misclassified by h, and p is the probability of misclassifying a single instance drawn from 23. Statisticians call errors(h) an estimator for the true error errorv(h). In general, an estimator is any random variable used to estimate some parameter of the underlying population from which the sample is drawn. An obvious question to ask about any estimator is whether on average it gives the right estimate. We define the estimation bias to be the difference between the expected value of the estimator and the true value of the parameter. Definition: The estimation bias of an estimator Y for an arbitrary parameter p is If the estimation bias is zero, we say that Y is an unbiased estimator for p. Notice this will be the case if the average of many random values of Y generated by repeated random experiments (i.e., E[Y]) converges toward p. Is errors(h) an unbiased estimator for errorv(h)? Yes, because for a Binomial distribution the expected value of r is equal to np (Equation r5.41). It follows, given that n is a constant, that the expected value of rln is p. Two quick remarks are in order regarding the estimation bias. First, when we mentioned at the beginning of this chapter that testing the hypothesis on the training examples provides an optimistically biased estimate of hypothesis error, it is exactly this notion of estimation bias to which we were referring. In order for errors(h)to give an unbiased estimate of errorv(h),the hypothesis h and sample S must be chosen independently. Second, this notion of estimation bias should not be confused with the inductive bias of a learner introduced in Chapter 2. The estimation bias is a numerical quantity, whereas the inductive bias is a set of assertions. A second important property of any estimator is its variance. Given a choice among alternative unbiased estimators, it makes sense to choose the one with least variance. By our definition of variance, this choice will yield the smallest expected squared error between the estimate and the true value of the parameter. To illustrate these concepts, suppose we test a hypothesis and find that it commits r = 12 errors on a sample of n = 40 randomly drawn test examples. Then an unbiased estimate for errorv(h) is given by errors(h) = rln = 0.3. The variance in this estimate arises completely from the variance in r, because n is a constant. Because r is Binomially distributed, its variance is given by Equation (5.7) as np(1 - p). Unfortunately p is unknown, but we can substitute a our estimate rln for p. This yields an estimated variance in r of 4 0 . 0.3(1 - 0.3) = 8.4, or a corresponding standard deviation of ;j: 2.9. his implies that the standard deviation in errors(h) = rln is approximately 2.9140 = .07. To summarize, errors(h) in this case is observed to be 0.30, with a standard deviation of approximately 0.07. (See Exercise 5.1 .) In general, given r errors in a sample of n independently drawn test exam- ples, the standard deviation for errors(h)is given by which can be approximated by substituting rln = errors(h)for p 5.3.5 Confidence Intervals One common way to describe the uncertainty associated with an estimate is to give an interval within which the true value is expected to fall, along with the probability with which it is expected to fall into this interval. Such estimates are called conjdence interval estimates. Definition: An N% confidence interval for some parameter p is an interval that is expected with probability N% to contain p. For example, if we observe r = 12 errors in a sample of n = 40 independently drawn examples, we can say with approximately 95% probability that the interval 0.30 f0.14 contains the true error errorv(h). How can we derive confidence intervals for errorv(h)? The answer lies in the fact that we know the Binomial probability distribution governing the estimator errors(h). The mean value of this distribution is errorV(h),and the standard deviation is given by Equation (5.9). Therefore, to derive a 95% confidence interval, we need only find the interval centered around the mean value errorD(h), which is wide enough to contain 95% of the total probability under this distribution. This provides an interval surrounding errorv(h) into which errors(h) must fall 95% of the time. Equivalently, it provides the size of the interval surrounding errordh) into which errorv(h) must fall 95% of the time. For a given value of N how can we find the size of the interval that contains N% of the probability mass? Unfortunately, for the Binomial distribution this calculation can be quite tedious. Fortunately, however, an easily calculated and very good approximation can be found in most cases, based on the fact that for sufficiently large sample sizes the Binomial distribution can be closely approximated by the Normal distribution. The Normal distribution, summarized in Table 5.4,is perhaps the most well-studied probability distribution in statistics. As illustrated in Table 5.4,it is a bell-shaped distribution fully specified by its Normal dismbution with mean 0.standarddeviation I 3 -2 -1 0 1 2 3 A Normal distribution (also called a Gaussian distribution) is a bell-shaped distribution defined by the probability density function A Normal distribution is fully determined by two parameters in the above formula: p and a. If the random variable X follows a normal distribution, then: 0 The probability that X will fall into the interval (a,6 ) is given by The expected, or mean value of X, E [ X ] ,is The variance of X, V a r ( X ) ,is 0 The standard deviation of X, ax, is V a r ( X )= a2 ax = a The Central Limit Theorem (Section 5.4.1) states that the sum of a large number of independent, identically distributed random variables follows a distribution that is approximately Normal. TABLE 5.4 The Normal or Gaussian distribution. mean p and standard deviation a. For large n, any Binomial distribution is very closely approximated by a Normal distribution with the same mean and variance. One reason that we prefer to work with the Normal distribution is that most statistics references give tables specifying the size of the interval about the mean that contains N% of the probability mass under the Normal distribution. This is precisely the information needed to calculate our N% confidence interval. In fact, Table 5.1 is such a table. The constant ZN given in Table 5.1 defines the width of the smallest interval about the mean that includes N% of the total probability mass under the bell-shaped Normal distribution. More precisely, ZN gives half the width of the interval (i.e., the distance from the mean in either direction) measured in standard deviations. Figure 5.l(a) illustrates such an interval for z.80. To summarize, if a random variable Y obeys a Normal distribution with mean p and standard deviation a , then the measured random value y of Y will fall into the following interval N% of the time Equivalently, the mean p will fall into the following interval N% of the time We can easily combine this fact with earlier facts to derive the general expression for N% confidence intervals for discrete-valued hypotheses given in Equation (5.1). First, we know that errors(h)follows a Binomial distribution with mean value e r r o r ~ ( ha)nd standard deviation as given in Equation (5.9). Second, we know that for sufficiently large sample size n, this Binomial distribution is well approximated by a Normal distribution. Third, Equation (5.11) tells us how to find the N% confidence interval for estimating the mean value of a Normal distribution. Therefore, substituting the mean and standard deviation of errors(h) into Equation (5.11) yields the expression from Equation (5.1) for N% confidence FIGURE 5.1 A Normal distribution with mean 0, standard deviation 1. (a) With 80% confidence, the value of the random variable will lie in the two-sided interval [-1.28,1.28]. Note 2.80 = 1.28. With 10% confidence it will lie to the right of this interval, and with 10% confidence it will lie to the left. (b)With 90%confidence, it will lie in the one-sided interval [-oo, 1.281. intervals for discrete-valued hypotheses Jerrors(h)(l -errors(h)) errors(h) zt ZN n Recall that two approximations were involved in deriving this expression, namely: 1. in estimating the standard deviation a of errors(h), we have approximated errorv(h) by errors(h) [i.e., in going from Equation (5.8) to (5.9)], and 2. the Binomial distribution has been approximated by the Normal distribution. The common rule of thumb in statistics is that these two approximations are very good as long as n 2 30, or when np(1- p) 2 5. For smaller values of n it is wise to use a table giving exact values for the Binomial distribution. 5.3.6 Two-sided and One-sided Bounds Notice that the above confidence interval is a two-sided bound; that is, it bounds the estimated quantity from above and from below. In some cases, we will be interested only in a one-sided bound. For example, we might be interested in the question "What is the probability that errorz,(h) is at most U?' This kind of onesided question is natural when we are only interested in bounding the maximum error of h and do not mind if the true error is much smaller than estimated. There is an easy modification to the above procedure for finding such onesided error bounds. It follows from the fact that the Normal distribution is syrnrnetric about its mean. Because of this fact, any two-sided confidence interval based on a Normal distribution can be converted to a corresponding one-sided interval with twice the confidence (see Figure 5.l(b)). That is, a 100(1- a)% confidence interval with lower bound L and upper bound U implies a 100(1- a/2)% confidence interval with lower bound L and no upper bound. It also implies a 100(1-a/2)% confidence interval with upper bound U and no lower bound. Here a corresponds to the probability that the correct value lies outside the stated interval. In other words, a is the probability that the value will fall into the unshaded region in Figure 5.l(a), and a/2 is the probability that it will fall into the unshaded region in Figure 5.l(b). To illustrate, consider again the example in which h commits r = 12 errors over a sample of n = 40 independently drawn examples. As discussed above, this leads to a (two-sided) 95% confidence interval of 0.30 f0.14. In this case, + 100(1- a) = 95%, so a! = 0.05. Thus, we can apply the above rule to say with 100(1- a/2) = 97.5% confidence that errorv(h) is at most 0.30 0.14 = 0.44, making no assertion about the lower bound on errorv(h). Thus, we have a onesided error bound on errorv(h) with double the confidence that we had in the corresponding two-sided bound (see Exercise 5.3). 142 MACHINE LEARNING 5.4 A GENERAL APPROACH FOR DERIVING CONFIDENCE INTERVALS The previous section described in detail how to derive confidence interval estimates for one particular case: estimating errorv(h) for a discrete-valued hypothesis h, based on a sample of n independently drawn instances. The approach described there illustrates a general approach followed in many estima6on problems. In particular, we can see this as a problem of estimating the mean (expected value) of a population based on the mean of a randomly drawn sample of size n. The general process includes the following steps: 1. Identify the underlying population parameter p to be estimated, for example, errorv(h). 2. Define the estimator Y (e.g., errors(h)).It is desirable to choose a minimumvariance, unbiased estimator. 3. Determine the probability distribution Vy that governs the estimator Y, in- cluding its mean and variance. 4. Determine the N% confidence interval by finding thresholds L and U such that N% of the mass in the probability distributionV yfalls between L and U. In later sections of this chapter we apply this general approach to several other estimation problems common in machine learning. First, however, let us discuss a fundamental result from estimation theory called the Central Limit Theorem. 5.4.1 Central Limit Theorem One essential fact that simplifies attempts to derive confidence intervals is the Central Limit Theorem. Consider again our general setting, in which we observe the values of n independently drawn random variables Yl . . . Yn that obey the same unknown underlying probability distribution (e.g., n tosses of the same coin). Let p denote the mean of the unknown distribution governing each of the Yi and let a denote the standard deviation. We say that these variables Yi are independent, identically distributed random variables, because they describe independent experiments, each obeying the same underlying probability distribution. In an attempt to estimate the mean p of the distribution governing the Yi, we calculate the sam- ple mean = '&Yi (e.g., the fraction of heads among the n coin tosses). The Central Limit Theorem states that the probability distribution governing Fn approaches a Normal distribution as n + co,regardless of the distribution that governs the underlying random variables Yi. Furthermore, the mean of the dis- k. tribution governing Yn approaches p and the standard deviation approaches More precisely, Theorem 5.1. Central Limit Theorem. Consider a set of independent, identically xy=, distributed random variables Yl ...Y, governed by an arbitrary probability distribu- tion with mean p and finite variance a2.Define the sample mean, = Yi. Then as n + co,the distribution governing 5 approaches a Normal distribution, with zero mean and standard deviation equal to 1. This is a quite surprising fact, because it states that we know the form of the distribution that governs the sample mean ? even when we do not know the form of the underlying distribution that governs the individual Yithat are being observed! Furthermore, the Central Limit Theorem describes how the mean and variance of Y can be used to determine the mean and variance of the individual Y i . The Central Limit Theorem is a very useful fact, because it implies that whenever we define an estimator that is the mean of some sample (e.g., errors(h) is the mean error), the distribution governing this estimator can be approximated by a Normal distribution for sufficiently large n. If we also know the variance for this (approximately) Normal distribution, then we can use Equation (5.1 1) to compute confidence intervals. A common rule of thumb is that we can use the Normal approximation when n 2 30. Recall that in the preceding section we used such a Normal distribution to approximate the Binomial distribution that more precisely describes errors (h). 5.5 DIFFERENCE IN ERROR OF TWO HYPOTHESES Consider the case where we have two hypotheses hl and h2 for some discretevalued target function. Hypothesis hl has been tested on a samj4e S1 containing nl randomly drawn examples, and ha has been tested on an indi:pendent sample S2 containing n2 examples drawn from the same distribution. Suppose we wish to estimate the difference d between the true errors of these two hypotheses. We will use the generic four-step procedure described at the beginning of Section 5.4 to derive a confidence interval estimate for d. Having identified d as the parameter to be estimated, we next define an estimator. The obvious choice for an estimator in this case is the difference between the sample errors, which we denote by 2 ,.d = errors, ( h l ) - errors, (h2) Although we will not prove it here, it can be shown that 2 gives an unbiased estimate of d; that is E[C?] = d. What is the probability distribution governing the random variable 2? From earlier sections,we know that for large nl and n2 (e.g., both 2 30), both errors, ( h l ) and error&( h z ) follow distributions that are approximately Normal. Because the difference of two Normal distributions is also a Normal distribution, 2 will also 144 MACHINE LEARNING r follow a distribution that is approximately Normal, with mean d. It can also be shown that the variance of this distribution is the sum of the variances of errors,( h l ) and errors2(h2).Using Equation (5.9) to obtain the approximate variance of each of these distributions, we have , + errorS,( h l ) ( l - errors, ( h l ) ) errors2(h2)(1- errors,(h2)) 0 c2i n1 n2 (5.12) Now that we have determined the probability distribution that governs the estimator 2, it is straightforward to derive confidence intervals that characterize the likely error in employing 2 to estimate d. For a random variable 2 obeying a Normal distribution with mean d and variance a2, the N% confidence interval estimate for d is 2 f z ~ a U. sing the approximate variance a; given above, this approximate N% confidence interval estimate for d is J + d f z ~errors, ( h l ) ( l- errors,(h1 ) ) errors2(h2)(1 - errors2(h2)) (5.13) nl n2 where zN is the same constant described in Table 5.1. The above expression gives the general two-sided confidence interval for estimating the difference between errors of two hypotheses. In some situations we might be interested in one-sided bounds--either bounding the largest possible difference in errors or the smallest, with some confidence level. One-sided confidence intervals can be obtained by modifying the above expression as described in Section 5.3.6. Although the above analysis considers the case in which hl and h2 are tested on independent data samples, it is often acceptable to use the confidence interval seen in Equation (5.13)in the setting where h1 and h2 are tested on a single sample S (where S is still independent of hl and h2).In this later case, we redefine 2 as The variance in this new 2 will usually be smaller than the variance given by Equation (5.12), when we set S1 and S2 to S. This is because using a single sample S eliminates the variance due to random differences in the compositions of S1 and S2. In this case, the confidence interval given by Equation (5.13) will generally be an overly conservative, but still correct, interval. 5.5.1 Hypothesis Testing In some cases we are interested in the probability that some specific conjecture is true, rather than in confidence intervals for some parameter. Suppose, for example, that we are interested in the question "what is the probability that errorv(h1) > errorv(h2)?' Following the setting in the previous section, suppose we measure the sample errors for hl and h2 using two independent samples S1 and S2 of size 100 and find that errors, ( h l ) = .30 and errors2(h2)= -20, hence the observed difference is 2 = . l o . Of course, due to random variation in the data sample, we might observe this difference in the sample errors even when errorv(hl) 5 errorv(h2). What is the probability that errorv(hl) > errorv(h2), given the observed difference in sample errors 2 = .10 in this case? Equivalently, what is the probability that d > 0, given that we observed 2 = .lo? Note the probability Pr(d > 0) is equal to the probability that 2 has not overestimated d by more than .lo. Put another way, this is the probability that 2 + falls into the one-sided interval 2 < d .lo. Since d is the mean of the distribution + governing 2, we can equivalently express this one-sided interval as 2 < p2 .lo. To summarize, the probability Pr(d > 0) equals the probability that 2 falls + into the one-sided interval 2 < pa .lo. Since we already calculated the ap- proximate distribution governing 2 in the previous section, we can determine the probability that 2 falls into this one-sided interval by calculating the probability + mass of the 2 distribution within this interval. Let us begin this calculation by re-expressing the interval 2 < pi .10 in terms of the number of standard deviations it allows deviating from the mean. Using Equation (5.12) we find that 02 FZ .061, so we can re-express the interval as approximately What is the confidence level associated with this one-sided interval for a Normal distribution? Consulting Table 5.1, we find that 1.64 standard deviations about the mean corresponds to a two-sided interval with confidence level 90%. Therefore, the one-sided interval will have an associated confidence level of 95%. Therefore, given the observed 2 = .lo, the probability that errorv(h1) > errorv(h2) is approximately .95. In the terminology of the statistics literature, we say that we accept the hypothesis that "errorv(hl) > errorv(h2)" with confidence 0.95. Alternatively, we may state that we reject the opposite hypothesis (often called the null hypothesis) at a (1 - 0.95) = .05 level of significance. 5.6 COMPARING LEARNING ALGORITHMS Often we are interested in comparing the performance of two learning algorithms L A and L B , rather than two specific hypotheses. What is an appropriate test for comparing learning algorithms, and how can we determine whether an observed difference between the algorithms is statistically significant? Although there is active debate within the machine-learning research community regarding the best method for comparison, we present here one reasonable approach. A discussion of alternative methods is given by Dietterich (1996). As usual, we begin by specifying the parameter we wish to estimate. Suppose we wish to determine which of LA and LB is the better learning method on average for learning some particular target function f . A reasonable way to define "on average" is to consider the relative performance of these two algorithms averaged over all the training sets of size n that might be drawn from the underlying instance distribution V.In other words, we wish to estimate the expected value of the difference in their errors where L(S) denotes the hypothesis output by learning method L when given the sample S of training data and where the subscript S c V indicates that the expected value is taken over samples S drawn according to the underlying instance distribution V. The above expression describes the expected value of the difference in errors between learning methods L A and LB. Of course in practice we have only a limited sample Do of data when comparing learning methods. In such cases, one obvious approach to estimating the above quantity is to divide Do into a training set So and a disjoint test set To. The training data can be used to train both LA and LB, and the test data can be used to compare the accuracy of the two learned hypotheses. In other words, we measure the quantity Notice two key differences between this estimator and the quantity in Equation (5.14). First, we are using errorTo(h)to approximate errorv(h). Second, we are only measuring the difference in errors for one training set So rather than taking the expected value of this difference over all samples S that might be drawn from the distribution 2). One way to improve on the estimator given by Equation (5.15) is to repeatedly partition the data Do into disjoint training and test sets and to take the mean of the test set errors for these different experiments. This leads to the procedure shown in Table 5.5 for estimating the difference between errors of two learning methods, based on a fixed sample Do of available data. This procedure first partitions the data into k disjoint subsets of equal size, where this size is at least 30. It then trains and tests the learning algorithms k times, using each of the k subsets in turn as the test set, and using all remaining data as the training set. In this way, the learning algorithms are tested on k independent test sets, 'and the mean difference in errors 8 is returned as an estimate of the difference between the two learning algorithms. The quantity 8 returned by the procedure of Table 5.5 can be taken as an estimate of the desired quantity from Equation 5.14. More appropriately, we can view 8 as an estimate of the quantity where S represents a random sample of size ID01 drawn uniformly from Do. The only difference between this expression and our original expression in Equation (5.14) is that this new expression takes the expected value over subsets of the available data Do, rather than over subsets drawn from the full instance distribution 2). 1. Partition the available data Do into k disjoint subsets T I ,T2, ...,Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do use Ti for the test set, and the remaining data for training set Si 0 Si c {Do- Ti} hA C LA(Si) h~ +L ~ ( s i ) 0 Si t errorq ( h A )- errorz ( h B ) 3. Return the value 6 , where TABLE 5.5 A procedure to estimate the difference in error between two learning methods LA and LB. Approximate confidence intervals for this estimate are given in the text. The approximate N% confidence interval for estimating the quantity in Equa- tion (5.16) using 8 is given by where tN,k-l is a constant that plays a role analogous to that of ZN in our earlier confidence interval expressions, and where s,- is an estimate of the standard deviation of the distribution governing 8. In particular, sg is defined as Notice the constant t ~ , k - l in Equation (5.17) has two subscripts. The first specifies the desired confidence level, as it did for our earlier constant Z N . The second parameter, called the number of degrees of freedom and usually denoted by v , is related to the number of independent random events that go into producing the value for the random variable 8. In the current setting, the number of degrees of freedom is k - 1. Selected values for the parameter t are given in Table 5.6. Notice that as k + oo, the value of t ~ , k - l approaches the constant Z N . Note the procedure described here for comparing two learning methods involves testing the two learned hypotheses on identical test sets. This contrasts with the method described in Section 5.5 for comparing hypotheses that have been evaluated using two independent test sets. Tests where the hypotheses are evaluated over identical samples are called paired tests. Paired tests typically produce tighter confidence intervals because any differences in observed errors in a paired test are due to differences between the hypotheses. In contrast, when the hypotheses are tested on separate data samples, differences in the two sample errors might be partially attributable to differences in the makeup of the two samples. Confidence level N 90% 95% 98% 99% TABLE 5.6 Values oft^," for two-sided confidence intervals. As v + w, t ~ , a"pproaches ZN. 5.6.1 Paired t Tests Above we described one procedure for comparing two learning methods given a fixed set of data. This section discusses the statistical justification for this procedure, and for the confidence interval defined by Equations (5.17) and (5.18). It can be skipped or skimmed on a first reading without loss of continuity. The best way to understand the justification for the confidence interval estimate given by Equation (5.17) is to consider the following estimation problem: 0 We are given the observed values of a set of independent, identically dis- tributed random variables YI, Y2, ...,Yk. 0 We wish to estimate the mean p of the probability distribution governing these Yi. a The estimator we will use is the sample mean Y This problem of estimating the distribution mean p based on the sample mean Y is quite general. For example, it covers the problem discussed earlier of using errors(h) to estimate errorv(h). (In that problem, the Yi are 1 or 0 to indicate whether h commits an error on an individual example from S, and errorv(h) is the mean p of the underlying distribution.) The t test, described by Equations (5.17) and (5.18), applies to a special case of this problem-the case in which the individual Yi follow a Normal distribution. Now consider the following idealization of the method in Table 5.5 for comparing learning methods. Assume that instead of having a fixed sample of data Do, we can request new training examples drawn according to the underlying instance distribution. In particular, in this idealized method we modify the procedure of Table 5.5 so that on each iteration through the loop it generates a new random training set Siand new random test set Ti by drawing from this underlying instance distribution instead of drawing from the fixed sample Do. This idealized method perfectly fits the form of the above estimation problem. In particular, the Si measured by the procedure now correspond to the independent, identically distributed random variables Yi.The mean p of their distribution corresponds to the expected difference in error between the two learning methods [i.e., Equation (5.14)]. The sample mean Y is the quantity 6 computed by this idealized version of the method. s?' We wish to answer the question "how good an estimate of p is provided by First, note that the size of the test sets has been chosen to contain at least 30 examples. Because of this, the individual Si will each follow an approximately Normal distribution (due to the Central Limit Theorem). Hence, we have a special case in which the Yiare governed by an approximately Normal distribution. It can be shown in general that when the individual Yieach follow a Normal dis- tribution, then the sample mean Y follows a Normal distribution as well. Given that Y is Normally distributed, we might consider using the earlier expression for confidence intervals (Equation [5.11]) that applies to estimators governed by Normal distributions. Unfortunately, that equation requires that we know the standard deviation of this distribution, which we do not. The t test applies to precisely these situations, in which the task is to estimate the sample mean of a collection of independent, identically and Normally distributed random variables. In this case, we can use the confidence interval given by Equations (5.17) and (5.18), which can be restated using our current notation as where sp is the estimated standard deviation of the sample mean and where tN,k-l is a constant analogous to our earlier ZN. In fact, the constant t ~ , k - lcharacterizes the area under a probability distribution known as the t distribution, just as the constant ZN characterizes the area under a Normal distribution. The t distribution is a bell-shaped distribution similar to the Normal distribution, but wider and shorter to reflect the greater variance introduced by using sp to approximate the true standard deviation ap.The t distribution approaches the Normal distribution (and therefore tN,k-lapproaches zN)as k approaches infinity. This is intuitively satisfying because we expect sp to converge toward the true standard deviation ap as the sample size k grows, and because we can use ZN when the standard deviation is known exactly. 5.6.2 Practical Considerations Note the above discussion justifies the use of the confidence interval estimate given by Equation (5.17) in the case where we wish to use the sample mean Y to estimate the mean of a sample containing k independent, identically and Normally distributed random variables. This fits the idealized method described above, in which we assume unlimited access to examples of the target function. In practice, given a limited set of data Do and the more practical method described by Table 5.5, this justification does not strictly apply. In practice, the problem is that the only way to generate new Si is to resample Do, dividing it into training and test sets in different ways. The 6i are not independent of one another in this case, because they are based on overlapping sets of training examples drawn from the limited subset Do of data, rather than from the full distribution 'D. When only a limited sample of data Do is available, several methods can be used to resample Do. Table 5.5 describes a k-fold method in which Do is partitioned into k disjoint, equal-sized subsets. In this k-fold approach, each example from Do is used exactly once in a test set, and k - 1 times in a training set. A second popular approach is to randomly choose a test set of at least 30 examples from Do, use the remaining examples for training, then repeat this process as many times as desired. This randomized method has the advantage that it can be repeated an indefinite number of times, to shrink the confidence interval to the desired width. In contrast, the k-fold method is limited by the total number of examples, by the use of each example only once in a test set, and by our desire to use samples of size at least 30. However, the randomized method has the disadvantage that the test sets no longer qualify as being independently drawn with respect to the underlying instance distribution D.In contrast, the test sets gener- ated by k-fold cross validation are independent because each instance is included in only one test set. To summarize, no single procedure for comparing learning methods based on limited data satisfies all the constraints we would like. It is wise to keep in mind that statistical models rarely fit perfectly the practical constraints in testing learning algorithms when available data is limited. Nevertheless, they do provide approximate confidence intervals that can be of great help in interpreting experimental comparisons of learning methods. 5.7 SUMMARY AND FURTHER READING The main points of this chapter include: 0 Statistical theory provides a basis for estimating the true error (errorv(h)) of a hypothesis h, based on its observed error (errors(h)) over a sample S of data. For example, if h is a discrete-valued hypothesis and the data sample S contains n 2 30 examples drawn independently of h and of one another, then the N% confidence interval for errorv(h) is approximately where values for zN are given in Table 5.1. 0 In general, the problem of estimating confidence intervals is approached by identifying the parameter to be estimated (e.g., errorD(h)) and an estimator CHAFER 5 EVALUATING HYPOTHESES 151 (e.g., errors(h)) for this quantity. Because the estimator is a random variable (e.g., errors(h) depends on the random sample S), it can be characterized by the probability distribution that governs its value. Confidence intervals can then be calculated by determining the interval that contains the desired probability mass under this distribution. 0 One possible cause of errors in estimating hypothesis accuracy is estimation bias. If Y is an estimator for some parameter p, the estimation bias of Y is the difference between p and the expected value of Y. For example, if S is the training data used to formulate hypothesis h, then errors(h) gives an optimistically biased estimate of the true error errorD(h). 0 A second cause of estimation error is variance in the estimate. Even with an unbiased estimator, the observed value of the estimator is likely to vary from one experiment to another. The variance a2of the distribution governing the estimator characterizes how widely this estimate is likely to vary from the correct value. This variance decreases as the size of the data sample is increased. 0 Comparing the relative effectiveness of two learning algorithms is an estimation problem that is relatively easy when data and time are unlimited, but more difficult when these resources are limited. One possible approach described in this chapter is to run the learning algorithms on different subsets of the available data, testing the learned hypotheses on the remaining data, then averaging the results of these experiments. 0 In most cases considered here, deriving confidence intervals involves making a number of assumptions and approximations. For example, the above confidence interval for errorv(h) involved approximating a Binomial distribution by a Normal distribution, approximating the variance of this distribution, and assuming instances are generated by a fixed, unchanging probability distribution. While intervals based on such approximations are only approximate confidence intervals, they nevertheless provide useful guidance for designing and interpreting experimental results in machine learning. The key statistical definitions presented in this chapter are summarized in Table 5.2. An ocean of literature exists on the topic of statistical methods for estimating means and testing significance of hypotheses. While this chapter introduces the basic concepts, more detailed treatments of these issues can be found in many books and articles. Billingsley et al. (1986) provide a very readable introduction to statistics that elaborates on the issues discussed here. Other texts on statistics include DeGroot (1986); Casella and Berger (1990). Duda and Hart (1973) provide a treatment of these issues in the context of numerical pattern recognition. Segre et al. (1991, 1996), Etzioni and Etzioni (1994), and Gordon and Segre (1996) discuss statistical significance tests for evaluating learning algorithms whose performance is measured by their ability to improve computational efficiency. Geman et al. (1992) discuss the tradeoff involved in attempting to minimize bias and variance simultaneously.There is ongoing debate regarding the best way to learn and compare hypotheses from limited data. For example, Dietterich (1996) discusses the risks of applying the paired-difference t test repeatedly to different train-test splits of the data. EXERCISES 5.1. Suppose you test a hypothesis h and find that it commits r = 300 errors on a sample S of n = 1000 randomly drawn test examples. What is the standard deviation in errors(h)? How does this compare to the standard deviation in the example at the end of Section 5.3.4? 5.2. Consider a learned hypothesis, h , for some boolean concept. When h is tested on a set of 100 examples, it classifies 83 correctly. What is the standard deviation and the 95% confidence interval for the true error rate for Errorv(h)? 5.3. Suppose hypothesis h commits r = 10 errors over a sample of n = 65 independently drawn examples. What is the 90% confidence interval (two-sided) for the true error rate? What is the 95% one-sided interval (i.e., what is the upper bound U such that errorv(h) 5 U with 95% confidence)? What is the 90% one-sided interval? 5.4. You are about to test a hypothesis h whose errorV(h) is known to be in the range between 0.2 and 0.6. What is the minimum number of examples you must collect to assure that the width of the two-sided 95% confidence interval will be smaller than 0.1? 5.5. Give general expressions for the upper and lower one-sided N% confidence intervals for the difference in errors between two hypotheses tested on different samples of data. Hint: Modify the expression given in Section 5.5. 5.6. Explain why the confidence interval estimate given in Equation (5.17) applies to estimating the quantity in Equation (5.16), and not the quantity in Equation (5.14). REFERENCES Billingsley, P., Croft, D. J., Huntsberger, D. V., & Watson, C. J. (1986). Statistical inferencefor management and economics. Boston: Allyn and Bacon, Inc. Casella, G., & Berger, R. L. (1990). Statistical inference. Pacific Grove, CA: Wadsworth and BrooksICole. DeGroot, M. H. (1986). Probability and statistics. (2d ed.) Reading, MA: Addison Wesley. Dietterich, T. G. (1996). Proper statistical tests for comparing supervised classiJicationlearning algorithms (Technical Report). Department of Computer Science, Oregon State University, Cowallis, OR. Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms (Technical Report). Department of Computer Science, Oregon State University, Cowallis, OR. Duda, R., & Hart, P. (1973). Pattern classiJicationand scene analysis. New York: John Wiley & Sons. Efron, B., & Tibshirani, R. (1991). Statistical data analysis in the computer age. Science, 253, 390395. Etzioni, O., & Etzioni, R. (1994). Statistical methods for analyzing speedup learning experiments. Machine Learning, 14, 333-347. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the biadvariance dilemma. Neural Computation, 4, 1-58. Gordon, G., & Segre, A.M. (1996). Nonpararnetric statistical methods for experimental evaluations of speedup learning. Proceedings of the ThirteenthInternational Conference on Machine Leaming, Bari, Italy. Maisel, L. (1971). Probability, statistics, and random processes. Simon and Schuster Tech Outlines. New York: Simon and Schuster. Segre, A., Elkan, C., & Russell, A. (1991). A critical look at experimental evaluations of EBL. Machine Learning, 6(2). Segre, A.M, Gordon G., & Elkan, C. P. (1996). Exploratory analysis of speedup learning data using expectation maximization. Artificial Intelligence, 85, 301-3 19. Speigel, M. R. (1991). Theory and problems of probability and statistics. Schaum's Outline Series. New York: McGraw Hill. Thompson, M.L., & Zucchini, W. (1989). On the statistical analysis of ROC curves. Statistics in Medicine, 8, 1277-1290. White, A. P., & Liu, W. Z. (1994). Bias in information-based measures in decision tree induction. Machine Learning, 15, 321-329. CHAPTER BAYESIAN LEARNING Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of interest are governed by probability distributions and that optimal decisions can be made by reasoning about these probabilities together with observed data. It is important to machine learning because it provides a quantitative approach to weighing the evidence supporting alternative hypotheses. Bayesian reasoning provides the basis for learning algorithms that directly manipulate probabilities, as well as a framework for analyzing the operation of other algorithms that do not explicitly manipulate probabilities. 6.1 INTRODUCTION Bayesian learning methods are relevant to our study of machine learning for two different reasons. First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses, such as the naive Bayes classifier, are among the most practical approaches to certain types of learning problems. For example, Michie et al. (1994) provide a detailed study comparing the naive Bayes classifier to other learning algorithms, including decision tree and neural network algorithms. These researchers show that the naive Bayes classifier is competitive with these other learning algorithms in many cases and that in some cases it outperforms these other methods. In this chapter we describe the naive Bayes classifier and provide a detailed example of its use. In particular, we discuss its application to the problem of learning to classify text documents such as electronic news articles. CHAFER 6 BAYESIAN LEARNING 155 For such learning tasks, the naive Bayes classifier is among the most effective algorithms known. The second reason that Bayesian methods are important to our study of machine learning is that they provide a useful perspective for understanding many learning algorithms that do not explicitly manipulate probabilities. For example, in this chapter we analyze algorithms such as the FIND-Sand CANDIDATEELIMINATIOaNlgorithms of Chapter 2 to determine conditions under which they output the most probable hypothesis given the training data. We also use a Bayesian analysis to justify a key design choice in neural network learning algorithms: choosing to minimize the sum of squared errors when searching the space of possible neural networks. We also derive an alternative error function, cross entropy, that is more appropriate than sum of squared errors when learning target functions that predict probabilities. We use a Bayesian perspective to analyze the inductive bias of decision tree learning algorithms that favor short decision trees and examine the closely related Minimum Description Length principle. A basic familiarity with Bayesian methods is important to understanding and characterizing the operation of many algorithms in machine learning. U Features of Bayesian learning methods include: 0 Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more flexible approach to learning than algorithms that completely eliminate a hypothesis if it is found to be inconsistent with any single example. 0 Prior knowledge can be combined with observed data to determine the final probability ~f a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a prior probability for each candidate hypothesis, and (2) a probability distribution over observed data for each possible hypothesis. Bayesian methods can accommodate hypotheses that make probabilistic predictions (e.g., hypotheses such as "this pneumonia patient has a 93% chance of complete recovery"). 0 New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities. 0 Even in cases where Bayesian methods prove computationally intractable, they can provide a standard of optimal decision making against which other practical methods can be measured. One practical difficulty in applying Bayesian methods is that they typically require initial knowledge of many probabilities. When these probabilities are not known in advance they are often estimated based on background knowledge, previously available data, and assumptions about the form of the underlying distributions. A second practical difficulty is the significant computational cost required to determine the Bayes optimal hypothesis in the general case (linear in the number of candidate hypotheses). In certain specialized situations, this computational cost can be significantly reduced. The remainder of this chapter is organized as follows. Section 6.2 introduces Bayes theorem and defines maximum likelihood and maximum a posteriori probability hypotheses. The four subsequent sections then apply this probabilistic framework to analyze several issues and learning algorithms discussed in earlier chapters. For example, we show that several previously described algorithms output maximum likelihood hypotheses, under certain assumptions. The remaining sections then introduce a number of learning algorithms that explicitly manipulate probabilities. These include the Bayes optimal classifier, Gibbs algorithm, and naive Bayes classifier. Finally, we discuss Bayesian belief networks, a relatively recent approach to learning based on probabilistic reasoning, and the EM algorithm, a widely used algorithm for learning in the presence of unobserved variables. 6.2 BAYES THEOREM In machine learning we are often interested in determining the best hypothesis from some space H, given the observed training data D. One way to specify what we mean by the best hypothesis is to say that we demand the most probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H. Bayes theorem provides a direct method for calculating such probabilities. More precisely, Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself. To define Bayes theorem precisely, let us first introduce a little notation. We shall write P ( h ) to denote the initial probability that hypothesis h holds, before we have observed the training data. P ( h ) is often called the priorprobability of h and may reflect any background knowledge we have about the chance that h is a correct hypothesis. If we have no such prior knowledge, then we might simply assign the same prior probability to each candidate hypothesis. Similarly, we will write P ( D ) to denote the prior probability that training data D will be observed (i.e., the probability of D given no knowledge about which hypothesis holds). Next, we will write P(D1h) to denote the probability of observing data D given some world in which hypothesis h holds. More generally, we write P(xly) to denote the probability of x given y. In machine learning problems we are interested in the probability P ( h1 D ) that h holds given the observed training data D . P (h1 D ) is called the posteriorprobability of h, because it reflects our confidence that h holds after we have seen the training data D . Notice the posterior probability P(h1D) reflects the influence of the training data D, in contrast to the prior probability P(h), which is independent of D. Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to calculate the posterior probability P(hlD), from the prior probability P(h), together with P ( D ) and P ( D ( h ) . Bayes theorem: CHAPTER 6 BAYESIAN LEARNING 157 As one might intuitively expect, P(hID) increases with P(h) and with P(D1h) according to Bayes theorem. It is also reasonable to see that P(hl D ) decreases as P ( D ) increases, because the more probable it is that D will be observed independent of h, the less evidence D provides in support of h. In many learning scenarios, the learner considers some set of candidate hypotheses H and is interested in finding the most probable hypothesis h E H given the observed data D (or at least one of the maximally probable if there are several). Any such maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis. We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis. More precisely, we will say that MAP is a MAP hypothesis provided h ~ =~argmpax P(hlD) h€H = argmax P(D1h)P (h) h€H (6.2) Notice in the final step above we dropped the term P ( D ) because it is a constant independent of h. In some cases, we will assume that every hypothesis in H is equally probable a priori ( P ( h i ) = P(h;) for all hi and h; in H). In this case we can further simplify Equation (6.2) and need only consider the term P(D1h) to find the most probable hypothesis. P(Dlh) is often called the likelihood of the data D given h, and any hypothesis that maximizes P(Dlh) is called a maximum likelihood (ML) hypothesis, hML. hML= argmax P(Dlh) h€H In order to make clear the connection to machine learning problems, we introduced Bayes theorem above by referring to the data D as training examples of some target function and referring to H as the space of candidate target functions. In fact, Bayes theorem is much more general than suggested by this discussion. It can be applied equally well to any set H of mutually exclusive propositions whose probabilities sum to one (e.g., "the sky is blue," and "the sky is not blue"). In this chapter, we will at times consider cases where H is a hypothesis space containing possible target functions and the data D are training examples. At other times we will consider cases where H is some other set of mutually exclusive propositions, and D is some other kind of data. 6.2.1 An Example To illustrate Bayes rule, consider a medical diagnosis problem in which there are two alternative hypotheses: (1) that the patien; has a- articular form of cancer. and (2) that the patient does not. The avaiiable data is from a particular laboratory test with two possible outcomes: $ (positive) and 8 (negative). We have prior knowledge that over the entire population of people only .008 have this disease. Furthermore, the lab test is only an imperfect indicator of the disease. The test returns a correct positive result in only 98% of the cases in which the disease is actually present and a correct negative result in only 97% of the cases in which the disease is not present. In other cases, the test returns the opposite result. The above situation can be summarized by the following probabilities: Suppose we now observe a new patient for whom the lab test returns a positive result. Should we diagnose the patient as having cancer or not? The maximum a posteriori hypothesis can be found using Equation (6.2): Thus, h ~ = -~cancper. The exact posterior hobabilities can also be determined by normalizing the above quantities so that they sum to 1 (e.g., P(cancer($) = .00;~~298= .21). This step is warranted because Bayes theorem states that the posterior probabilities are just the above quantities divided by the probability of the data, P(@). Although P($) was not provided directly as part of the problem statement, we can calculate it in this fashion because we know that P(cancerl$) and P(-cancerl$) must sum to 1 (i.e., either the patient has cancer or they do not). Notice that while the posterior probability of cancer is significantly higher than its prior probability, the most probable hypothesis is still that the patient does not have cancer. As this example illustrates, the result of Bayesian inference depends strongly on the prior probabilities, which must be available in order to apply the method directly. Note also that in this example the hypotheses are not completely accepted or rejected, but rather become more or less probable as more data is observed. Basic formulas for calculating probabilities are summarized in Table 6.1. 6.3 BAYES THEOREM AND CONCEPT LEARNING What is the relationship between Bayes theorem and the problem of concept learning? Since Bayes theorem provides a principled way to calculate the posterior probability of each hypothesis given the training data, we can use it as the basis for a straightforward learning algorithm that calculates the probability for each possible hypothesis, then outputs the most probable. This section considers such a brute-force Bayesian concept learning algorithm, then compares it to concept learning algorithms we considered in Chapter 2. As we shall see, one interesting result of this comparison is that under certain conditions several algorithms discussed in earlier chapters output the same hypotheses as this brute-force Bayesian CHAPTER 6 BAYESIAN LEARNING 159 .Product rule: probability P ( A A B) of a conjunction of two events A and B Sum rule: probability of a disjunction of two events A and B Bayes theorem: the posterior probability P(hl D ) of h given D . xy=l Theorem of totalprobability: if events A 1 , . ..,A, are mutually exclusive with P(Ai)= 1, then TABLE 6.1 Summary of basic probability formulas. 11 t algorithm, despite the fact that they do not explicitly manipulate probabilities and are considerably more efficient. 6.3.1 Brute-Force Bayes Concept Learning Consider the concept learning problem first introduced in Chapter 2. In particular, assume the learner considers some finite hypothesis space H defined over the instance space X, in which the task is to learn some target concept c :X + {0,1}. As usual, we assume that the learner is given some sequence of training examples ( ( x ~d,l )...(xm,d m ) )where xi is some instance from X and where di is the target value of xi (i.e., di = c(xi)).To simplify the discussion in this section, we assume the sequence of instances (xl ...xm)is held fixed, so that the training data D can . be written simply as the sequence of target values D = (dl .. d m ) .It can be shown (see Exercise 6.4) that this simplification does not alter the main conclusions of this section. We can design a straightforward concept learning algorithm to output the maximum a posteriori hypothesis, based on Bayes theorem, as follows: BRUTE-FORCME AP LEARNINaGlgorithm 1. For each hypothesis h in H, calculate the posterior probability 2. Output the hypothesis hMAPwith the highest posterior probability 160 MACHINE LEARNING This algorithm may require significant computation, because it applies Bayes theorem to each hypothesis in H to calculate P(hJD ) . While this may prove impractical for large hypothesis spaces, the algorithm is still of interest because it provides a standard against which we may judge the performance of other concept learning algorithms. In order specify a Iearning problem for the BRUTE-FORCMEAP LEARNING algorithm we must specify what values are to be used for P(h) and for P(D1h) (as we shall see, P ( D ) will be determined once we choose the other two). We may choose the probability distributions P(h) and P(D1h) in any way we wish, to describe our prior knowledge about the learning task. Here let us choose them to be consistent with the following assumptions: 1. The training data D is noise free (i.e., di = c(xi)). 2. The target concept c is contained in the hypothesis space H 3. We have no a priori reason to believe that any hypothesis is more probable than any other. Given these assumptions, what values should we specify for P(h)? Given no prior knowledge that one hypothesis is more likely than another, it is reasonable to assign the same prior probability to every hypothesis h in H . Furthermore,because we assume the target concept is contained in H we should require that these prior probabilities sum to 1. Together these constraints imply that we should choose P(h) = - 1 for all h in H IHI What choice shall we make for P(Dlh)? P(D1h) is the probability of ob- serving the target values D = (dl ...dm)for the fixed set of instances ( X I .. .x,), given a world in which hypothesis h holds (i.e., given a world in which h is the correct description of the target concept c). Since we assume noise-free training data, the probability of observing classification di given h is just 1 if di = h(xi) and 0 if di # h(xi).Therefore, 1 if di = h(xi) for all di in D P(D1h) = 0 otherwise (6.4) In other words, the probability of data D given hypothesis h is 1 if D is consistent with h, and 0 otherwise. Given these choices for P(h) and for P(Dlh) we now have a fully-defined problem for the above BRUTE-FORCMEAP LEARNINaGlgorithm.Let us consider the first step of this algorithm, which uses Bayes theorem to compute the posterior probability P(h1D) of each hypothesis h given the observed training data D . Recalling Bayes theorem, we have 161 CHAPTER 6 BAYESIAN LEARNING First consider the case where h is inconsistent with the training data D. Since Equation (6.4) defines P ( D ) h )to be 0 when h is inconsistent with D, we have o P ( ~ ( D=) - ' P(h) - if h is inconsistent with D P(D) The posterior probability of a hypothesis inconsistent with D is zero. Now consider the case where h is consistent with D. Since Equation (6.4) defines P(Dlh) to be 1 when h is consistent with D, we have -- 1 if h is consistent with D IVSH,DI where V S H ,is~the subset of hypotheses from H that are consistent with D (i.e., V S H ,is~ the version space of H with respect to D as defined in Chapter 2). It is easy to verify that P ( D ) = above, because the sum over all hypotheses of P(hID) must be one and because the number of hypotheses from H consistent with D is by definition IVSH,DI.Alternatively, we can derive P ( D ) from the theorem of total probability (see Table 6.1) and the fact that the hypotheses are mutually exclusive (i.e., (Vi # j)(P(hi A h j ) = 0 ) ) To summarize, Bayes theorem implies that the posterior probability P(hID) under our assumed P(h) and P(D1h) is if h is consistent with D P(hlD) = (6.3 0 otherwise where IVSH,DIis the number of hypotheses from H consistent with D. The evolution of probabilities associated with hypotheses is depicted schematically in Figure 6.1. Initially (Figure 6 . 1 ~ a)ll hypotheses have the same probability. As training data accumulates (Figures 6.1b and 6.lc), the posterior probability for inconsistent hypotheses becomes zero while the total probability summing to one is shared equally among the remaining consistent hypotheses. The above analysis implies that under our choice for P(h)and P(Dlh),every consistent hypothesis has posterior probability (1/ I V SH, I), and every inconsistent hypothesis has posterior probability 0. Every consistent hypothesis is, therefore, a MAP hypothesis. 6.3.2 MAP Hypotheses and Consistent Learners The above analysis shows that in the given setting, every hypothesis consistent with D is a MAP hypothesis. This statement translates directly into an interesting statement about a general class of learners that we might call consistent learners. We will say that a learning algorithm is a consistent learner provided it outputs a hypothesis that commits zero errors over the training examples. Given the above analysis, we can conclude that every consistent learner outputs a MAP hypothesis, i f we assume a uniformprior probability distribution over H (i.e., P(hi)= P(hj) for all i, j), and ifwe assume deterministic, noisefree training data (i.e., P(DIh) = 1 i f D and h are consistent, and 0 otherwise). Consider, for example, the concept learning algorithm FIND-Sdiscussed in Chapter 2. FIND-Ssearches the hypothesis space H from specific to general hypotheses, outputting a maximally specific consistent hypothesis (i.e., a maximally specific member of the version space). Because FIND-Soutputs a consistent hypothesis, we know that it will output a MAP hypothesis under the probability distributions P(h) and P(D1h) defined above. Of course FIND-Sdoes not explicitly manipulate probabilities at all-it simply outputs a maximally specific member hypotheses (a) hypotheses (4 hypotheses (c) FIGURE 6.1 Evolution of posterior probabilities P(hlD) with increasing training data. ( a ) Uniform priors assign equal probability to each hypothesis. As training data increases first to Dl (b), then to D l A 0 2 (c), the posterior probability of inconsistent hypotheses becomes zero, while posterior probabilities increase for hypotheses remaining in the version space. CHAPTER 6 BAYESIAN LEARNING 163 of the version space. However, by identifying distributions for P ( h ) and P ( D ( h ) under which its output hypotheses will be MAP hypotheses, we have a useful way of characterizing the behavior of FIND-S. Are there other probability distributions for P ( h ) and P(D1h) under which FIND-Soutputs MAP hypotheses? Yes. Because FIND-Soutputs a maximally spe- cz$c hypothesis from the version space, its output hypothesis will be a MAP hypothesis relative to any prior probability distribution that favors more specific hypotheses. More precisely, suppose 3-1 is any probability distribution P ( h ) over H that assigns P(h1) 2 P ( h z ) if h l is more specific than h2. Then it can be shown that FIND-Soutputs a MAP hypothesis assuming the prior distribution 3-1 and the same distribution P(D1h) discussed above. To summarize the above discussion, the Bayesian framework allows one way to characterize the behavior of learning algorithms (e.g., FIND-S),even when the learning algorithm does not explicitly manipulate probabilities. By identifying probability distributions P(h) and P(Dlh) under which the algorithm outputs optimal (i.e., MAP) hypotheses, we can characterize the implicit assumptions , under which this algorithm behaves optimally. ( Using the Bayesian perspective to characterize learning algorithms in this way is similar in spirit to characterizing the inductive bias of the learner. Recall that in Chapter 2 we defined the inductive bias of a learning algorithm to be the set of assumptions B sufficient to deductively justify the inductive inference performed by the learner. For example, we described the inductive bias of the CANDIDATE-ELIMINAaTlIgOoNrithm as the assumption that the target concept c is included in the hypothesis space H. Furthermore, we showed there that the output of this learning algorithm follows deductively from its inputs plus this implicit inductive bias assumption. The above Bayesian interpretation provides an alter- native way to characterize the assumptions implicit in learning algorithms. Here, instead of modeling the inductive inference method by an equivalent deductive system, we model it by an equivalent probabilistic reasoning system based on Bayes theorem. And here the implicit assumptions that we attribute to the learner are assumptions of the form "the prior probabilities over H are given by the distribution P ( h ) , and the strength of data in rejecting or accepting a hypothesis is given by P(Dlh)." The definitions of P(h) and P ( D ( h ) given in this section characterize the implicit assumptions of the CANDIDATE-ELIMINAaTnIdOFNIND-S algorithms. A probabilistic reasoning system based on Bayes theorem will exhibit input-output behavior equivalent to these algorithms, provided it is given these assumed probability distributions. The discussion throughout this section corresponds to a special case of Bayesian reasoning, because we considered the case where P(D1h) takes on val- ues of only 0 and 1, reflecting the deterministic predictions of hypotheses and the assumption of noise-free training data. As we shall see in the next section, we can also model learning from noisy training data, by allowing P(D1h) to take on values other than 0 and 1, and by introducing into P(D1h) additional assumptions about the probability distributions that govern the noise. 6.4 MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR HYPOTHESES As illustrated in the above section, Bayesian analysis can sometimes be used to show that a particular learning algorithm outputs MAP hypotheses even though it may not explicitly use Bayes rule or calculate probabilities in any form. In this section we consider the problem of learning a continuous-valued target function-a problem faced by many learning approaches such as neural network learning, linear regression, and polynomial curve fitting. A straightforward Bayesian analysis will show that under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a maximum likelihood hypothesis. The significance of this result is that it provides a Bayesian justification (under certain assumptions) for many neural network and other curve fitting methods that attempt to minimize the sum of squared errors over the training data. Consider the following problem setting. Learner L considers an instance space X and a hypothesis space H consisting of some class of real-valued functions defined over X (i.e., each h in H is a function of the form h : X -+ 8, where 8 represents the set of real numbers). The problem faced by L is to learn an unknown target function f : X -+ 8 drawn from H. A set of m training examples is provided, where the target value of each example is corrupted by random + noise drawn according to a Normal probability distribution. More precisely, each training example is a pair of the form (xi,d i )where di = f (xi) ei. Here f (xi)is the noise-free value of the target function and ei is a random variable representing the noise. It is assumed that the values of the ei are drawn independently and that they are distributed according to a Normal distribution with zero mean. The task of the learner is to output a maximum likelihood hypothesis, or, equivalently, a MAP hypothesis assuming all hypotheses are equally probable a priori. A simple example of such a problem is learning a linear function, though our analysis applies to learning arbitrary real-valued functions. Figure 6.2 illustrates FIGURE 6.2 Learning a real-valued function. The target function f corresponds to the solid line. The training examples (xi,di) are assumed to have Normally distributed noise ei with zero mean added to the true target value f (xi).The dashed line corresponds to the linear function that minimizes the sum of squared errors. Therefore, it is the maximum I likelihood hypothesis ~ M L g, iven these five x training examples. CHAPTER 6 BAYESIAN LEARNING 165 a linear target function f depicted by the solid line, and a set of noisy training examples of this target function. The dashed line corresponds to the hypothesis hMLwith least-squared training error, hence the maximum likelihood hypothesis. Notice that the maximum likelihood hypothesis is not necessarily identical to the correct hypothesis, f , because it is inferred from only a limited sample of noisy training data. Before showing why a hypothesis that minimizes the sum of squared errors in this setting is also a maximum likelihood hypothesis, let us quickly review two basic concepts from probability theory: probability densities and Normal distributions. First, in order to discuss probabilities over continuous variables such as e, we must introduce probability densities. The reason, roughly, is that we wish for the total probability over all possible values of the random variable to sum to one. In the case of continuous variables we cannot achieve this by assigning a finite probability to each of the infinite set of possible values for the random variable. Instead, we speak of a probability density for continuous variables such as e and require that the integral of this probability density over all possible values be one. In general we will use lower case p to refer to the probability density function, to distinguish it from a finite probability P (which we will sometimes refer to as + a probability mass). The probability density p(x0) is the limit as E goes to zero, of times the probability that x will take on a value in the interval [xo,xo 6 ) . Probability density function: Second, we stated that the random noise variable e is generated by a Normal probability distribution. A Normal distribution is a smooth, bell-shaped distribution that can be completely characterized by its mean p and its standard deviation a. See Table 5.4 for a precise definition. Given this background we now return to the main issue: showing that the least-squared error hypothesis is, in fact, the maximum likelihood hypothesis within our problem setting. We will show this by deriving the maximum likelihood hypothesis starting with our earlier definition Equation (6.3), but using lower case p to refer to the probability density As before, we assume a fixed set of training instances (xl ...xm) and there- fore consider the data D to be the corresponding sequence of target values + D = (dl ... d m ) .Here di = f ( x i ) ei. Assuming the training examples are mu- tually independent given h , we can write P ( D J h )as the product of the various ~ ( dlhi) Given that the noise ei obeys a Normal distribution with zero mean and unknown variance a 2 , each di must also obey a Normal distribution with variance a2 centered around the true target value f ( x i ) rather than zero. Therefore p(di lh) can be written as a Normal distribution with variance a2 and mean p = f ( x i ) .Let us write the formula for this Normal distribution to describe p(di Ih), beginning with the general formula for a Normal distribution from Table 5.4 and substituting the appropriate p and a 2 . Because we are writing the expression for the probability of di given that h is the correct description of the target function f , we will also substitute p = f ( x i )= h(xi), yielding We now apply a transformation that is common in maximum likelihood calculations: Rather than maximizing the above complicated expression we shall choose to maximize its (less complicated) logarithm. This is justified because l n p is a monotonic function of p. Therefore maximizing In p also maximizes p. hML= ... argmax x l n - 1 - 1 -(di -h ( ~ i ) ) ~ h€H i=l d G 7 202 The first term in this expression is a constant independent of h, and can therefore be discarded, yielding C1 hMr = argmax -s(di h € H i=l - h(xi)12 Maximizing this negative quantity is equivalent to minimizing the corresponding positive quantity. Finally, we can again discard constants that are independent of h. Thus, Equation (6.6) shows that the maximum likelihood hypothesis ~ M iLs the one that minimizes the sum of the squared errors between the observed training values di and the hypothesis predictions h ( x i ) .This holds under the assumption that the observed training values di are generated by adding random noise to CHAPTER 6 BAYESIAN LEARNING 167 the true target value, where this random noise is drawn independently for each example from a Normal distribution with zero mean. As the above derivation makes clear, the squared error term (di- h ( ~ ~fo)ll)ow~s directly from the exponent in the definition of the Normal distribution. Similar derivations can be performed starting with other assumed noise distributions, producing different results. Notice the structure of the above derivation involves selecting the hypothesis that maximizes the logarithm of the likelihood (In p(D1h)) in order to determine the most probable hypothesis. As noted earlier, this yields the same result as maximizing the likelihood p(D1h). This approach of working with the log likelihood is common to many Bayesian analyses, because it is often more mathematically tractable than working directly with the likelihood. Of course, as noted earlier, the maximum likelihood hypothesis might not be the MAP hypothesis, but if one assumes uniform prior probabilities over the hypotheses then it is. Why is it reasonable to choose the Normal distribution to characterize noise? One reason, it must be admitted, is that it allows for a mathematically straightforward analysis. A second reason is that the smooth, bell-shaped distribution is a i good approximation to many types of noise in physical systems. In fact, the Central Limit Theorem discussed in Chapter 5 shows that the sum of a sufficiently large number of independent, identically distributed random variables itself obeys a Normal distribution, regardless of the distributions of the individual variables. This implies that noise generated by the sum of very many independent, but identically distributed factors will itself be Normally distributed. Of course, in reality, different components that contribute to noise might not follow identical distributions, in which case this theorem will not necessarily justify our choice. Minimizing the sum of squared errors is a common approach in many neural network, curve fitting, and other approaches to approximating real-valued functions. Chapter 4 describes gradient descent methods that seek the least-squared error hypothesis in neural network learning. Before leaving our discussion of the relationship between the maximum likelihood hypothesis and the least-squared error hypothesis, it is important to note some limitations of this problem setting. The above analysis considers noise only in the target value of the training example and does not consider noise in the attributes describing the instances themselves. For example, if the problem is to learn to predict the weight of someone based on that person's age and height, then the above analysis assumes noise in measurements of weight, but perfect measurements of age and height. The analysis becomes significantly more complex as these simplifying assumptions are removed. 6.5 MAXIMUM LIKELIHOOD HYPOTHESES FOR PREDICTING PROBABILITIES In the problem setting of the previous section we determined that the maximum likelihood hypothesis is the one that minimizes the sum of squared errors over the training examples. In this section we derive an analogous criterion for a second setting that is common in neural network learning: learning to predict probabilities. Consider the setting in which we wish to learn a nondeterministic (prob- abilistic) function f : X -+ {0, 11, which has two discrete output values. For example, the instance space X might represent medical patients in terms of their symptoms, and the target function f (x) might be 1 if the patient survives the disease and 0 if not. Alternatively, X might represent loan applicants in terms of their past credit history, and f (x) might be 1 if the applicant successfully repays their next loan and 0 if not. In both of these cases we might well expect f to be probabilistic. For example, among a collection of patients exhibiting the same set of observable symptoms, we might find that 92% survive, and 8% do not. This unpredictability could arise from our inability to observe all the important distin- guishing features of the patients, or from some genuinely probabilistic mechanism in the evolution of the disease. Whatever the source of the problem, the effect is that we have a target function f (x) whose output is a probabilistic function of the input. Given this problem setting, we might wish to learn a neural network (or other real-valued function approximator) whose output is the probability that f (x) = 1. In other words, we seek to learn the target function, f ' : X + [O, 11, such that f'(x) = P (f (x) = 1). In the above medical patient example, if x is one of those indistinguishable patients of which 92% survive, then f'(x) = 0.92 whereas the probabilistic function f (x) will be equal to 1 in 92% of cases and equal to 0 in the remaining 8%. How can we learn f' using, say, a neural network? One obvious, brute- force way would be to first collect the observed frequencies of 1's and 0's for each possible value of x and to then train the neural network to output the target frequency for each x. As we shall see below, we can instead train a neural network directly from the observed training examples of f, yet still derive a maximum likelihood hypothesis for f '. What criterion should we optimize in order to find a maximum likelihood hypothesis for f' in this setting? To answer this question we must first obtain an expression for P(D1h). Let us assume the training data D is of the form D = {(xl,dl) ... (x,, dm)},where di is the observed 0 or 1 value for f (xi). Recall that in the maximum likelihood, least-squared error analysis of the previous section, we made the simplifyingassumption that the instances (xl .. .x,) were fixed. This enabled us to characterize the data by considering only the target values di. Although we could make a similar simplifying assumption in this case, let us avoid it here in order to demonstrate that it has no impact on the final outcome. Thus treating both xi and di as random variables, and assuming that each training example is drawn independently, we can write P(D1h) as nm P(Dlh) = ,(xi, 41,) (6.7) i=l It is reasonable to assume, furthermore, that the probability of encountering any particular instance xi is independent of the hypothesis h. For example, the probability that our training set contains a particular patient xi is independent of our hypothesis about survival rates (though of course the survival d, of the patient CHAPTER 6 BAYESIAN LEARNING 169 does depend strongly on h). When x is independent of h we can rewrite the above expression (applying the product rule from Table 6.1) as Now what is the probability P(dilh,xi) of observing di = 1 for a single instance xi, given a world in which hypothesis h holds? Recall that h is our hypothesis regarding the target function, which computes this very probability. Therefore, P(di = 1 1h, xi) = h(xi),and in general In order to substitute this into the Equation (6.8) for P(Dlh), let us first I"' re-express it in a more mathematically manipulable form, as It is easy to verify that the expressionsin Equations (6.9)and (6.10)are equivalent. Notice that when di = 1 , the second term from Equation (6.10), ( 1 - h(xi))'-", becomes equal to 1. Hence P(di = llh,xi) = h(xi),which is equivalent to the first case in Equation (6.9).A similar analysis shows that the two equations are also equivalent when di = 0. We can use Equation (6.10)to substitute for P(dilh, xi) in Equation (6.8)to obtain Now we write an expression for the maximum likelihood hypothesis The last term is a constant independent of h, so it can be dropped The expression on the right side of Equation (6.12) can be seen as a generalization of the Binomial distribution described in Table 5.3. The expression in Equation (6.12)describes the probability that flipping each of m distinct coins will produce the outcome (dl...dm),assuming that each coin xi has probability h(xi) of producing a heads. Note the Binomial distribution described in Table 5.3 is similar, but makes the additional assumption that the coins have identical probabilities of turning up heads (i.e., that h(xi) = h(xj), Vi, j). In both cases we assume the outcomes of the coin flips are mutually independent-an assumption that fits our current setting. As in earlier cases, we will find it easier to work with the log of the likelihood, yielding Equation (6.13) describes the quantity that must be maximized in order to obtain the maximum likelihood hypothesis in our current problem setting. This result is analogous to our earlier result showing that minimizing the sum of squared errors produces the maximum likelihood hypothesis in the earlier problem setting. -xi Note the similarity between Equation (6.13) and the general form of the entropy function, pi log pi, discussed in Chapter 3. Because of this similarity, the negation of the above quantity is sometimes called the cross entropy. 6.5.1 Gradient Search to Maximize Likelihood in a Neural Net Above we showed that maximizing the quantity in Equation (6.13) yields the maximum likelihood hypothesis. Let us use G(h, D) to denote this quantity. In this section we derive a weight-training rule for neural network learning that seeks to maximize G(h, D) using gradient ascent. As discussed in Chapter 4, the gradient of G(h, D) is given by the vector of partial derivatives of G(h, D) with respect to the various network weights that define the hypothesis h represented by the learned network (see Chapter 4 for a general discussion of gradient-descent search and for details of the terminology that we reuse here). In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to unit j is To keep our analysis simple, suppose our neural network is constructed from a single layer of sigmoid units. In this case we have where xijk is the kth input to unit j for the ith training example, and d ( x ) is the derivative of the sigmoid squashing function (again, see Chapter 4). Finally, 171 CIUPlER 6 BAYESIAN LEARNING substituting this expression into Equation (6.14), we obtain a simple expression for the derivatives that constitute the gradient Because we seek to maximize rather than minimize P(D(h), we perform gradient ascent rather than gradient descent search. On each iteration of the search the weight vector is adjusted in the direction of the gradient, using the weightupdate rule where m Awjk = 7 C ( d i - hbi)) xijk i=l (6.15) i and where 7 is a small positive constant that determines the step size of the gradient ascent search. It is interesting to compare this weight-update rule to the weight-update rule used by the BACKPROPAGATaIlOgNorithm to minimize the sum of squared errors between predicted and observed network outputs. The BACKPROPAGATION update rule for output unit weights (see Chapter 4), re-expressed using our current notation, is where Notice this is similar to the rule given in Equation (6.15) except for the extra term h ( x , ) ( l - h(xi)), which is the derivative of the sigmoid function. To summarize, these two weight update rules converge toward maximum likelihood hypotheses in two different settings. The rule that minimizes sum of squared error seeks the maximum likelihood hypothesis under the assumption that the training data can be modeled by Normally distributed noise added to the target function value. The rule that minimizes cross entropy seeks the maximum likelihood hypothesis under the assumption that the observed boolean value is a probabilistic function of the input instance. 6.6 MINIMUM DESCRIPTION LENGTH PRINCIPLE Recall from Chapter 3 the discussion of Occam's razor, a popular inductive bias that can be summarized as "choose the shortest explanation for the observed data." In that chapter we discussed several arguments in the long-standing debate regarding Occam's razor. Here we consider a Bayesian perspective on this issue and a closely related principle called the Minimum Description Length (MDL) principle. The Minimum Description Length principle is motivated by interpreting the definition of h M ~inp the light of basic concepts from information theory. Consider again the now familiar definition of MAP. hMAP= argmax P ( D l h ) P ( h ) h€H which can be equivalently expressed in terms of maximizing the log, + MAP = argmax log2P ( Dlh) log, P ( h ) h€H or alternatively, minimizing the negative of this quantity hMAp= argmin -log, P ( D1h ) - log, P ( h ) h€H Somewhat surprisingly, Equation (6.16) can be interpreted as a statement that short hypotheses are preferred, assuming a particular representation scheme for encoding hypotheses and data. To explain this, let us introduce a basic result from information theory: Consider the problem of designing a code to transmit messages drawn at random, where the probability of encountering message i is pi.We are interested here in the most compact code; that is, we are interested in the code that minimizes the expected number of bits we must transmit in order to encode a message drawn at random. Clearly, to minimize the expected code length we should assign shorter codes to messages that are more probable. Shannon and Weaver (1949) showed that the optimal code (i.e., the code that minimizes the expected message length) assigns - log, pi bitst to encode message i . We will refer to the number of bits required to encode message i using code C as the description length of message i with respect to C , which we denote by L c ( i ) . Let us interpret Equation (6.16) in light of the above result from coding theory. 0 -log, P ( h ) is the description length of h under the optimal encoding for the hypothesis space H. In other words, this is the size of the description of hypothesis h using this optimal representation. In our notation, LC, ( h ) = -log, P ( h ) , where CH is the optimal code for hypothesis space H. 0 -log2 P(D1h) is the description length of the training data D given hypothesis h, under its optimal encoding. In our notation, Lc,,,(Dlh) = -log, P(Dlh), where C D ,is~the optimal code for describing data D assuming that both the sender and receiver know the hypothesis h . x i t ~ o t i c ethe expected length for transmitting one message is therefore -pi logz pi, the formula for the entropy (see Chapter 3) of the set of possible messages. CHAPTER 6 BAYESIAN LEARNING 173 0 Thereforewe can rewrite Equation (6.16) to show that hMAPis the hypothesis h that minimizes the sum given by the description length of the hypothesis plus the description length of the data given the hypothesis. where CH and CDlhare the optimal encodings for H and for D given h, respectively. The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimizes the sum of these two description lengths. Of course to apply this principle in practice we must choose specific encodings or representations appropriate for the given learning task. Assuming we use the codes C1 and CZ to represent the hypothesis and the data given the hypothesis, we can state the MDL principle as 1' I Minimum Description Length principle: Choose hMDLwhere The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH, and if we choose C2 to be the optimal encoding CDlht,hen ~ M D L= A MAP. Intuitively, we can think of the MDL principle as recommending the shortest method for re-encoding the training data, where we count both the size of the hypothesis and any additional cost of encoding the data given this hypothesis. Let us consider an example. Suppose we wish to apply the MDL prin- ciple to the problem of learning decision trees from some training data. What should we choose for the representations C1 and C2 of hypotheses and data? For C1 we might naturally choose some obvious encoding of decision trees, in which the description length grows with the number of nodes in the tree and with the number of edges. How shall we choose the encoding C2 of the data given a particular decision tree hypothesis? To keep things simple, suppose that the sequence of instances (xl ...x,) is already known to both the transmitter and receiver, so that we need only transmit the classifications (f(XI)...f (x,)). (Note the cost of transmitting the instances themselves is independent of the correct hypothesis, so it does not affect the selection of ~ M D Lin any case.) Now if the training classifications (f(xl) . ..f (xm))are identical to the predictions of the hypothesis, then there is no need to transmit any information about these examples (the receiver can compute these values once it has received the hypothesis). The description length of the classifications given the hypothesis in this case is, therefore, zero. In the case where some examples are misclassified by h, then for each misclassification we need to transmit a message that identifies which example is misclassified (which can be done using at most logzm bits) as well as its correct classification (which can be done using at most log2k bits, where k is the number of possible classifications). The hypothesis hMDLunder the encoding~C1 and C2 is just the one that minimizes the sum of these description lengths. Thus the MDL principle provides a way of trading off hypothesis complexity for the number of errors committed by the hypothesis. It might select a shorter hypothesis that makes a few errors over a longer hypothesis that perfectly classifies the training data. Viewed in this light, it provides one method for dealing with the issue of overjitting the data. Quinlan and Rivest (1989) describe experiments applying the MDL principle to choose the best size for a decision tree. They report that the MDL-based method produced learned trees whose accuracy was comparable to that of the standard treepruning methods discussed in Chapter 3. Mehta et al. (1995) describe an alternative MDL-based approach to decision tree pruning, and describe experiments in which an MDL-based approach produced results comparable to standard tree-pruning methods. What shall we conclude from this analysis of the Minimum Description Length principle? Does this prove once and for all that short hypotheses are best? No. What we have shown is only that ifa representationof hypotheses is chosen so that the size of hypothesis h is -log2P(h), and ifa representation for exceptions is chosen so that the encoding length of D given h is equal to -log2 P(Dlh), then the MDL principle produces MAP hypotheses. However, to show that we have such a representation we must know all the prior probabilities P(h), as well as the P(D1h). There is no reason to believe that the MDL hypothesis relative to arbitrary encodings C1 and C2 should be preferred. As a practical matter it might sometimes be easier for a human designer to specify a representation that captures knowledge about the relative probabilities of hypotheses than it is to fully specify the probability of each hypothesis. Descriptions in the literature on the application of MDL to practical learning problems often include arguments providing some form of justification for the encodings chosen for C1 and C2. 6.7 BAYES OPTIMAL CLASSIFIER So far we have considered the question "what is the most probable hypothesis given the training data?' In fact, the question that is often of most significance is the closely related question "what is the most probable classiJicationof the new instance given the training data?'Although it may seem that this second question can be answered by simply applying the MAP hypothesis to the new instance, in fact it is possible to do better. To develop some intuitions consider a hypothesis space containing three hypotheses, hl, h2, and h3. Suppose that the posterior probabilities of these hypotheses given the training data are .4, .3, and .3 respectively. Thus, hl is the MAP hypothesis. Suppose a new instance x is encountered, which is classified positive by h l , but negative by h2 and h3. Taking all hypotheses into account, the probability that x is positive is .4 (the probability associated with h i ) , and CHAFER 6 BAYESIAN LEARNING 175 the probability that it is negative is therefore .6. The most probable classification (negative) in this case is different from the classification generated by the MAP hypothesis. In general, the most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities. If the possible classification of the new example can take on any value v j from some set V, then the probability P(vjlD) that the correct classification for the new instance is v;, is just The optimal classification of the new instance is the value v,, for which P(v; 1D) is maximum. Bayes optimal classification: To illustrate in terms of the above example, the set of possible classifications of the new instance is V = (@, 81, and therefore and Any system that classifies new instances according to Equation (6.18) is called a Bayes optimal classzjier, or Bayes optimal learner. No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average. This method maximizes the probability that the new instance is classified correctly, given the available data, hypothesis space, and prior probabilities over the hypotheses. For example, in learning boolean concepts using version spaces as in the earlier section, the Bayes optimal classification of a new instance is obtained by taking a weighted vote among all members of the version space, with each candidate hypothesis weighted by its posterior probability. Note one curious property of the Bayes optimal classifier is that the predictions it makes can correspond to a hypothesis not contained in H! Imagine using Equation (6.18) to classify every instance in X. The labeling of instances defined in this way need not correspond to the instance labeling of any single hypothesis h from H. One way to view this situation is to think of the Bayes optimal classifier as effectively considering a hypothesis space H' different from the space of hypotheses H to which Bayes theorem is being applied. In particu- lar, H' effectively includes hypotheses that perform comparisons between linear combinations of predictions from multiple hypotheses in H. 6.8 GIBBS ALGORITHM Although the Bayes optimal classifier obtains the best performance that can be achieved from the given training data, it can be quite costly to apply. The expense is due to the fact that it computes the posterior probability for every hypothesis in H and then combines the predictions of each hypothesis to classify each new instance. An alternative, less optimal method is the Gibbs algorithm (see Opper and Haussler 1991), defined as follows: 1. Choose a hypothesis h from H at random, according to the posterior probability distribution over H. 2. Use h to predict the classification of the next instance x. Given a new instance to classify, the Gibbs algorithm simply applies a hypothesis drawn at random according to the current posterior probability distribution. Surprisingly, it can be shown that under certain conditions the expected misclassification error for the Gibbs algorithm is at most twice the expected error of the Bayes optimal classifier (Haussler et al. 1994). More precisely, the expected value is taken over target concepts drawn at random according to the prior probability distribution assumed by the learner. Under this condition, the expected value of the error of the Gibbs algorithm is at worst twice the expected value of the error of the Bayes optimal classifier. This result has an interesting implication for the concept learning problem described earlier. In particular, it implies that if the learner assumes a uniform prior over H, and if target concepts are in fact drawn from such a distribution when presented to the learner, then classifying the next instance according to a hypothesis drawn at random from the current version space (according to a uniform distribution), will have expected error at most twice that of the Bayes optimal classijier. Again, we have an example where a Bayesian analysis of a non-Bayesian algorithm yields insight into the performance of that algorithm. 177 CHAPTJZR 6 BAYESIAN LEARNING 6.9 NAIVE BAYES CLASSIFIER One highly practical Bayesian learning method is the naive Bayes learner, often called the naive Bayes classijier. In some domains its performance has been shown to be comparable to that of neural network and decision tree learning. This section introduces the naive Bayes classifier; the next section applies it to the practical problem of learning to classify natural language text documents. The naive Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f ( x ) can take on any value from some finite set V. A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values ( a l ,a 2 . . .a,). The learner is asked to predict the target value, or classification, for this new instance. The Bayesian approach to classifying the new instance is to assign the most probable target value, VMAPg,iven the attribute values ( a l ,a2 ...a,) that describe the instance. VMAP = argmax P(vjlal,a 2 ...a,) vj€v We can use Bayes theorem to rewrite this expression as Now we could attempt to estimate the two terms in Equation (6.19) based on the training data. It is easy to estimate each of the P ( v j ) simply by counting the frequency with which each target value vj occurs in the training data. However, estimating the different P(al,a 2 . . .a,lvj) terms in this fashion is not feasible unless we have a very, very large set of training data. The problem is that the number of these terms is equal to the number of possible instances times the number of possible target values. Therefore, we need to see every instance in the instance space many times in order to obtain reliable estimates. The naive Bayes classifier is based on the simplifying assumption that the attribute values are conditionally independent given the target value. In other words, the assumption is that given the target value of the instance, the probability ni of observing the conjunction a l , a 2 . . .a, is just the product of the probabilities for the individual attributes: P(a1,a2 ...a, 1 v j ) = P(ailvj). Substituting this into Equation (6.19), we have the approach used by the naive Bayes classifier. n Naive Bayes classifier: V N B = argmax P (vj) P (ai1vj) ujcv (6.20) where V N B denotes the target value output by the naive Bayes classifier. Notice that in a naive Bayes classifier the number of distinct P(ailvj) terms that must be estimated from the training data is just the number of distinct attribute values times the number of distinct target values-a much smaller number than if we were to estimate the P(a1, a2 ...anlvj) terms as first contemplated. To summarize, the naive Bayes learning method involves a learning step in which the various P(vj) and P(ai Jvj)terms are estimated, based on their frequencies over the training data. The set of these estimates corresponds to the learned hypothesis. This hypothesis is then used to classify each new instance by applying the rule in Equation (6.20). Whenever the naive Bayes assumption of conditional independence is satisfied, this naive Bayes classification VNB is identical to the MAP classification. One interesting difference between the naive Bayes learning method and other learning methods we have considered is that there is no explicit search through the space of possible hypotheses (in this case, the space of possible hypotheses is the space of possible values that can be assigned to the various P(vj) and P(ailvj) terms). Instead, the hypothesis is formed without searching,simply by counting the frequency of various data combinations within the training examples. 6.9.1 An Illustrative Example Let us apply the naive Bayes classifier to a concept learning problem we considered during our discussion of decision tree learning: classifying days according to whether someone will play tennis. Table 3.2 from Chapter 3 provides a set of 14 training examples of the target concept PlayTennis, where each day is described by the attributes Outlook, Temperature, Humidity, and Wind. Here we use the naive Bayes classifier and the training data from this table to classify the following novel instance: (Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong) Our task is to predict the target value (yes or no) of the target concept PlayTennis for this new instance. Instantiating Equation (6.20) to fit the current task, the target value VNB is given by = argrnax P(vj) P(0utlook = sunny)v,)P(Temperature = coolIvj) vj~(yes,no] Notice in the final expression that ai has been instantiated using the particular attribute values of the new instance. To calculate VNB we now require 10 probabilities that can be estimated from the training data. First, the probabilities of the different target values can easily be estimated based on their frequencies over the 14 training examples P(P1ayTennis = yes) = 9/14 = .64 P(P1ayTennis = no) = 5/14 = .36 179 CHAETER 6 BAYESIAN LEARNING Similarly, we can estimate the conditional probabilities. For example, those for Wind = strong are P ( W i n d = stronglPlayTennis = yes) = 319 = .33 P(Wind = strongl PlayTennis = no) = 315 = .60 Using these probability estimates and similar estimates for the remaining attribute values, we calculate VNB according to Equation (6.21) as follows (now omitting attribute names for brevity) Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new instance, based on the probability estimates learned from the training data. Furthermore, by normalizing the above quantities to sum to one we can calculate ,02$ym,, the conditional probability that the target value is no, given the observed attribute values. For the current example, this probability is = -795. 6.9.1.1 ESTIMATING PROBABILITIES Up to this point we have estimated probabilities by the fraction of times the event is observed to occur over the total number of opportunities. For example, in the above case we estimated P(Wind = strong]Play Tennis = no) by the fraction % where n = 5 is the total number of training examples for which PlayTennis = no, and n, = 3 is the number of these for which Wind = strong. While this observed fraction provides a good estimate of the probability in many cases, it provides poor estimates when n, is very small. To see the difficulty, imagine that, in fact, the value of P ( W i n d = strongl PlayTennis = no) is .08 and that we have a sample containing only 5 examples for which PlayTennis = no. Then the most probable value for n, is 0 . This raises two difficulties. First, $ pro- duces a biased underestimate of the probability. Second, when this probability estimate is zero, this probability term will dominate the Bayes classifier if the future query contains Wind = strong. The reason is that the quantity calculated in Equation (6.20) requires multiplying all the other probability terms by this zero value. To avoid this difficulty we can adopt a Bayesian approach to estimating the probability, using the m-estimate defined as follows. m-estimate of probability: Here, n, and n are defined as before, p is our prior estimate of the probability we wish to determine, and m is a constant called the equivalent sample size, which determines how heavily to weight p relative to the observed data. A typical method for choosing p in the absence of other information is to assume uniform i. priors; that is, if an attribute has k possible values we set p = For example, in estimating P(Wind = stronglPlayTennis = no) we note the attribute Wind has two possible values, so uniform priors would correspond to choosing p = .5. Note 2. that if m is zero, the m-estimate is equivalent to the simple fraction If both n 2 and m are nonzero, then the observed fraction and prior p will be combined according to the weight m. The reason m is called the equivalent sample size is that Equation (6.22) can be interpreted as augmenting the n actual observations by an additional m virtual samples distributed according to p. 6.10 AN EXAMPLE: LEARNING TO CLASSIFY TEXT To illustrate the practical importance of Bayesian learning methods, consider learning problems in which the instances are text documents. For example, we might wish to learn the target concept "electronic news articles that I find interesting," or "pages on the World Wide Web that discuss machine learning topics." In both cases, if a computer could learn the target concept accurately, it could automatically filter the large volume of online text documents to present only the most relevant documents to the user. We present here a general algorithm for learning to classify text, based on the naive Bayes classifier. Interestingly, probabilistic approaches such as the one described here are among the most effective algorithms currently known for learning to classify text documents. Examples of such systems are described by Lewis (1991), Lang (1995), and Joachims (1996). The naive Bayes algorithm that we shall present applies in the following general setting. Consider an instance space X consisting of all possible text documents (i.e., all possible strings of words and punctuation of all possible lengths). We are given training examples of some unknown target function f ( x ) , which can take on any value from some finite set V. The task is to learn from these training examples to predict the target value for subsequent text documents. For illustration, we will consider the target function classifying documents as interesting or uninteresting to a particular person, using the target values like and dislike to indicate these two classes. The two main design issues involved in applying the naive Bayes classifier to such rext classification problems are first to decide how to represent an arbitrary text document in terms of attribute values, and second to decide how to estimate the probabilities required by the naive Bayes classifier. Our approach to representing arbitrary text documents is disturbingly simple: Given a text document, such as this paragraph, we define an attribute for each word position in the document and define the value of that attribute to be the English word found in that position. Thus, the current paragraph would be described by 111 attribute values, corresponding to the 111 word positions. The value of the first attribute is the word "our," the value of the second attribute is the word "approach," and so on. Notice that long text documents will require a larger number of attributes than short documents. As we shall see, this will not cause us any trouble. 181 CHAPTER 6 BAYESIAN LEARNING Given this representation for text documents, we can now apply the naive Bayes classifier. For the sake of concreteness, let us assume we are given a set of 700 training documents that a friend has classified as dislike and another 300 she has classified as like. We are now given a new document and asked to classify it. Again, for concreteness let us assume the new text document is the preceding paragraph. In this case, we instantiate Equation (6.20) to calculate the naive Bayes classification as n- a - Vns = argmax P(Vj) ~ ( alvij) vj~{like,dislike} i=l -- argmax P(vj) P(a1 = "our"lvj)P(a2 = "approach"lvj) v, ~{like,dislike} To summarize, the naive Bayes classification VNB is the classification that maximizes the probability of observing the words that were actually found in the F nfL1 I document, subject to the usual naive Bayes independence assumption. The independence assumption P(al, ...allllvj) = P(ai lvj) states in this setting that the word probabilities for one text position are independent of the words that occur in other positions, given the document classification vj. Note this assumption is clearly incorrect. For example, the probability of observing the word "learning" in some position may be greater if the preceding word is "machine." Despite the obvious inaccuracy of this independence assumption, we have little choice but to make it-without it, the number of probability terms that must be computed is prohibitive. Fortunately, in practice the naive Bayes learner performs remarkably well in many text classification problems despite the incorrectness of this independence assumption. Dorningos and Pazzani (1996) provide an interesting analysis of this fortunate phenomenon. To calculate VNB using the above expression, we require estimates for the probability terms P(vj) and P(ai = wklvj)(here we introduce wk to indicate the kth word in the English vocabulary). The first of these can easily be estimated based on the fraction of each class in the training data (P(1ike) = .3 and P(dis1ike) = .7 in the current example). As usual, estimating the class conditional probabilities (e.g., P(al = "our"ldis1ike)) is more problematic because we must estimate one such probability term for each combination of text position, English word, and target value. Unfortunately, there are approximately 50,000 distinct words in the English vocabulary, 2 possible target values, and 111 text positions in the current example, so we must estimate 2 . 111 -50,000= 10 million such terms from the training data. Fortunately, we can make an additional reasonable assumption that reduces the number of probabilities that must be estimated. In particular, we shall assume the probability of encountering a specific word wk (e.g., "chocolate") is independent of the specific word position being considered (e.g., a23 versus agg). More formally, this amounts to assuming that the attributes are independent and identically distributed, given the target classification; that is, P(ai = wk)vj) = P(a, = wkJvj)for all i, j, k, m. Therefore, we estimate the entire set of proba- bilities P(a1 = wk lvj), P(a2 = wk lv,) ... by the single position-independent prob- ability P(wklvj), which we will use regardless of the word position. The net effect is that we now require only 2.50,000 distinct terms of the form P(wklvj). This is still a large number, but manageable. Notice in cases where training data is limited, the primary advantage of making this assumption is that it increases the number of examples available to estimate each of the required probabilities, thereby increasing the reliability of the estimates. To complete the design of our learning algorithm, we must still choose a method for estimating the probability terms. We adopt the m-estimate-Equation (6.22)-with uniform priors and with rn equal to the size of the word vocabulary. Thus, the estimate for P(wklvj) will be where n is the total number of word positions in all training examples whose target value is vj, nk is the number of times word wk is found among these n word positions, and I Vocabulary I is the total number of distinct words (and other tokens) found within the training data. To summarize, the final algorithm uses a naive Bayes classifier together with the assumption that the probability of word occurrence is independent of position within the text. The final algorithm is shown in Table 6.2. Notice the algorithm is quite simple. During learning, the procedure LEARN~AIVEBAYES-TEXT examines all training documents to extract the vocabulary of all words and tokens that appear in the text, then counts their frequencies among the different target classes to obtain the necessary probability estimates. Later, given a new document to be classified, the procedure CLASSINSAIVEJ~AYES-TuEseXsTthese probability estimates to calculate VNB according to Equation (6.20). Note that any words appearing in the new document that were not observed in the training set are simply ignored by CLASSIFYSAIVEBAYES-TECXoTd.e for this algorithm, as well as training data sets, are available on the World Wide Web at http://www.cs.cmu.edu/-tom/book.htrnl. 6.10.1 Experimental Results How effective is the learning algorithm of Table 6.2? In one experiment (see Joachims 1996), a minor variant of this algorithm was applied to the problem of classifying usenet news articles. The target classification for an article in this case was the name of the usenet newsgroup in which the article appeared. One can think of the task as creating a newsgroup posting service that learns to assign documents to the appropriate newsgroup. In the experiment described by Joachims (1996), 20 electronic newsgroups were considered (listed in Table 6.3). Then 1,000 articles were collected from each newsgroup, forming a data set of 20,000 documents. The naive Bayes algorithm was then applied using two-thirds of these 20,000 documents as training examples, and performance was measured CHAPTER 6 BAYESIAN LEARNING 183 Examples is a set of text documents along with their target values. V is the set of all possible target values. Thisfunction learns the probability terms P(wk Iv,), describing the probability that a randomly drawn word from a document in class vj will be the English word wk. It also learns the class prior probabilities P(vj). 1. collect all words, punctwtion, and other tokens that occur in Examples a Vocabulary c the set of all distinct words and other tokens occurring in any text document from Examples 2. calculate the required P(vj)and P(wkJvjp)robability terms For each target value vj in V do docsj t the subset of documents from Examples for which the target value is vj ldocs .I P(uj) + 1ExornLlesl a Texti c a single document created by concatenating all members of docsi a n +*total number of distinct word positions in ~ e x c 0 for each word wk in Vocabulary 0 nk c number of times word wk occurs in Textj P(wk lvj) + n+12LLoryl " Return the estimated target valuefor the document Doc. ai denotes the wordfound in the ith position within Doc. 0 positions t all word positions in Doc that contain tokens found in Vocabulary a Return V N B , where n V N B = argmax ~ ( v j ) P(ai 1 9 ) V, E V ieposirions TABLE 6.2 Naive Bayes algorithms for learning and classifying text. In addition to the usual naive Bayes assumptions, these algorithms assume the probability of a word occurring is independent of its position within the text. over the remaining third. Given 20 possible newsgroups, we would expect random guessing to achieve a classification accuracy of approximately 5%. The accuracy achieved by the program was 89%.The algorithm used in these experiments was exactly the algorithm of Table 6.2, with one exception: Only a subset of the words occurring in the documents were included as the value of the Vocabulary variable in the algorithm. In particular, the 100 most frequent words were removed (these include words such as "the" and "of '), and any word occurring fewer than three times was also removed. The resulting vocabulary contained approximately 38,500 words. Similarly impressive results have been achieved by others applying similar statistical learning approaches to text classification. For example, Lang (1995) describes another variant of the naive Bayes algorithm and its application to learning the target concept "usenet articles that I find interesting." He describes the NEWSWEEDEsyRstem-a program for reading netnews that allows the user to rate articles as he or she reads them. NEWSWEEDEthRen uses these rated articles as TABLE 6.3 Twenty usenet newsgroups used in the text classification experiment. After training on 667 articles from each newsgroup, a naive Bayes classifier achieved an accuracy of 89% predicting to which newsgroup subsequent articles belonged. Random guessing would produce an accuracy of only 5%. training examples to learn to predict which subsequent articles will be of interest to the user, so that it can bring these to the user's attention. Lang (1995) reports experiments in which NEWSWEEDEuRsed its learned profile of user interests to suggest the most highly rated new articles each day. By presenting the user with the top 10% of its automatically rated new articles each day, it created a pool of articles containing three to four times as many interesting articles as the general pool of articles read by the user. For example, for one user the fraction of articles rated "interesting" was 16% overall, but was 59% among the articles recommended by NEWSWEEDER. Several other, non-Bayesian, statistical text learning algorithms are common, many based on similarity metrics initially developed for information retrieval (e.g., see Rocchio 1971; Salton 1991). Additional text learning algorithms are described in Hearst and Hirsh (1996). 6.11 BAYESIAN BELIEF NETWORKS As discussed in the previous two sections, the naive Bayes classifier makes signif- icant use of the assumption that the values of the attributes a1 .. .a, are condition- ally independent given the target value v. This assumption dramatically reduces the complexity of learning the target function. When it is met, the naive Bayes classifier outputs the optimal Bayes classification. However, in many cases this conditional independence assumption is clearly overly restrictive. A Bayesian belief network describes the probability distribution governing a set of variables by specifying a set of conditional independence assumptions along with a set of conditional probabilities. In contrast to the naive Bayes classifier, which assumes that all the variables are conditionally independent given the value of the target variable, Bayesian belief networks allow stating conditional independence assumptions that apply to subsets of the variables. Thus, Bayesian belief networks provide an intermediate approach that is less constraining than the global assumption of conditional independence made by the naive Bayes classifier, but more tractable than avoiding conditional independence assumptions altogether. Bayesian belief networks are an active focus of current research, and a variety of algorithms have been proposed for learning them and for using them for inference. 185 CHAPTER 6 BAYESIAN LEARNING In this section we introduce the key concepts and the representation of Bayesian belief networks. More detailed treatments are given by Pearl (1988), Russell and Norvig (1995), Heckerman et al. (1995), and Jensen (1996). In general, a Bayesian belief network describes the probability distribution over a set of variables. Consider an arbitrary set of random variables Yl ...Y, where each variable Yi can take on the set of possible values V(Yi). We define the joint space of the set of variables Y to be the cross product V(Yl) x V(Y2) x . . . V(Y,). In other words, each item in the joint space corresponds to one of the possible assignments of values to the tuple of variables (Yl ...Y,). The probability distribution over this joint' space is called the joint probability distribution. The joint probability distribution specifies the probability for each of the possible . variable bindings for the tuple (Yl .. Y,). A Bayesian belief network describes the joint probability distribution for a set of variables. i 6.11.1 Conditional Independence Let us begin our discussion of Bayesian belief networks by defining precisely the notion of conditional independence. Let X , Y, and Z be three discrete-valued random variables. We say that X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for 2; that is, if where xi E V(X), yj E V(Y), and z k E V(Z). We commonly write the above expression in abbreviated form as P(XIY, Z ) = P(X1Z). This definition of conditional independence can be extended to sets of variables as well. We say that the set of variables X1 ...Xi is conditionally independent of the set of variables . . Yl . .Ym given the set of variables 2 1 ..Z, if P ( X 1 ...XIJY1...Ym, z1 ...Z,) = P ( X l ...X1]Z1 ...Z,) Note the correspondence between this definition and our use of conditional , independence in the definition of the naive Bayes classifier. The naive Bayes classifier assumes that the instance attribute A1 is conditionally independent of instance attribute A2 given the target value V. This allows the naive Bayes classifier to calculate P ( A l , A21V) in Equation (6.20) as follows Equation (6.23) is just the general form of the product rule of probability from Table 6.1. Equation (6.24) follows because if A1 is conditionally independent of A2 given V, then by our definition of conditional independence P(A1 IA2, V ) = P(A1IV). S,B S,-B 7S.B 1s.-B -C 0.6 0.9 0.2 Campfire FIGURE 6.3 A Bayesian belief network. The network on the left represents a set of conditional independence assumptions. In particular, each node is asserted to be conditionally independent of its nondescendants, given its immediate parents. Associated with each node is a conditional probability table, which specifies the conditional distribution for the variable given its immediate parents in the graph. The conditional probability table for the Campjire node is shown at the right, where Campjire is abbreviated to C , Storm abbreviated to S, and BusTourGroup abbreviated to B. 6.11.2 Representation A Bayesian belief network (Bayesian network for short) represents the joint probability distribution for a set of variables. For example, the Bayesian network in Figure 6.3 represents the joint probability distribution over the boolean variables Storm, Lightning, Thunder, ForestFire, Campjre, and BusTourGroup. In general, a Bayesian network represents the joint probability distribution by specifying a set of conditional independence assumptions (represented by a directed acyclic graph), together with sets of local conditional probabilities. Each variable in the joint space is represented by a node in the Bayesian network. For each variable two types of information are specified. First, the network arcs represent the assertion that the variable is conditionally independent of its nondescendants in the network given its immediate predecessors in the network. We say Xjis a descendant of , Y if there is a directed path from Y to X. Second, a conditional probability table is given for each variable, describing the probability distribution for that variable given the values of its immediate predecessors. The joint probability for any de- sired assignment of values ( y l ,...,y,) to the tuple of network variables (YI...Y,) can be computed by the formula n ~ ( Y I ,... ,yd = n p ( y i ~ p a r e n t s ( ~ i ) ) i=l where Parents(Yi) denotes the set of immediate predecessors of Yi in the network. Note the values of P(yiJParents(Yi)) are precisely the values stored in the conditional probability table associated with node Yi. To illustrate, the Bayesian network in Figure 6.3 represents the joint probability distribution over the boolean variables Storm, Lightning, Thunder, Forest- C H m R 6 BAYESIAN LEARNING 187 Fire, Campfire, and BusTourGroup. Consider the node Campjire. The network nodes and arcs represent the assertion that CampJire is conditionally independent of its nondescendants Lightning and Thunder, given its immediate parents Storm and BusTourGroup. This means that once we know the value of the variables Storm and BusTourGroup, the variables Lightning and Thunder provide no additional information about Campfire. The right side of the figure shows the conditional probability table associated with the variable Campfire. The top left entry in this table, for example, expresses the assertion that P(Campfire = TruelStorm = True,BusTourGroup = True) = 0.4 Note this table provides only the conditional probabilities of Campjire given its parent variables Storm and BusTourGroup. The set of local conditional probability tables for all the variables, together with the set of conditional independence assumptions described by the network, describe the full joint probability distribution for the network. One attractive feature of Bayesian belief networks is that they allow a convenient way to represent causal knowledge such as the fact that Lightning causes Thunder. In the terminology of conditional independence, we express this by stating that Thunder is conditionally independent of other variables in the network, given the value of Lightning. Note this conditional independence assumption is implied by the arcs in the Bayesian network of Figure 6.3. 6.11.3 Inference We might wish to use a Bayesian network to infer the value of some target variable (e.g., ForestFire) given the observed values of the other variables. Of course, given that we are dealing with random variables it will not generally be correct to assign the target variable a single determined value. What we really wish to infer is the probability distribution for the target variable, which specifies the probability that it will take on each of its possible values given the observed values of the other variables. This inference step can be straightforward if values for all of the other variables in the network are known exactly. In the more general case we may wish to infer the probability distribution for some variable (e.g., ForestFire) given observed values for only a subset of the other variables (e.g., Thunder and BusTourGroup may be the only observed values available). In general, a Bayesian network can be used to compute the probability distribution for any subset of network variables given the values or distributionsfor any subset of the remaining variables. Exact inference of probabilities in general for an arbitrary Bayesian network is known to be NP-hard (Cooper 1990). Numerous methods have been proposed for probabilistic inference in Bayesian networks, including exact inference methods and approximate inference methods that sacrifice precision to gain efficiency. For example, Monte Carlo methods provide approximate solutions by randomly sampling the distributions of the unobserved variables (Pradham and Dagum 1996). In theory, even approximate inference of probabilities in Bayesian networks can be NP-hard (Dagum and Luby 1993). Fortunately, in practice approximate methods have been shown to be useful in many cases. Discussions of inference methods for Bayesian networks are provided by Russell and Norvig (1995) and by Jensen (1996). 6.11.4 Learning Bayesian Belief Networks Can we devise effective algorithms for learning Bayesian belief networks from training data? This question is a focus of much current research. Several different settings for this learning problem can be considered. First, the network structure might be given in advance, or it might have to be inferred from the training data. Second, all the network variables might be directly observable in each training example, or some might be unobservable. In the case where the network structure is given in advance and the variables are fully observable in the training examples, learning the conditional probability tables is straightforward. We simply estimate the conditional probability table entries just as we would for a naive Bayes classifier. In the case where the network structure is given but only some of the variable values are observable in the training data, the learning problem is more difficult. This problem is somewhat analogous to learning the weights for the hidden units in an artificial neural network, where the input and output node values are given but the hidden unit values are left unspecified by the training examples. In fact, Russell et al. (1995) propose a similar gradient ascent procedure that learns the entries in the conditional probability tables. This gradient ascent procedure searches through a space of hypotheses that corresponds to the set of all possible entries for the conditional probability tables. The objective function that is maximized during gradient ascent is the probability P(D1h) of the observed training data D given the hypothesis h. By definition, this corresponds to searching for the maximum likelihood hypothesis for the table entries. 6.11.5 Gradient Ascent Training of Bayesian Networks The gradient ascent rule given by Russell et al. (1995) maximizes P(D1h) by following the gradient of In P(DIh) with respect to the parameters that define the conditional probability tables of the Bayesian network. Let wi;k denote a single entry in one of the conditional probability tables. In particular, let wijk denote the conditional probability that the network variable Yi will take on the value yi, given that its immediate parents Ui take on the values given by uik. For example, if wijk is the top right entry in the conditional probability table in Figure 6.3, then Yi is the variable Campjire, Ui is the tuple of its parents (Stomz,BusTourGroup), yij = True, and uik = (False,False). The gradient of In P(D1h) is given by the derivatives for each of the toijk.As we show below, each of these derivatives can be calculated as CHAPTER 6 BAYESIAN LEARNING 189 For example, to calculate the derivative of In P(D1h) with respect to the upperrightmost entry in the table of Figure 6.3 we will have to calculate the quan- tity P(Campf ire = True, Storm = False, BusTourGroup = Falseld) for each training example d in D . When these variables are unobservable for the training example d , this required probability can be calculated from the observed variables in d using standard Bayesian network inference. In fact, these required quantities are easily derived from the calculations performed during most Bayesian network inference, so learning can be performed at little additional cost whenever the Bayesian network is used for inference and new evidence is subsequentlyobtained. Below we derive Equation (6.25) following Russell et al. (1995). The re- mainder of this section may be skipped on a first reading without loss of continuity. To simplify notation, in this derivation we will write the abbreviation P h ( D ) to represent P ( D J h ) .Thus, our problem is to derive the gradient defined by the set of derivatives for all i , j, and k . Assuming the training examples d in the data set D are drawn independently, we write this derivative as 9 This last step makes use of the general equality . = 1 f ( ~ )- ax W can now introduce the values of the variables Yi and Ui = Parents(Yi),by summing over their possible values yijl and uiu. This last step follows from the product rule of probability, Table 6.1. Now consider & the rightmost sum in the final expression above. Given that Wijk = Ph(yijl~ik)t,he only term in this sum for which is nonzero is the term for which j' = j and i' = i . Therefore Applying Bayes theorem to rewrite Ph(dlyij,uik),we have Thus, we have derived the gradient given in Equation (6.25). There is one more item that must be considered before we can state the gradient ascent training procedure. In particular, we require that as the weights wijk are updated they x j must remain valid probabilities in the interval [0,1]. We also require that the sum wijk remains 1 for all i , k. These constraints can be satisfied by updating weights in a two-step process. First we update each wijkby gradient ascent where q is a small constant called the learning rate. Second, we renormalize the weights wijk to assure that the above constraints are satisfied. As discussed by Russell et al., this process will converge to a locally maximum likelihood hypothesis for the conditional probabilities in the Bayesian network. As in other gradient-based approaches, this algorithm is guaranteed only to find some local optimum solution. An alternative to gradient ascent is the EM algorithm discussed in Section 6.12, which also finds locally maximum likelihood solutions. 6.11.6 Learning the Structure of Bayesian Networks Learning Bayesian networks when the network structure is not known in advance is also difficult. Cooper and Herskovits (1992) present a Bayesian scoring metric for choosing among alternative networks. They also present a heuristic search algorithm called K2 for learning network structure when the data is fully observable. Like most algorithms for learning the structure of Bayesian networks, K2 performs a greedy search that trades off network complexity for accuracy over the training data. In one experiment K2 was given a set of 3,000 training examples generated at random from a manually constructed Bayesian network containing 37 nodes and 46 arcs. This particular network described potential anesthesia problems in a hospital operating room. In addition to the data, the program was also given an initial ordering over the 37 variables that was consistent with the partial 191 CHAPTER 6 BAYESIAN LEARNING ordering of variable dependencies in the actual network. The program succeeded in reconstructing the correct Bayesian network structure almost exactly, with the exception of one incorrectly deleted arc and one incorrectly added arc. Constraint-based approaches to learning Bayesian network structure have also been developed (e.g., Spirtes et al. 1993). These approaches infer independence and dependence relationships from the data, and then use these relationships to construct Bayesian networks. Surveys of current approaches to learning Bayesian networks are provided by Heckerman (1995) and Buntine (1994). 6.12 THE EM ALGORITHM In many practical learning settings, only a subset of the relevant instance features might be observable. For example, in training or using the Bayesian belief network of Figure 6.3, we might have data where only a subset of the network variables Storm, Lightning, Thunder, ForestFire, Campfire, and BusTourGroup have been observed. Many approaches have been proposed to handle the problem of learning / in the presence of unobserved variables. As we saw in Chapter 3, if some variable is sometimes observed and sometimes not, then we can use the cases for which it has been observed to learn to predict its values when it is not. In this section we describe the EM algorithm (Dempster et al. 1977), a widely used approach to learning in the presence of unobserved variables. The EM algorithm can be used even for variables whose value is never directly observed, provided the general form of the probability distribution governing these variables is known. The EM algorithm has been used to train Bayesian belief networks (see Heckerman 1995) as well as radial basis function networks discussed in Section 8.4. The EM algorithm is also the basis for many unsupervised clustering algorithms (e.g., Cheeseman et al. 1988), and it is the basis for the widely used Baum-Welch forward-backward algorithm for learning Partially Observable Markov Models (Rabiner 1989). 6.12.1 Estimating Means of k Gaussians The easiest way to introduce the EM algorithm is via an example. Consider a problem in which the data D is a set of instances generated by a probability distribution that is a mixture of k distinct Normal distributions. This problem setting is illustrated in Figure 6.4 for the case where k = 2 and where the instances are the points shown along the x axis. Each instance is generated using a two-step process. First, one of the k Normal distributions is selected at random. Second, a single random instance xi is generated according to this selected distribution. This process is repeated to generate a set of data points as shown in the figure. To simplify our discussion, we consider the special case where the selection of the single Normal distribution at each step is based on choosing each with uniform probability, where each of the k Normal distributions has the same variance a2,and where a2 is known. The learning task is to output a hypothesis h = (FI, ...pk) that describes the means of each of the k distributions. We would like to find FIGURE 6.4 Instances generatedby a mixture of two Normal distributions with identical variance a.The instances are shown by the points along the x axis. If the means of the Normal distributions are unknown, the EM algorithm can be used to search for their maximum likelihood estimates. a maximum likelihood hypothesis for these means; that is, a hypothesis h that maximizes p ( D lh). Note it is easy to calculate the maximum likelihood hypothesis for the mean of a single Normal distribution given the observed data instances X I ,x2,...,xm drawn from this single distribution. This problem of finding the mean of a single distribution is just a special case of the problem discussed in Section 6.4, Equation (6.6), where we showed that the maximum likelihood hypothesis is the one that minimizes the sum of squared errors over the m training instances. Restating Equation (6.6) using our current notation, we have In this case, the sum of squared errors is minimized by the sample mean Our problem here, however, involves a mixture of k different Normal distributions, and we cannot observe which instances were generated by which distribution. Thus, we have a prototypical example of a problem involving hidden variables. In the example of Figure 6.4, we can think of the full description of each instance as the triple (xi,zil,ziz),where xi is the observed value of the ith instance and where zil and zi2 indicate which of the two Normal distributions was used to generate the value xi.In particular, zijhas the value 1 if xi was created by the jth Normal distribution and 0 otherwise. Here xi is the observed variable in the description of the instance, and zil and zi2are hidden variables. If the values of zil and zi2 were observed, we could use Equation (6.27) to solve for the means p1 and p2. Because they are not, we will instead use the EM algorithm. Applied to our k-means problem the EM algorithm searches for a maximum likelihood hypothesis by repeatedly re-estimating the expected values of the hid- den variables zijgiven its current hypothesis ( p I...pk),then recalculating the CHAPTER 6 BAYESIAN LEARNING 193 maximum likelihood hypothesis using these expected values for the hidden variables. We will first describe this instance of the EM algorithm, and later state the EM algorithm in its general form. ' Applied to the problem of estimating the two means for Figure 6.4, the EM algorithm first initializes the hypothesis to h = (PI, p2),where p1 and p2 are arbitrary initial values. It then iteratively re-estimates h by repeating the following two steps until the procedure converges to a stationary value for h. Step 1: Calculate the expected value E[zij]of each hidden variable zi,, assuming the current hypothesis h = (p1,p2) holds. Step 2: Calculate a new maximum likelihood hypothesis h' = (pi,p;), assuming the value taken on by each hidden variable zij is its expected value E [ z i j ] calculated in Step 1. Then replace the hypothesis h = (pl, p2) by the new hypothesis h' = (pi,pi) and iterate. / Let us examine how both of these steps can be implemented in practice. Step 1 must calculate the expected value of each zi,. This E [ 4 ] is just the probability that instance xi was generated by the jth Normal distribution Thus the first step is implemented by substituting the current values (pl, p2)and the observed xi into the above expression. In the second step we use the E[zij] calculated during Step 1 to derive a new maximum likelihood hypothesis h' = (pi,pi). AS we will discuss later, the maximum likelihood hypothesis in this case is given by Note this expression is similar to the sample mean from Equation (6.28) that is used to estimate p for a single Normal distribution. Our new expression is just the weighted sample mean for p j , with each instance weighted by the expectation E[z,j] that it was generated by the jth Normal distribution. The above algorithm for estimating the means of a mixture of k Normal distributions illustrates the essence of the EM approach: The current hypothesis is used to estimate the unobserved variables, and the expected values of these variables are then used to calculate an improved hypothesis. It can be proved that on each iteration through this loop, the EM algorithm increases the likelihood P ( D l h ) unless it is at a local maximum. The algorithm thus converges to a local maximum likelihood hypothesis for (pl, w 2 ) . 6.12.2 General Statement of EM Algorithm Above we described an EM algorithm for the problem of estimating means of a mixture of Normal distributions. More generally, the EM algorithm can be applied in many settings where we wish to estimate some set of parameters 8 that describe an underlying probability distribution, given only the observed portion of the full data produced by this distribution. In the above two-means example the parameters of interest were 8 = (PI,p2), and the full data were the triples (xi,zil,zi2) of which only the xi were observed. In general let X = {xl,...,x,} denote the observed data in a set of m independently drawn instances, let Z = {zl,...,z,} denote the unobserved data in these same instances, and let Y = X U Z denote the full data. Note the unobserved Z can be treated as a random variable whose probability distribution depends on the unknown parameters 8 and on the observed data X. Similarly, Y is a random variable because it is defined in terms of the random variable Z. In the remainder of this section we describe the general form of the EM algorithm. We use h to denote the current hypothesized values of the parameters 8, and h' to denote the revised hypothesis that is estimated on each iteration of the EM algorithm. The EM algorithm searches for the maximum likelihood hypothesis h' by seeking the h' that maximizes E[ln P(Y (h')].This expected value is taken over the probability distribution governing Y , which is determined by the unknown parameters 8. Let us consider exactly what this expression signifies. First, P(Ylhl) is the likelihood of the full data Y given hypothesis h'. It is reasonable that we wish to find a h' that maximizes some function of this quantity. Second, maximizing the logarithm of this quantity In P(Ylhl) also maximizes P(Ylhl), as we have discussed on several occasions already. Third, we introduce the expected value E[ln P(Ylhl)] because the full data Y is itself a random variable. Given that the full data Y is a combination of the observed data X and unobserved data Z, we must average over the possible values of the unobserved Z, weighting each according to its probability. In other words we take the expected value E[ln P ( Y lh')] over the probability distribution governing the random variable Y . The distribution governing Y is determined by the completely known values for X, plus the distribution governing Z. What is the probability distribution governing Y ? In general we will not know this distribution because it is determined by the parameters 0 that we are trying to estimate. Therefore, the EM algorithm uses its current hypothesis h in place of the actual parameters 8 to estimate the distribution governing Y . Let us define a function Q(hllh) that gives E[ln P(Y lh')] as a function of h', under the assumption that 8 = h and given the observed portion X of the full data Y . We write this function Q in the form Q(hllh) to indicate that it is defined in part by the assumption that the current hypothesis h is equal to 8. In its general form, the EM algorithm repeats the following two steps until convergence: CHAPTER 6 BAYESIAN LEARNING 195 Step 1: Estimation (E) step: Calculate Q(hllh)using the current hypothesis h and the observed data X to estimate the probability distribution over Y . Q ( h f ( h )t E[ln P(Ylhl)lh,XI Step 2: Maximization ( M ) step: Replace hypothesis h by the hypothesis h' that maximizes this Q function. h t argmax Q(hf1h) h' When the function Q is continuous, the EM algorithm converges to a stationary point of the likelihood function P ( Y ( h l ) .When this likelihood function has a single maximum, EM will converge to this global maximum likelihood estimate for h'. Otherwise, it is guaranteed only to converge to a local maximum. In this respect, EM shares some of the same limitations as other optimization methods such as gradient descent, line search, and conjugate gradient discussed in Chapter 4. 11 6.12.3 Derivation of the k Means Algorithm To illustrate the general EM algorithm, let us use it to derive the algorithm given in Section 6.12.1 for estimating the means of a mixture of k Normal distributions. As discussed above, the k-means problem is to estimate the parameters 0 = ( P I ...p k ) that define the means of the k Normal distributions. We are given the observed data X = { ( x i ) } .The hidden variables Z = { ( z i l ,...,z i k ) }in this case indicate which of the k Normal distributions was used to generate xi. To apply EM we must derive an expression for Q ( h ( h f )that applies to our k-means problem. First, let us derive an expression for 1np(Y(h1)N. ote the probability p(yi (h') of a single instance yi = ( x i ,Z i l , . ..~ i k o) f the full data can be written To verify this note that only one of the zij can have the value 1,and all others must be 0.Therefore, this expression gives the probability distribution for xi generated by the selected Normal distribution. Given this probability for a single instance p(yi(hl),the logarithm of the probability In P(Y(hl) for all m instances in the data is m lnP(Ylhf)= lnnp(,lhl) i=l Finally we must take the expected value of this In P(Ylhl) over the probability distribution governing Y or, equivalently, over the distribution governing the unobserved components zij of Y. Note the above expression for In P(Ylhl) is a linear function of these zij. In general, for any function f (z) that is a linear function of z, the following equality holds E[f (z)l = f (Ek.1) This general fact about linear functions allows us to write To summarize, the function Q(hllh) for the k means problem is where h' = ( p i ,..., p i ) and where E[zij] is calculated based on the current hypothesis h and observed data X. As discussed earlier e-&(x'-~)2 E[zij] = EL1e--2-+--P")~ (6.29) Thus, the first (estimation) step of the EM algorithm defines the Q function based on the estimated E[zij] terms. The second (maximization) step then finds the values pi, ...,pi that maximize this Q function. In the current case argmax Q(hllh) = argmax h' h1 i = l C - 1 - &2 - 1 2u2 j=l E[zijI(xi - Thus, the maximum likelihood hypothesis here minimizes a weighted sum of squared errors, where the contribution of each instance xi to the error that defines pj is weighted by E[zij]. The quantity given by Equation (6.30) is minimized by setting each pi to the weighted sample mean Note that Equations (6.29) and (6.31) define the two steps in the k-means algorithm described in Section 6.12.1. CHAPTER 6 BAYESIAN LEARNING 197 6.13 SUMMARY AND FURTHER READING The main points of this chapter include: 0 Bayesian methods provide the basis for probabilistic learning methods that accommodate (and require) knowledge about the prior probabilities of alternative hypotheses and about the probability of observing various data given the hypothesis. Bayesian methods allow assigning a posterior probability to each candidate hypothesis, based on these assumed priors and the observed data. 0 Bayesian methods can be used to determine the most probable hypothesis given the data-the maximum a posteriori (MAP) hypothesis. This is the optimal hypothesis in the sense that no other hypothesis is more likely. 0 The Bayes optimal classifier combines the predictions of all alternative hypotheses, weighted by their posterior probabilities, to calculate the most probable classification of each new instance. i 0 The naive Bayes classifier is a Bayesian learning method that has been found to be useful in many practical applications.It is called "naive" because it in- corporates the simplifying assumption that attribute values are conditionally independent, given the classification of the instance. When this assumption is met, the naive Bayes classifier outputs the MAP classification. Even when this assumption is not met, as in the case of learning to classify text, the naive Bayes classifier is often quite effective. Bayesian belief networks pro- vide a more expressive representation for sets of conditional independence assumptions among subsets of the attributes. 0 The framework of Bayesian reasoning can provide a useful basis for analyzing certain learning methods that do not directly apply Bayes theorem. For example, under certain conditions it can be shown that minimizing the squared error when learning a real-valued target function corresponds to computing the maximum likelihood hypothesis. 0 The Minimum Description Length principle recommends choosing the hypothesis that minimizes the description length of the hypothesis plus the description length of the data given the hypothesis. Bayes theorem and basic results from information theory can be used to provide a rationale for this principle. 0 In many practical learning tasks, some of the relevant instance variables may be unobservable. The EM algorithm provides a quite general approach to learning in the presence of unobservable variables. This algorithm begins with an arbitrary initial hypothesis. It then repeatedly calculates the expected values of the hidden variables (assuming the current hypothesis is correct), and then recalculates the maximum likelihood hypothesis (assuming the hidden variables have the expected values calculated by the first step). This procedure converges to a local maximum likelihood hypothesis, along with estimated values for the hidden variables. There are many good introductory texts on probability and statistics, such as Casella and Berger (1990). Several quick-reference books (e.g., Maisel 1971; Speigel 1991) also provide excellent treatments of the basic notions of probability and statistics relevant to machine learning. Many of the basic notions of Bayesian classifiers and least-squared error classifiers are discussed by Duda and Hart (1973). Domingos and Pazzani (1996) provide an analysis of conditions under which naive Bayes will output optimal classifications, even when its independence assumption is violated (the key here is that there are conditions under which it will output optimal classifications even when the associated posterior probability estimates are incorrect). Cestnik (1990) provides a discussion of using the m-estimate to estimate probabilities. Experimental results comparing various Bayesian approaches to decision tree learning and other algorithms can be found in Michie et al. (1994). Chauvin and Rumelhart (1995) provide a Bayesian analysis of neural network learning based on the BACKPROPAGATaIOlgNorithm. A discussion of the Minimum Description Length principle can be found in Rissanen (1983, 1989). Quinlan and Rivest (1989) describe its use in avoiding overfitting in decision trees. EXERCISES 6.1. Consider again the example application of Bayes rule in Section 6.2.1. Suppose the doctor decides to order a second laboratory test for the same patient, and suppose the second test returns a positive result as well. What are the posterior probabilities of cancer and -cancer following these two tests? Assume that the two tests are independent. 6.2. In the example of Section 6.2.1 we computed the posterior probability of cancer by normalizing the quantities P (+(cancer).P (cancer) and P (+I-cancer) .P (-cancer) so that they summed to one, Use Bayes theorem and the theorem of total probability (see Table 6.1) to prove that this method is valid (i.e., that normalizing in this way yields the correct value for P(cancerl+)). 6.3. Consider the concept learning algorithm FindG, which outputs a maximally general consistent hypothesis (e.g., some maximally general member of the version space). ( a ) Give a distribution for P(h) and P(D1h) under which FindG is guaranteed to output a MAP hypothesis. (6) Give a distribution for P(h) and P(D1h) under which FindG is not guaranteed to output a MAP .hypothesis. ( c ) Give a distribution for P(h) and P(D1h) under which FindG is guaranteed to output a ML hypothesis but not a MAP hypothesis. 6.4. In the analysis of concept learning in Section 6.3 we assumed that the sequence of instances (xl ...x,) was held fixed. Therefore, in deriving an expression for P ( D ( h ) we needed only consider the probability of observing the sequence of target values ( d l . .. d m ) for this fixed instance sequence. Consider the more general setting in which the instances are not held fixed, but are drawn independently from some probability distribution defined over the instance space X. The data D must now be described as the set of ordered pairs { ( x i ,di)}a,nd P(D1h) must now reflect the CHAPTER 6 BAYESIAN LEARNING 199 probability of encountering the specific instance X I , as well as the probability of the observed target value di.Show that Equation (6.5) holds even under this more general setting. Hint: Consider the analysis of Section 6.5. 6.5. Consider the Minimum Description Length principle applied to the hypothesis space H consisting of conjunctions of up to n boolean attributes (e.g., Sunny A Warm). Assume each hypothesis is encoded simply by listing the attributes present in the hypothesis, where the number of bits needed to encode any one of the n boolean attributes is log, n. Suppose the encoding of an example given the hypothesis uses zero bits if the example is consistent with the hypothesis and uses log, m bits otherwise (to indicate which of the m examples was misclassified-the correct classification can be inferred to be the opposite of that predicted by the hypothesis). ( a ) Write down the expression for the quantity to be minimized according to the Minimum Description Length principle. ( b )Is it possible to construct a set of training data such that a consistent hypothesis exists, but MDL chooses a less consistent hypothesis? If so, give such a training set. If not, explain why not. ( c ) Give probability distributions for P ( h ) and P(D1h) such that the above MDL algorithm outputs MAP hypotheses. 6.6. Draw the Bayesian belief network that represents the conditional independence assumptions of the naive Bayes classifier for the PlayTennis problem of Section 6.9.1. Give the conditional probability table associated with the node Wind. REFERENCES Buntine W. L. (1994). Operations for learning with graphical models. Journal of Art$cial Intelligence Research, 2, 159-225. http://www.cs.washington.edu/research/jair/hom.html. Casella, G., & Berger, R. L. (1990). Statistical inference. Pacific Grove, CA: Wadsworth & Brooks/Cole. Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning. Proceedings of the Ninth European Conference on Am&5al Intelligence (pp. 147-149). London: Pitman. Chauvin, Y., & Rumelhart, D. (1995). Backpropagation: Theory, architectures, and applications, (edited collection). Hillsdale, NJ: Lawrence Erlbaum Assoc. Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988). AUTOCLASAS: bayesian classification system. Proceedings of AAAI I988 (pp. 607-611). Cooper, G. (1990). Computational complexity of probabilistic inference using Bayesian belief networks (research note). Art@cial Intelligence, 42, 393-405. Cooper, G., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309-347. Dagum, P., & Luby, M. (1993). Approximating probabilistic reasoning in Bayesian belief networks is NP-hard. Art$cial Intelligence, 60(1), 141-153. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38. Domingos, P., & Pazzani, M. (1996). Beyond independence: Conditions for the optimality of the simple Bayesian classifier. Proceedings of the 13th International Conference on Machine Learning @p. 105-112). Duda, R. O., & Hart, P. E. (1973). Pattern class$cation and scene analysis. New York: John Wiley & Sons. Hearst, M., & Hirsh, H. (Eds.) (1996). Papers from the AAAI Spring Symposium on Machine Learning in Information Access, Stanford, March 25-27. http://www.parc.xerox.com/ist~ projects/mlia/ 200 MACHINE LEARNING Heckerman, D., Geiger, D., & Chickering, D. (1995) Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197. Kluwer Academic Publishers. Jensen, F. V. (1996). An introduction to Bayesian networks. New York: Springer Verlag. Joachims, T. (1996). A probabilistic analysis of the Rocchio algorithm with TFIDF for text catego- rization, (Computer Science Technical Report CMU-CS-96-118). Carnegie Mellon University. Lang, K. (1995). Newsweeder: Learning to filter netnews. In Prieditis and Russell (Eds.), Proceedings of the 12th International Conference on Machine Learning (pp. 331-339). San Francisco: Morgan Kaufmann Publishers. Lewis, D. (1991). Representation and learning in information retrieval, (Ph.D. thesis), (COINS Technical Report 91-93). Dept. of Computer and Information Science, University of Massachusetts. Madigan, D., & Rafferty, A. (1994). ~ o d eslelection and accounting for model uncertainty in graphi- cal models using Occam's window. Journal of the American Statistical Association, 89, 15351546. Maisel, L. (1971). Probability, statistics, and random processes. Simon and Schuster Tech Outlines. New York: Simon and Schuster. Mehta, M., Rissanen, J., & Agrawal, R. (1995). MDL-based decision tree pruning. In U. M. Fayyard and R. Uthurusamy (Eds.), Proceedings of the First International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural and statistical classification, (edited collection). New York: Ellis Horwood. Opper, M., & Haussler, D. (1991). Generalization performance of Bayes optimal prediction algorithm for learning a perceptron. Physical Review Letters, 66, 2677-2681. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan-Kaufmann. Pradham, M., & Dagum, P. (1996). Optimal Monte Carlo estimation of belief network inference. In Proceedings of the Conference on Uncertainty in Artijicial Intelligence (pp. 44-53). Quinlan, J. R., & Rivest, R. (1989). Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227-248. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286. Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11(2), 41-31. Rissanen, J., (1989). Stochastic complexity in statistical inquiry. New Jersey: World Scientific Pub. Rissanen, J. (1991). Information theory and neural nets. IBM Research Report RJ 8438 (76446), IBM Thomas J. Watson Research Center, Yorktown Heights, NY. Rocchio, J. (1971). Relevance feedback in information retrieval. In The SMART retrieval system: Experiments in automatic document processing, (Chap. 14, pp. 313-323). Englewood Cliffs, NJ: Prentice-Hall. Russell, S., & Nomig, P. (1995). Artificial intelligence: A modem approach. Englewood Cliffs, NJ: Prentice-Hall. Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal. San Francisco: Morgan Kaufmann. Salton, G. (1991). Developments in automatic text retrieval. Science, 253, 974-979. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Speigel, M. R. (1991). Theory and problems of probability and statistics. Schaum's Outline Series. New York: McGraw Hill. Spirtes, P., Glymour, C., & Scheines, R. (1993). Causation, prediction, and search. New York: Springer Verlag. http://hss.cmu.edu/htmUdepartments/philosophy~~D.BOO~ook.h~ CHAPTER COMPUTATIONAL LEARNING THEORY This chapter presents a theoretical characterization of the difficulty of several types of machine learning problems and the capabilities of several types of machine learning algorithms. This theory seeks to answer questions such as "Under what conditions is successful learning possible and impossible?" and "Under what conditions is a particular learning algorithm assured of learning successfully?' Two specific frameworks for analyzing learning algorithms are considered. Within the probably approximately correct (PAC) framework, we identify classes of hypotheses that can and cannot be learned from a polynomial number of training examples and we define a natural measure of complexity for hypothesis spaces that allows bounding the number of training examples required for inductive learning. Within the mistake bound framework, we examine the number of training errors that will be made by a learner before it determines the correct hypothesis. 7.1 INTRODUCTION When studying machine learning it is natural to wonder what general laws may govern machine (and nonmachine) learners. Is it possible to identify classes of learning problems that are inherently difficult or easy, independent of the learning algorithm? Can one characterize the number of training examples necessary or sufficient to assure successful learning? How is this number affected if the learner is allowed to pose queries to the trainer, versus observing a random sample of training examples? Can one characterize the number of mistakes that a learner 202 MACHINE LEARNING will make before learning the target function? Can one characterize the inherent computational complexity of classes of learning problems? Although general answers to all these questions are not yet known, fragments of a computational theory of learning have begun to emerge. This chapter presents key results from this theory, providing answers to these questions within particular problem settings. We focus here on the problem of inductively learning an unknown target function, given only training examples of this target function and a space of candidate hypotheses. Within this setting, we will be chiefly concerned with questions such as how many training examples are sufficient to successfully learn the target function, and how many mistakes will the learner make before succeeding. As we shall see, it is possible to set quantitative bounds on these measures, depending on attributes of the learning problem such as: 0 the size or complexity of the hypothesis space considered by the learner 0 the accuracy to which the target concept must be approximated 0 the probability that the learner will output a successful hypothesis 0 the manner in which training examples are presented to the learner For the most part, we will focus not on individual learning algorithms, but rather on broad classes of learning algorithms characterized by the hypothesis spaces they consider, the presentation of training examples, etc. Our goal is to answer questions such as: 0 Sample complexity. How many training examples are needed for a learner to converge (with high probability) to a successful hypothesis? 0 Computational complexity. How much computational effort is needed for a learner to converge (with high probability) to a successful hypothesis? 0 Mistake bound. How many training examples will the learner misclassify before converging to a successful hypothesis? Note there are many specific settings in which we could pursue such questions. For example, there are various ways to specify what it means for the learner to be "successful." We might specify that to succeed, the learner must output a hypothesis identical to the target concept. Alternatively, we might simply require that it output a hypothesis that agrees with the target concept most of the time, or that it usually output such a hypothesis. Similarly, we must specify how training examples are to be obtained by the learner. We might specify that training examples are presented by a helpful teacher, or obtained by the learner performing experiments, or simply generated at random according to some process outside the learner's control. As we might expect, the answers to the above questions depend on the particular setting, or learning model, we have in mind. The remainder of this chapter is organized as follows. Section 7.2 introduces the probably approximately correct (PAC) learning setting. Section 7.3 then analyzes the sample complexity and computational complexity for several learning 203 CHAPTER 7 COMPUTATIONAL LEARNING THEORY problems within this PAC setting. Section 7.4 introduces an important measure of hypothesis space complexity called the VC-dimension and extends our PAC analysis to problems in which the hypothesis space is infinite. Section 7.5 introduces the mistake-bound model and provides a bound on the number of mistakes made by several learning algorithms discussed in earlier chapters. Finally, we introduce the WEIGHTED-MAJORIaTlYgorithm, a practical algorithm for combining the predictions of multiple competing learning algorithms, along with a theoretical mistake bound for this algorithm. 7.2 PROBABLY LEARNING AN APPROXIMATELY CORRECT HYPOTHESIS In this section we consider a particular setting for the learning problem, called the probably approximately correct (PAC) learning model. We begin by specifying the problem setting that defines the PAC learning model, then consider the questions of how many training examples and how much computation are required i in order to learn various classes of target functions within this PAC model. For the sake of simplicity, we restrict the discussion to the case of learning booleanvalued concepts from noise-free training data. However, many of the results can be extended to the more general scenario of learning real-valued target functions (see, for example, Natarajan 1991), and some can be extended to learning from certain types of noisy data (see, for example, Laird 1988; Kearns and Vazirani 1994). 7.2.1 The Problem Setting As in earlier chapters, let X refer to the set of all possible instances over which target functions may be defined. For example, X might represent the set of all people, each described by the attributes age (e.g., young or old) and height (short or tall). Let C refer to some set of target concepts that our learner might be called upon to learn. Each target concept c in C corresponds to some subset of X, or equivalently to some boolean-valued function c : X + {0,1). For example, one target concept c in C might be the concept "people who are skiers." If x is a positive example of c, then we will write c(x) = 1; if x is a negative example, c(x) = 0. We assume instances are generated at random from X according to some probability distribution D.For example, 2)might be the distribution of instances generated by observing people who walk out of the largest sports store in Switzer- land. In general, D may be any distribution, and it will not generally be known to the learner. All that we require of D is that it be stationary; that is, that the distribution not change over time. Training examples are generated by drawing an instance x at random according to D,then presenting x along with its target value, c(x), to the learner. The learner L considers some set H of possible hypotheses when attempting to learn the target concept. For example, H might be the set of all hypotheses describable by conjunctions of the attributes age and height. After observing a sequence of training examples of the target concept c, L must output some hypothesis h from H, which is its estimate of c. To be fair, we evaluate the success of L by the performance of h over new instances drawn randomly from X according to D, the same probability distribution used to generate the training data. Within this setting, we are interested in characterizing the performance of various learners L using various hypothesis spaces H, when learning individual target concepts drawn from various classes C. Because we demand that L be general enough to learn any target concept from C regardless of the distribution of training examples, we will often be interested in worst-case analyses over all possible target concepts from C and all possible instance distributions D. 7.2.2 Error of a Hypothesis Because we are interested in how closely the learner's output hypothesis h approximates the actual target concept c, let us begin by defining the true error of a hypothesis h with respect to target concept c and instance distribution D. Informally, the true error of h is just the error rate we expect when applying h to future instances drawn according to the probability distribution 27.In fact, we already defined the true error of h in Chapter 5. For convenience, we restate the definition here using c to represent the boolean target function. Definition: The true error (denoted errorv(h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D. Here the notation Pr indicates that the probability is taken over the instance x€D distribution V. Figure 7.1 shows this definition of error in graphical form. The concepts c and h are depicted by the sets of instances within X that they label as positive. The error of h with respect to c is the probability that a randomly drawn instance will fall into the region where h and c disagree (i.e., their set difference). Note we have chosen to define error over the entire distribution of instances-not simply over the training examples-because this is the true error we expect to encounter when actually using the learned hypothesis h on subsequent instances drawn from D. Note that error depends strongly on the unknown probability distribution 2).For example, if D is a uniform probability distribution that assigns the same probability to every instance in X, then the error for the hypothesis in Figure 7.1 will be the fraction of the total instance space that falls into the region where h and c disagree. However, the same h and c will have a much higher error if D happens to assign very high probability to instances for which h and c disagree. In the extreme, if V happens to assign zero probability to the instances for which Instance space X C Where c and h disagree FIGURE 7.1 The error of hypothesis h with respect to target concept c. The error of h with respect to c is the probability that a randomly drawn instance will fall into the region where h and c disagree on its classification. The + and - points indicate positive and negative training examples. Note h has a nonzero error with respect to c despite the fact that h and c agree on all five training examples observed thus far. h ( x ) = ~ ( x )th,en the error for the h in Figure 7.1 will be 1, despite the fact the h and c agree on a very large number of (zero probability) instances. Finally, note that the error of h with respect to c is not directly observable to the learner. L can only observe the performance of h over the training examples, and it must choose its output hypothesis on this basis only. We will use the term training error to refer to the fraction of training examples misclassified by h, in contrast to the true error defined above. Much of our analysis of the complexity of learning centers around the question "how probable is it that the observed training error for h gives a misleading estimate of the true errorv(h)?" Notice the close relationship between this question and the questions considered in Chapter 5. Recall that in Chapter 5 we defined the sample error of h with respect to a set S of examples to be the fraction of S rnisclassified by h. The training error defined above is just the sample error when S is the set of training examples. In Chapter 5 we determined the probability that the sample error will provide a misleading estimate of the true error, under the assumption that the data sample S is drawn independent of h. However, when S is the set of training data, the learned hypothesis h depends very much on S ! Therefore, in this chapter we provide an analysis that addresses this important special case. 7.2.3 PAC Learnability Our aim is to characterize classes of target concepts that can be reliably learned from a reasonable number of randomly drawn training examples and a reasonable amount of computation. What kinds of statements about learnability should we guess hold true? We might try to characterize the number of training examples needed to learn a hypothesis h for which errorD(h) = 0. Unfortunately, it turns out this is futile in the setting we are considering, for two reasons. First, unless we provide training examples corresponding to every possible instance in X (an unrealistic assumption), there may be multiple hypotheses consistent with the provided training examples, and the learner cannot be certain to pick the one corresponding to the target concept. Second, given that the training examples are drawn randomly, there will always be some nonzero probability that the training examples encountered by the learner will be misleading. (For example, although we might frequently see skiers of different heights, on any given day there is some small chance that all observed training examples will happen to be 2 meters tall.) To accommodate these two difficulties, we weaken our demands on the learner in two ways. First, we will not require that the learner output a zero error hypothesis-we will require only that its error be bounded by some constant, c, that can be made arbitrarily small. Second, we will not require that the learner succeed for every sequence of randomly drawn training examples-we will require only that its probability of failure be bounded by some constant, 6, that can be made arbitrarily small. In short, we require only that the learner probably learn a hypothesis that is approximately correct-hence the term probably approximately correct learning, or PAC learning for short. Consider some class C of possible target concepts and a learner L using hypothesis space H. Loosely speaking, we will say that the concept class C is PAC-learnable by L using H if, for any target concept c in C, L will with probability (1 - 6) output a hypothesis h with errorv(h) < c, after observing a reasonable number of training examples and performing a reasonable amount of computation. More precisely, Definition: Consider a concept class C defined over a set of instances X of length n and a learner L using hypothesis space H . C is PAC-learnable by L using H if for all c E C, distributions D over X, E such that 0 < 6 < 112, and 6 such that 0 < 6 < 112, learner L will with probability at least (1 - 6) output a hypothesis h E H such that errorv(h) 5 E, in time that is polynomial in 116, 116, n, and size(c). Our definition requires two things from L. First, L must, with arbitrarily high probability (1-6), output a hypothesis having arbitrarily low error (6). Second, it must do so efficiently-in time that grows at most polynomially with 1/c and 116, which define the strength of our demands on the output hypothesis, and with n and size(c) that define the inherent complexity of the underlying instance space X and concept class C. Here, n is the size of instances in X. For example, if instances in X are conjunctions of k boolean features, then n = k. The second space parameter, size(c), is the encoding length of c in C, assuming some representation for C. For example, if concepts in C are conjunctions of up to k boolean features, each described by listing the indices of the features in the conjunction, then size(c) is the number of boolean features actually used to describe c. Our definition of PAC learning may at first appear to be concerned only with the computational resources required for learning, whereas in practice we are usually more concerned with the number of training examples required. However, the two are very closely related: If L requires some minimum processing time per training example, then for C to be PAC-learnable by L, L must learn from a polynomial number of training examples. In fact, a typical approach to showing that some class C of target concepts is PAC-learnable, is to first show that each target concept in C can be learned from a polynomial number of training examples and then show that the processing time per example is also polynomially bounded. Before moving on, we should point out a restrictive assumption implicit in our definition of PAC-learnable. This definition implicitly assumes that the learner's hypothesis space H contains a hypothesis with arbitrarily small error for every target concept in C . This follows from the requirement in the above definition that the learner succeed when the error bound 6 is arbitrarily close to zero. Of course this is difficult to assure if one does not know C in advance (what is C for a program that must learn to recognize faces from images?), unless H is taken to be the power set of X. As pointed out in Chapter 2, such an unbiased H will not support accurate generalization from a reasonable number of training examples. / Nevertheless, the results based on the PAC learning model provide useful insights regarding the relative complexity of different learning problems and regarding the rate at which generalization accuracy improves with additional training examples. Furthermore, in Section 7.3.1 we will lift this restrictive assumption, to consider the case in which the learner makes no prior assumption about the form of the target concept. 7.3 SAMPLE COMPLEXITY FOR FINITE HYPOTHESIS SPACES As noted above, PAC-learnability is largely determined by the number of training examples required by the learner. The growth in the number of required training examples with problem size, called the sample complexity of the learning problem, is the characteristic that is usually of greatest interest. The reason is that in most practical settings the factor that most limits success of the learner is the limited availability of training data. Here we present a general bound on the sample complexity for a very broad class of learners, called consistent learners. A learner is consistent if it outputs hypotheses that perfectly fit the training data, whenever possible. It is quite reasonable to ask that a learning algorithm be consistent, given that we typically prefer a hypothesis that fits the training data over one that does not. Note that many of the learning algorithms discussed in earlier chapters, including all the learning algorithms described in Chapter 2, are consistent learners. Can we derive a bound on the number of training examples required by any consistent learner, independent of the specific algorithm it uses to derive a consistent hypothesis? The answer is yes. To accomplish this, it is useful to recall the definition of version space from Chapter 2. There we defined the version space, V S H , D ,to be the set of all hypotheses h E H that correctly classify the training examples D. v s , =~{h E HI(V(x,4 ~ E) D) ) (h(x)= ~ ( x ) ) } The significance of the version space here is that every consistent learner outputs a hypothesis belonging to the version space, regardless of the instance space X, hypothesis space H, or training data D. The reason is simply that by definition the version space V S H , Dcontains every consistent hypothesis in H. Therefore, to bound the number of examples needed by any consistent learner, we need only bound the number of examples needed to assure that the version space contains no unacceptable hypotheses. The following definition, after Haussler (1988), states this condition precisely. Definition: Consider a hypothesis space H, target concept c, instance distribution V ,and set of training examples D of c. The version space V S , , is said to be €-exhaustedwith respect to c and V ,if every hypothesis h in VSH,* has error less than 6 with respect to c and V. This definition is illustrated in Figure 7.2. The version space is €-exhausted just in the case that all the hypotheses consistent with the observed training examples (i.e., those with zero training error) happen to have true error less than E . Of course from the learner's viewpoint all that can be known is that these hypotheses fit the training data equally well-they all have zero training error. Only an observer who knew the identity of the target concept could determine with certainty whether the version space is +exhausted. Surprisingly, a probabilistic argument allows us to bound the probability that the version space will be €-exhausted after a given number of training examples, even without knowing the identity of the target concept or the distribution from which training examples Hypothesis space H m error =.3 r =.4 FIGURE 7.2 Exhausting the version space. The version space VSH,Dis the subset of hypotheses h E H, which have zero training error (denoted by r = 0 in the figure). Of course the true errorv(h) (denoted by error in the figure) may be nonzero, even for hypotheses that commit zero errors over the training data. The version space is said to be €-exhausted when all hypotheses h remaining in V S H ,h~ave errorw(h) < E. are drawn. Haussler (1988) provides such a bound, in the form of the following theorem. Theorem 7.1. €-exhaustingthe version space. If the hypothesis space H is finite, and D is a sequence of rn 1 independent randomly drawn examples of some target concept c, then for any 0 5 E 5 1, the probability that the version space V S H ,is~ not €-exhausted (with respect to c) is less than or equal to Proof. Let h l ,h2,...hk be all the hypotheses in H that have true error greater than E with respect to c. We fail to €-exhaustthe version space if and only if at least one of these k hypotheses happens to be consistent with all rn independent random training examples. The probability that any single hypothesis having true error greater than E would be consistent with one randomly drawn example is at most (1-E). Therefore the probability that this hypothesis will be consistent with rn independently drawn examples is at most (1 - E ) ~G.iven that we have k hypotheses with error greater than E,the probability that at least one of these will be consistent with all rn training examples is at most And since k 5 I H1, this is at most 1HI(1- 6)". Finally, we use a general inequality stating that if 0 5 E 5 1 then (1 - E)5 e-'. Thus, which proves the theorem. .O We have just proved an upper bound on the probability that the version space is not €-exhausted, based on the number of training examples m, the allowed error E, and the size of H. Put another way, this bounds the probability that m training examples will fail to eliminate all "bad" hypotheses (i.e., hypotheses with true error greater than E), for any consistent learner using hypothesis space H. Let us use this result to determine the number of training examples required to reduce this probability of failure below some desired level 6. Rearranging terms to solve for m, we find + m 2 1 - (ln 1 HI ln(l/6)) (7.2) E To summarize, the inequality shown in Equation (7.2) provides a general bound on the number of training examples sufficient for any consistent learner to successfully learn any target concept in H, for any desired values of 6 and E. This number rn of training examples is sufficient to assure that any consistent hypothesis will be probably (with probability (1 -6)) approximately (within error E)correct. Notice m grows linearly in 1 / a~nd logarithmically in 116. It also grows logarithmically in the size of the hypothesis space H . 210 MACHINE LEARNING Note that the above bound can be a substantial overestimate. For example, although the probability of failing to exhaust the version space must lie in the interval [O, 11, the bound given by the theorem grows linearly with IHI. For sufficiently large hypothesis spaces, this bound can easily be greater than one. As a result, the bound given by the inequality in Equation (7.2) can substantially overestimate the number of training examples required. The weakness of this bound is mainly due to the IHI term, which arises in the proof when summing the probability that a single hypothesis could be unacceptable, over all possible hypotheses. In fact, a much tighter bound is possible in many cases, as well as a bound that covers infinitely large hypothesis spaces. This will be the subject of Section 7.4. 7.3.1 Agnostic Learning and Inconsistent Hypotheses Equation (7.2) is important because it tells us how many training examples suffice to ensure (with probability (1-6)) that every hypothesis in H having zero training error will have a true error of at most E . Unfortunately, if H does not contain the target concept c, then a zero-error hypothesis cannot always be found. In this case, the most we might ask of our learner is to output the hypothesis from H that has the minimum error over the training examples. A learner that makes no assumption that the target concept is representable by H and that simply finds the hypothesis with minimum training error, is often called an agnostic learner, because it makes no prior commitment about whether or not C g H. Although Equation (7.2) is based on the assumption that the learner outputs a zero-error hypothesis, a similar bound can be found for this more general case in which the learner entertains hypotheses with nonzero training error. To state this precisely, let D denote the particular set of training examples available to the learner, in contrast to D,which denotes the probability distribution over the entire set of instances. Let errorD(h) denote the training error of hypothesis h. In particular, e r r o r ~ ( h )is defined as the fraction of the training examples in D that are misclassified by h. Note the errorD(h) over the particular sample of training data D may differ from the true error errorv(h) over the entire probability distribution 2).Now let hb,,, denote the hypothesis from H having lowest training error over the training examples. How many training examples suffice to ensure +(with high probability) that its true error errorD(hb,,,) will be no more than E errorg (hbest)?Notice the question considered in the previous section is just a special case of this question, when errorD(hb,,) happens to be zero. This question can be answered (see Exercise 7.3) using an argument analogous to the proof of Theorem 7.1. It is useful here to invoke the general Hoeffding bounds (sometimes called the additive Chernoff bounds). The Hoeffding bounds characterize the deviation between the true probability of some event and its observed frequency over m independent trials. More precisely, these bounds apply to experiments involving m distinct Bernoulli trials (e.g., m independent flips of a coin with some probability of turning up heads). This is exactly analogous to the setting we consider when estimating the error of a hypothesis in Chapter 5: The 211 CHAPTER 7 COMPUTATIONAL LEARNING THEORY probability of the coin being heads corresponds to the probability that the hypothesis will misclassify a randomly drawn instance. The m independent coin flips correspond to the m independently drawn instances. The frequency of heads over the m examples corresponds to the frequency of misclassifications over the m instances. The Hoeffding bounds state that if the training error errOrD(h) is measured over the set D containing m randomly drawn examples, then This gives us a bound on the probability that an arbitrarily chosen single hypothesis has a very misleading training error. To assure that the best hypothesis found by L has an error bounded in this way, we must consider the probability that any one of the 1H 1 hypotheses could have a large error + Pr[(3h E H)(errorv(h) > e r r o r ~ ( h ) E)]5 1H~ e - ~ ~ ' ~ If we call this probability 6, and ask how many examples m suffice to hold S to some desired value, we now obtain This is the generalization of Equation (7.2) to the case in which the learner still picks the best hypothesis h E H, but where the best hypothesis may have nonzero training error. Notice that m depends logarithmically on H and on 116, as it did in the more restrictive case of Equation (7.2). However, in this less restrictive situation m now grows as the square of 116, rather than linearly with 116. 7.3.2 Conjunctions of Boolean Literals Are PAC-Learnable Now that we have a bound indicating the number of training examples sufficient to probably approximately learn the target concept, we can use it to determine the sample complexity and PAC-learnability of some specific concept classes. Consider the class C of target concepts described by conjunctions of boolean literals. A boolean literal is any boolean variable (e.g., Old), or its negation (e.g., -Old). Thus, conjunctions of boolean literals include target concepts such as "Old A -Tallv. Is C PAC-learnable? We can show that the answer is yes by first showing that any consistent learner will require only a polynomial number of training examples to learn any c in C, and then suggesting a specific algorithm that uses polynomial time per training example. Consider any consistent learner L using a hypothesis space H identical to C. We can use Equation (7.2) to compute the number m of random training examples sufficientto ensure that L will, with probability (1 - S), output a hypothesis with maximum error E. To accomplish this, we need only determine the size IHI of the hypothesis space. Now consider the hypothesis space H defined by conjunctions of literals based on n boolean variables. The size 1HI of this hypothesis space is 3". To see this, consider the fact that there are only three possibilities for each variable in 212 MACHINE LEARNING any given hypothesis: Include the variable as a literal in the hypothesis, include its negation as a literal, or ignore it. Given n such variables, there are 3" distinct hypotheses. Substituting IH I = 3" into Equation (7.2) gives the following bound for the sample complexity of learning conjunctions of up to n boolean literals. For example, if a consistent learner attempts to learn a target concept described by conjunctions of up to 10 boolean literals, and we desire a 95% probability that it will learn a hypothesis with error less than . l , then it suffices to present m + randomly drawn training examples, where rn = -$(101n3 ln(11.05)) = 140. Notice that m grows linearly in the number of literals n, linearly in 116, and logarithmically in 116. What about the overall computational effort? That will depend, of course, on the specific learning algorithm. However, as long as our learning algorithm requires no more than polynomial computation per training example, and no more than a polynomial number of training examples, then the total computation required will be polynomial as well. In the case of learning conjunctions of boolean literals, one algorithm that meets this requirement has already been presented in Chapter 2. It is the FIND-S algorithm, which incrementally computes the most specific hypothesis consistent with the training examples. For each new positive training example, this algorithm computes the intersection of the literals shared by the current hypothesis and the new training example, using time linear in n. Therefore, the FIND-S algorithm PAC-learns the concept class of conjunctions of n boolean literals with negations. Theorem 7.2. PAC-learnability of boolean conjunctions. The class C of conjunctions of boolean literals is PAC-learnable by the FIND-Salgorithm using H = C . Proof. Equation (7.4) shows that the sample complexity for this concept class is polynomial in n, 116, and 116, and independent of size(c).To incrementally process each training example, the FIND-Salgorithm requires effort linear in n and indepen- dent of 116, 116, and size(c). Therefore, this concept class is PAC-learnable by the FIND-Salgorithm. 0 7.3.3 PAC-Learnability of Other Concept Classes As we just saw, Equation (7.2) provides a general basis for bounding the sample complexity for learning target concepts in some given class C. Above we applied it to the class of conjunctions of boolean literals. It can also be used to show that many other concept classes have polynomial sample complexity (e.g., see Exercise 7.2). 7.3.3.1 UNBIASED LEARNERS Not all concept classes have polynomially bounded sample complexity according to the bound of Equation (7.2). For example, consider the unbiased concept class C that contains every teachable concept relative to X. The set C of all definable target concepts corresponds to the power set of X-the set of all subsets of Xwhich contains ICI = 2IXI concepts. Suppose that instances in X are defined by n boolean features. In this case, there will be 1x1 = 2" distinct instances, and therefore ICI = 21' = 2' distinct concepts. Of course to learn such an unbiased concept class, the learner must itself use an unbiased hypothesis space H = C. Substituting I H I = 22ninto Equation (7.2) gives the sample complexityfor learning the unbiased concept class relative to X. Thus, this unbiased class of target concepts has exponential sample complexity under the PAC model, according to Equation (7.2). Although Equations (7.2) and (7.5) are not tight upper bounds, it can in fact be proven that the sample complexity for the unbiased concept class is exponential in n. I1 1II 7.3.3.2 K-TERM DNF AND K-CNF CONCEPTS It is also possible to find concept classes that have polynomial sample complexity, but nevertheless cannot be learned in polynomial time. One interesting example is the concept class C of k-term disjunctive normal form (k-term DNF) expressions. k-term DNF expressions are of the form TI v T2 v . . - v Tk, where each term 1;: is a conjunction of n boolean attributes and their negations. Assuming H = C, it is easy to show that I HI is at most 3"k (because there are k terms, each of which may take on 3" possible values). Note 3"k is an overestimate of H, because it is double counting the cases where = I;. and where 1;: is more_general-than I;.. Still, we can use this upper bound on I HI to obtain an upper bound on the sample complexity, substituting this into Equation (7.2). which indicates that the sample complexity of k-term DNF is polynomial in 1 / ~ ,116, n, and k. Despite having polynomial sample complexity, the computational complexity is not polynomial, because this learning problem can be shown to be equivalent to other problems that are known to be unsolvable in polynomial time (unless R P = NP). Thus, although k-term DNF has polynomial sample complexity, it does not have polynomial computational complexity for a learner using H = C. The surprising fact about k-term DNF is that although it is not PAClearnable, there is a strictly larger concept class that is! This is possible because the larger concept class has polynomial computation complexity per example and still has polynomial sample complexity. This larger class is the class of k-CNF expressions: conjunctions of arbitrary length of the form TI A T2 A . ..A I;., where each is a disjunction of up to k boolean attributes. It is straightforward to show that k-CNF subsumes k-DNF, because any k-term DNF expression can easily be rewritten as a k-CNF expression (but not vice versa). Although k-CNF is more expressive than k-term DNF, it has both polynomial sample complexity and polynomial time complexity. Hence, the concept class k-term DNF is PAC learnable by an efficient algorithm using H = k-CNF. See Kearns and Vazirani (1994) for a more detailed discussion. 7.4 SAMPLE COMPLEXITY FOR INFINITE HYPOTHESIS SPACES In the above section we showed that sample complexity for PAC learning grows as the logarithm of the size of the hypothesis space. While Equation (7.2) is quite useful, there are two drawbacks to characterizing sample complexity in terms of IHI. First, it can lead to quite weak bounds (recall that the bound on 6 can be significantly greater than 1 for large I H I). Second, in the case of infinite hypothesis spaces we cannot apply Equation (7.2) at all! Here we consider a second measure of the complexity of H, called the Vapnik-Chervonenkis dimension of H (VC dimension, or VC(H), for short). As we shall see, we can state bounds on sample complexity that use VC(H) rather than IHI. In many cases, the sample complexity bounds based on VC(H) will be tighter than those from Equation (7.2). In addition, these bounds allow us to characterize the sample complexity of many infinite hypothesis spaces, and can be shown to be fairly tight. 7.4.1 Shattering a Set of Instances The VC dimension measures the complexity of the hypothesis space H, not by the number of distinct hypotheses 1H 1, but instead by the number of distinct instances from X that can be completely discriminated using H. To make this notion more precise, we first define the notion of shattering a set of instances. Consider some subset of instances S E X. For example, Figure 7.3 shows a subset of three instances from X. Each hypothesis h from H imposes some dichotomy on S; that is, h partitions S into the two subsets {x E Slh(x) = 1) and {x E Slh(x) = 0). Given some instance set S, there are 2ISI possible dichotomies, though H may be unable to represent some of these. We say that H shatters S if every possible dichotomy of S can be represented by some hypothesis from H. Definition: A set of instances S is shattered by hypothesis space H if and only if for every dichotomy.of S there exists some hypothesis in H consistent with this dichotomy. Figure 7.3 illustrates a set S of three instances that is shattered by the hypothesis space. Notice that each of the 23 dichotomies of these three instances is covered by some hypothesis. Note that if a set of instances is not shattered by a hypothesis space, then there must be some concept (dichotomy) that can be defined over the instances, but that cannot be represented by the hypothesis space. The ability of H to shatter Instance space X FIGURE 73 A set of three instances shattered by eight hypotheses. For every possible dichotomy of the instances, there exists a corresponding hypothesis. a set .of instances is thus a measure of its capacity to represent target concepts defined over these instances. 7.4.2 The Vapnik-Chervonenkis Dimension The ability to shatter a set of instances is closely related to the inductive bias of a hypothesis space. Recall from Chapter 2 that an unbiased hypothesis space is one capable of representing every possible concept (dichotomy) definable over the instance space X. Put briefly, an unbiased hypothesis space H is one that shatters the instance space X. What if H cannot shatter X, but can shatter some large subset S of X? Intuitively, it seems reasonable to say that the larger the subset of X that can be shattered, the more expressive H. The VC dimension of H is precisely this measure. Definition: The Vapnik-Chervonenkis dimension, V C ( H ) , of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H . If arbitrarily large finite sets of X can be shattered by H, then V C ( H ) = oo. Note that for any finite H, VC(H) 5 log2 IHI. To see this, suppose that VC(H) = d. Then H will require 2d distinct hypotheses to shatter d instances. Hence, 2d 5 IHI, andd = VC(H) s l o g 2 ( H ( . 7.4.2.1 ILLUSTRATIW EXAMPLES In order to develop an intuitive feeling for VC(H), consider a few example hypothesis spaces. To get started, suppose the instance space X is the set of real numbers X = 8 (e.g., describing the height of people), and H the set of intervals on the real number line. In other words, H is the set of hypotheses of the form a < x < b, where a and b may be any real constants. What is VC(H)? To answer this question, we must find the largest subset of X that can be shattered by H. Consider a particular subset containing two distinct instances, say S = {3.1,5.7}. Can S be shattered by H? Yes. For example, the four hypotheses (1 < x < 2), (1 < x < 4), (4< x < 7), and (1 < x < 7) will do. Together, they represent each of the four dichotomies over S, covering neither instance, either one of the instances, and both of the instances, respectively. Since we have found a set of size two that can be shattered by H, we know the VC dimension of H is at least two. Is there a set of size three that can be shattered? Consider a set S = (xo,xl,x2} containing three arbitrary instances. Without loss of generality, assume xo < xl < x2. Clearly this set cannot be shattered, because the dichotomy that includes xo and x2, but not X I ,cannot be represented by a single closed interval. Therefore, no subset S of size three can be shattered, and VC(H) = 2. Note here that H is infinite, but VC(H) finite. Next consider the set X of instances correspondingto points on the x,y plane (see Figure 7.4). Let H be the set of all linear decision surfaces in the plane. In other words, H is the hypothesis space corresponding to a single perceptron unit with two inputs (see Chapter 4 for a general discussion of perceptrons). What is the VC dimension of this H? It is easy to see that any two distinct points in the plane can be shattered by H, because we can find four linear surfaces that include neither, either, or both points. What about sets of three points? As long as the points are not colinear, we will be able to find 23 linear surfaces that shatter them. Of course three colinear points cannot be shattered (for the same reason that the three points on the real line could not be shattered in the previous example). What is VC(H) in this case-two or three? It is at least three. The definition of VC dimension indicates that if we find any set of instances of size d that can be shattered, then VC(H) 2 d. To show that VC(H) < d, we must show that no set of size d can be shattered. In this example, no sets of size four can be shattered, so VC(H) = 3. More generally, it can be shown that the VC dimension + of linear decision surfaces in an r dimensional space (i.e., the VC dimension of a perceptron with r inputs) is r 1. As one final example, suppose each instance in X is described by the con- junction of exactly three boolean literals, and suppose that each hypothesis in H is described by the conjunction of up to three boolean literals. What is VC(H)? We FIGURE 7.4 The VC dimension for linear decision surfaces in the x , y plane is 3. (a) A set of three points that can be shattered using linear decision surfaces. (b)A set of three that cannot be shattered. can show that it is at least 3, as follows. Represent each instance by a 3-bit string corresponding to the values of each of its three literals 11,12, and 13. Consider the following set of three instances: This set of three instances can be shattered by H, because a hypothesis can be constructed for any desired dichotomy as follows: If the dichotomy is to exclude instancei, add the literal -li to the hypothesis. For example, suppose we wish to include instance2, but exclude instance1 and instance3. Then we use the hypothesis -Il A -I3. This argument easily extends from three features to n. Thus, the VC dimension for conjunctions of n boolean literals is at least n. In fact, it is + exactly n, though showing this is more difficult, because it requires demonstrating that no set of n 1 instances can be shattered. i 7.4.3 Sample Complexity and the VC Dimension Earlier we considered the question "How many randomly drawn training examples suffice to probably approximately learn any target concept in C?' (i.e., how many examples suffice to €-exhaust the version space with probability (1 - a)?). Using VC(H) as a measure for the complexity of H, it is possible to derive an alternative answer to this question, analogous to the earlier bound of Equation (7.2). This new bound (see Blumer et al. 1989) is Note that just as in the bound from Equation (7.2), the number of required training examples m grows logarithmically in 118. It now grows log times linear in 116, rather than linearly. Significantly, the In I HI term in the earlier bound has now been replaced by the alternative measure of hypothesis space complexity, VC(H) (recall VC(H) Ilog2I H I). Equation (7.7) provides an upper bound on the number of training examples sufficientto probably approximately learn any target concept in C, for any desired t and a. It is also possible to obtain a lower bound, as summarizedin the following theorem (see Ehrenfeucht et al. 1989). Theorem 7.3. Lower bound on sample complexity. Consider any concept class &. C such that V C ( C )2 2, any learner L, and any 0 < E < $, and 0 < S < Then there exists a distribution 23 and target concept in C such that if L observes fewer examples than then with probability at least 6, L outputs a hypothesis h having errorD(h) > E. This theorem states that if the number of training examples is too few, then no learner can PAC-learn every target concept in any nontrivial C . Thus, this theorem provides a lower bound on the number of training examples necessary for successful learning, complementing the earlier upper bound that gives a suficient number. Notice this lower bound is determined by the complexity of the concept class C , whereas our earlier upper bounds were determined by H. (why?)+ This lower bound shows that the upper bound of the inequality in Equation (7.7) is fairly tight. Both bounds are logarithmic in 116 and linear in V C ( H ) . The only difference in the order of these two bounds is the extra log(l/c) dependence in the upper bound. 7.4.4 VC Dimension for Neural Networks Given the discussion of artificial neural network learning in Chapter 4, it is interesting to consider how we might calculate the VC dimension of a network of interconnected units such as the feedforward networks trained by the BACKPROPAGATION procedure. This section presents a general result that allows computing the VC dimension of layered acyclic networks, based on the structure of the network and the VC dimension of its individual units. This VC dimension can then be used to bound the number of training examples sufficient to probably approximately correctly learn a feedforward network to desired values of c and 6. This section may be skipped on a first reading without loss of continuity. Consider a network, G, of units, which forms a layered directed acyclic graph. A directed acyclic graph is one for which the edges have a direction (e.g., the units have inputs and outputs), and in which there are no directed cycles. A layered graph is one whose nodes can be partitioned into layers such that + all directed edges from nodes at layer 1 go to nodes at layer 1 1. The layered feedforward neural networks discussed throughout Chapter 4 are examples of such layered directed acyclic graphs. It turns out that we can bound the VC dimension of such networks based on their graph structure and the VC dimension of the primitive units from which they are constructed. To formalize this, we must first define a few more terms. Let n be the number of inputs to the network G, and let us assume that there is just one output node. Let each internal unit Niof G (i.e., each node that is not an input) have at most r inputs and implement a boolean-valued function ci : 8'' + (0, 1) from some function class C . For example, if the internal nodes are perceptrons, then C will be the class of linear threshold functions defined over 8'. We can now define the G-composition of C to be the class of all functions that can be implemented by the network G assuming individual units in G take on functions from the class C . In brief, the G-composition of C is the hypothesis space representable by the network G. t ~ i n tI:f we were to substitute H for C in the lower bound, this would result in a tighter bound on m in the case H > C. The following theorem bounds the VC dimension of the G-composition of C, based on the VC dimension of C and the structure of G. Theorem 7.4. VC-dimensionof directed acyclic layered networks. (See Kearns and Vazirani 1994.) Let G be a layered directed acyclic graph with n input nodes and s 2 2 internal nodes, each having at most r inputs. Let C be a concept class over 8Y of VC dimension d, corresponding to the set of functions that can be described by each of the s internal nodes. Let CG be the G-composition of C, corresponding to the set of functions that can be represented by G. Then VC(CG)5 2dslog(es), where e is the base of the natural logarithm. Note this bound on the VC dimension of the network G grows linearly with the VC dimension d of its individual units and log times linear in s, the number of threshold units in the network. Suppose we consider acyclic layered networks whose individual nodes are perceptrons. Recall from Chapter 4 that an r input perceptron uses linear decision surfaces to represent boolean functions over %'. As noted in Section 7.4.2.1, the + VC dimension of linear decision surfaces over is r 1. Therefore, a single + perceptron with r inputs has VC dimension r 1. We can use this fact, together with the above theorem, to bound the VC dimension of acyclic layered networks containing s perceptrons, each with r inputs, as We can now bound the number m of training examples sufficient to learn (with probability at least (1 - 6)) any target concept from Cp,erceptrons to within error E . Substituting the above expression for the network VC dimension into Equation (7.7), we have As illustrated by this perceptron network example, the above theorem is interesting because it provides a general method for bounding the VC dimension of layered, acyclic networks of units, based on the network structure and the VC dimension of the individual units. Unfortunately the above result does not directly apply to networks trained using BACKPROPAGATIfOorNt,wo reasons. First, this result applies to networks of perceptrons rather than networks of sigmoid units to which the BACKPROPAGATaIlOgNorithm applies. Nevertheless, notice that the VC dimension of sigmoid units will be at least as great as that of perceptrons, because a sigmoid unit can approximate a perceptron to arbitrary accuracy by using sufficiently large weights. Therefore, the above bound on m will be at least as large for acyclic layered networks of sigmoid units. The second shortcoming of the above result is that it fails to account for the fact that BACKPROPAGATION 220 MACHINE LEARNING trains a network by beginning with near-zero weights, then iteratively modifying these weights until an acceptable hypothesis is found. Thus, BACKPROPAGATION with a cross-validation stopping criterion exhibits an inductive bias in favor of networks with small weights. This inductive bias, which reduces the effective VC dimension, is not captured by the above analysis. 7.5 THE MISTAKE BOUND MODEL OF LEARNING While we have focused thus far on the PAC learning model, computational learning theory considers a variety of different settings and questions. Different learning settings that have been studied vary by how the training examples are generated (e.g., passive observation of random examples, active querying by the learner), noise in the data (e.g., noisy or error-free), the definition of success (e.g., the target concept must be learned exactly, or only probably and approximately), assumptions made by the learner (e.g., regarding the distribution of instances and whether C G H), and the measure according to which the learner is evaluated (e.g., number of training examples, number of mistakes, total time). In this section we consider the mistake bound model of learning, in which the learner is evaluated by the total number of mistakes it makes before it converges to the correct hypothesis. As in the PAC setting, we assume the learner receives a sequence of training examples. However, here we demand that upon receiving each example x, the learner must predict the target value c(x), before it is shown the correct target value by the trainer. The question considered is "How many mistakes will the learner make in its predictions before it learns the target concept?' This question is significant in practical settings where learning must be done while the system is in actual use, rather than during some off-line training stage. For example, if the system is to learn to predict which credit card purchases should be approved and which are fraudulent, based on data collected during use, then we are interested in minimizing the total number of mistakes it will make before converging to the correct target function. Here the total number of mistakes can be even more important than the total number of training examples. This mistake bound learning problem may be studied in various specific settings. For example, we might count the number of mistakes made before PAC learning the target concept. In the examples below, we consider instead the number of mistakes made before learning the target concept exactly. Learning the target concept exactly means converging to a hypothesis such that (Vx)h(x) = c(x). 7.5.1 Mistake Bound for the FIND-SAlgorithm To illustrate, consider again the hypothesis space H consisting of conjunctions of up to n boolean literals 11 ...1, and their negations (e.g., Rich A -Handsome). Recall the FIND-Salgorithm from Chapter 2, which incrementally computes the maximally specific hypothesis consistent with the training examples. A straightforward implementation of FIND-Sfor the hypothesis space H is as follows: 221 CHAPTER 7 COMPUTATIONAL LEARNING THEORY FIND-S: 0 Initialize h to the most specific hypothesis l1 A -II A 12 A -12.. .1, A -1, 0 For each positive training instance x 0 Remove from h any literal that is not satisfied by x 0 Output hypothesis h. FIND-Sconverges in the limit to a hypothesis that makes no errors, provided C H and provided the training data is noise-free. FIND-Sbegins with the most specific hypothesis (which classifies every instance a negative example), then incrementally generalizes this hypothesis as needed to cover observed positive training examples. For the hypothesis representation used here, this generalization step consists of deleting unsatisfied literals. Can we prove a bound on the total number of mistakes that FIND-Swill make before exactly learning the target concept c? The answer is yes. To see this, note first that if c E H, then FIND-Scan never mistakenly classify a negative example as 1 positive. The reason is that its current hypothesis h is always at least as specific as the target concept e. Therefore, to calculate the number of mistakes it will make, we need only count the number of mistakes it will make misclassifying truly positive examples as negative. How many such mistakes can occur before FIND-S learns c exactly? Consider the first positive example encountered by FIND-S.The learner will certainly make a mistake classifying this example, because its initial hypothesis labels every instance negative. However, the result will be that half of the 2n terms in its initial hypothesis will be eliminated, leaving only n terms. For each subsequent positive example that is mistakenly classified by the current hypothesis, at least one more of the remaining n terms must be eliminated from + the hypothesis. Therefore, the total number of mistakes can be at most n 1. This number of mistakes will be required in the worst case, corresponding to learning the most general possible target concept (Vx)c(x) = 1 and corresponding to a worst case sequence of instances that removes only one literal per mistake. 7.5.2 Mistake Bound for the HALVINGAlgorithm As a second example, consider an algorithm that learns by maintaining a description of the version space, incrementally refining the version space as each new training example is encountered. The CANDIDATE-ELIMINAaTlgIOorNithm and the LIST-THEN-ELIMINAalTgoErithm from Chapter 2 are examples of such algorithms. In this section we derive a worst-case bound on the number of mistakes that will be made by such a learner, for any finite hypothesis space H, assuming again that the target concept must be learned exactly. To analyze the number of mistakes made while learning we must first specify precisely how the learner will make predictions given a new instance x. Let us assume this prediction is made by taking a majority vote among the hypotheses in the current version space. If the majority of version space hypotheses classify the new instance as positive, then this prediction is output by the learner. Otherwise a negative prediction is output. 222 MACHINE LEARNING This combination of learning the version space, together with using a majority vote to make subsequent predictions, is often called the HALVINaGlgorithm. What is the maximum number of mistakes that can be made by the HALVING algorithm, for an arbitrary finite H, before it exactly learns the target concept? Notice that learning the target concept "exactly" corresponds to reaching a state where the version space contains only a single hypothesis (as usual, we assume the target concept c is in H). To derive the mistake bound, note that the only time the HALVINaGlgorithm can make a mistake is when the majority of hypotheses in its current version space incorrectly classify the new example. In this case, once the correct classification is revealed to the learner, the version space will be reduced to at most half its current size (i.e., only those hypotheses that voted with the minority will be retained). Given that each mistake reduces the size of the version space by at least half, and given that the initial version space contains only I H I members, the maximum number of mistakes possible before the version space contains just one member is log2I H I. In fact one can show the bound is Llog, I H (1.Consider, for example, the case in which IHI = 7. The first mistake must reduce IHI to at most 3, and the second mistake will then reduce it to 1. Note that [log2 IH(1 is a worst-case bound, and that it is possible for the HALVINGalgorithm to learn the target concept exactly without making any mistakes at all! This can occur because even when the majority vote is correct, the algorithm will remove the incorrect, minority hypotheses. If this occurs over the entire training sequence, then the version space may be reduced to a single member while making no mistakes along the way. One interesting extension to the HALVINGalgorithm is to allow the hypotheses to vote with different weights. Chapter 6 describes the Bayes optimal classifier, which takes such a weighted vote among hypotheses. In the Bayes optimal classifier, the weight assigned to each hypothesis is the estimated posterior probability that it describes the target concept, given the training data. Later in this section we describe a different algorithm based on weighted voting, called the WEIGHTED-MAJORaIlTgYorithm. 7.5.3 Optimal Mistake Bounds The above analyses give worst-case mistake bounds for two specific algorithms: FIND-Sand CANDIDATE-ELIMINATItIOisNi.nteresting to ask what is the optimal mistake bound for an arbitrary concept class C, assuming H = C. By optimal mistake bound we mean the lowest worst-case mistake bound over all possible learning algorithms. To be more precise, for any learning algorithm A and any - target concept c, let MA(c)denote the maximum over all possible sequences of training examples of the number of mistakes made by A to exactly learn c. Now for any nonempty concept class C, let MA(C) max,,~MA(c).Note that above + we showed MFindPS(C)= n 1 when C is the concept class described by up to n boolean literals. We also showed MHalving(C5) log2((CI) for any concept class C. We define the optimal mistake bound for a concept class C below. Definition: Let C be an arbitrary nonempty concept class. The optimal mistake bound for C , denoted Opt ( C ) ,is the minimum over all possible learning algorithms A of MA(C). (a Opt( C ) = min MA Adearning algorithms Speaking informally, this definition states that Opt(C) is the number of mistakes made for the hardest target concept in C, using the hardest training sequence, by the best algorithm. Littlestone (1987) shows that for any concept class C, there is an interesting relationship among the optimal mistake bound for C, the bound of the HALVINGalgorithm, and the VC dimension of C, namely Furthermore, there exist concept classes for which the four quantities above are exactly equal. One such concept class is the powerset Cp of any finite set of instances X. In this case, VC(Cp) = 1x1= log2(1CpJ),so all four quantities must be equal. Littlestone (1987) provides examples of other concept classes for which VC(C) is strictly less than Opt (C) and for which Opt (C) is strictly less than M~aIvin~(C) 7.5.4 WEIGHTED-MAJORITAYlgorithm In this section we consider a generalization of the HALVINGalgorithm called the WEIGHTED-MAJORIaTlgYorithm. The WEIGHTED-MAJORIaTlgYorithm makes predictions by taking a weighted vote among a pool of prediction algorithms and learns by altering the weight associated with each prediction algorithm. These prediction algorithms can be taken to be the alternative hypotheses in H, or they can be taken to be alternative learning algorithms that themselves vary over time. All that we require of a prediction algorithmis that it predict the value of the target concept, given an instance. One interesting property of the WEIGHTED-MAJORITY algorithm is that it is able to accommodate inconsistent training data. This is because it does not eliminate a hypothesis that is found to be inconsistent with some training example, but rather reduces its weight. A second interesting property is that we can bound the number of mistakes made by WEIGHTED-MAJORiInTY terms of the number of mistakes committed by the best of the pool of prediction algorithms. The WEIGHTED-MAJORIaTlgYorithm begins by assigning a weight of 1 to each prediction algorithm, then considers the training examples. Whenever a prediction algorithm misclassifies a new training example its weight is decreased by multiplying it by some number B, where 0 5 B < 1. The exact definition of the WEIGHTED-MAJORaIlTgYorithm is given in Table 7.1. Notice if f? = 0 then WEIGHTED-MAJORiIsTYidentical to the HALVINGal- gorithm. On the other hand, if we choose some other value for p, no prediction ai denotes the if* prediction algorithm in the pool A of algorithms. wi denotes the weight associated with ai. For all i initialize wi c 1 For each training example (x,c(x)) c Initialize qo and ql to 0 am For each prediction algorithm ai c If ai(x)= O then qo t q0 +wi + If ai(x)= 1 then ql c ql wi If ql > qo then predict c(x) = 1 If qo > q1 then predict c(x) = 0 If ql = qo then predict 0 or 1 at random for c(x) For each prediction algorithm ai in A do If ai(x)# c(x) then wi +- Buri TABLE 7.1 WEIGHTED-MAJORaIlTgoYrithm. algorithm will ever be eliminated completely. If an algorithm misclassifies a training example, it will simply receive a smaller vote in the future. We now show that the number of mistakes committed by the WEIGHTEDMAJORITYalgorithm can be bounded in terms of the number of mistakes made by the best prediction algorithm in the voting pool. Theorem 7.5. Relative mistake bound for WEIGHTED-MAJORITLYe.t D be any sequence of training examples, let A be any set of n prediction algorithms, and let k be the minimum number of mistakes made by any algorithm in A for the training 4 sequence D.Then the number of mistakes over D made by the WEIGHTED-MAJORITY algorithm using /3 = is at most 2.4(k +log, n) Proof. We prove the theorem by comparing the final weight of the best prediction algorithm to the sum of weights over all algorithms. Let aj denote an algorithm from 3 A that commits the optimal number k of mistakes. The final weight wj associated with aj will be because its initial weight is 1 and it is multiplied by for each mistake. Now consider the sum W = x : = , w iof the weights associated with all n algorithms in A. W is initially n. For each mistake made by WEIGHTED-MAJORITY, W is reduced to at most :w. This is the case because the algorithms voting in the 4. weighted majority must hold at least half of the total weight W , and this portion of W will be reduced by a factor of Let M denote the total number of mistakes committed by WEIGHTED-MAJORIfToYr the training sequence D. Then the final total weight W is at most n ( : l M . Because the final weight wj cannot be greater than the final total weight, we have Rearranging terms yields (a) + M 5 (k +log' n, 2.4(k log, n) -1% - which proves the theorem. To summarize, the above theorem states that the number of mistakes made by the WEIGHTED-MAJORaIlTgYorithm will never be greater than a constant factor times the number of mistakes made by the best member of the pool, plus a term that grows only logarithmically in the size of the pool. This theorem is generalized by Littlestone and Warmuth (1991), who show that for an arbitrary 0 5 j3 < 1 the above bound is / 7.6 SUMMARY AND FURTHER READING The main points of this chapter include: 0 The probably approximately correct (PAC) model considers algorithms that learn target concepts from some concept class C, using training examples drawn at random according to an unknown, but fixed, probability distribution. It requires that the learner probably (with probability at least [ l - 61) learn a hypothesis that is approximately (within error E)correct, given computational effort and training examples that grow only polynornially with I/€, 1/6, the size of the instances, and the size of the target concept. 0 Within the setting of the PAC learning model, any consistent learner using a finite hypothesis space H where C H will, with probability (1 - S), output a hypothesis within error E of the target concept, after observing m randomly drawn training examples, as long as This gives a bound on the number of training examples sufficient for successful learning under the PAC model. One constraining assumption of the PAC learning model is that the learner knows in advance some restricted concept class C that contains the target concept to be learned. In contrast, the agnostic learning model considers the more general setting in which the learner makes no assumption about the class from which the target concept is drawn. Instead, the learner outputs the hypothesis from H that has the least error (possibly nonzero) over the training data. Under this less restrictive agnostic learning model, the learner is assured with probability (1-6) to output a hypothesis within error E of the best possible hypothesis in H, after observing rn randomly drawn training examples, provided a The number of training examples required for successful learning is strongly influenced by the complexity of the hypothesis space considered by the learner. One useful measure of the complexity of a hypothesis space H is its Vapnik-Chervonenkis dimension, VC(H). VC(H) is the size of the largest subset of instances that can be shattered (split in all possible ways) by H. a An alternative upper bound on the number of training examples sufficient for successful learning under the PAC model, stated in terms of VC(H) is A lower bound is a An alternative learning model, called the mistake bound model, is used to analyze the number of training examples a learner will misclassify before it exactly learns the target concept. For example, the HALVINGalgorithm will make at most Llog, 1H 1J mistakes before exactly learning any target concept drawn from H. For an arbitrary concept class C , the best worstcase algorithm will make Opt (C) mistakes, where VC(C>5 Opt(C) Ilog,(lCI) a The WEIGHTED-MAJORaIlTgoYrithm combines the weighted votes of multiple prediction algorithms to classify new instances. It learns weights for each of these prediction algorithms based on errors made over a sequence of examples. Interestingly,the number of mistakes made by WEIGHTED-MAJORcIaTnY be bounded in terms of the number of mistakes made by the best prediction algorithm in the pool. Much early work on computational learning theory dealt with the question of whether the learner could identify the target concept in the limit, given an indefinitely long sequence of training examples. The identification in the limit model was introduced by Gold (1967). A good overview of results in this area is (Angluin 1992). Vapnik (1982) examines in detail the problem of uniform convergence, and the closely related PAC-learning model was introduced by Valiant (1984). The discussion in this chapter of €-exhausting the version space is based on Haussler's (1988) exposition. A useful collection of results under the PAC model can be found in Blumer et al. (1989). Kearns and Vazirani (1994) provide an excellent exposition of many results from computational learning theory. Earlier texts in this area include Anthony and Biggs (1992) and Natarajan (1991). Current research on computational learning theory covers a broad range of learning models and learning algorithms. Much of this research can be found in the proceedings of the annual conference on Computational Learning Theory (COLT). Several special issues of the journal Machine Learning have also been devoted to this topic. EXERCISES 7.1. Consider training a two-input perceptron. Give an upper bound on the number of training examples sufficient to assure with 90% confidence that the learned perceptron will have true error of at most 5%. Does this bound seem realistic? 7.2. Consider the class C of concepts of the form (a 4 x 5 b ) ~ (5c y 5 d), where a ,b, c , and d are integers in the interval (0,99). Note each concept in this class corresponds to a rectangle with integer-valued boundaries on a portion of the x , y plane. Hint: Given a region in the plane bounded by the points (0,O) and (n - 1 , n - I), the number of distinct rectangles with integer-valued boundaries within this region is i ("M)2. 2 (a) Give an upper bound on the number of randomly drawn training examples sufficient to assure that for any target concept c in C, any consistent learner using H = C will, with probability 95%, output a hypothesis with error at most .15. (b) Now suppose the rectangle boundaries a , b, c, and d take on real values instead of integer values. Update your answer to the first part of this question. 7.3. In this chapter we derived an expression for the number of training examples sufficient to ensure that every hypothesis will have true error no worse than 6 plus its observed training error errorD(h).In particular, we used Hoeffding bounds to derive Equation (7.3). Derive an alternative expression for the number of training + examples sufficient to ensure that every hypothesis will have true error no worse than ( 1 y)errorD(h).You can use the general Chernoff bounds to derive such a result. Chernoff bounds: Suppose X I , ...,X m are the outcomes of rn independent coin flips (Bernoulli trials), where the probability of heads on any single trial is Pr[Xi = 11 = p and the probability of tails is Pr[Xi = 01 = 1 - p. Define S = XI+ X2 + - .- + Xm to be the sum of the outcomes of these m trials. The expected value of S/m is E[S/m] = p. The Chernoff bounds govern the probability that S/m will differ from p by some factor 0 5 y 5 1 . 7.4. Consider a learning problem in which X = % is the set of real numbers, and C = H is the set of intervals over the reals, H = { ( a < x < b ) I a , b E E}. What is the probability that a hypothesis consistent with m examples of this target concept will have error at least E? Solve this using the VC dimension. Can you find a second way to solve this, based on first principles and ignoring the VC dimension? 7.5. Consider the space of instances X corresponding to all points in the x , y plane. Give the VC dimension of the following hypothesis spaces: (a) H, = the set of all rectangles in the x , y plane. That is, H = {((a < x < b ) ~ (