pdf

# 机器学习经典教程：机器学习十大算法

• 1星
• 日期： 2018-11-07
• 大小： 4.94MB
• 所需积分：1分
• 下载次数：0
• favicon收藏
• rep举报
• free评论

Chapter 1 C45 Naren Ramakrishnan Contents 11 Introduction 1 12 Algorithm Description 3 13 C45 Features 7 131 Tree Pruning 7 132 Improved Use of Continuous Attributes 8 133 Handling Missing Values 9 134 Induc......

Chapter 1 C4.5 Naren Ramakrishnan Contents 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 1.3 C4.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 1.3.1 Tree Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 1.3.2 Improved Use of Continuous Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 1.3.3 Handling Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 1.3.4 Inducing Rulesets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Discussion on Available Software Implementations . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Two Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5.1 Golf Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5.2 Soybean Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.1 Mining from Secondary Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.2 Oblique Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.4 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.6.5 Classiﬁcation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.6.6 Redescriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Introduction 1.1 C4.5 [30] is a suite of algorithms for classiﬁcation problems in machine learning and data mining. It is targeted at supervised learning: Given an attribute-valued dataset where instances are described by collections of attributes and belong to one of a set of mutually exclusive classes, C4.5 learns a mapping from attribute values to classes that can be applied to classify new, unseen instances. For instance, see Figure 1.1 where rows denote speciﬁc days, attributes denote weather conditions on the given day, and the class denotes whether the conditions are conducive to playing golf. Thus, each row denotes an instance, described by values for attributes such as Out- look (a ternary-valued random variable) Temperature (continuous-valued), Humidity 1 2 C4.5 Day Outlook 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Temperature 85 80 83 70 68 65 64 72 69 75 75 72 81 71 Humidity Windy 85 90 78 96 80 70 65 95 70 80 70 90 75 80 False True False False False True True False False False True True False True Play Golf? No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Figure 1.1 Example dataset input to C4.5. (also continuous-valued), and Windy (binary), and the class is the Boolean PlayGolf? class variable. All of the data in Figure 1.1 constitutes “training data,” so that the intent is to learn a mapping using this dataset and apply it on other, new instances that present values for only the attributes to predict the value for the class random variable. C4.5, designed by J. Ross Quinlan, is so named because it is a descendant of the ID3 approach to inducing decision trees [25], which in turn is the third incarnation in a series of “iterative dichotomizers.” A decision tree is a series of questions systemat- ically arranged so that each question queries an attribute (e.g., Outlook) and branches based on the value of the attribute. At the leaves of the tree are placed predictions of the class variable (here, PlayGolf?). A decision tree is hence not unlike the series of troubleshooting questions you might ﬁnd in your car’s manual to help determine what could be wrong with the vehicle. In addition to inducing trees, C4.5 can also restate its trees in comprehensible rule form. Further, the rule postpruning operations supported by C4.5 typically result in classiﬁers that cannot quite be restated as a decision tree. The historical lineage of C4.5 offers an interesting study into how different sub- communities converged on more or less like-minded solutions to classiﬁcation. ID3 was developed independently of the original tree induction algorithm developed by Friedman [13], which later evolved into CART [4] with the participation of Breiman, Olshen, and Stone. But, from the numerous references to CART in [30], the design decisions underlying C4.5 appear to have been inﬂuenced by (to improve upon) how CART resolved similar issues, such as procedures for handling special types of at- tributes. (For this reason, due to the overlap in scope, we will aim to minimize with the material covered in the CART chapter, Chapter 10, and point out key differences at appropriate junctures.) In [25] and [36], Quinlan also acknowledged the inﬂuence of the CLS (Concept Learning System [16]) framework in the historical development 1.2 Algorithm Description 3 of ID3 and C4.5. Today, C4.5 is superseded by the See5/C5.0 system, a commercial product offered by Rulequest Research, Inc. The fact that two of the top 10 algorithms are tree-based algorithms attests to the widespread popularity of such methods in data mining. Original applications of decision trees were in domains with nominal valued or categorical data but today they span a multitude of domains with numeric, symbolic, and mixed-type attributes. Examples include clinical decision making, manufacturing, document analysis, bio- informatics, spatial data modeling (geographic information systems), and practically any domain where decision boundaries between classes can be captured in terms of tree-like decompositions or regions identiﬁed by rules. 1.2 Algorithm Description C4.5 is not one algorithm but rather a suite of algorithms—C4.5, C4.5-no-pruning, and C4.5-rules—with many features. We present the basic C4.5 algorithm ﬁrst and the special features later. The generic description of how C4.5 works is shown in Algorithm 1.1. All tree induction methods begin with a root node that represents the entire, given dataset and recursively split the data into smaller subsets by testing for a given attribute at each node. The subtrees denote the partitions of the original dataset that satisfy speciﬁed attribute value tests. This process typically continues until the subsets are “pure,” that is, all instances in the subset fall in the same class, at which time the tree growing is terminated. Algorithm 1.1 C4.5(D) Input: an attribute-valued dataset D 1: Tree = {} 2: if D is “pure” OR other stopping criteria met then 3: 4: end if 5: for all attribute a ∈ D do terminate Compute information-theoretic criteria if we split on a 6: 7: end for 8: abest = Best attribute according to above computed criteria 9: Tree = Create a decision node that tests abest in the root 10: Dv = Induced sub-datasets from D based on abest 11: for all Dv do 12: 13: 14: end for 15: return Tree Treev = C4.5(Dv) Attach Treev to the corresponding branch of Tree 4 C4.5 Outlook Sunny Rainy Overcast Humidity Yes Windy <=75 Yes >75 No True No False Yes Figure 1.2 Decision tree induced by C4.5 for the dataset of Figure 1.1. Figure 1.1 presents the classical “golf” dataset, which is bundled with the C4.5 installation. As stated earlier, the goal is to predict whether the weather conditions on a particular day are conducive to playing golf. Recall that some of the features are continuous-valued while others are categorical. Figure 1.2 illustrates the tree induced by C4.5 using Figure 1.1 as training data (and the default options). Let us look at the various choices involved in inducing such trees from the data. r What types of tests are possible? As Figure 1.2 shows, C4.5 is not restricted to considering binary tests, and allows tests with two or more outcomes. If the attribute is Boolean, the test induces two branches. If the attribute is categorical, the test is multivalued, but different values can be grouped into a smaller set of options with one class predicted for each option. If the attribute is numerical, then the tests are again binary-valued, and of the form {≤ θ?, > θ?}, where θ is a suitably determined threshold for that attribute. r How are tests chosen? C4.5 uses information-theoretic criteria such as gain (reduction in entropy of the class distribution due to applying a test) and gain ratio (a way to correct for the tendency of gain to favor tests with many outcomes). The default criterion is gain ratio. At each point in the tree-growing, the test with the best criteria is greedily chosen. r How are test thresholds chosen? As stated earlier, for Boolean and categorical attributes, the test values are simply the different possible instantiations of that attribute. For numerical attributes, the threshold is obtained by sorting on that attribute and choosing the split between successive values that maximize the criteria above. Fayyad and Irani [10] showed that not all successive values need to be considered. For two successive values vi and vi+1 of a continuous-valued 1.2 Algorithm Description 5 attribute, if all instances involving vi and all instances involving vi+1 belong to the same class, then splitting between them cannot possibly improve informa- tion gain (or gain ratio). r How is tree-growing terminated? A branch from a node is declared to lead to a leaf if all instances that are covered by that branch are pure. Another way in which tree-growing is terminated is if the number of instances falls below a speciﬁed threshold. r How are class labels assigned to the leaves? The majority class of the instances assigned to the leaf is taken to be the class prediction of that subbranch of the tree. The above questions are faced by any classiﬁcation approach modeled after trees and similar, or other reasonable, decisions are made by most tree induction algorithms. The practical utility of C4.5, however, comes from the next set of features that build upon the basic tree induction algorithm above. But before we present these features, it is instructive to instantiate Algorithm 1.1 for a simple dataset such as shown in Figure 1.1. We will work out in some detail how the tree of Figure 1.2 is induced from Figure 1.1. Observe how the ﬁrst attribute chosen for a decision test is the Outlook attribute. To see why, let us ﬁrst estimate the entropy of the class random variable (PlayGolf?). This variable takes two values with probability 9/14 (for “Yes”) and 5/14 (for “No”). The entropy of a class random variable that takes on c values with probabilities p1, p2, . . . , pc is given by: c(cid:2) i=1 − pi log2 pi The entropy of PlayGolf? is thus −(9/14) log2(9/14) − (5/14) log2(5/14) or 0.940. This means that on average 0.940 bits must be transmitted to communicate information about the PlayGolf? random variable. The goal of C4.5 tree induction is to ask the right questions so that this entropy is reduced. We consider each attribute in turn to assess the improvement in entropy that it affords. For a given random variable, say Outlook, the improvement in entropy, represented as Gain(Outlook), is calculated as: Entropy(PlayGolf? in D) − (cid:2) |Dv| |D| Entropy(PlayGolf? in Dv) v where v is the set of possible values (in this case, three values for Outlook), D denotes the entire dataset, Dv is the subset of the dataset for which attribute Outlook has that value, and the notation | · | denotes the size of a dataset (in the number of instances). This calculation will show that Gain(Outlook) is 0.940−0.694 = 0.246. Similarly, we can calculate that Gain(Windy) is 0.940 − 0.892 = 0.048. Working out the above calculations for the other attributes systematically will reveal that Outlook is indeed

### 评论

TI最新应用解决方案

###### 工业电子汽车电子个人电子

About Us 关于我们 客户服务 联系方式 器件索引 网站地图 最新更新 手机版

# EEWorld下载中心——分享有价值的电子技术资料

\$(function(){ var appid = \$(".select li a").data("channel"); \$(".select li a").click(function(){ var appid = \$(this).data("channel"); \$('.select dt').html(\$(this).html()); \$('#channel').val(appid); }) })