下载中心>资源分类>应用技术>测试测量>Support Vector Machines

pdf

Support Vector Machines

1星
2015-04-10
355.14KB
需要5积分
0次下载

下载资源

文档简介
猜您喜欢
用户评论0

标签： Support Vector Machines

Support Vector Machines

A Tutorial on

½-Support

Vector Machines

Pai-Hsuen Chen

, Chih-Jen Lin

, and Bernhard Sch¨ lkopf

Department of Computer Science and Information Engineering

National Taiwan University

Taipei 106, Taiwan

Max Planck Institute for Biological Cybernetics, T¨ bingen, Germany

bernhard.schoelkopf@tuebingen.mpg.de

Abstract.

We brieﬂy describe the main ideas of statistical learning theory, sup-

port vector machines (SVMs), and kernel feature spaces. We place particular em-

phasis on a description of the so-called

½-SVM,

including details of the algorithm

and its implementation, theoretical results, and practical applications.

1 An Introductory Example

Suppose we are given empirical data

, y

. . . ,

, y

)

∈ X × {±1}.

(1)

Here, the

domain

is some nonempty set that the

patterns

are taken from; the

are called

labels

targets.

Unless stated otherwise, indices

and

will always be understood to run over the

training set, i.e.,

i, j

= 1,

. . . , m.

Note that we have not made any assumptions on the domain

other than it being a

set. In order to study the problem of learning, we need additional structure. In learning,

we want to be able to

generalize

to unseen data points. In the case of pattern recognition,

this means that given some new pattern

∈ X

, we want to predict the corresponding

∈ {±1}.

By this we mean, loosely speaking, that we choose

such that

(x,

is in

some sense similar to the training examples. To this end, we need similarity measures in

and in

{±1}.

The latter is easy, as two target values can only be identical or different.

For the former, we require a similarity measure

X × X →

(x,

)

→

k(x, x

(2)

i.e., a function that, given two examples

and

, returns a real number characterizing

their similarity. For reasons that will become clear later, the function

is called a

kernel

([24], [1], [8]).

A type of similarity measure that is of particular mathematical appeal are dot prod-

ucts. For instance, given two vectors

x, x

∈

, the canonical dot product is deﬁned

Parts of the present article are based on [31].

) :=

(x)

(x )

i=1

(3)

Here,

(x)

denotes the

ith

entry of

The geometrical interpretation of this dot product is that it computes the cosine of

the angle between the vectors

and

, provided they are normalized to length

More-

over, it allows computation of the length of a vector

x),

and of the distance

between two vectors as the length of the difference vector. Therefore, being able to

compute dot products amounts to being able to carry out all geometrical constructions

that can be formulated in terms of angles, lengths and distances.

Note, however, that we have not made the assumption that the patterns live in a

dot product space. In order to be able to use a dot product as a similarity measure, we

therefore ﬁrst need to transform them into some dot product space

which need not

be identical to

. To this end, we use a map

X →H

→

(4)

The space

is called a

feature space.

To summarize, there are three beneﬁts to trans-

form the data into

1. It lets us deﬁne a similarity measure from the dot product in

k(x, x

) := (x

) = (Φ(x)

Φ(x

)).

(5)

2. It allows us to deal with the patterns geometrically, and thus lets us study learning

algorithm using linear algebra and analytic geometry.

3. The freedom to choose the mapping

will enable us to design a large variety of

learning algorithms. For instance, consider a situation where the inputs already live

in a dot product space. In that case, we could directly deﬁne a similarity measure

as the dot product. However, we might still choose to ﬁrst apply a nonlinear map

to change the representation into one that is more suitable for a given problem and

learning algorithm.

We are now in the position to describe a pattern recognition learning algorithm that

is arguable one of the simplest possible. The basic idea is to compute the means of the

two classes in feature space,

−

{i:y

=+1}

(6)

−

{i:y

=−1}

(7)

where

and

−

are the number of examples with positive and negative labels, re-

spectively (see Figure 1). We then assign a new point

to the class whose mean is

x-c

Fig. 1.

A simple geometric classiﬁcation algorithm: given two classes of points (depicted by ‘o’

and ‘+’), compute their means

−

and assign a test pattern

to the one whose mean is closer.

This can be done by looking at the dot product between

−

(where

= (c

−

)/2)

and

−

, which changes sign as the enclosed angle passes through

π/2.

Note that the

corresponding decision boundary is a hyperplane (the dotted line) orthogonal to

(from [31]).

closer to it. This geometrical construction can be formulated in terms of dot products.

Half-way in between

and

−

lies the point

:= (c

−

)/2.

We compute the class

by checking whether the vector connecting

and

encloses an angle smaller than

π/2

with the vector

−

connecting the class means, in other words

= sgn ((x

−

= sgn ((x

−

)/2)

−

))

= sgn ((x

)

−

) +

b).

Here, we have deﬁned the offset

−

(8)

−

(9)

It will be proved instructive to rewrite this expression in terms of the patterns

the input domain

. To this end, note that we do not have a dot product in

, all we

have is the similarity measure

(cf. (5)). Therefore, we need to rewrite everything in

terms of the kernel

evaluated on input patterns. To this end, substitute (6) and (7) into

(8) to get the

decision function



= sgn





= sgn



)

−

{i:y

=+1}



−

) +



{i:y

=−1}



k(x, x

) +



(10)

k(x, x

)

−

{i:y

=+1}

−

{i:y

=−1}

Similarly, the offset becomes





−



k(x

, x

)

−

k(x

, x

)



{(i,j):y

=+1}

(11)

{(i,j):y

=−1}

Let us consider one well-known special case of this type of classiﬁer. Assume that the

class means have the same distance to the origin (hence

= 0),

and that

can be viewed

as a density, i.e., it is positive and has integral

k(x, x

)dx = 1

for all

∈ X

(12)

In order to state this assumption, we have to require that we can deﬁne an integral on

If the above holds true, then (10) corresponds to the so-called Bayes decision bound-

ary separating the two classes, subject to the assumption that the two classes were gen-

erated from two probability distributions that are correctly estimated by the

Parzen

windows

estimators of the two classes,

(x) :=

−

k(x, x

)

{i:y

=+1}

(13)

(14)

k(x, x

{i:y

=−1}

Given some point

the label is then simply computed by checking which of the two,

(x)

(x),

is larger, which directly leads to (10). Note that this decision is the best

we can do if we have no prior information about the probabilities of the two classes.

For further details, see [31].

The classiﬁer (10) is quite close to the types of learning machines that we will

be interested in. It is linear in the feature space, and while in the input domain, it is

represented by a kernel expansion in terms of the training points. It is example-based

in the sense that the kernels are centered on the training examples, i.e., one of the two

arguments of the kernels is always a training example. The main points that the more

sophisticated techniques to be discussed later will deviate from (10) are in the selection

of the examples that the kernels are centered on, and in the weights that are put on the

individual data in the decision function. Namely, it will no longer be the case that

all

training examples appear in the kernel expansion, and the weights of the kernels in the

expansion will no longer be uniform. In the feature space representation, this statement

corresponds to saying that we will study all normal vectors

of decision hyperplanes

that can be represented as linear combinations of the training examples. For instance,

we might want to remove the inﬂuence of patterns that are very far away from the

decision boundary, either since we expect that they will not improve the generalization

error of the decision function, or since we would like to reduce the computational cost

of evaluating the decision function (cf. (10)). The hyperplane will then only depend on

a subset of training examples, called

support vectors.

2 Learning Pattern Recognition from Examples

With the above example in mind, let us now consider the problem of pattern recognition

in a more formal setting ([37], [38]), following the introduction of [30]. In two-class

pattern recognition, we seek to estimate a function

X → {±1}

(15)

based on input-output training data (1). We assume that the data were generated inde-

pendently from some unknown (but ﬁxed) probability distribution

(x,

y).

Our goal

is to learn a function that will correctly classify unseen examples

(x,

y),

i.e., we want

(x) =

for examples

(x,

that were also generated from

(x,

y).

If we put no restriction on the class of functions that we choose our estimate

from, however, even a function which does well on the training data, e.g. by satisfying

) =

for all

= 1,

. . . , m,

need not generalize well to unseen examples. To see

this, note that for each function

and any test set

(¯

, y

. . . ,

(¯

, y

)

∈

×{±1},

satisfying

{¯

, . . . , x

} ∩ {x

, . . . , x

}

{},

there exists another function

∗

such

∗

that

) =

)

for all

= 1,

. . . , m,

yet

∗

(¯

) =

(¯

)

for all

= 1,

. . . , m.

As we are only given the training data, we have no means of selecting which of the two

functions (and hence which of the completely different sets of test label predictions) is

preferable. Hence, only minimizing the training error (or

empirical risk),

emp

[f ] =

i=1

)

−

(16)

does not imply a small test error (called

risk),

averaged over test examples drawn from

the underlying distribution

(x,

y),

R[f

] =

(x)

−

y| dP

(x,

y).

(17)

Statistical learning theory ([41], [37], [38], [39]), or VC (Vapnik-Chervonenkis) theory,

shows that it is imperative to restrict the class of functions that

is chosen from to one

which has a

capacity

that is suitable for the amount of available training data. VC theory

provides

bounds

on the test error. The minimization of these bounds, which depend on

both the empirical risk and the capacity of the function class, leads to the principle of

structural risk minimization

([37]). The best-known capacity concept of VC theory is

the

VC dimension,

deﬁned as the largest number

of points that can be separated in

all possible ways using functions of the given class. An example of a VC bound is the

following: if

h < m

is the VC dimension of the class of functions that the learning

machine can implement, then for all functions of that class, with a probability of at

least

−

η,

the bound

R(f

)

≤

emp

(f ) +

holds, where the

conﬁdence term

is deﬁned as

log(η)

log

log(η)

(18)

+ 1

−

log(η/4)

(19)

展开预览

猜您喜欢

上传者

: 1067677560; 查看他的其他资源

TI 文字链专区

举报人：
被举报人：	1067677560
举报的资源分：	5
* 类型：
	请您提供公司营业执照和软件相关版权到service@eeworld.com.cn
* 详细原因：

Support Vector Machines

文档简介

评论

汽车 模拟

汽车模拟