A Tutorial on
½-Support
Vector Machines
Pai-Hsuen Chen
1
, Chih-Jen Lin
1
, and Bernhard Sch¨ lkopf
2
o
Department of Computer Science and Information Engineering
National Taiwan University
Taipei 106, Taiwan
Max Planck Institute for Biological Cybernetics, T¨ bingen, Germany
u
bernhard.schoelkopf@tuebingen.mpg.de
1
2
Abstract.
We briefly describe the main ideas of statistical learning theory, sup-
port vector machines (SVMs), and kernel feature spaces. We place particular em-
phasis on a description of the so-called
½-SVM,
including details of the algorithm
and its implementation, theoretical results, and practical applications.
1 An Introductory Example
Suppose we are given empirical data
(x
1
, y
1
),
. . . ,
(x
m
, y
m
)
∈ X × {±1}.
(1)
Here, the
domain
X
is some nonempty set that the
patterns
x
i
are taken from; the
y
i
are called
labels
or
targets.
Unless stated otherwise, indices
i
and
j
will always be understood to run over the
training set, i.e.,
i, j
= 1,
. . . , m.
Note that we have not made any assumptions on the domain
X
other than it being a
set. In order to study the problem of learning, we need additional structure. In learning,
we want to be able to
generalize
to unseen data points. In the case of pattern recognition,
this means that given some new pattern
x
∈ X
, we want to predict the corresponding
y
∈ {±1}.
By this we mean, loosely speaking, that we choose
y
such that
(x,
y)
is in
some sense similar to the training examples. To this end, we need similarity measures in
X
and in
{±1}.
The latter is easy, as two target values can only be identical or different.
For the former, we require a similarity measure
k
:
X × X →
R,
(x,
x
)
→
k(x, x
),
(2)
i.e., a function that, given two examples
x
and
x
, returns a real number characterizing
their similarity. For reasons that will become clear later, the function
k
is called a
kernel
([24], [1], [8]).
A type of similarity measure that is of particular mathematical appeal are dot prod-
ucts. For instance, given two vectors
x, x
∈
R
N
, the canonical dot product is defined
Parts of the present article are based on [31].
as
(x
·
x
) :=
N
(x)
i
(x )
i
.
i=1
(3)
Here,
(x)
i
denotes the
ith
entry of
x.
The geometrical interpretation of this dot product is that it computes the cosine of
the angle between the vectors
x
and
x
, provided they are normalized to length
1.
More-
over, it allows computation of the length of a vector
x
as
(x
·
x),
and of the distance
between two vectors as the length of the difference vector. Therefore, being able to
compute dot products amounts to being able to carry out all geometrical constructions
that can be formulated in terms of angles, lengths and distances.
Note, however, that we have not made the assumption that the patterns live in a
dot product space. In order to be able to use a dot product as a similarity measure, we
therefore first need to transform them into some dot product space
H,
which need not
be identical to
R
N
. To this end, we use a map
Φ
:
X →H
x
→
x.
(4)
The space
H
is called a
feature space.
To summarize, there are three benefits to trans-
form the data into
H
1. It lets us define a similarity measure from the dot product in
H,
k(x, x
) := (x
·
x
) = (Φ(x)
·
Φ(x
)).
(5)
2. It allows us to deal with the patterns geometrically, and thus lets us study learning
algorithm using linear algebra and analytic geometry.
3. The freedom to choose the mapping
Φ
will enable us to design a large variety of
learning algorithms. For instance, consider a situation where the inputs already live
in a dot product space. In that case, we could directly define a similarity measure
as the dot product. However, we might still choose to first apply a nonlinear map
Φ
to change the representation into one that is more suitable for a given problem and
learning algorithm.
We are now in the position to describe a pattern recognition learning algorithm that
is arguable one of the simplest possible. The basic idea is to compute the means of the
two classes in feature space,
c
+
=
1
m
+
1
m
−
x
i
,
{i:y
i
=+1}
(6)
c
−
=
x
i
,
{i:y
i
=−1}
(7)
where
m
+
and
m
−
are the number of examples with positive and negative labels, re-
spectively (see Figure 1). We then assign a new point
x
to the class whose mean is
o
+
+
w
c
+
+
+
.
o
c
-
o
c
x-c
x
Fig. 1.
A simple geometric classification algorithm: given two classes of points (depicted by ‘o’
and ‘+’), compute their means
c
+
,
c
−
and assign a test pattern
x
to the one whose mean is closer.
This can be done by looking at the dot product between
x
−
c
(where
c
= (c
+
+
c
−
)/2)
and
w
:=
c
+
−
c
−
, which changes sign as the enclosed angle passes through
π/2.
Note that the
corresponding decision boundary is a hyperplane (the dotted line) orthogonal to
w
(from [31]).
closer to it. This geometrical construction can be formulated in terms of dot products.
Half-way in between
c
+
and
c
−
lies the point
c
:= (c
+
+
c
−
)/2.
We compute the class
of
x
by checking whether the vector connecting
c
and
x
encloses an angle smaller than
π/2
with the vector
w
:=
c
+
−
c
−
connecting the class means, in other words
y
= sgn ((x
−
c)
·
w)
y
= sgn ((x
−
(c
+
+
c
−
)/2)
·
(c
+
−
c
−
))
= sgn ((x
·
c
+
)
−
(x
·
c
−
) +
b).
Here, we have defined the offset
b
:=
1
2
c
−
2
(8)
−
c
+
2
.
(9)
It will be proved instructive to rewrite this expression in terms of the patterns
x
i
in
the input domain
X
. To this end, note that we do not have a dot product in
X
, all we
have is the similarity measure
k
(cf. (5)). Therefore, we need to rewrite everything in
terms of the kernel
k
evaluated on input patterns. To this end, substitute (6) and (7) into
(8) to get the
decision function
1
y
= sgn
m
+
1
= sgn
m
+
(x
·
x
i
)
−
{i:y
i
=+1}
1
m
−
(x
·
x
i
) +
b
{i:y
i
=−1}
k(x, x
i
) +
b
.
(10)
k(x, x
i
)
−
{i:y
i
=+1}
1
m
−
{i:y
i
=−1}
Similarly, the offset becomes
1
1
b
:=
2
2
m
−
k(x
i
, x
j
)
−
1
m
2
+
k(x
i
, x
j
)
.
{(i,j):y
i
=y
j
=+1}
(11)
{(i,j):y
i
=y
j
=−1}
Let us consider one well-known special case of this type of classifier. Assume that the
class means have the same distance to the origin (hence
b
= 0),
and that
k
can be viewed
as a density, i.e., it is positive and has integral
1,
k(x, x
)dx = 1
for all
x
∈ X
.
X
(12)
In order to state this assumption, we have to require that we can define an integral on
X
.
If the above holds true, then (10) corresponds to the so-called Bayes decision bound-
ary separating the two classes, subject to the assumption that the two classes were gen-
erated from two probability distributions that are correctly estimated by the
Parzen
windows
estimators of the two classes,
p
1
(x) :=
p
2
(x) :=
1
m
+
1
m
−
k(x, x
i
)
{i:y
i
=+1}
(13)
(14)
k(x, x
i
).
{i:y
i
=−1}
Given some point
x,
the label is then simply computed by checking which of the two,
p
1
(x)
or
p
2
(x),
is larger, which directly leads to (10). Note that this decision is the best
we can do if we have no prior information about the probabilities of the two classes.
For further details, see [31].
The classifier (10) is quite close to the types of learning machines that we will
be interested in. It is linear in the feature space, and while in the input domain, it is
represented by a kernel expansion in terms of the training points. It is example-based
in the sense that the kernels are centered on the training examples, i.e., one of the two
arguments of the kernels is always a training example. The main points that the more
sophisticated techniques to be discussed later will deviate from (10) are in the selection
of the examples that the kernels are centered on, and in the weights that are put on the
individual data in the decision function. Namely, it will no longer be the case that
all
training examples appear in the kernel expansion, and the weights of the kernels in the
expansion will no longer be uniform. In the feature space representation, this statement
corresponds to saying that we will study all normal vectors
w
of decision hyperplanes
that can be represented as linear combinations of the training examples. For instance,
we might want to remove the influence of patterns that are very far away from the
decision boundary, either since we expect that they will not improve the generalization
error of the decision function, or since we would like to reduce the computational cost
of evaluating the decision function (cf. (10)). The hyperplane will then only depend on
a subset of training examples, called
support vectors.
2 Learning Pattern Recognition from Examples
With the above example in mind, let us now consider the problem of pattern recognition
in a more formal setting ([37], [38]), following the introduction of [30]. In two-class
pattern recognition, we seek to estimate a function
f
:
X → {±1}
(15)
based on input-output training data (1). We assume that the data were generated inde-
pendently from some unknown (but fixed) probability distribution
P
(x,
y).
Our goal
is to learn a function that will correctly classify unseen examples
(x,
y),
i.e., we want
f
(x) =
y
for examples
(x,
y)
that were also generated from
P
(x,
y).
If we put no restriction on the class of functions that we choose our estimate
f
from, however, even a function which does well on the training data, e.g. by satisfying
f
(x
i
) =
y
i
for all
i
= 1,
. . . , m,
need not generalize well to unseen examples. To see
this, note that for each function
f
and any test set
(¯
1
, y
1
),
. . . ,
(¯
m
, y
m
)
∈
R
N
×{±1},
x
¯
x
¯
¯
¯
satisfying
{¯
1
, . . . , x
m
} ∩ {x
1
, . . . , x
m
}
=
{},
there exists another function
f
∗
such
x
¯
¯
∗
that
f
(x
i
) =
f
(x
i
)
for all
i
= 1,
. . . , m,
yet
f
∗
(¯
i
) =
f
(¯
i
)
for all
i
= 1,
. . . , m.
x
x
¯
As we are only given the training data, we have no means of selecting which of the two
functions (and hence which of the completely different sets of test label predictions) is
preferable. Hence, only minimizing the training error (or
empirical risk),
1
R
emp
[f ] =
m
m
i=1
1
|f
(x
i
)
−
y
i
|,
2
(16)
does not imply a small test error (called
risk),
averaged over test examples drawn from
the underlying distribution
P
(x,
y),
R[f
] =
1
|f
(x)
−
y| dP
(x,
y).
2
(17)
Statistical learning theory ([41], [37], [38], [39]), or VC (Vapnik-Chervonenkis) theory,
shows that it is imperative to restrict the class of functions that
f
is chosen from to one
which has a
capacity
that is suitable for the amount of available training data. VC theory
provides
bounds
on the test error. The minimization of these bounds, which depend on
both the empirical risk and the capacity of the function class, leads to the principle of
structural risk minimization
([37]). The best-known capacity concept of VC theory is
the
VC dimension,
defined as the largest number
h
of points that can be separated in
all possible ways using functions of the given class. An example of a VC bound is the
following: if
h < m
is the VC dimension of the class of functions that the learning
machine can implement, then for all functions of that class, with a probability of at
least
1
−
η,
the bound
R(f
)
≤
R
emp
(f ) +
φ
holds, where the
confidence term
φ
is defined as
φ
h
log(η)
,
m
m
=
h
log
2m
h
h
log(η)
,
m
m
(18)
+ 1
−
log(η/4)
.
m
(19)
评论