下载中心>资源分类>应用技术>人工智能>统计机器学习（斯坦福大学讲义）1-12(全)

pdf

统计机器学习（斯坦福大学讲义）1-12(全)

1星
2021-05-04
5.76MB
需要1积分
4次下载

下载资源

文档简介
猜您喜欢
用户评论0

标签：机器学习 AI

统计机器学习（斯坦福大学讲义）1-12(全)

CS229 Lecture notes

Andrew Ng

Supervised learning

Let’s start by talking about a few examples of supervised learning problems.

Suppose we have a dataset giving the living areas and prices of 47 houses

from Portland, Oregon:

Living area (feet

)

2104

1600

2400

1416

3000

We can plot this data:

housing prices

1000

900

800

700

price (in $1000)

600

500

400

300

200

100

500

1000

1500

2000

2500

3000

square feet

3500

4000

4500

5000

Price (1000$s)

400

330

369

232

540

Given data like this, how can we learn to predict the prices of other houses

in Portland, as a function of the size of their living areas?

CS229 Fall 2012

To establish notation for future use, we’ll use

(i)

to denote the “input”

variables (living area in this example), also called input

features,

and

(i)

to denote the “output” or

target

variable that we are trying to predict

(price). A pair (x

(i)

, y

(i)

) is called a

training example,

and the dataset

that we’ll be using to learn—a list of

training examples

{(x

(i)

, y

(i)

);

. . . , m}—is

called a

training set.

Note that the superscript “(i)” in the

notation is simply an index into the training set, and has nothing to do with

exponentiation. We will also use

denote the space of input values, and

the space of output values. In this example,

To describe the supervised learning problem slightly more formally, our

goal is, given a training set, to learn a function

X → Y

so that

h(x)

is a

“good” predictor for the corresponding value of

For historical reasons, this

function

is called a

hypothesis.

Seen pictorially, the process is therefore

like this:

Training

set

Learning

algorithm

(living area of

house.)

predicted y

(predicted price)

of house)

When the target variable that we’re trying to predict is continuous, such

as in our housing example, we call the learning problem a

regression

prob-

lem. When

can take on only a small number of discrete values (such as

if, given the living area, we wanted to predict if a dwelling is a house or an

apartment, say), we call it a

classiﬁcation

problem.

Part I

Linear Regression

To make our housing example more interesting, let’s consider a slightly richer

dataset in which we also know the number of bedrooms in each house:

Living area (feet

)

2104

1600

2400

1416

3000

#bedrooms Price (1000$s)

400

330

369

232

540

(i)

Here, the

x’s

are two-dimensional vectors in

. For instance,

is the

(i)

living area of the

i-th

house in the training set, and

is its number of

bedrooms. (In general, when designing a learning problem, it will be up to

you to decide what features to choose, so if you are out in Portland gathering

housing data, you might also decide to include other features such as whether

each house has a ﬁreplace, the number of bathrooms, and so on. We’ll say

more about feature selection later, but for now let’s take the features as

given.)

To perform supervised learning, we must decide how we’re going to rep-

resent functions/hypotheses

in a computer. As an initial choice, let’s say

we decide to approximate

as a linear function of

(x) =

Here, the

’s are the

parameters

(also called

weights)

parameterizing the

space of linear functions mapping from

When there is no risk of

confusion, we will drop the

subscript in

(x), and write it more simply as

h(x).

To simplify our notation, we also introduce the convention of letting

= 1 (this is the

intercept term),

so that

h(x)

i=0

where on the right-hand side above we are viewing

and

both as vectors,

and here

is the number of input variables (not counting

Now, given a training set, how do we pick, or learn, the parameters

θ?

One reasonable method seems to be to make

h(x)

close to

at least for

the training examples we have. To formalize this, we will deﬁne a function

that measures, for each value of the

θ’s,

how close the

h(x

(i)

)’s are to the

corresponding

(i)

’s. We deﬁne the

cost function:

J(θ)

i=1

(i)

)

−

(i)

)

If you’ve seen linear regression before, you may recognize this as the familiar

least-squares cost function that gives rise to the

ordinary least squares

regression model. Whether or not you have seen it previously, let’s keep

going, and we’ll eventually show this to be a special case of a much broader

family of algorithms.

LMS algorithm

We want to choose

so as to minimize

J(θ).

To do so, let’s use a search

algorithm that starts with some “initial guess” for

θ,

and that repeatedly

changes

to make

J(θ)

smaller, until hopefully we converge to a value of

that minimizes

J(θ).

Speciﬁcally, let’s consider the

gradient descent

algorithm, which starts with some initial

θ,

and repeatedly performs the

update:

∂

−

J(θ).

∂θ

(This update is simultaneously performed for all values of

= 0,

. . . , n.)

Here,

is called the

learning rate.

This is a very natural algorithm that

repeatedly takes a step in the direction of steepest decrease of

In order to implement this algorithm, we have to work out what is the

partial derivative term on the right hand side. Let’s ﬁrst work it out for the

case of if we have only one training example (x,

y),

so that we can neglect

the sum in the deﬁnition of

We have:

∂

J(θ)

(x)

−

∂θ

∂

= 2

(x)

−

(x)

−

∂θ

∂

= (h

(x)

−

∂θ

= (h

(x)

−

y) x

i=0

−

For a single training example, this gives the update rule:

α y

(i)

−

(i)

)

The rule is called the

LMS

update rule (LMS stands for “least mean squares”),

and is also known as the

Widrow-Hoﬀ

learning rule. This rule has several

properties that seem natural and intuitive. For instance, the magnitude of

the update is proportional to the

error

term (y

(i)

−

(i)

)); thus, for in-

stance, if we are encountering a training example on which our prediction

nearly matches the actual value of

(i)

, then we ﬁnd that there is little need

to change the parameters; in contrast, a larger change to the parameters will

be made if our prediction

(i)

) has a large error (i.e., if it is very far from

(i)

We’d derived the LMS rule for when there was only a single training

example. There are two ways to modify this method for a training set of

more than one example. The ﬁrst is replace it with the following algorithm:

Repeat until convergence

{

}

The reader can easily verify that the quantity in the summation in the update

rule above is just

∂J(θ)/∂θ

(for the original deﬁnition of

J).

So, this is

simply gradient descent on the original cost function

This method looks

at every example in the entire training set on every step, and is called

batch

gradient descent.

Note that, while gradient descent can be susceptible

to local minima in general, the optimization problem we have posed here

for linear regression has only one global, and no other local, optima; thus

gradient descent always converges (assuming the learning rate

is not too

large) to the global minimum. Indeed,

is a convex quadratic function.

Here is an example of gradient descent as it is run to minimize a quadratic

function.

We use the notation “a :=

b”

to denote an operation (in a computer program) in

which we

set

the value of a variable

to be equal to the value of

In other words, this

operation overwrites

with the value of

In contrast, we will write “a =

b”

when we are

asserting a statement of fact, that the value of

is equal to the value of

(i)

i=1

(i)

−

(i)

)

(i)

(for every

j).

展开预览

猜您喜欢

上传者

: sigma; 查看他的其他资源

TI 文字链专区

举报人：
被举报人：	sigma
举报的资源分：	1
* 类型：
	请您提供公司营业执照和软件相关版权到service@eeworld.com.cn
* 详细原因：

统计机器学习（斯坦福大学讲义）1-12(全)

文档简介

评论

汽车 模拟

汽车模拟