机器学习（Machine Learning）- 吴恩达（Andrew Ng）学习笔记（二）

Liner regression with one variable

Model Representation 模型表示

Supervised Learning 监督学习

Given the "right answer" for each example in the data. 对每个数据来说，我们给出了”正确答案“。

Regression Problem 回归问题

Predict real-valued output. 我们根据之前的数据预测出一个准确的输出值。

Training set 训练集

Notation：常见符号表示

m = Number of training examples 训练样本
x = "input" variable / feature 输入变量 / 特征值
y = "output" variable / "target" variable 输出变量 / 目标变量
(x, y) = one training example 一个训练样本
((x^i, y^i)) = the (i_{th}) training example 第(i)个训练样本

hypothesis 假设（是一个函数）

Training set -> Learning Algorithm -> h 将数据集“喂”给学习算法，学习算法输出一个函数。
x -> h -> y a map from (x's) to (y's). 是一个从x到y的映射函数。
How do we represent h?

(h_{ heta}(x) = heta_0 + heta_1 imes x).

Summery

数据集和函数的作用：预测一个关于(x)的线性函数(y)。

Cost Function 代价函数

如何把最有可能的直线与我们的数据相拟合？

Idea

Choose ( heta_0, heta_1) so that (h_{ heta}(x)) is close to (y) for our training examples ((x, y)).

Squared error function

(J( heta_0, heta_1) = frac{1}{2m} sum_{i=1}^m(h_{ heta}(x^i) - y^i)^2)

目标：找到( heta_0, heta_1) 使得 (J( heta_0, heta_1))最小。其中(J( heta_0, heta_1))称为代价函数。

Cost Function Intuition

Review

Hypothesis: (h_{ heta}(x) = heta_0 + heta_1 imes x)
Parameters: ( heta_0, heta_1)
Cost Function: (J( heta_0, heta_1) = frac{1}{2m} sum_{i=1}^m(h_{ heta}(x^i) - y^i)^2)
Goal: find ( heta_0, heta_1) to minimize (J( heta_0, heta_1))

Simplified

( heta_0 = 0 ightarrow h_{ heta}(x) = heta_1x)

(J( heta_1) = frac{1}{2m} sum_{i=1}^m(h_{ heta}(x^i) - y^i)^2)

Goal: find ( heta_1) to minimize (J( heta_1))

例子：样本点包含(1, 1)、(2, 2)、(3, 3)的假设函数和代价函数的关系图

Gradient Descent 梯度下降

Background

Have some function (J( heta_0, heta_1, heta_2, ldots, heta_n))

Want find ( heta_0, heta_1, heta_2, ldots, heta_n) to minimize (J( heta_0, heta_1, heta_2, ldots, heta_n))

Simplify -> ( heta_1, heta_2)

Outline

start with some ( heta_0, heta_1) (( heta_0 = 0, heta_1 = 0)). 初始化
Keep changing ( heta_0, heta_1) to reduce (J( heta_0, heta_1)) until we hopefully end up at a minimum. 不断寻找最优解，直到找到局部最优解（从不同点/方向出发得到的最终结果可能会不同）。

Gradient descent algorithm

repeat until convergence {

( heta_j := heta_j - alphafrac{partial}{partial heta_j}J( heta_0, heta_1)) (for (j= 0) and (j = 1))

}

变量含义：(alpha): learning rate 学习速率（控制我们以多大的幅度更新这个参数( heta_j) ）
Correct: Simultaneous update 正确实现同时更新的方法

(temp0 := heta_0 - alphafrac{partial}{partial heta_j}J( heta_0, heta_1))

(temp1 := heta_1 - alphafrac{partial}{partial heta_j}J( heta_0, heta_1))

( heta_0 := temp0)

( heta_1 := temp1)
Incorrect: 没有实现同步更新

(temp0 := heta_0 - alphafrac{partial}{partial heta_j}J( heta_0, heta_1))

( heta_0 := temp0)

(temp1 := heta_1 - alphafrac{partial}{partial heta_j}J( heta_0, heta_1))

( heta_1 := temp1)

Gradient Descent Intuition

导数项的意义

当(alphafrac{partial}{partial heta_j}J( heta_0, heta_1) > 0) （即函数处于递增状态）时，(ecause alpha > 0)，( herefore heta_1 := heta_1 - alphafrac{partial}{partial heta_j}J( heta_0, heta_1) < 0)，即向最低点处移动。
当(alphafrac{partial}{partial heta_j}J( heta_0, heta_1) < 0) （即函数处于递减状态）时，(ecause alpha > 0)，( herefore heta_1 := heta_1 - alphafrac{partial}{partial heta_j}J( heta_0, heta_1) > 0)，即向最低点处移动。（都是向最低点，即代价最小点处移动）

学习速率(alpha)

If (alpha) is too small, gradient descent can be slow. (alpha)太小，会使梯度下降的太慢。
If (alpha) is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge. (alpha)太大，梯度下降法可能会越过最低点，甚至可能无法收敛。

思考

假设你将( heta_1)初始化在局部最低点，而这条线的斜率将等于(0)，因此导数项等于(0)，梯度下降更新的过程中就会有( heta_1 = heta_1)。
Gradient descent can converge to a local minimum, even with the learning rate (alpha) fixed. 即使学习速率(alpha)固定不变，梯度下降也可以达到局部最小值。

As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease (alpha) over time. 在我们接近局部最小值时，梯度下降将会自动更换为更小的步子，因此我们没必要随着时间的推移而更改(alpha)的值。（因为斜率在变）

Gradient Descent For Linear Regression

梯度下降在线性回归中的应用

化简公式

$frac{partial}{partial heta_j}J( heta_0, heta_1) = frac{partial}{partial heta_j} frac{1}{2m} sum_{i = 1}^m(h_ heta(xi) - yⁱ⁾2 = frac{partial}{partial heta_j} frac{1}{2m} sum_{i = 1}^m( heta_0 + heta_1x^i - yⁱ⁾2 $

上述公式分别对( heta_0, heta_1)求偏导

(j = 0) : $ frac{partial}{partial heta_0}J( heta_0, heta_1) = frac{1}{m} sum_{i=1}^{m(h_{ heta}(x}i) - y^i) $
(j = 1) : $ frac{partial}{partial heta_1}J( heta_0, heta_1) = frac{1}{m} sum_{i=1}^{m(h_{ heta}(x}i) - y^i) imes x^i $

Gradient descent algorithm 将上面结果放回梯度下降法中

repeat until convergence {

$ heta_0 := heta_0 - alpha frac{1}{m} sum_{i=1}^{m(h_{ heta}(x}i) - y^i) $

$ heta_1 := heta_1 - alpha frac{1}{m} sum_{i=1}^{m(h_{ heta}(x}i) - y^i) imes x^i $

}

"Batch" Gradient Descent 批梯度下降法

"Batch": Each step of gradient descent uses all the training examples. 每迭代一步，都要用到训练集的所有数据。

Review

测试

Consider the problem of predicting how well a student does in her second year of college/university, given how well she did in her first year.

Specifically, let (x) be equal to the number of "A" grades (including A-. A and A+ grades) that a student receives in their first year of college (freshmen year). We would like to predict the value of (y), which we define as the number of "A" grades they get in their second year (sophomore year).

Here each row is one training example. Recall that in linear regression, our hypothesis is (h_ heta(x) = heta_0 + heta_1x), and we use (m) to denote the number of training examples.

x 3 1 0 4

y 2 2 1 3

For the training set given above (note that this training set may also be referenced in other questions in this quiz), what is the value of (m)?

4
For this question, assume that we are using the training set from Q1. Recall our definition of the cost function was (J( heta_0, heta_1) = frac{1}{2m}sum_{i=1}^m{(h_ heta(x^{(i)}) - y^{(i)})^2}), What is (J(0, 1))? In the box below,please enter your answer (Simplify fractions to decimals when entering answer, and '.' as the decimal delimiter e.g., 1.5).

0.5
Suppose we set ( heta_0 = -2, heta_1 = 0.5) in the linear regression hypothesis from Q1. What is (h_{ heta}(6))?

1
Let (f) be some function so that (f( heta_1, heta_2)) outputs a number. For this problem, (f) is some arbitrary/unknown smooth function (not necessarily the cost function of linear regression, so (f) may have local optima). Suppose we use gradient descent to try to minimize (f( heta_0, heta_1)) as a function of ( heta_0) and ( heta_1). Which of the following statements are true? (Check all that apply.)
- [ ] No matter how ( heta_0) and ( heta_1) are initialized, so long as (alpha) is sufficiently small, we can safely expect gradient descent to converge to the same solution. 根据( heta)初始值的不同，可能会收敛到不同的局部最优解
- [ ] Setting the learning rate (alpha) to be very small is not harmful, and can only speed up the convergence of gradient descent. (alpha)设置的太小会降低梯度下降的速度
- [x] If ( heta_0) and ( heta_1) are initialized atthe global minimum, then one iteration will not change their values. 全局最小值处，导数为0，梯度下降不会改变参数。
- [x] If the first few iterations of gradient descent cause (f( heta_0, heta_1)) to increase rather than decrease, then the most likely cause is that we have set the learning rate (alpha) to too large a value. 学习速率设置的过大会导致在梯度下降过程中目标值反而增大
- [ ] If ( heta_0) and ( heta_1) are initialized so that ( heta_0 = heta_1), then by symmetry (because we do simultaneous updates to the two parameters), after one iteration of gradient descent, we will still have ( heta_0 = heta_1). ( heta_1)和( heta_2)的更新规则是不同的
- [x] If the learning rate is too small, then gradient descent may take a very longtime to converge. 学习速率太小会导致收敛速度太慢
- [x] If ( heta_0) and ( heta_1) are initialized at a local minimum, then one iteration will not change their values. 局部最小值处，导数为0，梯度下降不会改变参数。
- [ ] Even if the learning rate (alpha) is very large, every iteration of gradient descent will decrease the value of (f( heta_0, heta_1)). 学习速率设置的过大会导致在梯度下降过程中目标值反而增大
Suppose that for some linear regression problem (say, predicting housing prices as in the lecture), we have some training set, and for our training set we managed to find some ( heta_0), ( heta_1) such that (J( heta_0, heta_1)=0). Which of the statements below must then be true? (Check all that apply.)
- [ ] We can perfectly predict the value of (y) even for new examples that we have not yet seen. (e.g., we can perfectly predict prices of even new houses that we have not yet seen.)
- [x] For these values of ( heta_0) and ( heta_1) that satisfy (J( heta_0, heta_1) = 0), we have that (h_ heta(x^{(i)}) = y^{(i)}) for every training example ((x^{(i)}, y^{(i)})).
- [ ] For this to be true, we must have ( heta_0 = 0) and ( heta_1 = 0) so that (h_ heta(x) = 0).
- [ ] This is not possible: By the definition of (J( heta_0, heta_1)) it is not possible for there to exist ( heta_0) and ( heta_1) so that (J( heta_0, heta_1) = 0).
- [x] Our training set can be fit perfectly by a straight line,i.e., all of our training examples lie perfectly on some straight line.
- [ ] For this to be true, we must have (y^{(i)} = 0) for every value of i = 1, 2, (ldots), m.
- [ ] Gradient descent is likely to get stuck at a local minimum and fail to find the global minimum. 线性回归无局部最小值

机器学习（Machine Learning）- 吴恩达（Andrew Ng） 学习笔记（二）

Model Representation 模型表示

Supervised Learning 监督学习

Regression Problem 回归问题

Training set 训练集

hypothesis 假设（是一个函数）

Summery

Cost Function 代价函数

Idea

Squared error function

Cost Function Intuition

Review

Simplified

Gradient Descent 梯度下降

Background

Outline

Gradient descent algorithm

Gradient Descent Intuition

导数项的意义

学习速率(alpha)

思考

Gradient Descent For Linear Regression

化简公式

上述公式分别对( heta_0, heta_1)求偏导

Gradient descent algorithm 将上面结果放回梯度下降法中

"Batch" Gradient Descent 批梯度下降法

Review

机器学习（Machine Learning）- 吴恩达（Andrew Ng）学习笔记（二）