林嶔 (Lin, Chin)
Lesson 1 深度神經網路介紹
在分析工具的部分,這堂課會附上R語言的語法範例,但大家並不一定要使用R語言,也可以使用SPSS或Python等工具。
本門課的重點在於對「深度學習」的理論闡釋,並介紹幾個很有創意的模型解決實務任務。
– 但這門課著重在於對「深度學習」的廣度介紹,如果你想了解各應用的深度,進而提升最終模型的準確度,強烈建議選修「醫療人工智慧實作」,透過資料科學競賽熟悉建模流程
– 因此,「深度學習」的重點在於對「非結構化資料」的分析,像是影像、生理訊號、文字等,而「反向傳播法」相較於其他算法有強大的擴展性,從而在這之上發展出了眾多面對不同任務的模型。
在開始的時候我們還是要做一些基本介紹,資料科學是一門探索未知的學問,我們將運用現有的資料,建構資料之間的關聯性。
在數學上,假定我們的Input是\(x\)物件,而Output是\(y\)物件,那我們可以建構一個『預測函數』\(f()\)來對其進行預測:
\[\hat{y} = f(x)\]
\[loss = diff(y, \hat{y})\]
\[loss = diff(y, f(x))\]
\[min(loss)\]
\[f(x) = x^{2} + 2x + 1\] - 接著,我們對上述函數進行微分,並尋找微分後函數為0的位置,將可以知道此函數的極值位置:
\[\frac{\partial}{\partial x} f(x) = 2x + 2 = 0\]
\[x = -1\]
為什麼我們能夠利用『微分』求函數的極值?這邊大家可能要複習一下基本觀念,對一個『函數』進行『微分』所獲得的『導函數』其實就是該函數的『切線斜率函數』,而『切線斜率函數』等於0的位置就暗示著函數不經過一系列的上升/下降後停止變化,那當然這個位置就是極值所在。
然而,剛剛的求極值過程中有一個非常討厭的部分,那就是要求一個「一元一次方程式」,而當函數複雜一點,我們將要求「N元M次聯立方程式」的答案,那將會讓整個過程異常複雜,所以我們要尋求其他解決方案。
在這裡我們隆重介紹『梯度下降法』。首先我們要先定義何謂『梯度』?所謂的『梯度』其實就是『斜率』(注意,這個定義並不精確,但為了省略太多複雜的數學語言,我們暫且使用這個定義)。在這個定義之下,『梯度下降法』意思就是我們在『求解極值』的過程中,隨著『梯度』進行移動,從而找到極值的過程。
下面以找到剛剛那個函數「\(f(x)\)」的極值為例,我們先隨機指定一個起始值,並定義他是第0代:
\[x_{\left(epoch:0\right)} = 10\]
\[x_{\left(epoch:t\right)} = x_{\left(epoch:t - 1\right)} - lr \cdot \frac{\partial}{\partial x}f(x_{\left(epoch:t - 1\right)})\] - 由於剛剛函數的導函數為「\(2x + 2\)」,我們可以將式子帶入運算:
\[ \begin{align} x_{\left(epoch:1\right)} & = x_{\left(epoch:0\right)} - lr \cdot \frac{\partial}{\partial x}f(x_{\left(epoch:0\right)}) \\ & = 10 - lr \cdot \frac{\partial}{\partial x}f(10) \\ & = 10 - 0.05 \cdot (2\cdot10+2)\\ & = 8.9 \end{align} \]
\[ \begin{align} x_{\left(epoch:2\right)} & = x_{\left(epoch:1\right)} - lr \cdot \frac{\partial}{\partial x}f(x_{\left(epoch:1\right)}) \\ & = 8.9 - lr \cdot \frac{\partial}{\partial x}f(8.9) \\ & = 8.9 - 0.05 \cdot (2\cdot8.9+2)\\ & = 7.91 \end{align} \]
\[ \begin{align} x_{\left(epoch:3\right)} & = 7.91 - 0.891 = 7.019 \\ x_{\left(epoch:4\right)} & = 7.019 - 0.8019 = 6.2171 \\ x_{\left(epoch:5\right)} & = 6.2171 - 0.72171 = 5.49539 \\ & \dots \\ x_{\left(epoch:\infty\right)} & = -1 \end{align} \]
– 我們充分了解到「線性模型」的限制,我們需要讓我們的「預測函數」具有足夠的「非線性擬合能力」。
\[L^k(x_1, x_2) = w_{0}^k + w_{1}^kx_1 + w_{2}^kx_2\]
\[ \begin{align} S(x) & = \frac{{1}}{1+e^{-x}} \end{align} \]
\[ \begin{align} h_1 & = S(L^1(x_1, x_2)) \\ h_2 & = S(L^2(x_1, x_2)) \\ o & = S(L^3(h_1, h_2)) \end{align} \]
\[ \begin{align} h_1 & = S(L^1(x_1, x_2)) \\ h_2 & = S(L^2(x_1, x_2)) \\ o & = S(L^3(h_1, h_2)) \\ loss & = CE(y, o) = \frac{{1}}{n}\sum \limits_{i=1}^{n} -\left(y_{i} \cdot log(o_{i}) + (1-y_{i}) \cdot log(1-o_{i})\right) \end{align} \]
– 下列是\(w_{0}^1\)、\(w_{1}^1\)、\(w_{2}^1\)、\(w_{0}^2\)、\(w_{1}^2\)、\(w_{2}^2\)、\(w_{0}^3\)、\(w_{1}^3\)、\(w_{2}^3\)各自的偏微分,詳細的求解過程請你參考這裡:
\[ \begin{align} \frac{\partial}{\partial w_{0}^3}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \\ \frac{\partial}{\partial w_{1}^3}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot h_{1i} \\ \frac{\partial}{\partial w_{2}^3}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n}\left( o_i-y_i \right) \cdot h_{2i} \\ \frac{\partial}{\partial w_{0}^2}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{2}^3 \cdot h_{2i} (1 - h_{2i}) \\ \frac{\partial}{\partial w_{1}^2}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{2}^3 \cdot h_{2i} (1 - h_{2i}) \cdot x_{1i} \\ \frac{\partial}{\partial w_{2}^2}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{2}^3 \cdot h_{2i} (1 - h_{2i}) \cdot x_{2i} \\ \frac{\partial}{\partial w_{0}^1}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{1}^3 \cdot h_{1i} (1 - h_{1i}) \\ \frac{\partial}{\partial w_{1}^1}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{1}^3 \cdot h_{1i} (1 - h_{1i}) \cdot x_{1i} \\ \frac{\partial}{\partial w_{2}^1}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{1}^3 \cdot h_{1i} (1 - h_{1i}) \cdot x_{2i} \end{align} \]
– 在模擬資料的生成上,我們採用與「多層感知機」完全一樣的架構,因此我們會很確定之後的係數應該是怎樣
– 我們指定的數字是:\(w_{0}^1=-4\)、\(w_{1}^1=-4\)、\(w_{2}^1=-3\)、\(w_{0}^2=7\)、\(w_{1}^2=-6\)、\(w_{2}^2=-1\)、\(w_{0}^3=4\)、\(w_{1}^3=6\)、\(w_{2}^3=-7\)
set.seed(0)
x1 <- rnorm(500, sd = 1)
x2 <- rnorm(500, sd = 1)
l1 <- -4 - 4 * x1 - 3 * x2
l2 <- 7 - 6 * x1 - 1 * x2
h1 <- 1 / (1 + exp(-l1))
h2 <- 1 / (1 + exp(-l2))
l3 <- 4 + 6 * h1 - 7 * h2
o <- 1 / (1 + exp(-l3))
y <- (o > runif(500)) + 0L
col_list <- c('#0000FF80', '#FF000080')[y + 1]
plot(x1, x2, col = col_list, xlim = c(-5, 5), ylim = c(-5, 5), pch = 19, cex = 0.5)
#Forward
S.fun <- function (x, eps = 1e-5) {
S = 1/(1 + exp(-x))
S[S < eps] = eps
S[S > 1 - eps] = 1 - eps
return(S)
}
h1.fun <- function (w10, w11, w12, x1 = x1, x2 = x2) {
L1 = w10 + w11 * x1 + w12 * x2
return(S.fun(L1))
}
h2.fun <- function (w20, w21, w22, x1 = x1, x2 = x2) {
L2 = w20 + w21 * x1 + w22 * x2
return(S.fun(L2))
}
o.fun <- function (w30, w31, w32, h1, h2) {
L3 = w30 + w31 * h1 + w32 * h2
return(S.fun(L3))
}
loss.fun <- function (o, y = y) {
loss = -1/length(y) * sum(y * log(o) + (1 - y) * log(1 - o))
return(loss)
}
#Backward
differential.fun.w30 <- function(o, y = y) {
return(-1/length(y)*sum(y-o))
}
differential.fun.w31 <- function(o, h1, y = y) {
return(-1/length(y)*sum((y-o)*h1))
}
differential.fun.w32 <- function(o, h2, y = y) {
return(-1/length(y)*sum((y-o)*h2))
}
differential.fun.w20 <- function(o, h2, w32, y = y) {
return(-1/length(y)*sum((y-o)*w32*h2*(1-h2)))
}
differential.fun.w21 <- function(o, h2, w32, y = y, x1 = x1) {
return(-1/length(y)*sum((y-o)*w32*h2*(1-h2)*x1))
}
differential.fun.w22 <- function(o, h2, w32, y = y, x2 = x2) {
return(-1/length(y)*sum((y-o)*w32*h2*(1-h2)*x2))
}
differential.fun.w10 <- function(o, h1, w31, y = y) {
return(-1/length(y)*sum((y-o)*w31*h1*(1-h1)))
}
differential.fun.w11 <- function(o, h1, w31, y = y, x1 = x1) {
return(-1/length(y)*sum((y-o)*w31*h1*(1-h1)*x1))
}
differential.fun.w12 <- function(o, h1, w31, y = y, x2 = x2) {
return(-1/length(y)*sum((y-o)*w31*h1*(1-h1)*x2))
}
num.iteration = 10000
lr = 0.1
W_matrix = matrix(0, nrow = num.iteration + 1, ncol = 9)
loss_seq = rep(0, num.iteration)
colnames(W_matrix) = c('w10', 'w11', 'w12', 'w20', 'w21', 'w22', 'w30', 'w31', 'w32')
#Start random values
W_matrix[1,] = rnorm(9, sd = 1)
for (i in 2:(num.iteration+1)) {
#Forward
current_H1 = h1.fun(w10 = W_matrix[i-1,1], w11 = W_matrix[i-1,2], w12 = W_matrix[i-1,3],
x1 = x1, x2 = x2)
current_H2 = h2.fun(w20 = W_matrix[i-1,4], w21 = W_matrix[i-1,5], w22 = W_matrix[i-1,6],
x1 = x1, x2 = x2)
current_O = o.fun(w30 = W_matrix[i-1,7], w31 = W_matrix[i-1,8], w32 = W_matrix[i-1,9],
h1 = current_H1, h2 = current_H2)
loss_seq[i-1] = loss.fun(o = current_O, y = y)
#Backward
W_matrix[i,1] = W_matrix[i-1,1] - lr * differential.fun.w10(o = current_O, h1 = current_H1,
w31 = W_matrix[i-1,8], y = y)
W_matrix[i,2] = W_matrix[i-1,2] - lr * differential.fun.w11(o = current_O, h1 = current_H1,
w31 = W_matrix[i-1,8], y = y, x1 = x1)
W_matrix[i,3] = W_matrix[i-1,3] - lr * differential.fun.w12(o = current_O, h1 = current_H1,
w31 = W_matrix[i-1,8], y = y, x2 = x2)
W_matrix[i,4] = W_matrix[i-1,4] - lr * differential.fun.w20(o = current_O, h2 = current_H2,
w32 = W_matrix[i-1,9], y = y)
W_matrix[i,5] = W_matrix[i-1,5] - lr * differential.fun.w21(o = current_O, h2 = current_H2,
w32 = W_matrix[i-1,9], y = y, x1 = x1)
W_matrix[i,6] = W_matrix[i-1,6] - lr * differential.fun.w22(o = current_O, h2 = current_H2,
w32 = W_matrix[i-1,9], y = y, x2 = x2)
W_matrix[i,7] = W_matrix[i-1,7] - lr * differential.fun.w30(o = current_O, y = y)
W_matrix[i,8] = W_matrix[i-1,8] - lr * differential.fun.w31(o = current_O, h1 = current_H1, y = y)
W_matrix[i,9] = W_matrix[i-1,9] - lr * differential.fun.w32(o = current_O, h2 = current_H2, y = y)
}
library(scales)
library(plot3D)
x1_seq = seq(-5, 5, length.out = 201)
x2_seq = seq(-5, 5, length.out = 201)
pre_func = function (x1, x2, W_list = W_matrix[nrow(W_matrix),]) {
H1 = h1.fun(w10 = W_list[1], w11 = W_list[2], w12 = W_list[3], x1 = x1, x2 = x2)
H2 = h2.fun(w20 = W_list[4], w21 = W_list[5], w22 = W_list[6], x1 = x1, x2 = x2)
O = o.fun(w30 = W_list[7], w31 = W_list[8], w32 = W_list[9], h1 = H1, h2 = H2)
return(O)
}
z_matrix = sapply(x2_seq, function(x) {pre_func(x1 = x1_seq, x2 = x)})
image2D(z = z_matrix,
x = x1_seq, xlab = 'x1', xlim = c(-5, 5),
y = x2_seq, ylab = 'x2', ylim = c(-5, 5),
shade = 0.2, rasterImage = TRUE,
col = colorRampPalette(c("#FFA0A0", "#FFFFFF", "#A0A0FF"))(100))
points(x1, x2, col = (y + 1)*2, pch = 19, cex = 0.5)
– 最終我們跑出來的結果是不是非常接近標準答案呢?
## w10 w11 w12 w20 w21 w22 w30
## [10001,] -3.673431 -3.44538 -2.358838 5.492932 -4.407179 -0.941106 3.677612
## w31 w32
## [10001,] 5.642388 -6.222671
## w10 w11 w12 w20 w21 w22 w30
## [10001,] 4.351249 4.341602 3.010642 4.877026 -4.003332 -0.8222346 7.696732
## w31 w32
## [10001,] -4.741939 -5.340528
– 你應該可以同意,在其他條件固定的前提下,「\(w_{0}^1=-4\)、\(w_{1}^1=-4\)、\(w_{2}^1=-3\)」這組參數,會跟「\(w_{0}^1=-8\)、\(w_{1}^1=-8\)、\(w_{2}^1=-6\)」這組參數,計算上得到「非常類似」的結果(畢竟經過\(S(x)\)轉換後,大部分的值會非常接近)
– 除此之外,把\(h_1 = S(L^1(x_1, x_2))\)與\(h_2 = S(L^2(x_1, x_2))\)兩者直接調換,再配上相對應的\(w_{1}^3\)與\(w_{2}^3\)的修正,那結果也會「完全一樣」
– 因此也有人說,神經網路系列的方法是「黑盒子(black box)」算法,我們根本不清楚他的邏輯,只知道他最後預測得不錯。
– 透過矩陣能夠將多層感知機變成較為簡易的公式組合(以下為個人層級的方程式):
\[ \begin{align} L^k_d(X, W^k_d) & = XW^k_d \\ X & = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,m} \end{pmatrix} \\ W^k_d & = \begin{pmatrix} w^k_{0,1} & w^k_{0,2} & \cdots & w^k_{0,d} \\ w^k_{1,1} & w^k_{1,2} & \cdots & w^k_{1,d} \\ w^k_{2,1} & w^k_{2,2} & \cdots & w^k_{2,d} \\ \vdots & \vdots & \ddots & \vdots \\ w^k_{m,1} & w^k_{m,2} & \cdots & w^k_{m,d} \end{pmatrix} \\ \frac{\partial}{\partial W^k_d}L^k_d(X, W^k_d) & = \begin{pmatrix} X^T & & X^T & \cdots & X^T \\ \end{pmatrix} \mbox{ [repeat } d \mbox{ times]} \end{align} \]
\[ \begin{align} S(x) & = \frac{{1}}{1+e^{-x}} \\ \frac{\partial}{\partial x}S(x) & = S(x)(1-S(x)) \end{align} \]
\[ \begin{align} l_1 & = L^1_d(x^E,W^1_d) \\ h_1 & = S(l_1) \\ l_2 & = L^2_1(h_1^E,W^2_1) \\ o & = S(l_2) \end{align} \]
\[ \begin{align} l_1 & = L^1_d(x^E,W^1_d) \\ h_1 & = S(l_1) \\ l_2 & = L^2_1(h_1^E,W^2_1) \\ o & = S(l_2) \\ loss & = CE(y, o) = \frac{{1}}{n}\sum \limits_{i=1}^{n} -\left(y \cdot log(o) + (1-y) \cdot log(1-o)\right) \end{align} \]
\[ \begin{align} grad.o & = \frac{\partial}{\partial o}loss = \frac{o-y}{o(1-o)} \\ grad.l_2 & = \frac{\partial}{\partial l_2}loss = grad.o \otimes \frac{\partial}{\partial l_2}o= o-y \\ grad.W^2_1 & = \frac{\partial}{\partial W^2_1}loss = grad.l_2 \otimes \frac{\partial}{\partial W^2_1}l_2 = \frac{1}{n} \otimes (h_1^E)^T \bullet grad.l_2\\ grad.h_1^E & = \frac{\partial}{\partial h_1^E}loss = grad.l_2 \otimes \frac{\partial}{\partial h_1^E}l_2 = grad.l_2 \bullet (W^2_1)^T \\ grad.l_1 & = \frac{\partial}{\partial l_1}loss = grad.h_1 \otimes \frac{\partial}{\partial l_1}h_1 = grad.h_1 \otimes h_1 \otimes (1-h_1) \\ grad.W^1_d & = \frac{\partial}{\partial W^1_d}loss = grad.l_1 \otimes \frac{\partial}{\partial W^1_d}l_1 = \frac{1}{n} \otimes (x^E)^T \bullet grad.l_1 \end{align} \]
– 這是一個MLP訓練函數,由於已經矩陣化了,我們可以很方便的指定隱藏層的神經元數量(num.hidden)
MLP_Trainer = function (num.iteration = 500, num.hidden = 2, lr = 0.1, x1 = x1, x2 = x2, y = y) {
#Functions
#Forward
S.fun = function (x, eps = 1e-5) {
S = 1/(1 + exp(-x))
S[S < eps] = eps
S[S > 1 - eps] = 1 - eps
return(S)
}
L.fun = function (X, W) {
X.E = cbind(1, X)
L = X.E %*% W
return(L)
}
CE.fun = function (o, y, eps = 1e-5) {
loss = -1/length(y) * sum(y * log(o + eps) + (1 - y) * log(1 - o + eps))
return(loss)
}
#Backward
grad_o.fun = function (o, y) {
return((o - y)/(o*(1-o)))
}
grad_l2.fun = function (grad_o, o) {
return(grad_o*(o*(1-o)))
}
grad_W2.fun = function (grad_l2, h1) {
h1.E = cbind(1, h1)
return(t(h1.E) %*% grad_l2/nrow(h1))
}
grad_h1.fun = function (grad_l2, W2) {
return(grad_l2 %*% t(W2[-1,]))
}
grad_l1.fun = function (grad_h1, h1) {
return(grad_h1*(h1*(1-h1)))
}
grad_W1.fun = function (grad_l1, x) {
x.E = cbind(1, x)
return(t(x.E) %*% grad_l1/nrow(x))
}
#Caculating
X_matrix = cbind(x1, x2)
W1_list = list()
W2_list = list()
loss_seq = rep(0, num.iteration)
#Start random values
W1_list[[1]] = matrix(rnorm(3*num.hidden, sd = 1), nrow = 3, ncol = num.hidden)
W2_list[[1]] = matrix(rnorm(num.hidden + 1, sd = 1), nrow = num.hidden + 1, ncol = 1)
for (i in 2:(num.iteration+1)) {
#Forward
current_l1 = L.fun(X = X_matrix, W = W1_list[[i - 1]])
current_h1 = S.fun(x = current_l1)
current_l2 = L.fun(X = current_h1, W = W2_list[[i - 1]])
current_o = S.fun(x = current_l2)
loss_seq[i-1] = CE.fun(o = current_o, y = y, eps = 1e-5)
#Backward
current_grad_o = grad_o.fun(o = current_o, y = y)
current_grad_l2 = grad_l2.fun(grad_o = current_grad_o, o = current_o)
current_grad_W2 = grad_W2.fun(grad_l2 = current_grad_l2, h1 = current_h1)
current_grad_h1 = grad_h1.fun(grad_l2 = current_grad_l2, W2 = W2_list[[i - 1]])
current_grad_l1 = grad_l1.fun(grad_h1 = current_grad_h1, h1 = current_h1)
current_grad_W1 = grad_W1.fun(grad_l1 = current_grad_l1, x = X_matrix)
W2_list[[i]] = W2_list[[i-1]] - lr * current_grad_W2
W1_list[[i]] = W1_list[[i-1]] - lr * current_grad_W1
}
require(scales)
require(plot3D)
x1_seq = seq(min(x1), max(x1), length.out = 100)
x2_seq = seq(min(x2), max(x2), length.out = 100)
pre_func = function (x1, x2, W1 = W1_list[[length(W1_list)]], W2 = W2_list[[length(W2_list)]]) {
new_X = cbind(x1, x2)
O = S.fun(x = L.fun(X = S.fun(x = L.fun(X = new_X, W = W1)), W = W2))
return(O)
}
z_matrix = sapply(x2_seq, function(x) {pre_func(x1 = x1_seq, x2 = x)})
par(mfrow = c(1, 2))
image2D(z = z_matrix,
x = x1_seq, xlab = 'x1',
y = x2_seq, ylab = 'x2',
shade = 0.2, rasterImage = TRUE,
col = colorRampPalette(c("#FFA0A0", "#FFFFFF", "#A0A0FF"))(100))
points(x1, x2, col = (y + 1)*2, pch = 19, cex = 0.5)
plot(loss_seq, type = 'l', main = 'loss', xlab = 'iter.', ylab = 'CE loss')
}
– 這是跟剛剛一樣,神經元數目=2:
– 稍微多一點,神經元數目=5,其實也沒什麼變化:
– 我們用個非常誇張的數字,神經元數目=50,大致上沒什麼變化,但開始出現奇怪的邊界:
– 在讓我們試試其他的資料:
set.seed(0)
x1 <- rnorm(100, sd = 1)
x2 <- rnorm(100, sd = 1)
lp <- -2.5 + x1^2 + 2 * x2^2
y <- (lp > 0) + 0L
– 這是本來的2層深多層感知機與他的損失函數:
\[ \begin{align} l_1 & = L^1_d(x^E,W^1_d) \\ h_1 & = S(l_1) \\ l_2 & = L^2_1(h_1^E,W^2_1) \\ o & = S(l_2) \\ loss & = CE(y, o) = \frac{{1}}{n}\sum \limits_{i=1}^{n} -\left(y \cdot log(o) + (1-y) \cdot log(1-o)\right) \end{align} \]
\[ \begin{align} l_1 & = L^1_d(x^E,W^1_d) \\ h_1 & = S(l_1) \\ l_2 & = L^2_d(h_1^E,W^2_d) \\ h_2 & = S(l_2) \\ l_3 & = L^3_1(h_2^E,W^3_1) \\ o & = S(l_3) \\ loss & = CE(y, o) = \frac{{1}}{n}\sum \limits_{i=1}^{n} -\left(y \cdot log(o) + (1-y) \cdot log(1-o)\right) \end{align} \]
\[ \begin{align} grad.o & = \frac{\partial}{\partial o}loss = \frac{o-y}{o(1-o)} \\ grad.l_3 & = \frac{\partial}{\partial l_3}loss = grad.o \otimes \frac{\partial}{\partial l_3}o= o-y \\ grad.W^3_1 & = \frac{\partial}{\partial W^3_1}loss = grad.l_3 \otimes \frac{\partial}{\partial W^3_1}l_3 = \frac{1}{n} \otimes (h_2^E)^T \bullet grad.l_3\\ grad.h_2^E & = \frac{\partial}{\partial h_2^E}loss = grad.l_3 \otimes \frac{\partial}{\partial h_2^E}l_3 = grad.l_3 \bullet (W^3_1)^T \\ grad.l_2 & = \frac{\partial}{\partial l_2}loss = grad.h_2 \otimes \frac{\partial}{\partial l_2}h_2 = grad.h_2 \otimes h_2 \otimes (1-h_2) \\ grad.W^2_d & = \frac{\partial}{\partial W^2_d}loss = grad.l_2 \otimes \frac{\partial}{\partial W^2_d}l_2 = \frac{1}{n} \otimes (h_1^E)^T \bullet grad.l_2\\ grad.h_1^E & = \frac{\partial}{\partial h_1^E}loss = grad.l_2 \otimes \frac{\partial}{\partial h_1^E}l_2 = grad.l_2 \bullet (W^2_d)^T \\ grad.l_1 & = \frac{\partial}{\partial l_1}loss = grad.h_1 \otimes \frac{\partial}{\partial l_1}h_1 = grad.h_1 \otimes h_1 \otimes (1-h_1) \\ grad.W^1_d & = \frac{\partial}{\partial W^1_d}loss = grad.l_1 \otimes \frac{\partial}{\partial W^1_d}l_1 = \frac{1}{n} \otimes (x^E)^T \bullet grad.l_1 \end{align} \]
DEEP_MLP_Trainer = function (num.iteration = 500, num.hidden = c(10, 10, 10), lr = 0.05,
x1 = x1, x2 = x2, y = y, eps = 1e-8) {
#Functions
#Forward
S.fun = function (x, eps = 1e-5) {
S = 1/(1 + exp(-x))
S[S < eps] = eps
S[S > 1 - eps] = 1 - eps
return(S)
}
L.fun = function (X, W) {
X.E = cbind(1, X)
L = X.E %*% W
return(L)
}
CE.fun = function (o, y, eps = 1e-5) {
loss = -1/length(y) * sum(y * log(o + eps) + (1 - y) * log(1 - o + eps))
return(loss)
}
#Backward
grad_o.fun = function (o, y) {
return((o - y)/(o*(1-o)))
}
grad_s.fun = function (grad_o, o) {
return(grad_o*(o*(1-o)))
}
grad_W.fun = function (grad_l, h) {
h.E = cbind(1, h)
return(t(h.E) %*% grad_l/nrow(h))
}
grad_h.fun = function (grad_l, W) {
return(grad_l %*% t(W[-1,]))
}
grad_l.fun = function (grad_h, h) {
return(grad_h * h * (1 - h))
}
#initialization
X_matrix = cbind(x1, x2)
Y_matrix = t(t(y))
W_list = list()
len_h = length(num.hidden)
for (w_seq in 1:(len_h+1)) {
if (w_seq == 1) {
NROW_W = ncol(X_matrix) + 1
NCOL_W = num.hidden[w_seq]
} else if (w_seq == len_h+1) {
NROW_W = num.hidden[w_seq - 1] + 1
NCOL_W = ncol(Y_matrix)
} else {
NROW_W = num.hidden[w_seq - 1] + 1
NCOL_W = num.hidden[w_seq]
}
W_list[[w_seq]] = matrix(rnorm(NROW_W*NCOL_W, sd = 1), nrow = NROW_W, ncol = NCOL_W)
}
loss_seq = rep(0, num.iteration)
#Caculating
for (i in 1:num.iteration) {
#Forward
current_l_list = list()
current_h_list = list()
for (j in 1:len_h) {
if (j == 1) {
current_l_list[[j]] = L.fun(X = X_matrix, W = W_list[[j]])
} else {
current_l_list[[j]] = L.fun(X = current_h_list[[j-1]], W = W_list[[j]])
}
current_h_list[[j]] = S.fun(x = current_l_list[[j]])
}
current_l_list[[len_h+1]] = L.fun(X = current_h_list[[len_h]], W = W_list[[len_h+1]])
current_o = S.fun(x = current_l_list[[len_h+1]], eps = eps)
loss_seq[i] = CE.fun(o = current_o, y = y, eps = eps)
#Backward
current_grad_l_list = list()
current_grad_W_list = list()
current_grad_h_list = list()
current_grad_o = grad_o.fun(o = current_o, y = y)
current_grad_l_list[[len_h+1]] = grad_s.fun(grad_o = current_grad_o, o = current_o)
current_grad_W_list[[len_h+1]] = grad_W.fun(grad_l = current_grad_l_list[[len_h+1]], h = current_h_list[[len_h]])
for (j in len_h:1) {
current_grad_h_list[[j]] = grad_h.fun(grad_l = current_grad_l_list[[j+1]], W = W_list[[j+1]])
current_grad_l_list[[j]] = grad_l.fun(grad_h = current_grad_h_list[[j]], h = current_h_list[[j]])
if (j != 1) {
current_grad_W_list[[j]] = grad_W.fun(grad_l = current_grad_l_list[[j]], h = current_h_list[[j - 1]])
} else {
current_grad_W_list[[j]] = grad_W.fun(grad_l = current_grad_l_list[[j]], h = X_matrix)
}
}
for (j in 1:(len_h+1)) {
W_list[[j]] = W_list[[j]] - lr * current_grad_W_list[[j]]
}
}
require(scales)
require(plot3D)
x1_seq = seq(min(x1), max(x1), length.out = 100)
x2_seq = seq(min(x2), max(x2), length.out = 100)
pre_func = function (x1, x2) {
new_X = cbind(x1, x2)
current_l_list = list()
current_h_list = list()
for (j in 1:len_h) {
if (j == 1) {
current_l_list[[j]] = L.fun(X = new_X, W = W_list[[j]])
} else {
current_l_list[[j]] = L.fun(X = current_h_list[[j-1]], W = W_list[[j]])
}
current_h_list[[j]] = S.fun(x = current_l_list[[j]])
}
current_l_list[[len_h+1]] = L.fun(X = current_h_list[[len_h]], W = W_list[[len_h+1]])
current_o = S.fun(x = current_l_list[[len_h+1]], eps = eps)
return(current_o)
}
pred_y = pre_func(x1 = x1, x2 = x2)
z_matrix = sapply(x2_seq, function(x) {pre_func(x1 = x1_seq, x2 = x)})
par(mfrow = c(1, 2))
image2D(z = z_matrix,
x = x1_seq, xlab = 'x1',
y = x2_seq, ylab = 'x2',
shade = 0.2, rasterImage = TRUE,
col = colorRampPalette(c("#FFA0A0", "#FFFFFF", "#A0A0FF"))(100))
points(x1, x2, col = (y + 1)*2, pch = 19, cex = 0.5)
plot(loss_seq, type = 'l', main = 'loss', xlab = 'iter.', ylab = 'CE loss')
}
set.seed(0)
x1 <- rnorm(300, sd = 2)
x2 <- rnorm(300, sd = 2)
lp <- x1^2 +x2^2
y <- (lp > 9 | (lp < 4 & lp > 1)) + 0L
DEEP_MLP_Trainer(num.iteration = 10000, num.hidden = c(20, 20, 20, 20, 20), lr = 0.1, x1 = x1, x2 = x2, y = y)
– 讓我們再試試看更深的網路…
DEEP_MLP_Trainer(num.iteration = 10000, num.hidden = rep(10, 10), lr = 0.1, x1 = x1, x2 = x2, y = y)
– 隨著網路「深度」的提升,我們的目標求解函數將會越來越複雜,從而導致求解困難!
– 這個問題其實我們在很久以前面對邏輯斯回歸的時候就遇到了(忘記了請參考這裡),當時我們使用殘差平方和作為邏輯斯回歸的損失函數,導致偏導函數中出現了\(p(1-p)\)的部分,而我們的解決方式是透過改寫損失函數將這個部分約分掉,從而讓梯度實現平穩的下降。
\[ \begin{align} grad.l_2 & = \frac{\partial}{\partial l_2}loss = grad.h_2 \otimes \frac{\partial}{\partial l_2}h_2 = grad.h_2 \otimes h_2 \otimes (1-h_2) \\ grad.h_1^E & = \frac{\partial}{\partial h_1^E}loss = grad.l_2 \otimes \frac{\partial}{\partial h_1^E}l_2 = grad.l_2 \bullet (W^2_d)^T \\ grad.l_1 & = \frac{\partial}{\partial l_1}loss = grad.h_1 \otimes \frac{\partial}{\partial l_1}h_1 = grad.h_1 \otimes h_1 \otimes (1-h_1) \end{align} \]
– 透過「深度神經網路」,我們目前掌握了非常強大的非線性分類器,而只要你願意他就能將整個既有樣本中的所有資訊都透過某種形式記憶住,從而達到極高的準確性,但其對於真實世界中新樣本的外推性就會被質疑。
– 不要忘記資料科學中的實驗流程,你需要準備測試樣本才能得到真實的準確度!
我們一定需要在層與層之間有非線性轉換,但非線性轉換就會造成梯度不穩定。
由於每一層所經過的「非線性轉換導函數」不一樣多,因此我們不可能設計一個「損失函數」滿足所有的層。