深度學習理論與實務

林嶔 (Lin, Chin)

Lesson 1 深度神經網路介紹

課程介紹

F01

– 但這門課著重在於對「深度學習」的廣度介紹,如果你想了解各應用的深度,進而提升最終模型的準確度,強烈建議選修「醫療人工智慧實作」,透過資料科學競賽熟悉建模流程

– 因此,「深度學習」的重點在於對「非結構化資料」的分析,像是影像、生理訊號、文字等,而「反向傳播法」相較於其他算法有強大的擴展性,從而在這之上發展出了眾多面對不同任務的模型。

第一節:基本知識(1)

\[\hat{y} = f(x)\]

\[loss = diff(y, \hat{y})\]

\[loss = diff(y, f(x))\]

\[min(loss)\]

第一節:基本知識(2)

\[f(x) = x^{2} + 2x + 1\] - 接著,我們對上述函數進行微分,並尋找微分後函數為0的位置,將可以知道此函數的極值位置:

\[\frac{\partial}{\partial x} f(x) = 2x + 2 = 0\]

\[x = -1\]

第一節:基本知識(3)

\[x_{\left(epoch:0\right)} = 10\]

\[x_{\left(epoch:t\right)} = x_{\left(epoch:t - 1\right)} - lr \cdot \frac{\partial}{\partial x}f(x_{\left(epoch:t - 1\right)})\] - 由於剛剛函數的導函數為「\(2x + 2\)」,我們可以將式子帶入運算:

\[ \begin{align} x_{\left(epoch:1\right)} & = x_{\left(epoch:0\right)} - lr \cdot \frac{\partial}{\partial x}f(x_{\left(epoch:0\right)}) \\ & = 10 - lr \cdot \frac{\partial}{\partial x}f(10) \\ & = 10 - 0.05 \cdot (2\cdot10+2)\\ & = 8.9 \end{align} \]

第一節:基本知識(4)

\[ \begin{align} x_{\left(epoch:2\right)} & = x_{\left(epoch:1\right)} - lr \cdot \frac{\partial}{\partial x}f(x_{\left(epoch:1\right)}) \\ & = 8.9 - lr \cdot \frac{\partial}{\partial x}f(8.9) \\ & = 8.9 - 0.05 \cdot (2\cdot8.9+2)\\ & = 7.91 \end{align} \]

\[ \begin{align} x_{\left(epoch:3\right)} & = 7.91 - 0.891 = 7.019 \\ x_{\left(epoch:4\right)} & = 7.019 - 0.8019 = 6.2171 \\ x_{\left(epoch:5\right)} & = 6.2171 - 0.72171 = 5.49539 \\ & \dots \\ x_{\left(epoch:\infty\right)} & = -1 \end{align} \]

第二節:反向傳播法與多層感知機(1)

– 我們充分了解到「線性模型」的限制,我們需要讓我們的「預測函數」具有足夠的「非線性擬合能力」。

F04

F02

  1. 線性預測函數L:

\[L^k(x_1, x_2) = w_{0}^k + w_{1}^kx_1 + w_{2}^kx_2\]

  1. 邏輯斯轉換函數S:

\[ \begin{align} S(x) & = \frac{{1}}{1+e^{-x}} \end{align} \]

  1. 多層感知機預測函數:

\[ \begin{align} h_1 & = S(L^1(x_1, x_2)) \\ h_2 & = S(L^2(x_1, x_2)) \\ o & = S(L^3(h_1, h_2)) \end{align} \]

第二節:反向傳播法與多層感知機(2)

\[ \begin{align} h_1 & = S(L^1(x_1, x_2)) \\ h_2 & = S(L^2(x_1, x_2)) \\ o & = S(L^3(h_1, h_2)) \\ loss & = CE(y, o) = \frac{{1}}{n}\sum \limits_{i=1}^{n} -\left(y_{i} \cdot log(o_{i}) + (1-y_{i}) \cdot log(1-o_{i})\right) \end{align} \]

– 下列是\(w_{0}^1\)\(w_{1}^1\)\(w_{2}^1\)\(w_{0}^2\)\(w_{1}^2\)\(w_{2}^2\)\(w_{0}^3\)\(w_{1}^3\)\(w_{2}^3\)各自的偏微分,詳細的求解過程請你參考這裡

\[ \begin{align} \frac{\partial}{\partial w_{0}^3}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \\ \frac{\partial}{\partial w_{1}^3}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot h_{1i} \\ \frac{\partial}{\partial w_{2}^3}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n}\left( o_i-y_i \right) \cdot h_{2i} \\ \frac{\partial}{\partial w_{0}^2}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{2}^3 \cdot h_{2i} (1 - h_{2i}) \\ \frac{\partial}{\partial w_{1}^2}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{2}^3 \cdot h_{2i} (1 - h_{2i}) \cdot x_{1i} \\ \frac{\partial}{\partial w_{2}^2}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{2}^3 \cdot h_{2i} (1 - h_{2i}) \cdot x_{2i} \\ \frac{\partial}{\partial w_{0}^1}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{1}^3 \cdot h_{1i} (1 - h_{1i}) \\ \frac{\partial}{\partial w_{1}^1}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{1}^3 \cdot h_{1i} (1 - h_{1i}) \cdot x_{1i} \\ \frac{\partial}{\partial w_{2}^1}CE(y, o) & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left( o_i-y_i \right) \cdot w_{1}^3 \cdot h_{1i} (1 - h_{1i}) \cdot x_{2i} \end{align} \]

練習1:使用梯度下降法求解多層感知機(1)

– 在模擬資料的生成上,我們採用與「多層感知機」完全一樣的架構,因此我們會很確定之後的係數應該是怎樣

– 我們指定的數字是:\(w_{0}^1=-4\)\(w_{1}^1=-4\)\(w_{2}^1=-3\)\(w_{0}^2=7\)\(w_{1}^2=-6\)\(w_{2}^2=-1\)\(w_{0}^3=4\)\(w_{1}^3=6\)\(w_{2}^3=-7\)

set.seed(0)
x1 <- rnorm(500, sd = 1) 
x2 <- rnorm(500, sd = 1) 
l1 <- -4 - 4 * x1 - 3 * x2
l2 <- 7 - 6 * x1 - 1 * x2
h1 <- 1 / (1 + exp(-l1))
h2 <- 1 / (1 + exp(-l2))
l3 <- 4 + 6 * h1 - 7 * h2
o <- 1 / (1 + exp(-l3))
y <- (o > runif(500)) + 0L
col_list <- c('#0000FF80', '#FF000080')[y + 1]
plot(x1, x2, col = col_list, xlim = c(-5, 5), ylim = c(-5, 5), pch = 19, cex = 0.5)

練習1:使用梯度下降法求解多層感知機(2)

#Forward

S.fun <- function (x, eps = 1e-5) {
  S = 1/(1 + exp(-x))
  S[S < eps] = eps
  S[S > 1 - eps] = 1 - eps
  return(S)
}

h1.fun <- function (w10, w11, w12, x1 = x1, x2 = x2) {
  L1 = w10 + w11 * x1 + w12 * x2
  return(S.fun(L1))
}

h2.fun <- function (w20, w21, w22, x1 = x1, x2 = x2) {
  L2 = w20 + w21 * x1 + w22 * x2
  return(S.fun(L2))
}

o.fun <- function (w30, w31, w32, h1, h2) {
  L3 = w30 + w31 * h1 + w32 * h2
  return(S.fun(L3))
}

loss.fun <- function (o, y = y) {
  loss = -1/length(y) * sum(y * log(o) + (1 - y) * log(1 - o))
  return(loss)
}

#Backward

differential.fun.w30 <- function(o, y = y) {
  return(-1/length(y)*sum(y-o))
}

differential.fun.w31 <- function(o, h1, y = y) {
  return(-1/length(y)*sum((y-o)*h1))
}

differential.fun.w32 <- function(o, h2, y = y) {
  return(-1/length(y)*sum((y-o)*h2))
}

differential.fun.w20 <- function(o, h2, w32, y = y) {
  return(-1/length(y)*sum((y-o)*w32*h2*(1-h2)))
}

differential.fun.w21 <- function(o, h2, w32, y = y, x1 = x1) {
  return(-1/length(y)*sum((y-o)*w32*h2*(1-h2)*x1))
}

differential.fun.w22 <- function(o, h2, w32, y = y, x2 = x2) {
  return(-1/length(y)*sum((y-o)*w32*h2*(1-h2)*x2))
}

differential.fun.w10 <- function(o, h1, w31, y = y) {
  return(-1/length(y)*sum((y-o)*w31*h1*(1-h1)))
}

differential.fun.w11 <- function(o, h1, w31, y = y, x1 = x1) {
  return(-1/length(y)*sum((y-o)*w31*h1*(1-h1)*x1))
}

differential.fun.w12 <- function(o, h1, w31, y = y, x2 = x2) {
  return(-1/length(y)*sum((y-o)*w31*h1*(1-h1)*x2))
}

練習1答案(1)

num.iteration = 10000
lr = 0.1
W_matrix = matrix(0, nrow = num.iteration + 1, ncol = 9)
loss_seq = rep(0, num.iteration)
colnames(W_matrix) = c('w10', 'w11', 'w12', 'w20', 'w21', 'w22', 'w30', 'w31', 'w32')

#Start random values
W_matrix[1,] = rnorm(9, sd = 1) 

for (i in 2:(num.iteration+1)) {
  
  #Forward
  
  current_H1 = h1.fun(w10 = W_matrix[i-1,1], w11 = W_matrix[i-1,2], w12 = W_matrix[i-1,3],
                      x1 = x1, x2 = x2)
  
  current_H2 = h2.fun(w20 = W_matrix[i-1,4], w21 = W_matrix[i-1,5], w22 = W_matrix[i-1,6],
                      x1 = x1, x2 = x2)
  
  current_O = o.fun(w30 = W_matrix[i-1,7], w31 = W_matrix[i-1,8], w32 = W_matrix[i-1,9],
                    h1 = current_H1, h2 = current_H2)
  
  loss_seq[i-1] = loss.fun(o = current_O, y = y)
  
  #Backward
  
  W_matrix[i,1] = W_matrix[i-1,1] - lr * differential.fun.w10(o = current_O, h1 = current_H1,
                                       w31 = W_matrix[i-1,8], y = y)
  
  W_matrix[i,2] = W_matrix[i-1,2] - lr * differential.fun.w11(o = current_O, h1 = current_H1,
                                       w31 = W_matrix[i-1,8], y = y, x1 = x1)
  
  W_matrix[i,3] = W_matrix[i-1,3] - lr * differential.fun.w12(o = current_O, h1 = current_H1,
                                       w31 = W_matrix[i-1,8], y = y, x2 = x2)
  
  W_matrix[i,4] = W_matrix[i-1,4] - lr * differential.fun.w20(o = current_O, h2 = current_H2,
                                       w32 = W_matrix[i-1,9], y = y)
    
  W_matrix[i,5] = W_matrix[i-1,5] - lr * differential.fun.w21(o = current_O, h2 = current_H2,
                                       w32 = W_matrix[i-1,9], y = y, x1 = x1)
  
  W_matrix[i,6] = W_matrix[i-1,6] - lr * differential.fun.w22(o = current_O, h2 = current_H2,
                                       w32 = W_matrix[i-1,9], y = y, x2 = x2)
    
  W_matrix[i,7] = W_matrix[i-1,7] - lr * differential.fun.w30(o = current_O, y = y)
    
  W_matrix[i,8] = W_matrix[i-1,8] - lr * differential.fun.w31(o = current_O, h1 = current_H1, y = y)
  
  W_matrix[i,9] = W_matrix[i-1,9] - lr * differential.fun.w32(o = current_O, h2 = current_H2, y = y)
  
}

練習1答案(2)

library(scales)
library(plot3D)
  
x1_seq = seq(-5, 5, length.out = 201)
x2_seq = seq(-5, 5, length.out = 201)

pre_func = function (x1, x2, W_list = W_matrix[nrow(W_matrix),]) {
  H1 = h1.fun(w10 = W_list[1], w11 = W_list[2], w12 = W_list[3], x1 = x1, x2 = x2)
  H2 = h2.fun(w20 = W_list[4], w21 = W_list[5], w22 = W_list[6], x1 = x1, x2 = x2)
  O = o.fun(w30 = W_list[7], w31 = W_list[8], w32 = W_list[9], h1 = H1, h2 = H2)
  return(O)
}

z_matrix = sapply(x2_seq, function(x) {pre_func(x1 = x1_seq, x2 = x)})
  
  
image2D(z = z_matrix,
        x = x1_seq, xlab = 'x1', xlim = c(-5, 5),
        y = x2_seq, ylab = 'x2', ylim = c(-5, 5),
        shade = 0.2, rasterImage = TRUE,
        col = colorRampPalette(c("#FFA0A0", "#FFFFFF", "#A0A0FF"))(100))

points(x1, x2, col = (y + 1)*2, pch = 19, cex = 0.5)

練習1討論(1)

– 最終我們跑出來的結果是不是非常接近標準答案呢?

tail(W_matrix, 1)
##                w10      w11       w12      w20       w21       w22      w30
## [10001,] -3.673431 -3.44538 -2.358838 5.492932 -4.407179 -0.941106 3.677612
##               w31       w32
## [10001,] 5.642388 -6.222671
##               w10      w11      w12      w20       w21        w22      w30
## [10001,] 4.351249 4.341602 3.010642 4.877026 -4.003332 -0.8222346 7.696732
##                w31       w32
## [10001,] -4.741939 -5.340528

練習1討論(2)

– 你應該可以同意,在其他條件固定的前提下,「\(w_{0}^1=-4\)\(w_{1}^1=-4\)\(w_{2}^1=-3\)」這組參數,會跟「\(w_{0}^1=-8\)\(w_{1}^1=-8\)\(w_{2}^1=-6\)」這組參數,計算上得到「非常類似」的結果(畢竟經過\(S(x)\)轉換後,大部分的值會非常接近)

– 除此之外,把\(h_1 = S(L^1(x_1, x_2))\)\(h_2 = S(L^2(x_1, x_2))\)兩者直接調換,再配上相對應的\(w_{1}^3\)\(w_{2}^3\)的修正,那結果也會「完全一樣」

F02

– 因此也有人說,神經網路系列的方法是「黑盒子(black box)」算法,我們根本不清楚他的邏輯,只知道他最後預測得不錯。

F03

第三節:矩陣化多層感知機(1)

– 透過矩陣能夠將多層感知機變成較為簡易的公式組合(以下為個人層級的方程式):

  1. 線性預測函數L之矩陣式(d為輸出維度):

\[ \begin{align} L^k_d(X, W^k_d) & = XW^k_d \\ X & = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,m} \end{pmatrix} \\ W^k_d & = \begin{pmatrix} w^k_{0,1} & w^k_{0,2} & \cdots & w^k_{0,d} \\ w^k_{1,1} & w^k_{1,2} & \cdots & w^k_{1,d} \\ w^k_{2,1} & w^k_{2,2} & \cdots & w^k_{2,d} \\ \vdots & \vdots & \ddots & \vdots \\ w^k_{m,1} & w^k_{m,2} & \cdots & w^k_{m,d} \end{pmatrix} \\ \frac{\partial}{\partial W^k_d}L^k_d(X, W^k_d) & = \begin{pmatrix} X^T & & X^T & \cdots & X^T \\ \end{pmatrix} \mbox{ [repeat } d \mbox{ times]} \end{align} \]

  1. 邏輯斯轉換函數S:

\[ \begin{align} S(x) & = \frac{{1}}{1+e^{-x}} \\ \frac{\partial}{\partial x}S(x) & = S(x)(1-S(x)) \end{align} \]

  1. 多層感知機預測函數之矩陣式,矩陣上標E代表該矩陣的第一欄已被填滿1(這是剛剛的單一隱藏層網路,其隱藏層中神經元數目為d,在剛剛的例子中d=2):

\[ \begin{align} l_1 & = L^1_d(x^E,W^1_d) \\ h_1 & = S(l_1) \\ l_2 & = L^2_1(h_1^E,W^2_1) \\ o & = S(l_2) \end{align} \]

第三節:矩陣化多層感知機(2)

\[ \begin{align} l_1 & = L^1_d(x^E,W^1_d) \\ h_1 & = S(l_1) \\ l_2 & = L^2_1(h_1^E,W^2_1) \\ o & = S(l_2) \\ loss & = CE(y, o) = \frac{{1}}{n}\sum \limits_{i=1}^{n} -\left(y \cdot log(o) + (1-y) \cdot log(1-o)\right) \end{align} \]

\[ \begin{align} grad.o & = \frac{\partial}{\partial o}loss = \frac{o-y}{o(1-o)} \\ grad.l_2 & = \frac{\partial}{\partial l_2}loss = grad.o \otimes \frac{\partial}{\partial l_2}o= o-y \\ grad.W^2_1 & = \frac{\partial}{\partial W^2_1}loss = grad.l_2 \otimes \frac{\partial}{\partial W^2_1}l_2 = \frac{1}{n} \otimes (h_1^E)^T \bullet grad.l_2\\ grad.h_1^E & = \frac{\partial}{\partial h_1^E}loss = grad.l_2 \otimes \frac{\partial}{\partial h_1^E}l_2 = grad.l_2 \bullet (W^2_1)^T \\ grad.l_1 & = \frac{\partial}{\partial l_1}loss = grad.h_1 \otimes \frac{\partial}{\partial l_1}h_1 = grad.h_1 \otimes h_1 \otimes (1-h_1) \\ grad.W^1_d & = \frac{\partial}{\partial W^1_d}loss = grad.l_1 \otimes \frac{\partial}{\partial W^1_d}l_1 = \frac{1}{n} \otimes (x^E)^T \bullet grad.l_1 \end{align} \]

第三節:矩陣化多層感知機(3)

– 這是一個MLP訓練函數,由於已經矩陣化了,我們可以很方便的指定隱藏層的神經元數量(num.hidden)

MLP_Trainer = function (num.iteration = 500, num.hidden = 2, lr = 0.1, x1 = x1, x2 = x2, y = y) {
  
  #Functions
  
  #Forward
  
  S.fun = function (x, eps = 1e-5) {
    S = 1/(1 + exp(-x))
    S[S < eps] = eps
    S[S > 1 - eps] = 1 - eps
    return(S)
  }
  
  L.fun = function (X, W) {
    X.E = cbind(1, X)
    L = X.E %*% W
    return(L)
  }
  
  CE.fun = function (o, y, eps = 1e-5) {
    loss = -1/length(y) * sum(y * log(o + eps) + (1 - y) * log(1 - o + eps))
    return(loss)
  }
  
  #Backward
  
  grad_o.fun = function (o, y) {
    return((o - y)/(o*(1-o)))
  }
  
  grad_l2.fun = function (grad_o, o) {
    return(grad_o*(o*(1-o)))
  }
  
  grad_W2.fun = function (grad_l2, h1) {
    h1.E = cbind(1, h1)
    return(t(h1.E) %*% grad_l2/nrow(h1))
  }
  
  grad_h1.fun = function (grad_l2, W2) {
    return(grad_l2 %*% t(W2[-1,]))
  }
  
  grad_l1.fun = function (grad_h1, h1) {
    return(grad_h1*(h1*(1-h1)))
  }
  
  grad_W1.fun = function (grad_l1, x) {
    x.E = cbind(1, x)
    return(t(x.E) %*% grad_l1/nrow(x))
  }
  
  #Caculating
  
  X_matrix = cbind(x1, x2)
  
  W1_list = list()
  W2_list = list()
  loss_seq = rep(0, num.iteration)
  
  #Start random values
  
  W1_list[[1]] = matrix(rnorm(3*num.hidden, sd = 1), nrow = 3, ncol = num.hidden)
  W2_list[[1]] = matrix(rnorm(num.hidden + 1, sd = 1), nrow = num.hidden + 1, ncol = 1)
  
  for (i in 2:(num.iteration+1)) {
    
    #Forward
    
    current_l1 = L.fun(X = X_matrix, W = W1_list[[i - 1]])
    current_h1 = S.fun(x = current_l1)
    current_l2 = L.fun(X = current_h1, W = W2_list[[i - 1]])
    current_o = S.fun(x = current_l2)
    loss_seq[i-1] = CE.fun(o = current_o, y = y, eps = 1e-5)
    
    #Backward
    
    current_grad_o = grad_o.fun(o = current_o, y = y)
    current_grad_l2 = grad_l2.fun(grad_o = current_grad_o, o = current_o)
    current_grad_W2 = grad_W2.fun(grad_l2 = current_grad_l2, h1 = current_h1)
    current_grad_h1 = grad_h1.fun(grad_l2 = current_grad_l2, W2 = W2_list[[i - 1]])
    current_grad_l1 = grad_l1.fun(grad_h1 = current_grad_h1, h1 = current_h1)
    current_grad_W1 = grad_W1.fun(grad_l1 = current_grad_l1, x = X_matrix)
    
    W2_list[[i]] = W2_list[[i-1]] - lr * current_grad_W2
    W1_list[[i]] = W1_list[[i-1]] - lr * current_grad_W1
    
  }
  
  require(scales)
  require(plot3D)
  
  x1_seq = seq(min(x1), max(x1), length.out = 100)
  x2_seq = seq(min(x2), max(x2), length.out = 100)
  
  pre_func = function (x1, x2, W1 = W1_list[[length(W1_list)]], W2 = W2_list[[length(W2_list)]]) {
    new_X = cbind(x1, x2)
    O = S.fun(x = L.fun(X = S.fun(x = L.fun(X = new_X, W = W1)), W = W2))
    return(O)
  }
  
  z_matrix = sapply(x2_seq, function(x) {pre_func(x1 = x1_seq, x2 = x)})
  
  par(mfrow = c(1, 2))
  
  image2D(z = z_matrix,
          x = x1_seq, xlab = 'x1',
          y = x2_seq, ylab = 'x2',
          shade = 0.2, rasterImage = TRUE,
          col = colorRampPalette(c("#FFA0A0", "#FFFFFF", "#A0A0FF"))(100))
  
  points(x1, x2, col = (y + 1)*2, pch = 19, cex = 0.5)
  
  plot(loss_seq, type = 'l', main = 'loss', xlab = 'iter.', ylab = 'CE loss')
  
}

第三節:矩陣化多層感知機(4)

– 這是跟剛剛一樣,神經元數目=2:

MLP_Trainer(num.iteration = 10000, num.hidden = 2, lr = 0.1, x1 = x1, x2 = x2, y = y)

– 稍微多一點,神經元數目=5,其實也沒什麼變化:

MLP_Trainer(num.iteration = 10000, num.hidden = 5, lr = 0.1, x1 = x1, x2 = x2, y = y)

– 我們用個非常誇張的數字,神經元數目=50,大致上沒什麼變化,但開始出現奇怪的邊界:

MLP_Trainer(num.iteration = 10000, num.hidden = 50, lr = 0.1, x1 = x1, x2 = x2, y = y)

第三節:矩陣化多層感知機(5)

– 在讓我們試試其他的資料:

set.seed(0)
x1 <- rnorm(100, sd = 1) 
x2 <- rnorm(100, sd = 1) 
lp <- -2.5 + x1^2 + 2 * x2^2
y <- (lp > 0) + 0L
MLP_Trainer(num.iteration = 10000, num.hidden = 2, lr = 0.1, x1 = x1, x2 = x2, y = y)

MLP_Trainer(num.iteration = 10000, num.hidden = 5, lr = 0.1, x1 = x1, x2 = x2, y = y)

MLP_Trainer(num.iteration = 10000, num.hidden = 50, lr = 0.1, x1 = x1, x2 = x2, y = y)

第四節:深度神經網路的擴展(1)

– 這是本來的2層深多層感知機與他的損失函數:

\[ \begin{align} l_1 & = L^1_d(x^E,W^1_d) \\ h_1 & = S(l_1) \\ l_2 & = L^2_1(h_1^E,W^2_1) \\ o & = S(l_2) \\ loss & = CE(y, o) = \frac{{1}}{n}\sum \limits_{i=1}^{n} -\left(y \cdot log(o) + (1-y) \cdot log(1-o)\right) \end{align} \]

\[ \begin{align} l_1 & = L^1_d(x^E,W^1_d) \\ h_1 & = S(l_1) \\ l_2 & = L^2_d(h_1^E,W^2_d) \\ h_2 & = S(l_2) \\ l_3 & = L^3_1(h_2^E,W^3_1) \\ o & = S(l_3) \\ loss & = CE(y, o) = \frac{{1}}{n}\sum \limits_{i=1}^{n} -\left(y \cdot log(o) + (1-y) \cdot log(1-o)\right) \end{align} \]

第四節:深度神經網路的擴展(2)

\[ \begin{align} grad.o & = \frac{\partial}{\partial o}loss = \frac{o-y}{o(1-o)} \\ grad.l_3 & = \frac{\partial}{\partial l_3}loss = grad.o \otimes \frac{\partial}{\partial l_3}o= o-y \\ grad.W^3_1 & = \frac{\partial}{\partial W^3_1}loss = grad.l_3 \otimes \frac{\partial}{\partial W^3_1}l_3 = \frac{1}{n} \otimes (h_2^E)^T \bullet grad.l_3\\ grad.h_2^E & = \frac{\partial}{\partial h_2^E}loss = grad.l_3 \otimes \frac{\partial}{\partial h_2^E}l_3 = grad.l_3 \bullet (W^3_1)^T \\ grad.l_2 & = \frac{\partial}{\partial l_2}loss = grad.h_2 \otimes \frac{\partial}{\partial l_2}h_2 = grad.h_2 \otimes h_2 \otimes (1-h_2) \\ grad.W^2_d & = \frac{\partial}{\partial W^2_d}loss = grad.l_2 \otimes \frac{\partial}{\partial W^2_d}l_2 = \frac{1}{n} \otimes (h_1^E)^T \bullet grad.l_2\\ grad.h_1^E & = \frac{\partial}{\partial h_1^E}loss = grad.l_2 \otimes \frac{\partial}{\partial h_1^E}l_2 = grad.l_2 \bullet (W^2_d)^T \\ grad.l_1 & = \frac{\partial}{\partial l_1}loss = grad.h_1 \otimes \frac{\partial}{\partial l_1}h_1 = grad.h_1 \otimes h_1 \otimes (1-h_1) \\ grad.W^1_d & = \frac{\partial}{\partial W^1_d}loss = grad.l_1 \otimes \frac{\partial}{\partial W^1_d}l_1 = \frac{1}{n} \otimes (x^E)^T \bullet grad.l_1 \end{align} \]

第四節:深度神經網路的擴展(3)

DEEP_MLP_Trainer = function (num.iteration = 500, num.hidden = c(10, 10, 10), lr = 0.05,
                             x1 = x1, x2 = x2, y = y, eps = 1e-8) {
  
  #Functions
  
  #Forward
  
  S.fun = function (x, eps = 1e-5) {
    S = 1/(1 + exp(-x))
    S[S < eps] = eps
    S[S > 1 - eps] = 1 - eps
    return(S)
  }
  
  L.fun = function (X, W) {
    X.E = cbind(1, X)
    L = X.E %*% W
    return(L)
  }
  
  CE.fun = function (o, y, eps = 1e-5) {
    loss = -1/length(y) * sum(y * log(o + eps) + (1 - y) * log(1 - o + eps))
    return(loss)
  }
  
  #Backward
  
  grad_o.fun = function (o, y) {
    return((o - y)/(o*(1-o)))
  }
  
  grad_s.fun = function (grad_o, o) {
    return(grad_o*(o*(1-o)))
  }
  
  grad_W.fun = function (grad_l, h) {
    h.E = cbind(1, h)
    return(t(h.E) %*% grad_l/nrow(h))
  }
  
  grad_h.fun = function (grad_l, W) {
    return(grad_l %*% t(W[-1,]))
  }
  
  grad_l.fun = function (grad_h, h) {
    return(grad_h * h * (1 - h))
  }
  
  #initialization
  
  X_matrix = cbind(x1, x2)
  Y_matrix = t(t(y))
  
  W_list = list()
  
  len_h = length(num.hidden)
  
  for (w_seq in 1:(len_h+1)) {
    if (w_seq == 1) {
      NROW_W = ncol(X_matrix) + 1
      NCOL_W = num.hidden[w_seq]
    } else if (w_seq == len_h+1) {
      NROW_W = num.hidden[w_seq - 1] + 1
      NCOL_W = ncol(Y_matrix)
    } else {
      NROW_W = num.hidden[w_seq - 1] + 1
      NCOL_W = num.hidden[w_seq]
    }
    W_list[[w_seq]] = matrix(rnorm(NROW_W*NCOL_W, sd = 1), nrow = NROW_W, ncol = NCOL_W)
  }
  
  loss_seq = rep(0, num.iteration)
  
  #Caculating
  
  for (i in 1:num.iteration) {
    
    #Forward
    
    current_l_list = list()
    current_h_list = list()
    
    for (j in 1:len_h) {
      if (j == 1) {
        current_l_list[[j]] = L.fun(X = X_matrix, W = W_list[[j]])
      } else {
        current_l_list[[j]] = L.fun(X = current_h_list[[j-1]], W = W_list[[j]])
      }
      current_h_list[[j]] = S.fun(x = current_l_list[[j]])
    }
    current_l_list[[len_h+1]] = L.fun(X = current_h_list[[len_h]], W = W_list[[len_h+1]])
    current_o = S.fun(x = current_l_list[[len_h+1]], eps = eps)
    loss_seq[i] = CE.fun(o = current_o, y = y, eps = eps)
    
    #Backward
    
    current_grad_l_list = list()
    current_grad_W_list = list()
    current_grad_h_list = list()
    
    current_grad_o = grad_o.fun(o = current_o, y = y)
    current_grad_l_list[[len_h+1]] = grad_s.fun(grad_o = current_grad_o, o = current_o)
    current_grad_W_list[[len_h+1]] = grad_W.fun(grad_l = current_grad_l_list[[len_h+1]], h = current_h_list[[len_h]])
    
    for (j in len_h:1) {
      current_grad_h_list[[j]] = grad_h.fun(grad_l = current_grad_l_list[[j+1]], W = W_list[[j+1]])
      current_grad_l_list[[j]] = grad_l.fun(grad_h = current_grad_h_list[[j]], h = current_h_list[[j]])
      if (j != 1) {
        current_grad_W_list[[j]] = grad_W.fun(grad_l = current_grad_l_list[[j]], h = current_h_list[[j - 1]])
      } else {
        current_grad_W_list[[j]] = grad_W.fun(grad_l = current_grad_l_list[[j]], h = X_matrix)
      }
    }
    
    for (j in 1:(len_h+1)) {
      
      W_list[[j]] = W_list[[j]] - lr * current_grad_W_list[[j]]
      
    }
    
  }
  
  require(scales)
  require(plot3D)
  
  x1_seq = seq(min(x1), max(x1), length.out = 100)
  x2_seq = seq(min(x2), max(x2), length.out = 100)
  
  pre_func = function (x1, x2) {
    new_X = cbind(x1, x2)
    
    current_l_list = list()
    current_h_list = list()
    
    for (j in 1:len_h) {
      if (j == 1) {
        current_l_list[[j]] = L.fun(X = new_X, W = W_list[[j]])
      } else {
        current_l_list[[j]] = L.fun(X = current_h_list[[j-1]], W = W_list[[j]])
      }
      current_h_list[[j]] = S.fun(x = current_l_list[[j]])
    }
    
    current_l_list[[len_h+1]] = L.fun(X = current_h_list[[len_h]], W = W_list[[len_h+1]])
    current_o = S.fun(x = current_l_list[[len_h+1]], eps = eps)
    
    return(current_o)
  }
  
  pred_y = pre_func(x1 = x1, x2 = x2)
  z_matrix = sapply(x2_seq, function(x) {pre_func(x1 = x1_seq, x2 = x)})
  
  par(mfrow = c(1, 2))
  
  image2D(z = z_matrix,
          x = x1_seq, xlab = 'x1',
          y = x2_seq, ylab = 'x2',
          shade = 0.2, rasterImage = TRUE,
          col = colorRampPalette(c("#FFA0A0", "#FFFFFF", "#A0A0FF"))(100))
  
  points(x1, x2, col = (y + 1)*2, pch = 19, cex = 0.5)
  plot(loss_seq, type = 'l', main = 'loss', xlab = 'iter.', ylab = 'CE loss')
  
}

第四節:深度神經網路的擴展(4)

set.seed(0)
x1 <- rnorm(300, sd = 2) 
x2 <- rnorm(300, sd = 2) 
lp <- x1^2 +x2^2
y <- (lp > 9 | (lp < 4 & lp > 1)) + 0L
DEEP_MLP_Trainer(num.iteration = 10000, num.hidden = c(100), lr = 0.1, x1 = x1, x2 = x2, y = y)

DEEP_MLP_Trainer(num.iteration = 10000, num.hidden = c(20, 20, 20, 20, 20), lr = 0.1, x1 = x1, x2 = x2, y = y)

練習2:更深的神經網路

– 讓我們再試試看更深的網路…

DEEP_MLP_Trainer(num.iteration = 10000, num.hidden = rep(10, 10), lr = 0.1, x1 = x1, x2 = x2, y = y)

練習2答案

– 隨著網路「深度」的提升,我們的目標求解函數將會越來越複雜,從而導致求解困難!

– 這個問題其實我們在很久以前面對邏輯斯回歸的時候就遇到了(忘記了請參考這裡),當時我們使用殘差平方和作為邏輯斯回歸的損失函數,導致偏導函數中出現了\(p(1-p)\)的部分,而我們的解決方式是透過改寫損失函數將這個部分約分掉,從而讓梯度實現平穩的下降。

\[ \begin{align} grad.l_2 & = \frac{\partial}{\partial l_2}loss = grad.h_2 \otimes \frac{\partial}{\partial l_2}h_2 = grad.h_2 \otimes h_2 \otimes (1-h_2) \\ grad.h_1^E & = \frac{\partial}{\partial h_1^E}loss = grad.l_2 \otimes \frac{\partial}{\partial h_1^E}l_2 = grad.l_2 \bullet (W^2_d)^T \\ grad.l_1 & = \frac{\partial}{\partial l_1}loss = grad.h_1 \otimes \frac{\partial}{\partial l_1}h_1 = grad.h_1 \otimes h_1 \otimes (1-h_1) \end{align} \]

結語

– 透過「深度神經網路」,我們目前掌握了非常強大的非線性分類器,而只要你願意他就能將整個既有樣本中的所有資訊都透過某種形式記憶住,從而達到極高的準確性,但其對於真實世界中新樣本的外推性就會被質疑。

– 不要忘記資料科學中的實驗流程,你需要準備測試樣本才能得到真實的準確度!

  1. 我們一定需要在層與層之間有非線性轉換,但非線性轉換就會造成梯度不穩定。

  2. 由於每一層所經過的「非線性轉換導函數」不一樣多,因此我們不可能設計一個「損失函數」滿足所有的層。