多類別預測與避免過擬合

林嶔 (Lin, Chin)

Lesson 4

前言

  1. 過度擬合問題 - 這限制網路的總參數量,限制了網路的複雜度

  2. 梯度消失問題 - 這讓網路難以訓練,限制了網路的深度

  3. 權重初始化問題 - 這使訓練結果很不穩定,限制了網路的準確度

– 除此之外,在深入探討他的解決方法之前,我們再多學習一些基本的機器學習及統計學的基礎知識,而這些知識會讓我們對目前所學會的MLP結構做很好的擴展,從而增加他的實用性。

多類別的預測函數(1)

\[\hat{y} = f(x)\]

多類別的預測函數(2)

\[ \begin{align} l_1 & = L^1_d(x^E,W^1_d) \\ h_1 & = ReLU(l_1) \\ l_2 & = L^2_1(h_1^E,W^2_1) \\ o & = S(l_2) \\ loss & = CE(y, o) = -\left(y \cdot log(o) + (1-y) \cdot log(1-o)\right) \end{align} \]

\[ \begin{align} l_1 & = L^1_d(x^E,W^1_d) \\ h_1 & = ReLU(l_1) \\ l_2 & = L^2_m(h_1^E,W^2_m) \end{align} \]

多類別的預測函數(3)

\[ \begin{align} Softmax(x_j) & = \frac{e^{x_j}}{\sum \limits_{i=1}^{m} e^{x_i}} \ \ \ \ \ j \in 1 \ \mbox{to} \ m \end{align} \]

\[ \begin{align} \frac{\partial}{\partial x_i}Softmax(x_j) & = \left\{ \begin{array} -Softmax(x_j)(1 - Softmax(x_i)) & \mbox{ if i = j} \\ Softmax(x_j)(0 - Softmax(x_i)) & \mbox{ otherwise} \end{array} \right. \end{align} \]

多類別的預測函數(4)

– 之前我們為了邏輯斯迴歸設計了交叉熵損失函數,使其在最終函數求解時較為平穩,現在我們也要為函數\(Softmax()\)設計一個特異的損失函數,他的名字叫做\(m\)對數似然函數(\(m\)-log likelihood function):

– 這裡的\(m\)同樣代表著類別數,而\(n\)代表著樣本數:

\[ \begin{align} mlogloss(y_{i}, p_{i}) & = - \sum \limits_{i=1}^{m} y_{i} log(p_{i}) \end{align} \]

– 這裡我們把偏導函數列出來(過程略):

\[ \begin{align} \frac{\partial}{\partial p_i}mlogloss(y_{i}, p_{i}) & = - \sum \limits_{i=1}^{m} y_{i} \frac {1} {p_{i}} \end{align} \]

– 接著我們將\(Softmax()\)以及\(mlogloss()\)兩個函數合起來,得到的偏導函數就會相當優美:

\[ \begin{align} p_i &= Softmax(x_i)\\ loss &= mlogloss(y_{i}, p_{i}) \\\\ \frac{\partial}{\partial x_j}loss & = \frac{\partial}{\partial p_j}loss \frac{\partial}{\partial x_j}p_i \\ & = - y_{j} \frac {1} {p_{j}} p_{j} ( 1-p_{j})-\sum \limits_{i\neq j} y_{i} \frac {1} {p_{j}} p_{j} (0 - p_{j}) \\ & = - y_{j} ( 1-p_{j})-\sum \limits_{i\neq j} y_{i} (0 - p_{j}) \\ & = - y_{j} +y_{j}p_{j}+\sum \limits_{i\neq j} y_{i} p_{j} \\ & = - y_{j} +\sum \limits_{i = 1}^{m} y_{i} p_{j} \\ & = p_{j} - y_{j} \end{align} \]

練習1:利用多類別函數做IRIS資料集的預測(1)

F4_1

– 其數據集包含了150個樣本,都屬於鳶尾屬下的三個亞屬,分別是山鳶尾(setosa)、變色鳶尾(versicolor)和維吉尼亞鳶尾(virginica)。四個特徵被用作樣本的定量分析,它們分別是花萼(sepal)和花瓣(petal)的長度和寬度。

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

練習1:利用多類別函數做IRIS資料集的預測(2)

X = as.matrix(iris[,-5])
Y = matrix(model.matrix(~-1 + iris[,5]), nrow = 150, ncol = 3)
DEEP_MLP_Trainer = function (num.iteration = 500, num.hidden = c(10, 10, 10), batch_size = 30,
                             lr = 0.001, beta1 = 0.9, beta2 = 0.999, optimizer = 'adam', eps = 1e-8,
                             x1 = x1, x2 = x2, y = y,
                             test_x1 = NULL, test_x2 = NULL, test_y = NULL) {
  
  #Functions
  
  #Forward
  
  S.fun = function (x, eps = eps) {
    S = 1/(1 + exp(-x))
    S[S < eps] = eps
    S[S > 1 - eps] = 1 - eps
    return(S)
  }
  
  ReLU.fun = function (x) {
    x[x < 0] <- 0
    return(x)
  }
  
  L.fun = function (X, W) {
    X.E = cbind(1, X)
    L = X.E %*% W
    return(L)
  }
  
  CE.fun = function (o, y, eps = eps) {
    loss = -1/length(y) * sum(y * log(o + eps) + (1 - y) * log(1 - o + eps))
    return(loss)
  }
  
  #Backward
  
  grad_o.fun = function (o, y) {
    return((o - y)/(o*(1-o)))
  }
  
  grad_s.fun = function (grad_o, o) {
    return(grad_o*(o*(1-o)))
  }
  
  grad_W.fun = function (grad_l, h) {
    h.E = cbind(1, h)
    return(t(h.E) %*% grad_l/nrow(h))
  }
  
  grad_h.fun = function (grad_l, W) {
    return(grad_l %*% t(W[-1,]))
  }
  
  grad_l.fun = function (grad_h, l) {
    de_l = l
    de_l[de_l<0] = 0
    de_l[de_l>0] = 1
    return(grad_h*de_l)
  }
  
  #initialization
  
  X_matrix = cbind(x1, x2)
  Y_matrix = t(t(y))
  
  W_list = list()
  M_list = list()
  N_list = list()
  
  len_h = length(num.hidden)
  
  for (w_seq in 1:(len_h+1)) {
    if (w_seq == 1) {
      NROW_W = ncol(X_matrix) + 1
      NCOL_W = num.hidden[w_seq]
    } else if (w_seq == len_h+1) {
      NROW_W = num.hidden[w_seq - 1] + 1
      NCOL_W = ncol(Y_matrix)
    } else {
      NROW_W = num.hidden[w_seq - 1] + 1
      NCOL_W = num.hidden[w_seq]
    }
    W_list[[w_seq]] = matrix(rnorm(NROW_W*NCOL_W, sd = 1), nrow = NROW_W, ncol = NCOL_W)
    M_list[[w_seq]] = matrix(0, nrow = NROW_W, ncol = NCOL_W)
    N_list[[w_seq]] = matrix(0, nrow = NROW_W, ncol = NCOL_W)
  }
  
  loss_seq = rep(0, num.iteration)
  
  #Caculating
  
  for (i in 1:num.iteration) {
    
    idx = sample(1:length(x1), batch_size)
    
    #Forward
    
    current_l_list = list()
    current_h_list = list()
    
    for (j in 1:len_h) {
      if (j == 1) {
        current_l_list[[j]] = L.fun(X = X_matrix[idx,], W = W_list[[j]])
      } else {
        current_l_list[[j]] = L.fun(X = current_h_list[[j-1]], W = W_list[[j]])
      }
      current_h_list[[j]] = ReLU.fun(x = current_l_list[[j]])
    }
    current_l_list[[len_h+1]] = L.fun(X = current_h_list[[len_h]], W = W_list[[len_h+1]])
    current_o = S.fun(x = current_l_list[[len_h+1]], eps = eps)
    loss_seq[i] = CE.fun(o = current_o, y = y[idx], eps = eps)
    
    #Backward
    
    current_grad_l_list = list()
    current_grad_W_list = list()
    current_grad_h_list = list()
    
    current_grad_o = grad_o.fun(o = current_o, y = y[idx])
    current_grad_l_list[[len_h+1]] = grad_s.fun(grad_o = current_grad_o, o = current_o)
    current_grad_W_list[[len_h+1]] = grad_W.fun(grad_l = current_grad_l_list[[len_h+1]], h = current_h_list[[len_h]])
    
    for (j in len_h:1) {
      current_grad_h_list[[j]] = grad_h.fun(grad_l = current_grad_l_list[[j+1]], W = W_list[[j+1]])
      current_grad_l_list[[j]] = grad_l.fun(grad_h = current_grad_h_list[[j]], l = current_l_list[[j]])
      if (j != 1) {
        current_grad_W_list[[j]] = grad_W.fun(grad_l = current_grad_l_list[[j]], h = current_h_list[[j - 1]])
      } else {
        current_grad_W_list[[j]] = grad_W.fun(grad_l = current_grad_l_list[[j]], h = X_matrix[idx,])
      }
    }
    
    if (optimizer == 'adam') {
      
      for (j in 1:(len_h+1)) {
        M_list[[j]] = beta1 * M_list[[j]] + (1 - beta1) * current_grad_W_list[[j]]
        N_list[[j]] = beta2 * N_list[[j]] + (1 - beta2) * current_grad_W_list[[j]]^2
        M.hat = M_list[[j]]/(1 - beta1^i)
        N.hat = N_list[[j]]/(1 - beta2^i)
        W_list[[j]] = W_list[[j]] - lr*M.hat/sqrt(N.hat+eps)
      }
      
    } else if (optimizer == 'sgd') {
      
      for (j in 1:(len_h+1)) {
        M_list[[j]] = beta1 * M_list[[j]] + lr * current_grad_W_list[[j]]
        W_list[[j]] = W_list[[j]] - M_list[[j]]
      }
      
    } else {
      stop('optimizer must be selected from "sgd" or "adam".')
    }
    
  }
  
  plot(loss_seq, type = 'l', main = 'loss', xlab = 'iter.', ylab = 'CE loss')
  
}

練習1答案(程式碼)

DEEP_MLP_Trainer = function (num.iteration = 500, num.hidden = c(10, 10, 10), batch_size = 30,
                             lr = 0.001, beta1 = 0.9, beta2 = 0.999, optimizer = 'adam', eps = 1e-8,
                             X_matrix = X, Y_matrix = Y) {
  
  #Functions
  
  #Forward
  
  Softmax.fun = function (x, eps = eps) {
    out <- apply(x, 1, function (i) {exp(i)/sum(exp(i))})
    return(t(out))
  }
  
  ReLU.fun = function (x) {
    x[x < 0] <- 0
    return(x)
  }
  
  L.fun = function (X, W) {
    X.E = cbind(1, X)
    L = X.E %*% W
    return(L)
  }
  
  Mlogloss.fun = function (o, y, eps = eps) {
    loss = -1/nrow(y) * sum(y * log(o + eps))
    return(loss)
  }
  
  #Backward
  
  grad_o_s.fun = function (o, y) {
    return(o - y)
  }
  
  grad_W.fun = function (grad_l, h) {
    h.E = cbind(1, h)
    return(t(h.E) %*% grad_l/nrow(h))
  }
  
  grad_h.fun = function (grad_l, W) {
    return(grad_l %*% t(W[-1,]))
  }
  
  grad_l.fun = function (grad_h, l) {
    de_l = l
    de_l[de_l<0] = 0
    de_l[de_l>0] = 1
    return(grad_h*de_l)
  }
  
  #initialization
  
  W_list = list()
  M_list = list()
  N_list = list()
  
  len_h = length(num.hidden)
  
  for (w_seq in 1:(len_h+1)) {
    if (w_seq == 1) {
      NROW_W = ncol(X_matrix) + 1
      NCOL_W = num.hidden[w_seq]
    } else if (w_seq == len_h+1) {
      NROW_W = num.hidden[w_seq - 1] + 1
      NCOL_W = ncol(Y_matrix)
    } else {
      NROW_W = num.hidden[w_seq - 1] + 1
      NCOL_W = num.hidden[w_seq]
    }
    W_list[[w_seq]] = matrix(rnorm(NROW_W*NCOL_W, sd = 1), nrow = NROW_W, ncol = NCOL_W)
    M_list[[w_seq]] = matrix(0, nrow = NROW_W, ncol = NCOL_W)
    N_list[[w_seq]] = matrix(0, nrow = NROW_W, ncol = NCOL_W)
  }
  
  loss_seq = rep(0, num.iteration)
  
  #Caculating
  
  for (i in 1:num.iteration) {
    
    idx = sample(1:nrow(X_matrix), batch_size)
    
    #Forward
    
    current_l_list = list()
    current_h_list = list()
    
    for (j in 1:len_h) {
      if (j == 1) {
        current_l_list[[j]] = L.fun(X = X_matrix[idx,], W = W_list[[j]])
      } else {
        current_l_list[[j]] = L.fun(X = current_h_list[[j-1]], W = W_list[[j]])
      }
      current_h_list[[j]] = ReLU.fun(x = current_l_list[[j]])
    }
    current_l_list[[len_h+1]] = L.fun(X = current_h_list[[len_h]], W = W_list[[len_h+1]])
    current_o = Softmax.fun(x = current_l_list[[len_h+1]], eps = eps)
    loss_seq[i] = Mlogloss.fun(o = current_o, y = Y_matrix[idx,], eps = eps)
    
    #Backward
    
    current_grad_l_list = list()
    current_grad_W_list = list()
    current_grad_h_list = list()
    
    current_grad_l_list[[len_h+1]] = grad_o_s.fun(o = current_o, y = Y_matrix[idx,])
    current_grad_W_list[[len_h+1]] = grad_W.fun(grad_l = current_grad_l_list[[len_h+1]], h = current_h_list[[len_h]])
    
    for (j in len_h:1) {
      current_grad_h_list[[j]] = grad_h.fun(grad_l = current_grad_l_list[[j+1]], W = W_list[[j+1]])
      current_grad_l_list[[j]] = grad_l.fun(grad_h = current_grad_h_list[[j]], l = current_l_list[[j]])
      if (j != 1) {
        current_grad_W_list[[j]] = grad_W.fun(grad_l = current_grad_l_list[[j]], h = current_h_list[[j - 1]])
      } else {
        current_grad_W_list[[j]] = grad_W.fun(grad_l = current_grad_l_list[[j]], h = X_matrix[idx,])
      }
    }
    
    if (optimizer == 'adam') {
      
      for (j in 1:(len_h+1)) {
        M_list[[j]] = beta1 * M_list[[j]] + (1 - beta1) * current_grad_W_list[[j]]
        N_list[[j]] = beta2 * N_list[[j]] + (1 - beta2) * current_grad_W_list[[j]]^2
        M.hat = M_list[[j]]/(1 - beta1^i)
        N.hat = N_list[[j]]/(1 - beta2^i)
        W_list[[j]] = W_list[[j]] - lr*M.hat/sqrt(N.hat+eps)
      }
      
    } else if (optimizer == 'sgd') {
      
      for (j in 1:(len_h+1)) {
        M_list[[j]] = beta1 * M_list[[j]] + lr * current_grad_W_list[[j]]
        W_list[[j]] = W_list[[j]] - M_list[[j]]
      }
      
    } else {
      stop('optimizer must be selected from "sgd" or "adam".')
    }
    
  }
  
  plot(loss_seq, type = 'l', main = 'loss', xlab = 'iter.', ylab = 'CE loss')
  
  return(W_list)
  
}
PRED_MLP = function (X_matrix, W_list, eps = 1e-8) {
  
  #Functions
  
  #Forward
  
  Softmax.fun = function (x, eps = eps) {
    out <- apply(x, 1, function (i) {exp(i)/sum(exp(i))})
    return(t(out))
  }
  
  ReLU.fun = function (x) {
    x[x < 0] <- 0
    return(x)
  }
  
  L.fun = function (X, W) {
    X.E = cbind(1, X)
    L = X.E %*% W
    return(L)
  }
  
  len_h = length(W_list) - 1
  
  current_l_list = list()
  current_h_list = list()
  
  for (j in 1:len_h) {
    if (j == 1) {
      current_l_list[[j]] = L.fun(X = X_matrix, W = W_list[[j]])
    } else {
      current_l_list[[j]] = L.fun(X = current_h_list[[j-1]], W = W_list[[j]])
    }
    current_h_list[[j]] = ReLU.fun(x = current_l_list[[j]])
  }
  current_l_list[[len_h+1]] = L.fun(X = current_h_list[[len_h]], W = W_list[[len_h+1]])
  current_o = Softmax.fun(x = current_l_list[[len_h+1]], eps = eps)
  
  return(current_o)
  
}

練習1答案(實驗結果)

set.seed(0)
TRAIN.seq = sample(1:150, 100)
TRAIN.X = X[TRAIN.seq,]
TRAIN.Y = Y[TRAIN.seq,]
TEST.X = X[-TRAIN.seq,]
TEST.Y = Y[-TRAIN.seq,]
W_LIST = DEEP_MLP_Trainer(num.iteration = 2000, num.hidden = c(10, 10, 10), batch_size = 30,
                          lr = 0.001, beta1 = 0.9, beta2 = 0.999, optimizer = 'adam', eps = 1e-8,
                          X_matrix = TRAIN.X, Y_matrix = TRAIN.Y)

PRED_Y = PRED_MLP(X_matrix = TEST.X, W_list = W_LIST, eps = 1e-8)
table(max.col(PRED_Y), max.col(TEST.Y))
##    
##      1  2  3
##   1 18  0  0
##   2  0 13  2
##   3  0  2 15

資料擴增(1)

– 這樣看起來,增加變異是一件相當重要的事情,並且在實驗中我們還發現他對於測試集的準確度是有幫助的。那現在的問題是,我們有沒有可能再更進一步的增加變異,以進一步增加模型準確度。

– 這個手法統稱為資料擴增(data augmentation),其方法及種類的多樣性足夠上一整學期的課。我們這裡先小試身手,之後的課程還有許多部份同樣涉及資料擴增的部分會再持續補充!

資料擴增(2)

DEEP_MLP_Trainer = function (num.iteration = 500, num.hidden = c(10, 10, 10), batch_size = 30,
                             lr = 0.001, beta1 = 0.9, beta2 = 0.999, optimizer = 'adam', eps = 1e-8,
                             sd_Augmentation = 0,
                             x1 = x1, x2 = x2, y = y,
                             test_x1 = NULL, test_x2 = NULL, test_y = NULL) {
  
  #Functions
  
  #Forward
  
  S.fun = function (x, eps = eps) {
    S = 1/(1 + exp(-x))
    S[S < eps] = eps
    S[S > 1 - eps] = 1 - eps
    return(S)
  }
  
  ReLU.fun = function (x) {
    x[x < 0] <- 0
    return(x)
  }
  
  L.fun = function (X, W) {
    X.E = cbind(1, X)
    L = X.E %*% W
    return(L)
  }
  
  CE.fun = function (o, y, eps = eps) {
    loss = -1/length(y) * sum(y * log(o + eps) + (1 - y) * log(1 - o + eps))
    return(loss)
  }
  
  #Backward
  
  grad_o.fun = function (o, y) {
    return((o - y)/(o*(1-o)))
  }
  
  grad_s.fun = function (grad_o, o) {
    return(grad_o*(o*(1-o)))
  }
  
  grad_W.fun = function (grad_l, h) {
    h.E = cbind(1, h)
    return(t(h.E) %*% grad_l/nrow(h))
  }
  
  grad_h.fun = function (grad_l, W) {
    return(grad_l %*% t(W[-1,]))
  }
  
  grad_l.fun = function (grad_h, l) {
    de_l = l
    de_l[de_l<0] = 0
    de_l[de_l>0] = 1
    return(grad_h*de_l)
  }
  
  #initialization
  
  X_matrix = cbind(x1, x2)
  Y_matrix = t(t(y))
  
  W_list = list()
  M_list = list()
  N_list = list()
  
  len_h = length(num.hidden)
  
  for (w_seq in 1:(len_h+1)) {
    if (w_seq == 1) {
      NROW_W = ncol(X_matrix) + 1
      NCOL_W = num.hidden[w_seq]
    } else if (w_seq == len_h+1) {
      NROW_W = num.hidden[w_seq - 1] + 1
      NCOL_W = ncol(Y_matrix)
    } else {
      NROW_W = num.hidden[w_seq - 1] + 1
      NCOL_W = num.hidden[w_seq]
    }
    W_list[[w_seq]] = matrix(rnorm(NROW_W*NCOL_W, sd = 1), nrow = NROW_W, ncol = NCOL_W)
    M_list[[w_seq]] = matrix(0, nrow = NROW_W, ncol = NCOL_W)
    N_list[[w_seq]] = matrix(0, nrow = NROW_W, ncol = NCOL_W)
  }
  
  loss_seq = rep(0, num.iteration)
  
  #Caculating
  
  for (i in 1:num.iteration) {
    
    idx = sample(1:length(x1), batch_size)
    
    #Augmentation
    
    New_X_matrix = X_matrix[idx,]
    New_X_matrix = New_X_matrix + rnorm(prod(dim(New_X_matrix)), sd = sd_Augmentation)
    New_y = y[idx]
    
    #Forward
    
    current_l_list = list()
    current_h_list = list()
    
    for (j in 1:len_h) {
      if (j == 1) {
        current_l_list[[j]] = L.fun(X = New_X_matrix, W = W_list[[j]])
      } else {
        current_l_list[[j]] = L.fun(X = current_h_list[[j-1]], W = W_list[[j]])
      }
      current_h_list[[j]] = ReLU.fun(x = current_l_list[[j]])
    }
    current_l_list[[len_h+1]] = L.fun(X = current_h_list[[len_h]], W = W_list[[len_h+1]])
    current_o = S.fun(x = current_l_list[[len_h+1]], eps = eps)
    loss_seq[i] = CE.fun(o = current_o, y = New_y, eps = eps)
    
    #Backward
    
    current_grad_l_list = list()
    current_grad_W_list = list()
    current_grad_h_list = list()
    
    current_grad_o = grad_o.fun(o = current_o, y = y[idx])
    current_grad_l_list[[len_h+1]] = grad_s.fun(grad_o = current_grad_o, o = current_o)
    current_grad_W_list[[len_h+1]] = grad_W.fun(grad_l = current_grad_l_list[[len_h+1]], h = current_h_list[[len_h]])
    
    for (j in len_h:1) {
      current_grad_h_list[[j]] = grad_h.fun(grad_l = current_grad_l_list[[j+1]], W = W_list[[j+1]])
      current_grad_l_list[[j]] = grad_l.fun(grad_h = current_grad_h_list[[j]], l = current_l_list[[j]])
      if (j != 1) {
        current_grad_W_list[[j]] = grad_W.fun(grad_l = current_grad_l_list[[j]], h = current_h_list[[j - 1]])
      } else {
        current_grad_W_list[[j]] = grad_W.fun(grad_l = current_grad_l_list[[j]], h = X_matrix[idx,])
      }
    }
    
    if (optimizer == 'adam') {
      
      for (j in 1:(len_h+1)) {
        M_list[[j]] = beta1 * M_list[[j]] + (1 - beta1) * current_grad_W_list[[j]]
        N_list[[j]] = beta2 * N_list[[j]] + (1 - beta2) * current_grad_W_list[[j]]^2
        M.hat = M_list[[j]]/(1 - beta1^i)
        N.hat = N_list[[j]]/(1 - beta2^i)
        W_list[[j]] = W_list[[j]] - lr*M.hat/sqrt(N.hat+eps)
      }
      
    } else if (optimizer == 'sgd') {
      
      for (j in 1:(len_h+1)) {
        M_list[[j]] = beta1 * M_list[[j]] + lr * current_grad_W_list[[j]]
        W_list[[j]] = W_list[[j]] - M_list[[j]]
      }
      
    } else {
      stop('optimizer must be selected from "sgd" or "adam".')
    }
    
  }
  
  require(scales)
  require(plot3D)
  
  x1_seq = seq(min(x1), max(x1), length.out = 100)
  x2_seq = seq(min(x2), max(x2), length.out = 100)
  
  pre_func = function (x1, x2) {
    new_X = cbind(x1, x2)
    
    current_l_list = list()
    current_h_list = list()
    
    for (j in 1:len_h) {
      if (j == 1) {
        current_l_list[[j]] = L.fun(X = new_X, W = W_list[[j]])
      } else {
        current_l_list[[j]] = L.fun(X = current_h_list[[j-1]], W = W_list[[j]])
      }
      current_h_list[[j]] = ReLU.fun(x = current_l_list[[j]])
    }
    
    current_l_list[[len_h+1]] = L.fun(X = current_h_list[[len_h]], W = W_list[[len_h+1]])
    current_o = S.fun(x = current_l_list[[len_h+1]], eps = eps)
    
    return(current_o)
  }
  
  pred_y = pre_func(x1 = x1, x2 = x2)
  MAIN_TXT = paste0('Train-Acc:', formatC(mean((pred_y > 0.5) == y), 2, format = 'f'))
  if (!is.null(test_x1)) {
    pred_test_y = pre_func(x1 = test_x1, x2 = test_x2)
    MAIN_TXT = paste0(MAIN_TXT, '; Test-Acc:', formatC(mean((pred_test_y > 0.5) == test_y), 2, format = 'f'))
  }
  
  z_matrix = sapply(x2_seq, function(x) {pre_func(x1 = x1_seq, x2 = x)})
  
  par(mfrow = c(1, 2))
  
  image2D(z = z_matrix, main = MAIN_TXT,
          x = x1_seq, xlab = 'x1',
          y = x2_seq, ylab = 'x2',
          shade = 0.2, rasterImage = TRUE,
          col = colorRampPalette(c("#FFA0A0", "#FFFFFF", "#A0A0FF"))(100))
  
  points(x1, x2, col = (y + 1)*2, pch = 19, cex = 0.5)
  if (!is.null(test_x1)) {
    points(test_x1, test_x2, col = 'black', bg = c('#C00000', '#0000C0')[(test_y + 1)], pch = 21)
  }
  
  plot(loss_seq, type = 'l', main = 'loss', xlab = 'iter.', ylab = 'CE loss')
  
  return(list(Train_tab = table(pred_y > 0.5 + 0L, y), Test_tab = table(pred_test_y > 0.5 + 0L, test_y)))
  
}

資料擴增(3)

set.seed(0)
x1 = rnorm(100, sd = 1) 
x2 = rnorm(100, sd = 1) 
lr1 = - 1.5 + x1^2 + x2^2 + rnorm(100)
y = lr1 > 0 + 0L

test_x1 = rnorm(100, sd = 1) 
test_x2 = rnorm(100, sd = 1) 
lr1 = - 1.5 + test_x1^2 + test_x2^2 + rnorm(100)
test_y = lr1 > 0 + 0L

– 讓我們比較一下沒有做資料擴增(sd_Augmentation = 0)與做了資料擴增(sd_Augmentation = 0.3/0.8)的結果差異:

TABLE_LIST = DEEP_MLP_Trainer(num.iteration = 2000, num.hidden = c(100), batch_size = 20,
                              lr = 0.001, beta1 = 0.9, beta2 = 0.999, optimizer = 'sgd',
                              sd_Augmentation = 0,
                              x1 = x1, x2 = x2, y = y,
                              test_x1 = test_x1, test_x2 = test_x2, test_y = test_y)

TABLE_LIST = DEEP_MLP_Trainer(num.iteration = 2000, num.hidden = c(100), batch_size = 20,
                              lr = 0.001, beta1 = 0.9, beta2 = 0.999, optimizer = 'sgd',
                              sd_Augmentation = 0.3,
                              x1 = x1, x2 = x2, y = y,
                              test_x1 = test_x1, test_x2 = test_x2, test_y = test_y)