機器學習及演算法

林嶔 (Lin, Chin)

Lesson 15 機器學習概論6(非線性支持向量機與核函數)

第一節:軟邊界支持向量機(1)

\[ \begin{align} min & \text{ } \frac{1}{2}w^Tw & \\ \text{ subject to } & y_i(w^Tx_i + b) \geq 1 \end{align} \]

\[ \begin{align} min & \space \space \space \frac{1}{2} \sum \limits_{i=1}^{n} \sum \limits_{j=1}^{n} \alpha_i \alpha_j y_i y_j x_i x_j - \sum \limits_{i=1}^{n} \alpha_i \\ \text{subject to } & \space \space \space \sum \limits_{i=1}^{n} \alpha_i y_i = 0 \space \space \text{and} \space \alpha_i \geq 0 \space \space \space \text{ for all } i \\\\ w & = \sum \limits_{i=1}^{n} \alpha_i y_i x_i \\ b & = yi - w^Tx_i \space \space \space \text{ for all } \alpha_i > 0 \end{align} \]

– 下面那個式子中,雖然「看起來」好像沒有條件限制,但推導過程中明明就有這個限制。但很神奇的是,上節課我們直接用下面解法求解「不能完美限制分割的問題」,居然解決了,還記得嗎?

– 另外,在用套件時有個參數【cost】,我們說這是「違反\(y(w^Tx_i + b) \geq 1\)所要付出的代價」,看起來應該是有什麼方法能夠修正我們的SVM,讓我們能面對「不能完美限制分割的問題」。

第一節:軟邊界支持向量機(2)

\[ \begin{align} \text{for correct sample} & \\ min \text{ } & \frac{1}{2}w^Tw \\ \text{ subject to } & y_i(w^Tx_i + b) \geq 1 \\\\ \text{for incorrect sample} & \\ min \text{ } & \sum \xi_i \end{align} \]

\[ \begin{align} min \text{ } & \frac{1}{2}w^Tw + C \sum \xi_i\\ \text{ subject to } & y_i(w^Tx_i + b) \geq 1 - \xi_i \space \space \space \space \space \text{and} \space \space \space \space \space \xi_i \geq 0 \space \space \space \text{ for all } i \\\\ \end{align} \]

第一節:軟邊界支持向量機(3)

\[ \begin{align} min \text{ } & \frac{1}{2}w^Tw + C \sum \xi_i\\ \text{ subject to } & y_i(w^Tx_i + b) \geq 1 - \xi_i \space \space \space \space \space \text{and} \space \space \space \space \space \xi_i \geq 0 \space \space \space \text{ for all } i \\\\ L(b, w,\xi_i,\alpha_i,\beta_i) & = \frac{1}{2}w^Tw + C \sum \limits_{i=1}^{n} \xi_i + \sum \limits_{i=1}^{n} \alpha_i (1 - \xi_i - y_i(w^Tx_i + b)) + \sum \limits_{i=1}^{n} \beta_i (- \xi_i) \end{align} \]

– 在新的式子中,因為有兩個條件,所以有兩個拉格朗日乘數\(\alpha_i\)\(\beta_i\),分別消去\(y_i(w^Tx_i + b) \geq 1 - \xi_i\)\(\xi_i \geq 0\)兩項

– 同樣的,在同時滿足上面兩個條件之下,兩個拉格朗日乘數\(\alpha_i\)\(\beta_i\)的相乘項\((1 - \xi_i - y(w^Tx_i + b))\)\((- \xi_i)\)必然為負數,考慮到拉格朗日乘數必然為正數,這兩項一定是負數,我們首先要先最大化\(\alpha_i\)\(\beta_i\),再求整體最小化:

\[ \begin{align} \min \limits_{w, \space b, \space \xi_i} \left( \max \limits_{\alpha_i \geq 0, \space \beta_i \geq 0} \left( L(b, w,\xi_i,\alpha_i,\beta_i) \right) \right) \end{align} \]

\[ \begin{align} \min \limits_{w, \space b, \space \xi_i} \left( \max \limits_{\alpha_i \geq 0, \space \beta_i \geq 0} \left( L(b, w,\xi_i,\alpha_i,\beta_i) \right) \right) \geq \max \limits_{\alpha_i \geq 0, \space \beta_i \geq 0} \left( \min \limits_{w, \space b, \space \xi_i} \left( L(b, w,\xi_i,\alpha_i,\beta_i) \right) \right) \end{align} \]

第一節:軟邊界支持向量機(4)

\[ \begin{align} L(b, w,\xi_i,\alpha_i,\beta_i) & = \frac{1}{2}w^Tw + C \sum \limits_{i=1}^{n} \xi_i + \sum \limits_{i=1}^{n} \alpha_i (1 - \xi_i - y_i(w^Tx_i + b)) + \sum \limits_{i=1}^{n} \beta_i (- \xi_i) \\ & = \frac{1}{2}w^Tw + C \sum \limits_{i=1}^{n} \xi_i + \sum \limits_{i=1}^{n} \left( \alpha_i - \alpha_i \xi_i - \alpha_i y_iw^Tx_i - \alpha_i y_i b \right) + \sum \limits_{i=1}^{n} \beta_i (- \xi_i) \\\\ \frac{\partial}{\partial b} L(b, w,\xi_i,\alpha_i,\beta_i) & = - \sum \limits_{i=1}^{n} \alpha_i y_i = 0 \\ \frac{\partial}{\partial w} L(b, w,\xi_i,\alpha_i,\beta_i) & = w - \sum \limits_{i=1}^{n} \alpha_i y_ix_i = 0 \\ \frac{\partial}{\partial \xi_i} L(b, w,\xi_i,\alpha_i,\beta_i) & = C - \alpha_i - \beta_i = 0 \\ \end{align} \]

– 原來\(b\)\(w\)的導函數跟之前一模一樣,難怪之前直接運算可以獲得完全相同的結果!

– 透過這一系列推導我們可以拿到3個等式(前兩個之前已經用過了!):

\[ \begin{align} \sum \limits_{i=1}^{n} \alpha_i y_i = 0 \space \space \space \space w = \sum \limits_{i=1}^{n} \alpha_i y_i x_i \space \space \space \space \beta_i = C - \alpha_i \end{align} \]

第一節:軟邊界支持向量機(5)

\[ \begin{align} L(b, w,\xi_i,\alpha_i,\beta_i) & = \frac{1}{2}w^Tw + C \sum \limits_{i=1}^{n} \xi_i + \sum \limits_{i=1}^{n} \left( \alpha_i - \alpha_i \xi_i - \alpha_i y_i w^Tx_i - \alpha_i y_i b \right) + \sum \limits_{i=1}^{n} \beta_i (- \xi_i) \\ & = \frac{1}{2}w^Tw + C \sum \limits_{i=1}^{n} \xi_i + \sum \limits_{i=1}^{n} \alpha_i - \sum \limits_{i=1}^{n} \alpha_i \xi_i - \sum \limits_{i=1}^{n} \alpha_i y_i w^Tx_i - \sum \limits_{i=1}^{n} \alpha_i y_i b - \sum \limits_{i=1}^{n} \beta_i \xi_i \\ & = \frac{1}{2}w^Tw + \sum \limits_{i=1}^{n} (C - \alpha_i - \beta_i) \xi_i + \sum \limits_{i=1}^{n} \alpha_i - w^T \sum \limits_{i=1}^{n} \alpha_i y_i x_i - b \sum \limits_{i=1}^{n} \alpha_i y_i \\ \left[ \because C - \alpha_i - \beta_i = 0 \right] & = \frac{1}{2}w^Tw + \sum \limits_{i=1}^{n} \alpha_i - w^T \sum \limits_{i=1}^{n} \alpha_i y_i x_i - b \sum \limits_{i=1}^{n} \alpha_i y_i \\ \left[ \because \sum \limits_{i=1}^{n} \alpha_i y_i = 0 \right] & = \frac{1}{2}w^Tw + \sum \limits_{i=1}^{n} \alpha_i - w^T \sum \limits_{i=1}^{n} \alpha_i y_i x_i \\ \left[ \because w = \sum \limits_{i=1}^{n} \alpha_i y_i x_i \right] & = \frac{1}{2}w^Tw + \sum \limits_{i=1}^{n} \alpha_i - w^T w \\ & = - \frac{1}{2}w^Tw + \sum \limits_{i=1}^{n} \alpha_i\\ & = - \frac{1}{2} \sum \limits_{i=1}^{n} \sum \limits_{j=1}^{n} \alpha_i \alpha_j y_i y_j x_i x_j + \sum \limits_{i=1}^{n} \alpha_i \end{align} \]

第一節:軟邊界支持向量機(6)

– 由於\(\beta_i\)必定是正數,因此這個限制其實就是代表\(\alpha_i\)除了要大於0之外,還要小於\(C\),而滿足這樣條件的\(\alpha_i\)自然會讓\(\beta_i\)滿足條件,因此我們等於是增加一個條件\(C \geq \alpha_i \geq 0\)

– 現在我們又可以把這個式子改成二次規劃的形式:

\[ \begin{align} min & \space \space \space \frac{1}{2} \sum \limits_{i=1}^{n} \sum \limits_{j=1}^{n} \alpha_i \alpha_j y_i y_j x_i x_j - \sum \limits_{i=1}^{n} \alpha_i \\ \text{subject to } & \space \space \space \sum \limits_{i=1}^{n} \alpha_i y_i = 0 \space \space \text{and} \space C \geq \alpha_i \geq 0 \space \space \space \text{ for all } i \end{align} \]

\[ \begin{align} min & \space \space \space \space \space \frac{1}{2}b^TDb - d^Tb \\ \text{ subject to} & \space \space \space \space \space A^Tb \geq b_0 \end{align} \]

\[ \begin{align} b = \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \alpha_3 \\ \alpha_4 \end{pmatrix} \space D =YY^T \otimes (XX^T) = \begin{pmatrix} y_1y_1x_1x_1 & y_1y_2x_1x_2 & y_1y_3x_1x_3 & y_1y_4x_1x_4 \\ y_2y_1x_2x_1 & y_2y_2x_2x_2 & y_2y_3x_2x_3 & y_2y_4x_2x_4 \\ y_3y_1x_3x_1 & y_3y_2x_3x_2 & y_3y_3x_3x_3 & y_3y_4x_3x_4 \\ y_4y_1x_4x_1 & y_4y_2x_4x_2 & y_4y_3x_4x_3 & y_4y_4x_4x_4 \end{pmatrix} \space d = \begin{pmatrix} 1 \\ 1 \\ 1 \\ 1 \end{pmatrix} \\ \space A = \begin{pmatrix} y_1 & 1 & 0 & 0 & 0 & -1 & 0 & 0 & 0 \\ y_2 & 0 & 1 & 0 & 0 & 0 & -1 & 0 & 0 \\ y_3 & 0 & 0 & 1 & 0 & 0 & 0 & -1 & 0 \\ y_4 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & -1 \end{pmatrix} \space b_0 = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ -C \\ -C \\ -C \\ -C \end{pmatrix} \end{align} \]

第一節:軟邊界支持向量機(7)

– 要特別注意的是本來在計算截具時只計算\(\alpha_i > 0\)的點,現在我們還要加上一個條件\(\alpha_i < C\),因為\(\alpha_i = C\)的點代表\(\beta_i = 0\),那就代表\(\xi_i > 0\),因此不能計算他的貢獻:

library(quadprog)

data(iris)
sub.iris <- iris[51:150,]
x1 <- sub.iris[,1]
x2 <- sub.iris[,2]
y <- as.integer(sub.iris[,5]) * 2 - 5

cost = 1
n.sample = length(y)
small.value = 5e-6

X = cbind(x1, x2)

D.matrix = (y%*%t(y))*(X%*%t(X))
D.matrix = D.matrix + small.value*diag(n.sample)
A.matrix = t(rbind(matrix(y, ncol = n.sample), diag(n.sample), -diag(n.sample)))

fit = solve.QP(Dmat = D.matrix, dvec = rep(1, n.sample), Amat = A.matrix, bvec = c(rep(0, n.sample + 1), rep(-cost, n.sample)), meq = 1, factorized = FALSE)
qpsol <- fit$solution

findCoefs <- function(a, y, X, cost = 1){
  nonzero <- abs(a) > 5e-6
  noncost <- (a < cost)
  W <- rowSums(sapply(which(nonzero), function(i) a[i]*y[i]*X[i,]))
  b <- mean(sapply(which(nonzero & noncost), function(i) y[i]-X[i,]%*%W))
  result <- c(b, W)
  names(result) = c("w0", "w1", "w2")
  return(result)
}

coefs = findCoefs(qpsol, y, X)

A = -coefs[1]/coefs[3]
B = -coefs[2]/coefs[3]

plot(x1, x2, col =  y + 3, pch = 19)
abline(a = A, b = B, lwd = 2, lty = 1)

練習1:調整超參數

– 請至這裡下載範例資料

dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")

– 讓我們來實驗一下使用Rate跟Age來預測LVD,調整一下超參數讓我們在驗證組中獲得最佳結果。

– 需要特別注意一下參數【scale】,你應該有注意到SVM的求解中需要用到\(XX^T\),如果沒有標準化的話就會對標準差較大的變項有較高的權重。

library(e1071)

subdat <- dat[!(dat[,'LVD'] %in% NA) & !(dat[,'Rate'] %in% NA) & !(dat[,'AGE'] %in% NA), c('LVD', 'Rate', 'AGE')]

set.seed(0)
all_idx <- 1:nrow(subdat)

train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_dat <- subdat[train_idx,]
valid_dat <- subdat[valid_idx,]
test_dat <- subdat[test_idx,]

svm.model <- svm(LVD ~ Rate + AGE, data = train_dat, kernel = "linear", scale = TRUE, type = "C-classification", cost = 1)
pred_valid <- predict(svm.model, valid_dat, decision.values = TRUE)
valid_dat[,'pred'] <- attr(pred_valid, "decision.values")

練習1答案

library(pROC)

result <- data.frame(cost = 10^(-3:1), valid_auc = NA)

for (i in 1:nrow(result)) {
  
  svm.model <- svm(LVD ~ Rate + AGE, data = train_dat, kernel = "linear", scale = TRUE, type = "C-classification", cost = result[i,'cost'])
  pred_valid <- predict(svm.model, valid_dat, decision.values = TRUE)
  valid_dat[,'pred'] <- attr(pred_valid, "decision.values")
  roc_valid <- roc(LVD ~ pred, data = valid_dat)
  result[i,'valid_auc'] <- roc_valid[['auc']]
  
}

result
##    cost valid_auc
## 1 1e-03 0.6312042
## 2 1e-02 0.6316163
## 3 1e-01 0.6318681
## 4 1e+00 0.6321200
## 5 1e+01 0.6321200

– 讓我們用最佳結果來畫ROC曲線:

best_pos <- which.max(result[,'valid_auc'])
best.svm.model <- svm(LVD ~ Rate + AGE, data = train_dat, kernel = "linear", scale = TRUE, type = 'C-classification', cost = result[best_pos,'cost'])

library(pROC)

pred_test <- predict(best.svm.model, test_dat, decision.values = TRUE)
roc_curve <- roc(test_dat[,1] ~ attr(pred_test, "decision.values"))
plot(roc_curve)
text(0.5, 0.5, paste0('AUC = ', formatC(roc_curve[['auc']], 4, format = 'f')), col = 'red')

第二節:核函數(1)

F01

\[ \begin{align} min & \space \space \space \frac{1}{2} \sum \limits_{i=1}^{n} \sum \limits_{j=1}^{n} \alpha_i \alpha_j y_i y_j x_i x_j - \sum \limits_{i=1}^{n} \alpha_i \\ \text{subject to } & \space \space \space \sum \limits_{i=1}^{n} \alpha_i y_i = 0 \space \space \text{and} \space C \geq \alpha_i \geq 0 \space \space \space \text{ for all } i \end{align} \]

\[ \begin{align} min & \space \space \space \space \space \frac{1}{2}b^TDb - d^Tb \\ \text{ subject to} & \space \space \space \space \space A^Tb \geq b_0 \\ D & = YY^T \otimes (XX^T) = \begin{pmatrix} y_1y_1x_1x_1 & y_1y_2x_1x_2 & y_1y_3x_1x_3 & y_1y_4x_1x_4 \\ y_2y_1x_2x_1 & y_2y_2x_2x_2 & y_2y_3x_2x_3 & y_2y_4x_2x_4 \\ y_3y_1x_3x_1 & y_3y_2x_3x_2 & y_3y_3x_3x_3 & y_3y_4x_3x_4 \\ y_4y_1x_4x_1 & y_4y_2x_4x_2 & y_4y_3x_4x_3 & y_4y_4x_4x_4 \end{pmatrix} \end{align} \]

第二節:核函數(2)

– 我們用「二次多項式轉換」來進行比較

– 注意,這個Kernel function是為了帶領大家快速理解而設計,並非套件「e1071」中SVM函數內真實使用的polynomial kernel function

x1 = c(0, 2, 2, 3)
x2 = c(0, 2, 0, 0)
y = c(1, 1, -1, -1)

X = cbind(x1, x2)
Z = cbind(x1, x2, x1^2, x1*x2, x2*x1, x2^2)

D.matrix_1 = (y%*%t(y))*(Z%*%t(Z))
D.matrix_1
##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0   72  -20  -42
## [3,]    0  -20   20   42
## [4,]    0  -42   42   90
X.DOT = (X%*%t(X))
D.matrix_2 = (y%*%t(y))*(X.DOT + X.DOT^2)
D.matrix_2
##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0   72  -20  -42
## [3,]    0  -20   20   42
## [4,]    0  -42   42   90

第二節:核函數(3)

set.seed(0)
x1 = rnorm(20) 
x2 = rnorm(20) 
lr1 = x1^2 + x2^2
y = (lr1 > mean(lr1)) * 2 - 1

plot(x1, x2, col = y + 3, pch = 19)

library(quadprog)

cost = 1
n.sample = length(y)
small.value = 5e-6

X = cbind(x1, x2)
Z = cbind(x1, x2, x1^2, x1*x2, x2*x1, x2^2)

– 讓我們先展示先「維度擴增運算」再「內積」(傳統方法):

Z.DOT <- (Z%*%t(Z))
D.matrix <- (y%*%t(y))*(Z.DOT)
D.matrix <- D.matrix + small.value*diag(n.sample)
A.matrix <- t(rbind(matrix(y, ncol = n.sample), diag(n.sample), -diag(n.sample)))

fit <- solve.QP(Dmat = D.matrix, dvec = rep(1, n.sample), Amat = A.matrix, bvec = c(rep(0, n.sample + 1), rep(-cost, n.sample)), meq = 1, factorized = FALSE)
qpsol <- fit$solution
qpsol
##  [1]  8.083094e-01 -5.454571e-14 -3.268684e-17 -1.298141e-14 -3.124524e-14
##  [6]  2.532337e-16 -6.374327e-17  4.991869e-15  6.314419e-01  2.836752e-16
## [11]  1.000000e+00  1.000000e+00  1.000000e+00  3.581068e-14  1.000000e+00
## [16]  9.940971e-01  1.000000e+00  1.000000e+00  8.321542e-01  7.339974e-01

– 讓我們再展示先「內積」再「維度擴增運算」(新方法):

X.DOT <- (X%*%t(X))
D.matrix <- (y%*%t(y))*(X.DOT + X.DOT^2)
D.matrix <- D.matrix + small.value*diag(n.sample)
A.matrix <- t(rbind(matrix(y, ncol = n.sample), diag(n.sample), -diag(n.sample)))

fit = solve.QP(Dmat = D.matrix, dvec = rep(1, n.sample), Amat = A.matrix, bvec = c(rep(0, n.sample + 1), rep(-cost, n.sample)), meq = 1, factorized = FALSE)
qpsol <- fit$solution
qpsol
##  [1]  8.083094e-01 -8.374902e-14 -4.301072e-17 -1.295574e-14 -4.102804e-14
##  [6]  3.274020e-17 -1.587524e-16  5.255536e-15  6.314419e-01  4.871732e-17
## [11]  1.000000e+00  1.000000e+00  1.000000e+00  2.898899e-14  1.000000e+00
## [16]  9.940971e-01  1.000000e+00  1.000000e+00  8.321542e-01  7.339974e-01

第二節:核函數(4)

findCoefs_1 <- function(a, y, Z){
  nonzero <-  abs(a) > 5e-6
  noncost <- (a < cost)
  W <- rowSums(sapply(which(nonzero), function(i) a[i]*y[i]*Z[i,]))
  b <- mean(sapply(which(nonzero & noncost), function(i) y[i]-Z[i,]%*%W))
  result <- c(b, W)
  names(result) = c("w0", "w1", "w2", "w11", "w12", "w21", "w22")
  return(result)
}

coefs <- findCoefs_1(qpsol, y, Z)
print(coefs)
##          w0          w1          w2         w11         w12         w21 
## -2.10243693 -0.11098829  0.03399143  1.88669298 -0.16517552 -0.16517552 
##         w22 
##  1.87546756

第二節:核函數(5)

\[ \begin{align} w & = \sum \limits_{i=1}^{n} \alpha_i yx_i \\ b & = yi - w^Tx_i \space \space \space \text{ for all } C > \alpha_i > 0 \end{align} \]

– 所以我們必須找到一種方法,讓我們不求\(w\)\(b\)但能求出距離!

\[ \begin{align} min & \space \space \space \frac{1}{2} \sum \limits_{i=1}^{n} \sum \limits_{j=1}^{n} \alpha_i \alpha_j y_i y_j x_i x_j - \sum \limits_{i=1}^{n} \alpha_i \\ \text{subject to } & \space \space \space \sum \limits_{i=1}^{n} \alpha_i y_i = 0 \space \space \text{and} \space C \geq \alpha_i \geq 0 \space \space \space \text{ for all } i \\\\ \text{ for all support vectors} (\alpha_i > 0) & \\ w & = \sum \limits_{i=1}^{n} \alpha_i y_i x_i \\ b & = \frac{1} {n} \sum \limits_{i=1}^{n} \left( yi - w^Tx_i \right) \space \text{ for all } (\alpha_i < C) \\ \text{ for a new sample} & \\ \hat{y} & = w^Th + b \space \space \space \space \space \space \end{align} \]

\[ \begin{align} min & \space \space \space \frac{1}{2} \alpha (YY^T) (XX^T) \alpha^T - 1^T \alpha \\ \text{subject to } & \space \space \space \sum \limits_{i=1}^{n} \alpha_i y_i = 0 \space \space \text{and} \space C \geq \alpha_i \geq 0 \space \space \space \text{ for all } i \\\\ \text{ for all support vectors} (\alpha_i > 0) & \\ w & = (\alpha \otimes Y)^TX \\ b & = \frac{1} {n} (\alpha < C)^T(Y - XX^T(\alpha \otimes Y)) \\ \text{ for new samples} & \\ \hat{Y} & = w^TH + b = HX^T(\alpha \otimes Y) + \frac{1} {n} (\alpha < C)^T(Y - XX^T(\alpha \otimes Y)) \end{align} \]

第二節:核函數(6)

\[ \begin{align} min & \space \space \space \frac{1}{2} \alpha (YY^T) \phi(XX^T) \alpha^T - 1^T \alpha \\ \text{subject to } & \space \space \space \sum \limits_{i=1}^{n} \alpha_i y_i = 0 \space \space \text{and} \space C \geq \alpha_i \geq 0 \space \space \space \text{ for all } i \\\\ \text{ for new samples} & \\ \hat{Y} & = \phi(HX^T)(\alpha \otimes Y) + \frac{1} {n} (\alpha_i < C)^T(Y - \phi(XX^T)(\alpha \otimes Y)) \end{align} \]

# Sample

set.seed(0)
x1 <- rnorm(20) 
x2 <- rnorm(20) 
lr1 <- x1^2 + x2^2
y <- (lr1 > mean(lr1)) * 2 - 1

# Parameters

library(quadprog)

cost <- 1
n.sample <- length(y)
small.value <- 5e-6
kernel_func <- function (X) {X + X^2}

# QP solve

X <- cbind(x1, x2)

X.DOT <- (X%*%t(X))
D.matrix <- (y%*%t(y))*kernel_func(X.DOT) # Use kernel function
D.matrix <- D.matrix + small.value*diag(n.sample)
A.matrix <- t(rbind(matrix(y, ncol = n.sample), diag(n.sample), -diag(n.sample)))

fit <- solve.QP(Dmat = D.matrix, dvec = rep(1, n.sample), Amat = A.matrix, bvec = c(rep(0, n.sample + 1), rep(-cost, n.sample)), meq = 1, factorized = FALSE)
qpsol <- fit$solution

# Get support vector

sv_pos <- which(qpsol > small.value)
sv_alpha <- qpsol[sv_pos]
sv_Y <- y[sv_pos]
sv_X <- X[sv_pos,]

# Calculate for new data

new_H <- matrix(0:3, ncol = 2)
Y_hat <- (kernel_func(new_H %*% t(sv_X)) %*% (sv_alpha * sv_Y)) + as.numeric((sv_alpha < cost) %*% (sv_Y - kernel_func(sv_X %*% t(sv_X)) %*% (sv_alpha * sv_Y)) / sum(sv_alpha < cost))
Y_hat
##           [,1]
## [1,]  5.467416
## [2,] 15.663397
aug_H <- cbind(new_H, new_H[,1]^2, new_H[,1]*new_H[,2], new_H[,2]*new_H[,1], new_H[,2]^2)
aug_H %*% coefs[-1] + coefs[1]
##           [,1]
## [1,]  5.467416
## [2,] 15.663397

練習2:重現套件中的核函數SVM之預測結果

# Sample

set.seed(0)
x1 <- rnorm(20) 
x2 <- rnorm(20) 
lr1 <- x1^2 + x2^2
y <- (lr1 > mean(lr1)) * 2 - 1

# SVM

svm.model <- svm(y ~ x1 + x2, kernel = "polynomial", scale = FALSE, type = "C-classification", cost = 1, gamma = 1, coef0 = 0, degree = 2)
plot(svm.model, data = data.frame(y = factor(y), x1, x2))

new_H <- data.frame(x1 = 0:1, x2 = 2:3)
predict(svm.model, new_H, decision.values = TRUE)
## 1 2 
## 1 1 
## attr(,"decision.values")
##        1/-1
## 1  5.435932
## 2 15.985613
## Levels: -1 1

F02

練習2答案

kernel_func <- function (X, gamma = 1, coef0 = 0, degree = 2) {(gamma * X + coef0)^degree}

# Get support vector

sv_alpha.Y <- svm.model$coefs
sv_X <- svm.model$SV
b <- -svm.model$rho

# Validate parameter b

sv_alpha <- abs(sv_alpha.Y)
sv_Y <- sign(sv_alpha.Y)
my_b <- as.numeric(t(sv_alpha < 1) %*% (sv_Y - kernel_func(sv_X %*% t(sv_X)) %*% (sv_alpha * sv_Y)) / sum(sv_alpha < 1))

# Calculate for new data

new_H <- matrix(0:3, ncol = 2)
Y_hat <- kernel_func(new_H %*% t(sv_X)) %*% sv_alpha.Y + b
Y_hat
##           [,1]
## [1,]  5.435932
## [2,] 15.985613

練習3:再確認你對核函數的理解

# Sample

set.seed(0)
x1 <- rnorm(20) 
x2 <- rnorm(20) 
lr1 <- x1^2 + x2^2
y <- (lr1 > mean(lr1)) * 2 - 1

# SVM

svm.model <- svm(y ~ x1 + x2, kernel = "polynomial", scale = FALSE, type = "C-classification", cost = 1, gamma = 1, coef0 = 0, degree = 3)
plot(svm.model, data = data.frame(y = factor(y), x1, x2))

第三節:利用核函數擴展至超高維空間(1)

– 但目前就我們對「Kernel function」的理解,我們發現基本上邏輯斯回歸能做出幾乎一樣的結果,只要你手動擴增維度即可,這樣SVM好像也沒多厲害。

– 為了讓SVM達到邏輯斯回歸達不到的境界,我們有個瘋狂的想法,想要把維度擴增到「無限多維」,這樣邏輯斯回歸不就不可能做到了?

– 有的,那就是「radial basis kernel function」,有些人稱他為「gaussian kernel function」。

F03

F04

– 乍看之下是無限多維,但\(\gamma\)參數若設的很小,後面幾項衰減的速度非常快,因此幾乎等同於在「低維」空間內進行分類;但反過來說,若\(\gamma\)參數設的非常非常大,那這就真的相當於在「無限多維」空間中進行線性分割,那會造成什麼下場?

第三節:利用核函數擴展至超高維空間(2)

– 但現實世界的資料中,一定存在否些測量誤差,我們充分了解將其投影到「無限多維」後絕對能完美分割,但這是否反而會降低泛化能力?

– 統計學基本定理:若維度數目超過樣本數,則必定存在一個完全可分割的超平面能完美分割資料

– 與非常多隱藏神經元的多層感知機一樣,我們必須要小心「過度擬合」的危險

F05

# Sample

set.seed(0)
x1 <- rnorm(20) 
x2 <- rnorm(20) 
lr1 <- x1^2 + x2^2
y <- (lr1 > mean(lr1)) * 2 - 1

# SVM

svm.model <- svm(y ~ x1 + x2, kernel = "radial", scale = FALSE, type = "C-classification", cost = 1e5, gamma = 1e3, coef0 = 0, degree = 3)
plot(svm.model, data = data.frame(y = factor(y), x1, x2))

第三節:利用核函數擴展至超高維空間(3)

F07

練習4:體驗超強的SVM

– 需要特別注意一下參數【scale】,你應該有注意到SVM的求解中需要用到\(XX^T\),如果沒有標準化的話就會對標準差較大的變項有較高的權重。

library(e1071)

subdat <- dat[!(dat[,'LVD'] %in% NA) & !(dat[,'Rate'] %in% NA) & !(dat[,'AGE'] %in% NA), c('LVD', 'Rate', 'AGE')]
subdat[,'LVD'] <- factor(subdat[,'LVD'])

set.seed(0)
all_idx <- 1:nrow(subdat)

train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_dat <- subdat[train_idx,]
valid_dat <- subdat[valid_idx,]
test_dat <- subdat[test_idx,]

練習4答案

library(pROC)

result <- data.frame(cost = rep(2^(-4:4), each = 9), gamma = 2^(-4:4), valid_auc = NA)

for (i in 1:nrow(result)) {
  
  svm.model <- svm(LVD ~ Rate + AGE, data = train_dat, kernel = "radial", type = 'C-classification', cost = result[i,'cost'], gamma = result[i,'gamma'])
  pred_valid <- predict(svm.model, valid_dat, decision.values = TRUE)
  valid_dat[,'pred'] <- attr(pred_valid, "decision.values")
  roc_valid <- roc(LVD ~ pred, data = valid_dat)
  result[i,'valid_auc'] <- roc_valid[['auc']]
  
}

best_pos <- which.max(result[,'valid_auc'])
result[best_pos,]
##    cost gamma valid_auc
## 65    8 0.125 0.6455586
best.svm.model <- svm(LVD ~ Rate + AGE, data = train_dat, kernel = "radial", type = 'C-classification', cost = result[best_pos,'cost'], gamma = result[best_pos,'gamma'])
plot(best.svm.model, data = train_dat, svSymbol = '', dataSymbol = '')

library(pROC)

pred_test <- predict(best.svm.model, test_dat, decision.values = TRUE)
roc_curve <- roc(test_dat[,1] ~ attr(pred_test, "decision.values"))
plot(roc_curve)
text(0.5, 0.5, paste0('AUC = ', formatC(roc_curve[['auc']], 4, format = 'f')), col = 'red')

練習4引申

– 讓我們來用我們的資料嘗試一下,需要注意的是它的指標是錯誤率而非AUC,原則上我還是比較建議大家使用自己寫的語法進行超參數尋找:

svm_tune <- tune(svm, train.x = train_dat[,2:3], train.y = train_dat[,1],
                 kernel = "radial", ranges = list(cost = 2^(-1:1), gamma = 2^(-1:1), type = 'C-classification'))

print(svm_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma             type
##     1     2 C-classification
## 
## - best performance: 0.3686914

– 資料科學的實驗流程是非常重要的!

課程小結

– 但他畢竟還是在維度擴增的基礎上進行分類,即便增加了參數「cost」避免「過度擬合」,但這中間的拿捏需要大量的測試。

– 一個比較簡單的方式是與Random Forest結合,利用Random Forest僅使用少數變數的特性進行特徵篩選,再利用SVM進行維度擴增做最佳化分類