機器學習及演算法

林嶔 (Lin, Chin)

Lesson 11 機器學習概論3(正則化與彈性網路)

第一節:正則化(1)

– 由於數學上的相似性,我們已經教過的4種統計模型可以利用相同的技術進行擴展,這其實是相當方便的。

– 在統計問題中,有一系列的統計方法用到了正則化技術,而主要所希望解決的問題是「變數多個案少」的情況,我們將統計領域中常用的正則化技術分為兩類:

  1. L1 Regularization - 使用這個方法的迴歸模型稱為套索迴歸(LASSO regression)

  2. L2 Regularization - 使用這個方法的迴歸模型稱為脊迴歸(Ridge regression)

– 請至這裡下載範例資料

dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")

第一節:正則化(2)

\[ \begin{align} \hat{y_i} & = b_{0} + b_{1}x_i \\ loss & = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \hat{y_i}\right)^{2} \end{align} \]

– 而L1 Regularization的意思,就是對\(loss\)的部分做一些修正: (這裡\(\lambda\)是一個懲罰權重)

\[ \begin{align} loss & = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \hat{y_i}\right)^{2} + \lambda ( |b_0| + |b_1|) \end{align} \]

– L2 Regularization與L1 Regularization非常相似,只是對\(loss\)的修正方式不同:

\[ \begin{align} loss & = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \hat{y_i}\right)^{2} + \frac{\lambda}{2} ( b_0^2 + b_1^2) \end{align} \]

– 而我們再考慮一下他的物理意義,權重為0代表該項輸入並不重要,因此也代表該變項可有可無,所以使用正則化技術會讓許多變項的權重為0,這對於變項很多需要選變項時非常好用。

第一節:正則化(3)

– 整合預測函數與損失函數

\[loss = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right)^{2} + \lambda ( |b_0| + |b_1|)\]

\(b_0\)以及\(b_1\)的偏導函數(過程略)

\[ \begin{align} \frac{\partial}{\partial b_0}loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n}\left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right) + \lambda \frac{b_0}{|b_0|} \\ \frac{\partial}{\partial b_1}loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n}\left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right)\cdot x_{i} + \lambda \frac{b_1}{|b_1|} \end{align} \]

– 另外,他似乎只提供了方向,而不提供大小。這是因為對於絕對值函數而言,導函數不是\(1\)就是\(-1\)

– 因此在訓練的時候,假設某個參數\(b\)的重要性實在太低,那由原始損失函數部分(左邊)所提供的梯度就會太小,而該參數\(b\)的數值就會在\(\lambda\)\(-\lambda\)之間震盪。

– 由於我們非常清楚這個介於在\(\lambda\)\(-\lambda\)之間是完全沒有意義的,因而能將所有介於\(\lambda\)\(-\lambda\)之間的參數\(b\)指定為\(0\),這樣暗示著這個參數所對應到的特徵\(x\)是完全沒有意義的。

– 從以上觀點來看,L1 Regularization具有很強的「稀疏化」特性,可以拿來做變項挑選。

第一節:正則化(4)

– 這個問題則可以被L2 Regularization解決,我們再看看他對於\(b_0\)以及\(b_1\)的偏導函數為何:

\[loss = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right)^{2} + \frac{\lambda}{2} ( b_0^2 + b_1^2)\]

\(b_0\)以及\(b_1\)的偏導函數(過程略)

\[ \begin{align} \frac{\partial}{\partial b_0}loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n}\left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right) + \lambda b_0\\ \frac{\partial}{\partial b_1}loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n}\left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right)\cdot x_{i} + \lambda b_1 \end{align} \]

– 很明顯的,當\(\lambda\)很大的時候,那所有權重都將會很難離開0,但與L1 Regularization正則化不同,單純使用L2 Regularization經常會造成很多不重要的參數\(b\)有個很小的數值,卻沒有辦法捨去。

練習1:在Cox proportional hazard model中套用正則化的技術(1)

– 我們只使用前50筆資料加快他的速度:

subdat <- dat[!(dat[,"time"] %in% c(NA, 0)) & !(dat[,"death"] %in% NA) & !(dat[,"AGE"] %in% NA) & !(dat[,"Rate"] %in% NA),]
subdat <- subdat[1:50,]
survival_encode <- function (event, time) {
  
  sub_dat <- data.frame(id = 1:length(event), event = event, time = time, stringsAsFactors = FALSE)
  sub_dat <- sub_dat[order(sub_dat[,'event'], decreasing = TRUE),]
  sub_dat <- sub_dat[order(sub_dat[,'time']),]
  
  label <- array(0L, dim = rep(nrow(sub_dat), 2))
  mask <- array(0L, dim = rep(nrow(sub_dat), 2))
  
  diag(label) <- 1L
  
  for (m in 1:nrow(sub_dat)) {
    
    mask[m,m:nrow(sub_dat)] <- 1L
    
    if (sub_dat[m,'event'] == 0) {
      
      mask[m,] <- 0L
      
    } else if (m > 1) {
      
      zero_pos <- which(sub_dat[1:(m-1),'time'] == sub_dat[m,'time'])
      
      if (length(zero_pos) > 0) {
        
        mask[zero_pos,m] <- 0L
        
      }
      
    }
    
  }
  
  original_pos <- order(sub_dat[,'id'])
  
  return(list(label = label[,original_pos,drop = FALSE], mask = mask[,original_pos,drop = FALSE]))
  
}

練習1:在Cox proportional hazard model中套用正則化的技術(2)

X_mat <- matrix(as.matrix(subdat[,c("AGE", "Rate")]), ncol = 2)
Y_list <- survival_encode(event = subdat[,'death'], time = subdat[,'time'])
Y_mat <- Y_list$label
M_mat <- Y_list$mask

num.iteration <- 2000 
lr <- 0.01
B_mat <- matrix(0, nrow = ncol(X_mat), ncol = 1)

for (i in 1:num.iteration) {
  
  H_mat <- matrix(X_mat %*% B_mat, nrow = nrow(Y_mat), ncol = ncol(Y_mat), byrow = TRUE)

  P11_mat <- M_mat * exp(H_mat)
  P12_mat <- P11_mat %*% X_mat
  P13_mat <- P11_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
  P14_mat <- apply(P12_mat / P13_mat, 2, sum, na.rm = TRUE)
  
  P21_mat <- M_mat * Y_mat * exp(H_mat)
  P22_mat <- P21_mat %*% X_mat
  P23_mat <- P21_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
  P24_mat <- apply(P22_mat / P23_mat, 2, sum, na.rm = TRUE)

  grad_B <- P14_mat - P24_mat
  
  B_mat <- B_mat - lr * grad_B / nrow(X_mat)
  
}

print(B_mat)
##             [,1]
## [1,] 0.037072488
## [2,] 0.008077607

練習1答案(1)

– 這是使用L1 Regularization的結果:

lambda <- 0.005
num.iteration <- 2000 
lr <- 0.01
B_mat <- matrix(0, nrow = ncol(X_mat), ncol = 1)

for (i in 1:num.iteration) {
  
  H_mat <- matrix(X_mat %*% B_mat, nrow = nrow(Y_mat), ncol = ncol(Y_mat), byrow = TRUE)

  P11_mat <- M_mat * exp(H_mat)
  P12_mat <- P11_mat %*% X_mat
  P13_mat <- P11_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
  P14_mat <- apply(P12_mat / P13_mat, 2, sum, na.rm = TRUE)
  
  P21_mat <- M_mat * Y_mat * exp(H_mat)
  P22_mat <- P21_mat %*% X_mat
  P23_mat <- P21_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
  P24_mat <- apply(P22_mat / P23_mat, 2, sum, na.rm = TRUE)

  grad_B <- P14_mat - P24_mat
  
  B_mat <- B_mat - lr * grad_B / nrow(X_mat) - lambda * sign(B_mat)
  
}

print(B_mat)
##             [,1]
## [1,] 0.023285025
## [2,] 0.003745863

– 使用L1 Regularization的結果告訴我們其實Rate這個變項不重要!

練習1答案(2)

lambda <- 0.05
num.iteration <- 2000 
lr <- 0.01
B_mat <- matrix(0, nrow = ncol(X_mat), ncol = 1)

for (i in 1:num.iteration) {
  
  H_mat <- matrix(X_mat %*% B_mat, nrow = nrow(Y_mat), ncol = ncol(Y_mat), byrow = TRUE)

  P11_mat <- M_mat * exp(H_mat)
  P12_mat <- P11_mat %*% X_mat
  P13_mat <- P11_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
  P14_mat <- apply(P12_mat / P13_mat, 2, sum, na.rm = TRUE)
  
  P21_mat <- M_mat * Y_mat * exp(H_mat)
  P22_mat <- P21_mat %*% X_mat
  P23_mat <- P21_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
  P24_mat <- apply(P22_mat / P23_mat, 2, sum, na.rm = TRUE)

  grad_B <- P14_mat - P24_mat
  
  B_mat <- B_mat - lr * grad_B / nrow(X_mat) - lambda * B_mat
  
}

print(B_mat)
##             [,1]
## [1,] 0.031659045
## [2,] 0.008512594

– 你可以試試看使用不同的\(\lambda\)

第二節:彈性網路(1)

– 答案當然是可以的,先讓我們以視覺化的方式觀察他們分別的意義。

– 以剛剛的資料為例,讓我們看看「主損失函數」、「L1 Regularization」、「L2 Regularization」他們分別的損失梯度圖,其中白色的點是在無正則化下的最優解。

– 讓我們將它合併,左邊是全部使用L1 Regularization、中間是全部使用L2 Regularization、右邊是各用一半:

第二節:彈性網路(2)

– 我們以線性回歸為例,改寫一下損失函數:

\[ \begin{align} \hat{y_i} & = b_{0} + b_{1}x_i \\ loss & = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \hat{y_i}\right)^{2} + \alpha \lambda ( |b_0| + |b_1|) + \frac{(1 - \alpha) \lambda}{2} ( b_0^2 + b_1^2) \end{align} \]

第二節:彈性網路(3)

library(glmnet)

X_mat <- matrix(as.matrix(subdat[,c("AGE", "Rate")]), ncol = 2)
Y_mat <- matrix(as.matrix(subdat[,c("time", "death")]), ncol = 2)
colnames(Y_mat) <- c('time', 'status')

– 這是完全沒有正則化的結果:

fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'cox', lambda = 0, alpha = 0)
coef(fit_glmnet)
## 2 x 1 sparse Matrix of class "dgCMatrix"
##             s0
## V1 0.036989411
## V2 0.007978329

– 這是使用L2 Regularization的結果:

fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'cox', lambda = 0.05, alpha = 0)
coef(fit_glmnet)
## 2 x 1 sparse Matrix of class "dgCMatrix"
##             s0
## V1 0.024919251
## V2 0.008261397

– 這是使用L1 Regularization的結果:

fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'cox', lambda = 0.05, alpha = 1)
coef(fit_glmnet)
## 2 x 1 sparse Matrix of class "dgCMatrix"
##            s0
## V1 0.01422007
## V2 .

– 當然你可以決定L1/L2 Regularization各占一半的權重:

fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'cox', lambda = 0.05, alpha = 0.5)
coef(fit_glmnet)
## 2 x 1 sparse Matrix of class "dgCMatrix"
##             s0
## V1 0.020567344
## V2 0.003837489

第二節:彈性網路(4)

  1. 這是linear regression的範例:
subdat <- dat[!(dat[,'K'] %in% NA) & !(dat[,'GENDER'] %in% NA) & !(dat[,'AGE'] %in% NA), c('K', 'GENDER', 'AGE')]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])

X_mat <- model.matrix(~ ., data = subdat[,-1])
X_mat <- X_mat[,-1]
Y_mat <- matrix(as.matrix(subdat[,"K"]), ncol = 1)

fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'gaussian', lambda = 0.01, alpha = 0.5)
coef(fit_glmnet)
  1. 這是logistic regression的範例:
subdat <- dat[!(dat[,'LVD'] %in% NA) & !(dat[,'GENDER'] %in% NA) & !(dat[,'AGE'] %in% NA), c('LVD', 'GENDER', 'AGE')]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])

X_mat <- model.matrix(~ ., data = subdat[,-1])
X_mat <- X_mat[,-1]
Y_mat <- matrix(as.matrix(subdat[,"LVD"]), ncol = 1)

fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'binomial', lambda = 0.01, alpha = 0.5)
coef(fit_glmnet)
  1. 這是softmax regression的範例:
subdat <- dat[!(dat[,'AMI'] %in% NA) & !(dat[,'GENDER'] %in% NA) & !(dat[,'AGE'] %in% NA), c('AMI', 'GENDER', 'AGE')]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])

X_mat <- model.matrix(~ ., data = subdat[,-1])
X_mat <- X_mat[,-1]
Y_mat <- matrix(as.matrix(subdat[,"AMI"]), ncol = 1)

fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'multinomial', lambda = 0.01, alpha = 0.5)
coef(fit_glmnet)
  1. 這是Cox model的範例:
subdat <- dat[!(dat[,'time'] %in% NA) & !(dat[,'death'] %in% NA) & !(dat[,'GENDER'] %in% NA) & !(dat[,'AGE'] %in% NA), c('time', 'death', 'GENDER', 'AGE')]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])

X_mat <- model.matrix(~ ., data = subdat[,-c(1, 2)])
X_mat <- X_mat[,-1]
Y_mat <- matrix(as.matrix(subdat[,c("time", "death")]), ncol = 2)
colnames(Y_mat) <- c('time', 'status')

fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'cox', lambda = 0.01, alpha = 0.5)
coef(fit_glmnet)

練習2:資料科學實驗

– 為了實驗方便,我們先假定在\(\lambda = 0.001\)的條件下。

subdat <- dat[apply(is.na(dat[,-1:-3]), 1, sum) %in% 0, -1:-3]

subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])

X_mat <- model.matrix(~ ., data = subdat[,-1:-2])
X_mat <- X_mat[,-1]

Y_mat <- matrix(as.matrix(subdat[,c("time", "death")]), ncol = 2)
colnames(Y_mat) <- c('time', 'status')

set.seed(0)
all_idx <- 1:nrow(subdat)

train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_X <- X_mat[train_idx,]
valid_X <- X_mat[valid_idx,]
test_X <- X_mat[test_idx,]

train_Y <- Y_mat[train_idx,]
valid_Y <- Y_mat[valid_idx,]
test_Y <- Y_mat[test_idx,]

練習2答案(1)

LASSO_fit <- glmnet(x = train_X, y = train_Y, family = 'cox', lambda = 0.001, alpha = 1)
Ridge_fit <- glmnet(x = train_X, y = train_Y, family = 'cox', lambda = 0.001, alpha = 0)
Elastic_fit <- glmnet(x = train_X, y = train_Y, family = 'cox', lambda = 0.001, alpha = 0.5)
library(survival)

pred.LASSO <- predict(LASSO_fit, valid_X)
pred.Ridge <- predict(Ridge_fit, valid_X)
pred.Elastic <- predict(Elastic_fit, valid_X)

concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.LASSO))[['concordance']]
## [1] 0.7249698
concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.Ridge))[['concordance']]
## [1] 0.7247502
concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.Elastic))[['concordance']]
## [1] 0.7246403

練習2答案(2)

pred.test <- predict(LASSO_fit, test_X)
concordance(coxph(Surv(test_Y[,'time'], test_Y[,'status']) ~ pred.test))[['concordance']]
## [1] 0.7287844
library(pROC)

par(mfrow = c(1, 3))

for (current_t in c(365, 730, 1825)) {
  
  valid_y <- valid_Y[,'status']
  valid_y[valid_Y[,'time'] > current_t] <- 0
  valid_y[valid_Y[,'time'] <= current_t & valid_Y[,'status'] %in% 0] <- NA
  
  test_y <- test_Y[,'status']
  test_y[test_Y[,'time'] > current_t] <- 0
  test_y[test_Y[,'time'] <= current_t & test_Y[,'status'] %in% 0] <- NA
  
  roc_valid <- roc(valid_y ~ pred.LASSO)
  best_pos <- which.max(roc_valid$sensitivities + roc_valid$specificities)
  best_cut <- roc_valid$thresholds[best_pos]
  
  tab_test <- table(factor(pred.test > best_cut, levels = c(FALSE, TRUE)), factor(test_y, levels = 0:1))
  sens <- tab_test[2,2] / sum(tab_test[,2], 1e-8)
  spec <- tab_test[1,1] / sum(tab_test[,1], 1e-8)
  
  roc_test <- roc(test_y ~ pred.test)
  plot(roc_test, main = paste0('t = ', formatC(current_t, digits = 0, format = 'f')))
  
  points(spec, sens, pch = 19)
  text(0.5, 0.5, paste0('Cut = ', formatC(best_cut, digits = 3, format = 'f'),
                        '\nSens = ', formatC(sens, digits = 3, format = 'f'),
                        '\nSpec = ', formatC(spec, digits = 3, format = 'f'),
                        '\nAUC = ', formatC(roc_test$auc, digits = 3, format = 'f')), col = 'red')
  
}

第三節:交叉驗證(1)

– 在「訓練組」上進行交叉驗證是一種很好的參數搜索方式。

– 他的使用邏輯如下圖所示:

F01

– 在k交叉驗證中,是使用不同的資料組合來驗證你訓練的模型,舉例來說,假設你有100個樣本,你可以第一次先使用前90個做訓練,另外10個做測試,然後再用第80到90個,不斷重複這個動作,這樣你可以得到不同的訓練/測試資料組合,提供更多數據去驗證。

第三節:交叉驗證(2)

– 我們同樣使用剛剛的樣本進行交叉驗證,從而找到最佳的參數\(\lambda\)

– 要特別注意一點是函數「cv.glmnet()」並不提供參數\(\alpha\)的搜索,如果想要進行搜索必須手動進行。

cv_fit <- cv.glmnet(x = train_X, y = train_Y, family = 'cox', lambda = exp(seq(-5, -1, length.out = 100)), alpha = 0.5, nfolds = 5)
plot(cv_fit)

– 如果你想取得預測參數可以下這樣的指令:

coef(cv_fit, s = "lambda.min")
coef(cv_fit, s = "lambda.1se")

第三節:交叉驗證(3)

pred.min <- predict(cv_fit, valid_X, s = "lambda.min")
pred.1se <- predict(cv_fit, valid_X, s = "lambda.1se")

concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.min))[['concordance']]
## [1] 0.7415895
concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.1se))[['concordance']]
## [1] 0.7476297

練習3:參數調整實驗

– 這次實驗我們將搜索\(\alpha = 0, \space 0.1, \space 0.2, \space \cdots, \space 1.0\)

– 為了實驗的完整性,這次我們使用多重插補法進行資料插補

library(mice)

subdat <- dat[!(dat[,"time"] %in% NA) & !(dat[,"death"] %in% NA),]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])
for (i in 1:31) {subdat[,paste0('rhythm.', i)] <- as.factor(subdat[,paste0('rhythm.', i)])}

used_dat.x <- subdat[,-1:-5]
mice_dat <- mice(used_dat.x, m = 1, maxit = 5, meth = 'cart', seed = 123, printFlag = FALSE)
impute_dat.x <- mice:::complete(mice_dat, action = 1)

X_mat <- model.matrix(~ ., data = impute_dat.x)
X_mat <- X_mat[,-1]

Y_mat <- matrix(as.matrix(subdat[,c("time", "death")]), ncol = 2)
colnames(Y_mat) <- c('time', 'status')

set.seed(0)
all_idx <- 1:nrow(subdat)

train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_X <- X_mat[train_idx,]
valid_X <- X_mat[valid_idx,]
test_X <- X_mat[test_idx,]

train_Y <- Y_mat[train_idx,]
valid_Y <- Y_mat[valid_idx,]
test_Y <- Y_mat[test_idx,]

練習3答案(1)

library(glmnet)
library(survival)

result_elasticnet <- data.frame(alpha = seq(0, 1, by = 0.1), lambda = NA, c_idx = NA)

for (i in 1:nrow(result_elasticnet)) {
  
  cv_fit <- cv.glmnet(x = train_X, y = train_Y, family = 'cox', lambda = exp(seq(-5, -1, length.out = 100)), alpha = result_elasticnet[i, 'alpha'], nfolds = 5)
  
  pred.min <- predict(cv_fit, valid_X, s = "lambda.min")
  pred.1se <- predict(cv_fit, valid_X, s = "lambda.1se")
  
  c_idx.min <- concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.min))[['concordance']]
  c_idx.1se <- concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.1se))[['concordance']]
  
  if (c_idx.1se > c_idx.min) {
    
    result_elasticnet[i, 'lambda'] <- cv_fit$lambda.1se
    result_elasticnet[i, 'c_idx'] <- c_idx.1se
    
  } else {
    
    result_elasticnet[i, 'lambda'] <- cv_fit$lambda.min
    result_elasticnet[i, 'c_idx'] <- c_idx.min
    
  }
  
}

result_elasticnet
##    alpha      lambda     c_idx
## 1    0.0 0.055078827 0.7547908
## 2    0.1 0.185098218 0.7690074
## 3    0.2 0.064740148 0.7665125
## 4    0.3 0.057349804 0.7678030
## 5    0.4 0.046859287 0.7680826
## 6    0.5 0.038287712 0.7676524
## 7    0.6 0.039866368 0.7666631
## 8    0.7 0.023577209 0.7658243
## 9    0.8 0.022643583 0.7668351
## 10   0.9 0.028855503 0.7627057
## 11   1.0 0.006737947 0.7626627

– 我們將結果應用到「測試組」中。

練習3答案(2)

best_pos <- which.max(result_elasticnet[,'c_idx'])

final_fit <- glmnet(x = train_X, y = train_Y, family = 'cox', lambda = result_elasticnet[best_pos,'lambda'], alpha = result_elasticnet[best_pos,'alpha'])

pred.valid <- predict(final_fit, valid_X)
pred.test <- predict(final_fit, test_X)

concordance(coxph(Surv(test_Y[,'time'], test_Y[,'status']) ~ pred.test))[['concordance']]
## [1] 0.7695282
library(pROC)

par(mfrow = c(1, 3))

for (current_t in c(365, 730, 1825)) {
  
  valid_y <- valid_Y[,'status']
  valid_y[valid_Y[,'time'] > current_t] <- 0
  valid_y[valid_Y[,'time'] <= current_t & valid_Y[,'status'] %in% 0] <- NA
  
  test_y <- test_Y[,'status']
  test_y[test_Y[,'time'] > current_t] <- 0
  test_y[test_Y[,'time'] <= current_t & test_Y[,'status'] %in% 0] <- NA
  
  roc_valid <- roc(valid_y ~ pred.valid)
  best_pos <- which.max(roc_valid$sensitivities + roc_valid$specificities)
  best_cut <- roc_valid$thresholds[best_pos]
  
  tab_test <- table(factor(pred.test > best_cut, levels = c(FALSE, TRUE)), factor(test_y, levels = 0:1))
  sens <- tab_test[2,2] / sum(tab_test[,2], 1e-8)
  spec <- tab_test[1,1] / sum(tab_test[,1], 1e-8)
  
  roc_test <- roc(test_y ~ pred.test)
  plot(roc_test, main = paste0('t = ', formatC(current_t, digits = 0, format = 'f')))
  
  points(spec, sens, pch = 19)
  text(0.5, 0.5, paste0('Cut = ', formatC(best_cut, digits = 3, format = 'f'),
                        '\nSens = ', formatC(sens, digits = 3, format = 'f'),
                        '\nSpec = ', formatC(spec, digits = 3, format = 'f'),
                        '\nAUC = ', formatC(roc_test$auc, digits = 3, format = 'f')), col = 'red')
  
}

課程小結

– 相較於統計模型的優勢是他有足夠多的參數可以用來搜索最佳模型,透過我們練習題中的資料科學實驗,你應該更瞭解完整的應用流程了。

– 除此之外,L1 Regularization的優點亦可以幫助我們選擇重要的變項,這是一種除了傳統使用p-value選擇之外另一種選擇方法。

– 解決方法是預先進行特徵工程,但這還是很困難,因此我們需要一些非線性的模型幫我們「自動」進行特徵工程。