林嶔 (Lin, Chin)
Lesson 11 機器學習概論3(正則化與彈性網路)
– 由於數學上的相似性,我們已經教過的4種統計模型可以利用相同的技術進行擴展,這其實是相當方便的。
– 在統計問題中,有一系列的統計方法用到了正則化技術,而主要所希望解決的問題是「變數多個案少」的情況,我們將統計領域中常用的正則化技術分為兩類:
L1 Regularization - 使用這個方法的迴歸模型稱為套索迴歸(LASSO regression)
L2 Regularization - 使用這個方法的迴歸模型稱為脊迴歸(Ridge regression)
– 請至這裡下載範例資料
現在講了一大堆統計名詞,這把事情弄得很玄讓人搞不清楚(想要唬弄一下人或賣弄自己的知識效果還不錯),但實際上這件事情寫成數學式子非常的簡單,他其實就是修改損失函數。
我們以線性迴歸為例,我們看看標準的預測函數及損失函數長成甚麼樣子:
\[ \begin{align} \hat{y_i} & = b_{0} + b_{1}x_i \\ loss & = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \hat{y_i}\right)^{2} \end{align} \]
– 而L1 Regularization的意思,就是對\(loss\)的部分做一些修正: (這裡\(\lambda\)是一個懲罰權重)
\[ \begin{align} loss & = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \hat{y_i}\right)^{2} + \lambda ( |b_0| + |b_1|) \end{align} \]
– L2 Regularization與L1 Regularization非常相似,只是對\(loss\)的修正方式不同:
\[ \begin{align} loss & = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \hat{y_i}\right)^{2} + \frac{\lambda}{2} ( b_0^2 + b_1^2) \end{align} \]
– 而我們再考慮一下他的物理意義,權重為0代表該項輸入並不重要,因此也代表該變項可有可無,所以使用正則化技術會讓許多變項的權重為0,這對於變項很多需要選變項時非常好用。
– 整合預測函數與損失函數
\[loss = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right)^{2} + \lambda ( |b_0| + |b_1|)\]
– \(b_0\)以及\(b_1\)的偏導函數(過程略)
\[ \begin{align} \frac{\partial}{\partial b_0}loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n}\left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right) + \lambda \frac{b_0}{|b_0|} \\ \frac{\partial}{\partial b_1}loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n}\left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right)\cdot x_{i} + \lambda \frac{b_1}{|b_1|} \end{align} \]
– 另外,他似乎只提供了方向,而不提供大小。這是因為對於絕對值函數而言,導函數不是\(1\)就是\(-1\)。
– 因此在訓練的時候,假設某個參數\(b\)的重要性實在太低,那由原始損失函數部分(左邊)所提供的梯度就會太小,而該參數\(b\)的數值就會在\(\lambda\)及\(-\lambda\)之間震盪。
– 由於我們非常清楚這個介於在\(\lambda\)及\(-\lambda\)之間是完全沒有意義的,因而能將所有介於\(\lambda\)及\(-\lambda\)之間的參數\(b\)指定為\(0\),這樣暗示著這個參數所對應到的特徵\(x\)是完全沒有意義的。
– 從以上觀點來看,L1 Regularization具有很強的「稀疏化」特性,可以拿來做變項挑選。
– 這個問題則可以被L2 Regularization解決,我們再看看他對於\(b_0\)以及\(b_1\)的偏導函數為何:
\[loss = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right)^{2} + \frac{\lambda}{2} ( b_0^2 + b_1^2)\]
– \(b_0\)以及\(b_1\)的偏導函數(過程略)
\[ \begin{align} \frac{\partial}{\partial b_0}loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n}\left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right) + \lambda b_0\\ \frac{\partial}{\partial b_1}loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n}\left(y_{i} - \left(b_{0} + b_{1}x_{i}\right)\right)\cdot x_{i} + \lambda b_1 \end{align} \]
– 很明顯的,當\(\lambda\)很大的時候,那所有權重都將會很難離開0,但與L1 Regularization正則化不同,單純使用L2 Regularization經常會造成很多不重要的參數\(b\)有個很小的數值,卻沒有辦法捨去。
– 我們只使用前50筆資料加快他的速度:
subdat <- dat[!(dat[,"time"] %in% c(NA, 0)) & !(dat[,"death"] %in% NA) & !(dat[,"AGE"] %in% NA) & !(dat[,"Rate"] %in% NA),]
subdat <- subdat[1:50,]
survival_encode <- function (event, time) {
sub_dat <- data.frame(id = 1:length(event), event = event, time = time, stringsAsFactors = FALSE)
sub_dat <- sub_dat[order(sub_dat[,'event'], decreasing = TRUE),]
sub_dat <- sub_dat[order(sub_dat[,'time']),]
label <- array(0L, dim = rep(nrow(sub_dat), 2))
mask <- array(0L, dim = rep(nrow(sub_dat), 2))
diag(label) <- 1L
for (m in 1:nrow(sub_dat)) {
mask[m,m:nrow(sub_dat)] <- 1L
if (sub_dat[m,'event'] == 0) {
mask[m,] <- 0L
} else if (m > 1) {
zero_pos <- which(sub_dat[1:(m-1),'time'] == sub_dat[m,'time'])
if (length(zero_pos) > 0) {
mask[zero_pos,m] <- 0L
}
}
}
original_pos <- order(sub_dat[,'id'])
return(list(label = label[,original_pos,drop = FALSE], mask = mask[,original_pos,drop = FALSE]))
}
X_mat <- matrix(as.matrix(subdat[,c("AGE", "Rate")]), ncol = 2)
Y_list <- survival_encode(event = subdat[,'death'], time = subdat[,'time'])
Y_mat <- Y_list$label
M_mat <- Y_list$mask
num.iteration <- 2000
lr <- 0.01
B_mat <- matrix(0, nrow = ncol(X_mat), ncol = 1)
for (i in 1:num.iteration) {
H_mat <- matrix(X_mat %*% B_mat, nrow = nrow(Y_mat), ncol = ncol(Y_mat), byrow = TRUE)
P11_mat <- M_mat * exp(H_mat)
P12_mat <- P11_mat %*% X_mat
P13_mat <- P11_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
P14_mat <- apply(P12_mat / P13_mat, 2, sum, na.rm = TRUE)
P21_mat <- M_mat * Y_mat * exp(H_mat)
P22_mat <- P21_mat %*% X_mat
P23_mat <- P21_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
P24_mat <- apply(P22_mat / P23_mat, 2, sum, na.rm = TRUE)
grad_B <- P14_mat - P24_mat
B_mat <- B_mat - lr * grad_B / nrow(X_mat)
}
print(B_mat)
## [,1]
## [1,] 0.037072488
## [2,] 0.008077607
– 這是使用L1 Regularization的結果:
lambda <- 0.005
num.iteration <- 2000
lr <- 0.01
B_mat <- matrix(0, nrow = ncol(X_mat), ncol = 1)
for (i in 1:num.iteration) {
H_mat <- matrix(X_mat %*% B_mat, nrow = nrow(Y_mat), ncol = ncol(Y_mat), byrow = TRUE)
P11_mat <- M_mat * exp(H_mat)
P12_mat <- P11_mat %*% X_mat
P13_mat <- P11_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
P14_mat <- apply(P12_mat / P13_mat, 2, sum, na.rm = TRUE)
P21_mat <- M_mat * Y_mat * exp(H_mat)
P22_mat <- P21_mat %*% X_mat
P23_mat <- P21_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
P24_mat <- apply(P22_mat / P23_mat, 2, sum, na.rm = TRUE)
grad_B <- P14_mat - P24_mat
B_mat <- B_mat - lr * grad_B / nrow(X_mat) - lambda * sign(B_mat)
}
print(B_mat)
## [,1]
## [1,] 0.023285025
## [2,] 0.003745863
– 使用L1 Regularization的結果告訴我們其實Rate這個變項不重要!
lambda <- 0.05
num.iteration <- 2000
lr <- 0.01
B_mat <- matrix(0, nrow = ncol(X_mat), ncol = 1)
for (i in 1:num.iteration) {
H_mat <- matrix(X_mat %*% B_mat, nrow = nrow(Y_mat), ncol = ncol(Y_mat), byrow = TRUE)
P11_mat <- M_mat * exp(H_mat)
P12_mat <- P11_mat %*% X_mat
P13_mat <- P11_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
P14_mat <- apply(P12_mat / P13_mat, 2, sum, na.rm = TRUE)
P21_mat <- M_mat * Y_mat * exp(H_mat)
P22_mat <- P21_mat %*% X_mat
P23_mat <- P21_mat %*% matrix(1, nrow = nrow(X_mat), ncol = ncol(X_mat))
P24_mat <- apply(P22_mat / P23_mat, 2, sum, na.rm = TRUE)
grad_B <- P14_mat - P24_mat
B_mat <- B_mat - lr * grad_B / nrow(X_mat) - lambda * B_mat
}
print(B_mat)
## [,1]
## [1,] 0.031659045
## [2,] 0.008512594
– 你可以試試看使用不同的\(\lambda\)。
– 答案當然是可以的,先讓我們以視覺化的方式觀察他們分別的意義。
– 以剛剛的資料為例,讓我們看看「主損失函數」、「L1 Regularization」、「L2 Regularization」他們分別的損失梯度圖,其中白色的點是在無正則化下的最優解。
– 讓我們將它合併,左邊是全部使用L1 Regularization、中間是全部使用L2 Regularization、右邊是各用一半:
– 我們以線性回歸為例,改寫一下損失函數:
\[ \begin{align} \hat{y_i} & = b_{0} + b_{1}x_i \\ loss & = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - \hat{y_i}\right)^{2} + \alpha \lambda ( |b_0| + |b_1|) + \frac{(1 - \alpha) \lambda}{2} ( b_0^2 + b_1^2) \end{align} \]
這個結合L1/L2 Regularization的新方法叫做「彈性網路(Elastic net)」,讓我們來使用套件完成這個任務,套件是我們熟悉的「glmnet」,函數也是我們熟悉的「glmnet()」,而這個函數非常的萬用,上述4種方法都能用這個函數完成。
現在讓我們試著重現一下剛剛練習1的分析結果。
library(glmnet)
X_mat <- matrix(as.matrix(subdat[,c("AGE", "Rate")]), ncol = 2)
Y_mat <- matrix(as.matrix(subdat[,c("time", "death")]), ncol = 2)
colnames(Y_mat) <- c('time', 'status')
– 這是完全沒有正則化的結果:
## 2 x 1 sparse Matrix of class "dgCMatrix"
## s0
## V1 0.036989411
## V2 0.007978329
– 這是使用L2 Regularization的結果:
fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'cox', lambda = 0.05, alpha = 0)
coef(fit_glmnet)
## 2 x 1 sparse Matrix of class "dgCMatrix"
## s0
## V1 0.024919251
## V2 0.008261397
– 這是使用L1 Regularization的結果:
fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'cox', lambda = 0.05, alpha = 1)
coef(fit_glmnet)
## 2 x 1 sparse Matrix of class "dgCMatrix"
## s0
## V1 0.01422007
## V2 .
– 當然你可以決定L1/L2 Regularization各占一半的權重:
fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'cox', lambda = 0.05, alpha = 0.5)
coef(fit_glmnet)
## 2 x 1 sparse Matrix of class "dgCMatrix"
## s0
## V1 0.020567344
## V2 0.003837489
subdat <- dat[!(dat[,'K'] %in% NA) & !(dat[,'GENDER'] %in% NA) & !(dat[,'AGE'] %in% NA), c('K', 'GENDER', 'AGE')]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])
X_mat <- model.matrix(~ ., data = subdat[,-1])
X_mat <- X_mat[,-1]
Y_mat <- matrix(as.matrix(subdat[,"K"]), ncol = 1)
fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'gaussian', lambda = 0.01, alpha = 0.5)
coef(fit_glmnet)
subdat <- dat[!(dat[,'LVD'] %in% NA) & !(dat[,'GENDER'] %in% NA) & !(dat[,'AGE'] %in% NA), c('LVD', 'GENDER', 'AGE')]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])
X_mat <- model.matrix(~ ., data = subdat[,-1])
X_mat <- X_mat[,-1]
Y_mat <- matrix(as.matrix(subdat[,"LVD"]), ncol = 1)
fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'binomial', lambda = 0.01, alpha = 0.5)
coef(fit_glmnet)
subdat <- dat[!(dat[,'AMI'] %in% NA) & !(dat[,'GENDER'] %in% NA) & !(dat[,'AGE'] %in% NA), c('AMI', 'GENDER', 'AGE')]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])
X_mat <- model.matrix(~ ., data = subdat[,-1])
X_mat <- X_mat[,-1]
Y_mat <- matrix(as.matrix(subdat[,"AMI"]), ncol = 1)
fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'multinomial', lambda = 0.01, alpha = 0.5)
coef(fit_glmnet)
subdat <- dat[!(dat[,'time'] %in% NA) & !(dat[,'death'] %in% NA) & !(dat[,'GENDER'] %in% NA) & !(dat[,'AGE'] %in% NA), c('time', 'death', 'GENDER', 'AGE')]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])
X_mat <- model.matrix(~ ., data = subdat[,-c(1, 2)])
X_mat <- X_mat[,-1]
Y_mat <- matrix(as.matrix(subdat[,c("time", "death")]), ncol = 2)
colnames(Y_mat) <- c('time', 'status')
fit_glmnet <- glmnet(x = X_mat, y = Y_mat, family = 'cox', lambda = 0.01, alpha = 0.5)
coef(fit_glmnet)
– 為了實驗方便,我們先假定在\(\lambda = 0.001\)的條件下。
subdat <- dat[apply(is.na(dat[,-1:-3]), 1, sum) %in% 0, -1:-3]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])
X_mat <- model.matrix(~ ., data = subdat[,-1:-2])
X_mat <- X_mat[,-1]
Y_mat <- matrix(as.matrix(subdat[,c("time", "death")]), ncol = 2)
colnames(Y_mat) <- c('time', 'status')
set.seed(0)
all_idx <- 1:nrow(subdat)
train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]
train_X <- X_mat[train_idx,]
valid_X <- X_mat[valid_idx,]
test_X <- X_mat[test_idx,]
train_Y <- Y_mat[train_idx,]
valid_Y <- Y_mat[valid_idx,]
test_Y <- Y_mat[test_idx,]
LASSO_fit <- glmnet(x = train_X, y = train_Y, family = 'cox', lambda = 0.001, alpha = 1)
Ridge_fit <- glmnet(x = train_X, y = train_Y, family = 'cox', lambda = 0.001, alpha = 0)
Elastic_fit <- glmnet(x = train_X, y = train_Y, family = 'cox', lambda = 0.001, alpha = 0.5)
library(survival)
pred.LASSO <- predict(LASSO_fit, valid_X)
pred.Ridge <- predict(Ridge_fit, valid_X)
pred.Elastic <- predict(Elastic_fit, valid_X)
concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.LASSO))[['concordance']]
## [1] 0.7249698
## [1] 0.7247502
## [1] 0.7246403
pred.test <- predict(LASSO_fit, test_X)
concordance(coxph(Surv(test_Y[,'time'], test_Y[,'status']) ~ pred.test))[['concordance']]
## [1] 0.7287844
library(pROC)
par(mfrow = c(1, 3))
for (current_t in c(365, 730, 1825)) {
valid_y <- valid_Y[,'status']
valid_y[valid_Y[,'time'] > current_t] <- 0
valid_y[valid_Y[,'time'] <= current_t & valid_Y[,'status'] %in% 0] <- NA
test_y <- test_Y[,'status']
test_y[test_Y[,'time'] > current_t] <- 0
test_y[test_Y[,'time'] <= current_t & test_Y[,'status'] %in% 0] <- NA
roc_valid <- roc(valid_y ~ pred.LASSO)
best_pos <- which.max(roc_valid$sensitivities + roc_valid$specificities)
best_cut <- roc_valid$thresholds[best_pos]
tab_test <- table(factor(pred.test > best_cut, levels = c(FALSE, TRUE)), factor(test_y, levels = 0:1))
sens <- tab_test[2,2] / sum(tab_test[,2], 1e-8)
spec <- tab_test[1,1] / sum(tab_test[,1], 1e-8)
roc_test <- roc(test_y ~ pred.test)
plot(roc_test, main = paste0('t = ', formatC(current_t, digits = 0, format = 'f')))
points(spec, sens, pch = 19)
text(0.5, 0.5, paste0('Cut = ', formatC(best_cut, digits = 3, format = 'f'),
'\nSens = ', formatC(sens, digits = 3, format = 'f'),
'\nSpec = ', formatC(spec, digits = 3, format = 'f'),
'\nAUC = ', formatC(roc_test$auc, digits = 3, format = 'f')), col = 'red')
}
– 在「訓練組」上進行交叉驗證是一種很好的參數搜索方式。
– 他的使用邏輯如下圖所示:
– 在k交叉驗證中,是使用不同的資料組合來驗證你訓練的模型,舉例來說,假設你有100個樣本,你可以第一次先使用前90個做訓練,另外10個做測試,然後再用第80到90個,不斷重複這個動作,這樣你可以得到不同的訓練/測試資料組合,提供更多數據去驗證。
– 我們同樣使用剛剛的樣本進行交叉驗證,從而找到最佳的參數\(\lambda\)。
– 要特別注意一點是函數「cv.glmnet()」並不提供參數\(\alpha\)的搜索,如果想要進行搜索必須手動進行。
cv_fit <- cv.glmnet(x = train_X, y = train_Y, family = 'cox', lambda = exp(seq(-5, -1, length.out = 100)), alpha = 0.5, nfolds = 5)
– 如果你想取得預測參數可以下這樣的指令:
pred.min <- predict(cv_fit, valid_X, s = "lambda.min")
pred.1se <- predict(cv_fit, valid_X, s = "lambda.1se")
concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.min))[['concordance']]
## [1] 0.7415895
## [1] 0.7476297
– 這次實驗我們將搜索\(\alpha = 0, \space 0.1, \space 0.2, \space \cdots, \space 1.0\):
– 為了實驗的完整性,這次我們使用多重插補法進行資料插補
library(mice)
subdat <- dat[!(dat[,"time"] %in% NA) & !(dat[,"death"] %in% NA),]
subdat[,'GENDER'] <- as.factor(subdat[,'GENDER'])
for (i in 1:31) {subdat[,paste0('rhythm.', i)] <- as.factor(subdat[,paste0('rhythm.', i)])}
used_dat.x <- subdat[,-1:-5]
mice_dat <- mice(used_dat.x, m = 1, maxit = 5, meth = 'cart', seed = 123, printFlag = FALSE)
impute_dat.x <- mice:::complete(mice_dat, action = 1)
X_mat <- model.matrix(~ ., data = impute_dat.x)
X_mat <- X_mat[,-1]
Y_mat <- matrix(as.matrix(subdat[,c("time", "death")]), ncol = 2)
colnames(Y_mat) <- c('time', 'status')
set.seed(0)
all_idx <- 1:nrow(subdat)
train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]
train_X <- X_mat[train_idx,]
valid_X <- X_mat[valid_idx,]
test_X <- X_mat[test_idx,]
train_Y <- Y_mat[train_idx,]
valid_Y <- Y_mat[valid_idx,]
test_Y <- Y_mat[test_idx,]
library(glmnet)
library(survival)
result_elasticnet <- data.frame(alpha = seq(0, 1, by = 0.1), lambda = NA, c_idx = NA)
for (i in 1:nrow(result_elasticnet)) {
cv_fit <- cv.glmnet(x = train_X, y = train_Y, family = 'cox', lambda = exp(seq(-5, -1, length.out = 100)), alpha = result_elasticnet[i, 'alpha'], nfolds = 5)
pred.min <- predict(cv_fit, valid_X, s = "lambda.min")
pred.1se <- predict(cv_fit, valid_X, s = "lambda.1se")
c_idx.min <- concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.min))[['concordance']]
c_idx.1se <- concordance(coxph(Surv(valid_Y[,'time'], valid_Y[,'status']) ~ pred.1se))[['concordance']]
if (c_idx.1se > c_idx.min) {
result_elasticnet[i, 'lambda'] <- cv_fit$lambda.1se
result_elasticnet[i, 'c_idx'] <- c_idx.1se
} else {
result_elasticnet[i, 'lambda'] <- cv_fit$lambda.min
result_elasticnet[i, 'c_idx'] <- c_idx.min
}
}
result_elasticnet
## alpha lambda c_idx
## 1 0.0 0.055078827 0.7547908
## 2 0.1 0.185098218 0.7690074
## 3 0.2 0.064740148 0.7665125
## 4 0.3 0.057349804 0.7678030
## 5 0.4 0.046859287 0.7680826
## 6 0.5 0.038287712 0.7676524
## 7 0.6 0.039866368 0.7666631
## 8 0.7 0.023577209 0.7658243
## 9 0.8 0.022643583 0.7668351
## 10 0.9 0.028855503 0.7627057
## 11 1.0 0.006737947 0.7626627
– 我們將結果應用到「測試組」中。
best_pos <- which.max(result_elasticnet[,'c_idx'])
final_fit <- glmnet(x = train_X, y = train_Y, family = 'cox', lambda = result_elasticnet[best_pos,'lambda'], alpha = result_elasticnet[best_pos,'alpha'])
pred.valid <- predict(final_fit, valid_X)
pred.test <- predict(final_fit, test_X)
concordance(coxph(Surv(test_Y[,'time'], test_Y[,'status']) ~ pred.test))[['concordance']]
## [1] 0.7695282
library(pROC)
par(mfrow = c(1, 3))
for (current_t in c(365, 730, 1825)) {
valid_y <- valid_Y[,'status']
valid_y[valid_Y[,'time'] > current_t] <- 0
valid_y[valid_Y[,'time'] <= current_t & valid_Y[,'status'] %in% 0] <- NA
test_y <- test_Y[,'status']
test_y[test_Y[,'time'] > current_t] <- 0
test_y[test_Y[,'time'] <= current_t & test_Y[,'status'] %in% 0] <- NA
roc_valid <- roc(valid_y ~ pred.valid)
best_pos <- which.max(roc_valid$sensitivities + roc_valid$specificities)
best_cut <- roc_valid$thresholds[best_pos]
tab_test <- table(factor(pred.test > best_cut, levels = c(FALSE, TRUE)), factor(test_y, levels = 0:1))
sens <- tab_test[2,2] / sum(tab_test[,2], 1e-8)
spec <- tab_test[1,1] / sum(tab_test[,1], 1e-8)
roc_test <- roc(test_y ~ pred.test)
plot(roc_test, main = paste0('t = ', formatC(current_t, digits = 0, format = 'f')))
points(spec, sens, pch = 19)
text(0.5, 0.5, paste0('Cut = ', formatC(best_cut, digits = 3, format = 'f'),
'\nSens = ', formatC(sens, digits = 3, format = 'f'),
'\nSpec = ', formatC(spec, digits = 3, format = 'f'),
'\nAUC = ', formatC(roc_test$auc, digits = 3, format = 'f')), col = 'red')
}
– 相較於統計模型的優勢是他有足夠多的參數可以用來搜索最佳模型,透過我們練習題中的資料科學實驗,你應該更瞭解完整的應用流程了。
– 除此之外,L1 Regularization的優點亦可以幫助我們選擇重要的變項,這是一種除了傳統使用p-value選擇之外另一種選擇方法。
– 解決方法是預先進行特徵工程,但這還是很困難,因此我們需要一些非線性的模型幫我們「自動」進行特徵工程。