林嶔 (Lin, Chin)
Lesson 8 人工智慧基礎5(多類別預測模型)
\[\hat{y} = f(x)\]
– 請至這裡下載範例資料
\[ \begin{align} Y & = XB + E \\\\ Y & = \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{pmatrix} X = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ 1 & \vdots \\ 1 & x_n \end{pmatrix} B = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} E = \begin{pmatrix} \delta_1 \\ \delta_2 \\ \delta_3 \\ \vdots \\ \delta_n \end{pmatrix} \end{align} \]
\[ \begin{align} B & = (X^TX)^{-1}X^TY \end{align} \]
subdat <- dat[!(dat[,"AGE"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"PR"] %in% NA),]
X <- subdat[,"AGE"]
Y <- subdat[,c("Rate", "PR")]
X_mat <- cbind(1, as.matrix(X))
Y_mat <- as.matrix(Y)
B_mat <- solve(t(X_mat) %*% X_mat) %*% t(X_mat) %*% Y_mat
B_mat
## Rate PR
## [1,] 79.50964419 142.6066267
## [2,] 0.07317945 0.3430289
##
## Call:
## lm(formula = Y[, "Rate"] ~ X)
##
## Coefficients:
## (Intercept) X
## 79.50964 0.07318
##
## Call:
## lm(formula = Y[, "PR"] ~ X)
##
## Coefficients:
## (Intercept) X
## 142.607 0.343
– 我們來嘗試做一次使用「AGE」與「Rate」同時來預測「AMI」好了。
– 需要特別注意的是編碼方式有一點點不同:
subdat <- dat[!(dat[,"AMI"] %in% NA) & !(dat[,"AGE"] %in% NA) & !(dat[,"Rate"] %in% NA),]
subdat[,"AMI"] <- as.factor(subdat[,"AMI"])
X <- cbind(1, as.matrix(subdat[,c('AGE', 'Rate')]))
Y <- matrix(model.matrix(~-1 + subdat[,"AMI"]), ncol = 3)
head(Y)
## [,1] [,2] [,3]
## [1,] 0 0 1
## [2,] 1 0 0
## [3,] 1 0 0
## [4,] 0 0 1
## [5,] 0 1 0
## [6,] 1 0 0
\[ \begin{align} Softmax(x_j) & = \frac{e^{x_j}}{\sum \limits_{i=1}^{m} e^{x_i}} \ \ \ \ \ j \in 1 \ \mbox{to} \ m \end{align} \]
\[ \begin{align} \frac{\partial}{\partial x_i}Softmax(x_j) & = \left\{ \begin{array} -Softmax(x_j)(1 - Softmax(x_i)) & \mbox{ if i = j} \\ Softmax(x_j)(0 - Softmax(x_i)) & \mbox{ otherwise} \end{array} \right. \end{align} \]
– 之前我們為了邏輯斯迴歸設計了交叉熵損失函數,使其在最終函數求解時較為平穩,現在我們也要為函數\(Softmax()\)設計一個特異的損失函數,他的名字叫做\(m\)對數似然函數(\(m\)-log likelihood function):
– 這裡的\(m\)同樣代表著類別數,而\(n\)代表著樣本數:
\[ \begin{align} mlogloss(y_{i}, p_{i}) & = - \sum \limits_{i=1}^{m} y_{i} log(p_{i}) \end{align} \]
– 這裡我們把偏導函數列出來(過程略):
\[ \begin{align} \frac{\partial}{\partial p_i}mlogloss(y_{i}, p_{i}) & = - \sum \limits_{i=1}^{m} y_{i} \frac {1} {p_{i}} \end{align} \]
– 接著我們將\(Softmax()\)以及\(mlogloss()\)兩個函數合起來,得到的偏導函數就會相當優美:
\[ \begin{align} p_i &= Softmax(x_i)\\ loss &= mlogloss(y_{i}, p_{i}) \\\\ \frac{\partial}{\partial x_j}loss & = \frac{\partial}{\partial p_j}loss \frac{\partial}{\partial x_j}p_i \\ & = - y_{j} \frac {1} {p_{j}} p_{j} ( 1-p_{j})-\sum \limits_{i\neq j} y_{i} \frac {1} {p_{j}} p_{j} (0 - p_{j}) \\ & = - y_{j} ( 1-p_{j})-\sum \limits_{i\neq j} y_{i} (0 - p_{j}) \\ & = - y_{j} +y_{j}p_{j}+\sum \limits_{i\neq j} y_{i} p_{j} \\ & = - y_{j} +\sum \limits_{i = 1}^{m} y_{i} p_{j} \\ & = p_{j} - y_{j} \end{align} \]
\[ \begin{align} L & = XB \\ P & = Softmax(L) \\ loss & = mlogloss(Y, P) \\\\ Y & = \begin{pmatrix} y^1_1 & y^2_1 & y^3_1 \\ y^1_2 & y^2_2 & y^3_2 \\ y^1_3 & y^2_3 & y^3_3 \\ \vdots \\ y^1_n & y^2_n & y^3_n \end{pmatrix} P = \begin{pmatrix} p^1_1 & p^2_1 & p^3_1 \\ p^1_2 & p^2_2 & p^3_2 \\ p^1_3 & p^2_3 & p^3_3 \\ \vdots \\ p^1_n & p^2_n & p^3_n \end{pmatrix} X = \begin{pmatrix} 1 & x^1_1 & x^2_2 \\ 1 & x^1_2 & x^2_2 \\ 1 & x^1_3 & x^2_3 \\ 1 & \cdots & \cdots \\ 1 & x^1_n & x^2_n \end{pmatrix} B = \begin{pmatrix} b^1_0 & b^2_0 & b^3_0 \\ b^1_1 & b^2_1 & b^3_1 \\ b^1_2 & b^2_2 & b^3_2\end{pmatrix} \end{align} \]
\[ \begin{align} \frac{\partial}{\partial B}loss & = \frac{\partial}{\partial L} mlogloss(Y, Softmax(L)) \frac{\partial}{\partial B} XB \\ & = X^T (P-Y) \end{align} \]
– 我們先使用鳶尾花卉數據集(Iris flower data set)最初是Edgar Anderson 從加拿大加斯帕半島上的鳶尾屬花朵中提取的地理變異數據,後由Ronald Fisher作為判別分析的一個例子。
– 其數據集包含了150個樣本,都屬於鳶尾屬下的三個亞屬,分別是山鳶尾(setosa)、變色鳶尾(versicolor)和維吉尼亞鳶尾(virginica)。四個特徵被用作樣本的定量分析,它們分別是花萼(sepal)和花瓣(petal)的長度和寬度。
– 讓我們使用梯度下降法來求\(B\):
num.iteration <- 5000
lr <- 0.05
B_mat <- matrix(0, nrow = 5, ncol = 3)
for (i in 1:num.iteration) {
L <- X %*% B_mat
P <- t(apply(L, 1, function (x) {exp(x) / sum(exp(x))}))
B_mat <- B_mat - lr * t(X) %*% (P-Y) / nrow(X)
}
B_mat
## [,1] [,2] [,3]
## 0.5612794 1.19326975 -1.754549
## Sepal.Length 1.1960152 0.87175566 -2.067771
## Sepal.Width 2.6903757 -0.02834479 -2.662031
## Petal.Length -3.6596999 -0.24019526 3.899895
## Petal.Width -1.7270576 -1.71964671 3.446704
L <- X %*% B_mat
P <- t(apply(L, 1, function (x) {exp(x) / sum(exp(x))}))
table(max.col(Y), max.col(P))
##
## 1 2 3
## 1 50 0 0
## 2 0 48 2
## 3 0 0 50
– 裡面的這個函數「glmnet()」非常強大,我們之後會再詳細介紹,如果想要重複之前的結果我們必須設定這樣的參數(lambda = 0):
library(glmnet)
subdat <- dat[!(dat[,"K"] %in% NA) & !(dat[,"AGE"] %in% NA) & !(dat[,"QTc"] %in% NA),]
X <- as.matrix(subdat[,c("AGE", "QTc")])
Y <- subdat[,"K"]
fit_glmnet <- glmnet(x = X, y = Y, family = 'gaussian', lambda = 0)
coef.glmnet(fit_glmnet)
## 3 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 4.244527966
## AGE 0.018570089
## QTc -0.003497804
## (Intercept) AGE QTc
## 4.244527924 0.018570090 -0.003497804
## [,1]
## s0 0.2872139
## [,1] [,2] [,3]
## 0.5612794 1.19326975 -1.754549
## Sepal.Length 1.1960152 0.87175566 -2.067771
## Sepal.Width 2.6903757 -0.02834479 -2.662031
## Petal.Length -3.6596999 -0.24019526 3.899895
## Petal.Width -1.7270576 -1.71964671 3.446704
data(iris)
X <- as.matrix(iris[,-5])
Y <- iris[,5]
fit_glmnet <- glmnet(x = X, y = Y, family = 'multinomial', lambda = 0)
coef.glmnet(fit_glmnet)
## $setosa
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## 21.231985
## Sepal.Length .
## Sepal.Width 9.139921
## Petal.Length -10.974156
## Petal.Width -6.699920
##
## $versicolor
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## 10.3546155
## Sepal.Length 0.9359832
## Sepal.Width .
## Petal.Length .
## Petal.Width .
##
## $virginica
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## -31.586601
## Sepal.Length -1.519747
## Sepal.Width -6.574618
## Petal.Length 9.312621
## Petal.Width 17.999778
– 不要緊張,因為在Softmax函數的作用下,會有多個式子會給出相同的答案,但預測出來的答案是相似的:
pred_B_mat <- cbind(1, X) %*% B_mat
pred_B_mat <- max.col(pred_B_mat)
pred_glmnet <- predict(fit_glmnet, X)
pred_glmnet <- max.col(pred_glmnet[,,1])
table(pred_B_mat, pred_glmnet)
## pred_glmnet
## pred_B_mat 1 2 3
## 1 50 0 0
## 2 0 48 0
## 3 0 2 50
## Y
## pred_B_mat setosa versicolor virginica
## 1 50 0 0
## 2 0 48 0
## 3 0 2 50
## Y
## pred_glmnet setosa versicolor virginica
## 1 50 0 0
## 2 0 49 1
## 3 0 1 49
請你使用資料科學的實驗流程設計一個實驗,先把樣本隨機分成3份,產生「訓練組」、「驗證組」與「測試組」。
為了簡化整個流程,我們希望你採取函數「glmnet()」,並且我們指定使用「Rate」與「QTc」同時來預測「AMI」好了。
subdat <- dat[!(dat[,"AMI"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"QTc"] %in% NA),]
subdat[,"AMI"] <- as.factor(subdat[,"AMI"])
X <- as.matrix(subdat[,c('Rate', 'QTc')])
Y <- subdat[,"AMI"]
set.seed(0)
all_idx <- 1:nrow(subdat)
train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]
train_X <- X[train_idx,]
valid_X <- X[valid_idx,]
test_X <- X[test_idx,]
train_Y <- Y[train_idx]
valid_Y <- Y[valid_idx]
test_Y <- Y[test_idx]
fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'multinomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)
## valid_Y
## not-AMI NSTEMI STEMI
## 1 220 26 49
## test_Y
## not-AMI NSTEMI STEMI
## 1 208 42 47
train_X <- cbind(1, train_X)
train_Y <- matrix(model.matrix(~-1 + train_Y), ncol = 3)
num.iteration <- 5000
lr <- 0.001
B_mat <- matrix(0, nrow = 3, ncol = 3)
for (i in 1:num.iteration) {
L <- train_X %*% B_mat
P <- t(apply(L, 1, function (x) {exp(x) / sum(exp(x))}))
B_mat <- B_mat - lr * t(train_X) %*% (P-train_Y) / nrow(train_X)
}
## valid_Y
## not-AMI NSTEMI STEMI
## 1 207 25 48
## 2 13 1 1
## test_Y
## not-AMI NSTEMI STEMI
## 1 199 38 43
## 2 9 4 4
– 我們來試著使用「Rate」與「QTc」同時來預測「LVD」好了。
subdat <- dat[!(dat[,"LVD"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"QTc"] %in% NA),]
X <- as.matrix(subdat[,c('Rate', 'QTc')])
Y <- subdat[,"LVD"]
set.seed(0)
all_idx <- 1:nrow(subdat)
train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]
train_X <- X[train_idx,]
valid_X <- X[valid_idx,]
test_X <- X[test_idx,]
train_Y <- Y[train_idx]
valid_Y <- Y[valid_idx]
test_Y <- Y[test_idx]
fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'binomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)
## valid_Y
## 0 1
## FALSE 181 86
## TRUE 61 94
## test_Y
## 0 1
## FALSE 174 98
## TRUE 46 104
library(pROC)
roc_valid <- roc(valid_Y ~ pred_valid)
best_pos <- which.max(roc_valid$sensitivities + roc_valid$specificities)
best_cut <- roc_valid$thresholds[best_pos]
tab_test <- table(pred_test >= best_cut, test_Y)
sens <- tab_test[2,2] / sum(tab_test[,2])
spec <- tab_test[1,1] / sum(tab_test[,1])
roc_test <- roc(test_Y ~ pred_test)
plot(roc_test)
points(spec, sens, pch = 19)
text(0.5, 0.5, paste0('Sens = ', formatC(sens, digits = 3, format = 'f'),
'\nSpec = ', formatC(spec, digits = 3, format = 'f'),
'\nAUC = ', formatC(roc_test$auc, digits = 3, format = 'f')), col = 'red')
subdat <- dat[!(dat[,"AMI"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"QTc"] %in% NA),]
subdat[,"AMI"] <- as.factor(subdat[,"AMI"])
X <- as.matrix(subdat[,c('Rate', 'QTc')])
Y <- subdat[,"AMI"]
set.seed(0)
all_idx <- 1:nrow(subdat)
train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]
train_X <- X[train_idx,]
valid_X <- X[valid_idx,]
test_X <- X[test_idx,]
train_Y <- Y[train_idx]
valid_Y <- Y[valid_idx]
test_Y <- Y[test_idx]
par(mfrow = c(1, 2))
roc_valid <- roc((valid_Y == 'NSTEMI') ~ pred_valid[,2,1])
roc_test <- roc((test_Y == 'NSTEMI') ~ pred_test[,2,1])
plot(roc_valid, col = 'blue')
plot(roc_test, add = TRUE, col = 'red')
roc_valid <- roc((valid_Y == 'STEMI') ~ pred_valid[,3,1])
roc_test <- roc((test_Y == 'STEMI') ~ pred_test[,3,1])
plot(roc_valid, col = 'blue')
plot(roc_test, add = TRUE, col = 'red')
– 對於線性模型,這個作用可能還不是非常明顯,但你要記住這個技巧,未來有許多非線性模型非常需要他的幫忙
train_X.0 <- train_X[train_Y == 'not-AMI',]
train_X.1 <- train_X[train_Y == 'NSTEMI',]
train_X.2 <- train_X[train_Y == 'STEMI',]
set.seed(0)
train_X.0 <- train_X.0[sample(1:nrow(train_X.0), 5000, replace = TRUE),]
train_X.1 <- train_X.1[sample(1:nrow(train_X.1), 5000, replace = TRUE),]
train_X.2 <- train_X.2[sample(1:nrow(train_X.2), 5000, replace = TRUE),]
train_X <- rbind(train_X.2, train_X.1, train_X.2)
train_Y <- factor(rep(0:2, each = 5000))
fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'multinomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)
## valid_Y
## not-AMI NSTEMI STEMI
## 1 78 7 12
## 2 112 14 31
## 3 30 5 6
## test_Y
## not-AMI NSTEMI STEMI
## 1 71 7 16
## 2 117 25 14
## 3 20 10 17
– 另外,你應該也更清楚「驗證組」的重要性了,有太多事情需要驗證組來進行決定。
– 函數「glmnet()」顯然可以用來同時做目前我們學到的所有事情,甚至如果你有仔細看他的說明文件,他也能拿來做存活分析!