機器學習及演算法

林嶔 (Lin, Chin)

Lesson 8 人工智慧基礎5(多類別預測模型)

第一節:線性迴歸的多輸出擴展(1)

\[\hat{y} = f(x)\]

– 請至這裡下載範例資料

dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")

第一節:線性迴歸的多輸出擴展(2)

\[ \begin{align} Y & = XB + E \\\\ Y & = \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{pmatrix} X = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ 1 & \vdots \\ 1 & x_n \end{pmatrix} B = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} E = \begin{pmatrix} \delta_1 \\ \delta_2 \\ \delta_3 \\ \vdots \\ \delta_n \end{pmatrix} \end{align} \]

\[ \begin{align} B & = (X^TX)^{-1}X^TY \end{align} \]

第一節:線性迴歸的多輸出擴展(3)

subdat <- dat[!(dat[,"AGE"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"PR"] %in% NA),]
X <- subdat[,"AGE"]
Y <- subdat[,c("Rate", "PR")]

X_mat <- cbind(1, as.matrix(X))
Y_mat <- as.matrix(Y)

B_mat <- solve(t(X_mat) %*% X_mat) %*% t(X_mat) %*% Y_mat
B_mat
##             Rate          PR
## [1,] 79.50964419 142.6066267
## [2,]  0.07317945   0.3430289
lm(Y[,'Rate'] ~ X)
## 
## Call:
## lm(formula = Y[, "Rate"] ~ X)
## 
## Coefficients:
## (Intercept)            X  
##    79.50964      0.07318
lm(Y[,'PR'] ~ X)
## 
## Call:
## lm(formula = Y[, "PR"] ~ X)
## 
## Coefficients:
## (Intercept)            X  
##     142.607        0.343

第二節:多類別的預測函數(1)

– 我們來嘗試做一次使用「AGE」與「Rate」同時來預測「AMI」好了。

– 需要特別注意的是編碼方式有一點點不同:

subdat <- dat[!(dat[,"AMI"] %in% NA) & !(dat[,"AGE"] %in% NA) & !(dat[,"Rate"] %in% NA),]
subdat[,"AMI"] <- as.factor(subdat[,"AMI"])

X <- cbind(1, as.matrix(subdat[,c('AGE', 'Rate')]))
Y <- matrix(model.matrix(~-1 + subdat[,"AMI"]), ncol = 3)

head(Y)
##      [,1] [,2] [,3]
## [1,]    0    0    1
## [2,]    1    0    0
## [3,]    1    0    0
## [4,]    0    0    1
## [5,]    0    1    0
## [6,]    1    0    0

第二節:多類別的預測函數(2)

\[ \begin{align} Softmax(x_j) & = \frac{e^{x_j}}{\sum \limits_{i=1}^{m} e^{x_i}} \ \ \ \ \ j \in 1 \ \mbox{to} \ m \end{align} \]

\[ \begin{align} \frac{\partial}{\partial x_i}Softmax(x_j) & = \left\{ \begin{array} -Softmax(x_j)(1 - Softmax(x_i)) & \mbox{ if i = j} \\ Softmax(x_j)(0 - Softmax(x_i)) & \mbox{ otherwise} \end{array} \right. \end{align} \]

第二節:多類別的預測函數(3)

– 之前我們為了邏輯斯迴歸設計了交叉熵損失函數,使其在最終函數求解時較為平穩,現在我們也要為函數\(Softmax()\)設計一個特異的損失函數,他的名字叫做\(m\)對數似然函數(\(m\)-log likelihood function):

– 這裡的\(m\)同樣代表著類別數,而\(n\)代表著樣本數:

\[ \begin{align} mlogloss(y_{i}, p_{i}) & = - \sum \limits_{i=1}^{m} y_{i} log(p_{i}) \end{align} \]

– 這裡我們把偏導函數列出來(過程略):

\[ \begin{align} \frac{\partial}{\partial p_i}mlogloss(y_{i}, p_{i}) & = - \sum \limits_{i=1}^{m} y_{i} \frac {1} {p_{i}} \end{align} \]

– 接著我們將\(Softmax()\)以及\(mlogloss()\)兩個函數合起來,得到的偏導函數就會相當優美:

\[ \begin{align} p_i &= Softmax(x_i)\\ loss &= mlogloss(y_{i}, p_{i}) \\\\ \frac{\partial}{\partial x_j}loss & = \frac{\partial}{\partial p_j}loss \frac{\partial}{\partial x_j}p_i \\ & = - y_{j} \frac {1} {p_{j}} p_{j} ( 1-p_{j})-\sum \limits_{i\neq j} y_{i} \frac {1} {p_{j}} p_{j} (0 - p_{j}) \\ & = - y_{j} ( 1-p_{j})-\sum \limits_{i\neq j} y_{i} (0 - p_{j}) \\ & = - y_{j} +y_{j}p_{j}+\sum \limits_{i\neq j} y_{i} p_{j} \\ & = - y_{j} +\sum \limits_{i = 1}^{m} y_{i} p_{j} \\ & = p_{j} - y_{j} \end{align} \]

第二節:多類別的預測函數(4)

\[ \begin{align} L & = XB \\ P & = Softmax(L) \\ loss & = mlogloss(Y, P) \\\\ Y & = \begin{pmatrix} y^1_1 & y^2_1 & y^3_1 \\ y^1_2 & y^2_2 & y^3_2 \\ y^1_3 & y^2_3 & y^3_3 \\ \vdots \\ y^1_n & y^2_n & y^3_n \end{pmatrix} P = \begin{pmatrix} p^1_1 & p^2_1 & p^3_1 \\ p^1_2 & p^2_2 & p^3_2 \\ p^1_3 & p^2_3 & p^3_3 \\ \vdots \\ p^1_n & p^2_n & p^3_n \end{pmatrix} X = \begin{pmatrix} 1 & x^1_1 & x^2_2 \\ 1 & x^1_2 & x^2_2 \\ 1 & x^1_3 & x^2_3 \\ 1 & \cdots & \cdots \\ 1 & x^1_n & x^2_n \end{pmatrix} B = \begin{pmatrix} b^1_0 & b^2_0 & b^3_0 \\ b^1_1 & b^2_1 & b^3_1 \\ b^1_2 & b^2_2 & b^3_2\end{pmatrix} \end{align} \]

\[ \begin{align} \frac{\partial}{\partial B}loss & = \frac{\partial}{\partial L} mlogloss(Y, Softmax(L)) \frac{\partial}{\partial B} XB \\ & = X^T (P-Y) \end{align} \]

第二節:多類別的預測函數(5)

– 我們先使用鳶尾花卉數據集(Iris flower data set)最初是Edgar Anderson 從加拿大加斯帕半島上的鳶尾屬花朵中提取的地理變異數據,後由Ronald Fisher作為判別分析的一個例子。

F01

– 其數據集包含了150個樣本,都屬於鳶尾屬下的三個亞屬,分別是山鳶尾(setosa)、變色鳶尾(versicolor)和維吉尼亞鳶尾(virginica)。四個特徵被用作樣本的定量分析,它們分別是花萼(sepal)和花瓣(petal)的長度和寬度。

data(iris)

X <- cbind(1, as.matrix(iris[,-5]))
Y <- matrix(model.matrix(~-1 + iris[,5]), nrow = 150, ncol = 3)

第二節:多類別的預測函數(6)

– 讓我們使用梯度下降法來求\(B\)

num.iteration <- 5000
lr <- 0.05
B_mat <- matrix(0, nrow = 5, ncol = 3)

for (i in 1:num.iteration) {
  L <- X %*% B_mat
  P <- t(apply(L, 1, function (x) {exp(x) / sum(exp(x))}))
  B_mat <- B_mat - lr * t(X) %*% (P-Y) / nrow(X)
}

B_mat
##                    [,1]        [,2]      [,3]
##               0.5612794  1.19326975 -1.754549
## Sepal.Length  1.1960152  0.87175566 -2.067771
## Sepal.Width   2.6903757 -0.02834479 -2.662031
## Petal.Length -3.6596999 -0.24019526  3.899895
## Petal.Width  -1.7270576 -1.71964671  3.446704
L <- X %*% B_mat
P <- t(apply(L, 1, function (x) {exp(x) / sum(exp(x))}))
table(max.col(Y), max.col(P))
##    
##      1  2  3
##   1 50  0  0
##   2  0 48  2
##   3  0  0 50

第三節:使用套件來進行多類別預測(1)

– 裡面的這個函數「glmnet()」非常強大,我們之後會再詳細介紹,如果想要重複之前的結果我們必須設定這樣的參數(lambda = 0):

library(glmnet)

subdat <- dat[!(dat[,"K"] %in% NA) & !(dat[,"AGE"] %in% NA) & !(dat[,"QTc"] %in% NA),]
X <- as.matrix(subdat[,c("AGE", "QTc")])
Y <- subdat[,"K"]

fit_glmnet <- glmnet(x = X, y = Y, family = 'gaussian', lambda = 0)
coef.glmnet(fit_glmnet)
## 3 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  4.244527966
## AGE          0.018570089
## QTc         -0.003497804
fit_lm <- lm(K ~ AGE + QTc, data = subdat)
coef(fit_lm)
##  (Intercept)          AGE          QTc 
##  4.244527924  0.018570090 -0.003497804
pred_Y <- predict(fit_glmnet, X)
cor(pred_Y, Y)
##         [,1]
## s0 0.2872139

第三節:使用套件來進行多類別預測(2)

B_mat
##                    [,1]        [,2]      [,3]
##               0.5612794  1.19326975 -1.754549
## Sepal.Length  1.1960152  0.87175566 -2.067771
## Sepal.Width   2.6903757 -0.02834479 -2.662031
## Petal.Length -3.6596999 -0.24019526  3.899895
## Petal.Width  -1.7270576 -1.71964671  3.446704
data(iris)
X <- as.matrix(iris[,-5])
Y <- iris[,5]

fit_glmnet <- glmnet(x = X, y = Y, family = 'multinomial', lambda = 0)
coef.glmnet(fit_glmnet)
## $setosa
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
##               21.231985
## Sepal.Length   .       
## Sepal.Width    9.139921
## Petal.Length -10.974156
## Petal.Width   -6.699920
## 
## $versicolor
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
##              10.3546155
## Sepal.Length  0.9359832
## Sepal.Width   .        
## Petal.Length  .        
## Petal.Width   .        
## 
## $virginica
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
##              -31.586601
## Sepal.Length  -1.519747
## Sepal.Width   -6.574618
## Petal.Length   9.312621
## Petal.Width   17.999778

第三節:使用套件來進行多類別預測(3)

– 不要緊張,因為在Softmax函數的作用下,會有多個式子會給出相同的答案,但預測出來的答案是相似的:

pred_B_mat <- cbind(1, X) %*% B_mat
pred_B_mat <- max.col(pred_B_mat)

pred_glmnet <- predict(fit_glmnet, X)
pred_glmnet <- max.col(pred_glmnet[,,1])

table(pred_B_mat, pred_glmnet)
##           pred_glmnet
## pred_B_mat  1  2  3
##          1 50  0  0
##          2  0 48  0
##          3  0  2 50
table(pred_B_mat, Y)
##           Y
## pred_B_mat setosa versicolor virginica
##          1     50          0         0
##          2      0         48         0
##          3      0          2        50
table(pred_glmnet, Y)
##            Y
## pred_glmnet setosa versicolor virginica
##           1     50          0         0
##           2      0         49         1
##           3      0          1        49

練習1:執行一次多元預測任務

subdat <- dat[!(dat[,"AMI"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"QTc"] %in% NA),]
subdat[,"AMI"] <- as.factor(subdat[,"AMI"])

X <- as.matrix(subdat[,c('Rate', 'QTc')])
Y <- subdat[,"AMI"]
set.seed(0)
all_idx <- 1:nrow(subdat)

train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_X <- X[train_idx,]
valid_X <- X[valid_idx,]
test_X <- X[test_idx,]

train_Y <- Y[train_idx]
valid_Y <- Y[valid_idx]
test_Y <- Y[test_idx]

練習1答案(1)

fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'multinomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)
table(max.col(pred_valid[,,1]), valid_Y)
##    valid_Y
##     not-AMI NSTEMI STEMI
##   1     220     26    49
table(max.col(pred_test[,,1]), test_Y)
##    test_Y
##     not-AMI NSTEMI STEMI
##   1     208     42    47

練習1答案(2)

train_X <- cbind(1, train_X)
train_Y <- matrix(model.matrix(~-1 + train_Y), ncol = 3)

num.iteration <- 5000
lr <- 0.001
B_mat <- matrix(0, nrow = 3, ncol = 3)

for (i in 1:num.iteration) {
  L <- train_X %*% B_mat
  P <- t(apply(L, 1, function (x) {exp(x) / sum(exp(x))}))
  B_mat <- B_mat - lr * t(train_X) %*% (P-train_Y) / nrow(train_X)
}
pred_valid <- cbind(1, valid_X) %*% B_mat
pred_test <- cbind(1, test_X) %*% B_mat
table(max.col(pred_valid), valid_Y)
##    valid_Y
##     not-AMI NSTEMI STEMI
##   1     207     25    48
##   2      13      1     1
table(max.col(pred_test), test_Y)
##    test_Y
##     not-AMI NSTEMI STEMI
##   1     199     38    43
##   2       9      4     4

第四節:類別不平衡問題(1)

– 我們來試著使用「Rate」與「QTc」同時來預測「LVD」好了。

subdat <- dat[!(dat[,"LVD"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"QTc"] %in% NA),]

X <- as.matrix(subdat[,c('Rate', 'QTc')])
Y <- subdat[,"LVD"]

set.seed(0)
all_idx <- 1:nrow(subdat)

train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_X <- X[train_idx,]
valid_X <- X[valid_idx,]
test_X <- X[test_idx,]

train_Y <- Y[train_idx]
valid_Y <- Y[valid_idx]
test_Y <- Y[test_idx]

第四節:類別不平衡問題(2)

fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'binomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)
table(pred_valid > 0, valid_Y)
##        valid_Y
##           0   1
##   FALSE 181  86
##   TRUE   61  94
table(pred_test > 0, test_Y)
##        test_Y
##           0   1
##   FALSE 174  98
##   TRUE   46 104

第四節:類別不平衡問題(3)

library(pROC)

roc_valid <- roc(valid_Y ~ pred_valid)
best_pos <- which.max(roc_valid$sensitivities + roc_valid$specificities)
best_cut <- roc_valid$thresholds[best_pos]

tab_test <- table(pred_test >= best_cut, test_Y)
sens <- tab_test[2,2] / sum(tab_test[,2])
spec <- tab_test[1,1] / sum(tab_test[,1])

roc_test <- roc(test_Y ~ pred_test)
plot(roc_test)

points(spec, sens, pch = 19)
text(0.5, 0.5, paste0('Sens = ', formatC(sens, digits = 3, format = 'f'),
                      '\nSpec = ', formatC(spec, digits = 3, format = 'f'),
                      '\nAUC = ', formatC(roc_test$auc, digits = 3, format = 'f')), col = 'red')

第四節:類別不平衡問題(4)

subdat <- dat[!(dat[,"AMI"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"QTc"] %in% NA),]
subdat[,"AMI"] <- as.factor(subdat[,"AMI"])

X <- as.matrix(subdat[,c('Rate', 'QTc')])
Y <- subdat[,"AMI"]

set.seed(0)
all_idx <- 1:nrow(subdat)

train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_X <- X[train_idx,]
valid_X <- X[valid_idx,]
test_X <- X[test_idx,]

train_Y <- Y[train_idx]
valid_Y <- Y[valid_idx]
test_Y <- Y[test_idx]
fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'multinomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)

第四節:類別不平衡問題(5)

par(mfrow = c(1, 2))

roc_valid <- roc((valid_Y == 'NSTEMI') ~ pred_valid[,2,1])
roc_test <- roc((test_Y == 'NSTEMI') ~ pred_test[,2,1])
plot(roc_valid, col = 'blue')
plot(roc_test, add = TRUE, col = 'red')

roc_valid <- roc((valid_Y == 'STEMI') ~ pred_valid[,3,1])
roc_test <- roc((test_Y == 'STEMI') ~ pred_test[,3,1])
plot(roc_valid, col = 'blue')
plot(roc_test, add = TRUE, col = 'red')

第四節:類別不平衡問題(6)

F02

– 對於線性模型,這個作用可能還不是非常明顯,但你要記住這個技巧,未來有許多非線性模型非常需要他的幫忙

train_X.0 <- train_X[train_Y == 'not-AMI',]
train_X.1 <- train_X[train_Y == 'NSTEMI',]
train_X.2 <- train_X[train_Y == 'STEMI',]

set.seed(0)

train_X.0 <- train_X.0[sample(1:nrow(train_X.0), 5000, replace = TRUE),]
train_X.1 <- train_X.1[sample(1:nrow(train_X.1), 5000, replace = TRUE),]
train_X.2 <- train_X.2[sample(1:nrow(train_X.2), 5000, replace = TRUE),]

train_X <- rbind(train_X.2, train_X.1, train_X.2)
train_Y <- factor(rep(0:2, each = 5000))
fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'multinomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)
table(max.col(pred_valid[,,1]), valid_Y)
##    valid_Y
##     not-AMI NSTEMI STEMI
##   1      78      7    12
##   2     112     14    31
##   3      30      5     6
table(max.col(pred_test[,,1]), test_Y)
##    test_Y
##     not-AMI NSTEMI STEMI
##   1      71      7    16
##   2     117     25    14
##   3      20     10    17

課程小結

– 另外,你應該也更清楚「驗證組」的重要性了,有太多事情需要驗證組來進行決定。

– 函數「glmnet()」顯然可以用來同時做目前我們學到的所有事情,甚至如果你有仔細看他的說明文件,他也能拿來做存活分析!