第一節：線性迴歸的多輸出擴展(1)

我們現在已經非常清楚人工智慧的本質就是創造一個『預測函數』\(f()\)，假定Input是\(x\)物件，而Output是\(y\)物件，並使用\(f()\)來對其進行預測：

\[\hat{y} = f(x)\]

然而，目前所學習的結構僅讓我們能面對數值型態的\(x\)，而\(y\)的屬性僅能是單一的連續變項或二元類別變項，顯然這樣的能力是不足以讓我們應付真實世界的，讓我們對這兩者進一步擴展吧！

– 請至這裡下載範例資料

dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")

第一節：線性迴歸的多輸出擴展(2)

我們先以線性迴歸為例，再看一次他的方程式(矩陣型態)：

\[ \begin{align} Y & = XB + E \\\\ Y & = \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{pmatrix} X = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ 1 & \vdots \\ 1 & x_n \end{pmatrix} B = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} E = \begin{pmatrix} \delta_1 \\ \delta_2 \\ \delta_3 \\ \vdots \\ \delta_n \end{pmatrix} \end{align} \]

這是他的求解過程：

\[ \begin{align} B & = (X^TX)^{-1}X^TY \end{align} \]

你是不是發現了一件事情，那就是\(Y\)的輸出維度在此之前被我們固定為1。

第一節：線性迴歸的多輸出擴展(3)

但從矩陣運算的角度來看，這似乎是可以擴展的，所以我們試著做個實驗，用「AGE」同時預測「Rate」與「PR」看看：

subdat <- dat[!(dat[,"AGE"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"PR"] %in% NA),]
X <- subdat[,"AGE"]
Y <- subdat[,c("Rate", "PR")]

X_mat <- cbind(1, as.matrix(X))
Y_mat <- as.matrix(Y)

B_mat <- solve(t(X_mat) %*% X_mat) %*% t(X_mat) %*% Y_mat
B_mat

##             Rate          PR
## [1,] 79.50964419 142.6066267
## [2,]  0.07317945   0.3430289

我們一次得到兩組係數，比較一下：

lm(Y[,'Rate'] ~ X)

## 
## Call:
## lm(formula = Y[, "Rate"] ~ X)
## 
## Coefficients:
## (Intercept)            X  
##    79.50964      0.07318

lm(Y[,'PR'] ~ X)

## 
## Call:
## lm(formula = Y[, "PR"] ~ X)
## 
## Coefficients:
## (Intercept)            X  
##     142.607        0.343

矩陣很方便對吧！

第二節：多類別的預測函數(1)

我們現在一次可以透過\(X\)預測多個\(Y\)，我們是不是可以透過某種方式來預測多類別的狀況？

– 我們來嘗試做一次使用「AGE」與「Rate」同時來預測「AMI」好了。

還記得之前的虛擬變數製作方式嘛?我們需要先把AMI做這樣的處理。

– 需要特別注意的是編碼方式有一點點不同：

subdat <- dat[!(dat[,"AMI"] %in% NA) & !(dat[,"AGE"] %in% NA) & !(dat[,"Rate"] %in% NA),]
subdat[,"AMI"] <- as.factor(subdat[,"AMI"])

X <- cbind(1, as.matrix(subdat[,c('AGE', 'Rate')]))
Y <- matrix(model.matrix(~-1 + subdat[,"AMI"]), ncol = 3)

head(Y)

##      [,1] [,2] [,3]
## [1,]    0    0    1
## [2,]    1    0    0
## [3,]    1    0    0
## [4,]    0    0    1
## [5,]    0    1    0
## [6,]    1    0    0

第二節：多類別的預測函數(2)

在這裡，我們介紹一個常用的多分類函數\(Softmax()\)，這是邏輯斯轉換函數的一種多元擴展型態，他的公式如下：

\[ \begin{align} Softmax(x_j) & = \frac{e^{x_j}}{\sum \limits_{i=1}^{m} e^{x_i}} \ \ \ \ \ j \in 1 \ \mbox{to} \ m \end{align} \]

而他的偏導函數也相當簡單(過程略)：

\[ \begin{align} \frac{\partial}{\partial x_i}Softmax(x_j) & = \left\{ \begin{array} -Softmax(x_j)(1 - Softmax(x_i)) & \mbox{ if i = j} \\ Softmax(x_j)(0 - Softmax(x_i)) & \mbox{ otherwise} \end{array} \right. \end{align} \]

有了這項工具後，我們將可以在網路的最後一層輸出\(c\)個類別的預測機率，並且他們加總起來的和為1。

第二節：多類別的預測函數(3)

我們如果仔細觀察剛剛函數\(Softmax()\)的偏導函數，你會發現跟邏輯斯轉換函數遇到了相同的問題，也就是有\(Softmax(x_j)(1 - Softmax(x_i))\)的部分。

– 之前我們為了邏輯斯迴歸設計了交叉熵損失函數，使其在最終函數求解時較為平穩，現在我們也要為函數\(Softmax()\)設計一個特異的損失函數，他的名字叫做\(m\)對數似然函數(\(m\)-log likelihood function)：

– 這裡的\(m\)同樣代表著類別數，而\(n\)代表著樣本數：

\[ \begin{align} mlogloss(y_{i}, p_{i}) & = - \sum \limits_{i=1}^{m} y_{i} log(p_{i}) \end{align} \]

– 這裡我們把偏導函數列出來(過程略)：

\[ \begin{align} \frac{\partial}{\partial p_i}mlogloss(y_{i}, p_{i}) & = - \sum \limits_{i=1}^{m} y_{i} \frac {1} {p_{i}} \end{align} \]

– 接著我們將\(Softmax()\)以及\(mlogloss()\)兩個函數合起來，得到的偏導函數就會相當優美：

\[ \begin{align} p_i &= Softmax(x_i)\\ loss &= mlogloss(y_{i}, p_{i}) \\\\ \frac{\partial}{\partial x_j}loss & = \frac{\partial}{\partial p_j}loss \frac{\partial}{\partial x_j}p_i \\ & = - y_{j} \frac {1} {p_{j}} p_{j} ( 1-p_{j})-\sum \limits_{i\neq j} y_{i} \frac {1} {p_{j}} p_{j} (0 - p_{j}) \\ & = - y_{j} ( 1-p_{j})-\sum \limits_{i\neq j} y_{i} (0 - p_{j}) \\ & = - y_{j} +y_{j}p_{j}+\sum \limits_{i\neq j} y_{i} p_{j} \\ & = - y_{j} +\sum \limits_{i = 1}^{m} y_{i} p_{j} \\ & = p_{j} - y_{j} \end{align} \]

第二節：多類別的預測函數(4)

讓我們來親手做一次完整的推導好了(上標代表不同變數)：

\[ \begin{align} L & = XB \\ P & = Softmax(L) \\ loss & = mlogloss(Y, P) \\\\ Y & = \begin{pmatrix} y^1_1 & y^2_1 & y^3_1 \\ y^1_2 & y^2_2 & y^3_2 \\ y^1_3 & y^2_3 & y^3_3 \\ \vdots \\ y^1_n & y^2_n & y^3_n \end{pmatrix} P = \begin{pmatrix} p^1_1 & p^2_1 & p^3_1 \\ p^1_2 & p^2_2 & p^3_2 \\ p^1_3 & p^2_3 & p^3_3 \\ \vdots \\ p^1_n & p^2_n & p^3_n \end{pmatrix} X = \begin{pmatrix} 1 & x^1_1 & x^2_2 \\ 1 & x^1_2 & x^2_2 \\ 1 & x^1_3 & x^2_3 \\ 1 & \cdots & \cdots \\ 1 & x^1_n & x^2_n \end{pmatrix} B = \begin{pmatrix} b^1_0 & b^2_0 & b^3_0 \\ b^1_1 & b^2_1 & b^3_1 \\ b^1_2 & b^2_2 & b^3_2\end{pmatrix} \end{align} \]

我們會用到連鎖率

\[ \begin{align} \frac{\partial}{\partial B}loss & = \frac{\partial}{\partial L} mlogloss(Y, Softmax(L)) \frac{\partial}{\partial B} XB \\ & = X^T (P-Y) \end{align} \]

第二節：多類別的預測函數(5)

讓我們使用梯度下降法來親手做一次好了！

– 我們先使用鳶尾花卉數據集(Iris flower data set)最初是Edgar Anderson 從加拿大加斯帕半島上的鳶尾屬花朵中提取的地理變異數據，後由Ronald Fisher作為判別分析的一個例子。

F01

– 其數據集包含了150個樣本，都屬於鳶尾屬下的三個亞屬，分別是山鳶尾(setosa)、變色鳶尾(versicolor)和維吉尼亞鳶尾(virginica)。四個特徵被用作樣本的定量分析，它們分別是花萼(sepal)和花瓣(petal)的長度和寬度。

在R裡面，我們能輕易的叫出IRIS資料集

data(iris)

X <- cbind(1, as.matrix(iris[,-5]))
Y <- matrix(model.matrix(~-1 + iris[,5]), nrow = 150, ncol = 3)

第二節：多類別的預測函數(6)

– 讓我們使用梯度下降法來求\(B\)：

num.iteration <- 5000
lr <- 0.05
B_mat <- matrix(0, nrow = 5, ncol = 3)

for (i in 1:num.iteration) {
  L <- X %*% B_mat
  P <- t(apply(L, 1, function (x) {exp(x) / sum(exp(x))}))
  B_mat <- B_mat - lr * t(X) %*% (P-Y) / nrow(X)
}

B_mat

##                    [,1]        [,2]      [,3]
##               0.5612794  1.19326975 -1.754549
## Sepal.Length  1.1960152  0.87175566 -2.067771
## Sepal.Width   2.6903757 -0.02834479 -2.662031
## Petal.Length -3.6596999 -0.24019526  3.899895
## Petal.Width  -1.7270576 -1.71964671  3.446704

讓我們來簡單預測一下好了：

L <- X %*% B_mat
P <- t(apply(L, 1, function (x) {exp(x) / sum(exp(x))}))
table(max.col(Y), max.col(P))

##    
##      1  2  3
##   1 50  0  0
##   2  0 48  2
##   3  0  0 50

還滿準的吧？

第三節：使用套件來進行多類別預測(1)

套件「glmnet」可以用來進行多元分類迴歸分析：

– 裡面的這個函數「glmnet()」非常強大，我們之後會再詳細介紹，如果想要重複之前的結果我們必須設定這樣的參數(lambda = 0)：

library(glmnet)

subdat <- dat[!(dat[,"K"] %in% NA) & !(dat[,"AGE"] %in% NA) & !(dat[,"QTc"] %in% NA),]
X <- as.matrix(subdat[,c("AGE", "QTc")])
Y <- subdat[,"K"]

fit_glmnet <- glmnet(x = X, y = Y, family = 'gaussian', lambda = 0)
coef.glmnet(fit_glmnet)

## 3 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  4.244527966
## AGE          0.018570089
## QTc         -0.003497804

先跟之前線性回歸的結果比較：

fit_lm <- lm(K ~ AGE + QTc, data = subdat)
coef(fit_lm)

##  (Intercept)          AGE          QTc 
##  4.244527924  0.018570090 -0.003497804

這個函數可以透過這種方式很方便的給出預測值：

pred_Y <- predict(fit_glmnet, X)
cor(pred_Y, Y)

##         [,1]
## s0 0.2872139

第三節：使用套件來進行多類別預測(2)

現在讓我們重現一下剛剛手刻的結果：

B_mat

##                    [,1]        [,2]      [,3]
##               0.5612794  1.19326975 -1.754549
## Sepal.Length  1.1960152  0.87175566 -2.067771
## Sepal.Width   2.6903757 -0.02834479 -2.662031
## Petal.Length -3.6596999 -0.24019526  3.899895
## Petal.Width  -1.7270576 -1.71964671  3.446704

直接使用函數「glmnet()」：

data(iris)
X <- as.matrix(iris[,-5])
Y <- iris[,5]

fit_glmnet <- glmnet(x = X, y = Y, family = 'multinomial', lambda = 0)
coef.glmnet(fit_glmnet)

## $setosa
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
##               21.231985
## Sepal.Length   .       
## Sepal.Width    9.139921
## Petal.Length -10.974156
## Petal.Width   -6.699920
## 
## $versicolor
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
##              10.3546155
## Sepal.Length  0.9359832
## Sepal.Width   .        
## Petal.Length  .        
## Petal.Width   .        
## 
## $virginica
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
##              -31.586601
## Sepal.Length  -1.519747
## Sepal.Width   -6.574618
## Petal.Length   9.312621
## Petal.Width   17.999778

第三節：使用套件來進行多類別預測(3)

看起來結果好像差很多對嘛?

– 不要緊張，因為在Softmax函數的作用下，會有多個式子會給出相同的答案，但預測出來的答案是相似的：

pred_B_mat <- cbind(1, X) %*% B_mat
pred_B_mat <- max.col(pred_B_mat)

pred_glmnet <- predict(fit_glmnet, X)
pred_glmnet <- max.col(pred_glmnet[,,1])

table(pred_B_mat, pred_glmnet)

##           pred_glmnet
## pred_B_mat  1  2  3
##          1 50  0  0
##          2  0 48  0
##          3  0  2 50

跟實際值比一下：

table(pred_B_mat, Y)

##           Y
## pred_B_mat setosa versicolor virginica
##          1     50          0         0
##          2      0         48         0
##          3      0          2        50

table(pred_glmnet, Y)

##            Y
## pred_glmnet setosa versicolor virginica
##           1     50          0         0
##           2      0         49         1
##           3      0          1        49

練習1：執行一次多元預測任務

請你使用資料科學的實驗流程設計一個實驗，先把樣本隨機分成3份，產生「訓練組」、「驗證組」與「測試組」。
為了簡化整個流程，我們希望你採取函數「glmnet()」，並且我們指定使用「Rate」與「QTc」同時來預測「AMI」好了。

subdat <- dat[!(dat[,"AMI"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"QTc"] %in% NA),]
subdat[,"AMI"] <- as.factor(subdat[,"AMI"])

X <- as.matrix(subdat[,c('Rate', 'QTc')])
Y <- subdat[,"AMI"]

先用隨機的方式決定「訓練組」、「驗證組」、「測試組」

set.seed(0)
all_idx <- 1:nrow(subdat)

train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_X <- X[train_idx,]
valid_X <- X[valid_idx,]
test_X <- X[test_idx,]

train_Y <- Y[train_idx]
valid_Y <- Y[valid_idx]
test_Y <- Y[test_idx]

如果你有時間的話，可以試著用剛剛解出來的梯度下降法流程重現一遍結果。

練習1答案(1)

直接使用函數「glmnet()」：

fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'multinomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)

與真實值比較一下：

table(max.col(pred_valid[,,1]), valid_Y)

##    valid_Y
##     not-AMI NSTEMI STEMI
##   1     220     26    49

table(max.col(pred_test[,,1]), test_Y)

##    test_Y
##     not-AMI NSTEMI STEMI
##   1     208     42    47

失敗了嘛? 好像有點悲劇? 想一下為什麼會這樣！

練習1答案(2)

讓我們試一下用之前推導出來的方法做吧！

train_X <- cbind(1, train_X)
train_Y <- matrix(model.matrix(~-1 + train_Y), ncol = 3)

num.iteration <- 5000
lr <- 0.001
B_mat <- matrix(0, nrow = 3, ncol = 3)

for (i in 1:num.iteration) {
  L <- train_X %*% B_mat
  P <- t(apply(L, 1, function (x) {exp(x) / sum(exp(x))}))
  B_mat <- B_mat - lr * t(train_X) %*% (P-train_Y) / nrow(train_X)
}

來預測一下：

pred_valid <- cbind(1, valid_X) %*% B_mat
pred_test <- cbind(1, test_X) %*% B_mat

與真實值比較一下：

table(max.col(pred_valid), valid_Y)

##    valid_Y
##     not-AMI NSTEMI STEMI
##   1     207     25    48
##   2      13      1     1

table(max.col(pred_test), test_Y)

##    test_Y
##     not-AMI NSTEMI STEMI
##   1     199     38    43
##   2       9      4     4

好像差不多！

第四節：類別不平衡問題(1)

實際上真的有這麼差嗎?問題其實是出在截距項！

– 我們來試著使用「Rate」與「QTc」同時來預測「LVD」好了。

subdat <- dat[!(dat[,"LVD"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"QTc"] %in% NA),]

X <- as.matrix(subdat[,c('Rate', 'QTc')])
Y <- subdat[,"LVD"]

set.seed(0)
all_idx <- 1:nrow(subdat)

train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_X <- X[train_idx,]
valid_X <- X[valid_idx,]
test_X <- X[test_idx,]

train_Y <- Y[train_idx]
valid_Y <- Y[valid_idx]
test_Y <- Y[test_idx]

第四節：類別不平衡問題(2)

直接使用函數「glmnet()」：

fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'binomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)

與真實值比較一下：

table(pred_valid > 0, valid_Y)

##        valid_Y
##           0   1
##   FALSE 181  86
##   TRUE   61  94

table(pred_test > 0, test_Y)

##        test_Y
##           0   1
##   FALSE 174  98
##   TRUE   46 104

你發現問題點了嗎？關鍵在切點。

第四節：類別不平衡問題(3)

上週我們其實是有先在「驗證組」中使用「ROC曲線」找出敏感度與特異度最大的切點，再放到「測試組」的：

library(pROC)

roc_valid <- roc(valid_Y ~ pred_valid)
best_pos <- which.max(roc_valid$sensitivities + roc_valid$specificities)
best_cut <- roc_valid$thresholds[best_pos]

tab_test <- table(pred_test >= best_cut, test_Y)
sens <- tab_test[2,2] / sum(tab_test[,2])
spec <- tab_test[1,1] / sum(tab_test[,1])

roc_test <- roc(test_Y ~ pred_test)
plot(roc_test)

points(spec, sens, pch = 19)
text(0.5, 0.5, paste0('Sens = ', formatC(sens, digits = 3, format = 'f'),
                      '\nSpec = ', formatC(spec, digits = 3, format = 'f'),
                      '\nAUC = ', formatC(roc_test$auc, digits = 3, format = 'f')), col = 'red')

原來我們是有使用「驗證組」找切點，看來驗證組真的很重要！

第四節：類別不平衡問題(4)

讓我們再做一次原來的問題：

subdat <- dat[!(dat[,"AMI"] %in% NA) & !(dat[,"Rate"] %in% NA) & !(dat[,"QTc"] %in% NA),]
subdat[,"AMI"] <- as.factor(subdat[,"AMI"])

X <- as.matrix(subdat[,c('Rate', 'QTc')])
Y <- subdat[,"AMI"]

set.seed(0)
all_idx <- 1:nrow(subdat)

train_idx <- sample(all_idx, nrow(subdat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(subdat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_X <- X[train_idx,]
valid_X <- X[valid_idx,]
test_X <- X[test_idx,]

train_Y <- Y[train_idx]
valid_Y <- Y[valid_idx]
test_Y <- Y[test_idx]

直接使用函數「glmnet()」：

fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'multinomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)

第四節：類別不平衡問題(5)

把ROC curve畫出來：

par(mfrow = c(1, 2))

roc_valid <- roc((valid_Y == 'NSTEMI') ~ pred_valid[,2,1])
roc_test <- roc((test_Y == 'NSTEMI') ~ pred_test[,2,1])
plot(roc_valid, col = 'blue')
plot(roc_test, add = TRUE, col = 'red')

roc_valid <- roc((valid_Y == 'STEMI') ~ pred_valid[,3,1])
roc_test <- roc((test_Y == 'STEMI') ~ pred_test[,3,1])
plot(roc_valid, col = 'blue')
plot(roc_test, add = TRUE, col = 'red')

第四節：類別不平衡問題(6)

除了驗證組的作用之外，為了讓模型更加認識稀有類別的問題，我們也會使用過採樣(Oversampling)技術來幫忙，這也能有效解決一些問題。

F02

– 對於線性模型，這個作用可能還不是非常明顯，但你要記住這個技巧，未來有許多非線性模型非常需要他的幫忙

train_X.0 <- train_X[train_Y == 'not-AMI',]
train_X.1 <- train_X[train_Y == 'NSTEMI',]
train_X.2 <- train_X[train_Y == 'STEMI',]

set.seed(0)

train_X.0 <- train_X.0[sample(1:nrow(train_X.0), 5000, replace = TRUE),]
train_X.1 <- train_X.1[sample(1:nrow(train_X.1), 5000, replace = TRUE),]
train_X.2 <- train_X.2[sample(1:nrow(train_X.2), 5000, replace = TRUE),]

train_X <- rbind(train_X.2, train_X.1, train_X.2)
train_Y <- factor(rep(0:2, each = 5000))

直接把新的資料拿來建模+預測：

fit_glmnet <- glmnet(x = train_X, y = train_Y, family = 'multinomial', lambda = 0)
pred_valid <- predict(fit_glmnet, valid_X)
pred_test <- predict(fit_glmnet, test_X)

這樣操作之後甚至可以拿來直接預測！(不過使用ROC曲線找切點還是比較理想的方式)

table(max.col(pred_valid[,,1]), valid_Y)

##    valid_Y
##     not-AMI NSTEMI STEMI
##   1      78      7    12
##   2     112     14    31
##   3      30      5     6

table(max.col(pred_test[,,1]), test_Y)

##    test_Y
##     not-AMI NSTEMI STEMI
##   1      71      7    16
##   2     117     25    14
##   3      20     10    17

課程小結

本週的上課重點在於擴展了我們可以執行的任務，目前\(y\)無論是連續變項、類別變項都有相對應的工具能進行建模了。

– 另外，你應該也更清楚「驗證組」的重要性了，有太多事情需要驗證組來進行決定。

– 函數「glmnet()」顯然可以用來同時做目前我們學到的所有事情，甚至如果你有仔細看他的說明文件，他也能拿來做存活分析！

下週我們將教導Cox proportional hazard model的完整推導過程，以及如何使用套件來進行分析！

機器學習及演算法