機器學習及演算法

林嶔 (Lin, Chin)

Lesson 7 人工智慧基礎4(邏輯斯迴歸與資料科學研究設計)

第一節:邏輯斯迴歸介紹(1)

  1. 提出預測函數

  2. 提出損失函數

  3. 利用梯度下降法優化(也可以叫模型訓練)

– 邏輯斯迴歸的預測方程:

\[lp_i= log(\frac{{p_i}}{1-p_i}) = b_{0} + b_{1}x_i\]

– 其中\(p\)代表著是樣本為正的預測機率。我們先將此式稍作改寫,改成標準型態:

\[p_i = \frac{{1}}{1+e^{-lp_i}} = \frac{{1}}{1+e^{-b_{0} - b_{1}x_i}}\]

第一節:邏輯斯迴歸介紹(2)

– 請至這裡下載範例資料

dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")

– 最常見的Y分別為:

  1. 連續變項 - 假設誤差為常態分佈 (這就是線性迴歸)

  2. 二元變項 - 假設機率服從邏輯斯分布 (這就是邏輯斯迴歸)

x <- dat[,"AGE"]
y <- dat[,"LVD"]
model <- glm(y ~ x, family = 'binomial')
model
## 
## Call:  glm(formula = y ~ x, family = "binomial")
## 
## Coefficients:
## (Intercept)            x  
##   -0.308954     0.001626  
## 
## Degrees of Freedom: 2111 Total (i.e. Null);  2110 Residual
##   (2888 observations deleted due to missingness)
## Null Deviance:       2907 
## Residual Deviance: 2906  AIC: 2910

第一節:邏輯斯迴歸介紹(3)

\[log(\frac{{p}}{1-p}) = -0.308954 + 0.001626 \times AGE\]

\[p = \frac{{1}}{1+e^{0.308954 - 0.001626 \times AGE}}\] - 如果有個人的AGE是60,那他有LVD的機率是0.4473474:

\[p = 0.4473474 = \frac{{1}}{1+e^{0.308954 - 0.001626 \times 60}}\] – 代用相同的式子計算,如果有個人的AGE是80,那他有LVD的機率是0.4554004;假設AGE是100,那他有LVD的機率是0.4634767

\[p = 0.4554004 = \frac{{1}}{1+e^{0.308954 - 0.001626 \times 80}}\]

\[p = 0.4634767 = \frac{{1}}{1+e^{0.308954 - 0.001626 \times 100}}\]

第一節:邏輯斯迴歸介紹(4)

\[loss = diff(y, p)\]

– 但這個損失函數不太好定,我們先以簡單線性迴歸的損失函數為例,所求的值為殘差平方和,可將此式改寫為:

\[loss = diff(y, p) = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - p_{i}\right)^{2}\]

\[loss = \frac{{1}}{2n} \sum \limits_{i=1}^{n} \left(y_{i} - \frac{{1}}{1+e^{-b_{0} - b_{1}x_{i}}}\right)^{2}\]

第一節:邏輯斯迴歸介紹(5)

– 這個過程實在是複雜了點,讓我們列出幾個微分公式輔助進行:

  1. 連鎖率

\[\frac{\partial}{\partial x}h(x) = \frac{\partial}{\partial x}f(g(x)) = \frac{\partial}{\partial g(x)}f(g(x)) \cdot\frac{\partial}{\partial x}g(x)\]

  1. 微分除法

\[\frac{\partial}{\partial x}\frac{{f(x)}}{g(x)} = \frac{{g(x) \cdot \frac{\partial}{\partial x} f(x)} - {f(x) \cdot \frac{\partial}{\partial x} g(x)}}{g(x)^2}\]

  1. 指數微分

\[\frac{\partial}{\partial x} e^x = e^x\]

  1. S型函數微分

\[ \begin{align} \frac{\partial}{\partial x}S(x) & = \frac{\partial}{\partial x}\frac{{1}}{1+e^{-x}} \\ & = \frac{\partial}{\partial (1+e^{-x})}\frac{{1}}{1+e^{-x}} \cdot \frac{\partial}{\partial x}(1+e^{-x}) \\ & = \frac{-1}{(1+e^{-x})^2} \cdot (-e^{-x}) \\ & = \frac{e^{-x}}{(1+e^{-x})^2} \\ & = \frac{1}{1+e^{-x}} \cdot (1 - \frac{1}{1+e^{-x}}) \\ & = S(x)(1-S(x)) \end{align} \]

\(b_0\)的偏導函數:

\[ \begin{align} \frac{\partial}{\partial b_{0}} loss & = \frac{\partial}{\partial p} diff(y, p) \cdot \frac{\partial}{\partial b_{0}} p \\ & = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \frac{\partial}{\partial p_i} \left(y_{i} - p_{i}\right)^{2} \cdot \frac{\partial}{\partial lp_i} \frac{{1}}{1+e^{-lp_i}} \cdot \frac{\partial}{\partial b_{0}} lp_i \\ & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left(p_{i} - y_{i} \right) \cdot p_{i} \cdot (1 - p_{i}) \cdot \frac{\partial}{\partial b_{0}} (b_{0} + b_{1}x_i) \\ & = \frac{{1}}{n} \sum \limits_{i=1}^{n} \left(p_{i} - y_{i}\right) \cdot p_{i} \cdot (1 - p_{i}) \end{align} \]

\(b_1\)的偏導函數(過程略):

\[ \begin{align} \frac{\partial}{\partial b_{1}} loss & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left(p_{i} - y_{i} \right) \cdot p_{i} \cdot (1 - p_{i}) \cdot \frac{\partial}{\partial b_{1}} (b_{0} + b_{1}x_i) \\ & = \frac{{1}}{n} \sum \limits_{i=1}^{n} \left(p_{i} - y_{i}\right) \cdot p_{i} \cdot (1 - p_{i}) \cdot x_i \end{align} \]

練習1:使用梯度下降法求解邏輯斯回歸

x <- 1:10 
y <- c(0, 0, 1, 0, 1, 0, 1, 0, 1, 1)

pred.fun <- function(b0, b1, x = x) {
  p = 1 / (1 + exp(- b0 - b1 * x))
  return(p)
}

loss.fun <- function(b0, b1, x = x, y = y) {
  p = pred.fun(b0 = b0, b1 = b1, x = x)
  loss = 1/(2*length(x)) * sum((y - p)^2)
  return(loss)
}

differential.fun.b0 <- function(b0, b1, x = x, y = y) {
  p = pred.fun(b0 = b0, b1 = b1, x = x)
  return(-sum((y - p)*p*(1-p))/length(x))
}

differential.fun.b1 <- function(b0, b1, x = x, y = y) {
  p = pred.fun(b0 = b0, b1 = b1, x = x)
  return(-sum((y - p)*p*(1-p)*x)/length(x))
}
model <- glm(y~x, family = 'binomial')
model
## 
## Call:  glm(formula = y ~ x, family = "binomial")
## 
## Coefficients:
## (Intercept)            x  
##     -1.9957       0.3629  
## 
## Degrees of Freedom: 9 Total (i.e. Null);  8 Residual
## Null Deviance:       13.86 
## Residual Deviance: 11.67     AIC: 15.67

練習1答案

num.iteration <- 2000
lr <- 0.1
ans_b0 <- rep(0, num.iteration)
ans_b1 <- rep(0, num.iteration)

for (i in 2:num.iteration) {
  ans_b0[i+1] <- ans_b0[i] - lr * differential.fun.b0(b0 = ans_b0[i], b1 = ans_b1[i], x = x, y = y)
  ans_b1[i+1] <- ans_b1[i] - lr * differential.fun.b1(b0 = ans_b0[i], b1 = ans_b1[i], x = x, y = y)
}

print(tail(ans_b0, 1))

[1] -1.539271

print(tail(ans_b1, 1))

[1] 0.2869881

F01

第二節:適當的損失函數(1)

\(b_0\)的偏導函數:

\[ \begin{align} \frac{\partial}{\partial b_{0}} loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n} \left(p_{i} - y_{i}\right) \cdot p_{i} \cdot (1 - p_{i}) \end{align} \]

\(b_1\)的偏導函數:

\[ \begin{align} \frac{\partial}{\partial b_{1}} loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n} \left(p_{i} - y_{i}\right) \cdot p_{i} \cdot (1 - p_{i}) \cdot x_i \end{align} \]

第二節:適當的損失函數(2)

– 我們必須思考的是,什麼樣才是好的損失函數?最大概似估計法的精神似乎能給我們答案。

– 伯努利分布的機率質量函數為:

\[ Pr(p,y) = p^y(1-p)^{1-y} \]

\[p = \frac{{1}}{1+e^{-b_{0} - b_{1}x}}\]

\[max(\prod \limits_{i=1}^{n}Pr(p_i,y_i))\]

\[max(log(\prod \limits_{i=1}^{n}Pr(p_i,y_i))) = max(\sum \limits_{i=1}^{n}\left(y_{i} \cdot log(p_{i}) + (1-y_{i}) \cdot log(1-p_{i})\right))\]

\[min(\sum \limits_{i=1}^{n} -\left(y_{i} \cdot log(p_{i}) + (1-y_{i}) \cdot log(1-p_{i})\right)\]

第二節:適當的損失函數(3)

\[loss = CE(y, p) = \frac{{1}}{n}\sum \limits_{i=1}^{n} -\left(y_{i} \cdot log(p_{i}) + (1-y_{i}) \cdot log(1-p_{i})\right)\]

  1. 對數微分

\[\frac{\partial}{\partial x} log(x) = \frac{{1}}{x}\]

  1. 單一交叉熵函數的微分

\[ \begin{align} \frac{\partial}{\partial p}CE(y, p) & = - ( \frac{\partial}{\partial p}y \cdot log(p) + \frac{\partial}{\partial p}(1-y) \cdot log(1-p) ) \\ & = - (\frac{y}{p} - \frac{1-y}{1-p} )\\ & = \frac{p-y}{p(1-p)} \end{align} \]

\(b_0\)的偏導函數:

\[ \begin{align} \frac{\partial}{\partial b_{0}} loss & = \frac{\partial}{\partial p} CE(y, p) \cdot \frac{\partial}{\partial b_{0}} p \\ & = \frac{1}{n}\sum \limits_{i=1}^{n}\frac{\partial}{\partial p_i}CE(y_i, p_i) \cdot \frac{\partial}{\partial lp_i} \frac{{1}}{1+e^{-lp_i}} \cdot \frac{\partial}{\partial b_{0}} lp_i \\ & = \frac{1}{n}\sum \limits_{i=1}^{n}\frac{p_i - y_i}{p_i(1-p_i)} \cdot p_i(1-p_i) \cdot \frac{\partial}{\partial b_{0}} (b_{0} + b_{1}x_i) \\ & = \frac{1}{n}\sum \limits_{i=1}^{n} p_i - y_i \end{align} \]

\(b_1\)的偏導函數(過程略):

\[ \begin{align} \frac{\partial}{\partial b_{1}} loss & = \frac{1}{n}\sum \limits_{i=1}^{n}\frac{p_i - y_i}{p_i(1-p_i)} \cdot p_i(1-p_i) \cdot \frac{\partial}{\partial b_{1}} (b_{0} + b_{1}x_i) \\ & = \frac{1}{n}\sum \limits_{i=1}^{n} (p_i - y_i)x_i \end{align} \]

練習2:試試使用交叉熵作為損失函數的邏輯斯迴歸

x <- 1:10 
y <- c(0, 0, 1, 0, 1, 0, 1, 0, 1, 1)

pred.fun <- function(b0, b1, x = x) {
  p = 1 / (1 + exp(- b0 - b1 * x))
  return(p)
}

loss.fun <- function(b0, b1, x = x, y = y) {
  p = pred.fun(b0 = b0, b1 = b1, x = x)
  loss = -1/length(x) * sum(y * log(p) + (1 - y) * log(1 - p))
  return(loss)
}

differential.fun.b0 <- function(b0, b1, x = x, y = y) {
  p = pred.fun(b0 = b0, b1 = b1, x = x)
  return(-sum(y - p)/length(x))
}

differential.fun.b1 <- function(b0, b1, x = x, y = y) {
  p = pred.fun(b0 = b0, b1 = b1, x = x)
  return(-sum((y - p)*x)/length(x))
}
model <- glm(y~x, family = 'binomial')
model
## 
## Call:  glm(formula = y ~ x, family = "binomial")
## 
## Coefficients:
## (Intercept)            x  
##     -1.9957       0.3629  
## 
## Degrees of Freedom: 9 Total (i.e. Null);  8 Residual
## Null Deviance:       13.86 
## Residual Deviance: 11.67     AIC: 15.67

練習2答案

num.iteration <- 2000
lr <- 0.1
ans_b0 <- rep(0, num.iteration)
ans_b1 <- rep(0, num.iteration)

for (i in 2:num.iteration) {
  ans_b0[i+1] <- ans_b0[i] - lr * differential.fun.b0(b0 = ans_b0[i], b1 = ans_b1[i], x = x, y = y)
  ans_b1[i+1] <- ans_b1[i] - lr * differential.fun.b1(b0 = ans_b0[i], b1 = ans_b1[i], x = x, y = y)
}

print(tail(ans_b0, 1))

[1] -1.994507

print(tail(ans_b1, 1))

[1] 0.3626772

F02

第三節:資料科學實驗流程(1)

– 對於資料科學實驗的流程,我們通常會把樣本分為3個部分,分別是:

  1. 訓練組(Training set, Development set):負責用來建構一個預測模型,我們之前學的三大流程就是用在這上面的,樣本可以隨意調整

  2. 驗證組(Validation set, Tuning set):不參與模型訓練,但會用來指導模型訓練,以及建構重要的資訊所用,樣本可以隨意調整

  3. 測試組(Testing set, Hold-out set, Validation set):最終模型會在上面運行一次(只能一次),已確定最終的準確度,原則上樣本選取必須符合未來使用條件

– 有時候測試組會有多個,並分為內部及外部測試組。

– 有時候「訓練組」與「驗證組」可以合併,並使用k倍交叉驗證等技術

– 某些研究可能不包含「驗證組」(可能模型建立完全都是使用訓練組完成),如果你發現研究只有唯一的「測試組」,且研究有使用這個組別進行一些操作,那就要小心這個研究可能高估模型的最終準確度。

第三節:資料科學實驗流程(2)

  1. 特徵工程:我們會先對手上有的變項進行一些運算,通常是要產生非線性項目用

– 像是本來資料只有身高體重,我們透過預先有的知識轉換為BMI,多重插補法等應用也在這裡

  1. 變數選擇:通常研究中會包含非常多變數,我們必須要想方法進行篩選

– 一個可行的方案是在「訓練組」上建構很多模型,然後在「驗證組」上根據性能決定一個模型

  1. 樣本修正:有些研究會對訓練樣本做一些樣本篩選(刪除極端值等),或是做一些資料擴增

– 所有對「訓練組」上的操作都是允許的,但不能污染到「測試組」,如改變「測試組」的盛行率必然會提升它的陽性預測值

  1. 模型訓練:制定一個訓練策略(以後會慢慢提到更多策略),可以訓練多個模型再由「驗證組」上根據性能決定一個模型

– 有時候針對相同的資料,我們會使用多個模型進行訓練,我們之後會教到一系列的機器學習模型

– 這裡也包含不同超參數的選擇,舉例來說調整梯度下降法中的學習綠

  1. 性能評估:根據題目進行最終的準確度展示

– 將最終決定的模型應用在「測試組」上,展示出平均誤差、敏感度等資訊。如果需要決定切點,需要在「驗證組」上決定並應用於「測試組」

第三節:資料科學實驗流程(3)

– 首先我們先決定我們的研究題目,我們希望用「GENDER」、「AGE」、「Rate」、「PR」、「QRSd」、「QT」、「QTc」、「Axes_P」、「Axes_QRS」、「Axes_T」預測「LVD」

dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")
used_dat <- dat[!dat[,'LVD'] %in% NA, c('LVD', 'GENDER', 'AGE', 'Rate', 'PR', 'QRSd', 'QT', 'QTc', 'Axes_P', 'Axes_QRS', 'Axes_T')]

– 接著我們知道這筆資料中的自變項有許多Missing value,先進行多重插補:

library(mice)

used_dat[,'GENDER'] <- as.factor(used_dat[,'GENDER'])

used_dat.x <- used_dat[,c('GENDER', 'AGE', 'Rate', 'PR', 'QRSd', 'QT', 'QTc', 'Axes_P', 'Axes_QRS', 'Axes_T')]
used_dat.y <- used_dat[,'LVD', drop = FALSE]

mice_dat <- mice(used_dat.x, m = 1, maxit = 10, meth = 'cart', seed = 123, printFlag = FALSE)
impute_dat.x <- mice:::complete(mice_dat, action = 1)

impute_dat <- cbind(used_dat.y, impute_dat.x)

– 先用隨機的方式決定「訓練組」、「驗證組」、「測試組」

set.seed(0)

all_idx <- 1:nrow(used_dat)

train_idx <- sample(all_idx, nrow(used_dat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(used_dat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]

train_dat <- impute_dat[train_idx,]
valid_dat <- impute_dat[valid_idx,]
test_dat <- impute_dat[test_idx,]

第三節:資料科學實驗流程(4)

– 你不可以用全樣本來做分析,這樣會讓測試組資料資訊混雜到你的模型訓練中(我們選擇GENDER、Rate、QRSd、QTc、Axes_T,p < 0.001)。

summary(glm(LVD ~ GENDER, data = train_dat, family = 'binomial'))
## 
## Call:
## glm(formula = LVD ~ GENDER, family = "binomial", data = train_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1834  -1.1834  -0.9319   1.1714   1.4446  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.6092     0.1018  -5.985 2.17e-09 ***
## GENDERmale    0.6234     0.1229   5.073 3.91e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.2  on 1266  degrees of freedom
## Residual deviance: 1718.9  on 1265  degrees of freedom
## AIC: 1722.9
## 
## Number of Fisher Scoring iterations: 4
summary(glm(LVD ~ AGE, data = train_dat, family = 'binomial'))
## 
## Call:
## glm(formula = LVD ~ AGE, family = "binomial", data = train_dat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.184  -1.104  -1.037   1.252   1.366  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.561182   0.230433  -2.435   0.0149 *
## AGE          0.005634   0.003372   1.671   0.0948 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.2  on 1266  degrees of freedom
## Residual deviance: 1742.4  on 1265  degrees of freedom
## AIC: 1746.4
## 
## Number of Fisher Scoring iterations: 3
summary(glm(LVD ~ Rate, data = train_dat, family = 'binomial'))
## 
## Call:
## glm(formula = LVD ~ Rate, family = "binomial", data = train_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0309  -1.0428  -0.8116   1.1847   1.7329  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.315867   0.245747  -9.424   <2e-16 ***
## Rate         0.023699   0.002659   8.912   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.2  on 1266  degrees of freedom
## Residual deviance: 1656.9  on 1265  degrees of freedom
## AIC: 1660.9
## 
## Number of Fisher Scoring iterations: 4
summary(glm(LVD ~ PR, data = train_dat, family = 'binomial'))
## 
## Call:
## glm(formula = LVD ~ PR, family = "binomial", data = train_dat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.103  -1.099  -1.097   1.258   1.269  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.739e-01  2.270e-01  -0.766    0.443
## PR          -8.813e-05  1.339e-03  -0.066    0.948
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.2  on 1266  degrees of freedom
## Residual deviance: 1745.2  on 1265  degrees of freedom
## AIC: 1749.2
## 
## Number of Fisher Scoring iterations: 3
summary(glm(LVD ~ QRSd, data = train_dat, family = 'binomial'))
## 
## Call:
## glm(formula = LVD ~ QRSd, family = "binomial", data = train_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8633  -1.0343  -0.9169   1.2574   1.6071  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.074485   0.262346  -7.907 2.63e-15 ***
## QRSd         0.017816   0.002433   7.324 2.41e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.2  on 1266  degrees of freedom
## Residual deviance: 1684.3  on 1265  degrees of freedom
## AIC: 1688.3
## 
## Number of Fisher Scoring iterations: 4
summary(glm(LVD ~ QT, data = train_dat, family = 'binomial'))
## 
## Call:
## glm(formula = LVD ~ QT, family = "binomial", data = train_dat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.201  -1.104  -1.052   1.249   1.433  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.394874   0.392452   1.006    0.314
## QT          -0.001478   0.000985  -1.501    0.133
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.2  on 1266  degrees of freedom
## Residual deviance: 1743.0  on 1265  degrees of freedom
## AIC: 1747
## 
## Number of Fisher Scoring iterations: 3
summary(glm(LVD ~ QTc, data = train_dat, family = 'binomial'))
## 
## Call:
## glm(formula = LVD ~ QTc, family = "binomial", data = train_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2281  -1.0332  -0.8095   1.1722   3.3208  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.509856   0.606206  -9.089   <2e-16 ***
## QTc          0.011293   0.001281   8.816   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.2  on 1266  degrees of freedom
## Residual deviance: 1654.4  on 1265  degrees of freedom
## AIC: 1658.4
## 
## Number of Fisher Scoring iterations: 4
summary(glm(LVD ~ Axes_P, data = train_dat, family = 'binomial'))
## 
## Call:
## glm(formula = LVD ~ Axes_P, family = "binomial", data = train_dat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.395  -1.106  -1.027   1.247   1.472  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.365537   0.087012  -4.201 2.66e-05 ***
## Axes_P       0.003585   0.001334   2.686  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.2  on 1266  degrees of freedom
## Residual deviance: 1737.9  on 1265  degrees of freedom
## AIC: 1741.9
## 
## Number of Fisher Scoring iterations: 4
summary(glm(LVD ~ Axes_QRS, data = train_dat, family = 'binomial'))
## 
## Call:
## glm(formula = LVD ~ Axes_QRS, family = "binomial", data = train_dat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.135  -1.099  -1.082   1.256   1.329  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.1675835  0.0637138  -2.630  0.00853 **
## Axes_QRS    -0.0006913  0.0009849  -0.702  0.48278   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.2  on 1266  degrees of freedom
## Residual deviance: 1744.7  on 1265  degrees of freedom
## AIC: 1748.7
## 
## Number of Fisher Scoring iterations: 3
summary(glm(LVD ~ Axes_T, data = train_dat, family = 'binomial'))
## 
## Call:
## glm(formula = LVD ~ Axes_T, family = "binomial", data = train_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7110  -1.0429  -0.8212   1.1588   1.7670  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.686879   0.080496  -8.533   <2e-16 ***
## Axes_T       0.007096   0.000784   9.051   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.2  on 1266  degrees of freedom
## Residual deviance: 1654.4  on 1265  degrees of freedom
## AIC: 1658.4
## 
## Number of Fisher Scoring iterations: 4

第三節:資料科學實驗流程(5)

final_model <- glm(LVD ~ GENDER + Rate + QRSd + QTc + Axes_T, data = train_dat, family = 'binomial')
valid_dat[,'pred'] <- predict(final_model, valid_dat)
test_dat[,'pred'] <- predict(final_model, test_dat)
library(pROC)

roc_valid <- roc(LVD ~ pred, data = valid_dat)
best_pos <- which.max(roc_valid$sensitivities + roc_valid$specificities)
best_cut <- roc_valid$thresholds[best_pos]

第三節:資料科學實驗流程(6)

tab_test <- table(test_dat$pred >= best_cut, test_dat$LVD)
sens <- tab_test[2,2] / sum(tab_test[,2])
spec <- tab_test[1,1] / sum(tab_test[,1])

roc_test <- roc(LVD ~ pred, data = test_dat)
plot(roc_test)

points(spec, sens, pch = 19)
text(0.5, 0.5, paste0('Sens = ', formatC(sens, digits = 3, format = 'f'),
                      '\nSpec = ', formatC(spec, digits = 3, format = 'f'),
                      '\nAUC = ', formatC(roc_test$auc, digits = 3, format = 'f')), col = 'red')

課程小結

– 對於二元分類任務中,使用ROC曲線及敏感度、特異度,有時候你還會看到陽性預測值。

– 對於線性預測,他可考慮使用「平均絕對誤差」與「相關係數」做為評估指標