林嶔 (Lin, Chin)
Lesson 7 人工智慧基礎4(邏輯斯迴歸與資料科學研究設計)
提出預測函數
提出損失函數
利用梯度下降法優化(也可以叫模型訓練)
– 邏輯斯迴歸的預測方程:
\[lp_i= log(\frac{{p_i}}{1-p_i}) = b_{0} + b_{1}x_i\]
– 其中\(p\)代表著是樣本為正的預測機率。我們先將此式稍作改寫,改成標準型態:
\[p_i = \frac{{1}}{1+e^{-lp_i}} = \frac{{1}}{1+e^{-b_{0} - b_{1}x_i}}\]
– 請至這裡下載範例資料
dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")
– 最常見的Y分別為:
連續變項 - 假設誤差為常態分佈 (這就是線性迴歸)
二元變項 - 假設機率服從邏輯斯分布 (這就是邏輯斯迴歸)
##
## Call: glm(formula = y ~ x, family = "binomial")
##
## Coefficients:
## (Intercept) x
## -0.308954 0.001626
##
## Degrees of Freedom: 2111 Total (i.e. Null); 2110 Residual
## (2888 observations deleted due to missingness)
## Null Deviance: 2907
## Residual Deviance: 2906 AIC: 2910
\[log(\frac{{p}}{1-p}) = -0.308954 + 0.001626 \times AGE\]
\[p = \frac{{1}}{1+e^{0.308954 - 0.001626 \times AGE}}\] - 如果有個人的AGE是60,那他有LVD的機率是0.4473474:
\[p = 0.4473474 = \frac{{1}}{1+e^{0.308954 - 0.001626 \times 60}}\] – 代用相同的式子計算,如果有個人的AGE是80,那他有LVD的機率是0.4554004;假設AGE是100,那他有LVD的機率是0.4634767
\[p = 0.4554004 = \frac{{1}}{1+e^{0.308954 - 0.001626 \times 80}}\]
\[p = 0.4634767 = \frac{{1}}{1+e^{0.308954 - 0.001626 \times 100}}\]
\[loss = diff(y, p)\]
– 但這個損失函數不太好定,我們先以簡單線性迴歸的損失函數為例,所求的值為殘差平方和,可將此式改寫為:
\[loss = diff(y, p) = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \left(y_{i} - p_{i}\right)^{2}\]
\[loss = \frac{{1}}{2n} \sum \limits_{i=1}^{n} \left(y_{i} - \frac{{1}}{1+e^{-b_{0} - b_{1}x_{i}}}\right)^{2}\]
– 這個過程實在是複雜了點,讓我們列出幾個微分公式輔助進行:
\[\frac{\partial}{\partial x}h(x) = \frac{\partial}{\partial x}f(g(x)) = \frac{\partial}{\partial g(x)}f(g(x)) \cdot\frac{\partial}{\partial x}g(x)\]
\[\frac{\partial}{\partial x}\frac{{f(x)}}{g(x)} = \frac{{g(x) \cdot \frac{\partial}{\partial x} f(x)} - {f(x) \cdot \frac{\partial}{\partial x} g(x)}}{g(x)^2}\]
\[\frac{\partial}{\partial x} e^x = e^x\]
\[ \begin{align} \frac{\partial}{\partial x}S(x) & = \frac{\partial}{\partial x}\frac{{1}}{1+e^{-x}} \\ & = \frac{\partial}{\partial (1+e^{-x})}\frac{{1}}{1+e^{-x}} \cdot \frac{\partial}{\partial x}(1+e^{-x}) \\ & = \frac{-1}{(1+e^{-x})^2} \cdot (-e^{-x}) \\ & = \frac{e^{-x}}{(1+e^{-x})^2} \\ & = \frac{1}{1+e^{-x}} \cdot (1 - \frac{1}{1+e^{-x}}) \\ & = S(x)(1-S(x)) \end{align} \]
– \(b_0\)的偏導函數:
\[ \begin{align} \frac{\partial}{\partial b_{0}} loss & = \frac{\partial}{\partial p} diff(y, p) \cdot \frac{\partial}{\partial b_{0}} p \\ & = \frac{{1}}{2n}\sum \limits_{i=1}^{n} \frac{\partial}{\partial p_i} \left(y_{i} - p_{i}\right)^{2} \cdot \frac{\partial}{\partial lp_i} \frac{{1}}{1+e^{-lp_i}} \cdot \frac{\partial}{\partial b_{0}} lp_i \\ & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left(p_{i} - y_{i} \right) \cdot p_{i} \cdot (1 - p_{i}) \cdot \frac{\partial}{\partial b_{0}} (b_{0} + b_{1}x_i) \\ & = \frac{{1}}{n} \sum \limits_{i=1}^{n} \left(p_{i} - y_{i}\right) \cdot p_{i} \cdot (1 - p_{i}) \end{align} \]
– \(b_1\)的偏導函數(過程略):
\[ \begin{align} \frac{\partial}{\partial b_{1}} loss & = \frac{{1}}{n}\sum \limits_{i=1}^{n} \left(p_{i} - y_{i} \right) \cdot p_{i} \cdot (1 - p_{i}) \cdot \frac{\partial}{\partial b_{1}} (b_{0} + b_{1}x_i) \\ & = \frac{{1}}{n} \sum \limits_{i=1}^{n} \left(p_{i} - y_{i}\right) \cdot p_{i} \cdot (1 - p_{i}) \cdot x_i \end{align} \]
x <- 1:10
y <- c(0, 0, 1, 0, 1, 0, 1, 0, 1, 1)
pred.fun <- function(b0, b1, x = x) {
p = 1 / (1 + exp(- b0 - b1 * x))
return(p)
}
loss.fun <- function(b0, b1, x = x, y = y) {
p = pred.fun(b0 = b0, b1 = b1, x = x)
loss = 1/(2*length(x)) * sum((y - p)^2)
return(loss)
}
differential.fun.b0 <- function(b0, b1, x = x, y = y) {
p = pred.fun(b0 = b0, b1 = b1, x = x)
return(-sum((y - p)*p*(1-p))/length(x))
}
differential.fun.b1 <- function(b0, b1, x = x, y = y) {
p = pred.fun(b0 = b0, b1 = b1, x = x)
return(-sum((y - p)*p*(1-p)*x)/length(x))
}
##
## Call: glm(formula = y ~ x, family = "binomial")
##
## Coefficients:
## (Intercept) x
## -1.9957 0.3629
##
## Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
## Null Deviance: 13.86
## Residual Deviance: 11.67 AIC: 15.67
num.iteration <- 2000
lr <- 0.1
ans_b0 <- rep(0, num.iteration)
ans_b1 <- rep(0, num.iteration)
for (i in 2:num.iteration) {
ans_b0[i+1] <- ans_b0[i] - lr * differential.fun.b0(b0 = ans_b0[i], b1 = ans_b1[i], x = x, y = y)
ans_b1[i+1] <- ans_b1[i] - lr * differential.fun.b1(b0 = ans_b0[i], b1 = ans_b1[i], x = x, y = y)
}
print(tail(ans_b0, 1))
[1] -1.539271
[1] 0.2869881
– \(b_0\)的偏導函數:
\[ \begin{align} \frac{\partial}{\partial b_{0}} loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n} \left(p_{i} - y_{i}\right) \cdot p_{i} \cdot (1 - p_{i}) \end{align} \]
– \(b_1\)的偏導函數:
\[ \begin{align} \frac{\partial}{\partial b_{1}} loss & = \frac{{1}}{n} \sum \limits_{i=1}^{n} \left(p_{i} - y_{i}\right) \cdot p_{i} \cdot (1 - p_{i}) \cdot x_i \end{align} \]
– 我們必須思考的是,什麼樣才是好的損失函數?最大概似估計法的精神似乎能給我們答案。
– 伯努利分布的機率質量函數為:
\[ Pr(p,y) = p^y(1-p)^{1-y} \]
\[p = \frac{{1}}{1+e^{-b_{0} - b_{1}x}}\]
\[max(\prod \limits_{i=1}^{n}Pr(p_i,y_i))\]
\[max(log(\prod \limits_{i=1}^{n}Pr(p_i,y_i))) = max(\sum \limits_{i=1}^{n}\left(y_{i} \cdot log(p_{i}) + (1-y_{i}) \cdot log(1-p_{i})\right))\]
\[min(\sum \limits_{i=1}^{n} -\left(y_{i} \cdot log(p_{i}) + (1-y_{i}) \cdot log(1-p_{i})\right)\]
\[loss = CE(y, p) = \frac{{1}}{n}\sum \limits_{i=1}^{n} -\left(y_{i} \cdot log(p_{i}) + (1-y_{i}) \cdot log(1-p_{i})\right)\]
這個函數的特色是,當\(y=1\)時,我們會希望\(p\)盡可能接近1,而當\(y=0\)時,我們又會希望\(p\)盡可能接近0,從而達到我們的目標。
現在讓我們結合交叉熵與邏輯斯回歸,並重新計算\(b_0\)以及\(b_1\)的偏導函數:
\[\frac{\partial}{\partial x} log(x) = \frac{{1}}{x}\]
\[ \begin{align} \frac{\partial}{\partial p}CE(y, p) & = - ( \frac{\partial}{\partial p}y \cdot log(p) + \frac{\partial}{\partial p}(1-y) \cdot log(1-p) ) \\ & = - (\frac{y}{p} - \frac{1-y}{1-p} )\\ & = \frac{p-y}{p(1-p)} \end{align} \]
– \(b_0\)的偏導函數:
\[ \begin{align} \frac{\partial}{\partial b_{0}} loss & = \frac{\partial}{\partial p} CE(y, p) \cdot \frac{\partial}{\partial b_{0}} p \\ & = \frac{1}{n}\sum \limits_{i=1}^{n}\frac{\partial}{\partial p_i}CE(y_i, p_i) \cdot \frac{\partial}{\partial lp_i} \frac{{1}}{1+e^{-lp_i}} \cdot \frac{\partial}{\partial b_{0}} lp_i \\ & = \frac{1}{n}\sum \limits_{i=1}^{n}\frac{p_i - y_i}{p_i(1-p_i)} \cdot p_i(1-p_i) \cdot \frac{\partial}{\partial b_{0}} (b_{0} + b_{1}x_i) \\ & = \frac{1}{n}\sum \limits_{i=1}^{n} p_i - y_i \end{align} \]
– \(b_1\)的偏導函數(過程略):
\[ \begin{align} \frac{\partial}{\partial b_{1}} loss & = \frac{1}{n}\sum \limits_{i=1}^{n}\frac{p_i - y_i}{p_i(1-p_i)} \cdot p_i(1-p_i) \cdot \frac{\partial}{\partial b_{1}} (b_{0} + b_{1}x_i) \\ & = \frac{1}{n}\sum \limits_{i=1}^{n} (p_i - y_i)x_i \end{align} \]
x <- 1:10
y <- c(0, 0, 1, 0, 1, 0, 1, 0, 1, 1)
pred.fun <- function(b0, b1, x = x) {
p = 1 / (1 + exp(- b0 - b1 * x))
return(p)
}
loss.fun <- function(b0, b1, x = x, y = y) {
p = pred.fun(b0 = b0, b1 = b1, x = x)
loss = -1/length(x) * sum(y * log(p) + (1 - y) * log(1 - p))
return(loss)
}
differential.fun.b0 <- function(b0, b1, x = x, y = y) {
p = pred.fun(b0 = b0, b1 = b1, x = x)
return(-sum(y - p)/length(x))
}
differential.fun.b1 <- function(b0, b1, x = x, y = y) {
p = pred.fun(b0 = b0, b1 = b1, x = x)
return(-sum((y - p)*x)/length(x))
}
##
## Call: glm(formula = y ~ x, family = "binomial")
##
## Coefficients:
## (Intercept) x
## -1.9957 0.3629
##
## Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
## Null Deviance: 13.86
## Residual Deviance: 11.67 AIC: 15.67
num.iteration <- 2000
lr <- 0.1
ans_b0 <- rep(0, num.iteration)
ans_b1 <- rep(0, num.iteration)
for (i in 2:num.iteration) {
ans_b0[i+1] <- ans_b0[i] - lr * differential.fun.b0(b0 = ans_b0[i], b1 = ans_b1[i], x = x, y = y)
ans_b1[i+1] <- ans_b1[i] - lr * differential.fun.b1(b0 = ans_b0[i], b1 = ans_b1[i], x = x, y = y)
}
print(tail(ans_b0, 1))
[1] -1.994507
[1] 0.3626772
– 對於資料科學實驗的流程,我們通常會把樣本分為3個部分,分別是:
訓練組(Training set, Development set):負責用來建構一個預測模型,我們之前學的三大流程就是用在這上面的,樣本可以隨意調整
驗證組(Validation set, Tuning set):不參與模型訓練,但會用來指導模型訓練,以及建構重要的資訊所用,樣本可以隨意調整
測試組(Testing set, Hold-out set, Validation set):最終模型會在上面運行一次(只能一次),已確定最終的準確度,原則上樣本選取必須符合未來使用條件
– 有時候測試組會有多個,並分為內部及外部測試組。
– 有時候「訓練組」與「驗證組」可以合併,並使用k倍交叉驗證等技術
– 某些研究可能不包含「驗證組」(可能模型建立完全都是使用訓練組完成),如果你發現研究只有唯一的「測試組」,且研究有使用這個組別進行一些操作,那就要小心這個研究可能高估模型的最終準確度。
– 像是本來資料只有身高體重,我們透過預先有的知識轉換為BMI,多重插補法等應用也在這裡
– 一個可行的方案是在「訓練組」上建構很多模型,然後在「驗證組」上根據性能決定一個模型
– 所有對「訓練組」上的操作都是允許的,但不能污染到「測試組」,如改變「測試組」的盛行率必然會提升它的陽性預測值
– 有時候針對相同的資料,我們會使用多個模型進行訓練,我們之後會教到一系列的機器學習模型
– 這裡也包含不同超參數的選擇,舉例來說調整梯度下降法中的學習綠
– 將最終決定的模型應用在「測試組」上,展示出平均誤差、敏感度等資訊。如果需要決定切點,需要在「驗證組」上決定並應用於「測試組」
– 首先我們先決定我們的研究題目,我們希望用「GENDER」、「AGE」、「Rate」、「PR」、「QRSd」、「QT」、「QTc」、「Axes_P」、「Axes_QRS」、「Axes_T」預測「LVD」
dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")
used_dat <- dat[!dat[,'LVD'] %in% NA, c('LVD', 'GENDER', 'AGE', 'Rate', 'PR', 'QRSd', 'QT', 'QTc', 'Axes_P', 'Axes_QRS', 'Axes_T')]
– 接著我們知道這筆資料中的自變項有許多Missing value,先進行多重插補:
library(mice)
used_dat[,'GENDER'] <- as.factor(used_dat[,'GENDER'])
used_dat.x <- used_dat[,c('GENDER', 'AGE', 'Rate', 'PR', 'QRSd', 'QT', 'QTc', 'Axes_P', 'Axes_QRS', 'Axes_T')]
used_dat.y <- used_dat[,'LVD', drop = FALSE]
mice_dat <- mice(used_dat.x, m = 1, maxit = 10, meth = 'cart', seed = 123, printFlag = FALSE)
impute_dat.x <- mice:::complete(mice_dat, action = 1)
impute_dat <- cbind(used_dat.y, impute_dat.x)
– 先用隨機的方式決定「訓練組」、「驗證組」、「測試組」
set.seed(0)
all_idx <- 1:nrow(used_dat)
train_idx <- sample(all_idx, nrow(used_dat) * 0.6)
valid_idx <- sample(all_idx[!all_idx %in% train_idx], nrow(used_dat) * 0.2)
test_idx <- all_idx[!all_idx %in% c(train_idx, valid_idx)]
train_dat <- impute_dat[train_idx,]
valid_dat <- impute_dat[valid_idx,]
test_dat <- impute_dat[test_idx,]
– 你不可以用全樣本來做分析,這樣會讓測試組資料資訊混雜到你的模型訓練中(我們選擇GENDER、Rate、QRSd、QTc、Axes_T,p < 0.001)。
##
## Call:
## glm(formula = LVD ~ GENDER, family = "binomial", data = train_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1834 -1.1834 -0.9319 1.1714 1.4446
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6092 0.1018 -5.985 2.17e-09 ***
## GENDERmale 0.6234 0.1229 5.073 3.91e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1745.2 on 1266 degrees of freedom
## Residual deviance: 1718.9 on 1265 degrees of freedom
## AIC: 1722.9
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = LVD ~ AGE, family = "binomial", data = train_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.184 -1.104 -1.037 1.252 1.366
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.561182 0.230433 -2.435 0.0149 *
## AGE 0.005634 0.003372 1.671 0.0948 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1745.2 on 1266 degrees of freedom
## Residual deviance: 1742.4 on 1265 degrees of freedom
## AIC: 1746.4
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = LVD ~ Rate, family = "binomial", data = train_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0309 -1.0428 -0.8116 1.1847 1.7329
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.315867 0.245747 -9.424 <2e-16 ***
## Rate 0.023699 0.002659 8.912 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1745.2 on 1266 degrees of freedom
## Residual deviance: 1656.9 on 1265 degrees of freedom
## AIC: 1660.9
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = LVD ~ PR, family = "binomial", data = train_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.103 -1.099 -1.097 1.258 1.269
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.739e-01 2.270e-01 -0.766 0.443
## PR -8.813e-05 1.339e-03 -0.066 0.948
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1745.2 on 1266 degrees of freedom
## Residual deviance: 1745.2 on 1265 degrees of freedom
## AIC: 1749.2
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = LVD ~ QRSd, family = "binomial", data = train_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8633 -1.0343 -0.9169 1.2574 1.6071
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.074485 0.262346 -7.907 2.63e-15 ***
## QRSd 0.017816 0.002433 7.324 2.41e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1745.2 on 1266 degrees of freedom
## Residual deviance: 1684.3 on 1265 degrees of freedom
## AIC: 1688.3
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = LVD ~ QT, family = "binomial", data = train_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.201 -1.104 -1.052 1.249 1.433
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.394874 0.392452 1.006 0.314
## QT -0.001478 0.000985 -1.501 0.133
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1745.2 on 1266 degrees of freedom
## Residual deviance: 1743.0 on 1265 degrees of freedom
## AIC: 1747
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = LVD ~ QTc, family = "binomial", data = train_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2281 -1.0332 -0.8095 1.1722 3.3208
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.509856 0.606206 -9.089 <2e-16 ***
## QTc 0.011293 0.001281 8.816 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1745.2 on 1266 degrees of freedom
## Residual deviance: 1654.4 on 1265 degrees of freedom
## AIC: 1658.4
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = LVD ~ Axes_P, family = "binomial", data = train_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.395 -1.106 -1.027 1.247 1.472
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.365537 0.087012 -4.201 2.66e-05 ***
## Axes_P 0.003585 0.001334 2.686 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1745.2 on 1266 degrees of freedom
## Residual deviance: 1737.9 on 1265 degrees of freedom
## AIC: 1741.9
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = LVD ~ Axes_QRS, family = "binomial", data = train_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.135 -1.099 -1.082 1.256 1.329
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1675835 0.0637138 -2.630 0.00853 **
## Axes_QRS -0.0006913 0.0009849 -0.702 0.48278
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1745.2 on 1266 degrees of freedom
## Residual deviance: 1744.7 on 1265 degrees of freedom
## AIC: 1748.7
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = LVD ~ Axes_T, family = "binomial", data = train_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7110 -1.0429 -0.8212 1.1588 1.7670
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.686879 0.080496 -8.533 <2e-16 ***
## Axes_T 0.007096 0.000784 9.051 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1745.2 on 1266 degrees of freedom
## Residual deviance: 1654.4 on 1265 degrees of freedom
## AIC: 1658.4
##
## Number of Fisher Scoring iterations: 4
final_model <- glm(LVD ~ GENDER + Rate + QRSd + QTc + Axes_T, data = train_dat, family = 'binomial')
valid_dat[,'pred'] <- predict(final_model, valid_dat)
test_dat[,'pred'] <- predict(final_model, test_dat)
tab_test <- table(test_dat$pred >= best_cut, test_dat$LVD)
sens <- tab_test[2,2] / sum(tab_test[,2])
spec <- tab_test[1,1] / sum(tab_test[,1])
roc_test <- roc(LVD ~ pred, data = test_dat)
plot(roc_test)
points(spec, sens, pch = 19)
text(0.5, 0.5, paste0('Sens = ', formatC(sens, digits = 3, format = 'f'),
'\nSpec = ', formatC(spec, digits = 3, format = 'f'),
'\nAUC = ', formatC(roc_test$auc, digits = 3, format = 'f')), col = 'red')
上週我們已經初步介紹了「預測函數」、「損失函數」、「梯度下降法」的組合!今天我們把這個程序用在邏輯斯回歸上,你應該開始對設計新的預測函數熟悉了!
今天除了介紹邏輯斯回歸外,最重要的是資料科學實驗流程,你會發現大部分的頂級期刊上的研究都是依循這個邏輯。
– 對於二元分類任務中,使用ROC曲線及敏感度、特異度,有時候你還會看到陽性預測值。
– 對於線性預測,他可考慮使用「平均絕對誤差」與「相關係數」做為評估指標