機器學習及演算法

林嶔 (Lin, Chin)

Lesson 5 人工智慧基礎2(最大概似估計法與資料插補)

第一節:最大概似估計法介紹(1)

– 那就是我們的「優化」目標是令「殘差平方和」最小化,我們可不可以改成令「殘差絕對誤差和」最小化

– 如何判斷何者為佳?

  1. 先定義我們的預測式(y = b0 + b1 * x)

  2. 再定義我們的求解目標(希望讓殘差平方最小化)

  3. 偏微分求解殘差平方方程(了解極小值出現的位置)

第一節:最大概似估計法介紹(2)

– 先考慮一個比較簡單的問題,假定已知台北該疾病盛行率為\(p=0.3\),台中盛行率為\(p=0.5\),高雄盛行率為\(p=0.7\),請問我們這次的抽樣最有可能是在哪個城市呢?

– 這顯然是個二項分布問題,二項分布的機率公式如下(注意二項分布的基本假設):

\[ \begin{align} likelihood & = C_x^np^x(1-p)^{n-x} \end{align} \]

\[ \begin{align} likelihood_{Taipei} & = C_{49}^{80}0.3^{49}0.7^{31} \end{align} \]

Taipei.p <- choose(80, 49) * 0.3^49 * 0.7^31
Taipei.p
## [1] 5.402339e-09

– 接著算出在台中及高雄抽樣的樣本機率吧吧:

\[ \begin{align} likelihood_{Taichung} & = C_{49}^{80}0.5^{49}0.5^{31} \\ likelihood_{Kaohsiung} & = C_{49}^{80}0.7^{49}0.3^{31} \end{align} \]

Taichung.p <- choose(80, 49) * 0.5^49 * 0.5^31
Taichung.p
## [1] 0.0118359
Kaohsiung.p <- choose(80, 49) * 0.7^49 * 0.3^31
Kaohsiung.p
## [1] 0.02270722

第一節:最大概似估計法介紹(3)

– 最簡單的方法就是帶數字對吧,多帶幾個看看

– 這裡會用到「自訂函數」,如果你需要知道自訂函數的應用,可以參考「R語言程式設計導論」的課程

sample.p <- function (p) {choose(80, 49) * (p)^49 * (1 - p)^31}
sample.p(0.2)
## [1] 7.977383e-16
sample.p(0.4)
## [1] 6.014914e-05
sample.p(0.6)
## [1] 0.08889393
sample.p(0.8)
## [1] 5.482016e-05

– 有了這樣的概念後,該怎樣求解呢?

第一節:最大概似估計法介紹(4)

– 這裡會用到「for迴圈」,如果你需要知道迴圈指令的應用,可以參考「R語言程式設計導論」的課程

seq.p <- seq(0, 1, by = 0.001)
result <- numeric(length(seq.p))
for (i in 1:length(seq.p)) {
  result[i] <- choose(80, 49) * (seq.p[i])^49 * (1 - seq.p[i])^31
}

which.max(result)
## [1] 613
seq.p[which.max(result)]
## [1] 0.612
plot(seq.p, result, type = "l", xlab = "p", ylab = "likelihood")

第一節:最大概似估計法介紹(5)

– 其中\(n\)\(x\)是已知數唷

\[ \begin{align} likelihood & = C_x^np^x(1-p)^{n-x} \end{align} \] – 這裡會用到微分乘法公式:

\[\frac{\partial}{\partial x}f(x)g(x) = g(x) \cdot \frac{\partial}{\partial x} f(x) + f(x) \cdot \frac{\partial}{\partial x} g(x)\]

– 我們想要找到一個\(p\)使\(likelihood\)最大化:

\[ \begin{align} \frac{\partial}{\partial p} likelihood & = \frac{\partial}{\partial p} (C_x^np^x(1-p)^{n-x}) \\ &= C_x^n\frac{\partial}{\partial p} (p^x(1-p)^{n-x}) \\ &= C_x^n (\frac{\partial}{\partial p} p^x \cdot (1-p)^{n-x} + \frac{\partial}{\partial p} (1-p)^{n-x} \cdot p^x) \\ &= C_x^n (xp^{x-1} \cdot (1-p)^{n-x} - (n-x)(1-p)^{n-x-1} \cdot p^x) = 0 \end{align} \]

\[ \begin{align} xp^{x-1} \cdot (1-p)^{n-x} & = (n-x)(1-p)^{n-x-1} \cdot p^x \\ x \cdot (1-p) & = (n-x) \cdot p \\ x - xp & = np - xp \\ \frac {x}{n} & = p \end{align} \]

第一節:最大概似估計法介紹(6)

– 我們需要使用套件「stats4」內的函數「mle()」(但它只能求最小值出現的位置,不能求最大值,所以要改寫我們的sample.p函數):

– 請注意,函數「mle」並非使用數學上的微分求解,他的求解方式我們會再詳細介紹!

library(stats4)
sample.p <- function (p) {-choose(80, 49) * (p)^49 * (1 - p)^31}
fit <- mle(sample.p, start = list(p = 0.5), method = "SANN")
fit
## 
## Call:
## mle(minuslogl = sample.p, start = list(p = 0.5), method = "SANN")
## 
## Coefficients:
##         p 
## 0.6123949

第一節:最大概似估計法介紹(7)

\[ \begin{align} likelihood & = dnorm(x, \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} exp (- \frac{(x - \mu)^2}{2 \sigma ^2}) \end{align} \]

mu = 2
sigma = 3
x = 1
likelihood <- 1 / (sigma * sqrt(2 * pi)) * exp(- (x - mu)^2 / (2 * sigma^2))
likelihood
## [1] 0.1257944
likelihood <- dnorm(x = 1, mean = 2, sd = 3, log = FALSE)
likelihood
## [1] 0.1257944

第一節:最大概似估計法介紹(8)

– 從剛剛的推導上看,我們應該使用固定的\(\mu\)\(\sigma\)先分別計算\(likelihood_1\)\(likelihood_5\),並且找出一組\(\mu\)\(\sigma\)使其乘積最大化:

\[ \begin{align} likelihood_i & = dnorm(x_i, \mu, \sigma) \\ likelihood &= \prod \limits_{i=1}^{n} dnorm(x_i, \mu, \sigma) \end{align} \]

– 由於累乘不好表達而且容易運算到一個過小的值,我們進行log轉換讓它變成累加型態

\[ \begin{align} log(likelihood) &= \sum \limits_{i=1}^{n} log(dnorm(x_i, \mu, \sigma)) \end{align} \]

– 套用上面的過程,如果我給定你\(x_1\)\(x_5\)分別是多少,你應該有能力找出一組\(\mu\)\(\sigma\)了對嘛?

練習1:計算平均數與標準差

x <- c(1, 7, 5, 6, 8)
mean(x)
## [1] 5.4
sd(x)
## [1] 2.701851

練習1答案

\[ \begin{align} log(likelihood) &= \sum \limits_{i=1}^{n} log(dnorm(x_i, \mu, \sigma)) \end{align} \]

library(stats4)
sample.likelihood <- function (mu, sigma) {-sum(dnorm(x, mean = mu, sd = sigma, log = TRUE))}
fit <- mle(sample.likelihood, start = list(mu = 0, sigma = 0), method = "SANN")
fit
## 
## Call:
## mle(minuslogl = sample.likelihood, start = list(mu = 0, sigma = 0), 
##     method = "SANN")
## 
## Coefficients:
##       mu    sigma 
## 5.408800 2.395984

第二節:以最大概似估計法求解線性迴歸(1)

  1. 先定義我們的預測式(y = b0 + b1 * x)

  2. 再定義我們的求解目標(希望讓樣本機率最大化)

x <- c(1, 2, 3, 4, 5)
y <- c(6, 7, 9, 8, 10)

linear.p <- function(b0, b1, sd.res) {
  y.hat <- b0 + b1 * x
  res <- y - y.hat
  log_p.res <- dnorm(res, mean = 0, sd = sd.res, log = TRUE)
  return(-sum(log_p.res))
}

fit1 <- mle(linear.p, start = list(b0 = 0, b1 = 0, sd.res = 1), method = "Nelder-Mead")
fit1
## 
## Call:
## mle(minuslogl = linear.p, start = list(b0 = 0, b1 = 0, sd.res = 1), 
##     method = "Nelder-Mead")
## 
## Coefficients:
##        b0        b1    sd.res 
## 5.2992917 0.9001965 0.6166326

第二節:以最大概似估計法求解線性迴歸(2)

fit1
## 
## Call:
## mle(minuslogl = linear.p, start = list(b0 = 0, b1 = 0, sd.res = 1), 
##     method = "Nelder-Mead")
## 
## Coefficients:
##        b0        b1    sd.res 
## 5.2992917 0.9001965 0.6166326
fit2 <- lm(y~x)
fit2
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##         5.3          0.9

第二節:以最大概似估計法求解線性迴歸(3)

vcov(fit2)
##             (Intercept)           x
## (Intercept)   0.6966667 -0.19000000
## x            -0.1900000  0.06333333
slotNames(fit1)
## [1] "call"      "coef"      "fullcoef"  "vcov"      "min"       "details"  
## [7] "minuslogl" "nobs"      "method"
slot(fit1, "vcov")
##                   b0            b1        sd.res
## b0      4.182595e-01 -1.140708e-01 -8.743581e-05
## b1     -1.140708e-01  3.802359e-02  2.425373e-05
## sd.res -8.743581e-05  2.425373e-05  3.805807e-02
fit1@vcov
##                   b0            b1        sd.res
## b0      4.182595e-01 -1.140708e-01 -8.743581e-05
## b1     -1.140708e-01  3.802359e-02  2.425373e-05
## sd.res -8.743581e-05  2.425373e-05  3.805807e-02

第二節:以最大概似估計法求解線性迴歸(4)

– 請至這裡下載範例資料

dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")
subdat <- dat[!(dat[,"K"] %in% NA) & !(dat[,"AGE"] %in% NA),]
x <- subdat[,"AGE"]
y <- subdat[,"K"]

fit1 <- mle(linear.p, start = list(b0 = 0, b1 = 0, sd.res = 1), method = "Nelder-Mead")
fit2 <- lm(y~x)

fit1@coef
##         b0         b1     sd.res 
## 2.66662740 0.01775345 1.27377093
fit2$coefficients
## (Intercept)           x 
##  2.66644512  0.01775045
fit1@vcov
##                   b0            b1       sd.res
## b0      6.256080e-03 -9.152175e-05 7.294819e-08
## b1     -9.152175e-05  1.457622e-06 1.199834e-09
## sd.res  7.294819e-08  1.199834e-09 2.548881e-04
vcov(fit2)
##               (Intercept)             x
## (Intercept)  6.258357e-03 -9.155506e-05
## x           -9.155506e-05  1.458153e-06

練習2:使用最大概似估計法來解複迴歸參數

subdat <- dat[!(dat[,"K"] %in% NA) & !(dat[,"AGE"] %in% NA) & !(dat[,"PR"] %in% NA),]
x.1 <- subdat[,"AGE"]
x.2 <- subdat[,"PR"]
y <- subdat[,"K"]

fit <- lm(y ~ x.1 + x.2)

fit$coefficients
##   (Intercept)           x.1           x.2 
##  2.7524737508  0.0162302985 -0.0001931383

練習2答案

linear.p <- function(b0, b1, b2, sd.res) {
  y.hat <- b0 + b1 * x.1 + b2 * x.2
  res <- y - y.hat
  log_p.res <- dnorm(res, mean = 0, sd = sd.res, log = TRUE)
  return(-sum(log_p.res))
}

mle_fit <- mle(linear.p, start = list(b0 = 0, b1 = 0, b2 = 0, sd.res = 1), method = "Nelder-Mead")
mle_fit
## 
## Call:
## mle(minuslogl = linear.p, start = list(b0 = 0, b1 = 0, b2 = 0, 
##     sd.res = 1), method = "Nelder-Mead")
## 
## Coefficients:
##            b0            b1            b2        sd.res 
##  2.7527314664  0.0162225537 -0.0001933522  1.2270127940

第三節:多重插補法(1)

– 現在我們介紹一個目前學術期刊上最常用來解決這個問題的方法:多重插補法(Multivariate Imputation)

– 使用多重插補法處理遺失值的假設為:資料為隨機遺失(Missing Completely At Random, MCAR)

F01

– 儘管相關運算技術有點複雜,但這跟我們的課程主軸較無關聯,我們來學習如何使用它就好,因此實現原理的部份我們都先跳過。

第三節:多重插補法(2)

– 對於任一個統計參數而言,我們假定第\(i\)個插補集中的係數為\(Q_i\),而變異數為\(U_i\),則綜合結果\(Q\)\(U\)分別可以用這樣的方法算出:

\[ \begin{align} Q & = \frac{1}{m} \sum \limits_{i=1}^{m} Q_i \\ U & = \frac{1}{m} \sum \limits_{i=1}^{m} U_i + \frac{m + 1}{m(m - 1)} \sum \limits_{i=1}^{m} (Q_i - Q) \end{align} \]

– 如果你想作統計檢定,可以參考這個公式:

\[ \begin{align} df & = (m + 1)(1 + \frac{m(m - 1)U}{(m + 1) \sum \limits_{i=1}^{m} (Q_i - Q)}) \\ t_{df} & = \frac{Q}{\sqrt{U}} \end{align} \]

第三節:多重插補法(3)

– 安裝後,請用函數「library()」進行載入

library(mice)

– 另外,我們等等要用插補後的數值預測K,而依變項是不允許插補的:

dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")

dat[,'AMI'] <- as.factor(dat[,'AMI'])
dat[,'GENDER'] <- as.factor(dat[,'GENDER'])
dat[,'LVD'] <- as.factor(dat[,'LVD'])

subdat <- dat[!dat[,'K'] %in% NA, c('AMI', 'K', 'LVD', 'GENDER', 'AGE', 'Rate', 'PR', 'QRSd', 'QT', 'QTc', 'Axes_P', 'Axes_QRS', 'Axes_T')]
subdat.x <- subdat[,c('AMI', 'LVD', 'GENDER', 'AGE', 'Rate', 'PR', 'QRSd', 'QT', 'QTc', 'Axes_P', 'Axes_QRS', 'Axes_T')]
subdat.y <- subdat[,'K']

第三節:多重插補法(4)

– 參數m表示要產生幾組填補數據集。預設為產生5組填補數據集。

– 參數method (or meth)表示填補方法。在本範例中,因為變數皆為數值,我將使用cart(classification and regression trees)。

– 參數maxit表示用於計算遺失值的迭代次數,預設為5。本範例將迭代次數設為10。

impute_dat <- mice(subdat.x, m = 5, maxit = 10, meth = 'cart', seed = 123)
## 
##  iter imp variable
##   1   1  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   1   2  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   1   3  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   1   4  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   1   5  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   2   1  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   2   2  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   2   3  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   2   4  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   2   5  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   3   1  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   3   2  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   3   3  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   3   4  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   3   5  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   4   1  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   4   2  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   4   3  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   4   4  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   4   5  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   5   1  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   5   2  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   5   3  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   5   4  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   5   5  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   6   1  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   6   2  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   6   3  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   6   4  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   6   5  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   7   1  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   7   2  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   7   3  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   7   4  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   7   5  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   8   1  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   8   2  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   8   3  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   8   4  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   8   5  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   9   1  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   9   2  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   9   3  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   9   4  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   9   5  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   10   1  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   10   2  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   10   3  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   10   4  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T
##   10   5  AMI  LVD  PR  QT  QTc  Axes_P  Axes_QRS  Axes_T

第三節:多重插補法(5)

– 注意,如果你直接使用「help()」函數詢問的畫,你會發現有2個套件分別提供了完全一樣的函數名稱,我們要指定使用Package「mice」的函數。

head(subdat.x)
##       AMI  LVD GENDER      AGE Rate  PR QRSd  QT QTc Axes_P Axes_QRS Axes_T
## 2 not-AMI <NA>   male 50.95044  112 192   99 360 492      0       41     -2
## 3    <NA> <NA> female 66.22767   76 154   95 397 447     79       11     30
## 4    <NA> <NA> female 67.46526   65 184   86 440 457     61       40     30
## 6   STEMI    0   male 52.39062   93 224   89 386 481     60      -36     -4
## 8    <NA> <NA> female 38.60219   63 165   94 414 424     67       23     15
## 9    <NA>    0 female 83.70883   71 192   92 412 448     71       49    138
dat1 <- mice:::complete(impute_dat, action = 1)
head(dat1)
##       AMI LVD GENDER      AGE Rate  PR QRSd  QT QTc Axes_P Axes_QRS Axes_T
## 1 not-AMI   0   male 50.95044  112 192   99 360 492      0       41     -2
## 2 not-AMI   0 female 66.22767   76 154   95 397 447     79       11     30
## 3 not-AMI   0 female 67.46526   65 184   86 440 457     61       40     30
## 4   STEMI   0   male 52.39062   93 224   89 386 481     60      -36     -4
## 5 not-AMI   0 female 38.60219   63 165   94 414 424     67       23     15
## 6 not-AMI   0 female 83.70883   71 192   92 412 448     71       49    138
dat2 <- mice:::complete(impute_dat, action = 2)
head(dat2)
##       AMI LVD GENDER      AGE Rate  PR QRSd  QT QTc Axes_P Axes_QRS Axes_T
## 1 not-AMI   1   male 50.95044  112 192   99 360 492      0       41     -2
## 2 not-AMI   0 female 66.22767   76 154   95 397 447     79       11     30
## 3 not-AMI   0 female 67.46526   65 184   86 440 457     61       40     30
## 4   STEMI   0   male 52.39062   93 224   89 386 481     60      -36     -4
## 5 not-AMI   0 female 38.60219   63 165   94 414 424     67       23     15
## 6 not-AMI   0 female 83.70883   71 192   92 412 448     71       49    138

練習3:建構合理的分析流程

\[ \begin{align} m = 1, \hat{y^m} & = b_{0}^m + b_{1}^mx_{1}^m + b_{2}^mx_{2}^m \\ m = 2, \hat{y^m} & = b_{0}^m + b_{1}^mx_{1}^m + b_{2}^mx_{2}^m \\ m = 3, \hat{y^m} & = b_{0}^m + b_{1}^mx_{1}^m + b_{2}^mx_{2}^m \\ m = 4, \hat{y^m} & = b_{0}^m + b_{1}^mx_{1}^m + b_{2}^mx_{2}^m \\ m = 5, \hat{y^m} & = b_{0}^m + b_{1}^mx_{1}^m + b_{2}^mx_{2}^m \\ \end{align} \]

練習3答案

dat1 <- mice:::complete(impute_dat, action = 1)
dat2 <- mice:::complete(impute_dat, action = 2)
dat3 <- mice:::complete(impute_dat, action = 3)
dat4 <- mice:::complete(impute_dat, action = 4)
dat5 <- mice:::complete(impute_dat, action = 5)

fit1 <- lm(subdat.y ~ ., dat = dat1)
fit2 <- lm(subdat.y ~ ., dat = dat2)
fit3 <- lm(subdat.y ~ ., dat = dat3)
fit4 <- lm(subdat.y ~ ., dat = dat4)
fit5 <- lm(subdat.y ~ ., dat = dat5)

課程小結

– 你會發現這套邏輯似乎可以更廣泛的應用在其他模型上,我們不再會為了優化目標而努力。

– 你現在還會困擾為什麼線性回歸的優化目標是令「殘差平方和」最小化,為什麼不是「殘差絕對誤差和」最小化

– 但是這節課下來,你應該會對函數「mle()」為什麼能夠透過一個起始值搜索到最終的優化目標產生疑問,我們下節課會仔細介紹他的原理。