機器學習及演算法

林嶔 (Lin, Chin)

Lesson 6 人工智慧基礎3(MCMC與梯度下降法)

第一節:馬可夫鏈蒙特卡羅法(1)

– 我們現在已經了解到一件事情,無論函數型態寫成什麼樣子,最大概似估計法重點就是能讓我們定義出一個求解函數,所以我們把問題簡化一點:我們希望有一個方法,能在某個不特定函數中找出該函數的極值。

– 由於MCMC是完全隨機的移動,在他移動進入極值時可以利用這個隨機移動的特性去找尋該區域的分布特性

第一節:馬可夫鏈蒙特卡羅法(2)

original.fun <- function(x) {return(x^2)}

random.walk <- function(x, val = 0.1) {x + runif(1, min = -val, max = val)}

set.seed(0)

start.value = 5
num.iteration = 1000

x = rep(NA, num.iteration)

for (i in 1:100) {
  if (i == 1) {
    x[i] <- start.value
  } else {
    old.x <- x[i-1]
    new.x <- random.walk(old.x)
    if (original.fun(old.x) < original.fun(new.x)) {
      x[i] <- old.x
    } else {
      x[i] <- new.x
    }
  }
}

x[71:100]
##  [1] 3.482776 3.482776 3.450591 3.450591 3.419928 3.386683 3.381953 3.381953
##  [9] 3.381953 3.359951 3.359951 3.359951 3.346883 3.346883 3.326882 3.291952
## [17] 3.291952 3.232490 3.232490 3.156829 3.105927 3.034587 2.982513 2.894300
## [25] 2.894300 2.894300 2.894300 2.894300 2.885355 2.867372
for (i in 101:1000) {
  old.x = x[i-1]
  new.x = random.walk(old.x)
  if (original.fun(old.x) < original.fun(new.x)) {
    x[i] = old.x
  } else {
    x[i] = new.x
  }
}

x[971:1000]
##  [1] -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651
##  [6] -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651
## [11] -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651
## [16] -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651
## [21] -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651
## [26] -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651 -0.0001650651

第一節:馬可夫鏈蒙特卡羅法(3)

– 請至這裡下載範例資料

dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")
subdat <- dat[!(dat[,"K"] %in% NA) & !(dat[,"PR"] %in% NA),]
x <- subdat[,"PR"]
y <- subdat[,"K"]

prop.fun <- function(b0, b1, x = x, y = y) {
  y.hat <- b0 + b1 * x
  res <- y.hat - y
  sd.res <- sd(res)
  log_p <- dnorm(res, mean = 0, sd = sd.res, log = TRUE)  
  return(sum(log_p))
}

第一節:馬可夫鏈蒙特卡羅法(4)

start.b0 <- 0
start.b1 <- 0
num.iteration <- 10000

b0.seq <- rep(NA, num.iteration)
b1.seq <- rep(NA, num.iteration)

for (i in 1:num.iteration) {
  if (i == 1) {
    b0.seq[i] <- start.b0
    b1.seq[i] <- start.b1
  } else {
    b0.seq[i] <- random.walk(b0.seq[i-1], val = 1)
    b1.seq[i] <- random.walk(b1.seq[i-1], val = 0.1)
    old.log_p <- prop.fun(b0 = b0.seq[i-1], b1 = b1.seq[i-1], x = x, y = y)
    new.log_p <- prop.fun(b0 = b0.seq[i], b1 = b1.seq[i], x = x, y = y)
    diff.p <- exp(new.log_p - old.log_p)
    if (diff.p < runif(1, min = 0, max = 1)) {
      b0.seq[i] <- b0.seq[i-1] 
      b1.seq[i] <- b1.seq[i-1] 
    }
  }
}

第一節:馬可夫鏈蒙特卡羅法(5)

par(mfcol = c(1, 2))
plot(1:num.iteration, b0.seq, type = "l")
plot(1:num.iteration, b1.seq, type = "l")

– 我們設置一個Burn-In time,看看在開始上下震動之後的結果

burn_in <- 5000

par(mfcol = c(2, 2))
hist(b0.seq[(burn_in+1):num.iteration])
abline(v = mean(b0.seq[(burn_in+1):num.iteration]), col = "blue")
abline(v = 5, col = "red")
plot((burn_in+1):num.iteration, b0.seq[(burn_in+1):num.iteration], type = "l")
abline(h = mean(b0.seq[(burn_in+1):num.iteration]), col = "blue")
abline(h = 5, col = "red")
hist(b1.seq[(burn_in+1):num.iteration])
abline(v = mean(b1.seq[(burn_in+1):num.iteration]), col = "blue")
abline(v = 3, col = "red")
plot((burn_in+1):num.iteration, b1.seq[(burn_in+1):num.iteration], type = "l")
abline(h = mean(b1.seq[(burn_in+1):num.iteration]), col = "blue")
abline(h = 3, col = "red")

第一節:馬可夫鏈蒙特卡羅法(6)

fit <- lm(y~x)
summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2563 -0.8290 -0.6899  0.4701  6.3229 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.5733390  0.1067549  33.472   <2e-16 ***
## x           0.0008796  0.0006399   1.375    0.169    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.264 on 2820 degrees of freedom
## Multiple R-squared:  0.0006697,  Adjusted R-squared:  0.0003153 
## F-statistic:  1.89 on 1 and 2820 DF,  p-value: 0.1693
mean(b0.seq[(burn_in+1):num.iteration])
## [1] 3.549336
mean(b1.seq[(burn_in+1):num.iteration])
## [1] 0.001074259
sd(b0.seq[(burn_in+1):num.iteration])
## [1] 0.09483694
sd(b1.seq[(burn_in+1):num.iteration])
## [1] 0.000531151

練習1:修正隨機參數

– 這是我們第一次完全沒有透過數學方式解「函數極值」,而同樣也能獲得「近似解」

– MCMC似乎沒有什麼太大的限制,我們只需要指定一個預測公式以及優化目標,之後就能輕鬆解出答案了

練習1答案(1)

start.b0 <- 0
start.b1 <- 0
num.iteration <- 10000

b0.seq <- rep(NA, num.iteration)
b1.seq <- rep(NA, num.iteration)

for (i in 1:num.iteration) {
  decay_coef <- sqrt(i / 10 + 1)
  if (i == 1) {
    b0.seq[i] <- start.b0
    b1.seq[i] <- start.b1
  } else {
    b0.seq[i] <- random.walk(b0.seq[i-1], val = 10 / decay_coef)
    b1.seq[i] <- random.walk(b1.seq[i-1], val = 1 / decay_coef)
    old.log_p <- prop.fun(b0 = b0.seq[i-1], b1 = b1.seq[i-1], x = x, y = y)
    new.log_p <- prop.fun(b0 = b0.seq[i], b1 = b1.seq[i], x = x, y = y)
    diff.p <- exp(new.log_p - old.log_p)
    if (diff.p < runif(1, min = 0, max = 1)) {
      b0.seq[i] <- b0.seq[i-1] 
      b1.seq[i] <- b1.seq[i-1] 
    }
  }
}

練習1答案(2)

burn_in <- 5000

par(mfcol = c(2, 2))
hist(b0.seq[(burn_in+1):num.iteration])
abline(v = mean(b0.seq[(burn_in+1):num.iteration]), col = "blue")
abline(v = 5, col = "red")
plot((burn_in+1):num.iteration, b0.seq[(burn_in+1):num.iteration], type = "l")
abline(h = mean(b0.seq[(burn_in+1):num.iteration]), col = "blue")
abline(h = 5, col = "red")
hist(b1.seq[(burn_in+1):num.iteration])
abline(v = mean(b1.seq[(burn_in+1):num.iteration]), col = "blue")
abline(v = 3, col = "red")
plot((burn_in+1):num.iteration, b1.seq[(burn_in+1):num.iteration], type = "l")
abline(h = mean(b1.seq[(burn_in+1):num.iteration]), col = "blue")
abline(h = 3, col = "red")

mean(b0.seq[(burn_in+1):num.iteration])
## [1] 3.472149
mean(b1.seq[(burn_in+1):num.iteration])
## [1] 0.001550942
sd(b0.seq[(burn_in+1):num.iteration])
## [1] 0.1795203
sd(b1.seq[(burn_in+1):num.iteration])
## [1] 0.001078361

第二節:梯度下降法(1)

– 這個問題是因為隨機移動造成的,如果想要解決這個問題,那我們勢必需要有方法導引他往正確的方向走。

– 還是很難理解吧,我們來想想這個函數的求解過程:

\[f(x) = x^{2} + 2x + 1\]

– 我們先把他的導函數寫下來:

\[\frac{\partial}{\partial x} f(x) = 2x + 2\] – 他的意思是說在任何一個點的切線斜率是\(2x + 2\),而斜率的意思就是說「x每增加一個單位,y所改變的量」

第二節:梯度下降法(2)

\[\frac{\partial}{\partial x} f(x) = 2x + 2 = 0\]

\[x = -1\]

第二節:梯度下降法(3)

\[x_{\left(epoch:0\right)} = 10\]

\[x_{\left(epoch:t\right)} = x_{\left(epoch:t - 1\right)} - lr \cdot \frac{\partial}{\partial x}f(x_{\left(epoch:t - 1\right)})\] - 由於剛剛函數的導函數為「\(2x + 2\)」,我們可以將式子帶入運算:

\[ \begin{align} x_{\left(epoch:1\right)} & = x_{\left(epoch:0\right)} - lr \cdot \frac{\partial}{\partial x}f(x_{\left(epoch:0\right)}) \\ & = 10 - lr \cdot \frac{\partial}{\partial x}f(10) \\ & = 10 - 0.05 \cdot (2\cdot10+2)\\ & = 8.9 \end{align} \]

第二節:梯度下降法(4)

\[ \begin{align} x_{\left(epoch:2\right)} & = x_{\left(epoch:1\right)} - lr \cdot \frac{\partial}{\partial x}f(x_{\left(epoch:1\right)}) \\ & = 8.9 - lr \cdot \frac{\partial}{\partial x}f(8.9) \\ & = 8.9 - 0.05 \cdot (2\cdot8.9+2)\\ & = 7.91 \end{align} \]

\[ \begin{align} x_{\left(epoch:3\right)} & = 7.91 - 0.891 = 7.019 \\ x_{\left(epoch:4\right)} & = 7.019 - 0.8019 = 6.2171 \\ x_{\left(epoch:5\right)} & = 6.2171 - 0.72171 = 5.49539 \\ & \dots \\ x_{\left(epoch:\infty\right)} & = -1 \end{align} \]

第二節:梯度下降法(5)

\[f(x) = x^{2}\]

original.fun = function(x) {
  return(x^2)
}

differential.fun = function(x) {
  return(2*x)
}

start.value = 5
learning.rate = 0.1
num.iteration = 1000

result.x = rep(NA, num.iteration)

for (i in 1:num.iteration) {
  if (i == 1) {
    result.x[1] = start.value
  } else {
    result.x[i] = result.x[i-1] - learning.rate * differential.fun(result.x[i-1])
  }
}

print(tail(result.x, 1))

[1] 7.68895e-97

F01

第二節:梯度下降法(6)