下載資料

– 與第一個任務相同,這個任務的資料下載是需要經過申請的,請你找助教申請帳號。

– 你會得到3個檔案,分別是「train.csv」、「test.csv」以及「sample_submission.csv」。

讀取資料

library(data.table)

train_dat <- fread('train.csv', data.table = FALSE)
test_dat <- fread('test.csv', data.table = FALSE)
submit_dat <- fread('sample_submission.csv', data.table = FALSE)

插補資料

interest_var <- colnames(train_dat)[-1:-2]

train_dat[,'GENDER'] <- as.factor(train_dat[,'GENDER'])
test_dat[,'GENDER'] <- as.factor(test_dat[,'GENDER'])

for (i in 1:length(interest_var)) {
  if (class(train_dat[,interest_var[i]])[1] %in% c('numeric', 'integer')) {
    m_val <- mean(train_dat[,interest_var[i]], na.rm = TRUE)
  } else {
    m_val <- names(which.max(table(train_dat[,interest_var[i]])))
  }
  train_dat[train_dat[,interest_var[i]] %in% NA, interest_var[i]] <- m_val
  test_dat[test_dat[,interest_var[i]] %in% NA, interest_var[i]] <- m_val
}

使用極限梯度提升機進行分析

– 由於我們沒有辦法得到「test.csv」的結果,我們同樣必須從原始樣本中切割出一部分「驗證組」,這樣才能看看準確度:

library(xgboost)

set.seed(0)

train_idx <- sample(1:nrow(train_dat), nrow(train_dat) * 0.8)

train_X_mat <- model.matrix(~ ., data = train_dat[train_idx,-1:-3])
xgb.data_train <- xgb.DMatrix(data = train_X_mat[,-1], label = train_dat[train_idx,'K'])

valid_X_mat <- model.matrix(~ ., data = train_dat[-train_idx,-1:-3])
xgb.data_valid <- xgb.DMatrix(data = valid_X_mat[,-1], label = train_dat[-train_idx,'K'])

xgb_fit <-  xgb.train(data = xgb.data_train, watchlist = list(eval = xgb.data_valid),
                      early_stopping_rounds = 10, eval_metric = 'rmse', verbose = FALSE,
                      nthread = 2, nrounds = 200, objective = "reg:linear")
test_X_mat <- model.matrix(~ ., data = test_dat[,-1:-2])
submit_dat[,'K'] <- predict(xgb_fit, test_X_mat[,-1])
fwrite(submit_dat, file = 'my_submission.csv', na = '', row.names = FALSE, quote = FALSE)

線性混合模型

剛剛的運算我們完全沒有考慮到病患中可能存在變化,我們這裡要介紹「線性混合模式」來幫助我們進行運算。

\[ \begin{align} y_{i,k} &= \beta_{0,k} + \beta_{1,k} x_{i,k} + \varepsilon \space \space \space \text{ for patient } k \end{align} \]

線性混合模型與一般線性回歸之方程式無異,但特別的是其對於特定的病人就會給與特定的斜率\(\beta_{1,k}\)及截距\(\beta_{0,k}\),這是因為在線性混合模型中存在隨機效應描述每個病人方程式的特定斜率\(B_{k}\)與全局斜率\(B\)之間的關係,而最佳線性無偏預測(best linear unbiased prediction,BLUP)可以用來計算每個病人方程式的特定斜率\(B_{k}\)

\[ \begin{align} B_{k} & = B + BLUP_k \end{align} \]

其中最佳線性無偏預測的計算方法如下,\(Y_k\)是一個n × 1的向量描述病人\(k\)的所有依變項,而\(X_k\)則是一個n × q的矩陣描述所有病人\(k\)的自變項。特別的是\(Z_k\)矩陣,這是一個n × p的矩陣與\(X_k\)非常類似,但是這是僅包含具有隨機效應的自變項。這些矩陣與資料的關係如下:

\[ \begin{align} Y_k = \begin{pmatrix} y_{1,k} \\ y_{2,k} \\ \cdots \\ y_{n,k} \end{pmatrix} \space X_k = \begin{pmatrix} 1 & x_{1,k} \\ 1 & x_{2,k} \\ 1 & \cdots \\ 1 & x_{n,k} \end{pmatrix} \space Z_k = \begin{pmatrix} 1 & z_{1,k} \\ 1 & z_{2,k} \\ 1 & \cdots \\ 1 & z_{n,k} \end{pmatrix} \end{align} \]

線性混合模型的運算結果可以求得矩陣\(G\)、向量\(B\)以及殘差變異數\(\sigma^2\). 矩陣\(G\)是一個variance co-variance matrix來描述隨機效應(p × p),而向量\(B\)則是所有固定效應的係數估計(q × 1)。根據上述結果,我們可以額外求得矩陣\(R_k\)(n × n)如下:

\[ \begin{align} G = \begin{pmatrix} \tau_1^2 & \tau_1 \tau_2 \\ \tau_2 \tau_1 & \tau_2^2 \end{pmatrix} \space B = \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix} \space R_k = \sigma^2 I_{n \times n} = \begin{pmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & \cdots & 0 \\ \cdots & \cdots & \cdots & \cdots \\ 0 & 0 & \cdots & \sigma^2 \end{pmatrix} \end{align} \]

如果我們假設隨機效應(\(G\))與殘差(\(\sigma^2\))之間獨立性存在,我們將可以計算一個對於依變項向量\(Y_k\)的variance co-variance matrix(\(\Sigma_k\)):

\[ \begin{align} \Sigma_k & = Z_k G Z_k^T + R_k \end{align} \]

最終,每個病人的最佳線性無偏預測\(BLUP_k\)可以透過下式計算出來:

\[ \begin{align} BLUP_k & = G Z_k^T \Sigma_k^{-1} \left( Y_k - X_k B \right) \end{align} \]

使用個人斜率修正預測結果

library(xgboost)

train_idx <- which(!train_dat[,'PID'] %in% test_dat[,'PID'])

train_X_mat <- model.matrix(~ ., data = train_dat[train_idx,-1:-3])
xgb.data_train <- xgb.DMatrix(data = train_X_mat[,-1], label = train_dat[train_idx,'K'])

valid_X_mat <- model.matrix(~ ., data = train_dat[-train_idx,-1:-3])
xgb.data_valid <- xgb.DMatrix(data = valid_X_mat[,-1], label = train_dat[-train_idx,'K'])

xgb_fit <-  xgb.train(data = xgb.data_train, watchlist = list(eval = xgb.data_valid),
                      early_stopping_rounds = 10, eval_metric = 'rmse', verbose = FALSE,
                      nthread = 2, nrounds = 200, objective = "reg:linear")

– 預測完畢後,我們先將驗證組的資料填上,並建立「線性混合模式」:

library(lme4)

valid_dat <- train_dat[-train_idx,]
valid_dat[,'xgb_pred'] <- predict(xgb_fit, valid_X_mat[,-1])

lmm_fit <- lmer(K ~ xgb_pred + (1 + xgb_pred | PID), data = valid_dat)

– 接著我們能夠用這個方法計算出每個人的「個人化截距及斜率」:

test_X_mat <- model.matrix(~ ., data = test_dat[,-1:-2])
test_dat[,'xgb_pred'] <- predict(xgb_fit, test_X_mat[,-1])

G_mat <- matrix(as.numeric(VarCorr(lmm_fit)[[1]]), nrow = 2, ncol = 2)

for (i in 1:nrow(test_dat)) {

  Z_mat <- cbind(1, valid_dat[valid_dat[,'PID'] %in% test_dat[i,'PID'], 'xgb_pred'])
  R_mat <- matrix(0, nrow = nrow(Z_mat), ncol = nrow(Z_mat))
  diag(R_mat) <- sigma(lmm_fit)^2
  Sigma_mat <- Z_mat %*% G_mat %*% t(Z_mat) + R_mat
  Res_val <- valid_dat[valid_dat[,'PID'] %in% test_dat[i,'PID'], 'K'] - Z_mat %*% fixef(lmm_fit)
  BLUP <- G_mat %*% t(Z_mat) %*% solve(Sigma_mat) %*% Res_val
  person_coef <- fixef(lmm_fit) + BLUP
  test_dat[i,'b0_person'] <- person_coef[1]
  test_dat[i,'b1_person'] <- person_coef[2]

}

submit_dat[,'K'] <- test_dat[,'b0_person'] + test_dat[,'b1_person'] * test_dat[,'xgb_pred']
fwrite(submit_dat, file = 'my_submission.csv', na = '', row.names = FALSE, quote = FALSE)