機器學習及演算法

林嶔 (Lin, Chin)

Lesson 2 統計分析實作1（描述性統計及描述統計圖表）

第一節：描述性統計(1)

這節課我們終於要開始進行數據分析了，在最開始我們要簡單介紹幾個描述性統計的函數

– 請至這裡下載範例資料

dat <- read.csv("ECG_train.csv", header = TRUE, fileEncoding = 'CP950', stringsAsFactors = FALSE, na.strings = "")

以下幾個函數為常見的對連續變項的描述性統計，分別是：

函數「mean()」可以幫助我們計算平均值

mean(dat$AGE, na.rm = TRUE)

## [1] 61.14584

函數「sd()」可以幫助我們計算標準差

sd(dat$AGE, na.rm = TRUE)

## [1] 18.45581

函數「var()」可以幫助我們計算變異數

var(dat$AGE, na.rm = TRUE)

## [1] 340.6169

函數「median()」可以幫助我們計算中位數

median(dat$AGE, na.rm = TRUE)

## [1] 61.96178

函數「quantile()」可以幫助我們計算百分位數

quantile(dat$AGE, na.rm = TRUE)

##        0%       25%       50%       75%      100% 
##  20.03530  48.18833  61.96178  75.82827 102.63361

quantile(dat$AGE, 0.5, na.rm = TRUE)

##      50% 
## 61.96178

quantile(dat$AGE, 0.95, na.rm = TRUE)

##      95% 
## 89.31906

函數「min()」可以幫助我們找出最小值

min(dat$AGE, na.rm = TRUE)

## [1] 20.0353

函數「max()」可以幫助我們找出最大值

max(dat$AGE, na.rm = TRUE)

## [1] 102.6336

第一節：描述性統計(2)

對類別變項，常用的描述性統計如下：

函數「table()」可以幫助我們產生列聯表

table(dat$GENDER)

## 
## female   male 
##   2172   2828

table(dat$GENDER, dat$LVD)

##         
##            0   1
##   female 459 247
##   male   703 703

函數「prop.table()」可以幫助我們產生列聯表的百分比

tab1 <- table(dat$GENDER)
prop.table(tab1)

## 
## female   male 
## 0.4344 0.5656

tab2 <- table(dat$GENDER, dat$LVD)
prop.table(tab2)

##         
##                  0         1
##   female 0.2173295 0.1169508
##   male   0.3328598 0.3328598

prop.table(tab2, margin = 1)

##         
##                  0         1
##   female 0.6501416 0.3498584
##   male   0.5000000 0.5000000

prop.table(tab2, margin = 2)

##         
##                  0         1
##   female 0.3950086 0.2600000
##   male   0.6049914 0.7400000

第二節：基礎繪圖函數簡介(1)

同樣的資訊，使用圖像相呈現較於表格/文字，通常能讓閱讀者更快的獲得資訊。

– 在R裡面，我們能夠畫出任何統計圖！

我們先從幾個簡單的統計圖開始

直方圖：需要使用函數「hist()」

hist(dat[,"AGE"])

盒鬚圖：需要使用函數「boxplot()」

boxplot(dat[,"AGE"])

圓餅圖：需要使用函數「pie()」以及函數「table()」

pie(table(dat[,"AMI"]))

長條圖：需要使用函數「barplot()」以及函數「table()」

barplot(table(dat[,"AMI"]))

第二節：基礎繪圖函數簡介(2)

這些圖都能透過增加不同的參數增加變化，我們可以透過函數「help()」查詢它們內部的參數。舉例來說，我們可以用下列方式改變圖的顏色

– 在R裡面的顏色可以在Colors in R裡查看

– 另外，這裡教一個新函數「par()」，他可以指定繪圖環境。其中最常見的應用為把4張圖放在同一張畫布內：

par(mfrow = c(2, 2))
hist(dat[,"AGE"], col = "red")
boxplot(dat[,"AGE"], col = "blue")
pie(table(dat[,"AMI"]), col = c("blue", "red", "green"))
barplot(table(dat[,"AMI"]), col = c("gray90", "gray50", "gray10"))

你如果喜歡你畫的圖，可以透過函數「pdf()」把圖片存出去，注意最後一定要用函數「dev.off()」關掉那個PDF檔案

pdf("plot1.pdf", height = 8, width = 8, family = "serif")
par(mfrow = c(2, 2))
hist(dat[,"AGE"], col = "red")
boxplot(dat[,"AGE"], col = "blue")
pie(table(dat[,"AMI"]), col = c("blue", "red", "green"))
barplot(table(dat[,"AMI"]), col = c("gray90", "gray50", "gray10"))
dev.off()

練習1：簡單繪圖

請透過函數「help()」查詢該如何完成下面這張圖：

練習1答案

你可能要先試著執行一次boxplot的Examples：

boxplot(count ~ spray, data = InsectSprays, col = "lightgray")

你應該可以猜到，「InsectSprays」是一個資料表，我們來看看：

head(InsectSprays)

##   count spray
## 1    10     A
## 2     7     A
## 3    20     A
## 4    14     A
## 5    14     A
## 6    12     A

接著你可以再多嘗試幾個參數，最後畫出這樣的圖：

boxplot(dat[,"AGE"] ~ dat[,"LVD"], col = c("blue", "red"), ylab = "Age", xlab = "LVD", main = "Age value by LVD status", lwd = 1.5)

第三節：繪圖物件簡介(1)

接著我們介紹一個強大的函數「plot()」，他支援了多種不同的圖形，其中最主要的是散布圖：

plot(dat[,"AGE"], dat[,"Rate"], ylab = "Heart rate", xlab = "AGE", main = "Scatter plot of AGE and Heart rate")

其實，我們可以修改點的造型，例如：

plot(dat[,"AGE"], dat[,"Rate"], ylab = "Heart rate", xlab = "AGE", main = "Scatter plot of AGE and Heart rate", pch = 19)

下面有pch造型與數字的對應表：

第三節：繪圖物件簡介(2)

你可以為你的圖形加點東西，首先我們先介紹函數「lines()」。

– 函數「lines()」的效果是按照順序把幾個點連起來，舉例來說…

– 註：函數「plot.new()」及函數「plot.window()」是拿來開一張新畫布用的！

x <- c(1, 4, 7)
y <- c(2, 9, 6)
plot.new()
plot.window(xlim = c(0, 10), ylim = c(0, 10))
lines(x, y)

當然，如果點夠密，你其實可以畫出圓！

z <- 0:1000/100
x <- sin(z) #三角函數sin
y <- cos(z) #三角函數cos
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
lines(x, y)

第三節：繪圖物件簡介(3)

學會函數「lines()」以後，我們能夠幫散布圖上加預測線了…

– 預測線的方程式，需要先學會線性回歸，在R語言裡面線性回歸是這樣建立的：

# 建立MODEL以及預測線的座標
X <- dat[,"AGE"]
Y <- dat[,"Rate"]
model <- lm(Y~X)
model

## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##     78.9313       0.1021

容我們在之後的課程再說明線性回歸的數學原理，現在我們可以利用函數「lines()」把預測線加上去：

x <- c(10, 150)
y <- 78.9313 + 0.1021 * x

plot(dat[,"AGE"], dat[,"Rate"], ylab = "Heart rate", xlab = "AGE", main = "Scatter plot of AGE and Heart rate", pch = 19)
lines(x, y, col = "red", lwd = 2)

其實有一個函數「abline()」可以做到一模一樣的事情：

plot(dat[,"AGE"], dat[,"Rate"], ylab = "Heart rate", xlab = "AGE", main = "Scatter plot of AGE and Heart rate", pch = 19)
abline(model, col = "red", lwd = 2)

第三節：繪圖物件簡介(4)

既然都學到線性回歸了，你會不會覺得手動輸入回歸係數感覺很蠢？這個物件「model」其實是一個特殊的物件格式

class(model)

## [1] "lm"

這個特殊的物件其實是基於「列表(list)」的，這類物件有一些函數可以幫我們進行解析：

– 函數「ls()」可以協助我們看看物件中有哪些東西

– 函數「names()」也可以做到一樣的事情

ls(model)

##  [1] "assign"        "call"          "coefficients"  "df.residual"  
##  [5] "effects"       "fitted.values" "model"         "qr"           
##  [9] "rank"          "residuals"     "terms"         "xlevels"

這樣，我們就可以透過索引運算子「$」把迴歸係數拿出來了：

COEF <- model$coefficients
COEF

## (Intercept)           X 
##  78.9312550   0.1021352

再透過數字索引，整個過程是不是優雅了許多呢?

x <- c(10, 150)
y = COEF[1] + COEF[2] * x

plot(dat[,"AGE"], dat[,"Rate"], ylab = "Heart rate", xlab = "AGE", main = "Scatter plot of AGE and Heart rate", pch = 19)
lines(x, y, col = "red", lwd = 2)

第三節：繪圖物件簡介(5)

其實，你還可以再為你的圖形加點料…

函數「text()」可以為你的圖片上加文字描述

x = c(1, 0, -1, 0)
y = c(0, 1, 0, -1)
t = c("A", "B", "C", "D")
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
text(x, y, t)

函數「points()」可以為你的圖片上加點

x = c(1, 0, -1, 0)
y = c(0, 1, 0, -1)
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
points(x, y, pch = 1:4)

函數「legend()」可以為你的圖片加上註釋

plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
legend("topleft", c("Female", "Male"), col = c("red", "blue"), pch = c(15, 19), bg = "gray90")
legend(0, 0, c("estimates", "95% CI"), lty = c(1, 2), lwd = 2, col = "black")

函數「polygon()」可以畫多邊形

x = c(1, 0, -1, 0)
y = c(0, 1, 0, -1)
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
polygon(x, y, col = "green")

練習2：為你的圓餅圖增加描述

這張圖是基礎版的圓餅圖：

pie(table(dat[,"GENDER"]), col = c("red", "blue"))

不同「GENDER」的人數分別是：

table(dat[,"GENDER"])

## 
## female   male 
##   2172   2828

– 你應該已經會為圖片加料了，試著畫出下面這張圖：

練習2答案

你可能要先透過用函數「help()」了解到如何改變圓餅圖的邊框文字描述，你會發現【labels】這個參數

pie(table(dat[,"GENDER"]), col = c("red", "blue"), labels = c('Female', 'Male'), main = 'Distribution of gender')

接著就是加字了，這可能需要一些位置的微調，最開始可以從原點開始：

pie(table(dat[,"GENDER"]), col = c("red", "blue"), labels = c('Female', 'Male'), main = 'Distribution of gender')
text(0, 0, 'test-1')
text(0, 0.3, 'test-2')

如果你想要完美的圖片，你必須再用函數「help()」查詢「text」裡面的參數，如此才能完成我們剛剛的圖：

pie(table(dat[,"GENDER"]), col = c("red", "blue"), labels = c('Female', 'Male'), main = 'Distribution of gender')
text(0.1, 0.3, 2172, font = 2, col = 'white')
text(-0.1, -0.3, 2828, font = 2, col = 'white')

第四節：特定條件的描述性統計(1)

利用索引函數，我們將能得到某種條件下的描述性統計，舉例來說我們想獲得case組的AGE平均數，可以使用下列語法

mean(dat[dat[,"LVD"] %in% 1,]$AGE, na.rm = TRUE)

## [1] 66.36768

Control組的亦同

mean(dat[dat[,"LVD"] %in% 0,]$AGE, na.rm = TRUE)

## [1] 65.91322

透過這種方式，我們可以獲得任意條件的描述性統計值！

– 下面這樣的語法與上面有些不同，但結果卻一致。還記得他們兩者的差別嗎？那為什麼結果又會一致呢？

mean(dat[dat[,"LVD"] == 1,]$AGE, na.rm = TRUE)

## [1] 66.36768

mean(dat[dat[,"LVD"] == 0,]$AGE, na.rm = TRUE)

## [1] 65.91322

第四節：特定條件的描述性統計(2)

如果我們想要把平均數±標準差表示出來，可以利用函數「paste()」或函數「paste0()」：

paste(mean(dat[dat[,"LVD"] %in% 1,]$AGE, na.rm = TRUE), "±", sd(dat[dat[,"LVD"] %in% 1,]$AGE, na.rm = TRUE), sep = "")

## [1] "66.3676804579562±15.9551239076992"

paste0(mean(dat[dat[,"LVD"] %in% 1,]$AGE, na.rm = TRUE), "±", sd(dat[dat[,"LVD"] %in% 1,]$AGE, na.rm = TRUE))

## [1] "66.3676804579562±15.9551239076992"

我們發現這樣的呈現相當醜，我們可以使用函數「formatC()」來指定我們想要的小數點位數

m = mean(dat[dat[,"LVD"] %in% 1,]$AGE, na.rm = TRUE)
s = sd(dat[dat[,"LVD"] %in% 1,]$AGE, na.rm = TRUE)
formatC(m, digits = 3, format = "f")

## [1] "66.368"

formatC(s, digits = 3, format = "f")

## [1] "15.955"

paste0(formatC(m, digits = 3, format = "f"), "±", formatC(s, digits = 3, format = "f"))

## [1] "66.368±15.955"

練習3：手刻一張圖

我們現在想要呈現Case組與Control組在AGE上數值的差異，我們先手動計算出重要參數：

m0 = mean(dat[dat[,"LVD"] %in% 0,]$AGE, na.rm = TRUE)
s0 = sd(dat[dat[,"LVD"] %in% 0,]$AGE, na.rm = TRUE)
txt0 = paste0(formatC(m0, digits = 3, format = "f"), "±", formatC(s0, digits = 3, format = "f"))
txt0

## [1] "65.913±17.341"

m1 = mean(dat[dat[,"LVD"] %in% 1,]$AGE, na.rm = TRUE)
s1 = sd(dat[dat[,"LVD"] %in% 1,]$AGE, na.rm = TRUE)
txt1 = paste0(formatC(m1, digits = 3, format = "f"), "±", formatC(s1, digits = 3, format = "f"))
txt1

## [1] "66.368±15.955"

現在讓我們畫出這張圖吧！(你需要用到函數「barplot()」，誤差線是標準差的型式)

練習3答案

之前我們的長條圖是這樣畫的：

barplot(table(dat[,"AMI"]))

你應該會發現長條圖裡面填入的物件其實是一堆數字，所以我們可以透過這個方式把底圖畫出來：

x = c(m0, m1)
barplot(x)

剩下的就是調參數和加料囉：

x = c(m0, m1)
barplot(x, col = c("blue", "red"), xlab = "LVD", ylab = "AGE", ylim = c(0, 120))

lines(c(1.9, 1.9), c(m1 - s1, m1 + s1), lwd = 3)
lines(c(1.75, 2.05), c(m1 + s1, m1 + s1), lwd = 3)
lines(c(1.75, 2.05), c(m1 - s1, m1 - s1), lwd = 3)
lines(c(0.7, 0.7), c(m0 - s0, m0 + s0), lwd = 3)
lines(c(0.55, 0.85), c(m0 + s0, m0 + s0), lwd = 3)
lines(c(0.55, 0.85), c(m0 - s0, m0 - s0), lwd = 3)

text(1.9, 15, txt1, col = 'white', font = 2)
text(0.7, 15, txt0, col = 'white', font = 2)

legend("topright", c("Control", "Case"), fill = c("blue", "red"))

第五節：色彩透明度與函數(1)

剛剛AGE對K的散布圖是不是感覺到有很多點重疊在一起。

– 資料量多的時候經常會遇到這樣的問題，這時候我們可能需要告訴使用者不同區域點的密度。

plot(dat[,"AGE"], dat[,"Rate"], ylab = "Heart rate", xlab = "AGE", main = "Scatter plot of AGE and Heart rate", pch = 19)

第五節：色彩透明度與函數(2)

在R裡面，我們使用的是6或8位元的16進位色碼，其規格為：#[紅色][綠色][藍色][透明度]

– 舉例來說，不透明的紅色的色碼為『#FF0000』或『#FF0000FF』

– 透明度50%的紅色色碼為『#FF000080』

– 透明度50%的黑色色碼為『#00000080』

x = c(1, 0, -1, 0)
y = c(0, 1, 0, -1)
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1))
points(x, y, pch = 19, cex = 2, col = c("#FF0000", "#FF0000FF", "#FF000080", "#00000080"))

如果你懶得自己想色碼，函數「rgb()」可以協助你調色

rgb(1, 0, 0, 0.5)

## [1] "#FF000080"

rgb(0.7, 0.5, 0.3, 0.7)

## [1] "#B3804DB3"

有了半透明的顏色後，剛剛的散布圖終於可以看出密度了

plot(dat[,"AGE"], dat[,"Rate"], ylab = "Heart rate", xlab = "AGE", main = "Scatter plot of AGE and Heart rate", pch = 19)

第五節：色彩透明度與函數(3)

事實上，函數「smoothScatter()」可以畫出與剛剛類似的散布圖：

smoothScatter(dat[,"AGE"], dat[,"Rate"], nrpoints = 0, ylab = "Heart rate", xlab = "AGE", main = "Scatter plot of AGE and Heart rate")

練習4：置換程式碼(1)

對於畫圖而言，我們最重要的就是感受Google的威力，如果你想要畫出漂亮的圖片，問Google最快了。

– 現在，假設你對單一色階的散布圖仍然不滿意，想要精益求精，google給了你一條明路，請參考R Scatter Plot: symbol color represents number of overlapping points

F01

練習4：置換程式碼(2)

透過複製貼上，你應該能獲得跟她一模一樣的圖：

## Data in a data.frame
x1 <- rnorm(n=1E3, sd=2)
x2 <- x1*1.2 + rnorm(n=1E3, sd=2)
df <- data.frame(x1,x2)

## Use densCols() output to get density at each point
x <- densCols(x1,x2, colramp=colorRampPalette(c("black", "white")))
df$dens <- col2rgb(x)[1,] + 1L

## Map densities to colors
cols <-  colorRampPalette(c("#000099", "#00FEFF", "#45FE4F", 
                            "#FCFF00", "#FF9400", "#FF3100"))(256)
df$col <- cols[df$dens]

## Plot it, reordering rows so that densest points are plotted on top
plot(x2~x1, data=df[order(df$dens),], pch=20, col=col, cex=2)

該怎樣將網頁上的程式碼，套用到我們的圖上呢?

練習4答案

你必須把別人的語法想成一個函數，只要改變input即可：

x1 <- dat[,"AGE"] # 關鍵在這
x2 <- dat[,"Rate"] # 關鍵在這
df <- data.frame(x1,x2)

## Use densCols() output to get density at each point
x <- densCols(x1,x2, colramp=colorRampPalette(c("black", "white")))
df$dens <- col2rgb(x)[1,] + 1L

## Map densities to colors
cols <-  colorRampPalette(c("#000099", "#00FEFF", "#45FE4F", 
                            "#FCFF00", "#FF9400", "#FF3100"))(256)
df$col <- cols[df$dens]

## Plot it, reordering rows so that densest points are plotted on top
plot(x2~x1, data=df[order(df$dens),], col = col,  ylab = "Heart rate", xlab = "AGE", main = "Scatter plot of AGE and Heart rate", pch = 19) # 關鍵在這

課程小結

本次課程中同學學習到在R語言內進行基本的描述性統計，以提各種圖形化的方法，這可以取代Excel能做的事情

– 透過4個練習題，你應該能發現要畫出paper上的圖其實並不難，只要你願意花時間可以一點一點畫出來。

– 除此之外，儲存成pdf格式的圖片由於是向量圖，他具有無限大的解析度，這點也非常重要。

在練習4中，我們體驗了google大神的威力，這是同學必須要學會的技能，透過google找到類似的圖片，從而畫出漂亮的圖