缺項是大數據的特徵之一,而處理缺項則是處理大數據常會用到的技巧,以下我們簡單介紹幾種常用的補缺項套件,我們使用的程式主要是參考Analytics Vidhya的這一篇文章。
mice Packagelibrary(mice)
library(missForest)
library(VIM)
data(iris)
iris.mis <- missForest::prodNA(iris, noNA=0.1)
summary(iris.mis) ## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.300 Median :1.300
## Mean :5.844 Mean :3.077 Mean :3.761 Mean :1.204
## 3rd Qu.:6.400 3rd Qu.:3.400 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.700 Max. :2.500
## NA's :15 NA's :18 NA's :13 NA's :15
## Species
## setosa :43
## versicolor:48
## virginica :45
## NA's :14
##
##
##
我們利用prodNA()產生缺項之後,可以看到每一個變數裡面都有缺項。
md.pattern(iris.mis) ## Petal.Length Species Sepal.Length Petal.Width Sepal.Width
## 85 1 1 1 1 1 0
## 12 1 1 0 1 1 1
## 14 1 1 1 1 0 1
## 10 0 1 1 1 1 1
## 9 1 1 1 0 1 1
## 10 1 0 1 1 1 1
## 1 0 1 0 1 1 2
## 1 1 1 0 0 1 2
## 2 1 1 1 0 0 2
## 2 0 1 1 0 1 2
## 1 1 0 0 1 1 2
## 2 1 0 1 1 0 2
## 1 1 0 1 0 1 2
## 13 14 15 15 18 75
VIM package)aggr(iris.mis,labels=names(iris.mis), numbers=T, sortVars=T,
cex.axis=.7, cex.numbers=0.7, cex.lab=1.2,
ylab=c("缺項比率","缺項樣式"))##
## Variables sorted by number of missings:
## Variable Count
## Sepal.Width 0.12000000
## Sepal.Length 0.10000000
## Petal.Width 0.10000000
## Species 0.09333333
## Petal.Length 0.08666667
micefit = mice(
iris.mis, # data with missing value
m=5, # no. fit data set
maxit = 50, # no, iterations
method = 'pmm', # imputing method
printFlag=F,
seed = 500)
summary(fit) # exmaine the result## Multiply imputed data set
## Call:
## mice(data = iris.mis, m = 5, method = "pmm", maxit = 50, printFlag = F,
## seed = 500)
## Number of multiple imputations: 5
## Missing cells per column:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 15 18 13 15 14
## Imputation methods:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "pmm" "pmm" "pmm" "pmm" "pmm"
## VisitSequence:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 2 3 4 5
## PredictorMatrix:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Sepal.Length 0 1 1 1 1
## Sepal.Width 1 0 1 1 1
## Petal.Length 1 1 0 1 1
## Petal.Width 1 1 1 0 1
## Species 1 1 1 1 0
## Random generator seed value: 500
所謂Multiple Imputation(mice),就是說它會一次產生好幾個datasets,我們可以隨便挑一個 …
dataset2 = complete(fit,2)
summary(dataset2) ## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.1
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.525 1st Qu.:0.3
## Median :5.750 Median :3.000 Median :4.300 Median :1.3
## Mean :5.823 Mean :3.068 Mean :3.760 Mean :1.2
## 3rd Qu.:6.400 3rd Qu.:3.375 3rd Qu.:5.100 3rd Qu.:1.8
## Max. :7.900 Max. :4.400 Max. :6.700 Max. :2.5
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
從summary()的output可以看到每一個變數裡面的缺項都補好了
我們也可以先比較看看哪一個dataset比較好,再挑一個表現最好(\(R^2\)最高的)的dataset …
formula = "Sepal.Width ~ Sepal.Length + Petal.Width"
R2 = sapply(1:5, function(i)summary(lm(formula, complete(fit,i)))$r.squared)
R2## [1] 0.2443570 0.2362726 0.2432087 0.2577832 0.2664208
best_data = complete(fit, which.max(R2))Amelia PackageAmelia這個套件假設各連續變數間呈現Multivariate Normal Distribution,所以它可以跨變數做補缺項的動作,但不能處理類別變數, 資料比較多的時候,Amelia也可以利用平行處理來加快執行的速度;
Amelialibrary(Amelia)
fit = amelia(
iris.mis,
m=2000, # 2000 imputations, to demo parallel processing
parallel="snow", # parallel ("snow/multicore" for windows/mac)
ncpus=4, # no. CPU core to use
noms=c("Species") # amelia不能處理類別變數
)注意一下amelia不能處理類別變數
dataset4 = fit$imp[[4]]
summary(dataset4)## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.297 Min. :2.000 Min. :0.9683 Min. :0.04234
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.6000 1st Qu.:0.30000
## Median :5.800 Median :3.000 Median :4.3000 Median :1.30000
## Mean :5.829 Mean :3.067 Mean :3.7660 Mean :1.20867
## 3rd Qu.:6.400 3rd Qu.:3.394 3rd Qu.:5.1000 3rd Qu.:1.80000
## Max. :7.900 Max. :4.400 Max. :6.7000 Max. :2.50000
## Species
## setosa :46
## versicolor:57
## virginica :47
##
##
##
R2 = sapply(1:length(fit$imp),
function(i)summary(lm(formula,fit$imp[[i]]))$r.squared)
hist(R2,20)c(which.max(R2), max(R2)) # the maximun R2 is ...## [1] 1306.0000000 0.2933382
c(mean(R2), median(R2)) # in practice, we should choose around the mean## [1] 0.2300343 0.2297963
以上我們看到\(R^2\)有相當寬的分布,為了避免過度適配(over fitting),實務上我們應該選取一個\(R^2\)落在平均值或中位數左右的dataset。
missForest PackagemissForestmissForest是使用無母數(non parametric)/隨機森林(random forest)演算法, 所以它可以處理類別變數
library(missForest)
fit = missForest(iris.mis)## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
它只會產生一個output,所以沒有選擇dataset的問題
dataset = fit$ximp
summary(dataset)## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.538 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.300 Median :1.300
## Mean :5.842 Mean :3.061 Mean :3.753 Mean :1.196
## 3rd Qu.:6.400 3rd Qu.:3.330 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.700 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
missForest會提供OOB(out of bag)的估計誤差:
fit$OOBerror## NRMSE PFC
## 0.14076951 0.04411765
也可以估計imputed data相對於actual data的誤差
mixError(fit$ximp, iris.mis, iris)## NRMSE PFC
## 0.1341296 0.0000000
利用它的這個特點,我們可以試一下用重複抽樣(repeated boosting)的方式來調教random forest的參數
library(doParallel)
cores <- makeCluster(detectCores())
registerDoParallel(cores); getDoParWorkers()
boost = foreach(ntree=seq(40,320,40), .combine=rbind) %:%
foreach(mtry=2:4, .combine=rbind) %dopar% {
library(missForest)
x = rowMeans( sapply(1:20, function(i){ # repeat 20 times
mis = prodNA(iris, noNA = 0.1)
mod = missForest(mis, maxiter = 25, ntree=ntree, mtry=mtry)
mixError(mod$ximp, mis, iris)
}))
data.frame(ntree, mtry, NRMSE=x[1], PFC=x[2])
}
stopCluster(cores)使用ggplot()畫出重複抽樣的結果
library(dplyr)
library(ggplot2)
library(reshape2)
melt(boost, id=c('ntree', 'mtry')) %>%
ggplot(aes(x=ntree, y=value, col=variable)) +
geom_line(size=1.2,alpha=0.5) + geom_point() +
facet_grid(variable~mtry, scales="free_y")由於樣本很小,參數調教的結果並不穩定,不過照上圖看來 ntree=200, mtry=2 應該是不錯的選擇。(由於隨機抽樣,每一次跑出來的結果不會都一樣)
mis = prodNA(iris, noNA = 0.1)
mod = missForest(mis, maxiter = 25, ntree=200, mtry=2)## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 6 in progress...done!
mixError(mod$ximp, mis, iris) ## NRMSE PFC
## 0.1872111 0.0000000
為避免篇幅過長,下次我們再介紹Hmisc和mi這兩個套件 …