缺項是大數據的特徵之一,而處理缺項則是處理大數據常會用到的技巧,以下我們簡單介紹幾種常用的補缺項套件,我們使用的程式主要是參考Analytics Vidhya的這一篇文章



1. The mice Package

Create data set with NA

library(mice)
library(missForest)
library(VIM)

data(iris)
iris.mis <- missForest::prodNA(iris, noNA=0.1)  
summary(iris.mis) 
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.300   Median :1.300  
##  Mean   :5.844   Mean   :3.077   Mean   :3.761   Mean   :1.204  
##  3rd Qu.:6.400   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.700   Max.   :2.500  
##  NA's   :15      NA's   :18      NA's   :13      NA's   :15     
##        Species  
##  setosa    :43  
##  versicolor:48  
##  virginica :45  
##  NA's      :14  
##                 
##                 
## 

我們利用prodNA()產生缺項之後,可以看到每一個變數裡面都有缺項。

Examine pattern of missing data

md.pattern(iris.mis)           
##    Petal.Length Species Sepal.Length Petal.Width Sepal.Width   
## 85            1       1            1           1           1  0
## 12            1       1            0           1           1  1
## 14            1       1            1           1           0  1
## 10            0       1            1           1           1  1
##  9            1       1            1           0           1  1
## 10            1       0            1           1           1  1
##  1            0       1            0           1           1  2
##  1            1       1            0           0           1  2
##  2            1       1            1           0           0  2
##  2            0       1            1           0           1  2
##  1            1       0            0           1           1  2
##  2            1       0            1           1           0  2
##  1            1       0            1           0           1  2
##              13      14           15          15          18 75

Plot the pattern of missing data (VIM package)

aggr(iris.mis,labels=names(iris.mis), numbers=T, sortVars=T, 
     cex.axis=.7, cex.numbers=0.7, cex.lab=1.2,
     ylab=c("缺項比率","缺項樣式"))

## 
##  Variables sorted by number of missings: 
##      Variable      Count
##   Sepal.Width 0.12000000
##  Sepal.Length 0.10000000
##   Petal.Width 0.10000000
##       Species 0.09333333
##  Petal.Length 0.08666667

Impute data with mice

fit = mice(
  iris.mis,       # data with missing value
  m=5,            # no. fit data set
  maxit = 50,     # no, iterations
  method = 'pmm', # imputing method
  printFlag=F, 
  seed = 500)

summary(fit)  # exmaine the result
## Multiply imputed data set
## Call:
## mice(data = iris.mis, m = 5, method = "pmm", maxit = 50, printFlag = F, 
##     seed = 500)
## Number of multiple imputations:  5
## Missing cells per column:
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##           15           18           13           15           14 
## Imputation methods:
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##        "pmm"        "pmm"        "pmm"        "pmm"        "pmm" 
## VisitSequence:
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##            1            2            3            4            5 
## PredictorMatrix:
##              Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Sepal.Length            0           1            1           1       1
## Sepal.Width             1           0            1           1       1
## Petal.Length            1           1            0           1       1
## Petal.Width             1           1            1           0       1
## Species                 1           1            1           1       0
## Random generator seed value:  500

Select one of the alternative datasets

所謂Multiple Imputation(mice),就是說它會一次產生好幾個datasets,我們可以隨便挑一個 …

dataset2 = complete(fit,2)
summary(dataset2) 
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width 
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.1  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.525   1st Qu.:0.3  
##  Median :5.750   Median :3.000   Median :4.300   Median :1.3  
##  Mean   :5.823   Mean   :3.068   Mean   :3.760   Mean   :1.2  
##  3rd Qu.:6.400   3rd Qu.:3.375   3rd Qu.:5.100   3rd Qu.:1.8  
##  Max.   :7.900   Max.   :4.400   Max.   :6.700   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

summary()的output可以看到每一個變數裡面的缺項都補好了

Compare the alternative datasets

我們也可以先比較看看哪一個dataset比較好,再挑一個表現最好(\(R^2\)最高的)的dataset …

formula = "Sepal.Width ~ Sepal.Length + Petal.Width"
R2 = sapply(1:5, function(i)summary(lm(formula, complete(fit,i)))$r.squared)
R2
## [1] 0.2443570 0.2362726 0.2432087 0.2577832 0.2664208
best_data = complete(fit, which.max(R2))




2. The Amelia Package

Amelia這個套件假設各連續變數間呈現Multivariate Normal Distribution,所以它可以跨變數做補缺項的動作,但不能處理類別變數, 資料比較多的時候,Amelia也可以利用平行處理來加快執行的速度;

Impute missing data with Amelia

library(Amelia)
fit = amelia(
  iris.mis,  
  m=2000,            # 2000 imputations, to demo parallel processing
  parallel="snow",   # parallel ("snow/multicore" for windows/mac)
  ncpus=4,           # no. CPU core to use
  noms=c("Species")  # amelia不能處理類別變數
  )

注意一下amelia不能處理類別變數

Select one of the alternative datasets

dataset4 = fit$imp[[4]]
summary(dataset4)
##   Sepal.Length    Sepal.Width     Petal.Length     Petal.Width     
##  Min.   :4.297   Min.   :2.000   Min.   :0.9683   Min.   :0.04234  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.6000   1st Qu.:0.30000  
##  Median :5.800   Median :3.000   Median :4.3000   Median :1.30000  
##  Mean   :5.829   Mean   :3.067   Mean   :3.7660   Mean   :1.20867  
##  3rd Qu.:6.400   3rd Qu.:3.394   3rd Qu.:5.1000   3rd Qu.:1.80000  
##  Max.   :7.900   Max.   :4.400   Max.   :6.7000   Max.   :2.50000  
##        Species  
##  setosa    :46  
##  versicolor:57  
##  virginica :47  
##                 
##                 
## 

Compare the alternative datasets

R2 = sapply(1:length(fit$imp), 
            function(i)summary(lm(formula,fit$imp[[i]]))$r.squared)
hist(R2,20)

c(which.max(R2), max(R2))  # the maximun R2 is ...
## [1] 1306.0000000    0.2933382
c(mean(R2), median(R2))    # in practice,  we should choose around the mean
## [1] 0.2300343 0.2297963

以上我們看到\(R^2\)有相當寬的分布,為了避免過度適配(over fitting),實務上我們應該選取一個\(R^2\)落在平均值或中位數左右的dataset。


3. The missForest Package

Impute missing data with missForest

missForest是使用無母數(non parametric)/隨機森林(random forest)演算法, 所以它可以處理類別變數

library(missForest)
fit = missForest(iris.mis)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!

The imputed datasets

它只會產生一個output,所以沒有選擇dataset的問題

dataset = fit$ximp 
summary(dataset)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.538   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.300   Median :1.300  
##  Mean   :5.842   Mean   :3.061   Mean   :3.753   Mean   :1.196  
##  3rd Qu.:6.400   3rd Qu.:3.330   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.700   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Examine the imputation error

missForest會提供OOB(out of bag)的估計誤差:

  • NRMSE: 連續變數的 Normalized RMSE
  • PFC: 類別變數的 Propotion of False Classification
fit$OOBerror
##      NRMSE        PFC 
## 0.14076951 0.04411765

也可以估計imputed data相對於actual data的誤差

mixError(fit$ximp, iris.mis, iris)
##     NRMSE       PFC 
## 0.1341296 0.0000000

Tunning Parameter by Repeated Boosting

利用它的這個特點,我們可以試一下用重複抽樣(repeated boosting)的方式來調教random forest的參數

library(doParallel)
cores <- makeCluster(detectCores())
registerDoParallel(cores); getDoParWorkers()
boost = foreach(ntree=seq(40,320,40), .combine=rbind) %:% 
  foreach(mtry=2:4, .combine=rbind) %dopar% {
    library(missForest)
    x = rowMeans( sapply(1:20, function(i){ # repeat 20 times
      mis = prodNA(iris, noNA = 0.1)
      mod = missForest(mis, maxiter = 25, ntree=ntree, mtry=mtry)
      mixError(mod$ximp, mis, iris)
    }))
    data.frame(ntree, mtry, NRMSE=x[1], PFC=x[2])
  }
stopCluster(cores)

使用ggplot()畫出重複抽樣的結果

library(dplyr)
library(ggplot2)
library(reshape2)
melt(boost, id=c('ntree', 'mtry')) %>% 
  ggplot(aes(x=ntree, y=value, col=variable)) +
  geom_line(size=1.2,alpha=0.5) + geom_point() +
  facet_grid(variable~mtry, scales="free_y")

由於樣本很小,參數調教的結果並不穩定,不過照上圖看來 ntree=200, mtry=2 應該是不錯的選擇。(由於隨機抽樣,每一次跑出來的結果不會都一樣)

mis = prodNA(iris, noNA = 0.1)
mod = missForest(mis, maxiter = 25, ntree=200, mtry=2)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   missForest iteration 6 in progress...done!
mixError(mod$ximp, mis, iris) 
##     NRMSE       PFC 
## 0.1872111 0.0000000


為避免篇幅過長,下次我們再介紹Hmiscmi這兩個套件 …