補缺項工具介紹(1/2)

缺項是大數據的特徵之一，而處理缺項則是處理大數據常會用到的技巧，以下我們簡單介紹幾種常用的補缺項套件，我們使用的程式主要是參考Analytics Vidhya的這一篇文章。

1. The `mice` Package

Create data set with NA

library(mice)
library(missForest)
library(VIM)

data(iris)
iris.mis <- missForest::prodNA(iris, noNA=0.1)  
summary(iris.mis)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.300   Median :1.300  
##  Mean   :5.844   Mean   :3.077   Mean   :3.761   Mean   :1.204  
##  3rd Qu.:6.400   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.700   Max.   :2.500  
##  NA's   :15      NA's   :18      NA's   :13      NA's   :15     
##        Species  
##  setosa    :43  
##  versicolor:48  
##  virginica :45  
##  NA's      :14  
##                 
##                 
##

我們利用prodNA()產生缺項之後，可以看到每一個變數裡面都有缺項。

Examine pattern of missing data

md.pattern(iris.mis)

##    Petal.Length Species Sepal.Length Petal.Width Sepal.Width   
## 85            1       1            1           1           1  0
## 12            1       1            0           1           1  1
## 14            1       1            1           1           0  1
## 10            0       1            1           1           1  1
##  9            1       1            1           0           1  1
## 10            1       0            1           1           1  1
##  1            0       1            0           1           1  2
##  1            1       1            0           0           1  2
##  2            1       1            1           0           0  2
##  2            0       1            1           0           1  2
##  1            1       0            0           1           1  2
##  2            1       0            1           1           0  2
##  1            1       0            1           0           1  2
##              13      14           15          15          18 75

Plot the pattern of missing data (`VIM` package)

aggr(iris.mis,labels=names(iris.mis), numbers=T, sortVars=T, 
     cex.axis=.7, cex.numbers=0.7, cex.lab=1.2,
     ylab=c("缺項比率","缺項樣式"))

## 
##  Variables sorted by number of missings: 
##      Variable      Count
##   Sepal.Width 0.12000000
##  Sepal.Length 0.10000000
##   Petal.Width 0.10000000
##       Species 0.09333333
##  Petal.Length 0.08666667

Impute data with `mice`

fit = mice(
  iris.mis,       # data with missing value
  m=5,            # no. fit data set
  maxit = 50,     # no, iterations
  method = 'pmm', # imputing method
  printFlag=F, 
  seed = 500)

summary(fit)  # exmaine the result

## Multiply imputed data set
## Call:
## mice(data = iris.mis, m = 5, method = "pmm", maxit = 50, printFlag = F, 
##     seed = 500)
## Number of multiple imputations:  5
## Missing cells per column:
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##           15           18           13           15           14 
## Imputation methods:
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##        "pmm"        "pmm"        "pmm"        "pmm"        "pmm" 
## VisitSequence:
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##            1            2            3            4            5 
## PredictorMatrix:
##              Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Sepal.Length            0           1            1           1       1
## Sepal.Width             1           0            1           1       1
## Petal.Length            1           1            0           1       1
## Petal.Width             1           1            1           0       1
## Species                 1           1            1           1       0
## Random generator seed value:  500

Select one of the alternative datasets

所謂Multiple Imputation(mice)，就是說它會一次產生好幾個datasets，我們可以隨便挑一個 …

dataset2 = complete(fit,2)
summary(dataset2)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width 
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.1  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.525   1st Qu.:0.3  
##  Median :5.750   Median :3.000   Median :4.300   Median :1.3  
##  Mean   :5.823   Mean   :3.068   Mean   :3.760   Mean   :1.2  
##  3rd Qu.:6.400   3rd Qu.:3.375   3rd Qu.:5.100   3rd Qu.:1.8  
##  Max.   :7.900   Max.   :4.400   Max.   :6.700   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

從summary()的output可以看到每一個變數裡面的缺項都補好了

Compare the alternative datasets

我們也可以先比較看看哪一個dataset比較好，再挑一個表現最好(\(R^2\)最高的)的dataset …

formula = "Sepal.Width ~ Sepal.Length + Petal.Width"
R2 = sapply(1:5, function(i)summary(lm(formula, complete(fit,i)))$r.squared)
R2

## [1] 0.2443570 0.2362726 0.2432087 0.2577832 0.2664208

best_data = complete(fit, which.max(R2))

2. The `Amelia` Package

Amelia這個套件假設各連續變數間呈現Multivariate Normal Distribution，所以它可以跨變數做補缺項的動作，但不能處理類別變數，資料比較多的時候，Amelia也可以利用平行處理來加快執行的速度；

Impute missing data with `Amelia`

library(Amelia)
fit = amelia(
  iris.mis,  
  m=2000,            # 2000 imputations, to demo parallel processing
  parallel="snow",   # parallel ("snow/multicore" for windows/mac)
  ncpus=4,           # no. CPU core to use
  noms=c("Species")  # amelia不能處理類別變數
  )

注意一下amelia不能處理類別變數

Select one of the alternative datasets

dataset4 = fit$imp[[4]]
summary(dataset4)

##   Sepal.Length    Sepal.Width     Petal.Length     Petal.Width     
##  Min.   :4.297   Min.   :2.000   Min.   :0.9683   Min.   :0.04234  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.6000   1st Qu.:0.30000  
##  Median :5.800   Median :3.000   Median :4.3000   Median :1.30000  
##  Mean   :5.829   Mean   :3.067   Mean   :3.7660   Mean   :1.20867  
##  3rd Qu.:6.400   3rd Qu.:3.394   3rd Qu.:5.1000   3rd Qu.:1.80000  
##  Max.   :7.900   Max.   :4.400   Max.   :6.7000   Max.   :2.50000  
##        Species  
##  setosa    :46  
##  versicolor:57  
##  virginica :47  
##                 
##                 
##

Compare the alternative datasets

R2 = sapply(1:length(fit$imp), 
            function(i)summary(lm(formula,fit$imp[[i]]))$r.squared)
hist(R2,20)

c(which.max(R2), max(R2))  # the maximun R2 is ...

## [1] 1306.0000000    0.2933382

c(mean(R2), median(R2))    # in practice,  we should choose around the mean

## [1] 0.2300343 0.2297963

以上我們看到\(R^2\)有相當寬的分布，為了避免過度適配(over fitting)，實務上我們應該選取一個\(R^2\)落在平均值或中位數左右的dataset。

3. The `missForest` Package

Impute missing data with `missForest`

missForest是使用無母數(non parametric)/隨機森林(random forest)演算法，所以它可以處理類別變數

library(missForest)
fit = missForest(iris.mis)

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!

The imputed datasets

它只會產生一個output，所以沒有選擇dataset的問題

dataset = fit$ximp 
summary(dataset)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.538   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.300   Median :1.300  
##  Mean   :5.842   Mean   :3.061   Mean   :3.753   Mean   :1.196  
##  3rd Qu.:6.400   3rd Qu.:3.330   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.700   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Examine the imputation error

missForest會提供OOB(out of bag)的估計誤差:

NRMSE: 連續變數的 Normalized RMSE
PFC: 類別變數的 Propotion of False Classification

fit$OOBerror

##      NRMSE        PFC 
## 0.14076951 0.04411765

也可以估計imputed data相對於actual data的誤差

mixError(fit$ximp, iris.mis, iris)

##     NRMSE       PFC 
## 0.1341296 0.0000000

Tunning Parameter by Repeated Boosting

利用它的這個特點，我們可以試一下用重複抽樣(repeated boosting)的方式來調教random forest的參數

library(doParallel)
cores <- makeCluster(detectCores())
registerDoParallel(cores); getDoParWorkers()
boost = foreach(ntree=seq(40,320,40), .combine=rbind) %:% 
  foreach(mtry=2:4, .combine=rbind) %dopar% {
    library(missForest)
    x = rowMeans( sapply(1:20, function(i){ # repeat 20 times
      mis = prodNA(iris, noNA = 0.1)
      mod = missForest(mis, maxiter = 25, ntree=ntree, mtry=mtry)
      mixError(mod$ximp, mis, iris)
    }))
    data.frame(ntree, mtry, NRMSE=x[1], PFC=x[2])
  }
stopCluster(cores)

使用ggplot()畫出重複抽樣的結果

library(dplyr)
library(ggplot2)
library(reshape2)
melt(boost, id=c('ntree', 'mtry')) %>% 
  ggplot(aes(x=ntree, y=value, col=variable)) +
  geom_line(size=1.2,alpha=0.5) + geom_point() +
  facet_grid(variable~mtry, scales="free_y")

由於樣本很小，參數調教的結果並不穩定，不過照上圖看來 ntree=200, mtry=2 應該是不錯的選擇。(由於隨機抽樣，每一次跑出來的結果不會都一樣)

mis = prodNA(iris, noNA = 0.1)
mod = missForest(mis, maxiter = 25, ntree=200, mtry=2)

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   missForest iteration 6 in progress...done!

mixError(mod$ximp, mis, iris)

##     NRMSE       PFC 
## 0.1872111 0.0000000

為避免篇幅過長，下次我們再介紹Hmisc和mi這兩個套件 …

補缺項工具介紹(1/2)

Tony Chuo

2018/01/14

1. The `mice` Package

Create data set with NA

Examine pattern of missing data

Plot the pattern of missing data (`VIM` package)

Impute data with `mice`

Select one of the alternative datasets

Compare the alternative datasets

2. The `Amelia` Package

Impute missing data with `Amelia`

Select one of the alternative datasets

Compare the alternative datasets

3. The `missForest` Package

Impute missing data with `missForest`

The imputed datasets

Examine the imputation error

Tunning Parameter by Repeated Boosting

補缺項工具介紹(1/2)

Tony Chuo

2018/01/14

1. The mice Package

Create data set with NA

Examine pattern of missing data

Plot the pattern of missing data (VIM package)

Impute data with mice

Select one of the alternative datasets

Compare the alternative datasets

2. The Amelia Package

Impute missing data with Amelia

Select one of the alternative datasets

Compare the alternative datasets

3. The missForest Package

Impute missing data with missForest

The imputed datasets

Examine the imputation error

Tunning Parameter by Repeated Boosting

1. The `mice` Package

Plot the pattern of missing data (`VIM` package)

Impute data with `mice`

2. The `Amelia` Package

Impute missing data with `Amelia`

3. The `missForest` Package

Impute missing data with `missForest`