上一篇文章裡面我們介紹了:

這三個R套件,這裡我們繼續介紹其它兩個補缺項的套件;

這裡我們使用的程式主要是參考Analytics Vidhya的這一篇文章


上次一樣,我們利用prodNA()產生缺項之後,可以看到每一個變數裡面都有缺項。

library(mice)
library(missForest)
library(VIM)
data(iris)

# Create data set with NA
iris.mis <- missForest::prodNA(iris, noNA=0.1)  
summary(iris.mis) 
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.200   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.500   1st Qu.:0.350  
##  Median :5.800   Median :3.000   Median :4.400   Median :1.300  
##  Mean   :5.853   Mean   :3.065   Mean   :3.769   Mean   :1.217  
##  3rd Qu.:6.400   3rd Qu.:3.375   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##  NA's   :8       NA's   :20      NA's   :14      NA's   :15     
##        Species  
##  setosa    :45  
##  versicolor:45  
##  virginica :42  
##  NA's      :18  
##                 
##                 
## 
# Examine pattern of missing data
md.pattern(iris.mis)           
##    Sepal.Length Petal.Length Petal.Width Species Sepal.Width   
## 93            1            1           1       1           1  0
##  2            0            1           1       1           1  1
## 14            1            1           1       1           0  1
##  8            1            0           1       1           1  1
##  8            1            1           0       1           1  1
## 14            1            1           1       0           1  1
##  1            0            1           1       1           0  2
##  1            0            0           1       1           1  2
##  1            1            1           0       1           0  2
##  2            1            0           0       1           1  2
##  1            1            1           0       0           1  2
##  1            0            0           0       1           1  3
##  1            0            1           1       0           0  3
##  1            1            0           1       0           0  3
##  1            0            0           0       1           0  4
##  1            0            1           0       0           0  4
##               8           14          15      18          20 75
# Plot the pattern of missing data (`VIM` package)
aggr(iris.mis,labels=names(iris.mis), numbers=T, sortVars=T, 
     cex.axis=.7, cex.numbers=0.7, cex.lab=1.2,
     ylab=c("缺項比率","缺項樣式"))

## 
##  Variables sorted by number of missings: 
##      Variable      Count
##   Sepal.Width 0.13333333
##       Species 0.12000000
##   Petal.Width 0.10000000
##  Petal.Length 0.09333333
##  Sepal.Length 0.05333333



4. The Hmisc Package

Hmisc是一個多工能的輔助性套件,除了補缺項之外,它還有一系列估計模型適配度和模型除錯的功能。在補缺項這一個功能上面,它提供兩種作法:Hmisc::impute()Hmisc::aregImput()

4.1 Impute data with Hmisc::impute()

Hmisc::impute()這一個function比較簡單,它基本上就是按照使用者指定的方式(mean,median,max,etc.),一次只補一個變數的缺項 …

library(Hmisc)

# 使用平均值補缺項
iris.mis$imputed_age <- with(iris.mis, impute(Sepal.Length, mean))

# 使用隨機抽樣補缺項
iris.mis$imputed_age2 <- with(iris.mis, impute(Sepal.Length, 'random'))

4.2 Impute data with Hmisc::aregImput()

Hmisc::aregImput()則是和mice一樣、假設各變數之間有線性關係,使用者可以選擇使用predictive mean matching (default), boostrapping, or non-parametric additive model (see the manual page of Hmisc for detail)來補齊缺項 …

# 使用predictive mean matching (default)補缺項
impute_arg <- aregImpute(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width +
Species, data = iris.mis, n.impute = 5)
## Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 
# check the result of imputation
impute_arg
## 
## Multiple Imputation using Bootstrap and PMM
## 
## aregImpute(formula = ~Sepal.Length + Sepal.Width + Petal.Length + 
##     Petal.Width + Species, data = iris.mis, n.impute = 5)
## 
## n: 150   p: 5    Imputations: 5      nk: 3 
## 
## Number of NAs:
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##            8           20           14           15           18 
## 
##              type d.f.
## Sepal.Length    s    2
## Sepal.Width     s    2
## Petal.Length    s    2
## Petal.Width     s    2
## Species         c    2
## 
## Transformation of Target Variables Forced to be Linear
## 
## R-squares for Predicting Non-Missing Values for Each Variable
## Using Last Imputations of Predictors
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##        0.863        0.631        0.979        0.942        0.992

miForest一樣,Hmisc::aregImput()也可以處理類別變數。


5. The mi Package

mi這個套件的使用方法和前面幾個套件很類似,一般只要用預設的參數就可以補齊缺項:

5.1 Impute missing data with mi

library(mi)
mi_data <- mi(iris.mis, seed = 335)
summary(mi_data)
## $Sepal.Length
## $Sepal.Length$is_missing
## missing
## FALSE  TRUE 
##   142     8 
## 
## $Sepal.Length$imputed
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.1754369 -0.0421273 -0.0115479 -0.0330390 -0.0007415  0.0766148 
## 
## $Sepal.Length$observed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.94255 -0.45695 -0.03206  0.00000  0.33214  1.24262 
## 
## 
## $Sepal.Width
## $Sepal.Width$is_missing
## missing
## FALSE  TRUE 
##   130    20 
## 
## $Sepal.Width$imputed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.75498 -0.51236 -0.13400 -0.09021  0.18794  2.27421 
## 
## $Sepal.Width$observed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.98300 -0.30145 -0.07427  0.00000  0.38009  1.51601 
## 
## 
## $Petal.Length
## $Petal.Length$is_missing
## missing
## FALSE  TRUE 
##   136    14 
## 
## $Petal.Length$imputed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.76672 -0.45859  0.06203 -0.01937  0.28273  0.70827 
## 
## $Petal.Length$observed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7707 -0.6316  0.1756  0.0000  0.3704  0.8714 
## 
## 
## $Petal.Width
## $Petal.Width$is_missing
## missing
## FALSE  TRUE 
##   135    15 
## 
## $Petal.Width$imputed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.90286 -0.62470 -0.08748 -0.08582  0.38468  1.01879 
## 
## $Petal.Width$observed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.74920 -0.61506  0.05564  0.00000  0.39099  0.86049 
## 
## 
## $Species
## $Species$crosstab
##             
##              observed imputed
##   setosa          180      24
##   versicolor      180      17
##   virginica       168      31
## 
## 
## $imputed_age
## $imputed_age$is_missing
## [1] "all values observed"
## 
## $imputed_age$observed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.96892 -0.46974 -0.01648  0.00000  0.34143  1.27739 
## 
## 
## $imputed_age2
## $imputed_age2$is_missing
## [1] "all values observed"
## 
## $imputed_age2$observed
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.93254 -0.44515 -0.04915  0.00000  0.34686  1.26071

詳細的說明請參考mi的說明文件.

以上介紹的幾個方法,各有各的優缺點,實務上我會建議大家多試幾種方法,再從裡面選擇效果比較好的來用。