上一篇文章裡面我們介紹了:
這三個R套件,這裡我們繼續介紹其它兩個補缺項的套件;
這裡我們使用的程式主要是參考Analytics Vidhya的這一篇文章。
跟上次一樣,我們利用prodNA()
產生缺項之後,可以看到每一個變數裡面都有缺項。
library(mice)
library(missForest)
library(VIM)
data(iris)
# Create data set with NA
iris.mis <- missForest::prodNA(iris, noNA=0.1)
summary(iris.mis)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.200 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.500 1st Qu.:0.350
## Median :5.800 Median :3.000 Median :4.400 Median :1.300
## Mean :5.853 Mean :3.065 Mean :3.769 Mean :1.217
## 3rd Qu.:6.400 3rd Qu.:3.375 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## NA's :8 NA's :20 NA's :14 NA's :15
## Species
## setosa :45
## versicolor:45
## virginica :42
## NA's :18
##
##
##
# Examine pattern of missing data
md.pattern(iris.mis)
## Sepal.Length Petal.Length Petal.Width Species Sepal.Width
## 93 1 1 1 1 1 0
## 2 0 1 1 1 1 1
## 14 1 1 1 1 0 1
## 8 1 0 1 1 1 1
## 8 1 1 0 1 1 1
## 14 1 1 1 0 1 1
## 1 0 1 1 1 0 2
## 1 0 0 1 1 1 2
## 1 1 1 0 1 0 2
## 2 1 0 0 1 1 2
## 1 1 1 0 0 1 2
## 1 0 0 0 1 1 3
## 1 0 1 1 0 0 3
## 1 1 0 1 0 0 3
## 1 0 0 0 1 0 4
## 1 0 1 0 0 0 4
## 8 14 15 18 20 75
# Plot the pattern of missing data (`VIM` package)
aggr(iris.mis,labels=names(iris.mis), numbers=T, sortVars=T,
cex.axis=.7, cex.numbers=0.7, cex.lab=1.2,
ylab=c("缺項比率","缺項樣式"))
##
## Variables sorted by number of missings:
## Variable Count
## Sepal.Width 0.13333333
## Species 0.12000000
## Petal.Width 0.10000000
## Petal.Length 0.09333333
## Sepal.Length 0.05333333
Hmisc
PackageHmisc
是一個多工能的輔助性套件,除了補缺項之外,它還有一系列估計模型適配度和模型除錯的功能。在補缺項這一個功能上面,它提供兩種作法:Hmisc::impute()
和Hmisc::aregImput()
Hmisc::impute()
Hmisc::impute()
這一個function比較簡單,它基本上就是按照使用者指定的方式(mean
,median
,max
,etc.),一次只補一個變數的缺項 …
library(Hmisc)
# 使用平均值補缺項
iris.mis$imputed_age <- with(iris.mis, impute(Sepal.Length, mean))
# 使用隨機抽樣補缺項
iris.mis$imputed_age2 <- with(iris.mis, impute(Sepal.Length, 'random'))
Hmisc::aregImput()
Hmisc::aregImput()
則是和mice
一樣、假設各變數之間有線性關係,使用者可以選擇使用predictive mean matching (default), boostrapping, or non-parametric additive model (see the manual page of Hmisc
for detail)來補齊缺項 …
# 使用predictive mean matching (default)補缺項
impute_arg <- aregImpute(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width +
Species, data = iris.mis, n.impute = 5)
## Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
# check the result of imputation
impute_arg
##
## Multiple Imputation using Bootstrap and PMM
##
## aregImpute(formula = ~Sepal.Length + Sepal.Width + Petal.Length +
## Petal.Width + Species, data = iris.mis, n.impute = 5)
##
## n: 150 p: 5 Imputations: 5 nk: 3
##
## Number of NAs:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 8 20 14 15 18
##
## type d.f.
## Sepal.Length s 2
## Sepal.Width s 2
## Petal.Length s 2
## Petal.Width s 2
## Species c 2
##
## Transformation of Target Variables Forced to be Linear
##
## R-squares for Predicting Non-Missing Values for Each Variable
## Using Last Imputations of Predictors
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 0.863 0.631 0.979 0.942 0.992
跟miForest
一樣,Hmisc::aregImput()
也可以處理類別變數。
mi
Packagemi
這個套件的使用方法和前面幾個套件很類似,一般只要用預設的參數就可以補齊缺項:
mi
library(mi)
mi_data <- mi(iris.mis, seed = 335)
summary(mi_data)
## $Sepal.Length
## $Sepal.Length$is_missing
## missing
## FALSE TRUE
## 142 8
##
## $Sepal.Length$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1754369 -0.0421273 -0.0115479 -0.0330390 -0.0007415 0.0766148
##
## $Sepal.Length$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.94255 -0.45695 -0.03206 0.00000 0.33214 1.24262
##
##
## $Sepal.Width
## $Sepal.Width$is_missing
## missing
## FALSE TRUE
## 130 20
##
## $Sepal.Width$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.75498 -0.51236 -0.13400 -0.09021 0.18794 2.27421
##
## $Sepal.Width$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.98300 -0.30145 -0.07427 0.00000 0.38009 1.51601
##
##
## $Petal.Length
## $Petal.Length$is_missing
## missing
## FALSE TRUE
## 136 14
##
## $Petal.Length$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.76672 -0.45859 0.06203 -0.01937 0.28273 0.70827
##
## $Petal.Length$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7707 -0.6316 0.1756 0.0000 0.3704 0.8714
##
##
## $Petal.Width
## $Petal.Width$is_missing
## missing
## FALSE TRUE
## 135 15
##
## $Petal.Width$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.90286 -0.62470 -0.08748 -0.08582 0.38468 1.01879
##
## $Petal.Width$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.74920 -0.61506 0.05564 0.00000 0.39099 0.86049
##
##
## $Species
## $Species$crosstab
##
## observed imputed
## setosa 180 24
## versicolor 180 17
## virginica 168 31
##
##
## $imputed_age
## $imputed_age$is_missing
## [1] "all values observed"
##
## $imputed_age$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.96892 -0.46974 -0.01648 0.00000 0.34143 1.27739
##
##
## $imputed_age2
## $imputed_age2$is_missing
## [1] "all values observed"
##
## $imputed_age2$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.93254 -0.44515 -0.04915 0.00000 0.34686 1.26071
詳細的說明請參考mi
的說明文件.
以上介紹的幾個方法,各有各的優缺點,實務上我會建議大家多試幾種方法,再從裡面選擇效果比較好的來用。