缺項是大數據的特徵之一,而處理缺項則是處理大數據常會用到的技巧,以下我們簡單介紹幾種常用的補缺項套件,我們使用的程式主要是參考Analytics Vidhya的這一篇文章



1. The mice Package

Create data set with NA

library(mice)
library(missForest)
library(VIM)

data(iris)
iris.mis <- missForest::prodNA(iris, noNA=0.1)  
summary(iris.mis) 
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.300   Median :1.300  
##  Mean   :5.844   Mean   :3.077   Mean   :3.761   Mean   :1.204  
##  3rd Qu.:6.400   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.700   Max.   :2.500  
##  NA's   :15      NA's   :18      NA's   :13      NA's   :15     
##        Species  
##  setosa    :43  
##  versicolor:48  
##  virginica :45  
##  NA's      :14  
##                 
##                 
## 

我們利用prodNA()產生缺項之後,可以看到每一個變數裡面都有缺項。

Examine pattern of missing data

md.pattern(iris.mis)           
##    Petal.Length Species Sepal.Length Petal.Width Sepal.Width   
## 85            1       1            1           1           1  0
## 12            1       1            0           1           1  1
## 14            1       1            1           1           0  1
## 10            0       1            1           1           1  1
##  9            1       1            1           0           1  1
## 10            1       0            1           1           1  1
##  1            0       1            0           1           1  2
##  1            1       1            0           0           1  2
##  2            1       1            1           0           0  2
##  2            0       1            1           0           1  2
##  1            1       0            0           1           1  2
##  2            1       0            1           1           0  2
##  1            1       0            1           0           1  2
##              13      14           15          15          18 75

Plot the pattern of missing data (VIM package)

aggr(iris.mis,labels=names(iris.mis), numbers=T, sortVars=T, 
     cex.axis=.7, cex.numbers=0.7, cex.lab=1.2,
     ylab=c("缺項比率","缺項樣式"))