第四單元(4A)：簡單資料探索

(4A)單元摘要： 使用第三周個人作業裡面的資料來練習

資料處理套件
- dplyr
- tidyr
繪圖套件
- ggplot2
- plotly
描述資料(敘述性統計)
- 統計值：mean(), mediam(), min(), max(), …
- 分布：hist(), table %>% barplot
- 變數之間的關係：cor(), plot(x, y)
簡單資料探索(分類比較)
- 分類計數：
- 分類統計：
- 分類分布：
- 分類關係：

載入套件

pacman::p_load(dplyr,ggplot2,plotly,gridExtra)

載入資料：美國(郡)人口統計資料

d = readRDS("data/counties.rds")
d = mutate_at(d, vars(region,metro),factor)
summary(d)

【A】描述資料(敘述性統計)

統計值：mean(), mediam(), min(), max(), …
分布：hist(), table %>% barplot
變數之間的關係：cor(), plot(x, y)

統計值

summary(d$black)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.60    2.10    8.88   10.18   85.90

💡 學習重點：分布
■ 一種描述『變數』的方式
■ 分布：『變數』的值出現的『頻率』
■ 可以用『出現次數』或『出現比率』來呈現

數值分布

hist(d$black)

類別分布

table(d$state)


       Alabama         Alaska        Arizona       Arkansas     California 
            67             28             15             75             58 
      Colorado    Connecticut       Delaware        Florida        Georgia 
            64              8              3             67            159 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             5             44            102             92             99 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
           105            120             64             16             24 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
            14             83             87             82            115 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
            56             93             17             10             21 
    New Mexico       New York North Carolina   North Dakota           Ohio 
            33             62            100             53             88 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
            77             36             67              5             46 
  South Dakota      Tennessee          Texas           Utah        Vermont 
            65             95            253             29             14 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
           133             39             55             72             23

table(d$state) %>% barplot

par(cex=0.7,mar=c(3,8,4,3))
table(d$state) %>% sort %>% tail(20) %>% 
  barplot(las=2,horiz=T,main="No. Counties")

變數之間的關係

cor(d$black, d$income_per_cap)

[1] -0.26762

cor.test(d$black, d$income_per_cap)$p.value

[1] 1.3085e-52

# plot(d$black, d$income_per_cap)

ggplot(d, aes(x=black, y=income_per_cap)) +
  geom_point(color='cyan', alpha=0.2) + 
  geom_smooth(method='lm',se=F)

`geom_smooth()` using formula 'y ~ x'

【B】簡單資料探索(分類比較)

分類計數
分類統計
分類分布
分類關係

分類計數：類別組合的分布 (又稱作列連表)

table(d$metro, d$region)

          
           North Central Northeast South West
  Metro              302       130   591  142
  Nonmetro           752        87   829  305

table(d$metro, d$region) %>% barplot(beside=T)

p1 = ggplot(d, aes(x=region,fill=metro))
p2 = ggplot(d, aes(x=metro,fill=region))  

grid.arrange(
  p1 + geom_bar(show.legend=F),
  p1 + geom_bar(position=position_dodge(),show.legend=F),
  p1 + geom_bar(position=position_fill()),
  p2 + geom_bar(show.legend=F),
  p2 + geom_bar(position=position_dodge(),show.legend=F),
  p2 + geom_bar(position=position_fill()),
  nrow = 2)

🗿 問題：
Q: 在各region之中，分別算出metro和Nonmetro的比率

# 對應到右上角的圖形
#

Q: 在metro和Nonmetro，分別算出各region的比率

# 對應到右下角的圖形
#

分類統計：

tapply(d$land_area, d$region, mean)

North Central     Northeast         South          West 
       710.08        746.14        611.04       3879.13

tapply(d$land_area, list(d$region,d$metro), mean)

                Metro Nonmetro
North Central  604.64   752.43
Northeast      571.08  1007.72
South          561.92   646.06
West          2741.58  4408.75

group_by(d, region, metro) %>% summarise(
  land_area = mean(land_area)) %>% 
  ggplot(aes(x=region, y=land_area, fill=metro)) +
  geom_col(position=position_dodge2())

`summarise()` has grouped output by 'region'. You can override using the `.groups` argument.

group_by(d, region, metro) %>% summarize(
  income_per_cap = weighted.mean(income_per_cap, population)
  ) %>% 
  ggplot(aes(x=region, y=income_per_cap, fill=metro)) +
  geom_col(position=position_dodge2())

`summarise()` has grouped output by 'region'. You can override using the `.groups` argument.

分類分布：

par(cex=0.7)
boxplot(log(land_area,10)~region,d)

grid.arrange(
  ggplot(d, aes(x=land_area)) + geom_histogram() + scale_x_log10(),
  ggplot(d, aes(x=land_area)) + geom_density() + scale_x_log10(),
  nrow=1
  )

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

grid.arrange(
  ggplot(d, aes(x=land_area,fill=region,color=region)) + 
    geom_histogram(alpha=0.5) + scale_x_log10(),
  ggplot(d, aes(x=land_area,fill=region,color=region)) + 
    geom_density(alpha=0.5) + scale_x_log10(),
  nrow=2
  )

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

分類關係：

ggplot(d, aes(x=d$black, y=d$income_per_cap)) +
  geom_point(color='cyan', alpha=0.2) + 
  geom_smooth(method='lm',se=F) +
  facet_grid(metro~region)

Warning: Use of `d$black` is discouraged. Use `black` instead.

Warning: Use of `d$income_per_cap` is discouraged. Use `income_per_cap` instead.

Warning: Use of `d$black` is discouraged. Use `black` instead.

Warning: Use of `d$income_per_cap` is discouraged. Use `income_per_cap` instead.

`geom_smooth()` using formula 'y ~ x'

第四單元(4A)：簡單資料探索

中山大學管理學院卓雍然

2021-03-15 13:12:28

載入資料：美國(郡)人口統計資料

【A】描述資料(敘述性統計)

統計值

數值分布

類別分布

變數之間的關係

【B】簡單資料探索(分類比較)

分類計數：類別組合的分布 (又稱作列連表)

分類統計：

分類分布：

分類關係：

第四單元(4A)：簡單資料探索

中山大學管理學院 卓雍然

2021-03-15 13:12:28

載入資料：美國(郡)人口統計資料

【A】 描述資料(敘述性統計)

統計值

數值分布

類別分布

變數之間的關係

【B】 簡單資料探索(分類比較)

分類計數：類別組合的分布 (又稱作列連表)

分類統計：

分類分布：

分類關係：

中山大學管理學院卓雍然

【A】描述資料(敘述性統計)

【B】簡單資料探索(分類比較)