前置準備

1. 下載Dplyr套件

tidyverse套件,包含了dplyrggplot2,與stringr等數據處理常用包

library(tidyverse) # 一次進行下載與載入套件
## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

2. 讀取資料

# setwd("~/camp") # 設定環境目錄
df = read.csv('athlete_all.csv', stringsAsFactors = F) # 不要將字串變數轉換成類別變數

Dplyr介紹

1. 什麼是Dplyr?

  • %>%運算子(Pipeline)表示法:兼顧可讀性與精簡的特性
  • 六大主要功能:

    函數 說明
    select() 選擇變數
    filter() 篩選出符合條件的變數
    arrange() 依照變數排序資料
    summarise() 聚合函數,對變數做群組運算
    mutate() 新增變數
    group_by() 依照類別變數分組

2. Dplyr Cheatsheet

實戰演練

1. %>%運算子

假設今天欲確認df中變數Games的年份,是否與變數year的年份相同?


【一般寫法】

year = substr(df$Games, 1, 4) # 取第1到第4個字元
str(year)                     # 查看資料型態: chr
##  chr [1:271116] "1952" "1952" "1900" "1900" "1900" "1900" "1964" ...
year = as.integer(year)       # 將資料型態從字串轉換成整數
str(year)                     # 查看資料型態: int
##  int [1:271116] 1952 1952 1900 1900 1900 1900 1964 1964 1964 1948 ...
identical(df$year, year)      # 查看兩者年份是否相同
## [1] TRUE


【進階寫法】

year = as.integer(substr(df$Games, 1, 4)) # 將兩步驟直接合併為一步驟
identical(df$year, year)
## [1] TRUE


%>%寫法】

df$Games %>% substr(1,4) %>% as.integer() %>% identical(df$year)
## [1] TRUE


【小結】

一般寫法 進階寫法 %>%寫法
優點 易讀性高 效率高 易讀又高效
缺點 效率差 不好閱讀 沒學過看不懂

2. select()

說明:選擇變數


【基礎篇】

只看姓名、年齡、性別

df %>% select(Name, Age, Sex) %>% 
  head(10) # 只看前十筆資料
##                                                    Name Age Sex
## 1                                     Harald Fereberger  23   M
## 2                                      Harald von Musil  44   M
## 3          Jos Eustaquio Luis Francisco Escandn y Barrn  38   M
## 4    Jos Manuel Mara del Corazn de Jess Escandn y Barrn  42   M
## 5  Jos Pablo Eustaquio Manuel Francisco Escandn y Barrn  44   M
## 6                                 William Hayden Wright  NA   M
## 7                                   Mauricio de la Lama  44   M
## 8                                            Juan Frias  46   M
## 9                         Francisco Eduardo Prieto Lpez  52   M
## 10                                        ngel Carrasco  NA   M

3. filter()

說明:篩選條件變數


【基礎篇】

找出18歲以上的女運動員

df %>% filter(Age>=18 & Sex=='F') %>% head(5) 
##     X     country year     ID              Name Sex Age Height Weight NOC
## 1 122 Afghanistan 2004  99303     Friba Razayee   F  18    165     70 AFG
## 2 124 Afghanistan 2004 132125 Robina Muqim Yaar   F  18    180     56 AFG
## 3 128 Afghanistan 2008 132125 Robina Muqim Yaar   F  22    180     56 AFG
## 4 133 Afghanistan 2012  61961 Tahmina Kohistani   F  23    160     52 AFG
## 5 136 Afghanistan 2016 133277     Kamia Yousufi   F  20    165     55 AFG
##         Games Season           City     Sport                        Event
## 1 2004 Summer Summer         Athina      Judo    Judo Women's Middleweight
## 2 2004 Summer Summer         Athina Athletics Athletics Women's 100 metres
## 3 2008 Summer Summer        Beijing Athletics Athletics Women's 100 metres
## 4 2012 Summer Summer         London Athletics Athletics Women's 100 metres
## 5 2016 Summer Summer Rio de Janeiro Athletics Athletics Women's 100 metres
##   Medal ce_rate electricity female_school food_suppiy health_expense
## 1  <NA>   100.0          NA          21.3        1970           6.36
## 2  <NA>   100.0          NA          21.3        1970           6.36
## 3  <NA>   100.0          NA          22.2        2040           6.98
## 4  <NA>    95.3          NA          23.1        2100             NA
## 5  <NA>    84.6          NA          <NA>          NA             NA
##   income_GDP
## 1       1060
## 2       1060
## 3       1300
## 4       1840
## 5       1740


【進階篇】

找出的18歲以上的女運動員的姓名、年齡與運動項目

df %>% filter(Age>=18 & Sex=='F') %>%    # 篩選出18歲以上的女運動員
  select(Name, Age, Sport) %>%           # 挑選想要的欄位
  head(10) 
##                                   Name Age     Sport
## 1                        Friba Razayee  18      Judo
## 2                    Robina Muqim Yaar  18 Athletics
## 3                    Robina Muqim Yaar  22 Athletics
## 4                    Tahmina Kohistani  23 Athletics
## 5                        Kamia Yousufi  20 Athletics
## 6                        Afrdita Tusha  26  Shooting
## 7                       Alma Qeramixhi  28 Athletics
## 8  Enkelejda Shehu (-Xhakaj, -Bekurti)  23  Shooting
## 9                Vera Bitanji (Bregu-)  27 Athletics
## 10              Mirela Maniani-Tzelili  19 Athletics

4. arrange()

說明:依變數排序


【基礎篇】

將資料依據最新的年份排序

df %>% arrange(desc(year)) %>% head(3) # 預設為正序, desc代表倒序
##     X     country year     ID                    Name Sex Age Height
## 1 135 Afghanistan 2016   7050 Mohammad Tawfiq Bakhshi   M  30    181
## 2 136 Afghanistan 2016 133277           Kamia Yousufi   F  20    165
## 3 137 Afghanistan 2016 133698      Abdul Wahab Zahiri   M  24    175
##   Weight NOC       Games Season           City     Sport
## 1     99 AFG 2016 Summer Summer Rio de Janeiro      Judo
## 2     55 AFG 2016 Summer Summer Rio de Janeiro Athletics
## 3     68 AFG 2016 Summer Summer Rio de Janeiro Athletics
##                          Event Medal ce_rate electricity female_school
## 1  Judo Men's Half-Heavyweight  <NA>    84.6          NA          <NA>
## 2 Athletics Women's 100 metres  <NA>    84.6          NA          <NA>
## 3   Athletics Men's 100 metres  <NA>    84.6          NA          <NA>
##   food_suppiy health_expense income_GDP
## 1          NA             NA       1740
## 2          NA             NA       1740
## 3          NA             NA       1740


【進階篇】

找出不重複的18歲以上的女運動員的姓名、年齡與運動

df %>% filter(Age>=18 & Sex=='F') %>% 
  select(ID, Name, Age, Sport) %>%   
  arrange(Age, Sport, Name) %>% # 依照年齡,運動項目,姓名排序
  .[!duplicated(.$ID),] %>%     # 找出不重複的人
  head(10)
##        ID                                Name Age         Sport
## 1   35568 Abigail E. ""Abbi"" Fisher (-Gould)  18 Alpine Skiing
## 2   29764                        Alenka Dovan  18 Alpine Skiing
## 7    1081                   Alessia Afi Dipol  18 Alpine Skiing
## 9   74722                    Alexandra Maasov  18 Alpine Skiing
## 11  18769             Andrea Casasnovas Rocha  18 Alpine Skiing
## 12  17306          Ania Monica Germaine Caill  18 Alpine Skiing
## 16  78375          Ann Eva Margareta Melander  18 Alpine Skiing
## 18  48263                    Anna-Karin Hesse  18 Alpine Skiing
## 20  13225                       Annamria Bnis  18 Alpine Skiing
## 23 100184            Anne-Flore Rey (-Tafani)  18 Alpine Skiing

5. summarise()

說明:聚合運算


【基礎篇】

歷年奧運最年長運動員的歲數

df %>% summarise(MaxAge = max(Age, na.rm=T))
##   MaxAge
## 1     97

6. mutate()

說明:新增變數


【基礎篇】

製作一個新的變數:BMI

df %>% mutate(BMI = (Weight/(Height*0.01)^2)) %>% head(3)
##   X               country year     ID
## 1 1           30. Februar 1952  34666
## 2 2           30. Februar 1952 127384
## 3 3 A North American Team 1900  33106
##                                           Name Sex Age Height Weight NOC
## 1                            Harald Fereberger   M  23    167     70 AUT
## 2                             Harald von Musil   M  44    176     NA AUT
## 3 Jos Eustaquio Luis Francisco Escandn y Barrn   M  38     NA     NA MEX
##         Games Season     City   Sport                             Event
## 1 1952 Summer Summer Helsinki Sailing Sailing Mixed Two Person Keelboat
## 2 1952 Summer Summer Helsinki Sailing Sailing Mixed Two Person Keelboat
## 3 1900 Summer Summer    Paris    Polo                   Polo Men's Polo
##    Medal ce_rate electricity female_school food_suppiy health_expense
## 1   <NA>      NA          NA          <NA>          NA             NA
## 2   <NA>      NA          NA          <NA>          NA             NA
## 3 Bronze      NA          NA          <NA>          NA             NA
##   income_GDP     BMI
## 1         NA 25.0995
## 2         NA      NA
## 3         NA      NA

7. group_by()

說明:將變數分組


【基礎篇】

依照獎牌分組

df %>% group_by(Medal)
## # A tibble: 271,116 x 22
## # Groups:   Medal [4]
##        X country  year     ID Name  Sex     Age Height Weight NOC   Games
##    <int> <chr>   <int>  <int> <chr> <chr> <int>  <int>  <dbl> <chr> <chr>
##  1     1 30. Fe…  1952  34666 Hara… M        23    167     70 AUT   1952…
##  2     2 30. Fe…  1952 127384 Hara… M        44    176     NA AUT   1952…
##  3     3 A Nort…  1900  33106 Jos … M        38     NA     NA MEX   1900…
##  4     4 A Nort…  1900  33107 Jos … M        42     NA     NA MEX   1900…
##  5     5 A Nort…  1900  33109 Jos … M        44     NA     NA MEX   1900…
##  6     6 A Nort…  1900 131733 Will… M        NA     NA     NA USA   1900…
##  7     7 Acipac…  1964  26276 Maur… M        44    171     69 MEX   1964…
##  8     8 Acipac…  1964  37141 Juan… M        46    173     73 MEX   1964…
##  9     9 Acipac…  1964  96982 Fran… M        52    180     84 MEX   1964…
## 10    10 Acturus  1948  18513 ngel… M        NA     NA     NA ARG   1948…
## # … with 271,106 more rows, and 11 more variables: Season <chr>,
## #   City <chr>, Sport <chr>, Event <chr>, Medal <chr>, ce_rate <dbl>,
## #   electricity <dbl>, female_school <chr>, food_suppiy <int>,
## #   health_expense <dbl>, income_GDP <int>

觀察以上資料框,可以發現:單一使用group_by不會對資料框產生變化。 group_by是聚合函數,目的是以此分組依據去做某種動作,因此需配合其他dplyr函數一起使用。


【進階篇】

2016年奧運運動員獎盃得主的最高年齡、平均年齡、性別比與BMI

df %>% filter(year==2016) %>% # 篩選2016年的資料
  group_by(Medal) %>%         # 依照獎牌分類
  mutate(BMI = (Weight/(Height*0.01)^2), # 新增BMI變數
         Male = (Sex=='M')) %>%          # 新增男性變數(True:男; False:女)
  summarise(MaxAge = max(Age),           # 計算最高年齡
            Age = mean(Age),             # 計算平均年齡
            SexRatio = mean(Male),       # 計算男性比
            BMI = first(BMI)) %>%        # 保留BMI的變數
  .[c(2,3,1),]                # 依照金牌>銀牌>銅牌之順序重新排列
## # A tibble: 3 x 5
##   Medal  MaxAge   Age SexRatio   BMI
##   <chr>   <int> <dbl>    <dbl> <dbl>
## 1 Gold       58  26.5    0.522  24.7
## 2 Silver     52  26.4    0.511  23.2
## 3 Bronze     52  26.1    0.529  23.7

其他常用技巧

1. tapply()

  • tapply(x, index, function)
  • 將x依照index分組並做function


tapply(df$Sport, df$year, n_distinct)
## 1896 1900 1904 1906 1908 1912 1920 1924 1928 1932 1936 1948 1952 1956 1960 
##    9   20   18   13   24   17   25   30   25   25   32   29   27   27   27 
## 1964 1968 1972 1976 1980 1984 1988 1992 1994 1996 1998 2000 2002 2004 2006 
##   31   30   33   33   33   35   37   41   12   31   14   34   15   34   15 
## 2008 2010 2012 2014 2016 
##   34   15   32   15   34

2. aggregate()

  • aggregate(x, by=list(), function)
  • 將x by list的內容分組並做function
  • 與tapply相似,但可以多層次分組


aggregate(df$Event, 
          list(Sport=df$Sport,
               Year=df$year), 
          n_distinct) %>% 
  head(20)
##            Sport Year  x
## 1      Athletics 1896 12
## 2        Cycling 1896  6
## 3        Fencing 1896  3
## 4     Gymnastics 1896  8
## 5       Shooting 1896  5
## 6       Swimming 1896  4
## 7         Tennis 1896  2
## 8  Weightlifting 1896  2
## 9      Wrestling 1896  1
## 10       Archery 1900  8
## 11     Athletics 1900 23
## 12 Basque Pelota 1900  1
## 13       Cricket 1900  1
## 14       Croquet 1900  3
## 15       Cycling 1900  3
## 16 Equestrianism 1900  5
## 17       Fencing 1900  7
## 18      Football 1900  1
## 19          Golf 1900  2
## 20    Gymnastics 1900  1

3. unique()

說明:將重複的內容只取出一個代表


unique(df$Sport)
##  [1] "Sailing"                   "Polo"                     
##  [3] "Hockey"                    "Athletics"                
##  [5] "Football"                  "Wrestling"                
##  [7] "Boxing"                    "Judo"                     
##  [9] "Taekwondo"                 "Shooting"                 
## [11] "Weightlifting"             "Swimming"                 
## [13] "Cycling"                   "Alpine Skiing"            
## [15] "Gymnastics"                "Handball"                 
## [17] "Fencing"                   "Tennis"                   
## [19] "Volleyball"                "Rowing"                   
## [21] "Trampolining"              "Table Tennis"             
## [23] "Cross Country Skiing"      "Badminton"                
## [25] "Bobsleigh"                 "Archery"                  
## [27] "Canoeing"                  "Snowboarding"             
## [29] "Biathlon"                  "Basketball"               
## [31] "Beach Volleyball"          "Figure Skating"           
## [33] "Equestrianism"             "Water Polo"               
## [35] "Art Competitions"          "Modern Pentathlon"        
## [37] "Diving"                    "Luge"                     
## [39] "Freestyle Skiing"          "Triathlon"                
## [41] "Skeleton"                  "Synchronized Swimming"    
## [43] "Rugby Sevens"              "Golf"                     
## [45] "Rugby"                     "Alpinism"                 
## [47] "Speed Skating"             "Ice Hockey"               
## [49] "Nordic Combined"           "Rhythmic Gymnastics"      
## [51] "Short Track Speed Skating" "Softball"                 
## [53] "Baseball"                  "Tug-Of-War"               
## [55] "Ski Jumping"               "Motorboating"             
## [57] "Lacrosse"                  "Curling"                  
## [59] "Military Ski Patrol"       "Cricket"                  
## [61] "Croquet"                   "Racquets"                 
## [63] "Jeu De Paume"              "Basque Pelota"            
## [65] "Aeronautics"               "Roque"

4. left_join()

  • left_join(x, y, by=)
  • 保留x所有資料,根據by的變數,把x沒有但y有的資料合併至x資料框


band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
band_members %>% left_join(band_instruments)
## Joining, by = "name"
## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass