tidyverse
套件,包含了dplyr
、ggplot2
,與stringr
等數據處理常用包
library(tidyverse) # 一次進行下載與載入套件
## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# setwd("~/camp") # 設定環境目錄
df = read.csv('athlete_all.csv', stringsAsFactors = F) # 不要將字串變數轉換成類別變數
%>%
運算子(Pipeline)表示法:兼顧可讀性與精簡的特性六大主要功能:
函數 | 說明 |
---|---|
select() | 選擇變數 |
filter() | 篩選出符合條件的變數 |
arrange() | 依照變數排序資料 |
summarise() | 聚合函數,對變數做群組運算 |
mutate() | 新增變數 |
group_by() | 依照類別變數分組 |
%>%
運算子假設今天欲確認df中變數Games
的年份,是否與變數year
的年份相同?
year = substr(df$Games, 1, 4) # 取第1到第4個字元
str(year) # 查看資料型態: chr
## chr [1:271116] "1952" "1952" "1900" "1900" "1900" "1900" "1964" ...
year = as.integer(year) # 將資料型態從字串轉換成整數
str(year) # 查看資料型態: int
## int [1:271116] 1952 1952 1900 1900 1900 1900 1964 1964 1964 1948 ...
identical(df$year, year) # 查看兩者年份是否相同
## [1] TRUE
year = as.integer(substr(df$Games, 1, 4)) # 將兩步驟直接合併為一步驟
identical(df$year, year)
## [1] TRUE
%>%
寫法】df$Games %>% substr(1,4) %>% as.integer() %>% identical(df$year)
## [1] TRUE
一般寫法 | 進階寫法 | %>% 寫法 |
|
---|---|---|---|
優點 | 易讀性高 | 效率高 | 易讀又高效 |
缺點 | 效率差 | 不好閱讀 | 沒學過看不懂 |
說明:選擇變數
只看姓名、年齡、性別
df %>% select(Name, Age, Sex) %>%
head(10) # 只看前十筆資料
## Name Age Sex
## 1 Harald Fereberger 23 M
## 2 Harald von Musil 44 M
## 3 Jos Eustaquio Luis Francisco Escandn y Barrn 38 M
## 4 Jos Manuel Mara del Corazn de Jess Escandn y Barrn 42 M
## 5 Jos Pablo Eustaquio Manuel Francisco Escandn y Barrn 44 M
## 6 William Hayden Wright NA M
## 7 Mauricio de la Lama 44 M
## 8 Juan Frias 46 M
## 9 Francisco Eduardo Prieto Lpez 52 M
## 10 ngel Carrasco NA M
說明:篩選條件變數
找出18歲以上的女運動員
df %>% filter(Age>=18 & Sex=='F') %>% head(5)
## X country year ID Name Sex Age Height Weight NOC
## 1 122 Afghanistan 2004 99303 Friba Razayee F 18 165 70 AFG
## 2 124 Afghanistan 2004 132125 Robina Muqim Yaar F 18 180 56 AFG
## 3 128 Afghanistan 2008 132125 Robina Muqim Yaar F 22 180 56 AFG
## 4 133 Afghanistan 2012 61961 Tahmina Kohistani F 23 160 52 AFG
## 5 136 Afghanistan 2016 133277 Kamia Yousufi F 20 165 55 AFG
## Games Season City Sport Event
## 1 2004 Summer Summer Athina Judo Judo Women's Middleweight
## 2 2004 Summer Summer Athina Athletics Athletics Women's 100 metres
## 3 2008 Summer Summer Beijing Athletics Athletics Women's 100 metres
## 4 2012 Summer Summer London Athletics Athletics Women's 100 metres
## 5 2016 Summer Summer Rio de Janeiro Athletics Athletics Women's 100 metres
## Medal ce_rate electricity female_school food_suppiy health_expense
## 1 <NA> 100.0 NA 21.3 1970 6.36
## 2 <NA> 100.0 NA 21.3 1970 6.36
## 3 <NA> 100.0 NA 22.2 2040 6.98
## 4 <NA> 95.3 NA 23.1 2100 NA
## 5 <NA> 84.6 NA <NA> NA NA
## income_GDP
## 1 1060
## 2 1060
## 3 1300
## 4 1840
## 5 1740
找出的18歲以上的女運動員的姓名、年齡與運動項目
df %>% filter(Age>=18 & Sex=='F') %>% # 篩選出18歲以上的女運動員
select(Name, Age, Sport) %>% # 挑選想要的欄位
head(10)
## Name Age Sport
## 1 Friba Razayee 18 Judo
## 2 Robina Muqim Yaar 18 Athletics
## 3 Robina Muqim Yaar 22 Athletics
## 4 Tahmina Kohistani 23 Athletics
## 5 Kamia Yousufi 20 Athletics
## 6 Afrdita Tusha 26 Shooting
## 7 Alma Qeramixhi 28 Athletics
## 8 Enkelejda Shehu (-Xhakaj, -Bekurti) 23 Shooting
## 9 Vera Bitanji (Bregu-) 27 Athletics
## 10 Mirela Maniani-Tzelili 19 Athletics
說明:依變數排序
將資料依據最新的年份排序
df %>% arrange(desc(year)) %>% head(3) # 預設為正序, desc代表倒序
## X country year ID Name Sex Age Height
## 1 135 Afghanistan 2016 7050 Mohammad Tawfiq Bakhshi M 30 181
## 2 136 Afghanistan 2016 133277 Kamia Yousufi F 20 165
## 3 137 Afghanistan 2016 133698 Abdul Wahab Zahiri M 24 175
## Weight NOC Games Season City Sport
## 1 99 AFG 2016 Summer Summer Rio de Janeiro Judo
## 2 55 AFG 2016 Summer Summer Rio de Janeiro Athletics
## 3 68 AFG 2016 Summer Summer Rio de Janeiro Athletics
## Event Medal ce_rate electricity female_school
## 1 Judo Men's Half-Heavyweight <NA> 84.6 NA <NA>
## 2 Athletics Women's 100 metres <NA> 84.6 NA <NA>
## 3 Athletics Men's 100 metres <NA> 84.6 NA <NA>
## food_suppiy health_expense income_GDP
## 1 NA NA 1740
## 2 NA NA 1740
## 3 NA NA 1740
找出不重複的18歲以上的女運動員的姓名、年齡與運動
df %>% filter(Age>=18 & Sex=='F') %>%
select(ID, Name, Age, Sport) %>%
arrange(Age, Sport, Name) %>% # 依照年齡,運動項目,姓名排序
.[!duplicated(.$ID),] %>% # 找出不重複的人
head(10)
## ID Name Age Sport
## 1 35568 Abigail E. ""Abbi"" Fisher (-Gould) 18 Alpine Skiing
## 2 29764 Alenka Dovan 18 Alpine Skiing
## 7 1081 Alessia Afi Dipol 18 Alpine Skiing
## 9 74722 Alexandra Maasov 18 Alpine Skiing
## 11 18769 Andrea Casasnovas Rocha 18 Alpine Skiing
## 12 17306 Ania Monica Germaine Caill 18 Alpine Skiing
## 16 78375 Ann Eva Margareta Melander 18 Alpine Skiing
## 18 48263 Anna-Karin Hesse 18 Alpine Skiing
## 20 13225 Annamria Bnis 18 Alpine Skiing
## 23 100184 Anne-Flore Rey (-Tafani) 18 Alpine Skiing
說明:聚合運算
歷年奧運最年長運動員的歲數
df %>% summarise(MaxAge = max(Age, na.rm=T))
## MaxAge
## 1 97
說明:新增變數
製作一個新的變數:BMI
df %>% mutate(BMI = (Weight/(Height*0.01)^2)) %>% head(3)
## X country year ID
## 1 1 30. Februar 1952 34666
## 2 2 30. Februar 1952 127384
## 3 3 A North American Team 1900 33106
## Name Sex Age Height Weight NOC
## 1 Harald Fereberger M 23 167 70 AUT
## 2 Harald von Musil M 44 176 NA AUT
## 3 Jos Eustaquio Luis Francisco Escandn y Barrn M 38 NA NA MEX
## Games Season City Sport Event
## 1 1952 Summer Summer Helsinki Sailing Sailing Mixed Two Person Keelboat
## 2 1952 Summer Summer Helsinki Sailing Sailing Mixed Two Person Keelboat
## 3 1900 Summer Summer Paris Polo Polo Men's Polo
## Medal ce_rate electricity female_school food_suppiy health_expense
## 1 <NA> NA NA <NA> NA NA
## 2 <NA> NA NA <NA> NA NA
## 3 Bronze NA NA <NA> NA NA
## income_GDP BMI
## 1 NA 25.0995
## 2 NA NA
## 3 NA NA
說明:將變數分組
依照獎牌分組
df %>% group_by(Medal)
## # A tibble: 271,116 x 22
## # Groups: Medal [4]
## X country year ID Name Sex Age Height Weight NOC Games
## <int> <chr> <int> <int> <chr> <chr> <int> <int> <dbl> <chr> <chr>
## 1 1 30. Fe… 1952 34666 Hara… M 23 167 70 AUT 1952…
## 2 2 30. Fe… 1952 127384 Hara… M 44 176 NA AUT 1952…
## 3 3 A Nort… 1900 33106 Jos … M 38 NA NA MEX 1900…
## 4 4 A Nort… 1900 33107 Jos … M 42 NA NA MEX 1900…
## 5 5 A Nort… 1900 33109 Jos … M 44 NA NA MEX 1900…
## 6 6 A Nort… 1900 131733 Will… M NA NA NA USA 1900…
## 7 7 Acipac… 1964 26276 Maur… M 44 171 69 MEX 1964…
## 8 8 Acipac… 1964 37141 Juan… M 46 173 73 MEX 1964…
## 9 9 Acipac… 1964 96982 Fran… M 52 180 84 MEX 1964…
## 10 10 Acturus 1948 18513 ngel… M NA NA NA ARG 1948…
## # … with 271,106 more rows, and 11 more variables: Season <chr>,
## # City <chr>, Sport <chr>, Event <chr>, Medal <chr>, ce_rate <dbl>,
## # electricity <dbl>, female_school <chr>, food_suppiy <int>,
## # health_expense <dbl>, income_GDP <int>
觀察以上資料框,可以發現:單一使用group_by不會對資料框產生變化。 group_by是聚合函數,目的是以此分組依據去做某種動作,因此需配合其他dplyr函數一起使用。
2016年奧運運動員獎盃得主的最高年齡、平均年齡、性別比與BMI
df %>% filter(year==2016) %>% # 篩選2016年的資料
group_by(Medal) %>% # 依照獎牌分類
mutate(BMI = (Weight/(Height*0.01)^2), # 新增BMI變數
Male = (Sex=='M')) %>% # 新增男性變數(True:男; False:女)
summarise(MaxAge = max(Age), # 計算最高年齡
Age = mean(Age), # 計算平均年齡
SexRatio = mean(Male), # 計算男性比
BMI = first(BMI)) %>% # 保留BMI的變數
.[c(2,3,1),] # 依照金牌>銀牌>銅牌之順序重新排列
## # A tibble: 3 x 5
## Medal MaxAge Age SexRatio BMI
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Gold 58 26.5 0.522 24.7
## 2 Silver 52 26.4 0.511 23.2
## 3 Bronze 52 26.1 0.529 23.7
tapply(df$Sport, df$year, n_distinct)
## 1896 1900 1904 1906 1908 1912 1920 1924 1928 1932 1936 1948 1952 1956 1960
## 9 20 18 13 24 17 25 30 25 25 32 29 27 27 27
## 1964 1968 1972 1976 1980 1984 1988 1992 1994 1996 1998 2000 2002 2004 2006
## 31 30 33 33 33 35 37 41 12 31 14 34 15 34 15
## 2008 2010 2012 2014 2016
## 34 15 32 15 34
aggregate(df$Event,
list(Sport=df$Sport,
Year=df$year),
n_distinct) %>%
head(20)
## Sport Year x
## 1 Athletics 1896 12
## 2 Cycling 1896 6
## 3 Fencing 1896 3
## 4 Gymnastics 1896 8
## 5 Shooting 1896 5
## 6 Swimming 1896 4
## 7 Tennis 1896 2
## 8 Weightlifting 1896 2
## 9 Wrestling 1896 1
## 10 Archery 1900 8
## 11 Athletics 1900 23
## 12 Basque Pelota 1900 1
## 13 Cricket 1900 1
## 14 Croquet 1900 3
## 15 Cycling 1900 3
## 16 Equestrianism 1900 5
## 17 Fencing 1900 7
## 18 Football 1900 1
## 19 Golf 1900 2
## 20 Gymnastics 1900 1
說明:將重複的內容只取出一個代表
unique(df$Sport)
## [1] "Sailing" "Polo"
## [3] "Hockey" "Athletics"
## [5] "Football" "Wrestling"
## [7] "Boxing" "Judo"
## [9] "Taekwondo" "Shooting"
## [11] "Weightlifting" "Swimming"
## [13] "Cycling" "Alpine Skiing"
## [15] "Gymnastics" "Handball"
## [17] "Fencing" "Tennis"
## [19] "Volleyball" "Rowing"
## [21] "Trampolining" "Table Tennis"
## [23] "Cross Country Skiing" "Badminton"
## [25] "Bobsleigh" "Archery"
## [27] "Canoeing" "Snowboarding"
## [29] "Biathlon" "Basketball"
## [31] "Beach Volleyball" "Figure Skating"
## [33] "Equestrianism" "Water Polo"
## [35] "Art Competitions" "Modern Pentathlon"
## [37] "Diving" "Luge"
## [39] "Freestyle Skiing" "Triathlon"
## [41] "Skeleton" "Synchronized Swimming"
## [43] "Rugby Sevens" "Golf"
## [45] "Rugby" "Alpinism"
## [47] "Speed Skating" "Ice Hockey"
## [49] "Nordic Combined" "Rhythmic Gymnastics"
## [51] "Short Track Speed Skating" "Softball"
## [53] "Baseball" "Tug-Of-War"
## [55] "Ski Jumping" "Motorboating"
## [57] "Lacrosse" "Curling"
## [59] "Military Ski Patrol" "Cricket"
## [61] "Croquet" "Racquets"
## [63] "Jeu De Paume" "Basque Pelota"
## [65] "Aeronautics" "Roque"
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
band_members %>% left_join(band_instruments)
## Joining, by = "name"
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass