pacman::p_load(dplyr, ggplot2, car, vcd, GGally, mvtnorm)


【A】批發商資料集

 Channel    Region        Fresh            Milk         Grocery     
 Ch1:298   Reg1: 77   Min.   :0.477   Min.   :1.74   Min.   :0.477  
 Ch2:142   Reg2: 47   1st Qu.:3.495   1st Qu.:3.19   1st Qu.:3.333  
           Reg3:316   Median :3.930   Median :3.56   Median :3.677  
                      Mean   :3.792   Mean   :3.53   Mean   :3.666  
                      3rd Qu.:4.229   3rd Qu.:3.86   3rd Qu.:4.028  
                      Max.   :5.050   Max.   :4.87   Max.   :4.968  
     Frozen     Detergents_Paper   Delicassen   
 Min.   :1.40   Min.   :0.477    Min.   :0.477  
 1st Qu.:2.87   1st Qu.:2.409    1st Qu.:2.611  
 Median :3.18   Median :2.912    Median :2.985  
 Mean   :3.17   Mean   :2.947    Mean   :2.895  
 3rd Qu.:3.55   3rd Qu.:3.594    3rd Qu.:3.260  
 Max.   :4.78   Max.   :4.611    Max.   :4.681  


【B】連續變數的相關性(係數) Correlation

B1a. 點狀圖 Simple Scatter Plot

B1b. 點狀圖+回歸線 Scatter Plot with Regrssion Line

`geom_smooth()` using formula 'y ~ x'

B2. 回歸係數 Correlation \[r_{xy}=\frac{Cov(x,y)}{\sigma_x \sigma_y} =\frac{\Sigma_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\Sigma_{i=1}^n(x_i - \bar{x})^2} \sqrt{\Sigma_{i=1}^n(y_i - \bar{y})^2}}\]

[1] 0.75885

B3. 回歸係數檢定 Correlation Test


    Pearson's product-moment correlation

data:  W$Milk and W$Grocery
t = 24.4, df = 438, p-value <0.0000000000000002
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.71617 0.79588
sample estimates:
    cor 
0.75885 

💡 : 簡單講,\(p-value\)大致可以視為「沒有效果」的機率(準確的說它並不是這樣)

💡 : 如果「沒有效果」的機率很小,我們就推路這效果是「顯著」的

💡 : \(p-value\): 給定虛無假設(\(H_0:r=0\)),檢定統計量(\(t\))大於觀察值(\(24.4\))的機率

🗿 : \(p = 0.05\)時,對立假設(\(H_A\))為真的機率是?

💡 : 貝氏定理:\(P(A|B) = P(B|A) \cdot P(A) / P(B)\)

B4. Simulating Bi-Variate Normal Distibution



【C】相關性矩陣

C1. Matrix of Correlation Coefficients

                  Fresh   Milk Grocery Frozen Detergents_Paper Delicassen
Fresh             1.000 -0.020  -0.133  0.384           -0.156      0.255
Milk             -0.020  1.000   0.759 -0.055            0.678      0.338
Grocery          -0.133  0.759   1.000 -0.165            0.796      0.236
Frozen            0.384 -0.055  -0.165  1.000           -0.212      0.255
Detergents_Paper -0.156  0.678   0.796 -0.212            1.000      0.167
Delicassen        0.255  0.338   0.236  0.255            0.167      1.000

💡 : 相關性矩陣:(a)對角線等於1;(b)左下、右上對稱

C1. Matrix of Scatter Plots



【D】類別變數之間的關聯性 Association

D1. 關聯性卡方檢定 Chisq-Test for Association


    Pearson's Chi-squared test

data:  .
X-squared = 4.35, df = 2, p-value = 0.11

D2. 馬賽克圖 Mosaic Plot

D3. Another Example

       Eye
Hair    Brown Blue Hazel Green
  Black    68   20    15     5
  Brown   119   84    54    29
  Red      26   17    14    14
  Blond     7   94    10    16

頭髮顏色的邊際機率

  Black   Brown     Red   Blond 
0.18243 0.48311 0.11993 0.21453 

眼睛顏色的邊際機率

  Brown    Blue   Hazel   Green 
0.37162 0.36318 0.15709 0.10811 

若頭髮、眼睛顏色是不相關的(Independent),則頭髮、眼睛的聯合機率分佈

         Brown     Blue    Hazel    Green
Black 0.067796 0.066255 0.028659 0.019722
Brown 0.179533 0.175453 0.075894 0.052228
Red   0.044569 0.043557 0.018841 0.012966
Blond 0.079723 0.077911 0.033701 0.023192
[1] 1

若頭髮、眼睛顏色是不相關的(Independent),則各頭髮、眼睛組合的期望值是

        Brown    Blue  Hazel   Green
Black  40.135  39.223 16.966 11.6757
Brown 106.284 103.868 44.929 30.9189
Red    26.385  25.785 11.154  7.6757
Blond  47.196  46.123 19.951 13.7297

標準化殘差(Standardized Residual)

       Eye
Hair        Brown      Blue     Hazel     Green
  Black  4.398399 -3.069377 -0.477352 -1.953684
  Brown  1.233458 -1.949477  1.353284 -0.345100
  Red   -0.074978 -1.730125  0.852253  2.282737
  Blond -5.850997  7.049590 -2.227844  0.612698

馬賽克圖

💡 學習重點:
  ■ The Association between Hair and Eye is significant
  ■ For Black-Hair:
    ■ the propotion of Brown-Eye is significantly higher than the expected
    ■ the propotion of Blue-Eye is significantly lower
  ■ For Red-Hair:
    ■ he propotion of Green-Eye is significantly higher
  ■ For Blond-Hair:
    ■ the propotion of Brown-Eye is significantly lower
    ■ the propotion of Blue-Eye is significantly higher
    ■ the propotion of Hazel-Eye is significantly lower


D4. The Expected Value in Contingent Table

       Eye
Hair      Brown    Blue  Hazel   Green
  Black  40.135  39.223 16.966 11.6757
  Brown 106.284 103.868 44.929 30.9189
  Red    26.385  25.785 11.154  7.6757
  Blond  47.196  46.123 19.951 13.7297

It is derived from the marginal probability in both directions

        Brown    Blue  Hazel   Green
Black  40.135  39.223 16.966 11.6757
Brown 106.284 103.868 44.929 30.9189
Red    26.385  25.785 11.154  7.6757
Blond  47.196  46.123 19.951 13.7297

D5. Residuals

       Eye
Hair        Brown      Blue     Hazel     Green
  Black  27.86486 -19.22297  -1.96622  -6.67568
  Brown  12.71622 -19.86824   9.07095  -1.91892
  Red    -0.38514  -8.78547   2.84628   6.32432
  Blond -40.19595  47.87669  -9.95101   2.27027

D6. Standardized Residuals

       Eye
Hair        Brown      Blue     Hazel     Green
  Black  4.398399 -3.069377 -0.477352 -1.953684
  Brown  1.233458 -1.949477  1.353284 -0.345100
  Red   -0.074978 -1.730125  0.852253  2.282737
  Blond -5.850997  7.049590 -2.227844  0.612698


🧙 熱圖與馬賽克圖的比較 Heatmap.Rmd



【E】類別與連續變數之間的關係 - GGally

【F】類別與連續變數的各種可能關係圖示

F3. 兩類別變數

💡 : 如果類別變數裡面有太多分類(或各分類的數量很不平衡),我們應該考慮使用熱圖

F5. 數量 X 數量 (X 類別)
`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'


💡 學習重點: 存在整個族群的關係和個別分群之中的關係有可能是不同(甚至是相反)的!

Joining, by = "order_id"
Joining, by = "order_id"
Joining, by = "product_id"
Joining, by = "product_category_name"

    Pearson's product-moment correlation

data:  P2$score and P2$product_photos_qty
t = 3.92, df = 32339, p-value = 0.00009
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.010881 0.032669
sample estimates:
     cor 
0.021778