pacman::p_load(dplyr, ggplot2, plotly, ggpubr)

批發商資料集

W = read.csv('data/wholesales.csv')
W$Channel = factor( paste0("Ch",W$Channel) )
W$Region = factor( paste0("Reg",W$Region) )
W[3:8] = lapply(W[3:8], log, base=10)


[A] 用R做線性回歸

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        


[B] 理論 vs 實證模型

[1] -0.0000000000000015543  0.0000000000000257572
[1] -0.0000000000000255906  0.0000000000000015543


[C] 畫出回歸線

\[ Milk_i = b_0 + b_1 Grocery_i\]

`geom_smooth()` using formula 'y ~ x'

🌻 問題



[D] 使用模型做預測(估計)

# A tibble: 9 x 1
  Grocery
    <dbl>
1     1  
2     1.5
3     2  
4     2.5
5     3  
6     3.5
7     4  
8     4.5
9     5  

🌻 \(\hat y\)信賴區間(Confidence Interval)

     fit    lwr    upr
1 1.5670 1.4064 1.7275
2 1.9345 1.8030 2.0661
3 2.3021 2.1993 2.4049
4 2.6697 2.5949 2.7445
5 3.0373 2.9885 3.0861
6 3.4049 3.3746 3.4352
7 3.7725 3.7377 3.8074
8 4.1401 4.0830 4.1973
9 4.5077 4.4236 4.5918

🌻 \(y\)預測區間(Prediction Interval)

     fit     lwr    upr
1 1.5670 0.94409 2.1898
2 1.9345 1.31853 2.5506
3 2.3021 1.69161 2.9127
4 2.6697 2.06329 3.2762
5 3.0373 2.43354 3.6411
6 3.4049 2.80235 4.0075
7 3.7725 3.16969 4.3753
8 4.1401 3.53559 4.7446
9 4.5077 3.90004 5.1154
    x   yhat  c.lwr  c.upr   p.lwr  p.upr
1 1.0 1.5670 1.4064 1.7275 0.94409 2.1898
2 1.5 1.9345 1.8030 2.0661 1.31853 2.5506
3 2.0 2.3021 2.1993 2.4049 1.69161 2.9127
4 2.5 2.6697 2.5949 2.7445 2.06329 3.2762
5 3.0 3.0373 2.9885 3.0861 2.43354 3.6411
6 3.5 3.4049 3.3746 3.4352 2.80235 4.0075
7 4.0 3.7725 3.7377 3.8074 3.16969 4.3753
8 4.5 4.1401 4.0830 4.1973 3.53559 4.7446
9 5.0 4.5077 4.4236 4.5918 3.90004 5.1154

🌻 95%預測區間(Prediction Interval)才會涵蓋95%的\(y\)


[E] 模型摘要功能


Call:
lm(formula = Milk ~ Grocery, data = W)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1371 -0.1875  0.0219  0.1593  1.8732 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   0.8318     0.1115    7.46     0.00000000000047 ***
Grocery       0.7352     0.0301   24.39 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.306 on 438 degrees of freedom
Multiple R-squared:  0.576, Adjusted R-squared:  0.575 
F-statistic:  595 on 1 and 438 DF,  p-value: <0.0000000000000002
§ 模型的準確性
  • Residual standard error: 殘差的標準差 (~誤差的平均值)
  • R-squared: 模型所能解釋的(目標變數的)變異比率

💡 : \(b_1\) 係數代表平均而言, \(x\) 每增加一單位時,\(y\) 會增加的數量

  • Coefficients:
    • Estimate: 係數估計值
    • Std. Error: 係數估計值的標準差
    • Pr(>|t|): \(p\)值 (變數之間沒有關係的機率?)
    • Signif. codes: 顯著標記


§ 係數的分布

💡 : 係數(\(b_0, b_1\))也都是隨機變數,他們都會有:

  • 點估計:期望值、最有可能的值
  • 區間估計:某一信心水準之下估計值可能出現的區間