案例:糖尿病患醫療品質
pacman::p_load(caTools, ggplot2, dplyr)
D = read.csv("data/quality.csv") # Read in dataset
set.seed(88)
split = sample.split(D$PoorCare, SplitRatio = 0.75) # split vector
TR = subset(D, split == TRUE)
TS = subset(D, split == FALSE)
glm1 = glm(PoorCare ~ OfficeVisits + Narcotics, TR, family=binomial)
summary(glm1)
預測機率 Predicted Probability (Training)
混淆矩陣 Confusion Matrix (Training)
Predict
Acture FALSE TRUE
0 70 4
1 15 10
模型準確性指標 Accuracy Metrices (Training)
AccuracyMetrices = function(x, k=3) c(
accuracy = sum(diag(x))/sum(x), # 正確性
sensitivity = as.numeric(x[2,2]/rowSums(x)[2]), # 敏感性
specificity = as.numeric(x[1,1]/rowSums(x)[1]) # 明確性
) %>% round(k)
AccuracyMetrices(cmx)
accuracy sensitivity specificity
0.808 0.400 0.946
預測機率 Predicted Probability (Testing)
par(cex=0.8)
pred2 = predict(glm1, newdata=TS, type="response")
hist(pred2, 10)
abline(v=0.5, col='red')
混淆矩陣 Confusion Matrix (Testing)
Predict
Acture FALSE TRUE
0 23 1
1 5 3
比較模型準確性指標 Accuracy Matrices (Testing)
Train Test
accuracy 0.808 0.812
sensitivity 0.400 0.375
specificity 0.946 0.958
預測機率分佈 (DPP) - Distribution of Predicted Probability (Train)
data.frame(y=factor(TR$PoorCare), pred=pred) %>%
ggplot(aes(x=pred, fill=y)) +
geom_histogram(bins=20, col='white', position="stack", alpha=0.5) +
ggtitle("Distribution of Predicted Probability (DPP,Train)") +
xlab("predicted probability")
預測機率分佈 (DPP) - Distribution of Predicted Probability (Test)
ROC - Receiver Operation Curve
par(mfrow=c(1,2), cex=0.8)
trAUC = colAUC(pred, y=TR$PoorCare, plotROC=T)
tsAUC = colAUC(pred2, y=TS$PoorCare, plotROC=T)
AUC - Area Under Curve
[1] 0.77459 0.79948
🗿 練習:
使用TR$MemberID
以外的所有欄位,建立一個邏輯式回歸模型來預測PoorCare
,並:
【A】 分別畫出Training
和Testing
的DPP
【B】 分別畫出Training
和Testing
的ROC
【C】 分別算出Training
和Testing
的ACC
、SENS
和SPEC
【D】 分別算出Training
和Testing
的AUC
【E】 跟用兩個預測變數的模型相比,這一個模型有比較準嗎?
【F】 為什麼它比較準(或比較不準)呢?