13 Regression model

คือ การสร้างสมการถดถอยของความสัมพันธ์ระหว่าง 2 ตัวแปรขึ้นไป โดยประกอบไปด้วย

ตัวแปรต้น (Independent variable, \(x\)) คือ ตัวแปรที่เป็นต้นเหตุ
ตัวแปรตาม (Dependent variable, \(y\)) คือ ตัวแปรที่เป็นปลายเหตุ ซึ่งเป็นผลมาจากการเปลี่ยนแปลงของตัวแปรต้น

ความสัมพันธ์ของตัวแปรต้นและตัวแปรตามจะเขียนในรูปแบบ \(f(x) \sim x\)

13.1 Linear regression

13.1.1 Model summary

คือ การสร้างความสัมพันธ์ของตัวแปรแบบเชิงเส้น ใช้กับข้อมูลแบบต่อเนื่อง (Continuous data)

\[ h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \ … \ + \theta_{n}x_{n} + \epsilon \]

data("Diabetes", package = "heplots")

ggplot(Diabetes, aes(x = glufast, y = sspg)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw()

13.1.2 Model performance

ในการวัดความแม่นยำของ Linear regression นั้นประกอบด้วยสามองค์ประกอบ

## 
## Call:
## lm(formula = sspg ~ glufast, data = Diabetes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -197.030  -59.050   -0.371   68.950  142.060 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.4534    13.3348   2.959  0.00362 ** 
## glufast       1.1866     0.0969  12.247  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74.33 on 143 degrees of freedom
## Multiple R-squared:  0.5119, Adjusted R-squared:  0.5085 
## F-statistic:   150 on 1 and 143 DF,  p-value: < 2.2e-16

\[ \text{Total variation} = \text{Explained variation} + \text{Unexplained variation/Error} \]

\[ \text{Total sum of squares} \ (TSS) = \text{Sum of squares regression} \ (SSR) + \text{Sum of squares error} \ (SSE) \]

\[ \sum^{n}_{i=1}(y_{i}-\bar{y}_{i})^{2} = \sum^{n}_{i=1}(\hat{y}_{i}-\bar{y}_{i})^{2} + \sum^{n}_{i=1}(y_{i}-\hat{y}_{i})^{2} \]

\(TSS\) = Total variation = ค่าความผันผวนระหว่างข้อมูลกับค่าเฉลี่ยของข้อมูล
\(SSR\) = Explained variation = ค่าความผันผวนระหว่างค่าเฉลี่ยของข้อมูลกับเส้น Regression
\(SSE\) = Error = ค่าความผันผวนระหว่างข้อมูลกับเส้น Regression

diabetes_fit <- lm(sspg ~ glufast, data = Diabetes) 
summary(diabetes_fit)

## 
## Call:
## lm(formula = sspg ~ glufast, data = Diabetes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -197.030  -59.050   -0.371   68.950  142.060 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.4534    13.3348   2.959  0.00362 ** 
## glufast       1.1866     0.0969  12.247  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74.33 on 143 degrees of freedom
## Multiple R-squared:  0.5119, Adjusted R-squared:  0.5085 
## F-statistic:   150 on 1 and 143 DF,  p-value: < 2.2e-16

Estimate คือ ค่าสัมประสิทธิ์ของตัวแปรที่มีผลต่อสมการนั้นๆ โดย (Intercept) คือ จุดตัดแกนของ Dependent variable เมื่อไม่คิดผลกระทบจากตัวแปรอื่น และ ในส่วน glufast คือ ค่าที่เพิ่มขึ้น เมื่อ glufast เพิ่มขึ้น 1 หน่วย ซึ่งเขียนเป็นสมการได้ว่า

\[ f(x) = 39.4354 \ + 1.1866\times(\text{glufast}) + \epsilon \]

\(R^{2}\) คือ อัตราส่วนระหว่าง Explained variation กับ Total variation = \(\frac{SSR}{TSS} = 1-\frac{SSE}{TSS}\) ซึ่งจะบ่งบอกความสามารถของเส้นถดถอย ในการอธิบายข้อมูล (Goodness of fit)
\(F\)-test คือการเปรียบเทียบความสามารถของเส้นถดถอย ว่าสามารถอธิบายข้อมูลได้ดีกว่า \(H_{0}\) อย่างมีนัยสำคัญหรือไม่ ภายใต้สมมติฐาน
- \(H_{0}\): \(f(x) = \theta_{0} + c\) หรือ สามารถอธิบายข้อมูลได้โดยใช้แค่ค่าเฉลี่ย
- \(H_{a}\): \(f(x) = \theta_{0} + x_{1}\theta_{1} + … + \epsilon\)
ซึ่งเป็นการเทียบอัตราส่วน Explained กับ Unexplained variation เช่นเดียวกับ ANOVA

\[F =\frac{MSR}{MSE} = \frac{TSS-SSE}{SSE}/\frac{DF_{TSS}-DF_{SSE}}{DF_{SSE}}\]

\(t\)-test คือการเปรียบเทียบเส้น Regression เมื่อมีตัวแปรนั้น ว่าอธิบายได้ดีกว่าเมื่อไม่มีตัวแปรนั้นหรือไม่
\(H_{0}\): \(\theta = 0\)
\(H_{a}\): \(\theta \neq 0\)

จะเห็นว่าสมการนั้นเหมือน One-sample t-test แต่เขียนในรูปแบบ Regression

\[ t = \frac{\theta_{H_{a}}-\theta_{H_0}}{SE(\theta)} = \frac{\theta-0}{SE(\theta)} \]

\[ SE(\theta) = \sqrt{\frac{1}{n-2} \times {\sum_{i=1}^{n}\frac{(y_{i} - \hat{y_{i}})^2}{(x_{i} - \bar{x})^2}}} \]

13.1.3 Prediction

ข้อดีของสมการถอดถอย คือ สามารถใช้ในการทำนายข้อมูลตัวแปรตามชุดใหม่ได้ จากผลลัพธ์ขั้นต้นในคอลัมน์ coef พบว่าทุกๆ glufast ที่เพิ่มขึ้น 1.186 หน่วย ส่งผลให้ sspg เพิ่มขึ้น 1 หน่วย

โดยสมการนี้จะอยู่ใน diabetes_fit

new_sspg <- data.frame(glufast = 1:400) 
new_sspg <- new_sspg |> 
  mutate(predicted_sspg = predict(diabetes_fit, newdata = new_sspg))

ggplot(new_sspg, aes(x = glufast, y = predicted_sspg)) +
  geom_point(size = 0.3) + theme_bw()

13.2 Logistic regression

13.2.1 Model summary

คือ สมการถดถอยซึ่งมีคุณสมบัติในการจำแนกตัวแปรแบบสองตัวแปร (Binary classification) ซึ่งเขียนอยู่ในรูปของ ค่าลอการิธึมของอัตราส่วนความเสี่ยง (Log odds)

\[ log(\frac{h(x)}{1-h(x)}) = \theta_{0} + x_{1}\theta_{1} + x_{2}\theta_{2} + \ ... \ + x_{n}\theta_{n} + \epsilon = z \]

\[ \frac{h(x)}{1-h(x)} = e^{z} \]

\[ h(x) = \frac{e^{z}}{1+e^{z}} \]

\[ h(x) = \frac{1}{1+e^{-(\theta_{0} + x_{1}\theta_{1} + x_{2}\theta_{2} + \ ... \ + x_{n}\theta_{n} + \epsilon)}} \]

ความพิเศษของสมการนี้คือ ขอบเขตของ \(h(x)\) จะอยู่ระหว่าง (0, 1) เสมอ ซึ่งส่งผลให้สามารถคำนวณกลับไปทำนายอัตราการเกิดเหตุการณ์จากอัตราส่วนความเสี่ยงได้

log_df <- data.frame(x = -20:20) |> 
  mutate(y = 1/(1+exp(-(0 + 0.75*x))))

ggplot(log_df, aes(x = x, y = y)) + geom_line() + theme_bw() +
  labs(y = "h(x)")

ต่อไปเราจะทำนายว่า glufast เพื่อให้ตัวแปรเป็น binary เราจะรวม Overt_Diabetic และ Chemical_Diabetic เป็นกลุ่มเดียวกัน

Diabetes_mixed <- Diabetes |> 
  mutate(group = case_match(group,
                            "Normal" ~ 0,
                            "Chemical_Diabetic" ~ 1,
                            "Overt_Diabetic" ~ 1))

ggplot(Diabetes_mixed, aes(x = glufast, y = group)) + 
  geom_point() +
  geom_smooth(method = "glm",  method.args = list(family = "binomial"), 
              se = FALSE) +
  theme_bw()

13.2.2 Model performance

logit_fit <- glm(group ~ glufast, family = "binomial", 
                 data = Diabetes_mixed) 
summary(logit_fit)

## 
## Call:
## glm(formula = group ~ glufast, family = "binomial", data = Diabetes_mixed)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.76109    2.42017  -4.860 1.18e-06 ***
## glufast       0.11583    0.02492   4.649 3.34e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 200.67  on 144  degrees of freedom
## Residual deviance: 121.94  on 143  degrees of freedom
## AIC: 125.94
## 
## Number of Fisher Scoring iterations: 8

Estimate ในที่นี้คือ \(log(\frac{h(x)}{1-h(x)})\) โดยที่ \(\frac{h(x)}{1-h(x)}\) คือ Odd ratio ของการเกิดเหตุการณ์
Likelihood คือ ค่าของตวามเป็นไปได้ที่จะเกิดเหตุการณ์นั้นๆ

\[ \prod_{i=1}^{n}p_{i}^{y_{i}}(1-p_{i}^{y_{i}}) \]โดย \(y_{i}\) คือ ค่าของความเป็นไปได้ของการเกิดเหตุการณ์ตาม Binomial distribution ซึ่งในที่นี่คือ group = 0 และ group = 1 ซึ่งการคำนวณนั้นจะนิยมใช้ค่า log มากกว่าเพื่อความสะดวกในการคำนวณ

\[ LL = log(\prod_{i=1}^{n}p_{i}^{y_{i}}(1-p_{i}^{1-y_{i}})) = \sum_{i=1}^{n}log(p_{i}^{y_{i}}(1-p_{i}^{1-y_{i}})) \]
Null deviance คือ \(2(LL(\text{Saturated model}) - LL(\text{Null model}))\)
Residual deviance คือ \(2(LL(\text{Saturated model}) - LL(\text{Fitted model}))\)
Akaike information criterion (AIC) คือ ค่าการประมาณความผิดพลาดของสมการ \(2(\text{Parameters}- LL(\text{Fitted model}))\) มักใช้ในการเปรียบเทียบกับสมการตัวแปรชนิดอื่น เพื่อเทียบคุณภาพของแบบจำลองที่มีตัวแปรต่างกัน
Pseudo \(R^{2}\) สามารถคำนวณได้จาก \(1 - \frac{D_{Fitted}}{D_{Null}}\)

ค่าที่เกี่ยวข้องกับ Deviance นั้น ยิ่งน้อยยิ่งดี เนื่องหมายความว่ามี Deviation ต่ำ

logit_performance <- Diabetes_mixed |> select(glufast, group) |> 
  mutate(predicted_prob = predict(logit_fit, newdata = Diabetes_mixed,
                                   type = "response")) |> 
  mutate(binomial_response = 
           (predicted_prob^group)*(1-predicted_prob)^(1-group)) |> 
  mutate(saturated_response = 1) |> 
  mutate(null_response = 0.5)

logit_likelihood <- logit_performance |> 
  summarize(across(contains("response"), ~sum(log(.x))))
logit_likelihood

## Null deviance
nd <- with(logit_likelihood, 2*(saturated_response - null_response))
nd

## [1] 201.0127

## Residual deviance
rd <- with(logit_likelihood, 2*(saturated_response - binomial_response))
rd

## [1] 121.944

## AIC
aic <- 2*(2-logit_likelihood$binomial_response)
aic

## [1] 125.944

จากข้อมูลนี้ สามารถคำนวณ Likelihood ratio statistics ได้

\[ L = -2log(\frac{LL_{Fitted}}{LL_{Null}}) =D_{Null} - D_{Fitted} \]

ซึ่งค่า \(L\) นั้นจะมีลักษณะการกระจายตัวเป็นแบบเดียวกับ Chi-square

L <- nd - rd
L

## [1] 79.06869

pchisq(L, df = 144-143, lower.tail = FALSE)

## [1] 5.998755e-19

library(lmtest) # Use package
null_fit <- glm(group ~ 1, data = Diabetes_mixed, family = "binomial")
lrtest(logit_fit, null_fit)

13.2.3 Prediction and classification

ในส่วนของ Prediction นั้น ใช้ฟังก์ชันลักษณะเดียวกันกับ Linear regression โดยให้ Argument type = response ซึ่งจะทำนายโอกาสที่ผู้ป่วยรายนั้นจะเป็นเบาหวาน

new_group <- data.frame(glufast = 1:200) 
new_group <- new_group |> 
  mutate(predicted_response = predict(logit_fit, 
                                   newdata = new_group, type = "response")) 

ggplot(new_group, aes(x = glufast, y = predicted_response)) +
  geom_point() + 
  geom_hline(yintercept = 0.5, col = "blue", linewidth =0.5) +
  theme_bw()

หลังจากที่ได้โอกาสของคนที่จะเป็นเบาหวานแล้ว ท่านสามารถสร้างเส้นขอบเขตการตัดสินใจ (Decision boundary) ว่าโอกาสที่เท่าไร ถึงจะให้เป็นเบาหวาน สังเกตเส้น ที่ predicted_response = 0.5 หมายความว่า ค่าโอกาสที่ \(>\) 0.5 จะจัดกลุ่มให้เป็นเบาหวาน และค่าโอกาสที่ \(\leq\) 0.5 จะจัดอยู่ในกลุ่มไม่เป็นเบาหวาน

13.2.4 Receiver Operating Characteristic (ROC) curve

ในการจัดกลุ่มนั้น ที่ Decision boundary ต่างกัน ย่อมส่งผลต่อความแม่นยำของการจำแนกที่ต่างกัน ท่านสามารถประเมินความแม่นยำที่ขอบเขตต่างๆ ได้ด้วย ROC ซึ่งคือการสร้างกราฟของ Sensitivity (True positive) และ 1 - Specificity (False positive)

ROC <- map(seq(0,1,0.01), \(x) {
  df <- logit_performance |>
      mutate(predicted_group = ifelse(predicted_prob >= x, 1, 0)) 
  sensitivity <- sum(df$group == 1 & df$predicted_group == 1)/sum(df$group == 1)
  specificity <- sum(df$group == 0 & df$predicted_group == 0)/sum(df$group == 0)
  return(data.frame(cutoff_prob = x, sens = sensitivity, spec = specificity) )
  }) |> 
  list_rbind() |> 
  filter(!(sens > 0 & spec == 1))
ROC

สังเกตว่า ที่ Cut-off สูงขึ้นเรื่อยๆ นั้น Sensitivity นั้นจะต่ำลง ในขณะที่ Specificity นั้นจะสูงขึ้น (1 - Specificity ต่ำลง) เมื่อนำมาสร้างกราฟจะได้ ROC curve

ggplot(ROC, aes(x = 1-spec , y = sens)) + geom_point(size = 1) +
  geom_line() +
  geom_abline(intercept = 0, slope = 1, col = "darkred", linetype = "dashed") +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  labs(x = "1 - Specificity", y = "Sensitivity") +
  coord_equal() +
  theme_bw()

โดยตัวประสิทธิภาพโดยรวมของ ROC นั้นคำนวณจากพื้นที่ใต้กราฟ (Area under curve: AUC)

group <- logit_performance$group
probs <- logit_performance$predicted_prob
combinations <- expand_grid(pos = probs[group == 1],
                            neg = probs[group == 0])
auc <- mean(with(combinations, pos>neg))
auc

## [1] 0.859077

Credit: https://stats.stackexchange.com/a/146136/322844

13.3 Poisson, quassipoisson, and negative binomial regression

13.3.1 Poisson regression

Poisson regression คือ สมการถดถอยที่ใช้ในการทำนายความถี่ หรือค่าเฉลี่ยของการเกิดเหตุการณ์นั้นๆ มักใช้กับข้อมูลที่ไม่ต่อเนื่อง (Discrete value) พิจารณาข้อมูลแบบ Poisson distribution

\[ f(k) = P(X = k) = \frac{\lambda^{k}}{k!}e^{-\lambda} \]

จะพบว่ามีตัวแปรที่สามารถทำนายได้เมื่อมีข้อมูลอีกชนิดหนึ่ง คือ ค่าเฉลี่ยการเกิดเหตุการณ์ \((\lambda)\) และ จำนวนเหตุการณ์ที่เกิด \((k)\) จึงออกมาเป็นสมการถดถอยได้สองรูปแบบ

สำหรับ \(\lambda\)

\[ \lambda = e^{\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+…+\theta_{n}x_{n}} \]

\[ ln(\lambda) = \theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+…+\theta_{n}x_{n} \]

สำหรับ \(k\) เราจำเป็นต้องเปลี่ยน \(\lambda\) ให้อยู่ในรูป \(k/t\) และย้ายไปเป็นตัวแปรควบคุมเวลา (Offset) ซึ่งจะมี Regression coefficient = 1 เสมอ

\[ \lambda = \frac{k}{t} = e^{\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+…+\theta_{n}x_{n}} \]

\[ ln(k) - \ln(t) = \theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+…+\theta_{n}x_{n} \]

\[ ln(k) = \theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+…+\theta_{n}x_{n} + \ln(t) \rightarrow \text{offset} \]

ซึ่งสมการนี้เป็นสมการพื้นฐานของการนับจำนวนใดๆ ดังเช่นตัวอย่างนี้ คือความสัมพันธ์ของตัวแปรต่างๆ กับอัตราการตายในโรคมะเร็งปากมดลูก

cervix_mort <- read.csv("Resource/cervix_mort.csv") |>  # aggregrated data for demonstration
                mutate(histo_major = fct_relevel(histo_major, "Squamous cell carcinoma",
                                                 "Adenocarcinoma",
                                                 "Adenosquamous carcinoma",
                                                 "Others"))
glm(dead ~  age_group + histo_major + stage, family = "poisson", data = cervix_mort) |> summary()

## 
## Call:
## glm(formula = dead ~ age_group + histo_major + stage, family = "poisson", 
##     data = cervix_mort)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         7.172968   0.031870  225.07   <2e-16 ***
## age_group                          -0.748325   0.010905  -68.62   <2e-16 ***
## histo_majorAdenocarcinoma          -1.413401   0.025945  -54.48   <2e-16 ***
## histo_majorAdenosquamous carcinoma -2.850308   0.049905  -57.12   <2e-16 ***
## histo_majorOthers                  -1.731075   0.029393  -58.89   <2e-16 ***
## stage                               0.209258   0.008536   24.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 25897.8  on 59  degrees of freedom
## Residual deviance:  9499.4  on 54  degrees of freedom
## AIC: 9832.1
## 
## Number of Fisher Scoring iterations: 5

ggplot(cervix_mort, aes (x = age_group, y = dead)) +
  geom_jitter() +
  geom_smooth(method = "glm", 
              method.args = list(family = "poisson"), se = FALSE) +
  scale_x_continuous(breaks = 1:4) +
  theme_bw()

สังเกตว่ายิ่งอายุเยอะ อัตราการตายยิ่งน้อยลงอย่างมีนัยสำคัญ ซึ่งไม่น่าจะเป็นไปได้ ทั้งนี้ เพราะยังไม่ได้ปรับระยะเวลาติดตาม โดยการใส่ Offset เป็น person-month เข้าไปคำนวณด้วย

pois_fit <- glm(dead ~  age_group + histo_major + stage + 
                  offset(log(person_month)),
                family = "poisson", data = cervix_mort)
summary(pois_fit)

## 
## Call:
## glm(formula = dead ~ age_group + histo_major + stage + offset(log(person_month)), 
##     family = "poisson", data = cervix_mort)
## 
## Coefficients:
##                                     Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                        -7.138354   0.035957 -198.523  < 2e-16 ***
## age_group                           0.037519   0.014553    2.578  0.00994 ** 
## histo_majorAdenocarcinoma          -0.076841   0.025947   -2.961  0.00306 ** 
## histo_majorAdenosquamous carcinoma -0.001698   0.049964   -0.034  0.97289    
## histo_majorOthers                   0.408249   0.029643   13.772  < 2e-16 ***
## stage                               0.918984   0.007915  116.101  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 15452.8  on 59  degrees of freedom
## Residual deviance:  2386.7  on 54  degrees of freedom
## AIC: 2719.4
## 
## Number of Fisher Scoring iterations: 5

จะพบว่าหลังจากปรับตามระยะเวลาติดตาม กลุ่มอายุที่เพิ่มขึ้นจะส่งผลให้การตายเพิ่มขึ้นอย่างมีนัยสำคัญ

new_data <- expand_grid(age_group = 1:4,
                       histo_major = unique(cervix_mort$histo_major),
                       stage = 1:4,
                       person_month = 1200)
new_data <- new_data |> 
  mutate(predicted_dead = predict(pois_fit, newdata = new_data, type = "response"))

ggplot(new_data, aes (x = age_group, y = predicted_dead, col = histo_major)) +
  geom_line() +
  facet_wrap(~stage)+
  scale_y_continuous(breaks = seq(0,60,10)) +
  theme_bw()

อย่างไรก็ตาม พิจารณาดูแล้วอัตราการตายไม่ได้เพิ่มขึ้นมากตามกลุ่มอายุ ซึ่งอาจจะเป็นผลบวกลวง เมื่อกลับมาพิจารณาแล้ว Poisson regression มีสมมติฐานที่สำคัญตาม Poisson distribution คือ

ต้องเป็นจำนวนนับ
การกระจายตัวของข้อมูล = Dispersion parameter (\(\phi\)) = Variance/Mean = 1

13.3.2 Quasipoisson regression

เมื่อตรวจสอบจากผลลัพธ์ของสมการถอดถอย โดยพิจารณาจาก \(\text{Residuals/Deviance}\) แล้วพบว่า \(\phi > 1\) ซึ่งหมายความว่าข้อมูลตั้งต้นนั้นมีการกระจายตัวของข้อมูลสูงเกินไป (Overdispersed) จึงไม่เป็นไปตามสมมติฐานของ Poisson ณ จุดนี้จึงจำเป็นต้องใช้ ส่วนขยายสมการของ Poisson regression เรียกว่า Quasipoisson regression ซึ่งจะทำการปรับความแปรปรวน ตามการเพิ่มขึ้นของความแปรปรวนต่อค่าเฉลี่ย

\[ V = \phi\mu_{i} \] \[ \phi = \frac{\chi^{2}}{\text{DF}} = \frac{1}{n-k}{\sum_{i=1}^{n}\frac{(Y_{i}-\hat{\mu_{i}})^{2}}{\hat{\mu_{i}}}} \]

quasipois_fit <- glm(dead ~ age_group + histo_major + stage + 
                       offset(log(person_month)),
                family = "quasipoisson", data = cervix_mort)
summary(quasipois_fit)

## 
## Call:
## glm(formula = dead ~ age_group + histo_major + stage + offset(log(person_month)), 
##     family = "quasipoisson", data = cervix_mort)
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -7.138354   0.256738 -27.804   <2e-16 ***
## age_group                           0.037519   0.103913   0.361    0.719    
## histo_majorAdenocarcinoma          -0.076841   0.185263  -0.415    0.680    
## histo_majorAdenosquamous carcinoma -0.001698   0.356744  -0.005    0.996    
## histo_majorOthers                   0.408249   0.211657   1.929    0.059 .  
## stage                               0.918984   0.056517  16.260   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 50.98103)
## 
##     Null deviance: 15452.8  on 59  degrees of freedom
## Residual deviance:  2386.7  on 54  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

จะเห็นว่าหลังจากปรับ Dispersion parameter แล้ว age_group ไม่ได้ส่งผลให้การตายเพิ่มขึ้นอย่างมีนัยสำคัญแต่อย่างใด

13.3.3 Negative binomial regression

ปัญหาของ Quasipoisson คือ อาจจะไม่มี Parameter ที่ต้องการ เช่น AIC เป็นต้น ซึ่งสามารถใช้อีกหนึ่งสมการถดถอยที่นิยมใช้คือ Negative binomial regression จาก Package MASS ซึ่งปรับตามอัตราส่วนความแปรปรวนเช่นกัน

\[ V = u_{i}(1+\frac{u_{i}}{\phi}) \]

nb_fit <- MASS::glm.nb(dead ~  
                         age_group + histo_major + stage + 
                         offset(log(person_month)), 
                 data = cervix_mort)
summary(nb_fit)

## 
## Call:
## MASS::glm.nb(formula = dead ~ age_group + histo_major + stage + 
##     offset(log(person_month)), data = cervix_mort, init.theta = 1.720086921, 
##     link = log)
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -6.74220    0.37648 -17.908   <2e-16 ***
## age_group                           0.13619    0.09497   1.434    0.152    
## histo_majorAdenocarcinoma          -0.11835    0.28789  -0.411    0.681    
## histo_majorAdenosquamous carcinoma -0.09216    0.30632  -0.301    0.764    
## histo_majorOthers                   0.16574    0.28407   0.583    0.560    
## stage                               0.82974    0.08937   9.285   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.7201) family taken to be 1)
## 
##     Null deviance: 144.871  on 59  degrees of freedom
## Residual deviance:  69.168  on 54  degrees of freedom
## AIC: 598.94
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.720 
##           Std. Err.:  0.339 
## 
##  2 x log-likelihood:  -584.937

ผลลัพธ์ที่ได้จะใกล้เคียงกันกับสมการถดถอย Quasipoisson ทั้งสองสมการนี้จะนิยมใช้ในการตั้งสมการถดถอยในจำนวนนับจากงานทดลอง High-throughput เช่น RNA-sequencing เนื่องจากข้อมูลประเภทนี้มักมีเป็นข้อมูลจำนวนนับที่มีการกระจายตัวสูงมาก (\(\phi > 1\))

12 False discovery

14 Survival analysis