hist(Depression_DF_no_NA$PHQ9.Severity.Final)
hist(log(Depression_DF_no_NA$PHQ9.Severity.Final))
The linear model may not be adequate. Distribution ressembles poisson distribution.
model <- lm(PHQ9.Severity.Final~ Sex*scale(max_age_MHQ, scale = FALSE) + Sex*I(scale(max_age_MHQ, scale = FALSE)^2), data = Depression_DF_no_NA, na.action = na.exclude)
summary(model)
##
## Call:
## lm(formula = PHQ9.Severity.Final ~ Sex * scale(max_age_MHQ, scale = FALSE) +
## Sex * I(scale(max_age_MHQ, scale = FALSE)^2), data = Depression_DF_no_NA,
## na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.544 -2.347 -1.133 1.054 25.081
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2.6846888 0.0132668 202.362
## Sex 0.5577549 0.0265335 21.021
## scale(max_age_MHQ, scale = FALSE) -0.0662899 0.0012603 -52.598
## I(scale(max_age_MHQ, scale = FALSE)^2) 0.0015752 0.0001553 10.146
## Sex:scale(max_age_MHQ, scale = FALSE) -0.0052290 0.0025206 -2.074
## Sex:I(scale(max_age_MHQ, scale = FALSE)^2) -0.0003560 0.0003105 -1.146
## Pr(>|t|)
## (Intercept) <2e-16 ***
## Sex <2e-16 ***
## scale(max_age_MHQ, scale = FALSE) <2e-16 ***
## I(scale(max_age_MHQ, scale = FALSE)^2) <2e-16 ***
## Sex:scale(max_age_MHQ, scale = FALSE) 0.038 *
## Sex:I(scale(max_age_MHQ, scale = FALSE)^2) 0.252
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.692 on 157014 degrees of freedom
## Multiple R-squared: 0.02784, Adjusted R-squared: 0.02781
## F-statistic: 899.4 on 5 and 157014 DF, p-value: < 2.2e-16
plot(model)
Residuals are not normally distributed. Let us try poisson regression.
Poisson regression is often used for modeling count data.
Assumption : conditional variance is equal to the conditional mean -> test overdispersion
model1 <- glm(PHQ9.Severity.Final~ Sex*scale(max_age_MHQ, scale = FALSE) + Sex*I(scale(max_age_MHQ, scale = FALSE)^2), data = Depression_DF_no_NA, family="poisson", na.action = na.exclude)
summary(model1)
##
## Call:
## glm(formula = PHQ9.Severity.Final ~ Sex * scale(max_age_MHQ,
## scale = FALSE) + Sex * I(scale(max_age_MHQ, scale = FALSE)^2),
## family = "poisson", data = Depression_DF_no_NA, na.action = na.exclude)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0294 -2.0945 -0.7920 0.6138 9.6361
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.9830455 0.0021924 448.384
## Sex 0.2083897 0.0043848 47.525
## scale(max_age_MHQ, scale = FALSE) -0.0237423 0.0002141 -110.916
## I(scale(max_age_MHQ, scale = FALSE)^2) 0.0002648 0.0000252 10.506
## Sex:scale(max_age_MHQ, scale = FALSE) 0.0024630 0.0004281 5.753
## Sex:I(scale(max_age_MHQ, scale = FALSE)^2) -0.0001671 0.0000504 -3.316
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## Sex < 2e-16 ***
## scale(max_age_MHQ, scale = FALSE) < 2e-16 ***
## I(scale(max_age_MHQ, scale = FALSE)^2) < 2e-16 ***
## Sex:scale(max_age_MHQ, scale = FALSE) 8.76e-09 ***
## Sex:I(scale(max_age_MHQ, scale = FALSE)^2) 0.000913 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 646321 on 157019 degrees of freedom
## Residual deviance: 625054 on 157014 degrees of freedom
## AIC: 943039
##
## Number of Fisher Scoring iterations: 6
Our residual deviance is 625054 for 157014 degrees of freedom. The rule of thumb is ratio = 1, here : 625054/157014 = 3.98 - So we have moderate dispersion, which we can also test with a dispersion test
library("AER")
## Loading required package: car
## Loading required package: carData
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: survival
dispersiontest(model1)
##
## Overdispersion test
##
## data: model1
## z = 111.52, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 4.777168
Conditional variance exceedes the conditional mean -> test overdispersion
Maybe our distributional assumption was simply wrong, and we choose a different distribution
https://biometry.github.io/APES/LectureNotes/2016-JAGS/Overdispersion/OverdispersionJAGS.pdf https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
model_2 <- glm.nb(PHQ9.Severity.Final~ Sex*scale(max_age_MHQ, scale = FALSE)+ Sex*I(scale(max_age_MHQ, scale = FALSE)^2), data = Depression_DF_no_NA, na.action = na.exclude)
summary(model_2)
##
## Call:
## glm.nb(formula = PHQ9.Severity.Final ~ Sex * scale(max_age_MHQ,
## scale = FALSE) + Sex * I(scale(max_age_MHQ, scale = FALSE)^2),
## data = Depression_DF_no_NA, na.action = na.exclude, init.theta = 0.7475254415,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7216 -1.4289 -0.4096 0.2778 3.5664
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 9.758e-01 4.708e-03 207.253
## Sex 2.107e-01 9.416e-03 22.377
## scale(max_age_MHQ, scale = FALSE) -2.352e-02 4.488e-04 -52.403
## I(scale(max_age_MHQ, scale = FALSE)^2) 3.838e-04 5.491e-05 6.989
## Sex:scale(max_age_MHQ, scale = FALSE) 2.541e-03 8.976e-04 2.831
## Sex:I(scale(max_age_MHQ, scale = FALSE)^2) -1.960e-04 1.098e-04 -1.785
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## Sex < 2e-16 ***
## scale(max_age_MHQ, scale = FALSE) < 2e-16 ***
## I(scale(max_age_MHQ, scale = FALSE)^2) 2.76e-12 ***
## Sex:scale(max_age_MHQ, scale = FALSE) 0.00464 **
## Sex:I(scale(max_age_MHQ, scale = FALSE)^2) 0.07427 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.7475) family taken to be 1)
##
## Null deviance: 174645 on 157019 degrees of freedom
## Residual deviance: 170198 on 157014 degrees of freedom
## AIC: 680683
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.74753
## Std. Err.: 0.00389
##
## 2 x log-likelihood: -680669.48500
1 - pchisq(summary(model_2)$deviance,
summary(model_2)$df.residual)
## [1] 0
The ratio of deviance 170198/157014 near 1 and thus fine. We can also check with a dispersion test.
library(DHARMa)
## This is DHARMa 0.4.1. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa') Note: Syntax of plotResiduals has changed in 0.3.0, see ?plotResiduals for details
simulationOutput <- simulateResiduals(model_2)
testDispersion(simulationOutput)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.97707, p-value = 0.008
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 1.022, p-value < 2.2e-16
## alternative hypothesis: two.sided
“As we mentioned earlier, negative binomial models assume the conditional means are not equal to the conditional variances. This inequality is captured by estimating a dispersion parameter (not shown in the output) that is held constant in a Poisson model. Thus, the Poisson model is actually nested in the negative binomial model. We can then use a likelihood ratio test to compare these two and test this model assumption.”
https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/
pchisq(2 * (logLik(model_2) - logLik(model1)), df = 1, lower.tail = FALSE)
## 'log Lik.' 0 (df=7)
library("lmtest")
lrtest(model1, model_2)
## Likelihood ratio test
##
## Model 1: PHQ9.Severity.Final ~ Sex * scale(max_age_MHQ, scale = FALSE) +
## Sex * I(scale(max_age_MHQ, scale = FALSE)^2)
## Model 2: PHQ9.Severity.Final ~ Sex * scale(max_age_MHQ, scale = FALSE) +
## Sex * I(scale(max_age_MHQ, scale = FALSE)^2)
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 6 -471514
## 2 7 -340335 1 262358 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this example the associated chi-squared value estimated from 2*(logLik(m1) – logLik(m3)) is 926.03 with one degree of freedom. This strongly suggests the negative binomial model, estimating the dispersion parameter, is more appropriate than the Poisson model.
Model_summ_coef <- as.data.frame(summary(model_2)$coefficients)
Model_summ_coef$Model <- "PHQ9"
names(Model_summ_coef) <- c("Estimate", "SE", "t/z", "p", "Model")
fwrite( Model_summ_coef, "/scratch2/ukbio/biobank/High_IQ_Project/results/Sex_Age_Effects_PHQ9.csv")