The following four data sets are constructed by Anscombe (1973). The regression output for the four sets is identical (to two decimal places) in every respect, but their scatter plots dramatically different. Looking only at the numerical regression output may lead to very misleading conclusions about the data, and lead to adopting the wrong model.
fit1=lm(y1~x1, data=anscombe);
summary(fit1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0000909 1.1247468 2.667348 0.025734051
## x1 0.5000909 0.1179055 4.241455 0.002169629
fit2=lm(y2~x2, data=anscombe);
summary(fit2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000909 1.1253024 2.666758 0.025758941
## x2 0.500000 0.1179637 4.238590 0.002178816
fit3=lm(y3~x3, data=anscombe);
summary(fit3)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0024545 1.1244812 2.670080 0.025619109
## x3 0.4997273 0.1178777 4.239372 0.002176305
fit4=lm(y4~x4, data=anscombe);
summary(fit4)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017273 1.1239211 2.670763 0.025590425
## x4 0.4999091 0.1178189 4.243028 0.002164602
The savings data frame has 50 rows and 5 columns. The data is averaged over the period 1960-1970. This data frame contains the following columns:
library(faraway)
attach(savings)
country=dimnames(savings)[[1]]
n=50; p=5;
g = lm(sr ~., data=savings);
lev=influence(g)$hat
lev[lev>2*p/n]
## Ireland Japan United States Libya
## 0.2122363 0.2233099 0.3336880 0.5314568
halfnorm(lev, 4, labs=row.names(savings), ylab="Leverages")
Which countries have high-leverage?
savings[lev > 2*p/n,]
## sr pop15 pop75 dpi ddpi
## Ireland 11.34 31.16 4.19 1139.95 2.99
## Japan 21.10 27.01 1.91 1257.28 8.21
## United States 7.56 29.81 3.43 4001.89 2.45
## Libya 8.89 43.69 2.07 123.58 16.71
jack=rstudent(g);
qt(.05/(2*n), 44) # Bonferroni correction
## [1] -3.525801
qt(.05/2, 44)
## [1] -2.015368
sort(abs(jack), decreasing=TRUE)[1:5] # No outliers.
## Zambia Chile Philippines Peru Iceland
## 2.853558 2.313429 1.863826 1.823914 1.731200
cook = cooks.distance(g)
halfnorm(cook, labs=row.names(savings), ylab="Cook's distances")
max(cook)
## [1] 0.2680704
Although there are no high influential points based on the rule-of-thumb, the cook’s distance for Libya is much larger than the other samples. So let’s remove “Libya”“, refit the model, and check the changes.
summary(lm(sr~., data=savings[row.names(savings) != 'Lybia',]))
##
## Call:
## lm(formula = sr ~ ., data = savings[row.names(savings) != "Lybia",
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2422 -2.6857 -0.2488 2.4280 9.7509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.5660865 7.3545161 3.884 0.000334 ***
## pop15 -0.4611931 0.1446422 -3.189 0.002603 **
## pop75 -1.6914977 1.0835989 -1.561 0.125530
## dpi -0.0003369 0.0009311 -0.362 0.719173
## ddpi 0.4096949 0.1961971 2.088 0.042471 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared: 0.3385, Adjusted R-squared: 0.2797
## F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904
summary(g)
##
## Call:
## lm(formula = sr ~ ., data = savings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2422 -2.6857 -0.2488 2.4280 9.7509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.5660865 7.3545161 3.884 0.000334 ***
## pop15 -0.4611931 0.1446422 -3.189 0.002603 **
## pop75 -1.6914977 1.0835989 -1.561 0.125530
## dpi -0.0003369 0.0009311 -0.362 0.719173
## ddpi 0.4096949 0.1961971 2.088 0.042471 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared: 0.3385, Adjusted R-squared: 0.2797
## F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904
An experiment is conducted to investigate the amount of a particular drug present in the liver of a rat: 19 rats are randomly selected, weighted, and given an oral dose of the drug, which is approximately proportional to the body weight. …
rat=read.table("rat.dat");
dim(rat)
## [1] 19 4
head(rat)
## V1 V2 V3 V4
## 1 176 6.5 0.88 0.42
## 2 176 9.5 0.88 0.25
## 3 190 9.0 1.00 0.56
## 4 176 8.9 0.88 0.23
## 5 200 7.2 1.00 0.23
## 6 167 8.9 0.83 0.32
“dose” is highly correlated with “body” by the design of the experiment; “body” and “liver” are known to be correlated. Note the correlations between the predictors and Y are low.
names(rat)=c("body", "liver", "dose", "Y")
cor(rat)
## body liver dose Y
## body 1.0000000 0.5000101 0.9902126 0.1510855
## liver 0.5000101 1.0000000 0.4900711 0.2033302
## dose 0.9902126 0.4900711 1.0000000 0.2275436
## Y 0.1510855 0.2033302 0.2275436 1.0000000
How many sub-models we can fit on this data? Seven, or Eight (if including the intercept-only model).
Tried all seven models. It’s surprising to see that whenever “body” and “liver” are both included in the model, they become significant, otherwise, no variables are significant.
Y ~ body
, Y ~ live
, Y ~ dose
.summary(lm(Y~body, data=rat))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1962346237 0.221582485 0.8856053 0.3881858
## body 0.0008105376 0.001286209 0.6301757 0.5369590
summary(lm(Y~liver, data=rat))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.22037463 0.13572725 1.6236579 0.1228446
## liver 0.01470945 0.01717914 0.8562387 0.4037739
summary(lm(Y~dose, data=rat))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1330050 0.2109117 0.6306194 0.5366756
## dose 0.2346096 0.2435074 0.9634599 0.3488226
Y ~ body + liver
, Y ~ body + dose
, Y ~ liver + dose
.summary(lm(Y~body+liver, data=rat))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.178356944 0.227775476 0.7830384 0.4450409
## body 0.000353495 0.001513754 0.2335221 0.8183175
## liver 0.012325997 0.020412653 0.6038410 0.5544149
summary(lm(Y~body+dose, data=rat))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.28551741 0.19126688 1.492770 0.15495313
## body -0.02044423 0.00783845 -2.608198 0.01902132
## dose 4.12532992 1.50647192 2.738405 0.01457729
summary(lm(Y~liver+dose, data=rat))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.117365750 0.21909377 0.5356873 0.5995428
## liver 0.008741867 0.02008518 0.4352397 0.6692025
## dose 0.173550533 0.28626117 0.6062664 0.5528430
Y ~ body + liver + dose
. summary(lm(Y~., data=rat))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.26592177 0.194585255 1.3666080 0.19188433
## body -0.02124634 0.007974304 -2.6643501 0.01767688
## liver 0.01429806 0.017217141 0.8304549 0.41930376
## dose 4.17811141 1.522625206 2.7440183 0.01506639
fit=lm(Y~., data=rat);
par(mfrow=c(2,2)); plot(fit)
lev=influence(fit)$hat
halfnorm(lev, 4, ylab="Leverages")
sr.ex=rstudent(fit);
sort(sr.ex, decreasing=TRUE)[1:5]
## 19 1 18 3 7
## 2.1388332 1.9170719 0.8426711 0.7972915 0.7774937
qt(0.05/(19*2), 14)
## [1] -3.64871
cook = cooks.distance(fit)
halfnorm(cook, 4, ylab="Cook's distances")
max(cook)
## [1] 0.929616
## body liver dose Y
## 1 176 6.5 0.88 0.42
## 2 176 9.5 0.88 0.25
## 3 190 9.0 1.00 0.56
## 4 176 8.9 0.88 0.23
## 5 200 7.2 1.00 0.23
## 6 167 8.9 0.83 0.32
## 7 188 8.0 0.94 0.37
## 8 195 10.0 0.98 0.41
## 9 176 8.0 0.88 0.33
## 10 165 7.9 0.84 0.38
## 11 158 6.9 0.80 0.27
## 12 148 7.3 0.74 0.36
## 13 149 5.2 0.75 0.21
## 14 163 8.4 0.81 0.28
## 15 170 7.2 0.85 0.34
## 16 186 6.8 0.94 0.28
## 17 146 7.3 0.73 0.30
## 18 181 9.0 0.90 0.37
## 19 149 6.4 0.75 0.46
Remove the 3rd sample, and refit the model.
newfit=lm(Y~., data=rat, subset=(1:19)[-3])
summary(newfit)
##
## Call:
## lm(formula = Y ~ ., data = rat, subset = (1:19)[-3])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.102154 -0.056486 0.002838 0.046519 0.137059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.311427 0.205094 1.518 0.151
## body -0.007783 0.018717 -0.416 0.684
## liver 0.008989 0.018659 0.482 0.637
## dose 1.484877 3.713064 0.400 0.695
##
## Residual standard error: 0.07825 on 14 degrees of freedom
## Multiple R-squared: 0.02106, Adjusted R-squared: -0.1887
## F-statistic: 0.1004 on 3 and 14 DF, p-value: 0.9585