1. Anscombe’s Examples

The following four data sets are constructed by Anscombe (1973). The regression output for the four sets is identical (to two decimal places) in every respect, but their scatter plots dramatically different. Looking only at the numerical regression output may lead to very misleading conclusions about the data, and lead to adopting the wrong model.

fit1=lm(y1~x1, data=anscombe);
summary(fit1)$coef
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0000909  1.1247468 2.667348 0.025734051
## x1          0.5000909  0.1179055 4.241455 0.002169629
fit2=lm(y2~x2, data=anscombe);
summary(fit2)$coef
##             Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.000909  1.1253024 2.666758 0.025758941
## x2          0.500000  0.1179637 4.238590 0.002178816
fit3=lm(y3~x3, data=anscombe);
summary(fit3)$coef
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0024545  1.1244812 2.670080 0.025619109
## x3          0.4997273  0.1178777 4.239372 0.002176305
fit4=lm(y4~x4, data=anscombe);
summary(fit4)$coef
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0017273  1.1239211 2.670763 0.025590425
## x4          0.4999091  0.1178189 4.243028 0.002164602

2. Re-visit the Saving Data

The savings data frame has 50 rows and 5 columns. The data is averaged over the period 1960-1970. This data frame contains the following columns:

library(faraway)
attach(savings)
country=dimnames(savings)[[1]]

n=50; p=5;
g = lm(sr ~., data=savings);
lev=influence(g)$hat
lev[lev>2*p/n]
##       Ireland         Japan United States         Libya 
##     0.2122363     0.2233099     0.3336880     0.5314568
halfnorm(lev, 4, labs=row.names(savings), ylab="Leverages")

Which countries have high-leverage?

savings[lev > 2*p/n,]
##                  sr pop15 pop75     dpi  ddpi
## Ireland       11.34 31.16  4.19 1139.95  2.99
## Japan         21.10 27.01  1.91 1257.28  8.21
## United States  7.56 29.81  3.43 4001.89  2.45
## Libya          8.89 43.69  2.07  123.58 16.71
jack=rstudent(g); 
qt(.05/(2*n), 44) # Bonferroni correction
## [1] -3.525801
qt(.05/2, 44)
## [1] -2.015368
sort(abs(jack), decreasing=TRUE)[1:5]  # No outliers. 
##      Zambia       Chile Philippines        Peru     Iceland 
##    2.853558    2.313429    1.863826    1.823914    1.731200
cook = cooks.distance(g)
halfnorm(cook, labs=row.names(savings), ylab="Cook's distances")

max(cook)
## [1] 0.2680704

Although there are no high influential points based on the rule-of-thumb, the cook’s distance for Libya is much larger than the other samples. So let’s remove “Libya”“, refit the model, and check the changes.

summary(lm(sr~., data=savings[row.names(savings) != 'Lybia',]))
## 
## Call:
## lm(formula = sr ~ ., data = savings[row.names(savings) != "Lybia", 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904
summary(g)
## 
## Call:
## lm(formula = sr ~ ., data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904

3. Rat Data, from Weisburg (1980)

An experiment is conducted to investigate the amount of a particular drug present in the liver of a rat: 19 rats are randomly selected, weighted, and given an oral dose of the drug, which is approximately proportional to the body weight. …

rat=read.table("rat.dat"); 
dim(rat)
## [1] 19  4
head(rat)
##    V1  V2   V3   V4
## 1 176 6.5 0.88 0.42
## 2 176 9.5 0.88 0.25
## 3 190 9.0 1.00 0.56
## 4 176 8.9 0.88 0.23
## 5 200 7.2 1.00 0.23
## 6 167 8.9 0.83 0.32

“dose” is highly correlated with “body” by the design of the experiment; “body” and “liver” are known to be correlated. Note the correlations between the predictors and Y are low.

names(rat)=c("body", "liver", "dose", "Y")
cor(rat)  
##            body     liver      dose         Y
## body  1.0000000 0.5000101 0.9902126 0.1510855
## liver 0.5000101 1.0000000 0.4900711 0.2033302
## dose  0.9902126 0.4900711 1.0000000 0.2275436
## Y     0.1510855 0.2033302 0.2275436 1.0000000

How many sub-models we can fit on this data? Seven, or Eight (if including the intercept-only model).

Tried all seven models. It’s surprising to see that whenever “body” and “liver” are both included in the model, they become significant, otherwise, no variables are significant.

summary(lm(Y~body, data=rat))$coef
##                 Estimate  Std. Error   t value  Pr(>|t|)
## (Intercept) 0.1962346237 0.221582485 0.8856053 0.3881858
## body        0.0008105376 0.001286209 0.6301757 0.5369590
summary(lm(Y~liver, data=rat))$coef
##               Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) 0.22037463 0.13572725 1.6236579 0.1228446
## liver       0.01470945 0.01717914 0.8562387 0.4037739
summary(lm(Y~dose, data=rat))$coef
##              Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) 0.1330050  0.2109117 0.6306194 0.5366756
## dose        0.2346096  0.2435074 0.9634599 0.3488226
summary(lm(Y~body+liver, data=rat))$coef
##                Estimate  Std. Error   t value  Pr(>|t|)
## (Intercept) 0.178356944 0.227775476 0.7830384 0.4450409
## body        0.000353495 0.001513754 0.2335221 0.8183175
## liver       0.012325997 0.020412653 0.6038410 0.5544149
summary(lm(Y~body+dose, data=rat))$coef
##                Estimate Std. Error   t value   Pr(>|t|)
## (Intercept)  0.28551741 0.19126688  1.492770 0.15495313
## body        -0.02044423 0.00783845 -2.608198 0.01902132
## dose         4.12532992 1.50647192  2.738405 0.01457729
summary(lm(Y~liver+dose, data=rat))$coef
##                Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) 0.117365750 0.21909377 0.5356873 0.5995428
## liver       0.008741867 0.02008518 0.4352397 0.6692025
## dose        0.173550533 0.28626117 0.6062664 0.5528430
 summary(lm(Y~., data=rat))$coef
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept)  0.26592177 0.194585255  1.3666080 0.19188433
## body        -0.02124634 0.007974304 -2.6643501 0.01767688
## liver        0.01429806 0.017217141  0.8304549 0.41930376
## dose         4.17811141 1.522625206  2.7440183 0.01506639
fit=lm(Y~., data=rat);
par(mfrow=c(2,2)); plot(fit)

lev=influence(fit)$hat
halfnorm(lev, 4, ylab="Leverages")

sr.ex=rstudent(fit); 
sort(sr.ex, decreasing=TRUE)[1:5]
##        19         1        18         3         7 
## 2.1388332 1.9170719 0.8426711 0.7972915 0.7774937
qt(0.05/(19*2), 14) 
## [1] -3.64871
cook = cooks.distance(fit)
halfnorm(cook, 4, ylab="Cook's distances")

max(cook)
## [1] 0.929616
##    body liver dose    Y
## 1   176   6.5 0.88 0.42
## 2   176   9.5 0.88 0.25
## 3   190   9.0 1.00 0.56
## 4   176   8.9 0.88 0.23
## 5   200   7.2 1.00 0.23
## 6   167   8.9 0.83 0.32
## 7   188   8.0 0.94 0.37
## 8   195  10.0 0.98 0.41
## 9   176   8.0 0.88 0.33
## 10  165   7.9 0.84 0.38
## 11  158   6.9 0.80 0.27
## 12  148   7.3 0.74 0.36
## 13  149   5.2 0.75 0.21
## 14  163   8.4 0.81 0.28
## 15  170   7.2 0.85 0.34
## 16  186   6.8 0.94 0.28
## 17  146   7.3 0.73 0.30
## 18  181   9.0 0.90 0.37
## 19  149   6.4 0.75 0.46

Remove the 3rd sample, and refit the model.

newfit=lm(Y~., data=rat, subset=(1:19)[-3])
summary(newfit)
## 
## Call:
## lm(formula = Y ~ ., data = rat, subset = (1:19)[-3])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.102154 -0.056486  0.002838  0.046519  0.137059 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.311427   0.205094   1.518    0.151
## body        -0.007783   0.018717  -0.416    0.684
## liver        0.008989   0.018659   0.482    0.637
## dose         1.484877   3.713064   0.400    0.695
## 
## Residual standard error: 0.07825 on 14 degrees of freedom
## Multiple R-squared:  0.02106,    Adjusted R-squared:  -0.1887 
## F-statistic: 0.1004 on 3 and 14 DF,  p-value: 0.9585