Rcode_Diag

1. Anscombe’s Examples

The following four data sets are constructed by Anscombe (1973). The regression output for the four sets is identical (to two decimal places) in every respect, but their scatter plots dramatically different. Looking only at the numerical regression output may lead to very misleading conclusions about the data, and lead to adopting the wrong model.

fit1: The simple linear regression model appears to offer a reasonably good fit. The scatter plot appears to depict a linear trend, with homoscedastic variance.

fit1=lm(y1~x1, data=anscombe);
summary(fit1)$coef

##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0000909  1.1247468 2.667348 0.025734051
## x1          0.5000909  0.1179055 4.241455 0.002169629

fit2: It is immediately seem that a linear fit is inappropriate, and a smooth curve such as a quadratic polynomial may be needed.

fit2=lm(y2~x2, data=anscombe);
summary(fit2)$coef

##             Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.000909  1.1253024 2.666758 0.025758941
## x2          0.500000  0.1179637 4.238590 0.002178816

fit3: A simple linear regression model seems entirely appropriate for the data, except for a single point which has considerable influence on the fitted line. This point may be considered an outlier.

fit3=lm(y3~x3, data=anscombe);
summary(fit3)$coef

##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0024545  1.1244812 2.670080 0.025619109
## x3          0.4997273  0.1178777 4.239372 0.002176305

fit4: Data are only available for two distinct values of X, with only one observation at the highest value of X. Because of these limitations, one cannot know whether the regression of Y on X follows a line, or if the variance is homoscedastic. Selecting a simple linear regression model in this case requires assumptions which cannot be checked. Furthermore, the single Y observations at X = 19, has too much influence on the fitted line.

fit4=lm(y4~x4, data=anscombe);
summary(fit4)$coef

##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0017273  1.1239211 2.670763 0.025590425
## x4          0.4999091  0.1178189 4.243028 0.002164602

2. Re-visit the Saving Data

The savings data frame has 50 rows and 5 columns. The data is averaged over the period 1960-1970. This data frame contains the following columns:

sr – personal saving divided by disposable income
pop15 – percent population under age of 15
pop75 – percent population over age of 75
dpi – per-capita disposable income in dollars
ddpi – percent growth rate of dpi

library(faraway)
attach(savings)
country=dimnames(savings)[[1]]

Take a look of the data.

Check for high-leverage points.

n=50; p=5;
g = lm(sr ~., data=savings);
lev=influence(g)$hat
lev[lev>2*p/n]

##       Ireland         Japan United States         Libya 
##     0.2122363     0.2233099     0.3336880     0.5314568

halfnorm(lev, 4, labs=row.names(savings), ylab="Leverages")

Which countries have high-leverage?

savings[lev > 2*p/n,]

##                  sr pop15 pop75     dpi  ddpi
## Ireland       11.34 31.16  4.19 1139.95  2.99
## Japan         21.10 27.01  1.91 1257.28  8.21
## United States  7.56 29.81  3.43 4001.89  2.45
## Libya          8.89 43.69  2.07  123.58 16.71

Check for outliers.

jack=rstudent(g); 
qt(.05/(2*n), 44) # Bonferroni correction

## [1] -3.525801

qt(.05/2, 44)

## [1] -2.015368

sort(abs(jack), decreasing=TRUE)[1:5]  # No outliers.

##      Zambia       Chile Philippines        Peru     Iceland 
##    2.853558    2.313429    1.863826    1.823914    1.731200

Check for high influential points.

cook = cooks.distance(g)
halfnorm(cook, labs=row.names(savings), ylab="Cook's distances")

max(cook)

## [1] 0.2680704

Although there are no high influential points based on the rule-of-thumb, the cook’s distance for Libya is much larger than the other samples. So let’s remove “Libya”“, refit the model, and check the changes.

summary(lm(sr~., data=savings[row.names(savings) != 'Lybia',]))

## 
## Call:
## lm(formula = sr ~ ., data = savings[row.names(savings) != "Lybia", 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904

summary(g)

## 
## Call:
## lm(formula = sr ~ ., data = savings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904

3. Rat Data, from Weisburg (1980)

An experiment is conducted to investigate the amount of a particular drug present in the liver of a rat: 19 rats are randomly selected, weighted, and given an oral dose of the drug, which is approximately proportional to the body weight. …

X1: body weight (g)
X2: Liver weight (g)
X3: Relative dose
Y: The percentage of the dose in the liver

rat=read.table("rat.dat"); 
dim(rat)

## [1] 19  4

head(rat)

##    V1  V2   V3   V4
## 1 176 6.5 0.88 0.42
## 2 176 9.5 0.88 0.25
## 3 190 9.0 1.00 0.56
## 4 176 8.9 0.88 0.23
## 5 200 7.2 1.00 0.23
## 6 167 8.9 0.83 0.32

“dose” is highly correlated with “body” by the design of the experiment; “body” and “liver” are known to be correlated. Note the correlations between the predictors and Y are low.

names(rat)=c("body", "liver", "dose", "Y")
cor(rat)

##            body     liver      dose         Y
## body  1.0000000 0.5000101 0.9902126 0.1510855
## liver 0.5000101 1.0000000 0.4900711 0.2033302
## dose  0.9902126 0.4900711 1.0000000 0.2275436
## Y     0.1510855 0.2033302 0.2275436 1.0000000

How many sub-models we can fit on this data? Seven, or Eight (if including the intercept-only model).

Tried all seven models. It’s surprising to see that whenever “body” and “liver” are both included in the model, they become significant, otherwise, no variables are significant.

Model with one predictor: Y ~ body, Y ~ live, Y ~ dose.

summary(lm(Y~body, data=rat))$coef

##                 Estimate  Std. Error   t value  Pr(>|t|)
## (Intercept) 0.1962346237 0.221582485 0.8856053 0.3881858
## body        0.0008105376 0.001286209 0.6301757 0.5369590

summary(lm(Y~liver, data=rat))$coef

##               Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) 0.22037463 0.13572725 1.6236579 0.1228446
## liver       0.01470945 0.01717914 0.8562387 0.4037739

summary(lm(Y~dose, data=rat))$coef

##              Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) 0.1330050  0.2109117 0.6306194 0.5366756
## dose        0.2346096  0.2435074 0.9634599 0.3488226

Model with two predictors: Y ~ body + liver, Y ~ body + dose, Y ~ liver + dose.

summary(lm(Y~body+liver, data=rat))$coef

##                Estimate  Std. Error   t value  Pr(>|t|)
## (Intercept) 0.178356944 0.227775476 0.7830384 0.4450409
## body        0.000353495 0.001513754 0.2335221 0.8183175
## liver       0.012325997 0.020412653 0.6038410 0.5544149

summary(lm(Y~body+dose, data=rat))$coef

##                Estimate Std. Error   t value   Pr(>|t|)
## (Intercept)  0.28551741 0.19126688  1.492770 0.15495313
## body        -0.02044423 0.00783845 -2.608198 0.01902132
## dose         4.12532992 1.50647192  2.738405 0.01457729

summary(lm(Y~liver+dose, data=rat))$coef

##                Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) 0.117365750 0.21909377 0.5356873 0.5995428
## liver       0.008741867 0.02008518 0.4352397 0.6692025
## dose        0.173550533 0.28626117 0.6062664 0.5528430

Model with all three predictors: Y ~ body + liver + dose.

 summary(lm(Y~., data=rat))$coef

##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept)  0.26592177 0.194585255  1.3666080 0.19188433
## body        -0.02124634 0.007974304 -2.6643501 0.01767688
## liver        0.01429806 0.017217141  0.8304549 0.41930376
## dose         4.17811141 1.522625206  2.7440183 0.01506639

Plot diagonostics.

fit=lm(Y~., data=rat);
par(mfrow=c(2,2)); plot(fit)

Check for high-leverage points

lev=influence(fit)$hat
halfnorm(lev, 4, ylab="Leverages")

Check for outliers

sr.ex=rstudent(fit); 
sort(sr.ex, decreasing=TRUE)[1:5]

##        19         1        18         3         7 
## 2.1388332 1.9170719 0.8426711 0.7972915 0.7774937

qt(0.05/(19*2), 14)

## [1] -3.64871

Check for high-influential points

cook = cooks.distance(fit)
halfnorm(cook, 4, ylab="Cook's distances")

max(cook)

## [1] 0.929616

The 3rd sample keeps being flagged as an unusual observation. Let’s take a look of the data.

##    body liver dose    Y
## 1   176   6.5 0.88 0.42
## 2   176   9.5 0.88 0.25
## 3   190   9.0 1.00 0.56
## 4   176   8.9 0.88 0.23
## 5   200   7.2 1.00 0.23
## 6   167   8.9 0.83 0.32
## 7   188   8.0 0.94 0.37
## 8   195  10.0 0.98 0.41
## 9   176   8.0 0.88 0.33
## 10  165   7.9 0.84 0.38
## 11  158   6.9 0.80 0.27
## 12  148   7.3 0.74 0.36
## 13  149   5.2 0.75 0.21
## 14  163   8.4 0.81 0.28
## 15  170   7.2 0.85 0.34
## 16  186   6.8 0.94 0.28
## 17  146   7.3 0.73 0.30
## 18  181   9.0 0.90 0.37
## 19  149   6.4 0.75 0.46

Remove the 3rd sample, and refit the model.

newfit=lm(Y~., data=rat, subset=(1:19)[-3])
summary(newfit)

## 
## Call:
## lm(formula = Y ~ ., data = rat, subset = (1:19)[-3])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.102154 -0.056486  0.002838  0.046519  0.137059 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.311427   0.205094   1.518    0.151
## body        -0.007783   0.018717  -0.416    0.684
## liver        0.008989   0.018659   0.482    0.637
## dose         1.484877   3.713064   0.400    0.695
## 
## Residual standard error: 0.07825 on 14 degrees of freedom
## Multiple R-squared:  0.02106,    Adjusted R-squared:  -0.1887 
## F-statistic: 0.1004 on 3 and 14 DF,  p-value: 0.9585