Introduction to data analysis using R programming language

To download in-built data in R for use, click below:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

Multiple (Linear) Regression

NB: The goal is to find out the Determinants of Murder Rates in the United States based on an existing empirical data.

Description of data to be used:

NB: A data frame containing 44 observations on 8 variables.

rate: Murder rate per 100,000 (FBI estimate, 1950).

convictions: Number of convictions divided by number of murders in 1950.

executions: Average number of executions during 1946–1950 divided by convictions in 1950.

time: Median time served (in months) of convicted murderers released in 1951.

income: Median family income in 1949 (in 1,000 USD).

lfp: Labor force participation rate in 1950 (in percent).

noncauc: Proportion of population that is non-Caucasian in 1950.

southern: Factor indicating region or residential status.

Specific Tasks

  1. Add a new categorical variable called income level with levels Low level if income is less than 1.5 (in thousand), Medium level if income is from 1.5 but less than 2 (in thousand) and High level if greater or equal to 2 (in thousand).

  2. Compute descriptive summaries of all the 8 variables with appropriate graphs

  3. Perform correlation analysis based on all the numerial varibles with informative correlation matrix plots

  4. Find the relationship between the income level and their residential status (whether in southern area or not). NB: A Chi-square test in this case since both variables are categorical.

  5. Fit a regression model to find out the Determinants of Murder Rates.

NB: Based on the variables: convictions, executions, time, income, lfp, noncauc and southern/residential status

Setting a working directory

In [46]:
setwd("C:/Users/user/Desktop/DataAnalysis_results_R/DataAnalysis_NUGSA_Shangai")

Loading required packages

In [6]:
#Intalling our first R package (NB: You will need an internet to install packages only)
#install.packages("moments") #run this to install the package called "moments"
library("moments") # load the package to be able to calculate skewness and kurtosis only
library("RColorBrewer")
library("PerformanceAnalytics")
In [7]:
#Setting plot size for large view and specific resolution
options(repr.plot.width=8, repr.plot.height=8,repr.plot.res = 300)
In [8]:
Data_murderRates<-read.csv(file="MurderRates_data.csv") 
head(Data_murderRates, n=10) #to view firs t 10 rows of the data
A data.frame: 10 × 8
rateconvictionsexecutionstimeincomelfpnoncaucsouthern
<dbl><dbl><dbl><int><dbl><dbl><dbl><fct>
119.250.2040.035 471.1051.20.321yes
2 7.530.3270.081 580.9248.50.224yes
3 5.660.4010.012 821.7250.80.127no
4 3.210.3180.0701002.1854.40.063no
5 2.800.3500.0622221.7552.40.021no
6 1.410.2830.1001642.2656.70.027no
7 6.180.2040.0501612.0754.60.139yes
812.150.2320.054 701.4352.70.218yes
9 1.340.1990.0862191.9252.30.008no
10 3.710.1380.000 811.8253.00.012no

Checking for missing values

In [10]:
#To check whether there is any missing data
any(is.na(Data_murderRates)) #it returned false implies no missing data
FALSE

Creating the income level variable

In [14]:
min(Data_murderRates$income)

max(Data_murderRates$income)
0.76
2.39
In [30]:
detach(Data_murderRates)
In [32]:
Data_murderRates$rate
  1. 19.25
  2. 7.53
  3. 5.66
  4. 3.21
  5. 2.8
  6. 1.41
  7. 6.18
  8. 12.15
  9. 1.34
  10. 3.71
  11. 5.35
  12. 4.72
  13. 3.81
  14. 10.44
  15. 9.58
  16. 1.02
  17. 7.52
  18. 1.31
  19. 1.67
  20. 7.07
  21. 11.79
  22. 2.71
  23. 13.21
  24. 3.48
  25. 0.81
  26. 2.32
  27. 3.47
  28. 8.31
  29. 1.57
  30. 4.13
  31. 3.84
  32. 1.83
  33. 3.54
  34. 1.11
  35. 8.9
  36. 1.27
  37. 15.26
  38. 11.15
  39. 1.74
  40. 11.98
  41. 3.04
  42. 0.85
  43. 2.83
  44. 2.89
In [23]:
thresholds<- c(0, 1.5, 2, 3)
Data_murderRates$income_levels= cut(Data_murderRates$income,breaks=thresholds, right=FALSE,labels=c("Low","Medium","High"))
In [24]:
#View updated data
head(Data_murderRates, n=10) #to view first 6 rows of the data
A data.frame: 10 × 9
rateconvictionsexecutionstimeincomelfpnoncaucsouthernincome_levels
<dbl><dbl><dbl><int><dbl><dbl><dbl><fct><fct>
119.250.2040.035 471.1051.20.321yesLow
2 7.530.3270.081 580.9248.50.224yesLow
3 5.660.4010.012 821.7250.80.127no Medium
4 3.210.3180.0701002.1854.40.063no High
5 2.800.3500.0622221.7552.40.021no Medium
6 1.410.2830.1001642.2656.70.027no High
7 6.180.2040.0501612.0754.60.139yesHigh
812.150.2320.054 701.4352.70.218yesLow
9 1.340.1990.0862191.9252.30.008no Medium
10 3.710.1380.000 811.8253.00.012no Medium
In [53]:
levels(Data_murderRates$income_levels)
  1. 'Low'
  2. 'Medium'
  3. 'High'
In [54]:
table(Data_murderRates$income_levels)
   Low Medium   High 
    10     18     16 
In [37]:
dim(Data_murderRates)

dim(Data_murderRates)[1]
  1. 44
  2. 9
44
In [42]:
table(Data_murderRates$income_levels,Data_murderRates$southern)
table(Data_murderRates$income_levels,Data_murderRates$southern)/dim(Data_murderRates)[1]

table(Data_murderRates$income_levels,Data_murderRates$southern)/dim(Data_murderRates)[1]*100

round(table(Data_murderRates$income_levels,Data_murderRates$southern)/dim(Data_murderRates)[1]*100, 2)
        
         no yes
  Low     0  10
  Medium 15   3
  High   14   2
        
                 no        yes
  Low    0.00000000 0.22727273
  Medium 0.34090909 0.06818182
  High   0.31818182 0.04545455
        
                no       yes
  Low     0.000000 22.727273
  Medium 34.090909  6.818182
  High   31.818182  4.545455
        
            no   yes
  Low     0.00 22.73
  Medium 34.09  6.82
  High   31.82  4.55

Creating own function to compute descriptive/summary statistics

The summary statistics function should first:

  1. Determine the type of variable
  1. If it's numeric find & return mean, median, mode, standard deviation, standard error, skewness, kurtosis and 95% quantile recorded to 2 decimal places as well as histogram plot of the numeric variable unique colour respectively.
  1. Else if it's categorical, it should find & return percentages for all categories/levels (in 1 decimal place) and the name of the categories as a dataframe as well as plot a pie chart with percentages for each category of the variable with different colours

Function to compute mode

In [53]:
# Creating the function to compute mode in R
getmode <- function(x) {
   uniq_x <- unique(x)
   return(uniq_x[which.max(tabulate(match(x, uniq_x)))])
}
In [54]:
getmode(Data_murderRates$income)
2.07
In [52]:
table(Data_murderRates$income)
0.76 0.92  1.1 1.15 1.24 1.26 1.29 1.35 1.42 1.43 1.55 1.59 1.68  1.7 1.72 1.75 
   1    1    1    1    1    1    1    1    1    1    2    1    2    1    2    1 
1.81 1.82 1.84 1.89  1.9 1.92 1.96 1.97    2 2.04 2.07 2.12 2.18 2.21 2.26 2.29 
   2    1    1    1    1    1    1    1    1    3    3    2    1    1    1    1 
2.34 2.39 
   2    1 
In [48]:
class(Data_murderRates$income_level)
'factor'

Summary statistic function

In [56]:
#Creating a function to estimate summary statistics of the data
#by determining the type of variable
#If it's numeric find & return mean, median, mode standard deviation, standard error, 
#skewness, kurtosis and 95% quantile recorded to 2 decimal places
#But if its categorical find & return percentages for all categories/levels 
#(in 1 decimal place) and the name of the categories as a dataframe

Summary_stats<-function(data, variable_index){
  Variable_name<-names(data)[variable_index]
  Variable<-(data)[,variable_index]
  if(is.numeric(Variable)==TRUE){ #if variable is numeric/quantitative
    #compute mean, median, standard deviation, standard error, 
    #skewness and kurtosis
  mean_value<-mean(Variable) #compute mean
  median_value<-median(Variable) #compute median
  modal_value<-getmode(Variable)
  std<-sd(Variable) #compute standard deviation
  standard_error<-std/sqrt(length(Variable)) #compute standard error
  skewness<-skewness(Variable) #compute skewness
  kurtosis<-kurtosis(Variable) #compute kurtosis
  quantile_95percent<- as.vector(quantile(Variable,c(0.025,0.975))) #compute 95% quantile
    graph<-hist(Variable,xlab=paste(Variable_name),col=variable_index, main="")
  #returns the mean, median, standard deviation, standard error,skewness and kurtosis
     numerical_summaries<- data.frame(mean=round(mean_value,2),median=round(median_value,2),mode=modal_value,std=round(std,2),SE=round(standard_error,2),
    skewness=round(skewness,2),kurtosis=round(kurtosis,2),quantile_95percent_lower=round(quantile_95percent,2)[1],
                               quantile_95percent_upper=round(quantile_95percent,2)[2] ) 
      
  return(list(Variable_name=Variable_name,numerical_summaries=numerical_summaries ,histogram=graph$histogram))           

  } else if(is.factor(Variable)==TRUE){ #else if categorical
    #compute the percentages rounded in 1 decimal place
  percentage<-paste(round((table(Variable)/dim(data)[1])*
                           100,1),"%")
  levels_variable<-levels(Variable)
  output<-data.frame(Categories=levels_variable,percentage=percentage)#storing output as dataframe
      
 #Plotting the pie chart for the categorical variable
   Percentage_values<- round((table(Variable)/dim(data)[1])*100,1)
   labels_variables <- paste(levels(Variable),":", Percentage_values) # add percents to labels
   labels_variables <- paste( labels_variables,"%",sep="") # ad % to labels
    
     
 #Deciding how many colours to choose if the number of categories is < 3 or >=3 before plot
  if(length(levels_variable)==2){
    #colours_two_categories<- c("red","blue")
    colours_two_categories<- rainbow(n=2)
    pie_chart<- pie(x=Percentage_values, labels =labels_variables,radius =.7,cex=0.71,main="",            
    col =colours_two_categories,font=2,clockwise = TRUE,init.angle=90)

      } else if(length(levels_variable)>=3){
      #colours_categories<-brewer.pal(n = length(Percentage_values), name = "Paired")
      colours_categories<-rainbow(n =length(Percentage_values))
      pie_chart<- pie(x=Percentage_values, labels =labels_variables,radius =.7,cex=0.71,main="",            
    col =colours_categories,font=2,clockwise = TRUE,init.angle=90)
     
           }
  
       #return variable name and a dataframe of percentages for each category
     return(list(Variable_name=Variable_name,output=output, pie_chart= pie_chart))
       }
  }
In [57]:
Summary_stats(data=Data_murderRates, variable_index=1)
$Variable_name
'rate'
$numerical_summaries
A data.frame: 1 × 9
meanmedianmodestdSEskewnesskurtosisquantile_95percent_lowerquantile_95percent_upper
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
5.43.6219.254.460.671.170.70.8615.11
$histogram
NULL
In [58]:
Summary_stats(data=Data_murderRates, variable_index=2)
$Variable_name
'convictions'
$numerical_summaries
A data.frame: 1 × 9
meanmedianmodestdSEskewnesskurtosisquantile_95percent_lowerquantile_95percent_upper
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
0.260.230.2040.140.021.753.290.120.67
$histogram
NULL
In [79]:
Summary_stats(data=Data_murderRates, variable_index=3)
$Variable_name
'executions'
$numerical_summaries
A data.frame: 1 × 9
meanmedianmodestdSEskewnesskurtosisquantile_95percent_lowerquantile_95percent_upper
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
0.060.0400.070.013.0312.1100.21
$histogram
NULL
In [80]:
Summary_stats(data=Data_murderRates, variable_index=4)
$Variable_name
'time'
$numerical_summaries
A data.frame: 1 × 9
meanmedianmodestdSEskewnesskurtosisquantile_95percent_lowerquantile_95percent_upper
<dbl><dbl><int><dbl><dbl><dbl><dbl><dbl><dbl>
136.5212410461.69.290.68-0.0447.68279.22
$histogram
NULL
In [81]:
Summary_stats(data=Data_murderRates, variable_index=5)
$Variable_name
'income'
$numerical_summaries
A data.frame: 1 × 9
meanmedianmodestdSEskewnesskurtosisquantile_95percent_lowerquantile_95percent_upper
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1.781.832.070.40.06-0.64-0.220.932.34
$histogram
NULL
In [82]:
Summary_stats(data=Data_murderRates, variable_index=6)
$Variable_name
'lfp'
$numerical_summaries
A data.frame: 1 × 9
meanmedianmodestdSEskewnesskurtosisquantile_95percent_lowerquantile_95percent_upper
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
53.0753.451.22.480.37-0.29-0.0348.556.88
$histogram
NULL
In [83]:
Summary_stats(data=Data_murderRates, variable_index=7)
$Variable_name
'noncauc'
$numerical_summaries
A data.frame: 1 × 9
meanmedianmodestdSEskewnesskurtosisquantile_95percent_lowerquantile_95percent_upper
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
0.110.060.1270.110.021.441.160.010.38
$histogram
NULL
In [84]:
Summary_stats(data=Data_murderRates, variable_index=8)
$Variable_name
'southern'
$output
A data.frame: 2 × 2
Categoriespercentage
<fct><fct>
no 65.9 %
yes34.1 %
$pie_chart
NULL

Supposing you want to save a plot directly into your working directory

Let's use the pie chart above for income levels and save directory using a command or by a code.

In [93]:
#SAVING PLOT AS A PNG

#step 1
par(mar=c(4,4,1,1))
png(file="Income_level_PieChart.png")
   

#Step 2: Create the plot with R code
Summary_stats(data=Data_murderRates, variable_index=9)$pie_chart

# Step 3: Run dev.off() to create the file!
dev.off()
NULL
png: 2

Correlation analysis using PerformanceAnalytics package

Based on only the quantitative variables.

Automated function to decide on the quantitative variables

In [94]:
names(Data_murderRates)
  1. 'rate'
  2. 'convictions'
  3. 'executions'
  4. 'time'
  5. 'income'
  6. 'lfp'
  7. 'noncauc'
  8. 'southern'
  9. 'income_levels'
In [63]:
Quantitative_variabes<-function(data){
    
    Variable_name=list()
    for (variable_index in seq_along(names(data))){
     Variable<-(data)[,variable_index]   
  if(is.numeric(Variable)==TRUE){ #if variable is numeric/quantitative
   Variable_name[[variable_index]]<-names(data)[variable_index]
  }
        }
    return(unlist(Variable_name))
}
In [65]:
Categorical_variabes<-function(data){
    
    Variable_name=list()
    for (variable_index in seq_along(names(data))){
     Variable<-(data)[,variable_index]   
  if(is.numeric(Variable)==FALSE){ #if variable is numeric/quantitative
   Variable_name[[variable_index]]<-names(data)[variable_index]
  }
        }
    return(unlist(Variable_name))
}
In [66]:
Categorical_variabes(data=Data_murderRates)
  1. 'southern'
  2. 'income_levels'
In [64]:
#printing all numerical variables in any data
Quantitative_variabes(data=Data_murderRates)
  1. 'rate'
  2. 'convictions'
  3. 'executions'
  4. 'time'
  5. 'income'
  6. 'lfp'
  7. 'noncauc'

Determining the normality of all the numeric variables

In [68]:
shapiro.test(Data_murderRates$rate) #0.0000874 p<0.001
shapiro.test(Data_murderRates$convictions)
	Shapiro-Wilk normality test

data:  Data_murderRates$rate
W = 0.86161, p-value = 8.746e-05
	Shapiro-Wilk normality test

data:  Data_murderRates$convictions
W = 0.82442, p-value = 1.038e-05
In [69]:
#shapiro.test(Data_murderRates$rate) #normality test for only the variable: rate

Normality_test<- function(data){
    Numerical_variables<- Quantitative_variabes(data=Data_murderRates)
    
    #Using lapply function instead for loop
    ShapiroWilk_test<- function(k) shapiro.test(data[ ,Numerical_variables[k]]) 
    results<- lapply(seq_along(Numerical_variables),ShapiroWilk_test)
   
    for(i in seq_along(Numerical_variables)) print(paste("Index=",i,"",Numerical_variables[i]))
    
    return(results)      
}
In [70]:
Normality_test(data=Data_murderRates)
[1] "Index= 1  rate"
[1] "Index= 2  convictions"
[1] "Index= 3  executions"
[1] "Index= 4  time"
[1] "Index= 5  income"
[1] "Index= 6  lfp"
[1] "Index= 7  noncauc"
[[1]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.86161, p-value = 8.746e-05


[[2]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.82442, p-value = 1.038e-05


[[3]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.70302, p-value = 3.962e-08


[[4]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.95706, p-value = 0.101


[[5]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.95673, p-value = 0.09818


[[6]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.9861, p-value = 0.8669


[[7]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.79548, p-value = 2.331e-06

Correlation matrix plot

In [81]:
Data_numeric=Data_murderRates[,Quantitative_variabes(data=Data_murderRates)] 
#colnames(Data_numeric)=c("rate","convictions","executions","time","income","labour fp","Prop NC")
#method = c("pearson", "kendall", "spearman")
#options(warn=-1) #ignore warnings for instance with running non-parametric correlation due to ties
chart.Correlation(Data_numeric, histogram=TRUE, pch=19,method = c("pearson"))

Relationship between the income level and their residential status

In [83]:
chisq.test(Data_murderRates$income_levels, Data_murderRates$southern)

#Cross tabulation between income levels and residential status
table(Data_murderRates$income_levels, Data_murderRates$southern)

#Cross tabulation between income levels and residential status (as proportions)
table(Data_murderRates$income_levels, Data_murderRates$southern)/dim( Data_murderRates)[1]

#Cross tabulation between income levels and residential status (as percentages)
table(Data_murderRates$income_levels, Data_murderRates$southern)/dim( Data_murderRates)[1]*100


#Cross tabulation between income levels and residential status (as percentages in 2 d.p)
round(table(Data_murderRates$income_levels, Data_murderRates$southern)/dim( Data_murderRates)[1]*100,
      2)
	Pearson's Chi-squared test

data:  Data_murderRates$income_levels and Data_murderRates$southern
X-squared = 25.085, df = 2, p-value = 3.571e-06
        
         no yes
  Low     0  10
  Medium 15   3
  High   14   2
        
                 no        yes
  Low    0.00000000 0.22727273
  Medium 0.34090909 0.06818182
  High   0.31818182 0.04545455
        
                no       yes
  Low     0.000000 22.727273
  Medium 34.090909  6.818182
  High   31.818182  4.545455
        
            no   yes
  Low     0.00 22.73
  Medium 34.09  6.82
  High   31.82  4.55

Fit a regression model to find out the Determinants of Murder Rates

Based on the variables: convictions, executions, time, income, lfp, noncauc and southern/residential status

In [139]:
names(Data_murderRates)
  1. 'rate'
  2. 'convictions'
  3. 'executions'
  4. 'time'
  5. 'income'
  6. 'lfp'
  7. 'noncauc'
  8. 'southern'
  9. 'income_levels'
In [85]:
Glm_gaussian= glm(rate ~ convictions+executions+time+income+lfp
                    +noncauc+southern, data = Data_murderRates,family="gaussian")

summary(Glm_gaussian)
Call:
glm(formula = rate ~ convictions + executions + time + income + 
    lfp + noncauc + southern, family = "gaussian", data = Data_murderRates)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.9913  -1.1943  -0.3538   1.2383   6.5574  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.44436    9.96694   0.045   0.9647  
convictions -4.33938    2.78313  -1.559   0.1277  
executions   2.85276    6.12313   0.466   0.6441  
time        -0.01547    0.00705  -2.194   0.0348 *
income      -2.50013    1.68519  -1.484   0.1466  
lfp          0.19357    0.20614   0.939   0.3540  
noncauc     10.39903    5.40610   1.924   0.0623 .
southernyes  3.26216    1.32980   2.453   0.0191 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 6.046619)

    Null deviance: 856.67  on 43  degrees of freedom
Residual deviance: 217.68  on 36  degrees of freedom
AIC: 213.22

Number of Fisher Scoring iterations: 2
In [86]:
OLS_regression <- lm(rate ~ convictions+executions+time+income+lfp
                    +noncauc+southern, data = Data_murderRates)

summary(OLS_regression)
Call:
lm(formula = rate ~ convictions + executions + time + income + 
    lfp + noncauc + southern, data = Data_murderRates)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9913 -1.1943 -0.3538  1.2383  6.5574 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.44436    9.96694   0.045   0.9647  
convictions -4.33938    2.78313  -1.559   0.1277  
executions   2.85276    6.12313   0.466   0.6441  
time        -0.01547    0.00705  -2.194   0.0348 *
income      -2.50013    1.68519  -1.484   0.1466  
lfp          0.19357    0.20614   0.939   0.3540  
noncauc     10.39903    5.40610   1.924   0.0623 .
southernyes  3.26216    1.32980   2.453   0.0191 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 36 degrees of freedom
Multiple R-squared:  0.7459,	Adjusted R-squared:  0.6965 
F-statistic:  15.1 on 7 and 36 DF,  p-value: 5.105e-09
In [87]:
coef(OLS_regression)

coef(Glm_gaussian)
(Intercept)
0.444359915749847
convictions
-4.33937650268746
executions
2.85275974408881
time
-0.0154701553487085
income
-2.5001332491105
lfp
0.193567589685116
noncauc
10.3990315691726
southernyes
3.26216529583923
(Intercept)
0.444359915749847
convictions
-4.33937650268746
executions
2.85275974408881
time
-0.0154701553487085
income
-2.5001332491105
lfp
0.193567589685116
noncauc
10.3990315691726
southernyes
3.26216529583923

Understanding Diagnostic Plots for Linear Regression Analysis

https://data.library.virginia.edu/diagnostic-plots/

In [91]:
par(mfrow=c(2,2))
plot(OLS_regression)

Taking the influential observation 28 out and refitting the OLS model

In [92]:
Data_without_Obs_28<- Data_murderRates[-28, ]
In [93]:
OLS_regression_without_Obs_28 <- lm(rate ~ convictions+executions+time+income+lfp
                    +noncauc+southern, data = Data_without_Obs_28)
summary(OLS_regression_without_Obs_28)
Call:
lm(formula = rate ~ convictions + executions + time + income + 
    lfp + noncauc + southern, data = Data_without_Obs_28)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0980 -1.6476  0.3139  1.3045  5.7031 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)   4.846979   9.212258   0.526  0.60211   
convictions  -5.805350   2.587239  -2.244  0.03127 * 
executions  -18.364530   9.230697  -1.990  0.05450 . 
time         -0.010432   0.006659  -1.566  0.12623   
income       -2.004167   1.545671  -1.297  0.20324   
lfp           0.096977   0.190860   0.508  0.61457   
noncauc      14.116612   5.093368   2.772  0.00887 **
southernyes   3.725205   1.222708   3.047  0.00438 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.241 on 35 degrees of freedom
Multiple R-squared:  0.7927,	Adjusted R-squared:  0.7512 
F-statistic: 19.11 on 7 and 35 DF,  p-value: 3.121e-10
In [96]:
coef(OLS_regression_without_Obs_28)[8]
southernyes: 3.72520546849213
In [158]:
par(mfrow=c(2,2))
plot(OLS_regression_without_Obs_28)

Comparing the coeficient of determination $R^2$ between the two OLS models

That's, with or without the influential observation 28

In [162]:
paste("R-square of model with inflential observation=",summary(OLS_regression)$r.squared) 


paste("R-square of model without inflential observation=",
      summary(OLS_regression_without_Obs_28)$r.squared)
'R-square of model with inflential observation= 0.745901612378261'
'R-square of model without inflential observation= 0.792650175651192'
In [163]:
#Regression coefficients
best_model=OLS_regression_without_Obs_28
coef(best_model)
(Intercept)
4.84697878175395
convictions
-5.80535041064009
executions
-18.3645303515864
time
-0.0104318799685669
income
-2.00416712193378
lfp
0.0969770883033366
noncauc
14.1166124068012
southernyes
3.72520546849213
In [ ]:
Data_murderRates
In [99]:
names(Data_without_Obs_28)
  1. 'rate'
  2. 'convictions'
  3. 'executions'
  4. 'time'
  5. 'income'
  6. 'lfp'
  7. 'noncauc'
  8. 'southern'
  9. 'income_levels'
In [100]:
## Binomial logistic model. Note: southern coefficient
logitic_model <- glm(southern ~ convictions+executions+time+income_levels+lfp
                    +noncauc+southern+rate, data =Data_without_Obs_28 , family = binomial)
summary(logitic_model)
Call:
glm(formula = southern ~ convictions + executions + time + income_levels + 
    lfp + noncauc + southern + rate, family = binomial, data = Data_without_Obs_28)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-3.224e-05  -2.100e-08  -2.100e-08   2.100e-08   3.947e-05  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)
(Intercept)          8.151e+01  6.355e+06   0.000    1.000
convictions          2.657e+02  1.244e+06   0.000    1.000
executions           2.052e+02  3.425e+06   0.000    1.000
time                 1.256e-01  2.996e+03   0.000    1.000
income_levelsMedium -1.337e+02  3.797e+05   0.000    1.000
income_levelsHigh   -4.781e+01  6.889e+05   0.000    1.000
lfp                 -4.507e+00  1.141e+05   0.000    1.000
noncauc              3.978e+01  9.838e+05   0.000    1.000
rate                 2.316e+01  2.706e+04   0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5.5618e+01  on 42  degrees of freedom
Residual deviance: 3.8621e-09  on 34  degrees of freedom
AIC: 18

Number of Fisher Scoring iterations: 25
In [ ]: