Introduction to data analysis using R programming language¶

To download in-built data in R for use, click below:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

Multiple (Linear) Regression

NB: The goal is to find out the Determinants of Murder Rates in the United States based on an existing empirical data.

Description of data to be used:¶

NB: A data frame containing 44 observations on 8 variables.

rate: Murder rate per 100,000 (FBI estimate, 1950).

convictions: Number of convictions divided by number of murders in 1950.

executions: Average number of executions during 1946–1950 divided by convictions in 1950.

time: Median time served (in months) of convicted murderers released in 1951.

income: Median family income in 1949 (in 1,000 USD).

lfp: Labor force participation rate in 1950 (in percent).

noncauc: Proportion of population that is non-Caucasian in 1950.

southern: Factor indicating region or residential status.

Specific Tasks¶

Add a new categorical variable called income level with levels Low level if income is less than 1.5 (in thousand), Medium level if income is from 1.5 but less than 2 (in thousand) and High level if greater or equal to 2 (in thousand).
Compute descriptive summaries of all the 8 variables with appropriate graphs
Perform correlation analysis based on all the numerial varibles with informative correlation matrix plots
Find the relationship between the income level and their residential status (whether in southern area or not). NB: A Chi-square test in this case since both variables are categorical.
Fit a regression model to find out the Determinants of Murder Rates.

NB: Based on the variables: convictions, executions, time, income, lfp, noncauc and southern/residential status

Setting a working directory¶

setwd("C:/Users/user/Desktop/DataAnalysis_results_R/DataAnalysis_NUGSA_Shangai")

Loading required packages¶

#Intalling our first R package (NB: You will need an internet to install packages only)
#install.packages("moments") #run this to install the package called "moments"
library("moments") # load the package to be able to calculate skewness and kurtosis only
library("RColorBrewer")
library("PerformanceAnalytics")

#Setting plot size for large view and specific resolution
options(repr.plot.width=8, repr.plot.height=8,repr.plot.res = 300)

Data_murderRates<-read.csv(file="MurderRates_data.csv") 
head(Data_murderRates, n=10) #to view firs t 10 rows of the data

Checking for missing values

#To check whether there is any missing data
any(is.na(Data_murderRates)) #it returned false implies no missing data

Creating the income level variable

min(Data_murderRates$income)

max(Data_murderRates$income)

detach(Data_murderRates)

Data_murderRates$rate

thresholds<- c(0, 1.5, 2, 3)
Data_murderRates$income_levels= cut(Data_murderRates$income,breaks=thresholds, right=FALSE,labels=c("Low","Medium","High"))

#View updated data
head(Data_murderRates, n=10) #to view first 6 rows of the data

levels(Data_murderRates$income_levels)

table(Data_murderRates$income_levels)

   Low Medium   High 
    10     18     16

dim(Data_murderRates)

dim(Data_murderRates)[1]

table(Data_murderRates$income_levels,Data_murderRates$southern)
table(Data_murderRates$income_levels,Data_murderRates$southern)/dim(Data_murderRates)[1]

table(Data_murderRates$income_levels,Data_murderRates$southern)/dim(Data_murderRates)[1]*100

round(table(Data_murderRates$income_levels,Data_murderRates$southern)/dim(Data_murderRates)[1]*100, 2)

        
         no yes
  Low     0  10
  Medium 15   3
  High   14   2

        
                 no        yes
  Low    0.00000000 0.22727273
  Medium 0.34090909 0.06818182
  High   0.31818182 0.04545455

        
                no       yes
  Low     0.000000 22.727273
  Medium 34.090909  6.818182
  High   31.818182  4.545455

        
            no   yes
  Low     0.00 22.73
  Medium 34.09  6.82
  High   31.82  4.55

Creating own function to compute descriptive/summary statistics¶

The summary statistics function should first:

Determine the type of variable

If it's numeric find & return mean, median, mode, standard deviation, standard error, skewness, kurtosis and 95% quantile recorded to 2 decimal places as well as histogram plot of the numeric variable unique colour respectively.

Else if it's categorical, it should find & return percentages for all categories/levels (in 1 decimal place) and the name of the categories as a dataframe as well as plot a pie chart with percentages for each category of the variable with different colours

Function to compute mode

# Creating the function to compute mode in R
getmode <- function(x) {
   uniq_x <- unique(x)
   return(uniq_x[which.max(tabulate(match(x, uniq_x)))])
}

getmode(Data_murderRates$income)

table(Data_murderRates$income)

0.76 0.92  1.1 1.15 1.24 1.26 1.29 1.35 1.42 1.43 1.55 1.59 1.68  1.7 1.72 1.75 
   1    1    1    1    1    1    1    1    1    1    2    1    2    1    2    1 
1.81 1.82 1.84 1.89  1.9 1.92 1.96 1.97    2 2.04 2.07 2.12 2.18 2.21 2.26 2.29 
   2    1    1    1    1    1    1    1    1    3    3    2    1    1    1    1 
2.34 2.39 
   2    1

class(Data_murderRates$income_level)

Summary statistic function

#Creating a function to estimate summary statistics of the data
#by determining the type of variable
#If it's numeric find & return mean, median, mode standard deviation, standard error, 
#skewness, kurtosis and 95% quantile recorded to 2 decimal places
#But if its categorical find & return percentages for all categories/levels 
#(in 1 decimal place) and the name of the categories as a dataframe

Summary_stats<-function(data, variable_index){
  Variable_name<-names(data)[variable_index]
  Variable<-(data)[,variable_index]
  if(is.numeric(Variable)==TRUE){ #if variable is numeric/quantitative
    #compute mean, median, standard deviation, standard error, 
    #skewness and kurtosis
  mean_value<-mean(Variable) #compute mean
  median_value<-median(Variable) #compute median
  modal_value<-getmode(Variable)
  std<-sd(Variable) #compute standard deviation
  standard_error<-std/sqrt(length(Variable)) #compute standard error
  skewness<-skewness(Variable) #compute skewness
  kurtosis<-kurtosis(Variable) #compute kurtosis
  quantile_95percent<- as.vector(quantile(Variable,c(0.025,0.975))) #compute 95% quantile
    graph<-hist(Variable,xlab=paste(Variable_name),col=variable_index, main="")
  #returns the mean, median, standard deviation, standard error,skewness and kurtosis
     numerical_summaries<- data.frame(mean=round(mean_value,2),median=round(median_value,2),mode=modal_value,std=round(std,2),SE=round(standard_error,2),
    skewness=round(skewness,2),kurtosis=round(kurtosis,2),quantile_95percent_lower=round(quantile_95percent,2)[1],
                               quantile_95percent_upper=round(quantile_95percent,2)[2] ) 
      
  return(list(Variable_name=Variable_name,numerical_summaries=numerical_summaries ,histogram=graph$histogram))           

  } else if(is.factor(Variable)==TRUE){ #else if categorical
    #compute the percentages rounded in 1 decimal place
  percentage<-paste(round((table(Variable)/dim(data)[1])*
                           100,1),"%")
  levels_variable<-levels(Variable)
  output<-data.frame(Categories=levels_variable,percentage=percentage)#storing output as dataframe
      
 #Plotting the pie chart for the categorical variable
   Percentage_values<- round((table(Variable)/dim(data)[1])*100,1)
   labels_variables <- paste(levels(Variable),":", Percentage_values) # add percents to labels
   labels_variables <- paste( labels_variables,"%",sep="") # ad % to labels
    
     
 #Deciding how many colours to choose if the number of categories is < 3 or >=3 before plot
  if(length(levels_variable)==2){
    #colours_two_categories<- c("red","blue")
    colours_two_categories<- rainbow(n=2)
    pie_chart<- pie(x=Percentage_values, labels =labels_variables,radius =.7,cex=0.71,main="",            
    col =colours_two_categories,font=2,clockwise = TRUE,init.angle=90)

      } else if(length(levels_variable)>=3){
      #colours_categories<-brewer.pal(n = length(Percentage_values), name = "Paired")
      colours_categories<-rainbow(n =length(Percentage_values))
      pie_chart<- pie(x=Percentage_values, labels =labels_variables,radius =.7,cex=0.71,main="",            
    col =colours_categories,font=2,clockwise = TRUE,init.angle=90)
     
           }
  
       #return variable name and a dataframe of percentages for each category
     return(list(Variable_name=Variable_name,output=output, pie_chart= pie_chart))
       }
  }

Summary_stats(data=Data_murderRates, variable_index=1)

Summary_stats(data=Data_murderRates, variable_index=2)

Summary_stats(data=Data_murderRates, variable_index=3)

Summary_stats(data=Data_murderRates, variable_index=4)

Summary_stats(data=Data_murderRates, variable_index=5)

Summary_stats(data=Data_murderRates, variable_index=6)

Summary_stats(data=Data_murderRates, variable_index=7)

Summary_stats(data=Data_murderRates, variable_index=8)

Supposing you want to save a plot directly into your working directory¶

Let's use the pie chart above for income levels and save directory using a command or by a code.

#SAVING PLOT AS A PNG

#step 1
par(mar=c(4,4,1,1))
png(file="Income_level_PieChart.png")
   

#Step 2: Create the plot with R code
Summary_stats(data=Data_murderRates, variable_index=9)$pie_chart

# Step 3: Run dev.off() to create the file!
dev.off()

NULL

Correlation analysis using PerformanceAnalytics package¶

Based on only the quantitative variables.

Automated function to decide on the quantitative variables

names(Data_murderRates)

Quantitative_variabes<-function(data){
    
    Variable_name=list()
    for (variable_index in seq_along(names(data))){
     Variable<-(data)[,variable_index]   
  if(is.numeric(Variable)==TRUE){ #if variable is numeric/quantitative
   Variable_name[[variable_index]]<-names(data)[variable_index]
  }
        }
    return(unlist(Variable_name))
}

Categorical_variabes<-function(data){
    
    Variable_name=list()
    for (variable_index in seq_along(names(data))){
     Variable<-(data)[,variable_index]   
  if(is.numeric(Variable)==FALSE){ #if variable is numeric/quantitative
   Variable_name[[variable_index]]<-names(data)[variable_index]
  }
        }
    return(unlist(Variable_name))
}

Categorical_variabes(data=Data_murderRates)

#printing all numerical variables in any data
Quantitative_variabes(data=Data_murderRates)

Determining the normality of all the numeric variables

shapiro.test(Data_murderRates$rate) #0.0000874 p<0.001
shapiro.test(Data_murderRates$convictions)

	Shapiro-Wilk normality test

data:  Data_murderRates$rate
W = 0.86161, p-value = 8.746e-05

	Shapiro-Wilk normality test

data:  Data_murderRates$convictions
W = 0.82442, p-value = 1.038e-05

#shapiro.test(Data_murderRates$rate) #normality test for only the variable: rate

Normality_test<- function(data){
    Numerical_variables<- Quantitative_variabes(data=Data_murderRates)
    
    #Using lapply function instead for loop
    ShapiroWilk_test<- function(k) shapiro.test(data[ ,Numerical_variables[k]]) 
    results<- lapply(seq_along(Numerical_variables),ShapiroWilk_test)
   
    for(i in seq_along(Numerical_variables)) print(paste("Index=",i,"",Numerical_variables[i]))
    
    return(results)      
}

Normality_test(data=Data_murderRates)

[1] "Index= 1  rate"
[1] "Index= 2  convictions"
[1] "Index= 3  executions"
[1] "Index= 4  time"
[1] "Index= 5  income"
[1] "Index= 6  lfp"
[1] "Index= 7  noncauc"

[[1]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.86161, p-value = 8.746e-05


[[2]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.82442, p-value = 1.038e-05


[[3]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.70302, p-value = 3.962e-08


[[4]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.95706, p-value = 0.101


[[5]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.95673, p-value = 0.09818


[[6]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.9861, p-value = 0.8669


[[7]]

	Shapiro-Wilk normality test

data:  data[, Numerical_variables[k]]
W = 0.79548, p-value = 2.331e-06

Correlation matrix plot

Data_numeric=Data_murderRates[,Quantitative_variabes(data=Data_murderRates)] 
#colnames(Data_numeric)=c("rate","convictions","executions","time","income","labour fp","Prop NC")
#method = c("pearson", "kendall", "spearman")
#options(warn=-1) #ignore warnings for instance with running non-parametric correlation due to ties
chart.Correlation(Data_numeric, histogram=TRUE, pch=19,method = c("pearson"))

Relationship between the income level and their residential status

chisq.test(Data_murderRates$income_levels, Data_murderRates$southern)

#Cross tabulation between income levels and residential status
table(Data_murderRates$income_levels, Data_murderRates$southern)

#Cross tabulation between income levels and residential status (as proportions)
table(Data_murderRates$income_levels, Data_murderRates$southern)/dim( Data_murderRates)[1]

#Cross tabulation between income levels and residential status (as percentages)
table(Data_murderRates$income_levels, Data_murderRates$southern)/dim( Data_murderRates)[1]*100


#Cross tabulation between income levels and residential status (as percentages in 2 d.p)
round(table(Data_murderRates$income_levels, Data_murderRates$southern)/dim( Data_murderRates)[1]*100,
      2)

	Pearson's Chi-squared test

data:  Data_murderRates$income_levels and Data_murderRates$southern
X-squared = 25.085, df = 2, p-value = 3.571e-06

        
         no yes
  Low     0  10
  Medium 15   3
  High   14   2

        
                 no        yes
  Low    0.00000000 0.22727273
  Medium 0.34090909 0.06818182
  High   0.31818182 0.04545455

        
                no       yes
  Low     0.000000 22.727273
  Medium 34.090909  6.818182
  High   31.818182  4.545455

        
            no   yes
  Low     0.00 22.73
  Medium 34.09  6.82
  High   31.82  4.55

Fit a regression model to find out the Determinants of Murder Rates

Based on the variables: convictions, executions, time, income, lfp, noncauc and southern/residential status

names(Data_murderRates)

Glm_gaussian= glm(rate ~ convictions+executions+time+income+lfp
                    +noncauc+southern, data = Data_murderRates,family="gaussian")

summary(Glm_gaussian)

Call:
glm(formula = rate ~ convictions + executions + time + income + 
    lfp + noncauc + southern, family = "gaussian", data = Data_murderRates)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.9913  -1.1943  -0.3538   1.2383   6.5574  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.44436    9.96694   0.045   0.9647  
convictions -4.33938    2.78313  -1.559   0.1277  
executions   2.85276    6.12313   0.466   0.6441  
time        -0.01547    0.00705  -2.194   0.0348 *
income      -2.50013    1.68519  -1.484   0.1466  
lfp          0.19357    0.20614   0.939   0.3540  
noncauc     10.39903    5.40610   1.924   0.0623 .
southernyes  3.26216    1.32980   2.453   0.0191 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 6.046619)

    Null deviance: 856.67  on 43  degrees of freedom
Residual deviance: 217.68  on 36  degrees of freedom
AIC: 213.22

Number of Fisher Scoring iterations: 2

OLS_regression <- lm(rate ~ convictions+executions+time+income+lfp
                    +noncauc+southern, data = Data_murderRates)

summary(OLS_regression)

Call:
lm(formula = rate ~ convictions + executions + time + income + 
    lfp + noncauc + southern, data = Data_murderRates)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9913 -1.1943 -0.3538  1.2383  6.5574 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.44436    9.96694   0.045   0.9647  
convictions -4.33938    2.78313  -1.559   0.1277  
executions   2.85276    6.12313   0.466   0.6441  
time        -0.01547    0.00705  -2.194   0.0348 *
income      -2.50013    1.68519  -1.484   0.1466  
lfp          0.19357    0.20614   0.939   0.3540  
noncauc     10.39903    5.40610   1.924   0.0623 .
southernyes  3.26216    1.32980   2.453   0.0191 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 36 degrees of freedom
Multiple R-squared:  0.7459,	Adjusted R-squared:  0.6965 
F-statistic:  15.1 on 7 and 36 DF,  p-value: 5.105e-09

coef(OLS_regression)

coef(Glm_gaussian)

Understanding Diagnostic Plots for Linear Regression Analysis¶

https://data.library.virginia.edu/diagnostic-plots/

par(mfrow=c(2,2))
plot(OLS_regression)

Taking the influential observation 28 out and refitting the OLS model

Data_without_Obs_28<- Data_murderRates[-28, ]

OLS_regression_without_Obs_28 <- lm(rate ~ convictions+executions+time+income+lfp
                    +noncauc+southern, data = Data_without_Obs_28)
summary(OLS_regression_without_Obs_28)

Call:
lm(formula = rate ~ convictions + executions + time + income + 
    lfp + noncauc + southern, data = Data_without_Obs_28)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0980 -1.6476  0.3139  1.3045  5.7031 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)   4.846979   9.212258   0.526  0.60211   
convictions  -5.805350   2.587239  -2.244  0.03127 * 
executions  -18.364530   9.230697  -1.990  0.05450 . 
time         -0.010432   0.006659  -1.566  0.12623   
income       -2.004167   1.545671  -1.297  0.20324   
lfp           0.096977   0.190860   0.508  0.61457   
noncauc      14.116612   5.093368   2.772  0.00887 **
southernyes   3.725205   1.222708   3.047  0.00438 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.241 on 35 degrees of freedom
Multiple R-squared:  0.7927,	Adjusted R-squared:  0.7512 
F-statistic: 19.11 on 7 and 35 DF,  p-value: 3.121e-10

coef(OLS_regression_without_Obs_28)[8]

par(mfrow=c(2,2))
plot(OLS_regression_without_Obs_28)

Comparing the coeficient of determination $R^2$ between the two OLS models

That's, with or without the influential observation 28

paste("R-square of model with inflential observation=",summary(OLS_regression)$r.squared) 


paste("R-square of model without inflential observation=",
      summary(OLS_regression_without_Obs_28)$r.squared)

#Regression coefficients
best_model=OLS_regression_without_Obs_28
coef(best_model)

Data_murderRates

names(Data_without_Obs_28)

## Binomial logistic model. Note: southern coefficient
logitic_model <- glm(southern ~ convictions+executions+time+income_levels+lfp
                    +noncauc+southern+rate, data =Data_without_Obs_28 , family = binomial)
summary(logitic_model)

Call:
glm(formula = southern ~ convictions + executions + time + income_levels + 
    lfp + noncauc + southern + rate, family = binomial, data = Data_without_Obs_28)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-3.224e-05  -2.100e-08  -2.100e-08   2.100e-08   3.947e-05  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)
(Intercept)          8.151e+01  6.355e+06   0.000    1.000
convictions          2.657e+02  1.244e+06   0.000    1.000
executions           2.052e+02  3.425e+06   0.000    1.000
time                 1.256e-01  2.996e+03   0.000    1.000
income_levelsMedium -1.337e+02  3.797e+05   0.000    1.000
income_levelsHigh   -4.781e+01  6.889e+05   0.000    1.000
lfp                 -4.507e+00  1.141e+05   0.000    1.000
noncauc              3.978e+01  9.838e+05   0.000    1.000
rate                 2.316e+01  2.706e+04   0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5.5618e+01  on 42  degrees of freedom
Residual deviance: 3.8621e-09  on 34  degrees of freedom
AIC: 18

Number of Fisher Scoring iterations: 25

	rate	convictions	executions	time	income	lfp	noncauc	southern
	<dbl>	<dbl>	<dbl>	<int>	<dbl>	<dbl>	<dbl>	<fct>
1	19.25	0.204	0.035	47	1.10	51.2	0.321	yes
2	7.53	0.327	0.081	58	0.92	48.5	0.224	yes
3	5.66	0.401	0.012	82	1.72	50.8	0.127	no
4	3.21	0.318	0.070	100	2.18	54.4	0.063	no
5	2.80	0.350	0.062	222	1.75	52.4	0.021	no
6	1.41	0.283	0.100	164	2.26	56.7	0.027	no
7	6.18	0.204	0.050	161	2.07	54.6	0.139	yes
8	12.15	0.232	0.054	70	1.43	52.7	0.218	yes
9	1.34	0.199	0.086	219	1.92	52.3	0.008	no
10	3.71	0.138	0.000	81	1.82	53.0	0.012	no

	rate	convictions	executions	time	income	lfp	noncauc	southern	income_levels
	<dbl>	<dbl>	<dbl>	<int>	<dbl>	<dbl>	<dbl>	<fct>	<fct>
1	19.25	0.204	0.035	47	1.10	51.2	0.321	yes	Low
2	7.53	0.327	0.081	58	0.92	48.5	0.224	yes	Low
3	5.66	0.401	0.012	82	1.72	50.8	0.127	no	Medium
4	3.21	0.318	0.070	100	2.18	54.4	0.063	no	High
5	2.80	0.350	0.062	222	1.75	52.4	0.021	no	Medium
6	1.41	0.283	0.100	164	2.26	56.7	0.027	no	High
7	6.18	0.204	0.050	161	2.07	54.6	0.139	yes	High
8	12.15	0.232	0.054	70	1.43	52.7	0.218	yes	Low
9	1.34	0.199	0.086	219	1.92	52.3	0.008	no	Medium
10	3.71	0.138	0.000	81	1.82	53.0	0.012	no	Medium

mean	median	mode	std	SE	skewness	kurtosis	quantile_95percent_lower	quantile_95percent_upper
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
5.4	3.62	19.25	4.46	0.67	1.17	0.7	0.86	15.11

mean	median	mode	std	SE	skewness	kurtosis	quantile_95percent_lower	quantile_95percent_upper
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
0.26	0.23	0.204	0.14	0.02	1.75	3.29	0.12	0.67

mean	median	mode	std	SE	skewness	kurtosis	quantile_95percent_lower	quantile_95percent_upper
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
0.06	0.04	0	0.07	0.01	3.03	12.11	0	0.21

A data.frame: 2 × 2
Categories	percentage
<fct>	<fct>
no	65.9 %
yes	34.1 %