Advanced Data Analysis in R (organised by NUGS-China 2022)¶

Facilitator: Clement Twumasi (Postdoctoral Researcher, Oxford University Statistics Department, UK).

Date: July 1, 2022.

Personal Website: https://twumasiclement.wixsite.com/website

YouTube Channel on Advanced R programming videos: https://www.youtube.com/channel/UCxrpjLYi_Akmbc2QICDvKaQ/videos

Objectives¶

Brief introduction to mathematical/statistical programming and important info on data analyses (e.g., variables, data structures, inferential tests & model fitting).

Description of R IDEs including installation & loading of packages (e.g., R studio, Jupyter Notebook).

Setting working directory, importing and exporting data of different extensions (.CSV, .TXT, XLS/XLXS, SPSS, SAS, STATA, etc.).

Sourcing of external Rscripts/Complex R Functions, descriptive summaries & data visualisations in R.

Interesting Practice Task (Instructor will describe the problem): Using statistics to aid in forensic or crime detection investigation of mummy with the help of a GLM classification model.

1. Brief introduction to mathematical/statistical programming and important info on data analyses (e.g., variables, data structures, inferential tests & model fitting)¶

Inferential tests and model fitting in R¶

NB: The class/type of statistical tests and models to fit are predominantly dependent on the type/nature of your data.

Statistical tests:
- Mean tests (Parametric tests: Paired t-tests, independent t-test, ANOVA, Bonferroni pairwise comparison tests, Repeated Measures ANOVA, MANOVA,etc).
- Median tests (Non-parametric tests: Wilcoxon sign test, Mann-Whitney, Kruskal-Wallis test, Bonferroni-Dunn's test, Friedman test, Multivariate Kruskal-Wallis test,etc).
- Test of proportions for one or more groups, etc.
- Correlation test/analysis (Using the informative correlation matrix plot from the PerformanceAnalytics package should be enough).

Link to some of the aforementioned univariate tests in R: https://rpubs.com/sujith/IS

Multivariate analysis in R: https://little-book-of-r-for-multivariate-analysis.readthedocs.io/en/latest/src/multivariateanalysis.html

http://www.sthda.com/english/wiki/manova-test-in-r-multivariate-analysis-of-variance

PCA, SVD, Correspondence Analysis and Multidimensional Scaling:

https://web.stanford.edu/class/bios221/labs/multivariate/lab_5_multivariate.html

Mixed effect regression model (we shall fit a Generalized Linear Mixed Model based on an empirical).

Note some of the class of regression models available. Knowing which one is appropriate for any given data and research problem is very imperative.

GLM (OLS regression, binomial \& multinomial regression, poisson regression, quasi poisson regression, negative binomial regression, etc) [cross-sectional data]
Generalized Linear Mixed Models (for all types of dependent variables) [these class of models are for longitudinal data]
Generalized Additive Models (fixed and mixed-effect types; both cross-sectional \& longitudinal data) and Panel regression models [these class of models for longitudinal data]

Introduction to linear regression: https://www.machinelearningplus.com/machine-learning/complete-introduction-linear-regression-r/

OLS linear regression in R: http://r-statistics.co/Linear-Regression.html

Different class of linear models in R: https://www.statmethods.net/advstats/glm.html

Time series analysis (we shall simulate a time series data, learn how to declare time series data and fit its model).

Time series analysis in R:

https://ourcodingclub.github.io/tutorials/time/

https://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/

Other class of models: Machine learning algorithms¶

Machine learning algorithms (Classification tree, Random forest, Gradient Boosting Machine, etc.) [both cross-sectional \& longitudinal data]

Variable selection methods using penalized regression & other methods¶

Method of variable selection: Stepwise regression, Penalized regression (Ridge, LASSO and Elastic net), Recursive Feature Selection, etc.

Data dimension reduction methods¶

Principal Component Analysis--PCA, Factor Analysis--FA (which can either be Confirmatory or Exploratory FA), and Partial Least Square Regression, Multidimensional Scaling, Independent component analysis, t-distributed stochastic neighbor embedding, Uniform manifold approximation and projection (UMAP), etc.

https://rpubs.com/Saskia/520216

2. Description of R IDEs including installation & loading of packages (e.g., R studio, Jupyter Notebook).¶

###How to install R packages#####
#install.packages("parallel") #a package for parallel computing
#install.packages("glmnet") # a package for performing penalised or regularised regression models
#install.packages("ggplot2") #for other customised data visualisation
#install.packages("PerformanceAnalytics") #for customised correlation matrix plots

####Loading installed packages of interest#######
library("parallel")
library("glmnet")
library("ggplot2")
library("PerformanceAnalytics")
library("foreign")#for importing data such as SAS, Spss and Stata, etc.
library(haven) #for also importing data such as SAS, Spss and Stata, etc.
library("readxl") #package for loading excel data

3. Setting working directory, importing and exporting data of different extensions (.CSV, .TXT, XLS/XLXS, SPSS, SAS, STATA, etc.).¶

options(repr.plot.width=8, repr.plot.height=8,repr.plot.res = 300) #Setting plot size

#setting working directory to import and/or export data
setwd("C:/Users/user/Desktop/DataAnalysis_results_R/NUGSChina_R_class2022")

Importing CSV data¶

#Importing CSV data
MurderRates<-read.csv("MurderRates_data.csv")
head(MurderRates,n=10) # view first 10 rows

tail(MurderRates,n=6) # view last 6 rows

Importing excel data directly without changing to CSV¶

#Importing excel data directly without changing to CSV
library("readxl")
Excel_data<- read_excel("Transformed_data.xlsx")
head(Excel_data)


Excel_data<- Excel_data[,-1]
head(Excel_data)

New names:
* `` -> ...1

Importing SPSS data into R¶

#Importing SPSS data into R
SPSS_data<- read.spss("Combined_data_SPSS.sav", use.value.label=TRUE, to.data.frame=TRUE)
head(SPSS_data)

re-encoding from UTF-8

Importing Stata data with Haven package¶

#With Haven package 
#install.packaes("haven")
#library("haven")
ADP_data <- read_dta("ADP.dta")
head(ADP_data)
write.csv(ADP_data,"ADP_excel.csv")

Importing Stata data into R using package "foreign"¶

#Importing Stata data into R using package "foreign"
Stata_data <- read.dta("imm23.dta")
head(Stata_data )

Importing SAS data into R¶

Run in SAS to convert data into CSV before importing into R (Long approach) :)

proc export data=dataset

outfile="datast.csv"

dbms=csv;

run;

And then, run this in R

df <- read.csv("dataset.csv",header=T,as.is=T)

Alternatively (simple approach) using package haven

#Alternatively (simple approach) using package haven
#library(haven)
SAS_data<- read_sas("imm10.sas7bdat")
head(SAS_data)

Declaring some variables in data called SAS_data as categorical with specific levels/categories

#Declaring some variables in data called SAS_data as categorical with specific levels/categories
SAS_data$sex<- factor(SAS_data$sex,levels=c(1,2),labels=c("male","female"))

head(SAS_data)

#Frequency across sex
cat("Frequencies across sex")
table(SAS_data$sex)

cat("Proprotions across sex")
table(SAS_data$sex)/sum(table(SAS_data$sex))

Frequencies across sex

  male female 
   132    128

Proprotions across sex

     male    female 
0.5076923 0.4923077

print(SAS_data$race)

paste("Frequencies across race")
table(as.factor(SAS_data$race))


paste("Percentages across race (%)")
(table(as.factor(SAS_data$race))/sum(table(as.factor(SAS_data$race))))*100

  [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 2 3 3 3 3
 [38] 3 3 3 3 3 3 4 4 1 4 4 4 4 4 4 4 1 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [75] 4 4 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[112] 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 1 3 4 4
[149] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 1 4 4 1 4
[186] 4 1 4 2 4 4 4 4 4 4 4 4 4 2 3 3 4 4 4 4 3 4 4 4 3 3 3 3 4 3 3 3 4 3 4 3 4
[223] 3 3 4 4 3 3 3 4 4 4 3 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[260] 4
attr(,"label")
[1] "race of student, 1=asian, 2=Hispanic, 3=Black, 4=White, 5=Native American"

  1   2   3   4 
  8  23  40 189

        1         2         3         4 
 3.076923  8.846154 15.384615 72.692308

SAS_data$race<- factor(SAS_data$race,levels=c(1,2,3,4),labels=c("asian","Hispanic","Black","White"))

print(SAS_data$race)

  [1] White    White    White    White    White    White    White    White   
  [9] White    White    White    White    White    White    White    White   
 [17] White    White    White    White    White    White    White    Black   
 [25] Black    Black    Black    Black    Black    Black    Black    Black   
 [33] Hispanic Black    Black    Black    Black    Black    Black    Black   
 [41] Black    Black    Black    White    White    asian    White    White   
 [49] White    White    White    White    White    asian    White    White   
 [57] White    White    asian    White    White    White    White    White   
 [65] White    White    White    White    White    White    White    White   
 [73] White    White    White    White    White    White    White    White   
 [81] White    White    White    White    White    White    White    Hispanic
 [89] White    White    White    White    White    White    White    White   
 [97] White    White    White    White    White    White    White    White   
[105] White    White    White    White    White    White    White    Hispanic
[113] Hispanic Hispanic Hispanic Hispanic Hispanic Hispanic Hispanic Hispanic
[121] White    Hispanic Hispanic Hispanic Hispanic Hispanic Hispanic Hispanic
[129] Hispanic Hispanic Hispanic White    White    White    White    White   
[137] White    White    White    White    White    White    White    White   
[145] asian    Black    White    White    White    White    White    White   
[153] White    White    White    White    White    White    White    White   
[161] White    White    White    White    White    White    White    White   
[169] White    White    White    White    White    White    White    White   
[177] asian    White    White    White    asian    White    White    asian   
[185] White    White    asian    White    Hispanic White    White    White   
[193] White    White    White    White    White    White    Hispanic Black   
[201] Black    White    White    White    White    Black    White    White   
[209] White    Black    Black    Black    Black    White    Black    Black   
[217] Black    White    Black    White    Black    White    Black    Black   
[225] White    White    Black    Black    Black    White    White    White   
[233] Black    White    White    White    Black    White    Black    White   
[241] White    White    White    White    White    White    White    White   
[249] White    White    White    White    White    White    White    White   
[257] White    White    White    White   
Levels: asian Hispanic Black White

#Crosstabulations
table(SAS_data$race,SAS_data$sex)

          
           male female
  asian       3      5
  Hispanic   11     12
  Black      20     20
  White      98     91

print(SAS_data$public)

SAS_data$public<- factor(SAS_data$public,levels=c(0,1),labels=c("public","non-public"))


head(SAS_data)

  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[186] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[260] 1
attr(,"label")
[1] "Public school: 1=public, 0=non-public"

#Cross-tabulation
table(SAS_data$race,SAS_data$sex,SAS_data$public)

, ,  = public

          
           male female
  asian       3      2
  Hispanic    1      0
  Black       0      1
  White      32     28

, ,  = non-public

          
           male female
  asian       0      3
  Hispanic   10     12
  Black      20     19
  White      66     63

Exporting the updated "SAS_data" into working repository as "SAS_data_updatedCSV" (saved as CSV file)¶

NB: No package is required for that

#No package is required
#export SAS data as CSV 
write.csv(SAS_data,"SAS_data_updatedCSV.csv")

Exporting the updated "SAS_data" into working repository as "SAS_data_updatedExcel" (saved as Excel file)¶

NB: A package is required for that

#install.packages("writexl")

#export SAS data as excel
#Package "writexl" is required
writexl::write_xlsx(SAS_data,"SAS_data_updatedExcel.xlsx")

4. Sourcing of external Rscripts/Complex R Functions, descriptive summaries & data visualisations in R.¶

#Loading an external script called "Novel_summary_computation_script.R"
setwd("C:/Users/user/Desktop/DataAnalysis_results_R/NUGSChina_R_class2022")
source("Novel_summary_computation_script.R")

Instructor's (Clement) novel function created in R to compute descriptive/summary statistics (it will be developed into a package soon).¶

The summary statistics function should first:

Determine the type of variable

If it's numeric find & return mean, median, mode, standard deviation, standard error, skewness, kurtosis and 95% quantile recorded to 2 decimal places as well as histogram plot of the numeric variable unique colour respectively.

Else if it's categorical, it should find & return percentages for all categories/levels (in 1 decimal place) and the name of the categories as a dataframe as well as plot a pie chart with percentages for each category of the variable with different colours

Summarizing a few of the variables using my novel function

NB: The data class must strictly be a dataframe.¶

head(SAS_data)

#Exclude data on ID
SAS_data<- SAS_data[, -1]

head(SAS_data)

class(SAS_data)

#Converting SAS_data to data.frame
SAS_data_df<- as.data.frame(SAS_data)

#**Summarizing a few of the variables using my novel function**
Summary_stats(data=SAS_data_df, variable_index=2)

Summary_stats(data=SAS_data_df, variable_index=8)

Summary_stats(data=SAS_data_df, variable_index=13)

A function created to extract all quantative variables of your data

#A function created to extract all quantative variables of your data
Quantitative_variabes<-function(data){
    
    Variable_name=list()
    for (variable_index in seq_along(names(data))){
     Variable<-(data)[,variable_index]   
  if(is.numeric(Variable)==TRUE){ #if variable is numeric/quantitative
   Variable_name[[variable_index]]<-names(data)[variable_index]
  }
        }
    return(unlist(Variable_name))
}

Quantitative_variabes(data=MurderRates)

A function created to extract all categorical variables of your data

Categorical_variabes<-function(data){
    
    Variable_name=list()
    for (variable_index in seq_along(names(data))){
     Variable<-(data)[,variable_index]   
  if(is.numeric(Variable)==FALSE){ #if variable is numeric/quantitative
   Variable_name[[variable_index]]<-names(data)[variable_index]
  }
        }
    return(unlist(Variable_name))
}

Categorical_variabes(data=MurderRates)

Correlation matrix plot based on MurderRates data¶

From PerformanceAnalytics R package

Data_numeric=MurderRates[ ,Quantitative_variabes(data=MurderRates)] 
#colnames(Data_numeric)=c("rate","convictions","executions","time","income","labour fp","Prop NC")
#method = c("pearson", "kendall", "spearman")
#options(warn=-1) #ignore warnings for instance with running non-parametric correlation due to ties
chart.Correlation(Data_numeric, histogram=TRUE, pch=19,method = c("pearson"))

5. Interesting Practice Task: Using statistics to aid in forensic or crime detection investigation of mummy with the help of a GLM classification model.¶

Description of the problem

NB: This is a published paper entitled: An Experimental Study of Lesions Observed in Bog Body Funerary Performances.¶

Authors: Tiffany Treadway and Clement Twumasi (2021).

URL Link: https://exarc.net/ark:/88735/10595

Fig%203.jpg

Importing the data which is saved as an SPSS file into R¶

#Importing SPSS data into R
Data<- read.spss("Data_task.sav", use.value.label=F, to.data.frame=TRUE)
head(Data)

re-encoding from UTF-8

Comparing the distribution of Knife and Spear areas across groups¶

Comparison between the differences in each group’s BMI and the produced incision area (height/width) were tested using a Multivariate analysis of variance (MANOVA). The BMI groups or fixed factors were divided into three ranges: 18-22.5 (Group A), 22.5-27 (Group B), and 27-31.5 (Group C).

#Install packages before loading/use
library("mvnormtest")
library("MVTests")
#For Non-parametric pairwise comparison
library("PMCMRplus")
library(ggpubr)
library(ggplot2)
library(gridExtra)

Multivariate normality test¶

# Multivariate normality test
#The data is multivariate normal
mshapiro.test(t(Data[, 4:5]))

shapiro.test(Data[, 4])
shapiro.test(Data[, 5])

	Shapiro-Wilk normality test

data:  Z
W = 0.95851, p-value = 0.03062

	Shapiro-Wilk normality test

data:  Data[, 4]
W = 0.9396, p-value = 0.003636

	Shapiro-Wilk normality test

data:  Data[, 5]
W = 0.87003, p-value = 7.058e-06

p1=ggqqplot(Data, "Area_Knife", facet.by = c("Groups")) +labs(x="",y="Sample quantile (Dagger)")

p2=ggqqplot(Data, "Area_Spear", facet.by = c("Groups")) +labs(x="Normal quantile",y="Sample quantile (Spear)")

grid.arrange(p1,p2,ncol=1)

Box's M test¶

Tests the null hypothesis that the observed covariance matrices of the dependent variables are equal across groups

The hypotheses are defined as H0:The Covariance matrices are homogeneous and H1:The Covariance matrices are not homogeneous

results <- BoxM(data=Data[, 4:5],group=Data$Groups)
summary(results)

       Box's M Test 

Chi-Squared Value = 3.666222 , df = 6  and p-value: 0.722

par(mfrow=c(2,1),mar=c(4,4,1,1))
boxplot(Data$Area_Knife~Data$Groups,
main = "",
ylim=c(0,150),
ylab=expression(paste("Incision area in mm"^"2"  )),        
xlab="",
names = c("Group A (BMI: 18-22.5)","Group B (BMI: 22.5-27)","Group C (BMI: 27-31.5)"),
las = 1,
col = c("red","blue","green"),
border = "black",
horizontal = F,
notch = F
)

mtext(text = "Dagger",
      side = 3, #side 2 = left
      line = 0,cex=1,font=2,adj =0.01)


boxplot(Data$Area_Spear~Data$Groups,
main = "",
ylim=c(0,200),
ylab=expression(paste("Incision area in mm"^"2    "  )),
xlab="",
names = c("Group A (BMI: 18-22.5)","Group B (BMI: 22.5-27)","Group C (BMI: 27-31.5)"),
las = 1,
col = c("red","blue","green"),
border = "black",
horizontal = F,
notch = F
)

mtext(text = "Spear",
      side = 3, #side 2 = left
      line = 0,cex=1,font=2,adj =0.01)

#18-22.5 (Group A), 22.5-27 (Group B), and 27-31.5 (Group C).

Implementing the Multivariate Kruskal Wallis (MKW) test¶

Creating a function for Multivariate Kruskal Wallis test

The function returns the test statistic and its p-value

multkw<- function(group,y,simplify=FALSE){
 ### sort data by group ###
    o<-order(group)
    group<-group[o]
    y<-as.matrix(y[o,])
    n<-length(group)
    k<-dim(y)[2]   #k=p
    
    if (dim(y)[1] != n)
    return("number of observations not equal to length of group")
    groupls<-unique(group)
    g<-length(groupls) #number of groups (Number of fish-parasite combination)#
    groupind<-sapply(groupls,"==",group) #group indicator#
    ni<-colSums(groupind) #num of subj of each group (Number of fish in each group)#
    r<-apply(y,2,rank) #corresponding rank variable (Parasite at each bodyparts)#
    
    ### calculation of statistic ###
    r.ik<-t(groupind)%*%r*(1/ni)  #gxp, mean rank of kth variate in ith group#
    m<- (n+1)/2 #expected value of rik#
    u.ik<-t(r.ik-m)
    U<-as.vector(u.ik)
    V<-1/(n-1)*t(r-m)%*%(r-m) #pooled within-group cov matrix
    Vstar<-Matrix::bdiag(lapply(1/ni,"*",V))
    W2<-as.numeric(t(U)%*%solve(Vstar)%*%U)
    
    ### return stat and p-value ###
   returnlist<-data.frame(statistic=W2,d.f.=k*(g-1),
   p.value=pchisq(W2,k*(g-1),lower.tail=F))
    
    if (simplify==TRUE) return (W2)
    else return (returnlist)
    }

Assignment 1:¶

Cut and paste the MKW test function in a seperate R script and call it in R using the source() function.

MKW test¶

MKW_result<- multkw(group=Data$Group,y=Data[,4:5],simplify=F)
MKW_result

Classification problem¶

Determing whether an injury was caused by a knife or a spear

#Importing the data
Weapon_classification<- read.csv(file="Training_Stabs.csv")
head(Weapon_classification,n=10)# view the first 10 columns

min(Weapon_classification$BMI)
max(Weapon_classification$BMI)

names(Weapon_classification)#the variables of the data

table(Weapon_classification$Weapon_Type)

knife spear 
   64    64

Re-categorizing BMI into 3 BMI statuses¶

#Re-categorizing BMI into 3 BMI statuses
Weapon_classification$BMI_status<- cut(Weapon_classification$BMI,
                                breaks=c(0,18,25.1,31.5),right=T,labels=c("A","B","C"))


#Create Weapon_type as 0 (Knife) and 1 (Spear)
re_categorize_func<- function(variable){
    var<- as.numeric(variable)
    for(i in seq_along(variable)){
       if(var[i]==1)   var[i]<- 0 #if a knife
       if(var[i]==2)  var[i]<- 1  # if a spear
        }
    return(var)
}

Weapon_classification$Weapon_type<- re_categorize_func(Weapon_classification$Weapon_Type)
Weapon_classification$Weapon_type<-factor(Weapon_classification$Weapon_type,
                                         levels=c(0,1),label=c("knife","spear"))

head(Weapon_classification,n=10)# view the first 10 columns

dim(Weapon_classification)#dimension of the data

Dividing the dataset into training and validation sets¶

#Establishing training and validation set
set.seed(1)# for reproducility
train_index <- caret::createDataPartition(Weapon_classification$Weapon_type, p=0.8, list=FALSE)
training_data<-Weapon_classification[train_index,] #training data 
Validation_data<-Weapon_classification[-train_index,]# Validation data

Fitting a binomial logistic regression based on the training data¶

Dependent variable: Weapon type

Independent variales: BMI and stab length

#Setting knife as the reference categories
training_data$Weapon_type_new<- relevel(training_data$Weapon_type,ref="knife")
training_data$BMI_status<- relevel(training_data$BMI_status,ref="A")

#Fitting logitic regression model
Logit_model<-glm(Weapon_type~BMI+Lengths,family="binomial",data=training_data)
summary(Logit_model)

Call:
glm(formula = Weapon_type ~ BMI + Lengths, family = "binomial", 
    data = training_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.5689  -0.2412  -0.0040   0.2249   2.3400  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -16.43866    3.96530  -4.146 3.39e-05 ***
BMI          -0.12532    0.09535  -1.314    0.189    
Lengths       0.62810    0.12822   4.898 9.66e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 144.175  on 103  degrees of freedom
Residual deviance:  47.438  on 101  degrees of freedom
AIC: 53.438

Number of Fisher Scoring iterations: 7

Assessing the accuracy of the fitted model based on the validation data¶

Estimating the Gini coefficient (otained from the AUC of the ROC curve), the Kolmogorov Sminorv statistics and the percentage of correct classification (obtained from the confusion matrix) as accurcacy measures.

ROC means Receiver Operator Characteristic Curve

#loading R packages
library("ROCR")# loading the ROCR package to obtai the ROC curve
library(pROC)

pred_weapon<- predict(Logit_model, Validation_data,type="response")

## library(ROCR)
df=data.frame(predictions= pred_weapon,labels=Validation_data$Weapon_type)

#logistic regression
pred <- prediction(df$predictions, df$labels)
perf <- performance(pred,"tpr","fpr")
#plot(perf,col="green",main="ROC Curve using Validation Data",xlab="Specificity",ylab="Sensitivity")
#abline(0,1)#add a 45 degree line

pROC_obj <- roc(df$labels,df$predictions,
            smoothed = TRUE,
            # arguments for ci
            ci=TRUE, ci.alpha=0.9, stratified=FALSE,
            # arguments for plot
            plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
            print.auc=TRUE, show.thres=TRUE)


#sens.ci <- ci.se(pROC_obj)
#plot(sens.ci, type="shape", col="lightblue")
## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.
#plot(sens.ci, type="bars")

Setting levels: control = knife, case = spear

Setting direction: controls < cases

Specificity_sensitivity=data.frame(cbind(c(perf@y.values),c(perf@x.values)))
names(Specificity_sensitivity)=c("Sensitivity","Specificity")
head(Specificity_sensitivity)

#Area under the curve
print(paste("Area under the  first ROC curve is:",performance(pred,"auc")@"y.values"))

[1] "Area under the  first ROC curve is: 0.9375"

Gini_Coefficient=function(AUROC){
    return((AUROC-0.5)/0.5)}
area_under_curve<- as.numeric(performance(pred,"auc")@"y.values")
print(paste("Gini Coefficient of ROC=",Gini_Coefficient(area_under_curve)))

[1] "Gini Coefficient of ROC= 0.875"

TPRfromROCR=unlist(perf@y.values)
FPRfromROCR=unlist(perf@x.values)
diff_TPRFPR=TPRfromROCR-FPRfromROCR
KS=max(diff_TPRFPR)

print(paste("Kolmogorov Sminorv Value for ROC curve is:", KS))

[1] "Kolmogorov Sminorv Value for ROC curve is: 0.833333333333333"

# use caret and compute a confusion matrix

#Spear=True since it was kept as reference
confusion_matrix<- table(Validation_data$Weapon_type, pred_weapon > 0.5)
confusion_matrix

accuracy= ((confusion_matrix[1]+confusion_matrix[4])/sum(confusion_matrix))*100

print(paste("The percentage of accurcate classification is","",accuracy,"%"))

       
        FALSE TRUE
  knife    10    2
  spear     2   10

[1] "The percentage of accurcate classification is  83.3333333333333 %"

Prediction from the fitted logistic model for an individual with unknown BMI.¶

$$P(spear)=\frac{e^{bo+\sum_{i=1}^{n} b_iX_i}}{1+e^{bo+\sum_{i=1}^{n} b_iX_i}}$$

Creating a function to estimate the probability of weapon being a spear from the binomial logitic model

coef(Logit_model)

Prob_spear=function(model,bmi, stab_length){
   b0=as.numeric(coef(model)[1])
   b1=as.numeric(coef(model)[2])
   b2=as.numeric(coef(model)[3])
  Numerator=exp(b0+(b1*bmi)+(b2*stab_length))
  return(Numerator/(1+Numerator))  #return probability in percent and in 1 decimal place
}

#Assuming values for the bmi are based on the empirical data
bmi_values<- seq(18,31.5,by=.5)
bmi_values

length_values_LM<- c(30,35,60)
length_values_HW<- c(7.5,15,20,21,30,31)

pred_spear_bmi_LM=pred_knife_bmi_LM=NULL
pred_spear_bmi_HW=pred_knife_bmi_HW=NULL
for(i in seq_along(length_values_LM)){
pred_spear_bmi_LM[[i]]<-Prob_spear(model=Logit_model,bmi=bmi_values, stab_length=length_values_LM[i])
pred_knife_bmi_LM[[i]]<- 1-pred_spear_bmi_LM[[i]]    
    }

for(i in seq_along(length_values_HW)){
pred_spear_bmi_HW[[i]]<-Prob_spear(model=Logit_model,bmi=bmi_values, stab_length=length_values_HW[i])
pred_knife_bmi_HW[[i]]<- 1-pred_spear_bmi_HW[[i]]    
    }


colour_LM<- c("blue","red","green")
#par(mfrow=c(2,1),mar=c(4,4,1,1))
plot(bmi_values,pred_spear_bmi_LM[[1]],type="l",lwd=3,
     xlab="Possible BMI values",ylab="Probability of a weapon being a spear relative to a dagger",main="Mummy Man",col=colour_LM[1]
    ,ylim=c(0,1),xlim=c(18,32))
abline(h=0.5,col="black",lwd=3)
text(28,0.5,"Probability=0.5",col="black",font=2,pos=3)

for(i in seq_along(length_values_LM)[-1]){
lines(bmi_values,pred_spear_bmi_LM[[i]],type="l",lwd=3,col=colour_LM[i])

}

text_LM<-c("wound length= 30mm","wound length= 35mm","wound length= 60mm")
legend(x=24,y=.8,text_LM,
       col=colour_LM,bty="n",cex=.9,box.lwd = 2,fill=colour_LM)

#plot(length_values,pred_spear_length,type="l",col="red",lwd=3,xlab="Different stab length values",ylab="Probability of a weapon being a spear")
#lines(bmi_values,pred_knife,type="l",col="red",lwd=3)

#legend("center",c("Spear","Knife"),col=c("blue","red"),bty="n",cex=.9,box.lwd = 2,fill=c("blue","red"))

Prob_LM<- as.data.frame(do.call("cbind",pred_spear_bmi_LM))
names(Prob_LM)<-text_LM
Prob_LM$bmi_values<- bmi_values
Prob_LM
write.csv(Prob_LM,"Predicted_probabilities_LM.csv")

colour_HW<- c("green","yellow","orange","pink","blue","red")
#par(mfrow=c(2,1),mar=c(4,4,1,1))
plot(bmi_values,pred_spear_bmi_LM[[1]],type="l",lwd=3,
     xlab="Possible BMI values",ylab="Probability of a weapon being a spear relative to a dagger",main="Mummy Woman",col=colour_HW[1]
    ,ylim=c(0,.8),xlim=c(18,32))
abline(h=0.5,col="black",lwd=3)
text(28,0.5,"Probability=0.5",col="black",font=2,pos=3)

for(i in seq_along(length_values_HW)[-1]){
lines(bmi_values,pred_spear_bmi_HW[[i]],type="l",lwd=3,col=colour_HW[i])

}
text_HW<-c("wound length= 7.5mm","wound length= 15mm","wound length= 20mm",
                    "wound length= 21mm","wound length= 30mm","wound length= 31mm")
legend(x=24,y=.81,text_HW,
       col=colour_HW,bty="n",cex=.9,box.lwd = 2,fill=colour_HW)

Prob_HW<- as.data.frame(do.call("cbind",pred_spear_bmi_HW))
names(Prob_HW)<-text_HW
Prob_HW$bmi_values<- bmi_values
Prob_HW
write.csv(Prob_HW,"Predicted_probabilities_HW.csv")

	rate	convictions	executions	time	income	lfp	noncauc	southern
	<dbl>	<dbl>	<dbl>	<int>	<dbl>	<dbl>	<dbl>	<fct>
1	19.25	0.204	0.035	47	1.10	51.2	0.321	yes
2	7.53	0.327	0.081	58	0.92	48.5	0.224	yes
3	5.66	0.401	0.012	82	1.72	50.8	0.127	no
4	3.21	0.318	0.070	100	2.18	54.4	0.063	no
5	2.80	0.350	0.062	222	1.75	52.4	0.021	no
6	1.41	0.283	0.100	164	2.26	56.7	0.027	no
7	6.18	0.204	0.050	161	2.07	54.6	0.139	yes
8	12.15	0.232	0.054	70	1.43	52.7	0.218	yes
9	1.34	0.199	0.086	219	1.92	52.3	0.008	no
10	3.71	0.138	0.000	81	1.82	53.0	0.012	no

	rate	convictions	executions	time	income	lfp	noncauc	southern
	<dbl>	<dbl>	<dbl>	<int>	<dbl>	<dbl>	<dbl>	<fct>
39	1.74	0.418	0.000	104	2.04	51.7	0.017	no
40	11.98	0.282	0.032	91	1.59	54.3	0.222	yes
41	3.04	0.194	0.086	199	2.07	53.7	0.026	no
42	0.85	0.378	0.000	101	2.00	54.7	0.012	no
43	2.83	0.757	0.033	109	1.84	47.0	0.057	yes
44	2.89	0.356	0.000	117	2.04	56.9	0.022	no

...1	Zcores	Elements	Locations
<dbl>	<chr>	<chr>	<chr>
1	2.2171772203965099	Carbon	A
2	-8.0415754107645801E-2	Carbon	C
3	-0.76969364645889404	Carbon	E
4	0.14934354334276501	Carbon	A
5	-0.31017505155806502	Carbon	C
6	-0.31017505155806502	Carbon	E

Zcores	Elements	Locations
<chr>	<chr>	<chr>
2.2171772203965099	Carbon	A
-8.0415754107645801E-2	Carbon	C
-0.76969364645889404	Carbon	E
0.14934354334276501	Carbon	A
-0.31017505155806502	Carbon	C
-0.31017505155806502	Carbon	E

	V1	Experience	X.Strabismus_surgery	X.Oculoplastic_surgery	X.Cataract_surgery	VR_surgery	Laser_surgery	Extraocular_surgical._competence	X.stereoacuity_level_extraocular_surgery	Intraocular_surgical_competence	stereoacuity_level_intraocular_surgery	X.Extraocular_surgery_performed	Intraocular_surgery_performed	Stereoacuity_measured	Stereo	Comp	Location	Category	Cataract
	<dbl>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<fct>	<dbl>	<dbl>	<fct>	<dbl>	<dbl>
1	1	5-10 years	Disagree	Disagree	Disagree	Disagree	Disagree	No	No stereoacuity	No	No stereoacuity	No	Yes	No	3	0	Wales	1	1
2	2	1-5 years	Agree	Agree	Agree	Agree	Agree	Yes	200 secs	Yes	80 secs	No	Yes	Yes	1	1	Wales	1	1
3	3	1-5 years	Disagree	Disagree	Agree	Agree	Agree	No	No stereoacuity	Yes	200 secs	No	Yes	Yes	0	1	Wales	1	1
4	4	10-15 years	Disagree	Disagree	Agree	Agree	Agree	No	No stereoacuity	Yes	400 secs	No	Yes	No	3	1	Wales	1	1
5	5	15-20 years	Agree	Agree	Agree	Agree	Agree	Yes	60 secs of arc or better	Yes	60 secs of arc or better	No	Yes	Yes	1	1	Wales	1	1
6	6	15-20 years	Agree	Agree	Agree	Agree	Agree	Yes	60 secs of arc or better	Yes	60 secs of arc or better	No	Yes	Yes	2	1	Northern	1	1

	Sensitivity	Specificity
	<list>	<list>
1	0.00000000, 0.08333333, 0.16666667, 0.25000000, 0.33333333, 0.41666667, 0.41666667, 0.50000000, 0.58333333, 0.66666667, 0.75000000, 0.83333333, 0.83333333, 0.91666667, 1.00000000, 1.00000000, 1.00000000, 1.00000000, 1.00000000, 1.00000000, 1.00000000, 1.00000000, 1.00000000, 1.00000000, 1.00000000	0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.08333333, 0.16666667, 0.16666667, 0.16666667, 0.25000000, 0.33333333, 0.41666667, 0.50000000, 0.58333333, 0.66666667, 0.75000000, 0.83333333, 0.91666667, 1.00000000

A tibble: 6 × 632
version	date	start_time	hhid	region_id	enumerator_id	enumerator	consent	region	district_id	...	fert_cost_soybean_3	fertilizer_soybean_4	fert_qty_soybean_4	fert_up_soybean_4	fert_cost_soybean_4	duration	hour	start_date	end_date	submission_date
<chr>	<chr>	<chr>	<chr>	<dbl+lbl>	<dbl+lbl>	<chr>	<dbl+lbl>	<chr>	<dbl+lbl>	...	<dbl>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<date>	<date>	<date>
2106012155	2-Jun-2021	11:40	820959	4	11	Appiah Adjei Christina	1	Upper East	10	...	NA		NA	NA	NA	3034	50.56667	2021-06-02	2021-06-04	2021-06-05
2106012155	2-Jun-2021	11:47	646837	4	12	Mustapha Suraj Mohammed	1	Upper East	10	...	NA		NA	NA	NA	2576	42.93333	2021-06-02	2021-06-04	2021-06-04
2106012155	5-Jun-2021	12:26	398199	4	12	Mustapha Suraj Mohammed	1	Upper East	10	...	NA		NA	NA	NA	4088	68.13333	2021-06-05	2021-06-08	2021-06-10
2106012155	3-Jun-2021	14:56	674671	4	12	Mustapha Suraj Mohammed	1	Upper East	10	...	NA		NA	NA	NA	3548	59.13334	2021-06-03	2021-06-06	2021-06-06
2106012155	3-Jun-2021	11:24	467917	4	10	John Azaaziba	1	Upper East	10	...	NA		NA	NA	NA	4974	82.90000	2021-06-03	2021-06-06	2021-06-07
2106012155	4-Jun-2021	11:05	127523	4	12	Mustapha Suraj Mohammed	1	Upper East	10	...	NA		NA	NA	NA	4084	68.06667	2021-06-04	2021-06-07	2021-06-08

A data.frame: 6 × 18
	schid	stuid	ses	meanses	homework	white	parented	public	ratio	percmin	math	sex	race	sctype	cstr	scsize	urban	region
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	6053	1	0.85	0.6997727	1	1	4	0	18	3	50	2	4	4	3	3	1	2
2	6053	2	0.43	0.6997727	1	1	3	0	18	3	43	2	4	4	3	3	1	2
3	6053	4	-0.59	0.6997727	3	0	3	0	18	3	50	2	1	4	3	3	1	2
4	6053	11	1.02	0.6997727	1	1	5	0	18	3	49	2	4	4	3	3	1	2
5	6053	12	0.84	0.6997727	1	1	5	0	18	3	62	1	4	4	3	3	1	2
6	6053	13	1.32	0.6997727	1	1	6	0	18	3	43	2	4	4	3	3	1	2

A tibble: 6 × 19
schid	stuid	ses	meanses	homework	white	parented	public	ratio	percmin	math	sex	race	sctype	cstr	scsize	urban	region	schnum
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
7472	3	-0.13	-0.4826087	1	1	2	1	19	0	48	2	4	1	2	3	2	2	1
7472	8	-0.39	-0.4826087	0	1	2	1	19	0	48	1	4	1	2	3	2	2	1
7472	13	-0.80	-0.4826087	0	1	2	1	19	0	53	1	4	1	2	3	2	2	1
7472	17	-0.72	-0.4826087	1	1	2	1	19	0	42	1	4	1	2	3	2	2	1
7472	27	-0.74	-0.4826087	2	1	2	1	19	0	43	2	4	1	2	3	2	2	1
7472	28	-0.58	-0.4826087	1	1	2	1	19	0	57	2	4	1	2	3	2	2	1

A tibble: 6 × 19
schid	stuid	ses	meanses	homework	white	parented	public	ratio	percmin	math	sex	race	sctype	cstr	scsize	urban	region	schnum
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
7472	3	-0.13	-0.4826087	1	1	2	1	19	0	48	female	4	1	2	3	2	2	1
7472	8	-0.39	-0.4826087	0	1	2	1	19	0	48	male	4	1	2	3	2	2	1
7472	13	-0.80	-0.4826087	0	1	2	1	19	0	53	male	4	1	2	3	2	2	1
7472	17	-0.72	-0.4826087	1	1	2	1	19	0	42	male	4	1	2	3	2	2	1
7472	27	-0.74	-0.4826087	2	1	2	1	19	0	43	female	4	1	2	3	2	2	1
7472	28	-0.58	-0.4826087	1	1	2	1	19	0	57	female	4	1	2	3	2	2	1