R CODES FOR THE 5TH MEETING/CLASS¶

Below is the outline of today's class

Tasks to be completed¶

Solving a few selected questions in the last week's assignment tasks

a) Extract multiples of 12 from 10 to 100000 (save results as a vector & list respectively) using i) for loop and ii) while loop

b) Create a Monte Carlo Simulation to describe the outcome of pregnancy for 100 women at birth (number of runs=100) who can either give birth to a male or female. Note that male and female babies are equally likely to be born (probability of 0.5 respectively). Record the total number of woman who gave birth to males and female respectively out of the 100 women from the Monte Carlo Simulation. Using a barplot, plot the proportion of male and female borns. Increase the number of women to 1000, and compare the findings with that of the 100 women.

Simulating samples from probability distributions (exponential, normal, lognormal, Poisson, uniform and binomial distributions).
Prove the Central Limit theorem of at least the mean based on simulated samples from different probability distributions.
Use Bootstrapping & Jackknife estimation techniques to estimate and bound unknown parameters such as population means, population proportions, etc.

Q1a) Extract multiples of 12 from 10 to 100000¶

Save results as a vector & list respectively using:

i) for loop and ii) while loop

Using For Loop to find all multiples of 12 from 10 to 100000

# Q1a) Extract multiples of 12 from 10 to 100000 and storing results as a vector
#Using For loop
n <- 10:100000 #seq(10,100000,by=1)
multiple_12<-rep(NA,length=length(n)) #creating a vector of NA to store results
for (i in seq_along(n)) {# for(i in 1:length(n))
  if(n[i]%%12 == 0)  {
    multiple_12[i]<- n[i]
     }
}

#multiple_12 #This willcontain some NA 

multiples_of_12<-na.omit(multiple_12) #stores all multiples of 12 eliminating NAs

print(paste("Total multiples of 12=",length(multiples_of_12))) #total multiples of 12

print(multiples_of_12[1:100])# printing first 100 multiple of 12.

[1] "Total multiples of 12= 8333"
  [1]   12   24   36   48   60   72   84   96  108  120  132  144  156  168  180
 [16]  192  204  216  228  240  252  264  276  288  300  312  324  336  348  360
 [31]  372  384  396  408  420  432  444  456  468  480  492  504  516  528  540
 [46]  552  564  576  588  600  612  624  636  648  660  672  684  696  708  720
 [61]  732  744  756  768  780  792  804  816  828  840  852  864  876  888  900
 [76]  912  924  936  948  960  972  984  996 1008 1020 1032 1044 1056 1068 1080
 [91] 1092 1104 1116 1128 1140 1152 1164 1176 1188 1200

# Q1b) Extract multiples of 12 from 10 to 100000 and storing results as a list 
#Using For loop
n <- 10:100000 
multiple_12<-list()# c() #creating a empty list to store results
for (i in seq_along(n)) {
  if(n[i]%%12 == 0)  {
    multiple_12[[i]]<- n[i]
     }
}



multiples_of_12<-unlist(multiple_12) #unlisting to create a vector of only multiples of 12

print(paste("Total multiples of 12=",length(multiples_of_12))) #total multiples of 12

print(multiples_of_12[1:100]) # printing first 100 multiple of 12.

[1] "Total multiples of 12= 8333"
  [1]   12   24   36   48   60   72   84   96  108  120  132  144  156  168  180
 [16]  192  204  216  228  240  252  264  276  288  300  312  324  336  348  360
 [31]  372  384  396  408  420  432  444  456  468  480  492  504  516  528  540
 [46]  552  564  576  588  600  612  624  636  648  660  672  684  696  708  720
 [61]  732  744  756  768  780  792  804  816  828  840  852  864  876  888  900
 [76]  912  924  936  948  960  972  984  996 1008 1020 1032 1044 1056 1068 1080
 [91] 1092 1104 1116 1128 1140 1152 1164 1176 1188 1200

Using While Loop to find all multiples of 12 from 10 to 100000

# Q1 c) Extract multiples of 12 from 10 to 100000 and storing results as a list
#Using For While

multiple_12<- list() #creating a empty list to store results
#multiple_12=rep(NA, length(n))

last_number=100000

n <-  10

while (n<=last_number){
  
    if(n%%12 == 0)  {
    multiple_12[[n]]<- n
     }
    n=n+1
}

#multiple_12 #This willcontain some NA 

multiples_of_12<-unlist(multiple_12) #stores all multiples of 12 eliminating NAs

print(paste("Total multiples of 12=",length(multiples_of_12))) #total multiples of 12

print(multiples_of_12[1:100]) # printing first 100 multiple of 12.

[1] "Total multiples of 12= 8333"
  [1]   12   24   36   48   60   72   84   96  108  120  132  144  156  168  180
 [16]  192  204  216  228  240  252  264  276  288  300  312  324  336  348  360
 [31]  372  384  396  408  420  432  444  456  468  480  492  504  516  528  540
 [46]  552  564  576  588  600  612  624  636  648  660  672  684  696  708  720
 [61]  732  744  756  768  780  792  804  816  828  840  852  864  876  888  900
 [76]  912  924  936  948  960  972  984  996 1008 1020 1032 1044 1056 1068 1080
 [91] 1092 1104 1116 1128 1140 1152 1164 1176 1188 1200

Monte Carlo Simulation (Assignment question)¶

Create a Monte Carlo Simulation to describe the outcome of pregnancy for 100 women at birth (number of runs=100) who can either give birth to a male or female. Note that male and female babies are equally likely to be born (probability of 0.5 respectively). Record the total number of woman who gave birth to males and female respectively out of the 100 women from the Monte Carlo Simulation. Using a barplot, plot the proportion of male and female borns. Increase the number of women to 1000, and compare the findings with that of the 100 women.

Outcome of birth for a single Mother/birth

#Create a Monte Carlo Simulation to describe the outcome of pregnancy 
#for a single mother at birth (number of runs=1) who can either give birth to 
#a male or female. Note that male and female babies are equally
#likely to be born (probability of 0.5 respectively)

count=rep(0,2) # or numeric(2) #to stores the total of the two outcomes (male or female borns)

rand=runif(n=1,0,1) #a random variable to decide which type of birth occurs ( for a single woman)
for (i in rand){
  if (i>=0 & i<=1/2){ #or just if(i<=1/2)
  count[1]=count[1]+1
    print("Boy")
    } else if(i>1/2 & i<=1){ #or just else{}
     count[2]=count[2]+1
    print("Girl")
     }
}

[1] "Boy"

Outcome of birth for 100 Mothers/births

#Create a Monte Carlo Simulation to describe the outcome of pregnancy 
#for 100 women at birth (number of runs=100) who can either give birth to 
#a male or female. Note that male and female babies are equally
#likely to be born (probability of 0.5 respectively)

count=rep(0,2) # or numeric(2) #to stores the total of the two outcomes (male or female borns)

rand=runif(n=100,0,1) #a random variable to decide which type of birth occurs (100 women)
for (i in rand){
  if (i>=0 & i<=1/2){ #or just if(i<=1/2)
  count[1]=count[1]+1
    #print("Boy")
    } else if(i>1/2 & i<=1){ #or just else{}
     count[2]=count[2]+1
    #print("Girl")
     }
}

#Printing total number of male and female borns
type_born<- c("Boy","Girl")
for (i in seq_along(count)) {
    print(paste(type_born[i],"", "=",count[i]))
}

[1] "Boy  = 53"
[1] "Girl  = 47"

Outcome of birth for 1000 Mothers/births

#Create a Monte Carlo Simulation to describe the outcome of pregnancy 
#for 1000 women at birth (number of runs=100) who can either give birth to 
#a male or female. Note that male and female babies are equally
#likely to be born (probability of 0.5 respectively)

count=rep(0,2) # or numeric(2) #to stores the total of the two outcomes (male or female borns)
#set.seed(1)
rand=runif(n=1000,0,1) #a random variable to decide which type of birth occurs (1000 women)
for (i in rand){
  if (i>=0 && i<=1/2){ #or just if(i<=1/2)
  count[1]=count[1]+1
    #print("Boy")
    } else if(i>1/2 && i<=1){ #or just else{}
     count[2]=count[2]+1
    #print("Girl")
     }
}

#Printing total number of male and female borns
type_born<- c("Boy","Girl")
for (i in seq_along(count)) {
    print(paste(type_born[i],"", "=",count[i]))
}

[1] "Boy  = 519"
[1] "Girl  = 481"

options(repr.plot.width=8, repr.plot.height=8,repr.plot.res = 300) #Setting plot size & resolution

total_women=1000
barplot(count/total_women,col="blue",xlab="Sex of baby",horiz = FALSE,
       names.arg =c("Boy","Girl"))

text(.7, 0.45, paste(round((count[1]/total_women*100),3),"%", "(","n=",count[1],")"),col="black") #pasting percentage of male borns
text(2, 0.4, paste(round((count[2]/total_women*100),3),"%","(","n=",count[2],")"),col="black")#pasting percentage of female borns

What Is the Central Limit Theorem (CLT)?¶

The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution or distribution of the population. This fact holds especially true for sample sizes at least 30.

Simulating samples from probability distributions for CLT

Exponential, normal, lognormal, Poisson, uniform and binomial distributions.

set.seed(123) #for reproducibility
names_prob_distn<-c("binomial","poisson","uniform","exponential","normal","lognormal")
prob_distn=NULL #list( )# c()

prob_distn[[1]]<-rbinom(n=1000,size=10,prob=0.5) #binomial sample
prob_distn[[2]]<-rpois(n=1000, lambda=1)#poisson_sample
prob_distn[[3]]<-runif(n=1000,min=0, max=1)#uniform sample
prob_distn[[4]]<-rexp(n=1000, rate=1)#exponential sample
prob_distn[[5]]<-rnorm(n=1000, mean = 0, sd = 1)#normal sample
prob_distn[[6]]<-rlnorm(n=1000, meanlog = 0, sdlog = 1)#lognormal sample


par(mfrow=c(3,2),mar=c(4,4,1,0))

for(i in seq_along(names_prob_distn)){
    hist(prob_distn[[i]],col=i,main=paste(names_prob_distn[i]),xlab=paste(names_prob_distn[i],"","Samples"))
}

CLT based samples from different population distribution¶

For a large number of times (100 thousand replicates each)

#Simulating 100000 different samples of size 30 from different population distribution respectively
names_prob_distn<-c("exponential","poisson","uniform","normal","lognormal","binomial")

#create 6 lists as objects for each distribution
probability_samples=vector("list",length=length(names_prob_distn)) 

for(i in seq_along(names_prob_distn)){ #creating list for each list
    probability_samples[[i]]=list()
}

nsim=100000

  for(k in 1:nsim){ #100000 replicates/repititions for each probability distribution
   probability_samples[[1]][[k]]<- rexp(n=30, rate=1) # 30 exponential samples with mean=1  
   probability_samples[[2]][[k]]<- rpois(n=30, lambda=1)#poisson_sample
   probability_samples[[3]][[k]]<- runif(n=30,min=0, max=1)#uniform sample
   probability_samples[[4]][[k]]<- rnorm(n=30, mean = 0, sd = 1)#normal sample
   probability_samples[[5]][[k]]<- rlnorm(n=30, meanlog = 0, sdlog = 1)#lognormal sample
   probability_samples[[6]][[k]]<- rbinom(n=30,size=10,prob=0.5) #binomial sample
      }

probability_samples[[1]][[1]]#view first exponential sample

probability_samples[[1]][[2]]#view second exponential sample

Finding mean for each exponential sample

#create 6 lists as objects for means of each distribution
means_distribution=vector("list",length=length(names_prob_distn)) 

#Create vector of 0 of size 10000 for each list
for(i in seq_along(names_prob_distn)){ 
means_distribution[[i]]=numeric(nsim)
   }



for(i in seq_along(names_prob_distn)){
    for(k in 1:nsim){
 means_distribution[[i]][k]<- mean(probability_samples[[i]][[k]]) # 30 exponential samples with mean=1
        }
}

Plotting the sampling distribution of means for each probability distribution

par(mfrow=c(3,2),mar=c(4,4,1,0))

for(i in seq_along(names_prob_distn)){
    hist(means_distribution[[i]],col=i,main=paste(names_prob_distn[i]),xlab="Sampling distribution of means")
}

Normal Quantile-Quantile Plot¶

sample=c(2,9,7,10)
sample
sort(sample)

par(mfrow=c(3,2),mar=c(4,4,1,0))

for(i in seq_along(names_prob_distn)){
qqnorm(means_distribution[[i]], col=i, pch = 1, frame = FALSE,main=paste("Normal Q-Q plot:",names_prob_distn[i]))
qqline(means_distribution[[i]], col = "steelblue", lwd = 3)
        }

Testing for normality¶

Shapiro-Wilk test (parameteric test for large samples at most 5000: recommended)
Anderson-Darling test (non-parameteric test)
Lilliefors test
Kolmogorov-Smirnov test (non-parameteric test for large samples more than 5000). Cannot consider ties in the data.
Shapiro-Francia test (parameteric test)
Skewness-Kurtosis test (non-parameteric test)
Jarque–Bera test (for time series data)

The hypothesis for the normality test

H0: The data is normal

H1: The data is not normal

NB: The three common procedures in assessing whether a random sample of independent observations of size n come from a population with a normal distribution are: graphical methods (histograms, boxplots, Q-Q-plots), numerical methods (skewness and kurtosis indices) and formal normality tests.

Use Bootstrapping & Jackknife estimation techniques or resampling method to estimate and bound unknown parameters such as population means, population proportions, etc.

NB: Bootstrap samples are simple samples obtained with replacement

Finding the bootstrap and Jackknife means of samples from exponential distribution¶

Simulating a 100 samples or data from exponential distribution with parameter=2 to represent your population from which bootstrap samples will be obtained

### Finding the bootstrap and Jackknife means of samples from exponential distribution###

#Simulating a 100 samples or data from exponential distribution for instance
set.seed(1)#setting seed number for reproducibility
Exponential_samples<- rexp(n=100,rate=2)

print(Exponential_samples)

  [1] 0.37759092 0.59082139 0.07285336 0.06989763 0.21803431 1.44748427
  [7] 0.61478103 0.26984142 0.47828375 0.07352300 0.69536756 0.38101493
 [13] 0.61880178 2.21196711 0.52727158 0.51762197 0.93801759 0.32737332
 [19] 0.16846674 0.29423986 1.18225763 0.32094629 0.14706019 0.28293276
 [25] 0.05303631 0.02971958 0.28935623 1.97946643 0.58665605 0.49840648
 [31] 0.71764267 0.01863426 0.16200508 0.66023396 0.10175518 0.51136294
 [37] 0.15087047 0.36260715 0.37577135 0.11751373 0.53994057 0.51412345
 [43] 0.64613082 0.62655268 0.27732070 0.15064150 0.64656233 0.49727789
 [49] 0.25708715 1.00391620 0.21112122 1.08938628 1.60889451 0.27891468
 [55] 0.29730883 0.48869790 0.10493329 0.15472393 0.55296813 0.38709388
 [61] 0.04483704 0.55408833 0.12363213 0.78599342 2.41640637 0.21556607
 [67] 1.36519466 0.56841571 0.40668412 0.41850325 0.89238270 1.15623581
 [73] 1.45494363 0.14279549 0.19439339 0.02602772 0.17593525 0.78262067
 [79] 0.40726790 1.37962189 0.19309678 0.50412823 0.40925709 0.02963060
 [85] 1.14192674 0.40208546 0.79184804 0.61689575 0.67282201 1.05018861
 [91] 0.51754857 0.22372595 0.52180426 0.13041412 0.34061455 0.13186913
 [97] 0.22330283 0.10530345 0.06628570 0.17444417

mean(Exponential_samples)

x=c(1,4,7,3,8,0)

x

sample(x,size=length(x),replace=TRUE)

Obtaining 10000 different Bootstrap samples of size 100 from the simulated data¶

Bootstrap_samples=list()

for(i in 1:10000)  Bootstrap_samples[[i]]<- sample(Exponential_samples,size=100, replace=TRUE)

Bootstrap_samples[[1]][1:10] #view first 10 data point of 1st bootstrap sample

Bootstrap_samples[[2]][1:10]  ##view first 10 data point of 2nd bootstrap sample

Finding means of each Bootstrap samples

means_bootstrap_samples<- numeric(10000)

for (i in seq_along(means_bootstrap_samples))  means_bootstrap_samples[i]<- mean(Bootstrap_samples[[i]])

Plotting the sample distribution of the means from the bootstrap samples.¶

NB: We are expecting a symmetrical distribution of the means for the Bootstrappping technique to be accurate or good.

hist(means_bootstrap_samples,col="blue",main="Sampling distribution of means from the Bootstrap resampling",xlab="Means",
    ylim=c(0,2000))

abline(v=mean(means_bootstrap_samples),lwd=3,col="red",lty=2)

text(0.58,1800,paste("True mean=",1/2),col="black")

text(0.58,1740,paste("Bootstrap mean=",round(mean(means_bootstrap_samples),4)),col="red")


text(0.61,1690,paste("Average of the actual data=",round(mean(Exponential_samples),4)),col="green")

The bootstrap mean is the overall mean across all the means from the bootstrap samples

paste("The bootstrap mean=",mean(means_bootstrap_samples))

prediction_interval_95=quantile(means_bootstrap_samples, c(0.025,0.975))
paste("The 95% prediction interval:","[" ,prediction_interval_95[1],";",prediction_interval_95[2],"]" )

paste("Bias of bootstrap estimate=",mean(means_bootstrap_samples)-0.5)

Jackknife estimation of the mean (leave-one-out estimation)¶

leave_one_out_means=numeric(length(Exponential_samples))

for(i in seq_along(Exponential_samples)){
 leave_one_out_means[i]<- mean(Exponential_samples[-i])   
}


Jackknife_estimate<- mean(leave_one_out_means)

paste("The Jackknife mean=",Jackknife_estimate)

prediction_interval_95=quantile(leave_one_out_means, c(0.025,0.975))
paste("The 95% prediction interval:","[" ,prediction_interval_95[1],";",prediction_interval_95[2],"]" )

paste("Bias of jackknife estimate=",Jackknife_estimate-0.5)

hist(leave_one_out_means,main="Sampling distribution of means from Jackknife resampling",xlab="means",col="blue")

abline(v=Jackknife_estimate,lwd=3,col="yellow",lty=2)

text(0.505,55,paste("True mean=",1/2),col="black")

text(0.505,50,paste("Jackknife mean=",round(Jackknife_estimate,4)),col="yellow")

text(0.505,45,paste("Bootstrap mean=",round(mean(means_bootstrap_samples),4)),col="red")

text(0.505,40,paste("Average of the actual data=",round(mean(Exponential_samples),4)),col="green")