R CODES FOR THE 5TH MEETING/CLASS

Below is the outline of today's class

Tasks to be completed

  1. Solving a few selected questions in the last week's assignment tasks

    a) Extract multiples of 12 from 10 to 100000 (save results as a vector & list respectively) using i) for loop and ii) while loop

    b) Create a Monte Carlo Simulation to describe the outcome of pregnancy for 100 women at birth (number of runs=100) who can either give birth to a male or female. Note that male and female babies are equally likely to be born (probability of 0.5 respectively). Record the total number of woman who gave birth to males and female respectively out of the 100 women from the Monte Carlo Simulation. Using a barplot, plot the proportion of male and female borns. Increase the number of women to 1000, and compare the findings with that of the 100 women.

  1. Simulating samples from probability distributions (exponential, normal, lognormal, Poisson, uniform and binomial distributions).

  2. Prove the Central Limit theorem of at least the mean based on simulated samples from different probability distributions.

  3. Use Bootstrapping & Jackknife estimation techniques to estimate and bound unknown parameters such as population means, population proportions, etc.

Q1a) Extract multiples of 12 from 10 to 100000

Save results as a vector & list respectively using:

i) for loop and ii) while loop

Using For Loop to find all multiples of 12 from 10 to 100000

In [5]:
# Q1a) Extract multiples of 12 from 10 to 100000 and storing results as a vector
#Using For loop
n <- 10:100000 #seq(10,100000,by=1)
multiple_12<-rep(NA,length=length(n)) #creating a vector of NA to store results
for (i in seq_along(n)) {# for(i in 1:length(n))
  if(n[i]%%12 == 0)  {
    multiple_12[i]<- n[i]
     }
}

#multiple_12 #This willcontain some NA 

multiples_of_12<-na.omit(multiple_12) #stores all multiples of 12 eliminating NAs

print(paste("Total multiples of 12=",length(multiples_of_12))) #total multiples of 12

print(multiples_of_12[1:100])# printing first 100 multiple of 12.
[1] "Total multiples of 12= 8333"
  [1]   12   24   36   48   60   72   84   96  108  120  132  144  156  168  180
 [16]  192  204  216  228  240  252  264  276  288  300  312  324  336  348  360
 [31]  372  384  396  408  420  432  444  456  468  480  492  504  516  528  540
 [46]  552  564  576  588  600  612  624  636  648  660  672  684  696  708  720
 [61]  732  744  756  768  780  792  804  816  828  840  852  864  876  888  900
 [76]  912  924  936  948  960  972  984  996 1008 1020 1032 1044 1056 1068 1080
 [91] 1092 1104 1116 1128 1140 1152 1164 1176 1188 1200
In [10]:
# Q1b) Extract multiples of 12 from 10 to 100000 and storing results as a list 
#Using For loop
n <- 10:100000 
multiple_12<-list()# c() #creating a empty list to store results
for (i in seq_along(n)) {
  if(n[i]%%12 == 0)  {
    multiple_12[[i]]<- n[i]
     }
}



multiples_of_12<-unlist(multiple_12) #unlisting to create a vector of only multiples of 12

print(paste("Total multiples of 12=",length(multiples_of_12))) #total multiples of 12

print(multiples_of_12[1:100]) # printing first 100 multiple of 12.
[1] "Total multiples of 12= 8333"
  [1]   12   24   36   48   60   72   84   96  108  120  132  144  156  168  180
 [16]  192  204  216  228  240  252  264  276  288  300  312  324  336  348  360
 [31]  372  384  396  408  420  432  444  456  468  480  492  504  516  528  540
 [46]  552  564  576  588  600  612  624  636  648  660  672  684  696  708  720
 [61]  732  744  756  768  780  792  804  816  828  840  852  864  876  888  900
 [76]  912  924  936  948  960  972  984  996 1008 1020 1032 1044 1056 1068 1080
 [91] 1092 1104 1116 1128 1140 1152 1164 1176 1188 1200

Using While Loop to find all multiples of 12 from 10 to 100000

In [18]:
# Q1 c) Extract multiples of 12 from 10 to 100000 and storing results as a list
#Using For While

multiple_12<- list() #creating a empty list to store results
#multiple_12=rep(NA, length(n))

last_number=100000

n <-  10

while (n<=last_number){
  
    if(n%%12 == 0)  {
    multiple_12[[n]]<- n
     }
    n=n+1
}

#multiple_12 #This willcontain some NA 

multiples_of_12<-unlist(multiple_12) #stores all multiples of 12 eliminating NAs

print(paste("Total multiples of 12=",length(multiples_of_12))) #total multiples of 12

print(multiples_of_12[1:100]) # printing first 100 multiple of 12.
[1] "Total multiples of 12= 8333"
  [1]   12   24   36   48   60   72   84   96  108  120  132  144  156  168  180
 [16]  192  204  216  228  240  252  264  276  288  300  312  324  336  348  360
 [31]  372  384  396  408  420  432  444  456  468  480  492  504  516  528  540
 [46]  552  564  576  588  600  612  624  636  648  660  672  684  696  708  720
 [61]  732  744  756  768  780  792  804  816  828  840  852  864  876  888  900
 [76]  912  924  936  948  960  972  984  996 1008 1020 1032 1044 1056 1068 1080
 [91] 1092 1104 1116 1128 1140 1152 1164 1176 1188 1200

Monte Carlo Simulation (Assignment question)

Create a Monte Carlo Simulation to describe the outcome of pregnancy for 100 women at birth (number of runs=100) who can either give birth to a male or female. Note that male and female babies are equally likely to be born (probability of 0.5 respectively). Record the total number of woman who gave birth to males and female respectively out of the 100 women from the Monte Carlo Simulation. Using a barplot, plot the proportion of male and female borns. Increase the number of women to 1000, and compare the findings with that of the 100 women.

Outcome of birth for a single Mother/birth

In [27]:
#Create a Monte Carlo Simulation to describe the outcome of pregnancy 
#for a single mother at birth (number of runs=1) who can either give birth to 
#a male or female. Note that male and female babies are equally
#likely to be born (probability of 0.5 respectively)

count=rep(0,2) # or numeric(2) #to stores the total of the two outcomes (male or female borns)

rand=runif(n=1,0,1) #a random variable to decide which type of birth occurs ( for a single woman)
for (i in rand){
  if (i>=0 & i<=1/2){ #or just if(i<=1/2)
  count[1]=count[1]+1
    print("Boy")
    } else if(i>1/2 & i<=1){ #or just else{}
     count[2]=count[2]+1
    print("Girl")
     }
}
[1] "Boy"

Outcome of birth for 100 Mothers/births

In [32]:
#Create a Monte Carlo Simulation to describe the outcome of pregnancy 
#for 100 women at birth (number of runs=100) who can either give birth to 
#a male or female. Note that male and female babies are equally
#likely to be born (probability of 0.5 respectively)

count=rep(0,2) # or numeric(2) #to stores the total of the two outcomes (male or female borns)

rand=runif(n=100,0,1) #a random variable to decide which type of birth occurs (100 women)
for (i in rand){
  if (i>=0 & i<=1/2){ #or just if(i<=1/2)
  count[1]=count[1]+1
    #print("Boy")
    } else if(i>1/2 & i<=1){ #or just else{}
     count[2]=count[2]+1
    #print("Girl")
     }
}
In [33]:
#Printing total number of male and female borns
type_born<- c("Boy","Girl")
for (i in seq_along(count)) {
    print(paste(type_born[i],"", "=",count[i]))
}
[1] "Boy  = 53"
[1] "Girl  = 47"

Outcome of birth for 1000 Mothers/births

In [41]:
#Create a Monte Carlo Simulation to describe the outcome of pregnancy 
#for 1000 women at birth (number of runs=100) who can either give birth to 
#a male or female. Note that male and female babies are equally
#likely to be born (probability of 0.5 respectively)

count=rep(0,2) # or numeric(2) #to stores the total of the two outcomes (male or female borns)
#set.seed(1)
rand=runif(n=1000,0,1) #a random variable to decide which type of birth occurs (1000 women)
for (i in rand){
  if (i>=0 && i<=1/2){ #or just if(i<=1/2)
  count[1]=count[1]+1
    #print("Boy")
    } else if(i>1/2 && i<=1){ #or just else{}
     count[2]=count[2]+1
    #print("Girl")
     }
}

#Printing total number of male and female borns
type_born<- c("Boy","Girl")
for (i in seq_along(count)) {
    print(paste(type_born[i],"", "=",count[i]))
}
[1] "Boy  = 519"
[1] "Girl  = 481"
In [42]:
options(repr.plot.width=8, repr.plot.height=8,repr.plot.res = 300) #Setting plot size & resolution
In [44]:
total_women=1000
barplot(count/total_women,col="blue",xlab="Sex of baby",horiz = FALSE,
       names.arg =c("Boy","Girl"))

text(.7, 0.45, paste(round((count[1]/total_women*100),3),"%", "(","n=",count[1],")"),col="black") #pasting percentage of male borns
text(2, 0.4, paste(round((count[2]/total_women*100),3),"%","(","n=",count[2],")"),col="black")#pasting percentage of female borns

What Is the Central Limit Theorem (CLT)?

The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution or distribution of the population. This fact holds especially true for sample sizes at least 30.

Simulating samples from probability distributions for CLT

Exponential, normal, lognormal, Poisson, uniform and binomial distributions.

In [45]:
set.seed(123) #for reproducibility
names_prob_distn<-c("binomial","poisson","uniform","exponential","normal","lognormal")
prob_distn=NULL #list( )# c()

prob_distn[[1]]<-rbinom(n=1000,size=10,prob=0.5) #binomial sample
prob_distn[[2]]<-rpois(n=1000, lambda=1)#poisson_sample
prob_distn[[3]]<-runif(n=1000,min=0, max=1)#uniform sample
prob_distn[[4]]<-rexp(n=1000, rate=1)#exponential sample
prob_distn[[5]]<-rnorm(n=1000, mean = 0, sd = 1)#normal sample
prob_distn[[6]]<-rlnorm(n=1000, meanlog = 0, sdlog = 1)#lognormal sample


par(mfrow=c(3,2),mar=c(4,4,1,0))

for(i in seq_along(names_prob_distn)){
    hist(prob_distn[[i]],col=i,main=paste(names_prob_distn[i]),xlab=paste(names_prob_distn[i],"","Samples"))
}

CLT based samples from different population distribution

For a large number of times (100 thousand replicates each)

In [183]:
#Simulating 100000 different samples of size 30 from different population distribution respectively
names_prob_distn<-c("exponential","poisson","uniform","normal","lognormal","binomial")

#create 6 lists as objects for each distribution
probability_samples=vector("list",length=length(names_prob_distn)) 

for(i in seq_along(names_prob_distn)){ #creating list for each list
    probability_samples[[i]]=list()
}

nsim=100000

  for(k in 1:nsim){ #100000 replicates/repititions for each probability distribution
   probability_samples[[1]][[k]]<- rexp(n=30, rate=1) # 30 exponential samples with mean=1  
   probability_samples[[2]][[k]]<- rpois(n=30, lambda=1)#poisson_sample
   probability_samples[[3]][[k]]<- runif(n=30,min=0, max=1)#uniform sample
   probability_samples[[4]][[k]]<- rnorm(n=30, mean = 0, sd = 1)#normal sample
   probability_samples[[5]][[k]]<- rlnorm(n=30, meanlog = 0, sdlog = 1)#lognormal sample
   probability_samples[[6]][[k]]<- rbinom(n=30,size=10,prob=0.5) #binomial sample
      }
In [184]:
probability_samples[[1]][[1]]#view first exponential sample
  1. 0.293686571996659
  2. 1.33542986396225
  3. 0.236357956611642
  4. 0.725357025819731
  5. 1.65104663736167
  6. 0.0592711525100793
  7. 0.507466313429177
  8. 1.63255846119466
  9. 2.16184980968449
  10. 2.11476305360412
  11. 0.0370244127698243
  12. 2.83840329183226
  13. 0.36908208206296
  14. 0.00660336975001781
  15. 3.18015953136568
  16. 0.336528518293473
  17. 0.625854187645018
  18. 0.243728240409083
  19. 0.186926712482107
  20. 1.56841491083753
  21. 0.38022939208895
  22. 0.788098775798699
  23. 0.127604602372778
  24. 0.510241292882711
  25. 0.731953634838474
  26. 0.235114172101021
  27. 2.92377563549166
  28. 0.114993593761883
  29. 1.31590255904467
  30. 1.07685372269542
In [185]:
probability_samples[[1]][[2]]#view second exponential sample
  1. 0.79205121487649
  2. 2.21534371123186
  3. 3.68173158706582
  4. 0.0958198946900666
  5. 1.80439872099998
  6. 3.03478032333736
  7. 3.97686002928755
  8. 2.01529437481406
  9. 0.0873427314314203
  10. 1.88998164414408
  11. 0.825289959077538
  12. 0.194568308578529
  13. 0.375121364369988
  14. 1.89408541089835
  15. 0.559571533929557
  16. 0.462050467729568
  17. 0.195737620815635
  18. 0.144275017548352
  19. 0.345725294202566
  20. 0.17291892785579
  21. 2.40893994990354
  22. 0.768799647646856
  23. 0.404170591849834
  24. 2.61504943469975
  25. 2.16345442405587
  26. 0.651554889976978
  27. 0.0670320307035385
  28. 1.04556361767919
  29. 0.674178449902683
  30. 0.406346736941487

Finding mean for each exponential sample

In [179]:
#create 6 lists as objects for means of each distribution
means_distribution=vector("list",length=length(names_prob_distn)) 

#Create vector of 0 of size 10000 for each list
for(i in seq_along(names_prob_distn)){ 
means_distribution[[i]]=numeric(nsim)
   }



for(i in seq_along(names_prob_distn)){
    for(k in 1:nsim){
 means_distribution[[i]][k]<- mean(probability_samples[[i]][[k]]) # 30 exponential samples with mean=1
        }
}

Plotting the sampling distribution of means for each probability distribution

In [186]:
par(mfrow=c(3,2),mar=c(4,4,1,0))

for(i in seq_along(names_prob_distn)){
    hist(means_distribution[[i]],col=i,main=paste(names_prob_distn[i]),xlab="Sampling distribution of means")
}

Normal Quantile-Quantile Plot

In [55]:
sample=c(2,9,7,10)
sample
sort(sample)
  1. 2
  2. 9
  3. 7
  4. 10
  1. 2
  2. 7
  3. 9
  4. 10
In [187]:
par(mfrow=c(3,2),mar=c(4,4,1,0))

for(i in seq_along(names_prob_distn)){
qqnorm(means_distribution[[i]], col=i, pch = 1, frame = FALSE,main=paste("Normal Q-Q plot:",names_prob_distn[i]))
qqline(means_distribution[[i]], col = "steelblue", lwd = 3)
        }

Testing for normality

  1. Shapiro-Wilk test (parameteric test for large samples at most 5000: recommended)

  2. Anderson-Darling test (non-parameteric test)

  3. Lilliefors test

  4. Kolmogorov-Smirnov test (non-parameteric test for large samples more than 5000). Cannot consider ties in the data.

  5. Shapiro-Francia test (parameteric test)

  6. Skewness-Kurtosis test (non-parameteric test)

  7. Jarque–Bera test (for time series data)

The hypothesis for the normality test

H0: The data is normal

H1: The data is not normal

NB: The three common procedures in assessing whether a random sample of independent observations of size n come from a population with a normal distribution are: graphical methods (histograms, boxplots, Q-Q-plots), numerical methods (skewness and kurtosis indices) and formal normality tests.

Use Bootstrapping & Jackknife estimation techniques or resampling method to estimate and bound unknown parameters such as population means, population proportions, etc.

NB: Bootstrap samples are simple samples obtained with replacement

Finding the bootstrap and Jackknife means of samples from exponential distribution

Simulating a 100 samples or data from exponential distribution with parameter=2 to represent your population from which bootstrap samples will be obtained

In [56]:
### Finding the bootstrap and Jackknife means of samples from exponential distribution###

#Simulating a 100 samples or data from exponential distribution for instance
set.seed(1)#setting seed number for reproducibility
Exponential_samples<- rexp(n=100,rate=2)

print(Exponential_samples)
  [1] 0.37759092 0.59082139 0.07285336 0.06989763 0.21803431 1.44748427
  [7] 0.61478103 0.26984142 0.47828375 0.07352300 0.69536756 0.38101493
 [13] 0.61880178 2.21196711 0.52727158 0.51762197 0.93801759 0.32737332
 [19] 0.16846674 0.29423986 1.18225763 0.32094629 0.14706019 0.28293276
 [25] 0.05303631 0.02971958 0.28935623 1.97946643 0.58665605 0.49840648
 [31] 0.71764267 0.01863426 0.16200508 0.66023396 0.10175518 0.51136294
 [37] 0.15087047 0.36260715 0.37577135 0.11751373 0.53994057 0.51412345
 [43] 0.64613082 0.62655268 0.27732070 0.15064150 0.64656233 0.49727789
 [49] 0.25708715 1.00391620 0.21112122 1.08938628 1.60889451 0.27891468
 [55] 0.29730883 0.48869790 0.10493329 0.15472393 0.55296813 0.38709388
 [61] 0.04483704 0.55408833 0.12363213 0.78599342 2.41640637 0.21556607
 [67] 1.36519466 0.56841571 0.40668412 0.41850325 0.89238270 1.15623581
 [73] 1.45494363 0.14279549 0.19439339 0.02602772 0.17593525 0.78262067
 [79] 0.40726790 1.37962189 0.19309678 0.50412823 0.40925709 0.02963060
 [85] 1.14192674 0.40208546 0.79184804 0.61689575 0.67282201 1.05018861
 [91] 0.51754857 0.22372595 0.52180426 0.13041412 0.34061455 0.13186913
 [97] 0.22330283 0.10530345 0.06628570 0.17444417
In [57]:
mean(Exponential_samples)
0.515338218028809
In [68]:
x=c(1,4,7,3,8,0)

x

sample(x,size=length(x),replace=TRUE)
  1. 1
  2. 4
  3. 7
  4. 3
  5. 8
  6. 0
  1. 7
  2. 1
  3. 1
  4. 4
  5. 1
  6. 0

Obtaining 10000 different Bootstrap samples of size 100 from the simulated data

In [66]:
Bootstrap_samples=list()

for(i in 1:10000)  Bootstrap_samples[[i]]<- sample(Exponential_samples,size=100, replace=TRUE)
In [71]:
Bootstrap_samples[[1]][1:10] #view first 10 data point of 1st bootstrap sample
  1. 0.175935248942048
  2. 0.340614554006606
  3. 0.504128228386349
  4. 0.717642671686538
  5. 2.41640637243602
  6. 0.131869132013586
  7. 0.131869132013586
  8. 0.660233964656896
  9. 0.289356231689453
  10. 0.938017586204124
In [72]:
Bootstrap_samples[[2]][1:10]  ##view first 10 data point of 2nd bootstrap sample
  1. 1.14192673650788
  2. 1.15623581310021
  3. 0.289356231689453
  4. 1.0039162011397
  5. 0.478283746849227
  6. 0.511362938655025
  7. 0.223302827226042
  8. 0.12363212667013
  9. 1.36519465682315
  10. 0.0186342631932348

Finding means of each Bootstrap samples

In [74]:
means_bootstrap_samples<- numeric(10000)

for (i in seq_along(means_bootstrap_samples))  means_bootstrap_samples[i]<- mean(Bootstrap_samples[[i]])

Plotting the sample distribution of the means from the bootstrap samples.

NB: We are expecting a symmetrical distribution of the means for the Bootstrappping technique to be accurate or good.

In [273]:
hist(means_bootstrap_samples,col="blue",main="Sampling distribution of means from the Bootstrap resampling",xlab="Means",
    ylim=c(0,2000))

abline(v=mean(means_bootstrap_samples),lwd=3,col="red",lty=2)

text(0.58,1800,paste("True mean=",1/2),col="black")

text(0.58,1740,paste("Bootstrap mean=",round(mean(means_bootstrap_samples),4)),col="red")


text(0.61,1690,paste("Average of the actual data=",round(mean(Exponential_samples),4)),col="green")

The bootstrap mean is the overall mean across all the means from the bootstrap samples

In [279]:
paste("The bootstrap mean=",mean(means_bootstrap_samples))

prediction_interval_95=quantile(means_bootstrap_samples, c(0.025,0.975))
paste("The 95% prediction interval:","[" ,prediction_interval_95[1],";",prediction_interval_95[2],"]" )

paste("Bias of bootstrap estimate=",mean(means_bootstrap_samples)-0.5)
'The bootstrap mean= 0.515845736718423'
'The 95% prediction interval: [ 0.429613009767174 ; 0.609286383324578 ]'
'Bias of bootstrap estimate= 0.0158457367184227'

Jackknife estimation of the mean (leave-one-out estimation)

In [281]:
leave_one_out_means=numeric(length(Exponential_samples))

for(i in seq_along(Exponential_samples)){
 leave_one_out_means[i]<- mean(Exponential_samples[-i])   
}


Jackknife_estimate<- mean(leave_one_out_means)

paste("The Jackknife mean=",Jackknife_estimate)

prediction_interval_95=quantile(leave_one_out_means, c(0.025,0.975))
paste("The 95% prediction interval:","[" ,prediction_interval_95[1],";",prediction_interval_95[2],"]" )

paste("Bias of jackknife estimate=",Jackknife_estimate-0.5)
'The Jackknife mean= 0.515338218028809'
'The 95% prediction interval: [ 0.502327040782556 ; 0.5202439286446 ]'
'Bias of jackknife estimate= 0.0153382180288089'
In [277]:
hist(leave_one_out_means,main="Sampling distribution of means from Jackknife resampling",xlab="means",col="blue")

abline(v=Jackknife_estimate,lwd=3,col="yellow",lty=2)

text(0.505,55,paste("True mean=",1/2),col="black")

text(0.505,50,paste("Jackknife mean=",round(Jackknife_estimate,4)),col="yellow")

text(0.505,45,paste("Bootstrap mean=",round(mean(means_bootstrap_samples),4)),col="red")

text(0.505,40,paste("Average of the actual data=",round(mean(Exponential_samples),4)),col="green")