Investigating The Exponential Distribution & Compare It With The Central Limit Theorem

Apr 30, 2019

The analysis consists of two parts:

A simulation exercise.
Basic inferential data analysis.

Analysis

1. A simulation exercise.

In this analysis I’ll investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. I’ll investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

Task

# set seed for reproducability
set.seed(31)

# set lambda to 0.2
lambda <- 0.2

# 40 samples
n <- 40

# 1000 simulations
simulations <- 1000

# simulate
simulated_exponentials <- replicate(simulations, rexp(n, lambda))

# calculate mean of exponentials
means_exponentials <- apply(simulated_exponentials, 2, mean)

Question 1

Show where the distribution is centered at and compare it to the theoretical center of the distribution.

analytical_mean <- mean(means_exponentials)
analytical_mean

[1] 4.993867

# analytical mean
theory_mean <- 1/lambda
theory_mean

[1] 5

# visualization
hist(means_exponentials, xlab = "mean", main = "Exponential Function Simulations")
abline(v = analytical_mean, col = "red")
abline(v = theory_mean, col = "orange")

The analytics mean is 4.993867 the theoretical mean 5. The center of distribution of averages of 40 exponentials is very close to the theoretical center of the distribution.

Question 2

Show how variable it is and compare it to the theoretical variance of the distribution..

# standard deviation of distribution
standard_deviation_dist <- sd(means_exponentials)
standard_deviation_dist

[1] 0.7931608

# standard deviation from analytical expression
standard_deviation_theory <- (1/lambda)/sqrt(n)
standard_deviation_theory

[1] 0.7905694

# variance of distribution
variance_dist <- standard_deviation_dist^2
variance_dist

[1] 0.6291041

# variance from analytical expression
variance_theory <- ((1/lambda)*(1/sqrt(n)))^2
variance_theory

[1] 0.625

Standard Deviation of the distribution is 0.7931608 with the theoretical SD calculated as 0.7905694. The Theoretical variance is calculated as ((1 / ??) * (1/???n))² = 0.625. The actual variance of the distribution is 0.6291041

Question 3

Show that the distribution is approximately normal.

xfit <- seq(min(means_exponentials), max(means_exponentials), length=100)
yfit <- dnorm(xfit, mean=1/lambda, sd=(1/lambda/sqrt(n)))
hist(means_exponentials,breaks=n,prob=T,col="orange",xlab = "means",main="Density of means",ylab="density")
lines(xfit, yfit, pch=22, col="black", lty=5)

# compare the distribution of averages of 40 exponentials to a normal distribution
qqnorm(means_exponentials)
qqline(means_exponentials, col = 2)

Due to Due to the central limit theorem (CLT), the distribution of averages of 40 exponentials is very close to a normal distribution.

2. Basic inferential data analysis.

Now in the second portion of this analysis, we’re going to analyze the ToothGrowth data in the R datasets package.

Load the ToothGrowth data and perform some basic exploratory data analyses
Provide a basic summary of the data.
Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)
State your conclusions and the assumptions needed for your conclusions.

Load the ToothGrowth data and perform some basic exploratory data analyses

# load the data ToothGrowth
data(ToothGrowth)

# preview the structure of the data
str(ToothGrowth)

'data.frame':   60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

# preview first 5 rows of the data
head(ToothGrowth, 5)

   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5

Provide a basic summary of the data.

# data summary
summary(ToothGrowth)

      len        supp         dose      
 Min.   : 4.20   OJ:30   Min.   :0.500  
 1st Qu.:13.07   VC:30   1st Qu.:0.500  
 Median :19.25           Median :1.000  
 Mean   :18.81           Mean   :1.167  
 3rd Qu.:25.27           3rd Qu.:2.000  
 Max.   :33.90           Max.   :2.000

# compare means of the different delivery methods
tapply(ToothGrowth$len,ToothGrowth$supp, mean)

      OJ       VC 
20.66333 16.96333

# plot data graphically
ggplot(ToothGrowth, aes(factor(dose), len, fill = factor(dose))) +
      geom_boxplot() +
      # facet_grid(.~supp)+
      facet_grid(.~supp, labeller = as_labeller(
            c("OJ" = "Orange juice", 
              "VC" = "Ascorbic Acid"))) +
      labs(title = "Tooth growth of 60 guinea pigs by dosage and\nby delivery method of vitamin C",
           x = "Dose in milligrams/day", 
           y = "Tooth Lengh") +
      scale_fill_discrete(name = "Dosage of\nvitamin C\nin mg/day") +
  theme_classic()

Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.

# comparison by delivery method for the same dosage
t05 <- t.test(len ~ supp, 
       data = rbind(ToothGrowth[(ToothGrowth$dose == 0.5) & 
                                      (ToothGrowth$supp == "OJ"),],
                    ToothGrowth[(ToothGrowth$dose == 0.5) & 
                                      (ToothGrowth$supp == "VC"),]), 
       var.equal = FALSE)

t1 <- t.test(len ~ supp, 
       data = rbind(ToothGrowth[(ToothGrowth$dose == 1) & 
                                      (ToothGrowth$supp == "OJ"),],
                    ToothGrowth[(ToothGrowth$dose == 1) & 
                                      (ToothGrowth$supp == "VC"),]), 
       var.equal = FALSE)

t2 <- t.test(len ~ supp, 
       data = rbind(ToothGrowth[(ToothGrowth$dose == 2) & 
                                      (ToothGrowth$supp == "OJ"),],
                    ToothGrowth[(ToothGrowth$dose == 2) & 
                                      (ToothGrowth$supp == "VC"),]), 
       var.equal = FALSE)

# summary of the conducted t.tests, which compare the delivery methods by dosage,
# take p-values and CI
summaryBYsupp <- data.frame(
      "p-value" = c(t05$p.value, t1$p.value, t2$p.value),
      "Conf.Low" = c(t05$conf.int[1],t1$conf.int[1], t2$conf.int[1]),
      "Conf.High" = c(t05$conf.int[2],t1$conf.int[2], t2$conf.int[2]),
      row.names = c("Dosage .05","Dosage 1","Dosage 2"))

# show data table 
summaryBYsupp

               p.value  Conf.Low Conf.High
Dosage .05 0.006358607  1.719057  8.780943
Dosage 1   0.001038376  2.802148  9.057852
Dosage 2   0.963851589 -3.798070  3.638070

Conclusion

With 95% confidence we reject the null hypothesis, stating that there is no difference in the tooth growth by the delivery method for .5 and 1 milligrams/day. We observe p-values less than the treshold of .05 and the confidence levels don’t include 0. So, for dosage of .5 milligrams/day and 1 milligrams/day does matter the delivery method. With 95% confidence we fail to reject the null hypothesis, stating that there is no difference in the tooth growth by the delivery method for 2 milligrams/day. We observe p-values more than the treshold of .05 and the confidence levels include 0. So, for dosage of 2 milligrams/day the delivery method doesn’t matter.

R Data Analytics

Rihad Variawa

Data Scientist

I am the Sr. Data Scientist at Malastare AI and head of global Fintech Research, responsible for overall vision and strategy, investment priorities and offering development. Working in the financial services industry, helping clients adopt new technologies that can transform the way they transact and engage with their customers. I am passionate about data science, super inquisitive and challenge seeker; looking at everything through a lens of numbers and problem-solver at the core. From understanding a business problem to collecting and visualizing data, until the stage of prototyping, fine-tuning and deploying models to real-world applications, I find the fulfillment of tackling challenges to solve complex problems using data.