Finding the distribution and running hypothesis test on data set

user124578 · October 21, 2019, 5:54pm

Please how can i find what type of distribution does this data set flow? And how can I run hypothesis test on it?

sample=c(103.85717,
103.14916,
113.90591,
104.91262,
90.84991,
97.60370,
101.22786,
102.18647,
97.69707,
95.88739,
109.02683,
92.12523,
106.21427,
98.92642,
90.83309,
93.55331,
102.26012,
99.50595,
109.58390,
99.52911,
106.94329,
98.57266,
95.04966,
94.62632,
105.90799,
103.72201,
98.21947,
99.34361,
103.06769,
105.33094)

sampple02(84.91725,	
100.62723	,
388.79755	,
104.33116	,
98.37728	,
97.46470	,
94.56373	,
96.59360	,
95.56268	,
90.70722	,
90.80393	,
102.37638	,
100.52961	,
99.52056,
106.59802	,
85.82843,
92.39860	,
100.09725,
98.49902	,
103.80920	,
92.75619	,
108.97948,
107.33424	,
94.47826	
107.20983	,
97.42251,
91.63524,	
95.62785,	
98.37055,	
88.43078)

mattwarkentin · October 21, 2019, 6:04pm

Hi @user124578,

Here is my attempt to show the distribution of your data, and also do a simple t-test to test the hypothesis of whether these data come from populations with equal means. Hope this is helpful.

sample1 =c(103.85717,
         103.14916,
         113.90591,
         104.91262,
         90.84991,
         97.60370,
         101.22786,
         102.18647,
         97.69707,
         95.88739,
         109.02683,
         92.12523,
         106.21427,
         98.92642,
         90.83309,
         93.55331,
         102.26012,
         99.50595,
         109.58390,
         99.52911,
         106.94329,
         98.57266,
         95.04966,
         94.62632,
         105.90799,
         103.72201,
         98.21947,
         99.34361,
         103.06769,
         105.33094)

sample2 <- c(84.91725,	
          100.62723,
          388.79755,
          104.33116,
          98.37728,
          97.46470,
          94.56373,
          96.59360,
          95.56268,
          90.70722,
          90.80393,
          102.37638,
          100.52961,
          99.52056,
          106.59802,
          85.82843,
          92.39860,
          100.09725,
          98.49902,
          103.80920,
          92.75619,
          108.97948,
          107.33424,
          94.47826,	
          107.20983	,
          97.42251,
          91.63524,	
          95.62785,	
          98.37055,	
          88.43078)

library(ggplot2)
library(dplyr)

n1 <- length(sample1)
n2 <- length(sample2)

data <- tibble(x = c(sample1, sample2),
               gp = factor(c(rep(0, n1), rep(1, n2))))

data %>% 
  ggplot(aes(x, fill = gp)) +
  geom_density()

t.test(x ~ gp, data = data)

user124578 · October 21, 2019, 6:11pm

Thanks for this. When trying to plot the graph x isn't find.

mattwarkentin · October 21, 2019, 6:13pm

I cannot reproduce your error if you do not share the code you are trying to run.

You may need to load the magrittr package to get access to the pipe (%>%). Try loading the pipe with either library(magrittr) or library(dplyr). Either should work.

user124578 · October 21, 2019, 6:18pm

It works now after importing the libraries. Please can you explain what does this code do please:

data <- tibble(x = c(sample1, sample2),
               gp = factor(c(rep(0, n1), rep(1, n2))))

It looks like the data is normally distrubuted.

mattwarkentin · October 21, 2019, 6:35pm

Glad it worked.

A tibble is very similar to a data.frame, with a few more things going on behind the scene to make them a little safer to use. They are often described as lazy and surly, because they don't make decisions for you (lazy) and they produce errors early and often to make you confront data issues (surly).

Inside the tibble, we are creating two variables x and gp. First, we create x which is formed by combining (c()) the data for the sample1 and sample2 together.

Then we create the gp variable, which I am treating as a factor variable (i.e. the assigned values of 0 and 1 correspond to group labels, and not meaningful numeric values). See ?factor for more details. To create this group variable we repeat (rep()) the value 0 and 1 each n1 and n2 times, respectively. We then combine that into a single vector with c() again.

Perhaps the following code is actually more clear. Here we make the group names more explicit:

data <- tibble(x = c(sample1, sample2),
               gp = factor(c(rep('sample1', times = n1), rep('sample2', times = n2))))

user124578 · October 21, 2019, 6:40pm

Thanks! Is there a command in R that would give a numerical summary of both the data set? I am looking to find mean, variance, sd...etc. Thanks

mattwarkentin · October 21, 2019, 6:53pm

Do you mean a summary of the x variable for each level of the gp variable? If so, here is one way to do it:

library(dplyr)
data %>% 
  group_by(gp) %>% 
  summarise(mean = mean(x),
            variance = var(x),
            sd = sd(x))

user124578 · October 21, 2019, 6:56pm

A summary for the entire data set (sample 1, sample 2)

mattwarkentin · October 21, 2019, 6:57pm

Simply remove the group_by statement from the previous code to apply the functions to the entire data set.

data %>% 
  summarise(mean = mean(x),
            variance = var(x),
            sd = sd(x))

user124578 · October 21, 2019, 7:10pm

Thanks! I would like to be able to summaries each sample individual so I can have a table that compares them side by side. If i try to summaries each sample I get:

Error in UseMethod("summarise_") :
no applicable method for 'summarise_' applied to an object of class "c('double', 'numeric')"

mattwarkentin · October 21, 2019, 7:23pm

Did you run this code I shared above:

data %>% 
  group_by(gp) %>% 
  summarise(mean = mean(x),
            variance = var(x),
            sd = sd(x))

I think it does achieve what you are describing. Summary statistics for sample1 and sample2 separately.

user124578 · October 21, 2019, 7:26pm

Yes it does. It make sense now. I was confused by the gp

olibravo · October 26, 2019, 7:04pm

First of all, plot both samples separately. Knowing nothing about the samples at the beginning of your analysis You should not assume the two samples have anything common. In particular, running the t-test may lead you to wrong conclusions. You may plot a histogram to get a feeling what the real distribution can be. Plot many histograms of the same sample with different breaks setting. Just play a little bit with the samples and after you get some sense of it, found something similar in both datasets. You can make some hypotheses, in particular, you may find t-test very useful eventually. Please remember there are variants of the test so you should be fully consious what variant you're gonna use. Your task may seem trivial, but it's not because there may be many methodological errors made, and the lack of statistical knowledge may result in wrong conclusions. Your issue seems to me to be one of first stages of deeper analysis, but I don't know if it's the case. I won't give you any code, because I think You should consider carefully what You're going to do with the data, what are Your next steps and where wrong assumptions You're making at the moment would lead You.

system · November 16, 2019, 7:04pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.