Finding the distribution and running hypothesis test on data set

Please how can i find what type of distribution does this data set flow? And how can I run hypothesis test on it?

sample=c(103.85717,
103.14916,
113.90591,
104.91262,
90.84991,
97.60370,
101.22786,
102.18647,
97.69707,
95.88739,
109.02683,
92.12523,
106.21427,
98.92642,
90.83309,
93.55331,
102.26012,
99.50595,
109.58390,
99.52911,
106.94329,
98.57266,
95.04966,
94.62632,
105.90799,
103.72201,
98.21947,
99.34361,
103.06769,
105.33094)

sampple02(84.91725,	
100.62723	,
388.79755	,
104.33116	,
98.37728	,
97.46470	,
94.56373	,
96.59360	,
95.56268	,
90.70722	,
90.80393	,
102.37638	,
100.52961	,
99.52056,
106.59802	,
85.82843,
92.39860	,
100.09725,
98.49902	,
103.80920	,
92.75619	,
108.97948,
107.33424	,
94.47826	
107.20983	,
97.42251,
91.63524,	
95.62785,	
98.37055,	
88.43078)

Hi @user124578,

Here is my attempt to show the distribution of your data, and also do a simple t-test to test the hypothesis of whether these data come from populations with equal means. Hope this is helpful.

sample1 =c(103.85717,
         103.14916,
         113.90591,
         104.91262,
         90.84991,
         97.60370,
         101.22786,
         102.18647,
         97.69707,
         95.88739,
         109.02683,
         92.12523,
         106.21427,
         98.92642,
         90.83309,
         93.55331,
         102.26012,
         99.50595,
         109.58390,
         99.52911,
         106.94329,
         98.57266,
         95.04966,
         94.62632,
         105.90799,
         103.72201,
         98.21947,
         99.34361,
         103.06769,
         105.33094)

sample2 <- c(84.91725,	
          100.62723,
          388.79755,
          104.33116,
          98.37728,
          97.46470,
          94.56373,
          96.59360,
          95.56268,
          90.70722,
          90.80393,
          102.37638,
          100.52961,
          99.52056,
          106.59802,
          85.82843,
          92.39860,
          100.09725,
          98.49902,
          103.80920,
          92.75619,
          108.97948,
          107.33424,
          94.47826,	
          107.20983	,
          97.42251,
          91.63524,	
          95.62785,	
          98.37055,	
          88.43078)
library(ggplot2)
library(dplyr)

n1 <- length(sample1)
n2 <- length(sample2)

data <- tibble(x = c(sample1, sample2),
               gp = factor(c(rep(0, n1), rep(1, n2))))

data %>% 
  ggplot(aes(x, fill = gp)) +
  geom_density()

t.test(x ~ gp, data = data)

Thanks for this. When trying to plot the graph x isn't find.

I cannot reproduce your error if you do not share the code you are trying to run.

You may need to load the magrittr package to get access to the pipe (%>%). Try loading the pipe with either library(magrittr) or library(dplyr). Either should work.

It works now after importing the libraries. Please can you explain what does this code do please:

data <- tibble(x = c(sample1, sample2),
               gp = factor(c(rep(0, n1), rep(1, n2))))

It looks like the data is normally distrubuted.

Glad it worked.

A tibble is very similar to a data.frame, with a few more things going on behind the scene to make them a little safer to use. They are often described as lazy and surly, because they don't make decisions for you (lazy) and they produce errors early and often to make you confront data issues (surly).

Inside the tibble, we are creating two variables x and gp. First, we create x which is formed by combining (c()) the data for the sample1 and sample2 together.

Then we create the gp variable, which I am treating as a factor variable (i.e. the assigned values of 0 and 1 correspond to group labels, and not meaningful numeric values). See ?factor for more details. To create this group variable we repeat (rep()) the value 0 and 1 each n1 and n2 times, respectively. We then combine that into a single vector with c() again.

Perhaps the following code is actually more clear. Here we make the group names more explicit:

data <- tibble(x = c(sample1, sample2),
               gp = factor(c(rep('sample1', times = n1), rep('sample2', times = n2))))
2 Likes

Thanks! Is there a command in R that would give a numerical summary of both the data set? I am looking to find mean, variance, sd...etc. Thanks

Do you mean a summary of the x variable for each level of the gp variable? If so, here is one way to do it:

library(dplyr)
data %>% 
  group_by(gp) %>% 
  summarise(mean = mean(x),
            variance = var(x),
            sd = sd(x))

A summary for the entire data set (sample 1, sample 2)

Simply remove the group_by statement from the previous code to apply the functions to the entire data set.

data %>% 
  summarise(mean = mean(x),
            variance = var(x),
            sd = sd(x))

Thanks! I would like to be able to summaries each sample individual so I can have a table that compares them side by side. If i try to summaries each sample I get:

Error in UseMethod("summarise_") :
no applicable method for 'summarise_' applied to an object of class "c('double', 'numeric')"

Did you run this code I shared above:

data %>% 
  group_by(gp) %>% 
  summarise(mean = mean(x),
            variance = var(x),
            sd = sd(x))

I think it does achieve what you are describing. Summary statistics for sample1 and sample2 separately.

Yes it does. It make sense now. I was confused by the gp

First of all, plot both samples separately. Knowing nothing about the samples at the beginning of your analysis You should not assume the two samples have anything common. In particular, running the t-test may lead you to wrong conclusions. You may plot a histogram to get a feeling what the real distribution can be. Plot many histograms of the same sample with different breaks setting. Just play a little bit with the samples and after you get some sense of it, found something similar in both datasets. You can make some hypotheses, in particular, you may find t-test very useful eventually. Please remember there are variants of the test so you should be fully consious what variant you're gonna use. Your task may seem trivial, but it's not because there may be many methodological errors made, and the lack of statistical knowledge may result in wrong conclusions. Your issue seems to me to be one of first stages of deeper analysis, but I don't know if it's the case. I won't give you any code, because I think You should consider carefully what You're going to do with the data, what are Your next steps and where wrong assumptions You're making at the moment would lead You.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.