Performing statistical test across rows

ctheisen · September 13, 2022, 8:50pm

So I'm trying to do a wilcox.test across rows for several proteins in the example dataset below

> dput(head(all.data))
structure(list(`Quant View:3404 Proteins in 3219 Clusters<BR>With 26 Decoys` = c("Tripeptidyl-peptidase II OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS1693 PE=3 SV=1", 
"AAA domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS15172 PE=4 SV=1", 
"Importin N-terminal domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS16570 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS8107 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS6067 PE=4 SV=1", 
"CBM20 domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS16954 PE=3 SV=1"
), `Accession Number` = c("A0A383V887", "A0A383WCV6", "A0A383WGG7", 
"A0A383VTM6", "A0A383VJ69", "A0A383WH22"), `Mann-Whitney Test (p-value)Benjamini-Hochberg (p &lt 0.01200)` = c(0.00011, 
0.00012, 0.00012, 0.00012, 0.00013, 0.00014), R_avg = c(20.2, 
20.45, 20.25, 21.35, 20.25, 20.35), A28d_avg = c(21.25, 21.95, 
21, 20.95, 21, 21.85), R_nolog = c(1204497.52628937, 1432397.0282665, 
1246974.03982109, 2672947.36870384, 1246974.03982109, 1336473.68435192
), A28d_nolog = c(2493948.07964218, 4051430.60815479, 2097152, 
2025715.3040774, 2097152, 3780118.42033046), `A28d/R` = c(2.07052984768276, 
2.82842712474619, 1.68179283050743, 0.757858283255199, 1.68179283050743, 
2.82842712474619), LogFC = c(1.05, 1.5, 0.750000000000002, -0.4, 
0.750000000000002, 1.5)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

So basically I want to do a wilcox test for each accession number, comparing the value in column "R_avg" and "A28d_avg" and have each of the p-values be placed into a new column so I can make a volcano plot. I tried the two methods below and received errors.

> all.data$p-value<-wilcox.test(R_avg ~ A28d_avg, data=all.data) 
Error in wilcox.test.formula(R_avg ~ A28d_avg, data = all.data) : 
  grouping factor must have exactly 2 levels
> all.data %>% 
+   rowwise() %>% 
+   wilcox.test(R_avg ~ A28d_avg, data=all.data) 
Error in wilcox.test.default(., R_avg ~ A28d_avg, data = all.data) : 
  'x' must be numeric

Any help would be greatly appreciated

FJCC · September 13, 2022, 10:41pm

I am confused about your data. You say you want to do a wilcox test for each accession number, comparing R_avg and A28d_avg. Since each row has a unique accession number, that would mean doing a wilcox test to compare single values in the two levels. Since there is no variation in the populations, the test is not useful. Please explain how I am misunderstanding you.

ctheisen · September 14, 2022, 1:43pm

So the avg columns are from several replicates. I mean I could include the various replicates, but I figured it would be easier to just do it with the average columns. And yes, it is a statistical test for each row, because the expression level of one protein is independent of other proteins, so I need a statistical test for each. My end goal is to create a volcano plot of the data, which is a graph of log(fold change) vs -log(p-value)

FJCC · September 14, 2022, 2:08pm

To use the wilcox.test() function, you need to have the individual replicates. The data should be shaped so that one column contains the measured value and another column holds the R or A28d label.

ctheisen · September 14, 2022, 2:12pm

So the label is the accession number. I want to compare the values of the R (reference) data to the values in the A28d set. If you have a different test to suggest that's fine. But I still need help writing the function so I can get the p-values in a new column and so I can graph them. Maybe a Mann-Whitney test is better, but if you have another suggestion I'm open to it.

FJCC · September 14, 2022, 2:36pm

I did not mean to suggest any particular test. I mentioned the wilcox.test() because you used it in your code. My point is that you cannot use such tests, which compare two populations, on the average of the populations. You need to run your test on the set of individual measurements. Without knowing what those data look like, I cannot suggest specific code for doing that.

nirgrahamuk · September 14, 2022, 2:44pm

I think its a reasonable guess that you got your average column by using group_by/summarise and applying the mean function on it; but you could similarly apply list function on it, that would gather the values into a list rather than averaging them; this would mean that you retain the sample values, and could potentially use them in statistical tests that that require sample values that would make sense to do by group.

Here is a simple example with r's built in iris dataset.



iris |> 
  group_by(Species) |> 
  summarise(across(where(is.numeric),mean)) 

iris |>
  group_by(Species) |> 
  summarise(across(where(is.numeric),list))

ctheisen · September 14, 2022, 2:45pm

Oh okay, I see. So below is the spreadsheet/data I am working with. I have other software that computed the Mann-Whitney test p-value, which is in one of the columns, but the problem is that if it has any p-value less than 0.0001 it just writes "<0.0001" and I want an actual value, which is why I want to do the actual test in RStudio.

> dput(head(all.data[,1:8],10))
structure(list(`Quant View:3404 Proteins in 3219 Clusters<BR>With 26 Decoys` = c("Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17974 PE=4 SV=1", 
"GCV_T domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17245 PE=4 SV=1", 
"Brix domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS15148 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS526 PE=3 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS7621 PE=3 SV=1", 
"Pyruvate carboxyltransferase domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS357 PE=3 SV=1", 
"CCT-epsilon OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS13778 PE=3 SV=1", 
"Magnesium chelatase OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS7779 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS8630 PE=3 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS2038 PE=3 SV=1"
), `Accession Number` = c("A0A383VTX3", "A0A383WIA8", "A0A383WC92", 
"A0A383V6F3", "A0A383VPU2", "A0A383V2H1 (+1)", "A0A383W8E2 (+2)", 
"A0A383VR52", "A0A383VV27", "A0A383VA88"), `Mann-Whitney Test (p-value)Benjamini-Hochberg (p &lt 0.01200)` = c("0.05", 
"0.12", "0.12", "8.9999999999999993E-3", "0.12", "0.05", "8.9999999999999993E-3", 
"< 0.0001", "3.5E-4", "4.2999999999999997E-2"), R_1 = c(18.8, 
21.3, 19.5, 20.4, 20.9, 19.2, 20.6, 20.2, 21, 20.9), R_2 = c(18.6, 
21.3, 19, 20.4, 20.7, 19.1, 20.3, 20.1, 21, 20.6), A28d_1 = c(19, 
20.9, 19.7, 20.3, 21.3, 19.6, 20.4, 20.4, 20.9, 20.4), A28d_2 = c(18.9, 
21.1, 19.7, 20.6, 21, 19.5, 20.7, 20.4, 20.9, 20.5), LogFC = c(2.85, 
2.4, 2.55, 2.45, 2.5, 2.45, 2.45, 2.45, 2.1, 2.2)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

And what I have manage to get so far is the following:

all.data<-readxl::read_xlsx("MW test with normalized intensity.xlsx", .name_repair = "minimal")
head(all.data)
colnames(all.data)

p_values<-vector("list",nrow(all.data))

for(i in seq_along(1:nrow(all.data))){
  p_values[i]=wilcox.test(all.data[,R_1:R_2],all.data[,A28d_1:"A28d_2"],paired=TRUE,alternative="two.sided",exact=FALSE)$p.value
}

But I get the following error:

Error in `[.tbl_df`(all.data, , R_1:R_2) : object 'R_1' not found

Does this help?

nirgrahamuk · September 14, 2022, 3:36pm

seems like you are trying to use a spreadsheet with calculations that came from the other software you mentioned.
if you want to use R to do equivalent calculations instead, you should be using the data source that you used for that analysis, rather than the output of that analysis (in my opinion)

ctheisen · September 14, 2022, 4:01pm

The raw data is as a .raw file from mass spec data. But what I outputed from the software is a spreadsheet but the data columns are the raw data values. I'm trying to use R because the software didn't do what I needed.

This is what I have so far, but now for some reason R isn't recognizing one of my columns in the data frame:

> dput(head(all.data[,1:8],10))
structure(list(`Quant View:3404 Proteins in 3219 Clusters<BR>With 26 Decoys` = c("Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17974 PE=4 SV=1", 
"GCV_T domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17245 PE=4 SV=1", 
"Brix domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS15148 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS526 PE=3 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS7621 PE=3 SV=1", 
"Pyruvate carboxyltransferase domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS357 PE=3 SV=1", 
"CCT-epsilon OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS13778 PE=3 SV=1", 
"Magnesium chelatase OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS7779 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS8630 PE=3 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS2038 PE=3 SV=1"
), `Accession Number` = c("A0A383VTX3", "A0A383WIA8", "A0A383WC92", 
"A0A383V6F3", "A0A383VPU2", "A0A383V2H1 (+1)", "A0A383W8E2 (+2)", 
"A0A383VR52", "A0A383VV27", "A0A383VA88"), `Mann-Whitney Test (p-value)Benjamini-Hochberg (p &lt 0.01200)` = c("0.05", 
"0.12", "0.12", "8.9999999999999993E-3", "0.12", "0.05", "8.9999999999999993E-3", 
"< 0.0001", "3.5E-4", "4.2999999999999997E-2"), R_1 = c(18.8, 
21.3, 19.5, 20.4, 20.9, 19.2, 20.6, 20.2, 21, 20.9), R_2 = c(18.6, 
21.3, 19, 20.4, 20.7, 19.1, 20.3, 20.1, 21, 20.6), A28d_1 = c(19, 
20.9, 19.7, 20.3, 21.3, 19.6, 20.4, 20.4, 20.9, 20.4), A28d_2 = c(18.9, 
21.1, 19.7, 20.6, 21, 19.5, 20.7, 20.4, 20.9, 20.5), LogFC = c(2.85, 
2.4, 2.55, 2.45, 2.5, 2.45, 2.45, 2.45, 2.1, 2.2)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

all.data<-readxl::read_xlsx("MW test with normalized intensity.xlsx", .name_repair = "minimal")
head(all.data)
colnames(all.data)

p_values<-vector("list",nrow(all.data))

for(i in seq_along(1:nrow(all.data))){
  p_values[i]=wilcox.test(all.data[,R_1:R_2],all.data[,A28d_1:"A28d_2"],paired=TRUE,alternative="two.sided",exact=FALSE)$p.value
}

Error in `[.tbl_df`(all.data, , R_1:R_2) : object 'R_1' not found

I understand what you are saying, but the data in the data columns is the raw data. The .RAW files are around 20 GB and were converted to a readible format in the software Scaffold, and then I outputted these raw data as an interpretable format. I agree it isn't ideal, but I think it is the simplest way to do it as I am not familiar enough with R to analyze full proteomics/MS datasets in R

nirgrahamuk · September 14, 2022, 4:05pm

the "data columns" , ? do you mean A28d_1 A28d_2 ?
is it these that you averaged together to make A28d_avg ?
Is it only two data points for the examples sake, and there are many more samples in reality , or is it only two samples ?

ctheisen · September 14, 2022, 4:08pm

It is only 2 samples. Again, I know ideally it would be at least 3, but we were limited to the amount of biomass we were sent, so we were unable to do more replicates. Proteomics statistical analysis is rarely ideal. I'm just looking for help on the coding please.

nirgrahamuk · September 14, 2022, 4:23pm

are we theorising that the "Mann-Whitney Test (p-value)Benjamini-Hochberg (p &lt 0.01200)" column you provided, reports a fact about the two values at A28d_1 A28d_2 of each row ? I'm extremely sceptical of that
oh, are those R columns involved? sorry, your layout is a lot to take in ...

ctheisen · September 14, 2022, 4:35pm

So the R_1 and R_2 are the reference columns (they were an initial biomass), and then after treatment I got data for the A28d_1 and A28d_2 samples. I want to perform a statistical test that compares these two groups. I am also skeptical about the Mann-Whitney Test values, which is why I want to do my own test in R. The data in the two treatment groups are the log2 values for the actual intensities, so they are actually very different from each other.

This is how the data is actually formatted. I know there's a lot of information, I apologize for the clumsiness.

> head(all.data)
# A tibble: 6 × 8
  `Quant View:3404 Proteins in 3219 Clusters<BR>With 26 Decoys`                                                       Accession Nu…¹ Mann-…²   R_1   R_2 A28d_1 A28d_2 LogFC
  <chr>                                                                                                               <chr>          <chr>   <dbl> <dbl>  <dbl>  <dbl> <dbl>
1 Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17974 PE=4 SV=1                              A0A383VTX3     0.05     18.8  18.6   19     18.9  2.85
2 GCV_T domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17245 PE=4 SV=1                      A0A383WIA8     0.12     21.3  21.3   20.9   21.1  2.4 
3 Brix domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS15148 PE=4 SV=1                       A0A383WC92     0.12     19.5  19     19.7   19.7  2.55
4 Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS526 PE=3 SV=1                                A0A383V6F3     8.9999…  20.4  20.4   20.3   20.6  2.45
5 Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS7621 PE=3 SV=1                               A0A383VPU2     0.12     20.9  20.7   21.3   21    2.50
6 Pyruvate carboxyltransferase domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS357 PE=3 SV=1 A0A383V2H1 (+… 0.05     19.2  19.1   19.6   19.5  2.45

nirgrahamuk · September 14, 2022, 4:39pm

I take no responsibility for this :

a1 <- all.data |> rowwise() |> 
  mutate(mwt = list(wilcox.test(
                    x=c(R_1,R_2),
                    y=c(A28d_1,A28d_2),
                    alternative="less",
                    mu=0.012)),
                    mwt_val = mwt$statistic,
                    mwt_pval = mwt$p.value)

ctheisen · September 15, 2022, 4:42pm

nirgrahamuk:

a1 <- all.data |> rowwise() |> 
  mutate(mwt = list(wilcox.test(
                    x=c(R_1,R_2),
                    y=c(A28d_1,A28d_2),
                    alternative="less",
                    mu=0.012)),
                    mwt_val = mwt$statistic,
                    mwt_pval = mwt$p.value)

It looks like this worked. Thank you.

system · October 27, 2022, 4:43pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.