Performing statistical test across rows

So I'm trying to do a wilcox.test across rows for several proteins in the example dataset below

> dput(head(all.data))
structure(list(`Quant View:3404 Proteins in 3219 Clusters<BR>With 26 Decoys` = c("Tripeptidyl-peptidase II OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS1693 PE=3 SV=1", 
"AAA domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS15172 PE=4 SV=1", 
"Importin N-terminal domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS16570 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS8107 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS6067 PE=4 SV=1", 
"CBM20 domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS16954 PE=3 SV=1"
), `Accession Number` = c("A0A383V887", "A0A383WCV6", "A0A383WGG7", 
"A0A383VTM6", "A0A383VJ69", "A0A383WH22"), `Mann-Whitney Test (p-value)Benjamini-Hochberg (p &lt 0.01200)` = c(0.00011, 
0.00012, 0.00012, 0.00012, 0.00013, 0.00014), R_avg = c(20.2, 
20.45, 20.25, 21.35, 20.25, 20.35), A28d_avg = c(21.25, 21.95, 
21, 20.95, 21, 21.85), R_nolog = c(1204497.52628937, 1432397.0282665, 
1246974.03982109, 2672947.36870384, 1246974.03982109, 1336473.68435192
), A28d_nolog = c(2493948.07964218, 4051430.60815479, 2097152, 
2025715.3040774, 2097152, 3780118.42033046), `A28d/R` = c(2.07052984768276, 
2.82842712474619, 1.68179283050743, 0.757858283255199, 1.68179283050743, 
2.82842712474619), LogFC = c(1.05, 1.5, 0.750000000000002, -0.4, 
0.750000000000002, 1.5)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

So basically I want to do a wilcox test for each accession number, comparing the value in column "R_avg" and "A28d_avg" and have each of the p-values be placed into a new column so I can make a volcano plot. I tried the two methods below and received errors.

> all.data$p-value<-wilcox.test(R_avg ~ A28d_avg, data=all.data) 
Error in wilcox.test.formula(R_avg ~ A28d_avg, data = all.data) : 
  grouping factor must have exactly 2 levels
> all.data %>% 
+   rowwise() %>% 
+   wilcox.test(R_avg ~ A28d_avg, data=all.data) 
Error in wilcox.test.default(., R_avg ~ A28d_avg, data = all.data) : 
  'x' must be numeric

Any help would be greatly appreciated

I am confused about your data. You say you want to do a wilcox test for each accession number, comparing R_avg and A28d_avg. Since each row has a unique accession number, that would mean doing a wilcox test to compare single values in the two levels. Since there is no variation in the populations, the test is not useful. Please explain how I am misunderstanding you.

So the avg columns are from several replicates. I mean I could include the various replicates, but I figured it would be easier to just do it with the average columns. And yes, it is a statistical test for each row, because the expression level of one protein is independent of other proteins, so I need a statistical test for each. My end goal is to create a volcano plot of the data, which is a graph of log(fold change) vs -log(p-value)

To use the wilcox.test() function, you need to have the individual replicates. The data should be shaped so that one column contains the measured value and another column holds the R or A28d label.

So the label is the accession number. I want to compare the values of the R (reference) data to the values in the A28d set. If you have a different test to suggest that's fine. But I still need help writing the function so I can get the p-values in a new column and so I can graph them. Maybe a Mann-Whitney test is better, but if you have another suggestion I'm open to it.

I did not mean to suggest any particular test. I mentioned the wilcox.test() because you used it in your code. My point is that you cannot use such tests, which compare two populations, on the average of the populations. You need to run your test on the set of individual measurements. Without knowing what those data look like, I cannot suggest specific code for doing that.

I think its a reasonable guess that you got your average column by using group_by/summarise and applying the mean function on it; but you could similarly apply list function on it, that would gather the values into a list rather than averaging them; this would mean that you retain the sample values, and could potentially use them in statistical tests that that require sample values that would make sense to do by group.

Here is a simple example with r's built in iris dataset.



iris |> 
  group_by(Species) |> 
  summarise(across(where(is.numeric),mean)) 

iris |>
  group_by(Species) |> 
  summarise(across(where(is.numeric),list))
1 Like

Oh okay, I see. So below is the spreadsheet/data I am working with. I have other software that computed the Mann-Whitney test p-value, which is in one of the columns, but the problem is that if it has any p-value less than 0.0001 it just writes "<0.0001" and I want an actual value, which is why I want to do the actual test in RStudio.

> dput(head(all.data[,1:8],10))
structure(list(`Quant View:3404 Proteins in 3219 Clusters<BR>With 26 Decoys` = c("Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17974 PE=4 SV=1", 
"GCV_T domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17245 PE=4 SV=1", 
"Brix domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS15148 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS526 PE=3 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS7621 PE=3 SV=1", 
"Pyruvate carboxyltransferase domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS357 PE=3 SV=1", 
"CCT-epsilon OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS13778 PE=3 SV=1", 
"Magnesium chelatase OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS7779 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS8630 PE=3 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS2038 PE=3 SV=1"
), `Accession Number` = c("A0A383VTX3", "A0A383WIA8", "A0A383WC92", 
"A0A383V6F3", "A0A383VPU2", "A0A383V2H1 (+1)", "A0A383W8E2 (+2)", 
"A0A383VR52", "A0A383VV27", "A0A383VA88"), `Mann-Whitney Test (p-value)Benjamini-Hochberg (p &lt 0.01200)` = c("0.05", 
"0.12", "0.12", "8.9999999999999993E-3", "0.12", "0.05", "8.9999999999999993E-3", 
"< 0.0001", "3.5E-4", "4.2999999999999997E-2"), R_1 = c(18.8, 
21.3, 19.5, 20.4, 20.9, 19.2, 20.6, 20.2, 21, 20.9), R_2 = c(18.6, 
21.3, 19, 20.4, 20.7, 19.1, 20.3, 20.1, 21, 20.6), A28d_1 = c(19, 
20.9, 19.7, 20.3, 21.3, 19.6, 20.4, 20.4, 20.9, 20.4), A28d_2 = c(18.9, 
21.1, 19.7, 20.6, 21, 19.5, 20.7, 20.4, 20.9, 20.5), LogFC = c(2.85, 
2.4, 2.55, 2.45, 2.5, 2.45, 2.45, 2.45, 2.1, 2.2)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

And what I have manage to get so far is the following:

all.data<-readxl::read_xlsx("MW test with normalized intensity.xlsx", .name_repair = "minimal")
head(all.data)
colnames(all.data)

p_values<-vector("list",nrow(all.data))

for(i in seq_along(1:nrow(all.data))){
  p_values[i]=wilcox.test(all.data[,R_1:R_2],all.data[,A28d_1:"A28d_2"],paired=TRUE,alternative="two.sided",exact=FALSE)$p.value
}

But I get the following error:

Error in `[.tbl_df`(all.data, , R_1:R_2) : object 'R_1' not found

Does this help?

seems like you are trying to use a spreadsheet with calculations that came from the other software you mentioned.
if you want to use R to do equivalent calculations instead, you should be using the data source that you used for that analysis, rather than the output of that analysis (in my opinion)

The raw data is as a .raw file from mass spec data. But what I outputed from the software is a spreadsheet but the data columns are the raw data values. I'm trying to use R because the software didn't do what I needed.

This is what I have so far, but now for some reason R isn't recognizing one of my columns in the data frame:

> dput(head(all.data[,1:8],10))
structure(list(`Quant View:3404 Proteins in 3219 Clusters<BR>With 26 Decoys` = c("Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17974 PE=4 SV=1", 
"GCV_T domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17245 PE=4 SV=1", 
"Brix domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS15148 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS526 PE=3 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS7621 PE=3 SV=1", 
"Pyruvate carboxyltransferase domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS357 PE=3 SV=1", 
"CCT-epsilon OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS13778 PE=3 SV=1", 
"Magnesium chelatase OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS7779 PE=4 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS8630 PE=3 SV=1", 
"Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS2038 PE=3 SV=1"
), `Accession Number` = c("A0A383VTX3", "A0A383WIA8", "A0A383WC92", 
"A0A383V6F3", "A0A383VPU2", "A0A383V2H1 (+1)", "A0A383W8E2 (+2)", 
"A0A383VR52", "A0A383VV27", "A0A383VA88"), `Mann-Whitney Test (p-value)Benjamini-Hochberg (p &lt 0.01200)` = c("0.05", 
"0.12", "0.12", "8.9999999999999993E-3", "0.12", "0.05", "8.9999999999999993E-3", 
"< 0.0001", "3.5E-4", "4.2999999999999997E-2"), R_1 = c(18.8, 
21.3, 19.5, 20.4, 20.9, 19.2, 20.6, 20.2, 21, 20.9), R_2 = c(18.6, 
21.3, 19, 20.4, 20.7, 19.1, 20.3, 20.1, 21, 20.6), A28d_1 = c(19, 
20.9, 19.7, 20.3, 21.3, 19.6, 20.4, 20.4, 20.9, 20.4), A28d_2 = c(18.9, 
21.1, 19.7, 20.6, 21, 19.5, 20.7, 20.4, 20.9, 20.5), LogFC = c(2.85, 
2.4, 2.55, 2.45, 2.5, 2.45, 2.45, 2.45, 2.1, 2.2)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

all.data<-readxl::read_xlsx("MW test with normalized intensity.xlsx", .name_repair = "minimal")
head(all.data)
colnames(all.data)

p_values<-vector("list",nrow(all.data))

for(i in seq_along(1:nrow(all.data))){
  p_values[i]=wilcox.test(all.data[,R_1:R_2],all.data[,A28d_1:"A28d_2"],paired=TRUE,alternative="two.sided",exact=FALSE)$p.value
}

Error in `[.tbl_df`(all.data, , R_1:R_2) : object 'R_1' not found

I understand what you are saying, but the data in the data columns is the raw data. The .RAW files are around 20 GB and were converted to a readible format in the software Scaffold, and then I outputted these raw data as an interpretable format. I agree it isn't ideal, but I think it is the simplest way to do it as I am not familiar enough with R to analyze full proteomics/MS datasets in R

the "data columns" , ? do you mean A28d_1 A28d_2 ?
is it these that you averaged together to make A28d_avg ?
Is it only two data points for the examples sake, and there are many more samples in reality , or is it only two samples ?

It is only 2 samples. Again, I know ideally it would be at least 3, but we were limited to the amount of biomass we were sent, so we were unable to do more replicates. Proteomics statistical analysis is rarely ideal. I'm just looking for help on the coding please.

are we theorising that the "Mann-Whitney Test (p-value)Benjamini-Hochberg (p &lt 0.01200)" column you provided, reports a fact about the two values at A28d_1 A28d_2 of each row ? I'm extremely sceptical of that
oh, are those R columns involved? sorry, your layout is a lot to take in ...

So the R_1 and R_2 are the reference columns (they were an initial biomass), and then after treatment I got data for the A28d_1 and A28d_2 samples. I want to perform a statistical test that compares these two groups. I am also skeptical about the Mann-Whitney Test values, which is why I want to do my own test in R. The data in the two treatment groups are the log2 values for the actual intensities, so they are actually very different from each other.

This is how the data is actually formatted. I know there's a lot of information, I apologize for the clumsiness.

> head(all.data)
# A tibble: 6 × 8
  `Quant View:3404 Proteins in 3219 Clusters<BR>With 26 Decoys`                                                       Accession Nu…¹ Mann-…²   R_1   R_2 A28d_1 A28d_2 LogFC
  <chr>                                                                                                               <chr>          <chr>   <dbl> <dbl>  <dbl>  <dbl> <dbl>
1 Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17974 PE=4 SV=1                              A0A383VTX3     0.05     18.8  18.6   19     18.9  2.85
2 GCV_T domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS17245 PE=4 SV=1                      A0A383WIA8     0.12     21.3  21.3   20.9   21.1  2.4 
3 Brix domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS15148 PE=4 SV=1                       A0A383WC92     0.12     19.5  19     19.7   19.7  2.55
4 Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS526 PE=3 SV=1                                A0A383V6F3     8.9999…  20.4  20.4   20.3   20.6  2.45
5 Uncharacterized protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS7621 PE=3 SV=1                               A0A383VPU2     0.12     20.9  20.7   21.3   21    2.50
6 Pyruvate carboxyltransferase domain-containing protein OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS357 PE=3 SV=1 A0A383V2H1 (+… 0.05     19.2  19.1   19.6   19.5  2.45

I take no responsibility for this :

a1 <- all.data |> rowwise() |> 
  mutate(mwt = list(wilcox.test(
                    x=c(R_1,R_2),
                    y=c(A28d_1,A28d_2),
                    alternative="less",
                    mu=0.012)),
                    mwt_val = mwt$statistic,
                    mwt_pval = mwt$p.value)

It looks like this worked. Thank you.