# Help with understanding functions

I'm trying to figure out what the two following lines of code are doing. I'm going to guess at what they are doing, but please correct me, as I am getting NA values in my data where I don't expect to.

1. This function takes a cell in the dataframe (x), then finds the median of a given sequence (seq)?
``````e.function <- function(x, seq) tapply(x, seq, median)
``````
1. The following takes the dataframe sdat[,1] (so removes the first column of dataframe sdat), and it goes down the column and applies e.function (I'm not sure what the x and seq are for this function), then returns a value of sdat from the first column? Sorry, this one is really confusing me.
``````  temp <- apply(sdat[,-1], 2, e.function,seq=sdat[, 1])
``````

If anyone can explain to me what is going on exactly for these two functions, that would be really helpful. I'm not great at writing and using functions.

It's harder to answer that question in general than knowing what your data looks like. So I'll cheat as I've read your other question.

`apply()` takes a data frame (or similar), and applies an operation on its rows or columns. Here we use `apply(..., 2, ...)` so we apply the function on its columns. For example:

``````X <- data.frame(x1 = 1:3,
x2 = 4:6)
X
#>   x1 x2
#> 1  1  4
#> 2  2  5
#> 3  3  6

apply(X, 2, min)
#> x1 x2
#>  1  4
apply(X, 2, max)
#> x1 x2
#>  3  6
``````

Created on 2023-07-12 with reprex v2.0.2

So an `apply()` is a way to make a loop. In other words, `apply(X, 2, max)` means "take X, and for each column of X take the max".

Here we have:

``````apply(sdat[,-1], 2, e.function, seq=sdat[, 1])
``````

That can be translated in "Take `sdat[,-1]`, and for each column of `sdat[,-1]` take the function `e.function()`". But, as we'll see in a second, `e.function()` requires two parameters, `x` and `seq`. So, `x` will be each column of `sdat[,-1]`, but we also need to provide `seq`. We can give it as the 4th argument: `seq = sdat[,1]`, that means the first column of `sdat`, which is `Sequence`.

So, what this does is, for each column of `sdat` except the first, pass that column as `x` and the first column as `seq` and apply `e.function()`.

Now let's go to 1. and the definition of `e.function()`. I should say `tapply()` can be used in many ways, and can be very confusing. Here, we have a single case where both of its inputs are a vector (a single column of `sdat`).

`tapply()` takes argument `X`, a data vector, and `INDEX`, a grouping factor. It uses the grouping factor to "split" the data, and applies a function to each of the groups:

``````x <- 1:7
fac <- list(c("a","a","a","a","b","b","b"))

tapply(x, fac, min)
#> a b
#> 1 5
tapply(x, fac, max)
#> a b
#> 4 7
``````

Finally, let's put it back together:

``````e.function <- function(x, seq) tapply(x, seq, median)
temp <- apply(sdat[,-1], 2, e.function, seq=sdat[, 1])
``````

What this does is take `sdat`, and separate the first column which has protein sequences from the other columns which contain data. Then, for each data column, it takes the median by peptide.

1 Like

So it is taking the median down the column of all values that have the same sdat[,1] value? So like it selects the protein group, then finds the median for each column in that group and puts that into the new dataframe temp?

Also, I appreaciate your help on both of these questions. I'm still a bit of a novice at coding, but am trying to get better. I appreciate your patience.

No problem, we've all been there! To be honest, this is not the clearest code there is, it feels that was written by someone used to R who only wanted to get things done quickly, not to make them easy to read for novices.

Yes, but note that they never use the values in `temp`, so it actually doesn't matter. Here is a commented version:

``````dat <- data.frame(Protein.Group.Accessions = c("A","A","A","A",
"B","B","B","B"),
Sequence = c("AAAESMIPTTTGAAK","AAAESMIPTTTGAAK",
"AAAVSNIIR","AAAVSNIIR",
"AAAESMIPTTTGAAK","AAAESMIPTTTGAAK",
"AAAVSNIIR","AAAVSNIIR"),
col1 = 1:8,
col2 = 20:13)
cha <- c("col1", "col2")
e.function <- function(x, seq) tapply(x, seq, median)

dat\$Sequence <- toupper(dat\$Sequence)
accessions <- as.character(unique(dat\$Protein.Group.Accessions))
n.proteins <- length(accessions)
n.cha <- length(cha)

output <- NULL

# for each Accession
for(k in 1:n.proteins){
id <- accessions[k]
# keep only the rows for the first Accession, keep only the Sequence and data columns
sdat <- subset(dat, Protein.Group.Accessions==id)[c("Sequence", cha)]
# log2 of all data
sdat[cha] <- log2(sdat[cha])
# subtract the median of each row
sdat[cha] <- sdat[cha] - apply(sdat[cha], 1, median)
# only the data columns
pdat <- sdat[, -1]
# nb spectra is the number of rows
n.spectra <- ifelse(is.integer(dim(pdat)), nrow(pdat), 1)
# take median by Sequence, note, this is saved in `temp`, but temp is NOT used below
temp <- apply(sdat[,-1], 2, e.function, seq=sdat[, 1])
# n.peptides is the number of different Sequences
n.peptides <- ifelse(is.integer(dim(temp)), nrow(temp), 1)
# take the median of each column
if(is.integer(dim(pdat))) pdat <- apply(pdat, 2, median)
# keep the data columns, add columns for nb peptides and spectra
pdat <- c(pdat, n.peptides=n.peptides, n.spectra=n.spectra)
# for the first protein, `output` is NULL, then each subsequent protein gets added as a new row
output <- rbind(output, pdat)
}
``````

All the `ifelse(is.integer(dim(...))` are to count the number of rows handling the fact it could be 1.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.