How to compare several variables in the same column

LeBarlou_08 · November 26, 2022, 5:07pm

Hello,

I'm a beginner on Rstudio and I'm stuck on an exercise. From a rdata file (of which I put you an example in screenshot), I have to see if there is a progression of the athletes (an answer in the form of boolean) and the time which separates the 2 performances of each person from this tibble. I have two performances per athlete except some who did not play two competitions.
I have to present my answer in another tibble with 3 variables: the name of the athletes/the progression (if there is one or not (true or false)) and the time that separates the performances.

I hope to be clear and that you can help me.

PS: We have only seen the simple functions of dplyr: mutate()/case_when()/filter()/select() and the pipe.

technocrat · November 26, 2022, 11:40pm

See the FAQ: How to do a minimal reproducible example reprex for beginners, so as not to deter potential users posting answers because of the drag of reverse engineering data.

# fake up some data
tib <- data.frame(
  name =
    c("Denis Rutherford-Sipes", "Denis Rutherford-Sipes", "Tishie Stehr", "Tishie Stehr", "Diann O'Connell", "Diann O'Connell", "Mareli Mertz", "Doll Nader V", "Sharyn Casper", "Sharyn Casper"), date = structure(c(19322, 19323, 19324, 19325, 19326, 19327, 19328, 19329, 19330, 19331),
    class = "Date"
  ),
  points =
    c(1383L, 1361L, 1455L, 1413L, 1331L, 1381L, 1414L, 1350L, 1466L, 1476L)
)

suppressPackageStartupMessages({
  library(dplyr)
})

# data
tib
#>                      name       date points
#> 1  Denis Rutherford-Sipes 2022-11-26   1383
#> 2  Denis Rutherford-Sipes 2022-11-27   1361
#> 3            Tishie Stehr 2022-11-28   1455
#> 4            Tishie Stehr 2022-11-29   1413
#> 5         Diann O'Connell 2022-11-30   1331
#> 6         Diann O'Connell 2022-12-01   1381
#> 7            Mareli Mertz 2022-12-02   1414
#> 8            Doll Nader V 2022-12-03   1350
#> 9           Sharyn Casper 2022-12-04   1466
#> 10          Sharyn Casper 2022-12-05   1476

# preprocessing
# discard singletons
census <- tib %>% group_by(name) %>% count()
census <- census %>% filter(n == 2)
tib <- left_join(census,tib,by="name")

# arrange by date
tib <- tib %>% arrange(-desc(date))

# setup
# create result table
names <- unique(tib$name)
result <- rep(0,length(names))
compared <- data.frame(names,result)

# main
# find index positinos of even and odd rows
finish <- tib[seq_len(nrow(tib)) %% 2 |> as.logical(),]
begin  <- tib[-c(seq_len(nrow(tib))) %% 2 |> as.logical(),]
# calculate change in points
delta <- begin$points - finish$points

# finish table
compared$result <- delta
compared
#>                    names result
#> 1 Denis Rutherford-Sipes      0
#> 2           Tishie Stehr      0
#> 3        Diann O'Connell      0
#> 4          Sharyn Casper      0

^{Created on 2022-11-26 by the reprex package (v2.0.1)}

This example is a mix of {dplyr} and {base}. Some things are more convenient to do in one and some things are more convenient in the other. Be flexible and don't get married to tools.

The example isn't as "efficient" as it could be, and that's all right. The purpose is not to get the code to run faster, because it runs faster than an interactive user can notice and it wouldn't be until really large data sets that speed would be a consideration. The inefficiency should help the user better understand what is being done at each step. Data analysis is all about steps, breaking data down to its constituent parts and transforming them. Divide and conquer.

Thinking in terms of school algegra, f(x) = y, where x is what you have, y is what you want and f is the function or chain of functions to get you from x to y you've described x and y well. Let's look at f.

By the statement of the problem each name has either one or two entries. As it's not possible to compare two point scores if there is only one, those should be eliminated, which is what the # discard singletons code block does. First, a count of names is made after grouping by names and that result is overridden by filtering on the count, n to include only names that appear twice. The left_join operation overwrites the tib object by joining it to the census object, which discards the missing single entries from `census.

Next, we need to assure that the begin and finish dates are in chronological order, which is what arrange does. Following that, we create an empty data frame to hold the results.

The heavy lifting is done by this ugly looking operation

tib[seq_len(nrow(tib)) %% 2 |> as.logical(),]

This looks scarier than it actually is.

We being with our re-arranged data.frame tib and subset it with the [] square bracket operators. Notice the comma , before the close ] bracket. That's because of the syntax of the subset operator.

object[1] # first column
object[1,3] third column in first row
object[3,] all columns in third row

So we now know that whatever the result of between the [] brackets is that it will be used to pull something out of tib. begin pulls out the odd numbered rows, and finish the even numbered. That's what the %% modulus operator does—if you can successively divide by two and reach zero, a number is even. To get the odd rows, we negate with wrapping the expression in c() and prefixing it with -.

The vector of row numbers returned can be simplified by |> piping the result to as.logical for conversion to TRUE FALSE. The end result is that tib is divided into the two pieces with odd and even rows, and we can subtract the odd rows from the even rows to get the change delta between the first and second point scores. We tack that on to the empty results table compared and we're done.

The mantra should always be: I have x, I want y and if I apply f_1 to x I can move one step closer or if I apply f_2 to y I can get closer the other way.

Finally, see the Homework FAQ. I generally provide only hints, rather than solutions. In this case, it's a combination of new users get a break and this assignment seems unreasonably difficult for a beginner. Don't count on solutions as a matter of course.

system · December 17, 2022, 11:40pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.