Adding a calculated column in a df

pnwballr · February 27, 2021, 12:40am

I have tried to do this a couple different ways with medium success, but now have a final form I want to do this and am not sure how. I have a df that I want to add a column to with values of 0 and 1 depending on how two other columns compare in that row.

I don't have any code set on how to do this, but what I am trying to do is compare PTS to dPTS and add a column beyond dPTS called W/L.
IF PTS>dPTS put a 1 in the column
IF PTS,dPTS put a 0 in the column

Not a very good reprex at all, but wasn't quite sure how to explain. I check this a lot and will reply quickly to any questions on stuff that's not clear.

ID	Home	Team	Mins	oMins	dMins	oPOSS	dPOSS	ORTG	DRTG	NETRTG	PTS	dPTS	FGA	dFGA

UCLA	0	Seattle U	39	20	19	72	72	72	108	-36	52	78	59	51
Long Beach St.	0	Seattle U	38	22	17	76	77	99	104	-5	75	80	64	59
Washington	0	Seattle U	39	21	18	68	69	60	106	-46	41	73	59	56
Northwest (WA)	1	Seattle U	39	16	23	70	69	127	58	69	89	40	59	58
Col. of Idaho	1	Seattle U	39	19	20	68	70	115	77	38	78	54	63	53
California	0	Seattle U	39	20	19	63	64	103	109	-6	65	70	60	53
Portland	1	Seattle U	40	21	19	79	76	106	89	17	84	68	68	64
Saint Martin's	1	Seattle U	39	18	21	74	73	132	86	46	98	63	60	69
Utah Valley	1	Seattle U	44	23	21	92	90	100	103	-3	92	93	73	57
Dixie St.	0	Seattle U	40	21	19	75	73	101	105	-4	76	77	71	64
Dixie St.	0	Seattle U	39	22	18	78	76	99	74	25	77	56	57	57

kaijabean · February 27, 2021, 1:51am

Hi @pnwballr,
One way to do this would be to use the case_when() function in dplyr, combined with mutate() to add the new column.

So, if your data frame is called df, you might write code like this (note: I'm assuming that your second condition was supposed to say "IF PTS < dPTS"?)

library(dplyr)
df <- df %>%
     mutate(WL = case_when(PTS > dPTS ~ 1, # in this case, set WL to 1
                           PTS < dPTS ~ 0) # in this other case, set it to 0
     )

A couple things to note:

You said you wanted to name the column W/L. I don't think R will allow slashes in column names, though I'm not sure. I've called it WL here, but you could go with W_L or whatever.
case_when() can be a bit tricky because it automatically sets the value for any row that doesn't meet your conditions to NA. I don't know if you would ever have a row where PTS == dPTS; if so, you might want to modify the conditions to <= or >=, or add a third condition for the == case. If you have any NA values in those columns, too, you'll end up with NA values in the resulting column.
case_when() does have an optional argument, confusingly called TRUE, which basically functions like the else part of an if/else statement. I don't think it's super relevant to your case because you're just splitting your data into two parts, but theoretically, you could write something like this:

library(dplyr)
df <- df %>%
     mutate(WL = case_when(PTS > dPTS ~ 1, # in this case, set WL to 1
                           TRUE ~ 0) # else, set it to 0
     )

I hope that's helpful! There are also ways to do this using base R, and I'm aware that this suggestion requires that you load the dplyr package, but if you don't mind doing that, this function is pretty useful.

The documentation for case_when() is here if you want to read more.

Happy coding!

pnwballr · February 27, 2021, 2:34am

Ok thank you so much! It is running without error, but it is not actually creating a new column in my df. Do you know why that is happening or do I need to add the new column in a seperate step?

fdecteamstats=filter(decteamstats, Team == "Seattle U") %>%
  mutate(WL = case_when(PTS > dPTS ~ 1, PTS < dPTS ~ 0))

pnwballr · February 27, 2021, 2:46am

This is the code I am trying to run, but the resulting data is not right. All the data in the new column is getting recorded as WL when i want it to return a 1 or o depending on my conditional. Any suggestions? thank you for your help, love this community

fdecteamstats=filter(decteamstats, Team == "Seattle U") %>%
  add_column(new_col = "WL", .after = "dPTS") %>%
  mutate("WL" = case_when(PTS > dPTS ~ 1, PTS < dPTS ~ 0))

ID	Home	Team	Mins	oMins	dMins	PTS	dPTS	new_col

UCLA	0	Seattle U	39	20	19	52	78	WL
Long Beach St.	0	Seattle U	38	22	17	75	80	WL
Washington	0	Seattle U	39	21	18	41	73	WL
Northwest (WA)	1	Seattle U	39	16	23	89	40	WL
Col. of Idaho	1	Seattle U	39	19	20	78	54	WL
California	0	Seattle U	39	20	19	65	70	WL
Portland	1	Seattle U	40	21	19	84	68	WL
Saint Martin's	1	Seattle U	39	18	21	98	63	WL
Utah Valley	1	Seattle U	44	23	21	92	93	WL
Dixie St.	0	Seattle U	40	21	19	76	77	WL
Dixie St.	0	Seattle U	39	22	18	77	56	WL

lars · February 27, 2021, 7:59am

Your last attempt is definitely not correct, but your previous code is working just fine:

Doing so you get a new dataset fdecteamstats with the column WL appended to it, like so:

library(dplyr)

decteamstats <- tibble::tribble(
                      ~ID, ~Home,       ~Team, ~Mins, ~oMins, ~dMins, ~oPOSS, ~dPOSS, ~ORTG, ~DRTG, ~NETRTG, ~PTS, ~dPTS, ~FGA, ~dFGA,
                   "UCLA",    0L, "Seattle U",   39L,    20L,    19L,    72L,    72L,   72L,  108L,    -36L,  52L,   78L,  59L,   51L,
         "Long Beach St.",    0L, "Seattle U",   38L,    22L,    17L,    76L,    77L,   99L,  104L,     -5L,  75L,   80L,  64L,   59L,
             "Washington",    0L, "Seattle U",   39L,    21L,    18L,    68L,    69L,   60L,  106L,    -46L,  41L,   73L,  59L,   56L,
         "Northwest (WA)",    1L, "Seattle U",   39L,    16L,    23L,    70L,    69L,  127L,   58L,     69L,  89L,   40L,  59L,   58L,
          "Col. of Idaho",    1L, "Seattle U",   39L,    19L,    20L,    68L,    70L,  115L,   77L,     38L,  78L,   54L,  63L,   53L,
             "California",    0L, "Seattle U",   39L,    20L,    19L,    63L,    64L,  103L,  109L,     -6L,  65L,   70L,  60L,   53L,
               "Portland",    1L, "Seattle U",   40L,    21L,    19L,    79L,    76L,  106L,   89L,     17L,  84L,   68L,  68L,   64L,
         "Saint Martin's",    1L, "Seattle U",   39L,    18L,    21L,    74L,    73L,  132L,   86L,     46L,  98L,   63L,  60L,   69L,
            "Utah Valley",    1L, "Seattle U",   44L,    23L,    21L,    92L,    90L,  100L,  103L,     -3L,  92L,   93L,  73L,   57L,
              "Dixie St.",    0L, "Seattle U",   40L,    21L,    19L,    75L,    73L,  101L,  105L,     -4L,  76L,   77L,  71L,   64L,
              "Dixie St.",    0L, "Seattle U",   39L,    22L,    18L,    78L,    76L,   99L,   74L,     25L,  77L,   56L,  57L,   57L
        )

fdecteamstats = filter(decteamstats, Team == "Seattle U") %>%
  mutate(WL = case_when(PTS > dPTS ~ 1, PTS < dPTS ~ 0)) %>% 
  ## place the appended column more to the front
  select(ID:Team, WL, everything())

fdecteamstats
#> # A tibble: 11 x 16
#>    ID     Home Team     WL  Mins oMins dMins oPOSS dPOSS  ORTG  DRTG NETRTG
#>    <chr> <int> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int>  <int>
#>  1 UCLA      0 Seat~     0    39    20    19    72    72    72   108    -36
#>  2 Long~     0 Seat~     0    38    22    17    76    77    99   104     -5
#>  3 Wash~     0 Seat~     0    39    21    18    68    69    60   106    -46
#>  4 Nort~     1 Seat~     1    39    16    23    70    69   127    58     69
#>  5 Col.~     1 Seat~     1    39    19    20    68    70   115    77     38
#>  6 Cali~     0 Seat~     0    39    20    19    63    64   103   109     -6
#>  7 Port~     1 Seat~     1    40    21    19    79    76   106    89     17
#>  8 Sain~     1 Seat~     1    39    18    21    74    73   132    86     46
#>  9 Utah~     1 Seat~     0    44    23    21    92    90   100   103     -3
#> 10 Dixi~     0 Seat~     0    40    21    19    75    73   101   105     -4
#> 11 Dixi~     0 Seat~     1    39    22    18    78    76    99    74     25
#> # ... with 4 more variables: PTS <int>, dPTS <int>, FGA <int>, dFGA <int>

^{Created on 2021-02-27 by the reprex package (v1.0.0)}

kaijabean · February 27, 2021, 6:04pm

Yeah, I agree with lars that I'm not sure why your initial code is not working. But the add_column() part definitely isn't right.

When you write
add_column(new_col = "WL", .after = "dPTS") %>%, you're saying "add a new column called 'new_col' that takes the value 'WL' in all rows". So that's why the new column gets filled with 'WL'.

Does lars's code work for you, or rather, does the first code snippet you included work when you try it again? It definitely should, so if it doesn't then something is weird...

pnwballr · February 27, 2021, 7:20pm

Thank you for the explanation! That makes sense and is now working properly. Being new to R I took the logical approach and figured you needed to create the column first before you could modify that.

kaijabean · March 4, 2021, 12:19am

Glad you got it working! Yeah, the not having to create the column first thing is confusing, but it's common across R. If you wanted to add a new column to a data frame in base R (i.e. not in the tidyverse), you could still just declare it as df$newcolumn <- [values]. When you do that, R checks whether a column by that name already exists. If it does, then that column gets overwritten; if it doesn't, then the column gets created.

system · March 25, 2021, 12:19am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.