Is it possible to automate code section?

Context: I wonder if I could rewrite the code below to identify where the counts in the set reach double figures, and then remove from single-digits down. I am trying to make my code generalisable to other data.
Question: Might anyone know (a) if this is possible and (b) how to begin this process?

# remove years before the collocate count reaches double figures (here, > 1974) 
# Note: you may choose not to make this judgment call depending on your data
word_count2 <- word_count %>% filter(year >= 1974) # filter >= 1974 for word_count
df_word2 <- df_word %>% filter(year >= 1974) # filter >= 1974 for tibble with Warriner ratings

NOTE: The word_count dataset looks like this:

year <- c(1970, 1971, 1972, 1973, 1974, 1975)
count <- c(16, 37, 31, 9, 39, 27)
df_word <- table(year,count)

If I understand correctly, you'd like to have a cumulative count of words, then only keep the rows where the cumulative count is 100 or more.

There is a base R function called cumsum() which calculates a cumulative sum of a vector.

I don't work with tables often, but here is a dplyr/tibble solution.

year <- c(1970, 1971, 1972, 1973, 1974, 1975)
count <- c(16, 37, 31, 9, 39, 27)
data <- tibble(year,count)

data <- data %>%
  mutate(cumulative.count = cumsum(count))


data <- data %>% 
  filter(cumulative.count >= 100)

You could combine these steps with the %>% (magrittr pipe).

The result looks like this.

# A tibble: 2 x 3
   year  c ount   cumulative.count
  <dbl>   <dbl>              <dbl>
1  1974      39                132
2  1975      27                159

I hope I understood correctly and that this helps.

What is df_word ? Please paste a representative dataset.

I have updated the code appropriately above, but pasted it below too:

year <- c(1970, 1971, 1972, 1973, 1974, 1975)
count <- c(16, 37, 31, 9, 39, 27)
df_word <- table(year,count)

Hi @beginnersluke , thanks very much! I didn't explain it well enough before: I would like to only keep the rows where the count is 10 or more. No cumulative count is needed as I have already summed the frequencies to create df_word. Now, I want to automate the process of removing any rows which come before the rows reach 10 (e.g., 16, 37, 31 from df_word would be removed).
Perhaps it is too complicated and I should leave the manual process of filtering and checking in.

Okay, so the table you posted above looks like this:

> df_word
      count
year   9 16 27 31 37 39
  1970 0  1  0  0  0  0
  1971 0  0  0  0  1  0
  1972 0  0  0  1  0  0
  1973 1  0  0  0  0  0
  1974 0  0  0  0  0  1
  1975 0  0  1  0  0  0

Correct me again if I'm wrong, but your goal is to have this table, but without the '9' column and the '1973' row, but you'd like to have code that would eliminate any other column/row combinations where the count is less than 10.

If that's right, I think you could accomplish this by subsetting the data first.
(subset() returns the elements in your data that meet a condition.)

# Your data
year <- c(1970, 1971, 1972, 1973, 1974, 1975)
count <- c(16, 37, 31, 9, 39, 27)

# Keep only the subsets of year and count where count > 10
# Notice that to subset the year, you're still asking R to evaluate the elements in count
# (the condition is count > 9). If element 1 of count > 10, then is will return element 1 in year. (For this
# reason, it's important to do the subsetting below in the correct order -- year first in this case.) 
year <- subset(year, count > 10)
count <- subset(count, count > 10)
#You now have the original data minus any year/count combination where count was >10.

# Generate your data table
df_word <- table(year,count)
df_word

Working with datatables is far from my expertise, so hopefully we can get some confirmation from someone who works with this type of data often, but as far as I can tell, this should do exactly what you're looking for.

Hope that at least helps a bit,
Luke

Sorry to drag this out, @beginnersluke. I realise now that the table() function does something different to what I thought. The data frame looks as in the image attached.

Given that year is one column, and count is another column, is it still appropriate to use the subset function?

This is how I generated the table and filtered for years >= 1974:

# aggregate repetitions of matched lemmas by year 
word_count <- df_word %>% group_by(year) %>% summarise(count = n_distinct(lemma)) %>% ungroup()
head(word_count) # show df

# remove years before the collocate count reaches double figures (here, > 1974) 
# Note: you may choose not to make this judgment call depending on your data
word_count2 <- word_count %>% filter(year >= 1974) # filter >= 1974 for word_count
df_word2 <- df_word %>% filter(year >= 1974) # filter >= 1974 for tibble with Warriner ratings

df_word

No worries. I'm happy to help if I can.

I have to say though, I'm a little confused.

Your code has the note: "remove years before the collocate count reaches double figures (here, > 1974) ", but 1970 had a double-figure count (16) as did 1971 (37).

Likewise, in an earlier message you said "I would like to only keep the rows where the count is 10 or more." but then say "Now, I want to automate the process of removing any rows which come before the rows reach 10 (e.g., 16, 37, 31 from df_word would be removed)."

So is the following correct: You want to find the point at which the count goes below 10 then returns to 10-or-more and filter out everything that comes before?

It may be simplest to think of this as a chart.

Are you trying to only keep the data that comes after the last time the red line is crossed in an ascending direction (going up)?

So if 1969 had a count of 7, you would then keep everything after 1969 (1970 on)?

To answer your broader question... you could use subset on a tibble/dataframe, but if you're working with a tibble, most people would just use filter.

If you are trying to do something like "identify the most recent year in which Count is in single digits" so that you can only keep everything after it, a simple way would be to first identify that year (and put it in a variable) then use that variable to filter.

For example:

#Find the most recent (max) year in which 'count' is less than 10.

VeryImportantYear = max(df_word$year[df_word$count<10])

# Just as you can say df_word$year[3] to see row 3 of the 'year' column of df_word, you can say "rows where count is < 10", which is what the above does.

# Then use VeryImportantYear as your filter criteria.

df_word <- df_word %>%
     filter(year > VeryImportantYear)

I hope that helps. Feel free to clarify if I didn't understand something.

Thanks and have a great day!
Luke

1 Like

Thank you so much, Luke! This makes complete sense and I really like how simple the code is. The max function is very useful in this circumstance and seems to do exactly what I was after. Thank you!

Because you have been so helpful and explain things so well, I would so appreciate your opinion on whether this chunk of code can also be simplified. I sense that it can but am unsure how to proceed? Please know I expect no response, but would really appreciate one if you could take a look:

Context: Here, I am attempting to compute the main outcome variable sev_word and the weighted average for arousal (aro_word) and valence (val_word) components of the word severity index.

# Step 1: Weighted Sum. Sum of (the repetition for each word * sum of severity ratings) per year

  df_word2 <- df_word2 %>% mutate(VA_prod=(repet*VA_mean_sum)) # prep: product of repeats*mean V+A rat
  sumVAprod_word <- aggregate(df_word2[, 8], list(df_word2$year), sum) %>% rename(c("year"="Group.1", "sumVAprod_word"="x"))  # sum VA_prod by year
  
  df_word2 <- df_word2 %>% mutate(A_prod=(repet*A_mean_sum)) # prep: product of repeats*mean A rat
  sumAprod_word <- aggregate(df_word2[, 9], list(df_word2$year), sum) %>% rename(c("year"="Group.1", "sumAprod_word"="x"))    # sum A_prod by year
  
  df_word2 <- df_word2 %>% mutate(V_prod=(repet*V_mean_sum_r)) # prep: product of repeats*mean V rat
  sumVprod_word <- aggregate(df_word2[, 10], list(df_word2$year), sum) %>% rename(c("year"="Group.1", "sumVprod_word"="x"))   # sum V_prod by year

Please let me know if/how I can clarify anything!

That's very nice of you to say. Thank you.

I think what you have is fine.

You asked about making it more efficient, and I'm not sure if you mean this as "the computer will process the data more quickly" or "the code will have fewer words and look neater", but I think you mean the latter.

The first thing I noticed in your stats::aggregate functions is you group by year each time.I would use group_by() here and do this in one chunk of code. I think it would make things easier to read and more concise.

So you could create your 3 new variables in one chunk:

df_word2 <- df_word2 %>%
  mutate(VA_prod=(repet*VA_mean_sum),
         A_prod=(repet*A_mean_sum).
         V_prod=(repet*V_mean_sum_r)) 

(Personally, I actually write 3 separate mutate statements separated by the pipe (%>%) instead of one mutate statement with commas. I find this easier to read, but I think this is unusual.)

Now you can use group_by and create your summary variables:

df_word2 <- dfword2 %>% 
  group_by(year) %>% 
  mutate(sumVAprod_word = sum(VA_prod),
         sumAprod_word = sum(A_prod),
         sumVprod_word = sum(V_prod)) %>%
  ungroup()

You can read the documentation on group_by(), but in short: it groups by the chosen variable or variables, so your mutate functions are done for each group (so the sum for each group in your case).

I don't have your df_word2 dataframe, so I can't really test this, but what I have above should work and get you what you want.


If you do want to use stats::aggregate, the only suggestions I'd have is to use the actual column name instead of the number. This is for readability, but also makes your code more resilient should a column be added in the future. So aggregate(df_word2[, "VA_mean_sum"]... instead of aggregate(df_word2[, 8],....

I hope that helps. Good luck with your work!

Luke

1 Like

dplyr features across which allows constructing the new variables formulaically (rules based), as opposed to explicitly. This isn't super useful for cases of 1 or 2 variables, but becomes very useful when handing more.

library(tidyverse)

 iris %>% 
  group_by(Species) %>% 
  mutate(sum_Petal.Length = sum(Petal.Length),
         sum_Petal.Width = sum(Petal.Width)) %>%
  ungroup()
 
 iris %>% 
   group_by(Species) %>% 
   mutate(across(starts_with("Petal"),
                 list(sum=~sum(.x)), 
                 .names = "{.fn}_{.col}")) %>% 
   ungroup()
1 Like

Thanks very much to you, both!

I remember why I used aggregate now as I have run into a problem I run into before so please bear with me as I try to explain it: I now have my computed variables (weighted averages) however, I am wanting to collapse the rows so that the unit is only showing up for year.

Here, I compute a weighted sum = Sum(sum of mean ratings * the repetition of each word near key word)

df_word <- df_word %>% # compute the product of mean sum ratings (AV,A,V) * lemma repetitions
  mutate(AV.prod=(repet*AV.Mean.Sum),
         A.prod=(repet*A.Mean.Sum),  
         V.prod=(repet*V.Mean.Sum.R))

df_word <- df_word %>% # group by year and sum for each group (AV,A,V)
  group_by(year) %>%
  mutate(sumAVprod.word = sum(AV.prod),
         sumAprod.word = sum(A.prod),
         sumVprod.word = sum(V.prod)) %>% ungroup()

Then I standardize by computing a weighted average = sum of(repetition-weighted severity: AV,A,V) by lemma/ sum(repetitions by year)


df_word <- df_word %>% 
  group_by(year) %>%
  mutate(sum_repet_word=sum(repet)) %>% ungroup()

  # compute standardization
df_word <- df_word %>% 
  mutate(sev_word=(sumAVprod.word/sum_repet_word), # V+A
                   aro_word=(sumAprod.word/sum_repet_word),  # A
                   val_word=(sumVprod.word/sum_repet_word))  # V

However, while the value in the weighted average is correct, I want to figure out how to now create a tibble (not manually) which only contains these mutated variables (sev_word, aro_word, val_word) by year.

I am not sure how to start going about creating a reproducible example for this so will show the current data frame as is:

compared to a desired data frame (where variables are not repeating per word, but only per year):

I am wondering how to read the code above?

 iris %>% 
   group_by(Species) %>% 
   mutate(across(starts_with("Petal"),
                 list(sum=~sum(.x)), 
                 .names = "{.fn}_{.col}")) %>% 
   ungroup()

I can see that it groups the dataset iris by species and then mutates variables from columns starting with "Petal"? After that I am not sure how to read what it does... is it summing across rows in variables starting with "Petal"? Not sure what .names is either.
Is there

I think you are asking how to pick particular columns from a data.frame. dplyr::select() is the tool for thisz it provides many possibilities including explicit and rules based. You can read about it and see examples.

You can read about dplyr::across()
It lets you choose variables to process, function(s) to run on them, and how they should be named as a result.
The tilde and . or .x are convenient notation for explaining how each vatiable should be passed into the function of choice, as functions can take additional params etc.

1 Like

Thanks! Yeah, the problem with using the select() function to pick the respective columns is that the dataset would still contain too many rows throughout. Because I now used mutate instead of aggregate, there are multiple duplicate rows in the dataset!

If its simply duplicated use the distinct() keyword

1 Like

Thank you so much for your patience! The code looks much cleaner and simpler now. I will eventually learn how to ask my questions better as I learn to code better!

Yay!

Have you ever used summarise()?

It's a dplyr function in some ways similar to mutate, but with very important differences.

While mutate() will add a column/variable and keep all the observations, summarise() will take grouped data, return the columns/variables you create and collapse down to only the rows which are unique.

This is probably best seen with an example. You can copy and paste this and run it yourself so you can see what the resuling data looks like.

Here I create a datframe of 4,000 phone calls with 12 different customers.

calls <- tibble(callNo = 1:4000,
                callDuration = as.integer(runif(4000, min = 60, max = 1200)),
                customerNo = as.integer(runif(4000, min = 1, max =13)),
                randomVar = runif(4000, min = 0, max = 1))

Now if I want some summary statistics, I can create them via mutate:

callSummary <- calls %>% 
  group_by(customerNo) %>% 
  mutate(callsPerCustomer = n(),
         meanCallDur = sum(callDuration) / callsPerCustomer) %>% 
  ungroup()

This gives the data I want, but keeps a lot that I no longer need.I still have 4,000 and some now useless columns. (So in reality, I'd add some code to select() the columns I want, then unique() to keep only the rows I need (eliminate now duplicate rows).

But check out summarise() (or summarize()):

callSummary <- calls %>% 
  group_by(customerNo) %>% 
  summarise(callsPerCustomer = n(),
         meanCallDur = sum(callDuration) / callsPerCustomer) %>% 
  ungroup()

Same amount of code; I've only swapped summarise() in for mutate(), but this returns just 12 rows and 3 columns with just the data I was after.

I hope that helps. I have a project to finish this morning, so I just quickly read through your message, but it seems like what you might need. if you end up needing more help, please just ask.

Luke

Thank you so much! I had not realised that summarise() can be used in place of mutate. For some reason, I still have to use distinct() to eliminate duplicate rows after using it, but it still saves at least one line of code (where I use the select() function) so I am happy with it and will keep this in mind for future.

It has also narrowed down the libraries needed in my script to 2! (here and dplyr) which is also really cool. It looks like dplyr is a very useful package

1 Like