Need help with looping through communities by ID

imantmn · December 19, 2020, 2:28pm

I am going to find the top 10 hashtags for each of the several thousand communities in my dataset. Each user_name in the dataset, belongs to a specific community (e.g., "a", "b", "c", "d" belong to community 0). A sample of my dataset with only 10 communities looks like the following:

df <- data.frame(N = c(1,2,3,4,5,6,7,8,9,10),
                  user_name = c("a","b","c","d","e","f", "g", "h", "i", "j"),
                  community_id =c(0,0,0,0,1,1,2,2,2,3),
                  hashtags   = c("#illness, #ebola", "#coronavirus, #covid", "#vaccine, #lie", "#flue, #ebola, #usa", "#vaccine", "#flue", "#coronavirus", "#ebola", "#ebola, #vaccine", "#china, #virus") )

To find the top 10 hashtags for EACH community (in the following case, community 0) I need to run the following codes:

#select community 0
df_comm_0 <- df %>%
  filter (community == 0)

#remove NAs
df_comm_0 <- na.omit(df_comm_0)

#find top 10 hashtags
df_hashtags_0 <- df_comm_0 %>% 
unnest_tokens(hashtag, hashtags, token = "tweets") %>%
  count(hashtag, sort = TRUE) %>%
  top_n(10)

I know using a loop, can save me from running my codes ~15,000 times (number of communities in the dataset). I am not familiar with loop and even after searching for a couple of hours, was not able to write a loop. The following code is what I wrote which gives me the hashtags for the entire dataset!

x <- (df$community_id)

for (val in x) {
  
print (
df %>%
unnest_tokens(hashtag, hashtags, token = "tweets") %>%
  count(hashtag, sort = TRUE) %>%
  top_n(10)
)
}
print()

Is there a way I could run the hashtag freqs for all communities by looping through all of them and outputting the top 10 hashtags for each community to 1 file (or separate files)?

Your assistant is much appreciated.

jms · December 20, 2020, 11:16pm

Hi!

There are two problems with your loop.
The main problem is, that you don't specify that the code inside the loop should be applied on the different communities individually. Instead, you are just keep repeating the same code on the complete data.
The second thing is, that the vector that you are using to initialise your loop (x), is equal to the community_ID column in your data frame. That means, that the IDs in there appear multiple times and thus, they would be used more than once in the loop. In the example I am posting below, you can see that I use comID<-unique(comID), which ensures that every ID is called only once in the loop. Inside the loop I included the line mydf%>%filter(community_ID==i, so that the rest of the code is applied to a different community with every iteration.

Here is a solution that loops trough all community IDs, selects the top 10 hashtags and stores them in a list.

suppressPackageStartupMessages({
  library(dplyr)
  library(tidytext)})

# Data
mydf <- data.frame(N = c(1,2,3,4,5,6,7,8,9,10),
                   user_name = c("a","b","c","d","e","f", "g", "h", "i", "j"),
                   community_id =c(0,0,0,0,1,1,2,2,2,3),
                   hashtags   = c("#illness, #ebola", "#coronavirus, #covid", "#vaccine, #lie", "#flue, #ebola, #usa", "#vaccine", "#flue", "#coronavirus", "#ebola", "#ebola, #vaccine", "#china, #virus"),
                   stringsAsFactors = FALSE)


# create index with all community IDs
comID<-unique(mydf$community_id)

# create list to store the output
output <- vector(mode = "list", length = length(comID))

#loop 
for(i in comID){
  mydf %>% filter(community_id==i)%>%
    unnest_tokens(hashtag, hashtags, token = "tweets") %>%
    count(hashtag, sort = TRUE) %>%
    top_n(10)->output[[i+1]]
}

# name list objects 
names(output)<-paste0("Community:_", comID)

output
#> $`Community:_0`
#>        hashtag n
#> 1       #ebola 2
#> 2 #coronavirus 1
#> 3       #covid 1
#> 4        #flue 1
#> 5     #illness 1
#> 6         #lie 1
#> 7         #usa 1
#> 8     #vaccine 1
#> 
#> $`Community:_1`
#>    hashtag n
#> 1    #flue 1
#> 2 #vaccine 1
#> 
#> $`Community:_2`
#>        hashtag n
#> 1       #ebola 2
#> 2 #coronavirus 1
#> 3     #vaccine 1
#> 
#> $`Community:_3`
#>   hashtag n
#> 1  #china 1
#> 2  #virus 1

^{Created on 2020-12-20 by the reprex package (v0.3.0)}

Just a note, the code inside the loop will still run for 15.000 times so it will probably still take some minutes to complete.

mattwarkentin · December 21, 2020, 3:06am

I really don't think looping is necessary. Just grouping your data with dplyr::group_by() will have the same effect with much less code and indirection.

library(dplyr)
library(tidytext)

# Data
mydf <- data.frame(
  N = c(1,2,3,4,5,6,7,8,9,10),
  user_name = c("a","b","c","d","e","f", "g", "h", "i", "j"),
  community_id =c(0,0,0,0,1,1,2,2,2,3),
  hashtags   = c("#illness, #ebola", "#coronavirus, #covid", "#vaccine, #lie", "#flue, #ebola, #usa", "#vaccine", "#flue", "#coronavirus", "#ebola", "#ebola, #vaccine", "#china, #virus"),
  stringsAsFactors = FALSE
)

mydf %>% 
  unnest_tokens(hashtag, hashtags, token = "tweets") %>% 
  group_by(community_id) %>% 
  count(hashtag) %>% 
  slice_max(order_by = n, n = 10)
#> Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#> # A tibble: 15 x 3
#> # Groups:   community_id [4]
#>    community_id hashtag          n
#>           <dbl> <chr>        <int>
#>  1            0 #ebola           2
#>  2            0 #coronavirus     1
#>  3            0 #covid           1
#>  4            0 #flue            1
#>  5            0 #illness         1
#>  6            0 #lie             1
#>  7            0 #usa             1
#>  8            0 #vaccine         1
#>  9            1 #flue            1
#> 10            1 #vaccine         1
#> 11            2 #ebola           2
#> 12            2 #coronavirus     1
#> 13            2 #vaccine         1
#> 14            3 #china           1
#> 15            3 #virus           1

imantmn · December 21, 2020, 3:48am

I really appreciate your time and assistance. The codes work very well

imantmn · December 21, 2020, 4:17am

The code is very efficient! Thank you Matt!

system · January 11, 2021, 4:17am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.