How to do multiple string count over very large dataset

prateek26394 · August 20, 2019, 5:11am

I have Flights data in which trips are present.

library(data.table)
library(stringr)
library(dplyr)
library(tidyr)

### Sample dataset ###
dataset <- data.frame("Trips"=c(1:30),"Connections"=c(
  "[Flight-6][Flight-2][Flight-5][Flight-7]"	,
  "[Flight-4][Flight-9][Flight-5][Flight-3]"	,
  "[Flight-8][Flight-3][Flight-2][Flight-4]"	,
  "[Flight-4][Flight-7][Flight-5][Flight-2]"	,
  "[Flight-3][Flight-10][Flight-7][Flight-5]"	,
  "[Flight-9][Flight-10][Flight-1][Flight-8]"	,
  "[Flight-10][Flight-5][Flight-7][Flight-8]"	,
  "[Flight-5][Flight-9][Flight-2][Flight-10]"	,
  "[Flight-1][Flight-10][Flight-9][Flight-7]"	,
  "[Flight-8][Flight-9][Flight-3][Flight-2]"	,
  "[Flight-10][Flight-3][Flight-5][Flight-6]"	,
  "[Flight-3][Flight-7][Flight-5][Flight-10]"	,
  "[Flight-4][Flight-2][Flight-3][Flight-8]"	,
  "[Flight-4][Flight-3][Flight-7][Flight-9]"	,
  "[Flight-2][Flight-4][Flight-7][Flight-1]"	,
  "[Flight-2][Flight-3][Flight-10][Flight-9]"	,
  "[Flight-4][Flight-7][Flight-2][Flight-5]"	,
  "[Flight-10][Flight-4][Flight-5][Flight-6]"	,
  "[Flight-10][Flight-8][Flight-4][Flight-9]"	,
  "[Flight-4][Flight-6][Flight-9][Flight-5]"	,
  "[Flight-2][Flight-5][Flight-4][Flight-10]"	,
  "[Flight-2][Flight-4][Flight-5][Flight-6]"	,
  "[Flight-8][Flight-7][Flight-10][Flight-9]"	,
  "[Flight-1][Flight-2][Flight-6][Flight-5]"	,
  "[Flight-1][Flight-3][Flight-6][Flight-9]"	,
  "[Flight-3][Flight-5][Flight-7][Flight-9]"	,
  "[Flight-5][Flight-8][Flight-4][Flight-10]"	,
  "[Flight-2][Flight-4][Flight-8][Flight-6]"	,
  "[Flight-1][Flight-5][Flight-8][Flight-4]"	,
  "[Flight-7][Flight-10][Flight-3][Flight-1]"	,
  "[Flight-6][Flight-2][Flight-5][Flight-7]"	,
  "[Flight-4][Flight-9][Flight-5][Flight-3]"	,
  "[Flight-8][Flight-3][Flight-2][Flight-4]"	,
  "[Flight-4][Flight-7][Flight-5][Flight-2]"	,
  "[Flight-3][Flight-10][Flight-7][Flight-5]"	,
  "[Flight-9][Flight-10][Flight-1][Flight-8]"	,
  "[Flight-10][Flight-5][Flight-7][Flight-8]"	,
  "[Flight-5][Flight-9][Flight-2][Flight-10]"	,
  "[Flight-1][Flight-10][Flight-9][Flight-7]"	,
  "[Flight-8][Flight-9][Flight-3][Flight-2]"	,
  "[Flight-10][Flight-3][Flight-5][Flight-6]"	,
  "[Flight-3][Flight-7][Flight-5][Flight-10]"	,
  "[Flight-4][Flight-2][Flight-3][Flight-8]"	,
  "[Flight-4][Flight-3][Flight-7][Flight-9]"	,
  "[Flight-2][Flight-4][Flight-7][Flight-1]"	,
  "[Flight-2][Flight-3][Flight-10][Flight-9]"	,
  "[Flight-4][Flight-7][Flight-2][Flight-5]"	,
  "[Flight-10][Flight-4][Flight-5][Flight-6]"	,
  "[Flight-10][Flight-8][Flight-4][Flight-9]"	,
  "[Flight-4][Flight-6][Flight-9][Flight-5]"	,
  "[Flight-2][Flight-5][Flight-4][Flight-10]"	,
  "[Flight-2][Flight-4][Flight-5][Flight-6]"	,
  "[Flight-8][Flight-7][Flight-10][Flight-9]"	,
  "[Flight-1][Flight-2][Flight-6][Flight-5]"	,
  "[Flight-1][Flight-3][Flight-6][Flight-9]"	,
  "[Flight-3][Flight-5][Flight-7][Flight-9]"	,
  "[Flight-5][Flight-8][Flight-4][Flight-10]"	,
  "[Flight-2][Flight-4][Flight-8][Flight-6]"	,
  "[Flight-1][Flight-5][Flight-8][Flight-4]"	,
  "[Flight-7][Flight-10][Flight-3][Flight-1]"	
  )) # Actual input is data.table from txt file with >2900000 trips

First I extracted all the connections i.e.

# for trip  [Flight-6][Flight-2][Flight-5][Flight-7] connections are:- [Flight-6][Flight-2] ; [Flight-2][Flight-5] ; [Flight-5][Flight-7]

### Making Connectons from dataset ###
connections <- str_match_all(dataset$Connections,pattern = "(?=(\\[Flight-\\d+\\]\\[Flight-\\d+\\]))" )
conn <- unlist(connections,recursive = TRUE)
conn <-unique(conn)
conn<- data.frame(conn)
conn <- subset(conn,str_length(conn)>0)  # All unique connections are created

Now, need to calculate Count of each connection in the dataset

conn1 <- str_replace_all(conn$conn,"\\[","\\\\[")
conn1 <- str_replace_all(conn1,"\\]","\\\\]")        ### Made regex pattern for String matching

dataset_as_String <- toString(dataset$Connections)  ## Converting dataset to String

### Counting each connections from dataset
ans = str_count(dataset_as_String,conn1)

But, I need to do the same for >17000 connections over >2900000 trips. Which is very slow process.
Further, I also need to calculate 4 other types: -

1.No. of trips starting with Flight-x

# Something like might be wrong
pattern <- "\\[Flight-\\d+\\]\\S+"

2.No. of trips ending with Flight-x

pattern <- "\\[Flight-\\d+\\]$" ## Might be wrong

3.Connections starting with Flight-x but not ending at Flight-y

pattern <- "\\[Flight-\\d+\\](?!\\[Flight-\\d+\\])"## Might be wrong

4.Connections not starting with Flight-x but ending at Flight-y

pattern <- "(?!\\[Flight-\\d+\\])\\[Flight-\\d+\\]"  ## Might be wrong

where, x and y are all possible combinations.

How can I calculate all the 5 conditions as fast as possible.
Calculating time of <60 min is acceptable.
Time<15min, I'll be grateful.

Any help is appreciated.

Thank You for your time and concern.

DavoWW · August 22, 2019, 10:07am

I was intrigued by your data munging problem and produced the following:

library(tidyverse)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
# I have made some name changes here.
# Reduced trips to 30 (original was duplicated 1:30, 1:30)
dataset <- data.frame("trips"=c(1:30),
                      "sequence"=c(
  "[Flight-6][Flight-2][Flight-5][Flight-7]"    ,
  "[Flight-4][Flight-9][Flight-5][Flight-3]"    ,
  "[Flight-8][Flight-3][Flight-2][Flight-4]"    ,
  "[Flight-4][Flight-7][Flight-5][Flight-2]"    ,
  "[Flight-3][Flight-10][Flight-7][Flight-5]"   ,
  "[Flight-9][Flight-10][Flight-1][Flight-8]"   ,
  "[Flight-10][Flight-5][Flight-7][Flight-8]"   ,
  "[Flight-5][Flight-9][Flight-2][Flight-10]"   ,
  "[Flight-1][Flight-10][Flight-9][Flight-7]"   ,
  "[Flight-8][Flight-9][Flight-3][Flight-2]"    ,
  "[Flight-10][Flight-3][Flight-5][Flight-6]"   ,
  "[Flight-3][Flight-7][Flight-5][Flight-10]"   ,
  "[Flight-4][Flight-2][Flight-3][Flight-8]"    ,
  "[Flight-4][Flight-3][Flight-7][Flight-9]"    ,
  "[Flight-2][Flight-4][Flight-7][Flight-1]"    ,
  "[Flight-2][Flight-3][Flight-10][Flight-9]"   ,
  "[Flight-4][Flight-7][Flight-2][Flight-5]"    ,
  "[Flight-10][Flight-4][Flight-5][Flight-6]"   ,
  "[Flight-10][Flight-8][Flight-4][Flight-9]"   ,
  "[Flight-4][Flight-6][Flight-9][Flight-5]"    ,
  "[Flight-2][Flight-5][Flight-4][Flight-10]"   ,
  "[Flight-2][Flight-4][Flight-5][Flight-6]"    ,
  "[Flight-8][Flight-7][Flight-10][Flight-9]"   ,
  "[Flight-1][Flight-2][Flight-6][Flight-5]"    ,
  "[Flight-1][Flight-3][Flight-6][Flight-9]"    ,
  "[Flight-3][Flight-5][Flight-7][Flight-9]"    ,
  "[Flight-5][Flight-8][Flight-4][Flight-10]"   ,
  "[Flight-2][Flight-4][Flight-8][Flight-6]"    ,
  "[Flight-1][Flight-5][Flight-8][Flight-4]"    ,
  "[Flight-7][Flight-10][Flight-3][Flight-1]"   ), stringsAsFactors=FALSE)

str(dataset)
#> 'data.frame':    30 obs. of  2 variables:
#>  $ trips   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ sequence: chr  "[Flight-6][Flight-2][Flight-5][Flight-7]" "[Flight-4][Flight-9][Flight-5][Flight-3]" "[Flight-8][Flight-3][Flight-2][Flight-4]" "[Flight-4][Flight-7][Flight-5][Flight-2]" ...

# My approach: make each flight in a trip sequence a single 
# factor column (sectors).
# Can then easily get the various connections.

# Get rid of the extraneous characters
# My regex skills are not good enough to do this in one step.

s1 <- str_replace_all(dataset$sequence, "(Flight-)", "")
s2 <- str_replace_all(s1, "(\\[)", "")
s3 <- str_replace_all(s2, "(\\])", ",")

# str_split(s3, ",")
# str_split(s3, ",", simplify=TRUE)

sectors <- str_split(s3, ",", simplify=TRUE) 

new.df <- cbind(dataset, sectors[,c(1:4)])
names(new.df)[3:6] <- c("first","second","third","fourth")
str(new.df)
#> 'data.frame':    30 obs. of  6 variables:
#>  $ trips   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ sequence: chr  "[Flight-6][Flight-2][Flight-5][Flight-7]" "[Flight-4][Flight-9][Flight-5][Flight-3]" "[Flight-8][Flight-3][Flight-2][Flight-4]" "[Flight-4][Flight-7][Flight-5][Flight-2]" ...
#>  $ first   : Factor w/ 10 levels "1","10","2","3",..: 7 5 9 5 4 10 2 6 1 9 ...
#>  $ second  : Factor w/ 9 levels "10","2","3","4",..: 2 9 3 7 1 1 5 9 1 9 ...
#>  $ third   : Factor w/ 10 levels "1","10","2","3",..: 6 6 3 6 8 1 8 3 10 4 ...
#>  $ fourth  : Factor w/ 10 levels "1","10","2","3",..: 8 4 5 3 6 9 9 2 8 3 ...

pad_left <- function(x) {str_pad(x, width=2, side="left", pad="0")}

new.df %<>%
  mutate_at(c("first","second","third","fourth"), pad_left) %>%
  mutate(conn_1 = factor(paste0(.$first, "_", .$second)),
         conn_2 = factor(paste0(.$second,"_", .$third)),
         conn_3 = factor(paste0(.$third, "_", .$fourth)))

head(new.df)
#>   trips                                  sequence first second third
#> 1     1  [Flight-6][Flight-2][Flight-5][Flight-7]    06     02    05
#> 2     2  [Flight-4][Flight-9][Flight-5][Flight-3]    04     09    05
#> 3     3  [Flight-8][Flight-3][Flight-2][Flight-4]    08     03    02
#> 4     4  [Flight-4][Flight-7][Flight-5][Flight-2]    04     07    05
#> 5     5 [Flight-3][Flight-10][Flight-7][Flight-5]    03     10    07
#> 6     6 [Flight-9][Flight-10][Flight-1][Flight-8]    09     10    01
#>   fourth conn_1 conn_2 conn_3
#> 1     07  06_02  02_05  05_07
#> 2     03  04_09  09_05  05_03
#> 3     04  08_03  03_02  02_04
#> 4     02  04_07  07_05  05_02
#> 5     05  03_10  10_07  07_05
#> 6     08  09_10  10_01  01_08
str(new.df)
#> 'data.frame':    30 obs. of  9 variables:
#>  $ trips   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ sequence: chr  "[Flight-6][Flight-2][Flight-5][Flight-7]" "[Flight-4][Flight-9][Flight-5][Flight-3]" "[Flight-8][Flight-3][Flight-2][Flight-4]" "[Flight-4][Flight-7][Flight-5][Flight-2]" ...
#>  $ first   : chr  "06" "04" "08" "04" ...
#>  $ second  : chr  "02" "09" "03" "07" ...
#>  $ third   : chr  "05" "05" "02" "05" ...
#>  $ fourth  : chr  "07" "03" "04" "02" ...
#>  $ conn_1  : Factor w/ 27 levels "01_02","01_03",..: 18 15 20 14 10 23 26 17 4 22 ...
#>  $ conn_2  : Factor w/ 26 levels "02_03","02_05",..: 2 22 4 17 25 23 13 20 26 21 ...
#>  $ conn_3  : Factor w/ 25 levels "01_08","02_04",..: 13 11 2 10 18 1 19 4 24 6 ...

# These connections now sort properly due to the leading zeros
levels(new.df$conn_1)
#>  [1] "01_02" "01_03" "01_05" "01_10" "02_03" "02_04" "02_05" "03_05"
#>  [9] "03_07" "03_10" "04_02" "04_03" "04_06" "04_07" "04_09" "05_08"
#> [17] "05_09" "06_02" "07_10" "08_03" "08_07" "08_09" "09_10" "10_03"
#> [25] "10_04" "10_05" "10_08"

# Can now find specific connections
# e.g. [Flight-1][Flight-2]

new.df %>%
  filter_all(any_vars(. =="01_02"))
#>   trips                                 sequence first second third fourth
#> 1    24 [Flight-1][Flight-2][Flight-6][Flight-5]    01     02    06     05
#>   conn_1 conn_2 conn_3
#> 1  01_02  02_06  06_05

# Same connection but at different points in the sequence
new.df %>%
  filter_all(any_vars(. =="02_03"))
#>   trips                                  sequence first second third
#> 1    13  [Flight-4][Flight-2][Flight-3][Flight-8]    04     02    03
#> 2    16 [Flight-2][Flight-3][Flight-10][Flight-9]    02     03    10
#>   fourth conn_1 conn_2 conn_3
#> 1     08  04_02  02_03  03_08
#> 2     09  02_03  03_10  10_09

# All connections that started with [Flight-4]
new.df %>%
  filter_all(any_vars(grepl("^04_", .)))
#>    trips                                  sequence first second third
#> 1      2  [Flight-4][Flight-9][Flight-5][Flight-3]    04     09    05
#> 2      4  [Flight-4][Flight-7][Flight-5][Flight-2]    04     07    05
#> 3     13  [Flight-4][Flight-2][Flight-3][Flight-8]    04     02    03
#> 4     14  [Flight-4][Flight-3][Flight-7][Flight-9]    04     03    07
#> 5     15  [Flight-2][Flight-4][Flight-7][Flight-1]    02     04    07
#> 6     17  [Flight-4][Flight-7][Flight-2][Flight-5]    04     07    02
#> 7     18 [Flight-10][Flight-4][Flight-5][Flight-6]    10     04    05
#> 8     19 [Flight-10][Flight-8][Flight-4][Flight-9]    10     08    04
#> 9     20  [Flight-4][Flight-6][Flight-9][Flight-5]    04     06    09
#> 10    21 [Flight-2][Flight-5][Flight-4][Flight-10]    02     05    04
#> 11    22  [Flight-2][Flight-4][Flight-5][Flight-6]    02     04    05
#> 12    27 [Flight-5][Flight-8][Flight-4][Flight-10]    05     08    04
#> 13    28  [Flight-2][Flight-4][Flight-8][Flight-6]    02     04    08
#>    fourth conn_1 conn_2 conn_3
#> 1      03  04_09  09_05  05_03
#> 2      02  04_07  07_05  05_02
#> 3      08  04_02  02_03  03_08
#> 4      09  04_03  03_07  07_09
#> 5      01  02_04  04_07  07_01
#> 6      05  04_07  07_02  02_05
#> 7      06  10_04  04_05  05_06
#> 8      09  10_08  08_04  04_09
#> 9      05  04_06  06_09  09_05
#> 10     10  02_05  05_04  04_10
#> 11     06  02_04  04_05  05_06
#> 12     10  05_08  08_04  04_10
#> 13     06  02_04  04_08  08_06

# All connections that ended with [Flight-7]
new.df %>%
  filter_all(any_vars(grepl("_07$", .)))
#>    trips                                  sequence first second third
#> 1      1  [Flight-6][Flight-2][Flight-5][Flight-7]    06     02    05
#> 2      4  [Flight-4][Flight-7][Flight-5][Flight-2]    04     07    05
#> 3      5 [Flight-3][Flight-10][Flight-7][Flight-5]    03     10    07
#> 4      7 [Flight-10][Flight-5][Flight-7][Flight-8]    10     05    07
#> 5      9 [Flight-1][Flight-10][Flight-9][Flight-7]    01     10    09
#> 6     12 [Flight-3][Flight-7][Flight-5][Flight-10]    03     07    05
#> 7     14  [Flight-4][Flight-3][Flight-7][Flight-9]    04     03    07
#> 8     15  [Flight-2][Flight-4][Flight-7][Flight-1]    02     04    07
#> 9     17  [Flight-4][Flight-7][Flight-2][Flight-5]    04     07    02
#> 10    23 [Flight-8][Flight-7][Flight-10][Flight-9]    08     07    10
#> 11    26  [Flight-3][Flight-5][Flight-7][Flight-9]    03     05    07
#>    fourth conn_1 conn_2 conn_3
#> 1      07  06_02  02_05  05_07
#> 2      02  04_07  07_05  05_02
#> 3      05  03_10  10_07  07_05
#> 4      08  10_05  05_07  07_08
#> 5      07  01_10  10_09  09_07
#> 6      10  03_07  07_05  05_10
#> 7      09  04_03  03_07  07_09
#> 8      01  02_04  04_07  07_01
#> 9      05  04_07  07_02  02_05
#> 10     09  08_07  07_10  10_09
#> 11     09  03_05  05_07  07_09

HTH

prateek26394 · August 23, 2019, 7:23am

Thank You for your time and concern.

Wanted to ask how can I pass a vector of Connections in the filter_all().
Because, my final report should contain values for the 5 variables, for all the connections.

 a=data.frame("Conn"=levels(factor(new.df$conn_1)))
 b=data.frame("Conn"=levels(factor(new.df$conn_2)))
 c=data.frame("Conn"=levels(factor(new.df$conn_3)))
 d = rbind(a,b,c)))) # Need to find count for d.

Thank You

DavoWW · August 25, 2019, 5:11am

Hi, sorry for delay in replying (weekend tasks!)

If I understand correctly, you want the frequencies of all connection types?
If yes, then see the following code (run after my original code above):

#----- Further section -----
# Get frequencies of connection type across
# each sector, and then all sectors.

# Need to put data into "long" form using tidyr::gather
# What "conn_*" columns are there?
new.df %>%
  select(starts_with("conn_")) %>%
  names(.) -> conn_cols

new.df %>%
  gather(., key="conn_pos", value="conn_all", conn_cols) -> long.df

# Then get summary tables of frequencies
table(long.df$conn_all)
as.data.frame(table(long.df$conn_all))

tab1 <- as.data.frame.matrix(table(long.df$conn_all, long.df$conn_pos))

tab1 %>%
  rownames_to_column(var="conn_type") -> tab2

# Make a nice table for publication
library(flextable)
qflextable(tab2)

Good luck!

prateek26394 · August 26, 2019, 6:28am

This is exactly what I needed. Over the big dataset (>16000 connections) it is doing calculation within 10 seconds for frequency of connections.
Now, I will try to generate the report for the remaining 4 variables described in the problem.
Guessing, won't take more than a minute to calculate all the variables.

Frankly, wasn't expecting output generation to be this fast.

Thank You for your time and concern.
@DavoWW u r awsum

system · September 2, 2019, 6:28am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.