How to separate data using path concept from a CSV file to another CSV file

akib62 · November 22, 2020, 12:20pm

Hello Good People,

I have a CSV file that contains information about a path including the weight of the path. The dataset sample is given below:

start	end	weight
A1	A2	0.6
A2	A5	0.5
A3	A1	0.75
A4	A5	0.88
A5	A3	0.99
(+)-1(10),4-Cadinadiene	Falcarinone	0.09
Leucodelphinidin	(+)-1(10),4-Cadinadiene	0.876
Lignin	(2E,7R,11R)-2-Phyten-1-ol	0.778
(2E,7R,11R)-2-Phyten-1-ol	Leucodelphinidin	0.55
Falcarinone	Lignin	1
A1	(+)-1(10),4-Cadinadiene	1
A2	Lignin	1
A3	(2E,7R,11R)-2-Phyten-1-ol	1
A4	Leucodelphinidin	1
A5	Falcarinone	1

Now I want to create another CSV file based on the path concept. For example, From A1 I can visit A2 and from A2 I can visit Lignin (A1 -> A2 -> Lignin). So, we can say, we can visit Lignin from A1. In the same way, we can visit Falcarinone from A1 (A1 -> A2 -> A5 -> Falcarinone).

However, I am not interested to contain data from (A1 -> A2 -> A5 -> A3 -> (2E,7R,11R)-2-Phyten-1-ol). That's mean, I want to take A at least 2 times and at most 3 times.

My expected 2nd CSF file will contain only this kind of information:

start	end	Total weight
A1	Lignin	1.6
A1	Falcarinone	2.1

I have no idea how can I do this task and I am extremely sorry that I do not have any reproducible code.

I will be grateful if you give any suggestions or ideas.

AlexisW · November 24, 2020, 1:44am

I'm afraid this problem is not trivial: it's about finding paths in a (weighted, directed) graph, but it's not necessarily the shortest path (so the standard algorithms in packages like igraph or network will not be applicable). Actually, if your dataset is big, I suspect it might impossible to compute in a reasonable time (NP-hard). But assuming the dataset is not too big, this should do what you want (but it's pretty inefficient, it could probably be improved).

dta <- read.table(text="start   end     weight
A1  A2  0.6
A2  A5  0.5
A3  A1  0.75
A4  A5  0.88
A5  A3  0.99
(+)-1(10),4-Cadinadiene     Falcarinone     0.09
Leucodelphinidin    (+)-1(10),4-Cadinadiene     0.876
Lignin  (2E,7R,11R)-2-Phyten-1-ol   0.778
(2E,7R,11R)-2-Phyten-1-ol   Leucodelphinidin    0.55
Falcarinone     Lignin  1
A1  (+)-1(10),4-Cadinadiene     1
A2  Lignin  1
A3  (2E,7R,11R)-2-Phyten-1-ol   1
A4  Leucodelphinidin    1
A5  Falcarinone     1", header=TRUE)


library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union

g <- graph_from_data_frame(dta)

stopifnot(is_directed(g))
stopifnot(is_weighted(g))

# initialize the result data frame
n_guess <- 100
res <- data.frame(start = character(n_guess),
                  end = character(n_guess),
                  nb_A = integer(n_guess),
                  weight = double(n_guess),
                  full_path = character(n_guess))

cnt <- 0

for(i in 1:length(V(g))){
  all_paths <- all_simple_paths(g, from = V(g)[i])

  for(j in 1:length(all_paths)){
    cur_path <- all_paths[[j]]
    nb_A <- sum(startsWith(names(cur_path), "A"))
    if(nb_A > 3 | nb_A < 2){
      next
    } else{
      cnt <- cnt + 1 # number of valid results
      
      res[cnt,"start"] <- names(cur_path[1])
      res[cnt,"end"] <- names(cur_path[length(cur_path)])
      res[cnt,"nb_A"] <- nb_A
      res[cnt,"weight"] <- sum(dta$weight[dta$start %in% names(cur_path[-length(cur_path)]) &
                                         dta$end %in% names(cur_path[-1])])
      res[cnt,"full_path"] <- paste(names(cur_path), collapse = ", ")
    }
  }
}
res[1:cnt, 1:4]
#>    start                       end nb_A weight
#> 1     A1                        A2    2  0.600
#> 2     A1                        A5    3  1.100
#> 3     A1               Falcarinone    3  2.100
#> 4     A1                    Lignin    3  4.100
#> 5     A1 (2E,7R,11R)-2-Phyten-1-ol    3  4.878
#> 6     A1          Leucodelphinidin    3  5.428
#> 7     A1   (+)-1(10),4-Cadinadiene    3  7.304
#> 8     A1                    Lignin    2  1.600
#> 9     A1 (2E,7R,11R)-2-Phyten-1-ol    2  2.378
#> 10    A1          Leucodelphinidin    2  2.928
#> 11    A1   (+)-1(10),4-Cadinadiene    2  4.804
#> 12    A1               Falcarinone    2  4.894
#> 13    A2                        A5    2  0.500
#> 14    A2                        A3    3  1.490
#> 15    A2 (2E,7R,11R)-2-Phyten-1-ol    3  2.490
#> 16    A2          Leucodelphinidin    3  3.040
#> 17    A2   (+)-1(10),4-Cadinadiene    3  3.916
#> 18    A2               Falcarinone    3  5.006
#> 19    A2                    Lignin    3  7.006
#> 20    A2               Falcarinone    2  1.500
#> 21    A2                    Lignin    2  3.500
#> 22    A2 (2E,7R,11R)-2-Phyten-1-ol    2  4.278
#> 23    A2          Leucodelphinidin    2  4.828
#> 24    A2   (+)-1(10),4-Cadinadiene    2  5.704
#> 25    A3                        A1    2  0.750
#> 26    A3                        A2    3  1.350
#> 27    A3                    Lignin    3  2.350
#> 28    A3 (2E,7R,11R)-2-Phyten-1-ol    3  4.128
#> 29    A3          Leucodelphinidin    3  4.678
#> 30    A3   (+)-1(10),4-Cadinadiene    3  6.554
#> 31    A3               Falcarinone    3  6.644
#> 32    A3   (+)-1(10),4-Cadinadiene    2  1.750
#> 33    A3               Falcarinone    2  1.840
#> 34    A3                    Lignin    2  2.840
#> 35    A3 (2E,7R,11R)-2-Phyten-1-ol    2  4.618
#> 36    A3          Leucodelphinidin    2  5.168
#> 37    A4                        A5    2  0.880
#> 38    A4                        A3    3  1.870
#> 39    A4 (2E,7R,11R)-2-Phyten-1-ol    3  2.870
#> 40    A4          Leucodelphinidin    3  4.420
#> 41    A4   (+)-1(10),4-Cadinadiene    3  5.296
#> 42    A4               Falcarinone    3  6.386
#> 43    A4                    Lignin    3  7.386
#> 44    A4               Falcarinone    2  1.880
#> 45    A4                    Lignin    2  2.880
#> 46    A4 (2E,7R,11R)-2-Phyten-1-ol    2  3.658
#> 47    A4          Leucodelphinidin    2  5.208
#> 48    A4   (+)-1(10),4-Cadinadiene    2  6.084
#> 49    A5                        A3    2  0.990
#> 50    A5                        A1    3  1.740
#> 51    A5   (+)-1(10),4-Cadinadiene    3  2.740
#> 52    A5               Falcarinone    3  3.830
#> 53    A5                    Lignin    3  4.830
#> 54    A5 (2E,7R,11R)-2-Phyten-1-ol    3  6.608
#> 55    A5          Leucodelphinidin    3  7.158
#> 56    A5 (2E,7R,11R)-2-Phyten-1-ol    2  1.990
#> 57    A5          Leucodelphinidin    2  2.540
#> 58    A5   (+)-1(10),4-Cadinadiene    2  3.416
#> 59    A5               Falcarinone    2  4.506
#> 60    A5                    Lignin    2  5.506

^{Created on 2020-11-23 by the reprex package (v0.3.0)}

Two notes:

to initialize the results data frame, we need to make a guess on the maximum number of results we can obtain. Here I just used 100, you might want to do something more clever.
note that I added a column with the full description of the path, in case you want to take a look. Of course, you can remove the corresponding line from the code if you're not interested.

akib62 · November 28, 2020, 9:09am

@AlexisW, thank you very much.

system · December 5, 2020, 9:09am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.