I'm afraid this problem is not trivial: it's about finding paths in a (weighted, directed) graph, but it's not necessarily the shortest path (so the standard algorithms in packages like igraph
or network
will not be applicable). Actually, if your dataset is big, I suspect it might impossible to compute in a reasonable time (NP-hard). But assuming the dataset is not too big, this should do what you want (but it's pretty inefficient, it could probably be improved).
dta <- read.table(text="start end weight
A1 A2 0.6
A2 A5 0.5
A3 A1 0.75
A4 A5 0.88
A5 A3 0.99
(+)-1(10),4-Cadinadiene Falcarinone 0.09
Leucodelphinidin (+)-1(10),4-Cadinadiene 0.876
Lignin (2E,7R,11R)-2-Phyten-1-ol 0.778
(2E,7R,11R)-2-Phyten-1-ol Leucodelphinidin 0.55
Falcarinone Lignin 1
A1 (+)-1(10),4-Cadinadiene 1
A2 Lignin 1
A3 (2E,7R,11R)-2-Phyten-1-ol 1
A4 Leucodelphinidin 1
A5 Falcarinone 1", header=TRUE)
library(igraph)
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
g <- graph_from_data_frame(dta)
stopifnot(is_directed(g))
stopifnot(is_weighted(g))
# initialize the result data frame
n_guess <- 100
res <- data.frame(start = character(n_guess),
end = character(n_guess),
nb_A = integer(n_guess),
weight = double(n_guess),
full_path = character(n_guess))
cnt <- 0
for(i in 1:length(V(g))){
all_paths <- all_simple_paths(g, from = V(g)[i])
for(j in 1:length(all_paths)){
cur_path <- all_paths[[j]]
nb_A <- sum(startsWith(names(cur_path), "A"))
if(nb_A > 3 | nb_A < 2){
next
} else{
cnt <- cnt + 1 # number of valid results
res[cnt,"start"] <- names(cur_path[1])
res[cnt,"end"] <- names(cur_path[length(cur_path)])
res[cnt,"nb_A"] <- nb_A
res[cnt,"weight"] <- sum(dta$weight[dta$start %in% names(cur_path[-length(cur_path)]) &
dta$end %in% names(cur_path[-1])])
res[cnt,"full_path"] <- paste(names(cur_path), collapse = ", ")
}
}
}
res[1:cnt, 1:4]
#> start end nb_A weight
#> 1 A1 A2 2 0.600
#> 2 A1 A5 3 1.100
#> 3 A1 Falcarinone 3 2.100
#> 4 A1 Lignin 3 4.100
#> 5 A1 (2E,7R,11R)-2-Phyten-1-ol 3 4.878
#> 6 A1 Leucodelphinidin 3 5.428
#> 7 A1 (+)-1(10),4-Cadinadiene 3 7.304
#> 8 A1 Lignin 2 1.600
#> 9 A1 (2E,7R,11R)-2-Phyten-1-ol 2 2.378
#> 10 A1 Leucodelphinidin 2 2.928
#> 11 A1 (+)-1(10),4-Cadinadiene 2 4.804
#> 12 A1 Falcarinone 2 4.894
#> 13 A2 A5 2 0.500
#> 14 A2 A3 3 1.490
#> 15 A2 (2E,7R,11R)-2-Phyten-1-ol 3 2.490
#> 16 A2 Leucodelphinidin 3 3.040
#> 17 A2 (+)-1(10),4-Cadinadiene 3 3.916
#> 18 A2 Falcarinone 3 5.006
#> 19 A2 Lignin 3 7.006
#> 20 A2 Falcarinone 2 1.500
#> 21 A2 Lignin 2 3.500
#> 22 A2 (2E,7R,11R)-2-Phyten-1-ol 2 4.278
#> 23 A2 Leucodelphinidin 2 4.828
#> 24 A2 (+)-1(10),4-Cadinadiene 2 5.704
#> 25 A3 A1 2 0.750
#> 26 A3 A2 3 1.350
#> 27 A3 Lignin 3 2.350
#> 28 A3 (2E,7R,11R)-2-Phyten-1-ol 3 4.128
#> 29 A3 Leucodelphinidin 3 4.678
#> 30 A3 (+)-1(10),4-Cadinadiene 3 6.554
#> 31 A3 Falcarinone 3 6.644
#> 32 A3 (+)-1(10),4-Cadinadiene 2 1.750
#> 33 A3 Falcarinone 2 1.840
#> 34 A3 Lignin 2 2.840
#> 35 A3 (2E,7R,11R)-2-Phyten-1-ol 2 4.618
#> 36 A3 Leucodelphinidin 2 5.168
#> 37 A4 A5 2 0.880
#> 38 A4 A3 3 1.870
#> 39 A4 (2E,7R,11R)-2-Phyten-1-ol 3 2.870
#> 40 A4 Leucodelphinidin 3 4.420
#> 41 A4 (+)-1(10),4-Cadinadiene 3 5.296
#> 42 A4 Falcarinone 3 6.386
#> 43 A4 Lignin 3 7.386
#> 44 A4 Falcarinone 2 1.880
#> 45 A4 Lignin 2 2.880
#> 46 A4 (2E,7R,11R)-2-Phyten-1-ol 2 3.658
#> 47 A4 Leucodelphinidin 2 5.208
#> 48 A4 (+)-1(10),4-Cadinadiene 2 6.084
#> 49 A5 A3 2 0.990
#> 50 A5 A1 3 1.740
#> 51 A5 (+)-1(10),4-Cadinadiene 3 2.740
#> 52 A5 Falcarinone 3 3.830
#> 53 A5 Lignin 3 4.830
#> 54 A5 (2E,7R,11R)-2-Phyten-1-ol 3 6.608
#> 55 A5 Leucodelphinidin 3 7.158
#> 56 A5 (2E,7R,11R)-2-Phyten-1-ol 2 1.990
#> 57 A5 Leucodelphinidin 2 2.540
#> 58 A5 (+)-1(10),4-Cadinadiene 2 3.416
#> 59 A5 Falcarinone 2 4.506
#> 60 A5 Lignin 2 5.506
Created on 2020-11-23 by the reprex package (v0.3.0)
Two notes:
- to initialize the results data frame, we need to make a guess on the maximum number of results we can obtain. Here I just used
100
, you might want to do something more clever.
- note that I added a column with the full description of the path, in case you want to take a look. Of course, you can remove the corresponding line from the code if you're not interested.