Index/Match Excel Function similarity in R?

HABSaremyjam · April 21, 2021, 3:24pm

Hello!

I am trying to imitate the Index(Match() function in Excel. Basically what has happened is this:

I had this initial dataframe:

dput(EFR.ex, 10)
structure(list(Density_cells_per_mL = c(40477.29429, 919.9385066, 
22998.46266, 32197.84773, 919.9385066, 16098.92387, 505.8492373, 
919.9385066, 71755.20351, 919.9385066), Algal_Group = c("Cyanobacteria", 
"Green Algae", "Cyanobacteria", "Cyanobacteria", "Green Algae", 
"Green Algae", "Green Algae", "Cyanobacteria", "Cyanobacteria", 
"Haptophytes"), Genus = c("Anabaena", "Ankistrodesmus", "Aphanizomenon", 
"Aphanocapsa", "Chlamydomonas", "Chlorella", "Chlorogonium", 
"Chroococcus", "Chroococcus", "Chrysochromulina")), row.names = c(NA, 
10L), class = "data.frame")

I aggregated some algal densities, which created a new data frame. I then used this aggregated information to create an NMDS plot in R and associated the species with the ordination points to get this data frame:

dput(head(species.scores,10))
structure(list(NMDS1 = c(-0.0658985027506765, -0.1057693049412, 
0.0414392803242683, -0.131625100961953, -0.126438528647588, 0.10738777684384, 
-0.018071066906765, 0.0617902321086005, 0.0438645773084015, -0.1619003291023
), NMDS2 = c(-0.0213016404546824, 0.0418361050301208, 0.0338151840015218, 
-0.104557880312153, 0.026295743677768, -0.0268381946290351, 0.0935447463941724, 
-0.0275020026132876, -0.0160527011087588, 0.0107856144200736), 
    species = c("Actinastrum", "Anabaena", "Ankistrodesmus", 
    "Aphanizomenon", "Aphanocapsa", "Carteria", "Ceratium", "Chlamydomonas", 
    "Chlorella", "Chlorogonium")), row.names = c("Actinastrum", 
"Anabaena", "Ankistrodesmus", "Aphanizomenon", "Aphanocapsa", 
"Carteria", "Ceratium", "Chlamydomonas", "Chlorella", "Chlorogonium"
), class = "data.frame")

What I want to do is match the genus in the second data frame (species.scores) to the algal group in the first dataframe (EFR.ex) and create a column that returns that value, so that basically the second data frame now looks like this:

> dput(head(species.scores.2, 10))
structure(list(NA. = c("Actinastrum", "Anabaena", "Ankistrodesmus", 
"Aphanizomenon", "Aphanocapsa", "Carteria", "Ceratium", "Chlamydomonas", 
"Chlorella", "Chlorogonium"), NMDS1 = c(-0.0658985027506765, 
-0.1057693049412, 0.0414392803242683, -0.131625100961953, -0.126438528647588, 
0.10738777684384, -0.018071066906765, 0.0617902321086005, 0.0438645773084015, 
-0.1619003291023), NMDS2 = c(-0.0213016404546824, 0.0418361050301208, 
0.0338151840015218, -0.104557880312153, 0.026295743677768, -0.0268381946290351, 
0.0935447463941724, -0.0275020026132876, -0.0160527011087588, 
0.0107856144200736), Genus = c("Actinastrum", "Anabaena", "Ankistrodesmus", 
"Aphanizomenon", "Aphanocapsa", "Carteria", "Ceratium", "Chlamydomonas", 
"Chlorella", "Chlorogonium"), Algal_Group = c("Green Algae", 
"Cyanobacteria", "Green Algae", "Cyanobacteria", "Cyanobacteria", 
"Green Algae", "Dinoflagellates", "Green Algae", "Green Algae", 
"Green Algae")), row.names = c(NA, 10L), class = "data.frame")

That second dataframe I obtained by pulling the data into Excel and doing an Index(Match() function with the first dataframe, but I want to be able to do it directly in R. I have done a lot of Google searches on this and tried numerous methods but I just can't figure out how to do it. Any help would be so greatly appreciated!

Thank you so much!

FJCC · April 21, 2021, 4:30pm

I think you want to use one of the *_join functions from dplyr. I used left_join in this example.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF1 <- structure(list(Density_cells_per_mL = c(40477.29429, 919.9385066, 
                                        22998.46266, 32197.84773, 919.9385066, 16098.92387, 505.8492373, 
                                        919.9385066, 71755.20351, 919.9385066), 
                      Algal_Group = c("Cyanobacteria", "Green Algae", "Cyanobacteria", "Cyanobacteria", 
                                      "Green Algae", "Green Algae", "Green Algae", "Cyanobacteria", 
                                      "Cyanobacteria", "Haptophytes"), 
                      Genus = c("Anabaena", "Ankistrodesmus", "Aphanizomenon", "Aphanocapsa", "Chlamydomonas", 
                                "Chlorella", "Chlorogonium", "Chroococcus", "Chroococcus", "Chrysochromulina")), 
                 row.names = c(NA, 10L), class = "data.frame")
DF2 <- structure(list(NMDS1 = c(-0.0658985027506765, -0.1057693049412, 
                                0.0414392803242683, -0.131625100961953, -0.126438528647588, 0.10738777684384, 
                                -0.018071066906765, 0.0617902321086005, 0.0438645773084015, -0.1619003291023),
                      NMDS2 = c(-0.0213016404546824, 0.0418361050301208, 0.0338151840015218, 
             -0.104557880312153, 0.026295743677768, -0.0268381946290351, 0.0935447463941724, 
             -0.0275020026132876, -0.0160527011087588, 0.0107856144200736),
             species = c("Actinastrum", "Anabaena", "Ankistrodesmus", 
            "Aphanizomenon", "Aphanocapsa", "Carteria", "Ceratium", "Chlamydomonas", 
            "Chlorella", "Chlorogonium")), 
            row.names = c("Actinastrum", "Anabaena", "Ankistrodesmus", "Aphanizomenon", "Aphanocapsa", 
                          "Carteria", "Ceratium", "Chlamydomonas", "Chlorella", "Chlorogonium"
            ), class = "data.frame")

DF3 <- structure(list(NA. = c("Actinastrum", "Anabaena", "Ankistrodesmus", 
                       "Aphanizomenon", "Aphanocapsa", "Carteria", "Ceratium", "Chlamydomonas", 
                       "Chlorella", "Chlorogonium"), NMDS1 = c(-0.0658985027506765, 
                                                               -0.1057693049412, 0.0414392803242683, -0.131625100961953, -0.126438528647588, 
                                                               0.10738777684384, -0.018071066906765, 0.0617902321086005, 0.0438645773084015, 
                                                               -0.1619003291023), NMDS2 = c(-0.0213016404546824, 0.0418361050301208, 
                                                                                            0.0338151840015218, -0.104557880312153, 0.026295743677768, -0.0268381946290351, 
                                                                                            0.0935447463941724, -0.0275020026132876, -0.0160527011087588, 
                                                                                            0.0107856144200736), Genus = c("Actinastrum", "Anabaena", "Ankistrodesmus", 
                                                                                                                           "Aphanizomenon", "Aphanocapsa", "Carteria", "Ceratium", "Chlamydomonas", 
                                                                                                                           "Chlorella", "Chlorogonium"), Algal_Group = c("Green Algae", 
                                                                                                                                                                         "Cyanobacteria", "Green Algae", "Cyanobacteria", "Cyanobacteria", 
                                                                                                                                                                         "Green Algae", "Dinoflagellates", "Green Algae", "Green Algae", 
                                                                                                                                                                         "Green Algae")), row.names = c(NA, 10L), class = "data.frame")
tmp <- select(DF1,Algal_Group,Genus)
left_join(DF2,tmp,by=c("species"="Genus"))
#>          NMDS1       NMDS2        species   Algal_Group
#> 1  -0.06589850 -0.02130164    Actinastrum          <NA>
#> 2  -0.10576930  0.04183611       Anabaena Cyanobacteria
#> 3   0.04143928  0.03381518 Ankistrodesmus   Green Algae
#> 4  -0.13162510 -0.10455788  Aphanizomenon Cyanobacteria
#> 5  -0.12643853  0.02629574    Aphanocapsa Cyanobacteria
#> 6   0.10738778 -0.02683819       Carteria          <NA>
#> 7  -0.01807107  0.09354475       Ceratium          <NA>
#> 8   0.06179023 -0.02750200  Chlamydomonas   Green Algae
#> 9   0.04386458 -0.01605270      Chlorella   Green Algae
#> 10 -0.16190033  0.01078561   Chlorogonium   Green Algae

^{Created on 2021-04-21 by the reprex package (v0.3.0)}

HABSaremyjam · April 21, 2021, 4:54pm

This worked! For some reason it created a bunch of duplicated rows, so all I did was remove them and then I had exactly what I was looking for. Thank you so very much for taking the time to do this!

mikecrobp · April 22, 2021, 12:35pm

The key difference between Excel Index/Match and R joins is that Excel stops searching after the first match. join links each row on one side with each matching row on the other other side. That will explain your duplicated rows.

system · April 29, 2021, 12:36pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.