Merging data between data frame and meta data

Dear R experts--this seems like this may be a common problem, but I am likely not searching correctly?

In short, I am trying to create a ggplot2 point plot of numerical factors (# septic tanks) versus values of tp (in this case), and trying

Here is a some of my data. Data are in long format:

tibble::tribble(
     ~well,       ~date, ~parm.code,   ~adj,                    ~type,
  "DEPFLD", "13-Nov-17",       "tp",   0.04, "Upper Floridan Aquifer",
    "MWBS", "13-Nov-17",       "tp",  0.206,      "Surficial Aquifer",
  "DEPPBD", "14-Nov-17",       "tp",  0.075, "Upper Floridan Aquifer",
  "DEPPBS", "14-Nov-17",       "tp",   0.03,      "Surficial Aquifer",
    "MW01", "14-Nov-17",       "tp", 0.0025,      "Surficial Aquifer",
    "MW02", "14-Nov-17",       "tp",  0.404,   "Intermediate Aquifer",
    "MW04", "14-Nov-17",       "tp",  0.289,   "Intermediate Aquifer",
    "MW06", "14-Nov-17",       "tp",  0.052,      "Surficial Aquifer",
    "MW07", "14-Nov-17",       "tp",   0.08,      "Surficial Aquifer",
    "MW11", "14-Nov-17",       "tp",  0.033,   "Intermediate Aquifer",
    "MW14", "14-Nov-17",       "tp",  0.035,      "Surficial Aquifer",
    "MW17", "14-Nov-17",       "tp",  0.042,   "Intermediate Aquifer",
    "MW20", "14-Nov-17",       "tp",  0.013,      "Surficial Aquifer",
    "MW22", "14-Nov-17",       "tp",  0.011,   "Intermediate Aquifer",
    "MWAI", "14-Nov-17",       "tp",  0.085,   "Intermediate Aquifer",
    "MWBU", "14-Nov-17",       "tp",  0.182, "Upper Floridan Aquifer",
    "BW02", "16-Nov-17",       "tp",  0.077,      "Surficial Aquifer",
    "MWCI", "16-Nov-17",       "tp",  0.517, "Upper Floridan Aquifer",
    "MWEU", "16-Nov-17",       "tp",  0.128, "Upper Floridan Aquifer",
    "SW01", "16-Nov-17",       "tp",  0.126,          "Surface Water",
    "MWDS", "17-Nov-17",       "tp",  0.266,      "Surficial Aquifer",
    "MWDU", "17-Nov-17",       "tp",  0.084, "Upper Floridan Aquifer",
  "DEPFLD",  "3-Apr-18",       "tp",  0.052, "Upper Floridan Aquifer",
    "MW01",  "3-Apr-18",       "tp",  0.008,      "Surficial Aquifer",
    "MW14",  "3-Apr-18",       "tp",  0.113,      "Surficial Aquifer",
    "MW22",  "3-Apr-18",       "tp",  0.024,   "Intermediate Aquifer",
    "MWDS",  "3-Apr-18",       "tp",  0.068,      "Surficial Aquifer"
  )

And the coding data--wells match between each dataset, and the number of septic is constant for each well:

``` r
tibble::tribble(
   ~well,    ~code,                      ~description, ~value,    ~units,
  "MW05",   "1200",      "Residential Medium Density", 576.36,   "acres",
  "MW05",   "1800",                    "Recreational", 124.33,   "acres",
  "MW05",   "5200",                           "Lakes",  32.64,   "acres",
  "MW05",   "1100",         "Residential Low Density",  15.33,   "acres",
  "MW05",   "4300",            "Upland Mixed Forests",   8.54,   "acres",
  "MW05",   "5300",                      "Reservoirs",   7.41,   "acres",
  "MW05",   "1300",        "Residential High Density",   5.72,   "acres",
  "MW05",   "6400", "Vegetated Non-Forested Wetlands",   3.33,   "acres",
  "MW05",   "4100",       "Upland Coniferous Forests",    2.5,   "acres",
  "MW05",   "1700",                   "Institutional",   0.04,   "acres",
  "MW05", "Septic",                          "Septic",      0, "density",
  "MW09",   "1200",      "Residential Medium Density", 609.68,   "acres",
  "MW09",   "1800",                    "Recreational", 115.29,   "acres",
  "MW09",   "1700",                   "Institutional",  16.85,   "acres",
  "MW09",   "1100",         "Residential Low Density",   8.75,   "acres",
  "MW09",   "6300",          "Wetland Forested Mixed",    7.9,   "acres",
  "MW09",   "1300",        "Residential High Density",   5.58,   "acres",
  "MW09",   "6100",                   "Major Springs",   3.78,   "acres",
  "MW09",   "5200",                           "Lakes",   3.32,   "acres",
  "MW09",   "5300",                      "Reservoirs",   3.02,   "acres",
  "MW09",   "6400", "Vegetated Non-Forested Wetlands",   2.04,   "acres",
  "MW09", "Septic",                          "Septic",     25, "density",
  "MW13",   "1200",      "Residential Medium Density", 333.75,   "acres",
  "MW13",   "1400",         "Commercial and Services", 146.57,   "acres",
  "MW13",   "1300",        "Residential High Density",  86.08,   "acres",
  "MW13",   "1100",         "Residential Low Density",  37.58,   "acres",
  "MW13",   "6400", "Vegetated Non-Forested Wetlands",  32.04,   "acres",
  "MW13",   "5200",                           "Lakes",  30.92,   "acres",
  "MW13",   "4300",            "Upland Mixed Forests",  29.26,   "acres",
  "MW13",   "8100",                  "Transportation",   21.3,   "acres",
  "MW13",   "5300",                      "Reservoirs",  17.16,   "acres",
  "MW13",   "1700",                   "Institutional",  16.94,   "acres",
  "MW13",   "8300",                       "Utilities",  14.13,   "acres",
  "MW13",   "3300",                 "Mixed Rangeland",   8.33,   "acres",
  "MW13",   "4200",         "Upland Hardwood Forests",    1.4,   "acres",
  "MW13",   "1900",                       "Open Land",   0.73,   "acres",
  "MW13", "Septic",                          "Septic",     92, "density",
  "MW16",   "1200",      "Residential Medium Density", 463.54,   "acres",
  "MW16",   "1100",         "Residential Low Density", 140.16,   "acres",
  "MW16",   "1800",                    "Recreational", 128.26,   "acres",
  "MW16",   "1300",        "Residential High Density",   15.8,   "acres",
  "MW16",   "4300",            "Upland Mixed Forests",   9.93,   "acres",
  "MW16",   "4100",       "Upland Coniferous Forests",   7.61,   "acres",
  "MW16",   "5300",                      "Reservoirs",   7.41,   "acres",
  "MW16",   "5200",                           "Lakes",   3.48,   "acres",
  "MW16", "Septic",                          "Septic",     19, "density"
  )
#> # A tibble: 46 x 5
#>    well  code  description                      value units
#>    <chr> <chr> <chr>                            <dbl> <chr>
#>  1 MW05  1200  Residential Medium Density      576.   acres
#>  2 MW05  1800  Recreational                    124.   acres
#>  3 MW05  5200  Lakes                            32.6  acres
#>  4 MW05  1100  Residential Low Density          15.3  acres
#>  5 MW05  4300  Upland Mixed Forests              8.54 acres
#>  6 MW05  5300  Reservoirs                        7.41 acres
#>  7 MW05  1300  Residential High Density          5.72 acres
#>  8 MW05  6400  Vegetated Non-Forested Wetlands   3.33 acres
#>  9 MW05  4100  Upland Coniferous Forests         2.5  acres
#> 10 MW05  1700  Institutional                     0.04 acres
#> # ... with 36 more rows

Created on 2020-11-22 by the reprex package (v0.3.0)

The matching factors, are wells. But, I am unclear how to do this. Do I have to make the dataframe in long format?

Again, I want to plot Septic tank numbers versus "tp" values ("adj"). And, I also want to plot septic tank numbers versus mean "tp" values. I have sketched out the two figures I am trying to make. I dont know how to call up the dataframe with the factors in ggplot2 and have it match the dataframe.

image

image

@FJCC this is somewhat similar to that last issue you helped me with, with similar datastructure, but not sure how to start this?

Thanks so much.

I am not sure I completely understand the meaning of the x axis. Do you want to count how many septic tanks are associated with each well? Below is some code to do that. You could then use the left_join() function from dplyr to have the count values placed in the correct rows of the tibble containing the adj values.

library(dplyr, warn.conflicts = FALSE)
library(tibble)
SepticDF <- tibble::tribble(
  ~well,    ~code,                      ~description, ~value,    ~units,
  "MW05",   "1200",      "Residential Medium Density", 576.36,   "acres",
  "MW05",   "1800",                    "Recreational", 124.33,   "acres",
  "MW05",   "5200",                           "Lakes",  32.64,   "acres",
  "MW05",   "1100",         "Residential Low Density",  15.33,   "acres",
  "MW05",   "4300",            "Upland Mixed Forests",   8.54,   "acres",
  "MW05",   "5300",                      "Reservoirs",   7.41,   "acres",
  "MW05",   "1300",        "Residential High Density",   5.72,   "acres",
  "MW05",   "6400", "Vegetated Non-Forested Wetlands",   3.33,   "acres",
  "MW05",   "4100",       "Upland Coniferous Forests",    2.5,   "acres",
  "MW05",   "1700",                   "Institutional",   0.04,   "acres",
  "MW05", "Septic",                          "Septic",      0, "density",
  "MW09",   "1200",      "Residential Medium Density", 609.68,   "acres",
  "MW09",   "1800",                    "Recreational", 115.29,   "acres",
  "MW09",   "1700",                   "Institutional",  16.85,   "acres",
  "MW09",   "1100",         "Residential Low Density",   8.75,   "acres",
  "MW09",   "6300",          "Wetland Forested Mixed",    7.9,   "acres",
  "MW09",   "1300",        "Residential High Density",   5.58,   "acres",
  "MW09",   "6100",                   "Major Springs",   3.78,   "acres",
  "MW09",   "5200",                           "Lakes",   3.32,   "acres",
  "MW09",   "5300",                      "Reservoirs",   3.02,   "acres",
  "MW09",   "6400", "Vegetated Non-Forested Wetlands",   2.04,   "acres",
  "MW09", "Septic",                          "Septic",     25, "density",
  "MW13",   "1200",      "Residential Medium Density", 333.75,   "acres",
  "MW13",   "1400",         "Commercial and Services", 146.57,   "acres",
  "MW13",   "1300",        "Residential High Density",  86.08,   "acres",
  "MW13",   "1100",         "Residential Low Density",  37.58,   "acres",
  "MW13",   "6400", "Vegetated Non-Forested Wetlands",  32.04,   "acres",
  "MW13",   "5200",                           "Lakes",  30.92,   "acres",
  "MW13",   "4300",            "Upland Mixed Forests",  29.26,   "acres",
  "MW13",   "8100",                  "Transportation",   21.3,   "acres",
  "MW13",   "5300",                      "Reservoirs",  17.16,   "acres",
  "MW13",   "1700",                   "Institutional",  16.94,   "acres",
  "MW13",   "8300",                       "Utilities",  14.13,   "acres",
  "MW13",   "3300",                 "Mixed Rangeland",   8.33,   "acres",
  "MW13",   "4200",         "Upland Hardwood Forests",    1.4,   "acres",
  "MW13",   "1900",                       "Open Land",   0.73,   "acres",
  "MW13", "Septic",                          "Septic",     92, "density",
  "MW16",   "1200",      "Residential Medium Density", 463.54,   "acres",
  "MW16",   "1100",         "Residential Low Density", 140.16,   "acres",
  "MW16",   "1800",                    "Recreational", 128.26,   "acres",
  "MW16",   "1300",        "Residential High Density",   15.8,   "acres",
  "MW16",   "4300",            "Upland Mixed Forests",   9.93,   "acres",
  "MW16",   "4100",       "Upland Coniferous Forests",   7.61,   "acres",
  "MW16",   "5300",                      "Reservoirs",   7.41,   "acres",
  "MW16",   "5200",                           "Lakes",   3.48,   "acres",
  "MW16", "Septic",                          "Septic",     19, "density"
)
SepticCount <- SepticDF %>% group_by(well) %>% 
  summarize(Count = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
SepticCount
#> # A tibble: 4 x 2
#>   well  Count
#>   <chr> <int>
#> 1 MW05     11
#> 2 MW09     11
#> 3 MW13     15
#> 4 MW16      9

Created on 2020-11-22 by the reprex package (v0.3.0)

The line 10 up from the bottom-that 92 is the number of septic tanks--do I have my data ordered incorrectly?
"MW13", "Septic", "Septic", 92, "density

@FJCC--just ran a simple ggplot, using "septic" without making it a factor, and it worked great.

Using the following code:

plot1b <- combined.data %>%
  filter(parm.code == "nox")  %>%
  ggplot(aes(x=septic, y=adj)) +
      geom_point()
#> Error in combined.data %>% filter(parm.code == "nox") %>% ggplot(aes(x = septic, : could not find function "%>%"
 plot1b
#> Error in eval(expr, envir, enclos): object 'plot1b' not found

Created on 2020-11-22 by the reprex package (v0.3.0)

If I wanted to plot the mean value for the adj, would I write this within the ggplot code?

I suspect that the x axis is always a factor for box plots. For any plot where the x axis can be numeric, you will have to balance the more regular appearance of the data when you use a factor against the deceptive spacing of the data. I would be inclined to keep the values numeric but you will have to judge what is best for your data.

@FJCC Thanks for your help--the left bind worked perfect.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

I do not see where in your data it shows that MW13 has 92 septic tanks. Can you point that out to me, please?

@FJCC I actually just want to use the number of septic tanks as my X-axis , not neccessarily count them.

For example, in the data, well MW13 has 92 septic tanks and well MW16 has 19 septic tanks
"MW13", "Septic", "Septic", 92, "density
"MW16", "Septic", "Septic", 19, "density"

(And all the other 23 wells have different numbers of septic tanks)

Again, I want to plot the number of septic tanks fro well MW13 versus all the tp values for this well (there are 14 values over 2 years), and I not sure how to merge them.

thanks

So you need to filter the second table for rows where the code column equals "Septic" and join that result to the first table that contains the adj values? Is that right?

@FJCC--yes, that is correct. I used a left join and it worked.

Do I have to code the new septic column as a factor to make this graph?