Automatizing algorithm to complete data below detection limit in R

I have a dataframe called data_base that looks like that:

The dataframe is this Data Base. Exists more variables I showed that necesaries to understand the problem.

Cod_Corto Cuenca    Temporada Alcalinidad Carb_ Bicarb_ Fluoruros      Cloruros Nitratos Nitratos_N Sulfatos Nitritos Nitritos_N
   <fct>     <fct>     <fct>           <dbl> <dbl>   <dbl> <chr>             <dbl> <chr>    <chr>         <dbl> <chr>    <chr>     
 1 GW-026    Tumbes    Avenida         283.    0.3   346.  0.495            572.   <0.031   <0.007        901.  <0.003   <0.001    
 2 GW-007    Zarumilla Avenida         146.    0.3   179.  0.24399999999…   279.   <0.031   <0.007        274.  <0.003   <0.001    
 3 GW-008    Zarumilla Avenida         108.    0.3   132.  0.314             62.8  2.22900… 0.504          61.0 <0.003   <0.001    
 4 GW-009    Zarumilla Avenida         138.    0.3   168.  0.184            584.   <0.031   <0.007        151.  <0.003   <0.001    
 5 SW-056    Tumbes    Avenida         156.    0.3   191.  0.20200000000…    68.6  0.151    3.4000000…    124.  <0.003   <0.001    
 6 GW-005    Tumbes    Avenida         211     0.3   257.  0.157             64.9  1.04499… 0.2359999…    146.  <0.003   <0.001    
 7 GW-004    Tumbes    Avenida         278.    0.3   339   0.17              28.8  5.57800… 1.26          127.  <0.003   <0.001    
 8 GW-003    Tumbes    Avenida         311.    0.3   380   0.09              19.8  0.56699… 0.128          19.4 <0.003   <0.001    
 9 SW-032    Tumbes    Avenida          44.9   0.3    54.7 4.59999999999…     1.45 1.768    0.4            20.7 <0.003   <0.001    
10 SW-036    Tumbes    Avenida          44.7   0.3    54.5 5.29999999999…     1.51 2.323    0.5250000…     18.6 <0.003   <0.001    
# … with 89 more rows, and 101 more variables: Fosfatos <chr>, Fosfatos_P <chr>, Bromuro <chr>, Ag_dis <chr>, Al_dis <chr>,
#   As_dis <chr>, B_dis <chr>, Ba_dis <chr>, Be_dis <chr>, Bi_dis <chr>, Ca_dis <dbl>, Cd_dis <chr>, Ce_dis <chr>, Co_dis <chr>,
#   Cr_dis <chr>, Cs_dis <chr>, Cu_dis <chr>, Fe_dis <chr>, Ga_dis <chr>, Ge_dis <chr>, Hf_dis <chr>, Hg_dis <chr>, K_dis <dbl>,
#   La_dis <chr>, Li_dis <chr>, Lu_dis <chr>, Mg_dis <dbl>, Mn_dis <chr>, Mo_dis <chr>, Na_dis <dbl>, Nb_dis <chr>, Ni_dis <chr>,
#   P_dis <chr>, Pb_dis <chr>, Rb_dis <chr>, Sb_dis <chr>, Se_dis <chr>, Si_dis <dbl>, SiO2_dis <dbl>, Sn_dis <chr>, Sr_dis <chr>,
#   Ta_dis <chr>, Te_dis <chr>, Th_dis <chr>, Ti_dis <chr>, Tl_dis <chr>, U_dis <chr>, V_dis <chr>, W_dis <chr>, Yb_dis <chr>,
#   Zn_dis <chr>, Zr_dis <chr>, Ag_tot <chr>, Al_tot <chr>, As_tot <chr>, B_tot <chr>, Ba_tot <dbl>, Be_tot <chr>, Bi_tot <chr>, …

So I need to achieve one object (I quite complex I think):

  1. Completed the data below the detection limit in a one new column (many new columns because I have many elemnts) named "element_com", this element is variable (can be Cd, Zn, Cu, and so on) and depend of two factors to be completed espace (Cuenca, Subcuenca or Microcuenca) and time factor (Temporada or time intervals).

In order to achieve this objects, I create the following algorithm:

#### Input values in water Below Detection (B.D) ####
library(tidyverse)
library(NADA)
library(DT)

#Filter to detect some B.D
datatable(data_base[ ,c("Zn_dis", "Zn_tot")], filter = "top") 

#Elements that I need to complete
elementos <- colnames(data_base[ ,70:180])
#Variables that I need to complete
data_total <- data_base %>% select("Cuenca","Temporada", any_of(elementos))

# First Block
# Filter accord to Cuenca, Subcuenca or Microcuenca (or spatial coherence)
Tumbes <- data_base %>% filter(Cuenca=="Tumbes")
str(Tumbes)

# Second Block
#After one run, start here with Avenida:
z <- Tumbes[Tumbes$Temporada=="Avenida", ] #Do the same for Estiaje, change Avenida by Estiaje!
#Preparing Data for Analysis:
val0 <- unique(grep("<", z$Zn_dis, value = TRUE))
z$var0 <- z$Zn_dis
z$ND_var0 <- rep(0, length(z$var0))
indcero0 <- which(z$var0==val0)
z$var0[indcero0] <- substr(val0,2,nchar(val0))
z$var0 <- as.numeric(z$var0)
z$ND_var0[indcero0] <- 1
z$ND_var0 <- as.logical(z$ND_var0)
#Sort the Data Acorder to Study:
indna0 <- is.na(z$var0)
yn0 <- z$var0[which(indna0==FALSE)]
cyn0 <- z$ND_var0[which(indna0==FALSE)]
yn0 <- sort(yn0,index.return=TRUE)
cyn0 <- cyn0[yn0$ix]
#Apply the ROS (REGRESSION IN ORDER STATISTICS)
elemento <- ros(yn0$x,cyn0,forwardT = "log", reverseT = "exp")
elemento <- as.data.frame(elemento)
id <- which(elemento$censored==TRUE)
elemento <-  elemento[id, ]

# Third Block

#DO IT WITH CARE!!!! (only the first time)
#Create new column of completed values.
Tumbes$Zn_com <- Tumbes$Zn_dis #omit this step after first time run

#Loop to input values:
id <- which(Tumbes$Zn_com==val0)
for(i in 1:length(id)){
  replace <- elemento$modeled[i]
  Tumbes[id, ]$Zn_com[i] <-  replace
}
Tumbes$Zn_com <- as.numeric(Tumbes$Zn_com)

# Testing if works
#After run in Avenida and Estiaje
knitr::kable(Tumbes[ ,c("Zn_dis","Zn_com")])  
write.csv(x = Tumbes, file = "Tumbes.csv") 

At the begginig I subset the database to Tumbes Cuenca (First Block), then I work in the Avenida Temporada (Second and Third Block), after that again with Estiaje Temporada, the initial line of the third block only runs once. Finally, I run the Testing if works part.

Important:
In this case the database does not have Estiaje, but sometimes exist. Maybe if you runs with estiaje exist an error.

All that runs good but my problem is that I have to run hundreds of time, because in this case I have 3 Cuencas, 2 Temporadas (The data from estiaje is not yet here) and 101 elements.

I hope someone can help me. Additionally, I am creating a rpackage to automatically works with dataframes that contain non-detected values in geology water resource and complete by using Hesel methodology from the United State Geological Survey associate with spatio-temporal coherence, moreover many analisis in water resource so who help with these I will appreciate a lot. Thank so much!!!

Greetings compatriot (I'm assuming you are Peruvian because of the content of your data), I can't test this for every possible scenario (it would take me too much time) and it will fail if the proportion of values below the detection limit is too high for any given Cuenca, Temporada, combination (which causes NADA::ros() to fail) but I think it would be a good starting point for you to refine.

This code iterates on every Cuenca, Temporada, combination and imputes the "below detection limit" values for all the selected columns (as an example I have selected 71:74 just because they have a low censored proportion).

library(tidyverse)
library(NADA)

# Downloading sample data
data_url <- "https://drive.google.com/uc?export=download&id=15lz4PWfqXR3zIyeixX_DnfBMyajq9JpY"
data_base <- read.csv(data_url)

# Relevant code
data_base %>%
    group_nest(Cuenca, Temporada) %>% 
    mutate(data = map(data, 
                      ~ .x %>%
                          mutate(across(.cols = 71:74, # Selecting columns with few imputations for examplification
                                        .fns = ~ ros(as.numeric(str_remove(.x, "<")),
                                                     str_detect(.x, "<"),
                                                     forwardT = "log",
                                                     reverseT = "exp")%>%
                                            as.data.frame() %>%
                                            arrange(sort(as.numeric(str_remove(.x, "<")), index.return = TRUE)$ix) %>% 
                                            pull(modeled))))) %>%
    unnest(cols = c(data)) %>% 
    select(Cuenca, Temporada, 71:74) # Showing results
#> # A tibble: 99 × 6
#>    Cuenca            Temporada Alcalinidad Carb_ Bicarb_ Fluoruros
#>    <chr>             <chr>           <dbl> <dbl>   <dbl>     <dbl>
#>  1 Intercuenca 13951 Avenida          214.   0.3    261      0.392
#>  2 Intercuenca 13951 Avenida          208.   0.3    254.     0.299
#>  3 Intercuenca 13951 Avenida          243.   0.3    297      0.33 
#>  4 Intercuenca 13951 Avenida          284.   0.3    346.     0.878
#>  5 Intercuenca 13951 Avenida          329.   0.3    401.     0.486
#>  6 Tumbes            Avenida          283.   0.3    346.     0.495
#>  7 Tumbes            Avenida          156.   0.3    191.     0.202
#>  8 Tumbes            Avenida          211    0.3    257.     0.157
#>  9 Tumbes            Avenida          278.   0.3    339      0.17 
#> 10 Tumbes            Avenida          311.   0.3    380      0.09 
#> # … with 89 more rows

Created on 2022-07-14 by the reprex package (v2.0.1)

1 Like

Greetings compatriot, you assume well. You example to start point, help me a lot by I am studying about map function, because I do not to much. I create a code to identify the data below detection limit:

elementos <- colnames(data_base[ ,70:180]) #solo elementos quimicos
data_total <- data_base %>% select("Cuenca","Temporada", any_of(elementos))

data_total |>
  pivot_longer(!1:2, values_transform=list(value=as.character)) |>
  group_by(across(1:3)) |>
  summarize(
    non_detected=sum(str_detect(value, "^<")),
    completed=nrow(data_total) - non_detected,
    .groups="drop")

Equal the problem to create all automatically is a bit harder. Thank so much for your help!! Also, I am from Lima, very glad find people from Peru here. And again "muchas gracias".

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.