How get only 1 value for many repeats values in a column

Hi community

Im have this data with repeats values. The idea is get only a entry for each one. For example when all entry for specific ACC have numeric values only get 1 entry for this and if is all is NA put a values.

The data base have 19.000 rows
Im try with unique but only get a sum.

dats <- data.frame(ACC=c('BRA1290','BRA1290',
                         'CAE401','CAE401',
                         'FLA444-2','FLA444-2','FLA444-2',
                         'PER413-3','PER413-3','PER413-3',
                         'PER426','PER426','PER426',
                         'PER408','PER408','PER408'),
                   core=c('N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N'),
                   lat=c(-14.66670000000000,-14.66670000000000,NA,NA,NA,-15.8,-15.8,NA,-6.8,-6.8,NA,NA,-3.75,NA,NA,NA),
                   long=c(-52.35,-52.35,NA,NA,NA,-48.833300000000001,-48.833300000000001,NA,-63.916699999999999,-63.916699999999999,NA,NA,-73.183300000000003,NA,NA,NA))

head(dats,15)
#>         ACC core      lat     long
#> 1   BRA1290    N -14.6667 -52.3500
#> 2   BRA1290    N -14.6667 -52.3500
#> 3    CAE401    N       NA       NA
#> 4    CAE401    N       NA       NA
#> 5  FLA444-2    N       NA       NA
#> 6  FLA444-2    N -15.8000 -48.8333
#> 7  FLA444-2    N -15.8000 -48.8333
#> 8  PER413-3    N       NA       NA
#> 9  PER413-3    N  -6.8000   -63.9167
#> 10 PER413-3    N -6.8000  -63.9167
#> 11   PER426    N       NA       NA
#> 12   PER426    N       NA       NA
#> 13   PER426    N  -3.7500 -73.1833
#> 14   PER408    N       NA       NA
#> 15   PER408    N       NA       NA
Created on 2023-02-14 by the reprex package (v2.0.1)

data link:

For example, the desire df is something like that:

dats2 <- data.frame(ACC=c('BRA1290','CAE401','FLA444-2','PER413-3','PER426','PER408'),
                    core=c('N','N','N','N','N','N'),
                    lat=c(-14.6667,NA,-15.8000,-6.8000,-3.7500,NA),
                    long=c(-52.3500,NA,-48.8333,-63.9167,-73.1833,NA))
head(dats2)
#>        ACC core      lat     long
#> 1  BRA1290    N -14.6667 -52.3500
#> 2   CAE401    N       NA       NA
#> 3 FLA444-2    N -15.8000 -48.8333
#> 4 PER413-3    N  -6.8000 -63.9167
#> 5   PER426    N  -3.7500 -73.1833
#> 6   PER408    N       NA       NA 

Im try with this:

dats_unique <- distinct(dats, ACC, .keep_all = TRUE)

a bad result
# ACC core      lat   long
# 1  BRA1290    N -14.6667 -52.35
# 2   CAE401    N       NA     NA
# 3 FLA444-2    N       NA     NA
# 4 PER413-3    N       NA     NA
# 5   PER426    N       NA     NA
# 6   PER408    N       NA     NA

Tnks

This gets close to your desired result. I do not understand how you want PER413-3 calculated. It has two lat and two long values and your final answer mixes the two.

dats <- data.frame(ACC=c('BRA1290','BRA1290',
                         'CAE401','CAE401',
                         'FLA444-2','FLA444-2','FLA444-2',
                         'PER413-3','PER413-3','PER413-3',
                         'PER426','PER426','PER426',
                         'PER408','PER408','PER408'),
                   core=c('N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N'),
                   lat=c(-14.66670000000000,-14.66670000000000,NA,NA,NA,-15.8,-15.8,NA,-6.8,-63.916699999999999,NA,NA,-3.75,NA,NA,NA),
                   long=c(-52.35,-52.35,NA,NA,NA,-48.833300000000001,-48.833300000000001,NA,-6.8,-63.916699999999999,NA,NA,-73.183300000000003,NA,NA,NA))

library(purrr)
CheckNA <- function(DF) {
  DF <- unique(DF)
  if(any(!is.na(DF$lat))) DF <- na.omit(DF)
  return(DF)
}
datsUniq <- map_dfr(split(dats, dats$ACC), CheckNA)
datsUniq  
#>        ACC core      lat     long
#> 1  BRA1290    N -14.6667 -52.3500
#> 2   CAE401    N       NA       NA
#> 3 FLA444-2    N -15.8000 -48.8333
#> 4   PER408    N       NA       NA
#> 5 PER413-3    N  -6.8000  -6.8000
#> 6 PER413-3    N -63.9167 -63.9167
#> 7   PER426    N  -3.7500 -73.1833

Created on 2023-02-14 with reprex v2.0.2

1 Like

run length encoding—rle()—along with cumsum will get there.

dats <- data.frame(ACC=c('BRA1290','BRA1290',
                         'CAE401','CAE401',
                         'FLA444-2','FLA444-2','FLA444-2',
                         'PER413-3','PER413-3','PER413-3',
                         'PER426','PER426','PER426',
                         'PER408','PER408','PER408'),
                   core=c('N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N'),
                   lat=c(-14.66670000000000,-14.66670000000000,NA,NA,NA,-15.8,-15.8,NA,-6.8,-63.916699999999999,NA,NA,-3.75,NA,NA,NA),
                   long=c(-52.35,-52.35,NA,NA,NA,-48.833300000000001,-48.833300000000001,NA,-6.8,-63.916699999999999,NA,NA,-73.183300000000003,NA,NA,NA))


dats[cumsum(rle(dats[[1]])$lengths),]
#>         ACC core      lat     long
#> 2   BRA1290    N -14.6667 -52.3500
#> 4    CAE401    N       NA       NA
#> 7  FLA444-2    N -15.8000 -48.8333
#> 10 PER413-3    N -63.9167 -63.9167
#> 13   PER426    N  -3.7500 -73.1833
#> 16   PER408    N       NA       NA

rle() returns the number of times each entry recurs, and the cumsum() of that corresponds to the row indexes first happen. In turn that allows dats to be subset.

1 Like

I do not understand how you want PER413-3

Hi @FJCC, yes I'm check this error when I make the reprex, was my error.
I'm going to check in the original dats. Tnks

Im try with boths code helps but when not all ACC get unique value.
Im goig to put all data.
Im try without space in ACC column but not work.

For example, when I count repeat values FLA444-2 have 2 next to run the codes.

Is good advance because the data pass to get 7000 repeat values to only 17.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.