Having trouble using distm() to add distance between two points

ososoba · May 29, 2021, 5:24pm

I'm working on some practice data and trying to get the distance between two geo locations then add that as a column to the table

When I use

all_trips %>% rowwise() %>%
mutate(ride_dist = distm(c(start_lng, start_lat), c(end_lng, end_lat)))

The distance is calculated but I can't perform any other operation on ride_dist outside of this code snippet, it gives an error that the column is not initialized. And when I run head() I don't see teh column as well. Feels like it's only there temporarily or something, is this possible?

However, when I use

all_trips$ride_dist <- distm(c(all_trips$start_lng, all_trips$start_lat), c(all_trips$end_lng, all_trips$end_lat))

I get the error

Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2

I'd like to know how I resolve either of these issues. The data structure can be found here.

technocrat · May 29, 2021, 6:52pm

The culprit is rowwise. The return is there, but as an attribute.

suppressPackageStartupMessages({
  library(dplyr)
  library(geosphere)
  library(magrittr)
})

all_trips <- data.frame(
  start_lng =
    c(-87.666058, -87.666058, -87.63110067, -87.672069, -87.6258275, -87.62025317),
  start_lat =
    c(42.012701, 42.012701, 41.88579467, 41.895634, 41.8347335, 41.89580767),
  end_lng =
    c(-87.661406, -87.669563, -87.62749767, -87.673935, -87.6451235, -87.63197917),
  end_lat =
    c(42.004583, 42.019537, 41.884866, 41.903119, 41.83816333, 41.89488583)
)

all_trips %<>%
  rowwise() %>%
  mutate(ride_dist = distm(c(start_lng, start_lat), c(end_lng, end_lat)))

str(all_trips)
#> rowwise_df [6 × 5] (S3: rowwise_df/tbl_df/tbl/data.frame)
#>  $ start_lng: num [1:6] -87.7 -87.7 -87.6 -87.7 -87.6 ...
#>  $ start_lat: num [1:6] 42 42 41.9 41.9 41.8 ...
#>  $ end_lng  : num [1:6] -87.7 -87.7 -87.6 -87.7 -87.6 ...
#>  $ end_lat  : num [1:6] 42 42 41.9 41.9 41.8 ...
#>  $ ride_dist: num [1:6, 1] 981 813 316 846 1647 ...
#>  - attr(*, "groups")= tibble [6 × 1] (S3: tbl_df/tbl/data.frame)
#>   ..$ .rows: list<int> [1:6] 
#>   .. ..$ : int 1
#>   .. ..$ : int 2
#>   .. ..$ : int 3
#>   .. ..$ : int 4
#>   .. ..$ : int 5
#>   .. ..$ : int 6
#>   .. ..@ ptype: int(0)

all_trips[5]
#> # A tibble: 6 x 1
#> # Rowwise: 
#>   ride_dist[,1]
#>           <dbl>
#> 1          981.
#> 2          813.
#> 3          316.
#> 4          846.
#> 5         1647.
#> 6          978.

The path of least resistance

suppressPackageStartupMessages({
  library(dplyr)
  library(geosphere)
  library(magrittr)
})

all_trips <- data.frame(
  start_lng =
    c(-87.666058, -87.666058, -87.63110067, -87.672069, -87.6258275, -87.62025317),
  start_lat =
    c(42.012701, 42.012701, 41.88579467, 41.895634, 41.8347335, 41.89580767),
  end_lng =
    c(-87.661406, -87.669563, -87.62749767, -87.673935, -87.6451235, -87.63197917),
  end_lat =
    c(42.004583, 42.019537, 41.884866, 41.903119, 41.83816333, 41.89488583)
)

sav_trips <- all_trips

all_trips %>%
  rowwise() %>%
  mutate(ride_dist = distm(c(start_lng, start_lat), c(end_lng, end_lat)))
#> # A tibble: 6 x 5
#> # Rowwise: 
#>   start_lng start_lat end_lng end_lat ride_dist[,1]
#>       <dbl>     <dbl>   <dbl>   <dbl>         <dbl>
#> 1     -87.7      42.0   -87.7    42.0          981.
#> 2     -87.7      42.0   -87.7    42.0          813.
#> 3     -87.6      41.9   -87.6    41.9          316.
#> 4     -87.7      41.9   -87.7    41.9          846.
#> 5     -87.6      41.8   -87.6    41.8         1647.
#> 6     -87.6      41.9   -87.6    41.9          978.

all_trips %<>% rowwise() %>%
  mutate(ride_dist = distm(c(start_lng, start_lat), c(end_lng, end_lat))) 

ride_dist <- all_trips[5]

all_trips <- tibble::add_column(sav_trips,ride_dist)
all_trips
#>   start_lng start_lat   end_lng  end_lat ride_dist
#> 1 -87.66606  42.01270 -87.66141 42.00458  980.5928
#> 2 -87.66606  42.01270 -87.66956 42.01954  812.9084
#> 3 -87.63110  41.88579 -87.62750 41.88487  316.3360
#> 4 -87.67207  41.89563 -87.67394 41.90312  845.6658
#> 5 -87.62583  41.83473 -87.64512 41.83816 1647.4263
#> 6 -87.62025  41.89581 -87.63198 41.89489  978.4702

ososoba · May 29, 2021, 8:04pm

Thanks a lot for this response, I understand the path you took. if you don't mind explaining, why does this code have to be repeated twice?

I've confirmed the table doesn't have the data at first when I call head() but has it after the second time, is there a concept I'm missing?

Secondly, this script takes some time to run on the data I'm working with, about 3-5mins on the ride_dist section. Is that expected? Or is there an alternative way to make this faster?

Thank you!

technocrat · May 29, 2021, 10:06pm

I'm not sure of the internals but rowwise operates to create ride_dist not as a new variable but as an attribute of the last column. It can, however, set things to right by piping to ungroup(), which renders my workaround unnecessary.

To illustrate the problem, I narrowed the variables to those chosen and the rows to just a few. I noticed that distm seemed to take an inordinate time. The body of distm itself is unexceptionable. It calls .pointsToMatrix internally, whose origin is obscure, which does a lot of error checking before getting down to business


	if (poly) {
		if (! isTRUE(all.equal(p[1,], p[nrow(p),]))) {
			p <- rbind(p, p[1,])
		} 

		i <- p[-nrow(p),1] == p[-1,1] &  p[-nrow(p),2] == p[-1,2]
		i <- which(isTRUE(i))
		if (length(i) > 0) {
			p <- p[-i, ,drop=FALSE]
		}
	
		.isPolygon(p)
	}

Then .isPolygon must be sought in helper.R(geosphere/R/helper.R at master · cran/geosphere · GitHub)

.isPolygon <- function(x, fix=FALSE) {
	x <- stats::na.omit(x)
	if (nrow(x) < 4) {
		stop('this is not a polygon (insufficent number of vertices)')
	}
	if (length(unique(x[,1]))==1) {
		stop('All longitudes are the same (not a polygon)')
	}
	if (length(unique(x[,2]))==1) {
		stop('All latitudes are the same (not a polygon)')
	}
	if (! all(!(is.na(x))) ) {
		stop('polygon has NA values)')
	}
	if (! isTRUE(all.equal(x[1,], x[nrow(x),]))) {
		stop('this is not a valid (closed) polygon. The first vertex is not equal to the last vertex')	
	}
	return(x)
}

that removes NAs then does a lot more error checking.

On the other hand, however, I run the whole data under the following {sf} script, and it took 2.8 minutes on my moderately competent Ubuntu laptop. Running with {future} didn't improve results.

suppressPackageStartupMessages({
  library(dplyr)
  library(future)
  library(geosphere)
  library(magrittr)
  library(sf)
  library(tictoc)
})

tic()
readr::read_csv("~/Desktop/grist.csv") -> all_trips

skimr::skim(all_trips)

# remove the 214 missing value records

missing_end_lats <- which(is.na(all_trips$end_lat))

all_trips <- all_trips[-(missing_end_lats),]

starts <- tibble(pnt = "start",
                 lat = all_trips$start_lat,
                 lon = all_trips$start_lng)
                    
ends <- tibble(pnt = "end",
                 lat = all_trips$end_lat,
                 lon = all_trips$end_lng)

start_sf <- st_as_sf(starts, coords = c('lon','lat'))
end_sf   <- st_as_sf(ends,   coords = c('lon','lat'))
st_crs(start_sf) = 4326
st_crs(end_sf) = 4326

distances %<-% st_distance(start_sf,end_sf, by_element = TRUE)

all_trips["distance"] <- distances

ososoba · May 30, 2021, 6:02pm

Thanks a lot for your response. I'd try out the sf method you suggested and compare to the original.

system · June 20, 2021, 6:03pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.