Having trouble using distm() to add distance between two points

I'm working on some practice data and trying to get the distance between two geo locations then add that as a column to the table

When I use

all_trips %>% rowwise() %>%
mutate(ride_dist = distm(c(start_lng, start_lat), c(end_lng, end_lat)))

The distance is calculated but I can't perform any other operation on ride_dist outside of this code snippet, it gives an error that the column is not initialized. And when I run head() I don't see teh column as well. Feels like it's only there temporarily or something, is this possible?

However, when I use

all_trips$ride_dist <- distm(c(all_trips$start_lng, all_trips$start_lat), c(all_trips$end_lng, all_trips$end_lat))

I get the error

Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2

I'd like to know how I resolve either of these issues. The data structure can be found here.

The culprit is rowwise. The return is there, but as an attribute.

suppressPackageStartupMessages({
  library(dplyr)
  library(geosphere)
  library(magrittr)
})

all_trips <- data.frame(
  start_lng =
    c(-87.666058, -87.666058, -87.63110067, -87.672069, -87.6258275, -87.62025317),
  start_lat =
    c(42.012701, 42.012701, 41.88579467, 41.895634, 41.8347335, 41.89580767),
  end_lng =
    c(-87.661406, -87.669563, -87.62749767, -87.673935, -87.6451235, -87.63197917),
  end_lat =
    c(42.004583, 42.019537, 41.884866, 41.903119, 41.83816333, 41.89488583)
)

all_trips %<>%
  rowwise() %>%
  mutate(ride_dist = distm(c(start_lng, start_lat), c(end_lng, end_lat)))

str(all_trips)
#> rowwise_df [6 × 5] (S3: rowwise_df/tbl_df/tbl/data.frame)
#>  $ start_lng: num [1:6] -87.7 -87.7 -87.6 -87.7 -87.6 ...
#>  $ start_lat: num [1:6] 42 42 41.9 41.9 41.8 ...
#>  $ end_lng  : num [1:6] -87.7 -87.7 -87.6 -87.7 -87.6 ...
#>  $ end_lat  : num [1:6] 42 42 41.9 41.9 41.8 ...
#>  $ ride_dist: num [1:6, 1] 981 813 316 846 1647 ...
#>  - attr(*, "groups")= tibble [6 × 1] (S3: tbl_df/tbl/data.frame)
#>   ..$ .rows: list<int> [1:6] 
#>   .. ..$ : int 1
#>   .. ..$ : int 2
#>   .. ..$ : int 3
#>   .. ..$ : int 4
#>   .. ..$ : int 5
#>   .. ..$ : int 6
#>   .. ..@ ptype: int(0)

all_trips[5]
#> # A tibble: 6 x 1
#> # Rowwise: 
#>   ride_dist[,1]
#>           <dbl>
#> 1          981.
#> 2          813.
#> 3          316.
#> 4          846.
#> 5         1647.
#> 6          978.

The path of least resistance

suppressPackageStartupMessages({
  library(dplyr)
  library(geosphere)
  library(magrittr)
})

all_trips <- data.frame(
  start_lng =
    c(-87.666058, -87.666058, -87.63110067, -87.672069, -87.6258275, -87.62025317),
  start_lat =
    c(42.012701, 42.012701, 41.88579467, 41.895634, 41.8347335, 41.89580767),
  end_lng =
    c(-87.661406, -87.669563, -87.62749767, -87.673935, -87.6451235, -87.63197917),
  end_lat =
    c(42.004583, 42.019537, 41.884866, 41.903119, 41.83816333, 41.89488583)
)

sav_trips <- all_trips

all_trips %>%
  rowwise() %>%
  mutate(ride_dist = distm(c(start_lng, start_lat), c(end_lng, end_lat)))
#> # A tibble: 6 x 5
#> # Rowwise: 
#>   start_lng start_lat end_lng end_lat ride_dist[,1]
#>       <dbl>     <dbl>   <dbl>   <dbl>         <dbl>
#> 1     -87.7      42.0   -87.7    42.0          981.
#> 2     -87.7      42.0   -87.7    42.0          813.
#> 3     -87.6      41.9   -87.6    41.9          316.
#> 4     -87.7      41.9   -87.7    41.9          846.
#> 5     -87.6      41.8   -87.6    41.8         1647.
#> 6     -87.6      41.9   -87.6    41.9          978.

all_trips %<>% rowwise() %>%
  mutate(ride_dist = distm(c(start_lng, start_lat), c(end_lng, end_lat))) 

ride_dist <- all_trips[5]

all_trips <- tibble::add_column(sav_trips,ride_dist)
all_trips
#>   start_lng start_lat   end_lng  end_lat ride_dist
#> 1 -87.66606  42.01270 -87.66141 42.00458  980.5928
#> 2 -87.66606  42.01270 -87.66956 42.01954  812.9084
#> 3 -87.63110  41.88579 -87.62750 41.88487  316.3360
#> 4 -87.67207  41.89563 -87.67394 41.90312  845.6658
#> 5 -87.62583  41.83473 -87.64512 41.83816 1647.4263
#> 6 -87.62025  41.89581 -87.63198 41.89489  978.4702

Thanks a lot for this response, I understand the path you took. if you don't mind explaining, why does this code have to be repeated twice?

I've confirmed the table doesn't have the data at first when I call head() but has it after the second time, is there a concept I'm missing?

Secondly, this script takes some time to run on the data I'm working with, about 3-5mins on the ride_dist section. Is that expected? Or is there an alternative way to make this faster?

Thank you!

I'm not sure of the internals but rowwise operates to create ride_dist not as a new variable but as an attribute of the last column. It can, however, set things to right by piping to ungroup(), which renders my workaround unnecessary.

To illustrate the problem, I narrowed the variables to those chosen and the rows to just a few. I noticed that distm seemed to take an inordinate time. The body of distm itself is unexceptionable. It calls .pointsToMatrix internally, whose origin is obscure, which does a lot of error checking before getting down to business


	if (poly) {
		if (! isTRUE(all.equal(p[1,], p[nrow(p),]))) {
			p <- rbind(p, p[1,])
		} 

		i <- p[-nrow(p),1] == p[-1,1] &  p[-nrow(p),2] == p[-1,2]
		i <- which(isTRUE(i))
		if (length(i) > 0) {
			p <- p[-i, ,drop=FALSE]
		}
	
		.isPolygon(p)
	}

Then .isPolygon must be sought in helper.R(geosphere/R/helper.R at master · cran/geosphere · GitHub)

.isPolygon <- function(x, fix=FALSE) {
	x <- stats::na.omit(x)
	if (nrow(x) < 4) {
		stop('this is not a polygon (insufficent number of vertices)')
	}
	if (length(unique(x[,1]))==1) {
		stop('All longitudes are the same (not a polygon)')
	}
	if (length(unique(x[,2]))==1) {
		stop('All latitudes are the same (not a polygon)')
	}
	if (! all(!(is.na(x))) ) {
		stop('polygon has NA values)')
	}
	if (! isTRUE(all.equal(x[1,], x[nrow(x),]))) {
		stop('this is not a valid (closed) polygon. The first vertex is not equal to the last vertex')	
	}
	return(x)
}

that removes NAs then does a lot more error checking.

On the other hand, however, I run the whole data under the following {sf} script, and it took 2.8 minutes on my moderately competent Ubuntu laptop. Running with {future} didn't improve results.

suppressPackageStartupMessages({
  library(dplyr)
  library(future)
  library(geosphere)
  library(magrittr)
  library(sf)
  library(tictoc)
})

tic()
readr::read_csv("~/Desktop/grist.csv") -> all_trips

skimr::skim(all_trips)

# remove the 214 missing value records

missing_end_lats <- which(is.na(all_trips$end_lat))

all_trips <- all_trips[-(missing_end_lats),]

starts <- tibble(pnt = "start",
                 lat = all_trips$start_lat,
                 lon = all_trips$start_lng)
                    
ends <- tibble(pnt = "end",
                 lat = all_trips$end_lat,
                 lon = all_trips$end_lng)

start_sf <- st_as_sf(starts, coords = c('lon','lat'))
end_sf   <- st_as_sf(ends,   coords = c('lon','lat'))
st_crs(start_sf) = 4326
st_crs(end_sf) = 4326

distances %<-% st_distance(start_sf,end_sf, by_element = TRUE)

all_trips["distance"] <- distances

Thanks a lot for your response. I'd try out the sf method you suggested and compare to the original.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.