I'm preparing a talk about R speeds and I'm including a section benchmarking a handful of methods/packages purely on the task of reading in a 1.6GB tab-delimited file. I thought it would be interesting to include using reticulate to read the file with Python's pandas package.
Out of curiosity, also ran it straight from python. I wasn't surprised that there was a difference in how long it took, but I was surprised that reticulate (~50seconds, using microbenchmark, avg of 5 passes) was so much slower than python (~35 seconds, using timeit.timeit, avg of 5 passes). Could anyone explain why the overhead is so high? I am curious myself, but also anticipating questions from the audience. I used python 3.6.8 for both, reticulate 1.12. Thanks!
use_python("/anaconda3/bin/python") reticulate_bench <- microbenchmark(reticulate_tab <- pd$read_table(filepath_or_buffer = "Brain_Amygdala.truncated.txt", sep = "\t"), times = 5, setup = pd <- import("pandas"))
Python REPL version
import pandas pandas_bench = timeit.timeit('pandas_tab = pandas.read_table("Brain_Amygdala.truncated.txt", "\t")', number=5, setup='import pandas')