Duplicates created by map_df(~ read_csv...), for which R can only detect a handful

Sidenote: those ID numbers are reeeeaaaaaally long (19 digits, I count). So long, in fact, that R's integer type probably can't hold them, and the double class (which is definitely not appropriate for storing ID numbers) can only represent them to about 15 to 17 significant figures:

The 53-bit significand precision gives from 15 to 17 significant decimal digits precision.

If read_csv is reading these ID numbers in as doubles, you're likely losing some of those figures in the import process, which could explain why you're getting so few duplicates. I would've thought you'd get an error or warning if that was happening, but it's hard to tell without seeing what the data frame, read into R, looks like.

If you can't make these ID numbers smaller and you need to use them to find duplicates, maybe have read_csv read them in as strings using them col_types argument :slight_smile:

EDIT: Excel also uses double precision, so if you read your data into Excel and let Excel assume this column is numeric, it'll also probably ruin the numbers.

4 Likes

Correct, these are Twitter IDs and so can't be changed, at least not without creating other issues. Thanks for the tip on strings, I will try this. I think the workflow without Excel makes sense, it's just that I originally collected 3 months worth of tweets in Excel format. If I start collecting again, I'll collect 7 days worth in a .csv, then use purrr to combine them all in a tibble.

1 Like

Having to collect your data again sucks :cry: But hopefully if you stick to handling the IDs as strings, you'll have less problems :slight_smile: The inconsistency between Excel and R re. duplicate counts might also be related to this, if you were inspecting the data in Excel, saving and then analysing further in R! It's hard to test this in a consistent way, because loss of precision tends not to be very consistent :confused:

2 Likes

The workflow works fine with .csv and purr now, also automated the scrip with Windows task scheduler to run weekly. Thanks everyone for the assistance.

1 Like

If your question's been answered (even if by you), would you mind choosing a solution? (See FAQ below for how).

Having questions checked as resolved makes it a bit easier to navigate the site visually and see which threads still need help.

Thanks

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.