Duplicates created by map_df(~ read_csv...), for which R can only detect a handful

#7

I don’t have a direct answer to your question but I find the janitor package really useful here. It has a get_dupes() function that you can use to explore duplicate observations in your data. This function is documented on the package’s README.md file.

I suspect that you have some variables containing data that Excel and R treat differently. This is just a hunch, but perhaps there are dates or double values that could be the source of this? Something to keep an eye out for as you check your data.

0 Likes

#8

He is passing an optional variable argument (id) into distinct(), so I think that the other variables are not considered. So it is with the id variable that there is something going on. The first error message also hints into some problem with id.

What I suspect is that the variable id got coerced to factors upon import by some base R function (such as read.csv()) and what you are comparing are not the actual id values but the factor levels. Converting your factors to characters before running distinct() might do the trick.

The OP is using read_csv() though... and the factor error message is when he tries with read.csv() instead. So maybe it is something else going on with id.

@chalg, if you give us some info on your data, it will be easier to try to understand what is going on.

1 Like

#9

Oh this makes perfect sense @prosoitos. I had read that part earlier but it slipped my mind. Sorry to confuse things @chalg! Keep the janitor package in your back pocket for next time.

0 Likes

#10

Well, I thought so, until I re-read the code at the very top... (see the end of my previous message which I added while you were typing this).

Maybe there is something weird going on when passing read_csv() to map().

0 Likes

#11

Hmmm... read_csv() allows you to define column types. One way to fiddle around with this could be to force id to be imported as a character vector. The col_types argument is what you’d want to modify here.

0 Likes

#12

True, but none of the tidyverse functions coerce anything into factors (unless you specifically ask them to).

But the workflow is quite complex here. And maybe id was turned into factors in the first import (if it was done with read.csv()).

Anyway, I think that we need more info :smile:

1 Like

#13

Ok thanks for all your help. I will try the col_types argument tonight after work to import as a character vector. Note: "In bind_rows_(x, .id) : Unequal factor levels: coercing to character" was a warning in read.csv, not an error.

1 Like

#14

Got yah @chalg - let us know how it goes!

2 Likes

#15

Is there a way of stopping Excel from changing the id column? For example from 1026314554563640000 to 1026310000000000000. This is the problem that I try and get around by formatting the id column in Excel before saving as .csv format (prior to import into R), which causes additional problems when trying to combine data with purrr functions.
image

I store in .xlsx format because you can only collect 7 days of twitter data at a time and I have around 3 months worth in the .xlsx.

The current workaround is to just append new data to the historical in one file via Excel, format the id column then save as a .csv and import into R.
Alternatively, If I create two .csv files and perform the same formatting on the id field as per above on both, then R is able to detect the same duplicates as Excel after importing both files with a purrr function. However this involves manual workflow, which I was trying to avoid.

0 Likes

#16

What is the purpose of Excel in your workflow? Wouldn't it be possible to save to csv instead? This way Excel won't mangle your ID column.

2 Likes

#17

It seems that when you open a csv and save, Excel mangles the ID column. The reason I use Excel is to store the historical data in .xlsx format, which leaves the id alone, i.e. you can open and save with no issues.

0 Likes

#18

I'm still not sure of the workflow. To me it sounds like this:

  1. Get data from Twitter (via R, I presume?).
  2. Save data to csv (not from Excel, but from R).
  3. Read data from csv directly to R.
  4. Remove duplicates.
  5. Save cleaned data to csv.

If you want to use Excel, then you don't need csv's:

  1. Get data from Twitter.
  2. Save data to xlsx directly (with writexl).
  3. Read data from xlsx directly to R (with readxl, for example).
  4. Remove duplicates.
  5. Save cleaned data to xlsx.
3 Likes

#19

Thanks, I'll test out those libraries and see what happens.

1 Like

#20

On the off chance this isn't clear... just because Excel "claims" CSV files (by putting its icon on them and setting itself as the default application that opens when you double-click one) doesn't mean a CSV is in any way dependent on Excel. It's just a plain text file with a certain agreed-upon layout. You can open it up in any plain text editor, or even directly in RStudio. A CSV isn't particularly nice to read on its own, but it's not a good idea to go tweaking it by hand anyway.

So if you want to keep your ID values from getting mangled, you don't need to work with xlsx files (though of course you might choose to do so for other reasons) — you can just cut Excel out of your workflow completely as @mishabalyasin suggested.

2 Likes

#21

Sidenote: those ID numbers are reeeeaaaaaally long (19 digits, I count). So long, in fact, that R's integer type probably can't hold them, and the double class (which is definitely not appropriate for storing ID numbers) can only represent them to about 15 to 17 significant figures:

The 53-bit significand precision gives from 15 to 17 significant decimal digits precision.

If read_csv is reading these ID numbers in as doubles, you're likely losing some of those figures in the import process, which could explain why you're getting so few duplicates. I would've thought you'd get an error or warning if that was happening, but it's hard to tell without seeing what the data frame, read into R, looks like.

If you can't make these ID numbers smaller and you need to use them to find duplicates, maybe have read_csv read them in as strings using them col_types argument :slight_smile:

EDIT: Excel also uses double precision, so if you read your data into Excel and let Excel assume this column is numeric, it'll also probably ruin the numbers.

4 Likes

#22

Correct, these are Twitter IDs and so can't be changed, at least not without creating other issues. Thanks for the tip on strings, I will try this. I think the workflow without Excel makes sense, it's just that I originally collected 3 months worth of tweets in Excel format. If I start collecting again, I'll collect 7 days worth in a .csv, then use purrr to combine them all in a tibble.

1 Like

#23

Having to collect your data again sucks :cry: But hopefully if you stick to handling the IDs as strings, you'll have less problems :slight_smile: The inconsistency between Excel and R re. duplicate counts might also be related to this, if you were inspecting the data in Excel, saving and then analysing further in R! It's hard to test this in a consistent way, because loss of precision tends not to be very consistent :confused:

2 Likes

#24

The workflow works fine with .csv and purr now, also automated the scrip with Windows task scheduler to run weekly. Thanks everyone for the assistance.

1 Like

#25

If your question's been answered (even if by you), would you mind choosing a solution? (See FAQ below for how).

Having questions checked as resolved makes it a bit easier to navigate the site visually and see which threads still need help.

Thanks

1 Like

closed #26

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

0 Likes