Efficient code for editing values in a data frame

Egonomist · November 21, 2020, 7:18pm

Good afternoon,
I have a rather embarrassingly basic need, that I do not seem to find a good and efficient solution for.

I have a large data frame where I periodically will need to change specific values, corrections as errors are found through human observation (i.e. "for the security with ISIN DK0030170444, please change the Coupon Rate to 6.5").

Currently, I have been doing this by progressively adding mutate commands each addressing one correction. I am quickly realizing this is not smart, as everything is piecemeal and I do not have a database with all the observations that have been needing corrections, which variables were corrected and what values have been changed to.

So, I started looking for a systematic solution where I could build maintain data about the changes to make and simply add to it. For example, if my data looks like this:

ISIN Issuer_Name Ticker Issue_Year Maturity

1 DK0 Danske Bank DANBNK 2009 5
2 LL4 LLoyds LLOY 2009 10
3 XS0 UniCredit UCGIM 2010 NA

I would like to be able to maintain some data structure with the ISIN of the security that needs a change, the variable that needs changing, and the value to be substituted....

('DK0' , 'Issue_Year', 2010)

.... and then have code that simply reads this data and makes the changes in the data frame, so that each time a new correction is necessary I can just add one extra row to the data and re-execute the code to fix the newly found issue.

I realize there are probably 100 ways to achieve this, but I am a noob and I would like to choose the right path from the start instead of figuring out later that my approach was fundamentally flawed. Which is what I did with my previous solution: it was supposed to be just a few observations that needed a fix, and it is turning out it is hundreds...

I hope my question is understandable, and I thank you advance for any precious help.

mmuurr · November 21, 2020, 9:01pm

The first issue you'll likely have to deal with for an "efficient" solution is the replacement value type. Will it always be numeric? Or sometimes will it be numeric, sometimes boolean, sometimes a string, etc.?
(If it's always one type, some very efficient solutions are possible.)

Also, how large is "large"? (How many cols? Rows? Unless this is in the many millions or higher, it seems unlikely you'll need an efficient solution and instead should opt for the semantically-simplest solution.)

And what's your measure for efficiency? (E.g. full value replacement in under x ms?)

Egonomist · November 21, 2020, 10:36pm

Thank you so much for answering.

I definitely used the wrong term when I said I wanted the most "efficient" solution. I don't need at all computational efficiency, my dataset is not that large (I will need to change hopefully not more than a few hundred values in a data frame of ~10,000 observations with around 30 variables).

What I actually meant was I was looking that was efficient in terms of my management of the code... coding it in a way that will be flexible in the future when I need to add to the list of values to change.

Right now each change is a mutate command changing the specific value in the data frame. I would like to have a data structure for the observations that need a change, what variable has to be changed, and what value needs to be put in, so that each time I have to add to it I only need to add this required information to it and the process is already set up. I hope I am managing to explain myself.

The values will not always be numeric. The changes are being done to variables that are a mix of numeric, character, boolean and (maybe) dates.

system · December 12, 2020, 10:36pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.