Replacing messy range value with midpoint

I have a problem with my datasets. It has a column consisting of range of age of patients but it's all in a messy formatting. For example, there's a value '25-29' but also '25-44' and even '25-99'.

Is it okay if I find the midpoint of each value and assigned them to each of the record? I want to know exactly what age my patients are or at least where do they fall within the range (example: <17, 17-25, 26-35, etc.)

Is it ok if I find the midpoint of each value

I think that will really depend on what you are doing. How are you obtaining the data that you do have? And how would you go about getting the precise age?

I got the data from Kaggle. I think this is the original source is: here
It's pretty credible.

I thought I could find median of each age range and replace the messy range with its midpoint. For example: 1-5 -> 3

Should I just take the column out?

So after reading through the repo linked, if I were trying to do any sort of analysis with that data, I think I would probably not use the data from that column, at least not without significant transformation. As a general rule, you should only impute data where you have a reasonable expectation that your imputation approximates imputed value appropriately. This would seem to be even more true for an epidemiological dataset, where I would expect the information value of age to be high.

I think that if you knew more about how this data was collected, that would probably be most informative in you decision to include or exclude the column.

I have no idea how they collected the data. I was only searching through and playing with datasets and this one comes. I'm still omw learning to find usable datasets.

Thanks for the input, I will remove the column :slight_smile:

If I was going to use a method of analysis that allowed for weighted data, I'd be tempted to duplicate a record for each year of its possible age within the age range, and give it a weight appropriate to keep the total weight contribution of the row as 1 in my dataset. This would allow for the possibility of detecting some signal from noise, whilst avoiding introducing undue bias I think. Curious to know what others would make of that approach.

Hey, I'm really curious about the technicality of this method you mentioned.

I'm not a statistician and I would like to learn more about weighted data. What analysis allow and does not allow weighted data? Can you please give me a link as to how to learn to do this? Thank you ><

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.