Summarising trend data as a feature using tidymodels


I am currently reading Feature Engineering and Selection by Max Kuhn and Kjell Johnson. In Chapter 9 [p 206] It is mentioned that a feature could be added to show a specific trend in a group. For example, if I was trying to classify an item and I had 300 days of very noisy longitudinal data, I could run a linear regression and use the slope (feature 1) of this regression as a proxy with the the trimmed median (feature 2) and MAD (feature 3) values to summarize the history of this group of items per day (if the pattern if this feature is going up or down). My question is two fold

  • How would I achieve the linear regression portion within the tidy models framework (it would need to be done I imagine within the resamples)

  • Is it possible to use a different model than Linear regression to summarise the trend, for example MARs which may capture a non linear pattern than that of the linear regression. How would i summarise the formula within a column to become a feature to be picked up by my main model? Is this also possible in tidymodels?

Thank you very much for your time

As a recap, there are two main ways for fit longitudinal models.

First, you can include time as a predictor. This mans that rows in the data are not independent and the data need to be resampled appropriately. This also means that you can encode nonlinear relationships with the outcome by adding interactions with time (perhaps coupled with things like splines). This can be done in tidymodel using the regular recipe syntax.

Alternatively, you can summarize the data so that each row corresponds to an independent experimental unit. Here, you would have to create summary statistics within each independent unit. We don't currently support doing that in recipes. It's one of the rate situations where you'd have to do all of that before resampling, modeling, and so on. dplyr's grouping facilities are good ways of approaching this.

Hi @Max

Thank you for taking the time out to answer my question.

If i just take the first paragraph as its more appropriate, i think to the data I'm looking at. If i take my fictitious 300 days, if i understand correctly, I would encode a column per day? More concretely day 300 gets a value, day 299 gets a value all the way up to day n-1 where n is the current day.

I would then use time slices, I imagine, for my resampling (since this is a moving 300 column trend). resample 1 would contain 300 samples, resample 2 would move that window by one day so that column 300 is replaced with column 299 and all other columns scooch down one. Column 1 is now yesterday so to speak. I could then build out interactions this way by comparing profile data at time 20 with time 200 for example. Does this seem like a good approach or am i talking out of my hat :slight_smile:

So are there just 300 days in the data or is there some other factor (i.e. patient, company, etc) that has 300 days nested inside of it?

I guess if we use the ficticious patient example, we could say we have a rolling window of 300 days for each patient, maybe how many times they visit the hospital per day. There would be other factors too for example weight height hair colour eye colour etc. When we try to classify a patient its on the most recent day where we dont have the number of days they visit the hospital at the time of classification. The next day (tomorrow) would incorporate todays number of visits for tomorrows classification. I hope this is clear but if not I could attempt to create some fake data tomorrow morning to clarify if you think it would be helpful?

Of course to play devils advocate we could have many different hospitals so the counts of visits are per patient per hospital where all patients have attributes themselves.

Let's say that there were 3 time points. If the time points are always going ot be the same, the data could be:

patient time_1 time_2 time_3     y
      1    1.2    7.5   11.0   0.1
      2    5.4    3.1    1.0   1.9

If that were the case you can resample as usual since 1 row = 1 independent unit.

If the time points vary, then you would need something like

patient time     x     y
      1    1   1.1   6.1
      1    7   5.6   1.7
      1    9   9.3   1.0

      2    1   0.3   3.1
      2    4   4.3   6.0

In this case, 2+ rows = 1 independent unit so you have to resample by patient using rsample::group_vfold_cv() or some other method.

1 Like

Ah ok

Thanks for walking me through it, its now much clearer in my head

Using the first method then eliminates the need to put a sub model in for example linear regression to show that the trend for the patient might be reducing

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.