When and when not to seed?

I know the purpose of set.seed() in general, but when and why should we use this function in a typically tidymodels-construct? For example, Julia does this:

library(tidymodels)

set.seed(123)
trees_split <- initial_split(trees_df, strata = legal_status)
trees_train <- training(trees_split)
trees_test <- testing(trees_split)

But why? For which function does she use set.seed()? Let's say I'm building an ensemble with tidymodels, consisting n different model-types. Where should I use set.seed() and when does it needs another, new set.seed()? And is it possible that two consecutive: 'set.seed(123)' could colide with each other? Is it danger to forget set.seed()?

Here the clear motivation for setting the seed is to make the initial split reproducible.
If is.run line by line without picking and choosing order or running lines multiple times interactively but simply end to end then further used of set.seed would not be necessary

Why should the split be reproducible?
On Julias website I see two other standard applications...

  1. set.seed(345)
    folds <- vfold_cv(train,strata=value,v=trainConf[['folds']])

  2. set.seed(678)

    tune_res <- tune_grid(
        tune_wf,
        resamples = folds,
        grid      = modelGrid
    )
    

... Same question, why vfold_cv() should be reproducible (or for which case/application)?
Why should tune_grid() be reproducible (or for which case/application)?
And is it possible that two consecutive: 'set.seed(123)' could colide with each other?
And is it danger to forget set.seed() in the context of tidymodels, for example when building model-ensembles? Are there any following steps like finalize() etc. which depend on set.seed()? thx.

I'm confused, are you asking me to motivate doing reproducible work ...?

I' m asking the questions- I wrote down in the post ;-).

Because the work as a whole should be reproducible.
If this part isn't none of it will be.

1 Like

So is it meant to be only for the case- we want to say initial_split() again and get the same split as long as the seed exists? This would mean, its not essential for a complete tidymodels 'workflow' and more like a unit-test tool.

what do you mean, as long as the seed exists ? if its an R integer , theres no 'non-existence'.

its essential for your tidymodels workflow, for reproducibility. it is not a unit test tool. it is a requirement for reproducibility.

Non-existence could be, if you are in another namespace, or when you are using multiple sessions via a shell-script for example. So what is your case for reproducibility? How does it looks like? Do you call initial_split() one time and in another script-part a second time, to expect the same split? Are we talking about adthoc-code-reproducibility or explainability in production? Anyways, in my case I don't need adthoc-code-reproducibility. So I'm asking about any practically pitfalls in my case- not in your case :-). R is no religion.

Any other productive answers?

Non-existence could be, if you are in another namespace, or when you are using multiple sessions via a shell-script for example

I dont see that this has any relevance to setting the random seed of the active R session. If you pull up the same script in another session, for that script to be fully reproducible (same outputs if processing same inputs), it will be possible to set the seed, and indeed necessary. Again, if your seed to set, is an R integer, there is no failure case.

So what is your case for reproducibility?

The ability to stand behind my work and justify it, on a different day than the day I originally created it.
if thats not a requirement for you, forget seeds altogether, but
a) I would find it maddening to rerun a script and find the numbers jumping all about from one run to the next, it would make debugging more difficult
b) my work is generally not a one off that no-one would challenge, so reproducibility is a must.

R is not a religion, indeed, and you dont have to use set.seed. and you dont have to make your work reproducible, but we can infer that this would likely not be a) science, b) commercial work. If its a pure hobbyist thing, fine, have my blessing to throw reproducibility out of the window. on the other hand, its best to have good habits....

So indeed, for this task I need no set.seed(), as long as there are no inherent dependencies in tidymodels. I can use set.seed() for simulations- if I really need them- in other parts of my projects. Your assumptions about commercial work and science is your personal opinion- and does not fit my case in any direction. It depends on the application- when, where and how to establish reproducibility and persistence. There isn't always a workspace and reproducibility and persistence at the wrong place- can lead to unwanted effects. So I prefer to choose it wisely and only when needed.

(2) "Initially, there is no seed; a new one is created from the current
time and the process ID when one is required. Hence different sessions
will give different simulation results, by default. However, the seed
might be restored from a previous session if a previously saved workspace
is restored.", this is why you would want to call set.seed() with same
integer values the next time you want a same sequence of random sequence.

https://livefreeordichotomize.com/2018/01/22/a-set-seed-ggplot2-adventure/

Don't forget that not only simulations depends on random generators.

Any method that require some sampling is affected: split for train and test, cross validation methods, random forests.

Setting seed means you can reproduce errors and debug.

Moreover: if someone wants to audit your work or check your results then without reproducibility you cannot defend your models.

Second thing: using seed in online materials ensures that the learner gets the same results as author.

1 Like

Thank you for this comment Marek_Marek. At least it mentioned some of the potential affected Tidymodels-functions. I have no students and no other audits. I'm currently building an ensemble-(framework), with a theoretically, unlimited amount of models/model-types. I also have some helper-functions to generate different numbers and set different seeds, for every model, passing the specific unique proceedures (code once approach). I just want to make sure, if this is really nessesary and nothing colides with each other (different models) when setting, or not setting seeds.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.