Biasness using set.seed() in machine learning?

Hi,
As to produce a reproducible result, set.seed() is important in saving data that is randomly generate. Would it create bias if i use the same seed number before running all the codes from the link below?
https://www.codementor.io/zulaikhageer/support-vector-machine-in-r-using-svm-to-predict-heart-diseases-to5cy8sbg
If i use different set.seed number, would it create unidentified feature that disturb its randomness and hence wouldn't able to fairly compare the result (as apple to apple)? I am using the code based on the link above and wanting to get a reproducible result but I dont want to create any unintention bias.

Thank you for any suggestions!

If you want to reproduce someone's results exactly, then you would need to use the same seed in order to "randomly" generate the same set of data. Using a different seed will lead to somewhat different data, because the randomness in your data is different than someone else's.

Using a different seed won't systematically bias your results, but it will mean your results will be somewhat different than someone else's. How different your results are (when using a different seed), depends on several considerations.

Based on skimming the article you provided, it seems like the only aspect of that analysis that would be affected by set.seed() is the random training-testing split. With a large enough sample size, the exact split shouldn't have a meaningful affect on the results. But yes, without using the same seed, your split will be different than the one in the article, so your results will vary to some degree.

3 Likes

Thank you for the information!

According to the article above, wouldn't trainControl() or train() produces different result without setting set.seeed()? I am very new to machine learning so having a hard time to understand the affects brought up by different functions and arguements even after reading the documentation for package(caret) several times. Besides, most of them suggested this document for reading http://topepo.github.io/caret/recursive-feature-elimination.html#recursive-feature-elimination-via-caret but it's still too hard for me as I have 0 knowledge or experience on data analysis.

As I am using recursive feature elimination (https://stackoverflow.com/questions/51933704/feature-selection-with-caret-rfe-and-training-with-another-method) before getting into support vector machine to get a more accurate algorithm so the sample size would be much smaller. May I know how much is the sample size be considered as large enough?

Thank you for any suggestions.

Hi,

First of all, I wanna mention that @mattwarkentin provided an excellent explanation on the topic.

Regarding your latest questions:
Every algorithm or piece of code that uses random numbers (which is pretty much all machine learning) will produce (slightly) different results every time you run it. If you set a seed, this does not mean that the same random number is generated over and over, but the order in which they appear are the same.

set.seed(1)
runif(5)
[1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
runif(5)
[1] 0.89838968 0.94467527 0.66079779 0.62911404 0.06178627

If you run this code, you will get the exact same output. But note that the second round of random numbers is completely different from the first one. This means that every time after you set the seed random numbers are needed, they will be different, yet reproducible.

set.seed(1)
runif(10)
[1] 0.26550866 0.37212390 0.57285336 0.90820779 0.20168193 0.89838968 0.94467527 0.66079779 0.62911404 0.06178627

Here you see I now run all 10 at once, and the order of the random numbers is the same

Usually there should be no issue or bias with this, as long as you dataset is large and not sparse. There is no way of easily defining what dataset is large enough, but there is an easy way to find out if randomness will be important or not: Just set a seed (before splitting data), run all of your code and store the results. Then change the seed, run the code again and look at the results. If you do this a few times and you see the results are highly similar, the seed will not bias.

If the results are significantly different, your dataset it either too small or too sparse. In those cases splitting data create sets with a different distribution of inputs or outputs, which will influence performance depending on the distribution. Large or dense (not sparse) sets maintain the distribution of the data even when split.

Hope this helps,
PJ

3 Likes

You would get different results but you would also be able to compare the results. They would be different realizations of the same analysis and, whether we like to admit it or not, analyses have random components.

In fact, it is a good idea to use a different seed just to assess how much things would have changed. That's basically what resampling does.

If you do not set the seed, different random numbers are used and that means different data splits, resamples, and other quantities are different.

Unfortunately, that's really hard to say without knowing a lot about the data and the problem. In general, RFE requires a fair amount of replication (usually in the form of resampling) in order to really know what the variation is.

2 Likes

I have a followup question. When I was in undergrad and grad school (over 30 years ago), the actual seed number you picked could result in very different cycle lengths for the pseudo-random numbers being generated. I remember being advised to pick an odd number with at least 5 digits for the simulation program we used (SLAM).

What's the advice for R? Are there any seed numbers which are better or worse? Is the seed number the actual first input into a recursive function or is it a reference to a list of initial values?

Thanks!

Hi,

I am not a computer scientist by training so I would not know how to answer this in detail. I think there will always be some issues with (pseuso) random number generators in extreme cases but I think that in daily use it won't affect the results too much.

I found two pages that might help you with more details:

The random number functions used in R:
https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/Random

A paper on random number generation

Hopefully some others might have more in-depth answers than me :slight_smile:

PJ

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.