it's about cross-validation

In my book, it provides a sample about cross-validation.

housing<-read.table("http://www.jaredlander.com/data/housing.csv",sep=",",header=TRUE,stringAsFactors=FALSE)
names(housing)<-c("neighborhood","class","units","yearbuilt","sqft","income","incomepersqft","expense","valuepersqft","boro")
cv.work<-function(fun,k,data,cost=function(y,yhat)mean((y-yhat)^2),response="y",...)
+ {folds<-data.frame(Fold=sample(rep(x=1:k,length.out=nrow(data))),Row=1:nrow(data))
+ error<-0
+ for(f in 1:max(folds$Fold))
+ {
+ theRows<-folds$Row[folds$Fold==f]
+ mod<-fun(data=data[-theRows,],...)
+ pred<-predict(mod,data[theRows,])
+ error<-error+cost(data[theRows,response],pred)*(length(theRows)/nrow(data))
+ }
+ return(error)
+ }
cv1<-cv.work(lm,5,housing,response = "valuepersqft",formula=valuepersqft~units*sqft+boro) 

I repeat the book's code but get the different result and another problem is that the numeric result of cv1 will change when I repeat cv1.
So why the same codes and same dataset will produce different numbers.

Thanks!

The value changes because you're using sample to define the folds.

In each repetition, Fold is being changed, and hence it leads to different output.

If you want to get same output, consider using set.seed. For example, if you run set.seed(seed = 32121) before the last line defining cv1, it'll always take the value 4.610207.

Hope this helps.


You have made a typo. The argument is stringsAsFactors, not stringAsFactors. Also, you can use read.csv in this case.

according to the first two sentences, you mean the Fold will be changed due to sample function at every repetition. Because I don't set argument that specify how to sample within the sample function hence cv.work always produces random seed value. It's also reason that you suggest running set.seed to obtain same output?

I'm not sure I understand you completely.

Yes, I meant that for each repetition, Fold will vary, and hence you'll get different outputs. If you don't want that, you'll have to use set.seed to fix seed and that'll lead to same results.

However, I didn't follow this:

I don't know that there's a argument to sample to specify seed, or to tell it how to sample. Can you please explain what you have meant here?

In fact, I don't know how to specify seed or to tell it how to sample.
The confused sentence was my further thinking.
You think the different results need to be mitigated by set seed.
So I think it also was resolved if I can specify a fixed sample rule that will sample same datasets.
Undoubtedly, the same output is contradict with the nature of sample.
Also, the behavior that sets sample rule to ensure the same output is expensive than your theme.
I'm interested in the logic that computer understand my codes. I think it's too important for me.
I'm sorry to your confusing

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.