large dataset and using sapply for the returning outcome as "mean" & "sd"

I have been tasked to generate a large dataset corresponding the following request:

"Write a function that takes a size n, then (1) builds a dataset using the code provided in Q1 but with n observations instead of 100 and without the set.seed(1), (2) runs the replicate() loop that you wrote to answer Q1, which builds 100 linear models and returns a vector of RMSEs, and (3) calculates the mean and standard deviation. "

Dataset = underneath
n <- c(100, 500, 1000, 5000, 10000)
Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)
dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))

rmse <- replicate(100, {
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))

})

The goal of this is to return the numbers assigned to variable "n" as ''mean'' & ''Standard deviation''.

So far i have approached the numbers of "n" & "RMSE's" to be plugged as value within sapply as reference to results:

results <- sapply(n, rmse)

that transits to a error: ''can't extract residuals from model''. However performing the "mean" specified with a row or column index "[1]" manually:

mean(rmse[1])

an incorrect decimal value is received, whereby SD is nothing more than a "NA" attribute.

sd(rmse[1])
[1] NA

I might have overlooked some critical factors here. A friendly reminder with extra approaches and tips to solve the section would be highly appreciated.

Thanks,

Irvin

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.