I have been tasked to generate a large dataset corresponding the following request:
"Write a function that takes a size n, then (1) builds a dataset using the code provided in Q1 but with n observations instead of 100 and without the set.seed(1), (2) runs the replicate() loop that you wrote to answer Q1, which builds 100 linear models and returns a vector of RMSEs, and (3) calculates the mean and standard deviation. "
Dataset = underneath
n <- c(100, 500, 1000, 5000, 10000)
Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)
dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))
rmse <- replicate(100, {
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
})
The goal of this is to return the numbers assigned to variable "n" as ''mean'' & ''Standard deviation''.
So far i have approached the numbers of "n" & "RMSE's" to be plugged as value within sapply as reference to results:
results <- sapply(n, rmse)
that transits to a error: ''can't extract residuals from model''. However performing the "mean" specified with a row or column index "[1]" manually:
mean(rmse[1])
an incorrect decimal value is received, whereby SD is nothing more than a "NA" attribute.
sd(rmse[1])
[1] NA
I might have overlooked some critical factors here. A friendly reminder with extra approaches and tips to solve the section would be highly appreciated.
Thanks,
Irvin