Regression for data frame

Hi,

There are two data sets.

  1. mtcars: Data sets relate to the 32 different cars (observations) and 10 aspects of car design such as mile-per-galoon (mpg) or horsepower (hp) or watt (wt). We need to fit a regression model in which mpg is the dependent variable and wt and hp are the independent variables.

  2. state.x77: Data sets related to the 50 states of the USA (50 observations). The variables include Population, Income, Illiteracy, Life Expectancy, Murder, High School Grade, Frost, and Area. The aim is attributing a regression model in which the dependent variable is Murder and Population + Illiteracy + Income + Frost are respond variables.
    For 1, the following code works:
    B <- lm(mpg ~ hp + wt, data=mtcars)
    summary(B)
    However, the following code doesn't work for 2:

A <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=state.x77)
summary(A)

Why? What is the difference between the two data sets?

Forgotten to mention to check homework policy if applicable.

To debug this type of problem, it helps to look at what there is to work with. First a diversion to clarify my usage.

One of the hard things to get used to in R is the concept that everything is an object that has properties. Some objects have properties that allow them to operate on other objects to produce new objects. Those are functions.

Think of R as school algebra writ large: f(x) = y, where the objects are f, a function, x, an object (and there may be several) termed the argument and y is an object termed a value, which can be as simple as a single number (aka an atomic vector) or a very packed object with a multitude of data and labels.

And, because functions are also objects, they can be arguments to other functions, like the old g(f(x)) = y. (Trivia, this is called being a first class object.)

Although there are function objects in R that operate like control statements in imperative/procedural language, they are best used "under the hood." As it presents to users interactively, R is a functional programming language. Instead of saying

take this, take that, do this, then do that, then if the result is this one thing, do this other thing, but if not do something else and give me the answer

in the style of most common programming languages, R allows the user to say

use this function to take this argument and turn it into the value I want for a result

The roles in

A  <-  lm(Murder ~ Population + Illiteracy + Income + Frost, data=state.x77)

consist of <-, a so-called primitive f that works as an assignment operator to send the return value of lm to the new object A, an object state.x77, Murder, an object within state.x77, the \sim operator that identifies the following objects to lm.

What do we see when we run the command?

A <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=state.x77)
#> Error in model.frame.default(formula = Murder ~ Population + Illiteracy + : 'data' must be a data.frame, not a matrix or an array

Created on 2020-04-06 by the reprex package (v0.3.0)

This clearly points to states.x77 as the culprit. It's the wrong kind of object.

class(state.x77)
#> [1] "matrix"

Created on 2020-04-06 by the reprex package (v0.3.0)

matrix \ne data frame

So, what to do?

frame.x77 <- state.x77
as.data.frame(frame.x77) -> frame.x77
class(frame.x77)
#> [1] "data.frame"
A <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=frame.x77)
summary(A)
#> 
#> Call:
#> lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
#>     data = frame.x77)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.7960 -1.6495 -0.0811  1.4815  7.6210 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
#> Population  2.237e-04  9.052e-05   2.471   0.0173 *  
#> Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
#> Income      6.442e-05  6.837e-04   0.094   0.9253    
#> Frost       5.813e-04  1.005e-02   0.058   0.9541    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.535 on 45 degrees of freedom
#> Multiple R-squared:  0.567,  Adjusted R-squared:  0.5285 
#> F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

Created on 2020-04-06 by the reprex package (v0.3.0)

Ah, good.

Two very good resources that I always recommend are

  1. R for Data Science -- Introductory +
  2. R Cookbook, 2nd ed. -- Intermediate

The links are the free online copies, but both are well worth the cost of a paper version.

In a way, coming to R without a programming background is an advantage because R does things differently than most programming languages. Here's the brief explainer.

One of the hard things to get used to in R is the concept that everything is an object that has properties. Some objects have properties that allow them to operate on other objects to produce new objects. Those are functions.

Think of R as school algebra writ large: f(x) = y, where the objects are f, a function, x, an object (and there may be several) termed the argument and y is an object termed a value, which can be as simple as a single number (aka an atomic vector) or a very packed object with a multitude of data and labels.

And, because functions are also objects, they can be arguments to other functions, like the old g(f(x)) = y. (Trivia, this is called being a first class object.)

Although there are function objects in R that operate like control statements in imperative/procedural language, they are best used "under the hood." As it presents to users interactively, R is a functional programming language. Instead of saying

take this, take that, do this, then do that, then if the result is this one thing, do this other thing, but if not do something else and give me the answer

in the style of most common programming languages, R allows the user to say

use this function to take this argument and turn it into the value I want for a result

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Thanks for your reply. It is not homework. I am learning R on my own and I don't have a background on any programming language. I am using "R IN ACTION, Data analysis and graphics with R, Robert I. Kabacoff" book.

Do you recommend any other books?

1 Like