Need help with a multiple regression model (&tests)

Hi guys, trying to test some data witha a multiple regression model.
So, let's assume i have this data

"Fam.ID" "Income" "Consumption" "Status" "Qualification" "Family num" "Age" "Sex"
1 46200 40600 1 4 2 57 2
1 46200 40600 1 4 2 60 1
2 32340 30600 1 4 2 55 2
2 32340 30600 1 3 2 62 1
3 25200 20400 1 3 3 55 1
3 25200 20400 1 3 3 52 2
3 25200 20400 2 4 3 21 1
3 34100 33600 2 3 4 29 2
4 9880 10800 4 3 1 77 2
5 11950 11400 4 2 1 75 2
6 41100 20800 1 3 4 48 2
7 25596 8900 4 2 1 83 1
8 27400 18000 1 2 6 53 1
8 27400 18000 1 3 6 49 2
8 27400 18000 2 3 6 29 1
8 27400 18000 2 3 6 26 1
8 27400 18000 2 3 6 15 2
8 27400 18000 2 3 6 13 2
9 13500 13300 1 2 2 70 1
9 13500 13300 1 2 2 61 2

Of course the data is much bigger, this is just a sample. Anyhow, i need to analyze it so i use

data<-read.table("dataTest.txt",sep=" ",header=T)
str(data)

but i need them to be factors, because for example sex=1 is male and viceversa.

data$Status=factor(data$Status)
data$Qualification=factor(data$Qualification)
data$Family.num=factor(data$Family.num)
data$Sex=factor(data$Sex)
attach(data)

Are the factors correct ? or maybe Age should have been and Family.num shouldn't ? Then i try to get a correlation matrix.

cor(data)

But I get an error: ('X' must be numeric).

Ok but then how do i get a correlation matrix which doesn't consider 4 columns of data ? i can get R to not compute on one with the [,-1], but can i do it for multiple ones ? or should i create another object with just selected columns ?

Another question is about the lm command itself. What should i do about the first column (the family id one) ?

should the command be

reg1=lm(Income~.,data=data[,-1])

or should it be

reg2=lm(Income~Consumption+Status+Qualification+Family.num+Age+Sex,data=data[,-1])

or is it the same thing ? and what does

reg3=lm(Income~1,data=data[,-1])

do? and should the [,-1] even be there ?

so many question guys, thanks in advance to any who'll help me !

another doubt is about the factor$ itself. If i tell R that they are factor and not numeric, he correctly states the levels, for example

$ Qualification: Factor w/ 3 levels

but then if i try to get the summary, i get one less intecept for each factor:

Status2      2.255e+03  3.399e+02   6.635 3.34e-11 ***
Status3     -2.664e+03  5.933e+02  -4.490 7.15e-06 ***
Status4     -1.094e+03  4.467e+02  -2.450 0.014290 *

where is Status1 ? did it read it as a dummy variable ? so many doubts lol

On this, I don't think you need to remove column/variables from data that you don't refer to in your model formula. You can actually check this by running both these models and comparing the results. Are they identical?(they should be)

1 Like

I think this discussion is helpful:

When you run a regression with factor variables, R will automatically drop one of the categories (we can think of the dropped category as the baseline category). In your output with the Status variable, I suspect there are four categories in Status, but you will only see the estimates for three of the four categories (from the output you shared - these are Status2, Status3 and Status4). This is because Status1 is now the baseline category, and the estimate on Status2 (2.255e+03) gives you the effect of being in status 2 compared to status 1. Similarly, the estimate on Status3 is the effect of being in status 3 compared to status 1, etc.

Also, as @EconomiCurtis points out, you don't need to remove columns/variables from your dataset. If your dataset is called data, then data = data is fine.