Error in linear regression

Hello again,
I made a post recently about having some issues when performing one way ANOVAs. I got a lot of help and managed to solve it, but now I am experiencing another issue when performing a simple linear regression with the same set of data. This is my code:

library(readxl)
Species_measurement_merged <- read_excel("Species_measurement_merged.xlsx")
View(Species_measurement_merged)
data <- Species_measurement_merged

library(tidyr) #Needed for all %>% in the code
library(tidyverse) #Needed for all %>% in the code
data <- Species_measurement_merged %>%
mutate(across(where(is.character), ~na_if(., "NA"))) %>%
type.convert()
warnings()
Date.WPD.lm <- lm(Date ~ WaterPot_Dawn, data=data)

The error I get is:

Date.WPD.lm <- lm(Date ~ WaterPot_Dawn, data=data)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion

I previously used the mutate command to eliminate the issue of the non-valid NAs, but now apparently that error has come back despite the fact that the ANOVAs are still working. When executing the mutate command I get these warning repeated 18 times:

warnings()
Warning messages:
1: In type.convert.default(x[[i]], ...) :
'as.is' should be specified by the caller; using TRUE

Any idea on what could be wrong? I did not expect to have this error again so this is quite confusing.

Please run the code

data <- Species_measurement_merged %>%
mutate(across(where(is.character), ~na_if(., "NA"))) %>%
type.convert()

and then run

dput(head(data,20))

Post the output of the dput() function.
You should check the class of the Date column. Run

class(data$Date)

Is Date characters?

1 Like

Some remarks:

If you refer to an earlier question, please provide a link to that question.
Of course we can find it, but more help from you leads to more help from us.

The first error/warning you get is often the most important.
When the mutate is not done correctly how reliable would be the succeeding regression?

I changed the call for the type.convert function with the data you provided in your last question.
Then the creation of data works without errors/warnings.

However the regression still fails because the date variable is not numeric (and in this dataset only one-valued).

The code I used you find below. My advice is to have a good look at the data before applying the software.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(tibble)

Species_measurement_merged <- structure(list(Art=c("Acerbuergerianum","Acerbuergerianum",
"Acerbuergerianum","Acerbuergerianum","Acerbuergerianum",
"Acerrufinerve","Acerrufinerve","Acerrufinerve","Acerrufinerve",
"Acerrufinerve","Carpinusjaponica","Carpinusjaponica","Carpinusjaponica",
"Carpinusjaponica","Carpinusjaponica","Celtisaustralis",
"Celtisaustralis","Celtisaustralis","Celtisaustralis","Celtisaustralis"
),Date=c("AMay","AMay","AMay","AMay","AMay","AMay","AMay",
"AMay","AMay","AMay","AMay","AMay","AMay","AMay","AMay",
"AMay","AMay","AMay","AMay","AMay"),Ind=c(1,2,3,4,
5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),WaterPot_Dawn=c("NA",
"NA","0.41","0.59599999999999997","0.78","NA","NA","0.62",
"0.75","0.44800000000000001","0.12","0.48799999999999999",
"0.73","NA","NA","0.35499999999999998","0.28000000000000003",
"0.41","0.27800000000000002","0.505"),WaterPot_Noon=c("NA",
"NA","0.14000000000000001","0.28000000000000003","0.31","NA",
"NA","0.255","0.42","0.182","0.31","0.62","0.91","NA",
"NA","0.85","0.93","0.52","NA","NA"),ChloroCont=c("NA",
"NA","21.2","18.100000000000001","18.399999999999999","NA",
"NA","26.1","24.7","26","27.4","24.3","24.8","26.7","23.9",
"31.6","6.2","17.2","29.5","18.7"),Leaf_area=c("NA","NA",
"52.6","63.29","22.97","NA","NA","332","318.04000000000002",
"338.9","41.76","56.04","47.83","65.03","56.11","5.92",
"2.99","7","5.95","3.57"),Fresh_weight=c("NA","NA","1.1599999999999999",
"1.26","0.79","NA","NA","7.23","5.84","5.05","1.06","1.29",
"1.22","1.46","1.24","0.6","0.56999999999999995","0.61",
"0.62","0.6"),Dry_weight=c("NA","NA","0.26","0.27","0.1",
"NA","NA","2.25","1.84","1.6","0.31","0.39","0.3","0.44",
"0.37","4.7E-2","2.4E-2","3.5000000000000003E-2","4.9000000000000002E-2",
"3.9E-2"),DBH=c("NA","NA","18","14.8","9.8000000000000007",
"NA","NA","10","10","10.199999999999999","11.4","10","11.6",
"11.2","9.6","12","11.8","11.8","13.2","13.7"),Height=c("NA",
"NA","371","397","303","NA","NA","352","309","337","251",
"293","313","307","270","372","379","372","362","385"
),'1st_leaf'=c("NA","NA","189","179.5","185","NA","NA",
"182.5","169","178","157","173","195","168","164","196",
"210","185","189","195"),Axis_1=c("NA","NA","123","146",
"80","NA","NA","87","61","68","95","116","118","94",
"124","50","63","67","65","70"),Axis_2=c("NA","NA",
"112","106","90","NA","NA","92","58","63","81","104",
"133","105","109","94","68","69","59","53"),Canopy_size=c("NA",
"NA","182","217.5","118","NA","NA","169.5","140","159",
"94","120","118","139","106","176","169","187","173",
"190"),Leaf_dry_cont=c("NA","NA","0.22413793103448279",
"0.2142857142857143","0.12658227848101267","NA","NA","0.31120331950207469",
"0.31506849315068497","0.31683168316831684","0.29245283018867924",
"0.30232558139534882","0.24590163934426229","0.30136986301369861",
"0.29838709677419356","7.8333333333333338E-2","4.2105263157894743E-2",
"5.7377049180327877E-2","7.9032258064516137E-2","6.5000000000000002E-2"
),Crown_area=c("NA","NA","10502268.842726992","14099593.493017135",
"3558796.1579865171","NA","NA","5682839.5174491908","2074791.564234795",
"2853219.580731479","3029877.6188281402","6064027.8036651621",
"7757187.0699222777","5746726.945652592","6001262.9712366425",
"3464967.2573993024","3032667.3531045276","3621213.3208280397",
"2779073.8053165548","2952678.2153539266"),Specific_leaf=c("NA",
"NA","202.30769230769229","234.40740740740739","229.7","NA",
"NA","147.55555555555554","172.84782608695653","211.81249999999997",
"134.70967741935485","143.69230769230768","159.43333333333334",
"147.79545454545456","151.64864864864865","125.95744680851064",
"124.58333333333334","199.99999999999997","121.42857142857143",
"91.538461538461533")),row.names=c(NA,-20L),class=c("tbl_df",
"tbl","data.frame")) %>% as_tibble()

data <- Species_measurement_merged %>%
mutate(across(where(is.character), ~na_if(., "NA"))) %>%
type.convert(as.is=F)

Date.WPD.lm <- lm(Date ~ WaterPot_Dawn, data=data)
#> Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
#> response will be ignored
#> Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors

sapply(data,class)
#>           Art          Date           Ind WaterPot_Dawn WaterPot_Noon 
#>      "factor"      "factor"     "integer"     "numeric"     "numeric" 
#>    ChloroCont     Leaf_area  Fresh_weight    Dry_weight           DBH 
#>     "numeric"     "numeric"     "numeric"     "numeric"     "numeric" 
#>        Height      1st_leaf        Axis_1        Axis_2   Canopy_size 
#>     "integer"     "numeric"     "integer"     "integer"     "numeric" 
#> Leaf_dry_cont    Crown_area Specific_leaf 
#>     "numeric"     "numeric"     "numeric"
Created on 2023-01-02 with reprex v2.0.2
1 Like

I have used the same code before, and Date is indeed characters. Would changing them into numeric fix this? Here is the link from my previous post (sorry for not providing it first, I am rather new in this):

I have added the as.is=F as per your suggestion and now it runs with no errors. Thank you. I will change the Date characters to anumeric values and try again.

@HanOostdijk @FJCC I changed the characters in Date from month names to numbers and tried again. Now I get the linear regression with no errors, but with a couple warnings:

Date.WPD.lm <- lm(Date ~ WaterPot_Dawn, data=dataNUM)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : β€˜-’ not meaningful for factors

Date.WPD.lm
Call:
lm(formula = Date ~ WaterPot_Dawn, data = dataNUM)
Coefficients:
(Intercept) WaterPot_Dawn
3.788 -2.471

Are these warning normal? Or have I made another mistake? Thank you.

Having a factor as a left-hand side variable can be okay, but because the numeric value isn't meaningful the residuals aren't meaningful.

I am not sure I follow. Do you mean that the linear regression has not found any significan relation between the two variables? As if the numeric variable that represents the date has no impact on the other one? Thank you.

Its unclear what form your dates are after the changes you spoke about. it seems they are yet factors when they perhaps should be properly numeric.
An easy way to check is to use str() on them.

They appear to be from a cathegory named "int", but the values I assigned are indeed numbers. It is a different number per month (May=1, June=2, July=3, August=4, September=5). Here is the feedback from str():

str(dataNUM)
tibble [450 Γ— 18] (S3: tbl_df/tbl/data.frame)
Art : Factor w/ 20 levels "Acer buergerianum",..: 1 1 1 1 1 2 2 2 2 2 ... Date : int [1:450] 1 1 1 1 1 1 1 1 1 1 ...
Ind : int [1:450] 1 2 3 4 5 1 2 3 4 5 ... WaterPot_Dawn: num [1:450] NA NA 0.41 0.596 0.78 NA NA 0.62 0.75 0.448 ...
$ WaterPot_Noon: num [1:450] NA NA 0.14 0.28 0.31 NA NA 0.255 0.42 0.182

All other variables are num too. I don't see any other way to do this, so does this mean I should do a different type of analysis other than a simple linear regression?

Do you really mean that the outcome for September is fives times the outcome for May? That seems unlikely. If it is what you mean, then a linear regression is fine. If the months are just different outcomes, you may want an ordered logit or a multinomial logit.

The outcome of september is not 5 times the outcome of May. I wanted to analiyse whether the month when the data was taken had any effect on the data itself. Would an ordered logit or a multinomial logit work for this? Thank you.

Yes, an ordered or multinomial logit would work.

But are you asking whether the month affects the data or whether the data affects the month?

I will look into those two then.
The data I am trying to analyse was taken from May to September. I want to test whether there are any differences in data that could be explained by the passing of time (AKA if data was affected by the month it was taken in). I want to check if the variability in data could be explained by the month it was taken in.
Thank you very much for your persistent help.

Then this may be easier. The explanatory variables go on the right in a regression. The data being explained goes on the left. There is nothing problematic with having a factor on the right. lm() will translate a factor into dummy variables.

1 Like

I see. I will try again as per your suggestion. Thank you very much.

It seems it finally worked properly. These are the results:

Date.WPD.lm <- lm(WaterPot_Dawn ~ Date, data=dataNUM)

Date.WPD.lm
Call:
lm(formula = WaterPot_Dawn ~ Date, data = dataNUM)
Coefficients:
(Intercept) Date
0.61622 -0.06975

summary(Date.WPD.lm)
Call:
lm(formula = WaterPot_Dawn ~ Date, data = dataNUM)
Residuals:
Min 1Q Median 3Q Max
-0.42647 -0.13647 -0.02497 0.09991 0.83528
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.61622 0.03059 20.145 < 2e-16 ***
Date -0.06975 0.01017 -6.861 6.5e-11 ***


Signif. codes: 0 β€˜β€™ 0.001 β€˜β€™ 0.01 β€˜β€™ 0.05 β€˜.’ 0.1 β€˜ ’ 1
Residual standard error: 0.1919 on 226 degrees of freedom
(222 observations deleted due to missingness)
Multiple R-squared: 0.1724, Adjusted R-squared: 0.1687
F-statistic: 47.07 on 1 and 226 DF, p-value: 6.501e-11

Do these results look ok or is there anything concerning? The p-value is 6.5e-11 = 6.5 x 10-11 = 0.000000000065, which would mean the relation between measurements and date (predictor) is significant, right? Thank you.

the R squared is relatively low; therefore the model hasn't revealed a particularly powerful relationship (explaining a lot of the variance); the significance is a sign that the weak signal you found is unlikely to be a pure accident, so is potentially worth more of a look into. a linear model is good at drawing straight lines on charts; have you charted your data, and the fit to look at it?; probably unless September has the most WaterPot_Dawn, and May has the least, with a monotonic trend between; I would expect there to be rather gnarly residuals.

2 Likes

I have not charted my data, and I do not know what that is. I will have to look into it. Thank you for your help.

It looks like date is no longer coded as a factor. If you put it in as a factor, you'll get an estimate for each month if that's what you want.

1 Like