Clarification of using models

I'm having some problems building an appropriate statistical analysis of the following data sheet where I have species average counts with their prey average counts and the threats values in the different months.
I wanted to look whether is there any relationship with the shorebirds species with their prey or threats. Also, is it possible to run a General Linear Model (GLM) with these three parameters where shorebirds can be a dependent variable and prey and threats-both of them can be predicted variables (as an interaction term)?

Any tips would be really helpful around fixing these issues.
Thank you!

months prey.avg shorebirds.avg threats
October 646.27 14.33 50
November 1292.55 7.67 64
December 4093.06 14.75 51
January 4954.76 9.75 65
February 5816.46 3.25 68
March 5816.46 18.17 61

I think you can build the models you want using the lm() function. Here are a couple of examples.

DF <- read.csv("~/R/Play/Dummy.csv")
DF
#>     months prey.avg shorebirds.avg threats
#> 1  October   646.27          14.33      50
#> 2 November  1292.55           7.67      64
#> 3 December  4093.06          14.75      51
#> 4  January  4954.76           9.75      65
#> 5 February  5816.46           3.25      68
#> 6    March  5816.46          18.17      61
FIT_prey <- lm(shorebirds.avg ~ prey.avg, data = DF)
summary(FIT_prey)
#> 
#> Call:
#> lm(formula = shorebirds.avg ~ prey.avg, data = DF)
#> 
#> Residuals:
#>      1      2      3      4      5      6 
#>  2.573 -3.997  3.475 -1.404 -7.784  7.136 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)  
#> (Intercept) 11.847683   5.156813   2.297   0.0832 .
#> prey.avg    -0.000140   0.001199  -0.117   0.9127  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.086 on 4 degrees of freedom
#> Multiple R-squared:  0.003398,   Adjusted R-squared:  -0.2458 
#> F-statistic: 0.01364 on 1 and 4 DF,  p-value: 0.9127

#Both variables plus their interaction
FIT_both <- lm(shorebirds.avg ~ prey.avg * threats, data = DF)

Created on 2020-05-30 by the reprex package (v0.2.1)

1 Like

These data were collected directly from the field by me. Actually, here I just putted all of the average values for every months. So, the average values of Prey for Feb and March are similar in that way.

I'd be curious to see the raw data beneath the aggregation

1 Like

Thanks, and I am sharing the raw data in following. Also, while I was checking the raw data again by keeping mind your confusions then found some mismatches.
Anyway, I have corrected it now. And, here the habitat points are representing the replications in six months along with the prey/benthic density (in 10 cm3), observed species and the threats responses at that sites.
So, I want to look whether is there any relationship in between the observed species and the benhtic/prey density. Besides, I want to also look whether there is any relationship in between the observed species and the benthos with an interaction with the threats.

habitat.points benthos.10cm3 obs.shorebirds threats
HP1 6 48 13
HP2 6 33 4
HP3 0 17 5
HP4 5 26 5
HP5 0 3 3
HP6 0 9 3
HP7 0 11 1
HP8 0 13 3
HP9 0 12 4
HP10 0 0 9
HP1 6 27 5
HP2 0 19 2
HP3 0 11 1
HP4 5 5 1
HP5 0 2 11
HP6 0 0 15
HP7 7 9 7
HP8 4 10 6
HP9 0 7 6
HP10 5 10 0
HP1 0 33 10
HP2 4 28 4
HP3 5 17 6
HP4 8 18 3
HP5 7 11 2
HP6 3 9 3
HP7 2 16 3
HP8 0 22 6
HP9 4 15 5
HP10 3 8 9
HP1 4 29 7
HP2 7 16 9
HP3 10 23 3
HP4 3 10 11
HP5 7 0 11
HP6 3 4 14
HP7 2 7 3
HP8 0 8 2
HP9 4 11 1
HP10 3 9 4
HP1 17 13 9
HP2 6 5 12
HP3 2 4 9
HP4 5 6 2
HP5 1 0 6
HP6 6 0 11
HP7 2 3 5
HP8 0 5 4
HP9 3 3 5
HP10 4 0 5
HP1 5 48 7
HP2 15 39 5
HP3 5 21 5
HP4 3 23 5
HP5 4 13 8
HP6 5 7 11
HP7 2 18 3
HP8 4 27 11
HP9 3 17 4
HP10 0 5 2
> FIT_benthos <- lm(obs.shorebirds~benthos.10cm3, rawdat)
> summary(FIT_benthos)

Call:
lm(formula = obs.shorebirds ~ benthos.10cm3, data = rawdat)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.107  -8.060  -2.388   5.979  32.956 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     9.8881     2.0490   4.826 1.05e-05 ***
benthos.10cm3   1.0312     0.4133   2.495   0.0155 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.97 on 58 degrees of freedom
Multiple R-squared:  0.09695,	Adjusted R-squared:  0.08138 
F-statistic: 6.227 on 1 and 58 DF,  p-value: 0.01545

> FIT_BOTH <- lm(obs.shorebirds ~ benthos.10cm3+threats, data = rawdat)
> summary(FIT_BOTH)

Call:
lm(formula = obs.shorebirds ~ benthos.10cm3 + threats, data = rawdat)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.828  -7.220  -2.153   5.355  33.817 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    11.4349     3.0099   3.799 0.000355 ***
benthos.10cm3   1.0751     0.4197   2.561 0.013093 *  
threats        -0.2848     0.4044  -0.704 0.484149    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.02 on 57 degrees of freedom
Multiple R-squared:  0.1047,	Adjusted R-squared:  0.07332 
F-statistic: 3.334 on 2 and 57 DF,  p-value: 0.04272

But, I am confused to use them as the data of the variables were over dispersed and did not show normality or any linear pattern.
Any suggestions to look the above mentioned relationships or interactions will be helpful for me.

What do the habitat points represent? I'm guessing these aren't locations so far apart that they are on different countries and so are independent , are they regular or irregulary spaced, i.e are they defined by latitudes and longitudes and are they near enough to point locations or have a known area?

1 Like

Here, these habitat points are situated in an island and selected by maintaining same distances (with 1 km intervals). These sampling or habitat points (also defined by latitudes and longitudes) were used for observing the species and benthic samples were collected from the same sites and also threats were observed there. These ten points were replicated in a same way for continuous six months of winter.

I hope this will give you a sense on the data and the selective variables. I will look forward to the further suggestions on it.

thanks, would it be straightforward to share the latitudes and longitudes in question?

Thanks for the reply, but have a little bit confusion on it.
Are you suggesting to input the coordinates besides the habitat points of my question?

if the latitudes and longditutes are regularly space 1km to form a normal sort of lattice, then probably the particulars of the lattitudes and longditudes arent so important, but it would be good to at least understand the arrangement, which location is closer / further from other locations, as you may expect the relationships to be stronger and weaker.
If you can share whatever info you have on the lattitudes and longditutes and the HP labels, I could look at using that. you wouldnt need to integrate it with your measurement data direcftly per se, having it as a standalone frame would be enough for me to look at.

1 Like

Here it is,

Habitat Points Lat Long
HP-01 20°35'26.75"N 92°19'41.23"E
HP-02 20°35'57.48"N 92°19'54.78"E
HP-03 20°36'22.92"N 92°19'44.81"E
HP-04 20°37'0.73"N 92°19'34.85"E
HP-05 20°37'58.44"N 92°19'2.16"E
HP-06 20°37'26.82"N 92°19'10.98"E
HP-07 20°36'57.90"N 92°19'31.32"E
HP-08 20°36'25.16"N 92°19'33.97"E
HP-09 20°35'54.60"N 92°19'24.72"E
HP-10 20°37'38.04"N 92°19'35.64"E

thank, so far I used this code to look at it

mapcoords <- tribble(
  ~HPCODE , ~lat,~long,
"HP-01",20.352675,	92.194123,
"HP-02",20.355748,	92.195478,
"HP-03",20.362292,	92.194481,
"HP-04",20.37073,	92.193485,
"HP-05",20.375844,	92.19216,
"HP-06",20.372682,	92.191098,
"HP-07",20.365790,	92.193132,
"HP-08",20.362516,	92.193397,
"HP-09",20.355460,	92.192472,
"HP-10",20.373804,	92.193564)

library(leaflet)
m <- leaflet(data=mapcoords) %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=~long, lat=~lat, popup=~HPCODE)

zoomed out


zoomed in

Does that seem right ?

1 Like

yes, those are completely correct

Any further suggestions or recommendations on my confusions about the variables from my attached data sheet?

I fit a rather poor quality linear model

library(tidyverse)
library(janitor)
library(gridExtra)
library(grid)
library(broom)
library(cowplot)

mapcoords <- tribble(
  ~habitat_points , ~lat,~long,
"HP1",20.352675,	92.194123,
"HP2",20.355748,	92.195478,
"HP3",20.362292,	92.194481,
"HP4",20.37073,	92.193485,
"HP5",20.375844,	92.19216,
"HP6",20.372682,	92.191098,
"HP7",20.365790,	92.193132,
"HP8",20.362516,	92.193397,
"HP9",20.355460,	92.192472,
"HP10",20.373804,	92.193564)

input_txt <- "habitat_points	benthos.10cm3	obs.shorebirds	threats
HP1	6	48	13
HP2	6	33	4
HP3	0	17	5
HP4	5	26	5
HP5	0	3	3
HP6	0	9	3
HP7	0	11	1
HP8	0	13	3
HP9	0	12	4
HP10	0	0	9
HP1	6	27	5
HP2	0	19	2
HP3	0	11	1
HP4	5	5	1
HP5	0	2	11
HP6	0	0	15
HP7	7	9	7
HP8	4	10	6
HP9	0	7	6
HP10	5	10	0
HP1	0	33	10
HP2	4	28	4
HP3	5	17	6
HP4	8	18	3
HP5	7	11	2
HP6	3	9	3
HP7	2	16	3
HP8	0	22	6
HP9	4	15	5
HP10	3	8	9
HP1	4	29	7
HP2	7	16	9
HP3	10	23	3
HP4	3	10	11
HP5	7	0	11
HP6	3	4	14
HP7	2	7	3
HP8	0	8	2
HP9	4	11	1
HP10	3	9	4
HP1	17	13	9
HP2	6	5	12
HP3	2	4	9
HP4	5	6	2
HP5	1	0	6
HP6	6	0	11
HP7	2	3	5
HP8	0	5	4
HP9	3	3	5
HP10	4	0	5
HP1	5	48	7
HP2	15	39	5
HP3	5	21	5
HP4	3	23	5
HP5	4	13	8
HP6	5	7	11
HP7	2	18	3
HP8	4	27	11
HP9	3	17	4
HP10	0	5	2"
my_temp<-tempfile()
writeLines(input_txt,my_temp)
input_df <- read_delim(my_temp,"\t") %>% clean_names() %>%
  group_by(habitat_points) %>%
  mutate(time=row_number(),
         ratio = benthos_10cm3/obs_shorebirds) %>% ungroup() %>% mutate_all(.funs = ~ifelse(is.infinite(.),NA,.)) %>% na.omit() 
input_df_2 <- input_df
input_df_2$time <- input_df$time + 1
names(input_df_2) <- paste0("lag1_",names(input_df))
input_df_2<-rename(input_df_2,
                   time = lag1_time,
                   habitat_points = lag1_habitat_points)
input_df_3 <- input_df
input_df_3$time <- input_df$time + 2
names(input_df_3) <- paste0("lag2_",names(input_df))
input_df_3<-rename(input_df_3,
                   time = lag2_time,
                   habitat_points = lag2_habitat_points)
full_df <- left_join(input_df,mapcoords ) %>% left_join(input_df_2)%>% left_join(input_df_3)
# library(leaflet)
# m <- leaflet(data=mapcoords) %>%
#   addTiles() %>%  # Add default OpenStreetMap map tiles
#   addMarkers(lng=~long, lat=~lat, popup=~HPCODE)

  
ratio_them <- function(x,y){
  full_df[[paste0("ratio",x,"_",y)]] <<- full_df[[x]] / full_df[[y]]
}

# ratio_them("lag1_benthos_10cm3","lag2_benthos_10cm3")

ratio_them("lag1_obs_shorebirds","lag2_obs_shorebirds")
# 
# ratio_them("lag1_threats","lag2_threats")
# 
# ratio_them("lag1_ratio","lag2_ratio")

lm_bird_target <- lm(obs_shorebirds ~  
                       lat + long +
                       # lag1_benthos_10cm3 +
                       # lag1_obs_shorebirds +
                       lag1_threats +
                       lag1_ratio 
                       # +lag2_benthos_10cm3 + 
                       # lag2_obs_shorebirds +
                       # lag2_threats + lag2_ratio +
                       # ratiolag1_obs_shorebirds_lag2_obs_shorebirds 
                     ,data = full_df)

(smry_lm <- summary(lm_bird_target))
(glanced_lm <- broom::glance(smry_lm))

full_df$predicted_birds_lm <- predict(lm_bird_target,newdata = full_df)

to_plot_df <- full_df %>% filter(time>1) %>% select(habitat_points,
                                                    time,
                                                    predicted_birds_lm,
                                                    obs_shorebirds) %>%
  pivot_longer(cols=3:4,names_to="measurement",
               values_to = "value")
p1<-ggplot(data=to_plot_df,aes(x=habitat_points,y=value,color=measurement),alpha=.5) + 
  geom_point()

p2<-ggplot(data=to_plot_df,aes(x=time,y=value,color=measurement),alpha=.5) + 
  geom_point()



(gt_glm_smry<-tableGrob(glanced_lm %>% mutate_if(is.numeric,round,digits=3)))
(gt_glm<-tableGrob(tidy(smry_lm) %>% mutate_if(is.numeric,round,digits=3)))
cowplot::plot_grid(p1,
                   p2,
                   gt_glm_smry,
                   gt_glm,
                   nrow=4,
                   ncol=1,
                   labels=c("predicting birds at habitats (overplot time)",
                                  "predicting birds at time (overplot habitats)",
                            "summary of linear model"))

1 Like

Thank you so much for your efforts but I have some confusions on it:

-is it necessary to account the habitat points along with other three variables (species/prey/threats)? or, can it also be possible to see the interactive relationship without the habitat points.
-could you please slightly interpret the graphs that has been produced here? (as I did not use the cowplot/overplot yet and a beginner learner of R so, facing struggle to go through with it)
-after running or observing this poor quality outcomes of the linear model, will you suggest or recommend to use it or will it be better to use any other non-parametric test or simple correlation with these data sets?

Sorry, for asking too many questions and taking your time but I just want to clear about my confusions. Hope you will consider that.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Thank you for your suggestions and examples.

So, I tried the lm() function but a little bit confused about it's interpretation as the p values are not significant (here, p>0.05) and also other values of the models are not that much supportive. Could you suggest anything on this issue that how can I deal with it or interpret it in a better way?

FIT_prey <- lm(shorebirds.avg~prey.avg, nwformdat)
summary(FIT_prey)

Call:
lm(formula = shorebirds.avg ~ prey.avg, data = nwformdat)

Residuals:
     1      2      3      4      5      6 
 2.573 -3.997  3.475 -1.404 -7.784  7.136 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 11.847683   5.156813   2.297   0.0832 .
prey.avg    -0.000140   0.001199  -0.117   0.9127  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.086 on 4 degrees of freedom
Multiple R-squared:  0.003398,	Adjusted R-squared:  -0.2458 
F-statistic: 0.01364 on 1 and 4 DF,  p-value: 0.9127
#Both variables plus their interaction
FIT_both <- lm(shorebirds.avg ~ prey.avg * threats, data = nwformdat)
summary(FIT_both)

Call:
lm(formula = shorebirds.avg ~ prey.avg * threats, data = nwformdat)

Residuals:
       1        2        3        4        5        6 
 1.09777 -0.05455 -3.51832  0.73642 -3.69853  5.43722 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)       3.057e+01  3.767e+01   0.812    0.502
prey.avg          5.599e-03  1.019e-02   0.549    0.638
threats          -3.683e-01  6.637e-01  -0.555    0.635
prey.avg:threats -7.874e-05  1.695e-04  -0.464    0.688

Residual standard error: 5.356 on 2 degrees of freedom
Multiple R-squared:  0.6141,	Adjusted R-squared:  0.03519 
F-statistic: 1.061 on 3 and 2 DF,  p-value: 0.5188

I am very hesitant to comment on data about which I know nothing other than the raw numbers. I will make a few comments but you should not treat them as well informed analysis.

  1. Fitting two variables plus their interaction on only six data points brings a big risk of over fitting. You can see this in the difference between the Multiple R-Squared and Adjusted R-squared for that model.
  2. If you plot shorebirds.avg against the other two variables individually you will see that there is a hint of a relationship with threats and nothing with prey.avg. The results from lm() will convey the same message. If further investigation could pursue only one of those variables, threats seems the more promising candidate.
  3. In general, you should not treat 0.05 as a hard threshold for p values. There is lots of literature explaining why that is an abuse of the p value.
1 Like