Categorize variables for a specific database's logistical model

nanda2021 · April 30, 2021, 6:22am

Good afternoon people,

I have the following problem with a database.
In this bank, I have the number of dengue cases reported in each neighborhood in a city, these notifications are reported by month, year, in addition to other numerical variations.

So, the database is cataloged like this:

neighborhood / date / cases / precipitation / temperature
xxxx / 2014-01-01 / 3 / 115.4 / 35.5
xxxx / 2014-02-01 / 2 / 118.4 / 34.8
xxxx / 2014-03-01 / 0 / 156.4 / 33.9
xxxx / 2014-04-01 / 1 / 105.4 / 25.6
xxxx / 2014-05-01 / 6 / 15.4 / 32.1
xxxx / 2014-06-01 / 7 / 135.4 / 30.0
... ... ... ... ... ... ...
xxxx / 2014-12-01 / 8 / 115.4 / 35.6
xxxx / 2015-01-01 / 7 / 115.4 / 35.5
xxxx / 2015-02-01 / 10 / 118.4 / 34.8
xxxx / 2015-03-01 / 0 / 156.4 / 33.9
xxxx / 2015-04-01 / 15 / 105.4 / 25.6
xxxx / 2015-05-01 / 7 / 15.4 / 32.1
xxxx / 2015-06-01 / 13 / 135.4 / 30.0
... ... ... ... ... ... ...
xxxx / 2015-12-01 / 12 / 110.4 / 33.2

Altogether there are 225 neighborhoods, so this one in the sample exemplifies only one neighborhood and all are arranged in a single file.

My question is regarding the categorical variables of the data (neighborhood / date), because I need to use the logistic model to be able to predict cases in these neighborhoods and I am not able to catalog in R so that I can apply the logistic model with my result always being binary.

If anyone has any ideas on how I can accomplish this I would be grateful for the help.

I appreciate the help and I'm sorry for the English.

technocrat · April 30, 2021, 7:53am

English is a world language; even among those who have it as a first language there is no uniform way of either speaking or writing it. Like all languages, it's for communication and communication requires equal effort between sender and receiver. Your English is better than my attempts at any of the languages that I've studied.

The data described is not an obvious candidate for logistic regression modeling. It appears that the number of cases is the dependent (or treatment or outcome) variable, and it is on the borderline of categorical/continuous. If it takes on more than about 12 values, the conventional approach is to treat it as continuous.

fit <- lm(cases ~ neighborhood + precipitation + temperature, data = dengue)

Or fewer variables could be chosen initially.

Many types of continuous variables follow a Gaussian distribution. A relatively small number of distinct values, however, especially if reported as integers may follow a Poisson distribution. An lm model is appropriate in the first case and a glm model in the second, with family = poisson or family=quasipoisson

d.AD <- data.frame(treatment = gl(3,3),
                   outcome   = gl(3,1,9),
                   counts    = c(18,17,15, 20,10,20, 25,13,12))
glm.D93 <- glm(counts ~ outcome + treatment, d.AD, family = poisson())
summary(glm.D93)
#> 
#> Call:
#> glm(formula = counts ~ outcome + treatment, family = poisson(), 
#>     data = d.AD)
#> 
#> Deviance Residuals: 
#>        1         2         3         4         5         6         7         8  
#> -0.67125   0.96272  -0.16965  -0.21999  -0.95552   1.04939   0.84715  -0.09167  
#>        9  
#> -0.96656  
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  3.045e+00  1.709e-01  17.815   <2e-16 ***
#> outcome2    -4.543e-01  2.022e-01  -2.247   0.0246 *  
#> outcome3    -2.930e-01  1.927e-01  -1.520   0.1285    
#> treatment2   1.338e-15  2.000e-01   0.000   1.0000    
#> treatment3   1.421e-15  2.000e-01   0.000   1.0000    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 10.5814  on 8  degrees of freedom
#> Residual deviance:  5.1291  on 4  degrees of freedom
#> AIC: 56.761
#> 
#> Number of Fisher Scoring iterations: 4
## Quasipoisson: compare with above / example(glm) :
glm.qD93 <- glm(counts ~ outcome + treatment, d.AD, family = quasipoisson())
summary(glm.qD93)
#> 
#> Call:
#> glm(formula = counts ~ outcome + treatment, family = quasipoisson(), 
#>     data = d.AD)
#> 
#> Deviance Residuals: 
#>        1         2         3         4         5         6         7         8  
#> -0.67125   0.96272  -0.16965  -0.21999  -0.95552   1.04939   0.84715  -0.09167  
#>        9  
#> -0.96656  
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  3.045e+00  1.944e-01  15.665  9.7e-05 ***
#> outcome2    -4.543e-01  2.299e-01  -1.976    0.119    
#> outcome3    -2.930e-01  2.192e-01  -1.337    0.252    
#> treatment2   1.338e-15  2.274e-01   0.000    1.000    
#> treatment3   1.421e-15  2.274e-01   0.000    1.000    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for quasipoisson family taken to be 1.2933)
#> 
#>     Null deviance: 10.5814  on 8  degrees of freedom
#> Residual deviance:  5.1291  on 4  degrees of freedom
#> AIC: NA
#> 
#> Number of Fisher Scoring iterations: 4

(From help(family).

Only if cases were coded as 0 = none and 1 = some would a logistic bit be considered.

lfit <- glm(cases_yn ~ ., family = "binomial)

The neighborhood and date variables introduce the potential for spatial and temporal autocorrelation—adjacent neighborhoods may share underlying conditions conducive to disease and the disease may have seasonality, such that one August, for example, is much like the next. There are tools in the time series domain to deal with the temporal case. I've not used them, but I assume that they exist for the spatial case, as well. (Of course, if the commonalities by neighborhood are distinct from location, that wouldn't be a concern.)

Finally, casting neighborhood as a factor can be used for categorizing. See this description of the forcats package.

nanda2021 · April 30, 2021, 12:33pm

Thank you very much for the answer and your attentive look, especially in relation to the distributions. The initial idea to be able to work with this database with these categorical variables was to use Decision Tree, but I chose to go to the logistics. A question, why does the data described is not an obvious candidate for logistic regression modeling???

technocrat · April 30, 2021, 8:33pm

Logistic models assume a response variable that is binary. As presented the dengue database has no binary variables. See this introduction.

system · May 13, 2021, 12:55pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.