categorical variable to compare

Please help me to compare two categorical variables

#indicator variables
Name[1:10441]
Behaviour[1:10441]

#creating category of interest
active_behaviour<-c("AE","AI")
passive_behaviour<-c("PO","PA","PI")
Cont<-c("Cont 1","Cont 5","Cont 9")
LPD<-c("LPD 11","LPD 6" ,"LPD 7" ,"LPD 8" )

  1. data_mb<-data.frame(active_behaviour,passive_behaviour,Cont_Name,LPD_Name)

Error in data.frame(active_behaviour, passive_behaviour, Cont_Name, LPD_Name) :
arguments imply differing number of rows: 2, 3, 3, 4

  1. glm1<- glm (data=mb, Behaviour ~ active_behaviou + passive_behaviour,family=binomial)

Error in model.frame.default(formula = Behaviour ~ active_behaviour + :
variable lengths differ (found for 'active_behaviour')

Welcome @Lizachka2309 !

It's a bit hard for me to answer the question. Can you please provide a bit more information about what you have, and what you want to get out? If possible, try to create a reprex, possible using a different dataset?

That said, I can note two things:

  1. A dataframe is defined as an object that has a fixed number of columns and rows (kinda like a matrix) - so when you are trying to create data_mb, it doesn't let you -- since your proposed columns have different lengths (for example -- what should go in the 5th row for column LPD_Name?)

  2. What is mb& Behaviour. This format assumes that mb is a data frame, and it has at least three columns -- Behaviour, activate_behaviou, and passive_behaviour, each with the same number of entries (i.e. the amount of rows in mb). Is this accurate?)

1 Like
  1. I have a data:

head(mb)
Days_after_birth Hour Name Minute Behaviour Condition
1 16.02.2019 7 Cont 1 1 AE Cont
2 16.02.1999 7 Cont 9 1 PO Cont
3 16.02.1999 7 LPD 6 1 PA LPD
4 16.02.1999 7 LPD 8 1 PI LPD
5 16.02.1999 9 Cont 1 1 AE Cont
6 16.02.1999 9 Cont 9 1 AI Cont

  1. I need to show that "Cont" has more "A..." - active behaviour then "LPD"
    so I separated category of interest

active_behaviour<-c("AE","AI")
passive_behaviour<-c("PO","PA","PI")
Cont<-c("Cont 1","Cont 5","Cont 9")
LPD<-c("LPD 11","LPD 6" ,"LPD 7" ,"LPD 8" )

  1. And now I woul like to bild a model

For example glm

glm1<- glm(data=mb, active_behaviour ~Cont + Hour+Minute + Condition + Days_after_birth,family=binomial)
Error in model.frame.default(formula = active_behaviour ~ Cont + Hour + :
variable lengths differ (found for 'Cont')

Please help me to find a solution.
Thank you

Am I correct on the following?

  • There are 5 possibilities for the Behaviour column - AE, AI, PO, PA, and PI. You want to group AE and AI as "active" and PO, PA, and PI as "passive"
  • There are 7 possibilities for the Name columns -- Cont 1, Cont 5, Cont 9, LPD 11, LPD 6, LPD 7, LPD 8. You already grouped them into Cont and LPD in the Condition column.
  • You want to build a regression to predict whether or not Behaviour is active or passive, using Hour, Minute, Condition, and days after birth

If that is correct, I'd suggest the following:

First, create an indicator column about whether behaviour is active -- the output should be 0 or 1

mb <- mb %>% 
  mutate(is_active = if_else(Behaviour %in% active_behaviour, 1, 0))

Also, make an indicator variable for is condition Cont or LPD

mb <- mb %>% 
  mutate(is_cont = if_else(Name %in% Cont, 1, 0))

Then, I'd turn Days_after_birth into a continuous number -- right now its character vector in the form of a date:

mb <- mb %>%
  mutate(date0 = parse_date(Days_after_birth, format = "%d.%m.%Y"), 
         birth_date = as.Date("1981-07-01"), 
         age_in_days = as.numeric(date0 - birth_date))

NOW is when I would do something like

glm(formula = is_active ~ Hour + Minute + is_cont + age_in_days, 
    data = mb, 
    family = binomial)

Does this make sense?

1 Like

yes, it is correct understanding.
Im trying to create new column for active behaviour :

data_mb<-with(mb,active_behaviour)
head(data_mb)
data_mb<within(mb,{Active<-active_behaviour})

data_mb<within(mb,{active<-active_behaviour})
Days_after_birth Hour Name Minute Behaviour Condition Active
[1,] NA FALSE NA FALSE NA NA FALSE
[2,] NA FALSE NA FALSE NA NA FALSE
[3,] NA FALSE NA FALSE NA NA FALSE
[4,] NA FALSE NA FALSE NA NA FALSE
[5,] NA FALSE NA FALSE NA NA FALSE
[6,] NA FALSE NA FALSE NA NA FALSE
I'm not sure that I run correctly and how to put conditions number.
Correct me pease.

I'd suggest you do it one of two ways:

  1. As I suggested above, using the dplyr package:
mb <- mb %>% dplyr::mutate(is_active = if_else(Behaviour %in% active_behaviour, 1, 0))
  1. Using base R
mb$is_active <- as.numeric(mb$Behaviour %in% active_behaviour)

(this works since as.numeric(TRUE) is 1 and as.numeric(FALSE) is 0).

  1. Other method of base R
mb$is_active <- ifelse(mb$Behaviour %in% active_behaviour, 1, 0)
1 Like

Thank you for your answers, but I'm still having errors message
I two ways:

  1. Using the dplyr

mb$active_behaviour<- mutate (active_behaviour = if_else(Behaviour<-active_behaviour, 1, 0))
Error: condition must be a logical vector, not a character vector
Call rlang::last_error() to see a backtrace

2.Base R

mb$active_behaviour<-as.numeric(mb$Behaviour<-active_behaviour)
Warning message:
NAs introduced by coercion
mb$Cont<-as.numeric(mb$Name<-Cont)
Warning message:
NAs introduced by coercion
mb$LPD<-as.numeric(mb$Name<-LPD)
Warning message:
NAs introduced by coercion
mb$Days_after_birth<-as.numeric(mb$Days_after_birth)
glm1<- glm(data=mb, active_behaviour ~Cont + Days_after_birth,family=binomial)
Error in family$linkfun(mustart) :
Argument mu must be a nonempty numeric vector

  1. The same error with another method of base R

mb$active_behaviour <- ifelse(mb$Behaviour<- active_behaviour, 1, 0)
mb$Cont <- ifelse(mb$Name<- Cont, 1, 0)
mb$LPD <- ifelse(mb$Name<- LPD, 1, 0)
mb$Days_after_birth <- ifelse(mb$Days_after_birth, 1, 0)
glm2<- glm(data=mb, active_behaviour ~Cont + Days_after_birth,family=binomial)
Error in family$linkfun(mustart) :
Argument mu must be a nonempty numeric vector

Please help to find a solution

Hi,

Try the following two changes -- one applies to the first, dplyr chunk, and the other applies to all 3

  1. dplyr operates on the entire data frame not on the column. This means two things:
    1. the calculation should be assigned directly to mb, not mb$active_behaviour)
    2. the first argument of mutate should be mb -- which can either be simply as mutate(mb, ...) or mb %>% mutate(...)
  2. "<-" does not mean is in, you are looking for %in% . <- means assign something to a variable. Therefore, within a as.numeric(...) or if_else(...), you should never see a <-. For example, if_else(Behaviour %in% active_behaviour, 1, 0) or as.numeric(mb$Behaviour %in% active_behaviour)
1 Like

Now it is working! I just instaling again "dplyr" and run again with "%in%". Thank you!

I have a questions about "parse_date":

mb <- mb %>% dplyr::mutate(date0 = parse_date(Days_after_birth, "%m.%d.%Y"),birth_date = as.Date("1981-07-01"), age_in_days = as.numeric(date0 - birth_date))

Error in .Call(C_R_parse_date, dates, approx) %||% numeric() :
approx must the logical of length 1

What can I do in this case?

Hmm...I'm not 100% sure, but I think I implied the wrong package -- try changing parse_date() to readr::parse_date() -- see if that does anything (or try readr::parse_date(as.character(Days_after_birth), "%m.%d.%Y"))

Finally, I just change data by name each day.
P1<-c("P1")
P2<-c("P2")
P3<-c("P3")
P4<-c("P4")
and than bild a regretion
glm1Cont<-glm(formula = is_active ~ Hour + Minute + is_cont + is_p1, data = mb, family = binomial)

Please tell me, Is it possible to plot a graph with this regression?
For example

library("GGally")
ggplot(mb, aes(x=active_behaviour,y=Cont,colour=Days_after_birth)) +geom_point() +geom_smooth(method="glm",method.args=list(family="binomial"))

Error: Aesthetics must be either length 1 or the same as the data (10440): x, y

While I don't have much experience with GGally, you should make sure that all the things in aes are columns of the mb dataframe -- maybe you want todo is_cont instead of Cont?

Thank you so much!

ggplot(mb, aes(x=is_active,y=is_passive,colour=Condition)) +geom_point() +geom_smooth(method="glm",method.args=list(family="binomial"))mb

1 Like