How to assign categorical variables to specific rows of data


#1

I'm new to R and this community, so please excuse any etiquette or common practice violations that I have obliviously made.

I'm working with NCIC (National Crime Information Center) data and I have thousands of rows with different NCIC codes, i.e. 2405, 1110, 3803, etc. Each of these NCIC codes represents a different type of crime. I'm trying to sort the data into four offense categories: 1=violent, 2=property, 3=drug, 4=other. How could I classify crimes 2101 through 3204 as a number 3, for example?

An similar dataset to my own can be found here:
https://secure.ssa.gov/poms.nsf/lnx/0202613900


#2

Hi @palcape, welcome to the community!

A great first step when asking a question like this is to start with detailing what you've tried already. Assuming of course, that you have an idea where to start and have done so - this isn't always easy to do when you're starting out (as I can well remember)! So don't worry if the answer for now is, I don't know where to start.

In terms of etiquette and sharing code etc, most people will ask that you present your code and (a sample of) the data you're working with as a reproducible example, or a reprex.

To get started with working with data, I'd recommend checking out the data transformation section of the R for Data Science book.

It can also be worth breaking down your problem a bit, too. It sounds like you're trying to:

  • Create a new variable in an existing data set;
  • Set that variable to be a a value based on other values in the data set.

To me, this sounds like a good time to use the mutate() function from the dplyr package. You can find some more details about that here.

Again, welcome, and please reply here if there's anything further I/we can add!


#3

Hi! Welcome! :grin:

Take a look at the forcats package (tidyverse tools for working with factor data, like yours), especially the functions fct_collapse and fct_relabel. You can see an overview of all the functions here.

fct_relabel will be especially helpful if there are patterns within the codes — such as “all codes with first two digits in these ranges => violent”, since you can put that logic into a function to relabel lots of codes automatically.

As a very simple example, based on the SSA website’s list of violent and drug-related codes: 09XX, 10XX, 11XX, 12XX, 13XX, 16XX, 21XX, 52XX, you could write a function like:

library(tidyverse)
library(stringr)
library(forcats)

assign_violent  <- function(code) {
  if(str_sub(code, 2) %in% c("09", "10", "11", "12", "13", "16", "21", "52")) {
    “violent”
  } else {
    code
  }
}

# suppose a dataframe df with a factor variable crime_code...
df  %>% mutate(crime_code = fct_relabel(crime_code, assign_violent))

Does that help get you started? To give more specific advice, it will be helpful to see the code you’re working with (it’s ok if it doesn’t work right!), made as reproducible as possible. If you want to learn about how things work around here, definitely take a look at the FAQs!


#4

@jim89, thanks a lot for your reply! I appreciate you taking the time to help a novice. :smile:

My bad—I'll be more specific. I created a sample table of my data here:
https://drive.google.com/file/d/1YTkG3suUytgNw8k6CnsARnJ9lcdZvDpM/view?usp=sharing

You nailed it; your suggestion to use mutate() is exactly what I'm trying to do. However, now I'm getting tripped up on the syntax. As a Stata user, I'm a big fan of if commands and I'm trying to use them here (I'll need at least 10 of them). Where am I going wrong?

#LoadPackages
library(dplyr)
library(readxl)
NCICSampleData <- read_xlsx("/Users/Cal/Desktop/Sorenson/NCICSampleData.xlsx")
  names(NCICSampleData)
  names(NCICSampleData) <- c("ncic.code", "description")

  mutate(NCICSampleData, offense.code = 
    1 if "ncic.code"= 2101-2102
    2 if "ncic.code"= 2003-2099


#5

@jcblum thank you so much for your counsel—it is much appreciated!

I have posted a sample table of data and my code in my reply to Jim.

I'm trying to create a new variable, so I'm planning on using the mutate() function. However, I'm having trouble implementing my if commands. I noticed that in your code you used it effectively. Would you mind shedding some light on my own? :grin:


#6

Nice start.

The way that R uses if statements is different to Stata. It looks like what you're trying to do would be a great situation for case_when(), another dplyr function. If you know SQL you'll be familiar with it, but to (very) briefly explain it, it's basically a function that lets you express things like:if A, then 1, if B then 2, if C then 3, otherwise...

If I was at a computer with R I'd show you, but unfortunately I'm not. Hopefully this points you in the right direction, though!


#8

To pick up where @jim89 left off, here's an example of using case_when() for this task:

mutate(
  NCICSampleData, 
  offense.code = case_when(
    ncic.code %in% 2101:2102 ~ 1,
    ncic.code %in% 2003:2099 ~ 2
  )
)

I will note that case_when() uses a special shorthand syntax that is unique to this function, and shouldn't be used as a guide to other conditional branching structures in R.

It's definitely a good idea to get to grips with R's basic syntax, especially if you're coming from a different language — while the concepts are often broadly similar, the details of syntax don't really transfer between languages. This thread has a lot of great resources for brushing up on the basics:


(despite the original topic, plenty of the suggestions work very well for those with programming experience but in a different language)


#9

@jcblum and confused Googlers, I have posted my code for this question on my GitHub here: