Replace entire string by one specific word.

Hi,

I have a column with the following event types as shown in the image below. Now, wherever I find 'Battle...' in the string of distinct event types, I need to replace the entire event type by only 'Battles'

Similarly if an event type contains 'Riots/Protests' replace the entire string by 'Riots'

Basically, I am group event types.

How do I do this in R? Any help would be appreciated. TIA

Hi, @ketan10! First of all, it is going to be much easier for folks around here to help if you include the data and code you've used to get to where you are stuck rather than just a screenshot. Without the data you are working with, I had to type a few rows of your table rather than simply copying and pasting. See this post for more information about best practices for writing questions and including a reproducible example (reprex).

As to your question, I find the most intuitive way to do this for a relative simply replace is using str_detect() from the stringr package. If the text you feed to str_detect() is found, it will return TRUE and if it is not found, it will return FALSE. I use str_detect() with mutate() and case_when() to do multiple conditional replacements within a single variable.

library(tidyverse)

df <- tribble(
  ~event_type,
  "Violence against civilans",
  "Battle-No change of territory",
  "Riots/Protests",
  "Battle Government regains territory"
  )

df %>%
  mutate(event_type = case_when(
    str_detect(event_type, "Battle") ~ "Battle",
    str_detect(event_type, "Riots")  ~ "Riot",
    TRUE ~ event_type
    )
  )
#> # A tibble: 4 x 1
#>   event_type               
#>   <chr>                    
#> 1 Violence against civilans
#> 2 Battle                   
#> 3 Riot                     
#> 4 Battle

Created on 2018-10-30 by the reprex package (v0.2.1)

6 Likes

Hi @mfherman , Thanks for the revert. I will keep a note on the best practices.

I am reading a csv file and doing column cleaning stuff on the csv as given in the code below:

raw_armed_conflicts <- read.csv('C:\\ketan\\SampleProject\\Conflicts.csv')

#Removing unwanted columns from the dataset and choosing only the necessary columns
raw_armed_conflicts <- raw_armed_conflicts[,c(1,5,6,8,16,17,22,23,28)]

#Renaming the columns legibly
names(raw_armed_conflicts)[1] <- 'event_data_id'
names(raw_armed_conflicts)[3] <- 'event_year'
names(raw_armed_conflicts)[5] <- 'event_region'
names(raw_armed_conflicts)[6] <- 'event_country'
names(raw_armed_conflicts)[7] <- 'event_latitude'
names(raw_armed_conflicts)[8] <- 'event_event_longitude'
names(raw_armed_conflicts)[9] <- 'event_fatalities'

I need to group the data in the column 'event_type'. Wherever I find 'Battle...' in the string of distinct event types, I need to replace the entire event type by only ' Battles '
Similarly if an event type contains 'Riots/Protests' replace the entire string by 'Riots'

Data for reference attached:

|data_id|event_date|year|event_type|region|country|iso3|latitude|longitude|fatalities|
|---|---|---|---|---|---|---|---|---|---|
|1892808|20-Oct-18|2018|Battle-No change of territory|Eastern Africa|Burundi|BDI|-2.8772|29.3253|4|
|1892831|20-Oct-18|2018|Violence against civilians|Middle Africa|Cameroon|CMR|5.9333|10.1667|3|
|1892860|20-Oct-18|2018|Battle-Government regains territory|Northern Africa|Egypt|EGY|31.1316|33.7984|4|
|1892861|20-Oct-18|2018|Battle-No change of territory|Northern Africa|Egypt|EGY|31.2163|34.1107|0|
|1892874|20-Oct-18|2018|Strategic development|Eastern Africa|Ethiopia|ETH|12.9667|36.2|0|
|1892875|20-Oct-18|2018|Violence against civilians|Eastern Africa|Ethiopia|ETH|10.15|36.35|6|
|1892920|20-Oct-18|2018|Battle-No change of territory|Western Africa|Mali|MLI|14.795|-1.318|1|
|1892921|20-Oct-18|2018|Violence against civilians|Western Africa|Mali|MLI|16.6314|-3.3256|0|
|1892922|20-Oct-18|2018|Riots/Protests|Western Africa|Mali|MLI|16.8425|-3.8559|1|
|1892961|20-Oct-18|2018|Violence against civilians|Western Africa|Nigeria|NGA|12.1492|12.9907|12|
|1892962|20-Oct-18|2018|Violence against civilians|Western Africa|Nigeria|NGA|11.4953|12.9688|2|
|1893075|20-Oct-18|2018|Non-violent transfer of territory|Eastern Africa|Somalia|SOM|3.0399|43.7969|0|
|1893076|20-Oct-18|2018|Riots|Eastern Africa|Somalia|SOM|8.4064|48.4819|0|
|1893282|20-Oct-18|2018|Violence against civilians|Southern Asia|Afghanistan|AFG|33.6457|62.2696|0|
|1893283|20-Oct-18|2018|Violence against civilians|Southern Asia|Afghanistan|AFG|34.5195|65.2509|6|
|1893284|20-Oct-18|2018|Strategic development|Southern Asia|Afghanistan|AFG|34.9145|65.2884|0|
|1893285|20-Oct-18|2018|Violence against civilians|Southern Asia|Afghanistan|AFG|34.3448|61.4932|0|
|1893286|20-Oct-18|2018|Battle-No change of territory|Southern Asia|Afghanistan|AFG|34.5167|69.1833|16|

Thanks for including some of your data and code. To make a reproducible example, all the data and code someone else would need should be included. Therefore, using read_csv() on a file that is only on your computer is not reproducible. A better way is to use tribble() or dput() to include a sample of your data in the code as I do below.

After creating the sample data, I do the string replace as I wrote in my previous code. Does this accomplish what you are trying to do?

library(tidyverse)

raw_armed_conflicts <- tribble(
   ~data_id,   ~event_date,  ~year,                           ~event_type,           ~region,      ~country,  ~iso3,  ~latitude,  ~longitude,  ~fatalities,  
   "1892808",  "20-Oct-18", "2018",       "Battle-No change of territory",  "Eastern Africa",     "Burundi",  "BDI",  "-2.8772",   "29.3253",          "4",  
   "1892831",  "20-Oct-18", "2018",          "Violence against civilians",   "Middle Africa",    "Cameroon",  "CMR",   "5.9333",   "10.1667",          "3",  
   "1892860",  "20-Oct-18", "2018", "Battle-Government regains territory", "Northern Africa",       "Egypt",  "EGY",  "31.1316",   "33.7984",          "4",  
   "1892861",  "20-Oct-18", "2018",       "Battle-No change of territory", "Northern Africa",       "Egypt",  "EGY",  "31.2163",   "34.1107",          "0",  
   "1892874",  "20-Oct-18", "2018",               "Strategic development",  "Eastern Africa",    "Ethiopia",  "ETH",  "12.9667",      "36.2",          "0",  
   "1892875",  "20-Oct-18", "2018",          "Violence against civilians",  "Eastern Africa",    "Ethiopia",  "ETH",    "10.15",     "36.35",          "6",  
   "1892920",  "20-Oct-18", "2018",       "Battle-No change of territory",  "Western Africa",        "Mali",  "MLI",   "14.795",    "-1.318",          "1",  
   "1892921",  "20-Oct-18", "2018",          "Violence against civilians",  "Western Africa",        "Mali",  "MLI",  "16.6314",   "-3.3256",          "0",  
   "1892922",  "20-Oct-18", "2018",                      "Riots/Protests",  "Western Africa",        "Mali",  "MLI",  "16.8425",   "-3.8559",          "1",  
   "1892961",  "20-Oct-18", "2018",          "Violence against civilians",  "Western Africa",     "Nigeria",  "NGA",  "12.1492",   "12.9907",         "12",  
   "1892962",  "20-Oct-18", "2018",          "Violence against civilians",  "Western Africa",     "Nigeria",  "NGA",  "11.4953",   "12.9688",          "2",  
   "1893075",  "20-Oct-18", "2018",   "Non-violent transfer of territory",  "Eastern Africa",     "Somalia",  "SOM",   "3.0399",   "43.7969",          "0",  
   "1893076",  "20-Oct-18", "2018",                               "Riots",  "Eastern Africa",     "Somalia",  "SOM",   "8.4064",   "48.4819",          "0",  
   "1893282",  "20-Oct-18", "2018",          "Violence against civilians",   "Southern Asia", "Afghanistan",  "AFG",  "33.6457",   "62.2696",          "0",  
   "1893283",  "20-Oct-18", "2018",          "Violence against civilians",   "Southern Asia", "Afghanistan",  "AFG",  "34.5195",   "65.2509",          "6",  
   "1893284",  "20-Oct-18", "2018",               "Strategic development",   "Southern Asia", "Afghanistan",  "AFG",  "34.9145",   "65.2884",          "0",  
   "1893285",  "20-Oct-18", "2018",          "Violence against civilians",   "Southern Asia", "Afghanistan",  "AFG",  "34.3448",   "61.4932",          "0",  
   "1893286",  "20-Oct-18", "2018",       "Battle-No change of territory",   "Southern Asia", "Afghanistan",  "AFG",  "34.5167",   "69.1833",         "16"
  )

raw_armed_conflicts %>%
  mutate(event_type = case_when(
    str_detect(event_type, "Battle") ~ "Battles",
    str_detect(event_type, "Riots")  ~ "Riots",
    TRUE ~ event_type
    ))
#> # A tibble: 18 x 10
#>    data_id event_date year  event_type region country iso3  latitude
#>    <chr>   <chr>      <chr> <chr>      <chr>  <chr>   <chr> <chr>   
#>  1 1892808 20-Oct-18  2018  Battles    Easte… Burundi BDI   -2.8772 
#>  2 1892831 20-Oct-18  2018  Violence … Middl… Camero… CMR   5.9333  
#>  3 1892860 20-Oct-18  2018  Battles    North… Egypt   EGY   31.1316 
#>  4 1892861 20-Oct-18  2018  Battles    North… Egypt   EGY   31.2163 
#>  5 1892874 20-Oct-18  2018  Strategic… Easte… Ethiop… ETH   12.9667 
#>  6 1892875 20-Oct-18  2018  Violence … Easte… Ethiop… ETH   10.15   
#>  7 1892920 20-Oct-18  2018  Battles    Weste… Mali    MLI   14.795  
#>  8 1892921 20-Oct-18  2018  Violence … Weste… Mali    MLI   16.6314 
#>  9 1892922 20-Oct-18  2018  Riots      Weste… Mali    MLI   16.8425 
#> 10 1892961 20-Oct-18  2018  Violence … Weste… Nigeria NGA   12.1492 
#> 11 1892962 20-Oct-18  2018  Violence … Weste… Nigeria NGA   11.4953 
#> 12 1893075 20-Oct-18  2018  Non-viole… Easte… Somalia SOM   3.0399  
#> 13 1893076 20-Oct-18  2018  Riots      Easte… Somalia SOM   8.4064  
#> 14 1893282 20-Oct-18  2018  Violence … South… Afghan… AFG   33.6457 
#> 15 1893283 20-Oct-18  2018  Violence … South… Afghan… AFG   34.5195 
#> 16 1893284 20-Oct-18  2018  Strategic… South… Afghan… AFG   34.9145 
#> 17 1893285 20-Oct-18  2018  Violence … South… Afghan… AFG   34.3448 
#> 18 1893286 20-Oct-18  2018  Battles    South… Afghan… AFG   34.5167 
#> # … with 2 more variables: longitude <chr>, fatalities <chr>

Created on 2018-10-30 by the reprex package (v0.2.1)

Hi @mfherman
I understand your solution, however my original csv contains more than 10k rows of data.
Adding each row of data (something shown below) would not be a generic solution:

raw_armed_conflicts <- tribble(
  ~data_id,   ~event_date,  ~year,                           ~event_type,           ~region,      ~country,  ~iso3,  ~latitude,  ~longitude,  ~fatalities,  
  "1892808",  "20-Oct-18", "2018",       "Battle-No change of territory",  "Eastern Africa",     "Burundi",  "BDI",  "-2.8772",   "29.3253",          "4",  
  "1892831",  "20-Oct-18", "2018",          "Violence against civilians",   "Middle Africa",    "Cameroon",  "CMR",   "5.9333",   "10.1667",          "3",  
  "1892860",  "20-Oct-18", "2018", "Battle-Government regains territory", "Northern Africa",       "Egypt",  "EGY",  "31.1316",   "33.7984",          "4",  

Instead of putting the data here (in the code), I need to read the column event_type from the dataframe raw_armed_conflicts (which contains the raw csv data) and then check each row with the conditions in code above.

Not sure if this is possible ?

The code I wrote above should work just fine on all 10k rows. Many functions in R are "vectorized", meaning that they will execute on an entire vector or in this case, row of a data frame. Have you tried running the code on your complete data set that you have read in using read.csv() or something similar?

@mfherman
This is the code I am trying to run, am I going wrong anywhere ?

#Loading the Armed Conflicts uncleaned data from [Structured Source]
raw_armed_conflicts <- read.csv('C:\\ketan\\SampleProject\\Conflicts.csv')

#Removing unwanted columns from the dataset and choosing only the necessary columns
raw_armed_conflicts <- raw_armed_conflicts[,c(1,5,6,8,16,17,22,23,28)]

#Renaming the columns legibly
names(raw_armed_conflicts)[1] <- 'event_data_id'
names(raw_armed_conflicts)[3] <- 'event_year'
names(raw_armed_conflicts)[5] <- 'event_region'
names(raw_armed_conflicts)[6] <- 'event_country'
names(raw_armed_conflicts)[7] <- 'event_latitude'
names(raw_armed_conflicts)[8] <- 'event_longitude'
names(raw_armed_conflicts)[9] <- 'event_fatalities'


raw_armed_conflicts %>%
  mutate(event_type = case_when(
    str_detect(event_type, "Battle") ~ "Battles",
    str_detect(event_type, "Riots")  ~ "Riots",
    TRUE ~ event_type
  ))

write.csv(raw_armed_conflicts,"raw_armed_conflicts.csv", row.names = FALSE)

I get the following error message:

Error in mutate_impl(.data, dots) : 
  Evaluation error: must be type character, not integer.
> View(raw_armed_conflicts)
> str(raw_armed_conflicts)
'data.frame':	103814 obs. of  9 variables:
 $ event_data_id   : int  1897233 1897234 1897272 1897320 1897368 1897385 1897386 1897402 1897429 1897470 ...
 $ event_date      : Factor w/ 406 levels "01-Apr-18","01-Aug-18",..: 360 360 360 360 360 360 360 360 360 360 ...
 $ event_year      : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
 $ event_type      : Factor w/ 12 levels "Battle-Government regains territory",..: 12 2 9 2 7 2 2 11 9 7 ...
 $ event_region    : Factor w/ 8 levels "Eastern Africa",..: 8 8 2 4 4 8 8 8 8 1 ...
 $ event_country   : Factor w/ 76 levels "Afghanistan",..: 8 8 11 16 39 42 42 49 50 60 ...
 $ event_latitude  : num  13.82 11.75 4.05 31.29 32.03 ...
 $ event_longitude : num  -1.32 -3.3 9.71 34.24 20.07 ...
 $ event_fatalities: int  0 0 0 0 1 7 0 0 3 1 ...

The problem seems to be that the column you are trying to use str_detect() on, event_type is a factor column and not a character column. You can read more about factors here, but suffice it to say for your purposes, you can just use character columns. Instead of using read.csv() at the beginning of your script, try read_csv(), which defaults to reading strings as characters.

One more issue that I see in your code is that you don't save the results of the mutate() operation to a new object. In order for the results of that code to be saved, you need to assign it to a new object. Then you can write the new object with the modified column to a csv.

conflicts_cleaned <- raw_armed_conflicts %>%
  mutate(event_type = case_when(
    str_detect(event_type, "Battle") ~ "Battles",
    str_detect(event_type, "Riots")  ~ "Riots",
    TRUE ~ event_type
  ))

write_csv(conflicts_cleaned, "raw_armed_conflicts.csv")

Seems like you're almost there!

4 Likes

Bingo @mfherman ! Thanks :slight_smile: i am new to R and am happy that your well explained solutions are helping me develop more interest in it. Thanks again :slight_smile:

2 Likes

So glad you're getting more interested in R! A really great reference guide to R is the book R for Data Science by Hadley Wickham. An online version is available for free here and I strongly encourage you to check it out if you want to develop some more R skills.

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.