Dealing with similar variable names for different data

Hi everyone, I am extremely new to using R Studio, I'm currently in a class and have to create a decision tree, so please bear with me. This is also a multi-part question. I have two datasets that I need to combine to work with. However, each dataset has the same column names: state id, name, and rank (each of which should be the same since I am looking for information by state), but the other columns are months of the year for the past ten years (ex.: 2009-01, 2009-02, etc....). One dataset contains average home values in each state by month, the other has the number of sales per state in each month.
Should I rename the columns for the months in each set by adding an initial before or after the month (ex.: V2009-01 for the average value, SC2009-01 for sales count)? I can still calculate the percentage increase per state over the ten years for just the values and then incorporate sales count in the decision tree, is that correct? I'm just not sure how to bring the two datasets together correctly since it's different data but the same variable names.
Additionally, the sales count dataset is missing data for a few states for some of the months. Should I put NA in those cells (I currently have the data open in Excel, since I find that easier to work with to prepare the data, we haven't learned that much in class that I feel comfortable working with a lot of R code yet and want to focus on the decision tree in R Studio) or will that completely mess up the data?
I hope I'm explaining this correctly, I'm attaching a screenshot of the data files in Excel so you can see what I mean. Thanks for any help!

First create a new project in RStudio and put your csv files in the project directory. Then create a new R Script to work in.

In R you typically want to get your data into a dataframe where each column is a different kind of data. For rectangular data like yours, you can do this using gather from the tidyr package. Try this:

library(readr)
library(tidyr)

# read in the data
sales <- read_csv("Sale_Counts_State.csv")
value <- read_csv("State_MedianValuePerSqft_AllHomes.csv")

# gather
sales <- sales %>% 
	gather(key="month", value="sales_counts", starts_with("200"))
value <- value %>% 
	gather(key="month", value="median_value", starts_with("200"))

You might need to install the readr and tidyr packages. This can be done with

install.packages("tidyverse")

Great, thanks! I'll give this a try. I appreciate the help!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.