Help with loop, csv dataset

tsrockon28 · October 12, 2017, 6:50pm

Hi,
I am trying to get the column sums from a dataset where i have 17 observations for each of the states. there are 48 states in the dataset. states are divided into regions. i was trying to run a loop and assign values for column sums to a new matrix. it doesn't work out. any help would be greatly appreciates. thanks.

the script:

data_matrix= read.csv("PRODUC.csv")
str(data_matrix)

data_matrix$Capital= data_matrix$P_CAP+ data_matrix$HWY+ data_matrix$WATER+ data_matrix$UTIL

gulf= subset(data_matrix, ST_ABB %in% c("AL", "FL", "LA", "MS"))
midwest= subset(data_matrix, ST_ABB %in% c("IL", "IN", "KY", "MI", "MN", "OH", "WI"))
mid_atlantic= subset(data_matrix, ST_ABB %in% c("DE", "MD", "NJ", "NY", "PA", "VA"))
MOUNTAIN= subset(data_matrix, ST_ABB %in% c("CO", "ID", "MT", "ND", "SD", "WY"))
NEW_ENGLAND= subset(data_matrix, ST_ABB %in% c("CT", "ME", "MA", "NH", "RI", "VT"))
SOUTH= subset(data_matrix, ST_ABB %in% c("GA", "NC", "SC", "TN", "WV", "RI"))
SOUTHWEST= subset(data_matrix, ST_ABB %in% c("AZ", "NV", "NM", "TX", "UT"))
CENTRAL= subset(data_matrix, ST_ABB %in% c("AK", "IA", "KS", "MO", "NE", "OK"))
WEST_COAST= subset(data_matrix, ST_ABB %in% c("CA", "OR", "WA"))

vector_1= rep.int(0, times = 4*9)
my_matrix_gulf=matrix(vector_1, nrow = 4, ncol = 9)
vectors_gulf_cumulative= for(i in seq(0, 68, by= 17)) for(j in c(1,2,3,4)) {
{
print(i)

my_matrix_gulf[j,1:9]= colSums(gulf[(i-16):i, 4:12])

print(my_matrix_gulf)
}
}

dataset can be downloaded from here https://drive.google.com/file/d/0B11woa7YyVb1dmlLV3hDVWUwRDA/view?usp=sharing

jchou · October 12, 2017, 8:39pm

I think your approach does not cater to R's strengths, nor make use of excellent libraries, like the tidyverse, that further simplify data manipulations. It feels like you are trying to 'brute force' the data by manually forcing it into a matrix.

(One hint is that if you're trying to loop, rather than use vectorized manipulations, you might be going about things in a more complicated way than needed.)

Are you trying to group by some factor (like states or regions or year) and then obtain sums of particular columns?

If so, it would be helpful for you to clarify exactly what you are trying to accomplish, if you want folks to help.

I think it'd take longer to work through your (currently nonfunctioning) code to figure out what your goal is, than to actually write a little bit of 'tidyverse'-compatible code to do the manipulations in dataframes.

As a start, to get your data and regions into an analyzable dataframe, I'd start with:

library(tidyverse)

df <- read.csv(file = '~/Desktop/PRODUC.csv') %>%
  mutate(
    region = case_when( # add a 'region' column, based on the state
      ST_ABB %in% c('AL', 'FL', 'LA', 'MS') ~ 'gulf',
      ST_ABB %in% c('IL', 'IN', 'KY', 'MI', 'MN', 'OH', 'WI') ~ 'midwest',
      ST_ABB %in% c('DE', 'MD', 'NJ', 'NY', 'PA', 'VA') ~ 'mid_atlantic',
      ST_ABB %in% c('CO', 'ID', 'MT', 'ND', 'SD', 'WY') ~ 'MOUNTAIN',
      ST_ABB %in% c('CT', 'ME', 'MA', 'NH', 'RI', 'VT') ~ 'NEW_ENGLAND',
      ST_ABB %in% c('GA', 'NC', 'SC', 'TN', 'WV', 'RI') ~ 'SOUTH',
      ST_ABB %in% c('AZ', 'NV', 'NM', 'TX', 'UT') ~ 'SOUTHWEST',
      ST_ABB %in% c('AK', 'IA', 'KS', 'MO', 'NE', 'OK') ~ 'CENTRAL',
      ST_ABB %in% c('CA', 'OR', 'WA') ~ 'WEST_COAST'
    )
  )

Then, you could explore tidyverse's "group_by" followed by "summarize" manipulations to start to get the column-wise manipulations done.

tsrockon28 · October 12, 2017, 8:43pm

Thanks for the reply.
So for each state i have 17 observations. now i want the sum of all the observations(rows) of a particular state in a new row.
so since i have 48 states in 9 regions, doing them manually is tedious.
so i wanted to know if there is a smarter way to do the same.
Thnaks.

jchou · October 12, 2017, 8:57pm

This is 'tidyverse' style code that does what you want. It doesn't seem like you actually need the regions, but I left that code in there since it was part of your sample code goal.

As you can see, the tidyverse-style of data-munging is rather clean.

When I started learning R a few years ago, I also tried to do stuff in loops (having programmed in non-vectorized languages before). Now, I try to code in this way, which is much easier to folllow once you get used to it.

Read Hadley Wickhams 'R for Data Science' if you're interested in learning more.

library(tidyverse)

df <- read.csv(file = '~/Desktop/PRODUC.csv') %>%
  mutate(
    region = case_when(
      ST_ABB %in% c('AL', 'FL', 'LA', 'MS') ~ 'gulf',
      ST_ABB %in% c('IL', 'IN', 'KY', 'MI', 'MN', 'OH', 'WI') ~ 'midwest',
      ST_ABB %in% c('DE', 'MD', 'NJ', 'NY', 'PA', 'VA') ~ 'mid_atlantic',
      ST_ABB %in% c('CO', 'ID', 'MT', 'ND', 'SD', 'WY') ~ 'MOUNTAIN',
      ST_ABB %in% c('CT', 'ME', 'MA', 'NH', 'RI', 'VT') ~ 'NEW_ENGLAND',
      ST_ABB %in% c('GA', 'NC', 'SC', 'TN', 'WV', 'RI') ~ 'SOUTH',
      ST_ABB %in% c('AZ', 'NV', 'NM', 'TX', 'UT') ~ 'SOUTHWEST',
      ST_ABB %in% c('AK', 'IA', 'KS', 'MO', 'NE', 'OK') ~ 'CENTRAL',
      ST_ABB %in% c('CA', 'OR', 'WA') ~ 'WEST_COAST'
    )
  )

df %>%
  select(-ST_ABB, -YR, -region) %>% # remove the columns which you don't want to sum
  group_by(STATE) %>%
  summarise_all( funs(sum) )

epi_pete · October 12, 2017, 9:01pm

Just a quick observation, and I understand this may not be your actual code. Arkansas doesn't appear in any region.

tsrockon28 · October 12, 2017, 9:23pm

getting this error:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'C:/Users/Tej/Documents/Desktop/PRODUC.csv': No such file or directory

Thanks

jchou · October 12, 2017, 9:25pm

Yup, I noticed that, too -- wasn't included in the OP's subsetting anywhere.

But it doesn't seem like the OP is actually doing anything with the regions that got defined, so I let it be.

On the plus side for this thread, this is the first time I've ever used the case_when and summarize_all functions, so I got to learn something new! Yay for Google and Stack Overflow.

jchou · October 12, 2017, 9:27pm

Change the path to wherever you put the input file...?

tsrockon28 · October 13, 2017, 4:35am

Yeah i did that too.
THen also the same error is displayed,

tsrockon28 · October 13, 2017, 8:46am

i was able to get the grouped summary data frame with your help.

additionally, i am having issues with grouping on multiple variables. so i want the weighted unemployment in a region in each year. weights are defined for a state in a particular year as the ratio of employment of that state in that year to sum of employmnet rates of all the states in that region in that year. now the unemployment of the region in time t is the weighted sum of unemployment rates of the state in that time t with weights defined as above. any help appreciated.

this is my code

gulf %>%
select(STATE, YR, UNEMP) %>%
group_by(STATE, YR) %>%
summarise(sum_unemp= sum(gulf$UNEMP))

tsrockon28 · October 13, 2017, 9:02am

this might make it clearer

dannyfraser · October 13, 2017, 1:42pm

You just need to add an expression to weight the values in the group - here's how I did it:

library(tidyverse)
read_csv('YOUR_FILE_PATH_HERE') %>% 
  mutate(REGION = case_when( # add a 'region' column, based on the state
      ST_ABB %in% c('AL', 'FL', 'LA', 'MS') ~ 'gulf',
      ST_ABB %in% c('IL', 'IN', 'KY', 'MI', 'MN', 'OH', 'WI') ~ 'midwest',
      ST_ABB %in% c('DE', 'MD', 'NJ', 'NY', 'PA', 'VA') ~ 'mid_atlantic',
      ST_ABB %in% c('CO', 'ID', 'MT', 'ND', 'SD', 'WY') ~ 'MOUNTAIN',
      ST_ABB %in% c('CT', 'ME', 'MA', 'NH', 'RI', 'VT') ~ 'NEW_ENGLAND',
      ST_ABB %in% c('GA', 'NC', 'SC', 'TN', 'WV', 'RI') ~ 'SOUTH',
      ST_ABB %in% c('AZ', 'NV', 'NM', 'TX', 'UT') ~ 'SOUTHWEST',
      ST_ABB %in% c('AK', 'IA', 'KS', 'MO', 'NE', 'OK') ~ 'CENTRAL',
      ST_ABB %in% c('CA', 'OR', 'WA') ~ 'WEST_COAST'
    )
  ) %>% 
  group_by(REGION, YR) %>% 
  summarise(UNEMP = sum(UNEMP * EMP / sum(EMP)))

tsrockon28 · October 13, 2017, 2:17pm

this is what i did

library(tidyverse)

gulf = gulf %>%
  group_by(YR) %>%
  mutate(total_emp= sum(EMP))%>%
  mutate(weights_emp= EMP/total_emp) %>%
  mutate(weighted_unemp= weights_emp * UNEMP)

gulf_summary= gulf %>%
  select(-YR, -ST_ABB, -total_emp, -weights_emp, -weighted_unemp) %>%
  group_by(STATE) %>%
  summarise_all(funs(sum))
gulf_summary  #this works

tsrockon28 · October 13, 2017, 6:46pm

Thanks.
Another query can i write this as a function of a data frame?