Help with loop, csv dataset

Hi,
I am trying to get the column sums from a dataset where i have 17 observations for each of the states. there are 48 states in the dataset. states are divided into regions. i was trying to run a loop and assign values for column sums to a new matrix. it doesn't work out. any help would be greatly appreciates. thanks.

the script:

data_matrix= read.csv("PRODUC.csv")
str(data_matrix)

data_matrix$Capital= data_matrix$P_CAP+ data_matrix$HWY+ data_matrix$WATER+ data_matrix$UTIL

gulf= subset(data_matrix, ST_ABB %in% c("AL", "FL", "LA", "MS"))
midwest= subset(data_matrix, ST_ABB %in% c("IL", "IN", "KY", "MI", "MN", "OH", "WI"))
mid_atlantic= subset(data_matrix, ST_ABB %in% c("DE", "MD", "NJ", "NY", "PA", "VA"))
MOUNTAIN= subset(data_matrix, ST_ABB %in% c("CO", "ID", "MT", "ND", "SD", "WY"))
NEW_ENGLAND= subset(data_matrix, ST_ABB %in% c("CT", "ME", "MA", "NH", "RI", "VT"))
SOUTH= subset(data_matrix, ST_ABB %in% c("GA", "NC", "SC", "TN", "WV", "RI"))
SOUTHWEST= subset(data_matrix, ST_ABB %in% c("AZ", "NV", "NM", "TX", "UT"))
CENTRAL= subset(data_matrix, ST_ABB %in% c("AK", "IA", "KS", "MO", "NE", "OK"))
WEST_COAST= subset(data_matrix, ST_ABB %in% c("CA", "OR", "WA"))

vector_1= rep.int(0, times = 4*9)
my_matrix_gulf=matrix(vector_1, nrow = 4, ncol = 9)
vectors_gulf_cumulative= for(i in seq(0, 68, by= 17)) for(j in c(1,2,3,4)) {
{
print(i)

my_matrix_gulf[j,1:9]= colSums(gulf[(i-16):i, 4:12])

print(my_matrix_gulf)
}
}

dataset can be downloaded from here https://drive.google.com/file/d/0B11woa7YyVb1dmlLV3hDVWUwRDA/view?usp=sharing

I think your approach does not cater to R's strengths, nor make use of excellent libraries, like the tidyverse, that further simplify data manipulations. It feels like you are trying to 'brute force' the data by manually forcing it into a matrix.

(One hint is that if you're trying to loop, rather than use vectorized manipulations, you might be going about things in a more complicated way than needed.)

Are you trying to group by some factor (like states or regions or year) and then obtain sums of particular columns?

If so, it would be helpful for you to clarify exactly what you are trying to accomplish, if you want folks to help.

I think it'd take longer to work through your (currently nonfunctioning) code to figure out what your goal is, than to actually write a little bit of 'tidyverse'-compatible code to do the manipulations in dataframes.

As a start, to get your data and regions into an analyzable dataframe, I'd start with:

library(tidyverse)

df <- read.csv(file = '~/Desktop/PRODUC.csv') %>%
  mutate(
    region = case_when( # add a 'region' column, based on the state
      ST_ABB %in% c('AL', 'FL', 'LA', 'MS') ~ 'gulf',
      ST_ABB %in% c('IL', 'IN', 'KY', 'MI', 'MN', 'OH', 'WI') ~ 'midwest',
      ST_ABB %in% c('DE', 'MD', 'NJ', 'NY', 'PA', 'VA') ~ 'mid_atlantic',
      ST_ABB %in% c('CO', 'ID', 'MT', 'ND', 'SD', 'WY') ~ 'MOUNTAIN',
      ST_ABB %in% c('CT', 'ME', 'MA', 'NH', 'RI', 'VT') ~ 'NEW_ENGLAND',
      ST_ABB %in% c('GA', 'NC', 'SC', 'TN', 'WV', 'RI') ~ 'SOUTH',
      ST_ABB %in% c('AZ', 'NV', 'NM', 'TX', 'UT') ~ 'SOUTHWEST',
      ST_ABB %in% c('AK', 'IA', 'KS', 'MO', 'NE', 'OK') ~ 'CENTRAL',
      ST_ABB %in% c('CA', 'OR', 'WA') ~ 'WEST_COAST'
    )
  )

Then, you could explore tidyverse's "group_by" followed by "summarize" manipulations to start to get the column-wise manipulations done.

1 Like

Thanks for the reply.
So for each state i have 17 observations. now i want the sum of all the observations(rows) of a particular state in a new row.
so since i have 48 states in 9 regions, doing them manually is tedious.
so i wanted to know if there is a smarter way to do the same.
Thnaks.

This is 'tidyverse' style code that does what you want. It doesn't seem like you actually need the regions, but I left that code in there since it was part of your sample code goal.

As you can see, the tidyverse-style of data-munging is rather clean.

When I started learning R a few years ago, I also tried to do stuff in loops (having programmed in non-vectorized languages before). Now, I try to code in this way, which is much easier to folllow once you get used to it.

Read Hadley Wickhams 'R for Data Science' if you're interested in learning more.

library(tidyverse)

df <- read.csv(file = '~/Desktop/PRODUC.csv') %>%
  mutate(
    region = case_when(
      ST_ABB %in% c('AL', 'FL', 'LA', 'MS') ~ 'gulf',
      ST_ABB %in% c('IL', 'IN', 'KY', 'MI', 'MN', 'OH', 'WI') ~ 'midwest',
      ST_ABB %in% c('DE', 'MD', 'NJ', 'NY', 'PA', 'VA') ~ 'mid_atlantic',
      ST_ABB %in% c('CO', 'ID', 'MT', 'ND', 'SD', 'WY') ~ 'MOUNTAIN',
      ST_ABB %in% c('CT', 'ME', 'MA', 'NH', 'RI', 'VT') ~ 'NEW_ENGLAND',
      ST_ABB %in% c('GA', 'NC', 'SC', 'TN', 'WV', 'RI') ~ 'SOUTH',
      ST_ABB %in% c('AZ', 'NV', 'NM', 'TX', 'UT') ~ 'SOUTHWEST',
      ST_ABB %in% c('AK', 'IA', 'KS', 'MO', 'NE', 'OK') ~ 'CENTRAL',
      ST_ABB %in% c('CA', 'OR', 'WA') ~ 'WEST_COAST'
    )
  )

df %>%
  select(-ST_ABB, -YR, -region) %>% # remove the columns which you don't want to sum
  group_by(STATE) %>%
  summarise_all( funs(sum) )
2 Likes

Just a quick observation, and I understand this may not be your actual code. Arkansas doesn't appear in any region.

getting this error:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'C:/Users/Tej/Documents/Desktop/PRODUC.csv': No such file or directory

Thanks

Yup, I noticed that, too -- wasn't included in the OP's subsetting anywhere.

But it doesn't seem like the OP is actually doing anything with the regions that got defined, so I let it be.

On the plus side for this thread, this is the first time I've ever used the case_when and summarize_all functions, so I got to learn something new! Yay for Google and Stack Overflow.

Change the path to wherever you put the input file...?

2 Likes

Yeah i did that too.
THen also the same error is displayed,

i was able to get the grouped summary data frame with your help.

additionally, i am having issues with grouping on multiple variables. so i want the weighted unemployment in a region in each year. weights are defined for a state in a particular year as the ratio of employment of that state in that year to sum of employmnet rates of all the states in that region in that year. now the unemployment of the region in time t is the weighted sum of unemployment rates of the state in that time t with weights defined as above. any help appreciated.

this is my code

gulf %>%
select(STATE, YR, UNEMP) %>%
group_by(STATE, YR) %>%
summarise(sum_unemp= sum(gulf$UNEMP))

this might make it clearer

You just need to add an expression to weight the values in the group - here's how I did it:

library(tidyverse)
read_csv('YOUR_FILE_PATH_HERE') %>% 
  mutate(REGION = case_when( # add a 'region' column, based on the state
      ST_ABB %in% c('AL', 'FL', 'LA', 'MS') ~ 'gulf',
      ST_ABB %in% c('IL', 'IN', 'KY', 'MI', 'MN', 'OH', 'WI') ~ 'midwest',
      ST_ABB %in% c('DE', 'MD', 'NJ', 'NY', 'PA', 'VA') ~ 'mid_atlantic',
      ST_ABB %in% c('CO', 'ID', 'MT', 'ND', 'SD', 'WY') ~ 'MOUNTAIN',
      ST_ABB %in% c('CT', 'ME', 'MA', 'NH', 'RI', 'VT') ~ 'NEW_ENGLAND',
      ST_ABB %in% c('GA', 'NC', 'SC', 'TN', 'WV', 'RI') ~ 'SOUTH',
      ST_ABB %in% c('AZ', 'NV', 'NM', 'TX', 'UT') ~ 'SOUTHWEST',
      ST_ABB %in% c('AK', 'IA', 'KS', 'MO', 'NE', 'OK') ~ 'CENTRAL',
      ST_ABB %in% c('CA', 'OR', 'WA') ~ 'WEST_COAST'
    )
  ) %>% 
  group_by(REGION, YR) %>% 
  summarise(UNEMP = sum(UNEMP * EMP / sum(EMP)))

this is what i did

library(tidyverse)

gulf = gulf %>%
  group_by(YR) %>%
  mutate(total_emp= sum(EMP))%>%
  mutate(weights_emp= EMP/total_emp) %>%
  mutate(weighted_unemp= weights_emp * UNEMP)

gulf_summary= gulf %>%
  select(-YR, -ST_ABB, -total_emp, -weights_emp, -weighted_unemp) %>%
  group_by(STATE) %>%
  summarise_all(funs(sum))
gulf_summary  #this works

Thanks.
Another query can i write this as a function of a data frame?