Comparing proportion of hits

paulgureghian · May 23, 2018, 9:52pm

# The 'errors' data have already been loaded. 
head(errors)

# Generate an object called 'totals' that contains the numbers of good and bad predictions for polls rated A- and C-
totals <- errors %>% filter(grade %in% c("A-", "C-")) %>% group_by(grade,hit) %>% summarize(num = n()) %>% spread(grade,num)   
totals

# Print the proportion of hits for grade A- polls to the console
mean(hit == TRUE / A-)  

# Print the proportion of hits for grade C- polls to the console
mean(hit == TRUE /  C-)  
#> Error: <text>:9:22: unexpected ')'
#> 8: # Print the proportion of hits for grade A- polls to the console
#> 9: mean(hit == TRUE / A-)
#>                         ^

paulgureghian · May 23, 2018, 9:55pm

What I am trying to do: Filter the errors data for just polls with grades A- and C-. Calculate the proportion of times each grade of poll predicted the correct winner. I am trying to generate a 2 x2 tibble, I keep getting 2 x 3.
How to calculate the number of hits which are TRUE for each grade of A- and C- ?

dchiu · May 23, 2018, 10:11pm

It'll be easier to provide a reprex. The error above can be avoided by surrounding non-standard R column names with backticks: `A-`

paulgureghian · May 23, 2018, 10:21pm

this isnt a reprex ?

dchiu · May 23, 2018, 10:22pm

No, I cannot reproduce your input object errors.

paulgureghian · May 23, 2018, 10:33pm

# The 'errors' data have already been loaded. 
head(errors)

# Generate an object called 'totals' that contains the numbers of good and bad predictions for polls rated A- and C-
totals <- errors %>% filter(grade %in% c("A-", "C-")) %>% group_by(grade,hit) %>% summarize(num = n()) %>% spread(grade,num)   
totals

# Print the proportion of hits for grade A- polls to the console
mean(hit == TRUE / `A-`)  

# Print the proportion of hits for grade C- polls to the console
mean(hit == TRUE /  `C-`)  
#> Error: <text>:9:22: unexpected ')'
#> 8: # Print the proportion of hits for grade A- polls to the console
#> 9: mean(hit == TRUE / A-)
#>                         ^

dchiu · May 23, 2018, 10:45pm

I recently asked about generating a reprex here. On my machine, I get this:

head(errors)
#> Error in head(errors) : object 'errors' not found

paulgureghian · May 23, 2018, 10:56pm

that object was predefined and preloaded into my workspace. how to find out how it was generated e.g the packages used and the original dataset used ?

dchiu · May 23, 2018, 11:35pm

If you do not know the data source, you can run dput(errors) and then copy the output on the console and paste it here.

paulgureghian · May 23, 2018, 11:50pm


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dslabs)
data("polls_us_election_2016")

# Create a table called `polls` that filters by  state, date, and reports the spread
polls <- polls_us_election_2016 %>% 
  filter(state != "U.S." & enddate >= "2016-10-31") %>% 
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)

# Create an object called `cis` that columns for the lower and upper confidence intervals. Select the columns indicated in the instructions.

N <- polls$samplesize
cis <- polls %>% mutate(X_hat=(spread+1)/2,se=2*sqrt(X_hat*(1-X_hat)/N),lower=spread-qnorm(0.975)*se,upper=spread+qnorm(0.975)*se) %>%  
  select(state,startdate,enddate,pollster,grade,spread,lower,upper)    

add <- results_us_election_2016 %>% mutate(actual_spread = clinton/100 - trump/100) %>% select(state, actual_spread)
cis <- cis %>% mutate(state = as.character(state)) %>% left_join(add, by = "state")

errors <- cis %>% mutate(error = (spread - actual_spread),hit = sign(spread) == sign(actual_spread)) 

# The 'errors' data have already been loaded. Examine them using the `head` function.
head(errors)
#>            state  startdate    enddate                pollster grade
#> 1     New Mexico 2016-11-06 2016-11-06                Zia Poll  <NA>
#> 2       Virginia 2016-11-03 2016-11-04   Public Policy Polling    B+
#> 3           Iowa 2016-11-01 2016-11-04        Selzer & Company    A+
#> 4      Wisconsin 2016-10-26 2016-10-31    Marquette University     A
#> 5 North Carolina 2016-11-04 2016-11-06           Siena College     A
#> 6        Georgia 2016-11-06 2016-11-06 Landmark Communications     B
#>   spread        lower         upper actual_spread  error   hit
#> 1   0.02 -0.001331221  0.0413312213         0.083 -0.063  TRUE
#> 2   0.05 -0.005634504  0.1056345040         0.054 -0.004  TRUE
#> 3  -0.07 -0.139125210 -0.0008747905        -0.094  0.024  TRUE
#> 4   0.06  0.004774064  0.1152259363        -0.007  0.067 FALSE
#> 5   0.00 -0.069295191  0.0692951912        -0.036  0.036 FALSE
#> 6  -0.03 -0.086553820  0.0265538203        -0.051  0.021  TRUE

# Generate an object called 'totals' that contains the numbers of good and bad predictions for polls rated A- and C-
totals <- errors %>% filter(grade %in% c("A-", "C-")) %>% group_by(grade,hit) %>% summarize(num = n()) %>% spread(grade,num)   
#> Error in spread(., grade, num): could not find function "spread"
totals
#> Error in eval(expr, envir, enclos): object 'totals' not found

# Print the proportion of hits for grade A- polls to the console
totals %>% mean(hit == TRUE / `A-`)  
#> Error in eval(lhs, parent, parent): object 'totals' not found

# Print the proportion of hits for grade C- polls to the console
totals %>% mean(hit == TRUE /  `C-`)   
#> Error in eval(lhs, parent, parent): object 'totals' not found

paulgureghian · May 23, 2018, 11:51pm

this should work. let me know

dchiu · May 24, 2018, 12:00am

Is this what you are looking for?

prop.table(as.matrix(totals[, -1]), margin = 2)
#>             C-        A-
#> [1,] 0.1385042 0.1969697
#> [2,] 0.8614958 0.8030303

paulgureghian · May 24, 2018, 12:02am

I think i need a 2 x 3 tibble with "hit" ,"A-","C-"

paulgureghian · May 24, 2018, 12:05am

was my reprex ok ? did it run on your machine ?

paulgureghian · May 24, 2018, 6:56pm

this is a data camp exercise im working on, and the libraries and dataset are all in r studio.

dchiu · May 25, 2018, 4:59pm

It ran ok. I see this for totals

totals <- errors %>%
  dplyr::filter(grade %in% c("A-", "C-")) %>%
  dplyr::group_by(grade,hit) %>%
  dplyr::summarize(num = n()) %>%
  tidyr::spread(grade, num)   
totals
#> # A tibble: 2 x 3
#>   hit    `C-`  `A-`
#>   <lgl> <int> <int>
#> 1 FALSE    50    26
#> 2 TRUE    311   106

I thought you wanted 2 by 2?

paulgureghian · May 25, 2018, 6:15pm

the instructions called for 2 x 2 , but who knows what the auto-grader will actually accept. I think the "hit" is 1, the grades are actually counted as 1. I was able to figure it out though. thanks. catch you next time.