Calculating % of a column with binary values

sd09 · September 21, 2019, 3:00pm

I'm really new to R. This question is for a homework assignment where we have the option to use Excel or R but I want to figure it out in R if I can. I'm working with categorical data and have a column of 0 and 1 (dummy/binary variables) and basically need to calculate the % of 0s in the column. I hope that makes sense. I'm not well versed in R or coding terminology so going through articles on this has been confusing.

jcblum · September 21, 2019, 3:49pm

Hi @sd09! Welcome!

For starters, I think it will be easier to help with a little bit more information (this is pretty common — when you're new to this stuff, it's hard to know how much information is enough!). Since you’re asking a question about homework, you should also make sure you’re familiar with our FAQ: Homework Policy (so far, your question is within bounds ).

So, my questions for you:

Do you already have your data in R, or is that part of what you’re trying to figure out?
- If the data is in R, it will help if you can paste in the results of running the str() function on the R object that’s storing your data. For instance, if your data is in an R variable called hw01_data, you’d run:
  str(hw01_data)
  …in the console and copy-paste the results. Be sure to format what you paste as code so it doesn’t get garbled by the forum software!
How would you solve the problem of calculating the percentage of zeroes in the column if you were using a pencil and paper?
Have you been learning about specific R features in class? If so, which things you’ve learned about do you think might apply to this question?

(It’s OK if you’re not quite sure how, and it’s OK to give a “wrong” answer — I’m asking this because the thing that’s hardest about helping with homework questions is that most assignments can be completed many different ways and it’s difficult to know what to suggest without all the context of the class itself).

sd09 · September 21, 2019, 6:38pm

Hi @jcblum, thank you so much for sharing the FAQ link! The course I'm taking is Data Analytics and we're currently learning logistic regression. I'm working with data on different explanatory variables that affect SAT scores. The response variable is Improvement (given in the data set) which is the column I was referring to in my original question.

Here are the results from str()

> str(Kaplandata_)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	235 obs. of  15 variables:
 $ Improvement    : int  1 1 0 1 1 0 0 1 1 0 ...
 $ coaching_kaplan: int  0 0 0 0 0 0 0 0 0 0 ...
 $ coaching_other : int  0 0 0 0 0 0 0 0 0 0 ...
 $ coaching_no    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ hs_prep        : int  1 1 1 1 1 1 1 1 0 1 ...
 $ hs_voc         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hs_other       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hs_general     : int  0 0 0 0 0 0 0 0 1 0 ...
 $ Male           : int  1 0 0 1 1 1 1 0 0 0 ...
 $ Female         : int  0 1 1 0 0 0 0 1 1 1 ...
 $ Income         : int  10 12 7 17 14 14 17 2 17 6 ...
 $ HS.Type        : int  1 1 1 1 1 1 1 1 1 0 ...
 $ Rank           : int  3 2 3 1 4 2 3 1 1 2 ...
 $ Math1          : int  580 510 580 440 530 240 430 580 680 350 ...
 $ Verb1          : int  400 420 380 550 370 370 450 440 450 360 ...
 $ attr(*, "spec")=List of 2
  ..$ cols   :List of 15
  .. ..$ Improvement    : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ coaching_kaplan: list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ coaching_other : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ coaching_no    : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ hs_prep        : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ hs_voc         : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ hs_other       : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ hs_general     : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Male           : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Female         : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Income         : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ HS.Type        : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Rank           : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Math1          : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Verb1          : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "collector"
  ..- attr(*, "class")= chr "col_spec"

I don't know how to fix the formatting of the output (is this what it's supposed to look like?). I do have the updated version of RStudio and haven't altered the data from the orginal .csv file so I hope there's not an issue there.

There are 253 rows in the column. I would have to count the number of 0s, divide that number by 253 and multiply it by 100.
We have covered R features more for data modeling purposes (performing regression, getting summary statistics for coefficients, creating residual plots, etc.). Most recently, for logistic regression, we did an example of fitting a logistic regression model where we used factor(), level() and relevel() to create dummy variables and change reference levels. I'm not sure if factor() or level() really apply because I don't need to fit the model to get a % for one column. It's also already in dummy variable form:

> P<-factor(Improvement)
> levels(P)
[1] "0" "1"

jcblum · September 21, 2019, 7:18pm

Thanks, your answers help a lot .

I should have probably made my advice on this more obvious: we have an FAQ that shows you how to get the formatting right when you're pasting stuff into posts on this site (it's as easy as clicking a button in the posting box). You can find that info here: FAQ: How to format your code
(I linked to it in my message above, but a little obliquely — sorry! ).

For now, I went ahead and fixed the formatting in your reply. Without code formatting, the forum software thinks all those dollar signs mean you're trying to write mathematical equations and things get wonky .

A good intuition! Beyond being a case of using a sledgehammer to swat a fly, a model is not the same thing as a description of your data.

I also agree that factor manipulation functions are not helpful for your question. A factor is just a fancy numeric variable that has an "attribute" storing the information about what the numeric codes correspond to, so as I think you've realized, there's no benefit to factor-izing Improvement at this point.

Great So as a first step, can you express that algorithm in R code? In case you don't know where to start, try running these statements and seeing what happens:

Kaplandata_$Improvement

?length
length(Kaplandata_$Improvement)

Kaplandata_$Improvement == 0

Kaplandata_$Improvement[Kaplandata_$Improvement == 0]

How I'd translate the above statements into English sentences

Kaplandata_$Improvement: "Show me the vector 'Improvement' from the data frame 'Kaplandata_'"
?length: "Show me help for the 'length' function"
length(Kaplandata_$Improvement): How many elements are in Improvement?
Kaplandata_$Improvement == 0: "Show me the results of going through each element of Improvement and asking if the element is equal to zero"
Kaplandata_$Improvement[Kaplandata_$Improvement == 0]: "Show me the elements of Improvement that are equal to zero", or more verbosely and precisely, "Show me the elements of Improvement where the answer to asking if they are equal to zero is TRUE".

The direct translation of your pencil-and-paper method might not be the fanciest or most clever way to solve the problem, but it is often a very good starting place because it is a method you understand and can reason about. Plus it's easier to have fun trying other methods when you know you've got at least one thing that works .

A hint for going further: given that you're dealing with a column of 1s and 0s, can you think of a simple arithmetic trick that would give you the number of 1s in the column?

sd09 · September 21, 2019, 8:57pm

Wow, this was so helpful!! Thank you so much for taking the time to write out the English "translations" as well. One my biggest hurdles with R is never fully understanding what the functions do, even after reading the "help" section.

After I entered the code you provided, I used the length function again with the last function nested inside:

> length(Kaplandata_$Improvement[Kaplandata_$Improvement == 0])
[1] 61

So now I can find the percent of 0s by dividing 61 by 235.

I'm guessing a simple arithmetic trick to find the number of 1s would be to subtract the number of 0s (61) from the total number of elements (235).

I'm really grateful for all your help

jcblum · September 21, 2019, 9:36pm

sd09:

After I entered the code you provided, I used the length function again with the last function nested inside:
> length(Kaplandata_$Improvement[Kaplandata_$Improvement == 0])
[1] 61
So now I can find the percent of 0s by dividing 61 by 235.

Glad to see you're on the right track. You can keep building on that statement, too. For instance, you now know a way of getting R to calculate the 235 (yes?), so what happens if you combine those two things with a division sign ( / )?

That's not exactly what I was thinking of, but it's also a good idea!

Some more code for you to experiment with, to see if you can figure out the One Cool Trick of binary variables I had in mind

# A purposefully simple binary variable. You can probably
# calculate the proportion of 1s just by looking at it!
bin_var <- c(1, 1, 1, 0, 0)

# Count how many elements are in the vector
length(bin_var)

# Count how many of the elements are equal to 1
length(bin_var[bin_var == 1])

# Sooo the number equal to 1 divided by the total number of
# elements is the proportion equal to 1
length(bin_var[bin_var == 1]) / length(bin_var)

# Hmm, what if we... add up all the numbers in the vector?
# Does this look like a number we've already seen?
sum(bin_var)

# So the sum of a binary variable is the same as [ you tell me! ].
# (More importantly, why does that make sense?)

# If I substitute this new method back into my proportion
# calculation, do I get the same answer?
sum(bin_var) / length(bin_var)

# Wait a minute... the sum divided by the total number...
# that sounds awfully familar! What's that calculation called?

And you're welcome for the sentences! Learning to read code takes lots of practice, and people who've been doing it for a long time often forget how hard it is. On top of that, some syntax was designed to be compact and efficient to type, at the expense of being easy to read and comprehend for beginners. It's absolutely OK to ask "what does this line do?" or "what does this part mean?" — most people here will be happy to slow down and explain.

system · October 12, 2019, 9:36pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.