Finding percentages of binary datasets

ra17mv · December 7, 2021, 6:00pm

I have a little bit of experience learning R through my courses, but most data sets we worked on were not binary. I am not trying to understand how to analyze binary data sets. I have a file in which cases are labelled as 1 or 0 based on a yes/no system.

I am working on an imported CSV dataset. I have multiple columns with binary and string data, but i want to create a table which displays percentage of people that like apples (from a binary column with 0 for do not like apples and 1 for do like apples) relative to their location (from a string column with three locations: farm 1, farm 2 and farm 3). How can I create a table with percentage of people that like apples form each location if my data is binary?

The table should look like:

                      Farm 1 |  Farm 2  |  Farm 3

% People that like Apples

Also, there are certain cases where Farm 2 and Farm 3 have been misspelled as Fram 2/3 multiple times. Is there a way to replace all the mislabelled data entries at once?

Thanks!

FJCC · December 7, 2021, 7:35pm

The usual way of finding the percentage of responses that meet a criterion is to count the cases that meet the criterion, divide by the total number of responses and multiply by 100. With the responses coded as 0 and 1, counting the number of ones is the same as summing all of the responses. That is, if you have five responses with two "likes", you may have (1,0,0,1,0). Summing that set of numbers gives 2 and then you would divide by 5 (the number of responses) and multipy by 100 to get 2/5 * 100 = 40%. However, summing a set of values and then dividing the number of values is the same as calculating the mean. The percentage of responses that are 1 is the same as the mean of the responses times 100. mean(1,0,0,1,0) * 100 = 40.
Below is an example of doing this sort of calculation using one possible data layout. I also illustrated how to change Fram to Farm using the str_replace() function from the stringr package.

DF <- data.frame(Site = c("Farm 1","Farm 2","Fram 3","Farm 1","Farm 2",
                          "Farm 3","Farm 1","Fram 2","Farm 3","Farm 1"),
                 Like = c(0,1,1,0,0,0,1,0,1,1))
DF
#>      Site Like
#> 1  Farm 1    0
#> 2  Farm 2    1
#> 3  Fram 3    1
#> 4  Farm 1    0
#> 5  Farm 2    0
#> 6  Farm 3    0
#> 7  Farm 1    1
#> 8  Fram 2    0
#> 9  Farm 3    1
#> 10 Farm 1    1
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
DF <- DF |> mutate(Site = str_replace(Site, "Fram", "Farm"))
DF
#>      Site Like
#> 1  Farm 1    0
#> 2  Farm 2    1
#> 3  Farm 3    1
#> 4  Farm 1    0
#> 5  Farm 2    0
#> 6  Farm 3    0
#> 7  Farm 1    1
#> 8  Farm 2    0
#> 9  Farm 3    1
#> 10 Farm 1    1
Percents <- DF |> group_by(Site) |> summarize(Perc = mean(Like) * 100)
Percents
#> # A tibble: 3 x 2
#>   Site    Perc
#>   <chr>  <dbl>
#> 1 Farm 1  50  
#> 2 Farm 2  33.3
#> 3 Farm 3  66.7

^{Created on 2021-12-07 by the reprex package (v2.0.1)}

ra17mv · December 7, 2021, 9:05pm

Thank-you so much. That was a lot easier than I thought it would be and makes so much sense.

I am having a little bit of trouble renaming the site names. I want it to update in my file which I will extract. Currently I have the data imported into dataset1.

Instead of using a data.frame, I tried to run the function using the datatset as follows:

dataset1 <- dataset1 |> mutate(dataset$site = str_replace(dataset$site, "Fram", "Farm"))

However, i keep coming up with the following error:

Error: unexpected '=' in "dataset1 <- dataset1 |> mutate(dataset1$site ="

Also, when I attach the dataset and then try to run the function using the following:

dataset1 <- dataset1 |> mutate(site = str_replace(site, "Fram", "Farm")), it seems to run fine without any errors, but when I look at the data through table(site), it returns the incorrect spelling still. I am not sure what I am doing wrong there.

FJCC · December 7, 2021, 9:26pm

Notice that in

dataset1 <- dataset1 |> mutate(dataset$site = str_replace(dataset$site, "Fram", "Farm"))

you start with using dataset1 but inside the mutate() function you refer to dataset. Also, there is no need to use the dataset1$site notation inside of mutate. Just write site; mutate() knows that it is working with dataset1 becasuse that has been passed in through preceding the function with dataset1 |>

I cannot account for dataset1 not changing after you run the code. Try it without attaching dataset1. There is no need to do that. If that does not work, please post a small subset of your data. Run

dput(head(dataset1))

and post the output of that. Place a line with three back ticks just before and just after the output, like this
```
pasted output here
```

ra17mv · December 7, 2021, 11:57pm

Oh, I see the error! Thank you that worked perfectly!!

system · December 14, 2021, 11:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.