How to sum rows but only rows with different values

Hi there,

Im stuck with my dataanalysis i R. I hope you can help. I am analyzing the impact of migrating from rural to urban areas in Tanzania and want to make a dietary diversity score for rural households before they migrate and after they migrate. I divided every fooditem into 10 groups. If the household consume 5 or more groups they have dietary diversity.

Rice is in group 1, cassava is in group 1, sweet potatoes is in group 8 and so on. How do I do if I want to tell R to give household#1 a score of 1 if they consumed food from 5 or more groups and a 0 if not. Remember that plenty of foods are in under the same foodgroup but I am only interested in the intake of different foodgroups. So the household should not get 1 two times if they consumed food from the same group twice.

To give an example of how my data looks like, see below. There is 59 different fooditems divided in 10 foodgroups. E.g. household#1 has 59 rows. Thereafter comes household#2 and so on downwards. There is 281 households.

ID Fooditemvalue Foodgroups Did you consume (yes/no)
1 101 1 1
1 102 1 2
1 103 8 1
1 104 3 1
2 101 1 2
2 102 1 1

How do I count the mean dietary diversity score (if the household consumes 5 or more foodgroups) from all of the household together and seperately? I would like to make a plot with dietary diversity score on the y-axis (from 1-10) and on the x-axis the different household (hh1rural, hh1urban;hh2rural, hh2urban; hh3rural, hh3urban...) and one plot with the mean dietary diversity score on the y-axis and on the x-axis rural households and urban households.

I hope you understand my question. Thank you so much in advance.

Here is how I would do it, using a small data set. I wrote a small function that filters each family's data for Consume == "Yes" and then counts the number of unique Group values.

#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>     filter, lag
#> The following objects are masked from 'package:base':
#>     intersect, setdiff, setequal, union
#Invent data
DF <- data.frame(ID = rep(1:3, each = 5),
                 Item = sample(101:159, 5),
                 Group = sample(1:10, 15, replace = TRUE),
                 Consume = sample(c("No", "Yes"), 15, replace = TRUE))
#>    ID Item Group Consume
#> 1   1  131    10     Yes
#> 2   1  115     2      No
#> 3   1  151     6     Yes
#> 4   1  114     5      No
#> 5   1  103     4     Yes
#> 6   2  131     6      No
#> 7   2  115     9     Yes
#> 8   2  151    10     Yes
#> 9   2  114     5      No
#> 10  2  103     3      No
#> 11  3  131     9      No
#> 12  3  115     9      No
#> 13  3  151     9     Yes
#> 14  3  114     3      No
#> 15  3  103     8     Yes

#Make a function to count the unique Groups
DiversityFunc <- function(G, C) {
  G <- G[C == "Yes"]

DiversityDF <- DF %>% group_by(ID) %>% 
  summarize(GroupCount = DiversityFunc(Group, Consume))
#> `summarise()` ungrouping output (override with `.groups` argument)

Created on 2021-05-19 by the reprex package (v0.3.0)

Thank you so much for the thorough answer @FJCC. I will try and see how that works out.

I tried to run the code now @FJCC. It did what I told it to do, I tried to interpreted your code into my own R with the exact number of colums, rows etc. Something went wrong during the process. The thing is that the fooditems not are in order, so it will look like this: 101, 102, 203, 204, 205, 300, 301 and so on. So when I run the code it is just chosing randomly numbers it seems like. It looks like this:

So the fooditems "Værdi" is not the actual ones and also I can see that the intake (hh_j01) gives a one twice to foodgroupvalue "Fødevaregruppeværdi" no 6. I want to tell R only to count 1 one time in the same foodgroup and not twice.

Can you help me again with this?

Best Anika

Data as it is now looks like this with all the important colums
y3_hhid: ID, Værdi=fooditemvalue, Fødevare=food, Fødevaregruppe=foodgroup, Fødevaregruppeværdi=foodgroupvalue, hh_j01=1=yes, 2=no (consume)

I think the code you need is

DiversityFunc <- function(G, C) {
  G <- G[C == 1]

DiversityDF <- DF %>% group_by(y3_hhid) %>% 
  summarize(GroupCount = DiversityFunc(Fødevaregruppeværdi, hh_j01))

The resulting data frame DiversityDF should have one row for each value of y3_hhid and its GroupCount column will show how many unique values of Fødevaregruppeværdi had hh_j01 == 1.

If that does not work, please make a small data set with two values of y3_hhid with each of those having a few different foods. Ten rows of data is enough. Let's say you name that small data set DF. Post the output of


and place a line with three back ticks just before and after the output, like this
your output here
We can then both work with the same data set.

It worked @FJCC! Thank you so much for your help.

@FJCC, do you have any idea how to count how many "1" (yes) there is in each of the 10 foodgroups for all the households? The foodintake in each foodgroup still needs to be unique.

What a would like is a simple table looking like this.
Foodgroup n intake
1 201
2 99
3 137

So the n intake will be showing how many of the 281 household that did consume food from the specifik foodgroup. Then after I will calculate the percentage.

Do you have any idea how to do this?

I think this will do what you want.

CountHousehold <- DF %>% filter(hh_j01 == 1) %>% #keep only the "Yes" rows
  select(y3_hhid, Fødevaregruppeværdi) %>% #keep only these two columns
  distinct() %>% #Remove duplicates
  count(Fødevaregruppeværdi) #Count how many times each Fødevaregruppeværdi appears

Hi again @FJCC ,

That was what I was looking for thank you so much!
I noticed that there are some households with dietary diversity=0 which is unlikely. I have identified those households by subsetting. Now I want to delete those from my initial data. I tried all the tips from google nothing works.

I have a dataset "subsetmhdds0" with my 11 households that needs to be removed from another dataset. The IDs are in both datasets called "hh_a09". Do you have any suggestions how to remove these households? The IDs are listet as exponents eg 8.060120e+12 which may be a problem, because there's multiple IDs looking like that.

I hope you understand my question? :slight_smile:

It is hard to be specific without seeing either your code or your data. Starting with the second problem, your IDs have been read as numbers. If they are numbers in both data sets, that is not a problem. If one data set has numbers and the other characters, you can convert the numbers to characters with code like

subsetmhdds0$hh_a09 <- as.character(subsetmhdds0$hh_a09)

If your main data set is called DF, you can filter out all of the IDs that are in subsetmhdds0 with this code.

DF <- DF %>% filter(!hh_a09 %in% subsetmhdds0$hh_a09)

As a tip, statements like "I tried everything and nothing works" do not provide much information. Please show what you did try and what happened.

Hi again,

Thank you so much, that helped. In the future I will post my codes and tell more about the process before posting.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.