Using R to create a proportion

Hi R community,
I wonder if anyone could help me create a proportion using my data set?
Here is the dput:
df1 <-structure(list(data.year = c(2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L, 2010L, 2010L, 2010L, 2010L), current.group = c("F",
"F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",
"F", "F", "F", "F", "F", "F"), monkey.id = c("00J", "00O", "00O",
"03J", "03J", "04N", "10S", "10S", "10S", "14I", "14I", "14L",
"14L", "14L", "20F", "20F", "20F", "24B", "24B", "24B"), partner.id = c("63V",
"44J", "55V", "62V", "00J", "24B", "14L", "64P", "68V", "29Z",
"V36", "17C", "68V", "10S", "73B", "87K", "X44", "28J", "59A",
"04N"), Survival.endofyear = c("Y", "Y", "Y", "Y", "Y", "Y",
"Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",
"Y")), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"
))

The variable "monkey.id" contains individual monkey's that each have a unique ID code. The variable "partner.id" contains the ID of a social partner they have in a given year. Finally, the variable "data.year" gives the year of data collection. I have data for multiple years of study.
What I want to do is create a proportion, that shows partnership stability. So if a monkey has a social partner one year. How many years do they stay partners. Ie, lets say a monkey is in the data set for 10 years, and has a social relationship with the same individual for 7 of those years, then it would have a proportion of 0.7.
I also have "survival.endofyear" which is whether the monkey lives to the end of the year. If the monkey dies, then that will effect how long its in the data set. There are also different groups of monkeys "current.group" which represents the group in which they live. For different groups of monekey's i have differing amounts of data, i.e some groups i have more years of data than for others.
Any pointers would be much appreciated.

Would it be possible to share a subset of data that includes multiple years for the same monkey?

1 Like

A handy way to supply some sample data is the dput() function. In the case of a large dataset something like dput(head(mydata, 100)) should supply the data we need if it includes multiple years as jonspring suggests.

1 Like

Thank you both for your responses. I have subsetted some of group V. Included are some data for 3 different years, which features multiple years for the same individual.
df1 <-structure(list(data.year = c(2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L), current.group = c("V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V",
"V", "V", "V", "V"), monkey.id = c("0F0", "0F0", "0F0", "14E",
"1B0", "1G0", "1G0", "1G0", "1G1", "1G2", "1G2", "1G2", "1G4",
"1G6", "2E4", "2E4", "2G2", "2G2", "2G4", "3H0", "3H0", "45Z",
"45Z", "45Z", "47Z", "47Z", "4C5", "4C5", "4C9", "4C9", "51Z",
"51Z", "51Z", "52Z", "5C6", "5C6", "5E8", "5E8", "78I", "78I",
"7C4", "7C4", "81E", "81E", "90T", "90T", "90T", "93T", "93T",
"9E8", "9E8", "0F0", "0F0", "0K0", "0K0", "13H", "14E", "15E",
"15E", "1G0", "1G0", "1G0", "1G1", "1G1", "1G6", "1I5", "1I5",
"1I7", "1I7", "1K1", "1K1", "1K2", "1K2", "2E4", "2E4", "2E4",
"2G2", "2G2", "2G2", "2G4", "2I3", "3D9", "3H0", "3H0", "47Z",
"4C5", "4C5", "4C9", "4C9", "4J4", "4J4", "51Z", "51Z", "53Z",
"5C6", "5E8", "78I", "7C4", "7C4", "90T", "90T", "93T", "93T",
"98J", "9C1", "9C1", "9C1", "9E8", "T03", "0F0", "0F0", "0K0",
"0M3", "0M4", "15E", "1G1", "1G1", "1G1", "1G4", "1I6", "1I6",
"1I7", "1I7", "1K1", "2E4", "2E4", "2G2", "2G2", "2K1", "4C5",
"4J4", "4J4", "52Z", "5E8", "5E8", "5L7", "5L7", "78I", "85T",
"90T", "90T", "93T", "93T", "9C1", "9C1", "9C1", "9E8", "9E8"
), partner.id = c("1G2", "2E4", "5E8", "1G2", "00V", "3H0", "4C9",
"90T", "7C4", "0F0", "14E", "1G1", "3H0", "7C4", "5C6", "7C4",
"2G4", "4C5", "2G2", "1G0", "1G4", "5E8", "00V", "2E4", "4C9",
"93T", "4C9", "2G2", "3H0", "47Z", "5C6", "2E4", "45Z", "2E4",
"7C4", "2E4", "0F0", "45Z", "00V", "2G4", "2E4", "5C6", "90T",
"9E8", "1G0", "3H0", "81E", "2G2", "47Z", "7C4", "81E", "0K0",
"2E4", "1K1", "0F0", "51Z", "51Z", "2E4", "5C6", "1G4", "1I5",
"4J4", "1I7", "3H0", "4J4", "00V", "1G0", "2G2", "00V", "2E4",
"0K0", "90T", "1G6", "7C4", "0F0", "1K1", "2I3", "90T", "1I7",
"5C6", "4J4", "52Z", "1G1", "2G4", "53Z", "5E8", "7C4", "93T",
"15E", "52Z", "2I3", "5C6", "14E", "47Z", "51Z", "00V", "00V",
"2E4", "5C6", "93T", "2G2", "1G6", "90T", "90T", "00V", "2G4",
"53Z", "00V", "1I6", "0K0", "2E4", "0F0", "0F0", "1K1", "93T",
"1I7", "7C4", "93T", "51Z", "85T", "1G4", "78I", "1G1", "0M4",
"5E8", "0F0", "4J4", "1I5", "5E8", "5L7", "1G4", "2G2", "5L7",
"1G4", "2E4", "1I5", "4C5", "9E8", "1I6", "93T", "2G2", "5E8",
"90T", "0F0", "4C5", "53Z", "4C9", "78I"), Survival.at.yearsend = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
)), row.names = c(NA, -148L), class = c("tbl_df", "tbl", "data.frame"
))

Thanks for the data. Survival.at.yearsend is coded (0, 1). Is 0 = "still alive?

1 Like

Yes that's correct. Thanks for helping.

I am having a mental block on how do do this here is what I have looked at so far but I feel like I am just exploring the data not answering the question.

library(tidyverse)
dat1  <-    dfl1  %>% count(monkey.id) 
dat1  <-   dfl1 %>%  group_by(monkey.id)  %>% count(partner.id) 
                                                          
subset(dat1, monkey.id == "0F0" & partner.id == "0K0")              

Rather promiscuous lot aren't they?

1 Like

Thanks for your thoughts. The code you've given creates a count for how many of the years an individual has the same partner, which is helpful for what i'm trying to do. What i'd like to do is somehow calculate how many years an individual is in the data set. And then create a proportion (number of years with the same partner / total number of years the monkey is alive in the data set).
Thanks for your help, yes they have plenty of partners.

Try this where I removed the creation of the data set and changed one row to get at least one consecutive relation:

# df1 <-structure ( ) 

library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.1.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# to ensure that we have at least one consecutive relation
df1c <- df1 |>
  mutate( partner.id = case_when(
    (data.year == 2017 & monkey.id =="78I" & partner.id == "9E8") ~ "00V" ,
    T ~ partner.id
  ))
          
# number of years per monkey 
nr_year_by_monkey <- df1c |>
  group_by(monkey.id,data.year) |>
  summarise(count=n()) |>
  summarise(count=n())  
#> `summarise()` has grouped output by 'monkey.id'. You can override using the
#> `.groups` argument.

# determine the unique partner in one year (NA_character if more than one)
df2 <- df1c |>
    nest_by(monkey.id,data.year,.key="d") |>
    rowwise() |>
    mutate(partners=  list(unique(d$partner.id)),
           uniq_partner = ifelse(length(partners)>1,NA_character_, partners)) |>
    select(monkey.id,data.year,uniq_partner) |>
    ungroup()

# determine the number of consecutive years for relation
#   assuming there are no missing years
df3 <- df2 |>
  arrange(monkey.id,data.year) |>
  group_by(monkey.id) |>
  mutate(count = ifelse( lag(uniq_partner,default=" ") == uniq_partner,1,0),
         count = ifelse(is.na(count),0,count) ,
         count = ifelse(count == 1 & lag(count,default=0) == 0,2,count)) |>
  summarise(relation_years=sum(count))

# determine percentage
df_percent <- df3 |>
  left_join(nr_year_by_monkey,by=c(monkey.id="monkey.id")) |>
  mutate(perc_relation= relation_years / count)
Created on 2022-07-08 by the reprex package (v2.0.1)
1 Like

Thanks so much for your input!! I ran the code you posted and it has created proportions for each monkey. What i'm looking to do is create separate proportions with each different social partner a monkey has. I also need to account for survival, i.e. some of the monkey's die during the study, did your code account for this some way? Appreciate your help.

Please explain:

I assumed that if a monkey dies, there will be no more records for that monkey (??)

1 Like

I presume this is incomplete in some respect, but to understand the problem I wanted to try something simple to modify later.

What if for each pairing in the data, you collect the first year, last year, and number of years observed?


df1 %>%
  group_by(monkey.id, partner.id) %>%
  summarize(start = min(data.year),
            end   = max(data.year),
            obs   = n()) %>%
  mutate(proportion = obs / (end - start + 1))

In the subset of data you shared, I'm getting 100% for all pairings, which seemed suspect but maybe will look clearer with more data.

I also wasn't sure how survival would fit in to this -- if a monkey doesn't survive at years end should we exclude that year's observation?

1 Like

Thank you both very much for your help and thoughts. I've decided its just too complicated to analyse stability in this way and am going to try something very different.
Thanks again!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.