Create turnover point variables

I am trying to recreate the following picture, specifically the 2 variables "Last year of tenureship" and "First year of tenureship". My question consists of two parts, Firstly, the thing is that for a specific company, let's say company A, i want to highlight the first year of the tenureship with their client, and the last year of the tenureship with their client, to calculate the turnover point. If it is not the first or last year, the value is 0 (i was thinking of the mutate and if_else function?).
Secondly, the datas i am using is from 1999 to 2021. If i have like in the picture company C, which only has data from 2005 until 2018, the "first year value" cannot have the value 1 in 2005, but instead 0, as for this company, the data is only available from 2005, and not from the start (1999) , so i cannot measure the tenureship due to a lack of data (in executive, we don't now with who the company was working with in 2004, so therefore, the "first year variable" cannot be equal to 1 in 2005). Same goes for the year 2018, where the data stops again, so before 2021. I would like to give it also the value 0 instead of 1 for the last year variable, as we do not have the information about the year of 2019. Can someone help me out a bit with this? I have explained it a little bit in the image as well, so hopefully it clarifies it a little bit more. I do not know it this is possible to recreate the second part of my question. Thanks in advance already!

https://imgur.com/a/wr3HYGN

EDIT: After the comment from nirgrahamuk, i also am writing the code over here :slight_smile:

df <- data.frame(
stringsAsFactors = FALSE,
year = c(1999L,2000L,
2001L,2002L,2003L,2004L,2005L,2006L,2007L,
2008L,2009L,2010L,2011L,2012L,2013L,2014L,
2015L,2016L,2017L,2018L,2019L,2020L,2021L,
1999L,2000L,2001L,2002L,2003L,2004L,2005L,
2006L,2007L,2008L,2009L,2010L,2011L,2012L,
2013L,2014L,2015L,2016L,2017L,2018L,2019L,
2020L,2021L,2005L,2006L,2007L,2008L,2009L,
2010L,2011L,2012L,2013L,2014L,2015L,2016L,
2017L,2018L),
Company = c("A","A","A",
"A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","C","C","C","C","C","C",
"C","C","C","C","C","C","C","C"),
clientnumber = c(1,1,1,1,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,
3,3,3,3,3,3,3,4,1,1,1,2,2,2,2,2,
2,2,2,2,2,2)
)

Hello,
I'm sure you shared this image with the best intentions, but perhaps you didnt realise what it implies.
If someone wished to use example data to test code against, they would type it out from your screenshot...

This is very unlikely to happen, and so it reduces the likelihood you will receive the help you desire.
Therefore please see this guide on how to reprex data. Key to this is use of either datapasta, or dput() to share your data as code

df <- data.frame(
  stringsAsFactors = FALSE,
                      year = c(1999L,2000L,
                               2001L,2002L,2003L,2004L,2005L,2006L,2007L,
                               2008L,2009L,2010L,2011L,2012L,2013L,2014L,
                               2015L,2016L,2017L,2018L,2019L,2020L,2021L,
                               1999L,2000L,2001L,2002L,2003L,2004L,2005L,
                               2006L,2007L,2008L,2009L,2010L,2011L,2012L,
                               2013L,2014L,2015L,2016L,2017L,2018L,2019L,
                               2020L,2021L,2005L,2006L,2007L,2008L,2009L,
                               2010L,2011L,2012L,2013L,2014L,2015L,2016L,
                               2017L,2018L),
                   Company = c("A","A","A",
                               "A","A","A","A","A","A","A","A","A","A",
                               "A","A","A","A","A","A","A","A","A","A",
                               "B","B","B","B","B","B","B","B","B",
                               "B","B","B","B","B","B","B","B","B","B",
                               "B","B","B","B","C","C","C","C","C","C",
                               "C","C","C","C","C","C","C","C"),
              clientnumber = c(1,1,1,1,2,
                               2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
                               2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,
                               3,3,3,3,3,3,3,4,1,1,1,2,2,2,2,2,
                               2,2,2,2,2,2)
        )

You mean like this? Sorry, i am quite new to this website, and also quite new to rstudio. I will also update the main post. I by the way do not know why there is a L after every year, this happened when i used dpasta for some reason. Thanks for the comment, noted

It seems important to know the first year that each company has data for, as it matters if its 1999 or after....
so calculate that and then bring it in to use it.

(company_bounds_df <- df %>% group_by(Company) %>%
  summarise(first_year_data=min(year),
            last_year_data = max(year)))


(df2 <- left_join(
  df,
  company_bounds_df
) %>% group_by(Company, clientnumber) %>% mutate(
  first_year_ten = as.integer(
    year == min(year) & (year != first_year_data | first_year_data==1999)
)) )%>% print(n=100)

I'm not sure I understood your logic for first year of tenure, I think it should be grouped at company/client level. then for each year, its got to be the earliest year in order to get a 1 , and the exceptions are if its the first year for which that company has had data for .... unless that year is 1999

1 Like

Thanks! This was indeed where i was looking for, and i have played a bit around with your script, and got totally what i wanted to recreate. I still have one problem where i am running into right now, it is where i am creating a new variable which calculates the length of the partnership between the company and the clientnumber. If again, as is happening with company C, there is missing data before 2005, the value gets 0 or NA (until there is a change of clientnumber, then it starts counting from the start again) as we cannot know the length of the partnership due to lack of information. This is what i am trying to recreate ( i have put this in manually):

Summary

df <- data.frame(
stringsAsFactors = FALSE,
year = c(1999L,2000L,2001L,2002L,
2003L,2004L,2005L,2006L,2007L,2008L,2009L,2010L,
2011L,2012L,2013L,2014L,2015L,2016L,2017L,2018L,
2019L,2020L,2021L,1999L,2000L,2001L,2002L,2003L,
2004L,2005L,2006L,2007L,2008L,2009L,2010L,2011L,
2012L,2013L,2014L,2015L,2016L,2017L,2018L,2019L,
2020L,2021L,2005L,2006L,2007L,2008L,2009L,2010L,
2011L,2012L,2013L,2014L,2015L,2016L,2017L,2018L),
Company = c("A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","C","C","C","C","C",
"C","C","C","C","C","C","C","C","C"),
clientnumber = c(1,1,1,1,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,1,
1,1,2,2,2,2,2,2,2,2,2,2,2),
first_year_data = c(1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,2005L,2005L,2005L,2005L,2005L,2005L,
2005L,2005L,2005L,2005L,2005L,2005L,2005L,2005L),
last_year_data = c(2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2018L,2018L,2018L,2018L,2018L,2018L,
2018L,2018L,2018L,2018L,2018L,2018L,2018L,2018L),
last_year_tenure = c(0L,0L,0L,1L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,1L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,1L,0L,0L,1L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L),
first_year_tenure = c(1L,0L,0L,0L,1L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,1L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,0L,0L,0L,
1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L),
tenure = c(1,2,3,4,1:19,1,2,3,1:19,1,0,0,0,1:11)

)

I tried this using the row_number() formula, but it gave me this:

Summary

df data.frame(
stringsAsFactors = FALSE,
year = c(1999L,2000L,2001L,2002L,
2003L,2004L,2005L,2006L,2007L,2008L,2009L,2010L,
2011L,2012L,2013L,2014L,2015L,2016L,2017L,2018L,
2019L,2020L,2021L,1999L,2000L,2001L,2002L,2003L,
2004L,2005L,2006L,2007L,2008L,2009L,2010L,2011L,
2012L,2013L,2014L,2015L,2016L,2017L,2018L,2019L,
2020L,2021L,2005L,2006L,2007L,2008L,2009L,2010L,
2011L,2012L,2013L,2014L,2015L,2016L,2017L,2018L),
Company = c("A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","C","C","C","C","C",
"C","C","C","C","C","C","C","C","C"),
clientnumber = c(1,1,1,1,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,1,
1,1,2,2,2,2,2,2,2,2,2,2,2),
first_year_data = c(1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,1999L,1999L,1999L,1999L,1999L,1999L,
1999L,1999L,2005L,2005L,2005L,2005L,2005L,2005L,
2005L,2005L,2005L,2005L,2005L,2005L,2005L,2005L),
last_year_data = c(2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2021L,2021L,2021L,2021L,2021L,2021L,
2021L,2021L,2018L,2018L,2018L,2018L,2018L,2018L,
2018L,2018L,2018L,2018L,2018L,2018L,2018L,2018L),
last_year_ten = c(0L,0L,0L,1L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,1L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,1L,0L,0L,1L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L),
first_year_tenure = c(1L,0L,0L,0L,1L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,1L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,0L,0L,0L,
1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L),
tenure = c(1L,0L,0L,0L,0L,1L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,1L,0L,0L,0L,1L,0L,0L,0L,0L,0L,0L,
0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,
0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L)
)

What am i doing wrong here? I used this code (changed up your code a bit, as for a beginner like myself this gives me more overview and clarity for myself)

df2 <- left_join(df,company_bounds_df) %>% group_by(Company, clientnumber)
df2 <- df2 %>% mutate(last_year_ten = as.integer(year == max(year) & (year != last_year_data | last_year_data==2021)))
df2 <- df2 %>% mutate(first_year_tenure = as.integer(year == min(year) & (year != first_year_data | first_year_data==1999)))
df2 <- df2 %>% group_by(Company, clientnumber) %>% mutate(tenure = as.integer(clientnumber == row_number() & (year != first_year_data | first_year_data==1999)))

I would guess having something with the place of the row_number(), but i am not sure how i could fix this.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.