Script which pivots my data only writes the column name and first row of data in dataframe, why?

I was handed down a script which transforms and pivots some data. After fixing the code, I finally got it to pivot my data but it only seems to write the first two lines, the column name and the first row of data.

Here is my code:

# libraries
library(dplyr)
library(tidyr)
library(readxl)
library(openxlsx)
library(rlang)


# Use this for leavers survey

# change the file path "<path>/NEW - leavers data 2019 2020.xlsx"
inputFilepath <- "C:/Users/bangathj/OneDrive - Centrica/Documents/Projects/Automation/Excel R Transformation Project/Leavers Survey/SCH0132 - Exit Interview Responses 2022-07-15 04_30 BST.xlsx"
outputFilepath <- "C:/Users/bangathj/OneDrive - Centrica/Documents/Projects/Automation/Excel R Transformation Project/Leavers Survey/SCH0132 - Exit Interview Responses 2022-07-15 04_30 BST Reshaped.xlsx"

# C:/Users/bangathj/OneDrive - Centrica/Documents/Projects/Automation/Excel R Transformation Project/Leavers Survey/SCH0132 - Exit Interview Responses 2022-07-15 04_30 BST.xlsx
df = read_excel(inputFilepath, sheet="Sheet1", skip = 8)

# Step 1: Remove Duplicates
dfDuplicatesRemoved = df %>%
  group_by(!!sym("Employee ID"), !!sym("Question Body")) %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

# Step 2: Pivot Data
dfTransformed = dfDuplicatesRemoved %>%
  select(!!sym("Employee ID"), !!sym("Question Body"), !!sym("Questionnaire Answer")) %>%
  pivot_wider(id_cols="Employee ID", names_from="Question Body", values_from="Questionnaire Answer") 

# Step 3: Obtain an Employee Profile by fetching all data except for question related data.
dfEmployeeProfile = df %>%
  dplyr::select(-c("Question Body", "Questionnaire Answer", "Question", "Questionnaire")) %>% 
  group_by("Employee ID") %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()


# Step 4: Join transformed data with Employee Profile to have a complete view
dfTranformedWithEmployeeProfile = dfTransformed %>% right_join(dfEmployeeProfile)

write.xlsx(dfTranformedWithEmployeeProfile, outputFilepath)
rm(list = ls())

sort(names(df))

I've made a small dummy version of the survey data I'm trying to transform:

Exit Interview Responses
Questionnaire Target Contexts
Questionnaire Targets
Supervisory Organization -----------------------
Include Subordinate Organizations Yes
Created On or After 01/06/2022
Created On or Before 30/06/2022
Questionnaire Answers
Effective Date Business Process Reason Employee ID Voluntary or Involuntary Manager Supervisory Organization Management Level Location Country FTE % Service in Years Worker Type Supervisory Org - 1 From The Top Supervisory Org - 2 From The Top Supervisory Org - 3 From The Top Supervisory Org - 4 From The Top Supervisory Org - 5 From The Top Supervisory Org - 6 From The Top Questionnaire Respondent Question Question Body Questionnaire Answer
22/06/2022 Terminate Employee > Voluntary > Resignation - Career 123456 Voluntary John Smith Commercial Level 8 UK United Kingdom 100 11.13 Employee Adam Smith Michael Smith Terry Smith Jane Smith Ken Doe Jane Doe Exit Interview Raj Patel Exit Interview - Career I knew how I could develop my career across the company Disagree
22/06/2022 Terminate Employee > Voluntary > Resignation - Career 789101 Voluntary John Smith Commercial Level 8 UK United Kingdom 100 11.13 Employee Adam Smith Michael Smith Terry Smith Jane Smith Ken Doe Jane Doe Exit Interview Raj Patel Exit Interview - Centrica Achieve I understand what the company wants to achieve over the next year Disagree
22/06/2022 Terminate Employee > Voluntary > Resignation - Career 121314 Voluntary John Smith Commercial Level 8 UK United Kingdom 100 11.13 Employee Adam Smith Michael Smith Terry Smith Jane Smith Ken Doe Jane Doe Exit Interview Raj Patel Exit Interview - Change Is there anything we could have done to change your mind? Nothing at the moment

What is causing my code to only read the first row of data?

my attempt at reproducing the data I need to transform:

``` r
tibble::tribble(
example_df <-  ~Effective.Date, ~Employee.ID,     ~Manager, ~Location,   ~Questionnaire, ~Respondent,                           ~Question,                                                       ~Question.Body,   ~Questionnaire.Answer,
     "22/06/2022",      123456L, "John Smith",      "UK", "Exit Interview", "Raj Patel",           "Exit Interview - Career",            "I knew how I could develop my career across the company",              "Disagree",
     "22/06/2022",      789101L, "John Smith",      "UK", "Exit Interview", "Raj Patel", "Exit Interview - Centrica Achieve", "I understand  what the company wants to achieve over the next year",              "Disagree",
     "22/06/2022",      121314L, "John Smith",      "UK", "Exit Interview", "Raj Patel",           "Exit Interview - Change",          "Is there anything we could have done to change your mind?", "Nothing at the moment"
  )
#> # A tibble: 3 × 9
#>   Effective.Date Emplo…¹ Manager Locat…² Quest…³ Respo…⁴ Quest…⁵ Quest…⁶ Quest…⁷
#>   <chr>            <int> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1 22/06/2022      123456 John S… UK      Exit I… Raj Pa… Exit I… I knew… Disagr…
#> 2 22/06/2022      789101 John S… UK      Exit I… Raj Pa… Exit I… I unde… Disagr…
#> 3 22/06/2022      121314 John S… UK      Exit I… Raj Pa… Exit I… Is the… Nothin…
#> # … with abbreviated variable names ¹​Employee.ID, ²​Location, ³​Questionnaire,
#> #   ⁴​Respondent, ⁵​Question, ⁶​Question.Body, ⁷​Questionnaire.Answer

Created on 2022-08-04 by the reprex package (v2.0.1)

It seems from your description that you are disattisfied with dfTransformed
In order for us to attempt to write code that would show us what it does, and allow us to try a different approach we would need access to the input data which it seems is dfDuplicatesRemoved

First of all, please investigate dfDuplicatesRemoved and determine if in principle it seems correct and would be what you want to be working on, if not then this post is a red herring and you need to back up a step. If it seems fine and appropriate, then please take the steps to provide it to the forum.


You can share your data in a forum friendly way by passing the data to share to the dput() function.
If your data is too large you can use standard methods to reduce it before sending to dput().
When you come to share the dput() text that represents your data, please be sure to format your post with triple backticks on the line before your code begins to format it appropriately.

```
( example_df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
5, 4.4, 4.9), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 
3.4, 2.9, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 
1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 
0.4, 0.3, 0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")))
```

longer explanation :

1 Like

Hi, I've reproduced a shortened form of the data I'm trying to transform. As for your comment on whether I am satisfied or not with the dfDuplicatesRemoved . I am aware that the the reason why I need to transform my data is because the way the survey data is recorded, the same question is asked to every respondent as the data format is Vertical instead of Horizontal, so I'm trying to pivot all the questions as individual columns instead of repeated cells for every question asked to an individual.

I think you need to better explain your issue in as concrete terms as you can...
I ran your code (adjusting for the fact that the tribble code you shared distorted the names) and the result was a dfTransformed object that seemed fine given what you have said about it. It has 3 rows each for a different employee, and the columns represent each question and the contents of the columns have the answers of absence of answers. Have I misunderstood you, or do you perhaps have no problem with dfTransformed, but have some other problem

What you just said is exactly what I want for my code to accomplish but I seem to only produce an excel file with one row of data. Perhaps the format of the original excel file is causing this and not the code?

After looking at which row is posted, I have realized that it only posts the last employee. Meaning that perhaps the code is only looking or writing the last questions at the bottom of the dataframe. Hence why it worked for you as there were only one set of questions asked.

I recommend that you provide a reprex from 'df' as the start point, with at least as many records as would be reasonable to duplicate your issues

I have found the problem. I was looking at what each version of the df is showing by clicking them in the rstudio environment. dfTransformed needs to merge with dfEmployeeProfile to make dfTransformedWithEmployeeProfile. The issue I believe, lies with the line:

# Step 3: Obtain an Employee Profile by fetching all data except for question related data.
dfEmployeeProfile = df %>%
  dplyr::select(-c("Question Body", "Questionnaire Answer", "Question", "Questionnaire")) %>% 
  group_by("Employee ID") %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

as when I looked at the table this produces, it only shows the last line of the original dataframe.
Is there something wrong with this code?

here is the reprex of dfEmployeeProfile:

tibble::tribble(
dfEmployeeProfile <-  ~Effective.Date, ~Business.Process.Reason,                                              ~Employee.ID, ~Voluntary.or.Involuntary,    ~Manager,    ~Supervisory.Organization,                                           ~Management.Level, ~Location,                              ~Country,         ~`FTE.%`, ~Service.in.Years, ~Worker.Type, ~`Supervisory.Org.-.1.From.The.Top`,    ~`Supervisory.Org.-.2.From.The.Top`,     ~`Supervisory.Org.-.3.From.The.Top`,                        ~`Supervisory.Org.-.4.From.The.Top`,       ~`Supervisory.Org.-.5.From.The.Top`,         ~`Supervisory.Org.-.6.From.The.Top`,                                                 ~Respondent,   ~`"Employee.ID"`,
               1L,             "30/06/2022", "Terminate Employee > Voluntary > Resignation - Personal",                   830957L, "Voluntary", "Vincent Gautheron (861145)", "Procurement- Group Functions (Vincent Gautheron (861145))", "Level 6", "UK - Berkshire - Windsor Millstream", "United Kingdom",              100L,         5.12,                          "Employee", "Centrica PLC (Scott Wheway (213219))",  "Centrica PLC (Chris O'Shea (854981))", "British Gas Services and Solutions (Jana Siber (905647))", "COO, British Gas (David House (137182))", "Group Procurement (Peter Sowrey (873613))", "Procurement- Group Functions (Vincent Gautheron (861145))", "Anne Rittmueller"
  )
#> # A tibble: 1 × 20
#>   Effective.Date Busin…¹ Emplo…² Volun…³ Manager Super…⁴ Manag…⁵ Locat…⁶ Country
#>            <int> <chr>   <chr>     <int> <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1              1 30/06/… Termin…  830957 Volunt… Vincen… Procur… Level 6 UK - B…
#> # … with 11 more variables: `FTE.%` <chr>, Service.in.Years <int>,
#> #   Worker.Type <dbl>, `Supervisory.Org.-.1.From.The.Top` <chr>,
#> #   `Supervisory.Org.-.2.From.The.Top` <chr>,
#> #   `Supervisory.Org.-.3.From.The.Top` <chr>,
#> #   `Supervisory.Org.-.4.From.The.Top` <chr>,
#> #   `Supervisory.Org.-.5.From.The.Top` <chr>,
#> #   `Supervisory.Org.-.6.From.The.Top` <chr>, Respondent <chr>, …
#> # ℹ Use `colnames()` to see all variable names

Created on 2022-08-04 by the reprex package (v2.0.1)

it should have 63 rows of data, not the last row of the entire dataframe.

I have resolved the issue. I fixed the line:

# Step 3: Obtain an Employee Profile by fetching all data except for question related data.
dfEmployeeProfile = df %>%
  dplyr::select(-c("Question Body", "Questionnaire Answer", "Question", "Questionnaire")) %>% 
  group_by("Employee ID") %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

by changing:
group_by("Employee ID") %>%
to:
group_by(!!sym("Employee ID")) %>%

added the (!!sym notation. I already had to correct many other lines of code. My assumption is that the previous user was using an outdated version of R and ran his transformations with this script for years. So when I used a modern version of Rstudio, it ran into some syntax problems. Either way, exploring my problems with you has been valuable in isolating the problem. And I will use this experience for future debugging endeavours. Thank you @nirgrahamuk , and apologies for the lack of proper formatting when I made this topic as I am new to all this.