Script which pivots my data only writes the column name and first row of data in dataframe, why?

Zcythor · August 4, 2022, 8:39am

I was handed down a script which transforms and pivots some data. After fixing the code, I finally got it to pivot my data but it only seems to write the first two lines, the column name and the first row of data.

Here is my code:

# libraries
library(dplyr)
library(tidyr)
library(readxl)
library(openxlsx)
library(rlang)


# Use this for leavers survey

# change the file path "<path>/NEW - leavers data 2019 2020.xlsx"
inputFilepath <- "C:/Users/bangathj/OneDrive - Centrica/Documents/Projects/Automation/Excel R Transformation Project/Leavers Survey/SCH0132 - Exit Interview Responses 2022-07-15 04_30 BST.xlsx"
outputFilepath <- "C:/Users/bangathj/OneDrive - Centrica/Documents/Projects/Automation/Excel R Transformation Project/Leavers Survey/SCH0132 - Exit Interview Responses 2022-07-15 04_30 BST Reshaped.xlsx"

# C:/Users/bangathj/OneDrive - Centrica/Documents/Projects/Automation/Excel R Transformation Project/Leavers Survey/SCH0132 - Exit Interview Responses 2022-07-15 04_30 BST.xlsx
df = read_excel(inputFilepath, sheet="Sheet1", skip = 8)

# Step 1: Remove Duplicates
dfDuplicatesRemoved = df %>%
  group_by(!!sym("Employee ID"), !!sym("Question Body")) %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

# Step 2: Pivot Data
dfTransformed = dfDuplicatesRemoved %>%
  select(!!sym("Employee ID"), !!sym("Question Body"), !!sym("Questionnaire Answer")) %>%
  pivot_wider(id_cols="Employee ID", names_from="Question Body", values_from="Questionnaire Answer") 

# Step 3: Obtain an Employee Profile by fetching all data except for question related data.
dfEmployeeProfile = df %>%
  dplyr::select(-c("Question Body", "Questionnaire Answer", "Question", "Questionnaire")) %>% 
  group_by("Employee ID") %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()


# Step 4: Join transformed data with Employee Profile to have a complete view
dfTranformedWithEmployeeProfile = dfTransformed %>% right_join(dfEmployeeProfile)

write.xlsx(dfTranformedWithEmployeeProfile, outputFilepath)
rm(list = ls())

sort(names(df))

I've made a small dummy version of the survey data I'm trying to transform:

Exit Interview Responses
Questionnaire Target Contexts
Questionnaire Targets
Supervisory Organization	-----------------------
Include Subordinate Organizations	Yes
Created On or After	01/06/2022
Created On or Before	30/06/2022
																				Questionnaire Answers
Effective Date	Business Process Reason	Employee ID	Voluntary or Involuntary	Manager	Supervisory Organization	Management Level	Location	Country	FTE %	Service in Years	Worker Type	Supervisory Org - 1 From The Top	Supervisory Org - 2 From The Top	Supervisory Org - 3 From The Top	Supervisory Org - 4 From The Top	Supervisory Org - 5 From The Top	Supervisory Org - 6 From The Top	Questionnaire	Respondent	Question	Question Body	Questionnaire Answer
22/06/2022	Terminate Employee > Voluntary > Resignation - Career	123456	Voluntary	John Smith	Commercial	Level 8	UK	United Kingdom	100	11.13	Employee	Adam Smith	Michael Smith	Terry Smith	Jane Smith	Ken Doe	Jane Doe	Exit Interview	Raj Patel	Exit Interview - Career	I knew how I could develop my career across the company	Disagree
22/06/2022	Terminate Employee > Voluntary > Resignation - Career	789101	Voluntary	John Smith	Commercial	Level 8	UK	United Kingdom	100	11.13	Employee	Adam Smith	Michael Smith	Terry Smith	Jane Smith	Ken Doe	Jane Doe	Exit Interview	Raj Patel	Exit Interview - Centrica Achieve	I understand what the company wants to achieve over the next year	Disagree
22/06/2022	Terminate Employee > Voluntary > Resignation - Career	121314	Voluntary	John Smith	Commercial	Level 8	UK	United Kingdom	100	11.13	Employee	Adam Smith	Michael Smith	Terry Smith	Jane Smith	Ken Doe	Jane Doe	Exit Interview	Raj Patel	Exit Interview - Change	Is there anything we could have done to change your mind?	Nothing at the moment

What is causing my code to only read the first row of data?

my attempt at reproducing the data I need to transform:

``` r
tibble::tribble(
example_df <-  ~Effective.Date, ~Employee.ID,     ~Manager, ~Location,   ~Questionnaire, ~Respondent,                           ~Question,                                                       ~Question.Body,   ~Questionnaire.Answer,
     "22/06/2022",      123456L, "John Smith",      "UK", "Exit Interview", "Raj Patel",           "Exit Interview - Career",            "I knew how I could develop my career across the company",              "Disagree",
     "22/06/2022",      789101L, "John Smith",      "UK", "Exit Interview", "Raj Patel", "Exit Interview - Centrica Achieve", "I understand  what the company wants to achieve over the next year",              "Disagree",
     "22/06/2022",      121314L, "John Smith",      "UK", "Exit Interview", "Raj Patel",           "Exit Interview - Change",          "Is there anything we could have done to change your mind?", "Nothing at the moment"
  )
#> # A tibble: 3 × 9
#>   Effective.Date Emplo…¹ Manager Locat…² Quest…³ Respo…⁴ Quest…⁵ Quest…⁶ Quest…⁷
#>   <chr>            <int> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1 22/06/2022      123456 John S… UK      Exit I… Raj Pa… Exit I… I knew… Disagr…
#> 2 22/06/2022      789101 John S… UK      Exit I… Raj Pa… Exit I… I unde… Disagr…
#> 3 22/06/2022      121314 John S… UK      Exit I… Raj Pa… Exit I… Is the… Nothin…
#> # … with abbreviated variable names ¹Employee.ID, ²Location, ³Questionnaire,
#> #   ⁴Respondent, ⁵Question, ⁶Question.Body, ⁷Questionnaire.Answer

^{Created on 2022-08-04 by the reprex package (v2.0.1)}

nirgrahamuk · August 4, 2022, 10:18am

It seems from your description that you are disattisfied with dfTransformed
In order for us to attempt to write code that would show us what it does, and allow us to try a different approach we would need access to the input data which it seems is dfDuplicatesRemoved

First of all, please investigate dfDuplicatesRemoved and determine if in principle it seems correct and would be what you want to be working on, if not then this post is a red herring and you need to back up a step. If it seems fine and appropriate, then please take the steps to provide it to the forum.

You can share your data in a forum friendly way by passing the data to share to the dput() function.
If your data is too large you can use standard methods to reduce it before sending to dput().
When you come to share the dput() text that represents your data, please be sure to format your post with triple backticks on the line before your code begins to format it appropriately.

```
( example_df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
5, 4.4, 4.9), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 
3.4, 2.9, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 
1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 
0.4, 0.3, 0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")))
```

longer explanation :

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Zcythor · August 4, 2022, 11:07am

Hi, I've reproduced a shortened form of the data I'm trying to transform. As for your comment on whether I am satisfied or not with the dfDuplicatesRemoved . I am aware that the the reason why I need to transform my data is because the way the survey data is recorded, the same question is asked to every respondent as the data format is Vertical instead of Horizontal, so I'm trying to pivot all the questions as individual columns instead of repeated cells for every question asked to an individual.

nirgrahamuk · August 4, 2022, 12:08pm

I think you need to better explain your issue in as concrete terms as you can...
I ran your code (adjusting for the fact that the tribble code you shared distorted the names) and the result was a dfTransformed object that seemed fine given what you have said about it. It has 3 rows each for a different employee, and the columns represent each question and the contents of the columns have the answers of absence of answers. Have I misunderstood you, or do you perhaps have no problem with dfTransformed, but have some other problem

Zcythor · August 4, 2022, 1:08pm

What you just said is exactly what I want for my code to accomplish but I seem to only produce an excel file with one row of data. Perhaps the format of the original excel file is causing this and not the code?

Zcythor · August 4, 2022, 1:41pm

After looking at which row is posted, I have realized that it only posts the last employee. Meaning that perhaps the code is only looking or writing the last questions at the bottom of the dataframe. Hence why it worked for you as there were only one set of questions asked.

nirgrahamuk · August 4, 2022, 2:06pm

I recommend that you provide a reprex from 'df' as the start point, with at least as many records as would be reasonable to duplicate your issues

Zcythor · August 4, 2022, 3:12pm

I have found the problem. I was looking at what each version of the df is showing by clicking them in the rstudio environment. dfTransformed needs to merge with dfEmployeeProfile to make dfTransformedWithEmployeeProfile. The issue I believe, lies with the line:

# Step 3: Obtain an Employee Profile by fetching all data except for question related data.
dfEmployeeProfile = df %>%
  dplyr::select(-c("Question Body", "Questionnaire Answer", "Question", "Questionnaire")) %>% 
  group_by("Employee ID") %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

as when I looked at the table this produces, it only shows the last line of the original dataframe.
Is there something wrong with this code?

here is the reprex of dfEmployeeProfile:

tibble::tribble(
dfEmployeeProfile <-  ~Effective.Date, ~Business.Process.Reason,                                              ~Employee.ID, ~Voluntary.or.Involuntary,    ~Manager,    ~Supervisory.Organization,                                           ~Management.Level, ~Location,                              ~Country,         ~`FTE.%`, ~Service.in.Years, ~Worker.Type, ~`Supervisory.Org.-.1.From.The.Top`,    ~`Supervisory.Org.-.2.From.The.Top`,     ~`Supervisory.Org.-.3.From.The.Top`,                        ~`Supervisory.Org.-.4.From.The.Top`,       ~`Supervisory.Org.-.5.From.The.Top`,         ~`Supervisory.Org.-.6.From.The.Top`,                                                 ~Respondent,   ~`"Employee.ID"`,
               1L,             "30/06/2022", "Terminate Employee > Voluntary > Resignation - Personal",                   830957L, "Voluntary", "Vincent Gautheron (861145)", "Procurement- Group Functions (Vincent Gautheron (861145))", "Level 6", "UK - Berkshire - Windsor Millstream", "United Kingdom",              100L,         5.12,                          "Employee", "Centrica PLC (Scott Wheway (213219))",  "Centrica PLC (Chris O'Shea (854981))", "British Gas Services and Solutions (Jana Siber (905647))", "COO, British Gas (David House (137182))", "Group Procurement (Peter Sowrey (873613))", "Procurement- Group Functions (Vincent Gautheron (861145))", "Anne Rittmueller"
  )
#> # A tibble: 1 × 20
#>   Effective.Date Busin…¹ Emplo…² Volun…³ Manager Super…⁴ Manag…⁵ Locat…⁶ Country
#>            <int> <chr>   <chr>     <int> <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1              1 30/06/… Termin…  830957 Volunt… Vincen… Procur… Level 6 UK - B…
#> # … with 11 more variables: `FTE.%` <chr>, Service.in.Years <int>,
#> #   Worker.Type <dbl>, `Supervisory.Org.-.1.From.The.Top` <chr>,
#> #   `Supervisory.Org.-.2.From.The.Top` <chr>,
#> #   `Supervisory.Org.-.3.From.The.Top` <chr>,
#> #   `Supervisory.Org.-.4.From.The.Top` <chr>,
#> #   `Supervisory.Org.-.5.From.The.Top` <chr>,
#> #   `Supervisory.Org.-.6.From.The.Top` <chr>, Respondent <chr>, …
#> # ℹ Use `colnames()` to see all variable names

^{Created on 2022-08-04 by the reprex package (v2.0.1)}

it should have 63 rows of data, not the last row of the entire dataframe.

Zcythor · August 4, 2022, 4:00pm

I have resolved the issue. I fixed the line:

# Step 3: Obtain an Employee Profile by fetching all data except for question related data.
dfEmployeeProfile = df %>%
  dplyr::select(-c("Question Body", "Questionnaire Answer", "Question", "Questionnaire")) %>% 
  group_by("Employee ID") %>%
  filter(row_number() == n()) %>% # keep last row
  ungroup()

by changing:
group_by("Employee ID") %>%
to:
group_by(!!sym("Employee ID")) %>%

added the (!!sym notation. I already had to correct many other lines of code. My assumption is that the previous user was using an outdated version of R and ran his transformations with this script for years. So when I used a modern version of Rstudio, it ran into some syntax problems. Either way, exploring my problems with you has been valuable in isolating the problem. And I will use this experience for future debugging endeavours. Thank you @nirgrahamuk , and apologies for the lack of proper formatting when I made this topic as I am new to all this.

system · August 11, 2022, 4:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.