# NEWBIE: Difficulty Cleaning Dataset Comprised of Many Excel Sheets

Hi all, I ended up combining 85 different excel sheets of data and turned them into a dataframe. I listed the code below on how I exactly did that. The difficulty here is that I do not know how to further process the data for analysis/plotting as each of the 85 rows now have many observations in the form of elements (e.g. 1 observation or row actually has 21 elements or data). I have listed sample row 1 under my code as an example. There end up being 7 different elements under each of the columns (the columns being: "Block", "X.Correct", "meanRT"). I have recently started to learn R, so any help would be appreciated!

``````install.packages("openxlsx")
library("openxlsx")

#Finding all excel files I need
list<-list.files(pattern="*individualtemp.xlsx",recursive = TRUE)
#Choosing specific excel sheet and data cells
#Turning list into df
my_df<-data.frame(t(sapply(raw_list,c)))

#Example row of data
my_df[2, ]

Block                                       X.Correct                                                                meanRT
[2] 1, 2, 3, 4, 5, 6, 7 0.875, 0.975, 1.000, 0.975, 0.975, 0.975, 0.975 1166.9429, 801.9487, 806.1000, 758.4615, 654.1282, 601.2051, 592.8974
``````

``````dput(head(my_df))
``````

Place a line with three back ticks just before and after the output.
```
```
Is your goal to get the data into 21 columns?

Hi, this is the output of dput(head(my_df))

``````structure(list(Block = list(c(1, 2, 3, 4, 5, 6, 7), c(1, 2, 3,
4, 5, 6, 7), c(1, 2, 3, 4, 5, 6, 7), c(1, 2, 3, 4, 5, 6, 7),
c(1, 2, 3, 4, 5, 6, 7), c(1, 2, 3, 4, 5, 6, 7)), X.Correct = list(
c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), c(0.875, 0.975, 1, 0.975, 0.975, 0.975, 0.975),
c(0.975, 1, 1, 0.925, 0.975, 1, 1), c(0.925, 0.975, 1, 1,
1, 0.975, 0.975), c(0.875, 0.925, 0.975, 1, 1, 0.925, 1),
c(1, 1, 1, 1, 1, 1, 1)), meanRT = list(c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), c(1166.94285714286,
801.948717948718, 806.1, 758.461538461538, 654.128205128205,
601.205128205128, 592.897435897436), c(877.641025641026, 866.575,
717.2, 699.945945945946, 751.435897435897, 580.175, 603.975),
c(1189.13513513514, 966.923076923077, 849.15, 797.875, 744.975,
742.769230769231, 801.512820512821), c(1122.8, 967.216216216216,
885.769230769231, 836.15, 739.825, 842.648648648649, 746.6
), c(918.725, 873.075, 885.5, 766.025, 740.825, 713.5, 788.525
))), row.names = c(NA, 6L), class = "data.frame")
``````

Thanks for taking the time to respond. I am not sure which format would be best for this dataset. I definitely want the two columns: "X.Correct" and "meanRT". This is the data I want to analyze and look at. I am not sure what to do with the "Block" column. Essentially, there are 6 blocks done for each participant. In this dataset, each row represents a participant, whereby there are 6 blocks of data (Blocks 1:6, X.Correct values 1:6, meanRT values 1:6). In the end, I want each of the 6 blocks to match the 6 data values from "X.Correct" where "X.Correct" is represented on the y axis and blocks are represented on the x axis. This set-up would be the same for "meanRT" and blocks 1:6. I guess this brings me to the question of - does this dataset need to be changed further or is there a way to tell R to read the data so it can plot this information? Most of the dataset examples I have seen only have one element of data per column and row. Is this why I am having so much difficulty with my dataset? Because of the fact I have multiple elements of data per row and column when there should only be one data point?

Here is an easier format for the data. I am not sure exactly what you want the graph to look like but this format should be a good place to start.

``````DF <- structure(list(Block = list(c(1, 2, 3, 4, 5, 6, 7), c(1, 2, 3,
4, 5, 6, 7), c(1, 2, 3, 4, 5, 6, 7), c(1, 2, 3, 4, 5, 6, 7),
c(1, 2, 3, 4, 5, 6, 7), c(1, 2, 3, 4, 5, 6, 7)),
X.Correct = list(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), c(0.875, 0.975, 1, 0.975, 0.975, 0.975, 0.975),
c(0.975, 1, 1, 0.925, 0.975, 1, 1), c(0.925, 0.975, 1, 1,
1, 0.975, 0.975), c(0.875, 0.925, 0.975, 1, 1, 0.925, 1),
c(1, 1, 1, 1, 1, 1, 1)),
meanRT = list(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
c(1166.94285714286, 801.948717948718, 806.1, 758.461538461538, 654.128205128205,601.205128205128, 592.897435897436), c(877.641025641026, 866.575,                                                                                                                                                                 717.2, 699.945945945946, 751.435897435897, 580.175, 603.975),
c(1189.13513513514, 966.923076923077, 849.15, 797.875, 744.975,
742.769230769231, 801.512820512821), c(1122.8, 967.216216216216,
885.769230769231, 836.15, 739.825, 842.648648648649, 746.6
), c(918.725, 873.075, 885.5, 766.025, 740.825, 713.5, 788.525
))), row.names = c(NA, 6L), class = "data.frame")
library(tidyr)
DF_unnest <- unnest(data = DF,cols = c("Block","X.Correct","meanRT"))
#> # A tibble: 14 x 3
#>    Block X.Correct meanRT
#>    <dbl>     <dbl>  <dbl>
#>  1     1    NA        NA
#>  2     2    NA        NA
#>  3     3    NA        NA
#>  4     4    NA        NA
#>  5     5    NA        NA
#>  6     6    NA        NA
#>  7     7    NA        NA
#>  8     1     0.875  1167.
#>  9     2     0.975   802.
#> 10     3     1       806.
#> 11     4     0.975   758.
#> 12     5     0.975   654.
#> 13     6     0.975   601.
#> 14     7     0.975   593.
``````

Created on 2021-09-01 by the reprex package (v0.3.0)

Hi, thanks a bunch for your input! Would you please be able to run me through your thought process on why you set the data this way? Or rather, why the original data format was problematic? I have some idea, but I feel that your perspective would help me understand more on this topic. Thanks again.

The usual layout for data is to have single values in each element of a data frame. The usual plotting functions would expect this. It would be difficult to plot the data in the original format. I am sure code could be written to do it, but that is much harder than reshaping the data.
It is also usually easier to have the data in a "long" format so that multiple values for one individual are labeled with one column and the values are in another column, rather than having multiple columns of values. The Block column in your data serves the purpose of the label and the other two columns hold the values.
You might want to add a column to the original data that labels the individual, since that is implicitly done with the row and that information is lost in my version of the data. Whether that is necessary depends on what you want to do with the data.

Thanks so much! This was a big help. Your explanation was easy to understand and I feel that I have a better idea of how to shape datasets. Thanks again!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.