How to divide rows into columns in a large data frame

JB1409 · November 30, 2021, 4:17pm

Hello all,

I've recently started using R and I've come to a problem to which I can't seem to find a solution.

From a list of 140 .txt files I've managed to make one data frame consisting of 3 columns and 37.200 rows. Every 310 rows should be shuffled into a separate column, since that's how I need to export my data into Excel for further analysis. Afterwards it would be great if I can filter out every 2nd column, except for column 2, but that's something I can either do by hand in Excel or figure out later. How do I tackle this issue of shuffling every 310 rows into separate columns?

This is probably a noob question and I'm very sorry if this issue is super generic, but I appreciate all of your help!

Yours,
JB

nirgrahamuk · November 30, 2021, 4:22pm

here is some 'small' example data

data.frame(array(1:12,c(4,3)))

lets say every 2 rows should be suffled into a seperate column ...
what would that mean exactly ?
can you type out what x4 column would contain after the shufflings ?

tip:
use triple backticks to set the forum post to a monospaced font suitable for sharing code

```
code goes here

JB1409 · November 30, 2021, 4:39pm

If your column is like this:

Then my preferred column could be like this:

X1 X2 X3 X1 X2 X3
1  5  9  3  7  11 
2  6  10 4  8  12

Preferably I would then be able to filter out every X1 and X2 column, except for the first X1 and X2 columns. Especially this reorganization of the data frame would be nice to get down!

I hope the formatting in my reply is correct, this is my first time posting on this forum ;).

Thanks for your help!

valentingar · November 30, 2021, 4:52pm

Hi there!

Here is a solution using only base-R. I've also included a example dataset that is more like your original description:


# example data with only 930 rows
df <- data.frame(X = rep(c("A","B","C"), each = 310),
                 Y = rep(c("D","E","F"), each = 310),
                 Z = rep(c("G","H","I"), each = 310))

# number of splits to do
n_cols <- nrow(df)/310

# subsetting each split seperately, store in a list
df_l <- 
lapply(1:n_cols,function(i){
  startrow <- (i-1) * 310 + 1
  endrow <- i*310
  
  df[startrow:endrow, c(1,2,3)]
})

# bind list elements together
df_new <-
do.call(cbind, df_l)

# find which columns to remove per your requirements
ncols <- ncol(df_new)
toremove <- seq(4, ncols, by = 2)

# remove columns
df_new <-
  df_new[-toremove]

Hope this helps!
Best

nirgrahamuk · November 30, 2021, 5:01pm

my attempt. I switched from my example data to valentingar's

dimensions <- 310
# example data with only 930 rows
start_df <- data.frame(X = rep(c("A","B","C"), each = dimensions),
                 Y = rep(c("D","E","F"), each = dimensions),
                 Z = rep(c("G","H","I"), each = dimensions))

start_df$csrow <- cumsum(1 == (seq_len(nrow(start_df)) %% dimensions))

(df_split <- split(start_df,start_df$csrow))

(df_recombined <- Reduce(x = df_split,f=cbind))

(df_newx <- subset(df_recombined,
       select =which(names(df_recombined) != "csrow")))

jrkrideau · November 30, 2021, 5:09pm

I am not clear here. Do you want each of the existing columns divided into 120 columns (37.200 / 310 = 120) thus giving 360 columns?

By the way, what analyses are you doing? It may be faster and easier to do most or all of your analysis is R than Excel. R has rather extensive text analysis capabilities.

Does this sort of thing do the first part of what you want?

dat1 <-   data.frame(xx = sample(letters[1:20], 1000, replace = TRUE), yy = sample(letters[20:1], 1000, 
                  replace = TRUE), zz = sample(letters[10:20], 1000, replace = TRUE)) 


vec1 <-   unlist(dat1)

dat2 <-   matrix(vec1, 250, 12)

dat3 <-  as.data.frame(dat2)

str(dat3)

JB1409 · December 1, 2021, 12:12pm

Thanks for all the help, I've tried what you suggested and it works up untill the last bit. If I try to remove the columns with the code you suggested it removes the rows (observations), not the columns (variables). I've tried to find solutions to this, but couldn't get it to work. I've solved it for now to remove them by hand in Excel. If you've got a solution for this last bit, it's greatly appreciated, that way I can learn how to do it in the future!

Thanks again!

valentingar · December 1, 2021, 12:49pm

Probably just a misplaced comma? Note that

df[-toremove,  ]

will remove the rows, whereas

df[, -toremove]

will remove the columns.

So in general, you access data by df[rows, columns]

Best!
Valentin

Btw: I agree wirth @jrkrideau that it is worth trying to do everything in R as well. It will potentially be loads easier, after you learned a bit of R (which is, granted, a barrier nonetheless). Good luck with that!

JB1409 · December 1, 2021, 12:55pm

Thanks for the suggestions! Doing the data analysis part in R is not doable though, since that will be done using a heavy simulation model written in another language, and will be performed by a colleague of mine.

In the future I will have to handle these big datasets more often though, so the experience I'm gathering with R right now will be very useful in the future! I'm following a Udemy course on R Studio as we speak, and probably will see if I can find additional courses there after I've finished this one! I'm starting to like coding more than I was expecting haha, though right now it's a bit of a short-term time-sink figuring all of this stuff out.

system · December 22, 2021, 12:56pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.