Data rearrangement not working

kuttan98 · September 24, 2022, 5:06pm

Hi,
I have a dataset as shown in the data frame "data". It is in horizontal format. I have tried converting to vertical format, but the data does not come in the desired format. The reprex is shown below.

library(tidyverse)

data<-tibble::tribble(
                      ~en_name, ~week_no,                                 ~schoolid,                ~teacherid,                    ~cal_name1,                             ~cal_name2,                         ~cal_name3,                          ~cal_name4,                        ~cal_name5,                         ~cal_name6,                               ~cal_name7,                           ~cal_name8, ~q_1_1, ~q_2_1, ~q_3_1, ~q_4_1, ~q_5_1, ~q_6_1, ~q_7_1, ~q_8_1,
        "Netravathi Kattimani",       1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI",     "Elizabeth Banikatti", "Abbas Ali Nanamattik-ZUQ207",                 "Mushruf Mulla-MZE987",         "Saiyed Navalagund-OGN003",           "Dongrima Walikar-FHG475",        "Umekusum Bellihal-KNT615",   "Fathima Zahara Bellihal-RLY981",            "Yashaswini Khanbargi-TPP643",              "Ananya Kopardi-TQQ828",     1L,     1L,     1L,     0L,     1L,     0L,     1L,     1L,
        "Netravathi Kattimani",       1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI", "Sureka nagappa belawadi",    "MD Amaan Nalambad-QYW348",                    "Imbrahim B-UIS798",       "MD. arfan Bavikatti-VYD916",              "Raheem Jakati-YOT663",       "Khasimsab Makandar-VGM989",           "Bharath Mankani-UHK261",                                       NA,                                   NA,     0L,     1L,     0L,     0L,     0L,     0L,     0L,     0L,
            "Gayathri Anikivi",       2L,                       "GHPS KURUBAGATTI.",   "Prabhavati.M.Gunavant",   "Akash.Y.Guddanavar-JXH849",              "Bharath.N.Poojar-BSL042",     "Dyavappa.H.Bettanavar-ZRH198",      "Manikant.R.mummigatti-MPG298",      "Aliyabegum.D.Nadhaf-SYN159",         "Mallige.M.Malledi-OHD802",            "Maktumbi.I.sanigatti-IZI558",                                   NA,     1L,     1L,     0L,     0L,     1L,     0L,     1L,     0L,
            "Gayathri Anikivi",       2L,                       "GHPS KURUBAGATTI.",       "Jyoti.Yavagalmath",   "Abhishek.M.Nargund-CTG024",                "Manoj.N.Katger-XHZ205",         "Pradeep.N.Koobnal-OVN977",             "Paizal M Mulla-TXDD6J",       "Manjunath  P pojar-S6S472",          "Gangavva malledi-H0AFJU",             "Nagaraj F Gundagovi-CZY5R9",             "Manikantha meti-64V58V",     1L,     1L,     1L,     0L,     1L,     1L,     0L,     1L,
            "Maruti Kattimani",       3L,          "GHPS SIDDHESHWAR NAGAR DHARWAD",   "Basavaraj I Managundi",   "Ankitha Kondapilli-RLO453",       "Yashaswini Ontettinavar-YII784",              "Kavya Konnur-HFC295", "Divya Fakeerappa Kariyavar-A5OJLR",      "Gaytri Suresh Morab-BBLKK1",                                 NA,                                       NA,                                   NA,     0L,     1L,     1L,     0L,     0L,     0L,     0L,     0L,
            "Maruti Kattimani",       3L,          "GHPS SIDDHESHWAR NAGAR DHARWAD",      "Jayalalita Badiger",       "Vinayak Gawadi-GNB587", "Amruta Shekappa ontyattinavar-6GGXCW", "Anita Hanumantappa Konnur-7OS697",  "Chaitanya Shivanad Konnur-9BW4FK", "Kalpana Pakkiresh Badagi-XERLLT", "Pavan Dharmo Haranshikari-EEN549", "Shankravva Somanna Haranshikari-ZGKXAN", "Shivaji Hanmanthappa Kulavi-TOM915",     1L,     0L,     1L,     0L,     0L,     0L,     0L,     1L
        )


Students1 <- data |> 
  select(schoolid, starts_with("cal_name"))
Students1 <- Students1 |> 
  pivot_longer(cols = -schoolid,names_to = "ID",values_to = "Name") |> 
  mutate(ID=str_remove(ID,"cal_name")) |> 
  filter(!is.na(Name))

Scores1 <- data |>
  select(schoolid,matches("q_\\d+_\\d+"))
Scores1 <- Scores1 |> 
  pivot_longer(cols = -schoolid,names_to = "ID",values_to = "Score") |> 
  mutate(Q=str_remove(ID,"_\\d+"),
         ID=str_extract(ID,"\\d+")) |> 
  filter(!is.na(Score)) |> 
  pivot_wider(names_from = "Q",values_from = "Score")
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
Final1 <- inner_join(Students1,Scores1,by=c("ID","schoolid"))

data_2<-tibble::tribble(
                 ~en_name, ~week,                                 ~schoolid,                         ~student, ~attendnace,
  "Nethravathi Kattimani",    1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI",    "Abbas Ali Nanamattik-ZUQ207",          1L,
  "Nethravathi Kattimani",    1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI",           "Mushruf Mulla-MZE987",          1L,
  "Nethravathi Kattimani",    1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI",       "Saiyed Navalagund-OGN003",          1L,
  "Nethravathi Kattimani",    1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI",        "Dongrima Walikar-FHG475",          0L,
  "Nethravathi Kattimani",    1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI",       "Umekusum Bellihal-KNT615",          1L,
  "Nethravathi Kattimani",    1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI", "Fathima Zahara Bellihal-RLY981",          0L,
  "Nethravathi Kattimani",    1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI",    "Yashaswini Khanbargi-TPP643",          1L,
  "Nethravathi Kattimani",    1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI",          "Ananya Kopardi-TQQ828",          1L
  )
Created on 2022-09-24 by the reprex package (v2.0.1)

FactOREO · September 24, 2022, 5:24pm

Hello,

what exactly is the desired result in this case?

Also, your column Q contains q_1 only as entries. Is this intentionally?

Maybe you can specify the expected output, just to be able to check the result.

Kind regards

kuttan98 · September 25, 2022, 4:28am

I have edited the original post with creating a new data frame "data_2". That is how I want the output to be. To give the context, the data is of student attendance. The names of students appear first. Then q_1_1,q_2_1,q_3_1, etc.. begins that shows the response (0 if absent and 1 for present). I hope it's clear now.

FactOREO · September 25, 2022, 8:52am

Iam not 100% confident that this is what you want, since there are different teacher ids per week (which are not present in your desired result). But without them you kind of loose information, hence I decided to keep them for now. The result matches your desired result in terms of students and attendance, but also has the teacherid as further information:

library(tidyverse)

Data<-tibble::tribble(
  ~en_name, ~week_no,                                 ~schoolid,                ~teacherid,                    ~cal_name1,                             ~cal_name2,                         ~cal_name3,                          ~cal_name4,                        ~cal_name5,                         ~cal_name6,                               ~cal_name7,                           ~cal_name8, ~q_1_1, ~q_2_1, ~q_3_1, ~q_4_1, ~q_5_1, ~q_6_1, ~q_7_1, ~q_8_1,
  "Netravathi Kattimani",       1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI",     "Elizabeth Banikatti", "Abbas Ali Nanamattik-ZUQ207",                 "Mushruf Mulla-MZE987",         "Saiyed Navalagund-OGN003",           "Dongrima Walikar-FHG475",        "Umekusum Bellihal-KNT615",   "Fathima Zahara Bellihal-RLY981",            "Yashaswini Khanbargi-TPP643",              "Ananya Kopardi-TQQ828",     1L,     1L,     1L,     0L,     1L,     0L,     1L,     1L,
  "Netravathi Kattimani",       1L, "GLPS CHABBI PLOT NEKAR NAGAR HALE HUBLI", "Sureka nagappa belawadi",    "MD Amaan Nalambad-QYW348",                    "Imbrahim B-UIS798",       "MD. arfan Bavikatti-VYD916",              "Raheem Jakati-YOT663",       "Khasimsab Makandar-VGM989",           "Bharath Mankani-UHK261",                                       NA,                                   NA,     0L,     1L,     0L,     0L,     0L,     0L,     0L,     0L,
  "Gayathri Anikivi",       2L,                       "GHPS KURUBAGATTI.",   "Prabhavati.M.Gunavant",   "Akash.Y.Guddanavar-JXH849",              "Bharath.N.Poojar-BSL042",     "Dyavappa.H.Bettanavar-ZRH198",      "Manikant.R.mummigatti-MPG298",      "Aliyabegum.D.Nadhaf-SYN159",         "Mallige.M.Malledi-OHD802",            "Maktumbi.I.sanigatti-IZI558",                                   NA,     1L,     1L,     0L,     0L,     1L,     0L,     1L,     0L,
  "Gayathri Anikivi",       2L,                       "GHPS KURUBAGATTI.",       "Jyoti.Yavagalmath",   "Abhishek.M.Nargund-CTG024",                "Manoj.N.Katger-XHZ205",         "Pradeep.N.Koobnal-OVN977",             "Paizal M Mulla-TXDD6J",       "Manjunath  P pojar-S6S472",          "Gangavva malledi-H0AFJU",             "Nagaraj F Gundagovi-CZY5R9",             "Manikantha meti-64V58V",     1L,     1L,     1L,     0L,     1L,     1L,     0L,     1L,
  "Maruti Kattimani",       3L,          "GHPS SIDDHESHWAR NAGAR DHARWAD",   "Basavaraj I Managundi",   "Ankitha Kondapilli-RLO453",       "Yashaswini Ontettinavar-YII784",              "Kavya Konnur-HFC295", "Divya Fakeerappa Kariyavar-A5OJLR",      "Gaytri Suresh Morab-BBLKK1",                                 NA,                                       NA,                                   NA,     0L,     1L,     1L,     0L,     0L,     0L,     0L,     0L,
  "Maruti Kattimani",       3L,          "GHPS SIDDHESHWAR NAGAR DHARWAD",      "Jayalalita Badiger",       "Vinayak Gawadi-GNB587", "Amruta Shekappa ontyattinavar-6GGXCW", "Anita Hanumantappa Konnur-7OS697",  "Chaitanya Shivanad Konnur-9BW4FK", "Kalpana Pakkiresh Badagi-XERLLT", "Pavan Dharmo Haranshikari-EEN549", "Shankravva Somanna Haranshikari-ZGKXAN", "Shivaji Hanmanthappa Kulavi-TOM915",     1L,     0L,     1L,     0L,     0L,     0L,     0L,     1L
)

Result <- Data |>
  # pivot student names into one column
  pivot_longer(
    cols = starts_with("cal"),
    names_to = 'id_student', values_to = 'student') |>
  # pivot scores into one column
  pivot_longer(
    cols = starts_with("q"),
    names_to = 'id_score', values_to = 'score'
  ) |>
  # edit the IDs and filter
  mutate(
    id_student = str_remove(id_student,"cal_name"),
    id_score = str_remove(sub("(_\\d+.*?)_\\d+", "\\1", id_score), 'q_')
  ) |>
  filter(
    id_student == id_score
  ) |>
  select(
    en_name, teacherid, 'week' = week_no, schoolid, student, 'attendance' = score
  )

Result
#> # A tibble: 48 × 6
#>    en_name              teacherid                week schoolid   student atten…¹
#>    <chr>                <chr>                   <int> <chr>      <chr>     <int>
#>  1 Netravathi Kattimani Elizabeth Banikatti         1 GLPS CHAB… Abbas …       1
#>  2 Netravathi Kattimani Elizabeth Banikatti         1 GLPS CHAB… Mushru…       1
#>  3 Netravathi Kattimani Elizabeth Banikatti         1 GLPS CHAB… Saiyed…       1
#>  4 Netravathi Kattimani Elizabeth Banikatti         1 GLPS CHAB… Dongri…       0
#>  5 Netravathi Kattimani Elizabeth Banikatti         1 GLPS CHAB… Umekus…       1
#>  6 Netravathi Kattimani Elizabeth Banikatti         1 GLPS CHAB… Fathim…       0
#>  7 Netravathi Kattimani Elizabeth Banikatti         1 GLPS CHAB… Yashas…       1
#>  8 Netravathi Kattimani Elizabeth Banikatti         1 GLPS CHAB… Ananya…       1
#>  9 Netravathi Kattimani Sureka nagappa belawadi     1 GLPS CHAB… MD Ama…       0
#> 10 Netravathi Kattimani Sureka nagappa belawadi     1 GLPS CHAB… Imbrah…       1
#> # … with 38 more rows, and abbreviated variable name ¹attendance

^{Created on 2022-09-25 with reprex v2.0.2}

You may want to check the result and see if it fits your needs.

Kind regards

kuttan98 · September 25, 2022, 9:24am

Thanks a lot. I have tried for my sample data and it was coming in the desired format. Also thanks for putting up the teacherid there as I required it. But when I tried it on my original bigger dataset, the following error comes up:
"Cannot allocate vector size of 104 mb."

FactOREO · September 25, 2022, 9:33am

This is an issue with your RAM. You can try to close other not needed programs and run rm(list = ls()) followed by gc() before running all the other stuff from your script you need to do the data rearranging. This will delete all objects in your global environment and gc() cleares the used space from the objects (since this is not done with rm()).

Maybe this already solves your issue regarding the RAM.

kuttan98 · September 25, 2022, 2:06pm

But when I run this, the object (my data) is getting deleted and I cannot proceed with data rearrangement.

FactOREO · September 25, 2022, 2:08pm

Thats why I wrote

So do this (and the closing of other applications on your computer) before you load your data and do everything you need to do before you want to reshape the data.

kuttan98 · September 25, 2022, 2:15pm

Ok. I will just show outline of the script:

Load packages

Import data

Then rearranging it. (this rearrangement is exactly the code that was suggested by you).

This is how the script is. If I place the code for removing things in global environment, then the data gets deleted.

FactOREO · September 25, 2022, 2:20pm

Well, then (if you restart into a fresh R session) and have closed all your other applications which could cause the issue, you should upgrade your RAM or split the actual data in multiple pieces. But 104MB isn't really large, so I don't get it why you have this issue.

kuttan98 · September 25, 2022, 2:22pm

Ok thanks a lot for the help.

system · October 2, 2022, 2:23pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.