How do I sort unique values alphabetically?

Hello, I am new to R programming and struggling to solve what appears to be a simple problem in exploring a dataset. If anyone browsing has the knowledge and could share it with me that would be fantastic for my learning.

The dataset I am using can be found here: French bakery daily sales | Kaggle

Here is my problem:
I am trying to sort unique values in the "article" column alphabetically. I tried using this:

bakery_sales[order(bakery_sales$article),]
This sorted my column alphabetically but I did not want to sort the entire dataset, only the unique values within the column.

So I added unique to the same code here but did not get the results I wanted:

bakery_sales[order(unique(bakery_sales$article)),]

How can I take the unique values in the "article" and sort them cleanly in alphabetical order?

Thank you

1 Like

It isn't totally clear to me whether you want to pull the article column out of the dataframe and sort the unique values, or just sort the whole dataframe according to the article column. However, neither of these are very difficult:

To do just the individual vector, you want to pull the vector out of the dataframe, remove the duplicate values, then sort it.

For the dataframe, you just use dplyr::arrange on the column you want to sort by.

library(tidyverse)
df <- read_csv(<<PATH TO FILE>>)
# Vector with sorted, unique values
df %>% 
    pull(article) %>% 
    unique %>% 
    sort
#>   [1] "."                        "12 MACARON"              
#>   [3] "ARMORICAIN"               "ARTICLE 295"             
#>   [5] "BAGUETTE"                 "BAGUETTE APERO"          
#>   [7] "BAGUETTE GRAINE"          "BANETTE"                 
#>   [9] "BANETTINE"                "BOISSON 33CL"            
#>  [11] "BOTTEREAU"                "BOULE 200G"              
#>  [13] "BOULE 400G"               "BOULE POLKA"             
#>  [15] "BRIOCHE"                  "BRIOCHE DE NOEL"         
#>  [17] "BRIOCHETTE"               "BROWNIES"                
#>  [19] "BUCHE 4PERS"              "BUCHE 6PERS"             
#>  [21] "BUCHE 8PERS"              "CAFE OU EAU"             
#>  [23] "CAKE"                     "CAMPAGNE"                
#>  [25] "CARAMEL NOIX"             "CEREAL BAGUETTE"         
#>  [27] "CHAUSSON AUX POMMES"      "CHOCOLAT"                
#>  [29] "CHOU CHANTILLY"           "COMPLET"                 
#>  [31] "COOKIE"                   "COUPE"                   
#>  [33] "CROISSANT"                "CROISSANT AMANDES"       
#>  [35] "CRUMBLE"                  "CRUMBLECARAMEL OU PISTAE"
#>  [37] "DELICETROPICAL"           "DEMI BAGUETTE"           
#>  [39] "DEMI PAIN"                "DIVERS BOISSONS"         
#>  [41] "DIVERS BOULANGERIE"       "DIVERS CONFISERIE"       
#>  [43] "DIVERS PATISSERIE"        "DIVERS SANDWICHS"        
#>  [45] "DIVERS VIENNOISERIE"      "DOUCEUR D HIVER"         
#>  [47] "ECLAIR"                   "ECLAIR FRAISE PISTACHE"  
#>  [49] "ENTREMETS"                "FICELLE"                 
#>  [51] "FINANCIER"                "FINANCIER X5"            
#>  [53] "FLAN"                     "FLAN ABRICOT"            
#>  [55] "FONDANT CHOCOLAT"         "FORMULE PATE"            
#>  [57] "FORMULE PLAT PREPARE"     "FORMULE SANDWICH"        
#>  [59] "FRAISIER"                 "FRAMBOISIER"             
#>  [61] "GACHE"                    "GAL FRANGIPANE 4P"       
#>  [63] "GAL FRANGIPANE 6P"        "GAL POIRE CHOCO 4P"      
#>  [65] "GAL POIRE CHOCO 6P"       "GAL POMME 4P"            
#>  [67] "GAL POMME 6P"             "GALETTE 8 PERS"          
#>  [69] "GD FAR BRETON"            "GD KOUIGN AMANN"         
#>  [71] "GD NANTAIS"               "GD PLATEAU SALE"         
#>  [73] "GRAND FAR BRETON"         "GRANDE SUCETTE"          
#>  [75] "GUERANDAIS"               "KOUIGN AMANN"            
#>  [77] "MACARON"                  "MERINGUE"                
#>  [79] "MILLES FEUILLES"          "MOISSON"                 
#>  [81] "NANTAIS"                  "NID DE POULE"            
#>  [83] "NOIX JAPONAISE"           "PAILLE"                  
#>  [85] "PAIN"                     "PAIN AU CHOCOLAT"        
#>  [87] "PAIN AUX RAISINS"         "PAIN BANETTE"            
#>  [89] "PAIN CHOCO AMANDES"       "PAIN DE MIE"             
#>  [91] "PAIN GRAINES"             "PAIN NOIR"               
#>  [93] "PAIN S/SEL"               "PAIN SUISSE PEPITO"      
#>  [95] "PALET BRETON"             "PALMIER"                 
#>  [97] "PARIS BREST"              "PATES"                   
#>  [99] "PLAQUE TARTE 25P"         "PLAT"                    
#> [101] "PLAT 6.50E"               "PLAT 7.00"               
#> [103] "PLAT 7.60E"               "PLAT 8.30E"              
#> [105] "PLATPREPARE5,50"          "PLATPREPARE6,00"         
#> [107] "PLATPREPARE6,50"          "PLATPREPARE7,00"         
#> [109] "PT NANTAIS"               "PT PLATEAU SALE"         
#> [111] "QUIM BREAD"               "REDUCTION SUCREES 12"    
#> [113] "REDUCTION SUCREES 24"     "RELIGIEUSE"              
#> [115] "ROYAL"                    "ROYAL 4P"                
#> [117] "ROYAL 6P"                 "SABLE F  P"              
#> [119] "SACHET DE CROUTON"        "SACHET DE VIENNOISERIE"  
#> [121] "SACHET VIENNOISERIE"      "SAND JB"                 
#> [123] "SAND JB EMMENTAL"         "SANDWICH COMPLET"        
#> [125] "SAVARIN"                  "SEIGLE"                  
#> [127] "SPECIAL BREAD"            "SPECIAL BREAD KG"        
#> [129] "ST HONORE"                "SUCETTE"                 
#> [131] "TARTE FINE"               "TARTE FRAISE 4PER"       
#> [133] "TARTE FRAISE 6P"          "TARTE FRUITS 4P"         
#> [135] "TARTE FRUITS 6P"          "TARTELETTE"              
#> [137] "TARTELETTE CHOC"          "TARTELETTE COCKTAIL"     
#> [139] "TARTELETTE FRAISE"        "THE"                     
#> [141] "TRADITIONAL BAGUETTE"     "TRAITEUR"                
#> [143] "TRIANGLES"                "TROIS CHOCOLAT"          
#> [145] "TROPEZIENNE"              "TROPEZIENNE FRAMBOISE"   
#> [147] "TULIPE"                   "VIENNOISE"               
#> [149] "VIK BREAD"

# Whole dataframe sorted according to that column
df %>% 
    arrange(article)
#> # A tibble: 234,005 x 7
#>        X1 date       time   ticket_number article    Quantity unit_price
#>     <dbl> <date>     <time>         <dbl> <chr>         <dbl> <chr>     
#>  1  33726 2021-03-04 12:32         159219 .                 2 0,00 €    
#>  2  43541 2021-03-18 12:59         161853 .                 1 0,00 €    
#>  3  54650 2021-04-04 09:53         164878 .                 1 0,00 €    
#>  4  73667 2021-04-27 16:48         170079 .                 1 0,00 €    
#>  5 135091 2021-07-10 13:25         186662 .                 2 0,00 €    
#>  6 421218 2022-07-13 12:32         264526 12 MACARON        1 10,00 €   
#>  7 426004 2022-07-16 13:06         265779 12 MACARON        1 10,00 €   
#>  8 427234 2022-07-17 10:25         266077 12 MACARON        1 10,00 €   
#>  9 427800 2022-07-17 11:49         266223 12 MACARON        1 10,00 €   
#> 10 428066 2022-07-17 12:48         266300 12 MACARON        1 10,00 €   
#> # ... with 233,995 more rows
#> # i Use `print(n = ...)` to see more rows

Created on 2022-11-22 by the reprex package (v1.0.0)

3 Likes

Thank you so much. I tried a different approach and got what I needed by creating a new frame with what I needed but your solution was much more succinct and cleaner.

I did this;
items <- unique(bakery_sales$article)
and simply scrolled through the values to accomplish what I needed in terms of exploration but that lead me to a new question.

Is it best practice to create a new dataframe, list or table when doing something like this to be able to manipulate the data within or try to keep your data environment more condensed and simple in regards to amount of code?

That's a good question. And as with all good questions, the answer is: it depends. I can spell out my general thoughts, but I would need to know a bit more about what you are trying to do? Why are you looking at this data? What questions are you trying to answer? Or what problems are you trying to solve?

Personally, I like to stay with data.frames until the very end of your analysis. So if the question you are trying to answer is "how many unique article values are there?", then I would keep a data.frame until the last step, when you pull out the unique article values. But if this is an input to another step in your analysis, then I would keep the data.frame.

If you are trying to make the fastest possible solution, you want to keep as few objects in memory as possible - and the smaller they are the better (as a general rule). However, I would NOT recommend trying to optimize for speed/size until speed size become a problem. If you know that blazing speed is a requirement for your program upfront, you should probably consider choosing a different language than R.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.