Hi @AlexisW - thank you so much for taking a stab at this! It took me a while to work through your code and figure out what you were doing, but I really like the fact that your solution doesn't require transliterating the Arabic characters back and forth. My actual dataframe only has Arabic strings, so I am reproducing a simplified version here
Regarding your first solution, I played around with it a bit. If you take out the rev() function in the user-designed function within lapply() at Step 3, you actually get the letters parsed correctly from right-to-left (woo hoo!)
# Create sample dataframe
root_letters <- c("أ", "آب", "أباجور", "دار")
entry <- c(1:4)
dict <- data.frame(entry, root_letters)
dict # display dataframe
#> entry root_letters
#> 1 1 أ
#> 2 2 آب
#> 3 3 أباجور
#> 4 4 دار
# Step 1: split character strings of the root_letters column into substrings using strsplit() in baseR
ind_chars <- strsplit( # create a list of vectors of split character strings
dict$root_letters, # from this column
split = "") # split at every character
# Step 2: determine length of substrings to find length of longest substring
# sapply(): applies a function (either from the function or user-defined) to input (list, vector or data frame) and returns a vector or a matrix
max_long <- max(sapply( # find the maximum value from the vector created by...
ind_chars, # taking the list of split character strings...
length)) # and finding the length of each element (built-in function)
# Step 3 (original): ensure all substrings are the same length by filling in the empty elements with NA
##lapply(): applies a function (either from the function or user-defined) to input (list, vector or data frame) and returns list object
filled_chars <- lapply(ind_chars, # Apply a function to all the elements of the input
function(x)
rev( # reverse elements in the output
c(rep(NA, max_long - length(x)), x))) # make vectors equal length by replacing remaining elements with NA
# Step 4 (original) : turn the filled in list into a matrix
dict1 <- do.call(rbind, filled_chars)
dict1 # print output
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] "أ" NA NA NA NA NA
#> [2,] "ب" "آ" NA NA NA NA
#> [3,] "ر" "و" "ج" "ا" "ب" "أ"
#> [4,] "ر" "ا" "د" NA NA NA
# Step 3 (without rev): ensure all substrings are the same length by filling in the empty elements with NA
filled_chars2 <- lapply(ind_chars, function(x) c(rep(NA, max_long - length(x)), x))
# Step 4 (without rev) : turn the filled in list into a matrix
dict2 <- do.call(rbind, filled_chars2)
dict2 # print output
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] NA NA NA NA NA "أ"
#> [2,] NA NA NA NA "آ" "ب"
#> [3,] "أ" "ب" "ا" "ج" "و" "ر"
#> [4,] NA NA NA "د" "ا" "ر"
As for your second solution, I'm somewhat puzzled by the output. When I click to view the matrix in the source pane the letters are (mostly) appropriately parsed (R recognizes that the right-most letter is the beginning of the word, although the letters are still spit from left-to-right [the direction of split can be reversed by simply reordering the columns, so this isn't a problem]).
Strangely, however, when I print the output the parsing order changes: R incorrectly interprets the left-most letter as the beginning of the word. I'm not sure how to illustrate my source pane in a reprexable way, so I will just describe it instead:
# Separate strings - AlexisW's 2nd way ----
# Load packages
library(tidyverse)
dict3 <- stringr::str_split_fixed(dict$root_letters, "", max_long)
# When I click on dict3 in the environment pane to view it in the source pane, the parsing is correct
dict3 # when I print the output, the parsing order has been reversed (strings are matched by last letter, not by first)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] "أ" "" "" "" "" ""
#> [2,] "آ" "ب" "" "" "" ""
#> [3,] "أ" "ب" "ا" "ج" "و" "ر"
#> [4,] "د" "ا" "ر" "" "" ""