Problem with "bnlearn" package & string elements table by reading with excel (or readxl...)

Hi dear R community
When I read a CSV file to apply Hill climbing algo : No problem !
When I use readxl . I get this :

===== The check gives this :
str(Donnees)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 54 obs. of 7 variables:
thick : chr "petit" "moyen" "gros" "petit" ... Bshaft: chr "fin" "fin" "fin" "medium" ...
$ length: chr "court" "court" "court" "court" ...
....
....
res_hc <- hc(Donnees.Learn,whitelist=NULL,debug=TRUE, score = "aic")
====>>> Error in data.type(x) : variable thick is not supported in bnlearn (type: character).
When I am using read CSV the co_type is factor... (instead of "chr")... I have not seen any way to convert the dataframe after having red the file with excel.
Thanks a lot for an indication

Any way you can supply a limited reproducible example, reprex with small example files?

It sounds like you have two questions. One loading data as an excel file. And one understanding the error message bnlearn gives you.

With the read.csv call, are you familiar with the stringsAsFactors setting? (some fun background). The readr package's read_csv had string variables as character types.

For the bnlearn error message. I'm not too familiar with this package, but I think variables must be either numeric, factors or ordered factors (bnlearn-manual), and depending on what you're doing you might be further limited.

I have a feeling the data-loading issue will be a quick solve here once you give a reprex. But for the bnlearn question you might change the category to #ml, machine learning and modeling.

1 Like

Thanks for these elements. I will investigate some of them.
The purpose is for me to transform "chr" vector to factor vector for bnlearn after reading the data from excel. (and not from csv where the dataframe is directly "factor")
In other word,the question, in reading data from Excel file, would be :
==> Why does stringsAsFactors not default to TRUE ????
Thanks also for the "stringsasfactors-an-unauthorized-biography/"... which gives the idea to investigate on the function "...as.factor " or something approaching...
As a reprex, here is the simplified set of R commands:

Script:

library(bnlearn)   
library(lattice)  
library(gRain) 
library(readxl)
setwd("My_Dir")   # working dir 
Donnees <- read_excel("My_data.xlsx", sheet = "RB_FLAP") # 
str(Donnees)

Gives:

#	Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	54 obs. of  7 variables:
# 	$ thick : chr  "petit" "moyen" "gros" "petit" ...
# 	$ Bshaft: chr  "fin" "fin" "fin" "medium" ...
# 	$ length: chr  "court" "court" "court" "court" ...
# 	$ Ribs  : chr  "non" "non" "non" "non" ...
# 	$ Strain: chr  "].2-.4]" "[0-.2]" "[0-.2]" "].2-.4]" ...
# 	$ Utotal: chr  "].5-1]" "[0.-.5]" "[0.-.5]" "].5-1]" ...
#	$ MRFY  : chr  "]1.3-2]" "]1.3-2]" "]1.3-2]" "]1.3-2]" ...

Script:

ratio_LV<-0.8 # Learning  80% 
nbligne=nrow(Donnees)
listal<-sort(sample(nbligne,round(nbligne*ratio_LV)))  
Donnees.Learn <- Donnees[listal, ]  # data.frame Learning
Donnees.Valid<-Donnees[-listal, ]  # data.frame validation
#### Problem comes with "bnlearn & "chr" type in "Donnees"
res_hc <- hc(Donnees.Learn,whitelist=NULL,debug=TRUE, score = "aic") 

results:

Error in data.type(x) : 
  variable thick is not supported in bnlearn (type: character).

So, I am now looking for function converting data type "chr" to "factor"
Thanks again for the time spent on this

On just this, check out dplyr::mutate_if. As an example

library(dplyr)
df <- dplyr::tibble(
  c1 = 1:5,
  c2 = LETTERS[1:5]
  )
df
#> # A tibble: 5 x 2
#>      c1 c2   
#>   <int> <chr>
#> 1     1 A    
#> 2     2 B    
#> 3     3 C    
#> 4     4 D    
#> 5     5 E

df %>% 
  mutate_if(
    is.character, as.factor
  )
#> # A tibble: 5 x 2
#>      c1 c2   
#>   <int> <fct>
#> 1     1 A    
#> 2     2 B    
#> 3     3 C    
#> 4     4 D    
#> 5     5 E

Created on 2018-05-03 by the reprex package (v0.2.0).

1 Like

This sounds quite versatile... But it is another package...
I am surprised not to find a simple function to adapt data like if it comes from csv file...
Is there any simpler convert function like : Donnees2 <- as.factor(Donnees) ?
I will try any with this new package... Thanks

Donnees <- read_excel("My_data.xlsx", sheet = "RB_FLAP") %>% 
  mutate_if(
    is.character, as.factor
  )

Strikes me as fairly simple and straightforward.

Check out the ?read_excel help docs. Note the col_types argument.
Oh, and of course they created this package-vignette on Cell and Column Types.

1 Like

Thanks again for this step by step conversion procedure...
After installing two new packages (yaml & "dplyr"), I applied the command :
Donnees <- read_excel("Excel_vers_R/RB_FLAP_to_R complet.xlsx", sheet = "RB_FLAP_to_categories") %>% mutate_if(
** is.character, as.factor)**
And I got this :
Error in mutate_if(., is.character, as.factor) : **
** could not find function "mutate_if"

Nonetheles I got the help for this function and I explore it.... Just a question: What is the meaning of the string %>% ???
Thanks in advance

Almost working with the following sequence :
library(bnlearn)
library(lattice)
library(gRain)
library(yaml)
library(dplyr)
Donnees <- read_excel("My_dir/My_data.xlsx", sheet = "RB_FLAP",col_names = TRUE) %>% mutate_if(is.character, as.factor)

Donnees

A tibble: 54 x 7

thick Bshaft length Ribs Strain Utotal MRFY

1 petit fin court non ].2-.4] ].5-1] ]1.3-2]
2 moyen fin court non [0-.2] [0.-.5] ]1.3-2]

... with 52 more rows

str(Donnees)
thick : Factor w/ 3 levels "gros","moyen",..: 3 2 1 3 2 1 3 2 1 3 ... Bshaft: Factor w/ 3 levels "epais","fin",..: 2 2 2 3 3 3 1 1 1 2 ...
$ length: Factor w/ 3 levels "court","inter",..: 1 1 1 1 1 1 1 1 1 2 ...
....
learn_set
thick Bshaft length Ribs Strain Utotal MRFY

1 petit fin court non ].2-.4] ].5-1] ]1.3-2]
2 moyen fin court non [0-.2] [0.-.5] ]1.3-2]
3 gros fin court non [0-.2] [0.-.5] ]1.3-2]
4 petit medium court non ].2-.4] ].5-1] ]1.3-2]

... with 50 more rows>>>>>>>>>>>>> for this test Learnset = Donnees

res_hc <- hc(Donnees.Learn,whitelist=NULL,debug=TRUE, score = "aic")
Error in check.data(x) : variable thick must have at least two levels.
>>>>>>>>> Nonetheless
levels(learn_set[["thick"]])
[1] "gros" "moyen" "petit"

So close to the solution. I look more precisely into "cell-and-column-types.html" you gave me before... I let you know...

For modeling, it makes a lot of sense (to me at least) to make them factors. However, there are a lot of cases where it is much better to work with the raw strings and the creators of those packages made that decision on that basis.

The modeling packages, specifically recipes, will convert them to factors since that what you would need for models.

1 Like

That is just telling you you need to load the dplyr package, which has the mutate_if function.