Problem with "empty" search_index.json

Hi,

I have an issue with the content of the search_index.json file after rendering my gitbook. When I use my laptop, the file is full of textual content and allow searches to be performed when the gitbook in open in a web browser. When I render the book using a remote server, the file is virtually empty (ie, full of empty string "") and searches are not possible when the gitbook in open in a web browser.

Any ideas where to look for and how to correct this issue?

Thanks in advance

Laptop sessionInfo

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] bookdown_0.7

loaded via a namespace (and not attached):
 [1] compiler_3.4.4  backports_1.1.1 magrittr_1.5    rprojroot_1.2 
 [5] htmltools_0.3.6 tools_3.4.4     rstudioapi_0.7  yaml_2.1.14   
 [9] Rcpp_0.12.18    stringi_1.1.5   rmarkdown_1.10  knitr_1.20    
[13] stringr_1.2.0   digest_0.6.12   xfun_0.3        evaluate_0.10.1
>

Remote server sessionInfo()

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] bookdown_0.7

loaded via a namespace (and not attached):
 [1] compiler_3.4.3  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
 [5] htmltools_0.3.6 tools_3.4.3     rstudioapi_0.7  yaml_2.1.18
 [9] Rcpp_0.12.16    stringi_1.1.7   rmarkdown_1.9   knitr_1.20
[13] stringr_1.3.0   digest_0.6.15   xfun_0.3        evaluate_0.10.1

Just a quick follow-up: for SOP reason, I have to use the remote server for production work. So I am looking at fixing the problem on the server (which, by the way, cannot be switched to a ubuntu-based distribution).

What is your workflow to compile the book on the remote server ?

Hi @cderv

The workflow is the virtually identical on both machines:

  1. On my linux laptop, I open a shell and navigate to the location of my Rmd files. In the server case, I have to use my Windows workstation and remotely connect to the server using Putty; then, in the shell, I navigate to the location of my Rmd files.
  2. I open an interactive R session (no RStudio use at all)
  3. I issue the following command (the output directory is obviously different on the 2 machines): bookdown::render_book('index.Rmd', 'bookdown::gitbook', output_dir=<some directory>)
  4. I go outside for a minute or 2 to enjoy the sun and come back to see the results :smiley:

unfortunately, I don't know where it could come from... I am not sure how the file is created; Maybe looking at code source could be useful to find some hint.

Hi,

An update on our investigation:

  1. After many attempts playing with the source code of bookdown, it appears that the problem lies in the strip_html utility function. This function takes an input object x, which is vector containing text extracted from the html file content. This vector is processed using various gsub calls. One in particular (gsub{"\\s{2,}", " ", x)) happens to return a vector of empty strings (ie, "") if a single element of x contains non printable characters. This unfortunate conversion happens only on the CentOS server.
  2. In my case, the process of creation of HTML page through bookdown transform single quote characters (0x27) used in the Rmd files into right single quotation mark Unicode characters (0xe2 0x80 0x99) in the html files.
  3. Looking at the server side a bit more, my IT colleagues found out that, if we recompile the source of R instead of using the packages provided by CentOS (EPEL repository), the problem mentioned in 1 does not occur. Therefore, they suspect that the maintainers of this repository must have an different encoding on their package build machines...
  4. Using gsub{"\\s{2,}", " ", x, useBytes = TRUE) process the files correctly.

I would like to avoid recompiling R or maintaining a fork of bookdown, so does anyone would know how to force bookdown to not convert single quote into unicode character?

1 Like

If you think this is a bug and improvement should be made in bookdown by adding useBytes = TRUE for the gsub call, it would make sense to open an issue ticket in bookdown github repo, linking to this discussion.

Otherwise, I don't have the answer yet. You may be able to force encoding if this is related. But what you explained about R compilation make me doubt. Maybe an option in R?

Nice investigation by the way! :clap:

1 Like

I just remembered that for simplicity bookdown is expecting UTF8 only. See the sentence in this page https://bookdown.org/yihui/bookdown/usage.html

You should make sure all your files are utf8 before rendering. Is this the case?

Thanks @cderv

I am fairly sure that this is a problem with our server / R installation relating to encoding. Not using bookdown at all and reading in a UTF-8 encoded file just containing r’s (with the middle character been right single quotation mark Unicode characters (0xe2 0x80 0x99)), I get this:

> x <- scan(file='test.html', what='character', sep='\n')
Read 1 item
> x
[1] "r’s"
> gsub('\\s{2,}', ' ', x)
[1] " "
> gsub('\\s{2,}', ' ', x, useBytes=TRUE)
[1] "r’s"

I am honestly not sure if this is a bookdown bug.