bookdown search

ecodiv · March 7, 2022, 5:10pm

The search function of the bs4_book bookdown book does not seem to find words beyond a certain number of paragraphs. For example, I created a new book based on the bs4_book template. Next, I created a random text and copied that text to a paragraph in chapter 2, and to a paragraph in chapter 3. Searching for a word from that text finds them in both chapters.

Now, I copied the same text twice in a paragraph in chapter 2, and once in a paragraph in chapter 3. Searching for a word finds it in chapter 3, but not in chapter 2. See https://ecodiv.earth/test/_book/index.html and search e.g., for the word 'bibendum'.

Question: is there a limit to the number of words per paragrap and or page that can be searched and found?

cderv · March 7, 2022, 5:37pm

Interesting. I would need to look closer into your example to see if we are missing something while creating the search list. Can you share your example repo ?

The term is correctly in the index search.json created for your book https://ecodiv.earth/test/_book/search.json

So maybe this is a limitation of fuse.js that is used for bs4_book() or it is due to some configurations.
Doc is here: https://fusejs.io/
The code for using fuse in bookdown is here: bookdown/bs4_book.js at main · rstudio/bookdown · GitHub

We have a scoring filter and some options. As you text is copy pasted exactly maybe it is filter.

I don't have yet an answer for your question as I am not expert with fuse.js, but I hope the above hints can help dig into that and maybe adapt on our side.

cderv · March 7, 2022, 5:49pm

From debugging your example in Devtools console in browser, it seems that the occurrences on chapter 2 Hello Bookdown are filtered out because of scoring sets by Fuse.js

From fuse doc, score 0 is perfect match, and score 1 is complete mismatch. In bs4_book() we keep only item with score <= 0.75.

That is why the chapter 2 result is not show as it got a score of 0.82.

This is something related to Fuse.js algorithm and I don't know how the scoring works exactly. You could look into this with your case.

Hope it helps

ecodiv · March 7, 2022, 6:32pm

Hi, thanks for looking into this. The source is here: https://github.com/ecodiv/bookdownsearchtest. It is just the standard bs4_book template with some inserted text

ecodiv · March 7, 2022, 6:34pm

Yes, I had noticed that the search.json contained all the text. I already had a look at the fuse help pages, but with your clues it might be easier to know what to look for, thanks.

jtbayly · March 7, 2022, 7:42pm

This does feel like a bug in Fuse.js.

I would recommend raising the question there. The weighting just seems completely wrong in this example.

Only thing I can think of that may be negatively affecting this is it not being English, or a “supported” language, possibly. Not sure if that’s a thing in Fuse.js

ecodiv · March 7, 2022, 8:19pm

Reading up on scoring theory used by fuse, this seems more like a feature(s). From the page https://fusejs.io/concepts/scoring-theory.html there are two possible culprits:

Field-length Norm: The shorter the field, the higher its relevance. If a pattern matches a short field (such as a title field) it is likely to be more relevant than the same pattern matched with a bigger field.
Distance, Threshold, and Location, text determine the number of words that are included in the search.

I tried if setting the search engine options to ignore the location and field norm, but that does not seem to work (or I am using this wrongly).

bookdown::bs4_book:
  css: bs4_style.css
  theme:
    primary: "#096B72"
  repo: https://github.com/rstudio/bookdown-demo
  config:
    search:
      engine: fuse
      options: 
        isCaseSensitive: false
        findAllMatches: true
        includeScore: false
        ignoreLocation: true
        ignoreFieldNorm: true

cderv · March 8, 2022, 10:51am

@ecodiv currently bs4_book() does not support changing the fuse.js setting. Only gitbook() format does.

so this is not used when using bs4_book()

  config:
    search:
      engine: fuse
      options: 
        isCaseSensitive: false
        findAllMatches: true
        includeScore: false
        ignoreLocation: true
        ignoreFieldNorm: true

(gitbook() does not have the same issue as no filter is done)

This would be a feature request to add this to bs4_book() but deactivating scoring would create issue as of now because score value is used in the filter.

You would need to fork bookdown and change some options to see how that would work

ecodiv · March 11, 2022, 6:22pm

It is also possible to change the effect of the field length. So being able to set this parameters world be a very welcome feature. Is this the best place to do a feature request?

cderv · March 11, 2022, 7:02pm

bookdown Github repo is the best place to post a feature request.

jtbayly · March 14, 2022, 3:32pm

Is it possible that the scoring is relative, @ecodiv ? I haven't read that page yet, but I can imagine that since the exact search phrase happens more than once in the text of the book, that it then tries to rank them by adjusting their scores.

I'm still very unimpressed with the idea that an exact match for my search phrase would not even show up in the results.

Doesn't everybody agree that's really a fundamental problem? Searching for a phrase should find that exact phrase anywhere it's in the book.

What are the ramifications of adjusting the score threshold, @cderv? Does it have to be set? Can it just be left to Fuse.js to determine whether there are matching results and how to prioritize them? Or does a threshold have to be set?

Because if scoring is relative, then I'm guessing we should just get rid of the threshold if we can.

cderv · March 15, 2022, 9:13am

I don't know a lot about Fuse.js as I did not implement the search feature in the first place. In bs4_book() we tried to add some logic to improve, in gitbook() we just used the default options.

If there is some improvment to do based on a better understanding of fuse.js, I'll be happy to do it. It seems indeed a bit off that exact match are not show. I can't answer your question regarding Fuse.js though, we'll need to search and try with the JS Lib.

I think the threshold is there to avoid having every result in the pop up box when you search but if this is not used correctly, then we need to change it.

if someone is willing to improve it, please submit a PR.

jtbayly · March 17, 2022, 2:05am

In bs4_book, the location is apparently already being ignored via ignoreLocation: true

I suspect somebody should try to modify bookdown and set ignoreFieldNorm: true

That wouldn't turn off or break scoring, so it should be a fairly simple test. And I'm guessing it's the most likely culprit, now that I've read the Fuse.js documentation.

If that doesn't help, then I'd try setting findAllMatches: true. However, since the match is apparently being returned according to @cderv's test, just scored too low, I'm guessing that's not the problem.

cderv · March 17, 2022, 10:19am

I have added the field in

One can try

remotes::install_github("rstudio/bookdown#1319")

playing with the configuration in Devtools panel in browser or in an editor should also work.

system · April 7, 2022, 10:20am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.