Catalog filtering should now be case/diacritics insensitive for all
fields. However it is not validated for language, name and category
fields, and is validated for tags, creator & publisher only for text
supplied in the filter (but not for values read from the book).
Catalog filtering by titles/description was sensitive to diacritics
present in the query string. Fixed that.
Also enhanced the unit test to validate the insensitivity to diacritics
present in either the title/description or the query string.
This change fixes the failure of the LibraryTest.filterByPublisher
unit-test broken by the previous commit.
The previous approach used in `publisherQuery()` for building a phrase
query enforcing the specified prefix for all terms fails if
1. the input phrase contains a non-word term that Xapian's query parser
doesn't like (e.g. a standalone ampersand character, 1/2, a#1, etc);
2. the input phrase contains at least three terms that Xapian's query
parser has no issue with.
Using the `quest` tool (coming with xapian-tools under Ubuntu) the
issue can be demonstrated as follows:
```
$ quest -o phrase -d some_xapian_db "Energy & security"
Parsed Query: Query((energy@1 PHRASE 11 Zsecur@2))
Exactly 0 matches
MSet:
$ quest -o phrase -d some_xapian_db "Energy & security act"
UnimplementedError: OP_NEAR and OP_PHRASE only currently support leaf subqueries
$ quest -o phrase -d some_xapian_db 'Energy 1/2 security act'
UnimplementedError: OP_NEAR and OP_PHRASE only currently support leaf subqueries
$ quest -o phrase -d some_xapian_db "Energy a#1 security act"
UnimplementedError: OP_NEAR and OP_PHRASE only currently support leaf subqueries
```
The problem comes from parsing the query with the default operation set
to `OP_PHRASE` (exemplified by the `-o phrase` option in above
invocations of `quest`). A workaround is to parse the phrase with a
default operation of `OP_OR` and then combine all the terms with
`OP_PHRASE`.
Besides stemming should be disabled in order to target an exact phrase
match (save for the non-word terms, if any, that are ignored by the
query parser).
Moved the `filter.hasQuery()` check inside `buildXapianQuery()`.
`Library::filterViaBookDB()` only cares if the query that is going to be
run on the book DB would match all documents. The rest of changes
related to enhancing the usage of Xapian for the catalog search will
happen inside `buildXapianQuery()` and `updateBookDB()`.
Language code is converted from ISO 639-3 to ISO 639 (which is
understood by Xapian) via ICU. The previous approach via an explicit
map had its advantages since Xapian has more than one stemmer
implementations for some languages (selectable via Xapian-specific
identifiers). This commit relies on the defaults associated with the
ISO 639 language codes.
The search text in the catalog query is interpreted as partial by
default, but partial query mode can be disabled in C++. The latter
possibility is not exposed via the /catalog/search kiwix-serve endpoint,
though.
1. Get the subset of books matching the q (title/description) parameter
of the search
2. Filter out books not matching the other parameters of the search.
Stage 1. currently works in the old way, but will be replaced by Xapian
based search in subsequent commits.
The default value of this parameter is false, in this case all the bookmarks
are returned, otherwise only those who are related to books of the library.
Api changes :
- removeLastPathElement do not takes extra arguments
`removePreSeparator` and `removePostSeparator`.
This is not needed as path do not need special tailing separator.
- Only one function `split`. Arguments can be implicitly convert to
string. No need for overloading functions to explicitly cast them.
- `split` function takes another argument `trimEmpty`. If true, empty
element are removed.
Path manipulation now almost pass trough a vector<string> to store each
path's part.
Most of the complex works is now made in the normalizeParts function.
Instead of having a single method `listBooksIds` that tries to be
exhaustive about all the filter and sort option, split the method in
two separated methods `filter` and `sort`.
The `filter` method takes a `Filter` object that represent on what we are
filtering. This object has to be construct before calling `filter`.
```cpp
Filter filter;
filter.query("Astring");
filter.acceptTags({"nopic"});
// return all book in eng and with "Astring" in the tile or description".
library.filter(filter);
//equivalent to
library.listBooksIds(ALL, UNSORTED, "Astring", "", "", "", {"nopic"});
// or better
library.filter(Filter().query("Astring").acceptTags({"nopic"}));
```
The method `listBooksIds` has been marked as deprecated.
Add a small test on the library.
The `common` name is from the time where kiwix was only one repository
for all the project (android, desktop, server...).
Now we have split the repositories and kiwix-lib is the "common" repo,
the "common" directory is somehow nonsense.
The library now contains (simple) methods to handle bookmarks.
The bookmark are stored in a separate xml file.
Bookmark are mainly a couple (`zimId`, `articleUrl`).
However, in the xml we store a bit more data :
- The article's title (for display)
- The book's title, lang and date (for potential update of zim files)
It is up to the book to manage its attribute.
Also remove the `absolutePath` (and `indexAbsolutePath`). The `Book::path` is always stored
absolute.
The fact that the path can be stored absolute or relative in the
`library.xml` is not relevant for the book.