In the absence of the "userlang" query parameter in the URL, the value
of the "Accept-Language" header is used. However, it is assumed that
"Accept-Language" specifies a single language (rather than a comma
separated list of languages possibly weighted with quality values).
Example:
Accept-Language: fr
// should work
Accept-Language: fr-CH, fr;q=0.9, en;q=0.8, de;q=0.7, *;q=0.5
// The requested language will be considered to be
// "fr-CH, fr;q=0.9, en;q=0.8, de;q=0.7, *;q=0.5".
// The i18n code will fail to find resources for such a language
// and will use the default "en" instead.
At this point a potential issue has been revealed. Now we produce
the final HTML via 2-level template expansion
1. Render parameterized messages
2. Render the HTML template
In which templates we should use double mustache "{{}}" (HTML-escaping)
tags and where we may use triple mustache "{{{}}}" (non-escaping) tags?
Introduced a new resource compiler script kiwix-compile-i18n that
processes i18n string data stored in JSON files and generates sorted C++
tables of string keys and values for all languages.
The "Fulltext search unavailable" error page is now generated using the
static/templates/error.html template. Also added two test cases checking
that error page.
404.html no longer contains anything specific to the 404 error and will
henceforth serve (with some enhancements) as a general purpose error
page template.
It is better to directly try to get the `Search` from the cache instead
of getting the `Searcher` first which could be useless in Search already
exist.
SearchInfo is a small helper structure to store information about the
queried search. It regroup already existing information (`patternString`,
geo query, ...) in one structure.
It is also used as key in the cache instead of using a generated string.
Instead of passing the `bookName` and `bookTitle` parameters to
`Response::build_404()`, `withTaskbarInfo()` is applied to its result
when needed. Note, that in `InternalServer::handle_raw()`
`withTaskbarInfo()` was not utilized since the results of the `/raw`
endpoint are not supposed to be decorated with a taskbar.
This was done in preparation for removing the `bookName` and `bookTitle`
parameters from `Response::build_404()`, but since the new function
could already be put to some use in this commit that was done too.
Previously, the seachURL was not encoded.
This resulted in an XSS vulnerability, a concept of proof is:
start kiwix-serve
visit - http://192.168.18.1:8081/"><svg onload="alert(1)">
This would display an alert message.
This encodes the searchURL before passing it to searchSuggestionHtml
We create a cache for SuggestionSearcher very similar to that of FT
searcher. User can specify a custom cache size using the environment
variable SUGGESTION_SEARCHER_CACHE_SIZE. It has a default value of 10%
of the number of books in the library.
We use the new cache template to implement two kind of cache.
1: The Searcher cache is more general in terms of its usage. A Searcher
can be used for multiple searches without much change to itself. We
try to retrieve the searcher and perform searches using it whenever
possible, and if not we put a searcher into the cache. User can
specify a custom cache length by manipulating the environment
variable SEARCHER_CACHE_SIZE. It's default value is 10% of all the
books available.
2: The search cache is much more restricted in terms of usage. It's main
purpose is to avoid re-searching on the searcher during page changes
to generate SearchResultSet of various ranges. User can specify a
custom cache length using the environment variable SEARCH_CACHE_SIZE
with a default value of 2;
Adds a std::map<std::string, std::string> with display names for language codes not given by libicu
Fault language codes are taken from library.kiwix.org
As we still create a `Reader` in the deprecated code of `Library`,
we need a way to create a reader without raising a deprecated warning.
So we create a another constructor with a dummy argument and we use it.
As the `Entry` is still created by `Reader` we need a way to create a
entry without raising a deprecated warning.
To do so we create a second constructor with a dummy argument.
This second constructor is private and is not marked as deprecated so we
can use it.
The HumanReadableId can contains special char (`&`/`=`/...)
As it is used as to create a url in the opds template,
we must url encode it.
- We don't need to encode the book id as it is a uuid, it never contains
special char.
- We don't need to encode the book url as it is read from the library and
the url must already be correctly encoded in the library.xml.
(tests modified accordingly)
kiwix::fileExists only checks for file existence now
kiwix::fileReadable will check if the file is readable (implicitly checking for file existence also)
As the name suggests it, this endpoint is not smart :
It returns the content as it is and only if it is present
(no compatibility or whatever).
The only "smart" thing is to return a redirect if the entry is a redirect.
As we render the entry's xml in a separated steps, we need to pass the
rootLocation to all the internal rendering.
Testing with and without root is not so easy.
I've simply made all server tests using a ROOT prefix.
We can assume that if the ROOT is present everywhere we need it, it will not
when we don't need. (As long as we don't hardcode "ROOT" in the server.)
As a result of this clean-up the /suggest endpoint too stopped
generating confusing 404 Not Found errors (which, like in /meta's case
is not that important). Another functional change is that the "term"
parameter became optional.
Before this fix the /meta endpoint could return a 404 Not Found page
saying
The requested URL "/meta" was not found on this server.
Error cases producing such a result were:
- `/meta?content=NON-EXISTING-BOOK&name=metaname`
- `/meta?content=book&name=BAD-META-NAME`
Now a proper message is shown for each of those cases.
This fix is being done just for consistency (the /meta endpoint is not
a user-facing one and the scripts don't bother about error texts).
Now Response::build_404() takes the URL instead of the entire
RequestContext object. An empty url suppresses the
The requested URL "url" was not found on this server.
part of the error text.
Before this fix the /random endpoint could return a 404 Not Found page
saying
The requested URL "/random" was not found on this server.
Error cases producing such a result were:
- `/random?content=NON-EXISTING-BOOK` (can happen when a server is
restarted or the library is reloaded and the current book is no longer
available).
- Failure of the libkiwix routine for picking a random article.
Now a proper message is shown for each of those cases.
Library became thread-safe with the exception of `getBookById()`
and `getBookByPath()` methods - thread safety in those accessors is
rendered meaningless by their return type (they return a reference
to a book which can be removed any time later by another thread).
Introducing a mutex in `Library` necessitates manually implementing the
move constructor and assignment operator. It's better to still delegate
that work to the compiler to eliminate any possibility of bugs when new
data members are added to `Library`. The trick is to move the data into
an auxiliary class `LibraryBase` and derive `Library` from it.
Originally `LibraryManipulator` was an abstract class completely decoupled
from `Library`. Its `addBookToLibrary()` and `addBookmarkToLibrary()`
methods could be defined in an arbitrary way. Now `LibraryManipulator` has to be
bound to a library object, those methods are no longer virtual, they always
update the library and allow for some additional actions via virtual
functions `bookWasAddedToLibrary()` and `bookmarkWasAddedToLibrary()`.
Deduplicated the mustache templates static/templates/catalog_v2_entries.xml
and static/templates/catalog_v2_complete_entry.xml (the latter was
renamed to static/templates/catalog_v2_entry.xml).
This will allow handle_suggest API to accept two arguments `start` and
`suggestionLength` that will allow handle_suggest to retrieve
suggestions in the given range rather than the default 0-10 range.
Language code to human friendly name translation is now done with the
help of the ICU library. It works if the line
```
-include $(LANGSRCDIR)/resfiles.mk
```
in the file `source/data/Makefile.in` of the icu4c dependency is not
commented out. Currently, the said line is commented out (along with
some other include's) by the `icu4c_custom_data.patch` patch of the
`kiwix-build` tool.
Introduces a new member mp_search that houses the zim::Search object,
adds a new constructor for this purpose. This commit also add an
overload for getHtml that takes start and end integers as arguments
since they are not part of the search object we include.
With openzim/libzim#540 we now have a new function to get
illustration(previously favicon in 48x48 size and unity scale) in
multiple sizes. We need to replace getFaviconEntry with this new
getIllustrationItem method.
This changes the output of `/catalog/search` as follows:
- Entire search query (rather than only the value of the `q` parameter)
is put in the <title> node.
- Search performed with an empty query presents itself as "All zims".
- The feed id remains stable for identical searches on the same
library.
/catalog/v2/entries is intended to play the combined role of
/catalog/root.xml and /catalog/search of the old OPDS API. Currently,
the latter role is not yet implemented.
Implementation note: instead of tweaking and reusing
`OPDSDumper::dumpOPDSFeed()`, the generation of the OPDS feed is done via `mustache`
and a new template `static/catalog_v2_entries.xml`.
Note: This commit somewhat relaxes validation of non variable
`<updated>` elements in the OPDS feed - the contents of any `<updated>`
element is replaced with the YYYY-MM-DDThh:mm:ssZ placeholder.
Each sugestions used to be stored as vector of strings to hold various values
such as title, path etc inside them. With this commit, we use the new
dedicated class `SuggestionItem` to do the same.
With openzim/libzim#545 we now support snippet generation of titles
which can be used as the display label on the ui for highlighted titles
via the "label" field.
The old version used plain title which is still available in the value
field.
After switching to Xapian-based search in the library/catalog, an empty
query stopped acting as a match-all query. This commit restores the old
behaviour in that regard.
Returning status code 204 in case of an empty results doesn't show the
empty results page as described in #466. Reverting the changes in #396
fixes the issue.
Catalog filtering should now be case/diacritics insensitive for all
fields. However it is not validated for language, name and category
fields, and is validated for tags, creator & publisher only for text
supplied in the filter (but not for values read from the book).
Catalog filtering by titles/description was sensitive to diacritics
present in the query string. Fixed that.
Also enhanced the unit test to validate the insensitivity to diacritics
present in either the title/description or the query string.
This change fixes the failure of the LibraryTest.filterByPublisher
unit-test broken by the previous commit.
The previous approach used in `publisherQuery()` for building a phrase
query enforcing the specified prefix for all terms fails if
1. the input phrase contains a non-word term that Xapian's query parser
doesn't like (e.g. a standalone ampersand character, 1/2, a#1, etc);
2. the input phrase contains at least three terms that Xapian's query
parser has no issue with.
Using the `quest` tool (coming with xapian-tools under Ubuntu) the
issue can be demonstrated as follows:
```
$ quest -o phrase -d some_xapian_db "Energy & security"
Parsed Query: Query((energy@1 PHRASE 11 Zsecur@2))
Exactly 0 matches
MSet:
$ quest -o phrase -d some_xapian_db "Energy & security act"
UnimplementedError: OP_NEAR and OP_PHRASE only currently support leaf subqueries
$ quest -o phrase -d some_xapian_db 'Energy 1/2 security act'
UnimplementedError: OP_NEAR and OP_PHRASE only currently support leaf subqueries
$ quest -o phrase -d some_xapian_db "Energy a#1 security act"
UnimplementedError: OP_NEAR and OP_PHRASE only currently support leaf subqueries
```
The problem comes from parsing the query with the default operation set
to `OP_PHRASE` (exemplified by the `-o phrase` option in above
invocations of `quest`). A workaround is to parse the phrase with a
default operation of `OP_OR` and then combine all the terms with
`OP_PHRASE`.
Besides stemming should be disabled in order to target an exact phrase
match (save for the non-word terms, if any, that are ignored by the
query parser).
Moved the `filter.hasQuery()` check inside `buildXapianQuery()`.
`Library::filterViaBookDB()` only cares if the query that is going to be
run on the book DB would match all documents. The rest of changes
related to enhancing the usage of Xapian for the catalog search will
happen inside `buildXapianQuery()` and `updateBookDB()`.
Language code is converted from ISO 639-3 to ISO 639 (which is
understood by Xapian) via ICU. The previous approach via an explicit
map had its advantages since Xapian has more than one stemmer
implementations for some languages (selectable via Xapian-specific
identifiers). This commit relies on the defaults associated with the
ISO 639 language codes.
The search text in the catalog query is interpreted as partial by
default, but partial query mode can be disabled in C++. The latter
possibility is not exposed via the /catalog/search kiwix-serve endpoint,
though.
1. Get the subset of books matching the q (title/description) parameter
of the search
2. Filter out books not matching the other parameters of the search.
Stage 1. currently works in the old way, but will be replaced by Xapian
based search in subsequent commits.
The kiwixlib java wrapper unit test can be run manually via the
src/wrapper/java/org/kiwix/testing/compile_test.sh script.
The test ZIM files in src/wrapper/java/org/kiwix/testing were created
using the create_test_zimfiles. They must be updated/re-generated and
committed in git whenever their source data or the create_test_zimfiles
script changes. Note: small.zim.embedded is not used at this point, it
was created for testing the enhancement coming in a few commits.
Mimetype may contain a parameters.
Then, the mimetype would be something like "text/html;foo=bar;foz=baz"
It will contains a `;` and `=` and it conflicts with the same operators
we use to separate the items in our list.
We have to use a more advanced algorithm which takes the context into
account.
Fix#416