Tuesday, August 17, 2021

Document searching and indexing export - Part 2

Milestone 2 - Rendering an image of the search result


In the part 1, I talked about the functionality added to LibreOffice to create indexing XML file from the document, which can be used to feed into a search indexing engine. After we search, we expect a search hit will contain the added internal node information from the indexing XML file. The next step is that with the help of that information, we now render that part of the document into an image.

Thanks to NLnet Foundation for sponsoring this work.

Figure 1: Example of a rectangle for a search string

Calculating the result rectangle

To render an image, we first need to get the area of the document, where the search hit is located. This is implemented in the SearchResultLocator class, which takes SearchIndexData that contains the internal model index of the hit location (object and paragraph). The algorithm then finds the location of the paragraph in the document model, and then it determines, what the rectangle of the paragraph is.

The search hit can span over multiple paragraphs, so we need to handle multiple hit locations. With that we get multiple rectangles, which need to be combine into the final rectangle (union of all rectangles). See figure 1 for an example.

Rendering the image from the rectangle in LOKit

This part is implemented for the LOKit API, which can already handle rendering part of the document with an existing API, using rendering of the tiles.

The new function added to the API is:
bool renderSearchResult(const char* pSearchResult, unsigned char** pBitmapBuffer, int* pWidth, int* pHeight, size_t* pByteSize);

The method renders an image for the search result. The input is the pSearchResult (XML), and pBitmapBufferpWidthpHeightpByteSize are output parameters.

If the command succeeded, the function returns true, the pBitmapBuffer contains the raw image, pWidth and pHeight contain the width and height of the image in pixels, and pByteSize the byte size of the image. 

What happens internally in the function is, that the content of pSearchResult is parsed with a XML parser, so that a SearchIndexData can be created and send to SearchResultLocator to get the rectangle of the search hit area. A call to doc_paintTile then renders the part of the document enclosed by the rectangle to the input pBitmapBuffer.  

See desktop/source/lib/init.cxx - function "doc_renderSearchResult"

Collabora Online service "render-search-result"

To actually be useful, we need to provide the functionality in a form that can be "glued" together with the search provider and indexer to show the rendered image of the search hit from the document. For this we have implemented a service in the Collabora Online. The service is a just a HTTP POST request/response, where the in the request we send the document and the search result to the service, and the response is the image.

What the service does is:
  • load the document
  • run the "renderSearchResult" with the search result XML
  • interpret the bitmap and encode into the PNG format
  • return the PNG image
As an example how the service can be used, see in Collabora Online repository: test/integration-http-server.cpp - test method HTTPServerTest::testRenderSearchResult 

The following commits are implementing this milestone 2 functionality:

Core: