Monday, September 13, 2021

Document searching and indexing export - Part 3

Milestone 3 - Gluing everything together into a search solution

In the part 1 we looked into indexing XML export, and in the part 2 into rendering a search result as an image. In this part we will glue together both parts with an indexing search engine (Solr) into a full solution for searching and introduce a "proof of concept" application for searching of documents.

Thanks to NLnet Foundation for sponsoring this work.

Solr search platform

Apache Solr is a popular platform for searching and the idea is to use it as our search and indexing engine. First we need to figure out how to put the indexing data from our indexing XML into Solr. Solr uses the concept of documents (not to be confused with a LibreOffice document), which is an entry in the database, which can contain multiple fields. To add documents into the database, we can use a specially structured Solr XML file (many others are supported, like JSON) and simply send it using a HTTP POST request.   

So we need to convert our indexing XML into Solr structure, which is done like this:
  • Each paragraph or object is a Solr document.
  • All the attributes of a paragraph or object is an field of a Solr document.
  • The paragraph text is stored in a "content" field.
  • An additional field is "filename", which is the name of the source (Writer) document.
For example:
    <paragraph index="9" node_type="writer">Lorem ipsum</paragraph>

transforms to:
    <add>
      <doc>
        <field name="filename">Lorem.odt</field>
        <field name="type">paragraph</field>
        <field name="index">9</field>
        <field name="node_type">writer</field>
        <field name="content">Lorem ipsum</field>
      </doc>
      ...
    </add>

Searching using Solr

Solr has a extensive API for querying/searching, but for our needs we just need a small subset of those. Searching is done by sending a HTTP GET to Solr server. For example with the following URL in browser:

http://localhost:8983/solr/documents/select?q=content:Lorem*

"documents" in the URL is the name of the collection (where we put our index data), "q" parameter is the query string, "content" is the field we want to search in (we put the paragraphs text in "content" field) and "Lorem*" is the expression we want to search for.


Proof of concept web application

Figure 1: Search "proof of concept" web application


The application is written in python for the server side processing and HTTP server and the client side HTML+JavaScript using AngularJS (for data binding, REST services) and Bootstrap (UI). The purpose of the web app is to demonstrate how to implement searching and rendering in other web applications.

The web app (see Figure 1) shows a list of documents in a configurable folder, where each document can be opened in Collabora Online instance. On top there is a edit filed and the "Search" button, with which we can search the documents, and a "Re-Index Documents" button, which triggers re-indexing of all the documents. 
Figure 2: Search "proof of concept" web application - Search Results

After we enter a search expression and click the "Search" button, we get a page with search results, which is a table of the document filename and the rendered image from the document, where in the document the search result has been found. See Figure 2 for an example.
There is a "Clear" button at the bottom, which clears the search results and shows the initial list of documents again.

About Server.py - REST and HTTP server

The server has the following services:
  • Provide the HTML and JS documents to the browser, so the web app can be shown
  • GET service "/document" - returns a list of documents
  • POST service "/search" - triggers a query in Solr and returns the result
  • POST service "/reindex" - triggers the re-indexing process
  • POST service "/image" - triggers rendering of an image for the input search result, and returns the image as base64 encoded string

Re-indexing service

Re-indexing glues together the "convert-to" service of the Collabora Online server, to get the indexing XML for a input document, conversion of the indexing XML to Solr supported XML and updating the entries in the Solr server.

Search service

Search service is using the Solr query REST service to search, and transforms the result to a JSON format, that we can use in the web app and is also compatible to use as an input to render a search result.

Image service

Sending a search result and the document to "render-search-result" HTTP POST service on Collabora Online server, the image of the search result is rendered and sent back. For easier use in the web client, the image is converted to base64 string. 

Demo video

Video showing searching in the WebApp:


Video showing re-indexing in the WebApp:




Proof of concept web app source location and relevant commits

The proof of concept web application is located in Collabora Online source tree inside the indexing sub-folder. Please check the README file on how to start it up.

Collabora Online:

Fixes and changes for LibreOffice core: