Thursday, July 1, 2021

Document searching and indexing export - Part 1

About the idea

Searching for a phrase in multiple documents is not a new thing and many implementations exist, however such searching will usually only provide you if and roughly where in the documents a searched phrase exists. With Collabora Online and LibreOffice we can do better than this and in addition provide the search result in form of a thumbnail of the search location. In this way it is easier for the user to see the context, where the searched phrase is located. For example, if it is located in a table, shape, footer/header, or is it figure text or maybe "alt" text of an image. 

Thanks to the sponsor of the work - NLnet Foundation, we are implementing this solution for Writer documents.

The solution to this consist in 3 parts:

  • preparing the data for indexing, 
  • indexing and searching 
  • rendering of the result
Preparing the data for indexing and rendering of the search result is done in LibreOffice core, while the actual indexing and searching is delegated to one of the existing indexing and searching databases / frameworks (we will provide support for Apache Solr). 

In this post I will describe what has been done for milestone 1.

Milestone 1 - preparing data for indexing

Indexing data usually consists of (enriched) text, however in our case we also need to provide additional internal information, where the text is located, so it is possible to later go to the search result location and create a thumbnail of the document. In Writer we can provide a node index of the paragraph, with which it is possible to quickly identify the text in the document model and generate a thumbnail of the area around the text.

The data for indexing is provided by a "indexing export" filter in LibreOffice, which creates a XML document with a custom structure. The root element is <indexing> and the child elements are paragraphs with index and text, which can be nested in sub-elements (like image, shape, table, section) depending on where the paragraph is located. 

For example:

 <?xml version="1.0" encoding="UTF-8"?>
 <paragraph index="6">Drawing : Just a Diamond</paragraph>
 <paragraph index="12"></paragraph>
 <shape name="Circle" alt="" description="">
  <paragraph index="0">This is a circle</paragraph>
  <paragraph index="1">This is a second paragraph</paragraph>
 <shape name="Diamond" alt="" description="">
  <paragraph index="0">This is a diamond</paragraph>
 <shape name="Text Frame 1" alt="" description="">
  <paragraph index="0">This is a TextBox - Para1</paragraph>
  <paragraph index="1">Para2</paragraph>
  <paragraph index="2">Para3</paragraph>

The indexing export is build upon a ModelTraverser class, which was created for the indexing purpose, but can be reused for other purposes (it is similar to what AccessibilityCheck does, but generalised, so AccessibilityCheck can in the future be refactored to use it). 

The purpose of ModelTraverser is to traverse through the Writer document model, and provide SwNode and SdrObjects to the consuming objects - in our case IndexingExport class, which extracts the text from those objects (depending on the object type) and with help of a XmlWriter, writes the indexing data to the XML file.

Indexing export filter can be tested with the LibreOffice command line "convert-to" tool in the following way:

soffice --convert-to xml:writer_indexing_export <Writer document file path>

The commits implementing this milestone 1 functionality:

In the next milestone, we will render the thumbnail with the provided search result data.

To be continued...