About the idea
Searching for a phrase in multiple documents is not a new thing and many
implementations exist, however such searching will usually only provide you if
and roughly where in the documents a searched phrase exists. With Collabora
Online and LibreOffice we can do better than this and in addition provide the
search result in form of a thumbnail of the search location. In this way it is
easier for the user to see the context, where the searched phrase is located.
For example, if it is located in a table, shape, footer/header, or is it
figure text or maybe "alt" text of an image.
Thanks to the sponsor of the work - NLnet Foundation, we are implementing this
solution for Writer documents.
The solution to this consist in 3 parts:
- preparing the data for indexing,
- indexing and searching
- rendering of the result
Preparing the data for indexing and rendering of the search result is done in
LibreOffice core, while the actual indexing and searching is delegated to one of
the existing indexing and searching databases / frameworks (we will provide
support for Apache Solr).
In this post I will describe what has been done for milestone 1.
Milestone 1 - preparing data for indexing
Indexing data usually consists of (enriched) text, however in our case we also
need to provide additional internal information, where the text is located, so
it is possible to later go to the search result location and create a
thumbnail of the document. In Writer we can provide a node index of the
paragraph, with which it is possible to quickly identify the text in the
document model and generate a thumbnail of the area around the text.
The data for indexing is provided by a "indexing export" filter in
LibreOffice, which creates a XML document with a custom structure. The root
element is <indexing> and the child elements are paragraphs with index and text, which can
be nested in sub-elements (like image, shape, table, section) depending on
where the paragraph is located.
For example:
<?xml version="1.0" encoding="UTF-8"?>
<indexing>
<paragraph index="6">Drawing : Just a
Diamond</paragraph>
<paragraph index="12"></paragraph>
<shape name="Circle" alt="" description="">
<paragraph index="0">This is a circle</paragraph>
<paragraph index="1">This is a second
paragraph</paragraph>
</shape>
<shape name="Diamond" alt="" description="">
<paragraph index="0">This is a diamond</paragraph>
</shape>
<shape name="Text Frame 1" alt="" description="">
<paragraph index="0">This is a TextBox -
Para1</paragraph>
<paragraph index="1">Para2</paragraph>
<paragraph index="2">Para3</paragraph>
</shape>
</indexing>
The indexing export is build upon a ModelTraverser class, which was created
for the indexing purpose, but can be reused for other purposes (it is similar
to what AccessibilityCheck does, but generalised, so AccessibilityCheck can in
the future be refactored to use it).
The purpose of ModelTraverser is to traverse through the Writer document
model, and provide SwNode and SdrObjects to the consuming objects - in our
case IndexingExport class, which extracts the text from those objects
(depending on the object type) and with help of a XmlWriter, writes the
indexing data to the XML file.
Indexing export filter can be tested with the LibreOffice command line "convert-to" tool in the following way:
soffice --convert-to xml:writer_indexing_export <Writer document file path>
The commits implementing this milestone 1 functionality:
In the next milestone, we will render the thumbnail with the provided search result data.
To be continued...
No comments:
Post a Comment