Asciidoctor and documentation: how to implement search feature

, Romain Manni-Bucau, 2020-05-08, 11 min and 2 sec read

Asciidoctor is a great language for documentation and its API enables to do awesome things (while you accept to pay the initialization cost for java binding). However, some tasks are not built-in and require some glue code. This is the case for search support.

To create an search in your documentation you have multiple options:

Start an elasticsearch+shield (or equivalent) and use that as an API,
Pay an online service to provide you a search API,
Do a client search in javascript.

1 and 2 and quite close in terms of documentation impact (but not in terms of infra) but require to manage a runtime or pay. It will be relevant for very huge documentations but less for small/medium documentations. For that last category, you can implement a client search.

To implement a client search, you can:

Create a build/rendering time index and directly look it up at runtime,
Create a documentation dictionnary, map it to an index at runtime then use it to implement your search.

personally I tend to prefer the option 2 even if a bit slower (but it is unlikely you notice it) because the index can stay while you update the search library + you can filter the indexation whereas option 1 will prevent you to preprocess the index depending the page, the user logged in or whatever other criteria.

So the first step for us will be to create a dictionnary usable by the search. Here, I recommend you to check out your search library and its format. Some will support objects (like fuse), others can only support a string (like flexsearch).

In my case I picked flexsearch because its results are quite accurate and it is fast even for thousands of records.

In terms of code, you will need these steps:

const index = FlexSearch.create(); // 1
getMyDictionnary().forEach((entry, idx) => index.add(idx, entry.content)); // 2
const searchResultIds = index.search(query); // 3

You create an empty index,
You index your dictionnary,
You do a search.

Tip: if you are in a SPA, ensure to create the index once for all pages. In angular it can be done through a service responsible of the index management.

This previous snippet already gives you hints about the structure we will use. How search results will be used:

They will be rendered as proposals so we must have the rendering data, for us it will be a title and a content (short description),
Each result must be able to be linked (as href) so we will have a link and optionally an anchor.

Side note: you can also notice, flexsearch is based on identifiers but in our case we will use the index in the dictionnary array so we can bypass that but if you have a natural id then you can use it.

So what we want to generate is this dictionnary:

[
  {
    "title": "...",
    "content": "...",
    "link": "...",
    "anchor": "..."
  },
  {
    "title": "...",
    "content": "...",
    "link": "...",
    "anchor": "..."
  }
]

Note: indeed, you can merge the link and anchor in a single string "link" but this structure avoids to reparse the link in SPA based on a router API line angular.

Assuming we have this JSON at the uri search/index.json, here is some code able to handle a search (ensure to adapt it to your website).

First we create a form to go on the search result page:

<form role="search" action="search.html">
  <input type="text" placeholder="Search..." name="query" id="searchInput">
</form>

Nothing crazy there, just a plain old form sending the search input as a query parameter on search.html page.

Then to have flexsearch we need to import it. You can use npm if you have it or just a CDN link:

<script src="https://cdnjs.cloudflare.com/ajax/libs/FlexSearch/0.6.22/flexsearch.light.js"
    integrity="sha256-yHgMhLaDJWehbr/+au9lLLUq4+5rJKrwkSCz0s68Zdw="
    crossorigin="anonymous"></script>

Then we just add a div to contain our results in the search.html page:

<div class="search-results"></div>

Finally we need to fetch the JSON index, create a flexsearch index, index the dictionnary and create a list of results based on the search results. Here is a snippet doing it (which must be added/imported in search.html):

new Promise(function (done) {
   var index = new FlexSearch(); // 1
   fetch('search/index.json') // 2
    .then(function (r) {return r.json();}) // 3
    .then(function (data) {
        data.forEach((entry, idx) => index.add(idx, entry.content)); // 4
        var result = {index,data}; // 5
        done(result);
        return result;
    });
}).then(function (wrapper) {
    var query = (location.search.split('query=')[1] || '').split('&')[0].replace('+', ' '); // 6
    document.querySelector('#searchInput').value = query; // 7
    wrapper.index.search(query, function (result) { // 8
        var resultDiv = document.querySelector('.search-results'); // 9
        if (!result || result.length === 0) {
            resultDiv.innerHTML = 'No result found';
        } else {
            resultDiv.innerHTML = (result || []).map(i => {
                var toRender = wrapper.data[i]; // 10
                return '<div class="search-result-container">' +
                    '<a href="' + toRender.link + '#' + toRender.anchor + '">' + toRender.title + '</a>' +
                    '<p>' + toRender.content + '</p>' +
                '</div>';
            }).join('');
        }
    });
});

We create the flexsearch index,
We fetch our dictionnary,
We map the result as a javascript object/json,
We index our dictionnary using the index of the entry in the dictionnary as identifier and content as indexed data,
Since the fetch is asynchronous we use a promise to load the dictionnary and therefore create a "message" with the index and dictionnary to be able to handle the search result. This "wrapper" is our promise result,
We extract the query from the URL,
We ensure to propagate the query string to the input (since we changed of page thanks to the form - this is useless in a SPA),
We do the search thanks flexsearch index,
When we get search result we render it in the div we created for that purpose,
This is where the flexsearch identifier (index for us) is useful, from the identifier sent as result of the search, we can lookup the original entry in the dictionnary and therefore get the title/content/link/anchor to render the search result in HTML.

At that point, we have all the HTML/javascript code to have a functional search but it is missing the most important part: the dictionnary creation.

What do we have as inputs: an asciidoc documentation, what do we want as output? a JSON reprensenting this documentation.

So the creation of the dictionnary can be split in these steps:

Visit all pages of the documentation,
Parse them all to extract each part of the documentation,
Create the JSON.

This second step is very important and enables to have a more accurate search and more precise links. For example, if you have a reference documentation or configuration page in your documentation it will likely be huge so any search will match it but then the user will be at the top of the page but not exactly where the search matched. To fix it, you must split the asciidoctor page in subparts being small but having some consistency. In practise it means paragraph, snippets and more generally unitary elements.

I'll use Java to implement this logic but ruby or javascript are two very relevant alternatives (but if you have a java project and a living documentation you likely alreayd started to code in java this part).

So the first step is to extract all adoc files. Here is a quick snippet doing that in Java:

final Path root = Paths.get("documentation/asciidoctor"); // 1
Files.walkFileTree(root, new SimpleFileVisitor<Path>() { // 2
    private final Collection<Path> adocs = new ArrayList<>(); // 3

    @Override // 4
    public FileVisitResult visitFile(final Path file, final BasicFileAttributes attrs) throws IOException {
        final String name = file.getFileName().toString();
        if (name.endsWith(".adoc") && !"README.adoc".equals(name)) {
            adocs.add(file);
        }
        return super.visitFile(file, attrs);
    }
});

We first create a Path representing the base directory of our documentation,
We visit recursively this directory,
We accumulate all files in a list (optional, could be done in steaming, just easier to debug),
The accumulation is done if the file is a adoc and not the readme of the project (ensure to adjust it to your project structure). Note that if some pages must not be indexed, they can be filtered there (or after if you use an asciidoc attribute to mark them as not indexable).

Now we need to load each asciidoc document and extract its indexable parts. For that we will use asciidoctorj (the Java binding of asciidoctor but once again you can do almost the same in ruby/js):

final Asciidoctor asciidoctor = Asciidoctor.Factory.create();
// optionally register extensions

Then the overall logic will be:

final Options options = OptionsBuilder.options() // 1
    .backend("html5")
    .inPlace(false)
    .headerFooter(false) // 2
    .baseDir(root)
    .safe(SafeMode.UNSAFE)
    .get();
final Collection<SearchableEntry> entries = adocs.stream() // 3
        .flatMap(adoc -> {
            try {
                final Document document = adoc.load( // 4
                    String.join("\n", Files.readAllLines(adoc, options);
                return createDictionnaryElements(document); // 5
            } catch (final IOException e) {
                throw new IllegalStateException(e);
            }
        }).collect(toList());

We create shared options for the dictionnary,
We ensure to bypass header and footer elements which don't bring anything for the search,
We iterate over all documents,
We load their AST in memory (Document),
We explode the document in subelements (indexable).

The question now is how to implement createDictionnaryElements. The first step is to extract all sections of the pages:

final Stream<Section> sections = document.getBlocks().stream()
  .filter(Section.class::isInstance)
  .map(Section.class::cast);

Then we must extract all subelements but there is an important thing to not forget: the anchors. We want to create precise links so the page + the anchor. Since Asciidoctor generates "id" for almost all HTML elements, it is mainly a matter of tracking the closest id of any element.

The method doing that will take as input a stream of node and the closest id and will return a stream of ContentWithId objects:

@Data
public class ContentWithId {
    private final String id;
    private final String content;
}

Tip: if you are already on java 14, don't hesitate to use a record ;).

The method doing this mapping will need to explode blocks, explode tables in cells, explode lists in listitems and once a leaf is reached, create a ContentWithId using parent Id and current node content. Here is a simple implementation:

private Stream<ContentAndId> extractIndexableTexts(final Stream<StructuralNode> nodes, final String lastId) {
    return nodes.flatMap(n -> { // 1
        if (n.isBlock()) { // 2
            final List<StructuralNode> blocks = n.getBlocks();
            if (!blocks.isEmpty()) { // 3
                return extractIndexableTexts(blocks.stream(), ofNullable(n.getId()).orElse(lastId));
            }
        }
        if (Table.class.isInstance(n)) { // 4
            final Table table = Table.class.cast(n);
            return table.getBody().stream()
                    .flatMap(t -> t.getCells().stream()) // 5
                    .map(Cell::getContent)
                    .map(String::valueOf)
                    .map(c -> new ContentAndId(
                        ofNullable(table.getId()).orElse(lastId),  // 7
                        c));
        }
        return Stream.of(n).map(c -> { // 6
            final Object content = c.getContent();
            if (String.class.isInstance(content) && !String.valueOf(content).isEmpty()) {
                return new ContentAndId(
                    ofNullable(n.getId()).orElse(lastId), // 7
                    String.valueOf(content));
            }
            if (ListItem.class.isInstance(c)) {
                return new ContentAndId(
                    ofNullable(n.getId()).orElse(lastId), // 7
                    ListItem.class.cast(c).getText());
            }
            if ("empty".equals(c.getContentModel())) {
                return null;
            }
            if (Section.class.isInstance(c) && c.getTitle() != null && !c.getTitle().isEmpty()) { // just a title, can happen in generated doc
                return null;
            }
            throw new IllegalArgumentException("unsupported: " + c);
        });
    });
}

We take a node and want to convert it to a list of sub elements so we flatmap the original nodes,
If we have a block we try to relaunching the same logic on subblock,
If there is no subblock let's handle current block directly to ensure we index it,
If we have a table we want to index its content,
So we explode the table in cells,
If the node is not explodable then we process it directly,
Each time we go down in the block hierarchy we ensure to propagate the identifier so we take the id of the current element or inherit from the parent one.

This implementation does not handle all potential asciidoctor ast elements but works for most documentations.

Then we must filter potentially undesired elements so we can append a filter on this stream:

contentWithIds.filter(it -> it != null && isIndexable(it.getContent()))

Personally I use this kind of heuristic:

private boolean isIndexable(final String str) {
    return str.length() > 2 && // too small
           str.length() < 2000 && // too big, likely a "full" snippet
           !isHtml(str); // passthrough page
}

private boolean isHtml(final String string) {
    return string.startsWith("<div") || string.startsWith("<script");
}

Now we have sections, we know how to explode it so let's just do it:

final Collection<SearchEntry> entries = sections
  .flatMap(s -> extractIndexableTexts(Stream.of(s), s.getId()) // 1
                   .map(c -> new SearchableEntry( // 2
                       document.getTitle() + " :: " + s.getTitle(), // 3
                       root.relativize(adoc).toString().replaceAll(".adoc$", ".html"), // 4
                       c.getId(), // 5
                       c.getContent() // 6
                         .replaceAll("<[^>]+>", "").trim())));

For each section we convert it to the stream of subelements,
For each subelement we create a dictionnary entry,
We create the search result title (here the document title suffxed by the section title),
We create the link to the page (simply the relative path from the root and the adoc extension replaced with html, this can need to be refined depending your deployment, like prefixing with a context),
We propagate the anchor (asciidoctor "parent" id we extracted),
Finally we propagate the content with some minor sanitization (we drop html tags).

At that stage we have our dictionnary so we just have to serialize it in a json file:

try (final Jsonb jsonb = JsonbBuilder.create(new JsonbConfig()
                .withFormatting(true)
                .withPropertyOrderStrategy(PropertyOrderStrategy.LEXICOGRAPHICAL)
                .setProperty("johnzon.cdi.activated", false))) {
    Files.write(searchJsonPath, jsonb.toJson(entries).getBytes(StandardCharsets.UTF_8));
}

Indeed, ensure seatchJsonPath matched the uri you fetch in the javascript code.

Here we are, we have our dictionnary properly built, we jave our flexsearch consumer which builds an index and is wired to our search input so our documentation just got search feature for free :).

Indeed, it can be enhanced a lot (by highlighting the search keywords in the result rendering, by refining the asciidoc document dictionnary etc..) but this gives already pretty good results and enable to search in huge reference pages quite efficiently.

Happy writing!

From the same author:

Romain Manni-Bucau

In the same category:

Other

Asciidoctor and documentation: how to implement search feature

Navigation