How to convert an XML document to a JSON one with Sax and JSON-P
JSON-P provides several kind of API (generator, reader, writers, parser, pointer, ...). These are generally low level API but it also allows to write some bridge code which is quite efficient. To illustrate that, this post will show you how to convert a XML document to a JSON one using implementing a Sax handler to parse the XML document with a JSON-P object builder.
Technical note: if you want to implement that for a real convertion ending on a stream you will likely want to switch the object builder by a JSON-P generator to enable streaming. In the context of this post we just want to build in memory a JsonObject to be able to apply on a XML model a JSON processing already existing (extraction thanks to a pointer for instance).
Sax parsing to visit the XML document
The base to parse a XML document with Sax is to create a sax handler and use a SAXParser to visit the document:
final SAXParserFactory factory = SAXParserFactory.newInstance();
final SAXParser parser = factory.newSAXParser();
final MyHandler handler = new MyHandler();
parser.parse(inputStream, handler);
I would recommand you to cache the SAXParserFactory and not create it each time you need one but it is not a thread safe instance so either create a pool - a cheap pool is a queue ;) - or use a thread local to ensure you reuse the created instances instead of going through the instantiation chain each single time.
One you have a factory, it is as easy as calling newSAXParser() to get a parser and then call parse() on the parser passing your handler instance.
The handler gets called when the XML parser visits a tag or some content.
The Sax Handler to convert the XML document to JSON
There are a tons of way to implement the handler. For this post we'll use a single one which will have two responsabilities - but in more advanced code you should split it in two handlers to respect the separation of concerns:
- Handle the XML stack browsing: each time we visit a node/tag we will create a new JSON object so we need to maintain a kind stack,
- Extract the needed informations from tags - the tag name will be used as an object attribute key fo rinstance, and create the objects when the end tag is visited.
Here is our handler:
public class ObjectHandler extends DefaultHandler {
private final JsonBuilderFactory jsonBuilderFactory;
private JsonObjectBuilder builder;
private String name;
private Attributes attributes;
private StringBuilder characters;
private final LinkedList<ObjectHandler> stack;
private final List<ObjectHandler> children = new LinkedList<>();
public ObjectHandler(final JsonBuilderFactory jsonBuilderFactory, final LinkedList<ObjectHandler> stack) {
this.jsonBuilderFactory = jsonBuilderFactory;
this.stack = stack;
}
@Override
public void startElement(final String uri, final String localName, final String qName, final Attributes attributes) throws SAXException {
stack.getLast().onStartElement(qName, attributes);
}
@Override
public void characters(final char[] ch, final int start, final int length) {
stack.getLast().onCharacter(ch, start, length);
}
@Override
public void endElement(final String uri, final String localName, final String qName) throws SAXException {
stack.removeLast().onEndElement();
}
private void onStartElement(final String qName, final Attributes attributes) {
if (this.name == null) {
this.name = qName;
this.attributes = attributes;
} else {
final ObjectHandler handler = new ObjectHandler();
handler.name = qName;
handler.attributes = attributes;
handler.stack = stack;
stack.add(handler);
children.add(handler);
}
}
private void onCharacter(final char[] ch, final int start, final int length) {
if (characters == null) {
characters = new StringBuilder();
}
characters.append(ch, start, length);
}
private void onEndElement() {
builder = jsonBuilderFactory.createObjectBuilder();
if (characters != null) {
builder.add("__content__", characters.toString());
characters = null;
}
if (attributes != null && attributes.getLength() > 0) {
builder.add("__attributes__", IntStream.range(0, attributes.getLength())
.boxed()
.collect(
jsonBuilderFactory::createObjectBuilder,
(builder, idx) -> builder.add(attributes.getQName(idx), attributes.getValue(idx)),
JsonObjectBuilder::addAll)
.build());
}
children.forEach(c -> builder.add(c.name, c.builder));
children.clear();
}
}
We clearly identify two parts in this implementation which could be split in two handler layers as mentionned previously. The overriden method and the stack list maintain the visiting state of the XML document. Each time a tag is visited a new handler is appended to the list of "current" handlers and the last one is always used to process the current tag. The other attributes are used to extract the data from the XML payload (text etc) and convert it at the end of the corresponding tag to a JSON object using a JsonObjectBuilder.
Note that in this conversion I convert the tag content to the __content__ JSON attribute, this could be replaced by using a JsonString instead of a JsonObject if needed. The advantage to do it is to be able to add the XML attributes under the __attributes__ object in a "key/value" JSON object.
Use the ObjectHandler
To use this implementation we need to modify a little bit the parser to add the root object handler to the initial stack. To be fancy we also add a flag wrapRoot to add a nested JSON object to keep the first level of JSON instead of swallowing the root tag. For instance <foo>bar</foo> would be {"__content__":"bar"} without that tag and {"foo":{"__content__":"bar"}} with that tag:
final SAXParserFactory factory = SAXParserFactory.newInstance();
final SAXParser parser = factory.newSAXParser();
final ObjectHandler handler = new ObjectHandler(Json.createBuilderFactory(emptyMap()), new LinkedList<>());
handler.getStack().add(handler); // the change
parser.parse(stream, handler);
Going further
This first implementation is functional but has several limitations:
- It doesn't handle arrays: not a hard task it requires two configurations
- Being able to detect consecutive tags with the same name at the same level to wrap them in a JsonArray,
- Being able to detect a wrapper contains only the same nested tags and replace it with a JsonArray (this can require additional configuration).
- It doesn't handle namespaces: the previous implementation assumes all tags are using the default namespace, if you use namespaces you need another way to convert it to JSON. An option can be to add the attribute __namespace__ in such cases or change the attribute name to include the full namespace. The alternative which consists to add a first level object child with the namespace registry works but is very XML and not really JSON so consumers would be probably less happy.
- As mentionned a bit earlier, it loads the model in memory, if your goal is to return the JSON model as a HTTP response, you will probably want to enable streaming and therefore replace the builders by a JsonGenerator which has almost the exact same API but adapted to a streaming mode.
All these limitations are not blockers at all, JSON-P provides all you need to solve them so no reason to not use it ;).
From the same author:
In the same category: