Valid, literal XML in C++ with Blasien

2015-07-05

Creating and processing XML feels awkward in most programming languages. With Blasien, a tiny C++11 header library, XML in C++ feels easy and natural. As an extra the XML that is written is mostly validated at compile time.

Here is an example:

XHTML

C++ with Blasien

<!DOCTYPE html PUBLIC "..." "...">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>
      Hello world!
    </title>
  </head>
  <body>
    <p>
      Some nice paragraph text.
    </p>
    <img src="http://example.com/hello.jpg" alt="Hello"/>
  </body>
</html>
XmlWriter<xhtml11::XHtmlDocument>(stream)
<html
  <head
    <title
      <"Hello world!"
    >title
  >head
  <body
    <p
      <"Some nice paragraph text."
    >p
    <img(src="http://example.com/hello.jpg",alt="Hello")>img
  >body
>html;

The same syntax can be used to create a DOM.

Background

Code to create XML is usually a matter of calling functions like startElement, setAttribute, endElement etc. Such code looks nothing like the desired XML. And there is no static type checking. Here is a typical example:

XHTML

C++

<html xmlns="http://www.w3.org/1999/xhtml">
  <body>
    <p id="hello">
      Hello world.
    </p>
  </body>
</html>
out.writeStartElement(xhtmlns, "html");
out.writeStartElement(xhtmlns, "body");
out.writeStartElement(xhtmlns, "p");
out.writeAttribute("id", "hello");
out.writeCharacters("Hello world!");
out.writeEndElement();
out.writeEndElement();
out.writeEndElement();

This code looks unpleasant and it is easy to make errors. The tag names are written as string: a typo there can go undetected for a long time.

Elements are closed with writeEndElement(). Matching up the opening and closing of tags is hard to do visually and errors there are not caught at compile time.

There are programming languages, like XSLT and XQuery, that work better with XML. Calling code in these languages from C++ is inconvenient and requires that the programmer learns an additional programming language.

A few years ago, I created a way to work with XML from C++. In that way, wrapper classes were created for each element type from a schema definition. This prevented many possible errors at compile time. But the code still did not look like XML. Blasien has all the same checks but with a nicer syntax:

C++ with writeodf

C++ with Blasien

text_p p(xmlWriter);
p.set_text_style_name("my_italic");
text_span span(p.add_text_span());
span.set_text_style_name("my_bold");
span.addTextNode("Hello Calligra!");
XmlWriter<office::TextType>(stream)
<text::p( text::style_name="my_italic" )
    <text::span( text::style_name="my_bold" )
        <"Hello Calligra!"
    >text::span
>text::p;

Secret sauce 1: operator< and operator>

Blasien is built on a powerful C++ feature: operator overloading. Nearly all operators can be overloaded in C++. For XML, two operators are most distinctive: < and >. These operators usually mean "smaller than" and "larger than" and are used in expressions like if (x > 3) { ... }. There is no rule that limits the use of < and > to mathematical expressions. As you will see, they can be put to very different use.

The operators < and > are left-associative. The left-most combination of expression, operator, expression is replaced first. The left-most expression is a sink for XML expressions. Each handled expression leads to a new sink with a different state.

sink < html < body < "hello" > body > html;

can be written more explictly as:

const HtmlTag html;
const BodyTag body;
const HtmlDocSink sink;
const HtmlSink sink2 = sink < html;
const BodySink sink3 = sink2 < body;
const BodySink sink4 = sink3 < "hello";
const HtmlSink sink5 = sink4 > body;
const HtmlDocSink sink6 = sink5 > html;

which works because of these operator overloads:

HtmlSink operator<(const HtmlDocSink& sink, const HtmlTag& tag) {
    sink.startElement(tag);
    return HtmlSink(sink);
}
BodySink operator<(const HtmlSink& sink, const BodyTag& tag) {
    sink.startElement(tag);
    return BodySink(sink);
}
BodySink operator<(const BodySink& sink, const char* text) {
    sink.writeCharacters(text);
    return sink;
}
HtmlSink operator>(const BodySink& sink, const BodyTag& tag) {
    sink.endElement();
    return sink.base;
}
HtmlDocSink operator>(const HtmlSink& sink, const HtmlTag& tag) {
    sink.endElement();
    return sink.base;
}

The operators < and > are only implemented for valid combinations of parent and child nodes. The compiler accepts sink <html <body but refuses sink <body <html because < is overloaded for a left <html/> and a right <body/>, but not for a left <body/> and a right <html/>. The compiler also catches any text nodes that are put in places where they are not allowed.

With Blasien, the compiler also checks the XML attributes. Missing required attributes, double attributes and forbidden elements are all caught at compile time.

Secret sauce 2: metaprogramming

Metaprogramming is a fancy word for using templates. Templates in C++ are very powerful and very complex.

The operator overloading from the previous section gets unwieldly quickly for common XML schemas. To avoid writing a lot of code, we translate the XML Schema or Relax NG schema to data structure in C++. The templated code uses this data to generate the required overloaded functions during compilation.

Here are some excerpts from a very simple XHTML schema for use with Qt code (Blasien is not restricted to Qt).

The tags and element types are defined in a dedicated namespaces, in this case xhtml:

namespace xhtml {

const QString htmlns = QStringLiteral("http://www.w3.org/1999/xhtml");
const QString htmlTag = QStringLiteral("html");
const QString headTag = QStringLiteral("head");

Tags are derived from a template called XmlTag.

using HtmlTag = XmlTag<QString,&htmlns, &htmlTag>;
using HeadTag = XmlTag<QString,&htmlns, &headTag>;
using BodyTag = XmlTag<QString,&htmlns, &bodyTag>;

Each document type and element type defines what tag is used and which attributes are allowed and which are required:

struct XHtmlDocument {
};
struct HtmlType {
    using Tag = HtmlTag;
    using allowedAttributes = std::tuple<xhtml11::IdTag,xhtml11::ClassTag>;
};
struct ImgType {
    using Tag = ImgTag;
    using allowedAttributes = std::tuple<xhtml11::IdTag,xhtml11::ClassTag>;
    using requiredAttributes = std::tuple<xhtml11::SrcTag,xhtml11::AltTag>;
};
}

Determining which nodes are allowed in which other nodes is done by overloading the definition of a structure called allowed_child_types:

template <>
struct allowed_child_types<xhtml11::HtmlType> {
    using types = std::tuple<xhtml11::HeadType, xhtml11::BodyType>;
};
template <>
struct allowed_child_types<xhtml11::ImgType> {
    using types = std::tuple<>;
};

Blasien takes these structs to generate the right overloaded operators.

How to use it

To use Blasien in your project, you need to use a provided XmlSink or write one yourself. Two Sinks are provided: one for writing XML (30 lines of code) and one for createing a DOM tree (40 lines of code). Both use Qt5.

Here is an example that uses XmlBuilder to create a QDomDocument:

#include <XmlBuilder.h>
#include <XHtml11.h>

struct create_paragraphs {
    const QList<QString> texts;
    template <typename Sink>
    Sink operator()(const Sink& sink) {
        for (const QString& t: texts) {
            sink <p<t>p;
        }
        return sink;
    }
};

QDomDocument
createDocument(const QString& docTitle, const QList<QString>& paragraphs) {
    QDomDocument dom("test");
    XmlBuilder<XHtmlDocument>(dom)
    <html
        <head
            <title
                <docTitle
            >title
        >head
        <body
            <create_paragraphs{{paragraphs}}
        >body
    >html;
    return dom;
}

Future enhancements

This code is very new and these instructions are likely to change. But Blasien is usable now and newer releases will simply bring more features.

C++ projects like Calligra, LibreOffice, Inkscape, MuseScore and many more that rely a lot on XML can already benefit from the current version.

Metaprogramming is so powerful that XML generating code can be written that checks against nearly all aspects of a Relax NG schema or XML schema. Future releases will do more and more compile time checking.

An exciting future feature is XPath-like selectors. Code for this is already present in Blasien. It gives a convenient syntax for collecting information from XML documents. Extending this part of Blasien could make it to a natively compiled alternative for XQuery and XSLT.

Getting the code

Blasien is just a few C++ headers with a reasonable amount of unit tests. I've put them on GitHub for now. Feel free to file issues or send pull requests.

The code is currently under LGPL3, but I'm open to additional licenses if a project requires it.

Comments

Post a comment