Translating Haskell to C++ metaprogramming

2015-07-12

Haskell and C++ are very different programming languages. Haskell is purely functional and C++ is imperative. While 'normal' C++ is imperative, C++ metaprogramming is purely functional, just like Haskell. This blog shows examples of Haskell code that I've translated to C++.

Why would you do this?

Last week I blogged about Blasien. Blasien is a set of header files for writing literal validated XML in C++. XML can be validated against DTD, XML schema or Relax NG. Relax NG is of interest to me since that schema language is used in the Opendocument Format standard. Relax NG is an elegant schema language, but validating against it is not trivial. The goal with Blasien is to do this validation at compile time.

James Clark has written an elegant algorithm for Relax NG validation. His algorithm is written in a subset of Haskell. In the past, I've ported his algoritm to JavaScript for use in WebODF. For use in Blasien, I've now ported it to C++ metaprogramming. This took some puzzling, but the result feels very natural once you get used to the syntax.

A simple example

Haskell is a very clean programming language. It is very different from C++. Haskell is a purely functional programming language and so is C++ metaprogramming. Purely functional means that all structures are immutable and functions have no side-effects. All a function does is to take immutable structures as input and create a new immutable structure as output. The same is true in C++ metaprogramming.

Here is an example of a function that adds two integers.

Haskell

C++

add :: Int -> Int -> Int
add a b = a + b



main :: IO ()
main = do
    let r = add 1 2

    putStrLn $ show r
template <int a, int b>
struct add {
    static constexpr int value = a + b;
};

int main() {
    auto r = add<1,2>::value;

    std::cout << r << std::endl;
    return 0;
}

Both of these examples have a main function which prints out a value. That part of the examples is not pure: printing out a value is a side-effect. But the calculation of the value is pure. In C++ the value of r will be calculated at compile time. (Probably in Haskell too.)

Data types

Haskell has algebraic data types. In the code fragment below, NameClass is a data type with four variants: AnyName, QName, NsName and NameClassChoice. Instances of AnyName have no members. Instances of QName have two members of type Uri and LocalName respectively. The members are not named but can be accessed via pattern matching. This will be explained below.

In C++, the variants of NameClass are not connected directly. Each of them is defined as a separate struct. The members of each data type are given via template parameters. We are using structs everywhere now: both functions and data types are defined via template structs.

Haskell

C++



type Uri = String

type LocalName = String

type NameClass = AnyName
               | QName Uri LocalName
               | NsName Uri
               | NameClassChoice NameClass NameClass
using String = const char*;

using Uri = String;

using LocalName = String;

struct AnyName;

template <String U, String L>
struct QName {
    static constexpr String Uri = U;
    static constexpr String LocalName = L;
};

template <String U>
struct NsName {
    static constexpr String Uri = U;
};

template <typename NC1, typename NC2>
struct NameClassChoice {
    using NameClass1 = NC1;
    using NameClass2 = NC2;
};

The C++ definition of QName differs in an important aspect from the definition of NameClassChoice. QName has static constexpr String in front of its members and NameClassChoice has using before its members. This is because C++ makes a distinction between type parameters and value parameters. QName has two String values as parameters and NameClassChoice has two NameClass types as parameters.

Note that AnyName does not have a definition. It has no members and therefor no need for a definition. All it needs is a declaration. C++ metaprogramming is programming with types. None of these types will be instantiated, so they do not require a complete definition.

Function overloading with pattern matching and template specialization

Haskell does not have function overloading. Instead it has pattern matching (and type classes). Only one function with a particlar name is possible. That function can still handle different types of input if the input parameters are algebraic data types. A different implementation is possible for each variant of an algebraic data type.

This is demonstrated in the function contains. The first parameter to this function should be of type NameClass. There are four variants of this data type and hence four implementations are possible. The function contains returns true if the NameClass instance contains the given QName. The members of each variant of NameClass are exposed via pattern matching.

_ is a wildcard and is used in the first variant AnyName. Since AnyName matches any QName instance, the value of QName is not bound to a variable. The implementations of the QName and NsName variants are very straightforward. The implementation of the NameClassChoice variant is recursive.

C++ metaprogramming also uses pattern matching. In C++ this is called template specialization. First a template struct is defined. This may or may not have a definition. Next, a specialization is written for each of the variants of the algebraic data type.

Haskell

C++


contains :: NameClass -> QName -> Bool


contains AnyName _ = True



contains (QName ns1 ln1) (QName ns2 ln2) = (ns1 == ns2) && (ln1 == ln2)



contains (NsName ns1) (QName ns2 _) = (ns1 == ns2)



contains (NameClassChoice nc1 nc2) n = (contains nc1 n) || (contains nc2 n)
template <typename NameClass, typename QName>
struct contains;

template <String U, String L>
struct contains<AnyName, QName<U,L>> {
    static constexpr bool value = true;
};
template <String U1, String L1, String U2, String L2>
struct contains<QName<U1, L1>, QName<U2, L2>> {
    static constexpr bool value = strcmp(U1, U2) == 0 && strcmp(L1, L2) == 0;
};
template <String U1, String U2, String L2>
struct contains<NsName<U1>,QName<U2,L2>> {
    static constexpr bool value = strcmp(U1, U2) == 0;
};
template <typename NameClass1, typename NameClass2,String U2, String L2>
struct contains<NameClassChoice<NameClass1,NameClass2>,QName<U2,L2>> {
    static constexpr bool value = contains<NameClass1,QName<U2,L2>>::value
                                  || contains<NameClass2,QName<U2,L2>>::value;
};

Running unit tests at compile time

Some people prefer type checking. Others choose unit tests. Having both is best. A neat feature of C++ metaprogramming is that it is possible to write unit test that run at compile time. Here are some tests for our new contains function.

constexpr char xhtmlNS[] = "http://www.w3.org/1999/xhtml";
constexpr char divLocalName[] = "div";
constexpr char pLocalName[] = "p";

void testContainsAnyName() {
    using PQName = QName<xhtmlNS, pLocalName>;
    static_assert(contains<AnyName,PQName>::value, "AnyName should match PQName.");
}

void testContainsQName() {
    using PQName = QName<xhtmlNS, pLocalName>;
    using DivQName = QName<xhtmlNS, divLocalName>;
    static_assert(!contains<DivQName,PQName>::value, "DivPQName should not match PQName.");
}

void testContainsNsName() {
    using PQName = QName<xhtmlNS, pLocalName>;
    using HtmlNsName = NsName<xhtmlNS>;
    static_assert(contains<HtmlNsName,PQName>::value, "HtmlNsName should match PQName.");
}

void testContainsNameClassChoice() {
    using PQName = QName<xhtmlNS, pLocalName>;
    using DivQName = QName<xhtmlNS, divLocalName>;
    using HtmlNsName = NsName<xhtmlNS>;
    using NameChoice = NameClassChoice<DivQName,HtmlNsName>;
    static_assert(contains<NameChoice,PQName>::value, "NameChoice should match PQName.");
}

These functions do not even need to be called from anywhere (at least not in GCC, maybe other compilers behave differently) to run these tests. They should, of course, be in a compilation unit.

Again, why are you doing this?

I have translated (most of) this Relax NG validation algorithm to C++ metaprogramming. You can look at the progress so far.

The point of doing this is to further Blasien. Generating XML from C++ (and most programming languages for that matter) is a source of many invalid documents because doing validation at either runtime or compile time is currently incomplete and cumbersome. With a convenient way to write XML inside of C++ I hope this will change. Doing most of the validation at compile time will catch most of the common errors. Solving nice puzzles and learning more Haskell and C++ while developing Blasien is a nice bonus.

Comments

Post a comment