Generating C++ from a DTD with Jinja2 and lxml

I recently stumbled across an XML format specified in a DTD that I wanted to work with from within C++. The XML format is document centric, which is a bit of a pain with existing data binding compilers according to my limited experience.

So to learn something new, and to keep control over generated code, I started to investigate what it would take to write my own little custom data binding compiler.

Writing a program that writes a program

It turns out that there are two very helpful libraries in Python which can really make your life a lot easier:

To keep my life simple, I am focusing on generating accessors for XML attributes only for now. I leave it up to the library client to figure out how to deal with child elements.

A highly simplified DOM

Inspired by the hybrid example from libstudxml, we define a simple base class that can store raw XML elements.

class element {
public:
  using attributes_type = std::map<xml::qname, std::string>;
  using elements_type = std::vector<std::shared_ptr<element>>;

  element(const xml::qname& name) : tag_name_(name) {}
  virtual ~element() = default;

  xml::qname const& tag_name() const { return tag_name_; }

  attributes_type const& attributes() const { return attributes_; }
  attributes_type&       attributes()       { return attributes_; }

  std::string const& text() const { return text_; }
  void text(std::string const& text) { text_ = text; }

  elements_type const& elements() const {return elements_;}
  elements_type&       elements()       { return elements_; }

  element(xml::parser&, bool start_end = true);

  void serialize (xml::serializer&, bool start_end = true) const;

  template<typename T> static std::shared_ptr<element> create(xml::parser& p) {
    return std::make_shared<T>(p, false);
  }

private:
  xml::qname tag_name_;
  attributes_type attributes_;
  std::string text_;           // Simple content only.
  elements_type elements_;     // Complex content only.
};

For each element name in the DTD, we're going to define a class that inherits from the element class, implementing special methods to make attribute access easier. The element(xml::parser&) constructor is going to create the corresponding class whenever it sees a certain element name. This calls for some sort of factory:

class factory {
public:
  static std::shared_ptr<element> make(xml::parser& p);

protected:
  struct element_info {
    xml::content content_type;
    std::shared_ptr<element> (*construct)(xml::parser&);
  };

  using map_type = std::map<xml::qname, element_info>;

  static map_type *get_map() {
    if (!map) map = new map_type;

    return map;
  }

private:
  static map_type *map;
};

template<typename T>
struct register_element : factory {
  register_element(xml::qname const& name, xml::content const& content) {
    get_map()->insert({name, element_info{content, &element::create<T>}});
  }
};

shared_ptr<element> factory::make(xml::parser& p) {
  auto name = p.qname();
  auto iter = get_map()->find(name);
  if (iter == get_map()->end()) {
    // No subclass found, so store plain data so we do not loose on roundtrip.
    return std::make_shared<element>(p, false);
  }

  auto const& element = iter->second;

  p.content(element.content_type);
  return element.create(p);
}

The header template

Now that we have our required infrastructure, we can finally start writing Jinja2 templates to generate classes for all elements in our DTD:

{%- for elem in dtd.iterelements() %}
  {%- if elem.name in forwards_for %}
    {%- for forward in forwards_for[elem.name] %}
class {{forward}};
    {%- endfor %}
  {%- endif %}

class {{elem.name}} : public dom::element {
  static register_element<{{elem.name}}> factory_registration;

public:
  {{elem.name}}(xml::parser& p, bool start_end = true) : dom::element(p, start_end) {
  }

  {%- for attr in elem.iterattributes() %}
    {%- if attr is required_string_attribute %}

  std::string {{attr.name}}() const;
  void {{attr.name}}(std::string const&);

    {%- elif attr is implied_string_attribute %}

  optional<std::string> {{attr.name}}() const;
  void {{attr.name}}(optional<std::string>);

    {# more branches to go here #}

    {%- endif %}
  {%- endfor %}
};
{%- endfor %}

required_string_attribute and implied_string_attribute are so-called Jinja2 tests. They are a nice way to isolate predicates such that the Jinja2 templates can stay relatively free of complicated expressions:

templates.tests['required_string_attribute'] = lambda a: \
  a.type in ['id', 'cdata', 'idref'] and a.default == 'required'
templates.tests['implied_string_attribute'] = lambda a: \
  a.type in ['id', 'cdata', 'idref'] and a.default == 'implied'

That is nice, but we have only seen C++ header declarations so far. Lets have a look into the implementation of some of our attribute accessors.

Enum conversion

One interesting aspect of DTD based code generation is the fact that attributes can have enumerations specified. Assume that we have some extra data-structure in Python which helps us to define a nice name for each individual enumeration attribute. Then, a part of the Jinja2 template to generate the implementation for an enumeration attribute looks like:

    {%- elif attr is known_enumeration_attribute %}
      {%- set enum = enumerations[tuple(attr.values())]['name'] %}
      {%- if attr.default == 'required' %}

{{enum}} {{elem.name}}::{{attr.name}}() const {
  auto iter = attributes().find(qname{"{{attr.name}}"});
  if (iter != attributes().end()) {
        {%- for value in attr.values() %}
    {% if not loop.first %}else {% else %}     {% endif -%}
    if (iter->second == "{{value}}") return {{enum}}::{{value | mangle}};
        {%- endfor %}

    throw illegal_enumeration{};
  }

  throw missing_attribute{};
}

void {{elem.name}}::{{attr.name}}({{enum}} value) {
  static qname const attr{"{{attr.name}}"};

  switch (value) {
        {%- for value in attr.values() %}
  case {{enum}}::{{value | mangle}}:
    attributes()[attr] = "{{value}}";
    break;
        {%- endfor %}

  default:
    throw illegal_enumeration{};
  }
}

      {%- elif attr.default == 'implied' %}

{# similar implementation using boost::optional #}

      {%- endif %}
    {%- endif %}

Putting it all together

The header for the library is generated like this:

from jinja2 import DictLoader, Environment
from lxml.etree import DTD

LIBRARY_HEADER = """
{# Our template code #}
"""

bmml = DTD('bmml.dtd')
templates = Environment(loader=DictLoader(globals()))

templates.filters['mangle'] = lambda ident: \
  {'8th_or_128th': 'eighth_or_128th',
   '256th': 'twohundredfiftysixth',
   'continue': 'continue_'
  }.get(ident, ident)

def template(name):
  return templates.get_template(name)

def hpp():
  print(template('LIBRARY_HEADER').render(
    {'dtd': bmml,
     'enumerations': enumerations,
     'forwards_for': {'ornament': ['ornament_type'],
                      'score': ['score_data', 'score_header']}
    }))

With all of this in place, we can have a look at a small use case for our library.

Printing document content

I haven't really explained anything about the document format we're working with until now. Braille Music Markup Language is an XML based plain text markup language. Its purpose is to be able to enhance plain braille music scores with usually hard-to-calcuate meta information. Almost all element text content is supposed to be printed as-is to reconstruct the original plain text.

So we could at least define one very basic operation in our library: printing the plain text content of an element.

I found an XML stylesheet that is supposed to convert BMML documents to HTML. This stylesheet apparently has a bug, insofar as it forgets to treat the rest_data element in the same way as it already treats the note_data element.

note to self, I wish I would've done a code review before the EU-project that developed BMML was finished. It looks like resurrecting maintainance is one of the things I might be able to look into in a meeting in Pisa in the first three days of March this year.

If we keep this in mind, we can easily reimplement what the stylesheet does in idiomatic C++:

template<typename T>
typename std::enable_if<std::is_base_of<element, T>::value, std::ostream&>::type
operator<<(std::ostream &out, std::shared_ptr<T> elem) {
  if (!std::dynamic_pointer_cast<note_data>(elem) &&
      !std::dynamic_pointer_cast<rest_data>(elem) &&
      !std::dynamic_pointer_cast<score_header>(elem))
  {
    auto const& text = elem->text();
    if (text.empty()) for (auto child : *elem) out << child; else out << text;
  }
  return out;
}

The use of std::enable_if is necessary here so that operator<< is defined on the element class and all of its subclasses. Without the std::enable_if magic, client code would be forced to manually make sure it is passing std::shared_ptr<element> each time it wants to use the operatr<< on any of our specially defined subclasses.

Now we can easily print BMML documents and get their actual plain text representation.

#include <fstream>
#include <iostream>

#include <xml/parser>
#include <xml/serializer>

#include "bmml.hxx"

using namespace std;
using namespace xml;

int main (int argc, char *argv[]) {
  if (argc < 2) {
    cerr << "usage: " << argv[0] << " [<filename.bmml>...]" << endl;
    return EXIT_FAILURE;
  }

  try {
    for (int i = 1; i < argc; ++i) {
      ifstream ifs{argv[i]};

      if (ifs.good()) {
        parser p{ifs, argv[i]};

        p.next_expect(parser::start_element, "score", content::complex);
        cout << make_shared<bmml::score>(p, false) << endl;
        p.next_expect(parser::end_element, "score");
      } else {
        cerr << "Unable to open '" << argv[i] << "'." << endl;
        return EXIT_FAILURE;
      }
    }
  } catch (xml::exception const& e) {
    cerr << e.what() << endl;
    return EXIT_FAILURE;
  }
}

That's it for now. The full source for the actual library which inspired this posting can be found on github in my bmmlcxx project.

If you have an comments or questions, send me mail. If you like bmmlcxx, don't forget to star it :-).

blogroll

social

Github Activity