Pleasantly Producing PowerPoint Parsers

2009-09-21

KOffice has the potential to be a widely used office suite. One of the requirements for user adoption is good support for popular file formats and most presentations are available as Powerpoint presentations. KOffice uses ODF as native format. There is an import filter for PowerPoint presentations in KOffice which is currently incomplete. At KO, we are working to improve this situation.

To convert data from one file format to another, you have to understand both formats. ODF is an open standard and rather well documented. Since about a year, Microsoft has, after significant political pressure, put documentation for their file formats on-line. In the header of their documentation, permission is granted to use the documentation to develop software:

Regardless of any other terms that are contained in the terms of use for the Microsoft website that hosts this documentation, you may make copies of it in order to develop implementations of the technologies described in the Open Specifications and may distribute portions of it in your implementations using these technologies or your documentation as necessary to properly document the implementation. You may also distribute in your implementation, with or without modification, any schema, IDL‘s, or code samples that are included in the documentation.

This documentation is available as PDF files. The file describing PPT is 663 pages, the one describing drawings, which are an essential part of presentations, is 620 pages. To implement a parser for all of that is a lot of work. It is an exercise that would have to be undertaken for each language in which one would want to parse these files.

It is easier to convert the documentation to a computer readable format and generate parsers for different situations from that. This is now being done in msoscheme. It comes with a big file called mso.xml which already contains a very large part of the documentation. From this file, a C++ and a Java parser are generated (Java, C++). Both parsers can deserialize ppt files to a runtime representation that can be the start for conversion to e.g. ODF.

A small Qt program called ppttoxml can convert a ppt to an XML representation. This XML representation is easy to read and understand and therefor very helpful for us in improving our current Powerpoint filter.

It would be great to get people from other projects that want to read ppt files on board. It does not matter what programming language or languages you use. You can write a parser generator in less than 700 lines of code.

Here is are the commands you need to see what a ppt file looks like on the inside: