Strigi partial port to javascript

2009-10-21

You may remember two of my recent blogs. One was about a project to parse powerpoint files and another one was about porting hexdump to the browser.

So how about a combination of those two topics: parsing powerpoint files in the browser. It is quite a feasible task. The powerpoint file format is largely described in an xml schema now. From this scheme one would need to generate a parser like there is for c++ and java already. The parsers for java and c++ are both less then 700 lines of code.

We have not reached that stage yet and I do not have time to implement a powerpoint parser in javascript soon. I have written some requirements for it. To parse the individual data streams in a ppt file, one must parse the OLE2 file format. Currently we use pole for this in c++ and poifs in java. Now I could port either of these libraries to javascript, but there is another nice OLE parser: strigi.

In Strigi, the OLE file format is treated like other container formats such as zip, tar and mime. Porting parts of Strigi to javascript seemed like an interesting challenge. In Strigi, we use low level c++ to ensure speed. Most of the techniques used in the c++ are not available in javascript. So the javascript version is bound to be much slower. Still, I was curious what Strigi would look like in javascript.

And now it is ready. The parts required for reading OLE files have been ported. The result is one html page of 600 lines. It can read ppt files and list the streams in there. When clicking the streams, you see the stream in ‘hexdump’ style display. The speed is not even that bad. It takes about a second to parse a megabyte of file.

enjoy the demo! (firefox 3.5 or recent webkit browser required)

Comments

Post a comment