MS OOXML: Defective By Design
Stephane Rodriguez, an independent software vendor and file format excerpt, has authored a pretty damning document titled "Microsoft Office Open XML: Defective by Design". We at OpenMalaysiaBlog have been noting the technical problems with OOXML for some time now. Here are some excerpts from an external independent party:
[On Text Markup]
The extensiveness of the ECMA 376 documentation, over 6000 pages, is telling how much legacy Microsoft is willing to bring into the future. Taking an example of such legacy clarifies what it takes to implement even a portion of the documentation. The example is text formatting. Any of the 3 applications, Word, Excel and Powerpoint uses its own text formatting markup. Worse, the shared libraries themselves (VML, DrawingML, MathML, ...) also use separate text formattings, each different. Even worse, if that's possible, Word has many own ways to do text formatting. Excel has many own ways to do text formatting. Powerpoint has many own ways to do text formatting.
...
To read a document, you cannot assume what's in that document, therefore you've got to implement all possible combinations of objects that may be part of the document. In particular, you've got to implement all ways to get text formatting markup models because that may well be the XML you face. This is a horrible scenario. To support this scenario, either you are Microsoft, or you have a number of years of work ahead on the subject with plenty of implementation done already. There is no way around, the barrier to entry to this scenario is sky high.[On Storing Data in Spreadsheets]
We all take for granted that when we type a value such as 1234.1234 in a cell of a spreadsheet, that's what actually gets stored ... Is this storage neutrality true with the new formats? ... Excel 2007 does not store what we entered. If we read the XML, we are going to grab numbers that have rounding errors compared to the actual numbers we typed. The spreadsheet does not reflect the proper values ... Imagine non-Microsoft applications used in healthcare and critical systems relying on the spreadsheet data. Not only the rounding error seems arbitrary (one would have to go back and study the artefacts of IEEE floating-point values, several decades of work), but it changes.
...
It's important to understand that if we open the spreadsheet in Excel 2007, we see the proper values. No loss (based on the values entered) seem to have occured, the problem is that the data in XML just cannot be used as is.
As an aside, the stored value does not use the locale (it always uses the dot as decimal separator), therefore we have to assume this is all US English. If we wrote software in Excel VBA that grabs the value in cells, then processes it, there is no way we could migrate our VBA code to work with this XML part without substantial rework. We are left with Excel's own international implementation artefacts, undocumented.
[On the Alleged Deprecated VML]
Contrary to what the ECMA 376 documentation says in many places, VML drawing parts are not deprecated at all. VML is in fact very pervasive in Word, Excel and Powerpoint documents, so it's even more a blatant problem.
Here is a way to create a VML part :
start Excel 2007 and create a new spreadsheet right-click and choose Insert Comment enter a comment save the spreadsheet (xlsx file) close it, unzip it[On Bad Packaging]
The underlying architecture of how zip entries relate together is called by Microsoft "open packaging conventions". What it means is that zip entries are not independent, or even related by way of a single master zip entry which would work as a directory of all zip entries of relevance. There is a logical tree of entries which uses separate zip entries to define relations between zip entries. The logical tree has nothing to do with the physical tree of zip entries in a package, despite Microsoft continuously using screenshots of Windows XP's built-in ZIP folders to mimic a folder hierarchy.
The problem with such an architecture is that a part may or may not relate to another and there is no standard way to know. Often, there is a r:id attribute right in the content of some XML part that tells the application that there is a relation, but this is not standard. By the way, Microsoft's PDF fixed format competitor called XPS is also based on the same underlying architecture, except that the team who developed XPS did not quite want to play by the same rules than the Office team. For instance the XPS main zip document entry related to one or more XPS pages with an attribute such as : Source="Pages/1.fpage". In other words, they are not using the r:id attribute, instead relying on their own mechanism. This makes it impossible for a generic library to know which part relates to which part, and it has an unfortunate consequence.
The unfortunate consequence is being unable to know whether a part relates or not to another part makes it impossible to know, when you delete a part, if you are going to corrupt the document or not. The document becomes corrupt if it points or relates (implicitely or explicitely) to a missing part. It's unclear why Microsoft chose this way of doing things, obviously leading to an internal chaos, instead of just copying the research from the OpenOffice project, where a central directory is used (OpenOffice ZIP initiative predates Microsoft's by at least three years, despite Microsoft stealing the thunder).
When you don't know the dependencies of a part, the consequence is obvious, you leave those parts alone. If you do this enough times, it clutters up the package, and soon enough you end up with a package containing any number of parts god only knows why they are there. Add to this you can add a part of any content type (arbitrary MIME type), and you have a recipe for disaster. Among other things, virus could proliferate.
[On The Lack of Internationalization Support]
An important ongoing tension with Office documents is the support for locales. Microsoft historically used a number of mechanisms to address this need, but they kept evolving and Microsoft aggregated all mechanisms to keep compatibility with older versions. What was hidden is being surfaced with the new XML. Anything that gets displayed, calculated, rendered or stored depends one way or another on an complex and undocumented combination of locale settings including : the Office application language, the Office application language settings (per application), the Office document language settings (per document), the system locale of the operating system.
To save them time, Microsoft chose to store XML using the US English locale regardless of all settings above.
This has an unfortunate consequence for implementers or those willing to make a manual change. Indeed, Microsoft is imposing everybody else to adapt to US English locale options (separators, date formats, formula conventions, ...) despite the fact that when using Office interactively, this fact is hidden to the user.
...
We are talking two decade worth of internationalization issues, for Office-related locale issues and Windows-related locale issues. To get an idea of how bad the situation is, suffice to say that a Microsoft employee part of the internationalization team in Windows has a blog where he posts daily horror stories.
Just think of all the fun things you could do with over six contradictory methods of formatting text in one file format, a fist full of buffer overflows and a heart full of malice aforethought!
It curls my toes too, and so, I don't intend implementing MS OO XML aka ECMA 376.
Posted by: Wesley Parish | Tuesday, 28 August 2007 at 08:54 PM