« MSOOXML: Third Party Support - Apple iWork '08 | Main | How low can you go? »

Monday, 27 August 2007

MS OOXML: Defective By Design

Stephane Rodriguez, an independent software vendor and file format excerpt, has authored a pretty damning document titled "Microsoft Office Open XML: Defective by Design". We at OpenMalaysiaBlog have been noting the technical problems with OOXML for some time now. Here are some excerpts from an external independent party:

[On Text Markup]

The extensiveness of the ECMA 376 documentation, over 6000 pages, is telling how much legacy Microsoft is willing to bring into the future. Taking an example of such legacy clarifies what it takes to implement even a portion of the documentation. The example is text formatting. Any of the 3 applications, Word, Excel and Powerpoint uses its own text formatting markup. Worse, the shared libraries themselves (VML, DrawingML, MathML, ...) also use separate text formattings, each different. Even worse, if that's possible, Word has many own ways to do text formatting. Excel has many own ways to do text formatting. Powerpoint has many own ways to do text formatting.
...
To read a document, you cannot assume what's in that document, therefore you've got to implement all possible combinations of objects that may be part of the document. In particular, you've got to implement all ways to get text formatting markup models because that may well be the XML you face. This is a horrible scenario. To support this scenario, either you are Microsoft, or you have a number of years of work ahead on the subject with plenty of implementation done already. There is no way around, the barrier to entry to this scenario is sky high.

[On Storing Data in Spreadsheets]

We all take for granted that when we type a value such as 1234.1234 in a cell of a spreadsheet, that's what actually gets stored ... Is this storage neutrality true with the new formats? ... Excel 2007 does not store what we entered. If we read the XML, we are going to grab numbers that have rounding errors compared to the actual numbers we typed. The spreadsheet does not reflect the proper values ... Imagine non-Microsoft applications used in healthcare and critical systems relying on the spreadsheet data. Not only the rounding error seems arbitrary (one would have to go back and study the artefacts of IEEE floating-point values, several decades of work), but it changes.

...

It's important to understand that if we open the spreadsheet in Excel 2007, we see the proper values. No loss (based on the values entered) seem to have occured, the problem is that the data in XML just cannot be used as is.

As an aside, the stored value does not use the locale (it always uses the dot as decimal separator), therefore we have to assume this is all US English. If we wrote software in Excel VBA that grabs the value in cells, then processes it, there is no way we could migrate our VBA code to work with this XML part without substantial rework. We are left with Excel's own international implementation artefacts, undocumented.


[On the Alleged Deprecated VML]

Contrary to what the ECMA 376 documentation says in many places, VML drawing parts are not deprecated at all. VML is in fact very pervasive in Word, Excel and Powerpoint documents, so it's even more a blatant problem.

Here is a way to create a VML part :

  • start Excel 2007 and create a new spreadsheet
  • right-click and choose Insert Comment
  • enter a comment
  • save the spreadsheet (xlsx file)
  • close it, unzip it

[On Bad Packaging]

The underlying architecture of how zip entries relate together is called by Microsoft "open packaging conventions". What it means is that zip entries are not independent, or even related by way of a single master zip entry which would work as a directory of all zip entries of relevance. There is a logical tree of entries which uses separate zip entries to define relations between zip entries. The logical tree has nothing to do with the physical tree of zip entries in a package, despite Microsoft continuously using screenshots of Windows XP's built-in ZIP folders to mimic a folder hierarchy.

The problem with such an architecture is that a part may or may not relate to another and there is no standard way to know. Often, there is a r:id attribute right in the content of some XML part that tells the application that there is a relation, but this is not standard. By the way, Microsoft's PDF fixed format competitor called XPS is also based on the same underlying architecture, except that the team who developed XPS did not quite want to play by the same rules than the Office team. For instance the XPS main zip document entry related to one or more XPS pages with an attribute such as : Source="Pages/1.fpage". In other words, they are not using the r:id attribute, instead relying on their own mechanism. This makes it impossible for a generic library to know which part relates to which part, and it has an unfortunate consequence.

The unfortunate consequence is being unable to know whether a part relates or not to another part makes it impossible to know, when you delete a part, if you are going to corrupt the document or not. The document becomes corrupt if it points or relates (implicitely or explicitely) to a missing part. It's unclear why Microsoft chose this way of doing things, obviously leading to an internal chaos, instead of just copying the research from the OpenOffice project, where a central directory is used (OpenOffice ZIP initiative predates Microsoft's by at least three years, despite Microsoft stealing the thunder).

When you don't know the dependencies of a part, the consequence is obvious, you leave those parts alone. If you do this enough times, it clutters up the package, and soon enough you end up with a package containing any number of parts god only knows why they are there. Add to this you can add a part of any content type (arbitrary MIME type), and you have a recipe for disaster. Among other things, virus could proliferate.

[On The Lack of Internationalization Support]

An important ongoing tension with Office documents is the support for locales. Microsoft historically used a number of mechanisms to address this need, but they kept evolving and Microsoft aggregated all mechanisms to keep compatibility with older versions. What was hidden is being surfaced with the new XML. Anything that gets displayed, calculated, rendered or stored depends one way or another on an complex and undocumented combination of locale settings including : the Office application language, the Office application language settings (per application), the Office document language settings (per document), the system locale of the operating system.

To save them time, Microsoft chose to store XML using the US English locale regardless of all settings above.

This has an unfortunate consequence for implementers or those willing to make a manual change. Indeed, Microsoft is imposing everybody else to adapt to US English locale options (separators, date formats, formula conventions, ...) despite the fact that when using Office interactively, this fact is hidden to the user.

...

We are talking two decade worth of internationalization issues, for Office-related locale issues and Windows-related locale issues. To get an idea of how bad the situation is, suffice to say that a Microsoft employee part of the internationalization team in Windows has a blog where he posts daily horror stories.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c01ba53ef00e54ed18bdf8833

Listed below are links to weblogs that reference MS OOXML: Defective By Design:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Just think of all the fun things you could do with over six contradictory methods of formatting text in one file format, a fist full of buffer overflows and a heart full of malice aforethought!

It curls my toes too, and so, I don't intend implementing MS OO XML aka ECMA 376.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Welcome to
Open Malaysia blog!

  • Bloggers @ Open Malaysia
    We are a group of individual bloggers working to build openness in Malaysia's ICT culture. Most of us have day jobs and a couple of us are students. Those with a job work for companies ranging from large international enterprises to self-run Malaysian start-ups.
    Email us at this address:
    open -AT- openmalaysiablog -DOT- com

Disclaimer...

  • We declare our independence of opinions from our employers, institutions, associations and clients, past and present. Thoughts and expressions in the Open Malaysia blog are rightly each blogger's own and each of us stand by what we individually write. Views by readers who post comments and others whose writings we link to in this blog are theirs.

May 2009

Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            

Subscribe to this site
- FeedBurner Feed

Subscribe to this site
- email alert options

Your email address:


Powered by FeedBlitz

Enter your email address:

Delivered by FeedBurner

Blog powered by TypePad

.

  • .