« "Microsoft Cuts Off Access To Old Documents" | Main | Disabled or diffabled? »

Thursday, 14 February 2008

Existing corpus of binary documents.

Malaysia raised 23 comments against the OOXML Draft standard at the end of the 5 month review period. By January 14th, Ecma responded with a set of proposed dispositions for Malaysia to review. One of the items we raised was about how OOXML handles dates.

I wrote a blog post on this issue "Malaysia's history is ill-formed" back in June 2007, to illustrate what is wrong with the spec, why it's not good for users and how to fix it. I even discovered that the predecessor to MSOOXML, the Microsoft Office 2003 XML file format (MSO2K3XML), actually solved the problem by fully implementing ISO 8601 date format!

MSO2K7 actually supports ISO 8601 dates ... but ironically, only when it writes to OOXML's predecessor file format MSO2K3XML, which was tauted as the next big thing in 2003. But subsequently forgotten within the year.

Ecma's response:

Proposed Disposition

We agree that it is important for SpreadsheetML to support ISO 8601, and we propose the following changes to allow dates in the ISO 8601 format. These changes update the description of date and time representation within SpreadsheetML. They also add the necessary schema to support the added format, and they provide examples of using ISO 8601 dates in SpreadsheetML.

Additionally, there was a request to remove one of the existing serial date base systems, but in order to maintain compatibility with the existing corpus of binary documents, those date bases will remain in the specification

To its credit it goes on to describe how to use the ISO 8601 date formats within the spec. But what is peculiar is that it still includes the old way of using the serial date base within the spec. The new ISO dates are only "recommended" while the old serial dates are still normative (i.e. must be implemented if a third party needs to write applications to support OOXML).

Corpus Porkus

The justification of including the arcane date encoding method is this "maintain compatibility with the existing corpus of binary documents". Ecma would argue that they are protecting current customers interests in that they have a huge investment in documents in the old format. And therefore OOXML should be compatible with them.

There are two ways of handling this "existing corpus of binary documents". The First is to leave it as it is and hope that in the future, filters will always be around to open the files, Second is to convert them to an archival format or to migrate to a future proof file format.

The first way is risky. People would not like risk relying on a vendor to be around for the next 50 - 100 years, let alone a tech company. The second way is possible, but painful in the short term.

Do nothing or Convert. Those are the choices. OOXML is not a magic device which is "compatible" with any of the binary file formats. It's merely another file format.

The magic lies in the file converters or translators. Translators know how to convert from one format to another. The more it knows about the two formats, the higher the fidelity of the conversion. So if Ecma was really worried about the existing corpus of binary documents, then it should start work on fully specifying the Binary File Format (BIFF) of Microsoft Office, macros and all. Hopefully we hear good news tomorrow (15th Feb), when its due to be (re-)released by Microsoft. And pigs may fly over the Petronas Twin Towers 2.

So it doesn't matter if OOXML needs to have its dates stored as a serial number. It can easily save it as a ISO 8601 date format. All the translation from string to serial number is done during the File Load/Save process, where the computer converts the internal representation to a user friendly, XML readable format.

If they say that "Oh no! Then it will be not compatible with our customers customized programs which go through millions of Excel rows!", the answer is simple. Your customers will have to re-write all their customized programs anyway to support your new file format OOXML, regardless of whether you use ISO 8601 or serial numbers.

If they say "But there already exists a few thousand Ecma 376 files in the wild which have encoded dates as serial numbers!", I can only respond: Bad bad excuse. It's the risk of the vendor to base its product on a file format which is immature and liable to change during the standardisation process. Don't expect the National Bodies to have to compromise just because of a vendors zeal to rush out a product with the full knowledge that the standardisation process will cause the specifications to change. The vendor has to deal with it, which means product recalls, patches or freely available 'convertors'.

[Update. 1am 15th Feb: I just remembered that Doug Mahugh had something to say about the risks in implementing Ecma 376 content creators before DIS 29500 was approved:

"Well, its too early for other vendors to commit to this file format. After the BRM (Ballot Resolution Meeting - in February 2008) there may be changes to it, so it is risky, and may not make commercial sense to implement OpenXML as it is at the moment."

      - Doug Mahugh, TechEd 2007, Kuala Lumpur Malaysia. ]

Back to the date issue.

Malaysia is obviously not the only one who is worried about this issue. Czech, Denmark, France, Britain, India, Ireland, Kenya, Philippines, USA, and many others have all noted comments on this. What is interesting is the work from Antonis Christofides, a representative from the National Technical University of Athens, as a committee member of the Greek National Body. This item thoroughly reviewed and a constructive document was prepared which is worth reading.

"Alternative Disposition on Dates"

Its still in its draft stage, but it seems very well thought out. It is presented in a very visual manner and easy to understand. The solution is elegant in that ultimately only one form of encoding is used, while the hard work is done during the conversion process, where it handles the fringe case of a formula results on numbers being displayed as a date.

Greekdate

The Greek contribution goes on to compare how other applications have handled this problem, and demonstrates prior work on how it can be done:

The way to address the problem is similar to what has been done in OpenOffice and Open Document Format (ODF). ODF dictates that timestamps are stored as timestamps, leaving it to the application to handle legacy conversions. While OpenOffice Calc apparently treats timestamps in the same way as Microsoft Excel, in fact it includes underlying conversions so that it properly stores timestamps as required by ODF.

I think the Greek solution is sound, and I would recommend the National Bodies to support it during the BRM in a few weeks time.

The ramifications

Of course this is not just about dates. It's about all the issues where the excuse by the Ecma proposition was "No we can't fix the spec, we got to keep the way things are, because that's how it's always been and there are billions of documents out there. So tell you what we can do, let's just add in your suggestion as an additional solution, just to complicate matters further. kthxbai"

To highlight the additional complexities, the proposed disposition also includes new XML elements called valIso, maxValIso and minValIso. This is to complement the val, maxVal and minVal within the spec. Oh, if valIso is in the cell, use it and ignore val. But val is not really val as it depends on the epoch and if the year is 1900.

The "existing corpus of binary documents" is Ecma's stock solution to most of Malaysia's comments. Instead of cleaning things up, they give the impression that they are brushing things under the carpet and putting the burden of document fidelity on the shoulders of future developers instead of addressing it today. This is a fixable problem which can be handled by todays conversion software. Let's put an end to the propagation of 20 year old bugs once and for all.

yk.

[Update, 1am 15th Feb 2008: Identification of the Greek committee member, and Doug Mahugh's comment on the risks of implementing Ecma 376 before its ratified as an ISO standard]

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c01ba53ef00e5504576258833

Listed below are links to weblogs that reference Existing corpus of binary documents.:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

How do you convert Macros from Excell97 to ECMA376?

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Welcome to
Open Malaysia blog!

  • Bloggers @ Open Malaysia
    We are a group of individual bloggers working to build openness in Malaysia's ICT culture. Most of us have day jobs and a couple of us are students. Those with a job work for companies ranging from large international enterprises to self-run Malaysian start-ups.
    Email us at this address:
    open -AT- openmalaysiablog -DOT- com

Disclaimer...

  • We declare our independence of opinions from our employers, institutions, associations and clients, past and present. Thoughts and expressions in the Open Malaysia blog are rightly each blogger's own and each of us stand by what we individually write. Views by readers who post comments and others whose writings we link to in this blog are theirs.

May 2009

Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            

Subscribe to this site
- FeedBurner Feed

Subscribe to this site
- email alert options

Your email address:


Powered by FeedBlitz

Enter your email address:

Delivered by FeedBurner

Blog powered by TypePad

.

  • .