« Rick Jelliffe - myths debunked? | Main | Is VML in or out now, or was that a typo? »

Friday, 22 June 2007

Will Readability hinder Performance?

One of the main reasons people opt for text based formats is for human readability. XML is ideal for the encoding of information which retains the ability for humans to interpret the information via a simple text based viewer. These human-friendly benefits should NOT be undermined by any efforts to "optimize" it for a specific application to perform better.

Attributes need to be descriptive. Values need to have units. Naming conventions should be consistent. Abbreviations should be avoided. Namespaces should be clear. These are golden rules in ensuring that your format is as accessible as possible.

My previous post about how the Microsoft Office Open XML (MSOOXML) file format handles (or mis-handles) dates brought on quite abit of discussion. One of the reasons was because it was for "speed and optimisation issues"

This is an interesting thing. I remember way back last year, Alan Yates, from Microsoft, was complaining about ODF's performance against MSOOXML. Actually it was OpenOffice.org versus Microsoft Office. Or rather OpenOffice Calc versus Microsoft Office Excel in calculating a spreadsheet. What was curious was that he commented:

Alan Yates, the general manager of Microsoft's information worker strategy [responsible for the Microsoft Office line of products], told ZDNet UK on Wednesday. "The Open XML format is designed for performance. XML is fundamentally slower than binary formats so we have made sure that customers won't notice a big difference in performance."

So how do you make loading faster if XML is fundamentally slower? Perhaps you map it as close as possible to the binary format? You sacrifice on well known XML structures so it conforms better to the architecture of your product? Perhaps forget about the user friendly units and just spew out different machine friendly values? Otherwise you could just output binary bitfields if its too much trouble mapping it out to something useful?  Forget international developers, just give the attributes names which the internal Microsoft programmers use with all the different naming conventions.

Whenever someone pushes "performance" over other factors in their code, I am always skeptical. Speed over readability is almost always not a good thing. It hides future problems of maintainability and expandability. It restricts innovation. Performance is extremely important if you are programming a graphics intensive computer game and every 1% of improvement is important, but in the domain of office productivity however, this should not be so a priority.

There will be some hardcore number crunchers who would benefit from faster processing, but if they are waiting 25 minutes for a 'sheet to calculate 5.5million cells, I would suggest them re-thinking their strategy on relying on a spreadsheet for this sort of work. Contract a programmer to re-architect the solution and re-write it with proper tools using a RDBMS with a customized UI. The time this project saves would pay eventually for itself.

Let us take a more technical look at this issue. We hear that MSOOXML would rather not support ISO 8601 dates for the sake of performance. Instead MSOOXML opts for a confusing, limited, and un-user friendly number format.

I just found out that back in January 2007, Rick Jelliffe declared:

 "The use of numeric forms for storing dates is obviously the result of a conscious trade-off of the load/parse time of dates versus the precalculation benefits of numbers. It might not be the format the we would choose if we had any expectation that the numbers would be entered by hand or read unmediated by a formatter, but for spreadsheets (and scientific data) it is not a high priority requirement; instead, size and load/parse time issues certainly are important."

Additionally, hAl, added to the discussion at the Grokdoc EOOXML objections page:

"The ISO 8601 standard is extremly extensive and put a lot of extra implementation effort to fully implement which is why w3c implemnets the subset which is reused by OOXML."

ISO 8601 looks like parsing this representation "2007-06-22T02:02:02" versus a number representation like "39255.6071286466" is far harder to implement and longer to execute.

How much slower is parsing these "extensive" dates? There is only one way to find out, and that is to write a program to parse and compare these different dates.

So I fired up vi, and wrote a C program, and compiled it using  gcc to do just that. That said, it has been over 10 years since I wrote my last C program, so please be easy on the quality of my code.

This program basically creates millions of dates in both formats and holds it in memory. I have removed the I/O factor from this because reading from the disk would take approximately the same time for both encodings.
The program will then keep track of the duration it takes to parse through the numbers, and ultimately produces an average time taken. The more numbers to crunch, the better the average you'd get. So I ran it with 5,555,555 dates. Thats over 5 million dates.

yky@x1407:~/src/isodates$ ./iso 5555555
Count: 5555555
long: 1182449222  ISO 8601: 2007-06-21T18:07:02

Parsed 5555555 IntDates.
      vInt[5555554]:  '1238004772' -> 1238004772 -> 2009-03-25T18:12:52
      Total time: 1.030s
  Each: 185 ns x1.00

Parsed 5555555 ISODates.
      vISO[5555554]:  '2009-03-25T18:12:52' -> 2009-03-25T18:12:52
      Total time: 1.560s
  Each: 281 ns  x1.51

As you can see, to calculate over 5.5million dates, my 1.7GHz Centrino laptop (about 2 years out of date now) takes 1 second to parse the numeric data type. This means that it takes a mere 185 micro-seconds to process one date.

[Update: 070716: First report was calculated in micro seconds. Its off by a factor by 1000. Its actually nano-seconds] This needs emphasising: this is 185 nano-seconds. Or 0.185 micro-seconds. Or 0.000185 milli-seconds. Or 0.000000185 seconds. In a blink of an eye (1 second), over 5 million dates can be converted from strings to the native date datatype.

To parse an ISO 8601 date, it takes 281 nano-seconds. A cynical statistician would tell you that this is over 50% slower than processing a native date type. It may seem slow, but if we were to put this in perspective, the total time taken to process 5.5million dates is only 1.56 seconds!

This means that the added benefits of ISO 8601 dates (bug-free, humanly readable, with unlimited range) only cost us half a second extra for 5.5 million dates.

The proponents against human friendly encodings will then backpedal and ultimately change tact and a new argument would usually take this form; as Mr Jelliffe commented a few days later in January 2007 in the same thread:

As for the argument that speed does not matter, in general I agree, but that does not mean it never matters. The particular always overrides the general. In the case of office documents, we are talking about a few seconds difference, but this is multiplied by hundreds of millions of users. You are talking of billions or hundreds of billions of seconds of lost productivity. I think there is even a good case to be made that for office document formats, efficiency of loading/opening should be a prime requirement.

Perhaps then to save the hundreds of billions of lost productivity Operating System vendors should cut down on computer boot-up times. Don't load up that fancy 3D GUI! Don't run any annoying taskbar notifications! I believe that much "productivity seconds" could be saved in many other areas, but compromising on readability on dates for performance would reduce very little real time. 0.56 seconds is nothing compared to the 60 seconds "wasted" during the boot-up of unnecessary applications.

What is the price of waiting for the added benefits of readability, usability and flexibility? Also please remember that hardly anyone populates spreadsheets with over 5 million dates. Hence the 50% penalty is a price I am more than willing to pay for readability. Plus I am banking on Mr. Moore and his Law to subsidise the cost in the near future.

Now we come to something more pertinent.

If you can remember, I brought up some issues regarding the contradictory definitions of Percentage units in MSOOXML back in January 2007 To recap, instead of having four different forms of encoding Percentages, I recommended to Ecma to adopt a more "standardised" use of units, which is similar to what all HTML and CSS programmers around the world are used to. The examples of the recommendation was to use something like "tableWidth = 80%" or "zoomFactor = 40%" or "shadingCoverage = 62.5%"

Personally, I thought it was a good solution. However, the J1N8530-22 Ecma Responses to Comments and Perceived Contradictions PDF released after the Fast Track comments were submitted, basically brushed aside all the careful deliberations, issues and recommendations not just by Malaysia, but by all other National Bodies.

Their dismissal of this particular recommendation read:

"In §2.15.1.95, "zoom (Magnification Setting)", §2.18.97, "ST_TblWidth (Table Width Units)", and §5.1.12.41, "ST_Percentage (Percentage)", the percentage values allowed have been limited to integers to permit efficient parsing and processing. (Use of the % symbol and allowing non-integer decimal values would introduce parsing complexity.)"

Read that again: "the percentage values allowed have been limited to integers to permit efficient parsing and processing." and "Use of the % symbol and allowing non-integer decimal values would introduce parsing complexity"

I just wonder how all the web browsers do it on a daily basis when evaluating the HTML and CSS table cell widths so efficiently?

The term "non-integer decimal values" sounds so "difficult". Non-integer decimal values in this case actually means "Floating point numbers." We are no longer in the days where math-coprocessors are a  rarity. It may have been an exotic feature on your CPU 15 years ago, but nowadays, its a commodity.

To test this claim by Ecma, I modified my program to also cater for floats. I then processed 5.5 million non-integers decimals with the % symbol appended and this is what I got:

yky@x1407:~/src/isodates$ ./iso 5555555
Parsed 5555555 Percentages.
      vPerc[5555554]:  '55.54%' -> 55.54 -> 55.54%
      Total time: 4.530s
  Each: 815 ns  x4.12

Yes, it took longer. It took over 4 times longer than a date. But note; dates in MSOOXML are actually formed as 39255.6071286466. And guess what? They are non-integer decimal values too! The only difference? Its a trailing "%" percentage sign in the string.

Because my C skills are not as sharp, to decode "55.54%", I just used the standard built in function:

sscanf( vPerc[i], "%f%%", &vpct );

This obviously is not the best way to parse a Percentage. Given a little bit more effort, a more efficient way can be found. It is hard to believe that parsing an ISO 8601 string with so many more elements (year, month, day, hour, min, sec) is faster than parsing a HTML percentage string, but writing the fastest parser is not the purpose of this exercise.

The purpose of this entry is to show that even if a "non-integer decimal value with a trailing percentage symbol" took 4 times longer than a "non-integer decimal value", the difference to read 5.5 million numbers is a mere 3 seconds. Surely we can trade at most 3 seconds for a readable and human friendly representation of your data?

This commentary would hopefully put to rest the myth that user friendly data-types in XML would cost significantly in performance when parsing through the files. We are fortunate nowadays to live in a world where multi-gigahertz computers are the norm. Faster technology has also spurred the popularity of XML where users do not have to worry about the cost vs benefits of human friendly representations of their data versus their applications performance.

If there are vendors, consultants contracted by vendors, or standards bodies controlled by vendors out there who state that the reason why their version of XML is so hard to read because it has been optimized for speed, then my little C program clearly demonstrates that encoding values in human readable forms cost very little in terms of performance and should not be traded away so cheaply.


yk.

[Update: 070716 - Thanks to Brian Jones' calculation I realised that my units stated was off by a factor of 1000. The units should be in nano-seconds instead of micro-seconds. So to calculate 24 x 10,000 dates in a spread sheet (total = 240,000 dates), it should take no longer than 0.07 seconds, and not 67seconds as per his calculations ]

/*
070621  yoonkit@gmail.com
    Program to calculate how long it takes to parse ISO 8601 datetime
    and the native longint datetimes.
    To make:
        gcc iso.c -o iso
    On my 1.7GHz Centrino:
        '2007-01-20T07:38:38' takes 222μs
        '2007-01-20' takes 168ns
        1232437118  takes 124ns

070622  yoonkit@gmail.com
    Added calculations for percentages. Using sscanf( "%d%%" ).. seems slow
        '54%'    takes 466ns

070625  yoonkit@gmail.com
    Released on /2007/06/will-readabil-1.html
    License: GPL v2 and above. Please fix any silly mistakes.

Responses to Comments and Perceived Contradictions
J1N8530-22 Ecma Responses to Comments and Perceived Contradictions
In §2.15.1.95, “zoom (Magnification Setting)”, §2.18.97, “ST_TblWidth (Table Width Units)”, and §5.1.12.41, “ST_Percentage (Percentage)”, the percentage values allowed have been limited to integers to permit efficient parsing and processing. (Use of the % symbol and allowing non-integer decimal values would introduce parsing complexity.)

*/

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define max_size 30
//#define ISOfmt "%Y-%m-%d"  // short form. 168ns
#define ISOfmt "%Y-%m-%dT%H:%M:%S"  // long. 230ns

int main(int argc, char *argv[])
{
    time_t time_now;
    clock_t vstart, vend;
    float vduration, vbase, vpct;
    struct tm *time_ptr, otime;   
    char vstr[max_size];
    char **vISO, **vInt, **vPerc;
    int i, vcount;

    if (argc > 1) vcount = atoi( argv[1] ); // get command line
    else vcount = 1000000;    // .. or default to a million   

    printf("Count: %s \n",  argv[1]);

    time( &time_now ); // Now!

    time_ptr = gmtime( &time_now );
    strftime( vstr, max_size, ISOfmt, time_ptr );

    printf( "long: %d  ISO 8601: %s \n", time_now, vstr );

    vISO = (char **)malloc( vcount * sizeof(int *) );   
    vInt = (char **)malloc( vcount * sizeof(int *) );   
    vPerc = (char **)malloc( vcount * sizeof(int *) );
    for( i=0; i < vcount; i )
    {
        vISO[i] = (char *) malloc( max_size * sizeof( char ) );
        vInt[i] = (char *) malloc( max_size * sizeof( char ) );
        vPerc[i] = (char *) malloc( max_size * sizeof( char ) );
        time_now = 10;
        time_ptr = gmtime( &time_now );
        // converting from time to strings and storing it into arrays
        strftime( vISO[i], max_size, ISOfmt, time_ptr );
        sprintf( vInt[i], "%d", time_now );
        sprintf( vPerc[i], "%3.2f%%", ((float)(i%10000))/100.0 );
    }

    // ====  Parsing Integer Dates

    vstart = clock(); // mark start time of decoding
    for( i=0; i < vcount; i )
//        sscanf( vInt[i], "%d", &time_now ); // sscanf is slow: 640μs
        time_now = atoi( vInt[i] ); // converting from string to longint
    vend = clock();

    vduration = vend - vstart;

    time_now = atoi( vInt[vcount-1] );
    time_ptr = gmtime( &time_now );
    strftime( vstr, max_size, ISOfmt, time_ptr );
    vbase = vduration;

    printf( "Parsed %d IntDates. \n  "
        "    vInt[%d]:  '%s' -> %d -> %s \n  "
        "    Total time: %3.3fs \n "
        " Each: %3.0f ns x%2.2f\n" ,
        vcount, vcount-1,
        vInt[vcount-1], time_now, vstr,
        vduration/1000000,
        (1000 * vduration) / vcount, 1.0 );

    // ==== Parsing ISO 8601 dates

    vstart = clock(); // mark start time of decoding
    for( i=0; i < vcount; i )
    {
        // converting from string to date using "strptime"
        strptime( vISO[i], ISOfmt, &otime );
        //if (strptime( vISO[i], ISOfmt, &otime ) == NULL)
        //    printf("Couldnt not convert \n\r");
    }
    vend = clock();

    vduration = vend - vstart;

    strptime( vISO[vcount-1], ISOfmt, &otime );
    strftime( vstr, max_size, ISOfmt, &otime );

    printf( "Parsed %d ISODates. \n  "
        "    vISO[%d]:  '%s' -> %s \n  "
        "    Total time: %3.3fs \n "
        " Each: %3.0f ns  x%2.2f \n" ,
        vcount, vcount-1, vISO[vcount-1], vstr, vduration/1000000,
        ( 1000 * vduration / vcount ), vduration/vbase );

    // ====  Parsing Percentages

    vstart = clock(); // mark start time of decoding
    for( i=0; i < vcount; i )
        sscanf( vPerc[i], "%f%%", &vpct ); // converting from percentage str to float
    vend = clock();

    vduration = vend - vstart;

    sscanf( vPerc[vcount-1], "%f%%", &vpct );

    printf( "Parsed %d Percentages. \n  "
        "    vPerc[%d]:  '%s' -> %3.2f -> %3.2f%% \n  "
        "    Total time: %3.3fs \n "
        " Each: %3.0f ns  x%2.2f\n" ,
        vcount, vcount-1,
        vPerc[vcount-1], vpct, vpct,
        vduration/1000000,
        ( 1000 * vduration / vcount ), vduration/vbase );

    return 0;
}

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/686627/19489666

Listed below are links to weblogs that reference Will Readability hinder Performance?:

Comments

It would actually be more relvant to use short integers with values less than 100k and compare them to iso dates in somthing a spreadsheet actually does.
Which would be:
Converting the value from an input file to an internal spreadsheet value.
Recalc the spreadsheet (for instance represented by adding or subtracting a period to each date) as would be required by the loading of a sheet and then rewriting it back to an output file.
Also your example seems to lack that when reading a cell with an ISO date that it requires preparsing before conversion to see what is the actual kind of ISO date value it is representing as ISO dates can be short, dates, long dates or periods and also I miss the parsing of timezone information and recalculation the spreadsheet based on that timezone information.

Could you enhance you test to read 1 million arbitrary format ISO date fields, parse the information to determine what ISO date format is used and whether a timezonecorrection is needed, convert it to an internal spreadsheet format and do an arbitrary recalculation with them and then revert them back to their prior ISO arbitrary date format representation and write them back into the file ?

That would seem a more realistic test of actuall speed

Hmmm I forgot to mention doing XML schema validation for each cell of course. (I do hope that something like an Office implementation would use XML validation).
Validation would also eat time.

Jeffrey,

I use the standard C library function called "strptime" to do the conversions for me. It just works. This is not a case of optimizing or making XML documents faster. It is just to illustrate how fast, ball park figure, a ISO date parses versus a numeric date both stored as strings.

How OpenOffice.org or Microsoft Office optimizes the parsing and their overheads is beyond the scope. This program just shows that parsing ISO dates has a overhead, but that overhead is small.

Your additional tests are unfair. Do you see timezone information encoded in a Excel float? Do you have to validate the numeric dates? Do the numeric dates change formats (4532.3453242; 4.532345e4; 1101001011101110111)?

> Could you enhance you test to read ...

The source code is available, so why dont you do the modifications yourself and post your results?

Regards,

yk.

I do not know C so I cannot interprete your code very well. To me your code could be optimised so it fully runs in superfast L1 processorcache or even with data all in internal CPU registers whilst that cannot be the case if the data has to read from external memory and written back. Because your program does actually use a document a a basis it seems a strange way to test document format performance optimalisation. That Is why I proposed you do something a good spreadsheet program would always do.

Read the dates (depending on testscript it should work for mixed type of ISO dates OR mixed types of decimal dates (integers or floats).
Validate the ISO dates/decimal dates (required for XML files)
Recalculate the data to resemble to initiate all of the spreadsheet.
Rewrite the data in a way resembling their original format (rewrite a ISO 8601 period as a period and an integer as an integer).

From your current test I cannot see if the performance is optimised in a way that could never happen in opening and parsing a real document. Mayby somone with better programming skills can do such a

That should actually read:

Because your program doesN'T actually use a document as a basis it seems a strange way to test a document format performance optimalisation

> To me your code could be optimised
> so it fully runs in superfast L1 processorcache
> or even with data all in internal CPU
> registers whilst that cannot be the case if
> the data has to read from external memory
> and written back.

Hi Jeffery,

This argument was put forward by Rick Jelliffe at the comment area: http://www.oreillynet.com/xml/blog/2007/07/what_is_an_iso_8601_date.html

However the rebuttal is that in the days of megabyte sized CPU cache, even if you read in millions of dates in spreadsheets, you will have code running off the cache. So loading up many rows and columns of dates in a spreadsheet will also be cached, not unlike this tight loop.

In terms of memory usage, as you can see from the code, the memory allocated is interleaved between the three different arrays. As the program runs I can actually see the memory being consumed, taking up about 500MB. This is definitely too big to fit inside the CPU's data cache.

Another point is that this program is not to find the most optimized way of decoding ISO dates or numeric values. I will leave that to the experts. However using the built in C functions, without any optimization taken, written by a novice programmer like myself, you can see that ISO dates dont take that much longer to parse over numeric numbers.

Im not sure why you are fixated at the changing formats of ISO dates. If the MSOOXML spec stated that datetimes should be encoded as "%Y-%M-%DT%H:%m:%s" then that will be the standardised format it would be represented in. There will be no need to cater for different cases, nor will there be need to validate the inputs. Look at the output of MSOffice 2003 XML formats. Its fixed. Even when the time is not provided, the ISO date is output at 00:00:00.

You also said in the previous post, that:

> the MS Office XML 2003 formats were
> notoriouly slow in use and now Micrsoft
> has expanded spreadsheet sizes by huge amounts,
> performace could be an issue again.

Look at the comments in the Brian Jones post.
http://blogs.msdn.com/brian_jones/archive/2007/07/11/wordprocessingml-document-model.php

I dont have a copy of MSOffice 2003 to see how much longer it takes to output a quarter of a million datetimes, but according to some people, it doesnt take that much longer. Perhaps you can verify this?


Regards,

yk.

Post a comment

If you have a TypeKey or TypePad account, please Sign In

Welcome to
Open Malaysia blog!

  • Bloggers @ Open Malaysia
    We are a group of individual bloggers working to build openness in Malaysia's ICT culture. Most of us have day jobs and a couple of us are students. Those with a job work for companies ranging from large international enterprises to self-run Malaysian start-ups.
    Email us at this address:
    open -AT- openmalaysiablog -DOT- com

Disclaimer...

  • We declare our independence of opinions from our employers, institutions, associations and clients, past and present. Thoughts and expressions in the Open Malaysia blog are rightly each blogger's own and each of us stand by what we individually write. Views by readers who post comments and others whose writings we link to in this blog are theirs.

October 2007

Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Subscribe to this site
- FeedBurner Feed

Subscribe to this site
- email alert options

Your email address:


Powered by FeedBlitz

Enter your email address:

Delivered by FeedBurner

Powered by TypePad