Activity Workshop
 

Microsoft OOXML

It used to be that if you saved a document in Microsoft Word, you saved it in "Word format" and it got the file extension .doc. You understood that when you saved it that way, you were effectively tying yourself to Microsoft, and forcing you to keep using (and buying) Microsoft products if you wanted to access your documents. Obviously that was a bad scenario, especially when a new version of Word came out with a new document format, and you couldn't open the new documents which people sent you because your older version of Word couldn't open them. Worse still, you paid Microsoft for the newer version of Word, and then you had problems opening your old documents saved in the old format!

This was the situation for many years with Microsoft enjoying market dominance and selling lots of copies of their office software for good money. Businesses and governments accepted the inevitable costs of upgrades and accepted the vendor lock-in because there appeared to be no choice.

Enter StarOffice, or as it later became known, Open Office. This competitor product was cross-platform (meaning you didn't even need to buy a Microsoft Windows system), cost zero money to use it, was made available as open source, and the community was encouraged to help develop and improve it. It was able (no thanks to Microsoft) to open Word documents, sometimes with more success than Word itself, but there was another feature which proved appealing - you could choose to save your documents in an open format, clearly defined as an ISO standard, which removed the lock-in. You could choose to use Open Office now, safe in the knowledge that the file formats you were using were open and available to anyone to implement. If you wanted to switch to another word processor, you could. If you wanted to write your own converter tool, you could. The risk of losing your precious documents to an outdated and no-longer-supported (secret) file format was gone.

This caused interest in some quarters and dismay in others. Some businesses, and in particular governmental departments, liked the sound of open, ISO-approved document standards, and relished the prospect of actually being able to negotiate for word processor software. Open document formats were clearly good for their own agendas, as well as bringing the very sensible opportunity to make information available to the public in a non-vendor-specific way. Requiring the public to buy Microsoft software just to access information from the government is clearly wrong. So a push to promote more open standards began. Of course this caused dismay in those companies who relied on vendor lock-in to ensure repeated upgrade fees for large profits.

So at this point there's Microsoft, with its closed, proprietary, binary formats, which it endeavoured to keep secret and which the open source community endeavoured to decipher and interpret. And there was Open Office, one of its main competitor products, with its open source software and clearly defined, open document standards. Microsoft needed a response to avoid being closed out of lucrative markets.

Microsoft's response

Microsoft clearly needed to change its document formats in response. Given that one of the main competitors was called "Open Office", and given that Open Office stored its documents in XML format (zipped up to save space), what was Microsoft's response? They decided to store their documents in a new format, also as XML (zipped up to save space). And given that the name "Open Office" was the name of a main competitor, Microsoft decided to name its new format "Office Open XML". Can you believe it?

As if that wasn't confusing enough, Microsoft themselves also refer to this format on occasion (perhaps by mistake) as "Open Office XML". And they also insist on calling it "Open XML" as though it were the only open XML format around. One has to wonder whether this was just designed to cause confusion and mix-ups with the XML format used by Open Office? Some commenters use a more accurate naming, given that it was created by Microsoft, for Microsoft, then why not call it "Microsoft OOXML" or MOOXML for short, making it obvious which open XML you're talking about?

Okay, so apart from the naming, what have we got with this new document format? Remember, it's being promoted as an open format for anyone to implement as they wish. And if there's already an ISO standard for representing exactly this kind of document (the ODF standard used by Open Office), there should be some pretty compelling reasons why ODF isn't good enough, and why ODF can't be extended or enhanced to fulfill Microsoft's specific requirements.

Microsoft have applied for, and been granted, patents covering the use of OOXML. When they claim it is "open", it is due to their "covenant not to sue". So if other software developers try to implement OOXML, they put themselves in the position where they could be sued by Microsoft, except for this promise that they won't. Some contend that this is fine, and the promise is watertight. Others believe that the wording of the covenant allows Microsoft in principle to sue (or just threaten to sue) such competitor products at will. Richard Stallman of the Free Software Foundation is one of those who believe that the terms "do not allow free implementations".

OK, so what is the real goal of OOXML? Is it to provide a clean, uniform standard which will be as useful as possible to all those who want to store their documents in a free, open, portable way? No. That is quite clearly not the goal. The goal of OOXML is to represent, as faithfully as possible, all the features of Microsoft's legacy binary formats. And of course that includes all the quirks, bugs and inconsistencies of them too. So is OOXML good for everybody, including consumers, or is it just good for Microsoft? If it's just good for Microsoft, should it be an ISO standard?

One illustration of this is the file extensions chosen for the files meeting the OOXML specification. Instead of a general file extension relating to "office" or "open" or "xml", to match the name of the file format, Microsoft have defined three file extensions for word processing documents (.docx), spreadsheet documents (.xlsx) and presentation documents (.pptx). How Microsoft-centric is this open ISO standard?

How much specification does OOXML require? It may surprise many people to learn that the documentation is over 6500 pages long. Six and a half thousand pages! Of course, Microsoft Word is a complex program, but this isn't specifying how the program works, this is just for the file format. For comparison, the ODF format used by Open Office (which was already approved as an ISO standard) is specified in less than 900 pages. This indicates firstly that the format is overly complex, and poorly designed due to duplicate (and conflicting) definitions. Secondly it is a huge task to read and review the specification, and implementing and testing any such software will also be a mammoth task.

Given that OOXML was submitted to the ISO as a potential worldwide, ubiquitous and open standard, why does it contain elements such as autoSpaceLikeWord95 (used by Microsoft Word 95), lineWrapLikeWord6 (used by Microsoft Word 6), useWord2002TableStyleRules (from Microsoft Word 2002) or useWord97LineBreakRules (from Microsoft Word 97)? Isn't it ridiculous that a forward-looking, "open" standard should contain such proprietary, vendor-specific kludges? The answer is obvious, they don't belong in an ISO standard, they belong in Microsoft's own converter code which is required to translate their own inconsistent binary formats into the proper ISO standard.

A good specification reuses other industry standards in an appropriate way. For example, when vector graphics are required, just use the already standardised SVG format rather than inventing another one. For mathematical equations, use the accepted standard MathML. This is what ODF does, and this is part of the reason why ODF's specification is much smaller than OOXML's. Microsoft decided to use their own rules rather than the accepted standard ones, making everyone's lives more complicated than necessary.

How do you define a colour in such a file format specification? The sensible thing to do would be to define how to do it once, and do it that way everywhere. That's not what OOXML does. It defines various ways to do it, and various rules about when to use which format. That's just ridiculous. The same goes for alignment options, instead of doing it one way consistently, there are a bunch of inconsistent ways to right-align things depending on what page of the spec you're on or what part of the spreadsheet you're talking about. The same applies to dates, too. See for example Rob Weir's excellent discussion for more on these inconsistencies. Instead of fixing Microsoft's buggy and inconsistent colour-handling code, they're trying to force their mistakes into an international standard so that everybody else's code is bloated and convoluted too.

Given the contentious nature of this standard, it stretches credulity that Microsoft was able to push it through by fast track approval to the ISO, allowing drastically insufficient time for review (of the more than 6500 pages), and despite many appeals and petitions, and examples of clearly contradictory information in the specification, the ISO approval was indeed obtained, albeit controversially. There were many subsequent accusations of Microsoft "buying" national committees by offering incentives, "stuffing" panels with their own employees or friendly parties, contriving excuses to exclude employees of other companies from taking part in discussions, and other irregularities.

So if the OOXML specification as "approved" by ISO still contains missing or inconsistent information, how can OOXML be implemented properly and consistently by other software developers? The answer is simple - Microsoft's implementation becomes the de facto standard, independent of what was submitted to ISO, and all other developers become forced to duplicate Microsoft's interpretation (and bugs). Except there's a problem. Microsoft can't even implement OOXML properly either. Their software does not read and write documents according to the ISO standard. Even the forthcoming Office 2010 will not support the Strict ISO standard. Microsoft is trying to force a file format on its users on the basis that the format is "open", but that's not what the users are buying.

The outcome

It's a poor outcome for Microsoft's users, as they won't get the "open" format they were hoping for. It's a poor outcome for ISO, whose reputation for impartiality and professionalism has suffered at the hands of the Microsoft buyout. It's poor for Microsoft's competitors, who are once again forced to try to analyse and duplicate Microsoft's bugs with a tangled, broken specification.

One benefit I can think of that may come from Microsoft's switch from binary formats to zipped XML. Even if a zipped XML document can't be completely, properly, 100% perfectly rendered for whatever reason, there is a good chance that a lot of it can be recovered. With binary formats you never know what might happen, but even if I don't have any office software available at all, I should at least be able to unzip an OOXML file and extract the text, even with basic, free tools. So instead of a corrupt file and everything lost, I should still be able to recover precious file contents and go from there.

Another possibly positive outcome from all this is perhaps for the ODF format, as used by Open Office and many other smaller office suites. Presumably as a result of all the pressure towards openness, the unthinkable has happened and the very latest versions of Microsoft Office now support ODF format for reading and writing. It's possible that this will become increasingly used by certain businesses with a view towards their future lock-in. Just as long as Microsoft keep to the specification (like they didn't for HTML).