| |||||||||
8 Reasons why XML Sucks
8 Reasons why XML SucksWhen this text writes "I" it could be either Jakob or Kasper speaking. If you have comments about a specific statement (if you have further questions or arguments etc.), contact us via email. You can find an email address at http://jenkov.com/about/index.htmlI was once a believer in XML. Until I started working with it, that is. Now my enthusiasm has faded away - replaced by - well - bitterness or something like that. I just don't like it anymore. This text lists a series of problems developers encounter when working with XML. I'll expand it as more disadvantages come into mind. If you have some complaints to add, send me an email. Find the address on the About page. So, when your fellow developers start praising XML you can send them a link to this page to get their feet back on the ground. XML has only one advantage: It is sometimes humanly readable. This is its only advantage, but this is also one of the reasons why it sucks. Most other aspects of XML are just plain lame. XML may be fine for files read and written by both humans and computers like configuration files. But, for computer-computer communication XML is terrible. This text will take you through 8 reasons why we think so. 1. Character EncodingXML is character based hence any change you make to an XML document in an editor has to be saved the character encoding scheme that you specify the XML file (encoding="blablabla"). This
means that if your editor saves
the file in ISO Latin 1, but you specify "UTF-16" inside the XML file,
the parser may fail. Most editors
have no way of letting the user specify the format in to store the
edited file in, so this is often a problem.
Additionally, you may run into problems when you need to embed character data inside the XML file that is stored in a different character set than the one the XML file is encoded in. In that case you will need to manually translate from one encoding to another. Finally, as XML is character based, element data cannot contain
characters that can confuse the XML parser. For
instance, the character < must be encoded as 2. Embedding binary dataBeing character based it is not trivial to embed binary data inside XML. Two commonly used encoding schemes are bin-hex and base64 encoding, which both expand the total amount of data. Bin-hex requires 2 characters (bytes) for each original byte. Base64 requires 4 bytes for every 3 original bytes. An unnecessary overhead.3. Indentation Characters Mixed With DataBeing character based and hierarchical, spaces or tabs are often used to indent the XML elements to to reflect the hierarchy, thereby make the files easier to read and write by humans. When using a normal text-editor, these characters are typically stored along with the XML elements themselves. However, these characters are essentially formatting characters, and they should not be saved into the file. They often have no meaning in the XML data itself.When I first heard of XML it was related to discussions about separating formatting and layout from content. Interestingly XML often does exactly the opposite, by having the indentation characters mixed with the contained data. 4. Textual Representation of Numeric DataBeing character based lots of data becomes unnecessarily verbose. For instance, sending integers using XML:<int>1234567890</int> Firstly, the tag-element alone take up 5 + 6 = 11 bytes in total (this is more than the actual content!). And had longer element names been used, the tags would take up even more bytes. For no reason other than being "humanly readable". The integer value itself also take up 10 bytes, even though its value could easily fit inside a regular 4 byte integer.
Then there is the problem of decimal numbers. Should it be formatted
with a 123.666 US Style 123,666 Danish Style 123.666 UK Style 123,666 German Style 5. Unnecessarily VerboseXML is unnecessarily verbose. I remember back in 2000 on a telco project we converted TAP 3 files (billing logs) into XML. The XML files were 16 times larger! When looking at the textual representation of tags, indentation characters, and numbers, this is not surprising, but the reality was zero gain. If data is not hierarchically structured, consider using another less verbose format such a CSV. Such formats are much easier to read, takes up less space, and are easier to export from various applications 6. Human un-readabilityOne of the often claimed advantages of XML is the human readability aspect. This is simply not true. There are so many cases where an XML document is not readable.
7. A single XML root elementThe fact that XML files can only have a single root element inside each XML file is a total rookie mistake. There isn't a single good reason why XML has this limitation (if you know of any, send me an email). The single root element limitation makes XML unsuitable for logging and streaming purposes.A logging example:
<log> ... log data... </log>
<log> ... log data... </log>
Notice how this logfile does not have a single root element. Imagine if you had to have a single root element. Then you would also have to have the root element correctly ended with a root element end tag. Whenever appending to your log file, your logger would then have to remove the old end tag, append the new log record, and re-write the end tag. A total waste! The single root element limitation also makes it harder to first examine what kind of XML document it is, before passing it on to the appropriate parser / processor. If an XML document could have contained a header and a body element, the header element could have been inspected before passing the body on to the appropriate processor. Finally, many parsers have problems properly dealing with documents with no single root. Recently, in ANT I tried including one XML document in another using XSLT. The parser complained a lot. Among the errors were "file not found" and "illegal format". Little indication was given as to the missing root element, and the "file not found" turned out to be total bullocks. 8. SOAP - Oh My God!When people finally agreed to a common remote procedure / service / what-ever protocol, why do we have to suffice with XML?It doesn't even make sense in my opinion to use XML for this purpose. XML's only strength is that it is "humanly readable". But web services are all about computer - computer interaction, isn't it? On a large IT project I have been working on, the web front-end exclusively retrieved data to display using web services. One problem we had was that many screens were slugish. Some took 2-3 seconds to display. We assumed it was bad data base queries who were to blame. It turned out the data base call took 30 milli seconds! The remaining 1-2 seconds were the overhead of producing the XML, retrieving the XML in the web layer, parse it and display it. In addition, the single root element limitation is really annoying in computer to computer communication. A SOAP request has an envelope with a header and body element inside, like this: <envelope> <header> </header> <body< </body> </envelope> It would be really useful if a node in a computer to computer communication network could just parse the header and then decide what component is to process it. Or perhaps to forward the request to another computer with the body untouched and unparsed. Unfortunately, because of the way traditional XML parsers work this is not possible. Once you start parsing the XML document you parse it all! Allowing multiple roots would have solved the problem. Like this: <header> </header> <header> </header> ... <body> </body> Now each header could be parsed one by one, without having to parse anymore than that. In addition, imagine having to exchange a few int's (4-8-12 bytes or so) via SOAP. It'll expand into 1-2 KB !! A complete waste off CPU time and bandwidth! BMLFor my peer-to-peer hobby project I designed a binary equivalent to XML which I call BML (Binary Markup Language). It is much simpler than XML, much less verbose, allows any character encoding inside, allows binary data like int's, long's etc. to be transferred in their natural representation (byte form), allows multiple root elements etc. And guess what? The reader for the data format which is equal to an XML SAX parser, is only 120 lines of Java code. Any argument of always defaulting to using XML because "the re-useable parser is already written" is really not a strong in the light of a much better, faster and more efficient data format achieved using only 120 lines of Java! Did you like Jakob Jenkov's writing? See his fantastic free tutorials at http://tutorials.jenkov.comCommentsIf you have any comments to this article, please drop me a mail at firstclassthoughts at gmail dot com please indicate if I can't publish whole or parts of your comment on the site.If you like this site consider Help spread the wordShare this post on your favorite social bookmarking sites:
The most recent contributions 28/07/09 Magic in mathematics II Fun with the number cyclic numbers, and specifically with 142857 as it is the smallest of such numbers. 13/07/09 My top 8 time-saving Firefox shortcuts This article presents my favorite top 8 time-saving shortcuts in Firefox 3.0 and Firefox 3.5. Get to know these and you'll be saving a lot of time. They have been ordered by "the element of most surprise" 20/05/09 Board Game Jungle speed / Arriba Review of the cool game "Jungle Speed" aka. "Arriba". 16/05/09 Danish Twin words "Twin words" are words that not only have multiple meanings, they must be composed next to each other in meaningful sentences. This article explores the concept of twin words. Nothing of interest? Try browsing the entire article archive... | |||||||||