Oh, to Structure Code

Today I read an article about XHTML on which discussed (in a very short article) some of the major points concerning this re-formulation of the HTML spec, what it is, why it is, and why web developers should know about it. It was a nice little article. (Although there are better ones.) Then I read the 'Feedback' message board.

Good god. I didn't get through the whole archive, but what I saw gave me a good scare. This is what really threw me for a loop:

Does anyone else go "ick" at the idea of well-formed HTML? I mean, HTML escaped the rigidity of SGML, and got rid of the concept that every tag was a container and needed a close tag.

It actually made me sick to my stomach. I know that closing tags and well-formed HTML are so very important. Very very important. Why didn't these people see this? It was then that I realized that I knew a secret. A deep, untold secret that most people don't know, and most web developers desperately need to be told: Poor HTML Code Crashes Browsers.

You're kidding me.

It's true. Here, I'll show you. But first, some background.

Background on Parsers: If you pick apart the pieces of a web browser, you'll find that there's an 'engine' that's used to figure out what each character in an HTML text file means. This engine is called a parser. For a great look at the pieces of code in a web browser, see this article on Gecko from WebTechniques magazine.

Background on Why Parsers Aren't Strict Enough: In the early days, all HTML was written by hand. Many early browsers set a dangerous precedent by trying to render even poorly coded pages. Because of this lack of feedback between the browser software and the HTML author, the author never learned of the mistakes that were made. Those mistakes were then copied over and over by that author and by others who used that authors pages for examples. By letting people 'get away with' this sloppy code for the first year or two, the vast majority of the web had poorly formed pages.

If a new browser came out that was more strict and popped up error messages each time it came across a problem, most web surfers would drop it like a hot potato. The new browser would never develop a large audience because most of the web had errors, and most of the browsers ignored them.

Background on The Hard Life of Parsers: Imagine you were trying to follow a recipe for a cake, but the steps were out of order. Now, imagine that you're really stupid and have no idea how to cook, but all you have is this recipe. I think you'd fail. You might come out with a lump of flour, chocolate and eggs, but few would call the results of that "poorly formed" recipe a cake.

In the same way, poorly formed HTML is like a mixed up recipe. A simple parser looking at poorly formed code would produce a web page as mixed up as the code that described it. Browser programmers have done their best to make parsers that will interpret these poorly written HTML pages.

My theory is that parsers are written by humans. Given this premise, I logically deduct that since humans are fallible, their parsers are fallible as well. The more code there is in a program the more likely it is that there will be bugs. Remember that.

Background on Why Parsers' Error Correcting Code Causes Crashes: Think about how a piece of software would go about 'ignoring' and 'covering up' mistakes. It takes a certain amount of code to parse and display an HTML page. It often takes 10 times as much code to parse and display mixed up code. This isn't easy for a computer to do. The parser has to guesses about where closing tags should be and other assumptions. Making guesses and assumptions is something computers are not cut out to do. The parsers don't always get things right. Now this may not be a problem when it comes to displaying things on a screen, but a parser is more than it's graphical output. It's a computer program that manipulates lots of data and bits and bytes.

If the parser doesn't work perfectly (in other words, if it has bugs), those bits and bytes get loose and get scattered all over your RAM. This is technically known as a 'memory leak'. What happens soon after this? Your computer crashes. Big ol' honking crashes.

Okay, that's all the background we need. Now here's why structured code is important.

The Heart of The Matter: I noticed a few years ago that when working on certain pages of HTML, Netscape would crash. Most of the time Netscape would be just fine, but while working on pages that didn't have their code finished, it would become unstable. Netscape wouldn't crash the first time the page was rendered, it might happen the eighth or twelfth time. Similar problems would crop up with frames. Sometimes the page wouldn't crash the browser right away, but the browser would crash later on some unrelated web site. But problem was the unfinished (read unstructured) HTML code.

However, once I cleaned up the HTML, the crashes stopped. I imagine that the parser was no longer running into bugs in it's programming. Remember, the amount of programming that goes into rendering a structured HTML document is minuscule compared to the amount of programming that a parser has to have to clean up poorly coded HTML. Statistics say that the bigger sections of programming will have more bugs. Since these bugs don't necessarily raise their ugly heads right away in the displayed page, they are overlooked. But they're there. And they're wrecking havoc.

Poorly structured code makes the parser use it's huge "error correction" programming sections. These sections are buggy. HTML code that forces a parser to use this error correction can cause a browser to crash.

No piece of software should crash. But since we don't live in that perfect world, web developers need to take responsibility for their code. HTML may not be programming, but poor HTML can be just a hazardous as poor programming when the two are brought together on a computer screen. Part of being a responsible web developer is keeping your HTML code correctly structured.


Postscript: My friend (and part time technical editor) Jock wrote to me, making sure that I clearly pointed out a few items in this article:

Parsing bad HTML is not the reason for memory leaks, bad parsers are. If loose grammars caused memory leaks, perl, awk, and a host of other languages would be leaking like sieves. Parsing and lexical analysis are well known and understood. Memory leaks come from bad programming practice.

Jock's absolutely right, and that didn't come out clearly in my article. Instead of explaining this, I jumped right over it to the concept of making clean HTML in order to mitigate the problems in some parsers. Jock continues:

I have looked at the original Mosaic code, and it is nothing to write home about, but it looks like it was written by young, bright engineers, but relatively inexperienced at writing real applications. IE2 and IE3 were derived directly from the Mosaic codebase. NS1 through NS4 was the love child of the original Mosaic team so they made most of the same mistakes, and kept heaving more code on top of a poor foundation. It is telling that the IE4 project started at the same time as the IE3 project, so that MS could re-implement the browser. IE4 & IE5 are far less problematic (that being said, YMMV). The original NS5 codebase (that was used to start the Mozilla project) was thrown out by the developers, so they could start from scratch. Something that they should have done years ago (IMHO).

Jock also states that there are not "huge 'error correction' programming sections" in parsers. Jock is a true software engineer and I will take him at his word, however the mental concept, how I use it to explain the behavior and the conclusion still remains the same. Thanks for your notes, Jock.


[ Read and write comments ]