Web Application Development and Design: relational data

Everything is fodder for argument with XML. Note these 50 pages of point-counterpoint that discuss nearly every aspect of XML's use and usability - it's a pretty quick read [http://www.zapthink.com/report.html?id=ZT-XMLPROCON].

The short of this entry is that storing data on the server-side as XML documents is a very flexible, readable, maintainable, and most importantly scalable option for a Web application. Document size issues that tend to accompany XML-based storage can be avoided by leveraging the size of documents vs. external contextual data (i.e. indexing information). Performance and scalability for even the most demanded applications can be as outstanding as any highly-trafficked Web site, if the metaphor of serving your data in XML documents as Web pages can be kept in mind.

The traditional rule of thumb is to put "data" in a database and store documents on a file system. But breaking up a document's worth of data to shove into a database is not a difficult task, and composing a table or two into a document of information is also not difficult. Where things become complicated is when mounds of contextual metadata is generated to cope with relational data. Separating contextual data from content can give you immense power over what you serve, when you serve it, and how you manage it.

-----

There are many time-sinks and concerns when using XML to describe your data, a few of which are engineering the documents' structures, handling data serialization, performance issues, and information bloat. Before you can massage your data into XML, you need to know what structure the document should be - sometimes this is not trivial, especially if representing relational data, and will take time. Taking into account the hitch that binary serialization is brittle and not always platform/hardware independent, having to massage your data so it can be serialized as String representations may also take time; complex objects need to be broken up into primitives to be represented nicely.

Once this is finished, immediately one might notice that some XML representations are thousands of lines long (I made this wonderful mistake more than once), slowing your parser and increasing the transfer time towards O(eternity); the more complex the object, the more contextual descriptions and metadata will be needed to describe the object. Ugh.

Since there is (obviously) a connection between the size of your XML document and the time it takes to parse its content, a good way to increase performance is to break your data into finer levels of granularity. For instance, if you have a document representing the furniture in your house by room, you can easily break the document up into many documents, each representing one room.

With each room separated out, a natural idea would be to add an additional tag under the "room" parent node (or as an attribute) that identifies which house it belongs to. If we break things down even farther, and separate out each piece of furniture into its own document, we would need to add a tag that identifies which room each piece of furniture belongs in.

This method of "contextualizing" XML documents has a problem. Since the information is internal to the document, each time the context of a document needs to be determined it must be parsed and searched. Additionally, this method only allows the "child" document to know who its "parent" is - a house would not know what furniture it has, only the furniture would know which house it belongs to. There are at least two ways to solve this, and the first is obvious - after breaking up a large document, build a "virtual" document that has references to its pieces and parts. This is not a terrible way of doing things - it can, however, lead to an enormous amount of files.

The alternative is to externalize all "virtual resources" into one document, namely a large indexing file. So if multiple house documents - "FooHouse" and "BarHouse" - have each of its rooms stored as separate files, a master document will subsume the identifiers for both room documents under their particular identifiers. When a user requests the resource "FooHouse", the master document (which is assumedly kept in memory for quick traversal) will either assemble the document, or - assuming a screen could not handle showing every room's contents all at once - simply serve each room document when requested.

This solution can scale well, requires no particular storage model (the master document could refer to tables in a database, files on a disk or even a URL) and allows a conceptual resource to be as complex as needed. It also allows for performance tuning, as the client can receive only the portions of the resource pertinent to their particular task.

Web Application Development and Design

Wednesday, October 24, 2007

Relational Data as XML

Blog Archive

About Me