Tuesday, September 27, 2005

My...

MySQL 5.0 is upon us and includes a formal release of the archive storage engine that is aimed at storing enormous volumes of historical data on-line in a compressed format - but without any indexing at all...! Hmmm... a phrase involving the words "teapot" and "chocolate" comes to my mind.

Monday, September 26, 2005

The Economics of Indexing

The economics of conventional database indexing is based on the assumption that each piece of data is written once and is subsequently queried many times - hence traditional indexes have concentrated on achieving optimal query performance and relied on repeated queries to amortize the large cost of having created the index initially. These normative economics fail in situations where the likelihood of querying any specific item of data is minimal – you incur the expense of indexing it, but you are unlikely to recover that cost later on. In what situations is that likely? I would suggest granular drill through from analysis and discovery; compliance retention and data surveillance as the most obvious examples.

For instance, a credit card fraud surveillance application is unlikely to drill into the detailed transaction history for any particular card until the overall spending profile is deemed suspicious. Thankfully, most credit card usage is legitimate and the vast majority of detailed transaction history will remain untouched – forcing the conventional indexing economic model into debt. Moreover, as new transactions continuously pour in, each and every one needs to be indexed quickly with minimal latency as any interrogation for potential fraud needs complete and accurate information – all within the challenging environment of a large retained data population.

The familiar B-tree, hash and inverted-list (including bit map) indexes cannot service these demands easily and require heavy hardware investment to alleviate their problems. Frankly, B-trees and fast updates go together like a horse and carnage [sic]. Hence the emergence of data appliances, that supplant indexes with extreme hardware parallelism to avoid the impedance mismatch between conventional indexing and this acquisition dominant workload. But more hardware means less MTBF, both theoretically and empirically, so wouldn’t it be better to change the economics of indexing rather than to mask the problem with extra hardware?

Did You Know?

Apparently, Sir Isaac Newton invented the cat-flap... two nouns that I didn't expect to see within the same sentence.

Friday, September 23, 2005

Fit for Purpose

The introduction of new EU retention legislation means that even a small to medium telecommunications network operator will need to retain 100 billion (or so) call detail records on-line and accessible for query. Let’s hope that not too many operators throw an OLTP database at it...

Thursday, September 22, 2005

Paradigm Wars

I am about to poke my vulnerable toe into the ancient cauldron that is the relational-verses-object paradigm debate. But first of all, let’s get a few definitions straight.

- The relational model is a pure mathematical theory for modelling data, which is loosely based around set theory. It offers a formal and elegant approach to data structuring together with a broad (but unspecific) framework for how data is manipulated. It is not SQL. It is not a database.

- The object model (separated from any specific language) is a rigorous model for software development that cohesively ties data with processing. It is not a language. It is not C++, C#, Java etc. It is not a database.

- A database is a piece of infrastructure for managing and persisting data. Nothing more. It may present a particular paradigm view of the data (such as relational, object or XML), but for pragmatism and performance reasons, it’s actually quite unlikely store its physical data in an exact extrapolation of the paradigm used at the presentation level.

- SQL, C++, C# and Java are languages that purport to conform to their respective paradigms – but they all have weaknesses to a greater or lesser degree.

If you disagree with any of the above definitions then please don’t bother reading any further as I suspect you have some personal subjectivity issues you need to deal with first... :-)

The relational model is good for achieving clarity in the data structure. It is excellent for environments where users need to discover and understand the structure of the data. However, sadly, SQL is a poor language for exploiting the relational model as it violates/disregards some of its most fundamental principles. Frankly, it is wholly inadequate for the task and many of the problems encountered with relational systems arise from the failings of SQL itself. The relational model is beautiful - SQL is not. If you doubt this, then compare a SQL statement with the simplicity of what it is trying to achieve. I suspect SQL dumped the elegance of simple relational operators in favour of “natural language” and ultimately failed at implementing either satisfactorily. Users do not struggle with the simple concepts of a joins and grouping, but they do struggle with SQL’s clumsy and arcane representations of them.

Meanwhile, the object model is a good framework for both data and process development, that incorporates rigorous principles such as encapsulation, inheritance and polymorphism to promote development productivity and to provide protection against potential implementation mistakes. However, it deliberately hides data and does not provide open visibility of the data structure for users unfamiliar with the system construction who are attempting to discover it – users must rely on the software implementation to provide a suitable presentation. The object languages available do a reasonable job of implementing the object framework for development purposes, although each has its own specific issues that detract from the rigours of the underlying paradigm to some extent.

Both relational and object paradigms have strengths in different areas and an attempt to combine them at a fundamental level is highly desirable in order to achieve the open formality of relational structures under the protective framework of object oriented development. The current attempts to combine object and relational models into a single database are really just clumsy attempts at shoehorning objects into relational databases – and is a far cry from the fundamental rewrite that the industry needs. Hopefully, someday we can free ourselves from the monster that is SQL and, with luck, that may coincide with a new language that elegantly and efficiently combines the best principles of the distinct paradigms that exist. None of these paradigms are mutually exclusive – it is the bigots that hail their particular paradigm’s singular divinity that create barriers to a better approach.

Finally, database performance has nothing to do with these paradigms. If a database is performing badly then that is largely irrelevant to whether it offers a relational, object or an XML view of its data – the issue sits with the database architecture and not the paradigm.

Now... can we all move on and do something better please...

Wednesday, September 21, 2005

An Insight...

"Technical skill is mastery of complexity, while creativity is mastery of simplicity"
- Erik Zeeman

Monday, September 19, 2005

Elegance

How do you measure the virtue of a software solution? That partly depends on what your priorities are, but you will probably agree that absolute virtues are (a) completeness - it does what you need it to do; (b) performance - it does what it does quickly and (c) reliability – it can be trusted to do what it purports to do. You would probably also agree that simplicity promotes performance and reliability because there is less to do and less of it to go wrong. But then simplicity conflicts with completeness, doesn’t it? Not if you can lift the solution into a more generic and uniform problem space with fewer dependencies and fewer restrictions. Admittedly, that’s not an easy task and often requires a leap of inspiration, but it would ideally be a key objective for software developers (time allowing...).

It’s akin to seeking the theory of everything in physics, an elegant solution sought to replace the current cocktail of awkward theories that fail at arbitrary boundaries created by localised assumptions. An example in the RDBMS world is cost-based optimisers over their rule-based ancestors – the problem has been lifted to the more generic solution of comparing the true cost of the possible access paths whereas the original rule-based solution was a glorified bunch of ever-expanding incomplete and specific heuristics; the cost-based approach is less complex, yet is more complete, accurate and predictable. No doubt, as I write, somebody out there is groaning about the dire performance of a cost-based optimiser in one of their databases – but that probably has more to do with the inaccuracy of the underlying data distribution statistics rather than any flawed reasoning by the optimiser. At least, when there is an anomaly with the cost-based approach, you can readily discover the root cause and fix it – the rule-based approached was always a bit of a black-art.

On a related but tangential note, isn’t it about time we got away from the tyranny of collecting data distribution statistics as a distinct process – and instead had relational databases that updated them on the fly as and when the data changes - so that they are always accurate? That solution sounds more elegant too... (hint)

BTW, this whole meander was by triggered by rational trigonometry and some undocumented connections in my synapses.