Tuesday, November 08, 2005

The structure of data

I was involved in discussion recently about the distinction between structured and unstructured data. The viewpoint presented to me was that structured data sits in a database and everything else is unstructured (and at best semi-structured). But I disagree and don’t think the location of the data is at all relevant - the distinction should be purely based on the semantic knowledge associated with the data. Thus structured data has an explicit semantic framework defined (somewhere); whereas unstructured data simply consists of symbols (such as words) without any semantic framework at all (other than by contextual deduction and inference of the data itself).

An obvious example of unstructured data is a text document, which has no inherent meaning without human interpretation of its contents.

Now consider an obvious example of some structured data – a row in a relation database table. The row has columns, each with metadata attached that define the relevance and interpretation of the associated column values - and this explicit semantic framework is recorded within the data dictionary of the database. Does the data now become unstructured if the row is moved into a record within a flat file whereby each column value is written into a bounded field? I think not. The semantics of the data still remain (in a slightly more remote data dictionary) and so the data continues to be structured. There are also examples of structured data where the semantics are hidden and not so explicit - a server log file may not have a data dictionary, but the semantics are wired into the server application code.

Indeed, the volume of structured data existing outside of a database is currently burgeoning with self-service environments such as e-commerce and communications networks.

But then, what exactly is a database? A file-system packed with structured data can be viewed as a database of sorts (albeit with limited functionality and accessibility).

Thursday, October 27, 2005


At face value this article would appear to be, at best, sensationalist. But if you delve below the hyperbole, the essence of Gartner’s view is that data will become further decentralised and distributed. Unfortunately, the majority of the article weakens this argument by focusing on some simplistic RFID scenarios that do not withstand further investigation.

Consider the can of soup example with the inventory system that has no historic reference data. Say the inventory system determines that there are 100 cans of soup at one point in time, but later on discovers that there only 95 cans. Does it even know that 5 cans have gone since the last check? Have 5 cans been sold? What if 2 were stolen? What if a customer purchased a can, left the shop briefly, and later called back still carrying the same can of soup? How would the inventory system recognise that possibility if it accidentally discovered the can? Will the customer get charged again when he leaves the shop? Forget the soup, what we have here is a can of worms.

All of the above problems ostensibly come from the lack of any historic reference and yet the article only very briefly mentions that this still remains an essential aspect. This brief mention appears in the final paragraph, after the previous claims have left you in a state of outrage.

The fact is, RFID tag scanners and processes will produce event data essential to the understanding of the distribution system and its efficiency. Moreover, with tagging on individual items, there is likely to be terabytes of the stuff flowing throughout any reasonably large enterprise. You will not want to, or need to, pour all of this event data into a central database. I think Gartner is promoting the idea that you keep the data close to its origination and distribute it. This is more an argument for EII and SOA than an argument for the demise of databases per se. If nothing else, you need to retain historical data for manageability, security and accountability. But it doesn’t need to be centralised and detailed immutable historic event data doesn’t need to be shoehorned into a RDBMS as you only need to access it for search and aggregate it for analytics.

However, I do dispute the claim that XML is unstructured data... more comments about that later...

Wednesday, October 19, 2005

Maybe Oracle has already achieved its goal for MySQL... through simple FUD.

Thursday, October 13, 2005


The EU’s progress towards a more stringent communications-data retention directive continues with Wednesday’s agreement summarised here. Interestingly, there is a call for unanswered calls to be retained too because these are exactly the type of call used to detonate bombs. I wonder what proportion of all the calls made go unanswered.

Tuesday, October 11, 2005

Oracle acquires Innobase

It must have been a no-brainer for Oracle to acquire Innobase and it probably should have been a no-brainer for MySQL to acquire them too. Oracle now has a presence in the open source database market and an opportunity to exert some influence (albeit limited) over its current and future open source competitors. There are plenty of possibilities - Oracle may wish to shackle MySQL itself; or Oracle may wish to promote MySQL to spoil the potential of more scalable open source databases like Ingres; or Oracle may wish to be the first into a market openly eyed by other influential vendors such as Sun. But for now, Oracle can afford to wait and watch...

Friday, October 07, 2005

The gates are down, the lights are flashing, but the train isn't coming. Good to see those boundaries getting stretched though.

Wednesday, October 05, 2005

Sun & Google Announcement

In a nutshell: Google agrees to distribute Java while Sun agrees to distribute the Google tool bar.

<sarcasm> Clearly, because both of these technologies need to reach a wider audience... </sarcasm>

Coy doesn’t get close to describing this announcement and that may speak volumes in itself. But did we need a press conference to announce this? Surely that just encourages brickbats from under whelmed speculators. Or is that part of the grand counter double bluff... :-)