Tuesday, November 08, 2005

The structure of data

I was involved in discussion recently about the distinction between structured and unstructured data. The viewpoint presented to me was that structured data sits in a database and everything else is unstructured (and at best semi-structured). But I disagree and don’t think the location of the data is at all relevant - the distinction should be purely based on the semantic knowledge associated with the data. Thus structured data has an explicit semantic framework defined (somewhere); whereas unstructured data simply consists of symbols (such as words) without any semantic framework at all (other than by contextual deduction and inference of the data itself).

An obvious example of unstructured data is a text document, which has no inherent meaning without human interpretation of its contents.

Now consider an obvious example of some structured data – a row in a relation database table. The row has columns, each with metadata attached that define the relevance and interpretation of the associated column values - and this explicit semantic framework is recorded within the data dictionary of the database. Does the data now become unstructured if the row is moved into a record within a flat file whereby each column value is written into a bounded field? I think not. The semantics of the data still remain (in a slightly more remote data dictionary) and so the data continues to be structured. There are also examples of structured data where the semantics are hidden and not so explicit - a server log file may not have a data dictionary, but the semantics are wired into the server application code.

Indeed, the volume of structured data existing outside of a database is currently burgeoning with self-service environments such as e-commerce and communications networks.

But then, what exactly is a database? A file-system packed with structured data can be viewed as a database of sorts (albeit with limited functionality and accessibility).


Post a Comment

<< Home