Thursday, March 30, 2006

Unifying data

I can recall back in the early 1990s hearing that the worlds of structured and unstructured data were about to converge. A decade on, and despite the advent of XML, and that prospect still looks a long way off. It is like watching two people who have known each either for years and are attracted to each other, yet never seem to find a way of getting together. Some have argued that the data warehouse should simply open up to store unstructured data, but does this really make sense? When DBMS vendors brought out features allowing them to store BLOBS (binary large objects) the question should have been asked: why is this useful? Can I query this and combine it usefully with other data? Data warehouses deal with numbers (usually business transactions) that can be added up in a variety of ways, according to various sets of business rules (such as cost allocation rules, or the sequence of a hierarchy), which these days can be termed master data. The master data gives the transaction data "structure". A Powerpoint slide or a word document or an audio clip tends not to have much in the way of structure, which is why document management systems place emphasis on attaching keywords or tags to such files in order to give them structure (just as web pages are given similar tags, or at least they are if you want them to appear high up in the search engines).

You could store files of this type in a data warehouse, but given that these things cannot be added up there is little point in treating them as transactions. Instead we can consider them to be master data of a sort. Hence it is reasonable to want to manage them from a master data repository, though this may or may not be relevant to a data warehouse application.

I am grateful to Chris Angus for pointing out that there is a problem with the terms 'structured data' and 'unstructured data'. Historically the terms came into being to differentiate between data that could at that time be stuffed in a database and data that could not. That distinction is nothing like as important now and the semantics have shifted. The distinction is now more between data constrained by some form of fixed schema and whose structure is dictated by a computer application v data/documents not constrained in the same way. An interesting example of "unstructured data" that is a subject in its own right and needs managing is a health and safety notice. This is certainly not just a set of numbers, but it does have structure, and may well be related to other structured data e.g. HSE statistics. Hence this type of data may well need to be managed in master data management application. Another example is the technical data sheets than go with some products, such as lubricants; again, these have structure and are clearly related to a traditional type of master data, in this case "product", which will have transactions associated with it. Yet another would be a pharmaceutical regulatory document. Hence "structure" is more of a continuum than a "yes/no" state.

So, while the lines are blurring the place to reconcile these two worlds may not be in the data warehouse, but in the master data repository. Just as in the case of other master data, for practical purposes you may want to store the data itself elsewhere and maintain links to it e.g. a DMBS might not be an efficient place to store a video clip, but you would want to keep track of it from within your master data repository.


Anonymous Anonymous said...

I can remember an application we made in the mid 90ths. A perfume manufacturer wanted to keep track of it's competitor sales by buying GFK and Nielsen data. We added the images of the bottles which were used (the shape and looks of the bottle is very important) and even attached the TV commercials used in the differetn countries. In this way marketeers could look up the salesfigures of the competition and the most important sales "influencers". Otherwise I have indeed heard only a lot of talkign about combining structured and unstructured but surprisingly(?)very few examples

4:55 AM  
Blogger Andy Hayler said...

Thanks for your comment. I can also think of very few examples indeed.

8:33 PM  
Blogger bitblue said...

I agree that the difference between structured and unstructured data is somewhat blurry. For one, consider a CLOB sitting in a database. Is this structured? Are VARCHARs spanning 2000 bytes structured? Maybe, maybe not. The fact that they represent a column within a table is not enough to make that distinction. Similarly a text document or a JPEG file. I'd argue, if they hadn't structure, they would both be close to garbage.

1:37 PM  

Post a Comment

<< Home