Monday, January 16, 2006

The next generation data warehouse has a name

Bill Inmon, the "father of the data warehouse" has come up with a new definition of what he believes a next generation data warehouse architecture should follow. Labeled "DW 2.0" (and trademarked by Bill) the salient points, as noted in an article in DM Review are:

- the lifecycle of data
- unstructured data as well as structured data
- local and global metadata i.e. master data management
- integrity of integrated data.

These seem eminently sensible points to me, and ones that indeed are often overlooked in first generation custom-build warehouses. Too often these projects concentrated on the initial implementation at the expense of considering the impact of business change, with the consequence that the average data warehouse costs 72% of implementation costs to support every year e.g. a USD 3M warehouse would cost over USD 2M to support; not a pretty figure. This is a critical point that seems remarkably rarely discussed. A data warehouse that is designed on generic principles will reduce this figure to around 15%.

The very real issue of having to deal with local and global metadata including master data management is another critical aspect that has only recently come to the attention of most analysts and media. Managing this i.e. the process of dealing with master data, is a primary feature of large-scale data warehouse implementations yet the industry has barely woken up to this fact. Perhaps the only thing I would differ with Bill on here is his rather narrow definition of master data. He classifies it as a subset of business metadata, which is fair enough, but I would argue that it is actually the "business vocabulary" or context of business transactions, whereas he has a separate "context" category. Anyway, this is perhaps splitting hairs. At least it gets attention in DW 2.0, and hopefully he will expand further on it as DW 2.0 gets more attention.

The integrity of "integrated" data addresses the difference between truly integrated data that can be accessed in a repeatable way, and the "interactive" data that needs to be accessed in real-time e.g. "What is the credit rating of customer x" that will not be the same from one minute to the next. Making this distinction is a useful one, as it has caused much confusion whereby EII vendors have claimed that their way is the true path, when it patently cannot be in isolation.

I am pleased that DW 2.0 also points out the importance of time-variance. This is something that is often disregarded in data warehouse designs, mainly because it is hard. Bill Inmon's rival Ralph Kimball calls it the "slowly changing dimension" problem, with some technical mechanisms for how to deal with it, but at an enterprise level, these lessons are often lost. Time variance or "effective dating" (no, this is not like speed dating) is indeed critical in many business applications, and indeed is a key feature of Kalido.

It would indeed be nice if unstructured data mapped neatly into structured data, but here we are rather at the mercy of the database technologies. In principle Oracle and other databases can store images as "blobs" (binary large objects) but in practice very few people really do this, due to the difficulty in accessing them and the inefficiency of storage. Storing XML directly in the DBMS can be done, but brings its own issues, as we can testify at Kalido. Hence I think that the worlds of structured and unstructured data will remain rather separate for the foreseeable future.

The DW 2.0 material also has an excellent section on "the global data warehouse" where he lays out the issues and approaches to tackling deploying a warehouse on a global scale. This is what I term "federation", and examples of this kind of deployment can be found at Unilever, BP and Shell, amongst others. Again this is a topic that seems to have entirely eluded most analysts, and yet is key to getting a truly global view of the corporation.

Overall it is good to see Bill taking a view and recognizing that data warehouse language and architecture badly needs an update from the 1990s and before. Many serious issues are not well addressed by current data warehouse approaches, and I welcome this overdue airing of the issues. His initiative is quite ambitious, and presumably he is aiming for the same kind of impact on data warehouse architecture as Ted Codd's rules had on relational database theory (the latter' "rules" of relational were based on some mathematical theory and were quite rigorous in definition). It is to be hoped that acertificationtoin" process for particular designs or products that Bill develops will be an objective process rather than one based on sponsorship.

More detail on DW 2.0 can be found on Bill's web site.

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home