Tuesday, March 28, 2006

Data warehouse v master data repository

Bill Inmon notes that "Second-generation data warehouses recognize the need for tying metadata closely and intimately with the actual data in the data warehouse". This is indeed a critical point, and is at the heart of why all those enterprise data dictionary projects in the 1990s (and even 1980s; sad to say I am old enough to have been involved with one in the 1980s) failed. Because the dictionaries were just passive catalogs, they were of some use to data modelers but otherwise there was little incentive to keep them up to date. In particular, the business people could not see any direct benefit to them, so after the initial project went live the things quietly got out of date. In order for such initiatives to succeed it is critical that the business metadata (more important than the technical metadata) is tied into the actual instances of master data, so that the repository does not just list the product hierarchy structure (say) but also lists the product codes that reside within this structure. Ideally, the repository would act as the primary master source of master data for the enterprise, and serve up this data to the various applications that need it, probably via an automated link using middleware such as Tibco or IBM Websphere. Not many companies have taken it to this stage, but there are applications at BP and Unilever that do, for example.

However one important architectural point is that you may not want the data warehouse to actually manage all the master data directly; instead it may be better to have a separate master data repository. The reason for this apparently odd approach is that in a data warehouse you want the data to be "clean" i.e. validated, conforming to the company business model etc. On the other hand master data may have separate versions, drafts (e.g. draft three of the planned new product catalog) that need to be managed, and potentially "dirty" master data that is in the process of being improved or cleaned up. Such data has no place in a data warehouse, where you are relying on the integrity of the numbers.

Hence a broader picture may see an enterprise data warehouse alongside a master data repository, the latter feeding a "golden copy" of master data to the warehouse, just as it will feed the same golden copy to other applications that need it. With such an approach, and current technology, those old enterprise modeling skills might just come in handy.

Incidentally, spring is definitely in the air in Europe. The sun is out in London, there is a spring in people's step, and the French have called a general strike.


Post a Comment

Links to this post:

Create a Link

<< Home