Friday, May 26, 2006

Data warehouse architectures

Rick Sherman writes an interesting article about how data warehousing, despite being quite venerable in IT terms, is still poorly understood. He makes a good point, discussing various typical implementation approaches and how thee fail to get to the "single version of the truth" dream. Let's consider for a moment a few architectural choices:

(a) direct access (EII)
(b) data marts only - no pesky warehouse
(c) a single warehouse for the enterprise
(d) a federation of linked warehouses.

The first approach is limited to only a small subset of the reporting needs, and is insufficient to meet most enterprise reporting requirements. To have only single subject data marts was still surprisingly commonly advocated as late as the mid 1990s (born mainly out of the frustration of lengthy or failed data warehouse projects) yet pretty clearly is not going to scale for a company of any size. The sheer number of combinations of data sources required to build the marts means that the problem of resolving inconsistency is being done every time a mart is built, rather than being dealt with in the warehouse, so each mart either becomes a major project in itself, or (more likely) people just give up and go with some data source without getting a complete or even accurate picture.

The single giant warehouse certainly has a lot of appeal, as it resolves the semantic differences of source systems just once, allowing dependent data mars to be deployed easily. The trouble is one of practicality: for a large corporation the sheer scale of the task is scary. Large enterprises have hundreds (and usually thousands if they are counting properly) of applications where data is being captured, and these applications are often duplicated by country or major business lines. Hence the sheer scale of getting hold of all these sources and bring them into line is going to be a massive challenge. In the cases of certain industries (retail, Telco, retail banking) the scale of the data itself is also daunting, bring major technical challenges.

Hence for any large corporation it seems to me that a federated warehouse approach is what you will end up with, whether you like it or not. Few companies will have the energy or resources to deliver the single giant warehouse, and even those few that do will, in reality, have a series of skunk works data marts/warehouses dotted around the corporation since such a behemoth warehouse will be a bottleneck, hard to change and inevitably slow to respond to rapidly changing business needs.

The most pragmatic approach would seem to me to acknowledge this reality and architect for a federated approach, rather than staying in denial. It is practical to build a warehouse for either a country-level subsidiary (or groups of countries) or each business line, let that deal with the needs of that particular country or business line, and then link these together to a global warehouse which deals at the summary level. The global warehouse does not need to store every transaction in the enterprise; at that level you need to know what the sales were in Germany yesterday by product, channel and perhaps customer, but not that a particular customer bought a specific item at 14:25 at a store in Rhine-Westphalia. The detailed information like this is the domain of the country-level warehouse. Because the transaction detail is not needed at the enterprise level, you avoid the problems of technical scale that may otherwise occur, and only deal with the data that makes sense to look at across the enterprise as a whole.


Anonymous Stephen Barr said...

While I take your point about growing data volumes and the current trend to store it "indefinitley", I can't help but draw parallels between what you are suggesting and the data marts of the 80's and 90's where non-transactional summary data was the order of the day because of hardware constraints.

It didn't work then - so are you sure it can work now?

An archiving strategy is always a major part of any warehouse design, and with cheap, alternative near-line storage available in abundance, I don't see the need to revert to summary data.

With summary level data we're back again to the situation where someone has to decide how to summarise, and again leads to a situation where you're designing a warehouse around a set of "presently correct" questions. The beauty of transactional level data, is that it enables you to answer any question.

11:56 AM  
Blogger Andy Hayler said...

I think it depends on the scale of data you are talking about. In a B2B situation data volumes may be relatively modest and you may be able to get away with storing full transactional data. However in B2C situations like Telco, retail banking etc the volumes can be truly huge and very challenging. In such situations in seems to me better to work on summaries and avoid the need to store massive volumes of transactions. Although hardware gets cheaper, the growth in data continues. As the old saying goes "Intel giveth, and Microsoft taketh away".

8:49 AM  

Post a Comment

Links to this post:

Create a Link

<< Home