Tuesday, November 15, 2005

Data quality blues

A 2005 research note from Gartner says that more than 50% of data warehouse projects through 2007 will be either outright failures or have limited acceptance. This does not surprise me in the least. There are several aspects in which data warehouse projects are under unusual strain, as well as the normal problems that can beset any significant project. Data warehouses take in data from several separate data sources (ERP, supply chain, CRM etc) and consolidate it. Consequently they are dependent upon both the quality of the data and the stability of those source systems: if any of the underlying source systems has a major structural change (e.g. a new general ledger structure or customer segmentation) then it will affect the warehouse. You might think that data quality was a minor problem these days with all those shiny new ERP and CRM systems, but you'd be wrong. In Kalido projects in the field we constantly encounter major data quality issues, including with data captured in the ERP systems. Why is this?

An inherent problem is that systems typically capture information that is directly needed by the person entering the data, and other things as well, which are useful to someone, but not directly to that person. I remember doing some work in Malaysia and seeing a row of staff entering invoice data into a J.D. Edwards system. I was puzzled to see them carefully typing in a few fields of data, and then just crashing their fingers at random into the keyboard. After a while, they would resume normal typing. After seeing this a few times my curiosity got the better of me and I asked one of then what was going on. The person explained that there were about 40 fields that they were expected to enter, and many of the fields were unnecessary they could not move to the next screen without tabbing through each field in turn unless they entered some gibberish in one of the main fields, at which point the system conveniently took them to the last field. So by typing nonsense data into a field that turned out to be quite relevant (but not to them) they could save lots of keystrokes.

Of course this is an extreme case, but have you ever filled out an on-line survey, got bored or frustrated because they are asking something you don't have the data for and started just answering any old thing just to get to the end? The point is that people care about data quality when they are going to get something back. You can be sure they enter their address correctly on the prize draw form. But in many IT systems people are asked to enter information that doesn't affect them, and human nature says that they will be less accurate with this than something which does mean something directly to them. Some data quality issues can be dramatic. In the North Sea one oil company drilled through an existing pipe because it was not there according to the system that recorded the co-ordinates of the undersea pipes: this merely cost a few million dollars to fix, as fortunately the pipe was not in active use that day or the consequences would have been much worse. Another company discovered that it was making no profit margin on a major brand in one market due to a pricing slip in its ERP system that had gone undetected for two years.

The reason data warehouses suffer so much from data quality issues is that they not only encounter the data problems in each of the source systems they deal with, but because they also bring all the information together the data problems often only becomes clear at this point e.g. the problem with the pricing became apparent because the data warehouse showed that they were making zero gross margin, which was not apparent inside the ERP system since the margin calculation was made up of data from several systems combined. It is the data warehouse that shines light on such issues, but often is wrongly blamed when the project is delayed as a result.

Data quality problems are one major issue where there is no magic solution. Data quality tools can help, but this is a people and process issue rather than a technology issue. However another reason data warehouse projects are perceived to fail is that they take a long time, and cost a lot to maintain. Since it takes 16 months to build an average data warehouse (TDWI) survey) it is not surprising that some changes to the business occur in this time. The only way to really address this is to use a packaged data warehouse solution, which takes less time to implement (typically less than 6 months for Kalido). Maintenance costs are another major problem, and here again there are modern design techniques that can be applied to improve this situation. See my blog the data warehouse carousel.

It is only by making use of the most modern design approaches, iterative implementation approaches that show customers early results, and the most productive technologies that data warehouse project success rates will improve. There will always be projects that run into trouble due to poor project management, political issues and lack of customer commitment, but data warehouse projects at least need to stop making life harder for themselves than they need be.


Post a Comment

<< Home