In an article
in DM Review Malcolm Chisholm discusses different types of metadata. He sets out a definition which distinguishes between metadata, master data and reference data (separate from “transaction activity” data). I believe that the argument is flawed in several important ways.
Firstly, I believe that the distinction between metadata, master data, enterprise structure data and reference data as made in the article is actually spurious. One point made about master data is the notion that “Customer A is just Customer A” and here is not more to it than that. However, to the account manager looking after the customer there is a complex semantic which needs data to define it. Well, what if that customer is, say: “Unilever”. There is all kind of embedded meaning about the definition of Unilever that is not directly implied by the row itself, but is defined elsewhere e.g. is that the whole Unilever group of companies, or Unilever in the US, a Unilever factory or what? This type of definitional problem occurs to row level entries just as it does to the generic class of things called “customer”. Master data can have semantic meaning at the row level, just as can “reference data” as used in the article. This point is illustrated further if we use the article’s own example of this: the USA having multiple meanings. Both are valid perspectives for the USA but they are different things – they are defined and differentiated by the states that make them up i.e. their composition. This is the semantic of the two objects.
The article seems to want to create ever more classification of data, including “enterprise structure data”. It argues that “Enterprise structure data is often a problem because when it changes it becomes difficult to do historical reporting”. This is really just another type of master data. The problem of change can be dealt with by ensuring that all the data like this (and indeed all master data) has a “valid from” and “valid to” date. Hence if an organisation splits into two, then we want to be able to view data as it was at a point in time: for example before and after the reorganisation. Time stamping the data in this way addresses this problem; having yet another type of master data classification does not help.
The distinction between “reference data” and “master data” made in the article seems to be both false and also misleading. Just because “volumes of reference data are much lower than what is involved in master data and because reference data changes more slowly” in no way means that it needs be treated differently. In fact, it is a very difficult line to draw, since while typically master data may be more volatile, “reference data” also can change, with major effect, and so systems that store and classify it need to be able to expect and to deal with these changes.
In fact, one man’s transaction is another man’s reference data. A transaction like "payment" has Reference data like Payment Delivery, Customer, Product, Payment Type. A transaction
Delivery from the point of view of a driver might consist of Order, Product, Location, Mode of Delivery. Similarly an "order" could be viewed by a clerk as Contract, Product, Customer, Priority. Where is the line between Master and reference data to be drawn??
The article argues that identification is a major difference between master and reference data, that it is better to have meaningful rather than meaningless surrogate keys for things, which he acknowledges is contrary to perceived wisdom. In fact there are very good reasons to not embed the meaning of something in its coding structure. The article states that: “In reality, they are causing more problems because reference data is even more widely shared than master data, and when surrogate keys pass across system boundaries, their values must be changed to whatever identification scheme is used in the receiving system.”
But this is mistaken. Take the very real word example of article numbering. The Standard Industry codes (SIC) European Article Number (EAN) codes, which are attached to products like pharmaceuticals to enable pharmacists to uniquely identify a product. Here a high level part of the key is assigned e.g. to represent the European v. the US v. Australian e.g. GlaxoSmithKlien in Europe, and then the rest of the key is defined as Glaxo wishes. If the article is referred to by another system e.g. a supplier of Glaxo, then it can be identified as one of Glaxo’s products. This is an example of what is called a “global or universal unique identifier” (GUID or UUID), and for which indeed there are emerging standards.
A complication is that when the packaging changes, even because of changed wording on the conditions of use, then a new EAN code has to be assigned. The codes themselves are structured, often considered bad practice in the IT world, but the idea is to ensure global uniqueness and not give meaning to the code. Before Glaxo Welcome and SmithKlienBeacham merged they each had separate identifiers and so the ownership of the codes changed when the merger took place.
Another point I disagree with in the article is “we will be working with a much narrower scope” in the first paragraph. Surely we are trying to integrate information across the company to get a complete perspective. It is only small transactional applets which only need a worms eye view of what they are doing
The article says “Reference data is any kind of data that is used solely to categorize other data in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise”. But someone in the organization does have to manage this data even if it comes from outside the company and that person’s transaction may be the set up of this data and making it available to others.
For example, consider the setting of a customer’s credit rating. Someone in Finance has to review a new customer’s credit rating against a list of externally defined credit ratings say from D&B. Someone in the company spends time lobbying D&B (or parliament/congress) to have additional credit classifications. (the article defines them as Gold, Silver, Bronze etc. But D&B call them AAA, AA etc.). Data is always created through someone carrying out some business function (or transaction) even standards have to be managed somewhere.
A good example of this type of external data where a computer system is used to support the process is the Engineering parts library. It uses the ISO 15926 standard. It is a collaborative process between specialists from multiple engineering companies. It is a high level classification scheme which is used to create a library of spare parts for cars, aircraft, electronics etc. This is a changing world and there are always new and changing classifications. Groups of engineers who are skilled in some engineering domain define the types and groups of parts. One group defines pumps, another piping. Someone proposes a change and others review it to see if it will impact their business, it goes through a review process and ultimately gets authorized as part of the standard.
This example is about reference data, in the terms of the article, but it clearly has the problem the article attributes to master data. There are multiple versions and name changes and a full history of change has to be maintained if you wish to relate things from last year with things for this year.
The artiicle has an example concerning the marketing department’s view of customer v. accounts view of customer. It says this is a master data management issue and is semantic but this doesn’t apply to reference data. It clearly does relate to reference data. (see definition of USA above) and the ISO example above. But what is more important is that the issue can be resolved for both master and reference data by adopting the standards for integration defined in ISO 15926. Instead of trying to define customer in a way that satisfies everyone it is best to find what is common and what is different. Customers in both definitions are Companies – it is just that some of then have done business with us and others have not (yet). Signed up customers are a subset of all potential customers.
At the end of the section on The Problem of Meaning the article says “These diverse challenges require very different solutions” then in the section on Links between Master and Reference data it says “If there is a complete separation of master and reference data management, this can be a nightmare” and then says “we must think carefully about enterprise information as a whole”. I agree with this final statement but it is critical that we do not put up artificial boundaries and try to solve specific problems with some generic rules which differentiate according to some rather arbitrary definition such as Master and Reference data.
The line between master and reference data is really fuzzy in the definition used. Clearly “Product” is master data but I if have a retail gasoline customer which has only three products (Unleaded, Super and Diesel) I guess that means this is reference data. The engineering parts library classification scheme is a complex structure with high volumes (1000’s) of classes so that makes it master data but it is outside the company so does that makes it reference data?
In summary, the article takes a very IT-centric transactional view of the world. By trying to create separate classifications where in fact none exist, the approach suggested, far from simplifying things, will in fact cause serious problems if implemented, as when these artificial dividing lines blur (which they will) then the systems relying on them will break. Instead what is needed is not separation, but unity. Master data is master data is master data, whether it refers to the structure of an enterprise, a class of thing or an instance of a thing. It needs to be time-stamped and treated in a consistent way with other types of master data, not treated arbitrarily differently. Consistency works best here.
I am indebted to Bruce Ottmann, one of the world's leading data modelers, for some of the examples used in this blog.