Thursday, March 16, 2006

Should ETL really be ELT?

Traditionally ETL (extract/transform/load) products such as Informatica, Ascential and others have fulfilled the role of getting data out of source systems, dealing with inconsistencies between these source systems (transform) and then loading the resultant transformed data into a set of database tables (perhaps an operational data store, data marts or directly to a data warehouse).
However in the process of doing the "transform" a number of issues crop up. Firstly, you are embedding essentially what is a set of business rules (how different business hierarchies like product classifications actually relate) directly into the transformation rules. This is a dark place should you want to make sense of them in other contexts. If the rules are complex, which they may well be, then you can create a Frankenstein's monster of transform rules that become difficult to maintain, in a set of metadata that may be hard to share with other applications.

Moreover this is a one-way process. Once you have taken your various product hierarchies (say) and reduced them to a lowest common denominator form, then you can certainly start to analyze the data in this new form, but you have lost the component elements to all intents and purposes. These different product hierarchies did not end up different without some reason; they may reflect genuine market differences in different countries, for example. Moreover they may contain a level of richness that is lost when you strip everything down to a simpler form.

Ideally in a data warehouse you would like to be able to take an enterprise view, but also retain the individual perspectives of different business units or countries. For example it may be interesting to see the overall figures in the format of a particular business line or country. Now of course there are limitations here, since data from other businesses have not have sufficient granularity to support the views required, but in some cases this can be fixed (for example by providing additional allocation rules) and at least you have a sporting chance of doing something useful with the data if you have retained its original richness. You have no chance if it is gone.

Hence there is a strong argument to be made for an "ELT" approach, whereby data is copied from source systems pretty much untouched into a staging area, and then only from there is transformation work done on it to produce cross-enterprise views. If this staging area is controlled by the data warehouse then it is possible to provide other, alternate views and perspectives, possibly involving additional business metadata at this stage. The only real cost in this approach is some extra storage, which is hardly a major issue these days. Crucially, the transformation logic is held within the data warehouse, which is open to interrogation by other applications, and not buried away in the depths of a proprietary ETL format. Moreover, the DBMS vendors themselves have added more capability over the last few years to deal with certain transformations; let's face it, a SQL SELECT statement can do a lot of things. Since the DBMS processing is likely to be pretty efficient compared to a transformation engine, there may be performance benefits also.

This approach has been taken by more modern ETL tools like Sunopsis, which is explicitly ELT in nature. Intriguingly, Informatica added an ELT option in Powercenter 8 (called the "PowerCenter 8 pushdown optimization"), which suggests that this approach indeed is gaining traction. So far, good on Sunopsis for taking the ELT approach, which I believe is inherently superior to ETL in most cases. It will be interesting to see whether Ascential also respond in a future release.

5 Comments:

Anonymous Vincent McBurney said...

I've enjoyed following your blogs but I've taken exception to some of your comments this week. The ETL v ELT debate certainly has a lot of mileage left in it.

"transform rules that become difficult to maintain, in a set of metadata that may be hard to share with other applications"

Why are ELT transformation rules easier to maintain then ETL? Last time I checked transformation rules built into RDBMS views, procedures and functions were signficantly harder to document and maintain then ETL jobs, which is why ETL tools became popular in the first place.

What RDBMS or ELT tools have better metadata sharing and reporting then Informatica Superglue or WebSphere MetaStage?

regards
Vincent (ETL blogger :))

2:54 PM  
Anonymous Nigel Thomas (ex Oracle, ex Constellar Hub) said...

ETL has always been a misnomer anyway. You often need an element of T at the source; and yes, you often find that putting some T after the L helps, etc.

Which is why for a decade mostserious "ETL" tools have allowed you to specify schedules, chains or networks of these different elements (E, L, and all sorts of different Ts), often in a graphic way. Bolting an ETL/ELT hub onto a datawarehouse works in a lot of cases - but isn't always the most appropriate solution.

So perhaps we should be talking about TETLT instead...?

1:11 AM  
Blogger Andy Hayler said...

Some debate - excellent! The point you make about accessibility of the rules has some validity, but it varies by which ETL tool. Certainly in our client base we frequently encounter situations where complex ETL rules are regarded by customers as difficult to access and causing a major blockage to change. It is true that it MAY not be any better to store the business logic in the warehouse, but I believe that if properly implemented, this should be a more transparent mechanism. Moreover, there were two other reasons why ELT is likely to be superior to ETL most of the time: efficiency of using the DBMS engine, and (more importantly) retaining the different business structures, allowing the possibility at least of more flexible analysis.

2:39 AM  
Blogger Andy Hayler said...

Thanks also for your comment Nigel. Your idea of the TETLT made me smile; as you say, there will be some cases where the best order will vary.

By the way, whatever happened to Constellar? As I recall it had an excellent reputation, and was that rare beast: a UK enterprise software company. I know that they were bought in the end by DataMirror but wht was the inside story?

5:26 AM  
Anonymous Ian Bennett said...

Have to agree with Vincent. If transparency is important then what better than an ETL tool that is consistant regardless of the source or target? If the rules are stored in the database then you need to look in every database (not to mention understand the language used by different vendors) to get the full picture.

Also, all that an ETL tool need do is put more database specific features into their interfaces to get the performance benefits of ELT but the key point that many fail to see is that in most non-operational uses, efficiency/performance is not as much a priority as consistency, visiblility and scalability. ETL vendors are now introducing real-time functionality which is a farce when marketed for data warehousing but is important for positioning the ETL tools as middleware for all data movement which would mean more central control, visibility and flexibility and can support the MDM model for operational uses.

As far as the proprietary nature of ETL tools, I feel it is a a myth that in general the processes applied can be properly understood outside of the graphical presentation that these tools provide. Business rules in this domain are usually far more complex than A + B = C.

Personally I see ELT as a backwards step (hey, that's what we used to do before ETL) unless they wrap it in a more consistent user interface in which case it just becomes a poor cousin to ETL tools because it doesn't do anything that isnt possible in ETL yet does not have the flexibility provided by server processing.

5:37 PM  

Post a Comment

Links to this post:

Create a Link

<< Home