The big 'T' stands for 'Transformation'.....and the question is where to position this big 'T' in a Enterprise Data Warehouse architecture. Before I answer this let it be known that the 'big T'-phrase is something I picked up from a session at my client with Nippur. Great guys by the way!
Traditionally the big 'T' is performed on data when it enters the Enterprise Data Warehouse (abbreviated to EDW). Phrases like 'we maintain one single version of the truth' in our EDW. In these architectures the big T is positioned at the supply side of the architecture (near the source). At first, this will seem rather neat and works sufficiently.
But then the pain hits you hard:
- Your data will be increasingly harder to trace (your business will not be happy)
- Very fast it will turn out that your ETL will be increasingly hard to scale (your business will not be happy)
- Your DWH projects do not scale in resourcing (your business will not be happy)
- You will have a increasingly daunting taks of integrating new sources (your business blabla)
- Your flexibiltiy up-stream to the enduser adapting to its ever-changing-requirements will collapse (your blabla...)
- Without knowing you made functional assumptions on how to handle data historically....damn you made the worng assumptions - what now?
- Testing is increasling hard - especially regression testing.....(your business blabla)
Your data warehouse will eventually break down...and yes..your business will not be happy.
The architectural concept of positioning the big T before the EDW is wrong. It's should be punished by law because the client will pay for it big time.
The Big 'T' should always be positioned AFTER the EDW. And yes, all bullets I mentioned in the above will be countered and yes..your business will be happy. I would like to go even further; integration of data should (big T) should be positioned downstream to the enduser as much as possible. Ideally, I would like the big T to be done 'on the fly' by the enduser. However, that's not really achievable, especially performance-wise.
But the concept is there, the big T as downstream (towards the enduser) as possible. What's really fascinating in this simple architectural rule? Well, most architectures talk about integration as being something that needs to be designed upfront. Big T downstream is saying - I only integrate data when asked for.
Integration only when asked for - WE DO NOT INTEGRATE DATA WHEN THE BUSINESS DOES NOT REQUIRE IT TO BE INTEGRATED. In a modern EDW we integrate data between the EDW and the datamart. The EDW is filled with factual 1:1 data that can be (parallel) loaded as soon as it's being staged.
Don't we integrate at all in the EDW...yes we do; but it's semantically poor. We integrate on the data anchors (sorry Dan) of the business; Product, Employee, Customer, location etc.. Those anchors will always be there whatever happens to the source-system-environment. On these anchors the data will be attached - WITH ALL HISTORY (type 2). Where the anchors are defined by the business process, the data that is attached to the anchors are mostly based on the way the data is offered in the staging. In a way you can say we perform a 'very tiny,small T' in loading the EDW.
Just-in-time-integration a collegea of mine called it - and the idea is appealing - not only for EDW enviroments!!! I like this term a lot....
Credit where credit is due....the above is based on the architectural concepts of the Data Vault of Dan Linstedt.
One note to finish; between the EDW and the datamart I position a EDW+ instance. This instance - modelled with Data Vault principles - is seperated from the factual EDW and contains enriched (by the data warehouse) data that needs to be re-used by several datamarts. Hubs+ Links+ and Sattelite+ entities are filled coming from the EDW. This way I decompose the Big T in two stages which makes it more scalable, more re-usable etc.. But again...ONLY WHEN ASKED FOR BY THE BUSINESS.....
About the concepts of this EDW+...next post.
Ronald
100% agree. Using lookup tables (or coded rules) and loading only the transformation results into the EDW is asking for trouble. I got trap into this before (like most of us). With experience and also guidance from Dan Linstedt, I am now integrating source date and transformed data into the same EDW. The life of a transformation rule is sometime really short and based often on the personal taste of the current users. Please, load your source data (not transformed) into your EDW and then add some more entities representing the current transformations. Sometime the rule is so specific to one group of users that it is worth applying it when building their data mart instead of storing the result in the EDW.
With this principle, when the business change the rules, you don't have to reload everything (especially what would represent the fact tables). Another great benefit to this is to let you publish data profiling anomalies (data errors) the same way you would publish any other reports (directly from the EDW). The Data Vault is the best way of achieving this but you can also achieve this goal with more traditional data modeling techniques if desired.
Stef
Posted by: Stephan Deblois | Tuesday, June 03, 2008 at 08:01 AM
Hey Ronald,
Can you expand on what you mean by "Anchors"? I'd like to see if it's just a better term for Business Keys, or if it means something different to you (including architecture).
Thanks,
Dan L
Posted by: Dan Linstedt | Saturday, June 07, 2008 at 05:31 AM
He Dan,
I am trying to find a term that anybody can understand. Business keys, Hubs - all good words, but it is somehow hard to explain to more business oriented people.
Using the metaphor of an anchor (Do not/never confuse this with ANCHOR modelling!) might be an option. A good anchor does not move (or maybe very slightly with the current) - find good anchors in your business. Any busines analyst even a managerial type of person should be able to come up with them; customer, product etc...Most of thse Anchors will never change positions in the business.
It remains hard; but for business sake I really wanna avoid terms like business keys or hubs. These terms are however perfectly fine for using in a more datamodel/technically oriented environment.
Just my 2 cents,
Ronald
Posted by: Ronald Damhof | Sunday, June 08, 2008 at 06:03 AM
Hi Ronald,
Challenging post that counters common wisdom in data warehousing. Just last week, I came across another eye opening reason why integration is only allowed when the business asks for it: it may be against the law! Privacy legislation may forbid the integration of data from different sources (like car taxes and mortgage allowances). Dutch privacy legislation prescribes specific conditions for integration and I've heard from German privacy legislation where it is forbidden at all. This adds to the post made by Stephen, that integration requires business rules and we know business rules may change. Good post!
wouter
Posted by: Wouter | Tuesday, June 10, 2008 at 12:13 PM