Getting your data into a staging area from a variety of sources that are modelled in the most 'creative' ways and stored in heterogenous formats. Add billions of I/U/D statements and you got yourself an explosive cocktail.
And yes my fellow EDW architects - we tend to 'forget' this part of the puzzle....admit it. You make yourself beautifull Powerpoint pictures showing stagings, EDW, datamarts, fantastic BI tooling etc..
Oh ah yes..about getting the delta-driven, granular data.....
Famous words: 'the source owner is responsible for getting the data to us in the Staging'.
And yes - that is true. The source owner is responsible for getting the data to the staging. But....It must be avoided that every source 'develops' his own method of extraction. In fact, you really want to standardize the source extraction as much as you can!
About source extraction I could write a whole book. It's a grey-tinted discussion - no silver bullets. So do not - not ever - come to me and say 'Hey Ronald, we use the Enterprise Service Bus for all source extractions - we standardized - you happy now?'.
Well no...because it's bullshit. It's Powerpoint lingo.
Sources that are event-driven, or transaction driven - could be a candidate for extraction with an ESB. There is however that little something called 'a canonical model' that needs to be on top of the source - this is often not present for legacy systems. But hey - if it's there - you might have a fantastic standard (infrastructural) extraction method for getting your data delivered into the stagin area. Point of sale data, stock transactions, etc...could be all prime candidates.
But...
What if your source is proces-driven....I am at the moment working in the goverment and the majority of systems is proces driven. Using the ESB for extraction is a bad choice. Why? well...identifying every change on a granular level is hard and should be programmed (oh damn - you mean we have to program an interface to each proces-driven source? - yes, you have). You program the message and put it on the Bus - how is this different from the classic file-oriented extraction? Not much no. The ESB does not offer a distinctive advantage now anymore, and it's not a standard extraction method.
What could be another infrastructural, standard way of getting source data to your staging? Replication technology. Replicating the change data capture logs must be considered when you got tons of I/U/D loaded into your staging. There is no source-related programming necessary, it's very efficient on the resources of the source system and you got huge data-reliability in terms of completeness and timeliness (you can trickle-feed your staging and EDW - if modelled correctly (!) - getting near realtime loading performance). And last but certainly not least; the advanced replication technologies out there are extremely scalable in capture, processing and deploying data.
Now I already hear people say....'the software is kinda expensive'...Well, is it? Is the alternative more cheaper? Every source needs to make it's own delta-driven, granular data interface. On top of it, they need to maintain it, they need to monitor it, they need to correct it etc..FOR EVERY SOURCE. In the end, when you are architecting an Enterprise Data Warehouse in (yes I am talking large environments here) a data intensive company, the case for replication is propably a sound one.
The following however must be considered; infrastructural extraction is only an option if you own and maintain the source and you got control over change management. If you don't have that, you are propably dealing with a 'buy'-system like for example SAP, Siebel etc.. You do not want an infrastructural dependency with these sources, simply because you do not maintain it. You must insist on a talking to a functional interface - delivered to you by the vendor. In fact - this functional interface MUST be a selection requirement for buying a new source, if you ever want it to be loaded into (for example) a data warehouse.
Now - something you need to think about: what if every source that is outthere in the market had a Data Vault-like interface on top of there sources.......wouldn't that be great? So we got two flavors of standardized functional interfaces; canonical models and 'Data Vault' like models, each with different objectives, each with other requirements to fulfill.
If anyone is still following me on this braindump....let me know, give me your input. There is very little written down on this subject.
may be the first step into building a Data Vault could be to bring the source tables themselves into the Data Vault. Yes, they are not integrated yet but you have some benefits to bring them early in the DV. In real life, source systems are not always sending correct information and receiving corrected past records can be manage gracefully if the source tables are already stored in the DV with it's full history tracking. You can then load the integrated parts of the DV from the current view of the source DV tables. I don't see any problem with merging source and integrated data into one DV but you could also build two Data Vault, a "Staging Data Vault" and an "Integrated Data Vault". A benefit of a one DV architecture would be to give the business access to a layer of "grey" data...not scrubbed yet not integrated yet but still useful for some reporting scenarios.
Posted by: Stephan Deblois | Monday, November 03, 2008 at 08:27 AM
Stephan, the problem I mentioned with source extraction is typically a scale-problem. It's a freakin volume to get (delta) data from a source that is just huge in terms of volume.
Putting source tables next to the DV would result in huge databases....why? It's in my opion not necessary and not a valid strategy to put source tables inyyour DV. I really got questions about the extensiability of the DV model when you do such a thing. And how do you maintain a simple, yet standard and very performant load architecture?
Back to the topic....even if you put source tables in the data vault. How do you copy/extract/move the data from the source when it's huge, how do you do delta comparison when the volume is huge? That's the nasty stuff we gotta deal with in extraction of high-volume sources. They are not solved in moving data to the DV.
Posted by: Ronald | Monday, November 03, 2008 at 10:46 PM
Yes, you are right. The volume of source data makes a big difference in the strategy chosen. Currently, I am dealing with low volume but very complex data source. The sources are also all external (business partners) with not much control on the content. Fixing sources and keeping track of the changes is part of our daily routine...the source DV makes a lot of sense in our case. As usual, it depends :-)
Posted by: Stephan Deblois | Tuesday, November 04, 2008 at 05:52 AM