Getting your data into a staging area from a variety of sources that are modelled in the most 'creative' ways and stored in heterogenous formats. Add billions of I/U/D statements and you got yourself an explosive cocktail.
And yes my fellow EDW architects - we tend to 'forget' this part of the puzzle....admit it. You make yourself beautifull Powerpoint pictures showing stagings, EDW, datamarts, fantastic BI tooling etc..
Oh ah yes..about getting the delta-driven, granular data.....
Famous words: 'the source owner is responsible for getting the data to us in the Staging'.
And yes - that is true. The source owner is responsible for getting the data to the staging. But....It must be avoided that every source 'develops' his own method of extraction. In fact, you really want to standardize the source extraction as much as you can!
About source extraction I could write a whole book. It's a grey-tinted discussion - no silver bullets. So do not - not ever - come to me and say 'Hey Ronald, we use the Enterprise Service Bus for all source extractions - we standardized - you happy now?'.
Well no...because it's bullshit. It's Powerpoint lingo.
Sources that are event-driven, or transaction driven - could be a candidate for extraction with an ESB. There is however that little something called 'a canonical model' that needs to be on top of the source - this is often not present for legacy systems. But hey - if it's there - you might have a fantastic standard (infrastructural) extraction method for getting your data delivered into the stagin area. Point of sale data, stock transactions, etc...could be all prime candidates.
But...
What if your source is proces-driven....I am at the moment working in the goverment and the majority of systems is proces driven. Using the ESB for extraction is a bad choice. Why? well...identifying every change on a granular level is hard and should be programmed (oh damn - you mean we have to program an interface to each proces-driven source? - yes, you have). You program the message and put it on the Bus - how is this different from the classic file-oriented extraction? Not much no. The ESB does not offer a distinctive advantage now anymore, and it's not a standard extraction method.
What could be another infrastructural, standard way of getting source data to your staging? Replication technology. Replicating the change data capture logs must be considered when you got tons of I/U/D loaded into your staging. There is no source-related programming necessary, it's very efficient on the resources of the source system and you got huge data-reliability in terms of completeness and timeliness (you can trickle-feed your staging and EDW - if modelled correctly (!) - getting near realtime loading performance). And last but certainly not least; the advanced replication technologies out there are extremely scalable in capture, processing and deploying data.
Now I already hear people say....'the software is kinda expensive'...Well, is it? Is the alternative more cheaper? Every source needs to make it's own delta-driven, granular data interface. On top of it, they need to maintain it, they need to monitor it, they need to correct it etc..FOR EVERY SOURCE. In the end, when you are architecting an Enterprise Data Warehouse in (yes I am talking large environments here) a data intensive company, the case for replication is propably a sound one.
The following however must be considered; infrastructural extraction is only an option if you own and maintain the source and you got control over change management. If you don't have that, you are propably dealing with a 'buy'-system like for example SAP, Siebel etc.. You do not want an infrastructural dependency with these sources, simply because you do not maintain it. You must insist on a talking to a functional interface - delivered to you by the vendor. In fact - this functional interface MUST be a selection requirement for buying a new source, if you ever want it to be loaded into (for example) a data warehouse.
Now - something you need to think about: what if every source that is outthere in the market had a Data Vault-like interface on top of there sources.......wouldn't that be great? So we got two flavors of standardized functional interfaces; canonical models and 'Data Vault' like models, each with different objectives, each with other requirements to fulfill.
If anyone is still following me on this braindump....let me know, give me your input. There is very little written down on this subject.