Last week I wrote a post on the Raw Data Vault that got some good insightful comments. This post is a joint effort of me and Dan Linstedt regarding this subject.
In his book – published this week - Dan mentions a Raw Data Vault as well. We have discussed this and came to the conclusion that the Raw Data Vault as mentioned by Dan in his book is in fact the actual DV (integrated on the HUB’s, using business keys). He used the term ‘Raw’ to distinguish from the Business Data Vault.
Let us be clear; the Raw Data Vault as described in my blog post “Data Vault Schools” is not the same Raw Data Vault as described in Dan’s book. In fact it’s a fundamental difference with regard to DV methodology as Dan intented it. This is in line with my blog post "the case against the Raw Data Vault".
We both agree that there is no way to generate everything, because identification of the business keys has to happen. We do however acknowledge the possibility that, if you can specify the business keys, there are options to generate the model.
Ronald Damhof & Dan Linstedt
***update***
The first two comments on this post are valid and we felt to be more precise on the terminology.
1) Raw Data Vault = A term that should no longer be used in DV methodology. If it is used in formal writings, communications, blogs or whatever, then it resembles a Data Vault (integrated on the HUB’s, using business keys etc..) as defined by Dan Linstedt.
I will also update my other 2 posts to reflect this terminology.
Ronald, can you rename your meaning of a raw DV? So if we talk about raw DV we talk about the same and that is the actual Data Vault as Dan also discribed in his book. The Raw DV you (and many other) are talking about can for example be named Staging Data Vault?
Posted by: Basstiekema | Thursday, February 17, 2011 at 02:48 AM
I'm a bit confused now. There are two types of raw datavaults? In what way are they different to each other? Please make al list of differences (and agreements) to clarify this! Please give examples!
The raw datavault defined by you is the one that is generated from the source (based on databasekeys) and the one of Linstedt is defined on the business keys (which is the craftman's job)?
Posted by: Hennie de Nooijer | Thursday, February 17, 2011 at 04:23 AM
Yeah, we are aware of the confusion. We were already busy with the terminology. ***update posted***
Posted by: Ronald Damhof | Thursday, February 17, 2011 at 05:04 AM
Staging Vault is too suggestive IMO.
If we start discussing names we could have:
Stovepipe Data Vault
Source (generated/oriented) Data Vault
Technical Data Vault
Posted by: DM_Unseen | Thursday, February 17, 2011 at 06:42 AM
DATPROF will soon release a data vault demo containing Historical Staging, Staging DV, Raw DV. It will contain sample data and will run on Oracle and SQL Server. This demo will show the differences between the three approaches in all its aspects (modeling, generation and deployment).
Posted by: Harald Kikkers | Thursday, February 17, 2011 at 07:39 AM
We just call it the source data vault (sDV). It's a source model converted to a data vault model. Hence the name.
How could this not be part of the DV methodology? I am baffled by that statement. Of course, on it's own, an sDV is pretty useless. It's whatever you do with your sDV next, what gives it value. In our case: a business data vault (bDV). It's the complete solution that gives you value, not a singled out component.
I think the only different in schools/approaches is _where_ in the stream you have placed your different EDW functionalities (history tracking, business key integration, semantic conversion, etc.).
As long ans you end up with a correct data vault _with business value_, I don't care how you get there (whether you generate an sDV or not, whether you integrate immediately or in a later stage).
Posted by: Johannesvdb | Friday, February 18, 2011 at 03:04 AM
@Johannes;The staging vault is a copy of the source, yes DV modelled, but not according to DV methodology. Why are you baffled? Is this not a 'fact'?
I do not pass judgement - good or bad - on your solution. You seem to think that?
Wrong. Tbh if it's working for you - perfect! No problem. But DV methodology is not just a DV modelled data model, it is more than that. We need to differentiate this approach from DV methodology as Dan intended it. Simple.
In fact I am hugely curious on how you implemented your solution and would like to sit with you and watch it work!
Posted by: Ronald Damhof | Friday, February 18, 2011 at 02:41 PM
Hi Ronald and Dan,
Nice discussions about names and terms. Let's join :)
I think 'staging DV' is a bad combination of terms, staging is volatile and a DV is not. Maybe just stick to the term Johannes uses, 'source DV'. It's directly derived from the source interface. Maybe 'interface DV' is a better term, who knows?
@Ronald, as you stated in a comment in the 'case against the raw DV', you often end up using TK's in the hubs. So in real life a (elementary/fact?) DV consists of hubs with TK's and BK's.
I think the discussion is to black and white, it should be more colorful (or should I say there is no one version of the truth, just a couple of view points?). I see the collection of sDV's (with TK's and BK's) as a starting point. It's an evolution, you start with the TK's and after some time (when business knowlidge grows) you see more BK's popping up from your sources. The collection of sDV's CAN be coupled on BK's more and more and it is looking more like a (elementary/fact) DV.
Source systems are often modeled crappy and users are creative, so we have to use business rules to further integrate. I like the definition of 'business DV' Dan uses in his comment on the DV schools post:
"a subset of tables coming from the Raw DV (the true EDW), where the data is processed through common business rules used by all data marts"
Dan also states that we can connect DV's, creating one big virtual DV (a nice hypergraph :)
But to keep it managable we create logical layers. So we end up with the four layers (and smash in some new terms):
- the Interface (Staging/CDC) layer
- the Fact (Dan's Raw DV, the collection of (mostly connected) sDV's) layer
- the Common Business layer (Staging out, EDW+, bDV)
- the User layer (Data Marts)
Just a little contribution from my side. Maybe not completly 'DV methodology' proof, but that's not a problem ;)
Regards,
JJ.
Posted by: Delostilos | Friday, February 18, 2011 at 06:29 PM
He JJ - good to see you here.
I started this discussion to polarise - yes, be black and white in order to more clearly see boundaries and get some kind of logic in the mishmash of terminology.
Your post is nuanced and rightly so.
Naming a 100% source driven model 'SourceDv' 'Staging Dv' ...I honestly do not care about it much. But, I saw the term 'Raw Data Vault' also mentioned in Dan's book and it made me/us wanna clarify it, because the 'Raw DV' used in the NL and the 'Raw DV' (=True EDW, DV) used by Dan, are not the same.
I still have extreme doubts on the usefulness of a 'Source DV', I have not heard any good arguments as opposed to a persistant staging area or a DV (True EDW) as it was meant.
I also doubt the evolutionary character your describing - from source DV's to a true EDW (=DV). I think it's an illusion. Would be nice if you could elaborate a bit more.
Finally - the origin of my worries (;-)) stem from the facts that I see certain practitioners/Service Providers/automated tooling selling generated DWH's, DV's like they sell cookies. In my opinion the customer is getting squad/zip/zero (the semantic gap is just a big as it was be4), it will hurt the DV community and will constrain DV innovation the coming years. This should not be confused with DV methodology.
A few weeks a go I met a 'consultant' saying he can generate any DV-DWH in 8 hours. In my opinion he generated a copy of a source, nothing to do with DV methodology.
I can make a copy even faster btw....
And btw - the concept of the bDV (EDW+, Staging Out), which I coined first with the Tax Authority service and was inspired by Albert Heijns' Pallas project (I discussed it with Dan in 2007 over beers), is mismatched already as well.
The bDV coming from the 'Source DV' guys is another bDV (EDW+, Staging Out) as I and Dan defined it.
Whatever opinion we all might have (Clint Eastwood; Opinions are like assholes, everybody got 1), I think we all agree that we need STANDARDS. For all I care we get to have several DV methodologies. I just want them out in the open, more transparancy and more discussion.
At this moment peeps/companies/products are all screaming they support/generate Data Vaults....But they do really? Or is it some kind of 'fork'/mutation.
Again - thx for your input!
Ronald
Posted by: Ronald Damhof | Saturday, February 19, 2011 at 02:08 AM
@Ronald & JJ
An evelutionary DV is what I'm currently working at the moment at the RU. This is mainly due to the fact that source system integration is ad hoc and incomplete.
I'm not against source driven DV's, but I *AM* against unnesecary usage of TK's in a DV, and I will go to great lengths to avoid them. While my current approach is generatable, current DV genaration tools are not sophisticated enough to handle this (JJ knows what I mean). Besides, solving TK issues is impossible in standard DV. You actually need to borrow transformations from Anchor Modeling to make this work in a generic and automatable fashion.
This state of affairs leaves correctly handling of TK issues beyond the scope of current DV generators.
Posted by: DM_Unseen | Sunday, February 20, 2011 at 05:37 AM
I would like to point back to a blog post of Dan Linstedt (august 2010). How can we place the "Staging Area" as described by Dan in the "Raw-Data Vault" discussion?
http://danlinstedt.com/datavaultcat/data-vault-and-staging-area/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+DataVaultCoaching+%28Data+Vault+Coaching%29
Posted by: Joey Moelands | Monday, February 21, 2011 at 01:10 AM