« Lets demystify the BICC | Main | Dan Linstedt & Ronald Damhof; lets be clear about the Raw Data Vault »

Saturday, February 12, 2011


Feed You can follow this conversation by subscribing to the comment feed for this post.

Marco Schreuder

Nice: 100% Semantic gap
I really like this post.
Have one question:
In 3th logical layer. Is staging out modelled
"dimensional" and edw+/cdw+ datavault style?



BDV/EDW+/CDW+ is a modeled DV style.
staging out is modeled dimensional or otherwise.
IMO a staging out does not have to be persistent, just like a staging in in can be either volatile or permanent.


a source oriented semantical layer is easily constructed on top of a raw DV, so no big deal there. A raw DV still adds value on temporalization, although not on integration.



While it's easy to TECHNICALLY construct a source oriented semantical layer, most of the difficult work in building the data warehouse resides precisely in bridging the semantic gap.

And since we can also get value on temporalization by just copying the source tables and adding start and enddates to them, I fail to see what we gain by using Data Vault. More to the point: I do not think we should call that "Data Vault", because if all DV is, is a data transform, then there is no point in doing it. A Raw DV will provide no more business value than any other ODS with history, and when people think that "DV is just adifficult way of building an ODS" then pretty soon DV will be going the way of the Dodo.



No discussion there (at this point;)

But just as veldwijk uses a DV like solution for an operational historical system we can use DV on a system and gain benefits in schema, timeline and storage management. They are still real gains! Esp the space management is very important. an HSA is *NOT* a long term solution with current DBMS systems unless storage costs are not relevant and maintaining schema changes in a consistent fashion is not required/mandatory.

With current mainstream DBMSes and for generic temporalisation of arbitrary data models I would always look at Anchor oriented modeling approaches(AM/DV/HTC) and not just a basic 5TNF apporach.



I like this post and follow your logic. However, I don't fully agree with it. I don't see the situation as black and white.

Our reasons to have an SDV (= raw DV) are quite pragmatic.

Since it's not possible to fully generate a BDV (a business modeled DV) from a source system (it involves interpretation and thus intelligence and thus handwork to produce), you have two choices:

1) Skip the SDV and build a BDV by hand
2) Generate a SDV and build a BDV by hand

Put simply like this, it seems that 2) is more work. However, in my experience and in our situation, building a BDV on top of a SDV is really easy. The complex stuff like tracking history and work intensive stuff like building a physical DV structure and loading routines, has already been taken care of, without any expensive handwork. Our BDV is pretty easy stuff from there. In our case, they are a collection of (pretty simple) views (SELECTs and JOINs). Really no rocket science. For example, it's not difficult to integrate two hubs into one shared hub and to combine their satellites. However, building this directly on top of a source system by hand, is more work.

In practice, option 2 is a lot less work for us (= cost-effective) than option 1. Plus it gives us more robustness; less likely to make errors since there is less handwork. Most importantly, we didn't need a high initial investment in license cost or labour to produce results (a first iteration EDW), and in our current politic situation (convincing a council to part with cash in times of considerable cutbacks); that's a giant plus.

Furthermore; a BDV is, by design, interpretation of data. Isn't this a form of "one version of the truth" that we all agree we definately don't want at this level in an EDW? Something that changes whenever interpretation changes? Is it then such a bad idea to make this a semantic, disposable layer?

However, let me add: if we would be able to automate a BDV (by somehow entering the required business logic into the software), that could be even better. The open source tool we use (Quipu) has some plans in that direction and we'd be happy to contribute to that. In the meantime, our EDW still is cheap to build and still offers us the advantages of a DV modeled EDW.

I don't see how there can be one good and one bad way of achieving this. Depends on the situation, context and aspired goals.

Ronald Damhof

Quick reaction to Johannes's comment;

You seem to distinguish very strong between Raw DV and business DV. In my opinion both maybe following the DV modelling style, but they are not following the DV methodology.

Let it be clear btw that this is, in itself, no problem (I am not the methodology-police), but it confuses the DV landscape big time. And since standardization is a prime argument for DV I try to differentiate between DV 'schools'.

Your interpreation of a Business DV seems to adhere to the SPOT (single point of truth) idea? correct?

The Raw DV is generated (model and loading) 100% from the source - correct?

Again, in my opinion you may be modelling DV style, but you are not using the DV methodology for both Raw and Business DV. Which is a model integrated only on the business keys/hubs (very light, but extremely important integration). This also diminishes the "semantic gap" you have to overcome somehow. Another argument is that you get somewhat more loosely coupled from your sources and you are able to track business entities across the sources and across time. A final argument is that you build some level of protection against changes in sources you do not want to ressonate all the way through your models and data logistics. The remainder of objects in DV methodology follow the source. The DV methodology adhers to the single version of the fact idea - business rules downstream.

The only creative activity ,that is required in the above stated DV methodology, is designing the model - the loading can be accomodated by generation engines. Although the latter is not to be considered a holy grail, this work is pretty straightforward.

My point is - somewhere down the line you got to do the work - bridge the "semantic gap" and I am still puzzled/not convinced by the contribution of a Raw Data Vault in this respect.

I do acknowledge however that the Raw School exists and seems to gain a position.


In the light of this discussion it can come in handy to look at the background of the raw datavault from an historic perspective.
In its origin the DV-method used the business keys as the base of the HUBS.
In the first courses given in Holland the Northwind database was used as an example. In this database it was rather easy tot recognize the business keys.
The usage of the business keys had important advantages as reducing the semantic gap, usage of hubs to integrate data from several sources and achieving independence of often somewhat far fetched datamodels of source systems.
The addition of Links and Sats made it possible to store history, to parallize load processes and it made it easy to maintain the model.
So far so good.
However in practice it was not so easy to identify unique business keys. Business users often do not think in terms of business keys. Keys in source systems are often technical and not in line with the business keys.
Another problem was the concept of 100 % traceability that was linked to the DV-method. First of all you should watch out with 100% principles. They are rarely cost effective (when reachable at all).
Also in my opinion there was also a misconception that all the source data should be forward traceable in the data warehouse. I think it is far more important to have the possibility to trace back the information from the data warehouse to the source than other way round.
Another 100% principle was the 100% of the data all the time rule. This rule was often interpreted as “you never know what information the business is going to ask, so lets give them all the data from the source system”. In my opinion a rather lazy rule because it allows you not to think of the business value of the data.
Because the load processes for HUBS, SATS and LINKS look rather standard and because of the 100% rules the need arose to automate the DV-modelling and the generating of the load processes. The difficulty with identifying the business keys could be overcome by using the technical keys of the source systems instead of the business keys.
And so the Raw Datavault was born. And I agree with Ronald: all we have achieved is a persistent staging with history.
Luckily we still have the real challenges of business intelligence: deliver the right information in the right time and the right form to the right persons. Therefore we should focus on things like analyzing what information delivers real business value and how we link information to data by defining business rules and defining rules for integrating data from different sources.



I agree with your assesment on the issues with technical keys. However, most of these issues are solvable, albeit *NOT* in standard/basic Data Vault, but only when you see DV as one of many anchor oriented modeling techniques. Then all of theses issues can be resolved practically and (most of the time) elegantly (more or less, depending on situations).
For me the total focus in *pure* DV is hampering us to define standard (and automatable!) approaches to solve these types of issues.

IMO issue is that when you focus on DV as a *recepy* you'll be forced to accept a *lot* of (often implied) data rules. Solving DV modeling in a more generic fashion will help you solve them without resorting to RAW- DV's or other techical tricks. That said, I realise there is still a long way to go before this will become an accepted practice.

Ronald Damhof

@Rob; not sure I agree on the business key/technical key analysis wholeheartedly. It takes great effort to identify the REAL business keys (they are often present, they should, otherwise business might potentially have a problem as well).-->@rob; I know, reality is a bitch - see my last paragraph.

The advantages of discovering them and modelling them correctly in the DV are huge (see my other comment). Unfortunately I see the community retreat a bit and say - well, lets take the easy (lazy) road and use the technical keys.Alas, the Raw DV is born.

Now some nuance - and I agree with Rob in this respect:
Although the principal design rule should be; "use business keys, unless", shit seems to happen ;-).

I am at the moment involved in a 'SAP to Data Vault' project, although its hard, we do discover the business keys. And yes, we have situations where there is simply no other choice as to use the technical key (although this technical key is often a very evil busines key ;)). Also the projects we did at the Tax Authority service (with Rob) had issues, where we had no choice as to use technical keys for Hubs.


As you guys have stated: sometimes in a source system business keys also are the technical keys, but sometimes technical keys aren't business keys.

Our raw DV is, by design, modeled with technical keys as DV business keys.

Consider this situation. We have an entity, person, that is registered in two systems. The business key is 'person number'. In system A, person number is also the technical key. In system B it isn't. In other words, in system A, the business key is in the hub, in system B it's in the satellite.

In our approach, we have created a shared BDV hub for Person. It combines the two raw hub's into one person_hub, linking them on the business key (which, for system B, we lookup in the satellite). No problem, this is working fine for us.

Elwin Oost

I get the impression that we agree about most things:

* Yes, a source DV (I'll avoid "Raw" for now) should be skipped if not necessary. The closer you are to the source systems the more likely it is you won't need it.
* Yes, there's a semantic gap to be bridged somewhere (though I disagree the sDV leaves 100% to do)
* No, it's not always possible to skip a source DV depending on customer demands.

With wavering business requirements/business model, long (influence/query) distance to sources and high time pressure I stand by our Tax Authority sDV+CDW+ solution.

In a source DV you can already massage the model to be closer to the business model, by creating links with the right granularity (unit of work). This can already help bring the model a lot closer to the business model (but of course this needs proper business analysis and not just a press on a button).

As the business DV only needs to contain entities for which this wasn't enough, it's not a doubling of the core DWH. Hopefully, again depending on the source, it's only a small addition.


It is the name on which you can rely upon to buy hot hot tubs richmond va tub? The one thing that these Hot Spring Spas in the world we dwell in. The options for ideas for landscaping around hot tubs don't have to play the games you do at hot tubs. However in most cases a simple electrical socket won't be good enough.

The comments to this entry are closed.