There is a fundamental choice to be made when data is to be 'processed':
- a choice between consistency vs. availability
or - a choice between work upstream vs. work downstream
or - a choice between a sustainable (long term) view vs. an opportunistic (short term) view on data
Cryptic, I know. Let me explain myself a bit.
Let me take you on a short journey. Suppose we receive a dataset from <XXX>. In the agreement with <XXX> we state the structure, semantics and rules regarding the dataset. This might be shaped as a logical data model or - for communication-sake - it might just be natural language, I don't care as long as the level of ambiguity is low. On the spectrum of consistency vs. availability, I choose a for a position skewed towards consistency.
If I choose consistency, I need to validate the data before it is processed, right? We have an agreement, and I am honoring this agreement by validating whether the 'goods' are delivered as agreed. In data I like to validate the data on the logical data model. So, when the data violates the logical model, the data does not adhere to the agreement, agree?
"But, but, but...we need the data....", data scientists, BI developers, dashboard magicians are becoming worried. Oh my, now we don't have data.
What are the options? Simple, you give feedback to <XXX> that they need to solve this issue fast, it's a violation of the agreement.
"But, but, but...we can't ask them that, they never change it, we gotta deal with the data as we receive it".
Ah, so you want to process the data, despite the fact that it does not adhere to the logical model? So, what you are saying is that you want to slide to the right on the spectrum of consistency vs. availability? Fine, no problem, we will 'weaken' the logical model a bit, so the data survives validation, will be processed and will be made available.
But, what happened here is crucial for any data engineer to fully comprehend. We chose to weaken the logical model, a model that correctly reflects the business. We chose to accept something that is broken. The burden of this acceptance will shift downstream towards the user of the data. They need to cope with this problem now. What they can not say is 'oh crap, data dudes, you gotta fix this', it is there problem now!
There is another aspect at play here, which might as well be the most important one; data-integration. The logical model ideally stems from a conceptual model (either formal or informal) that states the major concepts of the business, or in other words, future integration points. Logical models re-use these concepts, to assure proper integration of any data coming in. Suppose I have a dataset coming in, where the data is so bad that the only unique key we can identify is the row number. We basically get a 'data lake' strategy; the logical model is one entity (maybe 2) and the data is basically loaded as it was received. We are way down the spectrum of consistency vs. availability. You probably guessed it; data integration (if at all possible) is pushed downstream.
Whether or not you can push the slider towards consistency poses great value for the business since we can alleviate the (e.g.) data scientists from the burden of endlessly prepping, mangling and cleaning data before they can actually do the work they were hired to do. But we also got to acknowledge the fact that sometimes (often) we do not have a choice and we are forced to slide towards availability. It's not a bad thing! Be conscious about it though, you are introducing 'data-debt' and somebody is paying the price. My advise; communicate that 'price' as clearly as you can....
And if you are forced into the realm of availability, setup your data governance accordingly. Are you able to setup policies in such a way that the slider - in time - gradually moves towards consistency? A great option is to keep the agreed-upon logical model (highly skewed towards consistency) but you agree on a weakened version (so you can process the data and make it available). However, you report back to <XXX> on the validation-errors of the original logical model.
Final thought; lets be honest, we tend to often go for the easy way out, lets process all we receive and deal with it later. Our product owners and management are happy; yeah, we got data. But then reality kicks in, other sources are to be combined and the data is very hard to use because of its inconsistencies ("we seem to own cars with five wheels, is that correct?"). Lets buy some data-mangling tools and let each data scientist code its way out of the problem he or she is facing, increasing the data-debt even more. My suggestion; make a conscious choice between consistency vs. availability - all the work upstream done in the name of consistency will - in the end- pay high returns. Resist the opportunistic urges....;-)
This post is trying to describe a very subtle orchestration that is going on between data architecture, data governance, data processing and data quality.