People familiar with my thinking know that I am a bit of a 'fundamentalist' when it comes to 'data'. I am the guy that pushes the non-sexy part of data; data quality, data governance, metadata, data protection, data integration, semantics, rules, etc..
It is hard to stand your ground in a time where shortermism, technology-fetish and data-populism is thriving. I see ‘data architectures’ in my industry that boils down to superdooper databases, ultrafast massively parallel hardware and of course huge amounts of software that seem to glorify ‘coding’ your way to the promised kingdom.
Call me old school, but I want (data) architectures to separate concerns on various levels (conceptual, logical and technical), dimensions (process, data, interaction) and aspects (law & regulation, people, organisation, governance, culture). Architectures should enable businesses to reach certain goals that (preferably) serve the customer, civilian, patient, student, etc..
Lately I have been studying the ‘datascience’ community, attempting to comprehend how they think, act & serve the common goals of an organisation. I have abandoned (just a bit :-)) my declarative nature of data modelling, semantics, dataquality or governance and I have drowned myself in coursera courses, learning to code in Python, Julia and R. Dusting off my old Statistics books I learned in Uni, installed Anaconda, Jupyter notebook, Rstudio, Git, etc..
And oh my, it is soooo cool. Give me some data, I don’t care what, where it comes from and what it exactly means, but I can do something cool with it. Promise!
Now my problem…
- (1) It seems to me that the ‘science’ in ‘datascience’ is on average extremely low to non-existent. Example; I have heard of ‘data science’ labs/environments where the code is not versioned at all & data is not temporal freezed, ergo; reproducibility is next to zero. Discovering any relationship between variables does not mean it is a proven fact, more is needed. Datasciene is not equal to data-analysis with R (or whatever), is it?
- (2) There seems to be a huge trust in relevance and quality of data wherever it comes from, whatever its context and however it is tortured. Information sits at the fabric of our universe1, it’s life, it’s the real world. Data is the ‘retarded little brother’ of this ‘information’, it is an attempt of humankind to capture information in a very poor way. Huge amounts of contexts are lost in this capturing. Attempting to retrofit ‘information’ from this ‘retarded brother’ called ‘data’ is dangerous and should be done with great care. Having these conversations with data scientists is hard and we simply seem to completely disconnect.
- (3) There seems to be little business-focus, bottom-line focus. Datascientists love to ‘play’, they call it experiment or innovate. I call it ‘play’ (if you are lucky they call their environment ‘sandbox’, wtf?). Playing on company resources should be killed. Experiments (or innovations) start with a hypothesis, something you wanna proof or investigate. You can fail, you can succeed, but you serve the bottom-line (and yes, failing is serving the bottom-line!) and the purpose/mission of an organisation. Datascientists seem to think they are done when they’ve made some fascinating machine learning-, predictive- or whatever-model in their sandbox or other lab-kind-of-environment. Getting this model deployed on scale in a production environment for ‘everyone’ to use, affecting the real world…..that is where the bottom-line value really shines, you are not done until this is achieved.
- (4) There seems to be little regard for data protection aspects. The new GDPR (General Data Protection Regulation) is also highly relevant for datascience. Your ‘sandbox’ or your separated research environment needs to be compliant as well! The penalties for non-compliance are huge.
There is huge value in datascience, its potential is staggering and it is soo much fun. But please, stop fooling around. This is serious business with serious consequences and opportunities for everyone and for probably every domain you can think of, whether it be automotive, banking, healthcare, poverty, climate control, energy, education, etc…
The ‘science of data’ and ‘datascience’ are the yin & yang of fulfilling the promise of being truly datadriven. Both are needed.
For my Data Quadrant Model followers; it is the marriage between quadrant I & II versus quadrant III & IV.
1 Increasingly, there is more and more 'evidence' originating from theoretical physics that this statement hold some truth to it, link [Dutch]. I would also like to give attention to this blogpost by John O'Gorman, very insightful, putting DIKW to rest....
I see your are stepping into one of the that ohter worlds in the Information & Communication Technology.
So may be time is ready for a new data-model. Proposal call it oo-lc. It is CDC based but he focus is the Object (document container name it whatever) with the lifecycle. Nothing being changed in the object just needing a unique Object-identifier the date/time of change and the date/time being valid. For double-check incoming data you can have the before and after being complete in the change-data message.
The unique Object Identiefer and those two time indicators with preferable the source origin (traceablity) being propagated to alle logical records/tables being extracted from the object.
The question in this is what data is needed for the business or analytics for a specfic goal.
- The last situation or some ohter moment in time reflecting that temporal status. (Time travelling possible).
- The events in all changes from selected objects. May be in a decicated clustered approach voor some association analyses and/or time series events analyses.
There are several ways to view at that temporal time dimension. Building ABT's classic BI of even operational information.
For the old fashioned BI with OLAP included those needed tables can be build from this. There will be a lot of duplication and denormalizationed data as usual.
This OO-LC is breaking with the rigid culture of data-modelling in either (facts quadrant 1 and context II). It can serve research in far better way as I described the intermediate steps for a ABT.
When there is a valid LC (life-cycle) the time-series (arima) with events for te content is reliable. Finding gaps its signalling issues but the document content can still be valid. That is removing that contracdiction between consistencyand availablity.
Don not try to close something turning right with a right screwing direction (vice versa). Data analytics is having a turn left where ICT is used to turn right. Only change of directions in approved agreed moments is possible.
Posted by: Ja Karman | Sunday, January 22, 2017 at 01:51 AM
Jaap, Thx for the comment!
Can not really grasp what your saying and that makes it interesting. Tbh, I halted already in the very first sentence; the term 'document container', please define this term as precise as you possibly can on a logical basis (no technical implementations pls).
To me it sounds very ambiguous....:-)
Posted by: Ronald | Sunday, January 22, 2017 at 02:25 AM
I see a document container object as the unit containing all related information inside as "the basic unit" of information.
To be honest is a real life experience. So the example could help.
The description of ownership on some property can be very complicated.
By actions of the notary it can be made all official. That notary document is having an complicated ER-relationship with many 0-n's. Expect every possiblity to occur in real data. One thing is certain what is made official once cannot be reverted in an other way than by a new event. Every event of document change is information on his own.
Just see that notary document as an example of the "document container". By now I am seeing some 70 technical different tables (type of information) being extracted. Somehow to be joined later for ease of use in about some 30. The first concern is not how to manage the information inside those documents but the documents as a whole.
Working is this topic I am seeing it as a far more generic approach.
It is a very common way we are processing that in human known ways.
Posted by: Ja Karman | Sunday, January 22, 2017 at 08:12 AM
I see the damhof kwadrants resemblence in the figure "big data referencearchitecture". It is a good blog of xomnia.
http://www.xomnia.com/expertise/big-data-engineering/big-data-architecture/
Now take the middle part, also the conflict area in your kwadrants.
Break it apart after staging in the middel of DV / data lake.
Take the advantages of the DV but remove any business structure and add the control of checking technical input completeness.
By that serving both the old BI (IT-push restricted design) and the more free ABT into the analytics.
Nice even the shortcut Arrow of getting data into analytics is there.
Posted by: Ja Karman | Saturday, January 28, 2017 at 05:14 AM
Jaap,
Your example is quite confusing, you are talking about a document as a concept. Which is perfectly fine, but what is the context of such a document identified? Time? Parties involved? Object (real estate, ...)? And how is this context identified (since the desire to integrate is high in analytics)? I think you are mixing up (at least) logical and technical concerns in your example (70 'tables' sounds pretty technical) which makes it confusing what you actually wanna accomplish.
If you execute a pure document approach - which is not a bad thing depending on the use case - the context is hidden in the document and often not explicitly stated. This could be ok when documents are somewhat standardized, but they are - by definition (unless it is XBRL type of docs) - not. So I got tons of documents with huge variety and I wanna do some analytics.....good luck on ya.
R.
Posted by: Ronald | Sunday, January 29, 2017 at 06:22 AM
Ronald take as an example this one. It is the ER relationship of all kind of information wiht all kind of legal rights events you make offical at the Notary (paperwork - document). https://www1.kadaster.nl/1/imkad/documentatie/20120508/index.htm
XSD and fully highly structured although still complicated.
By nature this specific one will have with every update a full discription and not only partial updates referring to previous ones.
Where is the issue you are seeing?
Combine this https://www.linkedin.com/pulse/big-data-reference-architecture-martijn-imrich?forceNoSplash=true wiht your kwadrant model the middle area is the conflict zone. The others are nicely in the places according your kwadrants. Even the shortcut left upper down right is there.
Posted by: Ja Karman | Monday, January 30, 2017 at 12:14 PM
Marvelous: We ‘the architects’ should stand firm in stating the message ‘this is complicated, we can do it, but I am not sure how’.
Posted by: Ja Karman | Monday, January 30, 2017 at 12:32 PM
The data and analysis allow us to make informed decisions and to stop guessing. I was never fond of making decisions based on gut feeling, perhaps because the gut says one thing one day, and something very different the next day.To be honest its a real time experience
Posted by: mounika | Sunday, April 09, 2017 at 11:33 PM