People familiar with my thinking know that I am a bit of a 'fundamentalist' when it comes to 'data'. I am the guy that pushes the non-sexy part of data; data quality, data governance, metadata, data protection, data integration, semantics, rules, etc..
It is hard to stand your ground in a time where shortermism, technology-fetish and data-populism is thriving. I see ‘data architectures’ in my industry that boils down to superdooper databases, ultrafast massively parallel hardware and of course huge amounts of software that seem to glorify ‘coding’ your way to the promised kingdom.
Call me old school, but I want (data) architectures to separate concerns on various levels (conceptual, logical and technical), dimensions (process, data, interaction) and aspects (law & regulation, people, organisation, governance, culture). Architectures should enable businesses to reach certain goals that (preferably) serve the customer, civilian, patient, student, etc..
Lately I have been studying the ‘datascience’ community, attempting to comprehend how they think, act & serve the common goals of an organisation. I have abandoned (just a bit :-)) my declarative nature of data modelling, semantics, dataquality or governance and I have drowned myself in coursera courses, learning to code in Python, Julia and R. Dusting off my old Statistics books I learned in Uni, installed Anaconda, Jupyter notebook, Rstudio, Git, etc..
And oh my, it is soooo cool. Give me some data, I don’t care what, where it comes from and what it exactly means, but I can do something cool with it. Promise!
Now my problem…
- (1) It seems to me that the ‘science’ in ‘datascience’ is on average extremely low to non-existent. Example; I have heard of ‘data science’ labs/environments where the code is not versioned at all & data is not temporal freezed, ergo; reproducibility is next to zero. Discovering any relationship between variables does not mean it is a proven fact, more is needed. Datasciene is not equal to data-analysis with R (or whatever), is it?
- (2) There seems to be a huge trust in relevance and quality of data wherever it comes from, whatever its context and however it is tortured. Information sits at the fabric of our universe1, it’s life, it’s the real world. Data is the ‘retarded little brother’ of this ‘information’, it is an attempt of humankind to capture information in a very poor way. Huge amounts of contexts are lost in this capturing. Attempting to retrofit ‘information’ from this ‘retarded brother’ called ‘data’ is dangerous and should be done with great care. Having these conversations with data scientists is hard and we simply seem to completely disconnect.
- (3) There seems to be little business-focus, bottom-line focus. Datascientists love to ‘play’, they call it experiment or innovate. I call it ‘play’ (if you are lucky they call their environment ‘sandbox’, wtf?). Playing on company resources should be killed. Experiments (or innovations) start with a hypothesis, something you wanna proof or investigate. You can fail, you can succeed, but you serve the bottom-line (and yes, failing is serving the bottom-line!) and the purpose/mission of an organisation. Datascientists seem to think they are done when they’ve made some fascinating machine learning-, predictive- or whatever-model in their sandbox or other lab-kind-of-environment. Getting this model deployed on scale in a production environment for ‘everyone’ to use, affecting the real world…..that is where the bottom-line value really shines, you are not done until this is achieved.
- (4) There seems to be little regard for data protection aspects. The new GDPR (General Data Protection Regulation) is also highly relevant for datascience. Your ‘sandbox’ or your separated research environment needs to be compliant as well! The penalties for non-compliance are huge.
There is huge value in datascience, its potential is staggering and it is soo much fun. But please, stop fooling around. This is serious business with serious consequences and opportunities for everyone and for probably every domain you can think of, whether it be automotive, banking, healthcare, poverty, climate control, energy, education, etc…
The ‘science of data’ and ‘datascience’ are the yin & yang of fulfilling the promise of being truly datadriven. Both are needed.
For my Data Quadrant Model followers; it is the marriage between quadrant I & II versus quadrant III & IV.
1 Increasingly, there is more and more 'evidence' originating from theoretical physics that this statement hold some truth to it, link [Dutch]. I would also like to give attention to this blogpost by John O'Gorman, very insightful, putting DIKW to rest....