'I made some Python code that really rocks, Ronald'
'It extracts data from various sources, validates it, does some cleansing, codes xxx business rules, lots of integration of course, executes some kind of predictive model, outputs the data and visualizes it'.
And then the magical words are uttered; lets deploy it to production……
Alas,, the magic of datascience ends abruptly, IT is probably being blamed for not being agile and architects are scorned for being too restrictive and even killing innovation in the process.
Datascience has got a problem, it fails to operationalize its brilliant models and it therefor fails to deliver value to the business. There I said it, I know, pretty polarizing, but I encounter it on a daily basis now. Datascience needs to grow up....
It’s all about: (1) the required quality of services and (2) separating concerns. Both seem to be not that important in datascience. It should though!
Let me clarify;
Quality of services
Two use cases;
(a) deploying a riscmodel at scale (lets say 500K transactions per day) that evaluates a transaction (from a customer) in realtime based on contra-information and determining in the process the level of supervision needed. Oh and by the way; one has to take into account ‘equality of right’ since the organization is a publicly owned organization.
(b) doing a one-time analysis on various sources, using advanced machine learning and where the output is used for a one-time policy influencing decision.
The quality of services between (a) and (b) are like night and day. (a) needs to be run at scale, realitime (direct feedback), using contra-information, the provenance is hugely important, it is subject based, so there are privacy concerns, it’s an automated decision (there is heavy legal shit here), equality of rights (metadata like; what model did we use on what transaction, what data did we evaluate,…) and many more.
(b) is a one-off….its output influences new policy or contributes to some insight. Your quality of services might be that the model is versioned, properly annotated and that the dataset is somehow archived properly to ensure repeatability.
My point is that, whenever you start on an analytic journey, establish the quality of services that you require on forehand as much as possible. And for that to happen you need to have a clear explicit statement on how the required informationproduct contributes to the bottom line. So yes, a proper portfolio management process, a risk based impact assessment (!) and deployment patterns (architecture!) that are designed in advance!
With regard to datascience it is vital to make a conscious choice, before you start, of the required quality of services. If these services are high, you might wanna work together closely with system engineers, datamodelling experts, rule experts, legal experts, etc.. Only then, you might be able to deploy stuff and generate the value the field of datascience promises us.
Separation of concerns
For those who do not know what ‘separation of concerns’ means, start with Wikipedia or google Edsger Dijkstra, one of the greatest (Dutch) computer scientist…..
Anything IT related is suffering from the ‘overloading concerns’ issue. Some examples;
- XBRL is a great standard, but suffers from overloading; integrity, validation, structure, meaning and presentation concerns are all bundled into one technical exchange format.
- Datavault is a great technical modeling paradigm, but it does not capture logical, linguistic or semantic concerns, and yet the data modelling community still tries
- Archimate is a great modeling notation in the Enterprise Architecture arena, why is it overloaded with process concerns? BPMN is such a better choice.
And of course we see it in code and we have seen it for ages in all programming languages; human tendency to solve all challenges/problems with the tool they are dominantly familiar/trained with. Datascience is no different. Failing to separate concerns lies at the root of many software related problems like maintainability, transparancy, changeability, performance, scaleability and many many more.
Like the example I started with in this blog;
A brilliant Python script where a staggering number of concerns have all been dealt with. This might not be a problem when the required quality of services is not that high. But when the required quality of service are high it becomes painfully clear that ‘deploying’ this code to production is a fantasy.
Extraction, validation, cleansing and integration concerns might be better dealt with by making use of tools and techniques in the information (modeling) arena.
Business rules might be better of designed (for example) by means of Rulespeak and subsequently making them more transparent for legal people and domain experts (which is btw – especially in AI – a huge concern!).
Visualization and presentation might be better of by using tools that are already purchased by your organization, be it Tableau, SAS Visual Analytics, Qlik, Tibco or whatever.
Finally
Dear datascientist, stop blaming IT, architects or whoever that your brilliant code is not being deployed in production. Instead, reflect on your own role and your own part in the total supply chain of stuff that need to happen to actually get things working in production at the quality of services that are required.
Dear organization, stop blaming datascientists for not delivering on the value which was promised. Start organizing, consciously, the operationalization of datascience. It is not a walk in the park, it is an assignment that requires an extremely broad skillset, an holistic view, cooperation and of course attention to human behavior/nature.
And the majority of these challenges fall with management!
Starting a datascience team or department without organizing the operationalisation is a waste of resources.
Operationalization of datascience IS NOT a technical problem!
For my dataquadrant fans; it is all about transitioning from quadrant IV to II.