Why the Data Mesh can’t and shouldn’t replace data integration

A few weeks ago, I wrote that Self-service analytics isn’t everyone’s dream and a few days later I discussed the content with my friend Alessio.
It turned out that one of the points I was trying to make didn’t come out as sharply as I believed, and this post is to elaborate on it and make it clearer.

  1. Data self-service is more feasible and cost-effective with a proper Data Mesh than with most 1st generation data lakes.
  2. Adopting the Data Mesh model doesn’t imply that data integration efforts should be halted.
    I see at least three reasons why data integration initiatives should continue:
    a) The cost of finding, understanding, and processing data from multiple sources is lowered when adopting the Data Mesh, but when the same activity is repeated multiple times by hundreds or thousands of employees the cumulative cost will become greater than the cost of integrating the data once for everyone.
    b) When each person or small team integrates the data autonomously the potential outcome is a Babel of numbers that don’t match those created by other teams and the organization ends up with a modern edition of the spread-mart proliferation of the past.
    Until the results are incorporated in a PowerPoint document this modern version removes the problem of the stale data, but it’s a small consolation.
    c) Truly federated analytical queries can’t guarantee, at least today, the low latency needed in many business scenarios.

In the previous paragraph I mentioned the cost of producing valuable knowledge from data, but I didn’t dig in the detail of computing such cost.
I tried to come up with a nice construct for it all I can offer so far is a small WIP to hopefully start a little brainstorm.
The cost of knowledge production (CoKP) is a function of computing costs and human effort.
– The computing component appears relatively simple but paying attention to the efficiency of the data processing is key when queries are executed at scale.
I’ve seen >100K analytical queries per day in production environments in several clients: this is what happens when and analytical platform is successful in generating value. It’s better to plan ahead of time rather than ending up forced to re-think the architecture by skyrocketing costs and potentially losing momentum and motivation in the process.
– In the human component I’d include the time needed to find the data, get access, understand what the specific data means, understand the fit for the specific purpose, and processing the data to extract the knowledge.
I know that cost in itself is not enough to evaluate an investment but estimating the return of an investment in data transformation is complex and I don’t dare taking a stab at it here.

I take the opportunity to do a couple of follow-up on comments related to this subject I’ve received recently:
– My friend Dan suggested that IT should be the group doing the data integration.
I’m not sure it has to be IT; they might have a better visibility of the overall data landscape and thus be in a better position that other organizations, but I’d say today there should be a chief data officer (CDO) to lead the data integration initiative.
Gaetano‘s reply to my comment a week ago seems to go in the same direction of seeing IT as the agent enabling cross-domain pollination.
Andrea kindly pointed out an improper use of the term “insight” on my side.
I’m pretty sure my former colleague Dirk would completely agree with him, and I’ll do my best to be more rigorous with the terms I use.

As usual…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.