Why the Data Mesh can’t and shouldn’t replace data integration

A few weeks ago, I wrote that Self-service analytics isn’t everyone’s dream and a few days later I discussed the content with my friend Alessio.
It turned out that one of the points I was trying to make didn’t come out as sharply as I believed, and this post is to elaborate on it and make it clearer.

  1. Data self-service is more feasible and cost-effective with a proper Data Mesh than with most 1st generation data lakes.
  2. Adopting the Data Mesh model doesn’t imply that data integration efforts should be halted.
    I see at least three reasons why data integration initiatives should continue:
    a) The cost of finding, understanding, and processing data from multiple sources is lowered when adopting the Data Mesh, but when the same activity is repeated multiple times by hundreds or thousands of employees the cumulative cost will become greater than the cost of integrating the data once for everyone.
    b) When each person or small team integrates the data autonomously the potential outcome is a Babel of numbers that don’t match those created by other teams and the organization ends up with a modern edition of the spread-mart proliferation of the past.
    Until the results are incorporated in a PowerPoint document this modern version removes the problem of the stale data, but it’s a small consolation.
    c) Truly federated analytical queries can’t guarantee, at least today, the low latency needed in many business scenarios.

In the previous paragraph I mentioned the cost of producing valuable knowledge from data, but I didn’t dig in the detail of computing such cost.
I tried to come up with a nice construct for it all I can offer so far is a small WIP to hopefully start a little brainstorm.
The cost of knowledge production (CoKP) is a function of computing costs and human effort.
– The computing component appears relatively simple but paying attention to the efficiency of the data processing is key when queries are executed at scale.
I’ve seen >100K analytical queries per day in production environments in several clients: this is what happens when and analytical platform is successful in generating value. It’s better to plan ahead of time rather than ending up forced to re-think the architecture by skyrocketing costs and potentially losing momentum and motivation in the process.
– In the human component I’d include the time needed to find the data, get access, understand what the specific data means, understand the fit for the specific purpose, and processing the data to extract the knowledge.
I know that cost in itself is not enough to evaluate an investment but estimating the return of an investment in data transformation is complex and I don’t dare taking a stab at it here.

I take the opportunity to do a couple of follow-up on comments related to this subject I’ve received recently:
– My friend Dan suggested that IT should be the group doing the data integration.
I’m not sure it has to be IT; they might have a better visibility of the overall data landscape and thus be in a better position that other organizations, but I’d say today there should be a chief data officer (CDO) to lead the data integration initiative.
Gaetano‘s reply to my comment a week ago seems to go in the same direction of seeing IT as the agent enabling cross-domain pollination.
Andrea kindly pointed out an improper use of the term “insight” on my side.
I’m pretty sure my former colleague Dirk would completely agree with him, and I’ll do my best to be more rigorous with the terms I use.

As usual…

Self-service analytics is not everyone’s dream

According to the recent messaging of many players in the data space, self-service analytics is the next-big-thing in the data space, the end-users want it, and the organizations should acquire as quickly as possible the technologies needed to deliver it.

I beg to differ on the starting assumption: no matter the definition of “self-service analytics” adopted, there are many, what the business users really want is access to the information contained in the organization’s data in a time-frame that makes said information usable to drive a business decision.

The definition of “self-service analytics” is not standardized and this can cause misunderstandings when discussing the subject.
The spectrum goes from a relatively conservative “self-service reporting capability”, where the data is integrated by the IT team (for example Gartner‘s and usually the BI tool vendors’), to a more modern “data-mesh-enabled self-service-everything-data” where simpler-to-use tools and standard interfaces are made available and allow the end-users to integrate the data as they deem appropriate without having to wait for the IT team to perform the task (TDWI‘s and Snowflake‘s definitions leans more in this direction).

Self-service analytics, in the modern definition, today might the best way to get to the desired timely insights from data given the current technological and organizational landscapes (i.e. without waiting a few months for the IT data integration process to complete).
“Best today” is not synonymous with “great” and I believe that a generalized shift of the end-users toward modern self-service analytics should not be advocated.
Organizations should keep investing toward a truly integrated data landscape.
The fact that data self-help can now be performed using big data technologies in a cloud environment doesn’t make the practice substantially different, in terms of risks, from the Excel-based “spread-marts” of the past. Even with the modern tools it is still possible to have different people label the same way data that is integrated, filtered, and aggregated in different ways at different point in time resulting in siloed, mismatching data marts.

The division of work and specialization is what enabled the standards of living to improve at an increasingly fast pace when humans started to leverage them at scale.
Mass production drove up the quality and volume of goods produced for the unit of time (and money) in exchange for a longer setup time of the production line and a reduced space for customization (“Any customer can have a car painted any colour that he wants so long as it is black” is a famous quote attributed to Henry Ford), but production techniques have evolved over time and, remaining in the car domain, today’s production give us a degree of flexibility in designing our vehicle Ford couldn’t even dream of.
This evolution didn’t completely destroy the demand for artisanal work and custom realizations, but made these realizations the exception rather than standard, and the prices for high-quality, personalized solutions are significantly higher.

When thinking about self-service analytics and data democratization organizations should be careful to properly model the associated sunken costs.
It is easy to quantify the cost of a centralized data modeling and engineering team, but is hard to model the costs of self-service analytics practices.
We should have learned from the experiments with the Hadoop-based data lakes that the apparently cheap schema-on-read approach resulted either in an increase of costs, as multiple people had to figure out the same data integration over and over, or in data swamps that no one wanted or could use.

There is no doubt that self-service analytics is more cost-effective with a proper data mesh (i.e. one where you have real data products and is based on a enterprise-wide domain driven design) than with the first-generation data lakes, but this doesn’t mean that the specialized, factory-like, creation of integrated data (products), that takes longer to build, should be abandoned.
While it’s great that today I could get the material needed to quickly replace the shattered glass of my table delivered at home with a few clicks, without having to lease a small truck and go to the shop like a decade ago, it doesn’t mean that now I should build my own kitchen.
Organizations should strive to have a faster, more flexible, mass production of high-quality insights and augment it, rather than replace it, with an efficient and effective delivery of raw data.
Just like I should buy my next kitchen here again rather than try to build it myself.

Replacing the word “source” with the word “product” is not enough to change the reality of your data.

I recently shared my perspective about the Data Mess vs. Data Mesh.
Shortly afterward I was chatting with my friend Luca asking his feedback and his main comments were along the lines of “it’s all good, but it’s very hard to find an effective mechanism to reward and incentivize the change”.
In this post I’m sharing my view on this challenge.

Citing  Zhamak Dehghani original work I highlighted that “Data as a product” is one of the pillars of the Data Mesh.
To reap the benefits promised by the data mesh organizations then feel urged to morph their data sets into data products.

Unfortunately there are many definitions of product and they are not equally useful to transform the organization and create a real, useful data mesh.

One definition of product is: “a thing that is the result of an action or process”.
This is the definition easiest to apply and is the most dangerous too.
It makes possible to quickly and automatically label every existing data source as a “data product” without the need to change anything in the existing processes.
It is a sure guarantee that the data mess will remain in place for the years to come with data remaining a by-product of the business processes rather than a “real” product.
Just like today, but with the trendy label.

A definition of product much more useful to incentivize the organizational change is the following: “an article or substance that is manufactured or refined for sale”.
The key part is “for sale” because it implies the existence of a historically strong product improvement driver: money changing hands and increasing a producer’s wealth in the process.

Many organizations have created, or are in the process of creating a “data marketplace”, to facilitate data product exchanges.
Unfortunately a lack of general agreement about what a “data marketplace” should be can lead to the creation of something slightly, but significantly different: a data catalog.
In recent implementations, hopefully, the catalog is paired with a set of tools to self service the data access and/or transfer (I’ll come to the subject of transfer vs. access in a future post).
This kind of data marketplaces work nicely with the first (and lesser) definition of a data product but is not fully supporting the value creation expected from the adoption of the second definition.

The data marketplace that supports continuously improving data products is something slightly, but significantly different.
On top of the functional characteristics of the basic marketplace I listed earlier it enables a low friction exchange of (data) goods for a certain amount of an agreed currency.

The technical means to easily move money around are many, well known, and broadly available.
The tricky part is, once again, an organization and people problem: defining the “certain amount” of currency that should change hands.

Who sets the price of the data products and how?

Being fundamentally Austrian in my vision of economy my first answer was: the free market!
Unfortunately this is a bad approach for the data market because the producers of data raw (I’ll tentatively blog about raw data/data by-products in the future) are in most cases natural monopolies: having only monopolists setting the prices would immediately lead to a complete failure of the marketplace.
Centrally regulated prices appears to be the only option.

Historically centralized economies have been trailing free economies in terms of wealth generation and this can be concerning: how we could prevent the same from happening in our data marketplace?
The special nature of digital goods compared to physical goods (produce once, sell many times) help us a bit in this matter.
Setting a fixed price centrally will promote efficiencies in data production at the assigned quality point (by making the production more efficient the producer will increase his gain), but shouldn’t destroy completely the incentives to improve the products because the same price is paid by each consumer and not by the corporate through a budget directly allocated to the producer.
In this scenario the producer has an incentive to get more consumers to buy the data product by improving it (there is a quantifiable return on additional investments made on the data product) and by sharing ideas about new ways to create value from data.

Deciding the prices is, once again, a non-technical problem.
I would promote value-based prices over cost-based prices any day because inefficient production processes lead to a higher price for a set value generated (I am being Austrian again here), but there is, again, a data product specific constraint here: most organizations have a hard time materializing the expected return of their data investiment.
I dare to say that many can’t even quantify the obtained return at all and this makes a proper value-based central pricing of data assets close to impossible.
The only the option left is to set the initial price of data products based on a linear combination of incurred and recurring production costs.
This unit price of the data product (subscription) for each consumer will be calculated by dividing the (calculated) current cost by the number of current consumers and the budgets will be aligned accordingly to the current consumers.
Organizations can (and should) apply a periodic price deflation factor to the initial prices to drive efficiency up and prevent omission bias and complacency on the producer side .

Will the Data Mesh save organizations from the Data Mess?

The “Data Mess” is almost as old as the installation of a second database within a single organization. Or maybe even older and paper-based.

Many companies, all over the world, have tried to solve the data mess problem for decades with varying degrees of success.
Which is a nice way to say: in many cases with limited or no success.
No matter the many promises of technical silver bullets that were made over the years, like the MPP databases earlier or the Hadoop-based datalakes later, the task of integrating data is still far from being a trivial one.

About 6 months ago I had a chat with a friend, and former Teradata colleague, and he told me he had to discuss the data mesh with the CIO of a large Italian company that was extremely excited about the subject.
Unsurprisingly, given the ripples that this post of Zhamak Dehghani had in the market, in the preceding weeks I had several conversations about the data mesh with my team mates and we are still debating the subject.

I’m writing today because I’m concerned by the fact that the data mesh is perceived, in almost all the conversations I have, as the (new&improved) silver bullet that will finally kill the data mess monster for good.
I think this might be the case. But only as long as the data mesh is not reduced to the technology/architecture part of the solution.

The “data mess” is generated by a combination of shortcomings in 3 key areas:
1) people
2) processes
3) technologies

The data mesh discussions I’ve had so far focus mostly, if not only, on the technical solutions with an unexpressed assumption (or hope) that removing the technical obstacles will be enough to magically fix also people and processes shortcomings.
I guess it might be because a lot of people in IT is more comfortable dealing with technologies than with processes and other people.
Or maybe I am just perceived as too much of a geek for my counterparts to discuss the non-technical aspects of the data mesh with me.

Frankly I hope it’s the latter scenario and the people and processes pillars are being addressed in other streams I’m not part of.
I say this because what the experience in the software quality space taught me is that technologies can facilitate processes, but don’t change them (with a few notable exceptions when packaged ERPs replaced custom solutions ahead of Y2K and many organization in a hurry just had to adapt to processes supported by the ERP they picked).
I also learned that people with enough motivation to do so can ignore, or even hijack, the best processes.

Both the first and the second post of Zhamak Dehghani touch multiple times the process aspects.
Are processes prominently missing only from the conversations I am having and hearing about or is a common pattern?

I tend to think that the people pillar (a.k.a. incentives to embrace a new way of doing things) is still not sorted out, or maybe is even perceived as too hard to approach, in many organizations and for this reason is simply removed from the debate.

I believe that solving the people part of the problem is strongly tied to a real transformation of data into a product rather than just a dump of JSON by-products, that the potential consumer has to figure out how to use, of the organization’s processes.
What incentive is given to the marketing team (or the e-commerce one, or the customer service, or the production lines…) to invest part of their limited budget to produce high-quality. easy to use, data available in the mesh and, maybe, also increase the data value over time?
No ROI, no party.

In the end my answer to the question I asked in the title is:
“Building a data mesh infrastructure without creating effective processes (and the right incentives for individuals and organizations to embrace the new processes) is not going to remove the data mess from the map.”