Data Federation is ETL. Maybe without the T but for sure with the outcome discarded afterward and calculated over and over.

Posted on June 2, 2025 by Marco Ullasci

Product marketing leverages the desires of people.
For many in IT getting rid of the complex ETL development, operations, and maintenance is one of these desires . Sometimes the business people shares this desire: to finally get free from the need to explain to non-business people (and sometimes challenge their own belief about) how the business works.
The data lakes in their original incarnation (in practices data swamps) promised to achieve this but didn’t live up to the promises.
There is also another solution, that didn’t suffer the same bad press, that promises to free the IT team from the need to have “complex ETL”: data virtualization/data federation.
In this post I’ll try to dispel two myths about data federation.

Myth 1: Data Federation removes the need to design complex ETL

ETL usually is not complex because of the tools used or a lack of skills.
The purpose of ongoing ETL is to extract new data from a number of source systems, manipulate it (clean, conform, track historical changes for slow changing dimensions, make hierarchies easier to use…) and produce new datasets that are useful for the business (i.e. data that enables the organization to make better and/or faster business decisions that will increase its profits).
The intrinsic complexity of the manipulation that needs to be implemented in the ETL is derived from the business processes: no tool will ever remove this complexity.
Intrinsic complexity can only be moved around, but never removed.
When a virtualization tool promises to “remove the complex ETL” what is really saying is “someone outside the ETL team will have to manage this complexity”: at best it is shifting the effort from one team to another within the organization. For example from the IT team to the business people. Or from the ETL team to the BI team within IT.
This is pretty much the same sleight of hand that was attempted with many of the early data lakes: store the data without much/any transformation.
The net result of many of the attempts was that the business users couldn’t utilize the data in these lakes in an efficient way. If at all.

Myth 2: Data Federation removes the effort needed to “run” the ETL

This is very true in the literal sense, but it is also very partial.
Replacing the ETL with a federation solution makes much harder to troubleshoot problems because intermediate datasets are not saved and available for review.
Because the data in the source systems is changing continuously it’s almost impossible to reproduce exactly the results the users complained about.
Ensuring in the federation approach a level of data quality comparable with what is achieved with the ETL can be cost prohibitive and the queries so slow that the user experience is unacceptable.
Extracting data from the source system during regular working hours adds load to the operational systems and also to the teams managing them.
If the number of queries executed for the the specific use case is small, the data moved is modest, and there is not expectation of a significant growth in either dimension, you are safely in the “federation space“. If this is not the case then you must properly benchmark your solution and avoid a “kick the tire” test.

Conclusion

What you can get easily with a data federation solution is different from what you can get easily with an ETL solution.
Pretending this is not the case suggests a gap of knowledge and real-world experience.
The best solution is the one that addresses your organization’s (i.e. including the users’) needs with the most acceptable set of trade-off today and tomorrow.

P.S. In any case never challenge an experienced supporter of one of the two approaches: you can rest assured the person will find a very intricate, and costly, way to prove his tool of choice can do everything you say it can’t.
This is the reason why is purposely added the “easily” adverb in my conclusion.

Challenges of Distributed Data Queries in Analytics: the “federation space”

Posted on May 3, 2025 by Marco Ullasci

I joined Teradata in 2008.
At the time, inside the organization, the idea of using a data federation solution to run analytics workloads at scale, instead of a traditional ETL/ELT approach, was considered at best extravagant and in most cases plainly nonsensical.

With the arrival of Hadoop this attitude changed because a lot of organization felt IT finally had found the philosopher’s stone of analytics and suddenly the demand to merge the best of the old (database) and new world (Hadoop) soared.
In this context in 2015 I ran my first benchmark of a data federation solution at scale thanks to the recently introduced product called Teradata QueryGrid.
The setup included a Teradata database cluster, an Hadoop cluster, and an Oracle database.

With Teradata and Hadoop nodes connected over an Infiniband network it felt like nothing could stop me from sailing smoothly through the test workloads.
The feeling lasted just until I started to go beyond the functional tests.
With one of the client-provided queries retrieving 300,000,000 (three hundred millions) rows from Hadoop (for each execution) the account team and I quickly became aware of the scalability limits we faced even when working with an infrastructure that was undoubtedly top notch.
The volume of data that needed to be moved was not the only dimension that was pushing the limit of the architecture, but it was the most important of all.
This reality of querying distributed data sets is schematically summarized in the diagram below: the volume of data that needs to be moved can make a distributed architecture not viable.

I didn’t come up with this diagram until 2020: at the time I was in a meeting with my team mates in the AWS ASEAN analytics specialist solution architects (at the time lead by Raja) and we were discussing, among the other topic the recent announcement of Google’s BigQuery Omni running on Anthos.
There was a some concern among the colleagues that the new offering from Google might be a game changer and put us in a difficult spot.
I drafted the diagram to help us all put the announcement in the context of it intrinsic limits.
We were much more relaxed after using it to frame our reasoning.

The diagram above comes without unit of measure on the axis on purpose for two main reasons:
– each organization has a different technology landscape (source systems and interconnects) with different sustainable throughput and this would change the numbers
– the acceptable query response latency moves, ceteris paribus, the yellow curve closer or farther away from the X and Y axis

The careful reader will notice I haven’t mentioned commonly considered techniques to help reduce the load on the interconnection.
I’ll write about data caching and query push down in the next post.

In the cloud you don’t need a development environment. You need two.

Posted on December 31, 2023 by Marco Ullasci

When I was much younger
Clients seldom agreed to the number and configuration of IT environments that constituted the perfect setup.
Some would have a dedicated unit testing environment, one for the integration test, one more for the preproduction (where the performance tests would be performed) and finally one for production.
Some would have a hot/standby production. Some would have an active-active production. Some would have a DR site in addition to the active-active production.
The only area where all my clients in the early 2000s agreed was that only one development environment was needed. Sometimes the development environment was owned, hosted and managed by the system integrator creating a custom solution rather than by the client.

The first time it came to my attention that the traditional definition of development environment was getting obsolete.
A few years ago, circa 2015-2016, I went to visit a customer in Malaysia to run a proof of concept for a data federation solution together with a few colleagues.
We took into account a number of technical and non-technical factors to select the platform we would use and finally we agreed with the customer to utilize the VMWare edition of our database that was recently made available.
We provided the installation files to the person managing the development environment of the customer and the installation scripts failed very soon, already during the environment checks, because the environment was over-provisioned. Such a check is indeed very reasonable for a production database but not so much for a PoC.
This initial failure led the customer to scrutinize the entire installation script and raise a number of concerns about how it would interact with the existing VMWare setup. The script was not just installing our software, it was also configuring the virtual hardware of the VMWare environment to ensure it matched the configuration expected by the installer.
The key feedback we received went along the lines of: “my VMWare development environment is my developers’ production environment, and I won’t allow your script to change it potentially destroying the productivity of my team”.
A member our team was very talented with VMWare and was able to edit the standard installation scripts to remove all the parts the customer considered a risk for its environment and enable the installation without a glitch while maintaining the existing infrastructure setup unchanged.
We just noted down that in the future we should ask for a dedicated infrastructure to run our VMWare edition and I put the experience in the long-term storage of my brain.

Infrastructure as code (IaC) is the new norm and this changes the management of the development environments forever.
Having a single development environment for both the software and the infrastructure IaC development creates two challenges for an organization.
On one side there is the risk of IaC developments disrupting the productivity of the other developers.
On the other side, to minimize the risks associated with the IaC changes, the organization might put in place a set of restrictive guardrails that effectively cripple its ability to innovate and bring the ways of working back to the pre-cloud era.

Just having a second, isolated, development environment for IaC is not enough.
Many infrastructure mistakes are only discovered when applications are executed.
To ensure the IaC errors are detected early, when the blast radius is minimal, the pipelines for the “normal” software components have to deploy and automatically test in the IaC development environment as well.
Any failure detected only in the IaC development environment shall then be raised to the infrastructure team while the failures happening both in IaC and SIT are raised to the component development team like before.

How do you approach the challenges and opportunities introduced by IaC?

Why the Data Mesh can’t and shouldn’t replace data integration

Posted on November 13, 2022 by Marco Ullasci

A few weeks ago, I wrote that Self-service analytics isn’t everyone’s dream and a few days later I discussed the content with my friend Alessio.
It turned out that one of the points I was trying to make didn’t come out as sharply as I believed, and this post is to elaborate on it and make it clearer.

Data self-service is more feasible and cost-effective with a proper Data Mesh than with most 1st generation data lakes.
Adopting the Data Mesh model doesn’t imply that data integration efforts should be halted.
I see at least three reasons why data integration initiatives should continue:
a) The cost of finding, understanding, and processing data from multiple sources is lowered when adopting the Data Mesh, but when the same activity is repeated multiple times by hundreds or thousands of employees the cumulative cost will become greater than the cost of integrating the data once for everyone.
b) When each person or small team integrates the data autonomously the potential outcome is a Babel of numbers that don’t match those created by other teams and the organization ends up with a modern edition of the spread-mart proliferation of the past.
Until the results are incorporated in a PowerPoint document this modern version removes the problem of the stale data, but it’s a small consolation.
c) Truly federated analytical queries can’t guarantee, at least today, the low latency needed in many business scenarios.

In the previous paragraph I mentioned the cost of producing valuable knowledge from data, but I didn’t dig in the detail of computing such cost.
I tried to come up with a nice construct for it all I can offer so far is a small WIP to hopefully start a little brainstorm.
The cost of knowledge production (CoKP) is a function of computing costs and human effort.
– The computing component appears relatively simple but paying attention to the efficiency of the data processing is key when queries are executed at scale.
I’ve seen >100K analytical queries per day in production environments in several clients: this is what happens when and analytical platform is successful in generating value. It’s better to plan ahead of time rather than ending up forced to re-think the architecture by skyrocketing costs and potentially losing momentum and motivation in the process.
– In the human component I’d include the time needed to find the data, get access, understand what the specific data means, understand the fit for the specific purpose, and processing the data to extract the knowledge.
I know that cost in itself is not enough to evaluate an investment but estimating the return of an investment in data transformation is complex and I don’t dare taking a stab at it here.

I take the opportunity to do a couple of follow-up on comments related to this subject I’ve received recently:
– My friend Dan suggested that IT should be the group doing the data integration.
I’m not sure it has to be IT; they might have a better visibility of the overall data landscape and thus be in a better position that other organizations, but I’d say today there should be a chief data officer (CDO) to lead the data integration initiative.
– Gaetano‘s reply to my comment a week ago seems to go in the same direction of seeing IT as the agent enabling cross-domain pollination.
– Andrea kindly pointed out an improper use of the term “insight” on my side.
I’m pretty sure my former colleague Dirk would completely agree with him, and I’ll do my best to be more rigorous with the terms I use.

As usual…

Self-service analytics is not everyone’s dream

Posted on October 16, 2022 by Marco Ullasci

According to the recent messaging of many players in the data space, self-service analytics is the next-big-thing in the data space, the end-users want it, and the organizations should acquire as quickly as possible the technologies needed to deliver it.

I beg to differ on the starting assumption: no matter the definition of “self-service analytics” adopted, there are many, what the business users really want is access to the information contained in the organization’s data in a time-frame that makes said information usable to drive a business decision.

The definition of “self-service analytics” is not standardized and this can cause misunderstandings when discussing the subject.
The spectrum goes from a relatively conservative “self-service reporting capability”, where the data is integrated by the IT team (for example Gartner‘s and usually the BI tool vendors’), to a more modern “data-mesh-enabled self-service-everything-data” where simpler-to-use tools and standard interfaces are made available and allow the end-users to integrate the data as they deem appropriate without having to wait for the IT team to perform the task (TDWI‘s and Snowflake‘s definitions leans more in this direction).

Self-service analytics, in the modern definition, today might the best way to get to the desired timely insights from data given the current technological and organizational landscapes (i.e. without waiting a few months for the IT data integration process to complete).
“Best today” is not synonymous with “great” and I believe that a generalized shift of the end-users toward modern self-service analytics should not be advocated.
Organizations should keep investing toward a truly integrated data landscape.
The fact that data self-help can now be performed using big data technologies in a cloud environment doesn’t make the practice substantially different, in terms of risks, from the Excel-based “spread-marts” of the past. Even with the modern tools it is still possible to have different people label the same way data that is integrated, filtered, and aggregated in different ways at different point in time resulting in siloed, mismatching data marts.

The division of work and specialization is what enabled the standards of living to improve at an increasingly fast pace when humans started to leverage them at scale.
Mass production drove up the quality and volume of goods produced for the unit of time (and money) in exchange for a longer setup time of the production line and a reduced space for customization (“Any customer can have a car painted any colour that he wants so long as it is black” is a famous quote attributed to Henry Ford), but production techniques have evolved over time and, remaining in the car domain, today’s production give us a degree of flexibility in designing our vehicle Ford couldn’t even dream of.
This evolution didn’t completely destroy the demand for artisanal work and custom realizations, but made these realizations the exception rather than standard, and the prices for high-quality, personalized solutions are significantly higher.

When thinking about self-service analytics and data democratization organizations should be careful to properly model the associated sunken costs.
It is easy to quantify the cost of a centralized data modeling and engineering team, but is hard to model the costs of self-service analytics practices.
We should have learned from the experiments with the Hadoop-based data lakes that the apparently cheap schema-on-read approach resulted either in an increase of costs, as multiple people had to figure out the same data integration over and over, or in data swamps that no one wanted or could use.

There is no doubt that self-service analytics is more cost-effective with a proper data mesh (i.e. one where you have real data products and is based on a enterprise-wide domain driven design) than with the first-generation data lakes, but this doesn’t mean that the specialized, factory-like, creation of integrated data (products), that takes longer to build, should be abandoned.
While it’s great that today I could get the material needed to quickly replace the shattered glass of my table delivered at home with a few clicks, without having to lease a small truck and go to the shop like a decade ago, it doesn’t mean that now I should build my own kitchen.
Organizations should strive to have a faster, more flexible, mass production of high-quality insights and augment it, rather than replace it, with an efficient and effective delivery of raw data.
Just like I should buy my next kitchen here again rather than try to build it myself.

Being a tech-savvy parent sometime can be very frustrating.

Posted on September 4, 2022 by Marco Ullasci

The COVID-19 pandemic has forced a great acceleration in the adoption of information technology by schools scrambling to keep students able to learn while at home.
The private school of my children took the path of cloud technology, a very reasonable one to scale capacity rapidly, and went for Google’s ecosystem: Google Drive and Google Docs/Sheets form the backbone of their solution.

All is good as long as you have a reliable connectivity, which is the case of Singapore, and the students are able to self-manage their use of non-school-related Google services.
The moment the children need help in resisting the temptation of burning tens of hours per week on YouTube things become more complicated for parents that neither have a solid understanding of the technology nor can spend all their time watching the computer use of their children.
Having a fairly solid IT background I thought I was in a better position that the average parent and could easily implement an automated solution to the problem.

I have a firewall at home and I thought that I could simply block YouTube through it.
Unfortunately blocking YouTube-related domains (m.youtube.com, http://www.youtube-nocookie.com, http://www.youtube.com, youtube.googleapis.com, youtubei.googleapis.com) not only works perfectly to block YouTube, but also works perfectly to prevent Google Drive website from loading.
A quick search confirmed it wasn’t something I did wrong on my side, but the way Google has setup their services.
But you can use the official app to make Google Drive look like a local disk and access the files while blocking YouTube, right?
I did so and felt quite proud of it, but only for a short while. Just until I realized this is not enough to edit documents created with Google’s productivity suite.
The documents in the virtual disk in reality contain only the metadata to open the remote documents with Google’s productivity suite online, not the actual documents: I needed YouTube access again.

Another quick search provided me with a manual workaround for the documents created with Google’s tools, but I find it is fairly impractical:
1) Download locally through the web interface the document created with Google’s productivity suite. This will automatically trigger a conversion to a standard office format for the downloaded copy.
2) Load back the converted file after ensuring you didn’t configure the settings to automatically convert to Google’s format
3) Edit the standard-format document both online (with Google’s tool) and locally (with LibreOffice if you don’t have a license for M365).
Because my son already had >100 documents in Google format and the first 2 steps require access to YouTube every time he forgets to perform them at school I find this route quite inconvenient.
For the time being I am giving up on finding a way to automatically block YouTube while having full access to Google Docs.

From the business point of view I can see why Google would rather not have parents blocking children’s access to YouTube; this is one way they make money with their “free” productivity tools.
I would also have understood the business decision to go with Google’s productivity suite if it was made by a public institute in a country with financial troubles because it’s still much better than leaving the students unable to learn.
But when the school asks for annual tuition fees in the range of of the tens of thousands of dollars picking Google’s productivity suite over Microsoft 365, which is also cloud based, but works by default with industry standard formats, it simply makes no sense from a parent’s point of view.
The school should have invested adequately in tools and personnel to facilitate the remote learning needs of their students at a level matching their premium tuition fees.

If someone knows a fully automatic way to block YouTube while at the same time work seamlessly with Google’s documents please share it in the comments: I’m sure I will not be the only parent happy to know how to do it.

Synology DSM 7.0 as a simple ad-blocker

Posted on September 20, 2021 by Marco Ullasci

I’m ok with the ad-supported business model, but when on the screen I find 10 (ten!) advertising for the same computer vendor I feel a strong urge to get some ad-freedom.
Often times the ads are not static, but dynamic and this obsoletes older computer faster than strictly needed. This motivates me even further.

A few years back I have successfully used a nice little tool to convert my DS411Slim into an ad-blocker.
The process was simple and all was good until my DS411Slim died for good.
After recovering all the data (with a fair amount of stress and work) I kept using the same solution on the Xpenology running on a HP microserver Gen8 that replaced it.

Fast forward a few years and I moved to Singapore without bringing the server over, but still needed a baseline level of ad-freedom.
For a while I used OPNsense (running on a Kodlix GK45) for this purpose, but after a few (several) months I grew tired of Suricata turning off seemingly randomly on the box (with little useable logs to troubleshoot and no improvement with the OPNsense patches) and sometimes the DNS service doing just the same.
The OPNsense box got decommissioned (it is waiting to be repurposed) and a Unifi USG (I had it sitting around unused since in Italy I moved to a Sophos firewall for my dual-wan needs) took over the basic firewall duties leaving me with my ad-invasion problem again.

Sunday last week I thought it was a good idea to get rid of my ad problem leveraging my recently purchased Synology DS120J in tandem with a used DS115J (yes, I did learn from the DS411Slim experience that a NAS is a single point of failure even when is supporting multiple disks).
Because I went through the process in the past I thought it was going to be a matter of a few minutes and a good idea to relax a bit during my study session for the AWS Certified Solutions Architect – Professional exam.

After almost 30 years dealing with computers I should have known better that often times “few minutes” becomes “few hours”.
In this case the time-stretching happened because I’m running DSM 7.0 instead of DSM 6.x like in my previous NAS and Synology in this major version changed the way some of the built-in services are managed.
Because of this change the directory structure changed too and the original script didn’t work out of the box anymore.
The former:
/etc/zone/data/
in DSM7 becomes:
/volume1/@appstore/DNSServer/named/etc/zone/data/

After hammering here and there for a while I got to a hack that seems to be working for me.
If someone wants to take a look at it I attached it in the GitHub issue here.
The usual disclaimer apply: there is no guarantee implied or expressed that it will work for anyone else, it’s totally unrelated to my current or past job and employers, and if the script is used I bear no responsibility for anything bad happening. You can read at the bottom the full, wordy, detail.

Because I passed my certification exam last Friday (Friday the 17th, what an auspicious day did I pick!) I feel much less guilty now of having subtracted two hours from my study and I hope with this post I will save some time for others trying to follow the same approach.

https://marcoullasci.com/2019/08/18/disclaimer/

Dell 6430u updated to bios A10

Posted on February 2, 2016 by Marco Ullasci

Two days ago I did a new bios update on the notebook.

The process worked fine as usual and, again as usual, did not fix or improve the issue with the fan noise.
After 36 months with it I have to bear it for only 12 more months until the notebook is due to refresh.

A positive note about the 6430U: it does no longer trigger the security scanner in the Ben Gurion airport. Whatever the chemical that was there it is now completely evaporated.

Olivetti M10: amarcord purchase

Posted on February 5, 2013 by Marco Ullasci

When I was 11 I had my first programming training.
At the time computers were still a fairly esoteric subject in Italy, but my school had the opportunity to get a few Olivetti M10 when they were introduced and offered the opportunity to the students, on a voluntary basis, to be trained to use the systems.

30 years later I’ve decided to buy a piece of my computing history and now it’s part of my collection of old hardware.

Welcome home M10

Asus EA-66N: a great little AP

Posted on January 30, 2013 by Marco Ullasci

After living for quite some time with the wi-fi built into the ADSL modems (I have two lines at home) I’ve decided that the signal needed some improvement to work reliably with the Nexus 7.
For this reason, after reading a lot of reviews online I’ve selected this small device: it’s not the cheapest device for the purpose but I trust Smallnetbuilder

The design is unconventional and the size was surprisingly small when I got it.
The installation manual is relatively fat but it’s only because it covers a dozen different languages: the actual content is quite skinny; this fortunately is not an issue as the setup, once connected to the web interface, is really easy to do.

Signal improved significantly on the Nexus 7: from 1-2 tabs with some occasional complete disconnection to 4 bars (out of 4) with few drops to 3 bars.
Also the Nokia Lumia 800 and E7 both have shown a significant improvement in signal quality.
The Acer 3810T was already working fine with the older solution: this is likely due to the larger radio antenna and greater available power.

The device can be used also as a wi-fi to ethernet bridge to connect a single device implementing in an easy way what I did using OpenWRT and to extend the wi-fi range, but I’ve not used it in this way.

Overall I had a very positive experience and would suggest this device to anyone having a need like mine.

marcoullasci

The blog of a gourmand technologist that loves to read books

Tag Archives: technology

Data Federation is ETL. Maybe without the T but for sure with the outcome discarded afterward and calculated over and over.

Myth 1: Data Federation removes the need to design complex ETL

Myth 2: Data Federation removes the effort needed to “run” the ETL

Conclusion

Challenges of Distributed Data Queries in Analytics: the “federation space”

Why the Data Mesh can’t and shouldn’t replace data integration

Self-service analytics is not everyone’s dream

Being a tech-savvy parent sometime can be very frustrating.

Synology DSM 7.0 as a simple ad-blocker

Dell 6430u updated to bios A10

Olivetti M10: amarcord purchase

Asus EA-66N: a great little AP

Myth 1: Data Federation removes the need to design complex ETL

Myth 2: Data Federation removes the effort needed to “run” the ETL

Conclusion

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: