Distributed Data Queries in Analytics: data caching and query push-down to the rescue?

I recently posted about the “federation space” and how there is a sweet spot for the use of data virtualization / data virtualization solutions to perform data analytics.
In the post I promised to discuss two optimization techniques that promise to extend the viability of running queries on distributed data.

Query push down

To anchor the discussion I’ll use a pseudo-code query.
The simplest query on distributed data can be sketched like this:
Select some_calculation from remote_table
For my example I’ll take “some-calculation” to be the average of the value of “column_a” in the remote table containing, let’s say, 10 rows with 10 columns per row.
– The most trivial implementation will perform the sample query by fetching the entire remote_table locally (100 values retrieved) then calculate the average.
– A slightly smarter implementation will push to the remote system the request to send across only the values of column a (10 values retrieved)
– A good, modern implementation will push the entire query to the remote system and retrieve only the desired average (1 value transferred across)

It all looks great in the simple example, but the reality of analytical queries is seldom this trivial.
A slightly more realistic pseudo query might look like the following:
Select customer from remote_spending_table, local_customers_segments_table where customer_spending>2[*average(all_foreign_customers_spending)] and customer_segment="foreign")
The purpose is to retrieve the list of foreign customers that are spending more than twice the average spending of the foreign customers
This generates two challenges:
1) correctly identify the extent of the possible query push down and translate it to the remote system’s SQL dialect (is already not trivial for this small example, just imagine with the 3000-lines-of-SQL generated by the MicroStrategy report of one of my German customers in 2010)
2) getting the owner of the remote system to agree on consuming its resources to execute our query.
Running a simple filter is much lighter than running an aggregation and it’s easier to get an agreement to it.

The second challenge is partly mitigated if the source in the cloud
Thanks to the cloud elasticity the negative impact in the source system can be made negligible or non-existent.
But the topic of who’s paying for the computing costs remains in place (unless it’s a single entity that pays for all the infrastructure costs).

Caching

As soon as a data virtualization project gets adopted the increased load on non-elastic on-premises source systems becomes unacceptable (because the original workload performance is negatively impacted) and the users of the analytics queries start to experience longer response times (because the source system is overwhelmed).
The solution to this challenge is to introduce a caching layer that will store a copy of (part of) the data near the data virtualization system. If properly sized the cache improves the user experience both for the source system’s users and for the people running the analytical queries.
The caching layer also helps reducing the egress costs when consuming data in a hybrid cloud setup. Microsoft’s Blog post about the introduction of S3 caching to OneLake just confirms it.

The first victim of caching is the vision of having “no data duplication (thanks to data virtualization)”, but it is not the only one.

Challenges of Distributed Data Queries in Analytics: the “federation space”

I joined Teradata in 2008.
At the time, inside the organization, the idea of using a data federation solution to run analytics workloads at scale, instead of a traditional ETL/ELT approach, was considered at best extravagant and in most cases plainly nonsensical.

With the arrival of Hadoop this attitude changed because a lot of organization felt IT finally had found the philosopher’s stone of analytics and suddenly the demand to merge the best of the old (database) and new world (Hadoop) soared.
In this context in 2015 I ran my first benchmark of a data federation solution at scale thanks to the recently introduced product called Teradata QueryGrid.
The setup included a Teradata database cluster, an Hadoop cluster, and an Oracle database.

With Teradata and Hadoop nodes connected over an Infiniband network it felt like nothing could stop me from sailing smoothly through the test workloads.
The feeling lasted just until I started to go beyond the functional tests.
With one of the client-provided queries retrieving 300,000,000 (three hundred millions) rows from Hadoop (for each execution) the account team and I quickly became aware of the scalability limits we faced even when working with an infrastructure that was undoubtedly top notch.
The volume of data that needed to be moved was not the only dimension that was pushing the limit of the architecture, but it was the most important of all.
This reality of querying distributed data sets is schematically summarized in the diagram below: the volume of data that needs to be moved can make a distributed architecture not viable.

I didn’t come up with this diagram until 2020: at the time I was in a meeting with my team mates in the AWS ASEAN analytics specialist solution architects (at the time lead by Raja) and we were discussing, among the other topic the recent announcement of Google’s BigQuery Omni running on Anthos.
There was a some concern among the colleagues that the new offering from Google might be a game changer and put us in a difficult spot.
I drafted the diagram to help us all put the announcement in the context of it intrinsic limits.
We were much more relaxed after using it to frame our reasoning.

The diagram above comes without unit of measure on the axis on purpose for two main reasons:
– each organization has a different technology landscape (source systems and interconnects) with different sustainable throughput and this would change the numbers
– the acceptable query response latency moves, ceteris paribus, the yellow curve closer or farther away from the X and Y axis

The careful reader will notice I haven’t mentioned commonly considered techniques to help reduce the load on the interconnection.
I’ll write about data caching and query push down in the next post.

What is a performance benchmark?

In my experience, when adopting a buyer’s perspective,  I have seen 3 main classes of performance benchmarks:

  1. Check-mark benchmarks
  2. Decision confirmation benchmarks
  3. Decision driving (risk-reducing) benchmarks

 

Check-mark benchmarks are usually driven by internal processes tied to best practices associated with quality processes or required by laws and regulations.

In most cases the benchmarks falling into this class are perceived as a pure cost that needs to be minimized: an industry standard benchmark is usually the cheapest answer to the need for a benchmark result.

The Wikipedia article on the general performance benchmarking subject adopts a perspective that matches very well this type of benchmarks.

The key principle, from the 7 proposed Benchmarking Principles, that in my opinion position the article as a description of “check-mark benchmarks” is the one called Representativeness “Benchmark performance metrics should be broadly accepted by industry and academia”.

Several years ago Curt Monash wrote here that “The TPC-H benchmark is a blight upon the industry”.

Not only I fully agree with him about TPC-H, but I would expand the statement further: as of today all the industry standard benchmarks serve (some) vendors, but not the buyers.

I find the rest of the principles listed in the article sound and relevant for all the classes I’m describing in this post, but I use a slightly different definition for relevance (the test should measure metrics relevant to the business problem that needs to be solved by the technical solutions tested) and transparency (not only the metrics need to be easy to understand, but also the test conditions and how changing these condition can influence the results should be clear).


Decision confirmation benchmarks are executed to demonstrate the correctness of a decision that has been already taken.

When running such a test there is a high risk of a confirmation bias coming into play in the way the test is defined with the tests favoring the technical solution that has been selected.

Because the decision is already made the benchmark is seen as a cost to minimize rather than an investment also in this case.


Risk-reducing benchmarks are executed, as the definition implies, to minimize the risks associated with the selection of a specific technical solution to address a set of business needs.

The costs associated with the selection of an incorrect or sub-optimal solution can be very significant for an enterprise with the direct ones (incurred to implement the solution) usually being just a fraction of the total.  The cost of (lost) opportunity is usually the largest part.

When looking at the performance benchmark from this perspective the buyer sees the costs associated with the preparation and execution as an investment like it would be the case for an insurance.

Minimization of cost is no longer the main design criteria and is replaced by the balance between the ability to predict the future behavior of the different technical solutions when implemented with the buyer’s specific processing pattern and the cost of defining and running the test.


A single exercise might show characteristics of more than one of the classes, but in my experience the (mainly) risk-reducing performance benchmarks are a very small fraction.

What is your experience  in this regard?

Performance benchmarks for the operational people

In August I promised (here) to post a target table of content of this series of posts about performance benchmarks and delivered the first part covering the subjects I believe are relevant to managers.

I label “operational people” any individual or organization that is involved hands-on in the definition, implementation, execution and technical evaluation of a performance benchmark.

Today you find below the candidate list of subjects I believe are relevant to the operational people.


1) Knowing your starting state

  1.   The system(s) you are going to replace
  2.   The interaction between the systems
  3.   The workload on each system

2) Simulating your final state

  1.   Black box definition of the final state
  2.   Defining the workload insisting on the black box based on the current state
  3.   Defining the workload generated by functional and/or organizational changes

3) Ensuring the technologies are set to perform the best way for the desired final state

  1.   Defining the metrics
  2.   Defining how the metrics are grouped
  3.   Sharing the metrics with the vendors/implementation teams

4)  Executing the tests and reviewing the results.


 

Performance benchmarks for the manager

I promised last week (here) to post a target table of content of this series of posts about performance benchmarks.

It is a quite long list of topics that I split in two main areas:

  • Topics that are relevant for everyone in the organization, labeled “for the manager”,
  • Topics that are of interest mostly for the technical people, labeled “for the operational people

Below you find the candidate subjects that I believe are of general interest.


1)   What is a performance benchmark

2)  Types of technical benchmark

  1.   The check mark benchmark
  2.   The confirmation benchmark
  3.   The industry standard benchmark
  4.   The valuable benchmark

3)  Organizational challenges to implement a valuable benchmark

  1.  The IT architecture perspective
  2.  The IT operations perspective
  3.  The CFO perspective
  4.  The CEO perspective

Next post will contain the list of technical subjects while the following ones will start to dig into each subject in the two lists in an orderly fashion.

As I wrote earlier: your feedback will be key in shaping how this will move forward. Feel free to comment here or to reach me directly.

Performance benchmarks

It is time for me to give back.

Dealing with performance benchmarks has occupied a fair share of my life from my early days in the computer world in the mid ’80s.

In the beginning it was mostly reading, with just a bit of writing, that today I would be ashamed of, in one of the early Italian BBS “newspaper” called “Corriere Telematico“.

At the time I could have never imagined that benchmarks would have a very large role in my career to the point that for about 8 years they even defined my job title.

Now, as I my transition into a new role is almost complete, it feels like the right time to write something about benchmarks that can help many people in the industry.


I recall reading in one of the paper magazines of my early days something along the lines of “benchmarks don’t lie, but liars do use benchmarks”. I believe it was on MCmicrocomputer but I can’t bet on this.

This bleak statement about benchmarks was true 30+ years ago and it’s still true now, but we should not throw the good away together with the bad: proper benchmarks were and still are useful tools for individuals and organizations alike.  It’s all about defining “proper” correctly in each context.

For a while, given the scarcity of published material on the subject, I was thinking of putting together a book, with the help of a friend of mine.

I fear I will not be able to put in all the time needed to complete it in a reasonable time frame and for this reason I decided to blog on the subject instead.

In the coming weeks (or months, I don’t know yet how this will work) I will share what I learned in many years as a source for anyone wanting to get closer to the holy grail of the “proper benchmark”.

I will be vendor and technology neutral, covering both the business and the technical sides.

Your feedback will be key in shaping how this will move forward.

In the next post I’ll share the target table of content of this series of posts.