All sessions of Transformation 2021 are now available on demand. Now look.


I recently wrote a post about the concept of Data Lackhouse, which in some ways brings out parts of the outline in my runt about databases and what I want to see in the new database system. In this post, I will go over the roll-ups of some recent big data developments. Try to describe what you should be aware of.

Let’s start with the largest level or large data stack in a database, which in many cases is Apache Spark Processing engine, as it powers many large data components. The component itself is clearly not new, but an interesting feature is the Spark 3.0. Which has been added to 3.0 Adaptive query execution (AQUEE) These features allow Spark to optimize and configure query schemes based on collected runtime statistics while the query is running. By default make sure it is turned on for SparkSQL (SparkSQL.Adaptive.nb).

The next component of interest is Apache Kudu. You are probably familiar with the wooden character. Unfortunately, wood bark has some significant drawbacks, such as the approach of the congenital batch (you have to send written data before it is available for reading). Especially when it comes to real-time applications. Kudu’s on-the-disk data format is similar to wood veneer, with a few differences to support efficient random access as well as updates. It is also worth noting that Kudu cannot use Cloud Object Storage due to its use of Ext it’s 4 or XFS and relies on consensus algorithms that are not supported in Cloud Object Storage (RAFT).

At the same level of the stack as the Kudu and the wooden container, we have to mention the Apache hoodie. Apache HoodieLike Kudu, it brings stream processing to big data by providing up-to-date information. Like Kudu it allows updates and deletions. However unlike Kudu, the hoodie does not provide a storage layer and so you usually want to use wood as its storage format. That’s probably one of the main differences, Kudu tries to be the storage layer for OLTP while the hoodie is a tight olap. Another powerful feature of the hoodie is that it makes available a ‘change stream’, which allows the growth rate to stretch. In addition it supports three types of questions:

  • Snapshot queries : Questions Views the latest snapshot of a table as a given commitment or compaction action. Here the concepts of ‘copy on copy’ and ‘merge on reading’ become important. The latter is useful for real-time queries.
  • Growing questions : Due to the given commitment / compaction the questions only see the new data written on the table.
  • Read Quimized Queries : Questions View the latest snapshot of the table according to the given commit / compaction action. This is mostly used for high speed query.

Hoodie documents are a great place to get more details. And here’s a diagram I borrowed from Xenostack:

What is a hoodie

Then what is Apache Ice Berg And then Delta Lake? This data is another way to organize your data. They can be supported by a wooden character, and each is slightly different in specific cases of use and how it handles data changes. And like a hoodie, it can be used with both a spark and a presto or a hive. For a more detailed discussion on the differences, take a look here and this blog shows you how to use Hoody and Delta Lake.

Enough about tables and storage formats. I’m more interested in the query layer when you need to deal with large amounts of data when it’s important.

The project to watch here is Apache Calcite Which is the ‘Data Management Framework’ or I would call it SQL Engine. It is not a complete database, mainly due to the exclusion of the storage layer. But it supports multiple storage engines. Another cool feature is support for streaming and graph SQL. Usually you don’t have to bother with this project as it is built into a number of existing engines like Hive, Drill, Solar, etc.

Having a quick summary and a slightly different way to see why all of these projects mentioned so far have come into existence, would mean rolling out the data pipeline challenge from a different perspective. Remember the days when we deployed Lambda Architectures? You had two different data paths; One for real-time and one for batch ingest. Apache Nimble This can help integrate the two paths. Others, instead of rewriting their pipelines, let developers write batch layers and then automatically use calcite in real-time processing code and use Apache to merge real-time and batch output. Pinot. (Source: LinkedIn Engineering)

The great thing is that there are priest for Pinot connector, which allows you to stay in your favorite query engine. Sidenote: Don’t worry about Apache Understood Too much here. It is another distributed processing engine such as Flink or Spark.

Gikiri enough. I’m sure your head hurts as much as I do, trying to keep track of all these crazy projects and how they get stuck together. Maybe another interesting lens is what AWS has around databases. To begin with, there is PartyTQL. In short, it is an SQL-compatible query language that enables you to query data regardless of where or in what format it is stored; Structured, unstructured, columnar, row-based, you name it. You can use DynamodyB or RPL of the project. Can use Particle. Glue elastic views also support the party QL at this point.

Well, I found, a general purpose data store that just does the right thing, i.e. it is fast, has true data integrity properties, etc., is a difficult problem. So all of this scattered across data stores (search, graph, column, row) and processing and storage projects (from hoodies to paraquets and implants back to Presto and CSV files). But ultimately, I really have a database that does all these things for me. I don’t want to learn about all these projects and noise. Just give me a system that lets me dump data into it and respond quickly to my SQL queries (real-time and batch) …

This story was originally written by Rafi.ch. Copyright Pirate 2021

Venturebet

VentureBet’s mission is to become a digital town square for technical decision makers to gain knowledge about transformative technology and transactions. Our site provides essential information on data technology and strategies to guide you as you lead your organizations. We invite you to become a member of our community for:

  • Up-to-date information on topics of interest to you
  • Our newsletters
  • Gated idea-leader content and discounted access to our precious events, e.g. Transformation 2021: Learn more
  • Networking features and more

Become a member