Data Warehouse Deep-Dive Match-up: BigQuery, Snowflake and Redshift
Data warehouses are nothing new. They’ve been in use for decades, but Cloud data warehouses are a relatively new paradigm. As the world generates more and more data, new solutions are required to meet the demands of organizations who need to store, query and analyze their unprecedented amounts of data.
In recent months, the cost of cloud computing has come under scrutiny as companies tighten their belts and look to cut costs. Still, for most companies, particularly small ones, the costs associated with cloud data warehouses are eclipsed by the cost of in-house solutions or deployments.
Processing and Storage
Two key features of modern Data Warehouse solutions are massively parallel processing (MPP) and Columnar data storage. MPP is an architecture that provides high-performance queries on large data volumes. MPP architectures consist of many servers running in parallel to distribute processing and input/output (I/O) loads.
Columnar data stores are the most common form of storage for data warehouses - they are the most flexible and economical for analytics. Data is processed by columns instead of rows, and this makes aggregate queries (often used in reporting) perform much faster.
Let’s look at three major data warehouse providers, how they work and what they do.
Amazon Redshift
RedShift is the granddaddy of cloud data warehouses. It is a fully managed data warehouse service that can scale from a few gigabytes to petabytes of data.
The core component of Redshift is a cluster. Clusters are composed of one or more compute nodes, and when more than one node is present, a leader node coordinates the compute nodes and external communication.
The leader node also distributes SQL statements to the compute nodes when a query references tables that are stored on the compute node. Otherwise, the leader node processes queries exclusively. The leader node compiles code for individual elements of the execution plan. The compute nodes execute the compiled code and send intermediate results back, which the leader node then aggregates. Every compute node is partitioned into slices, and each slice is allocated a portion of memory and disk space. The number of slices depends on node size.
Redshift uses a PostgreSQL dialect of SQL, but has some important differences that users accustomed to Postgres will want to fully understand before using the product.
Google BigQuery
BigQuery is another fully managed, serverless data warehouse that can autoscale to match storage and computing needs. BQ provides columnar and ANSI SQL databases that can analyze TBs to PBs with great performance. BQ also allows for geospatial data analysis using BQ GIS. BQ makes a great warehouse for Machine learning tasks, as you can quickly build and operationalize ML models on large-scale structured or semi-structured data using SQL. BigQuery also supports a BI engine for interactive dashboarding.
BigQuery utilizes an array of proprietary internal tools developed at Google, including Borg, Colossus, Jupiter and Dremel. Dremel is the execution engine used by BQ. Dremel transforms SQL queries into execution trees, where each leaf of the tree is called a slot. Slots do the heavy lifting of reading the data from Colossus, and doing any necessary computation.
Colossus is Google’s latest gen distributed file system. Each Google datacenter has its own Colossus cluster, and each cluster can provide users with thousands of disks at a time. Colossus handles replication, recovery and distributed management. It’s fast enough to provide performance akin to in-memory databases, but is cheaper and highly parallelized.
The branches of the execution tree are called mixers, which perform the aggregation. In between is shuffle, which takes advantage of Google’s Jupiter network to move data quickly from one place to another. Jupiter is the network BQ runs on - it can deliver 1 PB/sec of total bisection bandwidth, enough to easily handle even very large tasks. Google considers it the biggest differentiator for their platform - its bandwidth is unparalleled.
Borg handles computation. Borg is a cluster management system that can coordinate thousands of CPU cores for user tasks. Machines may crash, power may fail, networks may die, but Borg works around these issues.
Fun fact: Kubernetes was born out of Google’s internal orchestration tool, Borg.
As for storage, BigQuery uses a columnar storage format called Capacitor, which allows for highly efficient compression and encoding of data. Capacitor uses a technique called quantization to achieve efficient compression and encoding of neural network models. Quantization is a process by which the precision of the parameters and activations in a model are reduced, resulting in a smaller model size and faster inference times. Capacitor uses a technique called 8-bit integer quantization (as opposed to 32-bit). Capacitor also uses a technique called weight pruning to remove unimportant connections in the model, further reducing size and improving efficiency.
Snowflake
One of the newest and most popular cloud data warehouse solutions is Snowflake. Snowflake is a fully managed MPP cloud data warehouse that runs on AWS, GCP, and Azure - making it the only multi-cloud technology on this list.
The platform supports ACID-compliant relational processing and has native support for document store formats such as JSON, Avro, ORC (Optimized Row Columnar), Parquet, and XML. Because of the support for semi-structured data, Snowflake can be a great choice for machine learning models. Snowflake’s VARIANT data type allows users to store their semi-structured data records in a native form inside a relational table. This allows for easy loading of data into schema-less tables, where their properties are discovered automatically by Snowflake.
Snowflake’s secret sauce is its hybrid architecture. Snowflake combines the benefits of shared-disk architecture (where multiple cluster nodes have access to all data stored on a shared memory disk) and shared-nothing architecture (which stores and processes portions of data on different cluster nodes in parallel and independently).
Like shared-nothing, Snowflake uses MPP-based computing to process queries concurrently with each node locally storing a portion of the entire data. Like shared-disk, there is also a centralized data repo for a single copy of data that can be accessed from all independent compute nodes. As such, data management is as simple as in shared-disk with the performance and scalability of shared-nothing.
Snowflakes architecture consists of three layers - the database storage layer, query processing layer, and services layer. Database storage handles tasks related to elastic storage of data coming form disparate sources. Snowflake supports ETL and ELT processes, and schedules inserts in bulk or batches - not ideal for real-time analytics. Users can upgrade to continuous data ingestion (Snowpipe), but it will cost you. Data is optimized and compressed into an internal columnar format and broken down into micro-partitions.
The query processing (compute) layer provides the means of executing SQL statements. It relies on multiple independent compute clusters with nodes processing queries in parallel. These are called virtual warehouses, and are packed with compute resources like CPU, memory, etc.
The Cloud services layer contains everything else Snowflake offers - authentication and access control, infra management, metadata management, query optimization, etc.
Each solution has strengths and weaknesses you should consider before choosing to use their products:
Snowflake
Strengths:
Snowflake is a fully managed service, which means that it handles all of the infrastructure and maintenance required for running a data warehouse.
It has a unique "multi-cluster, shared data architecture" which allows for a high level of concurrency and performance.
It offers a wide range of data integration and data sharing options.
It supports semi-structured data and advanced analytics
Multi-cloud support
Weaknesses
Relatively new player in the data warehousing space
Many useful features are extremely costly.
May struggle to handle extremely large datasets
Low level of customization
BigQuery
Strengths
Handles even the largest datasets
Part of the Google Cloud ecosystem
First-class support for ML features
Good support for data mining
Best for varied workloads
Weaknesses
Not as customizable as other solutions
Charges based on amount of data scanned per query, which can get expensive with large datasets.
Redshift
Strengths
Highly customizable
Part of the AWS ecosystem
Built for Big Data Warehousing
Weaknesses
Not as scalable as other solutions
Poor cost management for varied workflows
Requires more hands on maintenance for more tasks, like data vacuuming and compression
Conclusion
Today we’ve looked at three major players in the data warehouse space, and discussed some of the under the hood mechanisms that drive their value propositions. Before choosing a data warehouse, thoroughly evaluate your expected usage patterns to discover what will be the most cost-effective and efficient use of your resources. There’s no one right answer, but one thing isi for sure. You want to get it right the first time - migrations are hell.
References