Snowflake expands key features of cloud data warehouse

In 2020, Snowflake announced a big roadmap. Two years later, a multicloud data warehouse vendor released almost all of the listed features. During Snowflake Summit 2022, he gave cover for at least two years.

First, there are performance improvements, one of the original promises of the resource. On AWS, Snowflake has demonstrated new instances based on Graviton 3 processors. Performance is said to be approximately 10% better than Intel processor-based instances commonly deployed on this cloud. Above all, the supplier once again improved the storage pressure by 30%. Data latency was reduced by 50%, while replication response time was reduced by 55%.

Next, Snowflake is needed to deliver better data consistency while ensuring a high level of performance on the majority of data formats, which come from the relational world.

Apache Iceberg: Better consistency for analytics

In response to this request, Snowflake plans to support the Apache Iceberg table format, which will soon be in private preview. This open source technology, which competes with Delta Lake – developed by Databriks – (a table format that Snowflake supports in its external tables) should facilitate analytical processing on large amounts of data (on the petabyte scale).

First, Apache Iceberg is a guarantor of data consistency. It should allow multiple applications to rely on the same data while tracking changes to files in a table. This is due to the phenomenon of corruption that contaminates most data stores. Designed as an alternative to Apache Hive, the open source project should provide better performance, a scalable schema, the ability to retrieve data in time, and support for ACID processing from these tables stored in client buckets.

Snowflake already offers external tables in Iceberg format for migration or ingest from cloud systems or for on-site processing application for data that cannot be migrated to a cloud data warehouse. In addition, the resource announces in a private preview the possibility of storing external tables on local systems, primarily those of Dell and Pure Storage.

With native Apache Iceberg support, company spokesmen promise that all features (governance, encryption, replication, compression, etc.) of the platform will be compatible with this kind of schedule. Here, Snowflake engineers chose to pair the Parquet data format with the Iceberg metadata and metadata catalog. The publisher did not say if Iceberg tables will support ORC, Avro, JSON, or other extensions. For information, this table format does not know what data format it is encapsulating in. Above all, Iceberg is compatible with various data conversion engines, including Dremio, Trino, Flink or even Apache Spark.

In view of the listed capabilities, Apache Iceberg is a fertile ground for data network deployment. A path the supplier intends to explore.

Most importantly, Iceberg will avoid using proprietary Snowflake tables. “Some of our customers have told us they want a certain amount of data that can be read in open file formats,” said Christian Kleinermann, Senior Vice President, Product Management, Snowflake. “It gives us a form of interoperability. This is very important for us,” he adds.

Unistore and Hybrid Tables: Translytic According to Snowflake

Just as Snowflake now wants to support analytics and transactional data processing, like MongoDB or Google with AlloyDB.

To do this, the publisher offers a private preview of Unistore, a feature based on “hybrid tables”, which are literally vehicles for HTAP (Hybrid Transactions/Analytical Processing) capacity. Concretely, Unistore is a class-oriented engine that can host transaction processing on mixed tables. This also makes it possible to perform analytical processing on transaction data. Above all, this engine allows at least one primary key and one foreign key to be defined, making it possible to reduce duplicate entries. If the user successfully activates the key system, the constraint mechanism returns an alert if the data is already in the data warehouse. In principle, this should also make it possible to rationalize data ingestion and migration to avoid unwanted copying.

“We have many customers, including Novartis, UiPath, IQVIA, Crane or Adobe, who have tested this feature. The feedback is fairly positive,” confirms Christian Kleinermann.

Maintaining data consistency, cross-approach… Snowflake seems to have the cards on hand to offer its clients a true multicloud data lake to support the majority of workloads.

However, we still have to wait. The provider has been supporting only unstructured data in general availability since April. Instead of supporting formats or data types, the editor supports files from object stores (Azure Blob Storage, Amazon S3, and Google Cloud Storage) and URLs.

Better governance and cost management tools

While some clients expect this type of functionality, others are more concerned with data management and cost management capabilities. Thus, the user interface dedicated to data management will be “soon” in the private preview, as will the descent mechanism at the column level. As promised by the seller, a system for masking data through stickers will soon be available in public preview. Classification of data will be publicly available “soon”.

In terms of implementing the FinOps approach, Snowflake intends to introduce a functionality called Resource Groups. In this way, computing and storage resources can be linked to tables or data objects to monitor their cost.

Some users have been waiting for the clone features promised by Snowflake for a year. Customer redirection will enter general availability very soon. With a secure URL connection, Client Redirect allows you to failover in a different region of the same cloud or on another cloud. Ideally, processing is interrupted only briefly when a crash causes an instance to crash.

By the same token, soon in public preview, account cloning should make it possible to adapt the same mechanism to user accounts. “With this feature, you’re not just copying the data, but you’re replicating all kinds of metadata about account users, roles, repositories, and all the metadata that surrounds the data,” says Christian Kleinmann.

In the private preview, the vendor provided pipeline replication. This last option will be very useful when Snowflake releases streaming pipelines. In fact, the publisher announced the special preview of Snowpipe Streaming. This system should make it possible to perform microbatch data ingestion from serverless environments. In this case, Snowflake reviewed its use of Kafka Connect to improve its data ingestion capabilities.

Along the same lines, the physical offering is a packaged pipeline system under development. It should make it possible to prepare material views in a declarative manner and to make incremental updates.

Leave a Comment