Google Cloud’s attempt to standardize it

The query union was just a first step. Google Cloud announced on April 6 a preview of BigLake. The service is described as a way to unify data lakes and stores across multiple clouds. GCP will then become the central controller for accessing and securing these environments.

“BigLake brings the decade of experience we have with BigQuery to other data lakes,” said Gerrit Kazmayr, Vice President and General Manager, Database, Analytics, and Looker at Google Cloud at a press conference. “This allows you to combine performance, governance, level of access, and security with open file formats,” he boasts.

Sudhir Hasby, Senior Director of Product Management at Google Cloud, puts the problem in a frame. “Historically, all data was stored in different storage systems, some in repositories, and these provide different capacities and create data repositories within organizations,” he says.

According to the official, these silos based on different technologies do not benefit from the same level of governance. A data warehouse can provide fine-grained access control for consistent governance, but a data lake, which contains much larger volumes of data, does not necessarily have this mechanism.

As organizations become more aware of governance policies [à déployer]“We need to go forward and have some consistency between these different platforms and strategies,” says Sudhir Ashby.

BigLake: a “standardized” open source storage engine

BigLake capabilities claimed by Google Certified Partners.

It doesn’t say exactly what BigLake does. This product is an effectively “unified” storage engine adjacent to BigQuery that should simplify accessing and managing tables in open formats across multiple cloud services. This data must be present in the object storage services of the three cloud giants, namely Google Cloud Storage, Amazon S3 and Azure Data Lake Storage Gen2. GCP’s promise to customers is that they can take advantage of their existing cloud infrastructure.

However, to achieve the required level of governance, BigLake introduces new tables. It is still possible to use “external tables”, which require that the metadata and schemas for these assets be stored in BigQuery, but GCP does not guarantee the governance and consistency of the linked data. On the other hand, it makes it easier to convert external tables to BigLake tables. This mechanism is similar to the “governed tables” introduced in Lake Formation by AWS.

In fact, the resource tied the creation of BigLake tables to the configuration of access rights from Google IAM. Thus, there are three roles: a data lake administrator who manages IAM rules for cloud storage objects and repositories, a data lake administrator who creates, deletes, and updates the BigLake tables (equivalent to ‘BigQuery admin’) and a data analyst, who can read and query data under certain conditions. Row and column level access control is triggered via labels to be edited from the BigLake table schema editor. Access rules are enforced by the BigQuery APIs. For customers who want to constantly control data across data lakes, data warehouses, and the data marketplace, GCP will integrate Dataplex and Unified Data Management (and Data Network Management) with BigLake.

The BigLake table behaves like its BigQuery counterparts, and meets the same limits, but there are different APIs to work with. The BigQuery Storage Read API based on the gRPC protocol allows you to read BigLake tables in JSON, CSV, Avro and ORC formats from open source processing engines such as Apache Spark. There are also connectors for Spark, Hive, and Trino engines hosted on Dataproc VMs or on containers for processing data stored in Google Cloud Storage. Even the data transfer layer to these open source analytics engines: GCP relies on Apache Arrow to speed up the download of (large) batches of data.

Note that Google Cloud Storage does not yet support Avro and ORC formats. GCP promises to support table formats from Delta Lake (Parquet) and later Apache Iceberg and Apache Hudi.

If the data does not reach the Google Cloud, the Google Cloud will go to the data

By default, in Amazon S3 and Azure Data Lake Storage Gen2, external tables can be read through the API of BigQuery Omni, the distributed multimedia version of BigQuery. GCP has also made BigLake tables compatible with this service. Then the conversion mechanism is especially useful.

For data processing, GCP publishes and manages the control level of BigQuery on GCP. This control level runs data tiers on AWS (S3) or Microsoft Azure (Azure Blob Storage) cloud instances, which are data tiers that run the BigQuery query engine, and then store the query results in the user’s Storage Services object or send them back to the master instance, on GCP. The user is interested in the extensions of external connections and writing queries. BigQuery Omni is entirely managed by GCP, and the customer does not pay any exit costs with the third party providers.

“BigQuery Omni is a big differentiator, because we don’t ask you to charge large ETL costs,” says Gerrit Kazmeier.

“We would love to see more data generated on BigQuery, but we know our customers have data spread across multiple data lakes in multiple clouds, including AWS and Azure.”

Sudhir HasbeSenior Director of Product Management, Google Cloud

Adds Sudhir Ashbe, who insists that GCP believes in putting the computation closer to the data rather than moving it. “We accept the fact that the files are in different places and we go through the data rather than collecting it in one place.”

BigQuery Omni has been publicly available since December 2021. It is probably too early to verify whether deploying the solution and its pricing model is more beneficial than doubling the functionality of the ETL, including the cost of data output.

Data Cloud Alliance: Commitments, no roadmap yet

In all cases, BigLake should limit if not prevent data copy movement and duplication. Snowflake has praised this standardization of uses, on the one hand, for its multicloud platform, especially by Databriks, who was the first to bet on a somewhat marketing term for Lakehouse, a combination of a data lake and a data warehouse (and who is less convinced of the multimedia principle) ). “I think the biggest difference is that we believe in open data architecture,” says Gerrit Kazmayr, to distinguish the GCP approach from that of players like Snowflake. “With BigLake, we don’t expect customers to compromise open storage or proprietary, between open source or proprietary processing engines.” For example, GCP expects that customers using the solution will analyze data from various sources such as SaaS software (Salesforce, Workday, or Marketo), and visualize it using Looker, Power BI, or Tableau.

“Customers don’t want to be chained to any salesperson, including us.”

Gerrit KazmayrVice President and General Manager of Databases, Analytics and Viewer, Google Cloud

As for Databriks, it’s a partner that shares “the same philosophy” as GCP, Sudhir Ashbe points out. “We are working with Databricks, the Spark engine is integrated with BigQuery, and we will continue to work with this company to solve customer issues together in a way that is compatible with open source formats.”

In this sense, Google announced the creation of the Data Cloud Alliance. Databriks is one of the members of this group, along with Startbust, MongoDB, Elastic, Fivetran, Neo4J, Redis, Dataiku Accenture, and Deloitte. These partners “commit” to accelerating the adoption of open data models and standards, reducing the complexity of governance, compliance and security, and enhancing the training of talent and practitioners in these areas.

“Customers don’t want to be chained up with a salesperson, including us,” admits Gerrit Kazmayer. “It’s about bringing the best of them all together and solving the problems of our mutual customers,” he adds. This does not make the initiative more visible at the moment. Officials promise that the “foundation moment” will be followed by announcements, but no timetable for launch has been revealed. For its part, Databriks invokes the fact of contributing to this initiative to improve data exchange, which is currently one of its priorities.

Leave a Comment