Snowflake is a columnar cloud native data platform for data warehouse and big data analytics. Snowflake provides massive scale and high-performance compute resource for data in a multi-cluster, three-layer architecture consisting of storage, compute and service.
The storage resource is based on data lake Blob storage of provided by the three main cloud service providers (Amazon, Azure, Google), where as the compute and service is provide by Snowflake data cloud.
The compute resource also known as virtual warehouses, process large quantity of data with speed and efficiency in a dedicated cluster, where as the service resource coordinates the transaction of the workload such as data load, query processes in a workload.
Snowflake Architecture
- Storage: A Blob storage where data is divided into small partitions and stored in a columnar table. Each partition is optimized and compressed and stored using a shared disk method. Single storage layer can be used by each virtual warehouse.
- Compute: MPP compute cluster consisting of multiple nodes with CPU and Memory is called Virtual Warehouse. Each cluster node is not connected to another virtual warehouse where by multiple workloads can be run in parallel.
- Service: Consists of all the operations that coordinate and manage the snowflake operation such as security, query management & optimization and configuration metadata
As a cloud native semi-structured and structured data platform, it provides all in one enterprise data platform for operational data warehouse, big data, AI/ML from a single data lake based on an existing cloud provider. Its architecture is based on a hybrid of shared-disk which is accessible by all cluster nodes and shared nothing architecture, where by each cluster node has its own disk storage where data can be partitioned and shared among other cluster nodes.
Snowflake data platform provides the following workloads
- Data Warehouse: A traditional structured data that is transformed and processed using advanced analytical query.
- Data Lake: It provides a landing zone for raw semi-structured data in formats such as csv, json, avro, Parquet etc
- Data Pipeline: Data pipeline that automate a continuous data load from blob storage to staging table
- Data Exchange: Provides a secure data collaboration between a select group of users internally or externally by publishing source data to be discovered & consumed by members
- Data Application: It provides a number of application interfaces that can be used to connect to snowflake such as client API’s connectors, drivers etc.
- Data Science: Provides data science platform for machine learning and artificial intelligence with Snowpark and Python integration.
Snowflake Software Layer vs Azure Infrastructure Service Layer
The above diagram illustrates the Snowflake application service layer along side the Azure infrastructure/platform layer that supports the Snowflake SaaS deployment
Snowflake Security :
Snowflake provides a secure framework to protect customer and their data with three-layer protection as well as login and query execution history audit
- Network Security: First layer security through network isolation is implemented through network policies, private end point, firewall rules
- Identity & Access Management: Second layer security is implemented through Role Based Access Control (Roles, SSO, Sessions, Object Level / Column Level / Row Level Access Control)
- Data Security: Third layer security is based on Transparent Data Encryption which encrypts data at all time
Getting Started with Snowflake
You need to create an account at signup.snowflake.com to start using Snowflake data platform. As database as a service, for basic functionality you will be able start running with it using the Web Based user interface. In addition to this Snowflake supports command line (SnowSQL), odbc, jdbc, Python, Spark and third-party connectors.
In this series of Introduction to Snowflake, we will cover a secure implementation of Snowflake architecture on Azure cloud platform using Azure AD Single Sign-On, Private End Point, Private Link, Network Policies, Internal & External Storage as well as continuous data pipeline as illustrated in the diagram below
Next :
- Azure Active Directory Integration to Snowflake
- Automatic account provisioning for Snowflake in Azure AD
- Snowflake Private link Configuration
- Azure storage account & private end point (External Stage)
- Snowflake Internal Stage
- Snowflake continuous data pipeline (Snowpipe)
- Snowflake Integration to Power BI