Use Snowflake to Share Data Across Cloud Regions

Cover Image for Use Snowflake to Share Data Across Cloud Regions

Background

I have been learning how to use Snowflake, a managed data warehouse service. This is my experience on trying the replication group feature.

I was following below official documentation on sharing data across different Snowflake accounts within the same organization, and could get most work done, but still there were some nuances to figure out before I was able to make the data flow from source account to target account successfully.

https://docs.snowflake.com/en/user-guide/secure-data-sharing-across-regions-platforms

When I had read it multiple times, I realized that above tutorial was somewhat an overkill for me, because I only wanted to share data for accounts located in different AWS regions whtin the SAME org. So I did't necessarily have to create a share, or add a share to the replication group.

Summary

Based on my reading of above doc, setting up replication with replication group entails a few steps,

  • disable individual databases' replication
  • use ORGADMIN role to enable account-level replication
  • create a replication group in source account
  • create a replica of the replication group in target account
  • schedule periodic refresh of the replication, or just refresh on-demand

Detailed Steps

Disable database-level replication

Database-level replication is set up per each database, while account-level replication with replicaiton group is the more recommended way from Snowflake.

See https://docs.snowflake.com/en/sql-reference/functions/system_disable_database_replication

USE ROLE ACCOUNTADMIN;
SELECT SYSTEM$DISABLE_DATABASE_REPLICATION('mydb');

Enable account-level replication

See https://docs.snowflake.com/en/user-guide/account-replication-config#label-enabling-accounts-for-replication

Just substitute FAILOVER GROUP with REPLICATION GROUP in the commands from the doc.

It's so annoying that Snowflake official doc only provides commands for their failover group feature, but they don't provide the equivalent of replication group. I guess they do so because failover group is a more expensive feature, so they make more $$$ by alluring customers to use it at the beginning. 🤷

USE ROLE ORGADMIN;

-- View the list of the accounts in your organization
-- Note the organization name and account name for each account for which you are enabling replication
SHOW ORGANIZATION ACCOUNTS;

-- Enable replication by executing this statement for each source and target account in your organization
SELECT SYSTEM$GLOBAL_ACCOUNT_SET_PARAMETER('<organization_name>.<account_name>', 'ENABLE_ACCOUNT_DATABASE_REPLICATION', 'true');

Create replication group in source account

Since I will be using the replicated data in the target account only, there is no need to create a replication group with share in OBJECT_TYPES. Just do databases.

USE ROLE ACCOUNTADMIN;

CREATE REPLICATION GROUP my_rg
  OBJECT_TYPES = databases
  ALLOWED_DATABASES = db1
  ALLOWED_ACCOUNTS = acme.account_2;

If you do create a replication group with a share, ensure the role you use to share(it's a verb) database with the share, has ownership of the share.

you either

  • grant a database role to a share that your role doesn't own

  • OR you grant specific database objects to a share that you own

See https://docs.snowflake.com/en/user-guide/data-sharing-provider#step-2-grant-privileges-for-a-database-and-objects-to-the-share

Create a replica of the replication group in target account

See https://docs.snowflake.com/en/user-guide/secure-data-sharing-across-regions-platforms

USE ROLE ACCOUNTADMIN;

CREATE REPLICATION GROUP my_rg
  AS REPLICA OF acme.account1.my_rg;

Refresh the replica

It's pretty simple to manually refresh the replica.

ALTER REPLICATION GROUP my_rg REFRESH;

If you don't see the replicated database in the target account, it might be a permission issue, that your role is not allowed to view the databases. Since in previous step, the replica is created with role ACCOUNTADMIN!

Try grant usage on the replicated database, schema, and grant select on the tables in the database/schema.

E.g.

USE ROLE ACCOUNTADMIN;
GRANT USAGE ON DATABASE db1 TO ROLE my_role;
GRANT USAGE ON SCHEMA db1.schema1 TO ROLE my_role;
GRANT SELECT ON FUTURE TABLES IN SCHEMA db1.schem1 TO ROLE my_role;;

Note the last grant is for FUTURE TABLES, because the database grant in a replica is not retroactive, i.e. what you replicate is just the database itself. In the target account, it has no knowledge of which roles have access to it, thus it defaults to ACCOUNDADMIN, which is used to create the replication group in target account.

Terraform

If you are using Terraform to manage your Snowflake infrastructure, everything should be simple, since they have their counterpart in the Terraform Snowflake provider, except the replication group LMAO.

The Terraform does have a resource for failover group tho. 😂

Just use unsafe_execute.

See https://registry.terraform.io/providers/Snowflake-Labs/snowflake/latest/docs/resources/unsafe_execute

E.g.

resource "snowflake_unsafe_execute" "test" {
  execute  <<EOT
CREATE REPLICATION GROUP my_rg
OBJECT_TYPES = databases
ALLOWED_DATABASES = db1
ALLOWED_ACCOUNTS = acme.account_2;
EOT
  revert  = "DROP REPLICATION GROUP IF EXISTS my_rg"
}

Conclusion

Based on my experience, the replication happens really fast. It was able to replicate ~100M data in less than 1 minute from AWS ap-southeast-2 to us-west-2.

This is it, my experience with replication group. Hope it helps.