Snowflake でCloudera Data PlatformのCDW Iceberg テーブルをアクセス

Snowflake でCloudera Data PlatformのCDW Iceberg テーブルをアクセス

Cloudera テクニカルシリーズ

目的:

Cloudera Data PlatformのCDWに、既にIcebergテーブルに保存したデータがあります。Snowflake からアクセスしてみたいです。

1) アーキテクチャ

Image source : https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume

2) 新規作成

2.1) Snowflake の外部Volumeを新規作成

Step 1: Configure access permissions for the S3 bucket

Create a policy in AWS IAM.

I created a policy called zzeng-Snowflake-ext :

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:DeleteObject",
                "s3:DeleteObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<my-bucket>/data/zzeng/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<my-bucket>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "data/zzeng/*"
                    ]
                }
            }
        }
    ]
}

AWS Console の設定:

Step 2: AWSのIAMロールを作成

zzeng-Snowflake-ext-role:

作成した後、ARNを控える

Step 3: SSE-KMS暗号化のIAMロール (optional)

今回暗号化しないので、SKIP

Step 4: Snowflakeの外部Volumeを作成

Snowflake のSQL実行:

 

CREATE OR REPLACE EXTERNAL VOLUME extIcebergVolC
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'zzeng-iceberg-se-s3-ap-northeast-1'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://<my-bucket>/data/zzeng/'
            STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<my-AWS-id*****>:role/zzeng-Snowflake-ext-role'
         )
      );
      ```


### Step 5: Snowflake accountとAWS IAM userを紐づけ

Snowflake で下記SQLを実行:

```sql
DESC EXTERNAL VOLUME extIcebergVolC;

実行結果:

zzeng#COMPUTE_WH@ZZENG.PUBLIC>DESC EXTERNAL VOLUME extIcebergVolC;
+-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
| parent_property   | property           | property_type | property_value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | property_default |
|-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------|
|                   | ALLOW_WRITES       | Boolean       | true                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | true             |
| STORAGE_LOCATIONS | STORAGE_LOCATION_1 | String        | {"NAME":"zzeng-iceberg-se-s3-ap-northeast-1","STORAGE_PROVIDER":"S3","STORAGE_BASE_URL":"s3://<my-bucket-id>/data/zzeng/","STORAGE_ALLOWED_LOCATIONS":["s3://<my-bucket-id>/data/zzeng/*"],"STORAGE_REGION":"us-east-2","PRIVILEGES_VERIFIED":true,"STORAGE_AWS_ROLE_ARN":"<STORAGE_AWS_ROLE_ARN>","STORAGE_AWS_IAM_USER_ARN":"<STORAGE_AWS_ROLE_ARN>","STORAGE_AWS_EXTERNAL_ID":"<a long string for STORAGE_AWS_EXTERNAL_ID>","ENCRYPTION_TYPE":"NONE","ENCRYPTION_KMS_KEY_ID":""} |                  |
| STORAGE_LOCATIONS | ACTIVE             | String        | zzeng-iceberg-se-s3-ap-northeast-1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                  |
+-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
3 Row(s) produced. Time Elapsed: 0.949s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>

STORAGE_AWS_IAM_USER_ARN と STORAGE_AWS_EXTERNAL_IDを控える

Step 6: Object StoreのBucketeアクセスのIAM user権限を設定

前のStep 5で貰った値で、AWSのIAM RoleのTrustshipを編集。

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<snowflake_user_arn>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "<snowflake_external_id>"
        }
      }
    }
  ]
}

2.2) CDWで新しいIcebergテーブルを作成

下記S3フォルダで作成する予定。

s3a://${my-test-bucket}/data/${user_id}/airlines/airlines

HUE画面で、下記SQLを実行

CREATE DATABASE ${user_id}_airlines_ice;
drop table if exists ${user_id}_airlines_ice.airlines;

CREATE EXTERNAL TABLE ${user_id}_airlines_ice.airlines (code string, description string) 
STORED BY ICEBERG
STORED AS PARQUET
LOCATION 's3a://${cdp_env_bucket}/data/${user_id}/airlines/airlines'
tblproperties("format-version"="2",'external.table.purge'='true');

INSERT INTO ${user_id}_airlines_ice.airlines
  SELECT * FROM ${user_id}_airlines_csv.airlines_csv;
  
  
select * from ${user_id}_airlines_ice.airlines;
  
select count(*) from ${user_id}_airlines_ice.airlines;

SQLを実行したら下記フォルダが出来上がる。

Metadataの情報をチェック:

SHOW CREATE TABLE airlines;

実行結果:

CREATE EXTERNAL TABLE `zzeng_airlines_ice`.`airlines`(
  `code` string, 
  `description` string)
ROW FORMAT SERDE 
  'org.apache.iceberg.mr.hive.HiveIcebergSerDe' 
STORED BY 
  'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 

LOCATION
    's3a://<my-bucket-name>/data/zzeng/airlines/airlines'
  TBLPROPERTIES (
    'bucketing_version'='2', 
    'current-schema'='{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"code","required":false,"type":"string"},{"id":2,"name":"description","required":false,"type":"string"}]}', 
    'current-snapshot-id'='6686807318178502869', 
    'current-snapshot-summary'='{"added-data-files":"1","added-records":"1491","added-files-size":"25686","changed-partition-count":"1","total-records":"1491","total-files-size":"25686","total-data-files":"1","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}', 
    'current-snapshot-timestamp-ms'='1710324490640', 
    'external.table.purge'='true', 
    'format-version'='2', 
    'iceberg.orc.files.only'='false', 
    'metadata_location'='s3a://<my-bucket-name>/data/zzeng/airlines/airlines/metadata/00001-7e28a998-42f9-4466-8884-32d450af5c85.metadata.json', 
    'previous_metadata_location'='s3a://<my-bucket-name>/data/zzeng/airlines/airlines/metadata/00000-9179e76f-369b-47ca-b01f-20e6bffd36a5.metadata.json', 
    'serialization.format'='1', 
    'snapshot-count'='1', 
    'table_type'='ICEBERG', 
    'transient_lastDdlTime'='1710324491', 
    'uuid'='1a86667c-6c7c-4318-a976-ba7fd49c13b0', 
    'write.delete.mode'='merge-on-read', 
    'write.format.default'='parquet', 
    'write.merge.mode'='merge-on-read', 
    'write.update.mode'='merge-on-read')

2.3) Snowflake でIceberg Table を作成し、先ほどのCDW Iceberg Tableをアクセス

SQL

CREATE OR REPLACE ICEBERG TABLE airlines
  CATALOG='zzengIcebergCatalogInt'
  EXTERNAL_VOLUME='extIcebergVolC'
  BASE_LOCATION='airlines/airlines'
  METADATA_FILE_PATH='metadata/00001-7e28a998-42f9-4466-8884-32d450af5c85.metadata.json'
;

3) 結果チェック

In Snowflake:

zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
+----------+
| COUNT(*) |
|----------|
|     1491 |
+----------+
1 Row(s) produced. Time Elapsed: 0.393s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>select * from AIRLINES limit 3;
+------+--------------------+
| CODE | DESCRIPTION        |
|------+--------------------|
| 02Q  | Titan Airways      |
| 04Q  | Tradewind Aviation |
| 05Q  | "Comlux Aviation   |
+------+--------------------+
3 Row(s) produced. Time Elapsed: 4.705s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>

In CDW Hive HUE:

select count(*) from ${user_id}_airlines_ice.airlines;

4) データ変更

4.1) データ追加

現時点、SnowflakeのIcebergカタログを使わず、外部のIcebergテーブルを利用する場合、Insertできないです。
Reference URL:

こちらの説明では、

You can use INSERT and UPDATE statements to modify an Iceberg table that uses Snowflake as the catalog.

Snowflake をIcebergのカタログとして利用する時はINSERT/UPDATEできますが、Catalog Integrationの場合は更新できません。

実際やってみました。

zzeng#COMPUTE_WH@ZZENG.PUBLIC>INSERT INTO airlines (code, description) VALUES
                                   ('A1', 'Airline 1 Description'),
                                   ('A2', 'Airline 2 Description'),
                                   ('A3', 'Airline 3 Description'),
                                   ('A4', 'Airline 4 Description'),
                                   ('A5', 'Airline 5 Description'),
                                   ('A6', 'Airline 6 Description'),
                                   ('A7', 'Airline 7 Description'),
                                   ('A8', 'Airline 8 Description'),
                                   ('A9', 'Airline 9 Description'),
                                   ('A10', 'Airline 10 Description'),
                                   ('A11', 'Airline 11 Description'),
                                   ('A12', 'Airline 12 Description'),
                                   ('A13', 'Airline 13 Description'),
                                   ('A14', 'Airline 14 Description'),
                                   ('A15', 'Airline 15 Description'),
                                   ('A16', 'Airline 16 Description'),
                                   ('A17', 'Airline 17 Description'),
                                   ('A18', 'Airline 18 Description'),
                                   ('A19', 'Airline 19 Description'),
                                   ('A20', 'Airline 20 Description');
091357 (42601): SQL Compilation error: Iceberg table AIRLINES with an external catalog integration is a read-only table and cannot be modified
zzeng#COMPUTE_WH@ZZENG.PUBLIC>

エラーです。

CDWのHiveでやってみると、

INSERT INTO `zzeng_airlines_ice`.`airlines` (code, description) VALUES
('A1', 'Airline 1 Description'),
('A2', 'Airline 2 Description'),
('A3', 'Airline 3 Description'),
('A4', 'Airline 4 Description'),
('A5', 'Airline 5 Description'),
('A6', 'Airline 6 Description'),
('A7', 'Airline 7 Description'),
('A8', 'Airline 8 Description'),
('A9', 'Airline 9 Description'),
('A10', 'Airline 10 Description'),
('A11', 'Airline 11 Description'),
('A12', 'Airline 12 Description'),
('A13', 'Airline 13 Description'),
('A14', 'Airline 14 Description'),
('A15', 'Airline 15 Description'),
('A16', 'Airline 16 Description'),
('A17', 'Airline 17 Description'),
('A18', 'Airline 18 Description'),
('A19', 'Airline 19 Description'),
('A20', 'Airline 20 Description');

成功。

4.2) CDWでIcebergテーブルを更新したら、Snowflake 側に反映

現在の課題:Hive側でIcebergテーブルを更新しましたが、Snowflake 側は反映されていない。

なぜ?

Icebergを更新したら、Metadataが変わりました。
Snowflake側はそれを検知していないから、前の世代のMetadataを見ていた。

zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
+----------+
| COUNT(*) |
|----------|
|     1491 |
+----------+
1 Row(s) produced. Time Elapsed: 0.678s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>

Snowflake 公式ドキュメント:

In Snowflake, The metadata files do not identify the most recent snapshot of an Iceberg table.

Ref URL: https://docs.snowflake.com/en/user-guide/tables-iceberg

手動でSnowflake のMetadata設定変更。

まずは新しいMetadataのファイル名を確認。

これでALTER文作成:

ALTER ICEBERG TABLE AIRLINES REFRESH 'metadata/00002-c33ae888-1af3-4f64-830b-eac9e0a95983.metadata.json';

更新、動作確認結果:

zzeng#COMPUTE_WH@ZZENG.PUBLIC>ALTER ICEBERG TABLE AIRLINES REFRESH 'metadata/00002-c33ae888-1af3-4f64-830b-eac9e0a95983.metadata.json';
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+
1 Row(s) produced. Time Elapsed: 10.199s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
+----------+
| COUNT(*) |
|----------|
|     1511 |
+----------+
1 Row(s) produced. Time Elapsed: 0.204s
zzeng#COMPUTE_WH@ZZENG.PUBLIC>

5) 制約事項

https://docs.snowflake.com/en/user-guide/tables-iceberg

Copied from Snowflake documents:

Metadata files

    The metadata files do not identify the most recent snapshot of an Iceberg table.

    You cannot modify the location of the data files or snapshot using the ALTER ICEBERG TABLE command. To modify either of these settings, you must recreate the table (using the CREATE OR REPLACE ICEBERG TABLE syntax).
Zhen Zeng
SE Manager
この著者の他の記事

コメントする

あなたのメールアドレスは公開されません。また、コメントにリンクを貼ることはできません。