Skip to main content

DynamoDB Data Connector

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. This connector enables using DynamoDB tables as data sources for federated SQL queries in Spice.

datasets:
- from: dynamodb:users
name: users
params:
dynamodb_aws_region: us-west-2
dynamodb_aws_access_key_id: ${secrets:aws_access_key_id} # Optional
dynamodb_aws_secret_access_key: ${secrets:aws_secret_access_key} # Optional
dynamodb_aws_session_token: ${secrets:aws_session_token} # Optional

Configuration

from

The from field should specify the DynamoDB table name:

fromDescription
dynamodb:tableRead data from a DynamoDB table named table
note

If an expected table is not found, verify the dynamodb_aws_region parameter. DynamoDB tables are region-specific.

name

The dataset name. This will be used as the table name within Spice.

Example:

datasets:
- from: dynamodb:users
name: my_users
params: ...
SELECT COUNT(*) FROM my_users;

The dataset name cannot be a reserved keyword.

params

The DynamoDB data connector supports the following configuration parameters:

Parameter NameDescription
dynamodb_aws_regionRequired. The AWS region containing the DynamoDB table
dynamodb_aws_access_key_idOptional. AWS access key ID for authentication. If not provided, credentials will be loaded from environment variables or IAM roles
dynamodb_aws_secret_access_keyOptional. AWS secret access key for authentication. If not provided, credentials will be loaded from environment variables or IAM roles
dynamodb_aws_session_tokenOptional. AWS session token for authentication
unnest_depthOptional. Maximum nesting depth for unnesting embedded documents into a flattened structure. Higher values expand deeper nested fields.
schema_infer_max_recordsOptional. The number of documents to use to infer the schema. Defaults to 10
scan_segmentsOptional. Number of segments for Scan request. 'auto' by default, which will calculate number of segments based on number of the records in a table

Authentication

If AWS credentials are not explicitly provided in the configuration, the connector will automatically load credentials from the following sources in order.

  1. Environment Variables:

    • AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
    • AWS_SESSION_TOKEN (if using temporary credentials)
  2. Shared AWS Config/Credentials Files:

    • Config file: ~/.aws/config (Linux/Mac) or %UserProfile%\.aws\config (Windows)

    • Credentials file: ~/.aws/credentials (Linux/Mac) or %UserProfile%\.aws\credentials (Windows)

    • The AWS_PROFILE environment variable can be used to specify a named profile, otherwise the [default] profile is used.

    • Supports both static credentials and SSO sessions

    • Example credentials file:

      # Static credentials
      [default]
      aws_access_key_id = YOUR_ACCESS_KEY
      aws_secret_access_key = YOUR_SECRET_KEY

      # SSO profile
      [profile sso-profile]
      sso_start_url = https://my-sso-portal.awsapps.com/start
      sso_region = us-west-2
      sso_account_id = 123456789012
      sso_role_name = MyRole
      region = us-west-2
    tip

    To set up SSO authentication:

    1. Run aws configure sso to configure a new SSO profile
    2. Use the profile by setting AWS_PROFILE=sso-profile
    3. Run aws sso login --profile sso-profile to start a new SSO session
  3. AWS STS Web Identity Token Credentials:

    • Used primarily with OpenID Connect (OIDC) and OAuth
    • Common in Kubernetes environments using IAM roles for service accounts (IRSA)
  4. ECS Container Credentials:

    • Used when running in Amazon ECS containers
    • Automatically uses the task's IAM role
    • Retrieved from the ECS credential provider endpoint
    • Relies on the environment variable AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or AWS_CONTAINER_CREDENTIALS_FULL_URI which are automatically injected by ECS.
  5. AWS EC2 Instance Metadata Service (IMDSv2):

    • Used when running on EC2 instances.
    • Automatically uses the instance's IAM role.
    • Retrieved securely using IMDSv2.

The connector will try each source in order until valid credentials are found. If no valid credentials are found, an authentication error will be returned.

IAM Permissions

Regardless of the credential source, the IAM role or user must have appropriate S3 permissions (e.g., s3:ListBucket, s3:GetObject) to access the files. If the Spicepod connects to multiple different AWS services, the permissions should cover all of them.

Required IAM Permissions

The IAM role or user needs the following permissions to access DynamoDB tables:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:Scan",
"dynamodb:Query",
"dynamodb:DescribeTable"
],
"Resource": [
"arn:aws:dynamodb:*:*:table/YOUR_TABLE_NAME"
]
}
]
}

Permission Details

PermissionPurpose
dynamodb:ScanRequired. Allows reading all items from the table
dynamodb:QueryRequired. Allows reading items from the table using partition key
dynamodb:DescribeTableRequired. Allows fetching table metadata and schema information

Example IAM Policies

Minimal Policy (Read-only access to specific table)

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:Scan",
"dynamodb:Query",
"dynamodb:DescribeTable"
],
"Resource": "arn:aws:dynamodb:us-west-2:123456789012:table/users"
}
]
}

Access to Multiple Tables

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:Scan",
"dynamodb:Query",
"dynamodb:DescribeTable"
],
"Resource": [
"arn:aws:dynamodb:us-west-2:123456789012:table/users",
"arn:aws:dynamodb:us-west-2:123456789012:table/orders"
]
}
]
}

Access to All Tables in a Region

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:Scan",
"dynamodb:Query",
"dynamodb:DescribeTable"
],
"Resource": "arn:aws:dynamodb:us-west-2:123456789012:table/*"
}
]
}
Security Considerations

Data Types

The table below shows the DynamoDB data types supported, along with the type mapping to Apache Arrow types in Spice.

DynamoDB TypeDescriptionArrow TypeNotes
BoolBooleanBoolean
SStringUtf8
SStringTimestamp(Millisecond)Naive timestamp if it matches time_format without timezone
SStringTimestamp(Millisecond, <timezone>)Timezone-aware timestamp if it matches time_format with timezone
SsString SetList<Utf8>
NNumberInt64 | Float64
NsNumber SetList<Int64|Float64>
BBinaryBinary
BsBinary SetList<Binary>
LListList<Utf8>DynamoDB arrays can be heterogeneous e.g. [1, "foo", true], Arrow arrays must be homogeneous - use strings to preserve all data
MMapUtf8 or UnflattenedDepending on unnest_depth value

Time format

Since DynamoDB stores timestamps as strings, Spice supports parsing timestamps using a customizable format. By default, Spice will try to parse timestamps using ISO8601 format, but you can provide a custom format using the time_format parameter.

Once Spice is able to parse a timestamp, it will convert it to a Timestamp(Millisecond) Arrow type, and will use the same format to serialize it back to DynamoDB for filter pushdown.

This parameter uses Go-style time formatting, which uses a reference time of Mon Jan 2 15:04:05 MST 2006.

Format PatternExample ValueDescription
2006-01-02T15:04:05Z07:002024-03-15T14:30:00ZISO8601 / RFC3339 with timezone (default)
2006-01-02T15:04:05.999Z07:002024-03-15T14:30:00.123-07:00ISO8601 with milliseconds and timezone
2006-01-02T15:04:052024-03-15T14:30:00ISO8601 without timezone (naive timestamp)
2006-01-02 15:04:052024-03-15 14:30:00Date and time with space separator
01/02/2006 15:04:0503/15/2024 14:30:00US-style date with time
02/01/2006 15:04:0515/03/2024 14:30:00European-style date with time
Jan 2, 2006 3:04:05 PMMar 15, 2024 2:30:00 PMHuman-readable with 12-hour clock
2006010215040520240315143000Compact format (no separators)

Go's format uses specific reference values that must appear exactly as shown:

ComponentReference ValueAlternatives
Year200606 (2-digit)
Month011, Jan, January
Day022
Hour (24h)15
Hour (12h)033
Minute044
Second055
AM/PMPMpm
TimezoneZ07:00-0700, MST
Milliseconds.000.999 (trailing zeros trimmed)
Microseconds.000000.999999 (trailing zeros trimmed)
Nanoseconds.000000000.999999999 (trailing zeros trimmed)

:::

Unnesting

Consider the following document:

{
"a": 1,
"b": {
"x": 2,
"y": {
"z": 3
}
}
}

Using unnest_depth you can control the unnesting behavior. Here are the examples:

unnest_depth: 0

sql> select * from test_table;
+-----------+---------------------+
| a (Int32) | b (Utf8) |
+-----------+---------------------+
| 1 | {"x":2,"y":{"z":3}} |
+---+-----------------------------+

unnest_depth: 1

sql> select * from test_table;
+-----------+-------------+------------+
| a (Int32) | b.x (Int32) | b.y (Utf8) |
+-----------+-------------+------------+
| 1 | 2 | {"z":3} |
+-----------+-------------+------------+

unnest_depth: 2

sql> select * from test_table;
+-----------+-------------+---------------+
| a (Int32) | b.x (Int32) | b.y.z (Int32) |
+-----------+-------------+---------------+
| 1 | 2 | 3 |
+-----------+-------------+---------------+

Examples

Basic Configuration with Environment Credentials

version: v1
kind: Spicepod
name: dynamodb

datasets:
- from: dynamodb:users
name: users
params:
dynamodb_aws_region: us-west-2
acceleration:
enabled: true

Configuration with Explicit Credentials

version: v1
kind: Spicepod
name: dynamodb

datasets:
- from: dynamodb:users
name: users
params:
dynamodb_aws_region: us-west-2
dynamodb_aws_access_key_id: ${secrets:aws_access_key_id}
dynamodb_aws_secret_access_key: ${secrets:aws_secret_access_key}
acceleration:
enabled: true

Configuration with time_format

version: v1
kind: Spicepod
name: dynamodb

datasets:
- from: dynamodb:users
name: users
params:
dynamodb_aws_region: us-west-2
time_format: 2006-01-02 15:04:05
acceleration:
enabled: true

Querying Nested Structures

DynamoDB supports complex nested JSON structures. These fields can be queried using SQL:

-- Query nested structs
SELECT metadata.registration_ip, metadata.user_agent
FROM users
LIMIT 5;

-- Query nested structs in arrays
SELECT address.city
FROM (
SELECT unnest(addresses) AS address
FROM users
)
WHERE address.city = 'San Francisco';
Limitations
  • The DynamoDB connector will scan the first 10 items to determine the schema of the table. This may miss columns that are not present in the first 10 items.
  • The DynamoDB connector does not support Decimal type.

Example schema from a users table:

describe users;
+----------------+------------------+-------------+
| column_name | data_type | is_nullable |
+----------------+------------------+-------------+
| email | Utf8 | YES |
| id | Int64 | YES |
| metadata | Struct | YES |
| addresses | List(Struct) | YES |
| preferences | Struct | YES |
| created_at | Utf8 | YES |
...
+----------------+------------------+-------------+

Streams

The DynamoDB Data Connector integrates with DynamoDB Streams to enable real-time streaming of table changes. This feature supports both initial table bootstrapping and continuous change data capture (CDC), allowing Spice to automatically detect and stream inserts, updates, and deletes from DynamoDB tables.

warning

Using the DynamoDB connector requires acceleration with refresh_mode: changes and defined on_conflict configuration.

Basic Configuration

To enable streaming from DynamoDB, enable acceleration and set the refresh_mode to changes in your dataset configuration.

You also need to configure the on_conflict parameter to specify how the connector should handle updates to existing records. The keys defined in on_conflict must match your DynamoDB table's partition key and range key (if your table has one)

datasets:
- from: dynamodb:my_table
name: orders_stream
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_mode: changes
on_conflict:
(id, version): upsert

Configuration Parameters

Dataset Parameters

  • ready_lag - Defines the maximum lag threshold before the dataset is reported as "Ready". Once the stream lag falls below this value, queries can be executed against the dataset. Default behavior reports ready immediately after bootstrap completes.

  • scan_interval - Controls the polling frequency for checking new records in the DynamoDB stream. Lower values provide more real-time updates but increase API calls. Higher values reduce API usage but may introduce additional latency.

Acceleration Parameters

  • on_conflict - Specifies the conflict resolution strategy when streaming changes that match existing records. The keys in the tuple should correspond to your DynamoDB table's partition key and range key (if applicable). The upsert action will insert new records or update existing ones based on these key columns.

    Examples:

    • Single partition key: id: upsert
    • Partition key + range key: (partition_key, sort_key): upsert
  • snapshots_trigger_threshold - Determines how frequently snapshots are created during streaming. A value of 5 means a snapshot is created every 5 batch updates. Snapshots enable faster recovery and better query performance but consume additional storage.

Metrics

The following Component Metrics are provided for monitoring streaming performance and health:

MetricTypeDescription
shards_activeGaugeCurrent number of active shards in the stream
records_consumed_totalCounterTotal number of records consumed from the stream
lag_msGaugeCurrent lag in milliseconds between stream watermark and the current time
errors_transient_totalCounterTotal number of transient errors encountered while polling from the stream

These metrics are not enabled by default, enable them by setting the metrics parameter:

datasets:
- from: kafka:user_events
name: events
metrics:
- name: shards_active
- name: lag_ms

You can find an example dashboard for DynamoDB Streams in monitoring/grafana-dashboard.json.

Advanced Configuration

For production workloads requiring fine-tuned control over streaming behavior and performance characteristics:

datasets:
- from: dynamodb:my_table
name: orders_stream
params:
ready_lag: 1s # Dataset reports as Ready when lag is below 1 second
scan_interval: 100ms # Poll for new stream records every 100 milliseconds
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_mode: changes
on_conflict:
(id, version): upsert
params:
snapshots_trigger_threshold: 5 # Create snapshot every 5 batch updates
metrics:
- name: shards_active
enabled: true
- name: records_consumed_total
enabled: true
- name: lag_ms
enabled: true
- name: errors_transient_total
enabled: true

Cookbooks