External S3 bucket storage
Snorkel provides the ability to use Amazon S3 as external storage for datasets and related objects (datasources, uploaded files, etc.). When configured, Snorkel performs in-platform authorization checks to verify user access to files stored in S3.
Prerequisites
Before configuring external S3 storage, ensure you have:
- A Snorkel instance that is either Snorkel-hosted or running on Amazon EKS
- An S3 bucket with appropriate permissions
- AWS credentials and permissions to create IAM roles
Configuration
On-premises instances
For on-premises deployments, configure external storage in your Snorkel configuration:
external_storage:
enabled: false
# bucket: "s3://my-company-snorkel-storage-bucket"
# roleArn: "arn:aws:iam::123456789012:role/SnorkelStorageRole"
# region: "us-west-2"
Note: For on-premises installations, follow the AWS EKS IAM roles for service accounts documentation to set up proper IAM configuration.
Contact Snorkel support for detailed installation guidance specific to your environment.
Snorkel-hosted instances
For Snorkel-hosted instances, follow these steps to configure cross-account S3 access:
Step 1: Obtain OIDC issuer URL
Contact Snorkel support to receive the OIDC issuer URL from your Snorkel cluster.
Step 2: Create IAM OIDC provider.
Follow the AWS cross-account access documentation to:
- Create an IAM OIDC provider for your cluster.
- Assign IAM roles to Kubernetes service accounts using the issuer URL obtained from Snorkel in the previous step.
Step 3: Configure IAM role permissions
The created IAM role must have the following S3 permissions:
{
"Effect": "Allow",
"Action": [
"s3:CreateBucket",
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
Important: Replace your-bucket-name
with your actual S3 bucket
name.
Step 4: Provide role ARN to Snorkel
After creating the IAM role, provide the role ARN to Snorkel support to complete the configuration.
Final configuration
Once configured with the S3 bucket, role ARN, and AWS region, Snorkel can store and manage datasets in your external S3 bucket while maintaining proper access controls and authorization.