Skip to main content
Version: 25.6

Supported data source types

This page provides information about the data source types that the Snorkel AI Data Development Platform supports. You can follow the steps outlined in Data preparation and Data upload to prepare and upload your dataset into the Snorkel AI Data Development Platform.

The Snorkel AI Data Development Platform supports the following types of data sources: Amazon S3, Google Cloud Storage (GCS), local files, SQL, Databricks SQL, Google BigQuery, and Snowflake.

note

When you use credentials to access your data, your credentials are encrypted, stored, and managed by Snorkel. Credentials are not logged or stored unencrypted.

Amazon S3

Select Amazon S3 to add data from private or public S3 buckets. Turn on the "Use credentials" toggle to use data in private buckets.

Fill out the required fields:

  • Access key ID: Your AWS access key ID
  • Access key: Your AWS Secret Access key
  • Region: AWS region

Google Cloud Storage

Select Google Cloud Storage to add data from your GCS. Turn on the "Use credentials" toggle to use data in private buckets.

Fill out the required fields:

  • Project ID: Your Google Cloud project ID.
  • JSON service account file contents: The raw JSON contents of a key file that belongs to a service account with access to Google BigQuery. Please note that the service account requires roles/bigquery.readSessionUser and roles/bigquery.dataViewer to read a table.

Local file

Select File Upload to add data sources from your local machine. You must upload a Parquet or CSV file.

note

Local data upload currently only supports data source files up to a maximum size of 100MB. If your data files are too large, use a cloud storage service like AWS S3 or Google Cloud Storage.

SQL

Select SQL DB to add data sources from queries against SQL databases like Postgres and SQLite.

Fill out the required fields:

  • Connection URI: A database connection string, for example, those that are used by SQLAlchemy.
  • SQL query: A SQL query where each result row is a data point.

Databricks SQL

Select Databricks SQL to add data sources from queries against a Databricks SQL warehouse. Currently, all table columns are read.

Fill out the required fields:

  • Server hostname: The server hostname of your Databricks SQL warehouse.
  • HTTP path: The HTTP path of your Databricks SQL warehouse.
  • Access token: Your Databricks access token.
  • SQL query: A Databricks query where each result row is a data point.

See the Databricks documentation for more details about these items.

Note
Your credentials are encrypted, stored, and managed by the Snorkel AI Data Development Platform. Credentials are not logged or stored unencrypted.

Google BigQuery

Select Google BigQuery to add data sources from Google BigQuery tables. Currently, all table columns are read.

Fill out the required fields:

  • Project ID: Your Google Cloud project ID.

  • JSON service account file contents: The raw JSON contents of a key file that belongs to a service account with access to Google BigQuery. Please note that the service account requires roles/bigquery.readSessionUser and roles/bigquery.dataViewer to read a table.

  • Google BigQuery table specification: A JSON specification for the table columns to read, where each result row is a data point. The specification must have the following keys, and an example is provided below.

    • dataset_id: The BigQuery dataset ID.
    • table_id: The BigQuery table ID.
    • columns: The list of columns to include.
{
"dataset_id": "bigquery-public-data.noaa_tsunami",
"table_id": "historical_source_event",
"columns": ["id", "year", "location_name"]
}

Snowflake

Select Snowflake to add data sources from queries against a Snowflake data warehouse.

Fill out the required fields:

  • Username: Your Snowflake username.
  • Account identifier: Your Snowflake account identifier.
  • Password: Your Snowflake password.
  • Snowflake query: A Snowflake query where each result row is a data point. Queries can specify the database, schema, and table, as in the example below.
SELECT c_name, c_address
FROM snowflake_sample_data.tpch_sf1.customer
LIMIT 10;
note

The default warehouse that is specified for your Snowflake account will be used.