Version: 25.5

Deploying with Helm

Prerequisites

Before installing the Snorkel AI Data Development Platform, make sure the following have been done.

You have sent Snorkel AI a Docker ID that you’ll use for the initial installation and updates.
You have access to a Kubernetes cluster with specs outlined in the Snorkel Kubernetes Installation Overview.
You have privileges to schedule pods, deployments, services, ingresses, and secrets, as well as to create namespaces, within a specific AWS EKS cluster. When referring to “the EKS cluster” in the remainder of this document, we will be referring to this cluster.
A workspace where you will have access to EKS, as well as the ability to save your the Snorkel AI Data Development Platform configuration file, which will be generated during this setup, to a permanent location. This guide assumes an Ubuntu virtual machine is available.
You have a set of the Snorkel AI Data Development Platform helm charts sent over by Snorkel AI

Note: These instructions assume that you will install the Snorkel AI Data Development Platform using Amazon's Elastic Kubernetes Service. Snorkel can run on any Kubernetes installation that follows the specs in Snorkel Kubernetes Installation Overview, but for this tutorial, we will focus on EKS. If you have specific questions about how the Snorkel AI Data Development Platform is configured, please contact Snorkel support. For other general management questions regarding Kubernetes, or EKS, please refer to their documentation.

System Dependencies

From your workspace, install the command line tools required to set up Kubernetes resources:

Install kubectl
Install helm

Ensure Cluster Access

Next, you will need to ensure that your kubectl is configured to access the cluster that the Snorkel AI Data Development Platform will be deployed into. Run the following to ensure the pods returned is what you expect:

kubectl get pods --all-namespaces

If the output of this command is what you expect to be running in the cluster already, we can move on!

Creating the Kubernetes Namespace

All resources created by the Snorkel utilities will live in a single namespace within the EKS cluster. Create this namespace, and set your kubectl context to point to this namespace.

PROJECT_NAME=<A short, unique, alphanumeric name for this instance of Snorkel, such as “snorkeldemo” or “snorkelproduction”>
kubectl create namespace $PROJECT_NAME
kubectl config set-context --current --namespace $PROJECT_NAME

We set the current namespace to the result - for the remainder of the setup, we will be operating in this namespace unless otherwise noted.

Accessing Docker Hub In Kubernetes

In order for your EKS cluster to download Snorkel images, you must give the cluster a registry credential that it can use as an image pull secret. This can be done by running the following:

DOCKER_USERNAME=<username for accessing the Docker registry>
DOCKER_PASSWORD=<password for the associated Docker registry user>
DOCKER_EMAIL=<Email used for Docker hub>
kubectl create secret docker-registry regcred \
  -n $PROJECT_NAME \
  --docker-server=index.docker.io \
  --docker-username=$DOCKER_USERNAME \
  --docker-password=$DOCKER_PASSWORD \
  --docker-email=$DOCKER_EMAIL

Install Snorkel with Helm

Finally, we'll use the set of Snorkel helm charts to install the full Snorkel AI Data Development Platform. With helm, installation comes in 2 steps:

Edit the values.yaml file for your specific installation
Use the helm CLI to deploy Snorkel

Tailor the values.yaml File

The following is a guide for fields within the values.yaml file. Documentation is also provided in the values.yaml file itself.

Name	Type	Default	Description
projectName	string	snorkelflow	The namespace the Snorkel AI Data Development Platform will be installed in.
version	string	[YOUR_SNORKEL_FLOW_VERSION]	The version of the Snorkel AI Data Development Platform being deployed.
image.imageNames	map	`{}`	(Optional) A key/value mapping of service to images, only used if an internal repository is being referenced.
pagerduty_key	string		(Optional) A pagerduty key if paging is needed.
affinity.binPackAlwaysPinnedPods	boolean	false	More efficiently bin-pack pods for scaleup and scaledown purposes.
affinity.tolerations	list		(Optional) A list of tolerations for non-gpu pods.
affinity.nodeAffinity	map		(Optional) A map of node affinity fields for pods for non-gpu pods.
autoscaling.worker_autoscaling	string	"0"	Turn on or off autoscaling for worker services (engine, model-trainer, ray-worker).
autoscaling.cluster_autoscaling	map	`{"pod_disruption_budget_toggling": "0", "business_hour_start_utc": 11, "business_hour_end_utc": 4}`	Configure ability for k8s to move the Snorkel AI Data Development Platform pods around during non-business hours.
traffic.basePath	string		(Optional) Configure a base path for the Snorkel AI Data Development Platform.
traffic.istio.enabled	boolean	false	Configure istio if using istio as an ingress gateway.
traffic.istio.mtls	map	`{"enabled": false}`	Configures mutual TLS for istio if istio is enabled.
traffic.istio.gateway	map	`{"create": true}`	Allow automatic creation of an istio gateway or not if istio is enabled.
traffic.ingresses.domain	string	"snorkel-ai.com"	URL domain of the the Snorkel AI Data Development Platform install (e.g. for snorkelflow.snorkel-ai.com, domain would be snorkel-ai.com).
traffic.ingresses.ingressClassName	string	null	The name of the ingress class being used for ingresses, if not using default.
traffic.ingresses.serviceType	string	"ClusterIP"	The ServiceType of services that require an ingress (should be either NodePort or ClusterIP).
traffic.ingresses.cloudProvider	string		(Optional) Specify the cloud provider the ingresses are for or leave blank otherwise. Currently supported providers: gcp, aws, azure.
traffic.ingresses.tlsHosts	map	`{"enabled": false}`	Add tls hosts to the ingress (typically not needed).
traffic.ingresses.annotations	map	`{}`	Global annotations that are applied to all ingress objects, typically used for ingress controller specific annotations.
traffic.ingresses.services.[SERVICE]	map	`{\[SERVICE\]: {"enabled": true, "urlPrefix": \[SERVICE_URL_PREFIX\], "annotations":` `}}`	Configure ingresses objects for each individual ingress.
traffic.tls	map	`{"key_secret_name": "envoy-front-proxy-envoy-tls-key-pem-secret", "cert_secret_name": "envoy-front-proxy-envoy-tls-cert-pem-secret"}`	Configure a cert/key pair for envoy to terminate TLS.
traffic.allowAllInboundTrafficOnKeyServices	boolean	true	Permit select services to receive all inbound traffic, defaulting to true. This applies to services that have a direct ingress object.
traffic.allowInternetAccess	boolean	true	Permit services to access the internet, typically used for downloading external models.
traffic.networkPolicies.enabled	boolean	false	Enable the Snorkel AI Data Development Platform network policies to be applied.
traffic.networkPolicies.ingresses	map		(Optional) Create additional networkPolicy ingress blocks for services with a direct ingress. Typically used to specify inbound traffic from a specific in-cluster ingress controller.
gpu.enabled	boolean	false	Deploy the Snorkel AI Data Development Platform with GPU support.
gpu.gpu_config.tolerations	list		(Optional) A list of tolerations for gpu-enabled pods.
gpu.gpu_config.node_selectors	map		(Optional) Key/value pairs of node selectors for the gpu pods.
gpu.gpu_config.schedulerName	string		(Optional) Configure a GPU scheduler (e.g. if using Run:ai).
gpu.separate_gpu_pods	boolean	false	Separates worker pods into cpu and gpu pods.
prefect.enabled	boolean	true	Use Prefect workflow engine.
namespace.enabled	boolean	true	Enable namespace creation as part of the helm deploy.
services.env	map	`{}`	Key/value pairs of shared environment variables to append to all pods.
services.labels	map	`{}`	Key/value pairs of shared labels to append to all pods.
services.[SERVICE].resources	map	`{}`	Configure non-default resource allocations for each service.
services.[SERVICE].env	map	`{}`	Key/value pairs of environment variables to append to the given service.
services.[SERVICE].labels	map	`{}`	Key/value pairs of labels to append to the given service.
services.[SERVICE].min_replicas	int	0	For autoscaled services, configure a minimum replica count.
services.[SERVICE].max_replicas	int	VARIES	For autoscaled services, configure a maxiumum replica count.
services.db.shared_buffers	string	"2GB"	Configure the database shared buffers size. This should ideally be 25% of the requested database memory.
services.jupyterhub.enabled	boolean	true	Enable the in-platform per-user notebook service. Disabling this will default to a single shared notebook service.
services.jupyterhub.singleUserNotebook.serviceAccountName	string	"snorkelflow-jupyterhub-user-sa"	The name of the service account to bind to the single-user notebook pods.
services.jupyterhub.singleUserNotebook.startTimeout	int	300	The start timeout of a single-user notebook pod, in seconds.
services.jupyterhub.singleUserNotebook.gpu	boolean	false	Whether notebook pods should spin up with a GPU (gpu.gpu_config must be filled out).
services.jupyterhub.singleUserNotebook.resources	map	`{"cpu_guarantee": 1, "cpu_limit": 1, "memory_gurantee": "2G", "memory_limit", "8G"}`	Define resource requests and limits for single-user notebook pods.
services.jupyterhub.singleUserNotebook.storage	map	`{"dynamicClass": "null", "type": "dynamic"}`	Define storage settings for the single-user notebook pods.
services.secretGenerator.enabled	boolean	false	Enable the secrets generator job, which will dynamically create secrets once and exit. Typically not needed.
volumes.[VOLUME].storageClass	string	VARIES	Specify the storageclass of the volume if not using the default.
volumes.[VOLUME].storageRequest	string	VARIES	Amount of memory needed for a particular volume.
volumes.[VOLUME].volumeName	string		(Optional) Specify the volumeName if using a specific PersistentVolume.
volumes.[VOLUME].persistentVolume.enabled	boolean	false	Enable creation of a corresponding PersistentVolume object in the charts for a PVC.
volumes.[VOLUME].persistentVolume.driver	map	`{}`	Configure a driver for the PersistentVolume being created. Refer to k8s documentation for all driver plugins.
authorization.adRoles.enabled	boolean	false	Enable active directory roles in Snorkel.
authorization.adRoles.oidc	map	`{}`	(Optional) Configure authorization for OIDC. It's possible to configure this once the platform is running. ex: `{"claim": "claim", "prefix": "prefix", "separator": "_"}`
authorization.adRoles.saml	map	`{}`	(Optional) Configure authorization for SAML. It's possible to configure this once the platform is running. ex: `{"attributeName": "SnorkelRoles", "prefix": "prefix", "separator": "_"}`
authentication.jwt	map	`{"enabled": false}`	Use external JWT to login. Please consult Snorkel support before enabling.
authentication.oidc	map		Configure OIDC authentication at deploy time. This is not required, as it's possible to also configure OIDC once the platform is running.
authentication.role	map	`{"key": null, "value": null}`	(Optional) Define a role from your cloud provider for deployments to use.
pretrained_models.enabled	string	false	Enable model crate kubernetes job and configure worker pods to use it. (Also requires specifying a valid model crate image in image.imageNames.pretrainedModelImage).

Deploy Snorkel with the Helm CLI

To see the output of the templated charts with the values in the values.yaml file, run the following:

$ helm template --values [PATH_TO_VALUES_FILE] [PATH_TO_CHART_DIRECTORY]

If things look good, install Snorkel by running the following:

$ helm install --values [PATH_TO_VALUES_FILE] [PATH_TO_CHART_DIRECTORY]

At this point the Snorkel AI Data Development Platform should be successfully installed into your cluster!

Prerequisites​

System Dependencies​

Ensure Cluster Access​

Creating the Kubernetes Namespace​

Accessing Docker Hub In Kubernetes​

Install Snorkel with Helm​

Tailor the values.yaml File​

Deploy Snorkel with the Helm CLI​