Version: 0.93

Kubernetes Installation (Legacy)

note

For a GPU-enabled environment, please see GPU-enabled Installation.

Prerequisites

Before installing Snorkel Flow, make sure the following have been done.

You have sent Snorkel AI a Docker ID that you’ll use for the initial installation and updates.
You have permissions to allocate instances with the required hardware specs as described in the previous section in AWS EC2
You have privileges to schedule pods, deployments, services, ingresses, and secrets, as well as to create namespaces, within a specific AWS EKS cluster. When referring to “the EKS cluster” in the remainder of this document, we will be referring to this cluster.
A workspace where you will have access to EKS, as well as the ability to save your Snorkel Flow configuration file, which will be generated during this setup, to a permanent location. This guide assumes an Ubuntu virtual machine is available.

note

These instructions assume you will install Snorkel Flow on Amazon Elastic Kubernetes Service. Snorkel Flow can run on any Kubernetes installation, but for this tutorial, we will focus on EKS. If you have specific questions about how Snorkel Flow is configured, please contact Snorkel support. For other general management questions regarding Kubernetes, or EKS, please refer to their documentation.

System Dependencies

From your workspace, install the command line tools required to set up Kubernetes resources on EKS:

Update apt, install pip and other tools

sudo apt update
sudo apt install python3-pip python-pip-whl
pip3 install kubernetes awscli tenacity --upgrade

Install kubectl

kubectl installation guide: https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/

Ensure AWS access

Next, you will need to ensure that your AWS credentials are in place and are for a user with permissions to manage the EKS cluster. You can verify that you are able to communicate with authenticated resources by running:

aws sts get-caller-identity

You should see an ARN that either matches the EKS cluster creator or otherwise has been granted management authorization to the cluster.

Install the snorkelflow Python package

Finally, you will need to install the Snorkel Flow wheel. This will be the main tool for generating Kubernetes configuration files and interacting with the Snorkel SDK. The snorkelflow SDK is bundled in a Docker image called snorkelai/snorkelflow-whl, hosted on Docker hub. It is recommended to extract the snorkelflow SDK by running the following:

VERSION=<snorkelflow version>
LOCAL_WHL_PATH=/tmp/whl
mkdir -p $LOCAL_WHL_PATH
CONTAINER_ID=$(docker create snorkelai/snorkelflow-whl:$VERSION bash)
docker cp $CONTAINER_ID:/ $LOCAL_WHL_PATH

replacing <snorkelflow version> with the provided version of snorkelflow using the wheel file located in $LOCAL_WHL_PATH.

You can now Install the snorkelflow SDK by running:

python3 -m pip install $LOCAL_WHL_PATH/snorkelflow-$VERSION-py3-none-any.whl[install]

Setup your cluster

Snorkel Flow can run on any Kubernetes installation as long as it satisfies our requirements on Storage and Ingress.

If you plan to run Snorkel Flow in your existing cluster, whether it’s running in a private data center or public cloud (AWS/GCP/Azure), we will need the followings:

The cluster’s node should meet the following requirements:
- Minimum: 8 CPU, 32 Gi RAM with 4+ nodes/machines.
- For AWS: m5.2xlarge and above, node group size 4+
A StorageClass that supports AccessMode=ReadWriteMany. https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes (e.g. NFS, or other filesystems)
- For AWS: Elastic File System (EFS) is a good option.
- For Azure: AzureFiles backed by NFS (no mount_options enabled) is a good option.
The cluster also must have an IngressController running. https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/
- For AWS: AWS Load Balancer Controller: https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.3/

Creating the Kubernetes namespace

All resources created by the Snorkel Flow utilities will live in a single namespace within the EKS cluster.

PROJECT_NAME=<A short, unique, alphanumeric name for this instance of Snorkel Flow, such as “snorkeldemo” or “snorkelproduction”>
kubectl create namespace $PROJECT_NAME
kubectl config set-context --current --namespace $PROJECT_NAME

We set the current namespace to the result - for the remainder of the setup, we will be operating in this namespace unless otherwise noted.

Accessing Docker Hub In Kubernetes

In order for your EKS cluster to download Snorkel Flow images, you must give the cluster a registry credential that it can use as an image pull secret. This can be done by running the following:

DOCKER_USERNAME=<username for accessing the Docker registry>
DOCKER_PASSWORD=<password for the associated Docker registry user>
DOCKER_EMAIL=<Email used for Docker hub>
kubectl create secret docker-registry regcred \
  -n $PROJECT_NAME \
  --docker-server=index.docker.io \
  --docker-username=$DOCKER_USERNAME \
  --docker-password=$DOCKER_PASSWORD \
  --docker-email=$DOCKER_EMAIL

Generate a Snorkel Flow config

User installation settings for the Snorkel Flow platform are stored in a file called snorkel-flow.yaml. The following instructions will generate a default snorkel-flow.yaml file. For instructions on customizing installation settings, contact a Snorkel AI team member.

DOMAIN=<The domain component of the HOST_IP. For example, if your ultimate location for this instance of Snorkel Flow is snorkeldemo.superscience.com, this variable would be “superscience.com”>
PROJECT_NAME=<A short, unique, alphanumeric name for this instance of Snorkel Flow, such as “snorkeldemo” or “snorkelproduction”. For example, if your ultimate location for this instance of Snorkel Flow is snorkeldemo.superscience.com, this variable would be “snorkeldemo”>
HOST_IP="$PROJECT_NAME.$DOMAIN"
LOCAL_DATA_PATH=<absolute path to a directory that Snorkel Flow will read and write data to. It is recommended to make this data a persistent volume>
WORK_DIR=<absolute path to a directory that Snorkel will use during installation. Some sensitive data is stored in this directory and should not be the same as the LOCAL_DATA_PATH.>
VERSION=<The version of Snorkel Flow being installed. This should be the same as the value specified in the previous step>
snorkel-install generate-config \
  --host-ip $HOST_IP \
  --platform k8s \
  --domain $DOMAIN \
  --mount-directory $LOCAL_DATA_PATH \
  --work-directory $WORK_DIR \
  --project-name $PROJECT_NAME \
  --version=$VERSION \
  --path snorkel-config.yaml

The HOST_IP will be the DNS name used for the primary ingress. DOMAIN and PROJECT_NAME will be used by the Kubernetes ingress objects to create DNS entries for different sub-services, such as our Training Data Manager API.

Install Snorkel Flow

Finally, we’ll use the snorkel-install CLI to install the full Snorkel Flow platform. The snorkel-install command below will pull the container images from Snorkel AI’s private registries, generate Kubernetes YAML files, and then apply them to the namespace identified by PROJECT_NAME, which we set in the previous section.

snorkel-install bootstrap -c snorkel-config.yaml

If you see an output that says

💫 Snorkel Flow bootstrap successful!

then congratulations, you have successfully installed Snorkel Flow! You now have an ingress resource that points to http://<HOST-IP>, which can be used to access the main Snorkel Flow interface. In order to enable the Snorkel Flow product, you will need to finalize the installation by adding a license key.

If for any reason you encounter an error when bootstrapping, you can inspect the generated YAML files inside $WORK_DIR/kubernetes and troubleshoot the configuration.

note

Each organization’s needs with respect to ingresses are different. The Snorkel CLI will only generate bare ingresses with minimal configuration; the CLI does not assume what metadata will be required on your ingresses to make them work for your organization. To set up ALB ingress controllers, refer to the primary documentation for ALB ingresses as well as external DNS controllers.

Managing Kubernetes YAML

The Snorkel CLI generates a number of Kubernetes YAML files, including secrets, in the directory $WORK_DIR/kubernetes. These files are applied to the Kubernetes namespace you are active in during bootstrapping and during future upgrades.

During future upgrades, the snorkel-install CLI will regenerate and modify these files. Any customizations you would like to preserve, such as modifications to Snorkel ingresses, should be applied after snorkel-install configuration has been generated.

This guide describes these generated files in detail. If at any point you have questions, do not hesitate to reach out to your Snorkel point of contact for support.

note

Configuration generation for Snorkel Flow creates secrets that should be kept secure following the creation of the Snorkel Flow instance. Please be careful to avoid committing these to an un-secure location, such as a git repository.

Generated Config

The config package that was generated into $WORK_DIR/kubernetes should include these directories:

configmaps
volumes
ingresses
networkpolicies
services
deployments
secrets
namespaces

Volumes

Persistent volume claim objects needed for Snorkel Flow are in the volumes directory.

Snorkel Flow uses a volume shared across almost all pods called data. The existing config mounts a volume of storageclass nfs with a ReadWriteMany access mode into each pod.

You will need to:

Pre-configure an NFS or other volume that has ReadWriteMany mode enabled
Example NFS volume:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: fileserver
spec:
  capacity:
    storage: 1T
  storageClassName: nfs
  accessModes:
  - ReadWriteMany
  nfs:
    path: /file-share
    server: ip-address

Edit volumes/snorkel-data-pvc.yaml to request a volume from the storage class in the previous step as ReadWriteMany
kubectl apply -f file in the volumes directory to ensure this pvc object gets created.
Edit deployments/*-deployment.yaml to mount the volumes from the previous step (this should be the default setting in these files).

apiVersion: apps/v1
kind: Deployment
...
volumes:
- name: data
persistentVolumeClaim:
  claimName: snorkel-data

The amount of space needed in the pvc (volumes/snorkel-data-pvc.yaml) depends on your use-case; this is where all the datasets will be stored in Snorkel Flow.

Snorkel Flow's Postgres deployment (deployments/db-deployment.yaml) also uses a volume and a corresponding persistent volume claim to store data (volumes/snorkel-postgres-pvc.yaml). You can edit the pvc config as necessary to use the correct storage class. The pvc has access mode ReadWriteOnce, as the Postgres pod is the only pod that will have it mounted.

Snorkel Flow's InfluxDB deployment (deployments/influxdb-deployment.yaml) also uses a volume and a corresponding persistent volume claim to store data (volumes/snorkel-influxdb-pvc.yaml). This pvc should be configured the same way as the postgres pvc is (ReadWriteOnce).

Other methods for configuring data access (such as via a hostPath) are possible.

Ingresses

In the ingresses/ directory, we've provided ingress objects to use with an ingress-nginx controller. For all of these ingress objects, you will need to edit syntax to fit your exact flavor of ingress (e.g. ingress-nginx, contour) and desired URL paths.

Ingress objects to adjust:

snorkelflow-ingress.yaml: ingress for the primary Snorkel Flow GUI
minio-ingress.yaml: ingress for local file upload via a MinIO server
tdm-api-ingress.yaml: ingress for API access to Snorkel Flow's training data manager (e.g. for usage with the Snorkel Flow SDK)
notebook-ingress.yaml: ingress for Jupyter Notebooks available in Snorkel Flow (e.g. for LF development and modeling)
grafana-ingress.yaml: ingress for system and application metrics dashboards

note

Network Policies

Optional network policy objects are in the networkpolicies/ directory if your Kubernetes infrastructure makes use of network policies.

You can optionally adjust the ingress policy in studio-api, tdm-api, notebook, influxdb, grafana and flow-ui marked # Allow ingress from ingress controller/nodeport to refer to your specific flavor of ingress (e.g. ingress-nginx, contour, nodeport).

You can also optionally adjust egress rules marked Allow DNS resolution to refer to your specific Kubernetes DNS solution (e.g. coredns, kube-dns).

Services

Snorkel Flow service objects are in the services/ directory.

The following service objects are configured as NodePort to allow ingress from ingress objects. If your ingress is configured to not use NodePort, or your Kubernetes setup does not support NodePort, this line is safe to delete as long as ingress can still be configured to route to these services.

Services with NodePort:

notebook-service.yaml
flow-ui-service.yaml
minio-service.yaml
tdm-api-service.yaml
influxdb-service.yaml
grafana-service.yaml

Deployments

Snorkel Flow deployment objects are in the deployments/ directory.

A few minor adjustments are needed to set these up properly in your Kubernetes cluster:

In notebook-deployment.yaml, change NOTEBOOK_IP environment variable to the full hostname of ingress in notebook-ingress.yaml
In notebook-deployment.yaml, change NOTEBOOK_PORT and NOTEBOOK_TLS_PORT to the ingress listen ports (80 for http, 443 for https respectively)
In notebook-deployment.yaml, change value of TORNADO_HOST_IP to the full hostname of ingress in snorkelflow-ingress.yaml

The CPU/memory requests and limits are benchmarks to get a basic version of Snorkel Flow running which is typically suitable for a small number of users. These resource values may be increased as needed.

Secrets

The minio_access_key and minio_secret_key set in secrets/minio-secret.yaml, and influxdb_access_key set in secrets/influxdb-secret.yaml are identical base64 encoded strings for snorkeladmin, which is the default access key and secret key for the MinIO, InfluxDB, and Grafana instances. The influxdb_bucket set in secrets/influxdb-secret.yaml is a base64 encoded string for snorkel which is the default bucket where metrics will be populated in InfluxDB.

These can be updated to the base64-encoded keys of your choice (minio keys will then need to be configured in the SDK).

Namespaces

The namespaces/ directory contains a simple namespace definition for Snorkel Flow.

GPU-enabled Installation

If you’d like to install a GPU-enabled instance, you’ll need to add "default-runtime": "nvidia" to /etc/docker/daemon.json somewhere in the top level of that JSON file. For example:

$ cat /etc/docker/daemon.json
{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia"
}

Restart the docker service: sudo service docker restart

You can confirm that GPU now works by running

docker run --rm nvidia/cuda:11.0-base nvidia-smi

(or use whatever nvidia/cuda image is available to your organization).

$ docker run --rm nvidia/cuda:11.0-base nvidia-smi
Thu Nov 18 22:00:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   22C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Once the above steps have been verified, you can run the installation steps. Running Snorkel Flow containers will utilize this nvidia runtime. You can validate this using the following Python commands in the Snorkel Flow Jupyter Notebook:

import torch

torch.cuda.is_available()
>>> True

torch.cuda.current_device()
>>> 0

torch.cuda.device(0)
>>>

torch.cuda.device_count()
>>> 1

torch.cuda.get_device_name(0)
>>> 'GeForce GTX 950M'

note

If you already started a snorkel cluster before doing the above, you’ll have to re-bootstrap the cluster to enable GPU support.

note

If the "nvidia" runtime does not already exist in /etc/docker/daemon.json, you may have to follow these instructions to enable it: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

Prerequisites​

System Dependencies​

Update apt, install pip and other tools​

Install kubectl​

Ensure AWS access​

Install the snorkelflow Python package​

Setup your cluster​

Creating the Kubernetes namespace​

Accessing Docker Hub In Kubernetes​

Generate a Snorkel Flow config​

Install Snorkel Flow​

Managing Kubernetes YAML​

Generated Config​

Volumes​

Ingresses​

Network Policies​

Services​

Deployments​

Secrets​

Namespaces​

GPU-enabled Installation​

Prerequisites

System Dependencies

Update apt, install pip and other tools

Install kubectl

Ensure AWS access

Install the snorkelflow Python package

Setup your cluster

Creating the Kubernetes namespace

Accessing Docker Hub In Kubernetes

Generate a Snorkel Flow config

Install Snorkel Flow

Managing Kubernetes YAML

Generated Config

Volumes

Ingresses

Network Policies

Services

Deployments

Secrets

Namespaces

GPU-enabled Installation