Version: 0.94

Using pre-trained models with Snorkel Flow in air-gapped or disconnected environment

This guide explains the requirements, steps, and validation for deploying your pre-trained model in an air-gapped or disconnected environment.

Prerequisites

This guide applies to the following scenario:

You have an existing Snorkel Flow installation in an air gapped or network-constrained environment
You want to use our embeddings or FM-related features (e.g. Warm Start, Prompts) or your use cases require using other pre-trained models
You or your admin have access to the infrastructure running Snorkel Flow (version 0.76+). Currently, we only support Kubernetes-based installations.

Enable model in your environment

Work with your Snorkel contacts to figure out which models are needed for your use case.
- Do you want to use FM Warm Start?
- Do you want to use Prompts?
- Do you want to use other models? For example, SimCSE (https://huggingface.co/princeton-nlp/sup-simcse-roberta-large)
After that, Snorkel provides the following artifacts:
- A docker image on snorkelai/pretrained-model-image with a specific version tag
- A copy-pretrained-models-job.yaml file used for deploying the pretrained models
Import the docker image into your internal image repository, and note its location, such as _MY_IMAGE_REPO_.

Make relevant edits to your copy-pretrained-models-job.yaml, including these key changes:

Update image to point to your internal image location.
Update claimName to be the same PersistentVolumeClaim you use for Snorkel Flow data volume. If you are unsure, check the PersistentVolumeClaim attached to the engine deployment yaml file.
Update namespace to be the Snorkel Flow namespace.

apiVersion: batch/v1
kind: Job
metadata:
  name: snorkel-copy-pretrained-models
  namespace: <REPLACE_WITH_SNORKEL_NAMESPACE>
spec:
  template:
    spec:
      imagePullSecrets:
        - name: regcred
      containers:
      - name: snorkel-pretrained-model-image
        image: <MY_IMAGE_REPO/pretrained-model-image:YOUR_VERSION>
        volumeMounts:
        - mountPath: /data
          name: data
        command: ["python3", "copy_pretrained_models.py"]
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: <REPLACE_WITH_DATA_CLAIM>
      restartPolicy: Never

Ask your Administrator to deploy copy-pretrained-models-job.yaml and wait for the Job to complete.

In your userconfig, change the enable_pretrained_model_directory flag from false to true and redeploy.

Alternatively, update your helm chart:

helm upgrade -n $namespace --reuse-values --set image.imageNames.pretrainedModelImage=snorkelai/image:tag --set  pretrained_models.enabled=true $namespace src/python/snorkelflow/config/helm/snorkelflow

Validation

Check tdm-api deployment for the environment variable:

...
  - name: ENABLE_PRETRAINED_MODEL_DIRECTORY
          value: "True"
...

If you have access to the underlying filesystem via MinIO or kubectl, check /data/snorkel-pretrained-models/.transformer-models/ and to ensure the models you need exist. For example, models--princeton-nlp--unsup-simcse-bert-base-uncased for SimCSE.

Prerequisites​

Enable model in your environment​

Validation​

Prerequisites

Enable model in your environment

Validation