Using Pre-trained Models with Snorkel Flow in Air Gapped or Disconnected Environment
Prerequisites
This guide applies to the following scenario:
- You have an existing Snorkel Flow installation in an air gapped or network-constrained environment
- You want to use our embeddings or FM-related features (e.g. Warm Start, Prompts) or your use cases require using other pre-trained models
- You or your admin have access to the infrastructure running Snorkel Flow (version 0.76+). Currently, we only support Kubernetes-based installations.
Steps
- Talk to your Snorkel Flow contacts, who will work with you to figure out which models are needed for your use case.
- Do you want to use FM Warm Start?
- Do you want to use Prompts?
- Do you want to use other models?
- For example, SimCSE (https://huggingface.co/princeton-nlp/sup-simcse-roberta-large)
- After that, we will provide the following artifacts:
- A docker image on snorkelai/pretrained-model-image with a specific version tag
- A YAML file (copy-pretrained-models-job.yaml) that will be used for deploying the pretrained models
- Import the docker image into your internal image repository, and make note of its location (e.g. MY_IMAGE_REPO)
- Make relevant edits to your copy-pretrained-models-job.yaml. The 2 key changes are:
- Update image to point to your internal image location
- Update claimName to be the same PersistentVolumeClaim you use for Snorkel Flow data volume
- If you are unsure, you can check the PersistentVolumeClaim attached to the engine deployment yaml
- Update namespace to be the namespace of Snorkel Flow’s
apiVersion: batch/v1
kind: Job
metadata:
name: snorkel-copy-pretrained-models
namespace: <REPLACE_WITH_SNORKEL_NAMESPACE>
spec:
template:
spec:
imagePullSecrets:
- name: regcred
containers:
- name: snorkel-pretrained-model-image
image: <MY_IMAGE_REPO/pretrained-model-image:YOUR_VERSION>
volumeMounts:
- mountPath: /data
name: data
command: ["python3", "copy_pretrained_models.py"]
volumes:
- name: data
persistentVolumeClaim:
claimName: <REPLACE_WITH_DATA_CLAIM>
restartPolicy: Never
- Ask your Administrator to deploy copy-pretrained-models-job.yaml and wait for the Job to complete.
- In your userconfig, change the enable_pretrained_model_directory flag from false to true and redeploy (or make the relevant changes in your Helm chart).
Validation
- Check tdm-api deployment for the environment variable
...
- name: ENABLE_PRETRAINED_MODEL_DIRECTORY
value: "True"
...
- If you have access to the underlying filesystem (either via MinIO or kubectl), check `/data/snorkel-pretrained-models/.transformer-models/` and make sure the models you need exist.
- For example: models--princeton-nlp--unsup-simcse-bert-base-uncased for SimCSE