Using pre-trained models with Snorkel Flow in air-gapped or disconnected environment
This guide explains the requirements, steps, and validation for deploying your pre-trained model in an air-gapped or disconnected environment.
Prerequisites
This guide applies to the following scenario:
- You have an existing Snorkel Flow installation in an air gapped or network-constrained environment
- You want to use our embeddings or FM-related features (e.g. Warm Start, Prompts) or your use cases require using other pre-trained models
- You or your admin have access to the infrastructure running Snorkel Flow (version 0.76+). Currently, we only support Kubernetes-based installations.
Enable model in your environment
-
Work with your Snorkel contacts to figure out which models are needed for your use case.
- Do you want to use FM Warm Start?
- Do you want to use Prompts?
- Do you want to use other models? For example, SimCSE (https://huggingface.co/princeton-nlp/sup-simcse-roberta-large)
-
After that, Snorkel provides the following artifacts:
- A docker image on
snorkelai/pretrained-model-image
with a specific version tag - A
copy-pretrained-models-job.yaml
file used for deploying the pretrained models
- A docker image on
-
Import the docker image into your internal image repository, and note its location, such as
_MY_IMAGE_REPO_
. -
Make relevant edits to your
copy-pretrained-models-job.yaml
, including these key changes:- Update
image
to point to your internal image location. - Update
claimName
to be the samePersistentVolumeClaim
you use for Snorkel Flow data volume. If you are unsure, check thePersistentVolumeClaim
attached to theengine
deploymentyaml
file. - Update
namespace
to be the Snorkel Flow namespace.
apiVersion: batch/v1
kind: Job
metadata:
name: snorkel-copy-pretrained-models
namespace: <REPLACE_WITH_SNORKEL_NAMESPACE>
spec:
template:
spec:
imagePullSecrets:
- name: regcred
containers:
- name: snorkel-pretrained-model-image
image: <MY_IMAGE_REPO/pretrained-model-image:YOUR_VERSION>
volumeMounts:
- mountPath: /data
name: data
command: ["python3", "copy_pretrained_models.py"]
volumes:
- name: data
persistentVolumeClaim:
claimName: <REPLACE_WITH_DATA_CLAIM>
restartPolicy: Never - Update
-
Ask your Administrator to deploy
copy-pretrained-models-job.yaml
and wait for the Job to complete. -
In your
userconfig
, change theenable_pretrained_model_directory
flag fromfalse
totrue
and redeploy.Alternatively, update your helm chart:
helm upgrade -n $namespace --reuse-values --set image.imageNames.pretrainedModelImage=snorkelai/image:tag --set pretrained_models.enabled=true $namespace src/python/snorkelflow/config/helm/snorkelflow
Validation
-
Check
tdm-api
deployment for the environment variable:...
- name: ENABLE_PRETRAINED_MODEL_DIRECTORY
value: "True"
... -
If you have access to the underlying filesystem via MinIO or kubectl, check
/data/snorkel-pretrained-models/.transformer-models/
and to ensure the models you need exist. For example,models--princeton-nlp--unsup-simcse-bert-base-uncased
for SimCSE.