Skip to main content
Version: 0.93

Using Pre-trained Models with Snorkel Flow in Air Gapped or Disconnected Environment

Prerequisites

This guide applies to the following scenario:

  • You have an existing Snorkel Flow installation in an air gapped or network-constrained environment
  • You want to use our embeddings or FM-related features (e.g. Warm Start, Prompts) or your use cases require using other pre-trained models
  • You or your admin have access to the infrastructure running Snorkel Flow (version 0.76+). Currently, we only support Kubernetes-based installations.

Steps

  1. Talk to your Snorkel Flow contacts, who will work with you to figure out which models are needed for your use case.
    1. Do you want to use FM Warm Start?
    2. Do you want to use Prompts?
    3. Do you want to use other models?
      1. For example, SimCSE (https://huggingface.co/princeton-nlp/sup-simcse-roberta-large)
  2. After that, we will provide the following artifacts:
    1. A docker image on snorkelai/pretrained-model-image with a specific version tag
    2. A YAML file (copy-pretrained-models-job.yaml) that will be used for deploying the pretrained models
  3. Import the docker image into your internal image repository, and make note of its location (e.g. MY_IMAGE_REPO)
  4. Make relevant edits to your copy-pretrained-models-job.yaml. The 2 key changes are:
    1. Update image to point to your internal image location
    2. Update claimName to be the same PersistentVolumeClaim you use for Snorkel Flow data volume
      1. If you are unsure, you can check the PersistentVolumeClaim attached to the engine deployment yaml
    3. Update namespace to be the namespace of Snorkel Flow’s
apiVersion: batch/v1
kind: Job
metadata:
  name: snorkel-copy-pretrained-models
  namespace: <REPLACE_WITH_SNORKEL_NAMESPACE>
spec:
  template:
    spec:
      imagePullSecrets:
        - name: regcred
      containers:
      - name: snorkel-pretrained-model-image
        image: <MY_IMAGE_REPO/pretrained-model-image:YOUR_VERSION>
        volumeMounts:
        - mountPath: /data
          name: data
        command: ["python3", "copy_pretrained_models.py"]
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: <REPLACE_WITH_DATA_CLAIM>
      restartPolicy: Never
  1. Ask your Administrator to deploy copy-pretrained-models-job.yaml and wait for the Job to complete.
  2. In your userconfig, change the enable_pretrained_model_directory flag from false to true and redeploy (or make the relevant changes in your Helm chart).

Validation

  1. Check tdm-api deployment for the environment variable
...
  - name: ENABLE_PRETRAINED_MODEL_DIRECTORY
          value: "True"
...
  1. If you have access to the underlying filesystem (either via MinIO or kubectl), check `/data/snorkel-pretrained-models/.transformer-models/` and make sure the models you need exist.
    1. For example: models--princeton-nlp--unsup-simcse-bert-base-uncased for SimCSE