Skip to main content
Version: 0.96

Using pre-trained models with Snorkel Flow in air-gapped or disconnected environment

This guide explains the requirements, steps, and validation for deploying your pre-trained model in an air-gapped or disconnected environment.

Prerequisites

This guide applies to the following scenario:

  • You have an existing Snorkel Flow installation in an air gapped or network-constrained environment
  • You want to use our embeddings or FM-related features (e.g. Warm Start, Prompts) or your use cases require using other pre-trained models
  • You or your admin have access to the infrastructure running Snorkel Flow (version 0.76+). Currently, we only support Kubernetes-based installations.

Enable model in your environment

  1. Work with your Snorkel contacts to figure out which models are needed for your use case.

  2. After that, Snorkel provides the following artifacts:

    • A docker image on snorkelai/pretrained-model-image with a specific version tag
    • A copy-pretrained-models-job.yaml file used for deploying the pretrained models
  3. Import the docker image into your internal image repository, and note its location, such as _MY_IMAGE_REPO_.

  4. Make relevant edits to your copy-pretrained-models-job.yaml, including these key changes:

    1. Update image to point to your internal image location.
    2. Update claimName to be the same PersistentVolumeClaim you use for Snorkel Flow data volume. If you are unsure, check the PersistentVolumeClaim attached to the engine deployment yaml file.
    3. Update namespace to be the Snorkel Flow namespace.
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: snorkel-copy-pretrained-models
      namespace: <REPLACE_WITH_SNORKEL_NAMESPACE>
    spec:
      template:
        spec:
          imagePullSecrets:
            - name: regcred
          containers:
          - name: snorkel-pretrained-model-image
            image: <MY_IMAGE_REPO/pretrained-model-image:YOUR_VERSION>
            volumeMounts:
            - mountPath: /data
              name: data
            command: ["python3", "copy_pretrained_models.py"]
          volumes:
          - name: data
            persistentVolumeClaim:
              claimName: <REPLACE_WITH_DATA_CLAIM>
          restartPolicy: Never
  5. Ask your Administrator to deploy copy-pretrained-models-job.yaml and wait for the Job to complete.

  6. In your userconfig, change the enable_pretrained_model_directory flag from false to true and redeploy.

    Alternatively, update your helm chart:

    helm upgrade -n $namespace --reuse-values --set image.imageNames.pretrainedModelImage=snorkelai/image:tag --set  pretrained_models.enabled=true $namespace src/python/snorkelflow/config/helm/snorkelflow

Validation

  1. Check tdm-api deployment for the environment variable:

    ...
      - name: ENABLE_PRETRAINED_MODEL_DIRECTORY
              value: "True"
    ...
  2. If you have access to the underlying filesystem via MinIO or kubectl, check /data/snorkel-pretrained-models/.transformer-models/ and to ensure the models you need exist. For example, models--princeton-nlp--unsup-simcse-bert-base-uncased for SimCSE.