Evaluation Limits
When using Snorkel for evaluation, there is maximum of < 1k traces and < 100 steps per trace for each dataset.
LLM-as-a-judge (LLMAJ) iteration can only be used on train and valid splits.
You can only export the benchmark configuration via the Snorkel SDK.