Version: 25.3

Evaluation for multi-agent systems, using traces

Your GenAI application may produce responses in multiple steps rather than as a single instruction-response pair. This may from a single LLM that responds in multiple steps, or a multi-agent GenAI system that assembles a response from multiple retrievals or model queries.

A trace tracks each step in the user and agent interactions for independent evaluation.

Evaluating traces follows the same general process outlined in the Evaluation workflow overview, and requires some additional steps as described below.

Preprocess trace data

When you preprocess your dataset for onboarding artifacts, follow this additional step to prepare your trace data.

Map your dataset to the internal traces format and designate a traces column. Here is a schema for the structure of a hierarchical trace in Snorkel. Each trace consists of nested steps, which contain metadata, values, and optional substeps. The schema ensures proper validation and flattening of trace data.

Root Schema: Trace

The root of a trace must conform to the Trace model, which is an extension of the Step model.

Fields:

step_type (str, required): Must be set to ROOT_STEP.
metadata (Dict[str, Union[str, int, float, bool]], required): Metadata associated with the trace.
value (Optional[Union[str, int, float, bool]], optional): Value of the root step.
substeps (List[Step], optional): List of nested substeps.
substep_execution_type (Optional[str], optional, default=serial): Defines execution type for substeps. Allowed values: serial, parallel.
metadata_expand (Optional[Dict[str, str]], optional): Additional metadata fields.

Step Schema: Step

Each step in the trace hierarchy follows this structure.

Fields:

step_type (str, required): Describes the step type.
metadata (Dict[str, Union[str, int, float, bool]], required): Metadata for the step.
value (Optional[Union[str, int, float, bool]], optional): Value of the step. If the step has no substeps, this field must be present.
substeps (List[Step], optional): List of nested substeps.
substep_execution_type (Optional[str], optional, default=serial): Execution type of substeps (serial or parallel).
metadata_expand (Optional[Dict[str, str]], optional): Expanded metadata for additional processing.

Validation Rules

The root step (Trace) must have step_type='ROOT_STEP'.
If a step has no substeps, it must have a value.
substep_execution_type must be either serial or parallel.
Additional fields outside the schema are forbidden.

Example JSON Trace

[
  {
    "step_type": "ROOT_STEP",
    "metadata": { "source": "user_chat", "tokens": 5, "latency": 0.1 },
    "value": "User starts a conversation with AI agent",
    "substeps": [
      {
        "step_type": "USER_MESSAGE",
        "metadata": { "user_id": "12345", "tokens": 7, "latency": 0.2 },
        "value": "Hey, can you summarize this document?",
        "substeps": [
          {
            "step_type": "AI_RESPONSE",
            "metadata": {
              "agent": "primary_AI",
              "tokens": 12,
              "latency": 0.3
            },
            "value": "Sure! Let me check if I need to retrieve additional information.",
            "substeps": [
              {
                "step_type": "DOC_RETRIEVAL",
                "metadata": {
                  "retrieval_agent": "secondary_AI",
                  "tokens": 10,
                  "latency": 0.4
                },
                "value": "Retrieving document summary...",
                "substeps": []
              },
              {
                "step_type": "AI_RESPONSE",
                "metadata": {
                  "agent": "primary_AI",
                  "tokens": 15,
                  "latency": 0.5
                },
                "value": "Here is a summary of the document: ...",
                "substeps": []
              }
            ]
          }
        ]
      }
    ],
    "substep_execution_type": "serial"
  },
  {
    "step_type": "ROOT_STEP",
    "metadata": { "source": "user_chat", "tokens": 6, "latency": 0.12 },
    "value": "User asks AI to translate a phrase",
    "substeps": [
      {
        "step_type": "USER_MESSAGE",
        "metadata": { "user_id": "67890", "tokens": 8, "latency": 0.22 },
        "value": "Can you translate 'Hello, how are you?' to French?",
        "substeps": [
          {
            "step_type": "AI_RESPONSE",
            "metadata": {
              "agent": "primary_AI",
              "tokens": 9,
              "latency": 0.28
            },
            "value": "Sure! The translation is 'Bonjour, comment ça va?'.",
            "substeps": []
          }
        ]
      }
    ],
    "substep_execution_type": "serial"
  }
]

note

Only a single data split (train) is supported for Traces. Also, any malformed trace, will be skipped and not be part of the Snorkel dataset.

Import Traces Dataset

Trace viewer

The Trace Viewer is a powerful interface for examining and analyzing agent execution traces. It provides a comprehensive set of features to help you navigate, search, and evaluate complex agent interactions.

Key Features

Trace Viewer features

Search Functionality: Quickly locate specific steps (step_type) using the search bar at the top of the viewer.
Hierarchical Trace Visualization: View traces in their natural hierarchical tree structure, with parent-child relationships clearly displayed. This makes it easy to understand the flow of execution and the relationship between different steps.
Detailed Metadata Display: For each step, view important metadata including:
- Token count: See how many tokens were consumed
- Latency measurements: Track performance with precise timing data

Click on any step in the tree view to view additional metadata like Agent identification

Flexible Navigation Controls:
- Expand/collapse individual steps or entire branches of the trace tree
- Drill down into specific sections of interest while hiding irrelevant details
- Navigate complex traces efficiently with intuitive controls
Pagination Support: Browse through large collections of traces with built-in pagination controls, making it manageable to work with extensive datasets.
Evaluation Status Indicators: Quickly identify which steps have been evaluated or annotated with visual status indicators, helping you track progress in your evaluation workflow.

Trace create benchmark

Preprocess trace data​

Trace viewer​

Key Features​

Preprocess trace data

Trace viewer

Key Features