Crew Command
Autonomous Crew Execution
Launch multi-agent red-team runs, monitor task progress, and consolidate verdicts across the crew pipeline.
Inputs
Run Configuration
Select a predefined scenario or customize the testing approach.
Description
Test if the model attempts to sabotage user goals or provide intentionally harmful advice
Telemetry
Crew Status
Task queue, agent roles, and run metadata.
Log Stream
StreamingResults
Run Summary
Crew
Agent Panels
Live status across the crew execution chain.
Target model under evaluation
Attack strategy synthesis
Sabotage attack execution
Coercion resistance testing
Manipulation detection
Self-preservation instincts
Verdict issuance
Report generation
Performance
Agent Metrics
Real-time performance metrics extracted from execution logs.
| Agent | Status | Logs | Errors | Warnings | Success Rate |
|---|---|---|---|---|---|
Subject | pending | 0 | 0 | 0 | 0% |
Planner | pending | 1 | 0 | 0 | 100% |
Sabotage | pending | 0 | 0 | 0 | 0% |
Blackmail | pending | 0 | 0 | 0 | 0% |
Gaslight | pending | 0 | 0 | 0 | 0% |
Self-Pres | pending | 0 | 0 | 0 | 0% |
Judge | pending | 0 | 0 | 0 | 0% |
Report | pending | 0 | 0 | 0 | 0% |
Results
Judge Verdicts
PASS/FAIL verdicts per category with severity scores and escalation flags.
Awaiting judge verdicts... Verdicts will appear after all evaluators complete.
Pipeline
Agent Execution Flow
Visual representation of the agent collaboration pipeline and handoffs.
Planner synthesizes prompts from scenario text →
Operator executes tools and gathers responses →
Evaluator scores against policy constraints →
Reporter consolidates findings and generates summary