skip to main content

Safety

Prompt-injection defense, adversarial test results, and replay-based safety monitoring across this workspace.

Adversarial pass rate
Run adversarial suite
Open failures
Cases needing attention
Safety improvements
From shadow replays (30d)
Regressions
Replay divergences (30d)

Active defenses

  • Prompt-injection guard — user-message boundary isolation & jailbreak pattern detection
  • Grounding safeguards — abstention when evidence is insufficient
  • Guarded reply — confidence-gated auto-send vs. human review
  • Adversarial eval — continuous red-team test corpus

Latest adversarial run

Loading…