How ABC Standard Works

A technical foundation for capturing, discovering, and composing reusable AI agent behaviors through abstraction and systematic documentation.

📋

Implementation Guide

Core Concept: Portability Through Abstraction

The fundamental insight that drives the ABC Standard:

Every valuable AI agent behavior has been solved before in another domain. ABC cards capture the abstract pattern so it can be discovered and reused across contexts.

This principle manifests in the Swap Test:

The Swap Test: "If you can't swap the implementation without changing the interface, it's not actually an adaptation point — it's a dependency." This is the core principle that defines what makes an adaptation point truly valuable.

ABC Card Anatomy: 7 Required Sections

Every ABC card must include these seven sections to be considered complete:

  1. Identity — Name, version, authors, tags. The unique fingerprint of the card.
  2. Problem Pattern — Category, description, sub-patterns, analogous domains. What problem does this solve, and where has it appeared before?
  3. Behavior Specification — Trigger, inputs, outputs, reasoning. The functional contract of the behavior.
  4. Domain Assumptions — Data, environment, authority. Each marked as hard (must have) or soft (degrades gracefully).
  5. Adaptation Points — Swappable components, configurable parameters, extensible capabilities. Where and how can this be modified?
  6. Composition — Events emitted/consumed, delegates. How does this interact with other behaviors?
  7. Provenance — Origin domain, lineage, fork history. Where did this come from and how has it evolved?

Adaptation Point Types

ABC Standard defines three types of adaptation points, each with different implementation requirements:

swappable_component

Complete implementation replacement without interface change. Example: Swap an ARIMA forecaster for a neural network predictor while maintaining the same input/output signature.

Example Swapswap ARIMAForecaster NeuralNetForecaster # Same interface, completely different algorithm
configurable

Parameter tuning within documented ranges. Example: Optimization horizon can be set from 1 to 90 days depending on domain needs.

Example Configurationoptimization_horizon: 45 days # Valid: 1-90
extensible

Add new capabilities without changing existing ones. Follows the Open/Closed Principle: open for extension, closed for modification.

Example Extensionextend PostProcessingHook { on_forecast(result) → CustomValidation() }

Domain Assumptions Classification

Every assumption in a domain must be classified and documented:

Type Meaning Action Required
hard Behavior breaks completely without this assumption Must have runtime guards that fail fast
soft Behavior degrades gracefully if assumption fails Document degradation path; provide fallback mechanism
Guard Example: If your behavior requires fresh data (hard assumption), implement a guard that checks timestamp and raises an exception if data is stale. Don't silently proceed.

Compliance Scoring: 100-Point System

ABC cards are scored on how thoroughly they implement the standard:

Criteria Points Evaluation
Formal interfaces per adaptation point 15 Each AP has a documented, testable interface
2+ implementations per adaptation point 15 Prove the AP actually works by having 2+ implementations
Swap tests passing 20 Test suite validates swapping between implementations
Configurable parameters externalized 10 Config is outside code, validated at runtime
Hard assumption runtime guards 10 Guards check preconditions, fail fast if violated
Soft degradation paths documented 5 Clear fallback behavior when soft assumptions fail
Event schemas typed 5 All emitted events have explicit schemas
Decoupled event dispatch 10 Behaviors communicate via event bus, not direct calls
Isolation tests present 5 Each behavior tested independently
Required directory structure 5 Follows ABC project layout conventions
Compliance Levels

90-100: ABC Certified
Fully backed by code and tests. Ready for production in the registry.

70-89: ABC Compatible
Most claims backed by code. Gaps are documented. Usable with known limitations.

50-69: ABC Aspirational
Describes intent and direction. Only partially implemented. For internal discussion or future work.

Below 50: Non-compliant
Card is largely fiction. Does not qualify for registry publication.

Card Author Checklist

Before submitting a card to the registry, verify all 14 of these items:

🔍

Similarity Engine

The Problem with Manual Similarity Scoring

Human-assigned similarity scores don't scale and are opaque:

similarity: 0.82

But what does 0.82 actually mean? How is it comparable to another 0.82 score from a different reviewer? How can two different humans score the same pair of cards? These are the fundamental problems with manual similarity assessment.

The ABC Similarity Engine solves this by decomposing similarity into six independent, computable dimensions. Each dimension answers a specific question and is calculated algorithmically or via LLM analysis.

Six Decomposed Dimensions

Each dimension is weighted and answers a specific question about how similar two cards are:

Dimension Weight Question Computation
Problem Pattern Similarity 30% Do they solve the same abstract problem? LLM-powered semantic analysis
Sub-Pattern Overlap 15% Do they share structural sub-problems? Jaccard similarity: 40% exact + 60% token overlap
I/O Structural Similarity 20% Are inputs/outputs compatible? Name (25%) + Type (15%) + Count (20%) + Output name (25%) + Output type (15%)
Reasoning Similarity 10% Do they use the same decision approach? LLM semantic comparison of method and approach
Adaptation Portability 15% How much work to fork and adapt? Assumption softness (35%) + Swappability (40%) + AP count (25%)
Composition Compatibility 10% Can they work in the same ecosystem? Event vocab overlap (40%) + Direct composability (35%) + Delegate interface (25%)
Detailed Dimension Explanations
Dimension 1: Problem Pattern Similarity (LLM-Powered)

Sends problem descriptions to Claude, ignoring domain-specific language and terminology. This catches cases where different domains solve identical abstract problems using different vocabulary.

Example: "Distribute medical supplies" and "Allocate seasonal gear" are structurally identical allocation problems.

Dimension 2: Sub-Pattern Overlap (Computed)

Uses Jaccard similarity on sub_patterns fields. 40% exact matches, 60% token-level overlap to catch semantic similarity.

Example Comparisonsub_patterns_A: [forecasting, temporal, aggregation] sub_patterns_B: [time_series, temporal_aggregation, smoothing] # "temporal" matches exactly, others have token overlap
Dimension 3: I/O Structural Similarity (Computed)

Compares the structure of inputs and outputs across multiple vectors to determine functional compatibility.

  • Input name similarity: 25%
  • Input type compatibility: 15%
  • Output name similarity: 25%
  • Output type compatibility: 15%
  • Input/Output count balance: 20%
Dimension 4: Reasoning Similarity (LLM-Powered)

Compares reasoning.method and reasoning.approach semantically using Claude to determine if the decision logic is conceptually similar.

Examplemethod_A: reinforcement_learning method_B: policy_gradient # Semantically identical for many domains
Dimension 5: Adaptation Portability (Computed)

Estimates how much work is required to fork and adapt one card for another domain. Lower effort = higher similarity.

  • Assumption softness (35%): Soft assumptions = easier to adapt
  • Swappability (40%): More swappable components = more flexible
  • Adaptation point count (25%): More APs = more adjustment options
Dimension 6: Composition Compatibility (Computed)

Evaluates how well two cards can work together in the same system architecture.

  • Event vocabulary overlap (40%): Shared event types
  • Direct composability (35%): Can output of one feed input of other
  • Delegate interface overlap (25%): Compatible delegation points

Online vs Offline Mode

The similarity engine operates in two modes depending on API key availability:

Offline Mode (No API Key)

When Claude API is unavailable:

  • Dimensions 1 & 4 fall back to Jaccard similarity
  • Lower quality results but fully functional
  • Cross-domain matching is less accurate
  • Still useful for same-domain card discovery
Online Mode (With API Key)

When Claude API is configured:

  • Dimensions 1 & 4 use Claude for semantic analysis
  • Much better cross-domain matching
  • Captures abstract pattern similarity
  • Recommended for production registries

Future: Embedding-Based Approach (Roadmap)

As the registry grows, we plan a phased transition to embedding-based similarity:

Phase 1 (Now)
LLM-scored similarity on demand. Accurate but slower for large registries.
Phase 2 (100+ Cards)
Pre-computed embeddings for each card + cosine similarity for instant results.
Phase 3 (1000+ Cards)
Vector database with hybrid search combining semantic, structural, and metadata similarity.
🛡️

Trust & Failure Mode Framework

Eight Failure Categories

Every behavioral failure falls into one of these eight categories. Understanding which category a failure belongs to helps determine mitigation strategy:

1
INPUT

Bad data fed into the system

2
MODEL

Wrong reasoning or incorrect logic

3
INTEGRATION

System connection and API failures

4
SCALE

Volume or performance breakdown

5
DOMAIN

Domain-specific edge cases

6
HUMAN

Human-AI interaction issues

7
SECURITY

Adversarial and access failures

8
ETHICAL

Fairness, bias, and harm issues

Severity Levels & Domain Escalation

The same failure can have dramatically different severity depending on the domain. This requires domain-aware severity classification:

Example: Stale Data
In a retail recommendation system: HIGH severity (customers see outdated products)
In a humanitarian supply distribution system: CRITICAL severity (wrong supplies reach wrong regions)

Standard Severity Levels:

  • Low — Performance impact only, correctness unaffected
  • Medium — Accuracy degraded, workaround available
  • High — Incorrect results, requires manual intervention
  • Critical — System failure, human safety/rights at risk
Blast Radius Classification
Radius Definition Example
local Single decision affected One recommendation is wrong
downstream Dependent systems affected Wrong classification propagates to billing
systemic Entire pipeline compromised Data corruption affects all future decisions

Documenting Each Failure Mode

Every identified failure mode must be documented with these five components:

Detection

Method, mechanism, and time_to_detect. How will we know this failure happened?

Mitigation

Preventive (stop before it happens), detective (catch as it happens), corrective (fix it after).

Recovery

Automatic, manual, and time_to_recover. How do we get back to good state?

Residual Risk

What cannot be eliminated? What's the best we can do?

Test Reference

Link to test suite that validates detection and recovery.

Ethical Assessment Requirements

Every card must include a complete ethical assessment covering:

Affected Parties & Vulnerability

Identify all groups affected and their vulnerability levels. A vulnerable party in high-stakes domains (healthcare, criminal justice) requires more scrutiny.

Applicable Principles
  • do_no_harm — Minimize risk of injury or adverse outcomes
  • equity — Fair treatment and equal access across groups
  • transparency — Clear communication of how system works
  • accountability — Responsibility for outcomes and recourse for harms
Ethical Dynamics

fairness_gaming — How might people game the system to get unfair advantage?

automation_authority — Are we automating away human judgment that should remain?

The "Would I Be Comfortable If..." Test

Before publishing a card, ask: "Would I be comfortable if the people most affected by this behavior could read this entire card and understand exactly how it works, what could go wrong, and what protections are in place?"

If the answer is no, the card needs more work.

Operational Guardrails

Three critical guardrail mechanisms that every card should implement:

Human-in-the-Loop: 3 Levels
Level Definition
mandatory_human_approval Human must explicitly approve before action. Critical for high-stakes decisions.
human_override_available System acts autonomously but human can intervene. Default for most applications.
fully_automated No human intervention. Only for low-risk, high-volume decisions.
Graceful Degradation: Define Path

When systems fail, define exactly what happens:

  • trigger — What condition triggers degradation
  • degraded_behavior — What does the system do in degraded state
  • user_notification — How do users know capability is reduced
  • auto_recovery — Does it automatically recover or need intervention
Kill Switch: Emergency Stop

Every behavior must have an emergency stop mechanism:

  • mechanism — How is it triggered (API call, config flag, etc.)
  • authorization — Who can activate it
  • time_to_effect — How long until behavior stops (target: < 5 seconds)
  • side_effects — What happens to in-flight operations
  • restart_procedure — Steps to resume normal operation

Trust Scoring: 100-Point System

Trust is scored separately from implementation compliance:

Criteria Points
All 7 failure categories assessed 15
Every failure mode has detection 10
Every failure mode has mitigation 10
Every failure mode has recovery plan 10
Domain-specific severity documented 5
Blast radius classified 5
Ethical assessment complete 10
Human oversight levels defined 10
Graceful degradation modes defined 10
Kill switch documented 5
Failure mode tests referenced 10
Trust Levels

90-100: Enterprise Ready
Can be deployed in regulated environments with full audit trail.

70-89: Production Ready
Suitable for standard deployment with normal operational oversight.

50-69: Pilot Ready
For controlled testing with close monitoring.

Below 50: Development Only
Not ready for any external deployment.

Overall Quality Score: Combined Compliance & Trust

The final registry readiness is determined by combining implementation compliance with trust assessment:

Overall Quality = (Compliance Score × 0.5) + (Trust Score × 0.5)
Registry Tier Determination

90-100: ABC Certified Gold
Production-ready with comprehensive documentation and safety mechanisms.

75-89: ABC Certified
Meets standard for registry publication. Suitable for most use cases.

60-74: ABC Compatible
Acceptable with awareness of specific limitations.

Below 60: Not Registry Ready
Requires significant work before publication.