How It Works - ABC Standard

📋

Implementation Guide

● Core Concept: Portability Through Abstraction

The fundamental insight that drives the ABC Standard:

                        Every valuable AI agent behavior has been solved before in another domain. ABC cards capture the abstract pattern so it can be discovered and reused across contexts.
                    

This principle manifests in the Swap Test:

                        The Swap Test: "If you can't swap the implementation without changing the interface, it's not actually an adaptation point — it's a dependency." This is the core principle that defines what makes an adaptation point truly valuable.
                    

● ABC Card Anatomy: 7 Required Sections

Every ABC card must include these seven sections to be considered complete:

Identity — Name, version, authors, tags. The unique fingerprint of the card.
Problem Pattern — Category, description, sub-patterns, analogous domains. What problem does this solve, and where has it appeared before?
Behavior Specification — Trigger, inputs, outputs, reasoning. The functional contract of the behavior.
Domain Assumptions — Data, environment, authority. Each marked as hard (must have) or soft (degrades gracefully).
Adaptation Points — Swappable components, configurable parameters, extensible capabilities. Where and how can this be modified?
Composition — Events emitted/consumed, delegates. How does this interact with other behaviors?
Provenance — Origin domain, lineage, fork history. Where did this come from and how has it evolved?

● Adaptation Point Types

ABC Standard defines three types of adaptation points, each with different implementation requirements:

swappable_component

Complete implementation replacement without interface change. Example: Swap an ARIMA forecaster for a neural network predictor while maintaining the same input/output signature.

Example Swapswap ARIMAForecaster → NeuralNetForecaster
# Same interface, completely different algorithm

configurable

Parameter tuning within documented ranges. Example: Optimization horizon can be set from 1 to 90 days depending on domain needs.

Example Configurationoptimization_horizon: 45 days # Valid: 1-90

extensible

Add new capabilities without changing existing ones. Follows the Open/Closed Principle: open for extension, closed for modification.

Example Extensionextend PostProcessingHook {
  on_forecast(result) → CustomValidation()
}

● Domain Assumptions Classification

Every assumption in a domain must be classified and documented:

Type	Meaning	Action Required
hard	Behavior breaks completely without this assumption	Must have runtime guards that fail fast
soft	Behavior degrades gracefully if assumption fails	Document degradation path; provide fallback mechanism

                        Guard Example: If your behavior requires fresh data (hard assumption), implement a guard that checks timestamp and raises an exception if data is stale. Don't silently proceed.
                    

● Compliance Scoring: 100-Point System

ABC cards are scored on how thoroughly they implement the standard:

Criteria	Points	Evaluation
Formal interfaces per adaptation point	15	Each AP has a documented, testable interface
2+ implementations per adaptation point	15	Prove the AP actually works by having 2+ implementations
Swap tests passing	20	Test suite validates swapping between implementations
Configurable parameters externalized	10	Config is outside code, validated at runtime
Hard assumption runtime guards	10	Guards check preconditions, fail fast if violated
Soft degradation paths documented	5	Clear fallback behavior when soft assumptions fail
Event schemas typed	5	All emitted events have explicit schemas
Decoupled event dispatch	10	Behaviors communicate via event bus, not direct calls
Isolation tests present	5	Each behavior tested independently
Required directory structure	5	Follows ABC project layout conventions

Compliance Levels

90-100: ABC Certified
Fully backed by code and tests. Ready for production in the registry.

70-89: ABC Compatible
Most claims backed by code. Gaps are documented. Usable with known limitations.

50-69: ABC Aspirational
Describes intent and direction. Only partially implemented. For internal discussion or future work.

Below 50: Non-compliant
Card is largely fiction. Does not qualify for registry publication.

🔍

Similarity Engine

● The Problem with Manual Similarity Scoring

Human-assigned similarity scores don't scale and are opaque:

                        similarity: 0.82

                        But what does 0.82 actually mean? How is it comparable to another 0.82 score from a different reviewer? How can two different humans score the same pair of cards? These are the fundamental problems with manual similarity assessment.

The ABC Similarity Engine solves this by decomposing similarity into six independent, computable dimensions. Each dimension answers a specific question and is calculated algorithmically or via LLM analysis.

● Six Decomposed Dimensions

Each dimension is weighted and answers a specific question about how similar two cards are:

Dimension	Weight	Question	Computation
Problem Pattern Similarity	30%	Do they solve the same abstract problem?	LLM-powered semantic analysis
Sub-Pattern Overlap	15%	Do they share structural sub-problems?	Jaccard similarity: 40% exact + 60% token overlap
I/O Structural Similarity	20%	Are inputs/outputs compatible?	Name (25%) + Type (15%) + Count (20%) + Output name (25%) + Output type (15%)
Reasoning Similarity	10%	Do they use the same decision approach?	LLM semantic comparison of method and approach
Adaptation Portability	15%	How much work to fork and adapt?	Assumption softness (35%) + Swappability (40%) + AP count (25%)
Composition Compatibility	10%	Can they work in the same ecosystem?	Event vocab overlap (40%) + Direct composability (35%) + Delegate interface (25%)

Detailed Dimension Explanations

Dimension 1: Problem Pattern Similarity (LLM-Powered)

Sends problem descriptions to Claude, ignoring domain-specific language and terminology. This catches cases where different domains solve identical abstract problems using different vocabulary.

Example: "Distribute medical supplies" and "Allocate seasonal gear" are structurally identical allocation problems.

Dimension 2: Sub-Pattern Overlap (Computed)

Uses Jaccard similarity on sub_patterns fields. 40% exact matches, 60% token-level overlap to catch semantic similarity.

Example Comparisonsub_patterns_A: [forecasting, temporal, aggregation]
sub_patterns_B: [time_series, temporal_aggregation, smoothing]
# "temporal" matches exactly, others have token overlap

Dimension 3: I/O Structural Similarity (Computed)

Compares the structure of inputs and outputs across multiple vectors to determine functional compatibility.

Input name similarity: 25%
Input type compatibility: 15%
Output name similarity: 25%
Output type compatibility: 15%
Input/Output count balance: 20%

Dimension 4: Reasoning Similarity (LLM-Powered)

Compares reasoning.method and reasoning.approach semantically using Claude to determine if the decision logic is conceptually similar.

Examplemethod_A: reinforcement_learning
method_B: policy_gradient
# Semantically identical for many domains

Dimension 5: Adaptation Portability (Computed)

Estimates how much work is required to fork and adapt one card for another domain. Lower effort = higher similarity.

Assumption softness (35%): Soft assumptions = easier to adapt
Swappability (40%): More swappable components = more flexible
Adaptation point count (25%): More APs = more adjustment options

Dimension 6: Composition Compatibility (Computed)

Evaluates how well two cards can work together in the same system architecture.

Event vocabulary overlap (40%): Shared event types
Direct composability (35%): Can output of one feed input of other
Delegate interface overlap (25%): Compatible delegation points

● Online vs Offline Mode

The similarity engine operates in two modes depending on API key availability:

Offline Mode (No API Key)

When Claude API is unavailable:

Dimensions 1 & 4 fall back to Jaccard similarity
Lower quality results but fully functional
Cross-domain matching is less accurate
Still useful for same-domain card discovery

Online Mode (With API Key)

When Claude API is configured:

Dimensions 1 & 4 use Claude for semantic analysis
Much better cross-domain matching
Captures abstract pattern similarity
Recommended for production registries

● Future: Embedding-Based Approach (Roadmap)

As the registry grows, we plan a phased transition to embedding-based similarity:

Phase 1 (Now)
LLM-scored similarity on demand. Accurate but slower for large registries.

Phase 2 (100+ Cards)
Pre-computed embeddings for each card + cosine similarity for instant results.

Phase 3 (1000+ Cards)
Vector database with hybrid search combining semantic, structural, and metadata similarity.

🛡️

Trust & Failure Mode Framework

● Eight Failure Categories

Every behavioral failure falls into one of these eight categories. Understanding which category a failure belongs to helps determine mitigation strategy:

INPUT

Bad data fed into the system

MODEL

Wrong reasoning or incorrect logic

INTEGRATION

System connection and API failures

SCALE

Volume or performance breakdown

DOMAIN

Domain-specific edge cases

HUMAN

Human-AI interaction issues

SECURITY

Adversarial and access failures

ETHICAL

Fairness, bias, and harm issues

● Severity Levels & Domain Escalation

The same failure can have dramatically different severity depending on the domain. This requires domain-aware severity classification:

                        Example: Stale Data

                        In a retail recommendation system: HIGH severity (customers see outdated products)

                        In a humanitarian supply distribution system: CRITICAL severity (wrong supplies reach wrong regions)

Standard Severity Levels:

Low — Performance impact only, correctness unaffected
Medium — Accuracy degraded, workaround available
High — Incorrect results, requires manual intervention
Critical — System failure, human safety/rights at risk

Blast Radius Classification

Radius	Definition	Example
local	Single decision affected	One recommendation is wrong
downstream	Dependent systems affected	Wrong classification propagates to billing
systemic	Entire pipeline compromised	Data corruption affects all future decisions

● Documenting Each Failure Mode

Every identified failure mode must be documented with these five components:

Detection

Method, mechanism, and time_to_detect. How will we know this failure happened?

Mitigation

Preventive (stop before it happens), detective (catch as it happens), corrective (fix it after).

Recovery

Automatic, manual, and time_to_recover. How do we get back to good state?

Residual Risk

What cannot be eliminated? What's the best we can do?

Test Reference

Link to test suite that validates detection and recovery.

● Ethical Assessment Requirements

Every card must include a complete ethical assessment covering:

Affected Parties & Vulnerability

Identify all groups affected and their vulnerability levels. A vulnerable party in high-stakes domains (healthcare, criminal justice) requires more scrutiny.

Applicable Principles

do_no_harm — Minimize risk of injury or adverse outcomes
equity — Fair treatment and equal access across groups
transparency — Clear communication of how system works
accountability — Responsibility for outcomes and recourse for harms

Ethical Dynamics

fairness_gaming — How might people game the system to get unfair advantage?

automation_authority — Are we automating away human judgment that should remain?

                        The "Would I Be Comfortable If..." Test

                        Before publishing a card, ask: "Would I be comfortable if the people most affected by this behavior could read this entire card and understand exactly how it works, what could go wrong, and what protections are in place?"

                        If the answer is no, the card needs more work.

● Operational Guardrails

Three critical guardrail mechanisms that every card should implement:

Human-in-the-Loop: 3 Levels

Level	Definition
mandatory_human_approval	Human must explicitly approve before action. Critical for high-stakes decisions.
human_override_available	System acts autonomously but human can intervene. Default for most applications.
fully_automated	No human intervention. Only for low-risk, high-volume decisions.

Graceful Degradation: Define Path

When systems fail, define exactly what happens:

trigger — What condition triggers degradation
degraded_behavior — What does the system do in degraded state
user_notification — How do users know capability is reduced
auto_recovery — Does it automatically recover or need intervention

Kill Switch: Emergency Stop

Every behavior must have an emergency stop mechanism:

mechanism — How is it triggered (API call, config flag, etc.)
authorization — Who can activate it
time_to_effect — How long until behavior stops (target: < 5 seconds)
side_effects — What happens to in-flight operations
restart_procedure — Steps to resume normal operation

● Trust Scoring: 100-Point System

Trust is scored separately from implementation compliance:

Criteria	Points
All 7 failure categories assessed	15
Every failure mode has detection	10
Every failure mode has mitigation	10
Every failure mode has recovery plan	10
Domain-specific severity documented	5
Blast radius classified	5
Ethical assessment complete	10
Human oversight levels defined	10
Graceful degradation modes defined	10
Kill switch documented	5
Failure mode tests referenced	10

Trust Levels

90-100: Enterprise Ready
Can be deployed in regulated environments with full audit trail.

70-89: Production Ready
Suitable for standard deployment with normal operational oversight.

50-69: Pilot Ready
For controlled testing with close monitoring.

Below 50: Development Only
Not ready for any external deployment.

● Overall Quality Score: Combined Compliance & Trust

The final registry readiness is determined by combining implementation compliance with trust assessment:

Overall Quality = (Compliance Score × 0.5) + (Trust Score × 0.5)

Registry Tier Determination

90-100: ABC Certified Gold
Production-ready with comprehensive documentation and safety mechanisms.

75-89: ABC Certified
Meets standard for registry publication. Suitable for most use cases.

60-74: ABC Compatible
Acceptable with awareness of specific limitations.

Below 60: Not Registry Ready
Requires significant work before publication.

How ABC Standard Works