Implementation Guide
● Core Concept: Portability Through Abstraction
The fundamental insight that drives the ABC Standard:
This principle manifests in the Swap Test:
● ABC Card Anatomy: 7 Required Sections
Every ABC card must include these seven sections to be considered complete:
- Identity — Name, version, authors, tags. The unique fingerprint of the card.
- Problem Pattern — Category, description, sub-patterns, analogous domains. What problem does this solve, and where has it appeared before?
- Behavior Specification — Trigger, inputs, outputs, reasoning. The functional contract of the behavior.
- Domain Assumptions — Data, environment, authority. Each marked as hard (must have) or soft (degrades gracefully).
- Adaptation Points — Swappable components, configurable parameters, extensible capabilities. Where and how can this be modified?
- Composition — Events emitted/consumed, delegates. How does this interact with other behaviors?
- Provenance — Origin domain, lineage, fork history. Where did this come from and how has it evolved?
● Adaptation Point Types
ABC Standard defines three types of adaptation points, each with different implementation requirements:
Complete implementation replacement without interface change. Example: Swap an ARIMA forecaster for a neural network predictor while maintaining the same input/output signature.
swap ARIMAForecaster → NeuralNetForecaster
# Same interface, completely different algorithmParameter tuning within documented ranges. Example: Optimization horizon can be set from 1 to 90 days depending on domain needs.
optimization_horizon: 45 days # Valid: 1-90Add new capabilities without changing existing ones. Follows the Open/Closed Principle: open for extension, closed for modification.
extend PostProcessingHook {
on_forecast(result) → CustomValidation()
}● Domain Assumptions Classification
Every assumption in a domain must be classified and documented:
| Type | Meaning | Action Required |
|---|---|---|
| hard | Behavior breaks completely without this assumption | Must have runtime guards that fail fast |
| soft | Behavior degrades gracefully if assumption fails | Document degradation path; provide fallback mechanism |
● Compliance Scoring: 100-Point System
ABC cards are scored on how thoroughly they implement the standard:
| Criteria | Points | Evaluation |
|---|---|---|
| Formal interfaces per adaptation point | 15 | Each AP has a documented, testable interface |
| 2+ implementations per adaptation point | 15 | Prove the AP actually works by having 2+ implementations |
| Swap tests passing | 20 | Test suite validates swapping between implementations |
| Configurable parameters externalized | 10 | Config is outside code, validated at runtime |
| Hard assumption runtime guards | 10 | Guards check preconditions, fail fast if violated |
| Soft degradation paths documented | 5 | Clear fallback behavior when soft assumptions fail |
| Event schemas typed | 5 | All emitted events have explicit schemas |
| Decoupled event dispatch | 10 | Behaviors communicate via event bus, not direct calls |
| Isolation tests present | 5 | Each behavior tested independently |
| Required directory structure | 5 | Follows ABC project layout conventions |
90-100: ABC Certified
Fully backed by code and tests. Ready for production in the registry.
70-89: ABC Compatible
Most claims backed by code. Gaps are documented. Usable with known limitations.
50-69: ABC Aspirational
Describes intent and direction. Only partially implemented. For internal discussion or future work.
Below 50: Non-compliant
Card is largely fiction. Does not qualify for registry publication.
● Card Author Checklist
Before submitting a card to the registry, verify all 14 of these items:
Similarity Engine
● The Problem with Manual Similarity Scoring
Human-assigned similarity scores don't scale and are opaque:
But what does 0.82 actually mean? How is it comparable to another 0.82 score from a different reviewer? How can two different humans score the same pair of cards? These are the fundamental problems with manual similarity assessment.
The ABC Similarity Engine solves this by decomposing similarity into six independent, computable dimensions. Each dimension answers a specific question and is calculated algorithmically or via LLM analysis.
● Six Decomposed Dimensions
Each dimension is weighted and answers a specific question about how similar two cards are:
| Dimension | Weight | Question | Computation |
|---|---|---|---|
| Problem Pattern Similarity | 30% | Do they solve the same abstract problem? | LLM-powered semantic analysis |
| Sub-Pattern Overlap | 15% | Do they share structural sub-problems? | Jaccard similarity: 40% exact + 60% token overlap |
| I/O Structural Similarity | 20% | Are inputs/outputs compatible? | Name (25%) + Type (15%) + Count (20%) + Output name (25%) + Output type (15%) |
| Reasoning Similarity | 10% | Do they use the same decision approach? | LLM semantic comparison of method and approach |
| Adaptation Portability | 15% | How much work to fork and adapt? | Assumption softness (35%) + Swappability (40%) + AP count (25%) |
| Composition Compatibility | 10% | Can they work in the same ecosystem? | Event vocab overlap (40%) + Direct composability (35%) + Delegate interface (25%) |
Sends problem descriptions to Claude, ignoring domain-specific language and terminology. This catches cases where different domains solve identical abstract problems using different vocabulary.
Example: "Distribute medical supplies" and "Allocate seasonal gear" are structurally identical allocation problems.
Uses Jaccard similarity on sub_patterns fields. 40% exact matches, 60% token-level overlap to catch semantic similarity.
sub_patterns_A: [forecasting, temporal, aggregation]
sub_patterns_B: [time_series, temporal_aggregation, smoothing]
# "temporal" matches exactly, others have token overlapCompares the structure of inputs and outputs across multiple vectors to determine functional compatibility.
- Input name similarity: 25%
- Input type compatibility: 15%
- Output name similarity: 25%
- Output type compatibility: 15%
- Input/Output count balance: 20%
Compares reasoning.method and reasoning.approach semantically using Claude to determine if the decision logic is conceptually similar.
method_A: reinforcement_learning
method_B: policy_gradient
# Semantically identical for many domainsEstimates how much work is required to fork and adapt one card for another domain. Lower effort = higher similarity.
- Assumption softness (35%): Soft assumptions = easier to adapt
- Swappability (40%): More swappable components = more flexible
- Adaptation point count (25%): More APs = more adjustment options
Evaluates how well two cards can work together in the same system architecture.
- Event vocabulary overlap (40%): Shared event types
- Direct composability (35%): Can output of one feed input of other
- Delegate interface overlap (25%): Compatible delegation points
● Online vs Offline Mode
The similarity engine operates in two modes depending on API key availability:
When Claude API is unavailable:
- Dimensions 1 & 4 fall back to Jaccard similarity
- Lower quality results but fully functional
- Cross-domain matching is less accurate
- Still useful for same-domain card discovery
When Claude API is configured:
- Dimensions 1 & 4 use Claude for semantic analysis
- Much better cross-domain matching
- Captures abstract pattern similarity
- Recommended for production registries
● Future: Embedding-Based Approach (Roadmap)
As the registry grows, we plan a phased transition to embedding-based similarity:
LLM-scored similarity on demand. Accurate but slower for large registries.
Pre-computed embeddings for each card + cosine similarity for instant results.
Vector database with hybrid search combining semantic, structural, and metadata similarity.
Trust & Failure Mode Framework
● Eight Failure Categories
Every behavioral failure falls into one of these eight categories. Understanding which category a failure belongs to helps determine mitigation strategy:
Bad data fed into the system
Wrong reasoning or incorrect logic
System connection and API failures
Volume or performance breakdown
Domain-specific edge cases
Human-AI interaction issues
Adversarial and access failures
Fairness, bias, and harm issues
● Severity Levels & Domain Escalation
The same failure can have dramatically different severity depending on the domain. This requires domain-aware severity classification:
In a retail recommendation system: HIGH severity (customers see outdated products)
In a humanitarian supply distribution system: CRITICAL severity (wrong supplies reach wrong regions)
Standard Severity Levels:
- Low — Performance impact only, correctness unaffected
- Medium — Accuracy degraded, workaround available
- High — Incorrect results, requires manual intervention
- Critical — System failure, human safety/rights at risk
| Radius | Definition | Example |
|---|---|---|
| local | Single decision affected | One recommendation is wrong |
| downstream | Dependent systems affected | Wrong classification propagates to billing |
| systemic | Entire pipeline compromised | Data corruption affects all future decisions |
● Documenting Each Failure Mode
Every identified failure mode must be documented with these five components:
Method, mechanism, and time_to_detect. How will we know this failure happened?
Preventive (stop before it happens), detective (catch as it happens), corrective (fix it after).
Automatic, manual, and time_to_recover. How do we get back to good state?
What cannot be eliminated? What's the best we can do?
Link to test suite that validates detection and recovery.
● Ethical Assessment Requirements
Every card must include a complete ethical assessment covering:
Affected Parties & VulnerabilityIdentify all groups affected and their vulnerability levels. A vulnerable party in high-stakes domains (healthcare, criminal justice) requires more scrutiny.
Applicable Principles- do_no_harm — Minimize risk of injury or adverse outcomes
- equity — Fair treatment and equal access across groups
- transparency — Clear communication of how system works
- accountability — Responsibility for outcomes and recourse for harms
fairness_gaming — How might people game the system to get unfair advantage?
automation_authority — Are we automating away human judgment that should remain?
Before publishing a card, ask: "Would I be comfortable if the people most affected by this behavior could read this entire card and understand exactly how it works, what could go wrong, and what protections are in place?"
If the answer is no, the card needs more work.
● Operational Guardrails
Three critical guardrail mechanisms that every card should implement:
Human-in-the-Loop: 3 Levels| Level | Definition |
|---|---|
| mandatory_human_approval | Human must explicitly approve before action. Critical for high-stakes decisions. |
| human_override_available | System acts autonomously but human can intervene. Default for most applications. |
| fully_automated | No human intervention. Only for low-risk, high-volume decisions. |
When systems fail, define exactly what happens:
- trigger — What condition triggers degradation
- degraded_behavior — What does the system do in degraded state
- user_notification — How do users know capability is reduced
- auto_recovery — Does it automatically recover or need intervention
Every behavior must have an emergency stop mechanism:
- mechanism — How is it triggered (API call, config flag, etc.)
- authorization — Who can activate it
- time_to_effect — How long until behavior stops (target: < 5 seconds)
- side_effects — What happens to in-flight operations
- restart_procedure — Steps to resume normal operation
● Trust Scoring: 100-Point System
Trust is scored separately from implementation compliance:
| Criteria | Points |
|---|---|
| All 7 failure categories assessed | 15 |
| Every failure mode has detection | 10 |
| Every failure mode has mitigation | 10 |
| Every failure mode has recovery plan | 10 |
| Domain-specific severity documented | 5 |
| Blast radius classified | 5 |
| Ethical assessment complete | 10 |
| Human oversight levels defined | 10 |
| Graceful degradation modes defined | 10 |
| Kill switch documented | 5 |
| Failure mode tests referenced | 10 |
90-100: Enterprise Ready
Can be deployed in regulated environments with full audit trail.
70-89: Production Ready
Suitable for standard deployment with normal operational oversight.
50-69: Pilot Ready
For controlled testing with close monitoring.
Below 50: Development Only
Not ready for any external deployment.
● Overall Quality Score: Combined Compliance & Trust
The final registry readiness is determined by combining implementation compliance with trust assessment:
Overall Quality = (Compliance Score × 0.5) + (Trust Score × 0.5)90-100: ABC Certified Gold
Production-ready with comprehensive documentation and safety mechanisms.
75-89: ABC Certified
Meets standard for registry publication. Suitable for most use cases.
60-74: ABC Compatible
Acceptable with awareness of specific limitations.
Below 60: Not Registry Ready
Requires significant work before publication.