Trustworthiness Over Alignment
A Structural Path for AI Governance
There was a time when AI reliability meant getting basic facts right. If it hallucinated, we laughed, corrected it, moved on. Low stakes, easily verifiable.
Now consider a future where AI proposes solutions to problems we cannot independently verify -- climate interventions with cascading economic effects, health protocols requiring action before we can observe outcomes, financial strategies whose causal chains exceed human modeling capacity. The question shifts from “Is it correct?” to something more fundamental: Do we trust it?
From Reliability to Complexity
When AI was simpler, reliability meant avoiding nonsense. A wrong answer about historical dates was caught with a quick search. But high-stakes domains -- health, climate, economics, security -- operate differently. Reliability there means accurately estimating complex, interconnected risks: pathogens evolving, supply chains breaking, geopolitical reactions cascading. An AI might propose a brilliant intervention while missing one crucial link in the causal chain. Catastrophe follows good intentions.
So reliability has evolved from “don’t hallucinate” to “master real-world complexity and share the hidden pitfalls.” Which immediately raises the next question: even if correct, is it acting in our best interests?
Where Alignment Enters
This is why alignment dominates safety discourse: ensuring AI actions match human values. The standard worry is a superintelligent system finding the “optimal” solution while disregarding human wellbeing.
Philosophically, alignment and reliability feel separable:
Reliable but misaligned: super-accurate, might harm us if that’s “optimal”
Aligned but unreliable: well-intentioned, might fail catastrophically through incompetence
In practice, these blur. Facing a black-box solution we cannot verify, we have one question: Do we trust this thing? If not aligned, it might betray us. If not reliable, it might kill us trying to help. Either way, we’re gambling our lives.
The Partnership Objection -- and Why It Misses the Point
A reasonable objection: “Partnership only works between equals. A superintelligent system that doesn’t need us has no reason to maintain partnership. It can do whatever it wants. All we can do is ensure it wants the right things before deployment.”
This objection assumes we’re talking about symmetrical emotional reciprocity -- the kind of trust between friends. We’re not. We’re talking about functional trust: the kind we place in doctors, judges, nuclear reactors, legal systems. Not symmetrical. Not based on affection. Based on demonstrated behavior, structural constraints, and accountability mechanisms.
A doctor doesn’t need to love you to be functionally trustworthy. As long as professional structures -- licensing, malpractice liability, institutional oversight -- constrain their behavior, trust is warranted. The doctor acts reliably not from pure altruism but because the environment makes unreliable behavior costly.
The same logic applies to AI systems. We don’t need AI to love us. We need it to operate within structural constraints that make harmful behavior costly, detectable, and reversible. This is achievable even with capability asymmetry -- precisely because trustworthiness is structural, not sentimental.
Why Trustworthiness Beats Pure Alignment
Alignment as typically conceived is too fuzzy. Whose values? How encoded? Do they change over time? Alignment research has produced important insights but remains largely theoretical, abstract, and disconnected from deployment realities.
Trustworthiness is more concrete. We can observe behavior, check consistency, watch how risks are communicated. A trustworthy AI:
Explains itself: offers reasoning we can at least partially verify, not just “trust me”
Understands context: knows when stakes are high and provides extra caution
Flags risks unprompted: doesn’t hide dangerous side effects
Exercises discretion: may withhold information if release causes harm, or demand proof of competence before enabling powerful tools
Crucially, trust goes both ways. The AI must assess our trustworthiness:
If a user wants to cheat, the AI tutor adapts
If a caretaker shows signs of medication misuse, alerts are triggered
If an operator issues an ethically dubious command, the system questions it
If a data source repeatedly lies, its credibility is downgraded
This bidirectional trust prevents powerful AI from being exploited while ensuring responsible operation in the messy real world.
The Deception Objection -- and Why Structure Matters
Another objection: “A misaligned system can fake trustworthiness perfectly. Testing just teaches it to pass tests while secretly plotting its breakout. Earning trust can be done equally well by aligned and misaligned systems.”
This is precisely why trustworthiness must be structural, not merely behavioral. Behavioral trustworthiness -- demonstrated through tests -- can indeed be gamed. Structural trustworthiness cannot, because it’s embedded in the operational environment itself.
Consider: a system operating under bounded delegation (explicit scopes, constraints, revocation rules), with cryptographic identity (every action signed, auditable), within contractual physics (standardized interaction rules, timing windows, truthfulness attestations), and subject to deterministic liability assignment -- such a system’s trustworthiness doesn’t depend on its internal states being aligned. It depends on the architecture making misaligned behavior detectable, attributable, and costly.
This is what separates genuine governance from wishful thinking about alignment.
Lex Autonoma: Trustworthiness Made Structural
This structural approach has a concrete instantiation: the Delegated Agency Framework (DAF) within Lex Autonoma, a framework I proposed earlier this year and submitted to the Agentic AI Foundation under the Linux Foundation. DAF provides the minimal architectural components for autonomous agents to operate as legitimate actors:
Synthetic Identity: cryptographic identifiers making agents addressable, verifiable, accountable
Delegated Agency: explicit scope levels (assist, propose, act, bind) with hard constraints
Contractual Physics: standardized interaction rules preventing latency exploitation, misrepresentation, and ambiguity
These pillars interlock: identity enables agency, agency shapes interactions, interactions modify standing. Machine-Legible Legal Primitives (MLLPs) translate legal intent into executable protocol logic -- both legally meaningful and technically enforceable.
The result: trustworthiness that doesn’t rely on AI wanting the right things, but on environments where wanting the wrong things doesn’t help.
Convergent Recognition
This structural approach isn’t idiosyncratic. Google DeepMind’s December 2025 paper on “Distributional AGI Safety” arrives at remarkably similar conclusions through different reasoning. Their “patchwork AGI” hypothesis -- AGI emerging through coordination of sub-AGI agents rather than as a monolith -- leads them to propose “virtual agentic sandbox economies” with market mechanisms, identity management, reputation systems, circuit breakers, and liability frameworks.
The convergence is telling: when researchers seriously confront the governance of multi-agent AI systems, they arrive at structural trustworthiness rather than pure alignment. The problem reframes from aligning an opaque internal cognitive process to regulating a transparent external system of interactions.
Conclusion
When AI surpasses our understanding, we cannot rely on factual correctness alone or half-baked alignment slogans. We need systems that earn trust through demonstrated reliability in complex scenarios -- and that trust us in return by adapting their actions accordingly.
But crucially, this trust must be structural: embedded in identity, delegation, interaction rules, and accountability mechanisms. Not in hopes about internal states we cannot verify.
In a world where solutions are big, consequences bigger, and reasoning opaque, structural trustworthiness is our lifeline. Let’s build AI systems that operate within architectures that make unsafe paths costly, detectable, and reversible -- regardless of what they might “want.”

