Deterministic execution substrate

Deterministic execution for AI‑generated research.

BLISP lets stochastic agents propose computations while a typed execution layer grounds, canonicalizes, executes, hashes, and replays them deterministically. No unwarranted operation reaches execution. Every result is replayable by hash.

BLISP does not try to make LLMs deterministic; it makes the execution boundary deterministic.

Prompt
Agent Proposal
Grounding Gate
Canonical Execution
8-Layer Provenance
Replayable Result
23.3% → 10.0%
Valid-but-unwarranted executions reduced by grounding gate
100% → 0%
Unwarranted executions on undiscoverable prompts
50/50
Replay runs produced bit-identical execution hashes
<14 ms
Grounding overhead per request
The problem

AI agents can reason. They cannot be trusted to execute unchecked.

Large language models propose computational pipelines from natural-language prompts. The operations they select may be structurally valid but semantically unwarranted—the operation exists in the system, but the user's request does not justify it. Schema validation catches malformed output. It does not catch valid-but-wrong execution.

Example: valid-but-unwarranted execution

User request

“Build a momentum strategy on equity futures, ranked by Sharpe ratio.”

vs

Agent proposal

Family: MOM_REV (mean-reversion)
Metric: SRP (Sharpe)

Both are valid capabilities. Schema validation passes. The pipeline executes—and produces the opposite computational signal.

The output is correct in form and exactly wrong in substance. Constrained decoding restricts the model to the full set of valid names—all 36 family×metric pairs—but not to the per-prompt discovered subset. The grounding gate restricts to discovered names only.

System design

The missing boundary between proposal and execution.

BLISP interposes a mandatory admissibility boundary—the grounding gate—between stochastic reasoning and deterministic execution. Above the boundary, agents propose. Below it, everything is deterministic, typed, and content-addressed.

01 Registry

A live capability registry (244 operations, 4 strategy families, 9 metrics): operations, families, signal blocks, and recipes. Each entry is hashed over semantic, algebraic, and implementation layers.

02 Discovery

Given natural-language terms, the system matches against the live registry using a four-tier cascade: exact, alias, tag, keyword. Unresolved terms cannot reach execution.

03 Grounding Gate

A deterministic function that checks whether every capability name in the agent's proposal has evidence in the discovery result. Names lacking evidence are rejected.

04 Specification

Admitted proposals become typed specification records with family, metric, parameter ranges, and data source. Parameter ranges expand into a morphism grid via Cartesian product.

05 Canonicalization

Expressions are parsed, normalized, canonicalized, planned, and optimized through a six-stage typed compilation pipeline. Surface syntax differences collapse to canonical identity.

06 Execution

Each admissible morphism executes through a typed deterministic execution engine. Deterministic: same input, same registry, same output. No randomness below the boundary.

07 Provenance

Every execution produces an 8-layer hash decomposing provenance into registry, request, morphisms, plans, artifacts, score, selection, and data. Fault localization without re-execution.

08 Replay

Identical grounded requests against identical data and registry produce bit-identical hashes. Compare two hashes to verify replay. Compare sub-hashes to localize divergence.

Design principle

Description/identity separation.

Each capability is hashed over three layers: semantic properties, algebraic type signature, and implementation details. A fourth layer—discovery metadata (aliases, tags, descriptions)—is explicitly excluded from the identity hash. Adding an alias like “log returns” → dlog changes what agents can discover; it does not change what dlog computes. The registry can improve discoverability without invalidating any prior execution hash.


Research Program
Eleven papers

The scientific backbone.

BLISP is built on an eleven-paper research program that formalizes the execution semantics, computation identity, provenance structure, semantic coordinates, and behavioral geometry of AI-generated computation. Paper 11 maps the emerging “verified AI actions” landscape and presents the first implemented system for runtime semantic verification of AI tool selection.

Program DOI 10.5281/zenodo.20459958 Program DOI: 10.5281/zenodo.20459958

Paper 1

The Grounding Gate

A mandatory admissibility boundary between stochastic AI reasoning and deterministic execution. Proposals whose capability names lack evidence from the user's terms are rejected before execution.

F3 rate: 23.3% → 10.0% (p = 0.027)
Undiscoverable: 100% → 0%
DOI: 10.5281/zenodo.20817087
Paper 2

Canonical Execution Semantics

A typed specification space, canonicalization pipeline, and content-addressed hashing scheme that provides execution identity independent of surface syntax.

278 → 235 canonical ops · 1,200 LLM generations
50/50 bit-identical replays
DOI: 10.5281/zenodo.20457255
Paper 3

Execution Categories

Stochastic prompt variation defines an equivalence relation on the execution space. Prompts that produce the same canonical execution form a quotient class. Execution fibers bundle equivalent proposals.

Congruence · quotient category · fiber projection
DOI: 10.5281/zenodo.20457403
Paper 4

Provenance Algebra

Every execution produces a decomposable provenance record. Sub-hash comparison localizes divergence without re-execution. Drift detection isolates which semantic layer changed.

Compositional provenance · divergence localization · partial replay
DOI: 10.5281/zenodo.20457667
Paper 5

Execution Fibers

Under stochastic prompt variation, many distinct proposals collapse into few execution identities. Synonym perturbations stay intra-fiber. Metric/family substitutions produce clean inter-fiber transitions.

2,200 proposals · synonym ρ = 0.985
metric/family ρ = 0.000 · σ = 1.000
DOI: 10.5281/zenodo.20457990
Paper 6

The Semantic Structure of Execution

A single 7-valued coordinate (DependencyClass) predicts four independent optimizer behaviors with 99.6% accuracy. The coordinate is a predictive object, not a label.

243/244 predictions · z = 13.0 · p < 10−38
DOI: 10.5281/zenodo.20612709
Paper 7

Semantic Coordinates as Predictive Objects

Frozen taxonomy generalizes to 25 unseen operations at 100% accuracy. Ablation confirms the coordinate is minimal. Random baselines achieve chance.

100/100 holdout · MI explains 96.7–100% of entropy
DOI: 10.5281/zenodo.20706294
Paper 8

Cross-System Transfer

Frozen dependency-shape taxonomy predicts execution behavior in Polars and DuckDB at 91.1% combined accuracy. Zero errors from taxonomy assignments.

180 predictions · buffering 96.7% in both systems
DOI: 10.5281/zenodo.20706086
Paper 9

Agent Convergence

Independent model families reconstruct structurally equivalent execution-identity primitives under task pressure. 7/8 primitives converge above 0.90.

55 runs · 3 model families · 3 domains · ~178k tokens/run
DOI: 10.5281/zenodo.20706156
Paper 10

When Data-Hash Caching Fails

Safe compositional caching requires the cache key to induce a congruence on the computation algebra. Data-hash keying violates this; identity-hash keying satisfies it.

97 false hits (DataHash) · 0 false hits (IdentityHash)
Theorem 1: cache key correctness iff congruence
DOI: 10.5281/zenodo.20815342
Paper 11

Verified AI Actions

The first implemented system for runtime semantic verification of AI tool selection. Three verification layers, a runtime grounding wall, and the architecture requirements for a cross-framework verification protocol.

Wrong-tool: 23.3% → 10.0% (p = 0.027)
9 property tests · 1,600/1,600 total pass
DOI: 10.5281/zenodo.20816935
Why it matters

Different audiences, one execution problem.

For AI research

Agents need execution substrates, not just tool APIs.

Tool-augmented LLMs select tools directly with no admission gate between selection and execution. A valid but wrong tool call produces a silent failure. The grounding gate makes tool admission evidence-based and deterministic.

For research

Computations must be replayable, comparable, and attributable.

Two researchers running the same grounded request against the same data get bit-identical results. When results differ, 8-layer sub-hash comparison localizes the divergence to a specific semantic layer without re-execution.

For finance

Systematic research needs deterministic provenance from prompt to portfolio.

Strategy families, scoring metrics, and parameter grids are content-addressed. Every research pipeline has a verifiable execution fingerprint. Six months later, the hash still validates.

For infrastructure

BLISP turns agent outputs into typed, admissible, content-addressed executions.

The execution layer is domain-independent. Finance is the first package. The architecture—discovery, grounding, canonicalization, provenance—applies to any domain where AI-generated pipelines must be validated before execution.

Infrastructure thesis

Why this can become infrastructure.

BLISP does not make the model truthful. It prevents unwarranted proposals from silently becoming executions. The model reasons stochastically. The execution layer operates deterministically. The boundary between them is the contribution.
Formal structure

The execution pipeline, formally.

ERΓ ⟶ BR/∼RκRεRPR
ER
Stochastic proposal space—all agent-generated proposals
Γ
Grounding gate—rejects proposals without discovery evidence
BR/∼R
Execution identity—equivalence classes under canonicalization
κR
Canonical representative—one expression per equivalence class
εR
Deterministic execution—same canonical input, same output
PR
8-layer provenance record—decomposable, content-addressed

Stochastic prompt variation generates many elements of ER. The grounding gate Γ admits only proposals with discovery evidence. Canonicalization collapses admitted proposals into equivalence classes BR/∼R, each with a unique canonical representative κR. Execution εR is a function on canonical representatives—deterministic by construction. The provenance record PR decomposes the full execution into 8 semantic layers for audit and fault localization.