Causality Is the Contract Between Runtime and Observability

I used to think agent causality was an observability problem.

Add traces. Add a dashboard. Add a session graph. Make the spans searchable. Once the data exists, the chain should become visible.

That framing was incomplete.

Observability can show activity. It can show which model ran, how long it took, how much it cost, which tool fired, and which request failed. But it cannot reliably preserve causality beyond the fidelity of the signals the runtime gives it.

The runtime knows when work starts. It knows which session owns the turn. It knows whether a run came from a chat message, a cron job, a delegated sub-agent, or a provider fallback. If those facts are not emitted as part of execution, the trace layer can only infer them later.

Sometimes that inference is good enough.

Sometimes it lies by omission.

The stronger thesis I have landed on is this:

Causality is not a runtime feature or an observability feature. It is the contract between them.

#The Previous Layer: Explanation

The last thing I wrote about was Mux, a small routing layer between my agents and model providers.

The obvious goal was model routing: stop sending every request to the same default model. Route lightweight prompts to cheaper models, coding prompts to stronger ones, and deeper reasoning tasks to the expensive path.

But the useful part was not the router. It was the explanation layer.

For each request, Mux records:

prov.route.requested_model
prov.route.resolved_model
prov.route.reason
prov.route.runtime
prov.route.message_count

That answers a specific question: why did this model run?

It does not answer a different question: why did this downstream action happen at all?

That second question shows up when agents delegate to other agents, when scheduled jobs run without a human in the loop, when sub-agents finish but disappear from the parent view, or when a provider fallback silently bypasses the proxy you expected it to use.

Mux made model choice explainable.

The next problem was runtime causality.

#What the Runtime Has to Emit

The more I dogfood this setup, the more I think every agent runtime needs a small but explicit causality contract.

At minimum, it should emit:

a stable session identifier
a stable run or turn identifier
a parent session identifier when work is delegated
a task label or source label
a queued event before execution starts
a processing state transition
an idle or completed state transition
a processed event with duration and outcome
trace context that can be propagated into child work

The exact names do not matter. The shape does.

In OpenClaw, the useful lifecycle is built around events like message.queued, session.state, and message.processed. AgentWeave can listen to those events and create root spans that downstream LLM calls, tool calls, and delegated sessions attach to.

That root span matters.

Without it, every LLM call looks real but contextless. With it, the trace has a beginning. A cron run, a Telegram message, a sub-agent handoff, and a model call can all become part of the same causal chain.

Runtime to observability contract: OpenClaw emits lifecycle events, AgentWeave creates root spans, and the trace graph preserves child work

The observability layer should not invent that chain if the runtime can emit it directly.

It should preserve it.

#Where This Broke in Practice

This is not theoretical for me. It came directly from running the system every day.

My setup has a few moving parts:

Nix runs on OpenClaw as the always-on orchestrator.
Max runs on a Mac Mini for browser and desktop work.
AgentWeave traces LLM calls, tool work, routing decisions, and session graphs.
Mux handles model routing through a shared policy layer.
Scheduled jobs run daily for portfolio briefings, research digests, and job-search automation.
Sub-agents get spawned for longer coding or research tasks.

This is exactly the kind of system where a pretty trace view can miss the real story.

The graph can show what crossed the event boundary. It cannot show what never crossed it.

The useful lesson came from five OpenClaw changes. Three have landed upstream. Two are running in my fork and are open upstream as follow-ups.

That distinction matters. The landed fixes are proof. The open PRs show where the contract is still being pushed deeper into the runtime and the plugin SDK.

#A Provider Fallback Escaped the Proxy

The first landed fix was model-provider routing.

I had configured OpenClaw's openai-codex provider with a custom baseUrl so traffic could flow through a local proxy path. That proxy path matters because it is where routing, attribution, and trace capture happen.

But when the model registry did not have a template row to clone, OpenClaw synthesized a fallback model with a hardcoded default base URL. The configured baseUrl was silently ignored.

The result was subtle:

the config looked correct
the runtime appeared to be using the right provider
the synthesized model bypassed the proxy
the call failed or fell back before the expected trajectory was visible

This is the kind of bug that makes operators distrust dashboards. The dashboard is not wrong, exactly. It is downstream of a runtime path that escaped instrumentation.

The fix was tiny:

const synthBaseUrl =
  ctx.providerConfig?.baseUrl ?? OPENAI_CODEX_BASE_URL;

But tiny fixes at the runtime boundary matter. If the provider path can silently bypass the proxy, no observability layer above it can recover the missing evidence.

This landed upstream in openclaw/openclaw#76428.

#Sub-Agents Were Alive but Invisible

The second landed fix was sub-agent retention.

OpenClaw has more than one way to spawn sub-agents. In one mode, the sub-agent runs as a managed run. In another, it gets its own session. Those modes had drifted apart in a small but operator-visible way.

Run-mode sub-agents respected the configured archiveAfterMinutes window. Session-mode sub-agent registry rows were swept after a hardcoded 5 minutes.

That created a strange failure mode.

The child session still existed on disk. The work had completed. But if I asked the parent agent what happened after the five-minute window, subagents list could be empty.

From a human perspective, the workers looked like they had silently disappeared.

Again, the issue was not that all the session data was gone. It was that the runtime had dropped the operator-facing relationship too early.

The fix was to make session-mode retention honor the same configuration as run-mode:

default retention: 60 minutes
custom archiveAfterMinutes: respected for both modes
archiveAfterMinutes: 0: disables reaping for both modes

That sounds like housekeeping. It is actually part of causality.

If the parent cannot explain which child sessions it spawned, whether they finished, and where their session records live, the system loses its causal structure from the operator's point of view.

This landed upstream in openclaw/openclaw#78263.

#Cron Jobs Were Doing Work Without a Root Turn

The clearest example is the latest one to land upstream.

OpenClaw already emitted diagnostic lifecycle events for channel-driven turns. If a Telegram message came in, the runtime emitted message.queued, moved the session into processing, eventually moved it back to idle, then emitted message.processed.

That gave observability subscribers a clean envelope for the turn.

But isolated cron jobs followed a different execution path. They created real sessions. They made real LLM calls. They delivered real outputs. But they did not emit the same queued and processed lifecycle.

From the tracing side, that meant cron traffic existed, but attribution collapsed.

In one live deployment window, the native session data showed 63 distinct cron-driven sessions. The diagnostic-event subscriber saw only 4 distinct session IDs reaching the trace store.

That is not a small gap. That is a 94% attribution loss.

In my fork, the cron runner now emits the same lifecycle as the channel dispatch path:

logMessageQueued({
  sessionId,
  sessionKey,
  channel: "cron",
  source: "cron-isolated",
});

logSessionStateChange({ state: "processing" });

try {
  await runAgentTurn();
} finally {
  logSessionStateChange({ state: "idle" });
  logMessageProcessed({ outcome, durationMs, error });
}

That changes the nature of the trace. The model call is no longer just "some LLM call from the main agent." It belongs to a specific scheduled run with a specific session key and outcome.

This is the strongest version of the thesis: observability can infer activity, but durable causality requires the runtime to emit lifecycle.

This landed upstream in openclaw/openclaw#79214.

That makes it the centerpiece of the story.

#Plugin Health Protects the Contract

There is one more layer that surprised me: plugin compatibility.

The AgentWeave bridge for OpenClaw depends on the runtime loading a plugin and that plugin subscribing to diagnostic events. If an upgrade changes plugin loading behavior, or if a plugin entry no longer resolves on disk, the gateway can boot while the observability path quietly disappears.

That failure mode is boring in the way production failures are boring. Nothing dramatic happens. The system just emits less truth.

So in my fork I added a proposed openclaw doctor --post-upgrade --json mode. It checks plugin compatibility after an upgrade and emits structured findings:

{
  "probesRun": ["plugin.entry_unresolved", "plugin.manifest_drift"],
  "findings": [
    {
      "level": "error",
      "code": "plugin.entry_unresolved",
      "plugin": "agentweave-bridge"
    }
  ]
}

This is not tracing. But it protects tracing.

If the bridge plugin fails to load, the runtime-observability contract is broken before the first span is created. A structured post-upgrade probe gives CI, upgrade tooling, and local operators a machine-readable way to catch that.

The upstream PR is open at openclaw/openclaw#79260.

#Model Events Need an Explicit Plugin Path

The fifth change closed a different kind of observability gap.

OpenClaw already emits model lifecycle events internally: model.call.started, model.call.completed, model.call.error, model.usage, and model.failover. Those events carry the details an observability bridge needs to attribute a call to a provider, model, token count, and cost.

But the public plugin diagnostic subscription intentionally filters trusted internal events. That is the right default for a broad API. It also meant model lifecycle events were invisible to plugins.

For AgentWeave, that showed up as a real attribution bug. Codex turns were traced, but the bridge could not see the model lifecycle event that said which model actually ran. The downstream dashboard bucketed calls as unknown instead of gpt-5.5.

That is another version of the same contract failure:

the runtime knew the model call happened
the runtime knew the trusted model identity
the plugin bridge was loaded and listening
the public diagnostic API still could not deliver the specific signal the bridge needed

The proposed fix is not to weaken the broad diagnostic API. It adds a focused opt-in API for exactly the trusted model.* lifecycle:

import { onModelDiagnosticEvent } from "openclaw/plugin-sdk/diagnostic-runtime";

const unsubscribe = onModelDiagnosticEvent((event) => {
  // model.call.completed, model.usage, model.failover, ...
});

That makes the boundary explicit. General plugins do not suddenly receive every trusted runtime event. Observability plugins can opt into the narrow model lifecycle they need for cost and attribution.

After testing this against a real Telegram-driven Codex turn, the bridge started receiving model.usage and the dashboard attributed the turn to gpt-5.5 instead of unknown.

This one is still open upstream at openclaw/openclaw#80497.

It is a useful reminder that causality is not only about session graphs. Model identity, usage, failover, and cost also need a path across the runtime boundary.

#What AgentWeave Should Do

This changed how I think about AgentWeave's role.

AgentWeave should not pretend to be the source of truth for causality. It should be where causality becomes visible once the runtime emits it.

That means AgentWeave should be excellent at:

accepting runtime lifecycle events
creating root spans for turns and scheduled runs
linking child agents to parent sessions
recording model routing metadata
preserving task labels and source labels
subscribing to explicit model lifecycle events
showing cost and latency inside the causal chain
making broken chains obvious

But it should not silently guess that two spans are related because they happened near each other in time. It should not hide missing lifecycle events behind a polished graph. It should not make a disconnected chain look complete.

The best observability tools make uncertainty visible.

If a cron run has no root event, show that. If a child session has no parent, show that. If a provider path skipped the proxy, make the absence obvious.

That is more useful than a graph that looks complete.

#The Takeaway

I started this work thinking the hard part was agent observability.

Now I think the hard part is the contract between runtime and observability.

Every agent host eventually needs to answer:

what started this work?
what session did it belong to?
which parent caused it?
what child work did it spawn?
which model path did it use?
which model actually ran?
when did it finish?
did the observability bridge load correctly after the last upgrade?

Those answers cannot live only in logs. They cannot be reconstructed reliably from LLM calls. They need to be emitted by the runtime as part of execution.

Mux made model routing explainable.

AgentWeave made agent work traceable.

The OpenClaw work made the next layer clear:

Durable causality lives in the contract between the system doing the work and the system observing it.