Building a Financial Agent with OpenClaw

In the previous article, we built a FinChat-style financial research agent using LangGraph and LangSmith. That system established a clean baseline: structured data, explicit workflows, deterministic operations, and observable traces. It deliberately avoided embeddings and retrieval in order to keep reasoning and execution transparent.

That baseline is useful - but it is also fragile.

This article examines where the initial design breaks down as the system grows, and how introducing OpenClaw changes the role of execution from “some code ran” into a formal, auditable system. The goal is not to add new capabilities, but to make correctness and failure explicit properties of the system.

Recap

The first version of the agent had several strong properties:

Natural language queries were mapped to a structured QueryPlan
All financial logic was deterministic and testable
Agent behavior was modeled as an explicit graph
Traces and lightweight evaluation were available via LangSmith

This was enough to answer a wide class of quantitative questions correctly most of the time. However, it relied on an implicit assumption: that if a tool function ran without raising an exception, the result was acceptable. That assumption does not hold once the system is exposed to real usage.

Reality Check

As soon as the agent is extended- even modestly- several problems emerge.

First, execution is in-process and informal. Tool calls are ordinary Python function invocations. There is no persistent record of what was attempted, what constraints applied, or whether retries occurred. If a number is questioned later, the only evidence is a log line or a trace snapshot.

Second, tool correctness is assumed rather than enforced. Inputs are schema-validated, but deeper invariants are implicit:

Are the metrics comparable?
Are quarters aligned across tickers?
Are derived values consistent with raw fields?

When these assumptions fail, the system either produces incorrect output or silently drops rows.

Third, failure modes are qualitative. We can see that something went wrong, but we cannot easily answer:

how often does it happen?
which tool fails most?
under what conditions?

Finally, adding unstructured context (for example, earnings commentary) risks contaminating execution. Without strong boundaries, narrative text can subtly influence numeric behavior.

These are not theoretical concerns. They are exactly the issues that cause early LLM-based financial systems to stall before production.

Shifting the architecture towards execution

Let's us introduce a single conceptual change: execution is externalized and formalized.

Instead of LangGraph nodes calling Python functions directly, they issue execution requests to OpenClaw. OpenClaw becomes the authority on how actions are carried out, not LangGraph.

This has several consequences:

Execution is no longer an implementation detail
Every action becomes an explicit, inspectable event
The agent’s reasoning and the system’s actions are clearly separated

LangGraph continues to decide what should happen.
OpenClaw decides what actually happened.

Introducing OpenClaw

At this point, it is important to be precise about what OpenClaw is - and just as importantly, what it is not.

OpenClaw is NOT a reasoning engine, and it is not a smarter LLM. It does not think, plan, or decide in any meaningful sense. Under the hood, it is best described as an agent runtime with an execution gateway: a long-running process that receives inputs, routes them to agents, executes actions, and persists state across time.

This distinction matters because the value OpenClaw adds in a system like ours has nothing to do with “intelligence.” It has to do with making execution explicit, durable, and inspectable.

OpenClaw inputs can come from many sources: human messages, scheduled timers, cron jobs, webhooks from external systems, internal lifecycle hooks, or even other agents . These inputs are placed into a queue and processed sequentially. There is no background reasoning loop and no autonomous decision-making in the abstract sense - only reactions to events that have been configured ahead of time.

This is why OpenClaw systems often appear autonomous. When time itself becomes an input (via timers or cron-like “crowns”), agents can act even when no human is actively interacting with them. But nothing is spontaneous: time fires an event, the event triggers an agent turn, and the agent executes according to its instructions.

For a financial agent, this model is extremely powerful. It means that execution is no longer tied to a single request–response lifecycle. Instead, every action - fetching data, computing metrics, persisting results, generating reports - can be treated as a discrete, replayable event.

A central concept in OpenClaw is the gateway. The gateway does not reason. It routes. It accepts inputs, determines which agent or skill should receive them, and manages execution order and isolation.

This separation mirrors the architectural boundary we introduced earlier:

LangGraph decides what should happen next (planning, routing, recovery).
OpenClaw decides how actions are executed and records what actually occurred.

In practical terms, this means tool calls are no longer “just function calls.” They are executions mediated by a runtime that can log inputs, enforce constraints, retry safely, and persist outcomes.

OpenClaw agents are defined declaratively. Each agent has:

a role (often described in a soul.md-style instruction file),
a bounded set of tools or “skills” it is allowed to use,
and a working directory or memory space that persists across executions .

This matters for financial systems because capability restriction becomes enforceable, not advisory. An agent that is only allowed to compute metrics cannot suddenly browse the web or modify files. An agent that ranks tickers cannot trigger trades or external notifications unless explicitly permitted.

Previously we had these boundaries were implied by code structure. With OpenClaw, they become runtime-enforced properties.

Execution as a first-class artifact

One of the most consequential differences introduced by OpenClaw is that execution leaves behind durable artifacts. Every action can be recorded with:

the triggering input,
the tool invoked,
the arguments supplied,
the result returned,
validation outcomes,
and timing metadata.

These artifacts are not logs meant only for debugging. They are replayable execution records. If a number is questioned, the system can re-run the same execution against the same data. If a regression is suspected, historical executions can be compared across versions.

This is fundamentally different from tracing reasoning alone. LangSmith tells you how the agent reasoned. OpenClaw tells you what the system actually did.

Why this matters for correctness and failure? Once execution is externalized and recorded, failure modes change character. Instead of discovering problems through user reports or ad-hoc inspection, failures become measurable signals:

how often preconditions are violated,
which tools fail most frequently,
how often retries are required,
where execution latency concentrates.

This directly supports operating under real financial constraints, where correctness is not binary and trust is built over time through consistency and explainability.

A note on power and risk

OpenClaw’s power comes from access. It can run commands, read and write files, and interact with external systems. That same power introduces real risk. Security analyses have shown that a significant portion of available skills contain vulnerabilities, and OpenClaw’s own documentation emphasizes that no configuration is perfectly safe .

This is not a reason to avoid OpenClaw in analytical systems. It is a reason to use it deliberately:

restrict enabled skills,
isolate execution environments,
enforce strict preconditions,
and monitor execution artifacts continuously.

The risk is less relevant to our system, as OpenClaw is not introduced as a general-purpose automation layer. Instead it is integrated as a way to make execution explicit, bounded, and auditable - which is exactly what the baseline financial agent lacked.

System overview: updated roles

With OpenClaw added, the system becomes more clearly layered. The components are the same, but their responsibilities are no longer overlapping.

LangGraph remains the control plane. It interprets the query plan, decides which operation should run, and defines the agent’s flow and recovery logic. What changes is that LangGraph no longer performs execution directly. It decides what should happen, not how it happens.

OpenClaw becomes the execution layer. All actions - data access, metric computation, validation, retries - are carried out through OpenClaw. Execution is now explicit, constrained by preconditions and postconditions, and recorded as replayable artifacts. This is where correctness is enforced.

LangSmith remains the observability layer for reasoning. It traces how the LLM interprets language, produces plans, and routes decisions. LangSmith explains why the agent decided to act, while OpenClaw records what actually happened when it did.

Together, the system separates decision-making, execution, and observability. That separation is what allows the agent to grow in complexity without losing correctness or debuggability (not sure if this is a word :).

Financial skills and execution boundaries

Introducing OpenClaw forces a concrete architectural question that the baseline system did not have to answer explicitly: what is a skill, and what is not? The answer matters, because skills are the unit at which execution becomes governed, auditable, and replayable.

At a high level, a skill in OpenClaw is a bounded capability that performs real work. It accepts structured inputs, executes under declared constraints, and produces observable outputs. Skills are not about reasoning or decision-making; they are about doing. They are where data is read, computations are performed, and side effects - if any - are applied. Because of that, skills are the natural place to attach preconditions, postconditions, retries, and execution records.

This immediately implies what should not be a skill. Anything that involves interpretation, intent resolution, or control flow does not belong there. Planning, routing, and recovery logic remain in LangGraph, where decisions are explicit and inspectable. The LLM continues to translate language into intent and to explain results, but it does not execute. Skills sit below all of that, operating only on structured inputs and producing structured outputs.

From this perspective, the correct skill boundary in a financial system is the execution surface: anything that reads financial data or computes values that downstream logic will treat as factual truth. That boundary is where correctness must be enforced, not inferred.

In our case, that surface falls naturally into two conceptual categories. The first is data access. Loading quarterly fundamentals, normalizing tickers and dates, and selecting subsets of the canonical dataset are all actions that determine what data the system is allowed to reason over. When this logic lives in orchestration code, it is implicit and easy to bypass. When it becomes a skill, access is bounded by contract: only specific columns can be read, only limited time windows can be queried, and the resulting data slice is explicitly recorded.

The second category is deterministic financial computation. Resolving quarters, extracting metrics, computing trends, changes, and rankings are all pure operations, but they are also where subtle errors tend to hide. Turning these functions into skills makes their assumptions explicit. Inputs must satisfy declared constraints, outputs must satisfy postconditions, and failures become structured outcomes rather than exceptions bubbling up through the graph.

Conceptually, this means that the earlier “tool layer” stops being a convenience abstraction and becomes an execution interface. Whether this is exposed as a single dispatcher skill that executes a full query plan, or as multiple operation-specific skills, is an implementation choice. Early on, a dispatcher keeps the orchestration graph simple. As the system grows, finer-grained skills can be introduced without changing the control flow above them.

Once execution is expressed as skills, several things follow naturally. Each skill invocation is an auditable event, with inputs, outputs, and validation results that can be replayed. Failures become measurable rather than anecdotal. Testing improves because skills are isolated and deterministic: they can be unit-tested with small fixtures, contract-tested for schema stability, and integration-tested through the runtime with minimal overhead. End-to-end tests still exist, but they are no longer the only place where correctness is asserted.

FactsStore as OpenClaw skill

In previous article, FactsStore.load() loads the entire Parquet table and returns a DataFrame copy.

class FactsStore:
    def load(self) -> pd.DataFrame:
        if self._df is None:
            df = pd.read_parquet(self.facts_path)
            missing = REQUIRED_COLUMNS.difference(df.columns)
            if missing:
                raise ValueError(...)
            df["report_date"] = pd.to_datetime(df["report_date"])
            df["ticker"] = df["ticker"].astype(str).str.upper()
            df = df.sort_values(["ticker", "report_date"]).reset_index(drop=True)
            self._df = df
        return self._df.copy()

In a skill-based design, you generally avoid returning the whole dataset. That approach is convenient, but it makes an important assumption: anything inside the process can freely access all facts, at any time, in any shape. Instead, you expose a bounded query surface. Once you introduce OpenClaw, you typically want a tighter boundary.

The equivalent of FactsStore.load() in a skill-based system is not “return the whole DataFrame.” It is a query skill that exposes a controlled slice of the dataset. A good shape for that is a skill like:

Inputs: tickers, quarters, columns
Output: a small, structured payload (rows as JSON records, or Arrow for efficiency)
Guarantees:
- only allowlisted columns can be requested
- the size of the request is bounded
- required identifiers are always present (ticker, quarter, report_date)
- results are normalized (ticker casing, report_date parsing)
- the call either returns a complete slice or fails explicitly

This turns “data access” into a contract, not a convention.

Skill lifecycle and execution boundaries

A skill invocation in OpenClaw is best understood as a small, repeatable transaction with a clear contract. When financial.facts.query runs, it follows a predictable lifecycle: it receives a structured request (tickers, quarters, metrics), validates that the request is permitted, loads and normalizes the canonical dataset, selects a bounded slice, validates the result, and returns a compact payload. OpenClaw records the entire execution—inputs, outputs, timing, and validation outcomes—so it can be audited or replayed later.

This lifecycle is why skills form a clean architectural boundary. They behave like minimal domain services with strict APIs, even when they run locally. Anything outside this boundary does not execute logic; it only decides when execution should occur.

Unlike "conventional" programming, preconditions must be introduced. These include things like maximum tickers, maximum quarters, and column allowlists. They should exist because once execution is exposed as a capability, the system must assume it will be called in unexpected ways—especially when driven by LLM-generated plans.

Bounding requests keeps execution predictable. Without limits, a single query could request hundreds of tickers or decades of data, leading to large I/O, memory pressure, and unstable latency. In production systems, performance degradation quickly becomes a correctness problem.

Allowlisting columns serves a different purpose: it enforces what is permissible, not merely what is technically possible. It prevents accidental or adversarial access to unintended fields and ensures that the execution surface remains narrow and auditable.

Skills sit on the critical path of agent execution, so they are designed to be deterministic and bounded in cost. They do not need to be “real-time,” but their runtime characteristics should be predictable and composable. This is why policy checks should exist: they indirectly enforce a latency budget by limiting the amount of work a skill can perform.

A useful mental model is that LangGraph is allowed to be flexible and exploratory, while skills should behave more like database queries or microservice endpoints—measurable, constrained, and reliable.

def facts_query_skill(tickers, quarters, columns):
    # Preconditions: enforce policy and bound work
    if len(tickers) > 50:
        raise ExecutionError("Too many tickers requested")
    if len(quarters) > 12:
        raise ExecutionError("Too many quarters requested")
    if not set(columns).issubset(ALLOWED_COLUMNS):
        raise ExecutionError("Requested columns are not allowlisted")

    # Load and normalize data
    df = load_facts_parquet()
    df["report_date"] = pd.to_datetime(df["report_date"])
    df["ticker"] = df["ticker"].astype(str).str.upper()

    # Select bounded slice
    out = df.loc[
        df["ticker"].isin([t.upper() for t in tickers]) &
        df["quarter"].isin(quarters),
        ["ticker", "quarter", "report_date"] + columns
    ]

    # Postconditions: explicit success criteria
    if out.empty:
        raise NoDataError("No matching rows found")
    if out[["ticker", "quarter", "report_date"]].isna().any().any():
        raise ExecutionError("Missing required identifiers")

    return out.to_dict(orient="records")

This is the conceptual conversion of FactsStore into a skill: the same normalization logic, but exposed through a bounded interface with explicit guarantees. A skill represents an action with intent, not a best-effort query. If inputs are valid and execution succeeds, returning no rows usually signals a broken assumption - wrong plan, missing data, or ingestion failure. Letting empty results pass hides these issues and spreads ambiguity.

Failing fast keeps correctness localized. The agent can retry or replan, and traces clearly show where execution broke. The UI may still say “no data available,” but the skill itself must remain strict.

From function to skill

Up to this point, facts_query_skill has just been a Python function. To become useful in an OpenClaw-based system, it has to cross a boundary: it must be exposed as a governed execution capability that an agent is allowed to invoke, rather than a helper that happens to be imported.

This transition happens in three conceptual steps: packaging the function as a skill, granting an agent permission to use it, and invoking it through the OpenClaw runtime.

Packaging the skill

The first step is to treat facts_query_skill as a stable entry point with a well-defined contract. Instead of living next to orchestration code, it moves into a skills module and becomes responsible for enforcing its own invariants: which columns are allowed, how much data can be requested, and what constitutes a valid result.

# skills/financial_facts.py
from dataclasses import dataclass

class ExecutionError(Exception): ...

@dataclass
class FactsSkill:

    def query(self, tickers: list[str], quarters: list[str], columns: list[str]) -> list[dict]:
        # This is our facts_query_skill

At this point, nothing about the logic changes. It still loads the same Parquet file, normalizes dates and tickers in the same way, and returns the same records. What changes is where responsibility lives. The function now owns its correctness guarantees, rather than relying on callers to behave.

Registering the intent

OpenClaw needs a mapping from a skill name like financial.facts.query to a callable. The exact mechanism depends on how you run OpenClaw, but conceptually you create a registry.

# skills/registry.py
from skills.financial_facts import FactsSkill

def build_skills(config: dict):
    facts = FactsSkill(Path(config["FACTS_PATH"]))
    return {
        "financial.facts.query": facts.query,
        # later:
        # "financial.metrics.rank": rank_skill,
        # "financial.metrics.trend": trend_skill,
    }

Once a function is packaged as a skill, it is not automatically usable. OpenClaw requires that every agent explicitly declares which skills it is allowed to invoke and register them. This declaration lives in soul.md.

For the financial agent, soul.md does not describe how to load data or how to compute metrics. Instead, it states intent and limits: this agent may query canonical quarterly facts, using a specific skill, under bounded conditions.

# FinChat Analyst

You are a financial analysis agent. You must never invent numeric values.
All financial facts must come from skills.

## Allowed Skills
- financial.facts.query: Query canonical quarterly facts by (tickers, quarters, columns).

## Rules
- Only request allowlisted columns.
- Keep requests bounded (<=50 tickers, <=12 quarters).
- If no rows are found, ask for a different quarter/timeframe or explain that data is missing.

This step is subtle but important. By listing financial.facts.query in soul.md, you turn an implementation detail into a governed capability. If the agent attempts to access data in any other way, the runtime will reject the action. The boundary is enforced by configuration, not convention.

Runtime invocation

With the skill defined and the agent permitted to use it, the OpenClaw runtime loads both at startup. Internally, it maintains a registry that maps symbolic names like financial.facts.query to concrete callables.

From the perspective of the LangGraph-based agent, nothing about planning or routing changes. What changes is execution. Instead of calling FactsStore.load() or a local function, the execution node issues a request to OpenClaw: “invoke financial.facts.query with these arguments.”

# execution_client.py
class OpenClawClient:
    def call(self, skill_name: str, args: dict) -> dict:
        # HTTP / IPC call to OpenClaw gateway
        ...

# And your execute_node becomes:

rows = openclaw.call(
    "financial.facts.query",
    {"tickers": plan.tickers, "quarters": quarters, "columns": metrics},
)

OpenClaw executes the skill, applies preconditions and postconditions, records the execution, and returns either a structured result or a structured failure. That record becomes part of the system’s execution history and can be replayed or audited later.

Conclusion

By moving data access and financial computations into OpenClaw skills, execution becomes governed, auditable, and replayable. Correctness is no longer an assumption embedded in orchestration code; it is enforced at the execution boundary.

LangGraph continues to control flow and decision-making, LangSmith continues to observe reasoning, and OpenClaw formalizes action. Together, they separate intent, execution, and observability in a way that allows the system to grow without losing trust.

In the next article, we will introduce MCP to standardize tool interfaces, permissions, and coordination across agents, and show how this architecture scales beyond a single hardened financial agent.

Deep Research and Development