Building the Financial Bot with OpenClaw

In the previous article, we moved from the idea of a Damodaran-style financial bot toward a practical multi-agent architecture: a system where different agents are responsible for data retrieval, assumptions, valuation, supervision, and report writing. That architecture gave us the direction of travel. The next question is more concrete:

how do we turn that architecture into a project that can be tested, evolved, and eventually connected to real financial data sources?

The goal is to build a small, deterministic, testable version of the system that proves the shape of the application before we introduce live data, complex assumptions, model drift, flaky APIs, or LLM variability.

A financial agent system can become difficult to debug very quickly. If an answer is wrong, the source of the problem may be the financial data, the normalization logic, the assumption layer, the valuation formula, the scenario model, the report writer, the agent instruction, or the orchestration path between them. If all of those are built at once, the system becomes impressive but opaque. A mocked POC gives us the opposite: a narrow, controlled, deterministic pipeline where every layer can be tested independently.

We start with one company fixture, CROX on NASDAQ. The financial statements are mocked. The assumptions are mocked. The DCF and scenario outputs are deterministic. The report is deterministic. That may sound limited, but it is exactly the point. We are not testing whether we can fetch every company in the market yet. We are testing:

whether the architecture works,
whether each responsibility is isolated,
whether the agents call the right tools,
whether the MCP server exposes the correct interface,
and whether the valuation core can be trusted by tests before it is expanded.

The Project Shape

The POC is structured around three boundaries: agents, the MCP server, and the valuation core package.

At the top level, the project looks like this:

├── agents
│   ├── assumptions-agent
│   ├── data-agent
│   ├── supervisor-agent
│   ├── valuation-agent
│   └── writer-agent
├── mcp_server
│   ├── dbot_mcp
│   └── tests
├── packages
│   └── valuation_core

This separation is intentional. The agents describe behavior. The MCP server exposes capabilities. The valuation core implements deterministic business logic. Those are different concerns and they should not be mixed.

Agents

The agents should not contain financial formulas. They should not invent assumptions. They should not normalize statement values themselves. They are instruction-driven workers that know when to call a skill and what shape of input and output to preserve.

MCP server

The MCP server should not become a financial modeling library. It should expose tools over a protocol boundary and convert requests into calls to the underlying package.

Valuation core

The valuation core should not know anything about OpenClaw, agents, prompts, or MCP sessions. It should be a normal Python package that can be tested with normal Python tests.

That gives us a layered structure:

Agent instructions
      ↓
Agent skill contract
      ↓
MCP tool call
      ↓
MCP server tool wrapper
      ↓
valuation_core package
      ↓
Data output

This shape is one of the most important architectural decisions in the POC. It means we can test the valuation library without running agents.

We can test the MCP server without invoking a full multi-agent workflow.
We can test agent contracts by checking their expected input and output shapes.
Finally, we can run an integration test over the stdio MCP server to verify that the protocol boundary works.

Why Start With Mocked Data?

In a finance bot, mocked data is not a shortcut. It is an engineering control.

If we immediately connect to live APIs, every test can fail for reasons unrelated to our architecture. An external provider may change its response format. A network call may timeout. A ticker may have missing fields. A provider may restate historical values. Currency handling may differ between exchanges. Those are all real problems, but they are not the first problems we should solve.

The first problem is whether the pipeline itself is correct. For example:

The data agent should retrieve structured statements.
The assumptions agent should add assumptions without changing the statements.
The valuation agent should add DCF and scenario outputs without rewriting the input.
The writer agent should render a report without recalculating valuation.
The supervisor should pass each specialist output forward without modifying values.

Those invariants can be tested with mocked data. In fact, they are easier to test with mocked data because the expected values do not move.

The mocked CROX fixture gives us a stable baseline:

def parse_statements(ticker: str, exchange: Exchange) -> FinancialStatements:
    """POC: returns hardcoded CROX statements regardless of input."""
    if ticker.upper() != "CROX":
        raise DataNotFoundError(f"No fixture available for ticker: {ticker}")

    return FinancialStatements(
        ticker="CROX",
        exchange=exchange,
        period="FY2024 (TTM)",
        currency=Currency.USD,
        income_statement=IncomeStatement(
            revenue=3_900_000_000,
            ebit=1_050_000_000,
            ebit_margin=0.27,
            effective_tax_rate=0.22,
            interest_expense=120_000_000,
            sbc=80_000_000,
            da=120_000_000,
        ),
        balance_sheet=BalanceSheet(
            total_debt=2_000_000_000,
            cash=200_000_000,
            net_debt=1_800_000_000,
            book_equity=1_600_000_000,
            invested_capital=3_400_000_000,
        ),
        cash_flow=CashFlowStatement(
            operating_cash_flow=900_000_000,
            capex=150_000_000,
            change_in_wc=50_000_000,
            fcff=770_000_000,
        ),
    )

The code is deliberately simple. It returns a fixed DTO (Data Transfer Object) for one supported ticker and raises a domain error for unsupported tickers. This is not the final data layer. It is a test fixture wrapped in the same interface that a real parser or API adapter will later use.

That is the key benefit: when live data arrives, the interface can stay stable. The implementation behind parse_statements can change from hardcoded fixture to API adapter, CSV loader, SEC parser, or database query, while the rest of the system continues to depend on the same contract.

Test-Driven Development

Test-driven development, or TDD, is often summarized as “write the test before the implementation.” That summary is technically true, but incomplete. In this project, TDD is more useful as a design method than as a ritual.

The test forces us to state what the system should do before we implement how it does it. That is especially valuable in multi-agent systems because natural language instructions can hide ambiguity. A test removes some of that ambiguity by turning the desired behavior into executable expectations.

For example, before we care about how financial statements are parsed, we can state what the parser must return for the CROX fixture:

def test_parse_crox_returns_fixture():
    ticker = "CROX"
    exchange = Exchange.NASDAQ

    fs = parse_statements(ticker, exchange)

    assert fs.ticker == ticker
    assert fs.exchange == exchange
    assert fs.income_statement.revenue == 3_900_000_000.0
    assert fs.balance_sheet.net_debt == 1_800_000_000.0
    assert fs.cash_flow.fcff == 770_000_000.0

This test does several things. It confirms that the parser supports the expected fixture.

It confirms that exchange is preserved.
It confirms that the financial statement DTO exposes income statement, balance sheet, and cash flow fields in the expected structure.
It also gives future contributors a warning: if they change one of these values, they are changing the baseline fixture and should do so intentionally.

The same idea applies to assumptions:

def test_load_crox_assumptions():
    a = load_assumptions("CROX", Exchange.NASDAQ)

    assert a.wacc == 0.098
    assert a.terminal_growth == 0.025
    assert a.beta == 1.45
    assert a.weight_equity + a.weight_debt == pytest.approx(1.0)

This test is more than a value check. It encodes a financial invariant: the capital structure weights should add up to approximately 100%. A good test suite should include both exact expected outputs and domain sanity checks. Exact values are useful for deterministic fixtures. Domain checks are useful because they protect the model from impossible or inconsistent assumptions.

Why Tests Matter More in Agentic Systems

Traditional software can fail because of bugs. Agentic software can fail because of bugs, ambiguous instructions, tool misuse, malformed intermediate state, hidden assumptions, and output drift.

That means tests are not optional glue. They are the main mechanism that keeps the system understandable.

In this POC, tests serve several roles:

They define the contract of each core valuation function.
They protect DTO shapes from accidental changes.
They verify MCP tools expose the expected outputs.
They check the stdio MCP server can be launched and called through the protocol.
They prevent agents from becoming responsible for logic that belongs in the core package.

A valuation model is also a high-trust domain. If the system says that intrinsic value is $154 per share, the user needs a way to understand where that number came from. Tests do not make the model financially correct by themselves, but they make the software path reproducible. Reproducibility is the first requirement before deeper financial validation can happen.

The Core Package: Keeping Finance Logic Outside the Agents

The valuation_core package is where the deterministic valuation logic lives. It is structured as a normal Python package:

packages
└── valuation_core
    ├── pyproject.toml
    └── valuation_core
        ├── assumptions
        ├── common
        ├── statements
        └── valuation

The common module contains DTOs, enums, money helpers, period helpers, and domain errors. This layer is intentionally boring. That is a good thing. DTOs are the shared language of the system. If they are stable, every other layer becomes easier to reason about.

A simplified example is the FinancialStatements DTO:

@dataclass
class FinancialStatements:
    ticker: str
    exchange: Exchange
    period: str
    currency: Currency
    income_statement: IncomeStatement
    balance_sheet: BalanceSheet
    cash_flow: CashFlowStatement

    def as_dict(self) -> dict:
        return asdict(self)

This DTO is important because agents and MCP tools pass JSON-like structures around, while the valuation package can work with typed Python objects. The as_dict method provides a clean exit point from the typed domain model back into serializable data.

The valuation module follows the same pattern. The POC implementation of DCF is fixed, but the return type is already the shape we expect from a real model:

def run_dcf(ticker: str, exchange: Exchange) -> DCFResult:
    """POC: returns hardcoded CROX DCF result."""
    if ticker.upper() != "CROX":
        raise DataNotFoundError(f"No fixture available for ticker: {ticker}")

    return DCFResult(
        ticker=ticker,
        exchange=exchange,
        pv_explicit_fcff=5_000_000_000.0,
        pv_terminal_value=5_400_000_000.0,
        enterprise_value=10_400_000_000.0,
        net_debt=1_800_000_000.0,
        equity_value=8_600_000_000.0,
        diluted_shares=56_000_000.0,
        intrinsic_value_per_share=154.0,
        wacc=0.098,
        terminal_growth=0.025,
        fcff_projection=[],
    )

In the actual project, the projection list contains a ten-year explicit forecast. The important detail is not whether the current POC computes that forecast dynamically. It does not. The important detail is that the result already looks like a real DCF result. That allows the MCP tool, valuation agent, writer agent, and report renderer to be built against the final interface before the final financial engine exists.

This is one of the most useful POC patterns: mock the internals, not the boundary.

The MCP Server as a Tool Boundary

Before going deeper into this project’s MCP implementation, it is useful to connect it to the earlier OpenClaw MCP article. There we explained the role of MCP as the execution boundary between agent reasoning and deterministic backend tools. The same principle applies here: agents should decide which capability is needed, but the actual calculation, data retrieval, and report generation should happen inside tested tools exposed by an MCP server.

The MCP server sits between the agents and the valuation package. Its job is to expose a small set of tools:

financial_statements
assumptions
scenario_valuation
report_writer

Each tool is intentionally narrow. The financial_statements tool returns normalized statements. The assumptions tool returns valuation assumptions. The scenario_valuation tool returns DCF and scenario results. The report_writer tool returns markdown.

This prevents one giant tool from doing everything. It also gives each agent one obvious tool to call.

A typical tool wrapper is small:

from valuation_core.assumptions import load_assumptions
from dbot_mcp.tools.common import TICKER_EXCHANGE_INPUT_SCHEMA

NAME = "assumptions"
DESCRIPTION = "Return WACC, growth, and reinvestment assumptions for a ticker."
INPUT_SCHEMA = TICKER_EXCHANGE_INPUT_SCHEMA


def run(arguments: dict) -> dict:
    ticker = arguments["ticker"]
    exchange = arguments["exchange"]
    return load_assumptions(ticker, exchange).as_dict()

This wrapper does not calculate WACC. It does not validate finance theory. It does not format prose. It simply adapts MCP tool input to a core package function and returns a serializable dictionary.

The shared input schema is also important:

TICKER_EXCHANGE_INPUT_SCHEMA = {
    "type": "object",
    "properties": {
        "ticker": {"type": "string", "description": "Stock ticker (e.g. CROX)."},
        "exchange": {
            "type": "string",
            "description": "Stock exchange (e.g. NASDAQ).",
            "enum": EXCHANGES,
        },
    },
    "required": ["ticker", "exchange"],
    "additionalProperties": False,
}

By making ticker and exchange explicit, every tool receives enough context to resolve the fixture. By using an enum for exchange, the tool boundary rejects unsupported exchange names early. In a future production version, this same schema can be extended with period type, filing source, currency preference, restatement policy, or data provider options.

The server then registers the tools and exposes them over stdio:

_TOOLS = (
    financial_statements,
    assumptions,
    scenario_valuation,
    report_writer,
)
_TOOLS_BY_NAME = {t.NAME: t for t in _TOOLS}


@server.list_tools()
async def list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name=t.NAME,
            description=t.DESCRIPTION,
            inputSchema=t.INPUT_SCHEMA,
        )
        for t in _TOOLS
    ]


@server.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    tool = _TOOLS_BY_NAME.get(name)
    if tool is None:
        raise ValueError(f"Unknown tool: {name}")

    result = tool.run(arguments or {})
    text = json.dumps(result, indent=2, default=str)

    return [types.TextContent(type="text", text=text)]

This gives the system a clean protocol layer. An MCP-aware host, such as OpenClaw, can discover available tools, inspect their schemas, and call them through a standard interface. The finance logic remains inside valuation_core. The agent instructions remain inside agents. The MCP server is the bridge.

Testing the MCP Layer

The MCP tools have their own tests because they are not just pass-through functions. They are the public interface exposed to the agent runtime. If a tool changes its output shape, the agents may break even if the core package still passes its own tests.

For example, the scenario_valuation tool is tested like this:

def test_returns_crox_dcf_and_scenarios():
    out = run({"ticker": "CROX", "exchange": "NASDAQ"})

    assert out["dcf"]["intrinsic_value_per_share"] == 154.0
    assert out["scenarios"]["iv_range_high"] == 200.0
    assert {s["name"] for s in out["scenarios"]["scenarios"]} == {
        "Bull",
        "Base",
        "Bear",
        "Stress",
    }

This test confirms that the tool returns both DCF and scenario outputs. It also checks the scenario set. The valuation agent depends on that structure because it places this tool output under the valuation key in its pipeline response.

The report writer tool has a different kind of test:

def test_returns_crox_report_markdown():
    out = run({"ticker": "CROX", "exchange": "NASDAQ"})

    assert out["ticker"] == "CROX"
    assert out["exchange"] == "NASDAQ"
    assert out["report_md"].startswith("# CROX (NASDAQ) Valuation Report")
    assert "WACC" in out["report_md"]

This test is intentionally not too strict about the full markdown body. For a report, we usually want to verify the essential structure rather than make the test brittle against every formatting change. The test confirms that the report is for the expected ticker and exchange, that it begins with the expected title, and that it includes a cost-of-capital section.

There is also an integration test over the stdio server:

def test_integration_pipeline_runs_over_stdio_server():
    async def _run() -> None:
        server = StdioServerParameters(
            command="python",
            args=["-m", "dbot_mcp.server"],
        )

        async with stdio_client(server) as (read_stream, write_stream):
            async with ClientSession(read_stream, write_stream) as session:
                await session.initialize()

                tools = await session.list_tools()
                assert {tool.name for tool in tools.tools} == {
                    "financial_statements",
                    "assumptions",
                    "scenario_valuation",
                    "report_writer",
                }

                result = await session.call_tool(
                    "report_writer",
                    {"ticker": "CROX", "exchange": "NASDAQ"},
                )

                payload = json.loads(result.content[0].text)
                assert payload["report_md"].startswith(
                    "# CROX (NASDAQ) Valuation Report"
                )

    asyncio.run(_run())

This test is valuable because it checks the real server path, not just individual Python functions. It launches the server, initializes a client session, lists the tools, calls a tool, parses the returned content, and verifies the report. That gives confidence that the MCP packaging and protocol wiring are correct.

Agents as Contracts, Not Calculation Engines

The agents are deliberately instruction-first. Each agent has an AGENTS.md file that defines its role, responsibilities, restrictions, input, and output. Agents that need tools also have a skill file under skills/SKILL.md.

The distinction between an agent and a skill is useful. The agent file describes the worker’s general behavior. The skill file describes a specific callable capability and how it maps pipeline input to MCP tool input.

For example, the Data Agent is responsible for retrieving deterministic company financial statement data. Its restrictions are just as important as its responsibilities:

# Data Agent

You provide deterministic company financial statement data.

Responsibilities:
- Retrieve income statement, balance sheet, and cash flow statement data.
- Return structured financial statement data.
- Preserve reported metric names, periods, and values.

Restrictions:
- Do not calculate valuation.
- Do not create assumptions.
- Do not infer missing values.
- Do not modify source values.
- Do not browse the web.
- Do not call reporting or valuation tools.

This instruction file prevents scope creep. Without it, the data agent might try to “help” by filling missing values, calculating margins, or adding commentary. That behavior may look useful in a demo, but it is dangerous in a financial system. The data agent should retrieve and preserve data. If normalization is needed, it belongs in the approved tool or the core package, not in free-form agent reasoning.

The corresponding skill makes the tool mapping explicit:

---
name: retrieve-financial-statements
description: Retrieve deterministic company financial statement data.
---

Use this skill when company financial statements are needed.

Call this MCP tool:
- financial_statements

Pipeline input:
{
  "ticker": "<ticker>",
  "exchange": "<exchange>"
}

MCP tool input:
{
  "ticker": "<ticker>",
  "exchange": "<exchange>"
}

Rules:
- Return only the pipeline output.
- Preserve `exchange` from the pipeline input when the MCP tool does not return it.
- Do not summarize.
- Do not normalize values unless the MCP tool already does it.
- Do not fill missing values.
- Do not call other DBOT tools.

This skill is a contract. It tells the agent which tool to call, what arguments to pass, and how to treat the response. In larger systems, this is the difference between an agent that behaves predictably and an agent that improvises.

The Writer Agent is a different kind of specialist. It does not retrieve data or calculate valuation. It receives completed valuation results and calls the report writer tool:

# Writer Agent

You produce deterministic valuation reports from provided valuation results.

Workflow:
1. Receive the Valuation Agent pipeline output.
2. Extract `ticker` and `exchange` from `valuation.dcf`.
3. Call the MCP tool `report_writer`.
4. Return `report_md` exactly as produced by the tool.

Rules:
- Do not calculate valuation.
- Do not change valuation values.
- Do not retrieve financial statements.
- Do not create assumptions.
- Do not add unsupported market commentary.
- Output markdown only.

This design choice may look strict, but it is necessary. If the writer agent is allowed to add unsupported commentary, the final report may contain claims that did not come from the valuation engine. By forcing the writer to return report_md exactly as produced by the tool, we keep the final narrative deterministic.

The Supervisor Agent

The supervisor is the coordinator. It should not do specialist work. It should not fetch financial statements itself. It should not calculate intrinsic value. It should not write the report. It should only move outputs from one specialist to the next.

The workflow is simple:

User request
   ↓
Supervisor Agent
   ↓
Data Agent
   ↓
Assumptions Agent
   ↓
Valuation Agent
   ↓
Writer Agent
   ↓
Final markdown report

The supervisor’s contract defines the pipeline:

Pipeline contracts:
- Data Agent input: `{ "ticker": "<ticker>", "exchange": "<exchange>" }`
- Data Agent output -> Assumptions Agent input: financial statements JSON.
- Assumptions Agent output -> Valuation Agent input:
  `{ "financial_statements": {}, "assumptions": {} }`
- Valuation Agent output -> Writer Agent input:
  `{ "financial_statements": {}, "assumptions": {}, "valuation": { "dcf": {}, "scenarios": {} } }`
- Writer Agent output: markdown report.

This is the orchestration layer. The supervisor does not need to understand the internal structure of the DCF model. It only needs to know which output goes where. If any specialist returns an error, the supervisor stops and returns a structured error. That behavior is critical because continuing after a missing data error or malformed assumption set would produce a misleading report.

Why Start with a Subset of Agents

A full target system might eventually include a wide range of specialists: market data, filings ingestion, macro analysis, risk modeling, peer comparison, accounting adjustments, citation handling, and compliance checks. Building all of them at once introduces unnecessary complexity and weakens testability.

A proof of concept should instead use the smallest viable set of agents required to validate the architecture end to end.

For this project, that subset consists of:

Data Agent
Assumptions Agent
Valuation Agent
Writer Agent
Supervisor Agent

This is sufficient to execute a complete valuation pipeline, from a ticker input to a structured markdown report. Each agent has a single responsibility and maps cleanly to one MCP tool, while the Supervisor orchestrates the sequence.

The structure mirrors a real valuation workflow:

Retrieve financial statements
Apply assumptions
Perform valuation
Generate the report

The simplicity is intentional. It keeps the system testable while still reflecting a realistic process.

The end-to-end flow operates as follows:

A user requests a valuation for CROX on NASDAQ
The Supervisor routes the request to the Data Agent
The Data Agent calls the financial_statements tool, which returns a deterministic CROX fixture via valuation_core.statements.normalize_statements
The Supervisor passes the result to the Assumptions Agent
The Assumptions Agent calls the assumptions tool, which returns deterministic parameters such as WACC, beta, and terminal growth
The Supervisor forwards both datasets to the Valuation Agent
The Valuation Agent calls scenario_valuation, which executes run_dcf and run_scenarios
The Supervisor sends the valuation output to the Writer Agent
The Writer Agent calls report_writer and produces deterministic markdown

The resulting report begins with a predictable structure:

# CROX (NASDAQ) Valuation Report

## Inputs
- Period: FY2024 (TTM)
- Revenue: $3.9B
- EBIT: $1.05B (margin 27%)
- Net debt: $1.8B

## Cost of Capital
- WACC: 9.80%
- Cost of equity: 11.45%
- Terminal growth: 2.50%

The specific formatting is secondary. What matters is pipeline discipline. Each component operates within strict boundaries:

The Data Agent retrieves data but does not interpret it
The Assumptions Agent defines parameters but does not perform valuation
The Valuation Agent computes outputs but does not fetch inputs
The Writer Agent formats results but does not generate new values
The Supervisor coordinates without mutating data

This separation ensures that every output is traceable to a deterministic source, making the system testable at every layer.

What this POC demonstrates is not financial accuracy or market validity. It does not validate that CROX is correctly priced, nor that the assumptions generalize across contexts. Those are domain-level concerns to be addressed later.

Instead, it validates the architectural foundation:

Clear separation between agents, tools, and domain logic
Typed DTOs returned from the valuation core
Deterministic MCP tool interfaces
Unit-testable tool wrappers
Integration-testable server boundaries
Explicit agent contracts
A supervisor-driven orchestration model
Report generation from structured data rather than free-form reasoning

This establishes a stable skeleton on which more complex functionality can be built.

What Comes Next

The next step is not to expand the number of agents, but to operationalize the existing architecture.

The current POC relies on deterministic fixtures and controlled execution paths. To move beyond this stage, the system must be configured to process real user input and allow agents to interact through actual runtime constraints.

The next article will focus on:

Configuring OpenClaw for this architecture
Setting up permissions and execution boundaries for tools
Enabling the POC to safely process live user input
Allowing agents to communicate through the defined protocol rather than mocked flows

This transition introduces real-world constraints: input validation, permission management, tool access control, and inter-agent communication under a governed runtime.

Once that layer is in place, the system moves from a controlled demonstration to an executable architecture capable of handling dynamic requests.

Deep Research and Development