Building a Financial Agent with LangGraph and LangSmith

Financial LLMs Overview

Financial LLM applications fail for predictable reasons: ungrounded numbers, opaque reasoning, and workflows that are hard to debug or evaluate. In this article we'll try to overcome some of the issues and build a tool-first, graph-based agent for financial research - similar in spirit to FinChat or Koyfin - but that actually useful ;)

The goal is not to predict prices or trade, but to answer structured financial questions such as:

“Compare NVDA and AMD margins using their latest quarterly reports.”

This problem is representative of real production constraints: structured data, deterministic calculations, clear provenance, and observable failures.

Scope and assumptions

We assume that quarterly financial statements for different companies are already available locally in on the disk in some structured form. The process that produces these files will be treated as an upstream data engineering concern and is explicitly out of scope.

This will be somewhat aligned with real systems, as the team designing analytical or agentic systems rarely owns the entire data lifecycle end to end. Financial data typically flows through ingestion pipelines, quality checks, and versioned storage long before it is exposed to downstream consumers or fetched from 3rd parties like FactSet.

The system built simple, on-demand and synchronous. There are no background agents, no scheduled jobs, and no persistent agent identity. A user asks a question, the agent executes a bounded workflow, and a response is returned. This mirrors the interaction model of products like FinChat or Koyfin, where the primary expectation is accurate, explainable analysis rather than autonomy.

The language model will responsible for planning, orchestration, and explanation, but never for inventing or inferring financial metrics from text.

To RAG or not to RAG

At first glance, “asking questions about quarterly reports” sounds like a classic retrieval-augmented generation use case like the one we were discussing in previous article about RAG. In practice, using RAG for core financial metrics is one of the most common architectural mistakes in LLM-based applications especially in finance!

Margins, growth rates, and cash flow figures are not ambiguous concepts that benefit from semantic similarity search. They are deterministic quantities derived from well-defined fields in structured financial statements. Retrieving paragraphs from filings and asking a model to extract or recompute these numbers introduces multiple sources of error: inconsistent phrasing, unit mismatches, partial context, and silent arithmetic mistakes. More importantly, it makes those errors difficult to detect and evaluate.

For this reason, we'll exclude RAG for now. All financial metrics are computed directly from structured data via tools with clear schemas. If the data is missing, inconsistent, or incomplete, the agent must surface that fact explicitly rather than attempting to infer an answer from text. When numbers come from tools, it becomes straightforward to assert properties such as:

which quarter was used,
whether both companies were evaluated on comparable periods,
whether the same formulas were applied consistently.

These properties are nearly impossible to guarantee when numbers are derived implicitly from retrieved text.

RAG is still valuable in financial applications - but primarily for qualitative context: management commentary, explanations of business drivers, or descriptions of risk factors. So by postponing RAG, we establish a stable version 0, auditable baseline against which future extensions can be measured.

System overview

The system will be designed as a tool-first financial reasoning pipeline, where the language model coordinates actions but never acts as a source of financial truth. Architecturally, it sits between a user-facing query interface and a set of deterministic data tools backed by local, structured financial data.

At a high level, the system has three layers:

Orchestration layer - controls reasoning, flow, and state
Execution layer - deterministic tools over financial data
Observability and evaluation layer - traces, metrics, and regressions

This separation mirrors how production ML systems are typically structured: control logic, data access, and measurement are isolated so each can evolve independently.

From a data-flow perspective, a user query moves through the system as follows:

A user submits a financial question (e.g., margin comparison).
The LangGraph agent interprets intent and determines which steps are required.
The agent invokes explicit financial tools to retrieve structured data.
Metrics are computed deterministically inside the workflow.
The agent synthesizes a response, clearly grounded in the computed results.
LangSmith records the full execution trace for inspection and evaluation.

Crucially, the language model never bypasses the tool layer. All financial facts must be traceable to a tool call and, ultimately, to a specific quarterly report on disk.

System Diagram

LangGraph, LlamaIndex, and LangSmith: what we use, what we don’t, and why

The LangChain ecosystem is often discussed as a single toolkit, but in practice it consists of components that solve very different problems. Treating them as interchangeable usually leads to overcomplicated or fragile systems. Let's clarify the role of each component in the context of this financial agent:

LangGraph is the orchestration layer. It is responsible for controlling how the agent reasons and acts over time. In this system, LangGraph defines the explicit workflow that turns a financial question into a sequence of bounded steps: resolving which quarter is “latest,” fetching structured data, computing metrics, validating assumptions, and synthesizing a response. The key reason for using LangGraph is that it makes agent behavior inspectable and deterministic. Financial analysis is inherently sequential and stateful, and LangGraph allows that structure to be represented directly rather than hidden inside a prompt loop. This is especially important when debugging or extending the system, because failures can be localized to specific nodes in the graph.

LlamaIndex is a data access layer optimized for unstructured information. Its strength lies in indexing, retrieving, and assembling context from large document collections-filings, transcripts, research notes, or internal knowledge bases. In this article, LlamaIndex is intentionally not used. The questions we target rely on structured numerical data, where retrieval and semantic similarity add little value and introduce avoidable risk. It keeps the initial system focused on correctness and explicit computation. LlamaIndex becomes relevant once we introduce qualitative context, such as earnings call commentary or narrative explanations, where controlled retrieval is necessary but must be layered on top of a hardened execution path.

LangSmith is neither an agent framework nor a data layer. It is an observability and evaluation system. LangSmith records how the agent behaves: which prompts were used, which tools were called, what intermediate values were produced, and how long each step took. In a financial setting, this capability is critical. Without traces, it is impossible to audit results, diagnose errors, or compare changes across versions. LangSmith also provides the foundation for dataset-based evaluation and regression testing, which allows known financial questions to be rerun automatically as the system evolves. We'll use LangSmith even though the agent is simple, to establish a measurement baseline before additional complexity is introduced.

Evaluation and correctness

Evaluation in financial AI systems cannot be an afterthought. The cost of a wrong answer is rarely just user confusion; it is loss of trust, operational risk, and downstream decision errors that are difficult to unwind. For that reason, evaluation and system design are treated as a single concern rather than two separate phases.

Our system is intentionally structured so that correctness is observable. Because all financial metrics are derived from structured tools and explicit calculations, it becomes possible to assert concrete properties about every answer. The agent must select a specific quarter, retrieve a defined set of statements, apply known formulas, and surface any gaps or inconsistencies. Each of these steps produces artifacts that can be inspected, logged, and tested independently.

LangSmith provides the instrumentation that makes this possible. Every run of the agent produces a full execution trace, including the interpreted intent, tool calls, intermediate values, and final response. This trace is not merely for debugging individual failures; it enables systematic evaluation over time. Known financial questions can be captured as datasets and replayed automatically to detect regressions in behavior, latency, or cost as prompts, models, or workflows change.

At first, our evaluation will focus on a small but meaningful set of signals. These include whether the agent correctly resolves temporal ambiguity (for example, identifying the most recent reported quarter), whether numerical calculations are internally consistent, and whether responses include all requested entities and metrics. Latency and tool-call counts are also tracked to establish performance baselines. These metrics are deliberately simple, but they are grounded in the realities of financial analysis rather than abstract language quality scores. When data is missing or ambiguous, the agent should be forced to surface that limitation explicitly rather than compensating with language.

Equally important, this design scales as complexity increases. When unstructured context is introduced later, or when multiple agents are coordinated, the same evaluation framework can be extended rather than replaced. New failure modes can be identified and turned into measurable checks, rather than discovered through user complaints. This ability to convert qualitative issues into quantitative signals is what allows agentic systems to operate under real financial constraints.

Implementation

Let's get to it! The key idea is to treat the LLM as a planner and narrator, while all financial logic is executed deterministically over structured data.

The runtime flow is intentionally simple and explicit:

Take a user query in natural language
Convert it into a structured QueryPlan
Validate that plan against allowed operations and constraints
Execute the plan over a canonical quarterly facts table
Render a traceable answer

LangGraph is used to make this flow visible and inspectable, rather than implicit. The agent is implemented as a state machine, not a loop. Each node has a single responsibility, and all shared data is carried through an explicit state object.

class GraphState(TypedDict, total=False):
    user_query: str
    plan: QueryPlan
    result_rows: list[dict[str, Any]]
    answer_text: str
    evaluation: dict[str, Any]

This makes the agent’s behavior auditable: at any point you can see what the system thought the plan was, what data it actually used, and what it returned.

The graph itself follows a predictable structure:

graph = StateGraph(GraphState)

graph.add_edge(START, "plan")
graph.add_edge("plan", "validate")
graph.add_edge("validate", "route")

graph.add_conditional_edges(
    "route",
    lambda s: s["plan"].operation.value,
    {
        "GET": "exec_get",
        "COMPARE": "exec_compare",
        "TREND": "exec_trend",
        "CHANGE": "exec_change",
        "RANK": "exec_rank",
    },
)

for node in ["exec_get", "exec_compare", "exec_trend", "exec_change", "exec_rank"]:
    graph.add_edge(node, "render")

graph.add_edge("render", END)

Instead of hardcoding “compare NVDA vs AMD,” the graph routes execution based on the validated operation type. This is what enables multiple question types without embedding-based intent classifiers or brittle regex logic.

Planning and validation

The LLM’s first responsibility is to convert free-form language into a structured QueryPlan.

@traceable("plan_node")
def plan_node(state: GraphState) -> GraphState:
    return {"plan": llm.plan(state["user_query"])}

The plan includes:

operation (GET, COMPARE, TREND, etc.)
metrics
tickers
timeframe
output format

Immediately after planning, the plan is validated:

def validate_plan(plan: QueryPlan) -> None:
    if plan.operation == OperationEnum.RANK and len(plan.metrics) != 1:
        raise PlanValidationError("RANK requires exactly one metric")

This separation is important:

the LLM proposes
the system decides whether it is allowed

That boundary is what keeps the agent safe and predictable in a financial context. All financial logic lives in a small set of deterministic operations. These functions operate over a single canonical quarterly facts table (one row per ticker × quarter).

Examples include:

def trend_metric(df, tickers, metrics, n_quarters):
    return get_metrics(df, tickers, metrics, "LAST_N_QUARTERS", n_quarters=n_quarters)
    
    def rank_tickers(df, tickers, metric, timeframe_kind, ...):
    ranked = latest.sort_values(metric, ascending=False)
    ranked["rank"] = ranked[metric].rank(method="dense", ascending=False)

Key properties:

no LLM involvement
no hidden state
fully testable with unit tests
easy to extend by adding new metrics or operations

This is what allows the system to answer many questions without hardcoding any single answer. The agent does not call operations directly. Instead, it goes through a thin tool abstraction:

class FinancialTools:
    def execute(self, plan: QueryPlan) -> list[dict[str, object]]:
        if plan.operation.value == "COMPARE":
            out = compare_metrics(...)
        elif plan.operation.value == "TREND":
            out = trend_metric(...)
        ...

xxxThis keeps LangGraph nodes generic and makes it easy to harden execution later.

Every node is wrapped with a @traceable decorator. Tracing is optional by design. If LangSmith environment variables are not set, tracing becomes a no-op and the system runs fully offline.

When enabled, traces give you:

the inferred query plan
routing decisions
executed operations
final rendered output
timing and failure points

This is critical for finance: you can answer why the system produced a given result, not just what it produced.

@traceable("render_node")
def render_node(state: GraphState) -> GraphState:
    ...

Lastly, the LLM

The mock:

maps language → QueryPlan
formats results into readable output
runs fully offline

This serves two purposes:

Reproducibility – tests don’t depend on external APIs
Architectural clarity – the LLM’s role is explicit and limited

In production, this mock can be replaced with:

OpenAI
Gemini
Claude
or any other LLM that can output structured plans similarly to what we did in the last article.

Nothing else in the system changes.

Conclusion

This first version of the system is intentionally restrained. It does not try to be clever, autonomous, or exhaustive. Instead, it establishes a set of boundaries that are often missing in early LLM-based financial applications- and that absence is usually what causes them to fail as they grow.

By separating language from logic and logic from data, the system makes a simple but important claim: financial questions should be answered by computation first, and explained by language second. LangGraph provides the structure to express that idea clearly. The LLM is allowed to interpret intent and communicate results, but never to decide what is true. Deterministic code, operating over structured quarterly data, is responsible for producing the facts.

Equally important, the system is observable from the start. Tracing and lightweight evaluation are not introduced as optimizations or afterthoughts; they are part of the core design. This shifts the mindset from “does this answer look reasonable?” to “can we explain exactly how this answer was produced?” In finance, that difference matters.

What this article shows is not a finished product, but a foundation. It demonstrates that a useful FinChat-style experience does not require embeddings, retrieval, or complex autonomy to begin with. It requires clear responsibilities, explicit workflows, and the discipline to let each component do only what it is suited for. With those pieces in place, adding unstructured context, harder constraints, or additional agents becomes an evolution of the system- not a rewrite.

In the next article, that foundation will be stressed by introducing qualitative context and more complex failure modes. The goal will not be to make the system smarter, but to keep it correct as it becomes richer.

Deep Research and Development