A Comparative Analysis of LLM-Powered Spreadsheet Query Architectures

Technical evaluation of five distinct context engineering approaches for natural language data analysis

Built for Andera by Nathan LiPowered by Claude Sonnet 4.5 & DuckDB-WASM

Abstract

This report presents a comparative analysis of five architectural patterns for enabling natural language queries over structured spreadsheet data using Large Language Models (LLMs). The study evaluates approaches from naive prompt injection to agentic multi-step systems, analyzing trade-offs in accuracy, latency, scalability, and cost.

These modes were chosen because they capture the dominant design patterns used in both open-source spreadsheet agents (e.g., LangChain CSV/Pandas agents, PandasAI, LlamaIndex Spreadsheet Agent) and proprietary systems (e.g., Microsoft Excel Copilot, Dust Query Tables). In practice, most current Excel-focused agents can be mapped to one or more of these strategies, making them representative benchmarks for analysis.

Our prototype implementation leverages Claude Sonnet 4.5 for reasoning and function calling, DuckDB-WASM for client-side SQL execution, and lightweight retrieval pipelines for semantic search. Each architectural mode thus reflects not just a theoretical construct but an observed practice, providing a structured lens through which to evaluate the strengths and limitations of LLM-powered spreadsheet systems.

1. Architecture Overview

ModeApproachContext MethodExecutionScalabilityAccuracy
Mode ANaive PromptDirect data injectionSingle LLM call~20 rows max70-75%
Mode BRAGSemantic retrievalEmbedding + LLM~10K+ rows75-85%
Mode CTool-DrivenSchema-informedSQL/Pandas generation~1M+ rows85-92%
Mode DHybrid RAG + ToolsRetrieval-guided SQLRAG → SQL → Synthesis~1M+ rows90-95%
Mode EAgenticMulti-step reasoningPlanning → Execution → Memory~1M+ rows92-97%

2. Detailed Architecture Analysis

Architecture Mode:
Query
Phase 1: RAG Retrieval
Embed
Search
Top-k
Phase 2: Context-Aware SQL
Extract
Gen SQL
Execute
Phase 3: Synthesis
Merge
Format
Answer

Mode D: Hybrid RAG + Tools

Combines semantic retrieval with SQL execution

Recommended

Best Use Cases

  • Production audit systems requiring both semantic understanding and precise computation
  • Ambiguous queries like “controls with issues” that need context + aggregation
  • Enterprise datasets (1M+ rows) with complex business logic
  • Systems where accuracy justifies the higher latency and cost

Characteristics

  • Advantages: Semantic understanding + precise computation, handles ambiguous queries
  • Limitations: Higher latency, increased cost (2x LLM calls + embeddings), added complexity
  • Accuracy: 90-95% across all query types, especially strong on complex analytical questions

Performance

Latency: ~2300ms
Cost: $0.032/query
Accuracy: 90-95%

Similar Systems

  • Open Source: LlamaIndex SQL + Vector hybrid agents, Haystack Pipeline (Retriever + Reader)
  • Proprietary: Thoughtspot AI (Spotter), Tableau Ask Data, Power BI Q&A

3. Performance Benchmarks

Latency (ms)

Mode A680ms
Mode B1350ms
Mode C1720ms
Mode D2450ms
Mode E3850ms

Cost per Query ($)

Mode A$0.009
Mode B$0.016
Mode C$0.019
Mode D$0.034
Mode E$0.041

Accuracy (%)

Mode A68%
Mode B77%
Mode C86%
Mode D91%
Mode E94%

* Benchmarks based on sample queries against 25-row IT controls dataset.

4. Demo Implementation

Core Components

Claude Sonnet 4.5
Single LLM across all modes. Direct browser API calls (dangerouslyAllowBrowser: true) for client-side execution. Temperature set to 0 for deterministic outputs. No function calling in Modes A-B; tool use in Modes C-E.
$3.00 per 1M input tokens, $15.00 per 1M output tokens
DuckDB-WASM v1.28
In-browser SQL via WebAssembly. Data loaded once per session, queries execute locally (no network). Used in Modes C-E. Custom row_number column added for traceability (1-indexed).
MIT License | ~1.5MB bundle | Zero server cost
Keyword-Based Retrieval
Simple in-memory vector store using TF-IDF-style keyword matching. Each row indexed as a document. Used in Modes B, D, E. Top-k docs retrieved based on query.
In-memory | Zero cost | No API dependencies

Sample Data

A synthetic IT controls dataset (25 rows × 12 columns) with seeded anomalies: duplicate user roles, misaligned purchase orders, and inconsistent export counts. Real-world structure with intentional test failures for anomaly detection queries.

Evaluation Protocol

  • Latency: Measured client-side from query dispatch to JSON result return.
  • Cost: Calculated from token counts × Claude API pricing.
  • Accuracy: Measured as recall + precision of anomalies detected against ground truth.
  • Scalability: Theoretical projection based on token limits and dataset size.

* Metrics from live demo queries.

5. Conclusion

Our analysis shows that context engineering strategy as a decisive factor in LLM performance for spreadsheet analysis. Naive prompting (Mode A) quickly breaks down beyond trivial inputs, while RAG (Mode B) improves coverage but fails on aggregation. Tool-driven execution (Mode C) scales to millions of rows with deterministic accuracy, and Hybrid systems (Mode D) deliver the best balance by combining semantic retrieval with precise computation, though at higher latency. Agentic approaches (Mode E) add memory and multi-step reasoning but provide only marginal gains for single-query tasks. In practice, most existing Excel agents leverage these modes, with C and D as the most robust patterns for compliance-oriented audit systems.

Technical Report · Andera Takehome Assignment · 2025

Implementation available for live testing at the main demo interface