Technical evaluation of five distinct context engineering approaches for natural language data analysis
This report presents a comparative analysis of five architectural patterns for enabling natural language queries over structured spreadsheet data using Large Language Models (LLMs). The study evaluates approaches from naive prompt injection to agentic multi-step systems, analyzing trade-offs in accuracy, latency, scalability, and cost.
These modes were chosen because they capture the dominant design patterns used in both open-source spreadsheet agents (e.g., LangChain CSV/Pandas agents, PandasAI, LlamaIndex Spreadsheet Agent) and proprietary systems (e.g., Microsoft Excel Copilot, Dust Query Tables). In practice, most current Excel-focused agents can be mapped to one or more of these strategies, making them representative benchmarks for analysis.
Our prototype implementation leverages Claude Sonnet 4.5 for reasoning and function calling, DuckDB-WASM for client-side SQL execution, and lightweight retrieval pipelines for semantic search. Each architectural mode thus reflects not just a theoretical construct but an observed practice, providing a structured lens through which to evaluate the strengths and limitations of LLM-powered spreadsheet systems.
Mode | Approach | Context Method | Execution | Scalability | Accuracy |
---|---|---|---|---|---|
Mode A | Naive Prompt | Direct data injection | Single LLM call | ~20 rows max | 70-75% |
Mode B | RAG | Semantic retrieval | Embedding + LLM | ~10K+ rows | 75-85% |
Mode C | Tool-Driven | Schema-informed | SQL/Pandas generation | ~1M+ rows | 85-92% |
Mode D | Hybrid RAG + Tools | Retrieval-guided SQL | RAG → SQL → Synthesis | ~1M+ rows | 90-95% |
Mode E | Agentic | Multi-step reasoning | Planning → Execution → Memory | ~1M+ rows | 92-97% |
Combines semantic retrieval with SQL execution
* Benchmarks based on sample queries against 25-row IT controls dataset.
A synthetic IT controls dataset (25 rows × 12 columns) with seeded anomalies: duplicate user roles, misaligned purchase orders, and inconsistent export counts. Real-world structure with intentional test failures for anomaly detection queries.
* Metrics from live demo queries.
Our analysis shows that context engineering strategy as a decisive factor in LLM performance for spreadsheet analysis. Naive prompting (Mode A) quickly breaks down beyond trivial inputs, while RAG (Mode B) improves coverage but fails on aggregation. Tool-driven execution (Mode C) scales to millions of rows with deterministic accuracy, and Hybrid systems (Mode D) deliver the best balance by combining semantic retrieval with precise computation, though at higher latency. Agentic approaches (Mode E) add memory and multi-step reasoning but provide only marginal gains for single-query tasks. In practice, most existing Excel agents leverage these modes, with C and D as the most robust patterns for compliance-oriented audit systems.