Building Custom Analytics: An AI SQL Agent over Per-Tenant DuckDB Files
BoxHero is an inventory management product used by thousands of businesses across a wide range of industries. We hold our customers' full operational data — items, stock movements, locations, purchase orders, and sales history — and turn it into real-time stock levels and pre-built reports.
Pre-built reports cover the cases that show up across most businesses, but every customer has questions that fall outside of “common use cases.” Real requests we hear regularly:
"I want monthly purchase totals broken down by supplier."
"I need inventory asset value using FIFO cost."
"I want to see available stock that accounts for sold-but-not-yet-shipped quantities."
"I want a monthly sales report, but exclude this one customer."
We could keep adding toggles and filter panels until the report screen covered every variation. But customers consistently tell us BoxHero is simple enough to use without training, and piling on configuration would undo that.
We needed a way to give analytical depth to power users, without making the product harder for everyone else.
This post is about how we built Custom Analytics: a feature that lets users ask questions in natural language, has an AI write DuckDB SQL against a per-team snapshot of their data, and runs it entirely in the browser. It's currently in beta with a subset of customers.
SQL as the escape hatch
SQL is the obvious tool for arbitrary analytical questions, but asking non-developers to write it has never been realistic.
What changed is that modern LLMs are genuinely good at translating natural language into SQL — good enough that the bottleneck is no longer the user's ability to write SQL, but our ability to give the model a safe place to run it.
“What if the AI wrote SQL, and we gave the user a sandbox where that SQL could actually execute against their own data?”
That question became the starting point for Custom Analytics.
A safe place to execute user-written SQL
BoxHero's primary database is PostgreSQL, and every customer's data lives in shared tables. Tenant isolation is enforced by row-level filters that depend on the authenticated user's team. Letting a user (or an LLM acting on their behalf) execute arbitrary SQL against that database is a non-starter: one missing WHERE team_id = ? and another customer's data leaks.
The cleanest way to remove that risk is to remove the shared database out of the picture. If each team has its own embedded database file containing only their data, an arbitrary query against that file is safe by construction.
Why DuckDB
SQLite was the obvious starting point. It's the default embedded database, and most teams pick it without thinking twice. We picked DuckDB instead, for three reasons:
- Analytical query performance. Our workload is OLAP — group-bys, window functions, time-series rollups — and that's exactly what DuckDB is built for.
- A much richer type system and standard library than SQLite. SQLite doesn't even have a real
TIMESTAMP WITH TIME ZONEtype; every date is a string or an integer you have to interpret yourself. DuckDB has proper date/time types and a deep catalog of analytical functions (QUALIFY,ARG_MAX, window aggregates, time-series helpers) that an LLM can use directly. - Columnar storage compresses extremely well on inventory data. Most of our columns are highly repetitive: the same item ids, location ids, transaction types, and status enums recur across millions of rows. Columnar layout dictionary-encodes that redundancy down to almost nothing, which keeps per-team files small.
We've been running DuckDB internally since 2020, mostly for ad-hoc analytics where querying production Postgres directly was too slow. Until the 1.0 release the on-disk format changed between versions, which made it hard to use in production. The format is now stable across minor versions, which was the last thing holding us back.
Syncing Postgres into per-team DuckDB files
With the storage choice settled, the rest of the work was a sync pipeline: read from a handful of large multi-tenant Postgres tables, produce one <team_id>.duckdb file per team, upload to Cloudflare R2. A few problems had to be solved.
1. Translating the Postgres schema into a DuckDB schema, reliably
Rather than hand-maintaining a parallel DuckDB schema, we defined a set of Postgres views that shape the data exactly the way we want it to appear in DuckDB. Then we let DuckDB's postgres extension read those views and materialize them into the DuckDB file.
The views aren't 1:1 mirrors of the underlying tables. They rename internal columns to names the LLM (and humans) can reason about, decode magic-number enums into readable strings, normalize timestamp types, and attach comments that document what each column means and which values it can take.
The txs view is a representative example:
CREATE VIEW duckdb_export.txs AS
SELECT
id,
tx_time::timestamptz,
CASE
WHEN tx_type = 1000 THEN 'in'
WHEN tx_type = 1001 THEN 'out'
WHEN tx_type = 1002 THEN 'adjust'
WHEN tx_type = 1003 THEN 'move'
ELSE tx_type::text
END AS tx_type,
from_location_id,
to_location_id,
vendor_id AS partner_id,
memo,
created_time::timestamptz AS created_at,
order_id,
return_id
FROM location_txs
WHERE team_id = current_setting('duckdb_export.team_id')::INTEGER;
COMMENT ON VIEW duckdb_export.txs IS
'Inventory transactions: records of all inventory movements. '
'tx_type: in=Stock In, out=Stock Out, move=Move Stock, adjust=Adjust Stock';
COMMENT ON COLUMN duckdb_export.txs.tx_type IS
'Transaction type. Allowed values: [in, out, move, adjust]';
COMMENT ON COLUMN duckdb_export.txs.partner_id IS
'Reference to partners.id (for purchases/sales)';
A few specifics from the script:
- The
team_idfilter is the tenant boundary.current_setting('duckdb_export.team_id')is a Postgres GUC the export script sets per process. If the script ever forgets to set it, the cast fails and the view returns nothing. Failure is empty, not cross-tenant. vendor_idbecomespartner_id. The internal name is a historical artifact; the LLM sees the name we want it to use.- Magic numbers become string enums (
1000→'in'), and the column comment lists the exact allowed values. This is what stops the LLM from inventing'incoming'or'stockin'when writing a query. - Timestamps are explicitly cast to
timestamptzso DuckDB picks up a clean type rather than guessing.
Schema changes only need to touch the Postgres view. The DuckDB file picks them up on the next sync, and the AI prompt does too, because the prompt is built by reading the schema and column comments out of the live DuckDB file at runtime rather than from a hard-coded copy.
2. A short export script
DuckDB itself is the ETL runtime, which is what kept the script small. There's no Airflow, no dbt, no bespoke pipeline — the export script attaches Postgres as a remote catalog, reads from the views, and writes into a local DuckDB file:
await connection.run(`ATTACH '${pgURL}' AS pg (TYPE POSTGRES, READ_ONLY);`);
await connection.run(
`CALL postgres_execute('pg', 'BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;');`
);
await connection.run(
`CALL postgres_execute('pg', 'SET duckdb_export.team_id = ''${teamId}''');`
);
// Discover views dynamically — new views require zero changes here.
const tables = (await connection.runAndReadAll(`
SELECT table_name FROM postgres_query('pg', $$
SELECT table_name FROM information_schema.views
WHERE table_schema = 'duckdb_export'
$$);
`)).getRowObjects().map(r => r.table_name);
for (const table of tables) {
await connection.run(`
CREATE TABLE ${table} AS
SELECT * FROM postgres_query('pg', 'SELECT * FROM duckdb_export.${table}');
`);
await copyTableAndColumnComments(table); // reads obj_description / col_description from PG
}
await connection.run(`DETACH pg;`);
await connection.run(`CHECKPOINT;`); // flush WAL so the file is self-contained
A few details worth calling out:
- Dynamic discovery from
information_schema.views. Adding a new view to theduckdb_exportschema in Postgres is the only change needed to expose a new table to users. The export script picks it up automatically on the next run. REPEATABLE READfor the whole export. Every view a single team's export touches reads from the same Postgres snapshot, so a transaction landing mid-export can't leave items and item attributes inconsistent with each other.- Comments are copied across. After creating each DuckDB table, the script reads
obj_descriptionandcol_descriptionfrom Postgres for the corresponding view and re-applies them asCOMMENT ON TABLE/COMMENT ON COLUMNin DuckDB. This is what lets the AI prompt, hours later in the user's browser, read meaningful column documentation directly out of the file. CHECKPOINTat the end flushes the WAL into the main file so what gets uploaded to R2 is a single self-contained.duckdbfile with no sidecar.
The whole script is around 200 lines. The Postgres views and DuckDB's postgres extension carry most of the load, which means there's not much pipeline code for us to maintain.
3. Making sure data from different teams can never mix
The GUC-based filter shown above is half the story; the other half is process isolation. The export script runs each team in its own subprocess and sets duckdb_export.team_id to exactly one value at the start of that process.
There's no shared connection where a stale GUC could carry over from one team to the next, and no code path that constructs the team_id from anything other than the subprocess argument.
4. Keeping the sync from hammering Postgres
We decided up front to sync once per day, and accept that data can lag by up to 24 hours. The first scheduler ran every team at midnight; as team count grew, that one minute became a sharp load spike on Postgres.
Instead, we hash each team_id into one of 24 buckets and sync each bucket on the corresponding hour of the day. Team distribution is uniform enough that this spreads the load to roughly 1/24 of the daily total at any given time.
The sync also runs against a Postgres read replica, not the primary.
Server-side vs. client-side execution
With per-team .duckdb files sitting in R2, we needed somewhere to run queries against them. The two options had clear trade-offs:
| Pros | Cons | |
|---|---|---|
| Server-side | User never fetches the DB file; only result rows cross the wire. | We have to operate a runtime that gives each team an isolated DuckDB instance. |
| Client-side | The DB file is the isolation boundary; cross-tenant leakage is impossible by construction. | The DB file has to reach the browser somehow. |
Server-side is probably the better long-term answer, but operating per-tenant isolated runtimes is genuinely hard. The most promising option we're aware of is Cloudflare Containers, which (unlike Lambda/Heroku/Railway) lets you key container instances by an arbitrary ID and guarantee at most one is running per key. That maps perfectly onto "one container per team," runs at the edge near the user, and the container itself is a strong isolation boundary.
Cloudflare Containers is still in beta, though, and in our testing it had enough rough edges that we weren't comfortable putting it in front of customers yet.
So we shipped client-side first, and the "user has to download a database" cost turned out to be tolerable. DuckDB's columnar compression keeps the average per-team .duckdb file around 5 MB, which is fast enough on any reasonable connection that we don't pay much for it perceptually.
Setup is two lines:
await connection.query(`ATTACH '${dbUrl}' AS db`);
await connection.query(`USE db`);
The client also caches the file with a 48-hour staleness check and ETag-based revalidation, so subsequent loads return 304 Not Modified instead of re-streaming the file.
The AI prompt
With a sandbox in place, the user could now run arbitrary SQL and either inspect results in a table or export to Excel. Writing DuckDB SQL by hand is still beyond most non-developers, so the first thing we shipped was a downloadable prompt: paste it into ChatGPT or Claude, ask your question, and paste the SQL back into our editor.
The prompt includes the full schema context: tables, views, macros, plus the column-level comments described above. Because it's built by querying the DuckDB file in the user's browser (duckdb_tables(), duckdb_columns(), duckdb_views(), duckdb_functions()), there's nothing hard-coded. Update the Postgres view, and on the next sync the prompt is automatically up to date everywhere.
The output is plain markdown.
An abbreviated version of what the model actually sees:
# BoxHero Custom Analytics Database Schema
> This document describes the database schema for BoxHero inventory management system.
> Use DuckDB SQL syntax to query this data.
## Instructions
You are a DuckDB SQL query generator.
**DUCKDB SPECIFICS:**
- Use DuckDB SQL syntax
- All timestamps are TIMESTAMP WITH TIME ZONE type
- ALWAYS filter out soft-deleted records with: `table.deleted = FALSE`
**INVENTORY SPECIFICS:**
- When txs.tx_type is 'out', tx_items.quantity is NEGATIVE
- To find largest outbound quantity: ORDER BY quantity ASC (most negative = largest)
- current_quantities shows CURRENT inventory levels only
- To calculate past inventory: current_quantities - SUM(tx_items.quantity after that date)
## Tables
### txs
Inventory transactions: records of all inventory movements.
tx_type: in=Stock In, out=Stock Out, move=Move Stock, adjust=Adjust Stock
| Column | Type | Description |
|-------------------|--------------------------|-----------------------------------------------------------|
| id | BIGINT | Unique transaction identifier |
| tx_time | TIMESTAMP WITH TIME ZONE | When the transaction occurred |
| tx_type | VARCHAR | Transaction type. Allowed values: [in, out, move, adjust] |
| from_location_id | BIGINT | Source location (for moves) |
| to_location_id | BIGINT | Destination location (for moves) |
| partner_id | BIGINT | Reference to partners.id (for purchases/sales) |
| order_id | BIGINT | Reference to orders.id (if from order) |
(...remaining tables, views, and macros follow the same shape.)
Two parts of this prompt are worth pointing out, because they're handled differently.
- The
Allowed values: [in, out, move, adjust]lines come straight from the Postgres column comments shown in the SQL example earlier. They survive the export, get re-applied as DuckDB column comments, and end up in the prompt without anyone curating a list by hand. - The inventory-specific guidance block ("when tx_type is 'out', quantity is NEGATIVE", etc.) is the one piece we do hand-write. These are domain quirks the LLM can't infer from the schema alone, and they're worth their weight in correctness.
From prompt to agent
Handing users a prompt got us to "AI writes SQL for me," but the failure mode was obvious within a week. LLMs follow instructions well but don't guarantee runnable output: they'd call a function DuckDB doesn't have, forget a cast, or — worse — invent column values.
For example, we document in a schema comment that txs.tx_type is one of 'in' | 'out' | 'move' | 'adjust', and the model would still guess 'incoming' and produce a query that silently returned nothing.
That's the ceiling of any prompt-only approach: the model can't verify what it wrote. So we built a chat UI on top of an agent that has tools to check its own work:
run_sql applies the final SQL to the user's editor and executes it. The client validates with EXPLAIN first:• On failure it returns
{ ok: false, error } without touching the editor, so the model can fix and retry.• On success it returns only
{ ok: true, totalRows }. The actual rows are never sent back to the model, so it can't describe or hallucinate about them.inspect_sql runs a small read-only query and returns rows to the model (not the user). This is what the model uses before writing a final query: distinct enum values, row counts, date ranges, sample rows. Capped at 20 rows; rejects any DDL/DML.ask_user asks the user to clarify when intent is genuinely ambiguous, with optional choice buttons. We tell the model not to use this for trivial decisions; pick a default and state the assumption instead.With just those three tools, the model writes the right SQL on the first user turn most of the time. inspect_sql in particular eliminated the entire class of "model invented a column value" failures.
Model choice and cost
Once the agent worked reliably, the next question was whether we could afford to run it.
We're paying per-token API pricing, where a small number of heavy users can easily flip the unit economics. Picking a cheap model isn't a free fix either: a model that writes broken SQL just makes the user retry, which costs more than a smart model that gets it right on turn one.
We tested Anthropic models — Claude Sonnet 4.6 and Haiku 4.5 — since they're particularly strong at code generation. Haiku produced invalid SQL more often than Sonnet, but run_sql's retry loop generally got it to a working query eventually. The real difference showed up on hard prompts: complex requests like FIFO inventory valuation were beyond Haiku in practice.
We launched on Sonnet 4.6.
▶︎ Making the prompt cacheable
The biggest single lever on cost is prompt caching: cached input tokens are roughly 1/10 the price of fresh ones, so the goal is to maximize the prefix that's identical across turns.
For Custom Analytics, the only context the model actually needs is the current SQL in the editor and the recent user turns. Old conversations from months ago aren't useful; they bloat the prompt and make it more likely the model misreads current intent. Past assistant SQL outputs and their result sets are even worse: long, irrelevant, and never cache-aligned with the current turn.
So we structured the request as follows, optimizing every part for cache reuse:
- Trim history to the most recent ~100 messages. The first message must be a
usermessage, so if trimming leaves anassistantat the front, we drop additional messages until auseris first. - Replace past
run_sqlcalls with placeholders before sending, on both sides. The input SQL gets swapped for[omitted: see <current_sql> in latest user message]and the output for[omitted]. The reasoning: the SQL the model wrote on a previous turn is already represented in the editor state we inject as<current_sql>on the current turn, so leaving it in the tool call is pure duplication that compounds every turn. The result{ ok, totalRows }carries no information forward either. - Leave
inspect_sqlcalls intact. Unlikerun_sql, the rowsinspect_sqlreturned in earlier turns carry learned facts — distinct enum values, row counts, sample values — that the model legitimately reuses when revising the query later. Stripping them would force the model to look the same things up again. - Inject editor state into the current user message, not the system prompt: the new turn starts with
<current_sql>...</current_sql>. This keeps the static prefix (system prompt + schema + tool definitions) byte-identical across turns, which is what the cache actually keys on.
stepCountIs(12) in the AI SDK). That stops a misbehaving conversation from looping run_sql → fix → run_sql → fix indefinitely on our dime.A snapshot of one day of real usage (top 5 teams by cost, names anonymized):
| Team | Input tokens | No-cache | Cache read | Cache write | Output tokens | Cost (USD) | Requests |
|---|---|---|---|---|---|---|---|
| Team A | 118,436 | 13,426 | 90,660 | 14,350 | 4,351 | $0.1866 | 17 |
| Team B | 76,428 | 8,147 | 49,412 | 18,869 | 2,860 | $0.1529 | 13 |
| Team C | 72,100 | 15,336 | 47,051 | 9,713 | 1,533 | $0.1195 | 9 |
| Team D | 21,357 | 1,661 | 13,893 | 5,803 | 565 | $0.0394 | 4 |
| Team E | 10,332 | 640 | 4,631 | 5,061 | 807 | $0.0344 | 2 |
Cache reads dominate input tokens by roughly 5–8× on the active teams, exactly what the prompt-shaping decisions above were aiming for. Per-request cost lands around $0.01, and output tokens stay small because the model returns SQL, not prose. run_sql doesn't echo result rows back, which keeps both cost and the chance of hallucinating about results down.
▶︎ Tracking spend
That table didn't come from a dashboard we bought. It came from the logging we added on day one, because cheap-per-message doesn't protect us from a single team running thousands of messages.
We log (team_id, model, input_tokens, cache_read, cache_write, output_tokens) for every request to Cloudflare Analytics Engine.
It's a lightweight time-series store with a SQL query interface, which is more than enough for this: write one row per request, run aggregations later, no separate database or schema migrations to maintain. It's also already in our stack.
Closing thoughts
The product question we started with was: how do we serve every customer's analytical need without turning the report screen into a cockpit?
The answer turned out not to be a new report feature at all. It was a small composition of pieces we already had access to:
• DuckDB to make per-tenant data files cheap, fast, and small.
• DuckDB-Wasm to push execution to the browser and make tenant isolation a property of the file, not the runtime.
• An LLM with a small set of SQL-aware tools to turn natural language into a verified, runnable query.
None of these components are novel on their own.
What's interesting is that combining them sidestepped the false choice between "complex feature" and "feature that doesn't quite fit." Instead of building configuration UIs for every requested cut of the data, we gave users a sandbox flexible enough that every cut is expressible.
The open question is cost.
Custom Analytics is in beta with a subset of customers right now, and at that scale the per-message numbers look fine — but we won't really know how it holds up until we open it to everyone. That's why every request is logged. The data will shape how we price the feature when it's generally available.
This is also BoxHero's first AI-powered feature, and it's a good template for the ones that come next: take a step back, see what primitives you already have, and use AI to make them speak the user's language.
If you're building something similar — natural-language SQL, per-tenant DuckDB, or LLM agents over your own data — we'd love to compare notes. Drop us a line.