Lessons from 15 months of building LLM agents
Or: What I want out of an LLM observability platform
I’ve spent the past 15 months building LLM agents, currently Ellipsis, a virtual software engineer. Previously, I worked on structured data extraction, codebase migrations, and text-to-SQL.
As a result, startups working on LLM observability/safety/reliability reach out to me several times per week to hear about my problems, so I’m consolidating my experience and advice here.
What does my current setup look like?
My workflow has looked remarkably similar across the varied LLM agents I’ve worked on.
Evals in CI
Most of my day is spent running evals; everything else flows from this. Why?
Prompts are brittle. You add an extra space or remove a comma somewhere, and the LLM output may be completely different, i.e. degraded on an important use case.
Small changes accumulate: A single agent run might have dozens, hundreds, or thousands (!) of LLM calls, so the actual outputs of a small change must be verified empirically.
All the normal reasons you have tests: they’re great. Run the tests, tests pass, you ship. Ship ship ship.
How do evals work? There are two main kinds of evals:
“Unit tests”: Verify a small piece of the agent internals; for example, verify the agent calls a particular tool when it’s the obvious choice. Good for testing obscure scenarios. Easy to verify expected behavior.
“Integration tests”: a complicated scenario end to end. Much more valuable because it tells you “does it actually work”. It’s often hard to write assertions for these because the outcome is amorphous, such as an answer to an ambiguous question.
So, the big question:
How do you tell if the agent was successful?
Here are some ideas:
Problem-specific heuristics: Often the agent’s objective has some other characteristics you can use to check. If it was writing a SQL query: does the query execute? If it was writing code: does the code pass tests? If it was extracting data: are all the expected fields present?
LLM self-evaluation: great idea. But here’s the thing: any reasonably advanced agent includes a self-critique sub-agent that it will use internally. This means your agent (probably) produced something that already passed your best LLM tests. (If your tests are finding room for improvement, they should be part of your agent!).
Sadly, this just doesn’t cut it most of the time. There are too many degenerate edge cases. There’s only one solution I’ve found:
*snapshot the outputs and read them manually*
This is a huge pain.
And if your tests check the exact outputs, they need to be *exactly* the same, so you’re going to need:
Caching
This is one of the first things I set up on a new agent project. A cache is needed for your evals to be:
Deterministic: you want to stay sane. GPT-4 is non-deterministic at temperature zero (due to Sparse MoE).
Fast: a roundtrip to OpenAI can take seconds or minutes, so an agent might run for minutes or even hours - a cache speeds this up by 100x.
Cheap: at $0.06/1k input tokens, a fully maxed-out call to GPT-4-32k costs $1.92 (up to $3.84 for large outputs). If you experiment with agents enough, you’ll find there are many ways to blow through $1k in a few minutes of experiments.
There are many caching providers out there, but I’ve always rolled my own (I did try Helicone and hit various issues). Why?
You do NOT want another service in between you and your LLM provider. LLM providers are unreliable enough as it is.
A cache is trivial to set up and maintain.
So, I’ve built the exact same thing for every LLM project I’ve been on: stable serialize the request => hash it to get a key => stuff in a key/value store. You can build your own in Postgres in minutes.
Observability, at long last
OK, you ran your evals (or a customer ran something in prod), and your agent produced some hot garbage. Time for the fun part.
Getting an intuitive feeling for your agent
Over time I find you get quite a good gut feeling for what your agent is (or should) be doing, just as with acclimating to a large new codebase or getting to know a literal human being.
Mostly, this comes from reading a LOT of:
Logs
I find good “normal” logging is even more valuable than fancy observability tools. The other AI engineers I know all agree: you spend a LOT of time poring over logs.
However much data you’re logging, it’s not enough, even if you take into account this blog post. (I promise I don’t work for DataDog and am not trying to convince you to spend $65M/year on observability.)
In a complicated agent, the “root cause” of an issue is often in the agent’s tools or environment, not in the prompts itself. But sometimes the agent “just did something dumb”, and for that you’ll need:
LLM request UI
You need a nice UI to view agent conversation histories. I use PromptLayer - simple, fast, reliable. I’ve tried a few others and hit misc issues.
Some other platforms have agent-first UIs (LangSmith and W&B come to mind) - haven’t tried them because it just hasn’t been a priority to have a nice visualization, even with thousands of LLM calls in a single agent run.
From viewing the conversation history, you find the spot where the agent went off the rails, and you ask yourself, “what could I have changed about the prompt to get it to do the right thing?” For that, you’ll head into:
Prompt Playground
Most of the LLM request UIs seem to have one built-in now. If you prefer copy-pasting into a different tab, OpenAI has one, Vercel built one to compare different models, loads exist. This is super useful.
You play around in the playground a few minutes, you find a change that works (maybe changing the system prompt, maybe it’s adding a new tool), and now…
Full circle! You need to add a new eval, and then run all your old evals to make sure we didn’t hopelessly break all our other use cases (which happens far too often).
Things I don’t need
Prompt Library / Prompt Templates
Everyone wants me to put my prompts in their database so I can change it in their UI. This makes sense for some products, such as if you have some simple chatbot and you want a PM to be able to tweak the prompt without having to touch any code.
For agents, it’s a complete non-starter:
Prompts must be in version control for your evals to be reliable
Agents are composed not of a single prompt, but dozens or hundreds of smaller prompts: various tools, sub-agents, and error-handling cases.
Langchain and similar libraries
There are many cool libraries for building LLM agents, and if I only have 30 minutes to prototype something, they can be useful.
However, I’ve yet to find a real use case for them outside of demos. LLM agents are pre-paradigm with no killer apps (yet), and building abstractions for applications that don’t exist is quite difficult.
Customizations to your agent are always necessary.
The libraries have hooks in all the wrong places, and many simple behaviors are completely inexpressible.
Agents are just a while-loop; you can write your own in around as much time as it takes to learn the API of whatever library, and it’ll be far simpler and easier to customize.
For now, the wrong abstraction is far worse than no abstraction.
Things I’ll pay someone to build
Here are some things I’d love:
Fuzz testing: prompt instability is a huge problem, and a constant cause for worry. Are all my prompts stuck in a super unstable local optima, which make changes extremely difficult? Try out hundreds of small variations and measure the variance in the results.
Prompt optimization: there’s now tons of literature on this.
Auditor agent: a first step would be “find the point in the conversation where things went wrong”, useful for long agents.
[Edit 2023-12-03] Observability for Embeddings: The use case is debugging RAG workflows and evaluating different embedding models. For example, Nomic Atlas has a great visualizer for exploring embedding spaces, but it’s meant for static datasets, not for production use cases.
Conclusion
For now, I think just like “great ML engineers spend a lot of time looking at their data”, AI engineers [have to] spend a lot of time reading agent logs and manually inspecting results.
If you are building something that would improve this workflow, I would love to chat.
Edits
A few people have asked about the desirability of determinism, and the assumptions inherent therein. The reason for determinism is rooted in agent reliability, with major exceptions:
- building a product where you want high variance (like writing poems)
- building in a pass@k-style architecture, where you generate many results and choose the best one
- intrinsically unsolvable by agent (like requiring non-indexed data to answer)
I think "human in the loop" (or "agent in the loop") workflows will be around for many years yet, and the UX is the key, just as how ChatGPT was fundamentally a UX advance to make LLMs easy to interact with.
So happy that I finally got around to reading this. Very informative!
I think there's an implicit assumption in this post is that your target customers need consistent results and predictable outcomes. Understandably so, btw. But since agents are built on non-deterministic LLMs, achieving enterprise-grade quality seems crazy... unless you're framing the product for early-adopter customers mid-market and smaller. I think one way to unlock better outcomes with agents is to involve humans in their workflows, enabling them review any proposed actions that the agent want to take (ie. spend some money, push some code, etc). It might be crazy, but I've been thinking about whether there's an opportunity to build agentic software for everyday consumers that leans into this interaction paradigm instead of trying to eliminate it entirely, at this stage.
Curious what you think.
Love this post! I'm interested in the caching section. Do you have an example for that in action?