So happy that I finally got around to reading this. Very informative!
I think there's an implicit assumption in this post is that your target customers need consistent results and predictable outcomes. Understandably so, btw. But since agents are built on non-deterministic LLMs, achieving enterprise-grade quality seems crazy... unless you're framing the product for early-adopter customers mid-market and smaller. I think one way to unlock better outcomes with agents is to involve humans in their workflows, enabling them review any proposed actions that the agent want to take (ie. spend some money, push some code, etc). It might be crazy, but I've been thinking about whether there's an opportunity to build agentic software for everyday consumers that leans into this interaction paradigm instead of trying to eliminate it entirely, at this stage.
totally! the reason for determinism is rooted in agent is reliability, with major exceptions:
- building a product where you want high variance (like writing poems)
- building in a pass@k-style architecture, where you generate many results and choose the best one
- intrinsically unsolveable by agent (like requiring non-indexed data to answer)
I think "human in the loop" (or "agent in the loop") workflows will be around for many years yet, and the UX is the key, just as how ChatGPT was fundamentally a UX advance to make LLMs easy to interact with
thanks! you can get up and running with Helicone easily just by swapping a URL, otherwise a Postgres table with a string key and JSON payload field is all you need, depending on your preference. Here's the Python to serialize.
Thanks, well I initially didn't fully grasp the purpose of caching and how it could help speedup evaluation, so I hoped to understand that given a code context.
So happy that I finally got around to reading this. Very informative!
I think there's an implicit assumption in this post is that your target customers need consistent results and predictable outcomes. Understandably so, btw. But since agents are built on non-deterministic LLMs, achieving enterprise-grade quality seems crazy... unless you're framing the product for early-adopter customers mid-market and smaller. I think one way to unlock better outcomes with agents is to involve humans in their workflows, enabling them review any proposed actions that the agent want to take (ie. spend some money, push some code, etc). It might be crazy, but I've been thinking about whether there's an opportunity to build agentic software for everyday consumers that leans into this interaction paradigm instead of trying to eliminate it entirely, at this stage.
Curious what you think.
totally! the reason for determinism is rooted in agent is reliability, with major exceptions:
- building a product where you want high variance (like writing poems)
- building in a pass@k-style architecture, where you generate many results and choose the best one
- intrinsically unsolveable by agent (like requiring non-indexed data to answer)
I think "human in the loop" (or "agent in the loop") workflows will be around for many years yet, and the UX is the key, just as how ChatGPT was fundamentally a UX advance to make LLMs easy to interact with
Love this post! I'm interested in the caching section. Do you have an example for that in action?
thanks! you can get up and running with Helicone easily just by swapping a URL, otherwise a Postgres table with a string key and JSON payload field is all you need, depending on your preference. Here's the Python to serialize.
```
def _stable_serialize(request: ChatCompletionRequest) -> str:
return json.dumps(request.get_request_body_dict(), sort_keys=True)
def _get_cache_key(request: ChatCompletionRequest) -> str:
request_serialized = _stable_serialize(request)
hash = hashlib.sha256(request_serialized.encode()).hexdigest()
return hash
```
Thanks, well I initially didn't fully grasp the purpose of caching and how it could help speedup evaluation, so I hoped to understand that given a code context.
I have spent the last weeks building a chatbot and i resonate with your article. Caching is a great strategy to speed up your testing, thanks!
Thanks! would love to checkout the chatbot if it's public