logo

LLM Providers and Abstractions

Dec 9, 2025

Recently I read some posts about LLM provider abstraction SDKs. I particularly liked Armin's post explaining the problems with current abstractions like the Vercel AI SDK. I also liked Mario's post and his whole approach to Pi.

Both posts resonated with me, but I arrived at slightly different conclusions.

TL;DR: Most unified LLM abstractions solve the wrong problem. You don't need seamless model switching mid-conversation you need development flexibility. The real issue is that assistant messages carry provider-specific state you can't recreate. My approach: store user messages and tool results in a standardized format, but keep assistant messages native. Standardize the streaming interface, not the underlying state.

Why Need an Abstraction Anyway?

Based on my experience building harnesses, I came up with two reasons for wanting a unified LLM API:

  • Try different models with the same harness during development
  • Switch models between sessions

That's it. And I'm not a fan of the second idea for harnesses. You lose caching state, thinking traces, and whatever internal representations the provider doesn't expose.

So we're really left with one solid reason: development flexibility. Being able to test Claude against GPT against Gemini without rewriting your entire system. That's genuinely valuable.

But here's where I diverge from most abstraction libraries: I don't think the goal should be making models interchangeable, especially between sessions.

Harnesses Should Be Model-Specific

I believe agent harnesses should be built around specific models. This might sound like vendor lock-in, and in some ways it is. But hear me out.

The gap between frontier models has been closing. Benchmark scores are converging. But that doesn't mean the models are the same. Each model has its own quirks, strengths, and failure modes. A harness that squeezes the most out of claude-opus-4.5 won't necessarily do the same for gpt-5.1-codex-max-high (oh god the naming).

Different providers handle things differently like caching, reasoning, provider-specific tools. Unifying them often creates loose ends, and these features are fundamental to how you design an effective agent.

So is vendor lock-in bad? Yes, but not for the reasons people usually cite.

It's bad because you want to try different models during development. It's bad because you might want different models for different sub-tasks. It's bad because providers have outages and you want fallback options.

It's not bad because you need seamless mid-conversation model switching. You probably don't, for the most part.

A Different Kind of Abstraction

Most unified LLM APIs try to normalize everything. They flatten out different provider responses into unified types. But that's not really necessary if we're not changing providers mid-session. Even if we are, we accept the risk of lossy translations but only during the transition phase.

Not all messages are created equal.

User messages are your data. You wrote them. You can convert them to any format because you know exactly what they contain.

Tool results are also your data. Your tools produced them. Same story.

But assistant messages? Those aren't really yours. They contain provider-specific state. Cache markers. Thinking traces. Internal blobs that the provider requires you to replay on subsequent requests. When you normalize an assistant message into some unified format and then convert it back, you destroy some information you may never recover.

You can build a unified library that doesn't lose any information, but it's quite difficult given the rapid pace of development new changes every week and I haven't seen one that actually achieves this.

So my approach stores user messages and tool results in a standardized format, but keeps assistant messages in their native provider format. The streaming events are standardized to keep the interface unified across providers.

When you're working within a single provider (which is 95% of the time), you get full fidelity. All the caching works. All the provider-specific features work. Nothing is lost.

The streaming interface is standardized my UI doesn't care which provider is responding. But the underlying messages stay native to their providers. So you can just choose any provider before a session.

When you need to test with a different model or fork a conversation to a different provider, you do that explicitly as an intentional operation with understood trade-offs. You can loosly convert the assistant message's from one provider to another.

The sdk is still in development and testing. Check it out here.

Open Questions

There are several options out there like the Vercel AI SDK, Pydantic AI, and the newly released TanStack AI SDK. They're fine for many use cases. But for building good harnesses, I've found that preserving provider-native state matters more than achieving perfect unification.

If you've found better patterns, I'd like to hear about them.