I recently participated in a panel at the AI Native Summit organized by Zetta Venture Partners. The topic of discussion was “Beyond Tokens: The Hidden Levers of Inference” and it was truly such a fun conversation.
While the recording should be out soon, I thought it would be worthwhile to share some of my key talking points. I will note as I did in the session that I’m relatively new to this world, so I’m always open to feedback and intelligent debate!
How to make inference more cost-effective?
While larger models have become cheaper to train, inference can still be expensive and slow given
- The increasingly large number of params (671B for Deepseek R1, for instance).
- Chain of thought thinking, which is exacerbating the problem.
Speculative decoding is a technique which can speed up LLM decoding by using a smaller, faster speculator model to speculate the next few tokens, which are then verified in parallel by the larger “target” model (that is being sped up). For example, an 8B model can quickly generate a sequence of future tokens, which the 671B model can efficiently verify in a single forward pass
A great speculator is both fast and aligned with the target model. The narrower the domain, the more highly performant the spec decoder. At Together, we train custom speculators. As customers run more inference on our service, we get more data, which in terms enhances the custom speculator.
We’ve written about this at length in our blog, but custom spec decoders can make inference 2-3x faster and lower cost by up to 50%.
When it comes to techniques like prompt engineering, long context windows, fine-tuning, RL, where should startups invest?
When thinking about brute force vs the strategic use of inputs, I think the strategic use wins. Despite how long context windows are getting, putting too much information in the context can
- Add to your inference cost.
- Make things a lot slower.
- Models aren’t great at paying attention to large context windows.
To put it more plainly, garbage in is garbage out. So I’d probably be judicious about what to include in a context window. It also makes sense IMO to spend more time on the retrieval side or on the orchestration side of things.
- Retrieval = On-demand recall from a large knowledge source, like searching the web.
- Persistent memory = Always-on context about the user/system, like your personal history and habits.
- One-shot/few-shot examples = Show the model how to apply that knowledge.
Generally, I’ve seen startups begin with retrieval and prompt engineering. As they find product-market fit, they transition to fine-tuning (especially when usage patterns are stable and scale pushes costs up). Reserve RL for when you have data, funding, and a clear quality moat.
What is the future of custom inference chips?
Having started my career in hardware and chips, I’m really excited about all the developments that are happening in this space with companies like Etched, Groq, and Cerebras. But my sense is you need to build a ton of optionality into the compile/runtime layer because the industry is still so new and changing quickly.
What happens post-transformers, when MoEs (Mixture of Experts) and algorithmic improvements start reshaping FLOP/memory needs, or some fundamental assumptions baked in break? Having to do a fab redesign would be the kiss of death.
What is one metric that matters for unit economics at scale?
Obviously a lot of ways to slice and dice this but I do like cost per query because it bundles together so many things and is more expansive than just tokens. C/Q is shaped by:
- Model choice (size, quantization).
- Context strategy (retrieval vs brute force).
- Hardware efficiency (throughput, kernel optimizations).
- Orchestration (batching, KV cache reuse, speculative decoding).
What is next for inference?
Okay, a lot of usual suspects like disaggregated serving, RL, and the like are exciting. But to be slightly contrarian, I like the idea of inference time-scaling or test-time scaling (increasing compute at inference). For example, Test-time Preference Optimization (TPO) is an iterative process that aligns LLM outputs with human preferences during inference without altering its underlying model weights (like RL, but at inference time). It has the potential for big wins at a small cost. TBD on how we do this at scale in production systems. Sidebar: I’ve only seen research on this, so please point me to practical applications if you know of them!
In general, I think the future of inference will likely be shaped not just by model breakthroughs, but by clever engineering choices that compound over time. And that’s what I’m excited to be working on every day!