I think 'actual parallelism' is a vastly easier and more fruitful way to get better performance out of these kinds of systems, compared to pushing against single-threaded faster generation. Tool calling and responses are often embarrassingly parallel. Code generation tasks naturally have a dependency tree that can be unrolled into a fixed budget of parallelism. Tasks can be hierarchically decomposed into subtasks.
It's the same asynchronous stream pattern we're used to dealing with in regular software engineering. We have a fixed thread pool, lots of work that can be scheduled concurrently. Since these are streams, we can do the compute incrementally to reduce the time-to-first-byte/token/response.
Since so many tool calls are inherently asynchronous, and subagent task decomposition can be modelled as such, the IO streams can be oversubscribed, and incoming responses can be priority queued.
On the intelligence front, it's incredible how much better frontier models perform when you just interrupt them every so often and go 'is that the best you can do?', or re-iterate instructions, or repeat the overall goal. I find instruction following _so poor_, especially for 'presentation layer' aspects. Yet if I ask the model to rewrite its last response, it does so perfectly. Why can't the model do this 'internally' and save me having to say 'try again'!
Just because the 'model' is autoregressive doesn't mean the system as a whole needs to present a single stream of immutable text.
I do this kind of parallelism with a little merge request tool I slopped together. I spin up multiple small agents and assign them specific code review tasks (security, coding standards, etc.) and have it spit out a gitlab API draft json object with code examples for the MR I can deterministically validate against. If it fails to insert code examples (depending on the task) and the proper json object schema, I have "ask it to try again" logic in place.
Works fine, forcing LLMs to output parsable responses is a good workaround to get them to do what you want until they improve. It also allows you to use the fast models (ex. I spin up the Gemini 3.1 flash lite model for these tasks) to have these tasks done in seconds rather than minutes.
Really cool paper and easy to follow. Lots of thoughts (in parallel, hah!) after a first read. I can see many benefits of the parallel streams w/ dynamic systems. Start thinking, fire up a tool call, adjust thinking on the fly. Or add a "clock tick" on one stream, and hope that the model learns how to output something under time constrain. Maybe some "time passing" concept can be had "for free?". Lots and lots of directions this could go.
It also gives a lot of new levers to play with. I'd assume you could tweak (sweep?) the amount of attention given to the same stream vs. cross stream, have different streams prompted / seeded with an objective, score each independently vs. together, etc. A bit reminiscent of the direction oAI took w/ their harmony template, where they define channels and the model learns to output to each channel (but that's sequential).
Would have loved to see even a small attempt at RL on top of this. Could probably get gnarly with so many avenues to explore, but even a few hundred steps could have informed if there's something to it.
One concern I have is w/ how the data was prepared. They used a 80b model to transform from sequential instruct format to this multi-stream format. There are a lot of ways where stuff can "leak" from the process, and contaminate the results. That's why I'd have loved to see some further RL on this, but anyway. Cool paper, worth a revisit sometime.
The potential of tweaking cross-stream attention is a very interesting avenue, like they note in their discussion: "one-way interactions for security, or partial stream isolation for fine-grained privilege control".
Splitting system streams from user streams already decreases likelihood of successful attacks (e.g., prompt injection) in their research, and that is - as they say - using the dense attention patterns between streams.
I am a bit suspicious of these ideas. When I disabled parallel tool calls in my custom gpt5.4 harness, the quality of results went up dramatically. It looks like it's running slower and it probably is for some problems, but it's correct way more often than if I allow parallel calls.
I am perfectly content with a medium-speed golden goose. It seems to be a lot more predictable and happy this way. The business and other developers are already saturated by the serialized technique. Going faster would only serve to distract others at this point.
Am I understanding correctly that an implication of this is reduced context? since they are streaming by splitting the input into streams the total context is now split amongst those streams and a particular streams context will be shorted to to context/ streams?
Do i understand correctly that this allows models to generate two contradicting tokens (Contemplating Stream: y op z = 3, Thinking Stream: y op z = 5) in separate streams at any given point, right? What would happen in such cases? Sounds like an interesting problem or quirk of this architecture.
Yep, in general, the interesting question is how you merge multiple streams that have been worked on in parallel without imposing a particular order or prioritizing one over another. (We do know how to do this in principle, that's exactly what a vector sum of encoded representations does. But it's not clear how to train the model so that it can recognize the outcome of that vector sum operation as meaningful.) Working on multiple streams at the same time is just what subagents do naturally, so the merge is the interesting part.
This sounds like a gamechanger for speed and efficiency if it can scale up.
"However, our models are nevertheless relatively small and trained on tiny amounts of instruction examples, compared to the scale of modern instruction data and multiple post-training stages used to reinforce the default message-based format. We do think that parallel streams are a conceptually enticing format, and that future work on a larger scale will go further to show these benefits."
I was thinking about this just the other day!! Frequently I find myself wanting to interleave thoughts in a conversation, this is the natural way to do so. I just wasn't sure how you'd train it.
New paper out of the Max Planck Institute for Intelligent Systems. If this holds up, it seems big.
Abstract: The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information.
In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.
It's the same asynchronous stream pattern we're used to dealing with in regular software engineering. We have a fixed thread pool, lots of work that can be scheduled concurrently. Since these are streams, we can do the compute incrementally to reduce the time-to-first-byte/token/response.
Since so many tool calls are inherently asynchronous, and subagent task decomposition can be modelled as such, the IO streams can be oversubscribed, and incoming responses can be priority queued.
On the intelligence front, it's incredible how much better frontier models perform when you just interrupt them every so often and go 'is that the best you can do?', or re-iterate instructions, or repeat the overall goal. I find instruction following _so poor_, especially for 'presentation layer' aspects. Yet if I ask the model to rewrite its last response, it does so perfectly. Why can't the model do this 'internally' and save me having to say 'try again'!
Just because the 'model' is autoregressive doesn't mean the system as a whole needs to present a single stream of immutable text.
reply