
The Fog We’re In
With the recent surge of AI tools and chatbots, there is a dense fog of fear in the air: Are we losing the battle of human intuition to machine’s precision. Everyone’s worried about losing their jobs and automation. But my main concern is completely different from this — have we built the AI we actually wanted? In the pursuit of human-centered AI, that supports and augments our cognitive flow, one has to ask: with the plethora of AI tools, and the tools they use — including the Model Context Protocol (MCP) — are we really getting work done…?
I have so many instances where I have lost my way from the initial thread of thought. I’ve found myself losing the deep focus – not because of the distractions, but because AI responses send me spiralling. One question leads to 10 new terms meaning 10 new links. Some might argue that it is a feature. It is great if you’re in the ideation or exploration phase. But when I am in my “flow state” I just need clarity – not a rabbit hole.
The experience brings to mind the distinction between a useful dialogue and a disorienting speech. When we are attempting to achieve something specific, even the most informative AI can serve as a hindrance instead of assistance.
What We Wanted – Human centered AI vs What We Got – Automation
So the real question remains – Is this the AI we initially set out to discover. We started out with pushing frontiers but all we’ve done is speed up the existing tasks. Tools like the Model Context Protocol (MCP) are really useful, reasoning models are an important step and the context length has also gotten huge. We may have made progress, sure. But often it feels like we’re stuck in an echo chamber of productivity tools—smart, fast, but fundamentally transactional.
As Andrej Karpathy pointed out, LLMs are indeed rare.
LLMs are the only technologies that have not been passed down from military, corporations and then finally consumers rather they bubbled up from consumers, indie researchers, hobbyists. And yet, now they’re becoming as corporate and overengineered as everything else. But That’s significant. It means we’re in control of shaping what comes next.
What are we Missing?
Currently what we have is a series of question-answers with the chatbots that too mainly the reasoning models, instead we need something that has a conversation with us. The model should not just tokenize my prompt/question and spit the next likely token. You know that one friend who doesn’t share our tastes but deeply shares our dislikes. And when the topic is about one of our dislikes, we just become a single person talking right, something like that. That shared wavelength? That’s what real conversation feels like.
And that led me to this question: What’s the next step before AGI? What should “AI” actually feel like?
Sparks that lit the Fire
What led to this vision of more cooperative AI? Three different experiences influenced my thought:
Jordan Rudess x Rick Beato — watching that interview reminded me that the best artists are also technologists Jordan Rudess, Hans Zimmer, etc. Rudess doesn’t just play the instrument, he reinvents it through intense improvisation. And more importantly his latest work with MIT.
Moshi — a talking AI you can interrupt mid-sentence, and it adapts. That feels natural. That feels human.
Karl Jenkins’ Adiemus — It wasn’t just music—it was a narrative, a rhythm, collaboration and unity. The vision of AI not as a tool, but as an orchestra. I even had this on loop all day while writing this, and it transported me.
These three factors come together to propose a new paradigm for AI—one of collaboration and not of automation, of enhancement and not replacement.
Vision of Collaborative AI and Human-Centered AI
AI-augmented systems such as the one developed by Jordan Rudess at MIT, exemplifies how generative AI can function as a real-time collaborative tool. It should be able to listen and respond simultaneously that mimics human-human interaction.
The prototype used by Rudess demonstrates the paradigm shift by moving away from automation towards symbiotic creativity. The AI-Augmented instrument enables real-time music co-creation by responding to Rudess’s playing while also contributing its own musical ideas. Here, the
Human-centered AI does not replace Rudess, rather becomes an extension of rudess’s creativity. The model is a GPT-2 transformer model fine-tuned on his personal style. This creates what the researchers call “symbiotic virtuosity” – where the performer trains the model, interacts with it and finally feeds new musical explorations back into the system. Rudess also mentions this relationship as mentorship:
“I’m training this young program, [and] I feel responsible as a mentor to my ‘young student’”
Similarly Moshi AI’s conversational framework showcases how AI can enhance rather than supplant human interaction. Unlike the conventional chatbots, Moshi’s full duplex capabilities allows it to listen and respond, creating a natural conversational flow that adapts to the user’s pace and style. It allows users to guide and shape the interaction due to its low latency (200ms) and its ability to handle interruptions.
These collaborative approaches keep the human at the center of the creation process, and AI becoming an extension that enhances the human creativity and expression.
The technical breakthroughs behind these systems—from multimodal processing by Moshi to anticipatory generation methods by the MIT team—point toward a future where AI exists as a responsive partner, not a passive tool.
The Technical Foundation of Human-centered AI
Why are these co-created AI systems unique compared to typical chatbots? The MIT/Rudess co-creation utilizes a GPT-2-derived Music Transformer architecture trained on more than 150,000 MIDI files and then fine-tuned exclusively on Rudess’s personal style. The personalization ensures the AI creates outputs that conform to the artist’s musical persona with flexibility to introduce new variations.
The system generates near-instantaneous responses with methods such as post-training quantization, bringing down model latency from 850ms to 120ms—critical for live musical collaboration. It uses parallel inference threads to produce multiple possible musical futures so that it can easily accommodate variations in the performer’s input without breaking the flow.
Moshi AI also focuses on responsiveness via its Helium model, a 7-billion parameter multimodal architecture that is native for speech processing. Its Mimi neural audio codec streams audio at 24 kHz while compressing it to a mere 1.1 kbps, allowing the system to support two audio streams—one for listening and one for speaking—concurrently.
These technological breakthroughs build human-centered AI which not only implement instructions but fully engage in the creative process, responding to the user’s rate and mode, and occasionally volunteering their own inputs to complement the partnership.
The Purpose
The Adiemus song by Karl Jenkins played a key role for me to realise the purpose of AI and what I believe in. I would highly, highly recommend listening to it first and then reading the rest of this. The first time I heard this song it triggered a scene in my mind – walking through a paddy field, with my hand brushing the grains knowing that a broken village needs rebuilding. Then as the music builds up, a montage of the villagers starting to believe in themselves and rebuilding their village. But anyways, there is this particular chorus that just builds up the entire song which made me realise the true power of creation, be it art, music, science, in fact anything requires individual people to work together to achieve greatness just like cogs in watches that work together to tick consistently and relentlessly.
We and our computers have to beat in harmony. Not only faster, but smarter. The future of AI isn’t replacing human productivity. It’s making systems that know us well enough to be genuine partners in our intellectual and creative pursuits. Just as my Adiemus-inspired “vision” of villagers working together to create something bigger than any single person could on their own, so humans and AI can join forces to break new ground that neither could on their own.
I didn’t want to become all this philosophical but I would most definitely blame Adiemus for it. But maybe this philosophical shift is just what we require—to back away from thinking of AI as a mere tool and to begin to view it as a possible partner in our human quest.
Moving Forward: The Next Frontier
Standing here at this juncture of AI development, we have the chance to determine what’s next. The instances of Jordan Rudess’s symbiotic virtuosity and Moshi’s conversational fluidity hint at a future in which AI systems don’t merely respond to our queries but actually work alongside us—anticipating our intentions, evolving to fit our style, and sometimes surprising us with novel opportunities.
This vision requires us to rethink how we design and interact with AI systems. It demands responsiveness, adaptability, and the capacity for genuine collaboration in addition to accuracy or efficiency. We need systems that facilitate instead of disrupting our flow state, that join our train of thought instead of derailing it.
The future beyond LLMs isn’t merely more advanced models or broader context windows—it’s AI systems that actually get what it means to work alongside humans. And in that work, we might find new horizons that both humans and machines could not reach alone.
Leave a Reply