Teaching Tinny When to Talk

How a fine tuned classifier drives Tinny’s conversation instincts

Mar 16, 2026

In a recent post (”Tinny’s Growing Up”) we shared how Continua’s been evolving. It only took a few months for us to reach two new milestones: over 60,000 users have invited Tinny into their conversations and we’ve processed over 15 million messages!

But as Tinny’s adoption increased, we learned a surprising lesson. The hardest part of building a multiplayer AI agent isn’t teaching it what to say. It’s teaching it when to speak and when to stay quiet.

To speak or not to speak

In a group chat, interacting with Tinny should feel as natural as messaging anyone else. Users shouldn’t need to tag Tinny or send followups to be heard.

The problem is, group chats are messy. It is not always clear how a human should behave in certain contexts, let alone an AI chatbot.

For instance, should Tinny respond when a message is open ended? What if one user asks a question but another tells Tinny not to respond? Or, a common scenario, when other people are talking should Tinny send a message and interrupt?

Initially, we tried solving this by adding instructions to Continua’s prompt, but we quickly realized that prompting on its own wouldn’t work. With so much variation in chats, the result of our prompt changes made Tinny’s behavior feel subjective and unpredictable. We had to try another way!

The Silence Detector

We decided our next step would be to provide Tinny with a way to check its messages and decide whether to stay silent. We trained a light weight classifier we called the Silence Detector by fine tuning Gemini 2.5 with anonymized, LLM labelled group chat conversations. The model produced a score from 1 to 5 and was invoked whenever a new message came through.

Compared to prompt-based approaches, the improvement felt good. Tinny interrupted conversations far less and better respected the group chat dynamics.

But once the system went into production, a new problem surfaced.

Users began reporting that Continua was ignoring them.

Tinny getting too good at staying silent

In other words, we had solved the interruption problem by creating a responsiveness problem.

To quantify the gap, we manually labelled a larger ground truth dataset and ran a new benchmark. This time, we compared our Silence Detector against zero shot predictions from Gemini 3, prompted to classify messages as reply or do_not_reply.

Although our silence model was fast (a core requirement), it couldn’t come close to the performance of the larger models. And our response recall confirmed our suspicions that Tinny was ignoring nearly half of valid user messages!

Why was this happening?

In retrospect, the scoring system was unnecessarily complicated. In practice, we only cared about a binary outcome (respond or stay silent) and the granularity made the model harder to calibrate and reason about. For instance, which scores from 1-5 should actually trigger a response?

The training data was another issue. Our original dataset was relatively small and lacked sufficient diversity, which meant the model struggled to generalize to unfamiliar conversation patterns.

Rethinking the Problem

Our first instinct had been to frame the task as silence detection, but after the benchmark, we asked ourselves a simpler question: what does Tinny actually need to understand about a message before deciding how or if to respond?

The answer we landed on was user intent.

We defined a set of intent categories that arise when users engage with Tinny or when they expect the AI to silently observe. We decided to fine tune a new model to predict these categories and drive Continua’s responses.

Tinny’s Conversation Guide or when an AI should engage in chat

This reframing matters for a reason that goes beyond semantics. A binary yes-no reply task collapses many distinct interaction patterns into the same label, limiting the signal to learn from. By switching to a multi-class task, the model can learn to recognize distinct patterns of interaction and the final decision is more interpretable.

The Intent Classifier

Before we could train the model, we had to gather data from several sources.

Human-annotated group chat data, randomly sampled from anonymized product usage. This gave us realistic examples of how people interact with Continua in the wild.
Synthetic group conversations, generated to target specific scenarios while varying topic, tone, participant count, and complexity. This helped us fill gaps that organic data alone couldn’t cover.
Human-to-human dialogue data, conversations with no AI present at all. This was critical in teaching Tinny when not to interrupt.

With our data in hand, we fine-tuned another Gemini model and called it the Intent Classifier.

The results validate the new approach. The intent classifier tops the leaderboard with a level 1 accuracy of 88.5% (respond vs do not respond) and level 2 accuracy of 72.4% (for our specific intent categories), while keeping average latency under 1.5 seconds.

The zero-shot models told an interesting story. Gemini 3.0 Pro has the strongest respond precision across both L1 and L2, and does well on do-not-respond too — but its average latency of 9.38 seconds rules it out for production use. Gemini 3.0 Flash follows a similar pattern: competitive accuracy with lower but still too-high latency.

ChatGPT 5.2 is the outlier on speed - latency under 1.15 seconds puts it in the same tier as our fine-tune. The tradeoff is a noticeably worse performance, as it would let more unwanted responses through. It’s also more expensive to run at scale. That said, we’re curious what a fine-tuned version of GPT could look like.

For now, our fine tuned classifier hits the right balance: best overall accuracy, best precision, and latency fast enough to keep up with a live conversation.

Where we go next

Since we deployed the intent classifier, reports about Tinny’s response behavior have dropped significantly. With the engage/observe decision now more reliable, we’ve been able to shift attention to the next set of system improvements.

Cost and Latency Optimization. Now that we can reliably classify intent before generating a response, we can potentially use that classification as a router for our generation model. The system can scale its effort based on the query message, which has real implications for cost and response quality.

Multi-Agent Coordination
Beyond Continua, we expect that the future of AI will be a world where many humans can interact with multiple specialized agents. Models like our intent classifier become the routing layer. How we scale this approach, whether each agent needs its own classifier or whether a single model can coordinate across agents, is a challenge we’re actively working on and hope to share findings on soon.

Ready to experience a personal AI in your group chat? Send a message to Continua AI!

A guest post by

Marvin

Machine Learning Engineer at Continua AI, focused on building scalable AI and ML systems. Based in NYC.

Discussion about this post

Ready for more?