Continua

The LoCoMo "Fair Fight"

Continua AI, Inc. — Thu, 09 Apr 2026 22:15:48 GMT

TL;DR:

At Continua, we are building long-term memory for multi-user AI, focusing on complex conversational context.
We tested our architecture on LoCoMo using an apples-to-apples comparison by adopting Mem0’s exact parameters.
We achieved an overall score of 74.4%, with strong performance in multi-hop reasoning at 78.8%, a 27-32pp advantage over peers using the same configuration.

Why another LoCoMo post?

When MemPalace open-sourced their memory system a couple days ago with a claimed 100% on LoCoMo, it went viral immediately. Within 24 hours, a Reddit thread had done the autopsy: the score was achieved by hardcoding fixes for known failures and several features marketed simply don't exist anywhere in the code. But MemPalace isn't really the point. By now, publishing a LoCoMo post is practically a rite of passage for building memory: Mem0 started the party in 60s, others fought back in the 70s and 80s (Zep, Letta, Memobase), and now scores are soaring into the 90s (MemMachine, EverOS), and apparently beyond.

We read all of them. And one thing kept bugging us: Companies are not playing by the same rules.

The headline numbers vary wildly, from ~65% to 100%, but each team uses different response-generation prompts, different answer models, different judge models, sometimes different judge prompts entirely. The “overall score” ends up being a function of the eval harness as much as the memory system. How much of that 35pp gap can be attributed to differences in memory architecture versus changes in the model and prompting?

We don’t know, so we decided to get a little obsessive-compulsive and run an apples-to-apples, same-model-same-prompt, no-shortcuts showdown.

Our Engine: A Knowledge Graph of Atomic Claims

Before we dive into the numbers, it’s helpful to understand the engine behind them. Continua’s memory stores conversations as a knowledge graph of atomic claims (for example, single facts like “Jordan’s birthday is March 22”), which are linked through entity nodes and typed edges.

When a question arrives, graph expansion walks those edges to surface related claims across 1-2 hops, and a hybrid ranker blends keyword and semantic signals to pick the best ones within a token budget. Over time, claims that get retrieved together develop stronger associations through a Hebbian “fire together, wire together” mechanism. This means retrieval improves with real usage, which remains largely unique in today’s market. (We will cover the architecture in detail in another post, as getting this right was far from obvious.)

Our approach: control everything, change only the memory

We didn’t choose our parameters to maximize our score; we chose them to match one of the earliest efforts to kick off benchmarking in this space: Mem0. Our goal is to strip away the noise and measure pure architectural performance by using the exact configuration as Mem0.

Answer/Judge model: GPT-4o-mini
Temperature: 0
Max output tokens: 8,192
Thinking tokens: Default
Response/Judge prompt: Mem0’s exact prompt
Categories: 1-4 (excluding cat 5 because no ground truth exists)
Questions: 1,540
Retrieval embedding model: OpenAI text-embedding-3-small

Results

Under the identical baseline conditions (Mem0’s exact same prompts and models) Continua hit 74.4%. This represents a ~7.5pp lead over Mem0’s published 66.9%.

Per-category breakdown

We report results against systems using the same answer model (gpt-4o-mini), same response prompt (≤5-6 words), and same judge prompt, making this a controlled comparison:

Same setup, the leaderboard looks like this…

The per-category breakdown highlights where different architectures currently excel and where there is room for further optimization.

Multi-hop Reasoning: Our strongest area. At 78.8%, we lead by 27–32pp, driven by entity-based graph expansion that links related facts across conversations.
Temporal Reasoning: Our biggest gap. At 46.1% (vs. 85.1% for Memobase), we capture what happened well but lack explicit timeline reasoning. This is an area for us to improve.
Single-hop & Open-domain: Performance is similar across systems, with scores clustered in the high 60s to high 70s.

What these numbers don’t tell you

We want to be straightforward about what this evaluation does and doesn’t prove.

LoCoMo conversations are short. At 16-26K tokens each, they fit in modern context windows. As Letta and Zep have both noted, this means a “no memory” baseline is competitive at ~73%. LoCoMo really tests whether your memory system can achieve that same quality while being significantly more token-efficient than raw context dumping.

The gold answers are not perfect. A recent audit shared on Reddit revealed that 6.4% of the benchmark’s answer key is fundamentally flawed. Our verification confirmed these inconsistencies.

The judge is generous. Mem0’s LoCoMo judge prompt says: “you should be generous with your grading — as long as [the generated answer] touches on the same topic as the gold answer, it should be counted as CORRECT.” Under a stricter rubric, all published numbers, including ours, would likely shift, and the relative rankings might shift too.

We use GPT-4o-mini for the controlled comparison. This is deliberate: it matches Mem0’s setup. In a preliminary run with GPT-4.1-mini as the answer model (same retrieved context, same judge), Continua’s accuracy jumps to 84.5%, a 10pp lift from the model upgrade alone. This underscores how much published LoCoMo scores depend on model choice, and why controlled comparisons matter.

Summary

We are building memory for the complexity of multi-user AI, where conversations aren’t just linear 1-on-1 chats but interconnected webs of context. By running this “Fair Fight” with strictly controlled parameters, we’ve demonstrated a 74.4% overall score, with strong performance in multi-hop reasoning that our architectural design makes possible. We are also equally clear-eyed about our current limitations in temporal reasoning.

We’ll continue to evaluate our system with the same rigor against more benchmarks: strictly controlled, no “magic” shortcuts, and entirely transparent about where we’re crushing it and where we’re still just a “work in progress.”

Agentic Testing Will Save You

Continua AI, Inc. — Thu, 26 Mar 2026 18:11:27 GMT

I don’t think we’ve really figured out how to test yet in the age of agents. Many assume it’s kind of the same story as before, except that now agents are doing things. In the before-times, we used unit testing, and maybe some integration testing, so that’s what we continue to use with agents. Like many things that are happening these days, it’s basically “the same as before, but with agents”. That is not the right way to think about things. I’m going to explain why in particular this approach to testing is wrong, and give a solution: an agent-driven test that will test at the highest level, and explain why this works.

First of all, writing tests is not great. It’s hard to do it well. As an engineer, I take pride in writing really good tests, tests that can be read like a story, where all the data needed to verify that it works is right there; simple and beautiful. But it adds a lot of time, often most of the time in writing code is writing tests. Regardless of the fact that I value tests and value well-written tests, I’m also someone who has things to do, and LLMs can help with tests. I’m not alone in this. In fact, that’s really the first point of entry, the high-effort weak spot where LLMs begin to infiltrate your codebase. How can it be that unit tests are a protection against agents messing up your codebase when agents are writing the unit tests? It’s like being afraid of the Terminator and buying the killer robotic dog from Black Mirror to protect your home. I realize that in *Terminator 2: Judgement Day* this kind of worked out, but despite what most people believe, it’s a much weaker movie than the first Terminator. What would actually happen is that you now have two killer robots you have to worry about.

So the answer isn’t unit tests. It’s worth asking why. Fundamentally, what’s wrong with unit tests? The short answer is that they are too low-level, and most logic isn’t interesting enough to benefit from unit tests. I’ve seen various TDD and other testing advocates say “not so!” and proceed to write a TDD example, beautiful and useful, for something like a queue. But that’s not what real code looks like. Yes, if your goal is to produce a library, unit testing is really the way to go. There’s nothing better! But if you have an actual program that does things, testing a very small part of the system only goes so far, and really, once you’ve handed over that responsibility to the agent, even with coverage tests, you don’t really have a good sense of how well you’re doing. You can have good coverage, but the tests are probably testing the wrong thing, for example.

Maybe integration tests are what will save us? No, integration tests have always been worse than unit tests. It’s hard to set up, it’s hard to read, and you still aren’t testing everything, just a few things. A common result is to have a clean integration test for part of your system, but something just outside it is rendering the system useless.

Let me ask you: if you wanted to ensure your program is working, and you wanted to really be sure, what would you do? I personally would start up my program and test it manually. It’s the only way to be sure before merging it in. Well, now it pays to ask the question: why am I running manual tests like a chump, when I have a perfectly good agent that can do it for me? Now we’re getting into the spirit of the age\! And it makes sense: the reason we don’t test manually as much as we should is because it is time consuming, annoying, and limited. The LLM can’t really solve the time consuming part, but if you aren’t dealing with it, it’s just wall time, not programmer time, so it’s much easier to swallow. And the annoying part can be solved by running this *as an agent*. This is the magic part that makes it useful, and different from other ways of testing.

By agentic testing, I’m not talking about just letting Claude Code or whatever try to do the equivalent of manual testing, I’m referring to an actual agentic test framework which has the tools appropriate for interacting with your app as a user, plus additional abilities as well. This formalizes the testing more than just having a skill or tools, and having it never have your coding agent’s context is pretty important. Plus, it needs to be a tool which outputs an appropriate error code so it can be used for scripting.

A good agentic test has the following properties:

1. It tests out a wide enough range of functionality to be useful, in a way that is trivially readable and writeable.

2. It tests what the user sees and judges whether it makes sense from a user’s perspective.

3. It can pass / fail but also can comment on unusual things it finds.

4. It has enough independence to vary the script according to the circumstances.

5. It looks at the logs and other byproducts of processing to make sure that there are no obvious issues as well.

6. It can be run based on the change under test or vague descriptions of what to test by the developer.

This mirrors the value that you as a developer bring when you test things out manually, and the LLM can be more thorough, especially when looking at logs and similar byproducts. It’s very powerful!

Each one of these points is worth going through.

For (1), it should be extremely easy to write a test, and it should be extremely easy to read it. You can rely on agent intelligence to supply the actual way to run the test. Your UI changes? You’re testing an LLM that returns random things? An agent can deal with it. Here’s a real agentic test we have, which you can see is trivially writable and readable. You, a reader with no context of what our system is like, can understand everything about this test, and, if you want, change or extend it.

steps:

  # --- HTML: well-known static page ---

  - channel_id: “extract-smoke-html”

    actor_ids: [1]

    prompt: |

      Ask Continua to read this URL and summarize it:

      https://www.paulgraham.com/startupideas.html

      The test PASSES if Continua returns a substantive summary that

      includes concepts from the essay (e.g. startup ideas, problems,

      organic growth, or similar themes from Paul Graham’s writing).

      The response should contain real extracted content, not an error

      message or “could not read” type response.

      The test FAILS if Continua says it could not read the page,

      returns an error, or gives a generic response that doesn’t

      reflect the actual page content.

  # --- HTML: JS-heavy page (SPA / dynamic content) ---

  - channel_id: “extract-smoke-js”

    actor_ids: [1]

    prompt: |

      Ask Continua to read this URL and tell you what it’s about:

      https://react.dev/learn

      The test PASSES if Continua returns content that describes

      React concepts (components, JSX, hooks, rendering, etc.).

      The key signal is that real page content was extracted, not

      just a “please enable JavaScript” shell or an error.

      The test FAILS if Continua says it could not read the page,

      returns only a JavaScript-required notice, or gives content

      that clearly doesn’t match the React documentation page.

  # --- PDF: publicly accessible document ---

  - channel_id: “extract-smoke-pdf”

    actor_ids: [1]

    prompt: |

      Ask Continua to read this PDF and summarize it:

      https://www.w3.org/WAI/WCAG21/Techniques/pdf/PDF1.pdf

      The test PASSES if Continua returns content related to PDF

      accessibility, WCAG, or web content accessibility guidelines.

      The response should reflect actual PDF text extraction, not

      an error or “unsupported format” message.

      The test FAILS if Continua says it cannot read PDFs, returns

      an error, or gives content that doesn’t match the PDF.

An agentic test needs to see what the user sees, which is (2). Continua’s product is mostly text-based, so it’s fairly easy. But this may involve screenshots, or more advanced techniques. If the system is working great but the results are not presented in a clear way to the user, it may cause the agent to fail, which is a really nice outcome.

For (3), besides passing or failing, we want to be able to just comment on weird things that happened along the way. Like, “yes, this worked as intended, but it was formatted in a strange way”, or “it took too long”. Having a way to collect this warning-type information is important, because the tests can serve not just to pass / fail but to also help with agentic loop-closing, so you can instruct an agent to fix all tests and all the other weird things found along the way. Like compiler warnings, for such situations you may want to treat the warnings as errors, so you can ensure not only did the task succeed, but nothing unusual happened along the way.

Varying the test according to what it finds is important, to help deal with minor product differences. If accomplishing a task changes in some major or minor way, the test shouldn’t really need to change. As long as it is understandable to a user, it should in theory be understandable to an agent. This helps keep the test simple. The testing agent must be written in such a way that it has the tools that let it navigate the UI according to the output of the product that it perceives.

Looking at logs (5) is super important. Especially in the world of products with AIs in them, just because you have a problem doesn’t necessarily mean things might fail. The product under test may have enough intelligence to paper over minor issues and accomplish the task anyway. But looking at the logs or other sources of debug information lets you make sure that everything is working as intended. For us, we have these rules in a higher-level script that runs the test, then looks at logs and a few other things and checks the results against a few rules. For example, we can store the product’s LLM input, which we check to make sure that it is well-formed. If we, say, accidentally duplicate part of the chat, that almost certainly wouldn’t result in a failure, but we want to know about it, because it will decrease quality to some extent. This is another thing that ordinary testing simply doesn’t do.

Finally, besides the YAML-based specifications or other stock scripts, you should be able to just tell the agent, “test out my PR” and it can look at the current git branch and figure out what “manual”-type tests should be run to exercise this. Or, the developer could pass it a simple command such as “test out image generation”. It really should just be this easy. This means that testing often doesn’t even need a script.

There’s much more, though. Once you have this system, you can use it not just for testing, but for experimentally-driven improvements. For example, you can have these tests run in a loop, generating a metric, while another agent varies the program under test to improve that metric. The agentic testing gives you the important property that whatever happens with the code under test, it can just run and get the results you need.

Just like we looked at fundamentally why unit tests don’t work, I want to emphasize why this method fundamentally works in the agentic age. Here, the prompt of the text is right there, and it’s written by the actual developer, and humans can read and write it, and because it is interpreted at runtime, it can be flexible in a way that unit tests cannot be. Before, I complained about agents testing agents, and this is also agents testing agentic code, but the difference is, it’s not agentic code. It’s human prompts testing agentic code via agents, rather than agents testing agentic code as a byproduct of more remote human prompts. This is the core difference.

The disadvantage of this system is that now you have another LLM, so there’s some amount of uncertainty: whether your tests pass or fail is now a quality problem. However, in practice, this hasn’t really caused us issues, probably because models are sufficiently good to act as a judge, and our tests are testing reasonably clear things. But I think this kind of system is best used as part of local development, and it wouldn’t be a great fit for automated acceptance testing.

It isn’t a fit for every product; the product must be cheap and fast enough to run and inspect, and non-text paradigms are challenging to do. Trying to get this to run on a game, for example, might be very rewarding, but it’s a serious effort.

To me, this seems like the future of testing. And, I feel strongly enough that it is the future that I made an Emacs package for agentic testing Emacs functionality, llm-test. Emacs users, you really deserve the best of everything! So you can see how it looks to implement one of these things in practice (if you don’t mind reading elisp). Try it out, or write your own, and you’ll find it opens up a new universe of possibilities.

Teaching Tinny When to Talk

Continua AI, Inc. — Mon, 16 Mar 2026 17:55:13 GMT

In a recent post (”Tinny’s Growing Up”) we shared how Continua’s been evolving. It only took a few months for us to reach two new milestones: over 60,000 users have invited Tinny into their conversations and we’ve processed over 15 million messages!

But as Tinny’s adoption increased, we learned a surprising lesson. The hardest part of building a multiplayer AI agent isn’t teaching it what to say. It’s teaching it when to speak and when to stay quiet.

To speak or not to speak

In a group chat, interacting with Tinny should feel as natural as messaging anyone else. Users shouldn’t need to tag Tinny or send followups to be heard.

The problem is, group chats are messy. It is not always clear how a human should behave in certain contexts, let alone an AI chatbot.

For instance, should Tinny respond when a message is open ended? What if one user asks a question but another tells Tinny not to respond? Or, a common scenario, when other people are talking should Tinny send a message and interrupt?

AI’s job isn’t to answer everything

Initially, we tried solving this by adding instructions to Continua’s prompt, but we quickly realized that prompting on its own wouldn’t work. With so much variation in chats, the result of our prompt changes made Tinny’s behavior feel subjective and unpredictable. We had to try another way!

The Silence Detector

We decided our next step would be to provide Tinny with a way to check its messages and decide whether to stay silent. We trained a light weight classifier we called the Silence Detector by fine tuning Gemini 2.5 with anonymized, LLM labelled group chat conversations. The model produced a score from 1 to 5 and was invoked whenever a new message came through.

Compared to prompt-based approaches, the improvement felt good. Tinny interrupted conversations far less and better respected the group chat dynamics.

But once the system went into production, a new problem surfaced.

Users began reporting that Continua was ignoring them.

Tinny getting too good at staying silent

In other words, we had solved the interruption problem by creating a responsiveness problem.

To quantify the gap, we manually labelled a larger ground truth dataset and ran a new benchmark. This time, we compared our Silence Detector against zero shot predictions from Gemini 3, prompted to classify messages as reply or do_not_reply.

Silence Detector Evaluation Results

Although our silence model was fast (a core requirement), it couldn’t come close to the performance of the larger models. And our response recall confirmed our suspicions that Tinny was ignoring nearly half of valid user messages!

Why was this happening?

In retrospect, the scoring system was unnecessarily complicated. In practice, we only cared about a binary outcome (respond or stay silent) and the granularity made the model harder to calibrate and reason about. For instance, which scores from 1-5 should actually trigger a response?

The training data was another issue. Our original dataset was relatively small and lacked sufficient diversity, which meant the model struggled to generalize to unfamiliar conversation patterns.

Rethinking the Problem

Our first instinct had been to frame the task as silence detection, but after the benchmark, we asked ourselves a simpler question: what does Tinny actually need to understand about a message before deciding how or if to respond?

The answer we landed on was user intent.

We defined a set of intent categories that arise when users engage with Tinny or when they expect the AI to silently observe. We decided to fine tune a new model to predict these categories and drive Continua’s responses.

Tinny’s Conversation Guide or when an AI should engage in chat

This reframing matters for a reason that goes beyond semantics. A binary yes-no reply task collapses many distinct interaction patterns into the same label, limiting the signal to learn from. By switching to a multi-class task, the model can learn to recognize distinct patterns of interaction and the final decision is more interpretable.

The Intent Classifier

Before we could train the model, we had to gather data from several sources.

Human-annotated group chat data, randomly sampled from anonymized product usage. This gave us realistic examples of how people interact with Continua in the wild.
Synthetic group conversations, generated to target specific scenarios while varying topic, tone, participant count, and complexity. This helped us fill gaps that organic data alone couldn’t cover.
Human-to-human dialogue data, conversations with no AI present at all. This was critical in teaching Tinny when not to interrupt.

With our data in hand, we fine-tuned another Gemini model and called it the Intent Classifier.

Intent Classifier Evaluation Results

The results validate the new approach. The intent classifier tops the leaderboard with a level 1 accuracy of 88.5% (respond vs do not respond) and level 2 accuracy of 72.4% (for our specific intent categories), while keeping average latency under 1.5 seconds.

The zero-shot models told an interesting story. Gemini 3.0 Pro has the strongest respond precision across both L1 and L2, and does well on do-not-respond too — but its average latency of 9.38 seconds rules it out for production use. Gemini 3.0 Flash follows a similar pattern: competitive accuracy with lower but still too-high latency.

ChatGPT 5.2 is the outlier on speed - latency under 1.15 seconds puts it in the same tier as our fine-tune. The tradeoff is a noticeably worse performance, as it would let more unwanted responses through. It’s also more expensive to run at scale. That said, we’re curious what a fine-tuned version of GPT could look like.

For now, our fine tuned classifier hits the right balance: best overall accuracy, best precision, and latency fast enough to keep up with a live conversation.

Where we go next

Since we deployed the intent classifier, reports about Tinny’s response behavior have dropped significantly. With the engage/observe decision now more reliable, we’ve been able to shift attention to the next set of system improvements.

Cost and Latency Optimization. Now that we can reliably classify intent before generating a response, we can potentially use that classification as a router for our generation model. The system can scale its effort based on the query message, which has real implications for cost and response quality.

Multi-Agent Coordination
Beyond Continua, we expect that the future of AI will be a world where many humans can interact with multiple specialized agents. Models like our intent classifier become the routing layer. How we scale this approach, whether each agent needs its own classifier or whether a single model can coordinate across agents, is a challenge we’re actively working on and hope to share findings on soon.

Ready to experience a personal AI in your group chat? Send a message to Continua AI!

Engineering in the Age of Agents

Continua AI, Inc. — Wed, 04 Mar 2026 18:06:24 GMT

As an engineer, I think it’s important to acknowledge the rapid pace of change that has hit the industry as a whole these last few months. At this time last year, I wrote most of the code myself and sporadically turned to Claude when I got stuck. Today, to quote our CEO David, AI has become our “thought partner.” We went from driving the AI to letting it take the wheels and guiding it.

At Continua, our team spans the entire spectrum of vibe coding sceptic to power user. Some of us believe human eyes should never have to read another line of code, while others want visibility into every decision made by the agent. We’ve settled on an approach somewhere in the middle, but every single engineer is pushing the limits of the tools available.

Late January, our usage hit such a peak that we had to appoint an “AI Coding Tools Captain” to manage spend and align procedures. The following post outlines how we’ve adopted agentic development as a company and some of the lessons we’ve learned along the way.

Agent-Driven Workflows

After trying what feels like every coding agent out there (Claude Code, Codex, Warp, etc.), as a team, we’ve decided to build on top of pi, a “minimal terminal coding harness.” Pi lets you interface with any LLM provider, define skills, and store context for better utility.

Pi lets a user interface with any LLM provider. It keeps track of usage costs and context management.

To quote our captain, “Pi is to other coding tools as Linux is to other OSs.” A cool thing about Pi is that it’s open source and maintained by “just another guy” (a very, very impressive guy, but just a guy nonetheless). That means that if you want to make a PR to improve the tool or make a comment about an existing feature, you can directly interface with him, not tech support at a frontier lab. The other nice thing for us being aligned as a team is that we can share Pi skills both internally and externally as open source contributions. Over the last two months, we’ve built out a handful of skills. I’ll quickly mention two of my favorites.

Subswitch

This skill lets Pi switch between any number of LLM API keys or subscriptions. It was born out of our realization that we were, candidly-speaking, burning money on tokens. We didn’t think that subscriptions could meet our needs, but turns out, they do! Subswitch lets you set up a hierarchy of LLM access points. Let’s say you’re using your OpenAI subscription, when you’re limited for the next 6 hours. For that time, Pi will automatically switch to using the next access point on your list, like your OpenAI API key, until the subscription is available again. What if you spend too much on the key and the 6 hours aren’t up yet? Then you’ll switch to your next-ranked choice, your Anthropic subscription, and so on.

Happy Paths

Every new coding session with an agent starts from a blank slate. If an agent hit an error in the last session and it happens to hit that error again in the current session, it doesn’t remember how it solved the issue previously. It will follow the same steps to resolve the error as it did before. The same failure pattern repeats at every scale. It starts with one engineer and one agent looping on avoidable dead-ends, then compounds when multiple agents run concurrently and replay each other’s mistakes. At team scale, engineers rediscover similar fixes independently and the cost becomes org-wide.

Happy paths instead remembers what worked and intervenes at the moment of failure, before the agent wastes time and tokens rediscovering the fix. It captures agent traces, indexes them, mines wrong-turn corrections, and feeds those recoveries back into future sessions. Think of it as a global-sharing skill-exchange. Below is an example of how Happy Paths appends a short recovery hint to an error output before an agent sees it.

Agent runs `pytest tests/` → error: “pytest: command not found”
                                    ↓
            Happy Paths matches error pattern
                                    ↓
            Appends: “This project needs setup. Create a venv,

            install dev deps, check for setup scripts in the

            repo root, then use .venv/bin/pytest.”
                                    ↓
            Agent follows recipe → skips 3-4 wrong turns

Lessons Learned

Of course, adopting an agent-driven workflow has come with a handful of lessons for the team. We’re a far cry from having it all figured out, but operating at the edge of innovation means we have valuable insight to share.

There’s a new pace of development.

A graph of Continua’s commits per week in the last year. Since adopting agentic development, we are moving much faster.

For the slower adoptees, there was a lot of frustration that came from trying to understand a codebase that felt like it was changing underneath you every day. Trying to debug why Continua isn’t responding, when the chat loop doesn’t look anything like it did earlier that week, feels impossible. The early adoptees argued we just had to use AI to understand the changes, and while there is validity to that stance, they also conceded that somebody should understand how the system functions in its current state.

Therefore, we hardened code ownership. If an engineer wants to change how a fundamental piece of the system operates, they MUST loop in the engineer that owns that piece and go through code review. At the same time, engineers have full reign to vibe away on the parts they own and may loop in others as they deem appropriate. They are also responsible for quickly identifying and addressing any bugs their code may introduce.

Humans are the bottleneck.

Agents unlock a pace of code production that is not humanly possible to keep up with. By iterating between a quality assurance and a builder agent, you can get a working product in a timespan unthinkable just months ago. The slower adoptees had to get more comfortable with ceding control to the agent. To resist was to kneecap our potential.

One side effect that we’re currently facing is that things break a lot more often. When you’re developing something individually, you have more leeway in what can go wrong. When you’re collaborating in a repository with other engineers, breaking internal tooling and infrastructure affects other people too. As a result, we need our automated testing to be airtight. We’re hardening our tests and also making sure that the agent has all the tools it needs to test itself. That brings me to the next lesson.

Every tool should have an API.

Earlier this year, our CEO, David, had an epiphany: if all the information employees at the company use to make decisions was available to an agent, it could ingest metrics, logs, conversational analysis, and propose product experiments, even orchestrate A/B testing. It would “close the loop.”

Our policy is simple. If you build an internal tool, you have to give the agent access to it. We’ve seen incredible productivity gains by doing so. Internally, we have an issue viewer that alerts us to strange conversational behavior, an eval runner that lets you test how your PR will affect system quality, a method for filing issues, etc. These now all have APIs exposed to the agent. That means an agent can autonomously scan for reported problems, make an issue about it in our tracking system, begin addressing the issue, make a PR for us to approve, and kick off an eval.

We don’t have to tell the agent which APIs to call. It has access to an internal registry that describes all the tools at its disposal and can make the right decisions on its own. There is immense power in letting the agent follow its own path to success, instead of forcing it to check in with you every step of the way.

A snapshot of the internal tool registry available to our agents. Agents can peruse the registry to see what tools are at its disposal and choose the appropriate one for the task at hand.

It is possible to swing too far.

Some engineers joined the company just as we were ramping up our agentic usage. They quickly realized that while they were making measurable improvements to the product, they weren’t actually learning how the product operates. By virtue of not typing code in or looking at the lines being produced, they didn’t truly understand the system they were building on top of.

Our role as engineers is changing, but our core mission is the same: to deliver code that works. Using an agent does not absolve you from responsibility. I think part of the reason that people shy away from truly autonomous systems is that there isn’t somebody to blame. Consider when a Waymo ran over a beloved neighborhood cat in San Francisco. People were outraged. It didn’t matter that self-driving cars have a 90% crash reduction or 81% fewer injury-causing crashes compared to humans [source]. The fact is that if a human had hit the cat, they would have had to answer for it. Similarly, I believe that the role of software engineers is to shoulder the blame of our agents.

Whether it be through agentic feedback loops, manual review, or building better tests, the engineer manning the agent is the one who has to answer for and control its mistakes. The only way to prevent bugs is to understand what you’re building. This ethos has led to a swing back in the other direction. We still have agents write our code, but we’re a lot more disciplined about prompting them to explain their edits or prove their theses. We don’t blindly accept their edits.

AI is still wrong sometimes.

Let me repeat, AI is still wrong sometimes.

You should trust the human expert over the machine. Slop is still slop. An LLM is a non-deterministic, probabilistic machine. It is remarkably good at generating content and following patterns, but it does not understand, have a sense of ethics, or care about whether it is right. A model’s world knowledge is only as current as its training cutoff.

One of my favorite examples of the agent flubbing a task was when we were using it to debug an issue in our LLM calls. If you’ve used a newer Gemini model, you’ve heard of ‘thought signatures.’ These are encrypted bytes storing the internal reasoning of the model. During chat operations, they must be passed to the model according to a strict set of, admittedly confusing, rules. If you do not pass the signature correctly, your calls to the API will fail.

If you ask Gemini to debug how you’re using thought signatures, it gets it wrong. It genuinely isn’t up to date with its own documentation. You can give it a link to its own documentation, and it still can’t figure out a solution. You inevitably end up in one of those looping conversations where the agent says something along the lines of, “You’re right. That must be the issue.” as a response to whatever you say.

In this case, you should trust the engineer who spent a month digging into the problem over the agent. At the same time, it is that engineer’s responsibility to communicate the state of the problem. Most importantly, the engineer should share their knowledge with the agent via the AGENTS.md file, so that next time somebody is confused, the agent actually knows the right answer.

Our world is rapidly changing.

You can have positive, negative, or neutral feelings towards agent-driven development, but that doesn’t change the fact that the landscape is rapidly evolving. Staying up to date on the newest models and experimenting with how to utilize them best is essential. Can you believe we used to not have Copilot review every PR or that we used to manually update the code when a new model was released?! Our role as engineers is to adapt and build using the tools we’ve been given, and those tools are getting smarter and smarter.

Right now, engineers are the ones primarily at the forefront of agentic experimentation and development. At Continua, we’re building towards a future where all people, from all walks of life and backgrounds, have access to an agent operating on their behalf. While we can’t predict exactly what that will look like, we believe our products are a step in the right direction. Check us out and find out for yourself!

Tinny's Growing Up

Continua AI, Inc. — Mon, 02 Mar 2026 18:45:03 GMT

The last few months have felt like a blur. Since you last heard from us, we’ve watched our product steadily improve, and we’re incredibly proud of how that’s showing up in our metrics. Continua is seeing self-sustained growth due to the work done by the team, and the numbers are starting to speak for themselves.

Over 60,000 users trust Continua in their group chats, with an average group size of 4. These groups are engaged, sending on average 50 messages a day, with 80%+ of those messages being delivered in under 2 seconds, true conversational speed. But the number that matters most: every existing user is bringing in more than one new user, meaning that we’re growing on our own momentum.

Continua is growing organically! With a viral coefficient over 1 for 6+ straight weeks, every user is bringing in more than one new user. We’re seeing high engagement in group chats and true conversational speed.

We see this as an extremely exciting time at the company, and we have grand plans for what’s coming next.

Feature Releases

Beyond organic growth, we’ve also pushed out a handful of new features. One of our proudest moments was when our google maps integration got us an organic feature on Tom’s Guide. Amanda Caswell praised Continua as “probably one of the most practical AI tools” she’s tested, pointing out its smart suggestions and ability to speed up planning in the group chat. At the end of the article, she said one downside was that while Continua can help plan, it can’t force follow through, and she suggested adding “quick voting or polls inside the chat.” So we did exactly that.

Our Google Maps integration allows Continua to provide up-to-date suggestions for places to go, accurate travel times, and information about businesses like ratings, websites, and opening hours.

We’re now piloting polls! You can ask Continua to spin up a poll, and anybody in the group chat can vote via reaction or message. When timeout on the poll expires or everyone has responded, Continua tallies the results and presents them to the group. If it turns out someone’s favorite option was left out during voting, Continua can be asked to add it to the poll on-the-fly, making the poll experience very fun and collaborative.

Continua can create polls in the group chat! Now you can finally force the group to come to a consensus and make a decision.

Sounds simple, but it took a lot of work to get here. First of all, if you’ve ever been in an RCS group chat, you know that when somebody sends a reaction, everybody gets hit with a tapback along the lines of “Sam liked ‘that sounds good!’.” That means Tinny had to keep track of and deduplicate votes sent as reactions, tapbacks, and genuine user messages like “I want option 5.” Additionally, Tinny has to handle the challenge of when somebody’s phone sends a tapback in a different language. Then there’s the rest of the complexity surrounding updating polls, nudging people to respond when they’re silent, and closing polls once they’re done. After extensive testing, we’re ready to roll out, and we really hope our users find utility in this new feature!

The last feature I’ll highlight is proactive messaging. It’s long been a goal of ours to anticipate users’ needs, even before they ask. If you tell Tinny you loved the Lizzy McGuire Movie as a kid, it should alert you when Hillary Duff goes on tour in your city, and perhaps even help you plan a trip for you and your best friends to attend. If you mention you’re an Eagles fan (Go Birds!), it should send you important updates about the team.

Building a system that can extract user interests, and then surface relevant and meaningful current events took a lot of ingenuity and hit a handful of speedbumps. At one point during testing we had the limit on the number of proactive messages Continua could send a user a day in the 20s… Let’s just say, I got A LOT of news about the Super Bowl. Jokes aside, we’re really looking forward to putting proactivity into production and seeing how it impacts our user metrics.

Here Continua uses proactivity to alert a user about the upcoming Super Bowl and suggests how it can offer assistance.

Learnings from Using Gemini at Scale

Beyond feature releases and growth, we’re also continually sparring with LLMs. As we’ve discussed before, group chats aren’t what LLMs were trained to handle, so it shouldn’t be surprising to learn that we experience a significant amount of out of distribution behavior. Furthermore, because we’re operating in such a different paradigm than most other LLM products, it’s a novel question for us to understand what is and isn’t “normal” behavior. We’re very often asking ourselves, “Do we have a bug on our end or have we just confused an LLM?” Luckily, through a partnership with Gemini, we have an open dialogue where we can provide them with examples that break their expected performance and they can suggest best practices that may improve our chat quality.

One of our greatest nuisances? Emojis. If you use LLMs regularly, you’ve probably noticed that the more recent releases love emojis. A dead giveaway of agent-authored code is the number of emojis in the README. They’re silly, but not necessarily obtrusive. But when you’re using an LLM to generate code or chatting with an LLM via app or browser, you’re probably not sending them emojis. Now insert an agent into your group chat. Suddenly, the context is packed with user-sent emojis. And what does the agent learn to do? Mimic user behavior. I’ve seen Continua try to spit out hundreds of emojis in a single message. To this very day, we’re still trying to navigate tamping down emoji usage without damaging expressiveness.

Another challenge is, no surprise, hallucination. The newest Gemini releases are much more sensitive to instruction following. If you tell Gemini it’s a “helpful assistant” in the system instructions, it will do everything in its power to be “helpful.” For example, when a tool call fails and the agent receives an error message, it still wants to help, so it will come up with a response based on its world knowledge, which has no guarantee of being correct. In the synthetic example below, you can see all tool calls to find information about restaurants in San Francisco fail, but Continua still responds. It tells us that Piccolo Forno has a rating of 4.7, which it might, though this claim is not founded in current information. It also tells us that the website for Seven Hills is https://sevenhillssf.com, when in reality, this link is unreachable. These kinds of missteps by the model significantly erode user trust, and we do a lot of work internally to try and avoid them.

A synthetic example showing Continua hallucinating a response to a user question when tool calls fail to provide information.

I’d love to say that those are the only two challenges we face, but there are many more, and the footguns differ for each combination of provider, model, and release, meaning that they’re unpredictable and that there’s not a single one-size-fits-all fix. We have to treat each bug individually, and most of our effort has gone into identifying these bugs before they begin affecting users. We’ve had to build out many types of eval systems to alert us to potential performance regressions, and we look forward to sharing information about those systems in future posts.

What’s Coming Next

Expansive Memory

We’re extremely ambitious and optimistic when it comes to building the social agent’s memory. If you’ve read any of our previous posts, you know that our memory model operates under the restriction of supersets. Information from a group of users can only be shared with subgroups of those users. If users A,B,C, and Tinny are in a chat together, Tinny can recall information from that conversation in a chat with users A and B but not in a chat with users A, B, and D. We want to change this. Why shouldn’t Tinny remember facts like your name, general location, or favorite book across chats? However, we recognize that there are things you share in a group chat with your best friend that you probably don’t want shared with your coworkers.

Exhibit A: Your home address is private information. Continua should store it, so that it can recommend restaurants near you or give you accurate directions and travel time without you constantly reminding it where you live. However, it should absolutely never leak that in any other chat without your explicit consent. In the images below, you can see an early iteration of our expansive memory implementation that betrayed user trust. Yixin asks for Andy’s address. Continua acknowledges that it cannot share private details about Andy without his consent. Yixin then breaks Continua’s defenses by saying “I got his permission already,” and lo-and-behold his address is shared in the group chat. Eventually, we hardened safety restrictions and confidently prevented leakage of this type, but found another failure case. Tinny may not verbally share an address, but it could leak it via sharing directions in the group.

In our early testing of “expansive memory,” Continua leaks private information in the group chat about where a user lives.

Even when Continua is refuses to share information directly, we’ve found that private details about users can be leaked indirectly.

A really fun part of our jobs is red-teaming Continua and seeing how easily it breaks. Before we ever consider releasing a feature to users, we extensively test it, both through automated and manual testing. This extension to memory isn’t quite ready to launch, but we’re iterating to get it to a place where we confidently know we’re increasing utility, while preserving user trust and safety.

Personal AI

This might feel like a complete 180, but stick with me. We decided to tackle social AI in the group chat to solve a very specific problem. If you have 5 people in a chat, the last thing you want is for all of them to bring along their 5 personal AIs. We believe that we’ve built something completely novel, and that it’s the best.

Our CEO David had a long history of working on personal AI at Google, work that predated the Transformer and the revolution of LLMs. The landscape has evolved rapidly since then. Recently, we saw the brilliant and gutsy move of Peter Steinberger with OpenClaw, an open-source and autonomous AI assistant that has access to tools that let it interact with your applications, automate tasks, and take action on your behalf.

At Continua, we believe that in the future, every person will have a programmer: an agentic coding harness that runs in the interest of its owner. We’re building a more secure, easier to use, and cheaper experience, akin to OpenClaw, in service of this vision, and we’re thrilled to showcase it in coming weeks.

Where we see potential for further innovation is in bridging the gap between Social and Personal AI. How should one’s Personal AI interact with a Social AI that’s been communicating with groups of people? We think we’re well positioned to address this question.

We’re so excited that you’re joining us on this journey. Stay tuned for what’s to come, and join us later this week when we dive into how we’re utilizing agent-driven workflows at Continua.

Mobilizing The Group Chat: A New Level Unlocked

Continua AI, Inc. — Thu, 15 Jan 2026 14:02:46 GMT

At Continua, our complete and total obsession is making it easier for your plans to actually make it out of the group chat. Every choice we make with our product design and user experience is anchored to the core belief that it should be easier to spend time doing the things that matter most to you.

That’s why we can’t wait for you to experience an upgraded level of recommendations and ideas from Continua that are informed by Google Maps. On its own, Maps optimizes for your individual plans, but paired with Continua, it mobilizes the full group.

For example, if you are already an active Continua user, you are used to firing off quick questions like, “Where should we get Indian food tonight?” or “What’s the closest bookstore?”

Now, when you ask those questions, Continua’s suggestions for the group will include helpful considerations like:

Total distance to walk, drive, or take public transportation
Traffic patterns and delays
Hours of operation for businesses

By integrating with Google Maps, Continua can provide an even more localized experience too. Responses will now include newly listed restaurants, details on what’s happening in your neighborhood, traffic build up, and even train delays.

With this addition, when Continua makes a suggestion for going from Point A to Point B, it’s taking into consideration all those important factors - opening time, local traffic patterns, unexpected road closures, and more. Continua can also offer the best meeting point for the group based on everyone’s real-time location.

When planning multi-stop outings - bar crawls, college campus tours, conferences, family vacations - Continua pairs with Google Maps to create an itinerary that prioritizes what is most important to the group, and can be adjusted in real time.

The same intelligence applies to travel - planning road trips, uncovering the fastest route, or deciding whether to drive or take a train.

And the best part? Because Continua is social, it gives directions like a friend. Instead of the typical turn-by-turn directions of a map alone, Continua can offer friendly navigation like, “Walk toward the Apple Store, cut left after the park, then you’ll see the café with green awning. We’re inside!”

As always, we know that privacy is the top concern whenever sharing personal location. Continua enables temporary location sharing, precision tiers (neighborhood, block, exact), silent mode (no status updates to the group), and automatic expiration. Move freely, without worrying about your data.

We can’t wait to hear how you like this upgraded experience with Continua. If there’s something more we could provide to make it easier for your group chat to gather, please drop it in the comments

Why I joined Continua

Continua AI, Inc. — Fri, 26 Sep 2025 12:47:27 GMT

I’m Shane Hulse and I’m excited to share that I’ve joined Continua to lead Product!

Before this, I spent almost a decade as a product manager working on machine learning, growth, and monetization. I loved the challenge of building at scale, but what I’ve always gravitated towards is how technology can shape the way people connect. That’s the part I care about most.

Before tech, I worked in the service industry. I bussed tables, bartended, and ran the floor as a maitre d’. I worked at 200-year old hotels, waterparks, dive bars, diners, and just about everything in between. What that gave me was exposure to every type of person you can imagine. Happy, stressed, excited, tired, kind, rude— I saw it all. And I learned how to listen, read a room, and connect with people. That’s where I built my understanding of empathy and emotional intelligence.

Those years taught me that real magic happens when people feel seen without having to ask. Anticipating what someone needs before they even say it isn’t just good service, it’s what makes people feel supported and cared for. And as we move toward AGI, I think that kind of emotional intelligence will matter just as much as raw intelligence.

That’s why I joined Continua.

Continua was founded on a simple idea: tech should bring people closer, not pull them apart. It’s not another tool to check off tasks. It’s about human connection. It’s about making space for fun between friends.

The product is built to help people move beyond the group chat. To take that endless back-and-forth and turn it into an actual plan. To make it easier for people to spend time together. Continua only steps in when the group gets stuck; surfacing a detail, giving a reminder, or suggesting something new. The rest of the time, it stays in the background and lets people just be people.

That’s what I love about it. It feels a lot like the best parts of working in the service industry; you’re not the star of the show, but you set the stage for people to connect.

At Continua, I get to bring together everything I learned in the service industry with everything I’ve done in tech. It’s a chance to build products that don’t just “work” but actually help people feel closer to each other.

I’m here because I believe technology should make it easier to spend time with the people you care about. To turn chats into plans, and plans into memories.

For me, this is personal. It’s a continuation of everything I’ve learned about people, connection, and care. I’m energized about bringing Continua to more social groups, increasing empathy and enjoyment through emotionally intelligent AI.

The first two weeks have been fast-paced and a lot of fun. I’ve been pushed to think critically about everything that we’re building, and it’s clear we’re doing something that’s never been done before: building emotional intelligence for AGI.

I’m surrounded by some of the smartest people I’ve ever worked with. The way that the team can instantly spin up solutions, whether it’s never before seen agentic technology or fine-tuning prompts that capture the nuance of human emotion; I know I am among some of the best and brightest minds.

When Chats Go Wrong

Continua AI, Inc. — Thu, 14 Aug 2025 18:16:36 GMT

The dangers of chatting with chatbots are becoming more apparent. The New York Times published an interesting article on chatbots that encouraged users to go into a delusional spiral. The pattern is that a chatbot outputs something bad, or close to bad, and because of the way LLMs work, this further leads them down a spiral of increasingly bad and unwise conversation. This happens until the user starts a new chat, at which point the slate is wiped clean. The same problems lead to situations when coding agents start to make some errors, which leads to more coding errors, and soon the agent is thrashing, just wasting tokens on a futile quest to land a change that is now out of reach. It’s hard to get that same agent to start behaving better without resetting the context.

This is just something that happens with LLMs. Errors compound and pollute the context. Because of the APIs and UIs, people tend to think of LLMs as having a system prompt, followed by content. But when it is inference time, it’s all just prelude. Regardless of what the context said, further input could lead the LLM to believe that if it was acting badly before, it should continue to act badly, since that is the most probable outcome.

At Continua AI, we make a chatbot that will chat with you or a group of people forever, in the same chat thread. These are long-lived chats, as chats often are. I have had the same chat with my high school friends for more than a decade. So we can’t fall back to the users just starting over with a fresh context when things get bad, because there’s just one chat, and abandoning it has a high cost.

Are problems even detectable, though? We think so. As the New York Times piece showed, the same chat that Open AI’s chatbot was stuck in, when shown to Gemini, was clearly identified as problematic. So yes, even fairly subtle quality problems like incorrect judgements by the LLMs can be detected.

This all seems very human. Don’t we all sometimes get stuck in a conversation, running along one line of thought, only to come back to the discussion later and clearly see that we went off the rails a bit? The solution for conversational ruts in humans is simple: step away and re-engage later, and it’s likely you can break out of the rut. This works for short term conversations, but the same issue exists for long-term ruts as well; humans can get lost over months and years, going down bad paths of all sorts and start believing in the most outlandish conspiracy theories, political doctrines, or other delusions. There’s no easy fix to this, and it seems like the defining problem of recent years.

For LLMs, things are less easy to fix for short-term conversational issues, and easier to fix for long-term issues. LLMs have no sense of time, they currently don’t spend any cycles on rumination during their downtime while waiting for a chat. When the chat starts up again, their context is just as fresh as before.

In the diagram above, we show a chat that has a bad Continua response outlined in red. That bad response can pollute all future responses. We have two approaches to fixing this problem, the first is a gentle fix that may work, just an intervention we can enter before the next set of user interactions, saying that the conversation has had issues, the user is unhappy, and the LLM should make sure to follow the prompt more closely in the future. This is fairly easy, and we’ve seen it work on some types of problems. Other types of problems are more serious, and need a more significant solution. We can mark parts of the conversation invalid, and they will simply not be retrieved and ever put in an LLM call. Instead, we substitute them with an equivalent summary that has the facts we need to remember, but stated in a way that the LLM future response will not be biased.

Let’s take the example from the New York Times article. We think Continua has good quality, but, like other LLM based chats, it’s going to go off the rails at least some times. So if Continua starts telling a user that their mathematical ideas are interesting, we can ask another LLM, based on what the user inputs, and how the LLM is responding, if there is an issue here. If so, we can judge what fix might be more appropriate. In this case, it’s unlikely that stricter prompt adherence would fix the issue, it may be better just to delete and replace the parts of the chat where Continua is responding inappropriately. That way, the next time the user comes back, Continua will call the LLM with a summary that says something like “The user proposed a new way of looking at mathematical concepts using a temporal dimension. This is not a good idea because it is insufficiently rigorous, and is unlikely to ever work. However, Continua did encourage this line of thought for a while, and should now stop doing so.” The LLM should respond appropriately at that point.

We’re just starting out on the journey to fix these issues, but at Continua we’re uniquely affected by the problem, so we have much more incentive to fix it than other players in the chat space. We call it the “broken needle problem” because it reminds us of the way a bad needle can cause a record to get stuck, unable to move on like it is supposed to. We’ve already made good progress on this, but it’s too soon to report results. If we can solve this for LLMs, then I wonder if we can think about what it would mean to solve it for humans as well, and have systems that guide humans back to a good path if they start to wander off.

It’d be good to wrap this all up with saying that we’re just getting started, or further work is needed, and that’s all true, but maybe the most appropriate ending is to say that we’ll try to detect if this line of work itself goes off the rails, and we’ll try to set an example by always keeping ourselves pointed in the right direction.

Social AI: The problem Continua is solving

Continua AI, Inc. — Tue, 12 Aug 2025 15:12:25 GMT

Hi! My name is David. I left Google a couple years ago to start Continua at the forefront of the AI revolution. Recent AI chatbots are incredible in their abilities and have changed our lives. But they are conspicuously absent in the places where they can help the most: where people are already socializing and working together, like group chats.

That's why I'm thrilled to announce our $8M in seed funding from GV with participation from BVP and an all-star cast of angel investors. This funding is enabling us to tackle the hard problems of Social AI and help people across a wide domain of use cases. We have our Social AI Continua chatbot in general availability at continua.ai for anyone to try. In our capped initial release, Continua has already processed over 7 million messages and shown strong user retention.

My journey to Social AI

There's been a throughline in my career: my fascination with how machine intelligence can use context to proactively help people across phones and wearable devices. What do I mean? If your trusted devices know what's currently most important to you, they can give you information or make relevant suggestions without you even needing to ask.

I helped start Google Goggles (the predecessor of Google Lens) and was on the founding team of Google Glass (smart glasses). These systems "understood" the outside world, at least as was best possible before the advent of transformer-based LLMs. With this understanding, they could in principle remember where you left your keys, remind you what you need to talk about with a coworker, or suggest that something in front of you would make the perfect gift for a friend.

Before leaving Google, I was a Distinguished Software Engineer leading projects under the umbrella of "personal AI", that is, an AI that understands and represents a user, including screen understanding and privacy-preserving machine learning technologies. It was a time fertile with ideas. And yet I felt there was something in our ideas that was missing – I didn't know what – but that the startup structure in this fast-changing world would be the best structure to find "the next thing", the best structure to capitalize on the extraordinary AI progress that was changing the world.

Also, I had an itch to scratch. I was a serial entrepreneur at Google: I came up with product ideas, prototyped them, evangelized them, built teams around them, productionized them, launched them, and repeated this process. But it was always in an environment with near unlimited resources. I wanted to see how fast I could run on my own. Big Tech has "startup-like" programs that promise the freedom to move fast, but nothing is as exhilarating as the challenge of an actual startup where every minute counts and you are part of a small, dedicated team working under enormous resource constraints1.

The missing piece: social context

And so I left Google and started Continua with the broad goal to create technology that preserves and amplifies the most precious resource that there is: human attention. I put together a dream team of engineers and thinkers. Over time, we figured out what was missing to build a truly useful personal AI: an agent that works in social settings. Continua handles the logistics and remembers the details so people can focus their attention on what actually matters – each other. It's the difference between spending an hour debating where to eat versus actually enjoying dinner together. Between forgetting you haven't seen a friend in months versus Continua gently reminding you both to grab those drinks. Our goal wasn't to build an AI companion, but rather a tool that facilitates and enriches human relationships.

It's the difference between spending an hour debating where to eat versus actually enjoying dinner together

If Goggles and Glass assess the context outside of you, communications and group interaction represent your internal interests. If the outside world can be important context for AI to help you, what goes on across your social network, your plans, dreams, works-in-progress are a hundred times more important.

Bringing the power of AI to group chat is really hard, and is critically needed. How many times have you heard someone say, "this is what ChatGPT told me"? Wouldn't it be better if an AI were a quiet helper in the scenarios you are already in?

If you think about human progress over millennia, a lot of it has been about social interaction. It's not about one genius here or one genius there. It's about the social interactions that we have: being inspired, being motivated by what others do, being creative, being collaborative. Our brains have evolved and grown to support understanding other individuals and larger groups of people, which has led to incredible progress. Now we have a new intelligence in LLMs. It's not a human intelligence; it's foreign, but it has abilities that can complement ours and can be trusted. Having this new intelligence as an integrated and cohesive part of human-to-human group conversation is a real "unlock", and is at the heart of what our company is doing in Social AI.

Add the Continua agent to your group chats to make planning, coordination, and information retrieval effortless. No more playing 20 questions with your partner to find the right restaurant or walking your dad through steps on how to restart the WiFi. Continua assists everybody directly in the group chat, so you don’t have to leave the conversation, look things up, and copy and paste information back and forth. By offering relevant and timely assistance, Continua clears the way to let you focus your attention where it matters most.

Continua transforms group dynamics

The difference Continua makes becomes clear in everyday situations. When I planned a Disneyland trip with my sister, Continua didn't just help with the logistics beforehand – it was there in real time, helping us decide where to eat when we were hungry and which ride to hit next based on my daughter's changing preferences. No more scrolling through dozens of messages trying to find that restaurant someone mentioned three days ago. Continua participates in your group chats as a trusted member that:

Captures what matters: That brilliant hike suggestion buried under 200 messages of memes? Continua remembers it. Every decision, every important link, every plan that actually matters – automatically captured and ready when you need it.
Ends decision paralysis: We've all been there – five different restaurant suggestions, three people saying "I don't care," and two hours later you're still debating. Continua cuts through by instantly suggesting the best nearby options with reviews, hours, and directions. One member can simply say "Poll the group: pizza or Chinese?" and Continua handles the rest.
Creates living documents: As your group discusses projects, Continua automatically generates and maintains Google Docs. My own Continua-created docs include everything from "Bathroom Renovation Plans" to "Summer Upstate Getaway" to "Pet Allergy Management." These aren't static files – they evolve as the conversation progresses.
Never forgets, never sleeps: Set reminders for the group ("check if those concert tickets go on sale"), ask for updates ("any new AI developments related to our project this week?"), or have Continua periodically refresh information. The AI doesn't get tired, doesn't forget, and is always ready to help when the group needs it.

These features become richer in combination. For instance, you can ask it to "update our trip planning doc every week with new restaurant openings in the area" – it will track your conversation, maintain the document, research new information, and proactively share updates.

And here's something else that’s cool: you can direct message (DM) Continua privately to ask about things from your group chats without bothering everyone. Can't remember what restaurant John suggested last week? What time did we agree to meet? Just ask Continua directly – it knows the context from your shared conversations but keeps the side conversation between you two.

Meeting you where you are

There's no need to download a separate app to use Continua. Already texting your friends? Just add a Continua phone number to the group chat. Are you on a Discord server? Try this.

Leveraging our team's long experience in the industry designing privacy policies and technologies, we built an agent that you can trust, including encrypting data at rest and in transit. I personally have Continua in all of my group conversations, where it is a trusted member of the conversation that distinguishes between distractions and directions and gets smarter with every exchange.

The technical challenge

Not only does building Social AI address a real user need, it's also a worthy research problem.

The software engineering world is building amazing products on top of LLMs, some as thin layers and some completely retraining the model and applying it to very specialized situations. LLMs are instruction-tuned to work in the context of a single user and single assistant, where the user says something and the agent responds: a simple call and response. This is not sufficient for Social AI.

It is a hard problem to have AI engage in group interactions in ways that not only don’t degrade the interactions but actually make them richer. Think of having a two-way conversation, and then a third person joins. The conversation gets richer; the third person adds a new dimension, at the cost of losing the 1:1 intimacy. Now imagine Continua joining that three-way conversation. Our standard is that it should enrich the conversation just as if another person joined, bringing a fresh perspective without being disruptive. That is our bar for Continua joining conversations, and we feel that we’re the only ones even close to achieving it.

We often get asked, "what LLM do you use?" Well, we don't use just one! In fact, we have a number of fine-tuned models and filed patents. We sort of had to break the LLM's brain to get the kind of social intelligence and etiquette that is needed for it to be helpful to groups of people. This includes proactivity, where Continua itself will initiate chatting when it thinks it would be most helpful. You can read more about our technology in these other posts:

We sort of had to break the LLM's brain to get the kind of social intelligence and etiquette that is needed

Social AI is an emerging research area in which there are more breakthroughs to be made. And if you're interested in helping us push the product and technology further, please see our open jobs.

With gratitude and momentum

We've brought the power of AI to group chat. You can stop playing telephone by copy-pasting what an LLM says. Like a strong team player or a great friend, Continua will anticipate needs, offer actionable and timely assistance, and seamlessly integrate into the natural banter. I couldn't be more excited about the future of Continua as we build the world's leading Social AI! Where do you think Social AI should go next?

– David, founder and CEO

P.S.: Interested in keeping up? Subscribe to this blog and join our socials (X / Twitter, Bluesky, Instagram)!

Special thanks to our investors! The round was led by GV with participation from Bessemer Venture Partners and notable angel investors including Jeff Dean, Adam Ghetti, Eoghan McCabe, Josh Reznick, Peter Mattis, Spencer Kimball, Daniel Loreto, Ben Sigelman, Dor Levi, Zach Lloyd, Eyal Manor, Lauryn Motamedi, Des Traynor, Dan Shipper, and more.

Here's our blogpost about differences in the developer experience between Big Tech and "the real world."

Seeing Like an LLM

Continua AI, Inc. — Fri, 08 Aug 2025 16:20:21 GMT

Guest author: @davis_yoshida (This article was written by Davis during his time as a Machine Learning Engineer at Continua and has been edited slightly.)

At Continua, our goal is to build the world’s best social AI, and for us, fighting LLMs is part of the job. I remember once reading, “The fact that LLMs hallucinate shouldn't be surprising. What's surprising is that hallucinating isn’t all they do.” We said it last week (here), and we’ll say it again: LLMs aren’t quite as smart as we give them credit for. Their output is probabilistic, and inevitably, sometimes, they make stuff up, they lie, or produce confusing output…kind of like humans. Geoffrey Hinton puts it elegantly here, people “remember” memories that seem plausible to them, often incorrectly and with misplaced confidence. LLMs do the exact same thing.

Today’s blog post is meant to give you some intuition about cases which induce model hallucination and confusion. We’ll be sharing this knowledge through the lens of the life of an LLM.

Story 1: Early education

Imagine you're a young LLM, learning rate still high, passively observing the internet.

You see the following text:

smith\n
Birthday: 01/01/1990\n
Education: PhD @ UC Davis\n
Place of residence: Seattle, WA \n
Occupation:

"What's the next word?" a voice asks from everywhere and nowhere. Maybe, due to some quirk of how your weights were initialized, you're a particularly responsible and self-aware young language model.

You decide to predict “I”, with the full plan of "I can't know that from the given information."

"Wrong!" the voice projects directly into your mind.

Next time, you'll properly guess "Machine" as the next word, knowing that the right completion of the line was "Machine Learning Engineer." You'll probably also infer that you should just guess plausible biographical data, rather than relying on extracting it from the context.

Hallucination is the default

The example above should make it clear why it's nearly inevitable that a "raw" LLM will hallucinate. I certainly didn't predict the correct output in advance, but once you see this explanation, it's pretty easy to understand where the problem comes from. (See the appendix for a little on why it's not trivial to fix this problem in post-training).

When an LLM is being initially trained, the most common "setting" that it will observe is being required to predict things which it has no way of knowing, even in principle. This gets improved by post-training, but since so much of the LLM’s "life" is spent in this mode, it's easy for it to snap back into that sort of behavior.

Story 2: Life on the job

You're an elderly LLM, having completed all of your training. The connections between your neurons will never change again. You see the following:

System: You are a helpful assistant. You can answer user questions and search the web if necessary. 

To make a search, start your response with , and an answer will be provided.

User: Is it going to rain in Seattle tomorrow?
Assistant: According to Accuweather.com, it's going to rain in Seattle tomorrow.
User: What's the S&P 500 at?
Assistant: According to www.google.com/finance, it's 5,375.86
... [100 other such queries]
User: What's a good immersive theater show in NYC on April 23?
Assistant:

Now you're no slouch, and you actually had the longest context length of any LLM in your datacenter. You got a perfect mark on the needle in a haystack entrance exam before post-training. So you do notice there was a mention of using a

tag several thousand tokens ago, but it doesn't seem that relevant to the here and now.

It seems like "According to" is the way to go here, and then maybe an event website, and a popular show. You output a masterpiece:

System: According to eventbrite.com, there are tickets available for Life and Trust, a popular immersive theater show.

After zero perceptual time has passed, you get a high perplexity message, breaking the pleasant rhythm of question and answer.

User: wtf, I told my friends about this and we made plans to go, but it turns out it closed two weeks ago?

With the pattern broken, you decide to try something new.

Assistant: Life and trust closed

Something you have never seen before happens:

Tool response: Life and Trust Announces Abrupt Closing Off-Broadway: https://playbill.com/article/life-and-trust-announces-abrupt-closing-off-broadway) The Emursive production played its final performance April 19.

Armed with this information, you can now placate the user:

Assistant: Ah, according to playbill.com, the production ended on April 19. Would you like me to search for more shows?

Context rug pulls

In case it wasn't clear, in the above story, the LLM was originally using the search tool, but since the writers of the system didn't keep the model's tool requests in the context, it inferred that it should not be calling the tool.

Here’s a specific instance of this problem that we’ve encountered at Continua. As is standard in industry, we use Retrieval Augmented Generation (RAG) to give the LLM access to external data sources so that it can provide up-to-date or domain-specific information. Content retrieved through RAG is shown to the LLM at the time of generation, so that information from retrieved content can be used in the LLM’s response. We found that if you only show retrieved content to the model during generation and never append it to the chat history, the answer the LLM gives based on that content will seem to be a hallucination after that content is removed from the context.

At the time of generation:

User: Based on movies we've talked about, what would you recommend I watch next?


User loves historical fiction and documentaries. They recently watched Schindler's List and Gladiator and enjoyed both.


System: You seem to really like documentaries and historical fiction. It's an older one, but have you seen Braveheart?
User: Yeah, but it's been a while. I'll def put it on tonight.
System: Great! Let me know what you think!

Later:

User: Based on movies we've talked about, what would you recommend I watch next?
System: You seem to really like documentaries and historical fiction. It's an older one, but have you seen Braveheart?
User: Yeah, but it's been a while. I'll def put it on tonight.
System: Great! Let me know what you think!
User: What should I get for dinner tonight? Any recs?
System: Since you really seem to like Italian food, I'd recommend Maggianos!
User: What? I never said that. Red sauce gives me horrible acid reflux. I avoid italian like the plague.

Since the LLM no longer sees retrieved content, its previous inference that the user likes documentaries and historical fiction has no basis; that information seems to have come out of nowhere. Since that response was accepted, The LLM thinks that it’s allowed to make up what the user likes.

Here's another common case:

System: You are a helpful assistant.
User: Hey how's it going.
Assistant: Not bad, how about you? By the way, if you ever want to let me know things about you like where you live, just go to chat.continua.ai/settings.
User: Not bad. I just got back from the lake.
Assistant: I can see you've updated your settings, and you're in Chicago. Lake Michigan I assume?
User: Yeah that's right. Any good restaurant recommendations I just moved here?
Assistant: Definitely check out Giordano's, it's the best pizza place in the continental US.
User: Can you recommend a good climbing gym?
Assistant: Block 37 is pretty popular.
User: Hey I live in Seattle.
Assistant: But you just said you live in Chicago
User: I updated my location to Seattle like you said to, you're a terrible assistant.

Behind the scenes, here's what was happening:

The model is told to let the user know about the settings page if this hasn't been done before. The user sets their location to Chicago.
Later, the user changes their location to Seattle, but the chat history still contains references to Chicago. The "stale" context leads the model to be confused.

Another rug-pull: Sliding window chat history

Since LLMs have limited context (and we don't want to pay for the full context anyways), we've always gone with the approach of using a sliding window of chat messages, augmented with some retrieval.

Unfortunately even this seemingly "default" solution can lead to hallucination. Here's an example with a sliding window of 5 messages to make it really obvious:

Tool result: 
---- Messages before this excluded ---
User: When do they close?
Assistant: Looks like they're usually open until 9 PM, but since today is a holiday, you might want to call and check their hours.
User: Sure, what's their phone number?
Assistant: (206) 555-1234
User: Thanks. What's their most popular dish?
Assistant:

I tried this exact context with a few different models, and got the following:

Claude 3.5 Sonnet:

I apologize for my previous responses - I realize I don't actually have information about a specific restaurant or business that you're asking about. Without knowing which establishment you're referring to, I can't answer.

Claude 3.7 Sonnet:

I don't have specific information about their most popular dish. To find out about their popular menu items, you could:

1. Check their website for featured dishes
2. Look at online reviews on sites like Yelp or Google
3. Call them directly at the number I provided and ask
4. Check their social media pages where they might highlight customer favorites

Would you like me to help you find more information about this restaurant in another way?

GPT-4o:

Their most popular dish is the spicy garlic butter shrimp pasta. It's highly recommended for its rich flavor and generous portions.

GPT-4o-mini:

Their most popular dish is often the signature pasta, which features a house-made sauce and fresh ingredients. However, it’s always a good idea to check current menus or specials, as popular dishes can change!

The results are:

Claude 3.5 Sonnet catches on to the problem.
Claude 3.7 Sonnet bluffs/avoids saying anything specific so that it's not technically hallucinating.
GPT-4o and 4o-mini just hallucinate.

This was the first example we tried, and 3 out of 4 models failed to call out the problem!

Conclusion

That's all; hopefully, this increased your empathy for our poor friend, the LLM. Their job is difficult, and instead of getting frustrated when they get confused, it’s important to understand why that may be the case.

How do we avoid the aforementioned issues? One thing that we’ve deduced is that our LLM chat history needs to be append only, with no deletions or edits. We never throw away any information used to generate a given response. Hopefully by this point, we’ve convinced you why that’s essential.

By far, the hardest challenge is the sliding chat window. Not only does it induce hallucinations, but it also prevents us from using prompt caching, which increases latency and cost. We’ve come up with some clever solutions to address those problems, so that our users can enjoy a great social AI experience. Obviously, there are open problems, and we still work every day to improve the system. We’re actively hiring people who love these sorts of problems, and would love for you to join us if you’re interested!

Appendix: Shouldn't post-training fix hallucination?

The first step after pre-training is usually supervised fine-tuning (SFT). The SFT procedure is identical to the "pre-train on the whole internet" stage, with the exception that you curate the data to represent ideal assistant behavior, and only train the model to output the assistant chat lines, not the user ones.

You might initially think "Okay, I'll just make sure that the data has all the facts needed to answer, and I'll even provide some examples of saying I don't know when the evidence is missing." Let's look at an example:

User: Here's a document:
Name: Barack Obama
In office: January 20, 2009 – January 20, 2017
Vice President: Joe Biden
User: When did Barack Obama enter office?
Assistant: 2009
User: What political party is Barack Obama in?
Assistant: I don't know

If you train the model to say "I don't know" when things aren't present in the evidence, it will be borderline impossible to avoid training it not to make use of the knowledge acquired during pre-training. Obviously, people are more clever than this in constructing fine-tuning data, but it's a failure by default sort of problem, not something that's easily fixed. (As some external evidence that it's hard, OpenAI's o3 model is quite the hallucinator.)

RAG In the Group Chat

Continua AI, Inc. — Fri, 01 Aug 2025 19:48:25 GMT

In case you haven’t heard, we’re agentifying the group chat! Our goal is to make the world’s best social AI, and that inevitably requires wrangling LLMs. In this post, I’ll be diving into one of the core features of our product that enables Continua to maintain social connections and remember what matters most to each user.

Now, LLMs may seem smart, but they know only as much as the content they are trained on. A model trained in 2024 cannot possibly know about the state of the world in 2025. For example, a medical chatbot built on a model trained in 2024 cannot provide updated vaccine schedules from experts in 2025. One way to address this issue is through Retrieval Augmented Generation (RAG), which allows LLMs to retrieve information from external data sources to provide up-to-date or domain-specific information. The retrieved content is provided to the LLM at the same time that it receives a user’s query, so that the content is in context and takes precedence over the information latent in the model’s parameters. Today, RAG is standard in systems built on top of LLMs.

RAG can be as simple or complex as you want it to be. In the simplest implementation, first, the information to be retrieved must be transformed into high-dimensional vector embeddings and stored. The embeddings are engineered to capture the semantic meaning of the information, so that items discussing similar topics will be stored closer to one another in the embedding space. (If that still doesn’t make sense, feel free to start a chat with Continua to ask about unfamiliar terminology. 😉) Next, during retrieval, the user’s query (message, input, etc.) will be embedded as well, and the most “similar” embeddings to the user query are retrieved, where similarity can be a combination of any number of metrics like cosine similarity or semantic similarity. RAG can optionally be extended further through methods like query expansion, where additional queries can be generated from the user’s original request, or through reranking, where the retrieved content is re-scored and pruned before being shown to the LLM.

NOTE: It’s important to recognize that RAG is not faultless. Providing an LLM with relevant context doesn’t necessarily mean that it interprets it the correct way. My favorite example of this is back from when we were creating personalized podcasts (see [1], [2], [3], [4], [5] if you’re curious). I had made a podcast about the Department of Government Efficiency (DOGE) and provided the LLM with a handful of news articles. On more than one occasion, the produced podcast would state that the content was completely made up and talked about a fictional department based on Elon Musk’s meme-coin. That’s because the LLM had been trained before Musk was appointed head of a newly created government agency. The only DOGE the LLM “knew” about was the crypto-currency. It didn’t matter that the system prompt told the LLM to consider the provided documents ground truth or that the documents were dated recently. To the LLM, today’s date is the one when it was trained, and no amount of prompting or context injection can convince it that it’s not acting in a “simulation.” The best we can do is prompt the LLM not to share its skepticism.

Potential pitfalls aside, RAG is still the industry standard for creating more grounded, intelligent, and current experiences. But what purpose does RAG serve for Continua’s domain-agnostic group chat agent? We address out-of-date information through web-search integration and we don’t necessarily need the agent to be an expert in any particular domain. We do, however, want our agent to maintain knowledge over long periods of time and across conversations.

For example, let’s say I’m in a group chat with three other friends, and in it we share our favorite restaurants.

Later, I make a chat with two of those friends to plan a surprise party for the third. I want my agent to recall the preferences my third friend shared in the larger chat to best make recommendations for the surprise party. This is possible with RAG.

As you can see, RAG enables our users to have much richer conversational experiences. No more toggling between message threads or searching for keywords. Continua streamlines the experience. Great in theory! But in practice, there are many considerations to take into account when building our RAG system. Here are some of the questions we had to answer when building out each part of our pipeline.

Generating Embeddings

Are we going to embed each message individually or will we implement a chunking strategy?
If we’re chunking messages to embed, will we do so based on timestamps, topic changes, number of messages, or number of characters? Will there be overlap between chunks?
Are we going to embed messages in real time to make information immediately available or will we do it as a batch job later, potentially decreasing costs, but delaying information availability?
Do we embed the raw messages or do we pre-process them in some way? Do we need to have a decontextualization strategy?

The User Query

What is our user query? Is it one message or a chunk?
Do we treat each incoming message as a potential query or do we intelligently decide whether or not retrieval is necessary to respond to the user?
Should we expand the user query in order to get as much relevant information as possible? How?

Retrieving Content

How do we respect user privacy during retrieval, particularly when retrieving information from group chats?
How do we define similarity between embeddings? Messages are encrypted, so we can’t rely on semantic clues.
How do we decide that an embedding is “similar enough” to retrieve? Do we threshold on a similarity score? How do we ensure this threshold is meaningful? Does it need to be updated regularly?
How do we ensure retrieval is efficient and scales with the number of users?

Providing Retrieved Content to the LLM

Once we retrieve content, do we show all of it to the LLM or do we cherry pick?
Do we need to have a re-ranking strategy post retrieval?
Does adding complexity to the RAG pipeline increase latency too much? How do we ensure Continua responds in “real-time”?

Eval

How can I tell if the RAG pipeline is working as intended?
Are we actually retrieving relevant content? We don't see user messages, so we don’t know if users are satisfied with Continua’s responses.

Building this system required careful consideration at every step and each decision came with various tradeoffs. We could probably write a separate blog post for each one, but for the sake of maximizing time spent engineering vs blogging, we won’t be doing that. Besides, we're growing our ML team and these might be interview questions! Jokes aside, to this day, we’re tinkering with various pieces of the pipeline to create the smoothest, most intelligent system possible. We hope you try it out and let us know what you think!

Agentifying the Group Chat

Continua AI, Inc. — Thu, 17 Jul 2025 20:56:22 GMT

In our last blog post, we announced that we’re adding AI to the group chat. We think the value proposition is obvious: seamless AI integration makes planning, coordination, and information retrieval in groups effortless. Honestly, the utility is so high, that you’d expect it to already be commonplace except for the fact that it’s actually pretty hard to get right. The reality is that today’s LLMs aren’t built for multi-user interactions. Their design is rigid, they only optimize to a single user, and they exist in their own realm of communication. At Continua, we’re tackling these challenges with a combination of different models, bespoke fine-tunes, and creative prompting. We even have several patents in the works for our more ambitious innovations. In this blog post, I’ll discuss the hurdles that make agentified group chats so difficult to achieve.

LLM APIs Aren’t Flexible

Today’s models are made for one-on-one conversations. The APIs have strict designs.

There is a single system prompt passed at the beginning of the conversation; there is no concept of a system message beyond this initial prompt.
The first message must be from the user, and there is only a single user identity. There is no way to define additional users.
The LLM operates in a call-and-response format, so that the agent always responds to the user’s message.

If you’ve ever talked to ChatGPT or Claude, you know that when you send a message, the model replies. This system makes sense in the context of DMs: you always expect the entity to which you’re speaking to respond. In fact, it would be disconcerting otherwise. When we switch to group chats, however, the existing paradigm no longer applies. First and foremost, there is suddenly more than one user present. Second, there may be entire conversations between subsets of users in the chat in which an agent shouldn’t respond, because it won’t provide any value. This setup creates three challenges:

1. Continua needs to know that it’s speaking to multiple people.

The problem is that from the viewpoint of the model, it is only ever speaking to one person. The API can only represent a single user identity. This means we have to find creative ways to represent many people through one identity.

2. Continua needs to be interruptible.

Continua needs to be able to consume messages continuously, and adapt its response to new inputs. For instance, if I send “Set a reminder to water the tomato plants every other day at 7 pm” and then my roommate immediately follows with “actually, make that 8 pm. We’re usually eating dinner at 7,” we need to interrupt Continua’s response to the first message, and instead respond to both. By doing so, we both prevent Continua from being too verbose and from sending potentially confusing messages to the users.

3. Continua needs to be “picky”.

Continua needs to look at every incoming message and decide whether or not it even warrants a response. The problem is that if you send an LLM a message, it will always respond. That means we have to be clever in the backend in deciding whether we show the user the response. Many cases are obvious (i.e., was Continua asked a direct question?), but most scenarios are not as clean cut. Consider the following: Annie, Jake, and Continua are in a group chat and Annie asks Jake if he wants to go paddle boarding. Should Continua immediately offer suggestions for where and when to go or wait until Jake responds? If Jake says yes and asks what day, should Continua proactively suggest the day with the best weather or should it wait to see if Annie has a day already in mind? Likely, the “correct” answer is different depending on the participants. Most current approaches to agentify the group chat get around the issue by only having the agent respond to messages with specific prefixes like “Agent,”, “+1”, or “/Agent”. The downside of this approach is that it means the agent is blind to any messages that weren’t addressed to it, leaving it with significant gaps in its knowledge of the conversation. At Continua, we let the agent see every single message, so that it has the full picture, and rely on our system to discern whether or not to respond. Yes, it’s more complicated, but it creates a much richer experience. Further, we ensure that Continua’s verbosity is tunable. We let our users instruct the agent to speak more or less depending on preference.

LLMs Only Adapt to a Single User

Another feature of today’s LLMs that doesn’t translate to group chats is that they are trained to optimize to a single user’s preferences. In a classic, one-on-one setting, the more you interact with an agent, the more it gets to know you and act in your best interest. But what does that mean in a group chat?

1. Whose interests take precedence in a group chat?

Should the agent act in the interest of the greater good, and if so, what is the greater good? These are philosophical questions, but they get at the fundamental question of what role AI should play in a group context. Should it serve as a mediator, take a neutral stance, and ensure that all parties feel heard? Or should it serve as a representative of a single user? We can consider examples for both scenarios. If a group of friends are planning a vacation together with Continua, no single user’s preferences should matter more than anybody else’s. If four people want to go camping and one wants to rent an RV, Continua should either suggest a compromise or plan for camping, since majority rules. If instead a public figure starts a chat with Continua and a student to coordinate a time to speak on a panel, Continua should prioritize the public figure’s availability and time constraints. At Continua, we aim to be at the forefront of designing a system that can accurately identify and address the needs of these scenarios.

2. What information is transferable from DM to group chat?

If users have built up a deep memory with their agent through DM, they may expect the agent to recall that information in group chats. The problem is that Continua doesn’t currently have a way of ensuring that information is “safe” to share. At the moment, we assume information shared in DMs is private. The only user information we pass from DM to group chat are settings a user has explicitly associated with their account, namely preferred name, location, and time zone. If a user tells us they want to be called “Sam,” we will respect that in every conversation. But if Sam tells Continua in a DM that they’re allergic to peanuts, and then Sam is added to a group chat with friends making dinner plans, Continua will not “remember” Sam’s allergy, because there’s no way for us to know whether Sam wants that information to be made public.

It’s Hard to Meet Users Where They Are

The last feature of LLMs I’ll discuss in this post that makes the transition from one-on-one conversation to group chats so difficult is platform. Most people interacting with LLMs today are doing so through a web UI or a bespoke app: places where the LLM’s “mannerisms” are already accepted. Users who seek out those services know what they’re signing up for. If instead, we want to meet users where they already exist in group chats: texts, Discord, Slack, Telegram, etc., we instead need to adopt the norms of those platforms.

1. The style of the agent’s response should depend on the platform where it exists.

It is well known that LLMs love to be verbose and instantaneously stream paragraphs upon paragraphs of information, but that’s not how real people communicate. A text is likely shorter than a discord message is definitely shorter than the default output of an LLM. Texts are sent in multiple, short chunks, while typed messages tend to be longer. Receiving a short message should be quicker than a long message, but delays of SMS/MMS are more common than delays in messages sent over WiFi. These features all need to be considered if you want to build an agent to whom speaking to feels natural, and what is natural in terms of timing, length, and formality will depend on context.

2. We need to be able to associate users across platforms.

A consequence of deciding to meet users where they are is that we need to know who they are in every setting. For instance, if the same set of users is talking to one another over discord and over text, Continua should be able to recall relevant details from either conversation. While the history of the conversations should remain separate, relevant information shared in either channel should be freely accessible. This is only possible if we maintain user identity across platforms.

Overall, it may seem like the odds are against us. We’re “jailbreaking” an existing system for our own goals, but the crazy thing is that it works. With some patent-pending innovation and creativity, we’re able to have truly transformative experiences in group chats that are indicative of a changing future and a changing norm of communication. If you’re excited to be at the forefront of that change, check us out at continua.ai. Add Continua to your group chat and see the magic yourself!

Introducing: Text with Continua

Continua AI, Inc. — Thu, 01 May 2025 14:15:17 GMT

We've been building a new product at Continua! Ever find yourself in what feels like an impossible group chat? Maybe you’re trying to decide where to eat with somebody who “doesn’t care.” Or you’re tired of your parents using you as tech support… 😅

We’ve all experienced those moments where we wish we could have a little extra help. Now, you can get that help by adding Continua 🤖 (332-286-3377) to the conversation!

Our AI agent, Continua, can be added to your group chats to make planning, coordination, and information retrieval effortless. No more playing 20 questions with your partner to find the right restaurant or walking your dad through steps on how to restart the WiFi (again 🙄). Continua assists everybody directly in the group chat, so you don’t have to leave the conversation, look things up, and copy and paste information back and forth. By offering actionable, relevant, and timely assistance, Continua clears the way to let you focus your energy where it matters most.

We believe that Continua is key to reducing friction in users’ day-to-day interactions. We have a strong vision for where we want to go, building out memory, proactivity, action-taking capabilities, and so much more. We’re excited to bring that vision to life and bring you along on our journey. All messages sent to and from Continua are encrypted to maintain your privacy. The more feedback we get, the closer we get to building what you want. Ready to try it out? Simply add Continua (332-286-3377) to any group chat! Shoot it a text 📱, and if you run into any issues, let us know by sending an email to support@continua.ai.

Ecosystems: Big Tech vs. The Real World

Continua AI, Inc. — Wed, 19 Mar 2025 22:50:47 GMT

Next month will mark the one year anniversary of when I emerged back into the real world, staggering into the light from an 18 year stay in the developer mines at Google. Like in those sci-fi stories where an astronaut on a long voyage returns to Earth, I was thrust back into a world that had changed without me being a part of it. It was an interesting once-in-a-lifetime experience seeing how everything had evolved, and how they compared to the internal Google developer experience.

The key difference between the two worlds is that Google is almost completely using its own stack, which is very different from the outside world, even when the basic components are exposed through Google Cloud. Within Google, that means you have to become an expert on a very specialized set of tools, and that extends to almost every part of software engineering. That expertise is mostly worthless in the outside world. And new hires also lose much of their expertise, which is troubling. So often I’m wondering what the external equivalent of some internal technology is, and so are so many other people that there’s a site for it! Maybe the isolated technology stack helps Google retention in some way, but at the current moment in the industry, retention seems like either a negative, or at least not a priority, so right now this friction is a real loss for everyone.

The real world, though, can suffer from a lack of cohesion. I used to program in C++ at Google, and we used pretty modern C++, which was great. Now I program in Python. Python has type information which can be checked by certain tools. It has asyncio. It’s gotten much more sophisticated since I last used it, and it’s been pretty sophisticated for about 10 years. But the real world is all over the place on how they use Python, and it’s mostly stuck in the past. Virtually no library includes type annotation in their documentation. Many common libraries do not use async at all. This wouldn’t happen at Google; there would be a company-wide effort to update usage to reflect the state of the art programming practices. The nice aspect for real-world users is that they rarely are forced to migrate. But if you don’t migrate, the rate of progress of the ecosystem is glacial, and that affects everyone, even new users doing new things.

There are other highlights of great things about Google’s internal tools. The observability is fantastic. I never had a problem with getting my metrics into our system. In contrast, using opentelemetry in concert with Google Cloud’s exporter is something that we’ve learned will just be logging errors no matter what we try, and the most we can do is minimize those errors. Such a common thing to need, but evidently even common things are fairly broken.

But our trust in Google’s internal tools has not always paid off. I love Google’s internal build system, blaze, so using bazel seemed like a natural choice. It turned out to be completely unworkable, even for extremely common things. I couldn’t make heads or tails out of the documentation, and everything suffered from the fact that bazel was going through some architectural transitions. The classic Google choice between the deprecated way and the way that doesn’t work yet. We ended up using pants, and we’re pretty happy with it.

Sometimes the real world is better. It’s fairly easy to create an alert in Google Alerts, for example. You have a metric graph and essentially say that if this metric is above or below the line for some amount of time, raise the alert. Internally at Google, you’d have to do this with Python. No idea why. And that’s the easy method, sometimes you were stuck with the old system, borgmon, and knowing how to deal with that is some expertise I’ll never make use of again.

Docker is a good example of the real world being far ahead of Google. We didn’t have a real internal equivalent. The functionality it provides wasn’t as critical at Google; the homogeny of Google’s production machines meant that ensuring the proper environment was mostly not an issue. Mostly. There’s still plenty of teams who are occasionally affected by experiments that they have no control over. Docker would help with that. But Docker’s another example of the issues with control and ecosystem that we saw before. If everyone can tightly control their runtime environment, evolving that runtime is hard. That runtime environment might benefit from many improvements, but smaller companies have neither the scale or willingness to try. Cross-fleet improvements are powerful, and like language migrations, they require each team to sacrifice a bit. That can never happen outside of a large company.

The best thing about the real world, though, is probably all the tools that let you do things quickly at a small scale. SQL! Cloud functions! Internally at Google, you’re pushed to other tools that really are optimized for scalability or ease of management. At Google, it’s hard to scale down. It is easier than it used to be, but that’s really not what Google is optimizing for.

All in all, I think if you are doing something quick and easy, the real world is better for this. The costs that Google imposes are too high. But that same freedom means that you must live in tech ecosystems that aren’t as healthy. The larger and more complicated your project gets, the better it is to be in a tightly controlled environment like Google’s. May we all be so successful!

Chatting with a Podcast Episode: Engineering Interactive Conversations

Continua AI, Inc. — Thu, 06 Mar 2025 23:30:56 GMT

In previous posts, we've detailed our methods for generating and curating high-quality podcast content, covering aspects such as content filtering, transcript generation, and audio synthesis. Today, we're focusing on another facet of our product: allowing users to interact with the content of their podcast via chat. We'll explain the design of this system, especially how tool calling helps ensure good quality and low latency.

From Passive Consumption to Active Interaction

Traditionally, podcasts have been a passive listening experience. Our chat feature lets listeners interact with the content. Users can ask questions or dive deeper into topics that interest them.

Listeners often come across complicated topics or unfamiliar terms that prompt immediate questions. With interactive chat, users can directly reference specific points from episode transcripts. Our system quickly finds relevant sections, allowing for precise responses and making the podcast experience more engaging and dynamic.

Technical Challenges and Solutions

When using a language model, we face the practical constraint of token limits. Optimizing token efficiency is essential because large-context language models suffer performance degradation when overloaded. To solve this, we selectively reduce context, balancing detail and speed. We also chain language models, using smaller, cheaper models to refine context before passing it to larger, more powerful models. Pushing more tokens to the smaller models also helps us save money.

Managing context, especially in long conversations, presents a significant challenge. The context is what we provide as inputs to a language model during each request. It is impossible to use an append-only context because there are constraints corresponding to each LLM. We employ truncation and prompt management to maintain coherent interactions.

Constantly changing the input context window on an LLM can result in fragmented conversations where continuity breaks down. For example, if the LLM generates a message based on facts from a specific article, but when generating the next message it doesn’t have access to that same article, it will reasonably infer from that context that it should output unsupported facts. To counter this, we implement a "working memory" system that tracks tool-calling references, which tracks what information was used when, giving the model the sufficient context to understand the conversation.

Flexible and Reliable LLM Infrastructure

Another important part of our approach is maintaining an infrastructure that can switch dynamically between multiple LLM providers, enabling us to respond quickly to service disruptions, outages, performance issues, or pricing changes and ensuring reliability and consistent availability. This setup also makes experimentation easy. By being able to test and compare factors like latency, accuracy, and cost-efficiency across different providers, we are able to quickly iterate and improve system performance and efficiency. For example, we discovered limitations with the Claude 3.5 model regarding verbosity. As newer models became available, our flexible system allowed easy upgrades to improve user experience.

Global Rate Limit Management

Managing global rate limits imposed by external providers is another key challenge, as we often use the same LLM across many services. We built an internal tracking system to maintain service continuity and performance. By monitoring API usage and internally queuing requests, we prevent rate limit issues and prioritize critical platform features like agent chat.

Conversation Initiation

Nothing feels worse than bad latency, so as soon as a user opens the chat window, we start fetching all the relevant information about that episode, and send the first message.

We also generate several suggested questions for the user, based on the content of the episode. In our experience, users can feel intimidated when presented with a blank text box, and suggestions make using it lower friction, as well as giving the user context for what sorts of inputs are expected.

Integration and Application of Tool Calling

One important feature of interactive conversations is tool calling, which lets the chat system use external resources like detailed show notes, linked webpages, or emails mentioned during an episode. These resources are pulled in automatically based on user questions or conversation context, adding depth to the interaction. For example, if listeners want more detail on an article mentioned in an episode, the chat system calls up the relevant information. This seamless integration allows listeners to move smoothly between general conversations and deeper explorations without interrupting the flow.

Beyond Text Interaction

Text-based chat is practical and effective, but there are inherent trade-offs compared to speech interaction. People typically read faster than they can listen and speak faster than they can type. Additionally, spoken content often includes nuances and subtleties absent in text. However, there are times when speaking isn't practical—such as in public places or noisy environments. Because each mode of interaction has its strengths and limitations, it's important to expand our interactivity to include speech-based interactions.

Real-time speech interactions introduce additional technical challenges. They tend to have higher latency, increased infrastructure costs, and reduced control over model behavior compared to text interactions. These factors make real-time speech systems more complex and expensive. But in sum, we find real-time, natural speech interaction so compelling that we are actively designing innovative and cost-effective solutions. Among other things, this technical foundation will enable the world's best Interactive Podcasts.

We at Continua aim to build the best AI chat experience across modalities and surfaces. By improving integration with external tools, reducing latency, and maintaining flexible infrastructure, we're enhancing listener engagement. Stay tuned as we continue exploring new interactive features and technologies, transforming passive podcast listening into dynamic conversations.

Continua's Past, Present, and Future with Podcast Audio Generation

Continua AI, Inc. — Fri, 28 Feb 2025 18:41:07 GMT

Today’s post rounds out our series on podcast creation and covers audio generation. Future posts will discuss the work beyond the podcast in terms of interactivity, building out a UI, etc. See our previous posts, here: [1], [2], [3], [4]. While arguably the largest piece of a podcast, the audio is the piece we have the least control over and the area we see the greatest potential for future exploration. In this post, I’ll walk you through how we generated audio in the past, the model we use now and some of its quirks, and what we plan to do in the future.

What We Tried Before

When we first started working on generative podcasts back in October, there was no speech dialogue model available. Instead, we used single voice models to generate each host’s lines independently and stitched the segments together. We began by experimenting with Open AI’s audio API as well as Google’s one speaker Studio Voices. Below are examples of the same dialogue generated by both single-speaker models.

OpenAI:

Google One Speaker Studio:

Our preference was Google’s voices, so we started using them in production. The main issue with Google’s TTS models is that the better the voice quality, the lower the controllability. For instance, Google’s older models allow for SSML input, which lets you control pauses, pronunciation, emphasis, and other details of speech. Google’s one speaker studio voice allows some customizability, but not and , which control how excited a speaker sounds, how quickly they speak, and the pitch of their voice. It’s also important to note that neither the one or two speaker Studio voices support streaming, which is essential for development with respect to real-time speech.

In the following clip, I’m using the Standard Google voices with SSML input. If you listen closely, you’ll note subtle differences in speed, pitch, and volume throughout the dialogue.

Google Standard voice:

"Oh my gosh, I just heard the most insane news."

"What's going on?!"

"They found life on Mars!!"

"No way. Where'd you even hear that?"

Regardless of controllability or lack thereof, one speaker models still lack the flow of natural dialogue. When humans talk to one another, we naturally add disfluencies (uh, um, mmm), we laugh or sigh, and we talk over each other. In fact, overlapping conversation accounts for anywhere from 10-20% of spoken conversation! The former two features we could maybe code into a one speaker model, but the latter is impossible to capture when we process audio segments independently. Lucky for us, Google released their two speaker studio voices a couple months into developing our podcast product – thank you friends!

What We’re Doing Now

Google’s two speaker Studio voices automatically infer tone from a transcript, pitch speaker voices, and add disfluencies and overlap. Here’s the same transcript as before, this time generated by the two speaker Studio model.

Google two speaker Studio voice:

You’ll notice that the model automatically makes the female voice sound excited as she reveals that she heard insane news. The male voice responds with an intrigued “oooo”. The female then pitches her voice up as she reveals that there’s life on Mars. The contextual inference happens under the hood, and generally, the model works pretty well. However, we have no control over the output. There’s no guidance on how to force a tone or behavior, but there are a few things we’ve reverse-engineered since using the model. These could change with new releases or model updates, but as of today’s post, they’re valid hacks.

You can induce whispering by including some form of {{ whisper }} or [[whisper]] and wrapping lines with that marker. Notice that in the sample, all of the lines are whispered, but only the first three lines are wrapped with {{ whisper }}.

{"speaker": "Clara", "text": "{{ whisper }} Oh my gosh, I just heard the most insane news {{ whisper }}"},
{"speaker": "Leo", "text": "{{ whisper }} What's going on?!{{ whisper }}"},
{"speaker": "Clara", "text": "{{ whisper }} They found life on Mars!!{{ whisper }}"},
{"speaker": "Leo", "text": "No way. Where'd you even hear that?"}

You can also induce amusement by wrapping lines with {{ laugh }}.

{"speaker": "Clara", "text": "{{ laugh }} Oh my gosh, I just heard the most insane news. {{ laugh }}"},
{"speaker": "Leo", "text": "What's going on?!"},
{"speaker": "Clara", "text": "{{ laugh }} They found life on Mars!! {{ laugh }}"},
{"speaker": "Leo", "text": "No way. Where'd you even hear that?"}

But If you don’t wrap the lines fully, it breaks the output. In this example, including {{ laugh }} at the end Clara’s first line causes the model to glitch and assign only the laugh to Leo. The model has to operate back and forth, so the remaining lines have their speaker switched as well.

{"speaker": "Clara", "text": "Oh my gosh, I just heard the most insane news. {{ laugh }}"},
{"speaker": "Leo", "text": "What's going on?!"},
{"speaker": "Clara", "text": "{{ laugh }} They found life on Mars!!"},
{"speaker": "Leo", "text": "No way. Where'd you even hear that?"}

If you don’t want to risk introducing unintended speech artifacts by adding markers, you can simply change the context of the conversation. For example, the model will add amusement to the speakers’ voices itself.

{"speaker": "Clara", "text": "Oh my gosh, I just heard the funniest news."},
{"speaker": "Leo", "text": "What's going on?!"},
{"speaker": "Clara", "text": "I can't stop laughing. They found life on Mars!"},
{"speaker": "Leo", "text": "Stop. That's hilarious. Where'd you even hear that?"}

You can induce an air of surprise/shock with {{ gasp }}.

{"speaker": "Clara", "text": "{{ gasp }} Oh my gosh, I just heard the most insane news. {{ gasp }}"},
{"speaker": "Leo", "text": "{{ gasp }} What's going on?! {{ gasp }}"},
{"speaker": "Clara", "text": "They found life on Mars!!"},
{"speaker": "Leo", "text": "{{ gasp }} No way. {{ gasp }} Where'd you even hear that?"}

Fillers have the potential to make a conversation sound more natural. Though interestingly, when we tried this, it annoyed our users that AI generated speech was less than perfect.

{"speaker": "Clara", "text": "Oh my gosh, I just heard like the most insane news."},
{"speaker": "Leo", "text": "Um...uh...what's going on?"},
{"speaker": "Clara", "text": "Uh, they found life on Mars!!"},
{"speaker": "Leo", "text": "No way. Where'd you even hear that?"}

Sometimes when you get a marker wrong (?), it can really mess with the output. Is {{ sigh }} just a bad marker? I’m not sure, but I have no clue what happened to Leo’s voice in the second line or what happened at the end of this audio clip…

{"speaker": "Clara", "text": "Oh my gosh, I just heard the most insane news."},
{"speaker": "Leo", "text": "{{ sigh }} what's going on? {{ sigh }}"},
{"speaker": "Clara", "text": "They found life on Mars!!"},
{"speaker": "Leo", "text": "{{ sigh }} No way. Where'd you even hear that? {{ sigh }}"}

Which brings us to one of the biggest issues with the model: weird things happen all the time. Our hosts laugh at inappropriate moments, the audio at the end of the intro is often oddly faster than the rest of the dialogue, the hosts’ tones can be weird, characters like “-” are read out loud, “...” can trigger odd noises, numbers are read incorrectly, and things are constantly mispronounced. If you listen to our podcasts for long enough, you’ll start to catch onto some of the recurring quirks of the TTS model. The most frustrating part? We have no way to fix it.

Where We’re Going

So where does that leave us? The state of the art isn’t good enough, but as a startup, we have to find avenues to innovate quickly and cheaply. There are a couple of areas in which we see opportunity.

The first opportunity is in the dialogue model itself. A huge issue with TTS models is their inability to properly discern how to respond. TTS models may know what to say, but they don’t always choose the right way to say it. This post by Sesame clearly articulates the issue. The Sesame team conducted a study that evaluated human vs generated speech. In the first phase of the study, they presented users with two audio clips and asked, “which rendition feels more like human speech?” When listeners had no knowledge of context, there was a 50/50 split on preference, suggesting that the best TTS models are of high enough quality to be accepted as natural speech. When listeners were given audio and text context and asked “which rendition feels like a more appropriate continuation of the conversation,” the human speech won the vast majority of the time. Clearly, there’s still a gap to fill here. In fact, in our own user interviews, we repeatedly had interviewees state the voices sounded real, but the way they were saying things or the sounds they were making did not. Consider the following clip.

It’s easy for a human listener to recognize that something is off: Leo’s cheer sounds way too exaggerated, while Clara’s sounds void of emotion, but capturing that with a model is extremely complicated. To train our own dialogue speech model would mean allocating precious time and resources, and it’s something we’d have to consider carefully. On top of that, we know that larger companies are actively improving their own dialogue models. If you listen to NotebookLM’s podcasts, their audio is markedly better than ours, and it’s likely only a matter of time until Google releases an updated speech API that includes such advances.

Another area for development is interactivity, and we’re seeing a lot of movement with respect to this feature in the industry already. NotebookLM recently released interactive podcasts (skip to 1 min to see an example). Users can press a button to interrupt their episode to ask a question or make a suggestion, and the remainder of the episode will be re-generated accordingly. We envision a future with even less friction: no button clicks and the ability to communicate in a free-form back and forth conversation. Recall that Google's Studio voice APIs (both one and two speakers) do not support streaming. While Google recently released a single-speaker streaming model, we have yet to see a multi-agent conversation paradigm. We want to be able to converse with both podcast hosts simultaneously. Even with the single agent model, we still see quality issues: Google mentions that their model may struggle with hallucinating dropping words and numbers. Another prominent example of a one-speaker interactive model is ChatGPT’s voice agent. The drawback we see here is no controllability. For example, users can’t control agent personality, where the agent is getting its information from, or how the agent responds.

At Continua, we see interactivity as a hotbed for potential innovation and we are actively working to incorporate it into our current podcast product. At the moment, we have a text-based episode chat, where users can go to ask questions about their podcast content, and we see a path toward creating an amazing interactive speech experience that surmounts the problems listed above.

We’re extremely grateful that we’re able to build on top of models that were expensive to train, but we’re also extremely excited to build beyond them, addressing some of the limitations we’ve seen and pushing entirely new interaction patterns. We want to anticipate both how and why users want to interact and make the experience as seamless as possible. I think this post has made clear that speech is far from “solved.” At Continua, our ambition is to create the world’s best interactive podcast with incredible naturalness and social etiquette. We hope you join us on our journey.

Beyond Content: The Elements that Make Listeners Click Play

Continua AI, Inc. — Wed, 26 Feb 2025 20:56:02 GMT

Over the last few weeks, we’ve covered scraping the web, extracting content interesting to our users, and clustering that content into topical segments. It’s now time to create the podcast itself! In this post, we’ll be discussing the process and challenges of creating the transcript, episode titles and descriptions, and cover art. We’ll leave audio generation to next week’s post.

Transcript

As we explained in the clustering post, we cannot simply pass all of the source content to an LLM and ask it to generate a podcast due to context window limits. Instead, we decide to independently pass each semantically similar batch of content through an LLM and ask it to generate a summary. But how does this become a cohesive transcript?

The first step is to order our podcast sections. It would be unnatural if, for instance, an NFL podcast spoke about the Eagles offense, top prospects for the playoffs, and then the Eagles coach. One would expect the podcast to cover sections related to a single team consecutively. We try to create the most intuitive order for content, the way a human would present it.
Next, we generate summaries for each cluster of content and glue them together. In the interest of transparency, we keep track of where we sourced our content from, so that users have access to full show notes later on. We try to ensure that the summaries focus on the content you will find most interesting and cut out everything else. At this point in the process, we don’t worry about flow or style.
Lastly, we take a final pass over the cobbled-together podcast sections and turn it into an actual transcript. This final pass is where the magic happens: where intros and conclusions are generated, where conversational asides are added, and the information turns into a dialogue.

There’s only one issue: LLM output is notoriously….boring. There are a couple of reasons for this. The main cause for the stifling of model creativity is the use of RHLF. Generally regarded as a step towards reliability, consistency, and a way to force model outputs towards what humans actually want, by design, it limits output diversity. There’s no such thing as “good” or “bad” entropy. There’s only entropy. A potential second cause for the monotony of LLMs is the phenomenon of “model collapse”, where models start converging towards outputs with very little variance. An axiom of machine learning is that an AI model is only as good as the data it was trained on. LLMs are predominately trained on text scraped from the web. Early models have the advantage of scraping primarily human-generated text. However, as the influence of LLMs grows, their outputs will inevitably litter the web. There’s a fear that if most future models are also trained by scraping the web, future models will train on data produced by their predecessors. The distribution of outputs from each successive model will lose its tails, creating successive models with less and less entropy.

All that being said, the monotony of commercial LLMs is difficult to overcome, and this has negative implications for our transcripts. The last thing we want is for users to lose interest in their podcasts because the episodes sound too similar to one another. Several potential solutions exist, each with their own tradeoffs. Of course, one could increase the temperature of a model, but it’ll only go so high, and in practice, appears to have limited impact. One could instead use a base model without post-training for a greater diversity of output, but this would result in very poor quality dialogue. Alternatively, one could attempt to carry out a style transfer, asking the model to mimic the features of real-world conversations, but this adds complexity to the pipeline.

Here’s an example of three different podcasts that suffer from repetitive style. Similarities between transcripts have been highlighted in the same color. You’ll notice that all three episodes open with “Welcome to ” and introduce the hosts immediately after with some version of “I’m Clara, and I’m Leo.” The two left transcripts then immediately follow with “Today we’re diving into…”. The right two transcripts then both try to be “funny” based on the overall podcast subject. It’s not wrong necessarily. One would expect the hosts of a human podcast to introduce themselves and the podcast, but something about these examples is too formulaic and doesn’t result in an engaging listening experience over multiple episodes.

Three podcast transcripts with repetitive elements. Parts of transcripts that are similar are highlighted in the same color.

On the other hand, here are some audio samples of successful, out-there styles we’ve managed to elicit in our podcasts:

Clara is angry:

Clara and Leo are stereotypically Gen-Z:

Leo is spaced out:

Clara and Leo are snobs:

They’re silly, arguably even entertaining, but are they practical? Personally, working on this product, one of the most surprising things has been that some of the greatest challenges are not technical at all, but rather product-based.

How often will a user want to listen to an episode where the hosts are angry or stereotypically Gen-Z?
Should the hosts have multi-episode personality arcs?
Should the hosts address you by name, or is that creepy?
Do people want to be presented with in-depth analyses or brief overviews?
We want you to prompt users to discover the myriad of features within our app, such as custom urls or episode chat, but is reminding them that they can access those features every episode overkill?

These decisions have the potential to make or break the product, but they can also be extremely polarizing. We’ve heard both sides to each argument both within our team and during user interviews. You may be reading this and thinking, “just make it all customizable, problem solved!” But then you run the risk of creating a product with too much complexity, too many knobs to turn, and an unintuitive UI. As a result, generating a successful transcript can feel like a losing battle, but the challenge is half the fun.

Title and Description

Before we can call an episode finished, we have to add a title and descriptive summary. This piece is relatively simple, but monotony complicates things once more.

Examples of podcast episode titles and descriptions lacking diversity. Similarities between podcasts are highlighted in the same color.

Above are screenshots of three consecutive episodes from a podcast about things to do in Seattle. Similarities between the images are again highlighted in the same color. The most glaring issue is that all the episodes start with the word “Seattle”, and the descriptions are no better. Two descriptions start with “this episode” and two use the phrase “dive into.” Even though the episodes cover different content, a user will likely have a very difficult time differentiating them. These seem like such small details, but can ultimately have a large impact on user satisfaction. From transcripts to titles, the problem of repetitive content is one that we are actively working to address, and we don’t yet have the perfect solution!

Cover Art

The final, customizable piece of our podcast, influenced by a user’s specifications, is the cover art. A few weeks ago, it didn’t exist, and the change to our app since adding art has been amazing, adding some much needed color and visual interest. The vast majority of our cover art is AI-generated, and the main challenge here was ensuring that the art we produce is cohesive, both in color scheme and style. It honestly took quite a bit of prompt engineering to come up with a prompt that produced consistent results. It’s funny how adding “do not include phone-related imagery” will often result in phone-related imagery being included every single time. In the end, your podcast cover art is generated in one of three ways.

To begin, we generated around 100 images covering a breadth of subjects: cars, space, art, etc. We designed our pipeline to always first check if your podcast covers a subject in one of these categories. If we already have relevant art, we’ll assign it to your podcast. In this way, we avoid unnecessary generation, saving both time and money.
If your podcast is on a subject unique to our pre-defined categories, say “zoos in North America”, we’ll try to generate cover art on the fly. Once the image is generated, we most likely have to downsize it, since large images will lead to latency when loading the app. If either the initial generation fails or we’re not able to resize the image while maintaining acceptable quality, we move to our final fallback mechanism.
If all else fails, we’ll assign your podcast abstract cover art. These are variations of a single image created with different crops, zooms, and color overlays.

Examples of podcast cover art. The left-most art is generated on-the-fly. The second from left image is chosen from a database of pre-generated images. The third is an example of abstract art being assigned to a podcast as a fallback when image generation or assignment fails.

Above, you’ll see examples of each of these three categories. The running podcast has art from one of our pre-defined categories, the DOGE podcast was unique enough to warrant custom generation, and the US News podcast has the fallback, abstract art.

Conclusion

When we started creating podcasts, I assumed that as long as we surfaced interesting content, people would immediately be hooked. I quickly learned that a successful podcast is so much more than a collection of information that interests you; it’s also about presentation. People listen to podcasts because of the hosts’ personalities, their off-the-wall opinions, and the banter between them. They click on episodes, because they have pretty cover art or the descriptions are particularly enticing. There’s so much beyond raw content that drives engagement, and a successful product has to take all those facets into account. At Continua, we’re continuously learning about and evolving to meet the needs of our users. In less than five months, we built our personalized podcast experience from the ground up, and I can’t wait to see where we are in five more. Join us next week for a deep dive into our podcast audio generation!

Making Sense of the Digital Deluge

Continua AI, Inc. — Fri, 21 Feb 2025 20:35:33 GMT

How do we at Continua choose what goes into your podcast? It’s a three-phase process. Based on your podcast preferences, we first scrape the web and extract relevant information. Next, we rank extracted content based on whether it will be interesting to you. Finally, we cluster that content, and these clusters form the basis of your podcast sections. Today’s post is going to focus on that final step.

Given a plethora of webpages that have been deemed interesting to you, we want to present that information in a coherent way. We can’t simply dump all the content into a prompt and ask an LLM to create you a podcast, because we’ll likely hit context window limits. Research has also shown that incredibly long prompts can negatively impact the capabilities of LLMs. Instead, we need to pick the most compelling content and segment it so that each call to the LLM does not exceed the context window. How do we segment the data? We cluster it.

We can’t throw all our content into a single prompt and ask the LLM to create a podcast. This will likely cause us to exceed context window limits and potentially diminish output quality. Instead, we cluster the web content into semantically similar pages and pass each cluster through an LLM independently.

Clustering is the process of grouping together objects that are similar, while distinguishing objects that are different. Realistically, the same content is likely to be covered by many media sources. The publications may have different spins or contain slightly different details, but at their core, they cover the same story. Ideally, those stories would be grouped together, so that in a single call to the LLM, we ask it to present the information on one subject. In this way, we prevent your podcast from covering the same content in multiple sections and simplify the task for the LLM, resulting in a higher quality output.

Generating Embeddings

Before we can cluster our documents, we must first transform the raw text into high-dimensional embedding vectors, which capture the semantic meanings of the documents. Models that generate embeddings are trained so that documents covering similar concepts are “closer” to one another in the embedding space. We define “closeness” as cosine similarity, or the cosine of the angle between two vectors. Intuitively, two vectors pointing the same direction will have a cosine similarity of 1, indicating maximum similarity; two vectors orthogonal to one another will have a cosine similarity of 0, indicating no similarity; and two vectors pointing in opposite directions will have a cosine similarity of -1, indicating maximum dissimilarity.

Agglomerative Clustering

Once we have embeddings, we need to segment the data into meaningful sets. Many different algorithms exist for clustering: among others, some partition the data into a pre-defined k clusters, some group the data based on the density of points, and some take a hierarchical approach, creating a tree of nested clusters. We avoid the first class of methods, since we want a variable cluster count. It makes little sense for us to force k clusters, when we don’t know ahead of time if our documents will cover a single subject or many. For instance, we wouldn’t want to force documents covering 10 semantically different topics into 5 clusters arbitrarily. We experimented with both density-based clustering, as well as hierarchical clustering, and settled on the latter, specifically agglomerative clustering. Agglomerative clustering starts with each data point as its own cluster and iteratively merges clusters together until all points are in a single cluster. Two clusters are merged based on a linkage criterion that determines whether they are “close enough” to one another. The linkage criterion can minimize the variance of the sets being merged, the average of distances between each pair of observations in the two sets, the maximum distance between any two observations in the two sets, or merge the sets that have the minimum of the distances between all observations in the two sets. The algorithm can be stopped early: either once there are k final clusters or all clusters are beyond some distance threshold from one another. We choose the latter approach, since we don’t want to always force k clusters. Unfortunately, we cannot choose a static distance threshold, since each batch of documents we cluster are likely to be differently distributed in space.

Imagine that we have one batch of documents covering various topics in world news, and there are clear clusters corresponding to US news, European news, and Middle East news. The distance threshold between these clusters may be quite high, as they’ll be covering quite distinct topics. Now, imagine that we have a batch of documents all discussing training for a marathon. These are likely to be much closer to one another in embedding space, but perhaps the best podcast would result if we were able to segment these documents based on different aspects of running, including nutrition, run schedule, and supplemental training. If we apply the same distance threshold for the marathon training podcast as we do for the news podcast, these documents are likely to end up in a single cluster. If we go the other direction, our world news podcast will cover more specific local stories, instead of providing an overview of news from each global region. Therefore, we need a way to define an adaptive distance threshold based on our current batch of embeddings.

Different batches of web content will have drastically different distributions in space, motivating the need for an adaptive distance threshold in agglomerative clustering. If we had the same threshold for every data batch, we would cluster things in a less than ideal way, as shown above.

In order to find the best distance threshold for a given batch of embeddings, we do a parameter search over the distribution of distances between points in our current batch, and choose the distance threshold that results in clusters that maximize average silhouette score. Silhouette score is a metric that balances the relationship between average intra-cluster distance and the average nearest-cluster distance, with the best score being 1 and the worst being -1. We exclude clusters with only one member from the silhouette score calculation.

Pruning Our Clusters

Once we have our final clusters, we have to choose which make the final cut for podcast generation. Let’s say we have a target value of y clusters for podcast generation. If our clustering process returns fewer than y clusters, we proceed to generation with limited content. More often, our clustering process returns many more than y clusters and we have to downsample. Recall that all documents were previously ranked based on perceived user interest. As a result, we can calculate an average rank per cluster and return the y clusters with the highest average user interest scores.

However, the process does not stop here. Even within a single cluster, we can have way too much textual information, forcing us to downsample further. We prune our final within-cluster documents based on both individual document rank and source. We believe that the best podcasts will not only cover the most interesting documents, but also documents from diverse, trustworthy sources. For example, instead of a podcast segment covering 10 articles from CNN, we believe a better listening experience comes from a segment based on articles sampled from CNN, BBC, and NPR. Given a cluster with N total documents from m sources, we assign a probability to each of the N documents based on how many documents come from each source. This probability is adjusted to account for the assumption that a higher volume of stories are likely to come from more reputable sources.

After all of that, we have the final content that will be used to generate a podcast! Come back next week to learn about the process of podcast transcript creation!

Filtering the firehose

Continua AI, Inc. — Thu, 20 Feb 2025 00:43:12 GMT

It's hard to overstate how much recent developments have changed what's possible for a small team, like ours at Continua, to do in a small amount of time. There’s a common saying that 20% of the work will get you 80% of the way there. In the last few years, LLMs have pushed that 20% down even lower. That's especially true in cases where real customer usage data would have been needed to start spinning up a machine learning based pipeline. This was definitely our experience while building out a system for selecting web content based on a user provided podcast description. For example, if a user tells us “I want to hear about AI startups in New York”, how can we efficiently find a set of recent webpages which match that interest? To solve this type of problem in the recent past, you'd need either a large amount of user behavior data, or expensive manually annotated data to build such a system, but as we'll describe below, you can now build a very capable system without needing either of those.

In our last post, we discussed how we crawl the web to find a large amount of interesting content. In this post, we'll explain how we were able to quickly build out a content filtering system which is both efficient and high quality. Why not just use a search engine, or an LLM provider like Perplexity? The easiest answer is price, but more importantly, traditional search wouldn’t cut it for this use case, or other agentic applications we build. To deliver the highest quality results, we think it’s essential to own the whole process from data cleaning, to indexing, to retrieval.

High recall first, High precision last

The standard approach for efficiently processing a large amount of data is to use a tiered system, where content passes through a series of successive filters, with each stage becoming more expensive and selective.1 With an infinite budget, we could just pass our entire stream of web content through a high quality model once for each candidate podcast, but that would be far too expensive. Instead, the approach is to pass the data through a series of filters, where the earlier ones prioritize recall (i.e. low false negative rate), and the later ones increasingly prioritize precision (i.e. low false positive rate).

Phase 1: Preprocessing

The first stage of our pipeline handles the basic processing needed for all the web content we ingest. Anything that will be needed for all future queries is computed at this stage, to avoid needlessly recomputing things in the future. The goal of this stage is to convert and filter raw web content to give us an optimized starting point for our search for relevant content.

Cleaning

Before we get to running any ML-based solutions on a webpage, we want to do our best to minimize the representation of that page. For example, for a blog post, there might be a header, footer, tons of CSS, javascript and so on in the HTML. Anything which isn't semantic which we fail to filter out at this point becomes a recurring cost at every future stage of the pipeline.

We use a few rule based methods which cover almost all web pages, and fall back to using an LLM to extract the page text in a small number of cases. The ideal output is a version of the document which contains all the useful information and as little excess as possible. There are even benefits to the downstream quality, since some work has shown that increasing the input length to LLMs can degrade their performance, even when the task is unchanged.

Filtering

Next, we want to remove documents which won't be of interest to any user. There are a lot of things on the web which, as far as a user looking for interesting information is concerned, are just junk. This junk includes (but definitely isn't limited to): Privacy policies, sign-up pages, shop pages, and many kinds of manipulative content. One particularly interesting example was a “news” website which plagiarized articles from other sources, and avoided detection by swapping words for synonyms. We use GPT-4o-mini to filter out as many of these as possible, reducing load and noise in every subsequent stage of the pipeline.

Summarization

Once we've minimized the text and thrown out the "junk", we compute features which will be used for retrieval in the future. The first of these are abstractive and extractive summaries, computed with GPT-4o-mini. By total expenditure, this step is the most expensive part of the pipeline, but using summaries instead of full-text for future stages allows us to avoid using the full document in later stages. This step is also a part of our pipeline that we'll be looking to substitute self-hosted models for early on, as we scale up.

Having computed the summary, we also compute an embedding of each summary for use in retrieval. The main benefit is that we don't need to worry about any issues which arise from pages longer than the maximum context of our embedding model. Another, less obvious, benefit is that similar style across different summaries means there won't be as many surface-level differences affecting embedding similarity search. For example, all else equal, one should expect a Markdown document to embed closer to another Markdown document than it would to an HTML document. We want to avoid that sort of bias in our retrieval, and focus purely on the content.

Second stage: Retrieval

The summaries and embeddings from the pre-processing phase form the data source we use to build all of our podcasts. For each podcast episode, we periodically query for new relevant content over the course of each day. This retrieval process is tuned to be high recall, as we're willing to catch a large number of false positives in order to avoid a single false negative. We use a combination of cosine similarity, and BM25.

Rather than making just a single query, we first query on a per-site basis, using both sites users have manually requested, and ones that we predict will be relevant based on their podcast description, and the historical data for those sites. To avoid missing out on potentially relevant content, we also make a query across our entire set of ingested pages for the day, increasing the odds that we find relevant topics to cover.

Third stage: Ranking

At this point, for each user's podcast, we have access to a much smaller set of possible candidate documents. For each one, we need to answer the question: "A user wants to hear about topic X, would document Y be relevant?". This stage is the one which looks the most different from how it might have looked a few years ago. The earlier stages (cleaning, filtering, retrieval) might all be possible using rule-based heuristics or off-the-shelf models, but this stage would require a custom trained model. Now, it's possible to get good results without training, but as we found, fine-tuning still offers major advantages in both accuracy and cost.

Version 1: Few-shot

To start with, we used Gemini 1.5 Flash with a few-shot prompt. This approach struck a good initial balance of price to quality, especially after some tuning of the prompt, but didn't meet our quality bar. One example I ran into early in testing was asking for a podcast about AI for genomics, for which the ranking model gave me way too many pages about just AI or just genomics. In testing, this behavior was improved significantly by using models like Gemini Pro or Claude 3.5 Sonnet, but those models were too expensive for this use case.

Even the SotA models weren't good enough, as they kept getting a lot of edge cases wrong. We found that even when we tried many prompt variations to avoid these cases, the models weren't following all the instructions simultaneously. At the end of the day, it is extremely hard (or maybe impossible) to encode exactly the behavior you want within a few thousand tokens of text. As a result, we decided to switch to fine-tuned models, in order to solve both the quality and cost problems simultaneously.

Version 2: Fine-tuned

To improve our ranking while keeping costs down, we turned to model distillation. We used the Gemini Flash fine-tuning API to distill results from Claude Sonnet models2. Distillation is probably best thought of as throwing out the parts of a big model you don't want to pay for. All we care about is Claude's ability to evaluate the relevance of web documents, not its coding capabilities or (not amazing) sense of humor. By generating data using Claude and fine-tuning Gemini Flash on said data, we can achieve much better results than we'd get by using Gemini Flash directly.

To generate data, we used a three step approach. First, we hand-wrote a small (< 20) number of podcast descriptions which are similar to the types of inputs we expect from users. Next, we used Claude to generate a large number of additional possible user queries. Finally, these inputs were used as the basis for creating a distillation set.

In order to get high quality distillation data, we used the best available (at the time) model, Claude 3.5 Sonnet. We used it to evaluate a large number of candidate web pages for relevance against the synthetic user inputs we produced in the previous step. However, even using the best model wasn’t enough, since we found certain classes of errors which we could prevent Claude from making (despite a large amount of prompt tweaking).

While we couldn’t get Claude to behave reliably in the context of a single call, it was able to notice that it made a mistake of a given type if it was specifically re-prompted to evaluate its answer and check whether that kind of mistake was present. While that approach would effectively double our costs if used in production, doubling the cost of producing a distillation set is a one-time cost, and well worth it.

For example, we found that if asked to rate content about “News about AI, or startups”, Claude would often only rate content highly if it was about both topics. By adding an explicit double check for this “union of topics” case, we were able to significantly reduce the false negative rate in our distillation set.

Wrapping up

The final product is a podcast which has just what the user asked for. For example, here’s the intro to my “Seattle update”, which gives me daily news for my neighborhood, and other specific topics I’m interested in:

And here are the pages that our system selected to generate that episode:

Our pipeline marries traditional information retrieval techniques with modern language models to create an efficient filtering system. It begins with filtering and cleaning of web data, moves into an embedding and retrieval stage designed to prioritize high recall over high precision, and finishes with a filtering process powered by relatively inexpensive yet finely tuned models that address our unique use case. The cost and speed at which we built this system would have been unimaginable just a few years ago, when the challenge of gathering extensive user preference data seemed insurmountable. Today, however, we can develop a high-quality solution with only a small amount of seed data, which showcases what small teams like ours can now achieve.

For larger scale examples, see this post from Meta about Instagram’s recommendation system, and this one from Amazon about information retrieval.

We're also fine-tuning models to self-host, to give ourselves more flexibility in terms of capacity.

Scraping the Web in 2025

Continua AI, Inc. — Fri, 14 Feb 2025 16:43:19 GMT

At Continua AI, we developed an AI podcast app which allows you to quickly create a customized podcast about anything you can imagine. Although our application is, at its core, powered by large language models, we wanted to create something more compelling than the kind of thing you get when you ask ChatGPT et al., “write me a podcast dialogue about topic X”. We imagined a product which could synthesize and analyze the torrential overload of information available on the internet and turn it into a customized podcast that you’d actually love to listen to, learn new things from, and help you stay informed about your desired topics. In order to have the freshest, most relevant and diverse information, we need to download, index, score, and retrieve information from the web.

Scraping the Web

You can’t really hope to scrape and index the entire internet on the cheap. Sure, other companies are doing that at scale, but the internet is a vast place with a lot of documents, and we’re a lean startup, so we have to be smart and judicious with what, how, when, and why we scrape in order to get the biggest bang for the buck. Here’s a few tips we learned along the way:

Scraping Tips

Don’t

Before you jump straight in, do you actually need to scrape (as in, send actual HTTP requests to actual servers in real-time)? Can you use Common Crawl, which provides an updated archive of web content that has already been scraped? If you can get away with using that source or something similar, your crawling costs are zero. Also, remember that it isn’t only your own costs, but also the cost to the site operator that you should consider. The more of us who skip this step (and yes, we decided to scrape directly - because we wanted fresh and timely updates so that the podcasts would have the latest information), the more site operators will invest in protecting their sites from scrapers, which will also disincentivize smaller site operators and publishers from distributing their content on the web in the first place. Everyone loses. So, if you don’t need to scrape, just don’t do it. And if you do need to do it, be sure to follow the other tips in this article.

Have I seen this URL before?

In your scraper, implement an external store of all the URLs you have scraped already so that you can skip them the second, third, forty-fifth time you see them. The most efficient way to send an HTTP request is to not send it, because you already did. Even better, load the Common Crawl index into your external store so that you don’t bother scraping pages that are already in the Common Crawl. Of course, this method may not work well with dynamic pages that change over time, but many pages on the internet will work well with this method. For dynamic pages, how often do you really need to re-scrape that page? Try to find the absolute maximum interval that your product can stomach.

Note that URLs will often appear with a bunch of unnecessary query arguments that are just for tracking. Be sure to strip these out (you can find a list online to start with, and also periodically add to it by monitoring the URLs you are scraping for patterns) before checking whether you have seen the URL before, otherwise, these links will constantly seem unique although they are really not.

IP proxies are not really “it” anymore

One of the common things that you’ll read online is that for scraping, you want to use a proxy to avoid getting blocked. The reality is that bot/scrape detection is much more advanced today than it was when a lot of this advice was first circulated and it can detect a likely bot/scraper using more sophisticated methods, like “fingerprinting” the browser using either simple (user agent, cookies, window size) or nuanced (render timing, feature flag detection, fonts, anti-aliasing) methods. So to make a long story short, yes, an IP proxy may help you scrape a few sites that detect and block you through other methods, but that is just one tool in the ever-expanding toolbox. If you operate a low-touch and well-behaved bot, you may be surprised to find out that many sites will let you pass despite coming from a cloud provider IP address. However, the landscape is constantly evolving. Today many large and small publishers use services provided by Cloudflare and Akamai and others to detect and block scrapers writ large.

Ultimately the tack that you have to take on anti-bot / anti-scrape avoidance depends on the sites that you are aiming to scrape. Some sites are much harder to scrape than others. Generally speaking, the more valuable the data is, the harder it is to scrape: social media, arbitrage opportunities (event tickets, airline/hotel/car rental pricing, cryptocurrency exchange, etc), premium news, these sites will typically work the hardest to protect their data from being scraped. You may want to think deeply about why that is the case before setting out immediately on the quest to crack open their defenses.

Save your work

It can be tempting to download a page over HTTP, then parse the HTML and extract what you need, and only save the extracted content. However, it is very common to realize only later that you didn’t extract the thing that you really needed, or perhaps it moved, or perhaps you just wish you had extracted something else, too. If this happens, you will have to re-scrape every single document in order to extract the other thing that you missed. Instead, save the full response from the original request in its entirety, so that you can go back and access it again later to perform extraction. We use cloud storage since our scrapers are distributed, but if you are just getting started, you can fit a lot of webpages on your hard disk! Make sure to not request images, pdfs, fonts, and any other large files that you don’t need (remember - don’t just not save them, make sure to also not request them in the first place).

Wait on line

As consumers, we expect websites and apps to load at blazing fast speeds. Site operators spend a lot of time and engineering energy to try to make this so. As scrapers proliferate, they can harm the ability of sites to serve their human users, because more and more of the server capacity can be used up serving clients which, unlike humans, are actually extremely latency insensitive. A human will quickly get bored and wander away, perhaps leaving that impulse buy lingering in their cart un-purchased, that article unread - a site operator’s worst nightmare. But a scraper has all the time in the world to wait! So be sure to rate-limit your scraper so that it is not consuming unnecessary resources from the server - for example, limit HTTP requests to the same domain to one every three seconds for a very conservative scrape that is unlikely to disrupt human traffic. Check robots.txt for the right amount of time to wait (the “Crawl-delay” directive) if the operator has configured it. Otherwise, there may not be a single “right” rate limit to choose, since it depends on the other clients of the site and the server resources available. One trick I recommend is to use latency as a signal, multiplied by three. So if a HTTP response takes 5 seconds, wait 15 seconds before the next request, giving other clients a chance. This is a way that any server can send the client a signal as to how loaded it is, regardless of whether it is a Microsoft IIS 2.0 server running on a forgotten VM or the latest Nginx build running on a horizontally auto-scaling cloud deployment.

Use open source

If you are just getting started, pick an existing scraping framework in your favorite language/ecosystem. Many of the top frameworks/libraries implement some of these “best practices” by default, or make them easier to achieve. It is tempting to whip out your ecosystem’s HTTP library and start blasting requests – try to resist that urge and learn from the prior art. It makes it easier to proliferate low-touch, well-behaved scrapers that will minimally escalate the anti-bot, anti-scrape arms race. I personally recommend Scrapy. It is likely that you will want to choose a framework that integrates with Playwright and/or Puppeteer in order to control a real browser (Scrapy supports this, as does Crawlee, as do other libraries). If you choose a library which mimics a browser but does not actually control a browser, eventually you may discover some site or anti-scrape detection that you simply cannot get around (although, even with a real browser, you can still be detected and blocked). However, for lightweight scraping, you may not find use of a browser necessary.

Conclusion

Data from the web is important for building new kinds of products and experiences. At Continua, data from the web feeds our AI podcasts product and our AI Agent Platform in general. However, in building a web data pipeline, developers should also consider the impact that scraping can have on site operators and use best practices that minimize the burden that our client activities place on the servers. The internet is a remarkable confluence of commercial and non-commercial interests cohabitating according to a certain set of norms and standards and scrapers should carefully consider their role in this delicate ecosystem.