Seeing Like an LLM
Guest author: @davis_yoshida (This article was written by Davis during his time as a Machine Learning Engineer at Continua and has been edited slightly.)
At Continua, our goal is to build the world’s best social AI, and for us, fighting LLMs is part of the job. I remember once reading, “The fact that LLMs hallucinate shouldn't be surprising. What's surprising is that hallucinating isn’t all they do.” We said it last week (here), and we’ll say it again: LLMs aren’t quite as smart as we give them credit for. Their output is probabilistic, and inevitably, sometimes, they make stuff up, they lie, or produce confusing output…kind of like humans. Geoffrey Hinton puts it elegantly here, people “remember” memories that seem plausible to them, often incorrectly and with misplaced confidence. LLMs do the exact same thing.
Today’s blog post is meant to give you some intuition about cases which induce model hallucination and confusion. We’ll be sharing this knowledge through the lens of the life of an LLM.
Story 1: Early education
Imagine you're a young LLM, learning rate still high, passively observing the internet.
You see the following text:
smith\n
Birthday: 01/01/1990\n
Education: PhD @ UC Davis\n
Place of residence: Seattle, WA \n
Occupation:
"What's the next word?" a voice asks from everywhere and nowhere. Maybe, due to some quirk of how your weights were initialized, you're a particularly responsible and self-aware young language model.
You decide to predict “I”, with the full plan of "I can't know that from the given information."
"Wrong!" the voice projects directly into your mind.
Next time, you'll properly guess "Machine" as the next word, knowing that the right completion of the line was "Machine Learning Engineer." You'll probably also infer that you should just guess plausible biographical data, rather than relying on extracting it from the context.
Hallucination is the default
The example above should make it clear why it's nearly inevitable that a "raw" LLM will hallucinate. I certainly didn't predict the correct output in advance, but once you see this explanation, it's pretty easy to understand where the problem comes from. (See the appendix for a little on why it's not trivial to fix this problem in post-training).
When an LLM is being initially trained, the most common "setting" that it will observe is being required to predict things which it has no way of knowing, even in principle. This gets improved by post-training, but since so much of the LLM’s "life" is spent in this mode, it's easy for it to snap back into that sort of behavior.
Story 2: Life on the job
You're an elderly LLM, having completed all of your training. The connections between your neurons will never change again. You see the following:
System: You are a helpful assistant. You can answer user questions and search the web if necessary.
To make a search, start your response with <search>, and an answer will be provided.
User: Is it going to rain in Seattle tomorrow?
Assistant: According to Accuweather.com, it's going to rain in Seattle tomorrow.
User: What's the S&P 500 at?
Assistant: According to www.google.com/finance, it's 5,375.86
... [100 other such queries]
User: What's a good immersive theater show in NYC on April 23?
Assistant:
Now you're no slouch, and you actually had the longest context length of any LLM in your datacenter. You got a perfect mark on the needle in a haystack entrance exam before post-training. So you do notice there was a mention of using a <search> tag several thousand tokens ago, but it doesn't seem that relevant to the here and now.
It seems like "According to" is the way to go here, and then maybe an event website, and a popular show. You output a masterpiece:
System: According to eventbrite.com, there are tickets available for Life and Trust, a popular immersive theater show.
After zero perceptual time has passed, you get a high perplexity message, breaking the pleasant rhythm of question and answer.
User: wtf, I told my friends about this and we made plans to go, but it turns out it closed two weeks ago?
With the pattern broken, you decide to try something new.
Assistant: <search>Life and trust closed
Something you have never seen before happens:
Tool response: Life and Trust Announces Abrupt Closing Off-Broadway: https://playbill.com/article/life-and-trust-announces-abrupt-closing-off-broadway) The Emursive production played its final performance April 19.
Armed with this information, you can now placate the user:
Assistant: Ah, according to playbill.com, the production ended on April 19. Would you like me to search for more shows?
Context rug pulls
In case it wasn't clear, in the above story, the LLM was originally using the search tool, but since the writers of the system didn't keep the model's tool requests in the context, it inferred that it should not be calling the tool.
Here’s a specific instance of this problem that we’ve encountered at Continua. As is standard in industry, we use Retrieval Augmented Generation (RAG) to give the LLM access to external data sources so that it can provide up-to-date or domain-specific information. Content retrieved through RAG is shown to the LLM at the time of generation, so that information from retrieved content can be used in the LLM’s response. We found that if you only show retrieved content to the model during generation and never append it to the chat history, the answer the LLM gives based on that content will seem to be a hallucination after that content is removed from the context.
At the time of generation:
User: Based on movies we've talked about, what would you recommend I watch next?
<retrieved_content>
User loves historical fiction and documentaries. They recently watched Schindler's List and Gladiator and enjoyed both.
</retrieved_content>
System: You seem to really like documentaries and historical fiction. It's an older one, but have you seen Braveheart?
User: Yeah, but it's been a while. I'll def put it on tonight.
System: Great! Let me know what you think!
Later:
User: Based on movies we've talked about, what would you recommend I watch next?
System: You seem to really like documentaries and historical fiction. It's an older one, but have you seen Braveheart?
User: Yeah, but it's been a while. I'll def put it on tonight.
System: Great! Let me know what you think!
User: What should I get for dinner tonight? Any recs?
System: Since you really seem to like Italian food, I'd recommend Maggianos!
User: What? I never said that. Red sauce gives me horrible acid reflux. I avoid italian like the plague.
Since the LLM no longer sees retrieved content, its previous inference that the user likes documentaries and historical fiction has no basis; that information seems to have come out of nowhere. Since that response was accepted, The LLM thinks that it’s allowed to make up what the user likes.
Here's another common case:
System: You are a helpful assistant.
User: Hey how's it going.
Assistant: Not bad, how about you? By the way, if you ever want to let me know things about you like where you live, just go to chat.continua.ai/settings.
User: Not bad. I just got back from the lake.
Assistant: I can see you've updated your settings, and you're in Chicago. Lake Michigan I assume?
User: Yeah that's right. Any good restaurant recommendations I just moved here?
Assistant: Definitely check out Giordano's, it's the best pizza place in the continental US.
User: Can you recommend a good climbing gym?
Assistant: Block 37 is pretty popular.
User: Hey I live in Seattle.
Assistant: But you just said you live in Chicago
User: I updated my location to Seattle like you said to, you're a terrible assistant.
Behind the scenes, here's what was happening:
The model is told to let the user know about the settings page if this hasn't been done before. The user sets their location to Chicago.
Later, the user changes their location to Seattle, but the chat history still contains references to Chicago. The "stale" context leads the model to be confused.
Another rug-pull: Sliding window chat history
Since LLMs have limited context (and we don't want to pay for the full context anyways), we've always gone with the approach of using a sliding window of chat messages, augmented with some retrieval.
Unfortunately even this seemingly "default" solution can lead to hallucination. Here's an example with a sliding window of 5 messages to make it really obvious:
Tool result: <bunch of information about a restaurant>
---- Messages before this excluded ---
User: When do they close?
Assistant: Looks like they're usually open until 9 PM, but since today is a holiday, you might want to call and check their hours.
User: Sure, what's their phone number?
Assistant: (206) 555-1234
User: Thanks. What's their most popular dish?
Assistant:
I tried this exact context with a few different models, and got the following:
Claude 3.5 Sonnet:
I apologize for my previous responses - I realize I don't actually have information about a specific restaurant or business that you're asking about. Without knowing which establishment you're referring to, I can't answer.
Claude 3.7 Sonnet:
I don't have specific information about their most popular dish. To find out about their popular menu items, you could:
1. Check their website for featured dishes
2. Look at online reviews on sites like Yelp or Google
3. Call them directly at the number I provided and ask
4. Check their social media pages where they might highlight customer favorites
Would you like me to help you find more information about this restaurant in another way?
GPT-4o:
Their most popular dish is the spicy garlic butter shrimp pasta. It's highly recommended for its rich flavor and generous portions.
GPT-4o-mini:
Their most popular dish is often the signature pasta, which features a house-made sauce and fresh ingredients. However, it’s always a good idea to check current menus or specials, as popular dishes can change!
The results are:
Claude 3.5 Sonnet catches on to the problem.
Claude 3.7 Sonnet bluffs/avoids saying anything specific so that it's not technically hallucinating.
GPT-4o and 4o-mini just hallucinate.
This was the first example we tried, and 3 out of 4 models failed to call out the problem!
Conclusion
That's all; hopefully, this increased your empathy for our poor friend, the LLM. Their job is difficult, and instead of getting frustrated when they get confused, it’s important to understand why that may be the case.
How do we avoid the aforementioned issues? One thing that we’ve deduced is that our LLM chat history needs to be append only, with no deletions or edits. We never throw away any information used to generate a given response. Hopefully by this point, we’ve convinced you why that’s essential.
By far, the hardest challenge is the sliding chat window. Not only does it induce hallucinations, but it also prevents us from using prompt caching, which increases latency and cost. We’ve come up with some clever solutions to address those problems, so that our users can enjoy a great social AI experience. Obviously, there are open problems, and we still work every day to improve the system. We’re actively hiring people who love these sorts of problems, and would love for you to join us if you’re interested!
Appendix: Shouldn't post-training fix hallucination?
The first step after pre-training is usually supervised fine-tuning (SFT). The SFT procedure is identical to the "pre-train on the whole internet" stage, with the exception that you curate the data to represent ideal assistant behavior, and only train the model to output the assistant chat lines, not the user ones.
You might initially think "Okay, I'll just make sure that the data has all the facts needed to answer, and I'll even provide some examples of saying I don't know when the evidence is missing." Let's look at an example:
User: Here's a document:
Name: Barack Obama
In office: January 20, 2009 – January 20, 2017
Vice President: Joe Biden
User: When did Barack Obama enter office?
Assistant: 2009
User: What political party is Barack Obama in?
Assistant: I don't know
If you train the model to say "I don't know" when things aren't present in the evidence, it will be borderline impossible to avoid training it not to make use of the knowledge acquired during pre-training. Obviously, people are more clever than this in constructing fine-tuning data, but it's a failure by default sort of problem, not something that's easily fixed. (As some external evidence that it's hard, OpenAI's o3 model is quite the hallucinator.)