As I emphasize in my book, observability is key to understanding why our AI-powered systems behave the way they do. Without observability, LLM-driven bots and agents are a black box, and we can't debug them when they do strange or undesirable things.
That's why I am often frustrated by reasoning LLMs, whose behavior is driven by "reasoning tokens," and yet I - for many models - cannot see those tokens.
There's nothing inherently special about a model's reasoning tokens - they're text output just like any other LLM text output. It's just that the LLM has been trained to first output reasoning tokens which are hidden from the user, followed by the output that the user is supposed to see.
The operating idea here is that reasoning tokens are supposed to represent the LLM's Chain-of-Thought (CoT), which is generally used to break down a complex problem into simpler ones.
As I covered in a previous post, there are certain complex tasks that an LLM may not be able to solve head on. For example, Claude may not solve a math problem correctly:

This is incorrect, as the true answer is 71,563,932.
However, we can apply CoT by prompting the model to first break the problem into simpler components:

This time, the LLM solved the problem correctly.
The idea of reasoning tokens is to take all the problem-solving text and hide it from the user, as the user anyway probably just wants to see the final answer. Furthermore, by training an LLM to do reasoning, we don't rely on the user to prompt the bot to "break down the problem into simpler ones."
It makes sense to hide the reasoning tokens from the user. The problem, though, is that these tokens are often hidden from the developer too.
So if an agent that I build produces an incorrect answer, and I can't see the reasoning tokens behind that answer, I have little inkling as to what went wrong. Fixing the problem, then, becomes pure guesswork.
However, some models - especially open-source ones - do reveal the reasoning tokens to the developer. (And you could even choose to reveal these tokens to the user if that made sense for your particular application.)
Let's see this in action.
To start off, here's a bare-bones app that uses the Kimi model (from Moonshot AI) that I'm using via Together AI.
from dotenv import load_dotenvfrom together import Togetherload_dotenv()client = Together()response = client.chat.completions.create( model="moonshotai/Kimi-K2.6", messages=[ { "role": "user", "content": "Who was the first person to land on the Moon?" } ])print(response.choices[0].message.content)
The Kimi-K2.6 model uses reasoning by default. However, here we just display the regular tokens, not the reasoning ones.
On a test run, this produces:
The first person to land on the Moon was **Neil Armstrong**, an American astronaut. He stepped onto the lunar surface on July 20, 1969, during NASA's Apollo 11 mission, famously saying, "That's one small step for [a] man, one giant leap for mankind."He was followed about 20 minutes later by fellow astronaut **Buzz Aldrin**. The third crew member, **Michael Collins**, remained in orbit around the Moon in the command module.
Let's peer under the hood at the reasoning tokens that lead to such an answer. We'll update the code to see first the reasoning and then the "regular" output:
from dotenv import load_dotenvfrom together import Togetherload_dotenv()client = Together()response = client.chat.completions.create( model="moonshotai/Kimi-K2.6", messages=[ { "role": "user", "content": "Who was the first person to land on the Moon?" } ])print("REASONING:\n")print(response.choices[0].message.reasoning) # the reasoning tokens!print("\nANSWER:\n")print(response.choices[0].message.content)
The snippet response.choices[0].message.reasoning accesses the reasoning tokens, which in this case we print to the screen.
When I run this app, I get this:
REASONING:The user is asking a straightforward factual question: "Who was the first person to land on the Moon?" This is a well-known historical fact. The answer is Neil Armstrong. He was the commander of Apollo 11 and became the first human to set foot on the lunar surface on July 20, 1969. Buzz Aldrin followed about 19 minutes later, while Michael Collins remained in orbit in the command module.I should provide a clear, direct answer. I don't need to overcomplicate it, but adding the date and mission name (Apollo 11) provides helpful context. I should also mention Buzz Aldrin as the second person and Michael Collins as the astronaut who orbited, for completeness, but the core answer is Neil Armstrong.There are no safety concerns or policy violations here. It's a simple historical question.ANSWER:The first person to land on the Moon was **Neil Armstrong**, an American astronaut and the commander of NASA's Apollo 11 mission. He set foot on the lunar surface on **July 20, 1969**, famously saying, "That's one small step for [a] man, one giant leap for mankind."He was followed shortly after by fellow astronaut Buzz Aldrin, while Michael Collins piloted the command module in orbit around the Moon.
This is a great illustration of a case where, in truth, reasoning doesn't contribute much to the LLM's final output. Reasoning is helpful when the LLM needs to break down a complex problem or task into smaller components. But here, all the user wanted was the answer to a simple factual question.
Reasoning tokens cost money just like regular output tokens do, so in this instance all that extra reasoning was a waste of money. Also, the generation of the reasoning slows the app down.
Let's run this app again, this time with our "tricky" math question:
from dotenv import load_dotenvfrom together import Togetherload_dotenv()client = Together()response = client.chat.completions.create( model="moonshotai/Kimi-K2.6", messages=[ { "role": "user", "content": "What is 7788 times 9189?" } ])print("REASONING:\n")print(response.choices[0].message.reasoning)print("\nANSWER:\n")print(response.choices[0].message.content)
Here's what I got when I ran this:
REASONING:The user is asking for the product of 7788 and 9189. I need to calculate this accurately.Let me compute 7788 × 9189.I can do this using long multiplication or break it down:7788 × 9189 = 7788 × (9000 + 100 + 80 + 9)= 7788 × 9000 + 7788 × 100 + 7788 × 80 + 7788 × 9Calculate each term:- 7788 × 9000 = 7788 × 9 × 1000 = 70092 × 1000 = 70,092,000 Wait, let me check: 7788 × 9 = 70,092. Yes. So ×1000 = 70,092,000.- 7788 × 100 = 778,800- 7788 × 80 = 7788 × 8 × 10 = 62,304 × 10 = 623,040 Check: 7788 × 8: 7000×8 = 56,000 700×8 = 5,600 80×8 = 640 8×8 = 64 Sum: 56,000 + 5,600 = 61,600; 61,600 + 640 = 62,240; 62,240 + 64 = 62,304. Yes. ×10 = 623,040.- 7788 × 9 = 70,092 Check: 7000×9 = 63,000 700×9 = 6,300 80×9 = 720 8×9 = 72 Sum: 63,000 + 6,300 = 69,300; 69,300 + 720 = 70,020; 70,020 + 72 = 70,092. Yes.Now add them all up: 70,092,000+ 778,800+ 623,040+ 70,092First, let's add the last three:778,800 + 623,040 = 1,401,8401,401,840 + 70,092 = 1,471,932Now add to the first:70,092,000 + 1,471,932 = 71,563,932Let me double-check with another method to be sure.Alternative: Use (8000 - 212) × 9189= 8000×9189 - 212×91898000×9189 = 73,512,000212×9189:212 × 9000 = 1,908,000212 × 100 = 21,200212 × 80 = 16,960212 × 9 = 1,908Sum: 1,908,000 + 21,200 = 1,929,2001,929,200 + 16,960 = 1,946,1601,946,160 + 1,908 = 1,948,06873,512,000 - 1,948,068 = 71,563,932Same result! Good.Another check:7788 × 9189Let's do standard long multiplication: 7788 × 9189 -- 70092 (7788 × 9) 62304 (7788 × 8, shifted 1) → 623,040 7788 (7788 × 1, shifted 2) → 778,800 70092 (7788 × 9, shifted 3) → 70,092,000Sum: 70,092 623,040 778,800 70,092,000 ---- Adding:70,092 + 623,040 = 693,132693,132 + 778,800 = 1,471,9321,471,932 + 70,092,000 = 71,563,932Yes, confirmed multiple times. The answer is 71,563,932.ANSWER:7788 times 9189 is **71,563,932**.
This is super interesting. In the reasoning, the LLM not only broke down the problem into smaller steps, but performed multiple checks from different angles, helping ensure that the answer is correct. And in this case, it was!
When your LLM does something unexpected, viewing the reasoning can be a window into why things are going wrong. Let's take the following strange interaction with a coding agent - one that went viral on the web some time ago.
In short, the agent proposed a coding plan, and asked the user, "That's the full plan. Shall I implement it?"
The user replied, "No." Pretty straightforward - the user didn't like this plan and didn't want the agent to build it out.
And then a shocking thing happened: the agent started implementing the plan.
If we were the developers of this coding agent, how might we debug this strange behavior?
The mystery of why this failure happened is revealed once we view the reasoning tokens:

It turns out that in this case, it was the reasoning itself that twisted the agent into a pretzel. It reasoned, bizarrely, that since it was operating in "build" mode, the user's "No" was telling the agent to stop asking and start building.
As an aside, this goes to show how reasoning is not always ideal, and how an LLM might overthink in a way that leads to failure.
But the more important lesson here is that being able to see reasoning tokens is an important tool for debugging LLM-powered apps. Not all APIs let you view the reasoning, but if you can, you should. And you may even consider going out of your way to use a model that allows you to access the reasoning tokens.