Using a Hosted Shell for LLM Code Execution

Sometimes, you may want an LLM to generate some arbitrary code and run it. For example, LLMs are notoriously bad at math. In my book, where we wanted our agent to multiply two numbers, we skirted around the whole issue by building a Python tool that reliably multiplies numbers deterministically.

However, what if you need your LLM to be capable of doing any sort of math---not just multiplication? One approach would be to build a tool that executes arbitrary Python code, and you can simply let the LLM generate the code that gets passed into this tool. Here's a simple agent that does just this:

import osimport jsonfrom dotenv import load_dotenvfrom openai import OpenAIload_dotenv()llm = OpenAI()TOOLS = [    {        "type": "function",        "name": "execute_math_code",        "description": "Executes an arbitrary Python expression to perform math.",        "parameters": {            "type": "object",            "properties": {                "code": {                           "type": "string",                          "description": "a Python expression"                },                            },            "required": ["code"],        },    }]def execute_math_code(code):    return eval(code)def llm_response(history):    response = llm.responses.create(        model="gpt-4.1-mini",        temperature=0,        input=history,        tools=TOOLS    )    return responsedef agent_loop(history):    while True:        response = llm_response(history)        history += response.output        tool_calls = [obj for obj in response.output if getattr(obj, "type", None) == "function_call"]        if not tool_calls:            break        for tool_call in tool_calls:            function_name = tool_call.name            args = json.loads(tool_call.arguments)            if function_name == "execute_math_code":                result = {"execute_math_code": execute_math_code(**args)}                history += [{"type": "function_call_output",                            "call_id": tool_call.call_id,                            "output": json.dumps(result)}]    return responsedef system_prompt():    return """You are a friendly AI assistant. If you ever need to do math, you MUST USE YOUR execute_math_code    tool rather than try to do the math on your own. To do math, call this execute_math_code tool, whose parameter is    a string of Python code that you'll generate. That is, you'll write a Python expression to perform the necessary    math, and the tool will run your code, giving the accurate result."""assistant_message = "How can I help?"user_input = input(f"\nAssistant: {assistant_message}\n\nUser: ")history = [    {"role": "developer", "content": system_prompt()},    {"role": "assistant", "content": assistant_message},    {"role": "user", "content": user_input}]while user_input != "exit":    response = agent_loop(history)                print(f"\nAssistant: {response.output_text}")    user_input = input("\nUser: ")    history += [{"role": "user", "content": user_input}]print("****HISTORY****")print(history)

Here's what a sample conversation history looks like---or at least the gist of it:

Assistant: How can I help?User: what is 6789 * 9876FUNCTION CALL: { execute_math_code({"code": "6789 * 9876"}) }FUNCTION CALL OUTPUT: { 67048164 }Assistant: The result of 6789 multiplied by 9876 is 67,048,164.User: What is the square root of 45 to the fourth power?FUNCTION CALL: { execute_math_code({"code": "(45**4)**0.5"}) }FUNCTION CALL OUTPUT: { 2025.0 }Assistant: The square root of 45 to the fourth power is 2025.0.

Sweet! Our agent can now flexibly calculate all sorts of math.

Note that our execute_math_code function can process any arbitrary code expression---not just math. It's just that we're directing our agent to (hopefully) only use this function when calculating math.

Now, allowing our LLM to run an arbitrary code script is Dangerous with a capital D. I spoke recently with a software engineer whose coding agent decided to delete his hard drive, so, well, yeah.

The Shell "Tool"

Conveniently, some LLM providers like OpenAI provide a hosted shell feature in which your agent can run arbitrary code in a cloud-hosted container. This is an easy way to sandbox your agent's code; after all, it's not running on your local machine. (OpenAI does also provide a local shell option too, which has some practical use cases---but that may be the subject of another blog post.)

Let's rewrite our previous app to make use of a hosted shell, which OpenAI treats as a tool. As such, we no longer have to write our own Python tool function:

import osimport jsonfrom dotenv import load_dotenvfrom openai import OpenAIload_dotenv()llm = OpenAI()# create "container" - that is a hosted shell environment:container = llm.containers.create(    name="math-container",    memory_limit="1g",    expires_after={"anchor": "last_active_at", "minutes": 20},)TOOLS = [    {        "type": "shell",        "environment": {            "type": "container_reference",            "container_id": container.id,        },    }]def llm_response(history):    response = llm.responses.create(        model="gpt-5.2",        input=history,        tools=TOOLS    )    return responsedef agent_loop(history):    while True:        response = llm_response(history)        history += response.output        tool_calls = [obj for obj in response.output if getattr(obj, "type", None) == "function_call"]        if not tool_calls:            break    return responsedef system_prompt():    return """You are a friendly AI assistant. If you ever need to do math,     you MUST USE YOUR shell tool rather than try to do the math on your own.    To do math, write a code expression in the shell to perform the necessary     math, and use the result."""assistant_message = "How can I help?"user_input = input(f"\nAssistant: {assistant_message}\n\nUser: ")history = [    {"role": "developer", "content": system_prompt()},    {"role": "assistant", "content": assistant_message},    {"role": "user", "content": user_input}]while user_input != "exit":    response = agent_loop(history)                print(f"\nAssistant: {response.output_text}")    user_input = input("\nUser: ")    history += [{"role": "user", "content": user_input}]print("****HISTORY****")print(history)

When I now ask the agent what is 6789 * 9876, here's the essence of what gets appended to the conversation history:

ResponseFunctionShellToolCall(    id='sh_0a0bc2bf2f322434381a1987dd8b42c488743',     action=Action(commands=["python3 - << 'PY'\nprint(6789*9876)\nPY"],         max_output_length=None, timeout_ms=None),     call_id='call_YA8AQSEvBUdfrP3ug2',    environment=ResponseContainerReference(        container_id='cntr_6993cfdd435asdf2e4230d5be',         type='container_reference'),     status='completed',     type='shell_call',     created_by=None),         ResponseFunctionShellToolCallOutput(    id='sho_0a0bc2bf2f32cd26006993cads8bed618a93c5c4be',    call_id='call_YA8AQSEvBUdfrP3ug2',    max_output_length=None,    output=[Output(outcome=OutputOutcomeExit(exit_code=0, type='exit'),        stderr='',         stdout='67048164\n',         created_by=None)],    status='completed',    type='shell_call_output',    created_by=None)

In this case, the agent ran a short Python script in the shell:

python3 - << 'PY'\nprint(6789*9876)\nPY

This outputted 67048164\n to the shell, and this output is then treated as the "result of the shell tool." That is, the use of the shell is treated by the SDK as a tool call, and the output of the shell (stdout) is returned as the result of that tool call.

The model outputs both the ResponseFunctionShellToolCall and ResponseFunctionShellToolCallOutput in a single response (in an array), and our code adds both to the conversation history with history += response.output.

With this, the agent can output the correct answer, as can be seen in my example conversation with it:

Assistant: How can I help?User: what is 6789 * 9876Assistant: 6789 × 9876 = **67,048,164**

The eagle-eyed reader may have noticed that I specifically used gpt-5.2 for the latest app. It turns out that the older models don't support this new hosted-shell feature.

Another interesting thing to note is that in the system prompt, I didn't tell the agent what programming language to use. It just happened to use Python.

Now, when I asked this shell-based agent What is the square root of 45 to the fourth power?, I noticed some bizarre stuff in the conversation history. Specifically, it tried to use the shell six times. It seems that there were five failed attempts before finding success. Here were the six Python scripts it tried running:

python3 - <<'PY'\nimport sympy as sp\nx=sp.sqrt(45)**4\nprint(x)\nprint(sp.simplify(x))\nPYpython3 - <<'PY'\nimport sympy as sp\nprint((sp.sqrt(45))**4)\nPYpython3 -c "import sympy as sp; print((sp.sqrt(45))**4)"python3 -c "import math; print((math.sqrt(45))**4)"python3 - <<'PY'\nimport sympy as sp\nprint(sp.Integer(45)**2)\nPYpython3 -c "print(45**2)"

Unfortunately, the OpenAI SDK doesn't output the result of most of these calls, so I can't see exactly what went wrong. Presumably, there were errors with importing the sympy library, as the hosted shell didn't have it installed.

Now, the fourth script did output a result of '2025.0000000000005\n', but I guess the agent wasn't satisfied with that result. (I guess it's correct, but I'm really curious as to how the agent made that determination.)

With persistence, the agent finally arrived at the correct answer, while also recognizing that 45 to the second power is equivalent to the square root of 45 to the fourth power.

It's unclear to me at the moment why the agent had an easier time with this problem when using the execute_math_code tool versus the shell tool. In either case, it's the LLM generating the code.

In any case, using a hosted shell may be a useful option where you need to give your agent the flexibility to generate and run arbitrary code. However, it may be tricky if the code will rely on external dependencies.

We'll explore shell-based features more in future posts, especially as they relate to agent skills.

Keep learning

Course
Let's Build an LLM App

Step-by-step live cohort on Maven. Build an LLM-powered app with other developers.

View course
Book
A Common-Sense Guide to AI Engineering

Build production-ready LLM applications from the ground up.

See the book