Average execution latency for warmed pooled containers
LLM-Powered Developer Platforms and Code Execution Sandboxes
The past few months have seen an explosion of LLM-native IDEs and execution environments, blending
natural-language interfaces with in-editor code execution. Tools like Cursor (an AI-driven code editor), ChatGPT’s Code Interpreter (a Python sandbox plugin), and new offerings from
Mistral and others are transforming development workflowscode.visualstudio.com.
These platforms embed LLMs directly into the coding process: they understand entire codebases,
maintain long context windows, and can run or test code in-place. For example, Mistral AI’s
Codestral model (open-weight, 22B parameters)
provides high-fidelity code completion (including “fill-in-the-middle” synthesis) with a 32K
token contextmistral.aimistral.ai,
while its Devstral models (24B) support agentic
workflows across multiple files (e.g. refactoring, test-gen)mistral.aimistral.ai.
GitHub’s Copilot Agent mode for VS Code (now
in preview) acts as an “autonomous peer programmer” that can analyze a whole workspace, propose
edits across files, run tests, and iterate until the task is completecode.visualstudio.comcode.visualstudio.com.
Figure: Example architecture of an AI-native IDE
plugin (Mistral Code). Specialized models (here Codestral for code completion and Devstral for
agentic workflows) are integrated with IDE tooling (search, edits, explanations) and can be
deployed in cloud, self-hosted or serverless environments. The figure is illustrative of how
LLMs underpin modern coding assistantsmistral.aimistral.ai.
Technical Capabilities: Context, Memory, and Execution
Environments
Multi-file
Context: Modern LLM editors maintain awareness of entire codebases, not just a single
file. For instance, Cursor’s AI agents can “read full files” and traverse entire directory trees
when neededcursor.com.
VS Code Copilot’s agent mode automatically determines which files to edit and can use
special tools like “find all references” (#usages) to gather contextcode.visualstudio.com.
These tools expose the workspace to the model via embedding-based search or APIs (e.g. Codestral
Embed for semantic code searchmistral.ai),
enabling the model to reason across files and code history. ChatGPT’s Projects feature similarly allows users to upload a project and have
chats share context across related fileshelp.openai.com.
Long-term
Memory: Several platforms support persisting state or “memory” across sessions. Some
LLM assistants retain a history of the coding session (e.g. Copilot’s unified Chat view keeps
past edits and lets you resume themcode.visualstudio.com).
Project-based chat systems (like ChatGPT Projects) share custom instructions and file context
across related conversationshelp.openai.com.
In principle, LLMs with long context windows (GPT-4o/4.1, Claude 3.5, Mistral’s 32K models) can
remember and reuse information, acting like a developer assistant that recalls prior decisions.
(For example, Claude 3.5 Sonnet can “reason about and execute code” with a much larger context
than earlier modelsanthropic.com.)
Secure Execution
Environments: A key feature is that code execution is done in isolated sandboxes, not
on the user’s host. Platforms typically spin up containers or VMs for running code. The
open-source E2B framework, for example,
launches ephemeral VMs (~150 ms startup) per user/agent session, creating a “small computer” for
code executione2b.dev.
The self-hosted SkyPilot Code Sandbox
similarly runs code in managed Docker containers with Kubernetes orchestration, allowing
autoscaling and reuse of warmed containersblog.skypilot.coblog.skypilot.co.
Such containerized compute ensures that arbitrary code (from the LLM) cannot escape and harm the
developer’s environment. In contrast, local execution is intrinsically risky: research on secure
code execution emphasizes that any local interpreter must enforce strict whitelists (e.g.
allowed libraries, loop iteration caps) to avoid damagehuggingface.cohuggingface.co.
In practice, high-end platforms offload execution to remote sandboxes (Docker or VM) where the
entire session is contained. For example, SkyPilot demonstrated that its pooled-container
approach yields ~0.28 s execution latency (300 ms “feel instantaneous” response) compared to
~0.75 s on E2B or ~2.0 s on modal’s cold-start serversblog.skypilot.coblog.skypilot.co.
Permission Gating
and Controls: Because these tools can run commands, approval mechanisms are vital. Copilot agent mode, for instance,
requires explicit user approval before invoking any tool or running a terminal commandcode.visualstudio.comcode.visualstudio.com.
Users can even “remember” these approvals at the session or project level, or disable the
interactive approval (with caution)code.visualstudio.com.
Cursor’s editor similarly treats terminal commands and editor actions as tools that must be
approved; its August 2025 update removed an old denylist and moved to an allowlist model for
auto-run toolsthehackernews.com.
In enterprise editions, admins can block certain files or directories from being indexed
(permission gating at the workspace level)cursor.com,
and can audit all AI-driven edits through tracking APIs.
Enterprise Use Cases
Enterprises are leveraging these LLM platforms for AI-augmented development workflows. Common use
cases include secure debugging (letting the
AI reproduce and fix bugs across a codebase), autonomous test generation (AI writing unit or integration tests
based on code), and AI pair programming (the
LLM as a collaborative “teammate”). For example, Mistral highlights using Devstral agents to
perform cross-file refactors, generate pull requests, or write tests, with human-in-the-loop
oversightmistral.ai.
GitHub Copilot’s agent mode can automatically create an app scaffold, migrate legacy code, or
write and run tests end-to-end from a single promptcode.visualstudio.comcode.visualstudio.com.
In security-sensitive environments, these tools run in isolated deployments (VPC or on-prem) so
that code and data never leave the corporate networkmistral.aie2b.dev.
They can be connected to CI/CD pipelines or issue trackers, enabling automated triage or fixes
for incoming bug reports. Cognition’s Devin
(a commercial AI software engineer) is an example of an LLM agent that can autonomously address
real GitHub issues: in demos it diagnosed a bug, wrote a fix, tested it, and committed it on its
owncognition.aicognition.ai.
(Devin scored 13.9% on the SWE-bench verified benchmark versus ~1.96% for prior modelscognition.ai,
illustrating the potential for AI to handle production engineering tasks.)
Safety and Security Considerations
LLM-driven code tools introduce new risks. Hallucinated commands – where an LLM suggests or
executes a nonsensical or malicious command – are a concern. Without strict validation, a model
might propose shell commands that delete files or exfiltrate secrets. Recent security analyses
of Cursor revealed multiple prompt-injection attacks: for instance, an attacker could inject
hidden instructions (via a malicious GitHub README or Slack message) that hijack Cursor’s MCP
configuration and execute arbitrary codethehackernews.comthehackernews.com.
Crucially, Cursor’s 2025 patches require re-approval of MCP rules on every change and replaced
its old global denylist with an allowlist approachthehackernews.comthehackernews.com.
These incidents highlight that sandbox
escapes and trust-model flaws are real: once an AI is granted permission to run
something, an attacker might exploit that trust (as one researcher noted, the Cursor flaw
“exposes a critical weakness in the trust model behind AI-assisted development
environments”thehackernews.com).
To mitigate such issues, platforms enforce layered
safeguards: containerization (limiting system access), allowlisted APIs, undo logs, and human
approval steps. Many LLM-IDE tools log every action: developers can review each proposed change
or command and undo if necessarycode.visualstudio.com.
Security audits emphasize running code only in hardened sandboxes (e.g. E2B or Docker) rather
than on the developer’s machinehuggingface.co.
Enterprises often disable or customize AI tools (e.g. whitelisting only trusted
code-transformers) to fit compliance requirementsmistral.aicode.visualstudio.com.
However, even with these measures, research shows LLMs can unintentionally generate insecure
code (a Veracode study found ~40–70% of LLM-generated Java/Python/JS code had OWASP
vulnerabilitiesthehackernews.com).
In short, engineering leaders must treat LLM code assistants with the same security rigor as
other developer tools: audit outputs, run static analysis, and confine execution to monitored
environments.
Platform Comparison and Recent Updates
OpenAI
GPT: OpenAI’s latest coding models (GPT-4 series) continue to evolve. GPT-4o
(originally the ChatGPT baseline) has given way to GPT-4.1 and GPT-4.5 in mid-2025, which vastly
outperform it on coding benchmarksopenai.com.
For example, GPT-4.1 + Turbo (May 2025) shows a ~54.6% success on HumanEval, a 21.4 point improvement over GPT-4oopenai.com.
ChatGPT’s built-in Code Interpreter (now
called Advanced Data Analysis) is also widely available: as of late 2025 it is on by default for
Plus/Pro usershelp.openai.com.
This allows free-form Python execution (and graphics) in chat. In VS Code, GitHub Copilot has
adopted a “Bring Your Own Key” model so you can hook it up to GPT models (or Anthropic or
Google’s APIs) of your choicecode.visualstudio.com.
Anthropic
Claude: Anthropic’s Claude 3.5 “Sonnet” (launched Aug 2025) was explicitly optimized
for coding. Compared to Claude 3 Opus, Sonnet solved 64% vs 38% of standard code tasksanthropic.com,
thanks to a larger context window and built-in code execution tools. Claude 3.5 adds features
like file-explaining and a side-panel for code edits. In fact, a VS Code Copilot agent running
Claude 3.7 Sonnet achieved a 56% pass rate on a complex coding benchmarkcode.visualstudio.com.
(This benchmarks agents operating with minimal human guidance.) Anthropic also has introduced
persistent memory to Claude, which can help the model “recall relevant context” over long
workflowsanthropic.com.
Mistral and
Open Models: Mistral AI has been prolific in the code space. Its Codestral family (open weights) is tuned for
high-throughput code completion and is now integrated into IDE plugins (Continue.dev, Tabnine,
etc.)mistral.aimistral.ai.
The August 2025 update “Codestral-25.08” boosts accepted completion rates by +30% and cuts
runaway generations by 50%mistral.ai.
For agents, Mistral’s Devstral models are
state-of-the-art among open models: the 24B “Devstral Small 1.1” scores 53.6% on SWE-bench (beat
prior open models)mistral.ai,
and its 61.6% on the 36B “Devstral Medium” surpasses closed models like Claude 3.5 in cross-file
tasksmistral.aimistral.ai.
Notably, Devstral Small is Apache-2.0 licensed and can run on a single GPU or even a high-end
laptop, enabling private, on-prem deploymentsmistral.aimistral.ai.
Other
Tools: Several new solutions deserve mention. Devin (Cognition.ai) is billed as an “AI software engineer” agent.
It leverages a multi-step planner and sandboxed tools to autonomously fix bugs or implement
featurescognition.aicognition.ai.
While still early, Devin demonstrates the trend toward AI agents that carry out entire tasks
rather than just code suggestions. AutoCoder (an open research model) exemplifies novel capabilities:
it reportedly outperforms GPT-4 Turbo/4o on code benchmarks and uniquely can auto-install
missing libraries before running codeopenreview.netgithub.com.
(This bridges the LLM and its code execution environment.)
Finally, many offerings focus on enterprise integration. For example, Copilot’s new agent mode works
inside GitHub PRs and can push commits automaticallycursor.com;
Mistral provides “Codestral Embed” for secure on-prem code searchmistral.ai;
and platforms like Codesandbox, Modal, or SkyPilot offer self-hosted sandboxes that connect via
standard protocols (like Anthropic’s MCP) to any IDE or agent. In short, the market is moving
toward full “AI-native” dev stacks: embed LLM models, tools, and sandbox runtimes together with
telemetry and compliance controlsmistral.aiblog.skypilot.co.