Devstral SWE-bench

61.6%

Score on SWE-bench for Devstral Medium

Codestral Completion

+30%

Accepted completion rate improvement

Sandbox Exec Latency

~0.28 s

Average execution latency for warmed pooled containers

LLM-Powered Developer Platforms and Code Execution Sandboxes

The past few months have seen an explosion of LLM-native IDEs and execution environments, blending natural-language interfaces with in-editor code execution. Tools like Cursor (an AI-driven code editor), ChatGPT’s Code Interpreter (a Python sandbox plugin), and new offerings from Mistral and others are transforming development workflowscode.visualstudio.com. These platforms embed LLMs directly into the coding process: they understand entire codebases, maintain long context windows, and can run or test code in-place. For example, Mistral AI’s Codestral model (open-weight, 22B parameters) provides high-fidelity code completion (including “fill-in-the-middle” synthesis) with a 32K token contextmistral.aimistral.ai, while its Devstral models (24B) support agentic workflows across multiple files (e.g. refactoring, test-gen)mistral.aimistral.ai. GitHub’s Copilot Agent mode for VS Code (now in preview) acts as an “autonomous peer programmer” that can analyze a whole workspace, propose edits across files, run tests, and iterate until the task is completecode.visualstudio.comcode.visualstudio.com.

Figure: Example architecture of an AI-native IDE plugin (Mistral Code). Specialized models (here Codestral for code completion and Devstral for agentic workflows) are integrated with IDE tooling (search, edits, explanations) and can be deployed in cloud, self-hosted or serverless environments. The figure is illustrative of how LLMs underpin modern coding assistantsmistral.aimistral.ai.

Technical Capabilities: Context, Memory, and Execution Environments

Multi-file Context: Modern LLM editors maintain awareness of entire codebases, not just a single file. For instance, Cursor’s AI agents can “read full files” and traverse entire directory trees when neededcursor.com. VS Code Copilot’s agent mode automatically determines which files to edit and can use special tools like “find all references” (#usages) to gather contextcode.visualstudio.com. These tools expose the workspace to the model via embedding-based search or APIs (e.g. Codestral Embed for semantic code searchmistral.ai), enabling the model to reason across files and code history. ChatGPT’s Projects feature similarly allows users to upload a project and have chats share context across related fileshelp.openai.com.

Long-term Memory: Several platforms support persisting state or “memory” across sessions. Some LLM assistants retain a history of the coding session (e.g. Copilot’s unified Chat view keeps past edits and lets you resume themcode.visualstudio.com). Project-based chat systems (like ChatGPT Projects) share custom instructions and file context across related conversationshelp.openai.com. In principle, LLMs with long context windows (GPT-4o/4.1, Claude 3.5, Mistral’s 32K models) can remember and reuse information, acting like a developer assistant that recalls prior decisions. (For example, Claude 3.5 Sonnet can “reason about and execute code” with a much larger context than earlier modelsanthropic.com.)

Secure Execution Environments: A key feature is that code execution is done in isolated sandboxes, not on the user’s host. Platforms typically spin up containers or VMs for running code. The open-source E2B framework, for example, launches ephemeral VMs (~150 ms startup) per user/agent session, creating a “small computer” for code executione2b.dev. The self-hosted SkyPilot Code Sandbox similarly runs code in managed Docker containers with Kubernetes orchestration, allowing autoscaling and reuse of warmed containersblog.skypilot.coblog.skypilot.co. Such containerized compute ensures that arbitrary code (from the LLM) cannot escape and harm the developer’s environment. In contrast, local execution is intrinsically risky: research on secure code execution emphasizes that any local interpreter must enforce strict whitelists (e.g. allowed libraries, loop iteration caps) to avoid damagehuggingface.cohuggingface.co. In practice, high-end platforms offload execution to remote sandboxes (Docker or VM) where the entire session is contained. For example, SkyPilot demonstrated that its pooled-container approach yields ~0.28 s execution latency (300 ms “feel instantaneous” response) compared to ~0.75 s on E2B or ~2.0 s on modal’s cold-start serversblog.skypilot.coblog.skypilot.co.

Permission Gating and Controls: Because these tools can run commands, approval mechanisms are vital. Copilot agent mode, for instance, requires explicit user approval before invoking any tool or running a terminal commandcode.visualstudio.comcode.visualstudio.com. Users can even “remember” these approvals at the session or project level, or disable the interactive approval (with caution)code.visualstudio.com. Cursor’s editor similarly treats terminal commands and editor actions as tools that must be approved; its August 2025 update removed an old denylist and moved to an allowlist model for auto-run toolsthehackernews.com. In enterprise editions, admins can block certain files or directories from being indexed (permission gating at the workspace level)cursor.com, and can audit all AI-driven edits through tracking APIs.

Enterprise Use Cases

Enterprises are leveraging these LLM platforms for AI-augmented development workflows. Common use cases include secure debugging (letting the AI reproduce and fix bugs across a codebase), autonomous test generation (AI writing unit or integration tests based on code), and AI pair programming (the LLM as a collaborative “teammate”). For example, Mistral highlights using Devstral agents to perform cross-file refactors, generate pull requests, or write tests, with human-in-the-loop oversightmistral.ai. GitHub Copilot’s agent mode can automatically create an app scaffold, migrate legacy code, or write and run tests end-to-end from a single promptcode.visualstudio.comcode.visualstudio.com. In security-sensitive environments, these tools run in isolated deployments (VPC or on-prem) so that code and data never leave the corporate networkmistral.aie2b.dev. They can be connected to CI/CD pipelines or issue trackers, enabling automated triage or fixes for incoming bug reports. Cognition’s Devin (a commercial AI software engineer) is an example of an LLM agent that can autonomously address real GitHub issues: in demos it diagnosed a bug, wrote a fix, tested it, and committed it on its owncognition.aicognition.ai. (Devin scored 13.9% on the SWE-bench verified benchmark versus ~1.96% for prior modelscognition.ai, illustrating the potential for AI to handle production engineering tasks.)

Safety and Security Considerations

LLM-driven code tools introduce new risks. Hallucinated commands – where an LLM suggests or executes a nonsensical or malicious command – are a concern. Without strict validation, a model might propose shell commands that delete files or exfiltrate secrets. Recent security analyses of Cursor revealed multiple prompt-injection attacks: for instance, an attacker could inject hidden instructions (via a malicious GitHub README or Slack message) that hijack Cursor’s MCP configuration and execute arbitrary codethehackernews.comthehackernews.com. Crucially, Cursor’s 2025 patches require re-approval of MCP rules on every change and replaced its old global denylist with an allowlist approachthehackernews.comthehackernews.com. These incidents highlight that sandbox escapes and trust-model flaws are real: once an AI is granted permission to run something, an attacker might exploit that trust (as one researcher noted, the Cursor flaw “exposes a critical weakness in the trust model behind AI-assisted development environments”thehackernews.com).

To mitigate such issues, platforms enforce layered safeguards: containerization (limiting system access), allowlisted APIs, undo logs, and human approval steps. Many LLM-IDE tools log every action: developers can review each proposed change or command and undo if necessarycode.visualstudio.com. Security audits emphasize running code only in hardened sandboxes (e.g. E2B or Docker) rather than on the developer’s machinehuggingface.co. Enterprises often disable or customize AI tools (e.g. whitelisting only trusted code-transformers) to fit compliance requirementsmistral.aicode.visualstudio.com. However, even with these measures, research shows LLMs can unintentionally generate insecure code (a Veracode study found ~40–70% of LLM-generated Java/Python/JS code had OWASP vulnerabilitiesthehackernews.com). In short, engineering leaders must treat LLM code assistants with the same security rigor as other developer tools: audit outputs, run static analysis, and confine execution to monitored environments.

Platform Comparison and Recent Updates

OpenAI GPT: OpenAI’s latest coding models (GPT-4 series) continue to evolve. GPT-4o (originally the ChatGPT baseline) has given way to GPT-4.1 and GPT-4.5 in mid-2025, which vastly outperform it on coding benchmarksopenai.com. For example, GPT-4.1 + Turbo (May 2025) shows a ~54.6% success on HumanEval, a 21.4 point improvement over GPT-4oopenai.com. ChatGPT’s built-in Code Interpreter (now called Advanced Data Analysis) is also widely available: as of late 2025 it is on by default for Plus/Pro usershelp.openai.com. This allows free-form Python execution (and graphics) in chat. In VS Code, GitHub Copilot has adopted a “Bring Your Own Key” model so you can hook it up to GPT models (or Anthropic or Google’s APIs) of your choicecode.visualstudio.com.

Anthropic Claude: Anthropic’s Claude 3.5 “Sonnet” (launched Aug 2025) was explicitly optimized for coding. Compared to Claude 3 Opus, Sonnet solved 64% vs 38% of standard code tasksanthropic.com, thanks to a larger context window and built-in code execution tools. Claude 3.5 adds features like file-explaining and a side-panel for code edits. In fact, a VS Code Copilot agent running Claude 3.7 Sonnet achieved a 56% pass rate on a complex coding benchmarkcode.visualstudio.com. (This benchmarks agents operating with minimal human guidance.) Anthropic also has introduced persistent memory to Claude, which can help the model “recall relevant context” over long workflowsanthropic.com.

Mistral and Open Models: Mistral AI has been prolific in the code space. Its Codestral family (open weights) is tuned for high-throughput code completion and is now integrated into IDE plugins (Continue.dev, Tabnine, etc.)mistral.aimistral.ai. The August 2025 update “Codestral-25.08” boosts accepted completion rates by +30% and cuts runaway generations by 50%mistral.ai. For agents, Mistral’s Devstral models are state-of-the-art among open models: the 24B “Devstral Small 1.1” scores 53.6% on SWE-bench (beat prior open models)mistral.ai, and its 61.6% on the 36B “Devstral Medium” surpasses closed models like Claude 3.5 in cross-file tasksmistral.aimistral.ai. Notably, Devstral Small is Apache-2.0 licensed and can run on a single GPU or even a high-end laptop, enabling private, on-prem deploymentsmistral.aimistral.ai.

Other Tools: Several new solutions deserve mention. Devin (Cognition.ai) is billed as an “AI software engineer” agent. It leverages a multi-step planner and sandboxed tools to autonomously fix bugs or implement featurescognition.aicognition.ai. While still early, Devin demonstrates the trend toward AI agents that carry out entire tasks rather than just code suggestions. AutoCoder (an open research model) exemplifies novel capabilities: it reportedly outperforms GPT-4 Turbo/4o on code benchmarks and uniquely can auto-install missing libraries before running codeopenreview.netgithub.com. (This bridges the LLM and its code execution environment.)

Finally, many offerings focus on enterprise integration. For example, Copilot’s new agent mode works inside GitHub PRs and can push commits automaticallycursor.com; Mistral provides “Codestral Embed” for secure on-prem code searchmistral.ai; and platforms like Codesandbox, Modal, or SkyPilot offer self-hosted sandboxes that connect via standard protocols (like Anthropic’s MCP) to any IDE or agent. In short, the market is moving toward full “AI-native” dev stacks: embed LLM models, tools, and sandbox runtimes together with telemetry and compliance controlsmistral.aiblog.skypilot.co.

Sources: Analysis is based on recent platform announcements and security analyses (July–Sept 2025) from Anthropic, OpenAI, Mistral, Cognition, and security researchersanthropic.commistral.aicursor.comthehackernews.comthehackernews.commistral.aicode.visualstudio.com. These cover technical features, benchmarks, and incidents illustrating current trends.

引用

Introducing GitHub Copilot agent mode (preview)

https://code.visualstudio.com/blogs/2025/02/24/introducing-copilot-agent-mode

Codestral | Mistral AI

https://mistral.ai/news/codestral

Announcing Codestral 25.08 and the Complete Mistral Coding Stack for Enterprise | Mistral AI

https://mistral.ai/news/codestral-25-08

Upgrading agentic coding capabilities with the new Devstral models | Mistral AI

https://mistral.ai/news/devstral-2507

Announcing Codestral 25.08 and the Complete Mistral Coding Stack for Enterprise | Mistral AI

https://mistral.ai/news/codestral-25-08

March 2025 (version 1.99)

https://code.visualstudio.com/updates/v1_99

Changelog | Cursor - The AI Code Editor

https://cursor.com/ja/changelog

March 2025 (version 1.99)

https://code.visualstudio.com/updates/v1_99

Announcing Codestral 25.08 and the Complete Mistral Coding Stack for Enterprise | Mistral AI

https://mistral.ai/news/codestral-25-08

ChatGPT — Release Notes | OpenAI Help Center

https://help.openai.com/en/articles/6825453-chatgpt-release-notes

March 2025 (version 1.99)

https://code.visualstudio.com/updates/v1_99

ChatGPT — Release Notes | OpenAI Help Center

https://help.openai.com/en/articles/6825453-chatgpt-release-notes

Introducing Claude 3.5 Sonnet \ Anthropic

https://www.anthropic.com/news/claude-3-5-sonnet

E2B - Code Interpreting for AI apps

https://e2b.dev/docs

Self-host open-source LLM agent sandbox on your own cloud | SkyPilot Blog

https://blog.skypilot.co/skypilot-llm-sandbox/

Self-host open-source LLM agent sandbox on your own cloud | SkyPilot Blog

https://blog.skypilot.co/skypilot-llm-sandbox/

Secure code execution

https://huggingface.co/docs/smolagents/en/tutorials/secure_code_execution

Secure code execution

https://huggingface.co/docs/smolagents/en/tutorials/secure_code_execution

Self-host open-source LLM agent sandbox on your own cloud | SkyPilot Blog

https://blog.skypilot.co/skypilot-llm-sandbox/

Introducing GitHub Copilot agent mode (preview)

https://code.visualstudio.com/blogs/2025/02/24/introducing-copilot-agent-mode

March 2025 (version 1.99)

https://code.visualstudio.com/updates/v1_99

Cursor AI Code Editor Fixed Flaw Allowing Attackers to Run Commands via Prompt Injection

https://thehackernews.com/2025/08/cursor-ai-code-editor-fixed-flaw.html

Changelog | Cursor - The AI Code Editor

https://cursor.com/ja/changelog

Announcing Codestral 25.08 and the Complete Mistral Coding Stack for Enterprise | Mistral AI

https://mistral.ai/news/codestral-25-08

Introducing GitHub Copilot agent mode (preview)

https://code.visualstudio.com/blogs/2025/02/24/introducing-copilot-agent-mode

Introducing GitHub Copilot agent mode (preview)

https://code.visualstudio.com/blogs/2025/02/24/introducing-copilot-agent-mode

Announcing Codestral 25.08 and the Complete Mistral Coding Stack for Enterprise | Mistral AI

https://mistral.ai/news/codestral-25-08

E2B - Code Interpreting for AI apps

https://e2b.dev/docs

Cognition | Introducing Devin, the first AI software engineer

https://cognition.ai/blog/introducing-devin

Cognition | Introducing Devin, the first AI software engineer

https://cognition.ai/blog/introducing-devin

Cursor AI Code Editor Fixed Flaw Allowing Attackers to Run Commands via Prompt Injection

https://thehackernews.com/2025/08/cursor-ai-code-editor-fixed-flaw.html

Cursor AI Code Editor Fixed Flaw Allowing Attackers to Run Commands via Prompt Injection

https://thehackernews.com/2025/08/cursor-ai-code-editor-fixed-flaw.html

Cursor AI Code Editor Vulnerability Enables RCE via Malicious MCP File Swaps Post Approval

https://thehackernews.com/2025/08/cursor-ai-code-editor-vulnerability.html

Introducing GitHub Copilot agent mode (preview)

https://code.visualstudio.com/blogs/2025/02/24/introducing-copilot-agent-mode

Announcing Codestral 25.08 and the Complete Mistral Coding Stack for Enterprise | Mistral AI

https://mistral.ai/news/codestral-25-08

March 2025 (version 1.99)

https://code.visualstudio.com/updates/v1_99

Cursor AI Code Editor Vulnerability Enables RCE via Malicious MCP File Swaps Post Approval

https://thehackernews.com/2025/08/cursor-ai-code-editor-vulnerability.html

Introducing GPT-4.1 in the API | OpenAI

https://openai.com/index/gpt-4-1/

March 2025 (version 1.99)

https://code.visualstudio.com/updates/v1_99

Codestral | Mistral AI

https://mistral.ai/news/codestral

Codestral | Mistral AI

https://mistral.ai/news/codestral

Upgrading agentic coding capabilities with the new Devstral models | Mistral AI

https://mistral.ai/news/devstral-2507

Upgrading agentic coding capabilities with the new Devstral models | Mistral AI

https://mistral.ai/news/devstral-2507

Announcing Codestral 25.08 and the Complete Mistral Coding Stack for Enterprise | Mistral AI

https://mistral.ai/news/codestral-25-08

Cognition | Introducing Devin, the first AI software engineer

https://cognition.ai/blog/introducing-devin

https://openreview.net/pdf?id=cDdeTXOnAK

GitHub - bin123apple/AutoCoder: We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024) and GPT-4o.

https://github.com/bin123apple/AutoCoder

Changelog | Cursor - The AI Code Editor

https://cursor.com/ja/changelog

Self-host open-source LLM agent sandbox on your own cloud | SkyPilot Blog

https://blog.skypilot.co/skypilot-llm-sandbox/

全部来源