ToolGym

ToolGym is an open-world tool-using environment for scalable agent testing and data curation.

Large tool pools • long-horizon workflows • wild constraints • unreliable tool states

Quick links

🏆 Leaderboard
📦 Dataset(s): /datasets/ToolGym/ToolGym
📄 [Paper](https://arxiv.org/abs/2601.06328)
💻 Code

Key highlights

5,571 validated tools (unified in MCP format)
204 real-world apps covered, from 276 MCP servers
Long-horizon, constraint-dense tasks
- Avg. 28.5 tool-use rounds per task (averaged across evaluated models)
A State Controller that injects realistic failures & drift
(timeouts, rate limits, transient unavailability, etc.)
Planner–Actor agent framework
- ToolGym supports and releases data signals for both:
  - Planner: deliberate reasoning, reflection, progress tracking, self-correction
  - Actor: step-wise tool retrieval, invocation, and execution
Data-efficient training (experiment): we show strong gains using only 1,170 curated training samples
(this number refers to the training subset used in our experiments, not the full scale/upper bound of ToolGym as a data engine)

What is ToolGym?

ToolGym is designed to close the gap between “clean” function-calling benchmarks and messy real-world tool ecosystems. It supports both:

Benchmarking: stress-test agents on long, multi-tool workflows under constraints and failures
Data curation: collect high-quality trajectories for training tool-using agents

Core components

1) Tool universe (MCP)

We curate and validate a large library of production-like tools, then standardize them under a unified Model Context Protocol (MCP) interface so agents can call tools consistently across apps and servers.

2) Tool retrieval index

Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand.

3) Task creation engine

ToolGym synthesizes long-horizon, multi-tool workflows that resemble real user requests:

multi-step dependencies
cross-app orchestration
dense constraints (format, ordering, trade-offs, verification requirements, etc.)

4) State Controller (robustness testing)

To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject:

tool-level failures (timeouts, temporary unavailability)
state-level drift (corrupted/delayed results, expired sessions)
constraint changes mid-execution (updated preferences, shifting deadlines)

5) Evaluation protocol

ToolGym evaluates agents on multiple axes, including:

Answer quality (completeness, grounding)
Robustness (schema compliance, recovery, flexibility)
Constraint following (format + other constraints)
Planning (goal decomposition, progress tracking, efficiency)

6) Planner–Actor decomposition

To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into:

Planner: global reasoning & self-correction (keeps the agent aligned over long trajectories)
Actor: efficient step-by-step execution (retrieval → tool call → observe → iterate)

Leaderboard

We maintain a public leaderboard for ToolGym.
➡️ Leaderboard link

License

This organization and its public repos are released under the MIT license unless otherwise specified in each repo.

Contributing

Community contributions are welcome:

Open a discussion: /datasets/ToolGym/ToolGym/discussions
Submit PRs to the relevant repo (dataset / code / leaderboard Space)

Contact

For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.