ToolGym

ToolGym is an open-world tool-using environment for scalable agent testing and data curation.

Large tool pools • long-horizon workflows • wild constraints • unreliable tool states


Quick links


Key highlights


What is ToolGym?

ToolGym is designed to close the gap between “clean” function-calling benchmarks and messy real-world tool ecosystems. It supports both:


Core components

1) Tool universe (MCP)

We curate and validate a large library of production-like tools, then standardize them under a unified Model Context Protocol (MCP) interface so agents can call tools consistently across apps and servers.

2) Tool retrieval index

Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand.

3) Task creation engine

ToolGym synthesizes long-horizon, multi-tool workflows that resemble real user requests:

4) State Controller (robustness testing)

To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject:

5) Evaluation protocol

ToolGym evaluates agents on multiple axes, including:

6) Planner–Actor decomposition

To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into:


Leaderboard

We maintain a public leaderboard for ToolGym.
➡️ Leaderboard link


License


Contributing

Community contributions are welcome:


Contact

For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.