Hugging Face has introduced new benchmarking methodology for evaluating open-source AI models on agentic tasks, allowing developers to assess model performance using their own custom tooling and infrastructure. The approach addresses a growing need in the AI community to move beyond generic performance metrics and test models in real-world application scenarios where they interact with external tools and systems. The framework enables researchers and engineers to conduct standardized evaluations of whether open models possess sufficient agentic capabilities—the ability to plan, reason, and execute multi-step tasks with tool integration. This development is particularly significant for organizations evaluating model selection for agent-based applications, as it provides a transparent, reproducible method for comparing models across different tooling environments rather than relying solely on leaderboard rankings that may not reflect specific use cases.