{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Welcome to ITBench-Lite-Space!\n", "\n", "Welcome! This interactive environment lets you run and evaluate AI agents on real-world IT automation tasks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quick Start Guide" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: Duplicate This Space First (If You Have Not!)\n", "\n", "**Important:** You need your own copy to set up API keys.\n", "\n", "1. Click the **⋮ menu** at the top of the page\n", "2. Select **\"Duplicate this Space\"**\n", "3. Choose a name and wait for it to build" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: Set Up Your API Keys (Required)\n", "\n", "Once you have your duplicated Space:\n", "\n", "1. Get your API keys:\n", " - [HuggingFace Token](https://huggingface.co/settings/tokens) - for agent execution\n", " - [OpenRouter Key](https://openrouter.ai) - for Gemini Judge evaluation\n", "2. In **your Space**, go to **Settings → Repository secrets**\n", "3. Add secrets:\n", " - `HF_TOKEN` = your HuggingFace token\n", " - `OPENROUTER_API_KEY` = your OpenRouter key\n", "4. (Optional) Before using llama-3.3-70b, **accept Llama license**: Visit [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) and click \"Agree and access repository\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: Choose Your Path\n", "\n", "**New to ITBench?** → Start with `download_run_scenario.ipynb`\n", "- Download a scenario from the ITBench-Lite dataset\n", "- Run an agent interactively to see how it works\n", "- Experiment with different models (HuggingFace: Llama, Qwen, GPT-OSS | OpenRouter: Gemini, Claude, GPT)\n", "\n", "**Ready to Evaluate?** → Jump to `evaluation.ipynb`\n", "- Analyze agent performance across multiple scenarios\n", "- View detailed metrics, trajectories, and visualizations\n", "- Generate comprehensive evaluation reports" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4: Open a Notebook\n", "\n", "Click on one of the notebook files in the left sidebar:\n", "- **download_run_scenario.ipynb** - Download scenarios and run agents\n", "- **evaluation.ipynb** - Comprehensive evaluation and analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's in This Space?\n", "\n", "- `download_run_scenario.ipynb` - Interactive agent execution notebook\n", "- `evaluation.ipynb` - Evaluation and analysis notebook\n", "- `analysis_src/` - Python modules for evaluation metrics\n", "- `ITBench-SRE-Agent/` - Reference agent implementation (pre-installed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": "## Useful Links\n\n- [Why Do Enterprise Agents Fail? Insights from IT-Bench using MAST](https://ucb-mast.notion.site/) - Research insights and analysis\n- [ITBench-Lite Dataset](https://huggingface.co/datasets/ibm-research/ITBench-Lite) - 50 scenarios across SRE and FinOps\n- [ITBench-Trajectories](https://huggingface.co/datasets/ibm-research/ITBench-Trajectories) - Complete execution traces\n- [ITBench GitHub](https://github.com/itbench-hub/ITBench) - Main repository\n- [ITBench-SRE-Agent](https://github.com/itbench-hub/ITBench-SRE-Agent) - Agent implementation\n\n**Happy benchmarking!**" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }