HUD Documentation — Evaluations and RL Environments.

You’ve built an environment with tools and scenarios. Deploy it to the platform and you can run evals at scale—hundreds of parallel runs across models, all traced, all generating training data.

Deploying Environments

Start with hud init (see Environments) to scaffold locally. When ready:

Go to hud.ai → New → Environment
Connect your GitHub repo and name your environment
Push changes and it rebuilds automatically, like Vercel

Your environment—tools, scenarios, everything—is now live. Connect from anywhere:

env.connect_hub("my-env")

Running at Scale

Once deployed, create evals on hud.ai from your scenarios. Each eval is a frozen configuration—same prompt, same scoring, every time. Your scenario might take arguments:

@env.scenario("checkout")
async def checkout_flow(product_name: str, apply_coupon: bool = False):
    yield f"Complete checkout for {product_name}" + (" with coupon" if apply_coupon else "")
    yield 1.0 if order_confirmed() else 0.0

On the platform, click New Eval → select your scenario → fill in the arguments. Create multiple evals from the same scenario:

Eval Name	Arguments
`checkout-laptop`	`product_name="Laptop"`, `apply_coupon=False`
`checkout-phone-coupon`	`product_name="Phone"`, `apply_coupon=True`
`checkout-headphones`	`product_name="Headphones"`, `apply_coupon=False`

Then run them—select an eval, choose variants and groups, launch hundreds of runs in parallel. Every run is traced. Results show scores, distributions, and side-by-side model comparisons. These become your training data. For A/B testing with variants and groups, see A/B Evals.

What’s Next?

With your environment deployed:

Scale: Launch thousands of rollouts. Every run generates traces—prompts, tool calls, rewards.
Analyze: See which evals agents struggle with. Compare models across your entire benchmark.
Train: Use runs as training data. Fine-tune on successful completions. Run reinforcement learning to optimize for your specific environment.

The loop: deploy → eval at scale → analyze → train → redeploy. Agents get better at your environment.

Integrations

Connect OpenAI, Anthropic, LangChain, and more.

Sandboxing

Turn production services into safe test environments.

Documentation Index

​Deploying Environments

​Running at Scale

​What’s Next?

Integrations

Sandboxing

Deploying Environments

Running at Scale

What’s Next?