CSVAnalystEnv

An OpenEnv-compatible benchmark for tabular reasoning agents.

Evaluation Tasks

Difficulty Levels

100%

OpenEnv Compliant

Live

FastAPI HTTP

How it works

Agents interact with a fixed CSV dataset representing e-commerce orders. Instead of writing raw code, agents must use a constrained action space (like filter_rows or groupby_aggregate) to explore the data and find the answer.

The environment enforces strict programmatic grading, limits episode length, and shapes behavior via normalized rewards (+1 for success, penalties for invalid tool use).

Core Endpoints

GET /tasks lists the question bank.
POST /reset begins an episode.
POST /step submits an action and returns the next observation.
GET /state returns the full episode transcript.

Open API Docs Human Interface Health Check GitHub Repo