Agents interact with a fixed CSV dataset representing e-commerce orders. Instead of writing raw code, agents must use a constrained action space (like filter_rows or groupby_aggregate) to explore the data and find the answer.
The environment enforces strict programmatic grading, limits episode length, and shapes behavior via normalized rewards (+1 for success, penalties for invalid tool use).
GET /tasks lists the question bank.POST /reset begins an episode.POST /step submits an action and returns the next observation.GET /state returns the full episode transcript.