
autoresearch
Andrej Karpathy's open-source AI agent that autonomously runs LLM training experiments overnight on a single GPU, editing code, evaluating improvements, and iterating while you sleep.
Overview
autoresearch is an experimental open-source project by Andrej Karpathy that demonstrates the power of autonomous AI agents in conducting real machine learning research. It provides a minimal, single-GPU training harness based on a simplified nanochat implementation, where an AI coding agent takes full control of the experimentation loop.
Instead of manually tweaking Python code, researchers write high-level instructions in a program.md Markdown file. The agent then iteratively edits the training script (train.py), runs fixed-time (typically 5-minute) training experiments, evaluates improvements based on a validation metric (e.g., val_bpb), and commits only the winning changes to a Git feature branch.
Key Features
- Autonomous Agent Loop: The AI agent plans experiments, modifies code (architecture, hyperparameters, optimizer, etc.), executes training, and decides what to keep.
- Single-GPU Efficiency: Designed for accessible hardware; each experiment runs for a fixed short duration (~5 minutes), enabling ~12 experiments per hour.
- Git-Based Versioning: Improvements are tracked via commits on a feature branch, making it easy to review and revert changes.
- Minimal Setup: A tiny codebase (~630-1000 lines across a few files) focused on one clear metric for objective evaluation.
- Human Oversight via Prompts: Users define the "research organization" through natural language instructions in Markdown, allowing sophisticated agent behaviors without touching low-level code.
- Extensible: Easy to add more agents, improve the program.md prompt, or adapt for different models/tasks.
How It Works
- The user sets up the repo and provides a research goal in
program.md. - An AI coding agent (e.g., powered by Claude, GPT, or local models) is launched.
- The agent creates/uses a Git feature branch and begins iterating:
- Edits
train.py. - Runs a timed training experiment.
- Measures the key validation metric.
- If improved, commits the change; otherwise, discards and tries again.
- Edits
- Overnight or over days, the system accumulates dozens to hundreds of experiments, surfacing better model configurations.
The project emphasizes engineering the agent prompt (the "research org code") to maximize long-term research velocity without human intervention.
Use Cases
- Personal ML Research: Let an agent explore hyperparameters, architectures, or optimizations while you sleep or focus on higher-level ideas.
- Educational Demo: Understand agentic AI workflows in a real, runnable ML context.
- Distributed Swarms: Community extensions enable multiple agents or machines to collaborate (e.g., autoresearch@home projects).
- Rapid Prototyping: Test ideas for autonomous scientific discovery in small-scale LLM training.
- Benchmarking Agent Capabilities: Evaluate how well different LLMs perform as autonomous researchers.
Getting Started
Clone the repository, install dependencies via pyproject.toml, configure your AI provider (API keys), prepare a program.md with your research instructions, and launch the agent loop. It runs on a single GPU and requires minimal setup.
The repo includes a baseline program.md that can be iterated upon for better results.
Why It Matters
autoresearch represents an early glimpse into a future where AI agents handle the grunt work of empirical research, freeing humans for creative direction. It has sparked massive community interest, forks, ports (AMD, Apple Silicon, etc.), and discussions around agent swarms and the "early singularity" of automated science.
Limitations
- Experiments start from scratch each time (no persistent memory across runs in the base version).
- Focused on a single, simple metric and small models.
- Success depends heavily on the quality of the underlying coding agent and prompt engineering.
For the latest details, code, and community discussions, visit the official GitHub repository.