Overview

autoresearch is an experimental open-source project by Andrej Karpathy that demonstrates the power of autonomous AI agents in conducting real machine learning research. It provides a minimal, single-GPU training harness based on a simplified nanochat implementation, where an AI coding agent takes full control of the experimentation loop.

Instead of manually tweaking Python code, researchers write high-level instructions in a program.md Markdown file. The agent then iteratively edits the training script (train.py), runs fixed-time (typically 5-minute) training experiments, evaluates improvements based on a validation metric (e.g., val_bpb), and commits only the winning changes to a Git feature branch.

Key Features

Autonomous Agent Loop: The AI agent plans experiments, modifies code (architecture, hyperparameters, optimizer, etc.), executes training, and decides what to keep.
Single-GPU Efficiency: Designed for accessible hardware; each experiment runs for a fixed short duration (~5 minutes), enabling ~12 experiments per hour.
Git-Based Versioning: Improvements are tracked via commits on a feature branch, making it easy to review and revert changes.
Minimal Setup: A tiny codebase (~630-1000 lines across a few files) focused on one clear metric for objective evaluation.
Human Oversight via Prompts: Users define the "research organization" through natural language instructions in Markdown, allowing sophisticated agent behaviors without touching low-level code.
Extensible: Easy to add more agents, improve the program.md prompt, or adapt for different models/tasks.

How It Works

The user sets up the repo and provides a research goal in program.md.
An AI coding agent (e.g., powered by Claude, GPT, or local models) is launched.
The agent creates/uses a Git feature branch and begins iterating:
- Edits train.py.
- Runs a timed training experiment.
- Measures the key validation metric.
- If improved, commits the change; otherwise, discards and tries again.
Overnight or over days, the system accumulates dozens to hundreds of experiments, surfacing better model configurations.

The project emphasizes engineering the agent prompt (the "research org code") to maximize long-term research velocity without human intervention.

Use Cases

Personal ML Research: Let an agent explore hyperparameters, architectures, or optimizations while you sleep or focus on higher-level ideas.
Educational Demo: Understand agentic AI workflows in a real, runnable ML context.
Distributed Swarms: Community extensions enable multiple agents or machines to collaborate (e.g., autoresearch@home projects).
Rapid Prototyping: Test ideas for autonomous scientific discovery in small-scale LLM training.
Benchmarking Agent Capabilities: Evaluate how well different LLMs perform as autonomous researchers.

Getting Started

Clone the repository, install dependencies via pyproject.toml, configure your AI provider (API keys), prepare a program.md with your research instructions, and launch the agent loop. It runs on a single GPU and requires minimal setup.

The repo includes a baseline program.md that can be iterated upon for better results.

Why It Matters

autoresearch represents an early glimpse into a future where AI agents handle the grunt work of empirical research, freeing humans for creative direction. It has sparked massive community interest, forks, ports (AMD, Apple Silicon, etc.), and discussions around agent swarms and the "early singularity" of automated science.

Limitations

Experiments start from scratch each time (no persistent memory across runs in the base version).
Focused on a single, simple metric and small models.
Success depends heavily on the quality of the underlying coding agent and prompt engineering.

For the latest details, code, and community discussions, visit the official GitHub repository.

autoresearch

Overview

Key Features

How It Works

Use Cases

Getting Started

Why It Matters

Limitations

Tags

Related Entries

Hermes Agent

Open SWE

Workspace Agents

OpenClaw Chrome MCP

Simular AI Agent

Suna ai agent

Related Reads

How to Use Claude Fable 5: Complete 2026 Guide to Anthropic’s Most Powerful Public AI Model

Railway Agent Sandbox: The Practical Guide to Running AI Coding Agents Safely in the Cloud

Claude Response Incomplete? The Real Causes, Fixes, and Prevention Checklist