Automated Machine Learning Engineering with Large Language Models
AIDE is an LLM-based agent that automates machine learning engineering by systematically drafting, debugging, and refining solutions. It tracks each solution in a tree-structured search, allowing it to reuse promising approaches and quickly discard failing ones. AIDE enables robust performance on tasks ranging from tabular data machine learning to AI R&D research, being the state-of-art agent on OpenAI MLE-Bench, METR RE-Bench as well as Weco's Internal Kaggle Bench.
AIDE tracks all solutions in a tree of scripts, where each node represents a distinct code version and each edge represents a single improvement step. New nodes branch off when the agent drafts a fresh solution from the empty root, debugs a broken script, or refines a working one with new ideas. This structure lets AIDE reuse promising solutions, quickly discard failing paths, and isolate the impact of each change.
Within this tree, each node stands on its own, so the agent can evaluate it independently for validation metrics. If a script is valid but needs improvement, AIDE generates a new branch from that node and tests only the modified code. If a script breaks or shows errors, the agent attempts to fix it or prunes the branch if it proves unhelpful. Over multiple iterations, this systematic search surfaces robust solutions, making the entire process both more efficient and less prone to errors.
Weco Kaggle Bench is a collection of Kaggle competitions chosen to evaluate AIDE's performance on practical machine learning challenges, especially tabular tasks. For some competitions, we submit solutions to Kaggle and record the official private leaderboard score. In other cases, we estimate the score offline based on local evaluation or available data splits. This approach provides a broad benchmark that shows how well AIDE compares to real Kaggle participants and enables repeatable measurements on a range of datasets.
AIDE demonstrates competitive performance in Kaggle benchmarks given only natural language descriptions.
On a smaller subset of the Weco Kaggle Bench, which focuses on tabular machine learning, AIDE outperforms conventional AutoML methods as well as general-purpose AI agents.
In parallel, MLE-Bench is a large-scale benchmark with 75 Kaggle competitions, spanning diverse tasks from basic tabular challenges to advanced GPU-based deep learning. It evaluates each agent by tracking estimated competition scores, valid submissions, and medal achievements, allowing a consistent comparison across different LLM-based approaches to automated machine learning.
AIDE achieves three times as many medals as the second-place agent on MLE-Bench.
AIDE fits reasoning model well. AIDE boosts performance 3.5x over o1-preview alone on MLE-Bench Lite
AIDE shows unexpectedly strong performance on RE-Bench, which focuses on AI R&D tasks such as kernel optimization and fine-tuning language models. Despite being primarily designed for machine learning engineering, the agent briefly surpasses human experts in overall score within the first 6 hours. This outcome highlights the potential of AIDE’s structured, iterative approach to generalize beyond machine learning benchmarks.