Measuring the Progress of AI Research

This pilot project collects problems and metrics/datasets from the AI research literature, and tracks progress on them.

You can use this Notebook to see how things are progressing in specific subfields or AI/ML as a whole, as a place to report new results you've obtained, as a place to look for problems that might benefit from having new datasets/metrics designed for them, or as a source to build on for data science projects.

At EFF, we're ultimately most interested in how this data can influence our understanding of the likely implications of AI. To begin with, we're focused on gathering it.

Original authors: Peter Eckersley and Yomna Nasser at EFF. Contact:

With contributions from: Yann Bayle, Owain Evans, and Gennie Gebhart

Inspired by and merging data from:

Thanks to many others for valuable conversations, suggestions and corrections, including: Dario Amodei, James Bradbury, Miles Brundage, Mark Burdett, Breandan Considine, Owen Cotton-Barrett, Marc Bellemare, Will Dabny, Eric Drexler, Otavio Good, Katja Grace, Hado van Hasselt, Anselm Levskaya, Clare Lyle, Toby Ord, Michael Page, Maithra Raghu, Anders Sandberg, Laura Schatzkin, Daisy Stanton, Gabriel Synnaeve, Stacey Svetlichnaya, Helen Toner, and Jason Weston. EFF's work on this project has been supported by the Open Philanthropy Project.


It collates data with the following structure:

    \   \
     \   metrics  -  measures 
       - subproblems

Problems describe the ability to learn an important category of task.

Metrics should ideally be formulated in the form "software is able to learn to do X given training data of type Y". In some cases X is the interesting part, but sometimes also Y.

Measurements are the score that a specific instance of a specific algorithm was able to get on a Metric.

problems are tagged with attributes: eg, vision, abstract-games, language, world-modelling, safety

Some of these are about performance relative to humans (which is of course a very arbitrary standard, but one we're familiar with)

  • agi -- most capable humans can do this, so AGIs can do this (note it's conceivable that an agent might pass the Turing test before all of these are won)
  • super -- the very best humans can do this, or human organisations can do this
  • verysuper -- neither humans nor human orgs can presently do this

problems can have "subproblems", including simpler cases and preconditions for solving the problem in general

a "metric" is one way of measuring progress on a problem, commonly associated with a test dataset. There will often be several metrics for a given problem, but in some cases we'll start out with zero metrics and will need to start proposing some...

a measure[ment] is a score on a given metric, by a particular codebase/team/project, at a particular time

The present state of the actual taxonomy is at the bottom of this notebook.

Source Code

  • Code implementing the taxonomy of Problems and subproblems, Metrics and Measurements is defined in a free-standing Python file, contains definitions of various unit systems used by Metrics.
  • Most source data is now defined in a series of separate files by topic:

    • data/ for hand-entered computer vision data
    • data/ for hand-entered and merged language data
    • data/ for data on abstract strategy games
    • data/ a combination of hand-entered and scraped Atari data (other video game data can also go here)
    • data/ for data on scientific & technical problems

    • data imported from specific scrapers (and then subsequently edited):

    • For now, some of the Problems and Metrics are still defined in this Notebook, especially in areas that do not have many active results yet.
  • Scrapers for specific data sources:
    • scrapers/ for importing data from Rodriguo Benenson's Are We There Yey? site
    • scrapers/ for processing a pasted table of data from the Evolutionary Strategies Atari paper (is probably a useful model for other Atari papers).
In [1]:
from IPython.display import HTML
    if (typeof code_show == "undefined") {
    } else {
        code_show = !code_show; // FIXME hack, because we toggle on load :/
    function toggle_one(mouse_event) {
        console.log("Unhiding "+button + document.getElementById(button.region));
        parent = button.parentNode;
        console.log("Parent" + parent)
        input = parent.querySelector(".input");
        console.log("Input" + input + " " + input.classList + " " + = "block";
    function code_toggle() {
        if (!code_show) {
            inputs = $('div.input');
            for (n = 0; n < inputs.length; n++) {
                if (inputs[n].innerHTML.match('# hidd' + 'encode'))
                    inputs[n].style.display = "none";
                    button = document.createElement("button");
                    button.innerHTML="unhide code";
           = "100px";
           = "90px";
                    button.addEventListener("click", toggle_one);
                    // inputs[n].parentNode.appendChild(button);
        } else { 
        code_show = !code_show;
    $( document ).ready(code_toggle);
<form action="javascript:code_toggle()">
    <input type="submit" value="Click here to show/hide source code cells."> <br><br>(you can mark a cell as code with <tt># hiddencode</tt>)

(you can mark a cell as code with # hiddencode)
In [2]:
# hiddencode
from __future__ import print_function

%matplotlib inline  
import matplotlib as mpl
    from lxml.cssselect import CSSSelector
except ImportError:
    # terrifying magic for Azure Notebooks
    import os
    if os.getcwd() == "/home/nbuser":
        !pip install cssselect
        from lxml.cssselect import CSSSelector

import datetime
import json
import re

from matplotlib import pyplot as plt

date =

import taxonomy
from taxonomy import Problem, Metric, problems, metrics, measurements, all_attributes, offline, render_tables
from scales import *

Problems, Metrics, and Datasets


(Imagenet example data)

The simplest vision subproblem is probably image classification, which determines what objects are present in a picture. From 2010-2017, Imagenet has been a closely watched contest for progress in this domain.

Image classification includes not only recognising single things within an image, but localising them and essentially specifying which pixels are which object. MSRC-21 is a metric that is specifically for that task:

(MSRC 21 example data)
In [3]:
from import *
In [4]:
from import *
from data.awty import *
In [5]:
for m in sorted(image_classification.metrics, key=lambda 
    if m != imagenet: m.graph()
/home/pde/.local/lib/python2.7/site-packages/matplotlib/axes/ UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "