Measuring the Progress of AI Research¶
This pilot project collects problems and metrics/datasets from the AI research literature, and tracks progress on them.
You can use this Notebook to see how things are progressing in specific subfields or AI/ML as a whole, as a place to report new results you've obtained, as a place to look for problems that might benefit from having new datasets/metrics designed for them, or as a source to build on for data science projects.
At EFF, we're ultimately most interested in how this data can influence our understanding of the likely implications of AI. To begin with, we're focused on gathering it.
Inspired by and merging data from:
- Rodrigo Benenson's "Who is the Best at X / Are we there yet?" collating machine vision datasets & progress
- Jack Clark and Miles Brundage's collection of AI progress measurements
- Sarah Constantin's Performance Trends in AI
- Katja Grace's Algorithmic Progress in Six Domains
- The Swedish Computer Chess Association's History of Computer Chess performance
- Gabriel Synnaeve's WER are We collation of speech recognition performance data
- Qi Wu et al.'s Visual Question Answering: A survey of Methods and Datasets
- Eric Yuan's Comparison of Machine Reading Comprehension Datasets
Thanks to many others for valuable conversations, suggestions and corrections, including: Dario Amodei, James Bradbury, Miles Brundage, Mark Burdett, Breandan Considine, Owen Cotton-Barrett, Marc Bellemare, Will Dabny, Eric Drexler, Otavio Good, Katja Grace, Geoffrey Irving, Hado van Hasselt, Anselm Levskaya, Clare Lyle, Toby Ord, Michael Page, Maithra Raghu, Anders Sandberg, Laura Schatzkin, Daisy Stanton, Gabriel Synnaeve, Stacey Svetlichnaya, Helen Toner, and Jason Weston. EFF's work on this project has been supported by the Open Philanthropy Project.
Table of Contents¶
- Source code for defining and importing data
- Game Playing
- Vision and image modelling
- Written Language
- Spoken Language
- Music Information Retrieval
Learning to Learn Better
- Safety and Security
- Transparency, Explainability & Interpretability
- Fairness and Debiasing
- Privacy Problems
It collates data with the following structure:
problem \ \ \ metrics - measures \ - subproblems \ metrics \ measure[ment]s
Problems describe the ability to learn an important category of task.
Metrics should ideally be formulated in the form "software is able to learn to do X given training data of type Y". In some cases X is the interesting part, but sometimes also Y.
Measurements are the score that a specific instance of a specific algorithm was able to get on a Metric.
problems are tagged with attributes: eg, vision, abstract-games, language, world-modelling, safety
Some of these are about performance relative to humans (which is of course a very arbitrary standard, but one we're familiar with)
- agi -- most capable humans can do this, so AGIs can do this (note it's conceivable that an agent might pass the Turing test before all of these are won)
- super -- the very best humans can do this, or human organisations can do this
- verysuper -- neither humans nor human orgs can presently do this
problems can have "subproblems", including simpler cases and preconditions for solving the problem in general
a "metric" is one way of measuring progress on a problem, commonly associated with a test dataset. There will often be several metrics for a given problem, but in some cases we'll start out with zero metrics and will need to start proposing some...
a measure[ment] is a score on a given metric, by a particular codebase/team/project, at a particular time
The present state of the actual taxonomy is at the bottom of this notebook.
- Code implementing the taxonomy of Problems and subproblems, Metrics and Measurements is defined in a free-standing Python file, taxonomy.py. scales.py contains definitions of various unit systems used by
Most source data is now defined in a series of separate files by topic:
- data/vision.py for hand-entered computer vision data
- data/language.py for hand-entered and merged language data
- data/strategy_games.py for data on abstract strategy games
- data/video_games.py a combination of hand-entered and scraped Atari data (other video game data can also go here)
data/stem.py for data on scientific & technical problems
data imported from specific scrapers (and then subsequently edited):
Are We There Yet? image data, generated by
scrapers/awty.pybut then edited by hand
- Are We There Yet? image data, generated by
- For now, some of the
Metrics are still defined in this Notebook, especially in areas that do not have many active results yet.
- Scrapers for specific data sources:
# hiddencode from __future__ import print_function %matplotlib inline import matplotlib as mpl try: from lxml.cssselect import CSSSelector except ImportError: # terrifying magic for Azure Notebooks import os if os.getcwd() == "/home/nbuser": !pip install cssselect from lxml.cssselect import CSSSelector else: raise import datetime import json import re from matplotlib import pyplot as plt date = datetime.date import taxonomy #reload(taxonomy) from taxonomy import Problem, Metric, problems, metrics, measurements, all_attributes, offline, render_tables from scales import *
The simplest vision subproblem is probably image classification, which determines what objects are present in a picture. From 2010-2017, Imagenet has been a closely watched contest for progress in this domain.
Image classification includes not only recognising single things within an image, but localising them and essentially specifying which pixels are which object. MSRC-21 is a metric that is specifically for that task:
from data.vision import * imagenet.graph()
from data.vision import * from data.awty import *
for m in sorted(image_classification.metrics, key=lambda m:m.name): if m != imagenet: m.graph()