This pilot project collects problems and metrics/datasets from the AI research literature, and tracks progress on them.
You can use this Notebook to see how things are progressing in specific subfields or AI/ML as a whole, as a place to report new results you've obtained, as a place to look for problems that might benefit from having new datasets/metrics designed for them, or as a source to build on for data science projects.
At EFF, we're ultimately most interested in how this data can influence our understanding of the likely implications of AI. To begin with, we're focused on gathering it.
Inspired by and merging data from:
Thanks to many others for valuable conversations, suggestions and corrections, including: Dario Amodei, James Bradbury, Miles Brundage, Mark Burdett, Breandan Considine, Owen Cotton-Barrett, Marc Bellemare, Will Dabny, Eric Drexler, Otavio Good, Katja Grace, Hado van Hasselt, Anselm Levskaya, Clare Lyle, Toby Ord, Michael Page, Maithra Raghu, Anders Sandberg, Laura Schatzkin, Daisy Stanton, Gabriel Synnaeve, Stacey Svetlichnaya, Helen Toner, and Jason Weston. EFF's work on this project has been supported by the Open Philanthropy Project.
Learning to Learn Better
It collates data with the following structure:
problem \ \ \ metrics - measures \ - subproblems \ metrics \ measure[ment]s
Problems describe the ability to learn an important category of task.
Metrics should ideally be formulated in the form "software is able to learn to do X given training data of type Y". In some cases X is the interesting part, but sometimes also Y.
Measurements are the score that a specific instance of a specific algorithm was able to get on a Metric.
problems are tagged with attributes: eg, vision, abstract-games, language, world-modelling, safety
Some of these are about performance relative to humans (which is of course a very arbitrary standard, but one we're familiar with)
problems can have "subproblems", including simpler cases and preconditions for solving the problem in general
a "metric" is one way of measuring progress on a problem, commonly associated with a test dataset. There will often be several metrics for a given problem, but in some cases we'll start out with zero metrics and will need to start proposing some...
a measure[ment] is a score on a given metric, by a particular codebase/team/project, at a particular time
The present state of the actual taxonomy is at the bottom of this notebook.
Most source data is now defined in a series of separate files by topic:
data/stem.py for data on scientific & technical problems
data imported from specific scrapers (and then subsequently edited):
scrapers/awty.pybut then edited by hand
Metrics are still defined in this Notebook, especially in areas that do not have many active results yet.
# hiddencode from __future__ import print_function %matplotlib inline import matplotlib as mpl try: from lxml.cssselect import CSSSelector except ImportError: # terrifying magic for Azure Notebooks import os if os.getcwd() == "/home/nbuser": !pip install cssselect from lxml.cssselect import CSSSelector else: raise import datetime import json import re from matplotlib import pyplot as plt date = datetime.date import taxonomy #reload(taxonomy) from taxonomy import Problem, Metric, problems, metrics, measurements, all_attributes, offline, render_tables from scales import *
The simplest vision subproblem is probably image classification, which determines what objects are present in a picture. From 2010-2017, Imagenet has been a closely watched contest for progress in this domain.
Image classification includes not only recognising single things within an image, but localising them and essentially specifying which pixels are which object. MSRC-21 is a metric that is specifically for that task:
from data.vision import * imagenet.graph()
from data.vision import * from data.awty import *
for m in sorted(image_classification.metrics, key=lambda m:m.name): if m != imagenet: m.graph()