This pilot project collects problems and metrics/datasets from the AI research literature, and tracks progress on them.
You can use this Notebook to see how things are progressing in specific subfields or AI/ML as a whole, as a place to report new results you've obtained, as a place to look for problems that might benefit from having new datasets/metrics designed for them, or as a source to build on for data science projects.
At EFF, we're ultimately most interested in how this data can influence our understanding of the likely implications of AI. To begin with, we're focused on gathering it.
Original authors: Peter Eckersley and Yomna Nasser at EFF. Contact: ai-metrics@eff.org.
With contributions from: Yann Bayle, Owain Evans, Gennie Gebhart and Dustin Schwenk.
Inspired by and merging data from:
Thanks to many others for valuable conversations, suggestions and corrections, including: Dario Amodei, James Bradbury, Miles Brundage, Mark Burdett, Breandan Considine, Owen Cotton-Barrett, Marc Bellemare, Will Dabny, Eric Drexler, Otavio Good, Katja Grace, Hado van Hasselt, Anselm Levskaya, Clare Lyle, Toby Ord, Michael Page, Maithra Raghu, Anders Sandberg, Laura Schatzkin, Daisy Stanton, Gabriel Synnaeve, Stacey Svetlichnaya, Helen Toner, and Jason Weston. EFF's work on this project has been supported by the Open Philanthropy Project.
Problems, Metrics and Datasets
Learning to Learn Better
It collates data with the following structure:
problem
\ \
\ metrics - measures
\
- subproblems
\
metrics
\
measure[ment]s
Problems describe the ability to learn an important category of task.
Metrics should ideally be formulated in the form "software is able to learn to do X given training data of type Y". In some cases X is the interesting part, but sometimes also Y.
Measurements are the score that a specific instance of a specific algorithm was able to get on a Metric.
problems are tagged with attributes: eg, vision, abstract-games, language, world-modelling, safety
Some of these are about performance relative to humans (which is of course a very arbitrary standard, but one we're familiar with)
problems can have "subproblems", including simpler cases and preconditions for solving the problem in general
a "metric" is one way of measuring progress on a problem, commonly associated with a test dataset. There will often be several metrics for a given problem, but in some cases we'll start out with zero metrics and will need to start proposing some...
a measure[ment] is a score on a given metric, by a particular codebase/team/project, at a particular time
The present state of the actual taxonomy is at the bottom of this notebook.
Metrics.Most source data is now defined in a series of separate files by topic:
data/stem.py for data on scientific & technical problems
data imported from specific scrapers (and then subsequently edited):
scrapers/awty.py but then edited by handProblems and Metrics are still defined in this Notebook, especially in areas that do not have many active results yet.from IPython.display import HTML
HTML('''
<script>
if (typeof code_show == "undefined") {
code_show=false;
} else {
code_show = !code_show; // FIXME hack, because we toggle on load :/
}
function toggle_one(mouse_event) {
console.log("Unhiding "+button + document.getElementById(button.region));
parent = button.parentNode;
console.log("Parent" + parent)
input = parent.querySelector(".input");
console.log("Input" + input + " " + input.classList + " " + input.style.display)
input.style.display = "block";
//$(input).show();
}
function code_toggle() {
if (!code_show) {
inputs = $('div.input');
for (n = 0; n < inputs.length; n++) {
if (inputs[n].innerHTML.match('# hidd' + 'encode'))
inputs[n].style.display = "none";
button = document.createElement("button");
button.innerHTML="unhide code";
button.style.width = "100px";
button.style.marginLeft = "90px";
button.addEventListener("click", toggle_one);
button.classList.add("cell-specific-unhide")
// inputs[n].parentNode.appendChild(button);
}
} else {
$('div.input').show();
$('button.cell-specific-unhide').remove()
}
code_show = !code_show;
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()">
<input type="submit" value="Click here to show/hide source code cells."> <br><br>(you can mark a cell as code with <tt># hiddencode</tt>)
</form>
''')
# hiddencode
from __future__ import print_function
%matplotlib inline
import matplotlib as mpl
try:
from lxml.cssselect import CSSSelector
except ImportError:
# terrifying magic for Azure Notebooks
import os
if os.getcwd() == "/home/nbuser":
!pip install cssselect
from lxml.cssselect import CSSSelector
else:
raise
import datetime
import json
import re
from matplotlib import pyplot as plt
date = datetime.date
import taxonomy
#reload(taxonomy)
from taxonomy import Problem, Metric, problems, metrics, measurements, all_attributes, offline, render_tables
from scales import *
The simplest vision subproblem is probably image classification, which determines what objects are present in a picture. From 2010-2017, Imagenet has been a closely watched contest for progress in this domain.
Image classification includes not only recognising single things within an image, but localising them and essentially specifying which pixels are which object. MSRC-21 is a metric that is specifically for that task:
from data.vision import *
imagenet.graph()
from data.vision import *
from data.awty import *
for m in sorted(image_classification.metrics, key=lambda m:m.name):
if m != imagenet: m.graph()
AWTY, not yet imported:
Handling 'Pascal VOC 2011 comp3' detection_datasets_results.html#50617363616c20564f43203230313120636f6d7033
Skipping 40.6 mAP Fisher and VLAD with FLAIR CVPR 2014
Handling 'Leeds Sport Poses' pose_estimation_datasets_results.html#4c656564732053706f727420506f736573
69.2 % Strong Appearance and Expressive Spatial Models for Human Pose Estimation ICCV 2013
64.3 % Appearance sharing for collective human pose estimation ACCV 2012
63.3 % Poselet conditioned pictorial structures CVPR 2013
60.8 % Articulated pose estimation with flexible mixtures-of-parts CVPR 2011
55.6% Pictorial structures revisited: People detection and articulated pose estimation CVPR 2009
Handling 'Pascal VOC 2007 comp3' detection_datasets_results.html#50617363616c20564f43203230303720636f6d7033
Skipping 22.7 mAP Ensemble of Exemplar-SVMs for Object Detection and Beyond ICCV 2011
Skipping 27.4 mAP Measuring the objectness of image windows PAMI 2012
Skipping 28.7 mAP Automatic discovery of meaningful object parts with latent CRFs CVPR 2010
Skipping 29.0 mAP Object Detection with Discriminatively Trained Part Based Models PAMI 2010
Skipping 29.6 mAP Latent Hierarchical Structural Learning for Object Detection CVPR 2010
Skipping 32.4 mAP Deformable Part Models with Individual Part Scaling BMVC 2013
Skipping 34.3 mAP Histograms of Sparse Codes for Object Detection CVPR 2013
Skipping 34.3 mAP Boosted local structured HOG-LBP for object localization CVPR 2011
Skipping 34.7 mAP Discriminatively Trained And-Or Tree Models for Object Detection CVPR 2013
Skipping 34.7 mAP Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection CVPR 2013
Skipping 34.8 mAP Color Attributes for Object Detection CVPR 2012
Skipping 35.4 mAP Object Detection with Discriminatively Trained Part Based Models PAMI 2010
Skipping 36.0 mAP Machine Learning Methods for Visual Object Detection archives-ouvertes 2011
Skipping 38.7 mAP Detection Evolution with Multi-Order Contextual Co-occurrence CVPR 2013
Skipping 40.5 mAP Segmentation Driven Object Detection with Fisher Vectors ICCV 2013
Skipping 41.7 mAP Regionlets for Generic Object Detection ICCV 2013
Skipping 43.7 mAP Beyond Bounding-Boxes: Learning Object Shape by Model-Driven Grouping ECCV 2012
Handling 'Pascal VOC 2007 comp4' detection_datasets_results.html#50617363616c20564f43203230303720636f6d7034
Skipping 59.2 mAP Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition ECCV 2014
Skipping 58.5 mAP Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Skipping 29.0 mAP Multi-Component Models for Object Detection ECCV 2012
Handling 'Pascal VOC 2010 comp3' detection_datasets_results.html#50617363616c20564f43203230313020636f6d7033
Skipping 24.98 mAP Learning Collections of Part Models for Object Recognition CVPR 2013
Skipping 29.4 mAP Discriminatively Trained And-Or Tree Models for Object Detection CVPR 2013
Skipping 33.4 mAP Object Detection with Discriminatively Trained Part Based Models PAMI 2010
Skipping 34.1 mAP Segmentation as selective search for object recognition ICCV 2011
Skipping 35.1 mAP Selective Search for Object Recognition IJCV 2013
Skipping 36.0 mAP Latent Hierarchical Structural Learning for Object Detection CVPR 2010
Skipping 36.8 mAP Object Detection by Context and Boosted HOG-LBP ECCV 2010
Skipping 38.4 mAP Segmentation Driven Object Detection with Fisher Vectors ICCV 2013
Skipping 39.7 mAP Regionlets for Generic Object Detection ICCV 2013
Skipping 40.4 mAP Fisher and VLAD with FLAIR CVPR 2014
Handling 'Pascal VOC 2010 comp4' detection_datasets_results.html#50617363616c20564f43203230313020636f6d7034
Skipping 53.7 mAP Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Skipping 40.4 mAP Bottom-up Segmentation for Top-down Detection CVPR 2013
Skipping 33.1 mAP Multi-Component Models for Object Detection ECCV 2012
from IPython.display import HTML
HTML(image_classification.tables())

Comprehending an image involves more than recognising what objects or entities are within it, but recognising events, relationships, and context from the image. This problem requires both sophisticated image recognition, language, world-modelling, and "image comprehension". There are several datasets in use. The illustration is from VQA, which was generated by asking Amazon Mechanical Turk workers to propose questions about photos from Microsoft's COCO image collection.
plot = vqa_real_oe.graph(keep=True, title="COCO Visual Question Answersing (VQA) real open ended", llabel="VQA 1.0")
vqa2_real_oe.graph(reuse=plot, llabel="VQA 2.0", fcol="#00a0a0", pcol="#a000a0")
for m in image_comprehension.metrics:
m.graph() if not m.graphed else None
HTML(image_comprehension.tables())
In principle, games are a sufficiently open-ended framework that all of intelligence could be captured within them. We can imagine a "ladder of games" which grow in sophistication and complexity, from simple strategy and arcade games to others which require very sophisticated language, world-modelling, vision and reasoning ability. At present, published reinforcement agents are climbing the first few rungs of this ladder.
As an easier case, abstract games like chess, go, checkers etc can be played with no knowldege of the human world or physics. Although this domain has largely been solved to super-human performance levels, there are a few ends that need to be tied up, especially in terms of having agents learn rules for arbitrary abstract games effectively given various plausible starting points (eg, textual descriptions of the rules or examples of correct play).
from data.strategy_games import *
computer_chess.graph()
HTML(computer_chess.table())
Computer and video games are a very open-ended domain. It is possible that some existing or future games could be so elaborate that they are "AI complete". In the mean time, a lot of interesting progress is likely in exploring the "ladder of games" of increasing complexity on various fronts.
Atari 2600 games have been a popular target for reinforcement learning, especially at DeepMind and OpenAI. RL agents now play most but not all of these games better than humans.
In the Atari 2600 data, the label "noop" indicates the game was played with a random number, up to 30, of "no-op" moves at the beginning, while the "hs" label indicates that the starting condition was a state sampled from 100 games played by expert human players. These forms of randomisation give RL systems a diversity of game states to learn from.
from data.video_games import *
from scrapers.atari import *
simple_games.graphs()