Deep Dive: EFF's New Wordlists for Random Passphrases
Randomly-generated passphrases offer a major security upgrade over user-chosen passwords. Estimating the difficulty of guessing or cracking a human-chosen password is very difficult. It was the primary topic of my own PhD thesis and remains an active area of research. (One of many difficulties when people choose passwords themselves is that people aren't very good at making random, unpredictable choices.)
Measuring the security of a randomly-generated passphrase is easy. The most common approach to randomly-generated passphrases (immortalized by XKCD) is to simply choose several words from a list of words, at random. The more words you choose, or the longer the list, the harder it is to crack. Looking at it mathematically, for k words chosen from a list of length n, there are kn possible passphrases of this type. It will take an adversary about kn/2 guesses on average to crack this passphrase. This leaves a big question, though: where do we get a list of words suitable for passphrases, and how do we choose the length of that list?
Several word lists have been published for different purposes; thus far, there has been little scientific evaluation of their usability. The most popular is Arnold Reinhold's Diceware list, first published in 1995. This list contains 7,776 words, equal to the number of possible ordered rolls of five six-sided dice (7776=65), making it suitable for using standard dice as a source of randomness. While the Diceware list has been used for over twenty years, we believe there are several avenues to improve the usability and are introducing three new lists for use with a set of five dice (as part of its Summer Security Reboot Campaign, EFF is providing a dice set to donors).
Enhancements over the Diceware list
The Diceware list can provide strong security, but offers some challenges to usability. In particular, some of the words on the list can be hard to memorize, hard to spell, or easy to confuse with another word.
- It contains many rare words such as buret, novo, vacuo
- It contains unusual proper names such as della, ervin, eaton, moran
- It contains a few strange letter sequences such as aaaa, ll, nbis
- It contains some words with punctuation such as ain't, don't, he'll
- It contains individual letters and non-word bigrams like tl, wq, zf
- It contains numbers and variants such as 46, 99 and 99th
- It contains many vulgar words
- Diceware passwords need spaces to be correctly decoded, e.g. in and put are in the list as well as input.
Note that several of these problems are exacerbated for users with a soft keyboard or other typing systems that relies on word recognition. Using only valid dictionary words makes this setup much easier.
Our new "long" list
Our first new list matches the original Diceware list in size (7,776 words (65)), offering equivalent security for each word you choose. However, we have fixed the above problems, resulting in a list that is hopefully easy to type and remember.
We based our list off of data collected by Ghent University's Center for Reading Research. The Ghent team has long studied word recognition; you can participate yourself in their online quiz to measure your English vocabulary. This list gives us a good idea of which words are most likely to be familiar to English speakers and eliminates most of the unusual words in the original Diceware list. This data also includes "concreteness" ratings for each words, from very concrete words (such as screwdriver) to very abstract words (such as love).
We took all words between 3 and 9 characters from the list, prioritizing the most recognized words and then the most concrete words. We manually checked and attempted to remove as many profane, insulting, sensitive, or emotionally-charged words as possible, and also filtered based on several public lists of vulgar English words (for example this one published by Luis von Ahn). We further removed words which are difficult to spell as well as homophones (which might be confused during recall). We also ensured that no word is an exact prefix of any other word.
The result is our own list of 7,776 words [.txt] suitable for use in dice-generated passphrases. The words in our list are longer (7.0 characters) on average, than Reinhold's Diceware list (4.3 characters). This is a result of banning words under 3 characters as well as prioritizing familiar words over short but unusual words.
Note that the security of a passphrase generated using either list is identical; the differences are in usability, including memorability, not in security. For most uses, we recommend a generating a six-word passphrase with this list, for a strength of 77 bits of entropy. ("Bits of entropy" is a common measure for the strength of a password or passphrase. Adding one bit of entropy doubles the number of guesses required, which makes it twice as difficult to brute force.) Each additional word will strengthen the passphrase by about 12.9 bits.
Our new "short" lists
We are also introducing new lists containing only 1,296 words (64), suitable for use with four six-sided dice. By reducing the number of words in the list, we were able to use words with a maximum of five characters. This can lead to more efficient typing for the same security if it requires fewer characters to enter N short words than N-1 long words.
Passphrases generated using the shorter lists will be weaker than the long list on a per-word basis (10.3 bits/word). Put another way, this means you would need to choose more words from the short list, to get comparable security to the long list—for example, using eight words from the short will provide a strength of about 82 bits, slightly stronger than six words from the long list.
The first short list [.txt] is designed to include the 1,296 most memorable and distinct words. Our hope is that this approach might offer a usability improvement for longer passphrases. Further study is need to determine conclusively which list will yield passphrases that are easier to remember.
Finally, we're publishing one more short list [.txt] which with a few additional features making the words easy to type:
- Each word has a unique three-character prefix. This means that future software could auto-complete words in the passphrase after the user has typed the first three characters
- All words are at least an edit distance of 3 apart. This means that future software could correct any single typo in the user's passphrase (and in many cases more than one typo).
We've added these features in the hope that they might be used by software in the future that was specially designed to take advantage of them, but will not offer a significant benefit today so this list is mostly a proof-of-concept for individual users. Software developers might be able to find interesting uses for this list.
Different lists might be preferable in different situations, and that's perfectly fine. For example, you might consider using one of the short lists when you are prioritizing ease of remembering, or when you know that the highest level of passphrase strength is not necessary. This might cover a website login that offers additional protections, like two-factor authentication, and that rate-limits guesses to protect against brute force.
If you are typing the passphrase frequently (as opposed to using a passphrase database), you might prioritize reducing the length of the words. Our long list has an average length of 7.0 characters per word, and 12.9 bits of entropy per word, yielding an efficiency of 1.8 bits of entropy per character. Our short list has an average length of 4.5 characters per word, and 10.3 bits of entropy per word, yielding 2.3 bits of entropy per character. Our typo-tolerant list is much less efficient at only 1.4 bits of entropy per character. However, using a future autocomplete software feature, only three characters would need to be typed per word, in which case this would be the most efficient list to use at 3.1 bits of entropy per character typed.
You might find the shorter average length in the original Diceware list to be preferable. That's perfectly fine as well, given the caveats we mentioned about the difficulty of using this list. Note that the original Diceware list offers 3.0 bits of entropy per character and hence less typing. As discussed above, we feel the large number of short words in this list (including single letters and bigrams) are hard to remember and hence a bad tradeoff to decrease typing time.
Since passphrases are individually chosen, it's okay for multiple lists to exist. In fact, this might even increase security, as it means the attacker has some uncertainty about which list was used to generate a passphrase.
We think our lists will be useful for people generating passphrases using EFF's dice (or otherwise), though they certainly aren't the last word on the matter. There's plenty of room for further research and experimentation on memorability and ways of optimizing lists and we hope people will keep exploring this area.
Support EFF's work during our Summer Security Reboot!