External Data Sources

The target word, key, and distractors for each item were carefully considered along a variety of dimensions. Word features including lexical, semantic, orthographic, and phonological features were coded. See below for a list of the databases and other external sources that were used in the item feature coding:

Lexical Features

Word Frequency of individual words: The Educator’s Word Frequency Guide (Zeno et al., 1995; [no link available]); Wikipedia Corpus (Davies, 2015); Brown Corpus [no link available]; and Hyperspace Analogue to Language (HAL) Corpus (Lund & Burgess, 1996; [no link available]).
Word Age: number of years between first recorded use of the word and the year 2000, from Google ngrams

Semantic Features

Number of Morphemes present in the word, from the English Lexicon Project (Balota et al., 2007)
Number of Meanings and Senses associated with the word, from Word Net (Fellbaum, 1998; Miller et al., 1990)
Semantic Precision: depth of hypernym chain for a word, from Word Net (Fellbaum, 1998; Miller et al., 1990)
Dispersion: number of different subject areas where a word appears (scaled from 0-1), from the Educator’s Word Frequency Guide (Zeno et al., 1995; [no link available])
Semantic Diversity: semantic similarities of all contexts in which a word appears, from Hoffman et al., 2013
Contextual Diversity: the number of different texts in which a word appears. From Adelman et al., 2006 [no link available]; and from SUBTLEX-UK (van Heuven et al., 2014)
Semantic Similarity: similarities between target word and key word, within a constructed semantic space, from LSA Project, CU Boulder

Orthographic Features

Word Length: number of letters in the word
Mean Bigram Frequency: the average bigram (two letter string) frequency of all bigrams in a word, from the English Lexicon Project (Balota et al., 2007)
Number of Orthographic Neighbors: Coltheart’s N, number of words that can be made by substituting one letter in the word, from the English Lexicon Project (Balota et al., 2007)
Levenshtein Distance (LD): the minimum number of operations (substitution, insertion, deletion) needed to turn one letter string into another. Distance between target and key, calculated using the R package vwr (Keuleers, 2013).
Orthographic Levenshtein Distance 20 (OLD20): the mean LD from a word to its 20 closest orthographic neighbors, Yarkoni et al., 2013, calculated using the R package vwr (Keuleers, 2013)
Decodability: a measure of the ease with which the word can be decoded, from Saha et al. (under review)

Phonological Features

Number of Phonemes present in the word, from the English Lexicon Project (Balota et al., 2007)
Number of Syllables present in the word, from the English Lexicon Project (Balota et al., 2007)
Number of Phonological Neighbors: number of words that can be formed by substituting one phoneme in the word, from the English Lexicon Project (Balota et al., 2007)
Number of Phonographic Neighbors: number of words that are both orthographic and phonological neighbors of the word, from the English Lexicon Project (Balota et al., 2007)
Phonological Levenshtein Distance 20 (PLD20): the mean LD from a word to its 20 closest phonological neighbors, from the English Lexicon Project (Balota et al., 2007)