Using Lexical Resources Effectively

Frequency Distributions, Wordlists, WordNet, Semantic Similarity

15 min readJul 19, 2021

Work in Natural Language Processing typically uses large bodies of linguistic data. In this article, we explore some lexical resources that help us ingest and analyze corpora. These resources are part of Python or the NLTK library.

Getting NLTK Corpora

We can access pre-imported corpora in NLTK in one of 2 ways:

emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

or like this:

from nltk.corpus import gutenberg
gutenberg.fileids()
emma = gutenberg.words('austen-emma.txt')

Language Statistics

We can write a quick little script to display a bunch of standard language statistics like average word length, average sentence length, lexical diversity.

It turns out that average word length is a universal attribute of English as it is 3 (displayed 4 because spaces are counted) all across:

4 24 26 austen-emma.txt 
4 26 16 austen-persuasion.txt 
4 28 22 austen-sense.txt 
4 33 79 bible-kjv.txt 
4 19 5 blake-poems.txt 
4 19 14 bryant-stories.txt 
4 17 12 burgess-busterbrown.txt 
4 20 12 carroll-alice.txt 
4 20 11 chesterton-ball.txt 
4 22 11 chesterton-brown.txt 
4 18 10 chesterton-thursday.txt 
4 20 24 edgeworth-parents.txt 
4 25 15 melville-moby_dick.txt 
4 52 10 milton-paradise.txt 
4 11 8 shakespeare-caesar.txt 
4 12 7 shakespeare-hamlet.txt 
4 12 6 shakespeare-macbeth.txt 
4 36 12 whitman-leaves.txt

Whereas, sentence length and lexical diversity seems to vary across authors.

If the above function measure central tendency, we can measure variance with something like the longest sentence function:

NLTK Corpora

NLTK has lots of different types of texts:

Gutenberg: classic books & novels
Webtext: discussion forums, overheard conversations, movie scripts, personal ads, wine reviews
Chat: web chats from chatrooms
Brown: 500 sources of different genres of text including news, editorial, romance, humour & science fiction…
Reuters: 10 thousand or so news documents
Inaugural: Presidential addresses over the years

This is all very fantastic for learning, but what you probably wanna learn, and what I wanna do is process my own texts.

Loading Your Own Corpus

To accomplish this task we use a class called Plaintext Corpus Reader, you give it the location of the text file and one of the following:

fileIds: [‘a.txt’, ‘dir/b.txt’]
regex pattern like ‘.*’ that matches the fileIds you want

Trump Inauguration Analysis

trump_corpus.words('trump.txt')

I copied and pasted the speech text into a notepad and added to /content/nltk.

[‘“‘, ‘My’, ‘fellow’, ‘Americans’, ‘:’, ‘Tonight’, ‘,’, …]

sentences = trump_corpus.sents('trump.txt')
sentences[11]

[‘Over’, ‘the’, ‘years’, ‘,’, ‘thousands’, ‘of’, ‘Americans’, ‘have’, ‘been’, ‘brutally’, ‘killed’,’by’, ‘those’, ‘who’, ‘illegally’, ‘entered’, ‘our’, ‘country’, ‘,’, ‘and’, ‘thousands’, ‘more’, ‘lives’, ‘will’, ‘be’, ‘lost’, ‘if’, ‘we’, ‘don’, “‘“, ‘t’, ‘act’, ‘right’, ‘now’, ‘.’]

len(trump_corpus.words('trump.txt'))

1301

His speech had 1301 words in total.

set(trump_corpus.words('trump.txt'))

{‘“‘, ‘$’, “‘“, ‘,’, ‘-’, ‘ — ‘, ‘.’, ‘.”’, ‘000’, ‘100’, ‘13’, ‘16’, ‘20’, ‘266’, ‘30’, ‘300’, ‘4’, ‘45’, ‘5’, ‘500’, ‘7’, ‘90’, ‘:’, ‘?’, ‘African’, ‘Air’, ‘America’, ‘American’, ‘Americans’, ‘Among’, ‘And’, ‘At’, ‘Border’, ‘But’, ‘California’, ‘Call’, ‘Christmas’, ‘Chuck’, ‘Congress’, ‘Congressional’, ‘Customs’, ‘Day’, ‘Democrats’, ‘Department’, ‘Every’, ‘Finally’, ‘Force’, ‘Furthermore’, ‘Georgia’, ‘God’, ‘Hispanic’, ‘Homeland’, ‘Hopefully’, ‘House’, ‘How’, ‘I’, ‘ICE’, ‘Imagine’, ‘In’, ‘It’, ‘Last’, ‘MS’, ‘Maryland’, ‘Mexico’, ‘More’, ‘My’, ‘Oath’, ‘Office’, ‘One’, ‘Our’, ‘Over’, ‘Pass’, ‘Patrol’, ‘President’, ‘Schumer’, ‘Security’, ‘Senator’, ‘So’, ‘Some’, ‘States’, ‘Thank’, ‘The’, ‘Then’, ‘These’, ‘They’, ‘This’, ‘To’, ‘Tonight’, ‘United’, ‘Vietnam’, ‘War’, ‘We’, ‘When’, ‘White’, ‘Women’, ‘a’, ‘about’, ‘above’, ‘absolutely’, ‘acknowledge’, ‘across’, ‘act’, ‘administration’, ‘after’, ‘agents’, ‘alien’, ‘aliens’, ‘all’, ‘allow’, ‘alone’, ‘along’, ‘also’, ‘always’, ‘am’, ‘an’, ‘and’, ‘approach’, ‘are’, ‘around’, ‘arrested’, ‘arrests’, ‘arrived’, ‘as’, ‘ask’, ‘asked’, ‘assaulted’, ‘assaults’, ‘assistance’, ‘at’, ‘back’, ‘barrier’, ‘be’, ‘beaten’, ‘beating’, ‘because’, ‘bed’, ‘been’, ‘before’, ‘beheading’, ‘between’, ‘biggest’, ‘bill’, ‘billion’, ‘blood’, ‘border’, ‘borders’, ‘brave’, ‘broke’, ‘broken’, ‘brought’, ‘brutally’, ‘build’, ‘but’, ‘by’, ‘came’, ‘can’, ‘changed’, ‘charged’, ‘child’, ‘children’, ‘choice’, ‘citizen’, ‘citizens’, ‘close’, ‘cocaine’, ‘cold’, ‘common’, ‘compromise’, ‘concrete’, ‘contains’, ‘continue’, ‘contraband’, ‘contribute’, ‘convicted’, ‘cost’, ‘could’, ‘country’, ‘coyotes’, ‘crimes’, ‘criminal’, ‘crisis’, ‘critical’, ‘cruelly’, ‘cut’, ‘cutting’, ‘cycle’, ‘dangerous’, ‘day’, ‘deal’, ‘death’, ‘decades’, ‘defends’, ‘desperately’, ‘detailed’, ‘detecting’, ‘determined’, ‘developed’, ‘die’, ‘dismembering’, ‘do’, ‘does’, ‘doing’, ‘don’, ‘done’, ‘down’, ‘dozens’, ‘dramatic’, ‘drives’, ‘drug’, ‘drugs’, ‘duty’, ‘economy’, ‘edge’, ‘elected’, ‘embraced’, ‘encounter’, ‘end’, ‘ends’, ‘enforcement’, ‘enrich’, ‘enter’, ‘entered’, ‘entire’, ‘ever’, ‘every’, ‘everything’, ‘exceeds’, ‘eyes’, ‘fact’, ‘families’, ‘far’, ‘fathers’, ‘federal’, ‘fellow’, ‘fences’, ‘fentanyl’, ‘finally’, ‘floods’, ‘for’, ‘forget’, ‘from’, ‘fueled’, ‘fulfill’, ‘fund’, ‘gang’, ‘gangs’, ‘gates’, ‘get’, ‘girl’, ‘goodnight’, ‘government’, ‘great’, ‘grief’, ‘gripping’, ‘growing’, ‘had’, ‘hammer’, ‘hands’, ‘hardest’, ‘has’, ‘hate’, ‘have’, ‘hearing’, ‘heart’, ‘held’, ‘help’, ‘hero’, ‘heroin’, ‘his’, ‘history’, ‘hit’, ‘hold’, ‘home’, ‘homes’, ‘horribly’, ‘human’, ‘humanely’, ‘humanitarian’, ‘hurt’, ‘husband’, ‘if’, ‘illegal’, ‘illegally’, ‘immigrant’, ‘immigrants’, ‘immigration’, ‘immoral’, ‘impacted’, ‘in’, ‘includes’, ‘including’, ‘increase’, ‘indirectly’, ‘injustice’, ‘innocent’, ‘inside’, ‘into’, ‘invited’, ‘is’, ‘it’, ‘its’, ‘itself’, ‘job’, ‘jobs’, ‘judges’, ‘just’, ‘justice’, ‘keep’, ‘killed’, ‘killing’, ‘killings’, ‘last’, ‘later’, ‘law’, ‘lawful’, ‘leadership’, ‘life’, ‘lives’, ‘long’, ‘loopholes’, ‘lost’, ‘love’, ‘loved’, ‘made’, ‘many’, ‘me’, ‘medical’, ‘meeting’, ‘member’, ‘members’, ‘met’, ‘meth’, ‘migrant’, ‘migration’, ‘millions’, ‘mind’, ‘minors’, ‘minute’, ‘mission’, ‘month’, ‘more’, ‘mothers’, ‘much’, ‘murder’, ‘murdered’, ‘must’, ‘name’, ‘nation’, ‘national’, ‘need’, ‘neighbor’, ‘never’, ‘new’, ‘no’, ‘not’, ‘nothing’, ‘now’, ‘of’, ‘officer’, ‘officers’, ‘old’, ‘on’, ‘one’, ‘ones’, ‘only’, ‘opens’, ‘or’, ‘order’, ‘other’, ‘our’, ‘out’, ‘outside’, ‘overall’, ‘paid’, ‘pain’, ‘part’, ‘partisan’, ‘pass’, ‘past’, ‘pawns’, ‘pay’, ‘people’, ‘percent’, ‘perform’, ‘physical’, ‘pipeline’, ‘plan’, ‘police’, ‘politicians’, ‘politics’, ‘power’, ‘precious’, ‘presented’, ‘problem’, ‘process’, ‘professionals’, ‘promptly’, ‘properly’, ‘proposal’, ‘protect’, ‘proudly’, ‘provide’, ‘public’, ‘quantities’, ‘quickly’, ‘raped’, ‘rather’, ‘re’, ‘reality’, ‘reason’, ‘recently’, ‘records’, ‘refuse’, ‘refused’, ‘remains’, ‘repeatedly’, ‘request’, ‘requested’, ‘resources’, ‘return’, ‘returned’, ‘right’, ‘rise’, ‘ruthless’, ‘s’, ‘sacred’, ‘sad’, ‘sadness’, ‘safe’, ‘safely’, ‘safer’, ‘savagely’, ‘secure’, ‘security’, ‘sense’, ‘serve’, ‘several’, ‘sex’, ‘sexually’, ‘sharp’, ‘shattered’, ‘shed’, ‘short’, ‘shut’, ‘situation’, ‘smugglers’, ‘so’, ‘society’, ‘solution’, ‘solved’, ‘someone’, ‘soul’, ‘souls’, ‘southern’, ‘space’, ‘speaking’, ‘spending’, ‘stabbing’, ‘steel’, ‘stolen’, ‘stop’, ‘strains’, ‘stricken’, ‘strong’, ‘suffering’, ‘suggested’, ‘support’, ‘supported’, ‘swore’, ‘system’, ‘t’, ‘technology’, ‘tell’, ‘terrible’, ‘than’, ‘that’, ‘the’, ‘their’, ‘them’, ‘there’, ‘these’, ‘they’, ‘thing’, ‘things’, ‘this’, ‘those’, ‘thousands’, ‘three’, ‘through’, ‘to’, ‘tomorrow’, ‘tonight’, ‘took’, ‘tools’, ‘totally’, ‘trade’, ‘traffickers’, ‘tragic’, ‘trek’, ‘tremble’, ‘tremendous’, ‘trying’, ‘two’, ‘unaccompanied’, ‘uncontrolled’, ‘unlawful’, ‘up’, ‘urgent’, ‘used’, ‘vast’, ‘vastly’, ‘ve’, ‘very’, ‘veteran’, ‘vicious’, ‘viciously’, ‘victimized’, ‘victims’, ‘violated’, ‘violent’, ‘voices’, ‘wages’, ‘wall’, ‘walls’, ‘want’, ‘was’, ‘way’, ‘we’, ‘wealthy’, ‘weapons’, ‘week’, ‘weeping’, ‘welcomes’, ‘were’, ‘what’, ‘when’, ‘whether’, ‘which’, ‘who’, ‘whose’, ‘why’, ‘wife’, ‘will’, ‘with’, ‘women’, ‘would’, ‘wrong’, ‘year’, ‘years’, ‘you’, ‘young’, ‘your’}

len(set(trump_corpus.words('trump.txt')))

552

His speech had 552 unique words.

fdist1 = FreqDist(trump_corpus.words('trump.txt')) 
fdist1.plot(25, cumulative=True),

fdist1.hapaxes() # Words that occur only once

[‘“‘, ‘fellow’, ‘Tonight’, ‘speaking’, ‘there’, ‘growing’, ‘Customs’, ‘Border’, ‘Patrol’, ‘encounter’, ‘trying’, ‘enter’, ‘out’, ‘hold’, ‘way’, ‘promptly’, ‘return’, ‘proudly’, ‘welcomes’, ‘millions’, ‘lawful’, ‘enrich’, ‘society’, ‘contribute’, ‘hurt’, ‘uncontrolled’, ‘strains’, ‘public’, ‘drives’, ‘jobs’, ‘wages’, ‘Among’, ‘hardest’, ‘hit’, ‘African’, ‘Hispanic’, ‘pipeline’, ‘vast’, ‘quantities’, ‘meth’, ‘cocaine’, ‘fentanyl’, ‘week’, ‘300’, ‘alone’, ‘90’, ‘percent’, ‘which’, ‘floods’, ‘More’, ‘die’, ‘entire’, ‘Vietnam’, ‘War’, ‘two’, ‘ICE’, ‘officers’, ‘266’, ‘arrests’, ‘aliens’, ‘records’, ‘convicted’, ‘100’, ‘assaults’, ‘30’, ‘sex’, ‘crimes’, ‘4’, ‘violent’, ‘killings’, ‘been’, ‘brutally’, ‘entered’, ‘lost’, ‘act’, ‘now’, ‘soul’, ‘Last’, ‘month’, ‘20’, ‘migrant’, ‘brought’, ‘into’, ‘dramatic’, ‘increase’, ‘used’, ‘pawns’, ‘vicious’, ‘coyotes’, ‘ruthless’, ‘One’, ‘three’, ‘women’, ‘sexually’, ‘assaulted’, ‘dangerous’, ‘trek’, ‘up’, ‘through’, ‘Women’, ‘biggest’, ‘victims’, ‘far’, ‘system’, ‘tragic’, ‘reality’, ‘cycle’, ‘suffering’, ‘determined’, ‘end’, ‘presented’, ‘detailed’, ‘stop’, ‘drug’, ‘smugglers’, ‘traffickers’, ‘tremendous’, ‘problem’, ‘developed’, ‘Department’, ‘properly’, ‘perform’, ‘mission’, ‘keep’, ‘safe’, ‘fact’, ‘safer’, ‘ever’, ‘includes’, ‘cutting’, ‘edge’, ‘technology’, ‘detecting’, ‘weapons’, ‘contraband’, ‘things’, ‘judges’, ‘bed’, ‘process’, ‘sharp’, ‘unlawful’, ‘fueled’, ‘strong’, ‘economy’, ‘plan’, ‘contains’, ‘urgent’, ‘assistance’, ‘medical’, ‘Furthermore’, ‘asked’, ‘close’, ‘loopholes’, ‘immigrant’, ‘safely’, ‘humanely’, ‘returned’, ‘Finally’, ‘part’, ‘overall’, ‘approach’, ‘At’, ‘steel’, ‘rather’, ‘concrete’, ‘absolutely’, ‘critical’, ‘want’, ‘common’, ‘sense’, ‘quickly’, ‘pay’, ‘itself’, ‘cost’, ‘exceeds’, ‘500’, ‘vastly’, ‘paid’, ‘indirectly’, ‘great’, ‘new’, ‘trade’, ‘deal’, ‘Senator’, ‘Chuck’, ‘Schumer’, ‘hearing’, ‘later’, ‘tonight’, ‘repeatedly’, ‘supported’, ‘past’, ‘along’, ‘changed’, ‘mind’, ‘elected’, ‘President’, ‘acknowledge’, ‘provide’, ‘brave’, ‘tools’, ‘desperately’, ‘federal’, ‘remains’, ‘shut’, ‘not’, ‘fund’, ‘doing’, ‘everything’, ‘power’, ‘impacted’, ‘solution’, ‘pass’, ‘spending’, ‘defends’, ‘re’, ‘opens’, ‘could’, ‘solved’, ‘45’, ‘minute’, ‘meeting’, ‘invited’, ‘Congressional’, ‘leadership’, ‘White’, ‘House’, ‘tomorrow’, ‘get’, ‘done’, ‘Hopefully’, ‘above’, ‘partisan’, ‘politics’, ‘order’, ‘national’, ‘Some’, ‘suggested’, ‘Then’, ‘why’, ‘wealthy’, ‘fences’, ‘gates’, ‘around’, ‘homes’, ‘hate’, ‘outside’, ‘but’, ‘love’, ‘inside’, ‘thing’, ‘nothing’, ‘continue’, ‘allow’, ‘innocent’, ‘horribly’, ‘victimized’, ‘broke’, ‘Christmas’, ‘when’, ‘young’, ‘police’, ‘officer’, ‘savagely’, ‘cold’, ‘came’, ‘hero’, ‘someone’, ‘had’, ‘Day’, ‘precious’, ‘cut’, ‘short’, ‘violated’, ‘Air’, ‘Force’, ‘veteran’, ‘raped’, ‘beaten’, ‘death’, ‘hammer’, ‘long’, ‘history’, ‘Georgia’, ‘recently’, ‘murder’, ‘killing’, ‘beheading’, ‘dismembering’, ‘his’, ‘neighbor’, ‘Maryland’, ‘MS’, ‘13’, ‘gang’, ‘members’, ‘arrived’, ‘unaccompanied’, ‘minors’, ‘arrested’, ‘viciously’, ‘stabbing’, ‘beating’, ‘16’, ‘old’, ‘girl’, ‘several’, ‘met’, ‘dozens’, ‘loved’, ‘ones’, ‘held’, ‘hands’, ‘weeping’, ‘mothers’, ‘embraced’, ‘grief’, ‘stricken’, ‘fathers’, ‘sad’, ‘terrible’, ‘never’, ‘forget’, ‘pain’, ‘eyes’, ‘tremble’, ‘voices’, ‘sadness’, ‘gripping’, ‘souls’, ‘How’, ‘much’, ‘must’, ‘shed’, ‘does’, ‘its’, ‘job’, ‘refuse’, ‘compromise’, ‘name’, ‘ask’, ‘Imagine’, ‘child’, ‘husband’, ‘wife’, ‘cruelly’, ‘shattered’, ‘totally’, ‘member’, ‘Pass’, ‘ends’, ‘citizen’, ‘Call’, ‘tell’, ‘finally’, ‘these’, ‘decades’, ‘choice’, ‘between’, ‘wrong’, ‘justice’, ‘injustice’, ‘about’, ‘whether’, ‘fulfill’, ‘sacred’, ‘duty’, ‘serve’, ‘When’, ‘took’, ‘Oath’, ‘Office’, ‘swore’, ‘always’, ‘me’, ‘God’, ‘Thank’, ‘goodnight’, ‘.”’]

V = set(trump_corpus.words('trump.txt')) 
long_words = [w for w in V if len(w) > 5] 
sorted(long_words)fdist1 = FreqDist(long_words)
fdist1.plot(25, cumulative=True)

plot_bigrams = bigrams(trump_corpus.words('trump.txt')) list(plot_bigrams)

[(‘“‘, ‘My’), (‘My’, ‘fellow’), (‘fellow’, ‘Americans’), (‘Americans’, ‘:’), (‘:’, ‘Tonight’), (‘Tonight’, ‘,’), (‘,’, ‘I’), (‘I’, ‘am’), (‘am’, ‘speaking’), (‘speaking’, ‘to’), (‘to’, ‘you’), (‘you’, ‘because’), (‘because’, ‘there’), (‘there’, ‘is’)…

trump_corpus.collocations() # frequency of bigrams

…

List Comprehensions in Python

Before we move on to the next topic, let’s review list comprehensions. List comprehensions look like this:

fruits = ["apple", "banana", "cherry", "kiwi", "mango"]

newlist = [x for x in fruits if "a" in x]

or like this:

genre_word = [(genre, word)
               for genre in ['news', 'romance']               
               for word in brown.words(categories=genre)]

List comprehensions help us construct lists from other lists. Instead of doing this:

fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
newlist = []

for x in fruits:
  if "a" in x:
    newlist.append(x)

In some ways, it’s just syntactic sugar. But it is easier to process for the eyes, and has 2 other nice little features:

You can use them as a parameter into a function directly
You can omit the square brackets [] when you do use them as a parameter

Conditional Frequency

There is a large corpus of text and we wanna count the number of words that have 7 or more characters. That is an example of a conditional frequency distribution.

The condition is the requirement to have 7 or more words, and the event is all the occurences of this condition from the text.

from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))genre_word = [(genre, word)               
               for genre in ['news', 'romance']               
               for word in brown.words(categories=genre)] len(genre_word)cfd = nltk.ConditionalFreqDist(genre_word)

Once we have declared the conditional frequency distribution, if we call a specific condition, we get the frequency distribution of that condition:

cfd['news']FreqDist({'The': 806,
          'Fulton': 14,
          'County': 35,
          'Grand': 6,
          'Jury': 2,
          'said': 402,
           ...
          'deaf': 5,
          'designed': 16,
          'special': 41,
          'schooling': 2,
          ...})

Tabulate

We can graph Conditional Frequency Distributions. We can also tabulate results for multiple conditions to do a comparison.

from nltk.corpus import udhrlanguages = ['Chickasaw', 'English', 'German_Deutsch',              'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']cfd = nltk.ConditionalFreqDist(     
        (lang, len(word))     
        for lang in languages     
        for word in udhr.words(lang + '-Latin1'))
cfd.tabulate(conditions=['English', 'German_Deutsch'], samples=range(10), cumulative=True)

Which gives us this cumulate frequency of word lengths of English and German from the corpus.

                  0    1    2    3    4    5    6    7    8    9 
       English    0  185  525  883  997 1166 1283 1440 1558 1638 
German_Deutsch    0  171  263  614  717  894 1013 1110 1213 1275

Generating Random Text with Bigrams

Remember that bigrams are just 2 words that follow. For example:

import nltksent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']bis = nltk.bigrams(sent)list(bis)

The bigrams here would be:

[('In', 'the'),  
('the', 'beginning'),  
('beginning', 'God'),  
('God', 'created'),  
('created', 'the'),  
('the', 'heaven'),  
('heaven', 'and'),  
('and', 'the'),  
('the', 'earth'),  
('earth', '.')]

If we feed a large text into the bigram constructor, we can get a frequency distribution of bigrams. From this we can get a probability of words that should follow from one word to the next.

def generate_model(cfdist, word, num=15):
     for i in range(num):
         print(word)
         word = cfdist[word].max()text = nltk.corpus.genesis.words('english-kjv.txt') 
bigrams = nltk.bigrams(text) 
cfd = nltk.ConditionalFreqDist(bigrams)

To use this simple function:

generate_model(cfd, 'living')

The function only needs a conditional frequency distribution of bigrams and a seed word.

living
creature
that
he
said
,
and
the
land
of
the
land
of
the
land

Reusing Code

Functions and Methods

def plural(word):
    if word.endswith('y'):
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    else:
        return word + 's'

Classes and Modules

Over time you will find that you create a variety of useful little text-processing functions, and you end up copying them from old programs to new ones. Which file contains the latest version of the function you want to use? It makes life a lot easier if you can collect your work into a single place, and access previously defined functions without making copies.

To do this, save your function(s) in a file called (say) textproc.py. Now, you can access your work simply by importing it from the file:

from textproc import plural
plural('wish')

Packages and Libraries

A collection of variable and function definitions in a file is called a Python module. A collection of related modules is called a package. NLTK’s code for processing the Brown Corpus is an example of a module, and its collection of code for processing all the different corpora is an example of a package. NLTK itself is a set of packages, sometimes called a library.

If you are creating a file to contain some of your Python code, do not name your file nltk.py: it may get imported in place of the “real” NLTK package. When it imports modules, Python first looks in the current directory (folder).

Lexical Resources

A lexicon or a lexical resource is a collection of words or phrases along with associated information, such as part of speech and sense definitions. These are usually secondary to the corpus.

While a corpus may be defined as:

my_text = '...'

The lexical resource will be defined as:

vocab = sorted(set(my_text))

Whereas the frequency distribution is defined as:

word_freq = FreqDist(my_text)

Both of vocab and freq_dist are simple lexical resources. Furthermore, concordance can also be another lexical resource.

Lexical Entry

A lexical entry consists of headword (or lemma), along with additional information:

part of speech
sense definition or gloss

saw₁ ~ [verb], past tense of see

saw₂~ [noun], cutting instrument

These 2 words are homonyms. Which are basically 2 distinct words.

Homonym

homonym ~ [noun], lexical entries for 2 headwords, having the same spelling, and different combinations of part of speech and gloss information

One way to remember this is to think of it as kind of a hash collision.

Types of Lexical Resources:

Wordlists
Stopwords
Pronunciation Dictionary — for each word, it returns a list of phonetic codes, used by speech synthesizers
Comparative wordlists — list of 200 most common words, in several languages
Toolboxes
Relational dictionaries
Dictionaries

Word Lists

NLTK has word lists, which are just list of words, except they can be used to spell check or used to filter unusual words.

def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab.difference(english_vocab)
    return sorted(unusual)

unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
unusual_words(nltk.corpus.nps_chat.words())

Stop Words

Are a special type of word list, that contain filler words and words that don’t contain a whole lot of lexical information.

import nltknltk.download('stopwords')from nltk.corpus import stopwordsstopwords.words('english')

The output:

['i',  'me',  'my',  'myself',  'we',  'our',  'ours',  'ourselves',  'you',  "you're",  "you've",  "you'll",  "you'd",  'your',  'yours',  'yourself',  'yourselves',  'he',  'him',  'his',  'himself',  'she',  "she's",  'her',  'hers',  'herself',  'it',  "it's",  'its',  'itself',  'they',  'them',  'their',  'theirs',  'themselves',  'what',  'which',  'who',  'whom',  'this',  'that',  "that'll",  'these',  'those',  'am',  'is',  'are',  'was',  'were',  'be',  'been',  'being',  'have',  'has',  'had',  'having',  'do',  'does',  'did',  'doing',  'a',  'an',  'the',  'and',  'but',  'if',  'or',  'because',  'as',  'until',  'while',  'of',  'at',  'by',  'for',  'with',  'about',  'against',  'between',  'into',  'through',  'during',  'before',  'after',  'above',  'below',  'to',  'from',  'up',  'down',  'in',  'out',  'on',  'off',  'over',  'under',  'again',  'further',  'then',  'once',  'here',  'there',  'when',  'where',  'why',  'how',  'all',  'any',  'both',  'each',  'few',  'more',  'most',  'other',  'some',  'such',  'no',  'nor',  'not',  'only',  'own',  'same',  'so',  'than',  'too',  'very',  's',  't',  'can',  'will',  'just',  'don',  "don't",  'should',  "should've",  'now',  'd',  'll',  'm',  'o',  're',  've',  'y',  'ain',  'aren',  "aren't",  'couldn',  "couldn't",  'didn',  "didn't",  'doesn',  "doesn't",  'hadn',  "hadn't",  'hasn',  "hasn't",  'haven',  "haven't",  'isn',  "isn't",  'ma',  'mightn',  "mightn't",  'mustn',  "mustn't",  'needn',  "needn't",  'shan',  "shan't",  'shouldn',  "shouldn't",  'wasn',  "wasn't",  'weren',  "weren't",  'won',  "won't",  'wouldn',  "wouldn't"]

Complement Sets

The complement of Stop Words, would be words that do have meaning. To find the percentage of those:

def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)content_fraction(nltk.corpus.brown.words())

Which gives 0.5909126139346464.

Intersect Sets

names = nltk.corpus.names
names.fileids()male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]

Lexical Relations

My area of interest in within linguistics is the subfield of semantics, we can study semantics at the level of words, phrases, sentences or even texts.

Lexical relations then studies meaning at the word level.

Types of Lexical Relations

Homonyms — word spelling collisions, words that have the same spelling but mean completely different things
Synonyms — different words that mean the same thing, word meaning collisions
Antonyms — different words that mean the opposite, ie. hot & cold, if we were to vectorize words according to their meaning, would this be orthogonal vectors or diagonal?
Hypernyms — parent class of a word, ie. color is a hypernym of red
Hyponyms — child class of a word, red is a hyponym of color
Meronyms — very similar to hypernyms and hyponyms, meronyms are a “part of” relationship between two items, ie. finger “is a part of” a hand, finger is a meronym of hand
Holonyms — parent class of a meronym, ie. hand is a holonym of fingers
Entailments — these are appropriate for describing verb relationships between words, ie. walking entails stepping

Semantic Similarity

Semantic similarity is the idea behind how sperm whale and orca are semantically similar, whereas sperm whale and monkey are less similar, and sperm whale and black whole are completely dissimilar.

We need a way to quantify this.

WordNet

WordNet is a dictionary. But it defines words using other words. But it’s not just a cloud of words, it places the lexicon in relation to other words in specific ways. So that the structure of its placement in relation to other words have meanings.

For example, it puts the word “hand” with its synonyms, antonyms, meronyms, holonyms, homonyms, hyponyms, hypernyms etc.

You can think of WordNet as a graph, where the nodes are words or sets of words, and edges are relations. There are 155k words and 117k synonym sets in WordNet.

Synonym Sets

from nltk.corpus import wordnet as wn wn.synsets('motorcar')  
//[Synset('car.n.01')]print(wn.synset('car.n.01').lemma_names())
//['car', 'auto', 'automobile', 'machine', 'motorcar']

Definition

wn.synset('car.n.01').definition()//a motor vehicle with four wheels; usually propelled by an internal combustion engine

To find all the definitions of an overloaded word:

for synset in wordnet.synsets('mint', wordnet.NOUN):     print(synset.name() + ':', synset.definition())//batch.n.02: (often followed by `of') a large number or amount or extent //mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers //mint.n.03: any member of the mint family of plants //mint.n.04: the leaves of a mint plant used fresh or candied //mint.n.05: a candy that is flavored with a mint oil //mint.n.06: a plant where money is coined by authority of the government

Example Use

wn.synset('car.n.01').examples()//['he needs a car to get to work']

Hyponyms

motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()sorted([lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas()])['Model_T',
 'S.U.V.',
 'SUV',
 'Stanley_Steamer',
 'ambulance',
 'beach_waggon',
 'beach_wagon',
 'bus',
 'cab',
 'compact',
 'compact_car',
 'convertible',
 'coupe',
 'cruiser',
 'electric',
 'electric_automobile',
 'electric_car',
 'estate_car',
 'gas_guzzler',
 'hack',
 'hardtop',
 'hatchback',
 'heap',
 'horseless_carriage',
 'hot-rod',
 'hot_rod',
 'jalopy',
 'jeep',
 'landrover',
 'limo',
 'limousine',
 'loaner',
 'minicar',
 'minivan',
 'pace_car',
 'patrol_car',
 'phaeton',
 'police_car',
 'police_cruiser',
 'prowl_car',
 'race_car',
 'racer',
 'racing_car',
 'roadster',
 'runabout',
 'saloon',
 'secondhand_car',
 'sedan',
 'sport_car',
 'sport_utility',
 'sport_utility_vehicle',
 'sports_car',
 'squad_car',
 'station_waggon',
 'station_wagon',
 'stock_car',
 'subcompact',
 'subcompact_car',
 'taxi',
 'taxicab',
 'tourer',
 'touring_car',
 'two-seater',
 'used-car',
 'waggon',
 'wagon']

Hypernyms

motorcar.hypernyms()paths = motorcar.hypernym_paths()
len(paths)[synset.name for synset in paths[0]]
[synset.name for synset in paths[1]]motorcar.root_hypernyms()

Meronyms

from nltk.corpus import wordnetwordnet.synset('tree.n.01').part_meronyms()wordnet.synset('tree.n.01').substance_meronyms()

Holonyms

wordnet.synset('tree.n.01').member_holonyms()wordnet.synset('mint.n.04').part_holonyms()
[Synset('mint.n.02')]wordnet.synset('mint.n.04').substance_holonyms()
[Synset('mint.n.05')]

Entailments

wordnet.synset('walk.v.01').entailments()
wordnet.synset('eat.v.01').entailments()wordnet.synset('tease.v.03').entailments()
[Synset('arouse.v.07'), Synset('disappoint.v.01')]

Antonyms

wordnet.lemma('supply.n.02.supply').antonyms()wordnet.lemma('rush.v.01.rush').antonyms()wordnet.lemma('horizontal.a.01.horizontal').antonyms()wordnet.lemma('staccato.r.01.staccato').antonyms()

Semantic Similarity Finally

Lowest Common Hypernyms

Or lowest common ancestor, linguistic group wise

right = wordnet.synset('right_whale.n.01')  orca = wordnet.synset('orca.n.01') 
minke = wordnet.synset('minke_whale.n.01') 
tortoise = wordnet.synset('tortoise.n.01') novel = wordnet.synset('novel.n.01')  right.lowest_common_hypernyms(minke) 
//[Synset('baleen_whale.n.01')]right.lowest_common_hypernyms(orca) 
//[Synset('whale.n.02')]  right.lowest_common_hypernyms(tortoise) //[Synset('vertebrate.n.01')]  right.lowest_common_hypernyms(novel) 
//[Synset('entity.n.01')]

Quantifying Similarity

wordnet.synset('baleen_whale.n.01').min_depth()
14
wordnet.synset('whale.n.02').min_depth()
13
wordnet.synset('vertebrate.n.01').min_depth()
8
wordnet.synset('entity.n.01').min_depth()
0

Of course 8 doesn’t mean all that much, until you get other numbers to compare it to. The min_depth() gives us an absolute depth from “entity”.

Hypernym Hierarchy Similarity Quantified

right.path_similarity(minke)
//0.25right.path_similarity(orca)
//0.16666666666666666right.path_similarity(tortoise)
//0.076923076923076927right.path_similarity(novel)
//0.043478260869565216

This one is much more intuitive.

Help for Specific Lexical Object

dir(wordnet.synset('harmony.n.02'))

General Help

help(wn)

VerbNet

NLTK also includes VerbNet, a hierarchical verb lexicon linked to WordNet. It can be accessed with nltk.corpus.verbnet

References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.