Inputting & PreProcessing Text

Input Methods, String & Unicode, Regular Expression Use Cases

Jake Batsuuri
Computronium Blog

--

NLTK has preprocessed texts. But we can also import and process our own texts.

Importing

from __future__ import division 
import nltk, re, pprint

To Import a Book as a Txt

Install urlopen:

!pip install urlopen

And:

import urllib.requesturl = "https://www.gutenberg.org/files/11/11-0.txt"
raw = urllib.request.urlopen(url).read()
type(raw)
# <type 'str'>
len(raw)
// 1176831
raw[:75]
// 'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

Tokenization:

tokens = nltk.word_tokenize(raw)type(tokens)
# <type 'list'>
len(tokens)
# 255809
tokens[:10]
# ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

Textization, or just turning it into NLTK’s Text Object so we run things like collocations:

text = nltk.Text(tokens)
type(text)
# <type 'nltk.text.Text'>
text[1020:1060]
# ['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', # 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', # 'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly', # ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']
text.collocations()
# Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna; Pyotr # Petrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch; # Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; Andrey # Semyonovitch; old woman; Literary Archive; Dmitri Prokofitch; great # deal; United States; Praskovya Pavlovna; Porfiry Petrovitch; ear rings

Getting Just the Good Stuff

text.find("CHAPTER I")
# 1007
text.find("THE END")
# 148848
text = text[1007:148848]

Lot of books will have header and footers, here we just find the header index and the footer index and simply remove ‘em.

If there are more than one “THE END”s, you can use:

text.rfind("THE END")

Which will find indexes from the bottom of the text.

Handling HTML

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.request.urlopen(url).read()
print(html)

The printout:

b'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\r\n<html>\r\n<head>\r\n<title>BBC NEWS | Health | Blondes \'to die out in 200 years\'</title>\r\n<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">\r\n<meta (none)~RS~a~RS~International~RS~q~RS~~RS~z~RS~25~RS~">\r\n</noscript>\r\n\r\n\r\n\r\n<br>\r\n<link type="text/css" rel="stylesheet" href="/nol/shared/stylesheets/uki_globalstylesheet.css">\r\n\r\n</body>\r\n</html>\r\n'

Not that great, so let’s clean it up:

!pip install beautifulsoup4

Beautiful soup has tons of easy methods to get us the text:

from bs4 import BeautifulSoupsoup = BeautifulSoup(html)
print(soup.prettify())

Stringify:

text = soup.get_text()
type(text)

Tokenify:

tokens = nltk.word_tokenize(text)
tokens

Textify:

text = nltk.Text(tokens)
text.concordance('gene')

Reading Local Files

f = open('document.txt')
raw = f.read()

If it's a different directory:

import os
os.listdir('.')

Print line by line:

f = open('document.txt', 'r')
for line in f:
print(line.strip())

Binary Files

Text sometimes comes in PDF and WORD, there are libraries for processing these, such as pypdf and pywin32.

User Input

s = input("Enter some text: ")
print("You typed", len(nltk.word_tokenize(s)), "words.")

Strings

Single Quote

monty = 'Monty Python'
monty
'Monty Python'

However if you wanna escape the single quotation itself:

circus = 'Monty Python\'s Flying Circus'
circus
"Monty Python's Flying Circus"

Or you can use the…

Double Quotation Mark

circus = "Monty Python's Flying Circus"
circus
"Monty Python's Flying Circus"

Triple Quotation Mark

couplet = "Shall I compare thee to a Summer's day?"\
"Thou are more lovely and more temperate:"
print(couplet)
# Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:

The problem is, the above doesn’t print newlines, but this does:

couplet = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:"""
print(couplet)
# Shall I compare thee to a Summer's day?
# Thou are more lovely and more temperate:

Works also for single quotation marks thrice:

couplet = '''Rough winds do shake the darling buds of May,
And Summer's lease hath all too short a date:'''
print(couplet)
# Rough winds do shake the darling buds of May,
# And Summer's lease hath all too short a date:

Concatenation

'very' + 'very' + 'very'
# 'veryveryvery'
'very' * 3
# 'veryveryvery'

Printing

grail = 'Holy Grail'
print(monty + grail)
# Monty PythonHoly Grail
print(monty, grail)
# Monty Python Holy Grail
print(monty, "and the", grail)
# Monty Python and the Holy Grail

Individual Chars

monty[0]
# 'M'
monty[3]
# 't'
monty[5]
# ' '

Negative Indexing Chars

monty[-1]
# 'n'
monty[5]
# ' '
monty[-7]
# ' '

Where the last character is also -1 and it counts up as you reverse through each character.

Print Chars

sent = 'colorless green ideas sleep furiously'
for char in sent:
print char
# c o l o r l e s s g r e e n i d e a s s l e e p f u r i o u s l y

Count Chars

from nltk.corpus import gutenbergraw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.keys()
# ['e', 't', 'a', 'o', 'n', 'i', 's', 'h', 'r', 'l', 'd', 'u', 'm',
# 'c', 'w', 'f', 'g', 'p', 'b', 'y', 'v', 'k', 'q', 'j', 'x', 'z']

You can also visualize this frequency distribution:

fdist.plot()

Each language has a typical frequency distribution, and is a good way to distinguish between them.

Substrings

monty[6:10]
# 'Pyth'

The slice (m,n) contains the substring from index m through n-1.

Substring Membership

phrase = 'And now for something completely different'
if 'thing' in phrase:
print('found "thing"')
# found "thing"

More Operations

  • s.find(t) Index of first instance of string t inside s (-1 if not found)
  • s.rfind(t) Index of last instance of string t inside s (-1 if not found)
  • s.index(t) Like s.find(t), except it raises ValueError if not found
  • s.rindex(t) Like s.rfind(t), except it raises ValueError if not found
  • s.join(text) Combine the words of the text into a string using s as the glue
  • s.split(t) Split s into a list wherever a t is found (whitespace by default)
  • s.splitlines() Split s into a list of strings, one per line
  • s.lower() A lowercased version of the string s
  • s.upper() An uppercased version of the string s
  • s.titlecase() A titlecased version of the string s
  • s.strip() A copy of s without leading or trailing whitespace
  • s.replace(t, u) Replace instances of t with u inside s

Lists

beatles = ['John', 'Paul', 'George', 'Ringo']beatles[2]
# 'George'
beatles[:2]
# ['John', 'Paul']
beatles + ['Brian']
# ['John', 'Paul', 'George', 'Ringo', 'Brian']

Unicode

ASCII can hold 128 or 256 characters, because it uses only 1 byte or 8 bits. Whereas UTF-8 can encode millions of characters because it uses 1 to 4 bytes.

We can manipulate unicode strings exactly as normal strings, however when we store unicode, we store em as a stream of bytes. Encodings such as ASCII are often enough to support a single language.

Unicode can support many if not all languages and other special characters like emojis.

Since unicode is the universal encoding between languages, we say translating into unicode is decoding. Translating out of unicode into usable encoding is called encoding.

Code Point

Unicode supports millions of characters, each character is assigned a number in the space, which we call a code point.

Glyphs

Fonts are a mapping from characters to glyphs. Glyphs are what appear on print outs and on screen. Characters are just 4 digit hexadecimal numbers.

Codecs

Codecs are a device or program that helps compress data so that it can be transmitted faster or more efficiently.

path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
import codecs
f = codecs.open(path, encoding='latin2')
f = codecs.open(path, 'w', encoding='utf-8').

Ordinal

ord('a') # 97hex(97) # 0x61char = u'\u0061'
print(char) # a

Regular Expression Applications to Tokenizing

Lots of linguistic tasks require pattern matching. For example to find words that end with ‘ed’, use endswith('ed’)

Regular expressions help us do that very efficiently.

import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
re.search('^pre', w)

Basic Metacharacters

The metacharacters define additional things like mark the start, end, wildcards etc.

Start: Caret

^ matches the start of the string, can think of it as space preceding the word.

import nltk
nltk.download('words')
re.search('^pre', w)
[w for w in wordlist if re.search('^pre', w)]
# 'predestinately', 'predestination', 'predestinational', 'predestinationism', 'predestinationist', 'predestinative', 'predestinator', 'predestine', 'predestiny', 'predestitute', 'predestitution', 'predestroy', 'predestruction', 'predetach', 'predetachment', 'predetail', 'predetain', 'predetainer', 'predetect', 'predetention', 'predeterminability', 'predeterminable', 'predeterminant', 'predeterminate', 'predeterminately', 'predetermination', 'predeterminative', 'predetermine', 'predeterminer',

End: Dollar Sign

[w for w in wordlist if re.search('ed$', w)] # ['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...]

$ matches the end of the string.

Single Character Wildcard: Dot

[w for w in wordlist if re.search('^..j..t..$', w)]# ['abjectly',  'adjuster',  'dejected',  'dejectly',  'injector',  'majestic',  'objectee',  'objector',  'rejecter',  'rejector',  'unjilted',  'unjolted',  'unjustly']

Optional Characters: Question Mark

‹‹^e-?mail$››

This makes it so that the regular expression inside ‹‹›› says that any character before ? is optional. The end result is that both email and e-mail are both matched.

[w for w in wordlist if re.search('^e-?mail$', w)]

Ranges

Words “golf” and “hold” are both textonyms, which are words that are entered with the same keystrokes.

[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)] # ['gold', 'golf', 'hold', 'hole']
  • Set = [ghi]
  • Range = [g-i]

Closures

import nltk
nltk.download('nps_chat')
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))[w for w in chat_words if re.search('^m+i+n+e+$', w)]# ['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
# 'miiiiiinnnnnnnnnneeeeeeee', 'mine',
# 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']
  • The next metacharacter is the + in ‹‹^m+i+n+e+$›› which means 1 or more instances of the preceding character.
  • The next metacharacter is the * in ‹‹^m*i*n*e*$›› which means 0 or more instances of the preceding character.
['',  'e',  'i',  'in',  'm',  'me',  'meeeeeeeeeeeee',  'mi',  'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',  'miiiiiinnnnnnnnnneeeeeeee',  'min',  'mine',  'mm',  'mmm',  'mmmm',  'mmmmm',  'mmmmmm',  'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee',  'mmmmmmmmmm',  'mmmmmmmmmmmmm',  'mmmmmmmmmmmmmm',  'n',  'ne']

These are Kleene Closures, which are all the strings possible under the regular expression.

You can also do it with ranges:

[w for w in chat_words if re.search('^[ha]+$', w)]

The result:

['a',  'aaaaaaaaaaaaaaaaa',  'aaahhhh',  'ah',  'ahah',  'ahahah',  'ahh',  'ahhahahaha',  'ahhh',  'ahhhh',  'ahhhhhh',  'ahhhhhhhhhhhhhh',  'h',  'ha',  'haaa',  'hah',  'haha',  'hahaaa',  'hahah',  'hahaha',  'hahahaa',  'hahahah',  'hahahaha',  'hahahahaaa',  'hahahahahaha',  'hahahahahahaha',  'hahahahahahahahahahahahahahahaha',  'hahahhahah',  'hahhahahaha']

Logical Not Operator: Caret Inside a Bracket

«[^aeiouAEIOU]» matches anything but a vowel, so this would give us tokens like:

  • :):):),
  • grrr,
  • cyb3r, and
  • zzzzzzzz
  • or just !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Matching Patterns with Separators: Escape with Forward Slash

import nltk
nltk.download('treebank')
wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

This gets us all decimal numbers:

# ['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', # '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99', # '1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', ...]

To get currencies:

[w for w in wsj if re.search('^[A-Z]+\$$', w)]

The result:

['C$', 'US$']

Limit Characters: Curly Brackets

[w for w in wsj if re.search('^[0-9]{4}$', w)]

The result:

['1614',  '1637',  '1787',  '1901',  '1903',  '1917',  '1925',  '1929',  '1933',  '1934',  '1948',  '1953',  '1955',  '1956',  '1961',  '1965',  '1966',  '1967',  '1968',  '1969',  '1970',  '1971',  '1972',  '1973',  '1975',  '1976',  '1977',  '1979',  '1980',  '1981',  '1982',  '1983',  '1984',  '1985',  '1986',  '1987',  '1988',  '1989',  '1990',  '1991',  '1992',  '1993',  '1994',  '1995',  '1996',  '1997',  '1998',  '1999',  '2000',  '2005',  '2009',  '2017',  '2019',  '2029',  '3057',  '8300']

To apply it to several ranges:

[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]

And the result:

['10-day',  '10-lap',  '10-year',  '100-share',  '12-point',  '12-year',  '14-hour',  '15-day',  '150-point',  '190-point',  '20-point',  '20-stock',  '21-month',  '237-seat',  '240-page',  '27-year',  '30-day',  '30-point',  '30-share',  '30-year',  '300-day',  '36-day',  '36-store',  '42-year',  '50-state',  '500-stock',  '52-week',  '69-point',  '84-month',  '87-store',  '90-day']

Order of Operations: Brackets

What does «w(i|e|ai|oo)t» match?

[w for w in wsj if re.search('^w(i|e|ai|oo)t', w)]

Gives results like:

['wait',  'waited',  'waiting',  'witches',  'with',  'withdraw',  'withdrawal',  'withdrawn',  'withdrew',  'withhold',  'within',  'without',  'withstand',  'witness',  'witnesses']

In regular expression backslashes always escape the following character, for example \b would be backspace character.

To send it to the re library to be processed, we prefix the string with like so r’\band\b’.

Extracting Word Pieces

word = 'supercalifragilisticexpialidocious'
list_vowels = re.findall(r'[aeiou]', word)
len(list_vowels)

The previous examples were all re.search(regex, word), here we start with finding all instance of with re.findall(regex, word).

The below example finds all the 2 or more sequences from the set [aeiou] .

import nltk
nltk.download('treebank')
wsj = sorted(set(nltk.corpus.treebank.words()))fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))fd.items()

The output is:

dict_items([('ea', 476), ('oi', 65), ('ou', 329), ('io', 549), ('ee', 217), ('ie', 331), ('ui', 95), ('ua', 109), ('ai', 261), ('ue', 105), ('ia', 253), ('ei', 86), ('iai', 1), ('oo', 174), ('au', 106), ('eau', 10), ('oa', 59), ('oei', 1), ('oe', 15), ('eo', 39), ('uu', 1), ('eu', 18), ('iu', 14), ('aii', 1), ('aiia', 1), ('ae', 11), ('aa', 3), ('oui', 6), ('ieu', 3), ('ao', 6), ('iou', 27), ('uee', 4), ('eou', 5), ('aia', 1), ('uie', 3), ('iao', 1), ('eei', 2), ('uo', 8), ('uou', 5), ('eea', 1), ('ueui', 1), ('ioa', 1), ('ooi', 1)])

Reconstructing Words from Word Pieces

regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'def compress(word):
pieces = re.findall(regexp, word)
return ''.join(pieces)
print(nltk.tokenwrap(compress(w) for w in wsj[0:-75]))

This function removes vowels from words, except in the first and last character, the result:

wgh wghd wghng wght wrd wlcme wlcmd wlfre wll wll-cnnctd wll-knwn wnt wre wht wht whl-ldr whls whn whn-ssd whnvr whre whrby whrwthl whthr whch whchvr whle whmscl whppng whpsw whrlng whstle whte wht-cllr who whle whlsle whlslr whm whse why wde wdly wdsprd wdgt wdgts wdw wld wfe wld wldly wll wllng wllngnss wn wndfll wndng wndw wne wn-byng wn-mkng wns wngs wnnr wnnrs wnnng wns wntr wrs wsdm wsh wtchs wth wthdrw wthdrwl wthdrwn wthdrw wthhld wthn wtht wthstnd wtnss wtnsss wvs wzrds wo wmn wmn wn wndr wng wrd wrd- prcssng wrds wrk wrkble wrkbks wrkd wrkr

Conditional Frequency Distributions

words = sorted(set(nltk.corpus.treebank.words()))cvs = [cv for w in words for cv in re.findall(r'[bcdfghjklmnpqrstvxyz][aeiou]', w)]cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

The output is a conditional frequency distribution of consonant — vowel sequences from treebank:

    a    e    i    o    u  
b 166 202 113 133 89
c 345 427 240 554 149
d 127 632 433 121 111
f 115 150 206 128 80
g 126 329 124 60 62
h 270 336 276 224 45
j 7 22 4 26 37
k 19 216 111 8 8
l 458 675 573 288 107
m 321 453 290 193 62
n 289 545 286 153 73
p 233 359 133 256 98
q 0 0 0 0 109
r 679 1229 665 486 127
s 130 577 391 175 229
t 429 1053 1130 352 182
v 100 516 214 60 0
x 17 28 21 7 3
y 13 104 44 22 1
z 21 76 21 10 2

treebank.wods() is a tokenized Wall Street Journal sample.

Finding All Instances Of

The consonant — vowel sequences in reverse:

cv_word_pairs = [(cv, w) for w in words for cv in re.findall(r'[bcdfghjklmnpqrstvxyz][aeiou]', w)]cv_index = nltk.Index(cv_word_pairs)cv_index['ba']

The outputs are:

['Albany',  'Atlanta-based',  'Barbados',  'Barbara',  'Barbaresco',  'Bermuda-based',  'Cabbage',  'Calif.-based',  'Carballo',  'Centerbank',  'Citibank',  'Conn.based',  'Embassy',  'Erbamont',  'Francisco-based',  'Freshbake',  'Garbage',  'Germany-based' ...

Finding Word Stems

Word stems are the core of the word, the root, and in search engines we want to query for not just a literal string match but for all related words using the stem.

regex = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$'[w for w in wsj if re.findall(regex, w)]

Which finds all words with those suffixes.

["'30s",  "'40s",  "'50s",  "'80s",  "'s",  '1920s',  '1940s',  '1950s',  '1960s',  '1970s',  '1980s',  '1990s',  '20s',  '30s',  '62%-owned',  '8300s',  'ADRs',  'Absorbed',  'Academically',  'According',  'Achievement'...

If you use it independently of the listing:

re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

we get:

[('process', 'es')]

Finding Stems In a Better Way

def stem(word):
regex = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
stem, suffix = re.findall(regex, word)[0]
return stem
raw = """DENNIS: Listen, strange women lying in ponds
distributing swords is no basis for a system of government.
Supreme executive power derives from a mandate from the masses,
not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)
[stem(t) for t in tokens]

Which outputs:

['DENNIS',  ':',  'Listen',  ',',  'strange',  'women',  'ly',  'in',  'pond',  'distribut',  'sword',  'i',  'no',  'basi',  'for',  'a',  'system',  'of',  'govern',  '.',  'Supreme',  'execut',  'power',  'deriv',  'from',  'a',  'mandate',  'from',  'the',  'mass',  ',',  'not',  'from',  'some',  'farcical',  'aquatic',  'ceremony',  '.']

So even this better method makes errors such as the bolded above.

Searching Tokenized Text

What if you wanted to search multiple words? We can use regular expressions for that too.

from nltk.corpus import gutenberg
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")
# monied; nervous; dangerous; white; white; white; pious; queer;
# good; mature; white; Cape; great; wise; wise; butterless; white;
# fiendish; pale; furious; better; certain; complete; dismasted;
# younger; brave; brave; brave; brave

The above regular expression will match “a (anything) man”. <.*> will match any single token.

  • If we want the matched phrase we don’t use the parentheses
  • If we use the parentheses, then it only matches the word
from nltk.corpus import gutenberg
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> <.*> <man>")
# a monied man; a nervous man; a dangerous man; a white man; a white # man; a white man; a pious man; a queer man; a good man; a mature
# man; a white man; a Cape man; a great man; a wise man; a wise man; # a butterless man; a white man; a fiendish man; a pale man; a
# furious man; a better man; a certain man; a complete man; a
# dismasted man; a younger man; a brave man; a brave man; a brave
# man; a brave man

To be able to match 3 word phrases:

chat = nltk.Text(gutenberg.words())
chat.findall(r"<.*> <.*> <whale>")
# or a whale; as a whale; in the whale; that the whale; of a whale; # name a whale; - piggledy whale; s ( whale; " This whale; of the
# whale; of one whale; like a whale; the wounded whale; of a whale; # While the whale; say the whale; see a whale; of a whale; once a
# whale; of the ...

To be able to match sequences of 3 or more words that start with “l”:

moby.findall(r"<l.*>{3,}")# little lower layer; little lower layer; lances lie levelled; long # lance lightly; like live legs

Exploring Hypernyms

Some linguistics phenomena such as superordinate words can sometimes appear a certain way in a text.

import nltk
nltk.download('brown')
from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

To understand how this works:

speed and other activities; water and other liquids; tomb and other landmarks; Statues and other monuments; pearls and other jewels; charts and other items; roads and other features; figures and other objects; military and other areas; demands and other factors; abstracts and other compilations; iron and other metals

Notice that one result prints “water and other liquids”, which tells us that water is a type of liquid and the liquid is the hypernym and the water is a hyponym.

Of course this method isn’t perfect, there can be false positives.

Other Articles

This post is part of a series of stories that explores the fundamentals of natural language processing:1. Context of Natural Language Processing
Motivations, Disciplines, Approaches, Outlook
2. Notes on Formal Language Theory
Objects, Operations, Regular Expressions and Finite State Automata
3. Natural Language Tool Kit 3.5
Search Functions, Statistics, Pronoun Resolution
4. What Are Regular Languages?
Minimization, Finite State Transducers, Regular Relations
5. What Are Context Free Languages?
Grammars, Derivations, Expressiveness, Hierarchies
6. Inputting & PreProcessing Text
Input Methods, String & Unicode, Regular Expression Use Cases

Up Next…

In the next article, we will explore Normalizing, Tokenizing and Sentence Segmentation.

For the table of contents and more content click here.

References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.

--

--