Inputting & PreProcessing Text

Input Methods, String & Unicode, Regular Expression Use Cases

Jake Batsuuri

Published in

Computronium Blog

13 min readAug 25, 2021

NLTK has preprocessed texts. But we can also import and process our own texts.

Importing

from __future__ import division 
import nltk, re, pprint

To Import a Book as a Txt

Install urlopen:

!pip install urlopen

And:

import urllib.requesturl = "https://www.gutenberg.org/files/11/11-0.txt"
raw = urllib.request.urlopen(url).read()type(raw) 
# <type 'str'>len(raw)
// 1176831raw[:75]
// 'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

Tokenization:

tokens = nltk.word_tokenize(raw)type(tokens)
# <type 'list'>len(tokens)
# 255809tokens[:10]
# ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

Textization, or just turning it into NLTK’s Text Object so we run things like collocations:

text = nltk.Text(tokens)
type(text)
# <type 'nltk.text.Text'>text[1020:1060] 
# ['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', # 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', # 'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly', # ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']text.collocations() 
# Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna; Pyotr # Petrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch; # Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; Andrey # Semyonovitch; old woman; Literary Archive; Dmitri Prokofitch; great # deal; United States; Praskovya Pavlovna; Porfiry Petrovitch; ear rings

Getting Just the Good Stuff

text.find("CHAPTER I")
# 1007text.find("THE END")
# 148848text = text[1007:148848]

Lot of books will have header and footers, here we just find the header index and the footer index and simply remove ‘em.

If there are more than one “THE END”s, you can use:

text.rfind("THE END")

Which will find indexes from the bottom of the text.

Handling HTML

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.request.urlopen(url).read()
print(html)

The printout:

b'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\r\n<html>\r\n<head>\r\n<title>BBC NEWS | Health | Blondes \'to die out in 200 years\'</title>\r\n<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">\r\n<meta (none)~RS~a~RS~International~RS~q~RS~~RS~z~RS~25~RS~">\r\n</noscript>\r\n\r\n\r\n\r\n<br>\r\n<link type="text/css" rel="stylesheet" href="/nol/shared/stylesheets/uki_globalstylesheet.css">\r\n\r\n</body>\r\n</html>\r\n'

Not that great, so let’s clean it up:

!pip install beautifulsoup4

Beautiful soup has tons of easy methods to get us the text:

from bs4 import BeautifulSoupsoup = BeautifulSoup(html)
print(soup.prettify())

Stringify:

text = soup.get_text()
type(text)

Tokenify:

tokens = nltk.word_tokenize(text)
tokens

Textify:

text = nltk.Text(tokens)
text.concordance('gene')

Reading Local Files

f = open('document.txt')
raw = f.read()

If it's a different directory:

import os
os.listdir('.')

Print line by line:

f = open('document.txt', 'r')
for line in f:
   print(line.strip())

Binary Files

Text sometimes comes in PDF and WORD, there are libraries for processing these, such as pypdf and pywin32.

pywin32

pywin32 [![CI](…

pypi.org

pyPdf

A Pure-Python library built as a PDF toolkit. It is capable of: extracting document information (title, author, ...)…

pypi.org

User Input

s = input("Enter some text: ")
print("You typed", len(nltk.word_tokenize(s)), "words.")

Strings

Single Quote

monty = 'Monty Python'
monty
'Monty Python'

However if you wanna escape the single quotation itself:

circus = 'Monty Python\'s Flying Circus'
circus
"Monty Python's Flying Circus"

Or you can use the…

Double Quotation Mark

circus = "Monty Python's Flying Circus"
circus
"Monty Python's Flying Circus"

Triple Quotation Mark

couplet = "Shall I compare thee to a Summer's day?"\
"Thou are more lovely and more temperate:"print(couplet)
# Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:

The problem is, the above doesn’t print newlines, but this does:

couplet = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:"""print(couplet) 
# Shall I compare thee to a Summer's day? 
# Thou are more lovely and more temperate:

Works also for single quotation marks thrice:

couplet = '''Rough winds do shake the darling buds of May,
And Summer's lease hath all too short a date:'''print(couplet)
# Rough winds do shake the darling buds of May, 
# And Summer's lease hath all too short a date:

Concatenation

'very' + 'very' + 'very'
# 'veryveryvery''very' * 3
# 'veryveryvery'

Printing

grail = 'Holy Grail'
print(monty + grail)
# Monty PythonHoly Grailprint(monty, grail)
# Monty Python Holy Grailprint(monty, "and the", grail)
# Monty Python and the Holy Grail

Individual Chars

monty[0]
# 'M'monty[3]
# 't'monty[5]
# ' '

Negative Indexing Chars

monty[-1]
# 'n' monty[5]
# ' ' monty[-7] 
# ' '

Where the last character is also -1 and it counts up as you reverse through each character.

Print Chars

sent = 'colorless green ideas sleep furiously'
for char in sent:
    print char# c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y

Count Chars

from nltk.corpus import gutenbergraw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.keys()# ['e', 't', 'a', 'o', 'n', 'i', 's', 'h', 'r', 'l', 'd', 'u', 'm', 
# 'c', 'w', 'f', 'g', 'p', 'b', 'y', 'v', 'k', 'q', 'j', 'x', 'z']

You can also visualize this frequency distribution:

fdist.plot()

Each language has a typical frequency distribution, and is a good way to distinguish between them.

Substrings

monty[6:10]
# 'Pyth'

The slice (m,n) contains the substring from index m through n-1.

Substring Membership

phrase = 'And now for something completely different'
if 'thing' in phrase:
    print('found "thing"')# found "thing"

More Operations

s.find(t) Index of first instance of string t inside s (-1 if not found)
s.rfind(t) Index of last instance of string t inside s (-1 if not found)
s.index(t) Like s.find(t), except it raises ValueError if not found
s.rindex(t) Like s.rfind(t), except it raises ValueError if not found
s.join(text) Combine the words of the text into a string using s as the glue
s.split(t) Split s into a list wherever a t is found (whitespace by default)
s.splitlines() Split s into a list of strings, one per line
s.lower() A lowercased version of the string s
s.upper() An uppercased version of the string s
s.titlecase() A titlecased version of the string s
s.strip() A copy of s without leading or trailing whitespace
s.replace(t, u) Replace instances of t with u inside s

Lists

beatles = ['John', 'Paul', 'George', 'Ringo']beatles[2]
# 'George'beatles[:2]
# ['John', 'Paul']beatles + ['Brian']
# ['John', 'Paul', 'George', 'Ringo', 'Brian']

Unicode

ASCII can hold 128 or 256 characters, because it uses only 1 byte or 8 bits. Whereas UTF-8 can encode millions of characters because it uses 1 to 4 bytes.

We can manipulate unicode strings exactly as normal strings, however when we store unicode, we store em as a stream of bytes. Encodings such as ASCII are often enough to support a single language.

Unicode can support many if not all languages and other special characters like emojis.

Since unicode is the universal encoding between languages, we say translating into unicode is decoding. Translating out of unicode into usable encoding is called encoding.

Code Point

Unicode supports millions of characters, each character is assigned a number in the space, which we call a code point.

Glyphs

Fonts are a mapping from characters to glyphs. Glyphs are what appear on print outs and on screen. Characters are just 4 digit hexadecimal numbers.

Codecs

Codecs are a device or program that helps compress data so that it can be transmitted faster or more efficiently.

path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
import codecsf = codecs.open(path, encoding='latin2')
f = codecs.open(path, 'w', encoding='utf-8').

Ordinal

ord('a') # 97hex(97) # 0x61char = u'\u0061'
print(char) # a

Regular Expression Applications to Tokenizing

Lots of linguistic tasks require pattern matching. For example to find words that end with ‘ed’, use endswith('ed’)

Regular expressions help us do that very efficiently.

import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]re.search('^pre', w)

Basic Metacharacters

The metacharacters define additional things like mark the start, end, wildcards etc.

Start: Caret

^ matches the start of the string, can think of it as space preceding the word.

import nltk
nltk.download('words')re.search('^pre', w)
[w for w in wordlist if re.search('^pre', w)]# 'predestinately',  'predestination',  'predestinational',  'predestinationism',  'predestinationist',  'predestinative',  'predestinator',  'predestine',  'predestiny',  'predestitute',  'predestitution',  'predestroy',  'predestruction',  'predetach',  'predetachment',  'predetail',  'predetain',  'predetainer',  'predetect',  'predetention',  'predeterminability',  'predeterminable',  'predeterminant',  'predeterminate',  'predeterminately',  'predetermination',  'predeterminative',  'predetermine',  'predeterminer',

End: Dollar Sign

[w for w in wordlist if re.search('ed$', w)] # ['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...]

$ matches the end of the string.

Single Character Wildcard: Dot

[w for w in wordlist if re.search('^..j..t..$', w)]# ['abjectly',  'adjuster',  'dejected',  'dejectly',  'injector',  'majestic',  'objectee',  'objector',  'rejecter',  'rejector',  'unjilted',  'unjolted',  'unjustly']

Optional Characters: Question Mark

‹‹^e-?mail$››

This makes it so that the regular expression inside ‹‹›› says that any character before ? is optional. The end result is that both email and e-mail are both matched.

[w for w in wordlist if re.search('^e-?mail$', w)]

Ranges

Words “golf” and “hold” are both textonyms, which are words that are entered with the same keystrokes.

[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)] # ['gold', 'golf', 'hold', 'hole']

Set = [ghi]
Range = [g-i]

Closures

import nltk
nltk.download('nps_chat')chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))[w for w in chat_words if re.search('^m+i+n+e+$', w)]# ['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
# 'miiiiiinnnnnnnnnneeeeeeee',  'mine',
# 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

The next metacharacter is the + in ‹‹^m+i+n+e+$›› which means 1 or more instances of the preceding character.
The next metacharacter is the * in ‹‹^m*i*n*e*$›› which means 0 or more instances of the preceding character.

['',  'e',  'i',  'in',  'm',  'me',  'meeeeeeeeeeeee',  'mi',  'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',  'miiiiiinnnnnnnnnneeeeeeee',  'min',  'mine',  'mm',  'mmm',  'mmmm',  'mmmmm',  'mmmmmm',  'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee',  'mmmmmmmmmm',  'mmmmmmmmmmmmm',  'mmmmmmmmmmmmmm',  'n',  'ne']

These are Kleene Closures, which are all the strings possible under the regular expression.

You can also do it with ranges:

[w for w in chat_words if re.search('^[ha]+$', w)]

The result:

['a',  'aaaaaaaaaaaaaaaaa',  'aaahhhh',  'ah',  'ahah',  'ahahah',  'ahh',  'ahhahahaha',  'ahhh',  'ahhhh',  'ahhhhhh',  'ahhhhhhhhhhhhhh',  'h',  'ha',  'haaa',  'hah',  'haha',  'hahaaa',  'hahah',  'hahaha',  'hahahaa',  'hahahah',  'hahahaha',  'hahahahaaa',  'hahahahahaha',  'hahahahahahaha',  'hahahahahahahahahahahahahahahaha',  'hahahhahah',  'hahhahahaha']

Logical Not Operator: Caret Inside a Bracket

«[^aeiouAEIOU]» matches anything but a vowel, so this would give us tokens like:

:):):),
grrr,
cyb3r, and
zzzzzzzz
or just !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Matching Patterns with Separators: Escape with Forward Slash

import nltk
nltk.download('treebank')wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

This gets us all decimal numbers:

# ['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', # '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99', # '1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', ...]

To get currencies:

[w for w in wsj if re.search('^[A-Z]+\$$', w)]

The result:

['C$', 'US$']

Limit Characters: Curly Brackets

[w for w in wsj if re.search('^[0-9]{4}$', w)]

The result:

['1614',  '1637',  '1787',  '1901',  '1903',  '1917',  '1925',  '1929',  '1933',  '1934',  '1948',  '1953',  '1955',  '1956',  '1961',  '1965',  '1966',  '1967',  '1968',  '1969',  '1970',  '1971',  '1972',  '1973',  '1975',  '1976',  '1977',  '1979',  '1980',  '1981',  '1982',  '1983',  '1984',  '1985',  '1986',  '1987',  '1988',  '1989',  '1990',  '1991',  '1992',  '1993',  '1994',  '1995',  '1996',  '1997',  '1998',  '1999',  '2000',  '2005',  '2009',  '2017',  '2019',  '2029',  '3057',  '8300']

To apply it to several ranges:

[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]

And the result:

['10-day',  '10-lap',  '10-year',  '100-share',  '12-point',  '12-year',  '14-hour',  '15-day',  '150-point',  '190-point',  '20-point',  '20-stock',  '21-month',  '237-seat',  '240-page',  '27-year',  '30-day',  '30-point',  '30-share',  '30-year',  '300-day',  '36-day',  '36-store',  '42-year',  '50-state',  '500-stock',  '52-week',  '69-point',  '84-month',  '87-store',  '90-day']

Order of Operations: Brackets

What does «w(i|e|ai|oo)t» match?

[w for w in wsj if re.search('^w(i|e|ai|oo)t', w)]

Gives results like:

['wait',  'waited',  'waiting',  'witches',  'with',  'withdraw',  'withdrawal',  'withdrawn',  'withdrew',  'withhold',  'within',  'without',  'withstand',  'witness',  'witnesses']

In regular expression backslashes always escape the following character, for example \b would be backspace character.

To send it to the re library to be processed, we prefix the string with like so r’\band\b’.

Extracting Word Pieces

word = 'supercalifragilisticexpialidocious'
list_vowels = re.findall(r'[aeiou]', word)
len(list_vowels)

The previous examples were all re.search(regex, word), here we start with finding all instance of with re.findall(regex, word).

The below example finds all the 2 or more sequences from the set [aeiou] .

import nltk
nltk.download('treebank')wsj = sorted(set(nltk.corpus.treebank.words()))fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))fd.items()

The output is:

dict_items([('ea', 476), ('oi', 65), ('ou', 329), ('io', 549), ('ee', 217), ('ie', 331), ('ui', 95), ('ua', 109), ('ai', 261), ('ue', 105), ('ia', 253), ('ei', 86), ('iai', 1), ('oo', 174), ('au', 106), ('eau', 10), ('oa', 59), ('oei', 1), ('oe', 15), ('eo', 39), ('uu', 1), ('eu', 18), ('iu', 14), ('aii', 1), ('aiia', 1), ('ae', 11), ('aa', 3), ('oui', 6), ('ieu', 3), ('ao', 6), ('iou', 27), ('uee', 4), ('eou', 5), ('aia', 1), ('uie', 3), ('iao', 1), ('eei', 2), ('uo', 8), ('uou', 5), ('eea', 1), ('ueui', 1), ('ioa', 1), ('ooi', 1)])

Reconstructing Words from Word Pieces

regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'def compress(word):
   pieces = re.findall(regexp, word)
   return ''.join(pieces)print(nltk.tokenwrap(compress(w) for w in wsj[0:-75]))

This function removes vowels from words, except in the first and last character, the result:

wgh wghd wghng wght wrd wlcme wlcmd wlfre wll wll-cnnctd wll-knwn wnt wre wht wht whl-ldr whls whn whn-ssd whnvr whre whrby whrwthl whthr whch whchvr whle whmscl whppng whpsw whrlng whstle whte wht-cllr who whle whlsle whlslr whm whse why wde wdly wdsprd wdgt wdgts wdw wld wfe wld wldly wll wllng wllngnss wn wndfll wndng wndw wne wn-byng wn-mkng wns wngs wnnr wnnrs wnnng wns wntr wrs wsdm wsh wtchs wth wthdrw wthdrwl wthdrwn wthdrw wthhld wthn wtht wthstnd wtnss wtnsss wvs wzrds wo wmn wmn wn wndr wng wrd wrd- prcssng wrds wrk wrkble wrkbks wrkd wrkr

Conditional Frequency Distributions

words = sorted(set(nltk.corpus.treebank.words()))cvs = [cv for w in words for cv in re.findall(r'[bcdfghjklmnpqrstvxyz][aeiou]', w)]cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

The output is a conditional frequency distribution of consonant — vowel sequences from treebank:

    a    e    i    o    u  
b  166  202  113  133   89  
c  345  427  240  554  149  
d  127  632  433  121  111  
f  115  150  206  128   80  
g  126  329  124   60   62  
h  270  336  276  224   45  
j    7   22    4   26   37  
k   19  216  111    8    8  
l  458  675  573  288  107  
m  321  453  290  193   62  
n  289  545  286  153   73  
p  233  359  133  256   98  
q    0    0    0    0  109  
r  679 1229  665  486  127  
s  130  577  391  175  229  
t  429 1053 1130  352  182  
v  100  516  214   60    0  
x   17   28   21    7    3  
y   13  104   44   22    1  
z   21   76   21   10    2

treebank.wods() is a tokenized Wall Street Journal sample.

Finding All Instances Of

The consonant — vowel sequences in reverse:

cv_word_pairs = [(cv, w) for w in words for cv in re.findall(r'[bcdfghjklmnpqrstvxyz][aeiou]', w)]cv_index = nltk.Index(cv_word_pairs)cv_index['ba']

The outputs are:

['Albany',  'Atlanta-based',  'Barbados',  'Barbara',  'Barbaresco',  'Bermuda-based',  'Cabbage',  'Calif.-based',  'Carballo',  'Centerbank',  'Citibank',  'Conn.based',  'Embassy',  'Erbamont',  'Francisco-based',  'Freshbake',  'Garbage',  'Germany-based' ...

Finding Word Stems

Word stems are the core of the word, the root, and in search engines we want to query for not just a literal string match but for all related words using the stem.

regex = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$'[w for w in wsj if re.findall(regex, w)]

Which finds all words with those suffixes.

["'30s",  "'40s",  "'50s",  "'80s",  "'s",  '1920s',  '1940s',  '1950s',  '1960s',  '1970s',  '1980s',  '1990s',  '20s',  '30s',  '62%-owned',  '8300s',  'ADRs',  'Absorbed',  'Academically',  'According',  'Achievement'...

If you use it independently of the listing:

re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

we get:

[('process', 'es')]

Finding Stems In a Better Way

def stem(word):
    regex = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regex, word)[0]
    return stemraw = """DENNIS: Listen, strange women lying in ponds
distributing swords is no basis for a system of government.
Supreme executive power derives from a mandate from the masses,
not from some farcical aquatic ceremony."""tokens = nltk.word_tokenize(raw)
[stem(t) for t in tokens]

Which outputs:

['DENNIS',  ':',  'Listen',  ',',  'strange',  'women',  'ly',  'in',  'pond',  'distribut',  'sword',  'i',  'no',  'basi',  'for',  'a',  'system',  'of',  'govern',  '.',  'Supreme',  'execut',  'power',  'deriv',  'from',  'a',  'mandate',  'from',  'the',  'mass',  ',',  'not',  'from',  'some',  'farcical',  'aquatic',  'ceremony',  '.']

So even this better method makes errors such as the bolded above.

Searching Tokenized Text

What if you wanted to search multiple words? We can use regular expressions for that too.

from nltk.corpus import gutenberg
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")# monied; nervous; dangerous; white; white; white; pious; queer; 
# good; mature; white; Cape; great; wise; wise; butterless; white; 
# fiendish; pale; furious; better; certain; complete; dismasted; 
# younger; brave; brave; brave; brave

The above regular expression will match “a (anything) man”. <.*> will match any single token.

If we want the matched phrase we don’t use the parentheses
If we use the parentheses, then it only matches the word

from nltk.corpus import gutenberg
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> <.*> <man>")# a monied man; a nervous man; a dangerous man; a white man; a white # man; a white man; a pious man; a queer man; a good man; a mature 
# man; a white man; a Cape man; a great man; a wise man; a wise man; # a butterless man; a white man; a fiendish man; a pale man; a 
# furious man; a better man; a certain man; a complete man; a 
# dismasted man; a younger man; a brave man; a brave man; a brave 
# man; a brave man

To be able to match 3 word phrases:

chat = nltk.Text(gutenberg.words())
chat.findall(r"<.*> <.*> <whale>")# or a whale; as a whale; in the whale; that the whale; of a whale; # name a whale; - piggledy whale; s ( whale; " This whale; of the 
# whale; of one whale; like a whale; the wounded whale; of a whale; # While the whale; say the whale; see a whale; of a whale; once a 
# whale; of the ...

To be able to match sequences of 3 or more words that start with “l”:

moby.findall(r"<l.*>{3,}")# little lower layer; little lower layer; lances lie levelled; long # lance lightly; like live legs

Exploring Hypernyms

Some linguistics phenomena such as superordinate words can sometimes appear a certain way in a text.

import nltk
nltk.download('brown')
from nltk.corpus import brownhobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

To understand how this works:

speed and other activities; water and other liquids; tomb and other landmarks; Statues and other monuments; pearls and other jewels; charts and other items; roads and other features; figures and other objects; military and other areas; demands and other factors; abstracts and other compilations; iron and other metals

Notice that one result prints “water and other liquids”, which tells us that water is a type of liquid and the liquid is the hypernym and the water is a hyponym.

Of course this method isn’t perfect, there can be false positives.

Up Next…

In the next article, we will explore Normalizing, Tokenizing and Sentence Segmentation.

For the table of contents and more content click here.

References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.

Inputting & PreProcessing Text

Input Methods, String & Unicode, Regular Expression Use Cases

Importing

To Import a Book as a Txt

Getting Just the Good Stuff

Handling HTML

Reading Local Files

Binary Files

pywin32

pywin32 [![CI](…

pyPdf

A Pure-Python library built as a PDF toolkit. It is capable of: extracting document information (title, author, ...)…

User Input

Strings

Single Quote

Double Quotation Mark

Triple Quotation Mark

Concatenation

Printing

Individual Chars

Negative Indexing Chars

Print Chars

Count Chars

Substrings

Substring Membership

More Operations

Lists

Unicode

Code Point

Glyphs

Codecs

Ordinal

Regular Expression Applications to Tokenizing

Basic Metacharacters

Start: Caret

End: Dollar Sign

Single Character Wildcard: Dot

Optional Characters: Question Mark

Ranges

Closures

Logical Not Operator: Caret Inside a Bracket

Matching Patterns with Separators: Escape with Forward Slash

Limit Characters: Curly Brackets

Order of Operations: Brackets

Extracting Word Pieces

Reconstructing Words from Word Pieces

Conditional Frequency Distributions

Finding All Instances Of

Finding Word Stems

Finding Stems In a Better Way

Searching Tokenized Text

Exploring Hypernyms

Other Articles

Up Next…

References

Written by Jake Batsuuri