Text Segmentation

Normalization, Tokenization, Sentence Segmentation + Useful Methods

Jake Batsuuri

Published in

Computronium Blog

7 min readAug 27, 2021

What does normalizing a text do?

We have previously called this method .lower() to turn all of the words lowercase, so that strings like “the” and “The” both become “the”, so we don’t double count them.

What if we wanna do even more?

Stemming

For example we can strip the affixes from words in a process called stemming. In the word “preprocessing”, there’s a prefix “pre — ” and suffix “ — ing” and the resulting word.

NLTK has several stemmers, you can make your own using regular expressions, but the NLTK stemmers handle many irregular cases.

There are 2 stemmers, Porter and Lancaster:

lancaster = nltk.LancasterStemmer()
[lancaster.stem(t) for t in tokens]# ['den',  ':',  'list',  ',',  'strange',  'wom',  'lying',  'in',  # 'pond',  'distribut',  'sword',  'is',  'no',  'bas',  'for',  
# 'a',

Which produces some wrong results.

porter = nltk.PorterStemmer()
[porter.stem(t) for t in tokens]# ['denni',  ':',  'listen',  ',',  'strang',  'women',  'lie',  
# 'in',  'pond',  'distribut',  'sword',  'is',

The Porter has slightly better outputs. The following class finds concordances not with direct string search, but with the different variations of “lie”.

class IndexedText(object):
 def __init__(self, stemmer, text):
  self._text = text
  self._stemmer = stemmer
  self._index = nltk.Index((self._stem(word), i) for (i, word) in enumerate(text))
 
 def concordance(self, word, width=40):
  key = self._stem(word)
  wc = width/4 # words of context
  for i in self._index[key]:
   lcontext = ' '.join(self._text[i-wc:i])
   rcontext = ' '.join(self._text[i:i+wc])
   ldisplay = '%*s' % (width, lcontext[-width:])
   rdisplay = '%-*s' % (width, rcontext[:width])
   print ldisplay, rdisplaydef _stem(self, word):
  return self._stemmer.stem(word).lower()porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')# r king ! DENNIS : Listen , strange women lying in ponds 
# beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ 
# Nay . Nay . Come . Come . You may lie here . Oh , but you are 
# doctors immediately ! No , no , please ! Lie down . [ clap clap ] 
# ere is much danger , for beyond the cave lies the Gorge of Eternal 
# you . Oh ... TIM : To the north there lies a cave -- the cave of 
# h it and lived ! Bones of full fifty men lie strewn about its lair 
# not stop our fight ' til each one of you lies dead , and the Holy

Lemmatization

The resulting word from stemming may or may not be a dictionary term, the process of making sure it is a real word is called lemmatization.

It’s not much of use if we stem a word, and the resulting root word is not in the dictionary. WordNet lemmatizes affixes only if the resulting word is in WordNet.

import nltk
nltk.download('wordnet')wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]'''
['DENNIS',  ':',  'Listen',  ',',  'strange',  'woman',  'lying',  'in',  'pond',  'distributing',  'sword',  'is',  'no',  'basis',  'for',  'a',  'system',  'of',  'government',  '.',  'Supreme',  'executive',  'power',  'derives',  'from',  'a',  'mandate',  'from',  'the',  'mass',  ',',  'not',  'from',  'some',  'farcical',  'aquatic',  'ceremony',  '.']
'''

This idea of only stemming if the result is a real word is helpful when you wanna do further analysis.

Furthermore, you can map special words such as numbers, abbreviations, emails, addresses etc into special sub-vocabularies. Which will improve the language model significantly.

To be able to map these special words sometimes you have to do things yourself.

Regular Expression for Tokenization

Tokenization is a special type of segmentation where we segment the entire text into words, as opposed to sentences or phrases.

Simplest way to tokenize a text is to split the text on whitespaces.

Simplest way to segment a sentence is to split by periods.

raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone though), 'I won't have any pepper in my kitchen AT ALL. Soup does very well without--Maybe it's always pepper that makes people hot-tempered,'..."""re.split(r' ', raw)'''
["'When",  "I'M",  'a',  "Duchess,'",  'she',  'said',  'to',  'herself,',  '(not',  'in',  'a',  'very',  'hopeful',  'tone\nthough),',
'''

Then we come across this problem, the newline is connecting two what should be separate tokens. We add a set of tabs, newlines, tabs:

re.split(r'[ \t\n]+', raw)'''
["'When",  "I'M",  'a',  "Duchess,'",  'she',  'said',  'to',  'herself,',  '(not',  'in',  'a',  'very',  'hopeful',  'tone',  'though),',  "'I",  "won't",  'have',  'any',  'pepper',  'in',  'my',  'kitchen',  'AT',  'ALL.',  'Soup',  'does',  'very',  'well',  'without--Maybe',  "it's",  'always',  'pepper',  'that',  'makes',  'people',  "hot-tempered,'..."]
'''

This works much better.

We can even shorthand the set to this r’\s+’ , which says any one or more whitespaces.

The Complement Method

Consider the range [a-zA-Z0–9_] which can be shortened to \w , this range matches words rather than spaces. Then there’s \W , which matches words but excludes letters, digits or underscores.

re.split(r'\W+', raw)'''
['',  'When',  'I',  'M',  'a',  'Duchess',  'she',  'said',  'to',  'herself',  'not',  'in',  'a',  'very',  'hopeful',  'tone',  'though',  'I',  'won',  't',  'have',  'any',  'pepper',  'in',  'my',  'kitchen',  'AT',  'ALL',  'Soup',  'does',  'very',  'well',  'without',  'Maybe',  'it',  's',  'always',  'pepper',  'that',  'makes',  'people',  'hot',  'tempered',  '']
'''

This method is essentially more powerful since it allows for easier regular expression matching of different patterns such as “I’m” or “hot-tempered”.

re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)'''
["'",  'When',  "I'M",  'a',  'Duchess',  ',',  "'",  'she',  'said',  'to',  'herself',  ',',  '(',  'not',  'in',  'a',  'very',  'hopeful',  'tone',  'though',  ')',  ',',  "'",  'I',  "won't",  'have',  'any',  'pepper',  'in',  'my',  'kitchen',  'AT',  'ALL',  '.',  'Soup',  'does',  'very',  'well',  'without',  '--',  'Maybe',  "it's",  'always',  'pepper',  'that',  'makes',  'people',  'hot-tempered',  ',',  "'",  '...']
'''

This is the best resource I found for regexes.

Regex symbol list and regex examples

Period, matches a single character of any single character, except the end of a line. For example, the below regex…

www.codexpedia.com

Issues with Tokenization

Tokenization never gives a perfect solution across different types of text, each text must be modified further according to the text type.

One way to handle this is to test and compare your tokenizer to already tokenized texts to make sure that its good.

The other issue with tokenizers is contractions such as didn’t, which must be treated with special cases.

Or what about cases where we can’t really tell word boundaries, such as when our input system just gets this “doyouseethekittyseethedoggydoyoulikethekittylikethedoggy”.

Simulated Annealing

Sentence Segmentation

What about segmenting into sentences?

import nltk
nltk.download('brown')len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())
20.250994070456922 # Average sentence length

Sentence segmentation is a bit harder, we can separate by periods, but periods are used in other ways, which makes it more difficult.

import nltk
nltk.download('gutenberg')sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = sent_tokenizer.tokenize(text)
pprint.pprint(sents[171:181])'''
['In the wild events which were to follow this girl had no\n'  
'part at all; he never saw her again until all his tale was over.',  'And yet, in some indescribable way, she kept recurring like a\n'  'motive in music through all his mad adventures afterwards, and the\n'  
'glory of her strange hair ran like a red thread through those dark\n'  
'and ill-drawn tapestries of the night.',  
'For what followed was so\nimprobable, that it might well have been a dream.',  
'When Syme went out into the starlit street, he found it for the\n'  'moment empty.',  
'Then he realised (in some odd way) that the silence\n'  
'was rather a living silence than a dead one.',  'Directly outside the\n'  
'door stood a street lamp, whose gleam gilded the leaves of the tree\n'  
'that bent out over the fence behind him.',  'About a foot from the\n'  
'lamp-post stood a figure almost as rigid and motionless as the\n'  'lamp-post itself.',  
'The tall hat and long frock coat were black; the\n'  
'face, in an abrupt shadow, was almost as dark.',  'Only a fringe of\n'  
'fiery hair against the light, and also something aggressive in the\n'  
'attitude, proclaimed that it was the poet Gregory.',  'He had something\n'  
'of the look of a masked bravo waiting sword in hand for his foe.']
'''

Formatting

Lists to Strings

silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']' '.join(silly)
';'.join(silly)# We called him Tortoise because he taught us .
# 'We;called;him;Tortoise;because;he;taught;us;.'

Newline Printing

sentence = """hello
world"""print(sentence)
'''
hello
 world
'''sentence
# hello\n world

Frequency Distribution Counters

fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in fdist:
    print(word, '->', fdist[word], ';',)'''
dog -> 4 ; cat -> 3 ; snake -> 1 ;
'''

String Formatting Expressions

for word in fdist:
    print('%s->%d;' % (word, fdist[word]))

These help us construct nicely formatted outputs:

'%s->%d;' % ('cat', 3)# cat->3;

Conversion Specifiers

%s is for strings
%d is for decimals

The tuple should always match the number of formatting strings in the first string.

"%s wants a %s %s" % ("Lee", "sandwich", "for lunch")

More use cases:

template = 'Lee wants a %s right now'
menu = ['sandwich', 'spam fritter', 'pancake']
for snack in menu:
    print(template % snack)'''
Lee wants a sandwich right now 
Lee wants a spam fritter right now 
Lee wants a pancake right now
'''

Lining It Up

Left Padding:

'%6s' % 'dog''   dog'

Right Padding:

'%-6s' % 'dog''dog   '

Variable Padding:

width = 10
'%-*s' % (width, 'dog')'dog       '

Decimals:

total = 9375
count = 3205
"accuracy for %d words: %2.4f%%" % (total, 100 * count / total)# accuracy for 9375 words: 34.1867%

Writing To a File

output_file = open('output.txt', 'w')
words = set(nltk.corpus.genesis.words('english-kjv.txt'))
for word in sorted(words):
    output_file.write(word + "\n")

If the data is non-text, then convert to string first:

output_file.write(str(len(words)) + "\n")
output_file.close()

Up Next…

In the next article, we will explore Computational Complexity for language processing.

For the table of contents and more content click here.

References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.