Normalization, Tokenization, Sentence Segmentation + Useful Methods

What does normalizing a text do?

We have previously called this method .lower() to turn all of the words lowercase, so that strings like “the” and “The” both become “the”, so we don’t double count them.

What if we wanna do even more?


For example we can strip the affixes from words in a process called stemming. In the word “preprocessing”, there’s a prefix “pre — ” and suffix “ — ing” and the resulting word.

NLTK has several stemmers, you can make your own using regular expressions, but the NLTK stemmers handle many irregular cases.

There are 2 stemmers, Porter and Lancaster:

lancaster = nltk.LancasterStemmer()
[lancaster.stem(t) for t in tokens]
# ['den'…

Input Methods, String & Unicode, Regular Expression Use Cases

NLTK has preprocessed texts. But we can also import and process our own texts.


from __future__ import division 
import nltk, re, pprint

To Import a Book as a Txt

Install urlopen:

!pip install urlopen


import urllib.requesturl = ""
raw = urllib.request.urlopen(url).read()
# <type 'str'>
// 1176831
// 'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'


tokens = nltk.word_tokenize(raw)type(tokens)
# <type 'list'>
# 255809
# ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

Textization, or just turning it into NLTK’s Text Object so we run things like collocations:

text = nltk.Text(tokens)
# <type 'nltk.text.Text'>

Grammars, Derivation, Expressiveness, Chomsky Hierarchy

Previously, we talked about how languages are studied using the notion of a formal language. Formal language is a mathematical construction that uses sets to describe a language and understand its properties.

We introduced the notion of a string, which is a word or sequence of characters, symbols or letters. Then we formally defined the alphabet, which is a set of symbols. The alphabet often goes hand in hand with the language because we define a formal language as a set of strings over a unique alphabet.

Then we explored some operations on the string.

Then we explored some operations…

Frequency Distributions, Wordlists, WordNet, Semantic Similarity

Work in Natural Language Processing typically uses large bodies of linguistic data. In this article, we explore some lexical resources that help us ingest and analyze corpora. These resources are part of Python or the NLTK library.

Getting NLTK Corpora

We can access pre-imported corpora in NLTK in one of 2 ways:

emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

or like this:

from nltk.corpus import gutenberg
emma = gutenberg.words('austen-emma.txt')

Language Statistics

We can write a quick little script to display a bunch of standard language statistics like average word length, average sentence length, lexical diversity.

It turns out that average word length is a universal attribute of…

#WEEK3DAY5: Adobe Libraries, Linking from Illustrator or Photoshop

Okay so I’m gonna review After Effects a bit. Adobe’s very own tutorials are actually pretty good. Nice pace and nicely divided into small videos, with notes.

Highly recommend.


  • Create a new composition, make sure the frame height and width make sense
  • Organize your assets into a folder in windows, and organize from within the After Effects by Video, Photo, Vector etc…
  • Note that After Effects doesn’t copy files into the project, it references from the Windows folder, so don’t move or delete them, just organize that stuff before the project start
  • Compositions have layers, layer dragging up and down…


Design composition I think is a lot like photography composition.

  • Rule of thirds
  • Diagonals
  • Different patterns like golden ratio
  • Leading lines — use the natural lines to draw the eyes to the subject
  • Negative space — have more negative space, busy photos are not great
  • Colour (make the subject stand out)
  • Focus
  • Content
  • Implied story
  • Style

Okay so let’s try to put everything into a remixed graphic.

I wanna mix a statue with a photo. And add these other things:

  • Background with lots of negative space
  • Frame with typography creating leading lines
  • 2D and 3D
  • Consistent lighting
  • Splash of colour…


I was watching this video for 2021 design trends. And illustration or certain types of illustration were in trend. What was even more reassuring was that what was in trend seemed to be simpler things to draw.

It seemed doable.

And basically it’s very clear how real designers and digital artists do things these days: Adobe Illustrator and ProCreate.

So I thought I could learn ProCreate, then learn how to draw, then learn some Illustrator.

So two apps and these illustration related topics:

  • Illustrate people or characters
  • Full page drawings

I looked up YouTube and SkillShare courses. Found these:


Why graphics design? Because basically everything I do could benefit from amazing design skills!

From video to websites and apps, making things look good is an essential skill for anyone doing any kind of creative and visual work.


Jake Batsuuri

I write about software && math

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store