# Summary of the Previous Article

• Language
• Strings
• Symbols
• Sets
• Operations
• Properties
• Rules
• Classes of Languages

Previously, we talked about how languages are studied using the notion of a formal language. Formal language is a mathematical construction that uses sets to describe a language and understand its properties.

We introduced the notion of a string, which is a word or sequence of characters, symbols or letters. Then we formally defined the alphabet, which is a set of symbols. The alphabet often goes hand in hand with the language because we define a formal language as a set of strings over a unique alphabet.

Then we explored some operations on the string.

Then we explored some operations on languages.

Exploring operations naturally leads us to the properties of the languages and the resulting languages from these operations.

Then the last two topics of the article were regular expressions and finite-state automata. Which are computational paradigms that help us express or generate regular languages.

This article will continue to explore finite state automata and discover how these formalisms process natural languages.

# Deterministic Systems

A deterministic system is a system in which there is no random process in the transition from state to state.

If you try to recall the definition of finite-state automata, you will realize that its an information generating machine. Moreover, this information generation can be either deterministic or non-deterministic.

Here we further formalize how to check if an FSA is deterministic or not.

DEFINITION 1. A finite-state automaton is a five-tuple (Q, q₀, ∑, δ, F) where:

• ∑ is a finite state of alphabet symbols
• Q is a finite set of states
• q₀ ∈ Q is the initial state
• F ⊆ Q is the set of final states, a subset of Q
• δ: Q × ∑ × Q is a relation from states and alphabet symbols to states

GRAPHICAL REPRESENTATION. Finite state automata are depicted graphically like graphs. We have nodes and links; imagine each node as a state and each link representing a letter from the alphabet.

If states, relations and the alphabet define an FSA, then we check whether if it is deterministic or not using the following.

DEFINITION 2. A finite-state automaton is a five-tuple (Q, q₀, ∑, δ, F), and it is deterministic if and only if it has no ϵ-transition moves and δ is a function from Q × ∑ to Q.

Remember that relations need not be functional, so the general definition of the FSA is relational. However, deterministic FSA must have functional relations.

Remember that we extended our FSA definition to include ϵ-moves by saying that the alphabet is now ∑∪{ϵ}. So a deterministic FSA also cannot have any ϵ-moves.

FSA can be used to check membership of a string in a language in linear time, which is fantastic. However, when an FSA is non-deterministic or has a randomness factor, the compute time can no longer be linear.

# Non Deterministic Systems

These systems can be random in a sense, where they split off from a non-ending node into several directions.

Sometimes it is easier to construct non-deterministic FSA, but we often want to work with deterministic ones. Lucky for us, some beneficial FSA properties can get us there.

DEFINITION 3. Two FSA, A₁ and A₂ are equivalent if and only if L(A₁) = L(A₂)

It turns out that you can construct many different types of FSA. Furthermore, you can equate them to each other by the languages they generate.

The above lemma, coupled with the fact that every non-deterministic FSA is equivalent to some deterministic one, guarantees that we can first construct a non-deterministic FSA then find an efficient deterministic FSA for usage.

We call these steps; determinize and minimize. The following theorems formalize this.

THEOREM 1. For every FSA A (with n states), there exists a deterministic FSA A’ (with at most 2ⁿ states) such that L(A) = L(A’)

This theorem guarantees determinization and provides a helpful upper bound as well.

THEOREM 2. For every regular language L, there exists a minimal FSA A such that no other FSA A’ has fewer states than A when L(A) = L(A’). A is unique (up to isomorphism).

This theorem guarantees that there is a unique and minimal FSA, guaranteeing minimization.

# Operations on FSA

We know that Finite State Automata are equivalent to Regular Expressions.

We know that Regular Expressions have operations on them, and sometimes they are closed under these operations.

Similarly, Finite State Automata have operations, and their closure is of interest to us.

Specifically, let us say there are two common languages, L₁ and L₂; there exist automata A₁ and A₂, which accept them.

We know L₁ Union L₂ is also a regular language, so some FSA must exist that accepts it. Question is can you construct this FSA using A₁ and A₂?

Let A₁ be a finite-state automaton such that L(A₁) = L₁.

Let A₂ be a finite-state automaton such that L(A₂) = L₂.

We then describe an automaton A such that L(A) = L₁ * L₂. A word w is in L₁ * L₂ if and only if it can be broken down into two parts, w₁ and w₂, such that w = w₁*w₂ and w₁ ∈ L₁, w₂ ∈ L2.

This implies that there is an accepting path for w₁ in A₁, and there is an accepting path for w₂ in A₂.

If we allow ϵ-transitions from the final states of A₁ to the initial states of A₂, we can get accepting paths for words of L₁* L₂.

Remember the original definition of FSA.

DEFINITION 1. A finite-state automaton is a five-tuple (Q, q₀, ∑, δ, F) where:

• ∑ is a finite state of alphabet symbols
• Q is a finite set of states
• q₀ ∈ Q is the initial state
• F ⊆ Q is the set of final states, a subset of Q
• δ: Q × ∑ × Q is a relation from states and alphabet symbols to states

We modify this a bit now.

DEFINITION 4. A finite-state automaton A is a five-tuple (Q, q₀, ∑, δ, F) where:

• ∑ is a finite state of alphabet symbols, and is a union of ∑₁ and ∑₂
• Q is a finite set of states and is a union of Q₁ and Q₂
• q₀ ∈ Q is the initial state, specifically q₀ ∈ Q₁
• F ⊆ Q is the set of final states, a subset of Q, specifically F⊆ F₂
• δ: Q × ∑ × Q is a relation from states and alphabet symbols to states, {qf, ϵ, q₀₂ | qf ∈ F₁} where q₀₂ is the initial state of A₂

Remember that regular languages are known to be closed under many operations:

• Intersection
• Complementation
• Exponentiation
• Substitution
• Homomorphism

So it is not surprising that operations on FSA on the same operations are also closed.

If L is a regular language over an alphabet ∑, then the complement of L is the set of all the words from ∑* that are not in L, denoted:

With FSA A such that A = <Q, q₀, ∑, δ, F>, we denote the complement as:

With the understanding that complementation is closed for regular languages, we can show that the intersection of two regular languages is also regular.

We primarily thought of the FSA as a computational device that generates a regular language. However, if we have a word, we can use the FSA to detect the word’s membership in the regular language.

This process takes linear time in the length of the word, regardless of how big the FSA is, this is very attractive. This simple use case is called a dictionary lookup.

It turns out the morphology of most words are simple concatenations of affixes. Since concatenation is closed for regular languages, it makes it easy to implement it with FSA.

Also, the closure properties of FSA help us build models in a modular way.

# Regular Relations

A regular relation is a relation over two alphabets. The two alphabets can be identical or different. These relations are of the form (w₁ from ∑₁, w₂ from ∑₂), kind of like Cartesian products.

Part of speech is a classification of words into groups, based on their function:

• Verb
• Noun
• Pronoun
• Preposition
• Conjunction
• Interjection

So let us create an alphabet based on the part of speech categories:

Now consider the English alphabet:

Now consider a sequence of strings over ∑₁: “I know some new tricks said the Cat in the Hat.”

If we map each string to an element of ∑₂, we get:

This mapping is called a part of speech tagger. You can see how this might be useful for language processing.

You can play with an online version of this tool here:

However, this maps strings from ∑₁ to letter from ∑₂. How can we map from string to string? Moreover, what can we get from that?

We define the ∑₄ as:

∑₄ comes from adding ∑₂ and ∑₃, part-of-speech and morphological tags, respectively. Furthermore our letters always suffix.

We are mapping to full words on ∑₄, which is more useful since it gives us more information on ∑₁.

This would be fairly straightforward if all English plural words were regular, if we consistently just added “s”. However sometimes we get irregular mappings like these:

# Finite-State Transducers

If FSA is a computational device that defines regular languages, then FST is a computational device that defines regular relations.

DEFINITION 1. A finite-state automaton is a five-tuple (Q, q₀, ∑, δ, F) where:

• ∑ is a finite state of alphabet symbols
• Q is a finite set of states
• q₀ ∈ Q is the initial state
• F ⊆ Q is the set of final states, a subset of Q
• δ: Q × ∑ × Q is a relation from states and alphabet symbols to states

DEFINITION 5. A finite-state transducer is a six-tuple (Q, q₀, ∑₁, ∑₂, δ, F) where:

• ∑₁ and ∑₂ are two alphabets
• Q is a finite set of states
• q₀ ∈ Q is the initial state
• F ⊆ Q is the set of final states, a subset of Q
• δ: Q × ∑₁ × ∑₂ × Q is a relation from states to a pair of alphabet symbols to states

Now if we are mapping from singular to plural, we are going from English to English. Those alphabets are ∑₁ = ∑₂ = {a,b,c…x,y,z}.

You can think of each edge as a dictionary pair data structure, where the left side is taken to be the input and the right side symbol, the output.

In this FST, we have an input singular of “goose” and an output plural of “geese.”

We see that this works great when singular and plural are the same lengths. So how does it work when its a “dog” and “dogs” pair? To handle this situation, we extend the definition of the relations to include the ϵ moves.

DEFINITION 6. A finite-state transducer is a six-tuple (Q, q₀, ∑₁, ∑₂, δ, F) where:

• ∑₁ and ∑₂ are two alphabets
• Q is a finite set of states
• q₀ ∈ Q is the initial state
• F ⊆ Q is the set of final states, a subset of Q
• δ: Q × ∑₁∪{ϵ} × ∑₂∪{ϵ} × Q is a relation from states to a pair of alphabet symbols to states

The ϵ-moves can happen on either side and continuously.

# Regular Relations and their Properties

Consider the relation δ: Q × ∑₁∪{ϵ} × ∑₂∪{ϵ} × Q; if one of the alphabets are taken away, we are left with an FSA.

We call these projections of T to ∑₁ or ∑₂. Since these projections yield an FSA and FSA results in a regular language, we say the relations of an FST are regular.

It is also the case that regular relations are not closed under intersection, nor complementation.

The composition is a new type of operation that works for transducers. The intuition behind a single FST is that it relates one word to another. Now, what would happen if we chained them together.

If one T maps from one language to another, like so:

As long as the inner languages are the same, we can get this:

They are, therefore, bypassing the middle language. So something like English to French, French to Spanish results in English to Spanish.

# Other Articles

`This post is part of a series of stories that explores the fundamentals of natural language processing:1. Context of Natural Language ProcessingMotivations, Disciplines, Approaches, Outlook2. Notes on Formal Language TheoryObjects, Operations, Regular Expressions and Finite State Automata3. Natural Language Tool Kit 3.5Search Functions, Statistics, Pronoun Resolution4. What Are Regular Languages?Minimization, Finite State Transducers, Regular Relations`

# Up Next…

In the next article, we will explore context free languages as this article marks the end of regular languages.

# References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.

I write about software && math

I write about software && math