I know what a Markov chain is, and I have a very hard time decoding the terminology and notation above. Hopefully this writeup will help those who don't yet have a Ph.D. in Mathematics understand what a Markov chain is, and how they work.
Named after the Russian mathematician, Andrey Markov, a Markov chain is basically a structure in which future states or statuses depend only on the current state - past states are not considered when deciding what to do next. These future states are determined through probability.
While Markov chains have uses in physics, statistics, economics, biology, gambling, sports, and more, this writeup will deal with the more fun aspect - generating (semi-) meaningful/coherent sentences from existing text. In any case, the concepts are the same: Forget where you were. You are here now. Decide where to go next.
One of the most famous implementations of a Markov chain used to generate random text is Mark V. Shaney (say it out loud, and run the words together, and it sounds similar to "Markov Chainy"). Mark V. Shaney was a Usenet "user" whose posts were generated through an algorithm. Because no one on Usenet had ever seen anything like this, many people were led to believe that these submissions were just posted by real person with an unusual writing style.
How do they work?
The basic algorithm for generating text with Markov chains is actually quite simple, you can even do it on paper - no computers required:
- Start with some text input. It can be a single sentence, a nursery rhyme, a book, the entire works of Shakespeare - pretty much anything.
- Read in each word, and keep track of what word comes after it. Since words can be repeated, each word may have a list of words that follow it. For example, if your input was "the boy went to the store" you would have:
- the: boy, store (both "boy" and "store" can follow the word "the")
- boy: went
- went: to
- to: the
- After you are finished making your list of words, now you can output your own generated text. Pick a starting word, and randomly pick one of the words from the list that follows that word. Now with your randomly chosen word, pick a word from the list that follows it. Repeat until you feel like quiting.
Using the terminology from above, the word you are on is the current "state", and you are randomly choosing the future state by picking a valid word from the list. This simplistic algorithm can be fun to play with, but its grammar will be rather poor. To generate better output, instead of just keeping track of one word for the current state, you keep track of two words. In other words, instead of making a list of all the individual words in your input, you keep track of all the word pairs, and all the words that can follow this word pair. This is called an n-tier algorithm (where in this case, n is 2). You can keep track of as many tiers as you want, but the higher the number, the more likely your generated output will be exactly the same as your input. For this reason, 2- or 3-tiers are typically the highest a text generator will go.
Okay, I think I understand, now give me an example
(or, I don't understand this at all, give me an example)
1-tier
Alright. Let's start with a 1-tier example using Genesis 1: 1-5. For our word list we are going to ignore all punctuation.
In the beginning God created the heavens and the earth.
And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
And God said, Let there be light: and there was light.
And God saw the light, that it was good: and God divided the light from the darkness.
And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.
On the left are all the words used in the input, and on the right are all the words that follow each of the words. Notice that we include duplicate words - since the input uses these words multiple times, they should have a better chance of occurring in the output. I have also alphabetized the list, but that is not necessary for the algorithm (it just makes it easier to keep track of your words when you are doing this on paper instead of with a computer).
and: | the, the, void, darkness, the, God, there, God, God, God, the, the, the |
be: | light |
beginning: | God |
called: | the, night |
created: | the |
darkness: | was, and, he |
day: | and |
deep: | and |
divided: | the |
earth: | and, was |
evening: | and |
face: | of, of |
first: | day |
form: | and |
from: | the |
God: | created, moved, said, saw, divided, called |
good: | and |
he: | called |
heavens: | and |
in: | the |
it: | was |
let: | there |
light: | and, and, that, from, day |
of: | the, God, the |
morning: | were |
moved: | upon |
night: | and |
said: | let |
saw: | the |
spirit: | of |
that: | it |
the: | beginning, heavens, earth, earth, face, deep, spirit, face, waters, light, light, darkness, light, darkness, evening, morning, first |
there: | be, was |
upon: | the, the |
void: | and |
was: | without, upon, light, good |
waters: | and |
were: | the |
without: | form |
Now let's pick our words to generate a sentence. Since the input starts with "In" that's what I will start with. When using a computer, you would use a random number generator for chosing the next word, but I'm just going to use a counter to pick what word comes next. I'm going to start at 1, then increase by one after I pick each word. In other words, I'll look at the word "In" and pick the first word in the list (the); then I'll look at "the" and pick the second word in the list (heavens), then I'll pick the 3rd word for the next one, and so on. If there aren't enough words, I'll loop back to the first word and keep going.
This gives the output:
0 1 2 3 4 5 6 7 8 9
In the heavens and darkness and God created the waters
10 11 12 13 14 15 16
and the darkness was upon the morning
Not to bad, but it doesn't make much sense. Let's try 2-tier...
2-tier
Using the same input, but tracking pairs of words instead of single words, we get:
and darkness: | was |
and God: | said, saw, divided, called |
and the: | earth, earth, spirit, darkness, evening, morning |
and there: | was |
and void: | and |
be light: | and |
beginning God: | created |
called night: | and |
called the: | light |
created the: | heavens |
darkness and: | God |
darkness he: | called |
darkness was: | upon |
day and: | the |
deep and: | the |
divided the: | light |
earth and: | the |
earth was: | without |
evening and: | the |
face of: | the, the |
form and: | void |
from the: | darkness |
God called: | the |
God created: | the |
God divided: | the |
God moved: | upon |
God said: | let |
God saw: | the |
good and: | God |
he called: | night |
heavens and: | the |
in the: | beginning |
it was: | good |
let there: | be |
light and: | there, God |
light day: | and |
light from: | the |
light that: | it |
morning were: | the |
moved upon: | the |
night and: | the |
of God: | moved |
of the: | deep, waters |
said let: | there |
saw the: | light |
spirit of: | God |
that it: | was |
the beginning: | God |
the darkness: | and, he |
the deep: | and |
the earth: | and, was |
the evening: | and |
the face: | of, of |
the first: | day |
the heavens: | and |
the light: | that, from, day |
the morning: | were |
the spirit: | of |
the waters: | and |
there be: | light |
there was: | light |
upon the: | face, face |
void and: | darkness |
was good: | and |
was light: | and |
was upon: | the |
was without: | form |
waters and: | God |
were the: | first |
without form: | and |
In this case, we are going to start with "in the", and follow the same word-chosing process as above:
0 1 2 3 4 5 6 7 8 9
In the beginning God created the heavens and the spirit
10 11 12 13 14 15 15 16 17
of God moved upon the face of the deep
That's a bit better!
Punctuation
In the examples above, we have ignored punctuation, however to generate meaningful output that is more than one sentences long, it is a necessity. Believe it or not, without making any changes (other than NOT ignoring it), we don't have to change the algorithm in any way to get decent punctuation handling - we just need to consider punctuation as part of the word. In other words, in the example above, the first sentence ends in "earth." (with the period). We just handle that as a separate word from "earth" (without the period) in the second verse. As I said, this gives you "decent" punctuation handling - not great. If you try to do better, the algorithm gets much more complicated - especially punctuation which requires you to open and close it like parenthesis, quotation marks, brackets, etc. Basically, you need to keep track of what punctuation you have already seen and use that information to help you decide whether the words in your word list are a good choice. If not, you need to decrease their likelihood of being chosen (typically by temporarily removing them from the list of choices).
A Real World Example: E2D2
E2D2 uses 2-tier Markov chains extensively for generating things to say. While he does a lot to handle punctuation, he is especially strict (though still not perfect) with the use of square brackets to handle hard links. He uses two forms of input: an archive of the chatterbox conversation, and noder writeups.
When you say: e2d2 tell me about everything2, he will find all the chatterbox messages that mention "everything2", combine them all together, and generate a Markov chain based on all these messages. Sometimes the output will include the word "everything2", and sometimes it won't simply due to how the words are chosen.
When you say: e2d2 tell me about [everything2], he will download all the writeups in the Everything2 node, remove all the HTML (and also remove the text from certain tags that tend to cause problems), combine them all together, and generate a Markov chain based on the writeups.
Similarly, when you say: e2d2 process my writeups, he will download all of your writeups, remove all the HTML, combine them all together, and generate a Markove chain based on the writeups.
Some Markov chain programs keep a database of their word lists, however with a properly coded algorithm, this is completely unnecessary. E2D2 does all his generating in memory. If you ignore the time it takes to download the user search and individual writeups, E2D2 can typically generate 10 random paragraphs from someone's writeups in less than 1 second. Even The Custodian, who has over 1600 writeups containing over 1 million words can be processed in about 2 seconds. Usually the delay you see when you ask him something is caused by either having to download the content, or simply because he only refreshes the catbox messages every 10 seconds.
Additional real world examples include:
- Donginger
- Mark V. Shaney
- The Book of Markov
- Dissociated press
- computer-generated poetry
- Jung-Markov Collective Dream Constructor
- The poetry of Jeff Harrison
- The following generated writeups by E2D2: May 19, 2009, May 22, 2009, May 28, 2009, June 2, 2009, June 4, 2009, June 8, 2009, and June 11, 2009
- Spam email quite often uses Markov chains to help it get past your spam filters