AlgLab Home

MarkovLab: Markov Chain Monte Carlo Text Generation

The Algorithm

(This is a synopsis of a discussion in Chapter 3: What to Measure, of A Guide to Experimental Algorithmics, referred to here as the Guide.) The Markov Chain Monte Carlo (MCMC) Text Generaton algorithm takes the following parameters as input:

T Sample input text, containing n words.

m Number ofoutput words to generate.

k Parameter controlling key size, a small integer

T	Sample input text, containing n words.
m	Number ofoutput words to generate.
k	Parameter controlling key size, a small integer

The algorithm generates a ``real-looking'' random text of m words, according to word frequencies calculated from T. Parameter k is a small integer specifying the phrase size, explained below.

The algorithm starts by building a table of all k-word phrases in the text. For example, suppose k=2 and the text is

This is a test, this is only a test. This is a test of the emergency 
broadcasting system.

In this text the phrases this is and a test each appear 3 times, is a appears 2 times, and so forth.

Once the table is built, the algorithm sets a String variable phrase to equal the first k words in the text, and prints them. It then generates the next m-k words as follows:

Find all copies of phrase in the table. For example this is has three copies in the table.
Select one of the copies uniformly at random. Print the successor word that comes after that copy of the phrase in the text. For example, the three successor words for this is are: a, only, and a. With probablity 2/3, a is printed next, and with probability 1/3, only is printed next.
Remove the first word from phrase and append the selected successor word. For example with probability 2/3 the new phrase is is a.
Go to step 1, and repeat until m words have been printed.

Modeling computation time. Chapter 3 surveys several choices of performance indicators for measuring the time performance of this algorithm, including the number of word comparisons; the number of character comparisons; the number of instructions executed in the program; profiling information from gprof; CPU times, and wall clock times.

A performance model is a function that is built to predict performance in terms of the parameters T, n, m, and k. The accuracy and precision of the performance model depend on the choice of performance indicator.

Resources and Links

In AlgLab

markov.c. A C code version by Jon Bentley (with the interface slightly modified by C. McGeoch), that implements the lookup table with a sorted array.

External Resources

The MCMC algorithm is described in Section 15.1 of Programming Pearls, by Jon Bentley (Addison-Wesley 2000): here is a link to the companion website. The discussion covers implementation and performance issues for strings. The site contains implementations of: markov.c, the original C program; markovhash.c, a C program implementing the lookup step with a hash table; markovlet.c, a C program that generates text at the character level instead of the word level; and C++ and C tools for counting and listing phrases in text files.
The algorithm is also described in Chapter 3 of The Practice of Programming, by Brian W. Kernighan and Rob Pike (Addison-Wesley 1999). The chapter considers issues of data structure design and choice of programming language. Here is a link to source code from the book, which contains implementations of MCMC in C, Java, C++, Awk, and Perl. There are also two sample text files (King James Bible and the Book of Psalms) for testing the code.
The text files mentioned in the Guide were downloaded from Project Gutenberg, which creates online versions of copyright-free texts. Remove the Project Gutenberg preamble before testing your MCMC implementations.
``Markov Chain Monte Carlo'' is a general algorithm paradigm that has many applications besides text generation. The Wikipedia Page on Markov Chains and their applications includes a discussion of text-generation. (Accessed 7/18/2011).
Here is a link to Valgrind.

Projects

Here are some suggestions for experimental projects using the MCMC algorithm.