MarkovLab: Markov Chain Monte Carlo Text Generation
The Algorithm
(This is a synopsis of a discussion in Chapter 3: What to
Measure, of
A Guide to Experimental Algorithmics, referred to here as the
Guide.) The Markov Chain Monte Carlo (MCMC) Text Generaton algorithm takes the following
parameters as input:
T | Sample input text, containing n words. |
m | Number ofoutput words to generate. |
k | Parameter controlling key size, a small integer |
The algorithm generates a ``real-looking''
random text of m words, according to
word frequencies calculated from T. Parameter k is a small
integer specifying the phrase size, explained below.
The algorithm starts by building a table of all k-word
phrases in the text. For example, suppose k=2 and the text is
This is a test, this is only a test. This is a test of the emergency
broadcasting system.
In this text the phrases this is and a test
each appear 3 times, is a appears 2 times, and so forth.
Once the table is built, the algorithm sets a String variable
phrase to equal
the first k words in the text, and prints them. It then generates
the next m-k words as follows:
- Find all copies of phrase in the table. For example
this is has three copies in the table.
- Select one of the copies uniformly at random. Print the successor
word that comes after that copy of the phrase in the text. For example,
the three successor words for this is are:
a, only, and a. With probablity 2/3, a
is printed next, and with probability 1/3, only is printed next.
- Remove the first word from phrase and append the selected successor
word. For example with probability 2/3 the new phrase is is a.
- Go to step 1, and repeat until m words have been printed.
Modeling computation time. Chapter 3 surveys several choices of
performance indicators for measuring the time performance of
this algorithm, including the number of word comparisons; the number of
character comparisons; the number of instructions executed in the program;
profiling information from gprof; CPU times, and wall clock times.
A performance model is a function that is
built to predict performance in terms of the parameters
T, n, m, and k. The accuracy and precision
of the performance model depend on the choice of performance indicator.
Resources and Links
In AlgLab
- markov.c. A C code version by Jon Bentley
(with the interface slightly modified by C. McGeoch), that implements the lookup table
with a sorted array.
External Resources
- The MCMC algorithm is described in
Section 15.1 of Programming Pearls, by Jon Bentley
(Addison-Wesley 2000): here is a
link to the companion website. The discussion covers implementation and performance
issues for strings.
The site contains implementations of: markov.c,
the original C program; markovhash.c, a C program implementing the lookup step with
a hash table; markovlet.c, a C program that generates text at the character
level instead
of the word level; and C++ and C tools for counting and listing phrases in text files.
- The algorithm is also described in Chapter 3 of The Practice of Programming,
by Brian W. Kernighan and Rob Pike (Addison-Wesley 1999). The chapter considers issues of
data structure design and choice of programming language. Here is a link to
source code from the book,
which contains implementations of MCMC in C, Java, C++, Awk, and Perl. There are also
two sample text files (King James Bible and the Book of Psalms) for testing the code.
- The text files mentioned in the Guide were
downloaded from Project Gutenberg, which
creates online versions of copyright-free texts. Remove the Project Gutenberg preamble
before testing your MCMC implementations.
- ``Markov Chain Monte Carlo'' is a general algorithm paradigm that has many applications
besides text generation.
The Wikipedia Page on
Markov Chains and their applications includes a discussion of text-generation. (Accessed
7/18/2011).
- Here is a link to Valgrind.
Projects
Here are some suggestions for experimental projects using the MCMC algorithm.
- Build and evaluate new models. Here is the
C implementation of the algorithm, written by
Jon Bentley, that is described in the book.
- Run experiments to extend the word comparison model to cases where k>1.
- Add another term p(k,T,n,m) to the word-comparison or character-comparison
model that gives a tighter prediction of the cost of the Random Selection step. For
example, it might be based on the average (or maximum?) number of
phrase duplicates in T. You will have to write a program that measures
this property in T (preferably in a O(n) time or less).
- Develop predictive models for non-English prose, poetry or
song lyrics, or files of source code.
- Use the procedure in Chapter 3 to develop models to predict
CPU time on your favorite operating system. How closely can you predict times for the
target experiment?
- Develop a character-comparison cost model that takes input
parameters n, m and k in units of characters
instead of words. How does it compare to the word-comparison and character-comparison models?
Evaluate the accuracy and precison of your models: first, using the same text but different
values of n,m,k within the range of your experiments; second, using values of
n,m,k outside the range of your experiments; third, using different text files T.
How do these variations affect the accuracy and precision of your model?
How does your choice of performance indicator affect the accuracy
and precision of your model? What would you do next to improve it?
- Evaluate performance indicators.
- Compare CPU times and elapsed times (clock ticks and instruction
cycles) on your favorite system. How much variation can you observe?
- Compare CPU times and elapsed times on two different systems. Can you map measurements
on one system to predictions on another?
- Use tools like gprof and Cachegrind to build a link between instruction
counts and CPU times. How much variation do you observe on different systems?
- Is the memory hierarchy a facter here? Does
time-per-instruction depend on the memory footprint of the program?
- Does the cost of
the Random Select operation depend on the fact that the caches were previously ``warmed up''
by binary search? Devise experiments to measure random selection times, with and without
binary search. Is there a difference?
- Compare implementations. Download an alternative implementation
of the MCMC algorithm (see the resource list below).
- Build new performance models for various performance indicators for
this implementation. Is one implementation easier to predict than another?
- Which performance indicators would be best for
comparing the sorted-array and hash-table implementations? Which performance indicators
would be best for tuning one of these implementations? Which performance indicators would
you use to compare implementations in different languages?