Paul Vitanyi has been a deep advocate for Kolmogorov complexity for many years. His book with Ming Li, An Introduction to Kolmogorov Complexity and Its Applications, remains on my book shelf (and was a bit of an investment in grad school).
I came across a rather interesting paper by Vitanyi with Rudi Cilibrasi called “Clustering by Compression” that illustrates perhaps more easily and clearly than almost any other recent work the tight connections between meaning, repetition, and informational structure. Rather than describing the paper, however, I wanted to conduct an experiment that demonstrates their results. To do this, I asked the question: are the writings of Dante more similar to other writings of Dante than to Thackeray? And is the same true of Thackeray relative to Dante?
Now, we could pursue these questions at many different levels. We might ask scholars, well-versed in the works of each, to compare and contrast the two authors. They might invoke cultural factors, the memes of their respective eras, and their writing styles. Ultimately, though, the scholars would have to get down to some textual analysis, looking at the words on the page. And in so doing, they would draw distinctions by lifting features of the text, comparing and contrasting grammatical choices, word choices, and other basic elements of the prose and poetry on the page. We might very well be able to take parts of the knowledge of those experts and distill it into some kind of a logical procedure or algorithm that would parse the texts and draw distinctions based on the distributions of words and other structural cues. If asked, we might say that a similar method might work for the so-called language of life, DNA, but that it would require a different kind of background knowledge to build the analysis, much less create an algorithm to perform the same task. And perhaps a similar procedure would apply to music, finding both simple similarities between features like tempos, as well as complex stylistic and content-based alignments.
Yet, what if we could invoke some kind of universal approach to finding exactly what features of each domain are important for comparisons? The universal approach would need to infer the appropriate symbols, grammars, relationships, and even meaning in order to do the task. And this is where compression comes in. In order to efficiently compress a data stream, an algorithm must infer repetitive patterns that are both close together like the repetition of t-h-e in English, as well as further apart, like verb agreement and other grammatical subtleties in the same language. By inferring those patterns, the compressor can then create a dictionary and emit just a reference to the pattern in the dictionary, rather than the pattern itself.
Cilibrasi and Vitanyi, in their paper, conduct a number of studies on an innovative application of compression to finding similarities and, conversely, differences. To repeat their experiment, I used a standard data compressor called bzip2 and grabbed two cantos from Dante’s Divine Comedy at random, and then two chapters from Thackeray’s The Book of Snobs. I then followed their procedure and compressed each individually using bzip2, as well as concatenating each together (pairwise) and compressing the combined document. The idea is that when you concatenate them together, the similarities present between the two documents should manifest as improved compression (smaller sized files) because the pattern dictionary will be more broadly applicable. The length of the files needs to be normalized a bit, however, because the files themselves vary in length, so following Cilibrasi and Vitanyi’s procedure, I subtracted the minimum of the compressed sizes of the two independent files and divided by the maximum of the same.
The results were perfect:
X1 | X1 Size | X2 | X2 Size | X1X2 Size | NCD |
d1 | 2828 | d2 | 3030 | 5408 | 4933.748232 |
d1 | 2828 | t1 | 3284 | 5834 | 6201.203678 |
d1 | 2828 | t2 | 2969 | 5529 | 5280.703324 |
d2 | 3030 | t1 | 3284 | 6001 | 5884.148845 |
d2 | 3030 | t2 | 2969 | 5692 | 5000.694389 |
t1 | 3284 | t2 | 2969 | 5861 | 4599.207369 |
Note that d1 has the lowest distance (NCD) to d2. t1 is also closest to t2. In this table, d1 and d2 are the Dante cantos. t1 and t2 are the Thackeray chapters. NCD is the Normalized Compression Distance. X1 Size is the compressed size of X1 in bytes. X2 Size is the compressed size for X2. Combined size is the compressed size of the combined original text (combined by concatenation in the order X1 followed by X2). NCD is calculated per the paper.