Theoretical Reorganization

Sean Carroll of Caltech takes on the philosophy of science in his paper, Beyond Falsifiability: Normal Science in a Multiverse, as part of a larger conversation on modern theoretical physics and experimental methods. Carroll breaks down the problems of Popper’s falsification criterion and arrives at a more pedestrian Bayesian formulation for how to view science. Theories arise, theories get their priors amplified or deflated, that prior support changes due to—often for Carroll—coherence reasons with other theories and considerations and, in the best case, the posterior support improves with better experimental data.

Continuing with the previous posts’ work on expanding Bayes via AIT considerations, the non-continuous changes to a group of scientific theories that arrive with new theories or data require some better model than just adjusting priors. How exactly does coherence play a part in theory formation? If we treat each theory as a binary string that encodes a Turing machine, then the best theory, inductively, is the shortest machine that accepts the data. But we know that there is no machine that can compute that shortest machine, so there needs to be an algorithm that searches through the state space to try to locate the minimal machine. Meanwhile, the data may be varying and the machine may need to incorporate other machines that help improve the coverage of the original machine or are driven by other factors, as Carroll points out:

We use our taste, lessons from experience, and what we know about the rest of physics to help guide us in hopefully productive directions.

The search algorithm is clearly not just brute force in examining every micro variation in the consequences of changing bits in the machine. Instead, large reusable blocks of subroutines get reparameterized or reused with variation. This is the normal operation of theoretical physicists who don’t look, for instance, at mathematical options that include illogical operations. Those theoretical bit patterns are excluded in the search procedure.

Luckily, there are data-driven construction methods that can be applied to create something like minimal machines, yet still have the feeling of radical or catastrophic reorganization as a part of their methodology. For instance, take a simple approach like agglomeratively building a production/recognition tree from a linear symbol string that has repetition, say like English text.

Let’s operate at the character level rather than the word level, though the latter works essentially the same way. Take each bigram of characters, count them, and then assign a probability based on their frequency. Next take the frequencies of bigrams of bigrams and add those to the our counting tabulations, and so forth. Now we can create a lexicon of these patterns and get some pretty predictable results for English. For instance, we would expect “th” to occur quite often, as well as “he.” We would further expect that “th” “he” would co-occur at a high base rate in the next level of the lexicon. No surprise there.

Now we can enforce reusability of the entries into our mini-grammar and ensure that groups of highly re-useable letters remain in the lexicon while very rare ones are instead just tossed as noise. We start to notice that word groupings get optional “s” or “es” suffixes as we winnow out the more frequent patterns. How do we make the decision of which groupings to retain and which to throw away? It is a trade-off between the encoded size of the sequence grammar and the encoded strings. Since both must be accounted for, the combined size in bits is the proper metric. Moreover, this enforces the Occam’s Razor nature of using a grammar as a theory about the underlying language: optimally shorter lexicons that still account for the data sequence are better than longer lexicons.

This is, not coincidentally, how data compression works, but the simplicity of using an agglomerative approach belies the complexity of actually rearranging the lexicon or grammar in order to account for longer range patterns or deeper recursion present in the data itself. So also for science. We need a broader model of how to reorganize theoretical models that allows for radical and transformative change within the model system.

We can grab a helping hand from Donald Campbell’s evolutionary epistemology to assist. Reorganization of the explanatory submachines in a complex hierarchy of machines can be accomplished using a variation-and-retention model. A new theory or subtheory is a variant that ought to provide better explanatory coverage for the data while also fitting coherently with the other theories in the overall machine. The thing has to run successfully against the data, after all. The new theory add-on is viewed as getting an increased prior from the perspective of the theorist and tested against the data and to watch for incompatibilities with other theories. Success is a new compact theory and, in Carroll’s area of expertise, moves towards more “credence” for theories that are not easily or readily falsifiable (at least not yet!)

Note: for more on sequence induction, check out Craig Nevill-Manning’s 1996 doctoral thesis on SEQUITUR.