Action on Hadoop

hadoopinactionThe back rooms of everyone from Pandora to the NSA are filled with machines working in parallel to enrich and analyze data. And mostly at the core is Doug Cutting’s Hadoop that provides an open source implementation of the Google BigTable MapReduce framework combined with a distributed file system for replication and failover. With Hadoop Summit arriving this week (the 6th I’ve been to and the 7th ever), the importance and impact of these technologies continues to grow.

I hope to see you there and I’ll take this opportunity to announce that I am co-authoring Hadoop in Action, 2nd Edition with the original author, Chuck Lam. The new version will provide updates to this best-selling book and introduce all of the newest animals in the Hadoop zoo.… Read the rest

Inching Towards Shannon’s Oblivion

SkynetFollowing Bill Joy’s concerns over the future world of nanotechnology, biological engineering, and robotics in 2000’s Why the Future Doesn’t Need Us, it has become fashionable to worry over “existential threats” to humanity. Nuclear power and weapons used to be dreadful enough, and clearly remain in the top five, but these rapidly developing technologies, asteroids, and global climate change have joined Oppenheimer’s misquoted “destroyer of all things” in portending our doom. Here’s Max Tegmark, Stephen Hawking, and others in Huffington Post warning again about artificial intelligence:

One can imagine such technology outsmarting financial markets, out-inventing human researchers, out-manipulating human leaders, and developing weapons we cannot even understand. Whereas the short-term impact of AI depends on who controls it, the long-term impact depends on whether it can be controlled at all.

I almost always begin my public talks on Big Data and intelligent systems with a presentation on industrial revolutions that progresses through Robert Gordon’s phases and then highlights Paul Krugman’s argument that Big Data and the intelligent systems improvements we are seeing potentially represent a next industrial revolution. I am usually less enthusiastic about the timeline than nonspecialists, but after giving a talk at PASS Business Analytics Friday in San Jose, I stuck around to listen in on a highly technical talk concerning statistical regularization and deep learning and I found myself enthused about the topic once again. Deep learning is using artificial neural networks to classify information, but is distinct from traditional ANNs in that the systems are pre-trained using auto-encoders to have a general knowledge about the data domain. To be clear, though, most of the problems that have been tackled are “subsymbolic” for image recognition and speech problems.… Read the rest

Computing the Madness of People

Bubble playing cardThe best paper I’ve read so far this year has to be Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-sample Performance by David Bailey, Jonathan Borwein, Marcos López de Prado, and Qiji Jim Zhu. The title should ring alarm bells with anyone who has ever puzzled over the disclaimers made by mutual funds or investment strategists that “past performance is not a guarantee of future performance.” No, but we have nothing but that past performance to judge the fund or firm on; we could just pick based on vague investment “philosophies” like the heroizing profiles in Kiplingers seem to promote or trust that all the arbitraging has squeezed the markets into perfect equilibria and therefore just use index funds.

The paper’s core tenets extend well beyond financial charlatanism, however. They point out that the same problem arises in drug discovery where main effects of novel compounds may be due to pure randomness in the sample population in a way that is masked by the sample selection procedure. The history of mental illness research has similar failures, with the head of NIMH remarking that clinical trials and the DSM for treating psychiatric symptoms is too often “shooting in the dark.”

The core suggestion of the paper is remarkably simple, however: use held-out data to validate models. Remarkably simple but apparently rarely done in quantitative financial analysis. The researchers show how simple random walks can look like a seasonal price pattern, and how by sending binary signals about market performance to clients (market will rise/market will fall) investment advisors can create a subpopulation that thinks they are geniuses as other clients walk away due to losses. These rise to the level of charlatanism but the problem of overfitting is just one of pseudo-mathematics where insufficient care is used in managing the data.… Read the rest

Saving Big Data from the Zeros

ZerosBecause of the hype cycle, Big Data inevitably attracts dissenters who want to deflate a bit the lofty expectations that are built around new technologies that appear mystifying to those on the outside of the Silicon Valley machine. The first response is generally “so what?” and that there is nothing new here, just rehashing efforts like grid computing and Beowulf and whatnot. This skepticism is generally a healthy inoculation against aggrandizement and any kind of hangover from unmet expectations. Hence, the NY Times op-ed from April 6th, Eight (No, Nine!) Problems with Big Data should be embraced for enumerating eight or nine different ways that Big Data technologies, algorithms and thinking might be stretching the balloon of hope towards a loud, but ineffectual, pop.

The eighth of the list bears some scrutiny, though. The authors, who I am not familiar with, focus on the overuse of trigrams in building statistical language models. And they note that language is very productive and that even a short sentence from Rob Lowe, “dumbed-down escapist fare,” doesn’t appear in the indexed corpus of Google. Shades of “colorless green ideas…” from Chomsky, but an important lesson in how to manage the composition of meaning. Dumbed-down escapist fare doesn’t translate well back-and-forth through German via the Google translate capability. For the authors, that shows the failure of the statistical translation methodology linked to Big Data, and ties in to their other concerns about predicting rare occurrences or even, in the case of Lowe’s quote, zero occurrences.

In reality, though, these methods of statistical translation through parallel text learning date to the late 1980s and reflect a distinct journey through ways of thinking about natural language and computing.… Read the rest

Parsimonious Portmanteaus

portmanteauMeaning is a problem. We think we might know what something means but we keep being surprised by the facts, research, and logical difficulties that surround the notion of meaning. Putnam’s Representation and Reality runs through a few different ways of thinking about meaning, though without reaching any definitive conclusions beyond what meaning can’t be.

Children are a useful touchstone concerning meaning because we know that they acquire linguistic skills and consequently at least an operational understanding of meaning. And how they do so is rather interesting: first, presume that whole objects are the first topics for naming; next, assume that syntactic differences lead to semantic differences (“the dog” refers to the class of dogs while “Fido” refers to the instance); finally, prefer that linguistic differences point to semantic differences. Paul Bloom slices and dices the research in his Précis of How Children Learn the Meanings of Words, calling into question many core assumptions about the learning of words and meaning.

These preferences become useful if we want to try to formulate an algorithm that assigns meaning to objects or groups of objects. Probabilistic Latent Semantic Analysis, for example, assumes that words are signals from underlying probabilistic topic models and then derives those models by estimating all of the probabilities from the available signals. The outcome lacks labels, however: the “meaning” is expressed purely in terms of co-occurrences of terms. Reconciling an approach like PLSA with the observations about children’s meaning acquisition presents some difficulties. The process seems too slow, for example, which was always a complaint about connectionist architectures of artificial neural networks as well. As Bloom points out, kids don’t make many errors concerning meaning and when they do, they rapidly compensate.… Read the rest

Algorithmic Aesthetics

Jared Tarbell’s work in algorithmic composition via processing.org continues to amaze me. See more, here. The relatively compact descriptions of complex landscapes lend themselves to treatment as aesthetic phenomena where the scale of the grammars versus the complexity of the results asks the question what is art and how does it relate to human neurosystems?

 

 … Read the rest

Substitutions, Permutations, and Economic Uncertainty

500px-SHA-2.svgWhen Robert Schiller was awarded the near-Nobel for economics there was also a tacit blessing that the limits of economics as a science were being recognized. You see, Schiller’s most important contributions included debunking the essentials of market behavior and replacing it with the irrationals of behavioral psychology.

Schiller’s pairing with Eugene Fama in the Nobel award is ironic in that Fama is the father of the efficient market hypothesis that suggests that rational behavior should overcome those irrational tendencies to reach a cybernetic homeostasis…if only the system were free of regulatory entanglements that drag on the clarity of the mass signals. And all these bubbles that grow and burst would be smoothed out of the economy.

But technological innovation can sometimes trump old school musings and analysis: BitCoin represents a bubble in value under the efficient market hypothesis because the currency value has no underlying factual basis. As the economist John Quinnen points out in The National Interest:

But in the case of Bitcoin, there is no source of value whatsoever. The computing power used to mine the Bitcoin is gone once the run has finished and cannot be reused for a more productive purpose. If Bitcoins cease to be accepted in payment for goods and services, their value will be precisely zero.

In fact, that specific computing power consists of just two basic functions: substitution and permutation. So some long string of transactions have all their bits substituted with other bits, then blocks of those bits are rotated and generally permuted until we end up with a bit signature that is of fixed length but that is statistically uncorrelated with the original content. And there is no other value to those specific (and hard to do) computations.… Read the rest

Novelty in the Age of Criticism

Gary Cutting from Notre Dame and the New York Times knows how to incite an intellectual riot, as demonstrated by his most recent The Stone piece, Mozart vs. the Beatles. “High art” is superior to “low art” because of its “stunning intellectual and emotional complexity.” He sums up:

My argument is that this distinctively aesthetic value is of great importance in our lives and that works of high art achieve it much more fully than do works of popular art.

But what makes up these notions of complexity and distinctive aesthetic value? One might try to enumerate those values or create a list. Or, alternatively, one might instead claim that time serves as a sieve for the values that Cutting is claiming make one work of art superior to another, thus leaving open the possibility for the enumerated list approach to be incomplete but still a useful retrospective system of valuation.

I previously argued in a 1994 paper (published in 1997), Complexity Formalisms, Order and Disorder in the Structure of Art, that simplicity and random chaos exist in a careful balance in art that reflects our underlying grammatical systems that are used to predict the environment. And Jürgen Schmidhuber took the approach further by applying algorithmic information theory to novelty seeking behavior that leads, in turn, to aesthetically pleasing models. The reflection of this behavioral optimization in our sideline preoccupations emerges as art, with the ultimate causation machine of evolution driving the proximate consequences for men and women.

But let’s get back to the flaw I see in Cutting’s argument that, in turn, fits better with Schmidhuber’s approach: much of what is important in art is cultural novelty. Picasso is not aesthetically superior to the detailed hyper-reality of Dutch Masters, for instance, but is notable for his cultural deconstruction of the role of art as photography and reproduction took hold.… Read the rest

Signals and Apophenia

qrcode-distortThe central theme in Signals and Noise is that of the inverse problem and its consequences: given an ocean of data, how does one uncover the true signals hidden in the noise? Is there even such a thing? There’s an obsessive balance between apophenia and modeling somewhere built into our skulls.

The cover art for Signals and Noise reflects those tendencies. There is a QR Code that encodes a passage from the book, and then there is a distortion of the content of the QR Code. The distortion, in turn, creates a compelling image. Is it a fly creeping to the left or a lion’s head tilted to the right?

Yes.

A free hard-cover copy of Signals and Noise to anyone who decodes the QR Code. Post a copy of the text to claim your reward.… Read the rest