Action on Hadoop

hadoopinactionThe back rooms of everyone from Pandora to the NSA are filled with machines working in parallel to enrich and analyze data. And mostly at the core is Doug Cutting’s Hadoop that provides an open source implementation of the Google BigTable MapReduce framework combined with a distributed file system for replication and failover. With Hadoop Summit arriving this week (the 6th I’ve been to and the 7th ever), the importance and impact of these technologies continues to grow.

I hope to see you there and I’ll take this opportunity to announce that I am co-authoring Hadoop in Action, 2nd Edition with the original author, Chuck Lam. The new version will provide updates to this best-selling book and introduce all of the newest animals in the Hadoop zoo.… Read the rest

Inching Towards Shannon’s Oblivion

SkynetFollowing Bill Joy’s concerns over the future world of nanotechnology, biological engineering, and robotics in 2000’s Why the Future Doesn’t Need Us, it has become fashionable to worry over “existential threats” to humanity. Nuclear power and weapons used to be dreadful enough, and clearly remain in the top five, but these rapidly developing technologies, asteroids, and global climate change have joined Oppenheimer’s misquoted “destroyer of all things” in portending our doom. Here’s Max Tegmark, Stephen Hawking, and others in Huffington Post warning again about artificial intelligence:

One can imagine such technology outsmarting financial markets, out-inventing human researchers, out-manipulating human leaders, and developing weapons we cannot even understand. Whereas the short-term impact of AI depends on who controls it, the long-term impact depends on whether it can be controlled at all.

I almost always begin my public talks on Big Data and intelligent systems with a presentation on industrial revolutions that progresses through Robert Gordon’s phases and then highlights Paul Krugman’s argument that Big Data and the intelligent systems improvements we are seeing potentially represent a next industrial revolution. I am usually less enthusiastic about the timeline than nonspecialists, but after giving a talk at PASS Business Analytics Friday in San Jose, I stuck around to listen in on a highly technical talk concerning statistical regularization and deep learning and I found myself enthused about the topic once again. Deep learning is using artificial neural networks to classify information, but is distinct from traditional ANNs in that the systems are pre-trained using auto-encoders to have a general knowledge about the data domain. To be clear, though, most of the problems that have been tackled are “subsymbolic” for image recognition and speech problems.… Read the rest

Computing the Madness of People

Bubble playing cardThe best paper I’ve read so far this year has to be Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-sample Performance by David Bailey, Jonathan Borwein, Marcos López de Prado, and Qiji Jim Zhu. The title should ring alarm bells with anyone who has ever puzzled over the disclaimers made by mutual funds or investment strategists that “past performance is not a guarantee of future performance.” No, but we have nothing but that past performance to judge the fund or firm on; we could just pick based on vague investment “philosophies” like the heroizing profiles in Kiplingers seem to promote or trust that all the arbitraging has squeezed the markets into perfect equilibria and therefore just use index funds.

The paper’s core tenets extend well beyond financial charlatanism, however. They point out that the same problem arises in drug discovery where main effects of novel compounds may be due to pure randomness in the sample population in a way that is masked by the sample selection procedure. The history of mental illness research has similar failures, with the head of NIMH remarking that clinical trials and the DSM for treating psychiatric symptoms is too often “shooting in the dark.”

The core suggestion of the paper is remarkably simple, however: use held-out data to validate models. Remarkably simple but apparently rarely done in quantitative financial analysis. The researchers show how simple random walks can look like a seasonal price pattern, and how by sending binary signals about market performance to clients (market will rise/market will fall) investment advisors can create a subpopulation that thinks they are geniuses as other clients walk away due to losses. These rise to the level of charlatanism but the problem of overfitting is just one of pseudo-mathematics where insufficient care is used in managing the data.… Read the rest

Saving Big Data from the Zeros

ZerosBecause of the hype cycle, Big Data inevitably attracts dissenters who want to deflate a bit the lofty expectations that are built around new technologies that appear mystifying to those on the outside of the Silicon Valley machine. The first response is generally “so what?” and that there is nothing new here, just rehashing efforts like grid computing and Beowulf and whatnot. This skepticism is generally a healthy inoculation against aggrandizement and any kind of hangover from unmet expectations. Hence, the NY Times op-ed from April 6th, Eight (No, Nine!) Problems with Big Data should be embraced for enumerating eight or nine different ways that Big Data technologies, algorithms and thinking might be stretching the balloon of hope towards a loud, but ineffectual, pop.

The eighth of the list bears some scrutiny, though. The authors, who I am not familiar with, focus on the overuse of trigrams in building statistical language models. And they note that language is very productive and that even a short sentence from Rob Lowe, “dumbed-down escapist fare,” doesn’t appear in the indexed corpus of Google. Shades of “colorless green ideas…” from Chomsky, but an important lesson in how to manage the composition of meaning. Dumbed-down escapist fare doesn’t translate well back-and-forth through German via the Google translate capability. For the authors, that shows the failure of the statistical translation methodology linked to Big Data, and ties in to their other concerns about predicting rare occurrences or even, in the case of Lowe’s quote, zero occurrences.

In reality, though, these methods of statistical translation through parallel text learning date to the late 1980s and reflect a distinct journey through ways of thinking about natural language and computing.… Read the rest

Signals and Apophenia

qrcode-distortThe central theme in Signals and Noise is that of the inverse problem and its consequences: given an ocean of data, how does one uncover the true signals hidden in the noise? Is there even such a thing? There’s an obsessive balance between apophenia and modeling somewhere built into our skulls.

The cover art for Signals and Noise reflects those tendencies. There is a QR Code that encodes a passage from the book, and then there is a distortion of the content of the QR Code. The distortion, in turn, creates a compelling image. Is it a fly creeping to the left or a lion’s head tilted to the right?

Yes.

A free hard-cover copy of Signals and Noise to anyone who decodes the QR Code. Post a copy of the text to claim your reward.… Read the rest

A Paradigm of Guessing

boxesThe most interesting thing I’ve read this week comes from Jurgen Schmidhuber’s paper, Algorithmic Theories of Everything, which should be provocative enough to pique the most jaded of interests. And the quote is from way into the paper:

The first number is 2, the second is 4, the third is 6, the fourth is 8. What is the fifth? The correct answer is “250,” because the nth number is n 5 −5n^4 −15n^3 + 125n^2 −224n+ 120. In certain IQ tests, however, the answer “250” will not yield maximal score, because it does not seem to be the “simplest” answer consistent with the data (compare [73]). And physicists and others favor “simple” explanations of observations.

And this is the beginning and the end of logical positivism. How can we assign truth to inductive judgments without crossing from fact to value, and what should that value system be?… Read the rest

Industrial Revolution #4

Paul Krugman at New York Times consumes Robert Gordon’s analysis of economic growth and the role of technology and comes up more hopeful than Gordon. The kernel in Krugman’s hope is that Big Data analytics can provide a shortcut to intelligent machines by bypassing the requirement for specification and programming that was once assumed to be a requirement for artificial intelligence. Instead, we don’t specify but use “data-intensive ways” to achieve a better result. And we might get to IR#4, following Gordon’s taxonomy where IR stands for “industrial revolution.” IR#1 was steam and locomotives  IR#2 was everything up to computers. IR#3 is computers and cell phones and whatnot.

Krugman implies that IR#4 might spur the typical economic consequences of grand technological change, including the massive displacement of workers, but like in previous revolutions it is also assumed that economic growth built from new industries will ultimately eclipse the negatives. This is not new, of course. Robert Anton Wilson argued decades ago for the R.I.C.H. economy (Rising Income through Cybernetic Homeostasis). Wilson may have been on acid, but Krugman wasn’t yet tuned in, man. (A brief aside: the Krugman/Wilson notions probably break down over extraction and agribusiness/land rights issues. If labor is completely replaced by intelligent machines, the land and the ingredients it contains nevertheless remain a bottleneck for economic growth. Look at the global demand for copper and rare earth materials, for instance.)

But why the particular focus on Big Data technologies? Krugman’s hope teeters on the assumption that data-intensive algorithms possess a fundamentally different scale and capacity than human-engineered approaches. Having risen through the computational linguistics and AI community working on data-driven methods for approaching intelligence, I can certainly sympathize with the motivation, but there are really only modest results to report at this time.… Read the rest

Keep Suspicious and Carry On

I’ve previously argued that it is unlikely that resource-constrained simulations can achieve adequate levels of fidelity to be sufficient for what we observe around us. This argument was a combination of computational irreducibility and assumptions about the complexity of evolutionary trajectories of living beings. There may also be an argument about the observed contingency of the evolutionary process that is an argument against any kind of “intelligent” organizing principle though not against simulation itself.

Leave it to physicists to envision a test of the Bostrom hypothesis that we are living in a computer simulation. Martin Savage and his colleagues look at Quantum Chromodynamic (QCD) theory and current simulation methods for QCD. They conclude that if we are, in fact, living in a simulation, then we might observe specific inconsistencies that arise from finite computing power for the universe as a whole. Those inconsistencies would be observed in looking at the distribution of cosmic ray energies, specifically. Note that if the distribution is not unusual the universe could either be a simulation (just a sophisticated one) or could be a truly physical one (free running and not on another entity’s computational framework). It is only if the distribution is unusual that it might be a simulation.… Read the rest

Sparse Grokking

Jeff Hawkins of Palm fame shows up in the New York Times hawking his Grok for Big Data predictions. Interestingly, if one drills down into the details of Grok, we see once again that randomized sparse representations are the core of the system. That is, if we assign symbols random representational vectors that are sparse, we can sum the vectors for co-occurring symbols and, following J.R. Firth’s pithy “words shall be known by the company that they keep” start to develop a theory of meaning that would not offend Wittgenstein.

Is there anything new to Hawkins’ effort? For certain types of time-series prediction, the approach parallels artificial neural network designs, replacing the complexity of shifting, multi-epoch training regimens that, in effect, build the high-dimensional distances between co-occurring events by gradually moving time-correlated data together and uncorrelated data apart with an end-run around all the computational complexity. But then there is Random Indexing, which I’ve previously discussed, here. If one restricts Random Indexing to operating on temporal patterns, or on spatial patterns, then the results start to look like Numenta’s offering.

While there is a bit of opportunism in Hawkins latching onto Big Data to promote an application of methods he has been working on for years, there are very real opportunities for trying to mine leading indicators to help with everything from ecommerce to research and development. Many flowers will bloom, grok, die, and be reborn.… Read the rest