Google’s translate has always been a useful tool for awkward gists of short texts. The method used was based on building a phrase-based statistical translation model. To do this, you gather up “parallel” texts that are existing, human, translations. You then “align” them by trying to find the most likely corresponding phrases in each sentence or sets of sentences. Often, between languages, fewer or more sentences will be used to express the same ideas. Once you have that collection of phrasal translation candidates, you can guess the most likely translation of a new sentence by looking up the sequence of likely phrase groups that correspond to that sentence. IBM was the progenitor of this approach in the late 1980’s.
It’s simple and elegant, but it always was criticized for telling us very little about language. Other methods that use techniques like interlingual transfer and parsers showed a more linguist-friendly face. In these methods, the source language is parsed into a parse tree and then that parse tree is converted into a generic representation of the meaning of the sentence. Next a generator uses that representation to create a surface form rendering in the target language. The interlingua must be like the deep meaning of linguistic theories, though the computer science versions of it tended to look a lot like ontological representations with fixed meanings. Flexibility was never the strong suit of these approaches, but their flaws were much deeper than just that.
For one, nobody was able to build a robust parser for any particular language. Next, the ontology was never vast enough to accommodate the rich productivity of real human language. Generators, being the inverse of the parser, remained only toy projects in the computational linguistic community. And, at the end of the day, no functional systems were built.
Instead, the statistical methods plodded along but had their own limitations. For instance, the translation of a never-before-seen sentence consisting of never-before-seen phrases, is the null set. Rare and strange words in the data have problems too, because they have very low probabilities and are swamped by well-represented candidates that lack the nuances of the rarer form. The model doesn’t care, of course; the probabilities rule everything. So you need more and more data. But then you get noisy data mixed in with the good data that distorts the probabilities. And you have to handle completely new words and groupings like proper nouns and numbers that are due to the unique productivity of these classes of forms.
So, where to go from here? For Google and its recent commitment to Deep Learning, the answer was to apply Deep Learning Neural Network approaches. The approach threw every little advance of recent history at the problem to pretty good effect. For instance, to cope with novel and rare words, they broke the input text up into sub-word letter groupings. The segmentation of the groupings was based, itself, on a learned model of the most common break-ups of terms, though they didn’t necessarily correspond to syllables or other common linguistic expectations. Sometimes they also used character-level models. The models were then combined into an ensemble, which is a common way of overcoming brittleness and overtraining on subsets of the data set. They used GPUs in some cases as well as reduced-precision arithmetic to speed-up the training of the models. They also used an attention-based intermediary between the encoder layers and the decoder layers to limit the influence of the broader context within a sentence.
The results improved translation quality by as much as 60% over the baseline phrase-based approach and, interestingly, showed a close approach to the average human translator’s performance. Is this enough? Not at all. You are not going to translate poetry this way any time soon. The productiveness of human language and the open classes of named entities remain a barrier. The subtleties of pragmatics might still vex any data driven approach—at least until there are a few examples in the corpora. And there might need to be a multi-sensory model somehow merged with the purely linguistic one to help manage some translation candidates. For instance, knowing the way in which objects fall could help move a translation from “plummeted” to “settled” to the ground.
Still, data-driven methods continue to reshape the intelligent machines of the future.