Here's an interesting article highlighting Google's approach to training their own systems to read documents.

Link: http://observer.com/2015/06/googles-ai-firm-used-daily-mail-cnn-articles-to-teach-their-computers-to-read/

 

Google’s AI Firm Used Daily Mail, CNN Articles to Teach Their Computers to Read

What do their articles have in common? Bullet points.

Teaching machines to read and understand natural language has been one of the biggest challenges facing computer scientists working to advance artificial intelligence. Although systems have been put to the task of reading documents and answering questions about the contents, the lack of large scale training and test datasets have limited the progress that has come of this.

DeepMind Technologies, Google’s AI lab known for shaking the field time and time again, figured out they didn’t need to create their own dataset because the perfect one was right in front of them—the deep vault of articles published online by CNN and The Daily Mail. In a recent study, their researchers used hundreds of thousands of articles from the two media companies to teach their AI systems to read.

So why articles from The Daily Mail and CNN? It turns out their style of including concise bullet points that summarize their articles is key for the machine reading systems to learn to understand natural language content.

To train a computer to accomplish a task, a neural network is used. That neural network, however, must first learn from a carefully annotated database in order to create a standard the system can use. The articles by The Daily Mail and CNN not only serve as examples of annotated natural language, there are a ton of them, making a collection of them the perfect database of natural language.

“This allows us to develop a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure,” the researchers wrote in the paper titled, “Teaching Machines to Read and Comprehend,” where they recently published their findings.

In total, 110,000 articles from CNN and 218,000 articles from The Daily Mail were used. This vast of a collection of available annotated examples of natural language is what researchers have been needing. Previous attempts to teach systems to read and understand natural language used small collections generated by hand, but they’ve ultimately been too small to create the necessary neural networks.

This news isn’t far off from the recent announcement that researchers from Microsoft and the University of Science and Technology of China built a deep learning machine that outperforms the average human on verbal reasoning questions on the IQ test. Computers have never been too successful at solving problems that deal with analogies, classifications, synonyms, antonyms and multiple word meanings, but that’s changed.

Understanding natural language has always been one of the main areas where AI skeptics thought computer scientists would hit a wall, but recent progress is proving advancements are possible.