The data revolution of the last decade has led to an exponential growth in heterogeneous textual data. The Web contains massive amounts of heterogeneous pages, including newspaper and encyclopedic sites, as well as user-generated content such as social media and web logs. Organizations like business corporations and governmental agencies are also dealing with increasingly growing amounts of textual data produced for internal usage. Information explosion further characterizes focused domains such as the scientific literature in various fields. For example, the complexity of biological systems and the large volume of literatu
re used for their description, make it impossible for the most capable expert groups to master it even in a relatively narrow domain. Converting these vast amounts of textual data into relevant information, in a human usable and computer readable form, is crucial to exploiting the potential of the big data era.
The answer to this challenge may come from Natural Language processing (NLP). This is a field that combines linguistics, cognitive science, statistical machine learning, optimization and other computer science areas in order to compile intelligent computer systems that can understand human languages. NLP has various applications, among which are machine translation, question answering and search engines. The field is of growing interest to the scientific community due to its key role in textual data analysis and especially in large scale heterogeneous domains such as the world wide web – where the data revolution takes place.
The field of NLP has, over the past two decades, come to simultaneously rely on and challenge the field of machine learning. Statistical methods now dominate NLP, and have moved the field forward substantially, opening up new possibilities for the exploitation of data in developing NLP components and applications. Many state of the art natural language algorithms are based on supervised learning techniques. In this type of learning, a corpus consisting of texts annotated by human experts is compiled and used to train a learning algorithm. This overlap between the data revolution in people’s day-to-day lives and the statistical revolution in NLP paves the way to many interesting applications but also posits substantial research challenges.
Textual big data analysis is, naturally, one of the main future goals of NLP research. Due to the large number of textual domains and the usage of numerous languages, NLP should move towards a language and domain independent era. This, however, is very challenging because of a problem often referred to as the annotation bottleneck, which stems from the supervised nature of most successful NLP algorithms.
While supervised learning has made substantial contributions to NLP, it faces some significant challenges. Many fundamental NLP tasks that are crucial for the big data era, including information extraction and semantic web search, involve structured prediction and sequential labeling. For tasks of this nature compiling annotated corpora is costly and error prone due to the complex nature of annotation. In the big data era, when multiple domains and languages are involved, this problem becomes so difficult that supervised learning is considered to be infeasible. In order to deal with the new challenges, the field of NLP has to move towards high quality algorithms that do not rely on manually labeled examples but rather make sophisticated use of the abundant textual information of the big data era. This kind of technology is called unsupervised learning and it has become increasingly dominant in NLP research over the past few years. Unsupervised NLP results are constantly improving and in some cases these algorithms have come to outperform their supervised counterpart.
The fields of big data processing and NLP may therefore be in a unique situation of mutual dependency these days. Will the abundant freely available textual information provide a good substance for high quality unsupervised NLP algorithms? Will NLP be able to overcome the annotation bottleneck and significantly contribute to big data applications? Or, perhaps, big data researchers and engineers will find the way to develop sophisticated tools that will extract useful information from text without using NLP tools which enable text understanding? Even the last possibility will not be overwhelmingly surprising if we take into consideration that leading search techniques such as the google search algorithm do not rely on text understanding technology but rather apply purely statistical information retrieval tools. Only future will tell the answer to these interesting questions.