The eight finalists of the Enlighten Your Research 4 contest (EYR4) wrote a blog post about the projects they proposed. This week the post from Piek Vossen (VU University Amsterdam)
Thousands of articles a day
Thousands of news articles are published daily presenting us with new events or providing updates to events reported earlier. Decision and policy makers in companies and government alike need to be aware of what is going on in order to make informed decisions. It is impossible to keep track of the vast amount of information coming in daily, especially if relevant information stretches over long periods of time, such as the financial crisis. New information may need to be linked to articles published months or even years earlier in order to get a complete picture of the current situation.
Linking today’s news to existing information
In NewsReader, we aim to extract information about events automatically and link today’s news streams to information collected from earlier news articles and complementary resources such as encyclopedias or company profiles. To this end, we apply state-of-the-art language technology to process texts from daily incoming news in English, Spanish, Italian and Dutch.
Scaling up linguistic processing
Processing one article typically takes about 6 minutes on one standard machine; we’re looking at approximately 1 million new articles per day for English only, with a backlog of 35 million articles from previous years. One of the main challenges in this project lies in scaling up linguistic processing and maximising the usage of computational resources to manage the daily stream of incoming information.
Lots of questions to be answered…
As part of the Enlighten Your Research Challenge we address questions such as the following: How can we optimise our linguistic pipeline consisting of 18 modules that are all interacting with each other in different ways? What resources are needed to support processing the all-news articles coming in every day? How should we store information from previous news articles in a way that it can easily interact with new data coming in?
Answering these questions provides the basis to build a ‘history recorder’ that follows the news, stores it and can directly link new events to events in the past. As such, it will provide the complete story line to decision makers, pointing out forgotten details and possibly even finding new links between current situations and events that happened in the past.
Computational Lexicology & Terminology Lab (CLTL)
The Network Institute, VU University Amsterdam
NewsReader is supported by the European Union’s 7th Framework Programme via the NewsReader Project (ICT-316404).