German news-articles database for Natural Language Processing

In recent years, the volume of news published and read online has been increasing, making news data an interesting resource for many fields of study, such as investigative journalism, social sciences and linguistics. News articles present a compelling source of data for researchers in numerous domains. With the growth of published news articles online, the manual processing of content uploaded daily to the web turns into a challenging and time-consuming task. One way towards automating this kind of task is the development of a web-scraping tool with the ability to automatically extract news data from a specified list of news portals.    

Advances in Natural Language Processing (NLP) and Deep Learning have become state-of-the-art technologies in many branches of media technology. However, this advancement has been mainly applied to the English language. Currently there is a alack in publicly available datasets of news articles in the German language, which can be used for NLP applications. A pipeline for data gathering from German news articles, including data preparation for NLP tasks, has been designed and implemented in IRT in the form of a student thesis.

News scraping pipeline

The implementation of the scraping tool implied the definition of a custom class for each news portal of the ARD broadcasters. The framework, which has been used to gather the news data is Scrapy, a web crawling and scraping library for Python. The news scrapping tool enables the collection of news data and can be used to further prepare them for NLP tasks, in German. A user-friendly access to the database is facilitated by the user interface prototype, whose requirements have been defined based on journalistic input in a survey designed specifically for this reason. Moreover, theusability of the User Interface (UI) has been tested during a remotely moderated study with 17 participants from IRT and LMU Munich. The prototype provides functionalities such as news articles visualisation and analysis, as well as convenient access to the articles by means of an advanced search that allows looking for specific news articles by entering a combination of multiple keywords, news portals and topics.    

In order to demonstrate the performance of the generated dataset for NLP tasks, two NLP tasks have been implemented: 1) automatic keyword extraction, that suggests keywords that best describe an article and can be used for article grouping, and 2) automatic article summarisation, that provides a short (approximately three sentences) summary of each article for quick reference. The latter is the first step towards a broader and more complex NLP subject: text summarisation, that will be discussed at the end of this article. 

 User Interface

Access to the database is possible thanks to a user interface prototype. This permitted us to implement a user-friendly access to the database, whose requirements have been defined based on journalistic input in a survey designed specifically for this reason. Moreover, theusability of the User Interface (UI) has been tested during a remotely moderated study with 17 participants from IRT and LMU Munich. The prototype provides functionalities such as news articles visualisation and analysis, as well as convenient access to the articles by means of an advanced search that allows looking for specific news articles by entering a combination of multiple keywords, news portals and topics.   

Here you can have a look at the UI prototype and the functionalities it offers:

Outlook

A growing database of news articles in German opens possibilities for numerous data analysis, visualisation and NLP tasks. For example, one NLP task that was the motivation behind the implementation of a growing database was abstractive text summarisation. 

As the amount of online news articles increases, so does the time needed by journalists to crawl and read all the content. This is particularly time-consuming especially for time sensitive subjects. Research in Natural Language Processing has focused on automated summarisation in different ways. The first one, extractive summarisation, is based on the extraction of the most important sentences in a text. This can be based on semantics or just features of the text. Because of this, it is usually easy to implement an extractive summarisation in multiple languages and plenty of open sources models already achieve that. The second type of summarisation is the abstractive one, which is more challenging. The text is summarised using new words and concepts. For example, the sentence “Today, Russian President Putin has decided to close down bars and pubs.”  would be summarised as “Moscow closes bars and pubs.”. Here, the model would understand the meaning of “Russian President Putin” well enough to be able to link it to “Moscow”. This is particularly complex, as it requires models to understand the meaning of the word in its context and be able to link it to another word which needs to be already in its vocabulary. This presents challenges: it needs to be trained on a large number of corpora which in turns requires high computational power.  

Abstractive summarisation is only one of the NLP tasks achievable with the database. As it grows, it becomes possible as well to use it for other tasks such as entity recognition, sentiment analysis and text classification.  

Back to blog