Name of participant: Neha Deshpande
Project’s name: Investigating Advanced Evaluation Techniques for LLM-Generated Summaries of German News Articles
Project Description: In an era of information overload, AI-generated news summaries have the potential to help readers quickly grasp key points without being overwhelmed. However, ensuring these summaries are accurate, clear, and free from misleading content is a critical challenge. This project aims to develop an advanced, automated evaluation framework for assessing Large Language Model (LLM)-generated summaries of German news articles, addressing key concerns such as factual accuracy, coherence, and relevance.
With misinformation and biased reporting on the rise, the reliability of automated news summaries is more important than ever. Many existing evaluation methods rely on human-generated reference summaries, which can introduce biases and inconsistencies. While manual evaluation offers deeper insights, it is time-consuming and difficult to scale. This project seeks to overcome these challenges by reducing dependence on human evaluations and leveraging knowledge-lean techniques such as knowledge graphs and LLM-based evaluators.
One of the key focuses is identifying the most relevant metrics for evaluating summaries—covering aspects like grammatical correctness, conciseness, and factual consistency. By collecting user ratings and analyzing reader preferences, the project aims to fine-tune evaluation models that align closely with human judgment. This research will also explore how LLMs compare with smaller, custom-trained models, investigating whether they can provide accurate, scalable assessments.
Another crucial aspect is addressing the issue of AI “hallucinations”—instances where models generate misleading or fabricated information. The framework will incorporate mechanisms to detect and penalize such inaccuracies, ensuring summaries remain trustworthy. By focusing specifically on the complexities of the German language, the project aims to fill a significant research gap in NLP, where most studies have been centered on English-language models.
Designed for generalizability, the framework will be applicable across different LLMs, such as GPT-4, Vicuna, and Mistral, making it a versatile tool for researchers, news organizations, and AI developers. Its insights will not only improve the quality of AI-generated news summaries but also contribute to broader advancements in NLP, particularly for low-resource languages.
By building a scalable, reliable, and adaptable evaluation system, this project aims to strengthen trust in AI-generated content, supporting a more informed public and a healthier digital news ecosystem.
Software Campus partners: Technische Universität Berlin und Holtzbrinck Publishing Group
Implementation period: 01.01.2025 – 31.12.2026