Research project Towards Trustworthy Large Language Models

Large language models (LLMs) power today’s AI assistants, yet they often behave like black boxes. This project explores why LLMs can be unreliable and how explainability methods can make their decisions more transparent, trustworthy and sustainable.

Illustration: Korbinian Randl.

Large language models (LLMs) such as Llama and GPT5 have become a key technology behind chatbots, content creation and software assistance. Despite their impressive abilities, these systems often act as “black boxes”: They produce confident answers without the user being aware of their internal “reasoning” processes.

Modern LLMs generate text one token at a time by predicting what comes next, rather than by reasoning over facts. As a result, they may produce plausible-sounding outputs even when they lack relevant information.

Their massive size and training on largely uncurated internet data further amplify issues such as bias, hallucinations, and environmental cost. This is especially problematic in safety-critical areas like healthcare or legislation, where mistakes can have serious consequences.

This research project investigates why current LLMs struggle with trustworthiness and how we can better understand their behavior. By combining technical insight with human-understandable explanations, we aim to move LLMs from impressive but opaque tools toward systems that can be meaningfully trusted and responsibly deployed.

This is Korbinian Randl’s PhD project. Tony Lindgren is the main supervisor, Aron Henriksson and John Pavlopoulos are co-supervisors.

The project was funded from February 2023 to March 2026 by the European Union’s Horizon Europe research and innovation programme EFRA.

More about EFRA – Extreme Food Risk Analytics

Members

John Pavlopoulos

Department of Informatics Athens University of Economics and Business, Greece

Randl, K., Pavlopoulos, J, Henriksson, A., och Lindgren, T. (2024).
CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification. In: Findings of the Association for Computational Linguistics: ACL 2024, pages 7695–7715, Bangkok, Thailand. Association for Computational Linguistics.
Read the article

Randl, K., Pavlopoulos, J., Henriksson, A., och Lindgren, T. (2025).
Evaluating the Reliability of Self-explanations in Large Language Models. In: Pedreschi, D., Monreale, A., Guidotti, R., Pellungrini, R., Naretto, F. (eds) Discovery Science. DS 2024. Lecture Notes in Computer Science 15243. Springer, Cham.
Read the article

Randl, K., Pavlopoulos, J, Henriksson, A., och Lindgren, T. (2025).
Mind the gap: from plausible to valid self-explanations in large language models. Mach Learn 114, 220.
Read the article

Randl, K., Rocchietti, G., Henriksson, A., Abedjan, Z., Lindgren, T., och Pavlopoulos, J. (2026).
RAG-E: Quantifying Retriever-Generator Alignment and Failure Modes.
Read the article

No news items available.
No events available.