Understanding Large Language Models

We offer two sub-themes focussing on furthering our understanding of large language models. This includes evaluating various aspects of training LLMs and dissecting linguistic and factual knowledge within them.

NB! Applicants are asked to apply for one of these sub-themes. Please indicate clearly which of the sub-themes you have chosen for your proposal by using one of the codes LTG1, LTG2.

Mentoring and internship will be offered by a relevant external partner.

Theme LTG1. Evaluating large language models - model architectures, training regimes and data selection

Contact person: Yves Scherrer
Keywords: large language models, evaluation, benchmarking
Research group: Language Technology Group (LTG)
Department of Informatics

In the past years, (generative) large language models have become the core foundation models for a wide range of traditional NLP tasks, and they have also seen widespread adoption by the general public. At the same time, little is known about the specific training setups of commercial models, and some design decisions (in terms of model architecture, training regimes, and data selection) are based on traditions rather than empirical or theoretical considerations. Moreover, most current LLMs rely heavily on English training and evaluation data, and their performance on non-English languages remains difficult to assess. Potential candidates are expected to formulate their research project within the broad area of LLM evaluation. Examples of research topics are given below.

Methodological research topics:

Compare fine-tuning external pre-trained LLMs with training language-specific LLMs from scratch.
Compare encoder-decoder LLMs with decoder-only LLMs.
Evaluate generative LLMs on various text generation tasks, such as summarization, simplification, text normalization.
Assess the multilingual (e.g. machine translation) and cross-lingual capabilities (cross-lingual transfer) of LLMs.
Investigate how closely related low-resource languages are best accommodated in LLMs.
Implement benchmarking datasets for LLM evaluation.

Theme LTG2. Separation of Linguistic and Factual Knowledge in Large Language Models

Contact person: Andrey Kutuzov
Keywords: Large Language Models, Machine learning, Natural Language Processing, Computational Linguistics
Research group: LTG/Integreat
Department of Informatics

During the training, modern large language models learn both the language structure and the knowledge about the world as one indivisible whole. Decoupling these two components is a challenging, but extremely promising field of research. Having more control over what is stored in the model weights should allow to optimize the model better. In particular, it might be possible not to learn world knowledge every time from scratch, when training a model for a particular language.

The question is whether it is possible to develop an architecture where a neural network is limited to learning the linguistic structure, while the knowledge about the world is stored in and retrieved from an external knowledge graph. Will such a model focus on learning language-specific linguistic skills without spending its parameters on time-dependent and language-agnostic factual knowledge? Is it also true that such language models will be less prone to hallucinations and bias?

References:

[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K��ttler, H., Lewis, M., Yih, W.-t., Rockt?schel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459�C9474.
[2] Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driessche, G. B., Lespiau, J.-B., Damoc, B., Clark, A., et al. (2022). Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206�C2240. PMLR.
[3] Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., and Wu, X. (2023). Unifying large language models and knowledge graphs: A roadmap.

Bildet kan inneholde: font, rektangel, parallell.