Abstract
In this project, we assist a research group at the Center for Computing in Science Education specialising in qualitative analysis. The project is affiliated with the Center for Interdisciplinary Education where the group employs traditional qualitative data analysis techniques alongside advanced Natural Language Processing (NLP) methods. Their objective is to uncover themes and trends within the text data effectively. Manual analysis is inherently time-consuming, hence NLP and machine learning alternatives are explored. The research group already has strong results in application of general-purpose language models in this field, and we have assisted in further exploration of domain adapted smaller models. Additionally, we have provided the group with a pipeline for efficient text extraction from diverse document formats containing data.
Background
Language models have immense potential for enhancing qualitative text analysis, offering tools to automate the categorization and extraction of meaningful themes from large and complex datasets. By capturing semantic structures, they facilitate deeper insights and enable more efficient data processing than traditional methods alone. For further background information on this topic visit:
Tor Ole Bigton Odden’s profile and Research Section for Physics Education.
A common dilemma in application of language models is choosing between large general-purpose language models, or smaller models that are tuned to a specific domain. Our assistance in this project concerns the latter approach.
Methodology
Word embeddings are numerical representations of semantic content, that can be distinguish by themes and topics in the embedding space. To illustrate this, we utilized GPT to create example sentences for three topics and generated embeddings for each using the “raw” SciBERT model. As demonstrated in the plot below texts from different topics are well clustered and distinctly separated in the embedding space. Each embedding is a vector of size 768. For plotting, the t-SNE algorithm transformed the dataset to 2d. Nearby points in the plot are also nearby in the high-dimensional embedding space.
The main task is to further fine-tune a pre-trained model using scientific articles from the domain of Physics Education. The objective is to achieve contextually trained embeddings that capture the subtle semantic nuances unique to this specialized field. We've implemented a pipeline using Huggingface's model. As illustrated in the diagram below, we start with the pre-trained BERT model and refine it using domain-specific samples, adapting it to the field of interest. The training pipeline obtains satisfactory performance on the powerful NVIDIA A100 GPUs of Educloud's Fox computer.
Monitoring formal training metrics, such as training and validation losses, allows for standard oversight, with possible adjusting of meta-parameters based on these metrics. However, fine-tuning presents a challenge in balancing domain adaptation with retention of the pre-trained model's general knowledge. We have therefore implemented LoRA (Low-rank adaptation) as an option. Briefly, the idea behind LoRA is to only let the model adapt in a subspace of the full space of model weights, thereby conserving some of the pretraining. The subspace size is an adjustable input parameter to the pipeline. Qualitative analysis comes with one further complication: Validating the qualitative capabilities of the fine-tuned model requires human input, or at least comparison against a nontrivial competing model. To facilitate this, we frequently save model "checkpoints" during training, i.e. snapshots of the model weight values are stored. The "checkpoints" enable validation of the model at various stages of fine-tuning.