Háskóli Íslands
Building a Modern Faroese POS Tagger with Encoder–Decoder Transfer
Lýsing
Faroese is a morphologically rich North Germanic language spoken by ~70,000 people, with very limited NLP resources. In this project, the objective is to build a next-generation part-of-speech (POS) tagger for Faroese by applying techniques from the ModernBERT paper (Warner et al., 2024) to a multilingual setting. The approach has two key steps:
Fine-tuning: A multilingual encoder–decoder model is fine-tuned on Faroese text data to strengthen its Faroese representations.
Encoder extraction: The encoder component is extracted and converted into a standalone encoder-only model, which is then evaluated as a POS tagger on the Faroese Sosialurin corpus.
The project builds on our existing work, where our ScandiBERT-based tagger achieved 94–98% accuracy using constrained multi-label classification. The goal is to determine whether this transfer approach can surpass current results.
Existing datasets and evaluation pipelines are available.
Aðrar upplýsingar
Qualifications: This project is suitable for students or researchers interested in low-resource NLP and transformer architectures. Candidates should have experience with deep learning frameworks (PyTorch / Hugging Face) and a solid understanding of transformer models would be ideal. Faroese language skills are a plus but not required.