ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Geographical Erasure in Language Generation

Schwöbel, P and Golebiowski, J and Donini, M and Archambeau, C and Pruthi, D (2023) Geographical Erasure in Language Generation. In: UNSPECIFIED, 06-12-2023 to 10-12-2023, Singapore, pp. 12310-12324.

[img] PDF
Pln_ass_com_llg_emn_2023.pdf - Published Version
Restricted to Registered users only

Download (1MB) | Request a copy


Large language models (LLMs) encode vast amounts of world knowledge. However, since these models are trained on large swaths of internet data, they are at risk of inordinately capturing information about dominant groups. This imbalance can propagate into generated language. In this work, we study and operationalise a form of geographical erasure, wherein language models underpredict certain countries. We demonstrate consistent instances of erasure across a range of LLMs. We discover that erasure strongly correlates with low frequencies of country mentions in the training corpus. Lastly, we mitigate erasure by finetuning using a custom objective. © 2023 Association for Computational Linguistics.

Item Type: Conference Proceedings
Publication: Findings of the Association for Computational Linguistics: EMNLP 2023
Publisher: Association for Computational Linguistics (ACL)
Additional Information: The copyright for this article belongs to Association for Computational Linguistics (ACL).
Keywords: Internet data; Language generation; Language model; Lower frequencies; Training corpus; World knowledge, Computational linguistics
Department/Centre: Division of Mechanical Sciences > Materials Engineering (formerly Metallurgy)
Date Deposited: 04 Mar 2024 07:23
Last Modified: 04 Mar 2024 07:23
URI: https://eprints.iisc.ac.in/id/eprint/84237

Actions (login required)

View Item View Item