Towards equitable Bangla language models: detection of stereotypical biases in Bangla word embeddings

Loading...
Thumbnail Image

Publisher

BRAC University

Citation

Abstract

Large language models are powerful tools that can be used in variety of tasks, including text generation, translation and question answering. However, Language models tend to pick up undesirable biases from the training data. Due to the rapid advancement in artificial intelligence and due to the widespread use of these models, we risk amplifying social stereotypes and biases through these systems. Word embedding is a framework that represents text data as vectors allowing them to capture the context of each word within a large text corpora. Word embeddings are essentially the building blocks of popular Bangla language models such as Bangla Glove, Bangla word2vec, banglaBERT. We observe that gender bias exists in a geometric direction in the word embedding. Using methods such as Principal component analysis, Word Embedding Association Test and Mask Language Modelling, we highlight the presence of stereotypically biased associations in the word embeddings. Our findings have broader implications for AI research within the Bangla NLP community and for enhancing both the viability and usability of Bangla NLP systems for downstream tasks.

Description

Cataloged from PDF version of thesis.
Includes bibliographical references (pages 21-24).
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024.

Publisher Link

Type

Thesis