Towards equitable Bangla language models: detection of stereotypical biases in Bangla word embeddings
Loading...
Date
Publisher
BRAC University
Authors
Citation
Abstract
Large language models are powerful tools that can be used in variety of tasks, including
text generation, translation and question answering. However, Language models tend
to pick up undesirable biases from the training data. Due to the rapid advancement in
artificial intelligence and due to the widespread use of these models, we risk amplifying
social stereotypes and biases through these systems. Word embedding is a framework
that represents text data as vectors allowing them to capture the context of each word
within a large text corpora. Word embeddings are essentially the building blocks of
popular Bangla language models such as Bangla Glove, Bangla word2vec, banglaBERT.
We observe that gender bias exists in a geometric direction in the word embedding. Using
methods such as Principal component analysis, Word Embedding Association Test and
Mask Language Modelling, we highlight the presence of stereotypically biased associations
in the word embeddings.
Our findings have broader implications for AI research within the Bangla NLP community
and for enhancing both the viability and usability of Bangla NLP systems for downstream
tasks.
Description
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 21-24).
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024.
Includes bibliographical references (pages 21-24).
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024.
Publisher Link
Type
Thesis