Towards equitable Bangla language models: detection of stereotypical biases in Bangla word embeddings

Mostafa, MD Tanjim

Towards equitable Bangla language models: detection of stereotypical biases in Bangla word embeddings

Files

16201084_CSE.pdf (243.77 KB)

Date

2024-06

Publisher

BRAC University

Authors

Mostafa, MD Tanjim

Full item page

URI

http://hdl.handle.net/10361/25879

Abstract

Large language models are powerful tools that can be used in variety of tasks, including text generation, translation and question answering. However, Language models tend to pick up undesirable biases from the training data. Due to the rapid advancement in artificial intelligence and due to the widespread use of these models, we risk amplifying social stereotypes and biases through these systems. Word embedding is a framework that represents text data as vectors allowing them to capture the context of each word within a large text corpora. Word embeddings are essentially the building blocks of popular Bangla language models such as Bangla Glove, Bangla word2vec, banglaBERT. We observe that gender bias exists in a geometric direction in the word embedding. Using methods such as Principal component analysis, Word Embedding Association Test and Mask Language Modelling, we highlight the presence of stereotypically biased associations in the word embeddings. Our findings have broader implications for AI research within the Bangla NLP community and for enhancing both the viability and usability of Bangla NLP systems for downstream tasks.

Keywords

Large language models, Natural language processing, Gender bias, Word embeddings, Machine learning, BanglaBERT

LC Subject Headings

Natural language processing (Computer science)., Deep learning (Machine learning)., Neural networks (Computer science)., Computational linguistics., Text data mining.

Description

Cataloged from PDF version of thesis.
Includes bibliographical references (pages 21-24).
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024.

Department

Department of Computer Science and Engineering

Type

Thesis

Collections

Thesis (Bachelor of Science in Computer Science)

Towards equitable Bangla language models: detection of stereotypical biases in Bangla word embeddings

Files

Date

Publisher

Authors

URI

Citation

Abstract

Keywords

LC Subject Headings

Description

Publisher Link

Department

Type

Collections