ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Overview of the HASOC Subtrack at FIRE 2023: Hate-Speech Identification in Sinhala and Gujarati

Satapara, S and Madhu, H and Ranasinghe, T and Dmonte, AE and Zampieri, M and Pandya, P and Shah, N and Modha, S and Majumder, P and Mandl, T (2023) Overview of the HASOC Subtrack at FIRE 2023: Hate-Speech Identification in Sinhala and Gujarati. In: 15th Forum for Information Retrieval Evaluation, FIRE 2023, 15 December 2023through 18 December 2023, Goa,India, pp. 344-350.

[img] PDF
CEUR_wor_pro_3681_2023 - Published Version
Restricted to Registered users only

Download (223kB) | Request a copy
Official URL: https://www2.scopus.com/record/display.uri?eid=2-s...

Abstract

Detecting offensive and hateful content in low-resource languages poses a significant challenge due to the limited availability of benchmark datasets. It is crucial to address this gap by creating benchmark datasets tailored to these languages. This not only enhances the accuracy of detection but also provides valuable insights into the efficacy of identifying problematic content in comparison to high-resource languages. In line with this commitment to advancing research on low-resource languages, the Hate Speech and Offensive Content Identification (HASOC) shared task introduced a dedicated subtrack for Hate Speech Identification in Sinhala and Gujarati in 2023. This paper outlines the objectives of the task, discusses the characteristics of the data involved, and presents an analysis of the participants� submissions. For Task 1a, we utilized an existing Sinhala dataset (SOLD) consisting of 10,000 tweets. Meanwhile, for Task 1b, focused on Gujarati, we curated a new dataset comprising 1,020 tweets. A total of 16 teams submitted experiments for Sinhala, with the leading team achieving an impressive F1 score of 0.83. In the case of the Gujarati task, 17 teams participated, and the highest-performing team achieved an F1 score of 0.84. These results highlight the significance of tailored datasets in facilitating the effective detection of offensive content in low-resource languages. © 2023 Copyright for this paper by its authors.

Item Type: Conference Paper
Publication: CEUR Workshop Proceedings
Publisher: CEUR-WS
Additional Information: The copyright for this article belongs to ACM SIGIR Special Interest Group on Information Retrieval.
Keywords: Deep learning; Fires, Benchmark; Content identifications; Deep learning; Evaluation; Hate speech; Language resources; Low resource languages; Social media; Social NLP; Speech identification, Social networking (online)
Department/Centre: Division of Electrical Sciences > Electrical Communication Engineering
Date Deposited: 24 Sep 2024 07:02
Last Modified: 24 Sep 2024 07:02
URI: http://eprints.iisc.ac.in/id/eprint/85225

Actions (login required)

View Item View Item