In the Bahrain Corpus, we aimed to create a specialized corpus of the Bahraini Arabic dialect, which includes written texts as well as transcripts of audio files, belonging to a different genre (folktales, comedy shows, plays, cooking shows, etc.).
At the time of this publication, the corpus comprises 620K words, carefully curated. We also enrich the Bahrain Corpus text with automatic morphological annotations using state-of-the-art morphosyntactic disambiguation for Gulf Arabic. We validate the quality of the annotations on a 7.6K word sample. We make the full corpus as well as the annotated sample publicly available to support researchers interested in Arabic NLP.
More details on this project can be found in
Abdulrahim et. al (2022):
Abdulrahim, Dana, Go Inoue, Latifa Shamsan, Salam Khalifa and Nizar Habash. 2022. The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic. In Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022), Marseille, France. European Language Resources Association (ELRA).