The Arabic–English Parallel Corpus of Authentic Hadith

Shatha Altammami, Eric Atwell, Ammar Alsalka

Abstract


We present a bilingual parallel corpus of Islamic Hadith, which is the set of narratives reporting different aspects of the Prophet Muhammad's life. The Hadith collection is extracted from the six canonical Hadith books which possess unique linguistic features and patterns that are automatically extracted and annotated using a domain-specific tool for Hadith segmentation. In this article, we present the methodology of creating the corpus of 39,038 annotated Hadiths which will be freely available for the research community.


Full Text:

PDF

References


• Al-Thubaity, A. O. (2015). A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Language Resources and Evaluation. https://doi.org/10.1007/s10579-014-9284-1

• Alosaimy, A, & Atwell, E. (2017). Sunnah Arabic Corpus: Design and Methodology. Proceedings of the 5th International Conference on Islamic Applications in Computer Science and Technologies (IMAN 2017), (December), 26–28. Retrieved from http://eprints.whiterose.ac.uk/125569/

• Alosaimy, Abdulrahman, & Atwell, E. (2017). Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers. Journal for Language Technology and Computational Linguistics, 32(1), 1–26. Retrieved from http://www.jlcl.org/2017_Heft1/01-MorphosyntacticTagging.pdf

• Alrabiah, M. (2013). The design and construction of the 50 million words KSUCCA. Proceedings of WACL’2 …. https://doi.org/https://doi.org/10.1016/j.iac.2007.09.004

• Altammami, S., Atwell, E., & Alsalka, A. (2019). Text Segmentation Using N-grams to Annotate Hadith Corpus. Proceedings of the 3rd Workshop on Arabic Corpus Linguistics, 31–39.

• Atwell, E. (2018). Using the Web to model Modern and Qurʾanic Arabic. Edinburgh University Press.

• Azmi, A. M., Al-Qabbany, A. O., & Hussain, A. (2019). Computational and natural language processing based studies of Hadith literature: a survey. Artificial Intelligence Review. https://doi.org/10.1007/s10462-019-09692-w

• Belinkov, Y., Magidow, A., Barrón-Cedeño, A., Shmidman, A., & Romanov, M. (2018). Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus. Retrieved from http://arxiv.org/abs/1809.03891

• Bounhas, I. (2019). On the Usage of a Classical Arabic Corpus as a Language Resource : Related Research and Key Challenges. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 18(3), 1–45.

• Guellil, I., Saâdane, H., Azouaou, F., Gueni, B., & Nouvel, D. (2019). Arabic Natural Language Processing: an overview. Journal of King Saud University - Computer and Information Sciences. https://doi.org/10.1016/j.jksuci.2019.02.006

• Hammo, B., Yagi, S., Ismail, O., & AbuShariah, M. (2016). Exploring and exploiting a historical corpus for Arabic. Language Resources and Evaluation, 50(4), 839–861. https://doi.org/10.1007/s10579-015-9304-9

• Khan, S. H. (1987). Al-Hitta Fi Dhikr Al-sihah Al-sitta. Beiru.

• Luthfi, E. T., Suryana, N., & Basari, A. H. (2018). Digital Hadith authentication: A literature review and analysis. Journal of Theoretical and Applied Information Technology.

• Mahmood, A., Ullah, H., K., F., Ramzan, M., & Ilyas, M. (2018). A Multilingual Datasets Repository of the Hadith Content. International Journal of Advanced Computer Science and Applications, 9(2), 165–172. https://doi.org/10.14569/IJACSA.2018.090224

• Zaghouani, W. (2017). Critical Survey of the Freely Available Arabic Corpora. Proceedings of the Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools Workshop Programme,LREC, 1–8. https://doi.org/10.13140/RG.2.1.1362.1284

• Zerrouki, T., & Balla, A. (2017). Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data in Brief, 11, 147–151. https://doi.org/10.1016/j.dib.2017.01.011


Refbacks

  • There are currently no refbacks.