Sunnah Arabic Corpus: Design and Methodology

Abdulrahman Alosaimy, Eric Atwell

Abstract


Sunnah Arabic Corpus is an annotated linguistic resource that consists of 144K words/170K tokens of the Hadith narratives (an utterance attributed to prophet Mohammed) extracted from Riyāḍu Aṣṣāliḥīn book. As a first layer of annotation, the corpus has been fully diacritized. In addition, each orthographic word/token is segmented into its syntactic words. And each syntactic word is tagged with its part-of-speech in addition to multiple morphological features. Several hadith translations in different languages are provided and aligned at the narrative/paragraph level. Hadith Arabic Corpus follows the successful Quranic Arabic Corpus in its standards (corpus.quran.com). Sunnah Arabic Corpus is freely available under the Creative Commons Attribution-ShareAlike 4.0 International License. 


Full Text:

PDF

References


• Albared, M., Omar, N., & Ab Aziz, M. J. (2009). Arabic part of speech disambiguation: A survey. International Review on Computers and Software, 4(5), 517–532.

• Alfahal, M. Y. (2007). Riyad-us-Saliheen (with commentary on Ahadith). Dar Ibn Katheer, Damascus, Syria.

• AlOmari, O. (2005). إعراب الأربعين النووية- The Iaarab Of The Nawawi Forty Book.

• Alrabia, M., Al-Salman, A., Atwell, E., & Alhelewh, N. (2014). KSUCCA: A Key To Exploring Arabic Historical Linguistics. International Journal of Computational Linguistics.

• Belinkov, Y., Magidow, A., Romanov, M., Shmidman, A., & Koppel, M. (2016). Shamela: A Large-Scale Historical Arabic Corpus. arXiv Preprint arXiv:1612.08989.

• Nivre, J., & Agic, L. ˇZeljko. (2017). Universal dependencies 2.0 CoNLL 2017 shared task development and test data. LINDAT/CLARIN digital library at the Institute of Formal and Applied.

• Yaseen, M., Attia, M., Maegaard, B., Choukri, K., Paulsson, N., Haamid, S., Ragheb, A. (2006). Building Annotated Written and Spoken Arabic LR’s in NEMLAR Project. Lrec, 533–538.

• Yosef, H. A. (2003)- The Iarab of Al-Nawawi’s Forty Hadithإعراب الأربعين حديثاً النووية . Cairo: AlMukhtar.

• Zeroual, I., & Lakhouaja, A. (2016). A new Quranic Corpus rich in morphosyntactical information. International Journal of Speech Technology, 1–8.

• Zerrouki, T., & Balla, A. (2017). Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data in Brief, 11, 147–151. http://doi.org/10.1016/j.dib.2017.01.011


Refbacks

  • There are currently no refbacks.