Text Analytics and Transcription Technology for Quranic Arabic

Majdi Sawalhaa, Claire Brierley, Eric Atwell, James Dickins

Abstract


Natural Language Processing Working Together with Arabic and Islamic Studies is a 2-year project funded by the UK Engineering and Physical Sciences Research Council (EPSRC) to study prosodic-syntactic mark-up in the Quran (Atwell et al 2013). Tajwīd or correct Quranic recitation is very important in Islam. The original insight informing this project is to view tajwīd mark-up in the Quran as additional text-based data for computational analysis. This mark-up is already incorporated into Quranic Arabic script, and identifies phrase boundaries of different strengths, plus lengthened syllables denoting prosodically and semantically salient words. We have developed a grapheme-phoneme mapping scheme (Brierley et al 2016), plus state-of-the-art software (Sawalha et al 2014) for generating a stressed and syllabified phonemic transcription or citation form for each word in the entire text of the Quran, using the International Phonetic Alphabet (IPA). This canonical pronunciation tier for Classical Arabic is informed and evaluated by Arabic linguists, tajwīd scholars, and phoneticians, and published in an open-source Boundary-Annotated Quran corpus and machine learning dataset (ibid). We utilise statistical techniques such as keyword extraction to explore semiotic relationships between sound and meaning in the Quran, invoking a Saussurean-type view of the sign as ‘...a bi-unity of expression and content...’ (Dickins 2007). Our investigation entails: (i) text data mining for statistically significant phonemes, syllables, words, and correlates of rhythmic juncture; and (ii) interpretation of results from interdisciplinary perspectives: Corpus Linguistics; tajwīd science; Arabic Linguistics; and Phonetics and Phonology.


Full Text:

PDF

References


Atwell, E.S., Dickins, J. and Brierley, C. 2013. Natural Language Processing Working Together with Arabic and Islamic Studies. Engineering and Physical Sciences Research Council (EPSRC). EP/K015206/1. Online. Accessed: 29.06.2014. http://gow.epsrc.ac.uk/NGBOViewGrant.aspx?GrantRef=EP/K015206/1

Beckman, M. and Hirschberg, J. 1994. The ToBI annotation conventions. The Ohio State University and AT&T Bell Laboratories, unpublished manuscript. Online. Accessed September 2011. ftp://ftp.ling.ohio-state.edu/pub/phonetics/TOBI/ToBI/ToBI.6.html.

Bird, S., Klein, E. and Loper, E. 2009. Natural Language Processing with Python. Sebastopol, CA. O’Reilly Media, Inc.

Brierley, C. 2011. Prosody Resources and Symbolic Prosodic Features for Automated Phrase Break Prediction. PhD Thesis. School of Computing. University of Leeds.

Brierley, C. and Atwell, E. 2011. “Non-Traditional Prosodic Features for Automated Phrase-Break Prediction.” In Journal of Literary and Linguistic Computing. (Digital Humanities 2010 Special Issue), doi: 10.1093/llc/fqr023

Brierley, C., Sawalha, M. and Atwell, E. 2012. ‘Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing’. In Proceedings of the Language Resources and Evaluation Conference (LREC) 2012. Istanbul, Turkey. Online. Accessed: 29.06.2014. http://www.lrec-conf.org/proceedings/lrec2012/index.html

Brierley, C., Atwell, E., Rowland, C. and Anderson, J. 2013. Semantic Pathways: a novel visualization of varieties of English. In Journal of International Computer Archive of Modern and Medieval English (ICAME). Volume 37.

Brierley, C., Sawalha, M. and Atwell, E. 2014. Tools for Arabic Natural Language Processing: a case study in qalqalah prosody. To appear in Proc. Language Resources and Evaluation Conference (LREC 2014), Reykjavik

Brierley, C., Sawalha, M., Heselwood, B. and Atwell, E. (2016). A Verified Arabic-IPA Mapping for Arabic Transcription Technology, Informed by Quranic Recitation, Traditional Arabic Linguistics, and Modern Phonetics. Journal of Semitic Studies (1), 157-186.

Dickins, J. 2007. Sudanese Arabic: phonematics and syllable structure. Wiesbaden: Otto Harrassowitz Verlag.

Dukes, K. and Habash, N. 2010. ‘Morphological Annotation of Qur’anic Arabic.’ In Proceedings of Language Resources and Evaluation Conference (LREC 2010), Valletta, Malta.

Ostendorf, M., Price, P. and Shattuck-Hufnagel, S. 1996. Boston University Radio Speech Corpus. Philadelphia. Linguistic Data Consortium.

Sawalha, Majdi. 2011. Open-source Resources and Standards for Arabic Word Structure Analysis. Leeds: University of Leeds PhD.

Sawalha, M., Brierley, C., and Atwell, E. 2012a. ‘Predicting Phrase Breaks in Classical and Modern Standard Arabic Text.’ In Proceedings of LREC 2012: Language Resources and Evaluation Conference. Istanbul, Turkey. May 2012.

Sawalha, M., Brierley, C., and Atwell, E. 2012b. “Open-Source Boundary-Annotated Qur’an Corpus for Arabic and Phrase Breaks Prediction in Classical and Modern Standard Arabic Text.” In Journal of Speech Sciences, 2.2.

Sawalha, M., Brierley, C. and Atwell, E. 2014. Automatically generated, phonemic Arabic-IPA pronunciation tiers for the Boundary Annotated Qur'ān Dataset for Machine Learning (version 2.0)’. In Proceedings of the Workshop on Language Resources and Evaluation for Religious Texts (LRE-Rel2) at LREC 2014. Reykjavik, Iceland.

Taylor, L.J. and Knowles, G. 1988. ‘Manual of Information to Accompany the SEC Corpus: The machine readable corpus of spoken English.’ Accessed: January 2010.

Wells, J.C. 2002. SAMPA for Arabic. Online. Accessed: 25.04.2013. http://www.phon.ucl.ac.uk/home/sampa/arabic.htm


Refbacks

  • There are currently no refbacks.