Evaluation of Arabic Named Entity Recognition Models on Sahih Al-Bukhari Text

Ibtisam Khalaf Alshammari, Eric Atwell, Mohammad Ammar Alsalka

Abstract


In this paper, the following four Arabic named entity recognition (ANER) models were applied to the Sahih Al-Bukhari (صحيح البخاري) dataset: CAMeLBERT-CA Hatmimoha, Marefa-NER, and Stanza. This study's main aim is to identify the best-performing model for use with other Hadith datasets. The Stanza and Marefa-NER models are best because they obtained F1-scores of 0.826191 and 0.807396, respectively. Then, a new test dataset of approximately 5,000 words was created based on the CANERCorpus annotation. The four models were evaluated using the latest test dataset and had disappointing F1-scores, although Hatmimoha had the best results. This problem likely arose as a result of the small dataset. However, we observed that since the model has many named entity classes and matches the CANERCorpus labels, it could obtain a high performance, as the Hatmimoha and Marefa-NER models did.


Full Text:

PDF

References


• Aldali, N. M. (2018). A Combination Method Of Linguistic Features And Machine Learning Techniques For Identifying Arabic Named Entities.

• Aldumaykhi, A., Otai, S., & Alsudais, A. (2022). Comparing Open Arabic Named Entity Recognition Tools. arXiv preprint arXiv:2205.05857.

• Alkhatib, M., & Shaalan, K. (2020, May). Boosting arabic named entity recognition transliteration with deep learning. In The thirty-third international flairs conference.

• Altammami, S., Atwell, E., & Alsalka, A. (2019). The Arabic–English Parallel Corpus of Authentic Hadith. Paper presented at the International Journal on Islamic Applications in Computer Science And Technology-IJASAT.

• Benajiba, Y., Rosso, P., & Benedíruiz, J. M. (2007, February). Anersys: An arabic named entity recognition system based on maximum entropy. In International Conference on Intelligent Text Processing and Computational Linguistics, pp. 143–153. Springer, Berlin, Heidelberg.

• Ekbal, A., & Bandyopadhyay, S. (2010). Named entity recognition using support vector machine: A language independent approach. International Journal of Electrical and Computer Engineering, 4(3), 589–604.

• Fairouz, B., Taleb, N., & Arari, A. N. (2020). An Ontological Model of Hadith Texts. International Journal of Advanced Computer Science and Applications, 11(4).

• Grishman, R. and Sundheim, M. B. (1996). Message understanding conference-6: A brief history.

• Harrag, F., El-Qawasmeh, E. and Salman Al-Salman, A.M. (2011). Extracting named entities from prophetic narration texts (hadith), Software Engineering and Computer Systems, pp. 289–297. Available at: https://doi.org/10.1007/978-3-642-22191-0_26.

• Hatmi, M. (2020). Arabic Named Entity Recognition Model.

• Inoue, G., Alhafni, B., Baimukan, N., Bouamor, H., & Habash, N. (2021). The interplay of variant, size, and task type in Arabic pre-trained language models. arXiv preprint arXiv:2103.06678.

• Jaber, M. J., & Saad, S. (2016). NER in English translation of hadith documents using classifiers combination. Journal of Theoretical and Applied Information Technology, 84(3), 348.

• Kim, S. N., Baldwin, T., & Kan, M. Y. (2010). Evaluating n-gram based evaluation metrics for automatic keyphrase extraction. In Proceedings of the 23rd International Conference on Computational Linguistics, pp. 572–580.

• Li, J., Sun, A., Han, J., & Li, C. (2020). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1), 50–70.

• Marefa-NLP (marefa NLP) (2021) marefa-nlp (Marefa NLP). Available at: https://huggingface.co/marefa-nlp (Accessed: November 03, 2022).

• Mohit, B., Schneider, N., Bhowmick, R., Oflazer, K., & Smith, N. A. (2012, April). Recall-oriented learning of named entities in Arabic Wikipedia. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 162–173.

• Obeid, O., Zalmout, N., Khalifa, S., Taji, D., Oudah, M., Alhafni, B., ... & Habash, N. (2020, May). CAMeL tools: An open source python toolkit for Arabic natural language processing. In Proceedings of the 12th language resources and evaluation conference (pp. 7022–7032).

• Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. arXiv preprint arXiv:2003.07082.

• Saad, S. (2014). Ontology learning and population techniques for English extended quranic translation text (Doctoral dissertation, Universiti Teknologi Malaysia).

• Sajadi, M. B. and Minaei, B. (2017). Arabic named entity recognition using boosting method, 2017 Artificial Intelligence and Signal Processing Conference (AISP) [Preprint]. Available at: https://doi.org/10.1109/aisp.2017.8324098.

• Salah, R. E., & Zakaria, L. Q. B. (2018, March). Building the classical Arabic named entity recognition corpus (CANERCorpus). In 2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP), pp. 1–8. IEEE.

• Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., & Rush, A. M. (2020, October). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45.

• Zaraket, F., & Makhlouta, J. (2012, May). Arabic cross-document NLP for the hadith and biography literature. In Twenty-Fifth International FLAIRS Conference.


Refbacks

  • There are currently no refbacks.