Impact of Text Preprocessing on Named Entity Recognition Based on Conditional Random Field in Indonesian Text
Main Article Content
Abstract
The text preprocessing stage within a natural language processing application framework helps eliminate parts that are not helpful in the text analysis process or particular noise. Despite having a potential impact on the final performance of the application, text preprocessing has not received attention in the text analysis application literature, especially in the named entity recognition application in Indonesian texts. This paper aims to comprehensively examine the impact of text preprocessing in the Indonesian named entity recognition based on a baseline model, namely Conditional Random Field, to find the fittest preprocessing procedures for a NER model compelling performance. Various forms of text preprocessing contribute to the successful recognition of named entities assessed comparatively across three categories: people, places, and organizations. Experimental analysis of the data set reveals that several combinations of preprocessing text forms are useful. Rather than enabling or disabling them all, several combinations can significantly improve the accuracy of Indonesian named entity recognition depending on the entity category.
Downloads
Article Details
[2] W. Shishah, “Fake News Detection Using BERT Model with Joint Learning,” Arabian Journal for Science and Engineering, vol. 46, no. 9, hal. 9115–9127, Sep 2021.
[3] M. E. Khademi dan M. Fakhredanesh, “Persian Automatic Text Summarization Based on Named Entity Recognition,” Iranian Journal of Science and Technology - Transactions of Electrical Engineering, Jul 2020.
[4] S. I. G. Situmeang, R. K. Lubis, F. J. N. Siregar, dan B. J. D. C. Panjaitan, “Movie Summarization based on Indonesian Subtitles with Restricted Boltzmann Machine,” in Proceedings of 2019 4th International Conference on Sustainable Information Engineering and Technology, SIET 2019, 2019, hal. 338–342.
[5] P. S. Banerjee, B. Chakraborty, D. Tripathi, H. Gupta, dan S. S. Kumar, “A Information Retrieval Based on Question and Answering and NER for Unstructured Information Without Using SQL,” Wireless Personal Communications, vol. 108, no. 3, hal. 1909–1931, Okt 2019.
[6] B. Topcu dan I. D. El-Kahlout, “TR-SEQ: Named Entity Recognition Dataset for Turkish Search Engine Queries,” in International Conference Recent Advances in Natural Language Processing, RANLP, 2021, hal. 1417–1422.
[7] T. P. Sariki dan B. G. Kumar, “A Book Recommendation System Based on Named Entities,” Annals of Library and Information Studies, vol. 65, no. 1, hal. 77–82, 2018.
[8] L. A. Ramshaw dan M. P. Marcus, “Text Chunking Using Transformation-Based Learning,” hal. 157–176, Mei 1999.
[9] R. Rifani, M. A. Bijaksana, dan I. Asror, “Named Entity Recognition for an Indonesian Based Language Tweet using Multinomial Naive Bayes Classifier,” Indonesia Journal on Computing (Indo-JC), vol. 4, no. 2, hal. 119–126, Sep 2019.
[10] Q. Zhang, C. Xue, X. Su, P. Zhou, X. Wang, dan J. Zhang, “Named Entity Recognition for Chinese Construction Documents Based on Conditional Random Field,” Frontiers of Engineering Management, Jan 2022.
[11] G. Georgiev, P. Nakov, K. Ganchev, P. Osenova, dan K. I. Simov, “Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields,” Sep 2021.
[12] N. Patil, A. Patil, dan B. V. Pawar, “Named Entity Recognition using Conditional Random Fields,” Procedia Computer Science, vol. 167, hal. 1181–1188, 2020.
[13] Y. Munarko, M. S. Sutrisno, W. A. I. Mahardika, I. Nuryasin, dan Y. Azhar, “Named Entity Recognition Model for Indonesian Tweet using CRF Classifier,” IOP Conference Series: Materials Science and Engineering, vol. 403, hal. 012067, 2018.
[14] Y. An, X. Xia, X. Chen, F.-X. Wu, dan J. Wang, “Chinese Clinical Named Entity Recognition via Multi-Head Self-Attention Based BiLSTM-CRF,” Artificial Intelligence in Medicine, vol. 127, hal. 102282, Mei 2022.
[15] A. C. Rouhou, M. Dhiaf, Y. Kessentini, dan S. Ben Salem, “Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical Document,” Pattern Recognition Letters, vol. 155, hal. 128–134, Mar 2022.
[16] O. Litake, M. Sabane, P. Patil, A. Ranade, dan R. Joshi, “Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition,” Mar 2022.
[17] L. Cui, Y. Wu, J. Liu, S. Yang, dan Y. Zhang, “Template-Based Named Entity Recognition Using BART,” Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, hal. 1835–1845, Jun 2021.
[18] Y. Chang, L. Kong, K. Jia, dan Q. Meng, “Chinese Named Entity Recognition Method Based on BERT,” in Proceedings of 2021 IEEE International Conference on Data Science and Computer Application, ICDSCA 2021, 2021, hal. 294–299.
[19] J. Luoma dan S. Pyysalo, “Exploring Cross-sentence Contexts for Named Entity Recognition with BERT,” hal. 904–914, Jun 2021.
[20] L. Hickman, S. Thapa, L. Tay, M. Cao, dan P. Srinivasan, “Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations,” Organizational Research Methods, vol. 25, no. 1, hal. 114–146, Jan 2022.
[21] R. M. M. A. K. Syachrul, M. A. Bijaksana, dan A. F. Huda, “Person Entity Recognition for The Indonesian Qur’an Translation with The Approach Hidden Markov Model-Viterbi,” Procedia Computer Science, vol. 157, hal. 214–220, 2019.
[22] N. Ali, “Chatbot: A Conversational Agent Employed with Named Entity Recognition Model Using Artificial Neural Network,” Jun 2020.
[23] R. M. M. A. K. Syachrul, M. A. Bijaksana, dan A. F. Huda, “Person Entity Recognition for The Indonesian Qur’an Translation with The Approach Hidden Markov Model-Viterbi,” Procedia Computer Science, vol. 157, hal. 214–220, 2019.
[24] U. Naseem, I. Razzak, dan P. W. Eklund, “A Survey of Pre-processing Techniques to Improve Short-Text Quality: A Case Study on Hate Speech Detection on Twitter,” Multimedia Tools and Applications, vol. 80, no. 28–29, hal. 35239–35266, Nov 2021.
[25] M. Novo-Lourés, R. Pavón, R. Laza, D. Ruano-Ordas, dan J. R. Méndez, “Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources,” Scientific Programming, vol. 2020, hal. 1–13, Agu 2020.
[26] M. Nazief, B. A. A. & Adriani, “Confix- Stripping: Approach to Stemming Algorithm for Bahasa Indonesia,” Conferences in Research and Practice in Information Technology Series, vol. 38, no. 4, hal. 307–314, 2005.
[27] J. Asian, H. E. Williams, dan S. M. M. Tahaghoghi, “Stemming Indonesian: A Confix-Stripping Approach,” Conferences in Research and Practice in Information Technology Series, vol. 38, no. 4, hal. 307–314, Des 2005.
[28] A. Z. Arifin, P. Adhi, K. Mahendra, dan H. T. Ciptaningtyas, “Enhanced Confix Stripping Stemmer and Ants Algorithm for Classifying News Document in Indonesian Language,” in Proceeding of International Conference on Information & Communication Technology and Systems (ICTS), 2009.
[29] D. P. Andita Dwiyoga Tahitoe, “Implementasi Modifikasi Enhanced Confix Stripping Stemmer Untuk Bahasa Indonesia Dengan Metode Corpus Based Stemming,” Institut Teknologi Sepuluh Nopember, 2010.
[30] J. Lafferty, A. Mccallum, dan F. Pereira, “Conditional Random Fields?: Probabilistic Models for Segmenting and Labeling Sequence Data Abstract,” in Proceedings of the Eighteenth International Conference on Machine Learning, 1999, vol. 2001, no. June, hal. 282–289.
[31] H. S. Huang, Y. M. Chang, dan C. N. Hsu, “Training Conditional Random Fields by Periodic Step Size Adaptation for Large-Scale Text Mining,” in Proceedings - IEEE International Conference on Data Mining, ICDM, 2007, hal. 511–516.
[32] J. N. Darroch dan D. Ratcliff, “Generalized Iterative Scaling for Log-Linear Models,” The Annals of Mathematical Statistics, vol. 43, no. 5, hal. 1470–1480, 1972.
[33] R. Malouf, “A Comparison of Algorithms for Maximum Entropy Parameter Estimation,” in Proceeding of The 6th Conference on Natural Language Learning - COLING-02, 2002, vol. 20, hal. 1–7.
[34] L. Yang dan A. Shami, “On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice,” Neurocomputing, vol. 415, hal. 295–316, Nov 2020.
[35] J. Bergstra dan Y. Bengio, “Random Search for Hyper-Parameter Optimization,” Journal of Machine Learning Research, vol. 13, no. null, hal. 281–305, 2012.
[36] E. F. Tjong Kim Sang dan F. De Meulder, “Introduction to The CoNLL-2003 Shared Task,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 -, 2003, vol. 4, hal. 142–147.
[37] I. Alfina, S. Savitri, dan M. I. Fanany, “Modified DBpedia Entities Expansion for Tagging Automatically NER Dataset,” in 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2017, hal. 216–221.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.