Semantic Classification of Sentences Using SMOTE and BiLSTM

Irvan Tanjung, Rid Ilyas, Melina Melina

Abstract


A paraphrase is a sentence that is re-expressed with a different word arrangement without changing its meaning (semantics). To find out the semantic proximity to the pair of citation sentences in the form of paraphrases, a computational model is needed. In doing classification sometimes appears a problem called Imbalance Class, which is a situation in which the distribution of data of each class is uneven. There are class groups that have less data (minorities) and class groups that have more data (majority). Any unbalanced real data can affect and decrease the performance of classification methods. One way to deal with it is using the SMOTE method, which is an over-sampling method that generates synthesis data derived from data replication in the minority class as much as data in the majority class. The study applied SMOTE in the classification of semantic proximity of citation pairs, used Word2Vec to convert words into vectors, and used the BiLSTM model for the learning process. The research was conducted through 8 different scenarios in terms of the data used, the selection of learning models, and the influence of SMOTE. The results showed that scenarios using previous research data with BiLSTM and SMOTE models provided the best accuracy and performance.

Keywords


BiLSTM, imbalance class, semantics, SMOTE, Word2Vec

Full Text:

PDF

References


Bhagat, R., & Hovy, E. (2013). What is a paraphrase? Computational Linguistics, 39(3), 463–472. https://doi.org/10.1162/coli_a_00166

Teufel, S., Siddharthan, A., & Tidhar, D. (2006). An annotation scheme for citation function. Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue - SigDIAL ’06, 80. https://doi.org/10.3115/1654595.1654612

González Aguirre, A. (2017). Computational models for semantic textual similarity.

Ichida, A. Y., Meneguzzi, F., & Ruiz, D. D. (2018). Measuring semantic similarity between sentences using a Siamese neural network. 2018 International Joint Conference on Neural Networks (IJCNN), 1, 1–7. https://doi.org/10.1109/ijcnn.2018.8489433

Shi, X., & Lu, R. (2019). Attention-based bidirectional hierarchical LSTM networks for text semantic classification. 2019 10th International Conference on Information Technology in Medicine and Education (ITME). https://doi.org/10.1109/itme.2019.00143

Nurjaman, J., Ilyas, R., & Kasyidi, F. (2020, September). Pengukuran Kesamaan Semantik Pasangan Kalimat Sitasi Menggunakan Convolutional Neural Network. In Prosiding Industrial Research Workshop and National Seminar (Vol. 11, No. 1, pp. 510-516).

Besti, A., Ilyas, R., Kasyidi, F., & Djamal, E. C. (2020). Semantic classification of scientific sentence pair using recurrent neural network. 2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI), 1, 150–155. https://doi.org/10.23919/eecsi50503.2020.9251897

Siringoringo, R. (2018). Klasifikasi data tidak Seimbang menggunakan algoritma SMOTE dan k-nearest neighbor. Journal Information System Development (ISD), 3(1).

Atmadja, A. R., & Purwarianti, A. (2015). Comparison on the rule based method and statistical based method on emotion classification for Indonesian twitter text. 2015 International Conference on Information Technology Systems and Innovation (ICITSI), 2, 1–6. https://doi.org/10.1109/icitsi.2015.7437692

Sarakit, P., Theeramunkong, T., & Haruechaiyasak, C. (2015). Improving emotion classification in imbalanced YouTube dataset using smote algorithm. 2015 2nd International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), 3, 1–5. https://doi.org/10.1109/icaicta.2015.7335373

Sutoyo, E., & Fadlurrahman, M. A. (2020). Penerapan SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Television Advertisement Performance Rating Menggunakan Artificial Neural Network. JEPIN (Jurnal Edukasi dan Penelitian Informatika), 6(3), 379-385.

Ruhyana, N. A. N. A. N. G., & Rosiyadi, D. I. D. I. (2019). Klasifikasi Komentar Instagram Untuk Identifikasi Keluhan Pelanggan Jasa Pengiriman Barang Dengan Teknik Smote. Faktor Exacta, 12(4), 280-290.

Kasanah, A. N., Muladi, M., & Pujianto, U. (2019). Penerapan Teknik SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Objektivitas Berita Online Menggunakan Algoritma KNN. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 3(2), 196-201.

Pasaribu, D. J. M., Kusrini, K., & Sudarmawan, S. (2020). Peningkatan Akurasi Klasifikasi Sentimen Ulasan Makanan Amazon dengan Bidirectional LSTM dan Bert Embedding. Inspiration: Jurnal Teknologi Informasi dan Komunikasi, 10(1), 9-20.

Cai, R., Qin, B., Chen, Y., Zhang, L., Yang, R., Chen, S., & Wang, W. (2020). Sentiment analysis about investors and consumers in energy market based on Bert-BILSTM. IEEE Access, 8, 171408–171415. https://doi.org/10.1109/access.2020.3024750

Song, Y., Tian, S., & Yu, L. (2020). A method for identifying local drug names in Xinjiang based on Bert-BILSTM-CRF. Automatic Control and Computer Sciences, 54(3), 179–190. https://doi.org/10.3103/s0146411620030098

Mikolov, T. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 746-751).

Jatnika, D., Bijaksana, M. A., & Suryani, A. A. (2019). Word2Vec model analysis for semantic similarities in English words. Procedia Computer Science, 157, 160–167. https://doi.org/10.1016/j.procs.2019.08.153

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

Melina, Sukono, Napitupulu, H., Sambas, A., Murniati, A., & Kusumaningtyas, V. A. (2022). Artificial Neural Network-Based Machine Learning Approach to Stock Market Prediction Model on the Indonesia Stock Exchange During the COVID-19. Engineering Letters, 30(3).

Melina, M., Sukono, Napitupulu, H., & Mohamed, N. (2024). Modeling of machine learning-based extreme value theory in stock investment risk prediction: A systematic literature review. Big Data. https://doi.org/10.1089/big.2023.0004.




DOI: https://doi.org/10.46336/ijqrm.v5i3.750

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Irvan Tanjung, Rid Ilyas, Melina Melina

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Published By: 

IJQRM: Jalan Riung Ampuh No. 3, Riung Bandung, Kota Bandung 40295, Jawa Barat, Indonesia

 

IJQRM Indexed By: 

width= width= width= width= width= width= 

 


Lisensi Creative Commons Creation is distributed below Lisensi Creative Commons Atribusi 4.0 Internasional.


View My Stats