Identification and Clustering of Template Texts in the Large Arrays of Messages

Authors: Vishnyakov I.E., Ivanov I.P., Karkin I.A. Published: 24.12.2022
Published in issue: #4(141)/2022  
DOI: 10.18698/0236-3933-2022-4-20-35

Category: Informatics, Computer Engineering and Control | Chapter: System Analysis, Control, and Information Processing  
Keywords: pattern identification, text clustering, big data


A lot of services are using short messages for various purposes today, for example, stores are sending promotional offers, and EMERCOM of Russia informs population in the event of a threat of natural and technogenic emergencies. Selecting short texts of the template messages from general traffic could be used to filter spam and mailings, as well as to protect users from fraudulent activities. Such arrays of messages are often reaching such a large size that their storage and processing on a single dedicated personal computer or server becomes practically impossible. This work aims at developing approaches to the efficient identification and clustering of the template texts from large arrays of the short messages using the Apache Spark framework for distributed processing of the unstructured data. Main approaches to identifying templates and clustering textual information are considered. Approaches were developed making it possible to cluster in large arrays of messages using distributed computation without preliminary acquisition of the text vector representations. Algorithms are provided for efficient identification of the template messages from large arrays of short texts. Algorithms were compared in terms of performance and quality of pattern identification

Please cite this article in English as:

Vishnyakov I.E., Ivanov I.P., Karkin I.A. Identification and clustering of template texts in the large arrays of messages. Herald of the Bauman Moscow State Technical University, Series Instrument Engineering, 2022, no. 4 (141), pp. 20--35 (in Russ.). DOI: https://doi.org/10.18698/0236-3933-2022-4-20-35


[1] Kirichenko K.M., Gerasimov M.B. [Review on text information clustering methods]. Dialogue 2001. Moscow, 2001 (in Russ.). Available at: https://www.dialog-21.ru/digest/2001/articles/kirichenko (accessed: 20.12.2021).

[2] Apache Spark: website. Аvailable at: https://spark.apache.org (accessed: 20.12.2021).

[3] Mikolov T., Chen K., Corrado G., et al. Distributed representations of words and phrases and their compositionality. Proc. 26th NIPS, 2013, vol. 2, pp. 3111--3119.

[4] Ogurtsov A.N. Osnovy bioinformatiki [Basics of bioinformatics]. Kharkov, KhPI Publ., 2013.

[5] Hartigan J.A., Wong M.A. A K-means clustering algorithm. J. R. Stat. Soc. Ser. C Appl. Stat., 1979, vol. 28, pp. 100--108. DOI: https://doi.org/10.2307/2346830

[6] Tan P.N., Steinbach М.М., Kumar V. Introduction to data mining. New York, Addison Wesley, 2006.

[7] Gusfield D. Algorithms on strings, trees, and sequences.‎ Cambridge, Cambridge University Press, 1997.

[8] Ukkonen E. On-line construction of suffix trees. Algorithmica, 1995, vol. 14, no. 3, pp. 249--260. DOI: https://doi.org/10.1007/BF01206331

[9] Sabbir A., Mohammad R.R., Motaher H., et al. An improved frequent pattern mining algorithm using suffix tree & suffix automata. Bangladesh, Dhaka, Univ. of Asia Pacific, 2014.

[10] Virtseva N.S., Vishnyakov I.E. Pattern identification and retrieval in short text messages. Nauka i obrazovanie: nauchnoe izdanie MGTU im. N.E. Baumana [Science and Education: Scientific Publication], 2016, no. 10 (in Russ.). DOI: 10.7463/1016.0848929

[11] Levenshteyn V.I. Binary codes capable of correcting deletions, insertions, and reversals. Doklady AN SSSR, 1965, vol. 163, no. 4, pp. 845--848 (in Russ.).

[12] Dubanov A.V. The comparison of program sources using the sequence alignment of tokens. Inzhenernyy zhurnal: nauka i innovatsii [Engineering Journal: Science and Innovation], 2014, no. 9 (in Russ.). DOI: https://doi.org/10.18698/2308-6033-2014-9-1318

[13] Yujian L., Bo L.A. Normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell., 2007, vol. 29, no. 6, pp. 1091--1095. DOI: https://doi.org/10.1109/TPAMI.2007.1078

[14] Boytsov L.M. Using signature hashing for fuzzy string search. Prikladnaya matematika i informatika, 2000, no. 7, pp. 135--153 (in Russ.).

[15] Bahmani B., Moseley B., Vattani A., et al. Scalable K-Means++. Proc. VLDB Endow., 2012, vol. 5, no. 7, pp. 622--633. DOI: http://dx.doi.org/10.14778/2180912.2180915

[16] Wang Y., Gu Y., Shun J. Theoretically-efficient and practical parallel. Proc. SIGMOD’20, 2020, pp. 2555--2571. DOI: https://doi.org/10.1145/3318464.3380582

[17] Apache ML Clustering --- DBSCANCluster. commons.apache.org: website. Available at: https://commons.apache.org/proper/commonsmath/javadocs/api-3.6.1/org/apache/commons/math3/ml/clustering/DBSCANClusterer.html (accessed: 20.12.2021).

[18] Index of /ruwiki/. dumps.wikimedia.org: website. Available at: https://dumps.wikimedia.org/ruwiki (accessed: 20.12.2021).

[19] Russian troll tweets. kaggle.com: website. Available at: https://www.kaggle.com/vikasg/russian-troll-tweets?select=tweets.csv(accessed: 20.12.2021).

[20] Apache Hadoop: website. Available at: https://hadoop.apache.org (accessed: 20.12.2021).