The algorithm for selecting publications on a given topic considering keyword priorities

Authors

DOI:

https://doi.org/10.20535/2786-8729.5.2024.316521

Keywords:

search of scientific publications, similarity of sets, Jaccard criterion, edit distance

Abstract

The article investigates the problems that exist in existing search engines for scientific publications. The search algorithms used in various search engines for scientific publications are described. The aim of the article is to develop a method for selecting publications on a given topic based on assessing the relevance of keyword sets. A review of the literature that was analyzed during the research is presented. Among the publications studied were materials related to the theory of set similarity, namely the use of the Jacquard coefficient and editing distance. A measure for determining the similarity of keyword sets is presented, which is based on the Jacquard coefficient taking into account the weighting coefficients of keywords. An algorithm is presented that can be used to determine the degree of similarity of publications to a user's search query based on keyword sets with weighting coefficients. The algorithm is based on the measure of similarity presented by us and the editing distance presented by us. The algorithm can be used to rank search results in search engines for scientific publications, as well as to compare the efficiency of different search engines, assess the quality of the results they return. The algorithm can also be used in book and film recommendation systems based on user preferences. The article provides the pseudocode of the algorithm. It is demonstrated on a limited data set how the measure calculated by the algorithm changes depending on the distribution of keyword weights in the user's query and the number of keywords.

Author Biographies

Olha Suprun, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv

Master's degree student of  the Department of Information Systems and Technologies of the Faculty of informatics and Computer Technique

Oksana Zhurakovska, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv

Associated Professor of the Department of Information Systems and Technologies of the Faculty of informatics and Computer Technique, Ph.D

References

M. Yadav and N. Goyal, "Comparison of Open Source Crawlers – A Review," Int. J. Sci. & Eng. Res., vol. 6, pp. 1544–1551, 2015, https://www.ijser.org/researchpaper/Comparison-of-Open-Source-Crawlers--A-Review.pd.

J. Shen, J. Xiao, X. He, J. Shang, S. Sinha, and J. Han, "Entity Set Search of Scientific Literature: An Unsupervised Ranking Approach," Proc. 41st Int. ACM SIGIR Conf. Research & Development in Information Retrieval, Association for Computing Machinery, Ann Arbor, MI, USA, pp. 565–574, 2018, https://doi.org/10.1145/3209978.3210055.

J. Beel, B. Gipp, and E. Eilde, "Academic Search Engine Optimization (ASEO ): Optimizing Scholarly Literature for Google Scholar & Co.," J. Scholarly Publ., vol. 41, no. 2, pp. 176–190, 2010, https://doi.org/10.1353/scp.0.0082.

C.W. Belter, "A relevance ranking method for citation-based search results," Scientometrics, vol. 112, pp. 731–746, 2017, https://doi.org/10.1007/s11192-017-2406-y.

C.W. Belter, "Citation analysis as a literature search method for systematic reviews," J. Assn. Inf. Sci. Tech., vol. 67, pp. 2766–2777, 2016, https://doi.org/10.1002/asi.23605.

O.V. Mazurets, O.V. Kozenko, M.O. Molchanova, and O.V. Sobko, Using cosine similarity metrics and Jaccard index for intellectual analysis of semantic similarity of text documents. Collection of scientific papers based on the materials of the XV All-Ukrainian scientific and practical conference "Actual problems of computer sciences APKN-2023". Khmelnytskyi, pp. 146–147, 2023.

E. Ristad and P. Yianilos, "Learning String Edit Distance", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 5, pp. 522–532, 1998. https://doi.org/10.1109/34.682181.

J. Leskovec, A. Rajaraman, and J.D. Ullman, Mining of Massive Datasets, 2nd ed., Cambridge: Cambridge University Press, pp. 92–98, 2014, https://doi.org/10.1017/CBO9781139924801.

Z. Li and A. Rainer, "Academic Search Engines: Constraints, Bugs, and Recommendation," pp. 1–8, 2022, https://doi.org/10.48550/arXiv.2211.00361.

E. Kelly, "Assessment of Digitized Library and Archives Materials: A Literature Review," pp. 1–34, 2016, https://doi.org/10.6084/M9.FIGSHARE.3206038.

S. Varma, S. Shivam, A. Thumu, A. Bhushanam, and D. Sarkar, "Jaccard Based Similarity Index in Graphs: A Multi-Hop Approach," 2022 IEEE Delhi Section Conf. (DELCON), New Delhi, India, 2022, pp. 1–4, https://doi.org/10.1109/DELCON54057.2022.9753316.

M. Eto, "Evaluations of context-based co-citation searching," Scientometrics, vol. 94, pp. 651–673, 2013, https://doi.org/10.1007/s11192-012-0756-z.

P. Mayr and A. Scharnhorst, "Scientometrics and information retrieval: weak-links revitalized," Scientometrics, vol. 102, pp. 2193–2199, 2015, https://doi.org/10.1007/s11192-014-1484-3.

K.A. Robinson, A.G. Dunn, G. Tsafnat, and P. Glasziou, "Citation networks of related trials are often disconnected: Implications for bidirectional citation searches," J. Clin. Epidemiol., vol. 67, no. 7, pp. 793–799, 2014, https://doi.org/10.1016/j.jclinepi.2013.11.015.

J. Santisteban and J. Tejada-Cárcamo, "Unilateral Weighted Jaccard Coefficient for NLP," 2015 Fourteenth Mexican Int. Conf. on Artificial Intelligence (MICAI), Cuernavaca, Mexico, 2015, pp. 14–20, https://doi.org/10.1109/MICAI.2015.9.

O.V. Suprun, O.S. Zhurakovska, "V International Scientific and Practical Conference of Young Scientists and Students 'Software Engineering and Advanced Information Technologies SoftTech-2023. INFORMATION SYSTEM FOR SEARCHING SCIENTIFIC PUBLICATIONS," pp. 310–313, Dec. 19–21, 2023.

Downloads

Published

2024-12-26

How to Cite

[1]
O. Suprun and O. Zhurakovska, “The algorithm for selecting publications on a given topic considering keyword priorities”, Inf. Comput. and Intell. syst. j., no. 5, pp. 101–111, Dec. 2024.