A Multimodal Retrieval-Augmented Generation System with ReAct Agent Logic for Multi-Hop Reasoning

Authors

DOI:

https://doi.org/10.20535/2786-8729.6.2025.330777

Keywords:

Multimodal system, Retrieval-Augmented Generation, RAG, ReAct agent, Multi-Hop reasoning, semantic search

Abstract

The rapid advancement of generative artificial intelligence models significantly influences modern methods of information processing and user interactions with information systems. One of the promising areas in this domain is Retrieval-Augmented Generation (RAG), which combines generative models with information retrieval methods to enhance the accuracy and relevance of responses. However, most existing RAG systems primarily focus on textual data, which does not meet contemporary needs for multimodal information processing (text, images, tables).

The research object of this work is a multimodal RAG system based on ReAct agent logic, capable of multi-hop reasoning. The main emphasis is placed on integrating textual, graphical, and tabular information to generate accurate, complete, and relevant responses. The system's implementation utilized the ChromaDB vector storage, the OpenAI embedding generation model (text-embedding-ada-002), and the GPT-4 language model.

The purpose of the study is the development, deployment, and empirical evaluation of the proposed multimodal RAG system based on the ReAct agent approach, capable of effectively integrating diverse knowledge sources into a unified informational context.

The experimental evaluation utilized the Global Tuberculosis Report 2024 by the World Health Organization, containing various textual, graphical, and tabular data. A specialized test set of 50 queries (30 textual, 10 tabular, 10 graphical) was created for empirical analysis, allowing comprehensive testing of all aspects of multimodal integration.

The research employed methods such as semantic vector search, multi-hop agent-based planning with ReAct logic, and evaluations of answer accuracy, answer recall, and response latency. Additionally, an analysis of response speed dependence on query volume was conducted.

The obtained results confirmed the high efficiency of the proposed approach. The system demonstrated an answer accuracy of 92%, answer recall of 89%, and ensured complete (100%) coverage of all data types. The average response time was approximately 5 seconds, meeting interactive system requirements. Optimal parameters were experimentally determined (for example, parameter k = 6, classification threshold 0.35, and up to three reasoning iterations), ensuring the best balance among completeness, speed, and operational efficiency.

The study's findings highlighted significant advantages of the multimodal agent-based approach compared to traditional textual RAG solutions, confirming the promising direction for further research.

Author Biographies

Denys Yuvzhenko, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv

PhD student of the Computer Engineering Department of the Faculty of informatics and Computer Technique

Viacheslaw Chymshyr, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv

PhD student of the Department of Information Systems and Technologies of the Faculty of informatics and Computer Technique, Candidate of Technical Sciences, Associated Professor

Volodymyr Shymkovych, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv

Associated Professor of the Department of Information Systems and Technologies of the Faculty of informatics and Computer Technique, Ph.D

Kyrylo Znova, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv

PhD student of the Department of Information Systems and Technologies of the Faculty of informatics and Computer Technique

Grzegorz Nowakowski, Cracow University of Technology, Warszawska street 24, Cracow

Teaching and research assistant of the department of automatics and information technologies of the  Cracow University of Technology

Sergii Telenyk, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv

Professor of Department of Information Systems and Technologies of the Faculty of informatics and Computer Technique, Doctor of Technical Sciences, Professor

References

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, D. Kiela, and S. Riedel, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 9459-9474, 2020. https://doi.org/10.48550/arXiv.2005.11401.

G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, E. Grave, Y. LeCun, and T. Scialom, “Augmented Language Models: A Survey,” arXiv preprint arXiv:2302.07842, 2023. https://doi.org/10.48550/arXiv.2302.07842.

Z. Jiang, F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, J. Callan, G. Neubig, “Active Retrieval Augmented Generation,” in Proc. Empirical Methods in Natural Language Processing (EMNLP 2023), 2023. https://doi.org/10.18653/v1/2023.emnlp-main.495.

W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-T. Yih, “REPLUG: Retrieval-Augmented Black-Box Language Models,” in Proc. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) 2024, 2024. https://doi.org/10.18653/v1/2024.naacl-long.463.

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” in Proc. Int. Conf. Mach. Learn. (ICML 2021), 2021. https://doi.org/10.48550/arXiv.2103.00020.

H. Liu, Z. Wang, X. Chen, Z. Li, F. Xiong, Q. Yu, and W. Zhang, “HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation,” arXiv preprint arXiv:2502.12442, 2025. https://doi.org/10.48550/arXiv.2502.12442.

S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, and L. Sifre, “Improving Language Models by Retrieving from Trillions of Tokens,” in Proc. Int. Conf. Mach. Learn. (ICML 2022), 2022. https://doi.org/10.48550/arXiv.2112.04426.

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, and T. Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools,” arXiv preprint arXiv:2302.04761, 2023. https://doi.org/10.48550/arXiv.2302.04761.

W. Chen, H. Hu, X. Chen, P. Verga, and W. W. Cohen, “MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text,” in Proc. Empirical Methods in Natural Language Processing (EMNLP 2022), pp. 5558-5570, 2022. https://doi.org/10.18653/v1/2022.emnlp-main.375.

D. Caffagni, F. Cocchi, N. Moratelli, S. Sarto, M. Cornia, L. Baraldi, and R. Cucchiara, “Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs,” in CVPR Workshops, MMFM 2024, 2024. https://doi.org/10.1109/CVPRW63382.2024.00188.

S. Yu, C. Tang, B. Xu, J. Cui, J. Ran, Y. Yan, Z. Liu, S. Wang, X. Han, Z. Liu, and M. Sun, “VisRAG: Vision-based Retrieval-Augmented Generation on Multi-modality Documents,” arXiv preprint arXiv:2410.10594, 2024. https://doi.org/10.48550/arXiv.2410.10594.

J. Cho, D. Mahata, O. Irsoy, Y. He, and M. Bansal, “M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding,” arXiv preprint arXiv:2411.04952, 2024. https://doi.org/10.48550/arXiv.2411.04952.

M. Suri, P. Mathur, F. Dernoncourt, K. Gowswami, R. A. Rossi, and D. Manocha, “VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation,” arXiv preprint arXiv:2412.10704, 2024. https://doi.org/10.48550/arXiv.2412.10704.

World Health Organization, Global Tuberculosis Report 2024, Geneva, Switzerland, 2024. [Online]. Available: https://iris.who.int/bitstream/handle/10665/379339/9789240101531-eng.pdf Accessed: Sep. 17, 2025.

P. Kashyap, “React Agents Using Langchain,” Medium, 2024. [Online]. Available: https://medium.com/@piyushkashyap045/react-agents-using-langchain-388dab893fc9 Accessed: Sep. 17, 2025.

Chroma, Chroma: The Open-Source AI Application Database, 2024. [Online]. Available: https://www.trychroma.com/. Accessed: Sep. 17, 2025.

OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023. https://doi.org/10.48550/arXiv.2303.08774.

OpenAI, “New and Improved Embedding Model: text-embedding-ada-002,” OpenAI Blog, 2022. [Online]. Available: https://openai.com/index/new-and-improved-embedding-model/. Accessed: Sep. 17, 2025.

PyMuPDF, “Text Extraction Recipes,” PyMuPDF Documentation, 2024. [Online]. Available: https://pymupdf.readthedocs.io/en/latest/recipes-text.html. Accessed: Sep. 17, 2025.

Camelot. PDF Table Extraction for Humans. Camelot Documentation, 2024. [Online]. Available: https://camelot-py.readthedocs.io/en/master/. Accessed: Sep. 17, 2025.

“Language Processing Pipelines,” spaCy Documentation, spaCy, 2024. [Online]. Available: https://spacy.io/usage/processing-pipelines. Accessed: Sep. 17, 2025.

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-Networks,” arXiv preprint arXiv:1908.10084, 2019, https://doi.org/10.48550/arXiv.1908.10084.

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” in Proc. 38th Int. Conf. Mach. Learn. (ICML 2021), 2021. https://doi.org/10.48550/arXiv.2103.00020.

Nowakowski, G. (2018). Fuzzy queries on relational databases. Proceedings of the 2018 International Interdisciplinary PhD Workshop (IIPhDW), 293–299. https://doi.org/10.1109/IIPHDW.2018.8388376.

Downloads

Published

2025-09-19

How to Cite

[1]
D. Yuvzhenko, V. . Chymshyr, V. . Shymkovych, K. . Znova, G. . Nowakowski, and S. . Telenyk, “A Multimodal Retrieval-Augmented Generation System with ReAct Agent Logic for Multi-Hop Reasoning”, Inf. Comput. and Intell. syst. j., no. 6, pp. 42–57, Sep. 2025.