Integration of Multi-Modal Data into Generation of Text Descriptions: Methods, Challenges and Prospects
https://doi.org/10.21686/2413-2829-2026-1-72-78
Abstract
Current systems of AI more and more often use multi-modal data by combining visual, text and audio information to resolve complicated problems. One key sphere of such system application is description generation on the basis of images and video. Integration of multi-modal data provides an opportunity to improve accuracy and expressiveness of texts being created, which gives more complete and sensible representation of the content. The article studies current methods of multi-modal data integration into generation of text descriptions, analyzes key challenges that face researchers and discusses promising trends in this field of development. Special attention is paid to application of convolution neuron nets (CNN) and transformers to process visual information, as well as attention mechanisms and models of successive text generation. The article researches approaches to data fusion formed by different modalities, including earlier and later combination of signs and multi-modal models trained on big blocks of data. In spite of serious progress, integration of multi-modal data provokes a number of challenges, including information synchronization, problems in interpretation and context, restrictions in learning data and others. Promising lines in development are being discussed. Obtained results can be used by developers of computer vision systems, in processing natural language and multi-modal machine learning and for elaboration of intellectual applications in the field of automatic image abstracts, video-summarizing and man-machine interaction.
About the Author
N. A. ChinyakovРоссия
Nikita A. Chinyakov, Post-Graduate Student of the Department for Informatics of the PRUE.
36 Stremyanny Lane, Moscow, 109992
References
1. Baltrusaitis T., Ahuja C., Morency L.-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, Vol. 41 (2), pp. 423–443.
2. Buolamwini J., Gebru T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of the 1st Conference on Fairness, Accountability and Transparency. Available at: https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf?utm_source=chatgpt.com
3. Esteva A., Kuprel B., Novoa R. A. Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. Available at: https://www.researchgate.net/publication/312890808_Dermatologist-level_classification_of_skin_cancer_with_deep_neural_networks
4. Goodfellow I., Pouget-Abadie J., Mirza M. Generative Adversarial Nets. Available at: https://www.researchgate.net/publication/263012109_Generative_Adversarial_Networks
5. Hochreiter S., Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997, Vol. 9 (8), pp. 1735–1780.
6. Jobin A., Ienca M., Andorno R. The Global Landscape of AI Ethics Guidelines. Nature Machine Intelligence. Available at: https://www.nature.com/articles/s42256-019-0088-2?utm_source=chatgpt.com
7. Kizilcec R. F., Piech C., Schneider E. F. Deconstructing Disengagement: Analyzing Learner Subpopulations in Massive Open Online Courses. Available at: https://www.researchgate.net/publication/260265661_Deconstructing_Disengagement_Analyzing_Learner_Subpopulations_in_Massive_Open_Online_Courses
8. McCormack J., Gifford T., Hutchings P. Autonomy, Authenticity and the Role of the Artist in the Age of AI. Available at: https://www.researchgate.net/publication/331562062_Autonomy_Authenticity_Authorship_and_Intention_in_computer_generated_art
9. Nguyen H., Wang Y., Zhang J. Multimodal Sentiment Analysis: A Survey on Methods and Applications. IEEE Transactions on Affective Computing. Available at: https://arxiv.org/abs/2305.07611?utm_source=chatgpt.com
10. Rihem F. Image Captioning Using Multimodal Deep Learning Approach. Computers, Materials & Continua, 2024, Vol. 81 (3), pp. 3951–3968.
11. Stojkoska B. R., Avramova A. P., Chatzimisios P. Application of Wireless Sensor Networks for Indoor Temperature Regulation. Available at: https://arxiv.org/abs/1606.07386
12. Sutskever I., Vinyals O., Le Q. V. Sequence to Sequence Learning with Neural Networks. Available at: https://arxiv.org/abs/1409.3215
13. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Gomez A. N., Kaiser L., Polosukhin I. Attention is All You Need. Available at: https://arxiv.org/abs/1706.03762
14. Vinyals O., Toshev A., Bengio S., Erhan D. Show and Tell: A Neural Image Caption Generator. Available at: https://arxiv.org/as/1411.4555
15. Zadeh A. B., Liang P. P., Poria S., Cambria E., Morency L-P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. Available at: https://aclanthology.org/P18-1208/
16. Zhang Y., Liu F., Wang H., Hu Z. Multimodal Learning for Medical Image Analysis: A Survey. Medical Image Analysis, 2023, No. 85, p. 102759.
Review
For citations:
Chinyakov N.A. Integration of Multi-Modal Data into Generation of Text Descriptions: Methods, Challenges and Prospects. Vestnik of the Plekhanov Russian University of Economics. 2026;(1):72-78. (In Russ.) https://doi.org/10.21686/2413-2829-2026-1-72-78
JATS XML




















