Integrative Vision-Language Reinforcement Learning for Autonomous Robotics
Main Article Content
Abstract
In recent years, the integration of deep learning and robotics has become a central paradigm in intelligent systems research, enabling robots to perceive, reason, and act autonomously in unstructured environments. This paper presents a unified framework that combines deep reinforcement learning (DRL) with large vision-language models (VLMs) to enhance robotic interaction, decision-making, and adaptability. The proposed system leverages a multi-modal perception backbone, where vision and language embeddings are jointly optimized to interpret complex sensory inputs and contextual commands. Reinforcement learning is utilized to refine policy control through environment interaction, enabling the robot to translate high-level semantic understanding into precise motor actions. The fusion mechanism employs a cross-modal attention network to align latent representations from both perception and reasoning layers, improving interpretability and reducing decision ambiguity. Experiments are conducted on robotic manipulation, navigation, and human-robot collaboration tasks, demonstrating significant improvements in task completion rates and generalization to unseen scenarios. Compared with traditional DRL-based agents, our method achieves higher sample efficiency and robustness under noisy sensory conditions. This work provides a novel pathway toward cognitive robotics, integrating reasoning and embodiment through deep learning architectures that mimic human-like intelligence.