Chapter 3: Bridging AI Agents with ROS 2
Learning Objectives
By the end of this chapter, you will:
- Explain how Python-based AI agents serve as cognitive layers in robot systems
- Describe rclpy as the interface connecting AI code to ROS 2 communication
- Diagram natural language command to robot action flow using ROS 2 primitives
- Identify safety boundaries between high-level AI planning and low-level robot control
Python Agents as Cognitive Layers
Modern AI models—large language models, vision transformers, and reinforcement learning policies—are typically developed in Python using frameworks like PyTorch, TensorFlow, and Hugging Face Transformers. These models serve as the "brain" of physical AI systems, making high-level decisions based on sensor input and user commands.
What is a cognitive layer: A cognitive layer is software that performs reasoning, planning, and decision-making using AI models. In humanoid robotics, the cognitive layer interprets sensor data (what does the camera see?), understands commands (what does the user want?), and decides actions (how should the robot respond?). This layer operates at a conceptual level—"pick up the cup," "navigate to the kitchen," "wave hello"—rather than specifying exact motor commands.
Why Python: AI research and development happen predominantly in Python because of its rich ecosystem for machine learning. Models trained using PyTorch or JAX can be loaded and executed in Python with minimal integration work. ROS 2 supports Python through rclpy, allowing AI developers to write nodes in the same language as their models.
Role in the system: The cognitive layer sits between perception and control. It receives processed sensor data (detected objects, estimated positions, recognized speech) and outputs high-level goals or actions. Lower-level controllers translate these goals into motor commands. This separation allows AI researchers to focus on decision-making without understanding motor control, while robotics engineers optimize controllers without retraining AI models.
Example architecture:
- Perception nodes (C++ for performance): Process camera images, LIDAR scans, IMU data
- Cognitive nodes (Python with AI models): Interpret perceptions, plan actions, understand language
- Control nodes (C++ for real-time): Execute motions, maintain balance, ensure safety
This layered design mirrors how biological brains work: sensory processing, high-level reasoning, and motor control operate in distinct but coordinated subsystems.
This architecture parallels distributed ML systems where data preprocessing runs on CPUs, model inference runs on GPUs, and serving layers handle requests. In robotics, perception runs on specialized hardware, AI models run where compute is available (often GPUs), and controllers run on real-time hardware. ROS 2 handles the communication between these distributed components.
Introduction to rclpy (Conceptual)
rclpy is the ROS 2 Client Library for Python. It provides Python APIs for creating nodes, publishing to topics, calling services, and sending action goals. rclpy wraps the underlying ROS 2 middleware (DDS) with Pythonic interfaces, making it straightforward to integrate AI code with robot systems.
What rclpy provides:
- Node creation: Define a Python class as a ROS 2 node with lifecycle management
- Publishers and subscribers: Send and receive messages on topics using Python types
- Service clients and servers: Implement request/response patterns with Python functions
- Action clients and servers: Handle long-running goals with feedback callbacks
- Timers and callbacks: Schedule periodic operations or respond to events
- Parameter handling: Configure node behavior dynamically
Purpose, not exhaustive API: This chapter introduces rclpy conceptually to show how AI agents connect to ROS 2. We focus on the purpose and patterns, not complete API documentation. Detailed tutorials and hands-on coding come in later modules.
Example - AI vision node:
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from std_msgs.msg import String
class VisionCognitiveNode(Node):
def __init__(self):
super().__init__('vision_cognitive_node')
# Subscribe to camera images
self.image_sub = self.create_subscription(
Image, '/camera/image', self.image_callback, 10)
# Publish detected objects
self.objects_pub = self.create_publisher(
String, '/detected_objects', 10)
def image_callback(self, msg):
# Run AI vision model
detected = self.run_object_detection(msg)
# Publish results
result_msg = String()
result_msg.data = detected
self.objects_pub.publish(result_msg)
This pattern—subscribe to sensor topics, process with AI, publish results—is fundamental to cognitive nodes in ROS 2.
rclpy is like a translator that lets your Python AI code talk to the robot. Just as you might use a library to connect Python to a database (like SQLAlchemy) or a web API (like requests), rclpy connects Python to the robot's communication system. You write normal Python code, and rclpy handles the robot-specific details.
ROS 2 as LLM-to-Robot Interface
Large language models (LLMs) like GPT-4, Claude, or LLaMA understand natural language and can reason about tasks. Connecting LLMs to robots enables natural interaction: users can give verbal commands, and the robot translates those commands into physical actions.
The interface challenge: LLMs output text. Robots need structured commands (move arm to position X, navigate to waypoint Y). ROS 2 serves as the interface layer that bridges natural language understanding and robot actuation.
How it works:
- Speech recognition node: Converts user voice to text, publishes on
/voice_commandtopic - LLM cognitive node: Subscribes to
/voice_command, queries LLM to interpret intent, determines required robot actions - Task translation: LLM node translates high-level intent ("bring me a water bottle") into ROS 2 action goals (navigate to kitchen, detect bottle, grasp bottle, navigate to user)
- Action execution: Navigation and manipulation controllers receive action goals and execute motions
- Feedback: Controllers publish progress, LLM node can adjust plan if needed
Example flow - "Wave hello":
class LLMCognitiveNode(Node):
def __init__(self):
super().__init__('llm_cognitive_node')
self.command_sub = self.create_subscription(
String, '/voice_command', self.command_callback, 10)
self.action_client = ActionClient(
self, ExecuteGesture, '/execute_gesture')
def command_callback(self, msg):
# Query LLM to interpret command
intent = self.query_llm(msg.data)
if intent['action'] == 'wave':
# Send action goal to gesture controller
goal = ExecuteGesture.Goal()
goal.gesture_type = 'wave'
goal.hand = 'right'
self.action_client.send_goal_async(goal)
Why this matters: This pattern enables robots to work with humans using natural language instead of requiring programming knowledge or button interfaces. For humanoid robots designed to assist people in homes and workplaces, natural language interaction is essential.
This is similar to voice assistants (Alexa, Siri) but connected to physical hardware. The LLM interprets intent like a voice assistant, but instead of playing music or setting timers, it sends commands to motors and actuators through ROS 2 topics and actions. Think of it as an API layer between natural language and robot APIs.
High-Level Planning vs Low-Level Control
Physical AI systems separate high-level planning (what to do) from low-level control (how to do it). This separation allows AI agents to focus on decision-making while controllers ensure safe and accurate execution.
High-level planning (AI cognitive layer):
- Interprets sensor data and user commands
- Decides which actions accomplish the goal
- Sequences tasks (first navigate, then grasp, then return)
- Monitors task progress and adjusts plans when needed
- Operates at the level of objects, locations, and goals
Low-level control (ROS 2 controllers):
- Translates high-level goals into motor commands
- Maintains balance and stability during motion
- Handles kinematics and dynamics constraints
- Ensures safety limits are not violated
- Operates at the level of joint angles, velocities, and forces
Why separation matters: AI models reason in terms of objects and actions ("pick up the cup"). Controllers reason in terms of joint angles and torques. If the AI layer had to compute inverse kinematics, plan collision-free paths, and maintain balance, it would become impossibly complex. Separating concerns allows each layer to focus on what it does well.
Example - Grasping task:
- High-level (AI): "Grasp the red cup on the table"
- Identify red cup from camera image (vision AI)
- Determine 3D position (perception system)
- Send action goal: move gripper to cup position and close
- Low-level (Controller): Receives action goal with target position
- Compute inverse kinematics to find joint angles
- Plan smooth trajectory avoiding obstacles
- Execute motion while monitoring force sensors
- Close gripper when contact detected
The AI layer treats the controller as a black box: send a goal, receive feedback, get result. The controller treats the AI layer as a black box: receive goals, execute safely, report progress.
This mirrors the distinction between task planning (AI planning, behavior trees) and motion planning (trajectory optimization, feedback control). The cognitive layer operates in task space (Cartesian coordinates, object poses), while the control layer operates in configuration space (joint angles, velocities). ROS 2 actions provide the interface between these spaces.
ROS 2 Controllers as Motor Cortex
In neuroscience, the motor cortex translates intentions ("reach for the cup") into coordinated muscle activations. The prefrontal cortex (decision-making) doesn't control individual muscles; it sends high-level commands to the motor cortex, which handles the details.
The analogy in robotics: ROS 2 controllers serve as the robot's motor cortex. The AI cognitive layer (prefrontal cortex) sends high-level goals. Controllers (motor cortex) translate those goals into precise motor commands while handling real-time constraints.
What controllers do:
- Inverse kinematics: Convert desired end-effector position to joint angles
- Trajectory planning: Generate smooth paths between positions
- Feedback control: Adjust commands based on actual vs desired state
- Safety monitoring: Enforce joint limits, collision avoidance, force limits
- Real-time execution: Run at high frequency (100-1000 Hz) for stability
Example - Arm reaching controller:
class ArmReachingController(Node):
def __init__(self):
super().__init__('arm_reaching_controller')
# Action server for high-level goals
self.action_server = ActionServer(
self, ReachTarget, '/reach_target', self.execute_reach)
# Publisher for low-level joint commands
self.joint_cmd_pub = self.create_publisher(
JointCommand, '/arm/joint_commands', 10)
def execute_reach(self, goal_handle):
target_pose = goal_handle.request.target_pose
# Compute inverse kinematics
joint_angles = self.inverse_kinematics(target_pose)
# Plan trajectory
trajectory = self.plan_smooth_trajectory(joint_angles)
# Execute with feedback control
for waypoint in trajectory:
cmd = self.compute_control_command(waypoint)
self.joint_cmd_pub.publish(cmd)
# Publish feedback
feedback = ReachTarget.Feedback()
feedback.current_pose = self.get_current_pose()
goal_handle.publish_feedback(feedback)
return ReachTarget.Result(success=True)
The cognitive layer sends a goal ("reach this pose"), and the controller handles all the motion planning and execution details.
Think of it like driving a car. You (the brain) decide "turn left at the intersection," but you don't consciously control every muscle in your hands, arms, and legs. Your motor cortex handles the detailed coordination. Similarly, the AI decides "move arm to cup," and the controller handles all the joint movements to make it happen smoothly and safely.
Safety and Control Boundaries
When AI agents control physical robots, safety is paramount. Unlike software-only systems where bugs cause errors or crashes, physical AI failures can cause damage or injury. Safety boundaries define what the AI can and cannot do.
Types of boundaries:
-
Workspace limits: AI can command motions only within defined regions
- Arm must not strike the robot's own body
- Navigation must stay within mapped areas
- Gripper force must not exceed safe thresholds
-
Command validation: Controllers verify that AI commands are physically feasible
- Joint angles within hardware limits
- Velocities and accelerations achievable
- Trajectories collision-free
-
Emergency override: Safety systems can override AI decisions
- Person detected nearby → stop all motion
- Joint torque exceeds limit → halt and alert
- Battery critical → return to charging station
-
Temporal constraints: AI decisions must happen within time bounds
- Balance controller needs commands at 100 Hz minimum
- If AI doesn't respond, use default safe behavior
- Timeout on long-running actions
Implementation pattern:
class SafetyMonitorNode(Node):
def __init__(self):
super().__init__('safety_monitor')
# Subscribe to all motion commands
self.cmd_sub = self.create_subscription(
JointCommand, '/arm/joint_commands',
self.validate_command, 10)
# Subscribe to sensors
self.proximity_sub = self.create_subscription(
ProximitySensor, '/proximity',
self.check_obstacles, 10)
# Service to emergency stop all motion
self.estop_client = self.create_client(
EmergencyStop, '/emergency_stop')
def validate_command(self, cmd):
# Check if command is within safe limits
if not self.is_safe(cmd):
self.trigger_emergency_stop()
def check_obstacles(self, sensor_data):
# If person too close, stop
if sensor_data.distance < 0.5: # 50cm
self.trigger_emergency_stop()
Why this matters for humanoid robots: Humanoids operate in human environments, often in close proximity to people. A humanoid must never harm a person, even if the AI makes a mistake. Safety boundaries ensure that even if the cognitive layer fails or makes poor decisions, the robot remains safe.
This is analogous to guardrails in LLM deployment (content filters, output validation). Just as production LLM systems validate outputs before showing them to users, robot systems validate AI decisions before executing them physically. The difference is that robot failures can cause physical harm, so validation must be real-time and hardware-enforced, not just software checks.
Example: Natural Language Command to ROS 2 Action
Let's trace a complete flow from natural language input to robot execution, showing how all components work together.
Scenario: User says "Please wave hello to me."
Step-by-step flow:
-
Speech Recognition (external service or node)
- Microphone captures audio
- Speech-to-text converts to string: "Please wave hello to me"
- Publishes to
/voice_commandtopic
-
LLM Cognitive Node (Python with GPT-4/Claude)
def command_callback(self, msg):
prompt = f"User command: {msg.data}\nIntent (JSON):"
response = self.query_llm(prompt)
intent = json.loads(response)
# Result: {"action": "wave_greeting", "hand": "right"}- Interprets command using LLM
- Determines intent: execute wave gesture
- Selects appropriate hand (right is convention for greetings)
-
Task Planning (same cognitive node)
goal = ExecuteGesture.Goal()
goal.gesture_type = 'wave'
goal.hand = 'right'
goal.repetitions = 3
self.gesture_action.send_goal_async(
goal, feedback_callback=self.on_feedback)- Formulates action goal in ROS 2 format
- Sends to gesture controller action server
-
Gesture Controller (C++ or Python node)
- Receives action goal
- Loads predefined wave trajectory (sequence of arm poses)
- Sends waypoints to arm controller at 10 Hz
- Publishes feedback: current repetition count
-
Arm Controller (low-level C++ node)
- Receives desired joint positions for each waypoint
- Computes torque commands using PID control
- Publishes commands to motor drivers at 100 Hz
- Monitors joint state for safety
-
Safety Monitor (always running)
- Subscribes to proximity sensors
- If person gets closer than 30cm during wave, cancels action
- Arm stops immediately and returns to neutral pose
-
Feedback to User (optional)
- When action completes, LLM node receives result
- Can trigger speech synthesis: "Hello! Nice to meet you."
Key observations:
- Natural language processing happens at cognitive layer (LLM)
- Motion execution happens at control layer (controllers)
- ROS 2 actions bridge the two with goals, feedback, and results
- Safety monitoring runs independently and can override at any time
- Each layer operates in its own space: LLM in concepts, controller in joint angles
This architecture enables complex, interactive behaviors while maintaining modularity and safety.
This is like ordering food at a restaurant. You tell the waiter what you want (natural language), the waiter translates it to the kitchen (task planning), the chef prepares it (execution), and you get updates on timing (feedback). The kitchen staff (controllers) handle the cooking details; you (the cognitive layer) just specify what you want.
Why This Matters for Humanoid Robots
Humanoid robots aim to operate in human environments and interact naturally with people. This requires combining human-like reasoning (AI) with precise physical control (robotics). The AI-ROS 2 bridge makes this possible.
Natural interaction: People communicate with language and gestures, not programming interfaces. LLMs enable humanoids to understand verbal commands, answer questions, and engage in dialogue while performing physical tasks.
Adaptability: Humanoid robots encounter unpredictable environments—furniture layouts change, objects move, people walk by. AI agents can perceive these changes and adapt plans in real-time. Controllers execute the adapted plans safely.
Learning from experience: Future humanoid systems will learn from demonstrations and corrections. A person shows the robot how to fold laundry, and the robot learns the task. This learning happens at the AI cognitive layer, while ROS 2 controllers handle the physical execution using learned policies.
Complex task coordination: Humanoid tasks often require coordinating vision, language, navigation, and manipulation. "Bring me the book from the shelf" involves recognizing the book (vision), understanding the request (language), walking to the shelf (navigation), and picking up the book (manipulation). AI agents orchestrate these subsystems through ROS 2 communication.
Safety in human spaces: Humanoids work alongside people, not behind safety cages like industrial robots. AI-powered safety monitoring can recognize people, predict their movements, and adjust robot behavior to maintain safe distances and gentle interactions.
This convergence of AI reasoning and robotic control, mediated by ROS 2, defines the future of humanoid robotics.
This is the physical embodiment of intelligent systems. Web applications connect AI (recommendations, chatbots) to user interfaces (web pages, mobile apps). Humanoid robots connect AI to physical interfaces (sensors, motors). ROS 2 plays the role of APIs and message buses in distributed web systems, but optimized for real-time physical constraints.
Preparing for Vision-Language-Action Systems
Vision-Language-Action (VLA) systems represent the frontier of physical AI: models that perceive visual scenes, understand language, and output robot actions in a single learned model. This chapter has prepared the conceptual groundwork.
What are VLA systems: VLA models are trained end-to-end to map camera images and language commands directly to robot actions. Instead of separate models for vision, language understanding, and motion planning, a single neural network learns the entire perception-to-action pipeline.
Why ROS 2 still matters: Even with VLA models, ROS 2 provides essential infrastructure:
- Sensor integration: VLAs need camera images, joint states, and tactile feedback—ROS 2 topics provide these
- Action execution: VLA models output high-level actions; ROS 2 controllers translate them to motor commands
- Safety boundaries: VLA outputs must be validated and constrained before execution
- Multi-modal coordination: Real humanoid systems combine VLA with other subsystems (navigation, speech) through ROS 2 communication
Conceptual preparation: Understanding nodes, topics, actions, and AI-ROS 2 integration patterns prepares you to work with VLA systems. Later modules will explore VLA implementations in simulation (Isaac Sim) and how they integrate with ROS 2 ecosystems.
Current chapter focus: We've focused on the interface between AI and ROS 2 using current technologies (LLMs, rclpy, explicit controllers). These patterns form the foundation for more advanced systems like VLAs, which automate parts of this pipeline through learned models.
VLA systems are analogous to end-to-end self-driving models that map sensor input directly to steering commands. However, robotics is more diverse than driving—VLAs must handle manipulation, navigation, and interaction. ROS 2 provides the standardized interface layer that makes VLA models reusable across different robot platforms and tasks, just as standard datasets (ImageNet) made vision models reusable.
Summary
This chapter explored how AI agents connect to ROS 2, enabling intelligent physical systems.
Key takeaways:
- Python-based AI agents serve as cognitive layers that interpret sensors and make high-level decisions
- rclpy provides Pythonic interfaces for AI code to create nodes, use topics, call services, and send action goals
- ROS 2 bridges LLMs and robots by translating natural language intent into structured commands
- High-level planning (what to do) separates from low-level control (how to do it) for modularity and safety
- Controllers act as the robot's motor cortex, handling kinematics, dynamics, and real-time constraints
- Safety boundaries validate AI commands and enable emergency overrides to prevent harm
- Natural language to robot action flows demonstrate end-to-end integration through ROS 2 communication
- This architecture enables humanoid robots to interact naturally while maintaining safety in human environments
- VLA systems will build on these patterns to create end-to-end learned models for physical AI
Completion: You've now completed Module 1, understanding ROS 2 as the nervous system for physical AI. You know what ROS 2 is, how its communication primitives work, and how AI agents connect to robot hardware through ROS 2 interfaces.
Next steps: Later modules will cover simulation environments (Isaac Sim, Gazebo) where you'll build and test robot behaviors, and Vision-Language-Action systems that enable robots to learn from demonstration and adapt to new tasks.