Chapter 3: Bridging AI Agents with ROS 2

Learning Objectives

By the end of this chapter, you will:

Explain how Python-based AI agents serve as cognitive layers in robot systems
Describe rclpy as the interface connecting AI code to ROS 2 communication
Diagram natural language command to robot action flow using ROS 2 primitives
Identify safety boundaries between high-level AI planning and low-level robot control

Python Agents as Cognitive Layers

Modern AI models—large language models, vision transformers, and reinforcement learning policies—are typically developed in Python using frameworks like PyTorch, TensorFlow, and Hugging Face Transformers. These models serve as the "brain" of physical AI systems, making high-level decisions based on sensor input and user commands.

What is a cognitive layer: A cognitive layer is software that performs reasoning, planning, and decision-making using AI models. In humanoid robotics, the cognitive layer interprets sensor data (what does the camera see?), understands commands (what does the user want?), and decides actions (how should the robot respond?). This layer operates at a conceptual level—"pick up the cup," "navigate to the kitchen," "wave hello"—rather than specifying exact motor commands.

Why Python: AI research and development happen predominantly in Python because of its rich ecosystem for machine learning. Models trained using PyTorch or JAX can be loaded and executed in Python with minimal integration work. ROS 2 supports Python through rclpy, allowing AI developers to write nodes in the same language as their models.

Role in the system: The cognitive layer sits between perception and control. It receives processed sensor data (detected objects, estimated positions, recognized speech) and outputs high-level goals or actions. Lower-level controllers translate these goals into motor commands. This separation allows AI researchers to focus on decision-making without understanding motor control, while robotics engineers optimize controllers without retraining AI models.

Example architecture:

Perception nodes (C++ for performance): Process camera images, LIDAR scans, IMU data
Cognitive nodes (Python with AI models): Interpret perceptions, plan actions, understand language
Control nodes (C++ for real-time): Execute motions, maintain balance, ensure safety

This layered design mirrors how biological brains work: sensory processing, high-level reasoning, and motor control operate in distinct but coordinated subsystems.

For AI Researchers

This architecture parallels distributed ML systems where data preprocessing runs on CPUs, model inference runs on GPUs, and serving layers handle requests. In robotics, perception runs on specialized hardware, AI models run where compute is available (often GPUs), and controllers run on real-time hardware. ROS 2 handles the communication between these distributed components.

Introduction to rclpy (Conceptual)

rclpy is the ROS 2 Client Library for Python. It provides Python APIs for creating nodes, publishing to topics, calling services, and sending action goals. rclpy wraps the underlying ROS 2 middleware (DDS) with Pythonic interfaces, making it straightforward to integrate AI code with robot systems.

What rclpy provides:

Node creation: Define a Python class as a ROS 2 node with lifecycle management
Publishers and subscribers: Send and receive messages on topics using Python types
Service clients and servers: Implement request/response patterns with Python functions
Action clients and servers: Handle long-running goals with feedback callbacks
Timers and callbacks: Schedule periodic operations or respond to events
Parameter handling: Configure node behavior dynamically

Purpose, not exhaustive API: This chapter introduces rclpy conceptually to show how AI agents connect to ROS 2. We focus on the purpose and patterns, not complete API documentation. Detailed tutorials and hands-on coding come in later modules.

Example - AI vision node:

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from std_msgs.msg import String

class VisionCognitiveNode(Node):
    def __init__(self):
        super().__init__('vision_cognitive_node')

        # Subscribe to camera images
        self.image_sub = self.create_subscription(
            Image, '/camera/image', self.image_callback, 10)

        # Publish detected objects
        self.objects_pub = self.create_publisher(
            String, '/detected_objects', 10)

    def image_callback(self, msg):
        # Run AI vision model
        detected = self.run_object_detection(msg)

        # Publish results
        result_msg = String()
        result_msg.data = detected
        self.objects_pub.publish(result_msg)

This pattern—subscribe to sensor topics, process with AI, publish results—is fundamental to cognitive nodes in ROS 2.

For Beginners

rclpy is like a translator that lets your Python AI code talk to the robot. Just as you might use a library to connect Python to a database (like SQLAlchemy) or a web API (like requests), rclpy connects Python to the robot's communication system. You write normal Python code, and rclpy handles the robot-specific details.

ROS 2 as LLM-to-Robot Interface

Large language models (LLMs) like GPT-4, Claude, or LLaMA understand natural language and can reason about tasks. Connecting LLMs to robots enables natural interaction: users can give verbal commands, and the robot translates those commands into physical actions.

The interface challenge: LLMs output text. Robots need structured commands (move arm to position X, navigate to waypoint Y). ROS 2 serves as the interface layer that bridges natural language understanding and robot actuation.

How it works:

Speech recognition node: Converts user voice to text, publishes on /voice_command topic
LLM cognitive node: Subscribes to /voice_command, queries LLM to interpret intent, determines required robot actions
Task translation: LLM node translates high-level intent ("bring me a water bottle") into ROS 2 action goals (navigate to kitchen, detect bottle, grasp bottle, navigate to user)
Action execution: Navigation and manipulation controllers receive action goals and execute motions
Feedback: Controllers publish progress, LLM node can adjust plan if needed

Example flow - "Wave hello":

class LLMCognitiveNode(Node):
    def __init__(self):
        super().__init__('llm_cognitive_node')
        self.command_sub = self.create_subscription(
            String, '/voice_command', self.command_callback, 10)

        self.action_client = ActionClient(
            self, ExecuteGesture, '/execute_gesture')

    def command_callback(self, msg):
        # Query LLM to interpret command
        intent = self.query_llm(msg.data)

        if intent['action'] == 'wave':
            # Send action goal to gesture controller
            goal = ExecuteGesture.Goal()
            goal.gesture_type = 'wave'
            goal.hand = 'right'
            self.action_client.send_goal_async(goal)

Why this matters: This pattern enables robots to work with humans using natural language instead of requiring programming knowledge or button interfaces. For humanoid robots designed to assist people in homes and workplaces, natural language interaction is essential.

For Software Engineers

This is similar to voice assistants (Alexa, Siri) but connected to physical hardware. The LLM interprets intent like a voice assistant, but instead of playing music or setting timers, it sends commands to motors and actuators through ROS 2 topics and actions. Think of it as an API layer between natural language and robot APIs.

High-Level Planning vs Low-Level Control

Physical AI systems separate high-level planning (what to do) from low-level control (how to do it). This separation allows AI agents to focus on decision-making while controllers ensure safe and accurate execution.

High-level planning (AI cognitive layer):

Interprets sensor data and user commands
Decides which actions accomplish the goal
Sequences tasks (first navigate, then grasp, then return)
Monitors task progress and adjusts plans when needed
Operates at the level of objects, locations, and goals

Low-level control (ROS 2 controllers):

Translates high-level goals into motor commands
Maintains balance and stability during motion
Handles kinematics and dynamics constraints
Ensures safety limits are not violated
Operates at the level of joint angles, velocities, and forces

Why separation matters: AI models reason in terms of objects and actions ("pick up the cup"). Controllers reason in terms of joint angles and torques. If the AI layer had to compute inverse kinematics, plan collision-free paths, and maintain balance, it would become impossibly complex. Separating concerns allows each layer to focus on what it does well.

Example - Grasping task:

High-level (AI): "Grasp the red cup on the table"
- Identify red cup from camera image (vision AI)
- Determine 3D position (perception system)
- Send action goal: move gripper to cup position and close
Low-level (Controller): Receives action goal with target position
- Compute inverse kinematics to find joint angles
- Plan smooth trajectory avoiding obstacles
- Execute motion while monitoring force sensors
- Close gripper when contact detected

The AI layer treats the controller as a black box: send a goal, receive feedback, get result. The controller treats the AI layer as a black box: receive goals, execute safely, report progress.

For Robotics Students

This mirrors the distinction between task planning (AI planning, behavior trees) and motion planning (trajectory optimization, feedback control). The cognitive layer operates in task space (Cartesian coordinates, object poses), while the control layer operates in configuration space (joint angles, velocities). ROS 2 actions provide the interface between these spaces.

ROS 2 Controllers as Motor Cortex

In neuroscience, the motor cortex translates intentions ("reach for the cup") into coordinated muscle activations. The prefrontal cortex (decision-making) doesn't control individual muscles; it sends high-level commands to the motor cortex, which handles the details.

The analogy in robotics: ROS 2 controllers serve as the robot's motor cortex. The AI cognitive layer (prefrontal cortex) sends high-level goals. Controllers (motor cortex) translate those goals into precise motor commands while handling real-time constraints.

What controllers do:

Inverse kinematics: Convert desired end-effector position to joint angles
Trajectory planning: Generate smooth paths between positions
Feedback control: Adjust commands based on actual vs desired state
Safety monitoring: Enforce joint limits, collision avoidance, force limits
Real-time execution: Run at high frequency (100-1000 Hz) for stability

Example - Arm reaching controller:

class ArmReachingController(Node):
    def __init__(self):
        super().__init__('arm_reaching_controller')

        # Action server for high-level goals
        self.action_server = ActionServer(
            self, ReachTarget, '/reach_target', self.execute_reach)

        # Publisher for low-level joint commands
        self.joint_cmd_pub = self.create_publisher(
            JointCommand, '/arm/joint_commands', 10)

    def execute_reach(self, goal_handle):
        target_pose = goal_handle.request.target_pose

        # Compute inverse kinematics
        joint_angles = self.inverse_kinematics(target_pose)

        # Plan trajectory
        trajectory = self.plan_smooth_trajectory(joint_angles)

        # Execute with feedback control
        for waypoint in trajectory:
            cmd = self.compute_control_command(waypoint)
            self.joint_cmd_pub.publish(cmd)

            # Publish feedback
            feedback = ReachTarget.Feedback()
            feedback.current_pose = self.get_current_pose()
            goal_handle.publish_feedback(feedback)

        return ReachTarget.Result(success=True)

The cognitive layer sends a goal ("reach this pose"), and the controller handles all the motion planning and execution details.

For Beginners

Think of it like driving a car. You (the brain) decide "turn left at the intersection," but you don't consciously control every muscle in your hands, arms, and legs. Your motor cortex handles the detailed coordination. Similarly, the AI decides "move arm to cup," and the controller handles all the joint movements to make it happen smoothly and safely.

Safety and Control Boundaries

When AI agents control physical robots, safety is paramount. Unlike software-only systems where bugs cause errors or crashes, physical AI failures can cause damage or injury. Safety boundaries define what the AI can and cannot do.

Types of boundaries:

Workspace limits: AI can command motions only within defined regions
- Arm must not strike the robot's own body
- Navigation must stay within mapped areas
- Gripper force must not exceed safe thresholds
Command validation: Controllers verify that AI commands are physically feasible
- Joint angles within hardware limits
- Velocities and accelerations achievable
- Trajectories collision-free
Emergency override: Safety systems can override AI decisions
- Person detected nearby → stop all motion
- Joint torque exceeds limit → halt and alert
- Battery critical → return to charging station
Temporal constraints: AI decisions must happen within time bounds
- Balance controller needs commands at 100 Hz minimum
- If AI doesn't respond, use default safe behavior
- Timeout on long-running actions

Implementation pattern:

class SafetyMonitorNode(Node):
    def __init__(self):
        super().__init__('safety_monitor')

        # Subscribe to all motion commands
        self.cmd_sub = self.create_subscription(
            JointCommand, '/arm/joint_commands',
            self.validate_command, 10)

        # Subscribe to sensors
        self.proximity_sub = self.create_subscription(
            ProximitySensor, '/proximity',
            self.check_obstacles, 10)

        # Service to emergency stop all motion
        self.estop_client = self.create_client(
            EmergencyStop, '/emergency_stop')

    def validate_command(self, cmd):
        # Check if command is within safe limits
        if not self.is_safe(cmd):
            self.trigger_emergency_stop()

    def check_obstacles(self, sensor_data):
        # If person too close, stop
        if sensor_data.distance < 0.5:  # 50cm
            self.trigger_emergency_stop()

Why this matters for humanoid robots: Humanoids operate in human environments, often in close proximity to people. A humanoid must never harm a person, even if the AI makes a mistake. Safety boundaries ensure that even if the cognitive layer fails or makes poor decisions, the robot remains safe.

For AI Researchers

This is analogous to guardrails in LLM deployment (content filters, output validation). Just as production LLM systems validate outputs before showing them to users, robot systems validate AI decisions before executing them physically. The difference is that robot failures can cause physical harm, so validation must be real-time and hardware-enforced, not just software checks.

Example: Natural Language Command to ROS 2 Action

Let's trace a complete flow from natural language input to robot execution, showing how all components work together.

Scenario: User says "Please wave hello to me."

Step-by-step flow:

Speech Recognition (external service or node)
- Microphone captures audio
- Speech-to-text converts to string: "Please wave hello to me"
- Publishes to /voice_command topic

LLM Cognitive Node (Python with GPT-4/Claude)

def command_callback(self, msg):
    prompt = f"User command: {msg.data}\nIntent (JSON):"
    response = self.query_llm(prompt)
    intent = json.loads(response)
    # Result: {"action": "wave_greeting", "hand": "right"}

Interprets command using LLM
Determines intent: execute wave gesture
Selects appropriate hand (right is convention for greetings)

Task Planning (same cognitive node)

goal = ExecuteGesture.Goal()
goal.gesture_type = 'wave'
goal.hand = 'right'
goal.repetitions = 3

self.gesture_action.send_goal_async(
    goal, feedback_callback=self.on_feedback)

Formulates action goal in ROS 2 format
Sends to gesture controller action server

Gesture Controller (C++ or Python node)
- Receives action goal
- Loads predefined wave trajectory (sequence of arm poses)
- Sends waypoints to arm controller at 10 Hz
- Publishes feedback: current repetition count
Arm Controller (low-level C++ node)
- Receives desired joint positions for each waypoint
- Computes torque commands using PID control
- Publishes commands to motor drivers at 100 Hz
- Monitors joint state for safety
Safety Monitor (always running)
- Subscribes to proximity sensors
- If person gets closer than 30cm during wave, cancels action
- Arm stops immediately and returns to neutral pose
Feedback to User (optional)
- When action completes, LLM node receives result
- Can trigger speech synthesis: "Hello! Nice to meet you."

Key observations:

Natural language processing happens at cognitive layer (LLM)
Motion execution happens at control layer (controllers)
ROS 2 actions bridge the two with goals, feedback, and results
Safety monitoring runs independently and can override at any time
Each layer operates in its own space: LLM in concepts, controller in joint angles

This architecture enables complex, interactive behaviors while maintaining modularity and safety.

For Beginners

This is like ordering food at a restaurant. You tell the waiter what you want (natural language), the waiter translates it to the kitchen (task planning), the chef prepares it (execution), and you get updates on timing (feedback). The kitchen staff (controllers) handle the cooking details; you (the cognitive layer) just specify what you want.

Why This Matters for Humanoid Robots

Humanoid robots aim to operate in human environments and interact naturally with people. This requires combining human-like reasoning (AI) with precise physical control (robotics). The AI-ROS 2 bridge makes this possible.

Natural interaction: People communicate with language and gestures, not programming interfaces. LLMs enable humanoids to understand verbal commands, answer questions, and engage in dialogue while performing physical tasks.

Adaptability: Humanoid robots encounter unpredictable environments—furniture layouts change, objects move, people walk by. AI agents can perceive these changes and adapt plans in real-time. Controllers execute the adapted plans safely.

Learning from experience: Future humanoid systems will learn from demonstrations and corrections. A person shows the robot how to fold laundry, and the robot learns the task. This learning happens at the AI cognitive layer, while ROS 2 controllers handle the physical execution using learned policies.

Complex task coordination: Humanoid tasks often require coordinating vision, language, navigation, and manipulation. "Bring me the book from the shelf" involves recognizing the book (vision), understanding the request (language), walking to the shelf (navigation), and picking up the book (manipulation). AI agents orchestrate these subsystems through ROS 2 communication.

Safety in human spaces: Humanoids work alongside people, not behind safety cages like industrial robots. AI-powered safety monitoring can recognize people, predict their movements, and adjust robot behavior to maintain safe distances and gentle interactions.

This convergence of AI reasoning and robotic control, mediated by ROS 2, defines the future of humanoid robotics.

For Software Engineers

This is the physical embodiment of intelligent systems. Web applications connect AI (recommendations, chatbots) to user interfaces (web pages, mobile apps). Humanoid robots connect AI to physical interfaces (sensors, motors). ROS 2 plays the role of APIs and message buses in distributed web systems, but optimized for real-time physical constraints.

Preparing for Vision-Language-Action Systems

Vision-Language-Action (VLA) systems represent the frontier of physical AI: models that perceive visual scenes, understand language, and output robot actions in a single learned model. This chapter has prepared the conceptual groundwork.

What are VLA systems: VLA models are trained end-to-end to map camera images and language commands directly to robot actions. Instead of separate models for vision, language understanding, and motion planning, a single neural network learns the entire perception-to-action pipeline.

Why ROS 2 still matters: Even with VLA models, ROS 2 provides essential infrastructure:

Sensor integration: VLAs need camera images, joint states, and tactile feedback—ROS 2 topics provide these
Action execution: VLA models output high-level actions; ROS 2 controllers translate them to motor commands
Safety boundaries: VLA outputs must be validated and constrained before execution
Multi-modal coordination: Real humanoid systems combine VLA with other subsystems (navigation, speech) through ROS 2 communication

Conceptual preparation: Understanding nodes, topics, actions, and AI-ROS 2 integration patterns prepares you to work with VLA systems. Later modules will explore VLA implementations in simulation (Isaac Sim) and how they integrate with ROS 2 ecosystems.

Current chapter focus: We've focused on the interface between AI and ROS 2 using current technologies (LLMs, rclpy, explicit controllers). These patterns form the foundation for more advanced systems like VLAs, which automate parts of this pipeline through learned models.

For AI Researchers

VLA systems are analogous to end-to-end self-driving models that map sensor input directly to steering commands. However, robotics is more diverse than driving—VLAs must handle manipulation, navigation, and interaction. ROS 2 provides the standardized interface layer that makes VLA models reusable across different robot platforms and tasks, just as standard datasets (ImageNet) made vision models reusable.

Summary

This chapter explored how AI agents connect to ROS 2, enabling intelligent physical systems.

Key takeaways:

Python-based AI agents serve as cognitive layers that interpret sensors and make high-level decisions
rclpy provides Pythonic interfaces for AI code to create nodes, use topics, call services, and send action goals
ROS 2 bridges LLMs and robots by translating natural language intent into structured commands
High-level planning (what to do) separates from low-level control (how to do it) for modularity and safety
Controllers act as the robot's motor cortex, handling kinematics, dynamics, and real-time constraints
Safety boundaries validate AI commands and enable emergency overrides to prevent harm
Natural language to robot action flows demonstrate end-to-end integration through ROS 2 communication
This architecture enables humanoid robots to interact naturally while maintaining safety in human environments
VLA systems will build on these patterns to create end-to-end learned models for physical AI

Completion: You've now completed Module 1, understanding ROS 2 as the nervous system for physical AI. You know what ROS 2 is, how its communication primitives work, and how AI agents connect to robot hardware through ROS 2 interfaces.

Next steps: Later modules will cover simulation environments (Isaac Sim, Gazebo) where you'll build and test robot behaviors, and Vision-Language-Action systems that enable robots to learn from demonstration and adapt to new tasks.

Learning Objectives​

Python Agents as Cognitive Layers​

Introduction to rclpy (Conceptual)​

ROS 2 as LLM-to-Robot Interface​

High-Level Planning vs Low-Level Control​

ROS 2 Controllers as Motor Cortex​

Safety and Control Boundaries​

Example: Natural Language Command to ROS 2 Action​

Why This Matters for Humanoid Robots​

Preparing for Vision-Language-Action Systems​

Summary​

Further Reading​

Learning Objectives

Python Agents as Cognitive Layers

Introduction to rclpy (Conceptual)

ROS 2 as LLM-to-Robot Interface

High-Level Planning vs Low-Level Control

ROS 2 Controllers as Motor Cortex

Safety and Control Boundaries

Example: Natural Language Command to ROS 2 Action

Why This Matters for Humanoid Robots

Preparing for Vision-Language-Action Systems

Summary

Further Reading