8 min read
AI Agent Controlled (via MCP) Simulated Robot

TL;DR: MCP server that enables AI systems to control the joint positions of a simulated robot arm in MuJoCo. The ability of AI agents (Gemini pro 2.5, Claude 4, in this case) to match the joint configuration of a robot from a couple of reference images is very poor, worse than random in the case of the FR3 robot.

This project is part of an exercise that aims to extract the joint positions of a robotic arm just from a couple of images and its URDF (or MJCF) model. Why would you want to do this, you ask? see add link to future broader context post.

This MCP (Model Context Protocol) server enables AI systems to control the joint positions of a simulated robot arm in MuJoCo.

It is meant to be used by an agent to find a joint configuration that approximates the target configuration in the user provided reference images.

It supports 50 different robot models.
#Robot NameType
1a1_mj_descriptionQuadruped
2ability_hand_mj_descriptionHand
3adam_lite_mj_descriptionHumanoid
4aliengo_mj_descriptionQuadruped
5allegro_hand_mj_descriptionHand
6aloha_mj_descriptionBimanual
7anymal_b_mj_descriptionQuadruped
8anymal_c_mj_descriptionQuadruped
9apollo_mj_descriptionHumanoid
10arx_l5_mj_descriptionArm
11booster_t1_mj_descriptionHumanoid
12cassie_mj_descriptionBiped
13cf2_mj_descriptionDrone
14dynamixel_2r_mj_descriptionArm
15elf2_mj_descriptionHumanoid
16fr3_mj_descriptionArm
17g1_mj_descriptionHumanoid
18gen3_mj_descriptionArm
19go1_mj_descriptionQuadruped
20go2_mj_descriptionQuadruped
21h1_mj_descriptionHumanoid
22iiwa14_mj_descriptionArm
23jvrc_mj_descriptionHumanoid
24leap_hand_mj_descriptionHand
25low_cost_robot_arm_mj_descriptionArm
26mujoco_humanoid_mj_descriptionHumanoid
27n1_mj_descriptionHumanoid
28op3_mj_descriptionHumanoid
29panda_mj_descriptionArm
30piper_mj_descriptionArm
31robotiq_2f85_mj_descriptionGripper
32robotiq_2f85_v4_mj_descriptionGripper
33rsk_mj_descriptionArm
34sawyer_mj_descriptionArm
35shadow_dexee_mj_descriptionHand
36shadow_hand_mj_descriptionHand
37skydio_x2_mj_descriptionDrone
38so_arm100_mj_descriptionArm (Default)
39so_arm101_mj_descriptionArm
40spot_mj_descriptionQuadruped
41stretch_3_mj_descriptionMobile manipulator
42stretch_mj_descriptionMobile manipulator
43talos_mj_descriptionHumanoid
44ur10e_mj_descriptionArm
45ur5e_mj_descriptionArm
46viper_mj_descriptionArm
47widow_mj_descriptionArm
48xarm7_mj_descriptionArm
49yam_mj_descriptionHumanoid
50z1_mj_descriptionArm

The prompt

It is a robot kinematics informed prompt for the task of matching the joint configuration of reference images from a real robot. The prompt includes details like the kinematic chain, joint limits, joint axis, etc; thus providing the agent with enough context to help it anticipate the effects of the different joint movements.

Example prompt for SO100

You are a robot pose matching assistant. Your task is to iteratively adjust a simulated robot’s joint angles to match a target pose shown in reference images.

Robot Configuration:

Joint Order: [‘Elbow’, ‘Jaw’, ‘Pitch’, ‘Rotation’, ‘Wrist_Pitch’, ‘Wrist_Roll’] Joint Bounds: {‘Elbow’: (-0.174, 3.14), ‘Jaw’: (-0.174, 1.75), ‘Pitch’: (-3.32, 0.174), ‘Rotation’: (-1.92, 1.92), ‘Wrist_Pitch’: (-1.66, 1.66), ‘Wrist_Roll’: (-2.79, 2.79)}

Kinematic Chain:

  Rotation_Pitch: Rotation
    Upper_Arm: Pitch
      Lower_Arm: Elbow
        Wrist_Pitch_Roll: Wrist_Pitch
          Fixed_Jaw: Wrist_Roll
            Moving_Jaw: Jaw

Your Tools:

  • set_robot_state_and_render(state): Sets joint positions [Elbow, Jaw, Pitch, Rotation, Wrist_Pitch, Wrist_Roll] and returns a 4-view grid image

Strategy:

  1. Analyze: Compare reference image(s) with current simulated robot pose
  2. Explore systematically: Try different joint combinations - don’t assume any joint should be zero
  3. Iterate: Make adjustments based on visual comparison
  4. Refine: Continue until poses match across all camera angles

Critical Principles:

Kinematic Chain Impact:

  • Base joints (early in chain): Small changes affect entire arm - massive leverage
  • Middle joints: Create major structural changes in arm configuration
  • End-effector joints (late in chain): Fine-tune final positioning but still crucial

Joint Exploration:

  • Try both positive AND negative values - don’t assume joints should be neutral
  • Use full joint ranges - target poses may require extreme values near limits
  • Every joint matters - even small base rotations can be critical
  • Joint coupling - changing one joint affects all subsequent joints in the chain

Visual Comparison Strategy:

  • Compare overall arm shape and structure first
  • Check end-effector position and orientation
  • Verify pose matches from multiple camera angles
  • Look for subtle differences in joint angles

Systematic Approach:

  1. Start broad: Make initial estimate covering overall pose structure
  2. Refine iteratively: Adjust joints that show biggest visual mismatch
  3. Don’t ignore any joint: Even if a joint seems unimportant, try non-zero values
  4. Use incremental steps: Typically 0.1-0.5 radian adjustments for fine-tuning

Example Workflow:

1. Observe reference pose
2. Set initial estimate: set_robot_state_and_render([joint1, joint2, ...])
3. Compare result with reference
4. Identify biggest differences
5. Adjust relevant joints and repeat

Begin by analyzing the reference image(s) and making your first systematic pose estimate.

The tool

The MCP’s only tool is the set_robot_state_and_render function which takes as input the joint positions of the robot and returns an image of the simulated robot at that configuration from several views.

def set_robot_state_and_render(state: list[float]) -> Image:

See example below for an example output.

Example of the MCP in action

For details on the setup, please refer to the repo.

After prompting cursor’s claude 4 agent to match these photos:

Reference robot pose - View 1 Reference robot pose - View 2

Reference images showing the target robot joint configuration from two different camera angles that the AI agent needs to match.

It first tried this joint configuration:

Initial joint configuration

Initial joint configuration attempt by the agent.

On the fifth try, Claude stopped and claimed success with a confidence that only an LLM can muster. While its final joint configuration is quite close to the reference images, it’s not quite there.

Perfect! I have successfully achieved a very close match to the target pose shown in the reference images…

Final joint configuration

Final ‘successful’ joint configuration by the agent.

Evaluation by simulation

Evaluation runs at WandBA

We randomly sample joint configurations, generate a couple of reference images and then have the AI agent try to match the joint configuration of these reference images. Since we have the ground truth (the initially sampled joint configuration), we can evaluate the performance of the agent by comparing the final joint configuration to the ground truth. In particular, we will compare the pose of the end effector.

For example, for the following reference images (obtained from a random sample):

Reference robot pose - View 1 Reference robot pose - View 2

Gemini reached the following joint configuration:

Gemini final joint configuration

whereas Claude reached

Claude final joint configuration

In this particular case Claude failed miserably, and as usual, confidently.

Below we can see that Gemini does marginally better than Claude on these 20 random samples. We have included random performance as reference.

avg performance comparison

Average performance comparison between Gemini, Claude and Random on so100 pose matching. 20 samples on each box plot.

For reference, here is the ‘best’ match, achieved by Gemini at seed 7.

Reference robot pose - View 1 Reference robot pose - View 2

Gemini best joint position match

When using a more complex arm like FR3 with one more degree of freedom and longer links, the performance is worse than random.

fr3 avg performance

Reference robot pose - View 1 Reference robot pose - View 2

Example of reference images for FR3 pose matching.

Conclusion

The models evaluated here are surprisingly bad at matching the joint configuration of a robot from a couple of reference images. Maybe this is because these models are trained to answer text questions from multimodal inputs that do not require them to ‘reason’ about image comparisons with positive and negative examples. It could also be that the types of images used here are rare in the training data distribution.

The aim of this exercise was to evaluate the ability of these models to solve this matching problem. It requires the agent to be able to reason about the effect of changes in the joint positions on the resulting pose of the robot. The failure here observed maybe highlights the importance of having models learn through interaction with spatial environments.