TL;DR: MCP server that enables AI systems to control the joint positions of a simulated robot arm in MuJoCo. The ability of AI agents (Gemini pro 2.5, Claude 4, in this case) to match the joint configuration of a robot from a couple of reference images is very poor, worse than random in the case of the FR3 robot.

This project is part of an exercise that aims to extract the joint positions of a robotic arm just from a couple of images and its URDF (or MJCF) model. Why would you want to do this, you ask? see add link to future broader context post.

This MCP (Model Context Protocol) server enables AI systems to control the joint positions of a simulated robot arm in MuJoCo.

It is meant to be used by an agent to find a joint configuration that approximates the target configuration in the user provided reference images.

It supports 50 different robot models.

#	Robot Name	Type
1	`a1_mj_description`	Quadruped
2	`ability_hand_mj_description`	Hand
3	`adam_lite_mj_description`	Humanoid
4	`aliengo_mj_description`	Quadruped
5	`allegro_hand_mj_description`	Hand
6	`aloha_mj_description`	Bimanual
7	`anymal_b_mj_description`	Quadruped
8	`anymal_c_mj_description`	Quadruped
9	`apollo_mj_description`	Humanoid
10	`arx_l5_mj_description`	Arm
11	`booster_t1_mj_description`	Humanoid
12	`cassie_mj_description`	Biped
13	`cf2_mj_description`	Drone
14	`dynamixel_2r_mj_description`	Arm
15	`elf2_mj_description`	Humanoid
16	`fr3_mj_description`	Arm
17	`g1_mj_description`	Humanoid
18	`gen3_mj_description`	Arm
19	`go1_mj_description`	Quadruped
20	`go2_mj_description`	Quadruped
21	`h1_mj_description`	Humanoid
22	`iiwa14_mj_description`	Arm
23	`jvrc_mj_description`	Humanoid
24	`leap_hand_mj_description`	Hand
25	`low_cost_robot_arm_mj_description`	Arm
26	`mujoco_humanoid_mj_description`	Humanoid
27	`n1_mj_description`	Humanoid
28	`op3_mj_description`	Humanoid
29	`panda_mj_description`	Arm
30	`piper_mj_description`	Arm
31	`robotiq_2f85_mj_description`	Gripper
32	`robotiq_2f85_v4_mj_description`	Gripper
33	`rsk_mj_description`	Arm
34	`sawyer_mj_description`	Arm
35	`shadow_dexee_mj_description`	Hand
36	`shadow_hand_mj_description`	Hand
37	`skydio_x2_mj_description`	Drone
38	`so_arm100_mj_description`	Arm (Default)
39	`so_arm101_mj_description`	Arm
40	`spot_mj_description`	Quadruped
41	`stretch_3_mj_description`	Mobile manipulator
42	`stretch_mj_description`	Mobile manipulator
43	`talos_mj_description`	Humanoid
44	`ur10e_mj_description`	Arm
45	`ur5e_mj_description`	Arm
46	`viper_mj_description`	Arm
47	`widow_mj_description`	Arm
48	`xarm7_mj_description`	Arm
49	`yam_mj_description`	Humanoid
50	`z1_mj_description`	Arm

The prompt

It is a robot kinematics informed prompt for the task of matching the joint configuration of reference images from a real robot. The prompt includes details like the kinematic chain, joint limits, joint axis, etc; thus providing the agent with enough context to help it anticipate the effects of the different joint movements.

Example prompt for SO100

You are a robot pose matching assistant. Your task is to iteratively adjust a simulated robot’s joint angles to match a target pose shown in reference images.

Robot Configuration:

Joint Order: [‘Elbow’, ‘Jaw’, ‘Pitch’, ‘Rotation’, ‘Wrist_Pitch’, ‘Wrist_Roll’] Joint Bounds: {‘Elbow’: (-0.174, 3.14), ‘Jaw’: (-0.174, 1.75), ‘Pitch’: (-3.32, 0.174), ‘Rotation’: (-1.92, 1.92), ‘Wrist_Pitch’: (-1.66, 1.66), ‘Wrist_Roll’: (-2.79, 2.79)}

Kinematic Chain:

  Rotation_Pitch: Rotation
    Upper_Arm: Pitch
      Lower_Arm: Elbow
        Wrist_Pitch_Roll: Wrist_Pitch
          Fixed_Jaw: Wrist_Roll
            Moving_Jaw: Jaw

Your Tools:

set_robot_state_and_render(state): Sets joint positions [Elbow, Jaw, Pitch, Rotation, Wrist_Pitch, Wrist_Roll] and returns a 4-view grid image

Strategy:

Analyze: Compare reference image(s) with current simulated robot pose
Explore systematically: Try different joint combinations - don’t assume any joint should be zero
Iterate: Make adjustments based on visual comparison
Refine: Continue until poses match across all camera angles

Critical Principles:

Kinematic Chain Impact:

Base joints (early in chain): Small changes affect entire arm - massive leverage
Middle joints: Create major structural changes in arm configuration
End-effector joints (late in chain): Fine-tune final positioning but still crucial

Joint Exploration:

Try both positive AND negative values - don’t assume joints should be neutral
Use full joint ranges - target poses may require extreme values near limits
Every joint matters - even small base rotations can be critical
Joint coupling - changing one joint affects all subsequent joints in the chain

Visual Comparison Strategy:

Compare overall arm shape and structure first
Check end-effector position and orientation
Verify pose matches from multiple camera angles
Look for subtle differences in joint angles

Systematic Approach:

Start broad: Make initial estimate covering overall pose structure
Refine iteratively: Adjust joints that show biggest visual mismatch
Don’t ignore any joint: Even if a joint seems unimportant, try non-zero values
Use incremental steps: Typically 0.1-0.5 radian adjustments for fine-tuning

Example Workflow:

1. Observe reference pose
2. Set initial estimate: set_robot_state_and_render([joint1, joint2, ...])
3. Compare result with reference
4. Identify biggest differences
5. Adjust relevant joints and repeat

Begin by analyzing the reference image(s) and making your first systematic pose estimate.

The tool

The MCP’s only tool is the set_robot_state_and_render function which takes as input the joint positions of the robot and returns an image of the simulated robot at that configuration from several views.

def set_robot_state_and_render(state: list[float]) -> Image:

See example below for an example output.

Example of the MCP in action

For details on the setup, please refer to the repo.

After prompting cursor’s claude 4 agent to match these photos:

Reference images showing the target robot joint configuration from two different camera angles that the AI agent needs to match.

It first tried this joint configuration:

Initial joint configuration

Initial joint configuration attempt by the agent.

On the fifth try, Claude stopped and claimed success with a confidence that only an LLM can muster. While its final joint configuration is quite close to the reference images, it’s not quite there.

Perfect! I have successfully achieved a very close match to the target pose shown in the reference images…

Final joint configuration

Final ‘successful’ joint configuration by the agent.

Evaluation by simulation

Evaluation runs at WandBA

We randomly sample joint configurations, generate a couple of reference images and then have the AI agent try to match the joint configuration of these reference images. Since we have the ground truth (the initially sampled joint configuration), we can evaluate the performance of the agent by comparing the final joint configuration to the ground truth. In particular, we will compare the pose of the end effector.

For example, for the following reference images (obtained from a random sample):

Gemini reached the following joint configuration:

Gemini final joint configuration

whereas Claude reached

Claude final joint configuration

In this particular case Claude failed miserably, and as usual, confidently.

Below we can see that Gemini does marginally better than Claude on these 20 random samples. We have included random performance as reference.

avg performance comparison

Average performance comparison between Gemini, Claude and Random on so100 pose matching. 20 samples on each box plot.

For reference, here is the ‘best’ match, achieved by Gemini at seed 7.

Gemini best joint position match

When using a more complex arm like FR3 with one more degree of freedom and longer links, the performance is worse than random.

fr3 avg performance

Example of reference images for FR3 pose matching.

Conclusion

The models evaluated here are surprisingly bad at matching the joint configuration of a robot from a couple of reference images. Maybe this is because these models are trained to answer text questions from multimodal inputs that do not require them to ‘reason’ about image comparisons with positive and negative examples. It could also be that the types of images used here are rare in the training data distribution.

The aim of this exercise was to evaluate the ability of these models to solve this matching problem. It requires the agent to be able to reason about the effect of changes in the joint positions on the resulting pose of the robot. The failure here observed maybe highlights the importance of having models learn through interaction with spatial environments.