Skip to content

Cannot reproduce paper results with released RoboTwin-MeM checkpoint (0/20 across 3 tasks vs paper's 48–90%, after fixing several setup issues) #3

Description

@jiang-tianyi666

Thanks for releasing the code and checkpoints! We spent considerable effort evaluating the released RoboTwin-MeM checkpoint and would like to report our findings and ask a few questions.

Setup issues we had to fix first

(sharing in case they help others)

  1. Assets: as in missing 003_cover asset #2, the download script pulls RoboTwin 2.0 base objects whose indices collide with RoboTwin-MeM objects (e.g. 009_kettle vs 009_toycar, 010_pen vs 010_mouse). Fixed by overlaying the full assets from the HF dataset repo.
  2. Embodiment: the training data metadata (lerobot_2.1/*/meta/info.json) says robot_type: "aloha", so we evaluate with embodiment: [aloha-agilex]. Note assets/embodiments/aloha-agilex/curobo_left.yml and curobo_right.yml contain hardcoded absolute paths (/mnt/workspace/yangganlin/code/RMBench/...) that need manual fixing.
  3. Checkpoint config: the released config.yaml has framework.name: QwenOFT, which build_framework rejects; we changed it to EventVLA. The weights then load with zero missing/unexpected/shape-mismatched keys.
  4. Websocket: client ping_interval=20 drops the connection when one inference exceeds 20s; we set it to None.

Results

After these fixes the policy behaves qualitatively correctly: the ALOHA arms reach and press buttons, and keyframe events fire (conf≈1.0) exactly at press moments, with the scene matching the training videos. However:

Task Paper (Table 2) Ours (seeds 100000+, demo_clean, instruction_type: unseen, step limit 1000)
press_button_keyframe 48% 0/10
put_back_block_hard 62% 0/5
pick_objects_in_order 90% 0/5

All runs are clean (no crashes/disconnects). A typical failure: in press_button_keyframe the policy presses 2–5 times and then idles until timeout, i.e. it fails the memory-dependent counting — even though keyframes are being committed.

Questions

  1. Is RoboTwin-MeM/final_model/pytorch_model.pt the exact model used for Table 2?
  2. Could you share the exact evaluation configuration (task_config yaml, instruction_type, step limits, seed range)?
  3. Is there any known issue where committed keyframes fail to be injected into the VLM at inference in the released code path (examples/RoboTwin-Mem/eval_files/)?

Happy to provide full logs. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions