Cannot reproduce paper results with released RoboTwin-MeM checkpoint (0/20 across 3 tasks vs paper's 48–90%, after fixing several setup issues)

Thanks for releasing the code and checkpoints! We spent considerable effort evaluating the released RoboTwin-MeM checkpoint and would like to report our findings and ask a few questions.

## Setup issues we had to fix first

(sharing in case they help others)

1. **Assets**: as in #2, the download script pulls RoboTwin 2.0 base objects whose indices collide with RoboTwin-MeM objects (e.g. `009_kettle` vs `009_toycar`, `010_pen` vs `010_mouse`). Fixed by overlaying the full assets from the HF dataset repo.
2. **Embodiment**: the training data metadata (`lerobot_2.1/*/meta/info.json`) says `robot_type: "aloha"`, so we evaluate with `embodiment: [aloha-agilex]`. Note `assets/embodiments/aloha-agilex/curobo_left.yml` and `curobo_right.yml` contain hardcoded absolute paths (`/mnt/workspace/yangganlin/code/RMBench/...`) that need manual fixing.
3. **Checkpoint config**: the released `config.yaml` has `framework.name: QwenOFT`, which `build_framework` rejects; we changed it to `EventVLA`. The weights then load with **zero missing/unexpected/shape-mismatched keys**.
4. **Websocket**: client `ping_interval=20` drops the connection when one inference exceeds 20s; we set it to `None`.

## Results

After these fixes the policy behaves qualitatively correctly: the ALOHA arms reach and press buttons, and keyframe events fire (conf≈1.0) exactly at press moments, with the scene matching the training videos. However:

| Task | Paper (Table 2) | Ours (seeds 100000+, `demo_clean`, `instruction_type: unseen`, step limit 1000) |
|---|---|---|
| press_button_keyframe | 48% | **0/10** |
| put_back_block_hard | 62% | **0/5** |
| pick_objects_in_order | 90% | **0/5** |

All runs are clean (no crashes/disconnects). A typical failure: in press_button_keyframe the policy presses 2–5 times and then idles until timeout, i.e. it fails the memory-dependent counting — even though keyframes are being committed.

## Questions

1. Is `RoboTwin-MeM/final_model/pytorch_model.pt` the exact model used for Table 2?
2. Could you share the exact evaluation configuration (task_config yaml, instruction_type, step limits, seed range)?
3. Is there any known issue where committed keyframes fail to be injected into the VLM at inference in the released code path (`examples/RoboTwin-Mem/eval_files/`)?

Happy to provide full logs. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot reproduce paper results with released RoboTwin-MeM checkpoint (0/20 across 3 tasks vs paper's 48–90%, after fixing several setup issues) #3

Setup issues we had to fix first

Results

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Task	Paper (Table 2)	Ours (seeds 100000+, `demo_clean`, `instruction_type: unseen`, step limit 1000)
press_button_keyframe	48%	0/10
put_back_block_hard	62%	0/5
pick_objects_in_order	90%	0/5

Uh oh!

Cannot reproduce paper results with released RoboTwin-MeM checkpoint (0/20 across 3 tasks vs paper's 48–90%, after fixing several setup issues) #3

Description

Setup issues we had to fix first

Results

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions