Generalization in robotic manipulation refers to the ability to successfully perform a task across a wide variety of novel object instances beyond those seen during training. However, conventional behavior cloning (BC) methods struggle to generalize across instance-level differences such as object geometry, size, and appearance. We address this challenge with a new general-purpose 3D keypoint based object-centric representation to achieve semantic generalization to intra-category variations. Our method, KeyGen, incorporates a standalone keypoint detector that canonicalizes object orientation and extracts stable, semantically consistent keypoints from point cloud data. Our visuomotor policy combines the extracted keypoints with object-centric point clouds to construct a robust scene representation. Further more, we create a novel synthetic dataset featuring three manipulation tasks, each with a diverse set of object instances, enabling the assessment of category-level generalization. Experimental results demonstrate that our approach improves sample efficiency and generalization, achieving higher success rates on both seen and unseen object instances compared to existing methods.
The first video in each row is the training video for that task.
Pour Task – Training
Pour Task – Video 1
Pour Task – Video 2
Pour Task – Video 3
Knife Task – Training
Knife Task – Video 1
Knife Task – Video 2
Knife Task – Video 3
Stack Task – Training
Stack Task – Video 1
Stack Task – Video 2
Stack Task – Video 3