Object pose detection is a task that is highly useful for a variety of object manipulation tasks such as robotic grasping and tool handling. Perspective-n-Point matching between keypoints on the objects offers a way to perform pose estimation where the keypoints also provide inherent object information, such as corner locations and object part sections, without the need to reference a separate 3D model. Existing works focus on scenes with little occlusion and limited object categories. In this study, we demonstrate the feasibility of a pose estimation network based on detecting semantically important keypoints on the MetagraspNet dataset which contains heavy occlusion and greater scene complexity. We further discuss various challenges in using semantically important keypoints as a way to perform object pose estimation. These challenges include maintaining consistent keypoint definition, as well as dealing with heavy occlusion and similar visual features.
pdf