Relational Grasp Dataset


Robotic grasping is a fundamental problem in robotics. It plays a basic role in nearly all robotic manipulation tasks. Particularly, visual perception is important for robotic grasping, since it can provide rich observations about the surroundings. Surprisingly, with the rapid development of deep learning techniques, robotic grasping has achieved impressive progress in recent years. A bunch of excellent researches for robust robotic grasp generation appear. However, grasping in realistic scenarios is usually target driven. In most cases, it is not a simply segregated task but should involve comprehensive and high-level visual perception.

Despite the impressive progress achieved in robust grasp detection, robots are not skilled in sophisticated grasping tasks (e.g. search and grasp a specific object in clutter). Such tasks involve not only grasping, but comprehensive perception of the visual world (e.g. the relationship between objects). Recently, the advanced deep learning techniques provide a promising way for understanding the high-level visual concepts. It encourages robotic researchers to explore solutions for such hard and complicated fields. However, deep learning usually means data-hungry. The lack of data severely limits the performance of deep-learning-based algorithms. Therefore, we propose a novel, large-scale, and automatically-generated dataset for safe and object-specific robotic grasping in clutter.


Dataset Introduction

RElational GRAsp Dataset (REGRAD) is a new dataset used to model the relationships among objects and grasps. It aims to build a new and robust benchmark for object-specific grasping in dense clutter. We collect the annotations of object poses, segmentations, grasps, and relationships in each image for comprehensive perception of grasping. Our dataset is collected in both forms of 2D images and 3D point clouds. Our dataset is collected in both forms of 2D images and 3D point clouds. Moreover, since all the data are generated automatically, users are free to import their own object models for the generation of as many data as they want.


Fig. 1: Some examples of REGRAD. The images are taken from 9 different views and the background is randomly generated


Dataset Features

To support comprehensive perception for realistic grasping and train large deep models, our dataset possesses the following features:

  • Data are rich. The dataset contains 2D color and depth images as well as 3D point clouds. The labels include: 6D pose of each object; Bounding boxes and segmentations on 2D images; Point cloud segmentations; Manipulation Relationship Graph indicating the grasping order; Collision-free and stable 6D grasps of each object; Rectangular 2D grasps.
  • Multiple views. We record the observations in 9 different views for each scene of our dataset, aiming to call for researches on robust goal-oriented object searching and grasping with multi-view visual fusion.
  • More objects and categories. Our dataset is build upon the well-known 3D model dataset ShapeNet. Specifically, there are totally 55 categories including 50K different object models.
  • More scalable. Compared to previous datasets like VMRD, it is much easier to expand REGRAD for more training data since all labels are generated automatically. The only thing left to do is to find more suitable 3D object models. We also provide open source codes for dataset generation.
  • A systematic way for scene generation without the intervention of humans. Previous relationship datasets are generated manually, which will introduce human bias. In this paper, all the scenes are generated automatically according to a carefully designed procedure to avoid such bias.


Dataset Download

Downloading of our dataset is free and open. Please cite the paper: “REGRAD: A Large-Scale Relational Grasp Dataset for Safe and Object-Specific Robotic Grasping in Clutter”

The subset of dataset can be downloaded at this link:

If you want to obtain the full dataset, please email:


Object Stacking Grasping Dataset (OSGD)


    In task-oriented grasping, the robot is supposed to manipulate the objects in a task-compatible manner, which is more significant but more challenging than just stably grasping. More seriously, there are usually multiple stacked objects with serious overlaps and occlusions in real-world scenes.Therefore, we construct a synthetic dataset Object Stacking Grasping Dataset (OSGD) for task-oriented grasping in object stacking scenes.



    To reduce data collection time, we construct the dataset OSGD by synthesis. This synthetic dataset contains 8000 depth maps of object stacking scenes. And for each object in the scene, we annotate its bounding box, category and task-compatible grasps, which is shown as follows:



    The generation process of our dataset is introduced detailedly in paper "Task-oriented Grasping in Object Stacking Scenes with CRF-based Semantic Model".

    The model trained on dataset OSGD can be generalized to real-world scenes well. But it should be noted that, firstly, the object materials are not supposed to have strong specular reflection and light absorption. Secondly, the objects shouldn't be too small and too thin.


Dataset Format

    The dataset OSGD contains 4 folders, which are 'JPGEImages', 'Annotations', 'Grasps' and 'ImageSets'. 'JPEGImages' contains 8000 synthetic depth maps. 'Annotations' contains the object annotation files in which the objects' bounding boxes and categories are organized with .xml format. 'Grasps' contains the grasp annotation files in which each grasp is identified with its 4 vertexes, task label and the object it belongs to. 'ImageSets' contains the files which divide our dataset into a trainset and a testset which contains 7000 and 1000 depth maps respectively.



     Downloading the dataset OSGD is free and open, please cite the paper "Task-oriented Grasping in Object Stacking Scenes with CRF-based Semantic Model". The download link is:


    The folder OSGD contains two compressed files and OSGD_ORI.rar. In, the depth maps are normalized to [0, 255] and stored with format .jpg. In OSGD_ORI.rar, the depth maps are stored with format .npy and the depth unit is dm.



Visual Manipulation Relationship Dataset


Perception and cognition play important roles in intelligent robot research. Before interacting with the environment, such as tasks of grasping or manipulating, the robot need to understand and infer what to do and how to do it first. In these situations, detecting targets and predicting the manipulation relationships in real time ensure that the robot can complete tasks in a safe and reliable way. Therefore, we collect a dataset named Visual Manipulation Relationship Dataset (VMRD) which can be used to train the robot to learn to percept and understand the environment, locate its target and find the proper order to grasp it.


Dataset Introduction

Visual Manipulation Relationship Dataset (VMRD) are collected and labeled using hundreds of objects coming from 31 categories. There are totally 5185 images including 17688 object instances and 51530 manipulation relationships. Some images of our dataset and data distribution are shown below. Each object node includes category information, bounding box location, the index of the current node and indexes of its parent nodes andchild nodes.



                               (a)                                                                                                                                              (b)

Fig. 1. (a) Category and manipulation relationship distribution. (b) Some dataset examples.


Annotation Format

Objects and manipulation relationships in VMRD are labeled using XML format. In each annotation file, there are several objects with their locations, indexes, fathers and children. Index of each object is different from any other objects even if they may belong to the same object category. Details are shown below:

  1. "location" is defined as (xmin, ymin, xmax, ymax), which is the coordinates of top-left vertex and down-right vertex of the object bounding box.

  2. "name" is a string which is the category of the object instance.

  3. "index" is the unique ID of the object instance in one image.

  4. "father" includes all indexes of parent nodes of one object instance.

  5. "children" includes all indexes of child nodes of one object instance.


Fig. 2. One example of our dataset.


Grasp Annotation Format

In our dataset, each grasp includes 5 dimensions: (x,y,w,h,θ) as shown in the following figure, where (x,y) is the coordinate of the grasp rectangle center, (w,h) is the width and height of the grasp rectangle and θ is the rotation angle w.r.t. horizontal axis.



However, in grasp annotation files, grasps are labeled using the coordinates of 4 vertexes (x1,y1,x2,y2,x3,y3,x4,y4). The first two points of a rectangle define the line representing the orientation of the gripper plate. All 4 vertexes are listed in counter-clockwise order. Besides, to indicate which object each grasp belongs to, the object index is also added into each annotation. The last dimension is "e" or "h", which means that the grasp is easy or hard (the execution may be hindered by other objects) to execute.There are totally 4683 images labeled with grasps, which are divided into training set (4233 images) and testing set (450   images).


Downloading of our dataset is free and open. Please cite the following paper:

H. Zhang, X. Lan*, X. Zhou, Z. Tian, YZhang, N. Zheng, Visual Manipulation Relationship Network for Autonomous Robotics, IEEE-RAS 18th International Conference on Humanoid Robots (HUMANOIDS), November, 6-9, Beijing, China, 2018


Complete dataset can be downloaded at this link: 


V1 (5185 images without grasps):


V2 (4683 images with grasps):