- kittiのメインページ
- kittiデータセットに関する論文
- kittiのアノテーションのフォーマット
1 type Describes the type of object: 'Car', 'Van', 'Truck',
'Pedestrian', 'Person_sitting', 'Cyclist', 'Tram',
'Misc' or 'DontCare'
1 truncated Float from 0 (non-truncated) to 1 (truncated), where
truncated refers to the object leaving image boundaries
1 occluded Integer (0,1,2,3) indicating occlusion state:
0 = fully visible, 1 = partly occluded
2 = largely occluded, 3 = unknown
1 alpha Observation angle of object, ranging [-pi..pi]
4 bbox 2D bounding box of object in the image (0-based index):
contains left, top, right, bottom pixel coordinates
3 dimensions 3D object dimensions: height, width, length (in meters)
3 location 3D object location x,y,z in camera coordinates (in meters)
1 rotation_y Rotation ry around Y-axis in camera coordinates [-pi..pi]
1 score Only for results: Float, indicating confidence in
detection, needed for p/r curves, higher is better.
0:物体の種類(Car, Van, Truck, Pedestrian, Person_sitting, Cyclist)
3:カメラから見た物体の向きα[-pi, pi]
4:2D bounding boxのminx
5:2D bounding boxのminy
6:2D bounding boxのmaxx
7:2D bounding boxのmaxy
8:3D object dimensionsの高さ(height)
9:3D object dimensionsの幅(width)
10:3D object dimensionsの奥行き(length)
11:3D 物体のx座標
12:3D 物体のy座標
13:3D 物体のz座標
14:カメラ座標系での物体の向きrotation_y [-pi..pi]
「The difference between rotation_y and alpha is, that rotation_y is directly given in camera coordinates, while alpha also considers the vector from the camera center to the object center, to compute the relative orientation of
the object with respect to the camera. For example, a car which is facing along the X-axis of the camera coordinate system corresponds to rotation_y=0, no matter where it is located in the X/Z plane (bird's eye view), while alpha is zero only,
when this object is located along the Z-axis of the camera.」
Car 0.89 0 2.29 0.00 194.70 414.71 373.00 1.57 1.67 4.14 -2.75 1.70 4.10 1.72
Pedestrian 0.00 1 0.71 1021.76 133.28 1101.39 316.63 1.81 1.06 0.73 4.75 1.33 7.59 1.25
Pedestrian 0.00 0 -1.58 672.23 171.73 690.13 224.33 1.73 0.84 0.86 2.46 1.41 24.14 -1.48
Pedestrian 0.00 0 -1.59 692.68 169.14 712.48 224.03 1.81 0.90 0.95 3.30 1.40 24.22 -1.46
Pedestrian 0.00 0 -1.46 537.84 168.08 560.90 224.34 1.78 0.92 1.01 -1.79 1.36 23.30 -1.54
DontCare -1 -1 -10 916.14 175.59 973.32 253.55 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 628.80 173.19 653.95 196.66 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 610.47 175.76 625.72 205.87 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 766.54 168.32 841.38 261.86 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 464.22 177.67 508.93 230.69 -1 -1 -1 -1000 -1000 -1000 -10
- カメラパラメータ
具体的には、以下のような3次元のカメラ座標から、画像座標への変換を考えた場合、各カメラの射影行列P(P0, P1, P2, P3)は下記のようになる。
7.070493000000e+02 0.000000000000e+00 6.040814000000e+02 0.000000000000e+00
0.000000000000e+00 7.070493000000e+02 1.805066000000e+02 0.000000000000e+00
0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00
7.070493000000e+02 0.000000000000e+00 6.040814000000e+02 -3.797842000000e+02
0.000000000000e+00 7.070493000000e+02 1.805066000000e+02 0.000000000000e+00
0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00
7.070493000000e+02 0.000000000000e+00 6.040814000000e+02 4.575831000000e+01
0.000000000000e+00 7.070493000000e+02 1.805066000000e+02 -3.454157000000e-01
0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 4.981016000000e-03
7.070493000000e+02 0.000000000000e+00 6.040814000000e+02 -3.341081000000e+02
0.000000000000e+00 7.070493000000e+02 1.805066000000e+02 2.330660000000e+00
0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 3.201153000000e-03
9.999128000000e-01 1.009263000000e-02 -8.511932000000e-03
- 1.012729000000e-02 9.999406000000e-01 -4.037671000000e-03
8.470675000000e-03 4.123522000000e-03 9.999556000000e-01
6.927964000000e-03 -9.999722000000e-01 -2.757829000000e-03 -2.457729000000e-02
- 1.162982000000e-03 2.749836000000e-03 -9.999955000000e-01 -6.127237000000e-02
9.999753000000e-01 6.931141000000e-03 -1.143899000000e-03 -3.321029000000e-01
9.999976000000e-01 7.553071000000e-04 -2.035826000000e-03 -8.086759000000e-01
- 7.854027000000e-04 9.998898000000e-01 -1.482298000000e-02 3.195559000000e-01
2.024406000000e-03 1.482454000000e-02 9.998881000000e-01 -7.997231000000e-01
ちなみに、3D Box検出のタスクでは、カメラP2のleft color imageを用いているとのこと。
- Arsalan Mousavian et al., 3D Bounding Box Estimation Using Deep Learning and Geometry, CVPR2017
「In many scenarios the objects can be assumed to be always upright. In this case, the 2D box top and bottom correspond only to the projection of vertices from the top and bottom of the 3D box, respectively, which reduces the number of correspondences to 1024」
「Furthermore, when the relative object roll is close to zero, the vertical 2D box side coordinates xmin and xmax can only correspond to projections of points from vertical 3D box sides. Similarly, ymin and ymax can only correspond to point projections from the horizontal 3D box sides. Consequently,
each vertical side of the 2D detection box can correspond to [±dx/2, ., ±dz/2] and each horizontal side of the 2D bounding corresponds to [., ±dy/2, ±dz/2], yielding 4^4 = 256 possible configurations」
「In the KITTI dataset,object pitch and roll angles are both zero, which further reduces of the number of configurations to 64」
「Assuming that the origin of the object coordinate frame is at the center of the 3D bounding box and the object dimensions D are known」