kitti関連の覚え書き - 八谷大岳の覚え書きブログ

kittiのメインページ

http://www.cvlibs.net/datasets/kitti/

kittiデータセットに関する論文

http://ww.cvlibs.net/publications/Geiger2013IJRR.pdf

kittiのアノテーションのフォーマット

github.com
https://github.com/NVIDIA/DIGITS/issues/866

以下フォーマットの説明を抜粋：

1 type Describes the type of object: 'Car', 'Van', 'Truck',
'Pedestrian', 'Person_sitting', 'Cyclist', 'Tram',
'Misc' or 'DontCare'
1 truncated Float from 0 (non-truncated) to 1 (truncated), where
truncated refers to the object leaving image boundaries
1 occluded Integer (0,1,2,3) indicating occlusion state:
0 = fully visible, 1 = partly occluded
2 = largely occluded, 3 = unknown
1 alpha Observation angle of object, ranging [-pi..pi]
4 bbox 2D bounding box of object in the image (0-based index):
contains left, top, right, bottom pixel coordinates
3 dimensions 3D object dimensions: height, width, length (in meters)
3 location 3D object location x,y,z in camera coordinates (in meters)
1 rotation_y Rotation ry around Y-axis in camera coordinates [-pi..pi]
1 score Only for results: Float, indicating confidence in
detection, needed for p/r curves, higher is better.

つまり、スペースで区切られていて、以下の順番に記載されている。
0：物体の種類（Car, Van, Truck, Pedestrian, Person_sitting, Cyclist）
1：物体の画像からはみ出している割合（0は完全に見えている、1は完全にはみ出している）
2：オクルージョン状態（0：完全に見える、1：部分的に隠れている、2：大部分が隠れている、3：不明）
3：カメラから見た物体の向きα[-pi, pi]
4：2D bounding boxのminx
5：2D bounding boxのminy
6：2D bounding boxのmaxx
7：2D bounding boxのmaxy
8：3D object dimensionsの高さ（height）
9：3D object dimensionsの幅（width）
10：3D object dimensionsの奥行き（length）
11：3D 物体のx座標
12：3D 物体のy座標
13：3D 物体のz座標
14：カメラ座標系での物体の向きrotation_y [-pi..pi]

3と14の違いは、以下のように記載されている。
「The difference between rotation_y and alpha is, that rotation_y is directly given in camera coordinates, while alpha also considers the vector from the camera center to the object center, to compute the relative orientation of
the object with respect to the camera. For example, a car which is facing along the X-axis of the camera coordinate system corresponds to rotation_y=0, no matter where it is located in the X/Z plane (bird's eye view), while alpha is zero only,
when this object is located along the Z-axis of the camera.」
つまり、「3のαは、カメラ中心と物体の中心の間に引いた線に対する角度」であり、「4のrotation_yは、カメラ座標系のy軸周りの角度」である。
したがって、例えば、車がx軸方向に向いている場合、4のrotation_yは常にゼロであるのに対し、3は車がz軸上にあるときのみゼロとなる。
f:id:hirotaka_hachiya:20180212233303p:plain

以下は、000015.txtの例：

Car 0.89 0 2.29 0.00 194.70 414.71 373.00 1.57 1.67 4.14 -2.75 1.70 4.10 1.72
Pedestrian 0.00 1 0.71 1021.76 133.28 1101.39 316.63 1.81 1.06 0.73 4.75 1.33 7.59 1.25
Pedestrian 0.00 0 -1.58 672.23 171.73 690.13 224.33 1.73 0.84 0.86 2.46 1.41 24.14 -1.48
Pedestrian 0.00 0 -1.59 692.68 169.14 712.48 224.03 1.81 0.90 0.95 3.30 1.40 24.22 -1.46
Pedestrian 0.00 0 -1.46 537.84 168.08 560.90 224.34 1.78 0.92 1.01 -1.79 1.36 23.30 -1.54
DontCare -1 -1 -10 916.14 175.59 973.32 253.55 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 628.80 173.19 653.95 196.66 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 610.47 175.76 625.72 205.87 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 766.54 168.32 841.38 261.86 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 464.22 177.67 508.93 230.69 -1 -1 -1 -1000 -1000 -1000 -10

f:id:hirotaka_hachiya:20180212233946p:plain

カメラパラメータ

カメラパラメータについては、以下の論文の3ページにて説明されている。
http://ww.cvlibs.net/publications/Geiger2013IJRR.pdf

具体的には、以下のような3次元のカメラ座標から、画像座標への変換を考えた場合、各カメラの射影行列P（P0, P1, P2, P3）は下記のようになる。
ただし、ここでの変換では、同次座標系を用いているのに注意する必要がある（つまり、最後の次元zで、xとyを割る形になっている）。
同次座標については、
http://www.wakayama-u.ac.jp/~tokoi/lecture/gg/ggbook03.pdfを参照するとよい。

f:id:hirotaka_hachiya:20180214191554p:plain

P0:
7.070493000000e+02 0.000000000000e+00 6.040814000000e+02 0.000000000000e+00
0.000000000000e+00 7.070493000000e+02 1.805066000000e+02 0.000000000000e+00
0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00
P1:
7.070493000000e+02 0.000000000000e+00 6.040814000000e+02 -3.797842000000e+02
0.000000000000e+00 7.070493000000e+02 1.805066000000e+02 0.000000000000e+00
0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00
P2:
7.070493000000e+02 0.000000000000e+00 6.040814000000e+02 4.575831000000e+01
0.000000000000e+00 7.070493000000e+02 1.805066000000e+02 -3.454157000000e-01
0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 4.981016000000e-03
P3:
7.070493000000e+02 0.000000000000e+00 6.040814000000e+02 -3.341081000000e+02
0.000000000000e+00 7.070493000000e+02 1.805066000000e+02 2.330660000000e+00
0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 3.201153000000e-03
R0_rect:
9.999128000000e-01 1.009263000000e-02 -8.511932000000e-03

1.012729000000e-02 9.999406000000e-01 -4.037671000000e-03

8.470675000000e-03 4.123522000000e-03 9.999556000000e-01
Tr_velo_to_cam:
6.927964000000e-03 -9.999722000000e-01 -2.757829000000e-03 -2.457729000000e-02

1.162982000000e-03 2.749836000000e-03 -9.999955000000e-01 -6.127237000000e-02

9.999753000000e-01 6.931141000000e-03 -1.143899000000e-03 -3.321029000000e-01
Tr_imu_to_velo:
9.999976000000e-01 7.553071000000e-04 -2.035826000000e-03 -8.086759000000e-01

7.854027000000e-04 9.998898000000e-01 -1.482298000000e-02 3.195559000000e-01

2.024406000000e-03 1.482454000000e-02 9.998881000000e-01 -7.997231000000e-01

ちなみに、3D Box検出のタスクでは、カメラP2のleft color imageを用いているとのこと。
http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d

Arsalan Mousavian et al., 3D Bounding Box Estimation Using Deep Learning and Geometry, CVPR2017

https://arxiv.org/pdf/1612.00496.pdf
https://cs.gmu.edu/~amousavi/papers/3D-Deepbox-Supplementary.pdf
CNNの特徴量から物体の3次元dimensionと向き、2次元のBBを回帰する際に、それぞれの平均とGTとの差分を目的変数として設定している方法。
回帰した3DBB（8頂点）と、2次元BB（4頂点）の対応する頂点の組み合わせは、8の4乗＝4096通りあるが、様々な仮定（以下のような）を入れることにより64通りまで減らして移動行列を求めることにより、物体の3次元位置を推定している。
「In many scenarios the objects can be assumed to be always upright. In this case, the 2D box top and bottom correspond only to the projection of vertices from the top and bottom of the 3D box, respectively, which reduces the number of correspondences to 1024」
「Furthermore, when the relative object roll is close to zero, the vertical 2D box side coordinates xmin and xmax can only correspond to projections of points from vertical 3D box sides. Similarly, ymin and ymax can only correspond to point projections from the horizontal 3D box sides. Consequently,
each vertical side of the 2D detection box can correspond to [±dx/2, ., ±dz/2] and each horizontal side of the 2D bounding corresponds to [., ±dy/2, ±dz/2], yielding 4^4 = 256 possible configurations」
「In the KITTI dataset,object pitch and roll angles are both zero, which further reduces of the number of configurations to 64」
さらに、ワールド座標の原点は、物体の中心に対応すると仮定している。
「Assuming that the origin of the object coordinate frame is at the center of the 3D bounding box and the object dimensions D are known」