單目攝像頭實現實時3D位姿估計

Real-time 3D Pose Estimation with a Monocular Camera Using Deep Learning and Object Priors On an Autonomous Racecar

背景

三維物體投影在平面上會失去一個維度,即不知道物體的距離。但是,有了三維物體的先驗信息,我們可以知道三維物體的距離

To this end, we propose a low-latency real-time pipeline to detect and estimate 3D position of multiple objects of interest using just a single measurement, i.e. a single image without the need for any special external markers

We propose a novel “keypoint regression” scheme that exploits prior information about the object’s shape and size to regress and find specific feature points on the image.

We propose a complete pipeline that allows object detection and simultaneously estimate pose of these multiple object using just a single image by exploiting object priors

As per the rules of the competition, the track is marked by cones. The left and right track limits are marked by blue and yellow traffic cones respectively

A novel feature regression scheme, “keypoint regression” is introduced which is used to match 2D-3D correspondences

This section shifts the focus on how to estimate 3D position of multipleobjects from a single image. Although,it is an ill-posed problem but with a priori information in the form of the shape,size and geometry of the object-of-interest, this is solvable, as elaborated in this chapter.

採用ROS系統的優勢

1.ROS通過節點通訊,並且有各種傳感器、導航的消息類型
2.ROS開源,有一系列的可視化、仿真工具

The pipeline’s sub-modules are run as nodes using Robot Operating System or ROS [5] as the framework that eases handling of communication and data messages across multiple systems as well as different nodes. Different sub-modules communicate via messages, they receive data and output processed information. Another important aspect is that ROS is open-source and provides tools for visualization, monitoring and simulation, making it easy to integrate, test, diagnose and develop the complete software system.

視覺感知系統(兩部分)

  1. 雙目立體
  2. 單目
    the stereo and the monocular pipeline. The stereo pipeline use the sub-modules explained in this section to have an extremely efficient way of triangulating and estimating depth from binocular vision. This methodology of drastically reducing the search space and cleverly tackling the issue of having numerous and often incorrect feature matche

單目通道

The monocular pipeline has 3 crucial sub-modules which enable it to detect multiple objects of interest and accurately estimate their 3D position up to a distance of 15 meters by making use of a single measurement in the form of an image captured by the monocular camera.

三個子模塊

The monocular pipeline can be broken down into three parts. (1) Multiple object detection, (2) Keypoint regression and (3) 2D-3D correspondence followed by 3D pose estimation from a single image

4.2 多目標檢測

Object recognition has 4 main categorizes of tasks:
(1) classification, (2) classification and localization,(3)objectdetectionand(4)instancesegmentation

Instead of using slow and computationally intensive cascade and sliding window approaches, weemployaquick,real-time and powerful object detector in our pipeline in the form of YOLOv2

4.2.1 Importance of color information

The path planning then has a cost function with apenalization term for potential paths that drive the car through same colored cones.
怎麼獲取錐形桶顏色?
We design the detector such that the cone color information can be directly obtained from it. In other words, we treat each colored cone as a different class for the object detector.

4.2.2 Customizing YOLOv2 for Formula Student Driverless

控制閾值
We choose YOLOv2 for the purpose of detecting different colored cones. Thresholds for it are chosen such that false positives, incorrect detections and misclassification are avoided at any cost; even if that translate to not being able to detect all cones in a given image

不太懂,不過應該是縮小置信區間,重新計算特徵
Since the annotations for cones are long and thin rectangular bounding boxes, we exploit such prior information by re-calculating the anchor boxes used by YOLOv2. This is done by performing k-means clustering on the aspect-ratio of the rectangle annotations in the dataset and improves the object detector’s performance.

needs to distinguish and detect ‘yellow’, ‘ blue’ and ‘orange’ cones that provide information about the track

4.2.3 Training to detect cones 訓練樣本

4.3 Keypoint Regression(關鍵點回歸)

先驗信息中的geometry(幾何)是怎麼知道的?
However, since there is prior information about the 3D shape, size and geometry of the cone, one has hope to recover 3D pose from a single measurement。

4.3.1 From patches to features-The need for “keypoint regression”

Using an object detector, cones can be detected in an image. However, one needs more information to go from detections on the image to 3D positions. We exploit a priori knowledge about the cone and a calibrated camera to help estimate its depth via 2D-3D correspondences

分辨率不高或其他情況,提取不到足夠的3D信息。
爲此,我們引入了一種基於經典計算機視覺的特徵提取方案,該方案具有通過機器學習從數據中學習的味道(To this end, we introduce a feature extraction scheme that is inspired by classical computer vision but has a flavor of learning from data via machine learning)

4.3.2 Design and architecture of the “keypoint regressor”

卷積神經網絡
The primary difference between this scheme and any other feature extraction process is that this is very specific as compared to commonly used techniques.

In our case, we want to find position of very specific points on the image that correspond to 3D counterparts whose locations can be measured in 3D from an arbitrary world frameFw.
在這裏插入圖片描述

4.3.3 Loss function 損失函數

The “keypoint network” also exploits a priori information about the object’s 3D geometry and appearance through the loss function. It uses the concept of the cross-ratio.

4.3.4 Training scheme

錐形桶上有7個關鍵點,使用卷積神經網絡不斷訓練樣本,使其最終能檢測出這7個點。這裏面的損失函數定義和訓練方案不是很懂。
在這裏插入圖片描述
其中的過程不太懂,但是結果就像上圖所示,即使樣本模糊甚至被覆蓋,仍然可以使用深度學習檢測出7個關鍵點的具體位置。

4.4 2D-3D Correspondences and 3D Pose Estimation(2D-3D對應和3D姿態估計)

The “keypoint network” provides with accurate locations of very specific features, the keypoints. Since, there is a priori information available about the shape, size, appearance and 3D geometry of the object, the cone in this case, 2D-3D correspondences can be matched. With access to a calibrated camera and 2D-3D correspondences, it is possible to estimate the pose of the object in question from a single image

We use Perspective n-Point or PnP to estimate the pose of every detected cone.(求世界座標系和相機座標系的轉換矩陣)

在這裏插入圖片描述

6.1改進之處

  1. 改變YOLOv2檢測的閾值,但要權衡利弊
  2. 視野廣角受限,可以使相機方向改變
  3. 最大的問題:數據延遲和處理的速度,可以用更好的設備

6.2 Using the “keypoint regression” for efficient stereo triangulation
兩個單目相機用關鍵點回歸
we use the “keypoint regression” and PnP on a single image from the left camera to acquire 3D position of detected cones. This 3D position is further improved via additional information in the form of a second image of the same scene (captured at the same time instance) from the right camera. The position accuracy is improved by performing triangulation.
應該就是縮小檢測範圍,用左相機檢測的位置縮小右相機的檢測範圍

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章