Real-time 3D Pose Estimation with a Monocular Camera Using Deep Learning and Object Priors On an Autonomous Racecar

背景

三維物體投影在平面上會失去一個維度，即不知道物體的距離。但是，有了三維物體的先驗信息，我們可以知道三維物體的距離

To this end, we propose a low-latency real-time pipeline to detect and estimate 3D position of multiple objects of interest using just a single measurement, i.e. a single image without the need for any special external markers

We propose a novel “keypoint regression” scheme that exploits prior information about the object’s shape and size to regress and ﬁnd speciﬁc feature points on the image.

We propose a complete pipeline that allows object detection and simultaneously estimate pose of these multiple object using just a single image by exploiting object priors

As per the rules of the competition, the track is marked by cones. The left and right track limits are marked by blue and yellow trafﬁc cones respectively

A novel feature regression scheme, “keypoint regression” is introduced which is used to match 2D-3D correspondences

This section shifts the focus on how to estimate 3D position of multipleobjects from a single image. Although,it is an ill-posed problem but with a priori information in the form of the shape,size and geometry of the object-of-interest, this is solvable, as elaborated in this chapter.

採用ROS系統的優勢

1.ROS通過節點通訊，並且有各種傳感器、導航的消息類型
2.ROS開源，有一系列的可視化、仿真工具

The pipeline’s sub-modules are run as nodes using Robot Operating System or ROS [5] as the framework that eases handling of communication and data messages across multiple systems as well as different nodes. Different sub-modules communicate via messages, they receive data and output processed information. Another important aspect is that ROS is open-source and provides tools for visualization, monitoring and simulation, making it easy to integrate, test, diagnose and develop the complete software system.

視覺感知系統（兩部分）

雙目立體
單目
the stereo and the monocular pipeline. The stereo pipeline use the sub-modules explained in this section to have an extremely efﬁcient way of triangulating and estimating depth from binocular vision. This methodology of drastically reducing the search space and cleverly tackling the issue of having numerous and often incorrect feature matche

單目通道

The monocular pipeline has 3 crucial sub-modules which enable it to detect multiple objects of interest and accurately estimate their 3D position up to a distance of 15 meters by making use of a single measurement in the form of an image captured by the monocular camera.

三個子模塊

The monocular pipeline can be broken down into three parts. (1) Multiple object detection, (2) Keypoint regression and (3) 2D-3D correspondence followed by 3D pose estimation from a single image

4.2 多目標檢測

Object recognition has 4 main categorizes of tasks:
(1) classiﬁcation, (2) classiﬁcation and localization,(3)objectdetectionand(4)instancesegmentation

Instead of using slow and computationally intensive cascade and sliding window approaches, weemployaquick,real-time and powerful object detector in our pipeline in the form of YOLOv2

4.2.1 Importance of color information

The path planning then has a cost function with apenalization term for potential paths that drive the car through same colored cones.
怎麼獲取錐形桶顏色？
We design the detector such that the cone color information can be directly obtained from it. In other words, we treat each colored cone as a different class for the object detector.

4.2.2 Customizing YOLOv2 for Formula Student Driverless

控制閾值
We choose YOLOv2 for the purpose of detecting different colored cones. Thresholds for it are chosen such that false positives, incorrect detections and misclassiﬁcation are avoided at any cost; even if that translate to not being able to detect all cones in a given image

不太懂，不過應該是縮小置信區間，重新計算特徵
Since the annotations for cones are long and thin rectangular bounding boxes, we exploit such prior information by re-calculating the anchor boxes used by YOLOv2. This is done by performing k-means clustering on the aspect-ratio of the rectangle annotations in the dataset and improves the object detector’s performance.

needs to distinguish and detect ‘yellow’, ‘ blue’ and ‘orange’ cones that provide information about the track

4.2.3 Training to detect cones 訓練樣本

4.3 Keypoint Regression（關鍵點回歸）

先驗信息中的geometry（幾何）是怎麼知道的？
However, since there is prior information about the 3D shape, size and geometry of the cone, one has hope to recover 3D pose from a single measurement。

4.3.1 From patches to features-The need for “keypoint regression”

Using an object detector, cones can be detected in an image. However, one needs more information to go from detections on the image to 3D positions. We exploit a priori knowledge about the cone and a calibrated camera to help estimate its depth via 2D-3D correspondences

分辨率不高或其他情況，提取不到足夠的3D信息。
爲此，我們引入了一種基於經典計算機視覺的特徵提取方案，該方案具有通過機器學習從數據中學習的味道（To this end, we introduce a feature extraction scheme that is inspired by classical computer vision but has a ﬂavor of learning from data via machine learning）

4.3.2 Design and architecture of the “keypoint regressor”

卷積神經網絡
The primary difference between this scheme and any other feature extraction process is that this is very speciﬁc as compared to commonly used techniques.

In our case, we want to ﬁnd position of very speciﬁc points on the image that correspond to 3D counterparts whose locations can be measured in 3D from an arbitrary world frameFw.

4.3.3 Loss function 損失函數

The “keypoint network” also exploits a priori information about the object’s 3D geometry and appearance through the loss function. It uses the concept of the cross-ratio.

4.3.4 Training scheme

錐形桶上有7個關鍵點，使用卷積神經網絡不斷訓練樣本，使其最終能檢測出這7個點。這裏面的損失函數定義和訓練方案不是很懂。

其中的過程不太懂，但是結果就像上圖所示，即使樣本模糊甚至被覆蓋，仍然可以使用深度學習檢測出7個關鍵點的具體位置。

4.4 2D-3D Correspondences and 3D Pose Estimation（2D-3D對應和3D姿態估計）

The “keypoint network” provides with accurate locations of very speciﬁc features, the keypoints. Since, there is a priori information available about the shape, size, appearance and 3D geometry of the object, the cone in this case, 2D-3D correspondences can be matched. With access to a calibrated camera and 2D-3D correspondences, it is possible to estimate the pose of the object in question from a single image

We use Perspective n-Point or PnP to estimate the pose of every detected cone.（求世界座標系和相機座標系的轉換矩陣）

6.1改進之處

改變YOLOv2檢測的閾值，但要權衡利弊
視野廣角受限，可以使相機方向改變
最大的問題：數據延遲和處理的速度，可以用更好的設備

6.2 Using the “keypoint regression” for efficient stereo triangulation
兩個單目相機用關鍵點回歸
we use the “keypoint regression” and PnP on a single image from the left camera to acquire 3D position of detected cones. This 3D position is further improved via additional information in the form of a second image of the same scene (captured at the same time instance) from the right camera. The position accuracy is improved by performing triangulation.
應該就是縮小檢測範圍，用左相機檢測的位置縮小右相機的檢測範圍

單目攝像頭實現實時3D位姿估計

Real-time 3D Pose Estimation with a Monocular Camera Using Deep Learning and Object Priors On an Autonomous Racecar

背景

採用ROS系統的優勢

視覺感知系統（兩部分）

單目通道

三個子模塊

4.2 多目標檢測

4.2.1 Importance of color information

4.2.2 Customizing YOLOv2 for Formula Student Driverless

4.2.3 Training to detect cones 訓練樣本

4.3 Keypoint Regression（關鍵點回歸）

4.3.1 From patches to features-The need for “keypoint regression”

4.3.2 Design and architecture of the “keypoint regressor”

4.3.3 Loss function 損失函數

4.3.4 Training scheme

4.4 2D-3D Correspondences and 3D Pose Estimation（2D-3D對應和3D姿態估計）

6.1改進之處

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

matlab01

STM32F103固件庫編程（7）—SPI

機器學習課程（吳恩達）學習筆記（3）—分類算法和正則化

ROS基礎知識學習筆記（9）—Robot_Localization

ROS基礎知識學習筆記（4）—C++類和對象(2)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結