學習筆記|Pytorch使用教程34
本學習筆記主要摘自“深度之眼”,做一個總結,方便查閱。
使用Pytorch版本爲1.2
- 圖像目標檢測是什麼?
- 模型是如何完成目標檢測的?
- 深度學習目標檢測模型簡介
- PyTorch中的Faster RCNN訓練
四.PyTorch中的Faster RCNN訓練
- 1.**torchvision.models.detection.fasterrcnn_resnet50_fpn()**返回FasterRCNN實例
- 2.class FasterRCNN(GeneralizedRCNN)
- 3.class GeneralizedRCNN(nn.Module)
forward():
- features = self.backbone(images.tensors)
- proposals, proposal_losses = self.rpn(images, features, targets)
- detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
接下來在學習筆記|Pytorch使用教程33(圖像目標檢測一瞥(上))中的代碼設置斷點:output_list = model(input_list)
,並進入(step into)
在該處設置斷點,並進入。得到faster rcnn的基類。
class GeneralizedRCNN(nn.Module):
"""
Main class for Generalized R-CNN.
Arguments:
backbone (nn.Module):
rpn (nn.Module):
heads (nn.Module): takes the features + the proposals from the RPN and computes
detections / masks from it.
transform (nn.Module): performs the data transformation from the inputs to feed into
the model
"""
def __init__(self, backbone, rpn, roi_heads, transform):
super(GeneralizedRCNN, self).__init__()
self.transform = transform
self.backbone = backbone
self.rpn = rpn
self.roi_heads = roi_heads
def forward(self, images, targets=None):
"""
Arguments:
images (list[Tensor]): images to be processed
targets (list[Dict[Tensor]]): ground-truth boxes present in the image (optional)
Returns:
result (list[BoxList] or dict[Tensor]): the output from the model.
During training, it returns a dict[Tensor] which contains the losses.
During testing, it returns list[BoxList] contains additional fields
like `scores`, `labels` and `mask` (for Mask R-CNN models).
"""
if self.training and targets is None:
raise ValueError("In training mode, targets should be passed")
original_image_sizes = [img.shape[-2:] for img in images]
images, targets = self.transform(images, targets)
features = self.backbone(images.tensors)
if isinstance(features, torch.Tensor):
features = OrderedDict([(0, features)])
proposals, proposal_losses = self.rpn(images, features, targets)
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
losses = {}
losses.update(detector_losses)
losses.update(proposal_losses)
if self.training:
return losses
return detections
在features = self.backbone(images.tensors)
設置斷點,瞭解backbone structure 提取feature 的過程,進入(step into)。類似地,在__call__(self, *input, **kwargs)
中設置斷點:result = self.forward(*input, **kwargs)
,再次進入。
發現module已經被封裝好了。發現在初始化的時候已經定義好了。
於是來到定義的地方:model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
。然後 Go to definition。
繼續然後 Go to definition。
其中backbone_name是’resnet50’。其實就是使用resnet50進行特徵提取。
現在直接查看feature。
有5個key–value對。其實對應在resnet中的五個輸出。
回顧學習筆記|Pytorch使用教程33(圖像目標檢測一瞥(上))中提到的feature map數據流:[256, h_f, w_f]
接下來查看RPN網絡。進入:proposals, proposal_losses = self.rpn(images, features, targets)
。
代碼:objectness, pred_bbox_deltas = self.head(features)
實現RPN網絡的核心功能,輸出分類向量objectness,實現前背景分類。把候選框的四個偏移量存在pred_bbox_deltas裏面。
代碼:anchors = self.anchor_generator(images, features)
這裏是生成邊界框。
代碼:boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
使用NMS挑選邊界框。
A 首先進入objectness, pred_bbox_deltas = self.head(features)
class RPNHead(nn.Module):
"""
Adds a simple RPN Head with classification and regression heads
Arguments:
in_channels (int): number of channels of the input feature
num_anchors (int): number of anchors to be predicted
"""
def __init__(self, in_channels, num_anchors):
super(RPNHead, self).__init__()
self.conv = nn.Conv2d(
in_channels, in_channels, kernel_size=3, stride=1, padding=1
)
self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
self.bbox_pred = nn.Conv2d(
in_channels, num_anchors * 4, kernel_size=1, stride=1
)
for l in self.children():
torch.nn.init.normal_(l.weight, std=0.01)
torch.nn.init.constant_(l.bias, 0)
def forward(self, x):
logits = []
bbox_reg = []
for feature in x:
t = F.relu(self.conv(feature))
logits.append(self.cls_logits(t))
bbox_reg.append(self.bbox_pred(t))
return logits, bbox_reg
其中的x就是上一步生成的5個特徵圖。
查看對一個feature map的操作即可。
-
1.
F.relu(self.conv(feature))
,進一步使用卷積,在使用relu激活函數。self.conv
是self.conv = nn.Conv2d( in_channels, in_channels, kernel_size=3, stride=1, padding=1 )
。
-
2.根據上面生成的feature map(t)進行邏輯迴歸:
logits.append(self.cls_logits(t))
。查看self.cls_logits
,是self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
。查看其shape可知num_anchors=3
,也就是特徵圖上每一個像素點都會預測3個anchor。
-
3.根據上面生成的feature map(t)進行邏輯迴歸進行邊界框迴歸:
bbox_reg.append(self.bbox_pred(t))
。查看self.bbox_pred
,是self.bbox_pred = nn.Conv2d( in_channels, num_anchors * 4, kernel_size=1, stride=1 )
,可知邊界框迴歸是對每anchor預測4個值。這就是12 = 3 * 4 的原因。
-
4.當5個特徵圖全部計算完畢後,會生成列表
objectness
和pred_bbox_deltas
,裏面存儲着上述5個特徵圖計算結果。
-
5.回顧學習筆記|Pytorch使用教程33(圖像目標檢測一瞥(上))中提到的2 Softmax 數據流:[num_anchors, h_f, w_f] 和 Regressors: [num_anchors * 4, h_f, w_f]。
B 接着根據上述計算結果生成一系列anchors:anchors = self.anchor_generator(images, features)
。查看生成20+萬個anchor
這裏225603個anchor,都會使用上述計算出來的偏移量,是和所有偏移量的總量是一致的(一個anchor需要一個偏移量):
C 下面一步則是對生成的anchor進行篩選:boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
。進入(step into)。
選取top_n 個anchor。如下圖所示,1是因爲該批次只有1張圖片。4693個anchor是從20+W個anchor中篩選出來的。
接下來使用NMS進一步篩選:keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
。使用keep = keep[:self.post_nms_top_n]
保留NMS處理過後的值,最終可以得到final_boxes, final_scores
。查看輸出值的shape,爲什麼是1000呢?這個是超參數。是在faster_rcnn.py
中設置的。
計算完後,返回,開始生成ROI。進入(step into)該代碼。
查看相關代碼:
-
1.代碼:
box_features = self.box_roi_pool(features, proposals, image_shapes)
。使用proposals在feature map山進行“摳圖”,即獲取子特徵區域,並池化成統一尺度(使得一些列不同尺度的特徵圖變成同一尺度的特徵圖)。
現在進入(step into)box_features = self.box_roi_pool(features, proposals, image_shapes)
查看。
現關注forward
函數。核心代碼如下:
執行完之後,生成統一shape爲:
-
2.代碼:
box_features = self.box_head(box_features)
。這個其實就是兩個FC層。進圖(step into)
查看輸出時的shape爲:即把25677 映射成了1024
-
3.代碼:
class_logits, box_regression = self.box_predictor(box_features)
。進行類別預測和邊界框迴歸。進入(step into)
其中self.cls_score
和self.bbox_pred
均是使用兩個全連接層進行預測。查看shape如下。因爲使用的是COCO數據集,所有使用的是類別是91個(90個類別+1個背景類)。
-
4.計算完畢後查看
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
的輸出。
最後使用detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
,把預測出的座標映射到原圖上。
總結:
下面使用上述數據集進行訓練測試:
測試代碼:
import os
import time
import torch.nn as nn
import torch
import random
import numpy as np
import torchvision.transforms as transforms
import torchvision
from PIL import Image
import torch.nn.functional as F
from tools.my_dataset import PennFudanDataset
from tools.common_tools import set_seed
from torch.utils.data import DataLoader
from matplotlib import pyplot as plt
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.transforms import functional as F
set_seed(1) # 設置隨機種子
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# classes_coco
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
def vis_bbox(img, output, classes, max_vis=40, prob_thres=0.4):
fig, ax = plt.subplots(figsize=(12, 12))
ax.imshow(img, aspect='equal')
out_boxes = output_dict["boxes"].cpu()
out_scores = output_dict["scores"].cpu()
out_labels = output_dict["labels"].cpu()
num_boxes = out_boxes.shape[0]
for idx in range(0, min(num_boxes, max_vis)):
score = out_scores[idx].numpy()
bbox = out_boxes[idx].numpy()
class_name = classes[out_labels[idx]]
if score < prob_thres:
continue
ax.add_patch(plt.Rectangle((bbox[0], bbox[1]), bbox[2] - bbox[0], bbox[3] - bbox[1], fill=False,
edgecolor='red', linewidth=3.5))
ax.text(bbox[0], bbox[1] - 2, '{:s} {:.3f}'.format(class_name, score), bbox=dict(facecolor='blue', alpha=0.5),
fontsize=14, color='white')
plt.show()
plt.close()
class Compose(object):
def __init__(self, transforms):
self.transforms = transforms
def __call__(self, image, target):
for t in self.transforms:
image, target = t(image, target)
return image, target
class RandomHorizontalFlip(object):
def __init__(self, prob):
self.prob = prob
def __call__(self, image, target):
if random.random() < self.prob:
height, width = image.shape[-2:]
image = image.flip(-1)
bbox = target["boxes"]
bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
target["boxes"] = bbox
return image, target
class ToTensor(object):
def __call__(self, image, target):
image = F.to_tensor(image)
return image, target
if __name__ == "__main__":
# config
LR = 0.001
num_classes = 2
batch_size = 1
start_epoch, max_epoch = 0, 30
train_dir = os.path.join(BASE_DIR, "..", "..", "data", "PennFudanPed")
train_transform = Compose([ToTensor(), RandomHorizontalFlip(0.5)])
# step 1: data
train_set = PennFudanDataset(data_dir=train_dir, transforms=train_transform)
# 收集batch data的函數
def collate_fn(batch):
return tuple(zip(*batch))
train_loader = DataLoader(train_set, batch_size=batch_size, collate_fn=collate_fn)
# step 2: model
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) # replace the pre-trained head with a new one
model.to(device)
# step 3: loss
# in lib/python3.6/site-packages/torchvision/models/detection/roi_heads.py
# def fastrcnn_loss(class_logits, box_regression, labels, regression_targets)
# step 4: optimizer scheduler
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=LR, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
# step 5: Iteration
for epoch in range(start_epoch, max_epoch):
model.train()
for iter, (images, targets) in enumerate(train_loader):
images = list(image.to(device) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
# if torch.cuda.is_available():
# images, targets = images.to(device), targets.to(device)
loss_dict = model(images, targets) # images is list; targets is [ dict["boxes":**, "labels":**], dict[] ]
losses = sum(loss for loss in loss_dict.values())
print("Training:Epoch[{:0>3}/{:0>3}] Iteration[{:0>3}/{:0>3}] Loss: {:.4f} ".format(
epoch, max_epoch, iter + 1, len(train_loader), losses.item()))
optimizer.zero_grad()
losses.backward()
optimizer.step()
lr_scheduler.step()
# test
model.eval()
# config
vis_num = 5
vis_dir = os.path.join(BASE_DIR, "..", "..", "data", "PennFudanPed", "PNGImages")
img_names = list(filter(lambda x: x.endswith(".png"), os.listdir(vis_dir)))
random.shuffle(img_names)
preprocess = transforms.Compose([transforms.ToTensor(), ])
for i in range(0, vis_num):
path_img = os.path.join(vis_dir, img_names[i])
# preprocess
input_image = Image.open(path_img).convert("RGB")
img_chw = preprocess(input_image)
# to device
if torch.cuda.is_available():
img_chw = img_chw.to('cuda')
model.to('cuda')
# forward
input_list = [img_chw]
with torch.no_grad():
tic = time.time()
print("input img tensor shape:{}".format(input_list[0].shape))
output_list = model(input_list)
output_dict = output_list[0]
print("pass: {:.3f}s".format(time.time() - tic))
# visualization
vis_bbox(input_image, output_dict, COCO_INSTANCE_CATEGORY_NAMES, max_vis=20, prob_thres=0.5) # for 2 epoch for nms
輸出:
其中微調model成可以使用訓練這個行人檢測(只需要預測兩個類:行人(前景)、背景):
# step 2: model
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) # replace the pre-trained head with a new one
先查看一下model的構成:
ipdb> model
FasterRCNN(
(transform): GeneralizedRCNNTransform()
(backbone): BackboneWithFPN(
(body): IntermediateLayerGetter(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): FrozenBatchNorm2d(original_name=FrozenBatchNorm2d)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
......
)
(layer2): Sequential(
......
)
(fpn): FeaturePyramidNetwork(
(inner_blocks): ModuleList(
(0): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(2): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1))
(3): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
)
(layer_blocks): ModuleList(
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(extra_blocks): LastLevelMaxPool()
)
)
(rpn): RegionProposalNetwork(
(anchor_generator): AnchorGenerator()
(head): RPNHead(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(cls_logits): Conv2d(256, 3, kernel_size=(1, 1), stride=(1, 1))
(bbox_pred): Conv2d(256, 12, kernel_size=(1, 1), stride=(1, 1))
)
)
(roi_heads): RoIHeads(
(box_roi_pool): MultiScaleRoIAlign()
(box_head): TwoMLPHead(
(fc6): Linear(in_features=12544, out_features=1024, bias=True)
(fc7): Linear(in_features=1024, out_features=1024, bias=True)
)
(box_predictor): FastRCNNPredictor(
(cls_score): Linear(in_features=1024, out_features=91, bias=True)
(bbox_pred): Linear(in_features=1024, out_features=364, bias=True)
)
)
)
所以可知:
微調之後:
最後:
目標檢測推薦github: https://github com/amusi/awesome-object-detection