1、fcos網絡
在常見的計算機視覺任務中,個人認爲檢測是比較複雜的。主要原因也是anchor生成機制的原因,檢測過程涉及anchor的尺寸scale和長寬比aspect radio等超參數的設置,檢測框匹配,正負樣本不均勻,計算複雜度高等等問題的解決。所以近年來anchor機制是檢測裏面的主流。
當然也有人開始挑戰權威。提出了anchor-free,這種idea讓我這樣的弟弟感到有機會熟練掌握一個檢測模型了。那麼首先來看看論文:https://arxiv.org/pdf/1904.01355.pdf
先看網絡結構:
模型的結構也非常簡單,首先是backbone,輸出三個特徵圖c3、c4、c5(outstride分別是8、16、32),然後經過1x1卷積改變通道數量(512、1024、2048),p3~p7明顯是個FPN(特徵金字塔結構),FPN的多尺度特徵在檢測裏面用的還是很多。對多尺度目標的檢測有利。p6和p7是p5的依次下采樣得到。p4和p3是在p5上採樣的過程中 sum c3和c4得到的。進入head層後只有兩個簡單的pipeline(分支),都是4個卷積層。分兩個分支也是retinanet提出的,head層分類和迴歸共享參數效果沒有分開好。也可以理解,各司其職,自然效果好點。在分類的分支多了一個center-ness的小分支。主要作用也是衡量預測框到真實框的偏離程度。這個分支得到的值(0~1)會和class分支得到的值相乘。那麼可以很好的抑制一些低質量的框的生成。
再來說說這個網絡怎麼得到預測框:
很明顯是由一個點回歸到一個檢測框,這個點就是p3~p7的特徵圖上點回歸到原圖所對應的點( feature map上的(x,y) 映射到原圖是 ( s /2 + xs, s /2 + ys)),這些點也要分正負樣本,如果這個點回歸到原圖在GT box裏面,就證明是正樣本,在外面就認爲是負樣本。我們只訓練這些正樣本(實際上這些正樣本中也包含背景的樣本點信息,但相對於anchor-base的方法大大減少了負樣本的數量,所以訓練速度和推理速度快),當然這些點在迴歸到原圖裏面有可能在多個檢測框裏面,那麼這個點簡單的選擇最小的框作爲他的類別和應該回歸的框。這個過程中難以避免的生成了很多低質量的框。論文中提出了用conter_ness分支來預測一個值,這個值代表距離中心點的偏離程度。這個值在0~1之間。用這個值和分類的值向乘。這樣一些偏離中心的低質量框會被很好的抑制。同時conter_ness分支在論文中會和分類的分支共享卷積層參數。但後面有人提出和迴歸的分支共享參數效果會更好。
當然過程中會有一些限制,比如要回歸的tlrb四個值,這些值有個限制範圍,不能無限迴歸。論文中是【0, 64, 128, 256, 512 and ∞】,鄰近兩個取值就是對應p(3~7)的迴歸範圍。當然這些值都要根據自己的目標任務進行調節。fcos的超參數雖然少,但是參數對模型的表現效果是影響很大的。不像anchor-base的檢測模型那麼穩。但是參數設置的好是完全可以超越一些anchor-base的檢測模型的。
loss函數也是分類的focal loss和迴歸的iou loss的sum。當然giou等升級版iou loss表現效果會更好。
2、代碼
2.1、backbone
選擇我最喜歡的vovnet網絡。
這個backbone也是非常簡單的,論文的地址是https://arxiv.org/abs/1904.09730
import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import OrderedDict
from torch.utils.model_zoo import load_url as load_state_dict_from_url
__all__ = ['vovnet39']
model_urls = {
'vovnet39': 'https://dl.dropbox.com/s/1lnzsgnixd8gjra/vovnet39_torchvision.pth?dl=1'
}
def conv3x3(in_channels, out_channels, module_name, postfix,
stride=1, groups=1, kernel_size=3, padding=1):
"""3x3 convolution with padding"""
return [
('{}_{}/conv'.format(module_name, postfix),
nn.Conv2d(in_channels, out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
groups=groups,
bias=False)),
('{}_{}/norm'.format(module_name, postfix),
nn.BatchNorm2d(out_channels)),
('{}_{}/relu'.format(module_name, postfix),
nn.ReLU(inplace=True)),
]
def conv1x1(in_channels, out_channels, module_name, postfix,
stride=1, groups=1, kernel_size=1, padding=0):
"""1x1 convolution"""
return [
('{}_{}/conv'.format(module_name, postfix),
nn.Conv2d(in_channels, out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
groups=groups,
bias=False)),
('{}_{}/norm'.format(module_name, postfix),
nn.BatchNorm2d(out_channels)),
('{}_{}/relu'.format(module_name, postfix),
nn.ReLU(inplace=True)),
]
class _OSA_module(nn.Module):
def __init__(self,
in_ch,
stage_ch,
concat_ch,
layer_per_block,
module_name,
identity=False):
super(_OSA_module, self).__init__()
self.identity = identity
self.layers = nn.ModuleList()
in_channel = in_ch
for i in range(layer_per_block):
self.layers.append(nn.Sequential(
OrderedDict(conv3x3(in_channel, stage_ch, module_name, i))))
in_channel = stage_ch
# feature aggregation
in_channel = in_ch + layer_per_block * stage_ch
self.concat = nn.Sequential(
OrderedDict(conv1x1(in_channel, concat_ch, module_name, 'concat')))
def forward(self, x):
identity_feat = x
output = []
output.append(x)
for layer in self.layers:
x = layer(x)
output.append(x)
x = torch.cat(output, dim=1)
xt = self.concat(x)
if self.identity:
xt = xt + identity_feat
return xt
class _OSA_stage(nn.Sequential):
def __init__(self,
in_ch,
stage_ch,
concat_ch,
block_per_stage,
layer_per_block,
stage_num):
super(_OSA_stage, self).__init__()
if not stage_num == 2:
self.add_module('Pooling',
nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True))
module_name = f'OSA{stage_num}_1'
self.add_module(module_name,
_OSA_module(in_ch,
stage_ch,
concat_ch,
layer_per_block,
module_name))
for i in range(block_per_stage-1):
module_name = f'OSA{stage_num}_{i+2}'
self.add_module(module_name,
_OSA_module(concat_ch,
stage_ch,
concat_ch,
layer_per_block,
module_name,
identity=True))
class VoVNet(nn.Module):
def __init__(self,
config_stage_ch,
config_concat_ch,
block_per_stage,
layer_per_block):
super(VoVNet, self).__init__()
# Stem module
stem = conv3x3(3, 64, 'stem', '1', 2)
stem += conv3x3(64, 64, 'stem', '2', 1)
stem += conv3x3(64, 128, 'stem', '3', 2)
self.add_module('stem', nn.Sequential(OrderedDict(stem)))
stem_out_ch = [128]
in_ch_list = stem_out_ch + config_concat_ch[:-1]
self.stage_names = []
for i in range(4): #num_stages
name = 'stage%d' % (i+2)
self.stage_names.append(name)
self.add_module(name,
_OSA_stage(in_ch_list[i],
config_stage_ch[i],
config_concat_ch[i],
block_per_stage[i],
layer_per_block,
i+2))
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight)
elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.constant_(m.bias, 0)
def forward(self, x):
x = self.stem(x)
outs = []
for name in self.stage_names:
x = getattr(self, name)(x)
outs.append(x)
return tuple(outs[1:])
def freeze_bn(self):
for layer in self.modules():
if isinstance(layer, nn.BatchNorm2d):
layer.eval()
def _vovnet(arch,
config_stage_ch,
config_concat_ch,
block_per_stage,
layer_per_block,
pretrained,
progress,
**kwargs):
model = VoVNet(config_stage_ch, config_concat_ch,
block_per_stage, layer_per_block,
**kwargs)
if pretrained:
state_dict = load_state_dict_from_url(model_urls[arch])
model.load_state_dict(state_dict,strict=False)
return model
def vovnet39(pretrained=False, progress=True, **kwargs):
"""
pretrained (bool): If True, returns a model pre-trained on ImageNet
progress (bool): If True, displays a progress bar of the download to stderr
"""
return _vovnet('vovnet39', [128, 160, 192, 224], [256, 512, 768, 1024],
[1,1,2,2], 5, pretrained, progress, **kwargs)
if __name__ == "__main__":
#查看模型輸出
test_inp=torch.randn((1,3,480,640)).to("cuda")
model=vovnet39()
model.cuda()
out=model(test_inp)
for i in range(len(out)):
print("模型的C%d輸出尺寸是:"%(i+3),out[i].size())
#統計模型的參數量
k=0
params = list(model.parameters())
for i in params:
l = 1
print("該層的結構:"+str(list(i.size())))
for j in i.size():
l*=j
print("該層參數和:" + str(l))
k=k+l
print("總參數數量和:" + str(k))
# 模型的C3輸出尺寸是: torch.Size([1, 512, 60, 80])
# 模型的C4輸出尺寸是: torch.Size([1, 768, 30, 40])
# 模型的C5輸出尺寸是: torch.Size([1, 1024, 15, 20])
# 總參數數量和:21575296
是Densenet的簡化版,但是表現效果是非常不錯的,因爲考慮到了GPU效率。我只搭建了vovnet39。如果你需要深一點的,只要多重複幾個OSA_Block就好了。
再看FPN,我提供了resnet18和vovnet39兩個可選backbone。因爲目前自己數據集量少,一直致力於輕量級網絡的使用。
import torch.nn as nn
import torch.nn.functional as F
import math
from model.config import DefaultConfig as cfg
class FPN(nn.Module):
'''only support resnet18 or vovnet39'''
def __init__(self,features=256,use_p5=True):
super(FPN,self).__init__()
if cfg.backbone_choice == "resnet18":
print("backbone use resnet18")
self.prj_5 = nn.Conv2d(512, features, kernel_size=1)
self.prj_4 = nn.Conv2d(256, features, kernel_size=1)
self.prj_3 = nn.Conv2d(128, features, kernel_size=1)
elif cfg.backbone_choice == "vovnet39":
print("backbone use vovnet39")
self.prj_5 = nn.Conv2d(1024, features, kernel_size=1)
self.prj_4 = nn.Conv2d(768, features, kernel_size=1)
self.prj_3 = nn.Conv2d(512, features, kernel_size=1)
self.conv_5 =nn.Conv2d(features, features, kernel_size=3, padding=1)
self.conv_4 =nn.Conv2d(features, features, kernel_size=3, padding=1)
self.conv_3 =nn.Conv2d(features, features, kernel_size=3, padding=1)
if use_p5:
self.conv_out6 = nn.Conv2d(features, features, kernel_size=3, padding=1, stride=2)
else:
self.conv_out6 = nn.Conv2d(512, features, kernel_size=3, padding=1, stride=2)
self.conv_out7 = nn.Conv2d(features, features, kernel_size=3, padding=1, stride=2)
self.use_p5=use_p5
self.apply(self.init_conv_kaiming)
def upsamplelike(self,inputs):
src,target=inputs
return F.interpolate(src, size=(target.shape[2], target.shape[3]),
mode='nearest')
def init_conv_kaiming(self,module):
if isinstance(module, nn.Conv2d):
nn.init.kaiming_uniform_(module.weight, a=1)
if module.bias is not None:
nn.init.constant_(module.bias, 0)
def forward(self,x):
C3,C4,C5=x
P5 = self.prj_5(C5)
P4 = self.prj_4(C4)
P3 = self.prj_3(C3)
P4 = P4 + self.upsamplelike([P5,C4])
P3 = P3 + self.upsamplelike([P4,C3])
P3 = self.conv_3(P3)
P4 = self.conv_4(P4)
P5 = self.conv_5(P5)
P5 = P5 if self.use_p5 else C5
P6 = self.conv_out6(P5)
P7 = self.conv_out7(F.relu(P6))
return [P3,P4,P5,P6,P7]
接下來是head部分,主要是兩個pipeline的輸出
import torch.nn as nn
import torch
import math
class ScaleExp(nn.Module):
def __init__(self,init_value=1.0):
super(ScaleExp,self).__init__()
self.scale=nn.Parameter(torch.tensor([init_value],dtype=torch.float32))
def forward(self,x):
return torch.exp(x*self.scale)
class ClsCntRegHead(nn.Module):
def __init__(self,in_channel,class_num,GN=True,cnt_on_reg=True,prior=0.01):
'''
Args
in_channel
class_num
GN
prior
'''
super(ClsCntRegHead,self).__init__()
self.prior=prior
self.class_num=class_num
self.cnt_on_reg=cnt_on_reg
cls_branch=[]
reg_branch=[]
for i in range(4):
cls_branch.append(nn.Conv2d(in_channel,in_channel,kernel_size=3,padding=1,bias=True))
if GN:
cls_branch.append(nn.GroupNorm(32,in_channel))
cls_branch.append(nn.ReLU(True))
reg_branch.append(nn.Conv2d(in_channel,in_channel,kernel_size=3,padding=1,bias=True))
if GN:
reg_branch.append(nn.GroupNorm(32,in_channel))
reg_branch.append(nn.ReLU(True))
self.cls_conv=nn.Sequential(*cls_branch)
self.reg_conv=nn.Sequential(*reg_branch)
self.cls_logits=nn.Conv2d(in_channel,class_num,kernel_size=3,padding=1)
self.cnt_logits=nn.Conv2d(in_channel,1,kernel_size=3,padding=1)
self.reg_pred=nn.Conv2d(in_channel,4,kernel_size=3,padding=1)
self.apply(self.init_conv_RandomNormal)
nn.init.constant_(self.cls_logits.bias,-math.log((1 - prior) / prior))
self.scale_exp = nn.ModuleList([ScaleExp(1.0) for _ in range(5)])
def init_conv_RandomNormal(self,module,std=0.01):
if isinstance(module, nn.Conv2d):
nn.init.normal_(module.weight, std=std)
if module.bias is not None:
nn.init.constant_(module.bias, 0)
def forward(self,inputs):
'''inputs:[P3~P7]'''
cls_logits=[]
cnt_logits=[]
reg_preds=[]
for index,P in enumerate(inputs):
cls_conv_out=self.cls_conv(P)
reg_conv_out=self.reg_conv(P)
cls_logits.append(self.cls_logits(cls_conv_out))
if not self.cnt_on_reg:
cnt_logits.append(self.cnt_logits(cls_conv_out))
else:
cnt_logits.append(self.cnt_logits(reg_conv_out))
reg_preds.append(self.scale_exp[index](self.reg_pred(reg_conv_out)))
return cls_logits,cnt_logits,reg_preds
整體的模型實現就是這樣,我使用口罩檢測數據集(數據下載鏈接在我的GitHub,記得點star喲)。
2、訓練自己數據集
2.1 首先將自己數據整理成voc格式。如下圖。
Annotations裏面是xml的labelimg的標註文件。JPEGImages裏面是原圖像。ImageSets/Main裏面是txt文件。
在Github的utils裏面兩個腳本,convert_json2VOCSEG.py是將label轉化爲xml文件和掩碼的npy文件,因爲後續要嘗試利用FCOS做實例分割,有興趣的大佬可以和我一起討論。maketxt.py就是用來生成txt的簡單腳本。
2.2 代碼
https://github.com/2anchao/FCOS_DET_MASK
2.3 訓練結果
2.3.1 沒有帶口罩的
2.3.2 帶口罩的
我師兄說這個妹子P圖比較嚴重,你怎麼看?
3 總結:
訓練的時間不長,只是訓練玩玩,訓練到10個epoch的時候被leader把工作臺給沒收了。。。繼續訓練效果肯定會好很多。希望這個博文能幫到你們。