点击上方“3D视觉工坊”,选择“星标”
干货第一时间送达
作者丨陈e
来源丨GiantPandaCV
编辑丨极市平台
导读
此文讨论如何在低端的移动设备上提高模型性能,文章针对模型(不改变模型原有op情况下,不需要重新训练)和后处理两部分的优化开展讲解,若有不当之处,望批评指出!
一、模型优化此处的模型优化指的是我们常说的模型卷积层与bn层的融合或者conection,identity等结构重参化的操作,改想法来源于某天无意参与的一次讨论:

大佬的想法认为fuse是可以做的,但没那么必要,fuse(conv+bn)=CB的作用在于其他,而对于提速的作用微乎及微,不过本人更加坚持自己的观点,因为yolov5的对比是基于高算力显卡,低端卡,甚至无GPU,NPU加持的设备是有明显的提速作用。
特别对于复用太多group conv或depthwise conv的模型,举个例子,shufflenetv2被当成是高效的移动端网络而被常常使用于端侧的backbone,我们看到单个shuffle block(stride=2)的组件就使用了两个深度可分离卷积:

光是一整套网络就用了25组depthwise conv(原因在于shufflenet系列为低算力cpu设备设计,无可避免复用大量深度分离卷积)
于是本着这样的初衷,做了一套基于v5lite-s模型的实验,并将测试结果贴出供大家相互交流:

以上测试结果基于对shuffle block的所有卷积和bn层进行融合的结果,抽取coco val2017中的1000张图片进行测试,可以看到,在i5的核上,fuse后的模型在x86 cpu上单次向前的加速很明显。若是对于arm端cpu,效果会更加明显。
融合的脚本如下所示:
import torch from thop import profile from copy import deepcopy from models.experimental import attempt_load def model_print(model, img_size): # Model information. img_size may be int or list, i.e. img_size=640 or img_size=[640, 320] n_p = sum(x.numel() for x in model.parameters()) # number parameters n_g = sum(x.numel() for x in model.parameters() if x.requires_grad) # number gradients stride = max(int(model.stride.max()), 32) if hasattr(model, 'stride') else 32 img = torch.zeros((1, model.yaml.get('ch', 3), stride, stride), device=next(model.parameters()).device) # input flops = profile(deepcopy(model), inputs=(img,), verbose=False)[0] / 1E9 * 2 # stride GFLOPS img_size = img_size if isinstance(img_size, list) else [img_size, img_size] # expand if int/float fs = ', %.6f GFLOPS' % (flops * img_size[0] / stride * img_size[1] / stride) # imh x imw GFLOPS print(f"Model Summary: {len(list(model.modules()))} layers, {n_p} parameters, {n_g} gradients{fs}") if __name__ == '__main__': load = 'weights/v5lite-e.pt' save = 'weights/repv5lite-e.pt' test_size = 320 print(f'Done. Befrom weights:({load})') device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = attempt_load(load, map_location=device) # load FP32 model torch.save(model, save) model_print(model, test_size) print(model)
融合op的核心代码如下:
if type(m) is Shuffle_Block: if hasattr(m, 'branch1'): re_branch1 = nn.Sequential( nn.Conv2d(m.branch1[0].in_channels, m.branch1[0].out_channels, kernel_size=m.branch1[0].kernel_size, stride=m.branch1[0].stride, padding=m.branch1[0].padding, groups=m.branch1[0].groups), nn.Conv2d(m.branch1[2].in_channels, m.branch1[2].out_channels, kernel_size=m.branch1[2].kernel_size, stride=m.branch1[2].stride, padding=m.branch1[2].padding, bias=False), nn.ReLU(inplace=True), ) re_branch1[0] = fuse_conv_and_bn(m.branch1[0], m.branch1[1]) re_branch1[1] = fuse_conv_and_bn(m.branch1[2], m.branch1[3]) # pdb.set_trace() # print(m.branch1[0]) m.branch1 = re_branch1 if hasattr(m, 'branch2'): re_branch2 = nn.Sequential( nn.Conv2d(m.branch2[0].in_channels, m.branch2[0].out_channels, kernel_size=m.branch2[0].kernel_size, stride=m.branch2[0].stride, padding=m.branch2[0].padding, groups=m.branch2[0].groups), nn.ReLU(inplace=True), nn.Conv2d(m.branch2[3].in_channels, m.branch2[3].out_channels, kernel_size=m.branch2[3].kernel_size, stride=m.branch2[3].stride, padding=m.branch2[3].padding, bias=False), nn.Conv2d(m.branch2[5].in_channels, m.branch2[5].out_channels, kernel_size=m.branch2[5].kernel_size, stride=m.branch2[5].stride, padding=m.branch2[5].padding, groups=m.branch2[5].groups), nn.ReLU(inplace=True), ) re_branch2[0] = fuse_conv_and_bn(m.branch2[0], m.branch2[1]) re_branch2[2] = fuse_conv_and_bn(m.branch2[3], m.branch2[4]) re_branch2[3] = fuse_conv_and_bn(m.branch2[5], m.branch2[6]) # pdb.set_trace() m.branch2 = re_branch2 # print(m.branch2) self.info()
下图未进行fuse的模型参数量,计算量,以及单个shuffle block的结构,可以看到未融合的shuffle block中的单个branch2分支就包含了8个子op.

而融合后的模型参数量减少了0.5万,计算量少了0.6万,主要还是来源于bn层,并且可以看到单个branch2分支中的op减少了三个,整套backbone网络算下来共减少了25个bn层

前言中提到的重参化操作之重要性更甚于op融合,引入前期发布的g模型:pogg:追求极致:Repvgg重参化对YOLO工业落地的实验和思考(https://zhuanlan.zhihu.com/p/410874403),由于g模型为高性能gpu涉及,backbone使用了repvgg,在训练时通过rbr_1x1和identity进行涨点,但推理时必须重参化为3×3卷积,才具有高性价比,最直观的,使用以下代码对每个repvgg block进行重参化和融合:
if type(m) is RepVGGBlock: if hasattr(m, 'rbr_1x1'): # print(m) kernel, bias = m.get_equivalent_kernel_bias() rbr_reparam = nn.Conv2d(in_channels=m.rbr_dense.conv.in_channels, out_channels=m.rbr_dense.conv.out_channels, kernel_size=m.rbr_dense.conv.kernel_size, stride=m.rbr_dense.conv.stride, padding=m.rbr_dense.conv.padding, dilation=m.rbr_dense.conv.dilation, groups=m.rbr_dense.conv.groups, bias=True) rbr_reparam.weight.data = kernel rbr_reparam.bias.data = bias for para in self.parameters(): para.detach_() m.rbr_dense = rbr_reparam # m.__delattr__('rbr_dense') m.__delattr__('rbr_1x1') if hasattr(self, 'rbr_identity'): m.__delattr__('rbr_identity') if hasattr(self, 'id_tensor'): m.__delattr__('id_tensor') m.deploy = True m.forward = m.fusevggforward # update forward # continue # print(m) if type(m) is Conv and hasattr(m, 'bn'): # print(m) m.conv = fuse_conv_and_bn(m.conv, m.bn) # update conv delattr(m, 'bn') # remove batchnorm m.forward = m.fuseforward # update forward """ 需要重参化后才能进行fuse操作,否则会出现重参化失败的情况 """
下方结果可以直观看出模型层数、计算量和参数量都有明显变化:

2.1 反函数操作
后处理的优化也同样重要,而后处理优化的目的在于减少低效率循环或判断语句,避免大量使用昂贵算子等。
我们使用yolov5基于ncnn demo的代码进行测试和修改,但由于源码链接太多库,我们单抽general_poprosal函数,仿造general_poprosal函写一段使用sigmoid计算confidence再比对80类,计算bbox坐标的操作。
float sigmoid(float x) { return static_cast(1.f / (1.f + exp(-x))); } vectorram_cls_num(int num) { std::vectorres; float a = 10.0, b = 100.0; srand(time(NULL));//设置随机数种子,使每次产生的随机序列不同 cout<<"number class:"<关注打赏


微信扫码登录