pytorch refinedet libtorch实现

pytorch/libtorch qq群: 1041467052

首先,需要掌握libtorch的一些语法,可以参考下面的链接:
[https://www.cnblogs.com/yanghailin/p/12901586.html]

大概说下pytorch转libtorch流程:

1.先训练pytorch的模型,并测试

2.把pytorch模型转pt

3.写后处理

2.把pytorch模型转pt

这个可以单独写个脚本,也可以在跑测试脚本的时候在其中某个位置加上两句话就可以了。
单独写脚本例子如下:

import torch
from net import resnet
# Seg model
model = resnet()
state_dict = torch.load("e130_i391.pth")
model.load_state_dict(state_dict, strict=True)
for p in model.parameters():
    p.requires_grad = False
model.eval()
model = model.cpu()

example = torch.rand(1, 3, 48, 640)
traced_script_module = torch.jit.trace(model, example)
print(traced_script_module)
traced_script_module.save("./01077cls.pt")

我一般喜欢直接拿测试脚本在某处加上两句话:

#########################
        traced_script_module = torch.jit.trace(net, x)
        traced_script_module.save("RefineDet.PyTorch-master/save_pt/refinedet320_0522_0_pytorch1_0.pt")
        print("sys.exit(1)")
        sys.exit(1)
#################

其中net就是初始化好的网络,x就是输入。如此即可生成pt

3.写后处理

后处理就是网络输出来模型的推理结果,要根据推理结果得到自己所需要的,可以仿照pytorch实现,一句一句的把python语句翻译成libtorch就可以。

在转refinedet的libtorch的时候并不是一帆风顺的,遇到了网上没有解答的错误!就是训练好的模型转pt报错!!!报错如下:

Finished loading model!
/opt/conda/conda-bld/pytorch_1573049310284/work/torch/csrc/autograd/python_function.cpp:622: UserWarning: Legacy autograd function with non-static forward method is deprecated and will be removed in 1.3. Please use new-style autograd function with static forward method. (Example: https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)
/opt/conda/conda-bld/pytorch_1573049310284/work/torch/csrc/autograd/python_function.cpp:622: UserWarning: Legacy autograd function with non-static forward method is deprecated and will be removed in 1.3. Please use new-style autograd function with static forward method. (Example: https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)
Traceback (most recent call last):
  File "/data_1/Yang/project_new/2020/pytorch_refinedet/RefineDet.PyTorch-master/eval_refinedet_320.py", line 480, in <module>
    thresh=args.confidence_threshold)
  File "/data_1/Yang/project_new/2020/pytorch_refinedet/RefineDet.PyTorch-master/eval_refinedet_320.py", line 402, in test_net
    output_names=['output'])
  File "/data_1/Yang/software_install/Anaconda1105/envs/DB_cuda10_2/lib/python3.7/site-packages/torch/onnx/__init__.py", line 26, in _export
    result = utils._export(*args, **kwargs)
  File "/data_1/Yang/software_install/Anaconda1105/envs/DB_cuda10_2/lib/python3.7/site-packages/torch/onnx/utils.py", line 382, in _export
    fixed_batch_size=fixed_batch_size)
  File "/data_1/Yang/software_install/Anaconda1105/envs/DB_cuda10_2/lib/python3.7/site-packages/torch/onnx/utils.py", line 249, in _model_to_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args, training)
  File "/data_1/Yang/software_install/Anaconda1105/envs/DB_cuda10_2/lib/python3.7/site-packages/torch/onnx/utils.py", line 206, in _trace_and_get_graph_from_model
    trace, torch_out, inputs_states = torch.jit.get_trace_graph(model, args, _force_outplace=True, _return_inputs_states=True)
  File "/data_1/Yang/software_install/Anaconda1105/envs/DB_cuda10_2/lib/python3.7/site-packages/torch/jit/__init__.py", line 275, in get_trace_graph
    return LegacyTracedModule(f, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
  File "/data_1/Yang/software_install/Anaconda1105/envs/DB_cuda10_2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/data_1/Yang/software_install/Anaconda1105/envs/DB_cuda10_2/lib/python3.7/site-packages/torch/jit/__init__.py", line 352, in forward
    out = self.inner(*trace_inputs)
  File "/data_1/Yang/software_install/Anaconda1105/envs/DB_cuda10_2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 539, in __call__
    result = self._slow_forward(*input, **kwargs)
  File "/data_1/Yang/software_install/Anaconda1105/envs/DB_cuda10_2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 525, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/data_1/Yang/project_new/2020/pytorch_refinedet/RefineDet.PyTorch-master/models/refinedet.py", line 208, in forward
    self.priors.type(type(x.data))                  # default boxes
RuntimeError: Attempted to trace Detect_RefineDet, but tracing of legacy functions is not supported

哎,没有解答,大概就是说哪个不支持转pt。仔细研究了一下,在refinedet.py的一句

     if self.phase == "test":
            #print(loc, conf)
            output = self.detect(
                arm_loc.view(arm_loc.size(0), -1, 4),           # arm loc preds
                self.softmax(arm_conf.view(arm_conf.size(0), -1,
                             2)),                               # arm conf preds
                odm_loc.view(odm_loc.size(0), -1, 4),           # odm loc preds
                self.softmax(odm_conf.view(odm_conf.size(0), -1,
                             self.num_classes)),                # odm conf preds
                self.priors.type(type(x.data))                  # default boxes
            )

这个self.detect调用如下的函数:
RefineDet.PyTorch-master/layers/functions/detection.py,
这个函数开头

import torch
from torch.autograd import Function
from ..box_utils import decode, nms
from data import voc as cfg

好像就是由于from torch.autograd import Function这个导致不支持转的。。。
弄了好久,然后我发现

output = self.detect(
                arm_loc.view(arm_loc.size(0), -1, 4),           # arm loc preds
                self.softmax(arm_conf.view(arm_conf.size(0), -1,
                             2)),                               # arm conf preds
                odm_loc.view(odm_loc.size(0), -1, 4),           # odm loc preds
                self.softmax(odm_conf.view(odm_conf.size(0), -1,
                             self.num_classes)),                # odm conf preds
                self.priors.type(type(x.data))                  # default boxes
            )

函数传进去的前4个值其实就是模型推理出来的结果,detect其实就是后处理,那么我直接输出这4个值就好了,后处理自己写!然后试了一下:在refinedet.py相应位置改成如下:

        if self.phase == "test":
            output = (arm_loc.view(arm_loc.size(0), -1, 4),
                      self.softmax(arm_conf.view(arm_conf.size(0), -1, 2)),
                      odm_loc.view(odm_loc.size(0), -1, 4),
                      self.softmax(odm_conf.view(odm_conf.size(0), -1, self.num_classes))
                      )
            #print(loc, conf)
            #output = self.detect(
            #    arm_loc.view(arm_loc.size(0), -1, 4),           # arm loc preds
            #    self.softmax(arm_conf.view(arm_conf.size(0), -1,
            #                 2)),                               # arm conf preds
            #    odm_loc.view(odm_loc.size(0), -1, 4),           # odm loc preds
            #    self.softmax(odm_conf.view(odm_conf.size(0), -1,
            #                 self.num_classes)),                # odm conf preds
            #    self.priors.type(type(x.data))                  # default boxes
            #)

果真可以!pt生成出来了!!!
然后折腾了好久,根据pt仿照这pytorch的后处理自己用libtorch写对应的后处理。查找资料,参考,弄了一个星期吧,终于给折腾出来了,并且建了我的第一个github,把我弄的上传github。链接如下:
[https://github.com/wuzuowuyou/libtorch_RefineDet_2020]
libtorch显存用的真少,320图片才860M。
github是cuda8,pytorch1.0的

我一开始是用libtorch1.3 pytorch1.3 cuda10.0实现的,下载链接如下
https://download.csdn.net/download/yang332233/12461623

#################################################
20200528
该用什么来描述我现在的心情???
本来都已经把该模块集成到工程中去了,但是发现和其他模块跑的时候显存溢出!!!!正是由于refinedet这块引起的!!!本来都准备上线了,又整这一出???
!!!!!!!!!!!!!!!!!!!!!!!!!!
淡定,去看看哪里出问题了。
现象是第一次forward的时候显存会快速涨到5000多M然后再变为正常的860M,后面forward都没有问题。。。libtorch版本的问题?可是我们工程中其他的调用libtorch的模块我试了没有问题,
因为一开始用的libtorch1.3的,我试了也有显存第一次forward的时候爆涨的问题。。。看来是pytorch refinedet本身的问题,然后就去pytorch refinedet试,果真也有这个问题。调试,定位问题在vgg这块。

 for k in range(30):
            x = self.vgg[k](x)###############
            if 22 == k:
                s = self.conv4_3_L2Norm(x)
                sources.append(s)
            elif 29 == k:
                s = self.conv5_3_L2Norm(x)
                sources.append(s)

        # apply vgg up to fc7
        aaa = len(self.vgg)
        for k in range(30, len(self.vgg)):
            x = self.vgg[k](x)
        sources.append(x)

在 x = self.vggk这行,用vgg跑前向的时候,
当i=5的时候,执行x = self.vggk这句话,显存会快速达到4851M然后快速降到800多M
当i=7的时候,执行x = self.vggk这句话,显存会快速达到1609M然后快速降到800多M
。。。
看了下都是卷积层
但是为什么是第一次的时候,后面不会有问题了!!!!so,why第一次的时候会出现后面不会??
哎,母鸡!咋解决?母鸡!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
当然!!最后的最后还是给解决了!!!整整花了一天的时间!!
这里还需要感谢群里的水哥!!!
首先由于是第一次forward的时候显存快速斗升斗降,需要实时的观察这一切变化:
nvidia-smi -lms 5|grep example
就会出来

|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   215MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   217MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   217MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   221MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   223MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   225MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   225MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   227MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   229MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   257MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   259MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   261MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   261MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   265MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   265MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   265MiB |
|    0     12293      C   ...ineDet_2020-unknown-Default/example-app   269MiB |

这里是不断刷新的,可以很方便的观察显存。
水哥首先指导我让我把
for k in range(30):
x = self.vggk###############
这段代码放在函数外面跑,监控一下显存,
就是放在
def build_refinedet(phase, size=320, num_classes=21):
if phase != "test" and phase != "train":
print("ERROR: Phase: " + phase + " not recognized")
return
if size != 320 and size != 512:
print("ERROR: You specified size " + repr(size) + ". However, " +
"currently only RefineDet320 and RefineDet512 is supported!")
return
base_ = vgg(base[str(size)], 3)
这里,我照做了,发现居然是正常的!!!!!!!!!感觉有点儿希望了!!!
但是还是不知道哪里的缘故!!然后又试了好久,都不行。
这里又折腾了很久。。。
然后我看到水哥给我实验的代码
if name == 'main':
# load net
num_classes = len(labelmap) + 1 # +1 for background
net = build_refinedet('test', int(args.input_size), num_classes) # initialize SSD
net.load_state_dict(torch.load(args.trained_model))
net.eval()

xx = torch.randn(1,3,320,320)
xx = net.(xx)

他说他这样没有复现我的问题,我试了下,这样确实可以!没有显存问题,然后我干脆就这个生成pt文件在libtorch里面实验,发现也没有问题!!!
这里可以说是绕开了问题并没有解决问题!!!
后来我还是看了看,就是总感觉问题出现在最下面的函数里面:
test_net(args.save_folder, net, args.cuda, dataset,
BaseTransform(net.size, dataset_mean), args.top_k, int(args.input_size),
thresh=args.confidence_threshold)
可是我进去看了,在此之前就是把输入预处理这样,然后我干脆把x直接变成
x = torch.randn(1,3,320,320),不用预处理的,发现还是这样!!哎。
然后漫无目的的再实验:

 num_classes = len(labelmap) + 1                      # +1 for background
    net = build_refinedet('test', int(args.input_size), num_classes)            # initialize SSD
    net.load_state_dict(torch.load(args.trained_model))
    net.eval()

    # xx = torch.randn(1,3,320,320)
    # for k in range(30):
    #     xx = net.vgg[k](xx)
    #     aa = 0

    print('Finished loading model!')
    # load data
    dataset = VOCDetection(args.voc_root, [('2007', set_type)],
                           BaseTransform(int(args.input_size), dataset_mean),
                           VOCAnnotationTransform())
    if args.cuda:
        net = net.cuda()
        cudnn.benchmark = True

    xx = torch.randn(1,3,320,320)
    for k in range(30):
        xx = net.vgg[k](xx)
        aa = 0

注意这里的测试代码
xx = torch.randn(1,3,320,320)
for k in range(30):
xx = net.vggk
aa = 0

上面有一个,我发现上面的代码显存正常,但是放到下面的时候就复现了显存问题!!!
然后把目光聚焦在

   cudnn.benchmark = True

这是啥玩意!!!?百度了一下:作用如下:
大部分情况下,设置这个 flag 可以让内置的 cuDNN 的 auto-tuner 自动寻找最适合当前配置的高效算法,来达到优化运行效率的问题。
如果网络的输入数据维度或类型上变化不大,设置 torch.backends.cudnn.benchmark = true 可以增加运行效率;
如果网络的输入数据在每次 iteration 都变化的话,会导致 cnDNN 每次都会去寻找一遍最优配置,这样反而会降低运行效率。
优化显存的啊,根据当前硬件优化显存,和tvm有点儿类似啊!
不管三七二十一,置为False看看!果真!!!
没有问题了!!!!
原来是cudnn.benchmark搞的鬼啊!!水哥说这个一般在训练的时候打开在测试的时候关闭。

问题是解决了,可是解决问题的宝贵经验值得学习。就是需要实验!不断的实验!!但是能找准方向实验会事半功倍!

原文地址:https://www.cnblogs.com/yanghailin/p/12965695.html