工程實踐_LFFD模型訓練過程疑難雜症debug

1.MxNet版本的LFFD需要安裝CUDA10.1版本和CuDNN

若不滿足會出現如下問題:

安裝的CUDA版本太低或沒有安裝:

raceback (most recent call last):
  File "configuration_10_320_20L_5scales_v2.py", line 17, in <module>
    import mxnet
  File "/usr/local/lib/python3.6/dist-packages/mxnet/__init__.py", line 24, in <module>
    from .context import Context, current_context, cpu, gpu, cpu_pinned
  File "/usr/local/lib/python3.6/dist-packages/mxnet/context.py", line 24, in <module>
    from .base import classproperty, with_metaclass, _MXClassPropertyMetaClass
  File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 213, in <module>
    _LIB = _load_lib()
  File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 204, in _load_lib
    lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)
  File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.10.1: cannot open shared object file: No such file or directory

沒有安裝CuDNN:

terminate called after throwing an instance of 'dmlc::Error'
  what():  [20:48:36] ../include/mshadow/./stream_gpu-inl.h:173: Check failed: err == CUDNN_STATUS_SUCCESS (4 vs. 0) : CUDNN_STATUS_INTERNAL_ERROR


Aborted (core dumped)

2.正確使用Python和正確安裝MxNet版本

若已經正確安裝CUDA和CUDNN,仍然出現:

terminate called after throwing an instance of 'dmlc::Error'
  what():  [20:48:36] ../include/mshadow/./stream_gpu-inl.h:173: Check failed: err == CUDNN_STATUS_SUCCESS (4 vs. 0) : CUDNN_STATUS_INTERNAL_ERROR


Aborted (core dumped)

有兩種可能:首先查看MxNet版本是否正確,再在configuration_10_560_25L_8scales_v1.py代碼中將如下代碼註釋:

# add mxnet python path to path env if need
mxnet_python_path = '/home/heyonghao/libs/incubator-mxnet/python'
sys.path.append(mxnet_python_path)

我們只需要使用我們本地默認的Python就行。

3.正確安裝OpenCV

如出現如下問題:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 234, in 'calling callback function'
  File "/root/work/mxnet/python/mxnet/operator.py", line 1052, in backward_entry
    print('Error in CustomOp.backward: %s' % traceback.format_exc())
UnicodeEncodeError: 'ascii' codec can't encode characters in position 369-376: ordinal not in range(128)

說明OpenCV版本沒有正確安裝,刪除舊版本之後安裝如下版本:

pip install opencv-python==3.4.5.20

4.正確設置batch_size

遇到如下問題,很可能是batch_size設置的太大:

MXNetError: cudaMalloc retry failed: out of memory

可以設置batch_size=16

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章