pytorch 多GPU 問題記錄

共享內存問題:unable to open shared memory object </torch_> in read-write mode

使用NAS,網絡太大,一塊放不下,所以嘗試用ddp玩一個多gpu訓練。

(py36torch15) xx@cluster:~/wang/FasterCrowdCountingNAS/FBNetBranch$ python main.py 
/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. Trainer(distributed_backend=dp) (or ddp, ddp2). Setting distributed_backend=ddp for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0,1,2]
Traceback (most recent call last):
  File "main.py", line 29, in <module>
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 844, in fit
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/process.py", line 105, in start
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage
RuntimeError: unable to open shared memory object </torch_12222_563474802> in read-write mode

剛開始還以爲我哪點寫錯了……直接到spawn空間找,發現應該是open file限制問題。

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514771
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 514771
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

1024空間實在太小,直接使用ulimit -SHn 51200問題解決。

多進程問題:The “freeze_support()” line can be omitted if the program is not going to be frozen to produce an executable.

  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/spawn.py", line 136, in _check_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

使用多進程發現又彈出這種問題,其實這個是python使用多進程設置的問題。
Python多進程的實現方式[。Unix系統下默認的實現方式是fork,而fork可以將進程複製一份,子進程可以執行與主程序不同的函數,此外,這種方式生成的進程繼承了父進程的數據,所以數據可以方便的從父進程流動到子進程。而在Windows上不支持fork,而是要使用spawn。spawn其實也是將進程複製一份,但是進程會重新執行一遍主函數裏面的代碼,就像父進程一樣,然後再去執行相應的函數。所以這就會導致一個問題就是如果我們不加任何判斷的話,這個進程會不斷的複製自身,形成新的進程。Python的設計者當然考慮到了這一點,所以如果你在spawn進程的初始階段還嘗試創建新進程的話,會報錯退出。怎麼區別主進程和父進程呢?一般會採用__name__屬性來區分。
解決方法:

if __name__ == '__main__':
    p = mp.Process(target=v)
    p.start()
    p.join()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章