[ pytorch ] ——— 報錯error解決彙總

原創

小小的行者

2020-03-08 04:54

顯存相關

1. 顯存溢出：GPU顯存佔用隨着運行不斷累加

出現這種情況主要是程序中有很多中間變量佔用的顯存，而這些顯存沒有被del掉。
舉個例子：

feature_loader = []
for i,(data, target) in enumerate(train_loader):
    # data, target = data.to(device), target.to(device)
    data, target = Variable(data.cuda()), Variable(target.cuda())

    feature, pred = model(data)
    
    loss = CELOSS(pred, target)
    optimizer4nn.zero_grad()
    loss.backward()
    optimizer4nn.step()

    feature_loader.append(feature)   # <--- 這裏

這裏的feature實際上還是個variable.cuda()，隨着迭代進行，featur變量會被新的feature變量替換，這部分不會有新的顯存增加。而feature_loader在不斷收集新的feature，這樣顯存會一點一點的累加。

解決方法就是在收集過程中，把feature從顯存.cuda()上搬到內存.cpu()上，這樣，feature_loader就不會佔用顯存。

feature_loader.append(feature.data.cpu())   # <---加上.data.cpu

待分類

1. 參數沒法load進去，多出個module.，原因是：之前訓練的時候使用了`nn.DataParallel(model_structure,device_ids=gpu_ids)`

RuntimeError: Error(s) in loading state_dict for ftnet_EncoderDecoder:
	
Missing key(s) in state_dict: 
"BackBone.model.conv1.weight", "BackBone.model.bn1.weight", "BackBone.model.bn1.bias", "BackBone.model.bn1.running_mean", "BackBone.model.bn1.running_var", .....

Unexpected key(s) in state_dict: 
"module.BackBone.model.conv1.weight", "module.BackBone.model.bn1.weight", "module.BackBone.model.bn1.bias", "module.BackBone.model.bn1.running_mean", "module.BackBone.model.bn1.running_var", "module.BackBone.model.l  "  ........

解決：

model_structure = net()
model_structure = nn.DataParallel(model_structure,device_ids=gpu_ids)
model = load_network(model_structure)

2、`There is an imbalance between your GPUs. You may want to exclude GPU 0 which has less than 75% of the memory or cores of GPU 1. You can do so by setting the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES environment variable.`

解決：
As mentioned in this link, you have to do model.cuda() before passing it to nn.DataParallel.

net = nn.DataParallel(model.cuda(), device_ids=[0,1])

URL that solved this problem:https://stackoverflow.com/questions/55343893/how-to-do-parallel-processing-in-pytorch

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

[ pytorch ] ——— 報錯error解決彙總

顯存相關

1. 顯存溢出：GPU顯存佔用隨着運行不斷累加

待分類

1. 參數沒法load進去，多出個module.，原因是：之前訓練的時候使用了`nn.DataParallel(model_structure,device_ids=gpu_ids)`

2、`There is an imbalance between your GPUs. You may want to exclude GPU 0 which has less than 75% of the memory or cores of GPU 1. You can do so by setting the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES environment variable.`

[ pytorch ] —— 函數積累

【python】 —— 庫：TensorboardX（pytorch中的可視化工具）

[ pytorch ] —— 基本使用：(9) 自定義反向傳播

[ 深度學習 ] —— 資料與資源彙總

[visio]

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

[ pytorch ] ——— 報錯error解決彙總

顯存相關

1. 顯存溢出：GPU顯存佔用隨着運行不斷累加

待分類

1. 參數沒法load進去，多出個module.，原因是：之前訓練的時候使用了nn.DataParallel(model_structure,device_ids=gpu_ids)

2、There is an imbalance between your GPUs. You may want to exclude GPU 0 which has less than 75% of the memory or cores of GPU 1. You can do so by setting the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES environment variable.

1. 參數沒法load進去，多出個module.，原因是：之前訓練的時候使用了`nn.DataParallel(model_structure,device_ids=gpu_ids)`

2、`There is an imbalance between your GPUs. You may want to exclude GPU 0 which has less than 75% of the memory or cores of GPU 1. You can do so by setting the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES environment variable.`