顯存相關
1. 顯存溢出:GPU顯存佔用隨着運行不斷累加
出現這種情況主要是程序中有很多中間變量佔用的顯存,而這些顯存沒有被del掉。
舉個例子:
feature_loader = []
for i,(data, target) in enumerate(train_loader):
# data, target = data.to(device), target.to(device)
data, target = Variable(data.cuda()), Variable(target.cuda())
feature, pred = model(data)
loss = CELOSS(pred, target)
optimizer4nn.zero_grad()
loss.backward()
optimizer4nn.step()
feature_loader.append(feature) # <--- 這裏
這裏的feature
實際上還是個variable.cuda(),隨着迭代進行,featur變量會被新的feature變量替換,這部分不會有新的顯存增加。而feature_loader
在不斷收集新的feature
,這樣顯存會一點一點的累加。
解決方法就是在收集過程中,把feature從顯存.cuda()
上搬到內存.cpu()
上,這樣,feature_loader
就不會佔用顯存。
feature_loader.append(feature.data.cpu()) # <---加上.data.cpu
待分類
1. 參數沒法load進去,多出個module.,原因是:之前訓練的時候使用了nn.DataParallel(model_structure,device_ids=gpu_ids)
RuntimeError: Error(s) in loading state_dict for ftnet_EncoderDecoder:
Missing key(s) in state_dict:
"BackBone.model.conv1.weight", "BackBone.model.bn1.weight", "BackBone.model.bn1.bias", "BackBone.model.bn1.running_mean", "BackBone.model.bn1.running_var", .....
Unexpected key(s) in state_dict:
"module.BackBone.model.conv1.weight", "module.BackBone.model.bn1.weight", "module.BackBone.model.bn1.bias", "module.BackBone.model.bn1.running_mean", "module.BackBone.model.bn1.running_var", "module.BackBone.model.l " ........
解決:
model_structure = net()
model_structure = nn.DataParallel(model_structure,device_ids=gpu_ids)
model = load_network(model_structure)
2、There is an imbalance between your GPUs. You may want to exclude GPU 0 which has less than 75% of the memory or cores of GPU 1. You can do so by setting the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES environment variable.
解決:
As mentioned in this link, you have to do model.cuda() before passing it to nn.DataParallel.
net = nn.DataParallel(model.cuda(), device_ids=[0,1])
URL that solved this problem:https://stackoverflow.com/questions/55343893/how-to-do-parallel-processing-in-pytorch