1.训练报错
使用BCE损失时,出现的问题包括:
报错 | 参数batch_size | epoch | hidden_size | lr_D | lr_DZ | lr_Eref | lr_model | z_dim |
'ViewBackward' returned nan values | 8 | 50 | 128 | 5e-05 | 0.001 | 0.001 | 0.001 | 16 |
MvBackward | 16 | 40 | 256 | 0.01 | 0.001 | 0.001 | 5e-05 | 16 |
AddmmBackward | 32 | 40 | 256 | 0.01 | 5e-05 | 5e-05 | 5e-05 | 128 |
ViewBackward | 32 | 75 | 128 | 5e-05 | 0.01 | 0.001 | 0.0001 | 64 |
ViewBackward | 8 | 25 | 64 | 0.001 | 5e-05 | 0.01 | 0.0001 | 32 |
但是这里也观察不出来什么规律。
但是这是少量出现的,在50个模型中,只有6个是出现Nan值。是否可以忽略这个问题呢?
2.解决办法
https://github.com/pytorch/pytorch/issues/51196,这里提到说
This error is only here in anomaly mode to help you find where nans appeared in the backward pass. This is not related to a bug in PyTorch but just that your current code generate nan values.
You can remove this error by just disabling anomaly detection.
注释掉:
torch.autograd.set_detect_anomaly(True)
但是也不是根本解决办法吧?