在写一个 torch 模型时,训练时被提示:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2702, 1]], which is output 0 of TanhBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
马上启用
torch.autograd.set_detect_anomaly(True)
找到了具体的位置:
File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1442, in relu
result = torch.relu(input)
而且删掉这一行处理,训练就一切正常。看来问题就出现在这里了!可是查了 torch 的文档,可以去䦺 torch.relu 不是 inplace 的呀。
这时候只能进入调试大法,跟踪直到最终报错的所有步骤,最后发现 loss 函数里的确原地修改了模型的 output。把它改过来就好了。
这个 anomaly 的错误检测还是远不够智能啊,定位不准确。实际还是需要单步跟踪调试。
Q. E. D.