调试 torch 出现的 inplace operation 错误

作者: 张志强 , 2024-06-25 , 共 782 字

在写一个 torch 模型时，训练时被提示：

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2702, 1]], which is output 0 of TanhBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

马上启用

torch.autograd.set_detect_anomaly(True)

找到了具体的位置：

File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 1442, in relu

result = torch.relu(input)

而且删掉这一行处理，训练就一切正常。看来问题就出现在这里了！可是查了 torch 的文档，可以去䦺 torch.relu 不是 inplace 的呀。

这时候只能进入调试大法，跟踪直到最终报错的所有步骤，最后发现 loss 函数里的确原地修改了模型的 output。把它改过来就好了。

这个 anomaly 的错误检测还是远不够智能啊，定位不准确。实际还是需要单步跟踪调试。

Q. E. D.