pytorch/torch/distributed/algorithms
Yi Wang a419a3e25d Add assertion on any NaN error on the error feedback (#49374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49374

After the assertion is added, the NaN error on certain trainings disappears.

It seems that the real error is caused by the underlying illegal memory access. This is a temporary workaround.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118572471

Test Plan:
Real run on Ads 10X model: scripts/wayi/mast_prof_gradient_compression.sh POWER_SGD 8

To reproduce the error, just comment out the assertion.

Reviewed By: rohan-varma

Differential Revision: D25548299

fbshipit-source-id: 039af7d94a27e0f47ef647c6163fd0e5064951d5
2020-12-14 20:15:39 -08:00
..
ddp_comm_hooks Add assertion on any NaN error on the error feedback (#49374) 2020-12-14 20:15:39 -08:00
__init__.py [Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158) 2020-11-06 00:28:09 -08:00