pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

History

Yi Wang a419a3e25d Add assertion on any NaN error on the error feedback (#49374 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49374 After the assertion is added, the NaN error on certain trainings disappears. It seems that the real error is caused by the underlying illegal memory access. This is a temporary workaround. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118572471 Test Plan: Real run on Ads 10X model: scripts/wayi/mast_prof_gradient_compression.sh POWER_SGD 8 To reproduce the error, just comment out the assertion. Reviewed By: rohan-varma Differential Revision: D25548299 fbshipit-source-id: 039af7d94a27e0f47ef647c6163fd0e5064951d5	2020-12-14 20:15:39 -08:00
..
ddp_comm_hooks	Add assertion on any NaN error on the error feedback (#49374 )	2020-12-14 20:15:39 -08:00
__init__.py	[Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158 )	2020-11-06 00:28:09 -08:00

Yi Wang a419a3e25d Add assertion on any NaN error on the error feedback (#49374 )

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49374

After the assertion is added, the NaN error on certain trainings disappears.

It seems that the real error is caused by the underlying illegal memory access. This is a temporary workaround.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118572471

Test Plan:
Real run on Ads 10X model: scripts/wayi/mast_prof_gradient_compression.sh POWER_SGD 8

To reproduce the error, just comment out the assertion.

Reviewed By: rohan-varma

Differential Revision: D25548299

fbshipit-source-id: 039af7d94a27e0f47ef647c6163fd0e5064951d5

2020-12-14 20:15:39 -08:00

ddp_comm_hooks

Add assertion on any NaN error on the error feedback (#49374 )

2020-12-14 20:15:39 -08:00

__init__.py

[Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158 )

2020-11-06 00:28:09 -08:00