RocksDB 7 starts to use C++17 in header.
We should make this configurable, in case user needs higher std version.
List of files to changed is found by `git grep 'CMAKE_[^_]*_STANDARD'`.
Doc string is from CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75519
Approved by: https://github.com/malfet
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62445
PyTorch currently uses the old style of compiling CUDA in CMake which is just a
bunch of scripts in `FindCUDA.cmake`. Newer versions support CUDA natively as
a language just like C++ or C.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31503350
fbshipit-source-id: 2ee817edc9698531ae1b87eda3ad271ee459fd55
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543
Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292
Test Plan: It builds
Reviewed By: cbalioglu
Differential Revision: D29062002
fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59359
Move `prepare_for_backward` into `_DDPSink` backward instead of calling it in DDP forward pass so that we can run multiple backwards in DDP with `retain_graph=True`.
ghstack-source-id: 131774159
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D28855226
fbshipit-source-id: 6b7b25d75b7696f5b5629078233433f97663d61c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59719
Added filestore functionality to the c10d backend. FileStore will create a temporary file in the /tmp directory to use if it is selected as the store type. Appropriate tests were added as well.
FileStore was modified to expose the path field for testing. It was also modified so that the numWorkers field in the constructor is optional (defaulting to -1). A negative value indicates there is not a fixed number of workers. In this case, the file is not attempted to be cleaned at the end.
Test Plan: Unit tests for creating a c10d backend with filestore and simple error handling.
Reviewed By: cbalioglu, H-Huang
Differential Revision: D28997436
fbshipit-source-id: 24c9b2c9b13ea6c947e8b1207beda892bdca2217
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59684
Same reasoning as in the below diff.
ghstack-source-id: 131167212
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D28981326
fbshipit-source-id: 264a7f787ea8be76f743a2eaca67ae1d3bd8073a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59697
The c10d build process selectively adds files based on the `USE_C10D_FOO` flags (where `FOO` is one of `GLOO`, `NCCL` or `MPI`). Replicating this logic inside libtorch will be harder, since libtorch uses a simpler approach (i.e., it lists the files in `build_variables.bzl`). So instead we could always include all files, and "disable" each file as needed using `#ifdef`s. Note that this is not a new approach: we already do the same for all the files of the TensorPipe agent based on the flag `USE_TENSORPIPE`.
ghstack-source-id: 131169540
Test Plan: CI
Reviewed By: agolynski
Differential Revision: D28987577
fbshipit-source-id: 4c6195de4e9a58101dad9379537e8d055dfd38af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59696
Some files in c10d refer to dist autograd. However, on Windows, dist autograd isn't built. Hence we need to "mask out" those references under Windows. This was already partly done, but when moving c10d to libtorch some issues came up, possibly due to the different way in which linking happens. Hence I masked out the remaining references.
ghstack-source-id: 131169541
Test Plan: CI
Reviewed By: agolynski
Differential Revision: D28987579
fbshipit-source-id: c29c5330f8429d699554972d30f99a89b2e3971d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59667
Use torch_check over throw std::runtime_error in monitored barrier so
that it works with torch_cpp_show_stacktraces to reveal the entire callstack
where the monitored barrier failed, which can help determine where the
particular rank encountered an issue.
ghstack-source-id: 130993689
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D28974510
fbshipit-source-id: 6a6958995c1066cddcd647ca88c74473079b69fc
Summary:
Switches most of the simple for loops outside of `jit` directories to use `c10::irange`.
Generated with D28874212.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D28909681
fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59523
Should use snake case instead of camel case for the consistency.
ghstack-source-id: 130759655
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
Reviewed By: cbalioglu
Differential Revision: D28922896
fbshipit-source-id: e04298284a78b2e71b562f790a878731962f873a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59576
If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.
This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130754510
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view
Reviewed By: rohan-varma
Differential Revision: D28941327
fbshipit-source-id: 932e8ddbdb2bfd609a78943f6dc390d3d6ca333f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59574
Remove `work` attribute from Reducer class in favor of `future_work`.
Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.
1) Compared with the reverted https://github.com/pytorch/pytorch/pull/58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow.
2) Compared with the reverted https://github.com/pytorch/pytorch/pull/59520, disabled `test_DistributedDataParallel_non_default_stream` on AMD, because now applying division first hurts the gradient averaging accuracy on AMD.
See [07:48:26]:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console
#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130752393
Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_non_default_stream
Reviewed By: rohan-varma
Differential Revision: D28940800
fbshipit-source-id: 1ba727ac951ebc1e7875dc1a1be8108a2c8d9462
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59522
If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.
This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130686229
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view
Reviewed By: rohan-varma
Differential Revision: D28922548
fbshipit-source-id: 442bd3cc7a35a8b948f626062fa7ad2e3704c5be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59520
Remove `work` attribute from Reducer class in favor of `future_work`.
Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.
Compared with the reverted https://github.com/pytorch/pytorch/pull/58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow.
#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130685351
Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view
Reviewed By: walterddr
Differential Revision: D28922305
fbshipit-source-id: 6388a96eda7a06f292873afed6d1362096c13e1c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58331
This PR is the final part of a stack that addresses the GitHub issue #41614; it introduces the multi-tenancy feature to the `TCPStore` class allowing two server stores to be instantiated with the same host:port pair.
ghstack-source-id: 130676394
Test Plan:
- Run the existing and newly-introduced tests.
- Run several smoke tests including the short code snippet referred in GitHub issue #41614.
Reviewed By: H-Huang
Differential Revision: D28453850
fbshipit-source-id: f9066b164305de0f8c257e9d5736e93fd7e21ec6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58330
This PR is part of a stack that addresses the GitHub issue #41614; it introduces a major refactoring of the `TCPStore` class in preparation of the multi-tenancy feature.
- All TCP sockets are wrapped with a new `TCPSocket` RAII type.
- `BackgroundThread` and daemon types are moved from header to cpp file.
- Server, client, and callback sockets are refactored into their own internal types `TCPServer`, `TCPClient` and `TCPCallbackClient`.
- Calls to `tcputil::send*` and `tcputil::recv*` are wrapped in `TCPClient` for easier readability and maintenance purposes.
- Two `TODO` statements are put to reference future improvements. Based on feedback, I will either create separate GitHub issues for them or address them as part of this stack.
ghstack-source-id: 130676392
Test Plan: Run the existing tests since there are no user-facing behavioral changes.
Reviewed By: H-Huang
Differential Revision: D28448981
fbshipit-source-id: 415b21e74b3cd51d673c1d5c349c6a2cb21dd667
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329
This PR is part of a stack that addresses the GitHub issue #41614; it introduces:
- A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair.
- Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature.
Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output.
ghstack-source-id: 130676389
Test Plan: Run the existing tests since there are no behavioral changes.
Reviewed By: rohan-varma
Differential Revision: D28424978
fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58328
This PR is part of a stack that addresses the GitHub issue #41614; it introduces a new `TCPStore` constructor that takes its optional parameters via a newly introduced `TCPStoreOptions` structure. This gives the API callers the flexibility to specify only the desired options while skipping the rest.
The main motivation behind this change is the introduction of the `multiTenant` constructor option in the second PR of this stack.
ghstack-source-id: 130676384
Test Plan: Run the existing tests since there are no behavioral changes.
Reviewed By: H-Huang
Differential Revision: D28417742
fbshipit-source-id: e6ac2a057f7ad1908581176ee6d2c2554c3c74a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58937
Remove `work` attribute from Reducer class in favor of `future_work`.
Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.
#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130673249
Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
Reviewed By: agolynski
Differential Revision: D28677383
fbshipit-source-id: 85e0620378b7e9d837e436e94b9d807631d7d752
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59281
Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.
Errors encountered in python-side DDP will be added in the next diff.
ghstack-source-id: 130412974
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28652717
fbshipit-source-id: 9772abc2647a92dac6a325da6976ef5eb877c589
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59070
This log is too verbose, especially in the case we call monitored
barrier before every collective as we do in ProcessGroupWrapper.
ghstack-source-id: 130052822
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D28738189
fbshipit-source-id: f2899537caa4c13508da31134d5dd0f4fd6a1f3a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58992
Currently, we define Torchbind custom classes in the same place that we define Python bindings.
This is nice from a code location perspective, but has two downsides:
1. These custom classes are not available in a C++-only build.
2. These break when included in torch::deploy.
Some explanation on the second issue: torch::deploy creates many Python
interpreters, and creates a full copy of all the bindings for each one. This
will run the static initialization code once for each copy of the bindings,
leading to multiple registration of the custom classes (and therefore an
error).
This PR splits out the relevant custom class binding code into its own source
file to be included in libc10d, which can be compiled and statically
initialized a single time and linked against from the c10d python bindings.
ghstack-source-id: 130168942
Test Plan: CI
Reviewed By: wconstab
Differential Revision: D28690832
fbshipit-source-id: 3c5e3fff28abb8bcdb4a952794c07de1ee2ae5a8
Summary:
`makeDeviceForHostname` and `makeDeviceForInterface` are almost
duplicate except for different default argument values
Create generic `makeGlooDevice` anonymous function that takes both host
name and interface name and call it from both
makeDeviceFor[Hostname|Interface]
Also solve two other minor issues:
- do not call `getenv("GLOO_DEVICE_TRANSPORT")` during library load
time
- Raise exception rather than crash if GLOO_DEVICE_TRANSPORT is set to unknown value
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58996
Reviewed By: pbelevich
Differential Revision: D28713324
Pulled By: malfet
fbshipit-source-id: cb33b438078d163e3ec6f047f2e5247b07d94f8d