Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45433
Primarily in order to pick up the fix landed in https://github.com/pytorch/tensorpipe/pull/225 which fixes the handling of scopes in link-local IPv6 addresses, which was reported by a user.
Test Plan: The specific upstream change is covered by new unit tests. The submodule update will be validated by the PyTorch CI.
Reviewed By: beauby
Differential Revision: D23962289
fbshipit-source-id: 4ed762fc19c4aeb1398d1337d61b3188c4c228be
Summary:
Including commits to fix Windows CI failure of enable distributed training on Windows PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45025
Reviewed By: beauby
Differential Revision: D23807995
Pulled By: mrshenli
fbshipit-source-id: a2f4c1684927ca66d7d3e9920ecb588fb4386f7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45014
Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/219
Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/212
+ Introduce buffer.h defining the buffer struct(s). The `CpuBuffer`
struct is always defined, while the `CudaBuffer` struct is defined
only when `TENSORPIPE_SUPPORTS_CUDA` is true.
+ Update all channels to take a `CpuBuffer` or `CudaBuffer` for
`send`/`recv` rather than a raw pointer and a length.
+ Make the base `Channel`/`Context` classes templated on `TBuffer`,
effectively creating two channel hierarchies (one for CPU channels,
one for CUDA channels).
+ Update the Pipe and the generic channel tests to use the new API. So
far, generic channel tests are CPU only, and tests for the CUDA IPC
channel are (temporarily) disabled. A subsequent PR will take care of
refactoring tests so that generic tests work for CUDA channels. An
other PR will add support for CUDA tensors in the Pipe.
Differential Revision: D23598033
Test Plan: Imported from OSS
Reviewed By: lw
Pulled By: beauby
fbshipit-source-id: 1d6c3f91e288420858835cd5e7962e8da051b44b
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33394 .
This PR does two things:
1. Implement CUDA scatter reductions with revamped GPU atomic operations.
2. Remove support for divide and subtract for CPU reduction as was discussed with ngimel .
I've also updated the docs to reflect the existence of only multiply and add.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41977
Reviewed By: mruberry
Differential Revision: D23748888
Pulled By: ngimel
fbshipit-source-id: ea643c0da03c9058e433de96db02b503514c4e9c
Summary:
- Bump oneDNN (mkl-dnn) to 1.6 for bug fixes
- Fixes https://github.com/pytorch/pytorch/issues/42446. RuntimeError: label is redefined for convolutions with large filter size on Intel AVX512
- Implemented workaround for internal compiler error when building oneDNN with Microsoft Visual Studio 2019 (https://github.com/pytorch/pytorch/pull/43169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44706
Reviewed By: ngimel
Differential Revision: D23705967
Pulled By: albanD
fbshipit-source-id: 65e8fecc52a76c9f3324403a8b60ffa8a8948bc6
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 1d710393d5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44647
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D23684528
fbshipit-source-id: 316ff2e448707a6e5a83248c9b22e58118bc8741
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 0725301da5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44581
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia, VitalyFedyunin
Differential Revision: D23665173
fbshipit-source-id: 03cee22335eef0517e561827795bbe2036942ea0
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: d5ace7ca70
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44177
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D23533561
fbshipit-source-id: 9e580f8dbfb83e57bebc28f8e459caa0c5fc7317
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44082
Automated submodule is running into some test failures and I am not sure how can I rebase that.
automated submodule update:
https://github.com/pytorch/pytorch/pull/43817
Test Plan: CI tests
Reviewed By: jianyuh
Differential Revision: D23489240
fbshipit-source-id: a49b01786ebf0a59b719a0abf22398e1eafa90af
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 685149bbc0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43251
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: YazhiGao
Differential Revision: D23207016
fbshipit-source-id: 54e13b246bb5189260ed11316ddf3d26d52c6b24
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 29d5eb9f3c
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42834
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: jspark1105
Differential Revision: D23040145
fbshipit-source-id: 1d7209ea1910419b7837703122b8a4c76380ca4a
Summary:
Not sure what happened, but possibly I landed a PR on PyTorch which updated the TensorPipe submodule to a commit hash of a *PR* of TensorPipe. Now that the latter PR has been merged though that same commit has a different hash. The commit referenced by PyTorch, therefore, has become orphaned. This is causing some issues.
Hence here I am updating the commit, which however does not change a single line of code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42789
Reviewed By: houseroad
Differential Revision: D23023238
Pulled By: lw
fbshipit-source-id: ca2dcf6b7e07ab64fb37e280a3dd7478479f87fd
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: a989b99279
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42713
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: amylittleyang
Differential Revision: D22990108
Pulled By: jspark1105
fbshipit-source-id: 3252a0f5ad9546221ef2fe908ce6b896252e1887
Summary:
Because 2.7.3 has some bug on GA100 which is fixed in 2.7.6
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42645
Reviewed By: malfet
Differential Revision: D22977280
Pulled By: mrshenli
fbshipit-source-id: 74779eff90d7d660a988ff33659f3a2237ca7e29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42522
Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.
There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header. So instead we link those targets to the tensorpipe target in order for them to pick up the correct include directories.
I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.
Test Plan: CI
Reviewed By: malfet
Differential Revision: D22959472
fbshipit-source-id: 1959a41c4a66ef78bf0f3bd5e3964969a2a1bf67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41603
Pull Request resolved: https://github.com/pytorch/glow/pull/4704
Previously in the glow onnxifi path, when an error is encountered, we log it to stderr then just return ONNXIFI_STATUS_INTERNAL_ERROR to C2. C2 then does CAFFE2_ENFORCE_EQUAL(return_code, ONNXIFI_STATUS_SUCCESS). The error message that eventually went to the user is something like
[enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0
This diff adds plumbing to get human readable error message out of glow into C2.
Test Plan:
Run ads replayer. Overload it with traffic. Now the error message sent back to the client used to be
E0707 00:57:45.697196 3709559 Caffe2DisaggAcceleratorTask.cpp:493] During running REMOTE_OTHER net: [enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0 (Error from operator:....
Now it's
```
E0707 16:46:48.366263 1532943 Client.cpp:966] Exception when calling caffe2_run_disagg_accelerator on remote predictor for model 190081310_0 : apache::thrift::TApplicationException: c10::Error: [enforce fail at onnxifi_op.cc:556] .
Error code: RUNTIME_REQUEST_REFUSED
Error message: The number of allowed queued requests has been exceeded. queued requests: 100 allowed requests: 100
Error return stack:
glow/glow/lib/Runtime/HostManager/HostManager.cpp:673
glow/glow/lib/Onnxifi/HostMana (Error from operator:...
```
Reviewed By: gcatron, yinghai
Differential Revision: D22416857
fbshipit-source-id: 564bc7644d9666eb660725c2dca5637affae9b73
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 4abc34af1a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42584
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D22941475
fbshipit-source-id: 29863cad7f77939edb44d337918693879b35cfaa
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 87c378172a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42496
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D22911638
fbshipit-source-id: f20c83908b51ff56d8bf1d8b46961f70d023c81a
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: e04b9ce034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42302
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: efiks
Differential Revision: D22841424
fbshipit-source-id: 211463b0207da986fc5b451242ae99edf32b9f68
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42225
Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.
There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header.
I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.
Test Plan: CircleCI is all green.
Reviewed By: beauby
Differential Revision: D22812445
fbshipit-source-id: e6d824bb28f5afe75fd765de0430968174f3531f
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: cad1c21404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42205
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D22806731
Pulled By: efiks
fbshipit-source-id: 779a9f7f00645e7e65f183e2832dc79117eae5fd
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 139c6f2292
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41814
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: dskhudia
Differential Revision: D22648844
fbshipit-source-id: 4cfa8d83585407f870ea2bdee74e1c1f371082eb
Summary:
A minor spell check!
I have gone through a dozen of .md files to fix the typos.
zou3519 take a look!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41599
Reviewed By: ezyang
Differential Revision: D22601629
Pulled By: zou3519
fbshipit-source-id: 68d8f77ad18edc1e77874f778b7dadee04b393ef
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).
New submodule commit: 73ea1f5828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40332
Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Reviewed By: gchanan, yns88
Differential Revision: D22150737
fbshipit-source-id: fe7e6787adef9e2fedee5d1a0a1e57bc4760b88c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40614
This update pulls in a oneliner fix, which sets the TCP_NODELAY option on the TCP sockets of the UV transport. This leads to exceptional performance gains in terms of latency, with about a 25x improvement in one simple benchmark. This thus resolves a regression that TensorPipe had compared to the ProcessGroup agent and, in fact, ends up beating it by 2x.
The benchmark I ran is this, with the two endpoints pinned to different cores of the same machine:
```
torch.jit.script
def remote_fn(t: int):
return t
torch.jit.script
def local_fn():
for _ in range(1_000_000):
fut = rpc.rpc_async("rhs", remote_fn, (42,))
fut.wait()
```
And the average round-trip time (one iteration) is:
- TensorPipe with SHM: 97.2 us
- TensorPipe with UV _after the fix_: 205us
- Gloo: 440us
- TensorPipe with UV _before the fix_: 5ms
Test Plan: Ran PyTorch RPC test suite
Differential Revision: D22255393
fbshipit-source-id: 3f6825d03317d10313704c05a9280b3043920507
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40374
To pick up two fixes to MPT:
4b1b855f21462200aad3
MPT isn't yet used by PyTorch so this should have no effect
Test Plan: Export to CircleCI and test
Reviewed By: patricklabatut
Differential Revision: D22160029
fbshipit-source-id: 202ea7487fcde015e5856f71ad6aebdfa6564ee1
Summary:
This is to import a few features:
- a fix to a race condition happening in SHM's use of epoll
- a new XTH channel, that uses a memcpy to transfer between threads of the same process
- a new MPT channel, that chunks and multiplexes tensors over multiple transport event loops
Test Plan: Run in CircleCI
Reviewed By: patricklabatut
Differential Revision: D22140736
fbshipit-source-id: a3cee8a3839d98a42b8438844a9fd24fd85b2744
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39945
In order to pick up 8fb1fe66f8.
Test Plan: Export to CircleCI and make sure tests pass.
Reviewed By: patricklabatut
Differential Revision: D22019033
fbshipit-source-id: eb192ea3950e4f27ed222f84e2d9de8bf6eb927c