Optimizes the nccl python bindings to reserve space when converting PythonObject* into Tensors. This should reduce the number of unnecessary allocations in the nccl bindings as the std::vector grows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88203
Approved by: https://github.com/ezyang
We define specializations for pybind11 defined templates
(in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently
it is important that these specializations *always* be #include'd
when making use of pybind11 templates whose behavior depends on
these specializations, otherwise we can cause an ODR violation.
The easiest way to ensure that all the specializations are always
loaded is to designate a header (in this case, torch/csrc/util/pybind.h)
that ensures the specializations are defined, and then add a lint
to ensure this header is included whenever pybind11 headers are
included.
The existing grep linter didn't have enough knobs to do this
conveniently, so I added some features. I'm open to suggestions
for how to structure the features better. The main changes:
- Added an --allowlist-pattern flag, which turns off the grep lint
if some other line exists. This is used to stop the grep
lint from complaining about pybind11 includes if the util
include already exists.
- Added --match-first-only flag, which lets grep only match against
the first matching line. This is because, even if there are multiple
includes that are problematic, I only need to fix one of them.
We don't /really/ need this, but when I was running lintrunner -a
to fixup the preexisting codebase it was annoying without this,
as the lintrunner overall driver fails if there are multiple edits
on the same file.
I excluded any files that didn't otherwise have a dependency on
torch/ATen, this was mostly caffe2 and the valgrind wrapper compat
bindings.
Note the grep replacement is kind of crappy, but clang-tidy lint
cleaned it up in most cases.
See also https://github.com/pybind/pybind11/issues/4099
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82552
Approved by: https://github.com/albanD
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56016
Missed these because I don't build on CUDA
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Differential Revision: D27765124
Pulled By: ezyang
fbshipit-source-id: aa202f594659d53c903b88c9d4a4cbb0e1c0b40a
Summary:
Converts loops of the form:
```
for(int64_t VAR=0;VAR<LIMIT;VAR++)
```
to the form
```
for(const auto VAR : c10::irange(LIMIT))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55148
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D27447811
fbshipit-source-id: 6311a094ec4a81a0b57383aaee0ba1b1dc2445c4
Summary:
While working on https://github.com/pytorch/pytorch/issues/38911, I realized that `nccl.reduce` only needs a single output tensor, while our current implementation requires a list of output tensors. This, along with a TODO I fixed in reduce_add, should have some speed up for data parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39688
Differential Revision: D22034547
Pulled By: mrshenli
fbshipit-source-id: e74d54d673ebbb062474b1bb5cc93a095a3a5f6c
Summary:
Because `torch_python` is supposed to be thin wrapper around `torch`
In this PR, all invocation of functions from nccl library are moved from python_nccl.cpp (which is part of torch_python) to nccl.cpp (which is part of torch_cuda)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36193
Test Plan: CI
Differential Revision: D20930047
Pulled By: malfet
fbshipit-source-id: 7f278610077df6ac5dc3471c1a1b5d51e653ef9c
Summary:
Given that pybind11 implements these gil functions, I don't think it makes sense for Pytorch to have its own bespoke versions.
Fixes https://github.com/pytorch/pytorch/issues/29065
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29095
Differential Revision: D18301806
Pulled By: ezyang
fbshipit-source-id: 03da6a26c41ee65aaadf7b67b9f0b14d2def2a5a
Summary:
Thanks to AddyLaddy ptrblck for tracking this fix down.
In torch/csrc/cuda/nccl.cpp and torch/csrc/cuda/python_nccl.cpp, construction of the `AutoNcclGroup` guard (which calls `ncclGroupStart()`) [precedes](https://github.com/pytorch/pytorch/pull/29014/files#diff-3b6a42619dd44000cf58c0328b679a1cL239-L241) a possible call to `get_communicators`, which may call `ncclCommInitAll()`. Calling `ncclCommInitAll()` within a `ncclGroupStart()/End()` is incorrect according to our Nccl people.
It seemed ok (relevant tests were silently passing) as long as Pytorch was compiled/linked against Nccl 2.4.x (which is currently what's locked into your third_party/nccl subrepo). However, when we tried to compile and link against Nccl 2.5.x in internal builds, we began to see test hangs (TestAutogradDeviceTypeCUDA.test_unused_output_device_cuda was what initially brought it to our attention).
The present PR fixes those hangs, as far as we know, and will prevent a nasty future surprise when you start building against nccl 2.5.
The backend affected by this PR is exposed via https://github.com/pytorch/pytorch/blob/master/torch/cuda/nccl.py. I'm not sure if the exposure is actually used anywhere (I think the distributed frontend is now backed by ProcessGroupNCCL in torch/lib/c10d). So this PR may affect code that is already dead or dying, but still tested, it seems.
I skimmed ProcessGroupNCCL.cpp for potential similar vulnerabilities and didn't spot anything obvious.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29014
Differential Revision: D18274799
Pulled By: ezyang
fbshipit-source-id: c5f88cf187960d61736be14458be01e3675c6702
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27124
ncclCommAbort() and ncclGetAsyncError() were two APIs added in NCCL
2.4 to detect errors in NCCL communicators. These were used as part of
ProcesGroupNCCL and we also enforced that only NCCL versions 2.4+ were
supported. Although, there is still legitimate use for older NCCL versions and
hence we should still support those.
For that purpose, in this change I've ensured we disable NCCL error checking
for versions < 2.4.
ghstack-source-id: 91452959
Test Plan:
1) Test with 2.4.8
2) Test with 2.2.13
3) unit tests.
Differential Revision: D17178988
fbshipit-source-id: 5dc44b5f7b4b00466c67fd452315f1d4f5c47698
Summary:
As part of the Variable/Tensor merge work: https://github.com/pytorch/pytorch/issues/13638, we make the following changes in this PR:
1. Remove the `Variable::Impl` class and the `DifferentiableViewImpl` class
2. Change all `Variable.data()` call sites to either use `Variable` directly, or use `Variable.tensor_data()`
3. Remove `Variable.data()` API
3. Add `Variable.variable_data()` that matches `tensor.data` in Python API, which creates a new `Variable` that shares the same storage and tensor metadata with the original `Variable`, but with a completely new autograd history.
After this PR, Variable doesn't wrap a Tensor internally anymore, and both Variable and Tensor use the same TensorImpl class as its `impl_`. The only difference is that Variable always has AutogradMeta in its TensorImpl, but Tensor doesn't.
**Note that this PR is BC-breaking in the following use cases:**
**Use Case 1:**
Previously, `x.data = y` works even if `x` and `y` are of different TensorImpl type (e.g. `x` is a CPU dense tensor whose impl is of type TensorImpl, while `y` is a CPU sparse tensor whose impl is of type SparseTensorImpl). However, after this PR, `x.data = y` doesn't work anymore if `x` and `y` are of different TensorImpl type, because the underlying implementation `variable.set_data(tensor)` no longer works if `variable` and `tensor` have different TensorImpl type.
**Use Case 2:**
If a tensor `x`'s `grad` is sparse, accumulating dense gradients to `x` will change the tensor that `x.grad` is pointing to. This is better illustrated with the following example:
```python
params = torch.tensor([1.5, 1.5]).requires_grad_()
with torch.no_grad():
# Change gradient to a sparse tensor
params.grad = torch.sparse_coo_tensor(torch.tensor([[1, 1]]).long(), torch.tensor([1., 1.]))
grad_saved = params.grad
params.backward(torch.tensor([1.5, 1.5]))
assert id(grad_saved) == id(params.grad) # This will fail after this PR
```
The assertion in the last line will fail after this PR, because adding dense gradients to sparse gradients will change the `params.grad` tensor reference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17072
Differential Revision: D14075257
Pulled By: yf225
fbshipit-source-id: 0e681df641270dea586042dd26db59f2e76b5957
Summary:
Some renaming and renamespacing also took place. I was originally planning not to do anything, but it turns out that it was easier to make HIPify work by using a namespace CUDACachingAllocator:: rather than THCCachingAllocator_, since :: is a word boundary but _ is not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16119
Reviewed By: smessmer
Differential Revision: D13718768
fbshipit-source-id: 884a481d99027fd3e34471c020f826aa12225656
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16117
This means I can move it to c10_cuda with minimal fuss.
Reviewed By: smessmer
Differential Revision: D13717836
fbshipit-source-id: a94c7dc649af64542480fc1c226b289588886c00
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14248
This diff also introduces a horrifying hack to override CUDA's DeviceGuardImpl
with a HIPGuardImplMasqueradingAsCUDA, to accommodate PyTorch's current
behavior of pretending CUDA is HIP when you build with ROCm enabled.
Reviewed By: bddppq
Differential Revision: D13145293
fbshipit-source-id: ee0e207b6fd132f0d435512957424a002d588f02
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.
I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.
I used the following script to do the canonicalization:
```
import subprocess
import re
import os.path
files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
for fn in files:
if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
continue
if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
continue
with open(fn, 'r') as f:
c = f.read()
def fmt(p):
return "#include <{}>".format(p)
def repl(m):
p = m.group(1)
if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
return fmt(p)
if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
return fmt(p)
for root in ["aten/src", "torch/lib", ""]:
for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
new_p = os.path.relpath(os.path.join(bad_root, p), root)
if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
return fmt(new_p)
print("ERROR: ", fn, p)
return m.group(0)
new_c = re.sub(r'#include "([^"]+)"', repl, c)
if new_c != c:
print(fn)
with open(fn, 'w') as f:
f.write(new_c)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849
Reviewed By: dzhulgakov
Differential Revision: D13363445
Pulled By: ezyang
fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13342
This PR introduces a few new concepts:
- DeviceGuardImplInterface, and implementations for CPU and CUDA, which
provide a generic interface for interfacing with device and stream state,
without requiring a direct dependency on the code in question.
- InlineDeviceGuard, a general template for generating both specialized
and dynamically dispatched device guard implementations. Dynamic
dispatch is done by specializing it on a VirtualGuardImpl.
- Provide a device-independent DeviceGuard class, which can be used even
from CPU code. It uses the aforementioned dynamic dispatch.
- CUDA-specialized CUDAGuard class, which doesn't have a dynamic dispatch
but can only be used from CUDA.
- StreamGuard, which is the same as above, but for streams rather than
devices.
- Optional variants of all the aforementioned guards, which are a no-op if
no device/stream is specified
- CUDAMultiStreamGuard, specifically for the case when we want to set
a device on every guard.
There are some subtle semantic changes, which have been thoroughly documented
in the class definition.
BC-breaking changes:
- Move constructor/assignment have been removed from all device guard
implementations.
- In some cases where you previously wrote 'set_device' (or 'set_stream'), you now must write
'reset_device', because if you switch devices/device types, the stream/device on the
previous device is unset. This is different from previous behavior.
- CUDAGuard no longer handles streams, or multiple streams. Use CUDAStreamGuard
or CUDAMultiStreamGuard as appropriate for your use case.
Reviewed By: dzhulgakov
Differential Revision: D12849620
fbshipit-source-id: f61956256f0b12be754b3234fcc73c2abc1be04e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13275
This resulted in a bunch of knock-on changes, which I will now
describe:
- s/original_index/original_device/
- s/last_index/last_device/
- A bunch of places that used set_index, now use CUDAGuard (which does have
set_index) because they were CUDA-specific code.
Major caveat: DeviceGuard doesn't *actually* work non-CUDA/CPU devices, To make
that happen, I plan on totally replacing the implementation of DeviceGuard; what
I mostly care about here is wrangling the API into an acceptable state.
Reviewed By: gchanan
Differential Revision: D12832080
fbshipit-source-id: 7de068c7cec35663dc8a533026a626331336e61d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13125
Previously, it returned a vector of THCStream*, which we eventually turned
into CUDAStream. No need to spatter the conversion code everywhere: just
do it correctly to begin with. An important side effect of doing it this
way is that we no longer pass nullptr to CUDAStream; instead, we create
the default stream. I will rely on this in a later patch.
Reviewed By: gchanan
Differential Revision: D10853224
fbshipit-source-id: f6bd6594eba4626eb41a4a5e67fc64c9bbb46a1a
Summary:
As the title says, we should always use the current stream on device in NCCL.
This can unblock ezyang on his further work
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13089
Reviewed By: ezyang
Differential Revision: D10847172
Pulled By: teng-li
fbshipit-source-id: 7fc7c4248b5efa1971d2af4d43f62d3379debfe4
Summary:
fully working version by using continuing on goldsborough 's initial version.
waiting on the stream guard to be merged before adding more stream perf logics into the c++ version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12852
Differential Revision: D10468696
Pulled By: teng-li
fbshipit-source-id: 8e46d408796973817abfd9dbd6566e0ca5b7a13f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11805
Some of our headers in Caffe2 pollute the macro namespace with things like MAX,
MIN, CHECK, so I renamed these in places where this is a problem.
This patch courtesy of gchanan, extracted out of #11721
Reviewed By: Yangqing
Differential Revision: D9917757
fbshipit-source-id: 17fc692ca04b208dcb8ae00731ed60e393284f7c
Summary:
How did we get so many uses of `NULL` again?
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11047
Differential Revision: D9566799
Pulled By: goldsborough
fbshipit-source-id: 83469f352ac69aa65bdaf1a1a21f922d892e0db3
* Created TensorOptions
Storing the type in TensorOptions to solve the Variable problem
Created convenience creation functions for TensorOptions and added tests
Converted zeros to TensorOptions
Converted rand to TensorOptions
Fix codegen for TensorOptions and multiple arguments
Put TensorOptions convenience functions into torch namespace too
All factory functions except *_like support TensorOptions
Integrated with recent JIT changes
Support *_like functions
Fix in place modification
Some cleanups and fixes
Support sparse_coo_tensor
Fix bug in Type.cpp
Fix .empty calls in C++ API
Fix bug in Type.cpp
Trying to fix device placement
Make AutoGPU CPU compatible
Remove some auto_gpu.h uses
Fixing some headers
Fix some remaining CUDA/AutoGPU issues
Fix some AutoGPU uses
Fixes to dispatch_tensor_conversion
Reset version of new variables to zero
Implemented parsing device strings
Random fixes to tests
Self review cleanups
flake8
Undo changes to variable.{h,cpp} because they fail on gcc7.2
Add [cuda] tag to tensor_options_cuda.cpp
Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks
Fix linker error in AutoGPU.cpp
Fix bad merge conflict in native_functions.yaml
Fixed caffe2/contrib/aten
Fix new window functions added to TensorFactories.cpp
* Removed torch::TensorOptions
Added code to generate wrapper functions for factory methods
Add implicit constructor from Backend to TensorOptions
Remove Var() from C++ API and use torch:: functions
Use torch:: functions more subtly in C++ API
Make AutoGPU::set_device more exception safe
Check status directly in DynamicCUDAHooksInterface
Rename AutoGPU to DeviceGuard
Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad
remove python_default_init: self.type()
Add back original factory functions, but with deprecation warnings
Disable DeviceGuard for a couple functions in ATen
Remove print statement
Fix DeviceGuard construction from undefined tensor
Fixing CUDA device compiler issues
Moved as many methods as possible into header files
Dont generate python functions for deprecated factories
Remove merge conflict artefact
Fix tensor_options_cuda.cpp
Fix set_requires_grad not being checked
Fix tensor_new.h
TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac
Fix bug in DeviceGuard.h
Missing includes
TEMPORARILY moving a few more methods into .cpp to see if it fixes windows
Fixing linker errors
* Fix up SummaryOps to use new factories
Undo device agnostic behavior of DeviceGuard
Use -1 instead of optional for default device index
Also move DeviceGuard methods into header
Fixes around device index after optional -> int32_t switch
Fix use of DeviceGuard in new_with_tensor_copy
Fix tensor_options.cpp
* Fix Type::copy(
* Remove test_non_float_params from ONNX tests
* Set requires_grad=False in ONNX tests that use ints
* Put layout/dtype/device on Tensor
* Post merge fixes
* Change behavior of DeviceGuard to match AutoGPU
* Fix C++ API integration tests
* Fix flip functions
There is a bug in NCCL that causing seg faults while calling ncclCommDestroy() in the destructor during program exit. According to Nvidia, "Whether the NCCL destructor will be called before or after the CUDA runtime destructor is undefined, which can lead to crashes."
For the immediate workaround, skip calling ncclCommDestroy ihe NCCL destructor. This is UGLY and we'll follow up with Nvidia to solve this ASAP.
Changelist:
- Move *.c to *.cpp
- Change includes of ".c" to ".cpp"
- A bunch of cmake configuration modifying CMAKE_C_FLAGS changed
to CMAKE_CXX_FLAGS or add_compile_options, because if you do CMAKE_C_FLAGS it only applies when you compile C code
- Explicitly cast void* to T* in a number of places
- Delete extern "C" { ... } blocks; instead, properly apply TH_API to everything that should have it (TH_API handles extern "C")
- Stop using stdatomic.h, instead, use <atomic>. This resulted in a bunch of placement-new/delete to be "totally properly correct"
- Refactor of THLongStorageView to not have static constructor methods (since it no longer has a copy/move constructor)
- Documentation about how the TH C interface (and extern C business) works
- Note that THD master_worker mode is dead
- C++ headers in TH libraries are given .hpp suffix, to make it less likely that you'll confuse them with the C-compatible headers (now suffixed .h)
- New function THCStream_stream and THCStream_device to project out fields of THCStream instead of accessing fields directly
- New function THStorage_(retainIfLive), which is equivalent to a retain but only if the refcount is greater than zero.
- In general, I tried to avoid using hpp headers outside of ATen/TH. However, there were a few places where I gave up and depended on the headers for my own sanity. See Note [TH abstraction violation] for all the sites where this occurred. All other sites were refactored to use functions
- Some extra Werror fixes (char* versus const char*)
This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp.
This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.
The Tensor and Variable classes are being merged in Python. This means
that all interfaces to C++ must accept Variables where they previously
accepted Tensors.