Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds.
Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase.
The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars here:
6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)
Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869
Do not try to fix sign compare violations in caffe2 codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723
Approved by: https://github.com/albanD
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.
I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop
# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
-j \
-s \
-k \
-v \
--paths torch/csrc/ \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649
Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.
Reviewed By: walterddr, janeyx99
Differential Revision: D29504258
Pulled By: 1ntEgr8
fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
Summary:
Switches most of the simple for loops outside of `jit` directories to use `c10::irange`.
Generated with D28874212.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D28909681
fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235
Reviewed By: janeyx99
Differential Revision: D28084444
Pulled By: malfet
fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45900
Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp`
Fixes https://github.com/pytorch/pytorch/issues/42517
Here is a NCCL dependency graph:
```
libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
| ^
| |
--------> libc10d.a -----------------
```
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before https://github.com/pytorch/pytorch/pull/42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so`
But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library.
This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library.
More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls.
Test Plan: Imported from OSS
Reviewed By: mingzhe09088
Differential Revision: D24138011
Pulled By: malfet
fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
Summary:
The error message produced by AT_ASSERT() in gather() encouraged users to file a bug report ("please report a bug to PyTorch..."). The assertion should be a regular argument check since it can be triggered by passing tensors with different dimensionality, e.g. `torch.cuda.comm.gather([torch.rand(1, device='cuda'), torch.rand(1, 1, device='cuda')])`.
See: https://github.com/pytorch/pytorch/issues/26400
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27456
Differential Revision: D19300270
Pulled By: ezyang
fbshipit-source-id: ec87d225e23445020b377521e0daccceb4748215
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29653
I didn't remove is_variable from Tensor for BC reasons, but I did
remove as many uses as I could from the codebase.
at::impl::variable_excluded_from_dispatch got moved to TensorBody.h
so that it's more widely accessible.
This diff is NOT semantics preserving. Here are the major differences:
- In a number of native operator implementations, we tested that arguments
are not variable. I replaced these with asserts that variable is
excluded from dispatch. I actually don't think these asserts are really
necessary now (they should certainly be true, but it's hard to get
it wrong), but I've kept them for old time's sake. At least, they'll detect
if you call these functions before you've processed variable (indicating
a bug in your kernel.)
- There are a number of places where we do a per-tensor test for being a
variable, for better error reporting when someone commits Tensor/Variable
confusion. Although these tests are substantively the same as the
tests above, in these cases I decided to *delete* the test entirely.
The reasoning is that in these cases, we didn't really care about
dispatch (also, see above; I'm not too sure we really need the dispatch
asserts), we cared about Tensor/Variable confusion. Since Tensor/Variable
confusion is impossible now, we don't need the tests. One of the key
factors which pushed me one way or another was whether or not a function
was doing per-tensor validation; if I kept the assert in such functions,
I'd repeatedly access the TLS. Even if we want to bring back the asserts,
they would have to go somewhere else.
Another similar idiom is the number of places we do !x.defined() ||
x.is_variable(); I treated this equivalently.
- nuclear_norm's computation of compute_uv is a bit weird, but I think
it's OK to just delete the is_variable case (I *suspect* that it is
always the case that self.is_variable(), but it doesn't really matter.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18496168
Pulled By: ezyang
fbshipit-source-id: 5a1ded931e0c10a6b758ba64a8380d34110e0c3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29213
A trivial use of make_variable is one where requires_grad=False. This
transformation is not technically semantics preserving, as make_variable
will create a shallow copy of the tensor in question; however, I
am guessing that we have the invariant that we don't actually make
use of this shallow copy in a nontrivial way.
There were some cases where the surrounding code expected a Variable proper
to be returned; I retained those sites.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18353503
Pulled By: ezyang
fbshipit-source-id: 57fe34d82e009c0cc852266fb0b79d6d9c62bb03
Summary:
The original PR (#16071) is not working anymore after `caffe2` and `torch` is unified. What's more, It is making the binary big since the optimizing flag is disabled on a very big project(the `torch` library used to be small, but it now applies on the whole `caffe2` and `caffe2_gpu` library). We need to get it reverted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21335
Differential Revision: D15622163
Pulled By: soumith
fbshipit-source-id: 900bd400106d27a1512eed1e9f2288114f5f41bb
Summary:
As part of the Variable/Tensor merge work: https://github.com/pytorch/pytorch/issues/13638, we make the following changes in this PR:
1. Remove the `Variable::Impl` class and the `DifferentiableViewImpl` class
2. Change all `Variable.data()` call sites to either use `Variable` directly, or use `Variable.tensor_data()`
3. Remove `Variable.data()` API
3. Add `Variable.variable_data()` that matches `tensor.data` in Python API, which creates a new `Variable` that shares the same storage and tensor metadata with the original `Variable`, but with a completely new autograd history.
After this PR, Variable doesn't wrap a Tensor internally anymore, and both Variable and Tensor use the same TensorImpl class as its `impl_`. The only difference is that Variable always has AutogradMeta in its TensorImpl, but Tensor doesn't.
**Note that this PR is BC-breaking in the following use cases:**
**Use Case 1:**
Previously, `x.data = y` works even if `x` and `y` are of different TensorImpl type (e.g. `x` is a CPU dense tensor whose impl is of type TensorImpl, while `y` is a CPU sparse tensor whose impl is of type SparseTensorImpl). However, after this PR, `x.data = y` doesn't work anymore if `x` and `y` are of different TensorImpl type, because the underlying implementation `variable.set_data(tensor)` no longer works if `variable` and `tensor` have different TensorImpl type.
**Use Case 2:**
If a tensor `x`'s `grad` is sparse, accumulating dense gradients to `x` will change the tensor that `x.grad` is pointing to. This is better illustrated with the following example:
```python
params = torch.tensor([1.5, 1.5]).requires_grad_()
with torch.no_grad():
# Change gradient to a sparse tensor
params.grad = torch.sparse_coo_tensor(torch.tensor([[1, 1]]).long(), torch.tensor([1., 1.]))
grad_saved = params.grad
params.backward(torch.tensor([1.5, 1.5]))
assert id(grad_saved) == id(params.grad) # This will fail after this PR
```
The assertion in the last line will fail after this PR, because adding dense gradients to sparse gradients will change the `params.grad` tensor reference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17072
Differential Revision: D14075257
Pulled By: yf225
fbshipit-source-id: 0e681df641270dea586042dd26db59f2e76b5957
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17991
changes:
-Breaks bc: Tensor::type() now returns DeprecatedTypeProperties& rather than Type&.
-Added DeprecatedTypeProperties, it serves as a temporary replacement for Type as the return value of Tensor::type(). This contributes to making Type just for dispatch purposes so that we can make it dtype agnostic.
-Tensor::dispatch_type() now returns Type& like Tensor::type() used to do.
-Changed callsites of Tensor::type() appropriately.
Reviewed By: ezyang
Differential Revision: D14443117
fbshipit-source-id: 239ccb7a09626279a71d1a37f8f82e7f57bf7d9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16751
This was made more complicated by the fact that ivalue::IntList
is a thing. So I had to fix all of the sites where we referring
to IValue post facto.
The following codemods were run, in this order:
```
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in IntList IntArrayRef
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in IntArrayRef::create IntList::create
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in ivalue::IntArrayRef ivalue::IntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in Tag::IntArrayRef Tag::IntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in isIntArrayRef isIntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in toIntArrayRef toIntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in 'Shared<IntArrayRef>' 'Shared<IntList>'
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in 'intrusive_ptr<IntArrayRef>' 'intrusive_ptr<IntList>'
```
Some manual fixups were done afterwards; they can be reviewed separately
at https://github.com/pytorch/pytorch/pull/16752
Reviewed By: dzhulgakov
Differential Revision: D13954363
fbshipit-source-id: b5c40aacba042402155a2f5a229fa6db7992ac64
Summary:
This puts stubs in the autograd profiler for the use of cuda APIs allowing the cuda parts of libtorch to be linked separately from the CPU parts.
This also edits the buck build.
Previous:
For GPU builds:
_C -> csrc -> caffe2
For CPU builds:
_C -> csrc-cpu -> caffe2
Now:
GPU:
_C -> libtorch_cuda -> (libtorch -> caffe2, for CPU)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15898
Reviewed By: ailzhang
Differential Revision: D13617991
Pulled By: zdevito
fbshipit-source-id: 6d84a50bb356a54b4217f93219902755601b00e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14248
This diff also introduces a horrifying hack to override CUDA's DeviceGuardImpl
with a HIPGuardImplMasqueradingAsCUDA, to accommodate PyTorch's current
behavior of pretending CUDA is HIP when you build with ROCm enabled.
Reviewed By: bddppq
Differential Revision: D13145293
fbshipit-source-id: ee0e207b6fd132f0d435512957424a002d588f02
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.
I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.
I used the following script to do the canonicalization:
```
import subprocess
import re
import os.path
files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
for fn in files:
if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
continue
if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
continue
with open(fn, 'r') as f:
c = f.read()
def fmt(p):
return "#include <{}>".format(p)
def repl(m):
p = m.group(1)
if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
return fmt(p)
if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
return fmt(p)
for root in ["aten/src", "torch/lib", ""]:
for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
new_p = os.path.relpath(os.path.join(bad_root, p), root)
if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
return fmt(new_p)
print("ERROR: ", fn, p)
return m.group(0)
new_c = re.sub(r'#include "([^"]+)"', repl, c)
if new_c != c:
print(fn)
with open(fn, 'w') as f:
f.write(new_c)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849
Reviewed By: dzhulgakov
Differential Revision: D13363445
Pulled By: ezyang
fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13342
This PR introduces a few new concepts:
- DeviceGuardImplInterface, and implementations for CPU and CUDA, which
provide a generic interface for interfacing with device and stream state,
without requiring a direct dependency on the code in question.
- InlineDeviceGuard, a general template for generating both specialized
and dynamically dispatched device guard implementations. Dynamic
dispatch is done by specializing it on a VirtualGuardImpl.
- Provide a device-independent DeviceGuard class, which can be used even
from CPU code. It uses the aforementioned dynamic dispatch.
- CUDA-specialized CUDAGuard class, which doesn't have a dynamic dispatch
but can only be used from CUDA.
- StreamGuard, which is the same as above, but for streams rather than
devices.
- Optional variants of all the aforementioned guards, which are a no-op if
no device/stream is specified
- CUDAMultiStreamGuard, specifically for the case when we want to set
a device on every guard.
There are some subtle semantic changes, which have been thoroughly documented
in the class definition.
BC-breaking changes:
- Move constructor/assignment have been removed from all device guard
implementations.
- In some cases where you previously wrote 'set_device' (or 'set_stream'), you now must write
'reset_device', because if you switch devices/device types, the stream/device on the
previous device is unset. This is different from previous behavior.
- CUDAGuard no longer handles streams, or multiple streams. Use CUDAStreamGuard
or CUDAMultiStreamGuard as appropriate for your use case.
Reviewed By: dzhulgakov
Differential Revision: D12849620
fbshipit-source-id: f61956256f0b12be754b3234fcc73c2abc1be04e
Summary:
In `broadcast_coalesced`, since multiple variables can be "views" of a big flattened tensor, they can share the same version counter. However, this base flat tensor is not exposed and they don't share any memory locations, so this is not necessary. Furthermore, it can cause problems, e.g., when two buffers are broadcast together in `DataParallel` and one of them is modified in-place during `forward` but the other is needed in backward, autograd engine will complain.
Fixing the bug discovered at https://github.com/pytorch/pytorch/pull/13350#issuecomment-436011370
edit: This is a very real problem. E.g., consider using Spectral Norm + Batch Norm together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13594
Differential Revision: D12967311
Pulled By: SsnL
fbshipit-source-id: 52998dbabe149f575cf0fb79e7016f0b95e4b9e5
Summary:
Followup to #12841
Changed these to not require type dispatch:
tensor.type().is_cuda() -> tensor.is_cuda()
tensor.type().is_sparse() -> tensor.is_sparse()
isVariable(tensor.type()) -> tensor.is_variable()
This probably does not affect performance
very much in most cases but it is nice to have.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13590
Reviewed By: ezyang
Differential Revision: D12929301
Pulled By: zou3519
fbshipit-source-id: 8ac5c6200c579dd7a44fb4ee58fc9bb170feb1d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13232
DeviceGuard should be device agnostic, which means that it shouldn't
assume that int64_t means select the CUDA device.
Reviewed By: gchanan
Differential Revision: D10858024
fbshipit-source-id: b40e8337e4046906fd8f83a95e6206367fb29dbe