Commit Graph

1727 Commits

Author SHA1 Message Date
Edward Z. Yang
e33f1eeeb7 SymIntify resize_ and deduplicate memory format logic (#90442)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90442
Approved by: https://github.com/bdhirsh
2022-12-11 14:38:38 +00:00
mikey dagitses
c8954a8907 simplify implementation of c10::isIntegralType (#90193)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90193).
* __->__ #90193

simplify implementation of c10::isIntegralType

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90193
Approved by: https://github.com/ezyang
2022-12-09 12:22:06 +00:00
Nikita Shulga
36ac095ff8 Migrate PyTorch to C++17 (#85969)
With CUDA-10.2 gone we can finally do it!

This PR mostly contains build system related changes, invasive functional ones are to be followed.
Among many expected tweaks to the build system, here are few unexpected ones:
 - Force onnx_proto project to be updated to C++17 to avoid `duplicate symbols` error when compiled by gcc-7.5.0, as storage rule for `constexpr` changed in C++17, but gcc does not seem to follow it
 - Do not use `std::apply` on CUDA but rely on the built-in variant, as it results in test failures when CUDA runtime picks host rather than device function when `std::apply` is invoked from CUDA code.
 - `std::decay_t` -> `::std::decay_t` and `std::move`->`::std::move` as VC++ for some reason claims that `std` symbol is ambigious
 - Disable use of `std::aligned_alloc` on Android, as its `libc++` does not implement it.

Some prerequisites:
 - https://github.com/pytorch/pytorch/pull/89297
 - https://github.com/pytorch/pytorch/pull/89605
 - https://github.com/pytorch/pytorch/pull/90228
 - https://github.com/pytorch/pytorch/pull/90389
 - https://github.com/pytorch/pytorch/pull/90379
 - https://github.com/pytorch/pytorch/pull/89570
 - https://github.com/facebookincubator/gloo/pull/336
 - https://github.com/facebookincubator/gloo/pull/343
 - 919676fb32

Fixes https://github.com/pytorch/pytorch/issues/56055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85969
Approved by: https://github.com/ezyang, https://github.com/kulinseth
2022-12-08 02:27:48 +00:00
Richard Barnes
ad188a227e Introduce CUDA Device Assertions Infrastructure (#84609)
Summary:
This diff introduces a set of changes that makes it possible for the host to get assertions from CUDA devices. This includes the introduction of

**`CUDA_KERNEL_ASSERT2`**

A preprocessor macro to be used within a CUDA kernel that, upon an assertion failure, writes the assertion message, file, line number, and possibly other information to UVM (Managed memory). Once this is done, the original assertion is triggered, which places the GPU in a Bad State requiring recovery. In my tests, data written to UVM appears there before the GPU reaches the Bad State and is still accessible from the host after the GPU is in this state.

Messages are written to a multi-message buffer which can, in theory, hold many assertion failures. I've done this as a precaution in case there are several, but I don't actually know whether that is possible and a simpler design which holds only a single message may well be all that is necessary.

**`TORCH_DSA_KERNEL_ARGS`**

This preprocess macro is added as an _argument_ to a kernel function's signature. It expands to supply the standardized names of all the arguments needed by `C10_CUDA_COMMUNICATING_KERNEL_ASSERTION` to handle device-side assertions. This includes, eg, the name of the pointer to the UVM memory the assertion would be written to. This macro abstracts the arguments so there is a single point of change if the system needs to be modified.

**`c10::cuda::get_global_cuda_kernel_launch_registry()`**

This host-side function returns a singleton object that manages the host's part of the device-side assertions. Upon allocation, the singleton allocates sufficient UVM (Managed) memory to hold information about several device-side assertion failures. The singleton also provides methods for getting the current traceback (used to identify when a kernel was launched). To avoid consuming all the host's memory the singleton stores launches in a circular buffer; a unique "generation number" is used to ensure that kernel launch failures map to their actual launch points (in the case that the circular buffer wraps before the failure is detected).

**`TORCH_DSA_KERNEL_LAUNCH`**

This host-side preprocessor macro replaces the standard
```
kernel_name<<<blocks, threads, shmem, stream>>>(args)
```
invocation with
```
TORCH_DSA_KERNEL_LAUNCH(blocks, threads, shmem, stream, args);
```
Internally, it fetches the UVM (Managed) pointer and generation number from the singleton and append these to the standard argument list. It also checks to ensure the kernel launches correctly. This abstraction on kernel launches can be modified to provide additional safety/logging.

**`c10::cuda::c10_retrieve_device_side_assertion_info`**
This host-side function checks, when called, that no kernel assertions have occurred. If one has. It then raises an exception with:
1. Information (file, line number) of what kernel was launched.
2. Information (file, line number, message) about the device-side assertion
3. Information (file, line number) about where the failure was detected.

**Checking for device-side assertions**

Device-side assertions are most likely to be noticed by the host when a CUDA API call such as `cudaDeviceSynchronize` is made and fails with a `cudaError_t` indicating
> CUDA error: device-side assert triggered CUDA kernel errors

Therefore, we rewrite `C10_CUDA_CHECK()` to include a call to `c10_retrieve_device_side_assertion_info()`. To make the code cleaner, most of the logic of `C10_CUDA_CHECK()` is now contained within a new function `c10_cuda_check_implementation()` to which `C10_CUDA_CHECK` passes the preprocessor information about filenames, function names, and line numbers. (In C++20 we can use `std::source_location` to eliminate macros entirely!)

# Notes on special cases

* Multiple assertions from the same block are recorded
* Multiple assertions from different blocks are recorded
* Launching kernels from many threads on many streams seems to be handled correctly
* If two process are using the same GPU and one of the processes fails with a device-side assertion the other process continues without issue
* X Multiple assertions from separate kernels on different streams seem to be recorded, but we can't reproduce the test condition
* X Multiple assertions from separate devices should be all be shown upon exit, but we've been unable to generate a test that produces this condition

Differential Revision: D37621532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84609
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-08 01:26:07 +00:00
mikey dagitses
368a1cbd02 fix c10::detail::integer_iterator for C++17 (#90174)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90174).
* __->__ #90174

fix c10::detail::integer_iterator for C++17

Summary: std::iterator is deprecated.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90174
Approved by: https://github.com/clee2000, https://github.com/malfet
2022-12-05 18:39:47 +00:00
Lukas N Wirz
301d9c0556 Remove deprecated usage of is_pod/is_pod_v (#88918)
… as equivalent replacements for std::is_pod and std::is_pod_v because they are deprecated in C++20.

When consuming libtorch header files in a project that uses C++20, there are warnings about std::is_pod being deprecated.  This patch fixes that issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88918
Approved by: https://github.com/ezyang
2022-12-05 16:50:00 +00:00
Driss Guessous
78bdb858f9 Call _sdp_attention in nn.functional.mha (#89470)
# Summary
Replaces the the inline block of code in nn.funcitonal.mha with `_scaled_dot_product_attention`. This function allows the fused kernels to be called if all the required input conditions are met.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89470
Approved by: https://github.com/cpuhrsch, https://github.com/mikekgfb
2022-12-02 19:46:22 +00:00
chengscott
9dffc56008 Intel compiler support in c10/util/TypeIndex.h (#89610)
Build passed with icc (ICC) 2021.7.1 20221019.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89610
Approved by: https://github.com/kit1980
2022-12-02 05:32:21 +00:00
Sean Ross-Ross
5f881ac2d1 Adding dispatch alias 'FuncTorchBatchedDecomposition' (#88771)
part of https://github.com/pytorch/functorch/issues/1009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88771
Approved by: https://github.com/zou3519
2022-12-02 04:38:28 +00:00
PyTorch MergeBot
f1415b8cb6 Revert "Call _sdp_attention in nn.functional.mha (#89470)"
This reverts commit 4d7ec30220.

Reverted https://github.com/pytorch/pytorch/pull/89470 on behalf of https://github.com/jeanschmidt due to breaking internal builds
2022-11-30 16:16:24 +00:00
PyTorch MergeBot
4cc5be3a06 Revert "Add bits tensor types (#88594)"
This reverts commit f3b1315eee.

Reverted https://github.com/pytorch/pytorch/pull/88594 on behalf of https://github.com/jeanschmidt due to breaking internal builds
2022-11-30 11:37:56 +00:00
Driss Guessous
4d7ec30220 Call _sdp_attention in nn.functional.mha (#89470)
# Summary
Replaces the the inline block of code in nn.funcitonal.mha with `_scaled_dot_product_attention`. This function allows the fused kernels to be called if all the required input conditions are met.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89470
Approved by: https://github.com/cpuhrsch, https://github.com/mikekgfb
2022-11-29 03:02:10 +00:00
Angela Yi
f3b1315eee Add bits tensor types (#88594)
TODO (in later PRs)
- [ ] the other bits8, 4x2, 2x4, 1x8
- [ ] bits printer function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88594
Approved by: https://github.com/ezyang
2022-11-28 23:39:57 +00:00
Jean Schmidt
d089fbdc33
supress Werror introduced by lack of override by #86786 on bool initialized() (#89687) 2022-11-28 15:16:15 +01:00
mfkasim1
1588ea0dbf Added log1p for complex in c10 (#89214)
One PR towards #89205.
The content is mostly from PR #38465, but slightly changed the expression to make it faster.

Here are some benchmarking code:
```c++
#include <complex>
#include <iostream>
#include <chrono>

// main.cc

template<typename T> inline std::complex<T> log1p_v0(const std::complex<T> &z) {
    // this PR
    T x = z.real();
    T y = z.imag();
    T theta = std::atan2(y, x + T(1));
    T r = x * (x + T(2)) + y * y;
    return {T(0.5) * std::log1p(r), theta};
}

template<typename T> inline std::complex<T> log1p_v1(const std::complex<T> &z) {
    // PR #38465
    T x = z.real();
    T y = z.imag();
    std::complex<T> p1 = z + T(1);
    T r = std::abs(p1);
    T a = std::arg(p1);
    T rm1 = (x * x + y * y + x * T(2)) / (r + 1);
    return {std::log1p(rm1), a};
}

template<typename T>
inline std::complex<T> log1p_v2(const std::complex<T> &z) {
    // naive, but numerically inaccurate
    return std::log(T(1) + z);
}

int main() {
    int n = 1000000;
    std::complex<float> res(0.0, 0.0);
    std::complex<float> input(0.5, 2.0);
    auto start = std::chrono::system_clock::now();
    for (int i = 0; i < n; i++) {
        res += log1p_v0(input);
    }
    auto end = std::chrono::system_clock::now();
    auto elapsed = end - start;
    std::cout << "time for v0: " << elapsed.count() << '\n';

    start = std::chrono::system_clock::now();
    for (int i = 0; i < n; i++) {
        res += log1p_v1(input);
    }
    end = std::chrono::system_clock::now();
    elapsed = end - start;
    std::cout << "time for v1: " << elapsed.count() << '\n';

    start = std::chrono::system_clock::now();
    for (int i = 0; i < n; i++) {
        res += log1p_v2(input);
    }
    end = std::chrono::system_clock::now();
    elapsed = end - start;
    std::cout << "time for v2: " << elapsed.count() << '\n';
    std::cout << res << '\n';
}
```

Compiling the script with command `g++ main.cc` produces the following results:
```
time for v0: 237812271
time for v1: 414524941
time for v2: 360585994
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89214
Approved by: https://github.com/lezcano
2022-11-24 11:11:51 +00:00
Charlie West-Taylor
953f39578a Mark IPU device as not supports_as_strided (#89130)
Currently causes issues in calls to `.to`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89130
Approved by: https://github.com/albanD
2022-11-23 19:51:53 +00:00
Emilio Castillo
c9d4390d13 Add Pluggable CUDA allocator backend (#86786)
Fixes #43144

This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

```c++
#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}

void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}
```

Compile it as a shared library
```
nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
```

And use it from PyTorch as follows

```python
import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)
```

Things to discuss
- How to test this, needs compiling external code ...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786
Approved by: https://github.com/albanD
2022-11-23 17:54:36 +00:00
kvathupo
8ac58bc2e3 Add nullptr_t overload to c10::intrusive_ptr (#89196)
__What?__

Fixes #82413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89196
Approved by: https://github.com/ezyang
2022-11-19 21:40:07 +00:00
Nikita Shulga
5654fed23e Export c10/[macros|util] headers to be used by internal inductor builds (#89249)
Summary: Fixes package boundary violation that existed in previous implementation

Test Plan: CI

Differential Revision: D41391862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89249
Approved by: https://github.com/izaitsevfb
2022-11-18 10:51:07 +00:00
Sherlock Huang
f1fb586bc6 Symintify repeat_interleave.self_int (#89111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89111
Approved by: https://github.com/ezyang
2022-11-18 05:04:02 +00:00
Rachel030219
70fb673e51 Use software approach to catch overflow ( c10/utils/safe_numerics.h ) on ARM devices (#89042)
Fixes #89040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89042
Approved by: https://github.com/malfet
2022-11-17 05:55:28 +00:00
Edward Z. Yang
4908a12542 Reland "SymIntify convolution backend calculation (#89069)"" (#89142)
This reverts commit 90db86be10.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89142
Approved by: https://github.com/albanD, https://github.com/malfet
2022-11-16 21:41:47 +00:00
PyTorch MergeBot
90db86be10 Revert "SymIntify convolution backend calculation (#89069)"
This reverts commit 09ed8b67e2.

Reverted https://github.com/pytorch/pytorch/pull/89069 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2022-11-16 16:36:27 +00:00
Edward Z. Yang
09ed8b67e2 SymIntify convolution backend calculation (#89069)
We will need this to implement a convolution meta function that
is SymInt aware.  I use templates so that regular convolution code
is not affected by the change.  No tests for symbolic ints directly; that will
come in a subsequent PR which also needs to refactor fake tensors.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89069
Approved by: https://github.com/SherlockNoMad
2022-11-16 14:02:43 +00:00
Edward Z. Yang
d96dd8ff09 Add int64_t, SymInt overloads for all binary operators in C++ (#89063)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89063
Approved by: https://github.com/SherlockNoMad
2022-11-16 01:08:31 +00:00
Aaron Gokaslan
48dc24ddce Fix: [ATen] Add some missing moves (#88514)
Related to #88512 , but for ATen. This should reduce a number of copies and inefficient atomic smart pointer increments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88514
Approved by: https://github.com/jgong5, https://github.com/ezyang
2022-11-13 22:05:41 +00:00
Edward Z. Yang
46796fe5e9 Fix XLA symbolic shapes binding (#88928)
Obsoletes https://github.com/pytorch/pytorch/pull/88772

Mostly revolves around NOT assuming that the inside is a SymNode,
but instead duck-typed to be a SymNode.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88928
Approved by: https://github.com/SherlockNoMad
2022-11-13 00:31:27 +00:00
Eddie Yan
3e30a9ea1c Fix CUDA_MAX_THREADS_PER_SM for sm_87 (#88644)
#88326
CC @ngimel @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88644
Approved by: https://github.com/ngimel
2022-11-08 19:44:23 +00:00
Howard Huang
bc66ddb5cb Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134)
Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8

Changes:
- introduce new error type
- Update `C10D_NCCL_CHECK`

Sample script to demonstrate new error type

```python
# python -m torch.distributed.run --nproc_per_node=2 <script>.py

import torch
import torch.distributed as dist

if __name__ == "__main__":
    dist.init_process_group("nccl")
    dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0)
```

Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134
Approved by: https://github.com/rohan-varma
2022-11-08 13:26:42 +00:00
biubiuX
ced71e8e82 [Pytorch] add an option to disable TORCH_WARN and TORCH_WARN_ONCE log (#87188)
Summary: Add an option to disable TORCH_WARN, some op could trigger spammy TOCH_WARN log which is not desired under certain scenario.

Test Plan:
Tested with
-pt.disable_warn = 1 and -pt.disable_warn = 0

verified TORCH_WARN and TORCH_WARN_ONCE are properly handled

tested with
-pt.strip_error_messages = 1, -pt.disable_warn = 0

verified strip error message is respected when warn is printed

Differential Revision: D40321550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87188
Approved by: https://github.com/kurtamohler, https://github.com/ezyang
2022-11-08 04:49:45 +00:00
Edward Z. Yang
825f4e602b Add support for symbolic shapes to sparse tensor (#88573)
Along the way, I undid making sparse/dense dim symint (they're
dimensions, so they should be static.)

Also symintify set_indices_and_values_unsafe

There is a little bit of a nontrivial infra change here: previously, we didn't populate the strides field on sparse tensors. It is now populated with "empty" strides, and this meant that sparse tensors were falsely reporting they were non-overlapping dense/contiguous. I added in a hack to work around this case.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88573
Approved by: https://github.com/anjali411
2022-11-08 03:13:42 +00:00
Aaron Gokaslan
b14e06503a (fix): Add some missing std::moves to C10 (#88512)
I saw some missed optimization opportunities in C10 using std::move and thought I would submit a PR to fix them. There are particularly a lot of them dealing with the symbolic operators which are used in quite a few places including in loops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88512
Approved by: https://github.com/ezyang
2022-11-07 22:17:13 +00:00
Edward Z. Yang
0e3031f7e7 Functionalize and compute joint simultaneously. (#88063)
This also comes with some bug fixes that were uncovered from doing
this:

- Forward device calls to inner tensor in FunctionalTensorWrapper

- Make legacyExtractDispatchKey exclude Functionalize, so that
  it can get at the real device type key.  This is noncontroversial.

- Stop stripping dense from key set.  The reason for this is
  FunctionalWrapperTensor may be used in contexts where people
  query if it is dense or not.  If it doesn't report this correctly
  (from the dispatch key), it will cause errors.  This caused some
  torchbench models to fail when I did one-pass tracing.

- Save and restore reapply views TLS correctly

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88063
Approved by: https://github.com/bdhirsh
2022-11-05 03:52:40 +00:00
Codrin Popa
5b767d404e Modified roundup_power2_divisions to specify the number of divisions for each power of two interval (#87290)
Summary:
Improved roundup_power2_divisions knob so it allows better control of rouding in the PyTorch CUDA Caching Allocator.

This new version allows setting the number of divisions per power of two interval starting from 1MB and ending at 64GB and above. An example use case is when rouding is desirable for small allocations but there are also very large allocations which are persistent, thus would not benefit from rounding and take up extra space.

Test Plan: Tested locally

Differential Revision: D40103909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87290
Approved by: https://github.com/zdevito
2022-11-04 19:31:16 +00:00
Pruthvi Madugundu
fbd08fb358 Introduce TORCH_DISABLE_GPU_ASSERTS (#84190)
- Asserts for CUDA are enabled by default
- Disabled for ROCm by default by setting `TORCH_DISABLE_GPU_ASSERTS` to `ON`
- Can be enabled for ROCm by setting above variable to`OFF` during build or can be forcefully enabled by setting `ROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON`

This is follow up changes as per comment in PR #81790, comment [link](https://github.com/pytorch/pytorch/pull/81790#issuecomment-1215929021)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84190
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2022-11-04 04:43:05 +00:00
Edward Z. Yang
f884e817d4 Make Python op registration work with torchdeploy/multipy (#87162)
See strategy at PythonOpRegistrationTrampoline.cpp for the
big picture.

Along the way, I made OperatorHandle support == and hashing,
and slightly changed the low level python_dispatch impl API
to disallow empty strings for dispatch key, which had the knock
on effect of requiring us to explicitly make sure we pass in
CompositeImplicitAutograd if we would have passed in "" (I didn't apply
this to the rest of the file because I'm lazy.)

Test strategy is we delete the logic for preventing Python op
registrations in torch from being skipped in a torchdeploy context
and show CI still works.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87162
Approved by: https://github.com/anjali411, https://github.com/bdhirsh
2022-11-03 12:56:44 +00:00
Richard Barnes
e59d307e2f Improve perf by avoiding implicit string creation in c10_cuda_check_implementation (#88350)
Test Plan: Sandcastle

Differential Revision: D40949947

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88350
Approved by: https://github.com/Skylion007, https://github.com/soumith
2022-11-03 02:48:41 +00:00
Scott Wolchok
1c0d47cb17 [PyTorch] Make c10::irange(x) generate the same assembly as for loop (#86841)
`c10::irange(n)` generated an extra `sar` and `andn` instruction compared to a traditional `for` loop. now it doesn't.

Differential Revision: [D40321009](https://our.internmc.facebook.com/intern/diff/D40321009/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86841
Approved by: https://github.com/r-barnes, https://github.com/malfet
2022-11-02 21:34:22 +00:00
PyTorch MergeBot
0fa23663cc Revert "Introduce TORCH_DISABLE_GPU_ASSERTS (#84190)"
This reverts commit 1e2c4a6e0e.

Reverted https://github.com/pytorch/pytorch/pull/84190 on behalf of https://github.com/malfet due to Needs internal changes, has to be landed via co-dev
2022-11-02 18:13:37 +00:00
Pruthvi Madugundu
1e2c4a6e0e Introduce TORCH_DISABLE_GPU_ASSERTS (#84190)
- Asserts for CUDA are enabled by default
- Disabled for ROCm by default by setting `TORCH_DISABLE_GPU_ASSERTS` to `ON`
- Can be enabled for ROCm by setting above variable to`OFF` during build or can be forcefully enabled by setting `ROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON`

This is follow up changes as per comment in PR #81790, comment [link](https://github.com/pytorch/pytorch/pull/81790#issuecomment-1215929021)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84190
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2022-11-02 17:41:57 +00:00
Nikita Shulga
a6acbad5c3 [BE] Use default constructor in LoggerVoidify (#88054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88054
Approved by: https://github.com/kit1980
2022-11-01 03:59:51 +00:00
sanchitintel
9c793b366f Move incorrectly placed closing curly brace of extern "C" block (#87853)
### Bug description
When `__SYCL_DEVICE_ONLY__` is defined, while building PyTorch, the output of the preprocessing step would not have the closing curly brace of the `extern "C"` block, as it has been incorrectly placed. Compilers don't seem to report an error or a warning for a missing closing brace of an `extern "C"` block.

### Impact of the bug
If `c10/macros/Macros.h` would be included in a C++ file, and after the preprocessing stage, if the preprocessed source file would have some templated code after `extern "C" {`, then, after compilation, linking might fail with the error `templates must have c++ linkage`). eg. https://stackoverflow.com/questions/61717819/template-with-c-linkage-error-when-using-template-keyword-in-main-cpp/61717908#61717908 (its answer also has a small snippet of code to reproduce such an issue).

### Solution in this PR
one-liner bug fix that rectifies the placement of closing curly brace (`}`), so that the `extern "C"` block ends properly when `__SYCL_DEVICE_ONLY__` is defined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87853
Approved by: https://github.com/jgong5, https://github.com/kit1980, https://github.com/malfet
2022-10-28 03:42:20 +00:00
Edward Z. Yang
1ff52225f1 Unify SymIntNode and SymFloatNode into SymNode (#87817)
This refactor was prompted by challenges handling mixed int/float
operations in C++.  A previous version of this patch
added overloads for each permutation of int/float and was unwieldy
https://github.com/pytorch/pytorch/pull/87722/  This PR takes a different
approach.

The general outline of the patch is to combine the C++ types SymIntNode
and SymFloatNode into a single type, SymNode.  This is type erased; we
no longer know statically at C++ if we have an int/float and have to test
it with the is_int()/is_float() virtual methods.  This has a number of
knock on effects.

- We no longer have C++ classes to bind to Python.  Instead, we take an
  entirely new approach to our Python API, where we have a SymInt/SymFloat
  class defined entirely in Python, which hold a SymNode (which corresponds
  to the C++ SymNode).  However, SymNode is not pybind11-bound; instead,
  it lives as-is in Python, and is wrapped into C++ SymNode using PythonSymNode
  when it goes into C++.  This implies a userland rename.

  In principle, it is also possible for the canonical implementation of SymNode
  to be written in C++, and then bound to Python with pybind11 (we have
  this code, although it is commented out.)  However, I did not implement
  this as we currently have no C++ implementations of SymNode.

  Because we do return SymInt/SymFloat from C++ bindings, the C++ binding
  code needs to know how to find these classes.  Currently, this is done
  just by manually importing torch and getting the attributes.

- Because SymInt/SymFloat are easy Python wrappers, __sym_dispatch__ now
  takes SymInt/SymFloat, rather than SymNode, bringing it in line with how
  __torch_dispatch__ works.

Some miscellaneous improvements:

- SymInt now has a constructor that takes SymNode.  Note that this
  constructor is ambiguous if you pass in a subclass of SymNode,
  so an explicit downcast is necessary.  This means toSymFloat/toSymInt
  are no more.  This is a mild optimization as it means rvalue reference
  works automatically.

- We uniformly use the caster for c10::SymInt/SymFloat, rather than
  going the long way via the SymIntNode/SymFloatNode.

- Removed some unnecessary toSymInt/toSymFloat calls in normalize_*
  functions, pretty sure this doesn't do anything.

- guard_int is now a free function, since to guard on an int you cannot
  assume the method exists.  A function can handle both int and SymInt
  inputs.

- We clean up the magic method definition code for SymInt/SymFloat/SymNode.
  ONLY the user classes (SymInt/SymFloat) get magic methods; SymNode gets
  plain methods; this is to help avoid confusion between the two types.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87817
Approved by: https://github.com/albanD, https://github.com/anjali411
2022-10-27 20:56:02 +00:00
Richard Barnes
85ffbedfb2 Strip GCC5 stuff from PyTorch (#85914)
[This file](https://github.com/pytorch/pytorch/pull/63208/files) indicates that we don't support anything less than GCC 7.5. Given that, let's remove this GCC 5 stuff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85914
Approved by: https://github.com/ezyang
2022-10-26 00:07:44 +00:00
albanD
b085c80126 Add /= to c10::SymInt (#87603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87603
Approved by: https://github.com/bdhirsh
2022-10-24 23:55:13 +00:00
samdow
169ec120ef [Modes] refactor modes to only use a stack in cpp (#86458)
Refactors the mode code to only have the C++ mode stack and not the "C++ mode" like we originally had. This also simplifies the mode logic in a number of places
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86458
Approved by: https://github.com/zou3519
2022-10-21 19:18:23 +00:00
Brian Hirsh
ce0c6e828e Reland "add an API for external backends to register custom device names (#86992)" (#87453)
Re-land of https://github.com/pytorch/pytorch/pull/86992

This reverts commit a895af9250.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87453
Approved by: https://github.com/ezyang, https://github.com/albanD
2022-10-21 16:51:36 +00:00
PyTorch MergeBot
a895af9250 Revert "add an API for external backends to register custom device names (#86992)"
This reverts commit fb6826bfd8.

Reverted https://github.com/pytorch/pytorch/pull/86992 on behalf of https://github.com/jeanschmidt due to breaking internal builds - D40534212 - arstudio-windows-tests-landcastle-0
2022-10-20 14:51:08 +00:00
Zachary DeVito
0d2c2110f1 [allocator] Introduce the abstract class CUDACachingAllocator (#87251)
This replaces the manual function pointers, making it easier to write
new drop-in allocators.

Note that most allocation goes through the Allocator interface, which
CUDAAllocator inherits from, and this arrangement avoids adding and
additional layer of dispatch along this pathway compared to what existed before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87251
Approved by: https://github.com/wconstab
2022-10-20 01:17:00 +00:00
albanD
12b2f70a89 Symintify pad ops (#87046)
Following comments below, we need to add support for `std::negate`/`std::min`/`std::max`/`operator-` for SymInt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87046
Approved by: https://github.com/ezyang
2022-10-19 21:43:08 +00:00