Commit Graph

604 Commits

Author SHA1 Message Date
Nikita Shulga
ca470fc59f [BE] Make test_no_triton_on_import simple (#102674)
Do not try to parse raised exception for no good reason
Add short description
Reduce script to a single line

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ea4164e</samp>

> _`test_no_triton_on_import`_
> _Cleans up the code, adds docs_
> _No hidden errors_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102674
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-06-01 20:31:18 +00:00
Nikita Vedeneev
d80d3b18d0 nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations (#98403)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 08f7a6a</samp>

This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403
Approved by: https://github.com/malfet, https://github.com/cpuhrsch
2023-05-31 13:09:45 +00:00
Masaki Kozuki
c8579b7374 Run test_cpp_memory_snapshot_pickle only when linux and x86_64 (#101366)
On Arm, I got

```
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5260, in test_cpp_memory_snapshot_pickle
    mem = run()
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5257, in run
    t = the_script_fn()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 496, in prof_func_call
    return prof_callable(func_call, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 493, in prof_callable
    return callable(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5254, in the_script_fn
                @torch.jit.script
                def the_script_fn():
                    return torch.rand(311, 411, device='cuda')
                           ~~~~~~~~~~ <--- HERE
RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms
```

dfe484a3b3/torch/csrc/profiler/unwind/unwind.cpp (L4-L24) seems related

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101366
Approved by: https://github.com/zdevito
2023-05-17 19:44:21 +00:00
Elias Ellison
3edff6b6ec Improve detection of workspace/non-output allocations in cudagraphs (#99985)
When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off.

This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985
Approved by: https://github.com/zdevito
2023-05-01 15:58:45 +00:00
Jane Xu
808267767c Prevent grad scale from overflowing (#98876)
Fixes #98828 by capping the growth in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98876
Approved by: https://github.com/ngimel
2023-04-25 20:59:44 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Masaki Kozuki
b87c7ab6d6 Remove redundant found_inf recompute from _step_supports_amp_unscaling path (#98620)
following https://github.com/pytorch/pytorch/pull/97415#issuecomment-1499787115.

Rel: https://github.com/pytorch/pytorch/pull/98613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98620
Approved by: https://github.com/janeyx99
2023-04-20 19:24:09 +00:00
Animesh Jain
971df458db Reland of "Python binding to set/get CUDA rng state offset" (#99565)
Why?
* To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377

Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way.

~~~~
import torch
torch.cuda.manual_seed(123)
print(torch.cuda.get_rng_state())
torch.cuda.set_rng_state_offset(40)
print(torch.cuda.get_rng_state())

tensor([123,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
tensor([123,   0,   0,   0,   0,   0,   0,   0,  40,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
~~~~

Reland of https://github.com/pytorch/pytorch/pull/98965

(cherry picked from commit 8214fe07e8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99565
Approved by: https://github.com/anijain2305
2023-04-20 15:42:25 +00:00
PyTorch MergeBot
bb2cd4a107 Revert "Python binding to set/get CUDA rng state offset (#98965)"
This reverts commit 8214fe07e8.

Reverted https://github.com/pytorch/pytorch/pull/98965 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-19 11:23:32 +00:00
Animesh Jain
8214fe07e8 Python binding to set/get CUDA rng state offset (#98965)
Why?
* To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377

Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way.

~~~~
import torch
torch.cuda.manual_seed(123)
print(torch.cuda.get_rng_state())
torch.cuda.set_rng_state_offset(40)
print(torch.cuda.get_rng_state())

tensor([123,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
tensor([123,   0,   0,   0,   0,   0,   0,   0,  40,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98965
Approved by: https://github.com/kulinseth, https://github.com/ezyang
2023-04-18 07:52:21 +00:00
Zachary DeVito
7ff1f3f3f6 Revert "Revert "Expandable blocks in allocator (#96995)"" (#99275)
This reverts commit 851e89c8e8.

Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275
Approved by: https://github.com/eellison
2023-04-17 23:46:08 +00:00
PyTorch MergeBot
851e89c8e8 Revert "Expandable blocks in allocator (#96995)"
This reverts commit 6a50b83b73.

Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests
2023-04-16 19:23:37 +00:00
Zachary DeVito
6a50b83b73 Expandable blocks in allocator (#96995)
Common advice we give for handling memory fragmentation issues is to
allocate a big block upfront to reserve memory which will get split up later.
For programs with changing tensor sizes this can be especially helpful to
avoid OOMs that happen the first time we see a new largest input and would
otherwise have to allocate new segments.

However the issue with allocating a block upfront is that is nearly impossible
to correctly estimate the size of that block. If too small, space in the block
will run out and the allocator will allocate separate blocks anyway. Too large,
and other non-PyTorch libraries might stop working because they cannot allocate
any memory.

This patch provides the same benefits as using a pre-allocating block but
without having to choose its size upfront. Using the cuMemMap-style APIs,
it adds the ability to expand the last block in a segment when more memory is
needed.

Compared to universally using cudaMallocAsync to avoid fragmentation,
this patch can fix this common fragmentation issue while preserving most
of the existing allocator behavior. This behavior can be enabled and disabled dynamically.
 This should allow users to, for instance, allocate long-lived parameters and state in individual buffers,
and put temporary state into the large expandable blocks, further reducing
fragmentation.

See inline comments for information about the implementation and its limitations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995
Approved by: https://github.com/eellison
2023-04-14 09:49:11 +00:00
Peeyush Agarwal
ebd4c165ff Back out "GradScaler recomputes optimizer_state["found_inf_per_device"] before optimizer.step (#97415)" (#98613)
Summary: This change causes multi-GPU job from XI team to hang after 8K steps.

Differential Revision: D44797248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98613
Approved by: https://github.com/ngimel
2023-04-07 23:31:31 +00:00
Zachary DeVito
b1a83c4da4 [memory history] cleanup recording API (#97406)
This makes the options for recording memory history
easier to understand and makes the default to record
the most information.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`.

Also adds a quick _dump_snapshot function to make
it easier to look at the common visualizations.

<!--
copilot:walkthrough
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

*  Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696))
* Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713))
* Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085))
* Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406
Approved by: https://github.com/ezyang
2023-03-28 16:31:10 +00:00
soulitzer
51c3fd39a5 Modify all calls to checkpoint pass use_reentrant explicitly (#97376)
Fixes #ISSUE_NUMBER

This is the first step toward making use_reentrant=False the default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97376
Approved by: https://github.com/albanD
2023-03-27 13:37:42 +00:00
Masaki Kozuki
b5edf18334 GradScaler recomputes optimizer_state["found_inf_per_device"] before optimizer.step (#97415)
I found a discrepancy between non-fused and fused optimizers, which is to use `optimizer_state["found_inf"]` or to recompute `found_inf`.

- non fused: e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L289)
- fused: e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)
    - where `_check_inf_per_device` is e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L564-L573)

The other way to align the behavior is to use the existing `found_inf` in e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353).

I'd say this PR is for the sake of "safety" and the alternative is to keep the existing behavior.
I honestly have no idea if it's expected to double-check the sanity of gradients in `GradScaler.step`.

---

what I've observed in huggingface/transformers T5-base example so far seems like that non-fused optimizers lead to invalid parameters while the fused not.
The cause seems to be that `gradients` become inf/nan before `GradScaler.step(optimizer)` after `GradScaler._unscale_grads_` (more precicely, the call of `torch._amp_foreach_non_finite_check_and_unscale_`) in the script of the issue linked below, i.e. the gradient clipping and/or unscaling lead to inf/nan as these happen after the grad check. See
788300cc2a/aten/src/ATen/native/cuda/AmpKernels.cu (L165-L174).

Fixes #96755 🙏

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97415
Approved by: https://github.com/ngimel, https://github.com/janeyx99
2023-03-24 17:36:47 +00:00
Tailing Yuan
63e1f12b49 Speedup bincount and histc on CUDA (#97090)
This is to speed up torch.bincount and torch.histc on CUDA.

1. Speed up int64_t gpuAtomicAdd,
2. and optimize the histogram kernel.

# Fixes #96626
After speedup, time cost in #96626 would be

```
... (run 2 times and ignore the first run)
case 1 CPU  0.0003631114959716797 seconds
case 1 CUDA 0.0005860328674316406 seconds
case 2 CPU  0.0013742446899414062 seconds
case 2 CUDA 0.0008623600006103516 seconds
```

Note that in "*case 1 CUDA*", the **max** op takes the most time, i.e., 5ee5a164ff/aten/src/ATen/native/cuda/SummaryOps.cu (L334-L335), which is not to be optimized in this PR.

# Benchmark

Time is measured on i7-10700 + RTX 3080, Ubuntu 22.04 (in WSL). The baseline is PyTorch 2.0.0+cu117. My dev version of PyTorch is compiled with CUDA 11.8. Each case is measured 15 times to take the median.

## torch.bincount
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | 0.000834 | 0.005783 | 0.000266 | 21.8x
2**20 | 80 | narrow in 1 bin | 0.001576 | 0.003967 | 0.000563 | 7.0x
2**20 | 500 | random.uniform | 0.000852 | 0.003641 | 0.000334 | 10.9x
2**20 | 500 | narrow in 1% bins | 0.000894 | 0.001878 | 0.000349 | 5.4x
2**20 | 2048 | random.uniform | 0.000891 | 0.000820 | 0.000298 | 2.8x
2**20 | 2048 | narrow in 1% bins | 0.000958 | 1.043251 | 0.000335 | 3,116.6x
2**26 | 80 | random.uniform | 0.067715 | 0.322409 | 0.003032 | 106.3x
2**26 | 80 | narrow in 1 bin | 0.110940 | 0.194644 | 0.017651 | 11.0x
2**26 | 500 | random.uniform | 0.066666 | 0.192302 | 0.002535 | 75.8x
2**26 | 500 | narrow in 1% bins | 0.066130 | 0.092237 | 0.005462 | 16.9x
2**26 | 2048 | random.uniform | 0.066371 | 0.035308 | 0.002476 | 14.3x
2**26 | 2048 | narrow in 1% bins | 0.068453 | 72.122858 | 0.003185 | 22,644.3x

## torch.histc (float32)
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | 0.001261 | 0.000145 | 9.47E-05 | 1.5x
2**20 | 80 | narrow in 1 bin | 0.001074 | 0.000356 | 0.000311 | 1.1x
2**20 | 500 | random.uniform | 0.001162 | 0.000227 | 9.18E-05 | 2.5x
2**20 | 500 | narrow in 1% bins | 0.001082 | 0.000201 | 0.000152 | 1.3x
2**20 | 2048 | random.uniform | 0.001100 | 0.000203 | 0.000118 | 1.7x
2**20 | 2048 | narrow in 1% bins | 0.001089 | 0.000396 | 0.000107 | 3.7x
2**26 | 80 | random.uniform | 0.064219 | 0.001170 | 0.000786 | 1.5x
2**26 | 80 | narrow in 1 bin | 0.056471 | 0.013283 | 0.011939 | 1.1x
2**26 | 500 | random.uniform | 0.078183 | 0.003411 | 0.000562 | 6.1x
2**26 | 500 | narrow in 1% bins | 0.056711 | 0.002763 | 0.002738 | 1.0x
2**26 | 2048 | random.uniform | 0.059296 | 0.003503 | 0.000533 | 6.6x
2**26 | 2048 | narrow in 1% bins | 0.061754 | 0.015703 | 0.000962 | 16.3x

## torch.histc (int64)
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | N/A | 0.005614 | 9.47E-05 | 59.3x
2**20 | 80 | narrow in 1 bin | N/A | 0.003799 | 0.000395 | 9.6x
2**20 | 500 | random.uniform | N/A | 0.003665 | 9.58E-05 | 38.2x
2**20 | 500 | narrow in 1% bins | N/A | 0.001760 | 0.000178 | 9.9x
2**20 | 2048 | random.uniform | N/A | 0.000693 | 0.000111 | 6.2x
2**20 | 2048 | narrow in 1% bins | N/A | 1.082904 | 0.000123 | 8,802.4x
2**26 | 80 | random.uniform | N/A | 0.320400 | 0.001145 | 279.9x
2**26 | 80 | narrow in 1 bin | N/A | 0.193668 | 0.015229 | 12.7x
2**26 | 500 | random.uniform | N/A | 0.182897 | 0.000823 | 222.2x
2**26 | 500 | narrow in 1% bins | N/A | 0.089363 | 0.00376 | 23.8x
2**26 | 2048 | random.uniform | N/A | 0.033190 | 0.000832 | 39.9x
2**26 | 2048 | narrow in 1% bins | N/A | 71.721012 | 0.001525 | 47,017.8x

## Banchmark code

Here is the benchmark code:

```python3
import time
import torch

cases = [
    ("bincount    bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.bincount(x, minlength=2048)),
    ("bincount    bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.bincount(x, minlength=2048)),
    ("histc_float bins=80   wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=80   narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=500  wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=500  narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=2048 wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_float bins=2048 narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_int   bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
    ("histc_int   bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
]

def test(case, device):
    name, x, func = case
    x = x.to(device)
    time_samples = []
    for _ in range(15):
        torch.cuda.synchronize()
        t1 = time.time()
        func(x)
        torch.cuda.synchronize()
        t2 = time.time()
        time_samples.append(t2 - t1)
    median = sorted(time_samples)[len(time_samples) // 2]
    print(device, name, median)

for case in cases:
    test(case, device="cuda")

# for case in cases:
#     test(case, device="cpu")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97090
Approved by: https://github.com/ngimel
2023-03-24 00:25:34 +00:00
Masaki Kozuki
22ea21da3d Change 1D Tensor of 1 element to 0D Tensor (#96994)
add 0d tensor to graph adam/adamw test

Affected:
- `torch.cuda.amp.GradScaler`'s `found_inf`, `_scale`, and `_growth_tracker`
- `step` of Adam & AdamW of `capturable`

Fixes #96776 🤞

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96994
Approved by: https://github.com/janeyx99
2023-03-21 18:24:19 +00:00
Elias Ellison
571f96bf59 cudagraph trees (#89146)
CUDA Graph Trees

Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit

Not currently implemented :

- Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr.

- Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr.

- Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking

Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146
Approved by: https://github.com/ezyang
2023-03-17 02:47:03 +00:00
Elias Ellison
ea7415087a Expose Stream Recording Apis in python (#96384)
Differential Revision: [D43999891](https://our.internmc.facebook.com/intern/diff/D43999891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96384
Approved by: https://github.com/zdevito
2023-03-16 23:45:43 +00:00
Zachary DeVito
e74f70d212 Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"" (#96878)
This reverts commit e1ea584b1c.
Adds __has_include check to fix fbcode build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878
Approved by: https://github.com/ezyang
2023-03-16 04:12:54 +00:00
PyTorch MergeBot
e1ea584b1c Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"
This reverts commit 4e1060c609.

Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2023-03-15 13:28:41 +00:00
Zachary DeVito
4e1060c609 [memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)
This refactors the stack trace facility specific to memory profiling
    in python+cuda to make a generic facility to generate combined stack
    traces.

    The generic facility (combined_traceback.h) does not require
    python to be around to work, but will return python stacks if it is
    present.

    This facility is then used to add support for stack trace gathering in memory profiling that
    happens directly from C++.

    It is also used to expose a python API for gathering and symbolizing
    combineds stacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541
Approved by: https://github.com/ezyang
2023-03-14 18:26:05 +00:00
Elias Ellison
da265652d6 Return Live Data Pointers from Checkpoint, swap onto tensors (#95020)
When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors.

The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general.

This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration.

Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020
Approved by: https://github.com/zdevito
2023-03-14 01:22:19 +00:00
Elias Ellison
1cc32aedb0 Handle additional live allocations not in checkpointed state (#94943)
We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die.

Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943
Approved by: https://github.com/zdevito
2023-03-14 01:00:47 +00:00
Elias Ellison
d798de2b05 Checkpoint CUDA Allocator Private Pool State (#94653)
Copying note from cuda caching allocator:

```
   * Note [Checkpointing PrivatePoolState]
   *
   * Refer above to Note [Interaction with CUDA graph capture]. Allocations made
   * during graph capture are made from a separate private pool. During graph
   * capture allocations behave as usual. During graph replay the allocator
   * state does not change even as new tensors are created. The private pool
   * will not free its blocks to the main caching allocator until cuda graph use
   * is finished to prevent an allocation from eager clobbering the memory from
   * a live but unaccounted for tensor that was created during replay.
   *
   * `make_graphed_callables`, a series of separate callables chained in
   * successive cuda graphs, can share a memory pool because after a cuda graph
   * recording the allocations in the shared private pool exactly reflect the
   * tensors that are allocated.
   *
   * We would like to extend callable chaining to support a graphed callable
   * tree. In this scenario, we have a tree of callable chains which will be
   * captured with cuda graphs. In the diagram below, we have a tree with four
   * callables, A, B, C, and D. Suppose we have captured, and subsequently
   * replayed, A, B, and C. Then on a new invocation, we replay A and B, but
   * would now like to record D. At this point the private pool will not reflect
   * any of the live tensors created during graph replay. Allocations made
   * during a new recording with the pool could overwrite those live tensors.
   *
   * In order to record a new graph capture after replaying prior callables in
   * the tree, we need the allocator to reflect the state of the live tensors.
   * We checkpoint the state of the private after each recording, and then
   * reapply it when we are starting a new recording chain. Additionally, we
   * must free the allocations for any tensors that died between the end of our
   * previous graph replaying and our new recording (TODO). All of the allocated
   * segments that existed in the checkpointed state must still exist in the
   * pool. There may also exist new segments, which we will free (TODO : link
   * note [live tensors between iterations] when it exists).
   *
   *
   *  ---------------> A ---------------> B ---------------> C
   *                                |
   *                                |
   *                                |
   *                                |
   *                                  ---------------> D
```

A few TODOs:
- need to add logic for freeing tensors that have died between a last replay and current new recording
- Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors)

The two scenarios above have not been exercised in the tests yet.

Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653
Approved by: https://github.com/zdevito
2023-03-14 00:47:30 +00:00
Zachary DeVito
4b372e3958 [memory profiling] C++ tracing support (#95357)
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.

This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).

The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
2023-03-12 07:24:14 +00:00
Zachary DeVito
266089a3fe [memory snapshots] record scripted stack traces (#95356)
Adds support for seeing both python and script stack traces in memory
debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95356
Approved by: https://github.com/aaronenyeshi
2023-03-12 07:24:14 +00:00
Zachary DeVito
d6d8d3484e _memory_viz.py: Visualize how blocks fit into segments. (#91336)
Add a segment_plot command that visualizes how blocks are allocated into segments.
This is similar to the 'stats' command but produces an interactive html viewer rather
than text dump, allowing exploration of stack traces.

It also adds the ability to see the layout at any point in the trace by starting from the
snapshot and then apply the events backwards to reconstruct what memory would have looked like.

Example:
![Screen Shot 2022-12-22 at 3 32 49 PM](https://user-images.githubusercontent.com/370202/209242650-b952372e-37ac-400a-a01c-13be2b5426fa.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91336
Approved by: https://github.com/bhosmer
2023-03-07 21:07:18 +00:00
Zachary DeVito
71f369092d Revert "Revert "memory viz: Add colors for categories and a legend (#90587)"" (#96133)
This reverts commit b38b39c441.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96133
Approved by: https://github.com/bhosmer
2023-03-07 21:07:18 +00:00
Catherine Lee
eea0733045 Reduce pytest blocklist (#96016)
`TestCase = object` or variations of it get switched to `TestCase = NoTest`.

unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection.  pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-03-07 18:30:27 +00:00
Eli Uriegas
b38b39c441 Revert "memory viz: Add colors for categories and a legend (#90587)"
This reverts commit ee43842505.
2023-03-06 11:38:58 -08:00
Zachary DeVito
ee43842505 memory viz: Add colors for categories and a legend (#90587)
Adds a category legend to memory trace plots that colors allocations by their role (activation, parameter, gradient, etc.) as captured by kineto.

Differential Revision: [D43757381](https://our.internmc.facebook.com/intern/diff/D43757381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90587
Approved by: https://github.com/aaronenyeshi
2023-03-03 20:42:22 +00:00
Mark Saroufim
9f707f164e Add more GPU metric instrumentation (#91717)
Fixes https://github.com/pytorch/serve/issues/1937

A fairly common query I see folks running while using pytorch is

`nvidia-smi --format=csv,noheader,nounits --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw,clocks.current.sm,clocks.current.memory -l 10`

Existing metrics we have
* For kernel utilization`torch.cuda.utilization()`
* For memory utilization we have them under `torch.cuda.memory` the memory allocated with `torch.cuda.memory.memory_allocated()`
* For total available memory we have `torch.cuda.get_device_properties(0).total_memory`

Which means the only metrics we're missing are
* Temperature: now in `torch.cuda.temperature()`
* Power draw: now in `torch.cuda.power()`
* Clock speed: now in `torch.cuda.clock_speed()`

With some important details on each

* Clock speed settings: I picked the SM clock domain which is documented here https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g805c0647be9996589fc5e3f6ff680c64
* Temperature: I use `pynvml.nvmlDeviceGetTemperature(handle, 0)` where 0 refers to the GPU die temperature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91717
Approved by: https://github.com/ngimel
2023-02-24 00:38:03 +00:00
Pearu Peterson
cece63f197 Add warn-once deprecation warning to legacy sparse constructors (#94850)
Addresses https://github.com/pytorch/pytorch/issues/68323#issuecomment-1425174341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94850
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-02-23 15:05:12 +00:00
puririshi98
8aa34602f7 Jetson Update for CI Redo (#94549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-21 17:13:38 +00:00
dllehr-amd
98012e4a59 [ROCm] hipGraph support for pytorch mainline (#88202)
With the release of ROCm 5.3 hip now supports a hipGraph implementation.

All necessary backend work and hipification is done to support the same functionality as cudaGraph.

Unit tests are modified to support a new TEST_GRAPH feature which allows us to create a single check for graph support instead of attempted to gather the CUDA level in annotations for every graph test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88202
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2023-02-14 22:18:56 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Xuehai Pan
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
c-odrin
54b7c7d5e9 Added requested_bytes to CUDA Caching Allocator Stats (#88575)
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575
Approved by: https://github.com/zdevito
2023-02-09 21:37:25 +00:00
Masaki Kozuki
6ba041fcae Look up group["capturable"], not defaults["capturable"] in Adam(W) (#94149)
We could set different values in each `param_group` when calling dunder init of `torch.optim` optimizers as in e.g.  https://github.com/pytorch/pytorch/issues/89987.

So check whether or not `capturable` is `True` among all the `param_group`s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94149
Approved by: https://github.com/albanD
2023-02-07 00:24:35 +00:00
Masaki Kozuki
4207d3c330 FusedAdam(W) should take OptState into account before unscaling grads (#94060)
the optimizers have to consult `OptState` before unscaling gradients because we could call `GradScaler.unscale_` explicitly to for e.g. `clip_grad_norm_` as mentioned in e52786f3d1/torch/cuda/amp/grad_scaler.py (L235-L266) and https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients

Related #90752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94060
Approved by: https://github.com/albanD
2023-02-04 05:20:13 +00:00
Masaki Kozuki
a23ed38f9a [mta][foreach] Implement fused adamw (#88015)
related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167
possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88015
Approved by: https://github.com/albanD, https://github.com/ngimel
2023-02-01 19:32:29 +00:00
albanD
d8aa68c683 make sure that our error handling runs with the GIL enabled (#92848)
Fixes https://github.com/pytorch/pytorch/issues/92684

I checked the other use case of this API and they never release the GIL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92848
Approved by: https://github.com/ngimel
2023-01-24 09:30:42 +00:00
eqy
fb38b9ff2a [cuBLAS][TF32] Fix TF32 get/set test when TORCH_ALLOW_TF32_CUBLAS_OVERRIDE is set (#92052)
Follow up of #85859 to fix the test for when the environment variable is set.

CC @xwang233 @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92052
Approved by: https://github.com/ngimel
2023-01-12 05:36:06 +00:00
eqy
97ff20d722 [cuBLAS] (re-open) Fix default cuBLAS workspace size and parsing for multiple workspaces (#91564)
re-open of #89027
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91564
Approved by: https://github.com/ngimel
2023-01-03 23:48:15 +00:00
PyTorch MergeBot
39d49dbe45 Revert "[cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027)"
This reverts commit b407d98dbe.

Reverted https://github.com/pytorch/pytorch/pull/89027 on behalf of https://github.com/kit1980 due to Fails test_cublas_workspace_explicit_allocation on ROCm
2022-12-31 23:04:57 +00:00
eqy
b407d98dbe [cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027)
Follow-up of #86167 ; The number of pools was mistakenly ignored and the default workspace size appears to be too small to match selected cuBLAS kernels before the explicit allocation change.

CC @ptrblck @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89027
Approved by: https://github.com/ngimel
2022-12-31 06:58:04 +00:00
lezcano
484dd40022 Implement PReLU in a compositional way (#91238)
The PReLU implementation was all over the place. This lead to a number
of bugs like https://github.com/pytorch/pytorch/issues/68760.  We fix it by:
- Keeping the weird broadcasting logic it has as a CompositeImplicit kernel that calls into a second kernel
- This second kernel is just a good-ol' pointwise kernel.
- We implement the derivative for the pointwise kernel via TI as well for speed.
- We implement the second derivative for the pointwise kernel and the forward AD derivatives compositionally

This fixes a number of issues:
- We don't perform copies any more when the inputs are not contiguous
- The derivatives are now correct
- We fix vmap and many other functorch-related issues.
- CPU and CUDA now share the relevant broadcasting logic
- The implementation is about 1/3 the length.

Fixes https://github.com/pytorch/pytorch/issues/68760
Fixes https://github.com/pytorch/pytorch/issues/89895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91238
Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser, https://github.com/albanD
2022-12-30 10:42:30 +00:00