Commit Graph

640 Commits

Author SHA1 Message Date
Aaron Enye Shi
7973ac586d [Memory Snapshot] Add CUDAAllocatorConfig details into snapshot metadata (#119404)
Summary:
Include the CUDAAllocatorConfig at the time of snapshot into the snapshot file. These include adding variables:

```
  double garbage_collection_threshold;
  size_t max_split_size;
  size_t pinned_num_register_threads;
  bool expandable_segments;
  bool release_lock_on_cudamalloc;
  bool pinned_use_cuda_host_register;
  std::string last_allocator_settings;
  std::vector<size_t> roundup_power2_divisions;
```

Test Plan:
`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ` produces
```
{'PYTORCH_CUDA_ALLOC_CONF': 'expandable_segments:True',
 'max_split_size': -1,
 'garbage_collection_threshold': 0.0,
 'expandable_segments': True,
 'pinned_num_register_threads': 1,
 'release_lock_on_cudamalloc': False,
 'pinned_use_cuda_host_register': False,
 'roundup_power2_divisions': {'1': 0,
  '2': 0,
  '4': 0,
  '8': 0,
  '16': 0,
  '32': 0,
  '64': 0,
  '128': 0,
  '256': 0,
  '512': 0,
  '1024': 0,
  '2048': 0,
  '4096': 0,
  '8192': 0,
  '16384': 0,
  '32768': 0}}
```
`PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"` produces
```
{'PYTORCH_CUDA_ALLOC_CONF': 'max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]',
 'max_split_size': 2097152000,
 'garbage_collection_threshold': 0.0,
 'expandable_segments': False,
 'pinned_num_register_threads': 1,
 'release_lock_on_cudamalloc': False,
 'pinned_use_cuda_host_register': False,
 'roundup_power2_divisions': {'1': 1, '2': 1, '4': 1, '8': 1, '16': 1, '32': 1, '64': 1, '128': 1, '256': 1, '512': 2, '1024': 8, '2048': 8, '4096': 8, '8192': 8, '16384': 8, '32768': 8}
}
```

Differential Revision: D53536199

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119404
Approved by: https://github.com/zdevito
2024-02-17 01:16:37 +00:00
cyy
d4882e438a [DeviceIndex][5/N] Use DeviceIndex in more places (#119866)
This PR follows the series of patches beginning with #119142 and fixes various CUDA related methods to use DeviceIndex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119866
Approved by: https://github.com/Skylion007
2024-02-15 07:01:43 +00:00
cyy
87c6cd2f00 [1/N] Replace std::tie with structural binding (#119774)
This PR replaces some std::tie calls with structural binding from C++17.  This not only makes the code more compact, but also has some performance gain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119774
Approved by: https://github.com/albanD, https://github.com/malfet
2024-02-14 09:25:04 +00:00
cyy
cb0886ecf2 [DeviceIndex][4/N] Use DeviceIndex in more places (#119741)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741
Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang
2024-02-14 00:29:10 +00:00
cyy
10f3abc6b8 [DeviceIndex][3/N] Use DeviceIndex in more places (#119635)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119635
Approved by: https://github.com/ezyang
2024-02-12 21:31:27 +00:00
Ke Wen
b2043c0543 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-12 18:45:49 +00:00
PyTorch MergeBot
0342b227e5 Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)"
This reverts commit f3e7d80993.

Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))
2024-02-12 07:34:20 +00:00
cyy
8a3c241094 Remove unused header inclusion (#119667)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119667
Approved by: https://github.com/Skylion007
2024-02-12 05:36:25 +00:00
cyy
568740f080 [DeviceIndex][2/N] Use DeviceIndex instead of int in allocators (#119545)
Follows #119142
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119545
Approved by: https://github.com/ezyang
2024-02-10 20:27:59 +00:00
Ke Wen
f3e7d80993 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-09 20:23:20 +00:00
cyy
560c92c324 [DeviceIndex] Use DeviceIndex instead of int in CUDA wrappers (#119142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119142
Approved by: https://github.com/ezyang
2024-02-08 23:00:56 +00:00
Natalia Gimelshein
6fe5a3adaf release GIL for cudaEventDestroy (#119393)
cudaEventDestroy can become blocking under some circumstances, and then holding GIL will lead to deadlocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119393
Approved by: https://github.com/Skylion007
2024-02-07 22:16:18 +00:00
Yu, Guangye
5c46600f84 [RELAND] refactor lazy init to device-agnostic (#119248)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846).
This is a common PR, and does not trigger xpu ciflow.

Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman
2024-02-07 15:58:51 +00:00
PyTorch MergeBot
ab613a4019 Revert "refactor lazy init to device-agnostic (#118846)"
This reverts commit 520771d7b3.

Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11  ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))
2024-02-05 18:06:30 +00:00
Yu, Guangye
520771d7b3 refactor lazy init to device-agnostic (#118846)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846
Approved by: https://github.com/malfet
2024-02-02 12:10:39 +00:00
albanD
54668ad6dc Cleanup max cuda device (#118779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118779
Approved by: https://github.com/ezyang
2024-02-01 21:11:28 +00:00
cyy
6da0e7f84b [Clang-tidy header][17/N] Apply clang-tidy on headers in torch/csrc/cuda (#117829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117829
Approved by: https://github.com/albanD
2024-01-26 13:33:24 +00:00
eqy
8f7caaee67 [cuDNN] Fix cuDNN version parsing against future versions of cuDNN (#117908)
Remove the unnecesssary dependence on assuming a fixed number of digits per field

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117908
Approved by: https://github.com/cpuhrsch
2024-01-21 05:00:01 +00:00
cyy
2b5a201aa6 [Exception] [3/N] Replace torch::NotImplementedError and torch::LinAlgError with C10 counterparts. (#116824)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116824
Approved by: https://github.com/albanD
2024-01-11 11:27:04 +00:00
Edward Yang
b4a35632f9 Add function to materialize COW storages (#117053)
Summary: From Kurt Mohler, see https://github.com/pytorch/pytorch/pull/113396 (manually imported due to ghimport problems)

Test Plan: sandcastle, OSS CI

Differential Revision: D52610522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117053
Approved by: https://github.com/malfet, https://github.com/kurtamohler
2024-01-10 15:34:16 +00:00
cyy
91bbcf8c71 [1/N] replace THPUtils_assert with TORCH_CHECK (#116675)
This PR replaces THPUtils_assert with TORCH_CHECK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116675
Approved by: https://github.com/albanD
2024-01-04 11:15:33 +00:00
Eddie Yan
ba06951c66 [BE] [cuDNN] Always build assuming cuDNN >= 8.1 (#95722)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 27084ed</samp>

This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built.

This is a re-land of https://github.com/pytorch/pytorch/pull/91527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722
Approved by: https://github.com/malfet, https://github.com/atalman
2024-01-03 15:41:28 +00:00
Nikita Shulga
0aa185f394 [BE] Make torch.cuda.has_magma a build time check (#116299)
Perhaps originally one needed to query about GPU capability, but right now it's a simple check for a build time flag: 52f0457d7d/aten/src/ATen/cuda/detail/CUDAHooks.cpp (L165-L171)

Alternative, to avoid `at::hasMAGMA()` call  one can implement it as follows:
```cpp
  const auto use_magma = caffe2::GetBuildOptions().at("USE_MAGMA");
  return PyBool_FromLong(use_magma == "1");
```

Make this check very similar to `_has_mkldnn`
0978482afa/torch/csrc/Module.cpp (L1793-L1794)

Test plan:
 Run `lldb -- python3 -c "import torch;print(torch.cuda.has_magma)"` and make sure it returns True and that `cuInit` is not called

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116299
Approved by: https://github.com/seemethere, https://github.com/albanD
2023-12-26 23:37:23 +00:00
zdevito
4afe2687d5 Reland "Serve multistream graph captures from correct pool (#114647)" (#116199)
Fixes a variable shadowing problem that broke internal builds.

This reverts commit fe15645619.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116199
Approved by: https://github.com/eellison
2023-12-20 21:22:34 +00:00
PyTorch MergeBot
fe15645619 Revert "Serve multistream graph captures from correct pool (#114647)"
This reverts commit 8a445f7bd5.

Reverted https://github.com/pytorch/pytorch/pull/114647 on behalf of https://github.com/jeanschmidt due to breaking multiple internal build jobs, please check internal diff in order to obtain more details ([comment](https://github.com/pytorch/pytorch/pull/114647#issuecomment-1864840724))
2023-12-20 17:11:42 +00:00
zdevito
8a445f7bd5 Serve multistream graph captures from correct pool (#114647)
This fixes #114320 by placing the logic for determining whether to allocate
to a pool inside a callback that is controlled by CUDAGraph.cpp or by the
python bound api to allocate a stream directly to a pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114647
Approved by: https://github.com/ngimel, https://github.com/eellison
2023-12-18 18:24:15 +00:00
Nikita Shulga
5989e1222d [BE] Set torch.cuda.has_half to True (#115884)
This check was introduced by https://github.com/pytorch/pytorch/pull/5417 and then turned into a tautology by https://github.com/pytorch/pytorch/pull/10147

So I guess it's time to let go of all that dynamic initialization (and may be just delete it in 2.3?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115884
Approved by: https://github.com/kit1980
2023-12-15 02:30:55 +00:00
Hongtao Yu
01ec71e466 [NFC][Autotune] Use device_prop.regsPerMultiprocessor instead of hardcoded reg number. (#115094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115094
Approved by: https://github.com/jansel
2023-12-05 23:49:46 +00:00
PyTorch MergeBot
ee96399bb4 Revert "[Reland2] Update NVTX to NVTX3 (#109843)"
This reverts commit dcb486232d.

Reverted https://github.com/pytorch/pytorch/pull/109843 on behalf of https://github.com/atalman due to Diff broke internal builds and tests ([comment](https://github.com/pytorch/pytorch/pull/109843#issuecomment-1841105398))
2023-12-05 16:10:20 +00:00
cyyever
dcb486232d [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10
2023-12-04 19:02:07 +00:00
PyTorch MergeBot
f36d09fcb7 Revert "Add function to materialize COW storages (#113396)"
This reverts commit e2f090086b.

Reverted https://github.com/pytorch/pytorch/pull/113396 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113396#issuecomment-1818769090))
2023-11-20 10:26:01 +00:00
Kurt Mohler
e2f090086b Add function to materialize COW storages (#113396)
Part of #109833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113396
Approved by: https://github.com/ezyang
2023-11-17 01:58:51 +00:00
cyy
f9bf104c64 [2/N] Fixes clang-tidy warnings in header files (#113727)
This PR fixes more clang-tidy warnings in common headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113727
Approved by: https://github.com/Skylion007
2023-11-16 13:21:15 +00:00
Aaron Enye Shi
3b80577212 [Memory Snapshot] Add timestamps to memory events collected in snapshots (#112266)
Summary: Use the same clock as the profiler to collect the timestamps on when memory events occurred. Save these to the snapshot dicts as well, so that they can be saved with the raw memory events.

Test Plan:
CI

Observed that trace_entry will now have time_us field, and it is ascending. For example:
```
trace entry: {'action': 'free_requested', 'addr': 140366476918784, 'size': 8192, 'stream': 0, 'time_us': 1698326576864190}
trace entry: {'action': 'free_completed', 'addr': 140366476918784, 'size': 8192, 'stream': 0, 'time_us': 1698326576864190}
trace entry: {'action': 'free_requested', 'addr': 140366476936192, 'size': 8192, 'stream': 0, 'time_us': 1698326576864194}
trace entry: {'action': 'free_completed', 'addr': 140366476936192, 'size': 8192, 'stream': 0, 'time_us': 1698326576864194}
trace entry: {'action': 'free_requested', 'addr': 140366641430528, 'size': 8192000, 'stream': 0, 'time_us': 1698326576864205}
trace entry: {'action': 'free_completed', 'addr': 140366641430528, 'size': 8192000, 'stream': 0, 'time_us': 1698326576864205}
trace entry: {'action': 'free_requested', 'addr': 140366403571712, 'size': 4000, 'stream': 0, 'time_us': 1698326576864209}
trace entry: {'action': 'free_completed', 'addr': 140366403571712, 'size': 4000, 'stream': 0, 'time_us': 1698326576864209}
```

Differential Revision: D50602011

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112266
Approved by: https://github.com/zdevito
2023-11-14 18:48:59 +00:00
PyTorch MergeBot
3c9a59cb8d Revert "[BE] [cuDNN] Always build assuming cuDNN >= 8.0 (#95722)"
This reverts commit df4f0b3829.

Reverted https://github.com/pytorch/pytorch/pull/95722 on behalf of https://github.com/PaliC due to is breaking a bunch of internal pytorch users ([comment](https://github.com/pytorch/pytorch/pull/95722#issuecomment-1806131675))
2023-11-10 17:26:36 +00:00
Eddie Yan
df4f0b3829 [BE] [cuDNN] Always build assuming cuDNN >= 8.0 (#95722)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 27084ed</samp>

This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built.

This is a re-land of https://github.com/pytorch/pytorch/pull/91527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722
Approved by: https://github.com/malfet
2023-11-08 07:53:23 +00:00
Wes Bland
9d765d28ca [pytorch] Add binding to get nccl version suffix (#112884)
Summary: Adds a Python to C binding to get the NCCL_SUFFIX value for more accurate NCCL version information and add that to the NCCL version tuple.

Differential Revision: D50978181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112884
Approved by: https://github.com/kwen2501
2023-11-08 02:51:22 +00:00
Min Si
94ebf52ea3 [cuda] introduce trace tracker callback in cache allocator (#112238)
Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.

- It allows external of cache allocator to "attach" trace tracker callbacks.
- When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
- **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
- **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock.

See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.

Differential Revision: D50726971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112238
Approved by: https://github.com/zdevito
2023-11-03 07:38:09 +00:00
Jithun Nair
333d5821ee [ROCm] Add gcnArchName to collect_env and torch.cuda.get_device_properties (#107477)
Printing just the device name is not helpful when investigating PyTorch issues filed for specific AMD GPUs, as the support/issue might depend on the gfx arch, which is part of the gcnArchName property.

`torch.cuda.get_device_properties(0).gcnArchName` will print the value of the `gcnArchName` property: eg.
```
>>> torch.cuda.get_device_properties(0).gcnArchName
'gfx906:sramecc+:xnack-'
```

```
root@6f064e3c19fb:/data/pytorch/test# python ../torch/utils/collect_env.py
...
GPU models and configuration: AMD Radeon Graphics(gfx906:sramecc+:xnack-)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107477
Approved by: https://github.com/albanD
2023-10-31 23:05:36 +00:00
PyTorch MergeBot
9c7391ea36 Revert " [1/N] Apply clang-tidy to c10 cuda files (#111137)"
This reverts commit 43b023694e.

Reverted https://github.com/pytorch/pytorch/pull/111137 on behalf of https://github.com/malfet due to Was reverted internally due to the failures in torch.cuda.memory_stats(device=0) (presumably) ([comment](https://github.com/pytorch/pytorch/pull/111137#issuecomment-1769274103))
2023-10-18 20:32:53 +00:00
cyy
43b023694e [1/N] Apply clang-tidy to c10 cuda files (#111137)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111137
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2023-10-17 04:52:50 +00:00
cyy
a6b452dfdc [2/N] Enable Wunused-result, Wunused-variable and Wmissing-braces in torch targets (#110836)
This PR enables Wunused-result, Wunused-variable and Wmissing-braces because our code base is clean.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110836
Approved by: https://github.com/Skylion007
2023-10-11 23:49:15 +00:00
cyy
3ec33957eb [1/N] Enable Wunused-result and Wunused-variable in torch targets (#110722)
They are useful for checking results of function calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110722
Approved by: https://github.com/Skylion007
2023-10-08 23:43:45 +00:00
cyy
e75f2e2ea1 Fix clang-tidy warnings in CUDAPluggableAllocator (#110678)
This PR fixes clang-tidy warnings in CUDAPluggableAllocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110678
Approved by: https://github.com/Skylion007
2023-10-06 17:33:08 +00:00
Banit Agrawal
30c4c6ff9b [PyTorch CCA] Refactor caching allocator config code (#110123)
Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well.

Test Plan: sandcastle

Differential Revision: D49653265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123
Approved by: https://github.com/zdevito
2023-10-04 14:58:23 +00:00
Pritam Damania
5565a29568 Release GIL in torch.cuda ops wherever possible. (#109159)
Most `torch.cuda` ops (ex: `torch.cuda.synchronize`) do not release GIL in C++ land. This has the potential of causing deadlocks and freeze the python process. For example, `torch.cuda.synchronize` could hold GIL and get blocked on some operation. However, that operation might never complete in python land since GIL is held by `torch.cuda.synchronize`.

In this PR, I've tried to release GIL as much as possible in `torch.cuda` ops.

See https://github.com/pytorch/pytorch/issues/109074 for an example of how holding GIL causes a deadlock.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109159
Approved by: https://github.com/ezyang
2023-09-25 14:35:31 +00:00
cyy
01fc6466d1 [Reland] [1/N] fix clang-tidy warnings in torch/csrc (#108114)
Reland of PR #107648 with auto replaced with Py_ssize_t in eval_frame.c. This PR applies fixes to some found issues by clang-tidy in torch/csrc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108114
Approved by: https://github.com/Skylion007
2023-08-30 17:11:16 +00:00
PyTorch MergeBot
8cbf77585d Revert "[1/N] fix clang-tidy warnings in torch/csrc (#107648)"
This reverts commit 49eeca00d1.

Reverted https://github.com/pytorch/pytorch/pull/107648 on behalf of https://github.com/osalpekar due to This causes breakages due to underspecified type ([comment](https://github.com/pytorch/pytorch/pull/107648#issuecomment-1696372588))
2023-08-28 20:35:12 +00:00
Elias Ellison
0a9778a372 Expose cudaStreamCaptureMode in CUDA Graphs, use local setting in inductor (#107407)
>  capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream.
Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc,
 may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for
 actions in the current thread, and "relaxed" will not error on these actions.

Inductor codegen is single-threaded, so it should be safe to enable "thread_local" for inductor's cuda graph capturing. We have seen errors when inductor cudagraphs has been used concurrently with data preprocessing in other threads.

Differential Revision: [D48656014](https://our.internmc.facebook.com/intern/diff/D48656014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107407
Approved by: https://github.com/albanD, https://github.com/eqy
2023-08-25 01:44:26 +00:00
cyy
49eeca00d1 [1/N] fix clang-tidy warnings in torch/csrc (#107648)
Apply fixes to some found issues by clang-tidy in torch/csrc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107648
Approved by: https://github.com/Skylion007
2023-08-25 00:30:09 +00:00
Zachary DeVito
c9b5e9d7a8 [allocator] register oom observers on every device (#107399)
This change is to match the behavior of _record_memory_history which was
recently changed to enable history recording on all devices rather than
the current one. It prevents confusing situations where the observer
was registered before the device was set for the training run.

It also ensures the allocators have been initialized in the python binding just in case this is the first call to the CUDA API.
Fixes #107330
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107399
Approved by: https://github.com/eellison
ghstack dependencies: #107171
2023-08-23 18:57:24 +00:00
Zachary DeVito
cc54448a07 [memory snapshot] add 'address' key to block (#107171)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107171
Approved by: https://github.com/ngimel
2023-08-23 18:57:24 +00:00
PyTorch MergeBot
22cade56ba Revert "[Reland] Upgrade NVTX to NVTX3 (#97582)"
This reverts commit 5bbfb96203.

Reverted https://github.com/pytorch/pytorch/pull/97582 on behalf of https://github.com/izaitsevfb due to Breaks meta RL builds ([comment](https://github.com/pytorch/pytorch/pull/97582#issuecomment-1679568525))
2023-08-15 20:55:12 +00:00
Zachary DeVito
80988b6277 Introduce memory stacks for free (#106758)
Previously when we recorded a free action in a memory trace, we would provide
the stack for when the block was allocated. This is faster because we do not
have to record stacks for free, which would otherwise double the number of stacks
collected. However, sometimes knowing the location of a free is useful for
figuring out why a tensor was live. So this PR adds this behavior. If
performance ends up being a concern the old behavior is possible by passing
"alloc" to the context argument rather than "all".

Also refactors some of glue logic to be consistent across C++ and Python and
routes the Python API through the C++ version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758
Approved by: https://github.com/albanD
2023-08-14 20:38:15 +00:00
cyy
5bbfb96203 [Reland] Upgrade NVTX to NVTX3 (#97582)
PR #90689 replaces NVTX with NVTX3. However, the torch::nvtoolsext is created only when the third party NVTX is used.
 This is clear a logical error. We now move the creation code out of the branch to cover all cases. This should fix the issues reported in the comments of  #90689.

It would be better to move configurations of the failed FRL jobs to CI tests so that we can find such issues early before merging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97582
Approved by: https://github.com/peterbell10
2023-08-14 16:55:25 +00:00
Alexander Pivovarov
02abbb8109 Fix some typos, mostly "that that" (#106901)
Fix some typos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106901
Approved by: https://github.com/janeyx99
2023-08-10 19:46:53 +00:00
Nikita Shulga
dfd441a12c [BE] Use nested namespaces in torch/csrc/cuda (#106928)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 6b1dde1</samp>

> _`namespace` syntax_
> _Simplified with C++17_
> _Code is more readable_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106928
Approved by: https://github.com/huydhn, https://github.com/izaitsevfb
2023-08-10 03:56:09 +00:00
Zachary DeVito
3e5a52cedd [memory snapshot] track context for segments (#106113)
We want to display the stack for the original cudaMalloc that created a segment.
Previously we could only report the last time the segment memory was used,
or the record of the segment_alloc could appear in the list of allocator actions.
This PR ensure regardless of whether we still have the segment_alloc action,
the context for a segment is still available. The visualizer is updated to
be able to incorporate this information.

This PR adds a new field to Block. However the previous stacked cleanup PR
 removed a field of the same size, making the change to Block size-neutral.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113
Approved by: https://github.com/aaronenyeshi
2023-07-28 06:45:48 +00:00
Zachary DeVito
45b564766d [memory snapshots] removed chained history (#106079)
For free blocks of memory in the allocator, we previously kept a linked list
of the stack frames of previous allocations that lived there. This was only
ever used in one flamegraph visualization and never proved useful at
understanding what was going on. When memory history tracing was added, it
became redundant, since we can see the history of the free space from recording
the previous actions anyway.

This patch removes this functionality and simplifies the snapshot format:
allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history.
Previously the memory history tracked the real size of allocations before rounding.
Since history was added, 'requested_size' has been added directly to the block which records the same information,
so this patch also removes that redundancy.

None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter
this part of the format.

This patch also updates our visualization tools to work with the simplified format. Visualization tools keep
support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079
Approved by: https://github.com/eellison
2023-07-28 06:45:48 +00:00
Xiaodong Wang
22eaedacd3 [nccl] Do no skip send/recv 0 byte tensor (#103140)
Summary:
Since NCCL 2.12.10, NCCL supports send/recv 0 byte: https://github.com/NVIDIA/nccl/issues/696. Therefore we don't have to skip.

One issue is that if a rank has 0 bytes to send and 0 bytes to recv, it'll skip send/recv completely. And it'll proceed to the next collective which it can send/recv something, making it confusing to the other ranks. Another solution is to add a barrier but that's very expensive.

Test Plan: will add a unit test

Differential Revision: D46507785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103140
Approved by: https://github.com/malfet, https://github.com/kwen2501
2023-06-23 01:57:53 +00:00
atalman
6ac8a11746 Switch cuda 12.1 docker images to gcc9 (#102380)
Update CUDA-12.1 CI docker images to gcc-9, that should tentatively fix for internal compiler error  in [libtorch-linux-bionic-cuda12.1-py3.7-gcc7 / build](https://github.com/pytorch/pytorch/actions/runs/5071681366/jobs/9135310361)

Co-authored by: Nikita Shulga <nshulga@meta.com>

Fixes: https://github.com/pytorch/pytorch/issues/102372
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102380
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-05-30 23:03:55 +00:00
Natalia Gimelshein
ecd79b1fef add additional stream priority for cuda streams (#101956)
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.

Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
2023-05-27 02:36:16 +00:00
PyTorch MergeBot
6c9b94dcda Revert "add additional stream priority for cuda streams (#101956)"
This reverts commit 5da497cabb.

Reverted https://github.com/pytorch/pytorch/pull/101956 on behalf of https://github.com/osalpekar due to Broke internal builds that used -Wunused-function since this PR removed the call to StreamIdType::<< ([comment](https://github.com/pytorch/pytorch/pull/101956#issuecomment-1563875493))
2023-05-26 06:35:23 +00:00
Natalia Gimelshein
5da497cabb add additional stream priority for cuda streams (#101956)
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.

Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
2023-05-24 23:26:47 +00:00
Elias Ellison
a8ff647e42 Disable conv cache emptying (#101038)
We warmup cudagraph trees in the cudagraph memory pool so that if we are part of the way through your run, and a large majority of memory is already allocated to cudagraphs, we dont try to allocate again to eager which would split memory pool in half. However this means this is causing us to fail the following assert due to the `emptyCache` call in CUDNN benchmarking: https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2959.

Disable the empty cache call during cudagraph warmup to fix error. Disabling did not have a significant affect on memory:

![image](https://github.com/pytorch/pytorch/assets/11477974/90513a1e-aa77-410c-a32e-2f80b99e673f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101038
Approved by: https://github.com/shunting314, https://github.com/ngimel
2023-05-12 18:49:46 +00:00
Elias Ellison
0ec4646588 CUDA Graph Trees - error on deallocated access (#100927)
Turn warning to error if we detect tensor is accessed after its memory is overwritten/released by a new invocation of cudagraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100927
Approved by: https://github.com/zou3519
2023-05-11 17:17:14 +00:00
PyTorch MergeBot
cbfed470bd Revert "CUDA Graph Trees - error on deallocated access (#100927)"
This reverts commit 3941bbc5ba.

Reverted https://github.com/pytorch/pytorch/pull/100927 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100927#issuecomment-1543874258))
2023-05-11 12:07:20 +00:00
Elias Ellison
3941bbc5ba CUDA Graph Trees - error on deallocated access (#100927)
Turn warning to error if we detect tensor is accessed after its memory is overwritten/released by a new invocation of cudagraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100927
Approved by: https://github.com/zou3519
2023-05-10 17:15:33 +00:00
Jeff Daily
bf214f40d4 explicitly check or discard cudaGetLastError return value (#100488)
cudaGetLastError and hipGetLastError will clear any error value within CUDA and HIP, respectively. This is often done on purpose to clear benign errors. Discarding the return value should be indicated by casting to void and a nearby comment. This silences warnings from HIP:

warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]

Performing an audit of pytorch sources found one use of cudaGetLastError that was incorrectly ignored in IndexKernel.cu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100488
Approved by: https://github.com/ezyang
2023-05-10 01:24:07 +00:00
Elias Ellison
3edff6b6ec Improve detection of workspace/non-output allocations in cudagraphs (#99985)
When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off.

This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985
Approved by: https://github.com/zdevito
2023-05-01 15:58:45 +00:00
Elias Ellison
d881b2978c Make autocast cache and buffer stealing aware of cudagraph static output tensors (#99368)
In this stack of PRs we adding caching to output tensors for cudagraph trees after we've done initial recording. On initial recording we do not cache tensor outputs because this prevents memory from being reclaimed. On subsequent exeuctions we do cache them to avoid overhead. However, because there is an extra reference around, this caused divergent recording & execution behavior in both autocast caching and autograd gradient stealing. Divergent recording & execution would keep on re-recording and eventually stabilize, but it's not what you want to see happen.

This pr makes the autocast cache and buffer stealing aware of the cudagraph static output tensors.

I will add this to the other cudagraph impl in another pr.

Not sure if this should be in autograd or in autocast since it affects both.. Or somewhere else

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99368
Approved by: https://github.com/albanD, https://github.com/ezyang
2023-04-24 20:23:12 +00:00
Ke Wen
9df8b1b594 Init comm_nonblocking_ when creating AutoNcclGroup (#99679)
A quick, trial fix for #99677.

My guess is that when the code instantiates an `AutoNcclGroup` object, it comes with an uninitialized random value for member `comm_nonblocking_`. Then `if (comm_nonblocking_)` evaluates to true, and `NCCL_CHECK_TIMEOUT` triggered.

This change is safe (and needed) anyway whether it indeed fixes #99677.

Cc @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99679
Approved by: https://github.com/eqy, https://github.com/awgu
2023-04-21 07:56:39 +00:00
Abhishek Jindal
16a4dc0f93 Correct typo for NCCL_MAJOR (#99482)
Correct Typo from NCCL_MACJOR to NCCL_MAJOR

Fixes Typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99482
Approved by: https://github.com/eqy, https://github.com/kwen2501
2023-04-21 03:27:08 +00:00
Elias Ellison
472f46635e Cache output tensors on execution (#98944)
Caches output tensors for the common case when the output Tensor storage is unaliased for all graph outputs in all paths. For these persisted tensors we adjust the liveness tracking by also checking that the output tensor does not have an additional python reference.

I limit cached output tensors to be unaliased. If a descendent node discovers it has an alias of a prior output, then the aliased output will no longer be persisted in the ancestor.

The large majority of tensors are unaliased, and preserving aliased output tensors would add significant additional complexity with marginal gains. For instance, when do checkpointing and re-recordings, we need to remove the persisted tensors otherwise it would prevent memory from being reclaimed. If a single persisted tensor was present in multiple paths then that would create an inter-path dependence which adds complexity. Additionally, each further caching of the output would affect the reference count of the other caches, and that reference count would also need to be adjusted depending on if a node was checkpointed.

Still need to do a complete a run but for the models I tried makes the performance extremely close between trees and non trees impl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98944
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-04-18 19:44:47 +00:00
Elias Ellison
93b64f0ad3 [Easy] Remove C++ call now that it wont be on hot path (#98943)
Since we will be caching output tensors, it is no longer necessary for this logic to be in C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98943
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-04-18 19:28:37 +00:00
Zachary DeVito
7ff1f3f3f6 Revert "Revert "Expandable blocks in allocator (#96995)"" (#99275)
This reverts commit 851e89c8e8.

Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275
Approved by: https://github.com/eellison
2023-04-17 23:46:08 +00:00
PyTorch MergeBot
851e89c8e8 Revert "Expandable blocks in allocator (#96995)"
This reverts commit 6a50b83b73.

Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests
2023-04-16 19:23:37 +00:00
Animesh Jain
fdbc8625a1 Functionalization of torch.rand/rand_like ops (#97377)
This PR introduces the functionalization of RNG ops. Key points are

* Introduces a new `philox_rand` prim operator that accepts seed, offset.
* Adds decompositions for random operators that use these philox_rand prims
* Adds a PhiloxStateTracker to track the offset for each occurence of rand ops
* Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset>
* Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior
* Raises assertion for CPU because CPU does not Philox RNG.

Not dealt in this PR
* dropout op - offset calculation is different
* other distributions like normal, poisson etc
* Inductor support
* Cudagraph support
* Dynamic shape support

An example
~~~

class Custom(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        a = torch.rand_like(x) * x
        a = torch.rand_like(x) * a
        return a

    @staticmethod
    def backward(ctx, grad_out):
        x, = ctx.saved_tensors
        return grad_out * torch.rand_like(grad_out) * torch.cos(x)

====== Forward graph 0 ======
def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0)
    philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32);  add = None
    mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1);  philox_rand = None
    add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4);  fwd_base_offset_1 = None
    philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32);  fwd_seed_1 = add_1 = None
    mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul);  philox_rand_1 = mul = None
    return [mul_1, primals_1]

====== Backward graph 0 ======
def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0);  bwd_base_offset_1 = None
    philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32);  bwd_seed_1 = add_2 = None
    mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2);  tangents_1 = philox_rand_2 = None
    cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1);  primals_1 = None
    mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos);  mul_2 = cos = None
    return [mul_3]

~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97377
Approved by: https://github.com/ezyang
2023-04-16 09:55:56 +00:00
Zachary DeVito
6a50b83b73 Expandable blocks in allocator (#96995)
Common advice we give for handling memory fragmentation issues is to
allocate a big block upfront to reserve memory which will get split up later.
For programs with changing tensor sizes this can be especially helpful to
avoid OOMs that happen the first time we see a new largest input and would
otherwise have to allocate new segments.

However the issue with allocating a block upfront is that is nearly impossible
to correctly estimate the size of that block. If too small, space in the block
will run out and the allocator will allocate separate blocks anyway. Too large,
and other non-PyTorch libraries might stop working because they cannot allocate
any memory.

This patch provides the same benefits as using a pre-allocating block but
without having to choose its size upfront. Using the cuMemMap-style APIs,
it adds the ability to expand the last block in a segment when more memory is
needed.

Compared to universally using cudaMallocAsync to avoid fragmentation,
this patch can fix this common fragmentation issue while preserving most
of the existing allocator behavior. This behavior can be enabled and disabled dynamically.
 This should allow users to, for instance, allocate long-lived parameters and state in individual buffers,
and put temporary state into the large expandable blocks, further reducing
fragmentation.

See inline comments for information about the implementation and its limitations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995
Approved by: https://github.com/eellison
2023-04-14 09:49:11 +00:00
Eddie Yan
c4f81cb6f4 [NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715)
Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature.
Enabled via the environment variable:
```
TORCH_NCCL_USE_COMM_NONBLOCKING=1
```

CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715
Approved by: https://github.com/kwen2501
2023-04-14 02:03:33 +00:00
PyTorch MergeBot
1149ba5553 Revert "[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715)"
This reverts commit a33eac3988.

Reverted https://github.com/pytorch/pytorch/pull/95715 on behalf of https://github.com/PaliC due to This pr has caused a regression on distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess causing it to timeout (https://hud.pytorch.org/failure/distributed%2Ftest_dynamo_distributed.py%3A%3ATestMultiProc%3A%3Atest_ddp_baseline_aot_eager_multiprocess)
2023-04-12 21:15:49 +00:00
Eddie Yan
a33eac3988 [NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715)
Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature.
Enabled via the environment variable:
```
TORCH_NCCL_USE_COMM_NONBLOCKING=1
```

CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715
Approved by: https://github.com/kwen2501
2023-04-12 18:33:10 +00:00
Xiaodong Wang
78ad800a2a [nccl] Remove lock for nccl collective launch for 2.0+ (#97904)
Summary: It looks nccl 2.0+ no longer needs a lock to avoid being called concurrently with cudaFree.

Test Plan: sandcastle + OSS CI

Differential Revision: D44514446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97904
Approved by: https://github.com/malfet, https://github.com/kwen2501
2023-04-11 23:58:54 +00:00
Aidyn-A
69eef5a4be [CUDA12] set_device change (#94864)
This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this:
```Python
import torch
x = torch.randn(1, device="cuda:1")
```
would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`.
After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang
2023-04-10 17:31:12 +00:00
Elias Ellison
5c8fea5647 Reduce overhead in CUDAGraph Trees (#98529)
Significantly reduces overhead of constructing Tensors and Storages and checking Storage Liveness. Removes the regression for HF models that I tested and removes 75% of overhead of the extremely overhead bound resnet50 training we have in torchbench. (.91x base commit, 1.02x torchinductor default, 1.16x this PR, 1.25 previous cudagraphs impl).

This PR takes care of all of the lower hanging fruit.

- Computes storage aliasing at record time instead of during at runtime. We no longer need to use a runtime storage cache, and can instead index directly into the existing alias if there is one, or construct a new Storage

- Moves the heavyweight C++ calls into a batch - getting storage weakrefs and constructing tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98529
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-04-07 05:46:08 +00:00
Zachary DeVito
e085acc9f3 Cleanup Copy.cu logic (#97071)
Some of the logic specific to the cudaMallocAsync allocator related to peer access is placed outside of the allocator itself. This PR refactors, documents, and encapsulates it, while maintaining the same behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97071
Approved by: https://github.com/ngimel, https://github.com/eellison
2023-04-06 17:22:35 +00:00
PyTorch MergeBot
279ca5f9db Revert "[CUDA12] set_device change (#94864)"
This reverts commit c18be2b2ec.

Reverted https://github.com/pytorch/pytorch/pull/94864 on behalf of https://github.com/ezyang due to avoid affecting cuda 11
2023-04-05 14:53:00 +00:00
Aidyn-A
c18be2b2ec [CUDA12] set_device change (#94864)
This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this:
```Python
import torch
x = torch.randn(1, device="cuda:1")
```
would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`.
After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang
2023-04-05 14:34:00 +00:00
mikey dagitses
da28af3286 distinguish mutability of StorageImpl::data_ptr() member (#97651)
See D44409928.

Differential Revision: [D44410323](https://our.internmc.facebook.com/intern/diff/D44410323/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97651
Approved by: https://github.com/ezyang
2023-03-30 19:13:56 +00:00
Zachary DeVito
759e527ea1 Use internal symbolizer for FBCODE (#97172)
Summary:
addr2line does not work fast on fbcode binaries, so use the
internally symbolize pathway.

Test Plan: sandcastle

Differential Revision: D44227690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97172
Approved by: https://github.com/eellison
2023-03-27 19:24:12 +00:00
PyTorch MergeBot
5170995b2a Revert "Upgrade NVTX to NVTX3 (#90689)"
This reverts commit e64ddd1ab9.

Reverted https://github.com/pytorch/pytorch/pull/90689 on behalf of https://github.com/osalpekar due to Build Failures due to not being able to find one nvtx3 header in FRL jobs: [D42332540](https://www.internalfb.com/diff/D42332540)
2023-03-24 18:16:06 +00:00
cyy
e64ddd1ab9 Upgrade NVTX to NVTX3 (#90689)
Due to recent upgrade to CUDA 11, we can upgrade NVTX to NVTX3 as well, which is a header only library that can simplify the building system a lot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90689
Approved by: https://github.com/soumith, https://github.com/malfet
2023-03-23 01:56:42 +00:00
Nikita Shulga
24ce3a7c34 Move hasPrimaryContext to c10::cuda (#96800)
This method has to be accessible from `c10` to enable CUDA-12 integration.
Implemented by providing private `c10::cuda:_internal::setHasPrimaryContext` that passes the pointer to the implementation (in `torch_cuda`) back to c10.
Use global class constructor/destructor to guarantee RAII.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96800
Approved by: https://github.com/ngimel
2023-03-17 04:50:35 +00:00
Elias Ellison
571f96bf59 cudagraph trees (#89146)
CUDA Graph Trees

Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit

Not currently implemented :

- Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr.

- Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr.

- Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking

Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146
Approved by: https://github.com/ezyang
2023-03-17 02:47:03 +00:00
Elias Ellison
ea7415087a Expose Stream Recording Apis in python (#96384)
Differential Revision: [D43999891](https://our.internmc.facebook.com/intern/diff/D43999891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96384
Approved by: https://github.com/zdevito
2023-03-16 23:45:43 +00:00
Zachary DeVito
e74f70d212 Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"" (#96878)
This reverts commit e1ea584b1c.
Adds __has_include check to fix fbcode build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878
Approved by: https://github.com/ezyang
2023-03-16 04:12:54 +00:00
PyTorch MergeBot
e1ea584b1c Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"
This reverts commit 4e1060c609.

Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2023-03-15 13:28:41 +00:00
Nikita Shulga
a229e78544 [BE] Enforce sign-compare (#96723)
Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds.
Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase.

The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars  here:
6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)

Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869
Do not try to fix sign compare violations in caffe2 codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723
Approved by: https://github.com/albanD
2023-03-15 06:04:20 +00:00
Zachary DeVito
85639c1a88 [allocator] Generalize recording to a pool (#96542)
Previously the allocator would query whether a stream was recording a graph,
and look up the pool associated with a graph. This change has the allocator
directly associate a stream with a mempool, decoupling "record this stream to a pool"
from the action of "record all actions to a cuda graph".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96542
Approved by: https://github.com/eellison
2023-03-15 04:28:49 +00:00
Zachary DeVito
4e1060c609 [memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)
This refactors the stack trace facility specific to memory profiling
    in python+cuda to make a generic facility to generate combined stack
    traces.

    The generic facility (combined_traceback.h) does not require
    python to be around to work, but will return python stacks if it is
    present.

    This facility is then used to add support for stack trace gathering in memory profiling that
    happens directly from C++.

    It is also used to expose a python API for gathering and symbolizing
    combineds stacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541
Approved by: https://github.com/ezyang
2023-03-14 18:26:05 +00:00
Elias Ellison
da265652d6 Return Live Data Pointers from Checkpoint, swap onto tensors (#95020)
When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors.

The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general.

This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration.

Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020
Approved by: https://github.com/zdevito
2023-03-14 01:22:19 +00:00
Elias Ellison
1cc32aedb0 Handle additional live allocations not in checkpointed state (#94943)
We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die.

Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943
Approved by: https://github.com/zdevito
2023-03-14 01:00:47 +00:00
Elias Ellison
d798de2b05 Checkpoint CUDA Allocator Private Pool State (#94653)
Copying note from cuda caching allocator:

```
   * Note [Checkpointing PrivatePoolState]
   *
   * Refer above to Note [Interaction with CUDA graph capture]. Allocations made
   * during graph capture are made from a separate private pool. During graph
   * capture allocations behave as usual. During graph replay the allocator
   * state does not change even as new tensors are created. The private pool
   * will not free its blocks to the main caching allocator until cuda graph use
   * is finished to prevent an allocation from eager clobbering the memory from
   * a live but unaccounted for tensor that was created during replay.
   *
   * `make_graphed_callables`, a series of separate callables chained in
   * successive cuda graphs, can share a memory pool because after a cuda graph
   * recording the allocations in the shared private pool exactly reflect the
   * tensors that are allocated.
   *
   * We would like to extend callable chaining to support a graphed callable
   * tree. In this scenario, we have a tree of callable chains which will be
   * captured with cuda graphs. In the diagram below, we have a tree with four
   * callables, A, B, C, and D. Suppose we have captured, and subsequently
   * replayed, A, B, and C. Then on a new invocation, we replay A and B, but
   * would now like to record D. At this point the private pool will not reflect
   * any of the live tensors created during graph replay. Allocations made
   * during a new recording with the pool could overwrite those live tensors.
   *
   * In order to record a new graph capture after replaying prior callables in
   * the tree, we need the allocator to reflect the state of the live tensors.
   * We checkpoint the state of the private after each recording, and then
   * reapply it when we are starting a new recording chain. Additionally, we
   * must free the allocations for any tensors that died between the end of our
   * previous graph replaying and our new recording (TODO). All of the allocated
   * segments that existed in the checkpointed state must still exist in the
   * pool. There may also exist new segments, which we will free (TODO : link
   * note [live tensors between iterations] when it exists).
   *
   *
   *  ---------------> A ---------------> B ---------------> C
   *                                |
   *                                |
   *                                |
   *                                |
   *                                  ---------------> D
```

A few TODOs:
- need to add logic for freeing tensors that have died between a last replay and current new recording
- Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors)

The two scenarios above have not been exercised in the tests yet.

Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653
Approved by: https://github.com/zdevito
2023-03-14 00:47:30 +00:00
Zachary DeVito
4b372e3958 [memory profiling] C++ tracing support (#95357)
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.

This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).

The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
2023-03-12 07:24:14 +00:00
Zachary DeVito
48490cec28 [memory profiling] Move Context object to c10 (#96280)
Minor refactor so that follow up PR can have objects that meet the GatheredContext
inferface without having to depend on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96280
Approved by: https://github.com/eellison
2023-03-12 07:24:14 +00:00
Zachary DeVito
266089a3fe [memory snapshots] record scripted stack traces (#95356)
Adds support for seeing both python and script stack traces in memory
debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95356
Approved by: https://github.com/aaronenyeshi
2023-03-12 07:24:14 +00:00
cyy
6786a24fd2 fix some tiny code issues (#95757)
This PR tries to fix:
1. a misspelled NDEBUG preprocessing condition.
2. get ride of all writable-strings warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95757
Approved by: https://github.com/soulitzer
2023-03-01 23:27:32 +00:00
Zachary DeVito
4f84c57c87 Fix potential deadlock when recording memory traces (#95273)
See comment in the diff

Differential Revision: [D43490668](https://our.internmc.facebook.com/intern/diff/D43490668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95273
Approved by: https://github.com/eellison
2023-02-27 19:04:47 +00:00
c-odrin
54b7c7d5e9 Added requested_bytes to CUDA Caching Allocator Stats (#88575)
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575
Approved by: https://github.com/zdevito
2023-02-09 21:37:25 +00:00
cyy
1a32db15e7 Some performance fixes (#94034)
Applies some performance fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94034
Approved by: https://github.com/Skylion007
2023-02-04 02:17:48 +00:00
cyy
27efdc5eed fix writable-strings warnings (#93246)
clang reports "ISO C++11 does not allow conversion from string
literal to 'char *'"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93246
Approved by: https://github.com/malfet
2023-02-04 02:11:15 +00:00
cyy
bfe5e1258b avoid unnecessary static_cast (#93898)
avoid unnecessary static_cast
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93898
Approved by: https://github.com/Skylion007
2023-02-03 03:44:43 +00:00
cyy
e292ddff4e More clang-tidy fixes (#92944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92944
Approved by: https://github.com/Skylion007
2023-01-25 19:11:51 +00:00
albanD
d8aa68c683 make sure that our error handling runs with the GIL enabled (#92848)
Fixes https://github.com/pytorch/pytorch/issues/92684

I checked the other use case of this API and they never release the GIL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92848
Approved by: https://github.com/ngimel
2023-01-24 09:30:42 +00:00
cyy
045d1de02d Fix some code issues (#92760)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92760
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-01-24 08:19:03 +00:00
cyy
85851b1e8f remove useless clang-tidy suppression (#92287)
remove NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
remove NOLINTNEXTLINE(performance-move-const-arg)
remove NOLINTNEXTLINE(performance-no-automatic-move)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92287
Approved by: https://github.com/albanD
2023-01-21 02:33:24 +00:00
PyTorch MergeBot
523d4f2562 Revert "[cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527)"
This reverts commit 4d07ad74f1.

Reverted https://github.com/pytorch/pytorch/pull/91527 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-01-16 13:28:09 +00:00
Eddie Yan
4d07ad74f1 [cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527)
We've been building with V8 (incl. V8 API) by default for a while now; this PR cleans up some guards for cuDNN < 8.0.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91527
Approved by: https://github.com/ngimel
2023-01-13 18:55:37 +00:00
Peter Bell
eece6da162 [inductor] Reduce device context manager overhead (#91045)
This adds `torch.cuda._DeviceGuard` which is a stripped down version of
`torch.cuda.device` with lower overhead. To do this, it only accepts `int` as
the device so we don't need to call `_get_device_index` and is implemented
with a new C++ helper `torch._C._cuda_exchangeDevice` that allows
`_DeviceGuard.__enter__` to be just a single function call. On my machine,
I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark:

```python
def set_device():
    with torch.cuda.device(0):
        pass

%timeit set_device()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2023-01-12 16:51:59 +00:00
Eddie Yan
e096d2db5a [BC-Breaking] Separate stream_id, device_index, and device_type in pack and unpack for Streams (#81596)
#75854

A naive attempt at working around the limitations of using a single 64-bit integer to pack `stream_id`, `device_index`, and `device_type`.

Stills needs sanity checks, testing, and minimization of BC-breaking changes.

Currently a Holder for the `StreamData3` struct is used for `IValue` compatibility. While doing this seems to work for `ivalue.h` and `ivalue_inl.h`, this doesn't seem to be naively working for the JIT CUDA stream wrapper? (Something about ambiguous calls if an `intrusive_ptr` to `c10::ivalue::StreamData3Holder` is used as the return type for `pack()`. It turns out that the methods required to access the fields for rematerializing a CUDA Stream are basically already present anyway, so `pack` is simply removed in the wrapper for now and the methods to access the required fields are called directly.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81596
Approved by: https://github.com/ezyang
2023-01-12 14:16:49 +00:00
Emilio Castillo
07e595e88a Add device_idx to free_fn in CUDAPluggableAllocator (#91398)
This was requested by nvidia folks, track also the device_id in the free function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91398
Approved by: https://github.com/albanD
2023-01-12 05:03:48 +00:00
PyTorch MergeBot
b3603f8129 Revert "Deduplicate c10 error and PyTorchError hierarchy (#87855)"
This reverts commit 34f2d3e6ae.

Reverted https://github.com/pytorch/pytorch/pull/87855 on behalf of https://github.com/osalpekar due to perf regression in quantization tests
2023-01-06 19:56:35 +00:00
William Phetsinorath
34f2d3e6ae Deduplicate c10 error and PyTorchError hierarchy (#87855)
Fixes #53370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87855
Approved by: https://github.com/albanD
2023-01-02 15:53:36 +00:00
Jianyu Huang
63b8ecc415 [CUDA12] Make PyTorch compatible with CUDA 12 (#91118)
Fix the failure when building PyTorch from source code using CUDA 12

```
In file included from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAFunctions.h:12,
                 from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAStream.h:10,
                 from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAGraphsC10Utils.h:3,
                 from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.h:5,
                 from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:2:
/home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp: In member function ‘void at::cuda::CUDAGraph::capture_end()’:
/home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:75: warning: converting to non-pointer type ‘long long unsigned int’ from NULL [-Wconversion-null]
     AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
                                                                           ^
/home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAException.h:31:42: note: in definition of macro ‘C10_CUDA_CHECK’
     C10_UNUSED const cudaError_t __err = EXPR;                           \
                                          ^~~~
/home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:5: note: in expansion of macro ‘AT_CUDA_CHECK’
     AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
     ^~~~~~~~~~~~~
/home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:75: error: too many arguments to function ‘cudaError_t cudaGraphInstantiate(CUgraphExec_st**, cudaGraph_t, long long unsigned int)’
     AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
                                                                           ^
/home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAException.h:31:42: note: in definition of macro ‘C10_CUDA_CHECK’
     C10_UNUSED const cudaError_t __err = EXPR;                           \
                                          ^~~~
/home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:5: note: in expansion of macro ‘AT_CUDA_CHECK’
     AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
     ^~~~~~~~~~~~~
In file included from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAStream.h:6,
                 from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAGraphsC10Utils.h:3,
                 from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.h:5,
                 from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:2:
/usr/local/cuda/include/cuda_runtime_api.h:11439:39: note: declared here
 extern __host__ cudaError_t CUDARTAPI cudaGraphInstantiate(cudaGraphExec_t *pGraphExec, cudaGraph_t graph, unsigned long long flags __dv(0));
                                       ^~~~~~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.
```

```
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp: In function ‘void torch::cuda::shared::initCudartBindings(PyObject*)’:
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:34:13: error: ‘cudaOutputMode_t’ was not declared in this scope
   py::enum_<cudaOutputMode_t>(
             ^~~~~~~~~~~~~~~~
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:34:13: note: suggested alternative: ‘cudaGraphNode_t’
   py::enum_<cudaOutputMode_t>(
             ^~~~~~~~~~~~~~~~
             cudaGraphNode_t
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:34:29: error: template argument 1 is invalid
   py::enum_<cudaOutputMode_t>(
                             ^
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:38:30: error: ‘cudaKeyValuePair’ was not declared in this scope
       .value("KeyValuePair", cudaKeyValuePair)
                              ^~~~~~~~~~~~~~~~
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:39:21: error: ‘cudaCSV’ was not declared in this scope
       .value("CSV", cudaCSV);
                     ^~~~~~~
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:39:21: note: suggested alternative: ‘cudart’
       .value("CSV", cudaCSV);
                     ^~~~~~~
                     cudart
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:99:7: error: ‘cudaProfilerInitialize’ was not declared in this scope
       cudaProfilerInitialize);
       ^~~~~~~~~~~~~~~~~~~~~~
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:99:7: note: suggested alternative: ‘cudaProfilerStart’
       cudaProfilerInitialize);
       ^~~~~~~~~~~~~~~~~~~~~~
       cudaProfilerStart
ninja: build stopped: subcommand failed.
```

After these fixes, we can see CUDA 12 is successfully built with OSS PyTorch instructions.

USE_CUDA=1 python setup.py develop  2>&1 | tee compile.log
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91118
Approved by: https://github.com/ngimel, https://github.com/brad-mengchi
2022-12-20 10:58:53 +00:00
mikey dagitses
322e4b4c8a set -Wsuggest-override for builds (#89852)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/89852).
* __->__ #89852
* #89851

set -Wsuggest-override for builds

Summary: This was flagged by a Meta internal build.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89852
Approved by: https://github.com/malfet
2022-12-19 22:08:47 +00:00
Aaron Gokaslan
12671fe620 Reserve space for std::vector output in extract_tensors for nccl python bindings (#88203)
Optimizes the nccl python bindings to reserve space when converting PythonObject* into Tensors. This should reduce the number of unnecessary allocations in the nccl bindings as the std::vector grows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88203
Approved by: https://github.com/ezyang
2022-12-10 20:28:19 +00:00
eqy
62e450d55f [CUDA Graphs] Add option to dump a captured graph for debugging (#85519)
CC @xwang233 @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85519
Approved by: https://github.com/ngimel
2022-12-06 22:03:05 +00:00
Emilio Castillo
1c8b0779de Fix segfault when swapping custom allocator (#89613)
Just screwed it before merging ...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89613
Approved by: https://github.com/albanD
2022-11-24 18:25:28 +00:00
albanD
c79489c8e6 Expose to python the backward AD view_func (#89586)
This will be useful for other systems (AOTAutograd) that want to replay autograd views.

FYI @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89586
Approved by: https://github.com/soulitzer
2022-11-24 03:39:58 +00:00
Emilio Castillo
c9d4390d13 Add Pluggable CUDA allocator backend (#86786)
Fixes #43144

This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

```c++
#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}

void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}
```

Compile it as a shared library
```
nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
```

And use it from PyTorch as follows

```python
import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)
```

Things to discuss
- How to test this, needs compiling external code ...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786
Approved by: https://github.com/albanD
2022-11-23 17:54:36 +00:00
Yuxin Wu
e6996ea172 Don't redefine __STDC_FORMAT_MACROS (#89310)
Similar to https://github.com/pytorch/pytorch/pull/39608 and https://github.com/pytorch/pytorch/pull/6676

This causes a compile error in our internal build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89310
Approved by: https://github.com/kit1980
2022-11-19 02:24:21 +00:00
Peter Bell
3d79ced8cf wrap_pybind_function: support member function pointers (#88932)
This updates `wrap_pybind_function` to use `invoke` and adds the
`invoke_traits` object which is analogous to `function_traits` but
for member functions it includes the class as an explicit argument.

To test this is working properly, I've also applied it to the
`CUDAGraph` binding code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88932
Approved by: https://github.com/albanD
2022-11-14 18:47:34 +00:00
Richard Barnes
3ece9fb45d Check all CUDA API calls for errors in torch/ (#81560)
Summary:
Original commit changeset: 0bb770d2cdb2

Original Phabricator Diff: D35194935 (79e5b053b6)

Differential Revision: D35291874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81560
Approved by: https://github.com/ezyang
2022-10-28 00:40:48 +00:00
Zachary DeVito
0d2c2110f1 [allocator] Introduce the abstract class CUDACachingAllocator (#87251)
This replaces the manual function pointers, making it easier to write
new drop-in allocators.

Note that most allocation goes through the Allocator interface, which
CUDAAllocator inherits from, and this arrangement avoids adding and
additional layer of dispatch along this pathway compared to what existed before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87251
Approved by: https://github.com/wconstab
2022-10-20 01:17:00 +00:00
Zachary DeVito
f56ce8dbad [allocator] Move getFreeMutex (#87237)
It isn't used at all the allocators and this change makes that more clear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87237
Approved by: https://github.com/wconstab
2022-10-19 18:00:40 +00:00
Eddie Yan
25725fd624 (Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682)
Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682
Approved by: https://github.com/ngimel
2022-10-12 03:44:21 +00:00
eqy
352d926482 [CUBLAS][CUDA GRAPHS] (re-re-re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#86645)
re-opening (again) in hopes of working around failed/stuck CLA check

CC @ptrblck @ngimel @huydhn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86645
Approved by: https://github.com/zdevito
2022-10-11 16:03:49 +00:00
Zachary DeVito
91b1bae1df Caching allocator tracing (#86241)
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241
Approved by: https://github.com/ngimel
2022-10-07 23:19:54 +00:00
Edward Z. Yang
adf5919720 Add option to record C++ backtraces in _record_memory_history (#86145)
I used this to debug https://github.com/pytorch/pytorch/issues/86136 so it is useful. The implementation is not so fast so it is not enabled by default.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86145
Approved by: https://github.com/albanD, https://github.com/zdevito
2022-10-06 04:07:37 +00:00
Edward Z. Yang
97d6b5bbf8 Refactor _cuda_recordMemoryHistory to use pybind11 (#86139)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86139
Approved by: https://github.com/albanD
2022-10-06 04:07:37 +00:00
Zachary DeVito
736adc0808 Memory snapshots from C++ (#86190)
Sometimes the driving process want to save memory snapshots but isn't Python.
Add a simple API to turn it on without python stack traces. It still
saves to the same format for the vizualization and summary scripts, using
the C++ Pickler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86190
Approved by: https://github.com/ezyang
2022-10-05 07:36:39 +00:00
Zachary DeVito
db13049b88 [allocator tracing] missing GIL acquire (#86254)
Bug where the context destructor needs to hold the GIL to free the context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86254
Approved by: https://github.com/ezyang
2022-10-05 05:54:05 +00:00
PyTorch MergeBot
71eb04403c Revert "[CUBLAS][CUDA GRAPHS] (re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85447)"
This reverts commit b04b2fa9aa.

Reverted https://github.com/pytorch/pytorch/pull/85447 on behalf of https://github.com/seemethere due to Caused a CUDA memory leak, detected by our performance benchmark suite
2022-09-30 20:53:41 +00:00
Eddie Yan
b04b2fa9aa [CUBLAS][CUDA GRAPHS] (re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85447)
Now includes @dagitses 's optimizations and fixes for teardown

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85447
Approved by: https://github.com/malfet
2022-09-28 16:04:58 +00:00
PyTorch MergeBot
0ac6311356 Revert "[CUBLAS][CUDA GRAPHS] (re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85292)"
This reverts commit 4012e623e8.

Reverted https://github.com/pytorch/pytorch/pull/85292 on behalf of https://github.com/dagitses due to broke an internal test during shutdown. Re-submit with #85399 in stack
2022-09-21 17:57:49 +00:00
eqy
4012e623e8 [CUBLAS][CUDA GRAPHS] (re-open of #83461) Explicitly set the workspace for cuBLAS handles (#85292)
re-open of #83461 with fix for 10.2 build

CC @ngimel @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85292
Approved by: https://github.com/malfet
2022-09-20 16:31:54 +00:00
Hector Yuen
d23ce29761 allow changing the cuda allocator settings even after the process started (#84970)
Summary:
- expose a python call to set the allocator settings, it uses the same format as the value for PYTORCH_CUDA_ALLOCATOR
- keep the implementation contained within the cpp file to avoid increasing build times, only expose a function to call the setting
- make some of the Allocator Config methods public, now it looks more like a singleton

Test Plan: added the unit test

Differential Revision: D39487522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84970
Approved by: https://github.com/zdevito
2022-09-17 09:42:42 +00:00
Masaki Kozuki
ab6c57217a Add NCCL PreMul Sum to c10d redce ops (#84243)
This is based on #81272 but this conforms to TorchScript Compiler

## TODO
- [ ] Update abaf8112e6/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L64-L73) to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit.

cc @ptrblck @kwen2501 @aazzolini
cc @zasdfgbnm for visibility to the TODO above
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243
Approved by: https://github.com/kwen2501
2022-09-02 21:57:45 +00:00
PyTorch MergeBot
1f61c39ac4 Revert "Support NCCL Premul Sum (#81272)"
This reverts commit 432c508e71.

Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2022-08-25 05:01:37 +00:00
Masaki Kozuki
432c508e71 Support NCCL Premul Sum (#81272)
This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is cb99ad6744 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501
2022-08-24 04:53:25 +00:00
Zachary DeVito
726d040692 annotated allocator snapshots (#82146)
Record stack trace information for each allocated segment in the allocator.
It takes around 1.5us to record 50 stack frames of context.
Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first.

Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well.

Potential Followups:
* stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression.
* Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl
* Things allocated during the backward pass have no stack frames because they are run on another C++ thread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146
Approved by: https://github.com/albanD
2022-08-09 17:21:35 +00:00
Jeff Daily
263c05c918 [ROCm] work-around missing hipProfilerStart/Stop (#82778)
### Description
cudaProfilerStart and cudaProfilerStop are deprecated but exposed by torch.cuda.cudart().  HIP has corresponding functions stubbed out, hipProfilerStart and hipProfilerStop, but they return hipErrorNotSupported.  Profiling in HIP is supported, but not via these deprecated APIs.

See https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PROFILER__DEPRECATED.html.

These functions are indirectly used by one or more unit tests that would otherwise pass if the non-functional HIP APIs were replaced with a dummy function.

### Testing
Unskipped a related unit test, run by ciflow/trunk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82778
Approved by: https://github.com/ezyang
2022-08-08 18:25:13 +00:00
Edward Z. Yang
df69660832 Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552)"" (#82599)
This reverts commit 532b8a9e00.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82599
Approved by: https://github.com/albanD
2022-08-02 19:37:02 +00:00
PyTorch MergeBot
532b8a9e00 Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552)"
This reverts commit 9465c0e0b5.

Reverted https://github.com/pytorch/pytorch/pull/82552 on behalf of https://github.com/zengk95 due to This seems to be breaking windows binary wheels
2022-08-01 20:25:35 +00:00
Edward Z. Yang
9465c0e0b5 Add a lint rule for torch/csrc/util/pybind.h include (#82552)
We define specializations for pybind11 defined templates
(in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently
it is important that these specializations *always* be #include'd
when making use of pybind11 templates whose behavior depends on
these specializations, otherwise we can cause an ODR violation.

The easiest way to ensure that all the specializations are always
loaded is to designate a header (in this case, torch/csrc/util/pybind.h)
that ensures the specializations are defined, and then add a lint
to ensure this header is included whenever pybind11 headers are
included.

The existing grep linter didn't have enough knobs to do this
conveniently, so I added some features.  I'm open to suggestions
for how to structure the features better.  The main changes:

- Added an --allowlist-pattern flag, which turns off the grep lint
  if some other line exists.  This is used to stop the grep
  lint from complaining about pybind11 includes if the util
  include already exists.

- Added --match-first-only flag, which lets grep only match against
  the first matching line.  This is because, even if there are multiple
  includes that are problematic, I only need to fix one of them.
  We don't /really/ need this, but when I was running lintrunner -a
  to fixup the preexisting codebase it was annoying without this,
  as the lintrunner overall driver fails if there are multiple edits
  on the same file.

I excluded any files that didn't otherwise have a dependency on
torch/ATen, this was mostly caffe2 and the valgrind wrapper compat
bindings.

Note the grep replacement is kind of crappy, but clang-tidy lint
cleaned it up in most cases.

See also https://github.com/pybind/pybind11/issues/4099

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82552
Approved by: https://github.com/albanD
2022-08-01 17:16:58 +00:00
Kurt Mohler
32cf6c6fb0 Remove THPTensor defs, override macros, and GenerateByteType.h (#82503)
### Description
These are old definitions and files that aren't used anymore.

### Issue
Fixes #82502

### Testing
N/A
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82503
Approved by: https://github.com/ezyang
2022-07-30 19:40:16 +00:00
Kurt Mohler
863176a1c7 Remove torch/csrc/generic (#82373)
### Description
Remove `torch/csrc/generic` since it is no longer needed.

### Issue
#82372

### Testing
No tests added

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82373
Approved by: https://github.com/ezyang
2022-07-28 07:45:31 +00:00
Can Balioglu
56dea92d97 Fix set_requires_cuda_init (#81183)
Fixes the buggy `set_requires_cuda_init` introduced in #80788.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81183
Approved by: https://github.com/ezyang
2022-07-11 15:36:42 +00:00
Eddie Yan
ae6dd20ba7 [cuDNN V8 API] (reopen 2) Allow the number of kernels profiled under torch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#78299)
Reopen of #77002 to address comments by @malfet

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78299
Approved by: https://github.com/ngimel
2022-07-07 23:25:23 +00:00
Can Balioglu
081b56fd41 Improve readability of cuda_lazy_init (#80788)
This PR cleans up the implementation of `cuda_lazy_init.cpp` and improves its readability. No behavioral changes are introduced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80788
Approved by: https://github.com/ezyang
2022-07-04 16:47:11 +00:00
jjsjann123
9e86796fe3 simple c10 implementation for std::call_once (#78051)
A long standing bug on std::call_once: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146
It could hang during re-entry after an exception handling.

Added a c10 implementation yielding a bulky mutex. Not the most efficient thing but at least it shouldn't hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78051
Approved by: https://github.com/albanD
2022-06-28 15:47:03 +00:00
George Gensure
7d17e3b884 Windows nvtx for CUDA 11.6 require _wgetenv decl (#79798)
Summary:
The include ordering modified via clang-format prevented a
declaration of _wgetenv, declared and used in nvtxInit.h, which was
previously acquired transitively through the pybind include. Provide an
explicit platform include of wchar.h which defines it on Windows, prior
to including nvToolsExt.h

Test Plan: Windows msvc cuda11 build should pass.

Differential Revision: D37244689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79798
Approved by: https://github.com/kit1980
2022-06-18 04:42:57 +00:00
Nikita Shulga
e895672b35 Followup fix after #78828 (#79554)
Will be skipped when imported internally, for more details see https://www.internalfb.com/diff/D37114156?src_version_fbid=3331368873807344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79554
Approved by: https://github.com/albanD
2022-06-14 20:20:18 +00:00
Michael Suo
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
Sherlock Huang
6db8440f35 Python Jiterator supports multiple outputs (#78139)
This PR is part3.
Part1: https://github.com/pytorch/pytorch/pull/77902
Part2: https://github.com/pytorch/pytorch/pull/77921

Python Jiterator now supports returning multiple outputs

```
fn = torch.cuda.jiterator._create_multi_output_jit_fn(
"""
template <typename T>
T binary_2outputs(T i0, T i1, T& out0, T& out1) {
    out0 = i0 + i1;
    out1 = i0 - i1;
}
""",
num_outputs=2)

x = torch.rand(3, device='cuda')
y = torch.rand(3, device='cuda')
out0, out1 = fn(x, y)

torch.allclose(out0, x+y)
torch.allclose(out1, x-y)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78139
Approved by: https://github.com/ngimel
2022-05-24 21:52:56 +00:00
Natalia Gimelshein
4ea176ea57 expose fast get_current_stream (#78165)
Expose fast no-frills version of getting raw `cudaStream_t` in python (200 ns instead of 4 us)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78165
Approved by: https://github.com/SherlockNoMad, https://github.com/soumith, https://github.com/gchanan
2022-05-24 15:54:47 +00:00
Kurt Mohler
aea6e2c396 Merge torch.cuda._UntypedStorage into torch._UntypedStorage (#75459)
Fixes #74933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75459
Approved by: https://github.com/ezyang
2022-05-19 13:54:39 +00:00
Michael Carilli
929f1d5317 [RELAND] Adds torch.cuda.is_current_stream_capturing (#77789)
Resubmit of https://github.com/pytorch/pytorch/pull/77673, which was reverted due to Windows test failures: https://github.com/pytorch/pytorch/pull/77673#issuecomment-1130425845.

I suspect these failures happened because I don't explicitly set a side stream for graph capture in the new test.
Not setting a side stream explicitly is alright on Linux because cuda tests implicitly use a side stream.
I think Windows cuda tests implicitly use the default stream, breaking capture and leaving the backend in a bad state.
Other graphs tests explicitly set side streams and don't error in Windows builds, so i'm 95% sure doing the same for the new test will work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77789
Approved by: https://github.com/ezyang
2022-05-18 23:18:53 +00:00
Jeff Daily
de86146c61 rocblas alt impl during backward pass only (#71881)
In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing.  The scope-based helper class `BackwardPassGuard` is provided to toggle state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881
Approved by: https://github.com/albanD
2022-05-18 19:42:58 +00:00
PyTorch MergeBot
0d8a0f186b Revert "Adds torch.cuda.is_current_stream_capturing (#77673)"
This reverts commit d03d43df52.

Reverted https://github.com/pytorch/pytorch/pull/77673 on behalf of https://github.com/suo
2022-05-18 19:31:49 +00:00
Michael Carilli
d03d43df52 Adds torch.cuda.is_current_stream_capturing (#77673)
Exposes a way to query if CUDA graph capture is underway on the current stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77673
Approved by: https://github.com/ezyang
2022-05-18 16:46:35 +00:00
Sherlockk Huang
8b6a78f39f Python Interface for Jiterator
This PR allows user to author a CUDA kernel in python.

```
from torch.cuda.jiterator import create_jit_fn

code_string = "template <typename T> T my_kernel(T x, T y, T alpha) { return  -x * y + x - y + alpha; }"
jitted_fn = create_jit_fn(code_string, alpha=0)

a = torch.rand(3, device='cuda')
b = torch.rand(3, device='cuda')
result = jitted_fn(a, b, alpha=1.0)
```

Limitations:
- Only supports elementwise kernel
- 1~8 tensor inputs (empty input, e.g. factory methods, is not supported)
- inputs tensors must live in cuda device
- cpu Scalar is not supported
- kwargs must be pre-declared when calling create_jit_fn
- kwargs must be convertible to at::Scalar, one of float64, int64_t, bool. (complex not support for now)

TODOs:
- [x] consolidate union and c10::variant implementation
- [x] plug into existing op testing framework
- [ ] rename files, place files in the right folder
- [ ] place util functions in the right file
- [x] enforce assumptions in python interface e.g <8 inputs, kwargs types
- [x] Add user-facing documentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76394
Approved by: https://github.com/mruberry
2022-05-06 18:44:28 +00:00
KyleCZH
97fbe6f0a4 [ROCm] use ncclAllToAll for rocm
use ncclAllToAll for rocm version > 5.0; https://github.com/ROCmSoftwarePlatform/rccl/pull/503

detail on ncclAllToAll:
https://github.com/ROCmSoftwarePlatform/rccl/pull/503

@jithunnair-amd  @amathews-amd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75128
Approved by: https://github.com/wenkaidu, https://github.com/yzygitzh, https://github.com/seemethere
2022-04-22 18:39:48 +00:00
Nikita Shulga
c40a009d66 Revert D35194935: Check all CUDA API calls for errors in torch/
Test Plan: revert-hammer

Differential Revision:
D35194935 (79e5b053b6)

Original commit changeset: f5ec5be87cdf

Original Phabricator Diff: D35194935 (79e5b053b6)

fbshipit-source-id: 0bb770d2cdb29b8e724c0b6a125c748f363d3358
(cherry picked from commit 04e5a73da4a53b0ec296f3df2c85626d19290c1f)
2022-03-31 05:48:30 +00:00
Richard Barnes
79e5b053b6 Check all CUDA API calls for errors in torch/ (#74923)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74923

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D35194935

fbshipit-source-id: f5ec5be87cdf775eb9c99f8c3baed6b0366dda49
(cherry picked from commit 7284c4ed7d57261d4936055e0c1a3f8f911fb1f0)
2022-03-31 05:08:55 +00:00
Shubhorup Biswas
e505f06a79 More folders in clang-tidy (#74908)
Summary:
1. Added folders torch/csrc/onnx and torch/csrc/cuda to clang-tidy.
2. Fixed clang-tidy violations in torch/csrc/cuda
3. Fixed(added Python import paths) in clang-tidy Python runner

Fixes some of https://github.com/pytorch/pytorch/issues/62011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74908

Reviewed By: atalman

Differential Revision: D35221843

Pulled By: shahofblah

fbshipit-source-id: f0d1f066550b383aa48449b12d194009977c0bd8
(cherry picked from commit 830186a673f432c9f3558f3e9cf1cd4294fa0fb0)
2022-03-29 22:59:16 +00:00
Stephen Oakley
1646a0033d Use irange in PyTorch (#72836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72836

Replacing increment iterator loops with ranged loops. It allows loops such as for(int i=0;i<10;i++) to be expressed as for(const auto i : c10::irange(10)). This auto-types the loops and adds const-safety to the iteration variable.

Reviewed By: albanD

Differential Revision: D34136539

fbshipit-source-id: 760a70ad43ce6f05630ba8fea261d4dbb699e62e
(cherry picked from commit 0428408d88)
2022-02-18 19:29:07 +00:00
Louis Feng
83b3b5fb00 [PyTorch] Support NVTX range_start and range_end (#70030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70030

range_push and range_pop do not support multi-thread. It only works for push and pop range in the same thread.

For process level ranges, we should use range_start and range_end. This is important because PyTorch forward is on one thread, while the autograd is on a different thread.

See NVidia implementation documentation:
cab2dec760/NSight/nvToolsExt.h (L397-L407)

Test Plan:
```
buck test caffe2/test:cuda

Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391483460
    ✓ ListingSuccess: caffe2/test:cuda - main (19.640)
Summary
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391483460
```

Reviewed By: malfet

Differential Revision: D33155244

fbshipit-source-id: c7d5143f6da9b6ef0e0811e2fcae03a3e76f24de
(cherry picked from commit 22134e91b7)
2022-02-07 17:31:57 +00:00
Wanchao Liang
6feba4bc7e Implement scatter primitive for ProcessGroupNCCL (#70029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70029

This PR implements NCCL scatter and add scatter to ProcessGroupNCCL.

NCCL doesn’t directly provide primitives for scatter, so we need to be implemented on top of NCCL’s send/recv API.

1. In ProcessGroupNCCL.cpp, the inputTensors are first flattened, then outputTensors and inputFlattened are passed by the collective class to scatter() function in nccl.cpp.
2. In nccl.cpp, scatter is implemented using ncclSend/ncclRecv: the root rank uses a for loop to send(distribute) the inputTensors to each rank, then all the ranks receive the inputTensor from the root rank.
ghstack-source-id: 147754837

Test Plan:
test_scatter_ops
test_scatter_stress
test_scatter_checks

Reviewed By: pritamdamania87

Differential Revision: D33154823

fbshipit-source-id: 4513e7eaf7d47a60eb67da99dc6c2e9a2882f3fd
(cherry picked from commit 93201f9d4a)
2022-01-27 19:37:55 +00:00
Wanchao Liang
9b53d3194c Implement gather primitive for ProcessGroupNCCL (#66745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745

This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api.

NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API.
1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp.
1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors.
ghstack-source-id: 147754838

Test Plan:
test_gather_ops
test_gather_checks
test_gather_stress

Reviewed By: pritamdamania87

Differential Revision: D29616361

fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782
(cherry picked from commit d560ee732e)
2022-01-27 19:37:55 +00:00
Andrey Talman
efd274bbcb Fix for windows builds with python 3.10 , getting rid of ssize_t (ssize_t is not a C++ defined type) (#71390)
Summary:
Fix for windows builds with python 3.10 , getting rid of ssize_t

Here is the completed bin build : https://app.circleci.com/pipelines/github/pytorch/pytorch/441527/workflows/144edb79-b398-4d70-92fe-b63158c1b439/jobs/16954881

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71390

Reviewed By: samdow

Differential Revision: D33637686

Pulled By: atalman

fbshipit-source-id: fcdfca672dc20385a3d2339c20e69bd2d1717e88
(cherry picked from commit 2ac58b0dc1)
2022-01-18 22:12:41 +00:00
Shintaro Iwasaki
5cae40c169 [pytorch][aten][cuda] move CUDAGeneratorImpl.h to ATen/cuda (#70650)
Summary:
This patch moves a CUDA-specific file, `CUDAGeneratorImpl.h` to `ATen/cuda` as the following TODO comment in  `CUDAGeneratorImpl.h` suggests:
```
// TODO: this file should be in ATen/cuda, not top level
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70650

Reviewed By: jianyuh, xw285cornell

Differential Revision: D33414890

Pulled By: shintaro-iwasaki

fbshipit-source-id: 4ff839205f4e4ea4c8767f164d583eb7072f1b8b
2022-01-10 22:27:04 -08:00
Natalia Gimelshein
c9e898fef8 delete TH (#69929)
Summary:
Move TH<C>GenerateByteType includes into torch/csrc (the only place they are used), and we can remove TH folder altogether!
The only thing left in THC are includes left for bc compatibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69929

Reviewed By: mruberry

Differential Revision: D33133013

Pulled By: ngimel

fbshipit-source-id: 78c87cf93d2d641631b0f71051ace318bf4ec3c1
2021-12-16 10:45:30 -08:00
Peter Bell
b08d64202a Remove THGeneral (#69041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69041

`TH_CONCAT_{N}` is still being used by THP so I've moved that into
it's own header but all the compiled code is gone.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872477

Pulled By: ngimel

fbshipit-source-id: 06c82d8f96dbcee0715be407c61dfc7d7e8be47a
2021-12-13 16:14:28 -08:00
Peter Bell
9962bfb3c9 Remove THTensor (#69040)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69040

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872478

Pulled By: ngimel

fbshipit-source-id: f93e16509d64308d91e374744410a6a811e7f4e3
2021-12-10 02:29:11 -08:00
Peter Bell
b2e79ed5ec Remove WindowsTorchApiMacro.h in favor of Export.h (#69585)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/68095

This also changes the files from the ATen folder to include c10's `Export.h` instead since they can't ever be exporting `TORCH_PYTHON_API`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69585

Reviewed By: mrshenli

Differential Revision: D32958594

Pulled By: albanD

fbshipit-source-id: 1ec7ef63764573fa2b486928955e3a1172150061
2021-12-09 17:30:09 -08:00
Mike Ruberry
dc87cf5fe1 Fixes mem_get_info when querying on a device other than the current device (#69640)
Summary:
Also fixes the documentation failing to appear and adds a test to validate that op works with multiple devices properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69640

Reviewed By: ngimel

Differential Revision: D32965391

Pulled By: mruberry

fbshipit-source-id: 4fe502809b353464da8edf62d92ca9863804f08e
2021-12-08 23:04:30 -08:00
Peter Bell
e279963eef Remove remaining THC code (#69039)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69039

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872476

Pulled By: ngimel

fbshipit-source-id: 7972aacc24aef9450fb59b707ed6396c501bcb31
2021-12-08 12:18:08 -08:00
Peter Bell
bf01cd5228 Move THC_sleep to ATen (#69038)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69038

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872479

Pulled By: ngimel

fbshipit-source-id: 97c7592b16eee2ecc66c42507c358aa92cc8ee50
2021-12-06 10:20:43 -08:00
Thomas Viehmann
ac1fe91dc9 Clean up some THC includes (#69024)
Summary:
These seem to not be needed and cause ninja to rebuild the files at every build.

(There also is THCStorage.cu, but hopefully this will go away with https://github.com/pytorch/pytorch/issues/68556 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69024

Reviewed By: soulitzer

Differential Revision: D32705309

Pulled By: ngimel

fbshipit-source-id: 5255297f213fdcf36e7203de7460a71291f8c9a0
2021-11-29 20:55:27 -08:00
Kurt Mohler
d9e7d85390 Remove TH/THC Storage (#68556)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67852

cc ezyang bhosmer smessmer ljk53 bdhirsh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68556

Reviewed By: ejguan

Differential Revision: D32652758

Pulled By: ngimel

fbshipit-source-id: 170956fca112606f9008abe09b92c6ddc411be09
2021-11-29 12:55:20 -08:00
Yifan Xiong
c7eaec86f0 [NCCL] Patch bfloat16 support (#67843)
Summary:
Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is
still not complete to enable bfloat16 for allreduce in end-to-end training.

This patch does the followings:
* fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in
  v2.10.3-1 (commit 7e51592)
* update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL
  operations like all reduce can use it
* enable unit tests for bfloat16 datatype if possible

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843

Reviewed By: H-Huang

Differential Revision: D32248132

Pulled By: mrshenli

fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525
2021-11-09 13:46:13 -08:00
Kurt Mohler
4d99bc839b Remove TH/THC Storage functions for unused dtypes (#67480)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67466

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67480

Reviewed By: mruberry

Differential Revision: D32023494

Pulled By: ngimel

fbshipit-source-id: 8827e1d6e765fee7219b5ee9888a1a3e3c5fbe89
2021-11-01 11:45:20 -07:00
Peter Bell
b3bb234e16 Remove THCGeneral.cpp (#66766)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66766

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D31721647

Pulled By: ngimel

fbshipit-source-id: 5033a2800871c8745a1a92e379c9f97c98af212e
2021-10-19 16:09:19 -07:00
Natalia Gimelshein
719d43a2a2 Revert D31547709: Remove THCGeneral.cpp
Test Plan: revert-hammer

Differential Revision:
D31547709 (aa0c31876b)

Original commit changeset: 059c47621863

fbshipit-source-id: e8c3597f2badbc5ecf356b381edea06a07331f24
2021-10-16 21:50:19 -07:00
Peter Bell
aa0c31876b Remove THCGeneral.cpp (#66391)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66391

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31547709

Pulled By: ngimel

fbshipit-source-id: 059c47621863738fb560f4257e7765afa9b952aa
2021-10-16 14:53:52 -07:00
Kurt Mohler
5883523c1d Remove dtype from torch.Storage and use only torch.ByteStorage (#62030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62030

Remove dtype tracking from Python Storage interface, remove all the different `<type>Storage` classes except for `ByteStorage`, and update serialization accordingly, while maintaining as much FC/BC as possible

Fixes https://github.com/pytorch/pytorch/issues/47442

* **THE SERIALIZATION FORMAT IS FULLY FC/BC.** We worked very hard to make sure this is the case. We will probably want to break FC at some point to make the serialization structure of tensors make more sense, but not today.
* There is now only a single torch.ByteStorage class. Methods like `Tensor.set_` no longer check that the dtype of storage is appropriate.
* As we no longer know what dtype of a storage is, we've **removed** the size method from Storage, replacing it with nbytes. This is to help catch otherwise silent errors where you confuse number of elements with number of bytes.
* `Storage._new_shared` takes a `nbytes` kwarg and will reject previous positional only calls.  `Storage._new_with_file` and `_set_from_file` require explicit element size arguments.
* It's no longer possible to convert storages to different types using the float/double/etc methods. Instead, do the conversion using a tensor.
* It's no longer possible to allocate a typed storage directly using FloatStorage/DoubleStorage/etc constructors. Instead, construct a tensor and extract its storage. The classes still exist but they are used purely for unpickling.
* The preexisting serialization format stores dtype with storage, and in fact this dtype is used to determine the dtype of the tensor overall.
 To accommodate this case, we introduce a new TypedStorage concept that exists only during unpickling time which is used to temporarily store the dtype so we can construct a tensor. **If you overrode the handling of pickling/unpickling, you MUST add handling for TypedStorage** or your serialization code will degrade to standard file-based serialization.

Original pull request: https://github.com/pytorch/pytorch/pull/59671

Reviewed By: soulitzer, ngimel

Differential Revision: D29466819

Pulled By: ezyang

fbshipit-source-id: 4a14e5d3c2b08e06e558683d97f7378a3180b00e
2021-10-05 13:50:34 -07:00
Peter Bell
f6dfac6974 Migrate THCCachingHostAllocator to ATen (#65746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65746

This also removes the cudaHostAllocator field on THCState, since there
doesn't seem to be an API anywhere for customizing it.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D31236630

Pulled By: ngimel

fbshipit-source-id: 2a8e756222ae70565e77f8e7139d60ec5be32276
2021-09-30 21:26:38 -07:00
Pruthvi Madugundu
085e2f7bdd [ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610

- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.

- In the next PR
   - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
   - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd

Reviewed By: jbschlosser

Differential Revision: D30909053

Pulled By: ezyang

fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
2021-09-29 09:55:43 -07:00
Natalia Gimelshein
50c29fef3e remove utils.cpp (#65184)
Summary:
Dead code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65184

Reviewed By: mruberry

Differential Revision: D31031777

Pulled By: ngimel

fbshipit-source-id: 13633888229a7af8cfd8ea7e55ea2880b2e47273
2021-09-17 17:31:15 -07:00