Commit Graph

2496 Commits

Author SHA1 Message Date
Mu-Chu Lee
c3a18f6126 [AOTInductor] Add states for constant folding process (#151273)
Summary:
We add states in the constant folding process for AOTInductor.
Basically, there's 3 states, which is
(1) None: The state when no constants are loaded and uninitialized.
(2) Initialized: The state when constants are loaded, but not yet
folded.
(3) Folded: The state where the model is fully ready with folded
constants.

Note that even if constant folding is not enabled, we still only run
when state is FOLDED, this is okay because without constant folding, the
transition from INITIALIZED to FOLDED is just a pass-throught.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_constant_folding_with_update

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D73002538](https://our.internmc.facebook.com/intern/diff/D73002538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151273
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-04-17 16:41:38 +00:00
Nariaki Tateiwa
23a3cef5d9 [c10d] Add _allgather_base , reduce_scatter , and _reduce_scatter_base into ProcessGroupMPI to enable FSDP with MPI backend (#150162)
This PR implements _allgather_base, reduce_scatter, and _reduce_scatter_base in the MPI backend (ProcessGroupMPI), enabling support for Fully Sharded Data Parallel (FSDP) in environments that use MPI for distributed communication.

### Context

As noted in https://github.com/pytorch/pytorch/issues/85628, FSDP currently supports only the NCCL backend. Due to this limitation, FSDP cannot run on legacy HPC environments or clusters that rely on MPI.

By implementing just these three collective operations, we can enable FSDP to work with the MPI backend. These collectives are implemented in a similar manner to existing operations such as allgather.

### Testing

We validated this PR using pytorch/build/bin/ProcessGroupMPITest with OpenMPI, and all tests passed successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150162
Approved by: https://github.com/H-Huang
2025-04-14 19:31:38 +00:00
Shivam Raikundalia
ad5e9065ac [Profiler/Easy] Remove temp flag for on-demand Memory Snapshot (#151068)
Summary: Now that we have profiler impl in we don't need the temporary flag. submodule update too.

Test Plan: CI

Reviewed By: sanrise

Differential Revision: D72672186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151068
Approved by: https://github.com/davidberard98
2025-04-11 18:50:25 +00:00
fduwjj
f663aa4e81 [c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987)
While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.

> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).

We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.

Added a unit test to test this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2025-04-10 18:48:57 +00:00
Mu-Chu Lee
f3cf3ec591 [AOTInductor] Add User Managed buffer for AOTI constant buffer. (#150276)
Summary:
We add the functionality to allow users to directly pass in a at::Tensor
into AOTInductor, that would be used as the constant.
This user managed buffer skips the copying step in AOTInductor, and let
users to directly manage the memory usage themselve.

Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/data/users/$USER/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D72589514](https://our.internmc.facebook.com/intern/diff/D72589514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2025-04-10 00:15:44 +00:00
Shivam Raikundalia
99c9a31386 [submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559)
Summary:
Profiler side of memory snapshot.

1. Add API to actually do snapshot when client interface is called
2. Add ifdefs to builds so that kineto hooks snapshot correctly.

Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship

Test Plan: {F1976563426}

Reviewed By: sanrise

Differential Revision: D70733247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559
Approved by: https://github.com/sanrise
2025-04-07 13:04:38 +00:00
Mu-Chu Lee
063ea5d669 [AOTInductor] Modify test for Memory tracking for memory-related (#150269)
operations

Summary:
Fix the test for memory tracking. This PR does:
(1) Add tracking before and after for all memory-related operations.
Make sure the operation do indeed captures memory both in CUDA and
torch's CUDACachAllocator Make sure the operation do indeed captures
consumed memory both in CUDA and torch's CUDACachAllocator.
(2) Keep track of memory being reserved by CUDACacheAllocator in
torch and it's relationship with global CUDA memory consumption.

Test Plan:
This PR is adding tests.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150269
Approved by: https://github.com/jingsh, https://github.com/chenyang78, https://github.com/desertfire
2025-04-02 04:18:18 +00:00
Ke Wen
35c45a4a31 [Reland] Launch kernel on current stream & remove record_stream entirely (#150398)
Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: https://github.com/pytorch/pytorch/pull/149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: https://github.com/pytorch/pytorch/pull/150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: https://github.com/pytorch/pytorch/pull/150130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398
Approved by: https://github.com/atalman
2025-04-01 16:46:07 +00:00
Mu-Chu Lee
a2070e2fd5 [AOTInductor] Free tensors in test (#150274)
Summary:
This PR frees tensor that were new-ed within the test itself to prevent
memory leak.

Test Plan:
Fixing tests itself.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150274
Approved by: https://github.com/chenyang78
2025-03-31 23:28:13 +00:00
Irshad CC
f3c77b2458 Set requires grad in TensorMaker::make_tensor() (#148255)
Fixes #146419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148255
Approved by: https://github.com/soulitzer
2025-03-29 08:06:42 +00:00
Mu-Chu Lee
03313c6619 [AOTInductor] Add function for users to extract constants in container (#150163)
Summary: Add extract_constant_map that allows users to inspect the constants being used by AOTInductor

Test Plan:
`python test/inductor/test_aot_inductor.py -k extract_constants_map`

`LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference`

Differential Revision: D72020400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150163
Approved by: https://github.com/chenyang78
2025-03-29 03:36:12 +00:00
Mu-Chu Lee
e6afb51805 [AOTInductor] Free folded constants that's managed by AOTInductor (#149825)
internally.

Summary:
This diff allows freeing the usage of folded constants that's created by
AOTInductor through CUDACachingAllocator instead of the constant blob
from cudaMalloc directly.

Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149825
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jingsh
2025-03-27 06:05:50 +00:00
Mu-Chu Lee
12628ba24d [AOTInductor] Bug fix for freeing buffers when freeing multiple times (#149810)
Summary:
We might free the active buffer if we free the buffer twice.

Test Plan:
```
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149810
Approved by: https://github.com/chenyang78
2025-03-25 20:26:36 +00:00
Scott Wolchok
c73a526599 Extract reusable portions of elu_kernel into header (#149673)
Similar to #140425, we are making the implementation usable via header-only code sharing.

Review note: #62546 by @yanbing-j removed expm1 usage from this path. I don't know why and expm1 should be more efficient, so I've put it back. Please let me know if there is a good reason I shouldn't.

Testing: existing correctness tests should cover.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149673
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-03-21 23:54:26 +00:00
Shangdi Yu
46dd226702 Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529)
Summary:
We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly.

- _fakify_script_objects in `compile_fx`
- Allow fake torchbind objects in `torchbind_constants`

Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens.

Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API.

Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`.

Test Plan:
```
buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind

buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms

buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc

```

Differential Revision: D70013257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529
Approved by: https://github.com/angelayi
2025-03-21 18:58:28 +00:00
Bin Bao
04e251a7dd [AOTI] Add num_runners to AOTIModelPackageLoader (#149364)
Summary: AOTIModelContainerRunner takes a num_runners argument for multi-threaded inference, but AOTIModelPackageLoader forgot to take the same parameter, although its run() API already expects to take an optional cudaStream_t parameter for multi-threaded inference.

Differential Revision: [D71357418](https://our.internmc.facebook.com/intern/diff/D71357418)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149364
Approved by: https://github.com/angelayi
2025-03-19 02:28:06 +00:00
Mu-Chu Lee
bb42e4d137 [AOTInductor] Add function to free buffer (#149161)
Summary:
We add a function that allows users to free the unused buffer.

Test Plan:
Testing correctness:
    python test/inductor/test_aot_inductor.py -k free_inactive

    Testing memory consumption:
    LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
    /home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #149249
2025-03-18 02:43:14 +00:00
PyTorch MergeBot
afa1eda901 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit ef6296e7f2.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))
2025-03-17 22:43:15 +00:00
Joel Schlosser
5e1b715dda BC fix for AOTIModelPackageLoader() constructor defaults (#149082)
The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082
Approved by: https://github.com/desertfire
2025-03-13 18:40:53 +00:00
Bin Bao
b9803a5c81 [AOTI] Re-enable AOTI cpp unit test (#149085)
Summary: test_inductor_aoti was removed by accident previously. Add it back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149085
Approved by: https://github.com/jbschlosser
2025-03-13 16:00:38 +00:00
Ke Wen
ef6296e7f2 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-11 18:36:12 +00:00
PyTorch MergeBot
a95eb0c0a7 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 2149f6c684.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))
2025-03-10 22:38:40 +00:00
Ke Wen
2149f6c684 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-09 07:32:23 +00:00
PyTorch MergeBot
9cb25f0ea2 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 17dbeb11db.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))
2025-03-09 03:01:55 +00:00
Ke Wen
17dbeb11db [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-08 20:00:12 +00:00
rpsilva
4abff4b271 Introduce cache clearing APIs for the lazy graph executor (#144489)
This PR introduces two new methods to the LazyGraphExecutor class:

- ClearComputationCache(): Allows clearing the entire computation cache.
- RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash.

The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance:
- Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently.
- Selectively remove cache entries to analyze the impact on subsequent computations.
- Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations.

On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash.

Simultaneously, **another motivation**, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489
Approved by: https://github.com/wconstab
2025-01-29 17:38:01 +00:00
Shuqiang Zhang
c0861d092c [PGNCCL] Add an API to get the status/error code at the PG level (#144498)
Summary:
This PR is basically a replacement of
https://github.com/pytorch/pytorch/pull/140087, which caused some perf
drop due to frequent TCPStore check in watchdog thread. The fix is to move the
tcpstore check in monitoring thread

If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.

This API is applied to PG level, compared to the
work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle
the PG failure as a whole, e.g, restarting the PG.

Error handling at the work level is still useful for users to attach
work specific context and debug the RC of the specific failing
work/collective

Note it is critical for all ranks in the PG to be notified about an
error as soon as it occurs, so we introduce an errorType of
REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a
local error) to all other ranks in the PG, the broadcast is done through
TCPStore currently

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498
Approved by: https://github.com/kwen2501
2025-01-24 16:47:32 +00:00
fduwjj
ae7df51232 [c10d] Fix CudaEventCache for dangling references (#144496)
Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it.
1. We add a unit test to repro the issue mentioned in the issue.
2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496
Approved by: https://github.com/kwen2501
2025-01-15 05:11:48 +00:00
PyTorch MergeBot
b80ecc4457 Revert "Fix poision child process issue when call getAccelerator() (#144368)"
This reverts commit 2583d831d4.

Reverted https://github.com/pytorch/pytorch/pull/144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))
2025-01-10 23:36:43 +00:00
Yu, Guangye
2583d831d4 Fix poision child process issue when call getAccelerator() (#144368)
# Motivation
fix https://github.com/pytorch/pytorch/issues/144152

# Solution

- Align `at::globalContext()::hasXXX` to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch.
- Define `at::hasXXX` to determine if accelerator XXX is available at runtime.
- Use `at::globalContext()::hasXXX` in `getAccelerator` rather than `at::hasXXX` to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144368
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/gujinghui
2025-01-10 09:28:27 +00:00
Aaron Gokaslan
bbec35f028 [BE]: Replace clone detach with detach clone to be more efficient (#144469)
Follow up to #144270 and fix some vulkan code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144469
Approved by: https://github.com/awgu
2025-01-09 18:28:39 +00:00
Jithun Nair
1365ae859c [ROCm][CI] upgrade CI to ROCm 6.3 (#142152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142152
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-09 17:14:16 +00:00
Will Feng
bf7009d839 [rpc] Fix unit test after c10::nullopt removal (#143690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143690
Approved by: https://github.com/yifuwang, https://github.com/c-p-i-o, https://github.com/XilunWu
2024-12-20 23:36:07 +00:00
Richard Barnes
f9da639950 [codemod] Fix a few unused-variable issues in pytorch (#143517)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143517
Approved by: https://github.com/mhorowitz
2024-12-19 00:18:08 +00:00
James
d4ed5941db Fix floating point literals in IRPrinter (#142119)
Fixes #114035
This is a recreation of #140002 with approval from its author. Original description:
>when v larger than 1e16, the format will be error. example: v is 1.2e17, the output is 1.2e17.f, it have two point '.'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142119
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-18 21:59:48 +00:00
cyy
075905b7bd [14/N] Fix extra warnings brought by clang-tidy-17 (#141644)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644
Approved by: https://github.com/ezyang

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2024-12-13 06:22:13 +00:00
Richard Barnes
82ce888273 c10::string_view -> std::string_view in more places (#142517)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142517
Approved by: https://github.com/malfet
2024-12-12 19:45:59 +00:00
PyTorch MergeBot
2f0fe82f6d Revert "[14/N] Fix extra warnings brought by clang-tidy-17 (#141644)"
This reverts commit 24a5a2ef25.

Reverted https://github.com/pytorch/pytorch/pull/141644 on behalf of https://github.com/clee2000 due to failing internally D67112938 ([comment](https://github.com/pytorch/pytorch/pull/141644#issuecomment-2539602023))
2024-12-12 17:43:36 +00:00
Richard Barnes
7667235a23 c10::optional -> std::optional (#142514)
Fixes issues introduced in https://github.com/pytorch/pytorch/pull/141348 and https://github.com/pytorch/pytorch/pull/139578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142514
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-12 17:23:46 +00:00
cyy
24a5a2ef25 [14/N] Fix extra warnings brought by clang-tidy-17 (#141644)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644
Approved by: https://github.com/ezyang
2024-12-11 18:40:42 +00:00
Richard Barnes
7e41717a26 c10::string_view -> std::string_view in caffe2/jit (#142383)
Test Plan: Sandcastle

Differential Revision: D66939979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142383
Approved by: https://github.com/malfet
2024-12-10 15:42:28 +00:00
Mu-Chu Lee
d3d1a78774 [AOTInductor] Add standalone test for compilation from ExportedProgram (#142327)
Summary: Provide a standalone path to compile and run a ExportedProgram in C.

Test Plan:
(1) Generate a compiled model from ExportedProgram
```
python generate_lowered_cpu.py --input-path /tmp/$USER/ep.pt --output-path /tmp/$USER/final.pt
```
(2) Compile a standalone test runner
```
TORCH_ROOT_DIR=/data/users/$USER/pytorch sh standalone_compile.sh standalone_test.cpp standalone_test.out
```
(3) Run test for the compiled model in step (1)
```
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib ./standalone_test.out /tmp/$USER/final.pt
```

Differential Revision: D66872380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142327
Approved by: https://github.com/hl475
2024-12-10 06:50:09 +00:00
Bin Bao
1cb2ebd740 [AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026

Fix #140546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140664
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268, #140269

Co-authored-by: Bin Bao <binbao@meta.com>
2024-12-10 05:05:08 +00:00
PyTorch MergeBot
6fcb294e18 Revert "[AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)"
This reverts commit 91d30546a4.

Reverted https://github.com/pytorch/pytorch/pull/140664 on behalf of https://github.com/clee2000 due to breaks forward compatibility?  D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))
2024-12-09 17:33:28 +00:00
xinan.lin
91d30546a4 [AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026

Fix #140546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140664
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268, #140269
2024-12-07 19:22:04 +00:00
rzou
215f5d77b5 [functional autograd] Refactor validate_outputs into a functional variant (#141348)
Today, validate_outputs is stateful (it depends on the autograd graph).
This PR refactors it into a stateless form that just depends on
InputMetadata.

Test Plan:
- new unittest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141348
Approved by: https://github.com/soulitzer
ghstack dependencies: #141278
2024-12-04 18:06:31 +00:00
Nikita Shulga
38bbe37187 Enable CI on SM89 (#140305)
Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376

To enable more balanced sharding, had to push 148ae19935

Added `@xfailIfSM89` to the following tests:
 - test_fp8_pattern_2
 - test_original_aten_preserved_split_addmm
 - test_sparse_semi_structured_scaled_mm
 - test_sparse_semi_structured_scaled_mm_fp8
 - test_sparse_fp8fp8_mm

Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA`

Skipped following inductor tests (that either flaky OOMs or timeouts):
 - test_reduction_fn_std_float64
 - test_reduction_fn_var_mean_float64
 - test_multi_output_unbacked_custom_op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305
Approved by: https://github.com/wdvr, https://github.com/ZainRizvi
2024-12-03 04:49:46 +00:00
Ke Wen
ad39a2fc46 [1/N] Decouple Flight Recorder from NCCL utils (#141648)
Part of the effort to make Flight Recorder device agnostic.

Step 1: Move it out of NCCLUtils.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141648
Approved by: https://github.com/fduwjj
2024-11-27 18:29:42 +00:00
fduwjj
5b4c864672 [c10d] Enable CudaEventCache by default and add multi device support (#140975)
We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating.

Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140975
Approved by: https://github.com/eqy, https://github.com/kwen2501
2024-11-26 18:42:45 +00:00
cyy
263d8f7a94 [8/N] Don't skip ASAN on some tests (#140081)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140081
Approved by: https://github.com/ezyang
2024-11-09 01:00:13 +00:00
Bin Bao
1868fc63d8 [AOTI] Update C++ runner API to take a const vector (#139955)
Summary: Tighten the AOTIModelContainerRunner::run interface to take a const vector of at::Tensor, which 1) makes it clear that the runner will not modify the input tensor vector; 2) runner will be able to take a temp vector of tensors as the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139955
Approved by: https://github.com/chenyang78
2024-11-08 16:59:10 +00:00
Ke Wen
e474f0de82 [PGNCCL] Slimming watchdog loop (#139834)
- Refactored traceback code into `work.printTraceback()`.  cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @shuqiangzhang
- Refactored desync debug code into `class DesyncDebugger`.
- Moved occurrences of `futureWorkResult_->markCompleted` into `checkAndSetException` and `checkTimeout`, respectively. cc @shuqiangzhang
- Modularized dump signal broadcast code into `ProcessGroupNCCL::broadcastDumpSignal`. cc @fduwjj @c-p-i-o

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139834
Approved by: https://github.com/shuqiangzhang
2024-11-07 17:22:44 +00:00
PyTorch MergeBot
7e02386303 Revert "[2/N] Replace c10::sv with std::sv (#139456)"
This reverts commit 028c5d3426.

Reverted https://github.com/pytorch/pytorch/pull/139456 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. @ezyang can you please help get this landed? See D65546398 for more details ([comment](https://github.com/pytorch/pytorch/pull/139456#issuecomment-2462768891))
2024-11-07 17:00:59 +00:00
cyy
028c5d3426 [2/N] Replace c10::sv with std::sv (#139456)
Follows  #139453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139456
Approved by: https://github.com/ezyang
2024-11-06 01:50:38 +00:00
cyyever
46d0b635b9 [CMake] Remove pthread linking (#134436)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134436
Approved by: https://github.com/r-barnes
2024-10-29 23:14:40 +00:00
Richard Barnes
068f7e7a78 torch::optional -> std::optional (#138987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138987
Approved by: https://github.com/Skylion007
2024-10-28 19:09:46 +00:00
Richard Barnes
42994234a6 std::value/std::type -> std::_v/std::_t (#138746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-10-26 20:59:24 +00:00
Mwiza Kunda
22d2e2d9a0 Set RUNPATH so installed tests can find the required shared libraries (#136627)
This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on.

For example, currently:
```bash
venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy
./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136627
Approved by: https://github.com/malfet
2024-10-25 09:38:08 +00:00
Angela Yi
51f6b946ae [torchbind] Add generic __deepcopy__ method (#137613)
Summary: Added a generic `__deepcopy__` method which will use the torchbind object's existing `__getattr__` and `__setattr__` to copy the torchbind object. This will later be used in [D64124825](https://www.internalfb.com/diff/D64124825)

Differential Revision: D64124826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137613
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-10-24 22:14:55 +00:00
FFFrog
af0bc75460 Remove deprecated alias macro(1/3) (#137556)
**Detailed Descriptions:**
- Remove AT_ERROR Macro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137556
Approved by: https://github.com/ezyang
2024-10-21 17:32:32 +00:00
Richard Barnes
fddabc6e0b C10_UNUSED to [[maybe_unused]] (#6357) (#138364)
Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-10-19 13:17:43 +00:00
Edward Yang
b14269dcfb Make Context to be Device-agnostic Step by Step (1/N) (#136519) (#138155)
Summary:
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Original pull request: https://github.com/pytorch/pytorch/pull/136519

Test Plan: contbuild & OSS CI, see 4a8e49389c

Reviewed By: malfet

Differential Revision: D64471142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155
Approved by: https://github.com/malfet, https://github.com/bobrenjc93
2024-10-17 20:58:56 +00:00
Shivam Raikundalia
dfb5ac05cc [Record Function] Add Kwargs only USER_SCOPE Macro (#138020)
Summary: Add a macro such that users can easily add a USER annotation with kwargs only

Test Plan: Will use D63801503 to test this E2E. Added unit test as well that makes sure that the kwargs get recorded correctly

Differential Revision: D64420328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138020
Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi
2024-10-17 18:48:49 +00:00
fduwjj
7e704c2073 [c10d] Add unit test for CUDAEventCache to ensure caching is working (#138059)
We created a simple test to validate the cache is indeed working and when the cache is indeed used up. I revert the fix in (https://github.com/pytorch/pytorch/pull/138040) and the test indeed failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138059
Approved by: https://github.com/kwen2501
ghstack dependencies: #138040, #138048
2024-10-16 17:34:57 +00:00
Shuqiang Zhang
f4158558aa [c10d] disable watchdog thread in blockingWait mode (#138001)
Summary:
Blocking wait mode is not widely used, probably useful in debugging.
in blockingWait mode, we don't need to enable the watchdog thread to
check the timeout or nccl error because the main thread would throw an
exception if error happens and it is obvious to user which work fails
and its user's responsibility to handle the exception.
Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138001
Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #137799
2024-10-16 07:42:22 +00:00
PyTorch MergeBot
d4d687ffb2 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit 4a8e49389c.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))
2024-10-15 17:19:16 +00:00
Richard Barnes
b7f798caa4 Use C10_UNUSED instead of (void)X (#137239)
Summary:
Auto-generated with
```
buck run //scripts/rbarnes/regex_multiline_replacer:regex_multiline_replacer -- --find '^(\s*for\s*\()(const.*\n)\s*\(void\)[A-Za-z]+;\s*//\s*Suppress.*\s*\n(.*)'  --replace '\1C10_UNUSED \2\3' `find caffe2/ -regex ".*\.\(cpp\|h\)"`
```

Differential Revision: D33432600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137239
Approved by: https://github.com/Skylion007
2024-10-15 14:32:59 +00:00
Catherine Lee
8ac06467d4 Forward fix test (#137910)
Summary: Add back in a deleted file to fix test

It was removed in https://github.com/pytorch/pytorch/pull/137404

Test Plan: `buck2 build --flagfile fbcode//mode/opt fbcode//caffe2/test/cpp/c10d:ProcessGroupGlooAsyncTest` succeeded

Differential Revision: D64341074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137910
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/kit1980
2024-10-14 22:07:29 +00:00
FFFrog
4a8e49389c Make Context to be Device-agnostic Step by Step (1/N) (#136519)
----

- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-13 12:38:02 +00:00
Nichols A. Romero
bd63ec4f45 [ROCm] LoadHIP CMake cleanup (#137112)
Should help mitigate issues reported here: https://github.com/pytorch/pytorch/issues/128313

While working on https://github.com/pytorch/pytorch/pull/136700, we realized that some of the ROCm CMake can be streamlined.

This PR does not fix any bugs or provide any new functionality. Strictly clean-up.

The remaining `${ROCM_ROCTX_LIB}` will be removed when we transition to the rocprofiler-sdk (to be done in a separate PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137112
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2024-10-13 00:06:41 +00:00
PyTorch MergeBot
079f909263 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit be0b75256a.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))
2024-10-10 18:32:17 +00:00
cyy
94e12f97dc [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404)
Follows #137072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404
Approved by: https://github.com/Skylion007
2024-10-10 18:05:34 +00:00
sanshang
249152475d fix sequence number for group (#134578)
Summary:
Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay.

`ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`.
b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L1188-L1191)
However, `WorkNCCL` only has one sequence number member `seq_`. b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L387)
We need to match collective and p2p with wait separately.
29b5a462dc

Depend on: https://github.com/pytorch/pytorch/pull/135132

Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test

Differential Revision:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134578
Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o
2024-10-10 04:24:06 +00:00
Shuqiang Zhang
47a515d260 [c10d] simplify barrier implementation and further decouple CPU/GPU (#137516)
synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137516
Approved by: https://github.com/fduwjj, https://github.com/kwen2501
2024-10-09 23:55:28 +00:00
FFFrog
be0b75256a Make Context to be Device-agnostic Step by Step (1/N) (#136519)
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-09 02:13:36 +00:00
cyy
6327a71880 [Environment Variable][2/N] Use thread-safe setenv wrapper (#124485)
This follows #119449 to make setenv thread-safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124485
Approved by: https://github.com/eqy
2024-10-04 07:30:51 +00:00
abhishek-fujitsu
63d6908da0 fix build error with gcc 12+ (#137092)
Fixes #127920

This commit addresses a build failure occurring with GCC 12 and above due to the -Werror=nonnull flag. The error manifests in the test_api target.

**Issue:**
When building with GCC 12+, the following error occurs:
```
error: argument 1 null where non-null expected [-Werror=nonnull]
  431 |             __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
      |             ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

This change ensures that:
1. The flag is only added for GCC 12 or higher
2. The flag is only added if it's supported by the compiler
3. The flag is added specifically to the test_api target, not globally

By disabling this specific error, we allow the build to proceed while maintaining other compiler warnings.

**Test Plan:**
- Verified successful build with GCC 12 and above
- Ensured no regression in builds with earlier GCC versions and other compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137092
Approved by: https://github.com/malfet
2024-10-02 00:37:15 +00:00
fduwjj
911a43f930 [TCPStore] Remove deprecated constructor (#136004)
While looking at TCPStore code again and found it confusing that we still keep the deprecated constructor for TCPStore in cpp while we don't expose it in python via pybind already. I checked both internal and external, all use cases in cpp (aside from unit test fixed in this PR) already moved to using option. So let's remove this legacy constructor to avoid confusion.

Differential Revision: [D62653634](https://our.internmc.facebook.com/intern/diff/D62653634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136004
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2024-09-14 04:25:47 +00:00
PyTorch MergeBot
564d00f364 Revert "Fix clang-tidy warnings in Caffe2 code (#134935)"
This reverts commit 7cfd23636c.

Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))
2024-09-13 16:42:37 +00:00
cyy
7cfd23636c Fix clang-tidy warnings in Caffe2 code (#134935)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935
Approved by: https://github.com/ezyang
2024-09-12 03:27:09 +00:00
PyTorch MergeBot
c044deb9ce Revert "c10d/logging: add C10D_LOCK_GUARD (#134131)"
This reverts commit f33bcbe5fd.

Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/kit1980 due to See D61985186 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2327556381))
2024-09-03 22:35:14 +00:00
Bin Bao
310eb6d8c6 [AOTI] Fix test_aoti_inference CPU build issue (#134675)
Summary: Fixes https://github.com/pytorch/pytorch/issues/130311. We need to guard CUDA-only code in test_aoti_inference with macros so that it won't fail for CPU-only platform.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134675
Approved by: https://github.com/atalman, https://github.com/chunyuan-w
2024-08-28 17:42:19 +00:00
Tristan Rice
f33bcbe5fd c10d/logging: add C10D_LOCK_GUARD (#134131)
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.

This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.

This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.

Test plan:

existing CI for regressions

will add unit tests on `C10D_LOCK_GUARD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-28 01:40:42 +00:00
PyTorch MergeBot
1c4780e69a Revert "c10d/logging: add C10D_LOCK_GUARD (#134131)"
This reverts commit 4c28a0eb0b.

Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/ZainRizvi due to Sorry but this causes formatting errors internally which make it fail to build. See D61759282 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2310455878))
2024-08-26 15:19:27 +00:00
Sheng Fu
519342962d Pass process group info into NcclWork (#134269)
Summary: Pass process group info into NcclWork

Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test

Differential Revision: D61677160

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134269
Approved by: https://github.com/wconstab
2024-08-24 01:04:43 +00:00
Tristan Rice
4c28a0eb0b c10d/logging: add C10D_LOCK_GUARD (#134131)
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.

This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.

This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.

Test plan:

existing CI for regressions

will add unit tests on `C10D_LOCK_GUARD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-24 00:27:39 +00:00
Xuehai Pan
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
cyy
c2eeda5da0 [structural binding][12/N] Replace std::tie with structural binding (#131031)
Follows #130830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131031
Approved by: https://github.com/ezyang
2024-08-14 00:51:34 +00:00
Chirag Pandya
40767e8468 [BE] rename testHelperPrefix test (#132916)
Summary:
Re-enable testHelperPrefix test that was erroneously disabled in CI.
Fixes #50701

Test Plan:
Test passes locally:
```
❯ ./TCPStoreTest --gtest_filter=TCPStoreTest.testHelperPrefix
Running main() from
/data/users/cpio/pytorch/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = TCPStoreTest.testHelperPrefix
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TCPStoreTest
[ RUN      ] TCPStoreTest.testHelperPrefix
[W807 12:01:31.531576727 socket.cpp:462] [c10d] waitForInput: poll for
socket SocketImpl(fd=6, addr=[localhost]:37984,
remote=[localhost]:37171) returned 0, likely a timeout
[W807 12:01:31.531663710 socket.cpp:487] [c10d] waitForInput: socket
SocketImpl(fd=6, addr=[localhost]:37984, remote=[localhost]:37171) timed
out after 100ms
[       OK ] TCPStoreTest.testHelperPrefix (314 ms)
[----------] 1 test from TCPStoreTest (314 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (314 ms total)
[  PASSED  ] 1 test.
╭─ ~/local/pytorch/build/bin  main *1 +1 ···················· ✔
/home/cpio/local/a/pytorch-env   cpio@devgpu011 ─╮
╰─
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132916
Approved by: https://github.com/Skylion007
2024-08-08 20:54:52 +00:00
Prachi Gupta
c326533999 [ROCm][Inductor] Enable AOT Inductor CPP UTs for ROCm (#131521)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131521
Approved by: https://github.com/jataylo, https://github.com/pruthvistony, https://github.com/malfet
2024-08-08 19:49:56 +00:00
Shangdi Yu
21906ddaba [AOTI] Fix complex64 not defined (#132810)
Partially fixes #122980

- change cpp type mapping for complex64 to std::complex<float>
- add `aoti_torch_item_complex64` and `aoti_torch_scalar_to_tensor_complex64`.
- add `expensiveCopyToTensor()` to convert `ArrayRefTensor<T>` type to `AtenTensorHandle` type.

- if we want to fully fix #122980, we still need to let ArrayRef and MiniArrayRef to consider underlying storage number of elements. See more details in https://github.com/pytorch/pytorch/pull/132347 (#132347  broke some internal tests, so we need more work before landing it).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132810
Approved by: https://github.com/desertfire
2024-08-08 18:08:23 +00:00
Matthew Hoffman
258f47fc0b Add padding_side to pad_sequence with "left" and "right" options ("right" as default) (#131884)
Fixes #10536

Reattempt of #61467. Thank you so much to @mskoh52 for your excellent work!

As I was trying to create a more efficient LLM data collator, I realized that `pad_sequence` only supports right padding, even though left padding is a very common format for LLMs, like Llama and Mistral.

The proposed alternative implementation was to use multiple flips, which tends to be 1.5x-2x slower. Instead we can add a [`padding_side` parameter as there is for for Hugging Face tokenizers](9d6c0641c4/src/transformers/tokenization_utils_base.py (L1565)), which requires only a very small change in the C++ code.

Here are the benchmarks of the new implementation!

`float32`:

![eaaa95ef-9384-45d2-be56-6898bc1d3514](https://github.com/user-attachments/assets/3b0eb309-e5a0-4a4d-97bb-4e3298783dbb)

`bool`:

![892f32da-8d9a-492b-9507-18d3f0a41e8e](https://github.com/user-attachments/assets/6824ea15-7d4e-4b89-95f0-8546635f0c2e)

Code:

```python
from __future__ import annotations

import random
import time
from typing import Literal

import numpy as np
import torch

def pad_sequence_with_flips(
    sequences: list[torch.Tensor],
    batch_first: bool = False,
    padding_value: int | float | bool = 0.0,
    padding_side: Literal["left", "right"] | str = "left",
) -> torch.Tensor:
    if padding_side == 'right':
        padded_sequence = torch._C._nn.pad_sequence([t.flatten() for t in sequences], batch_first=batch_first, padding_value=padding_value)
    elif padding_side=='left':
        padded_sequence = torch._C._nn.pad_sequence([t.flatten().flip(0) for t in sequences], batch_first=batch_first, padding_value=padding_value)  # pyright: ignore[reportArgumentType]
        padded_sequence = padded_sequence.flip(int(batch_first))
    else:
        raise ValueError(f"padding_side should be either 'right' or 'left', but got {padding_side}")

    return padded_sequence

sequence_lengths: list[int] = []

flip_left_pad_times: list[float] = []
flip_left_pad_times_std: list[float] = []

left_pad_times: list[float] = []
left_pad_times_std: list[float] = []

RUNS_PER_LOOP: int = 100

for i in range(1, 7):
    sequence_length = i * int(1e6) // 6
    sequence_lengths.append(sequence_length)

    sequences = [torch.randint(0, 2, (random.randint(1, sequence_length),), dtype=torch.bool) for _ in range(64)]

    inner_left_pad_times: list[float] = []
    inner_right_pad_times: list[float] = []

    inner_flip_left_pad_times: list[float] = []
    inner_flip_right_pad_times: list[float] = []

    for _ in range(RUNS_PER_LOOP):

        start = time.perf_counter()
        torch._C._nn.pad_sequence(sequences, batch_first=True, padding_value=False, padding_side="left")
        end = time.perf_counter()
        inner_left_pad_times.append(end - start)

        start = time.perf_counter()
        pad_sequence_with_flips(sequences, batch_first=True, padding_value=False, padding_side="left")
        end = time.perf_counter()
        inner_flip_left_pad_times.append(end - start)

    left_pad_times.append(sum(inner_left_pad_times) / len(inner_left_pad_times))
    left_pad_times_std.append(np.std(inner_left_pad_times))

    flip_left_pad_times.append(sum(inner_flip_left_pad_times) / len(inner_flip_left_pad_times))
    flip_left_pad_times_std.append(np.std(inner_flip_left_pad_times))

    print(f"Sequence Length: {sequence_length}, Left Pad Time: {left_pad_times[-1]}, Left with Flips Pad Time: {flip_left_pad_times[-1]}")

import matplotlib.pyplot as plt

plt.plot(sequence_lengths, left_pad_times, label="new pad_sequence left")
plt.scatter(sequence_lengths, left_pad_times)
plt.errorbar(sequence_lengths, left_pad_times, yerr=left_pad_times_std, linestyle='None', marker='^')

plt.plot(sequence_lengths, flip_left_pad_times, label="old pad_sequence left (2 flips)")
plt.scatter(sequence_lengths, flip_left_pad_times)
plt.errorbar(sequence_lengths, flip_left_pad_times, yerr=flip_left_pad_times_std, linestyle='None', marker='^')

plt.xlabel("Sequence Length")
plt.ylabel("Time (s)")
plt.legend(loc="upper right")

# Sequence Length: 166666, Left Pad Time: 0.06147645162009212, Left with Flips Pad Time: 0.09842291727001794
# Sequence Length: 333333, Left Pad Time: 0.08933195920990329, Left with Flips Pad Time: 0.15597836187991562
# Sequence Length: 500000, Left Pad Time: 0.08863158334006585, Left with Flips Pad Time: 0.15224887342999863
# Sequence Length: 666666, Left Pad Time: 0.10524682551997103, Left with Flips Pad Time: 0.18177212480995877
# Sequence Length: 833333, Left Pad Time: 0.11801802741003485, Left with Flips Pad Time: 0.20821274195001024
# Sequence Length: 1000000, Left Pad Time: 0.131894061660023, Left with Flips Pad Time: 0.23223503091008751
```

Co-authored-by: mskoh52 <mskoh52@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131884
Approved by: https://github.com/ezyang
2024-08-07 15:53:07 +00:00
Ivan Zaitsev
841cadd555 Fix discrepancies from 129973 (#132545)
#129973 ([D59132793](https://www.internalfb.com/diff/D59132793)) was exported missing changes in `test/cpp/jit/CMakeLists.txt` this PR remediates that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132545
Approved by: https://github.com/kit1980
2024-08-03 01:57:49 +00:00
Oguz Ulgen
221350e3a4 Add None return type to init -- tests (#132352)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352
Approved by: https://github.com/ezyang
ghstack dependencies: #132335, #132351
2024-08-01 15:44:51 +00:00
Xuehai Pan
548c460bf1 [BE][Easy][7/19] enforce style for empty lines in import segments in test/[a-c]*/ and test/[q-z]*/ (#129758)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129758
Approved by: https://github.com/ezyang
2024-07-31 10:54:03 +00:00
cyy
89da94594e [11/N] Fix clang-tidy warnings in jit (#132131)
Follows #132122

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132131
Approved by: https://github.com/Skylion007
2024-07-31 03:45:52 +00:00
cyy
73d0f484b3 [structural binding][11/N] Replace std::tie with structural binding (#130830)
Follows  #130784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130830
Approved by: https://github.com/janeyx99
2024-07-18 00:45:06 +00:00
cyy
168e41009b [structural binding][10/N] Replace std::tie with structural binding (#130784)
Follows  #130404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130784
Approved by: https://github.com/malfet
2024-07-16 10:28:14 +00:00
cyy
28f6ae2718 [9/N] Replace c10::optional with std::optional (#130674)
Follows  #130509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674
Approved by: https://github.com/Skylion007
2024-07-15 00:48:43 +00:00
Yidi Wu
0bf9a091ec [torchbind] add tracing_mode support (#129586)
Sometimes, it could be difficult to write a fake class e.g. when the original implementation is using some third-party libraries or users are certain that the class is safe to trace with the real object.

This PR allows user to specify their intention by implementing a "safe_to_trace_with_real_obj" method on their script class.

Test Plan:
`pytest test/export/test_torchbind.py -k safe`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129586
Approved by: https://github.com/zou3519
2024-07-12 18:01:47 +00:00
cyy
fb5888c719 Remove unused type traits in torch/csrc/utils (#128799)
Follows  #127852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128799
Approved by: https://github.com/ezyang
2024-06-27 23:51:18 +00:00
Shivam Raikundalia
1d0efedc85 [Profiler] Add TSC Clock Callback to CUPTI (#125036)
Summary:
Right now we use the default clock for CUPTI which is not monotonic nor particularly fast. We have already added the Kineto side of the implementation here: https://www.internalfb.com/diff/D56525885

This diff only adds the compile flags such that the TSC format is used and sets the converter using a libkineto call in the profiler

Test Plan:
Obtained following trace using resnet test:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Apr_25_11_03_18.3862943.pt.trace.json.gz&bucket=gpu_traces

TBD: Add benchmarks

Differential Revision: D56584521

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125036
Approved by: https://github.com/aaronenyeshi
2024-06-27 21:07:43 +00:00
Yidi Wu
b9697eacd3 [torchbind] support tensor ops inside of __obj_flatten__ (#129605)
As titled. Previously, __obj_flatten__ can run in a fake tensor mode, e.g. in process_input of aot_autograd, which is surrounded by a fake tensor mode. This causes the tensor ops inside __obj_flatten__ to run under fake tensor mode. However, tensors inside of script obejct are real tensors, this causes the fake tensor mode to error out saying that we need to first fakify fall the tensors (because allow_non_fake_inputs is set to True).

In this PR, we disable all the dispatch modes when running to_fake_obj.

 Note that, the output of `__obj_flatten__` will be fakified and filled inside of the corresponding FakeScriptObject. So during traicng, we'll be using FakeScriptObject that has fake tensor contents.

Test Plan:
Add a new test: pytest test/export/test_torchbind.py -k test_compile_tensor_op_in_tensor_flatten

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129605
Approved by: https://github.com/angelayi
2024-06-27 03:07:31 +00:00
Tristan Rice
0298560ca2 TCPStore: improve connect and retry logic (#129261)
We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times.

This PR does a few things:
* Retry that connect and validate up to the specified timeout.
* Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep.
* Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu https://github.com/pytorch/pytorch/pull/116141

Test plan:

```
python test/distributed/test_store.py -v
./build/bin/BackoffTest
```

Will do internal testing with some large scale jobs to ensure TCPStore works correctly.

At 4k scale: 4x improvement

```
tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py                                                                                                   (pytorch-3.10)
started 0
init 0
set 0
joined all

________________________________________________________
Executed in    1.98 secs    fish           external
   usr time    0.93 secs   91.00 micros    0.93 secs
   sys time    1.98 secs  954.00 micros    1.97 secs

tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10                                                                                                                                              (pytorch-3.10)
tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py                                                                                                          (torchdrive-3.10)
started 0
init 0
set 0
joined all

________________________________________________________
Executed in    8.20 secs    fish           external
   usr time    2.15 secs    0.00 micros    2.15 secs
   sys time    2.76 secs  843.00 micros    2.76 secs
```

```py
import time
import os
import threading
from multiprocessing import Pool

WORLD_SIZE = 10000

import torch.distributed as dist

def run(rank):
    should_log = rank % (WORLD_SIZE // 10) == 0
    if should_log:
        print(f"started {rank}")
    store = dist.TCPStore(
        host_name="devvm4382.nao0.facebook.com",
        port=29500,
        world_size=WORLD_SIZE,
        is_master=rank == 0,
        use_libuv=True,
    )
    if should_log:
        print(f"init {rank}")
    store.set(f"key{rank}", "1234")
    if should_log:
        print(f"set {rank}")
    del store

def noop(rank):
    pass

print("starting pool")
with Pool(WORLD_SIZE) as pool:
    pool.map(noop, range(WORLD_SIZE), 1)
    print("pool hot")
    start = time.time()
    pool.map(run, range(WORLD_SIZE), 1)
    print("run finished", time.time()-start)
```

```
tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py                                                                                                                                (pytorch-3.10)
starting pool
pool hot
started 0
[W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
started 1000
init 1000
set 1000
started 2000
init 2000
set 2000
started 3000
init 3000
set 3000
started 4000
init 4000
set 4000
started 5000
init 5000
set 5000
started 6000
init 6000
set 6000
started 7000
init 7000
set 7000
started 8000
init 8000
set 8000
started 9000
init 9000
set 9000
init 0
set 0
run finished 0.705092191696167
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129261
Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o
2024-06-25 19:24:22 +00:00
Yidi Wu
b22f0f5f51 [torchbind] fix bug of mutating FakeScriptObjects twice in aot_export (#128844)
This PR does two things:
1. it duplicates the fake script object because aot_export trace the program twice. The result of tracing in the first time would cause the tracing result of second time be wrong.
2. Also add a new test for methods that return constant outputs. Before the PR, there's is no meta["val"] for these nodes because fx won't track these constants. We still need to preserve these constant return operators in the graph because torchbind objects are stateful and deleting it would remove the implicit state mutation inside of the object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128844
Approved by: https://github.com/angelayi
2024-06-24 23:14:34 +00:00
cyy
e4c32d14a8 [3/N] Remove inclusion of c10/util/string_utils.h (#128504)
Follows #128372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128504
Approved by: https://github.com/malfet
2024-06-15 06:38:40 +00:00
Kiuk Chung
8629939a51 [torch/c10] Add C10_UBSAN_ENABLED macro and use it to disable SymInt_… (#127967)
Adds `C10_UBSAN_ENABLED` macro and use it to disable `SymIntTest::Overflows` (fails under `signed-integer-overflow` UBSAN check).

Also cleans up UBSAN guard in `jit/test_misc.cpp` to use `C10_UBSAN_ENABLED`  and the existing `C10_ASAN_ENABLED` instead of locally defining `HAS_ASANUBSAN`.

> NOTE: This should fix `SymIntTest::Overflows` failing under ubsan in fbcode too...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127967
Approved by: https://github.com/atalman, https://github.com/d4l3k, https://github.com/malfet
2024-06-14 16:01:12 +00:00
cyy
b054470db2 Remove unused functions (#127881)
Some unused functions detected by g++ warnings can be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127881
Approved by: https://github.com/zou3519
2024-06-05 05:21:24 +00:00
cyy
e7cb43a2d2 Check unused variables in tests (#127498)
Enables unused variable checks in CMake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127498
Approved by: https://github.com/ezyang
2024-06-04 05:35:25 +00:00
cyy
8629f9b3f2 Remove more unused variables in tests (#127510)
Follows #127379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127510
Approved by: https://github.com/Skylion007, https://github.com/r-barnes
2024-05-31 03:39:45 +00:00
Richard Barnes
3f5b59eef4 [codemod] c10::optional -> std::optional in caffe2/aten/src/ATen/DeviceGuard.h +117 (#126901)
Summary:
Generated with
```
fbgs -f '.*\.(cpp|cxx|cc|h|hpp|cu|cuh)$' c10::optional -l | perl -pe 's/^fbsource.fbcode.//' | grep -v executorch | xargs -n 50 perl -pi -e 's/c10::optional/std::optional/g'
```

 - If you approve of this diff, please use the "Accept & Ship" button :-)

(117 files modified.)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126901
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-05-24 00:26:15 +00:00
cyy
95e5c994f9 [Submodule] Clear USE_QNNPACK build option (#126941)
Following the removal of QNNPACK third-party module #126657, we can clear more build system code. Also third_party/neon2sse was removed because it is not used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126941
Approved by: https://github.com/ezyang
2024-05-24 00:12:56 +00:00
Chirag Pandya
a83e745356 [BE] split seq_id to collective_seq_id and p2p_seq_id (#125727)
Summary:
Split out `seq_id` into `collective_seq_id` and `p2p_seq_id`. The main idea here is that collectives that go to all machines should have identical `collective_seq_id` and therefore it makes it easier to spot if one of machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync.

Resolves issue: https://github.com/pytorch/pytorch/issues/125173

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125727
Approved by: https://github.com/zdevito
2024-05-21 03:26:49 +00:00
David Berard
cb3b8cd0d3 Use object identity for deepcopy memo (#126126)
Copy of #126089, with some additional fixes & tests

Partial fix for #125635: previously, the deepcopy implementation would group together any tensors with any aliasing relationship and assign them to the same tensor. This was sort of good if you have two tensors `b = a.detach()`, because then if you deepcopy `list = [a, b]` to `list2 = list.deepcopy()`, then writes to `list2[0]` will also modify `list2[1]`. But for the most part, it's bad; (1) if you have `b = a.as_strided((4, 4), (16, 1), 16)`, then it'll make `b == a` in the deepcopied implementation, which is completely wrong; and (2) even if you have `b = a.detach()`, these are still initially two different tensors which become the same tensor after the old deepcopy implementation.

The new implementation only groups together tensors that have the same identity. This is a partial fix, but it's more reasonable. What changes:
* (becomes more correct): different views of the same base tensor will no longer all become equal after deepcopying
* (still kind of wrong): views won't actually alias each other after deepcopying.
* (arguably a minor regression): equivalent views of the same tensor will no longer be copied to the same tensor - so they won't alias.

BC breaking: C++ deepcopy interface changes from accepting `IValue::HashAliasedIValueMap memo` to accepting `IValue::HashIdentityIValueMap memo`. If there are objections, we can keep the old API. However, it seems likely that users generally won't try to deepcopy from C++.

Differential Revision: [D57406306](https://our.internmc.facebook.com/intern/diff/D57406306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126126
Approved by: https://github.com/ezyang
2024-05-17 00:06:26 +00:00
Richard Barnes
ed327876f5 [codemod] c10:optional -> std::optional (#126135)
Generated by running the following from PyTorch root:
```
find . -regex ".*\.\(cpp\|h\|cu\|hpp\|cc\|cxx\)$" | grep -v "build/" | xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/'
```

`c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi
2024-05-14 19:35:51 +00:00
ydwu4
461ffaaaf3 [dynamo] support torchbind object input (#124978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124978
Approved by: https://github.com/jansel
2024-05-07 03:02:00 +00:00
Zhirui Dai
3411d54811 fix loading optimizer options from archive (#125215)
This PR makes libtorch behave the same as PyTorch when loading optimizer state from archive. With PyTorch, options of parameter groups are loaded from the archive, which is missing currently in libtorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125215
Approved by: https://github.com/janeyx99
2024-05-06 23:58:40 +00:00
Shuqiang Zhang
bfd5bb0c44 [c10d] only PG0 should dump when monitoring thread timed out (#125356)
Summary:
We found that some dumps are missing when monitoring thread timeout.
This is likely due to multiple PGs could still dump the same records
at the same time. So we should allow only PG0 to actualy dump
Test Plan:
 unit test
python test/run_test.py --cpp --verbose -i cpp/ProcessGroupNCCLErrorsTest
Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125356
Approved by: https://github.com/c-p-i-o
2024-05-04 00:43:20 +00:00
Stefan-Alin Pahontu
bebefcf845 Driver folder check (#117548)
Added extra check for driver folders for Libtorch, as stat struct does not recognize driver folders, so torch.save should work for them as well. (e.g. save model.pt directly under C: )

Fixes [#111121](https://github.com/pytorch/pytorch/issues/111121) and #105488

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117548
Approved by: https://github.com/malfet
2024-05-03 09:10:11 +00:00
Wes Bland
6f5f405b05 [ncclx] Rename NCCL-EXP to NCCLX (#125238)
Reviewed By: kryanchun

Differential Revision: D56534548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125238
Approved by: https://github.com/kwen2501
2024-05-01 23:29:55 +00:00
PyTorch MergeBot
724c7491d0 Revert " [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)"
This reverts commit b3fd94d15e.

Reverted https://github.com/pytorch/pytorch/pull/124987 on behalf of https://github.com/ezyang due to broke downstream extensions ([comment](https://github.com/pytorch/pytorch/pull/124987#issuecomment-2083956511))
2024-04-30 00:37:53 +00:00
cyy
b3fd94d15e [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
2024-04-27 07:22:27 +00:00
Kurt Mohler
abcb42cdd2 Avoid COW materialize in various places (1) (#124984)
Most, not all, of these cases were found automatically with `git grep -n '^\s*\<const\>.*\*.*=.*\<data_ptr\>'`

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124984
Approved by: https://github.com/Skylion007
2024-04-26 19:06:28 +00:00
Shivam Raikundalia
63d4dc5a80 Remove TMP_LIBKINETO_NANOSECOND flag from Compilation (#124734)
Summary: Now that we have reached nanosecond granularity, we can now remove the temporary guards that were previously required for nanosecond precision.

Test Plan: Regression should cover this change

Reviewed By: aaronenyeshi

Differential Revision: D56444570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124734
Approved by: https://github.com/aaronenyeshi
2024-04-26 06:57:03 +00:00
Bin Bao
b2fd224f27 [AOTI] Add more ABI-compatiblity unit test (#123900)
Summary: Follow https://github.com/pytorch/pytorch/pull/123848, and test more c10 util functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123900
Approved by: https://github.com/chenyang78
2024-04-23 16:06:40 +00:00
Bin Bao
4946638f06 [AOTI] Add ABI-compatiblity tests (#123848)
Summary: In AOTInductor generated CPU model code, there can be direct references to some aten/c10 utility functions and data structures, e.g. at::vec and c10::Half. These are performance critical and thus it doesn't make sense to create C shim for them. Instead, we make sure they are implemented in a header-only way, and use this set of tests to guard future changes.

There are more header files to be updated, but we will do it in other followup PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123848
Approved by: https://github.com/jansel
ghstack dependencies: #123847
2024-04-19 00:51:24 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Shivam Raikundalia
3ebbeb75fd [Profiler] Make Kineto traces export ns granularity for finer timestamps (#122425) (#123650)
Summary:

Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.

This diff contains profiler changes only. Libkineto changes found in D54964435.

Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189
Ran key_averages() to make sure FunctionEvent code working as expected:
--  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

                                          ProfilerStep*         0.74%       3.976ms        64.40%     346.613ms      69.323ms       0.000us         0.00%      61.710ms      12.342ms             5
                      Optimizer.zero_grad#SGD.zero_grad         0.76%       4.109ms         0.76%       4.109ms     821.743us       0.000us         0.00%       0.000us       0.000us             5
                                          ## forward ##         6.89%      37.057ms        27.19%     146.320ms      29.264ms       0.000us         0.00%      58.708ms      11.742ms             5
                                           aten::conv2d         0.22%       1.176ms         7.74%      41.658ms     157.199us       0.000us         0.00%      27.550ms     103.962us           265
                                      aten::convolution         0.79%       4.273ms         7.52%      40.482ms     152.762us       0.000us         0.00%      27.550ms     103.962us           265
                                     aten::_convolution         0.69%       3.688ms         6.73%      36.209ms     136.637us       0.000us         0.00%      27.550ms     103.962us           265
                                aten::cudnn_convolution         6.04%      32.520ms         6.04%      32.520ms     122.719us      27.550ms         8.44%      27.550ms     103.962us           265
                                             aten::add_         2.42%      13.045ms         2.42%      13.045ms      30.694us      12.700ms         3.89%      12.700ms      29.882us           425
                                       aten::batch_norm         0.19%       1.027ms         8.12%      43.717ms     164.971us       0.000us         0.00%      16.744ms      63.185us           265
                           aten::_batch_norm_impl_index         0.31%       1.646ms         7.93%      42.691ms     161.096us       0.000us         0.00%      16.744ms      63.185us           265
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Differential Revision: D55925068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650
Approved by: https://github.com/aaronenyeshi
2024-04-11 04:29:20 +00:00
PyTorch MergeBot
c66d503194 Revert "[Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425)"
This reverts commit 6f7dd2f84a.

Reverted https://github.com/pytorch/pytorch/pull/122425 on behalf of https://github.com/malfet due to Breaks ROCM builds ([comment](https://github.com/pytorch/pytorch/pull/122425#issuecomment-2041129241))
2024-04-06 16:19:00 +00:00
Shivam Raikundalia
6f7dd2f84a [Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425)
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.

This diff contains profiler changes only. Libkineto changes found in D54964435.

Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Tracing with flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_37_22.4155151.pt.trace.json.gz&bucket=gpu_traces
Tracing without flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_39_15.4166047.pt.trace.json.gz&bucket=gpu_traces
Tracing on main: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_42_43.4177559.pt.trace.json.gz&bucket=gpu_traces

Ran key_averages() to make sure FunctionEvent code working as expected:
--  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

                                          ProfilerStep*         0.74%       3.976ms        64.40%     346.613ms      69.323ms       0.000us         0.00%      61.710ms      12.342ms             5
                      Optimizer.zero_grad#SGD.zero_grad         0.76%       4.109ms         0.76%       4.109ms     821.743us       0.000us         0.00%       0.000us       0.000us             5
                                          ## forward ##         6.89%      37.057ms        27.19%     146.320ms      29.264ms       0.000us         0.00%      58.708ms      11.742ms             5
                                           aten::conv2d         0.22%       1.176ms         7.74%      41.658ms     157.199us       0.000us         0.00%      27.550ms     103.962us           265
                                      aten::convolution         0.79%       4.273ms         7.52%      40.482ms     152.762us       0.000us         0.00%      27.550ms     103.962us           265
                                     aten::_convolution         0.69%       3.688ms         6.73%      36.209ms     136.637us       0.000us         0.00%      27.550ms     103.962us           265
                                aten::cudnn_convolution         6.04%      32.520ms         6.04%      32.520ms     122.719us      27.550ms         8.44%      27.550ms     103.962us           265
                                             aten::add_         2.42%      13.045ms         2.42%      13.045ms      30.694us      12.700ms         3.89%      12.700ms      29.882us           425
                                       aten::batch_norm         0.19%       1.027ms         8.12%      43.717ms     164.971us       0.000us         0.00%      16.744ms      63.185us           265
                           aten::_batch_norm_impl_index         0.31%       1.646ms         7.93%      42.691ms     161.096us       0.000us         0.00%      16.744ms      63.185us           265
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Differential Revision: D55087993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122425
Approved by: https://github.com/aaronenyeshi
2024-04-06 06:04:28 +00:00
Arun Pa
f71e368969 UFMT formatting on test/autograd test/ao test/cpp test/backends (#123369)
Partially addresses #123062

Ran lintrunner on
- test/_test_bazel.py
- test/ao
- test/autograd test/backends test/benchmark_uitls test/conftest.py test/bottleneck_test test/cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123369
Approved by: https://github.com/huydhn
2024-04-05 18:51:38 +00:00
Chun Cai
691054eeef Fix error message of autograd (#123154)
This PR updates the error message in autograd when an input tensor does not set to `require_grad`. The original message does not contain the index info, making users hard to debug.
The error message style consists with that on line 105-109.
Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123154
Approved by: https://github.com/soulitzer
2024-04-03 19:07:21 +00:00
ydwu4
c77352b5cc Add torch._library.register_fake_class to fakify torchBind class (#122622)
This PR only adds abstract class registration logic without touching existing tests so they still trace with real script object. The added tests are only for registration APIs and test error messages.

Our design is that the abstract implementation should be in Python. This is much better in terms of usability. But this also has implications for custom op that takes script object as input, which is detailed later in this stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122622
Approved by: https://github.com/zou3519
ghstack dependencies: #122619, #122620, #122621
2024-04-02 23:52:17 +00:00
ydwu4
46c7235406 add tensor queue example (#122621)
This PR adds a tensor queue example for later use. It doesn't touch any existing logic. It refactors the tests a little bit to avoid importing the library in unittest setUp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122621
Approved by: https://github.com/zou3519
ghstack dependencies: #122619, #122620
2024-04-02 23:52:17 +00:00
blegouix
ccfc87b199 include scheduler_on_plateau in optim.h (#121722)
Fixes #121593
Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121722
Approved by: https://github.com/albanD
2024-03-27 19:45:25 +00:00
Mu-Chu Lee
2367d0dacd [AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562) (#122690)
Summary:

During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.

To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.

Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```

Reviewed By: zoranzhao

Differential Revision: D55354548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122690
Approved by: https://github.com/khabinov
2024-03-26 23:25:15 +00:00
PyTorch MergeBot
55f36d1ada Revert "[AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562)"
This reverts commit 57a3d00b06.

Reverted https://github.com/pytorch/pytorch/pull/122562 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122562#issuecomment-2019262415))
2024-03-26 02:18:19 +00:00
Mu-Chu Lee
57a3d00b06 [AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562)
Summary:
During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.

To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.

Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```

Differential Revision: D55286634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122562
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2024-03-25 22:05:20 +00:00
David Berard
2d9cee20a2 [jit] AliasDB type hash - don't always return 0 (#121874)
This hash was missing an assignment, so for almost all types it was returning "0".

c10::flat_hash_map turns out to have really bad behavior with a terrible hash like this, nearly exponential in memory usage.

Differential Revision: [D54916424](https://our.internmc.facebook.com/intern/diff/D54916424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121874
Approved by: https://github.com/eellison
2024-03-14 23:16:08 +00:00
angelayi
e8836759d0 [export] Add effect token to export (#121424)
Following the creation of effect tokens (https://github.com/pytorch/pytorch/pull/120296), we want to now add support for these tokens in export because the calling/returning convention has changed. The inputs are now `(tokens, params, buffers, constants, user_inputs)` and the outputs are `(tokens, buffer_mutations, user_mutations, user_outputs)`. The graph looks something like:
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %attr : [num_users=2] = placeholder[target=attr]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %with_effects : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, _TorchScriptTesting.takes_foo.default, %attr, %arg1_1), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {})
    %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 1), kwargs = {})
    %with_effects_1 : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%getitem, _TorchScriptTesting.takes_foo.default, %attr, %getitem_1), kwargs = {})
    %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 0), kwargs = {})
    %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 1), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %getitem_3), kwargs = {})
    return (getitem_2, add)
```

During unlifting, we will first remove the tokens and with_effect calls using the `remove_effect_tokens` pass. (cc @SherlockNoMad on the pass to remove tokens). This is so that this won't change the calling conventions when retracing. The graph after unlifting looks something like:
```
graph():
    %attr_1 : [num_users=2] = get_attr[target=attr]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %takes_foo_default_1 : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %arg1_1), kwargs = {})
    %takes_foo_default : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %takes_foo_default_1), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %takes_foo_default), kwargs = {})
    return (add,)
```

Serialization support will be added in a followup.
Note: tokens only affect custom ops that take in ScriptObjects, not ScriptObject methods yet.

Differential Revision: [D54639390](https://our.internmc.facebook.com/intern/diff/D54639390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121424
Approved by: https://github.com/tugsbayasgalan
2024-03-09 02:43:26 +00:00
Will Constable
f85d3a022c [C10D] Fix pointToPoint op Flight Recording (#120270)
Fix and test issues with both coalesced and individual send/recv ops

Considered an alternate approach and then ditched it
 - alternate approach: #119757
 - reason ditched: prefer recording individual collective events inside
   coalescing region instead of just the event at the end of the region,
   which also would not have tensor sizes or opnames without additional
   state variables added

Another approach also ditched
- record events on workEnqueue instead of initWork
- reason ditched: too messy to get input/output shapes tagged on
  recording when recording in workEnqueue.  Adding the info onto the
  Work obj would be possible, but adds to overhead of copying Works
  which we do on every collective. We can get info off the input/output
  tensors directly in initWork, but we don't want to keep refs to those
  tensors alive while the work is Enqueued, so we'd have to specifically
  copy size lists or something.

This PR instead avoids creating a work inside pointToPoint when
coalescing is active. Instead, only at endCoalescing() is a work finally
intialized and enqueued.  But it adds a record() call inside
pointToPoint() instead of creating a work, during coalescing. This
record() call picks up tensor shapes and op names.

It ALSO changes initWork to accept a 'record' argument. This defaults to
false, and should only be set to true if the caller ensures the work
will be enqueued by workEnqueue, ensuring its cuda events are live when
used by flight recorder's update_state().

The testing uncovers some odd pre-existing behavior and leaves them
alone for now. We could change some of these
- seq starts off at 1, not 0 for first op (but this is inconistent)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #120724
2024-02-29 01:03:31 +00:00
Shuqiang Zhang
39f0a5ecc9 [c10d] simplify the dump timeout logic and unify the async call (#120331)
Summary:
The current dump timeout logic is a bit cumbersome as it needs 2 times: 1.
timeout, 2. wake up time. And in theory the caller just needs to wait
for a max of timeout value for the dump and declare the dump to be
either successful or not. Also we unify the async call using std::async
instead of a customized async lauch function for each operation.
Test Plan:
Unit tests
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120331
Approved by: https://github.com/wconstab
2024-02-23 19:46:40 +00:00
cyy
1aad5c98b4 [structural binding][5/N] Replace std::tie with structural binding (#120142)
This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120142
Approved by: https://github.com/albanD
2024-02-21 22:32:55 +00:00
soulitzer
312ce35c1f Rename singleton int to nested int (#119661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661
Approved by: https://github.com/ezyang
2024-02-16 19:21:17 +00:00
cyy
47a2e6b6b8 Fix C++20 build (#112333)
Currently C++20 fails because of incorrect template initialization order. This PR adjusted the order of theses classes and a constructor to address the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112333
Approved by: https://github.com/albanD
2024-02-13 05:10:19 +00:00
Sahdev Zala
110919c984 Check QNNPACK support for the platform before running test (#119139)
Do not run test ConstantPropagation.CustomClassesCanBePropagated on a platform where QNNPACK is not supported.

For example, this test fails on M1 Mac because QNNPACK is not supported on M1 Mac:
[----------] 1 test from ConstantPropagation
[ RUN      ] ConstantPropagation.CustomClassesCanBePropagated
unknown file: Failure
as described in more details in the issue #88613.

After the PR, test passes successfully as below:
[----------] 1 test from ConstantPropagation
[ RUN      ] ConstantPropagation.CustomClassesCanBePropagated
[       OK ] ConstantPropagation.CustomClassesCanBePropagated (0 ms)
[----------] 1 test from ConstantPropagation (0 ms total)

Fixes #88613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119139
Approved by: https://github.com/jcaip
2024-02-12 20:21:07 +00:00
suo
82248f0b1c [export] improve FakeTensor serialization (#119531)
Recently we made it possible to serialize ExportedPrograms with fake parameters/buffers/etc.

The serialization regime was kind of whacky; basically we serialized a stub and reassembled the FakeTensor using metadata that we had stashed elsewhere in the Graph state.

This was bad for a few reasons:
- Storing the metadata separately from the actual serialized object caused situations where you could have one but not the other. An example case is if you had a FakeTensor contained inside a TorchBind object—there was no obviously place to store the metadata for this. This actually happens—TensorQueue in fbgemm does this.
- It created an annoying cycle: we had to deserialize the Graph's tensor metadata in order to deserialize (potentially faked) constants, but we need constants in order to deserialize the Graph.

This fixes all that. The basic idea is to patch the reducer function for FakeTensor at serialization time, and serialize a copy of the FakeTensor metadata. We already are policing BC for the TensorMeta schema struct so it's not a net increase in the BC surface.

As a bonus, I fixed a weird bug with torchbind tracing where we were accidentally reinterpreting a torch.ScriptObject as a torch.ScriptModule (which was the root cause of some weird behavior @bahuang was seeing last week).

Differential Revision: [D53601251](https://our.internmc.facebook.com/intern/diff/D53601251/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119531
Approved by: https://github.com/zhxchen17
2024-02-12 19:28:08 +00:00
Ke Wen
b2043c0543 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-12 18:45:49 +00:00
PyTorch MergeBot
0342b227e5 Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)"
This reverts commit f3e7d80993.

Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))
2024-02-12 07:34:20 +00:00
Ke Wen
f3e7d80993 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-09 20:23:20 +00:00
Ke Wen
029a16c41f [c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-02-07 22:29:29 +00:00
PyTorch MergeBot
9d46fe603d Revert "[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)"
This reverts commit 4ab852b6c5.

Reverted https://github.com/pytorch/pytorch/pull/119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119099#issuecomment-1930839754))
2024-02-06 22:14:36 +00:00
Ke Wen
4ab852b6c5 [c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099
Approved by: https://github.com/wconstab
2024-02-06 06:59:47 +00:00
lancerts
576383c2eb Add torch check for dtype within bilinear (#118900)
Fixes https://github.com/pytorch/pytorch/issues/117237
Short-term fix, when dtype does not match, it will be reflected in the torch check.

@ezyang a cpp test case is added
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118900
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-02-03 00:02:00 +00:00
Mu-Chu Lee
2b48891e62 [AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765)
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.

The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.

Test Plan: Unit tests included in commit.

Differential Revision: D53274382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
2024-02-01 04:54:25 +00:00
Boyuan Feng
b369888bec Replace constraints with dynamic_shapes in caffe2/test/cpp & torchrec/distributed/tests/test_pt2 (#118026)
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `caffe2/test/cpp` and `torchrec/distributed/test/test_pt2`.

Test Plan: CI

Differential Revision: D52977354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118026
Approved by: https://github.com/chenyang78
2024-01-23 22:15:15 +00:00
suo
4057d005ff Initial torchbind support in PT2 (#117697)
This PR adds the bare minimum functionality to get torchbind working in an e2e testable way on PT2.

It implements:
* ProxyTensor support
* Simple torch.export support (proxytensor-only path, e.g. non-strict).
* add some tests exercising the path.

Because all this is not fully baked, I hide the functionality behind a feature flag (`enable_torchbind_tracing()`) so it does not affect regular users for now.

Still on the agenda:
* Dynamo support
* Actual FakeMode support
* Mutability support

Hoping to get this first bit in as a standalone, as it will unblock some more extensive experimentation/testing going on internally.

Differential Revision: [D51825372](https://our.internmc.facebook.com/intern/diff/D51825372/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117697
Approved by: https://github.com/SherlockNoMad
2024-01-19 06:28:20 +00:00
Ke Wen
c16e6e4cf7 [ProcessGroup] Make watchdog check work queue more frequently (#117297)
Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed.

Take DDP and Ampere for example:

DDP's bucket size = 25 MB
Ampere's NVLink speed = 250 GB/s

25 MB / 250 GB/s = 100 ms.
So we are updating the interval to 100 ms.

Update:
25 MB / 250 GB/s = 0.1 ms
But let's see how it goes so far between making the checking more aggressive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297
Approved by: https://github.com/fduwjj
2024-01-19 02:33:31 +00:00
Ke Wen
6d96beb6be [c10d] Remove health check (#117699)
https://github.com/pytorch/pytorch/pull/114916 and https://github.com/pytorch/pytorch/pull/116222 added support for eager NCCL comm init (performed as soon as `init_process_group` is called).

If any user cares about the time difference and want to see NCCL init errors early, they can use eager init now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117699
Approved by: https://github.com/wconstab
2024-01-18 02:14:49 +00:00
fduwjj
ca4df16fdd [c10d] Make DebugInfoWriter Singleton across all PG objects (#116489)
Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances.

Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489
Approved by: https://github.com/kwen2501
2024-01-03 03:42:54 +00:00
Bin Bao
feafbcf437 [AOTI][refactor] Refactor model runner API (#116047)
Summary: 1) make proxy executor as a private member; 2) use std::string instead of char*

Differential Revision: [D52301106](https://our.internmc.facebook.com/intern/diff/D52301106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116047
Approved by: https://github.com/khabinov
2023-12-21 01:05:37 +00:00
Xilun Wu
0b0b9b3275 [c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer (#116141)
**Test Plan**
1. build pytorch
2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory.

without the change:
<img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa">

with the change:
<img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116141
Approved by: https://github.com/wconstab
2023-12-20 18:37:55 +00:00
Bin Bao
fabf9433e7 [AOTI][refactor] Organize model runner files (#116022)
Summary: Move runner util files into a subdirectory and put AOTIModelContainerRunnerCpu into a separate file

Differential Revision: [D52300693](https://our.internmc.facebook.com/intern/diff/D52300693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116022
Approved by: https://github.com/khabinov
2023-12-20 15:35:34 +00:00
Nikita Shulga
d7caef7996 [CI] Update clang-format (#116002)
To 17.0.6 build using https://github.com/pytorch/test-infra/blob/main/.github/workflows/clang-tidy-linux.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116002
Approved by: https://github.com/suo
2023-12-18 14:58:46 +00:00
Mu-Chu Lee
c285ca7916 [AOTInductor] Add updaing constant buffer to active buffer. (#116001)
Summary:
Refactor update inactive constant buffer to allow updating with active
buffer.

Test Plan:
Existing test to test inactive buffer updates.
UpdateConstantsCuda in cpp test for active buffer updates.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001
Approved by: https://github.com/chenyang78
2023-12-18 11:49:03 +00:00
soulitzer
4d8ad4fb82 Move SingletonSymNodeImpl from c10 to aten (#114895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114895
Approved by: https://github.com/jbschlosser
2023-12-13 20:01:18 +00:00
Wongboo
68f74dd162 Add python and C++ support for LPPool3d (#114199)
Add python and C++ support for LPPool3d to Fixes #114114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114199
Approved by: https://github.com/mikaylagawarecki
2023-12-08 18:18:44 +00:00
Will Constable
7562b45454 Reland "[C10D] Use future for flight recorder dump (#115176)" (#115332)
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort". The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.

Allows multiple calls to dump, which will be serialized.

Renames tryWriteDebugInfo to launchAsyncDebugDump in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.

Adds a test for dumping on timeout.

This reverts commit ac7d14baad.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115332
Approved by: https://github.com/fduwjj
2023-12-07 21:20:58 +00:00
Shaltiel Shmidman
ee8b33f7d5 Fixed crash when calling pad_packed_tensor when packed with cuda tensors and ensure_sorted=false due to indexing with tensors on different devices (#115028)
Fixes #115027

Fix in csrc as done in the python code [here](https://github.com/pytorch/pytorch/blob/main/torch/nn/utils/rnn.py#L338).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115028
Approved by: https://github.com/drisspg
2023-12-07 18:09:18 +00:00
suo
686a3e0bf0 [pytorch][PR] introduce WeakHashRef (#115216)
We would like weak dictionaries that have `torch.ScriptObject` keys. Similar to tensors, we need to override the behavior of the ref to dot he right thing under comparison.

This change also makes it so that WeakIdKeyDictionary works with a pluggable ref_type.

Differential Revision: [D51828205](https://our.internmc.facebook.com/intern/diff/D51828205/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115216
Approved by: https://github.com/albanD
2023-12-07 17:48:11 +00:00
PyTorch MergeBot
ac7d14baad Revert "[C10D] Use future for flight recorder dump (#115176)"
This reverts commit 0e07e3dbe4.

Reverted https://github.com/pytorch/pytorch/pull/115176 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test_timeout_dumps is failing in trunk 0e07e3dbe4 ([comment](https://github.com/pytorch/pytorch/pull/115176#issuecomment-1844076455))
2023-12-07 02:09:58 +00:00
Will Constable
0e07e3dbe4 [C10D] Use future for flight recorder dump (#115176)
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort".  The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.

Allows multiple calls to dump, which will be serialized.

Renames `tryWriteDebugInfo` to `launchAsyncDebugDump` in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.

Adds a test for dumping on timeout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115176
Approved by: https://github.com/zdevito
2023-12-06 23:42:19 +00:00
Mu-Chu Lee
80527c0cf2 [AOTInductor] Double buffering for Weights (#114446)
Summary:
This adds function to model container doing weight swapping with double buffering.

There are 2 parts for double buffering
a) Write constants into inactive buffer
b) Swap active buffer

For (a), we write the constants into the buffer that's currently not in use, and store the information in both constants map and the corresponding constant array to read.
For (b), we obtain the lock, and activate the constant map/constant array that is inactive, and flag the one that's currently in use to inactive.

Test Plan:
test/cpp/aot_inductor/test.cpp

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D51543732](https://our.internmc.facebook.com/intern/diff/D51543732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114446
Approved by: https://github.com/chenyang78, https://github.com/eellison
2023-12-05 22:31:56 +00:00
FFFrog
541591dd79 Add the appropriate check on div_value to the cpp frontend (#114671)
Fixes #114334

As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114671
Approved by: https://github.com/mikaylagawarecki
2023-12-04 01:28:11 +00:00
Chip Turner
9cc040fef6 Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880)
Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in #114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880
Approved by: https://github.com/kwen2501
2023-12-01 20:08:23 +00:00
Scott Wolchok
165f4f6ccf [PyTorch] Redirect c10::optional to std::optional (#101995)
We have C++17 now!

I am intentionally dropping the `c10::optional<c10::ArrayRef>` size optimization. It was intended to improve dispatch, but thanks to D34602980 / #70864 we don't use `optional<ArrayRef>` in function arguments anymore anyway.

Differential Revision: [D46079028](https://our.internmc.facebook.com/intern/diff/D46079028/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101995
Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/ezyang
2023-11-30 02:46:41 +00:00
Chip Turner
066e072524 Retry #112889 (Opportunistically use ncclCommSplit when creating new NCCL groups) (#114385)
- [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889)
- Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit`

Fixes cause of revert of original PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385
Approved by: https://github.com/huydhn
2023-11-23 07:00:00 +00:00
PyTorch MergeBot
b927a4e2ca Revert "Opportunistically use ncclCommSplit when creating new NCCL groups (#112889)"
This reverts commit 64a5372e6c.

Reverted https://github.com/pytorch/pytorch/pull/112889 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk 4d07428ede ([comment](https://github.com/pytorch/pytorch/pull/112889#issuecomment-1823214376))
2023-11-22 17:43:51 +00:00
Chip Turner
64a5372e6c Opportunistically use ncclCommSplit when creating new NCCL groups (#112889)
Currently `ncclCommInitRankConfig` is always used when creating new
communicator groups.  This is wasteful as it creates non-shared pairs
of endpoint queues as well as costs time to re-establish
communication.

This change is transparent and opportunistic; when `dist.new_group` is
called, it will use the existing, healthy world process group to
select the right ranks to include in the process group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889
Approved by: https://github.com/kwen2501
2023-11-21 21:03:52 +00:00
Pavan Balaji
8f8722e3f1 [nccl-pg] Avoid using NCCL_ prefix for non-NCCL env variables (#114077)
NCCL_ prefix should only be used for NCCL library's environment variables.  We currently use a few environment variables in PyTorch with the NCCL_ prefix that are the NCCL library does not understand.

This patch renames such environment variables to use the TORCH_NCCL_ prefix instead.  We still maintain the old NCCL_ variables, but throw a warning when they are used.

The following env changes have been made:

`NCCL_BLOCKING_WAIT` -> `TORCH_NCCL_BLOCKING_WAIT`
`NCCL_ENABLE_TIMING` -> `TORCH_NCCL_ENABLE_TIMING`
`NCCL_DESYNC_DEBUG` -> `TORCH_NCCL_DESYNC_DEBUG`
`NCCL_ASYNC_ERROR_HANDLING` -> `TORCH_NCCL_ASYNC_ERROR_HANDLING`
`ENABLE_NCCL_HEALTH_CHECK` -> `TORCH_ENABLE_NCCL_HEALTH_CHECK`
`NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK` -> `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114077
Approved by: https://github.com/fduwjj
2023-11-21 07:23:42 +00:00
cyy
226384b460 [2/N] Cleanup header inclusions in torch_cpu by iwyu (#109964)
Further cleaning up of torch_cpu header inclusions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109964
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-11-19 20:56:32 +00:00
Pavan Balaji
958f3b0df6 [nccl-pg] Migrate to getCvar* functions for env variable checking (#113797)
Summary:
The getCvar* functions allow us to provide multiple environment variables for the same value.  This allows us to deprecate some variables in favor of others, while still allowing users to temporarily use the old variables for some time.

Test Plan: OSS CI

Reviewed By: fduwjj, XilunWu

Differential Revision: D51225487

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113797
Approved by: https://github.com/fduwjj
2023-11-19 03:48:58 +00:00
Andrzej Kotlowski
0885c58296 Add Bfloat16 scalar support to gloo backend (#113557)
There was missing support for bfloat scalars. When I use gloo backend
`torch.distributed.init_process_group(backend='gloo')`
and run
`torch.nn.parallel.DistributedDataParallel(model)`
and _model_ has Bfloat16 features I receive following error:
`RuntimeError: Invalid scalar type`

This change fix this issue.
c10::BFloat16 defines conversions from/to float, so calculations are made on float for bfloat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113557
Approved by: https://github.com/XilunWu, https://github.com/jgong5
2023-11-17 21:16:54 +00:00
fduwjj
015fd2eb41 [NCCL PG] Add dumping flight recorder in the NCCL watchdog timeout (#113678)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113678
Approved by: https://github.com/XilunWu
ghstack dependencies: #113503
2023-11-17 07:00:41 +00:00
Mu-Chu Lee
eddce3c054 [AOTInductor] Rename model_runner to model_container_runner (#111324)
Summary:
We rename the model_runner to model_container_runner to prepare for
adding tests of pure model without container.

Test Plan:
commit itself is a test.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111324
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2023-11-16 19:14:22 +00:00
fduwjj
5fb1d8f18a [NCCL PG] Enable storing nccl traces into storage and make it configurable (#113503)
This PR is to enable the store of NCCL flight recorder to storage and make it configurable by letting users register their own way of storing the debug info. We will then provide users a script to offline parse and process the dumped blobs.

One thing, this PR is not trying to resolve is to decide where to dump the debug info. I will send a follow-up PR to address that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113503
Approved by: https://github.com/zdevito
2023-11-16 07:44:15 +00:00
Aleksei Nikiforov
b526aae95a test_lazy: skip HashTest.Scalar (#112747)
Fixes #99883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112747
Approved by: https://github.com/huydhn
2023-11-16 01:22:58 +00:00
fduwjj
f9114193bd [NCCL PG] ADD a separate monitoring thread to ensure we collect debug info and check watchdog heartbeat (#112518)
This PR has the following goals:
1. Detect unhealthy nccl watchdog thread by implementing a heartbeat. NCCL watchdog sometimes can hang for several reasons such as nccl/cuda API bugs or unexpected blocking behaviors. This is the last resort to ensure that we don't silently keep the training job run for hours.
2. Sometimes, the process gets stuck in the destroy of NCCL PG, and this PR will ensure that we will eventually abort it after some time (by default 2 mins)
3. Once heartbeat cannot be heard, we dump debug information (for now, we just use the flight recorder implemented in https://github.com/pytorch/pytorch/pull/110960/files) to disk. (How and where to dump the debug info will be addressed in the following PR).
4. Finally, we initiate std::abort via `LOG(FATAL)` to kill the process.

To clarify further what this PR is trying to solve, we first list are four cases when a NCCL PG can end up with:
- case 1: ncclwatchdog gets stuck (maybe some blocking API) and heartbeat monitor kills it during regular heartbeat monitor loop.
- case 2: ncclwatchdog timeout and desync report or destroy kicked in(let's call it shutdown) but this shutdown takes so long and heartbeat believes it has to kills the process anyway.
- case 3: ncclwatchdog aborts the process (heartbeat monitor not involved)
- case 4: program exits cleanly (heartbeat monitor not involved)

As we can see here, this PR is trying to address case one and two and we also want to ensure adding one more monitor thread does not interfere what we are currently doing in case three and four. That's why we added two flags `terminateHeartbeatMonitorThread_` and `collectiveDebugInfoMode_`.

For case three and four, either `monitorWakeUpCV_` will be waked up in the destructor or `terminateHeartbeatMonitorThread_` will be set to true. So that monitor thread will just exit ASAP.

For case one, both `terminateHeartbeatMonitorThread_` and `collectiveDebugInfoMode_` will still false when monitor thread see there are no heartbeat, so it will directly kill the process. For case two, either `terminateHeartbeatMonitorThread_` and `collectiveDebugInfoMode_` will be true, the monitor thread will wait extra time before killing the process.

Differential Revision: [D51146305](https://our.internmc.facebook.com/intern/diff/D51146305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112518
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2023-11-10 04:41:14 +00:00
Nikita Shulga
88920b26be [Cmake] Check that gcc-9.4 or newer is used (#112858)
As this is the oldest gcc that is fully compatible with C++17 standard.
- Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`.
- As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report```

Fixes https://github.com/pytorch/pytorch/issues/101839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-11-06 17:19:53 +00:00
PyTorch MergeBot
679ca510b0 Revert "[Cmake] Check that gcc-9.4 or newer is used (#112858)"
This reverts commit ad894cd072.

Reverted https://github.com/pytorch/pytorch/pull/112858 on behalf of https://github.com/PaliC due to breaking internal tests (check diff for test page) ([comment](https://github.com/pytorch/pytorch/pull/112858#issuecomment-1795485009))
2023-11-06 16:56:09 +00:00
Nikita Shulga
ad894cd072 [Cmake] Check that gcc-9.4 or newer is used (#112858)
As this is the oldest gcc that is fully compatible with C++17 standard.
- Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`.
- As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report```

Fixes https://github.com/pytorch/pytorch/issues/101839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-11-04 05:40:08 +00:00
Nikita Shulga
8665a51baf Initialize logging facility when running ProcessGroupNCCLTest (#112809)
If code is compiled without `glog`, there are no way to control log levels other than explicitly calling `c10::initLogging()`

Test plan: Run `TORCH_CPP_LOG_LEVEL=0 ./bin/ProcessGroupNCCLTest` and observe extra log messages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112809
Approved by: https://github.com/fduwjj
2023-11-03 02:26:13 +00:00
Pritam Damania
e66ec5843f [RESUBMIT] Cleanup error reporting for ProcessGroupNCCL (#112419)
Continuing some of the work from https://github.com/pytorch/pytorch/pull/108191, I realized majority of errors raised from ProcessGroupNCCL were just generic RuntimeError.

In this PR, I've added appropriate error types to all the exceptions raised from ProcessGroupNCCL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112419
Approved by: https://github.com/fduwjj
2023-10-31 05:58:21 +00:00
Aaron Enye Shi
63c089b09d [c10] Move profiler clock to libc10 for timestamps (#111972)
Summary:
Move the profiler's Approximate Clock from libtorch to libc10. The main reason is to allow c10 features to get time.

The clock is using TSC when available for performance. CUDA Caching Allocator's implementation of memory snapshot will add the timestamps to memory events with this same clock in subsequent diff.

Test Plan: CI

Differential Revision: D50601935

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111972
Approved by: https://github.com/davidberard98
2023-10-27 16:18:40 +00:00
PyTorch MergeBot
abe172e268 Revert "Cleanup error reporting for ProcessGroupNCCL (#111979)"
This reverts commit b29c658265.

Reverted https://github.com/pytorch/pytorch/pull/111979 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing multigpu test in trunk b29c658265 ([comment](https://github.com/pytorch/pytorch/pull/111979#issuecomment-1781919184))
2023-10-26 21:29:40 +00:00
angelayi
b126adcdee [aotinductor] Pass TorchIR to AOTInductor (#110020)
Updates `_export.aot_compile` to pass a torch IR graph to inductor, allowing inductor to now run the pre_grad_passes, and reuse more of inductor's code.
Also updates the API to only return the `so_path`, and not returning the exported program. The pytree call spec is now serialized and placed inside of the generated model code. When calling the model, because there is no c++ pytree implementation linked yet, we can access the call specs through `get_call_spec()`, and call pytree flatten/unflattenin python.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110020
Approved by: https://github.com/desertfire
2023-10-26 15:54:31 +00:00
Jeff Daily
28c0b07d19 [ROCm] remove HCC references (#111975)
- rename `__HIP_PLATFORM_HCC__` to `__HIP_PLATFORM_AMD__`
- rename `HIP_HCC_FLAGS` to `HIP_CLANG_FLAGS`
- rename `PYTORCH_HIP_HCC_LIBRARIES` to `PYTORCH_HIP_LIBRARIES`
- workaround in tools/amd_build/build_amd.py until submodules are updated

These symbols have had a long deprecation cycle and will finally be removed in ROCm 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111975
Approved by: https://github.com/ezyang, https://github.com/hongxiayang
2023-10-26 02:39:10 +00:00
Pritam Damania
b29c658265 Cleanup error reporting for ProcessGroupNCCL (#111979)
Continuing some of the work from https://github.com/pytorch/pytorch/pull/108191, I realized majority of errors raised from ProcessGroupNCCL were just generic RuntimeError.

In this PR, I've added appropriate error types to all the exceptions raised from ProcessGroupNCCL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111979
Approved by: https://github.com/fduwjj
2023-10-26 01:39:54 +00:00
cyy
f9cc7f6a1c Enable Wno-unused-private-field,Wunused-lambda-capture and fix CUDA warnings (#110856)
This PR enables Wno-unused-private-field,Wunused-lambda-capture  and some CUDA warnings were fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110856
Approved by: https://github.com/albanD, https://github.com/malfet
2023-10-25 03:39:05 +00:00
zdevito
babb6c6ac4 nccl flight recorder (#110960)
Keep a buffer of the last 16384 nccl work actions, including the stack
trace that launched the event.

When torch._C._distributed_c10d._dump_nccl_trace(), it an dump these to
a pickled archive.

For each action we get:
process_group_id, seq_id, collective_name, size_of_first_tensor, stack trace

state - issued, started, completed (based on cuda events and queried if
necessary when the dump is requested)

I tested that it is possible to query event state when the streams are
otherwise stuck.

Differential Revision: [D50138956](https://our.internmc.facebook.com/intern/diff/D50138956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110960
Approved by: https://github.com/wconstab
2023-10-24 07:12:21 +00:00