Commit Graph

2136 Commits

Author SHA1 Message Date
Bin Bao
4bf1cd6961 [aotinductor] Rename aot_runtime to aoti_runtime (#110007)
Summary: Make the naming more explicit

Differential Revision: D49593528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110007
Approved by: https://github.com/houseroad
2023-09-26 00:46:54 +00:00
cyy
265acd4bea Clean up CMake target linking (#109959)
This PR cleans up more CMake target linking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109959
Approved by: https://github.com/ezyang
2023-09-25 01:37:14 +00:00
cyy
dee100945e [2/N] Move c10::variant to std::variant (#109723)
This PR moves most of c10::variant calls to std::variant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109723
Approved by: https://github.com/ezyang
2023-09-24 02:47:43 +00:00
Bin Bao
8856c1628e [inductor] Change AOTInductor to return output tensors (#109790)
Summary:
Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits:

* It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable.
* As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance.
* With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability.

This change also combines D49494954 from Yang and https://github.com/pytorch/pytorch/pull/109560 from Angela.

Differential Revision: D49502318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109790
Approved by: https://github.com/chenyang78
2023-09-22 02:31:52 +00:00
cyy
e9e93c5350 [Reland] Move torch::make_unique to std::make_unique (#109780)
We can first try to move torch::make_unique to std::make_unique despite reverting of #108866 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109780
Approved by: https://github.com/ezyang
2023-09-21 18:30:21 +00:00
Bin Bao
9c2715bbb2 [inductor] Clean up AOTInductor runtime ABI (#109678)
Summary: Change the AOTInductor runtime interface to avoid referring to aten data structures directly, mostly at::Tensor and ProxyExecutor. This a combination of https://github.com/pytorch/pytorch/pull/109436,  https://github.com/pytorch/pytorch/pull/109498, https://github.com/pytorch/pytorch/pull/109450, https://github.com/pytorch/pytorch/pull/109606, plus a few internal build changes.

Reviewed By: frank-wei

Differential Revision: D49374820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109678
Approved by: https://github.com/frank-wei, https://github.com/chenyang78
2023-09-21 00:25:24 +00:00
PyTorch MergeBot
1cc052bcab Revert "[1/N] Add -Wdeprecated and related fixes (#108626)"
This reverts commit a53a677b4d.

Reverted https://github.com/pytorch/pytorch/pull/108626 on behalf of https://github.com/clee2000 due to I'm getting errors internally that look like the below on x86_64-apple-ios-simulator with clang 16 ([comment](https://github.com/pytorch/pytorch/pull/108626#issuecomment-1728102447))
2023-09-20 16:49:11 +00:00
cyy
a53a677b4d [1/N] Add -Wdeprecated and related fixes (#108626)
This PR adds -Wdeprecated to CMake warnings and fixes related issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108626
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-09-19 09:24:04 +00:00
PyTorch MergeBot
525e4f42d0 Revert "replace torch::make_unique with std::make_unique (#108866)"
This reverts commit 03e35efbf7.

Reverted https://github.com/pytorch/pytorch/pull/108866 on behalf of https://github.com/clee2000 due to Sorry but I found more usages of `torch::make_unique` internally, I can go change all of these, but I'd prefer if that gets done before this gets merged ([comment](https://github.com/pytorch/pytorch/pull/108866#issuecomment-1722577925))
2023-09-17 21:57:30 +00:00
cyy
4c208c1475 Remove unneeded linking in CMake targets (#109192)
This PR removes unused library dependencies, help refactoring in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109192
Approved by: https://github.com/ezyang
2023-09-15 19:43:25 +00:00
cyy
03e35efbf7 replace torch::make_unique with std::make_unique (#108866)
It should be safe to remove the old torch::make_unique functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108866
Approved by: https://github.com/albanD
2023-09-14 20:52:26 +00:00
Yang Chen
9cd4548f01 AOTInductor dynamic shape (#109012)
Summary: This PR adds dynamic-shape support for AOTInductor

* On the runtime/interface side, we added two structs, StaticDimInfo
and DynamicDimInfo, to hold values for static and dynamic dimensions,
respectively. Dynamic dimensions are tracked by an unordered map field
defined in AOTInductorModelBase. At inference time, the inference run
method will assign the current real dimensional value to each dynamic
dimension before executing any kernel.

* On the CUDA wrapper codegen side, we generate dynamic symbols
appropriately for shape computations. We simulate kernel launch grids
in the C++ land by re-using the grid functions from the Python world.
The returned grid configs, which may contain symbolic expressions,
are printed out in their C++ forms via the CppPrinter. Note that
when dynamic shapes are involved, we have to compute grid configs
for each kernel at runtime in the same way as we do for launching
the corresponding Triton kernel. Otherwise, we may end up with
memory-access failures or mis-computations caused by invalid indices
for fetching or storing data in device memory.

Differential Revision: D49100472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109012
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/hl475
2023-09-14 08:00:30 +00:00
FFFrog
003c5bb156 Add checks to num_layers for RNN, LSTM, GRU (#108853)
Fixes #108223

As the title shown

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108853
Approved by: https://github.com/mikaylagawarecki
2023-09-09 19:33:52 +00:00
shibo19
a5e1d38025 add check for torch_arg (#108397)
Fixes https://github.com/pytorch/pytorch/issues/108219
add check for torch_arg marco, as for inchannel/outchannel/groups, it should be greater than 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108397
Approved by: https://github.com/mikaylagawarecki
2023-09-08 23:18:27 +00:00
FFFrog
f30f9fec87 Fix the issue described by #106769 (#108340)
Fixes #106769

Align the behavior of the C++ interface with the Python interface

1. Remove some checks in C++ frontend api ,which duplicate with below
50fa5880e8/aten/src/ATen/native/RNN.cpp (L676-L690)
3. Add some checks
4. support 1D
5. Add Test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108340
Approved by: https://github.com/mikaylagawarecki
2023-09-08 22:22:09 +00:00
Mu-Chu Lee
30a33b76b9 [AOTInductor] Include constants in AOTInductor .so file. (#108473)
Summary:
Include constants in AOTInductor .so file.
Added some difference:
1) serialize with ctypes instead of the native of torch.storage
2) Use the underlying for_blob instead of from_blob to construct Tensor.

Test Plan:
Unit tests:
```
test/inductor/test_aot_inductor.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108473
Approved by: https://github.com/angelayi
2023-09-08 03:49:53 +00:00
cyy
e4f3e5434f [Reland] Elimates c10::guts::to_string (#108748)
Reland of PR #108480, after relanding another blocking PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108748
Approved by: https://github.com/huydhn
2023-09-07 13:35:17 +00:00
Bin Bao
60bd30ee0b [inductor] Move AOTInductor runtime headers (#108564)
Summary: Move AOTInductor runtime header files into its own subdirectory, to separate them from to-be-added libtorch C interface.

Reviewed By: frank-wei

Differential Revision: D48905038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108564
Approved by: https://github.com/frank-wei
2023-09-06 11:50:41 +00:00
PyTorch MergeBot
8da04e023e Revert "Eliminate c10::guts::to_string (#108480)"
This reverts commit 4146be192e.

Reverted https://github.com/pytorch/pytorch/pull/108480 on behalf of https://github.com/huydhn due to Sorry for reverting this, but this is needed to keep trunk green after https://github.com/pytorch/pytorch/pull/108479 was reverted.  Both will need to be relanded ([comment](https://github.com/pytorch/pytorch/pull/108480#issuecomment-1707067595))
2023-09-05 18:04:53 +00:00
shibo19
03aac0bff6 add input check at the beginning for C++ API interpolate (#108506)
Fixes https://github.com/pytorch/pytorch/issues/108346
add the input check to the beginning for  C++ API `interpolate`, raise an error when got an invalid input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108506
Approved by: https://github.com/ezyang
2023-09-05 17:56:17 +00:00
cyy
4146be192e Eliminate c10::guts::to_string (#108480)
This PR replace c10::guts::to_string with std::to_string. The major part of changes is using void* as optimizer state key since string is used only for serialization and using pointers as hashing keys is more efficient than a string.
Some other guts functions in the affected source files are also replaced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108480
Approved by: https://github.com/Skylion007
2023-09-04 08:12:53 +00:00
Sherlock Huang
b9dfdc091b [AOTInductor][Reland] Proxy Executor for Extern Fallback kernels (#107279) (#108350)
Summary:

This is a prototype for running extern fallback kernels with a host side proxy executor.

Sample of generated cpp wrapper call:
```
        at::Tensor buf0;  // output buffer
        void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0};
        int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81};
        proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0);
```

- In my current implementation, proxy executor interprets the raw pointers according to the ops schema.
This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption)
- I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels.

- These are all the input argument types I have support so far.
       union Argument {
         # Bool value does not matter
         1: bool asNone;
         2: TensorArgument asTensor;
         3: list<TensorArgument> asTensors;
         5: i64 asInt;
         7: list<i64> asInts;
         8: double asFloat;
         9: list<double> asFloats;
         10: string asString;
         10.5: list<string> asStrings;
         11: SymIntArgument asSymInt;
         12: list<SymIntArgument> asSymInts;
         13: ScalarType asScalarType;
         14: MemoryFormat asMemoryFormat;
         15: Layout asLayout;
         16: Device asDevice;
         17: bool asBool;
         18: list<bool> asBools;
       }

- Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC  implications.
1. requires exported fx graph to explicitly populate default values, if users doesn't specify.
2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify.
3. Proxy executor look up from opSchema for default values.

For fixing T162112344

Test Plan:
frontend:
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main

test:
 buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops

backend:
buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main

buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)'

Reviewed By: suo

Differential Revision: D48747417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350
Approved by: https://github.com/izaitsevfb
2023-09-02 17:14:10 +00:00
Bin Bao
06d74e6b24 Revert "[AOTInductor] Include constants in AOTInductor .so file. (#10… (#108349)
This reverts commit c3239442a3 due to internal test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108349
Approved by: https://github.com/aakhundov, https://github.com/zhxchen17
2023-08-31 16:26:02 +00:00
Pritam Damania
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
Emmanuel Menage
fe1f26af8a Add support for PickleOpCode::APPEND in torch unpickler (#104027)
Reviewed By: qiminglu

Differential Revision: D46760650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104027
Approved by: https://github.com/ezyang
2023-08-30 14:24:50 +00:00
Mu-Chu Lee
c3239442a3 [AOTInductor] Include constants in AOTInductor .so file. (#107718)
Summary:
Include the constants into AOTInductor .so file.
We do not modify existing API signatures but create necessary format with weight lifted out instead.

Test Plan:
test/inductor/test_aot_inductor.py

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107718
Approved by: https://github.com/angelayi, https://github.com/eellison
2023-08-29 22:37:30 +00:00
FFFrog
78810d78e8 Fix the coredump described by #106702 (#108002)
Fixes #106702 and add some tests

As shown by [maxUnpool1d](https://pytorch.org/docs/master/generated/torch.nn.MaxUnpool1d)(`MaxUnpool2d`, `MaxUnpool3d` also), `Input` and `Output` support `(N,C,*)` or `(C,*)`, but the c++ api currently supports the former, and the latter will cause a coredump.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108002
Approved by: https://github.com/albanD
2023-08-29 17:14:16 +00:00
Rodrigo Kumpera
fe2cda64dc [C10D] Implement new libuv backend for TCPStore. (#108066)
The new backend is currently under a flag 'use_libuv' in TCPStore constructor
to reduce the impact on existing users as we test it.

This is a reland of #105870 with a fix for a bad test.

Differential Revision: [D48742554](https://our.internmc.facebook.com/intern/diff/D48742554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108066
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
2023-08-29 14:55:14 +00:00
PyTorch MergeBot
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
Pritam Damania
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
PyTorch MergeBot
d3f92ca9e9 Revert "[C10D] Implement new libuv backend for TCPStore. (#105870)"
This reverts commit 3c841163ce.

Reverted https://github.com/pytorch/pytorch/pull/105870 on behalf of https://github.com/huydhn due to I think the distributed failure is related as this is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/105870#issuecomment-1683117192))
2023-08-17 23:41:00 +00:00
Rodrigo Kumpera
3c841163ce [C10D] Implement new libuv backend for TCPStore. (#105870)
The new backend is currently under a flag 'use_libuv' in TCPStore constructor
to reduce the impact on existing users as we test it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105870
Approved by: https://github.com/H-Huang
2023-08-17 20:40:32 +00:00
soulitzer
d9dc4b2b4c [BE] Add missing override to remove build warning spam (#107191)
```
In file included from /local/pytorch3/test/cpp/api/optim.cpp:7:
local/pytorch3/test/cpp/api/support.h:44:3: warning: '~WarningCapture' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  ~WarningCapture() {
  ^
local/pytorch3/c10/util/Exception.h:167:11: note: overridden virtual function is here
  virtual ~WarningHandler() = default;
  ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107191
Approved by: https://github.com/janeyx99
2023-08-15 17:32:34 +00:00
hongxyan
df8493455e [ROCm] enable test_api (test_libtorch) cpp unit tests (#106712)
This is part of effort to enable missed cpp tests for ROCm platform.
In this change,
- enabled test_libtorch cpp tests (more than 3107 tests)
- fixed missing dependency: libcaffe2_nvrtc.so required by FunctionalTest.Conv1d
- test_api binary is changed to exclude failed tests InitTest and IntegrationTest - to revisit later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106712
Approved by: https://github.com/jithunnair-amd, https://github.com/kit1980
2023-08-14 20:09:34 +00:00
angelayi
5b13c779d4 [AOTInductor] Remove call to aot_autograd when receiving ExportedProgram (#105977)
https://github.com/pytorch/pytorch/issues/105555

Existing flow first exports and then calls torch._inductor.aot_compile. However, export calls aot_autograd with the core aten decomposition table, and then torch._inductor.aot_compile calls aot_autograd again with the inductor decomposition table. The 2nd calling of aot_autograd is supposedly causing some problems, and seems excessive, so instead we will create a new function, torch._export.aot_compiler which will export using the inductor decomposition table, pass it to inductor's compile_fx_aot, and because it has already been exported, avoid recalling aot_autograd.

```
def aot_compile(
    f: Callable,
    args: Tuple[Any],
    kwargs: Optional[Dict[str, Any]] = None,
    constraints: Optional[List[Constraint]] = None,
) -> Tuple[str, ExportedProgram]:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105977
Approved by: https://github.com/desertfire, https://github.com/zhxchen17, https://github.com/eellison
2023-08-04 15:35:23 +00:00
Edward Z. Yang
7b9d250f06 Change _dynamo.export to be export(f)(*args, **kwargs) (#106109)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106109
Approved by: https://github.com/voznesenskym
2023-07-27 21:41:13 +00:00
Rodrigo Kumpera
174b0c22cb [C10D] Remove watchKey functionality from the Store. (#105014)
The feature was never fully finished and never got any adoption but
TCPStore pays the cost of twice the number of tcp connections anyway.

While the cost of all those idle connections is minimal is doesn't come for free:

- It increases the likelyhood of a connection refused failure during the initialization stampede.
- TCPStore uses poll for checking for socket availability which scales linearly on the number of sockets regardless of their status.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105014
Approved by: https://github.com/fduwjj
2023-07-21 21:18:55 +00:00
Richard Zou
ed6de45563 Fix Tensor::register_hook behavior on undefined tensors (#105587)
When the hook registered by Tensor::register_hook (in C++) gets passed
an undefined tensor, it raises an internal assert in debug mode.
The cause is that we attempt to construct an OptionalTensorRef
(4448c78a5d/aten/src/ATen/core/Tensor.h (L68))
which asserts that the passed-in TensorBase is defined.

The fix is that we create a new TensorRef class to convert the
TensorBase into a Tensor without bumping the refcount (which is what
OptionalTensorRef does). We cannot reuse OptionalTensorRef because
OptionalTensorRef represents `optional<Tensor>` that cannot hold an
Undefined Tensor.

For some more historical context, it looks like this behavior was introduced
in #63612

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105587
Approved by: https://github.com/soulitzer
2023-07-21 14:37:21 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
Bin Bao
b10de43c0a Add aot_inductor as a test backend for benchmarking (#105221)
Summary:
Original PR at https://github.com/pytorch/pytorch/pull/104977. Landing from fbcode instead.

Add an aot_inductor backend (Export+AOTInductor) in the benchmarking harness. Note it is not a dynamo backend.

Moved files from torch/_inductor/aot_inductor_include to torch/csrc/inductor as a more standard way for exposing headers
Created a caching function in benchmarks/dynamo/common.py for compiling, loading and caching the .so file, as a proxy for a pure C++ deployment, but easier for benchmarking.

Differential Revision: D47452591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105221
Approved by: https://github.com/jansel
2023-07-18 13:16:36 +00:00
Howard Huang
9165d46b89 DDP + C10D sparse all_reduce changes (#103916) (#104256)
Summary:

reland of https://github.com/pytorch/pytorch/pull/103916

## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D47056695

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104256
Approved by: https://github.com/rohan-varma
2023-06-28 00:37:52 +00:00
Yang Chen
d2281e38ae Adds the initial support for AOTInductor model and interface (#104202)
This PR combines the C++ code for the AOTInductor's model and interface with Bin Bao's changes to AOTInductor codegen.

It adds a number of AOTInductor C interfaces that can be used by an inference runtime. Under the hood of the interfaces, the model code generated by the AOTInductor's codegen is wrapped into a class, AOTInductorModel, which manages tensors and run the model inference.

On top of AOTInductorModel, we provide one more abstract layer, AOTInductorModelContainer, which allows the user to have multiple inference runs concurrently for the same model.

This PR also adjusts the compilation options for AOT codegen, particularly some fbcode-related changes such as libs to be linked and header-file search paths.

Note that this is the very first version of the AOTInductor model and interface, so many features (e.g. dynamic shape) are incomplete. We will support those missing features in in future PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104202
Approved by: https://github.com/desertfire
2023-06-27 00:37:26 +00:00
PyTorch MergeBot
436d035dc7 Revert "DDP + C10D sparse all_reduce changes (#103916)"
This reverts commit fed5fba6e4.

Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))
2023-06-26 22:37:58 +00:00
Howard Huang
fed5fba6e4 DDP + C10D sparse all_reduce changes (#103916)
Summary:
## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D46724856

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916
Approved by: https://github.com/rohan-varma
2023-06-26 20:42:17 +00:00
cyy
483f748dd5 [BE] Enforce missing override keyword (#104032)
This PR enables `-Winconsistent-missing-destructor-override` and `-Winconsistent-missing-override`
and fixes violations.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 47e904e</samp>

This pull request updates the code of various classes and operators in the `caffe2` and `aten` subdirectories to use the `override` specifier instead of the `virtual` keyword for destructors and other virtual functions that override a base class function. This improves the code readability, quality, and consistency with C++ best practices. It also modifies the `./CMakeLists.txt` file to enable warnings for these specifiers, but disable errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104032
Approved by: https://github.com/malfet
2023-06-24 02:34:24 +00:00
Edward Z. Yang
bc6ec97e02 Switch dynamic_shapes to True by default (#103597)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103597
Approved by: https://github.com/voznesenskym
2023-06-15 15:16:20 +00:00
PaDarochek
b00d388ada Update test_misc.cpp (#97768)
Potential null dereference after dynamic cast was found during static analysis.

**Description:**
Dereference of `ctx` is performed in `TORCH_CHECK` on line 1176, while `ctx` pointer may equal `nullptr`.
Previous `TORCH_CHECK` on line 1175 checks the value of `ctx_ptr` pointer that may be of type that cannot be casted to `TestContext*`. In such case, `dynamic_cast` returns `nullptr` despite `ctx_ptr` is not equal to `nullptr`.

**Fix:**

- Check `ctx` instead of `ctx_ptr` for equality to zero.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97768
Approved by: https://github.com/kit1980
2023-06-13 16:14:11 +00:00
Daniil Kutz
e6fc7d814d Segmentation fault in flatbuffers when parsing malformed modules (#95221)
Fixes #95061, #95062

Add Flatbuffer verification before parsing to avoid crashing on malformed modules. Flatbuffers doesn't perform boundary checks at runtime for the sake of performance, so when parsing untrusted modules it is highly recommended to verify overall buffer integrity.

This bug can be triggered both by C++ (`torch::jit::load`, `torch::jitload_jit_module_from_file`) and Python  API (`torch.jit.load`, `torch.jit.jit_module_from_flatbuffer`).

Crash files to reproduce:
[crash-1feb368861083e3d242e5c3fcb1090869f4819c4.txt](https://github.com/pytorch/pytorch/files/10795267/crash-1feb368861083e3d242e5c3fcb1090869f4819c4.txt)
[crash-7e8ffd314223be96b43ca246d3d3481702869455.txt](https://github.com/pytorch/pytorch/files/10795268/crash-7e8ffd314223be96b43ca246d3d3481702869455.txt)
[crash-ad4d7c6183af8f34fe1cb5c8133315c6389c409f.txt](https://github.com/pytorch/pytorch/files/10795279/crash-ad4d7c6183af8f34fe1cb5c8133315c6389c409f.txt)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95221
Approved by: https://github.com/qihqi, https://github.com/davidberard98
2023-05-24 21:16:19 +00:00
Michael Voznesensky
39f52c0218 Switch AOT Inductor test to export, add dynamic, fix invocation bug (#101585)
Fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101585
Approved by: https://github.com/ngimel, https://github.com/desertfire
2023-05-17 05:52:08 +00:00
Howard Huang
a206e8b027 [small BE] update NcclTest dim size (#101127)
Previously input dimensions are fixed to 3x3, this is a small change to make that configurable. Will be used in future additions to nccl tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101127
Approved by: https://github.com/rohan-varma
2023-05-15 23:05:10 +00:00