Commit Graph

64995 Commits

Author SHA1 Message Date
Bin Bao
3058700f7f [aotinductor] Add AOTIModelRunner as a utility class (#110891)
Summary: Introduce a utility class AOTIModelRunner to take care of running an AOTInductor compiled model. It does things like dlopen a model, initialize the model container, setup inputs and outputs, and destroy the model container.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110891
Approved by: https://github.com/chenyang78
ghstack dependencies: #110652
2023-10-11 15:58:28 +00:00
Bin Bao
b17c247eb1 [aotindutor] Update the cpp test example (#110652)
Summary: store inputs and outpus in python, and load them back to run the compiled model in c++ and compare the output
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110652
Approved by: https://github.com/chenyang78
2023-10-11 15:58:28 +00:00
ydwu4
3062e267b1 [cond] Add more tests for valid inputs of cond (#110727)
This PR adds a parametrized test for cond. It tests cond can be traced with valid inputs. Specifically valid inputs is combination of:
- pred (python boolean, boolean tensor, int tensor, scalar tensor)
- true_fn/false_fn (func, obj, nn_module)
- Operands (0 or more tensor inputs), tested with 0  and 2
- closures (0 or more tensor closures), tested with 0 and 2
- nested_level (no nesting or level-2 nested cond)

What this test doesn't cover:
- pred: symbolic boolean expression as predicate
- true_fn/false_fn: that mutates indermediate tensors
- operands: non-tensor operands such as float, int
- closures: nn_module attribute closures, python constant closures
- nested_level: 3+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110727
Approved by: https://github.com/zou3519
2023-10-11 15:56:13 +00:00
Nikita Shulga
ef19824db8 Suppress warnings in tensorpipe.h (#111012)
To fix distributed compilation with clang-15

Fixes https://github.com/pytorch/pytorch/issues/110974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111012
Approved by: https://github.com/huydhn, https://github.com/drisspg, https://github.com/Skylion007
2023-10-11 15:41:30 +00:00
Nikita Shulga
f2d476843e [MPS][BE] Avoid redispatch in sign_out (#110955)
By calling `at::mps::sign_outf` rather than `at::sign_out` that calls dispatcher again.
Also, do not copy output unnecessarily.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at f942e74</samp>

> _Metal tensors rise from the ashes_
> _`sign` and `sgn` unleash their flashes_
> _MPSFunctions reign supreme_
> _In the header of the metal dream_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110955
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-10-11 15:10:21 +00:00
Sam Larsen
fc1105b282 [inductor] Implement Fx graph caching to improve warm compilation time. (#103453)
Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR.

Test Plan:
* New unit tests exercising saving and load from the cache.
* New unit tests to exercise the cache key calculations.
* Ran several benchmarks to see cache hit and resulting compilation times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453
Approved by: https://github.com/eellison, https://github.com/Chillee
2023-10-11 14:39:14 +00:00
rzou
2cf9782912 [generate_opcheck_tests] Add some reasonable defaults (#110977)
Summary:
Make it easier to add `generate_opcheck_tests` by adding defaults for
the failures_dict location, the additional decorators, and the test
utils.

Test Plan:
Existing tests

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110977
Approved by: https://github.com/williamwen42
ghstack dependencies: #110951
2023-10-11 14:28:05 +00:00
Bin Bao
4abfa22812 [aotinductor] Add a perf smoke test for AOTInductor (#110972)
Summary: To prevent perf regression like the one caused by #110510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110972
Approved by: https://github.com/chenyang78
2023-10-11 13:30:05 +00:00
PyTorch MergeBot
98c329b19e Revert "[core ATen IR] Add decompositions for max, min, var_mean (#110906)"
This reverts commit 9606cda64e.

Reverted https://github.com/pytorch/pytorch/pull/110906 on behalf of https://github.com/SS-JIA due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110906#issuecomment-1757490740))
2023-10-11 11:41:21 +00:00
igm503
95ff51d8ed [MPS] Add support for Softshrink to MPS Backend (#110814)
Adds the softshrink activation function to the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110814
Approved by: https://github.com/kulinseth
2023-10-11 07:55:39 +00:00
Rohan Varma
de370eb313 [Distributed] Small nits to apply_optimizer_in_backward (#110903)
Clarify a few things around the documentation

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110903
Approved by: https://github.com/janeyx99
2023-10-11 07:45:45 +00:00
PyTorch MergeBot
0821868110 Revert "[export] Get export APIs ready for PTC (#110410)"
This reverts commit b96ea9f361.

Reverted https://github.com/pytorch/pytorch/pull/110410 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/110410#issuecomment-1757017249))
2023-10-11 07:31:51 +00:00
Huy Do
74029fae9d Fix broken period workflow after #110976 (#111013)
My typo mistake after #110976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111013
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-10-11 06:40:18 +00:00
Ramin Azarmehr
056d6247c7 [MPS] Use Metal Events to synchronize buffers in MPSAllocator (Part 1) (#106938)
- This PR is the first part of a bigger change to use `MPSEvent` to synchronize shared-buffers between CPU/GPU.
- Add APIs to record and wait for `MPSEvents` in `MPSAllocator`.
- Use a container list for Buffer Pools to simplify iterating over them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106938
Approved by: https://github.com/kulinseth
2023-10-11 06:13:05 +00:00
Angela Yi
b96ea9f361 [export] Get export APIs ready for PTC (#110410)
Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130
Changes:
* `torch.export` will return a functional ATen graph w/o decompositions
* `exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.

Calling convention for Executorch stays the same:
```
pre_autograd_graph = capture_pre_autograd_graph(f, args, ...)
aten_graph_no_decomps = torch.export.export(pre_autograd_graph, args, ...)
# Within to_edge we decompose to core aten and then convert to edge
edge_graph = exir.to_edge(aten_graph_no_decomps)
```

Test Plan: CI

Differential Revision: D49742989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110410
Approved by: https://github.com/ydwu4
2023-10-11 06:10:07 +00:00
Michael Voznesensky
1e7947b3e0 Revert "Reland 3rd try [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#109323)" + Forward fixes + test (#110964)
This reverts commit f786fbdebd.

Forward fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110964
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2023-10-11 05:16:47 +00:00
Nikita Shulga
e49ea87162 Fix socket.cpp compilation using gcc-9.4 (#111002)
Otherwise following error is thrown when attempted to compile with WERROR enabled:
```
In file included from /home/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30:
/home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:24: warning: redundant redeclaration of ‘constexpr’ static data member ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’ [-Wdeprecated]
  340 | constexpr const size_t codecvt_result<CodeUnit>::max_size;
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:335:33: note: previous declaration of ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’
  335 |   static constexpr const size_t max_size = 32;
      |                                 ^~~~~~~~
```
or following if using clang as host compiler
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30:
/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:50: warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated [-Wdeprecated]
constexpr const size_t codecvt_result<CodeUnit>::max_size;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111002
Approved by: https://github.com/drisspg
2023-10-11 05:16:00 +00:00
wz337
a614281ea9 Add current_device() to torch.cpu (#110987)
Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987
Approved by: https://github.com/wanchaol
2023-10-11 05:13:10 +00:00
soulitzer
110382bacf Make NestedTensor compilable with eager backend (#109171)
In this PR:
- Adds support for strides for jagged tensor (design doc for this coming soon)
- NestedTensor skips automatic dynamic
- Make use of @bdhirsh's subclass fakification logic by adding the __tensor_{un,}flatten__ functions.
- Additional logic for fakification: since existing subclass fakification logic does not handle the case where the outer tensor has an additional dimension. We insert one-off logic to (1) insert an extra SingletonSymInt onto the fakified NestedTensor. (2) make sure we call track_symint on both the sizes on the inner and outer tensor during guard creation.

Remaining things that are weird:
- Still need to skip some logic in meta utils for some reason (I was going to write this up more, but decided not to since we're not able to do this anyway for a immediate reason: we cannot arbitrarily compare singleton ints. For now I'm just following Brian's advise from [here](https://github.com/pytorch/pytorch/pull/109171#discussion_r1328137070) )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109171
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-10-11 04:47:10 +00:00
drisspg
e0dbaa04d2 Fix the meta func for mem_eff_backward (#110893)
Fixes #110832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110893
Approved by: https://github.com/eellison
2023-10-11 02:58:54 +00:00
andrewor14
0e551bbcd7 [quant][pt2] Preserve source_fn_stack after QAT fusion (#110899)
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_preserve_source_fn_stack

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar

Differential Revision: [D50101253](https://our.internmc.facebook.com/intern/diff/D50101253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110899
Approved by: https://github.com/jerryzh168
2023-10-11 02:55:52 +00:00
Tugsbayasgalan Manlaibaatar
5aee22e0e0 Move export.constrain_as_* to torch._constrain_as_* (#110757)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110757
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #109859
2023-10-11 02:37:55 +00:00
soulitzer
c9eb8d8d90 Add set_checkpoint_debug_enabled that overrides local setting (#110728)
People access activation checkpoint through many layers of config and it is not always guaranteed that all the layers of wrapping around checkpoint properly propagate all the kwargs, e.g. debug mode. This context manager offers an alternative way to enable debug mode that bypasses the need for all layers to propagate kwargs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110728
Approved by: https://github.com/albanD
ghstack dependencies: #110673, #110674, #110675, #110676
2023-10-11 02:12:31 +00:00
Michael Voznesensky
02f6a8126e Support a simple subset of functions as backward hooks on intermediate tensors (#109537)
The main thrust of the initial effort here was to capture `register_hook` calls on tensors in compile regions. The first part of this was done in https://github.com/pytorch/pytorch/pull/108903 wherein we added support for register_hook input tensors.

The distinction between input and intermediary is due to implementation differences.

There are 2 kinds of hooks:

1) Hooks on objects with sources (inputs, params)
2) Hooks on objects w/o sources (intermediaries, and outputs).

Note: As outputs can be made simple by how dynamo handles residuals, they could actually be handled as if they were inputs, but, for the sake of this PR, we will refer to hooks as either hooks on inputs (sourced), or hooks on intermediaries (not sourced).

**The plan:**

For tensors w/ a source: (The PR above)
We record registered hooks, store them as a global, and associate them with the tensor in residuals. This means that when dynamo goes to create the frame, where we produce bytecode to stitch together our PT2 modified bytecode with the original eager code, we call register_hook. This registration of hooks in residuals is sound because (a) it happens right after a Pt2 frame region ends and (b) we know that the tensor is alive in f_locals, f_globals, or a module in the users invoking frame. This means we can soundly know it will be around to invoke register_hook on. As long as we guard on the identity of the lifted function, this is sound to do.

For tensors w/o a source: (This PR)

Ostensibly, the most correct and complete solution would be to smuggle hooks into a runtime wrapper in aot_autograd, where all the items the hooks close over are lifted to inputs as necessary and passed alongside the user provided function. This is necessary so that we can properly trace out and capture all the mutations within the user defined hook at backwards time.

This is too complicated - so, we limited the scope of this initial PR to a simple subset of hooks:

- Hooks must have a source (be known to us already, not a lambda or intermediary defined function)
- We must be tracing under compiled autograd

**The flow**:

We use the HOP added in https://github.com/pytorch/pytorch/pull/109690/files, referred to as the HOP below.

1) We intercept register_hook calls and wrap the user defined fn in the HOP
2) We write a `_register_hook_trampoline` to the graph that is a local no-arg function that is invoked as a call_function in the dynamo graph
3) aot_autograd inlines through it during its trace, and sees the HOP
4) the HOP preserves itself in the graph - it does not get traced into
5) During backwards, compiled_autograd installs the HOP under a hook call
6) When compiled_autograd enters compilation over its generated graph, dynamo traces the contents of the hook

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109537
Approved by: https://github.com/ezyang
2023-10-11 01:35:37 +00:00
Jon Chuang
79212430df feat(inductor): fx graph debug should display device (#110346)
Device mismatch issues are root cause of: https://github.com/pytorch/pytorch/issues/107006, hence make device-related scheduling issues easier to diagnose.
Also format single-kwarg graphs to be more concise

Example rendering:
![image](https://github.com/pytorch/pytorch/assets/9093549/1b59a994-f2df-45c9-8cb7-37eb3ba12654)

CC code owners: @ngimel @jansel @shunting314 @mlazos @peterbell10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110346
Approved by: https://github.com/eellison
2023-10-11 00:34:55 +00:00
Edward Z. Yang
24bf9aeb6b Fix arange with dynamic end argument. (#110979)
Fixes https://github.com/pytorch/pytorch/issues/93468

There's a few extra tests that are sort of unrelated, but I ended up writing them while working on the fix and decided to keep them. The big idea here is to split the `_check` so that `expect_true` works; I could have probably also improved the symbolic reasoning but I'm lazy. One small logging fix too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110979
Approved by: https://github.com/Skylion007
2023-10-11 00:32:34 +00:00
leslie-fang-intel
a11d4a8378 [Reland] [Inductor] Break the loop fusion when node2 depends on node1 mutations (#110677)
Reland PR https://github.com/pytorch/pytorch/pull/109172 which has been reverted in https://github.com/pytorch/pytorch/pull/110622

Differential Revision: [D50097373](https://our.internmc.facebook.com/intern/diff/D50097373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110677
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-10-11 00:26:45 +00:00
PyTorch MergeBot
314a502eb0 Revert "Reland "[C10] PG observability hooks. (#108815)" (#110907)"
This reverts commit 7678cd22af.

Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this 7678cd22af ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))
2023-10-11 00:23:42 +00:00
Huy Do
2edc75a669 Add a workflow to release Android binaries (#110976)
This adds 2 jobs to build PyTorch Android with and without lite interpreter:

* Keep the list of currently supported ABI armeabi-v7a, arm64-v8a, x86, x86_64
* Pass all the test on emulator
* Run an the test app on emulator and my Android phone `arm64-v8a` without any issue
![Screenshot_20231010-114453](https://github.com/pytorch/pytorch/assets/475357/57e12188-1675-44d2-a259-9f9577578590)
* Run on AWS https://us-west-2.console.aws.amazon.com/devicefarm/home#/mobile/projects/b531574a-fb82-40ae-b687-8f0b81341ae0/runs/5fce6818-628a-4099-9aab-23e91a212076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110976
Approved by: https://github.com/atalman
2023-10-11 00:19:33 +00:00
Jon Chuang
5aa96fd336 [dynamo] list index: add more list types to testing, support namedtuple, improve error handling (#110919)
Follow up: #110817

Minor improvements as discussed in prev PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110919
Approved by: https://github.com/ezyang
2023-10-11 00:16:39 +00:00
SS-JIA
9606cda64e [core ATen IR] Add decompositions for max, min, var_mean (#110906)
## Context

Add decompositions for `aten.max`, `aten.min`, and `aten.var_mean`. These operators follow a pattern of returning a tuple of outputs from two component operators:

```
aten.max(x) -> return aten.amax(x), aten.argmax(x)
aten.min(x) -> return aten.amin(x), aten.argmin(x)
aten.var_mean(x) -> return aten.var(x), aten.mean(x)
```

For `var_mean`, the `refs` implementation was doing something similar, so I changed it to call `torch.` ops instead like was done for other `refs` implementations previously. cc: @peterbell10 @lezcano

Note that Inductor lowers all these directly, so they are excluded from the Inductor decomp table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110906
Approved by: https://github.com/manuelcandales
2023-10-11 00:06:24 +00:00
PyTorch MergeBot
3100d3e661 Revert "[inductor] Implement Fx graph caching to improve warm compilation time. (#103453)"
This reverts commit 8a8668e1ae.

Reverted https://github.com/pytorch/pytorch/pull/103453 on behalf of https://github.com/kit1980 due to The newly added test fails on internal builds ([comment](https://github.com/pytorch/pytorch/pull/103453#issuecomment-1756449919))
2023-10-10 23:21:59 +00:00
cyy
f98d6ad8b3 [1/N] Apply clang-tidy to aten/src/ATen/core/ (#110861)
It is time to cliang-tidy aten.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110861
Approved by: https://github.com/Skylion007
2023-10-10 23:20:58 +00:00
Will Constable
ca03f36233 Change ProcessGroupNCCL default timeout to 10 min (#110947)
Avoid changing default for other backends as CPU backend (GLOO) may need
longer timeouts.

Motivated by trying to save cluster time when encountering collective
hangs.  Generally collectives should time out within seconds and 30
minutes (or 10 minutes) should provide ample headroom for edge cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110947
Approved by: https://github.com/xw285cornell, https://github.com/fduwjj
2023-10-10 22:28:39 +00:00
Tugsbayasgalan Manlaibaatar
cd275dc24f Remove RangeConstraints in favor of ValueRanges (#109859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109859
Approved by: https://github.com/avikchaudhuri
2023-10-10 22:22:05 +00:00
Jerry Zhang
7a69e3d30b [fx][subgraph_matcher] Add a matcher that supports name to node map (#110743)
Summary:
We want the matcher to return a name -> node in target graph
so that we can refer to the node by name, this is useful for downstream applications like
quantization.

and also we can use the torch API as source of truth instead of matching aten API directly.

Test Plan:
python test/fx/test_matcher_utils.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110743
Approved by: https://github.com/SherlockNoMad
2023-10-10 22:21:24 +00:00
Ramil Nugmanov
91eeb77260 StackDataset batched sampling (#110694)
Optimization of loading minibatches

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110694
Approved by: https://github.com/ejguan
2023-10-10 22:05:51 +00:00
Joel Schlosser
ac01304e22 pin_memory support for NT (#110404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110404
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
ghstack dependencies: #110292
2023-10-10 21:58:19 +00:00
Joel Schlosser
43ea782af3 Multiprocessing support for NT (#110292)
Fixes #110161

Allows NTs to be used in DataLoaders with `num_workers > 1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110292
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-10-10 21:58:19 +00:00
Aaron Bockover
7f2d25c547 [ONNX] bump onnx submodule to rel-1.15.0 (#110663)
- onnx==1.15.0rc1
- onnxscript==0.1.0.dev20231006
- ort-nightly==1.17.0.dev20231005001
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110663
Approved by: https://github.com/ezyang, https://github.com/thiagocrepaldi
2023-10-10 21:44:09 +00:00
rzou
3a29cdc5e6 [optests] Add dontGenerateOpCheckTests and is_inside_opcheck_mode (#110951)
This PR adds the following helper functions for generated opcheck tests:
- dontGenerateOpCheckTests is a decorator that skips generation of the
  opcheck tests for the generated function
- is_inside_opcheck_mode lets us query if we are in a generated test.
  Useful for fast debugging out-of-tree without needing to update
  PyTorch.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110951
Approved by: https://github.com/williamwen42
2023-10-10 21:43:43 +00:00
wz337
d9eb5a57aa [FSDP] Change _create_chunk_dtensor in fsdp/_shard_utils.py to use public API from DTensor (#110831)
This PR:
1) updates _create_chunk_dtensor() in _shard_utils.py to use public APIs from DTensor. This will avoid the global_size calculation error from using DTensor.from_local() for uneven-sharded parameters, as described in https://github.com/pytorch/pytorch/issues/110762
2) updates test/distributed/fsdp/test_fsdp_dtensor_state_dict.py to include unit test for a model with uneven sharding.

cc. @wanchaol, @fegin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110831
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-10-10 21:04:27 +00:00
Jon Chuang
6e770c0dda [dynamo] Add itertools.repeat via polyfill (#110953)
Fixes https://github.com/pytorch/pytorch/issues/110286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110953
Approved by: https://github.com/ezyang
2023-10-10 20:40:33 +00:00
PyTorch MergeBot
02a02a23ee Revert "Move at::{Refcounted,}MapAllocator to c10 (#109881)"
This reverts commit 0341deb1c7.

Reverted https://github.com/pytorch/pytorch/pull/109881 on behalf of https://github.com/albanD due to It does break buck build ([comment](https://github.com/pytorch/pytorch/pull/109881#issuecomment-1756195823))
2023-10-10 20:39:12 +00:00
Khushi Agrawal
495f77be7a [cpu] explicitly vectorize digamma (#110217)
### Benchmarking results
```python
[-------------- torch.digamma(x) Benchmark -------------]
                                        |  implicitly vectorized |     explicitly  vectorized
1 threads: -----------------------------------------------------------------------
      dtype torch.float16 - n : 100     |        3.8      |      3.5
      dtype torch.float16 - n : 200     |        5.8      |      5.3
      dtype torch.float16 - n : 500     |       11.8      |     10.7
      dtype torch.float16 - n : 1000    |       22.0      |     19.6
      dtype torch.float16 - n : 10000   |      203.6      |    179.7
      dtype torch.float32 - n : 100     |        3.8      |      3.6
      dtype torch.float32 - n : 200     |        5.7      |      5.5
      dtype torch.float32 - n : 500     |       11.1      |     11.1
      dtype torch.float32 - n : 1000    |       20.6      |     20.5
      dtype torch.float32 - n : 10000   |      191.7      |    189.6
      dtype torch.float64 - n : 100     |        3.8      |      3.7
      dtype torch.float64 - n : 200     |        5.9      |      5.7
      dtype torch.float64 - n : 500     |       11.9      |     11.7
      dtype torch.float64 - n : 1000    |       22.1      |     21.7
      dtype torch.float64 - n : 10000   |      203.6      |    199.7
      dtype torch.bfloat16 - n : 100    |        3.7      |      3.5
      dtype torch.bfloat16 - n : 200    |        5.6      |      5.3
      dtype torch.bfloat16 - n : 500    |       11.2      |     10.6
      dtype torch.bfloat16 - n : 1000   |       20.8      |     19.5
      dtype torch.bfloat16 - n : 10000  |      190.0      |    179.7

Times are in microseconds (us).`
```

### Benchmarking config
Machine: Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz
<p>

```python
>>> import torch
>>> print(f"Torch config: {torch.__config__.show()}")
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/usr/local/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.2.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF,
```

</p>

Script -
```
import torch
import pickle
from torch.utils import benchmark
from itertools import product

device = 'cpu'
dtypes = (torch.float16, torch.float32, torch.float64, torch.bfloat16)
n = (100, 200, 500, 1000, 10000)

result = []

for dtype, num in product(dtypes, n):
    x = torch.rand(num, dtype=dtype, device='cpu')
    torch.digamma(x)
    stmt = 'torch.digamma(x)'
    measurement = benchmark.Timer(
        stmt=stmt,
        globals={'x': x},
        label=stmt + " Benchmark",
        sub_label=f"dtype {dtype} - n : {num}",
        description="vectorized",
    ).blocked_autorange(min_run_time=10)

    result.append(measurement)

fname_prefix = "benchmark_digamma_"

benchmark.Compare(result).print()
with open(fname_prefix+"vectorized", "wb") as f:
    pickle.dump(result, f)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110217
Approved by: https://github.com/sanchitintel, https://github.com/vfdev-5, https://github.com/ezyang
2023-10-10 20:31:25 +00:00
Will Constable
7678cd22af Reland "[C10] PG observability hooks. (#108815)" (#110907)
This reverts commit ff0358b038.

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
2023-10-10 20:09:40 +00:00
Jon Chuang
84ad3ed7b2 [dynamo] add config for displaying all guard failures (#110927)
Fixes https://github.com/pytorch/pytorch/issues/110879

Example output:
```
('Recompiling function fn in /home/jonch/Desktop/Programming/mlsys/pytorch/test/dynamo/test_misc.py:4578', 'triggered by the following guard failures: ["___check_type_id(L[\'obj\'], 94834370481168)", "L[\'obj\'].x == -0.5"]')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110927
Approved by: https://github.com/lezcano
2023-10-10 19:57:44 +00:00
DanilBaibak
8cf1a02e80 Rever [Profiler] Improve the docstring for export_memory_timeline (#110978)
Rever [Profiler] Improve the docstring for export_memory_timeline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110978
Approved by: https://github.com/huydhn, https://github.com/aaronenyeshi
2023-10-10 19:57:25 +00:00
soulitzer
bc49b1e50b [reland] Use is_symbolic instead of testing isinstance in some place (#110676)
reland of https://github.com/pytorch/pytorch/pull/110372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110676
Approved by: https://github.com/ezyang
ghstack dependencies: #110673, #110674, #110675
2023-10-10 19:37:17 +00:00
soulitzer
df9a6bcaef [reland] Symintify guards.cpp (#110675)
reland of https://github.com/pytorch/pytorch/pull/110371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110675
Approved by: https://github.com/ezyang
ghstack dependencies: #110673, #110674
2023-10-10 19:37:17 +00:00