Commit Graph

37529 Commits

Author SHA1 Message Date
Aaron Gokaslan
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
Edward Z. Yang
f34905f61d Assert that TracingContext is available when set_example_value is called (#124284)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124284
Approved by: https://github.com/Chillee
ghstack dependencies: #124105, #124059, #124176, #124283
2024-04-21 11:23:13 +00:00
Edward Z. Yang
0e6367dd44 Factor var_to_range assignments to _update_var_to_range helper (#124283)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124283
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #124105, #124059, #124176
2024-04-21 11:23:13 +00:00
Colin Peppler
cbf420b67a [inductor] for UserDefinedTritonKernels don't mark all inputs as mutating (#124425)
Take this example:
```
def _mul2(x):
    y = torch.empty_like(x)
    mul2_kernel[(10,)](
        in_ptr0=x, out_ptr=y,
        n_elements=x.numel(), BLOCK_SIZE=1,
    )
    return y

def f(x):
    for _ in range(4):
        x = _mul2(x)
    return x + 1
```

Currently, the codegen will show up like this. Notice, how we allocate 5 buffers of the same size.
```
# Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: []
buf0 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: []
buf4 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: []
buf8 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf8, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: []
buf12 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf8, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf12, (10, ), (1, ), 0) ...)

# Source Nodes: [add], Original ATen: [aten.add]
buf16 = empty_strided_cuda((10, ), (1, ), torch.float32)
triton_poi_fused_add_0.run(buf12, buf16, 10, grid=grid(10), stream=stream0)...)
return (buf16, )
```

With this PR, we want to see this. Notice, how we only allocate 2 buffers this time. The other 3 buffers are re-used.
```
# Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: []
buf0 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0), ...)
del arg0_1

# Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: []
buf2 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf2, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: []
buf4 = buf0; del buf0  # reuse
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf2, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: []
buf6 = buf2; del buf2  # reuse
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf6, (10, ), (1, ), 0) ...)
del buf4

# Source Nodes: [add], Original ATen: [aten.add]
buf8 = buf6; del buf6  # reuse
triton_poi_fused_add_0.run(buf8, 10, grid=grid(10), stream=stream0)
return (buf8, )
```

Differential Revision: [D56379307](https://our.internmc.facebook.com/intern/diff/D56379307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124425
Approved by: https://github.com/oulgen
2024-04-21 06:00:14 +00:00
Yanbo Liang
0d90d4d613 [Dynamo] Fix NamedTuple hasattr bug (#124531)
Fixes #124402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124531
Approved by: https://github.com/jansel
2024-04-21 04:36:22 +00:00
FFFrog
d6f88105ce Fix the problem about load_state_dict with unexpected key whose prefix matches a valid key (#124385)
Fixes https://github.com/pytorch/pytorch/issues/123510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124385
Approved by: https://github.com/mikaylagawarecki
2024-04-20 23:19:25 +00:00
Edward Z. Yang
afa78ad08c Call writeline from writelines (#124515)
This makes it more convenient to add a breakpoint here.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124515
Approved by: https://github.com/albanD
2024-04-20 15:45:30 +00:00
Animesh Jain
a32eac345f [dynamo] Return gm.forward for eager backend (#124109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124109
Approved by: https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #124445
2024-04-20 14:11:05 +00:00
Animesh Jain
febc4d8759 [dynamo][easy] forbid_in_graph check to use getattr_static (#124445)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124445
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-04-20 14:11:05 +00:00
Yanbo Liang
a3e3693afc [Dynamo] Fix missing bracket in ListVariable (#124532)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124532
Approved by: https://github.com/williamwen42
2024-04-20 08:26:30 +00:00
Michael Lazos
0d0b5b2655 Enable dynamo rosenbrock sparse tests (#124542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124542
Approved by: https://github.com/yf225
ghstack dependencies: #124540, #124541
2024-04-20 05:54:41 +00:00
Michael Lazos
184f16016e Enable dynamo-traced deepcopy test for RMSprop (#124541)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124541
Approved by: https://github.com/yf225
ghstack dependencies: #124540
2024-04-20 05:54:41 +00:00
Michael Lazos
6a730698e2 Enable dynamo-traced Adamax tests (#124540)
Enabling tests related to https://github.com/pytorch/pytorch/issues/121178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124540
Approved by: https://github.com/yf225
2024-04-20 05:54:41 +00:00
drisspg
f1cbaf1764 Adds LSE output for templated-attention-hop if inputs require grad (#124308)
Adds LSE output for templated-attention-hop if inputs require grad

Prep PR for adding autograd support to templated-attention-hop. The kernel needs to output the LSE during the forward which will be used during backwards.

### Output code
https://gist.github.com/drisspg/2aea3ce5db75811e7e143eeecb774d8a

## Before
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.159 |              |             |             |             |            |               |                |
| Max     |     1.342 |           16 |          16 |         512 |         512 |         64 | noop          | torch.bfloat16 |
| Min     |     1.016 |            1 |          16 |         512 |         512 |         64 | relative_bias | torch.bfloat16 |

## After
 Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     1.155 |              |             |             |             |            |             |                |
| Max     |     1.339 |           16 |          16 |         512 |         512 |         64 | noop        | torch.bfloat16 |
| Min     |     1.009 |            1 |          16 |         512 |         512 |         64 | head_bias   | torch.bfloat16 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124308
Approved by: https://github.com/Chillee
2024-04-20 05:45:56 +00:00
Oguz Ulgen
0d64b82f0b Make CompiledFxGraph portable between machines (#124438)
As we prepare FxGraphCache to move to remote, we need to make sure there's no data that is on the disk.

Differential Revision: [D56363808](https://our.internmc.facebook.com/intern/diff/D56363808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124438
Approved by: https://github.com/jansel
2024-04-20 05:26:14 +00:00
Shunting Zhang
c5a4ba2257 [inductor] consider pointwise nodes when deciding reduction hint (#124131)
In certain **rare** scenarios, inductor can generate a reduction kernel with really bad perf. E.g., if
- the reduction kernel contains a reduction node followed by a pointwise node
- And the pointwise node use a transposed layout.
- the reduction node is an inner reduction
- and rnumel <= 1024 ,

then inductor will generate a persistent reduction kernel and it causes really bad perf when doing tl.store for the pointwise node since we use a very skinny tile `(XBLOCK=1, RBLOCK=next_power_of_2(rnumel))` .

I've tried a few version of fix.
- The first version is, if I found any pointwise node in a reduction kernel uses a non-contiguous dependency, we use ReductionHint.DEFAULT. This cause 8s compilation time increase for huggingface with no perf wins... The reason is ReductionHint.DEFAULT does more autotunings.
- Then I changed the code to be more specific. We change the hint from INNER to DEFAULT if we are sure that the pointwise kernel can use a >1 stride for the lowest dimension. Kernels meet this condition should mostly have really bad perf anyways.

The situation mentioned above is rare. But it's reported by internal users. I'll also run one more perf test.

Testing script: https://gist.github.com/shunting314/9d3389891fa43633b49b8b7564ad6d8b . Something equivalent is also added as a unit test.

For this specific test from user reports, we improve the mentioned reduction kernels perf by **4.14x** (451us -> 109us)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124131
Approved by: https://github.com/jansel
2024-04-20 05:07:56 +00:00
Xiaodong Wang
57f64197f3 Reduce warning msg in torch.profiler (#124469)
Summary: This is actually quite noisy and my logs are full of this soft assertion msg. Maybe making it log once?

Test Plan:
On AMD GPU side, I got a lot of those warnings:
```
W0415 01:40:45.109864 917160 collection.cpp:602] Warning: Memcpy ? (? -> ?) (function operator())”
```
So just suppress the excessive logs

Reviewed By: aaronenyeshi, yoyoyocmu

Differential Revision: D55602788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124469
Approved by: https://github.com/aaronenyeshi
2024-04-20 04:45:12 +00:00
Scott Wolchok
3d8b903d95 [PyTorch] Remove ArrayRefTensor::numel_ (#124516)
ArrayRefTensor::numel_ is redundant with the size of the contained MiniArrayRef. Reclaiming the space entirely would break ABI compatibility, but at least we have 4-8 bytes for future expansion.

Differential Revision: [D56366829](https://our.internmc.facebook.com/intern/diff/D56366829/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D56366829/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124516
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-04-20 02:44:20 +00:00
Andrew Gu
f9fce110af [FSDP2][ez] Removed error check for swap tensors flag (#124513)
Since `DTensor` uses `swap_tensors` path automatically now, we can remove this check for the global flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124513
Approved by: https://github.com/weifengpy
ghstack dependencies: #124319, #120256
2024-04-20 00:46:36 +00:00
Andrew Gu
1c2cb36811 [FSDP2] Added CPU offloading (#120256)
#### Overview
This PR adds CPU offloading via the `offload_policy: OffloadPolicy` argument.
- We incur one H2D copy for each parameter before all-gather.
- We incur one D2H copy for each gradient after reduce-scatter.
- We run optimizer on CPU.

#### Example (Mixed Precision and CPU Offloading)
This example uses a small 125M numel model, which is not too representative. We can try to run with a larger model like Llama-7B. However, since the current optimizer step is already too slow, we may want to patch a faster CPU optimizer.

Forward
![Screenshot 2024-02-21 at 10 36 29 AM](https://github.com/pytorch/pytorch/assets/31054793/00ed95db-3a55-49bb-ac98-9b9162feaacd)
![Screenshot 2024-02-21 at 10 39 12 AM](https://github.com/pytorch/pytorch/assets/31054793/10e29854-1907-4001-b3dc-aab6c3bf153c)

Backward
![Screenshot 2024-02-21 at 10 37 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7039cb2e-eb78-4f53-b83f-67bae61ebddd)
![Screenshot 2024-02-21 at 10 38 44 AM](https://github.com/pytorch/pytorch/assets/31054793/e34615d6-6b6b-4995-aef1-9c7563034799)

Overall CPU (CPU optimizer step dominates)
![Screenshot 2024-02-21 at 10 39 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7a2a929a-3a40-4b35-891b-016cf57e8079)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120256
Approved by: https://github.com/weifengpy
ghstack dependencies: #124319
2024-04-20 00:42:58 +00:00
soulitzer
cf5ca58e7f [NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343
Approved by: https://github.com/jbschlosser
2024-04-19 23:13:59 +00:00
Laith Sakka
acbf888a13 rename sl to strobelight (#124455)
Summary:
TORCH_COMPILE_SL_PROFILE ->TORCH_COMPILE_STROBELIGHT
SL_MAX_STACK_LENGTH -> COMPILE_STROBELIGHT_MAX_STACK_LENGTH
SL_MAX_PROFILE_TIME -> COMPILE_STROBELIGHT_MAX_PROFILE_TIME
profile_with_sl() -> strobelight()
compiletime_sl_profile_meta() -> compiletime_strobelight_meta()

Test Plan:
1. run and verify
```
TORCH_COMPILE_STROBELIGHT=TRUE buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profiler_example
```
2. run and verify
```
buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:function_profiler_example --local-only
```
3. run and verify truncated stack for
```
TORCH_COMPILE_STROBELIGHT=TRUE COMPILE_STROBELIGHT_MAX_STACK_LENGTH=1 buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profiler_example
```
4. add infinite loop in _verify and verify samples for
```
COMPILE_STROBELIGHT_MAX_PROFILE_TIME=30 TORCH_COMPILE_STROBELIGHT=TRUE buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profiler_example
```

Reviewed By: oulgen

Differential Revision: D56327139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124455
Approved by: https://github.com/oulgen
2024-04-19 22:50:13 +00:00
PyTorch MergeBot
0feab7d6c3 Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)"
This reverts commit cb17721899.

Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
PyTorch MergeBot
929242a15c Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit d7e1bf9ff9.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
cdzhan
f8f7cfbeee Add __torch_function__ support for generated tensor methods/property of PrivateUse1 (#121723)
support following case:
```python
import torch
...
class CustomFooTensor(torch.Tensor):
  @classmethod
  def __torch_function__(cls, func, types, args=(), kwargs=None):
    ...
a = CustomFooTensor([3])
print(a.is_foo)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121723
Approved by: https://github.com/albanD
2024-04-19 22:34:34 +00:00
rzou
f0560f7b3b [opcheck] Stop doing test_aot_dispatch_static by default (#124495)
Motivations:
- this is pretty redundant with test_aot_dispatch_dynamic.
- The user story for opcheck is that a user should use opcheck to see
  if their operator was "registered correctly". If a user's custom op
  only supports dynamic shapes, then it's a bit awkward for
  one of the tests (e.g. `test_aot_dispatch_static`) to fail.
- We've already stopped running test_aot_dispatch_static in all of
  our opcheck tests.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124495
Approved by: https://github.com/williamwen42
ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403, #124414
2024-04-19 21:57:22 +00:00
rzou
37d18966ea [custom_op] set some tags when constructing the op (#124414)
- the op is automatically "pt2-compliant"
- In general we want to turn on needs_fixed_stride_order for all customm
  ops, but this needs some more work, so we're just going to turn it on
  for the new custom op API.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124414
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403
2024-04-19 21:57:22 +00:00
Andrew Gu
1900f79b72 [FSDP2] Added set_reshard_after_backward (#124319)
This PR adds a `set_reshard_after_backward` method to allow disabling resharding after backward. `reshard_after_backward=False` can be used with `reshard_after_forward=False` to implement "ZeRO-1", where there is only all-gather on the first microbatch forward and reduce-scatter on the last microbatch backward.

```
for microbatch_idx, microbatch in dataloader:
    is_last_microbatch = microbatch_idx == num_microbatches - 1
    model.set_requires_gradient_sync(is_last_microbatch)
    model.set_reshard_after_backward(is_last_microbatch)
    model.set_is_last_backward(is_last_microbatch)
    microbatch_fwd_bwd(model, microbatch, microbatch_idx)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124319
Approved by: https://github.com/weifengpy
2024-04-19 21:49:35 +00:00
Pian Pawakapan
10b9d4d19c [export] handle Dim.lower = 0, 1 for ep.run_decompositions() (#123602)
Summary:
With pre-dispatch export and ep.run_decompositions(), range constraints are updated through looking at ShapeEnv.var_to_range. However the lower bounds on these may be incorrect - analysis on un-specialized symbols are done with lower bounds of 2, which mismatch with user-specified bounds (may be 0, 1).

This updates `_get_updated_range_constraints()` to use the old range constraints if possible.

Test Plan: Existing pre-dispatch/dynamic shapes test case.

Differential Revision: D55899872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123602
Approved by: https://github.com/tugsbayasgalan
2024-04-19 21:29:36 +00:00
eellison
000d55870a Enable in oss (#124031)
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```

There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.

Inference:

<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">

Training:

<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229, #122825
2024-04-19 20:28:55 +00:00
eellison
179108f14d Use separate flags for MultiTemplates from BenchmarkFusion (#122825)
Two changes:
- Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default.
- Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122825
Approved by: https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229
2024-04-19 19:50:42 +00:00
IvanKobzarev
73f56e1e81 [sym_shapes][perf] Do not calculate hint in advice_is_size (#124472)
Differential Revision: [D56352412](https://our.internmc.facebook.com/intern/diff/D56352412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124472
Approved by: https://github.com/ezyang
2024-04-19 19:10:24 +00:00
PyTorch MergeBot
f87c788a34 Revert "Capture triton kernel in execution trace (#124140)"
This reverts commit 89407eca3b.

Reverted https://github.com/pytorch/pytorch/pull/124140 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/124140#issuecomment-2067137104))
2024-04-19 19:05:44 +00:00
IvanKobzarev
761de37ab7 [sym_shape][perf] eval_static: guards, unbacked compute once (#124217)
Differential Revision: [D56212345](https://our.internmc.facebook.com/intern/diff/D56212345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124217
Approved by: https://github.com/ezyang
2024-04-19 19:03:04 +00:00
Zhuoran Zhao
b0d83726bd [5/x][AMD][Lowering Enablement] Hipifying aoti code_wrapper (#124241)
Summary: as title

Test Plan:
CI & unit test

patch on top of https://www.internalfb.com/phabricator/paste/view/P1214895953 to test

Differential Revision: D56223917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124241
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-04-19 18:57:38 +00:00
rzou
25c65d6642 Change register_autograd to reflect ordering of setup_context and backward (#124403)
old: `register_autograd(setup_context, backward, /)`
new: `register_autograd(backward, /, *, setup_context=None)`

Motivations:
- We introduce these APIs as "give us a backward and use setup_context
  to save things for backward".
- setup_context isn't always necessary.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124403
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299, #124134, #124199
2024-04-19 17:56:30 +00:00
rzou
a8e17b2d4d Move schema inference to torch._library (#124199)
After this PR, we can delete torch._custom_op/torch._custom_ops (except
there are external libraries depending it).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124199
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299, #124134
2024-04-19 17:56:30 +00:00
rzou
a78450a00b Excise uses of the old custom ops APIs (#124134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124134
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299
2024-04-19 17:56:26 +00:00
eellison
9489019085 Small fixes for deferred epilogue (#123229)
Two small fixes:

- preserve rng around compile_fx_inner
- Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123229
Approved by: https://github.com/shunting314
ghstack dependencies: #124030, #122642
2024-04-19 17:41:29 +00:00
eellison
39fc280dce Dont precompile already seen keys, limit epilogue choices (#122642)
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
2024-04-19 17:34:22 +00:00
JackCaoG
7ae835eee4 Enable SourcelessBuilder to build GraphModule generated by make_fx (#123673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123673
Approved by: https://github.com/ezyang, https://github.com/anijain2305
ghstack dependencies: #123680
2024-04-19 17:23:51 +00:00
Michael Lazos
68a027f144 Fixes for 123400 (#123406)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123406
Approved by: https://github.com/janeyx99
ghstack dependencies: #123324, #123404, #123405, #124309
2024-04-19 17:20:57 +00:00
Michael Lazos
5050e627dc Defer marking_static_address (#124309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124309
Approved by: https://github.com/anijain2305
ghstack dependencies: #123324, #123404, #123405
2024-04-19 17:20:57 +00:00
Michael Lazos
1531a29fb9 Enable tests related to 116061 (#123405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123405
Approved by: https://github.com/janeyx99
ghstack dependencies: #123324, #123404
2024-04-19 17:20:54 +00:00
Michael Lazos
406d99e46c Fix for 117147 (#123404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123404
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
ghstack dependencies: #123324
2024-04-19 17:20:50 +00:00
Michael Lazos
203d111c54 Enable dynamo test_forloop_goes_right_direction_multi_gpu (#123324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123324
Approved by: https://github.com/janeyx99
2024-04-19 17:20:41 +00:00
ydwu4
293f756cdc Support aot_export torchbind op (#123370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123370
Approved by: https://github.com/zou3519
ghstack dependencies: #123367
2024-04-19 17:17:27 +00:00
ydwu4
e62169a8fa Support torchbind op dispatch in python (#123367)
We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table.

The idea is:
1. when inputs contains FakeScriptObject,  we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor.
2. when inputs are not fakified, we dispatch through the original c++ dispatcher.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367
Approved by: https://github.com/zou3519
2024-04-19 17:17:27 +00:00
eellison
136f8378e1 Re-land precompile triton templates (#124030)
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
2024-04-19 17:03:33 +00:00
rzou
bad8d25881 Add torch.library.register_kernel (#124299)
This mirrors the .register_kernel method on the object produced by the
custom_op decorator.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124299
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200
2024-04-19 13:54:21 +00:00
rzou
3918dfedc5 [custom_op] Rename register_impl to register_kernel (#124200)
Motivation:
- The API is used for registering an implementation for a specific
  device type.
- "impl" is ambiguous and can be confused with Library.impl.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124200
Approved by: https://github.com/albanD
ghstack dependencies: #124180
2024-04-19 13:54:21 +00:00
rzou
22a2f676c3 [custom_op] add ability to provide manual schema (#124180)
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124180
Approved by: https://github.com/albanD
2024-04-19 13:54:13 +00:00
GdoongMathew
8b1ad51881 Better Error Message in ChainedScheduler and SequentialLR (#121633)
Fixes #121577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121633
Approved by: https://github.com/janeyx99
2024-04-19 13:37:41 +00:00
Jesse Cai
c9db59e9e4 [sparse] Add fast semi-structured spasification kernels (#122350)
This PR adds in fast semi-structured sparsification kernels to PyTorch.

These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.

The kernels have been added as aten native functions

In particular, three new functions have been added:

* `torch._sparse_semi_structured_tile`

This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.

* `torch._sparse_semi_structured_apply`

This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.

* `torch._sparse_semi_structured_apply_dense`

This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation

The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.

To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`

Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
2024-04-19 13:31:58 +00:00
Cen Zhao
96724a769b [ptd] drop ncclGroupStart/end for ncclCommInit (#124363) (#124416)
Summary:

```
ncclGroupStart()
ncclCommInit(..)
ncclGroupEnd()
```

above pattern is only needed when we have *single-thread* to manage multiple GPUs

in our case, we always have 1 process managing 1 GPU, we don't need group operation.

Test Plan: CI

Differential Revision: D56274975

Co-authored-by: Cen Zhao <cenzhao@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416
Approved by: https://github.com/shuqiangzhang
2024-04-19 13:12:42 +00:00
chilli
8e280862ff Add custom joint graph passes (#124443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124443
Approved by: https://github.com/aorenste, https://github.com/malfet
2024-04-19 11:54:46 +00:00
Jane Xu
b412b75b42 [optim] add fused_adam/adamw_kernel support for CPU device (#123074)
On par with `CUDA` implementation.

For `autocast` logic, same with `CUDA` + `Fused Adam`:
 - check inf in `gradscalar.step`
 - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param.

**TestPlan**:
```
# extend CUDA only test for CPU fused adagrad
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_torch.py -k test_grad_scaling_autocast_fused

# extend fused test
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
python test_optim.py -k test_can_load_older_state_dict

# newly added test (follow 6b1f13ea2f/test/test_cuda.py (L1108))
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
```

**Benchmark**:
**5.1x** on 56 core SPR
**Parameter-size=1M**
**Nparams=10**
[test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7)

```
numactl -C 0-55 -m 0 python bench_adam.py
non-fused 6.0174267292022705 s
fused 1.1787631511688232 s
```

**Note: Fused kernel accuracy**
The accuracy failure in CI shows a little higher than default tolerance
```
2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%)
2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed)
2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed)
```
I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations.
For example, in non-fused impl
```
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
and in fused impl
```
  exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d];
  //  std::cout << "exp_avg_sq " <<   exp_avg_sq_ptr[d] << std::endl;
  exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] +
      scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val;
```
If I keep `std::cout`, I can get exactly same results in UT
```
===============param
0.6796758770942688
0.6796758770942688
```
But when I comment out it, there will be a difference
```
===============param
0.6796758770942688
0.6796759366989136
```
So I will make the tolerance a little higher than default one.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-19 11:14:04 +00:00
Boyuan Feng
9a71d12d92 [CUDAGraphTree] Support mutated inputs from prior cudagraph pool (#123231)
# PR
This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details.

# Note on Optimistic Mutation Check
To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs.

Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster.
- To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check.
- To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes.

Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is:
- [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`.
- [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function.
- [Slow Path] Before `record_function`, we run `check_for_mutation` again.

**Q1: Would there be overhead for type a,b,c,d?**
A: No. We only check input mutation types for the first invocation of a function on a node.

**Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future?**
A: Yes. This is done by `check_invariants` and guarantees the correctness.

**Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future?**
A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231
Approved by: https://github.com/eellison
2024-04-19 10:32:12 +00:00
Tobias Ringwald
58e403c739 Added a docstring for torch.Size.numel. (#124186)
Fixes #61231. Fixes #124167.

This PR documents a rather long-standing issue w.r.t. unexpected behavior of `torch.Size.numel`, first reported almost 5 years ago.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124186
Approved by: https://github.com/janeyx99
2024-04-19 09:23:02 +00:00
PyTorch MergeBot
520bc1080e Revert "[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)"
This reverts commit 768ce2cdda.

Reverted https://github.com/pytorch/pytorch/pull/123247 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123247#issuecomment-2066152611))
2024-04-19 09:09:03 +00:00
Xuehai Pan
a6f044a490 [dynamo, 3.8-3.9] support dataclass with frozen=True in Python 3.8/3.9 (#124393)
Closes #114966

Frozen field assignment in `__init__` in Python 3.8-3.9:

f5bd65ed37/Lib/dataclasses.py (L402-L411)

```python
import builtins

BUILTINS = builtins

def _field_assign(frozen, name, value, self_name):
    # If we're a frozen class, then assign to our fields in __init__
    # via object.__setattr__.  Otherwise, just use a simple
    # assignment.
    #
    # self_name is what "self" is called in this function: don't
    # hard-code "self", since that might be a field name.
    if frozen:
        return f'BUILTINS.object.__setattr__({self_name},{name!r},{value})'
    return f'{self_name}.{name}={value}'
```

Frozen field assignment in `__init__` in Python 3.10+:

812245ecce/Lib/dataclasses.py (L436-L445)

```python
__dataclass_builtins_object__ = object

def _field_assign(frozen, name, value, self_name):
    # If we're a frozen class, then assign to our fields in __init__
    # via object.__setattr__.  Otherwise, just use a simple
    # assignment.
    #
    # self_name is what "self" is called in this function: don't
    # hard-code "self", since that might be a field name.
    if frozen:
        return f'__dataclass_builtins_object__.__setattr__({self_name},{name!r},{value})'
    return f'{self_name}.{name}={value}'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124393
Approved by: https://github.com/jansel
2024-04-19 05:10:33 +00:00
Nikita Shulga
1ba85b34dd [AOTI] Enbale mmaped weights when CUDA is used (#124346)
By refactoring the logic that returns the start to constant pointer into `_get_constants_start()` method and call it from both CUDA and CPU readers

It has no runtime impact, but export time is down from 10m to 3m if mmaped weights are used on AWS p4d.24xlarge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124346
Approved by: https://github.com/mikekgfb, https://github.com/desertfire
2024-04-19 04:47:27 +00:00
Kiuk Chung
87f44d70b1 [torch/distributed] Check gloo availability when doing isinstance(pg,… (#124233)
Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233
Approved by: https://github.com/rohan-varma, https://github.com/d4l3k
2024-04-19 04:07:00 +00:00
Chen, Zejun
768ce2cdda [Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.

#suppress-api-compatibility-check

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-04-19 03:31:13 +00:00
rraminen
803a08f8ae [ROCm] Add cublasGemmAlgo_t -> hipblasGemmAlgo_t (#121030)
This PR is to add cublasGemmAlgo_t -> hipblasGemmAlgo_t to cuda_to_hip_mappings.py.
It is required for DeepSpeed transformer extension build on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121030
Approved by: https://github.com/jeffdaily, https://github.com/ezyang
2024-04-19 02:57:16 +00:00
rzou
889e3eeed3 Avoid cuda init to FakeTensorMode (#124413)
Also partially fixes #122109

This PR:
- We add a C++ flag (only_lift_cpu_tensors) to toggle the
  torch.tensor(1, device='cuda') ctor strategy.
  When false (default), it does the current PyTorch behavior
  of unconditionally constructing a concrete CUDA tensor then calling
  lift_fresh on it. When true, we instead construct a concrete CPU
  tensor, call lift_fresh, and then call Tensor.to(device) (under any ambient
  modes).
- FakeTensorMode flips this flag depending on if CUDA is available or
  not. We don't unconditionally set the flag to True because that is
  likely BC-breaking.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124413
Approved by: https://github.com/eellison
2024-04-19 02:39:35 +00:00
chilli
e620c3e814 Optimized templated attention to use exp2 (#124356)
0.705 (vs. FA2) to 0.860 after this change.

<img width="1270" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/d58f57ba-e50e-44ea-8a8a-4f13b8650adf">

to

<img width="1277" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f1945b67-0cfc-463c-a2f6-5812b90677fe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124356
Approved by: https://github.com/drisspg
2024-04-19 01:58:19 +00:00
Tristan Rice
ddd0ed1b43 distributed: templated ring attention (#124215)
This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR.

This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way.

Misc changes:

* Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test
* Adds compile support to the ring attention implementations (required some tweaks to process groups)

Test plan:

```
pytest test/distributed/_tensor/test_attention.py
pytest test/distributed/test_functional_api.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215
Approved by: https://github.com/wanchaol
2024-04-19 00:57:08 +00:00
Bin Bao
4946638f06 [AOTI] Add ABI-compatiblity tests (#123848)
Summary: In AOTInductor generated CPU model code, there can be direct references to some aten/c10 utility functions and data structures, e.g. at::vec and c10::Half. These are performance critical and thus it doesn't make sense to create C shim for them. Instead, we make sure they are implemented in a header-only way, and use this set of tests to guard future changes.

There are more header files to be updated, but we will do it in other followup PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123848
Approved by: https://github.com/jansel
ghstack dependencies: #123847
2024-04-19 00:51:24 +00:00
JackCaoG
9ed9b22ec0 Implement efficient_conv_bn_eval_decomp_graph_transform to handle conv and bn fusion after decomp (#123680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123680
Approved by: https://github.com/ezyang, https://github.com/youkaichao
2024-04-19 00:22:25 +00:00
Shuqiang Zhang
ca6a0e1348 [c10d] remove the env of TORCH_NCCL_ABORT_IN_DESTROY_PG (#124334)
Summary:
This ENV was introduced to safely rollout the behavior change in destroy
process group (e.g., call ncclCommsAbort). Now that this behavior change
were already rolled out, we no longer need this env and we should clean
up it to keep our code cleaner
Test Plan:
Modified/existing ut pass

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334
Approved by: https://github.com/wconstab
2024-04-18 23:42:55 +00:00
eellison
e4f6340f21 realize inputs to mem bound mm decomposition (#123165)
Differential Revision: [D55639709](https://our.internmc.facebook.com/intern/diff/D55639709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123165
Approved by: https://github.com/jackiexu1992
2024-04-18 23:10:04 +00:00
Mikayla Gawarecki
5ba6bb7b2f Add swap_tensors path to nn parametrizations (#124130)
Fixes #123859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130
Approved by: https://github.com/albanD
2024-04-18 22:22:08 +00:00
Wei Wei
87f651c7e7 fix cpu test errors (#124116)
Similar fix is from @int3 but not landed. Credit to @int3 too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124116
Approved by: https://github.com/chenyang78
2024-04-18 20:30:58 +00:00
ydwu4
2e48b39603 Fix example_value of map (#124203)
Previously, we didn't expand the shape of example_value of map to the same as inputs (edit: the first mapped dimension). This pr fixes this bug. To make this easier, we change _call_function_and_unflatten_output to accept example_values directly instead of retrieving them from the variable trackers.

Also remove a redundant call function node in strict_mode higher order op in dynamo.

Test Plan:
existing tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124203
Approved by: https://github.com/ezyang, https://github.com/zou3519
2024-04-18 19:18:36 +00:00
PyTorch MergeBot
4a0900d04b Revert "[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343)"
This reverts commit ef93402f61.

Reverted https://github.com/pytorch/pytorch/pull/124343 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124343#issuecomment-2064937192))
2024-04-18 18:55:48 +00:00
Sheng Fu
89407eca3b Capture triton kernel in execution trace (#124140)
Summary: This DIFF is to capture triton kernels in execution trace.

Test Plan: buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Differential Revision: D56162599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124140
Approved by: https://github.com/briancoutinho
2024-04-18 18:38:26 +00:00
angelayi
74bedbb9e1 [export] Serialize rational symint ranges (#123884)
Some symints result in rational ranges like 10/3 which runs into an error ([example](https://www.internalfb.com/intern/everpaste/?handle=GMG2AxkeoFUrh-UDAFcE8pKPgjoUbsIXAAAB)).

Ed will eventually get rid(?) of these rational ranges but as a workaround export can just clamp the results during serialization time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123884
Approved by: https://github.com/zhxchen17
2024-04-18 18:20:11 +00:00
Aaron Orenstein
37215a4fa2 Fix memory leak in pattern_matcher (#124345)
#121313 changed precompiled patterns so they are more integrated with the pattern matching code.  This resulted with a list of "known" patterns (with their example data) being stored globally. Unfortunately since small FakeTensors store a constant of the original tensor it meant that we leaked cuda tensors in the example data.

Fix this by clearing out the constant storage for the example data that we keep around.

Fixes #124081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124345
Approved by: https://github.com/xuzhao9
2024-04-18 17:38:12 +00:00
egienvalue
d7e1bf9ff9 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport

Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-18 17:38:06 +00:00
egienvalue
cb17721899 Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
    stream_id: _int  # Stream id
    device_index: _int
    device_type: _int

    device: _device  # The device of the stream

    @overload
    def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
    @overload
    def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
    def query(self) -> _bool: ...
    def synchronize(self) -> None: ...
    def wait_event(self, event: Event) -> None: ...
    def wait_stream(self, other: Stream) -> None: ...
    def record_event(self, event: Optional[Event] = None) -> Event: ...
    def query(self) -> None: ...
    def synchronize(self) -> None: ...
    def __hash__(self) -> _int: ...
    def __repr__(self) -> str: ...
    def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event

```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):

    device: _device  # The device of the Event
    event_id: _int # The raw event created by device backend

    def __new__(self,
        device: Optional[DeviceLikeType] = None,
        enable_timing: _bool = False,
        blocking: _bool = False,
        interprocess: _bool = False) -> Event: ...
    @classmethod
    def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
    def record(self, stream: Optional[Stream] = None) -> None: ...
    def wait(self, stream: Optional[Stream] = None) -> None: ...
    def query(self) -> _bool: ...
    def elapsed_time(self, other: Event) -> _float: ...
    def synchronize(self) -> None: ...
    def ipc_handle(self) -> bytes: ...
    def __repr__(self) -> str: ...
```

-----------

c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.

```
  double elapsedTime(const Event& event) const {
    return impl_.elapsedTime(event.impl_);
  }

  void* eventId() const {
    return impl_.eventId();
  }

  void synchronize() const {
    return impl_.synchronize();
  }
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.

Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD
2024-04-18 17:35:09 +00:00
Jason Ansel
7a6edb0b66 Possible fix for einops warning (#124084)
See https://github.com/arogozhnikov/einops/issues/315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124084
Approved by: https://github.com/peterbell10
2024-04-18 17:09:50 +00:00
Zhengxu Chen
e1062f5738 [export] Add a printer to unflattened module. (#124315)
Summary: add a helper method to print graph in every level of unflattened module.

Test Plan: {F1489609684}

Differential Revision: D56263195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124315
Approved by: https://github.com/tugsbayasgalan
2024-04-18 16:35:51 +00:00
Boyuan Feng
aa2da0cdd2 [Export] Add runtime assert to non-strict export (#123681)
This PR moves insert_deferred_runtime_asserts from dynamo to torch.fx.passes and uses it to add runtime assertion for non-strict export.

Differential Revision: D55944267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123681
Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi
2024-04-18 16:13:27 +00:00
soulitzer
ef93402f61 [NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343
Approved by: https://github.com/jbschlosser
2024-04-18 14:42:54 +00:00
Andrew Gu
bbb6e36495 [FSDP2] Fixed set_requires_gradient_sync's recurse arg (#124318)
The `recurse` argument was not being respected for `set_requires_gradient_sync`. This PR fixes that.

The previous unit test did not have nested FSDP modules with managed parameters, so the `recurse=False` was not being exercised. We augment the unit test to try only disabling gradient sync for the root module and not children.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124318
Approved by: https://github.com/weifengpy
ghstack dependencies: #120952, #124293
2024-04-18 14:21:57 +00:00
rzou
1542874311 Delete qualname from custom_op decorator (#124092)
I forgot to delete this in an earlier PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124092
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071, #124089
2024-04-18 12:48:04 +00:00
rzou
648c39c47d Add OpOverload.redispatch; use it in new custom ops API (#124089)
A kernel has "dispatcher convention" if there is an additional keyset
arg at the beginning of the argument list. This PR:
- adds a way to register kernels with dispatcher_convention using
  Library.impl (pass dispatcher_convention = True)
- adds OpOverload.redispatch

We use both of the above in the new custom ops API: we register the
autograd kernel in dispatcher convention so that we can actually call
redispatch like how pytorch built-in ops do it.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071
2024-04-18 12:48:04 +00:00
rzou
645173a0b5 Add torch.library.register_autograd (#124071)
Allows registering autograd for all custom op entry points:
- the new-style custom op API (custom_op)
- the old-style torch.library APIs
- C++ operator registration

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124071
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066
2024-04-18 12:47:59 +00:00
rzou
8135c4b921 torch.library.register_fake now accepts more types (#124066)
We allow it to accept:
- a string with the op name
- an opoverload
- a new-style custom op

If any of these are referring to a new-style custom op (created with the
custom_op decorator), then we dispatch to CustomOpDef.register_fake.
Otherwise, we do what we previously did.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124066
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065
2024-04-18 12:47:55 +00:00
xinan.lin
6fcbeb3489 [ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256)
Add CPU FP16 support for nll_loss and cross_entropy_loss.
Resolve issue #123328.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
2024-04-18 11:44:38 +00:00
IvanKobzarev
d59f1da62f [sym_shapes][perf] _find not update unchanged replacements (#124274)
Differential Revision: [D56236380](https://our.internmc.facebook.com/intern/diff/D56236380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124274
Approved by: https://github.com/ezyang
2024-04-18 08:32:02 +00:00
IvanKobzarev
9eba1995d0 [sym_shapes][perf] Use sympy xreplace instead of subs (#124208)
https://github.com/sympy/sympy/issues/22240

Differential Revision: [D56207553](https://our.internmc.facebook.com/intern/diff/D56207553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124208
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-04-18 08:19:03 +00:00
PyTorch MergeBot
2b82345e48 Revert "Re-land precompile triton templates (#124030)"
This reverts commit 030bb13fe8.

Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2063191117))
2024-04-18 07:21:41 +00:00
Animesh Jain
704fac5618 [dynamo][cpp-guard] Reland Attempt 1 - Enable cpp guard manager (#124231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124231
Approved by: https://github.com/jansel
ghstack dependencies: #124230, #124237
2024-04-18 06:36:20 +00:00
PyTorch MergeBot
6e86a40694 Revert "[Dynamo] Check for __bool__ attribute before accessing it (#120943)"
This reverts commit dd7aeedb72.

Reverted https://github.com/pytorch/pytorch/pull/120943 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/120943#issuecomment-2063098295))
2024-04-18 06:34:32 +00:00
PyTorch MergeBot
8ff85b42f9 Revert "Add swap_tensors path to nn parametrizations (#124130)"
This reverts commit 64f6ddf12c.

Reverted https://github.com/pytorch/pytorch/pull/124130 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124130#issuecomment-2063074856))
2024-04-18 06:12:54 +00:00
Zhuoran Zhao
8ad66e05d2 [4/x][AMD][Lowering Enablement] Enabling meta internal AOTInductor compilation on ROCM (#124123)
Summary: as title

Test Plan: CI & unit test

Differential Revision: D56163334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124123
Approved by: https://github.com/chenyang78, https://github.com/jansel
2024-04-18 04:19:37 +00:00
xinan.lin
c9ab9248ce [Inductor Intel GPU backend Upstream] Generalize device-bias code in (#124249)
Generalize device-bias code in tirton_utils.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124249
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel
2024-04-18 03:54:31 +00:00
Yanan Cao (PyTorch)
27daa110c8 Back out "Refresh OpOverloadPacket if a new OpOverload gets added (#123578)" (#124324)
Summary:
Original commit changeset: 528276bc8a92

Original Phabricator Diff: D56057952

Differential Revision: D56271240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124324
Approved by: https://github.com/davidberard98
2024-04-18 03:33:54 +00:00