Commit Graph

70496 Commits

Author SHA1 Message Date
Valentine233
db36d21f5c Add SDPA pattern for HuggingFace models BF16 (#121202)
### Description

- Add pattern for bf16 input type with fp32 attention mask. (Example model: ElectraForCausalLM)
- Add pattern with batch_size=1 to avoid some clones in graph. (Example model: text-classification+prajjwal1-bert-tiny)

### Newly matched models
Dtype: bf16, machine: SPR

#### Dynamo HuggingFace models

- ElectraForCausalLM (speedup=2.09x)
- ElectraForQuestionAnswering (speedup=4.22x)
- AlbertForQuestionAnswering (speedup=1.36x)
- AlbertForMaskedLM (speedup=1.39x)

#### OOB HuggingFace models

- multiple-choice+google-electra-base-discriminator
- text-classification+prajjwal1-bert-tiny
- text-classification+prajjwal1-bert-mini
- text-classification+google-electra-base-generator
- text-classification+bert-large-cased
- casual-language-modeling+xlm-roberta-base
- text-classification+roberta-base
- text-classification+xlm-roberta-base
- text-classification+albert-base-v2
- token-classification+google-electra-base-generator
- masked-language-modeling+bert-base-cased

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121202
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-07 07:40:00 +00:00
Oguz Ulgen
953c6c37cb Wrap remote cache creation with a try-catch (#121340)
Summary: In production I am seeing errors like "AttributeError: module 'triton.runtime' has no attribute 'fb_memcache'", likely due to some package skew. Until this is resolved, lets wrap this code with try-catch.

Test Plan: CI

Differential Revision: D54604339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121340
Approved by: https://github.com/aakhundov
2024-03-07 07:05:49 +00:00
Chen_Liqing
291ce86a6c Modify StorageImplCreateHelper (#118459)
I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``:
bb6eba189f/torch/csrc/Storage.cpp (L525-L540)

Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459
Approved by: https://github.com/albanD
2024-03-07 06:26:55 +00:00
Xia, Weiwen
f848e9c646 [Quant][Inductor] Fix q/dq per channel lowering with 64-bit qparams (#120984)
Fixes #120869

Fix lowering of `quantize_per_channel` and `dequantize_per_channel` with float64 scale and int64 zero point.
Generated codes are incorrect without explicit type conversion. Add type conversion to the lowering pass, i.e., float64 (double) -> float32 and int64 -> int32.

**Test plan**
python test/inductor/test_cpu_repro.py -k test_per_channel_fake_quant_module_uint8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120984
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-07 06:23:52 +00:00
Yeounoh Chung
4f9d4e1ab0 [DTensor][XLA] refactor DTensor _xla API (#113214)
In response to the change pytorch/xla#5776 and #92909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113214
Approved by: https://github.com/wanchaol
2024-03-07 06:18:05 +00:00
cyy
c723514ef4 [CUDACachingAllocator] Simplify update_stat and avoid casts (#120964)
update_stat in CUDACachingAllocator.cpp was split into increase and decrease functions in this PR to simplify the implementation and avoid type casts throughout the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120964
Approved by: https://github.com/albanD
2024-03-07 05:55:38 +00:00
drisspg
55232c4e1c Make CausalBias a torch.Tensor subclass again (#121358)
# Summary
This was removed in #116071 in order to enable compile support and re-adding this seems to still work with compile
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121358
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-03-07 05:20:47 +00:00
Xilun Wu
df2ad1fecc [dtensor][debug] have visualize_sharding correctly print for sub-mesh DTensor (#121216)
**Summary**
In `visualize_sharding` we chose to only print on rank 0 (global rank) which means calling `visualize_sharind` will never print anything when the dtensor object's mesh doesn't include rank 0 (i.e. a sub-mesh). This PR has `visualize_sharding` always print on rank whose mesh coordinate is (0, 0, ..., 0) instead of whose global rank is 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121216
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179, #120260
2024-03-07 04:50:15 +00:00
Xilun Wu
77873f6fe5 [dtensor][1/N] add torchrec even row-wise sharding example (#120260)
**Summary**
our goal is to demonstrate that DTensor's capability to represent TorchRec's parameter sharding. Currently this is done with `ShardedTensor` and theoretically `DTensor` can replace it with minor change.

This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.ROW_WISE` using DTensor. Note that this PR only covers the even sharding case.

**Test Run**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120260
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179
2024-03-07 04:50:15 +00:00
Xilun Wu
9cc0f23e5c [dtensor][debug] allow visualize_sharding to print header (#121179)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121179
Approved by: https://github.com/wanchaol
2024-03-07 04:50:06 +00:00
jmarin
a2854ae904 Bugfix consume_prefix_in_state_dict_if_present function to keep the order of the state_dict (#117464)
This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present.

In the original code, ``_metadata`` was handled as a ``key``.

```
    # also strip the prefix in metadata if any.
    if "_metadata" in state_dict:
```

This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to:

```
    # also strip the prefix in metadata if any.
    if hasattr(state_dict, "_metadata"):
```

This PR also includes the necessary test.

Fixes #106942

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464
Approved by: https://github.com/mikaylagawarecki
2024-03-07 04:00:49 +00:00
Aaron Orenstein
edd80f87b8 Prevent infinite recursion within Tensor.__repr__ (#120206)
`Tensor.__repr__` calls functions which can perform logging which ends up logging `self` (with `__repr__`) causing an infinite loop. Instead of logging all the args in FakeTensor.dispatch log the actual parameters (and use `id` to log the tensor itself).

The change to torch/testing/_internal/common_utils.py came up during testing - in some ways of running the test parts was `('test', 'test_testing.py')` and so `i` was 0 and we were doing a join on `()` which was causing an error.

Repro:
```
import torch
from torch.testing import make_tensor
from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
t = torch.sparse_coo_tensor(((0, 1), (1, 0)), (1, 2), size=(2, 2))
t2 = FakeTensor.from_tensor(t, FakeTensorMode())
print(repr(t2))
```
and run with `TORCH_LOGS=+all`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120206
Approved by: https://github.com/yanboliang, https://github.com/pearu
2024-03-07 02:24:45 +00:00
laith sakka
eb4d87f237 graph break on sparse tensors constructions (#120458)
Fix some tests in https://github.com/pytorch/pytorch/issues/119780
sparse_bsc_tensor is not supported
https://github.com/pytorch/pytorch/pull/117907

Also more about the issue here.
https://docs.google.com/document/d/1EIb4qG88-SjVFn5TloLERliYdxIu2hwYoAA8skjOVfo/edit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120458
Approved by: https://github.com/ezyang
2024-03-07 02:17:41 +00:00
Wanchao Liang
1a28ebffb3 [TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295)
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.

I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294
2024-03-07 02:04:59 +00:00
Eddie Yan
967dd31621 [cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862)
Follow-up of #95722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862
Approved by: https://github.com/Skylion007
2024-03-07 01:46:25 +00:00
briancoutinho
b9087f8571 [profiler] Add execution_trace_observer as an optional argument to profiler (#119912)
# Update Profiler API to collect Execution Traces

## TLDR
We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware.
```
import torch

def main():
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        …
        excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW
    ) as prof:
        ...
        prof.step()
```

See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API.

## What are Execution Traces?
[Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads.  It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies.
- Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too.
- At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki)

Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)]

## Why correlate Execution Trace with PyTorch/Kineto Trace

Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly.
Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths.

## Proposal
The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section

# Testing
Updated the unit test for collecting kineto and Execution Trace together.
- Check the collected ET has right range of events.
- Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference.

```
pytest test/profiler/test_profiler.py  -k test_execution_trace_with_kineto -rP

Running 1 items in this shard

test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[W execution_trace_observer.cpp:694] Disabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-03-07 01:30:26 +00:00
Lucas Pasqualin
eb1145436a [DCP] Adds main in format utils (#120128)
Adds main in format utils. Usage:

`python -m torch.distributed.checkpoint.format_utils dcp_to_torch dcp_dir torch_file.pt`

or

`python -m torch.distributed.checkpoint.format_utils torch_to_dcp torch_file.pt dcp_dir`

Differential Revision: [D53791355](https://our.internmc.facebook.com/intern/diff/D53791355/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120128
Approved by: https://github.com/fegin, https://github.com/wz337
2024-03-07 01:18:17 +00:00
cyy
5cc511f72f Use c10::irange and fix other index types in ForeachReduceOp.cu (#121123)
This PR follows the suggestions in #121066 and changes most loops to c10::irange.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121123
Approved by: https://github.com/soulitzer
2024-03-07 00:11:27 +00:00
Xiaodong Wang
c268ce4a6d Make ATen-cpu cuda/rocm agnostic (#121082)
Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system.

Test Plan: sandcastle + oss ci

Differential Revision: D54453492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082
Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD
2024-03-06 23:51:40 +00:00
Yichen Yan
e50ded03a6 Use type check for also is_not (#113859)
Handle `is_not` for:

9647a251cb/torch/_dynamo/variables/builtin.py (L1314-L1317)

I noticed https://github.com/pytorch/pytorch/issues/111713 exists, I think it's no harm to land this first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113859
Approved by: https://github.com/Skylion007
2024-03-06 23:12:42 +00:00
Wanchao Liang
a88356f45c [dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294)
add_.Tensor and div_.Scalar should support linearity so that we delay the partial
results.

This fixes the additional collective in the layernorm layer that we seen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294
Approved by: https://github.com/tianyu-l
2024-03-06 22:52:18 +00:00
Edward Z. Yang
2f064d895c Switch TORCH_TRACE to accept a directory by default (#121331)
Directory is better because it works smoothly with distributed
runs; otherwise you'd need to modify torchrun to setup distinct
log names for each file.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D54597814](https://our.internmc.facebook.com/intern/diff/D54597814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121331
Approved by: https://github.com/albanD
2024-03-06 22:46:18 +00:00
Andrew Gu
372f192050 [DTensor] Initialized RNG tracker if needed (#121328)
Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`).

```
pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328
Approved by: https://github.com/wanchaol
ghstack dependencies: #120351
2024-03-06 22:21:44 +00:00
Denis Yaroshevskiy
b0e2ed4d67 removing some macros (#120314)
Summary: Will be making some changes in the surrounding code, they are going to be easier without macros

Differential Revision: D54001770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120314
Approved by: https://github.com/zhxchen17
2024-03-06 22:06:05 +00:00
Lourencom
69cedc16c5 Add padding dimension checks and tests (#121298)
Fixes #121093

Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault:
```
torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d
```

To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298
Approved by: https://github.com/mikaylagawarecki
2024-03-06 21:55:34 +00:00
Yifu Wang
d7a5e59647 [dynamo] support group=None when rewriting collectives (#121043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043
Approved by: https://github.com/awgu
2024-03-06 21:37:19 +00:00
lezcano
3fee05f242 Triage the remaining fallbacks (#121312)
Building off work from @amjames. There may be some missclassifications, feel free to flag them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121312
Approved by: https://github.com/jansel
2024-03-06 21:23:47 +00:00
Andrew Gu
e865700f6a [FSDP2] Added initial meta-device init support (#120351)
This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`.

We override `_apply` to achieve the following:
- Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this
- Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor

We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`.

```
# Pre-training flow (no checkpoint)
global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp"))
dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"]
with torch.device("meta"):
  model = ...
  parallelize_module(model, tp_mesh, ...)
  fully_shard(model, mesh=dp_mesh, ...)
for param in model.parameters():
  assert param.device.type == "meta"

model.to_empty(device="cuda")
random.manual_seed(42, global_mesh)
for module in model.modules():
  if hasattr(module, "reset_parameters"):
    module.reset_parameters()
```

This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351
Approved by: https://github.com/wanchaol
2024-03-06 21:18:25 +00:00
Johannes Aalto
3cf02c5e06 [Dev Container] Fix container build by preventing conda prompt (#121128)
Without this the build will freeze with prompt:
  Proceed ([y]/n)?

I'm using rootless podman in vscode instead of docker but I think it should not affect this.
..or does conda somehow detect Docker but not Podman? Anyway, this should not break anything.

Btw, I also had to uncomment the line: "remoteUser": "root" in devcontainer.json to finish the post installation properly but I guess there might be other workarounds - and perhaps you don't want to run as root if your container has root privileges.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121128
Approved by: https://github.com/drisspg
2024-03-06 20:50:40 +00:00
angelayi
58ac4a2007 Remove llava from ci_expected_accuracy as it's flaky (#121322)
https://github.com/pytorch/pytorch/pull/121029 added it into the CI but the test is flaky on hud. It alternates between fail_accuracy and fail_to_run

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121322
Approved by: https://github.com/desertfire
2024-03-06 20:47:01 +00:00
PyTorch MergeBot
23fb37fa41 Revert "[export] Serialize union fields with single entry dict. (#121263)"
This reverts commit 7feabe9b73.

Reverted https://github.com/pytorch/pytorch/pull/121263 on behalf of https://github.com/osalpekar due to A large number of inductor benchmarking jobs failing starting this PR. See for details: 7feabe9b73 ([comment](https://github.com/pytorch/pytorch/pull/121263#issuecomment-1981680049))
2024-03-06 19:58:55 +00:00
Tobias Ringwald
76f3663efe Fixed a memory leak when calling from_numpy on a numpy array with an … (#121156)
…unsupported dtype.

Fixes #121138.

The lambda function that DECREFs the object is not called when the dtype conversion functions throws. This PR moves the conversion before the INCREF, which prevents the memory leak.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121156
Approved by: https://github.com/soulitzer, https://github.com/albanD
2024-03-06 19:37:38 +00:00
Kurman Karabukaev
360761f7d0 [Torchelasic] Create root log directory by default (#121257)
Summary:
After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent.

Reverting the behavior to:
- making tempdir when log dir is not specified
- allowing non-empty root log dir
    - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294

Differential Revision: D54531851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257
Approved by: https://github.com/d4l3k
2024-03-06 18:50:38 +00:00
Thiago Crepaldi
418568d2e3 Add Float8 support to onnx exporter (#121281)
Fixes #106877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121281
Approved by: https://github.com/BowenBao, https://github.com/titaiwangms
2024-03-06 18:46:56 +00:00
cyy
5a2527db22 [Clang-tidy header][22/N] Fix clang-tidy warnings in aten/src/ATEN/*.{cpp,h} (#121102)
This PR continues to fix clang-tidy warnings in aten/src/ATEN/*, following #120763.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121102
Approved by: https://github.com/Skylion007
2024-03-06 18:36:31 +00:00
Michael Lazos
c5ef4df274 guard on grads being None in compiled optimizers (#121291)
Fixes #115607

We were missing guards when the grads were set to `None`. So if we compiled the optimizer with grads set to their proper value, and then with the grads set to `None` we'd continuously run the `None` version because all of the guards would pass and it would be ordered before the correct version in the cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121291
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-03-06 18:33:23 +00:00
Zhengxu Chen
7feabe9b73 [export] Serialize union fields with single entry dict. (#121263)
Summary: remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly.

Test Plan: CI

Differential Revision: D54553770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121263
Approved by: https://github.com/tugsbayasgalan
2024-03-06 18:16:16 +00:00
PaulZhang12
c66d68ba51 [PT2] Add tolist() to FunctionalTensor for torch.export (#121242)
Adding tolist() to FunctionalTensor for torch.exporting TorchRec data types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121242
Approved by: https://github.com/ezyang
2024-03-06 18:10:44 +00:00
Simon Fan
05c256849b [compiled autograd] support custom ops backed by c++ autograd::Function (#120681)
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports

limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
2024-03-06 18:01:56 +00:00
blorange-amd
b27d76949b [ROCm] Enable several fake_crossref UTs on ROCm (#121112)
Enabled unit tests:

test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_norm_subgradients_at_zero_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_norm_subgradients_at_zero_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_norm_nuc_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_norm_nuc_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_svd_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_svd_cuda_float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121112
Approved by: https://github.com/ezyang
2024-03-06 17:36:47 +00:00
PyTorch MergeBot
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
jyomu
8dd4b6a78c Fix venv compatibility issue by updating python_lib_path (#121103)
Reference by sys.executable is the absolute path of the executable binary for the Python interpreter, which may not be appropriate. Instead, sys.base_exec_prefix is more suitable, and this change will correctly resolve the library when using venv. I have tested it with a venv created by rye.

https://docs.python.org/3.6/library/sys.html#sys.executable

> A string giving the absolute path of the executable binary for the Python interpreter, on systems where this makes sense. If Python is unable to retrieve the real path to its executable, [sys.executable](https://docs.python.org/3.6/library/sys.html#sys.executable) will be an empty string or None.

https://docs.python.org/3.6/library/sys.html#sys.exec_prefix

> A string giving the site-specific directory prefix where the platform-dependent Python files are installed; by default, this is also '/usr/local'. This can be set at build time with the --exec-prefix argument to the configure script. Specifically, all configuration files (e.g. the pyconfig.h header file) are installed in the directory exec_prefix/lib/pythonX.Y/config, and shared library modules are installed in exec_prefix/lib/pythonX.Y/lib-dynload, where X.Y is the version number of Python, for example 3.2.

https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix

> Set during Python startup, before site.py is run, to the same value as [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix). If not running in a [virtual environment](https://docs.python.org/3.6/library/venv.html#venv-def), the values will stay the same; if site.py finds that a virtual environment is in use, the values of [prefix](https://docs.python.org/3.6/library/sys.html#sys.prefix) and [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix) will be changed to point to the virtual environment, whereas [base_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_prefix) and [base_exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix) will remain pointing to the base Python installation (the one which the virtual environment was created from).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121103
Approved by: https://github.com/ezyang
2024-03-06 17:00:46 +00:00
mingfeima
a427d90411 add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-06 16:25:53 +00:00
Guilherme Leobas
54d92f2e37 Add jacrev support in torch.compile (#121146)
Changes are simple. Moved a few entries on trace_rules.py and included tests to compare the graph generated by jacrev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121146
Approved by: https://github.com/zou3519
2024-03-06 16:05:33 +00:00
vfdev-5
49d1fd31cf Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...) (#120077)
Description:
- PR tries to fuse nodes with compatible sizes, for example `node1: (s0, s1, s2)` and `node2: (s0 * s1 * s2)`. On `main` these two nodes can be fused due to different sizes. With this PR we can recompute node2 size, body etc using node1 indexing constraint and thus be able to fuse two nodes.
- this should influence only cpu device

Example:
```python
from unittest.mock import patch
import torch
from torch._inductor.graph import GraphLowering
from torch._inductor import config

# Force multple scheduler nodes creation to fuse them
config.realize_opcount_threshold = 1

@torch.compile(fullgraph=True, dynamic=True)
def fn(x: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor) -> torch.Tensor:
    o1 = x * w1.view(1, 1, 1, -1)
    o2 = x * w2.view(1, 1, 1, -1)
    output = o1 + o2
    return output

in_nodes = []
outputs = []
run_node = GraphLowering.run_node

graph_lowering_obj = None

def run_node_alt(self, n):
    global graph_lowering_obj

    graph_lowering_obj = self
    in_nodes.append(n)
    output = run_node(self, n)
    outputs.append(output)

    return output

x = torch.rand(1, 3, 32, 32)
w1 = torch.randn(32)
w2 = torch.randn(32)

with patch.object(GraphLowering, "run_node", run_node_alt):
    fn(x, w1, w2)

print("graph_lowering_obj.buffers:", graph_lowering_obj.buffers)
print("graph_lowering_obj.scheduler:", graph_lowering_obj.scheduler.nodes)
```

Output on `main`:
```
graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1), SchedulerNode(name='buf2')]
```
Output on this PR:
```
graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1_buf2)]
```

Context:
While working on https://github.com/pytorch/pytorch/pull/120411, upsampling bicubic decomposition, I saw an extra for-loop in C++ generated code summing up two buffers. Exploring the cause, it happend due to buffer number of ops goes beyond `config.realize_opcount_threshold`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120077
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10
2024-03-06 12:19:45 +00:00
Yukio Siraichi
aa0b0944d5 [dynamo] Re-dispatch torch.Tensor.new into torch.Tensor.new_empty method. (#121075)
Fix: https://github.com/pytorch/xla/issues/6009

This PR adds another case to `TensorVariable.method_new` special case, where it
re-dispatches `new` into `new_empty`.

Since we are using fake tensors, the `new` call doesn't actually gets to the corresponding
backend (e.g. XLA). So, things like the following might happen:

```python
@torch.compile(backend="openxla")
def foo(x):
    new_x = x.new(*x.size())

    # new_x.device() == "xla"
    # x.device() == "xla:0"

    return new_x + x

a = torch.arange(10)
foo(a.to(xm.xla_device()))
```

Resulting in the following error:

```python
Traceback (most recent call last):
  ...
  File "torch/_dynamo/utils.py", line 1654, in get_fake_value
    ret_val = wrap_fake_exception(
  File "torch/_dynamo/utils.py", line 1190, in wrap_fake_exception
    return fn()
  File "torch/_dynamo/utils.py", line 1655, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "torch/_dynamo/utils.py", line 1776, in run_node
    raise RuntimeError(make_error_message(e)).with_traceback(
  File "torch/_dynamo/utils.py", line 1758, in run_node
    return node.target(*args, **kwargs)
  File "torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "torch/_subclasses/fake_tensor.py", line 885, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1224, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 955, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1445, in _dispatch_impl
    return self.wrap_meta_outputs_with_default_device_logic(
  File "torch/_subclasses/fake_tensor.py", line 1575, in wrap_meta_outputs_with_default_device_logic
    return tree_map(wrap, r)
  File "torch/utils/_pytree.py", line 900, in tree_map
    return treespec.unflatten(map(func, *flat_args))
  File "torch/utils/_pytree.py", line 736, in unflatten
    leaves = list(leaves)
  File "torch/_subclasses/fake_tensor.py", line 1550, in wrap
    ) = FakeTensor._find_common_device(func, flat_args)
  File "torch/_subclasses/fake_tensor.py", line 625, in _find_common_device
    merge_devices(arg)
  File "torch/_subclasses/fake_tensor.py", line 620, in merge_devices
    raise RuntimeError(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., device='xla', size=(10,), dtype=torch.int64), FakeTensor(..., device='xla:0', size=(10,), dtype=torch.int64)), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices xla, xla:0
```

Using `new_empty`, instead, fixes this error because it uses the device from the source
tensor, instead of inferring from the current dispatch key set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121075
Approved by: https://github.com/jansel
2024-03-06 11:49:27 +00:00
Animesh Jain
e3bd6efe72 [dynamo][guards-cpp-refactor] Prevent duplication of leaf guards (#121164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121164
Approved by: https://github.com/jansel
ghstack dependencies: #121121, #121147, #121154
2024-03-06 08:36:45 +00:00
Animesh Jain
b6b2d5b00a [dynamo][guards-cpp-refactor] Pass source name for debug ease (#121154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121154
Approved by: https://github.com/jansel
ghstack dependencies: #121121, #121147
2024-03-06 08:36:45 +00:00
Animesh Jain
52d89d8491 [dynamo][guards-cpp-refactor] Simplify DictGuardManager by removing KeyValueDictGuardManager (#121147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121147
Approved by: https://github.com/jansel
ghstack dependencies: #121121
2024-03-06 08:36:45 +00:00
Animesh Jain
af7f55ffc8 [dynamo][guards-cpp-refactor] Add argnames in pybind'ings (#121121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121121
Approved by: https://github.com/jansel
2024-03-06 08:36:45 +00:00