Motivation:
- The API is used for registering an implementation for a specific
device type.
- "impl" is ambiguous and can be confused with Library.impl.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124200
Approved by: https://github.com/albanD
ghstack dependencies: #124180
This PR adds in fast semi-structured sparsification kernels to PyTorch.
These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.
The kernels have been added as aten native functions
In particular, three new functions have been added:
* `torch._sparse_semi_structured_tile`
This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.
* `torch._sparse_semi_structured_apply`
This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.
* `torch._sparse_semi_structured_apply_dense`
This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation
The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.
To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`
Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
Summary:
```
ncclGroupStart()
ncclCommInit(..)
ncclGroupEnd()
```
above pattern is only needed when we have *single-thread* to manage multiple GPUs
in our case, we always have 1 process managing 1 GPU, we don't need group operation.
Test Plan: CI
Differential Revision: D56274975
Co-authored-by: Cen Zhao <cenzhao@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416
Approved by: https://github.com/shuqiangzhang
On par with `CUDA` implementation.
For `autocast` logic, same with `CUDA` + `Fused Adam`:
- check inf in `gradscalar.step`
- In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param.
**TestPlan**:
```
# extend CUDA only test for CPU fused adagrad
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_torch.py -k test_grad_scaling_autocast_fused
# extend fused test
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
python test_optim.py -k test_can_load_older_state_dict
# newly added test (follow 6b1f13ea2f/test/test_cuda.py (L1108))
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
```
**Benchmark**:
**5.1x** on 56 core SPR
**Parameter-size=1M**
**Nparams=10**
[test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7)
```
numactl -C 0-55 -m 0 python bench_adam.py
non-fused 6.0174267292022705 s
fused 1.1787631511688232 s
```
**Note: Fused kernel accuracy**
The accuracy failure in CI shows a little higher than default tolerance
```
2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%)
2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed)
2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed)
```
I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations.
For example, in non-fused impl
```
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
and in fused impl
```
exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d];
// std::cout << "exp_avg_sq " << exp_avg_sq_ptr[d] << std::endl;
exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] +
scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val;
```
If I keep `std::cout`, I can get exactly same results in UT
```
===============param
0.6796758770942688
0.6796758770942688
```
But when I comment out it, there will be a difference
```
===============param
0.6796758770942688
0.6796759366989136
```
So I will make the tolerance a little higher than default one.
Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074
Approved by: https://github.com/jgong5, https://github.com/janeyx99
# PR
This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details.
# Note on Optimistic Mutation Check
To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs.
Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster.
- To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check.
- To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes.
Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is:
- [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`.
- [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function.
- [Slow Path] Before `record_function`, we run `check_for_mutation` again.
**Q1: Would there be overhead for type a,b,c,d?**
A: No. We only check input mutation types for the first invocation of a function on a node.
**Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future?**
A: Yes. This is done by `check_invariants` and guarantees the correctness.
**Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future?**
A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231
Approved by: https://github.com/eellison
Closes#114966
Frozen field assignment in `__init__` in Python 3.8-3.9:
f5bd65ed37/Lib/dataclasses.py (L402-L411)
```python
import builtins
BUILTINS = builtins
def _field_assign(frozen, name, value, self_name):
# If we're a frozen class, then assign to our fields in __init__
# via object.__setattr__. Otherwise, just use a simple
# assignment.
#
# self_name is what "self" is called in this function: don't
# hard-code "self", since that might be a field name.
if frozen:
return f'BUILTINS.object.__setattr__({self_name},{name!r},{value})'
return f'{self_name}.{name}={value}'
```
Frozen field assignment in `__init__` in Python 3.10+:
812245ecce/Lib/dataclasses.py (L436-L445)
```python
__dataclass_builtins_object__ = object
def _field_assign(frozen, name, value, self_name):
# If we're a frozen class, then assign to our fields in __init__
# via object.__setattr__. Otherwise, just use a simple
# assignment.
#
# self_name is what "self" is called in this function: don't
# hard-code "self", since that might be a field name.
if frozen:
return f'__dataclass_builtins_object__.__setattr__({self_name},{name!r},{value})'
return f'{self_name}.{name}={value}'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124393
Approved by: https://github.com/jansel
Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233
Approved by: https://github.com/rohan-varma, https://github.com/d4l3k
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.
#suppress-api-compatibility-check
Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
Also partially fixes#122109
This PR:
- We add a C++ flag (only_lift_cpu_tensors) to toggle the
torch.tensor(1, device='cuda') ctor strategy.
When false (default), it does the current PyTorch behavior
of unconditionally constructing a concrete CUDA tensor then calling
lift_fresh on it. When true, we instead construct a concrete CPU
tensor, call lift_fresh, and then call Tensor.to(device) (under any ambient
modes).
- FakeTensorMode flips this flag depending on if CUDA is available or
not. We don't unconditionally set the flag to True because that is
likely BC-breaking.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124413
Approved by: https://github.com/eellison
This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR.
This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way.
Misc changes:
* Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test
* Adds compile support to the ring attention implementations (required some tweaks to process groups)
Test plan:
```
pytest test/distributed/_tensor/test_attention.py
pytest test/distributed/test_functional_api.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215
Approved by: https://github.com/wanchaol
Summary: In AOTInductor generated CPU model code, there can be direct references to some aten/c10 utility functions and data structures, e.g. at::vec and c10::Half. These are performance critical and thus it doesn't make sense to create C shim for them. Instead, we make sure they are implemented in a header-only way, and use this set of tests to guard future changes.
There are more header files to be updated, but we will do it in other followup PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123848
Approved by: https://github.com/jansel
ghstack dependencies: #123847
Summary:
This ENV was introduced to safely rollout the behavior change in destroy
process group (e.g., call ncclCommsAbort). Now that this behavior change
were already rolled out, we no longer need this env and we should clean
up it to keep our code cleaner
Test Plan:
Modified/existing ut pass
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334
Approved by: https://github.com/wconstab
Previously, we didn't expand the shape of example_value of map to the same as inputs (edit: the first mapped dimension). This pr fixes this bug. To make this easier, we change _call_function_and_unflatten_output to accept example_values directly instead of retrieving them from the variable trackers.
Also remove a redundant call function node in strict_mode higher order op in dynamo.
Test Plan:
existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124203
Approved by: https://github.com/ezyang, https://github.com/zou3519
#121313 changed precompiled patterns so they are more integrated with the pattern matching code. This resulted with a list of "known" patterns (with their example data) being stored globally. Unfortunately since small FakeTensors store a constant of the original tensor it meant that we leaked cuda tensors in the example data.
Fix this by clearing out the constant storage for the example data that we keep around.
Fixes#124081
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124345
Approved by: https://github.com/xuzhao9
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
"init",
"is_available",
"synchronize",
"device_count",
"current_device",
"current_stream",
"default_stream",
"set_stream",
"stream",
"device",
]
```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```
---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport
Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
stream_id: _int # Stream id
device_index: _int
device_type: _int
device: _device # The device of the stream
@overload
def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
@overload
def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
def query(self) -> _bool: ...
def synchronize(self) -> None: ...
def wait_event(self, event: Event) -> None: ...
def wait_stream(self, other: Stream) -> None: ...
def record_event(self, event: Optional[Event] = None) -> Event: ...
def query(self) -> None: ...
def synchronize(self) -> None: ...
def __hash__(self) -> _int: ...
def __repr__(self) -> str: ...
def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event
```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):
device: _device # The device of the Event
event_id: _int # The raw event created by device backend
def __new__(self,
device: Optional[DeviceLikeType] = None,
enable_timing: _bool = False,
blocking: _bool = False,
interprocess: _bool = False) -> Event: ...
@classmethod
def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
def record(self, stream: Optional[Stream] = None) -> None: ...
def wait(self, stream: Optional[Stream] = None) -> None: ...
def query(self) -> _bool: ...
def elapsed_time(self, other: Event) -> _float: ...
def synchronize(self) -> None: ...
def ipc_handle(self) -> bytes: ...
def __repr__(self) -> str: ...
```
-----------
c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.
```
double elapsedTime(const Event& event) const {
return impl_.elapsedTime(event.impl_);
}
void* eventId() const {
return impl_.eventId();
}
void synchronize() const {
return impl_.synchronize();
}
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.
Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD
The `recurse` argument was not being respected for `set_requires_gradient_sync`. This PR fixes that.
The previous unit test did not have nested FSDP modules with managed parameters, so the `recurse=False` was not being exercised. We augment the unit test to try only disabling gradient sync for the root module and not children.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124318
Approved by: https://github.com/weifengpy
ghstack dependencies: #120952, #124293
A kernel has "dispatcher convention" if there is an additional keyset
arg at the beginning of the argument list. This PR:
- adds a way to register kernels with dispatcher_convention using
Library.impl (pass dispatcher_convention = True)
- adds OpOverload.redispatch
We use both of the above in the new custom ops API: we register the
autograd kernel in dispatcher convention so that we can actually call
redispatch like how pytorch built-in ops do it.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071
We allow it to accept:
- a string with the op name
- an opoverload
- a new-style custom op
If any of these are referring to a new-style custom op (created with the
custom_op decorator), then we dispatch to CustomOpDef.register_fake.
Otherwise, we do what we previously did.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124066
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065
Fixes https://github.com/pytorch/pytorch/issues/119607 for 3.11+.
In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame.
Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124238
Approved by: https://github.com/jansel
Removed a bunch of skips, I also updated test_forloop_goes_right_direction to *not* use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322
Approved by: https://github.com/janeyx99
ghstack dependencies: #123498
Motivations:
- This makes things more consistent: using a Library object, you should
be able to do all of the registration APIs that tie registrations to
the lifetime of the Library.
- I need this for the next PR up in the stack, where we will have
torch.library.register_fake support both CustomOpDef (from the new
custom ops API) and other custom ops.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124065
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064
Previously, if someone used `register_fake` to add a fake impl for an
operator defined in C++, we would require them to add a
`m.set_python_module(<module>)` call to C++. This was to avoid
situations where a user imported the C++ operator without importing the
fake impl.
This "breaks" open registration: there's no way to add a fake impl
outside of a repository that defines an operator, so we want to turn
this behavior off by default in open source.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124064
Approved by: https://github.com/albanD
ghstack dependencies: #123937
I'm going to setup some extra behavior when we set example value, so
I need a convenient place to interpose. I cannot easily do it on
meta itself because its a generic dict with no interposition point.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124176
Approved by: https://github.com/oulgen
ghstack dependencies: #124105, #124059
This PR:
- adds a new torch.library.register_fake and deprecates
torch.library.impl_abstract. The motivation is that we have a lot of
confusion around the naming so we are going to align the naming with
the actual subsystem (FakeTensor).
- renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to
`m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation
here yet; I need to test how this works with static initialization.
- Renames a bunch of internals to match (e.g. abstractimplpystub ->
pystub)
I'm scared to rename the Python-side internal APIs (e.g.
torch._library.abstract_impl) because of torch.package concerns. I'll do
that in its own isolated PR next just in case it causes problems.
DEPRECATION NOTE: torch.library.impl_abstract was renamed to to
torch.library.register_fake. Please use register_fake. We'll delete
impl_abstract in a future version of PyTorch.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937
Approved by: https://github.com/albanD
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
Summary: When applying FSDP-2 to FM-FB benchmark with FullModel model, we ran into an error that one of the output tensors of a forward pass is None. I double checked that the same output tensor is also None in FSDP-1. So, we just need to handle the None properly here.
Test Plan:
See that in the internal diff.
Differential Revision: D56087956
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123988
Approved by: https://github.com/awgu
This PR adds an `unshard(async_op: bool = False)` API to manually unshard the parameters via all-gather. This can be used for reordering the all-gather with other collectives (e.g. all-to-all).
This currently requires the user to set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` to avoid `recordStream` from `ProcessGroupNCCL` and get expected memory behaviors.
Differential Revision: [D56148725](https://our.internmc.facebook.com/intern/diff/D56148725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120952
Approved by: https://github.com/wanchaol
Add serial marker for individual tests so the test file can be removed from the ci serial list
Run serial marked tests first in serial
Run all other tests afterwards in parallel
Slowly reduce list and mark individual tests as serial instead
Hope # of serial tests is small so sharding evenness doesn't get too messed up
Hopefully can do 3 procs for sm86 and cpu?
serial no longer looks like a real word to me
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124085
Approved by: https://github.com/seemethere, https://github.com/malfet
Summary: https://github.com/pytorch/pytorch/pull/123452 added
backward support to this op by turning it into
CompositeImplicitAutograd, which meant it gets decomposed during
export/compile. However, this is not desirable behavior for the
PTQ case when we try to lower the model. This commit enables
QAT without breaking PTQ by refactoring the impl into a separate
op that does have backward support.
Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward
Reviewers: jerryzh168, digantdesai, zou3519
Subscribers: jerryzh168, digantdesai, zou3519, supriyar
Differential Revision: [D56192116](https://our.internmc.facebook.com/intern/diff/D56192116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124178
Approved by: https://github.com/digantdesai
**Summary**
We wrap DTensor's local tensor in `LocalShardsWrapper` for torchrec's table-wise sharding. The exception is on non-participating ranks: for non-participating ranks, the local tensor is an empty torch.Tensor object. The reason of this design is to avoid complexity on supporting empty tensor case on `LocalShardsWrapper`.
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122853
Approved by: https://github.com/wz337
ghstack dependencies: #120265, #121392, #122843
**Summary**
Always wrap local tensor into a `LocalShardsWrapper`. This is for uniformity and it leads to easiness on adoption of DTensor as a wrapper for local shard(s) representation. To support more tensor ops over `LocalShardsWrapper`, users need to extend its `__torch_dispatch__`.
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-even`
**Result**
```
Row-wise even sharding example in DTensor
Col 0-15
------- ----------
Row 0-1 cuda:0
Row 2-3 cuda:1
Row 4-5 cuda:2
Row 6-7 cuda:3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122843
Approved by: https://github.com/wz337
ghstack dependencies: #120265, #121392
**Summary**
This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.TABLE_WISE` using DTensor.
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120265
Approved by: https://github.com/wanchaol
This PR adds in fast semi-structured sparsification kernels to PyTorch.
These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.
The kernels have been added as aten native functions
In particular, three new functions have been added:
* `torch._sparse_semi_structured_tile`
This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.
* `torch._sparse_semi_structured_apply`
This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.
* `torch._sparse_semi_structured_apply_dense`
This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation
The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.
To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`
Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
Fixes https://github.com/pytorch/pytorch/issues/98921
There were two issues detected:
- `MultiStepLR`: issue is described in https://github.com/pytorch/pytorch/issues/98921, this is resolved by allowlisting `collections.Counter`
- `OneCycleLR`: `state_dict['anneal_func']` is either `<function OneCycleLR._annealing_cos at 0x7f364186f5b0>` or
`<function OneCycleLR._annealing_linear at 0x7f39aa483640>` depending on the `anneal_func` kwarg.
This leads to `WeightsUnpickler error: Unsupported class __builtin__.getattr` from the `weights_only` Unpickler.
Fixed the above in a BC-compatible manner by adding `OneCyclicLR._anneal_func_type` as a string attribute and removing `OneCyclicLR.anneal_func`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123775
Approved by: https://github.com/albanD, https://github.com/malfet
Summary:
Pass Process Group Name and Desc to NCCL communicator in order to access pg information in NCCL layer.
The information is passed as commDesc string(i.e. "<pg_desc>:<pg_name>")
Function only valid when NCCL_COMM_DESCRIPTION is defined.
Differential Revision: D55703310
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124149
Approved by: https://github.com/shuqiangzhang
Fixes#104729
This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%.
Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner.
Script attached below.
Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py`
[TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt)
```python
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
model = nn.Softmax().eval()
compiled_model = torch.compile(model)
inputs = torch.randn(1024, 1024)
with torch.set_grad_enabled(False):
for _ in range(50):
compiled_model(inputs) #Warmup
print("Warmup over")
with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("model_inference"):
for _ in range(100):
compiled_model(inputs)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Check if the compiled model inference and the eager model inference are similar using torch.allclose
print(torch.allclose(compiled_model(inputs), model(inputs)))
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123584
Approved by: https://github.com/jgong5, https://github.com/malfet
I found that returning the copy is actually useful in situations where you might do something like:
```
ret = _copy_state_dict(obj, cache)
ret.update(some_other_values)
```
and would like `cache` not to change structure from `ret.update(some_other_values)`. Open to some notes here, not returning a copy might force the user to do some additional copies for this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123567
Approved by: https://github.com/wz337
Summary: `matrix_instr_nonkdim` and `waves_per_eu` are AMD specific launch configs that can't be treated as fn input args
Test Plan:
HIP_VISIBLE_DEVICES=7 numactl --cpunodebind=1 --membind=1 buck2 run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.rocm_arch=mi300 //hammer/modules/sequential/encoders/tests:hstu_bench -- --torch-compile=True
the E2E works well on the magic model
Differential Revision: D56165438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124146
Approved by: https://github.com/aakhundov
Summary: With the merge of D55925068, we have introduced an overflow issue when recording a trace using dyno gputrace. This is because it is possible for TorchOPs to be enumerated but not have an end time since they were running as the recording ended. By default these events have an end time set to INT_MIN. When finding the duration() for such events using end-start, we get an overflow resulting in a very long duration. This was avoided before because we were dividing the INT_MIN by 1000 because we were trying to convert uS to nS. This change introduces a patch for TorchOps and a future PR will be added to create a more universal guard in kineto.
Test Plan:
Trace recorded using resnet test.
Trace:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1713199267/localhost/libkineto_activities_2247224.json.gz&bucket=gpu_traces
Differential Revision: D56144914
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124080
Approved by: https://github.com/aaronenyeshi
Summary:
As part of the work of unifying process group identifier, log <group_name, group_desc>, instead of pg uid in profiler.
- group_name remains as the unique identifier, e.g. “0”, "1"
- group_desc will be the user specified name, e.g. "fsdp".
Reviewed By: aaronenyeshi, kwen2501
Differential Revision: D55610682
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035
Approved by: https://github.com/aaronenyeshi
Fixes#121200
This PR introduces AcceleratorOutOfMemoryError for all privateuse1 backend. For python, there is a PyError object which will be set only when privateuse1 is registered. All privateuse1 backend then can use this error for memory errors. Maybe more error types in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121702
Approved by: https://github.com/guangyey, https://github.com/albanD
**Overview**
This PR adds pre/post-all-gather extensions to FSDP2.
- The pre/post-all-gather extensions are specified at the tensor-level on the `sharded_param._local_tensor` (i.e. the tensor wrapped by the sharded `DTensor`). If the user has a tensor-subclass parameter on the module passed to FSDP that preserves the subclass through the sharding ops (e.g. `new_zeros`, `chunk`, etc.), then the `sharded_param._local_tensor` will naturally be of that subclass.
- The pre-all-gather function has signature:
```
def fsdp_pre_all_gather(self) -> Tuple[Tuple[torch.Tensor, ...], Any]
```
- The first return value is a `Tuple[torch.Tensor, ...]` of the all-gather inputs. It is a tuple since a subclass could contribute >1 inner tensors.
- The second return value is any optional metadata needed to pass through to the post-all-gather.
- The post all-gather function has signature:
```
def fsdp_post_all_gather(
self,
all_gather_outputs: Tuple[torch.Tensor, ...],
metadata: Any,
param_dtype: torch.dtype,
*,
out: Optional[torch.Tensor] = None,
) -> Union[Tuple[torch.Tensor, Tuple[torch.Tensor, ...]], None]:
```
- The `all_gather_outputs` are exactly the all-gathered versions of the `fsdp_pre_all_gather` 1st return value (representing the all-gather inputs). We make sure to unflatten these back to ND for the user.
- The `metadata` is the `fsdp_pre_all_gather` 2nd return value, untouched.
- The `param_dtype` is the parameter dtype based on the passed-in `MixedPrecisionPolicy`. Namely, if no policy is passed in, then `param_dtype` is the original dtype, and otherwise, it is the `MixedPrecisionPolicy.param_dtype`.
- If `out` is not specified, then the return value has type `Tuple[torch.Tensor, Tuple[torch.Tensor, ...]]`. The first tuple item is the unsharded parameter (e.g. re-wrapping into some subclass). The second tuple item is a tuple of unsharded inner tensors that FSDP should free during reshard. These should be derived from the all-gather outputs.
- The `out` argument is required due to FSDP's `resize_` usage. We require an in-place variant for the backward all-gather. Here, `out` will be exactly the object returned as the first tuple item in the out-of-place variant mentioned before. The unsharded inner tensors will be allocated before calling `fsdp_post_all_gather`. When `out` is specified, the `fsdp_post_all_gather` should return `None`. If the post-all-gather does not do any out-of-place ops, then the `out` variant can just be a no-op since the unsharded inner tensors will be the same as the all-gather outputs, which FSDP directly writes to after all-gather. (E.g., this is the case for both float8 and `NF4Tensor`.)
- We check for `fsdp_pre_all_gather` and `fsdp_post_all_gather` directly via `hasattr` to accommodate monkey patching so that we do not strictly require the user to use a tensor subclass. The monkey patch must happen after the local tensors have been finalized (after applying FSDP and after any meta-device init).
- For now, we require that all gradients in one FSDP parameter group share the same dtype. This is fine for float8 and `NF4Tensor` use cases. If this requirement is too strict, then in the future we can issue 1 reduce-scatter per dtype per group.
**Design Notes**
- We assume that the `sharded_param._local_tensor` is padded on dim-0.
- This assumption should not block immediate use cases, and when we pad the `DTensor._local_tensor` by default, this assumption will always be true.
- This assumption allows us to call `sharded_param._local_tensor.fsdp_pre_all_gather()`; i.e. it tells us from which tensor object to invoke `fsdp_pre_all_gather()`.
- Suppose we want to compose with CPU offloading. Then, CPU offloading's H2D copy should run first, i.e. `sharded_param._local_tensor.to("cuda").fsdp_pre_all_gather()`, where `_local_tensor.to("cuda")` should return an instance of the subclass so that it still defines `fsdp_pre_all_gather()`. Note that in this case, the subclass instance on GPU is a temporary, which means caching values on it would not be possible. One possibility would be to have `.to("cuda")` move any cached values too.
- `fsdp_post_all_gather` can either return an unsharded parameter that aliases with the all-gather output or does not alias, but there is no way to know a priori.
- If the unsharded parameter aliases with the all-gather output, then we should _not_ free the all-gather output in `unshard`.
- If the unsharded parameter does not alias with the all-gather output, then we prefer to free the all-gather output in `unshard` to avoid holding the unneeded temporary.
- One approach is for eager-mode to check for this alias (by comparing data pointers). However, this might be adversarial to full-graph compilation. The compromise for simplicity can be to always free the all-gather output in `reshard`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122908
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119302
This PR is part of the FSDP extensions work. For subclasses such as for QLoRA's `NF4Tensor` (using block-wise quantization) that have multiple inner tensors per parameter, we must generalize to allow each parameter to contribute >1 all-gather inputs and hence have >1 all-gather outputs.
This PR does this generalization by converting `FSDPParam.all_gather_input: torch.Tensor` to `FSDPParam.all_gather_inputs: List[torch.Tensor]`. Unfortunately, since we need to preserve the mapping from all-gather inputs/outputs to their source parameter, we have to introduce `List[List]` instead of simply `List` in several places. Furthermore, we still require the flattened 1D `List` for `torch.split` calls, introducing some redundancy between data structures. Nonetheless, I do not see a way to avoid this if we want the generalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119302
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
Fixes#123597
There's a sizable comment in the PR about why this is needed, but essentially the launch path is really really perf sensitive (running `launch` is ~30 microseconds, and according to the linked issue, regressing it to 33us is worth 6% overall on torchbench). The `bin.launch_metadata` call doesn't look super expensive, but microseconds matter, and this is only useful when we have a launch hook installed (which seems pretty rare?). This change is worth about 2us, and when combined with the other diff in the stack seems to completely eliminate the torchbench regression.
Differential Revision: [D56046347](https://our.internmc.facebook.com/intern/diff/D56046347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123841
Approved by: https://github.com/jansel, https://github.com/shunting314
Summary:
note: breaking the original diff D55225818 into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size.
Stacked PR to restore original names to placeholder nodes, replacing the default names arg0_1, arg1_1, ...
This PR supports constant argument placeholder (e.g. forward(self, x, y=1)) names and de/serialization, by adding a name field for ConstantArguments in the graph signature, and ConstantInputSpec in the input specs for serialization.
Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py
Differential Revision: D55506949
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123590
Approved by: https://github.com/angelayi, https://github.com/zhxchen17