Summary: In production I am seeing errors like "AttributeError: module 'triton.runtime' has no attribute 'fb_memcache'", likely due to some package skew. Until this is resolved, lets wrap this code with try-catch.
Test Plan: CI
Differential Revision: D54604339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121340
Approved by: https://github.com/aakhundov
**Summary**
In `visualize_sharding` we chose to only print on rank 0 (global rank) which means calling `visualize_sharind` will never print anything when the dtensor object's mesh doesn't include rank 0 (i.e. a sub-mesh). This PR has `visualize_sharding` always print on rank whose mesh coordinate is (0, 0, ..., 0) instead of whose global rank is 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121216
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179, #120260
**Summary**
our goal is to demonstrate that DTensor's capability to represent TorchRec's parameter sharding. Currently this is done with `ShardedTensor` and theoretically `DTensor` can replace it with minor change.
This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.ROW_WISE` using DTensor. Note that this PR only covers the even sharding case.
**Test Run**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120260
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179
This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present.
In the original code, ``_metadata`` was handled as a ``key``.
```
# also strip the prefix in metadata if any.
if "_metadata" in state_dict:
```
This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to:
```
# also strip the prefix in metadata if any.
if hasattr(state_dict, "_metadata"):
```
This PR also includes the necessary test.
Fixes#106942
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464
Approved by: https://github.com/mikaylagawarecki
`Tensor.__repr__` calls functions which can perform logging which ends up logging `self` (with `__repr__`) causing an infinite loop. Instead of logging all the args in FakeTensor.dispatch log the actual parameters (and use `id` to log the tensor itself).
The change to torch/testing/_internal/common_utils.py came up during testing - in some ways of running the test parts was `('test', 'test_testing.py')` and so `i` was 0 and we were doing a join on `()` which was causing an error.
Repro:
```
import torch
from torch.testing import make_tensor
from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
t = torch.sparse_coo_tensor(((0, 1), (1, 0)), (1, 2), size=(2, 2))
t2 = FakeTensor.from_tensor(t, FakeTensorMode())
print(repr(t2))
```
and run with `TORCH_LOGS=+all`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120206
Approved by: https://github.com/yanboliang, https://github.com/pearu
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.
I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294
# Update Profiler API to collect Execution Traces
## TLDR
We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware.
```
import torch
def main():
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
…
excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW
) as prof:
...
prof.step()
```
See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API.
## What are Execution Traces?
[Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads. It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies.
- Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too.
- At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki)
Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)]
## Why correlate Execution Trace with PyTorch/Kineto Trace
Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly.
Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths.
## Proposal
The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section
# Testing
Updated the unit test for collecting kineto and Execution Trace together.
- Check the collected ET has right range of events.
- Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference.
```
pytest test/profiler/test_profiler.py -k test_execution_trace_with_kineto -rP
Running 1 items in this shard
test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[W execution_trace_observer.cpp:694] Disabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system.
Test Plan: sandcastle + oss ci
Differential Revision: D54453492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082
Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD
Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`).
```
pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328
Approved by: https://github.com/wanchaol
ghstack dependencies: #120351
Fixes#121093
Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault:
```
torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d
```
To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298
Approved by: https://github.com/mikaylagawarecki
This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`.
We override `_apply` to achieve the following:
- Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this
- Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor
We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`.
```
# Pre-training flow (no checkpoint)
global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp"))
dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"]
with torch.device("meta"):
model = ...
parallelize_module(model, tp_mesh, ...)
fully_shard(model, mesh=dp_mesh, ...)
for param in model.parameters():
assert param.device.type == "meta"
model.to_empty(device="cuda")
random.manual_seed(42, global_mesh)
for module in model.modules():
if hasattr(module, "reset_parameters"):
module.reset_parameters()
```
This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351
Approved by: https://github.com/wanchaol
Without this the build will freeze with prompt:
Proceed ([y]/n)?
I'm using rootless podman in vscode instead of docker but I think it should not affect this.
..or does conda somehow detect Docker but not Podman? Anyway, this should not break anything.
Btw, I also had to uncomment the line: "remoteUser": "root" in devcontainer.json to finish the post installation properly but I guess there might be other workarounds - and perhaps you don't want to run as root if your container has root privileges.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121128
Approved by: https://github.com/drisspg
Fixes#115607
We were missing guards when the grads were set to `None`. So if we compiled the optimizer with grads set to their proper value, and then with the grads set to `None` we'd continuously run the `None` version because all of the guards would pass and it would be ordered before the correct version in the cache.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121291
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports
limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
Fix: https://github.com/pytorch/xla/issues/6009
This PR adds another case to `TensorVariable.method_new` special case, where it
re-dispatches `new` into `new_empty`.
Since we are using fake tensors, the `new` call doesn't actually gets to the corresponding
backend (e.g. XLA). So, things like the following might happen:
```python
@torch.compile(backend="openxla")
def foo(x):
new_x = x.new(*x.size())
# new_x.device() == "xla"
# x.device() == "xla:0"
return new_x + x
a = torch.arange(10)
foo(a.to(xm.xla_device()))
```
Resulting in the following error:
```python
Traceback (most recent call last):
...
File "torch/_dynamo/utils.py", line 1654, in get_fake_value
ret_val = wrap_fake_exception(
File "torch/_dynamo/utils.py", line 1190, in wrap_fake_exception
return fn()
File "torch/_dynamo/utils.py", line 1655, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "torch/_dynamo/utils.py", line 1776, in run_node
raise RuntimeError(make_error_message(e)).with_traceback(
File "torch/_dynamo/utils.py", line 1758, in run_node
return node.target(*args, **kwargs)
File "torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "torch/_subclasses/fake_tensor.py", line 885, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 1224, in dispatch
return self._cached_dispatch_impl(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 955, in _cached_dispatch_impl
output = self._dispatch_impl(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 1445, in _dispatch_impl
return self.wrap_meta_outputs_with_default_device_logic(
File "torch/_subclasses/fake_tensor.py", line 1575, in wrap_meta_outputs_with_default_device_logic
return tree_map(wrap, r)
File "torch/utils/_pytree.py", line 900, in tree_map
return treespec.unflatten(map(func, *flat_args))
File "torch/utils/_pytree.py", line 736, in unflatten
leaves = list(leaves)
File "torch/_subclasses/fake_tensor.py", line 1550, in wrap
) = FakeTensor._find_common_device(func, flat_args)
File "torch/_subclasses/fake_tensor.py", line 625, in _find_common_device
merge_devices(arg)
File "torch/_subclasses/fake_tensor.py", line 620, in merge_devices
raise RuntimeError(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., device='xla', size=(10,), dtype=torch.int64), FakeTensor(..., device='xla:0', size=(10,), dtype=torch.int64)), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices xla, xla:0
```
Using `new_empty`, instead, fixes this error because it uses the device from the source
tensor, instead of inferring from the current dispatch key set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121075
Approved by: https://github.com/jansel