Per discussion with @malfet, only allow weights_only unpickler to load NJT if `torch.nested` and `torch._dynamo` are imported
(this is slightly weird as technically `torch.nested` is actually imported by default and `torch._dynamo.decorators._DimRange` is actually what needs to be imported)
we can't import this from `torch.nested` as this would
- undo dynamo lazy import
- cause circular import
===========================
Redo of https://github.com/pytorch/pytorch/pull/140304 caused issues as `torch.nested._internal.foo` needs to be imported, which causes issues like
```python
torch/_weights_only_unpickler.py", line 339, in load
if full_path in _get_allowed_globals():
torch/_weights_only_unpickler.py", line 188, in _get_allowed_globals
torch.nested._internal.nested_tensor.NestedTensor
AttributeError: module 'torch.nested' has no attribute '_internal'
```
**This likely wasn't caught in our CI because imports are global during unit tests(?), so we use subprocess to properly test this time**
Differential Revision: [D65961691](https://our.internmc.facebook.com/intern/diff/D65961691)
@jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140739
Approved by: https://github.com/malfet
# Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.
It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784)
This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.
The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support
The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.
The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs
NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.
I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.
This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">
My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.
Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).
But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.
Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">
My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.
Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).
But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.
Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
ghstack dependencies: #140460, #140815
This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments.
One HUGE hack we do here is a node looks like
```
triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_
metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']);
```
so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project.
Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326
Approved by: https://github.com/zou3519
Implementation of the `softmax_backward_data` operator for the CPU backend produces incorrect results when the `output` argument is non-contiguous.
Here is a test case that demonstrates this issue:
```python
torch.manual_seed(0)
op = torch.ops.aten._softmax_backward_data
grad_output = torch.ones(3, 3, 3)
temp = torch.randn(3, 10, 3)
out = temp[:, :3, :]
out = out.contiguous()
print(out.is_contiguous())
grad_input = op(grad_output, out, 1, torch.float32)
print(grad_input)
```
In this test case, the variable `grad_input` yields incorrect results if the line `out = out.contiguous()` is commented out. With this fix, `grad_input` consistently produces the same results whenever `output` is contiguous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139740
Approved by: https://github.com/zou3519
Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed.
- Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION`
- Introduce `skipIfMPSOnMacOS13` to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon)
- Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940
Approved by: https://github.com/janeyx99, https://github.com/huydhn
@ezyang noticed this exercises a multithreading bug that is causing tests to become disabled:
```
2024-11-13T21:05:55.8363582Z inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_fft_ihfftn_cpu_int32 /opt/conda/envs/py_3.9/lib/python3.9/site-packages/_pytest/threadexception.py:73: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-3
2024-11-13T21:05:55.8364857Z
2024-11-13T21:05:55.8364974Z Traceback (most recent call last):
2024-11-13T21:05:55.8365491Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 980, in _bootstrap_inner
2024-11-13T21:05:55.8366003Z self.run()
2024-11-13T21:05:55.8366371Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 917, in run
2024-11-13T21:05:55.8366858Z self._target(*self._args, **self._kwargs)
2024-11-13T21:05:55.8367518Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 176, in _run_event_loop
2024-11-13T21:05:55.8368189Z self.loop.run_until_complete(self.task)
2024-11-13T21:05:55.8368774Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
2024-11-13T21:05:55.8369348Z return future.result()
2024-11-13T21:05:55.8369980Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 214, in _worker
2024-11-13T21:05:55.8370603Z message = await asyncio.wait_for(
2024-11-13T21:05:55.8371090Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
2024-11-13T21:05:55.8371573Z return await fut
2024-11-13T21:05:55.8372156Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/queues.py", line 166, in get
2024-11-13T21:05:55.8372613Z await getter
2024-11-13T21:05:55.8374010Z RuntimeError: Task <Task pending name='Task-1' coro=<FbScribeLogger._worker() running at /opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py:214> cb=[_run_until_complete_cb() at /opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py:184]> got Future <Future pending> attached to a different loop
2024-11-13T21:05:55.8375366Z
2024-11-13T21:05:55.8375603Z warnings.warn(pytest.PytestUnhandledThreadExceptionWarning(msg))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140717
Approved by: https://github.com/ezyang, https://github.com/zxiiro
Summary:
`with_comms()` is mostly used as a decorator with an optional input argument `eager_init`. The problem of a decorator with input argument is that it has to be used with invocation always, i.e., you have to use as `with_comms()` rather than `with_comms` which majority of the existing usages.
This diff tries to provide a solution such that we could use `with_comms`, `with_comms()`, `with_comms(eager_init=False)`, and `with_comms(eager_init=True)`.
Test Plan: Contbuild & OSS CI
Differential Revision: D65385700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139637
Approved by: https://github.com/wz337
This PR updates OpInfo-based tests for NJTs:
* Adds extensive coverage across non-contiguous NJTs (both non-contiguous transposed and non-contiguous with holes)
* The `_sample_njts()` helper that `sample_input_func`s utilize now produces non-contig NJTs as well
* Utilizes a `SampleInput`-based xfail system for granular classification of bugs. For example, it's possible to indicate that a class of ops is expected to fail only on non-contig with holes NJT inputs.
* I decided on adding `SampleInput`s and utilizing this system over using test parametrization for two reasons:
* Test perf - adding `SampleInput`s is faster than generating entire new tests
* Avoiding the possibility of `sample_input_func`s not respecting the non-contig test parameter - this would result in silently incorrect passing of these tests. Keeping the responsibility for `SampleInput` generation firmly within each `OpInfo`'s `sample_input_func` means weirdness like this isn't possible
* Improves `SampleInput` naming for a bunch of `sample_input_func`s. This makes it easier to xfail them as needed. For example, binary / unary / other ops now use the new `_describe_njt()` helper to get a string repr that uniquely defines the type of NJT being passed to the op
* Adds appropriate `XFailRule`s to get tests passing for forward / backward / forward compile / backward compile. In general, each xfail corresponds to some bug that needs to be fixed
```python
# Represents a rule indicating how to xfail a particular test. It allows granularity
# at the device, dtype, op, and individual sample levels. This flexibility allows entire
# bugs to be represented by a single rule, even if this corresponds with multiple conceptual
# test cases across multiple ops.
@dataclass
class XFailRule:
# expected error type
error_type: TypeVar = Exception
# expected error message
error_msg: str = ".*"
# function to indicate whether the rule applies; return True if so
match_fn: Callable[[torch.device, torch.dtype, OpInfo, SampleInput], bool] = None
# optional name for identifying the rule
name: str = ""
def match(self, device, dtype, op, sample) -> bool:
return self.match_fn(device, dtype, op, sample)
```
Example:
```python
# Bug when broadcasting a binary op with non-contiguous with holes NJT + dense
# tensor with 1 in ragged dim.
XFailRule(
error_type=RuntimeError,
error_msg="cannot call binary pointwise function .* with inputs of shapes",
match_fn=lambda device, dtype, op, sample: (
isinstance(op, BinaryUfuncInfo)
and "noncontig_holes" in sample.name
and "broadcasting 1 over ragged" in sample.name
),
name="binary_noncontig_holes_broadcasting_1_over_ragged",
),
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138370
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #140160
Summary:
I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work.
After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False)
Therefore we add an explicit type cast.
Test Plan: Remove the test skip
Differential Revision: D65472170
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731
Introducing ability for subclass implement type convertion to expected_type.
```
def __coerce_same_metadata_as_tangent__(
self, expected_metadata: Any, expected_type: Optional[Type] = None
):
```
Here if `expected_type=None` means `SubclassClass` is expected.
E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime
Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
This allows Configs to handle setting their defaults (or overriding
themselves) via environment variables.
The environment variables are resolved at install time (which is usually
import time). This is done 1) to avoid any race conditions between
threads etc..., but 2) to help encourage people to just go modify the
configs directly, vs overriding environment variables to change
pytorch behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138956
Approved by: https://github.com/ezyang
ghstack dependencies: #138766
As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks.
Test plan: Run
```python
from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler
import torch
import time
vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",
subfolder='vae',
torch_dtype=torch.bfloat16,
force_upcast=False).to('mps')
pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae,
torch_dtype=torch.bfloat16, variant="fp16").to('mps')
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
start_time = time.time()
start_mps_mem = torch.mps.driver_allocated_memory()
image = pipe(prompt="Spherical cow in vacuum",
num_inference_steps=10,
guidance_scale=8,
generator=torch.Generator("mps").manual_seed(42),
).images[0]
end_mps_mem = torch.mps.driver_allocated_memory()
run_time = time.time() - start_time
print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb")
image.save(f'bfloat16.png')
```
Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:

Fixes https://github.com/pytorch/pytorch/issues/139389
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139791
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #139788, #139784, #139763
This teaches install_config_module (and the underlying code) to
understands Config objects. Additionally we've added a JK option to this
which resolves the JK.
This config gets stored within the _ConfigEntry class and is evaluated
when __getattr__ is called. If justknobs is set, it'll call
justknobs_check to see the result.
Due to preceeding work, basically everything works correctly here and we
had to update a couple of tests, and modify the getattr behaviour.
Note that we are updating the justknob_check function to support a
default option, to make default work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766
Approved by: https://github.com/ezyang
I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit.
It applies to the following ops:
* `sum` / `mean` / `prod`
* `all` / `any`
* `amin` / `amax`
* `min` / `max`
* `argmin` / `argmax`
The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, *args, **kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim.
Extensive test coverage includes:
* reductions across ragged dim
* reductions across non-batch, non-ragged dims
* reductions across both batch and ragged dims
* multiple dim reductions (for ops that support this)
* full reduction -> scalar
Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317
Approved by: https://github.com/cpuhrsch
**Background:** The `@parametrize` decorator enjoys widespread usage as a convenient tool for ensuring extensive test coverage. One particular feature that makes this easy is the ability to stack such decorators, testing over the cross-product of inputs. Example:
```python
class MyTestClass(TestCase):
@parametrize("x", range(3))
@parametrize("y", [False, True])
def test_foo(self, x, y):
# Invoked with:
# x=0, y=False
# x=1, y=False
# x=2, y=False
# x=0, y=True
# x=1, y=True
# x=2, y=True
...
```
Note that the `@ops` and `@modules` decorators employ the same underlying machinery for parametrizing over `OpInfo` / `ModuleInfo` entries. These decorators also parametrize over op-specific `device` / `dtype` info *according to what is supported for each op*.
```python
class MyTestClass(TestCase):
@ops(op_db)
def test_foo(self, op, device, dtype):
# Invoked each OpInfo in the db along with each device / dtype that corresponds
# with this op according to the OpInfo entry.
...
```
Note that this in contrast to the naive cross product between ops and devices / dtypes, which would generate too many tests. Certain use cases benefit from a similar type of flexible parametrization that is more intelligent than simple cross-product composition. It is expensive to generate / run too many tests, even if the unneeded ones are skipped appropriately.
This PR attempts to generalize such flexible parametrization and satisfy these use cases through the introduction of a `@reparametrize` decorator, which operates on an existing parametrizer and allows for customized on-the-fly parametrization through the use of an `adapter_fn`. Examples:
```python
# adapter_fn that adds a new arg
def include_is_even_arg(test_name, param_kwargs):
x = param_kwargs["x"]
is_even = x % 2 == 0
new_param_kwargs = dict(param_kwargs)
new_param_kwargs["is_even"] = is_even
is_even_suffix = "_even" if is_even else "_odd"
new_test_name = f"{test_name}{is_even_suffix}"
yield (new_test_name, new_param_kwargs)
# adapter_fn that excludes certain values
def exclude_odds(test_name, param_kwargs):
x = param_kwargs["x"]
is_even = x % 2 == 0
yield None if not is_even else (test_name, param_kwargs)
class MyTestClass(TestCase):
@reparametrize(parametrize("x", range(5)), include_is_even_arg)
def test_foo(self, x, is_even):
# Invoked with both the x value and the new is_even arg
...
@reparametrize(parametrize("x", range(5)), exclude_odds)
def test_bar(self, x):
# Only invoked with even x values
...
```
For a more real-world use case, imagine you want to write a set of OpInfo tests that parametrize over additional op-specific things beyond `device` / `dtype` (in NJT's case, this includes contiguity type, whether to operate over the batch / ragged / other dims, etc.). The `@reparametrize` decorator allows you to customize the `@ops` parametrization to add in these additional args as they make sense on a per-op basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138369
Approved by: https://github.com/janeyx99
We should only pass the `device_id` when the backend is `nccl`. Otherwise, we would run into the following error:
```
RuntimeError: No backend for the parent process group or its backend does not support splitting
```
This also fixes test failure is not asserted when using `with_comms()` or `with_comms(eager_init=False)`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139097
Approved by: https://github.com/XilunWu
This PR adds FlexAttention + NJT support. In particular:
* To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically.
* Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately
* Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported
* Tests that FlexAttention with a causal mask matches causal SDPA
* Adds a new public API for FlexAttention usage:
* `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space.
* Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term.
Example usage:
```python
def causal_mask(b, h, q_idx, kv_idx):
return q_idx >= kv_idx
query = ... # NJT of shape (B, H, S*, D)
key = ... # NJT of shape (B, H, S*, D)
value = ... # NJT of shape (B, H, S*, D)
# create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space
block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S*), sum(S*))
output = flex_attention(query, key, value, block_mask=block_mask)
def causal_score_mod(score, b, h, q_idx, kv_idx):
return torch.where(q_idx >= kv_idx, score, float("-inf"))
# flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs
output2 = flex_attention(query, key, value, score_mod=causal_score_mod)
```
TODO:
* ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though
* ~~Some cleanup~~
* ~~`njt_score_mod_adapter`~~
* ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~
* Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices?
* Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though.
* ~~Demonstrate non-causal mask~~
* Support non-contiguous NJTs with holes (**booted to future PR**)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792
Approved by: https://github.com/drisspg
ghstack dependencies: #138841
dot reference implementation should be consistent with the cpu / cuda implementations since it may be used for meta dispatch
i.e.
```python
import torch
x = torch.tensor([1,2,3], dtype=torch.float32)
y = torch.tensor([4,5,6], dtype=torch.float16)
x.dot(y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dot : expected both vectors to have same dtype, but found Float and Half
```
However the below does not raise an exception
```python
x.to("meta").dot(y.to("meta"))
```
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138596
Approved by: https://github.com/bdhirsh
This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them.
Cleanups
- This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose.
- I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows.
- I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes).
- I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired.
- I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values.
- I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher.
For release notes, Not sure exactly how to communicate this, but something like
"ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison
This has the benefit that
1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba
2) We can do analysis (eg. set different two sets of tests across two PRs)
3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH.
I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9
I then reverted the breaking change and published this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394
Approved by: https://github.com/ezyang
Add an optional `eager_init` flag to `with_comms`.
When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization.
Otherwise, `device_id` is still `None` and this goes through the normal lazy call.
Default for `eager_init` is False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108
Approved by: https://github.com/kwen2501
The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors.
The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture.
Fixes#131316#137414
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717
Approved by: https://github.com/drisspg
Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Our matmul support is abysmal - let's at least get this working and do it performantly later.
Bonus: implements `bmm` as well.
jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121
Approved by: https://github.com/cpuhrsch
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:
- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created.
- The newly introduced variables have `reconstruct` methods used in case of graph breaks.
- The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like.
- In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors.
- In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required.
- JIT Inductor and AOT Inductor support will be implemented in follow-up PRs.
Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677
Approved by: https://github.com/zou3519
### Fix 1: Throw async error during init wait
Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.
### Fix 2: Add wait after comm split
```
// After calling ncclCommSplit in non-blocking mode, we should wait for the
// source communicator to be out of ncclInProgress state.
// Reason 1:
// it's unsafe to call new operations on the parent comm while it's in
// ncclInProgress state.
// Reason 2:
// as of NCCL 2.23, the ptr value of child comm will not be filled until the
// state of parent comm is ncclSuccess. This may change in the future. See:
// https://github.com/NVIDIA/nccl/issues/1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741
Approved by: https://github.com/shuqiangzhang
A proposal addressing Issue #1489: **Optimizer should track parameter names and not id.**
(also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552)
## Summary
This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id.
Optimizers can be initialized with `named_parameters()` as:
```python
optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9)
```
This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as:
```
state_dict =
{
'state': {
0: {'momentum_buffer': tensor(...), ...},
1: {'momentum_buffer': tensor(...), ...},
},
'param_groups': [
{
'lr': 0.01,
'weight_decay': 0,
...
'params': [0,1]
'param_names' ['layer.weight', 'layer.bias'] (optional)
}
]
}
```
Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored.
## Key Features
#### Named Parameters in Optimizer Initialization:
Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly.
#### Parameter Names in `state_dict`:
The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters.
## Backward Compatibility
#### No Breaking Changes:
This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer.
#### Customization with Hooks:
For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs.
## Documentation Updates
Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively.
## Solution Example:
A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order.
The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict :
```python
def adapt_state_dict_ids(optimizer, state_dict):
# assuming a single param group.
current_state_group = optimizer.state_dict()['param_groups'][0]
loaded_state_group = state_dict['param_groups'][0]
# same number of params, same names, only different ordering
current_state_name_to_id_mapping = {} # mapping -- param_name: id
for i, name in enumerate(current_state_group['param_names']):
current_state_name_to_id_mapping[name] = current_state_group['params'][i]
# changing the ids of the loaded state dict to match the order of the given state dict.
for i, name in enumerate(current_state_group['param_names']):
loaded_state_group['params'][i] = current_state_name_to_id_mapping[name]
return state_dict
```
In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`.
Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict.
### Note
This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:
## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.
## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5) <-- no additional context yet
del work <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)
When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.
## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.
`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:
## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.
## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5) <-- no additional context yet
del work <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)
When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.
## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.
`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`.
A concern is that people may not be aware of what it actually does, due to bad naming of this function:
`Return the device to use with ``group`` for control flow usage (object collectives, barrier).`
The remediation is as follows:
- Added a deprecation warning to `_get_pg_default_device`;
- Added a private function `_get_object_coll_device` to undertake what it does;
- Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790
Approved by: https://github.com/H-Huang
Fixes#130154
This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic.
The PyCapsule issue also affects `deepcopy`, so that's fixed here as well.
Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030
Approved by: https://github.com/albanD
## Motivation
The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209
Approved by: https://github.com/kwen2501
Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details).
This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress.
Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles.
Uses objgraph for a nice debug utility when a leak is found.
Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak.
I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker.
Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py,
and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`:
```
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle?
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png
Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes)
Graph viewer (xdot) not found, generating a png instead
Image generated as /tmp/objgraph-ztz642h3.png
```
rendering of ` /tmp/objgraph-ztz642h3.png`:
<img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
ghstack dependencies: #136507
Co-authored-by: Howard Huang <howardhuang@fb.com>
Related: #132695
This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes:
* `(B, j0, D) + (1, 1, 1)`
* `(B, j0, D) + (B, 1, 1)`
* `(B, j0, D) + (B, 1, D)`
* etc.
This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021
Approved by: https://github.com/cpuhrsch
Fixes#93843
`EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888
Approved by: https://github.com/cpuhrsch
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```
Test Plan: Sandcastle plus visual inspection
Differential Revision: D62977370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs.
To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161
Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin
`rms_norm()` is a nice-to-have for ViT :)
This PR:
* SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp.
* Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side.
* Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #125947
Fixes#136090
* Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches).
* Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique.
* This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114
Approved by: https://github.com/malfet
This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm.
The original author is @kkontny
Previous PR summary:
Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation.
I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor.
Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability.
```
class LlamaRMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""
LlamaRMSNorm is equivalent to T5LayerNorm
"""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
```
Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy
Notable changes:
1. Enable CudaGraph related tests
2. Fix UT problems
3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`
Know Problem:
1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest`
+ Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest`
Note:
AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it.
Fixes#133540
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
reland of https://github.com/pytorch/pytorch/pull/133113
I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(
----
Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:
* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module
The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.
I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960:
Part of FSDP2's tracing strategy right now is that:
(1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated
(2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views)
(3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup.
It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960.
However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through).
So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when:
(1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state)
(2) **and** our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193
Approved by: https://github.com/yf225
Co-authored-by: Will Feng <yf225@cornell.edu>
Summary:
In graph of TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir.
This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor
now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv
```
Differential Revision: D61364547
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525
Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168
Summary:
When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies.
Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints.
Test Plan:
Some expected failures are gone now.
Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes.
Differential Revision: D62197742
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131
Approved by: https://github.com/ezyang
Differential Revision: D61506212
Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`.
`instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`.
Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda.
FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device'
repro:
```
CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936
Approved by: https://github.com/ColinPeppler, https://github.com/desertfire
The issue:
Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition.
As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified.
Solution:
If we have subclasses arguments, do not count that const propagation is doable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855
Approved by: https://github.com/zou3519
This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393
Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean
Before the change if run on MacOS 14:
```
% python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.053s
OK (skipped=32)
```
After
```
% python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.229s
OK (skipped=10, expected failures=2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858
Approved by: https://github.com/janeyx99
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.
I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
See #121528 for additional context.
In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA).
Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels.
Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR.
Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288
Approved by: https://github.com/drisspg
Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS` in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests
Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"`
Fixes https://github.com/pytorch/pytorch/issues/134423
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540
Approved by: https://github.com/Skylion007, https://github.com/albanD
## Context
In some user Triton kernels, we have this set-up for whatever reason.
```
@triton.jit
def mykernel(
param0,
param1,
param2,
param3: tl.constexpr, # autotuned
param4, # non-constexpr
):
...
```
This is an edge case because it's a general practice to declare all constexprs params at the end.
And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument...
```
> void* kernel_args_var_0[] = {&var_0, &var_1, &var_2};
---
< CUdeviceptr var_3;
< AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void**>(&var_3)));
< void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
```
## Root-cause
* `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params.
* `raw_args`, that gets passed to wrapper codegen, excludes autotuned args.
* In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature.
79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)
## Fix
We try to fix this, by calculating the right constexprs wrt `raw_args`.
An illustration
```
raw_args: [arg0, arg1, arg2, arg4]
kernel.arg_names: [param0, param1, param2, param3, param4]
kernel.constexprs: [3] # param3 is autotuned; this is correct wrt kernel.arg_names
constexpr_indices: [] # this is correct wrt raw_args
```
Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520
Approved by: https://github.com/oulgen
This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS.
Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs.
```bash
$ python test/run_test.py --mps --verbose -k TestNN
Running test batch 'tests to run' cost 84.76 seconds
```
Ref #133520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184
Approved by: https://github.com/albanD, https://github.com/malfet
I had a night mare rewriting tests in test_misc.py specifically :
1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments.
2. empty lines added when EXPECTTEST_ACCEPT=1 are changed with linter causing tests to fail or linter fail!
add flag to ignore empty lines.
3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045
this is used in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248
Approved by: https://github.com/aorenste
ghstack dependencies: #133639, #134364
Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux.
Reproduce UT:
```cmd
pytest test\dynamo\test_higher_order_ops.py -v -k test_vmap_grad_vmap_guard_fail
```
Error msg:
```cmd
________________________________________________________________________________________________________ HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail _________________________________________________________________________________________________________
Traceback (most recent call last):
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\logging_utils.py", line 89, in test_fn
fn(self, records)
File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 2714, in test_vmap_grad_vmap_guard_fail
munge_exc(record.getMessage()),
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 5252, in munge_exc
s = re.sub(file, os.path.basename(file), s)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 209, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 955, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 526, in _parse
code = _escape(source, this, state)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 370, in _escape
raise source.error("incomplete escape %s" % escape, len(escape))
re.error: incomplete escape \x at position 2
To execute this test, run the following from the base repo dir:
python test\dynamo\test_higher_order_ops.py HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------------------------------
frames [('total', 2), ('ok', 2)]
inductor []
inline_call []
stats [('calls_captured', 38), ('unique_graphs', 2)]
--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] Recompiling function fn in D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py:2699
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] triggered by the following guard failure(s):
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] - 0/0: torch._functorch.pyfunctorch.compare_functorch_state([('Vmap', 1, 'error')]) # _dynamo\output_graph.py:479 in init_ambient_guards
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.7452s] test/dynamo/test_higher_order_ops.py::HigherOrderOpVmapGuardTests::test_vmap_grad_vmap_guard_fail - re.error: incomplete escape \x at position 2
```
Local test passed:
<img width="860" alt="image" src="https://github.com/user-attachments/assets/90f0d780-0639-4c03-8d7c-6f227c93a3fc">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134348
Approved by: https://github.com/jansel
This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve.
Fixes#69538
Minimum example of usage:
```
import torch
if __name__ == '__main__':
spd = torch.rand(4, 3)
A = spd.T @ spd
b = torch.rand(3).to(torch.float64).cuda()
A = A.to_sparse_csr().to(torch.float64).cuda()
x = torch.linalg.solve(A, b)
print((A @ x - b).norm())
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn
Co-authored-by: Zihang Fang <zhfang1108@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
As you can see, 'privateuse1' appears many times in out-of-tree extension codebase. I think that everything about the device type should be as same as other in-tree backends after registering the privateuse1 backend.
For example, after registering a privateuse1 backend named "foo", you should allow "foo" to be passed in as a valid device type.
```diff
- instantiate_device_type_tests(TestIndexing, globals(), only_for='privateuse1')
- instantiate_device_type_tests(NumpyTests, globals(), only_for='privateuse1')
+ instantiate_device_type_tests(TestIndexing, globals(), only_for='foo')
+ instantiate_device_type_tests(NumpyTests, globals(), only_for='foo')
```
> https://github.com/Ascend/pytorch/blob/master/test/test_indexing.py#L1654-L1655
The change is to map privateuse1 backend name to 'privateuse1' when calling `filter_desired_device_types()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133082
Approved by: https://github.com/albanD
Summary: Defaulting TORCH_NCCL_DUMP_ON_TIMEOUT to "true" and adding a kilswitch in case we need to kill this feature in production.
Test Plan: Tests pass manually but need futher testing before this is rolled out fully everywhere.
Differential Revision: D61136320
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133237
Approved by: https://github.com/c00w
# UPDATE:
This is take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty
# Summary
Changes the stance of SDPA on what to do for fully masked out rows
## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963
These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617
Can be paraphrased as follows:
When passing in fully masked out rows, attention becomes ambiguous. We have two main options:
1. Uniformly attend to all values:
```python
scores[masked_out_rows] = 1 / len(row)
out[masked_out_rows] = 1 / len(row) * value
```
2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
```python
output[fully_masked_rows] = NaN
```
We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
[ nan, nan, nan, nan]])
```
Those pesky NaNs are back!
## Why do we see NaNs today?
The core of the problem revolves around using softmax function in sdpa:
```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```
## Quick Aside: Masking in Attention
Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.
We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.
## Alternative Approaches
If we use a very large negative number instead of -inf:
```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564, 0.1613, -0.0486],
[ 0.0000, 0.0000, 0.0000, 0.0000]])
```
This would bring us back into a better state.
## A Third Option
We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.
This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```
**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.
## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel
_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.
Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?
The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`
Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`
If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this
```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.
## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.
Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882
Approved by: https://github.com/soulitzer
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:
* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
I added a shim script to redirect old path calls to the new module
The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR:
- we have a foreach flag for Adafactor
- It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency.
Next steps:
- make torch.compile possible on it
- make it faster (by adding more foreach apis)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132336
Approved by: https://github.com/albanD
ghstack dependencies: #133360
# Summary
Changes the stance of SDPA on what to do for fully masked out rows
## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963
These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617
Can be paraphrased as follows:
When passing in fully masked out rows, attention becomes ambiguous. We have two main options:
1. Uniformly attend to all values:
```python
scores[masked_out_rows] = 1 / len(row)
out[masked_out_rows] = 1 / len(row) * value
```
2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
```python
output[fully_masked_rows] = NaN
```
We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
[ nan, nan, nan, nan]])
```
Those pesky NaNs are back!
## Why do we see NaNs today?
The core of the problem revolves around using softmax function in sdpa:
```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```
## Quick Aside: Masking in Attention
Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.
We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.
## Alternative Approaches
If we use a very large negative number instead of -inf:
```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564, 0.1613, -0.0486],
[ 0.0000, 0.0000, 0.0000, 0.0000]])
```
This would bring us back into a better state.
## A Third Option
We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.
This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```
**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.
## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel
_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.
Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?
The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`
Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`
If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this
```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.
## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131060
Approved by: https://github.com/jbschlosser
This PR fixes flaky internal tests:
- The AutoHeuristic test was sometimes failing because it required autotuning to happen for mixed_mm which didn't end up happening when there was a fx graph cache hit.
- The tests inside pattern_matcher failed because in some cases pad_mm decided to pad which made the mixed_mm pattern not match anymore (instead of cast -> mm, it was cast -> pad -> mm), and the tests also fail when is_big_gpu is false (which I haven't found an explanation for).
Differential Revision: [D60972176](https://our.internmc.facebook.com/intern/diff/D60972176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133015
Approved by: https://github.com/Chillee, https://github.com/eellison
It's possible to construct an NJT with "holes" by specifying both `offsets` and `lengths` metadata. When `nt.clone(memory_format=torch.contiguous_format)` is called on such an NJT, the result should be an NJT without holes.
This PR fixes this in simplistic way using `unbind()`, which isn't really supported in `torch.compile`. The longer term solution involves writing a proper kernel to support this.
NB: Another limitation is that the returned NJT does not have the same ragged structure as the input. While we could manually hack the nested int registry (or update the union find when that lands), this is the first instance where a NJT with holes and an NJT without holes could have the same ragged structure, and getting those to play nicely together requires some fairly involved updates. For now, this PR punts on these updates until we can clean this up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132776
Approved by: https://github.com/ani300, https://github.com/soulitzer
ghstack dependencies: #131898, #131704, #131937
Move the slow test json to be in the pytorch/pytorch repo and make a job that will update it weekly. The job uses the same environment as the commit hash. It uses similar code to the hash updates, but the hash update contains a lot of code that is specific to the hash update, so I chose to pick out the parts that are relevant
Remove references to the old file and set up testing to read from the new file instead
The old update cadence was every day, the new one is every week
The auto slow test infra + the lack of pinning between pytorch and test-infra makes it really hard to tell if a test started failing because of a change or because of the slow test json changing. While this can have benefits, like disable test issues being effective everywhere immediately, it can also be very confusing, especially since we don't have the same insight into slow tests like we do for disable issues.
Example PR made: https://github.com/pytorch/pytorch/pull/132383 (with all the changes from this PR because it was working on top of this)
We should just get rid of this at some point in favor of the slowTest decorator, but there are some tests that take 5+ minutes to run and I don't want to track them down right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132379
Approved by: https://github.com/huydhn
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.
TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.
Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
Combines contributions from https://github.com/pytorch/pytorch/pull/130505
Some context can be found in this large comment block:
a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)
Changes in this PR
- For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager.
- Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic).
- (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.)
Basically unchanged, but worth noting:
- `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified.
- to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack.
Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292
Approved by: https://github.com/bdhirsh
ghstack dependencies: #131916, #131803
Before setting up float8 numeric parity test, I have to set up regular TP numeric parity test, preferrably testing 10 iterations
this PR sets a baseline of TP numerics. I can verify fp8 on top of it
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132543
Approved by: https://github.com/tianyu-l
ghstack dependencies: #132350
Summary:
This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally.
We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.
Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s
OK
```
Differential Revision: D60701839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562
Approved by: https://github.com/chenyang78
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.
Fixes#35735
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing.
* New tests in `TestNestedTensorOpInfo`
* `test_forward()` - compares forward output to an unbind-based reference
* `test_backward()` - compares forward output and grads to an unbind-based reference
* `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager
* `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager
* To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`.
* `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
* `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
* `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim
* Several xfails were added to get things passing
TODO (future PRs):
* Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today)
* Mixed (NT, T), (T, NT) inputs for binary ops
* Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs
* Address all xfails via fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898
**Background**: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal.
This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal.
Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898
Approved by: https://github.com/soulitzer
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
ghstack dependencies: #131151
------
As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.
Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.
Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361
```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000
@classmethod
def eval(cls, base, divisor):
# python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
# Assert triggered by inequality solver
# assert base.is_integer, base
# assert divisor.is_integer, divisor
# We don't provide the same error message as in Python because SymPy
# makes it difficult to check the types.
if divisor.is_zero:
raise ZeroDivisionError("division by zero")
if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
int_oo,
-int_oo,
sympy.oo,
-sympy.oo,
):
return sympy.nan
if base is sympy.nan or divisor is sympy.nan:
return sympy.nan
if base.is_zero:
return sympy.S.Zero
if base.is_integer and divisor == 1:
return base
if base.is_integer and divisor == -1:
return sympy.Mul(base, -1)
if (
isinstance(base, sympy.Number)
and isinstance(divisor, sympy.Number)
and (
base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
)
):
r = float(base) / float(divisor)
if r == math.inf:
return int_oo
elif r == -math.inf:
return -int_oo
elif math.isnan(r):
return sympy.nan
else:
return sympy.Integer(math.floor(r))
if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
return sympy.Integer(int(base) // int(divisor))
if isinstance(base, FloorDiv):
return FloorDiv(base.args[0], base.args[1] * divisor)
# Expands (x + y) // b into x // b + y // b.
# This only works if floor is an identity, i.e. x / b is an integer.
for term in sympy.Add.make_args(base):
quotient = term / divisor
if quotient.is_integer and isinstance(divisor, sympy.Integer):
# NB: this is correct even if the divisor is not an integer, but it
# creates rational expressions that cause problems with dynamic
# shapes.
return FloorDiv(base - term, divisor) + quotient
try:
gcd = sympy.gcd(base, divisor)
if gcd != 1:
> return FloorDiv(
sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
)
base = -1.00000000000000
cls = FloorDiv
divisor = -1.00000000000000
gcd = 1.00000000000000
quotient = 1.00000000000000
term = -1.00000000000000
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}
@wraps(func)
def wrapper(*args, **kwargs):
try:
> retval = cfunc(*args, **kwargs)
E RecursionError: maximum recursion depth exceeded in comparison
E
E To execute this test, run the following from the base repo dir:
E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
args = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func = <function Function.__new__ at 0x7fc530317280>
kwargs = {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_ops_gradients.py` to make it easier to land.
Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131397
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388, #131372
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_module.py` to make it easier to land.
Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131372
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_torch.py` to make it easier to land.
Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131388
Approved by: https://github.com/zou3519
ghstack dependencies: #131551
------
As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.
Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.
Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361
```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000
@classmethod
def eval(cls, base, divisor):
# python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
# Assert triggered by inequality solver
# assert base.is_integer, base
# assert divisor.is_integer, divisor
# We don't provide the same error message as in Python because SymPy
# makes it difficult to check the types.
if divisor.is_zero:
raise ZeroDivisionError("division by zero")
if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
int_oo,
-int_oo,
sympy.oo,
-sympy.oo,
):
return sympy.nan
if base is sympy.nan or divisor is sympy.nan:
return sympy.nan
if base.is_zero:
return sympy.S.Zero
if base.is_integer and divisor == 1:
return base
if base.is_integer and divisor == -1:
return sympy.Mul(base, -1)
if (
isinstance(base, sympy.Number)
and isinstance(divisor, sympy.Number)
and (
base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
)
):
r = float(base) / float(divisor)
if r == math.inf:
return int_oo
elif r == -math.inf:
return -int_oo
elif math.isnan(r):
return sympy.nan
else:
return sympy.Integer(math.floor(r))
if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
return sympy.Integer(int(base) // int(divisor))
if isinstance(base, FloorDiv):
return FloorDiv(base.args[0], base.args[1] * divisor)
# Expands (x + y) // b into x // b + y // b.
# This only works if floor is an identity, i.e. x / b is an integer.
for term in sympy.Add.make_args(base):
quotient = term / divisor
if quotient.is_integer and isinstance(divisor, sympy.Integer):
# NB: this is correct even if the divisor is not an integer, but it
# creates rational expressions that cause problems with dynamic
# shapes.
return FloorDiv(base - term, divisor) + quotient
try:
gcd = sympy.gcd(base, divisor)
if gcd != 1:
> return FloorDiv(
sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
)
base = -1.00000000000000
cls = FloorDiv
divisor = -1.00000000000000
gcd = 1.00000000000000
quotient = 1.00000000000000
term = -1.00000000000000
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}
@wraps(func)
def wrapper(*args, **kwargs):
try:
> retval = cfunc(*args, **kwargs)
E RecursionError: maximum recursion depth exceeded in comparison
E
E To execute this test, run the following from the base repo dir:
E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
args = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func = <function Function.__new__ at 0x7fc530317280>
kwargs = {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)
This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").
The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.
Test commands:
- `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
#109581
At this point, the vanilla implementation (the default) is good.
Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor
Specifically, the impl in this PR, which attempts to replicate the paper,
```
optim = torch.optim.Adafactor([weight])
```
is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor
```
optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False)
```
is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor
```
optim = keras.optimizers.Adafactor(learning_rate=0.01)
```
The three results respectively for the same randomly generated weights:
```
# ours
tensor([[ 0.3807594, -0.3912092],
[ 0.0762539, 0.5377805],
[ 0.2459473, 0.4662207]])
# pytorch-optimizer
tensor([[ 0.3807592, -0.3912172],
[ 0.0762507, 0.5377818],
[ 0.2459457, 0.4662213]])
# keras
array([[ 0.38076326, -0.39121315],
[ 0.0762547 , 0.5377859 ],
[ 0.24594972, 0.46622536]], dtype=float32)
```
This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences:
* keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01.
* We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling.
<details>
Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing
My script repro:
```
import torch
from pytorch_optimizer import AdaFactor
torch.set_printoptions(precision=7)
weight = torch.tensor([[ 0.37697506, -0.39500135],
[ 0.07246649, 0.53399765],
[ 0.24216151, 0.46243715]], dtype=torch.float32)
# bias = torch.tensor([0, 0], dtype=torch.float32)
weight.grad = torch.tensor([[-0.5940447, -0.7743838],
[-0.5940447, -0.7743838],
[-0.5940447, -0.7743838]], dtype=torch.float32)
# bias.grad = torch.tensor([-2.5027974, 1.5422692], dtype=torch.float32)
weight_c = weight.clone()
weight_c.grad = weight.grad.clone()
optim = torch.optim.Adafactor([weight])
optim.step()
print(weight)
optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False)
optim_c.step()
print(weight_c)
```
<details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905
Approved by: https://github.com/albanD
------
As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.
Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.
Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361
```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000
@classmethod
def eval(cls, base, divisor):
# python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
# Assert triggered by inequality solver
# assert base.is_integer, base
# assert divisor.is_integer, divisor
# We don't provide the same error message as in Python because SymPy
# makes it difficult to check the types.
if divisor.is_zero:
raise ZeroDivisionError("division by zero")
if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
int_oo,
-int_oo,
sympy.oo,
-sympy.oo,
):
return sympy.nan
if base is sympy.nan or divisor is sympy.nan:
return sympy.nan
if base.is_zero:
return sympy.S.Zero
if base.is_integer and divisor == 1:
return base
if base.is_integer and divisor == -1:
return sympy.Mul(base, -1)
if (
isinstance(base, sympy.Number)
and isinstance(divisor, sympy.Number)
and (
base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
)
):
r = float(base) / float(divisor)
if r == math.inf:
return int_oo
elif r == -math.inf:
return -int_oo
elif math.isnan(r):
return sympy.nan
else:
return sympy.Integer(math.floor(r))
if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
return sympy.Integer(int(base) // int(divisor))
if isinstance(base, FloorDiv):
return FloorDiv(base.args[0], base.args[1] * divisor)
# Expands (x + y) // b into x // b + y // b.
# This only works if floor is an identity, i.e. x / b is an integer.
for term in sympy.Add.make_args(base):
quotient = term / divisor
if quotient.is_integer and isinstance(divisor, sympy.Integer):
# NB: this is correct even if the divisor is not an integer, but it
# creates rational expressions that cause problems with dynamic
# shapes.
return FloorDiv(base - term, divisor) + quotient
try:
gcd = sympy.gcd(base, divisor)
if gcd != 1:
> return FloorDiv(
sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
)
base = -1.00000000000000
cls = FloorDiv
divisor = -1.00000000000000
gcd = 1.00000000000000
quotient = 1.00000000000000
term = -1.00000000000000
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}
@wraps(func)
def wrapper(*args, **kwargs):
try:
> retval = cfunc(*args, **kwargs)
E RecursionError: maximum recursion depth exceeded in comparison
E
E To execute this test, run the following from the base repo dir:
E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
args = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func = <function Function.__new__ at 0x7fc530317280>
kwargs = {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.
Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431
Approved by: https://github.com/oulgen, https://github.com/zou3519
Fixes#130284Fixes#130653
- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
Fixes#130284Fixes#130653
- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)
@ydwu4 I tried to incorporate your requests already.
Currently there are two problems that I struggle with solving:
1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`
Co-authored-by: Yidi Wu <yidi@meta.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization.
Please refer: https://github.com/pytorch/pytorch/pull/121450
However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow.
This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)
@ydwu4 I tried to incorporate your requests already.
Currently there are two problems that I struggle with solving:
1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`
Co-authored-by: Yidi Wu <yidi@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
# Summary
- This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little
- Fixes a bug with indexing on block mask
- Adds some doc strings to helper funcs and fixes some misc typing things
- Forces functions passed to `create_block_mask` to mask_mods and updates tests files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871
Approved by: https://github.com/joydddd, https://github.com/Chillee
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
This PR resolves several sets of `_scaled_mm` test failures:
- `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them
- `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897
Approved by: https://github.com/drisspg
Summary:
- Updated SparseAdam to run test_state_dict_deterministic unit test.
- Made gradients sparse while keeping weights dense in the above test.
Test Plan:
- Ran test_optim.py locally.
Fixes#116507
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130645
Approved by: https://github.com/janeyx99
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.
The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.
Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.
The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.
With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.
As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.
One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.
Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.
Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/eqy, https://github.com/eellison
This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code.
The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code.
The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set.
The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen.
On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often.
The results for randomly generated data and huggingface look as follows:
`max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`.
The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set.
A100
```
max_depth min_samples_leaf dataset correct wrong unsure total max_wrong_speedup gman_wrong_speedup threshold
15 5.0 10 train 2730 4 3023 5757 1.372220 1.193873 1.702530
16 5.0 10 val 878 0 1042 1920 NaN NaN 1.702530
17 5.0 10 test 925 2 993 1920 1.741708 1.354954 1.702530
18 5.0 10 hf-train 14 0 22 36 NaN NaN 1.702530
19 5.0 10 hf-inf 7 0 1 8 NaN NaN 1.702530
```
The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned:
A100 hf-train: 60
A100 hf-inf: 10
## Results on running huggingface locally
This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds.
#pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure").
I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency.
Results on huggingface training
```
name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision
BartForCausalLM 1.19 (+/- 0.00) 1.19 (+/- 0.00) -0.00 40.33 (+/- 1.13) 40.95 (+/- 0.78) -0.62 1.52 3 2
BartForConditionalGeneration 1.53 (+/- 0.06) 1.47 (+/- 0.05) 0.06 81.93 (+/- 5.20) 82.23 (+/- 1.92) -0.30 0.36 3 1
BlenderbotSmallForCausalLM 1.86 (+/- 0.04) 1.86 (+/- 0.00) 0.00 36.76 (+/- 0.49) 37.62 (+/- 1.33) -0.87 2.31 3 2
CamemBert 2.36 (+/- 0.01) 2.35 (+/- 0.01) 0.01 97.60 (+/- 1.91) 98.69 (+/- 1.35) -1.09 1.11 2 1
DistillGPT2 2.57 (+/- 0.01) 2.57 (+/- 0.01) 0.00 57.33 (+/- 0.77) 58.26 (+/- 1.41) -0.93 1.59 3 2
PLBartForCausalLM 2.07 (+/- 0.01) 2.06 (+/- 0.01) 0.01 32.54 (+/- 0.83) 34.65 (+/- 0.71) -2.11 6.10 3 2
PLBartForConditionalGeneration 1.87 (+/- 0.00) 1.88 (+/- 0.00) -0.01 58.45 (+/- 1.24) 58.95 (+/- 1.92) -0.50 0.85 3 1
RobertaForCausalLM 2.39 (+/- 0.01) 2.40 (+/- 0.01) -0.01 97.38 (+/- 1.52) 97.69 (+/- 1.18) -0.31 0.32 2 1
TrOCRForCausalLM 1.70 (+/- 0.00) 1.70 (+/- 0.00) -0.00 44.79 (+/- 1.33) 45.25 (+/- 1.08) -0.46 1.01 3 2
Mean difference in speedup: 0.01
Mean compilation latency saved: -0.80s
Mean compilation latency reduction: 1.68%
```
Results on huggingface inference
```
name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision
BartForCausalLM 1.11 (+/- 0.00) 1.11 (+/- 0.00) 0.00 19.02 (+/- 0.28) 19.40 (+/- 0.35) -0.38 1.95 3 2
BartForConditionalGeneration 1.26 (+/- 0.01) 1.23 (+/- 0.03) 0.03 36.84 (+/- 0.40) 36.55 (+/- 0.75) 0.30 -0.81 3 1
BlenderbotSmallForCausalLM 1.87 (+/- 0.02) 1.87 (+/- 0.01) 0.00 17.53 (+/- 0.31) 18.03 (+/- 0.43) -0.49 2.74 3 2
DistillGPT2 2.50 (+/- 0.02) 2.50 (+/- 0.01) 0.00 16.16 (+/- 0.29) 16.40 (+/- 0.18) -0.24 1.46 3 2
PLBartForCausalLM 1.93 (+/- 0.01) 1.94 (+/- 0.01) -0.00 15.30 (+/- 0.22) 16.01 (+/- 0.71) -0.71 4.43 3 2
PLBartForConditionalGeneration 1.98 (+/- 0.01) 1.98 (+/- 0.01) 0.00 25.90 (+/- 0.32) 26.58 (+/- 0.62) -0.67 2.53 3 1
TrOCRForCausalLM 1.61 (+/- 0.00) 1.62 (+/- 0.00) -0.01 21.38 (+/- 0.37) 21.85 (+/- 0.16) -0.47 2.16 3 2
Mean difference in speedup: 0.00
Mean compilation latency saved: -0.38s
Mean compilation latency reduction: 2.07%
```
For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643
Approved by: https://github.com/Chillee, https://github.com/eellison
Fixes#125224
For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224).
This also affects multiple other random functions, such as `torch.rand` and `torch.randn`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066
Approved by: https://github.com/eqy, https://github.com/lezcano
This involved beefing up the Python dispatcher to handle torch_dispatch.
Given a HOP and a torch_dispatch Tensor subclass:
- the HOP will show up in the subclass's `__torch_dispatch__`
- you can also use HOP.py_impl to register a rule for the HOP x
subclass interaction
- (coming soon) we'll offer a way to open register HOP x subclass
interaction without needing to touch the subclass's
`__torch_dispatch__` or the HOP's .py_impl.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606
Approved by: https://github.com/ydwu4
When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes.
For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes.
When setUpClass() didn't set it, the device_suffix would be "privateuse1".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091
Approved by: https://github.com/zou3519
Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it.
Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs
Reviewed By: angelayi
Differential Revision: D59603147
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491
Approved by: https://github.com/angelayi
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.
Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.
The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.
Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.
The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.
With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.
As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.
One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.
Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.
Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/zdevito, https://github.com/eqy