Pull Request resolved: https://github.com/pytorch/pytorch/pull/77696https://github.com/pytorch/pytorch/pull/63619 added a RECORD_FUNCTION guard to make calls to `Engine::evaluate_function` visible regardless of the underlying op. While useful, this creates a call that looks like a forward call that somewhat complicates stitching forward and backward ops. I don't want to add complexity (and therefore work) on the hot path; instead it's fairly straightforward to stitch things back together in post. This PR simply propagates sequence number and forward tid info up to the `evaluate_function` event.
Differential Revision: [D36302562](https://our.internmc.facebook.com/intern/diff/D36302562/)
Approved by: https://github.com/aaronenyeshi
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76253
We're observing large QPS regression on the original PR https://github.com/pytorch/pytorch/pull/72302. For the training job we had, it regressed from 720k QPS to 450k QPS (see the test plan in FB internal). We suspect this is because the api was changed from `_record_function_enter` to `_record_function_enter_new`, and we're running experiments to confirm that. Will add more details when the runs in the test plan has finished. For now, it's better to revert the diff to unblock internal usecases and we can think about how to reland this diff later.
Original commit changeset: dc9939f1fa6d
Original Phabricator Diff: D35257354
Test Plan:
on trunk: f338665947
with this diff: f338502850
Reviewed By: malfet, robieta
Differential Revision: D35853300
fbshipit-source-id: dd38042aeacb848f66756491a4c849c7c652a0e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71708
In Python 3.2, a number of asserts were deprecated.
In Python 3.11, these asserts are deleted completely. The files in this change still use the deprecated asserts.
Switch over to the supported syntax for 3.2 onwards.
Test Plan: Tested on the internal test suite runner.
Reviewed By: ajtulloch
Differential Revision: D33503694
fbshipit-source-id: a150f296033260acf8365d77b837ce0679f57361
(cherry picked from commit abf60ed97409265222915d8265aaabedd625fd93)
Summary:
Description of the new behavior is in PythonFallbackKernel.cpp.
The updated test makes sure that we only call alias on the first Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73925
Reviewed By: samdow
Differential Revision: D34862940
Pulled By: albanD
fbshipit-source-id: 4d020e41c8bb8b10262dcafd524e84a5ad4d7af0
(cherry picked from commit 0aa6b56dbd3dcee830453fb02cd6c83ab7a8be06)
Summary:
Minimal example that deadlocks before but not after:
```python
import torch
from torch.autograd import Function
class Foo(Function):
staticmethod
def forward(ctx, x):
return x.clone()
staticmethod
def forward(ctx, gO):
return gO.clone()
def get_out():
inp = torch.rand(2, requires_grad=True)
# The python function is first so that it runs
# last in the backward pass
right = Foo.apply(inp)
# An op that creates new memory
left1 = inp.clone()
# An op that saves its input
left2 = left1 ** 2
# Inplace modify so that the backward for
# left2 always raises an error
left1 += 1
# An op that takes both side as input.
# After running, both side's last op will be in
# the ready queue
# And the op for left will run first as it was
# executed last during the forward
out = left2 + right
return out
# Nothing should be global variables here as, from what
# I can see, python leaks all the global objects
get_out().sum().backward()
```
Since this requires the python interpreter to die, it is hard to test in CI.
Let me know if you have an idea how to do it though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73961
Reviewed By: malfet
Differential Revision: D34752747
Pulled By: albanD
fbshipit-source-id: 1a537b1f733e161e8d3ff053cd432b37b34d432a
(cherry picked from commit 17943e4c04c782d81deab439e010195f04e75bbd)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72301
First step in resolving #35026.
This adds `PythonRecordFunction` which is a `torch::CustomClassHolder`
for `at::RecordFunction` to keep the ATen code free of torch includes.
And adds new unused internal API functions
`_record_function_enter_new` which return the torchbind object.
Once the FC period is expired, `torch.profiler.record_function` will
be updated to use this new internal API. Then once BC period is
expired, the cpp_custom_type_hack-based API can be removed.
Test Plan: Imported from OSS
Reviewed By: dagitses
Differential Revision: D34586311
Pulled By: robieta
fbshipit-source-id: d3eb9ffad7b348548a2b22c75203a92d1cb5115b
(cherry picked from commit 92d2ca808e5fbd20c9d6645dcabc3f059f9ef2d3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72688
Refactor how we know what to run on the cpu queue.
The Lazy Tensor moved there as it is always present as a device guard and would make the number of devices 1 all the time (forcing the creation of a thread).
FYI wconstab you most likely don't care about this unless you ever use multiple Lazy device?
This should slightly improve the perf if you run backward with Lazy Tensors as the work will be done in the main thread and not a worker thread.
Test Plan: Imported from OSS
Reviewed By: soulitzer
Differential Revision: D34180245
Pulled By: albanD
fbshipit-source-id: 88c5d5bdd631ad01bf271d720d1eab69aba84fc0
(cherry picked from commit da7e9b902f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72008
Fixes #71119
Technically BC-breaking because when an input does not require grad, previously it was returned as-is instead of a view because it didn't need to. Now we will also return a view in that case (whether or not forward AD runs).
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33859553
Pulled By: soulitzer
fbshipit-source-id: 81b3fa371f4c0904630878500aa190492c562367
(cherry picked from commit ee74bc8234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71707
Why?
- detach should behave like jax.stop_gradient in functorch. Because it
does not detach all the way through, functorch (as well as a Tensor
Subclass wrapping a Tensor subclass) won't see it after the first
layer/subclass handles it.
How?
- This PR changes detach to dispatch all the way through to the backend.
- This PR also modifies native::detach to call shallow_copy_and_detach
instead of native::alias. This is because today, the semantics of detach
and alias are differently -- they differ only by
allow_tensor_metadata_change. In the future, we may choose to deprecate
this flag.
- NB: Before and after this PR, detach() shows up twice in
torch_dispatch: https://github.com/pytorch/pytorch/issues/71725. This is
not a regression so I didn't want to fix it in this PR because it is
weird to fix.
Test Plan: - added new tests; run existing tests
Reviewed By: albanD
Differential Revision: D33752860
Pulled By: zou3519
fbshipit-source-id: 40cc2dc8232e75a02586a4ba5b0ef5f16cb76617
(cherry picked from commit f88aae426e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69827
In general, the current pattern allows for implementing optimisations
for all the backends in a common place (see for example the optimisation
for empty matrices).
After this PR, `torch.svd` is implemented in terms of `linalg.svd` and
`linalg.svdvals`, as expected. This makes it differentiable in the case
when `compute_uv=False`, although this is not particularly important, as
`torch.svd` will eventually be deprecated.
This PR also instantiates smaller `U` / `V` when calling cusolver_gesvdj
in the cases when `full_matrices=False` or `compute_uv=False`.
The memory for auxiliary `U` and `V` in the cases above, needed for some
cuSOLVER routines is allocated raw allocators rather than through fully
fledged tensors, as it's just a blob of memory the algorithm requests.
As the code is better structured now, it was easier to see that `U` and
`Vh` needn't be allocated when calling `svd_cusolver_gesvd`.
Now `linalg.svdvals` work as expected wrt the `out=` parameter.
Note that in the test `test_svd_memory_allocation` we were
passing a tensor of the wrong size and dtype and the test seemed to
pass...
This PR also changes the backward formula to avoid saving the input
matrix, as it's not necessary. In a follow up PR, I will clean the
backward formula and make it more numerically stable and efficient.
This PR also does a number of memory optimisations here and there, and fixes
the call to cusolver_gesvd, which were incorrect for m <= n. To test
this path, I compiled the code with a flag to unconditionally execute
the `if (!gesvdj_convergence_check.empty())` branch, and all the tests
passed.
I also took this chance to simplify the tests for these functions in
`test_linalg.py`, as we had lots of tests that were testing some
functionality that is already currently tested in the corresponding
OpInfos. I used xwang233's feature to test both MAGMA and CUDA
backends. This is particularly good for SVD, as cuSOLVER is always
chosen over MAGMA when available, so testing MAGMA otherwise would be
tricky.
cc jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano
Test Plan: Imported from OSS
Reviewed By: mikaylagawarecki
Differential Revision: D33751983
Pulled By: mruberry
fbshipit-source-id: 11d48d977946345583d33d14fb11a170a7d14fd2
(cherry picked from commit a1860bd567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71569
Not sure if this is the right API
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33695395
Pulled By: soulitzer
fbshipit-source-id: 652b5758f15d901f98ff0da94e977030c7f3415b
(cherry picked from commit 9421a6846a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71531
Based on the comment above the original internal assert, this is the desired check.
1. Don't error, and automatically make jvp return a view for that tensor output (this is easier than I originally thought: https://github.com/pytorch/pytorch/pull/71531#discussion_r789211877)
2. Error (currently doing)
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33695399
Pulled By: soulitzer
fbshipit-source-id: dba49890a55ad1dd59ed5c41faa96bf7cfc9e562
(cherry picked from commit fdb0f266f5)
Summary:
When default hooks are set, they are pushed onto a stack.
When nesting context-manager, only the inner-most hooks will
be applied.
There is special care needed to update the TLS code. See also https://github.com/pytorch/pytorch/issues/70940 (i.e. do we need to be storing the enabled flag as well?)
Fixes https://github.com/pytorch/pytorch/issues/70134
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70932
Reviewed By: mruberry
Differential Revision: D33530370
Pulled By: albanD
fbshipit-source-id: 3197d585d77563f36c175d3949115a0776b309f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68631
This PR:
- Adds the check that the storage numel of the base and tangent tensors are the same. This is to support the case when as_strided reveals elements that aren't indexable by the input tensor.
- Skips the check when batched tensors are involved, because using as_strided to reveal elements that not indexable by the input tensor is already not allowed vmap.
- Adds tests for the above two cases, as well as an edge case regarding conj bit (what about neg bit?)
For functorch:
- we need to copy the batching rule implemented here
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D32899678
Pulled By: soulitzer
fbshipit-source-id: 54db9550dd2c93bc66b8fb2d36ce40799ebba794
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69592
Currently, forward AD function for`copy_` (in `VariableTypeManual`) does not handle the broadcasting case. ~EDIT: but that is not a design decision, not a bug. In this PR, we make that clear as a comment.~
Note: `broadcast_to` does not have a batching rule in core, so the ops that rely on `copy_` to broadcast will still fail batched forward grad computation.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33020603
Pulled By: soulitzer
fbshipit-source-id: 09cb702bffc74061964a9c05cfef5121f8164814
Summary:
This fixes the case when `torch.inference_mode` is called with `mode=False` (disabled). When used as a decorator, it ignored the argument and enabled inference mode anyway.
`_DecoratorContextManager` is changed so that a new instance is a copy instead of a new instance with default parameters.
I also added more tests to cover this case.
Current behaviour:
```python
>>> import torch
>>> x = torch.ones(1, 2, 3, requires_grad=True)
>>> torch.inference_mode(mode=False)
... def func(x):
... return x * x
...
>>> out = func(x)
>>> out.requires_grad
False
```
New behaviour (fixed):
```python
>>> import torch
>>> x = torch.ones(1, 2, 3, requires_grad=True)
>>> torch.inference_mode(mode=False)
... def func(x):
... return x * x
...
>>> out = func(x)
>>> out.requires_grad
True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68617
Reviewed By: mrshenli
Differential Revision: D32958434
Pulled By: albanD
fbshipit-source-id: 133c69970ef8bffb9fc9ab5142dedcffc4c32945