Xiao Wang
21fd2bc32e
Allow setting TORCH_LINALG_PREFER_CUSOLVER=1 to prefer cusolver as linear algebra library globally ( #106226 )
...
setting TORCH_LINALG_PREFER_CUSOLVER=1
This will allow users to prefer cusolver as linear algebra backend in their container use case. The switch is not enabled by default so it won't change any existing default behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106226
Approved by: https://github.com/lezcano
2023-07-30 09:38:46 +00:00
Rohan Varma
c11412b4a8
[DDP] Support optim in backward after DDP init ( #105995 )
...
This allows in backward optimizers to be configured after DDP init, in
addition to before as was previously supported.
Differential Revision: [D47783347](https://our.internmc.facebook.com/intern/diff/D47783347/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105995
Approved by: https://github.com/fegin
2023-07-29 01:36:25 +00:00
Michael Lazos
bd669d52d2
Print env var name instead of flag name for commandline repros ( #106223 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106223
Approved by: https://github.com/seemethere , https://github.com/malfet
2023-07-28 23:22:27 +00:00
Richard Zou
dad65d09f2
Update custom op API ( #105947 )
...
As described in
https://docs.google.com/document/d/1aGWtgxV3HppuxQAdddyPrs74_aEntpkYt9MalnCKnhk/edit
This PR changes the CustomOp API to be private and adds new public
wrappers around it so that the user does not need to know about the
"CustomOp" object. We've effectively changed the "CustomOp" object to be
some metadata about the operator that the user does not directly
interact with.
The "updated custom op API" is in torch._custom_ops. Pending good customer
feedback, we will promote this module to torch.custom_ops.
NB: I cannot move around the older torch._custom_op APIs yet because
people are already using them.
Test Plan:
- I changed all of our tests to use the new `torch._custom_ops` module
instead of the old CustomOp API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105947
Approved by: https://github.com/soulitzer
2023-07-28 13:30:58 +00:00
Nikita Karetnikov
b812e35a75
[pt2] add meta for argsort.stable, use sort samples in OpInfo ( #106025 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106025
Approved by: https://github.com/ezyang , https://github.com/zou3519
2023-07-27 03:49:17 +00:00
Masaki Kozuki
e773f28ee3
Reland "Add forward mode AD to out-place foreach functions ( #102409 ) ( #106043 )
...
forward-mode AD of out-of-place foreach functions, finally.
rel:
- #102409
- #105504
- #58833
- #100695
---
# Generated Foreach
```c++
::std::vector<at::Tensor> _foreach_sinh(c10::DispatchKeySet ks, at::TensorList self) {
auto self_ = unpack(self, "self", 0);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );
std::vector<bool> _any_has_forward_grad_result(self.size());
for (const auto& i : c10::irange(self.size())) {
_any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
}
std::shared_ptr<ForeachSinhBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<ForeachSinhBackward0>(new ForeachSinhBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self ));
grad_fn->self_ = make_saved_variable_list(self);
grad_fn->self_size_ = self.size();
}
#ifndef NDEBUG
std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
for (const Tensor& tensor : self_)
self__storage_saved.push_back(
tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
for (size_t i=0; i<self_.size(); i++)
if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
#endif
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::redispatch::_foreach_sinh(ks & c10::after_autograd_keyset, self_);
})();
auto result = std::move(_tmp);
#ifndef NDEBUG
for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
}
for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
}
#endif
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
if (_any_has_forward_grad_result[i]) {
auto self_t_raw = toNonOptFwGrad(self[i]);
auto self_tensor = toNonOptTensor(self[i]);
auto self_t = (self_t_raw.defined() || !self_tensor.defined())
? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
auto self_p = toNonOptPrimal(self[i]);
result_new_fw_grad_opts[i] = (self_t.conj() * self_p.cosh().conj()).conj();
}
}
for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
// The hardcoded 0 here will need to be updated once we support multiple levels.
result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
}
}
return result;
}
::std::vector<at::Tensor> _foreach_norm_Scalar(c10::DispatchKeySet ks, at::TensorList self, const at::Scalar & ord) {
auto self_ = unpack(self, "self", 0);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );
std::vector<bool> _any_has_forward_grad_result(self.size());
for (const auto& i : c10::irange(self.size())) {
_any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
}
std::shared_ptr<ForeachNormBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<ForeachNormBackward0>(new ForeachNormBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self ));
grad_fn->ord = ord;
grad_fn->self_ = make_saved_variable_list(self);
grad_fn->self_size_ = self.size();
}
#ifndef NDEBUG
std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
for (const Tensor& tensor : self_)
self__storage_saved.push_back(
tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
for (size_t i=0; i<self_.size(); i++)
if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
#endif
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::redispatch::_foreach_norm(ks & c10::after_autograd_keyset, self_, ord);
})();
auto result = std::move(_tmp);
#ifndef NDEBUG
for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
}
for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
}
#endif
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
if (_any_has_forward_grad_result[i]) {
auto self_t_raw = toNonOptFwGrad(self[i]);
auto self_tensor = toNonOptTensor(self[i]);
auto self_t = (self_t_raw.defined() || !self_tensor.defined())
? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
auto self_p = toNonOptPrimal(self[i]);
result_new_fw_grad_opts[i] = norm_jvp(self_p, self_t, ord, result[i]);
}
}
for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
// The hardcoded 0 here will need to be updated once we support multiple levels.
result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
}
}
if (grad_fn) {
grad_fn->result = result;
}
return result;
}
```
# Reference
```c++
at::Tensor sinh(c10::DispatchKeySet ks, const at::Tensor & self) {
auto& self_ = unpack(self, "self", 0);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );
[[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
std::shared_ptr<SinhBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<SinhBackward0>(new SinhBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self ));
grad_fn->self_ = SavedVariable(self, false);
}
#ifndef NDEBUG
c10::optional<Storage> self__storage_saved =
self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
c10::intrusive_ptr<TensorImpl> self__impl_saved;
if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
#endif
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::redispatch::sinh(ks & c10::after_autograd_keyset, self_);
})();
auto result = std::move(_tmp);
#ifndef NDEBUG
if (self__storage_saved.has_value() &&
!at::impl::dispatch_mode_enabled() &&
!at::impl::tensor_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: sinh");
}
if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: sinh");
#endif
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
if (_any_has_forward_grad_result && (result.defined())) {
auto self_t_raw = toNonOptFwGrad(self);
auto self_tensor = toNonOptTensor(self);
auto self_t = (self_t_raw.defined() || !self_tensor.defined())
? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
auto self_p = toNonOptPrimal(self);
result_new_fw_grad_opt = (self_t.conj() * self_p.cosh().conj()).conj();
}
if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
// The hardcoded 0 here will need to be updated once we support multiple levels.
result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
}
return result;
}
at::Tensor norm_Scalar(c10::DispatchKeySet ks, const at::Tensor & self, const at::Scalar & p) {
auto& self_ = unpack(self, "self", 0);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );
[[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
std::shared_ptr<NormBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<NormBackward0>(new NormBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self ));
grad_fn->p = p;
grad_fn->self_ = SavedVariable(self, false);
}
#ifndef NDEBUG
c10::optional<Storage> self__storage_saved =
self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
c10::intrusive_ptr<TensorImpl> self__impl_saved;
if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
#endif
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::redispatch::norm(ks & c10::after_autograd_keyset, self_, p);
})();
auto result = std::move(_tmp);
#ifndef NDEBUG
if (self__storage_saved.has_value() &&
!at::impl::dispatch_mode_enabled() &&
!at::impl::tensor_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: norm_Scalar");
}
if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: norm_Scalar");
#endif
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
throw_error_for_complex_autograd(result, "norm");
c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
if (_any_has_forward_grad_result && (result.defined())) {
auto self_t_raw = toNonOptFwGrad(self);
auto self_tensor = toNonOptTensor(self);
auto self_t = (self_t_raw.defined() || !self_tensor.defined())
? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
auto self_p = toNonOptPrimal(self);
result_new_fw_grad_opt = norm_jvp(self_p, self_t, p, result);
}
if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
// The hardcoded 0 here will need to be updated once we support multiple levels.
result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
}
if (grad_fn) {
grad_fn->result_ = SavedVariable(result, true);
}
return result;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106043
Approved by: https://github.com/soulitzer
2023-07-27 03:13:24 +00:00
Fuzzkatt
b69e5302b5
add skip if sm < 80 check ( #105888 )
...
Fix issue where we were testing `test_schema_correctness_nn_functional_scaled_dot_product_attention_cuda_bfloat16` from `test_schema_check.py` on V100, but bfloat16 support on cuda doesn't exist for sm < 80. Added skip if sm < 80 to the failing test. cc @ptrblck @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105888
Approved by: https://github.com/kit1980
2023-07-26 21:25:24 +00:00
Rohan Varma
4137d6e499
[Composable FSDP] Enable HSDP ( #105206 )
...
Need to pass in strategy to _init_process_group_state to enable hsdp
for composable.
Differential Revision: [D47462394](https://our.internmc.facebook.com/intern/diff/D47462394/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105206
Approved by: https://github.com/awgu , https://github.com/fegin
2023-07-26 21:03:55 +00:00
drisspg
c4b7311fc2
Meff Attn Bias ( #104310 )
...
# Summary
### Review Points
- Automatically pad tensors to create aligned masks when seqlen_kv is not multiple of 16. This will cause memory spike ~ 2 * attn_mask size which could in theory be big. At appears though that doing this + mem_eff is faster than no_pad + math. SO seems to be worth it
- Using expand to view the attn_mask in 4d. This is a little different to how we enforce q,k,v to be viewed in 4d prior to calling. Also not supprint b*n_heads, seq_lenq, seq_lenkv case.
- Should enable, #96099
### Profiling
I ran a bunch of comparisons between sdpa.MATH and sdp.MemEffAttention. I added a attn_bias of shape (1, 1, seqlen_q, seqln_k). For these experiments seqlen_q == seqlen_k. These were all ran on an a100 80gb gpu.
Configs:
```
# Run a bunch of experiments
batch_sizes = [8, 16, 32]
num_heads = [16, 32]
max_seq_lens = [15, 64, 128, 512, 555, 1024]
embed_dims = [32, 64, 128]
dtypes = [torch.float16, torch.bfloat16, torch.float32]
pad_percentages = [None]
backends = [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]
run_backward = True
attn_mask = True
```
The function calls `sdpa(input**).sum().backward()`.
I calculated the geomean speedup of the efficient attention path of the math path for all these configs:
`Geomean Speedup: 1.977`
An example comparision with batchsize = 8, num_heads = 32, embed_dim = 64, and dtype = torch.float16:

This was done using the current state of the branch where we force alignment of mask when the last dim is not divisible by 16, which shows up in seq_len = 15 and 555 case.
The full data can be found here:
[attn_mask_sweep.csv](https://github.com/pytorch/pytorch/files/11962399/attn_mask_sweep.csv )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104310
Approved by: https://github.com/cpuhrsch
2023-07-26 15:51:59 +00:00
Masaki Kozuki
72f2c87a5a
[foreach] Set SavedVariable.is_output to true for grad_fn->result_ ( #105504 )
...
fixes #105502
The scope of this pull request is out-of-place foreach functions that depend on their output tensorlist for backward such as `_foreach_exp`. An example of the generated code with this update is as follows:
```c++
variable_list ForeachExpBackward0::apply(variable_list&& grads) {
std::lock_guard<std::mutex> lock(mutex_);
TORCH_CHECK(!result_released_, ERR_BACKWARD_TWICE);
IndexRangeGenerator gen;
auto self_ix = gen.range(self_size_);
variable_list grad_inputs(gen.size());
auto result = unpack_list(result_, shared_from_this());
if (task_should_compute_output({ self_ix })) {
std::vector<Tensor> grad_result;
grad_result.reserve(grads.size());
for (const auto & i : c10::irange(grads.size())) {
if (grads[i].defined()) {
grad_result.emplace_back(grads[i] * result[i].conj());
} else {
grad_result.emplace_back(Tensor());
}
}
copy_range(grad_inputs, self_ix, grad_result);
}
return grad_inputs;
}
::std::vector<at::Tensor> _foreach_exp(c10::DispatchKeySet ks, at::TensorList self) {
auto self_ = unpack(self, "self", 0);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );
std::shared_ptr<ForeachExpBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<ForeachExpBackward0>(new ForeachExpBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self ));
grad_fn->self_size_ = self.size();
}
#ifndef NDEBUG
std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
for (const Tensor& tensor : self_)
self__storage_saved.push_back(
tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
for (size_t i=0; i<self_.size(); i++)
if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
#endif
auto _tmp = ([&]() {
if ((isFwGradDefinedTensorList(self))) {
static c10::OperatorName full_name("aten::_foreach_exp", "");
static c10::optional<c10::OperatorHandle> opt_op = c10::Dispatcher::singleton().findSchema(full_name);
return impl::run_jit_decomposition_with_args_for_jvp<::std::vector<at::Tensor>>("_foreach_exp", *opt_op, ks, self);
} else {
at::AutoDispatchBelowADInplaceOrView guard;
return at::redispatch::_foreach_exp(ks & c10::after_autograd_keyset, self_);
}
})();
auto result = std::move(_tmp);
#ifndef NDEBUG
for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
}
for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
}
#endif
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
if (grad_fn) {
grad_fn->result_ = make_saved_variable_list(result, true);
}
return result;
}
```
A bit of context:
- https://github.com/pytorch/pytorch/pull/105368#issuecomment-1640912479
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105504
Approved by: https://github.com/soulitzer
2023-07-26 14:29:32 +00:00
Mikayla Gawarecki
e18d53e2df
Added ModuleInfo test for meta device ctx init ( #105871 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105871
Approved by: https://github.com/albanD
2023-07-26 01:57:54 +00:00
PyTorch MergeBot
340ec1f460
Revert "Meff Attn Bias ( #104310 )"
...
This reverts commit 5453508115 .
Reverted https://github.com/pytorch/pytorch/pull/104310 on behalf of https://github.com/DanilBaibak due to PR introduced cuda OOM issue ([comment](https://github.com/pytorch/pytorch/pull/104310#issuecomment-1650171538 ))
2023-07-25 16:37:32 +00:00
drisspg
5453508115
Meff Attn Bias ( #104310 )
...
# Summary
### Review Points
- Automatically pad tensors to create aligned masks when seqlen_kv is not multiple of 16. This will cause memory spike ~ 2 * attn_mask size which could in theory be big. At appears though that doing this + mem_eff is faster than no_pad + math. SO seems to be worth it
- Using expand to view the attn_mask in 4d. This is a little different to how we enforce q,k,v to be viewed in 4d prior to calling. Also not supprint b*n_heads, seq_lenq, seq_lenkv case.
- Should enable, #96099
### Profiling
I ran a bunch of comparisons between sdpa.MATH and sdp.MemEffAttention. I added a attn_bias of shape (1, 1, seqlen_q, seqln_k). For these experiments seqlen_q == seqlen_k. These were all ran on an a100 80gb gpu.
Configs:
```
# Run a bunch of experiments
batch_sizes = [8, 16, 32]
num_heads = [16, 32]
max_seq_lens = [15, 64, 128, 512, 555, 1024]
embed_dims = [32, 64, 128]
dtypes = [torch.float16, torch.bfloat16, torch.float32]
pad_percentages = [None]
backends = [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]
run_backward = True
attn_mask = True
```
The function calls `sdpa(input**).sum().backward()`.
I calculated the geomean speedup of the efficient attention path of the math path for all these configs:
`Geomean Speedup: 1.977`
An example comparision with batchsize = 8, num_heads = 32, embed_dim = 64, and dtype = torch.float16:

This was done using the current state of the branch where we force alignment of mask when the last dim is not divisible by 16, which shows up in seq_len = 15 and 555 case.
The full data can be found here:
[attn_mask_sweep.csv](https://github.com/pytorch/pytorch/files/11962399/attn_mask_sweep.csv )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104310
Approved by: https://github.com/cpuhrsch
2023-07-24 22:19:26 +00:00
yanbing-j
a54043516f
Add SparseCsrCPU and SparseCsrCUDA dispatch to sum.dim_IntList ( #99292 )
...
This PR is to add support of sum.dim_IntList for Sparse Tensor, which is exposed in https://github.com/pytorch/pytorch/issues/98796 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99292
Approved by: https://github.com/mingfeima , https://github.com/rusty1s , https://github.com/cpuhrsch
2023-07-24 17:30:58 +00:00
Nikita Karetnikov
944db0357d
Unify multilabel_margin_loss_shape_check on CPU and CUDA ( #105645 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105645
Approved by: https://github.com/ezyang
2023-07-23 02:16:29 +00:00
Nikita Karetnikov
eac9e1b35f
[OpInfo] add reference and error inputs for multilabel_margin_loss ( #105523 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105523
Approved by: https://github.com/ezyang
2023-07-23 02:16:29 +00:00
Aaron Gokaslan
6d43c89f37
[BE]: Update Ruff to 0.0.280 ( #105724 )
...
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang , https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
Yanbo Liang
0ad93a3d56
Fix aten.logspace decomposition ( #105201 )
...
Fixes #104118
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105201
Approved by: https://github.com/ezyang
2023-07-22 04:10:20 +00:00
Jane Xu
803d42e457
add lerp cpu support for half ( #105607 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105607
Approved by: https://github.com/albanD
2023-07-21 20:29:05 +00:00
Justin Chu
4cc1745b13
[BE] f-stringify torch/ and scripts ( #105538 )
...
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.
- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/
Command used:
```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```
and excluded `collect_env.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang , https://github.com/malfet
2023-07-21 19:35:24 +00:00
Yanbo Liang
4c73016ff2
[Dynamo] Enable torch._dynamo.config.suppress_errors by default ( #105307 )
...
Summary:
We are working toward full model compilation, where when compilation error happens, we just fall back to eager mode rather than error out.
But at the same time, we should fix these issues if they are bugs. We will:
* 1/ log warnings in OSS;
* 2/ log warnings and write them into Scuba in fbcode;
to prevent us from ignoring these issues.
Test Plan: Manual test
Differential Revision: D47506314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105307
Approved by: https://github.com/jansel
2023-07-21 19:17:46 +00:00
eqy
29f856e3e0
Kill process in wait_for_process if SIGINT fails to terminate it ( #105625 )
...
#98035 adds some additional logic `wait_for_process` that includes catching a timeout exception and sending `SIGINT` to the process before waiting on it again with a timeout. However, if the additional wait times out again, then the wait call in the `finally` block (which does not have a timeout) has the potential to hang indefinitely.
This PR kills the process if a second timeout exception occurs after the `SIGINT` signal is sent.
CC @clee2000 @ptrblck @xwang233 @kwen2501
Also hoping that this has the potential to reduce turnaround time for distributed timeouts like those seen in https://hud.pytorch.org/pr/pytorch/pytorch/105274#15148799113
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105625
Approved by: https://github.com/ezyang
2023-07-21 10:11:58 +00:00
Andrey Talman
c6653b65d8
Back out "Make adding buffers more like adding parameters ( #104069 )" ( #105581 )
...
Summary:
D47537831 is breaking pyper tests: https://fb.workplace.com/groups/802176577445480/posts/1018902842439518/
with `TypeError: register_buffer() takes 3 positional arguments but 4 were given`
Original commit changeset: d4b4069fbd38
Original Phabricator Diff: D47537831
Test Plan:
```
buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_inline_cvr_infer_pyper_pyper__canary_offline_training-launcher -- --run-harness-in-tupperware --build-fbpkg ads_dper3 --build-fbpkg training_platform
```
Reviewed By: atalman
Differential Revision: D47600140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105581
Approved by: https://github.com/mikaylagawarecki
2023-07-20 03:39:53 +00:00
Yukio Siraichi
0b6de0eb1c
Improve validator module behavior if Z3 is not installed. ( #105168 )
...
Fixes : #105143
In summary, the changes are:
- Check if Z3 is installed when the module is loaded
- Naming consistently as "translation validation" (not "validator")
- Skipping tests if Z3 is not installed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105168
Approved by: https://github.com/ezyang
2023-07-19 13:11:22 +00:00
Justin Chu
be03a56955
[BE] Enable ruff's UP rules and autoformat testing/ ( #105425 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105425
Approved by: https://github.com/malfet
2023-07-18 21:04:39 +00:00
mingfeima
5e942ac5ec
add bfloat16 support for reflection and replication padding ( #102949 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102949
Approved by: https://github.com/cpuhrsch
2023-07-18 13:01:09 +00:00
ekamiti
32d422f335
Make adding buffers more like adding parameters ( #104069 )
...
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.
Fixes #35735
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069
Approved by: https://github.com/mikaylagawarecki
2023-07-17 17:59:05 +00:00
Kurt Mohler
ffce2492af
Remove set_default_dtype calls from jit and ops tests ( #105072 )
...
Part of #68972
This only attempts to avoid setting the default dtype for `test_jit.py` and `test_ops.py`. There are other tests, like `test_nn.py`, which will be addressed in follow up PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105072
Approved by: https://github.com/ezyang
2023-07-15 03:18:33 +00:00
Nikita Karetnikov
0a6888243b
multi_margin_loss: check weight shape, make contiguous on CPU, add tests (#104852 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104852
Approved by: https://github.com/ezyang
2023-07-14 21:16:09 +00:00
Nikita Karetnikov
de67b52a88
Unify multi_margin_loss_shape_check on CPU and CUDA ( #104851 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104851
Approved by: https://github.com/ezyang
2023-07-14 21:16:09 +00:00
Nikita Karetnikov
0c89596e4f
[OpInfo] add reference and error inputs for multi_margin_loss ( #104850 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104850
Approved by: https://github.com/ezyang
2023-07-14 21:16:09 +00:00
mingfeima
a66f08d626
enable channels last for replication padding on CPU ( #102597 )
...
Enable channels last support for replication padding on CPU. This patch add channels last support for ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad3d_cpu_float32
```
The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.
### single core inference
```
(before)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.339 ms
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 82.935 ms
(after)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.324 ms
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 16.717 ms
```
### single socket inference
```
(before)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.135 ms
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 7.203 ms
(after)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.029 ms
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 3.174 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102597
Approved by: https://github.com/CaoE , https://github.com/cpuhrsch
2023-07-14 03:44:55 +00:00
mingfeima
f73757d551
enable channels last for reflection padding on CPU ( #102518 )
...
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32
```
The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.
### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 86.821 ms
(after)
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 16.806 ms
```
### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 7.367 ms
(after)
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102518
Approved by: https://github.com/CaoE , https://github.com/cpuhrsch
2023-07-13 16:22:31 +00:00
Richard Barnes
0faf8ed49f
Skip TS backend in FBCODE ( #104354 )
...
Summary:
Fixes:
```
Traceback (most recent call last):
File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/1a4194a16794cc72/caffe2/test/__torch__/torch#link-tree/torch/testing/_internal/common_device_type.py", line 543, in setUpClass
torch._lazy.ts_backend.init()
File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/1a4194a16794cc72/caffe2/test/__torch__/torch#link-tree/torch/_lazy/ts_backend.py", line 6, in init
torch._C._lazy_ts_backend._init()
RuntimeError: TorchScript backend not yet supported in FBCODE/OVRSOURCE builds
```
Test Plan: Sandcastle
Differential Revision: D47093028
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104354
Approved by: https://github.com/malfet
2023-07-13 02:46:58 +00:00
Rodrigo Kumpera
246dc0d9f2
[MTPG] Use TLS propagation to enable MTPG from bwd. ( #104735 )
...
We use PyTorch's built-in tls propagation in ThreadLocalState to forward the world object
from the fwd thread to the bwd thread.
This further closes the gap on enabling FSDP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104735
Approved by: https://github.com/rohan-varma
2023-07-12 18:47:02 +00:00
albanD
08cbfb2a58
Avoid tensor creation and use scalar overload ( #104264 )
...
I would expect this preserves the behavior but there might be weird edge cases?
@mruberry might know?
The aim is to fix https://github.com/pytorch/pytorch/pull/104254 (and make `1 ** t` capturable via cudagraph)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104264
Approved by: https://github.com/zou3519
2023-07-12 18:11:27 +00:00
Joel Schlosser
ece19bf018
Update run_test.py to use TEST_WITH_SLOW_GRADCHECK flag ( #104819 )
...
Finishes the job from #104537 . See https://github.com/pytorch/pytorch/pull/104537#pullrequestreview-1520065008
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104819
Approved by: https://github.com/huydhn
2023-07-11 21:58:46 +00:00
Joel Schlosser
c2e286daf9
Testing: Print test reproduction command on failure ( #104537 )
...
MS2 of the Reproducible Testing BE initiative. For context, this is the ask:
```
Another thing that would be really great as we start to have more dependent
systems or types of tests (functorch, dynamo, crossref) would be to have a
minimally reproducible version of the test (something at the end of the HUD
comment like: "Run python test/test_file.py -k test_name" but also if you need
flags, like crossref it would be like "Run <flag to run crossref> python test/..." ). I'll
often go through the test infra to find the flags that I need to pass when
something only breaks crossref/dynamo tests.
```
Implementation details:
* Adds a new flag `PRINT_REPRO_ON_FAILURE` that is settable through the environment variable `PYTORCH_PRINT_REPRO_ON_FAILURE=1`
* **Default is ON but I can be persuaded otherwise**
* When the flag is enabled, our base `TestCase` will wrap the test method in a context manager that catches any non-skip exceptions and appends a repro string to the exception message. The repro includes setting of necessary test flags through env vars. Example:
```
To execute this test, run the following from the base repo dir:
PYTORCH_TEST_WITH_CROSSREF=1 python test/test_ops.py -k test_foo_add_cuda_float32
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```
* To keep track of flag settings, this PR introduces a new `TestEnvironment` class that defines global flags by querying related environment variables. Flag and env var names are purposefully kept searchable via full names. Example usages:
```python
TestEnvironment.def_flag("TEST_WITH_TORCHINDUCTOR", env_var="PYTORCH_TEST_WITH_INDUCTOR")
# can track implication relationships to avoid adding unnecessary flags to the repro
TestEnvironment.def_flag(
"TEST_WITH_TORCHDYNAMO",
env_var="PYTORCH_TEST_WITH_DYNAMO",
implied_by_fn=lambda: TEST_WITH_TORCHINDUCTOR or TEST_WITH_AOT_EAGER)
# can use include_in_repro=False to keep the flag from appearing in the repro command
TestEnvironment.def_flag(
"DISABLE_RUNNING_SCRIPT_CHK", env_var="PYTORCH_DISABLE_RUNNING_SCRIPT_CHK", include_in_repro=False)
# the default default value is False, but this can be changed
TestEnvironment.def_flag(
"PRINT_REPRO_ON_FAILURE", env_var="PYTORCH_PRINT_REPRO_ON_FAILURE", default=(not IS_FBCODE), include_in_repro=False)
```
* AFAICT it is only feasible to achieve this from within the test framework rather than at the CI level. This is because CI / `run_test.py` are unaware of individual test cases. Implementing it in our base `TestCase` class has the broadest area of effect, as it's not isolated to e.g. OpInfo tests.
* I couldn't find an easy way to test the logic via `test_testing.py`, as the logic for extracting the test filename doesn't work for generated test classes. I'm open to ideas on testing this, however.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104537
Approved by: https://github.com/ezyang , https://github.com/janeyx99 , https://github.com/huydhn
2023-07-10 21:24:02 +00:00
Jerry Zhang
1a661639f7
[quant] Support integer implementations for adaptive_avg_pool2d ( #104226 )
...
Summary:
This is needed for representing quantized model in pt2 export quantization flow
Test Plan:
tested by opinfo, python test/test_ops.py
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104226
Approved by: https://github.com/jgong5 , https://github.com/andrewor14
2023-07-07 19:36:31 +00:00
Rohan Varma
0bf39d5663
[FSDP] Option for eval in fp32/bf16 ( #104682 )
...
In https://github.com/pytorch/pytorch/pull/97645 and some follow up diffs, we made FSDP run in full precision in eval mode, even if mixed precision was specified.
However, this is probably not the best idea and we should provide a flag for users to have control over this a bit more. Adding an env var FSDP_FULL_PREC_IN_EVAL and defaulting it to off, users who want to run eval in fp32 can toggle this before wrapping model in FSDP:
os.environ["FSDP_FULL_PREC_IN_EVAL"] = "1"
Verified that unittests, APS workflow, TNT workloads can run eval appropriately with this change.
Differential Revision: [D47246556](https://our.internmc.facebook.com/intern/diff/D47246556/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104682
Approved by: https://github.com/awgu
2023-07-07 08:14:23 +00:00
Rodrigo Kumpera
17ab4f85e9
[c10d] Adopt allgather_into_tensor_coalesced for NCCL. ( #103086 )
...
This is done by adding c10d::_allgather_into_tensor_coalesced wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103086
Approved by: https://github.com/rohan-varma
2023-07-06 15:05:55 +00:00
Jerry Zhang
611febf6cf
[quant] Support integer implementations for max_pool2d ( #104225 )
...
Summary:
This is needed for representing quantized model in pt2 export quantization flow
Test Plan:
tested by opinfo, python test/test_ops.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104225
Approved by: https://github.com/kimishpatel
2023-07-05 23:54:07 +00:00
Yukio Siraichi
40b8d10d5e
Re-land: Turn translation validation on for tests and accuracy runs by default. ( #104467 )
...
Re-landing: #103611
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104467
Approved by: https://github.com/malfet
2023-07-05 19:01:50 +00:00
PyTorch MergeBot
8958f041be
Revert "Add forward mode AD to out-place foreach functions ( #102409 )"
...
This reverts commit e2ec0ba404 .
Reverted https://github.com/pytorch/pytorch/pull/102409 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it is failing some tests in trunk e799f565eb ([comment](https://github.com/pytorch/pytorch/pull/102409#issuecomment-1615254393 ))
2023-06-30 22:46:57 +00:00
PyTorch MergeBot
a2a8b4d415
Revert "Turn translation validation on for tests and accuracy runs by default. ( #103611 )"
...
This reverts commit e311bed2a8 .
Reverted https://github.com/pytorch/pytorch/pull/103611 on behalf of https://github.com/malfet due to Broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/103611#issuecomment-1614850276 ))
2023-06-30 15:54:18 +00:00
Xilun Wu
e799f565eb
[DTensor][TP][Random] Introduce TensorParallelRNGTracker to integrate parallel RNG state with Tensor Parallel ( #103910 )
...
This PR enables the automatic use of `TensorParallelRNGTracker` in Tensor Parallel api. Some unit tests are going to be added to cover.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103910
Approved by: https://github.com/wanchaol , https://github.com/fduwjj
2023-06-30 08:06:41 +00:00
Wanchao Liang
958bd3a549
[fake_pg] remove init barrier env var ( #104428 )
...
We can now remove the env var as we by default disable the init barrier
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104428
Approved by: https://github.com/wconstab , https://github.com/fduwjj
2023-06-30 05:04:26 +00:00
Masaki Kozuki
e2ec0ba404
Add forward mode AD to out-place foreach functions ( #102409 )
...
The major difference from in-place support is that some out-place functions have their derivatives spelled out in derivatives.yaml, which requires some changes in `load_derivatives.py` and some handlings in various places due to the others whose derivatives are generated by `torchgen`.
rel:
- #58833
- #100695
---
# Generated Foreach
```c++
::std::vector<at::Tensor> _foreach_sinh(c10::DispatchKeySet ks, at::TensorList self) {
auto self_ = unpack(self, "self", 0);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );
std::vector<bool> _any_has_forward_grad_result(self.size());
for (const auto& i : c10::irange(self.size())) {
_any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
}
std::shared_ptr<ForeachSinhBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<ForeachSinhBackward0>(new ForeachSinhBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self ));
grad_fn->self_ = make_saved_variable_list(self);
grad_fn->self_size_ = self.size();
}
#ifndef NDEBUG
std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
for (const Tensor& tensor : self_)
self__storage_saved.push_back(
tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
for (size_t i=0; i<self_.size(); i++)
if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
#endif
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::redispatch::_foreach_sinh(ks & c10::after_autograd_keyset, self_);
})();
auto result = std::move(_tmp);
#ifndef NDEBUG
for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
}
for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
}
#endif
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
if (_any_has_forward_grad_result[i]) {
auto self_t_raw = toNonOptFwGrad(self[i]);
auto self_tensor = toNonOptTensor(self[i]);
auto self_t = (self_t_raw.defined() || !self_tensor.defined())
? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
auto self_p = toNonOptPrimal(self[i]);
result_new_fw_grad_opts[i] = (self_t.conj() * self_p.cosh().conj()).conj();
}
}
for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
// The hardcoded 0 here will need to be updated once we support multiple levels.
result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
}
}
return result;
}
::std::vector<at::Tensor> _foreach_norm_Scalar(c10::DispatchKeySet ks, at::TensorList self, const at::Scalar & ord) {
auto self_ = unpack(self, "self", 0);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );
std::vector<bool> _any_has_forward_grad_result(self.size());
for (const auto& i : c10::irange(self.size())) {
_any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
}
std::shared_ptr<ForeachNormBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<ForeachNormBackward0>(new ForeachNormBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self ));
grad_fn->ord = ord;
grad_fn->self_ = make_saved_variable_list(self);
grad_fn->self_size_ = self.size();
}
#ifndef NDEBUG
std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
for (const Tensor& tensor : self_)
self__storage_saved.push_back(
tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
for (size_t i=0; i<self_.size(); i++)
if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
#endif
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::redispatch::_foreach_norm(ks & c10::after_autograd_keyset, self_, ord);
})();
auto result = std::move(_tmp);
#ifndef NDEBUG
for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
}
for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
}
#endif
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
if (_any_has_forward_grad_result[i]) {
auto self_t_raw = toNonOptFwGrad(self[i]);
auto self_tensor = toNonOptTensor(self[i]);
auto self_t = (self_t_raw.defined() || !self_tensor.defined())
? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
auto self_p = toNonOptPrimal(self[i]);
result_new_fw_grad_opts[i] = norm_jvp(self_p, self_t, ord, result[i]);
}
}
for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
// The hardcoded 0 here will need to be updated once we support multiple levels.
result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
}
}
if (grad_fn) {
grad_fn->result = result;
}
return result;
}
```
# Reference
```c++
at::Tensor sinh(c10::DispatchKeySet ks, const at::Tensor & self) {
auto& self_ = unpack(self, "self", 0);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );
[[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
std::shared_ptr<SinhBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<SinhBackward0>(new SinhBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self ));
grad_fn->self_ = SavedVariable(self, false);
}
#ifndef NDEBUG
c10::optional<Storage> self__storage_saved =
self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
c10::intrusive_ptr<TensorImpl> self__impl_saved;
if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
#endif
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::redispatch::sinh(ks & c10::after_autograd_keyset, self_);
})();
auto result = std::move(_tmp);
#ifndef NDEBUG
if (self__storage_saved.has_value() &&
!at::impl::dispatch_mode_enabled() &&
!at::impl::tensor_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: sinh");
}
if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: sinh");
#endif
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
if (_any_has_forward_grad_result && (result.defined())) {
auto self_t_raw = toNonOptFwGrad(self);
auto self_tensor = toNonOptTensor(self);
auto self_t = (self_t_raw.defined() || !self_tensor.defined())
? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
auto self_p = toNonOptPrimal(self);
result_new_fw_grad_opt = (self_t.conj() * self_p.cosh().conj()).conj();
}
if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
// The hardcoded 0 here will need to be updated once we support multiple levels.
result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
}
return result;
}
at::Tensor norm_Scalar(c10::DispatchKeySet ks, const at::Tensor & self, const at::Scalar & p) {
auto& self_ = unpack(self, "self", 0);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );
[[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
std::shared_ptr<NormBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<NormBackward0>(new NormBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self ));
grad_fn->p = p;
grad_fn->self_ = SavedVariable(self, false);
}
#ifndef NDEBUG
c10::optional<Storage> self__storage_saved =
self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
c10::intrusive_ptr<TensorImpl> self__impl_saved;
if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
#endif
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::redispatch::norm(ks & c10::after_autograd_keyset, self_, p);
})();
auto result = std::move(_tmp);
#ifndef NDEBUG
if (self__storage_saved.has_value() &&
!at::impl::dispatch_mode_enabled() &&
!at::impl::tensor_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: norm_Scalar");
}
if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: norm_Scalar");
#endif
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
throw_error_for_complex_autograd(result, "norm");
c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
if (_any_has_forward_grad_result && (result.defined())) {
auto self_t_raw = toNonOptFwGrad(self);
auto self_tensor = toNonOptTensor(self);
auto self_t = (self_t_raw.defined() || !self_tensor.defined())
? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
auto self_p = toNonOptPrimal(self);
result_new_fw_grad_opt = norm_jvp(self_p, self_t, p, result);
}
if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
// The hardcoded 0 here will need to be updated once we support multiple levels.
result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
}
if (grad_fn) {
grad_fn->result_ = SavedVariable(result, true);
}
return result;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102409
Approved by: https://github.com/soulitzer
2023-06-30 04:51:43 +00:00
Wanchao Liang
8457703e8d
lazy init device mesh in fsdp ( #104447 )
...
since fsdp state is lazy init, we also need to lazy init device mesh
otherwise devicemesh allgather check would trigger some mismatch in
allgather counts in fsdp tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104447
Approved by: https://github.com/wconstab
2023-06-30 04:40:16 +00:00
Yukio Siraichi
e311bed2a8
Turn translation validation on for tests and accuracy runs by default. ( #103611 )
...
This PR turns translation validation on by default for tests and accuracy benchmark
runs. It also installs Z3 on CI.
The main changes are:
- Add `--no-translation-validation` as an option in _test/run_tests.py_
- Set `PYTORCH_TEST_WITH_TV` environment variable
- Add `TEST_WITH_TV` variable in _torch/testing/_internal/common_utils.py_
- Turn translation validation on for accuracy benchmarks in _benchmarks/dynamo/common.py_
- Add Z3 installation on CI scripts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103611
Approved by: https://github.com/ezyang
2023-06-30 01:32:21 +00:00