Commit Graph

59519 Commits

Author SHA1 Message Date
zhouzaida
b51f92ebda [Docs] Fix docstring format (#99396)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99396
Approved by: https://github.com/awgu
2023-04-28 01:10:07 +00:00
zhi.cai
64efd88845 Add directly referenced header files for "ceil_div.h" (#99607)
std::enable_if_t is defined in <type_traits>. Directly referencing header files is good programming style

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99607
Approved by: https://github.com/albanD, https://github.com/kit1980
2023-04-28 01:05:05 +00:00
Bin Bao
3e87fc521b [CI] Start to collect inference perf with cpp_wrapper ON (#100187)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100187
Approved by: https://github.com/huydhn
2023-04-28 01:03:09 +00:00
Masaki Kozuki
674018903d per-Tensor grad_fn for in-place foreach functions (#96405)
Generate a `grad_fn` for each (tuple of) `Tensor`(s) of the same index for `_foreach_foo_` and each `grad_fn` is `FooBackward`.

The current status of foreach functions' backward support for the record:
- out-place: Implemented, but no optimized implementations like their forward path
- in-place: not implemented. I think this check 7eaaefafb3/torchgen/api/autograd.py (L309-L311) is partly responsible but the difference of signature between out-place and in-place (see https://github.com/pytorch/pytorch/pull/96405#discussion_r1154690940) would prevent in-place from using out-place versions (the logic is around 7eaaefafb3/torchgen/api/autograd.py (L495-L500))

```c++
void _foreach_abs_(c10::DispatchKeySet ks, at::TensorList self) {
  auto self_ = unpack(self, "self", 0);
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  {
    at::AutoDispatchBelowAutograd guard;
    at::redispatch::_foreach_abs_(ks & c10::after_autograd_keyset, self_);
  }
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      AT_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      AT_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
}
```

Related:
- #95431
- #95765 for multiple `grad_fn`s logic

---

Examples: outputs of `_foreach_add_.List`, `_foreach_addcmul_.ScalarList`, and `_foreach_exp`

```c++
void _foreach_addcmul__ScalarList(c10::DispatchKeySet ks, at::TensorList self, at::TensorList tensor1, at::TensorList tensor2, at::ArrayRef<at::Scalar> scalars) {
  auto self_ = unpack(self, "self", 0);
  auto tensor1_ = unpack(tensor1, "tensor1", 1);
  auto tensor2_ = unpack(tensor2, "tensor2", 2);
  auto _any_requires_grad = compute_requires_grad( self, tensor1, tensor2 );

  (void)_any_requires_grad;
  std::vector<c10::optional<at::Tensor>> original_selfs(self.size());
  std::vector<std::shared_ptr<AddcmulBackward0>> grad_fns;
  if (_any_requires_grad) {
    for (const auto& i : c10::irange( self.size() )) {
      const auto ith_requires_grad = compute_requires_grad(self[i], tensor1[i], tensor2[i]);
      check_inplace(self[i], ith_requires_grad);
      grad_fns.push_back([&]() -> std::shared_ptr<AddcmulBackward0> {
          if (!ith_requires_grad) {
              return nullptr;
          } else {
              auto grad_fn = std::shared_ptr<AddcmulBackward0>(new AddcmulBackward0(), deleteNode);
              grad_fn->set_next_edges(collect_next_edges( self[i], tensor1[i], tensor2[i] ));
              return grad_fn;
          }
      }());
    }
    if (!grad_fns.empty()) {

        for (const auto& i : c10::irange(grad_fns.size())) {
            auto grad_fn = grad_fns[i];
            if (grad_fn != nullptr) {
                grad_fn->self_scalar_type = self[i].scalar_type();
                grad_fn->tensor1_scalar_type = tensor1[i].scalar_type();
                if (grad_fn->should_compute_output(1)) {
                  grad_fn->tensor2_ = SavedVariable(tensor2[i], false);
                }
                grad_fn->value = scalars[i];
                if (grad_fn->should_compute_output(2)) {
                  grad_fn->tensor1_ = SavedVariable(tensor1[i], false);
                }
                grad_fn->tensor2_scalar_type = tensor2[i].scalar_type();
            }
        }
    }
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  std::vector<c10::optional<Storage>> tensor1__storage_saved(tensor1_.size());
  for (const Tensor& tensor : tensor1_)
    tensor1__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> tensor1__impl_saved(tensor1_.size());
  for (size_t i=0; i<tensor1_.size(); i++)
    if (tensor1_[i].defined()) tensor1__impl_saved[i] = tensor1_[i].getIntrusivePtr();
  std::vector<c10::optional<Storage>> tensor2__storage_saved(tensor2_.size());
  for (const Tensor& tensor : tensor2_)
    tensor2__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> tensor2__impl_saved(tensor2_.size());
  for (size_t i=0; i<tensor2_.size(); i++)
    if (tensor2_[i].defined()) tensor2__impl_saved[i] = tensor2_[i].getIntrusivePtr();
  #endif
  {
    at::AutoDispatchBelowAutograd guard;
    at::redispatch::_foreach_addcmul_(ks & c10::after_autograd_keyset, self_, tensor1_, tensor2_, scalars);
  }
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor1__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor1_))
      TORCH_INTERNAL_ASSERT(tensor1__storage_saved[i].value().is_alias_of(tensor1_[i].storage()));
  }
  for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor1__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor1_))
      TORCH_INTERNAL_ASSERT(tensor1__impl_saved[i] == tensor1_[i].getIntrusivePtr());
  }
  for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor2__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor2_))
      TORCH_INTERNAL_ASSERT(tensor2__storage_saved[i].value().is_alias_of(tensor2_[i].storage()));
  }
  for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor2__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor2_))
      TORCH_INTERNAL_ASSERT(tensor2__impl_saved[i] == tensor2_[i].getIntrusivePtr());
  }
  #endif
  if (!grad_fns.empty()) {
      auto differentiable_outputs = flatten_tensor_args( self );
      TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size());
      for (const auto& i : c10::irange(grad_fns.size())) {
          auto grad_fn = grad_fns[i];
          if (grad_fn != nullptr) {
              rebase_history(differentiable_outputs[i], grad_fns[i]);
          }
      }
  }
}

```

```c++
void _foreach_add__List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) {
  auto self_ = unpack(self, "self", 0);
  auto other_ = unpack(other, "other", 1);
  auto _any_requires_grad = compute_requires_grad( self, other );

  (void)_any_requires_grad;
  std::vector<c10::optional<at::Tensor>> original_selfs(self.size());
  std::vector<std::shared_ptr<AddBackward0>> grad_fns;
  if (_any_requires_grad) {
    for (const auto& i : c10::irange( self.size() )) {
      const auto ith_requires_grad = compute_requires_grad(self[i], other[i]);
      check_inplace(self[i], ith_requires_grad);
      grad_fns.push_back([&]() -> std::shared_ptr<AddBackward0> {
          if (!ith_requires_grad) {
              return nullptr;
          } else {
              auto grad_fn = std::shared_ptr<AddBackward0>(new AddBackward0(), deleteNode);
              grad_fn->set_next_edges(collect_next_edges( self[i], other[i] ));
              return grad_fn;
          }
      }());
    }
    if (!grad_fns.empty()) {

        for (const auto& i : c10::irange(grad_fns.size())) {
            auto grad_fn = grad_fns[i];
            if (grad_fn != nullptr) {
                grad_fn->other_scalar_type = other[i].scalar_type();
                grad_fn->alpha = alpha;
                grad_fn->self_scalar_type = self[i].scalar_type();
            }
        }
    }
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  std::vector<c10::optional<Storage>> other__storage_saved(other_.size());
  for (const Tensor& tensor : other_)
    other__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> other__impl_saved(other_.size());
  for (size_t i=0; i<other_.size(); i++)
    if (other_[i].defined()) other__impl_saved[i] = other_[i].getIntrusivePtr();
  #endif
  {
    at::AutoDispatchBelowAutograd guard;
    at::redispatch::_foreach_add_(ks & c10::after_autograd_keyset, self_, other_, alpha);
  }
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  for (size_t i=0; i<other_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (other__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(other_))
      TORCH_INTERNAL_ASSERT(other__storage_saved[i].value().is_alias_of(other_[i].storage()));
  }
  for (size_t i=0; i<other_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (other__impl_saved[i] && !at::impl::tensorlist_has_dispatch(other_))
      TORCH_INTERNAL_ASSERT(other__impl_saved[i] == other_[i].getIntrusivePtr());
  }
  #endif
  if (!grad_fns.empty()) {
      auto differentiable_outputs = flatten_tensor_args( self );
      TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size());
      for (const auto& i : c10::irange(grad_fns.size())) {
          auto grad_fn = grad_fns[i];
          if (grad_fn != nullptr) {
              rebase_history(differentiable_outputs[i], grad_fns[i]);
          }
      }
  }
}

...

void _foreach_exp_(c10::DispatchKeySet ks, at::TensorList self) {
  auto self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );

  (void)_any_requires_grad;
  std::vector<c10::optional<at::Tensor>> original_selfs(self.size());
  std::vector<std::shared_ptr<ExpBackward0>> grad_fns;
  if (_any_requires_grad) {
    for (const auto& i : c10::irange( self.size() )) {
      const auto ith_requires_grad = compute_requires_grad(self[i]);
      check_inplace(self[i], ith_requires_grad);
      grad_fns.push_back([&]() -> std::shared_ptr<ExpBackward0> {
          if (!ith_requires_grad) {
              return nullptr;
          } else {
              auto grad_fn = std::shared_ptr<ExpBackward0>(new ExpBackward0(), deleteNode);
              grad_fn->set_next_edges(collect_next_edges( self[i] ));
              return grad_fn;
          }
      }());
    }
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  {
    at::AutoDispatchBelowAutograd guard;
    at::redispatch::_foreach_exp_(ks & c10::after_autograd_keyset, self_);
  }
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (!grad_fns.empty()) {
      auto differentiable_outputs = flatten_tensor_args( self );
      TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size());
      for (const auto& i : c10::irange(grad_fns.size())) {
          auto grad_fn = grad_fns[i];
          if (grad_fn != nullptr) {
              rebase_history(differentiable_outputs[i], grad_fns[i]);
          }
      }
  }
  if (!grad_fns.empty()) {

      for (const auto& i : c10::irange(grad_fns.size())) {
          auto grad_fn = grad_fns[i];
          if (grad_fn != nullptr) {
              grad_fn->result_ = SavedVariable(self[i], true, self[i].is_view());
          }
      }
  }
}

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96405
Approved by: https://github.com/soulitzer
2023-04-28 00:55:04 +00:00
Edward Z. Yang
9a3e411a41 More rigorous mixed overloads on SymInt (#100008)
Previously the change to aten/src/ATen/native/LossNLL.cpp eventually resulted in a double / SymInt division, which ended up calling the int64_t / SymInt overload, truncating the double (bad!) By adding overloads for all the int/float types, we avoid this situation from happening in the future.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100008
Approved by: https://github.com/albanD
2023-04-28 00:54:44 +00:00
Sergii Dymchenko
9dcabe293a Delete pytorch/caffe2/contrib/docker-ubuntu-14.04 (#100155)
It's not used anywhere AFAIK and only triggers security issues scanners.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100155
Approved by: https://github.com/huydhn
2023-04-28 00:41:37 +00:00
Andrew Gu
d1fbd33c70 [FSDP] Remove unneeded disable of tf32 (#100179)
I recall needing to disable tf32, but I cannot repro the issue anymore. Nowhere else in our unit tests do we disable tf32, so we can try to get rid of this disabling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100179
Approved by: https://github.com/rohan-varma
2023-04-28 00:14:40 +00:00
Andrew Gu
1f4183e275 [FSDP] Subtest sharding strategy in test_fsdp_grad_acc.py (#100178)
Let us make the unit test faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100178
Approved by: https://github.com/rohan-varma
2023-04-28 00:14:40 +00:00
Ke Wen
ae40a6c735 [c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215)
This is a mirror PR of D45339293

Summary:
These tests cause the following errors internally with unknown reason:
```
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adam'
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adamw'
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_sgd'
```
Commenting these tests out to unblock other PRs.

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100215
Approved by: https://github.com/wz337, https://github.com/fduwjj
2023-04-28 00:05:46 +00:00
Michael Voznesensky
a145a3332c Add tensor to fake clone snapshot for immutable source of truth (#100128)
There's a longstanding, well known mutability bug in dynamo, https://github.com/pytorch/pytorch/issues/93610 (and more issues, but this is the one I had at hand).

Ops that do in place mutation of tensors will mutate their corresponding FakeTensors.

So, for example, if you do `t_` on a tensor, you will reverse its strides. This, in turn, means that the FakeTensors strides are now also reversed, say, if you are trying to torch.compile:

```
class F(torch.nn.Module):
            def forward(self, x, y):
                x = x.t_()
                y = y.t_()
                return (x + y,)
```

However, we recently introduced accessing the fake_tensor memo/cache to get the symbolic shape values for sizes and strides during guard installation time.

This means that tensors captured with a given size and stride, say, for x above, size:(3,3) stride:(3, 1), will get their memo updates to size(3, 3), stride(1, 3).  Now, whenever you access this value for anything, it reflects it's current state in the tracing, as opposed to the state at which we initially started tracing on.

This causes us to produce guards that are never valid, for the example above, that `x.stride()[0] == 3`.

The solution is to not allow mutation to affect the fake tensors we use as source of truth here. We can do this by forcing a clone of the fake tensor at builder time, and storing that as the source of truth for our dynamic sizes and strides during guard installation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100128
Approved by: https://github.com/ezyang
2023-04-27 23:58:15 +00:00
Yanli Zhao
ca1cf434e7 Not flatten states when use_orig_param is True and sharding is NO_SHARD (#100189)
When use_orig_param is True and sharding is NO_SHARD, parameters and states are not flattened, so optimizer states should not be flattened as well. The unit test will fail without the fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100189
Approved by: https://github.com/awgu
2023-04-27 23:47:01 +00:00
chunyuan
3241fbd627 Inductor cpp wrapper: support LinearBinary (#99957)
Support `mkldnn::_linear_pointwise.binary` in cpp wrapper for GPT-J.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99957
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2023-04-27 23:46:42 +00:00
TamirCohen
0221198790 Added Typechecking to input tensor in RNN (#100100)
The input tensor of the RNN forward must be the same type as the weights.
While passing tensor of type long the error is:
    `RuntimeError: expected scalar type Long but found Float`
Which is misleading because it said to convert Something to Long, but the correct solution is to convert the input to Float (Which is the type of the weights).

The new error:
    `RuntimeError: input must have the type torch.float32, got type torch.int64`

Is correct and more verbose

Fixes #99998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100100
Approved by: https://github.com/drisspg
2023-04-27 23:36:57 +00:00
Driss Guessous
b8d7a28e1a refactor test_sdpa into two test classes to account for failure modes (#100121)
### Summary
This PR creates a new TestSDPAFailureModes test class in order to better seperate what each test is trying to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100121
Approved by: https://github.com/malfet, https://github.com/ngimel
2023-04-27 21:42:40 +00:00
Daniel Dale
477ca1789c Avoid elementwise dispatch of gradient unscaling/validation ops in _foreach_non_finite_check_and_unscale_cpu_ (#100108)
Fixes [#82206](https://github.com/pytorch/pytorch/issues/82206)

When executing a `ShardedGradScaler` step in the context of `cpu_offload`, [the function](ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L151-L152)) `_foreach_non_finite_check_and_unscale_cpu_` is grindingly slow. This issue is due to the elementwise op dispatching/redispatching/execution that is engendered by the current approach to gradient tensor validation:
ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L159-L163)

The subsequent `isinf` and `isnan` checks with associated `any` checks result in unscalable elementwise op dispatches:
ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L173-L181)

This inefficency is of course hidden in the current FSDP tests given their (appropriately) trivial parameter dimensionality. In the perf analysis below, the example test configures only the final `Linear(4, 8)` module parameters to require grad, so there are 40 elements to iterate through. However, if one increases the dimensionality to a still-modest 320008 elements (changing the final module to `Linear(40000,8)`), the execution time/cpu cost of the test is dominated by the elementwise op dispatching/redispatching/execution of the `any` validation ops in this function.

To characterize the current behavior, I use a slightly modified version of an existing `ShardedGradScaler` test [^1]. The following modifications to the test are made to allow the analysis:

1. Run just `CUDAInitMode.CUDA_BEFORE` for clarity instead of additional scenarios
2. Increase the final module to `Linear(40000, 8)` (along with modifying the preceding module to make the dimensions work) ,
3. For the cProfile run (but not valgrind or perf) the test runs just a single [`_train_for_several_steps`](ecd2c71871/torch/testing/_internal/common_fsdp.py (L926-L934)) step per rank (instead of 2 steps)
4. I temporarily reduce `init_scale` further to ensure we don't hit any `infs`, short-circuiting our analysis

### Current behavior

The most relevant call subgraph:
![callgrind_subgraph_elementwise_dispatch](https://user-images.githubusercontent.com/7462936/234656744-b7ca81b2-ce5b-4035-9918-0ad57d3689d3.png)

Note that:

1. Instead of dispatching to the relevant autograd op and then redispatching to the relevant CPU op implementation 8 times per test, (2 train steps x 2 any calls per parameter per step x 2 orig parameters) we (I believe unnecessarily) call the relevant dispatch flow elementwise, so 640016 times! (only 1 node in this trace so 320008 elements/2 X 2 train steps x 2 calls per element per step).
2. Nearly 50% of the relative (inclusive) instruction reads for the entire test in `callgrind` are executed by the `isnan` (320008 execs), `isinf` (320008 execs) and `any` (640016 execs) calls.
3. The `any` pre-dispatch entry point IRs (`torch::autograd::THPVariable_any`) vs actual op implementation IRs (`at::native::structured_any_all_out::impl`) are below to give one a sense of the relative dispatch and op execution cost in an elementwise context[^3].
![THPVariable_any_op_elementwise_dispatch_absolute_IR](https://user-images.githubusercontent.com/7462936/234656886-3c017ee3-8a04-4a7d-bdf8-6c690de42c92.png)
![structured_any_all_out_impl_absolute_IR](https://user-images.githubusercontent.com/7462936/234656915-0b203bb7-bd05-4ceb-a38b-67b0d4862aa7.png)

Using cprofile stats:

```bash
python -c "import pstats; stats=pstats.Stats('/tmp/fsdp_cprofile_8wa9uw39.stats'); stats.print_stats()"
...
ncalls  tottime  	percall  	cumtime  	percall  filename:lineno(function)
1   	20.159   	20.159   	66.805   	66.805 	 torch/distributed/fsdp/sharded_grad_scaler.py:151(_foreach_non_finite_check_and_unscale_cpu_)
160004  18.427    	0.000   	18.427    	0.000 	 {built-in method torch.isinf}
160004  6.026    	0.000    	6.026    	0.000 	 {built-in method torch.isnan}
```
We see that a single step of the scaler runs for more than a minute. Though there is non-trivial cprofile overhead, we can infer from this that per-element op dispatches/executions are on the order of a 100ns.

On the order of 100 nanoseconds per dispatch is acceptable if we're using typical tensor access patterns, but if we're dispatching each element for each op, obviously everything is going to come to a grinding halt for many practical use cases.

(Given the cost of this function is currently O(n) in the number of gradient elements, feel free to set `TORCH_SHOW_DISPATCH_TRACE=1` if you want to make this function cry 🤣)

I've attached a flamegraph at the bottom of the PR[^2] that more intuitively demonstrates the manner and extent of resource consumption attributable to this function with just a modest number of gradient elements.

### After the loop refactor in this PR:

The most relevant call subgraph:
![callgrind_subgraph_elementwise_dispatch_fix](https://user-images.githubusercontent.com/7462936/234657001-0a448756-b4ce-468e-9f91-1d21597df057.png)

Note that:

1. Less than 0.4% of the relative (inclusive) instruction reads for the entire test in `callgrind` are executed by the `isnan` (4 execs), `isinf` (4 execs) and `any` (8 execs) calls (versus ~50% and 320008, 320008, 640016 respectively above)
2. The `any` pre-dispatch entry point IRs (`torch::autograd::THPVariable_any`) vs actual op implementation IRs (`at::native::structured_any_all_out::impl`) reflect far less overhead (of secondary importance to item number 1)
![THPVariable_any_op_elementwise_dispatch_absolute_IR_fix](https://user-images.githubusercontent.com/7462936/234659454-b1e262cf-d291-4d44-aff2-e27efe284e9c.png)
![structured_any_all_out_impl_absolute_IR_fix](https://user-images.githubusercontent.com/7462936/234657154-91fa7cb8-e39e-48c7-abf0-cc58f06c0ae1.png)

Using cprofile stats:

```bash
python -c "import pstats; stats=pstats.Stats('/tmp/fsdp_cprofile_pfap7nwk.stats'); stats.print_stats()"
...
ncalls  tottime  	percall  	cumtime  	percall  	filename:lineno(function)
1    	0.013    	0.013    	0.109    	0.109 		torch/distributed/fsdp/sharded_grad_scaler.py:151(_foreach_non_finite_check_and_unscale_cpu_)
2    	0.022    	0.011    	0.022    	0.011 		{built-in method torch.isinf}
2    	0.018    	0.009    	0.018    	0.009 		{built-in method torch.isnan}
```
We can see our function runtime has dropped from more than a minute to ~100ms.

### Assumptions associated with this loop refactor:

The key assumptions here are:

1. The grads are always on CPU in this function so any MTA-safe constraints ([`can_use_fast_route`](efc3887ea5/aten/src/ATen/native/cuda/AmpKernels.cu (L110-L111)) relating to the relevant CUDA kernel path selection, i.e. slower `TensorIterator` gpu kernel vs `multi_tensor_apply_kernel`) do not apply in this context

2. We've already filtered by dtype and device and can assume the presence of a single CPU device. Unless manually creating separate CPU devices with manually set non-default indexes (which I don't think FSDP supports and should be validated prior to this function), device equality should always be `True` for `cpu` type devices so we should just need to check that the current device is of `cpu` type. [^4].

![elementwise_dispatch](https://user-images.githubusercontent.com/7462936/234660413-8c96ef90-7a23-4307-b8ed-c1fbf932f1e9.svg)

[^1]: `TestShardedGradScalerParityWithDDP.test_fsdp_ddp_parity_with_grad_scaler_offload_true_none_mixed_precision_use_orig_params` test in `test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py`
[^2]: Note the native frame stacks for `torch::autograd::THPVariable_isinf`, `torch::autograd::THPVariable_isnan`, `torch::autograd::THPVariable_any` in particular.
[^3]: There's more `TensorIterator` etc. setup overhead further up the stack beyond `structured_any_all_out`, but roughly speaking
[^4]: Device equality is based on [type and index combination](efc3887ea5/c10/core/Device.h (L47-L51)), CPU device type is -1 by default (`None` on the python side) and is intended to [always be 0](cf21240f67/c10/core/Device.h (L29)) if set explicitly. Though technically, unless in debug mode, this constraint isn't [actually validated](bb4e9e9124/c10/core/Device.h (L171-L184)), so one can actually manually create separate `cpu` devices with invalid indices. I suspect it's safe to ignore that potential incorrect/unusual configuration in this context but let me know if you'd like to add another `cpu` device equality check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100108
Approved by: https://github.com/awgu
2023-04-27 21:33:27 +00:00
AllenTiTaiWang
1dba53cbab [ONNX] Refactor test_op_consistenct.py and test_fx_op_consistency.py (#100172)
## Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 9255aa3</samp>

This pull request refactors the ONNX operator testing code to use a common module `./test/onnx/onnx_test_common.py` that defines constants, types, classes, and functions for testing ONNX operators. This improves the code quality, readability, and maintainability.

## Walkthrough
<!--
copilot:walkthrough
-->
### <samp>🤖 Generated by Copilot at 9255aa3</samp>

*  Refactor the common code for testing ONNX operators from different files into `./test/onnx/onnx_test_common.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eL10-R24), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eR33), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eR367-R623))
* Remove the unused and duplicated imports, constants, types, and classes for testing ONNX operators from `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L28-R29), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L43-R42), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L28-R29), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L43-R44))
* Import the `unittest`, `opinfo_core`, and `onnx_test_common` modules and the `fixme`, `skip`, and `xfail` functions in `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` ( [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4R36), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L37-R37))
* Update the references to the constants, types, functions, and classes for testing ONNX operators in `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` to use the definitions from `./test/onnx/onnx_test_common.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L324-R80), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L389-R135), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L405-R151), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L455-R204), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L333-R107), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L434-R183), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L448-R197), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L494-R246))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100172
Approved by: https://github.com/justinchuby
2023-04-27 21:32:04 +00:00
ydwu4
61917a006d Make DimConstraints create actionable message (#100103)
This pr makes summary of dimension constraints actionable. Before the pr, it will print:
```
torch.fx.experimental.symbolic_shapes: [WARNING] Summary of dimension constraints:
The following dimensions have been specialized and CANNOT be dynamic.
NOTE: Specializations will happen by default with `assume_static_by_default=True`.
        L['c'].size()[1] == 3
        L['a'].size()[2] == 3
        L['a'].size()[1] == 3
        L['b'].size()[2] == 2
        L['b'].size()[1] == 2
        L['c'].size()[2] == 3

The following dimensions CAN be dynamic.
You can use the following code to specify the constraints they must satisfy:
'''
constraints=[
        dynamic_dim(L['c'], 0) == dynamic_dim(L['a'], 0),
        2 <= dynamic_dim(L['b'], 0),
        2 <= dynamic_dim(L['a'], 0),
]
'''
```
Users need to initialize the L environment manually and copy the constraints over. After the pr, we have:
```
[2023-04-26 05:43:12,849] torch._dynamo.eval_frame: [WARNING] Summary of dimension constraints:
The following dimensions have been specialized and CANNOT be dynamic.
NOTE: Specializations will happen by default with `assume_static_by_default=True`.
'''
def specializations(a, b, c):
    return (a.size()[2] == 3 and
    c.size()[1] == 3 and
    a.size()[1] == 3 and
    c.size()[2] == 3 and
    b.size()[2] == 2 and
    b.size()[1] == 2)

'''

The following dimensions CAN be dynamic.
You can use the following code to specify the constraints they must satisfy:
'''
def specify_constraints(a, b, c):
    return [
        2 <= dynamic_dim(b, 0),
        dynamic_dim(c, 0) == dynamic_dim(a, 0),
        2 <= dynamic_dim(a, 0),
    ]
'''
```

, where dynamic_constraints has the same input signature as users code. This allow users to copy-paste and run the code  to generate the constraints before exporting as shown below:
```
def specify_constraints(a, b, c):
    return [
        2 <= dynamic_dim(b, 0),
        dynamic_dim(c, 0) == dynamic_dim(a, 0),
        2 <= dynamic_dim(a, 0),
    ]
torch._dynamo.export(my_dyn_fn, x, y, z, constraints=specify_constriants(x, y, z))
```

Implementation-wise, this pr also
1. changes shape_env.produce_guards to produce_guards_and_constraints,
2. adds contraints_export_fn hooks,
The purpose is to surface the DimConstraints to dynamo.export, where we could reliably get the original function's signature.

The alternative to the above is to get the function signature before creating SHAPE_ENV guard (https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/output_graph.py#L227) and pass it to DimConstraints, but I couldn't recover the signature before creating SHAPE_ENV because the frame's f_globals/locals don't contain the original function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100103
Approved by: https://github.com/guangy10, https://github.com/tugsbayasgalan
2023-04-27 21:24:18 +00:00
Zain Rizvi
d5f15d3515 Check for debug mode (#92707)
It works by validating the debug builds actually trigger debug level asserts

Turns out, most of our  debug jobs today don't actually build in debug mode (causing the test to fail). The PR also fixes that

Contributes to https://github.com/pytorch/pytorch/issues/88842
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92707
Approved by: https://github.com/malfet, https://github.com/albanD
2023-04-27 20:57:18 +00:00
wbigat
b02aa5e71d [Feature] storage resize_ support custom device. (#99882)
Fixes #99326

Support storage resize_ for custom device, by calling dispatched tensor operations.

@ezyang  this pr is another case  that was brought up in issue #99326,  please take a moment to review this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99882
Approved by: https://github.com/ezyang
2023-04-27 20:18:35 +00:00
Brian Hirsh
9834358e0f Get SchemaCheckMode to error on ops that return inputs directly. Expose as a dynamo backend, eager_debug (#99744)
Talked to @zou3519 and @ezyang on what the right UX is: tentatively, adding a new dynamo backend is cheap and simple, so it seems worth doing. And longer term, we agreed (?) that it's worth seeing if we can get custom ops sanity asserts to run more automatically, instead of needing a separate backend.

Side comment: that actually seems tough: the mode detects secret mutations by cloning every input to every op, running the op, and checking that the data matches between the real input and the cloned input. So I doubt we'll be able to make that behavior always-on? It would need some config at least.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99744
Approved by: https://github.com/albanD, https://github.com/ezyang, https://github.com/zou3519
2023-04-27 20:12:42 +00:00
Brian Hirsh
1f2d00e537 move SchemaCheckMode to torch/_subclasses (#99743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99743
Approved by: https://github.com/albanD
2023-04-27 20:12:41 +00:00
Jason Ansel
884c5c86f1 Pass torch.compile mode/options to all backends (#99645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99645
Approved by: https://github.com/anijain2305
2023-04-27 19:41:26 +00:00
AllenTiTaiWang
7295ab6746 [ONNX] Add test_fx_op_consistency.py (#99465)
Add op consistency test for fx exporter. There will be another PR to work around the limitations of https://github.com/pytorch/pytorch/issues/99534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99465
Approved by: https://github.com/justinchuby
2023-04-27 19:39:32 +00:00
Angela Yi
d06b93b0c7 Decompose arange.default to arange.start_step (#99739)
The aten op arange.default is not in the core aten IR, and should decompose into the arange.start_step op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99739
Approved by: https://github.com/SherlockNoMad
2023-04-27 19:06:36 +00:00
Li-Huai (Allan) Lin
a67fa845bd [vmap] Fix searchsorted batch rule (#99698)
Hopefully there is no other missed cases.

Fixes #99603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99698
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-04-27 19:03:41 +00:00
Nikita Shulga
991b1c0286 Do not use --extra-index-url in testing wheels (#100183)
Should prevent regressions like the ones reported in  https://github.com/pytorch/pytorch/issues/100104 from sneaking undetected.

Same for `install_triton_wheel.sh` - always use packages from https://download.pytorch.org/whl/

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at deda821</samp>

> _`pip install` changed_
> _Only use PyTorch nightly_
> _Snowflake packages_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100183
Approved by: https://github.com/kit1980, https://github.com/pmeier
2023-04-27 18:48:02 +00:00
Max Ren
151d76cc23
[quant][pt2e] remove dropout from fx quant
Differential Revision: D45250152nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99935
2023-04-27 11:22:41 -07:00
Nikita Shulga
089b085c32 Optimize periodic jobs (#100182)
Split existing 4 hour scheduled into two 8 hour ones
And schedule x86 MacOS test every 8 hours and exclude them from leak
checks
Schedule iOS tests every 8 hours and exclude them from leak-checks as
well

Remove IOS metal job, as it is already tested by ARM64 MPS job as well
as x86 and arm64 vanilla jobs, as they never caught any regressions in
last 60 days, based on data from running the following query on RockSet:
```sql
SELECT started_at,
      DATE_DIFF(
            'MINUTE',
            PARSE_TIMESTAMP_ISO8601(started_at),
            PARSE_TIMESTAMP_ISO8601(completed_at)
        ) as duration,
    conclusion, name, html_url, torchci_classification
  FROM commons.workflow_job
  WHERE
  workflow_name = 'periodic' and
  name like 'ios-12% % build (default, 1, 1, macos-12)' and
  url like 'https://api.github.com/repos/pytorch/pytorch/%'
  and conclusion = 'failure'
  order by started_at desc, run_id;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100182
Approved by: https://github.com/PaliC, https://github.com/huydhn
2023-04-27 18:07:28 +00:00
Chien-Chin Huang
01de8ee845 [SPMD][Easy] Add time counter in graph_optimization_pass (#99969)
This can give the idea how expensive the pass is.

Differential Revision: [D45255366](https://our.internmc.facebook.com/intern/diff/D45255366/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99969
Approved by: https://github.com/lessw2020
2023-04-27 17:56:07 +00:00
Rohan Varma
87db02ea38 [DDP] Perform input casting in pre forward (#100131)
This is so that replicate can also have the feature to cast its
inputs, which it currently does not. Next diff will change replicate pre hook
to support this.

Differential Revision: [D45335179](https://our.internmc.facebook.com/intern/diff/D45335179/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100131
Approved by: https://github.com/zhaojuanmao
2023-04-27 17:34:46 +00:00
Ke Wen
ae0eb2342d [Experimental] Remove store barrier after PG init (#99937)
Store based barrier is not scalable.
Experimenting to see if removing it breaks any CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99937
Approved by: https://github.com/kumpera, https://github.com/H-Huang
2023-04-27 17:23:10 +00:00
Angela Yi
7bece142a9 [export] Port over const prop pass (#100102)
Stacked on top of https://github.com/pytorch/pytorch/pull/100000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100102
Approved by: https://github.com/gmagogsfm
2023-04-27 17:06:47 +00:00
Iris
fad2f6edab [PTD][Checkpoint] Upstream fsspec storage read/write to PT (#98387)
Remove sync_files.
Remove single_file_per_rank and will add it back once we resolve the issue. https://github.com/pytorch/pytorch/issues/98386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98387
Approved by: https://github.com/fegin
2023-04-27 16:47:28 +00:00
Chien-Chin Huang
b94a0ba5bb [SPMD] Add embedding dense backward prop rule for postional embedding (#100038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100038
Approved by: https://github.com/mrshenli
2023-04-27 16:31:51 +00:00
pbialecki
8fe91d16b0 Remove CUDA 11.6 note from complex docs (#100118)
Removes note in the complex docs pointing to the CUDA 11.6 wheels introduced in https://github.com/pytorch/pytorch/pull/80363.
Background: this warning was added via https://github.com/pytorch/pytorch/issues/79876 which pointed out a slow compilation time in 11.3. The 11.6 pip wheels were thus recommended but are not build anymore as our current support is 11.7, 11.8 (and 12.1 experimental in nightlies).

The note is confusing users as it doesn't explain why 11.6 is needed.
Reference: https://discuss.pytorch.org/t/complex-numbers-cuda-11-6-documentation-warning/178588/1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100118
Approved by: https://github.com/msaroufim
2023-04-27 16:26:27 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
02f059c2b7 Add private _export API (#99992)
Differential Revision: D45279206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99992
Approved by: https://github.com/angelayi, https://github.com/gmagogsfm
2023-04-27 16:24:14 +00:00
Larry Liu
f5853342ea [dynamo][numpy] Handle return value being numpy ndarray (#99560)
On top of #95849 this PR is trying to handle the special case when dealing with numpy.

Consider the following example:

```
def f(x: torch.Tensor) -> np.ndarray:
	a = x.numpy()
	return a.T
```
In previous PR this will error out because we translate `a.T` to be a method call on `torch_np.ndarray.T` which is also a `torch_np.ndarray`.

This PR handles this case, by conditionally converting a `torch_np.ndarray` to `np.ndarray` before returning, to match the original behavior.

The compiled version will be:

```
def f(x):
    ___tmp_0 = __compiled_fn_0(x)
    if isinstance(___tmp_0, torch_np.ndarray):
        return ___tmp_0.tensor.numpy()
    else:
        return ___tmp_0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99560
Approved by: https://github.com/jansel, https://github.com/yanboliang
2023-04-27 16:18:35 +00:00
Larry Liu
687afeb686 [dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849)
Issue: #93684

# Problem

Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations.

# Design (as I know it)

* Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`.
* Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent.

This PR adds `NumpyTensorVariable` and supports:
1.  tensor to ndarray, ndarray to tensor
2. numpy functions such as numpy.meshgrid()
3. ndarray attributes such as `itemsize`, `stride`

Next PR will handle returning `np.ndarray` and add support for ndarray methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849
Approved by: https://github.com/ezyang
2023-04-27 16:18:35 +00:00
Yanbo Liang
d855b6aed6 [Dynamo] Add unit test for explicitly calling __call__ (#100146)
@wconstab As we discussed last Friday, I added the unit test for explicitly calling __call__ and added comment to explain why we redirecting ```UserMethodVariable.call_function``` to ```NNModuleVariable.call_method``` for a certain case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100146
Approved by: https://github.com/wconstab
2023-04-27 15:47:11 +00:00
Li-Huai (Allan) Lin
cb569dbccd Fix cat forward-AD tests (#99596)
Fixes #94115

Not sure where to add the test. There is an existing sample input but apparently doesn't fail any test.

6580b160d3/torch/testing/_internal/common_methods_invocations.py (L2043)

Edited: Found the skipper and xfailed some failures, which are pre-existing and unrelated to the fix in question. (Those failures are of gradgrad check, while the fix is of forward-AD).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99596
Approved by: https://github.com/soulitzer
2023-04-27 15:21:26 +00:00
Peter Bell
659dcc5e71 [inductor] Fix argmin/max with duplicate values (#99920)
Fixes #99879

This adds `minimum_with_index` helper functions to compute the minimum
value and index simultaneously, with a preference for the smaller
index which is required to match eager in case of duplicates.

I also remove the mask-and-sum hack with a `tl.reduce` using
the previously mentioned helper. This additionally fixes the indices
being added together in the case of duplicates.

As an example, this is the kernel generated for `torch.argmin(x, 1)`:
```python
def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
    xnumel = 1028 # dynamic_shapes=False
    rnumel = 1028 # dynamic_shapes=False
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rbase = tl.arange(0, RBLOCK)[None, :]
    x0 = xindex
    _tmp1 = tl.full([XBLOCK, RBLOCK], float("inf"), tl.float32)
    _tmp1_index = tl.full([XBLOCK, RBLOCK], 9223372036854775807, tl.int64)
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp0 = tl.load(in_ptr0 + (r1 + (1028*x0)), rmask & xmask, eviction_policy='evict_last', other=0)
        _tmp1_next, _tmp1_index_next = triton_helpers.minimum_with_index(
            _tmp1, _tmp1_index, tmp0, rindex
        )
        _tmp1 = tl.where(rmask & xmask, _tmp1_next, _tmp1)
        _tmp1_index = tl.where(rmask & xmask, _tmp1_index_next, _tmp1_index)
    _, tmp1_tmp = triton_helpers.min_with_index(_tmp1, _tmp1_index, 1)
    tmp1 = tmp1_tmp[:, None]
    tl.store(out_ptr0 + x0, tmp1, xmask)
```

Or for a persistent reduction, it generates:
```python
    tmp0 = tl.load(in_ptr0 + (r1 + (1024*x0)), rmask & xmask, other=0)
    tmp2 = tl.where(rmask & xmask, tmp0, float("inf"))
    tmp3 = tl.broadcast_to(rindex, tmp2.shape)
    _, tmp4_tmp = triton_helpers.min_with_index(tmp2, tmp3, 1)
    tmp4 = tmp4_tmp[:, None]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99920
Approved by: https://github.com/ngimel
2023-04-27 15:10:50 +00:00
Peter Bell
f9c3fcd1df [inductor] Fix nan-handling of max and min reductions (#99881)
This adds helpers that replace tritons `minimum`, `maximum`, `min` and
`max` with the correct NaN prrpagation. I also removed
`ops.int_minimum` in favor of `ops.minimum` because we can just omit
the nan-checks by checking the dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99881
Approved by: https://github.com/ngimel
2023-04-27 15:10:50 +00:00
Peter Bell
ed2eb13d76 [inductor] Create triton_helpers module for helper functions (#99880)
This changes codegen of `torch.prod` from:
```python
   tl.reduce(tmp2, 1, _prod_accumulate)[:, None]
```
where `_prod_accumulate` is defined elsewhere, to

```python
   triton_helpers.prod(tmp2, 1)[:, None]
```

A quirk I uncovered though is that `TritonCodeCache` breaks if you
define any new symbol beginning with `triton_`, since it assumes that
must be the kernel name. Instead, I've made the kernel name an
explicit argument to `async_compile.triton` so it doesn't have to guess.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99880
Approved by: https://github.com/ngimel
2023-04-27 15:10:50 +00:00
Rodrigo Kumpera
ad21890f8f [c10d] Scalable PG initiation. (#99931)
Add use_local_synchronization argument to new_group.

When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster.

This addressess both scalability and composability problems associated with new_group.

Fixes #81291.

This is relanding #84224
As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following:

new_group use_local_synchronization=False:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.12 |
| 8 | 0.25 |
| 16 | 0.51 |
| 32 | 0.87 |
| 64 | 1.50 |
| 128 | 2.87 |

new_group use_local_synchronization=True:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.05 |
| 8 | 0.04 |
| 16 | 0.03 |
| 32 | 0.03 |
| 64 | 0.04 |
| 128 | 0.04 |

Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128.

Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3.

Setup:

1 AWS host, backend gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931
Approved by: https://github.com/xw285cornell
2023-04-27 13:44:02 +00:00
Nikita Vedeneev
2eab5abb50 sparse.sum backward: short circuit on zero/empty grad (#98838)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98838
Approved by: https://github.com/pearu
2023-04-27 12:06:11 +00:00
Edward Z. Yang
67e0913de9 Add support for serializing real tensor data in after aot minifier (#99834)
The new minifier script looks like this:

```
import torch._dynamo.repro.after_aot
reader = torch._dynamo.repro.after_aot.InputReader(save_dir='/tmp/tmpcsngx39e')
buf0 = reader.storage('e2b39c716c0d4efb9fa57375a3902b9dab666893', 16)
t0 = reader.tensor(buf0, (4,))
args = [t0]
mod = make_fx(Repro(), tracing_mode='real')(*args)
```

The real tensor data is stored in the storages folder of the checkpoint dump directory. If you delete this folder / it is otherwise missing, we will transparently fall back to generating random data like before. The tensors are serialized using content store from #99809, which means each storage is content-addressed and we will automatically deduplicate equivalent data (which is useful if you keep dumping out, e.g., your parameters.) We don't use the tensor serialization capability from content store, instead all of the tensor metadata is stored inline inside the repro script (so that everything is in one file if you lose the checkpointed tensors).

We also add a stable_hash option to content store, where we use a slow SHA-1 sum on the data in CPU side to compute a hash that is stable across systems with the same endianness.

Out of rage, I also added support for Dtype.itemsize property access.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99834
Approved by: https://github.com/voznesenskym
2023-04-27 11:52:13 +00:00
Nikita Vedeneev
5cfaea15c4 relu/threshold backward for sparse: enable 0-nnz grads (#98935)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98935
Approved by: https://github.com/pearu
2023-04-27 10:57:05 +00:00
Sergii Dymchenko
c2402a9257 Change caffe2 branch links to main (#100129)
Just a change

pytorch/tree/master -> pytorch/tree/main
pytorch/blob/master -> pytorch/blob/main
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100129
Approved by: https://github.com/huydhn
2023-04-27 10:31:50 +00:00
Jiong Gong
77a37a54ce Include all mkl/mkldnn related test files to CPU ATen backend (#99592)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99592
Approved by: https://github.com/kit1980
2023-04-27 10:26:01 +00:00
Will Constable
100a25d021 Basic dynamo support for traceable collectives (#94440)
Make traceable collectives work with torchdynamo,
bypassing problems with tracing the AsyncTensor subclass.

Accept a suboptimal solution for now, and optimize it later.
For now, wait happens immediately, which generally forces an early sync.

Later, find a way either in dynamo or AOT stack to handle
AsyncCollectiveTensor to get the wait in the optimal place.

Note on implementation:
- Dynamo traces 'user-level' fc apis that are designed to behave differently
  in eager vs compiled.  In eager, there will be work-obj registration and
  a wrapper subclass will insert a 'wait' call at the appropriate time.
  In compile/trace mode, wait will be immetiately called, and work obj
  registration is required to be handled by the compile backend at runtime.
- Dynamo needs to trace into some of the helper functions in the 'user-level'
  api, such as '_expand_group' which is essentially a constant transformation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94440
Approved by: https://github.com/kumpera
2023-04-27 05:38:36 +00:00