pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
zhouzaida	b51f92ebda	[Docs] Fix docstring format (#99396 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99396 Approved by: https://github.com/awgu	2023-04-28 01:10:07 +00:00
zhi.cai	64efd88845	Add directly referenced header files for "ceil_div.h" (#99607 ) std::enable_if_t is defined in <type_traits>. Directly referencing header files is good programming style Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99607 Approved by: https://github.com/albanD, https://github.com/kit1980	2023-04-28 01:05:05 +00:00
Bin Bao	3e87fc521b	[CI] Start to collect inference perf with cpp_wrapper ON (#100187 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100187 Approved by: https://github.com/huydhn	2023-04-28 01:03:09 +00:00
Masaki Kozuki	674018903d	per-Tensor `grad_fn` for in-place foreach functions (#96405 ) Generate a `grad_fn` for each (tuple of) `Tensor`(s) of the same index for `_foreach_foo_` and each `grad_fn` is `FooBackward`. The current status of foreach functions' backward support for the record: - out-place: Implemented, but no optimized implementations like their forward path - in-place: not implemented. I think this check `7eaaefafb3/torchgen/api/autograd.py (L309-L311)` is partly responsible but the difference of signature between out-place and in-place (see https://github.com/pytorch/pytorch/pull/96405#discussion_r1154690940) would prevent in-place from using out-place versions (the logic is around `7eaaefafb3/torchgen/api/autograd.py (L495-L500)`) ```c++ void _foreach_abs_(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_abs_(ks & c10::after_autograd_keyset, self_); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) AT_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) AT_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif } ``` Related: - #95431 - #95765 for multiple `grad_fn`s logic --- Examples: outputs of `_foreach_add_.List`, `_foreach_addcmul_.ScalarList`, and `_foreach_exp` ```c++ void _foreach_addcmul__ScalarList(c10::DispatchKeySet ks, at::TensorList self, at::TensorList tensor1, at::TensorList tensor2, at::ArrayRef<at::Scalar> scalars) { auto self_ = unpack(self, "self", 0); auto tensor1_ = unpack(tensor1, "tensor1", 1); auto tensor2_ = unpack(tensor2, "tensor2", 2); auto _any_requires_grad = compute_requires_grad( self, tensor1, tensor2 ); (void)_any_requires_grad; std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<AddcmulBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i], tensor1[i], tensor2[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<AddcmulBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<AddcmulBackward0>(new AddcmulBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i], tensor1[i], tensor2[i] )); return grad_fn; } }()); } if (!grad_fns.empty()) { for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { grad_fn->self_scalar_type = self[i].scalar_type(); grad_fn->tensor1_scalar_type = tensor1[i].scalar_type(); if (grad_fn->should_compute_output(1)) { grad_fn->tensor2_ = SavedVariable(tensor2[i], false); } grad_fn->value = scalars[i]; if (grad_fn->should_compute_output(2)) { grad_fn->tensor1_ = SavedVariable(tensor1[i], false); } grad_fn->tensor2_scalar_type = tensor2[i].scalar_type(); } } } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> tensor1__storage_saved(tensor1_.size()); for (const Tensor& tensor : tensor1_) tensor1__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> tensor1__impl_saved(tensor1_.size()); for (size_t i=0; i<tensor1_.size(); i++) if (tensor1_[i].defined()) tensor1__impl_saved[i] = tensor1_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> tensor2__storage_saved(tensor2_.size()); for (const Tensor& tensor : tensor2_) tensor2__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> tensor2__impl_saved(tensor2_.size()); for (size_t i=0; i<tensor2_.size(); i++) if (tensor2_[i].defined()) tensor2__impl_saved[i] = tensor2_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_addcmul_(ks & c10::after_autograd_keyset, self_, tensor1_, tensor2_, scalars); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor1__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor1_)) TORCH_INTERNAL_ASSERT(tensor1__storage_saved[i].value().is_alias_of(tensor1_[i].storage())); } for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor1__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor1_)) TORCH_INTERNAL_ASSERT(tensor1__impl_saved[i] == tensor1_[i].getIntrusivePtr()); } for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor2__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor2_)) TORCH_INTERNAL_ASSERT(tensor2__storage_saved[i].value().is_alias_of(tensor2_[i].storage())); } for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor2__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor2_)) TORCH_INTERNAL_ASSERT(tensor2__impl_saved[i] == tensor2_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } } ``` ```c++ void _foreach_add__List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) { auto self_ = unpack(self, "self", 0); auto other_ = unpack(other, "other", 1); auto _any_requires_grad = compute_requires_grad( self, other ); (void)_any_requires_grad; std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<AddBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i], other[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<AddBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<AddBackward0>(new AddBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i], other[i] )); return grad_fn; } }()); } if (!grad_fns.empty()) { for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { grad_fn->other_scalar_type = other[i].scalar_type(); grad_fn->alpha = alpha; grad_fn->self_scalar_type = self[i].scalar_type(); } } } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> other__storage_saved(other_.size()); for (const Tensor& tensor : other_) other__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> other__impl_saved(other_.size()); for (size_t i=0; i<other_.size(); i++) if (other_[i].defined()) other__impl_saved[i] = other_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_add_(ks & c10::after_autograd_keyset, self_, other_, alpha); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } for (size_t i=0; i<other_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (other__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(other_)) TORCH_INTERNAL_ASSERT(other__storage_saved[i].value().is_alias_of(other_[i].storage())); } for (size_t i=0; i<other_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (other__impl_saved[i] && !at::impl::tensorlist_has_dispatch(other_)) TORCH_INTERNAL_ASSERT(other__impl_saved[i] == other_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } } ... void _foreach_exp_(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); auto _any_requires_grad = compute_requires_grad( self ); (void)_any_requires_grad; std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<ExpBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<ExpBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<ExpBackward0>(new ExpBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i] )); return grad_fn; } }()); } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_exp_(ks & c10::after_autograd_keyset, self_); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } if (!grad_fns.empty()) { for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { grad_fn->result_ = SavedVariable(self[i], true, self[i].is_view()); } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96405 Approved by: https://github.com/soulitzer	2023-04-28 00:55:04 +00:00
Edward Z. Yang	9a3e411a41	More rigorous mixed overloads on SymInt (#100008 ) Previously the change to aten/src/ATen/native/LossNLL.cpp eventually resulted in a double / SymInt division, which ended up calling the int64_t / SymInt overload, truncating the double (bad!) By adding overloads for all the int/float types, we avoid this situation from happening in the future. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100008 Approved by: https://github.com/albanD	2023-04-28 00:54:44 +00:00
Sergii Dymchenko	9dcabe293a	Delete pytorch/caffe2/contrib/docker-ubuntu-14.04 (#100155 ) It's not used anywhere AFAIK and only triggers security issues scanners. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100155 Approved by: https://github.com/huydhn	2023-04-28 00:41:37 +00:00
Andrew Gu	d1fbd33c70	[FSDP] Remove unneeded disable of tf32 (#100179 ) I recall needing to disable tf32, but I cannot repro the issue anymore. Nowhere else in our unit tests do we disable tf32, so we can try to get rid of this disabling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100179 Approved by: https://github.com/rohan-varma	2023-04-28 00:14:40 +00:00
Andrew Gu	1f4183e275	[FSDP] Subtest sharding strategy in test_fsdp_grad_acc.py (#100178 ) Let us make the unit test faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100178 Approved by: https://github.com/rohan-varma	2023-04-28 00:14:40 +00:00
Ke Wen	ae40a6c735	[c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215 ) This is a mirror PR of D45339293 Summary: These tests cause the following errors internally with unknown reason: ``` AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adam' AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adamw' AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_sgd' ``` Commenting these tests out to unblock other PRs. Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/100215 Approved by: https://github.com/wz337, https://github.com/fduwjj	2023-04-28 00:05:46 +00:00
Michael Voznesensky	a145a3332c	Add tensor to fake clone snapshot for immutable source of truth (#100128 ) There's a longstanding, well known mutability bug in dynamo, https://github.com/pytorch/pytorch/issues/93610 (and more issues, but this is the one I had at hand). Ops that do in place mutation of tensors will mutate their corresponding FakeTensors. So, for example, if you do `t_` on a tensor, you will reverse its strides. This, in turn, means that the FakeTensors strides are now also reversed, say, if you are trying to torch.compile: ``` class F(torch.nn.Module): def forward(self, x, y): x = x.t_() y = y.t_() return (x + y,) ``` However, we recently introduced accessing the fake_tensor memo/cache to get the symbolic shape values for sizes and strides during guard installation time. This means that tensors captured with a given size and stride, say, for x above, size:(3,3) stride:(3, 1), will get their memo updates to size(3, 3), stride(1, 3). Now, whenever you access this value for anything, it reflects it's current state in the tracing, as opposed to the state at which we initially started tracing on. This causes us to produce guards that are never valid, for the example above, that `x.stride()[0] == 3`. The solution is to not allow mutation to affect the fake tensors we use as source of truth here. We can do this by forcing a clone of the fake tensor at builder time, and storing that as the source of truth for our dynamic sizes and strides during guard installation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100128 Approved by: https://github.com/ezyang	2023-04-27 23:58:15 +00:00
Yanli Zhao	ca1cf434e7	Not flatten states when use_orig_param is True and sharding is NO_SHARD (#100189 ) When use_orig_param is True and sharding is NO_SHARD, parameters and states are not flattened, so optimizer states should not be flattened as well. The unit test will fail without the fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100189 Approved by: https://github.com/awgu	2023-04-27 23:47:01 +00:00
chunyuan	3241fbd627	Inductor cpp wrapper: support LinearBinary (#99957 ) Support `mkldnn::_linear_pointwise.binary` in cpp wrapper for GPT-J. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99957 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2023-04-27 23:46:42 +00:00
TamirCohen	0221198790	Added Typechecking to input tensor in RNN (#100100 ) The input tensor of the RNN forward must be the same type as the weights. While passing tensor of type long the error is: `RuntimeError: expected scalar type Long but found Float` Which is misleading because it said to convert Something to Long, but the correct solution is to convert the input to Float (Which is the type of the weights). The new error: `RuntimeError: input must have the type torch.float32, got type torch.int64` Is correct and more verbose Fixes #99998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100100 Approved by: https://github.com/drisspg	2023-04-27 23:36:57 +00:00
Driss Guessous	b8d7a28e1a	refactor test_sdpa into two test classes to account for failure modes (#100121 ) ### Summary This PR creates a new TestSDPAFailureModes test class in order to better seperate what each test is trying to do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100121 Approved by: https://github.com/malfet, https://github.com/ngimel	2023-04-27 21:42:40 +00:00
Daniel Dale	477ca1789c	Avoid elementwise dispatch of gradient unscaling/validation ops in `_foreach_non_finite_check_and_unscale_cpu_` (#100108 ) Fixes [#82206](https://github.com/pytorch/pytorch/issues/82206) When executing a `ShardedGradScaler` step in the context of `cpu_offload`, [the function](`ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L151-L152)`) `_foreach_non_finite_check_and_unscale_cpu_` is grindingly slow. This issue is due to the elementwise op dispatching/redispatching/execution that is engendered by the current approach to gradient tensor validation: `ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L159-L163)` The subsequent `isinf` and `isnan` checks with associated `any` checks result in unscalable elementwise op dispatches: `ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L173-L181)` This inefficency is of course hidden in the current FSDP tests given their (appropriately) trivial parameter dimensionality. In the perf analysis below, the example test configures only the final `Linear(4, 8)` module parameters to require grad, so there are 40 elements to iterate through. However, if one increases the dimensionality to a still-modest 320008 elements (changing the final module to `Linear(40000,8)`), the execution time/cpu cost of the test is dominated by the elementwise op dispatching/redispatching/execution of the `any` validation ops in this function. To characterize the current behavior, I use a slightly modified version of an existing `ShardedGradScaler` test [^1]. The following modifications to the test are made to allow the analysis: 1. Run just `CUDAInitMode.CUDA_BEFORE` for clarity instead of additional scenarios 2. Increase the final module to `Linear(40000, 8)` (along with modifying the preceding module to make the dimensions work) , 3. For the cProfile run (but not valgrind or perf) the test runs just a single [`_train_for_several_steps`](`ecd2c71871/torch/testing/_internal/common_fsdp.py (L926-L934)`) step per rank (instead of 2 steps) 4. I temporarily reduce `init_scale` further to ensure we don't hit any `infs`, short-circuiting our analysis ### Current behavior The most relevant call subgraph: ![callgrind_subgraph_elementwise_dispatch](https://user-images.githubusercontent.com/7462936/234656744-b7ca81b2-ce5b-4035-9918-0ad57d3689d3.png) Note that: 1. Instead of dispatching to the relevant autograd op and then redispatching to the relevant CPU op implementation 8 times per test, (2 train steps x 2 any calls per parameter per step x 2 orig parameters) we (I believe unnecessarily) call the relevant dispatch flow elementwise, so 640016 times! (only 1 node in this trace so 320008 elements/2 X 2 train steps x 2 calls per element per step). 2. Nearly 50% of the relative (inclusive) instruction reads for the entire test in `callgrind` are executed by the `isnan` (320008 execs), `isinf` (320008 execs) and `any` (640016 execs) calls. 3. The `any` pre-dispatch entry point IRs (`torch::autograd::THPVariable_any`) vs actual op implementation IRs (`at::native::structured_any_all_out::impl`) are below to give one a sense of the relative dispatch and op execution cost in an elementwise context[^3]. ![THPVariable_any_op_elementwise_dispatch_absolute_IR](https://user-images.githubusercontent.com/7462936/234656886-3c017ee3-8a04-4a7d-bdf8-6c690de42c92.png) ![structured_any_all_out_impl_absolute_IR](https://user-images.githubusercontent.com/7462936/234656915-0b203bb7-bd05-4ceb-a38b-67b0d4862aa7.png) Using cprofile stats: ```bash python -c "import pstats; stats=pstats.Stats('/tmp/fsdp_cprofile_8wa9uw39.stats'); stats.print_stats()" ... ncalls tottime percall cumtime percall filename:lineno(function) 1 20.159 20.159 66.805 66.805 torch/distributed/fsdp/sharded_grad_scaler.py:151(_foreach_non_finite_check_and_unscale_cpu_) 160004 18.427 0.000 18.427 0.000 {built-in method torch.isinf} 160004 6.026 0.000 6.026 0.000 {built-in method torch.isnan} ``` We see that a single step of the scaler runs for more than a minute. Though there is non-trivial cprofile overhead, we can infer from this that per-element op dispatches/executions are on the order of a 100ns. On the order of 100 nanoseconds per dispatch is acceptable if we're using typical tensor access patterns, but if we're dispatching each element for each op, obviously everything is going to come to a grinding halt for many practical use cases. (Given the cost of this function is currently O(n) in the number of gradient elements, feel free to set `TORCH_SHOW_DISPATCH_TRACE=1` if you want to make this function cry 🤣) I've attached a flamegraph at the bottom of the PR[^2] that more intuitively demonstrates the manner and extent of resource consumption attributable to this function with just a modest number of gradient elements. ### After the loop refactor in this PR: The most relevant call subgraph: ![callgrind_subgraph_elementwise_dispatch_fix](https://user-images.githubusercontent.com/7462936/234657001-0a448756-b4ce-468e-9f91-1d21597df057.png) Note that: 1. Less than 0.4% of the relative (inclusive) instruction reads for the entire test in `callgrind` are executed by the `isnan` (4 execs), `isinf` (4 execs) and `any` (8 execs) calls (versus ~50% and 320008, 320008, 640016 respectively above) 2. The `any` pre-dispatch entry point IRs (`torch::autograd::THPVariable_any`) vs actual op implementation IRs (`at::native::structured_any_all_out::impl`) reflect far less overhead (of secondary importance to item number 1) ![THPVariable_any_op_elementwise_dispatch_absolute_IR_fix](https://user-images.githubusercontent.com/7462936/234659454-b1e262cf-d291-4d44-aff2-e27efe284e9c.png) ![structured_any_all_out_impl_absolute_IR_fix](https://user-images.githubusercontent.com/7462936/234657154-91fa7cb8-e39e-48c7-abf0-cc58f06c0ae1.png) Using cprofile stats: ```bash python -c "import pstats; stats=pstats.Stats('/tmp/fsdp_cprofile_pfap7nwk.stats'); stats.print_stats()" ... ncalls tottime percall cumtime percall filename:lineno(function) 1 0.013 0.013 0.109 0.109 torch/distributed/fsdp/sharded_grad_scaler.py:151(_foreach_non_finite_check_and_unscale_cpu_) 2 0.022 0.011 0.022 0.011 {built-in method torch.isinf} 2 0.018 0.009 0.018 0.009 {built-in method torch.isnan} ``` We can see our function runtime has dropped from more than a minute to ~100ms. ### Assumptions associated with this loop refactor: The key assumptions here are: 1. The grads are always on CPU in this function so any MTA-safe constraints ([`can_use_fast_route`](`efc3887ea5/aten/src/ATen/native/cuda/AmpKernels.cu (L110-L111)`) relating to the relevant CUDA kernel path selection, i.e. slower `TensorIterator` gpu kernel vs `multi_tensor_apply_kernel`) do not apply in this context 2. We've already filtered by dtype and device and can assume the presence of a single CPU device. Unless manually creating separate CPU devices with manually set non-default indexes (which I don't think FSDP supports and should be validated prior to this function), device equality should always be `True` for `cpu` type devices so we should just need to check that the current device is of `cpu` type. [^4]. ![elementwise_dispatch](https://user-images.githubusercontent.com/7462936/234660413-8c96ef90-7a23-4307-b8ed-c1fbf932f1e9.svg) [^1]: `TestShardedGradScalerParityWithDDP.test_fsdp_ddp_parity_with_grad_scaler_offload_true_none_mixed_precision_use_orig_params` test in `test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py` [^2]: Note the native frame stacks for `torch::autograd::THPVariable_isinf`, `torch::autograd::THPVariable_isnan`, `torch::autograd::THPVariable_any` in particular. [^3]: There's more `TensorIterator` etc. setup overhead further up the stack beyond `structured_any_all_out`, but roughly speaking [^4]: Device equality is based on [type and index combination](`efc3887ea5/c10/core/Device.h (L47-L51)`), CPU device type is -1 by default (`None` on the python side) and is intended to [always be 0](`cf21240f67/c10/core/Device.h (L29)`) if set explicitly. Though technically, unless in debug mode, this constraint isn't [actually validated](`bb4e9e9124/c10/core/Device.h (L171-L184)`), so one can actually manually create separate `cpu` devices with invalid indices. I suspect it's safe to ignore that potential incorrect/unusual configuration in this context but let me know if you'd like to add another `cpu` device equality check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100108 Approved by: https://github.com/awgu	2023-04-27 21:33:27 +00:00
AllenTiTaiWang	1dba53cbab	[ONNX] Refactor test_op_consistenct.py and test_fx_op_consistency.py (#100172 ) ## Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 9255aa3</samp> This pull request refactors the ONNX operator testing code to use a common module `./test/onnx/onnx_test_common.py` that defines constants, types, classes, and functions for testing ONNX operators. This improves the code quality, readability, and maintainability. ## Walkthrough <!-- copilot:walkthrough --> ### <samp>🤖 Generated by Copilot at 9255aa3</samp> * Refactor the common code for testing ONNX operators from different files into `./test/onnx/onnx_test_common.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eL10-R24), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eR33), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eR367-R623)) * Remove the unused and duplicated imports, constants, types, and classes for testing ONNX operators from `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L28-R29), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L43-R42), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L28-R29), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L43-R44)) * Import the `unittest`, `opinfo_core`, and `onnx_test_common` modules and the `fixme`, `skip`, and `xfail` functions in `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` ( [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4R36), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L37-R37)) * Update the references to the constants, types, functions, and classes for testing ONNX operators in `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` to use the definitions from `./test/onnx/onnx_test_common.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L324-R80), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L389-R135), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L405-R151), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L455-R204), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L333-R107), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L434-R183), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L448-R197), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L494-R246)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100172 Approved by: https://github.com/justinchuby	2023-04-27 21:32:04 +00:00
ydwu4	61917a006d	Make DimConstraints create actionable message (#100103 ) This pr makes summary of dimension constraints actionable. Before the pr, it will print: ``` torch.fx.experimental.symbolic_shapes: [WARNING] Summary of dimension constraints: The following dimensions have been specialized and CANNOT be dynamic. NOTE: Specializations will happen by default with `assume_static_by_default=True`. L['c'].size()[1] == 3 L['a'].size()[2] == 3 L['a'].size()[1] == 3 L['b'].size()[2] == 2 L['b'].size()[1] == 2 L['c'].size()[2] == 3 The following dimensions CAN be dynamic. You can use the following code to specify the constraints they must satisfy: ''' constraints=[ dynamic_dim(L['c'], 0) == dynamic_dim(L['a'], 0), 2 <= dynamic_dim(L['b'], 0), 2 <= dynamic_dim(L['a'], 0), ] ''' ``` Users need to initialize the L environment manually and copy the constraints over. After the pr, we have: ``` [2023-04-26 05:43:12,849] torch._dynamo.eval_frame: [WARNING] Summary of dimension constraints: The following dimensions have been specialized and CANNOT be dynamic. NOTE: Specializations will happen by default with `assume_static_by_default=True`. ''' def specializations(a, b, c): return (a.size()[2] == 3 and c.size()[1] == 3 and a.size()[1] == 3 and c.size()[2] == 3 and b.size()[2] == 2 and b.size()[1] == 2) ''' The following dimensions CAN be dynamic. You can use the following code to specify the constraints they must satisfy: ''' def specify_constraints(a, b, c): return [ 2 <= dynamic_dim(b, 0), dynamic_dim(c, 0) == dynamic_dim(a, 0), 2 <= dynamic_dim(a, 0), ] ''' ``` , where dynamic_constraints has the same input signature as users code. This allow users to copy-paste and run the code to generate the constraints before exporting as shown below: ``` def specify_constraints(a, b, c): return [ 2 <= dynamic_dim(b, 0), dynamic_dim(c, 0) == dynamic_dim(a, 0), 2 <= dynamic_dim(a, 0), ] torch._dynamo.export(my_dyn_fn, x, y, z, constraints=specify_constriants(x, y, z)) ``` Implementation-wise, this pr also 1. changes shape_env.produce_guards to produce_guards_and_constraints, 2. adds contraints_export_fn hooks, The purpose is to surface the DimConstraints to dynamo.export, where we could reliably get the original function's signature. The alternative to the above is to get the function signature before creating SHAPE_ENV guard (https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/output_graph.py#L227) and pass it to DimConstraints, but I couldn't recover the signature before creating SHAPE_ENV because the frame's f_globals/locals don't contain the original function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100103 Approved by: https://github.com/guangy10, https://github.com/tugsbayasgalan	2023-04-27 21:24:18 +00:00
Zain Rizvi	d5f15d3515	Check for debug mode (#92707 ) It works by validating the debug builds actually trigger debug level asserts Turns out, most of our debug jobs today don't actually build in debug mode (causing the test to fail). The PR also fixes that Contributes to https://github.com/pytorch/pytorch/issues/88842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92707 Approved by: https://github.com/malfet, https://github.com/albanD	2023-04-27 20:57:18 +00:00
wbigat	b02aa5e71d	[Feature] storage resize_ support custom device. (#99882 ) Fixes #99326 Support storage resize_ for custom device, by calling dispatched tensor operations. @ezyang this pr is another case that was brought up in issue #99326, please take a moment to review this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99882 Approved by: https://github.com/ezyang	2023-04-27 20:18:35 +00:00
Brian Hirsh	9834358e0f	Get SchemaCheckMode to error on ops that return inputs directly. Expose as a dynamo backend, eager_debug (#99744 ) Talked to @zou3519 and @ezyang on what the right UX is: tentatively, adding a new dynamo backend is cheap and simple, so it seems worth doing. And longer term, we agreed (?) that it's worth seeing if we can get custom ops sanity asserts to run more automatically, instead of needing a separate backend. Side comment: that actually seems tough: the mode detects secret mutations by cloning every input to every op, running the op, and checking that the data matches between the real input and the cloned input. So I doubt we'll be able to make that behavior always-on? It would need some config at least. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99744 Approved by: https://github.com/albanD, https://github.com/ezyang, https://github.com/zou3519	2023-04-27 20:12:42 +00:00
Brian Hirsh	1f2d00e537	move SchemaCheckMode to torch/_subclasses (#99743 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99743 Approved by: https://github.com/albanD	2023-04-27 20:12:41 +00:00
Jason Ansel	884c5c86f1	Pass torch.compile mode/options to all backends (#99645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99645 Approved by: https://github.com/anijain2305	2023-04-27 19:41:26 +00:00
AllenTiTaiWang	7295ab6746	[ONNX] Add test_fx_op_consistency.py (#99465 ) Add op consistency test for fx exporter. There will be another PR to work around the limitations of https://github.com/pytorch/pytorch/issues/99534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99465 Approved by: https://github.com/justinchuby	2023-04-27 19:39:32 +00:00
Angela Yi	d06b93b0c7	Decompose arange.default to arange.start_step (#99739 ) The aten op arange.default is not in the core aten IR, and should decompose into the arange.start_step op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99739 Approved by: https://github.com/SherlockNoMad	2023-04-27 19:06:36 +00:00
Li-Huai (Allan) Lin	a67fa845bd	[vmap] Fix searchsorted batch rule (#99698 ) Hopefully there is no other missed cases. Fixes #99603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99698 Approved by: https://github.com/kshitij12345, https://github.com/zou3519	2023-04-27 19:03:41 +00:00
Nikita Shulga	991b1c0286	Do not use `--extra-index-url` in testing wheels (#100183 ) Should prevent regressions like the ones reported in https://github.com/pytorch/pytorch/issues/100104 from sneaking undetected. Same for `install_triton_wheel.sh` - always use packages from https://download.pytorch.org/whl/ <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at deda821</samp> > _`pip install` changed_ > _Only use PyTorch nightly_ > _Snowflake packages_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100183 Approved by: https://github.com/kit1980, https://github.com/pmeier	2023-04-27 18:48:02 +00:00
Max Ren	151d76cc23	[quant][pt2e] remove dropout from fx quant Differential Revision: D45250152nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99935	2023-04-27 11:22:41 -07:00
Nikita Shulga	089b085c32	Optimize periodic jobs (#100182 ) Split existing 4 hour scheduled into two 8 hour ones And schedule x86 MacOS test every 8 hours and exclude them from leak checks Schedule iOS tests every 8 hours and exclude them from leak-checks as well Remove IOS metal job, as it is already tested by ARM64 MPS job as well as x86 and arm64 vanilla jobs, as they never caught any regressions in last 60 days, based on data from running the following query on RockSet: ```sql SELECT started_at, DATE_DIFF( 'MINUTE', PARSE_TIMESTAMP_ISO8601(started_at), PARSE_TIMESTAMP_ISO8601(completed_at) ) as duration, conclusion, name, html_url, torchci_classification FROM commons.workflow_job WHERE workflow_name = 'periodic' and name like 'ios-12% % build (default, 1, 1, macos-12)' and url like 'https://api.github.com/repos/pytorch/pytorch/%' and conclusion = 'failure' order by started_at desc, run_id; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100182 Approved by: https://github.com/PaliC, https://github.com/huydhn	2023-04-27 18:07:28 +00:00
Chien-Chin Huang	01de8ee845	[SPMD][Easy] Add time counter in graph_optimization_pass (#99969 ) This can give the idea how expensive the pass is. Differential Revision: [D45255366](https://our.internmc.facebook.com/intern/diff/D45255366/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99969 Approved by: https://github.com/lessw2020	2023-04-27 17:56:07 +00:00
Rohan Varma	87db02ea38	[DDP] Perform input casting in pre forward (#100131 ) This is so that replicate can also have the feature to cast its inputs, which it currently does not. Next diff will change replicate pre hook to support this. Differential Revision: [D45335179](https://our.internmc.facebook.com/intern/diff/D45335179/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100131 Approved by: https://github.com/zhaojuanmao	2023-04-27 17:34:46 +00:00
Ke Wen	ae0eb2342d	[Experimental] Remove store barrier after PG init (#99937 ) Store based barrier is not scalable. Experimenting to see if removing it breaks any CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/99937 Approved by: https://github.com/kumpera, https://github.com/H-Huang	2023-04-27 17:23:10 +00:00
Angela Yi	7bece142a9	[export] Port over const prop pass (#100102 ) Stacked on top of https://github.com/pytorch/pytorch/pull/100000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100102 Approved by: https://github.com/gmagogsfm	2023-04-27 17:06:47 +00:00
Iris	fad2f6edab	[PTD][Checkpoint] Upstream fsspec storage read/write to PT (#98387 ) Remove sync_files. Remove single_file_per_rank and will add it back once we resolve the issue. https://github.com/pytorch/pytorch/issues/98386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98387 Approved by: https://github.com/fegin	2023-04-27 16:47:28 +00:00
Chien-Chin Huang	b94a0ba5bb	[SPMD] Add embedding dense backward prop rule for postional embedding (#100038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100038 Approved by: https://github.com/mrshenli	2023-04-27 16:31:51 +00:00
pbialecki	8fe91d16b0	Remove CUDA 11.6 note from complex docs (#100118 ) Removes note in the complex docs pointing to the CUDA 11.6 wheels introduced in https://github.com/pytorch/pytorch/pull/80363. Background: this warning was added via https://github.com/pytorch/pytorch/issues/79876 which pointed out a slow compilation time in 11.3. The 11.6 pip wheels were thus recommended but are not build anymore as our current support is 11.7, 11.8 (and 12.1 experimental in nightlies). The note is confusing users as it doesn't explain why 11.6 is needed. Reference: https://discuss.pytorch.org/t/complex-numbers-cuda-11-6-documentation-warning/178588/1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100118 Approved by: https://github.com/msaroufim	2023-04-27 16:26:27 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	02f059c2b7	Add private _export API (#99992 ) Differential Revision: D45279206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99992 Approved by: https://github.com/angelayi, https://github.com/gmagogsfm	2023-04-27 16:24:14 +00:00
Larry Liu	f5853342ea	[dynamo][numpy] Handle return value being numpy ndarray (#99560 ) On top of #95849 this PR is trying to handle the special case when dealing with numpy. Consider the following example: ``` def f(x: torch.Tensor) -> np.ndarray: a = x.numpy() return a.T ``` In previous PR this will error out because we translate `a.T` to be a method call on `torch_np.ndarray.T` which is also a `torch_np.ndarray`. This PR handles this case, by conditionally converting a `torch_np.ndarray` to `np.ndarray` before returning, to match the original behavior. The compiled version will be: ``` def f(x): ___tmp_0 = __compiled_fn_0(x) if isinstance(___tmp_0, torch_np.ndarray): return ___tmp_0.tensor.numpy() else: return ___tmp_0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99560 Approved by: https://github.com/jansel, https://github.com/yanboliang	2023-04-27 16:18:35 +00:00
Larry Liu	687afeb686	[dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849 ) Issue: #93684 # Problem Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations. # Design (as I know it) * Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`. * Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent. This PR adds `NumpyTensorVariable` and supports: 1. tensor to ndarray, ndarray to tensor 2. numpy functions such as numpy.meshgrid() 3. ndarray attributes such as `itemsize`, `stride` Next PR will handle returning `np.ndarray` and add support for ndarray methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849 Approved by: https://github.com/ezyang	2023-04-27 16:18:35 +00:00
Yanbo Liang	d855b6aed6	[Dynamo] Add unit test for explicitly calling __call__ (#100146 ) @wconstab As we discussed last Friday, I added the unit test for explicitly calling __call__ and added comment to explain why we redirecting ```UserMethodVariable.call_function``` to ```NNModuleVariable.call_method``` for a certain case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100146 Approved by: https://github.com/wconstab	2023-04-27 15:47:11 +00:00
Li-Huai (Allan) Lin	cb569dbccd	Fix cat forward-AD tests (#99596 ) Fixes #94115 Not sure where to add the test. There is an existing sample input but apparently doesn't fail any test. `6580b160d3/torch/testing/_internal/common_methods_invocations.py (L2043)` Edited: Found the skipper and xfailed some failures, which are pre-existing and unrelated to the fix in question. (Those failures are of gradgrad check, while the fix is of forward-AD). Pull Request resolved: https://github.com/pytorch/pytorch/pull/99596 Approved by: https://github.com/soulitzer	2023-04-27 15:21:26 +00:00
Peter Bell	659dcc5e71	[inductor] Fix argmin/max with duplicate values (#99920 ) Fixes #99879 This adds `minimum_with_index` helper functions to compute the minimum value and index simultaneously, with a preference for the smaller index which is required to match eager in case of duplicates. I also remove the mask-and-sum hack with a `tl.reduce` using the previously mentioned helper. This additionally fixes the indices being added together in the case of duplicates. As an example, this is the kernel generated for `torch.argmin(x, 1)`: ```python def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr): xnumel = 1028 # dynamic_shapes=False rnumel = 1028 # dynamic_shapes=False xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rbase = tl.arange(0, RBLOCK)[None, :] x0 = xindex _tmp1 = tl.full([XBLOCK, RBLOCK], float("inf"), tl.float32) _tmp1_index = tl.full([XBLOCK, RBLOCK], 9223372036854775807, tl.int64) for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp0 = tl.load(in_ptr0 + (r1 + (1028x0)), rmask & xmask, eviction_policy='evict_last', other=0) _tmp1_next, _tmp1_index_next = triton_helpers.minimum_with_index( _tmp1, _tmp1_index, tmp0, rindex ) _tmp1 = tl.where(rmask & xmask, _tmp1_next, _tmp1) _tmp1_index = tl.where(rmask & xmask, _tmp1_index_next, _tmp1_index) _, tmp1_tmp = triton_helpers.min_with_index(_tmp1, _tmp1_index, 1) tmp1 = tmp1_tmp[:, None] tl.store(out_ptr0 + x0, tmp1, xmask) ``` Or for a persistent reduction, it generates: ```python tmp0 = tl.load(in_ptr0 + (r1 + (1024x0)), rmask & xmask, other=0) tmp2 = tl.where(rmask & xmask, tmp0, float("inf")) tmp3 = tl.broadcast_to(rindex, tmp2.shape) _, tmp4_tmp = triton_helpers.min_with_index(tmp2, tmp3, 1) tmp4 = tmp4_tmp[:, None] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99920 Approved by: https://github.com/ngimel	2023-04-27 15:10:50 +00:00
Peter Bell	f9c3fcd1df	[inductor] Fix nan-handling of max and min reductions (#99881 ) This adds helpers that replace tritons `minimum`, `maximum`, `min` and `max` with the correct NaN prrpagation. I also removed `ops.int_minimum` in favor of `ops.minimum` because we can just omit the nan-checks by checking the dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99881 Approved by: https://github.com/ngimel	2023-04-27 15:10:50 +00:00
Peter Bell	ed2eb13d76	[inductor] Create triton_helpers module for helper functions (#99880 ) This changes codegen of `torch.prod` from: ```python tl.reduce(tmp2, 1, _prod_accumulate)[:, None] ``` where `_prod_accumulate` is defined elsewhere, to ```python triton_helpers.prod(tmp2, 1)[:, None] ``` A quirk I uncovered though is that `TritonCodeCache` breaks if you define any new symbol beginning with `triton_`, since it assumes that must be the kernel name. Instead, I've made the kernel name an explicit argument to `async_compile.triton` so it doesn't have to guess. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99880 Approved by: https://github.com/ngimel	2023-04-27 15:10:50 +00:00
Rodrigo Kumpera	ad21890f8f	[c10d] Scalable PG initiation. (#99931 ) Add use_local_synchronization argument to new_group. When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster. This addressess both scalability and composability problems associated with new_group. Fixes #81291. This is relanding #84224 As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following: new_group use_local_synchronization=False: \| World Size \| Time (in secs) \| \| --- \| ----------- \| \| 4 \| 0.12 \| \| 8 \| 0.25 \| \| 16 \| 0.51 \| \| 32 \| 0.87 \| \| 64 \| 1.50 \| \| 128 \| 2.87 \| new_group use_local_synchronization=True: \| World Size \| Time (in secs) \| \| --- \| ----------- \| \| 4 \| 0.05 \| \| 8 \| 0.04 \| \| 16 \| 0.03 \| \| 32 \| 0.03 \| \| 64 \| 0.04 \| \| 128 \| 0.04 \| Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128. Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3. Setup: 1 AWS host, backend gloo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931 Approved by: https://github.com/xw285cornell	2023-04-27 13:44:02 +00:00
Nikita Vedeneev	2eab5abb50	sparse.sum backward: short circuit on zero/empty grad (#98838 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98838 Approved by: https://github.com/pearu	2023-04-27 12:06:11 +00:00
Edward Z. Yang	67e0913de9	Add support for serializing real tensor data in after aot minifier (#99834 ) The new minifier script looks like this: ``` import torch._dynamo.repro.after_aot reader = torch._dynamo.repro.after_aot.InputReader(save_dir='/tmp/tmpcsngx39e') buf0 = reader.storage('e2b39c716c0d4efb9fa57375a3902b9dab666893', 16) t0 = reader.tensor(buf0, (4,)) args = [t0] mod = make_fx(Repro(), tracing_mode='real')(*args) ``` The real tensor data is stored in the storages folder of the checkpoint dump directory. If you delete this folder / it is otherwise missing, we will transparently fall back to generating random data like before. The tensors are serialized using content store from #99809, which means each storage is content-addressed and we will automatically deduplicate equivalent data (which is useful if you keep dumping out, e.g., your parameters.) We don't use the tensor serialization capability from content store, instead all of the tensor metadata is stored inline inside the repro script (so that everything is in one file if you lose the checkpointed tensors). We also add a stable_hash option to content store, where we use a slow SHA-1 sum on the data in CPU side to compute a hash that is stable across systems with the same endianness. Out of rage, I also added support for Dtype.itemsize property access. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99834 Approved by: https://github.com/voznesenskym	2023-04-27 11:52:13 +00:00
Nikita Vedeneev	5cfaea15c4	relu/threshold backward for sparse: enable 0-nnz grads (#98935 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98935 Approved by: https://github.com/pearu	2023-04-27 10:57:05 +00:00
Sergii Dymchenko	c2402a9257	Change caffe2 branch links to main (#100129 ) Just a change pytorch/tree/master -> pytorch/tree/main pytorch/blob/master -> pytorch/blob/main Pull Request resolved: https://github.com/pytorch/pytorch/pull/100129 Approved by: https://github.com/huydhn	2023-04-27 10:31:50 +00:00
Jiong Gong	77a37a54ce	Include all mkl/mkldnn related test files to CPU ATen backend (#99592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99592 Approved by: https://github.com/kit1980	2023-04-27 10:26:01 +00:00
Will Constable	100a25d021	Basic dynamo support for traceable collectives (#94440 ) Make traceable collectives work with torchdynamo, bypassing problems with tracing the AsyncTensor subclass. Accept a suboptimal solution for now, and optimize it later. For now, wait happens immediately, which generally forces an early sync. Later, find a way either in dynamo or AOT stack to handle AsyncCollectiveTensor to get the wait in the optimal place. Note on implementation: - Dynamo traces 'user-level' fc apis that are designed to behave differently in eager vs compiled. In eager, there will be work-obj registration and a wrapper subclass will insert a 'wait' call at the appropriate time. In compile/trace mode, wait will be immetiately called, and work obj registration is required to be handled by the compile backend at runtime. - Dynamo needs to trace into some of the helper functions in the 'user-level' api, such as '_expand_group' which is essentially a constant transformation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94440 Approved by: https://github.com/kumpera	2023-04-27 05:38:36 +00:00

1 2 3 4 5 ...

59519 Commits