Due to implicit conversion shenanigans, having both IntArrayRef
and SymIntArrayRef overloads makes {} ambiguous. While we could
fix this by making a single unified type that accepts all the overloads
we want, an easier fix was to just push the SymIntArrayRef overload
to its own name.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79281
Approved by: https://github.com/suo
The fairseq diff is split into two parts.
The first diff (this one)
This diff is about creating a mask left align function to check the mask condition for nested tensor. It is necessary for torchscript deployment.
The second diff (D37082681)
Fork the inference path inside the forward function. If loaded the checkpoint file and perform the inference, we will deploy BT. Otherwise, fairseq take the position.
Reviewed By: mikekgfb
Differential Revision: D36057338
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79186
Approved by: https://github.com/erichan1
This PR heavily simplifies the code of `linalg.solve`. At the same time,
this implementation saves quite a few copies of the input data in some
cases (e.g. A is contiguous)
We also implement it in such a way that the derivative goes from
computing two LU decompositions and two LU solves to no LU
decompositions and one LU solves. It also avoids a number of unnecessary
copies the derivative was unnecessarily performing (at least the copy of
two matrices).
On top of this, we add a `left` kw-only arg that allows the user to
solve `XA = B` rather concisely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74046
Approved by: https://github.com/nikitaved, https://github.com/IvanYashchuk, https://github.com/mruberry
Summary:
- Exception handling was not performed in forked subgraph execution
- forked subgraph runtime can throw runtime exception. Future returned by prim::fork needs to handle exceptions so that aten::wait handles it.
Test Plan:
local test cases:
- buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
- buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
- buck test mode/opt caffe2/test:static_runtime
Async execution of the subgraph is tested by adding pytorch profiler hooks on the StaticRuntime execution via below code. Async execution in threadpool is verfiied by checking trace
with profile(activities=[ProfilerActivity.CPU]) as prof:
static_runtime_module(inputs)
prof.export_chrome_trace("trace.json")
Differential Revision: D37072493
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79292
Approved by: https://github.com/mikeiovine
Summary:
Adding a feature to allow user to specify namespaces for operator and kernels.
# Feature
There's a feature request to allow DSL to:
1. take in an operator namespace other than `aten`.
2. take in a kernel that is in a different namespace than `at::native`.
For both features, we only allow user to have a single layer of namespace for the sake of simplicity. If user specify `custom::function` as kernel, the codegen will depend on `custom::native::function` where `native` is hardcoded.
# Proposal
For feature 1, add a `namespace` attribute to data class `NativeFunction`. The namespace will be extract out by matching pattern "::" on the `func` variable. For `NativeFunctionsGroup` there's an assumption that all variants (function, inplace, out) will have the same namespace. By default (if not specified) the namespace will be "aten".
For feature 2, add a `namespace` attribute to `BackendMetadata` class, similarly match pattern "::" on the kernel field. Remove the `cpp_namespace` field from `register_dispatch_key` data class. By default (if not specified) the namespace for a kernel would be "at::native".
Test Plan:
Example yaml entries:
```
- func: custom::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)
structured: True
structured_inherits: TensorIteratorBase
device_check: NoCheck # TensorIterator
python_module: nn
dispatch:
CPU: custom::gelu_out_cpu
CUDA: custom::gelu_out_cuda
MPS: custom::gelu_out_mps
- func: custom::gelu_(Tensor(a!) self, *, str approximate='none') -> Tensor(a!)
structured_delegate: gelu.out
device_check: NoCheck # TensorIterator
python_module: nn
dispatch:
NestedTensorCPU, NestedTensorCUDA: custom::NestedTensor_gelu_
- func: custom::gelu(Tensor self, *, str approximate='none') -> Tensor
structured_delegate: gelu.out
device_check: NoCheck # TensorIterator
python_module: nn
dispatch:
MkldnnCPU: custom::mkldnn_gelu
QuantizedCPU: custom::gelu_quantized_cpu
NestedTensorCPU, NestedTensorCUDA: custom::NestedTensor_gelu
```
see generated code:
`RegisterCPU.cpp`:
```
TORCH_LIBRARY_IMPL(aten, CPU, m) {
...
}
TORCH_LIBRARY_IMPL(custom, CPU, m) {
m.impl("gelu", TORCH_FN(wrapper_gelu));
m.impl("gelu.out", TORCH_FN(wrapper_gelu_out_out));
m.impl("gelu_", TORCH_FN(wrapper_gelu_));
};
```
```
struct structured_gelu_out_cpu_inplace final : public custom::native::structured_gelu_out_cpu {
structured_gelu_out_cpu_inplace(Tensor& self) : outputs_{std::ref(self)} {}
void set_output_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
const auto& out = outputs_[output_idx].get();
check_inplace(out, sizes, options);
auto maybe_proxy = maybe_create_proxy(out, sizes, strides, options);
if (C10_UNLIKELY(maybe_proxy.has_value())) {
proxy_outputs_[output_idx] = c10::ExclusivelyOwned<Tensor>(std::move(maybe_proxy).value());
}
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
custom::native::structured_gelu_out_cpu::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
void set_output_raw_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
const auto& out = outputs_[output_idx].get();
check_inplace(out, sizes, options);
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
custom::native::structured_gelu_out_cpu::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
const Tensor& maybe_get_output(int64_t output_idx) override {
return proxy_outputs_[output_idx].has_value() ? **proxy_outputs_[output_idx] : outputs_[output_idx].get();
}
std::array<std::reference_wrapper<Tensor>, 1> outputs_;
std::array<c10::optional<c10::ExclusivelyOwned<Tensor>>, 1> proxy_outputs_;
};
```
`RegisterSchema.cpp`
```
TORCH_LIBRARY(aten, m) {
...
}
TORCH_LIBRARY(custom, m) {
m.def("gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)");
m.def("gelu_(Tensor(a!) self, *, str approximate='none') -> Tensor(a!)");
m.def("gelu(Tensor self, *, str approximate='none') -> Tensor");
};
```
Differential Revision: D36558459
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78015
Approved by: https://github.com/bdhirsh
Summary:
clamp — constrain a value to lie between two further values
use `clamp` instead of `min(max())` to calculate the coordinate.
Test Plan:
```
buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac
[ RUN ] VulkanAPITest.replication_pad2d
[ OK ] VulkanAPITest.replication_pad2d (161 ms)
```
Differential Revision: D37063026
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79291
Approved by: https://github.com/SS-JIA
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
Bug fixes and minor refactor
Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:
```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```
RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147
Approved by: https://github.com/davidberard98
Summary:
- Initial support for fork was done on JIT interpreter. This patch enabled the async execution on static runtime
- For each forked node, seeprate runtime is created for the execution of subgraph. Async execution is handled by aten::ParallelThreadPoolNative threadpool
- aten::wait waits on the future of fork to be completed
Test Plan:
local test cases:
- buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
- buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest
- buck test mode/opt caffe2/test:static_runtime
Async execution of the subgraph is tested by adding pytorch profiler hooks on the StaticRuntime execution via below code. Async execution in threadpool is verfiied by checking trace
with profile(activities=[ProfilerActivity.CPU]) as prof:
static_runtime_module(inputs)
prof.export_chrome_trace("trace.json")
Differential Revision: D37044513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79211
Approved by: https://github.com/mikeiovine
Relates to #76700
**Overview:** Revised and added new test cases to `test_print_latest_commits.py` to test the functionality of `isGreen`. All tests are successful!
**Test Plan:** Check that everything is passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79223
Approved by: https://github.com/janeyx99