Currently any function with a default dtype other than None has to be
manually entered into this function. Instead, this reads the default
directly from `native_functions.yaml`. In order to do this, I also
change `PythonSignatureGroup` to take `tensor_options_args` from the
functional variant since the out variant doesn't actually have tensor
options arguments to take the default values from.
Also note that we need to use `default_init` instead of `default`
because the out argument version doesn't have a `tensor_options`
argument to extract the default value from and so the PythonSignature
objects wouldn't match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81479
Approved by: https://github.com/albanD
The composite kernel for `view_copy` that we generate is special-cased a bit for efficiency to avoid having to do extra clones in some cases.
That logic was slightly wrong though, and is fixed here (it needs to mirror the logic in `reshape()`).
It manifested as a debug assert firing for Lazy Tensor, which I confirmed no longer fires when running this script:
```
# ran with "python test_ltc_only_torch.py --device=lazy --sync=1 --nvtx=1"
import torch
import torch._lazy
from torch._lazy.ts_backend import init as init_ts_backend
init_ts_backend()
torch.manual_seed(42)
from transformers import BertForSequenceClassification
def parse_args():
import argparse
parser = argparse.ArgumentParser(description='')
parser.add_argument('--device', type=str, default='cuda')
parser.add_argument('--sync', type=bool, default=False)
parser.add_argument('--nvtx', type=bool, default=False)
return parser.parse_args()
args = parse_args()
device = args.device
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True)
from transformers import AdamW
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_batch = ["I love Pixar.", "I don't care for Pixar."]
encoding = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True)
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)
model = model.to(device)
model.train()
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
labels = torch.tensor([1,0]).unsqueeze(0).to(device)
for _ in range(6):
torch.cuda.nvtx.range_push(f'Iter{_}')
torch.cuda.nvtx.range_push('F')
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
if args.sync:
torch._lazy.mark_step()
torch._lazy.wait_device_ops()
torch.cuda.nvtx.range_pop()
loss = outputs.loss
torch.cuda.nvtx.range_push('B')
optimizer.zero_grad()
loss.backward()
if args.sync:
torch._lazy.mark_step()
torch._lazy.wait_device_ops()
torch.cuda.nvtx.range_pop()
torch.cuda.nvtx.range_push('O')
optimizer.step()
if args.sync:
torch._lazy.mark_step()
torch._lazy.wait_device_ops()
torch.cuda.nvtx.range_pop()
torch.cuda.nvtx.range_pop()
torch._lazy.mark_step()
torch._lazy.wait_device_ops()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81553
Approved by: https://github.com/ezyang
Summary:
In #77710 I introduces some hack to allow static dispatch to take namespaces. After we introduced namespace into ops and kernels, we don't have to pass namespace into `static_dispatch()`; instead we will generate ops with the kernel namespace for `Functions.h`. After this diff:
If we have a yaml file looking like this:
```
- func: op_1(Tensor(a) self) -> Tensor(a)
dispatch:
CPU: at::op_1_kernel # ATen kernel
- func: op_2(Tensor(a) self) -> Tensor(a)
dispatch:
CPU: custom::op_2_kernel # custom kernel
```
`Functions.h` will contain the following C++ APIs:
```
TORCH_API inline at::Tensor & op_1(at::Tensor & self) {
return at::cpu::op_1_kernel(self);
}
TORCH_API inline at::Tensor & op_2(at::Tensor & self) {
return custom::cpu::op_2_kernel(self);
}
```
Test Plan: Rely on CI
Differential Revision: D37900753
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81581
Approved by: https://github.com/iseeyuan
This PR is doing a few interrelated things, all of which are necessary to get correctness. Read the comment in torch/fx/experimental/proxy_tensor.py for the high level overview.
Let's break down the parts of this PR:
* Bug fix where `enable_torch_dispatch_mode` with `None` doesn't work. This make `enable_torch_dispatch_mode(current_mode.inner)` work which is the basis for how we temporarily disable fake tensor mode.
* Bug fix for when fake tensor mode is combined with a non-mode tensor subclass. This actually could be ablated from this PR but it affects where the logic for allowing non fake tensor inputs with lift goes, so it's all in here in one go. There are some relevant tests for the fix in fake tensor, but it turns out I didn't need this because I'm always using proxy tensors as a mode (which ensures the ordering is right.)
* New `lift_fresh` view operator. Note that like lift, we have to manually write the functionalize kernel for these functions.
* The actual change, which is to save constants when we see them in the proxy tensor mode, and then propagate them as we go (because otherwise you'll handle mutations on constants incorrectly--see test.)
This is mildly BC-breaking if anyone was previously interposing on
at::lift, but this operator was relatively new and I checked
functorch which has no explicit reference to lift. So I think it
should not be too disruptive.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81192
Approved by: https://github.com/samdow, https://github.com/bdhirsh
There are small typos in:
- caffe2/python/recurrent.py
- test/distributed/test_c10d_nccl.py
- test/test_fx.py
- torch/csrc/jit/runtime/autodiff.cpp
- torchgen/gen.py
Fixes:
- Should read `propagation` rather than `propogation`.
- Should read `multiplied` rather than `multuplied`.
- Should read `eliminate` rather than `elminate`.
- Should read `dispatcher` rather than `disaptcher`.
Semi-automated pull request generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81435
Approved by: https://github.com/ngimel
Summary:
A followup to #78015 and #79733. In those PRs I introduced custom namespace support into:
* `Register<DispatchKey>.cpp`
* `RegisterSchema.cpp`
* `NativeFunctions.h`
This PR extracts out logic that generates schema registration code (used in `RegisterSchema.cpp`) into a function so that it can be easily tested and reused. Added unit test to cover the logic as well.
Test Plan: Rely on newly added unit tests.
Differential Revision: D37581186
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80780
Approved by: https://github.com/iseeyuan
`ProxyTorchDispatchMode` was added recently as part of `make_fx`, which was secretly causing the meta tensor calls used inside of functionalization to get baked into the graph. It also wasn't caught because the functionalization tests in core don't use `make_fx`, and the tests in functorch aren't as comprehensive.
Now that `make_fx` is in core, I also ported the functionalization test infra over to use it, which would have caught the regression. This also makes the tests cleaner, since mode-based tracing lets us pick up factory functions in the trace output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80416
Approved by: https://github.com/ezyang, https://github.com/albanD
This should fix the last issue that @anijain2305 hit when running ResNet with TorchDynamo <> functionalization.
Today if you try to call an `OpOverloadPacket` from python with some arguments, we will use the types of those arguments to perform overload resolution. With some functional variants of ops, this can be ambiguous.
Today this affects just one op: `_fused_moving_avg_obs_fq_helper`, although it would potentially affect e.g. `native_batch_norm` in the future.
Example:
```
# There are technically two overloads:
# torch.ops.aten._fused_moving_avg_obs_fq_helper.default (returns 2 argument, mutates 4 of its inputs inplace)
# torch.ops.aten._fused_moving_avg_obs_fq_helper.functional (returns 6 argument, mutates none of its inputs)
# We pick the wrong one - no way to know that we should pick the functional one, just from the call site.
outs = torch.ops.aten._fused_moving_avg_obs_fq_helper(a, a, a, a, a, a, a, 1.0, 0, 1, 0)
# raises an error - tries to call the overload with only 2 returns
return _fused_moving_avg_obs_fq_helper_functional[5]
```
Specifically, functionalization will bake `_fused_moving_avg_obs_fq_helper.functional` into the graph, but when AOTAutograd tries to compile with TorchScript, it needs to remove the overload name (TS doesn't know how to parse overload names directly, so we need to remove the overload name and let it infer the right overload at runtime later- so it picks the wrong one).
The situation is pretty similar to inplace; `ops.aten.add` and `ops.aten.add_` represent two different `OverloadPacket` objects; they can't be overloads of the same op, because their schemas would be ambiguous - the alias annotations are different, but that isn't enough to disambiguate).
In this PR, I try to fix the situation in a pretty similar way to how we handle `inplace` in the data model: `inplace` ops get their own base operator name, but they are represented as a flag inside of `BaseOperatorName` in the data model.
Two other important changes that I made as part of this PR:
(1) Originally, there were ~100 different `*_functional` operators: e.g. we had operators named `resize.functional` and `zero.functional`. The `_functional` bit isn't actually necessary in most cases: it's only necessary for operators that **also** have a `SchemaKind.mutable` variant, where `_fused_moving_avg_obs_fq_helper` is the only op that fits that description today. So I removed the unnecessary notion of "functional" from those other ops. I also added a bunch of assertions to force this restriction.
I think that makes more sense in the long run, because it eliminates an unnecessary difference in the model. E.g. we don't have `add_.Tensor` and `add.Tensor_functional`. We just have `add_.Tensor` and `add.Tensor`.
(2) I noticed that we actually still weren't pairing up a bunch of `_foreach` operators correctly, because their input arguments were different (`self` vs. `tensors`). Since they're private API's, I went ahead and changed the argument names directly so they get matched up. Before this PR, we were generating a separate `_foreach_add` and `_foreach_add.functional` variant in a bunch of cases, that really did the same thing (but happened to have a different name for the first argument).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80556
Approved by: https://github.com/ezyang, https://github.com/albanD
Which is, in essence is composite of `eq`->`all`->`item`
`native/mps/operators/Equal.cpp` is an almost verbatim copy of `native/cuda/Equal.cpp`
Fix codegen by generating MPSFunctions headers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80195
Approved by: https://github.com/albanD
This expands the `AT_DISPATCH` macros to enable writing your own
`AT_DISPATCH_SWITCH` statements with multiple `AT_DISPATCH_CASE`
labels. So, where previously you may have written:
```cpp
if (iter.common_dtype() == kBool) {
my_bool_kernel(iter);
} else {
AT_DISPATCH_INTEGRAL_TYPES(iter.common_dtype(), "my_kernel", [&] {
...
});
}
```
You can now instead write
```cpp
AT_DISPATCH_SWITCH(iter.common_dtype(), "my_kernel",
AT_DISPATCH_CASE(kBool, [&] { my_kernel_bool(iter); })
AT_DISPATCH_CASE_INTEGRAL_TYPES([&] { ... })
);
```
The macro syntax is a bit ugly, however the benefits are:
- Greater flexibility, as the kernel code doesn't have to be shared
for all dtypes.
- Selective build and RECORD_KERNEL_FUNCTION work even for single
dtype specializations such as the bool case in the example.
- The compiler sees a single switch for all types, which should be
easier to optimize into a jump table.
- We also now get errors if the same scalar type is handled twice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79978
Approved by: https://github.com/ezyang
When constructing a lazy::Node, [null operands (optional values that aren't included) are dropped](30fb2c4aba/torch/csrc/lazy/core/ir.cpp (L82-L84)), so it’s possible for the stored operand list to be a different length than the one that was passed into the constructor.
This can become a problem during the call to `CanBeReused` in the autogen `LazyIr.h` code. For example:
```
bool CanBeReused(const torch::lazy::Value& input, const c10::optional<torch::lazy::Value>& weight, const c10::optional<torch::lazy::Value>& bias, const c10::optional<torch::lazy::Value>& running_mean, const c10::optional<torch::lazy::Value>& running_var, const bool& training, const double& momentum, const double& eps) const {
size_t i = 0;
std::cout << "Num operands: " << operands().size() << std::endl;
return (operand(i++) == input &&
operand(i++) == weight.value_or(kNullValue) &&
operand(i++) == bias.value_or(kNullValue) &&
operand(i++) == running_mean.value_or(kNullValue) &&
operand(i++) == running_var.value_or(kNullValue) &&
this->training == training &&
this->momentum == momentum &&
this->eps == eps);
}
```
Here we operate under the assumption that the number of operands stored in the `lazy::Node` is equal to the number of operands originally passed into the constructor. Recall that we drop any null operands though, so it’s possible to inadvertently access an invalid index at this point.
This PR addresses this issue by adding a new nullable_operand method which falls back to a null value instead of producing an index error when going out of bounds.
This should solve the issue found at https://github.com/pytorch/pytorch/pull/79637#issuecomment-1162044545
cc: @antoniojkim @ke1337 @wconstab @desertfire
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80060
Approved by: https://github.com/desertfire
Due to implicit conversion shenanigans, having both IntArrayRef
and SymIntArrayRef overloads makes {} ambiguous. While we could
fix this by making a single unified type that accepts all the overloads
we want, an easier fix was to just push the SymIntArrayRef overload
to its own name.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79281
Approved by: https://github.com/suo
Summary:
Adding a feature to allow user to specify namespaces for operator and kernels.
# Feature
There's a feature request to allow DSL to:
1. take in an operator namespace other than `aten`.
2. take in a kernel that is in a different namespace than `at::native`.
For both features, we only allow user to have a single layer of namespace for the sake of simplicity. If user specify `custom::function` as kernel, the codegen will depend on `custom::native::function` where `native` is hardcoded.
# Proposal
For feature 1, add a `namespace` attribute to data class `NativeFunction`. The namespace will be extract out by matching pattern "::" on the `func` variable. For `NativeFunctionsGroup` there's an assumption that all variants (function, inplace, out) will have the same namespace. By default (if not specified) the namespace will be "aten".
For feature 2, add a `namespace` attribute to `BackendMetadata` class, similarly match pattern "::" on the kernel field. Remove the `cpp_namespace` field from `register_dispatch_key` data class. By default (if not specified) the namespace for a kernel would be "at::native".
Test Plan:
Example yaml entries:
```
- func: custom::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)
structured: True
structured_inherits: TensorIteratorBase
device_check: NoCheck # TensorIterator
python_module: nn
dispatch:
CPU: custom::gelu_out_cpu
CUDA: custom::gelu_out_cuda
MPS: custom::gelu_out_mps
- func: custom::gelu_(Tensor(a!) self, *, str approximate='none') -> Tensor(a!)
structured_delegate: gelu.out
device_check: NoCheck # TensorIterator
python_module: nn
dispatch:
NestedTensorCPU, NestedTensorCUDA: custom::NestedTensor_gelu_
- func: custom::gelu(Tensor self, *, str approximate='none') -> Tensor
structured_delegate: gelu.out
device_check: NoCheck # TensorIterator
python_module: nn
dispatch:
MkldnnCPU: custom::mkldnn_gelu
QuantizedCPU: custom::gelu_quantized_cpu
NestedTensorCPU, NestedTensorCUDA: custom::NestedTensor_gelu
```
see generated code:
`RegisterCPU.cpp`:
```
TORCH_LIBRARY_IMPL(aten, CPU, m) {
...
}
TORCH_LIBRARY_IMPL(custom, CPU, m) {
m.impl("gelu", TORCH_FN(wrapper_gelu));
m.impl("gelu.out", TORCH_FN(wrapper_gelu_out_out));
m.impl("gelu_", TORCH_FN(wrapper_gelu_));
};
```
```
struct structured_gelu_out_cpu_inplace final : public custom::native::structured_gelu_out_cpu {
structured_gelu_out_cpu_inplace(Tensor& self) : outputs_{std::ref(self)} {}
void set_output_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
const auto& out = outputs_[output_idx].get();
check_inplace(out, sizes, options);
auto maybe_proxy = maybe_create_proxy(out, sizes, strides, options);
if (C10_UNLIKELY(maybe_proxy.has_value())) {
proxy_outputs_[output_idx] = c10::ExclusivelyOwned<Tensor>(std::move(maybe_proxy).value());
}
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
custom::native::structured_gelu_out_cpu::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
void set_output_raw_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
const auto& out = outputs_[output_idx].get();
check_inplace(out, sizes, options);
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
custom::native::structured_gelu_out_cpu::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
const Tensor& maybe_get_output(int64_t output_idx) override {
return proxy_outputs_[output_idx].has_value() ? **proxy_outputs_[output_idx] : outputs_[output_idx].get();
}
std::array<std::reference_wrapper<Tensor>, 1> outputs_;
std::array<c10::optional<c10::ExclusivelyOwned<Tensor>>, 1> proxy_outputs_;
};
```
`RegisterSchema.cpp`
```
TORCH_LIBRARY(aten, m) {
...
}
TORCH_LIBRARY(custom, m) {
m.def("gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)");
m.def("gelu_(Tensor(a!) self, *, str approximate='none') -> Tensor(a!)");
m.def("gelu(Tensor self, *, str approximate='none') -> Tensor");
};
```
Differential Revision: D36558459
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78015
Approved by: https://github.com/bdhirsh
Package config/template files with torchgen
This PR packages native_functions.yaml, tags.yaml and ATen/templates
with torchgen.
This PR:
- adds a step to setup.py to copy the relevant files over into torchgen
- adds a docstring for torchgen (so `import torchgen; help(torchgen)`
says something)
- adds a helper function in torchgen so you can get the torchgen root
directory (and figure out where the packaged files are)
- changes some scripts to explicitly pass the location of torchgen,
which will be helpful for the first item in the Future section.
Future
======
- torchgen, when invoked from the command line, should use sources
in torchgen/packaged instead of aten/src. I'm unable to do this because
people (aka PyTorch CI) invokes `python -m torchgen.gen` without
installing torchgen.
- the source of truth for all of these files should be in torchgen.
This is a bit annoying to execute on due to potential merge conflicts
and dealing with merge systems
- CI and testing. The way things are set up right now is really fragile,
we should have a CI job for torchgen.
Test Plan
=========
I ran the following locally:
```
python -m torchgen.gen -s torchgen/packaged
```
and verified that it outputted files.
Furthermore, I did a setup.py install and checked that the files are
actually being packaged with torchgen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78942
Approved by: https://github.com/ezyang
Summary:
Add script to go through view ops in "native_functions.yaml" and auto-register them into static runtime and auto-generate op unit tests for each.
Overall there are 96 grouped view ops, among which 21 is already registered by hand; 9 (including sparse ops/training related ops etc.) are not the target of static runtime; 30 has list args or list ret; and 7 has non-basic types such as "Dimname", "MemoryFormat", etc. In summary, this script auto-generate 29 view ops for now.
Run `buck run //caffe2/torch/fb/jit:gen_static_runtime_ops` to generate static runtime ops, and the results with this script are,
```
total grouped native ops: 1582
grouped native ops with out variant: 548
generated functions groups with out variant: 241
view grouped native ops: 96
generated functions view groups: 29
overall generated : 270
```
The generated view ops are added in D36258968
Test Plan:
Generate static runtime ops: `buck run //caffe2/torch/fb/jit:gen_static_runtime_ops`
Unit tests: `buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Differential Revision: D36258767
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77105
Approved by: https://github.com/mikeiovine
Add codegen infrastructure to generate IR nodes for non-native ops.
The proposed change is to add a `non_native` key to the `{backend}_native_functions.yaml` file that contains schema definitions similar to what is found in `native_functions.yaml`. e.g.
```
non_native:
...
- func: expand(Tensor input, int[] size, bool is_scalar_expand) -> Tensor
...
```
these definitions are parsed into a `LazyIrSchema` that can be used for generating IR nodes using `GenLazyIR`.
Fixes#74628
CC: @wconstab @desertfire @henrytwo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76535
Approved by: https://github.com/wconstab
For static dispatch we are hardcoding namespace to be `at` for backend-specific C++ functions, e.g., `at::cpu::add()`. We are extending it to accept namespaces from callsite. This is a temporary solution, in the long run we want to introduce custom namespace into codegen system, e.g., we should be able to add `at::` to `native_functions.yaml` and parse it into `NativeFunction`. This needs a bit more design.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77710
Approved by: https://github.com/ezyang
Previously when codegening ops like `zeros_` or `ones_` we'd hit a `Code below assumes there is at least one tensor arg error`. This check is not entirely correct which is what is causing the error to be thrown. There are ops like the ones mentioned that pass in a `device` parameter that can be used in place of the "first tensor".
CC: @wconstab @desertfire @henrytwo @ke1337
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76917
Approved by: https://github.com/desertfire
Partially fix#69813
This PR does mainly 3 things:
1. Introduces new methods for the `MetaBase` API:
- `set_output_strided`: creates proxy tensors with exact strides, if strides don't match
- `set_output_contiguous`: alias for `set_output_strided` with contiguous strides
- `set_output_raw_strided`: does not create proxy tensors
2. Modifies codegen for handling proxy tensors:
- Creates a new field for out-of-place kernels: `proxy_output_`
- Implements `set_output_strided` by creating a proxy tensor if necessary
- Passes the proxy tensor to them `IMPL` function
- Copy the result back to the real output, in the end, whenever a proxy was created
3. Replace `set_output` by `set_output_raw_strided` for `TensorIterator*`
- Needed, since it overrides `set_output`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76096
Approved by: https://github.com/ezyang
Unfortunately the built-in pprint module support pretty-print of dataclasses only from python 3.10. The code that I wrote in method `__str__` of OpInfo should do the same job and should also work for any dataclass. For now I've put it there but we can create a function and put it somewhere where is accessible also for other dataclasses. Also the max width (80) is now hardcode but it would ideally be the parameter of the function.
when you call print on an OpInfo you get:
```
OpInfo(name = '__getitem__',
ref = None,
aliases = (),
variant_test_name = '',
op = <slot wrapper '__getitem__' of 'torch._C._TensorBase' objects>,
method_variant = <slot wrapper '__getitem__' of 'torch._C._TensorBase' objects>,
inplace_variant = None,
skips = (<torch.testing._internal.common_methods_invocations.DecorateInfo object at 0x7f463acbca90>,
<torch.testing._internal.common_methods_invocations.DecorateInfo object at 0x7f463acbcae0>),
decorators = (<torch.testing._internal.common_methods_invocations.DecorateInfo object at 0x7f463acbca90>,
<torch.testing._internal.common_methods_invocations.DecorateInfo object at 0x7f463acbcae0>),
sample_inputs_func = <function sample_inputs_getitem at 0x7f463acc6af0>,
reference_inputs_func = None,
error_inputs_func = None,
sample_inputs_sparse_coo_func = <function _DecoratorContextManager.__call__.<locals>.decorate_context at 0x7f463acc6b80>,
sample_inputs_sparse_csr_func = <function _DecoratorContextManager.__call__.<locals>.decorate_context at 0x7f463acc6c10>,
dtypes = {torch.int16,
torch.float64,
torch.int32,
torch.int64,
torch.complex64,
torch.float16,
torch.bfloat16,
torch.uint8,
torch.complex128,
torch.bool,
torch.float32,
torch.int8},
dtypesIfCUDA = {torch.int16,
torch.float64,
torch.int32,
torch.int64,
torch.complex64,
torch.float16,
torch.bfloat16,
torch.uint8,
torch.complex128,
torch.bool,
torch.float32,
torch.int8},
dtypesIfROCM = {torch.int16,
torch.float64,
torch.int32,
torch.int64,
torch.complex64,
torch.float16,
torch.bfloat16,
torch.uint8,
torch.complex128,
torch.bool,
torch.float32,
torch.int8},
backward_dtypes = {torch.int16,
torch.float64,
torch.int32,
torch.int64,
torch.complex64,
torch.float16,
torch.bfloat16,
torch.uint8,
torch.complex128,
torch.bool,
torch.float32,
torch.int8},
backward_dtypesIfCUDA = {torch.int16,
torch.float64,
torch.int32,
torch.int64,
torch.complex64,
torch.float16,
torch.bfloat16,
torch.uint8,
torch.complex128,
torch.bool,
torch.float32,
torch.int8},
backward_dtypesIfROCM = {torch.int16,
torch.float64,
torch.int32,
torch.int64,
torch.complex64,
torch.float16,
torch.bfloat16,
torch.uint8,
torch.complex128,
torch.bool,
torch.float32,
torch.int8},
supports_out = False,
supports_autograd = True,
supports_gradgrad = True,
supports_fwgrad_bwgrad = True,
supports_inplace_autograd = False,
supports_forward_ad = True,
gradcheck_wrapper = <function OpInfo.<lambda> at 0x7f463a7a40d0>,
check_batched_grad = True,
check_batched_gradgrad = True,
check_batched_forward_grad = True,
check_inplace_batched_forward_grad = True,
gradcheck_nondet_tol = 0.0,
gradcheck_fast_mode = None,
aten_name = '__getitem__',
decomp_aten_name = None,
aten_backward_name = None,
assert_autodiffed = False,
autodiff_nonfusible_nodes = ['aten::__getitem__'],
autodiff_fusible_nodes = [],
supports_sparse = False,
supports_scripting = False,
supports_sparse_csr = False,
test_conjugated_samples = True,
test_neg_view = True,
assert_jit_shape_analysis = False,
supports_expanded_weight = False)
```
cc @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76810
Approved by: https://github.com/ezyang
Summary: Currently OpKind is stored as an object field called op_ for each IR
node, and one usage of op_ is to avoid dynamic_cast in NodeCast when we
need to downcast a base-node pointer into a concrete sub-node pointer.
As a result, we need to construct and pass in an op when downcasting
nodes, and this becomes quite anonnying when we start to implement the
trie-based IR node reusing. More importantly, the op for each subclass
should be unique for that subclass and thus making it a const static field
is a more logical design.
In this PR, we still keep the object-level op_ for easier XLA adoption. As
furture work, we can come back to remove op_, make the op() method
virtual, and get rid of OpKind in all the node constructors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76711
Approved by: https://github.com/wconstab, https://github.com/JackCaoG
This PR turns the previously introduced `ITensorList` into a more general `IList`
class. It is a container wrapper for arbitrary types (given their appropriate
implementations).
In summary, I have:
- Renamed `ITensorList` (its iterators and macros, for consistency) to `IList`
- Made `IList` a templated function (for an arbitrary type `T`), given that they:
- Specialize `IListTagImpl<T, Tag>`, for all `IListTag`
- Introduced type aliases (for both list and iterator types):
- `at::ITensorList` -> `c10::IList<at::Tensor>`
- `at::IOptTensorRefList` -> `c10::IList<at::OptionalTensorRef>`
- Added support for `Tensor?[]` in the structured codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69606
Approved by: https://github.com/ezyang
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76203
Request for comments:
This change adds extra code generator support to generate out variant wrappers for operators with unstructured kernels.
The current version generates 105 new out variant wrappers in addition to the existing 136 auto-generated out variants wrappers.
This change shows that a simple tweak can increase the generated op coverage to 16% (241/1559) among all native ops described in native_functions.yaml no. matter if they are structured or not.
Command to generate out variant wrappers.
```
buck run //caffe2/torch/fb/jit:gen_static_runtime_ops
```
- AFTER this change
```
total grouped native ops: 1559
structured grouped native ops: 545
generated grouped native ops: 241
```
- BEFORE this change
```
total grouped native ops: 1503
structured grouped native ops: 540
generated grouped native ops: 136
```
To enable CI tests and make it easier to review, the generated ops are added in a separate diff: D35945633
More details:
We added a block list to remove the generation of around 10 operations that are deprecated or for which the unit test would fail. All generated ops are well *compiled* but the compiled unittest may not pass due to the lack of hand-picked test input values for certain ops. Among the 42 ops whose unittest does not pass, 1 (op "index_select") is repeated from the existing ops; 32 ops are fixed; and 9 ops are removed and blocked from generation because either it is not being commonly used in internal models such as "cholesky", "linalg_householder_product", sparse kernel "sspaddmm", or it causes some errors in static runtime such as "conj_physical" leads to an error in memory planner, and "binary_cross_entropy".
Test Plan:
OP generation:
```buck run //caffe2/torch/fb/jit:gen_static_runtime_ops```
Test generated ops:
```buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest```
Reviewed By: tenpercent
Differential Revision: D34913736
fbshipit-source-id: a6f408321653c3589ae1c76826177fc403d59c44
(cherry picked from commit 6f4501730478dbaeeea7f3ad4f9d29bf6787e7c1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76471
Make node_base_ctor_call produce the entire node_bace_ctor_call.
Previously it was only producing the beginning of the call, which was unintended.
Addresses part of https://github.com/pytorch/xla/issues/3472
Test Plan: Imported from OSS
Reviewed By: qihqi, ngimel
Differential Revision: D35980436
Pulled By: wconstab
fbshipit-source-id: a443cf593ac7c35b2b65e72b82907e88e1e71c7a
(cherry picked from commit 360ad6d82a7e8303b8a60e61b177dabf0131ea8b)
This **roughly** corresponds to Goal 3.2 in https://docs.google.com/document/d/1iiLNwR5ohAsw_ymfnOpDsyF6L9RTUaHMpD8YLw-jxEw/edit#
Namely:
It adds the following:
* SymbolicIntNode interface
* LazySymbolicIntNode implementation
* Lazy `narrow_copy` implementation
* Need add support for SymInt in codegen
* Test (below)
```cpp
TEST(LazyDynamicOpsTest, NarrowCopy) {
auto x = torch::rand({5, 10, 10}).to(kLazy);
const size_t Y_DIM = 3;
const size_t X_DIM_INDEX = 2;
auto y = torch::rand({Y_DIM}).to(kLazy);
auto ly = torch::lazy::TryGetLtcTensor(y);
auto dim_node = MakeNode<SizeNode>(ly->GetIrValue(), 0);
auto lmn = new torch::lazy::SymbolicIntNode(dim_node);
auto z = x.narrow_copy(X_DIM_INDEX, 0, lmn->toSymInt());
AllClose(z.cpu(), x.cpu().narrow_copy(X_DIM_INDEX, 0, Y_DIM));
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75759
Approved by: https://github.com/wconstab