This PR does not include an NVFuser frontend cache but it decouples the backed Fusion IR exposure and instead builds it as needed, if there was a cache, by recording the requested definition for replay to start the process of building a Fusion if it doesn't already exist. Another PR will be put up to include the actual caching.
The main change in the Python Frontend is that the NVFuser Fusion IR is not directly defined by the interface. Currently, there is direct connection between the Python API and the creation of the Fusion IR and Object. This means the user defines TensorViews, Scalars, and calls Arith Functions (IR Expressions) on those IR Values. The goal is to disconnect the Python API from directly specifying the Fusion IR and enable caching of the IR so a Fusion Object is not necessarily built every time a Fusion Definition is seen.
The FusionDefinition in Python will mostly look the same except the Definition is now being recorded in a light weight representation called a "Recording" of Records. If the Description is not already cached, the Records are executed to build the Fusion IR. Initially, there is no caching because I am trying to bring up the representation first and get it correctly working.
This is what the Records look like. The records are functors that are called if it is necessary to build the Fusion IR
torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h
**Tensor Definition Record**
_Note: The Tensor Definition will change for runtime contiguity caching, I am just matching what is already there for now._
```
InputTensorRecord(
std::vector<size_t> _outputs,
std::vector<int64_t> _symbolic_sizes,
std::vector<bool> _contiguous_info,
NvfDataType _dtype)
: RecordFunctor({}, std::move(_outputs)),
symbolic_sizes(std::move(_symbolic_sizes)),
contiguous_info(std::move(_contiguous_info)),
dtype(_dtype) {}
void operator()(FusionDefinition& fd) final {
auto tv = TensorViewBuilder()
.ndims(symbolic_sizes.size())
.contiguity(contiguous_info)
.shape(symbolic_sizes)
.dtype(dtype)
.build();
fd.fusion_state.at(outputs.at(0)) = tv;
fd.addInput(tv);
}
std::vector<int64_t> symbolic_sizes;
std::vector<bool> contiguous_info;
NvfDataType dtype;
};
```
**Generic Templatized Op Record Definition**
Op Records are notable because they record Fusion IR arith functions as the `fusion_op_`.
```
template <class OutType, class... ArgTypes>
struct OpRecord : RecordFunctor {
OpRecord(
std::vector<size_t> _args,
std::vector<size_t> _outputs,
std::function<OutType(ArgTypes...)> fusion_op)
: RecordFunctor(std::move(_args), std::move(_outputs)),
fusion_op_(fusion_op) {}
template <class TupleType, std::size_t... Is>
OutType opFunc(
FusionDefinition& fd,
TupleType& tp,
std::index_sequence<Is...>) {
return fusion_op_(
dynamic_cast<typename std::tuple_element<Is, TupleType>::type>(
fd.fusion_state.at(args.at(Is)))...);
}
void operator()(FusionDefinition& fd) final {
using arg_tuple_t = std::tuple<ArgTypes...>;
auto indices =
std::make_index_sequence<std::tuple_size<arg_tuple_t>::value>();
arg_tuple_t inputs;
auto output = opFunc(fd, inputs, indices);
fd.fusion_state.at(outputs.at(0)) = output;
}
private:
std::function<OutType(ArgTypes...)> fusion_op_;
};
```
Perhaps the most confusing aspect of the Python Frontend is the `FusionDefinition`. The C++ Class that is bound to is very light weight, purposely. In an attempt to make sure users don't have to touch more than one file when adding new ops, assuming an appropriate Record has already been defined, the Python bindings effectively create functions that act on the FusionDefinition and appear as part of the class in Python but are not part of the class in C++.
Here is an example of a Unary Op Macro. It is creating the binding to a lambda function that effectively appears as a FusionDefinition operation in Python. The other way to do this would have been to create a class method directly in the `FusionDefinition` C++ and have a separate binding to that method.
```
#define NVFUSER_PYTHON_BINDING_UNARY_OP(op_str, op_name) \
nvf_ops.def( \
op_str, \
[](nvfuser::FusionDefinition::Operators& self, \
nvfuser::Tensor* input) -> nvfuser::Tensor* { \
nvfuser::Tensor* output = new nvfuser::Tensor( \
self.fusion_definition->recording_state.size()); \
self.fusion_definition->recording_state.emplace_back(output); \
self.fusion_definition->recording.emplace_back( \
new nvfuser::OpRecord<NvfTensorView*, NvfTensorView*>( \
{input->index}, \
{output->index}, \
static_cast<NvfTensorView* (*)(NvfTensorView*)>( \
torch::jit::fuser::cuda::op_name))); \
return output; \
}, \
py::return_value_policy::reference); \
```
Here is the `FusionDefinition` class edited for brevity. The playing of the records will be found under the `exit()` method where exit refers to exiting of the Python Context Manager. A `FusionDefinition` is captured through a context manager like the following:
```
fusion = Fusion()
with FusionDefinition(fusion) as fd :
t0 = fd.define_tensor(sizes=[5], strides=[1])
t1 = fd.ops.abs(t0)
fd.add_output(t1)
```
```
class FusionDefinition {
public:
FusionDefinition(FusionOwner* fusion_owner)
: fusion_owner_(fusion_owner),
prev_fusion_(nullptr),
recording(),
recording_state(),
fusion_state(),
ops(this) {}
// Context Manager Methods
FusionDefinition* enter() {
prev_fusion_ = FusionGuard::getCurFusion();
FusionGuard::setCurFusion(fusionPtr());
return this;
}
void exit() {
// Found in the Python Bindings, currently.
//for (auto& record : recording) {
// auto functor = record.get();
// (*functor)(self);
//}
FusionGuard::setCurFusion(prev_fusion_);
prev_fusion_ = nullptr;
}
void addInput(torch::jit::fuser::cuda::Val* input) {
fusionPtr()->addInput(input);
}
void addOutput(torch::jit::fuser::cuda::Val* output) {
fusionPtr()->addOutput(output);
}
Fusion* fusionPtr() {
return fusion_owner_->fusionPtr();
}
private:
FusionOwner* fusion_owner_;
Fusion* prev_fusion_;
public:
std::vector<std::unique_ptr<RecordFunctor>> recording;
std::vector<std::unique_ptr<State>> recording_state;
std::vector<NvfVal*> fusion_state;
struct Operators {
Operators(FusionDefinition* fd) : fusion_definition(fd) {}
// Python operations are effectively bound here.
FusionDefinition* fusion_definition;
};
Operators ops;
};
```
The Fusion IR doesn’t have `define_tensor` or `define_scalar` functions. I made them up and the name for the Python `FusionDefinition` as a more understandable/convenient way to define input tensors and scalars. `TensorView` objects and Fusion IR `Val` objects are not typically defined outside of a Fusion IR `Expr` output (typically arith function outputs) except for inputs to a graph. Mechanically speaking, there are two things you need to do to define the input in the Fusion IR. You need to define the IR `TensorView`/`Val` object and then record that the IR `TensorView`/`Val` object is an input in the `Fusion` Object that encapsulates the Fusion IR. Since the `FusionDefinition` does not correspond one-to-one with the Fusion IR and `define_tensor` and `define_scalar` are made up functions, I decided to combine the `Val` Object creation and recording of the input in the `Fusion` object in one step to reduce the amount of syntax required to define a Fusion in the python interface.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81578
Approved by: https://github.com/jjsjann123, https://github.com/IvanYashchuk, https://github.com/SherlockNoMad
# Summary
This change fixes a bug that was encountered when trying to add more backward formulas for nested tensor ops. If a derivative is defined that stores the "result" for use in the backward the output of the forward op is saved using:
```
if (grad_fn) {
grad_fn->result_ = SavedVariable(result, true);
}
```
SavedVariable calls a series of functions which in turn calls shallow_copy_and_detach and when c179597753/c10/core/TensorImpl.cpp (L533) is hit this calls sizes_custom() which is not implemented and errors. I also noticed that since the storage format is different for nested_tensor not `storage_ ` but instead two tensors that the we should actually be calling the NestedTensorImpl constructor.
This PR overrides shallow_copy_and_detach from the derived class and ensures that shallow copy works correctly.
## Update
- Added the softmax derivative in this PR because that is a direct use case that was blocked by not having shallow_copy_and_detach work correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81838
Approved by: https://github.com/soulitzer
Summary:
enabling AT_POCKETFFT_ENABLED@ flag and adding the appropriate dependencies to aten-cpu
moved mkl files from
`aten_cpu_source_non_codegen_list` to
`aten_native_source_non_codegen_list`
Test Plan:
After building testing binaries for both android and ios targets
### iOS
`fbcode/aibench/specifications/frameworks/pytorch/ios/build.sh`
Submitted benchmarks with the new binaries supporting pocketfft here:
https://www.internalfb.com/intern/aibench/details/245253003946591
### Android
`fbcode/aibench/specifications/frameworks/pytorch/android/arm64/build.sh`
Submitted Benchmarks with the new binaries supporting pocket fft here:
https://www.internalfb.com/intern/aibench/details/406253690682941
### Build Size Impact
Success: igios-pika on D37790257-V7
☷[pocket fft] turning on pocketfft flag☷
Diff: https://fburl.com/diff/exkploof
Unigraph Explorer: https://fburl.com/mbex/aipdzaqo
Changes for variation [arm64 + 3x assets]:
```Compressed : -473 B (-0.00%) => 86.69 MiB
Uncompressed: +2.4 KiB (+0.00%) => 187.71 MiB
```
Reviewed By: kimishpatel
Differential Revision: D37790257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81670
Approved by: https://github.com/kit1980
Summary:
Stacked conv2d uses empty_with_tail_padding from Factory.cpp. This was
available only in aten_native_cpu but linking directly against that is not
wise since aten_native_cpu contains references to other ops that are part of
aten_cpu and selective build.
Thus for the purpose of padded tensor allocation I decided to bring this out in
aten_cpu.
Test Plan:
CI
buck build
fbsource//xplat/caffe2/fb/custom_ops/ferraris:stacked_conv2dAndroid#arm-64,shared
Reviewed By: cccclai
Differential Revision: D37787231
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81356
Approved by: https://github.com/cccclai
- Created may_alias method in FunctionSchema to publicize aliasing information about inputs and outputs of a schema.
- Tested may_alias methods for basic functionality, exceptions, and wildcard functionality.
**Cases where elements of a container alias another argument will be handled with a new may_contain_alias method which will be created in a later pr**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80918
Approved by: https://github.com/davidberard98
Summary:
This diff integrates UCC process group as a native component of Pytorch Distributed core. It is based on the existing torch-ucc (https://github.com/facebookresearch/torch_ucc) as the wrapper for UCC collective communication library.
The environment and cmake variables are named in mirroring to the existing process groups such as NCCL and Gloo. Specifically,
- USE_UCC: enables UCC PG. This defaults to OFF, so there is no breakage of existing builds that do not have UCX/UCC external libraries.
- USE_SYSTEM_UCC: uses external UCX and UCC shared libraries that are set accordingly with UCX_HOME and UCC_HOME.
Currently, this diff only supports USE_SYSTEM_UCC=ON, i.e., requiring users to specify external libraries for UCX and UCC. In subsequent diffs, we will add UCX and UCC repos as third-party dependencies in pytorch/third-party.
Test Plan:
Passed Torch-UCC tests that invoke UCC process group. For example:
$ sh test/start_test.sh test/torch_allreduce_test.py --backend gloo --use-cuda
...
Test allreduce: succeeded
Differential Revision: D36973688
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79918
Approved by: https://github.com/kwen2501, https://github.com/kingchc
- Added overloads to is_mutable method in FunctionSchema to tell whether an argument at index is mutable or an argument with name is mutable.
- Created SchemaInfo subclass of FunctionSchema with constructors from FunctionSchema and from const char* signature.
- Tested is_mutable method overloads in new test_schema_info.cpp file.
**Note that this pr is used to set up SchemaInfo. Implementation for SchemaInfo will be addressed in later commits**
Differential Revision: [D37651384](https://our.internmc.facebook.com/intern/diff/D37651384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80734
Approved by: https://github.com/davidberard98
Similar to [scipy.sparse.spdiags](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.spdiags.html#scipy-sparse-spdiags)
Part of #70926
In other functions (ie (torch.diagonal)[https://pytorch.org/docs/stable/generated/torch.diagonal.html#torch.diagonal]) diagonals of a tensor are referenced using the offset and the two dimensions that the diagonal is taken with respect to.
Here the reference implementation from scipy is only considering matrix output, so even if we only support 2-d output at first. It may be useful to consider how the dimensions corresponding to each diagonal would be specified for higher dimensional output.
The proposed torch signature implies that all offsets refer to the diagonals with respect to the only two dimensions of the output:
```
torch.sparse.spdiags(Tensor diagonals, IntTensor offsets, int[] shape, Layout? layout=None) -> SparseTensor
```
Above it is required that: `diagonals.ndimension() == 2`, `offsets.ndimensions() == 1`, `offsets.shape[0] == diagonals.shape[0]` and `len(shape) == 2`.
This would need to be altered for the case where `len(shape)` > 2. One options is:
```
torch.sparse.spdiags(Tensor[] diagonals, IntTensor[] offsets, IntTensor dims, int[] shape, Layout? layout=None) -> SparseTensor
```
Here `offsets` and `diagonals` becomes lists of tensors, and the `IntTensor dims` argument is introduced. This would require that `len(diagonals) == len(offsets) == dims.shape[0]`, `dims.ndimension() == 2` and `dims.shape[1] == 2` also the same restrictions as the 2d case above apply to the elements of `diagonals` and `offsets` pairwise (that is `diagonals[i].ndimension() == 2`, `offsets[i].ndimension() == 1` and `offsets[i].shape[0] == diagonals[i].shape[0]` for all i). This form of the signature would construct the sparse result by placing the values from `diagonals[i][j]` into the diagonal with offset `offset[i][j]` taken with respect to dimensions `dims[i]`. The specialization back to the original signature for the 2d case could be seen as allowing the single row of dims to default to `[0, 1]` when there is only one `diagonals`, `offsets` provided, and shape is `2-d`. This option allows the rows of an input element `diagonals[i]` to have a different length which may be appropriate as the max length of a diagonal along different dimension pairs will be different.
Another option is to specify the dimensions the diagonal is taken with respect to for each offset. This signature would look like:
```
torch.sparse.spdiags(Tensor diagonals, IntTensor offsets, IntTensor dims, int[] shape, Layout? layout=None) -> SparseTensor
```
Here, `diagonals` is still 2-D with dimension 0 matching the length of 1-D `offsets` and the tensor input `dims` is also 2-D with dimension 0 matching the length of 1-D `offsets` and the second dimension being fixed at `2` in this case the sparse result is constructed by placing the elements from `diagonals[i]` into the output diagonal `output.diagonal(offset[i], dim0=dims[i][0], dim1=dims[i][1])` (with some additional consideration that makes it more complicated than simply asigning to that view). The specialization from this back to the 2-D form could be seen as assuming `dims = [[0, 1], [0, 1]... len(offsets) times ]` when `len shape==2`.
In both proposed signatures for the N-D case the specialization back to the 2-D signature is a bit of a stretch for your typical default arguments logic, however I think the first is better choice as it offers more flexibility.
I think some discussion is required about:
- [x] Should the N-D output case be implemented from the outset
- [x] If not, should the future addition of the N-D output case be considered when designing the interface.
- [x] Other thoughts on the signature which includes the `dims` information for the N-D output case.
**Resolution**: Since no one has offered a request for N-D output support, I think is fine to restrict this to sparse matrix generation. Should a request for N-D support come later, an overload accepting the additional `dims` could be added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78439
Approved by: https://github.com/nikitaved, https://github.com/cpuhrsch, https://github.com/pearu
Similar to [scipy.sparse.spdiags](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.spdiags.html#scipy-sparse-spdiags)
Part of #70926
In other functions (ie (torch.diagonal)[https://pytorch.org/docs/stable/generated/torch.diagonal.html#torch.diagonal]) diagonals of a tensor are referenced using the offset and the two dimensions that the diagonal is taken with respect to.
Here the reference implementation from scipy is only considering matrix output, so even if we only support 2-d output at first. It may be useful to consider how the dimensions corresponding to each diagonal would be specified for higher dimensional output.
The proposed torch signature implies that all offsets refer to the diagonals with respect to the only two dimensions of the output:
```
torch.sparse.spdiags(Tensor diagonals, IntTensor offsets, int[] shape, Layout? layout=None) -> SparseTensor
```
Above it is required that: `diagonals.ndimension() == 2`, `offsets.ndimensions() == 1`, `offsets.shape[0] == diagonals.shape[0]` and `len(shape) == 2`.
This would need to be altered for the case where `len(shape)` > 2. One options is:
```
torch.sparse.spdiags(Tensor[] diagonals, IntTensor[] offsets, IntTensor dims, int[] shape, Layout? layout=None) -> SparseTensor
```
Here `offsets` and `diagonals` becomes lists of tensors, and the `IntTensor dims` argument is introduced. This would require that `len(diagonals) == len(offsets) == dims.shape[0]`, `dims.ndimension() == 2` and `dims.shape[1] == 2` also the same restrictions as the 2d case above apply to the elements of `diagonals` and `offsets` pairwise (that is `diagonals[i].ndimension() == 2`, `offsets[i].ndimension() == 1` and `offsets[i].shape[0] == diagonals[i].shape[0]` for all i). This form of the signature would construct the sparse result by placing the values from `diagonals[i][j]` into the diagonal with offset `offset[i][j]` taken with respect to dimensions `dims[i]`. The specialization back to the original signature for the 2d case could be seen as allowing the single row of dims to default to `[0, 1]` when there is only one `diagonals`, `offsets` provided, and shape is `2-d`. This option allows the rows of an input element `diagonals[i]` to have a different length which may be appropriate as the max length of a diagonal along different dimension pairs will be different.
Another option is to specify the dimensions the diagonal is taken with respect to for each offset. This signature would look like:
```
torch.sparse.spdiags(Tensor diagonals, IntTensor offsets, IntTensor dims, int[] shape, Layout? layout=None) -> SparseTensor
```
Here, `diagonals` is still 2-D with dimension 0 matching the length of 1-D `offsets` and the tensor input `dims` is also 2-D with dimension 0 matching the length of 1-D `offsets` and the second dimension being fixed at `2` in this case the sparse result is constructed by placing the elements from `diagonals[i]` into the output diagonal `output.diagonal(offset[i], dim0=dims[i][0], dim1=dims[i][1])` (with some additional consideration that makes it more complicated than simply asigning to that view). The specialization from this back to the 2-D form could be seen as assuming `dims = [[0, 1], [0, 1]... len(offsets) times ]` when `len shape==2`.
In both proposed signatures for the N-D case the specialization back to the 2-D signature is a bit of a stretch for your typical default arguments logic, however I think the first is better choice as it offers more flexibility.
I think some discussion is required about:
- [x] Should the N-D output case be implemented from the outset
- [x] If not, should the future addition of the N-D output case be considered when designing the interface.
- [x] Other thoughts on the signature which includes the `dims` information for the N-D output case.
**Resolution**: Since no one has offered a request for N-D output support, I think is fine to restrict this to sparse matrix generation. Should a request for N-D support come later, an overload accepting the additional `dims` could be added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78439
Approved by: https://github.com/nikitaved, https://github.com/cpuhrsch, https://github.com/pearu
Summary:
This patch makes broadcast as a custom op such that it's dispatcher
passable. It's one part of the effort to route comm ops to the dispatcher
such that tracing mechanisms that relies on the dispatcher can trace them,
e.g., LazyTensor and AOTAutograd.
Test Plan:
python test/distributed/test_c10d_nccl.py -k test_broadcast_ops
python test/distributed/test_c10d_gloo.py -k test_broadcast_basics
...and other existing distributed tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76722
Approved by: https://github.com/pritamdamania87
Summary:
Previously, there was no suport for masked_fill for quantized tensors.
This PR introduces the feature for quantized CPU tensors. A
corresponding test case was added to test the impl.
Most of the implementation is copied over from the fp kernel and adapted
for quantized tensors.
TODO: implement masked_fill for QuantizedCUDA tensors.
Test Plan:
```
python test/test_quantization.py -k test_qtensor_masked_fill
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78368
Approved by: https://github.com/vkuzo
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
A few bigger updates:
1. Initial support of cp.async and cp.async.wait: https://github.com/csarofeen/pytorch/pull/1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: https://github.com/csarofeen/pytorch/pull/1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: https://github.com/csarofeen/pytorch/pull/1440
Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
Summary:
Names of analogous files in quantized directory (previously snake case) were inconsistent with
their non-quantized filename counterparts (pascal case). This is the first of a series of PRs that changes
all files in quantized (and sub-directories) dir to have pascal case.
`aten/src/ATen/native/quantized/qconv_unpack.cpp` has not been renamed yet
because (for reasons currently unknown) after making the name change, `import torch` produces the below error (`qlinear_unpack.cpp` renaming also seems to fail some phabricator CI tests for similar reasons). We suspect that these may be undefined errors and will revisit naming these files in a future PR.
```
terminate called after throwing an instance of 'c10::Error'
what(): Type c10::intrusive_ptr<ConvPackedParamsBase<2> > could not be converted to any of the known types.
Exception raised from operator() at ../aten/src/ATen/core/jit_type.h:1735 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x7f26745c0c65 in /data/users/dzdang/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb1 (0x7f26745bdcd1 in /data/users/dzdang/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1494e24 (0x7f2663b14e24 in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xfed0bc (0x7f266366d0bc in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame #4: c10::detail::infer_schema::make_function_schema(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>) + 0x5a (0x7f266366d71a in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame #5: c10::detail::infer_schema::make_function_schema(c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>) + 0x7b (0x7f266366e06b in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x1493f32 (0x7f2663b13f32 in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xe227dd (0x7f26634a27dd in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x14e0a (0x7f268c934e0a in /lib64/ld-linux-x86-64.so.2)
..........................truncated.............
```
Test Plan:
```
python test/test_quantization.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77037
Approved by: https://github.com/jerryzh168