Summary:
Enables the use of NoneType arguments to inputs tuple in the export API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45792
Reviewed By: heitorschueroff
Differential Revision: D24312784
Pulled By: bzinodev
fbshipit-source-id: 1717e856b56062add371af7dc09cdd9c7b5646da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46755
As reported in https://github.com/pytorch/pytorch/issues/41324, there is a bug in DDP when `find_unused_parameters=True` and 2 or more parameters share the same gradient accumulator.
In the reducer, we currently keep a mapping of grad accumulator to index and populate it with map[accumulator] = index, but this overwrites indices when the accumulator is the same. To fix this, switch the mapping values to a vector of indices to hold all such indices that share the same accumulator.
ghstack-source-id: 115453567
Test Plan: Added UT
Reviewed By: pritamdamania87
Differential Revision: D24497388
fbshipit-source-id: d32dfa9c5cd0b7a8df13c7873d5d28917b766640
Summary:
Related https://github.com/pytorch/pytorch/issues/38349
This PR implements `column_stack` as the composite ops of `torch.reshape` and `torch.hstack`, and makes `row_stack` as the alias of `torch.vstack`.
Todo
- [x] docs
- [x] alias pattern for `row_stack`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46313
Reviewed By: ngimel
Differential Revision: D24585471
Pulled By: mruberry
fbshipit-source-id: 62fc0ffd43d051dc3ecf386a3e9c0b89086c1d1c
Summary:
If there is no annotation given, we want to show users that the type is inferred
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46969
Test Plan:
Added a new test case that throws an error with the expected error message
Fixes https://github.com/pytorch/pytorch/issues/46326
Reviewed By: ZolotukhinM
Differential Revision: D24614450
Pulled By: gmagogsfm
fbshipit-source-id: dec555a53bfaa9cdefd3b21b5142f5e522847504
Summary:
Preserve PYBIND11 (63ce3fbde8) configuration options in `torch._C._PYBIND11 (63ce3fbde8)_COMPILER_TYPE` and use them when building extensions
Also, use f-strings in `torch.utils.cpp_extension`
"Fixes" https://github.com/pytorch/pytorch/issues/46367
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46415
Reviewed By: VitalyFedyunin
Differential Revision: D24605949
Pulled By: malfet
fbshipit-source-id: 87340f2ed5308266a46ef8f0317316227dab9d4d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47035
Chillee thought the `from math import inf, nan` string at the top of `.code` was annoying so here's an alternative way to do it by putting those values in `globals` before we `exec`
Test Plan: Imported from OSS
Reviewed By: dzhulgakov
Differential Revision: D24611278
Pulled By: jamesr66a
fbshipit-source-id: c25ef89e649bdd3e79fe91aea945a30fa7106961
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46773
Changed the constructor of RemoteModule to accept a `remote_device` arg in the following format:
"<workername>/<device>" (e.g., "trainer0/cpu", "ps0/cuda:0")
This arg merges the original `on` and `device` arg.
Original PR issue: RemoteDevice Format #46554
ghstack-source-id: 115448051
Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: pritamdamania87
Differential Revision: D24482562
fbshipit-source-id: 5acfc73772576a4b674df27625bf560b8f8e67c1
Summary:
Plus two minor fixes to `torch/csrc/Module.cpp`:
- Use iterator of type `Py_ssize_t` for array indexing in `THPModule_initNames`
- Fix clang-tidy warning of unneeded defaultGenerator copy by capturing it as `const auto&`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47025
Reviewed By: samestep
Differential Revision: D24605907
Pulled By: malfet
fbshipit-source-id: c276567d320758fa8b6f4bd64ff46d2ea5d40eff
Summary:
WIP: add support for different memory sizes on size_based_partition, so the size_based_partition could support different logical devices with different memory sizes. Compared to the original size_based_partition, the new one also supports partition to logical device mapping. Multiple partitions can be mapped into one device if the memory size is allowed. A test unit test_different_size_partition is also added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46919
Reviewed By: gcatron, VitalyFedyunin
Differential Revision: D24603511
Pulled By: scottxu0730
fbshipit-source-id: 1ba37338ae054ad846b425fbb7e631d3b6c500b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46955
Initially we were thinking of adding a `invalidate_quantized_float_parameters` option to free the memory
of quantized floating parameters, but it turns out we will do module swap just like in eager mode for the modules
that are quantized, so the old floating point module will not be referenced after quantization. therefore this feature
is only needed for functionals, since most people are using quantization with modules we may not need this.
we'll revisit after we find there is a need for this.
Test Plan: Imported from OSS
Reviewed By: supriyar
Differential Revision: D24579400
fbshipit-source-id: fbb0e567405dc0604a2089fc001573affdade986
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46897
These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.
Also adds/tidies up some documentation.
ghstack-source-id: 115359633
Test Plan: Modified unittests
Reviewed By: divchenko
Differential Revision: D24556177
fbshipit-source-id: 7e826007241eba0fde3019180066ed56faf3c0ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46686
I was trying to page this code back in after a while and some things
stuck out as unnecessarily confusing.
1. Improve documentation of closures and fork stuff to be more accurate
to how we use them today.
2. Change `prim::LocalVariableScope` to `prim::ListComprehension`. It is
only ever used for a list comprehensions, and in general the nodes
emitted by `ir_emitter` should correspond to concrete operations or
language features rather than semantic constraints.
3. Change the somewhat mysterious "inputs" and "attributes" argument
names throughout the codebase to be the more obvious "args" and "kwargs"
that they generally represent (I think "inputs" and "attributes" come
from the AST naming).
Test Plan: Imported from OSS
Reviewed By: navahgar, jamesr66a
Differential Revision: D24464197
Pulled By: suo
fbshipit-source-id: 1f4b1475b58b5690a0b204e705caceff969533b4
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46779
Test Plan: Used it in some notebooks.
Reviewed By: suo
Differential Revision: D24574005
Pulled By: dreiss
fbshipit-source-id: 78ba7a2bdb859fef5633212b73c7a3eb2cfbc380
Summary:
Caffe2 and Torch currently does not have a consistent mechanism for determining if a kernel has launched successfully. The result is difficult-to-detect or silent errors. This diff provides functionality to fix that. Subsequent diffs on the stack fix the identified issues.
Kernel launch errors may arise if invalid launch parameters (number of blocks, number of threads, shared memory, or stream id) are specified incorrectly for the hardware or for other reasons. Interestingly, unless these launch errors are specifically checked for CUDA will silently fail and return garbage answers which can affect downstream computation. Therefore, catching launch errors is important.
Launches are currently checked by placing
```
AT_CUDA_CHECK(cudaGetLastError());
```
somewhere below the kernel launch. This is bad for two reasons.
1. The check may be performed at a site distant to the kernel launch, making debugging difficult.
2. The separation of the launch from the check means that it is difficult for humans and static analyzers to determine whether the check has taken place.
This diff defines a macro:
```
#define TORCH_CUDA_KERNEL_LAUNCH_CHECK() AT_CUDA_CHECK(cudaGetLastError())
```
which clearly indicates the check.
This diff also introduces a new test which analyzes code to identify kernel launches and determines whether the line immediately following the launch contains `TORCH_CUDA_KERNEL_LAUNCH_CHECK();`.
A search of the Caffe2 codebase identifies 104 instances of `AT_CUDA_CHECK(cudaGetLastError());` while the foregoing test identifies 1,467 launches which are not paired with a check. Visual inspection indicates that few of these are false positives, highlighting the need for some sort of static analysis system.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46474
Test Plan:
The new test is run with:
```
buck test //caffe2/test:kernel_launch_checks -- --print-passing-details
```
And should be launched automatically with the other land tests. (TODO: Is it?)
The test is currently set up only to provide warnings but can later be adjusted to require checks.
Otherwise, I rely on the existing test frameworks to ensure that changes resulting from reorganizing existing launch checks don't cause regressions.
Reviewed By: ngimel
Differential Revision: D24309971
Pulled By: r-barnes
fbshipit-source-id: 0dc97984a408138ad06ff2bca86ad17ef2fdf0b6
Summary:
Hello there 👋
I do believe there is some typo in the typing of the `bool` argument of `_ConvNd`constructor.
The typing of the attribute is correct, but the constructor argument, while being the same way, is not the value that will be assigned to `self.bias`.
This PR simply corrects that.
Any feedback is welcome!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46828
Reviewed By: izdeby
Differential Revision: D24550435
Pulled By: ezyang
fbshipit-source-id: ab10f1a5b29a912cb23fc321a51e78b04a8391e3
Summary:
`graph` is automatically cached even when the underlying graph changes -- this PR hardcodes a fix to that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46960
Reviewed By: mrshenli
Differential Revision: D24582185
Pulled By: bwasti
fbshipit-source-id: 16aeeba251830886c92751dd5c9bda8699d62803
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46701
Provide 2 built-in implementations of C++ comm hook.
Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115319061
Test Plan: waitforbuildbot
Reviewed By: pritamdamania87
Differential Revision: D24382504
fbshipit-source-id: 1c1ef56620f91ab37a1707c5589f1d0eb4455bb3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46566
Only provides an interface. Some built-in implementations will be provided in a follow-up commit.
Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115319038
Test Plan: waitforbuildbot
Reviewed By: pritamdamania87
Differential Revision: D24379460
fbshipit-source-id: 8382dc4185c7c01d0ac5b3498e1bead785bccec5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46786
Previously we only support static quant, this PR added support for other types of quantization.
Note qat is actually orthogonal to these quant types, this is referring to the convert step where we
convert the observed module to a quantized module.
for qat, user will provide a CustomModule -> FakeQuantizedCustomModule in prepare_custom_config_dict
and FakeQuantizedCustomModule -> static/dynamic/weight_only quantized CustomModule in convert_custom_config_dict.
Test Plan: Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D24514701
fbshipit-source-id: 2918be422dd76093d67a6df560aaaf949b7f338c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46751
Currently we assume the first input for add/mul is node (Tensor), but it might not be the case
Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_quantized_add
python test/test_quantization.py TestQuantizeFxOps.test_quantized_mul
python test/test_quantization.py TestQuantizeFxOps.test_quantized_add_relu
python test/test_quantization.py TestQuantizeFxOps.test_quantized_mul_relu
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D24494456
fbshipit-source-id: ef5e23ba60eb22a57771791f4934306b25c27c01
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46772
When running `buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit`, I encountered the following error P146518683. The error was traced down to the fact that `torch.allclose` does not work with quantized tensors (the error was triggered by this particular multiplication https://fburl.com/diffusion/8vw647o6 since native mul can not work with a float scalar and a quantized tensor.)
Minimum example to reproduce:
```(Pdb) input = torch.ones(5)
(Pdb) aa = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
(Pdb) bb = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
(Pdb) torch.allclose(aa, bb)
Comparison exception: promoteTypes with quantized numbers is not handled yet; figure out what the correct rules should be, offending types: QUInt8 Float
```
Here the proposed fix is to compare quantized tensors strictly within `_compare_tensors_internal`.
The other two possible fixes are:
1. convert quantized tensors to float tensors first before sending them to `torch.allclose`
2. change `torch.allclose` to handle quantized tensor.
Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit
Reviewed By: kimishpatel
Differential Revision: D24506723
fbshipit-source-id: 6426ea2a88854b4fb89abef0edd2b49921283796
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46895
Bug: models after the FX graph mode quant prepare step lost information,
such as the extra attributes defined in `Quantizer.save_state`,
if the user performed `copy.deepcopy` on them. The information was lost
because `GraphModule` does not copy attributes which are not present on
`nn.Module` by default.
Fix: define a custom `__deepcopy__` method on observed models and
whitelist the attributes we care about.
This is needed because users sometimes run `copy.deepcopy` on their
models during non-quantization related preparations, and we should make
sure that quantization related state survives these calls.
Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_deepcopy
python test/test_quantization.py TestQuantizeFx.test_standalone_module
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D24556035
fbshipit-source-id: f7a6b28b6d2225fa6189016f967f175f6733b124
Summary:
WIP: This PR adds sparse_nn_partition into Partitioner class. It includes logical device assignment for all dag nodes. The basic idea is to do size_based_partition separately for embedding nodes and non-embedding nodes. A test unit is also added in test_fx_experimental.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46390
Reviewed By: gcatron
Differential Revision: D24555415
Pulled By: scottxu0730
fbshipit-source-id: 8772af946d5226883759a02a1c827cfdfce66097
Summary:
This is the second attempt at replacing flatten tensors with flatten loops in `TensorExprKernel::generateStmt`. The first attempt (https://github.com/pytorch/pytorch/pull/46539) resulted in a build failure due to an exception that gets thrown during inline.
The reason for the build failure was because there was an inline step, which was supposed to happen on the unflattened tensors. This was necessary earlier because for every flattened tensor there was an unflattened tensor which had to be inlined. That is no longer necessary since we do not have 2 tensors (flattened and unflattened) now. Removed this inline.
Checked python and cpp tests on CPU as well as CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46737
Reviewed By: anjali411, izdeby
Differential Revision: D24534529
Pulled By: navahgar
fbshipit-source-id: 8b131a6be076fe94ed369550d9f54d3879fdfefd
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/5029
Support single element tuples in to_backend
Test Plan: new unit test for to_glow
Reviewed By: andrewmillspaugh
Differential Revision: D24539869
fbshipit-source-id: fb385a7448167b2b948e70f6af081bcf78f338dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46795
add fburl link to the error message of missing ops so user can debug themselves.
Test Plan: fburl.com/missing_ops
Reviewed By: iseeyuan
Differential Revision: D24519992
fbshipit-source-id: d2d16db7e9d9c84ce2c4600532eb253c30b31971
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46512
1. Merge 1-line PythonCommHook constructor into the header for simplicity.
2. Move the implementation of PythonCommHook destructor from the header file to cpp file.
3. Rename processFuture method as parseHookResult for readability.
4. Simplify some comments.
Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115161086
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_sparse_gradients
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_future_passing_gpu_gloo
Reviewed By: jiayisuse
Differential Revision: D24374282
fbshipit-source-id: c8dbdd764bca5b3fa247708f1218cb5ff3e321bb