Summary:
This is a second attempt to use graph executor to run forward on a gradient. This allows a secondary chance to profile intermediate tensor introduced by autodiff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52136
Reviewed By: pbelevich
Differential Revision: D26693978
Pulled By: Krovatkin
fbshipit-source-id: 91dde8009a210950af8e5173668ada241e16dd52
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52534
Currently linear_dynamic_fp16 has a signature that's tied to fbgemm/qnnpack
We'll need to produce a pattern equivalent to linear_dynamic_fp16 to support extensions
to other backends
Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_linear_dynamic_fp16
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D26557726
fbshipit-source-id: 270c9f781f73c79416a092b7831294cabca84b0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52619
Runs this test suite with nccl_async_error_handling enabled. It is the
default to run many distributed training jobs, and can also help catch
errors/hangs in tests more easily. We don't expect any changes in the actual
existing tests since they shouldn't have any hangs.
Also removes a commented out line
ghstack-source-id: 122595646
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D26588108
fbshipit-source-id: a57bbe2ae5a0c86731d77be45756b17151618eb6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52779
1. makes the return type of the weight comparison APIs match the return
type of the activation comparison APIs:
```
# before
{layer_name: {model_name: weight_tensor}}
{layer_name: {model_name: [activation_tensor]}}
# after
{layer_name: {model_name: [weight_tensor]}}
{layer_name: {model_name: [activation_tensor]}}
```
2. makes a type alias for the type, so future changes are easier
Test Plan:
```
mypy torch/quantization
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```
Imported from OSS
Reviewed By: hx89
Differential Revision: D26652639
fbshipit-source-id: eb1f04d6913cedf88d628f362468875ae9ced928
Summary:
Addresses one item in https://github.com/pytorch/pytorch/issues/46321
## Background
This is a test version of the RL RPC example defined [here](https://github.com/pytorch/examples/blob/master/distributed/rpc/rl/main.py) and [here](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html), with the following differences:
* It defines and uses a `DummyEnv` to avoid a dependency on `gym`. The `DummyEnv` simply returns random states & rewards for a small number of iterations.
* It removes the `ArgumentParser` and utilizes `RpcAgentTestFixture` + hard-coded constants for configuration and launching.
* It changes the worker names to match what the internal Thrift RPC tests expect.
The code is purposefully kept very similar to the original example code outside of these differences.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52393
Test Plan:
```
pytest test/distributed/rpc/test_tensorpipe_agent.py -k test_rl_rpc -vs
pytest test/distributed/rpc/test_process_group_agent.py -k test_rl_rpc -vs
```
Reviewed By: glaringlee
Differential Revision: D26515435
Pulled By: jbschlosser
fbshipit-source-id: 548548c4671fe353d83c04108580d807108ca76e
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51730
I've added the `scatter_add` and `scatter_add.dimname` to the promote list as well as test cases for the former op.
However, it seems that `scatter_add` [doesn't support named tensors yet](8b0cb5ede3/aten/src/ATen/native/NamedTensor.cpp (L356-L358)) (thanks t-vi for the pointer):
```python
dev = 'cuda'
torch.scatter_add(torch.zeros(2, 2, 2, dtype=torch.float16, device=dev, names=('N', 'C', 'L')),
'C',
torch.randint(0, 2, (2, 2, 2), device=dev),
torch.randn((2, 2, 2), dtype=torch.float32, device=dev))
> RuntimeError: scatter_add: You passed a dimname (string) to this op in place of a dimension index but it does not yet support this behavior. Please pass a dimension index to work around this.
```
which raised this error after adding this test case.
I'm thus unsure, if I should also remove `scatter_add.dimname` from the promote list or not.
In any case, once named tensors are supported a potential test could be added as:
```python
("scatter_add", (torch.zeros(2, 2, 2, dtype=torch.float16, device=dev, names=('N', 'C', 'L')),
'C',
torch.randint(0, 2, (2, 2, 2), device=dev),
torch.randn((2, 2, 2), dtype=torch.float32, device=dev))),
```
CC mcarilli ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52133
Reviewed By: ejguan
Differential Revision: D26440392
Pulled By: ngimel
fbshipit-source-id: f4ee2d0b9e1f81afb6f94261c497cf2bf79ec115
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52804
`rpc.get_worker_info` used to only take string in v1.6. We recently
allow it to accept `int` and `WorkerInfo`, but the previous check
on `worker_name` is no longer correct. This commit adds explicit
`not None` check.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D26655089
Pulled By: mrshenli
fbshipit-source-id: fa1545bd6dd2b33bc1e919de46b94e799ab9719c
Summary:
This way, we can have a mapping from the test files we directly execute (the tests [here](https://github.com/pytorch/pytorch/blob/master/test/run_test.py#L20)) to the test suites that we store data for in XML reports.
This will come in use later for categorizing the tests we run in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52791
Reviewed By: samestep
Differential Revision: D26655086
Pulled By: janeyx99
fbshipit-source-id: 94be32f80d7bc0ea1a7a11d4c4b1d3d8e774c5ea
Summary:
Apple recently announced ML Compute, a new framework available in macOS Big Sur, which enables users to accelerate the training of neural networks on Mac hardware. This PR is the first on a series of PRs that will enable the integration with ML Compute. Most of the integration code will live on a separate subrepo named `mlc`.
The integration with `mlc` (ML Compute) will be very similar to that of xla. We rely on registering our ops through:
TORCH_LIBRARY_IMPL(aten, PrivateUse1, m) {
m.impl_UNBOXED(<op_schema_name>, &customized_op_kernel)
...
}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50634
Reviewed By: malfet
Differential Revision: D26614213
Pulled By: smessmer
fbshipit-source-id: 3b492b346c61cc3950ac880ac01a82fbdddbc07b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51807
Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html).
This function does not support broadcasting or batched inputs at the moment.
**NOTE**
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction.
**TODO**
- [ ] Benchmark against NumPy
- [x] Add OpInfo testing
- [x] Remove unnecessary copy for out= argument
Test Plan: Imported from OSS
Reviewed By: nikithamalgifb
Differential Revision: D26375734
Pulled By: heitorschueroff
fbshipit-source-id: 839642692424c4b1783606c76dd5b34455368f0b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52632
Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:
```
Process 0 exited with error code 10
```
The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.
To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.
The new output printed by the parent is as follows:
```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```
ghstack-source-id: 122273793
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D26589274
fbshipit-source-id: 7b7a71ec790b216a89db7c157377f426531349a5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52391
There are 2 ways DDP can throw the exception refactored here -
1) Unused params in the forward pass. We provide `find_unused_parameters=True` for this.
2) All params used in fwd pass, but not all outputs used in loss computation. There are a few workarounds for this but we do not provide native support.
Previously, these 2 issues were combined into 1 error message but that has historically resulted in confusion, with users reporting getting this error even when they enable `find_unused_parameters=True` (which they expect to fix this error). As a result there is additional churn to debug these issues because the true cause (1) vs (2) is not known.
This commit helps to fix the issue by separating out the 2 error messages depending on if we ran with unused parameter detection or not. Hopefully this should make the error message much more clear and actionable.
error msg with `find_unused_params=True`:
```
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since `find_unused_parameters=True` is enabled, this likely means that not all `forward` outputs participate in computing loss. You can fix this by making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
```
error msg without `find_unused_params` specified:
```
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
```
ghstack-source-id: 122097900
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D26496688
fbshipit-source-id: 4a9eeeda10293da13d94a692d10cb954e4506d7c
Summary:
This PR adds functionality to skip a test based on CUDA version.
This way, we can be more specific when skipping a test, such as when the test only fails for a particular CUDA version.
This allows us to add back the skipped tests for CUDA 11.2 for other CUDA versions, such as 10.1 and 11.1.
I tested this locally (by using 11.0 instead of 11.2), but will run all the CI to make sure it works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52359
Reviewed By: walterddr
Differential Revision: D26487951
Pulled By: janeyx99
fbshipit-source-id: 45c71cc6105ffd9985054880009cf68ea5ef3f6a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52179
Rename debug to reference. We'll use this to produce a reference quantized model
that can be used as a common interface between pytorch quantized model and backends.
Test Plan:
python test/test_quantization.py TestQuantizeFx
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D26424656
fbshipit-source-id: a0299b023f6ba7d98f5750724c517b0ecb987b35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51386
add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data
1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls
ghstack-source-id: 121933566
Test Plan: unit tests
Reviewed By: SciPioneer
Differential Revision: D26158645
fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52384
Adds a simple UT with unittest that we can modify when we enable DDP backward without needing all parameters to get gradient.
ghstack-source-id: 122001930
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D26482479
fbshipit-source-id: c80bdeea7cf9db35390e385084ef28d64ed239eb
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50790.
Added `min()` & `max()` support for `Float16` & `BFloat16`.
CUDA already supported these ops on `Float16`, so the other three combinations had to be enabled.
`OpInfo`s for `min` & `max` were also added, and their sample inputs were removed from `method_tests()`.
### MORE INFO
The (slightly) long-term goal is to add dispatch for `min()` & `max()` related operations on CPU & CUDA for `Float16` & `BFloat16`,
wherever they aren't present already:
1. `amin()`
2. `argmax()`
3. `amax()`
4. `argmin()`
5. `torch._aminmax()`
6. `torch.clamp()` on CPU. Was already supported on CUDA
7. `min()` (in this PR)
8. `max()` (in this PR)
9. `minimum()`
10. `maximum()`
I'll submit separate PRs for the other ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51244
Reviewed By: jbschlosser
Differential Revision: D26503455
Pulled By: anjali411
fbshipit-source-id: c32247f214e9272ca2e4322a23337874e737b140
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52264
When CPU fusion is enabled without LLVM support in PyTorch, it causes huge slowdown (> 50x). This PR makes the LLVM backend the default backend for TE. Now, an error will be reported if CPU fusion is enabled without LLVM support, to avoid this performance regression.
This PR also updates the tests to not use LLVM, so that the old flow is continued. This is necessary because tests run in CI do not have LLVM.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52314
Reviewed By: ejguan
Differential Revision: D26491294
Pulled By: navahgar
fbshipit-source-id: 74561db1207da805d6d28039450db046ba2988fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52130
We have patterns like (F.linear, F.relu) which need to match
to (toq.linear_relu). So, we need to match subgraphs.
This PR does the following:
* defines a "subgraph" as (start_node, end_node). The current assumption
is that subgraphs are simple, there is always a path from start_node to
end_node, and we can ignore any non-input args/kwargs of these nodes
for the purposes of matching and copying things. An example one node
subgraph is (F.linear, F.linear). An example two node subgraph
is (F.linear, F.relu).
* changes the matching logic to iterate over subgraphs instead of nodes
* changes the NS core APIs to use subgraph pairs instead of node pairs:
1. for weights, we match on the start node
2. for unshadowed activations, we observe the end nodes
3. for shadowed activations, we copy the subgraph of a to graph c
TODO(before review) write up better, not ready for review yet
Test Plan:
TODO before land: better test plan
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D26403092
fbshipit-source-id: e49aaad4b02b8d60589435848bee422b8f41937a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52092
Adds a very simple toy sparsenn model, and enables
its inspection with the new NS APIs.
Test Plan:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_sparsenn_compare_activations
python test/test_quantization.py TestFXNumericSuiteCoreAPIs.test_sparsenn_shadow
```
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D26403095
fbshipit-source-id: 3c3650aca47186deb32f2b3f1d87a0716d1ad9d1
Summary:
Now that `masked_fill` CUDA is migrated, skips on masked_scatter can be removed.
Reference: https://github.com/pytorch/pytorch/issues/33152
**Note**:
Have decreased the shape of Tensor for `masked_scatter` from (M, M) -> (S, S) and so on.
With shapes of M : **96.53s**
```
test/test_ops.py ........................................ssssssssssss........................ssssssssssss........................ [100%]
=============================================================== 88 passed, 24 skipped, 7981 deselected in 96.53s (0:01:36) ================================================================
```
With shapes of S : **46.53s**
```
test/test_ops.py ........................................ssssssssssss........................ssssssssssss........................ [100%]
==================================================================== 88 passed, 24 skipped, 7981 deselected in 46.53s =====================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52035
Reviewed By: VitalyFedyunin
Differential Revision: D26369476
Pulled By: anjali411
fbshipit-source-id: 7a79d5a609b0019f8fe9ce6452924dd33390dce1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52377
Add QNNPACK specific packed params for sparse linear.
Add sparse linear dynamic op with appropriate registration.
Add python side LinearDynamic module for sparsity.
Add tests to validate sparse linear qnnpack kernels.
Note that since these test are mostly run on x86 platform and
given that 1x4 sparse kernels are implemented both in sse and arm,
LinearDynamic at the moment defaults to 1x4 pattern.
Plan is to add another diff that will allow a global override for 8x1 pattern
such that prepare/convert flow can work for exporting model for mobile.
Test Plan: buck run caffe2/torch/fb/model_optimization:sparsity_test
Reviewed By: z-a-f
Differential Revision: D26491944
fbshipit-source-id: b98839b4c62664e1fabbb0cbeb2e5c1bd5903b4d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52031
Closes https://github.com/pytorch/pytorch/issues/52020
Ensures that we can profile collectives in DDP by propagating the profiler threadLocalState appropriately. As described in the above issue, before this wouldn't work as the profiler would only be enabled on the main thread.
ghstack-source-id: 121818080
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D26356192
fbshipit-source-id: 0158b5833a3f857a0b4b2943ae3037e9d998dfd1
Summary:
Add QNNPACK specific packed params for sparse linear.
Add sparse linear dynamic op with appropriate registration.
Add python side LinearDynamic module for sparsity.
Add tests to validate sparse linear qnnpack kernels.
Note that since these test are mostly run on x86 platform and
given that 1x4 sparse kernels are implemented both in sse and arm,
LinearDynamic at the moment defaults to 1x4 pattern.
Plan is to add another diff that will allow a global override for 8x1 pattern
such that prepare/convert flow can work for exporting model for mobile.
Test Plan: buck run caffe2/torch/fb/model_optimization:sparsity_test
Reviewed By: z-a-f
Differential Revision: D26263480
fbshipit-source-id: 04ab60aec624d1ecce8cfb38b79c7e94f501cdf6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52302
Adds the basic functionality for the three Numeric Suite core APIs to work on FX models:
1. comparing weights
2. comparing activations, with same input fed to both models
3. comparing activations, with nodes of A shadowing nodes of B
Note: there are a lot of TODOs in the code, and some/most of the APIs and implementation details may change as we iterate. This is just the first PR.
Test Plan:
We have unit test coverage for all of the APIs, for now this is with toy models:
```
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
```
Reviewed By: raghuramank100
Differential Revision: D26463013
Pulled By: vkuzo
fbshipit-source-id: e454115099ad18e4037d3c54986951cdffcab367
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51648
The following code will throw during the call to `traced(5)`:
```python
class M(nn.Module):
def __init__(self):
super(M, self).__init__()
self.W = torch.nn.Parameter(torch.randn(5))
def forward(self, x):
return torch.dot(self.W, x)
traced = fx.symbolic_trace(M())
traced(5)
```
Traceback before:
```
Traceback (most recent call last):
File "test/tinytest.py", line 26, in <module>
traced(5)
File "/home/ansley/local/pytorch/torch/fx/graph_module.py", line 338, in wrapped_call
return self._cls_call(self, *args, **kwargs)
File "/home/ansley/local/pytorch/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "<eval_with_key_0>", line 4, in forward
TypeError: dot(): argument 'tensor' (position 2) must be Tensor, not int
```
Traceback after:
```
Traceback (most recent call last):
File "/home/ansley/local/pytorch/torch/fx/graph_module.py", line 338, in wrapped_call
return torch.nn.Module.__call__(self, *args, **kwargs)
File "/home/ansley/local/pytorch/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "<eval_with_key_1>", line 4, in forward
dot_1 = torch.dot(w, x); w = x = None
TypeError: dot(): argument 'tensor' (position 2) must be Tensor, not int
Call using an FX-traced Module, line 4 of the traced Module’s generated forward function:
w = self.W
dot_1 = torch.dot(w, x); w = x = None
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
relu_1 = dot_1.relu(); dot_1 = None
return relu_1
```
(Note that the same `TypeError` is thrown despite modifying the traceback.)
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D26424005
Pulled By: ansley
fbshipit-source-id: 368f46ba81fb3111bd09654825bb2ac5595207d1