Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44000
This wasn't documented, so add a doc saying all ranks are used when
ranks=None
ghstack-source-id: 111206308
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D23465034
fbshipit-source-id: 4c51f37ffcba3d58ffa5a0adcd5457e0c5676a5d
Summary:
* Implement tuple sort by traversing contained IValue types and generate a lambda function as comparator for sort.
* Tuple, class objects can now arbitrarily nest within each other and still be sortable
Fixes https://github.com/pytorch/pytorch/issues/43219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43448
Reviewed By: eellison
Differential Revision: D23352273
Pulled By: gmagogsfm
fbshipit-source-id: b6efa8d00e112178de8256da3deebdba7d06c0e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44773
The model is created and prepared using fx APIs and then scripted for training.
In order to test QAT on scriptmodel we need to be able to disable/enable fake_quant
and observer modules on it.
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qat_and_script
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23741354
fbshipit-source-id: 3fee7aa9b049d9901313b977710f4dc1c4501532
Summary:
[test all]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44330
Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well
ghstack-source-id: 112257271
Test Plan: unit tests
Reviewed By: mrshenli
Differential Revision: D23583347
fbshipit-source-id: a5f2041b2c4f2c2b5faba1af834c7143eaade938
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44393
torch.quantile now correctly propagates nan and implemented torch.nanquantile similar to numpy.nanquantile.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23649613
Pulled By: heitorschueroff
fbshipit-source-id: 5201d076745ae1237cedc7631c28cf446be99936
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33394 .
This PR does two things:
1. Implement CUDA scatter reductions with revamped GPU atomic operations.
2. Remove support for divide and subtract for CPU reduction as was discussed with ngimel .
I've also updated the docs to reflect the existence of only multiply and add.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41977
Reviewed By: mruberry
Differential Revision: D23748888
Pulled By: ngimel
fbshipit-source-id: ea643c0da03c9058e433de96db02b503514c4e9c
Summary:
Enabled type checking in common_distributed by using tensors of ints
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44821
Test Plan: Run python test/test_type_hints.py, errors are no longer ingnored by mypy.ini
Reviewed By: walterddr
Differential Revision: D23747466
Pulled By: alanadakotashine
fbshipit-source-id: 820fd502d7ff715728470fbef0be90ae7f128dd6
Summary:
Adds a new optimization to the IRSimplifier which changes this pattern:
```
for ...
if ...
do thing;
```
into:
```
if ...
for ...
do thing;
```
Which should be almost strictly better.
There are many cases where this isn't safe to do, hence tests. Most obviously when the condition depends on something modified within the loop.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44764
Reviewed By: mruberry
Differential Revision: D23734463
Pulled By: nickgg
fbshipit-source-id: 51617e837de96b354fb702d0090ac65ddc523d36
Summary:
PyObject_IsSubclass may set python live exception bit if given object is not a class. `IsNamedTuple` is currently using it incorrectly, which may trip all following python operations in debug-build python. Normal release-build python is not affected because `assert` is no-op in release-build.
Fixes https://github.com/pytorch/pytorch/issues/43577
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44769
Reviewed By: jamesr66a
Differential Revision: D23725584
Pulled By: gmagogsfm
fbshipit-source-id: 2dabd4f8667a045d5bf75813500876c6fd81542b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44586
**Summary**
This commit disallows plain `Optional` type annotations without
any contained types both in type comments and in-line as
Python3-style type annotations.
**Test Plan**
This commit adds a unit test for these two situations.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23721517
Pulled By: SplitInfinity
fbshipit-source-id: ead411e94aa0ccce227af74eb0341e2a5331370a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43796
This diff adds an option for the process group NCCL backend to pick high priority cuda streams.
Test Plan: waitforsandcastle
Reviewed By: jiayisuse
Differential Revision: D23404286
fbshipit-source-id: b79ae097b7cd945a26e8ba1dd13ad3147ac790eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44577
I would like to to move this to cmake so that I can depend on it
happening from other parts of the build.
This PR pulls out the logic for determining the version string and
writing the version file into its own module. `setup.py` still receives
the version string and uses it as before, but now the code for writing
out `torch/version.py` lives in a custom command in torch/CMakeLists.txt
I noticed a small inconsistency in how version info is populated.
`TORCH_BUILD_VERSION` is populated from `setup.py` at configuration
time, while `torch/version.py` is written at build time. So if, e.g. you
configured cmake on a certain git rev, then built it in on another, the
two versions would be inconsistent.
This does not appear to matter, so I opted to preserve the existing
behavior.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23734781
Pulled By: suo
fbshipit-source-id: 4002c9ec8058503dc0550f8eece2256bc98c03a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44585
**Summary**
This commit disallows plain `Tuple` type annotations without any
contained types both in type comments and in-line as Python3-style
type annotations.
**Test Plan**
This commit adds a unit test for these two situations.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23721515
Pulled By: SplitInfinity
fbshipit-source-id: e11c77a4fac0b81cd535c37a31b9f4129c276592
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44584
**Summary**
This commit extends the work done in #38130 and disallows plain
Python3-style `List` type annotations.
**Test Plan**
This commit extends `TestList.test_no_element_type_annotation` to the
Python3-style type annotation.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23721514
Pulled By: SplitInfinity
fbshipit-source-id: 48957868286f44ab6d5bf5e1bf97f0a4ebf955df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44334
**Summary**
This commit detects and prohibits the case in which `typing.Dict` is
used as an annotation without type arguments (i.e. `typing.Dict[K, V]`).
At present, `typing.Dict` is always assumed to have two arguments, and
when it is used without them, `typing.Dict.__args__` is nonempty and
contains some `typing.TypeVar` instances, which have no JIT type equivalent.
Consequently, trying to convert `typing.Dict` to a JIT type results in
a `c10::DictType` with `nullptr` for its key and value types, which can cause
a segmentation fault.
This is fixed by returning a `DictType` from
`jit.annotations.try_ann_to_type` only if the key and value types are converted
successfully to a JIT type and returning `None` otherwise.
**Test Plan**
This commit adds a unit test to `TestDict` that tests the plain `Dict`
annotations throw an error.
**Fixes**
This commit closes#43530.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23610766
Pulled By: SplitInfinity
fbshipit-source-id: 036b10eff6e3206e0da3131cfb4997d8189c4fec
Summary:
Unifies a number of partial solutions to the thread and block dimension extent masking, including the NoThreadIdxWriter and my last fix https://github.com/pytorch/pytorch/issues/44325. The NoThreadIdxWriter is gone in favour of tracking the current loop extents and masking any statements that have a lower rank than the launch parameters in any Block or Thread dimension, which handles both the "no" and "smaller" axis binding cases.
For example it will transform the following:
```
for i in 0..10 // blockIdx.x
for j in 0..10 // threadIdx.x
do thing(i, j);
for k in 0..5 // threadIdx.x
do other thing(i, k);
```
Into:
```
do thing(blockIdx.x, threadIdx.x);
if (threadIdx.x < 5) {
do other thing(blockIdx.x, threadIdx.x);
}
```
And handle the case where statements are not bound by any axis, eg.
```
do outer thing;
for i in 0..10 // blockIdx.x
for j in 0..10 // threadIdx.x
do thing(i, j);
do other thing(i);
```
will become:
```
if (blockIdx.x < 1) {
if (threadIdx.x < 1) {
do outer thing;
}
}
syncthreads();
do thing(blockIdx.x, threadIdx.x);
syncthreads();
if (threadIdx.x < 1) {
do other thing(blockIdx.x);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44733
Reviewed By: mruberry
Differential Revision: D23736878
Pulled By: nickgg
fbshipit-source-id: 52d08626ae8043d53eb937843466874d479a6768
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44703
The description of this public function should be in the header file.
Also fix some typos.
Test Plan: N/A.
Reviewed By: pritamdamania87
Differential Revision: D23703661
fbshipit-source-id: 24ae63de9498e321b31dfb2efadb44183c6370df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44663
The new API returns the type of the data object referenced by this
`RRef`. On the owner, this is same as `type(rref.local_value())`.
On a user, this will trigger an RPC to fetch the `type` object from
the owner. After this function is run once, the `type` object is
cached by the `RRef`, and subsequent invocations no longer trigger
RPC.
closes#33210
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D23691990
Pulled By: mrshenli
fbshipit-source-id: a2d87cd601a691dd75164b6bcd7315245e9cf6bd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44439
Adds a test to ddp_under_dist_autograd_test to enusre that that uneven
inputs join() API works properly when DDP + RPC is combined. We test that when
running in outside DDP mode (DDP applied to whole hybrid module) we can
correctly process uneven inputs across different trainers.
ghstack-source-id: 112156980
Test Plan: CI
Reviewed By: albanD
Differential Revision: D23612409
fbshipit-source-id: f1e328c096822042daaba263aa8747a9c7e89de7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44749
Ensure fx module is scriptable after calling prepare_qat on it
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qat_and_script
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23718380
fbshipit-source-id: abf63ffb21e707f7def8f6c88246877f5aded58c
Summary:
The subclass sets "self.last_epoch" when this is set in the parent class's init function. Why would we need to set last_epoch twice? I think calling "super" resets last_epoch anyway, so I am not sure why we would want to include this in the subclass. Am I missing something?
For the record, I am just a Pytorch enthusiast. I hope my question isn't totally silly.
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44613
Reviewed By: albanD
Differential Revision: D23691770
Pulled By: mrshenli
fbshipit-source-id: 080d9acda86e1a2bfaafe2c6fcb8fc1544f8cf8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44566
The Delegate objects were confusing. They were suppose to be a way to
configure how tracing works, but in some cases they appeared necessary
for consturcting graphs, which was not true. This makes the organization
clearer by removing Delgate and moving its functionality into a Tracer class,
similar to how pickle has a Pickler class.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23683177
Pulled By: zdevito
fbshipit-source-id: 7605a34e65dfac9a487c0bada39a23ca1327ab00
Summary:
There's an annoying O(N^2) in module export logic that makes saving some of the models (if they have many classes) take eternity.
I'm not super familiar with this code to properly untangle the deps and make it a pure hash lookup. So I just added a side lookup table for raw pointers. It's still quadratic, but it's O(num_classes^2) instead of O(num_classes * num_references) which already gives huge savings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44589
Test Plan:
Tested with one of the offending models - just loading a saving a Torchscript file:
```
Before:
load 1.9239683151245117
save 165.74712467193604
After:
load 1.9409027099609375
save 1.4711427688598633
```
Reviewed By: suo
Differential Revision: D23675278
Pulled By: dzhulgakov
fbshipit-source-id: 8f3fa7730941085ea20d9255b49a149ac1bf64fe
Summary:
This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context.
The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it.
I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231
Reviewed By: albanD
Differential Revision: D23689688
Pulled By: nickgg
fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44654
Previously we weren't creating a fallback graph as intended in specialize autograd zero, so if a Tensor failed one of our undefinedness checks we would run the backward normally without reprofiling & optimizing.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23691764
Pulled By: eellison
fbshipit-source-id: 10c6fa79518c84a6f5ef2bfbd9ea10843af751eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44326
Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112011490
Test Plan: unit tests
Reviewed By: mrshenli
Differential Revision: D23583017
fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83
Summary:
This PR adds dilation to _ConvTransposeNd._output_padding method and tests using a bunch of different sized inputs.
Fixes https://github.com/pytorch/pytorch/issues/14272
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43793
Reviewed By: zou3519
Differential Revision: D23493313
Pulled By: ezyang
fbshipit-source-id: bca605c428cbf3a97d3d24316d8d7fde4bddb307
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42390
**Summary**
This commit extends support for properties to include
ScriptModules.
**Test Plan**
This commit adds a unit test that has a ScriptModule with
a user-defined property.
`python test/test_jit_py3.py TestScriptPy3.test_module_properties`
Test Plan: Imported from OSS
Reviewed By: eellison, mannatsingh
Differential Revision: D22880298
Pulled By: SplitInfinity
fbshipit-source-id: 74f6cb80f716084339e2151ca25092b6341a1560
Summary:
We were hitting an assert error when you passed in an empty `List[List[int]]` - this fixes that error by not recursing into 0-element tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44652
Reviewed By: ZolotukhinM
Differential Revision: D23688247
Pulled By: eellison
fbshipit-source-id: d48ea24893044fae96bc39f76c0f1f9726eaf4c7
Summary:
This PR:
- updates div to perform true division
- makes torch.true_divide an alias of torch.div
This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907
Reviewed By: ngimel
Differential Revision: D23622114
Pulled By: mruberry
fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927
Summary:
* Support sequence type (de)serialization, enables onnx shape inference on sequence nodes.
* Fix shape inference with block input/output: e.g. Loop and If nodes.
* Fix bugs in symbolic discovered by coverage of onnx shape inference.
* Improve debuggability: added more jit logs. For simplicity, the default log level, when jit log is enabled, will not dump ir graphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43929
Reviewed By: albanD
Differential Revision: D23674604
Pulled By: bzinodev
fbshipit-source-id: ab6aacb16d0e3b9a4708845bce27c6d65e567ba7
Summary:
When caller / callee pairs are inserted into the mapping, verify that
the arity of the buffer access is consistent with its declared rank.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44561
Test Plan: CI, test_tensorexpr --gtest_filter=TensorExprTest.DetectInlineRankMismatch
Reviewed By: albanD
Differential Revision: D23684342
Pulled By: asuhan
fbshipit-source-id: dd3a0cdd4c2492853fa68381468e0ec037136cab
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43389.
This PR replaces the old ELU formula from the docs that yields wrong results for negative alphas with the new one that fixes the issue and relies on the cases notation which makes the formula more straightforward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43764
Reviewed By: ailzhang
Differential Revision: D23425532
Pulled By: albanD
fbshipit-source-id: d0931996e5667897d926ba4fc7a8cc66e8a66837
Summary:
Improve simplification of nested Min and Max patterns.
Specifically, handles the following pattern simplications:
* `Max(A, Max(A, Const)) => Max(A, Const)`
* `Max(Min(A, B), Min(A, C)) => Min(A, Max(B, C))`
* `Max(Const, Max(A, OtherConst) => Max(A, Max(Const, OtherConst))`
- This case can have an arbitrarily long chain of Max ops. For example: `Max(5, Max(x, Max(y, Max(z, 8)))) => Max(Max(Max(x, 8), y), z)`
Similarly, for the case of Min as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44142
Reviewed By: albanD
Differential Revision: D23644486
Pulled By: navahgar
fbshipit-source-id: 42bd241e6c2af820566744c8494e5dee172107f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44562
Add a note that torch.median returns the smaller of the two middle elements for even-sized input and refer user to torch.quantile for the mean of the middle values.
fixes https://github.com/pytorch/pytorch/issues/39520
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23657208
Pulled By: heitorschueroff
fbshipit-source-id: 2747aa652d1e7f10229d9299b089295aeae092c2
Summary:
We run remove profile nodes and specialize types before batch_mm, so we cannot run peepholes on the type information of tensors since these properties have not been guarded to be guaranteed to be correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44565
Reviewed By: albanD
Differential Revision: D23661538
Pulled By: eellison
fbshipit-source-id: 0dd23a65714f047f49b4db4ec582b21870925fe1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44622
Remove an extra empty line in the warning comments.Remove an extra empty line.
Test Plan: N/A
Reviewed By: rohan-varma
Differential Revision: D23674070
fbshipit-source-id: 4ee570590c66a72fb808e9ee034fb773b833efcd
Summary:
This adds HIP version info to the `collect_env.py` output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44106
Reviewed By: VitalyFedyunin
Differential Revision: D23652341
Pulled By: zou3519
fbshipit-source-id: a1f5bce8da7ad27a1277a95885934293d0fd43c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44442
I noticed lock contention on startup as lookupByLiteral() was
calling registerPendingOperators() - some calls were holding the
lock for 10+ ms, as operators were being registered.
canonicalSchemaString() was using ostreamstring, which isn't typically
particularly fast (partly because of c++ spec locale requirements).
If we repalce with regular c++ string appends, it's somewhat faster
(which isn't hard when comparing with stringstream; albeit a bit
more codegen)
Over the first minute or so, this cuts out 1.4 seconds under the
OperatorRegistry lock (as part of registerPendingOperators) in the
first couple minutes of run time (mostly front-loaded) when running
sync sgd.
As an example, before:
registerPendingOperators 12688 usec for 2449 operators
After:
registerPendingOperators 6853 usec for 2449 operators
ghstack-source-id: 111862971
Test Plan: buck test mode/dev-nosan caffe2/test/cpp/...
Reviewed By: ailzhang
Differential Revision: D23614515
fbshipit-source-id: e712f9dac5bca0b1876e11fb8f0850402f03873a
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44219
Rebasing https://github.com/pytorch/pytorch/pull/44288 and fixing the git history.
This allows users to bencmark code without having to specify how long to run the benchmark. It runs the benchmark until the variance (IQR / Median) is low enough that we can be confident in the measurement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44607
Test Plan: There are unit tests, and we manually tested using Examples posted in git.
Reviewed By: robieta
Differential Revision: D23671208
Pulled By: bitfort
fbshipit-source-id: d63184290b88b26fb81c2452e1ae701c7d513d12
Summary:
This fixes a `katex` error I was getting trying to build the docs:
```
ParseError: KaTeX parse error: Undefined control sequence: \0 at position 55: …gin{cases}
```
This failure was introduced in https://github.com/pytorch/pytorch/issues/42523.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44481
Reviewed By: colesbury
Differential Revision: D23627700
Pulled By: mruberry
fbshipit-source-id: 9cc09c687a7d9349da79a0ac87d6c962c9cfbe2d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44337
Add a new run_method to mobile Module which is variadic (takes any number of arguments) to match full jit.
ghstack-source-id: 111909068
Test Plan: Added new unit test to test_jit test suite
Reviewed By: linbinyu, ann-ss
Differential Revision: D23585763
fbshipit-source-id: 007cf852290f03615b78c35aa6f7a21287ccff9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44588
1) SOURCE_DUMP crashes when invoked on a backward graph since
`prim::GradOf` nodes can't be printed as sources (they don't have
schema).
2) Dumping graph each time we execute an optimized plan produces lots of
output in tests where we run the graph multiple times (e.g.
benchmarks). Outputting that on the least level of verbosity seems
like an overkill.
3) Duplicated log statement is removed.
Differential Revision: D23666812
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: b9a30e34fd39c85f3e13c3f1e3594e157e1c130f
Summary:
**BC-breaking note**
This change is BC-breaking for C++ callers of linspace and logspace if they were providing a steps argument that could not be converted to an optional.
**PR note**
This PR deprecates calling linspace and logspace wihout setting steps explicitly by:
- updating the documentation to warn that not setting steps is deprecated
- warning (once) when linspace and logspace are called without steps being specified
A test for this behavior is added to test_tensor_creation_ops. The warning only appears once per process, however, so the test would pass even if no warning were thrown. Ideally there would be a mechanism to force all warnings, include those from TORCH_WARN_ONCE, to trigger.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43860
Reviewed By: izdeby
Differential Revision: D23498980
Pulled By: mruberry
fbshipit-source-id: c48d7a58896714d184cb6ff2a48e964243fafc90
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44340
Changed the constructor of GradBucket to pass the input by const
reference and hence avoided unnecessary explicit move semantics. Since
previously the declaration and definition are separated, passing the input
tensor vector by value looks quite bizarre.
Test Plan: buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest
Reviewed By: pritamdamania87
Differential Revision: D23569939
fbshipit-source-id: db761d42e76bf938089a0b38e98e76a05bcf4162
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44339
Moved the inline implementations of GradBucket class to the header for
succinctness and readability. This coding style is also consistent with
reducer.h under the same directory.
Test Plan: buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest
Reviewed By: pritamdamania87
Differential Revision: D23569701
fbshipit-source-id: 237d9e2c5f63a6bcac829d0fcb4a5ba3bede75e5
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/36404
Adding prim::device and prim::dtype to list of skipped peepholes when we run inlining. In the long term another fix may not be to encode shape / dtype info on the traced graph, because it is not guaranteed to be correct. This is blocked by ONNX currently.
Partial fix for https://github.com/pytorch/pytorch/issues/43134
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43363
Reviewed By: glaringlee
Differential Revision: D23383987
Pulled By: eellison
fbshipit-source-id: 2e9c5160d39d690046bd9904be979d58af8d3a20
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44564
Before this change we sometimes inlined autodiff subgraph containing
fusion groups. This happened because we didn't look for 'unsupported'
nodes recursively (maybe we should), but fusion groups were inside
if-nodes.
The problem was detected by bertmaher in 'LearningToPaint' benchmark
investigation where this bug caused us to keep constantly hitting
fallback paths of the graph.
Test Plan: Imported from OSS
Reviewed By: bwasti
Differential Revision: D23657049
Pulled By: ZolotukhinM
fbshipit-source-id: 7c853424f6dce4b5c344d6cd9c467ee04a8f167e
Summary:
Fix an issue where loops of different sizes are bound to the same Cuda dimension / metavar.
Coming soon more info and tests...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44325
Reviewed By: colesbury
Differential Revision: D23628859
Pulled By: nickgg
fbshipit-source-id: 3621850a4cc38a790b62ad168d32e7a0e2462fad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43043
This add the support for rpc_sync in TorchScript in a way similar to
rpc_async
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D23252039
Pulled By: wanchaol
fbshipit-source-id: 8a05329cb8a24079b2863178b73087d47273914c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44537
Originally, the `min_val`, `max_val`, `min_vals`, `max_vals`
attributes of observers were Tensors but not buffers. They had custom
state_dict save/load code to ensure their state was saved.
At some point, these attributes became buffers, and the custom
save/load code remained. This introduced a subtle bug:
* create model A, move it to a device (cpu/cuda) and save its state_dict
* create model B, load its state dict.
* `min_val|min_vals|max_val|max_vals` would always be loaded to model A's device, even if the rest of model B was on a different device
* the above is inconsistent with how save/load on different devices is expected to work (see https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-across-devices)
In practice, the case people would sometimes hit is:
* model A is on CPU, state dict is saved
* model B is created and moved to GPU, state_dict from model A is loaded
* assertions throw when operations are attempted across different devices
This PR fixes the behavior by removing the custom save/load where
possible and letting the default `nn.Module` save/load code handle
device assignment. We special case `PerChannelMinMaxObserver` and its
children to allow for loading buffers or different size, which is
normal.
There are some followups to also enable this for HistogramObserver
and FakeQuantize, which can be done in separate PRs due to higher
complexity.
Test Plan:
```
python test/test_quantization.py TestObserver.test_state_dict_respects_device_affinity
```
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23644493
fbshipit-source-id: 0dbb6aa309ad569a91a663b9ee7e44644080032e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44486
SmoothL1Loss had a completely different (and incorrect, see #43228) path when target.requires_grad was True.
This PR does the following:
1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the SmoothL1Loss CriterionTests to verify that the target derivative is checked.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23630699
Pulled By: gchanan
fbshipit-source-id: 0f94d1a928002122d6b6875182867618e713a917
Summary:
Add new transforms `sliceHead` and `sliceTail` to `LoopNest`, for example:
Before transformation:
```
for x in 0..10:
A[x] = x*2
```
After `sliceHead(x, 4)`:
```
for x in 0..4:
A[x] = x*2
for x in 4..10:
A[x] = x*2
```
After `sliceTail(x, 1)`:
```
for x in 0..4:
A[x] = x*2
for x in 4..9:
A[x] = x*2
for x in 9..10:
A[x] = x*2
```
`sliceHead(x, 10)` and `sliceTail(x, 10)` is no-op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43854
Test Plan: Tests are added in `test_loopnest.cpp`, the tests cover the basic transformations, and also tests the combination with other transformations such as `splitWithTail`.
Reviewed By: nickgg
Differential Revision: D23417366
Pulled By: cheng-chang
fbshipit-source-id: 06c6348285f2bafb4be3286d1642bfbe1ea499bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44235
Removes nonvariadic run_method() from mobile Module entirely (to be later replaced by a variadic version). All use cases should have been migrated to use get_method() and Method::operator() in D23436351
ghstack-source-id: 111848220
Test Plan: CI
Reviewed By: iseeyuan
Differential Revision: D23484577
fbshipit-source-id: 602fcde61e13047a34915b509da048b9550103b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44202
In preparation for changing mobile run_method() to be variadic, this diff:
* Implements get_method() for mobile Module, which is similar to find_method but expects the method to exist.
* Replaces calls to the current nonvariadic implementation of run_method() by calling get_method() and then invoking the operator() overload on Method objects.
ghstack-source-id: 111848222
Test Plan: CI, and all the unit tests which currently contain run_method that are being changed.
Reviewed By: iseeyuan
Differential Revision: D23436351
fbshipit-source-id: 4655ed7182d8b6f111645d69798465879b67a577
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43025
- Use new overloads that better reflect the arguments to interpolate.
- More uniform interface for upsample ops allows simplifying the Python code.
- Also reorder overloads in native_functions.yaml to give them priority.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37177
ghstack-source-id: 106938111
Test Plan:
test_nn has pretty good coverage.
Relying on CI for ONNX, etc.
Didn't test FC because this change is *not* forward compatible.
To ensure backwards compatibility, I ran this code before this change
```python
def test_func(arg):
interp = torch.nn.functional.interpolate
with_size = interp(arg, size=(16,16))
with_scale = interp(arg, scale_factor=[2.1, 2.2], recompute_scale_factor=False)
with_compute = interp(arg, scale_factor=[2.1, 2.2])
return (with_size, with_scale, with_compute)
traced_func = torch.jit.trace(test_func, torch.randn(1,1,1,1))
sample = torch.randn(1, 3, 7, 7)
output = traced_func(sample)
assert not torch.allclose(output[1], output[2])
torch.jit.save(traced_func, "model.pt")
torch.save((sample, output), "data.pt")
```
then this code after this change
```python
model = torch.jit.load("model.pt")
sample, golden = torch.load("data.pt")
result = model(sample)
for r, g in zip(result, golden):
assert torch.allclose(r, g)
```
Reviewed By: AshkanAliabadi
Differential Revision: D21209991
fbshipit-source-id: 5b2ebb7c3ed76947361fe532d1dbdd6faa3544c8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44471
L1Loss had a completely different (and incorrect, see #43228) path when target.requires_grad was True.
This PR does the following:
1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the L1Loss CriterionTests to verify that the target derivative is checked.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23626008
Pulled By: gchanan
fbshipit-source-id: 2828be16b56b8dabe114962223d71b0e9a85f0f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44500
Some user models are using those operators. Unblock them while keep the ops selective.
Test Plan: CI
Reviewed By: linbinyu
Differential Revision: D23634769
fbshipit-source-id: 55841d1b07136b6a27b6a39342f321638dc508cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44525
Since `TEST_SKIPS` is a global multiprocessing.manager, this was causing
issues when one test would fail and make the rest of the tests fail during
setup due to networking errors.
See the failed CI job: https://app.circleci.com/pipelines/github/pytorch/pytorch/212491/workflows/0450151d-ca09-4cf6-863d-272de6ed917f/jobs/7389065 for an example, where `test_ddp_backward` failed but then caused the rest of the tests to fail at the line `test_skips.update(TEST_SKIPS)`.
To fix this issue, at the end of every test we revert `TEST_SKIPS` back to a regular dict, and redo the conversion to a `mulitiprocessing.Manager` in the next test, which prevents these errors.
ghstack-source-id: 111844724
Test Plan: CI
Reviewed By: malfet
Differential Revision: D23641618
fbshipit-source-id: 27ce823968ece9804bb4dda898ffac43ef732b89
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44437
MSELoss had a completely different (and incorrect, see https://github.com/pytorch/pytorch/issues/43228) path when target.requires_grad was True.
This PR does the following:
1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the MSELoss CriterionTests to verify that the target derivative is checked.
TODO:
1) do we still need check_criterion_jacobian when we run grad/gradgrad checks?
2) ensure the Module tests check when target.requires_grad
3) do we actually test when reduction='none' and reduction='mean'?
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23612166
Pulled By: gchanan
fbshipit-source-id: 4f74d38d8a81063c74e002e07fbb7837b2172a10
Summary:
Fixes a bug in the NNC registerizer for Cuda where it would hoist reads out of a conditional context when trying to cache them. As a quick fix, prevent scalar replacement if a usage is within a condition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44223
Reviewed By: gchanan
Differential Revision: D23551247
Pulled By: nickgg
fbshipit-source-id: 17a7bf2be4c8c3dd8a9ab7997dce9aea200c3685
Summary:
Previously we were not removing profiling nodes in graphs that required grad and contained diff graphs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44420
Reviewed By: bertmaher
Differential Revision: D23607482
Pulled By: eellison
fbshipit-source-id: af095f3ed8bb3c5d09610f38cc7d1481cbbd2613
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44493
This function allows to execute a graph exactly as it is, without going
through a graph executor which would run passes on the graph before
interpreting it. I found this feature extremely helpful when I worked on
a stress-testing script to shake out bugs from the TE fuser: I needed to
execute a very specific set of passes on a graph and nothing else, and
then execute exactly it.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23632505
Pulled By: ZolotukhinM
fbshipit-source-id: ea81fc838933743e2057312d3156b77284d832ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44411
This basically aborts errored NCCL communicators if either blocking
wait or async error handling is enabled. Otherwise we may abort nccl
communicators where neither are enabled, and this may result in subsequent GPU
operations using corrupted data.
ghstack-source-id: 111839264
Test Plan: Succesful Flow run: f217591683
Reviewed By: jiayisuse
Differential Revision: D23605382
fbshipit-source-id: 6c16f9626362be3b0ce2feaf0979b2dff97ce61b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44410
See #44052 for context. One of the cumprod_backward overloads was unused
so I just deleted it.
Test Plan: - `pytest test/test_autograd.py -v`
Reviewed By: mrshenli
Differential Revision: D23605503
Pulled By: zou3519
fbshipit-source-id: f9c5b595e62d2d6e71f26580ba96df15cc9de4f7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44427
Closes https://github.com/pytorch/pytorch/issues/44425
DDP join API currently does not work properly with `model.no_sync()`, see https://github.com/pytorch/pytorch/issues/44425 for details. This PR fixes the problem via the approach mentioned in the issue, namely scheduling an allreduce that tells joined ranks whether to sync in the backwards pass or not. Tests are added for skipping gradient synchronization for various `sync_interval`s.
ghstack-source-id: 111786479
Reviewed By: pritamdamania87
Differential Revision: D23609070
fbshipit-source-id: e8716b7881f8eee95e3e3499283e716bd3d7fe76
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:
- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases
The functions moved are:
- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2
In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277
Reviewed By: mrshenli, ngimel
Differential Revision: D23617361
Pulled By: mruberry
fbshipit-source-id: edb292947769967de9383f6a84eb327f027509e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44224
The purpose of this file is to help developers on PT distributed get
upto speed on the code structure and layout for PT Distributed.
ghstack-source-id: 111644842
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D23548377
fbshipit-source-id: 561d5b8e257642de172def8fdcc1311fae20690b
Summary:
To help with further typing, move dynamically added native contributions from `torch.autograd` to `torch._C._autograd`
Fix invalid error handling pattern in
89ac30afb8/torch/csrc/autograd/init.cpp (L13-L15)
`PyImport_ImportModule` already raises Python exception and nullptr should be returned to properly propagate the to Python runtime.
And all native methods/types in `torch/autograd/__init.py` after `torch._C._init_autograd()` has been called
Use f-strings instead of `.format` in test_type_hints.py
Fixes https://github.com/pytorch/pytorch/issues/44450
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44451
Reviewed By: ezyang
Differential Revision: D23618261
Pulled By: malfet
fbshipit-source-id: fa5f739d7cff8410641128b55b810318c5f636ae
Summary:
Previously the specialized types were copied over to the fallback function, although the tensors in the fallback type were not of that type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44434
Reviewed By: SplitInfinity
Differential Revision: D23611943
Pulled By: eellison
fbshipit-source-id: 2ea88a97529409f6c5c4c1f59a14b623524933de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44347
Cloned from Pull Request resolved: https://github.com/pytorch/pytorch/pull/44097, because the original author Sinan has completed the internship and now is unable to submit this diff.
As johnsonpaul mentioned in D23277575 (7d517cf96f). It looks like all processes were allocating memory on GPU-ID=0.
I was able to reproduce it by running `test_ddp_comm_hook_allreduce_with_then_hook_nccl` unit test of `test_c10d.py` and running `nvidia-smi` while test was running. The issue was reproduced as:
```
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3132563 C python 777MiB |
| 0 3132564 C python 775MiB |
| 4 3132564 C python 473MiB |
+-----------------------------------------------------------------------------+
```
I realized that as we initialize ProcessGroupNCCL both processes were initially allocating memory on GPU 0.
We later also realized that I forgot `isHighPriority` input of `getStreamFromPool` and `futureNCCLCallbackStreams_.push_back(std::make_shared<at::cuda::CUDAStream>(at::cuda::getStreamFromPool(device_index)));` was just creating a vector of GPU 0 streams. As i changed `at::cuda::getStreamFromPool(device_index)` to `at::cuda::getStreamFromPool(false, device_index)`. `nvidia-smi` looked like:
```
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 673925 C python 771MiB |
| 0 673926 C python 771MiB |
| 1 673925 C python 771MiB |
| 1 673926 C python 771MiB |
| 2 673925 C python 771MiB |
| 2 673926 C python 771MiB |
| 3 673925 C python 771MiB |
| 3 673926 C python 771MiB |
| 4 673925 C python 771MiB |
| 4 673926 C python 771MiB |
| 5 673925 C python 771MiB |
| 5 673926 C python 771MiB |
| 6 673925 C python 771MiB |
| 6 673926 C python 771MiB |
| 7 673925 C python 707MiB |
| 7 673926 C python 623MiB |
+-----------------------------------------------------------------------------+
```
This confirms that we were just getting GPU 0 streams for the callback. I think this does not explain the `fp16_compress` stability issue, because we were able to reproduce that even without any then callback and just calling copy from fp32 to fp16 before allreduce. However, this can explain other issues where `allreduce` was not on par with `no_hook`. I'll run some additional simulations with this diff.
I tried to to replace `getStreamFromPool` by `getDefaultCUDAStream(deviceIndex)` and it wasn't causing additional memory usage. In this diff, I temporarily solved the issue by just initializing null pointers for each device in the constructor and setting the callback stream for corresponding devices inside `ProcessGroupNCCL::getNCCLComm`. After the fix it looks like the memory issue was resolved:
```
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2513142 C python 745MiB |
| 4 2513144 C python 747MiB |
+-----------------------------------------------------------------------------+
```
I could use a dictionary instead of a vector for `futureNCCLCallbackStreams_`, but since number of devices is fixed, I think it isn't necessary. Please let me know what you think in the comments.
ghstack-source-id: 111485483
Test Plan:
`test_c10d.py` and some perf tests. Also check `nvidia-smi` while running tests to validate memory looks okay.
This diff also fixes the regression in HPC tests as we register a hook:
{F322730175}
See https://fb.quip.com/IGuaAbD8 (474fdd7e2d)bnvy for details.
Reviewed By: pritamdamania87
Differential Revision: D23495436
fbshipit-source-id: ad08e1d94343252224595d7c8a279fe75e244822
Summary:
This PR fixes unexpected `SystemError` when warnings are emitted and warning filters are set.
## Current behavior
```
$ python -Werror
>>> import torch
>>> torch.range(1, 3)
UserWarning: torch.range is deprecated in favor of torch.arange and will be removed in 0.5. Note that arange generates values in [start; end), not [start; end].
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
SystemError: <built-in method range of type object at 0x7f38c7703a60> returned a result with an error set
```
## Expected behavior
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UserWarning: torch.range is deprecated and will be removed in a future release because its behavior is inconsistent with Python's range builtin. Instead, use torch.arange, which produces values in [start, end).
```
## Note
Python exception must be raised if `PyErr_WarnEx` returns `-1` ([python docs](https://docs.python.org/3/c-api/exceptions.html#issuing-warnings)). This PR fixes warnings raised in the following code:
```py
import torch
torch.range(1, 3)
torch.autograd.Variable().volatile
torch.autograd.Variable().volatile = True
torch.tensor(torch.tensor([]))
torch.tensor([]).new_tensor(torch.tensor([]))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44371
Reviewed By: mrshenli
Differential Revision: D23598410
Pulled By: albanD
fbshipit-source-id: 2fbcb13fe4025dbebaf1fd837d4c8e0944e05010
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44398
These end up executing the same tests, so no reason to have them separate.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23600855
Pulled By: gchanan
fbshipit-source-id: 0952492771498bf813f1bf8e1d7c8dce574ec965
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43958
There is not any difference between these tests (I'm merging them), so let's merge them in the JIT as well.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23452337
Pulled By: gchanan
fbshipit-source-id: e6d13cdb164205eec3dbb7cdcd0052b02c961778
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44381
Perhaps this was necessary when the test was originally introduced, but it's difficult to figure out what is actually tested. And I don't think we actually use NotImplementedErorrs.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23598646
Pulled By: gchanan
fbshipit-source-id: aa18154bfc4969cca22323e61683a301198823be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44226
**Summary**
At present, the `share_types` argument to `create_script_module` is used
to decide whether to reuse a previously created type for a top-level
module that has not yet been compiled. However, that setting does not apply
to the compilation of submodules of the top-level module; types are
still reused if possible.
This commit modifies `create_script_module` so that the `share_types`
flag is honoured during submodule compilation as well.
**Test Plan**
This commit adds a unit test to `TestTypeSharing` that checks that
submodule types are not shared or reused when `share_types` is set to
`False`.
**Fixes**
This commit fixes#43605.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23602371
Pulled By: SplitInfinity
fbshipit-source-id: b909b8b6abbe3b4cb9be8319ac263ade90e83bd3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44352
**Summary**
This commit adds support for `del` with class instances. If a class
implements `__delitem__`, then `del class_instance[key]` is syntactic
sugar for `class_instance.__delitem__[key]`.
**Test Plan**
This commit adds a unit test to TestClassTypes to test this feature.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23603102
Pulled By: SplitInfinity
fbshipit-source-id: 28ad26ddc9a693a58a6c48a0e853a1c7cf5c9fd6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43384
Much like the FileStoreTests, the HashStoreTests were also run in a single blob and threw exceptions upon failure. This modularizes the test by separating each function into separate gtest test cases.
ghstack-source-id: 111690834
Test Plan: Confirmed that the tests pass on devvm.
Reviewed By: jiayisuse
Differential Revision: D23257579
fbshipit-source-id: 7e821f0e9ee74c8b815f06facddfdb7dc2724294
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43383
FileStore Test currently has a large blob of tests that throw
exceptions upon failure. This PR modularizes each test so they can run
independently, and migrates the framework to gtest.
ghstack-source-id: 111690831
Test Plan: Confirmed tests pass on devvm
Reviewed By: jiayisuse
Differential Revision: D22879473
fbshipit-source-id: 6fa5468e594a53c9a6b972757068dfc41645703e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43382
StoreTestCommon defines standard helper functions that are used by all of our Store tests. These helpers currently throw exceptions upon failure, this PR changes them to use gtest assertions instead.
ghstack-source-id: 111690833
Test Plan: Tested the 2 PR's above this on devvm
Reviewed By: jiayisuse
Differential Revision: D22828156
fbshipit-source-id: 9e116cf2904e05ac0342a441e483501e00aad3dd
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/41946/, to suggest enumerating a module as an alternative if a user tries indexing into a modulelist/sequential with a non-integer literal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43361
Reviewed By: mrshenli
Differential Revision: D23602388
Pulled By: eellison
fbshipit-source-id: 51fa28d5bc45720529b3d45e92d367ee6c9e3316
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44400
This diff does the identical thing as D23549149 (398409f072) does. A fix included for OSS CI: pytorch_windows_vs2019_py36_cuda10.1_test1
ghstack-source-id: 111679745
Test Plan:
- CI
- OSS CI
Reviewed By: xcheng16
Differential Revision: D23601050
fbshipit-source-id: 8ebdcd8fdc5865078889b54b0baeb397a90ddc40
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44163
In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.
ghstack-source-id: 111637788
Test Plan:
CI/Sandcastle. We will turn on this env var by default in
torchelastic and HPC trainer soon.
Reviewed By: jiayisuse
Differential Revision: D23517895
fbshipit-source-id: e7cd244b2ddf2dc0800ff7df33c73a6f00b63dcc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41054
**This Commit:**
ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector.
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
ghstack-source-id: 111614314
Test Plan:
1. **DDP Sanity Check**: First we have a sanity check based on the PyTorch DDP benchmark. This verifies that the baseline DDP training with NCCL for standard CU workloads works well (esp. with standard models like Resnet50 and BERT). Here is a sample Flow: f213293473
1. **HPC Performance Benchmarks**: This stack has undergone thorough testing and profiling on the Training Cluster with varying number of nodes. This introduces 1-1.5% QPS regression only (~200-400 QPS regression for 8-64 GPUs).
1. **HPC Accuracy Benchmarks**: We've confirmed NE parity with the existing NCCL/DDP stack without this change.
1. **Kernel-Specific Benchmarks**: We have profiled other approaches for this system (such as cudaStreamAddCallback) and performed microbenchmarks to confirm the current solution is optimal.
1. **Sandcastle/CI**: Apart from the recently fixed ProcessGroupNCCL tests, we will also introduce a new test for desynchronization scenarios.
Reviewed By: jiayisuse
Differential Revision: D22054298
fbshipit-source-id: 2b95a4430a4c9e9348611fd9cbcb476096183c06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41053
**This Commit:**
Some minor refactoring - added helper to check if `WorkNCCL` objects have timed out. Adding a new finish function to ProcessGroupNCCL::WorkNCCL that avoids notifying CV and uses `lock_guard`. Also renaming the timeoutCVMutex mutex to be more descriptive.
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
ghstack-source-id: 111614315
Test Plan: See D22054298 for verification of correctness and performance
Reviewed By: jiayisuse
Differential Revision: D21943520
fbshipit-source-id: b27ee329f0da6465857204ee9d87953ed6072cbb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41052
**This Commit:**
Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.)
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
ghstack-source-id: 111614313
Test Plan: See D22054298 for verification of correctness and performance
Reviewed By: jiayisuse
Differential Revision: D21943151
fbshipit-source-id: 337bfcb8af7542c451f1e4b3dcdfc5870bdec453
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41051
**This Commit:**
In the workCleanupThread, we process completion and exception handling for workNCCL objects corresponding to collective calls that have either completed GPU Execution, or have already thrown an exception. This way, we throw an exception from the workCleanupThread for failed GPU operations. This approach replaces the previous (and lower performance) approach of enqueuing a callback on the CUDA stream to process failures.
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
ghstack-source-id: 111614319
Test Plan: See D22054298 for verification of correctness and performance
Reviewed By: jiayisuse
Differential Revision: D21938498
fbshipit-source-id: df598365031ff210afba57e0c7be865e3323ca07
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41050
**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
Test Plan: See D22054298 for verification of correctness and performance
Reviewed By: jiayisuse
Differential Revision: D21916637
fbshipit-source-id: f8cadaab0071aaad1c4e31f9b089aa23cba0cfbe
Summary:
This should prevent torch_python from linking the entire cudnn library statically just to query its version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44402
Reviewed By: seemethere
Differential Revision: D23602720
Pulled By: malfet
fbshipit-source-id: 185b15b789bd48b1df178120801d140ea54ba569
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42488
Currently, ProcessGroupGloo tests do not emit logs if the test was
skipped due CUDA not being available/not enough CUDA devices. This PR clarifies
the reason for skipping through these logs.
ghstack-source-id: 111638111
Test Plan: tested on devvm and devgpu
Reviewed By: jiayisuse
Differential Revision: D22879396
fbshipit-source-id: d483ca46b5e22ed986521262c11a1c6dbfbe7efd
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:
- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases
The functions moved are:
- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2
In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277
Reviewed By: ngimel
Differential Revision: D23568330
Pulled By: mruberry
fbshipit-source-id: 03e69fccdbfd560217c34ce4e9a5f20e10d05a5e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44315
I find it more intuitive to dump the optimized graph if we have one;
when I first saw the unoptimized graph being dumped I thought we had failed to
apply any optimizations.
Test Plan: Observe output by hand
Reviewed By: Lilyjjo
Differential Revision: D23578813
Pulled By: bertmaher
fbshipit-source-id: e2161189fb0e1cd53aae980a153aea610871662a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44162
This diff exports Node::isBefore/isAfter method to PythonAPI.
Test Plan: Tested locally. Please let me know if there is a set of unit tests to be passed.
Reviewed By: soumith
Differential Revision: D23514448
fbshipit-source-id: 7ef709b036370217ffebef52fd93fbd68c464e89
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41769
Currently the tests in `test_distributed` only work with the `fork` mode multiprocessing, this PR introduces support for `spawn` mode multiprocessing as well (while keeping the `fork` mode intact).
Motivations for the change:
1) Spawn multiprocessing is the default on MacOS, so it better emulates how MacOS users would use distributed
2) With python 3.8+, spawn is the default on linux, so we should have test coverage for this
3) PT multiprocessing suggests using spawn/forkserver over fork, for sharing cuda tensors: https://pytorch.org/docs/stable/multiprocessing.html
4) Spawn is better supported with respect to certain sanitizers such as TSAN, so adding this sanitizer coverage may help us uncover issues.
How it is done:
1) Move `test_distributed` tests in `_DistTestBase` class to a shared file `distributed_test` (similar to how the RPC tests are structured)
2) For `Barrier`, refactor the setup of temp directories, as the current version did not work with spawn, each process would get a different randomly generated directory and thus would write to different barriers.
3) Add all the relevant builds to run internally and in OSS.
Running test_distributed with spawn mode in OSS can be done with:
`python test/run_test.py -i distributed/test_distributed_spawn -v`
Reviewed By: izdeby
Differential Revision: D22408023
fbshipit-source-id: e206be16961fd80438f995e221f18139d7e6d2a9
Summary:
1) Ports nonzero from THC to ATen
2) replaces most thrust uses with cub, to avoid synchronization and to improve performance. There is still one necessary synchronization point, communicating number of nonzero elements from GPU to CPU
3) slightly changes algorithm, now we first compute the number of nonzeros, and then allocate correct-sized output, instead of allocating full-sized output as was done before, to account for possibly all elements being non-zero
4) unfortunately, since the last transforms are still done with thrust, 2) is slightly beside the point, however it is a step towards a future without thrust
4) hard limits the number of elements in the input tensor to MAX_INT. Previous implementation allocated a Long tensor with the size ndim*nelements, so that would be at least 16 GB for a tensor with MAX_INT elements. It is reasonable to say that larger tensors could not be used anyway.
Benchmarking is done for tensors with approximately half non-zeros
<details><summary>Benchmarking script</summary>
<p>
```
import torch
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys
device = "cuda"
results = []
for numel in (1024 * 128,):#, 1024 * 1024, 1024 * 1024 * 128):
inp = torch.randint(2, (numel,), device="cuda", dtype=torch.float)
for ndim in range(2,3):#(1,4):
if ndim == 1:
shape = (numel,)
elif ndim == 2:
shape = (1024, numel // 1024)
else:
shape = (1024, 128, numel // 1024 // 128)
inp = inp.reshape(shape)
repeats = 3
timer = Timer(stmt="torch.nonzero(inp, as_tuple=False)", label="Nonzero", sub_label=f"number of elts {numel}",
description = f"ndim {ndim}", globals=globals())
for i in range(repeats):
results.append(timer.blocked_autorange())
print(f"\rnumel {numel} ndim {ndim}", end="")
sys.stdout.flush()
comparison = Compare(results)
comparison.print()
```
</p>
</details>
### Results
Before:
```
[--------------------------- Nonzero ---------------------------]
| ndim 1 | ndim 2 | ndim 3
1 threads: ------------------------------------------------------
number of elts 131072 | 55.2 | 71.7 | 90.5
number of elts 1048576 | 113.2 | 250.7 | 497.0
number of elts 134217728 | 8353.7 | 23809.2 | 54602.3
Times are in microseconds (us).
```
After:
```
[-------------------------- Nonzero --------------------------]
| ndim 1 | ndim 2 | ndim 3
1 threads: ----------------------------------------------------
number of elts 131072 | 48.6 | 79.1 | 90.2
number of elts 1048576 | 64.7 | 134.2 | 161.1
number of elts 134217728 | 3748.8 | 7881.3 | 9953.7
Times are in microseconds (us).
```
There's a real regression for smallish 2D tensor due to added work of computing number of nonzero elements, however, for other sizes there are significant gains, and there are drastically lower memory requirements. Perf gains would be even larger for tensors with fewer nonzeros.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44259
Reviewed By: izdeby
Differential Revision: D23581955
Pulled By: ngimel
fbshipit-source-id: 0b99a767fd60d674003d83f0848dc550d7a363dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44217
Move the tests to static ones as well
Test Plan:
python test/test_quantization.py TestStaticQuantizedModule.test_embedding_bag_api
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23547386
fbshipit-source-id: 41f81c31e1613098ecf6a7eff601c7dcd4b09c76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44208
Add quantized module in static quantization namespace. Embedding
quantization requires only weights to be quantized so it is static.
Internally it calls the embedding_bag_byte op with the offsets set corresponding to the
indices.
Future PR will move EmbeddingBag quantization from dynamic to static as well.
Test Plan:
python test/test_quantization.py test_embedding_api
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23547384
fbshipit-source-id: eddc6fb144b4a771060e7bab5853656ccb4443f0
Summary:
Fixes a bug where FP16 values could be incorrectly cast to a half type that doesn't have a cast operator by inserting the cuda specific cast to float during handling of the Cast node, not as a wrapper around printing Loads and Stores. Two main changes: the HalfChecker now inserts the casts to float explicitly in the IR, and the PrioritizeLoad mutator now consumes both Loads and a Cast which immediately preceded a load.
Tested with test_jit_fuser_te.py and test_tensorexpr.py, plus C++ tests obv.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44209
Reviewed By: izdeby
Differential Revision: D23575577
Pulled By: nickgg
fbshipit-source-id: 808605aeb2af812758f96f9fdc11b07e08053b46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44042
Missed one case last time
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23479345
fbshipit-source-id: 30e6713120c494e9fab5584de4df9b25bec83d32
Summary:
When the backward ops execute via the autograd engine evaluate_function(), the fn.release_variables() is called to release the SavedVariables. For the eager mode ops, this releases the saved inputs that was required for backward grad function. However, with TorchScript, we get a DifferentableGraph and the DifferentiableGraphBackward() doesn't implement a release_variables(). This leads to the SavedVariables to be alive longer. Implement release_variables() for DifferentiableGraphBackward to release these SavedVariables early.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42994
Reviewed By: izdeby
Differential Revision: D23503172
Pulled By: albanD
fbshipit-source-id: d87127498cfa72883ae6bb31d0e6c7056c4c36d4
Summary:
This test is failing consistently on linux-bionic-rocm3.7-py3.6-test2. Relevant log snippet:
```
03:43:11 FAIL: test_addcmul_cuda_float16 (__main__.TestForeachCUDA)
03:43:11 ----------------------------------------------------------------------
03:43:11 Traceback (most recent call last):
03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 818, in wrapper
03:43:11 method(*args, **kwargs)
03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 258, in instantiated_test
03:43:11 result = test(self, *args)
03:43:11 File "test_foreach.py", line 83, in test_addcmul
03:43:11 self._test_pointwise_op(device, dtype, torch._foreach_addcmul, torch._foreach_addcmul_, torch.addcmul)
03:43:11 File "test_foreach.py", line 58, in _test_pointwise_op
03:43:11 self.assertEqual(tensors, expected)
03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1153, in assertEqual
03:43:11 exact_dtype=exact_dtype, exact_device=exact_device)
03:43:11 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1127, in assertEqual
03:43:11 self.assertTrue(result, msg=msg)
03:43:11 AssertionError: False is not true : Tensors failed to compare as equal! With rtol=0.001 and atol=1e-05, found 10 element(s) (out of 400) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.00048828125 (-0.46484375 vs. -0.46533203125), which occurred at index (11, 18).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44304
Reviewed By: malfet, izdeby
Differential Revision: D23578316
Pulled By: mruberry
fbshipit-source-id: 558eecf42677383e7deaa4961e12ef990ffbe28c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44233
**Summary**
By default, scripting tries to share concrete and JIT types across
compilations. However, this can lead to incorrect results if a module
extends `torch.jit.ScriptModule`, and injects instance variables into
methods defined using `define`.
This commit detects when this has happened and disables type sharing
for the compilation of the module that uses `define` in `__init__`.
**Test Plan**
This commit adds a test to TestTypeSharing that tests this scenario.
**Fixes**
This commit fixes#43580.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23553870
Pulled By: SplitInfinity
fbshipit-source-id: d756e87fcf239befa0012998ce29eeb25728d3e1
Summary:
When var and std are called without args (other than unbiased) they currently call into TH or THC. This PR:
- Removes the THC var_all and std_all functions and updates CUDA var and std to use the ATen reduction
- Fixes var's docs, which listed its arguments in the incorrect order
- Adds new tests comparing var and std with their NumPy counterparts
Performance appears to have improved as a result of this change. I ran experiments on 1D tensors, 1D tensors with every other element viewed ([::2]), 2D tensors and 2D transposed tensors. Some notable datapoints:
- torch.randn((8000, 8000))
- var measured 0.0022215843200683594s on CUDA before the change
- var measured 0.0020322799682617188s on CUDA after the change
- torch.randn((8000, 8000)).T
- var measured .015128850936889648 on CUDA before the change
- var measured 0.001912832260131836 on CUDA after the change
- torch.randn(8000 ** 2)
- std measured 0.11031460762023926 on CUDA before the change
- std measured 0.0017833709716796875 on CUDA after the change
Timings for var and std are, as expected, similar.
On the CPU, however, the performance change from making the analogous update was more complicated, and ngimel and I decided not to remove CPU var_all and std_all. ngimel wrote the following script that showcases how single-threaded CPU inference would suffer from this change:
```
import torch
import numpy as np
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys
base = 8
multiplier = 1
def stdfn(a):
meanv = a.mean()
ac = a-meanv
return torch.sqrt(((ac*ac).sum())/a.numel())
results = []
num_threads=1
for _ in range(7):
size = base*multiplier
input = torch.randn(size)
tasks = [("torch.var(input)", "torch_var"),
("torch.var(input, dim=0)", "torch_var0"),
("stdfn(input)", "stdfn"),
("torch.sum(input, dim=0)", "torch_sum0")
]
timers = [Timer(stmt=stmt, num_threads=num_threads, label="Index", sub_label=f"{size}",
description=label, globals=globals()) for stmt, label in tasks]
repeats = 3
for i, timer in enumerate(timers * repeats):
results.append(
timer.blocked_autorange()
)
print(f"\r{i + 1} / {len(timers) * repeats}", end="")
sys.stdout.flush()
multiplier *=10
print()
comparison = Compare(results)
comparison.print()
```
The TH timings using this script on my devfair are:
```
[------------------------------ Index ------------------------------]
| torch_var | torch_var0 | stdfn | torch_sum0
1 threads: ----------------------------------------------------------
8 | 16.0 | 5.6 | 40.9 | 5.0
80 | 15.9 | 6.1 | 41.6 | 4.9
800 | 16.7 | 12.0 | 42.3 | 5.0
8000 | 27.2 | 72.7 | 51.5 | 6.2
80000 | 129.0 | 715.0 | 133.0 | 18.0
800000 | 1099.8 | 6961.2 | 842.0 | 112.6
8000000 | 11879.8 | 68948.5 | 20138.4 | 1750.3
```
and the ATen timings are:
```
[------------------------------ Index ------------------------------]
| torch_var | torch_var0 | stdfn | torch_sum0
1 threads: ----------------------------------------------------------
8 | 4.3 | 5.4 | 41.4 | 5.4
80 | 4.9 | 5.7 | 42.6 | 5.4
800 | 10.7 | 11.7 | 43.3 | 5.5
8000 | 69.3 | 72.2 | 52.8 | 6.6
80000 | 679.1 | 676.3 | 129.5 | 18.1
800000 | 6770.8 | 6728.8 | 819.8 | 109.7
8000000 | 65928.2 | 65538.7 | 19408.7 | 1699.4
```
which demonstrates that performance is analogous to calling the existing var and std with `dim=0` on a 1D tensor. This would be a significant performance hit. Another simple script shows the performance is mixed when using multiple threads, too:
```
import torch
import time
# Benchmarking var and std, 1D with varying sizes
base = 8
multiplier = 1
op = torch.var
reps = 1000
for _ in range(7):
size = base * multiplier
t = torch.randn(size)
elapsed = 0
for _ in range(reps):
start = time.time()
op(t)
end = time.time()
elapsed += end - start
multiplier *= 10
print("Size: ", size)
print("Avg. elapsed time: ", elapsed / reps)
```
```
var cpu TH vs ATen timings
Size: 8
Avg. elapsed time: 1.7853736877441406e-05 vs 4.9788951873779295e-06 (ATen wins)
Size: 80
Avg. elapsed time: 1.7803430557250977e-05 vs 6.156444549560547e-06 (ATen wins)
Size: 800
Avg. elapsed time: 1.8569469451904296e-05 vs 1.2302875518798827e-05 (ATen wins)
Size: 8000
Avg. elapsed time: 2.8756141662597655e-05 vs. 6.97789192199707e-05 (TH wins)
Size: 80000
Avg. elapsed time: 0.00026622867584228516 vs. 0.0002447957992553711 (ATen wins)
Size: 800000
Avg. elapsed time: 0.0010556647777557374 vs 0.00030616092681884767 (ATen wins)
Size: 8000000
Avg. elapsed time: 0.009990205764770508 vs 0.002938544034957886 (ATen wins)
std cpu TH vs ATen timings
Size: 8
Avg. elapsed time: 1.6681909561157225e-05 vs. 4.659652709960938e-06 (ATen wins)
Size: 80
Avg. elapsed time: 1.699185371398926e-05 vs. 5.431413650512695e-06 (ATen wins)
Size: 800
Avg. elapsed time: 1.768803596496582e-05 vs. 1.1279821395874023e-05 (ATen wins)
Size: 8000
Avg. elapsed time: 2.7791500091552735e-05 vs 7.031106948852539e-05 (TH wins)
Size: 80000
Avg. elapsed time: 0.00018650460243225096 vs 0.00024368906021118164 (TH wins)
Size: 800000
Avg. elapsed time: 0.0010522041320800782 vs 0.0003039860725402832 (ATen wins)
Size: 8000000
Avg. elapsed time: 0.009976618766784668 vs. 0.0029211788177490234 (ATen wins)
```
These results show the TH solution still performs better than the ATen solution with default threading for some sizes.
It seems like removing CPU var_all and std_all will require an improvement in ATen reductions. https://github.com/pytorch/pytorch/issues/40570 has been updated with this information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43858
Reviewed By: zou3519
Differential Revision: D23498981
Pulled By: mruberry
fbshipit-source-id: 34bee046c4872d11c3f2ffa1b5beee8968b22050
Summary:
This PR adds the following aliaes:
- not_equal for torch.ne
- greater for torch.gt
- greater_equal for torch.ge
- less for torch.lt
- less_equal for torch.le
This aliases are consistent with NumPy's naming for these functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43870
Reviewed By: zou3519
Differential Revision: D23498975
Pulled By: mruberry
fbshipit-source-id: 78560df98c9f7747e804a420c1e53fd1dd225002
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43985
Added
```
def(detail::SelectiveStr<true>, ...)
impl(detail::SelectiveStr<true>, ...)
```
in torch/library, which can also be used for other templated selective registration.
Size saves for this diff:
fbios-pika: 78 KB
igios: 87 KB
Test Plan: Imported from OSS
Reviewed By: ljk53, smessmer
Differential Revision: D23459774
Pulled By: iseeyuan
fbshipit-source-id: 86d34cfe8e3f852602f203db06f23fa99af2c018
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44048
Inline the fork-wait calls to make sure we can see the ops to be quantized in the main graph
Also fix the InlineForkWait JIT pass to account for the case where the aten::wait call isn't present in the main graph
and we return future tensor from subgraph
Example
```
graph(%self.1 : __torch__.dper3.core.interop.___torch_mangle_6325.DperModuleWrapper,
%argument_1.1 : Tensor,
%argument_2.1 : Tensor):
%3 : Future[Tensor[]] = prim::fork_0(%self.1, %argument_1.1, %argument_2.1) # :0:0
return (%3)
with prim::fork_0 = graph(%self.1 : __torch__.dper3.core.interop.___torch_mangle_5396.DperModuleWrapper,
%argument_1.1 : Tensor,
%argument_2.1 : Tensor):
%3 : __torch__.dper3.core.interop.___torch_mangle_6330.DperModuleWrapper = prim::GetAttr[name="x"](%self.1)
%4 : __torch__.dper3.core.interop.___torch_mangle_5397.DperModuleWrapper = prim::GetAttr[name="y"](%self.1)
%5 : __torch__.dper3.core.interop.___torch_mangle_6327.DperModuleWrapper = prim::GetAttr[name="z"](%4)
%6 : Tensor = prim::CallMethod[name="forward"](%5, %argument_1.1, %argument_2.1) # :0:0
%7 : None = prim::CallMethod[name="forward"](%3, %6) # :0:0
%8 : Tensor[] = prim::ListConstruct(%6)
return (%8)
```
Test Plan:
python test/test_quantization.py test_interface_with_fork
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23481003
fbshipit-source-id: 2e756be73c248319da38e053f021888b40593032
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44008
embedding_bag requires only quantization of weights (no dynamic quantization of inputs)
So the type of quantization is essentially static (without calibration)
This will enable pyper to do fc and embedding_bag quantization using the same API call
Test Plan:
python test/test_quantization.py test_embedding_bag
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23467019
fbshipit-source-id: 41a61a17ee34bcb737ba5b4e19fb7a576d4aeaf9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43989
When we trace the model it produces aten::embedding_bag node in the graph,
Add necessary passes in graph mode to help support quantizing it as well
Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D23460485
fbshipit-source-id: 328c5e1816cfebb10ba951113f657665b6d17575
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44137
We only insert guards on Tensor types, so we rely on the output
of a node being uniquely determined by its input types.
bail if any non-Tensor input affects the output type
and cannot be reasoned about statically
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23543602
Pulled By: eellison
fbshipit-source-id: abd6fe0b1fd7fe6fc251694d4cd442b19c032dd7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44125
In `Quantizer._prepare`, `observed` was used for two different variables
with different types. Making the names a bit cleaner and removing the
name conflict.
Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```
Imported from OSS
Reviewed By: dskhudia
Differential Revision: D23504109
fbshipit-source-id: 0f73eac3d6dd5f72ad5574a4d47d33808a70174a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44165
Allows convolutions to be quantized if `torch.cudnn.backends.benchmark`
flag was set.
Not for land yet, just testing.
Test Plan:
in the gist below, the resulting graph now has quantized convolutions
https://gist.github.com/vkuzo/622213cb12faa0996b6700b08d6ab2f0
Imported from OSS
Reviewed By: supriyar
Differential Revision: D23518775
fbshipit-source-id: 294f678c6afbd3feeb89b7a6655bc66ac9f8bfbc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44227
As title
ghstack-source-id: 111490242
Test Plan: CI
Reviewed By: xcheng16
Differential Revision: D23549149
fbshipit-source-id: fad742a8d4e6f844f83495514cd60ff2bf0d5bcb
Summary:
Update repeat op so that the inputs to sizes argument can a mixture of dynamic and constant inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43430
Reviewed By: houseroad
Differential Revision: D23494257
Pulled By: bzinodev
fbshipit-source-id: 90c5e90e4f73e98f3a9d5c8772850e72cecdf0d4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43906
This method returns a list of RRefs of remote parameters that can be fed into the DistributedOptimizer.
Original PR issue: RemoteModule enhancements #40550
Test Plan: buck test caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: rohan-varma
Differential Revision: D23399586
fbshipit-source-id: 4b0f1ccf2e47c8a9e4f79cb2c8668f3cdbdff820
Summary:
Duplicate of https://github.com/pytorch/pytorch/issues/41413
This PR initiates the process of updating the torchsciprt backend interface used by ONNX exporter.
Replace jit lower graph pass by freeze module pass
Enable ScriptModule tests for ONNX operator tests (ORT backend) and model tests by default.
Replace jit remove_inplace_ops pass with remove_mutation and consolidation all passes for handling inplace ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43791
Reviewed By: houseroad
Differential Revision: D23421872
Pulled By: bzinodev
fbshipit-source-id: a98710c45ee905748ec58385e2a232de2486331b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44092
instead submodules and weights are installed directly on the
graph_module by transferring the original modules. This makes it more
likely that scripting will succeed (since we no longer have submodules
that are not used in the trace). It also prevents layered transforms
from having to special case handling of the `root` module. GraphModules
can now be re-traced as part of the input to other transforms.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23504210
Pulled By: zdevito
fbshipit-source-id: f79e5c4cbfc52eb0ffb5d6ed89b37ce35a7dc467
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44052
Summary
=======
This PR registers the following backwards functions as operators:
- slice_backward
- select_backward
- gather_backward
- index_select_backward (the backward function for index_select)
- select_index_backward (prevously known as index_select_backward, but is actually the backward function for max.dim, min.dim, etc)
In the future, I'd like to register more backward functions as operators
so that we can write batching rules for the backward functions. Batching
rules for backward functions makes it so that we can compute batched
gradients.
Motivation
==========
The rationale behind this PR is that a lot of backwards functions (27 in total)
are incompatible with BatchedTensor due to using in-place operations.
Sometimes we can allow the in-place operations, but other times we can't.
For example, consider select_backward:
```
Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes, int64_t dim, int64_t index) {
auto grad_input = at::zeros(input_sizes, grad.options());
grad_input.select(dim, index).copy_(grad);
return grad_input;
}
```
and consider the following code:
```
x = torch.randn(5, requires_grad=True)
def select_grad(v):
torch.autograd.grad(x[0], x, v)
vs = torch.randn(B0)
batched_grads = vmap(select_grad)(vs)
```
For the batched gradient use case, `grad` is a BatchedTensor.
The physical version of `grad` has size `(B0,)`.
However, select_backward creates a `grad_input` of shape `(5)`, and
tries to copy `grad` to a slice of it.
Other approaches
================
I've considered the following:
- register select_backward as an operator (this PR)
- have a branch inside select_backward for if `grad` is batched.
- this is OK, but what if we have more tensor extensions that want to override this?
- modify select_backward to work with BatchedTensor, by creating a new operator for the "select + copy_ behavior".
- select + copy_ isn't used elsewhere in derivative formulas so this doesn't seem useful
Test Plan
=========
- `pytest test/test_autograd.py -v`
- Registering backward functions may impact performance. I benchmarked
select_backward to see if registering it as an operator led to any noticable
performance overheads: https://gist.github.com/zou3519/56d6cb53775649047b0e66de6f0007dc.
The TL;DR is that the overhead is pretty minimal.
Test Plan: Imported from OSS
Reviewed By: ezyang, fbhuba
Differential Revision: D23481183
Pulled By: zou3519
fbshipit-source-id: 125af62eb95824626dc83d06bbc513262ee27350
Summary:
A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two.
This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches.
This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs).
This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list:
* When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body.
* When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined.
* `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885
Reviewed By: gmagogsfm
Differential Revision: D23503636
Pulled By: nickgg
fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44139
Also, make sure that we're checking that condition when we're starting a
new fusion group, not only when we merge a node into an existing fusion
group. Oh, and one more: add a test checking that we're rejecting graphs
with unspecified shapes.
Differential Revision: D23507510
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: 9c268825ac785671d7c90faf2aff2a3e5985ac5b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44115
Fixes device affinity in the FX prepare pass for QAT. Before this PR, observers
were always created on CPU. After this PR, observers are created on the
same device as the rest of the model. This will enable QAT prepare to
work regardless of whether users move the model to cuda before or after
calling this pass.
Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_qat_prepare_device_affinity
```
Imported from OSS
Reviewed By: supriyar
Differential Revision: D23502291
fbshipit-source-id: ec4ed20c21748a56a25e3395b35ab8640d71b5a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43298
IR emitter uses `ModuleValue` to represent ScriptModules and emit IR for
attribute access, submodule access, etc.
`ModuleValue` relies on two pieces of information, the JIT type of the
module, and the `ConcreteModuleType`, which encapsulates Python-only
information about the module.
ScriptModules loaded from a package used to create a dummy
ConcreteModuleType without any info in it. This led to divergences in
behavior during compilation.
This PR makes the two ways of constructing a ConcreteModuleType equivalent,
modulo any py-only information (which, by definition, is never present in
packaged files anyway).
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23228738
Pulled By: suo
fbshipit-source-id: f6a660f42272640ca1a1bb8c4ee7edfa2d1b07cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43284
The IR emitter looks for attributes on modules like:
1. Check the JIT type for the attribute
2. Check the originating Python class, in order to fulfill requests for, e.g. static methods or ignored methods.
In the case where you do:
```
inner_module = torch.jit.load("inner.pt")
wrapped = Wrapper(inner_module) # wrap the loaded ScriptModule in an nn.Module
torch.jit.script(wrapped)
```
The IR emitter may check for attributes on `inner_module`. There is no
originating Python class for `inner_module`, since it was directly
compiled from the serialized format.
Due to a bug in the code, we don't guard for this case an a segfault
results if the wrapper asks for an undefined attribute. The lookup in
this case looks like:
1. Check the JIT type for the attribute (not there!)
2. Check the originating Python class (this is a nullptr! segfault!)
This PR guards this case and properly just raises an attribute missing
compiler error instead of segfaulting.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23224337
Pulled By: suo
fbshipit-source-id: 0cf3060c427f2253286f76f646765ec37b9c4c49
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44083
Match on the complete schema of a node instead of its node kind when deciding to fuse it. Previously we matched on node kind, which could fail with something like `aten::add(int, int)` and if a new overload was added to an op without corresponding NNC support we would fuse it.
Follow ups are:
- bail when an output tensor type isnt uniquely determined by the input types (e.g. aten::add and the second input could be either a float or an int)
- remove NNC lowering for _tanh_backward & _sigmoid_backward
- Validate that we support all of the overloads here. I optimistically added ops that included Tensors, it's possible that we do not support every overload here. This isn't a regression, and this PR is at least improving our failures in that regard.
I can do any of these as part of this PR if desired, but there are a number of failures people have run into that this PR fixes so I think it would be good to land this sooner than later.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23503704
Pulled By: eellison
fbshipit-source-id: 3ce971fb1bc3a7f1cbaa38f1ed853e2db3d67c18
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43965
As part of a larger effort to unify the API between the lite interpreter and full JIT:
- implement torch::jit::mobile::Method, a proxy for torch::jit::mobile::Function
- add support for overloaded operator() to mobile Method and Function
- mobile find_method now returns a c10::optional<Method> (so signature matches full jit)
- moves some implementation of Function from module.cpp to function.cpp
ghstack-source-id: 111161942
Test Plan: CI
Reviewed By: iseeyuan
Differential Revision: D23330762
fbshipit-source-id: bf0ba0d711d9566c92af31772057ecd35983ee6d
Summary:
Polishes DDP join api docstrings and makes a few minor cosmetic changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43973
Reviewed By: zou3519
Differential Revision: D23467238
Pulled By: rohan-varma
fbshipit-source-id: faf0ee56585fca5cc16f6891ea88032336b3be56
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44036
Running replaceAtenConvolution on older traced model wont work as
_convolution signature has changed and replaceAtenConvolution was
changed to account for that.
But we did not preserve the old behavior during that. This change
restores the old behavior while keeing the new one.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23476775
fbshipit-source-id: 73a0c2b7387f2a8d82a8d26070d0059972126836
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44035
change
Also added test so as to capture such cases for future.
Test Plan:
python test/test_xnnpack_integration.py
Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D23476773
fbshipit-source-id: a62c4429351c909245106a70b4c60b1bacffa817
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44060
Right now it skips grad checks as well.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23484018
Pulled By: gchanan
fbshipit-source-id: 24a8f1af41f9918aaa62bc3cd78b139b2f8de1e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44054
**Summary**
This commit improves the error message that is printed when an
`Optional` type annotation with an unsupported contained type is
encountered. At present, the `Optional` is printed as-is, and
`Optional[T]` is syntatic sugar for `Union[T, None]`, so that is what
shows up in the error message and can be confusing. This commit modifies
the error message so that it prints `T` instead of `Union[T, None]`.
**Test Plan**
Continuous integration.
Example of old message:
```
AssertionError: Unsupported annotation typing.Union[typing.List, NoneType] could not be resolved.
```
Example of new message:
```
AssertionError: Unsupported annotation typing.Union[typing.List, NoneType] could not be resolved because typing.List could not be resolved.
```
**Fixes**
This commit fixes#42859.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D23490365
Pulled By: SplitInfinity
fbshipit-source-id: 2aa9233718e78cf1ba3501ae11f5c6f0089e29cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44078
When PyTorch mobile inference failed and throw exception, if caller catch and not crash the app, we are not able to track all the inference failures.
So we are adding native soft error reporting to capture all the failures occurring during module loading and running including both crashing and on-crashing failures. Since c10::Error has good error messaging stack handling (D21202891 (a058e938f9)), we are utilizing it for the error handling and message print out.
ghstack-source-id: 111307080
Test Plan:
Verified that the soft error reporting is sent through module.cpp when operator is missing, make sure a logview mid is generated with stack trace: https://www.internalfb.com/intern/logview/details/facebook_android_softerrors/5dd347d1398c1a9a73c804b20f7c2179/?selected-logview-tab=latest.
Error message with context is logged below:
```
soft_error.cpp [PyTorchMobileInference] : Error occured during model running entry point: Could not run 'aten::embedding' with arguments from the 'CPU' backend. 'aten::embedding' is only available for these backends: [BackendSelect, Named, Autograd, Autocast, Batched, VmapMode].
BackendSelect: fallthrough registered at xplat/caffe2/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at xplat/caffe2/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Autograd: fallthrough registered at xplat/caffe2/aten/src/ATen/core/VariableFallbackKernel.cpp:31 [backend fallback]
Autocast: fallthrough registered at xplat/caffe2/aten/src/ATen/autocast_mode.cpp:253 [backend fallback]
Batched: registered at xplat/caffe2/aten/src/ATen/BatchingRegistrations.cpp:317 [backend fallback]
VmapMode: fallthrough registered at xplat/caffe2/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Exception raised from reportError at xplat/caffe2/aten/src/ATen/core/dispatch/OperatorEntry.cpp:261 (m
```
Reviewed By: iseeyuan
Differential Revision: D23428636
fbshipit-source-id: 82d5d9c054300dff18d144f264389402d0b55a8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43734
Following the additional GH comments on the original PR https://github.com/pytorch/pytorch/pull/43307.
ghstack-source-id: 111327130
Test Plan: Run `python test/distributed/test_c10d.py`
Reviewed By: smessmer
Differential Revision: D23380288
fbshipit-source-id: 4b8889341c57b3701f0efa4edbe1d7bbc2a82ced
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44055
There is no functional change here. Another patch will rename NewCriterionTest to CriterionTest.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23482572
Pulled By: gchanan
fbshipit-source-id: de364579067e2cc9de7df6767491f8fa3a685de2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44050
We don't actually turn on the CTCLoss tests since they fail, but this allows you to toggle check_forward_only and for the code to actually run.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D23481091
Pulled By: gchanan
fbshipit-source-id: f2a3b0a2dee27341933c5d25f1e37a878b04b9f6
Summary:
This PR adds a new test suite, test_ops.py, designed for generic tests across all operators with OpInfos. It currently has two kinds of tests:
- it validates that the OpInfo has the correct supported dtypes by verifying that unsupported dtypes throw an error and supported dtypes do not
- it runs grad and gradgrad checks on each op and its variants (method and inplace) that has an OpInfo
This is a significant expansion and simplification of the current autogenerated autograd tests, which spend considerable processing their inputs. As an alternative, this PR extends OpInfos with "SampleInputs" that are much easier to use. These sample inputs are analogous to the existing tuples in`method_tests()`.
Future PRs will extend OpInfo-based testing to other uses of `method_tests()`, like test_jit.py, to ensure that new operator tests can be implemented entirely using an OpInfo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43451
Reviewed By: albanD
Differential Revision: D23481723
Pulled By: mruberry
fbshipit-source-id: 0c2cdeacc1fdaaf8c69bcd060d623fa3db3d6459
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44073
We don't have a proper support on NNC and JIT IR->NNC lowering side for it yet.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23487905
Pulled By: ZolotukhinM
fbshipit-source-id: da0da7478fc8ce7b455176c95d8fd610c94352c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43961
Currently we're removing prim::profile nodes and embed the type info
directly in the IR right before the fuser, because it is difficult to
fuse in a presence of prim::profile nodes. It turns out that BatchMM has
a similar problem: it doesn't work when there are prim::profile nodes in
the graph. These two passes run next to each other, so we could simply
remove prim::profile nodes slightly earlier: before the BatchMM pass.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23453266
Pulled By: ZolotukhinM
fbshipit-source-id: 92cb50863962109b3c0e0112e56c1f2cb7467ff1
Summary:
To avoid conflicts, this PR does not remove all imports. More are coming in further PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43808
Reviewed By: wanchaol
Differential Revision: D23436675
Pulled By: ailzhang
fbshipit-source-id: ccc21a1955c244f0804277e9e47e54bfd23455cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43972
It is useful when debugging a bug to disable NNC backend to see whether
the bug is there or in the fuser logic.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23455624
Pulled By: ZolotukhinM
fbshipit-source-id: f7c0452a29b860afc806e2d58acf35aa89afc060
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43270
`torch.conj` is a very commonly used operator for complex tensors, but it's mathematically a no op for real tensors. Switching to tensorflow gradients for complex tensors (as discussed in #41857) would involve adding `torch.conj()` to the backward definitions for a lot of operators. In order to preserve autograd performance for real tensors and maintain numpy compatibility for `torch.conj`, this PR updates `torch.conj()` which behaves the same for complex tensors but performs a view/returns `self` tensor for tensors of non-complex dtypes. The documentation states that the returned tensor for a real input shouldn't be mutated. We could perhaps return an immutable tensor for this case in future when that functionality is available (zdevito ezyang ).
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23460493
Pulled By: anjali411
fbshipit-source-id: 3b3bf0af55423b77ff2d0e29f5d2c160291ae3d9