pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jerry Zhang	7ddf212f33	[quant][fx] Fully align convert with the reference model design and simplify the implementation (#73863 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73863 This PR fully aligns the convert function with the design: https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md and simplifies the implementation of convert function by always produce a reference quantized model (with reference patterns) first, and then lower the model to a quantized model that is runnable with PyTorch native backend (fbgemm/qnnpack). This PR makes the convert.py much easier to understand than the previous implementation, and we are able to remove majority of code in quantization_patterns.py as well (in followup PRs). Test Plan: ``` python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps python test/test_quantization.py TestFXNumericSuiteCoreAPIs python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels ``` and other internal/oss regression tests Imported from OSS Reviewed By: andrewor14 Differential Revision: D34778506 fbshipit-source-id: 0678b66addf736039a8749b352f6f569caca962b (cherry picked from commit 33ec9caf23f3ab373d827117efbd9db0668b2437)	2022-03-11 17:11:30 +00:00
Junjie Wang (PyTorch)	616b36e437	[PT-D][FSDP] Implement _clip_grad_norm_ for FSDP (#73405 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73405 Implement the `_clip_grad_norm_` for FSDP, issue: https://github.com/pytorch/pytorch/issues/72548 ghstack-source-id: 151059433 Test Plan: CI Reviewed By: rohan-varma Differential Revision: D34230605 fbshipit-source-id: bbac7a6e49276e0f0502e2f4466c984aee2629fa (cherry picked from commit f10d090cd11489608ab3f67f52e3e950cd9f7dea)	2022-03-11 00:41:07 +00:00
Xiao Wang	5b805a6eec	Disable TF32 in some linalg tests; Disable TF32 in svd_lowrank forward (#73614 ) Summary: Follow up of https://github.com/pytorch/pytorch/pull/73460, https://github.com/pytorch/pytorch/issues/73461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/73614 Reviewed By: malfet Differential Revision: D34772822 Pulled By: ngimel fbshipit-source-id: 4e2bea0173d1b6b01e857ef63ef5c2d8c3802544 (cherry picked from commit 599486314370c5d2c771724139c0186ce190990b)	2022-03-10 19:12:02 +00:00
Natalia Gimelshein	967606124a	port torch cov tests to error inputs (#73977 ) Summary: Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/73977 Reviewed By: malfet Differential Revision: D34779552 Pulled By: ngimel fbshipit-source-id: b4191101a029981eb27c75e1b56d739db046f819 (cherry picked from commit 2c2af726ffdba68f358a4ff0ee07580609bccc34)	2022-03-10 19:04:44 +00:00
Samantha Andow	78e17eaadc	expanded weights: conv faster rule (#73692 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73692 Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D34719302 Pulled By: samdow fbshipit-source-id: 2288320a5f5d6a442da78e9fbe722f300b844be9 (cherry picked from commit a4cf23383c16d3c61d53e9d21f426259d2dc2d37)	2022-03-10 04:06:08 +00:00
Thiago Crepaldi	1fbc08c70c	Add Autocast support for Einsum (#71916 ) Summary: ONNX spec for Einsum requires all inputs to be the same dtype. PyTorch runtime does not allow executing aten::einsum with mismatching types by default, so the export would never succeed, However, when the model is wrapped by `torch.autocast()`, the run succeeds and the ONNX converter will create an Einsum ONNX node with mismatch types as input, which is not allowed by the aforementioned schema. This PR adds onnx::Einsum to the Autocast enabled list, which outputs lower resolution tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/71916 Reviewed By: ngimel Differential Revision: D34629666 Pulled By: malfet fbshipit-source-id: ec757bb87190a5b7512969e10a32450e9e1f87a1 (cherry picked from commit 7f2b5a6408ae34a6b9f858c3e9f5970b64ca1b4b)	2022-03-08 22:04:30 +00:00
Natalia Gimelshein	e47a5a64bb	Back out "Revert D34524207: [pytorch][PR] remove _s_where" (#73579 ) Summary: Original commit changeset: 87b1220d851c Original Phabricator Diff: D34524207 (`4eb2482568`) (`4eb2482568`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/73579 Test Plan: OSS tests tested with canary https://www.internalfb.com/intern/ads/canary/441912928798660873 Reviewed By: ezyang Differential Revision: D34688237 Pulled By: ngimel fbshipit-source-id: 32f3a0046053ef52e95ab45a26bfc1de17e7e061 (cherry picked from commit d1c0acbe3e0ff884c429072923a468ee1d3d447d)	2022-03-08 19:15:30 +00:00
Andrew Gu	9012e8d65a	[ZeRO][BE] Clean up ZeRO tests (#73842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842 Overview This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file. The main non-formatting changes include: - Using `parametrize` instead of manually including `for` loops over possible argument values - Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed` - Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness - Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed` - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`. - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.) - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO. - Renaming `test_multiple_groups()` to `test_nondefault_process_group()` - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend. - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket: `1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)` - Changing `_test_zero_model_parallel()` to not use CPU - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU. Questions - How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34675709 Pulled By: awgu fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb (cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)	2022-03-08 13:15:20 +00:00
Peter Bell	9ef5c679ef	record_function: add torchbind alternative API (#72301 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72301 First step in resolving #35026. This adds `PythonRecordFunction` which is a `torch::CustomClassHolder` for `at::RecordFunction` to keep the ATen code free of torch includes. And adds new unused internal API functions `_record_function_enter_new` which return the torchbind object. Once the FC period is expired, `torch.profiler.record_function` will be updated to use this new internal API. Then once BC period is expired, the cpp_custom_type_hack-based API can be removed. Test Plan: Imported from OSS Reviewed By: dagitses Differential Revision: D34586311 Pulled By: robieta fbshipit-source-id: d3eb9ffad7b348548a2b22c75203a92d1cb5115b (cherry picked from commit 92d2ca808e5fbd20c9d6645dcabc3f059f9ef2d3)	2022-03-08 03:26:27 +00:00
soulitzer	de73f9a558	Add forward AD support for logsumexp, log_softmax, softmax, nll_loss, and cross_entropy (#73741 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73741 There are probably more perf improvements that can be made, for example reusing more quantities from forward, doing more things inplace, but in the spirit of improving coverage, this is probably OK for now. Note: I didn't do anything with half_to_float, but CUDA (locally) hasn't complained yet Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D34690141 Pulled By: soulitzer fbshipit-source-id: fe934e191fee2c8e956d7a5f4b553923adf1b33f (cherry picked from commit ae49aff7f7c8496e04a3ce7667d8f068ca0a52ec)	2022-03-08 00:46:27 +00:00
anjali411	086645ad77	Update __torch_dispatch__ to return op overload instead of the opoverload packet function (#72673 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72673 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D34627164 Pulled By: anjali411 fbshipit-source-id: 3cb6406a392d530bf9da36b4d8e0a62b30e6497e (cherry picked from commit 65b85a0a67df4d0f16ac8964e2b685d478a610fb)	2022-03-07 22:38:42 +00:00
Andrew Gu	4a06b8d36c	[FSDP] Add grad accumulation without `no_sync()` (#73535 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73535 Overview - This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker. - This fixes a bug in the `no_sync()` testing, where the CPU offloading and backward prefetch arguments were not propagating to the `FullyShardedDataParallel` constructor. - This adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like: ``` <built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error ``` NOTE: Gradient accumulation without `no_sync()` is not currently compatible with CPU offloading. Test Plan I augmented the tests to test gradient accumulation interleaving iterations accumulating with and without `no_sync()`. After this diff: - QPS (ResNet): f328439897 - QPS (RoBERTa): f328440141 - Accuracy: f328442119 Before this diff (trunk): - QPS (ResNet): f328432756 - QPS (RoBERTa): f328436766 - Accuracy: f328437896 Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D34533546 Pulled By: awgu fbshipit-source-id: 821d762dfad5f2b1e59adcb8e5cb7c277399040c (cherry picked from commit 746a5ea2720dcf87c376229b405a318396fe5769)	2022-03-07 20:33:22 +00:00
Pritam Damania	aca4d02d12	Use higher timeout for test_tensorpipe_set_default_timeout (#73771 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73771 The runtime for this test doesn't actually depend on the timeout value specified here. As a result, increasing the timeout to avoid flakiness. https://ossci-raw-job-status.s3.amazonaws.com/log/4666724994 is an example of where this test failed due to a small timeout as reported in https://github.com/pytorch/pytorch/issues/70546 ghstack-source-id: 150507765 Test Plan: 1) waitforbuildbot 2) run the unit test Reviewed By: mrshenli Differential Revision: D34632204 fbshipit-source-id: ffe0f40d08f7a36f90f30f493a189608897bbb4c (cherry picked from commit a4920a4bfcbd26967567b55ee8417e994d53df49)	2022-03-04 23:29:18 +00:00
Natalia Gimelshein	55525632ab	Revert D34554432: Back out "Revert D34524207: [pytorch][PR] remove _s_where" Test Plan: revert-hammer Differential Revision: D34554432 (`9c03c6163f`) Original commit changeset: 2f3601d3d426 Original Phabricator Diff: D34554432 (`9c03c6163f`) fbshipit-source-id: db434750f44c6e6ec545a248c462d8fdcbefbaf8 (cherry picked from commit 866d4d0c795edd7ef519925683b5e57dd9b116ad)	2022-03-04 20:32:39 +00:00
Natalia Gimelshein	9c03c6163f	Back out "Revert D34524207: [pytorch][PR] remove _s_where" (#73579 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73579 Original commit changeset: 87b1220d851c Original Phabricator Diff: D34524207 (`4eb2482568`) Test Plan: OSS tests Reviewed By: malfet Differential Revision: D34554432 fbshipit-source-id: 2f3601d3d4261ebcebb05b4b1aec0c9a8a00ea04 (cherry picked from commit b9cad3f2bc54e12b275567454336cf4d9dcb78c4)	2022-03-04 19:35:41 +00:00
wayi1	0bb3b0652c	[Model Averaging] Support hierarchical model averaging (#73285 ) Summary: Implement hierarchical model averaging proposed in https://github.com/pytorch/pytorch/issues/71325. Unit tests are added. Since I don't have access to 4-GPU machines in open-source environment, expect that the branch with the prefix of `ci-all` can run the test that requires 4 GPUs. In the future, the internals of `PeriodicModelAveraging` can be simplified as an implementation of a specialized hierarchical model averaging, where `period_group_size_dict` only has a pair of period and world size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/73285 Reviewed By: mrshenli Differential Revision: D34457792 Pulled By: rohan-varma fbshipit-source-id: 39a6c5bf8a2852b6394a56abbad17b8a909b9fba (cherry picked from commit 5f543d46103edb515db199dbb80db43c85665f29)	2022-03-04 18:29:36 +00:00
Shihao Xu	bcd0843bec	[torch.distributed][DDP] Disable DDP bucketing for the first iteration (#72843 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72843 # [Debug Story] Training Hanging and DDP Bucketing What are the characteristics of the hanging training instance? The model uses TorchRec `PooledEmbeddingArch` and corresponding sharding solution. The model config difference to trigger this hanging issue is turning on position weighted embedding tables. A feature processor module, `GroupedPositionWeightedModule`, is constructed on all ranks, but `GroupedPositionWeightedModule.foward(...)` is only [called on subset ranks of the whole world](https://fburl.com/code/yqrmtvli). What was the initial manifested error? The training was stuck in the first iteration. What are useful debugging tools this time? After turning off [static_graph in DDP](https://fburl.com/code/4io81p5i), we saw there were sparse feature lengths becoming negative values after all-to-all collectives. Hanging becomes fatal failure. After turning on [torch.distributed DETAIL debugging mode](https://fburl.com/code/cp8e28mm), we saw 2 trainers sent out mismatched collectives, one doing all-to-all, the other doing all-reduce. So we know the negative values comes from all-to-all being matched with all-reduce. the error had happened ahead, which is the wrong timing of either doing all-reduce or all-to-all. With more added loggings inside of DDP, it turned out the DDP decided to do all-reduce at different timings across different ranks. What is DDP bucketing? Once a gradient is ready on a rank, DDP uses all-reduce to synchronize the average of this gradient across all ranks. Say we have 4 tensor ops. A, B, C, D. In the most naive version, we could do one synchronization when all gradients in the full backward graph are ready. The time sequence would be, * D.grad * C.grad * B.grad * A.grad * All reduce on [D.grad, C.grad, B.grad, A.grad]. But that would be a huge waste of communication channel bandwidth. With DDP bucketing, we could put ahead some gradient synchronization batch by batch. The above time sequence now becomes, * D.grad * C.grad * All reduce on [D.grad, C.grad]. * B.grad * A.grad * All reduce on [B.grad, A.grad]. With gradient computation overlaps with communication, bucketing technique brings better DDP execution performance. What exactly went wrong in this case? 1. The bucketing doesn’t honor backward graph execution order. 2. There are other collectives comm ops in backward graph. 3. There are unused parameters (i.e unused sub-module) in subset ranks of the whole world. Using the above example again, we have 4 tensor ops. A, B, C, D. Say we have 2 trainers, B is the feature processor module. B only runs on trainer 0 (both forward and backward), but not on trainer1. C is the All-to-all (Pooled embeddings distribution). C sends out all-to-all collective in both its forward and backward pass. Keep assuming all other ops run on both trainers. trainer_0 op sequence is, A, B (feature preproc), C (all-to-all), D \| D.grad, C.grad (reverse all-to-all), B.grad (feature proc grads), A.grad trainer_1 op sequence is, A, C (all-to-all), D \| D.grad, C.grad (reverse all-to-all), A.grad Even though the correct bucketing should be (same bucketing for both ranks), * bucket_0, [D.grad, C.grad] * bucket_1, [B.grad, A.grad] but because of 1), they end up like, * bucket_0, [B.grad, D.grad] * bucket_1, [C.grad, A.grad] Plus 2) and 3), the time sequence could like, (check mark represents the gradient is ready) (bucket is ready to do synchronization if all its enclosing gradients are ready) * trainer_0 * t0, * D.grad * bucket_0, [B.grad, D.grad ✓] * t1, * C.grad all-to-all * C.grad ✓ * bucket_1, [C.grad ✓, A.grad] * t2 * B.grad * bucket_0, [B.grad ✓, D.grad ✓] ✓ * t3 * All-reduce for bucket_0 * t4 * A.grad * bucket_1, [C.grad ✓, A.grad ✓] ✓ * trainer_1 * t0, * D.grad * bucket_0, [B.grad ✓, D.grad ✓] ✓. (Because B is not used on trainer_1, DDP marks its gradient as ready immediately.) * t1, * All-reduce for bucket_0 * t2 * C.grad all-to-all * bucket_1, [C.grad ✓, A.grad] * t3 * A.grad * bucket_1, [C.grad ✓, A.grad ✓] ✓ This is why trainer_0 all-to-all is matched up with trainer_1 all-reduce. What is the solution for fixing DDP? Disable DDP bucketing for the first iteration. D34051938 This is because after the first iteration, buckets will be built again based on real backward graph execution order. So the slow gradient synchronization only affects the first iteration. Test Plan: buck build mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn BACKEND=gloo WORLD_SIZE=3 buck-out/gen/caffe2/test/distributed/distributed_gloo_spawn\#binary.par -r test_ddp_logging_data_cpu P484179296 buck build mode/dev-nosan caffe2/test/distributed:distributed_nccl_spawn BACKEND=nccl WORLD_SIZE=2 buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn\#binary.par -r test_ddp_logging_data_cpu -r test_ddp_get_bucket_sizes P484177200 Reviewed By: zhaojuanmao Differential Revision: D34051938 fbshipit-source-id: 0c7f35875687095c3199f19990e73a8349b6e5b9 (cherry picked from commit bb8f11306ea51c2bd3ffd3ab001d62ce369a08ee)	2022-03-04 18:29:36 +00:00
Khushi Agrawal	905efa82ff	[fix] `torch.broadcast_shapes` should not handle shapes with negative dimensions. (#72999 ) Summary: Hi, The PR fixes https://github.com/pytorch/pytorch/issues/68957. It aims to include the following: - Fixes the code in `torch/functional.py`. - Add the missing tests for negative input values and non-iterable inputs. ~#### TODO~ ~- [x] Add OpInfo~ EDIT: `broadcast_shapes` don't take any tensor inputs. So we don't need OpInfo here. Thanks, kshitij12345 for guidance. #### Earlier ```python >>> shapes = [1, -12] >>> torch.broadcast_shapes(shapes) torch.Size([-12]) # MUST RAISE ERROR ``` #### Now ```python >>> shapes = [1, -12] >>> torch.broadcast_shapes(shapes) RuntimeError: Trying to create tensor with negative dimension -12: [-12] ``` #### NumPy's Output ```python >>> shapes = [1, -12] >>> numpy.broadcast_shapes(shapes) ValueError: negative dimensions are not allowed ``` #### `torch.broadcast_tensor()` Output As mentioned in the [doc](https://pytorch.org/docs/stable/generated/torch.broadcast_shapes.html): ```python >>> shapes = [1, -12] >>> torch.broadcast_tensors(map(torch.empty, shapes))[0].shape RuntimeError: Trying to create tensor with negative dimension -12: [-12] ``` Looking forward to hearing from you and your questions. Thanks! :) cc: mruberry kshitij12345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/72999 Reviewed By: albanD Differential Revision: D34543995 Pulled By: ngimel fbshipit-source-id: e32b1f266500a5e002c8f353b1e02f44c23d4f6e (cherry picked from commit a6253ce6bb8455a3c89398f12b7d790a0b7e8d95)	2022-03-03 18:33:06 +00:00
Pearu Peterson	4168c87ed3	Support CSR to COO conversion in to_sparse(2). (#73642 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73642 Former https://github.com/pytorch/pytorch/pull/73471 that was reverted due to lack of `to_sparse(sparse_dim)` support. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D34580353 Pulled By: cpuhrsch fbshipit-source-id: a8a4ea381daeb80d8365fe931af9f55a7e789ea1 (cherry picked from commit 5a3cf8110980e5a10dbb687e87e67d5524ebf2f5)	2022-03-02 22:33:32 +00:00
Nikita Karetnikov	eb0d370f14	Write explicit meta-kernels for `normal` (#70089 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70089 See #69386. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D34089964 Pulled By: bdhirsh fbshipit-source-id: eb88eb7c4830545d3d43c82b6f3abb98617cee8e (cherry picked from commit 89c9c02a0fb1c780495fee6370961104f4b1dcd1)	2022-03-01 23:28:14 +00:00
Chien-Chin Huang	6396547f9e	[FSDP] Make summon_full_params a public method (#73116 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73116 Users may need summon_full_params() to get the original parameters. ghstack-source-id: 150134237 Test Plan: CI Reviewed By: rohan-varma Differential Revision: D34353034 fbshipit-source-id: ac69cc032da177903cd9969094f3f82dc6a61636 (cherry picked from commit 55d34fdee3778110a165a13ae987d0339e8d33c7)	2022-03-01 22:29:28 +00:00
Nikita Shulga	8ac7393565	Revert D33767740: [pytorch][PR] Sparse CSR CPU: cuSolverSP backend for `linalg.solve` Test Plan: revert-hammer Differential Revision: D33767740 (`199d9a992c`) Original commit changeset: a945f065210c Original Phabricator Diff: D33767740 (`199d9a992c`) fbshipit-source-id: b7934df18118f8d6d5f165deb5aae9887953ae43 (cherry picked from commit d3ddbb021b227e3638f6f7c22c6eadfa73695e31)	2022-03-01 18:33:23 +00:00
Nikita Shulga	dd9517cc4a	Revert D34524207: [pytorch][PR] remove _s_where Test Plan: revert-hammer Differential Revision: D34524207 (`4eb2482568`) Original commit changeset: bc71e27b6d3f Original Phabricator Diff: D34524207 (`4eb2482568`) fbshipit-source-id: 87b1220d851c3d2b51bdd1cf2f8a493c58ab9b14 (cherry picked from commit af1f0cc9e032b00619a7979bbbd2281f69e0fdf0)	2022-03-01 17:43:16 +00:00
Rohan Varma	1cf6b34c0e	[Easy][Tests] Rename module in test (#73551 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73551 Rename to better indicate what it is. ghstack-source-id: 150166352 Test Plan: CI Reviewed By: awgu Differential Revision: D34537964 fbshipit-source-id: 5465003c2a2fd6f1a2646c375bc7c11d297e3f9e (cherry picked from commit 9f11bdef88c7886b59fedc939e7149872ad73453)	2022-03-01 16:37:37 +00:00
Rohan Varma	95204c4e2b	Revert D34503882: Support CSR to COO conversion in to_sparse. Test Plan: revert-hammer Differential Revision: D34503882 (`84f4e9c10a`) Original commit changeset: 4a781647a0ae Original Phabricator Diff: D34503882 (`84f4e9c10a`) fbshipit-source-id: cf161171a3b51aa3c0f2b15501956873b1ba29dd (cherry picked from commit 924c19071713777700087087b27b388eb057d8d9)	2022-03-01 15:33:37 +00:00
Natalia Gimelshein	4eb2482568	remove _s_where (#73468 ) Summary: Per title Fixes https://github.com/pytorch/pytorch/issues/73135 Pull Request resolved: https://github.com/pytorch/pytorch/pull/73468 Reviewed By: albanD Differential Revision: D34524207 Pulled By: ngimel fbshipit-source-id: bc71e27b6d3fa50de6737533c92375266d9eadc5 (cherry picked from commit 047b925849370e6e4cbe9e3a722db52bb1e965b9)	2022-03-01 07:30:34 +00:00
Pearu Peterson	84f4e9c10a	Support CSR to COO conversion in to_sparse. (#73471 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73471 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D34503882 Pulled By: cpuhrsch fbshipit-source-id: 4a781647a0ae5d03827406b75b14acc7c48da0b0 (cherry picked from commit fa3dbdc6a8529d19f8a055494436ca1f766807be)	2022-03-01 06:31:52 +00:00
Kushashwa Ravi Shrimali	199d9a992c	Sparse CSR CPU: cuSolverSP backend for `linalg.solve` (#71399 ) Summary: This PR introduces the `cuSolverSP` backend for `linalg.solve` with sparse CSR input matrices. The motivation comes from the issue: https://github.com/pytorch/pytorch/issues/69538. `cuSolver` provides [`cusolverSp<t>csrlsvluHost`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu) API, a few things to note: 1. As mentioned in the documentation: `only CPU (Host) path is provided.` From the profiling, there doesn't seem to be any GPU kernel launch for optimization, please see the profiling below. 2. Since only `host` path is provided, the CPU path uses `csrlsvluHost` (but requires PyTorch to be installed/built with CUDA support). 3. The documentation mentions reordering helps optimize stuff, but it isn't clear how it affects the performance. There are options for reordering, so we stick to `reorder = 0` as the default choice. `cuSolver` has [`csrlsvqr`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvqr) function which provides a `device` path to solve the linear system. This function is used for the CUDA path in this PR. Gist: For CPU Path: we call [`csrlsvluHost` function of cuSolver](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu). For CUDA Path: we call [`csrlsvqr` function of cuSolver](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvqr). Profiling: (On sparse input tensor of size 1000 x 1000, with a vector of shape length 1000), for `csrlsvlu` function (to show no GPU optimization) ```cpp ==3999651== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 2.1440us 1 2.1440us 2.1440us 2.1440us [CUDA memcpy HtoD] API calls: 99.72% 1.07199s 9 119.11ms 500ns 1.07164s cudaFree 0.11% 1.2182ms 398 3.0600us 140ns 137.94us cuDeviceGetAttribute 0.06% 674.45us 4 168.61us 165.50us 173.64us cuDeviceTotalMem 0.03% 357.07us 4 89.268us 2.7800us 201.89us cudaMalloc 0.03% 309.29us 1 309.29us 309.29us 309.29us cudaGetDeviceProperties 0.01% 160.47us 332 483ns 350ns 3.3300us cudaFuncSetAttribute 0.01% 115.12us 4 28.780us 26.290us 33.410us cuDeviceGetName 0.00% 28.591us 5 5.7180us 440ns 16.921us cudaGetDevice 0.00% 22.061us 4 5.5150us 871ns 18.690us cudaDeviceSynchronize 0.00% 20.370us 18 1.1310us 410ns 6.9900us cudaEventDestroy 0.00% 16.390us 1 16.390us 16.390us 16.390us cudaMemcpy 0.00% 11.540us 2 5.7700us 1.4900us 10.050us cuDeviceGetPCIBusId 0.00% 10.510us 18 583ns 430ns 1.6200us cudaEventCreateWithFlags 0.00% 7.9100us 21 376ns 290ns 700ns cudaDeviceGetAttribute 0.00% 1.4300us 6 238ns 150ns 590ns cuDeviceGet 0.00% 1.2200us 4 305ns 190ns 500ns cuDeviceGetCount 0.00% 900ns 1 900ns 900ns 900ns cuInit 0.00% 860ns 4 215ns 180ns 260ns cuDeviceGetUuid 0.00% 240ns 1 240ns 240ns 240ns cuDriverGetVersion 0.00% 230ns 1 230ns 230ns 230ns cudaGetDeviceCount ``` Script: ```python import torch def solve(x, other, out): torch.linalg.solve(x, other, out=out) if __name__ == "__main__": dense_inp = torch.randn((1000, 1000), dtype=torch.float64) # Set 50% of the values to 0 randomly dense_inp = torch.nn.functional.dropout(dense_inp, p=0.5) sparse_inp = dense_inp.to_sparse_csr() other = torch.randint(100, (1000,), dtype=torch.float64) out = torch.randint(1, (1000,), dtype=torch.float64) solve(sparse_inp, other, out) ``` The following error is raised when the function is used on a CPU device with PyTorch built/installed without CUDA support: * When built without CUDA support: ```python /home/krshrimali/pytorch/torch/autograd/profiler.py:151: UserWarning: CUDA is not available, disabling CUDA profiling warn("CUDA is not available, disabling CUDA profiling") Traceback (most recent call last): File "/home/krshrimali/pytorch/test_sp.py", line 17, in <module> solve(x, other, out) File "/home/krshrimali/pytorch/test_sp.py", line 5, in solve torch.linalg.solve(x, other, out=out) RuntimeError: PyTorch was not built with CUDA support. Please use PyTorch built CUDA support ``` Performance Comparison (vs SciPy's [`scipy.sparse.linalg.spsolve`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.spsolve.html): Time taken by `scipy.sparse.linalg.spsolve` : 0.595 seconds On CPU: Time taken by `torch.linalg.solve` : 4.565 seconds On CUDA: Time taken by `torch.linalg.solve`: 1.838 seconds The inputs are of dimensions: (17281, 17281) and (17281, 1), and were taken from https://math.nist.gov/MatrixMarket/extreme.html. Thanks to IvanYashchuk for helping me with the PR, and guiding me through it. cc: IvanYashchuk pearu nikitaved cpuhrsch cc nikitaved pearu cpuhrsch Pull Request resolved: https://github.com/pytorch/pytorch/pull/71399 Reviewed By: VitalyFedyunin Differential Revision: D33767740 Pulled By: cpuhrsch fbshipit-source-id: a945f065210cd719096eb8d7cdbf8e8937c2fce9 (cherry picked from commit f4f35c17da414e1ca6c6d91402933521857aa1ea)	2022-03-01 05:32:35 +00:00
Rohan Varma	6b424de338	[FSDP] Add state_dict() save/reload in parity test (#73366 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73366 Adds state_dict() save/reload in parity with DDP test to ensure checkpointing doesn't cause issue with accuracy/model params. ghstack-source-id: 150114251 Test Plan: CI Reviewed By: fegin Differential Revision: D34434358 fbshipit-source-id: fb0787486b383cfcbec7cc1325a486c8d9b1e2ea (cherry picked from commit e3bcc7733cb5a497a640007044b1138dfee3a532)	2022-03-01 04:35:30 +00:00
Yanli Zhao	6b883d9933	Back out "[BE][DDP] enable rebuilt bucket when find_unused_parameters=True" (#73524 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73524 Original commit changeset: 73284c3629ff Original Phabricator Diff: D34410523 (`a6c6f42c25`) ghstack-source-id: 150128700 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D34527951 fbshipit-source-id: 15c9d1b3e52b3f2e20fdbd3cf0cb9de78b824d2a (cherry picked from commit 10ec5881fa1cb6675e13e4148b2ba157ebf39b19)	2022-03-01 04:35:30 +00:00
Howard Huang	6c8e516a80	Add pickling support for WorkerInfo (#73371 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73371 This PRs allows for the pybinded class `WorkerInfo` to be pickled. The class is pickled into a tuple of worker_name and rank in format `(NAME, ID)`. This allows WorkerInfo to be passed as an argument for RPC calls. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D34458153 Pulled By: H-Huang fbshipit-source-id: 7b8f99960bdc0e24021e252d8c8138bcb53f698c (cherry picked from commit 8fb119bf760eef9f313a44e9287c9253cbb09cae)	2022-02-28 15:37:56 +00:00
Rohan Varma	540361fa53	[FSDP] full_state_dict impl (#73324 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73324 Implements `state_dict` and `load_state_dict` APIs for FSDP, with the following limitations: 1. Does not support `state_dict_device` (i.e. specifying which device params should be on) which fairscale does currently support 2. Does not yet support offload of state_dict onto CPU 3. Loads state_dict on all ranks currently. In the future we could add support for loading this on only rank 0, to avoid redundancy across ranks as usually only one rank is responsible for saving/loading the model. Along with (2) this would enable larger models to have state_dict called. As discussed in FSDP checkpoint API proposal, `state_dict` will basically be a `full_state_dict` where full parameters are returned on all ranks. As a result this implies that the model must actually be able to fit on a single GPU. ghstack-source-id: 150012240 Test Plan: ci Reviewed By: zhaojuanmao Differential Revision: D34433514 fbshipit-source-id: 3eb1d679b2236264f9f423e761d1720f9aaec73a (cherry picked from commit a451d5a08ebfa14a229a25fea35b9ca59fe91a59)	2022-02-27 19:32:22 +00:00
Sergii Dymchenko	285272f399	Fix undefined variable errors (#72838 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72838 Reviewed By: george-qi Differential Revision: D34406757 Pulled By: kit1980 fbshipit-source-id: b7ab8b431eb5715fe2278ca0979542c332f1deab (cherry picked from commit fd0cbebb16e4b8eb50103f6859c5f1f1e2a52968)	2022-02-25 11:28:53 +00:00
Yanli Zhao	a6c6f42c25	[BE][DDP] enable rebuilt bucket when find_unused_parameters=True (#73276 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73276 There are two major cases when find_unused_parameters=True: 1. grad ready order does not change over iterations, in this case, enable rebuilt bucket after first iteration can potentially improve performance 2. grad ready order changes over iterations, in this case, use static bucket order or dynamic bucket order in the first iteration does not matter much, as order changes per iteration ghstack-source-id: 149820812 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D34410523 fbshipit-source-id: 73284c3629ff2696de76681f070b74ad2bb01f1b (cherry picked from commit fa3a54bdd659669b776439190039ad889cf3371f)	2022-02-25 07:28:37 +00:00
Philip Meier	c6f1bbc0ac	promote torch.testing to stable (#73348 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73348 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D34457727 Pulled By: mruberry fbshipit-source-id: 2cc812b643e0d1e753bead2751ee79b3f03fde20 (cherry picked from commit bcdaca1a019a679b8b274e2fb5f19bfd08874ce9)	2022-02-25 06:30:31 +00:00
Philip Meier	14bcd3f681	cleanup torch.testing namespace (#72708 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72708 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D34457728 Pulled By: mruberry fbshipit-source-id: 8e017d2a1fd45f69533d1cdfd906d242b6b3ee68 (cherry picked from commit 8a2333a5668e64b45ab8cbc195e5e06383d49c0a)	2022-02-25 06:30:31 +00:00
Philip Meier	0415a64f3e	deprecate torch.testing.make_non_contiguous (#72705 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72705 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D34457731 Pulled By: mruberry fbshipit-source-id: 3b9da1740248dd4dc0a799b91f94dfbd2034abad (cherry picked from commit e71c35e0a561ddd26a6843938837982f07fd27e4)	2022-02-25 06:30:31 +00:00
Philip Meier	0973c5a1cc	align signature of make_tensor with other creation ops (#72702 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72702 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D34457729 Pulled By: mruberry fbshipit-source-id: 83d580c4201eef946dc9cf4b9e28a3d36be55609 (cherry picked from commit aa4cf20fbeb4b795595729b8ac2e6ba7707d8283)	2022-02-25 06:30:31 +00:00
Pearu Peterson	5f310c5e27	Testing of masked reductions on mixed layout inputs. (#72398 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72398 The design of this feature is discussed in https://github.com/pytorch/pytorch/pull/71239#discussion_r787292751 Test Plan: Imported from OSS Reviewed By: george-qi Differential Revision: D34408881 Pulled By: cpuhrsch fbshipit-source-id: a362d4220957ea38b7e442df4ecf260ffe682eab (cherry picked from commit 7fb3611130c08f1aa6ea708ca838708c13b0f01c)	2022-02-25 05:32:47 +00:00
Digant Desai	b2054d3025	Prepare for an update to the XNNPACK submodule (#72642 ) Summary: - Target Sha1: ae108ef49aa5623b896fc93d4298c49d1750d9ba - Make USE_XNNPACK a dependent option on cmake minimum version 3.12 - Print USE_XNNPACK under cmake options summary, and print the availability from collet_env.py - Skip XNNPACK based tests when XNNPACK is not available - Add SkipIfNoXNNPACK wrapper to skip tests - Update cmake version for xenial-py3.7-gcc5.4 image to 3.12.4 - This is required for the backwards compatibility test. The PyTorch op schema is XNNPACK dependent. See, aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp for example. The nightly version is assumed to have USE_XNNPACK=ON, so with this change we ensure that the test build can also have XNNPACK. - HACK: skipping test_xnnpack_integration tests on ROCM Pull Request resolved: https://github.com/pytorch/pytorch/pull/72642 Reviewed By: kimishpatel Differential Revision: D34456794 Pulled By: digantdesai fbshipit-source-id: 85dbfe0211de7846d8a84321b14fdb061cd6c037 (cherry picked from commit 6cf48e7b64d6979962d701b5d493998262cc8bfa)	2022-02-25 00:39:15 +00:00
Rohan Varma	199d1cb9dd	[FSDP][BE] remove get_full_params() from test code (#73242 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73242 Can use summon_full_params instead. ghstack-source-id: 149800364 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D34399789 fbshipit-source-id: 8552cdf3ed003aba1316f554f4ec457fdada5dbe (cherry picked from commit a397e2dfd3750afe1d21cdee3aa4c2d525ed837e)	2022-02-24 19:39:32 +00:00
Rohan Varma	e10cd88648	[FSDP] summon_full_params fix (#73314 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73314 Needs to synchronize all_gather stream. Added test fails without this fix ghstack-source-id: 149800363 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D34430602 fbshipit-source-id: 4ce07e2d098a4f07ac640285db1d0ff64fd42232 (cherry picked from commit 24c756e7bba69017b9358bf824589b2aeb366b5e)	2022-02-24 19:39:32 +00:00
Jongsoo Park	fffb97f3cb	[torch] do not fold bmm into mm when tensor1 dim==3 but not contiguous (#73115 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73115 matmul for [B, M, K] x [K, N] was mapped to mm by folding the first 2dim of tensor1 to [BxM, K] x [K, N] but when M and K are transposed it's better to use BMM to avoid data movement. We could generalize the condition we don't fold (see more details in the comment) but being conservative here to be cautious about potential unintended regression. Test Plan: In the following simple test case, before this diff 0.00652953577041626 0.003044447898864746 Permutation takes about same time as GEMM After this diff 0.002983328104019165 0.0030336639881134034 Permutation overhead essentially went away. ``` B = 128 M = 1024 N = 128 K = 1024 X = torch.rand(B, K, M).cuda() b = torch.rand(N).cuda() W = torch.rand(N, K).cuda() X = X.permute(0, 2, 1) Y = F.linear(X, W, b) X_contiguous = X.contiguous() Y_ref = F.linear(X_contiguous, W, b) torch.testing.assert_close(Y, Y_ref) t1, _ = benchmark_torch_function(F.linear, X, W, b, 0) t2, _ = benchmark_torch_function(F.linear, X_contiguous, W, b, 0) print(t1, t2) ``` Reviewed By: ngimel Differential Revision: D34350990 fbshipit-source-id: 73e99f785a405cf7a92b909b16f2022b48b1660f (cherry picked from commit bec995b899710991bb2a304a8009a67f38244114)	2022-02-24 06:29:22 +00:00
Can Balioglu	e1db2f13ce	Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166 This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started. ghstack-source-id: 149778566 Test Plan: Run the existing unit tests. Reviewed By: rohan-varma Differential Revision: D34371226 fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b (cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)	2022-02-24 02:33:05 +00:00
Nikita Karetnikov	75db05c3fd	Check if the iterator is valid before dereferencing it (#72405 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72405 Fixes #71674. This shouldn't segfault now: ``` import torch d = torch.complex64 torch.set_default_dtype(d) ``` Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D34423660 Pulled By: anjali411 fbshipit-source-id: cac92a6f56846f2c0727a120b5f568aa75baa21e (cherry picked from commit eaab813a0fddced24303b3bd50e4fcdba1516e46)	2022-02-23 18:33:46 +00:00
Rohan Varma	50efa3a6e8	Skip optimizer overlap tests that have issues with NCCL async error handling (#73261 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73261 Skip these tests which sometimes have issues on unrelated PRs such as https://github.com/pytorch/pytorch/runs/5291461671?check_suite_focus=true. See https://github.com/pytorch/pytorch/issues/73259 for additional detail Skip these tests which sometimes have issues on unrelated PRs such as https://github.com/pytorch/pytorch/runs/5291461671?check_suite_focus=true. See https://github.com/pytorch/pytorch/issues/73259 for additional details. ghstack-source-id: 149707988 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D34404857 fbshipit-source-id: 7889a66730679133100ad022ad9a9934fc5bc9b1 (cherry picked from commit c2b65df9c8b81eab6a31d2827c09cea304f714f6)	2022-02-23 16:31:21 +00:00
Pruthvi Madugundu	595a51b951	[ROCm] Enable sort operator BF16 support (#72854 ) Summary: The changes add support for dtype BF16 for sort operator in ROCm. Relates - https://github.com/pytorch/pytorch/pull/58196 Relanding the change - https://github.com/pytorch/pytorch/pull/71226 jeffdaily jithunnair-amd dllehr-amd Please review this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/72854 Reviewed By: zou3519 Differential Revision: D34284313 Pulled By: malfet fbshipit-source-id: abcfea84ea53874008d56416425849e990ebf15b (cherry picked from commit `e9e7e3e047`)	2022-02-23 15:28:15 +00:00
Xiao Wang	2051068233	Change how cuda available memory is calculated in largeTensorTest decorator (#72207 ) Summary: Related PR https://github.com/pytorch/pytorch/issues/45332 Related discussion https://github.com/pytorch/pytorch/pull/45332#issuecomment-985996064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/72207 Reviewed By: ngimel Differential Revision: D34387921 Pulled By: mruberry fbshipit-source-id: 2d842a25a5d3d1fc48917ba8fb29ff96d7bc2650 (cherry picked from commit `01a9e980c7`)	2022-02-23 02:31:42 +00:00
Samantha Andow	53faf78143	expanded weights without fast rules (#70140 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70140 [Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules. - User facing API is in `_stateless.py` (with documentation) - Testing is in test_expanded_weights - The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141 Test Plan: Imported from OSS Reviewed By: mikaylagawarecki Differential Revision: D34350950 Pulled By: samdow fbshipit-source-id: 69c664b0bc3dff6951358d79d7e5d94882f7aef2 (cherry picked from commit `ae1620d3b6`)	2022-02-22 20:35:16 +00:00
Nikita Shulga	cfb6c942fe	`scatter_reduce` documentation (#73125 ) Summary: Reland of https://github.com/pytorch/pytorch/issues/68580 (which were milestoned for 1.11) plus partial revert of https://github.com/pytorch/pytorch/pull/72543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/73125 Reviewed By: bdhirsh Differential Revision: D34355217 Pulled By: malfet fbshipit-source-id: 325ecdeaf53183d653b44ee5e6e8839ceefd9200 (cherry picked from commit `71db31748a`)	2022-02-22 19:33:46 +00:00

1 2 3 4 5 ...

2275 Commits