Summary:
On a CPU-only build of pytorch `torch._C._jit_set_nvfuser_enabled(False)` would throw an error (even though it is a no-op operation), with this fix:
```
>>> torch._C._jit_set_nvfuser_enabled(False)
False
>>> torch._C._jit_set_nvfuser_enabled(True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Running CUDA fuser is only supported on CUDA builds.
>>>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71358
Reviewed By: eellison
Differential Revision: D33601135
Pulled By: jansel
fbshipit-source-id: c764df2fa197ce7b4f71e5df0a91cd988766e99c
(cherry picked from commit a801df9321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71091
Fixes https://github.com/pytorch/pytorch/issues/65394
The masked sum on a full input tensor (of any layout) with an all-true mask is the same as the sum on the strided input tensor (after applying `to_dense` to sparse inputs).
Since masked sum uses `torch.sparse.sum` then, for the simplicity of masked reductions implementations, its reduction behavior ought to be defined by the behavior of the `torch.sum`. This PR implements the behavioral connection with respect to the directional summation of empty sparse tensors that correspond to all-zero strided tensors.
cc nikitaved pearu cpuhrsch
Test Plan: Imported from OSS
Reviewed By: davidberard98
Differential Revision: D33651750
Pulled By: cpuhrsch
fbshipit-source-id: 703891bff88c8da6270b4272f5d2da81688db67d
(cherry picked from commit 53f97e80f7)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69060
Saved variable hooks checkpointing was added in https://github.com/pytorch/pytorch/pull/69508, this PR adds some tests for DDP.
Specifically, we can support almost all DDP use cases with this new API, such as dynamic module with find_unused_parameters=True. One case remains to be supported, which is static_graph + non-reentrant based checkpointing. The underlying reason this does not work is https://github.com/pytorch/pytorch/issues/58111.
ghstack-source-id: 147219887
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D32712126
fbshipit-source-id: ba5ae9ca77fd8929ee020c7dc97838bae9a1931b
(cherry picked from commit 9c7f93e217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71462
Fixes
```
6 aienv/aienv_ig_reels_base:caffe2/modules/detectron/upsample_nearest_op.h:65:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
6 deep_entity_classification/si_dec_gnn:caffe2/modules/detectron/upsample_nearest_op.h:65:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
6 feed_recommendation_infra/multifeed_execution_graph_service_nosan:caffe2/modules/detectron/upsample_nearest_op.h:65:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
12 mobile_cv/mobile-vision_experimental:caffe2/modules/detectron/upsample_nearest_op.h:65:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
30 mobile_cv/mobile-vision_xraymobilev2_detection_caffe2:caffe2/modules/detectron/upsample_nearest_op.h:65:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
42 aienv/aienv:caffe2/modules/detectron/upsample_nearest_op.h:65:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
128 feed_recommendation_infra/multifeed_recagg_dev:caffe2/modules/detectron/upsample_nearest_op.h:65:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
136 fluent2/fblearner_flow_projects_fluent2_nosan:caffe2/modules/detectron/upsample_nearest_op.h:65:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
1338 f6/f6_nosan:caffe2/modules/detectron/upsample_nearest_op.h:65:1: error: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Werror,-Wpass-failed=transform-warning]
```
Test Plan: Sandcastle
Reviewed By: luciang
Differential Revision: D33641869
fbshipit-source-id: 8424849cfac5cb0109272dec2086863067bbde66
(cherry picked from commit d18429905c)
Summary:
Reference https://github.com/pytorch/pytorch/issues/69991
Refactored such that only `out` variant copies the result into `out` otherwise we just return the result of the composite functions as is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70894
Reviewed By: samdow
Differential Revision: D33641742
Pulled By: zou3519
fbshipit-source-id: 671be13b31a7fff3afc0b7976706a5ecfc51ccac
(cherry picked from commit e7d5ac9af3)
Summary:
The sccache compilation log is often misleading.
We can move it to its own group so people don't see it right away
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71444
Reviewed By: atalman
Differential Revision: D33659650
Pulled By: janeyx99
fbshipit-source-id: f22fd21640a8747beeacce8857bbb8281efd76f4
(cherry picked from commit e25970abf9)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70266
Addresses some of the issues mentioned in
https://github.com/pytorch/pytorch/issues/65638. ShardedLinear implementation
only support 2D inputs.
On the other hand `nn.Linear` supports arbitrary dimensions for inputs and
outputs. As a result, in this PR I've added support to ensure that
ShardedLinear supports arbitrary input dims as well.
ghstack-source-id: 147206607
Test Plan: waitforbuildbot
Reviewed By: wanchaol
Differential Revision: D33267630
fbshipit-source-id: 0460994c3aa33348b80547d9274206ef90cb29b6
(cherry picked from commit 7c289e1dbf)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71461
After operator versioning work, the version in model file is used for operator versioning, while bytecode_version is used for bytecode versioning (for bytecode schema). They are two seperate things now and this comparison is not needed.
ghstack-source-id: 147209286
Test Plan: CI
Reviewed By: iseeyuan, tugsbayasgalan
Differential Revision: D33648592
fbshipit-source-id: beaa136a728f88435176a00c07b2d521210f107f
(cherry picked from commit e90e650e1a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70615
This adds `at::detail::empty_meta` and
`at::detail::empty_strided_meta` to complement the cpu API.
Test Plan: Imported from OSS
Reviewed By: samdow
Differential Revision: D33623678
Pulled By: ngimel
fbshipit-source-id: 59e003116361fb547ec2c633bbc15a7973e21d0e
(cherry picked from commit b4f5836fa1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70614
This creates an `empty_strided_generic` function which, similar to
`empty_generic`, is a device-independent tensor constructor. This also
adds `at::detail::empty_strided_cpu` to complement
`at::detail::empty_cpu`.
Test Plan: Imported from OSS
Reviewed By: samdow
Differential Revision: D33623679
Pulled By: ngimel
fbshipit-source-id: 85994e88d664870bf425f398dfcdfc467885c694
(cherry picked from commit 2ff2a89df5)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71447
Changes the nightly build trigger to be based on pushes to the `nightly`
branch instead of being based on the tagged push. This aligns it with
our current CircleCI trigger and should make it so that it's easily
viewable using tools like https://hud.pytorch.org/ci/pytorch/pytorch/nightly
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D33647102
Pulled By: seemethere
fbshipit-source-id: c6757da35b7ec2d68bf36160dd7f3cb9ed040899
(cherry picked from commit 99b7b22650)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70613
This refactors `at::detail::empty_cpu` to use only `TensorBase` so you
can construct tensors without including `Tensor.h`. It also adds a
`TensorOptions` version to reduce friction in operators moving from
the `at::empty` API.
Test Plan: Imported from OSS
Reviewed By: samdow
Differential Revision: D33623682
Pulled By: ngimel
fbshipit-source-id: 7a7b08bc2ed06830a3d698197a0c8389a096dc1d
(cherry picked from commit 2e17ad0bbd)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71443
cogwheel test inline_cvr_infer_canary_pyper_model_publish is timing out.
The convert_fx call takes > 20 mins for local and local_ro sub modules, which used to take ~ 2 mins.
Test Plan:
Fblearn flow run
* the following cmd took 1113 seconds before the diff and 5002 seconds after.
flow-cli clone-locally 320014219 --run-as-secure-group pytorch_at_scale --operators pyper_model_publish_workflow.pyper_model_publish_workflow.process_torch_package_model_files.process_non_sparse_parameters[0]
Cogwheel test
* Cogwheel test with packages in B3588 (the last good run) took 4694.48s
* Cogwheel test with packages in B3590 (the first timeout) took 13975.83s
* Cogwheel test with the following packages took 4535.04s
* all packages in B3588 except the model publish
* the model publish built with D33469839 (043e84b3d2) reversed (created D33633570)
Reviewed By: albanD, jerryzh168
Differential Revision: D33633570
fbshipit-source-id: dc5e777c48a90c551641a3f79126461f6a60449e
(cherry picked from commit 03ab65023a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71417
I accidentally changed CPU_INSTANT_EVENT to CPU_OP, which broke TensorBoard.
Test Plan: Make memory profiling unit test check this case.
Reviewed By: aaronenyeshi
Differential Revision: D33637286
fbshipit-source-id: c95945f6b85cd4168820bd4d2a9203274a0a5bd6
(cherry picked from commit b1e258672a)
Summary:
In file graph_executor.cpp, line 963, a '\n' is missing in GRAPH_DEBUG, which all other GRAPH_DEBUG places here holds.
The output in GRAPH_DEBUG seems weird.
[DEBUG graph_executor.cpp:963] After CheckInplace (end of runOptimization)graph(%0 : Float(*, *, *, *, requires_grad=0, device=cpu),
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70421
Reviewed By: Gamrix
Differential Revision: D33596430
Pulled By: davidberard98
fbshipit-source-id: 0e7c3c02ce44bf925f0c45e96a382104059fe397
(cherry picked from commit 55899528a2)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71431
Adds a PR trigger based on paths to the binary build workflows to make
it easier to test / verify changes to the binary build workflows without
adding a bunch of skipped checks to the majority of our workflows
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: atalman
Differential Revision: D33641276
Pulled By: seemethere
fbshipit-source-id: 0ed65cbcebf06dfe998f81d67df817250dd1a716
(cherry picked from commit 598b55fd18)
Summary:
Use `pytorchmergebot` credentials to do the merge
Infer sync branch name from the workflow rather than hardcode it
Move common functions from `syncbranches.py` to `gitutils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71420
Reviewed By: bigfootjon
Differential Revision: D33638846
Pulled By: malfet
fbshipit-source-id: a568fd9ca04f4f142a7f5f64363e9516f5f4ef1c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69407
This generates aten_interned_strings.h from `native_functions.yaml`
which is more like how it was originally done. The items deleted from
`interned_strings.h` are duplicates that need to be removed in order
for the code to compile, some of the remaining items may still be out
of date but it is fairly benign even if that's the case.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D32923636
Pulled By: albanD
fbshipit-source-id: a0fd6b3714e70454c5f4ea9b19da5e047d2a4687
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70850
We support both, so we want to ensure both continue to work.
ghstack-source-id: 146960552
Test Plan: Tested manually. A subsequent diff adds this test configuration to CI.
Reviewed By: malfet
Differential Revision: D33297464
fbshipit-source-id: 70e1431d0907d480c576239af93ef57036d5e4d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70848
This is the C10 library, it that's the main lib we are building
here. While here, use `local_defines` instead of `copts` for this
definition. Both `copts` and `local_defines` only apply to the
compilation units in the library, and not transitively.
ghstack-source-id: 146998039
Test Plan: We are relying on CI to verify this doesn't cause any problems.
Reviewed By: malfet
Differential Revision: D33429420
fbshipit-source-id: b3fc84c0588bd43346e3f9f77e851d293bde9428
Summary:
This PR adds a persistent filesystem cache for jitted kernels. The cache is disabled on Windows because it relies on POSIX headers.
The cache writes, by default, to `~/.cache/torch/kernels`, but the location can be controlled by setting the `PYTORCH_KERNEL_CACHE_PATH`. A separate environment variable, `USE_PYTORCH_KERNEL_CACHE`, will disable all caching logic when set to zero.
The use of a persistent fileystem cache dramatically lowers the "first call time" for an operator AFTER its has been compiled, because it skips (most of) the jit compilation process. On systems where we're compiling only to ptx that ptx still has to be just-in-time compiled by the driver API, so an additional latency of around 10 milliseconds is expected at first call time. On systems which compile to SASS the additional first call time latency is about one millisecond. This compares with times of 150 milliseconds+ for just-in-time kernel compilation.
Files in the cache use a mostly human readable string that includes an SHA1 hash of the CUDA C string used to generate them. Note that this is not an SHA1 hash of the file's contents, because the contents are the compiled ptx or SASS. No verification is done when the file is loaded to ensure the kernel is what's expected, but it's far more likely you'll be struck by a meteor than observe two file names conflict. Using SHA1 hashes to generate unique ids this way is a common practice (GitHub does it, too).
This cache design could be reused by other fusion systems and should allow us to jiterate more operations without fear of regressing the "incremental development" scenario where users are tweaking or extending programs slightly, rerunning then, and then repeating that process again and again. Without a cache each run of the program would have to recompile every jitted kernel, but with this cache we expect a negligible impact to the user experience.
cc kshitij12345, xwang233
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71350
Reviewed By: ngimel
Differential Revision: D33626671
Pulled By: mruberry
fbshipit-source-id: d55df53416fbe46348623846f699f9b998e6c318
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70612
The device information is embedded in the `DataPtr` returned from the
allocator, so this argument is completely ignored.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D33623681
Pulled By: ngimel
fbshipit-source-id: bea64707bb17d46debb0ed7c1175493df56fee77
Summary:
This PR enables `test_block_triangular` tests on the CPU.
These tests revealed that there was a problem with how the nnz==0 case is handled. Now we return a tensor filled with NaNs both on CUDA and CPU.
cc nikitaved pearu cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71304
Reviewed By: davidberard98
Differential Revision: D33600482
Pulled By: cpuhrsch
fbshipit-source-id: d09cb619f8b6e54b9f07eb16765ad1c183c42487
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71369
Suppress `-Wimplicit-int-float-conversion` in `TypeSafeSignMath.h` when building with clang
Test Plan: CI check
Reviewed By: r-barnes
Differential Revision: D33612983
fbshipit-source-id: cff1239bc252d4a2f54a50a2bbcd48aeb8bf31ca
Summary:
This PR removes the PyTorch nightly dependencies of TorchBench CI. Instead, it relies on the bisection script to install TorchBench dependencies (https://github.com/pytorch/benchmark/pull/694).
This will unblock TorchBench CI users when the nightly build fails (e.g., https://github.com/pytorch/pytorch/issues/71260)
RUN_TORCHBENCH: resnet18
TORCHBENCH_BRANCH: xz9/optimize-bisection
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71323
Reviewed By: wconstab
Differential Revision: D33591713
Pulled By: xuzhao9
fbshipit-source-id: f1308ea33ece1f18196c993b40978351160ccc0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71356
Suppress remaining header based warnings in `caffe2/c10` when building with `clang`
Test Plan: CI pass
Reviewed By: r-barnes
Differential Revision: D33600097
fbshipit-source-id: e1c0d84a0bad768eb03e047d62b5379cf28b48e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71072
This PR replaces the old logic of loading frozen torch through cpython by directly loading zipped torch modules directly onto deploy interpreter. We use elf file to load the zip file as its' section and load it back in the interpreter executable. Then, we directly insert the zip file into sys.path of the each initialized interpreter. Python has implicit ZipImporter module that can load modules from zip file as long as they are inside sys.path.
Test Plan: buck test //caffe2/torch/csrc/deploy:test_deploy
Reviewed By: shunting314
Differential Revision: D32442552
fbshipit-source-id: 627f0e91e40e72217f3ceac79002e1d8308735d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71254
when we configure linear and relu with the same qconfig, we currently have utility functions to also
generate a qconfig for the fused linear relu module, but this code is not called in correct order before
which resulted in unexpected behaviors. This PR fixes the issue. Please see test case for more details.
(Test case is from Supriya)
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_fused_module_qat_swap
Imported from OSS
Reviewed By: supriyar
Differential Revision: D33558321
fbshipit-source-id: d95114dc4b77264e603c262c2da02a3de4acba69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71290
The existing code called an out-of-line hash function on a constant. This is just going to get the same random-looking 64-bit integer every time, so I just changed the constant to an integer I generated with `hex(random.randint(0x1000000000000000, 0xFFFFFFFFFFFFFFFF))` to get the same effect but without the runtime hashing.
ghstack-source-id: 146991945
Test Plan: CI
Reviewed By: wconstab
Differential Revision: D33574676
fbshipit-source-id: d6ce1e1cc0db67dfede148b7e3173508ec311ea8