pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Richard Barnes	3ece9fb45d	Check all CUDA API calls for errors in torch/ (#81560 ) Summary: Original commit changeset: 0bb770d2cdb2 Original Phabricator Diff: D35194935 (`79e5b053b6`) Differential Revision: D35291874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81560 Approved by: https://github.com/ezyang	2022-10-28 00:40:48 +00:00
Nikita Shulga	82c8365c16	[BE] Delete `TH_DISALLOW_COPY_AND_ASSIGN` (#87743 ) Replace it with `AT_DISALLOW_COPY_AND_ASSIGN` and delete the header that contained this define Pull Request resolved: https://github.com/pytorch/pytorch/pull/87743 Approved by: https://github.com/atalman, https://github.com/ngimel	2022-10-26 03:31:56 +00:00
alexmsettle	00b8c7e63b	New feature for issue #85575 . (#86514 ) Introduced RECORD_OUTPUTS() macro that goes with RECORD_FUNCTION(). It is used to capture the output tensors from a kernel launch. The tensors automatically get passed to the profiler using record_function methods. This allows the profiler to track the tensors that flow into and out of each op. Fixes #85575 cc @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb Pull Request resolved: https://github.com/pytorch/pytorch/pull/86514 Approved by: https://github.com/robieta	2022-10-24 20:02:56 +00:00
Natalia Gimelshein	272747db36	attempted fix for nvrtc with lovelace (#87611 ) Fixes #87595 (maybe?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87611 Approved by: https://github.com/malfet, https://github.com/atalman	2022-10-24 18:41:38 +00:00
Nikita Shulga	c28cdb53ea	[BE] Delete BUILD_SPLIT_CUDA option (#87502 ) As we are linking with cuDNN and cuBLAS dynamically for all configs anyway, as statically linked cuDNN is different library than dynamically linked one, increases default memory footprint, etc, and libtorch_cuda even if compiled for all GPU architectures is no longer approaching 2Gb binary size limit, so BUILD_SPLIT_CUDA can go away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87502 Approved by: https://github.com/atalman	2022-10-22 06:00:59 +00:00
Kazuaki Ishizaki	d80a5f9a96	Fix typo under torch directory (#87274 ) This PR fixes typo in .md files under torch directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/87274 Approved by: https://github.com/albanD	2022-10-21 14:22:20 +00:00
Zachary DeVito	f56ce8dbad	[allocator] Move getFreeMutex (#87237 ) It isn't used at all the allocators and this change makes that more clear. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87237 Approved by: https://github.com/wconstab	2022-10-19 18:00:40 +00:00
Nikita Shulga	3924aa75b1	[BE] Extend linter to detect DOS newlines (#86973 ) Fix DOS newlines in `onednn/decompose_silu.[cpp\|h]` introduced by https://github.com/pytorch/pytorch/pull/85591 as well as one in `.github/PULL_REQUEST_TEMPLATE.md` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86973 Approved by: https://github.com/huydhn, https://github.com/izaitsevfb	2022-10-15 00:20:42 +00:00
Ivan Yashchuk	fd80684784	Add nvFuser support for torch.Tensor.view (#84634 ) This is an alternative to https://github.com/pytorch/pytorch/pull/83739. While PrimTorch has `view` as a reference, we would like to use nvFuser's implementation for `view` for now. Later we might transition to PrimTorch's `torch._refs.view`. See `test_nvprims_view` for examples of things that are now sent to nvFuser. Note that nvFuser's `view` is a copy-like operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84634 Approved by: https://github.com/kevinstephano, https://github.com/mruberry	2022-10-14 12:08:02 +00:00
sanchitintel	974ad8fa6c	Add BFloat16 dtype support for oneDNN Graph JIT fuser (#85591 ) ## BFloat16 dtype support for faster inference with TorchScript using oneDNN Graph Intel Xeon Cooper Lake platform & beyond support the `AVX512_BF16` ISA, which is essentially native BFloat16 support. oneDNN Graph delivers high inference performance with BFloat16 on such machines. While oneDNN Graph can still be used with BFloat16 on older machines that lack `avx512_bf16` ISA but support `avx512bw`, `avx512vl` & `avx512dq` ISAs, the BF16 performance on these older machines will be significantly poorer (probably even poorer than Float32), as they lack native BF16 support. Currently, [AMP support for eager mode & JIT mode is divergent in PyTorch](https://github.com/pytorch/pytorch/issues/75956). So, for using oneDNN Graph with BFloat16, eager-mode AMP should be leveraged by turning off AMP for JIT mode, using `torch._C._jit_set_autocast_mode(False)` in python code, so as to avoid conflicts. Please use the following environment variable to view JIT logs - `PYTORCH_JIT_LOG_LEVEL=">>graph_helper:>>graph_fuser:>>kernel:>>interface"` ## Changes being made in this PR 1. This PR does NOT change the `oneDNN` commit or the `ideep` files. While the `ideep` commit is being updated, only files pertaining to oneDNN Graph are being updated. oneDNN Graph is being upgraded to version 0.5.2 (alpha patch release 2). To put things into perspective, `ideep` is a git submodule of PyTorch. `oneDNN Graph` is a git submodule of `ideep` (`ideep/mkl-dnn`), and oneDNN is a git submodule of oneDNN Graph (`ideep/mkl-dnn/third_party/oneDNN`). 2. Unit-tests are being updated. We now use the [existing dtypes decorator](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_device_type.py#L123-L131). 3. Suggestions made by @eellison in the [FP32 PR](https://github.com/pytorch/pytorch/pull/68111#pullrequestreview-896719477) are being incorporated/addressed - \| Action-item \| Status \| \| :--- \| ---: \| \|checkInputCompatibility follow up \| Fixed \| \|the mayConvertScalarInputToTensor logic we can consider \| Added type promotion code \| \|fix up fixConvOptionalBias\| The current approach seems correct \| \|Use opinfo tests\| using dtypes decorator. Will use `OpInfo` in a subsequent PR, if that'd be possible. Should we create a list of ops from opDB that are supported by oneDNN Graph, and add it to `common_methods_invocations.py`? \| \|inferDevice torch_check call \| not necessary now, perhaps, as only CPU is supported, for now? We'd add it by the beta release of oneDNN Graph, though, so that by then, users might be able to use other fusers with oneDNN Graph (NNC/TensorExpr are already compatible with the oneDNN Graph fuser). We can still add it, if you'd insist. \| \|not checking shapes of input mkldnn tensor to llga guard \| Those checks should not be present because oneDNN Graph may use blocked or channels-last layout, so those strides would be different. They're only skipped if an LLGA subgraph's output is input to another LLGA subgraph, which enables LLGA to choose an optimal layout between them. \| \|fix test failures with respect to unsupported inputs \| We'll address them with the upcoming release of oneDNN Graph beta version\| 4. More PyTorch ops are being been mapped to oneDNN Graph ## Example of using oneDNN Graph with BFloat16 ```python # Assuming we have a model of the name 'model' example_input = torch.rand(1, 3, 224, 224) # enable oneDNN Graph torch.jit.enable_onednn_fusion(True) # Disable AMP for JIT torch._C._jit_set_autocast_mode(False) with torch.no_grad(), torch.cpu.amp.autocast(): model = torch.jit.trace(model, (example_input)) model = torch.jit.freeze(model) # 2 warm-ups (2 for tracing/scripting with an example, 3 without an example) model(example_input) model(example_input) # speedup would be observed in subsequent runs. model(example_input) ``` ## TorchBench based Benchmarks URL: https://github.com/sanchitintel/benchmark/tree/onednn_graph_benchmark (instructions present at URL). Batch-size(s): TorchBench-default for each model Baseline : PyTorch JIT OFI FP32 Machine: Intel(R) Xeon(R) Platinum 8371HC (Cooper Lake) Sockets used: 1 Number of cores on one socket: 26 Intel OpenMP & tcmalloc were preloaded #### Benchmark results with single thread \| name \| latency of PyTorch JIT OFI FP32 (s) \| Latency of oneDNN Graph BF16 (s) \| % change \| \| :--- \| ---: \| ---: \| ---: \| \| test_eval[alexnet-cpu-jit] \| 1.063851 \| 0.509820 \| -52.1% \| \| test_eval[mnasnet1_0-cpu-jit] \| 0.218435 \| 0.107100 \| -51.0% \| \| test_eval[mobilenet_v2-cpu-jit] \| 0.114467 \| 0.058359 \| -49.0% \| \| test_eval[mobilenet_v3_large-cpu-jit] \| 0.233873 \| 0.117614 \| -49.7% \| \| test_eval[resnet18-cpu-jit] \| 0.160584 \| 0.075854 \| -52.8% \| \| test_eval[resnet50-cpu-jit] \| 1.652846 \| 0.713373 \| -56.8% \| \| test_eval[resnext50_32x4d-cpu-jit] \| 0.471174 \| 0.209431 \| -55.6% \| \|test_eval[shufflenet_v2_x1_0-cpu-jit] \| 0.310306 \| 0.167090 \| -46.2% \| \| test_eval[squeezenet1_1-cpu-jit] \| 0.161247 \| 0.045684 \| -71.7% \| \| test_eval[timm_efficientnet-cpu-jit] \| 1.643772 \| 0.800099 \| -51.3% \| \| test_eval[timm_regnet-cpu-jit] \| 5.732272 \| 2.333417 \| -59.3% \| \| test_eval[timm_resnest-cpu-jit] \| 1.366464 \| 0.715252 \| -47.7% \| \| test_eval[timm_vision_transformer-cpu-jit] \| 0.508521 \| 0.271598 \| -46.6% \| \| test_eval[timm_vovnet-cpu-jit] \| 2.756692 \| 1.125033 \| -59.2% \| \| test_eval[vgg16-cpu-jit] \| 0.711533 \| 0.312344 \| -56.1% \| #### Benchmark results with 26 threads: \| name \| latency of PyTorch JIT OFI FP32 (s) \| Latency of oneDNN Graph BF16 (s) \| % change \| \| :--- \| ---: \| ---: \| ---: \| \| test_eval[alexnet-cpu-jit] \| 0.062871 \| 0.034198 \| -45.6% \| \| test_eval[mnasnet1_0-cpu-jit] \| 0.022490 \| 0.008172 \| -63.7% \| \| test_eval[mobilenet_v2-cpu-jit] \| 0.012730 \| 0.005866 \| -53.9% \| \| test_eval[mobilenet_v3_large-cpu-jit] \| 0.025948 \| 0.010346 \| -60.1% \| \| test_eval[resnet18-cpu-jit] \| 0.011194 \| 0.005726 \| -48.9% \| \| test_eval[resnet50-cpu-jit] \| 0.124662 \| 0.045599 \| -63.4% \| \| test_eval[resnext50_32x4d-cpu-jit] \| 0.034737 \| 0.015214 \| -56.2% \| \|test_eval[shufflenet_v2_x1_0-cpu-jit] \| 0.028820 \| 0.012517 \| -56.6% \| \| test_eval[squeezenet1_1-cpu-jit] \| 0.012557 \| 0.003876 \| -69.1% \| \| test_eval[timm_efficientnet-cpu-jit] \| 0.203177 \| 0.051879 \| -74.5% \| \| test_eval[timm_regnet-cpu-jit] \| 0.452050 \| 0.151113 \| -66.6% \| \| test_eval[timm_resnest-cpu-jit] \| 0.117072 \| 0.052848 \| -54.9% \| \| test_eval[timm_vision_transformer-cpu-jit] \| 0.046048 \| 0.023275 \| -49.5% \| \| test_eval[timm_vovnet-cpu-jit] \| 0.213187 \| 0.077482 \| -63.7% \| \| test_eval[vgg16-cpu-jit] \| 0.044726 \| 0.021998 \| -50.8% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/85591 Approved by: https://github.com/jgong5, https://github.com/frank-wei, https://github.com/chunyuan-w	2022-10-13 20:36:59 +00:00
Nikita Shulga	9eb4f9dd17	Tweak test tolerances to be compatible with A10G (#86538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86538 Approved by: https://github.com/ngimel	2022-10-11 23:31:48 +00:00
Jeff Daily	8db30255c3	[ROCm] set nvfuser default to disabled, keep CI (#86369 ) Bug fix. nvfuser is functional for ROCm on gfx906, but some tests are failing for other gfx targets. Disable nvfuser until all features are verified. Users may still opt-in by setting the known env var PYTORCH_JIT_ENABLE_NVFUSER=1. This PR sets this env var for the github actions workflow for ROCm since all current CI hosts are gfx906. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86369 Approved by: https://github.com/huydhn	2022-10-11 20:55:58 +00:00
jjsjann123	dd6dd03ff2	Enable output allocation cache (#86100 ) Cherry-picked from devel branch: https://github.com/csarofeen/pytorch/pull/2010 turns on accidentally disabled output allocation cache [#2002](https://github.com/csarofeen/pytorch/issues/2002) Updated check for safety regarding allocation cache by iterating all IterDomain on outputs and enables cache re-use only when no extent value is a consumer of fusion inputs (output sizes is not dependent on scalar inputs). Pull Request resolved: https://github.com/pytorch/pytorch/pull/86100 Approved by: https://github.com/csarofeen	2022-10-10 23:31:21 +00:00
Kevin Stephano	b14f1d7bb8	Add Skip List for Aten Ops that are fused in nvFuser. (#86101 ) This Skip List (tuple) is added under the nvprims context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86101 Approved by: https://github.com/jjsjann123, https://github.com/mruberry	2022-10-07 03:55:13 +00:00
Ivan Yashchuk	68a6113248	Add nvFuser support for torch.native_batch_norm (#85562 ) This PR adds nvFuser's implementation for batch_norm as there's no reference yet (https://github.com/pytorch/pytorch/pull/81191) and no in-place copy support (https://github.com/pytorch/pytorch/pull/84545). Pull Request resolved: https://github.com/pytorch/pytorch/pull/85562 Approved by: https://github.com/kevinstephano, https://github.com/ngimel	2022-10-03 15:03:08 +00:00
Edward Z. Yang	3638089755	Ported reshape to symints and added a shim for BC (#85998 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85998 Approved by: https://github.com/ezyang	2022-10-02 17:46:00 +00:00
PyTorch MergeBot	a0b1693996	Revert "Update `amax/amin/norm/count_nonzero` signatures with `int[*]? dim` (#83300 )" This reverts commit `1c0f0b33a0`. Reverted https://github.com/pytorch/pytorch/pull/83300 on behalf of https://github.com/jeffdaily due to The commit breaks nvfuser tests	2022-09-28 17:04:53 +00:00
Kurt Mohler	1c0f0b33a0	Update `amax/amin/norm/count_nonzero` signatures with `int[]? dim` (#83300 ) Changes `dim` arg to use `int[]?` type for the following functions in `native_funcitons.yaml`: * `amax` * `amin` * `norm` * `frobenius_norm` * `native_norm` * `count_nonzero` Part of #29137 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83300 Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/kulinseth	2022-09-28 01:56:37 +00:00
PyTorch MergeBot	572dd862c4	Revert "Update `amax/amin/norm/count_nonzero` signatures with `int[*]? dim` (#83300 )" This reverts commit `8c7c7ed322`. Reverted https://github.com/pytorch/pytorch/pull/83300 on behalf of https://github.com/huydhn due to The commit pin breaks XLA test somehow	2022-09-28 01:36:43 +00:00
Kurt Mohler	8c7c7ed322	Update `amax/amin/norm/count_nonzero` signatures with `int[]? dim` (#83300 ) Changes `dim` arg to use `int[]?` type for the following functions in `native_funcitons.yaml`: * `amax` * `amin` * `norm` * `frobenius_norm` * `native_norm` * `count_nonzero` Part of #29137 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83300 Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/kulinseth	2022-09-27 23:50:04 +00:00
S. Song	101f10d7ca	Cherry pick sorting patch (#85620 ) Fixes https://github.com/csarofeen/pytorch/issues/1947 Cherry-picked patch for torchbench issues where fusion segmenter asserts in nvfuser: 1. test the groups comes with the same order as they are merged. 2. Fix detection of un-mappable root domains: ComputeAtRootDomainMap flags domains that should not be mapped due to reductions. Previously, checking if a domain potentially causes an invalid mapping is only done with one domain in each group of domains that are found to be mappable so far. That's not actually sufficient as the unmappable domain set is created just once with no root mapping information. The fix is to check all consumer domains of a producer tensor. A small other fix is also done to address a different problem discovered after the first fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85620 Approved by: https://github.com/csarofeen, https://github.com/davidberard98	2022-09-27 15:53:01 +00:00
jjsjann123	0e582fbfcc	[NVFuser] Upstream push 0907 (#84626 ) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939) b2fd01ea9346712c6d6f623ca6addbc4888d008e arange support (#1933) 56c00fd3922dad7dfc57351ad7d780f0f2f8e4ed Double support on all expression evaluators (#1937) 371f28223e57fe3f6b5e50a0a45177e6a5c0785c Improve trivial reduction merge support (#1931) 1d0c26790e5647920b40d419d26815bbe310b3a6 Test `rand` in a fusion with zero tensor input (#1932) 0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890) ef98f360f6d3e3e1cc662ecb65202d88150f128d Fix a bug (#1936) 63132a0c56508c550084b07fb76a3df865102d00 Propagate permissive mapping information into indexing pass (#1929) b4ac2c88d78078ee4d8b21c4fc51645b5710a282 Map IterationDomains through view operations. (#1919) c0a187a7619d7cf9dc920294e15461791e8d6d4d do not use deprecated functions (#1935) 88de85e758c5e4afb7b6e746573c0d9a53b4cea7 Upstream cherry pick fixes 0811 (#1934) b247dcf7c57dc6ac3f7a799b0a6beb7770536a74 Separate kernel compilation API from kernel execution API (#1914) b34e3b93ee1a8030730c14af3995dd95665af07d Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6707f43bf760494c238a46386d69830822 Nullary RNGOp (#1892) 3c3c89e638f5172cafb0761f22bacd1fd695eec3 Misc fixes/tuning for transpose scheduler (#1912) 20cf109c8b44d48f61977e35bae94368985144ac Grouped grid welford (#1921) 6cf7eb024c9e53c358cbe56597e117bad56efefd Transpose scheduler small dim sizes better support (#1910) 9341ea9a5bf42f9b14ccad0c94edbc79fc5bb552 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f66deeea816bb943d802a97c1b7e4414ab Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80339e4f794767a53eb8fdd61e64cf404a2 Add variance_mean function using Welford (#1907) 98febf6aa3b8c6fe4fdfb2864cda9e5d30089262 Remove DisableOption::UnrollWithRng (#1913) ee8ef33a5591b534cf587d347af11e48ba7a15d4 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953351f9dabfd1f991d8431cecb6c2ce684d Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a72385f6a4b145680a9dcc52d7e8293763 dopt is only available since nvrtc 11.7 (#1915) 2ec8fc711eafc72451eebf0f5e2a98a38bf3f6ef Kill computeAtBetween (#1911) d0d106a1d9af118d71673173674e875be35d259d Improve view support on pointwise and transpose scheduler (#1906) e71e1ecefe67219846070590bbed54bbc7416b79 Fix name clash of RNG with shared memory (#1904) 3381793a253689abf224febc73fd3fe2a0dbc921 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84626 Approved by: https://github.com/malfet	2022-09-23 20:29:48 +00:00
Ivan Yashchuk	308b26fe4d	Add nvFuser support for transpose (#84629 ) `torch._refs.t`, `torch._refs.transpose`, `torch._refs.permute` are all should be working now with nvFuser executor. It would also work with graphs processed by AOT Autograd as these functions are registered to the aten->ref mapping via the "register_decomposition" decorator: `07d398fb26/torch/_refs/__init__.py (L3125-L3126)` `07d398fb26/torch/_refs/__init__.py (L3143-L3144)` `07d398fb26/torch/_refs/__init__.py (L2548-L2549)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84629 Approved by: https://github.com/ngimel	2022-09-21 12:45:15 +00:00
Kevin Stephano	39f482acdf	Add a reset() method to nvFuser FusionCache to enable proper resetting during tests. (#85319 ) Fixes issue Jie found in his PR: https://github.com/pytorch/pytorch/pull/84626#issuecomment-1250745334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85319 Approved by: https://github.com/jjsjann123	2022-09-20 16:10:05 +00:00
Kevin Stephano	b8418e02eb	Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#85045 ) This PR does the following: - Replaces the `FusionOwner` with a `FusionCache` and `FusionInterface`. The `FusionCache` is a singleton that contains a cache of Fusions based on the `FusionDefinition`. It replaces the TorchScript graph caching that looked up a Fusion based on a stringified and canonicalized representation of the TorchScript graph with a prefix tree of statements in the `FusionDefinition`. The `FusionInterface` is an object that represents a Fusion in python. It can also query the cache based on id. - The ability to print out a mechanically derived definition, in python, for the user to use when debugging was added. - Replaces the python `examples` directory with true python tests under `test/test_nvfuser_frontend.py`. - Adds a set of C++ tests under the `test` directory to verify the `FusionCache`, `FusionDefinition`, and parts of the `RecordFunctor` child classes. - Adds a README file to explain how to use the Python Frontend While there are 3,000+ line edits, the bulk of the changes were repetitive line changes to the python bindings for each operation. An identical PR to #83267 to avoid tooling issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85045 Approved by: https://github.com/davidberard98	2022-09-17 10:52:54 +00:00
Aidyn-A	5271494ef2	[CUDA graphs] Fixes errors in RNG seed (#84967 ) Fixes #84614 Prior to this PR CUDAGraph did not store the RNG seed, that is why `torch.cuda.manual_seed(new_seed)` would only reset the offset but not update the seed at all keeping whatever value was used during graph capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84967 Approved by: https://github.com/ngimel	2022-09-14 19:56:12 +00:00
PyTorch MergeBot	94b67f4cd8	Revert "Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#83267 )" This reverts commit `ec916bf6af`. Reverted https://github.com/pytorch/pytorch/pull/83267 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2022-09-14 17:40:22 +00:00
Kevin Stephano	ec916bf6af	Create Cache for Fusion Reuse in NVFuser in Python Frontend for Primtorch (#83267 ) This PR does the following: - Replaces the `FusionOwner` with a `FusionCache` and `FusionInterface`. The `FusionCache` is a singleton that contains a cache of Fusions based on the `FusionDefinition`. It replaces the TorchScript graph caching that looked up a Fusion based on a stringified and canonicalized representation of the TorchScript graph with a prefix tree of statements in the `FusionDefinition`. The `FusionInterface` is an object that represents a Fusion in python. It can also query the cache based on id. - The ability to print out a mechanically derived definition, in python, for the user to use when debugging was added. - Replaces the python `examples` directory with true python tests under `test/test_nvfuser_frontend.py`. - Adds a set of C++ tests under the `test` directory to verify the `FusionCache`, `FusionDefinition`, and parts of the `RecordFunctor` child classes. - Adds a README file to explain how to use the Python Frontend While there are 3,000+ line edits, the bulk of the changes were repetitive line changes to the python bindings for each operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83267 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98	2022-09-13 23:28:39 +00:00
jjsjann123	1a33e944b5	nvfuser torchbench patch (#84411 ) 1. Patching nvfuser_execute to take aten nvprim fallback when no cuda tensors are provided as inputs 2. Extending support of nvfuser python API on cpu scalar tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84411 Approved by: https://github.com/ngimel, https://github.com/kevinstephano, https://github.com/IvanYashchuk	2022-09-07 05:22:37 +00:00
Jeff Daily	6efadf7e7e	[ROCm] guard ROCm-only files in NVFUSER_RUNTIME_FILES (#84312 ) Addresses comment in #82498 as a follow-up PR. https://github.com/pytorch/pytorch/pull/82498#discussion_r958745967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84312 Approved by: https://github.com/jjsjann123	2022-08-31 18:26:24 +00:00
Jeff Daily	d09486ab23	[ROCm] enable nvfuser (#82498 ) ### Description The nvfuser is enabled for ROCm. ### Testing CI label ciflow/trunk covers the newly enabled ROCm functionality as well as any CUDA regressions caused by these changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82498 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98	2022-08-30 21:50:39 +00:00
Ivan Yashchuk	90161c23cf	Add nvfuser support for squeeze (#84117 ) "_refs.squeeze" and "refs.unsqueeze" now work with nvfuser executor tests. Similarly to `_refs.reshape` we need to explicitly save the concrete shape on the trace to pass that info to nvfuser, as it gets lost in translation (https://github.com/pytorch/pytorch/pull/83739#discussion_r950352124). Pull Request resolved: https://github.com/pytorch/pytorch/pull/84117 Approved by: https://github.com/ngimel	2022-08-30 20:36:11 +00:00
Edward Z. Yang	ad44670fa1	Back out "Revert D38984222: Don't introduce new overload for SymInt (#83628 )" (#84173 ) Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature" Original commit changeset: dab4a9dba4fa Original commit changeset: dcaf16c037a9 Original Phabricator Diff: D38984222 Original Phabricator Diff: D39075159 Also update Metal registrations for C++ registration changes. Also update NNPI registration to account for tightened schema checking Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173 Approved by: https://github.com/Krovatkin	2022-08-29 18:01:07 +00:00
Ivan Yashchuk	3aae6ff1e1	Add nvprims.var_mean (#83508 ) This PR adds nvfuser-specific primitive - `var_mean`. Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager. I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`). Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti): ```py import torch from torch._prims.context import TorchRefsNvfuserCapabilityMode from torch.fx.experimental.proxy_tensor import make_fx from torch._prims.executor import execute def func(a): return torch.native_layer_norm(a, (1024,), None, None, 1e-6) a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda") with TorchRefsNvfuserCapabilityMode(): gm = make_fx(func)(a) for _ in range(10): execute(gm, a, executor="strictly_nvfuser"); ``` run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py` ```py # WITH THIS PR # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.033792 ms, achieved: 621.818 GB/s # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.032608 ms, achieved: 644.396 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.03072 ms, achieved: 684 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # ON MASTER # kernel1 run in 0.05632 ms, achieved: 373.091 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.043808 ms, achieved: 479.649 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s ``` So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape. Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`). Ref. https://github.com/pytorch/pytorch/issues/80187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508 Approved by: https://github.com/ngimel	2022-08-28 18:45:25 +00:00
PyTorch MergeBot	b159a5230f	Revert "Add nvprims.var_mean (#83508 )" This reverts commit `7e7694b661`. Reverted https://github.com/pytorch/pytorch/pull/83508 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2022-08-28 11:30:27 +00:00
Ivan Yashchuk	7e7694b661	Add nvprims.var_mean (#83508 ) This PR adds nvfuser-specific primitive - `var_mean`. Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager. I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`). Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti): ```py import torch from torch._prims.context import TorchRefsNvfuserCapabilityMode from torch.fx.experimental.proxy_tensor import make_fx from torch._prims.executor import execute def func(a): return torch.native_layer_norm(a, (1024,), None, None, 1e-6) a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda") with TorchRefsNvfuserCapabilityMode(): gm = make_fx(func)(a) for _ in range(10): execute(gm, a, executor="strictly_nvfuser"); ``` run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py` ```py # WITH THIS PR # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.033792 ms, achieved: 621.818 GB/s # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.032608 ms, achieved: 644.396 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.032768 ms, achieved: 641.25 GB/s # kernel1 run in 0.03072 ms, achieved: 684 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # kernel1 run in 0.031744 ms, achieved: 661.935 GB/s # ON MASTER # kernel1 run in 0.05632 ms, achieved: 373.091 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.043808 ms, achieved: 479.649 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.044032 ms, achieved: 477.209 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s # kernel1 run in 0.043008 ms, achieved: 488.571 GB/s ``` So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape. Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`). Ref. https://github.com/pytorch/pytorch/issues/80187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508 Approved by: https://github.com/ngimel	2022-08-27 09:05:20 +00:00
PyTorch MergeBot	c7edcd6968	Revert "Don't introduce new overload for SymInt (#83628 )" This reverts commit `9790d90e4b`. Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487	2022-08-27 01:23:17 +00:00
Edward Z. Yang	9790d90e4b	Don't introduce new overload for SymInt (#83628 ) Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented. This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts. This is BC-breaking in the following ways: * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this. This is not BC-breaking in the following ways: * The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed. * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type. Structure of the PR: * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it as if it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other: * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular: * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences. * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!) * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway. * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes. * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK. * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it. * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload) * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.) * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints. * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2022-08-26 01:35:40 +00:00
jjsjann123	b21a6ff639	[NVFuser] Upstream push 0811 (#83239 ) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. double support in expression evaluator - bug fixes: 1. dropout fix - rework RNG to support broadcasted dropout (Fixes #82784) 2. expand fix - Patch expand+reduction, expand+view, rework view analysis and guard - scheduler: 1. manual transpose schedule example 2. WIP transpose scheduler Commits that's in this PR from the devel branch: ``` b7435afcd22c917713c2f41a7237bc26e1183f14 Transpose scheduler, step 1 (#1854) 8a45dbf72034684eb8e18b1835b533e90b68f184 Add an example on how to manually schedule transpose (#1889) 83dbf56a9554b2efbd5416461d938fff477b0b27 Patch dropout fix (#1898) 69d3519a532250719b1aa8341b50e067b181b42d Expand+Reduction, Expand+View support, rework View analysis and guards (#1883) 15091c488e96343bdc49e3990acbf238a3b3da51 Rework RNG to correctly support broadcasted dropout (#1888) aafe2d048aaac596e503596a41303423619f3954 Make ExpressionEvaluator support Double (#1885) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38657074](https://our.internmc.facebook.com/intern/diff/D38657074) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83239 Approved by: https://github.com/davidberard98	2022-08-25 02:23:22 +00:00
PyTorch MergeBot	a7edf71360	Revert "Don't introduce new overload for SymInt (#83628 )" This reverts commit `8fae7027b3`. Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222	2022-08-25 00:49:40 +00:00
Edward Z. Yang	8fae7027b3	Don't introduce new overload for SymInt (#83628 ) Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented. This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts. This is BC-breaking in the following ways: * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this. This is not BC-breaking in the following ways: * The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed. * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type. Structure of the PR: * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it as if it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other: * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular: * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences. * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!) * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway. * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes. * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK. * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it. * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload) * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.) * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints. * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2022-08-23 22:04:07 +00:00
jjsjann123	1407e6728c	Nvfuser python api patch take 2 (#83684 ) landing #83645 again. Previously we are breaking on codegen bf16 kernel for cuda TK 10.2. Added a short-cut to disable bf tests on pre cuda 11 build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83684 Approved by: https://github.com/ngimel	2022-08-19 16:05:39 +00:00
Peter Bell	b14df5334d	CMake: List python source files as codegen dependencies (#83683 ) The pyi, selected_mobile_ops and nvfuser code generators were missing some dependencies outright. The autograd codegen had some effort to list out specific files that it depends on, but this has clearly fallen out of sync so it's safer to just depend on the entire folder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83683 Approved by: https://github.com/albanD	2022-08-18 23:34:59 +00:00
PyTorch MergeBot	f84e087d5e	Revert "fixing define_constant pybind signature to match std::complex scalar (#83645 )" This reverts commit `278c726458`. Reverted https://github.com/pytorch/pytorch/pull/83645 on behalf of https://github.com/albanD due to broke master test	2022-08-18 14:00:42 +00:00
jjsjann123	278c726458	fixing define_constant pybind signature to match std::complex scalar (#83645 ) Fixes #83576 Previously complex scalar is defined as boolean and generating wrong result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83645 Approved by: https://github.com/ezyang, https://github.com/kevinstephano	2022-08-18 04:52:33 +00:00
Jeff Daily	ff5fe9e622	[ROCm] enable jiterator (#77982 ) ### Description Enables jiterator for ROCm builds. This includes necessary porting when hiprtc and nvrtc behavior differed. This also ported ROCm versus CUDA differences w.r.t. MAX_DIMS and NUM_THREADS from the non-jiterator code paths into jiterator. ### Testing CI with ciflow/trunk label to force running ROCm workflows that are currently trunk-only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77982 Approved by: https://github.com/ngimel	2022-08-15 16:04:09 +00:00
jjsjann123	a395f6e842	Limits constant chunk propagation for pw-node-only (#83083 ) Fixes #82889 Disables constant chunk propagation on non-pointwise ops, since it could change semantics and give invalid graphs. TODO: - [x] python test for the breakage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83083 Approved by: https://github.com/davidberard98	2022-08-11 15:45:05 +00:00
Ivan Yashchuk	7191ae58a7	Add nvfuser support for prims.sign and refs.sign (#83167 ) This short PR adds nvFuser support for `prims.sign` and consequently `refs.sign`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83167 Approved by: https://github.com/ngimel	2022-08-11 10:58:32 +00:00
jjsjann123	df741c589f	[NVFuser] Upstream push 0809 (#83067 ) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream #81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3faed4c64477e5f5c678f21f33415d0195 Merge remote-tracking branch 'csarofeen/devel' into HEAD 16173732ecfafc4797e93c2449cfb778015a6c7a Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb7796bdcf055eb61d600b7b5c9df292950290 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6de62061d30781de50ef1862bbfb1615173 Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5bba3bc158d41ccbefa0ee2c5ceea7aedb Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522454aa715ef164c88a73fb8bdddc706805 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa219293a59e4166e258d76289fe13633ca Fix most inlined propagator for mismatched dims (#1875) 501f4aa270bf4dd47b0d2f4860bc6f23ebc32a38 Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d690f923047a85b5229a787118708f810741 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a61c87cd998e88ddd79a496548171c31e0 Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7a66b098f04c9d95a2d34ab2bceee151b3 fragment iteration to support fully unrolled mma ops (#1823) a48270a18dc2d3accc2626758d14d5858ae55032 Merge all dims in pointwise scheduler (#1872) 172fb3673fb4aaf4c1e889922a4fc5c06cbd59f7 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a5ac2fcf57a177bf36b0f26c61a4e252a4 Allow trivial reduction to be merged (#1871) 440102bcda6eb1dcd42d5fa5aeab9d6b049956bc Symmetric API for BestEffortReplay (#1870) d1caf330c08ea8002f7133ca655bbd5b28c4eb98 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda50be38eac96c00ba781340ac199d5a136 Remove some welford specific logic. (#1864) 51589d36be5a101d06e641fe0400b39028b7cb81 Some cleanups on tests and heuristics params (#1866) a6b3e70da5dee51dbc246347228ea21384e46ac3 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9b5e562d6f0caba5e7319e83e5df64104f Add nullptr checks to IrBuilder (#1861) 1cd9451d7493f631c2837ba07c1ea93a74e83a15 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9b8c454f557ab9fcf5b1c3cef9b9e136d0 Add leaky_relu operation (#1852) e842a9bab5e9f7289b7ce33ee37a682b22373f49 Minor cleanup in pointwise scheduler (#1858) 9ee850ca2f7f51dd5269bffb1255e485f809282d Fix stringstream usage (#1857) 20a36c1e4f28c4ff9837e56784be2686d17435f3 Improve nsight compute support (#1855) 405910308301097297b55c34d560aab6a360e897 Remove debugging `true \|\|` from getPointwiseHeuristics (#1822) 01117bfe8fdfacdbfdcfba9a624cdf900fe044d4 Misc cleanup (#1853) 5cc64943dc381a568223140bce0f22163c01e29f Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f0207e3a89fe90fd5cd3ffc575dfd766ba00 Cleanup normalization scheduler (#1845) db89c6591a2f21130599a93675e0615e55564e41 Type inference patch (#1848) 102fe93a4605ca465cda26ebaee4ba1af2026901 Add debug dump for InlinePropagator (#1847) b7a4d93d375a6e2ddef483763c93ffddc62ec452 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b256056d0e02877361b814ae6af32ca15f Upstream ci build fixes (#1842) 0b83645915029d67f9345aa4649b8c6f62b0061b Fix vectorization bug introduced in #1831 (#1840) 63630f1ae091180e541932a9d9dc598e0a9902dd Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a963c01d97ba34b1a7d2f106e78a13fd6651 Fix transpose benchmark dtype (#1839) 2c9a6c02312d5bf4f83cde653b847b4f85849432 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83067 Approved by: https://github.com/davidberard98	2022-08-10 21:02:56 +00:00
Kurt Mohler	5ca9b2b6fa	Enable `dim=None` for `torch.var` (#82765 ) ### Description Add support for `dim=None` in `torch.var` ### Issue Part of #29137 ### Testing N/A Pull Request resolved: https://github.com/pytorch/pytorch/pull/82765 Approved by: https://github.com/albanD	2022-08-04 20:47:27 +00:00

1 2 3 4 5 ...

304 Commits