Summary:
Fixes#42598
Max and min operators for quantized tensors only supports per tensor
quantized tensors. Previously, an exception is thrown further down the
stack of involved operator calls. This PR adds an earlier termination
with a clearer error message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79036
Approved by: https://github.com/vkuzo
Summary:
`test_qtensor_fill` previously only tested the quantized fill op for CPU
tensors with a comment saying that "copy_" only works for CPU tensors.
quantized CUDA tensors are currently supported, so the test case is
amended to include cuda tensors.
Test Plan:
```
python test/test_quantization.py -k test_qtensor_fill_per_tensor
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78662
Approved by: https://github.com/vkuzo
This PR adds testing of references with "aten" and "nvfuser" executors using `torch._prims.executor.make_traced`.
Many tests are skipped even for "aten" executor because of https://github.com/pytorch/pytorch/issues/78923.
I limited the dtypes for the nvfuser executor tests because it's slow due to compilation overhead (it took about 30 mins in total). With `float32` and `int32` types nvfuser tests take 5 minutes.
```
58 passed, 2507 skipped, 28162 deselected, 79 xfailed, 5 warnings in 297.58s (0:04:57)
```
58 tests passed means that 29 references work correctly with nvfuser executor now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78926
Approved by: https://github.com/mruberry
Summary:
All of the current Vulkan API Tests of the GRU op, use the same H_in and H_out sizes (H_in = 384 and H_out = 384). Which means, these tests don't test the behavior when H_in != H_out.
There is indeed a bug: H_in is used at some point to split the weights/biases when it should have been H_out (the hidden_size).
Differential Revision: D36895889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78945
Approved by: https://github.com/SS-JIA
This PR introduces selective build to lightweight dispatch CI job. By doing so we can't run the `test_lite_intepreter_runtime` test suite anymore because it requires some other operators.
From now on, if we are adding a new unit test in `test_codegen_unboxing`, we will have to export the operators for the unit test model and add them into `lightweight_dispatch_ops.yaml`. This can be automated by introducing tracing based selective build, but that's for next PR to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78983
Approved by: https://github.com/kit1980
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
A few bigger updates:
1. Initial support of cp.async and cp.async.wait: https://github.com/csarofeen/pytorch/pull/1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: https://github.com/csarofeen/pytorch/pull/1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: https://github.com/csarofeen/pytorch/pull/1440
Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78164
This PR finishes moving over the python tracer to use the unified event type. Things that changed:
1) The hacky after-the-fact splicing of python events in profiler_kineto.cpp is gone and python events now simply fold into the rest. (Yay!!!) This is a major BE win.
2) Added `ExtraFields<EventType::PyCall>` and `ExtraFields<EventType::PyCCall>`
3) The enter events (time + TraceKey) are now handled by RecordQueue for performance.
4) Python tracing now uses TSC for lower overhead.
Simplifications in profiler_python WRT part 1:
1) Rather than ValueCache emitting an intermediate value_t that gets further converted, load methods can now directly emit ExtraFields<...>
2) The complicated replay in profiler_python.cpp is replaced with a much simpler (and safer) pass to just pair start and end times.
3) During post processing we can now use `CallTypeHelper::map` to automatically pull in all events instead of having to loop over each the entries for each type manually. This will make it simpler to add new types of Python event later.
Differential Revision: [D36515869](https://our.internmc.facebook.com/intern/diff/D36515869/)
Approved by: https://github.com/aaronenyeshi