This PR adds a test for the previous PR in this stack: #109904. In summary, it calls
functions decorated with `@record_shapeenv_event`, that don't have an explicit `ShapeEnv`
parameter, with arguments that don't hold a `ShapeEnv` instance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109944
Approved by: https://github.com/ezyang
Summary:
This PR supports _scaled_dot_product_flash_attention fallback kernel.
Note that in the abi_compatible mode, we retrieve outputs by passing
output argument pointers rather than relying on std::get.
It also fixes an issue related to dynamic shapes, where we wrongfully
query undefined dynamic symbols.
Test Plan: ci
Reviewed By: frank-wei
Differential Revision: D49620191
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110085
Approved by: https://github.com/desertfire
Summary: We are trying to use wired message to pass python objects like KJT. In order to make JIT be able to unpickle it, we need to provide a type resolver as well as an obj loader. This diff modify the interface to let we be able to do that.
Test Plan:
Rely on current CI to make sure existing usage doesn't break.
In the next diff, test e2e
Differential Revision: D49438569
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109730
Approved by: https://github.com/davidberard98
After https://github.com/pytorch/test-infra/pull/4589, we can now query Dr.CI to get the list of flaky failures there. This change queries Dr.CI API endpoint and check if the failure is a flaky one using `is_flaky` function.
Because the change is relatively large, I'm breaking it down to several smaller PRs in this order:
* [x] This PR queries Dr.CI and adds `is_flaky` check
* [ ] Clean up the flaky rules logic because it has already been implemented on Dr. CI
* [ ] Clean up the broken trunk logic for the same reason
### Testing
* Create a new `drci_mocks.json` file to catch the JSON response from Dr.CI API endpoint. The API requires `DRCI_BOT_KEY`.
* `pytest -v test_trymerge.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110054
Approved by: https://github.com/clee2000
@crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in https://github.com/pytorch/pytorch/issues/100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because:
1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed.
2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through.
We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by:
- removing the finagling of chunks when the tail tensor is 0-sized
- adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still.
## test plan
As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109402
Approved by: https://github.com/albanD
Sequence numbers must be associated with a Work object
if we want to use it as a way to report collective progress.
The API surface change is introducing Work::getSequenceNumber, which
should eventually be exposed to python.
The bulk of this change is changing gloo to make the sequence number
be always in use and weave it to the dozens subclasses of Work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109136
Approved by: https://github.com/fduwjj
# Summary
Introduced a BC breaking change in #109533 when self is a view of the value. By using the copy_() op inside fill_ we were hitting `assert_no_partial_overlap` in tensor iterator.
Ideal we would be able to avoid this check if value.numel() ==1 . But rather than monkeying around with tensor iterator I just clone the input instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109835
Approved by: https://github.com/mikaylagawarecki
Extend metric library to allow setting global metrics on a process level which will always be emitted.
Current use case for them is to include shard information every time a metric is emitted by run_test.py
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 0cae92c</samp>
> _`run_test` refactored_
> _Sharding metrics in Rockset_
> _Autumn of testing_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035
Approved by: https://github.com/clee2000
Fixes#109196
When we have a split reduction and the tensor is not an even multiple of the split size,
we use `ops.masked` to pad to an even multiple. In the case here we generated:
```python
tmp5 = tl.where(mask, tmp4, 0)
```
which implicitly promotes our boolean value to `int32`. The fix is to give the default
value the same dtype as `result`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109325
Approved by: https://github.com/lezcano
Replacing https://github.com/pytorch/pytorch/pull/109553 as it gets reverted.
This PR enables training with new 2D flow and adds associated test. In addition, this PR moves the tensor/parallel/_data_parallel_utils.py that are fsdp specific back to tensor/parallel/fsdp.py to avoid circular dependency for ddp.py and test/distributed/tensor/parallel/test_ddp_2d_parallel.py.
state_dict related changes would be in later PRs.
cc. @fegin, @fduwjj, @wanchaol, @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110034
Approved by: https://github.com/fduwjj
## Context
Introduce a core decomposition for `aten.floor_divide` into other `aten` ops, and add it to the core ATen decomposition table.
This replaces the decomposition of `floor_divide` that was used by Inductor. I noticed there was a note on that decomposition
```
# TorchInductor-only decomposition. It should not be taken to core.
# See https://github.com/pytorch/torchdynamo/pull/1120
```
but couldn't discern the reason why this is the case. cc: @lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110046
Approved by: https://github.com/peterbell10
This flag is requested by @Chillee who is seeing recompilations with simple gpt experiments. We are observing recompilations because `_parameters` ordered dict keeps changing from run to run, and its unclear why that is happening.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110039
Approved by: https://github.com/Chillee
ghstack dependencies: #110023
Summary:
Saw this issue when running pytorch vulkan on a LSTM model:
https://www.internalfb.com/phabricator/paste/view/P834993118
Found that we don't always to the vulkan transfer on `at::cat`
Test Plan:
(Not running the LSTM model yet. Since there are other crahses.)
```
[yipjustin@47884.od /data/sandcastle/boxes/fbsource (3fd2308f8|remote/fbcode/warm_fbcode_od_stable...)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*cat*"
Building: finished in 0.1 sec (100%) 330/330 jobs, 0/330 updated
Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *cat*
[==========] Running 43 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 43 tests from VulkanAPITest
[ RUN ] VulkanAPITest.replication_pad2d
[ OK ] VulkanAPITest.replication_pad2d (102 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_invalidinputs_exceptions
[ OK ] VulkanAPITest.cat_4d_dim0_invalidinputs_exceptions (67 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_samebatch_success
[ OK ] VulkanAPITest.cat_4d_dim0_samebatch_success (111 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_diffbatch_success
[ OK ] VulkanAPITest.cat_4d_dim0_diffbatch_success (76 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_singledepth_success
[ OK ] VulkanAPITest.cat_4d_dim0_singledepth_success (40 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_singletensor_success
[ OK ] VulkanAPITest.cat_4d_dim0_singletensor_success (7 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_twotensors_success
[ OK ] VulkanAPITest.cat_4d_dim0_twotensors_success (30 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_negdim_success
[ OK ] VulkanAPITest.cat_4d_dim0_negdim_success (78 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_negdim_success
[ OK ] VulkanAPITest.cat_4d_dim1_negdim_success (130 ms)
[ RUN ] VulkanAPITest.cat_4d_dim2_negdim_success
[ OK ] VulkanAPITest.cat_4d_dim2_negdim_success (75 ms)
[ RUN ] VulkanAPITest.cat_4d_dim3_negdim_success
[ OK ] VulkanAPITest.cat_4d_dim3_negdim_success (68 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_texture2d_success
[ OK ] VulkanAPITest.cat_4d_dim1_texture2d_success (2 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_singledepth_success
[ OK ] VulkanAPITest.cat_4d_dim1_singledepth_success (65 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_singletensor_success
[ OK ] VulkanAPITest.cat_4d_dim1_singletensor_success (8 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_bat1_mult4ch_success
[ OK ] VulkanAPITest.cat_4d_dim1_bat1_mult4ch_success (9 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_bat2_mult4ch_success
[ OK ] VulkanAPITest.cat_4d_dim1_bat2_mult4ch_success (18 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_mult4ch_mixed_success
[ OK ] VulkanAPITest.cat_4d_dim1_mult4ch_mixed_success (60 ms)
[ RUN ] VulkanAPITest.cat_4d_dim2_sameheight_success
[ OK ] VulkanAPITest.cat_4d_dim2_sameheight_success (80 ms)
[ RUN ] VulkanAPITest.cat_4d_dim2_diffheight_success
[ OK ] VulkanAPITest.cat_4d_dim2_diffheight_success (69 ms)
[ RUN ] VulkanAPITest.cat_4d_dim2_singledepth_success
[ OK ] VulkanAPITest.cat_4d_dim2_singledepth_success (12 ms)
[ RUN ] VulkanAPITest.cat_4d_dim2_invalidinputs_exceptions
[ OK ] VulkanAPITest.cat_4d_dim2_invalidinputs_exceptions (63 ms)
[ RUN ] VulkanAPITest.cat_4d_dim3_invalidinputs_exceptions
[ OK ] VulkanAPITest.cat_4d_dim3_invalidinputs_exceptions (86 ms)
[ RUN ] VulkanAPITest.cat_4d_dim3_samewidth_success
[ OK ] VulkanAPITest.cat_4d_dim3_samewidth_success (117 ms)
[ RUN ] VulkanAPITest.cat_4d_dim3_diffwidth_success
[ OK ] VulkanAPITest.cat_4d_dim3_diffwidth_success (72 ms)
[ RUN ] VulkanAPITest.cat_3d_dim0_mult4ch_success
[ OK ] VulkanAPITest.cat_3d_dim0_mult4ch_success (12 ms)
[ RUN ] VulkanAPITest.cat_3d_dim0_diff_channel_success
[ OK ] VulkanAPITest.cat_3d_dim0_diff_channel_success (28 ms)
[ RUN ] VulkanAPITest.cat_3d_dim0_same_channel_success
[ OK ] VulkanAPITest.cat_3d_dim0_same_channel_success (15 ms)
[ RUN ] VulkanAPITest.cat_3d_dim1_diffheight_success
[ OK ] VulkanAPITest.cat_3d_dim1_diffheight_success (21 ms)
[ RUN ] VulkanAPITest.cat_3d_dim1_same_height_success
[ OK ] VulkanAPITest.cat_3d_dim1_same_height_success (10 ms)
[ RUN ] VulkanAPITest.cat_3d_dim2_diffwidth_success
[ OK ] VulkanAPITest.cat_3d_dim2_diffwidth_success (21 ms)
[ RUN ] VulkanAPITest.cat_3d_dim2_samewidth_success
[ OK ] VulkanAPITest.cat_3d_dim2_samewidth_success (11 ms)
[ RUN ] VulkanAPITest.cat_3d_dim0_negdim_success
[ OK ] VulkanAPITest.cat_3d_dim0_negdim_success (25 ms)
[ RUN ] VulkanAPITest.cat_3d_dim1_negdim_success
[ OK ] VulkanAPITest.cat_3d_dim1_negdim_success (23 ms)
[ RUN ] VulkanAPITest.cat_3d_dim2_negdim_success
[ OK ] VulkanAPITest.cat_3d_dim2_negdim_success (10 ms)
[ RUN ] VulkanAPITest.cat_2d_dim0_same_height_success
[ OK ] VulkanAPITest.cat_2d_dim0_same_height_success (3 ms)
[ RUN ] VulkanAPITest.cat_2d_dim0_diff_height_success
[ OK ] VulkanAPITest.cat_2d_dim0_diff_height_success (2 ms)
[ RUN ] VulkanAPITest.cat_2d_dim1_same_width_success
[ OK ] VulkanAPITest.cat_2d_dim1_same_width_success (3 ms)
[ RUN ] VulkanAPITest.cat_2d_dim1_diff_width_success
[ OK ] VulkanAPITest.cat_2d_dim1_diff_width_success (4 ms)
[ RUN ] VulkanAPITest.cat_2d_dim0_negdim_success
[ OK ] VulkanAPITest.cat_2d_dim0_negdim_success (3 ms)
[ RUN ] VulkanAPITest.cat_2d_dim1_negdim_success
[ OK ] VulkanAPITest.cat_2d_dim1_negdim_success (3 ms)
[ RUN ] VulkanAPITest.cat_1d_dim0_same_width_success
[ OK ] VulkanAPITest.cat_1d_dim0_same_width_success (52 ms)
[ RUN ] VulkanAPITest.cat_1d_dim0_diff_width_success
[ OK ] VulkanAPITest.cat_1d_dim0_diff_width_success (0 ms)
[ RUN ] VulkanAPITest.cat_1d_dim0_negdim_success
[ OK ] VulkanAPITest.cat_1d_dim0_negdim_success (0 ms)
[----------] 43 tests from VulkanAPITest (1717 ms total)
[----------] Global test environment tear-down
[==========] 43 tests from 1 test suite ran. (1717 ms total)
[ PASSED ] 43 tests.
YOU HAVE 4 DISABLED TESTS
```
Differential Revision: D49566743
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109936
Approved by: https://github.com/SS-JIA
For tests that TD prioritizes, we should track what their ordering _would have been_ if none of the TD heuristics had applied to it.
This is useful for two reasons:
1. It lets us better understand TD may have contributed to that test running sooner
2. it's possible that heuristics actually mark a test as less important than the default sorting would have claimed (the default sorts tests in a fixed order). This will let us track how often that happens
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110031
Approved by: https://github.com/clee2000