By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727
Approved by: https://github.com/kulinseth
This PR fixes two bugs
1) Constant folding a triton kernel results in the kernel's inputs to be returned back without any modification. Disable constant folding for triton kernels. Need more investigation
2) NoneLayout buffers should not be deleted as they do not exist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115908
Approved by: https://github.com/aakhundov, https://github.com/jansel
Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are:
* We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's confusing to have two places do one thing.
- Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs.
- Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR.
* We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics.
- Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker.
- Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963
Approved by: https://github.com/jansel
Summary:
This change makes the `DTensor.from_local()` placements in backward pass from `Partial()` to `Replicate()` as pass through for following reasons:
1. When we run backward pass of DTensor.from_local, if the target placement is partial() (i.e. from user manual overwrite code instead of torch_dispatch) we keep the grad as replicate. This is because converting the gradients back to `Partial()` is meaningless.
2. The current div logic will lead to wrong numerical value in the above case.
Test Plan:
**CI**:
CI Tests
**Unit test**:
`buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:redistribute`
- Passed
**With model training**:
```
# We tested the case where input tensor is manually overwrite as Partial() and
# output tensor manually overwrite to Shard() then to local.
# Before the change: numerical value not correct
Forward pass:
collective: ReduceScatter
backward pass:
collective: AllGather + div by process group size
# After the change: div is removed as expected.
Forward pass:
collective: ReduceScatter
Backward pas:
collective: AllGather
```
Differential Revision: D52175709
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115967
Approved by: https://github.com/wanchaol
Motivation: it would be nice to be able to test using the metrics in log_compilation_event; currently dumps logs (or logs to a database in fbcode) - these are hard to use in unit tests.
This change:
* always record the information in torch._dynamo.utils.record_compilation_metrics; here, log into a limited-size deque to prevent the list of metrics from getting too long
* if config.log_compilation_metrics, then call back into the original log_compilation_event function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115788
Approved by: https://github.com/yanboliang
## summary
`zip(inputs, self.input_layouts, self.desired_input_layouts)` is used in `_prepare_input_fn`; similar for `_prepare_output_fn`. Without assertion, unmatched dimension in inputs/outputs will be lost, potentially causing unexpected behabiors.
## test plan
`python test/distributed/tensor/parallel/test_tp_style.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115957
Approved by: https://github.com/wanchaol
Summary:
c2_protobuf_rule ([here](https://fburl.com/code/iyiulpmv)) is broken on buck2, ultimately due to the following error:
> .\./caffe2.proto: File does not reside within any path specified using --proto_path (or -I). You must specify a --proto_path which encompasses this file. Note that the proto_path must be an exact prefix of the .proto file names -- protoc is too dumb to figure out when two paths (e.g. absolute and relative) are equivalent (it's harder than you think).
The root cause is differences in how buck1 and buck2 handle `%SRCDIR%` (absolute versus relative paths). This diff fixes the build.
Test Plan:
# Before
```
buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```
```
More details at https://www.internalfb.com/intern/buck/build/c6550454-ae6d-479e-9d08-016e544ef050
BUILD SUCCEEDED
```
```
Action failed: fbsource//xplat/caffe2:caffe2.pb.h (genrule)
Remote command returned non-zero exit code <no exit code>
Reproduce locally: frecli cas download-action 5df17cf64b7e2fc5ab090c91e1129f2f3cad36dc72c7c182ab052af23d3f32aa:145
stdout:
stderr:
OUTMISS: Missing outputs: buck-out/v2/gen/fbsource/dd87aacb8683145b/xplat/caffe2/caffe2.pb.h/out/caffe2.pb.h
```
# After
Buck1 still works
```
buck1 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```
Buck2 works
```
buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```
```
Buck UI: https://www.internalfb.com/buck2/e5dae607-325a-4eab-b0c9-66fe4e9a6254
BUILD SUCCEEDED
```
Differential Revision: D52218365
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115954
Approved by: https://github.com/mcr229
some typo result in the note section not rendered properly, can't see
this from the last PR directly as the last PR only show the first commit
documentation :(
Also make the parallelize_module doc example more concrete
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115974
Approved by: https://github.com/wz337
The PyTorch build breaks when building from tip on ppc64le with following error pytorch/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp:863:46: error: no matching function for call to 'at::vec::DEFAULT::Vectorizedc10::qint8::dequantize(at::vec::DEFAULT::Vectorized&, at::vec::DEFAULT::Vectorized&)
Issue reported #115165
This patch fixes the build issue.
Fixes#115165
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115729
Approved by: https://github.com/albanD
Summary: This is useful the comparing the Triton kernels generated by two different invocations of torch.compile on the same model (e.g., checking of serial compile and parallel compile generate identical Triton kernels).
Test Plan:
Unit test:
buck2 test mode/opt //caffe2/torch/fb/module_factory/sync_sgd/tests:test_torchdynamo_wrapper -- --print-passing-details >& ~/tmp/log.test
PyPer Mast job:
https://www.internalfb.com/mast/job/sw-951074659-OfflineTraining_87587a4e
See the *.py files generated in:
pyper_traces/tree/torchinductor_traces/sw-951074659-OfflineTraining_87587a4e/4623
Differential Revision: D52221500
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115979
Approved by: https://github.com/yanboliang
Summary: During the inference time the intermediate graphs for optimization are not used so the Executor's graph is the only graph we need to keep around these two flags
Test Plan:
the FLAGS are all off by default
baseline
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true
I1212 10:24:20.407408 401092 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 182863 Kb
```
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true
I1212 10:31:37.663487 464000 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 186127 Kb
```
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true --torch_jit_execution_plan_avoid_extra_graph_copy=true
I1212 10:29:42.848093 447218 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 129451 Kb```
Differential Revision: D52081631
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115657
Approved by: https://github.com/houseroad
as titled, when using SAC + torch.compile, it currently only check for
functional tensor, but not checking any tensor subclasses, therefore SAC
under torch.compile would ignore the tensor types like tensor
subclasses. Fixed in this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115960
Approved by: https://github.com/bdhirsh
Summary:
Refactor update inactive constant buffer to allow updating with active
buffer.
Test Plan:
Existing test to test inactive buffer updates.
UpdateConstantsCuda in cpp test for active buffer updates.
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001
Approved by: https://github.com/chenyang78
Summary:
This change is to make the input tensor contiguous for DTensor reduce scatter in the case no padding is needed.
There's no exception thrown during training, but we ran into numerical value correctness issue without the change.
Test Plan:
**CI**
CI test
**WHEN model test**:
- Verified loss for each iteration within the expected range.
- Verified NE on-par with this change with 4B training data.
Differential Revision: D52170822
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115847
Approved by: https://github.com/wanchaol
- Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur`
- Surround calls to `mkldnn_matmul` with `try {} catch {}`
- Print warning and fall back to BLAS (by calling `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails
Test plan: On Linux arm run:
```shell
$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
return F.linear(input, self.weight, self.bias)
tensor([[-0.5183, 0.2279, -0.4035, ..., -0.3446, 0.0938, -0.2113],
[-0.5111, 0.2362, -0.3821, ..., -0.3536, 0.1011, -0.2159],
[-0.6387, 0.0894, -0.7619, ..., -0.1939, -0.0282, -0.1344],
...,
[-0.6352, 0.0934, -0.7516, ..., -0.1983, -0.0247, -0.1366],
[-0.4790, 0.2733, -0.2862, ..., -0.3939, 0.1338, -0.2365],
[-0.5702, 0.1682, -0.5580, ..., -0.2796, 0.0412, -0.1782]],
grad_fn=<AddmmBackward0>)
```
Fixes https://github.com/pytorch/pytorch/issues/114750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115936
Approved by: https://github.com/lezcano