This PR makes libtorch behave the same as PyTorch when loading optimizer state from archive. With PyTorch, options of parameter groups are loaded from the archive, which is missing currently in libtorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125215
Approved by: https://github.com/janeyx99
Summary:
We found that some dumps are missing when monitoring thread timeout.
This is likely due to multiple PGs could still dump the same records
at the same time. So we should allow only PG0 to actualy dump
Test Plan:
unit test
python test/run_test.py --cpp --verbose -i cpp/ProcessGroupNCCLErrorsTest
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125356
Approved by: https://github.com/c-p-i-o
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
Summary: Now that we have reached nanosecond granularity, we can now remove the temporary guards that were previously required for nanosecond precision.
Test Plan: Regression should cover this change
Reviewed By: aaronenyeshi
Differential Revision: D56444570
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124734
Approved by: https://github.com/aaronenyeshi
Summary: In AOTInductor generated CPU model code, there can be direct references to some aten/c10 utility functions and data structures, e.g. at::vec and c10::Half. These are performance critical and thus it doesn't make sense to create C shim for them. Instead, we make sure they are implemented in a header-only way, and use this set of tests to guard future changes.
There are more header files to be updated, but we will do it in other followup PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123848
Approved by: https://github.com/jansel
ghstack dependencies: #123847
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.
This diff contains profiler changes only. Libkineto changes found in D54964435.
Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189
Ran key_averages() to make sure FunctionEvent code working as expected:
-- ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5
Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5
## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5
aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265
aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265
aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265
aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265
aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425
aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265
aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Differential Revision: D55925068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650
Approved by: https://github.com/aaronenyeshi
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.
This diff contains profiler changes only. Libkineto changes found in D54964435.
Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Tracing with flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_37_22.4155151.pt.trace.json.gz&bucket=gpu_traces
Tracing without flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_39_15.4166047.pt.trace.json.gz&bucket=gpu_traces
Tracing on main: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_42_43.4177559.pt.trace.json.gz&bucket=gpu_traces
Ran key_averages() to make sure FunctionEvent code working as expected:
-- ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5
Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5
## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5
aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265
aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265
aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265
aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265
aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425
aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265
aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Differential Revision: D55087993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122425
Approved by: https://github.com/aaronenyeshi
This PR updates the error message in autograd when an input tensor does not set to `require_grad`. The original message does not contain the index info, making users hard to debug.
The error message style consists with that on line 105-109.
Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123154
Approved by: https://github.com/soulitzer
This PR only adds abstract class registration logic without touching existing tests so they still trace with real script object. The added tests are only for registration APIs and test error messages.
Our design is that the abstract implementation should be in Python. This is much better in terms of usability. But this also has implications for custom op that takes script object as input, which is detailed later in this stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122622
Approved by: https://github.com/zou3519
ghstack dependencies: #122619, #122620, #122621
Summary:
During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.
To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.
Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```
Reviewed By: zoranzhao
Differential Revision: D55354548
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122690
Approved by: https://github.com/khabinov
Summary:
During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.
To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.
Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```
Differential Revision: D55286634
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122562
Approved by: https://github.com/chenyang78, https://github.com/khabinov
Fix and test issues with both coalesced and individual send/recv ops
Considered an alternate approach and then ditched it
- alternate approach: #119757
- reason ditched: prefer recording individual collective events inside
coalescing region instead of just the event at the end of the region,
which also would not have tensor sizes or opnames without additional
state variables added
Another approach also ditched
- record events on workEnqueue instead of initWork
- reason ditched: too messy to get input/output shapes tagged on
recording when recording in workEnqueue. Adding the info onto the
Work obj would be possible, but adds to overhead of copying Works
which we do on every collective. We can get info off the input/output
tensors directly in initWork, but we don't want to keep refs to those
tensors alive while the work is Enqueued, so we'd have to specifically
copy size lists or something.
This PR instead avoids creating a work inside pointToPoint when
coalescing is active. Instead, only at endCoalescing() is a work finally
intialized and enqueued. But it adds a record() call inside
pointToPoint() instead of creating a work, during coalescing. This
record() call picks up tensor shapes and op names.
It ALSO changes initWork to accept a 'record' argument. This defaults to
false, and should only be set to true if the caller ensures the work
will be enqueued by workEnqueue, ensuring its cuda events are live when
used by flight recorder's update_state().
The testing uncovers some odd pre-existing behavior and leaves them
alone for now. We could change some of these
- seq starts off at 1, not 0 for first op (but this is inconistent)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #120724
Summary:
The current dump timeout logic is a bit cumbersome as it needs 2 times: 1.
timeout, 2. wake up time. And in theory the caller just needs to wait
for a max of timeout value for the dump and declare the dump to be
either successful or not. Also we unify the async call using std::async
instead of a customized async lauch function for each operation.
Test Plan:
Unit tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120331
Approved by: https://github.com/wconstab
Do not run test ConstantPropagation.CustomClassesCanBePropagated on a platform where QNNPACK is not supported.
For example, this test fails on M1 Mac because QNNPACK is not supported on M1 Mac:
[----------] 1 test from ConstantPropagation
[ RUN ] ConstantPropagation.CustomClassesCanBePropagated
unknown file: Failure
as described in more details in the issue #88613.
After the PR, test passes successfully as below:
[----------] 1 test from ConstantPropagation
[ RUN ] ConstantPropagation.CustomClassesCanBePropagated
[ OK ] ConstantPropagation.CustomClassesCanBePropagated (0 ms)
[----------] 1 test from ConstantPropagation (0 ms total)
Fixes#88613
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119139
Approved by: https://github.com/jcaip
Recently we made it possible to serialize ExportedPrograms with fake parameters/buffers/etc.
The serialization regime was kind of whacky; basically we serialized a stub and reassembled the FakeTensor using metadata that we had stashed elsewhere in the Graph state.
This was bad for a few reasons:
- Storing the metadata separately from the actual serialized object caused situations where you could have one but not the other. An example case is if you had a FakeTensor contained inside a TorchBind object—there was no obviously place to store the metadata for this. This actually happens—TensorQueue in fbgemm does this.
- It created an annoying cycle: we had to deserialize the Graph's tensor metadata in order to deserialize (potentially faked) constants, but we need constants in order to deserialize the Graph.
This fixes all that. The basic idea is to patch the reducer function for FakeTensor at serialization time, and serialize a copy of the FakeTensor metadata. We already are policing BC for the TensorMeta schema struct so it's not a net increase in the BC surface.
As a bonus, I fixed a weird bug with torchbind tracing where we were accidentally reinterpreting a torch.ScriptObject as a torch.ScriptModule (which was the root cause of some weird behavior @bahuang was seeing last week).
Differential Revision: [D53601251](https://our.internmc.facebook.com/intern/diff/D53601251/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119531
Approved by: https://github.com/zhxchen17
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.
The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.
Test Plan: Unit tests included in commit.
Differential Revision: D53274382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `caffe2/test/cpp` and `torchrec/distributed/test/test_pt2`.
Test Plan: CI
Differential Revision: D52977354
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118026
Approved by: https://github.com/chenyang78
This PR adds the bare minimum functionality to get torchbind working in an e2e testable way on PT2.
It implements:
* ProxyTensor support
* Simple torch.export support (proxytensor-only path, e.g. non-strict).
* add some tests exercising the path.
Because all this is not fully baked, I hide the functionality behind a feature flag (`enable_torchbind_tracing()`) so it does not affect regular users for now.
Still on the agenda:
* Dynamo support
* Actual FakeMode support
* Mutability support
Hoping to get this first bit in as a standalone, as it will unblock some more extensive experimentation/testing going on internally.
Differential Revision: [D51825372](https://our.internmc.facebook.com/intern/diff/D51825372/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117697
Approved by: https://github.com/SherlockNoMad
Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed.
Take DDP and Ampere for example:
DDP's bucket size = 25 MB
Ampere's NVLink speed = 250 GB/s
25 MB / 250 GB/s = 100 ms.
So we are updating the interval to 100 ms.
Update:
25 MB / 250 GB/s = 0.1 ms
But let's see how it goes so far between making the checking more aggressive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297
Approved by: https://github.com/fduwjj
Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances.
Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489
Approved by: https://github.com/kwen2501
Summary:
Refactor update inactive constant buffer to allow updating with active
buffer.
Test Plan:
Existing test to test inactive buffer updates.
UpdateConstantsCuda in cpp test for active buffer updates.
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001
Approved by: https://github.com/chenyang78