Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45318
When calling `then()` from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback.
Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library.
I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency.
Finally, a `record_stream_cb_` member is added to FutureNCCL, and the default value is nullptr. A default `record_stream_cb_` implementation is added to `PythonFutureWrapper,` where Python dependency is allowed.
In addition, a few lines are reformatted by lint.
caffe2/torch/csrc/distributed/c10d/init.cpp is only reformatted.
#Closes: https://github.com/pytorch/pytorch/issues/44203
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_with_then_hook
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl
Reviewed By: pritamdamania87
Differential Revision: D23910257
fbshipit-source-id: 66920746c41f3a27a3689f22e2a2d9709d0faa15
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46501
Gradients in this method will not be modified.
ghstack-source-id: 114851646
Test Plan: waitforbuildbot
Reviewed By: pritamdamania87
Differential Revision: D24374300
fbshipit-source-id: a2941891008f9f197a5234b50260218932d2d37d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46601
* except excluded tests and magic methods.
https://github.com/pytorch/pytorch/issues/38731
Previously, we'd only do run these tests for inplace operations. Since this is a lot more tests, fixed these issues that came up when running them -
- Updated schema of conj() to reflect existing behaviour.
- Updated deepEquals method in check_alias_annotation.cpp to re-use the overloaded == operator. Previous implementation did not cover all types of IValues.
- Corrected the order inputs are passed in during autograd testing of 'view' & 'reshape'.
- Subbed out atn::ger with the func its aliased to, atn::outer, for testing. The alias annotation checking code doesn't handle aliased operators properly.
ghstack-source-id: 114830903
Test Plan: Ran all tests in test:jit and verified they pass.
Reviewed By: eellison
Differential Revision: D24424955
fbshipit-source-id: 382d7e2585911b81b1573f21fff1d54a5e9a2054
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46356
Adding the flag `-Werror=cast-function-type` to ensure we don't allow
any invalid casts (ex: PyCFunction casts).
For more details see: https://github.com/pytorch/pytorch/issues/45419
ghstack-source-id: 114632980
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D24319759
fbshipit-source-id: 26ce4650c220e8e9dd3550245f214c7e6c21a5dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46227
Follow up from https://github.com/pytorch/pytorch/issues/45419, in
this PR I've removed as many PyCFunction casts as I could from the codebase.
The only ones I didn't remove were the ones with `METH_VARARGS | METH_KEYWORDS`
which have 3 parameters instead of 2 and had to be casted. Example: `
{"copy_", (PyCFunction)(void(*)(void))THPStorage_(copy_), METH_VARARGS |
METH_KEYWORDS, nullptr},`
ghstack-source-id: 114632704
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D24269435
fbshipit-source-id: 025cfd43a9a2a3e59f6b2951c1a78749193d77cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219
- Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`:
```
struct InferenceModule {
explicit InferenceModule(const torch::jit::Module& m);
explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g);
torch::jit::Module module;
std::shared_ptr<torch::jit::Graph> graph;
std::unique_ptr<c10::FunctionSchema> schema;
std::unordered_map<Value*, size_t> value_to_reg;
std::vector<size_t> input_regs; // inputs to the graph
std::vector<size_t> output_regs; // outputs of the graph
std::vector<size_t> internals;
};
```
which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime:
```
mutable std::vector<IValue> reg_;
// The nodes we need to run
std::vector<ProcessedNode> nodes_;
```
`reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`.
- Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example.
- Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable.
This diff focuses on threading model only. Benchmarks will be separate.
Reviewed By: bwasti
Differential Revision: D24237078
fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1
Summary:
This diff changes `TensorExprKernel::generateStmt` to use flatten loops instead of flatten tensors.
Checked all tests on CPU as well as CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46539
Reviewed By: nickgg
Differential Revision: D24395956
Pulled By: navahgar
fbshipit-source-id: f3792903f2069bda37b571c9f0a840e6fb02f189
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308
This PR adds a hand optimized version of DeepAndWide model with the goal
of estimating overheads of static runtime. While static runtime is
currently much faster than the existing JIT interpreter, it would be
useful to understand how close we are to an absolutely 0-overhead
system. Currently, this "ideal" implementation is 2x faster than the
static runtime on batchsize=1.
Full benchmark results:
```
Running build/bin/static_runtime_bench
Run on (24 X 2394.71 MHz CPU s)
CPU Caches:
L1 Data 32K (x24)
L1 Instruction 32K (x24)
L2 Unified 4096K (x24)
L3 Unified 16384K (x24)
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_deep_wide_base/1 59518 ns 59500 ns 10909
BM_deep_wide_base/8 74635 ns 74632 ns 9317
BM_deep_wide_base/20 82186 ns 82147 ns 9119
BM_deep_wide_fast/1 13851 ns 13851 ns 49825 << new
BM_deep_wide_fast/8 22497 ns 22497 ns 32089 << new
BM_deep_wide_fast/20 23868 ns 23841 ns 31184 << new
BM_deep_wide_jit_graph_executor/1 62786 ns 62786 ns 10835
BM_deep_wide_jit_graph_executor/8 76730 ns 76718 ns 7529
BM_deep_wide_jit_graph_executor/20 78886 ns 78883 ns 8769
BM_deep_wide_jit_profiling_executor/1 69504 ns 69490 ns 10309
BM_deep_wide_jit_profiling_executor/8 75718 ns 75715 ns 9199
BM_deep_wide_jit_profiling_executor/20 75364 ns 75364 ns 9010
BM_deep_wide_static/1 40324 ns 40318 ns 17232
BM_deep_wide_static/8 50327 ns 50319 ns 13335
BM_deep_wide_static/20 53075 ns 53071 ns 12855
BM_deep_wide_static_threaded/threads:8 6258 ns 49873 ns 14008
```
PS: The implementation could probably be optimized even more.
Differential Revision: D24300702
Test Plan: Imported from OSS
Reviewed By: dzhulgakov
Pulled By: ZolotukhinM
fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e
Summary:
Adds new rules to the NNC IRSimplifier to take care of the following cases:
* Comparisons which are symbolic but have a constant difference. E.g. this is most useful in cases like `if (x > x + 4) ...` which we can now eliminate.
* Simplification of `Mod` nodes, including simple rules such as `0 % x` and `x % 1`, but also factorization of both sides to find common symbolic multiples. E.g. `(x * y) % x` can be cancelled out to `0`.
See tests for many more examples!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46412
Reviewed By: navahgar
Differential Revision: D24396151
Pulled By: nickgg
fbshipit-source-id: abb954dc930867d62010dcbcd8a4701430733715
Summary: Interval midpoint calculations can overflow (integers). This fixes such an instance.
Test Plan: Standard test rigs.
Reviewed By: iseeyuan
Differential Revision: D24392608
fbshipit-source-id: 0face1133d99cea342abbf8884b14262d50b0826
Summary:
1. Added CudaFusionGuard as the custom TypeCheck for nvfuser; enabled dynamic shape support with profiling executor;
2. dropped support for legacy fuser;
3. re-enabled nvfuser tests;
4. added registration for profiling record to allow profiling on user specified nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46452
Reviewed By: zou3519, anjali411
Differential Revision: D24364642
Pulled By: ngimel
fbshipit-source-id: daf53a9a6b6636e1ede420a3a6d0397d4a8b450b
Summary:
Refactor foreach APIs to use overloads in case of scalar list inputs.
Tested via unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45673
Reviewed By: heitorschueroff
Differential Revision: D24053424
Pulled By: izdeby
fbshipit-source-id: 35976cc50b4acfe228a32ed26cede579d5621cde
Summary:
`getGraphExecutorOptimize` mandates we don't do any optimizations beyond what's required to run graphs. In this scenario, we don't want to do any profiling as profiling information will not be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46479
Reviewed By: ZolotukhinM
Differential Revision: D24368292
Pulled By: Krovatkin
fbshipit-source-id: a2c7618d459efca9cb0700c4d64d829b352792a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45164
This PR implements `fft2`, `ifft2`, `rfft2` and `irfft2`. These are the last functions required for `torch.fft` to match `numpy.fft`. If you look at either NumPy or SciPy you'll see that the 2-dimensional variants are identical to `*fftn` in every way, except for the default value of `axes`. In fact you can even use `fft2` to do general n-dimensional transforms.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D24363639
Pulled By: mruberry
fbshipit-source-id: 95191b51a0f0b8e8e301b2c20672ed4304d02a57
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46383
The old `USE_METAL` is actually being used by Caffe2. Here we introduce a new macro to enable metal in pytorch.
ghstack-source-id: 114499392
Test Plan:
- Circle CI
- The Person Segmentation model works
Reviewed By: linbinyu
Differential Revision: D24322018
fbshipit-source-id: 4e5548afba426b49f314366d89b18ba0c7e745ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46498
The name of the first arg "ddp_model" is misleading, because it is actually the reducer of DDP model rather than entire model.
This method is called in the file caffe2/torch/nn/parallel/distributed.py:
`dist._register_comm_hook(self.reducer, state, hook)`
ghstack-source-id: 114531188
(Note: this ignores all push blocking failures!)
Test Plan: waitforbuildbot
Reviewed By: pritamdamania87
Differential Revision: D24372827
fbshipit-source-id: dacb5a59e87400d93a2f35da43560a591ebc5499
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45858
When cloning a module that has __setstate__, __getstate__ methods.
We need to load these methods to initialize these modules.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D24116524
Pulled By: bzinodev
fbshipit-source-id: a5111638e2dc903781f6468838c000850d1f9a74
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46445
Fix an instance in which a truncated integer prevents downstream type safety checks.
Test Plan: I'm not sure what's appropriate here.
Reviewed By: albanD
Differential Revision: D24339292
fbshipit-source-id: 15748ec64446344ff1a8344005385906d3484d7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46347
Added the named tuple's error messages & workarounds when it returns from a function of a class in Pytorch Mobile.
To identify the error cases (returning NamedTuple type), I used the following coditions:
1) ins.op == RET (for returing)
2) type->kind() == TypeKind::TupleType (for pruning non-tuple types)
3) type->cast<TupleType>().name() (for pruning Tuple type)
- I could use the type's str (str() or repr_str()) directly, but I used whether it has the "name" attribute. Please give the comment for this.
[Information of Tuple and NamedTuple types]
1. Tuple
type->str(): (int, int)
type->repr_str(): Tuple[int, int]
type->kind(): TypeKind::TupleType # different with other types
type()->cast<NamedType>(): True
type()->cast<NamedType>()>name(): False # different with NamedTuple
2. NamedTuple
type->str(): __torch__.myNamedTuple
type->repr_str(): __torch__.myNamedTuple
type->kind(): TypeKind::TupleType # different with other types
type()->cast<NamedType>(): True
type->cast<TupleType>().name() = True # different with Tuple
(From the next diff, I will handle the other error cases: 1) returning List<module class>, Dict<module class> and 2) accessing Module class's member functions)
ghstack-source-id: 114361762
Test Plan:
[Added test results]
buck test mode/dev caffe2/test:mobile -- 'test_unsupported_return'
Summary
Pass: 2
ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7036874440497926
[Whole test results]
buck test mode/dev caffe2/test:mobile -- 'test'
Summary
Pass: 11
ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599664074084
Reviewed By: iseeyuan
Differential Revision: D24291962
fbshipit-source-id: a1a9e24e41a5f1e067738f59f1eae34d07cba31a
Summary:
This diff restores previous behavior of silently allow overflowing when inserting instructions. The behavior was changed recently in https://github.com/pytorch/pytorch/issues/45382. But it started to break some existing use cases that haver overflow problems.
Restoring original behavior but throw a warning to to unblock existing use cases where overflowing happens.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46369
Reviewed By: kwanmacher, wanchaol, fbhuba
Differential Revision: D24324345
Pulled By: gmagogsfm
fbshipit-source-id: 1c0fac421d4de38f070e21059bbdc1b788575bdf
Summary:
Only under static axes does opset 9 supports no-op squeeze when dim is not 1.
Updating the test case where it was setting dynamic axes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45369
Reviewed By: anjali411
Differential Revision: D24280180
Pulled By: bzinodev
fbshipit-source-id: d7cda88ab338a1c41a68052831dcebe739a3843c