Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390
Tensor objects should always refer to their Function's bufs. Currently
we never create a Tensor with a buffer different than of its function,
but having it in two places seems incorrect and dangerous.
Differential Revision: D23952865
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388
Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.
Differential Revision: D23952867
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44433
Not entirely sure why, but changing the type of beta from `float` to `double in autocast_mode.cpp and FunctionsManual.h fixes my compiler errors, failing instead at link time
fixing some type errors, updated fn signature in a few more files
removing my usage of Scalar, making beta a double everywhere instead
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D23636720
Pulled By: bdhirsh
fbshipit-source-id: caea2a1f8dd72b3b5fd1d72dd886b2fcd690af6d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664
Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.
To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.
For example, if the following async function is ran on a server over RPC:
```
def slow_add(x, y):
time.sleep(1)
return torch.add(x, y)
rpc.functions.async_execution
def slow_async_add(to, x, y):
return rpc.rpc_async(to, slow_add, args=(x, y))
```
we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:
```
------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s
1.012s 1 1
aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s
1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us
11.843us 1 2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us
22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- ---------------
Self CPU time total: 164.164us
```
This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470
Test Plan:
```
rvarm1@devbig978:fbcode (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```
Reviewed By: mrshenli
Differential Revision: D23638387
fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45264
Context for why we are porting to gtest in: https://github.com/pytorch/pytorch/pull/45018.
This PR completes the process of porting and removes unused files/macros.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23901392
Pulled By: suo
fbshipit-source-id: 89526890e1a49462f3f77718f4ee273c5bc578ba
Summary:
The Cuda HalfChecker casts up all loads and stores of Half to Float, so we do math in Float on the device. It didn't cast up HalfImmediate (ie. constants) so they could insert mixed-size ops. Fix is to do that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45213
Reviewed By: ezyang
Differential Revision: D23885287
Pulled By: nickgg
fbshipit-source-id: 912991d85cc06ebb282625cfa5080d7525c8eba9
Summary:
A lot of changes are in this update, some highlights:
- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218
Reviewed By: ezyang
Differential Revision: D23905183
Pulled By: soumith
fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45144
Moves prim ops from C10 back to JIT.
These were originally moved to C10 from JIT in D19237648 (f362cd510d)
ghstack-source-id: 112775781
Test Plan:
buck test //caffe2/test/cpp/jit:jit
https://pxl.cl/1l22N
buck test adsatlas/gavel/lib/ata_processor/tests:ata_processor_test
https://pxl.cl/1lBxD
Reviewed By: iseeyuan
Differential Revision: D23697598
fbshipit-source-id: 36d1eb8c346e9b161ba6af537a218440a9bafd27
Summary:
We need to check if dtypes differ in scalar type or lanes to decide between
Cast and Broadcast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45179
Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyBroadcastTermExpander
Reviewed By: bwasti
Differential Revision: D23873316
Pulled By: asuhan
fbshipit-source-id: ca141be67e10c2b6c5f2ff9c11e42dcfc62ac620
Summary:
Arithmetic operations on Bool aren't fully supported in the evaluator. Moreover,
such semantics can be implemented by the client code through insertion of
explicit casts to widen and narrow to the desired types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44677
Test Plan:
test_tensorexpr --gtest_filter=TensorExprTest.ExprDisallowBoolArithmetic
python test/test_jit_fuser_te.py
Reviewed By: agolynski
Differential Revision: D23801412
Pulled By: asuhan
fbshipit-source-id: fff5284e3a216655dbf5a9a64d1cb1efda271a36
Summary:
When doing a splitWithMask we only mask if the loop extent is not cleanly divide by the split factor. However, the logic does not simplify so any nontrivial loop extents will always cause a mask to be added, e.g. if the loop had been previously split. Unlike splitWithTail, the masks added by splitWithMask are always overhead and we don't have the analysis to optimize them out if they are unnecessary, so it's good to avoid inserting them if we can.
The fix is just to simplify the loop extents before doing the extent calculation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45141
Reviewed By: ezyang
Differential Revision: D23869170
Pulled By: nickgg
fbshipit-source-id: 44686fd7b802965ca4f5097b0172a41cf837a1f5
Summary:
combineMultilane used the wrong order when ramp was on the left hand side,
which matters for subtract.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45157
Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyRampSubBroadcast
Reviewed By: ailzhang
Differential Revision: D23851751
Pulled By: asuhan
fbshipit-source-id: 864d1611e88769fb43327ef226bb3310017bf858
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44653
This changes the profiler per a discussion with ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.
This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.
Added a test in `test_misc.cpp` to test this.
ghstack-source-id: 112605620
Reviewed By: mrshenli
Differential Revision: D23638499
fbshipit-source-id: f5bbb0d41ef883c5e5870bc27e086b8b8908f46b
Summary:
This flag simply allows users to get fusion groups that will *eventually* have shapes (such that `getOperation` is a valid).
This is useful for doing early analysis and compiling just in time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44401
Reviewed By: ZolotukhinM
Differential Revision: D23656140
Pulled By: bwasti
fbshipit-source-id: 9a26c202752399d1932ad7d69f21c88081ffc1e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39111
In our present alias analysis, we consider any Value that enter another container as entering the heap, and thus aliasing all other heap values of the same type. There are a number of advantages to this approach:
- it is not to hard to maintain the aliasDb implementation
- it is much easier from an op schema perspective - there are many composite list ops registered internally and externally that would be tricky to register and get right if we did something more complicated
- It limits the size of the AliasDb, because a container of size 10 only contains a single memory dag element instead of 10 elements.
The downside is that we have are unable to handle the simple and extremely common case of a list of tensors being used in an ATen op.
In an example like:
```
def foo(input):
x = torch.tensor([1, 2, 3, 4])
y = [x, x]
input.add_(1)
return torch.cat(y)
```
we will consider x to be written to. any write to any wildcard element (an element that enters a tuple, an element that is taken from a list) will mark x as written to. This can be limiting for our ability to create a functional subset and fuse graphs - as a result, 4 of TorchVision classification models could not be functionalized.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23828003
Pulled By: eellison
fbshipit-source-id: 9109fcb6f2ca20ca897cae71683530285da9d537
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45018
Now that https://github.com/pytorch/pytorch/pull/44795 has landed, we
can convert the bulk of our cpp tests to use gtest APIs. Eventually
we'll want to get rid of our weird harness for cpp tests entirely in
favor of using regular gtest everywhere. This PR demonstrates some of
the benefits of this approach:
1. You don't need to register your test twice (once to define it, once
in tests.h).
2. Consequently, it's easier to have many individual test cases.
Failures can be reported independently (rather than having huge
functions to test entire modules.
3. Some nicer testing APIs, notably test fixtures.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23802297
Pulled By: suo
fbshipit-source-id: 774255da7716294ac573747dcd5e106e5fe3ac8f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45014
Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/219
Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/212
+ Introduce buffer.h defining the buffer struct(s). The `CpuBuffer`
struct is always defined, while the `CudaBuffer` struct is defined
only when `TENSORPIPE_SUPPORTS_CUDA` is true.
+ Update all channels to take a `CpuBuffer` or `CudaBuffer` for
`send`/`recv` rather than a raw pointer and a length.
+ Make the base `Channel`/`Context` classes templated on `TBuffer`,
effectively creating two channel hierarchies (one for CPU channels,
one for CUDA channels).
+ Update the Pipe and the generic channel tests to use the new API. So
far, generic channel tests are CPU only, and tests for the CUDA IPC
channel are (temporarily) disabled. A subsequent PR will take care of
refactoring tests so that generic tests work for CUDA channels. An
other PR will add support for CUDA tensors in the Pipe.
Differential Revision: D23598033
Test Plan: Imported from OSS
Reviewed By: lw
Pulled By: beauby
fbshipit-source-id: 1d6c3f91e288420858835cd5e7962e8da051b44b
Summary:
A previous fix for masking Cuda dimensions (https://github.com/pytorch/pytorch/issues/44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled.
The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads.
To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place.
Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44909
Reviewed By: agolynski
Differential Revision: D23800565
Pulled By: nickgg
fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44861
We were redefining things like ASSERT_EQ to take a _VA_ARGS_ parameter, so compiling these files with gtest (instead of pytorch's custom python-based cpp test infra) fails.
Test Plan: buck build //caffe2/test/cpp/tensorexpr
Reviewed By: asuhan
Differential Revision: D23711293
fbshipit-source-id: 8af14fa7c1f1e8169d14bb64515771f7bc3089e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44795
Today, we build our cpp tests twice, once as a standalone gtest binary,
and once linked in `libtorch_python` so we can call them from
`test_jit.py`.
This is convenient (it means that `test_jit.py` is a single entry point
for all our tests), but has a few drawbacks:
1. We can't actually use the gtest APIs, since we don't link gtest into
`libtorch_python`. We're stuck with the subset that we want to write
polyfills for, and an awkward registration scheme where you have to
write a test then include it in `tests.h`).
2. More seriously, we register custom operators and classes in these
tests. In a world where we may be linking many `libtorch_python`s, this
has a tendency to cause errors with `libtorch`.
So now, only tests that explicitly require cooperation with Python are
built into `libtorch_python`. The rest are built into
`build/bin/test_jit`.
There are tests which require that we define custom classes and
operators. In these cases, I've built thm into separate `.so`s that we
call `torch.ops.load_library()` on.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity, ZolotukhinM
Differential Revision: D23735520
Pulled By: suo
fbshipit-source-id: d146bf4e7eb908afa6f96b394e4d395d63ad72ff
Summary:
Adds a pass to the IR Simplifier which fuses together the bodies of Cond statements which have identical conditions. e.g.
```
if (i < 10) {
do_thing_1;
} else {
do_thing_2;
}
if (i < 10) {
do_thing_3;
}
```
is transformed into:
```
if (i < 10) {
do_thing_1;
do_thing_3;
} else {
do_thing_2;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44886
Reviewed By: glaringlee
Differential Revision: D23768565
Pulled By: nickgg
fbshipit-source-id: 3fe40d91e82bdfff8dcb8c56a02a4fd579c070df
Summary:
Adds a new optimization to the IRSimplifier which changes this pattern:
```
for ...
if ...
do thing;
```
into:
```
if ...
for ...
do thing;
```
Which should be almost strictly better.
There are many cases where this isn't safe to do, hence tests. Most obviously when the condition depends on something modified within the loop.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44764
Reviewed By: mruberry
Differential Revision: D23734463
Pulled By: nickgg
fbshipit-source-id: 51617e837de96b354fb702d0090ac65ddc523d36
Summary:
Unifies a number of partial solutions to the thread and block dimension extent masking, including the NoThreadIdxWriter and my last fix https://github.com/pytorch/pytorch/issues/44325. The NoThreadIdxWriter is gone in favour of tracking the current loop extents and masking any statements that have a lower rank than the launch parameters in any Block or Thread dimension, which handles both the "no" and "smaller" axis binding cases.
For example it will transform the following:
```
for i in 0..10 // blockIdx.x
for j in 0..10 // threadIdx.x
do thing(i, j);
for k in 0..5 // threadIdx.x
do other thing(i, k);
```
Into:
```
do thing(blockIdx.x, threadIdx.x);
if (threadIdx.x < 5) {
do other thing(blockIdx.x, threadIdx.x);
}
```
And handle the case where statements are not bound by any axis, eg.
```
do outer thing;
for i in 0..10 // blockIdx.x
for j in 0..10 // threadIdx.x
do thing(i, j);
do other thing(i);
```
will become:
```
if (blockIdx.x < 1) {
if (threadIdx.x < 1) {
do outer thing;
}
}
syncthreads();
do thing(blockIdx.x, threadIdx.x);
syncthreads();
if (threadIdx.x < 1) {
do other thing(blockIdx.x);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44733
Reviewed By: mruberry
Differential Revision: D23736878
Pulled By: nickgg
fbshipit-source-id: 52d08626ae8043d53eb937843466874d479a6768
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44702
Original commit changeset: c6bd6d277aca
This diff caused windows build to fail due to a compiler bug in VS2019 (lambda capture constant int value). This back out works around the issue with explicit capture of const int value.
Test Plan: Tested and previously landed.
Reviewed By: mruberry
Differential Revision: D23703215
fbshipit-source-id: f9ef23be97540bc9cf78a855295fb8c69f360459
Summary:
This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context.
The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it.
I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231
Reviewed By: albanD
Differential Revision: D23689688
Pulled By: nickgg
fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44252
Add tracing to DPP client. Because DPP requests are async, we need to be able to start a trace event in one thread and potentially end in a different thread. RecordFunction and LibgpumonObserver previously assume each trace event starts and finishes in the same thread. So they use a thread local context to track enter and exit call backs. Async events breaks this assumption. This change attaches the event context to the RecordFunction object so we do not need to use thread local context.
Test Plan:
Tested with dpp perf test and able to collect trace.
{F307824044}
Reviewed By: ilia-cher
Differential Revision: D23323486
fbshipit-source-id: 4b6ca6c0e32028fb38a476cd1f44c17a001fc03b
Summary:
When caller / callee pairs are inserted into the mapping, verify that
the arity of the buffer access is consistent with its declared rank.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44561
Test Plan: CI, test_tensorexpr --gtest_filter=TensorExprTest.DetectInlineRankMismatch
Reviewed By: albanD
Differential Revision: D23684342
Pulled By: asuhan
fbshipit-source-id: dd3a0cdd4c2492853fa68381468e0ec037136cab
Summary:
Improve simplification of nested Min and Max patterns.
Specifically, handles the following pattern simplications:
* `Max(A, Max(A, Const)) => Max(A, Const)`
* `Max(Min(A, B), Min(A, C)) => Min(A, Max(B, C))`
* `Max(Const, Max(A, OtherConst) => Max(A, Max(Const, OtherConst))`
- This case can have an arbitrarily long chain of Max ops. For example: `Max(5, Max(x, Max(y, Max(z, 8)))) => Max(Max(Max(x, 8), y), z)`
Similarly, for the case of Min as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44142
Reviewed By: albanD
Differential Revision: D23644486
Pulled By: navahgar
fbshipit-source-id: 42bd241e6c2af820566744c8494e5dee172107f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44337
Add a new run_method to mobile Module which is variadic (takes any number of arguments) to match full jit.
ghstack-source-id: 111909068
Test Plan: Added new unit test to test_jit test suite
Reviewed By: linbinyu, ann-ss
Differential Revision: D23585763
fbshipit-source-id: 007cf852290f03615b78c35aa6f7a21287ccff9e
Summary:
Fix an issue where loops of different sizes are bound to the same Cuda dimension / metavar.
Coming soon more info and tests...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44325
Reviewed By: colesbury
Differential Revision: D23628859
Pulled By: nickgg
fbshipit-source-id: 3621850a4cc38a790b62ad168d32e7a0e2462fad