Summary:
When the input to an indexing operation is a boolean, for example array[True] = value,
the subsequent index_put node formed needs to be converted to masked_scatter/masked_fill node based on the type of val the indexing node is equated. If that value is just a single scalar, then we use the masked_fill functionality and if value is a tensor of appropriate size, we use the masked_scatter functionality.
Fixes https://github.com/pytorch/pytorch/issues/34054
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45584
Reviewed By: VitalyFedyunin
Differential Revision: D24116921
Pulled By: bzinodev
fbshipit-source-id: ebd66e06d62e15f0d49c8191d9997f55edfa520e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45942
We only need to keep track of this for traversing the autograd graph
when find_unused_parameters=True. Without that, we populate and keep this
mapping in memory, which occupies sizeof(pointer) * number of grad accumulators
of extra memory.
ghstack-source-id: 114219289
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D24154407
fbshipit-source-id: 220d723e262f36590a03a3fd2dab47cbfdb87d40
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46221
The RPC framework only allowed sending RPCs based on provided
WorkerInfo or name. When using RPC with DDP, sometimes it might just be easier
to refer to everything in terms of ranks since DDP doesn't support names yet.
As a result, support a `to` parameter in the RPC APIs which allow for
specifying a rank as well would be helpful.
ghstack-source-id: 114207172
Test Plan:
1) waitforbuildbot
2) Unit Tests
Reviewed By: mrshenli
Differential Revision: D24264989
fbshipit-source-id: 5edf5d92e2bd2f213471dfe7c74eebfa9efc9f70
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46116
Ideally I would just use one of the existing preprocessor flags such as `FBCODE_CAFFE2`, but this implies a whole bunch of other things elsewhere, so it is not really a solution for ovrsource.
Test Plan: CI green, we are able to disable it internally with `-DNVALGRIND`
Reviewed By: malfet
Differential Revision: D24227360
fbshipit-source-id: 24a3b393cf46d6a16acca0a9ec52610d4bb8704f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46250
Previously the type of GetAttr nodes was getting set incorrectly and wasn't matching the module type
Test Plan:
Existing quantization tests
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D24279872
fbshipit-source-id: 2b2e3027f6e9ad8ba9e9b7937bd5cc5daaf6e17c
Summary:
The record_stream method was hard coded for CUDA device. Define the record_stream in the native_functions.yaml to enable the dynamic dispatch to different end device.
Fixes https://github.com/pytorch/pytorch/issues/36556
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44301
Reviewed By: glaringlee
Differential Revision: D23763954
Pulled By: ezyang
fbshipit-source-id: e6d24f5e7892b56101fa858a6cad2abc5cdc4293
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46112
### Summary
This PR adds the support of running torchscript models on iOS GPU via Metal (Inference only). The feature is currently in prototype state, API changes are expected. The tutorial and the documents will be added once it goes to beta.
allow-large-files
- Users API
```
auto module = torch::jit::load(model);
module.eval();
at::Tensor input = at::ones({1,3,224,224}, at::ScalarType::Float).metal();
auto output = module.forward({input}).toTensor().cpu();
```
- Supported Models
- Person Segmentation v106 (FB Internal)
- Mobilenetv2
- Supported Operators
- aten::conv2d
- aten::addmm
- aten::add.Tensor
- aten::sub.Tensor
- aten::mul.Tensor
- aten::relu
- aten::hardtanh
- aten::hardtanh_
- aten::sigmoid
- aten::max_pool2d
- aten::adaptive_avg_pool2d
- aten::reshape
- aten::t
- aten::view
- aten::log_softmax.int
- aten::upsample_nearest2d.vec
- Supported Devices
- Apple A9 and above
- iOS 10.2 and above
- CMake scripts
- `IOS_ARCH=arm64 ./scripts/build_ios.sh -DUSE_METAL=ON`
### Test Plan
- Circle CI
ghstack-source-id: 114155638
Test Plan:
1. Sandcastle CI
2. Circle CI
Reviewed By: dreiss
Differential Revision: D23236555
fbshipit-source-id: 98ffc48b837e308bc678c37a9a5fd8ae72d11625
Summary:
Fixes two bugs reported by https://github.com/pytorch/pytorch/issues/45953 in the NNC Cuda codegen which could break when using Half floats:
1. The Registerizer will generate new scalars with the type of the load being replaced, and doesn't have Cuda specific logic to avoid using the half type. I've added a quick mutator to coerce these to float, similar to the existing load casting rules.
2. We're not handling explicit casts to Half inserted by the user (in the report the user being the JIT). Addressing this by replacing these with casts to Float since thats the type we do Half math in.
Fixes https://github.com/pytorch/pytorch/issues/45953.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46129
Reviewed By: glaringlee
Differential Revision: D24253639
Pulled By: nickgg
fbshipit-source-id: 3fef826eab00355c81edcfabb1030332cae595ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46036
Previously, this function didn't do error-bounds checking on the GetItem (GET_ITEM) calls, which led to issues like https://github.com/pytorch/pytorch/issues/46020.
A better solution would be to use pybind, but given writing the file is going to dominate bounds checking, this is strictly better.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D24228370
Pulled By: gchanan
fbshipit-source-id: f5d0a3d21ff12b4380beefe1e9954fa81ea2f567
Summary:
Fixes a crash bug in the IRSimplifier when the LHS is a Term (e.g. 2x) and the RHS is a Polynomial (e.g. 2x+1).
This case crashes 100% of the time so I guess it's not very common in models we've been benchmarking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46108
Reviewed By: agolynski
Differential Revision: D24226593
Pulled By: nickgg
fbshipit-source-id: ef454c855ff472febaeba16ec34891df932723c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45933
Occasionally users run DDP with models with unused params, in this
case we would like to surface an error message telling them to run with
find_unused_params=True. However, a recent change to rebuild_buckets logic (https://github.com/pytorch/pytorch/pull/44798) made
it so that we raise a size mismatch error when this happens, but the
information about unused parameters is likely to be more useful and likely to
be the most common case of failure. Prefer raising this error over the
subsequent size mismatch errors.
ghstack-source-id: 113914759
Test Plan: Added unittest
Reviewed By: mrshenli
Differential Revision: D24151256
fbshipit-source-id: 5d349a988b4aac7d3e0ef7b3cd84dfdcbe9db675
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45926
torch/csrc/cuda/nccl.cpp is compiled as part of torch_cuda library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into torch_python
Fixes similiar issue as https://github.com/pytorch/pytorch/issues/42517
ghstack-source-id: 113910530
Test Plan: waitforsandcastle
Reviewed By: jiayisuse
Differential Revision: D24147802
fbshipit-source-id: d8901fdb31bdc22ddca2364f8050844639a1beb3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783
After the previous device maps commits, `pipeWrite` might throw. In
this case, if we increment active calls before `pipeWrite` on the
caller, that active call won't be decremented properly when `pipeWrite`
throws. As a result, `shutdown` can silently timeout. I noticed this
as some tests take more than 60s to finish.
This commit extract the tensor device checking logic out of pipeWrite,
and make sure the error is thrown before the active call count is
incremented.
Differential Revision: D24094803
Test Plan: Imported from OSS
Reviewed By: mruberry
Pulled By: mrshenli
fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003
sparse is confusing because itt is used in training for sparse gradients
Test Plan: Imported from OSS
Reviewed By: radkris-git, qizzzh
Differential Revision: D24178248
fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45997
The current sparse field using in the float module is for sparse gradients, which is not applicable
to inference. The sparse field in the quantizd ops denotes pruned weights.
Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag
Imported from OSS
Reviewed By: qizzzh
Differential Revision: D24176543
fbshipit-source-id: a05b4ff949e0375462ae411947f68076e1b460d2
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45558
This assertion failure is caused by the incorrect implementation of ``aten::set_grad_enabled`` in [torch/csrc/jit/runtime/register_special_ops.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/register_special_ops.cpp#L436). The current implementation is:
```cpp
Operator(
"aten::set_grad_enabled(bool val) -> ()",
[](Stack* stack) {
torch::GradMode::set_enabled(pop(stack).toBool());
push(stack, IValue());
},
aliasAnalysisConservative()),
```
which push a ``None`` on to the evaluation stack after calling ``set_enabled``. But according to the signature, the behavior is incorrect as the signature says this function won't return a value. I guess the original author might be confused by the behavior of Python, which pushes a ``None`` on to the evaluation stack when the function definition does not end with a return statement with an explicit result value.
If ``aten::set_grad_enabled`` pushes a ``None`` on to the evaluation stack, each time it's called, the evaluation stack will accumulate an extra ``None``. In our case, ``with torch.no_grad():`` will cause ``aten::set_grad_enabled`` to be called twice, so when the ``forward`` method finishes, the evaluation stack will be ``[None, None, Tensor]``. But the return statement of ``GraphFunction::operator()`` in [torch/csrc/jit/api/function_impl.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/api/function_impl.cpp#L51) is ``return stack.front();`` which will try to extract a tensor out of a ``None`` thus causes the assertion failure.
The solution is simple, just remove the push in the implementation of ``aten::set_grad_enabled``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45559
Reviewed By: albanD
Differential Revision: D24142153
Pulled By: SplitInfinity
fbshipit-source-id: 75aad0e38bd912a437f7e1a1ee89ab4445e35b5d
Summary:
Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing:
Before:
```
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
A[i, j] = i * j;
}
}
for (int i_1 = 0; i_1 < 20; i_1++) {
for (int j_1 = 0; j_1 < 10; j_1++) {
B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41));
}
```
After `cacheAccesses(A->buf(), "A_local", j_loop);`
```
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
A[i, j] = i * j;
}
}
for (int i_1 = 0; i_1 < 20; i_1++) {
for (int i_2 = 0; i_2 < 2; i_2++) {
for (int j_1 = 0; j_1 < 11; j_1++) {
A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40];
}
}
for (int j_2 = 0; j_2 < 10; j_2++) {
B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]);
}
}
```
Or this reduction:
```
for (int l1 = 0; l1 < 4; l1++) {
sum[l1] = 0.f;
for (int n1_1 = 0; n1_1 < 3; n1_1++) {
for (int m1_1 = 0; m1_1 < 2; m1_1++) {
sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
}
}
}
```
After `l.cacheAccesses(d->buf(), "d_local", n_loop);`:
```
for (int l1 = 0; l1 < 4; l1++) {
Allocate(d_local, float, {1});
sum[l1] = 0.f;
d_local[0] = 0.f;
for (int n1_1 = 0; n1_1 < 3; n1_1++) {
for (int m1_1 = 0; m1_1 < 2; m1_1++) {
d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
}
}
sum[l1] = (sum[l1]) + (d_local[0]);
Free(d_local);
}
```
I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know.
This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup.
While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869
Reviewed By: mruberry
Differential Revision: D24195276
Pulled By: nickgg
fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45791
Most of the lowering for log1p and lgamma already existed, add JIT integration.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24169536
Pulled By: eellison
fbshipit-source-id: a009c77a3471f3b5d378bad5de6d8e0880e9da3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45790
Making sure that more tests invoke a run with a Fusion Group.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24169534
Pulled By: eellison
fbshipit-source-id: a2666df53fbb12c64571e960f59dbe94df2437e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45789
Making sure that more tests invoke a run with a Fusion Group.
Test Plan: Imported from OSS
Reviewed By: Krovatkin
Differential Revision: D24169535
Pulled By: eellison
fbshipit-source-id: 54d7af434772ba52144b12d15d32ae30460c0c3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45788
We were only running the traced graph once, which would not yet have been fused at that point. We should run for num_profiled_runs + 1, and also assert that all nodes in the graph were fused.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24169537
Pulled By: eellison
fbshipit-source-id: 8499bb1a5bd9d2221b1f1c54d6352558cf07ba9a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847
Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D24136629
Pulled By: heitorschueroff
fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9
Summary:
This enables the cuda fuser on ROCm and enables tests for them.
Part of this patch is based on work of Rohith Nallamaddi, thank you.
Errors are my own, of course.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45965
Reviewed By: seemethere
Differential Revision: D24170457
Pulled By: walterddr
fbshipit-source-id: 3dd25b3501a41d2f00acba3ce8642ce51c49c9a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45892
Previously we were using hashtable (`std::unordered_map` in OSS, `folly::F14FastMap` in fb) for workspace, a container for all the IValues in the graph. Hashtable based lookups can be expensive. This diff replaces the hashtable with `std::vector` and extra bookkeepings are introduced to keep track of the indices of graph inputs/outputs in `StaticRuntime` and op inputs/outputs in `ProcessedNode`.
Reviewed By: dzhulgakov
Differential Revision: D24098763
fbshipit-source-id: 337f835ee144985029b5fa2ab98f9bcc5e3606b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45948
No functionality changes expected, it's just a preparation for further
changes in the LoopNest interface.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24156000
Pulled By: ZolotukhinM
fbshipit-source-id: f95ab07aac0aba128bc4ed5376a3251ac9c31c06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45946
Also, make these functions static - they are not using anything from
`LoopNest` and can be applied to any `Stmt`.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24156002
Pulled By: ZolotukhinM
fbshipit-source-id: 1c7d205f85a2a1684e07eb836af662f10d0a50fc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45936
`Tensor` has been a view into a `Function` that was supposed to be used
for a more general case when we have multiple computations over the same
domain (aka multiple output functions). We have never got to a point
where we need this and now have other ideas in mind on how to support
this case if need be. For now, let's just nuke `Function` to reduce the
overall system complexity.
The change should not affect any existing behavior.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24153214
Pulled By: ZolotukhinM
fbshipit-source-id: 26d5f11db5d661ff5e1135f4a49eff1c6d4c1bd5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45900
Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp`
Fixes https://github.com/pytorch/pytorch/issues/42517
Here is a NCCL dependency graph:
```
libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
| ^
| |
--------> libc10d.a -----------------
```
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before https://github.com/pytorch/pytorch/pull/42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so`
But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library.
This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library.
More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls.
Test Plan: Imported from OSS
Reviewed By: mingzhe09088
Differential Revision: D24138011
Pulled By: malfet
fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
Summary:
This is a rewrite of the Registerizer, supporting scalar replacement in *vastly* more situations. As a refresher, the registerizer does this:
Before:
``` A[0] = 0;
for (int x = 0; x < 10; x++) {
A[0] = (A[0]) + x;
}
```
After:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
A_ = x + A_;
}
A[0] = A_;
```
Which can greatly reduce the number of accesses to main memory in a kernel. There are cases where doing this gets complicated, and the existing implementation bails out whenever encountering multiple partial overlaps of the same buffer, or conditional accesses under any circumstances. This makes it much less useful in the presence of complex (ie. real world not example) kernels. This new version should work optimally in almost all cases (I have a few minor follow ups).
I tested this version extensively, and found quite a few bugs in the original implementation I'd prefer not to back port fixes for - so I'm in favor of landing this even if we don't immediately see a perf win. I believe the killer app for this kind of optimization is fused reductions and we haven't enabled many examples of that yet.
It is safe to move two accesses of the same Tensor element to a local scalar Var if between all usages of the element there are no other Loads or Stores that may refer to it. In the comments I refer to this as overlapping the access, or "cutting" the existing AccessInfo. In the case where a candidate for registerization is cut, it may be possible to finalize the access early by writing it back to the Tensor and then create a new scalar variable after the overlapping access is complete. We will attempt to do this when it saves memory accesses.
There are a few cases that make this more challenging:
- For: Loops change the number of real usages of a buffer by the loop extent, but only if we can pull the definition and finalization of the scalar variable out of the loop block. For loops often create accesses which are conditional on a loop var and will overlap large ranges of elements.
E.g. Before:
```
A[0] = 2;
for (int x1 = 0; x1 < 10; x1++) {
A[0] = (A[0]) + x1;
}
for (int x2 = 1; x2 < 10; x2++) {
A[x2] = A[x2 - 1];
}
for (int x3 = 0; x3 < 10; x3++) {
A[0] = (A[0]) + x3;
}
```
After:
```
int A_1 = 2;
for (int x1 = 0; x1 < 10; x1++) {
A_1 = A_1 + x1;
}
A[0] = A_1;
for (int x2 = 1; x2 < 10; x2++) {
A[x2] = A[x2 - 1];
}
int A_2 = A[0];
for (int x3 = 0; x3 < 10; x3++) {
A_2 = A_2 + x3;
}
A[0] = A_2;
```
- Cond: Conditions complicate lifting scalars out of internal scopes. Generally we cannot lift an access outside of a conditional scope unless there is already a reference to that same access at the higher scope, since we don't know if the condition was guarding an array access not safe at the higher scope. In the comments I refer to this as the condition "hiding" the access, and the outer access "unhiding" it.
E.g. this example:
```
if (x<5 ? 1 : 0) {
A[x] = (A[x]) + 1;
}
A[x] = (A[x]) + 1;
if (x>5 ? 1 : 0) {
A[x] = (A[x]) + 1;
}
```
The A[x] access can be registerized due to the unconditional access between the two conditions:
```
int A_1 = A[x];
if (x<5 ? 1 : 0) {
A_1 = A_1 + 1;
}
A_1 = A_1 + 1;
if (x>5 ? 1 : 0) {
A_1 = A_1 + 1;
}
A[x] = A_1;
```
But this example has no accesses that can be registerized:
```
if (x<5 ? 1 : 0) {
A[x] = (A[x]) + 1;
}
if (x>5 ? 1 : 0) {
A[x] = (A[x]) + 1;
}
```
- IfThenElse: Same situation as Cond, except since IfThenElse is an Expr rather than a Stmt we cannot insert the scalar definition or finalizer within the conditional scope. Accesses inside an IfThenElse can be safely combined with external accesses but cannot exist completely within.
E.g in this example the `B[x]` cannot be registerized as there is no safe place to define it.
```
A[x] = IfThenElse(x<3 ? 1 : 0, (B[x]) + (B[x]), B[x]);
```
But the equivalent kernel using Cond can be registerized:
```
if (x<3 ? 1 : 0) {
float B_1 = B[x];
A[x] = B_1 + B_1;
} else {
A[x] = B[x];
}
```
- Let: Accesses dependent on local variables via Let Stmts, or loop vars, cannot be raised outside of the scope of the dependent var.
E.g. no accesses in this example can be registerized:
```
for (int x = 0; x < 10; x++) {
int y = 30;
A[y] = x + (A[y]);
}
```
But they can in this example:
```
int y = 30;
for (int x = 0; x < 10; x++) {
A[y] = x + (A[y]);
}
```
**Testing**
The majority of this PR is tests, over 3k lines of them, because there are many different rules to consider and they can interact together more or less arbitrarily. I'd greatly appreciate any ideas for situations we could encounter that are not covered by the tests.
**Performance**
Still working on it, will update. In many FastRRNS sub kernels this diff reduces the number of total calls to Store or Load by 4x, but since those kernels use Concat very heavily (meaning a lot of branches) the actual number encountered by any particular thread on GPU is reduced only slightly. Overall perf improved by a very small amount.
Reductions is where this optimization should really shine, and in particular the more complex the kernel gets (with extra fusions, etc) the better this version of the registerizer should do compared the existing version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45574
Reviewed By: albanD
Differential Revision: D24151517
Pulled By: nickgg
fbshipit-source-id: 9f0b2d98cc213eeea3fda16fee3d144d49fd79ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45929
We were checking `and` when we should have been checking `or`.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24148804
Pulled By: eellison
fbshipit-source-id: 9c394ea10ac91a588169d934b1e3208512c71b9d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45857
Fix for https://github.com/pytorch/pytorch/issues/45627
Op was calling `insert` instead of `insert_or_assign`, so it wouldn't overwrite an existing key.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D24148805
Pulled By: eellison
fbshipit-source-id: bf39c71d5d928890b82cff1a9a0985dc47c1ffac
Summary:
Currently, a GraphRoot instance doesn't have an associated stream. Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream. If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition.
The race condition can exist even if the user doesn't give a manually populated gradient:
```python
with torch.cuda.stream(side_stream):
# loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream
# GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream.
loss.backward()
# Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward,
# and the side_stream context is irrelevant. GraphRoot's interaction with its first consumer(s) is the spot where
# the side_stream context causes a problem.
```
This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.)
The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs.
With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops:
```python
# implicit population is safe
with torch.cuda.stream(side_stream):
loss.backward()
# explicit population in side stream then backward in side stream is safe
with torch.cuda.stream(side_stream):
kickoff_grad = torch.ones_like(loss)
loss.backward(gradient=kickoff_grad)
# explicit population in one stream then backward kickoff in another stream
# is NOT safe, even with this PR's diffs, but that unsafety is consistent with
# stream-semantics relationship of any pair of ops
kickoff_grad = torch.ones_like(loss)
with torch.cuda.stream(side_stream):
loss.backward(gradient=kickoff_grad)
# Safe, as you'd expect for any pair of ops
kickoff_grad = torch.ones_like(loss)
side_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(side_stream):
loss.backward(gradient=kickoff_grad)
```
This PR also adds the last three examples above to cuda docs and references them from autograd docstrings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787
Reviewed By: nairbv
Differential Revision: D24138376
Pulled By: albanD
fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45665Fixes#43944
Note that the codegen doesn't use a proper parser so, in the same way as with lists, the string `, ` cannot appear in defaults or it will be interpreted as a splitting point between arguments.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D24141835
Pulled By: ezyang
fbshipit-source-id: 578127861fd2504917f4486c44100491a2c40343
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45639
`StaticRuntime::run_individual` is to mimic the caffe2 operator benchmark `SimpleNet::TEST_Benchmark`, so we can accurate information on the operator breakdown. We found that the PyTorch AutogradProfiler adds a lot of overhead to small models, such as the adindexer precomputation_merge net, 100% for batch_size 1, 33% for batch_size 20. This implementation adds very little overhead, as shown in the test plan.
Test Plan: Test results are fb internal only.
Reviewed By: yinghai, dzhulgakov
Differential Revision: D24012088
fbshipit-source-id: f32eb420aace93e2de421a15e4209fce6a3d90f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45262
**Summary**
This commit adds an API for ignoring arbitrary module attributes during
scripting. A class attribute named `ignored_attributes` containing names
of attributes to ignore can be added to the class of the instance being
scripted. Attributes ignored in this fashion cannot be used in
`forward`, methods used by `forward` or by `exported` methods. They
are, however, copied to the `RecursiveScriptModule` wrapper and can be
used by `ignored` methods and regular Python code.
**Test Plan**
This commit adds unit tests to `TestScriptPy3` to test this new API.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23971882
Pulled By: SplitInfinity
fbshipit-source-id: 8c81fb415fde7b78aa2f87e5d83a477e876a7cc3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45899
Use function polymorphism to avoid repeated casts
I.e. instead of using `NCCL_CHECK(from_nccl_result(` add variant of the function that takes `ncclResult_t` as input argument
Add non-pointer variant of `to_nccl_comm` to avoid `*to_nccl_comm(&comm)` pattern
Test Plan: Imported from OSS
Reviewed By: walterddr
Differential Revision: D24138012
Pulled By: malfet
fbshipit-source-id: 7f62a03e108cbe455910e86e894afdd1c27e8ff1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45867
In most cases the lock ordering was hold a lock in local autograd and
then hold a lock in DistAutogradContext.
In case of `set_exception_without_signal` the lock order was in reverse and as
a result we saw potential deadlock issues in our TSAN tests. To fix this, I
removed the lock and instead just used std::atomic exchange.
In addition to this, I fixed TestE2E to ensure that we use the appropriate
timeout.
TestE2EProcessGroup was flaky for these two reasons and now is fixed.
ghstack-source-id: 113592709
Test Plan: waitforbuildbot.
Reviewed By: albanD
Differential Revision: D24120962
fbshipit-source-id: 12447b84ceae772b91e9a183c90d1e6340f44e66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785
Test Plan: unittest
Reviewed By: jiayisuse
Differential Revision: D23709848
fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
Summary:
Instead of dynamically loading `caffe2_nvrtc`, lazyNVRTC provides the same functionality by binding all the hooks to lazy bind implementation, very similar to the shared library jump tables:
On the first call, each function from the list tries to get a global handle to the respective shared library and replace itself with the dynamically resolved symbol, using the following template:
```
auto fn = reinterpret_cast<decltype(&NAME)>(getCUDALibrary().sym(C10_SYMBOLIZE(NAME)));
if (!fn)
throw std::runtime_error("Can't get" ## NAME);
lazyNVRTC.NAME = fn;
return fn(...)
```
Fixes https://github.com/pytorch/pytorch/issues/31985
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45674
Reviewed By: ezyang
Differential Revision: D24073946
Pulled By: malfet
fbshipit-source-id: 1479a75e5200e14df003144625a859d312885874
Summary:
* Add a pass at end of runCleanupPasses to annotate `aten::warn` so that each has its unique id
* Enhanced interpreter so that it tracks which `aten::warn` has been executed before and skip them
* Improved insertInstruction so that it correctly checks for overflow
Fixes https://github.com/pytorch/pytorch/issues/45108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45382
Reviewed By: mrshenli
Differential Revision: D24060677
Pulled By: gmagogsfm
fbshipit-source-id: 9221bc55b9ce36b374bdf614da3fe47496b481c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45726
FB has an old internal platform that uses some random llvm version
that looks sort of like llvm 7. I've guarded that with the appropriate
LLVM_VERSION_PATCH.
I've also swapped out some of our uses of ThreadSafeModule/ThreadSafeContext
for the variants without ThreadSafe in the name. As far as I can tell we
weren't using the bundled locks anyways, but I'm like 85% sure this is OK since
we compile under the Torch JIT lock anyways.
Test Plan: unit tests
Reviewed By: ZolotukhinM, asuhan
Differential Revision: D24072697
fbshipit-source-id: 7f56b9f3cbe5e6d54416acdf73876338df69ddb2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45543
This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store.
ghstack-source-id: 113409195
Test Plan: Will verify screenshots by building the docs.
Reviewed By: pritamdamania87
Differential Revision: D24005598
fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45464
Usage of Symbols to find arguments requires one to generate a nonsense symbol for inputs which don't already have one. The intention of symbols appears to be something of an internalized string, but the namespace component doesn't apply to an argument. In order to access the arguments by name without adding new symbols, versions of those functions with std::string input was added. These can be proved valid based on the existing codepath. Additionally, a hasNamedInput convenience function was added to remove the necessity of a try/catch block in user code.
The primary motivation is to be able to easily handle the variable number of arguments in glow, so that the arange op may be implemented.
Reviewed By: eellison
Differential Revision: D23972315
fbshipit-source-id: 3e0b41910cf07e916186f1506281fb221725a91b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44678
This is a prototype PR that introduces 4 bit qtensors. The new dtype added for this is c10::quint4x2
The underlying storage for this is still uint8_t, so we pack 2 4-bit values in a byte while quantizing it.
This change uses most of the existing scaffolding for qtensor storage. We allocate storage
based on the dtype before creating a new qtensor.
It also adds a dispatch mechanism for this dtype so we can use this to get the bitwidth, qmin and qmax info
while quantizing and packing the qtensor (when we add 2-bit qtensor)
Kernels that use this dtype should be aware of the packing format.
Test Plan:
Locally tested
```
x = torch.ones((100, 100), dtype=torch.float)
qx_8bit = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint8)
qx = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint4x2)
torch.save(x, "temp.p")
print('Size float (B):', os.path.getsize("temp.p"))
os.remove('temp.p')
torch.save(qx_8bit, "temp.p")
print('Size quantized 8bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')
torch.save(qx, "temp.p")
print('Size quantized 4bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')
```
Size float (B): 40760
Size quantized 8bit(B): 10808
Size quantized 4bit(B): 5816
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23993134
fbshipit-source-id: 073bf262f9680416150ba78ed2d932032275946d
Summary:
This modifies the default bailout depth to 20 which gives us a reasonable performance in benchmarks we considered (fastrnns, maskrcnn, hub/benchmark, etc)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45710
Reviewed By: robieta
Differential Revision: D24071861
Pulled By: Krovatkin
fbshipit-source-id: 472aacc136f37297b21f577750c1d60683a6c81e
Summary:
We are trying to build libtorch statically (BUILD_SHARED_LIBS=OFF) then link it into a DLL. Our setup hits the infinite loop mentioned [here](54c05fa34e/torch/csrc/autograd/engine.cpp (L228)) because we build with `BUILD_SHARED_LIBS=OFF` but still link it all into a DLL at the end of the day.
This PR fixes the issue by changing the condition to guard on which windows runtime the build links against using the `CAFFE2_USE_MSVC_STATIC_RUNTIME` flag. `CAFFE2_USE_MSVC_STATIC_RUNTIME` defaults to ON when `BUILD_SHARED_LIBS=OFF`, so backwards compatibility is maintained.
I'm not entirely confident I understand the subtleties of the windows runtime versus linking setup, but this setup works for us and should not affect the existing builds.
Fixes https://github.com/pytorch/pytorch/issues/44470
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43532
Reviewed By: mrshenli
Differential Revision: D24053767
Pulled By: albanD
fbshipit-source-id: 1127fefe5104d302a4fc083106d4e9f48e50add8
Summary:
* Support propagating `dim_param` in ONNX by encoding as `ShapeSymbol` in `SymbolicShape` of outputs. If export is called with `dynamic_axes` provided, shape inference will start with these axes set as dynamic.
* Add new test file `test_pytorch_onnx_shape_inference.py`, reusing all test cases from `test_pytorch_onnx_onnxruntime.py`, but focus on validating shape for all nodes in graph. Currently this is not enabled in the CI, since there are still quite some existing issues and corner cases to fix. The test is default to run only at opset 12.
* Bug fixes, such as div, _len, and peephole.cpp passes for PackPadded, and LogSoftmaxCrossEntropy.
* This PR depends on existing PR such as 44332.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44920
Reviewed By: eellison
Differential Revision: D23958398
Pulled By: bzinodev
fbshipit-source-id: 00479d9bd19c867d526769a15ba97ec16d56e51d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45343
Current default dynamic quant observer is not correct since we don't accumulate
min/max and we don't need to calculate qparams.
Test Plan: Imported from OSS
Reviewed By: supriyar
Differential Revision: D23933995
fbshipit-source-id: 3ff497c9f5f74c687e8e343ab9948d05ccbba09b
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45586
Test Plan: The unit test has been softened to be less platform sensitive.
Reviewed By: mruberry
Differential Revision: D24025415
Pulled By: robieta
fbshipit-source-id: ee986933b984e736cf1525e1297de6b21ac1f0cf
Summary:
This is an attempt at refactoring `torch.distributed` implementation. Goal is to push Python layer's global states (like _default_pg) to C++ layer such that `torch.distributed` becomes more TorchScript friendly.
This PR adds the skeleton of C++ implementation, at the moment it is not included in any build (and won't be until method implementations are filled in). If you see any test failures related, feel free to revert.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45547
Reviewed By: izdeby
Differential Revision: D24024213
Pulled By: gmagogsfm
fbshipit-source-id: 2762767f63ebef43bf58e17f9447d53cf119f05f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45585
I discovered this bug when I was trying to print the graph to a file. Turns out I had to close the file, but flushing should be a good safeguard in case other users forget.
Test Plan:
Tested with and without flushing.
with P144064292
without P144064767
Reviewed By: mortzur
Differential Revision: D24023819
fbshipit-source-id: 39574b3615feb28e5b5939664c04ddfb1257706a
Summary:
Export of embedding bag with dynamic list of offsets.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44693
Reviewed By: malfet
Differential Revision: D23831980
Pulled By: bzinodev
fbshipit-source-id: 3eaff1a0f20d1bcfb8039e518d78c491be381e1a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45377
This PR adds a C++ implementation of the TripletMarginWithDistanceLoss, for which the Python implementation was introduced in PR #43680. It's based on PR #44072, but I'm resubmitting this to unlink it from Phabricator.
Test Plan: Imported from OSS
Reviewed By: izdeby
Differential Revision: D24003973
fbshipit-source-id: 2d9ada7260a6f27425ff2fdbbf623dad0fb79405
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826
As described in https://github.com/pytorch/pytorch/issues/43690, there
is a need for DDP to be able to ignore certain parameters in the module (not
install allreduce hooks) for certain use cases. `find_unused_parameters` is
sufficient from a correctness perspective, but we can get better performance
with this upfront list if users know which params are unused, since we won't
have to traverse the autograd graph every iteration.
To enable this, we add a field `parameters_to_ignore` to DDP init and don't
pass in that parameter to reducer if that parameter is in the given list.
ghstack-source-id: 113210109
Test Plan: Added unittest
Reviewed By: xw285cornell, mrshenli
Differential Revision: D23740639
fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45461
This PR disables autograd for all C -> C, R -> C functions which are not included in the whitelist `GRADIENT_IMPLEMENTED_FOR_COMPLEX`. In practice, there will be a RuntimeError during forward computation when the outputs are differentiable:
```
>>> x=torch.randn(4, 4, requires_grad=True, dtype=torch.cdouble)
>>> x.pow(3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: pow does not support automatic differentiation for outputs with complex dtype.
```
The implicit assumption here is that all the C -> R functions have correct backward definitions. So before merging this PR, the following functions must be tested and verified to have correct backward definitions:
`torch.abs` (updated in #39955 ), `torch.angle`, `torch.norm`, `torch.irfft`, `torch.istft`.
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D23998156
Pulled By: anjali411
fbshipit-source-id: 370eb07fe56ac84dd8e2233ef7bf3a3eb8aeb179
Summary:
This PR allows Timer to collect deterministic instruction counts for (some) snippets. Because of the intrusive nature of Valgrind (effectively replacing the CPU with an emulated one) we have to perform our measurements in a separate process. This PR writes a `.py` file containing the Timer's `setup` and `stmt`, and executes it within a `valgrind` subprocess along with a plethora of checks and error handling. There is still a bit of jitter around the edges due to the Python glue that I'm using, but the PyTorch signal is quite good and thus this provides a low friction way of getting signal. I considered using JIT as an alternative, but:
A) Python specific overheads (e.g. parsing) are important
B) JIT might do rewrites which would complicate measurement.
Consider the following bit of code, related to https://github.com/pytorch/pytorch/issues/44484:
```
from torch.utils._benchmark import Timer
counts = Timer(
"x.backward()",
setup="x = torch.ones((1,)) + torch.ones((1,), requires_grad=True)"
).collect_callgrind()
for c, fn in counts[:20]:
print(f"{c:>12} {fn}")
```
```
812800 ???:_dl_update_slotinfo
355600 ???:update_get_addr
308300 work/Python/ceval.c:_PyEval_EvalFrameDefault'2
304800 ???:__tls_get_addr
196059 ???:_int_free
152400 ???:__tls_get_addr_slow
138400 build/../c10/core/ScalarType.h:c10::typeMetaToScalarType(caffe2::TypeMeta)
126526 work/Objects/dictobject.c:_PyDict_LoadGlobal
114268 ???:malloc
101400 work/Objects/unicodeobject.c:PyUnicode_FromFormatV
85900 work/Python/ceval.c:_PyEval_EvalFrameDefault
79946 work/Objects/typeobject.c:_PyType_Lookup
72000 build/../c10/core/Device.h:c10::Device::validate()
70000 /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
66400 work/Objects/object.c:_PyObject_GenericGetAttrWithDict
63000 ???:pthread_mutex_lock
61200 work/Objects/dictobject.c:PyDict_GetItem
59800 ???:free
58400 work/Objects/tupleobject.c:tupledealloc
56707 work/Objects/dictobject.c:lookdict_unicode_nodummy
```
Moreover, if we backport this PR to 1.6 (just copy the `_benchmarks` folder) and load those counts as `counts_1_6`, then we can easily diff them:
```
print(f"Head instructions: {sum(c for c, _ in counts)}")
print(f"1.6 instructions: {sum(c for c, _ in counts_1_6)}")
count_dict = {fn: c for c, fn in counts}
for c, fn in counts_1_6:
_ = count_dict.setdefault(fn, 0)
count_dict[fn] -= c
count_diffs = sorted([(c, fn) for fn, c in count_dict.items()], reverse=True)
for c, fn in count_diffs[:15] + [["", "..."]] + count_diffs[-15:]:
print(f"{c:>8} {fn}")
```
```
Head instructions: 7609547
1.6 instructions: 6059648
169600 ???:_dl_update_slotinfo
101400 work/Objects/unicodeobject.c:PyUnicode_FromFormatV
74200 ???:update_get_addr
63600 ???:__tls_get_addr
46800 work/Python/ceval.c:_PyEval_EvalFrameDefault
33512 work/Objects/dictobject.c:_PyDict_LoadGlobal
31800 ???:__tls_get_addr_slow
31700 build/../aten/src/ATen/record_function.cpp:at::RecordFunction::RecordFunction(at::RecordScope)
28300 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object*, _object**, bool)
27800 work/Objects/object.c:_PyObject_GenericGetAttrWithDict
27401 work/Objects/dictobject.c:lookdict_unicode_nodummy
24115 work/Objects/typeobject.c:_PyType_Lookup
24080 ???:_int_free
21700 work/Objects/dictobject.c:PyDict_GetItemWithError
20700 work/Objects/dictobject.c:PyDict_GetItem
...
-3200 build/../c10/util/SmallVector.h:at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
-3400 build/../aten/src/ATen/native/TensorIterator.cpp:at::TensorIterator::resize_outputs(at::TensorIteratorConfig const&)
-3500 /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:std::unique_lock<std::mutex>::unlock()
-3700 build/../torch/csrc/utils/python_arg_parser.cpp:torch::PythonArgParser::raw_parse(_object*, _object*, _object**)
-4207 work/Objects/obmalloc.c:PyMem_Calloc
-4500 /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
-4800 build/../torch/csrc/autograd/generated/VariableType_2.cpp:torch::autograd::VariableType::add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar)
-5000 build/../c10/core/impl/LocalDispatchKeySet.cpp:c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKey)
-5300 work/Objects/listobject.c:PyList_New
-5400 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionParameter::check(_object*, std::vector<pybind11::handle, std::allocator<pybind11::handle> >&)
-5600 /usr/include/c++/8/bits/std_mutex.h:std::unique_lock<std::mutex>::unlock()
-6231 work/Objects/obmalloc.c:PyMem_Free
-6300 work/Objects/listobject.c:list_repeat
-11200 work/Objects/listobject.c:list_dealloc
-28900 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object**, bool)
```
Remaining TODOs:
* Include a timer in the generated script for cuda sync.
* Add valgrind to CircleCI machines and add a unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44717
Reviewed By: soumith
Differential Revision: D24010742
Pulled By: robieta
fbshipit-source-id: df6bc765f8efce7193893edba186cd62b4b23623
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45482
Working on some models that need these ops on lite interpreter.
Test Plan: locally build and load/run the TS model without problem.
Reviewed By: iseeyuan
Differential Revision: D23906581
fbshipit-source-id: 01b9de2af2046296165892b837bc14a7e5d59b4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520
With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D23998789
Pulled By: ZolotukhinM
fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
Summary:
This might be an alternative to reverting https://github.com/pytorch/pytorch/issues/45396 .
The obvious rough edge is that I'm not really seeing the work group limits that TensorExpr produces.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45506
Reviewed By: zhangguanheng66
Differential Revision: D23991410
Pulled By: Krovatkin
fbshipit-source-id: 11d3fc4600e4bffb1d1192c6b8dd2fe22c1e064e
Summary:
Updated `cholesky_backward` to work correctly for complex input.
Note that the current implementation gives the conjugate of what JAX would return. anjali411 is that correct thing to do?
Ref. https://github.com/pytorch/pytorch/issues/44895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45267
Reviewed By: bwasti
Differential Revision: D23975269
Pulled By: anjali411
fbshipit-source-id: 9908b0bb53c411e5ad24027ff570c4f0abd451e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45488
model_name logging was broken, issue is from the recent change of assigning the method name into the module name, this diff is fixing it.
ghstack-source-id: 113103942
Test Plan:
made sure that now the model_name is logged from module_->name().
verified with one model which does not contain the model metadata, and the model_name field is logged as below:
09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING run() module = __torch__.Model
09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING metadata does not have model_name assigning to __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log model_name = __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log method_name = labels
09-28 21:59:30.068 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod()
Reviewed By: linbinyu
Differential Revision: D23984165
fbshipit-source-id: 5b00f50ea82106b695c2cee14029cb3b2e02e2c8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45479
Add a top level boolean attribute to the model called mobile_optimized that is set to true if it is optimized.
Test Plan: buck test //caffe2/test:mobile passes
Reviewed By: kimishpatel
Differential Revision: D23956728
fbshipit-source-id: 79c5931702208b871454319ca2ab8633596b1eb8