Summary:
Possibly fixes https://github.com/pytorch/pytorch/issues/46764.
Computing number of tensor elements in many cases is written as
```
int64_t numel = std::accumulate(oldshape.begin(), oldshape.end(), 1,
std::multiplies<int64_t>());
```
This computes the product with the type of `1` literal, which is `int`. When there's more than INT_MAX elements, result overflows. In https://github.com/pytorch/pytorch/issues/46746, the tensor that was sent to reshape had 256^4 elements, and that was computed as `0`, so reshape was not done correctly.
I've audited usages of std::accumulate and changed them to use int64_t as `init` type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46997
Reviewed By: albanD
Differential Revision: D24624654
Pulled By: ngimel
fbshipit-source-id: 3d9c5e6355531a9df6b10500eec140e020aac77e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46967
Tests under `tests/distributed/_pipeline/sync` use pytest and
specifying the `-f` option for such tests as follows: `python test/run_test.py
-i distributed/_pipeline/sync/skip/test_api -- -f` doesn't work.
The equivalent option for pytest is `-x`. To resolve this issue, I've updated
`run_test.py` to replace `-f` with `-x` for pytest tests.
More details in https://github.com/pytorch/pytorch/issues/46782
#Closes: https://github.com/pytorch/pytorch/issues/46782
ghstack-source-id: 115440558
Test Plan:
1) waitforbuildbot
2) `python test/run_test.py -i distributed/_pipeline/sync/skip/test_api -- -f`
Reviewed By: malfet
Differential Revision: D24584556
fbshipit-source-id: bd87f5b4953504e5659fe72fc8615e126e5490ff
Summary:
Recently the ort-nightly has become unstable and causing issues with CI tests. Switching to release package for now for stability, until the situation is improved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46595
Reviewed By: houseroad
Differential Revision: D24566175
Pulled By: bzinodev
fbshipit-source-id: dcf36e976daeeb17465df88f28bc9673eebbb7b7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47002
There was no good reason for TypeDerived.h (CPUType.h) codegen
to exist after static dispatch was deleted, and now that we
have Math alias key TypeDefault.h header is not needed either.
Sorry to anyone who was using these out of tree.
I didn't entirely delete TypeDefault.h as it has a use in
a file that I can't conveniently compile test locally. Will
kill it entirely in a follow up.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D24596583
Pulled By: ezyang
fbshipit-source-id: b5095d3509098ff74f836c5d0c272db0b2d226aa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47000
There is a new invariant that emit_body is only ever called when
strategy is 'use_derived', which means we can delete a bunch of code.
This removes the last use of TypeXXX.h headers.
Note that this change makes sense, as the TypeDefault entries are
registered as Math entries, which means they automatically populate
Autograd (and we no longer have to register them ourselves). Ailing
did all the hard work, this is just the payoff.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D24596584
Pulled By: ezyang
fbshipit-source-id: 6fa754b5f16e75cf2dcbf437887c0fdfda5e44b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46991
This change is motivated by a problem bdhirsh observed which is
that in internal builds that include both SchemaRegister.cpp
and TypeDefault.cpp, some operators have their schemas defined
multiple times. Instead of dumping schema registrations in
multiple files, it seems better to just toggle how many schemas
we write into TypeDefault.cpp.
ljk53 observes that technically SchemaRegister.cpp is only needed by
full-JIT frontend, and not by light interpreter (to resolve schema
lookups). However, in practice, the registration file seems to be
unconditionally loaded. This change will make it harder to do the
optimization where we drop schemas in the light interpreter, but you
probably want to architect this differently (similar to per-op
registrations, DON'T do any registrations in ATen, and then write out
the schema registrations in a separate library.)
I took this opportunity to also simplify the TypeDefault generation
logic by reworking things so that we only ever call with None argument
when registering. Soon, we should be able to just split these
files up entirely.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: ljk53
Differential Revision: D24593704
Pulled By: ezyang
fbshipit-source-id: f01ea22a3999493da77b6e254d188da0ce9adf2f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46970
Now that catchall declarations are reinterpreted as registrations to
dispatch key Math, we can now simplify code generation logic by directly
generating to Math, and bypasing logic for catchall. This also helps
avoid bugs where we incorrectly classify some kernels as Math and others
as not, even though they get registered in the same way.
Bill of changes:
- Give Math its own unique TORCH_LIBRARY_IMPL
- Make it so NativeFunction.dispatch is always non-None. Simplify
downstream conditionals accordingly
- When parsing NativeFunction, fill in missing dispatch with a
singleton Math entry (pointing to the cpp.name!)
One thing that is a little big about this change is a lot of kernels
which previously didn't report as "math" now report as math. I picked
a setting for these booleans that made sense to me, but I'm not sure
if e.g. XLA will handle it 100% correctly.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D24592391
Pulled By: ezyang
fbshipit-source-id: 2e3355f19f9525698864312418df08411f30a85d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46938
It turns out that after https://github.com/pytorch/pytorch/pull/42194
landed we no longer actually generate any registrations into this
file. That means it's completely unnecessary.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: IvanKobzarev
Differential Revision: D24573518
Pulled By: ezyang
fbshipit-source-id: b41ada9e394b780f037f5977596a36b896b5648c
Summary:
Enables the use of NoneType arguments to inputs tuple in the export API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45792
Reviewed By: heitorschueroff
Differential Revision: D24312784
Pulled By: bzinodev
fbshipit-source-id: 1717e856b56062add371af7dc09cdd9c7b5646da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46755
As reported in https://github.com/pytorch/pytorch/issues/41324, there is a bug in DDP when `find_unused_parameters=True` and 2 or more parameters share the same gradient accumulator.
In the reducer, we currently keep a mapping of grad accumulator to index and populate it with map[accumulator] = index, but this overwrites indices when the accumulator is the same. To fix this, switch the mapping values to a vector of indices to hold all such indices that share the same accumulator.
ghstack-source-id: 115453567
Test Plan: Added UT
Reviewed By: pritamdamania87
Differential Revision: D24497388
fbshipit-source-id: d32dfa9c5cd0b7a8df13c7873d5d28917b766640
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46469
There are some "fast_pass" function calls, where the symbols in `ATen/native` are directly referenced from outside of native at linking stage. This PR is to decouple one of the fast pass from native, while keeping the same functionality. `scalar_to_tensor` is included through `ATen/ATen.h`, which could be referenced by any cpp file including this header.
ghstack-source-id: 114485740
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D24361863
fbshipit-source-id: 28d658688687b6cde286a6e6933ab33a4b3cf9ec
Summary:
Related https://github.com/pytorch/pytorch/issues/38349
This PR implements `column_stack` as the composite ops of `torch.reshape` and `torch.hstack`, and makes `row_stack` as the alias of `torch.vstack`.
Todo
- [x] docs
- [x] alias pattern for `row_stack`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46313
Reviewed By: ngimel
Differential Revision: D24585471
Pulled By: mruberry
fbshipit-source-id: 62fc0ffd43d051dc3ecf386a3e9c0b89086c1d1c
Summary:
If there is no annotation given, we want to show users that the type is inferred
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46969
Test Plan:
Added a new test case that throws an error with the expected error message
Fixes https://github.com/pytorch/pytorch/issues/46326
Reviewed By: ZolotukhinM
Differential Revision: D24614450
Pulled By: gmagogsfm
fbshipit-source-id: dec555a53bfaa9cdefd3b21b5142f5e522847504
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46985.
Can someone comment on whether the "Run flake8" step should fail if `flake8` produces errors? This PR makes sure the errors are still shown, but [the job linked from the issue](https://github.com/pytorch/pytorch/runs/1320258832) also shows that the failure of that step seems to have caused the "Add annotations" step not to run.
Is this what we want, or should I instead revert back to the `--exit-zero` behavior (in this case by just removing the `-o pipefail` from this PR) that we had before https://github.com/pytorch/pytorch/issues/46740? And if the latter, then (how) should I modify this `flake8-py3` job to make sure it fails when `flake8` fails (assuming it didn't already do that?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46990
Reviewed By: VitalyFedyunin
Differential Revision: D24593573
Pulled By: samestep
fbshipit-source-id: 361392846de9fadda1c87d2046cf8d26861524ca
Summary:
The FullyConnectedDNNLowPOp::Y_int32_ vectors consume between 1GB and 2GB on one of FB's larger applications. By adding tracing I noticed that the number of elements in each instance oscillates wildy over time. As the buffer backing a vector can only be extended in a resize operation, this means there is wasted memory space. So as a simple optimization, I added code to right-size the buffer backing the vector when the number of elements is less than half the vector capacity at that point; this doesn't affect the existing elements.
There is of course a memory/cpu tradeoff here - with the change we are doing more mallocs and frees. I added tracing to measure how many times we grow or shrink per second: it's about 100 per second on average, which is not a great deal.
Test Plan:
Memory growth impact: over 24 hours and after the startup period, the memory consumed by this code grows from 0.85GB to 1.20GB vs 0.95GB to 1.75GB in the baseline. [ source: https://fburl.com/scuba/heap_profiles/wm47kpfe ]
https://pxl.cl/1pHlJ
Reviewed By: jspark1105
Differential Revision: D24592098
fbshipit-source-id: 7892b35f24e42403653a74a1a9d06cbc7ee866b9
Summary:
Updates mul_scalar shader to support the new Vulkan API, and adds a new op for it using the new API.
Also adds an in-place version for the op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47021
Test Plan:
Unit test included. To build & run:
```
BUILD_CUSTOM_PROTOBUF=OFF \
BUILD_TEST=ON \
USE_EIGEN_FOR_BLAS=OFF \
USE_FBGEMM=OFF \
USE_MKLDNN=OFF \
USE_NNPACK=OFF \
USE_NUMPY=OFF \
USE_OBSERVERS=OFF \
USE_PYTORCH_QNNPACK=OFF \
USE_QNNPACK=OFF \
USE_VULKAN=ON \
USE_VULKAN_API=ON \
USE_VULKAN_SHADERC_RUNTIME=ON \
USE_VULKAN_WRAPPER=OFF \
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python3 setup.py develop --cmake && ./build/bin/vulkan_api_test
```
Reviewed By: AshkanAliabadi
Differential Revision: D24624729
Pulled By: SS-JIA
fbshipit-source-id: 97e76e4060307a9a24311ac51dca8812e4471249
Summary:
Preserve PYBIND11 (63ce3fbde8) configuration options in `torch._C._PYBIND11 (63ce3fbde8)_COMPILER_TYPE` and use them when building extensions
Also, use f-strings in `torch.utils.cpp_extension`
"Fixes" https://github.com/pytorch/pytorch/issues/46367
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46415
Reviewed By: VitalyFedyunin
Differential Revision: D24605949
Pulled By: malfet
fbshipit-source-id: 87340f2ed5308266a46ef8f0317316227dab9d4d
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41768
The fault was that a NULL `tau` would get passed to LAPACK function. This PR fixes that by checking whether the `tau` contains 0 elements at the beginning of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46700
Reviewed By: albanD
Differential Revision: D24616427
Pulled By: mruberry
fbshipit-source-id: 92e8f1489b113c0ceeca6e54dea8b810a51a63c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46833
Implicit integer conversions are causing compiler warnings. Since in this case the logs make it pretty clear that the `unsigned` types won't overflow despite 64-bit inputs, we fix the issue by making the downconversion explicit.
Test Plan: Standard test rig.
Reviewed By: malfet
Differential Revision: D24481377
fbshipit-source-id: 4422538286d8ed2beb65065544016fd430394ff8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46850
So far, in the error messages when kernel signatures mismatched, we showed the location where the second kernel came from,
but we didn't show the location of the first kernel. This PR now shows the location of both.
ghstack-source-id: 115468616
Test Plan: waitforsandcastle
Reviewed By: ezyang
Differential Revision: D24540368
fbshipit-source-id: 3b4474062879d17f9bb7870ad3814343edc1b755
Summary:
This PR disables the test_softmax and test_softmax_results in test_nn.py that were enabled in https://github.com/pytorch/pytorch/issues/46363. The softmax tests are causing failure on gfx906 machines. Disabling those until we root cause and fix them on 906.
cc: jeffdaily ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46793
Reviewed By: izdeby
Differential Revision: D24539211
Pulled By: ezyang
fbshipit-source-id: 633cb9dc497ad6359af85b85a711c4549d772b2a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46590
This operator is very similar to LengthsToRanges but doesn't pack the offsets next to the original lengths.
Reviewed By: yf225
Differential Revision: D24419746
fbshipit-source-id: aa8b014588bb22eced324853c545f8684086c4e4
Summary: I was reading/looking into how LocalSession works and realized that the workspace type being passed around was the bound function on TaskGroup instead of the actual type. This meant that all workspaces for localsession would always be global, because they'd never match the private workspace type.
Test Plan: <not sure, could use some suggestions>
Reviewed By: cryptopic
Differential Revision: D24458428
fbshipit-source-id: 0f87874babe9c1ddff25b5363b443f9ca37e03c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47035
Chillee thought the `from math import inf, nan` string at the top of `.code` was annoying so here's an alternative way to do it by putting those values in `globals` before we `exec`
Test Plan: Imported from OSS
Reviewed By: dzhulgakov
Differential Revision: D24611278
Pulled By: jamesr66a
fbshipit-source-id: c25ef89e649bdd3e79fe91aea945a30fa7106961
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46773
Changed the constructor of RemoteModule to accept a `remote_device` arg in the following format:
"<workername>/<device>" (e.g., "trainer0/cpu", "ps0/cuda:0")
This arg merges the original `on` and `device` arg.
Original PR issue: RemoteDevice Format #46554
ghstack-source-id: 115448051
Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: pritamdamania87
Differential Revision: D24482562
fbshipit-source-id: 5acfc73772576a4b674df27625bf560b8f8e67c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44266
If PyTorchStreamWriter is writing to a file in a non-existing path, it throws an exception. In unwinding the destructor calls writeEndOfFile() and throws again. To avoid this double-exception, a check and throw is added in the constructor. In such case the destructor will not be called and the exception can go through the unwinding.
Test Plan: python test/test_jit.py TestSaveLoad.test_save_nonexit_file
Reviewed By: dreiss
Differential Revision: D23560770
Pulled By: iseeyuan
fbshipit-source-id: 51b24403500bdab3578c7fd5e017780467a5d06a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46401
Broader context about selective/custom build available at https://fb.quip.com/2oEzAR5MKqbD and https://fb.workplace.com/groups/pytorch.mobile.team/permalink/735794523641956/
Basically, we want to be able to trace full operator names (with overload name). The current observer infra picks up the operator name from the schema, which doesn't seem to include the overload name. To ensure consistency with the existing uses and to accomodate the new use-case, this diff adds a new overload to accept an `OperatorHandle` object, and the code in `before()` eagerly resolves it to an `OperatorName` object (which can be cached in a member variable) as well as a string (view) operator-name which has the same semantics as before.
Why do we pass in an `OperatorHandle` but then resolve it to an `OperatorName`? This might come across as a strange design choice (and it is), but it is grounded in practicality.
It is not reasonable to cache an `OperatorHandle` object but caching an `OperatorName` object is reasonable since it holds all the data itself.
An initial version of this change was trying to test this change in the `xplat` repo, which didn't work. Thanks to ilia-cher for pointing out that the dispatcher observing mechanism is disabled under a compile time flag (macro) for xplat.
ghstack-source-id: 114360747
Test Plan:
`buck test fbcode/caffe2/fb/test:record_function_test` succeeds. Also replicated this test in OSS in the file `test_misc.cpp` where the rest of the `RecordFunction` subsystem is being tested.
Ran benchmark as reqiested by ilia-cher
{P146511280}
Reviewed By: ilia-cher
Differential Revision: D24315241
fbshipit-source-id: 239f3081e6aa2e26c3021a7dd61f328b723b03d9
Summary:
Plus two minor fixes to `torch/csrc/Module.cpp`:
- Use iterator of type `Py_ssize_t` for array indexing in `THPModule_initNames`
- Fix clang-tidy warning of unneeded defaultGenerator copy by capturing it as `const auto&`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47025
Reviewed By: samestep
Differential Revision: D24605907
Pulled By: malfet
fbshipit-source-id: c276567d320758fa8b6f4bd64ff46d2ea5d40eff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46977
Clean up a few TODOs in the new python binding codegen.
Get rid of the _simple_type() hack and the uses of cpp_type_str.
Now python argument type strings and PythonArgParser unpacking methods
are directly generated from the original Type model.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D24589209
Pulled By: ljk53
fbshipit-source-id: b2a6c3911d58eae49c031d319c8ea6f804e2cfde
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46976
Technically, it's not semantic preserving, e.g.: emition of
'requires_grad' is no longer gated by 'has_tensor_return' - there is no
guarantee that is_like_or_new_function should all have tensor return.
But the output is identical so there might be some invariant - could
also add assertion to fail loudly when it's broken.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D24589211
Pulled By: ljk53
fbshipit-source-id: 47c7e43b080e4e67a526fde1a8a53aae99df4432
Summary:
WIP: add support for different memory sizes on size_based_partition, so the size_based_partition could support different logical devices with different memory sizes. Compared to the original size_based_partition, the new one also supports partition to logical device mapping. Multiple partitions can be mapped into one device if the memory size is allowed. A test unit test_different_size_partition is also added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46919
Reviewed By: gcatron, VitalyFedyunin
Differential Revision: D24603511
Pulled By: scottxu0730
fbshipit-source-id: 1ba37338ae054ad846b425fbb7e631d3b6c500b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46955
Initially we were thinking of adding a `invalidate_quantized_float_parameters` option to free the memory
of quantized floating parameters, but it turns out we will do module swap just like in eager mode for the modules
that are quantized, so the old floating point module will not be referenced after quantization. therefore this feature
is only needed for functionals, since most people are using quantization with modules we may not need this.
we'll revisit after we find there is a need for this.
Test Plan: Imported from OSS
Reviewed By: supriyar
Differential Revision: D24579400
fbshipit-source-id: fbb0e567405dc0604a2089fc001573affdade986