Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66167
Sometimes due to desync we see PG wrapper monitored barrier fail. In
this case it would be useful to print the info about the collective that was
trying to run along with the actual error.
ghstack-source-id: 140037653
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D31353021
fbshipit-source-id: e2a515326c9314c98119978d5566eb5431cca96c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66166
These methods should be private.
ghstack-source-id: 139782587
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D31353020
fbshipit-source-id: 583fb315cc2cacc37df3d29cd5793b42558930b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345
FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership.
ghstack-source-id: 140044165
Test Plan:
CI
perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial.
Reviewed By: hlu1
Differential Revision: D31027361
fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65346
Tidying up the top sources of reference count decrements seen during static runtime startup.
ghstack-source-id: 140027349
Test Plan:
CI
perf now shows under 2% time spend in ~__shared_count instead of about 5%.
Reviewed By: suo
Differential Revision: D31057277
fbshipit-source-id: 9a16daf2e655fda80d4ec21290b30f02ba63d8da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66277
Previously, it is grouped together with tests related to `MapDataPipe`, but it should be with `IterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31485823
Pulled By: NivekT
fbshipit-source-id: d13d8c28cbfc305da0e3033d4109a0f971281a02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66275
Once this is added to Core, TorchData's PR will not need a custom class and can use this wrapper instead.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31485822
Pulled By: NivekT
fbshipit-source-id: 790de27629c89c0ca7163a8ee5a09ee8b8233340
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66051
Make the error message clearer when quantized embedding is converted
with an unsupported dtype. This is helpful when debugging quantization
errors on new models.
Test Plan:
```
class M(nn.Module):
def __init__(self):
super().__init__()
self.embedding = nn.Embedding(1, 1)
m = M().eval()
m.qconfig = torch.quantization.QConfig(
activation=torch.quantization.MinMaxObserver.with_args(dtype=torch.qint8),
weight=torch.quantization.MinMaxObserver.with_args(dtype=torch.qint8))
m.embedding.qconfig = m.qconfig
mp = torch.quantization.prepare(m)
mq = torch.quantization.convert(m)
// error message now includes the incorrect dtype
```
Imported from OSS
Reviewed By: dagitses
Differential Revision: D31472848
fbshipit-source-id: 86f6d90bc0ad611aa9d1bdae24497bc6f3d2acaa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66050
Adds the dtype to an error message when trying to quantize something
other than a float. This is useful for debugging quantization tools on
new models.
Test Plan:
```
x = torch.randn(1, 1, 1, 1, dtype=torch.double)
xq = torch.quantize_per_tensor(x, 0.01, 0, torch.quint8)
// error message now includes Double
```
Imported from OSS
Reviewed By: dagitses
Differential Revision: D31472849
fbshipit-source-id: 2331ffacefcbc6f8eca79694757d740de74a0f1d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66049
Enables quantized add with broadcasting. As pointed out by jamesr66a,
this was disabled but TensorIterator already supports it. Added a test
case to verify.
Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_qadd_broadcast
```
Imported from OSS
Reviewed By: dagitses
Differential Revision: D31472850
fbshipit-source-id: a3b16d9000487918db743525d22db6864330762b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66108
BC-breaking change: intT is now longT (which aligns it more accurately with how
the types are referred to in C++). The benefit for this is we can idiomatically
express all C++ dtypes (with intT now mapping to int32_t). These types are needed
for ufunc codegen in a latter patch.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D31385761
Pulled By: ezyang
fbshipit-source-id: ec6f3a0953794313470dbe14911f23ac116be425
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66149
Updated logic will be able to infer rank of slice output, when only rank is known for slice input. Enables cases where `ConstantValueMap::HasRank(input)` is `True`, while `ConstantValueMap::HasShape(input)` is `False`.
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D31423232
Pulled By: ezyang
fbshipit-source-id: 516e3916aa71afda2b10e44620636e42ed837236
Co-authored-by: BowenBao <bowbao@microsoft.com>
Summary:
Hi, I'm looking forward to contributing to PyTorch, so starting with a minor fix in the documentation for `index_add`.
Currently, in the documentation for `index_add_` (please see https://pytorch.org/docs/master/generated/torch.Tensor.index_add_.html#torch.Tensor.index_add_):
1. `tensor` attribute was pointing to `torch.tensor` class, which IMO - is (thought may not be a big deal) unintentional.
2. `dim` attribute is pointing to `torch.Tensor.dim`, which again IMO - is unintentional.
This PR suggests a correction for the first point above, to rename `tensor` attribute to `input` so that it doesn't point to `torch.tensor` class. (I've verified that others ops like `scatter` use `input`, so this should not break the consistency in the documentation). I couldn't find an appropriate fix for the second point above, since renaming `dim` to something else will break the consistency (as almost all others op in PyTorch use `dim` as the attribute name).
I may be wrong here, so please let me know if there is any feedback or an alternate fix for this.
_Note:_ I plan to fix this behavior for `index_copy_` (https://pytorch.org/docs/master/generated/torch.Tensor.index_copy_.html#torch.Tensor.index_copy_) once and if this PR is approved.
To the reviewers, please help me tag the correct person who could help review this PR.
cc: krshrimali mruberry zou3519
cc brianjo mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65806
Reviewed By: dagitses, mruberry
Differential Revision: D31431182
Pulled By: zou3519
fbshipit-source-id: 66ced9677ac3bc71d672d13366f9f567ecea0a2d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65958
zhxchen17 added `pickle` pybind for trt engine which allows us to save and load a nn.Module with trt engine in fbcode. This diff though is explicitly ser/des engine in __set_state__` and `__get_state__` so that in OSS people can also save and load TRTModule directly.
Test Plan: buck test mode/dev-nosan caffe2/torch/fb/fx2trt:test_fx2trt
Reviewed By: wushirong
Differential Revision: D31309429
fbshipit-source-id: 9068e2ae6375ed0e1bb55b0e9d582b8d9c049dbf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65959
Give some more controls over the output dtype of a trt engine. Previously it would be fp16 if we turn on fp16_mode. This diff allows the engine to generate fp32 output with fp16_mode=True.
Test Plan: CI
Reviewed By: kflu, wushirong
Differential Revision: D31243929
fbshipit-source-id: 09c752e6f382d6ad169da66878d9a9277c134869
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66131
Turns out that a model with 72k instructions causes about 0.5MiB of additional memory overhead (if there's an 8 byte memory overhead per instruction). This is not necessary if we're building w/o eager symbolication support. This change eliminates the 8 byte `debug_handle` if the build is w/o eager symbolication support.
ghstack-source-id: 140045478
(Note: this ignores all push blocking failures!)
Test Plan:
```
buck build -c "pt.enable_eager_symbolication"=1 //xplat/caffe2/fb/lite_predictor:lite_predictor
buck build //xplat/caffe2/fb/lite_predictor:lite_predictor
```
Reviewed By: kimishpatel
Differential Revision: D31387784
fbshipit-source-id: af56787ad833b990a46b79ab021e512edaa22143
Summary:
Noticed that `periodic-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-slow-gradcheck` job has a `ciflow/default`, but does not have a `ciflow/scheduled` label
Added asserts to enforce that jobs with non-trival is_scheduled property do not have default and do have scheduled labesl
Rename `periodic-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-slow-gradcheck` to `periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck`
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66300
Reviewed By: seemethere
Differential Revision: D31493323
Pulled By: malfet
fbshipit-source-id: 194c1d7a4e659847d94a547b87a0d7d08e66406d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65326
parallel_for and parallel_reduce currently share some common code in
all backends, specifically for detecting if it should run in parallel
or not. This moves all the backend-specific code into a single
`internal::invoke_parallel` function and makes the `parallel_`
functions common to all backends.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D31124495
fbshipit-source-id: 65c3d2af42a8860cc4d6349566085c9fa8d8c6f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66258
Installing libgnutls30 has shown to be good when confronted with the
CERT issue related to deb.nodesource.com
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: dagitses
Differential Revision: D31477789
Pulled By: seemethere
fbshipit-source-id: f87ae4c098771acc505db14e3982d8858cf7326f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66015
Fixes https://github.com/pytorch/pytorch/issues/61982 by clone of
tensors in DDPSink. Only applies once for static_graph and generally for unused
params which already has overhead, so perf hit should not be an issue. Will
verify with benchmark.
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D31346633
fbshipit-source-id: 5b9245ade628565cffe01731f6a0dcbb6126029b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517
This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows:
- input_aliases: values that are either inputs or contain aliases of inputs or constants.
- output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases.
- Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph.
`output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance.
Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before.
Reviewed By: hlu1
Differential Revision: D30940955
fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc
Summary: Adding test to ensure non-Vanilla SGD behaves as if complex numbers are two real numbers in R^2 as per issue 65711 on github
Test Plan:
```buck test mode/dev caffe2/test:optim -- 'test_sgd_complex'```
https://pxl.cl/1QLxw
Reviewed By: albanD
Differential Revision: D31477212
fbshipit-source-id: 500678e561a05ac96759223b4c87a37cab26c6a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66021
A builtin library consists of a list of frozen modules and a list of builtin modules. For tensorrt, it's quite simple since we only have a single builtin module tensorrt.tensorrt. But it can be complex for libraries like numpy which contains multiple builtin modules (np.core._multiarray_umath, np.random.mtrand etc.) if we want to add it as a torch::deploy builtin. We enhance the macro that registers builtin libraries to accept a variable length of builtin modules. We can use this macro to register frozentorch, frozenpython, tensorrt for now and can also use it to register libraries like numpy later on.
The enhanced macro now looks as follows. Although we don't need to worry about back-compatibility for now, but this enhanced version is fully compatible with the previous version. The previous version is just a special case when the library contains no builtin modules.
```
REGISTER_TORCH_DEPLOY_BUILTIN(library_name_without_quote, frozen_modules_list,
builtin_module_name_1, builtin_module_init_function_1, ...,
builtin_module_name_N, builtin_module_init_function_N)
```
ghstack-source-id: 140007970
Test Plan:
1. Play around with interactive_embedded_interpreter.cpp to import torch._C, tensorrt.tensorrt etc inside the embedded interpreter.
2. Enhance test_builtin_registry.cpp
3. Run test_deploy.cpp and test_deploy_gpu.cpp
Reviewed By: suo
Differential Revision: D31349390
fbshipit-source-id: 70a1fcf660341180fc4d5195aed15ceb07c2bef7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66218
This stack of diffs reduces the memory used by LLVMCodeGen object.
Here are the numbers on model `294738512`: (this is the number reported as `Memory turnover after freeze_module:` in the output)
```
Before: 123343496
After : 121566008
```
So, there is a reduction of about `~1.77MB` with this change of making `PytorchLLVMJIT` a singleton.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM, hlu1
Differential Revision: D31445798
Pulled By: navahgar
fbshipit-source-id: c860d36456b2c5d3e21010c1217e2948326f666d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65671
Tentative implementation to use dist.gather_object to collect shards from all ranks and then "merge" them. The merge is done on dst_rank though padding the sharded tensors into the size of full tensor based on their metadata (offsets, lengths) first, and then summing these padded tensors together.
Also considered concatenating sharded tensor without padding to minimize memory footprint (assuming padding will increase memory). But it may not be flexible enough for arbitrary sharing (e.g. shard on multiple directions)
Another way can be constructing the padded tensor on each rank and reduce to rank0. I feel this is the most easy implementation. But it will invoke higher memory usage and comm payload. Please let me know if this alternative is preferred.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23
Test Plan:
Imported from OSS
python test/distributed/_sharded_tensor/test_sharded_tensor.py -v -k test_gather
did not manage to test on oss, but tested in fbcode by reserving on demand gpu
arc patch D31197611
modify the test with 2 gpus as on-demand gpu only has 2 cores (D31227986)
buck test -c fbcode.enable_gpu_sections=true mode/dev-nosan caffe2/test/distributed/_sharded_tensor:sharded_tensor -- test_gather
buck-out/gen/caffe2/test/distributed/_sharded_tensor/sharded_tensor#binary.par test_sharded_tensor.TestShardedTensorChunked.test_gather
{F667213605}
Reviewed By: dagitses, pritamdamania87
Differential Revision: D31197611
Pulled By: dracifer
fbshipit-source-id: cf98b4a2d7838b11b9582eb23f826bb0fa38a7f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65758
The same change has been made in conv2d, the proper algorithm is both
faster and gives more precision.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D31257872
Pulled By: ngimel
fbshipit-source-id: 6ff3a7a00a05b66f83d45cc820bd0c230cb8de6d
Summary:
Enable testing of `torch.Tensor.resize_`.
The negative view test is skipped as the test doesn't work with resize_ see
https://github.com/pytorch/pytorch/issues/65945.
cc mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66135
Reviewed By: dagitses
Differential Revision: D31444263
Pulled By: mruberry
fbshipit-source-id: 00c7fe05df28fba01508b31adb3ed4fdcf4d0326
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65542
Add docstring for torch.fx.passes.split_module that conforms to Google Python Style conventions.
Changed original example to the example from this diff:
https://www.internalfb.com/diff/D24925283 (9734c042b8)
Test Plan:
Ran buck test //caffe2/test:fx. No errors detected
https://pxl.cl/1QCch
Reviewed By: jamesr66a
Differential Revision: D31145694
fbshipit-source-id: 8e54f3b1be3dca1c4d414fdeeab71b9f2b5d9f3e
Summary:
These utils are prerequisites for Lazy Node base class.
- set up new torch/csrc/lazy, test/cpp/lazy dirs
- add source files to build_variables.bzl in new lazy_core_sources var
- create new test_lazy binary
Fixes https://github.com/pytorch/pytorch/issues/65636
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66181
Original commit changeset: 3d0d5377d71e
Test Plan:
Run PyTorch XLA corresponding PR in XLA CI:
https://github.com/pytorch/xla/pull/3148/files
Reviewed By: suo
Differential Revision: D31416438
fbshipit-source-id: 58a6a49c5bc30134bc6bae2e42778f359b9a8f40