Summary:
PR opened just to run the CI tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44465
Reviewed By: ngimel
Differential Revision: D23907565
Pulled By: mruberry
fbshipit-source-id: 620661667877f1e9a2bab17d19988e2dc986fc0f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44846
The save function traverses the model state dict to pick out the observer stats
load function traverse the module hierarchy to load the state dict into module attributes depending on observer type
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_save_observer_state_dict
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23746821
fbshipit-source-id: 05c571b62949a2833602d736a81924d77e7ade55
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390
Tensor objects should always refer to their Function's bufs. Currently
we never create a Tensor with a buffer different than of its function,
but having it in two places seems incorrect and dangerous.
Differential Revision: D23952865
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388
Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.
Differential Revision: D23952867
Test Plan: Imported from OSS
Reviewed By: nickgg
Pulled By: ZolotukhinM
fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
Summary:
In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before:
```
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
aten::matmul 0.17% 890.805us 99.05% 523.401ms 5.234ms 49.91% 791.184ms 7.912ms 100
aten::mm 98.09% 518.336ms 98.88% 522.511ms 5.225ms 49.89% 790.885ms 7.909ms 100
aten::t 0.29% 1.530ms 0.49% 2.588ms 25.882us 0.07% 1.058ms 10.576us 100
aten::view 0.46% 2.448ms 0.46% 2.448ms 12.238us 0.06% 918.936us 4.595us 200
aten::transpose 0.13% 707.204us 0.20% 1.058ms 10.581us 0.03% 457.802us 4.578us 100
aten::empty 0.14% 716.056us 0.14% 716.056us 7.161us 0.01% 185.694us 1.857us 100
aten::as_strided 0.07% 350.935us 0.07% 350.935us 3.509us 0.01% 156.380us 1.564us 100
aten::stride 0.65% 3.458ms 0.65% 3.458ms 11.527us 0.03% 441.258us 1.471us 300
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 528.437ms
CUDA time total: 1.585s
Recorded timeit time: 789.0814 ms
```
Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler
After
```
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::matmul 0.15% 802.716us 99.06% 523.548ms 5.235ms 302.451us 0.04% 791.151ms 7.912ms 100
aten::mm 98.20% 519.007ms 98.91% 522.745ms 5.227ms 790.225ms 99.63% 790.848ms 7.908ms 100
aten::t 0.27% 1.406ms 0.49% 2.578ms 25.783us 604.964us 0.08% 1.066ms 10.662us 100
aten::view 0.45% 2.371ms 0.45% 2.371ms 11.856us 926.281us 0.12% 926.281us 4.631us 200
aten::transpose 0.15% 783.462us 0.22% 1.173ms 11.727us 310.016us 0.04% 461.282us 4.613us 100
aten::empty 0.11% 591.603us 0.11% 591.603us 5.916us 176.566us 0.02% 176.566us 1.766us 100
aten::as_strided 0.07% 389.270us 0.07% 389.270us 3.893us 151.266us 0.02% 151.266us 1.513us 100
aten::stride 0.60% 3.147ms 0.60% 3.147ms 10.489us 446.451us 0.06% 446.451us 1.488us 300
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 528.498ms
CUDA time total: 793.143ms
Recorded timeit time: 788.9832 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209
Reviewed By: zou3519
Differential Revision: D23925491
Pulled By: ngimel
fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44970
Right now, when RecordFunction is not active (usual case),
we do two TLS accesses (check for thread local callbacks, and check for
thread local boolean).
Experimenting with reducing number of TLS accesses in RecordFunction
constructor.
Test Plan: record_function_benchmark
Reviewed By: dzhulgakov
Differential Revision: D23791165
Pulled By: ilia-cher
fbshipit-source-id: 6137ce4bface46f540ece325df9864fdde50e0a4
Summary:
To support abnormal detection for test time spike
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45457
Reviewed By: malfet
Differential Revision: D23975628
Pulled By: walterddr
fbshipit-source-id: f28d0f12559070004d637d5bde83289f029b15b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45069
`torch.abs` is a `C -> R` function for complex input. Following the general semantics in torch, the in-place version of abs should be disabled for complex input.
Test Plan: Imported from OSS
Reviewed By: glaringlee, malfet
Differential Revision: D23818397
Pulled By: anjali411
fbshipit-source-id: b23b8d0981c53ba0557018824d42ed37ec13d4e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45233
**Summary**
This commit modifies `TestClassType.test_properties` to check that
properties on class types can be ignored with the same syntax as
ignoring properties on `Modules`.
**Test Plan**
`python test/test_jit.py TestClassType.test_properties`
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23971885
Pulled By: SplitInfinity
fbshipit-source-id: f2228f61fe26dff219024668cc0444a2baa8834c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45232
**Summary**
This commit updates the TorchScript language reference to include
documentation on recently-added TorchScript enums. It also removed
`torch.no_grad` from the list of known unsupported `torch` modules and
classes because it is now supported.
**Test Plan**
Continuous integration.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23971884
Pulled By: SplitInfinity
fbshipit-source-id: 5e2c164ed59bc0926b11201106952cff86e9356e
Summary:
Inline pytorch into wrapper, which is especially helpful in combination
with dead code elimination to reduce IR size and compilation times when
a lot of parameters are unused.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45445
Test Plan: CI
Reviewed By: ZolotukhinM
Differential Revision: D23969009
Pulled By: asuhan
fbshipit-source-id: a21509d07e4c130b6aa6eae5236bb64db2748a3d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43612
**Summary**
This commit modifies the `torch._C._jit_to_backend` function so that it
accepts `ScriptModules` as inputs. It already returns `ScriptModules`
(as opposed to C++ modules), so this makes sense and makes the API more
intuitive.
**Test Plan**
Continuous integration, which includes unit tests and out-of-tree tests
for custom backends.
**Fixes**
This commit fixes#41432.
Test Plan: Imported from OSS
Reviewed By: suo, jamesr66a
Differential Revision: D23339854
Pulled By: SplitInfinity
fbshipit-source-id: 08ecef729c4e1e6bddf3f483276947fc3559ea88
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45280
Performance is the same on CPU and on CUDA is only 1-1.05x slower. This change is necessary for the future nan ops including nan(min|max|median)
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D23908796
Pulled By: heitorschueroff
fbshipit-source-id: c2b57acbe924cfa59fbd85216811f29f4af05088
Summary:
Stumbled upon a little gem in the audio conversion for `SummaryWriter.add_audio()`: two Python `for` loops to convert a float array to little-endian int16 samples. On my machine, this took 35 seconds for a 30-second 22.05 kHz excerpt. The same can be done directly in numpy in 1.65 milliseconds. (No offense, I'm glad that the functionality was there!)
Would also be ready to extend this to support stereo waveforms, or should this become a separate PR?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44201
Reviewed By: J0Nreynolds
Differential Revision: D23831002
Pulled By: edward-io
fbshipit-source-id: 5c8f1ac7823d1ed41b53c4f97ab9a7bac33ea94b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45214
When in verbose mode the package exporter will produce an html visualization
of dependencies of a module to make it easier to trim out unneeded code,
or debug inclusion of things that cannot be exported.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D23873525
Pulled By: zdevito
fbshipit-source-id: 6801991573d8dd5ab8c284e09572b36a35e1e5a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45402
Previous diffs in this stack implemented the getNumKeys and deleteKey
APIs in the c10d Store as well as added tests at the C++ layer. This diff adds
tests at the Python level in test_c10d.py
ghstack-source-id: 112997161
Test Plan: Running these new python tests as well as previous C++ tests
Reviewed By: mrshenli
Differential Revision: D23955729
fbshipit-source-id: c7e0af7c884de2d488320e2a1d94aec801a782e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401
Added a DeleteKey API for the TCP Store
ghstack-source-id: 112997162
Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values
Reviewed By: mrshenli
Differential Revision: D23955730
fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f
Summary: As title.
Test Plan:
FBL job without this diff failed:
f221545832
Error message:
```
NonRetryableException: AssertionError: Label is missing in training stage for HistogramBinningCalibration
```
FBL job with canary package built in this diff is running without failure:
f221650379
Reviewed By: chenshouyuan
Differential Revision: D23959508
fbshipit-source-id: c077230de29f7abfd092c84747eaabda0b532bcc
Summary:
Recent changes to the seq_num correlation behavior in profiler (PR https://github.com/pytorch/pytorch/issues/42565) has changed the behavior for emit_nvtx(record_shapes=True) which doesn't print the name of the operator properly.
Created PR to dump out the name in roctx traces, irrespective of the sequence number assigned only for ROCm.
cc: jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45229
Reviewed By: zou3519
Differential Revision: D23932902
Pulled By: albanD
fbshipit-source-id: c782667ff002b70b51f1cc921afd1b1ac533b39d
Summary:
This PR cleans up some of the rough edges around `Timer` and `Compare`
* Moves `Measurement` to be dataclass based
* Adds a bunch of type annotations. MyPy is now happy.
* Allows missing entries in `Compare`. This is one of the biggest usability issues with `Compare` right now, both from an API perspective and because the current failure mode is really unpleasant.
* Greatly expands the testing of `Compare`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45361
Test Plan: Changes to Timer are covered under existing tests, changes to `Compare` are covered by the expanded `test_compare` method.
Reviewed By: bwasti
Differential Revision: D23966816
Pulled By: robieta
fbshipit-source-id: 826969f73b42f72fa35f4de3c64d0988b61474cd
Summary:
Export of view op with dynamic input shape is broken when using tensors with a 0-dim.
This fix removes symbolic use of static input size to fix this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43558
Reviewed By: ailzhang
Differential Revision: D23965090
Pulled By: bzinodev
fbshipit-source-id: 628e9d7ee5d53375f25052340ca6feabf7ba7c53
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45291
It's not necessary, you can just check if the dtype is integral.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23911963
Pulled By: gchanan
fbshipit-source-id: 230139e1651eb76226f4095e31068dded30e03e8
Summary: Adding support for type double to caffe2 MeanOp and MeanGradientOp.
Test Plan:
All tests passed.
Example FBL job failed without this diff:
f221169563
Error message:
```
c10::Error: [enforce fail at mean_op.h:72] . Mean operator only supports 32-bit float, but input was of type double (Error from operator:
input: "dpsgd_8/Copy_3" input: "dpsgd_8/Copy_4" output: "dpsgd_8/Mean_2" name: "" type: "Mean" device_option { device_type: 0 device_id: 0 })
```
Example FBL job is running without failure with the canary package built from this diff:
f221468723
Reviewed By: chenshouyuan
Differential Revision: D23956222
fbshipit-source-id: 6c81bbc390d812ae0ac235e7d025141c8402def1