pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Nikolay Korovaiko	a7e22b4c6a	add bailout checks to checkScript (#32802 ) Summary: this adds enough infrastructure to run bailout checks in `checkScript`. I'll need to figure out the best way to enable it for nightly builds now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32802 Differential Revision: D19974718 Pulled By: Krovatkin fbshipit-source-id: 40485503f6d3ae14edcce98e1eec1f0559f3ad08	2020-02-21 21:18:54 -08:00
Rohan Varma	6cb9e6b015	Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434 Reland of https://github.com/pytorch/pytorch/pull/33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. ghstack-source-id: 98558377 Test Plan: Added UT test_tcp_store_timeout_set Differential Revision: D19935390 fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a	2020-02-19 17:17:17 -08:00
ptrblck	1e3664b6ef	Remove c/pdist tests from _internal/common_utils.py (#33409 ) Summary: * remove brute_test from `torch/testing/_internal/common_utils.py` * add these tests as internal tests to `test_torch.py` CC ailzhang Pull Request resolved: https://github.com/pytorch/pytorch/pull/33409 Differential Revision: D19951729 Pulled By: ailzhang fbshipit-source-id: b1126aaf26fa64a0f17cbb582dc8038b79cfe3eb	2020-02-19 10:27:30 -08:00
Pritam Damania	fd684cc312	Use torch.set_default_dtype in test_data_parallel and rename dtype2prec (#32962 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32962 As per gchanan's comments on https://github.com/pytorch/pytorch/pull/30445, I've used `torch.set_default_dtype` in test_data_parallel instead of specifying dtype=torch.double everywhere. Also, renamed dtype2prec to dtype2prec_DONTUSE ghstack-source-id: 98388429 Test Plan: waitforbuildbot Differential Revision: D19714374 fbshipit-source-id: eb55bbca33881625636ba9ea6dd4cb692f25668e	2020-02-15 14:07:54 -08:00
ptrblck	a64d0ffe81	Use int64 in pdist kernel to handle batches >= 46342 #30583 (#31593 ) Summary: Currently `torch.pdist` yields an illegal CUDA memory access for batch sizes >= 46342 as reported by SsnL in https://github.com/pytorch/pytorch/issues/30583. Thanks for the minimal code reproduction, btw! ;) Reason for this bug: The calculation if `i` in the [`pdist_kerne_cuda_impl`](`46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L112)`) might overflow, if a tensor with a `batch size >= 46342` is passed to `torch.pdist`. Detailed description: * `result` is resizes as ` n * (n - 1) / 2 = 1073767311` ([line of code](`46ad80c839/aten/src/ATen/native/Distance.cpp (L140)`)) * `grid` is initialized as `result.numel()` ([line of code](`46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L246)`)) * `k` is assigned to the `blockIdx.x` as an `int32` ([line of code](`46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L108)`)) * `i` is calculated using `2 * k >= 2147534622` ([line of code](`46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L112)`)), which overflows, since `2147534622 > 2147483647 (int32_max)`. Using `const int64_t k = blockIdx.x;` would solve the illegal memory access. This seems also be done for [`cdist_kernel_cuda_impl`](`46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L198-L201)`). However, we might expect a slowdown, so I've timed the current PyTorch master vs. this PR: (tested with `x = torch.randn(x.size(0), 128)` on a V100) \|x.size(0) \| int32 idx \| int64 idx \| slowdown \| \|----------\|-----------\|-----------\|----------\| \| 50000 \| - \| 4.4460 \| - \| \| 25000 \| 1.02522 \| 1.10869 \| 7.53% \| \| 12500 \| 0.25182 \| 0.27277 \| 7.68% \| \| 6250 \| 0.06291 \| 0.06817 \| 7.72% \| \| 3125 \| 0.01573 \| 0.01704 \| 7.69% \| \| 1562 \| 0.00393 \| 0.00426 \| 7.75% \| While checking the backward kernel, it seems I'm triggering another error with a size limit of ```python x = torch.randn(1449, 1, device='cuda', requires_grad=True) out = torch.pdist(x) out.mean().backward() > RuntimeError: CUDA error: invalid configuration argument ``` , while `[<=1448, 1]` works. I'll take another look at this issue. Let me know, if the potential fix should go into this PR or if I should open a new issue. CC ngimel, csarofeen Pull Request resolved: https://github.com/pytorch/pytorch/pull/31593 Differential Revision: D19825571 Pulled By: ngimel fbshipit-source-id: ace9ccab49f3cf0ce894cdb6daef0795e2e8ec03	2020-02-11 12:00:39 -08:00
George Guanheng Zhang	f4fbe9549d	Revert D19800021: [pytorch][PR] Improve error message for assertWarnsRegex Test Plan: revert-hammer Differential Revision: D19800021 Original commit changeset: 1c31ae785c8f fbshipit-source-id: d7b340d678562c25a84d48be66c576075000b50d	2020-02-10 12:17:52 -08:00
Peter Bell	c917a247a8	Improve error message for assertWarnsRegex (#33099 ) Summary: `assertWarnsRegex` now prints out any warnings that it caught while failing to find a matching warning. This makes it easier to debug tests by just looking at the CI logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33099 Differential Revision: D19800021 Pulled By: ezyang fbshipit-source-id: 1c31ae785c8ffc5d47619aff6597e479263be2de	2020-02-10 07:27:59 -08:00
Richard Zou	6209412647	Add option to use ninja to compile ahead-of-time cpp_extensions (#32495 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32495 Background ------------------------------ Previously, ninja was used to compile+link inline cpp_extensions and ahead-of-time cpp_extensions were compiled with distutils. This PR adds the ability to compile (but not link) ahead-of-time cpp_extensions with ninja. The main motivation for this is to speed up cpp_extension builds: distutils does not make use of parallelism. With this PR, using the new option, on my machine, - torchvision compilation goes from 3m43s to 49s - nestedtensor compilation goes from 2m0s to 28s. User-facing changes ------------------------------ I added a `use_ninja` flag to BuildExtension. This defaults to `True`. When `use_ninja` is True: - it will attempt to use ninja. - If we cannot use ninja, then this throws a warning and falls back to distutils. - Situations we cannot use ninja: Windows (NYI, I'll open a new issue for this), if ninja cannot be found on the system. Implementation Details ------------------------------ This PR makes this change in two steps. Please me know if it would be easier to review this if I split this up into a stacked diff. Those changes are: 1) refactor _write_ninja_file to separate the policy (what compiler flags to pass) from the mechanism (how to write the ninja file and do compilation). 2) call _write_ninja_file and _run_ninja_build while building ahead-of-time cpp_extensions. These are only used to compile objects; distutils still handles the linking. Change 1: refactor _write_ninja_file to seperate policy from mechanism - I split _write_ninja_file into: _write_ninja_file and _write_ninja_file_to_build_library - I renamed _build_extension_module to _run_ninja_build Change 2: Call _write_ninja_file while building ahead-of-time cpp_extensions - _write_ninja_file_and_compile_objects calls _write_ninja_file to only build object files. - We monkey-patch distutils.CCompiler.compile to call _write_ninja_files_and_compile_objects - distutils still handles the linking step. The linking step is not a bottleneck so it was not a concern. - This change only works on unix-based systems. Our code for windows goes down a different codepath and I did not want to mess with that. - If a system does not support ninja, we raise a warning and fall back to the original compilation path. Test Plan ------------------------------ Adhoc testing - I built torchvision using pytorch master and printed out the build commands. Next, I used this branch to build torchvision and looked at the ninja file. I compared the ninja file with the build commands and asserted that they were functionally the same. - I repeated the above for pytorch/nestedtensor. PyTorch test suite - I split `test_cpp_extensions` into `test_cpp_extensions_aot` and `test_cpp_extensions_jit`. The AOT (ahead-of-time) version tests ahead-of-time and the JIT version tests just-in-time (not to be confused with TorchScript) - `test_cpp_extensions_aot` gets run TWICE by run_test.py, once with a module that was built with ninja, and once with a module that was built without ninja. - run_test.py asserts that when we are building with use_ninja=True, ninja is actually available on the system. Test Plan: Imported from OSS Differential Revision: D19730432 Pulled By: zou3519 fbshipit-source-id: 819590d01cf65e8da5a1e8019b8b3084792fee90	2020-02-05 18:49:29 -08:00
davidriazati	2060e0a9dd	Split serialization tests to their own file (#32241 ) Summary: Stacked PRs * #32244 - Make zip serialization the default * #32241 - Split serialization tests to their own file This makes them all easier to run as a batch. This PR is just a code move / fixing up imports. There are still some serialization tests in `test_torch.py` as part of `TestDeviceType`. ](https://our.intern.facebook.com/intern/diff/19415826/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/32241 Pulled By: driazati Differential Revision: D19415826 fbshipit-source-id: a3f6cfe1626ff2f9b9631c409bf525bd32e4639b	2020-01-28 15:04:05 -08:00
Pritam Damania	f050b16dd9	Move pytorch distributed tests to separate folder for contbuild. (#30445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445 Create distributed and rpc directories under caffe/test for better management of unit tests. Differential Revision: D18702786 fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606	2020-01-22 21:16:59 -08:00

10 Commits