pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Edward Z. Yang	e33f1eeeb7	SymIntify resize_ and deduplicate memory format logic (#90442 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90442 Approved by: https://github.com/bdhirsh	2022-12-11 14:38:38 +00:00
mikey dagitses	c8954a8907	simplify implementation of c10::isIntegralType (#90193 ) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90193). * __->__ #90193 simplify implementation of c10::isIntegralType Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90193 Approved by: https://github.com/ezyang	2022-12-09 12:22:06 +00:00
Nikita Shulga	36ac095ff8	Migrate PyTorch to C++17 (#85969 ) With CUDA-10.2 gone we can finally do it! This PR mostly contains build system related changes, invasive functional ones are to be followed. Among many expected tweaks to the build system, here are few unexpected ones: - Force onnx_proto project to be updated to C++17 to avoid `duplicate symbols` error when compiled by gcc-7.5.0, as storage rule for `constexpr` changed in C++17, but gcc does not seem to follow it - Do not use `std::apply` on CUDA but rely on the built-in variant, as it results in test failures when CUDA runtime picks host rather than device function when `std::apply` is invoked from CUDA code. - `std::decay_t` -> `::std::decay_t` and `std::move`->`::std::move` as VC++ for some reason claims that `std` symbol is ambigious - Disable use of `std::aligned_alloc` on Android, as its `libc++` does not implement it. Some prerequisites: - https://github.com/pytorch/pytorch/pull/89297 - https://github.com/pytorch/pytorch/pull/89605 - https://github.com/pytorch/pytorch/pull/90228 - https://github.com/pytorch/pytorch/pull/90389 - https://github.com/pytorch/pytorch/pull/90379 - https://github.com/pytorch/pytorch/pull/89570 - https://github.com/facebookincubator/gloo/pull/336 - https://github.com/facebookincubator/gloo/pull/343 - `919676fb32` Fixes https://github.com/pytorch/pytorch/issues/56055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85969 Approved by: https://github.com/ezyang, https://github.com/kulinseth	2022-12-08 02:27:48 +00:00
Richard Barnes	ad188a227e	Introduce CUDA Device Assertions Infrastructure (#84609 ) Summary: This diff introduces a set of changes that makes it possible for the host to get assertions from CUDA devices. This includes the introduction of `CUDA_KERNEL_ASSERT2` A preprocessor macro to be used within a CUDA kernel that, upon an assertion failure, writes the assertion message, file, line number, and possibly other information to UVM (Managed memory). Once this is done, the original assertion is triggered, which places the GPU in a Bad State requiring recovery. In my tests, data written to UVM appears there before the GPU reaches the Bad State and is still accessible from the host after the GPU is in this state. Messages are written to a multi-message buffer which can, in theory, hold many assertion failures. I've done this as a precaution in case there are several, but I don't actually know whether that is possible and a simpler design which holds only a single message may well be all that is necessary. `TORCH_DSA_KERNEL_ARGS` This preprocess macro is added as an _argument_ to a kernel function's signature. It expands to supply the standardized names of all the arguments needed by `C10_CUDA_COMMUNICATING_KERNEL_ASSERTION` to handle device-side assertions. This includes, eg, the name of the pointer to the UVM memory the assertion would be written to. This macro abstracts the arguments so there is a single point of change if the system needs to be modified. `c10::cuda::get_global_cuda_kernel_launch_registry()` This host-side function returns a singleton object that manages the host's part of the device-side assertions. Upon allocation, the singleton allocates sufficient UVM (Managed) memory to hold information about several device-side assertion failures. The singleton also provides methods for getting the current traceback (used to identify when a kernel was launched). To avoid consuming all the host's memory the singleton stores launches in a circular buffer; a unique "generation number" is used to ensure that kernel launch failures map to their actual launch points (in the case that the circular buffer wraps before the failure is detected). `TORCH_DSA_KERNEL_LAUNCH` This host-side preprocessor macro replaces the standard ``` kernel_name<<<blocks, threads, shmem, stream>>>(args) ``` invocation with ``` TORCH_DSA_KERNEL_LAUNCH(blocks, threads, shmem, stream, args); ``` Internally, it fetches the UVM (Managed) pointer and generation number from the singleton and append these to the standard argument list. It also checks to ensure the kernel launches correctly. This abstraction on kernel launches can be modified to provide additional safety/logging. `c10::cuda::c10_retrieve_device_side_assertion_info` This host-side function checks, when called, that no kernel assertions have occurred. If one has. It then raises an exception with: 1. Information (file, line number) of what kernel was launched. 2. Information (file, line number, message) about the device-side assertion 3. Information (file, line number) about where the failure was detected. Checking for device-side assertions Device-side assertions are most likely to be noticed by the host when a CUDA API call such as `cudaDeviceSynchronize` is made and fails with a `cudaError_t` indicating > CUDA error: device-side assert triggered CUDA kernel errors Therefore, we rewrite `C10_CUDA_CHECK()` to include a call to `c10_retrieve_device_side_assertion_info()`. To make the code cleaner, most of the logic of `C10_CUDA_CHECK()` is now contained within a new function `c10_cuda_check_implementation()` to which `C10_CUDA_CHECK` passes the preprocessor information about filenames, function names, and line numbers. (In C++20 we can use `std::source_location` to eliminate macros entirely!) # Notes on special cases * Multiple assertions from the same block are recorded * Multiple assertions from different blocks are recorded * Launching kernels from many threads on many streams seems to be handled correctly * If two process are using the same GPU and one of the processes fails with a device-side assertion the other process continues without issue * X Multiple assertions from separate kernels on different streams seem to be recorded, but we can't reproduce the test condition * X Multiple assertions from separate devices should be all be shown upon exit, but we've been unable to generate a test that produces this condition Differential Revision: D37621532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84609 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-12-08 01:26:07 +00:00
mikey dagitses	368a1cbd02	fix c10::detail::integer_iterator for C++17 (#90174 ) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90174). * __->__ #90174 fix c10::detail::integer_iterator for C++17 Summary: std::iterator is deprecated. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90174 Approved by: https://github.com/clee2000, https://github.com/malfet	2022-12-05 18:39:47 +00:00
Lukas N Wirz	301d9c0556	Remove deprecated usage of is_pod/is_pod_v (#88918 ) … as equivalent replacements for std::is_pod and std::is_pod_v because they are deprecated in C++20. When consuming libtorch header files in a project that uses C++20, there are warnings about std::is_pod being deprecated. This patch fixes that issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88918 Approved by: https://github.com/ezyang	2022-12-05 16:50:00 +00:00
Driss Guessous	78bdb858f9	Call _sdp_attention in nn.functional.mha (#89470 ) # Summary Replaces the the inline block of code in nn.funcitonal.mha with `_scaled_dot_product_attention`. This function allows the fused kernels to be called if all the required input conditions are met. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89470 Approved by: https://github.com/cpuhrsch, https://github.com/mikekgfb	2022-12-02 19:46:22 +00:00
chengscott	9dffc56008	Intel compiler support in c10/util/TypeIndex.h (#89610 ) Build passed with icc (ICC) 2021.7.1 20221019. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89610 Approved by: https://github.com/kit1980	2022-12-02 05:32:21 +00:00
Sean Ross-Ross	5f881ac2d1	Adding dispatch alias 'FuncTorchBatchedDecomposition' (#88771 ) part of https://github.com/pytorch/functorch/issues/1009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88771 Approved by: https://github.com/zou3519	2022-12-02 04:38:28 +00:00
PyTorch MergeBot	f1415b8cb6	Revert "Call _sdp_attention in nn.functional.mha (#89470 )" This reverts commit `4d7ec30220`. Reverted https://github.com/pytorch/pytorch/pull/89470 on behalf of https://github.com/jeanschmidt due to breaking internal builds	2022-11-30 16:16:24 +00:00
PyTorch MergeBot	4cc5be3a06	Revert "Add bits tensor types (#88594 )" This reverts commit `f3b1315eee`. Reverted https://github.com/pytorch/pytorch/pull/88594 on behalf of https://github.com/jeanschmidt due to breaking internal builds	2022-11-30 11:37:56 +00:00
Driss Guessous	4d7ec30220	Call _sdp_attention in nn.functional.mha (#89470 ) # Summary Replaces the the inline block of code in nn.funcitonal.mha with `_scaled_dot_product_attention`. This function allows the fused kernels to be called if all the required input conditions are met. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89470 Approved by: https://github.com/cpuhrsch, https://github.com/mikekgfb	2022-11-29 03:02:10 +00:00
Angela Yi	f3b1315eee	Add bits tensor types (#88594 ) TODO (in later PRs) - [ ] the other bits8, 4x2, 2x4, 1x8 - [ ] bits printer function Pull Request resolved: https://github.com/pytorch/pytorch/pull/88594 Approved by: https://github.com/ezyang	2022-11-28 23:39:57 +00:00
Jean Schmidt	d089fbdc33	supress Werror introduced by lack of override by #86786 on `bool initialized()` (#89687 )	2022-11-28 15:16:15 +01:00
mfkasim1	1588ea0dbf	Added log1p for complex in c10 (#89214 ) One PR towards #89205. The content is mostly from PR #38465, but slightly changed the expression to make it faster. Here are some benchmarking code: ```c++ #include <complex> #include <iostream> #include <chrono> // main.cc template<typename T> inline std::complex<T> log1p_v0(const std::complex<T> &z) { // this PR T x = z.real(); T y = z.imag(); T theta = std::atan2(y, x + T(1)); T r = x * (x + T(2)) + y * y; return {T(0.5) * std::log1p(r), theta}; } template<typename T> inline std::complex<T> log1p_v1(const std::complex<T> &z) { // PR #38465 T x = z.real(); T y = z.imag(); std::complex<T> p1 = z + T(1); T r = std::abs(p1); T a = std::arg(p1); T rm1 = (x * x + y * y + x * T(2)) / (r + 1); return {std::log1p(rm1), a}; } template<typename T> inline std::complex<T> log1p_v2(const std::complex<T> &z) { // naive, but numerically inaccurate return std::log(T(1) + z); } int main() { int n = 1000000; std::complex<float> res(0.0, 0.0); std::complex<float> input(0.5, 2.0); auto start = std::chrono::system_clock::now(); for (int i = 0; i < n; i++) { res += log1p_v0(input); } auto end = std::chrono::system_clock::now(); auto elapsed = end - start; std::cout << "time for v0: " << elapsed.count() << '\n'; start = std::chrono::system_clock::now(); for (int i = 0; i < n; i++) { res += log1p_v1(input); } end = std::chrono::system_clock::now(); elapsed = end - start; std::cout << "time for v1: " << elapsed.count() << '\n'; start = std::chrono::system_clock::now(); for (int i = 0; i < n; i++) { res += log1p_v2(input); } end = std::chrono::system_clock::now(); elapsed = end - start; std::cout << "time for v2: " << elapsed.count() << '\n'; std::cout << res << '\n'; } ``` Compiling the script with command `g++ main.cc` produces the following results: ``` time for v0: 237812271 time for v1: 414524941 time for v2: 360585994 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89214 Approved by: https://github.com/lezcano	2022-11-24 11:11:51 +00:00
Charlie West-Taylor	953f39578a	Mark IPU device as not supports_as_strided (#89130 ) Currently causes issues in calls to `.to`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89130 Approved by: https://github.com/albanD	2022-11-23 19:51:53 +00:00
Emilio Castillo	c9d4390d13	Add Pluggable CUDA allocator backend (#86786 ) Fixes #43144 This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries. For example, we could have the following allocator in c++ ```c++ #include <sys/types.h> #include <cuda_runtime_api.h> #include <iostream> extern "C" { void* my_malloc(ssize_t size, int device, cudaStream_t stream) { void ptr; std::cout<<"alloc "<< size<<std::endl; cudaMalloc(&ptr, size); return ptr; } void my_free(void ptr) { std::cout<<"free "<<std::endl; cudaFree(ptr); } } ``` Compile it as a shared library ``` nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC' ``` And use it from PyTorch as follows ```python import torch # Init caching # b = torch.zeros(10, device='cuda') new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free') old = torch.cuda.memory.get_current_allocator() torch.cuda.memory.change_current_allocator(new_alloc) b = torch.zeros(10, device='cuda') # This will error since the current allocator was already instantiated torch.cuda.memory.change_current_allocator(old) ``` Things to discuss - How to test this, needs compiling external code ... Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786 Approved by: https://github.com/albanD	2022-11-23 17:54:36 +00:00
kvathupo	8ac58bc2e3	Add nullptr_t overload to c10::intrusive_ptr (#89196 ) __What?__ Fixes #82413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89196 Approved by: https://github.com/ezyang	2022-11-19 21:40:07 +00:00
Nikita Shulga	5654fed23e	Export c10/[macros\|util] headers to be used by internal inductor builds (#89249 ) Summary: Fixes package boundary violation that existed in previous implementation Test Plan: CI Differential Revision: D41391862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89249 Approved by: https://github.com/izaitsevfb	2022-11-18 10:51:07 +00:00
Sherlock Huang	f1fb586bc6	Symintify repeat_interleave.self_int (#89111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89111 Approved by: https://github.com/ezyang	2022-11-18 05:04:02 +00:00
Rachel030219	70fb673e51	Use software approach to catch overflow ( `c10/utils/safe_numerics.h` ) on ARM devices (#89042 ) Fixes #89040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89042 Approved by: https://github.com/malfet	2022-11-17 05:55:28 +00:00
Edward Z. Yang	4908a12542	Reland "SymIntify convolution backend calculation (#89069 )"" (#89142 ) This reverts commit `90db86be10`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89142 Approved by: https://github.com/albanD, https://github.com/malfet	2022-11-16 21:41:47 +00:00
PyTorch MergeBot	90db86be10	Revert "SymIntify convolution backend calculation (#89069 )" This reverts commit `09ed8b67e2`. Reverted https://github.com/pytorch/pytorch/pull/89069 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2022-11-16 16:36:27 +00:00
Edward Z. Yang	09ed8b67e2	SymIntify convolution backend calculation (#89069 ) We will need this to implement a convolution meta function that is SymInt aware. I use templates so that regular convolution code is not affected by the change. No tests for symbolic ints directly; that will come in a subsequent PR which also needs to refactor fake tensors. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89069 Approved by: https://github.com/SherlockNoMad	2022-11-16 14:02:43 +00:00
Edward Z. Yang	d96dd8ff09	Add int64_t, SymInt overloads for all binary operators in C++ (#89063 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89063 Approved by: https://github.com/SherlockNoMad	2022-11-16 01:08:31 +00:00
Aaron Gokaslan	48dc24ddce	Fix: [ATen] Add some missing moves (#88514 ) Related to #88512 , but for ATen. This should reduce a number of copies and inefficient atomic smart pointer increments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88514 Approved by: https://github.com/jgong5, https://github.com/ezyang	2022-11-13 22:05:41 +00:00
Edward Z. Yang	46796fe5e9	Fix XLA symbolic shapes binding (#88928 ) Obsoletes https://github.com/pytorch/pytorch/pull/88772 Mostly revolves around NOT assuming that the inside is a SymNode, but instead duck-typed to be a SymNode. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88928 Approved by: https://github.com/SherlockNoMad	2022-11-13 00:31:27 +00:00
Eddie Yan	3e30a9ea1c	Fix `CUDA_MAX_THREADS_PER_SM` for `sm_87` (#88644 ) #88326 CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/88644 Approved by: https://github.com/ngimel	2022-11-08 19:44:23 +00:00
Howard Huang	bc66ddb5cb	Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134 ) Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8 Changes: - introduce new error type - Update `C10D_NCCL_CHECK` Sample script to demonstrate new error type ```python # python -m torch.distributed.run --nproc_per_node=2 <script>.py import torch import torch.distributed as dist if __name__ == "__main__": dist.init_process_group("nccl") dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0) ``` Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134 Approved by: https://github.com/rohan-varma	2022-11-08 13:26:42 +00:00
biubiuX	ced71e8e82	[Pytorch] add an option to disable TORCH_WARN and TORCH_WARN_ONCE log (#87188 ) Summary: Add an option to disable TORCH_WARN, some op could trigger spammy TOCH_WARN log which is not desired under certain scenario. Test Plan: Tested with -pt.disable_warn = 1 and -pt.disable_warn = 0 verified TORCH_WARN and TORCH_WARN_ONCE are properly handled tested with -pt.strip_error_messages = 1, -pt.disable_warn = 0 verified strip error message is respected when warn is printed Differential Revision: D40321550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87188 Approved by: https://github.com/kurtamohler, https://github.com/ezyang	2022-11-08 04:49:45 +00:00
Edward Z. Yang	825f4e602b	Add support for symbolic shapes to sparse tensor (#88573 ) Along the way, I undid making sparse/dense dim symint (they're dimensions, so they should be static.) Also symintify set_indices_and_values_unsafe There is a little bit of a nontrivial infra change here: previously, we didn't populate the strides field on sparse tensors. It is now populated with "empty" strides, and this meant that sparse tensors were falsely reporting they were non-overlapping dense/contiguous. I added in a hack to work around this case. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88573 Approved by: https://github.com/anjali411	2022-11-08 03:13:42 +00:00
Aaron Gokaslan	b14e06503a	(fix): Add some missing std::moves to C10 (#88512 ) I saw some missed optimization opportunities in C10 using std::move and thought I would submit a PR to fix them. There are particularly a lot of them dealing with the symbolic operators which are used in quite a few places including in loops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88512 Approved by: https://github.com/ezyang	2022-11-07 22:17:13 +00:00
Edward Z. Yang	0e3031f7e7	Functionalize and compute joint simultaneously. (#88063 ) This also comes with some bug fixes that were uncovered from doing this: - Forward device calls to inner tensor in FunctionalTensorWrapper - Make legacyExtractDispatchKey exclude Functionalize, so that it can get at the real device type key. This is noncontroversial. - Stop stripping dense from key set. The reason for this is FunctionalWrapperTensor may be used in contexts where people query if it is dense or not. If it doesn't report this correctly (from the dispatch key), it will cause errors. This caused some torchbench models to fail when I did one-pass tracing. - Save and restore reapply views TLS correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88063 Approved by: https://github.com/bdhirsh	2022-11-05 03:52:40 +00:00
Codrin Popa	5b767d404e	Modified roundup_power2_divisions to specify the number of divisions for each power of two interval (#87290 ) Summary: Improved roundup_power2_divisions knob so it allows better control of rouding in the PyTorch CUDA Caching Allocator. This new version allows setting the number of divisions per power of two interval starting from 1MB and ending at 64GB and above. An example use case is when rouding is desirable for small allocations but there are also very large allocations which are persistent, thus would not benefit from rounding and take up extra space. Test Plan: Tested locally Differential Revision: D40103909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87290 Approved by: https://github.com/zdevito	2022-11-04 19:31:16 +00:00
Pruthvi Madugundu	fbd08fb358	Introduce TORCH_DISABLE_GPU_ASSERTS (#84190 ) - Asserts for CUDA are enabled by default - Disabled for ROCm by default by setting `TORCH_DISABLE_GPU_ASSERTS` to `ON` - Can be enabled for ROCm by setting above variable to`OFF` during build or can be forcefully enabled by setting `ROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON` This is follow up changes as per comment in PR #81790, comment [link](https://github.com/pytorch/pytorch/pull/81790#issuecomment-1215929021) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84190 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2022-11-04 04:43:05 +00:00
Edward Z. Yang	f884e817d4	Make Python op registration work with torchdeploy/multipy (#87162 ) See strategy at PythonOpRegistrationTrampoline.cpp for the big picture. Along the way, I made OperatorHandle support == and hashing, and slightly changed the low level python_dispatch impl API to disallow empty strings for dispatch key, which had the knock on effect of requiring us to explicitly make sure we pass in CompositeImplicitAutograd if we would have passed in "" (I didn't apply this to the rest of the file because I'm lazy.) Test strategy is we delete the logic for preventing Python op registrations in torch from being skipped in a torchdeploy context and show CI still works. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87162 Approved by: https://github.com/anjali411, https://github.com/bdhirsh	2022-11-03 12:56:44 +00:00
Richard Barnes	e59d307e2f	Improve perf by avoiding implicit string creation in c10_cuda_check_implementation (#88350 ) Test Plan: Sandcastle Differential Revision: D40949947 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88350 Approved by: https://github.com/Skylion007, https://github.com/soumith	2022-11-03 02:48:41 +00:00
Scott Wolchok	1c0d47cb17	[PyTorch] Make c10::irange(x) generate the same assembly as for loop (#86841 ) `c10::irange(n)` generated an extra `sar` and `andn` instruction compared to a traditional `for` loop. now it doesn't. Differential Revision: [D40321009](https://our.internmc.facebook.com/intern/diff/D40321009/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86841 Approved by: https://github.com/r-barnes, https://github.com/malfet	2022-11-02 21:34:22 +00:00
PyTorch MergeBot	0fa23663cc	Revert "Introduce TORCH_DISABLE_GPU_ASSERTS (#84190 )" This reverts commit `1e2c4a6e0e`. Reverted https://github.com/pytorch/pytorch/pull/84190 on behalf of https://github.com/malfet due to Needs internal changes, has to be landed via co-dev	2022-11-02 18:13:37 +00:00
Pruthvi Madugundu	1e2c4a6e0e	Introduce TORCH_DISABLE_GPU_ASSERTS (#84190 ) - Asserts for CUDA are enabled by default - Disabled for ROCm by default by setting `TORCH_DISABLE_GPU_ASSERTS` to `ON` - Can be enabled for ROCm by setting above variable to`OFF` during build or can be forcefully enabled by setting `ROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON` This is follow up changes as per comment in PR #81790, comment [link](https://github.com/pytorch/pytorch/pull/81790#issuecomment-1215929021) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84190 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2022-11-02 17:41:57 +00:00
Nikita Shulga	a6acbad5c3	[BE] Use default constructor in `LoggerVoidify` (#88054 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88054 Approved by: https://github.com/kit1980	2022-11-01 03:59:51 +00:00
sanchitintel	9c793b366f	Move incorrectly placed closing curly brace of `extern "C"` block (#87853 ) ### Bug description When `__SYCL_DEVICE_ONLY__` is defined, while building PyTorch, the output of the preprocessing step would not have the closing curly brace of the `extern "C"` block, as it has been incorrectly placed. Compilers don't seem to report an error or a warning for a missing closing brace of an `extern "C"` block. ### Impact of the bug If `c10/macros/Macros.h` would be included in a C++ file, and after the preprocessing stage, if the preprocessed source file would have some templated code after `extern "C" {`, then, after compilation, linking might fail with the error `templates must have c++ linkage`). eg. https://stackoverflow.com/questions/61717819/template-with-c-linkage-error-when-using-template-keyword-in-main-cpp/61717908#61717908 (its answer also has a small snippet of code to reproduce such an issue). ### Solution in this PR one-liner bug fix that rectifies the placement of closing curly brace (`}`), so that the `extern "C"` block ends properly when `__SYCL_DEVICE_ONLY__` is defined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87853 Approved by: https://github.com/jgong5, https://github.com/kit1980, https://github.com/malfet	2022-10-28 03:42:20 +00:00
Edward Z. Yang	1ff52225f1	Unify SymIntNode and SymFloatNode into SymNode (#87817 ) This refactor was prompted by challenges handling mixed int/float operations in C++. A previous version of this patch added overloads for each permutation of int/float and was unwieldy https://github.com/pytorch/pytorch/pull/87722/ This PR takes a different approach. The general outline of the patch is to combine the C++ types SymIntNode and SymFloatNode into a single type, SymNode. This is type erased; we no longer know statically at C++ if we have an int/float and have to test it with the is_int()/is_float() virtual methods. This has a number of knock on effects. - We no longer have C++ classes to bind to Python. Instead, we take an entirely new approach to our Python API, where we have a SymInt/SymFloat class defined entirely in Python, which hold a SymNode (which corresponds to the C++ SymNode). However, SymNode is not pybind11-bound; instead, it lives as-is in Python, and is wrapped into C++ SymNode using PythonSymNode when it goes into C++. This implies a userland rename. In principle, it is also possible for the canonical implementation of SymNode to be written in C++, and then bound to Python with pybind11 (we have this code, although it is commented out.) However, I did not implement this as we currently have no C++ implementations of SymNode. Because we do return SymInt/SymFloat from C++ bindings, the C++ binding code needs to know how to find these classes. Currently, this is done just by manually importing torch and getting the attributes. - Because SymInt/SymFloat are easy Python wrappers, __sym_dispatch__ now takes SymInt/SymFloat, rather than SymNode, bringing it in line with how __torch_dispatch__ works. Some miscellaneous improvements: - SymInt now has a constructor that takes SymNode. Note that this constructor is ambiguous if you pass in a subclass of SymNode, so an explicit downcast is necessary. This means toSymFloat/toSymInt are no more. This is a mild optimization as it means rvalue reference works automatically. - We uniformly use the caster for c10::SymInt/SymFloat, rather than going the long way via the SymIntNode/SymFloatNode. - Removed some unnecessary toSymInt/toSymFloat calls in normalize_* functions, pretty sure this doesn't do anything. - guard_int is now a free function, since to guard on an int you cannot assume the method exists. A function can handle both int and SymInt inputs. - We clean up the magic method definition code for SymInt/SymFloat/SymNode. ONLY the user classes (SymInt/SymFloat) get magic methods; SymNode gets plain methods; this is to help avoid confusion between the two types. Signed-off-by: Edward Z. Yang <ezyang@fb.com> cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87817 Approved by: https://github.com/albanD, https://github.com/anjali411	2022-10-27 20:56:02 +00:00
Richard Barnes	85ffbedfb2	Strip GCC5 stuff from PyTorch (#85914 ) [This file](https://github.com/pytorch/pytorch/pull/63208/files) indicates that we don't support anything less than GCC 7.5. Given that, let's remove this GCC 5 stuff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85914 Approved by: https://github.com/ezyang	2022-10-26 00:07:44 +00:00
albanD	b085c80126	Add /= to c10::SymInt (#87603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87603 Approved by: https://github.com/bdhirsh	2022-10-24 23:55:13 +00:00
samdow	169ec120ef	[Modes] refactor modes to only use a stack in cpp (#86458 ) Refactors the mode code to only have the C++ mode stack and not the "C++ mode" like we originally had. This also simplifies the mode logic in a number of places Pull Request resolved: https://github.com/pytorch/pytorch/pull/86458 Approved by: https://github.com/zou3519	2022-10-21 19:18:23 +00:00
Brian Hirsh	ce0c6e828e	Reland "add an API for external backends to register custom device names (#86992 )" (#87453 ) Re-land of https://github.com/pytorch/pytorch/pull/86992 This reverts commit `a895af9250`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87453 Approved by: https://github.com/ezyang, https://github.com/albanD	2022-10-21 16:51:36 +00:00
PyTorch MergeBot	a895af9250	Revert "add an API for external backends to register custom device names (#86992 )" This reverts commit `fb6826bfd8`. Reverted https://github.com/pytorch/pytorch/pull/86992 on behalf of https://github.com/jeanschmidt due to breaking internal builds - D40534212 - arstudio-windows-tests-landcastle-0	2022-10-20 14:51:08 +00:00
Zachary DeVito	0d2c2110f1	[allocator] Introduce the abstract class CUDACachingAllocator (#87251 ) This replaces the manual function pointers, making it easier to write new drop-in allocators. Note that most allocation goes through the Allocator interface, which CUDAAllocator inherits from, and this arrangement avoids adding and additional layer of dispatch along this pathway compared to what existed before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87251 Approved by: https://github.com/wconstab	2022-10-20 01:17:00 +00:00
albanD	12b2f70a89	Symintify pad ops (#87046 ) Following comments below, we need to add support for `std::negate`/`std::min`/`std::max`/`operator-` for SymInt Pull Request resolved: https://github.com/pytorch/pytorch/pull/87046 Approved by: https://github.com/ezyang	2022-10-19 21:43:08 +00:00

1 2 3 4 5 ...

1727 Commits