Continuation of https://github.com/pytorch/pytorch/pull/88207
A compile time guard was preventing ActivityType::CUDA from being available on rocm. This caused both the GPU_FALLBACK and CUDA modes to be active at the same time. So operators were being charged gpu time for the hipEventRecord ranges and the actual kernel execution times. This caused incorrect (and often negative) cuda times, in e.g. table().
Previously a cmake variable was not being propagated to a '-D', causing an issue on Windows, which uses cuda but not cupti.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89785
Approved by: https://github.com/jeffdaily, https://github.com/malfet
This ensures that subsequent link commands involving mkl libraries
know where to find the libraries if they are in a non-standard
location (which is the case if you installed mkl via conda, which
is what our standard instructions recommend.)
This is kind of a hack, because the MKL libraries are not actually
guaranteed to be in $MKL_ROOT/lib (they are for the conda install
though). The real fix is to properly use the MKL targets from
FindMKL.cmake but thats its own can of fish. See
https://github.com/pytorch/pytorch/issues/73008
This fixes https://github.com/pytorch/audio/issues/2784
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89359
Approved by: https://github.com/soumith
Should fix#13362 and fix#83790
I think I've discovered the root cause of the intermittent nccl link
failures. If we look at the variable name in the redefinition error:
```
_02021d91_11_sendrecv_cu_0bc7b9c8_11152
```
this is the name of the file being compiled + some form of unique ID.
As part of NCCL's build process, the same file is compiled multiple
times with different macro definitions depending on which operator and
dtype are being compiled, e.g.
```
nvcc -DNCCL_OP=0 -DNCCL_TYPE=0 -dc sendrecv.cu -o sendrecv_sum_i8.o
```
Since the filename parts are the same, then if the unique IDs also
happen to collide then the entire identifier will collide and the link
fails. So the fix here is to generate a unique `.cu` file for each
object file. I've implemented this as a `.patch` file that gets
applied from our cmake code, but if we instead fork nccl that would be
cleaner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84245
Approved by: https://github.com/janeyx99, https://github.com/malfet
Summary: In [PR 84755](https://github.com/pytorch/pytorch/pull/84755), @cccclai noticed and mentioned the presence of `message(STATUS...)` logging in caffe2/CMakeLists.txt and suggested moving it to the file cmake/Summary.cmake. This PR addresses that comment/suggestion.
Test Plan: Ran the build as `USE_NUMPY=0 USE_DISTRIBUTED=0 USE_CUDA=0 TRACING_BASED=1 python setup.py develop`
and saw the follwing being printed:
```
-- BUILD_MOBILE_AUTOGRAD : OFF
-- BUILD_LITE_INTERPRETER: OFF
-- INTERN_BUILD_MOBILE :
-- TRACING_BASED : 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84814
Approved by: https://github.com/cccclai
# Summary:
- I added a new submodule Cutlass pointing to 2.10 release. The inclusion of flash_attention code should be gated by the flag: USE_FLASH_ATTENTION. This is defaulted to off resulting in flash to not be build anywhere. This is done on purpose since we don't have A100 machines to compile and test on.
- Only looked at CMake did not attempt bazel or buck yet.
- I included the mha_fwd from flash_attention that has ben refactored to use cutlass 2.10. There is currently no backwards kernel on this branch. That would be a good follow up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81434
Approved by: https://github.com/cpuhrsch
We're no longer building Caffe2 mobile as part of our CI, and it adds a lot of clutter to our make files. Any lingering internal dependencies will use the buck build and so wont be effected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84338
Approved by: https://github.com/dreiss
Since #83173 was merged I have noticed some CI being slowed down by
the nccl building step. e.g. if there are no C++ changes then sccache
compiles everything else very quickly and nccl becomes the limiting
factor.
This re-enables parallel builds with some safeguards to protect
against oversubscription. When `make` is the parent build system, we
can use `$(MAKE)` and the `make` jobserver will coordinate job
allocation with the sub-process. For other build systems, this calls
`make` with the `-l` flag which should prevent it launching jobs when
the system load average is already too high.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83696
Approved by: https://github.com/malfet
This problem updates the the PR [#73040](https://github.com/pytorch/pytorch/pull/73040)
The compilation error in pyTorch with ROCm is successful with these changes when `NDEBUG` is enabled.
Solution:
For HIP we keep `__device__ __assert_fail()`
and for host side compilation we want to use the `__assert_fail()` from the glibc library.
Tested the code by compiling with below steps
```
python3 tools/amd_build/build_amd.py
python3 setup.py develop --cmake-only
cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build
cmake --build build
```
The UT test_fixed_cuda_assert_async is still skipped due performance overhead.
cc @jithunnair-amd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81790
Approved by: https://github.com/shintaro-iwasaki, https://github.com/jeffdaily, https://github.com/malfet
And use it throughout the CMakeLists and rectify `IF(APPLE)`/`IF(GNU_CXX_VERSION VERSION_GREATER A.B)` and so on
Also, add `target_compile_options_if_supported` and use it in `Dependencies.cmake` as well as in test's `CMakeListst.txt`
Delete `-Wno-unknown-warning-option` to test that conditions indeed working as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82883
Approved by: https://github.com/seemethere
- Modifies the current cmake build definitions to use `find_package` to find UCX and UCC installed in the system
- Install UCX and UCC in CUDA dockers
- Build PyTorch with `USE_UCC=1` in pipelines
- Currently, we are not running unit tests with the UCC PG. Those tests will be added in future PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81583
Approved by: https://github.com/vtlam, https://github.com/malfet
And use it throughout the CMakeLists and rectify `IF(APPLE)`/`IF(GNU_CXX_VERSION VERSION_GREATER A.B)` and so on
Also, add `target_compile_options_if_supported` and use it in `Dependencies.cmake` as well as in test's `CMakeListst.txt`
Delete `-Wno-unknown-warning-option` to test that conditions indeed working as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82883
Approved by: https://github.com/seemethere
By extending regex to match any character other than not just version
On Ubuntu version string looks as follows:
```
$ objcopy --version
GNU objcopy (GNU Binutils for Ubuntu) 2.30
```
And on some CentOSes it looks as
```
$ objcopy --version
GNU objcopy (GNU Binutils) 2.37
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82774
Approved by: https://github.com/ngimel
### Description
These changes were made to assure, that the code that tests the vector instruction set extensions not only compiles but also runs to detect it properly for MSVC:
- INCLUDE(CheckCSourceRuns) instead of INCLUDE(CheckCSourceCompiles)
- INCLUDE(CheckCXXSourceRuns) instead of INCLUDE(CheckCXXSourceCompiles)
- CHECK_C_SOURCE_RUNS instead of CHECK_C_SOURCE_COMPILES
- CHECK_CXX_SOURCE_RUNS instead of CHECK_CXX_SOURCE_COMPILES
### Issue
#82553
### Testing
I tried the [code changes](86246b3c58) on a copy of [FindAVX.cmake](https://github.com/pytorch/pytorch/blob/master/cmake/Modules/FindAVX.cmake) in my repository [convolution-benchmarks](https://github.com/JohT/convolution-benchmarks) and could verify that the detection works properly now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82554
Approved by: https://github.com/malfet
To fix#78540 I committed #78983 which is reverted due to internal CI failure. Then I comitted #79215 which was only fixing the failure but didn't have the full feature of #78983. This PR is another try.
This PR adds script to dump all operators from test models and automatically write into `lightweight_dispatch_ops.yaml`. This way we don't have to manually update the yaml file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80791
Approved by: https://github.com/raziel
RocksDB 7 starts to use C++17 in header.
We should make this configurable, in case user needs higher std version.
List of files to changed is found by `git grep 'CMAKE_[^_]*_STANDARD'`.
Doc string is from CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75519
Approved by: https://github.com/malfet
Summary:
This diff integrates UCC process group as a native component of Pytorch Distributed core. It is based on the existing torch-ucc (https://github.com/facebookresearch/torch_ucc) as the wrapper for UCC collective communication library.
The environment and cmake variables are named in mirroring to the existing process groups such as NCCL and Gloo. Specifically,
- USE_UCC: enables UCC PG. This defaults to OFF, so there is no breakage of existing builds that do not have UCX/UCC external libraries.
- USE_SYSTEM_UCC: uses external UCX and UCC shared libraries that are set accordingly with UCX_HOME and UCC_HOME.
Currently, this diff only supports USE_SYSTEM_UCC=ON, i.e., requiring users to specify external libraries for UCX and UCC. In subsequent diffs, we will add UCX and UCC repos as third-party dependencies in pytorch/third-party.
Test Plan:
Passed Torch-UCC tests that invoke UCC process group. For example:
$ sh test/start_test.sh test/torch_allreduce_test.py --backend gloo --use-cuda
...
Test allreduce: succeeded
Differential Revision: D36973688
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79918
Approved by: https://github.com/kwen2501, https://github.com/kingchc