Summary:
Python never closes shared library it `dlopen`s. This means that calling `load` or `load_inline` (i.e. building a JIT C++ extension) with the same C++ extension name twice in the same Python process will never re-load the library, even if the compiled source code and the underlying shared library have changed. The only way to circumvent this is to create a new library and load it under a new module name.
I fix this, of course, by introducing a layer of indirection. Loading a JIT C++ extension now goes through an `ExtensionVersioner`, which hashes the contents of the source files as well as build flags, and if this hash changed, bumps an internal version stored for each module name. A bump in the version will result in the ninja file being edited and a new shared library and effectively a new C++ extension to be compiled. For this the version name is appended as `_v<version>` to the extension name for all versions greater zero.
One caveat is that if you were to update your code many times and always re-load it in the same process, you may end up with quite a lot of shared library objects in your extension's folder under `/tmp`. I imagine this isn't too bad, since extensions are typically small and there isn't really a good way for us to garbage collect old libraries, since we don't know what still has handles to them.
Fixes https://github.com/pytorch/pytorch/issues/11398 CC The controller you requested could not be found.
ezyang gchanan soumith fmassa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11725
Differential Revision: D9948244
Pulled By: goldsborough
fbshipit-source-id: 695bbdc1f1597c5e4306a45cd8ba46f15c941383
Summary:
Two improvements to C++ extensions:
1. In verbose mode, show the ninja build output (the exact compile commands, very useful)
2. When raising an error, don't show the `CalledProcessError` that shows ninja failing, only show the `RuntimeError` with the captured stdout
soumith fmassa ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11724
Differential Revision: D9922459
Pulled By: goldsborough
fbshipit-source-id: 5b319bf24348eabfe5f4c55d6d8e799b9abe523a
Summary:
A couple fixes I deem necessary to the TorchScript C++ API after writing the tutorial:
1. When I was creating the custom op API, I created `torch/op.h` as the one-stop header for creating custom ops. I now notice that there is no good header for the TorchScript C++ story altogether, i.e. when you just want to load a script module in C++ without any custom ops necessarily. The `torch/op.h` header suits that purpose just as well of course, but I think we should rename it to `torch/script.h`, which seems like a great name for this feature.
2. The current API for the CMake we provided was that we defined a bunch of variables like `TORCH_LIBRARY_DIRS` and `TORCH_INCLUDES` and then expected users to add those variables to their targets. We also had a CMake function that did that for you automatically. I now realized a much smarter way of doing this is to create an `IMPORTED` target for the libtorch library in CMake, and then add all this stuff to the link interface of that target. Then all downstream users have to do is `target_link_libraries(my_target torch)` and they get all the proper includes, libraries and compiler flags added to their target. This means we can get rid of the CMake function and all that stuff. orionr AFAIK this is a much, much better way of doing all of this, no?
3. Since we distribute libtorch with `D_GLIBCXX_USE_CXX11_ABI=0`, dependent libraries must set this flag too. I now add this to the interface compile options of this imported target.
4. Fixes to JIT docs.
These could likely be 4 different PRs but given the release I wouldn't mind landing them all asap.
zdevito dzhulgakov soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11682
Differential Revision: D9839431
Pulled By: goldsborough
fbshipit-source-id: fdc47b95f83f22d53e1995aa683e09613b4bfe65
Summary:
I noticed warnings from within pybind11 being shown when building C++ extensions. This can be avoided by including non-user-supplied headers with `-isystem` instead of `-I`
I hope this works on Windows.
soumith ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11459
Differential Revision: D9764444
Pulled By: goldsborough
fbshipit-source-id: b288572106078f347f0342f158f9e2b63a58c235
Summary:
Currently we assume to find cudnn includes and libraries in the `CUDA_HOME` root. But this is not always true. So we now support a `CUDNN_HOME`/`CUDNN_PATH` environment variable that can have its own `/include` and `/lib64` folder.
This means cudnn extensions now also get support on the FAIR cluster.
soumith fmassa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10922
Differential Revision: D9526856
Pulled By: goldsborough
fbshipit-source-id: 5c64a5ff7cd428eb736381c24736006b21f8b6db
Summary:
Prior to this diff, there have been two ways of compiling the bulk of the torch codebase. There was no interaction between them - you had to pick one or the other.
1) with setup.py. This method
- used the setuptools C extension functionality
- worked on all platforms
- did not build test_jit/test_api binaries
- did not include the C++ api
- always included python functionality
- produced _C.so
2) with cpp_build. This method
- used CMake
- did not support Windows or ROCM
- was capable of building the test binaries
- included the C++ api
- did not build the python functionality
- produced libtorch.so
This diff combines the two.
1) cpp_build/CMakeLists.txt has become torch/CMakeLists.txt. This build
- is CMake-based
- works on all platforms
- builds the test binaries
- includes the C++ api
- does not include the python functionality
- produces libtorch.so
2) the setup.py build
- compiles the python functionality
- calls into the CMake build to build libtorch.so
- produces _C.so, which has a dependency on libtorch.so
In terms of code changes, this mostly means extending the cmake build to support the full variety of environments and platforms. There are also a small number of changes related to the fact that there are now two shared objects - in particular, windows requires annotating some symbols with dllimport/dllexport, and doesn't allow exposing thread_local globals directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8792
Reviewed By: ezyang
Differential Revision: D8764181
Pulled By: anderspapitto
fbshipit-source-id: abec43834f739049da25f4583a0794b38eb0a94f
Summary:
Any flags linking libraries only take effect on inputs preceding them,
so we have to call `$cxx $in $ldflags -o $out` instead of the other way
around.
This was probably not detected so far since the torch libraries are
already loaded when loading JIT-compiled extensions, so this only has an
effect on third-party libraries.
This also matches our behavior on windows.
Closes https://github.com/pytorch/pytorch/pull/9021
Reviewed By: soumith
Differential Revision: D8694049
Pulled By: ezyang
fbshipit-source-id: e35745fc3b89bf39c14f07ce90d6bd18e6a3d7cc
* Have PyTorch depend on minimal libcaffe2.so instead of libATen.so
* Build ATen tests as a part of Caffe2 build
* Hopefully cufft and nvcc fPIC fixes
* Make ATen install components optional
* Add tests back for ATen and fix TH build
* Fixes for test_install.sh script
* Fixes for cpp_build/build_all.sh
* Fixes for aten/tools/run_tests.sh
* Switch ATen cmake calls to USE_CUDA instead of NO_CUDA
* Attempt at fix for aten/tools/run_tests.sh
* Fix typo in last commit
* Fix valgrind call after pushd
* Be forgiving about USE_CUDA disable like PyTorch
* More fixes on the install side
* Link all libcaffe2 during test run
* Make cuDNN optional for ATen right now
* Potential fix for non-CUDA builds
* Use NCCL_ROOT_DIR environment variable
* Pass -fPIC through nvcc to base compiler/linker
* Remove THCUNN.h requirement for libtorch gen
* Add Mac test for -Wmaybe-uninitialized
* Potential Windows and Mac fixes
* Move MSVC target props to shared function
* Disable cpp_build/libtorch tests on Mac
* Disable sleef for Windows builds
* Move protos under BUILD_CAFFE2
* Remove space from linker flags passed with -Wl
* Remove ATen from Caffe2 dep libs since directly included
* Potential Windows fixes
* Preserve options while sleef builds
* Force BUILD_SHARED_LIBS flag for Caffe2 builds
* Set DYLD_LIBRARY_PATH and LD_LIBRARY_PATH for Mac testing
* Pass TORCH_CUDA_ARCH_LIST directly in cuda.cmake
* Fixes for the last two changes
* Potential fix for Mac build failure
* Switch Caffe2 to build_caffe2 dir to not conflict
* Cleanup FindMKL.cmake
* Another attempt at Mac cpp_build fix
* Clear cpp-build directory for Mac builds
* Disable test in Mac build/test to match cmake
* Split libATen.so into libATen_cpu.so and libATen_cuda.so
Previously, ATen could be built with either CPU-only support, or
CPU/CUDA support, but only via a compile-time flag, requiring
two separate builds. This means that if you have a program which
indirectly uses a CPU-only build of ATen, and a CPU/CUDA-build of
ATen, you're gonna have a bad time. And you might want a CPU-only
build of ATen, because it is 15M (versus the 300M of a CUDA build).
This commit splits libATen.so into two libraries, CPU/CUDA, so
that it's not necessary to do a full rebuild to get CPU-only
support; instead, if you link against libATen_cpu.so only, you
are CPU-only; if you additionally link/dlopen libATen_cuda.so,
this enables CUDA support. This brings ATen's dynamic library
structure more similar to Caffe2's. libATen.so is no more
(this is BC BREAKING)
The general principle for how this works is that we introduce
a *hooks* interface, which introduces a dynamic dispatch indirection
between a call site and implementation site of CUDA functionality,
mediated by a static initialization registry. This means that we can continue
to, for example, lazily initialize CUDA from Context (a core, CPU class) without
having a direct dependency on the CUDA bits. Instead, we look up
in the registry if, e.g., CUDA hooks have been loaded (this loading
process happens at static initialization time), and if they
have been we dynamic dispatch to this class. We similarly use
the hooks interface to handle Variable registration.
We introduce a new invariant: if the backend of a type has not
been initialized (e.g., it's library has not been dlopened; for
CUDA, this also includes CUDA initialization), then the Type
pointers in the context registry are NULL. If you access the
registry directly you must maintain this invariant.
There are a few potholes along the way. I document them here:
- Previously, PyTorch maintained a separate registry for variable
types, because no provision for them was made in the Context's
type_registry. Now that we have the hooks mechanism, we can easily
have PyTorch register variables in the main registry. The code
has been refactored accordingly.
- There is a subtle ordering issue between Variable and CUDA.
We permit libATen_cuda.so and PyTorch to be loaded in either
order (in practice, CUDA is always loaded "after" PyTorch, because
it is lazily initialized.) This means that, when CUDA types are
loaded, we must subsequently also initialize their Variable equivalents.
Appropriate hooks were added to VariableHooks to make this possible;
similarly, getVariableHooks() is not referentially transparent, and
will change behavior after Variables are loaded. (This is different
to CUDAHooks, which is "burned in" after you try to initialize CUDA.)
- The cmake is adjusted to separate dependencies into either CPU
or CUDA dependencies. The generator scripts are adjusted to either
generate a file as a CUDA (cuda_file_manager) or CPU file (file_manager).
- I changed all native functions which were CUDA-only (the cudnn functions)
to have dispatches for CUDA only (making it permissible to not specify
all dispatch options.) This uncovered a bug in how we were handling
native functions which dispatch on a Type argument; I introduced a new
self_ty keyword to handle this case. I'm not 100% happy about it
but it fixed my problem.
This also exposed the fact that set_history incompletely handles
heterogenous return tuples combining Tensor and TensorList. I
swapped this codegen to use flatten() (at the possible cost of
a slight perf regression, since we're allocating another vector now
in this code path).
- thc_state is no longer a public member of Context; use getTHCState() instead
- This PR comes with Registry from Caffe2, for handling static initialization.
I needed to make a bunch of fixes to Registry to make it more portable
- No more ##__VA_ARGS__ token pasting; instead, it is mandatory to pass at
least one argument to the var-args. CUDAHooks and VariableHooks pass a nullary
struct CUDAHooksArgs/VariableHooksArgs to solve the problem. We must get rid of
token pasting because it does not work with MSVC.
- It seems MSVC is not willing to generate code for constructors of template
classes at use sites which cross DLL boundaries. So we explicitly instantiate
the class to get around the problem. This involved tweaks to the boilerplate
generating macros, and also required us to shuffle around namespaces a bit,
because you can't specialize a template unless you are in the same namespace as
the template.
- Insertion of AT_API to appropriate places where the registry must be exported
- We have a general problem which is that on recent Ubuntu distributions,
--as-needed is enabled for shared libraries, which is (cc @apaszke who was
worrying about this in #7160 see also #7160 (comment)). For now, I've hacked
this up in the PR to pass -Wl,--no-as-needed to all of the spots necessary to
make CI work, but a more sustainable solution is to attempt to dlopen
libATen_cuda.so when CUDA functionality is requested.
- The JIT tests somehow manage to try to touch CUDA without loading libATen_cuda.so. So
we pass -Wl,--no-as-needed when linking libATen_cuda.so to _C.so
- There is a very subtle linking issue with lapack, which is solved by making sure libATen_cuda.so links against LAPACK. There's a comment in aten/src/ATen/CMakeLists.txt about htis as well as a follow up bug at #7353
- autogradpp used AT_CUDA_ENABLED directly. We've expunged these uses and added
a few more things to CUDAHooks (getNumGPUs)
- Added manualSeedAll to Generator so that we can invoke it polymorphically (it
only does something different for CUDAGenerator)
- There's a new cuda/CUDAConfig.h header for CUDA-only ifdef macros (AT_CUDNN_ENABLED, most prominently)
- CUDAHooks/VariableHooks structs live in at namespace because Registry's
namespace support is not good enough to handle it otherwise (see Registry
changes above)
- There's some modest moving around of native functions in ReduceOps and
UnaryOps to get the CUDA-only function implementations into separate files, so
they are only compiled into libATen_cuda.so. sspaddmm needed a separate CUDA
function due to object linkage boundaries.
- Some direct uses of native functions in CUDA code has to go away, since these
functions are not exported, so you have to go through the dispatcher
(at::native::empty_like to at::empty_like)
- Code in THC/THCS/THCUNN now properly use THC_API macro instead of TH_API
(which matters now that TH and THC are not in the same library)
- Added code debt in torch/_thnn/utils.py and other THNN parsing code to handle
both TH_API and THC_API
- TensorUtils.h is now properly exported with AT_API
- Dead uses of TH_EXPORTS and co expunged; we now use ATen_cpu_exports and
ATen_cuda_exports (new, in ATenCUDAGeneral.h) consistently
- Fix some incorrect type annotations on _cudnn_rnn_backward, where we didn't
declare a type as possibly undefined when we should have. We didn't catch this
previously because optional annotations are not tested on "pass-through" native
ATen ops (which don't have dispatch). Upstream issue at #7316
- There's a new cmake macro aten_compile_options for applying all of our
per-target compile time options. We use this on the cpu and cuda libraries.
- test/test_cpp_extensions.py can be run directly by invoking in Python,
assuming you've setup your PYTHONPATH setup correctly
- type_from_string does some new funny business to only query for all valid CUDA
types (which causes CUDA initialization) when we see "torch.cuda." in the
requested string
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Last mile libtorch fixes
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* pedantic fix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Rename autograd namespace to torch and change torch.h into python.h
* Include torch.h instead of python.h in test/cpp/api
* Change some mentions of torch.h to python.h in C++ extensions
* Set paths directly, without find_path
Adds ability to JIT compile C++ extensions from strings
>>> from torch.utils.cpp_extension import load_inline
>>> source = '''
at::Tensor sin_add(at::Tensor x, at::Tensor y) {
return x.sin() + y.sin();
}
'''
>>> module = load_inline(name='inline_extension', cpp_sources=source, functions='sin_add')
Fixes#7012
* Inline JIT C++ Extensions
* jit_compile_sources -> jit_compile
* Split up test into CUDA and non-CUDA parts
* Documentation fixes
* Implement prologue and epilogue generation
* Remove extra newline
* Only create the CUDA source file when cuda_sources is passed
* Add support for dotted names in CPP Extensions
* Modify tests for cpp extensions
Test that dotted names work
* Py2 fixes
* Make run_test cpp_extensions Win-compatible
* Create FileBaton to synchronize distributed JIT C++ extension builds
* Move FileBaton to its own file
* Autoformat code
* Respect verbose flag in cpp_extension._prepare_ldflags
* Change cpp_extensions.py to make it work on Windows
* Fix linting
* Show python paths
* Debug
* Debug 1
* set PYTHONPATH
* Add ATen into library
* expose essential libs and functions, and copy _C.lib
* Specify dir in header
* Update check_abi for MSVC
* Activate cl environment to compile cpp extensions
* change version string
* Redirect stderr to stdout
* Add monkey patch for windows
* Remove unnecessary self
* Fix various issues
* Append necessary flags
* add /MD flag to cuda
* Install ninja
* Use THP_API instead of THP_CLASS
* Beautify the paths
* Revert "Use THP_API instead of THP_CLASS"
This reverts commit dd7e74c44db48e4c5f85bb8e3c698ff9de71ba2d.
* Use THP_API instead of THP_CLASS(new)
* Also pass torch includes to nvcc build
* Export ATen/cuda headers with install
* Refactor flags common to C++ and CUDA
* Improve tests for C++/CUDA extensions
* Export .cuh files under THC
* Refactor and clean cpp_extension.py slightly
* Include ATen in cuda extension test
* Clarifying comment in cuda_extension.cu
* Replace cuda_extension.cu with cuda_extension_kernel.cu in setup.py
* Copy compile args in C++ extension and add second kernel
* Conditionally add -std=c++11 to cuda_flags
* Also export cuDNN headers
* Add comment about deepcopy
This PR adds support for convenient CUDA integration in our C++ extension mechanism. This mainly involved figuring out how to get setuptools to use nvcc for CUDA files and the regular C++ compiler for C++ files. I've added a mixed C++/CUDA test case which works great.
I've also added a CUDAExtension and CppExtension function that constructs a setuptools.Extension with "usually the right" arguments, which reduces the required boilerplate to write an extension even more. Especially for CUDA, where library_dir (CUDA_HOME/lib64) and libraries (cudart) have to be specified as well.
Next step is to enable this with our "JIT" mechanism.
NOTE: I've had to write a small find_cuda_home function to find the CUDA install directory. This logic is kind of a duplicate of tools/setup_helpers/cuda.py, but that's not available in the shipped PyTorch distribution. The function is also fairly short. Let me know if it's fine to duplicate this logic.
* CUDA support for C++ extensions with setuptools
* Remove printf in CUDA test kernel
* Remove -arch flag in test/cpp_extensions/setup.py
* Put wrap_compile into BuildExtension
* Add guesses for CUDA_HOME directory
* export PATH to CUDA location in test.sh
* On Python2, sys.platform has the linux version number