Summary:
always build nccl from within the main cmake build, rather than via a separate invocation in build_pytorch_libs.sh. Use the existing caffe2 codepaths
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13150
Differential Revision: D12815674
Pulled By: anderspapitto
fbshipit-source-id: a710b6f242d159b9816911a25ee2c4b8c3f855aa
Summary:
We want to move _C into the same cmake invocation that builds
libcaffe2 and libtorch. However, _C depends on THD and c10d, which in
turn depend on libcaffe2. That means that we can't move _C into that
cmake file unless we do these two first. This change does so.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12775
Differential Revision: D10457374
Pulled By: anderspapitto
fbshipit-source-id: 2c1aa3b8a418a73d2112e93c7da53a2e70cf7bba
Summary:
rather than pass a list through a text file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12951
Differential Revision: D10528309
Pulled By: anderspapitto
fbshipit-source-id: d94befcd61b6304815859694b623046f256462df
Summary:
This is used to patch our cmake cuda scripts - should be in the installation script.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13013
Reviewed By: ir413
Differential Revision: D10519104
Pulled By: Yangqing
fbshipit-source-id: 542049224ea41068f32d4c0f6399c7e8b684f764
Summary:
This PR implements a DataLoader API for the C++ frontend.
The components present in this API largely match the Python API. It consists of:
- `Dataset`s: Conceptually a function from a set of indices to a batch of examples;
- `Transform`s: A functional transformation of a dataset. A `Map<D, T>` for Dataset `D` and transform `T` is itself a dataset;
- `Sampler`s: Specify a strategy for generating indices for a new batch;
- A `DataLoader`, with the ability to automatically parallelize fetching of samples across multiple worker threads;
Note that collation functions fall naturally out of the `Map<Dataset, Transform>` abstraction.
Things that are missing right now that maybe should be added:
- Memory pinning for CUDA tensors
The API was designed to be generalizable to almost any kind of dataset, transform or sampling strategy, while providing a convenient API out of the box. To achieve this, it is quite heavily templatized on various possible input types.
There are many parts to this PR! Right now, I would like feedback on:
- Your impression of the general usability of the API;
- Your impression of which parts seem too complex or overthought;
- The implementation of the parallelization aspects of the DataLoader. I've followed the Python implementation in some matters, but also differ in others. I think my implementation is a little cleaner and decouples components slightly better than the Python dataloader.
I haven't added too many comments yet, as this is fresh out of the oven. Let me know if anything is unclear from the code itself.
There also aren't any tests yet. I will write a comprehensive test suite once we agree on the API and implementation.
apaszke ezyang The controller you requested could not be found. pietern
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11918
Reviewed By: ezyang
Differential Revision: D9998881
Pulled By: goldsborough
fbshipit-source-id: 22cf357b63692bea42ddb1cc2abc71dae5030aea
Summary:
There will be a link error when the caffe2 doesn't use its protobuf under third_party. The pytorch will always link that protobuf. The pytorch doesn't use the protobuf directly. We could remove it from
the list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12451
Differential Revision: D10262676
Pulled By: ezyang
fbshipit-source-id: c2ff3fdf757fc21ed689e7f663c082064b1a0bca
Summary:
There are still a few work to be done:
- Move logging and unify AT_WARN with LOG(ERROR).
- A few header files are still being plumbed through, need cleaning.
- caffe2::EnforceNotMet aliasing is not done yet.
- need to unify the macros. See c10/util/Exception.h
This is mainly a codemod and not causing functional changes. If you find your job failing and trace back to this diff, usually it can be fixed by the following approaches:
(1) add //caffe2/c10:c10 to your dependency (or transitive dependency).
(2) change objects such as at::Error, at::Optional to the c10 namespace.
(3) change functions to the c10 namespace. Especially, caffe2::MakeString is not overridden by the unified c10::str function. Nothing else changes.
Please kindly consider not reverting this diff - it involves multiple rounds of rebasing and the fix is usually simple. Contact jiayq@ or AI Platform Dev for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12354
Reviewed By: orionr
Differential Revision: D10238910
Pulled By: Yangqing
fbshipit-source-id: 7794d5bf2797ab0ca6ebaccaa2f7ebbd50ff8f32
Summary:
Properly set cmake python_library and include_dirs hints, so that systems with multiple version of python can still find the correct libraries and header files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12569
Differential Revision: D10359910
Pulled By: soumith
fbshipit-source-id: 2238dcbed7aac8a818c9435e6bba46cda5f81cad
Summary:
- Removed the old nccl file
- Make open-source NCCL a submodule
- CMake to make NCCL itself
NCCL2 now is in the default build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12359
Reviewed By: orionr, yns88
Differential Revision: D10219665
Pulled By: teng-li
fbshipit-source-id: 134ff47057512ba617b48bf390c1c816fff3f881
Summary:
Previously, we were only enabling Flush-To-Zero (FTZ) and
Denormals-Are-Zero (DAZ) when compiling with SSE3 enabled. After,
Christian's patch (https://github.com/pytorch/pytorch/pull/12109) we
won't be compiling core files with SSE3 or SSE4 enabled, to better
support older AMD processors.
This moves the FTZ and DAZ code behind a runtime CPU check in
preparation for that change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12386
Differential Revision: D10222237
Pulled By: colesbury
fbshipit-source-id: 7ffe32561ab965e1e5f9eb6e679602bbf4775538
Summary:
- Removed the old nccl file
- Make open-source NCCL a submodule
- CMake to make NCCL itself
NCCL2 now is in the default build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12312
Differential Revision: D10190845
Pulled By: teng-li
fbshipit-source-id: 08d42253b774149a66919d194f88b34628c39bae
Summary:
This variable is already being used so this just serves to document that. I think it's an important variable, too, so it should definitely be documented there somewhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12265
Differential Revision: D10162261
Pulled By: soumith
fbshipit-source-id: e0d01e012c2fedea63372de9967a8eaa3745fe94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11970
Adds an ATen-core-headers target, which caffe2_cpu_internal depends
on, and makes ATen-core depend on caffe2_headers. If you link against
ATen-core, you must ALSO link against caffe2_cpu_internal; if you
link against caffe2_cpu_internal, you must ALSO link against ATen-core,
otherwise you'll have undefined symbols.
Then, we merge template data<T>() method with Caffe2 implementation,
demonstrating that includes to Caffe2 (core) from ATen/core are working
Reviewed By: jerryzh168
Differential Revision: D9967509
fbshipit-source-id: 3d220c38b2c3c646f8ff2884fdcc889fa9276c7a
Summary:
This does 6 things:
- add c10/util/Registry.h as the unified registry util
- cleaned up some APIs such as export condition
- fully remove aten/core/registry.h
- fully remove caffe2/core/registry.h
- remove a bogus aten/registry.h
- unifying all macros
- set up registry testing in c10
Also, an important note that we used to mark the templated Registry class as EXPORT - this should not happen, because one should almost never export a template class. This PR fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12077
Reviewed By: ezyang
Differential Revision: D10050771
Pulled By: Yangqing
fbshipit-source-id: 417b249b49fed6a67956e7c6b6d22374bcee24cf
Summary:
This unifies our versions across setup.py, libtorch, and libcaffe2. CMake has a default version (bumped to 1.0.0) that can be overridden by setup.py. The versions are also printed as a part of cmake/Summary.cmake to make sure they are correct.
cc Yangqing ezyang soumith goldsborough pjh5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12053
Differential Revision: D10041878
Pulled By: orionr
fbshipit-source-id: a98a01771f6c008d1016ab63ab785c3a88c3ddb0
Summary:
Currently the C++ API and C++ extensions are effectively two different, entirely orthogonal code paths. This PR unifies the C++ API with the C++ extension API by adding an element of Python binding support to the C++ API. This means the `torch/torch.h` included by C++ extensions, which currently routes to `torch/csrc/torch.h`, can now be rerouted to `torch/csrc/api/include/torch/torch.h` -- i.e. the main C++ API header. This header then includes Python binding support conditioned on a define (`TORCH_WITH_PYTHON_BINDINGS`), *which is only passed when building a C++ extension*.
Currently stacked on top of https://github.com/pytorch/pytorch/pull/11498
Why is this useful?
1. One less codepath. In particular, there has been trouble again and again due to the two `torch/torch.h` header files and ambiguity when both ended up in the include path. This is now fixed.
2. I have found that it is quite common to want to bind a C++ API module back into Python. This could be for simple experimentation, or to have your training loop in Python but your models in C++. This PR makes this easier by adding pybind11 support to the C++ API.
3. The C++ extension API simply becomes richer by gaining access to the C++ API headers.
soumith ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11510
Reviewed By: ezyang
Differential Revision: D9998835
Pulled By: goldsborough
fbshipit-source-id: 7a94b44a9d7e0377b7f1cfc99ba2060874d51535
Summary:
Python never closes shared library it `dlopen`s. This means that calling `load` or `load_inline` (i.e. building a JIT C++ extension) with the same C++ extension name twice in the same Python process will never re-load the library, even if the compiled source code and the underlying shared library have changed. The only way to circumvent this is to create a new library and load it under a new module name.
I fix this, of course, by introducing a layer of indirection. Loading a JIT C++ extension now goes through an `ExtensionVersioner`, which hashes the contents of the source files as well as build flags, and if this hash changed, bumps an internal version stored for each module name. A bump in the version will result in the ninja file being edited and a new shared library and effectively a new C++ extension to be compiled. For this the version name is appended as `_v<version>` to the extension name for all versions greater zero.
One caveat is that if you were to update your code many times and always re-load it in the same process, you may end up with quite a lot of shared library objects in your extension's folder under `/tmp`. I imagine this isn't too bad, since extensions are typically small and there isn't really a good way for us to garbage collect old libraries, since we don't know what still has handles to them.
Fixes https://github.com/pytorch/pytorch/issues/11398 CC The controller you requested could not be found.
ezyang gchanan soumith fmassa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11725
Differential Revision: D9948244
Pulled By: goldsborough
fbshipit-source-id: 695bbdc1f1597c5e4306a45cd8ba46f15c941383
Summary:
The PR aims to resolve issues related to BUILD_PYTHON and BUILD_TEST after FULL_CAFFE2 is removed on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11385
Reviewed By: orionr
Differential Revision: D9884906
Pulled By: mingzhe09088
fbshipit-source-id: fc114c0cbff6223f1ec261161e4caecc1fef5dd6
Summary:
I'm just doing the honors and bumping the version to 1.0.0.
1.0 preview and RC releases will have the 1.0.0.dev{date} tag
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11717
Reviewed By: SsnL
Differential Revision: D9840857
Pulled By: soumith
fbshipit-source-id: 4c9c2e01dccb3c521dab26c49e1569d970a87ace
Summary:
This way it shows up in all current and future setup.py commands, as otherwise we'd have to override every once to have them all call copy_protos. This is needed because the nightly packages still do not include caffe2_pb2, because setup.py bdist does not go through setup.py install or setup.py develop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11726
Reviewed By: orionr
Differential Revision: D9844075
Pulled By: pjh5
fbshipit-source-id: 57b469e48010aacd0c08c214ba8a7e5d757feefa
Summary:
Currently, because of some setup.py logic, `ninja` caching of the `generate_code.py` build step was broken. This resulted in `generate_code.py` running every single time builds were happening, regardless of whether inputs changed.
This updated logic fixes the input caching
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11644
Reviewed By: orionr
Differential Revision: D9814348
Pulled By: soumith
fbshipit-source-id: 2012960908d0f600488d410094095cfd72adc34f
Summary:
This makes torch.distributed works for CPU only build.
Also added one more CI test case to cover MPI CPU build.
All CI tests should cover this change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11513
Differential Revision: D9784546
Pulled By: teng-li
fbshipit-source-id: 0976a6b0fd199670926f0273e17ad7d2805e42e7
Summary:
This whitelists train/eval functions in script modules, and tests that nested nn.Modules still work.
This also changes the code for calling python functions from script to allow non-tensor inputs/outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11505
Differential Revision: D9765466
Pulled By: zdevito
fbshipit-source-id: 1177bff931324422b69e18fa0bbaa82e3c98ec69
Summary:
This speeds up incremental builds by doing the following changes:
- Uses `rsync` instead of `cp` (when `rsync` is found) which is a bit smarter in doing "maybe copy"
- Introduces a `rebuild` mode which does not rerun `cmake` in `build_pytorch_libs.sh`.
*Note: `rebuild` should only be used if you dont add / remove files to the build, as `cmake` is not rerun*
Current no-op rebuild speedup:
- 1m 15s -> 20s
There are some lingering bugs. No-op rebuilds rerun `cmake` for two rebuilds (likely that cmake logic is dependent on the install folder, hence kicking off rebuild).
So what you see
```
python setup.py rebuild develop # first time - ~5 mins
python setup.py rebuild develop # second time - ~3 mins
python setup.py rebuild develop # third time - ~2 mins
python setup.py rebuild develop # fourth time - ~20 seconds
python setup.py rebuild develop # fifth time - ~20 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11487
Differential Revision: D9769087
Pulled By: soumith
fbshipit-source-id: 20fbecde33af6426149c13767e8734fb3be783c5
Summary:
Add flags for LMDB and LevelDB, default `OFF`. These can be enabled with
```
USE_LMDB=1 USE_LEVELDB=1 python setup.py build_deps
```
Also add a flag to build Caffe2 ops, which is default `ON`. Disable with
```
NO_CAFFE2_OPS=1 python setup.py build_deps
```
cc Yangqing soumith pjh5 mingzhe09088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11462
Reviewed By: soumith
Differential Revision: D9758156
Pulled By: orionr
fbshipit-source-id: 95fd206d72fdf44df54fc5d0aeab598bff900c63
Summary:
This PR is stacked on https://github.com/pytorch/pytorch/pull/10610, and only adds changes in one file `.jenkins/pytorch/test.sh`, where we now build the custom op tests and run them.
I'd also like to take this PR to discuss whether the [`TorchConfig.cmake`](https://github.com/pytorch/pytorch/blob/master/cmake/TorchConfig.cmake.in) I made is robust enough (we will also see in the CI) orionr Yangqing dzhulgakov what do you think?
Also ezyang for CI changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10611
Differential Revision: D9597627
Pulled By: goldsborough
fbshipit-source-id: f5af8164c076894f448cef7e5b356a6b3159f8b3
Summary:
Continuing pjh5's work to remove FULL_CAFFE2 flag completely.
With these changes you'll be able to also do something like
```
NO_TEST=1 python setup.py build_deps
```
and this will skip building tests in caffe2, aten, and c10d. By default the tests are built.
cc mingzhe09088 Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11321
Reviewed By: mingzhe09088
Differential Revision: D9694950
Pulled By: orionr
fbshipit-source-id: ff5c4937a23d1a263378a196a5eda0cba98af0a8
Summary:
The next function I'm moving to C++ is `sync_params`. It is stacked on top of https://github.com/pytorch/pytorch/pull/9729, so some changes will go away when it lands and I rebase.
I also split code into a `.h` and `.cpp` file for better code organization.
The controller you requested could not be found. pietern apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9805
Differential Revision: D9688604
Pulled By: goldsborough
fbshipit-source-id: 4467104d3f9e2354425503b9e4edbd59603e20a8
Summary:
* purge hcSPARSE now that rocSPARSE is available
* integrate a custom hcc and HIP
* hcc brings two important compiler fixes (fixes hundreds of unit tests)
* HIP brings a smart dispatcher that allows us to avoid a lot of static_casts (we haven't yet removed the automatic static_casts but this catches some occurrences the script did not catch)
* mark 5 unit tests skipping that have regressed w/ the new hcc (we don't know yet what is at fault)
* optimize bitonic sort - the comparator is always an empty struct - therefore passing it by value saves at least 3 bytes. It also removes an ambiguity around passing references to `__global__` functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11198
Differential Revision: D9652340
Pulled By: ezyang
fbshipit-source-id: f5af1d891189da820e3d13b7bed91a7a43154690
Summary:
Now that we're building everything together, making all distributed flags conditional of USE_DISTRIBUTED being set.
cc pietern The controller you requested could not be found. cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11221
Reviewed By: Yangqing
Differential Revision: D9664267
Pulled By: orionr
fbshipit-source-id: a296cda5746ad150028c97160f8beacba955ff73
Summary:
- In Python 2, use of `/` (regardless of int/float/Tensor) causes a compiler error if
`from __future__ import division` is not imported in the file.
- The / operator is universally set to do "true" division for integers
- Added a `prim::FloorDiv` operator because it is used in loop unrolling.
The error if users use '/' in python 2 without importing from __future__
occurs when building the JIT AST.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11016
Differential Revision: D9613527
Pulled By: zou3519
fbshipit-source-id: 0cebf44d5b8c92e203167733692ad33c4ec9dac6
Summary:
Will use USE_DISTRIBUTED for both c10d and THD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11237
Differential Revision: D9647825
Pulled By: teng-li
fbshipit-source-id: 06e0ec9b5e2f8f38780fc88718f8499463e9e969
Summary:
* improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs)
* integrate rocFFT (i.e., enable Fourier functionality)
* fix bugs in ROCm caused by wrong warp size
* enable more test sets, skip the tests that don't work on ROCm yet
* don't disable asserts any longer in hipification
* small improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893
Differential Revision: D9615053
Pulled By: ezyang
fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b
Summary:
Added MPI group support.
And this will make all previous group test cases of MPI passed.
Also, release the MPI thread level support by serializing different PG's MPI ops. This is required.
The build is fixed too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128
Differential Revision: D9602188
Pulled By: teng-li
fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497