Summary:
Rehash of previous attempts. This tries a different approach where we accept the install as specified in cmake (leaving bin/ include/ and lib/ alone), and then try to adjust the rest of the files to this more standard layout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16414
Differential Revision: D13863635
Pulled By: zdevito
fbshipit-source-id: 23725f5c64d7509bf3ca8f472dcdcad074de9828
Summary:
Some renaming and renamespacing also took place. I was originally planning not to do anything, but it turns out that it was easier to make HIPify work by using a namespace CUDACachingAllocator:: rather than THCCachingAllocator_, since :: is a word boundary but _ is not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16119
Reviewed By: smessmer
Differential Revision: D13718768
fbshipit-source-id: 884a481d99027fd3e34471c020f826aa12225656
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16117
This means I can move it to c10_cuda with minimal fuss.
Reviewed By: smessmer
Differential Revision: D13717836
fbshipit-source-id: a94c7dc649af64542480fc1c226b289588886c00
Summary:
Otherwise, it won't work if we sync on this event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8219
Reviewed By: pietern
Differential Revision: D13788657
Pulled By: teng-li
fbshipit-source-id: 8c96e9691ed2441d7a685fb7ae8fece906f58daf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14248
This diff also introduces a horrifying hack to override CUDA's DeviceGuardImpl
with a HIPGuardImplMasqueradingAsCUDA, to accommodate PyTorch's current
behavior of pretending CUDA is HIP when you build with ROCm enabled.
Reviewed By: bddppq
Differential Revision: D13145293
fbshipit-source-id: ee0e207b6fd132f0d435512957424a002d588f02
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.
I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.
I used the following script to do the canonicalization:
```
import subprocess
import re
import os.path
files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
for fn in files:
if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
continue
if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
continue
with open(fn, 'r') as f:
c = f.read()
def fmt(p):
return "#include <{}>".format(p)
def repl(m):
p = m.group(1)
if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
return fmt(p)
if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
return fmt(p)
for root in ["aten/src", "torch/lib", ""]:
for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
new_p = os.path.relpath(os.path.join(bad_root, p), root)
if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
return fmt(new_p)
print("ERROR: ", fn, p)
return m.group(0)
new_c = re.sub(r'#include "([^"]+)"', repl, c)
if new_c != c:
print(fn)
with open(fn, 'w') as f:
f.write(new_c)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849
Reviewed By: dzhulgakov
Differential Revision: D13363445
Pulled By: ezyang
fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13342
This PR introduces a few new concepts:
- DeviceGuardImplInterface, and implementations for CPU and CUDA, which
provide a generic interface for interfacing with device and stream state,
without requiring a direct dependency on the code in question.
- InlineDeviceGuard, a general template for generating both specialized
and dynamically dispatched device guard implementations. Dynamic
dispatch is done by specializing it on a VirtualGuardImpl.
- Provide a device-independent DeviceGuard class, which can be used even
from CPU code. It uses the aforementioned dynamic dispatch.
- CUDA-specialized CUDAGuard class, which doesn't have a dynamic dispatch
but can only be used from CUDA.
- StreamGuard, which is the same as above, but for streams rather than
devices.
- Optional variants of all the aforementioned guards, which are a no-op if
no device/stream is specified
- CUDAMultiStreamGuard, specifically for the case when we want to set
a device on every guard.
There are some subtle semantic changes, which have been thoroughly documented
in the class definition.
BC-breaking changes:
- Move constructor/assignment have been removed from all device guard
implementations.
- In some cases where you previously wrote 'set_device' (or 'set_stream'), you now must write
'reset_device', because if you switch devices/device types, the stream/device on the
previous device is unset. This is different from previous behavior.
- CUDAGuard no longer handles streams, or multiple streams. Use CUDAStreamGuard
or CUDAMultiStreamGuard as appropriate for your use case.
Reviewed By: dzhulgakov
Differential Revision: D12849620
fbshipit-source-id: f61956256f0b12be754b3234fcc73c2abc1be04e
Summary:
Followup to #12841
Changed these to not require type dispatch:
tensor.type().is_cuda() -> tensor.is_cuda()
tensor.type().is_sparse() -> tensor.is_sparse()
isVariable(tensor.type()) -> tensor.is_variable()
This probably does not affect performance
very much in most cases but it is nice to have.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13590
Reviewed By: ezyang
Differential Revision: D12929301
Pulled By: zou3519
fbshipit-source-id: 8ac5c6200c579dd7a44fb4ee58fc9bb170feb1d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13275
This resulted in a bunch of knock-on changes, which I will now
describe:
- s/original_index/original_device/
- s/last_index/last_device/
- A bunch of places that used set_index, now use CUDAGuard (which does have
set_index) because they were CUDA-specific code.
Major caveat: DeviceGuard doesn't *actually* work non-CUDA/CPU devices, To make
that happen, I plan on totally replacing the implementation of DeviceGuard; what
I mostly care about here is wrangling the API into an acceptable state.
Reviewed By: gchanan
Differential Revision: D12832080
fbshipit-source-id: 7de068c7cec35663dc8a533026a626331336e61d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13232
DeviceGuard should be device agnostic, which means that it shouldn't
assume that int64_t means select the CUDA device.
Reviewed By: gchanan
Differential Revision: D10858024
fbshipit-source-id: b40e8337e4046906fd8f83a95e6206367fb29dbe
Summary:
always build nccl from within the main cmake build, rather than via a separate invocation in build_pytorch_libs.sh. Use the existing caffe2 codepaths
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13150
Differential Revision: D12815674
Pulled By: anderspapitto
fbshipit-source-id: a710b6f242d159b9816911a25ee2c4b8c3f855aa
Summary:
clang-format-6 run on all cpp,cc,c,cu,cxx,hpp,hxx,h files under /c10d and /thd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13138
Differential Revision: D10857742
Pulled By: teng-li
fbshipit-source-id: f99bc62f56019c05acdfa8e8c4f0db34d23b4c52
Summary:
We want to move _C into the same cmake invocation that builds
libcaffe2 and libtorch. However, _C depends on THD and c10d, which in
turn depend on libcaffe2. That means that we can't move _C into that
cmake file unless we do these two first. This change does so.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12775
Differential Revision: D10457374
Pulled By: anderspapitto
fbshipit-source-id: 2c1aa3b8a418a73d2112e93c7da53a2e70cf7bba
Summary:
rather than pass a list through a text file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12951
Differential Revision: D10528309
Pulled By: anderspapitto
fbshipit-source-id: d94befcd61b6304815859694b623046f256462df
Summary:
This makes a few changes wrt Type, with the ultimate goal of removing Type from the public Methods/Functions. In particular:
1) Removes factory functions from Type, into TypeExtendedInterface.
2) sparse_coo_tensor is now a first class at:: namespace function, with TensorOptions overloads.
3) We move from Type-based sparse_coo_tensor dispatch to function-based.
Note we still require a number of changes to get rid of tType in the public interface, in particular TensorOptions needs to support CUDA vs non-CUDA dispatch. That is coming in a future patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12025
Reviewed By: ezyang
Differential Revision: D10017205
Pulled By: gchanan
fbshipit-source-id: 00807a37b09ed33f0656aaa165bb925abb026320
Summary:
```
Use intrusive_ptr in Storage; replace unique_ptr<Storage> with Storage
This patch does two major changes:
- It replaces the use of Retainable in Storage with a new implementation
based on intrusive_ptr. This will be necessary because Caffe2 will
be using this class to implement intrusive_ptrs, and we need to
line these up for the merge. One good thing about the new implementation is
that the default copy/move constructors/assignment operators and destructor
work automatically, instead of needing to be hardcoded into Storage/Tensor.
- It replaces all places where we returned std::unique_ptr<Storage> with
Storage, collapsing an unnecessary double indirection that is no longer
necessary now that we have correctly working copy/move constructors.
I didn't initially want to do step (2), but it was very important to
eliminate all bare uses of new Storage and new StorageImpl, and this making
the API change was the most straightforward way to do this.
HOW TO FIX YOUR CODE IN THE NEW API
- You no longer need to dereference the result of tensor.storage() to pass
it to set. So, instead of:
x.set_(*y.storage());
just write:
x.set_(y.storage());
- If you were accessing methods on StorageImpl via the pImpl() method, you
must use the dot operator to run pImpl(). Even better; just drop pImpl,
we now have method forwarding. So, instead of:
storage->pImpl()->data();
just do:
storage->data();
// storage.pImpl()->data() works too but is not as recommended
- storage->getDevice() is no more; instead use storage->device().index()
MISC CODE UPDATES
- retain, release, weak_retain, weak_release and weak_lock are now
reimplemented using the "blessed API", and renamed to make it
clearer that their use is discouraged.
- nvcc OS X and general OS X portability improvements to intrusive_ptr
- A new comment in intrusive_ptr describing how stack allocated
intrusive_ptr_targets work differently than heap allocated ones
from c10::make_intrusive
CAVEAT EMPTOR
- THStorage_weakRetain used to work on strong pointers, but it NO LONGER
works with intrusive_ptr. You must reclaim the strong pointer into a
real strong pointer, construct a weak pointer from it, and then release
the strong and weak pointers. See StorageSharing.cpp for an example.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10488
Reviewed By: gchanan
Differential Revision: D9306134
Pulled By: ezyang
fbshipit-source-id: 02d58ef62dab8e4da6131e1a24834a65c21048e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10268
Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for `connect()`), and the listening socket only has a backlog of 64.
I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused.
Reviewed By: soumith
Differential Revision: D9182216
fbshipit-source-id: 2f71c4995841db26c670cec344f1e3c7a80a7936
Summary:
This PR fixes#9418 .
Openmpi 1.10 segfaults in MPI_Bcast with CUDA buffer. And it's a retired openmpi version.
I've tested on 2.1.1 and 3.0.0 and they work well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10015
Reviewed By: soumith
Differential Revision: D9088103
Pulled By: ailzhang
fbshipit-source-id: fc0a45e5cd016093ef0dbb9f371cbf67170d7045
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9895
The primary goal here was to remove THTensor::_dim, which isn't part of the API moving forward.
Instead, we provide 3 options for getting the dimensionality (this is temporary although non-trivial to remove!):
```
nDimension corresponds to the "true" ATen dimension. TODO: implement.
nDimensionLegacyNoScalars correpsonds to the ATen dimension, except scalars are viewed as 1-dimensional tensors.
nDimensionLegacyAll corresponds to the ATen dimension, except scalars are viewed as 1-dimensional tensors
and tensors with a dimension of size zero are collapsed to 0-dimensional tensors.
```
So in this patch, nDimension -> nDimensionLegacyNoScalars and _dim/_nDimension goes to nDimensionLegacyAll.
These are just codemods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9835
Reviewed By: ezyang
Differential Revision: D8999338
Pulled By: gchanan
fbshipit-source-id: a4d676ac728f6f36ca09604a41e888d545ae9311
Summary:
It was only used to toggle refcounting, but we ALWAYS
refcount tensors.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9494
Differential Revision: D8875169
Pulled By: ezyang
fbshipit-source-id: 3a8618fb288334e62942bbaf388f3c9e473e7524
Summary:
Storage views were previously used to implement CUDA IPC sharing,
but they weren't necessary. The new strategy is described in
Note [CUDA IPC and the caching allocator].
This also fixes an unrelated bug, where we weren't actually using
the Tensor forking pickler, because we didn't register a pickler
for torch.Tensor.
Fixes#9447. Fixes#46.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
CC apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9466
Reviewed By: apaszke
Differential Revision: D8859698
Pulled By: ezyang
fbshipit-source-id: 3362cb92f6ae4aa37084c57d79b31004bd0b4a97
Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:
- It must support the reallocation usage-pattern, where, given
an existing Storage, we allocate a new region of memory,
copy the existing data to it, and then deallocate the old
region of memory.
- Creation of a deleter for memory MUST avoid dynamic allocations
in the common case. We've done some benchmarking in Caffe2
where dynamic allocation for deleters is ruinously expensive,
and it's really hard to avoid these performance tarpits in
very general function wrappers like std::function or
folly::Function (while benchmarking this, we discovered that
folly::Function's move constructor was way more expensive
than it should be).
- We need to be able to deallocate data that comes from external
sources, e.g., dlpack and numpy tensors. Most notably,
you often cannot deallocate these with merely the void*
data pointer; you need some extra, out-of-band information
(e.g., the managing struct) to deallocate it. Sometimes,
you may even want to resize data living in an external source!
- The "core" allocators need to support being wrapped in a Thrust
allocator, so you need to be implement the following two functions:
char* allocate(size_t);
void deallocate(char*, size_t);
- We need to support tensors which contain non-POD, non-trivially
copyable data; specifically tensors of std::string. This is
an upcoming requirement from Caffe2. It's dirty AF, but
it's really useful.
- It should use C++ standard library types like std::unique_ptr
(which is hugely problematic because std::unique_ptr doesn't
call the deleter when the pointer is null.)
Here is the billing of changes:
- Built-in support for realloc() has been DROPPED ENTIRELY.
Instead, you're expected to allocate and then copy from
the old memory to the new memory if you want to do a
reallocation. This is what you'd generally have expected
to occur; and axing realloc() from the design lets us avoid
some tricky correctness issues with std::realloc(), namely
the fact that we must refuse the realloc if the type of the
elements are not trivially copyeable. If it really matters,
we can add this back, but there really needs to be a good
explanation WHY you need fast resizing reallocations (by in
large, people don't resize their storages, and it should
be acceptable to have a performance degradation when they
do).
- TH_STORAGE_FREEMEM is no more; instead, if you want a
storage which doesn't free its result, you just give it
an empty deleter.
- What we used to call an "allocator" (really, a combined
object for allocating/deleting) has been split into two
concepts, an allocator, and a smart pointer (SupervisedPtr)
which knows how to delete data.
- Unlike previously, where THAllocator/THCDeviceAllocator
could have a per-tensor context storing extra information
(e.g., a pointer to the metadata you need to actually
free the tensor), there is no context in the allocator or
the deleter of the smart pointer; instead, the smart
pointer directly holds an owning reference to the
metadata necessary to free the data. This metadata
is *freshly manufactured* upon every allocation, which
permits us to resize tensors even in the absence of
built-in support for realloc().
- By default, allocators don't support "raw" allocations
and deallocations with raw pointers. This is because
some allocations may return a different context every
time, in which case you need to reconstruct the context
at delete time (because all you got was a void*, not
a unique_ptr that carries the deleter).
- The diff between at::Allocator and THCDeviceAllocator is a
bit larger:
- It used to return a cudaError_t. Now, allocators
are expected to check the error status immediately and throw
an exception if there was an error. It turns out that this
is what was immediately done after all occurrences of
allocate/release, so it wasn't a big deal (although some
subsidiary interfaces had to themselves be converted to
not return cudaError_t).
There is one notable exception to this, and it is how
we handle CUDA OOM: if this occurs, we attempt to return
unused memory to the system and try again. This is now
handled by a catch-all try-catch block. The cost of
catching the exception is probably the least of your worries
if you're about to OOM.
- It used to take the CUDA stream to perform the allocation
on as an argument. However, it turned out that all call
sites, this stream was the stream for the current device.
So we can push this into the allocator (and the choice,
in the future, could be made explicitly by twiddling
thread local state.)
- It held two extra methods, emptyCache and cacheInfo, specifically
for interacting with some state in THCCachingAllocator.
But this "generality" was a lie, since THCCachingAllocator
was the only allocator that actually implemented these
methods, and there is actually a bunch of code in THC
which assumes that it is the caching allocator that is
the underlying allocator for CUDA allocations. So I
folded these two methods into this interface as
THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.
- It held its context directly inside the THCDeviceAllocator
struct. This context has been moved out into whatever
is holding the at::Allocator*.
- The APIs for getting at allocators/deleters is now a little different.
- Previously there were a bunch of static variables you could get
the address of (e.g., &THDefaultAllocator); now there is a
function getTHDefaultAllocator().
- Some "allocators" didn't actually know how to allocate (e.g.,
the IPC "allocator"). These have been deleted; instead, you
can wrap the produced pointers into SupervisedPtr using
an appropriate makeSupervisedPtr() static method.
- Storage sharing was a lot of work to wrangle, but I think I've
tamed the beast.
- THMapAllocator and its "subclasses" have been refactored to
be proper, honest to goodness C++ classes. I used the enum
argument trick to get "named" constructors. We use inheritance
to add refcounting and management (in libshm). What we previously
called the "Context" class (Context has been dropped from the name)
is now the supervisor for the data.
- Sometimes, we need to pull out the file descriptor from a
tensor. Previously, it was pulled out of the allocator context.
Now, we pull it out of the supervisor of the SupervisorPtr,
using the static method fromSupervisedPtr(), which uses the
deleter as the typeid, and refines the type if it matches.
- I renamed the std::function deleter into
InefficientStdFunctionSupervisor, to emphasize the fact that it does
a dynamic allocation to save the std::function deleter.
TODO:
- Windows libshm is in shambles and needs to be fixed.
Perhaps for the future:
- newFromFd is now unconditionally calling cudaPointerGetAttributes
even though this is unnecessary, because we know what the device
is from higher up in the callstack. We can fix this by making
newWithDataAndAllocator also take an explicit device argument.
- Consider statically distinguishing between allocators that
support raw_allocate/raw_deallocate, and those which don't.
The Thrust constraint applies only to the CUDA device allocator;
you never need to allocate CPU memory this way
- Really want to get rid of storage views. Ugh.
Nontrivial bugs I noticed when preparing this patch:
- I forgot to placement-new unique pointers and attempted to
assign them directly on uninitialized memory; very bad! Sam
Gross has encouraged me to replace this with a proper constructor
but I keep putting it off, because once everything goes in
StorageImpl there really will be a proper constructor.
- I rewrote a number of APIs to use newWithDataAndAllocator
instead of newWithAllocator, calling the allocator at the
call site (because they required "allocation context" which
we no longer give to "allocators"). When I did this, I forgot
to insert the multiplication with sizeof(real) to scale from
numels to number of bytes.
- The implementation of swap on storages was missing it for
scalarType and backend. It was benign (because the only case
we call swap is when these are the same), but I fixed it anyway.
- I accidentally returned a nullptr unique_ptr with no deleter,
even though there was a legitimate one. This matters, because
some code still shoves its hands in the deleter context to
get extra metadata about the function.
- I used std::move() on a unique_ptr, and then did a boolean
test on the pointer aftewards (always false!)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9358
Reviewed By: SsnL
Differential Revision: D8811822
Pulled By: ezyang
fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
* Created TensorOptions
Storing the type in TensorOptions to solve the Variable problem
Created convenience creation functions for TensorOptions and added tests
Converted zeros to TensorOptions
Converted rand to TensorOptions
Fix codegen for TensorOptions and multiple arguments
Put TensorOptions convenience functions into torch namespace too
All factory functions except *_like support TensorOptions
Integrated with recent JIT changes
Support *_like functions
Fix in place modification
Some cleanups and fixes
Support sparse_coo_tensor
Fix bug in Type.cpp
Fix .empty calls in C++ API
Fix bug in Type.cpp
Trying to fix device placement
Make AutoGPU CPU compatible
Remove some auto_gpu.h uses
Fixing some headers
Fix some remaining CUDA/AutoGPU issues
Fix some AutoGPU uses
Fixes to dispatch_tensor_conversion
Reset version of new variables to zero
Implemented parsing device strings
Random fixes to tests
Self review cleanups
flake8
Undo changes to variable.{h,cpp} because they fail on gcc7.2
Add [cuda] tag to tensor_options_cuda.cpp
Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks
Fix linker error in AutoGPU.cpp
Fix bad merge conflict in native_functions.yaml
Fixed caffe2/contrib/aten
Fix new window functions added to TensorFactories.cpp
* Removed torch::TensorOptions
Added code to generate wrapper functions for factory methods
Add implicit constructor from Backend to TensorOptions
Remove Var() from C++ API and use torch:: functions
Use torch:: functions more subtly in C++ API
Make AutoGPU::set_device more exception safe
Check status directly in DynamicCUDAHooksInterface
Rename AutoGPU to DeviceGuard
Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad
remove python_default_init: self.type()
Add back original factory functions, but with deprecation warnings
Disable DeviceGuard for a couple functions in ATen
Remove print statement
Fix DeviceGuard construction from undefined tensor
Fixing CUDA device compiler issues
Moved as many methods as possible into header files
Dont generate python functions for deprecated factories
Remove merge conflict artefact
Fix tensor_options_cuda.cpp
Fix set_requires_grad not being checked
Fix tensor_new.h
TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac
Fix bug in DeviceGuard.h
Missing includes
TEMPORARILY moving a few more methods into .cpp to see if it fixes windows
Fixing linker errors
* Fix up SummaryOps to use new factories
Undo device agnostic behavior of DeviceGuard
Use -1 instead of optional for default device index
Also move DeviceGuard methods into header
Fixes around device index after optional -> int32_t switch
Fix use of DeviceGuard in new_with_tensor_copy
Fix tensor_options.cpp
* Fix Type::copy(
* Remove test_non_float_params from ONNX tests
* Set requires_grad=False in ONNX tests that use ints
* Put layout/dtype/device on Tensor
* Post merge fixes
* Change behavior of DeviceGuard to match AutoGPU
* Fix C++ API integration tests
* Fix flip functions
* Have PyTorch depend on minimal libcaffe2.so instead of libATen.so
* Build ATen tests as a part of Caffe2 build
* Hopefully cufft and nvcc fPIC fixes
* Make ATen install components optional
* Add tests back for ATen and fix TH build
* Fixes for test_install.sh script
* Fixes for cpp_build/build_all.sh
* Fixes for aten/tools/run_tests.sh
* Switch ATen cmake calls to USE_CUDA instead of NO_CUDA
* Attempt at fix for aten/tools/run_tests.sh
* Fix typo in last commit
* Fix valgrind call after pushd
* Be forgiving about USE_CUDA disable like PyTorch
* More fixes on the install side
* Link all libcaffe2 during test run
* Make cuDNN optional for ATen right now
* Potential fix for non-CUDA builds
* Use NCCL_ROOT_DIR environment variable
* Pass -fPIC through nvcc to base compiler/linker
* Remove THCUNN.h requirement for libtorch gen
* Add Mac test for -Wmaybe-uninitialized
* Potential Windows and Mac fixes
* Move MSVC target props to shared function
* Disable cpp_build/libtorch tests on Mac
* Disable sleef for Windows builds
* Move protos under BUILD_CAFFE2
* Remove space from linker flags passed with -Wl
* Remove ATen from Caffe2 dep libs since directly included
* Potential Windows fixes
* Preserve options while sleef builds
* Force BUILD_SHARED_LIBS flag for Caffe2 builds
* Set DYLD_LIBRARY_PATH and LD_LIBRARY_PATH for Mac testing
* Pass TORCH_CUDA_ARCH_LIST directly in cuda.cmake
* Fixes for the last two changes
* Potential fix for Mac build failure
* Switch Caffe2 to build_caffe2 dir to not conflict
* Cleanup FindMKL.cmake
* Another attempt at Mac cpp_build fix
* Clear cpp-build directory for Mac builds
* Disable test in Mac build/test to match cmake
* Remove stale THD README
* Move common THD dependency into THD/base
The master_worker directory now no longer contains files that are
needed for building other parts of THD.
Changelist:
- Move *.c to *.cpp
- Change includes of ".c" to ".cpp"
- A bunch of cmake configuration modifying CMAKE_C_FLAGS changed
to CMAKE_CXX_FLAGS or add_compile_options, because if you do CMAKE_C_FLAGS it only applies when you compile C code
- Explicitly cast void* to T* in a number of places
- Delete extern "C" { ... } blocks; instead, properly apply TH_API to everything that should have it (TH_API handles extern "C")
- Stop using stdatomic.h, instead, use <atomic>. This resulted in a bunch of placement-new/delete to be "totally properly correct"
- Refactor of THLongStorageView to not have static constructor methods (since it no longer has a copy/move constructor)
- Documentation about how the TH C interface (and extern C business) works
- Note that THD master_worker mode is dead
- C++ headers in TH libraries are given .hpp suffix, to make it less likely that you'll confuse them with the C-compatible headers (now suffixed .h)
- New function THCStream_stream and THCStream_device to project out fields of THCStream instead of accessing fields directly
- New function THStorage_(retainIfLive), which is equivalent to a retain but only if the refcount is greater than zero.
- In general, I tried to avoid using hpp headers outside of ATen/TH. However, there were a few places where I gave up and depended on the headers for my own sanity. See Note [TH abstraction violation] for all the sites where this occurred. All other sites were refactored to use functions
- Some extra Werror fixes (char* versus const char*)