Commit Graph

224 Commits

Author SHA1 Message Date
Roy Li
80a7eac79e Remove Type::elementSizeInBytes
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17785

Reviewed By: ezyang

Differential Revision: D14379074

fbshipit-source-id: 60727f187d61eb571b144bd6eed4dd4908da0b51
2019-03-15 12:56:02 -07:00
Roy Li
7aae51cded Replace tensor.type().scalarType() calls with tensor.scalar_type()
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17515

Reviewed By: ezyang

Differential Revision: D14233250

fbshipit-source-id: 6c7af8d2291c0c2b148001b30cf03834f34366c0
2019-03-08 14:08:18 -08:00
Zachary DeVito
21193bf123 try to get rid of tmp_install (#16414)
Summary:
Rehash of previous attempts. This tries a different approach where we accept the install as specified in cmake (leaving bin/ include/ and lib/ alone), and then try to adjust the rest of the files to this more standard layout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16414

Differential Revision: D13863635

Pulled By: zdevito

fbshipit-source-id: 23725f5c64d7509bf3ca8f472dcdcad074de9828
2019-01-29 17:29:40 -08:00
Edward Yang
e936a69085 Move THCCachingAllocator to c10_cuda. (#16119)
Summary:
Some renaming and renamespacing also took place. I was originally planning not to do anything, but it turns out that it was easier to make HIPify work by using a namespace CUDACachingAllocator:: rather than THCCachingAllocator_, since :: is a word boundary but _ is not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/16119

Reviewed By: smessmer

Differential Revision: D13718768

fbshipit-source-id: 884a481d99027fd3e34471c020f826aa12225656
2019-01-24 12:06:56 -08:00
Edward Yang
24b50f1411 Remove unnecessary includes and headers from THCCachingAllocator, move to at::cuda:: namespace (#16117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16117

This means I can move it to c10_cuda with minimal fuss.

Reviewed By: smessmer

Differential Revision: D13717836

fbshipit-source-id: a94c7dc649af64542480fc1c226b289588886c00
2019-01-24 12:06:54 -08:00
Teng Li
fc5b79cd1c CUDA event should only be recorded after NCCL group (#8219)
Summary:
Otherwise, it won't work if we sync on this event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8219

Reviewed By: pietern

Differential Revision: D13788657

Pulled By: teng-li

fbshipit-source-id: 8c96e9691ed2441d7a685fb7ae8fece906f58daf
2019-01-23 14:18:26 -08:00
Edward Yang
2d485ffb17 Move CUDAGuard, CUDAStream and CUDAGuardImpl to c10/cuda (#14248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14248

This diff also introduces a horrifying hack to override CUDA's DeviceGuardImpl
with a HIPGuardImplMasqueradingAsCUDA, to accommodate PyTorch's current
behavior of pretending CUDA is HIP when you build with ROCm enabled.

Reviewed By: bddppq

Differential Revision: D13145293

fbshipit-source-id: ee0e207b6fd132f0d435512957424a002d588f02
2018-12-12 11:24:26 -08:00
Edward Yang
517c7c9861 Canonicalize all includes in PyTorch. (#14849)
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.

I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.

I used the following script to do the canonicalization:

```
  import subprocess
  import re
  import os.path

  files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
  for fn in files:
      if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
          continue
      if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
          continue
      with open(fn, 'r') as f:
          c = f.read()
      def fmt(p):
          return "#include <{}>".format(p)
      def repl(m):
          p = m.group(1)
          if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
              return fmt(p)
          if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
              return fmt(p)
          for root in ["aten/src", "torch/lib", ""]:
              for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
                  new_p = os.path.relpath(os.path.join(bad_root, p), root)
                  if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
                      return fmt(new_p)
          print("ERROR: ", fn, p)
          return m.group(0)
      new_c = re.sub(r'#include "([^"]+)"', repl, c)
      if new_c != c:
          print(fn)
          with open(fn, 'w') as f:
              f.write(new_c)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849

Reviewed By: dzhulgakov

Differential Revision: D13363445

Pulled By: ezyang

fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
2018-12-08 19:38:30 -08:00
Teng Li
20e395a130 Fixed uninitialized warning (#14001)
Summary:
Fixing: https://github.com/pytorch/pytorch/issues/12014
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14001

Differential Revision: D13078583

Pulled By: teng-li

fbshipit-source-id: 6c8d663da81bc3e564f0643926d67260df828dd8
2018-11-14 21:37:11 -08:00
Edward Yang
72da09bb4d Canonicalize THD includes with .. in them
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13980

Reviewed By: jerryzh168

Differential Revision: D13062706

fbshipit-source-id: 100e10d1bae7efc3e13f029708c2c1dd053ce074
2018-11-14 17:43:56 -08:00
Edward Yang
a440629f14 Remove defunct build.sh/THConfig.cmake (#13953)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13953

Differential Revision: D13056128

Pulled By: ezyang

fbshipit-source-id: 9fd17f4fe000ac06144b04be996ef6849de2bafa
2018-11-14 07:42:09 -08:00
Edward Yang
e35418b3be New implementations of DeviceGuard, StreamGuard and MultiStreamGuard (with CUDA specializations) (#13342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13342

This PR introduces a few new concepts:

- DeviceGuardImplInterface, and implementations for CPU and CUDA, which
  provide a generic interface for interfacing with device and stream state,
  without requiring a direct dependency on the code in question.
- InlineDeviceGuard, a general template for generating both specialized
  and dynamically dispatched device guard implementations.  Dynamic
  dispatch is done by specializing it on a VirtualGuardImpl.
- Provide a device-independent DeviceGuard class, which can be used even
  from CPU code. It uses the aforementioned dynamic dispatch.
- CUDA-specialized CUDAGuard class, which doesn't have a dynamic dispatch
  but can only be used from CUDA.
- StreamGuard, which is the same as above, but for streams rather than
  devices.
- Optional variants of all the aforementioned guards, which are a no-op if
  no device/stream is specified
- CUDAMultiStreamGuard, specifically for the case when we want to set
  a device on every guard.

There are some subtle semantic changes, which have been thoroughly documented
in the class definition.

BC-breaking changes:

- Move constructor/assignment have been removed from all device guard
  implementations.
- In some cases where you previously wrote 'set_device' (or 'set_stream'), you now must write
  'reset_device', because if you switch devices/device types, the stream/device on the
  previous device is unset.  This is different from previous behavior.
- CUDAGuard no longer handles streams, or multiple streams.  Use CUDAStreamGuard
  or CUDAMultiStreamGuard as appropriate for your use case.

Reviewed By: dzhulgakov

Differential Revision: D12849620

fbshipit-source-id: f61956256f0b12be754b3234fcc73c2abc1be04e
2018-11-11 12:11:10 -08:00
Richard Zou
e60a7c2c88 codemod tensor.type().is_cuda(), tensor.type().is_sparse() (#13590)
Summary:
Followup to #12841

Changed these to not require type dispatch:
tensor.type().is_cuda() -> tensor.is_cuda()
tensor.type().is_sparse() -> tensor.is_sparse()
isVariable(tensor.type()) -> tensor.is_variable()

This probably does not affect performance
very much in most cases but it is nice to have.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13590

Reviewed By: ezyang

Differential Revision: D12929301

Pulled By: zou3519

fbshipit-source-id: 8ac5c6200c579dd7a44fb4ee58fc9bb170feb1d7
2018-11-07 07:27:42 -08:00
Teng Li
ce6edbfbd9 Fixed NCCL backend not being built (#13653)
Summary:
An regression caused by NCCL build refactoring earlier.

CC fmassa

Fixing: https://github.com/facebookresearch/maskrcnn-benchmark/issues/122
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13653

Differential Revision: D12952555

Pulled By: teng-li

fbshipit-source-id: b42e2a88fff83c9ddd58eeb33e933f1f59f51c52
2018-11-06 19:33:49 -08:00
Edward Yang
0aaff5eaf9 Replace CUDA-specific set_index(_from) method from DeviceGuard with set_device. (#13275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13275

This resulted in a bunch of knock-on changes, which I will now
describe:

- s/original_index/original_device/
- s/last_index/last_device/
- A bunch of places that used set_index, now use CUDAGuard (which does have
  set_index) because they were CUDA-specific code.

Major caveat: DeviceGuard doesn't *actually* work non-CUDA/CPU devices, To make
that happen, I plan on totally replacing the implementation of DeviceGuard; what
I mostly care about here is wrangling the API into an acceptable state.

Reviewed By: gchanan

Differential Revision: D12832080

fbshipit-source-id: 7de068c7cec35663dc8a533026a626331336e61d
2018-10-31 07:55:13 -07:00
Edward Yang
e5d56659ec Delete DeviceGuard(int64_t) constructor. (#13232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13232

DeviceGuard should be device agnostic, which means that it shouldn't
assume that int64_t means select the CUDA device.

Reviewed By: gchanan

Differential Revision: D10858024

fbshipit-source-id: b40e8337e4046906fd8f83a95e6206367fb29dbe
2018-10-31 07:55:11 -07:00
Anders Papitto
380d2dfb27 absorb nccl (#13150)
Summary:
always build nccl from within the main cmake build, rather than via a separate invocation in build_pytorch_libs.sh. Use the existing caffe2 codepaths
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13150

Differential Revision: D12815674

Pulled By: anderspapitto

fbshipit-source-id: a710b6f242d159b9816911a25ee2c4b8c3f855aa
2018-10-29 12:04:32 -07:00
Teng Li
4f94d82c7f clang-format on c10d and THD (#13138)
Summary:
clang-format-6 run on all cpp,cc,c,cu,cxx,hpp,hxx,h files under /c10d and /thd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13138

Differential Revision: D10857742

Pulled By: teng-li

fbshipit-source-id: f99bc62f56019c05acdfa8e8c4f0db34d23b4c52
2018-10-25 14:16:47 -07:00
Anders Papitto
69906afaee absorb THD into main cmake build (#12775)
Summary:
We want to move _C into the same cmake invocation that builds
libcaffe2 and libtorch. However, _C depends on THD and c10d, which in
turn depend on libcaffe2. That means that we can't move _C into that
cmake file unless we do these two first. This change does so.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12775

Differential Revision: D10457374

Pulled By: anderspapitto

fbshipit-source-id: 2c1aa3b8a418a73d2112e93c7da53a2e70cf7bba
2018-10-24 21:28:37 -07:00
Anders Papitto
2dacf28b66 link libgloo_cuda.a explictly from setup.py (#12951)
Summary:
rather than pass a list through a text file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12951

Differential Revision: D10528309

Pulled By: anderspapitto

fbshipit-source-id: d94befcd61b6304815859694b623046f256462df
2018-10-24 13:19:46 -07:00
Gregory Chanan
0947712e5d Move Factory functions from Type to TypeExtendedInterface. (#12025)
Summary:
This makes a few changes wrt Type, with the ultimate goal of removing Type from the public Methods/Functions.  In particular:
1) Removes factory functions from Type, into TypeExtendedInterface.
2) sparse_coo_tensor is now a first class at:: namespace function, with TensorOptions overloads.
3) We move from Type-based sparse_coo_tensor dispatch to function-based.

Note we still require a number of changes to get rid of tType in the public interface, in particular TensorOptions needs to support CUDA vs non-CUDA dispatch.  That is coming in a future patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12025

Reviewed By: ezyang

Differential Revision: D10017205

Pulled By: gchanan

fbshipit-source-id: 00807a37b09ed33f0656aaa165bb925abb026320
2018-09-25 09:40:17 -07:00
Yangqing Jia
c6a14b1edd Revert D9985212: [pytorch][PR] [minor] remove a remaining todo line deletion in THD cmake
Differential Revision:
D9985212

Original commit changeset: 5f8e7ac94101

fbshipit-source-id: 1783cbfc91008ab3db36bad7c1bf51e16da7fb2d
2018-09-21 11:25:53 -07:00
Yangqing Jia
0d9be2135f remove a remaining todo line deletion in THD cmake (#11920)
Summary:
TSIA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11920

Differential Revision: D9985212

Pulled By: Yangqing

fbshipit-source-id: 5f8e7ac94101177740e791f44eaa8c8ec55a908c
2018-09-21 00:40:20 -07:00
Edward Yang
227635142f Delete THD master_worker (#10731)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10731

Differential Revision: D9423675

Pulled By: ezyang

fbshipit-source-id: 37221e11d84cc3672b944af598ea229a1d4c38cc
2018-08-22 08:54:36 -07:00
Edward Yang
19031c68dc Use intrusive_ptr in Storage; replace unique_ptr<Storage> with Storage (#10488)
Summary:
```
Use intrusive_ptr in Storage; replace unique_ptr<Storage> with Storage

This patch does two major changes:

- It replaces the use of Retainable in Storage with a new implementation
  based on intrusive_ptr.  This will be necessary because Caffe2 will
  be using this class to implement intrusive_ptrs, and we need to
  line these up for the merge.  One good thing about the new implementation is
  that the default copy/move constructors/assignment operators and destructor
  work automatically, instead of needing to be hardcoded into Storage/Tensor.

- It replaces all places where we returned std::unique_ptr<Storage> with
  Storage, collapsing an unnecessary double indirection that is no longer
  necessary now that we have correctly working copy/move constructors.

I didn't initially want to do step (2), but it was very important to
eliminate all bare uses of new Storage and new StorageImpl, and this making
the API change was the most straightforward way to do this.

HOW TO FIX YOUR CODE IN THE NEW API

- You no longer need to dereference the result of tensor.storage() to pass
  it to set.  So, instead of:

      x.set_(*y.storage());

  just write:

      x.set_(y.storage());

- If you were accessing methods on StorageImpl via the pImpl() method, you
  must use the dot operator to run pImpl().  Even better; just drop pImpl,
  we now have method forwarding.  So, instead of:

      storage->pImpl()->data();

  just do:

      storage->data();
      // storage.pImpl()->data() works too but is not as recommended

- storage->getDevice() is no more; instead use storage->device().index()

MISC CODE UPDATES

- retain, release, weak_retain, weak_release and weak_lock are now
  reimplemented using the "blessed API", and renamed to make it
  clearer that their use is discouraged.

- nvcc OS X and general OS X portability improvements to intrusive_ptr

- A new comment in intrusive_ptr describing how stack allocated
  intrusive_ptr_targets work differently than heap allocated ones
  from c10::make_intrusive

CAVEAT EMPTOR

- THStorage_weakRetain used to work on strong pointers, but it NO LONGER
  works with intrusive_ptr.  You must reclaim the strong pointer into a
  real strong pointer, construct a weak pointer from it, and then release
  the strong and weak pointers.  See StorageSharing.cpp for an example.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10488

Reviewed By: gchanan

Differential Revision: D9306134

Pulled By: ezyang

fbshipit-source-id: 02d58ef62dab8e4da6131e1a24834a65c21048e2
2018-08-21 21:39:55 -07:00
Adam Lerer
18e298305e Increase TCP listen queue size from 64 to 1024 (#10268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10268

Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for `connect()`), and the listening socket only has a backlog of 64.

I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused.

Reviewed By: soumith

Differential Revision: D9182216

fbshipit-source-id: 2f71c4995841db26c670cec344f1e3c7a80a7936
2018-08-07 08:26:06 -07:00
Ailing Zhang
371a786b18 Errors out when Openmpi < 2.x.x with distributed. (#10015)
Summary:
This PR fixes #9418 .
Openmpi 1.10 segfaults in MPI_Bcast with CUDA buffer. And it's a retired openmpi version.
I've tested on 2.1.1 and 3.0.0 and they work well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10015

Reviewed By: soumith

Differential Revision: D9088103

Pulled By: ailzhang

fbshipit-source-id: fc0a45e5cd016093ef0dbb9f371cbf67170d7045
2018-07-31 12:24:40 -07:00
Gregory Chanan
1af1b0c2a5 Remove THTensor::_dim, temporarily remove THTensor_nDimension. (#9895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9895

The primary goal here was to remove THTensor::_dim, which isn't part of the API moving forward.
Instead, we provide 3 options for getting the dimensionality (this is temporary although non-trivial to remove!):
```
nDimension                 corresponds to the "true" ATen dimension. TODO: implement.
nDimensionLegacyNoScalars  correpsonds to the ATen dimension, except scalars are viewed as 1-dimensional tensors.
nDimensionLegacyAll        corresponds to the ATen dimension, except scalars are viewed as 1-dimensional tensors
                           and tensors with a dimension of size zero are collapsed to 0-dimensional tensors.
```
So in this patch, nDimension -> nDimensionLegacyNoScalars and _dim/_nDimension goes to nDimensionLegacyAll.
These are just codemods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9835

Reviewed By: ezyang

Differential Revision: D8999338

Pulled By: gchanan

fbshipit-source-id: a4d676ac728f6f36ca09604a41e888d545ae9311
2018-07-27 08:56:38 -07:00
idansc
029cf1d78a Improve error messages of wrong dimensions (#9694)
Summary:
Updated the error message terms _matrices_ and _vectors_ to _2D tensors_ and _1D tensors_ respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9694

Differential Revision: D8949589

Pulled By: ezyang

fbshipit-source-id: 2cdcd72e0e9a4459f3691c133bb16ef218b5cf3f
2018-07-23 10:10:55 -07:00
Christian Puhrsch
c1ee8835b6 Constructors and member functions for THStorage (#9357)
Summary:
Added on top of ezyang's https://github.com/pytorch/pytorch/pull/9278
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9357

Reviewed By: ezyang

Differential Revision: D8863934

Pulled By: cpuhrsch

fbshipit-source-id: a45c955c0b1e9e0866749b3a7e8a36de931bdff1
2018-07-18 15:56:26 -07:00
Edward Yang
d2d43824cd Delete flag from THTensor. (#9494)
Summary:
It was only used to toggle refcounting, but we ALWAYS
refcount tensors.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9494

Differential Revision: D8875169

Pulled By: ezyang

fbshipit-source-id: 3a8618fb288334e62942bbaf388f3c9e473e7524
2018-07-17 11:25:41 -07:00
Will Feng
ad74006ffa Pass THDRequest as void* pointer to THDRequest_free (#9398)
Summary:
This fixes https://github.com/pytorch/pytorch/issues/9054.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9398

Reviewed By: ezyang

Differential Revision: D8827778

Pulled By: yf225

fbshipit-source-id: 862287802cb69c6ac71ff4df19cadb89b1face1d
2018-07-16 19:25:22 -07:00
Edward Yang
976f9253a5 Eliminate storage views. (#9466)
Summary:
Storage views were previously used to implement CUDA IPC sharing,
but they weren't necessary.  The new strategy is described in
Note [CUDA IPC and the caching allocator].

This also fixes an unrelated bug, where we weren't actually using
the Tensor forking pickler, because we didn't register a pickler
for torch.Tensor.

Fixes #9447.  Fixes #46.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

CC apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9466

Reviewed By: apaszke

Differential Revision: D8859698

Pulled By: ezyang

fbshipit-source-id: 3362cb92f6ae4aa37084c57d79b31004bd0b4a97
2018-07-16 15:40:24 -07:00
Edward Yang
cffca2926b Introduce SupervisedPtr, delete THAllocator and THCDeviceAllocator (#9358)
Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:

- It must support the reallocation usage-pattern, where, given
  an existing Storage, we allocate a new region of memory,
  copy the existing data to it, and then deallocate the old
  region of memory.

- Creation of a deleter for memory MUST avoid dynamic allocations
  in the common case.  We've done some benchmarking in Caffe2
  where dynamic allocation for deleters is ruinously expensive,
  and it's really hard to avoid these performance tarpits in
  very general function wrappers like std::function or
  folly::Function (while benchmarking this, we discovered that
  folly::Function's move constructor was way more expensive
  than it should be).

- We need to be able to deallocate data that comes from external
  sources, e.g., dlpack and numpy tensors.  Most notably,
  you often cannot deallocate these with merely the void*
  data pointer; you need some extra, out-of-band information
  (e.g., the managing struct) to deallocate it.  Sometimes,
  you may even want to resize data living in an external source!

- The "core" allocators need to support being wrapped in a Thrust
  allocator, so you need to be implement the following two functions:

    char* allocate(size_t);
    void deallocate(char*, size_t);

- We need to support tensors which contain non-POD, non-trivially
  copyable data; specifically tensors of std::string.  This is
  an upcoming requirement from Caffe2.  It's dirty AF, but
  it's really useful.

- It should use C++ standard library types like std::unique_ptr
  (which is hugely problematic because std::unique_ptr doesn't
  call the deleter when the pointer is null.)

Here is the billing of changes:

- Built-in support for realloc() has been DROPPED ENTIRELY.
  Instead, you're expected to allocate and then copy from
  the old memory to the new memory if you want to do a
  reallocation.  This is what you'd generally have expected
  to occur; and axing realloc() from the design lets us avoid
  some tricky correctness issues with std::realloc(), namely
  the fact that we must refuse the realloc if the type of the
  elements are not trivially copyeable.  If it really matters,
  we can add this back, but there really needs to be a good
  explanation WHY you need fast resizing reallocations (by in
  large, people don't resize their storages, and it should
  be acceptable to have a performance degradation when they
  do).

- TH_STORAGE_FREEMEM is no more; instead, if you want a
  storage which doesn't free its result, you just give it
  an empty deleter.

- What we used to call an "allocator" (really, a combined
  object for allocating/deleting) has been split into two
  concepts, an allocator, and a smart pointer (SupervisedPtr)
  which knows how to delete data.

    - Unlike previously, where THAllocator/THCDeviceAllocator
      could have a per-tensor context storing extra information
      (e.g., a pointer to the metadata you need to actually
      free the tensor), there is no context in the allocator or
      the deleter of the smart pointer; instead, the smart
      pointer directly holds an owning reference to the
      metadata necessary to free the data.  This metadata
      is *freshly manufactured* upon every allocation, which
      permits us to resize tensors even in the absence of
      built-in support for realloc().

    - By default, allocators don't support "raw" allocations
      and deallocations with raw pointers.  This is because
      some allocations may return a different context every
      time, in which case you need to reconstruct the context
      at delete time (because all you got was a void*, not
      a unique_ptr that carries the deleter).

- The diff between at::Allocator and THCDeviceAllocator is a
  bit larger:

    - It used to return a cudaError_t.  Now, allocators
      are expected to check the error status immediately and throw
      an exception if there was an error.  It turns out that this
      is what was immediately done after all occurrences of
      allocate/release, so it wasn't a big deal (although some
      subsidiary interfaces had to themselves be converted to
      not return cudaError_t).

      There is one notable exception to this, and it is how
      we handle CUDA OOM: if this occurs, we attempt to return
      unused memory to the system and try again.  This is now
      handled by a catch-all try-catch block.  The cost of
      catching the exception is probably the least of your worries
      if you're about to OOM.

    - It used to take the CUDA stream to perform the allocation
      on as an argument.  However, it turned out that all call
      sites, this stream was the stream for the current device.
      So we can push this into the allocator (and the choice,
      in the future, could be made explicitly by twiddling
      thread local state.)

    - It held two extra methods, emptyCache and cacheInfo, specifically
      for interacting with some state in THCCachingAllocator.
      But this "generality" was a lie, since THCCachingAllocator
      was the only allocator that actually implemented these
      methods, and there is actually a bunch of code in THC
      which assumes that it is the caching allocator that is
      the underlying allocator for CUDA allocations.  So I
      folded these two methods into this interface as
      THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.

    - It held its context directly inside the THCDeviceAllocator
      struct.  This context has been moved out into whatever
      is holding the at::Allocator*.

- The APIs for getting at allocators/deleters is now a little different.

    - Previously there were a bunch of static variables you could get
      the address of (e.g., &THDefaultAllocator); now there is a
      function getTHDefaultAllocator().

    - Some "allocators" didn't actually know how to allocate (e.g.,
      the IPC "allocator").  These have been deleted; instead, you
      can wrap the produced pointers into SupervisedPtr using
      an appropriate makeSupervisedPtr() static method.

- Storage sharing was a lot of work to wrangle, but I think I've
  tamed the beast.

    - THMapAllocator and its "subclasses" have been refactored to
      be proper, honest to goodness C++ classes.  I used the enum
      argument trick to get "named" constructors.  We use inheritance
      to add refcounting and management (in libshm).  What we previously
      called the "Context" class (Context has been dropped from the name)
      is now the supervisor for the data.

    - Sometimes, we need to pull out the file descriptor from a
      tensor.  Previously, it was pulled out of the allocator context.
      Now, we pull it out of the supervisor of the SupervisorPtr,
      using the static method fromSupervisedPtr(), which uses the
      deleter as the typeid, and refines the type if it matches.

- I renamed the std::function deleter into
  InefficientStdFunctionSupervisor, to emphasize the fact that it does
  a dynamic allocation to save the std::function deleter.

TODO:

- Windows libshm is in shambles and needs to be fixed.

Perhaps for the future:

- newFromFd is now unconditionally calling cudaPointerGetAttributes
  even though this is unnecessary, because we know what the device
  is from higher up in the callstack.  We can fix this by making
  newWithDataAndAllocator also take an explicit device argument.

- Consider statically distinguishing between allocators that
  support raw_allocate/raw_deallocate, and those which don't.
  The Thrust constraint applies only to the CUDA device allocator;
  you never need to allocate CPU memory this way

- Really want to get rid of storage views. Ugh.

Nontrivial bugs I noticed when preparing this patch:

- I forgot to placement-new unique pointers and attempted to
  assign them directly on uninitialized memory; very bad!  Sam
  Gross has encouraged me to replace this with a proper constructor
  but I keep putting it off, because once everything goes in
  StorageImpl there really will be a proper constructor.

- I rewrote a number of APIs to use newWithDataAndAllocator
  instead of newWithAllocator, calling the allocator at the
  call site (because they required "allocation context" which
  we no longer give to "allocators").  When I did this, I forgot
  to insert the multiplication with sizeof(real) to scale from
  numels to number of bytes.

- The implementation of swap on storages was missing it for
  scalarType and backend.  It was benign (because the only case
  we call swap is when these are the same), but I fixed it anyway.

- I accidentally returned a nullptr unique_ptr with no deleter,
  even though there was a legitimate one.  This matters, because
  some code still shoves its hands in the deleter context to
  get extra metadata about the function.

- I used std::move() on a unique_ptr, and then did a boolean
  test on the pointer aftewards (always false!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/9358

Reviewed By: SsnL

Differential Revision: D8811822

Pulled By: ezyang

fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
2018-07-15 15:11:18 -07:00
Yu Feng
22ba8726da Mention MPICH_MAX_THREAD_SAFETY=multiple. (#8580)
Currently, this is a common step to enable level 3 support on MPICH based systems.
2018-06-26 12:40:48 -04:00
Peter Goldsborough
4f37a6481d Fix DeviceGuard usage in THD (#8622) 2018-06-18 18:33:54 -04:00
Peter Goldsborough
372d1d6735
Create ATen tensors via TensorOptions (#7869)
* Created TensorOptions

Storing the type in TensorOptions to solve the Variable problem

Created convenience creation functions for TensorOptions and added tests

Converted zeros to TensorOptions

Converted rand to TensorOptions

Fix codegen for TensorOptions and multiple arguments

Put TensorOptions convenience functions into torch namespace too

All factory functions except *_like support TensorOptions

Integrated with recent JIT changes

Support *_like functions

Fix in place modification

Some cleanups and fixes

Support sparse_coo_tensor

Fix bug in Type.cpp

Fix .empty calls in C++ API

Fix bug in Type.cpp

Trying to fix device placement

Make AutoGPU CPU compatible

Remove some auto_gpu.h uses

Fixing some headers

Fix some remaining CUDA/AutoGPU issues

Fix some AutoGPU uses

Fixes to dispatch_tensor_conversion

Reset version of new variables to zero

Implemented parsing device strings

Random fixes to tests

Self review cleanups

flake8

Undo changes to variable.{h,cpp} because they fail on gcc7.2

Add [cuda] tag to tensor_options_cuda.cpp

Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks

Fix linker error in AutoGPU.cpp

Fix bad merge conflict in native_functions.yaml

Fixed caffe2/contrib/aten

Fix new window functions added to TensorFactories.cpp

* Removed torch::TensorOptions

Added code to generate wrapper functions for factory methods

Add implicit constructor from Backend to TensorOptions

Remove Var() from C++ API and use torch:: functions

Use torch:: functions more subtly in C++ API

Make AutoGPU::set_device more exception safe

Check status directly in DynamicCUDAHooksInterface

Rename AutoGPU to DeviceGuard

Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad

remove python_default_init: self.type()

Add back original factory functions, but with deprecation warnings

Disable DeviceGuard for a couple functions in ATen

Remove print statement

Fix DeviceGuard construction from undefined tensor

Fixing CUDA device compiler issues

Moved as many methods as possible into header files

Dont generate python functions for deprecated factories

Remove merge conflict artefact

Fix tensor_options_cuda.cpp

Fix set_requires_grad not being checked

Fix tensor_new.h

TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac

Fix bug in DeviceGuard.h

Missing includes

TEMPORARILY moving a few more methods into .cpp to see if it fixes windows

Fixing linker errors

* Fix up SummaryOps to use new factories

Undo device agnostic behavior of DeviceGuard

Use -1 instead of optional for default device index

Also move DeviceGuard methods into header

Fixes around device index after optional -> int32_t switch

Fix use of DeviceGuard in new_with_tensor_copy

Fix tensor_options.cpp

* Fix Type::copy(

* Remove test_non_float_params from ONNX tests

* Set requires_grad=False in ONNX tests that use ints

* Put layout/dtype/device on Tensor

* Post merge fixes

* Change behavior of DeviceGuard to match AutoGPU

* Fix C++ API integration tests

* Fix flip functions
2018-06-16 00:40:35 -07:00
Soumith Chintala
dc186cc9fe
Remove NO_* and WITH_* across codebase, except in setup.py (#8555)
* remove legacy options from CMakeLists

* codemod WITH_ to USE_ for WITH_CUDA, WITH_CUDNN, WITH_DISTRIBUTED, WITH_DISTRIBUTED_MW, WITH_GLOO_IBVERBS, WITH_NCCL, WITH_ROCM, WITH_NUMPY

* cover SYSTEM_NCCL, MKLDNN, NNPACK, C10D, NINJA

* removed NO_* variables and hotpatch them only in setup.py

* fix lint
2018-06-15 12:29:48 -04:00
Soumith Chintala
7251d70c5b fixed THD NO_CUDA (#8539) 2018-06-15 09:09:23 -04:00
Yangqing Jia
991bdd7f13 [build] remove the use of NO_CUDA (#8300)
* Only remove NO_CUDA from CMakeLists.txt

* @ezyang's catch
2018-06-12 12:14:36 -04:00
Teng Li
80b6f9edd6 [THD] fix broken THD build with NCCL (#8323) 2018-06-10 23:48:10 -04:00
Yangqing Jia
37073f8be0 [build] Remove /torch/lib/THD/cmake in favor of /cmake (#7159)
* Remove /torch/lib/THD/cmake in favor of /cmake

* path fix

* Explicitly marking gloo to use cuda

* Fix gloo path in THD
2018-06-08 17:55:12 -04:00
Peter Goldsborough
04a3616de0 Replace std::size_t with size_t (#8093) 2018-06-04 11:10:44 -04:00
Orion Reblitz-Richardson
4bf0202cac
[build] Have PyTorch depend on minimal libcaffe2.so instead of libATen.so (#7399)
* Have PyTorch depend on minimal libcaffe2.so instead of libATen.so

* Build ATen tests as a part of Caffe2 build

* Hopefully cufft and nvcc fPIC fixes

* Make ATen install components optional

* Add tests back for ATen and fix TH build

* Fixes for test_install.sh script

* Fixes for cpp_build/build_all.sh

* Fixes for aten/tools/run_tests.sh

* Switch ATen cmake calls to USE_CUDA instead of NO_CUDA

* Attempt at fix for aten/tools/run_tests.sh

* Fix typo in last commit

* Fix valgrind call after pushd

* Be forgiving about USE_CUDA disable like PyTorch

* More fixes on the install side

* Link all libcaffe2 during test run

* Make cuDNN optional for ATen right now

* Potential fix for non-CUDA builds

* Use NCCL_ROOT_DIR environment variable

* Pass -fPIC through nvcc to base compiler/linker

* Remove THCUNN.h requirement for libtorch gen

* Add Mac test for -Wmaybe-uninitialized

* Potential Windows and Mac fixes

* Move MSVC target props to shared function

* Disable cpp_build/libtorch tests on Mac

* Disable sleef for Windows builds

* Move protos under BUILD_CAFFE2

* Remove space from linker flags passed with -Wl

* Remove ATen from Caffe2 dep libs since directly included

* Potential Windows fixes

* Preserve options while sleef builds

* Force BUILD_SHARED_LIBS flag for Caffe2 builds

* Set DYLD_LIBRARY_PATH and LD_LIBRARY_PATH for Mac testing

* Pass TORCH_CUDA_ARCH_LIST directly in cuda.cmake

* Fixes for the last two changes

* Potential fix for Mac build failure

* Switch Caffe2 to build_caffe2 dir to not conflict

* Cleanup FindMKL.cmake

* Another attempt at Mac cpp_build fix

* Clear cpp-build directory for Mac builds

* Disable test in Mac build/test to match cmake
2018-05-24 07:47:27 -07:00
Deyu Fu
20041e2704 better cache for nccl resourse (#6970)
allow more than 1 device list to be stored
2018-05-10 19:42:36 +02:00
gchanan
36a3f0995b Remove THDTensorDescriptor_newFromTH{X}Tensor. (#7287)
They don't seem to be used and we are moving to a single TensorImpl model.
2018-05-04 12:22:19 -07:00
Pieter Noordhuis
ebebfce681 Minor THD cleanup (#7161)
* Remove stale THD README

* Move common THD dependency into THD/base

The master_worker directory now no longer contains files that are
needed for building other parts of THD.
2018-05-02 07:29:27 -07:00
Paul Jesse Hellemn
7968ee0f59
Removing references to CUDA_SDK_ROOT_DIR to see if it breaks anything (#7125) 2018-05-01 07:52:16 -07:00
Luca Antiga
0703357723 Don't build THD/master_worker if not explicitly requested (#7081) 2018-04-29 13:17:09 -04:00
Edward Z. Yang
4caea64d72
Make all of TH and THC C++. (#6913)
Changelist:

- Move *.c to *.cpp
- Change includes of ".c" to ".cpp"
- A bunch of cmake configuration modifying CMAKE_C_FLAGS changed
to CMAKE_CXX_FLAGS or add_compile_options, because if you do CMAKE_C_FLAGS it only applies when you compile C code
- Explicitly cast void* to T* in a number of places
- Delete extern "C" { ... } blocks; instead, properly apply TH_API to everything that should have it (TH_API handles extern "C")
- Stop using stdatomic.h, instead, use <atomic>. This resulted in a bunch of placement-new/delete to be "totally properly correct"
- Refactor of THLongStorageView to not have static constructor methods (since it no longer has a copy/move constructor)
- Documentation about how the TH C interface (and extern C business) works
- Note that THD master_worker mode is dead
- C++ headers in TH libraries are given .hpp suffix, to make it less likely that you'll confuse them with the C-compatible headers (now suffixed .h)
- New function THCStream_stream and THCStream_device to project out fields of THCStream instead of accessing fields directly
- New function THStorage_(retainIfLive), which is equivalent to a retain but only if the refcount is greater than zero.
- In general, I tried to avoid using hpp headers outside of ATen/TH. However, there were a few places where I gave up and depended on the headers for my own sanity. See Note [TH abstraction violation] for all the sites where this occurred. All other sites were refactored to use functions
- Some extra Werror fixes (char* versus const char*)
2018-04-28 07:45:02 -04:00