Commit Graph

17 Commits

Author SHA1 Message Date
Edward Yang
cffca2926b Introduce SupervisedPtr, delete THAllocator and THCDeviceAllocator (#9358)
Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:

- It must support the reallocation usage-pattern, where, given
  an existing Storage, we allocate a new region of memory,
  copy the existing data to it, and then deallocate the old
  region of memory.

- Creation of a deleter for memory MUST avoid dynamic allocations
  in the common case.  We've done some benchmarking in Caffe2
  where dynamic allocation for deleters is ruinously expensive,
  and it's really hard to avoid these performance tarpits in
  very general function wrappers like std::function or
  folly::Function (while benchmarking this, we discovered that
  folly::Function's move constructor was way more expensive
  than it should be).

- We need to be able to deallocate data that comes from external
  sources, e.g., dlpack and numpy tensors.  Most notably,
  you often cannot deallocate these with merely the void*
  data pointer; you need some extra, out-of-band information
  (e.g., the managing struct) to deallocate it.  Sometimes,
  you may even want to resize data living in an external source!

- The "core" allocators need to support being wrapped in a Thrust
  allocator, so you need to be implement the following two functions:

    char* allocate(size_t);
    void deallocate(char*, size_t);

- We need to support tensors which contain non-POD, non-trivially
  copyable data; specifically tensors of std::string.  This is
  an upcoming requirement from Caffe2.  It's dirty AF, but
  it's really useful.

- It should use C++ standard library types like std::unique_ptr
  (which is hugely problematic because std::unique_ptr doesn't
  call the deleter when the pointer is null.)

Here is the billing of changes:

- Built-in support for realloc() has been DROPPED ENTIRELY.
  Instead, you're expected to allocate and then copy from
  the old memory to the new memory if you want to do a
  reallocation.  This is what you'd generally have expected
  to occur; and axing realloc() from the design lets us avoid
  some tricky correctness issues with std::realloc(), namely
  the fact that we must refuse the realloc if the type of the
  elements are not trivially copyeable.  If it really matters,
  we can add this back, but there really needs to be a good
  explanation WHY you need fast resizing reallocations (by in
  large, people don't resize their storages, and it should
  be acceptable to have a performance degradation when they
  do).

- TH_STORAGE_FREEMEM is no more; instead, if you want a
  storage which doesn't free its result, you just give it
  an empty deleter.

- What we used to call an "allocator" (really, a combined
  object for allocating/deleting) has been split into two
  concepts, an allocator, and a smart pointer (SupervisedPtr)
  which knows how to delete data.

    - Unlike previously, where THAllocator/THCDeviceAllocator
      could have a per-tensor context storing extra information
      (e.g., a pointer to the metadata you need to actually
      free the tensor), there is no context in the allocator or
      the deleter of the smart pointer; instead, the smart
      pointer directly holds an owning reference to the
      metadata necessary to free the data.  This metadata
      is *freshly manufactured* upon every allocation, which
      permits us to resize tensors even in the absence of
      built-in support for realloc().

    - By default, allocators don't support "raw" allocations
      and deallocations with raw pointers.  This is because
      some allocations may return a different context every
      time, in which case you need to reconstruct the context
      at delete time (because all you got was a void*, not
      a unique_ptr that carries the deleter).

- The diff between at::Allocator and THCDeviceAllocator is a
  bit larger:

    - It used to return a cudaError_t.  Now, allocators
      are expected to check the error status immediately and throw
      an exception if there was an error.  It turns out that this
      is what was immediately done after all occurrences of
      allocate/release, so it wasn't a big deal (although some
      subsidiary interfaces had to themselves be converted to
      not return cudaError_t).

      There is one notable exception to this, and it is how
      we handle CUDA OOM: if this occurs, we attempt to return
      unused memory to the system and try again.  This is now
      handled by a catch-all try-catch block.  The cost of
      catching the exception is probably the least of your worries
      if you're about to OOM.

    - It used to take the CUDA stream to perform the allocation
      on as an argument.  However, it turned out that all call
      sites, this stream was the stream for the current device.
      So we can push this into the allocator (and the choice,
      in the future, could be made explicitly by twiddling
      thread local state.)

    - It held two extra methods, emptyCache and cacheInfo, specifically
      for interacting with some state in THCCachingAllocator.
      But this "generality" was a lie, since THCCachingAllocator
      was the only allocator that actually implemented these
      methods, and there is actually a bunch of code in THC
      which assumes that it is the caching allocator that is
      the underlying allocator for CUDA allocations.  So I
      folded these two methods into this interface as
      THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.

    - It held its context directly inside the THCDeviceAllocator
      struct.  This context has been moved out into whatever
      is holding the at::Allocator*.

- The APIs for getting at allocators/deleters is now a little different.

    - Previously there were a bunch of static variables you could get
      the address of (e.g., &THDefaultAllocator); now there is a
      function getTHDefaultAllocator().

    - Some "allocators" didn't actually know how to allocate (e.g.,
      the IPC "allocator").  These have been deleted; instead, you
      can wrap the produced pointers into SupervisedPtr using
      an appropriate makeSupervisedPtr() static method.

- Storage sharing was a lot of work to wrangle, but I think I've
  tamed the beast.

    - THMapAllocator and its "subclasses" have been refactored to
      be proper, honest to goodness C++ classes.  I used the enum
      argument trick to get "named" constructors.  We use inheritance
      to add refcounting and management (in libshm).  What we previously
      called the "Context" class (Context has been dropped from the name)
      is now the supervisor for the data.

    - Sometimes, we need to pull out the file descriptor from a
      tensor.  Previously, it was pulled out of the allocator context.
      Now, we pull it out of the supervisor of the SupervisorPtr,
      using the static method fromSupervisedPtr(), which uses the
      deleter as the typeid, and refines the type if it matches.

- I renamed the std::function deleter into
  InefficientStdFunctionSupervisor, to emphasize the fact that it does
  a dynamic allocation to save the std::function deleter.

TODO:

- Windows libshm is in shambles and needs to be fixed.

Perhaps for the future:

- newFromFd is now unconditionally calling cudaPointerGetAttributes
  even though this is unnecessary, because we know what the device
  is from higher up in the callstack.  We can fix this by making
  newWithDataAndAllocator also take an explicit device argument.

- Consider statically distinguishing between allocators that
  support raw_allocate/raw_deallocate, and those which don't.
  The Thrust constraint applies only to the CUDA device allocator;
  you never need to allocate CPU memory this way

- Really want to get rid of storage views. Ugh.

Nontrivial bugs I noticed when preparing this patch:

- I forgot to placement-new unique pointers and attempted to
  assign them directly on uninitialized memory; very bad!  Sam
  Gross has encouraged me to replace this with a proper constructor
  but I keep putting it off, because once everything goes in
  StorageImpl there really will be a proper constructor.

- I rewrote a number of APIs to use newWithDataAndAllocator
  instead of newWithAllocator, calling the allocator at the
  call site (because they required "allocation context" which
  we no longer give to "allocators").  When I did this, I forgot
  to insert the multiplication with sizeof(real) to scale from
  numels to number of bytes.

- The implementation of swap on storages was missing it for
  scalarType and backend.  It was benign (because the only case
  we call swap is when these are the same), but I fixed it anyway.

- I accidentally returned a nullptr unique_ptr with no deleter,
  even though there was a legitimate one.  This matters, because
  some code still shoves its hands in the deleter context to
  get extra metadata about the function.

- I used std::move() on a unique_ptr, and then did a boolean
  test on the pointer aftewards (always false!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/9358

Reviewed By: SsnL

Differential Revision: D8811822

Pulled By: ezyang

fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
2018-07-15 15:11:18 -07:00
Edward Yang
d0d1820814 Add weak pointer and finalizer support directly to THStorage. (#9148)
Summary:
The underlying use-case is the file descriptor to storage cache in
torch.multiprocessing.reductions.  Previously, this was implemented by wrapping
an existing allocator with a "weak ref" allocator which also knew to null out
the weak reference when the storage died.  This is terribly oblique, and
prevents us from refactoring the allocators to get rid of per-storage allocator
state.

So instead of going through this fiasco, we instead directly implement weak
pointers and finalizers in THStorage.  Weak pointers to THStorage retain the
THStorage struct, but not the data_ptr.  When all strong references die,
data_ptr dies and the finalizers get invoked.

There is one major hazard in this patch, which is what happens if you
repeatedly call _weak_ref on a storage.  For cleanliness, we no longer
shove our grubby fingers into the finalizer struct to see if there is already
a Python object for the weak reference and return it; we just create a new one
(no one is checking these Python objects for identity).  This means if you
keep calling it, we'll keep piling on finalizers.  That's bad! But I am
not going to fix it until it is actually a problem for someone, because
then we need to add another caching layer.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9148

Differential Revision: D8729106

Pulled By: ezyang

fbshipit-source-id: 69710ca3b7c7e05069090e1b263f8b6b9f1cf72f
2018-07-10 06:25:33 -07:00
Edward Yang
4d57a1750c Unify THStorage and THCStorage structs. (#9107)
Summary:
Closes https://github.com/pytorch/pytorch/pull/9107

Some details about how this was done:

- For now, the allocators for CPU and CUDA are different (unifying
  the allocators is a bigger change to make, I'll contribute this in
  a later patch).  To smooth this over, the allocator field now
  stores a void* instead of THAllocator* or THCDeviceAllocator*; to
  make this clear the field is renamed to allocatorVoidPtr.

- Some THStorage functions which were generated per-scalar are now
  generalized, and thus moved out of the generic/ library.  This way
  they can be called directly from a non-code-generated at::Storage

- THCState is moved into a C++ header.  This is actually not really
  related to this particular diff, but I'll need it soon to replace
  THAllocator/THCDeviceAllocator with at::Allocator (C++, so I can't
  mention it in a C header file.)

- THPPointer needs to be adjusted, since there is no more type refinement
  between THStorage/THCStorage for it to template match over.  This
  is a little tricky, because I can't refer to THCStorage_free unless
  we actually compile with CUDA.  So there's two copies of the function
  now: one for the CPU build, one for the CUDA build.  If we ever split
  CUDA/non-CUDA Python builds, you will have to indirect this through some
  dynamic dispatch.

I want to soon replace the THCDeviceAllocator pointers in
THCState with at::Allocator, but I can't reference a C++ namespaced type
from C code, so THCState needs to move.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Closes https://github.com/pytorch/pytorch/pull/9087

Reviewed By: orionr

Differential Revision: D8712072

Pulled By: ezyang

fbshipit-source-id: c6e1ea236cd1df017b42a7fffb2dbff20d50a284
2018-07-02 17:09:52 -07:00
gchanan
7926313235
Have a single THStorage and THCStorage type. (#8030)
No longer generate data-type specific Storage types, since all Storage types are now identical anyway.
For (some) backwards compatibility and documentation purposes, the Real names, e.g. THLongStorage are now #defined as aliases to the single THStorage type
2018-06-02 11:05:02 -04:00
Edward Z. Yang
4caea64d72
Make all of TH and THC C++. (#6913)
Changelist:

- Move *.c to *.cpp
- Change includes of ".c" to ".cpp"
- A bunch of cmake configuration modifying CMAKE_C_FLAGS changed
to CMAKE_CXX_FLAGS or add_compile_options, because if you do CMAKE_C_FLAGS it only applies when you compile C code
- Explicitly cast void* to T* in a number of places
- Delete extern "C" { ... } blocks; instead, properly apply TH_API to everything that should have it (TH_API handles extern "C")
- Stop using stdatomic.h, instead, use <atomic>. This resulted in a bunch of placement-new/delete to be "totally properly correct"
- Refactor of THLongStorageView to not have static constructor methods (since it no longer has a copy/move constructor)
- Documentation about how the TH C interface (and extern C business) works
- Note that THD master_worker mode is dead
- C++ headers in TH libraries are given .hpp suffix, to make it less likely that you'll confuse them with the C-compatible headers (now suffixed .h)
- New function THCStream_stream and THCStream_device to project out fields of THCStream instead of accessing fields directly
- New function THStorage_(retainIfLive), which is equivalent to a retain but only if the refcount is greater than zero.
- In general, I tried to avoid using hpp headers outside of ATen/TH. However, there were a few places where I gave up and depended on the headers for my own sanity. See Note [TH abstraction violation] for all the sites where this occurred. All other sites were refactored to use functions
- Some extra Werror fixes (char* versus const char*)
2018-04-28 07:45:02 -04:00
Zachary DeVito
d985cf46f1
Add workaround to fix include warnings in Python 2 builds. (#6716) 2018-04-24 12:30:19 -07:00
Sam Gross
7588893ce2
Some additional clean-ups (#5505)
- Remove some uses of mega-header THP.h
 - Use HANDLE_TH_ERRORS in functions that may throw
 - Move NumPy includes to common header
 - Delete unused allocator
2018-03-05 17:45:02 -05:00
Sam Gross
4bce69be22
Implement Variable.storage() (#3765)
This still uses THPStorage, but avoids touching THPTensor
2017-11-20 14:18:07 -05:00
andreh7
cc8fd5bde1 added #define __STDC_FORMAT_MACROS to tensor and storage code templates to avoid problems with gcc 4.8.5 (#3629) 2017-11-10 15:21:33 -05:00
peterjc123
aa911939a3 Improve Windows Compatibility (for csrc/scripts) (#2941) 2017-11-08 19:51:35 +01:00
Adam Paszke
67f94557ff Expose torch.HalfTensor 2017-02-27 19:35:47 -05:00
Sam Gross
1af9a9637f Refactor copy and release GIL during copy (#286) 2016-12-11 21:54:58 +01:00
Adam Paszke
0c9670ddf0 Allow remapping storages at load time and serialize data in little endian order 2016-10-04 12:54:55 -07:00
Sam Gross
1486d880b0 Add Storage.from_buffer
The from_buffer is similar to numpy's frombuffer. It decodes a Python
buffer object into a Storage object. For byte and char storages, it
simply copies the bytes.
2016-09-07 15:32:33 -07:00
Adam Paszke
f9d186d33a Add initial version of multiprocessing module 2016-08-31 19:46:08 -07:00
Adam Paszke
4a5b66de9a Add copy and type methods to Tensors 2016-05-05 22:44:43 +02:00
Adam Paszke
842e1b6358 Add exception handling 2016-05-05 20:58:13 +02:00