Commit Graph

49 Commits

Author SHA1 Message Date
dolpm
4ac2ee573d [sigmoid] memory planner C10 deps (#151275)
Summary: perf-sensitive util functions for use in our memory planner

Test Plan: CI

Differential Revision: D73002726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151275
Approved by: https://github.com/georgiaphillips
2025-04-24 01:46:32 +00:00
Yu, Guangye
d5ce5c9509 Reuse format_size utils (#149383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149383
Approved by: https://github.com/malfet
2025-03-24 03:06:27 +00:00
Marko Radmilac
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
PyTorch MergeBot
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
Marko Radmilac
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
cyy
dca443835e Enable more readability-redundant checks (#143963)
They are helpful to simplifying code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963
Approved by: https://github.com/albanD
2024-12-30 14:49:33 +00:00
cyy
a2bc2e38f9 Use clang-tidy 17 (#139678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139678
Approved by: https://github.com/Skylion007
2024-11-05 16:00:25 +00:00
cyy
38d3c27849 [1/N] Enable cppcoreguidelines-special-member-functions (#137405)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137405
Approved by: https://github.com/ezyang
2024-10-23 00:16:53 +00:00
Richard Barnes
542f7c8383 Eliminate C10_NODISCARD (#138336)
Test Plan: Sandcastle

Reviewed By: swolchok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138336
Approved by: https://github.com/Skylion007
2024-10-19 02:54:06 +00:00
PyTorch MergeBot
7e8dace0de Revert "[ROCm] remove caffe2 from hipify (#137157)"
This reverts commit 40d8260745.

Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))
2024-10-08 17:45:45 +00:00
Jeff Daily
40d8260745 [ROCm] remove caffe2 from hipify (#137157)
- Remove all "MasqueradingAsCUDA" files and classes.
- Do not rename "CUDA" classes to "HIP".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157
Approved by: https://github.com/eqy
2024-10-05 12:48:54 +00:00
Aaron Enye Shi
f42d5b6dca [Memory Snapshot] Make recordAnnotations callback initialize lazily (#129242)
Summary: Make the recordAnnotations' Record function callback lazily initialize when record memory history starts. This will help reduce the impact on Time To First Batch metric.

Test Plan: CI and ran locally.

Differential Revision: D58875576

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129242
Approved by: https://github.com/zdevito
2024-06-22 04:05:55 +00:00
Aaron Enye Shi
b5d541609d [Memory Snapshot] Add recordAnnotations to capture record_function annotations (#129072)
Summary:
Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations.

Test Plan:
CI

Pulled By:
aaronenyeshi

Differential Revision: D55941362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129072
Approved by: https://github.com/zdevito
2024-06-19 18:05:41 +00:00
PyTorch MergeBot
718bb9016f Revert "[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179)"
This reverts commit 187aeaeabf.

Reverted https://github.com/pytorch/pytorch/pull/124179 on behalf of https://github.com/clee2000 due to test_tensorexpr.py::TestTensorExprFuser::test_simple_add is causing a segfault https://github.com/pytorch/pytorch/actions/runs/9097383783/job/25007155440 187aeaeabf, test was skipped due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/124179#issuecomment-2112948246))
2024-05-15 16:11:47 +00:00
Aaron Enye Shi
187aeaeabf [Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179)
Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations.

Test Plan:
CI

New Snapshot Generated:
devvm2184.cco0.facebook.com.Apr_19_13_27_14.3072800.snapshot.pickle

Snippet of Snapshot device_traces show `ProfilerStep#0`, and `## forward ##` annotations:
```
[[{'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168556,
   'frames': [{'name': 'START', 'filename': 'ProfilerStep#0', 'line': 0}]},
  {'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168738,
   'frames': [{'name': 'END', 'filename': 'ProfilerStep#0', 'line': 0}]},
  {'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168865,
   'frames': [{'name': 'START', 'filename': 'ProfilerStep#1', 'line': 0}]},
  {'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168920,
   'frames': [{'name': 'START', 'filename': '## forward ##', 'line': 0}]},
  {'action': 'alloc',
   'addr': 140166073581568,
   'size': 3211264,
   'stream': 0,
   'time_us': 1713558427172978,
   'frames': [{'name': '_conv_forward',
     'filename': '/mnt/xarfuse/uid-416185/235d4caf-seed-nspid4026531836_cgpid32884718-ns-4026531840/torch/nn/modules/conv
```

Differential Revision: D55941362

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124179
Approved by: https://github.com/zdevito
2024-05-15 14:19:40 +00:00
cyy
507611f9ae [CUDACachingAllocator] Turn Allocator::allocate into non-const (#120969)
Ideally, the method should be non-const since it changes the allocator state. Some const_casts are also removed in the way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120969
Approved by: https://github.com/albanD
2024-03-05 09:53:05 +00:00
Aaron Orenstein
d280b6ae58 Ensure that deleter is called even for a no-data tensor. (#117418)
Summary:

When using a custom deleter InefficientStdFunctionContext was using a
std::unique_ptr<> to store the pointer and call the deleter - but this failed to
call the deleter if the pointer was null. Since we have a separate holder class
anyway take out the std::unique_ptr<> and call the deleter directly.

Fixes #117273

Test Plan:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117418
Approved by: https://github.com/wjakob, https://github.com/yanboliang
2024-01-22 23:27:27 +00:00
Edward Yang
b4a35632f9 Add function to materialize COW storages (#117053)
Summary: From Kurt Mohler, see https://github.com/pytorch/pytorch/pull/113396 (manually imported due to ghimport problems)

Test Plan: sandcastle, OSS CI

Differential Revision: D52610522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117053
Approved by: https://github.com/malfet, https://github.com/kurtamohler
2024-01-10 15:34:16 +00:00
cyy
1544c37520 [7/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#115495)
This PR continues to fix clang-tidy warnings for headers in c10/core and c10/util.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115495
Approved by: https://github.com/malfet
2023-12-19 02:14:30 +00:00
cyy
99f222372b [5/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#115354)
This PR continues to fix clang-tidy warnings for headers in c10/core and c10/util.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115354
Approved by: https://github.com/Skylion007
2023-12-09 17:16:04 +00:00
cyy
7b8084d1c6 [5/N] Fixes clang-tidy warnings in c10/core/*.h (#115232)
This PR continues to fix clang-tidy warnings for headers in c10/core.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115232
Approved by: https://github.com/Skylion007
2023-12-07 15:48:03 +00:00
PyTorch MergeBot
f36d09fcb7 Revert "Add function to materialize COW storages (#113396)"
This reverts commit e2f090086b.

Reverted https://github.com/pytorch/pytorch/pull/113396 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113396#issuecomment-1818769090))
2023-11-20 10:26:01 +00:00
Kurt Mohler
e2f090086b Add function to materialize COW storages (#113396)
Part of #109833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113396
Approved by: https://github.com/ezyang
2023-11-17 01:58:51 +00:00
PyTorch MergeBot
9c7391ea36 Revert " [1/N] Apply clang-tidy to c10 cuda files (#111137)"
This reverts commit 43b023694e.

Reverted https://github.com/pytorch/pytorch/pull/111137 on behalf of https://github.com/malfet due to Was reverted internally due to the failures in torch.cuda.memory_stats(device=0) (presumably) ([comment](https://github.com/pytorch/pytorch/pull/111137#issuecomment-1769274103))
2023-10-18 20:32:53 +00:00
cyy
43b023694e [1/N] Apply clang-tidy to c10 cuda files (#111137)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111137
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2023-10-17 04:52:50 +00:00
PyTorch MergeBot
cca31f1797 Revert "implement a function to convert a storage to copy-on-write (#100819)"
This reverts commit aec11b8c80.

Reverted https://github.com/pytorch/pytorch/pull/100819 on behalf of https://github.com/jeanschmidt due to added tests are breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100819#issuecomment-1547929531))
2023-05-15 14:10:23 +00:00
mikey dagitses
aec11b8c80 implement a function to convert a storage to copy-on-write (#100819)
implement a function to convert a storage to copy-on-write

Summary:
This will be used in the _lazy_clone() operator as well as reshape().

Test Plan: 100% coverage of reachable lines.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/100819).
* #100821
* #100820
* __->__ #100819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100819
Approved by: https://github.com/ezyang
2023-05-12 17:45:04 +00:00
mikey dagitses
4431509a54 introduce c10::DataPtr::mutable_get() and use it in c10 (#98217)
Differential Revision: [D44629940](https://our.internmc.facebook.com/intern/diff/D44629940/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98217
Approved by: https://github.com/ezyang
2023-04-04 02:26:18 +00:00
Zachary DeVito
48490cec28 [memory profiling] Move Context object to c10 (#96280)
Minor refactor so that follow up PR can have objects that meet the GatheredContext
inferface without having to depend on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96280
Approved by: https://github.com/eellison
2023-03-12 07:24:14 +00:00
cyy
bfe5e1258b avoid unnecessary static_cast (#93898)
avoid unnecessary static_cast
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93898
Approved by: https://github.com/Skylion007
2023-02-03 03:44:43 +00:00
Aaron Gokaslan
700941f683 Fixup c10 headers with clang-tidy (#91407)
Clang-tidy was not applied properly to headers in c10 as documented #91406. These are the easy automated fixes that came out of applying clang-tidy to the c10 part of the code base. cc @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91407
Approved by: https://github.com/ezyang
2022-12-28 11:12:22 +00:00
Codrin Popa
0aedda25bc [PyTorch] Reporting OOM events to the Pytorch Profiler. (#80050)
Summary: Similar to reporting alloc and dealloc events in the PyTorch profiler, we are now reporting Out of Memory events as well. This is useful for performance troubleshooting

Test Plan: Added test_oom_tracing to test/test_profiler.py

Differential Revision: D36268132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80050
Approved by: https://github.com/robieta
2022-07-20 16:51:39 +00:00
Jiewen Tan
ab0d9b18e9 [LT] Support Tensor.is_alias_of
Summary:
Tensor.is_alias_of relies on Storage to perform. However, LTCTensorImpl was
not implemented with that in mind. This commit adds a fake storage to LazyTensor
as a marker to mark LazyTensors that point to the same storage. The reason
why it's not done at LTCTensorImpl is that LazyTensor maintains the view ops/alias
logic in LazyTensor class instead of relying on TensorImpl to do the check.

Test Plan:
./build/bin/test_lazy --gtest_filter=LazyOpsTest.IsAliasOf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75246

Approved by: https://github.com/bdhirsh
2022-04-14 07:28:03 +00:00
Han Guangyun
8bbcef5096 Report more information for memory profiling (#61282)
Summary:
Report pointed memory size, total allocated memory, total reserved size all in one report.

`ptr` and `alloc_size` will be used for associating with op trace.
`allocated_size`, `reserved_size` will be used for memory trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61282

Reviewed By: ejguan

Differential Revision: D29796282

Pulled By: chaekit

fbshipit-source-id: 5314c867632d3af1fa9a3811b35eaa5e931a5d87
2021-08-04 15:03:14 -07:00
Scott Wolchok
44cc873fba [PyTorch] Autoformat c10 (#56830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830

Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase.

Test Plan: CI

Reviewed By: zertosh

Differential Revision: D27979080

fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151
2021-04-30 21:23:28 -07:00
Ilia Cherniavskii
a94fb71b12 Memory profiling (#37775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37775

Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))
```

```
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Reviewed By: ngimel

Differential Revision: D21384248

Pulled By: ilia-cher

fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3
2020-05-19 15:48:48 -07:00
Allan Di Wu
f538cd627a Install HugePagesArena to optimize pytorch prediction performance (#37640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37640

Enable oversize arena to reduce memory fragmentation. Memory request with large size (configurable with FLAGS_caffe2_oversize_threshold) are fulfilled from dedicated arena separate from the existing huge page arena.

Two additional parameters are introduced to configure the 2-phase decay of the memory arena:
- caffe2_dirty_decay_ms
- caffe2_muzzy_decay_ms

In current JEMalloc implementation, oversized allocations will be immediately purged regardless of putting it in arena or not. Therefore we need to extend the decay time to indefinite. Currently we set the default for caffe2_muzzy_decay_ms to -1.

We now enable the arena allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones.
ghstack-source-id: 103276877

Test Plan:
buck test mode/dev //caffe2/caffe2/fb/init:huge_pages_allocator_test

Benchmarking known CV model that benefits from page arena:
```
PyTorchModelBench.cpp:183] test / base : 86.9532%
```

By adjusting ```dirty_decay_ms``` and ```muzzy_decay_ms```, we have the following plots:
https://pxl.cl/15SWW
https://pxl.cl/15TnL

From the figures above we can see performance does not change much until dirty decay time is indefinite (set to -1). Either setting muzzy decay or dirty decay time to -1 will reach best performance, regardless of which one it is. Even setting the decay time to very long (100s, which is longer than the run), does not change the performance by much.

## Observe performance difference in production with a variety of models (WIP)

Reviewed By: dzhulgakov

Differential Revision: D21258581

fbshipit-source-id: c006f8b94f28aef0666e52f48d4e82cf0d3a48af
2020-05-06 17:27:10 -07:00
Gemfield
d9115b533a remove needless ## in REGISTER_ALLOCATOR definition. (#19261)
Summary:
remove needless ## in REGISTER_ALLOCATOR definition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19261

Differential Revision: D15002025

Pulled By: soumith

fbshipit-source-id: 40614b1d79d1fe05ccf43f0ae5aab950e4c875c2
2019-04-18 22:44:09 -07:00
Edward Yang
474adf5458 Minor doc updates in c10/core/Allocator.h (#17164)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17164

Differential Revision: D14154393

Pulled By: ezyang

fbshipit-source-id: 59d8276d4bb4e7cadb4382769b75e5348ed388de
2019-02-20 14:36:15 -08:00
Dmytro Dzhulgakov
51dd2000cd unify c2 and TH allocator (#16892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16892

Replaces https://github.com/pytorch/pytorch/pull/14517

Merged caffe2 and TH CPU Allocators. Mostly using the code from caffe2 allocators.
`memset` of caffe2 allocator is gone now. These two allocators should be almost the same.

Baseline:
```
Running ./tensor_allocation
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_MakeStorageImpl                       148 ns        148 ns    4676594
BM_StorageImplCtor                        54 ns         54 ns   12957810
BM_MallocStorageImpl                      62 ns         62 ns   11254745
BM_TensorImplCtor                         22 ns         22 ns   31939472
BM_MallocTensorImpl                      105 ns        105 ns    6505661
BM_Malloc_1                               43 ns         43 ns   16464905
BM_MakeTensorFromStorage                 126 ns        126 ns    5586116
BM_MakeVariableFromTensor                236 ns        236 ns    2995528
BM_ATenCPUTensorAllocationSmall1         319 ns        319 ns    2268884
BM_ATenCPUTensorAllocationSmall2         318 ns        318 ns    2163332
BM_ATenCPUTensorAllocationMedium1        403 ns        403 ns    1663228
BM_ATenCPUTensorAllocationMedium2        448 ns        448 ns    1595004
BM_ATenCPUTensorAllocationBig1           532 ns        532 ns    1352634
BM_ATenCPUTensorAllocationBig2          4486 ns       4486 ns     160978
```
Changed:
```
Running ./tensor_allocation
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_MakeStorageImpl                       141 ns        141 ns    4803576
BM_StorageImplCtor                        55 ns         55 ns   13129391
BM_MallocStorageImpl                      64 ns         64 ns   11088143
BM_TensorImplCtor                         23 ns         23 ns   31616273
BM_MallocTensorImpl                      101 ns        101 ns    7017585
BM_Malloc_1                               39 ns         39 ns   18523954
BM_MakeTensorFromStorage                 118 ns        118 ns    5877919
BM_MakeVariableFromTensor                452 ns        452 ns    1565722
BM_ATenCPUTensorAllocationSmall1         384 ns        384 ns    1819763
BM_ATenCPUTensorAllocationSmall2         389 ns        389 ns    1857483
BM_ATenCPUTensorAllocationMedium1        425 ns        425 ns    1646284
BM_ATenCPUTensorAllocationMedium2        430 ns        430 ns    1561319
BM_ATenCPUTensorAllocationBig1           508 ns        508 ns    1309969
BM_ATenCPUTensorAllocationBig2          3799 ns       3799 ns     173674
```

lstm benchmark:
Before:
```
INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k
```

After:
```
INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k
```

Reviewed By: ezyang

Differential Revision: D13202632

fbshipit-source-id: db6d2ec756ed15b0732b15396c82ad42302bb79d
2019-02-12 21:16:34 -08:00
Dmytro Dzhulgakov
4d4c5273de Fix and add testing for nullptr allocator in c2->pt conversion (#16857)
Summary:
Fixes the bug for when tensor is created on Caffe2 side, then passed to PT and resized. Now we just initialize allocator correctly.

Note that the code in raw_mutable_data() is still necessary because of non-resizable tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16857

Reviewed By: houseroad

Differential Revision: D14019469

Pulled By: dzhulgakov

fbshipit-source-id: 14d3a3b946d718bbab747ea376903646b885706a
2019-02-11 23:21:02 -08:00
Edward Yang
e48ffa84d8 Add compare_exchange_deleter to DataPtr/UniqueVoidPtr (#16513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16513

compare_exchange_deleter makes it easier to replace a
deleter on a DataPtr with a new one, without requiring
allocating another closure to hold the old deleter.
See comment for details.

This diff was originally landed as part of D13762540
(#16226) but we are reverting that diff D13863610 (#16510)

Reviewed By: smessmer

Differential Revision: D13864245

fbshipit-source-id: 56eda4748238dd3a5130ba6434fda463fe7c690e
2019-01-31 17:40:04 -08:00
Edward Yang
279238f0b8 Back out "Delete duplicate copy of THCCachingAllocator." (#16510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16510

This diff was supposed to be memory usage neutral, but based on
some internal flows involving cuDNN, it was not. Reverting pending
further investigation.

Original commit changeset: 03f1ebf7f11c

Reviewed By: xw285cornell

Differential Revision: D13863610

fbshipit-source-id: 15517e255fd6b0c064b65fb99f0ef19742236cfd
2019-01-29 15:44:19 -08:00
Edward Yang
792cb774f1 Delete duplicate copy of THCCachingAllocator. (#16226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16226

Now that the caching allocator is moved to c10_cuda, we can
delete the duplicate copy from Caffe2.

Reviewed By: dzhulgakov, smessmer

Differential Revision: D13762540

fbshipit-source-id: 03f1ebf7f11c68c19aa0d66110156fe228da6138
2019-01-24 12:06:57 -08:00
Edward Yang
e936a69085 Move THCCachingAllocator to c10_cuda. (#16119)
Summary:
Some renaming and renamespacing also took place. I was originally planning not to do anything, but it turns out that it was easier to make HIPify work by using a namespace CUDACachingAllocator:: rather than THCCachingAllocator_, since :: is a word boundary but _ is not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/16119

Reviewed By: smessmer

Differential Revision: D13718768

fbshipit-source-id: 884a481d99027fd3e34471c020f826aa12225656
2019-01-24 12:06:56 -08:00
Sebastian Messmer
d408324350 Move files to/from c10/core and c10/util (#15316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316

This starts cleaning up the files in c10 according to the module structure we decided on.

Move to c10/util:
- Half.h, Half-inl.h, Half.cpp, bitcasts.h

Move to c10/core:
- Device.h, Device.cpp
- DeviceType.h, DeviceType.cpp

i-am-not-moving-c2-to-c10

Reviewed By: dzhulgakov

Differential Revision: D13498493

fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63
2019-01-10 16:22:22 -08:00
Sebastian Messmer
9e9e87c19e Move TensorImpl to c10 (yay!)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14795

Reviewed By: ezyang

Differential Revision: D13336856

fbshipit-source-id: 5375d0e42312ff7564f4df06210a5e49542d59e3
2018-12-11 21:01:38 -08:00
Sebastian Messmer
fb6806f6e9 Remove at references in c10 Allocator.h (#14434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14434

The referenced classes live now in c10, so we don't need to specify their namespace.

Reviewed By: ezyang

Differential Revision: D13224015

fbshipit-source-id: 6d154b8e3f9a1e38ff0407dbb1151f5c1d5df260
2018-11-29 11:07:22 -08:00
Sebastian Messmer
3a71d5ee49 Move Allocator.h to c10
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14059

Reviewed By: ezyang

Differential Revision: D13081606

fbshipit-source-id: d6ad59ad4e3d363268cd4307b6c999a168681246
2018-11-27 12:59:44 -08:00