Commit Graph

41 Commits

Author SHA1 Message Date
Yuanyuan Chen
f231be25c6 Mark unused parameters in C++ code (#164912)
This PR adds unused parameter name comments in C++ declarations to improve code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164912
Approved by: https://github.com/Skylion007
2025-10-09 06:23:25 +00:00
cyy
8fa81a6066 Enable misc-use-internal-linkage check and apply fixes (#148948)
Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19.

The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller.

The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948
Approved by: https://github.com/Skylion007
2025-03-12 14:22:56 +00:00
cyy
09291817b2 Fix extra semicolon warning (#148291)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148291
Approved by: https://github.com/Skylion007
2025-03-03 18:51:44 +00:00
cyy
38d3c27849 [1/N] Enable cppcoreguidelines-special-member-functions (#137405)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137405
Approved by: https://github.com/ezyang
2024-10-23 00:16:53 +00:00
cyy
a2396b2dd8 [2/N] Fix extra warnings brought by clang-tidy-17 (#137459)
Follows #137407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459
Approved by: https://github.com/Skylion007
2024-10-08 19:05:02 +00:00
cyy
507611f9ae [CUDACachingAllocator] Turn Allocator::allocate into non-const (#120969)
Ideally, the method should be non-const since it changes the allocator state. Some const_casts are also removed in the way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120969
Approved by: https://github.com/albanD
2024-03-05 09:53:05 +00:00
Edward Yang
b4a35632f9 Add function to materialize COW storages (#117053)
Summary: From Kurt Mohler, see https://github.com/pytorch/pytorch/pull/113396 (manually imported due to ghimport problems)

Test Plan: sandcastle, OSS CI

Differential Revision: D52610522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117053
Approved by: https://github.com/malfet, https://github.com/kurtamohler
2024-01-10 15:34:16 +00:00
PyTorch MergeBot
f36d09fcb7 Revert "Add function to materialize COW storages (#113396)"
This reverts commit e2f090086b.

Reverted https://github.com/pytorch/pytorch/pull/113396 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113396#issuecomment-1818769090))
2023-11-20 10:26:01 +00:00
Kurt Mohler
e2f090086b Add function to materialize COW storages (#113396)
Part of #109833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113396
Approved by: https://github.com/ezyang
2023-11-17 01:58:51 +00:00
PyTorch MergeBot
9c7391ea36 Revert " [1/N] Apply clang-tidy to c10 cuda files (#111137)"
This reverts commit 43b023694e.

Reverted https://github.com/pytorch/pytorch/pull/111137 on behalf of https://github.com/malfet due to Was reverted internally due to the failures in torch.cuda.memory_stats(device=0) (presumably) ([comment](https://github.com/pytorch/pytorch/pull/111137#issuecomment-1769274103))
2023-10-18 20:32:53 +00:00
cyy
43b023694e [1/N] Apply clang-tidy to c10 cuda files (#111137)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111137
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2023-10-17 04:52:50 +00:00
cyy
fa65ae8f56 cleanup unused include (#93359)
Using `include-what-you-use` tool to find out and remove some unused includes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93359
Approved by: https://github.com/malfet
2023-02-04 02:15:50 +00:00
cyy
bfe5e1258b avoid unnecessary static_cast (#93898)
avoid unnecessary static_cast
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93898
Approved by: https://github.com/Skylion007
2023-02-03 03:44:43 +00:00
cyy
37f7c00a8a More fixes and improved clang-tidy checkers (#93213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93213
Approved by: https://github.com/Skylion007
2023-02-01 14:44:17 +00:00
cyy
045d1de02d Fix some code issues (#92760)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92760
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-01-24 08:19:03 +00:00
Codrin Popa
0aedda25bc [PyTorch] Reporting OOM events to the Pytorch Profiler. (#80050)
Summary: Similar to reporting alloc and dealloc events in the PyTorch profiler, we are now reporting Out of Memory events as well. This is useful for performance troubleshooting

Test Plan: Added test_oom_tracing to test/test_profiler.py

Differential Revision: D36268132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80050
Approved by: https://github.com/robieta
2022-07-20 16:51:39 +00:00
mikey dagitses
844a4b47df extract out //c10/core:alloc_cpu (#70859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70859

ghstack-source-id: 147642534

Test Plan: Extracting code unmodified to a new library: relying on CI to validate.

Reviewed By: malfet

Differential Revision: D33329688

fbshipit-source-id: f60327467d197ec1862fb3554f8b83e6c84cab5c
(cherry picked from commit f82e7c0e9b)
2022-01-27 07:34:52 +00:00
mikey dagitses
fc6a488e9a extract out //c10/core:alignment (#70858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70858

ghstack-source-id: 147642533

Test Plan: Extracted a constant to a new header, trusting CI build to validate.

Reviewed By: malfet

Differential Revision: D33329689

fbshipit-source-id: 8697bb81a5cc3366462ebdf1f214b62d478fa77c
(cherry picked from commit 16663847e1)
2022-01-27 07:34:52 +00:00
Richard Barnes
29d759948e use irange for loops 2 (#66746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66746

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31705361

fbshipit-source-id: 33fd22eb03086d114e2c98e56703e8ec84460268
2021-12-10 04:26:23 -08:00
Xue Li
2f099c7555 Revert D30652629: use irange for loops
Test Plan: revert-hammer

Differential Revision:
D30652629 (687c2267d4)

Original commit changeset: 0ae6c4bbbb55

fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3
2021-10-15 15:23:10 -07:00
Richard Barnes
687c2267d4 use irange for loops (#66234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

bypass_size_limit
allow-large-files

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D30652629

fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
2021-10-15 13:50:33 -07:00
Han Guangyun
8bbcef5096 Report more information for memory profiling (#61282)
Summary:
Report pointed memory size, total allocated memory, total reserved size all in one report.

`ptr` and `alloc_size` will be used for associating with op trace.
`allocated_size`, `reserved_size` will be used for memory trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61282

Reviewed By: ejguan

Differential Revision: D29796282

Pulled By: chaekit

fbshipit-source-id: 5314c867632d3af1fa9a3811b35eaa5e931a5d87
2021-08-04 15:03:14 -07:00
Richard Barnes
ee44d73e59 Modernize override (#61744)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61744

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717320

fbshipit-source-id: 6eea4295ee2e5572ab337620be412376fcc2f3cc
2021-07-23 23:04:46 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Nikita Shulga
635d864b26 Fix modernize-use-equals-default nolint failures in torch/csrcs (#61142)
Summary:
Test-plan: Compile + clang-tidy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61142

Reviewed By: VitalyFedyunin

Differential Revision: D29529372

Pulled By: malfet

fbshipit-source-id: 2ccde7712a51c28243b16bbb4d1d68086e0414a6
2021-07-06 09:46:46 -07:00
Ilia Cherniavskii
047ae6b713 [profiler][small] CUDA synchronize guard, minor fix (#58254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58254

Don't use CUDA synchronize when profiling in CPU only mode.
minor fixes (a clarification for a doc string, fix spammy logging)

(Note: this ignores all push blocking failures!)

Test Plan: manual + CI

Reviewed By: gdankel, chaekit

Differential Revision: D28423667

Pulled By: ilia-cher

fbshipit-source-id: 04c71727f528ae8e2e0ff90e88271608d291bc69
2021-05-13 19:21:56 -07:00
Nikita Shulga
087049000b Make c10 clang-tidy clean (#55870)
Summary:
This change was autogenerated by running:
```
% find c10 -iname "*.cpp" -exec python3 tools/clang_tidy.py -c build -x {} -s \;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55870

Reviewed By: janeyx99

Differential Revision: D27728617

Pulled By: malfet

fbshipit-source-id: bede4d7f0c106d51394d1e9efddf01bf894421c5
2021-04-14 11:23:28 -07:00
Caroline Chen
2f91cda37e Modify error message (#53525)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53525

Reviewed By: mthrok

Differential Revision: D26900045

Pulled By: carolineechen

fbshipit-source-id: 387301381603d37d24cc829c8fed38123f268c0b
2021-03-09 09:12:00 -08:00
Ilia Cherniavskii
f1f9b049d8 [profiler] Support top-level memory events (#51421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51421

Mark memory events that did not happen within an operator context
explicitly in the profiler output.

Test Plan: python test/test_profiler.py -k test_memory_profiler

Reviewed By: ngimel

Differential Revision: D26166518

Pulled By: ilia-cher

fbshipit-source-id: 3c14d3ac25a7137733ea7cc65f0eb48693a98f5e
2021-02-04 04:14:15 -08:00
Hao Lu
4976208e73 [caffe2] Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator (#48161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48161

- Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator
- Use the AllocationArenaPool in both BlackBoxPredictor and StaticRuntime

Test Plan:
```
buck run //caffe2/caffe2/fb/predictor:black_box_predictor_test
buck run //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
AF canary:
https://www.internalfb.com/intern/ads/canary/431021257540238874/

Reviewed By: dzhulgakov

Differential Revision: D24977611

fbshipit-source-id: 33ba596b43c1e558c3ab237a0feeae93565b2d35
2020-11-30 15:03:34 -08:00
Kimish Patel
a09e1098e7 Profiling allocator for mobile. (#43951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43951

AllocationPlan: Stores the sequence of allocations, their sizes
                and liftime of the allocations. Along with this
                it also stores the total size of a single memory
                blob, total_size, required to satisfy all the allocations.
                It also stores the offsets in the blob, of size
                total_size, corresponding to each allocation.
                Thus allocation plan contains:
                - allocation sizes
                - allocation lifetimes
                - allocation offsets
                - total size
AllocationPlaner: Takes a pointer to the allocation plan and fills
                  it ups with plan, i.e. sizes, lifetimes, offsets,
                  total size.
                  This is done via WithProfileAllocationsGuard which
                  takes in AllocationPlan* and constructs
                  AllocationPlanner* and set the thread local
                  allocation_planner to it.
                  MobileCPUAllocator profiles allocations via
                  allocation_planner.
                  In WithValidateAllocationsGuard, allocations profiled
                  in the allocation plan are validated.
CPUProfilingAllocator:
Application owns CPUProfilingAllocator
Using WithProfilingAllocatorGuard, it passes both CPUProfilingAllocator
and AllocationPlan created earlier. Then CPUProfilingAllocator will
manage allocations and frees according to the plan. Allocations that
are not managed by CPUProfilingAllocator will be routed through
c10::alloc_cpu, c10::free_cpu.

Test Plan:
cpu_profiling_allocator_test on mobile.

Imported from OSS

Reviewed By: dreiss

Differential Revision: D23451019

fbshipit-source-id: 98bf1dbcfa8fcfb83d505ac01095e84a3f5b778d
2020-10-06 09:09:54 -07:00
Kimish Patel
6e55a26e10 Move mobile specific CPUCachingAllocator to c10/mobile folder. (#45364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45364

Plus add some more comments about the usage, limitations and cons.

Test Plan: Build and run benchmark binary.

Reviewed By: gchanan

Differential Revision: D23944193

fbshipit-source-id: 30d4f4991d2185a0ab768d94c846d73730fc0835
2020-09-29 11:33:26 -07:00
Kimish Patel
2a08566b8f Simple caching allocator for CPU. (#42006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42006

This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only.

Test Plan:
android test: cpu_caching_allocator_test

Imported from OSS

Reviewed By: dreiss

Differential Revision: D22726976

fbshipit-source-id: 9a38b1ce34059d5653040a1c3d035bfc97609e6c
2020-08-21 19:09:22 -07:00
Ilia Cherniavskii
a94fb71b12 Memory profiling (#37775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37775

Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))
```

```
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Reviewed By: ngimel

Differential Revision: D21384248

Pulled By: ilia-cher

fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3
2020-05-19 15:48:48 -07:00
Allan Di Wu
f538cd627a Install HugePagesArena to optimize pytorch prediction performance (#37640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37640

Enable oversize arena to reduce memory fragmentation. Memory request with large size (configurable with FLAGS_caffe2_oversize_threshold) are fulfilled from dedicated arena separate from the existing huge page arena.

Two additional parameters are introduced to configure the 2-phase decay of the memory arena:
- caffe2_dirty_decay_ms
- caffe2_muzzy_decay_ms

In current JEMalloc implementation, oversized allocations will be immediately purged regardless of putting it in arena or not. Therefore we need to extend the decay time to indefinite. Currently we set the default for caffe2_muzzy_decay_ms to -1.

We now enable the arena allocator statically. To ensure it is correctly installed regardless of static initialization order, we add a priority flag in c10::SetAllocator, and only higher priority allocators can overwrite existing ones.
ghstack-source-id: 103276877

Test Plan:
buck test mode/dev //caffe2/caffe2/fb/init:huge_pages_allocator_test

Benchmarking known CV model that benefits from page arena:
```
PyTorchModelBench.cpp:183] test / base : 86.9532%
```

By adjusting ```dirty_decay_ms``` and ```muzzy_decay_ms```, we have the following plots:
https://pxl.cl/15SWW
https://pxl.cl/15TnL

From the figures above we can see performance does not change much until dirty decay time is indefinite (set to -1). Either setting muzzy decay or dirty decay time to -1 will reach best performance, regardless of which one it is. Even setting the decay time to very long (100s, which is longer than the run), does not change the performance by much.

## Observe performance difference in production with a variety of models (WIP)

Reviewed By: dzhulgakov

Differential Revision: D21258581

fbshipit-source-id: c006f8b94f28aef0666e52f48d4e82cf0d3a48af
2020-05-06 17:27:10 -07:00
Ashkan Aliabadi
006f1a32f8 Mobile CPU allocator. (#36032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36032

QNNPACK AND XNNPACK may out-of-bound access the input and / or output tensors.
This is by-design, and chosen to make the implementation of micro-kernels
both simpler and faster as a result of not having to individually handle the
corner cases where the number of processed elements is not a multiple of SIMD
register width.  This behavior will trigger ASAN though, and may result in a
segfault if the accessed memory location just so happens to fall on a page
the current process has no read access to.  Here we define a custom allocator
that allocates the extra storage required to keep this behavior safe.  This
allocator could have been restricted to QNNPACK and XNNPACK only, but that
would have negative performance ramifications, as input tensors must now be
reallocated, and copied over, if the tensor is not allocated with this
allocator to begin with.  Making this allocator the default on mobile builds
minimizes the probability of unnecessary reallocations and copies, and
also enables acceleration of operations where the output tensor is allocated
outside of the function doing the implementation, wherein the implementation
cannot simply re-allocate the output with the guarding allocator.

Test Plan: Imported from OSS

Differential Revision: D20970217

Pulled By: AshkanAliabadi

fbshipit-source-id: 65cca2d38d7c0cef63c732f393016f50f1fa5199
2020-04-23 11:03:03 -07:00
Dmytro Dzhulgakov
d9dcfacd9e Improve CPUAllocator OOM message (#20618)
Summary:
Spotted while debugging some problem

Before
```
>>> torch.empty(10**15)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: [enforce fail at CPUAllocator.cpp:56] posix_memalign(&data, gAlignment, nbytes) == 0. 12 vs 0
```

After
```
>>> torch.empty(10**15)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4000000000000000 bytes. Error code 12 (Cannot allocate memory)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20618

Reviewed By: ezyang

Differential Revision: D15390400

Pulled By: dzhulgakov

fbshipit-source-id: 31f448303e4bd5f8c2bad8ca0f05bcece22a4b5e
2019-05-17 16:14:49 -07:00
Mikhail Zolotukhin
4211f674f0 Cleanup includes in c10/core/CPUAllocator.cpp. (#19885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19885
ghimport-source-id: 1f7d228ac8d1dd9aeb70446a6baf63f62f195663

Differential Revision: D15118516

Pulled By: ZolotukhinM

fbshipit-source-id: bdaf56d97db9e70cbd36ca03349f6eabfbac2668
2019-05-06 16:06:19 -07:00
Edward Yang
b3b692a80a Don't have malloc-free pairs that cross DLL boundaries. (#17302)
Summary:
See https://blogs.msdn.microsoft.com/oldnewthing/20060915-04/?p=29723
for more background on this requirement on Windows.

Fixes #17239.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc xkszltl peterjc123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17302

Differential Revision: D14150067

Pulled By: ezyang

fbshipit-source-id: 9dc16ca781ff17515b8df1bb55492477e7843d4c
2019-02-20 20:31:41 -08:00
Dmytro Dzhulgakov
11816affab Safety check for negative alloc_cpu() attempt (#17071)
Summary:
Some legacy TH code was relying on alloc to throw when called with negative number!!! E.g. `torch.linspace(0, 1, -1)`. And it breaks ASAN build. I still believe alloc should receive size_t, but I added a safety enforce inside.

It should fix ASAN. I'll follow up with a proper fix for empty_cpu (which is probably the right place to do it) separately
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17071

Differential Revision: D14074157

Pulled By: dzhulgakov

fbshipit-source-id: 3ed3bdb873e446edecb558e1df491310fd7179e3
2019-02-13 22:41:13 -08:00
Dmytro Dzhulgakov
51dd2000cd unify c2 and TH allocator (#16892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16892

Replaces https://github.com/pytorch/pytorch/pull/14517

Merged caffe2 and TH CPU Allocators. Mostly using the code from caffe2 allocators.
`memset` of caffe2 allocator is gone now. These two allocators should be almost the same.

Baseline:
```
Running ./tensor_allocation
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_MakeStorageImpl                       148 ns        148 ns    4676594
BM_StorageImplCtor                        54 ns         54 ns   12957810
BM_MallocStorageImpl                      62 ns         62 ns   11254745
BM_TensorImplCtor                         22 ns         22 ns   31939472
BM_MallocTensorImpl                      105 ns        105 ns    6505661
BM_Malloc_1                               43 ns         43 ns   16464905
BM_MakeTensorFromStorage                 126 ns        126 ns    5586116
BM_MakeVariableFromTensor                236 ns        236 ns    2995528
BM_ATenCPUTensorAllocationSmall1         319 ns        319 ns    2268884
BM_ATenCPUTensorAllocationSmall2         318 ns        318 ns    2163332
BM_ATenCPUTensorAllocationMedium1        403 ns        403 ns    1663228
BM_ATenCPUTensorAllocationMedium2        448 ns        448 ns    1595004
BM_ATenCPUTensorAllocationBig1           532 ns        532 ns    1352634
BM_ATenCPUTensorAllocationBig2          4486 ns       4486 ns     160978
```
Changed:
```
Running ./tensor_allocation
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_MakeStorageImpl                       141 ns        141 ns    4803576
BM_StorageImplCtor                        55 ns         55 ns   13129391
BM_MallocStorageImpl                      64 ns         64 ns   11088143
BM_TensorImplCtor                         23 ns         23 ns   31616273
BM_MallocTensorImpl                      101 ns        101 ns    7017585
BM_Malloc_1                               39 ns         39 ns   18523954
BM_MakeTensorFromStorage                 118 ns        118 ns    5877919
BM_MakeVariableFromTensor                452 ns        452 ns    1565722
BM_ATenCPUTensorAllocationSmall1         384 ns        384 ns    1819763
BM_ATenCPUTensorAllocationSmall2         389 ns        389 ns    1857483
BM_ATenCPUTensorAllocationMedium1        425 ns        425 ns    1646284
BM_ATenCPUTensorAllocationMedium2        430 ns        430 ns    1561319
BM_ATenCPUTensorAllocationBig1           508 ns        508 ns    1309969
BM_ATenCPUTensorAllocationBig2          3799 ns       3799 ns     173674
```

lstm benchmark:
Before:
```
INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k
```

After:
```
INFO:lstm_bench:Iter: 1 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 21 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 41 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 61 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 81 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 101 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 121 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 141 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 161 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 181 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 201 / 390. Entries Per Second: 0.8k.
INFO:lstm_bench:Iter: 221 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 241 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 261 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 281 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 301 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 321 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 341 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 361 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Iter: 381 / 390. Entries Per Second: 0.7k.
INFO:lstm_bench:Done. Total EPS excluding 1st iteration: 0.8k
```

Reviewed By: ezyang

Differential Revision: D13202632

fbshipit-source-id: db6d2ec756ed15b0732b15396c82ad42302bb79d
2019-02-12 21:16:34 -08:00