pytorch/docs/source/notes
Bert Maher 03342af3a3 Add env variable to bypass CUDACachingAllocator for debugging (#45294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294

While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.

This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc.  This way, cuda-memcheck will actually work.

Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.

Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826

And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```

Reviewed By: ngimel

Differential Revision: D23964734

Pulled By: bertmaher

fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
2020-09-28 11:40:04 -07:00
..
amp_examples.rst Reference amp tutorial (recipe) from core amp docs (#44725) 2020-09-16 11:37:58 -07:00
autograd.rst Doc note for complex (#41252) 2020-07-16 08:53:27 -07:00
broadcasting.rst [docs] Update broadcasting and cuda semantics notes (#6904) 2018-04-24 13:41:24 -04:00
cpu_threading_runtimes.svg Update CPU threading doc (#33083) 2020-02-11 14:13:51 -08:00
cpu_threading_torchscript_inference.rst Upgrade MKL-DNN to DNNL v1.2 (#32422) 2020-03-26 22:07:59 -07:00
cpu_threading_torchscript_inference.svg Threading and CPU Inference note 2019-07-29 15:45:49 -07:00
cuda.rst Add env variable to bypass CUDACachingAllocator for debugging (#45294) 2020-09-28 11:40:04 -07:00
ddp.rst Fix wrong link in docs/source/notes/ddp.rst (#40484) 2020-06-28 13:55:56 -07:00
extending.rst Don't materialize output grads (#41821) 2020-08-11 04:27:07 -07:00
faq.rst Revert "Revert D21337640: [pytorch][PR] Split up documentation into subpages and clean up some warnings" (#37778) 2020-05-04 14:32:35 -07:00
large_scale_deployments.rst Move ThreadLocalDebugInfo to c10 (#37774) 2020-05-11 19:27:41 -07:00
multiprocessing.rst Update docs for master to remove Python 2 references (#36336) 2020-04-16 10:15:48 -07:00
randomness.rst Update determinism documentation (#41692) 2020-08-31 21:06:24 -07:00
serialization.rst Makes the use of the term "module" consistent through the serialization note (#41563) 2020-07-16 14:59:49 -07:00
windows.rst Correct the windows docs (#43479) 2020-08-25 13:41:24 -07:00