Commit Graph

810 Commits

Author SHA1 Message Date
Richard Barnes
8558c0e612 Eliminate narrowing conversion (#46730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46730

A narrowing conversion on `last_idx` raises a compiler warning. This fixes that.

Test Plan: Standard pre-commit test rig.

Reviewed By: EscapeZero

Differential Revision: D24481497

fbshipit-source-id: f3e913b586738add59c422c3cf65035d87fc9e34
2020-10-22 20:08:59 -07:00
Ivan Kobzarev
3112e23428 [py][vulkan][reland] Add is_vulkan to py api, add vulkan to device type parsing (#46655)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46655

Test Plan: Imported from OSS

Pulled By: IvanKobzarev

Reviewed By: mrshenli

Differential Revision: D24448984

fbshipit-source-id: 5000846a06077f7a5a06dd51da422d2a42f70820
2020-10-22 09:35:50 -07:00
Shen Li
cebe87fe3a Revert D24379422: [py][vulkan] Add is_vulkan to py api, add vulkan to device type parsing
Test Plan: revert-hammer

Differential Revision:
D24379422 (e8fbe54cf5)

Original commit changeset: afab89bb9e17

fbshipit-source-id: 743c77e453239f10c155c67490cba5a42ab42f58
2020-10-21 08:23:05 -07:00
Ivan Kobzarev
e8fbe54cf5 [py][vulkan] Add is_vulkan to py api, add vulkan to device type parsing (#46511)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46511

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D24379422

Pulled By: IvanKobzarev

fbshipit-source-id: afab89bb9e17c50934083598262bbe14ea82e893
2020-10-20 20:04:24 -07:00
Wanchao Liang
e8ff0f6c5c [c10] add operator= of intrusive_ptr to weak_intrusive_ptr (#44045)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44045

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23632281

Pulled By: wanchaol

fbshipit-source-id: ea50427fc261f0c77ddaac2e73032827320d7077
2020-10-17 03:35:44 -07:00
Tao Xu
5da4a08675 [GPU] Add metal to DispatchKeySet (#46455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46455

After this PR(https://github.com/pytorch/pytorch/pull/46236 ) landed, the `aten::copy_` can no longer be dispatched to Metal kernels.
ghstack-source-id: 114499399

Test Plan:
- Sandcastle CI
- Circle CI

Reviewed By: IvanKobzarev, ailzhang

Differential Revision: D24356769

fbshipit-source-id: 8660ca5be663fdc8985d9eb710ddaadbb43b0ddd
2020-10-16 16:33:26 -07:00
Nikita Shulga
b5702e2350 Fix out-of-bounds access for caching allocator calls (#46439)
Summary:
In assertValidDevice() compare device index to `caching_allocator.device_allocator` rather than to `device_no`

Fixes potential crashes when caching allocator is accessed before being initialized, for example by calling something like:
`python -c "import torch;print(torch.cuda.memory_stats(0))"`

Fixes https://github.com/pytorch/pytorch/issues/46437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46439

Reviewed By: ngimel

Differential Revision: D24350717

Pulled By: malfet

fbshipit-source-id: 714e6e74f7c2367a9830b0292478270192f07a7f
2020-10-16 08:24:46 -07:00
Tao Xu
053c252c66 Update COMPILE_TIME_MAX_DEVICE_TYPES to 12 (#46327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46327

### Summary

Update the COMPILE_TIME_MAX_DEVICE_TYPES to 12 as we landed a new Metal backend.

### Test Plan

- Circle CI

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D24309189

Pulled By: xta0

fbshipit-source-id: eec076b7e4fc94bab11840318821aa554447e541
2020-10-15 02:09:17 -07:00
Kimish Patel
4aaad88790 Bug fixes in profiling allocator (#45993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45993

Some bug exposed via updated test and validation code.
Also enabled this test to be run on CI instead of just mobile only test.

Test Plan:
cpu_profiling_allocator_test

Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D24172599

fbshipit-source-id: da0d2e1d1dec87b476bf39a1c2a2ffa0e4b5df66
2020-10-14 22:45:04 -07:00
Basil Hosmer
d22455128f [dispatcher] avoid autograd fixup step on non-backend keys (#46135)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46135

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D24235974

Pulled By: bhosmer

fbshipit-source-id: 21215b31146673caae904bb82395858419641633
2020-10-13 23:33:15 -07:00
Tao Xu
a277c097ac [iOS][GPU] Add Metal/MPSCNN support on iOS (#46112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46112

### Summary

This PR adds the support of running torchscript models on iOS GPU via Metal (Inference only). The feature is currently in prototype state, API changes are expected. The tutorial and the documents will be added once it goes to beta.

allow-large-files

- Users API

```
  auto module = torch::jit::load(model);
  module.eval();
  at::Tensor input = at::ones({1,3,224,224}, at::ScalarType::Float).metal();
  auto output = module.forward({input}).toTensor().cpu();
```
- Supported Models
    - Person Segmentation v106 (FB Internal)
    - Mobilenetv2

- Supported Operators
    - aten::conv2d
    - aten::addmm
    - aten::add.Tensor
    - aten::sub.Tensor
    - aten::mul.Tensor
    - aten::relu
    - aten::hardtanh
    - aten::hardtanh_
    - aten::sigmoid
    - aten::max_pool2d
    - aten::adaptive_avg_pool2d
    - aten::reshape
    - aten::t
    - aten::view
    - aten::log_softmax.int
    - aten::upsample_nearest2d.vec

- Supported Devices
    - Apple A9 and above
    - iOS 10.2 and above

- CMake scripts
    - `IOS_ARCH=arm64 ./scripts/build_ios.sh -DUSE_METAL=ON`

### Test Plan

- Circle CI

ghstack-source-id: 114155638

Test Plan:
1. Sandcastle CI
2. Circle CI

Reviewed By: dreiss

Differential Revision: D23236555

fbshipit-source-id: 98ffc48b837e308bc678c37a9a5fd8ae72d11625
2020-10-13 01:46:56 -07:00
Ailing Zhang
0ddcc0ce35 Add alias dispatch key DefaultBackend. (#45718)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45718

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24165892

Pulled By: ailzhang

fbshipit-source-id: ed28bf62b7c6320d966fd10b7a44b14efffe2f62
2020-10-09 12:02:44 -07:00
Yinghai Lu
c9caa828f5 Throw special exception when backend compilation is met with fatal error (#45952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45952

Pull Request resolved: https://github.com/pytorch/glow/pull/4967

When glow compilation meets with nonrecoverable fatal error (hardware is busted), we would like to throw a special exception other than the normal caffe2::EnforceNotMet so that we can signal the upper layer application to handle it differently.

Test Plan: Manually code some error and add LOG(FATAL) in the special exception path and wait for application to fatal.

Reviewed By: ipiszy

Differential Revision: D24156792

fbshipit-source-id: 4ae21bb0d36c89eac331fc52dd4682826b3ea180
2020-10-08 00:46:01 -07:00
Xiao Wang
fcc7f272de maximum number of threads per block for sm_86 is 1536 (#45889)
Summary:
according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45889

Reviewed By: albanD

Differential Revision: D24131188

Pulled By: ngimel

fbshipit-source-id: 31d3038f7b1bc403751448c62b19609573c67a49
2020-10-06 12:01:33 -07:00
Kimish Patel
a09e1098e7 Profiling allocator for mobile. (#43951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43951

AllocationPlan: Stores the sequence of allocations, their sizes
                and liftime of the allocations. Along with this
                it also stores the total size of a single memory
                blob, total_size, required to satisfy all the allocations.
                It also stores the offsets in the blob, of size
                total_size, corresponding to each allocation.
                Thus allocation plan contains:
                - allocation sizes
                - allocation lifetimes
                - allocation offsets
                - total size
AllocationPlaner: Takes a pointer to the allocation plan and fills
                  it ups with plan, i.e. sizes, lifetimes, offsets,
                  total size.
                  This is done via WithProfileAllocationsGuard which
                  takes in AllocationPlan* and constructs
                  AllocationPlanner* and set the thread local
                  allocation_planner to it.
                  MobileCPUAllocator profiles allocations via
                  allocation_planner.
                  In WithValidateAllocationsGuard, allocations profiled
                  in the allocation plan are validated.
CPUProfilingAllocator:
Application owns CPUProfilingAllocator
Using WithProfilingAllocatorGuard, it passes both CPUProfilingAllocator
and AllocationPlan created earlier. Then CPUProfilingAllocator will
manage allocations and frees according to the plan. Allocations that
are not managed by CPUProfilingAllocator will be routed through
c10::alloc_cpu, c10::free_cpu.

Test Plan:
cpu_profiling_allocator_test on mobile.

Imported from OSS

Reviewed By: dreiss

Differential Revision: D23451019

fbshipit-source-id: 98bf1dbcfa8fcfb83d505ac01095e84a3f5b778d
2020-10-06 09:09:54 -07:00
peterjc123
b1373a74e0 Don't export enums for CUDA sources on Windows (#45829)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45829

Reviewed By: VitalyFedyunin

Differential Revision: D24113130

Pulled By: ezyang

fbshipit-source-id: 8356c837ed3a790efecf8dfcc8fb6ee6f45bd6e2
2020-10-06 08:04:36 -07:00
Xiaodong Wang
2fbe5971b3 [pytorch/cuda] apply 16-bit mask to the index for device guard registry (#45485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45485

Essentially this is the problem reported by ezyang: https://fb.workplace.com/groups/llvm.gcc/permalink/4053565044692080. There are two proposed fixes:
* https://github.com/pytorch/pytorch/pull/44883: this doesn't work because it fails some static assert at runtime
```
caffe2/c10/core/TensorOptions.h:553:1: error: static_assert failed due to requirement 'sizeof(c10::TensorOptions) <= sizeof(long) * 2' "TensorOptions must fit in 128-bits"
static_assert( sizeof(TensorOptions) <= sizeof(int64_t) * 2,
^
```
* https://github.com/pytorch/pytorch/pull/44885: to be tested

This diff is a temp hack to work around the problem. W/o this patch:

```
  volatile size_t device_type = static_cast<size_t>(type);
  auto p = device_guard_impl_registry[device_type].load();
  C10_LOG_FIRST_N(WARNING, 10) << "XDW-fail: " << cntr << ", Device type: " << type << ", type cast: " << device_type  << ", guard: " << p;

// output
XDW-fail: 1129, Device type: cuda, type cast: 65537, guard: 0

```

Another workaround is D23788441, which changes -O3 to -O2. So this seems to be a miscompilation for nvcc or the host compiler.

Reviewed By: ezyang

Differential Revision: D23972356

fbshipit-source-id: ab91fbbfccb6389052de216f95cf9a8265445aea
2020-10-05 22:37:47 -07:00
Xiang Gao
2fa062002e CUDA BFloat16 infrastructure (#44925)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44925

Reviewed By: agolynski

Differential Revision: D23783910

Pulled By: ngimel

fbshipit-source-id: dacac2ad87d58056bdc68bfe0b7ab1de5c2af0d8
2020-10-02 16:21:30 -07:00
Supriya Rao
04526a49d3 [quant] creating quint4x2 dtype for quantized tensors (#44678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44678

This is a prototype PR that introduces 4 bit qtensors. The new dtype added for this is c10::quint4x2
The underlying storage for this is still uint8_t, so we pack 2 4-bit values in a byte while quantizing it.

This change uses most of the existing scaffolding for qtensor storage. We allocate storage
based on the dtype before creating a new qtensor.

It also adds a dispatch mechanism for this dtype so we can use this to get the bitwidth, qmin and qmax info
while quantizing and packing the qtensor (when we add 2-bit qtensor)

Kernels that use this dtype should be aware of the packing format.

Test Plan:
Locally tested
```
x = torch.ones((100, 100), dtype=torch.float)
qx_8bit = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint8)
qx = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint4x2)

torch.save(x, "temp.p")
print('Size float (B):', os.path.getsize("temp.p"))
os.remove('temp.p')

torch.save(qx_8bit, "temp.p")
print('Size quantized 8bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')

torch.save(qx, "temp.p")
print('Size quantized 4bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')
```

Size float (B): 40760
Size quantized 8bit(B): 10808
Size quantized 4bit(B): 5816

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23993134

fbshipit-source-id: 073bf262f9680416150ba78ed2d932032275946d
2020-10-01 23:53:34 -07:00
Abaho Katabarwa
de3a48013a Use CAFFE2_USE_MSVC_STATIC_RUNTIME to determine when to avoid waiting for global destructors on Windows (#43532)
Summary:
We are trying to build libtorch statically (BUILD_SHARED_LIBS=OFF) then link it into a DLL. Our setup hits the infinite loop mentioned [here](54c05fa34e/torch/csrc/autograd/engine.cpp (L228)) because we build with `BUILD_SHARED_LIBS=OFF` but still link it all into a DLL at the end of the day.

This PR fixes the issue by changing the condition to guard on which windows runtime the build links against using the `CAFFE2_USE_MSVC_STATIC_RUNTIME` flag. `CAFFE2_USE_MSVC_STATIC_RUNTIME` defaults to ON when `BUILD_SHARED_LIBS=OFF`, so backwards compatibility is maintained.

I'm not entirely confident I understand the subtleties of the windows runtime versus linking setup, but this setup works for us and should not affect the existing builds.

Fixes https://github.com/pytorch/pytorch/issues/44470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43532

Reviewed By: mrshenli

Differential Revision: D24053767

Pulled By: albanD

fbshipit-source-id: 1127fefe5104d302a4fc083106d4e9f48e50add8
2020-10-01 16:41:14 -07:00
Kimish Patel
6e55a26e10 Move mobile specific CPUCachingAllocator to c10/mobile folder. (#45364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45364

Plus add some more comments about the usage, limitations and cons.

Test Plan: Build and run benchmark binary.

Reviewed By: gchanan

Differential Revision: D23944193

fbshipit-source-id: 30d4f4991d2185a0ab768d94c846d73730fc0835
2020-09-29 11:33:26 -07:00
Mike Ruberry
ab5edf21b0 Revert D23789657: [wip] fast typeMeta/ScalarType conversion approach 2
Test Plan: revert-hammer

Differential Revision:
D23789657 (1ed1a2f5b0)

Original commit changeset: 5afdd52d24bd

fbshipit-source-id: 6d827be8895bcb39c8e85342eee0f7a3f5056c76
2020-09-29 09:40:53 -07:00
Basil Hosmer
1ed1a2f5b0 [wip] fast typeMeta/ScalarType conversion approach 2 (#44965)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44965

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23789657

Pulled By: bhosmer

fbshipit-source-id: 5afdd52d24bd097891ff4a7313033f7bd400165e
2020-09-29 02:39:36 -07:00
Bert Maher
03342af3a3 Add env variable to bypass CUDACachingAllocator for debugging (#45294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294

While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.

This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc.  This way, cuda-memcheck will actually work.

Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.

Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826

And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```

Reviewed By: ngimel

Differential Revision: D23964734

Pulled By: bertmaher

fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
2020-09-28 11:40:04 -07:00
Sebastian Messmer
78fcde9c50 Trace scattered tensor options arguments (#44071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44071

Previously, tracing re-gathered ScalarType, Layout, Device, bool into a TensorOptions object and called `tracer::addInput()` on the gathered TensorOptions argument. `tracer::addInput()` then scattered them again and added the individual scattered arguments to the traced graph. This PR avoids the extraneous gathering and re-scattering step and calls `tracer::addInput()` on the individual arguments directly. This avoid the perf hit for an unnecessary gathering step.

This applies to both c10-full and non-c10-full ops. In the case of c10-full ops, the tracing kernels takes scattered arguments and we can directly pass them to `tracer::addInput()`. In the case of non-c10-full ops, the kernel takes a `TensorOptions` argument but we still call `tracer::addInput()` on the scattered arguments.
ghstack-source-id: 112825793

Test Plan:
waitforsandcastle

vs master: https://www.internalfb.com/intern/fblearner/details/216129483/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170069/

Reviewed By: ezyang

Differential Revision: D23486638

fbshipit-source-id: e0b53e6673cef8d7f94158e718301eee261e5d22
2020-09-25 09:04:06 -07:00
Sebastian Messmer
2ac7de7d53 Remove hacky_wrapper from BackendSelect kernels (#44062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062

Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory,  and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.

Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789

Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/

Reviewed By: ezyang

Differential Revision: D23484192

fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
2020-09-25 09:04:03 -07:00
Hong Xu
b470fa4500 Add complex number support for binary logical operators (#43174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43174

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684425

Pulled By: mruberry

fbshipit-source-id: 4857b16e18ec4c65327136badd7f04c74e32d330
2020-09-23 23:03:00 -07:00
Rohan Varma
70d2e4d1f6 [RPC profiling] Allow disableProfiler() to be called from another thread. (#44653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44653

This changes the profiler per a discussion with ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.

This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.

Added a test in `test_misc.cpp` to test this.
ghstack-source-id: 112605620

Reviewed By: mrshenli

Differential Revision: D23638499

fbshipit-source-id: f5bbb0d41ef883c5e5870bc27e086b8b8908f46b
2020-09-22 21:16:58 -07:00
Ailing Zhang
92f8f75c59 Add alias dispatch key Math. (#44354)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44354

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23591481

Pulled By: ailzhang

fbshipit-source-id: 6e93c4ec99a07f3fc920ba2d09dc222e6ced5adf
2020-09-21 11:10:39 -07:00
Ailing Zhang
a47e3697ab Use iterator of DispatchKeySet. (#44682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44682

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23698387

Pulled By: ailzhang

fbshipit-source-id: 4fa140db9254c2c9c342bf1c8dfd952469b0b779
2020-09-18 13:34:27 -07:00
Kimish Patel
f605d7581e Implement better caching allocator for segmentation usecase. (#44618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44618

This diff refactors caching allocator to allow for overriding behavior by
making it a virtual class.

Test Plan: https://www.internalfb.com/intern/fblearner/details/218419618?tab=Experiment%20Results

Reviewed By: dreiss

Differential Revision: D23672902

fbshipit-source-id: 976f02922178695fab1c87f453fcb59142c258ec
2020-09-17 08:56:14 -07:00
Louis Feng
eb75cfb9c0 Back out "Revert D23323486: DPP Async Tracing" plus windows build fix. (#44702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44702

Original commit changeset: c6bd6d277aca

This diff caused windows build to fail due to a compiler bug in VS2019 (lambda capture constant int value). This back out works around the issue with explicit capture of const int value.

Test Plan: Tested and previously landed.

Reviewed By: mruberry

Differential Revision: D23703215

fbshipit-source-id: f9ef23be97540bc9cf78a855295fb8c69f360459
2020-09-16 11:32:11 -07:00
Mike Ruberry
7036e91abd Revert D23323486: DPP Async Tracing
Test Plan: revert-hammer

Differential Revision:
D23323486 (71673b31f9)

Original commit changeset: 4b6ca6c0e320

fbshipit-source-id: c6bd6d277aca070bef2de3522c2a60e23b4395ad
2020-09-15 01:19:23 -07:00
Louis Feng
71673b31f9 DPP Async Tracing (#44252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44252

Add tracing to DPP client. Because DPP requests are async, we need to be able to start a trace event in one thread and potentially end in a different thread. RecordFunction and LibgpumonObserver previously assume each trace event starts and finishes in the same thread. So they use a thread local context to track enter and exit call backs. Async events breaks this assumption. This change attaches the event context to the RecordFunction object so we do not need to use thread local context.

Test Plan:
Tested with dpp perf test and able to collect trace.

{F307824044}

Reviewed By: ilia-cher

Differential Revision: D23323486

fbshipit-source-id: 4b6ca6c0e32028fb38a476cd1f44c17a001fc03b
2020-09-14 18:43:14 -07:00
kshitij12345
c68a99bd61 [numpy] Add torch.exp2 (#44184)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

TODO
* [x] Add tests
* [x] Add docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44184

Reviewed By: ngimel

Differential Revision: D23674237

Pulled By: mruberry

fbshipit-source-id: 7f4fb1900fad3051cd7fc9d3d7f6d985c5fb093c
2020-09-14 04:05:37 -07:00
Caleb Thomas
dd4bbe1a79 Add iterator like functionality for DispatchKeySet (#44066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44066

Add STL Input iterator to DispatchKeySet:
* Iterator is able to iterate from first not undefined DispatchKey
to NumDispatchKeys.
* Iterator is invalidated once underlying DispatchKeySet is invalidated

Note see http://www.cplusplus.com/reference/iterator/ for comparisons of
different iterators.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23611405

Pulled By: linux-jedi

fbshipit-source-id: 131b287d60226a1d67a6ee0f88571f8c4d29f9c3
2020-09-11 15:08:15 -07:00
Ailing Zhang
39bb455e36 Update fallback kernel for Autograd keys. (#44349)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44349

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23589807

Pulled By: ailzhang

fbshipit-source-id: 0e4b0bf3e07bb4e35cbf1bda22f7b03193eb3dc4
2020-09-11 12:04:52 -07:00
Giuseppe Ottaviano
6324ef4ced [caffe2] Speed up compilation of aten-op.cc (#44440)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44440

`aten-op.cc` takes a long time to compile due to the large generated constructor. For each case, the `std::function` constructor and the initialization functions are inlined, producing a huge amount of intermediate code that takes a long time to optimize, given that many compiler optimization passes are superlinear in the function size.

This diff moves each case to a separate function, so that each one is cheap to optimize, and the constructor is just a large jump table, which is easy to optimize.

Reviewed By: dzhulgakov

Differential Revision: D23593741

fbshipit-source-id: 1ce7a31cda10d9b0c9d799716ea312a291dc0d36
2020-09-09 21:21:48 -07:00
Ailing Zhang
4aacfab221 Resolve Autograd key for disable_variable_dispatch flag. (#44268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44268

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23561042

Pulled By: ailzhang

fbshipit-source-id: 6f35cd9a543bea3f9e294584f1db7c3622ebb741
2020-09-08 21:27:52 -07:00
Ailing Zhang
1b2da9ed82 Expose alias key info in dumpState and update test_dispatch. (#44081)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44081

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23492794

Pulled By: ailzhang

fbshipit-source-id: 27a2978591900463bda2e92e0201c9fd719f9792
2020-09-06 18:43:05 -07:00
Xiang Gao
263412e536 Rename is_complex_t -> is_complex (#39906)
Summary:
`is_complex_t` is a bad name. For example in std, there are `std::is_same` but not `std::is_same_t`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39906

Reviewed By: mrshenli

Differential Revision: D22665013

Pulled By: anjali411

fbshipit-source-id: 4b71745f5e2ea2d8cf5845d95ada4556c87e040d
2020-09-01 21:04:19 -07:00
Ailing Zhang
224232032c Move Autograd to an alias dispatch key (#43070)
Summary:
This PR moves `DispatchKey::Autograd` to an alias dispatch key mapping to `AutogradCPU, AutogradCUDA, AutogradXLA, AutogradOther, AutogradPrivate*` keys.

A few things are handled in this PR:
- Update alias dispatch key mapping and precompute dispatchTable logic
- Move `Autograd` key from `always_included` set to TensorImpl constructor.
- Update `dummyTensor` constructor to take `requires_grad` as optional argument so that it's closer to the real application in op_registration_test.
- Use `BackendSelect` key for both backend select before and after autograd layer. (1 liner in backend_select codegen)

A few planned followups ordered by priority:
- [cleanup] Update `test_dispatch.py` to include testing `Autograd`.
- [cleanup] Add Math alias key and move catchAll to Math. (to remove 2.2 in `computeDispatchTableEntryWithDebug`)
- [new feature] Add support for Math in native_functions.yaml
- [cleanup] Add iterator like functionality to DispatchKeySet
- [cleanup/large] Only add Autograd backend keys when tensor requires grad. (cc: ljk53 ?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43070

Reviewed By: ezyang

Differential Revision: D23281535

Pulled By: ailzhang

fbshipit-source-id: 9ad00b17142e9b83304f63cf599f785500f28f71
2020-09-01 09:05:29 -07:00
Ashkan Aliabadi
4e39c310eb Move torch/csrc/utils/hash.h to c10/util/hash.h. (#42503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42503

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252331

Pulled By: AshkanAliabadi

fbshipit-source-id: 3c4c0e27b9a7eec8560e374c2a3ba5f1c65dae48
2020-08-29 17:47:00 -07:00
Nikita Shulga
d10056652b Enable torch.half for lt and masked_select (#43704)
Summary:
Enable testing of those options in `TestTorchDeviceTypeCPU.test_logical_cpu` and `TestTorchDeviceTypeCPU.test_masked_select_cpu_float16`
Add `view_as_real` testing for `torch.complex32` type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43704

Reviewed By: albanD

Differential Revision: D23373070

Pulled By: malfet

fbshipit-source-id: 00f17f23b48513379a414227aea91e2d3c0dd5f9
2020-08-29 02:37:26 -07:00
Kimish Patel
dc5d365514 Fix bug in caching allocator. (#43719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43719

Accidentally this slipped through: with guard did not update the current
context

Test Plan: cpu_caching_allocator_test

Reviewed By: linbinyu

Differential Revision: D23374453

fbshipit-source-id: 1d3ef21cc390d0a8bde98fb1b5c2175b40ab571b
2020-08-28 11:56:23 -07:00
Jiakai Liu
3a0e35c9f2 [pytorch] deprecate static dispatch (#43564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564

Static dispatch was originally introduced for mobile selective build.

Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23324452

Pulled By: ljk53

fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
2020-08-27 14:52:48 -07:00
Yi Zhang
47e1b7a8f1 Set CONSTEXPR_EXCEPT_WIN_CUDA as const while it is not constexpr (#43380)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43380

Reviewed By: albanD

Differential Revision: D23278930

Pulled By: pbelevich

fbshipit-source-id: 6ce0bc9fd73cd0ead46c414fdea5f6fb7e9fec3e
2020-08-22 03:25:37 -07:00
Basil Hosmer
915fd1c8fc centralize autograd dispatch key set (#43387)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43387

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23258687

Pulled By: bhosmer

fbshipit-source-id: 3718f74fc7324db027f87eda0b90893a960aa56e
2020-08-22 00:46:02 -07:00
Kimish Patel
2a08566b8f Simple caching allocator for CPU. (#42006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42006

This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only.

Test Plan:
android test: cpu_caching_allocator_test

Imported from OSS

Reviewed By: dreiss

Differential Revision: D22726976

fbshipit-source-id: 9a38b1ce34059d5653040a1c3d035bfc97609e6c
2020-08-21 19:09:22 -07:00
Mike Ruberry
3aec1185e0 Enables bfloat16 x [float16, complex64, complex128] type promotion (#43324)
Summary:
Implements bfloat16 type promotion consistent with JAX (see https://jax.readthedocs.io/en/latest/type_promotion.html), addressing issue https://github.com/pytorch/pytorch/issues/43049.

- bfloat16 x float16 -> float32
- bfloat16 x complex64 -> complex64
- bfloat16 x complex128 -> complex128

Existing tests, after updates, are sufficient to validate the new behavior.

cc xuhdev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43324

Reviewed By: albanD

Differential Revision: D23259823

Pulled By: mruberry

fbshipit-source-id: ca9c2c7d0325faced1f884f3c37edf8fa8c8b089
2020-08-21 10:48:04 -07:00