Summary:
This was missing and resulted in the incorrect `name` passed into `_to_worker_info` not being printed out in the error message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31969
Differential Revision: D19331927
Pulled By: rohan-varma
fbshipit-source-id: e74d47daec3224c2d9b9da3c0a6404cfa67baf65
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31858
Trying to upgrade docker image but ran into the following error:
```
Running test_nn ... [2020-01-04 18:05:12.537860]
Traceback (most recent call last):
File "test_nn.py", line 45, in <module>
from common_cuda import TEST_CUDA, TEST_MULTIGPU, TEST_CUDNN, TEST_CUDNN_VERSION
File "/var/lib/jenkins/workspace/test/common_cuda.py", line 16, in <module>
import numba.cuda
File "/opt/conda/lib/python3.6/site-packages/numba/__init__.py", line 178, in <module>
_ensure_llvm()
File "/opt/conda/lib/python3.6/site-packages/numba/__init__.py", line 100, in _ensure_llvm
raise ImportError(msg)
ImportError: Numba requires at least version 0.30.0 of llvmlite.
Installed version is 0.28.0.
```
Test Plan: Imported from OSS
Differential Revision: D19282923
Pulled By: ljk53
fbshipit-source-id: bdeefbf4f6c0c97df622282f76e77eb1eadba436
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31031
This activation will be needed for the LSTM implementation.
Also includes the QNNPack implementation.
Test Plan: Imported from OSS
Differential Revision: D19334280
Pulled By: z-a-f
fbshipit-source-id: ae14399765a47afdf9b1e072d3967c24ff473e8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31857
According to mingbowan we will change to use string docker image
version because the tag is no longer an integer since we move the docker
image build job to circle CI:
http://ossci-docker.s3-website.us-east-1.amazonaws.com/pytorch.html
Test Plan: - with stacked PR
Differential Revision: D19282726
Pulled By: ljk53
fbshipit-source-id: 7a12ae89a11cf15163b905734d50fed6dc98cb07
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31995Fixes#31906.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D19331259
Pulled By: ezyang
fbshipit-source-id: 5d24bf3555e632211a9b6f8e50ff241603c18b3d
Summary:
Fix for https://github.com/pytorch/pytorch/issues/19420
So after actually writing a C++ JSON dumping class I figured that
a faster and cleaner way would be simply rewrite the Python without
the JSON module since the JSON that we need to output is so simple.
For now I decided to not touch the `parse_cpu_trace` function since
only changing `export_chrome_trace` shows a 4x speedup.
Here's the script I used for benchmarking:
``` python
import time
import torch
x = torch.ones(2, 2)
start = time.time()
with torch.autograd.profiler.profile() as prof:
for _ in range(10000):
x * x
for i in range(50):
prof.export_chrome_trace("trace.json")
stop = time.time()
print(stop-start)
```
master branch (using json dump) -> 8.07515025138855
new branch (without json dump) -> 2.0943689346313477
I checked the trace file generated in the [test](https://github.com/pytorch/pytorch/blob/master/test/test_autograd.py#L2659)
and it does work fine.
Please let me know what you think.
If you still insist on the C++ version I can send a new patch soon enough.
CC ezyang rgommers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30724
Differential Revision: D19298955
Pulled By: ezyang
fbshipit-source-id: b0d7324ea5f90884ab8a00dd272f3aa3d9bc0427
Summary:
Fix https://github.com/pytorch/pytorch/issues/24704.
Benchmark script :
```
import torch
import torch.nn as nn
import time
torch.manual_seed(0)
def _time():
return time.time()
device = "cpu"
#warm up
for n in [10, 100, 1000]:
input = torch.randn(128, n, requires_grad=False, device=device)
for i in range(1000):
input.geometric_(0.5)
for n in [1, 10, 100, 1000]:
fwd_t = 0
input = torch.randn(128, n, requires_grad=False, device=device)
for i in range(10000):
t1 = _time()
input.geometric_(0.5)
t2 = _time()
fwd_t = fwd_t + (t2 -t1)
fwd_avg = fwd_t / 10000 * 1000
print("input size(128, %d) forward time is %.4f (ms)." % (n, fwd_avg))
```
Test device: **skx-8180**.
Before:
```
input size(128, 1) forward time is 0.0092 (ms).
input size(128, 10) forward time is 0.0802 (ms).
input size(128, 100) forward time is 0.7994 (ms).
input size(128, 1000) forward time is 7.8403 (ms).
```
After:
```
input size(128, 1) forward time is 0.0088 (ms).
input size(128, 10) forward time is 0.0781 (ms).
input size(128, 100) forward time is 0.7815 (ms).
input size(128, 1000) forward time is 7.7163 (ms).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31878
Differential Revision: D19314510
Pulled By: ezyang
fbshipit-source-id: 2d95bf9938c8becf280890acf9e37223ddd08a39
Summary:
VitalyFedyunin, This PR is about port LogSigmoid activation to Aten:
Test script:
```
import torch
import torch.nn as nn
import time
torch.manual_seed(0)
def _time():
return time.time()
device = "cpu"
m = nn.LogSigmoid()
#warm up
for n in [1, 10, 100, 1000]:
input = torch.randn(128, n, requires_grad=True, device=device)
grad_output = torch.randn(128, n, device=device)
for i in range(1000):
output = m(input)
output.backward(grad_output)
for n in [1, 10, 100, 1000]:
input = torch.randn(128, n, requires_grad=True, device=device)
grad_output = torch.randn(128, n, device=device)
fwd_t = 0
bwd_t = 0
for i in range(10000):
t1 = _time()
output = m(input)
t2 = _time()
output.backward(grad_output)
t3 = _time()
fwd_t = fwd_t + (t2 -t1)
bwd_t = bwd_t + (t3 - t2)
fwd_avg = fwd_t / 10000 * 1000
bwd_avg = bwd_t / 10000 * 1000
print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
% (n, fwd_avg, bwd_avg))
```
**Before:**
```
input size(128, 1) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 10) forward time is 0.10 (ms); backwad avg time is 0.03 (ms).
input size(128, 100) forward time is 0.90 (ms); backwad avg time is 0.09 (ms).
input size(128, 1000) forward time is 9.04 (ms); backwad avg time is 0.87 (ms).
```
**After:**
```
input size(128, 1) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 10) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.03 (ms).
input size(128, 1000) forward time is 0.28 (ms); backwad avg time is 0.07 (ms).
```
**OMP_NUM_THREADS=1:**
```
Before:
input size(128, 1) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 10) forward time is 0.10 (ms); backwad avg time is 0.03 (ms).
input size(128, 100) forward time is 0.88 (ms); backwad avg time is 0.10 (ms).
input size(128, 1000) forward time is 8.72 (ms); backwad avg time is 0.81 (ms).
After:
input size(128, 1) forward time is 0.01 (ms); backwad avg time is 0.02 (ms).
input size(128, 10) forward time is 0.02 (ms); backwad avg time is 0.02 (ms).
input size(128, 100) forward time is 0.07 (ms); backwad avg time is 0.03 (ms).
input size(128, 1000) forward time is 0.63 (ms); backwad avg time is 0.15 (ms).
```
Fix https://github.com/pytorch/pytorch/issues/24724, https://github.com/pytorch/pytorch/issues/24725.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30958
Differential Revision: D19275111
Pulled By: ezyang
fbshipit-source-id: bbfe82e58fb27a4fb21c1914c6547a9050072e5c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31962
I added precision tests for CUDA half, float, and double.
The precision for CUDA half seems bad, but I checked the numbers against
previous versions of pytorch. The output of CUDA Half linspace+logspace
are exactly the same when compared with 1.2.0.
Test Plan: - Run CI
Differential Revision: D19320182
Pulled By: zou3519
fbshipit-source-id: 38d3d4dea2807875ed0b0ec2b93b19c10a289988
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31162
This should help us resolve a multitude of weird segfaults and crashes
when PyTorch is imported along with other packages. Those would often
happen because libtorch symbols were exposed globally and could be used
as a source of relocations in shared libraries loaded after libtorch.
Fixes#3059.
Some of the subtleties in preparing this patch:
* Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this.
* Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D19262579
Test Plan: Imported from OSS
Pulled By: ezyang
fbshipit-source-id: 06a48a5d2c9036aacd535f7e8a4de0e8fe1639f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31161
Previously, it wasn't necessary to specify `DT_NEEDED` in C++ extensions on Linux (aka pass `-l` flags) because all of the symbols would have already been loaded with `RTLD_GLOBAL`, so there wouldn't be any undefined symbols. But when we switch to loading `_C` with `RTLD_LOCAL`, it's now necessary for all the C++ extensions to know what libraries to link with. The resulting code is clearer and more uniform, so it's wins all around.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D19262578
Pulled By: ezyang
fbshipit-source-id: a893cc96f2e9aad1c064a6de4f7ccf79257dec3f
Summary:
special case for norm out where p == 2. Instead of calling `pow`,
we use multiplication as a faster code path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31903
Differential Revision: D19312749
Pulled By: ngimel
fbshipit-source-id: 73732b7b37a243a14438609784795b920271a0b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31800
If we know that two constants are the same object, we can ignore other constraints and pool them together. This fixes an issue introduced by the other PR where quantization relied on constant pooling happening for correctness.
Test Plan: Imported from OSS
Differential Revision: D19269499
Pulled By: eellison
fbshipit-source-id: 9d4396125aa6899cb081863d463d4f024135cbf4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31501
We have a number of places in our code base where we should be checking if it's safe to change the alias relationship between two sets of values. This PR adds an api to Alias Db to consolidate the logic, and refactors Constant Pooling and `CSE` to use the new api. Next steps: add api usage in peephole.cpp where applicable.
Happy to bikeshed `AliasDb::safeToChangeAliasingRelationship`. Previously I suggested `AliasDb::safeToIntroduceAliasing`, however that's not quite accurate, because this API also handles when it is unsafe to remove aliasing.
Alternate suggestions: `safeToChangeAliasing`, `validToChangeAliasing`, `validToChangeAliasingRelationship`
Related: https://github.com/pytorch/pytorch/issues/28360
Test Plan: Imported from OSS
Differential Revision: D19254413
Pulled By: eellison
fbshipit-source-id: 17f7f52ad2d1526d303132767cbbb32f8189ae15
Summary:
This is a first pass attempt at documenting `IValue` to help with problems like in #17165. Most users are probably concerned with
* how to make an `IValue` that matches the input type to their graph (most of the constructors are pretty self explanatory, so as long as they are in the docs I think its enough)
* how to extract the results after running their graph (there is a small note on the behavior of `.toX()` based on confusions we've had in the past)
Preview:
https://driazati.github.io/pytorch_doc_previews/31904/api/structc10_1_1_i_value.html#exhale-struct-structc10-1-1-i-value
There are also some random CSS fixes to clean up the style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31904
Pulled By: driazati
Differential Revision: D19318733
fbshipit-source-id: b29dae3349d5a7ea5a3b8e09cd23f7ff8434edb4
Summary:
This hooks up `inspect` so that Python functions get their parameters
names attached instead of naming them `0, 1, 2, ...`. This also fixes
issue #28537 where `ignore` functions were improperly typing `self`.
](https://our.intern.facebook.com/intern/diff/19256434/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29300
Pulled By: driazati
Differential Revision: D19256434
fbshipit-source-id: 6a1fe7bd0afab708b8439517798955d0abfeb44c
Summary:
Stacked PRs
* **#31908 - Remove C++ docs contributing page**
* #31905 - Add doc previewing instructions
We should have 1 source of truth for contribution instructions (CONTRIBUTING.md).
This PR moves the instructions from the C++ doc pages there instead of having its
own separate page.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31908
Pulled By: driazati
Differential Revision: D19296366
fbshipit-source-id: c1daf004259342bd09e09dea3b80e34db47066ec
Summary:
Stacked PRs
* #31908 - Remove C++ docs contributing page
* **#31905 - Add doc previewing instructions**
This adds some instructions on how to get started with Github pages you can show reviewers your documentation changes. Hopefully we can delete this eventually and build docs automatically on relevant PRs in CI.
](https://our.intern.facebook.com/intern/diff/19296364/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31905
Pulled By: driazati
Differential Revision: D19296364
fbshipit-source-id: df47fa1a8d7be029c3efcf6521298583ad9f7a95
Summary:
Fix https://github.com/pytorch/pytorch/issues/24684.
Benchmark script :
```
import torch
import torch.nn as nn
import time
torch.manual_seed(0)
def _time():
return time.time()
device = "cpu"
#warm up
for n in [10, 100, 1000]:
input = torch.randn(128, n, requires_grad=False, device=device)
for i in range(1000):
input.cauchy_()
for n in [1, 10, 100, 1000]:
fwd_t = 0
input = torch.randn(128, n, requires_grad=False, device=device)
for i in range(10000):
t1 = _time()
input.cauchy_()
t2 = _time()
fwd_t = fwd_t + (t2 -t1)
fwd_avg = fwd_t / 10000 * 1000
print("input size(128, %d) forward time is %.4f (ms)." % (n, fwd_avg))
```
Test device: **skx-8180**.
Before:
```
input size(128, 1) forward time is 0.0071 (ms).
input size(128, 10) forward time is 0.0596 (ms).
input size(128, 100) forward time is 0.5798 (ms).
input size(128, 1000) forward time is 5.8395 (ms).
```
After:
```
input size(128, 1) forward time is 0.0070 (ms).
input size(128, 10) forward time is 0.0583 (ms).
input size(128, 100) forward time is 0.5714 (ms).
input size(128, 1000) forward time is 5.7674 (ms).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31824
Differential Revision: D19314411
Pulled By: ezyang
fbshipit-source-id: 58098546face3e5971b023f702cfe44ff1cccfbc
Summary:
VitalyFedyunin, This PR is about port Softplus activation to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time
torch.manual_seed(0)
def _time():
if torch.cuda.is_available():
torch.cuda.synchronize()
return time.time()
device = "cpu"
m = nn.Softplus()
if torch.cuda.is_available():
device = "cuda"
m = m.cuda()
#warm up
for n in [100, 10000]:
input = torch.randn(128, n, requires_grad=True, device=device)
grad_output = torch.ones(128, n, device=device)
for i in range(1000):
output = m(input)
output.backward(grad_output)
for n in [100, 10000]:
input = torch.randn(128, n, requires_grad=True, device=device)
grad_output = torch.ones(128, n, device=device)
fwd_t = 0
bwd_t = 0
for i in range(10000):
t1 = _time()
output = m(input)
t2 = _time()
output.backward(grad_output)
t3 = _time()
fwd_t = fwd_t + (t2 -t1)
bwd_t = bwd_t + (t3 - t2)
fwd_avg = fwd_t / 10000 * 1000
bwd_avg = bwd_t / 10000 * 1000
print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
% (n, fwd_avg, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P40.
Perfromance:
Before:
```
GPU:
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.12 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.18 (ms).
CPU:
input size(128, 100) forward time is 1.16 (ms); backwad avg time is 0.69 (ms).
input size(128, 10000) forward time is 60.19 (ms); backwad avg time is 31.86 (ms).
```
After:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).
CPU:
input size(128, 100) forward time is 0.43 (ms); backwad avg time is 0.16 (ms).
input size(128, 10000) forward time is 1.65 (ms); backwad avg time is 0.83 (ms).
```
`OMP_NUM_THREADS=1:`
```
Before:
input size(128, 100) forward time is 0.53 (ms); backwad avg time is 0.28 (ms).
input size(128, 10000) forward time is 51.33 (ms); backwad avg time is 25.48 (ms).
After:
input size(128, 100) forward time is 0.44 (ms); backwad avg time is 0.16 (ms).
input size(128, 10000) forward time is 42.05 (ms); backwad avg time is 13.97 (ms).
```
Fix https://github.com/pytorch/pytorch/issues/24633, https://github.com/pytorch/pytorch/issues/24634, https://github.com/pytorch/pytorch/issues/24766, https://github.com/pytorch/pytorch/issues/24767.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30504
Differential Revision: D19274913
Pulled By: ezyang
fbshipit-source-id: 21b29e8459dcba5a040cc68333887b45a858328e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31897
Previous version only use avx2. The _simd version uses avx512 if CPU is capable of that.
Test Plan: Unitttest
Reviewed By: tracelogfb
Differential Revision: D19291499
fbshipit-source-id: 3b1ee0ba756e5c9defbd5caf7f68982d9b2ca06c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31031
This activation will be needed for the LSTM implementation.
Also includes the QNNPack implementation.
Test Plan: Imported from OSS
Differential Revision: D18903453
Pulled By: z-a-f
fbshipit-source-id: 0050b1cebb1ddb179b7ecbcb114fe70705070f67
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31412
The root cause is `plan_caches` being resized in one thread while another holds a reference to an existing `CuFFTParamsLRUCache` which then becomes invalidated.
I was able to reproduce the crash very reliably without this fix applied and no longer see it. Being a race condition, it's hard to say for sure though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31861
Differential Revision: D19312314
Pulled By: ezyang
fbshipit-source-id: 06e4561128d503f2d70cdfe1982be0f3db2a8cf8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31313
This is a bugfix. The reason we couldn't enable the constexpr-ness for it before is that it was buggy,
and without constexpr it crashed at runtime and not at compile time which seems to have passed our CI unfortunately...
ghstack-source-id: 96380160
Test Plan: Now it works even when enabling constexpr for it
Differential Revision: D19087471
fbshipit-source-id: 28be107389f4507d35d08eab4b089a405690529b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31026
This is error prone and probably wrong. Since we don't use LeftRight on the hot path anymore, let's remove this.
ghstack-source-id: 96369644
Test Plan: none
Differential Revision: D18902165
fbshipit-source-id: 7b9478cd7cc071f403d75da20c7c889c27248b5c
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31911
Test Plan:
* CI builds including GPU and OSS-build tests
* The `defined(__HIP_DEVICE_COMPILE__) ` instance a few lines below is proof that this is a define/undef flag, not a define01 flag
Reviewed By: hlu1
Differential Revision: D19296560
fbshipit-source-id: 1c45069aec534b0bf4a87751a74680675c985e06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31147
The goal here is to add more tests of the current behavior of the autograd to make sure no regressions are introduced when modifying it.
Do let me know if you think of other corner cases I missed.
Test Plan: Imported from OSS
Differential Revision: D19301082
Pulled By: albanD
fbshipit-source-id: 2cb07dcf99e56eb1f2c56a179796f2e6042d5a2d