Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62200
This commit brings back the `RemoveInplaceOps` pass removed in D29523283 (dec5aa2260) that apparently had a bunch of internal users.
Test Plan: danthe3rd
Reviewed By: danthe3rd
Differential Revision: D29833316
fbshipit-source-id: 6cf13d463ab0a5e50ba3eb3243f79a9c51623809
Summary:
This PR adds a **private** squid proxy (note that the internal ELB is only accessible from the private VPC subnets of GitHub Runners) that's deployed dedicated for PyTorch CI for GitHub runners.
```
dig $SQUID_PROXY
10.0.x.x
10.0.x.x
```
http_proxy and https_proxy are compatible with the following http clients:
- curl
- wget
- python
Existing cache policy:
refresh_pattern -i .(7z|deb|rpm|exe|zip|tar|tgz|gz|ram|rar|bin|tiff|bz2|run|csv|sh)$ 1440 80% 2880
It uses the standard squid refresh_pattern for cache requests. In our setup, we tried
to cache at least (1440 minutes - 1 day) and at max (2880 minutes - 2 days), with
last-modified factor 80% (squid doc). Please refer to pytorch/test-infra for details.
Right now, it only applies to the build and test step, to limit the scope and make sure build and test are more reliable with egress cache.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62244
Test Plan:
```
# first time, cache miss (4min20s)
http_proxy=$SQUID_PROXY https_proxy=$SQUID_PROXY curl -v -L http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz --output /tmp/tmp_mnist.zip
100 9680k 100 9680k 0 0 37836 0 0:04:21 0:04:21 --:--:-- 29908
# second time, cache hit (0s)
http_proxy=$SQUID_PROXY https_proxy=$SQUID_PROXY curl -v -L http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz --output /tmp/tmp_mnist.zip
100 9680k 100 9680k 0 0 103M 0 --:--:-- --:--:-- --:--:-- 103M
```
Load Test Plan:
```
# ab load test with `-n 100` requests
ab -X $SQUID_PROXY -n 100 http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Concurrency Level: 1
Time taken for tests: 9.044 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 991326300 bytes
HTML transferred: 991242200 bytes
Requests per second: 11.06 [#/sec] (mean)
Time per request: 90.442 [ms] (mean)
Time per request: 90.442 [ms] (mean, across all concurrent requests)
Transfer rate: 107040.50 [Kbytes/sec] received
```
Reviewed By: malfet
Differential Revision: D29928698
Pulled By: zhouzhuojie
fbshipit-source-id: 4ee78be0abe35411666c6121991b0addded57106
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62111
This base class will be passed to the post-localSGD optimizer in the next PR. This way, the same post-localSGD optimizer can choose different model averaging algorithms.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134489187
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
Reviewed By: rohan-varma
Differential Revision: D29884954
fbshipit-source-id: 1dc5e35c58895902991567f633afd621c7108938
Summary:
Here is the PR to enable the softmax calculation with data type of `bfloat16` when not along the last dim.
* Use bf16 specialization for forward calculation to reduce the bf16/fp32 cast in vec template.
* Release the bf16 limitation for backward calculation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60371
Reviewed By: ejguan
Differential Revision: D29563109
Pulled By: cpuhrsch
fbshipit-source-id: f6b439fa3850a6c633f35db65ea3d735b747863e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61992
This test previously was not enabled for static graph but to ensure
this feature is supported with DDPSink, enable it for static graph which
currently passes outputs to DDPSink.
ghstack-source-id: 134471406
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D29830887
fbshipit-source-id: 2d3f750d9eb4289558ed21acccd172d83d9b82cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61942
This PR changes is_reference=True for conv to produce a pattern consists of dequant - float conv - quant instead of reference conv module, this is useful for future transformations to custom backends, it is also helpful to simplify the implementation for
convert in the future.
Test Plan:
python test/test_quantization.py TestQuantizeFxOps
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D29810656
fbshipit-source-id: 549237a62bfda4341a2a7474c124f5e33350e267
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62292
This PR adds pytree support for namedtuples. The challenge about namedtuple
is that each namedtuple class is actually different. This PR does the
following:
- it adds a namedtuple flatten/unflatten. The flatten function returns
a context that is the actual type of the namedtuple subclass. The
unflatten function uses that type to reconstruct the namedtuple
- Special cases all pytree logic to consider all namedtuples the same.
This is done by creating a `_get_node_type(pytree)` helper function that
returns `namedtuple` if `pytree` is any namedtuple subclass. The effect
of this is that all namedtuple subclasses will go through the namedtuple
flatten/unflatten functions
- Adds a `_namedtuple_flatten_spec` function for FX pytrees. This function
flattens the namedtuple based on the spec and is equivalent to the
`_tuple_flatten_spec`.
Test Plan
- new tests in test/test_pytree.py and test/test_fx.py
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D29947302
Pulled By: zou3519
fbshipit-source-id: 19c00665b13546642c315df0f243ad99b8e7ff7c
Summary:
As MKL is only available on x86_64 platform, clone header-only PocketFFT
library and use it as FFT provider
Fixes https://github.com/pytorch/pytorch/issues/62107
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62222
Reviewed By: ejguan
Differential Revision: D29938718
Pulled By: malfet
fbshipit-source-id: ac0bd98b5090d6c8a26c36c4e34a4d6e1d9f1a92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62225
Rewrote the preprocess function for Android NNAPI delegate.
Previously, `preprocess()` called `convert_model_to_nnapi()` using Pybind and returned a NnapiModule that is serialized for mobile. Now, `preprocess()` calls a sub-function of `convert_model_to_nnapi()` and returns several preprocessed items (that were previously components of NnapiModule).
Dictionary returned contains:
"shape_compute_module": torch::jit::Module,
"ser_model": torch::Tensor,
"weights": List[torch.Tensor],
"inp_mem_fmts": List[int],
"out_mem_fmts": List[int]
**Purpose and Future:**
The purpose of these changes are to move more implementation from bytecode and Torchscript to the delegate API, since bytecode is less efficient.
Now, only the shape computation uses bytecode. In the future, shape computation will be moved out of Torchscript as well.
**nnapi_backend_preprocess.cpp:** preprocess implementation
**prepare.py**: refactored a portion of `convert_model_to_nnapi()` to `process_for_nnapi()`, so preprocess can get components of NnapiModule
**Test:**
Ran `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py` on OSS successfully
ghstack-source-id: 134444190
Test Plan: Ran `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py` on OSS successfully
Reviewed By: raziel
Differential Revision: D29922279
fbshipit-source-id: cadcf8908d8a745dc7abbe286e97d6ead937d4ab
Summary:
Which at the time of creating PR is points to 7e51592129
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62276
Reviewed By: ngimel
Differential Revision: D29940950
Pulled By: malfet
fbshipit-source-id: 59c6fda76a9023af3adbfb5a96b83ca50950df6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62249
parameter and grads passed to torch.optim.functional should always match, we should skip the parameters that have none gradients to avoid the size mismatch
ghstack-source-id: 134452467
Test Plan: test_dist_optim_none_grads
Reviewed By: mrshenli
Differential Revision: D29929653
fbshipit-source-id: 4ca6167fecdfe1db422236655edee3aa59b8b044
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62224
They underlying operator allows both args and kwargs, but we only expose args in this convenience method. this brings them in line while not changing any existing programs.
Test Plan: CI
Reviewed By: gunchu
Differential Revision: D29920830
fbshipit-source-id: f4b2aa88d4a679e33595625b7ef355e4d14e54c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62281
Closes gh-24646, Closes gh-24647
There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.
I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')
for _ in range(100):
torch.nn.functional.conv2d(x, w, groups=10)
```
and similarly for backwards. I see these as the same to within measurement error.
| | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
| Forward | 133.5 | 133.6 |
| Backward (input) | 1,102 | 1,119 |
| Backward (weight) | 2,220 | 2,217 |
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D29943062
Pulled By: ngimel
fbshipit-source-id: fc5d16496eb733743face7c5a14e532d7b8ee26a
Summary:
PowKernel.cu is the single slowest file to compile in all of pytorch, taking
7 m 34 s on my machine. After investigating, I discovered that the case with
complex inputs and a cpu scalar for the first argument takes more than half that
time just on its own.
Noting that [`thrust::pow`] for complex is just `exp(log(base) * exponent)`,
we can improve this kernel by precomputing `log(base)` on cpu and computing
only the `exp` on CUDA. This is faster in both runtime and compile time.
For 1 million elements, master takes 61.6 us vs 56.9 us with this PR.
I also noticed that the constant exponent case is implemented twice, once in
`gpu_kernel_with_scalars` and again in `pow_tensor_scalar_kernel`. Further, the
`Pow.cpp` code detects cpu-scalar exponents and redispatches to the `tensor_scalar`
overload, making the `gpu_kernel_with_scalars` version dead code. Now instead,
we unconditionally run `tensor_tensor` and it will call into `tensor_scalar` if appropriate.
With these changes, PowKernel.cu takes just 2 m 30 s to compile.
[`thrust::pow`]: 368266e80e/thrust/detail/complex/cpow.h (L33)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62260
Reviewed By: ejguan
Differential Revision: D29938789
Pulled By: ngimel
fbshipit-source-id: 7ab7d81ececc92a9e6e62e60b0a4f2e6e3146df8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62277
This PR changes is_reference=True for linear to produce a pattern consists of dequant - float linear - quant instead of reference linear module, this is useful for future transformations to custom backends, it is also helpful to simplify the implementation for
convert in the future.
Test Plan:
python test/test_quantization.py TestQuantizeFxOps
Imported from OSS
Imported from OSS
Reviewed By: ejguan
Differential Revision: D29941079
fbshipit-source-id: 84bdfc0bb872c34fc345875e545c8b323e77c41e
Summary:
When coming across the short runtime of a periodic job on this PR, I realized the current smoke tests on PRs set up was flawed. Previously an attempt for better future compatibility, our conditional for running smoke tests only was for USE_CUDA=1 on Windows.
This is BAD and has unintended consequences, such as misleading results when a ci/scheduled workflow is triggered but fails to test the full test suite. e.g., with PR https://github.com/pytorch/pytorch/issues/62266https://github.com/pytorch/pytorch/actions/runs/1071698069
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62288
Reviewed By: seemethere, ejguan
Differential Revision: D29945540
Pulled By: janeyx99
fbshipit-source-id: 3cc91511c151f7348872b039c94d7752b6ea4692
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62290
No longer needed anymore.
Fixes nightly failures that we're observing as well:
```
Jul 27 07:33:02 Found conflicts! Looking for incompatible packages.
Jul 27 07:33:02 This can take several minutes. Press CTRL-C to abort.
Jul 27 07:33:02 failed
Jul 27 07:33:02
Jul 27 07:33:02 UnsatisfiableError: The following specifications were found
Jul 27 07:33:02 to be incompatible with the existing python installation in your environment:
Jul 27 07:33:02
Jul 27 07:33:02 Specifications:
Jul 27 07:33:02
Jul 27 07:33:02 - conda-package-handling=1.6.0 -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.8,<3.9.0a0']
Jul 27 07:33:02
Jul 27 07:33:02 Your python: python=3.9
```
From: https://app.circleci.com/pipelines/github/pytorch/pytorch/356478/workflows/2102acf1-c92a-4a59-919c-61d32d3bcd71/jobs/15027876
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: driazati
Differential Revision: D29946501
Pulled By: seemethere
fbshipit-source-id: 3e9182f4cbcf2aab185dbbc21b7a6171746e2281
Summary:
Following up on https://github.com/pytorch/pytorch/issues/61768.
Currently the printout is hugely long because each test case returns a status code OK without an exception.
This should be avoided when no exception was raised from send_to_scribe.
Removing the log printing when response without error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62285
Reviewed By: zhouzhuojie
Differential Revision: D29944461
Pulled By: walterddr
fbshipit-source-id: fc3c2b88bba27c68521cef7079ca2b6197d2d58b
Summary:
Part of the fix for https://github.com/pytorch/pytorch/issues/12013
Checks if the inputs and outputs are non-zero in order to allow the Bilinear layer to accept 0-dim batch sizes. The if-check for this checks for both input and output dim sizes since the `_trilinear` function is written to work with both forward and backward for Bilinear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47106
Reviewed By: ejguan
Differential Revision: D29935589
Pulled By: jbschlosser
fbshipit-source-id: 607d3352bd4f88e2528c64408f04999960be049d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62059
GetOperatorCost in Workspace exposes flops and bytes_written only. Make the an additional piece, bytes_read, available from OperatorSchema::Cost.
Test Plan:
Added the two additional pieces in the unit test testGetOperatorCost in workspace_test
buck test caffe2/caffe2/python:workspace_test -- testGetOperatorCost
buck test //aml/ml_foundation/exp_platform/large_scale_training/distributed_hogwild/auto_device_placement/tests/...
buck test //aiplatform/training/autotuning/tests/...
buck test //aiplatform/training/pipelining/tests/...
buck test //deeplearning/fblsim/tests/...
Flow tests:
ADP Greedy: f288078287
ADP MILP: f288079278
Reviewed By: CrazySherman, xtaofb
Differential Revision: D29860676
fbshipit-source-id: 8b3a9f2bf17c0dae48cfe2800e8821bf441e0b03
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62207
Workspace was getting held back due to permission denied errors, let's
ensure we have a chown'd / clean workspace for all render_test_results
runs
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: walterddr, janeyx99
Differential Revision: D29915232
Pulled By: seemethere
fbshipit-source-id: dd9fcc9c00d9665569bd8cfa57e5d2d8da965aac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887
1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`
Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D29784152
fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62234
There was a typo that we caught until recently, thus making this fix.
Reviewed By: 842974287
Differential Revision: D29924190
fbshipit-source-id: ee6259fcd41358aefe9680b419acc87c0c2821cb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62006
Closes gh-24646, gh-24647
There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.
I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')
for _ in range(100):
torch.nn.functional.conv2d(x, w, groups=10)
```
and similarly for backwards. I see these as the same to within measurement error.
| | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
| Forward | 133.5 | 133.6 |
| Backward (input) | 1,102 | 1,119 |
| Backward (weight) | 2,220 | 2,217 |
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D29883676
Pulled By: ngimel
fbshipit-source-id: 9b2ac62cdd8a84e1a23ffcd66035b2b2fe2374d8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61892
This PR changes is_reference=True for linear to produce a pattern consists of dequant - float linear - quant instead of reference linear module, this is useful for future transformations to custom backends, it is also helpful to simplify the implementation for
convert in the future.
Test Plan:
python test/test_quantization.py TestQuantizeFxOps
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D29810657
fbshipit-source-id: 949615bbc017bc454d81c8a6b2bdec53badaab19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62213
Added sanity checks in preprocess function for Android NNAPI delegate.
`preprocess()` requires some input metadata passed through its `method_compile_spec` function argument.
`preprocess()` now throws specific error messages, if it cannot find the correct input arguments.
Example error message:
```
RuntimeError: method_compile_spec does not contain the "forward" key.
method_compile_spec should contain a Tensor or Tensor List which bundles input parameters: shape, dtype, quantization, and dimorder.
For input shapes, use 0 for run/load time flexible input.
method_compile_spec must use the following format: {"forward": {"inputs": at::Tensor}} OR {"forward": {"inputs": c10::List<at::Tensor>}}
```
nnapi_backend_preprocess.cpp: contains sanity check implementation
test_backend_nnapi.py: sanity check unit tests
Test: Ran `python test/test_jit.py TestNnapiBackend` in OSS successfully.
TODO: Using Tensors to pass input parameters is a temporary hack. When a dedicated object is implemented, update the sanity check error message.
ghstack-source-id: 134339282
Test Plan: Ran `python test/test_jit.py TestNnapiBackend` in OSS successfully.
Reviewed By: raziel, iseeyuan
Differential Revision: D29917004
fbshipit-source-id: 0d5c6b35889c556cda905ffc29c25c5422ae9ee4