Summary:
Thanks adefazio for the feedback, adding a note to the Contribution guide so that folks don't start working on code without checking with the maintainers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23513
Differential Revision: D16546685
Pulled By: soumith
fbshipit-source-id: 1ee8ade963703c88374aedecb8c9e5ed39d7722d
Summary:
This is still work in progress.
There are several more items to add to complete this doc, including
- [x] LHS indexing, index assignments.
- [x] Tensor List.
- [x] ~Shape/Type propagation.~
- [x] FAQs
Please review and share your thoughts, feel free to add anything that you think should be included as well. houseroad spandantiwari lara-hdr neginraoof
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23185
Differential Revision: D16459647
Pulled By: houseroad
fbshipit-source-id: b401c005f848d957541ba3b00e00c93ac2f4609b
Summary:
With this change you can now list multiple interfaces separated by
comma. ProcessGroupGloo creates a single Gloo context for every device
in the list (a context represents a connection to every other
rank). For every collective that is called, it will select the context
in a round robin fashion. The number of worker threads responsible for
executing the collectives is set to be twice the number of devices.
If you have a single physical interface, and wish to employ increased
parallelism, you can also specify
`GLOO_SOCKET_IFNAME=eth0,eth0,eth0,eth0`. This makes ProcessGroupGloo
use 4 connections per rank, 4 I/O threads, and 8 worker threads
responsible for executing the collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22978
ghstack-source-id: 87006270
Differential Revision: D16339962
fbshipit-source-id: 9aa1dc93d8e131c1714db349b0cbe57e9e7266f1
Summary:
Covering fleet-wide profiling, api logging, etc.
It's my first time writing rst, so suggestions are definitely welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23010
Differential Revision: D16456721
Pulled By: dzhulgakov
fbshipit-source-id: 3d3018f41499d04db0dca865bb3a9652d8cdf90a
Summary:
I manually went through all functions in `torch.*` and corrected any mismatch between the arguments mentioned in doc and the ones actually taken by the function. This fixes https://github.com/pytorch/pytorch/issues/8698.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22973
Differential Revision: D16419602
Pulled By: yf225
fbshipit-source-id: 5562c9b0b95a0759abee41f967c45efacf2267c2
Summary:
This cleans up the `torch.utils.tensorboard` API to remove all kwargs usage (which isn't clear to the user) and removes the "experimental" warning in prep for our 1.2 release.
We also don't need the additional PyTorch version checks now that we are in the codebase itself.
cc ezyang lanpa natalialunova
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21786
Reviewed By: natalialunova
Differential Revision: D15854892
Pulled By: orionr
fbshipit-source-id: 06b8498826946e578824d4b15c910edb3c2c20c6
Summary:
# What is this?
This is an implementation of the AdamW optimizer as implemented in [the fastai library](803894051b/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training.
There have already been several abortive attempts to push this into pytorch in some form or fashion: https://github.com/pytorch/pytorch/pull/17468, https://github.com/pytorch/pytorch/pull/10866, https://github.com/pytorch/pytorch/pull/3740, https://github.com/pytorch/pytorch/pull/4429. Hopefully this one goes through.
# Why is this important?
Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have.
# How was this tested?
There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21250
Differential Revision: D16060339
Pulled By: vincentqb
fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709
Summary:
This is a modified version of https://github.com/pytorch/pytorch/pull/14705 since commit structure for that PR is quite messy.
1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.
1. `Iterable` if the `dataset` is an instance of `IterableDataset`
2. `Map` o.w.
3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`
Closes https://github.com/pytorch/pytorch/issues/17909, https://github.com/pytorch/pytorch/issues/18096, https://github.com/pytorch/pytorch/issues/19946, and some of https://github.com/pytorch/pytorch/issues/13023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19228
Reviewed By: bddppq
Differential Revision: D15058152
fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
Summary:
Accidentally rebased the old PR and make it too messy. Find it here (https://github.com/pytorch/pytorch/pull/19274)
Create a PR for comments. The model is still WIP but I want to have some feedbacks before moving too far. The transformer model depends on several modules, like MultiheadAttention (landed).
Transformer is implemented based on the paper (https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf). Users have the flexibility to build a transformer with self-defined and/or built-in components (i.e encoder, decoder, encoder_layer, decoder_layer). Users could use Transformer class to build a standard transformer model and modify sub-layers as needed.
Add a few unit tests for the transformer module, as follow:
TestNN.test_Transformer_cell
TestNN.test_transformerencoderlayer
TestNN.test_transformerdecoderlayer
TestNN.test_transformer_args_check
TestScript.test_scriptmodule_transformer_cuda
There is another demonstration example for applying transformer module on the word language problem. https://github.com/pytorch/examples/pull/555
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20170
Differential Revision: D15417983
Pulled By: zhangguanheng66
fbshipit-source-id: 7ce771a7e27715acd9a23d60bf44917a90d1d572
Summary:
Something flaky is going on with `test_inplace_view_saved_output` on Windows.
With my PR #20598 applied, the test fails, even though there is no obvious reason it should be related, so the PR was reverted.
Based on commenting out various parts of my change and re-building, I think the problem is with the name -- renaming everything from `T` to `asdf` seems to make the test stop failing. I can't be sure that this is actually the case though, since I could just be seeing patterns in non-deterministic build output...
I spoke with colesbury offline and we agreed that it is okay to just disable this test on Windows for now and not block landing the main change. He will look into why it is failing.
**Test Plan:** I will wait to make sure the Windows CI suite passes before landing this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21175
Differential Revision: D15566970
Pulled By: umanwizard
fbshipit-source-id: edf223375d41faaab0a3a14dca50841f08030da3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20665
Add gelu activation forward on CPU in pytorch
Compare to current python implemented version of gelu in BERT model like
def gelu(self, x):
x * 0.5 * (1.0 + torch.erf(x / self.sqrt_two))
The torch.nn.functional.gelu function can reduce the forward time from 333ms to 109ms (with MKL) / 112ms (without MKL) for input size = [64, 128, 56, 56] on a devvm.
Reviewed By: zheng-xq
Differential Revision: D15400974
fbshipit-source-id: f606b43d1dd64e3c42a12c4991411d47551a8121
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21196
we'll add `quantize(quantizer)` as a tensor method later when we expose `quantizer` in Python frontend
Python
```
torch.quantize_linear(t, ...)
```
C++
```
at::quantize_linear(t, ...)
```
Differential Revision: D15577123
fbshipit-source-id: d0abeea488418fa9ab212f84b0b97ee237124240
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21156
we'll add `quantize(quantizer)` as a tensor method later when we expose `quantizer` in Python frontend
Python
```
torch.quantize_linear(t, ...)
```
C++
```
at::quantize_linear(t, ...)
```
Differential Revision: D15558784
fbshipit-source-id: 0b194750c423f51ad1ad5e9387a12b4d58d969a9