Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42766
**Summary**
Some python tests are missing in `caffe2/test/TARGETS`, add them to be more comprehension.
According to [run_test.py](https://github.com/pytorch/pytorch/blob/master/test/run_test.py#L125), some tests are slower. Slow tests are added as independent targets and others are put together into one `others` target. The reason is because we want to reduce overhead, especially for code covarge collection. Tests in one target can be run as a bundle, and then coverage can be collected together. Typically coverage collection procedure is time-expensive, so this helps us save time.
Test Plan:
Run all the new test targets locally in dev server and record the time they cost.
**Statistics**
```
# jit target
real 33m7.694s
user 653m1.181s
sys 58m14.160s
--------- Compare to Initial Jit Target runtime: ----------------
real 32m13.057s
user 613m52.843s
sys 54m58.678s
```
```
# others target
real 9m2.920s
user 164m21.927s
sys 12m54.840s
```
```
# serialization target
real 4m21.090s
user 23m33.501s
sys 1m53.308s
```
```
# tensorexpr
real 11m28.187s
user 33m36.420s
sys 1m15.925s
```
```
# type target
real 3m36.197s
user 51m47.912s
sys 4m14.149s
```
Reviewed By: malfet
Differential Revision: D22979219
fbshipit-source-id: 12a30839bb76a64871359bc024e4bff670c5ca8b
Summary:
Per title. Also updates our guidance for adding aliases to clarify interned_string and method_test requirements. The alias is tested by extending test_clamp to also test clip.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42770
Reviewed By: ngimel
Differential Revision: D23020655
Pulled By: mruberry
fbshipit-source-id: f1d8e751de9ac5f21a4f95d241b193730f07b5dc
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41656
For the CPU version, this is a regression introduced in https://github.com/pytorch/pytorch/issues/10980 which vectorized the `grid_sampler_2d` implementation. It uses the AVX2 gather intrinsic which for `float` requires 32-bit indexing to match the number of floats in the AVX register. There is also an `i64gather_ps` variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.
For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple `TORCH_CHECK(canUse32BitIndexMath(...))` is used instead. So, there is a decision to be made here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41923
Reviewed By: glaringlee
Differential Revision: D22925931
Pulled By: zou3519
fbshipit-source-id: 920816107aae26360c5e7f4e9c729fa9057268bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42189
Rehash of https://github.com/pytorch/pytorch/pull/28811, which was several months old.
As part of addressing https://github.com/pytorch/pytorch/issues/23232, this PR adds support for the following APIs:
`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.
The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load
Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.
If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 109322433
Reviewed By: mrshenli
Differential Revision: D22785387
fbshipit-source-id: a265a44ec0aa3aaffc3c6966023400495904c7d8
Summary:
This PR creates a new namespace, torch.fft (torch::fft) and puts a single function, fft, in it. This function is analogous to is a simplified version of NumPy's [numpy.fft.fft](https://numpy.org/doc/1.18/reference/generated/numpy.fft.fft.html?highlight=fft#numpy.fft.fft) that accepts no optional arguments. It is intended to demonstrate how to add and document functions in the namespace, and is not intended to deprecate the existing torch.fft function.
Adding this namespace was complicated by the existence of the torch.fft function in Python. Creating a torch.fft Python module makes this name ambiguous: does it refer to a function or module? If the JIT didn't exist, a solution to this problem would have been to make torch.fft refer to a callable class that mimicked both the function and module. The JIT, however, cannot understand this pattern. As a workaround it's required to explicitly `import torch.fft` to access the torch.fft.fft function in Python:
```
import torch.fft
t = torch.randn(128, dtype=torch.cdouble)
torch.fft.fft(t)
```
See https://github.com/pytorch/pytorch/issues/42175 for future work. Another possible future PR is to get the JIT to understand torch.fft as a callable class so it need not be imported explicitly to be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41911
Reviewed By: glaringlee
Differential Revision: D22941894
Pulled By: mruberry
fbshipit-source-id: c8e0b44cbe90d21e998ca3832cf3a533f28dbe8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42528
It seems it was an oversight that they weren't run. This allows to simplify our auto-generation logic as now all test suites are run in both modes.
ghstack-source-id: 109229969
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D22922151
fbshipit-source-id: 0766a6970c927efb04eee4894b73d4bcaf60b97f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40822
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This is the last step of removing TEST_CONFIG. As there was no one left using it, there is really not much to it.
ghstack-source-id: 109229471
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22307778
fbshipit-source-id: 0d9498d9367eec671e0a964ce693015f73c5638c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40821
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This change continues the work towards removing TEST_CONFIG, by taking a few functions that were accepting the agent name (as obtained from TEST_CONFIG) and then did a bunch of if/elses on it, and replace them by new abstract methods on the fixtures, so that these functions become "decentralized".
ghstack-source-id: 109229472
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22307776
fbshipit-source-id: 9e1f6edca79aacf0bcf9d83d50ce9e0d2beec0dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40820
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
Now that no one is using the generic fixture anymore (i.e., the fixture that looks up the agent's name in the global TEST_CONFIG) we can make it abstract, i.e., have its methods become no-ops and add decorators that will require all subclasses to provide new implementations of those methods. This is a first step towards removing TEST_CONFIG.
ghstack-source-id: 109229475
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22307777
fbshipit-source-id: e52abd915c37894933545eebdfdca3ecb9559926
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40819
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This diff removes the two decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which were used to skip tests. They were only used to prevent the TensorPipe agent from running tests that were using the process group agent's options. The converse (preventing the PG agent from using the TP options) is achieved by having those tests live in a `TensorPipeAgentRpcTest` class. So here we're doing the same for process group, by moving those tests to a `ProcessGroupAgentRpcTest` class.
ghstack-source-id: 109229473
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22283179
fbshipit-source-id: b9315f9fd67f35e88fe1843faa161fc53a4133c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40818
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This diff does the changes described above for the process group agent. It defines a fixture for it (instead of using the generic fixture in its default behavior) and then merges all the entry points into a single script. Note that after this change there won't be anymore a "vanilla" RPC test: all test scripts now specify what agent they are using. This puts all agents on equal standing.
ghstack-source-id: 109229474
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22283182
fbshipit-source-id: 7e3626bbbf37d88b892077a03725f0598576b370
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40817
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This diff does the changes described above for the faulty agent, which is its own strange beast. It merges all the test entry points (i.e., the combinations of agent, suite and fork/spawn) into a single file. It also modifies the test suites that are intended to be run only on the faulty agent, which used to inherit from its fixture, to inherit from the generic fixture, as they will be mixed in with the faulty fixture at the very end, inside the entry point script.
ghstack-source-id: 109229477
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22283178
fbshipit-source-id: 72659efe6652dac8450473642a578933030f2c74
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40816
Summary of the entire stack:
--
This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.
This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).
It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...
Summary of this commit
--
This diff does the changes described above for the TensorPipe agent. It fixes its fixture (making it inherit from the generic fixture) and merges all the entry point scripts into a single one, so that it's easier to have a clear overview of all the test suites which we run on TensorPipe (you'll notice that many are missing: the JIT ones, the remote module one, ...).
ghstack-source-id: 109229476
Test Plan: Sandcastle and CircleCI
Reviewed By: pritamdamania87
Differential Revision: D22283180
fbshipit-source-id: d5e9f9f4e6d4bfd6fbcae7ae56eed63d2567a02f
Summary:
For CUDA >= 10.2, the `CUBLAS_WORKSPACE_CONFIG` environment variable must be set to either `:4096:8` or `:16:8` to ensure deterministic CUDA stream usage. This PR adds some logic inside `torch.set_deterministic()` to raise an error if this environment variable is not set properly and CUDA >= 10.2.
Issue https://github.com/pytorch/pytorch/issues/15359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41377
Reviewed By: malfet
Differential Revision: D22758459
Pulled By: ezyang
fbshipit-source-id: 4b96f1e9abf85d94ba79140fd927bbd0c05c4522
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42276
This commit converts `_wait_all_workers()` to `_all_gather()` by
allowing each worker to provide its own data object. The `_all_gather()`
function blocks and returns the gathered results. This API can be
converted to `rpc.barrier()` latter.
Test Plan: Imported from OSS
Reviewed By: lw
Differential Revision: D22853480
Pulled By: mrshenli
fbshipit-source-id: 9d506813b9fd5b7c144885e2b76a863cbd19466a
Summary:
Raise and assert used to have a hard-coded error message "Exception". User provided error message was ignored. This PR adds support to represent user's error message in TorchScript.
This breaks backward compatibility because now we actually need to script the user's error message, which can potentially contain unscriptable expressions. Such programs can break when scripting, but saved models can still continue to work.
Increased an op count in test_mobile_optimizer.py because now we need aten::format to form the actual exception message.
This is built upon an WIP PR: https://github.com/pytorch/pytorch/pull/34112 by driazati
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41907
Reviewed By: ngimel
Differential Revision: D22778301
Pulled By: gmagogsfm
fbshipit-source-id: 2b94f0db4ae9fe70c4cd03f4048e519ea96323ad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42141
Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior.
Description from https://github.com/pytorch/pytorch/pull/37106 holds pretty well:
"""
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.
The graph fuser pass operates by pushing nodes into a fusion group. So
we start with
`x, y = f(a, b, c)`
and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
x_in, y_in = f(a_in, b_in, c_in)
-> x_in, y_in
```
We destroy the x and y Value*s in the process. This operation is
easy to express as an update to the aliasDb--x_out just takes on all
the aliasing information x used to have. In particular, since we know
f and prim::fusionGroup are purely functional, we don't have to mess
with any write information.
"""
The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node.
There are a few options:
- attempt to make all subgraph utils & ir cloning logic update a map
- mirror the subgraph utils implementation in create_autodiff_subgraph
- uniquely map x, y and x_in, y_in so you can back out the correspondence.
I went with the third option.
This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D22798377
Pulled By: eellison
fbshipit-source-id: 9a133bcaa3b051c0fb565afb23a3eed56dbe71f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42222
This change adds the necessary passes to perform FP16 dynamic quantization.
We skip inserting observers for activations based on the dtype (torch.float16) and only insert the Fp16Observer for weights
Test Plan:
python test/test_quantization.py TestQuantizeJitOps
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22849220
fbshipit-source-id: 2c53594ecd2485e9e3dd0b380eceaf7c5ab5fc50
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139
A bunch of tests were failing with buck since we would output to
stdout and buck would fail parsing stdout in some cases.
Moving these print statements to stderr fixes this issue.
ghstack-source-id: 108606579
Test Plan: Run the offending unit tests.
Reviewed By: mrshenli
Differential Revision: D22779135
fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad
Summary:
View ops as outputs of differentiable subgraphs can cause incorrect differentiation. For now, do not include them in the subgraph. This was observed with our autograd tests for MultiheadAttention and nn.Transformer, which currently fail with the legacy executor. This commit fixes those test failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42027
Reviewed By: pbelevich
Differential Revision: D22798133
Pulled By: eellison
fbshipit-source-id: 2f6c08953317bbe013933c6faaad20100376c039
Summary:
xref gh-39002 which handled the reading but not the writing of the onnx expect files, and the last comment in that PR which points out `XXX` was suboptimal.
xref [this comment](https://github.com/pytorch/pytorch/pull/37091#discussion_r456460168) which pointed out the problem.
This PR:
- replaces `XXX` with `CURRENT_VERSION` in the stored files
- ensures that updating the results with the `--accept` flag will maintain the change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41910
Reviewed By: pbelevich
Differential Revision: D22758671
Pulled By: ezyang
fbshipit-source-id: 47c345c66740edfc8f0fb9ff358047a41e19b554
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41930
As title
ghstack-source-id: 108517079
Test Plan: CI
Reviewed By: jerryzh168
Differential Revision: D22698386
fbshipit-source-id: 4f748c9bae4a0b615aa69c7cc8d8e451e5d26863
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/40056
A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538
Reviewed By: zou3519
Differential Revision: D22608376
Pulled By: ezyang
fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41558
The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.
Fixes https://github.com/pytorch/pytorch/issues/41474
ghstack-source-id: 108231453
Test Plan: Verified in https://github.com/pytorch/pytorch/issues/41474.
Reviewed By: fmassa
Differential Revision: D22582779
fbshipit-source-id: 63e34d8a020c4af996ef079cfb7041b2474e27c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37587
Lifting RecordFunction up into the dispatcher code
Test Plan: Imported from OSS
Differential Revision: D21374246
fbshipit-source-id: 19f9c1719e6fd3990e451c5bbd771121e91128f7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41575
Fixes https://github.com/pytorch/pytorch/issues/34294
This updates the C++ argument parser to correctly handle `TensorList` operands. I've also included a number of updates to the testing infrastructure, this is because we're now doing a much more careful job of testing the signatures of aten kernels, using the type information about the arguments as read in from `Declarations.yaml`. The changes to the tests are required because we're now only checking for `__torch_function__` attributes on `Tensor`, `Optional[Tensor]` and elements of `TensorList` operands, whereas before we were checking for `__torch_function__` on all operands, so the relatively simplistic approach the tests were using before -- assuming all positional arguments might be tensors -- doesn't work anymore. I now think that checking for `__torch_function__` on all operands was a mistake in the original design.
The updates to the signatures of the `lambda` functions are to handle this new, more stringent checking of signatures.
I also added override support for `torch.nn.functional.threshold` `torch.nn.functional.layer_norm`, which did not yet have python-level support.
Benchmarks are still WIP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34725
Reviewed By: mruberry
Differential Revision: D22357738
Pulled By: ezyang
fbshipit-source-id: 0e7f4a58517867b2e3f193a0a8390e2ed294e1f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41218
This test doesn't assert anything and was accidentally committed as
part of a larger diff a few months ago.
ghstack-source-id: 107882848
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D22469852
fbshipit-source-id: 0baa23da56b08200e16cf66df514566223dd9b15
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41217
Fixes this flaky test. Due to the possibility of callback
finishCreatingOwnerRRef running after request_callback has processed and
created the owner RRef, we could actually end up with 0 owners on the node,
since the callback removes from the owners_ map. In this case, shutdown is fine
since there are no owners. On the other hand, if the callback runs first, there
will be 1 owner which we will delete in shutdown when we detect it has no
forks. So either way, shutdown works fine and we don't need to enforce there to
be 1 owner.
ghstack-source-id: 107883497
Test Plan: Ran the test 500 times with TSAN.
Reviewed By: ezyang
Differential Revision: D22469806
fbshipit-source-id: 02290d6d5922f91a9e2d5ede21d1cf1c4598cb46
Summary:
This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514
Reviewed By: ngimel
Differential Revision: D22569348
Pulled By: mruberry
fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41287
Profiler tests that test profiling with builtin functions and `test_callback_simple` test has been broken for a while. This diff fixes that by preferring c10 ops to non-c10 ops in our operation matching logic.
The result of this is that these ops go through the c10 dispatch and thus have profiling enabled. For `test_callback_simple` this results in the effect that we choose `aten::add.Tensor` over `aten::add.Int` which fixes the type issue.
Test Plan:
Ensured that the tests are no longer flaky by running them a bunch
of times.
Reviewed By: vincentqb
Differential Revision: D22489197
fbshipit-source-id: 8452b93e4d45703453f77d968350c0d32f3f63fe
Summary:
Per title. `lgamma` produces a different result for `-inf` compared to scipy, so there comparison is skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41225
Differential Revision: D22473346
Pulled By: ngimel
fbshipit-source-id: e4ebda1b10e2a061bd4cef38d1d7b5bf0f581790
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39631
Background:
Currently, we cannot send ScriptModule over RPC as an argument.
Otherwise, it would hit the following error:
> _pickle.PickleError: ScriptModules cannot be deepcopied using
> copy.deepcopy or saved using torch.save. Mixed serialization of
> script and non-script modules is not supported. For purely
> script modules use my_script_module.save(<filename>) instead.
Failed attempt:
tried to install `torch.jit.ScriptModule` to RPC's
dispatch table, but it does not work as the dispatch table only
matches exact types and using base type `torch.jit.ScriptModule`
does not work for derived typed.
Current solution:
The current solution exposes `_enable_jit_rref_pickle` and
`_disable_jit_rref_pickle` APIs to toggle the `allowJitRRefPickle`
flag. See `test_pickle_script_module_with_rref` as an example.
Test Plan: Imported from OSS
Differential Revision: D21920870
Pulled By: mrshenli
fbshipit-source-id: 4d58afce5d0b4b81249b383c173488820b1a47d6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40902
See the bottom of this stack for context.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D22360210
Pulled By: suo
fbshipit-source-id: 4275127173a36982ce9ad357aa344435b98e1faf
Summary:
There's a regression in MIOpen in ROCm3.5 that results in failure of autocast tests. Skipping the tests for now and will re-enable once the fixes are in MIOpen.
ezyang jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41088
Differential Revision: D22419823
Pulled By: xw285cornell
fbshipit-source-id: 347fb9a03368172fe0b263d14d27ee0c3efbf4f6