Summary:
Fixes https://github.com/pytorch/pytorch/issues/62600
Adds `bazel --config=no-tty` that is useful for less verbose output in environments that don't implement full tty like CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62601
Reviewed By: soulitzer
Differential Revision: D30070154
Pulled By: malfet
fbshipit-source-id: 5b89af8441c3c6c7ca7e9a0ebdfddee00c9ab576
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61927
This is a refactor of `SavedVariable.cpp` to prevent ever defining the `data_` tensor if default hooks are set.
Before the refactor:
```c++
data_ = variable.tensor_data(); // this is wasteful if hooks are defined
register_hooks(Engine::get_default_engine().get_default_saved_variable_hooks());
```
After the refactor:
```c++
if (get_default_hooks_()) {
save_metadata_(variable);
register_hooks_(get_default_hooks_(), variable);
return;
}
save_metadata_(variable);
data_ = variable.tensor_data(); // only needed if hooks are not defined
```
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D29848524
Pulled By: Varal7
fbshipit-source-id: abca1eee37a17b47841e28d8a576490913fce1ce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61899
Tracking issue: #55070
This PR also removes `at::_cumprod`, which was the "backend" for `at::cumprod`.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D29939489
Pulled By: ezyang
fbshipit-source-id: d5e4a6dfa6c79e4b135508ea13c2d11bd0684f63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61899
Tracking issue: #55070
This PR also removes `at::_cumprod`, which was the "backend" for `at::cumprod`.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D29939152
Pulled By: ezyang
fbshipit-source-id: b3379033a1ffe3c7bc8216d16d089d388ea559ba
Summary:
Enable Gelu bf16/fp32 in CPU path using Mkldnn implementation. User doesn't need to_mkldnn() explicitly. New Gelu fp32 performs better than original one.
Add Gelu backward for https://github.com/pytorch/pytorch/pull/53615.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58525
Reviewed By: ejguan
Differential Revision: D29940369
Pulled By: ezyang
fbshipit-source-id: df9598262ec50e5d7f6e96490562aa1b116948bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62628
This fixes following errors when ROCm compiler is used
```
caffe2/aten/src/ATen/core/TensorAccessor.h:160:5: error: throw is prohibited in AMP-restricted functions
TORCH_CHECK_INDEX(
^
```
Test Plan: CI
Reviewed By: zhouzhuojie, seemethere
Differential Revision: D30059737
fbshipit-source-id: d094ee608768db41fcc91d044c2c6d7d29f33fe4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62632
Update the caffe2/core/context.h to directly use `at::mt19937` instead of the
`at::CPUGeneratorImpl` wrapper class from the ATen-cpu library.
Using `at::CPUGeneratorImpl` causes circular dependencies between the ATen and
caffe2 code. In particular the `at::CPUGeneratorImpl::get_state()` logic
depends on CPU Tensor functionality that currently depends on code from
caffe2.
Test Plan:
The RNG behavior should be identically to the previous code (perhaps even
faster since we now avoid virtual function calls).
buck test //caffe2/caffe2:caffe2_test_cpu \
//caffe2/caffe2/python: //caffe2/caffe2/fb/operators:
Differential Revision: D29915701
fbshipit-source-id: f9b2eab8d3b21b2224d30bcf52be9c0e7eb7cd0a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62618
ShardedTensor implicitly initialized RRefs to remote shards if the
RPC framework was initialized. Although, there are use cases where the RPC
framework might be initialized for a different purpose but users would not
prefer that ShardedTensor initializes RRefs as well.
As a result, I've made RRef initialization explcitit in ShardedTensor APIs.
ghstack-source-id: 134889287
Test Plan:
1) waitforbuildbot
2) unit tests.
Reviewed By: wanchaol
Differential Revision: D30056833
fbshipit-source-id: 9b2433a38dafa1888589c5b72ed93b6f0ee51639
Summary:
BLAS library is found by cmake/Dependencies.cmake and then
LAPACK library is found by FindLAPACK.cmake which in turn calls
FindBLAS.cmake. This means that we are searching for BLAS twice
and they might be different things. By setting a few variables,
this can be avoided.
cc seemethere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49647
Reviewed By: seemethere, ejguan
Differential Revision: D29943680
Pulled By: malfet
fbshipit-source-id: 3cbc350ea645a1a28dd92c19e5ee7f9eecdeff59
Summary: The comment above makes it seem intentional, so just ignore it.
Test Plan: NFC
Reviewed By: smeenai
Differential Revision: D30057632
fbshipit-source-id: 45929b4eeeefdf22f5c7c4dd603229635f9da31b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62242
1) Add a state_dict hook to ensure ShardedTensors are
added to a state_dict.
2) Add a pre load state_dict hook to ensure ShardedTensor are added back to a
module at load time.
3) Add a `with_load_process_group` context manager for load time.
4) Added ser-de capability to ShardedTensor.
ghstack-source-id: 134860967
Test Plan:
1) unit tests
2) waitforbuildbot
Reviewed By: wanchaol
Differential Revision: D29927881
fbshipit-source-id: b1ef8872ed91e9cb0e2d5dd17d2764678ab89f0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62610
Prefixes intermediate build image tags with build- so that ECR lifecycle
policies can automatically clean them up
Policy to automatically cleanup images prefixed with `build-`: b02dd818f9
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: walterddr
Differential Revision: D30055952
Pulled By: seemethere
fbshipit-source-id: 328b9c94ffc02877d088d0118a19c732f580838b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62564
If the user runs code that registers default saved tensor hooks from
multiple threads, it will fail with a nice error message most of the
time. This commit handles the very rare case where a race condition
would have made it fail silently.
Relanding previous PR #61957
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D30045406
Pulled By: Varal7
fbshipit-source-id: d04f74c99affbbf655e53cfc2acd42f7c5b4e6eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62592
Reland #62510
`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
ghstack-source-id: 134848352
Test Plan: unit test
Reviewed By: andwgu
Differential Revision: D30049431
fbshipit-source-id: 1bcac331aa30e529b7230e3891bc811c531b0ea9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62239
Added naive implementation of vulkan softmax (not using shared memory)
Based off of naive implementation of mean, found here:
2565a33c98/aten/src/ATen/native/vulkan/glsl/mean.glsl
Test Plan:
After building:
```
build/bin/vulkan_api_test
```
{F637001190}
```
[ RUN ] VulkanAPITest.softmax
[ OK ] VulkanAPITest.softmax (180 ms)
```
Reviewed By: SS-JIA
Differential Revision: D29793150
fbshipit-source-id: 4f9d8e1dae8a43cbcb7063b095fa4726df06c929
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61083
It's already supported on CUDA, so it seems reasonable to support on CPU as
well. This also changes `test_nansum` to compare against `torch.sum` since numpy
doesn't support BFloat16. Note that `test_nansum_vs_numpy` checks against
NumPy as well, so that's still being tested.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30006227
Pulled By: heitorschueroff
fbshipit-source-id: 1449730e1936417e7de1f8b3cf8cdcc15518873c
Summary: if we let torch_deploy get put in libomnibus, it hides the symbols we need to link against
Test Plan: buck test //caffe2/torch/csrc/deploy:test_deploy_from_python -- --exact 'caffe2/torch/csrc/deploy:test_deploy_from_python - test_deploy_from_python (caffe2.torch.csrc.deploy.test_deploy_from_python.TestDeployFromPython)' --run-disabled
Reviewed By: wconstab
Differential Revision: D30031134
fbshipit-source-id: e5c2f740f17abafec7d01c57c664bd71a00b6f61
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62201
change int to uint to be the same type as the runtimes bytecode. Only affects c++ since python doesn't have uints iirc. Also changed the behavior of the functions from returning -1 and a warning to just throw an exception. Wasnt sure what the proper behavior here would be (returning UINT_MAX seemed gross) so feedback is appreciated.
Test Plan: ci
Reviewed By: raziel
Differential Revision: D29914072
fbshipit-source-id: 1bb08702fc301d7c7612b5ad7205a6dbe855c890
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62436
## Problem
Given two modules and a tracer that indiscriminately marks all modules as a leaf:
```
class InnerModule(torch.nn.Module):
def forward(self, t):
return t + t
class MyModule(torch.nn.Module):
def __init__(self, inner):
super().__init__()
self.inner = inner
def forward(self, t):
x = self.inner(t)
y = self.inner(t)
return x + y
class MyTracer(torch.fx.Tracer):
def is_leaf_module(self, module, name):
return True
```
One might generally expect the following behavior (note call_module nodes):
```
print(">> Outer GraphModule (with inner module as nn.Module):")
inner = InnerModule()
m = MyModule(inner)
gm = torch.fx.GraphModule(m, MyTracer().trace(m))
print(gm.graph.print_tabular())
>> Outer GraphModule (with inner module as nn.Module):
opcode name target args kwargs
------------- ------- ----------------------- ---------------- --------
placeholder t t () {}
call_module inner inner (t,) {}
call_module inner_1 inner (t,) {}
call_function add <built-in function add> (inner, inner_1) {}
output output output (add,) {}
None
```
However, when the inner module is first symbolically traced, the symbolic trace of the outer module ignores `is_leaf_node` entirely, and traces through the whole module (note call_function nodes).
```
print(">> Inner module as GraphModule:")
inner = InnerModule()
inner_gm = torch.fx.GraphModule(inner, MyTracer().trace(inner))
print(inner_gm.graph.print_tabular())
print(">> Outer GraphModule (with inner module as GraphModule):")
m = MyModule(inner_gm)
gm = torch.fx.GraphModule(m, MyTracer().trace(m))
print(gm.graph.print_tabular())
>> Inner module as GraphModule:
opcode name target args kwargs
------------- ------ ----------------------- ------ --------
placeholder t t () {}
call_function add <built-in function add> (t, t) {}
output output output (add,) {}
None
>> Outer GraphModule (with inner module as GraphModule):
opcode name target args kwargs
------------- ------ ----------------------- ------------ --------
placeholder t t () {}
call_function add <built-in function add> (t, t) {}
call_function add_1 <built-in function add> (t, t) {}
call_function add_2 <built-in function add> (add, add_1) {}
output output output (add_2,) {}
None
```
This is surprising behavior and at first glance violates the tracer's intent. As I understand it, `torch.fx.symbolic_trace.Tracer.trace` intends to patch `torch.nn.Module.__call__` with a `module_call_wrapper()` that records a `call_module` node if the module is a leaf, else executes `torch.fx._symbbolic_trace._orig_module_call = torch.nn.Module.__call__`, which is set a module loading time.
**Every submodule should be a leaf, but no `call_module` nodes are created when that submodule is a `GraphModule`. Why?**
Upon further inspection, I found:
- The constructor for GraphModule includes a path to `GraphModule.recompile()` via the setter for a `fx.Graph`:
```
inner_gm = torch.fx.GraphModule(inner, MyTracer().trace(inner))
File "/torch/fx/graph_module.py", line 252, in __init__
self.graph = graph
File "/torch/nn/modules/module.py", line 1183, in __setattr__
object.__setattr__(self, name, value)
File "/torch/fx/graph_module.py", line 277, in graph
self.recompile()
```
- `recompile()` wraps the `__call__` method by holding a reference to the `__call__` method at the time of recompilation:
```
cls = type(self)
cls_call = cls.__call__
...
def wrapped_call(self, *args, **kwargs):
try:
return cls_call(self, *args, **kwargs)
except Exception as e:
...
cls.__call__ = wrapped_call
```
- Recompilation of the inner GraphModule happens on initialization, before creation or tracing of the outer module. Adding some old-fashioned print debug statements gives:
```
Inner Module:
_orig_module_call: <function Module._call_impl at 0x7faaebfee8b0>
recompile: cls.__call__ now wraps _orig_module_call, <function Module._call_impl at 0x7faaebfee8b0>
Outer Module:
_orig_module_call: <function Module._call_impl at 0x7faaebfee8b0>
tracing: patching method <class 'torch.nn.modules.module.Module'>.__call__ <function Module._call_impl at 0x7faaebfee8b0> with <function Module._call_impl at 0x7fa9d42bce50>
outer module MRO before tracing:
(0) <class '__main__.MyModule'>: <function Module._call_impl at 0x7faaebfee8b0>
(1) <class 'torch.nn.modules.module.Module'>: <function Module._call_impl at 0x7faaebfee8b0>
(2) <class 'object'>: <method-wrapper '__call__' of type object at 0x7fac3cd15f00>
outer module MRO during tracing:
(0) <class '__main__.MyModule'>: <function Module._call_impl at 0x7fa9d42bce50>
(1) <class 'torch.nn.modules.module.Module'>: <function Module._call_impl at 0x7fa9d42bce50>
(2) <class 'object'>: <method-wrapper '__call__' of type object at 0x7fac3cd15f00>
inner module MRO before tracing:
(0) <class 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'>: <function x.y.z.wrapped_call at 0x7fa9d42a8670>
(1) <class 'torch.fx.graph_module.GraphModule'>: <function Module._call_impl at 0x7faaebfee8b0>
(2) <class 'torch.nn.modules.module.Module'>: <function Module._call_impl at 0x7faaebfee8b0>
(3) <class 'object'>: <method-wrapper '__call__' of type object at 0x7fac3cd15f00>
inner module MRO during tracing:
(0) <class 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'>: <function x.y.z.wrapped_call at 0x7fa9d42a8670>
(1) <class 'torch.fx.graph_module.GraphModule'>: <function Module._call_impl at 0x7fa9d42bce50>
(2) <class 'torch.nn.modules.module.Module'>: <function Module._call_impl at 0x7fa9d42bce50>
(3) <class 'object'>: <method-wrapper '__call__' of type object at 0x7fac3cd15f00>
```
- The outer module is patched correctly, but the inner module's first element in its MRO is the `wrapped_call` from `recompile` that still invokes `<function Module._call_impl at 0x7faaebfee8b0>` directly. Therefore, no call_module nodes are created.
## In Practice
In practice, this behavior affects the ability of `torch.package` to package `GraphModules` whose submodules are `GraphModules`. In our case, the `GraphModule` submodules are not passed through a constructor, but created separately and installed on the root `GraphModule` via `setattr`. This means that prior to packaging, there appear to be no issues with the module, since the root's graph was created before any call_module targets were replaced with `GraphModules`.
When unpackaging such a model with `torch.package`, `torch.fx.graph_module._deserialize_graph_module` uses an inline `KeepModules` tracer that sets all submodules to leaves; the unpackaged module is implicitly and surprisingly inlined in the process.
## Potential Solution
This behavior was previously not understood by us, and so the current workaround is a gnarly process of wrapping all submodules with a `nn.Module` with a manually installed forward method.
Changing `wrapped_call` to return `return super(type(self), self).__call__(*args, **kwargs)` whenever `__call__` is inherited at least appears to solve the issue. Does this seem like an acceptable approach?
## Other Thoughts
- Repeated calls to `recompile` create nested `wrapped_calls`, all for the purpose of error handling. This seems probably unnecessary ¯\\_(ツ)\_/¯
- If a root module with a overriden `__call__` method is symbolically traced, it is ignored
Test Plan:
```
buck test:
✓ ListingSuccess: caffe2/test:fx - main (12.570)
✓ Pass: caffe2/test:fx - test_tracing_graphmodules_as_leaf_submodules (test_fx.TestFX) (11.982)
```
Reviewed By: ansley
Differential Revision: D29997935
fbshipit-source-id: 1988fbb025b14188da26a3e73e94fb789c3c1f74
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62532
This method is not stable at this time, so avoid releasing it when DDP communication hook feature is released as a stable feature.
ghstack-source-id: 134787831
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_hook_with_optimizer_parity
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_hook_then_optimizer_nccl
Reviewed By: rohan-varma
Differential Revision: D30031222
fbshipit-source-id: e03a8e13fee5116a5ddd724eb76316ee98f2a676
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62563
Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.
Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.
A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.
For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:
```
def pack(x):
name = os.path.join(tmp_dir, str(uuid.uuid4()))
torch.save(x, name)
return name
def unpack(name):
return torch.load(name)
```
Relanding previous PR: https://github.com/pytorch/pytorch/pull/61834
Original PR led to timeout error in: https://www.internalfb.com/mast/job/yuguo-release_canary_offline_training-inlinecvrp_a-canary_offline_train_28a7ecfc
Now passing: https://www.internalfb.com/mast/job/quach-release_canary_offline_training-inlinecvrp_a-canary_offline_train_9bb57e98
The difference with the new version is we don't need to acquire the GIL when calling `PyDefaultSavedVariableHooks::get_hooks`.
Test Plan: Imported from OSS
Reviewed By: iramazanli
Differential Revision: D30045405
Pulled By: Varal7
fbshipit-source-id: 7f6c07af3a56fe8835d5edcc815c15ea4fb4e332
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62515
**Summary**
This commit fixes an obvious typo in `Normalization.cu` I found while
working on #62452. Since that PR will not be landed anytime soon, I
thought it would be prudent to land this fix.
**Test Plan**
Continuous integration.
Test Plan: Imported from OSS
Reviewed By: makslevental
Differential Revision: D30027324
Pulled By: SplitInfinity
fbshipit-source-id: 9d368a54c13f8e2bf6f6956dfb2bee974226f726
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62510
`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
Test Plan:
Local run comprehensive test with following results:
https://pxl.cl/1Ml8b
For two timeout failure test cases, most likely environment related and fail in my devserver.
Reviewed By: SciPioneer
Differential Revision: D30024161
fbshipit-source-id: 07e6072a2f7b81f731425d9b71f8c8b60d383b0f
Summary:
Sobol was modfied to not drop the first point. This update reflects this behavior in the docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62548
Reviewed By: qingfeng10
Differential Revision: D30035627
Pulled By: Balandat
fbshipit-source-id: 64c659ea30c0c929778da3b58041875834e25e9a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60886
Remove all the hypothesis tests from test_adaptive_avg_pool2d_nhwc, test_adaptive_avg_pool, and test_adaptive_avg_pool3d_ndhwc.
Test Plan: I tested it with buck test //caffe2/test:quantization and all three tests passed. The tests that failed are test_conv2d_api (test_quantized_functional.py), test_conv3d_api (test_quantized_functional.py),
Reviewed By: wanchaol, jerryzh168
Differential Revision: D29432184
fbshipit-source-id: 2a4c540d2c169aec69cf8d143d5a155394885745
Summary:
**Overview:**
This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration.
Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157
Test Plan:
The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass:
- ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`)
- `test_ddp_with_zero_step_parity_gpu`
- `test_ddp_with_zero_step_interleaved_parity_gpu`
These were tested on the AI AWS cluster.
An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302.
Both approaches have been verified using an internal accuracy benchmark.
Reviewed By: mrshenli
Differential Revision: D29971046
Pulled By: andwgu
fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60747
Enhances the C++ versions of `Transformer`, `TransformerEncoderLayer`, and `TransformerDecoderLayer` to support callables as their activation functions. The old way of specifying activation function still works as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62342
Reviewed By: malfet
Differential Revision: D30022592
Pulled By: jbschlosser
fbshipit-source-id: d3c62410b84b1bd8c5ed3a1b3a3cce55608390c4
Summary:
This should address part of https://github.com/pytorch/pytorch/issues/62357.
1. rename all files 'generated-*' to make it clear, filename will not be in CI workflow name
2. remove all 'pytorch-' in names
3. make sure the build test shell scripts are adopted to new name
Next change should reduce more device related naming
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62402
Reviewed By: malfet
Differential Revision: D30021959
Pulled By: walterddr
fbshipit-source-id: 64b21a2020e25a507101c09c010cb593d8ac4146
Summary:
This PR: (1) enables the use of a system-provided Intel TBB for building PyTorch, (2) removes `tbb:task_scheduler_init` references since it has been removed from TBB a while ago (3) marks the implementation of `_internal_set_num_threads` with a TODO as it requires a revision that fixes its thread allocation logic.
Tested with `test/run_test`; no new tests are introduced since there are no behavioral changes (removal of `tbb::task_scheduler_init` has no impact on the runtime behavior).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61934
Reviewed By: malfet
Differential Revision: D29805416
Pulled By: cbalioglu
fbshipit-source-id: 22042b428b57b8fede9dfcc83878d679a19561dd