Original PR: #77295
Original commit message:
On GPU, conv errors if not all its inputs have the same dtype.
In the case of autocasting during freezing, what we see is:
1) inputs to conv are casted to half
2) inputs to batchnorm are not casted, so many are still floats
3) we try to fold conv + batchnorm, by finding different weight and bias such that conv(input, new_weight, new_bias) is equivalent to the original conv -> batchnorm.
If conv previously had an optional bias, then during freezing we will temporarily create a zero-valued bias as a placeholder for conv_bias. We want to construct it to have the same dtype as the weight input to conv, to avoid errors on GPU.
Reland changes:
There's a memory leak from cuda caching allocator that is a side effect of this fix. The memory leak causes the test to fail, though for some reason it didn't fail on CI in the last PR. This skips the tests for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77617
Approved by: https://github.com/eellison
Summary:
This PR creates a best practices guideline for debugging quantization
accuracy. The content here comes from https://fburl.com/gdoc/nzlzxeaf,
with experimental and Meta-only parts left out.
For now, a lot of the debugging is manual, with the Numeric Suite the
only tool we have to help the user find root causes of quantization
inaccuracies. As we build additional tools for equalization detection,
outlier detection, etc, we will add them to this page
Test plan:
```
cd docs
make html
cd build/html
python -m server.http
// result renders well in browser
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77536
Approved by: https://github.com/hx89
Partially fix#69813
This PR does mainly 3 things:
1. Introduces new methods for the `MetaBase` API:
- `set_output_strided`: creates proxy tensors with exact strides, if strides don't match
- `set_output_contiguous`: alias for `set_output_strided` with contiguous strides
- `set_output_raw_strided`: does not create proxy tensors
2. Modifies codegen for handling proxy tensors:
- Creates a new field for out-of-place kernels: `proxy_output_`
- Implements `set_output_strided` by creating a proxy tensor if necessary
- Passes the proxy tensor to them `IMPL` function
- Copy the result back to the real output, in the end, whenever a proxy was created
3. Replace `set_output` by `set_output_raw_strided` for `TensorIterator*`
- Needed, since it overrides `set_output`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76096
Approved by: https://github.com/ezyang
Summary: We were handling constant attrs in a few different ways before, leading to confusion and missed handing for fused dtypes. This diff consolidates some of that code and unbreaks current breakage.
Test Plan: CI. Recently broken tests now pass.
Differential Revision: D36335238
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77401
Approved by: https://github.com/jaybean-dev, https://github.com/jamesr66a
- Uses state dict / load state dict hooks to ensure that modules wrapped with `CheckpointWrapper` can be loaded into non-checkpointed wrapped module.
This is because a training run can use activation checkpointing, then we can recover `state_dict`, and a future run may not want to wrap modules with activation checkpointing or decide to change activation checkpoint wrapping structure. To support this, we add hooks to remove / add the relevant prefix as needed.
Tests are added to ensure we can load into CheckpointWrapper module as well as local module from CheckpointWrapper-wrapped module. state_dict with FSDP is also verified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77224
Approved by: https://github.com/zhaojuanmao
Otherwise, its possible to build TensorPipe with one version of libuv
and gloo with another.
Also, delete strange `GLOO_INSTALL` logic, as none of the install artifacts are really packaged as part of PyTorch (and it were probably used by Caffe2 builds)
This helps solve problem for compiling PyTorch for M1, where `libuv` is not available in conda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77312
Approved by: https://github.com/seemethere
gather_object is problematic when used with Tensors as they can unpickle on the wrong
device and lead to deadlocks or spurious failures.
This change introduces a RPC workaround for EFA when initing TensorPipe until
they properly address it.
Fixes#73935
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77272
Approved by: https://github.com/pritamdamania87
Summary: The root module may have different forward functions. The current implementation assumes only the func forward can be traced. In this PR, we add an attribute func name to Tracer class to enable users trace different functions
Test Plan:
python3 test/test_fx.py TestFX.test_trace_multiple_funcs
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77502
Approved by: https://github.com/jamesr66a
On GPU, conv errors if not all its inputs have the same dtype.
In the case of autocasting during freezing, what we see is:
1) inputs to conv are casted to half
2) inputs to batchnorm are not casted, so many are still floats
3) we try to fold conv + batchnorm, by finding different weight and bias such that conv(input, new_weight, new_bias) is equivalent to the original conv -> batchnorm.
If conv previously had an optional bias, then during freezing we will temporarily create a zero-valued bias as a placeholder for conv_bias. We want to construct it to have the same dtype as the weight input to conv, to avoid errors on GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77295
Approved by: https://github.com/eellison
Per title.
Before this PR `flip` throws errors on invalid inputs from ATen implementation itself, and not from error checks happening in prims/refs.
We should make sure that prims/refs do all the necessary error checking (@mruberry is going to test that by moving reference error inputs testing to call meta implementations instead of real ones).
In general, most error checking should live in refs, prims meta functions should propagate the necessary properties, but they should assume that they are getting valid inputs. The checks on the inputs should happen in refs, where they can be traced to the necessary guards, or lead to RuntimeErrors during tracing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77500
Approved by: https://github.com/mruberry