pytorch/test/distributed
Saurabh Mishra 134dfbeaef [DCP] DTensor slice dequantization with proper block alignment (#163532)
Summary:
When loading quantized tensors with DTensor slicing, the dequantization process was producing numerically incorrect results due to improper block-to-slice coordinate mapping. The previous implementation calculated block boundaries relative to the sliced tensor dimensions instead of the original full tensor dimensions, causing scale factors to be applied to wrong tensor regions.

This fix addresses the issue by:

1. **Proper coordinate mapping**: Added `_get_slice_to_block_mapping()` to correctly map tensor slices to quantization blocks using global coordinates from the full tensor shape.

3. **Block-aligned dequantization**: Updated `_dequantize_tensor()` to use proper block intersection logic, ensuring scale factors are applied to the correct portions of sliced tensors.

The fix ensures that when DTensor requests a slice of a quantized tensor, the dequantization correctly identifies which quantization blocks intersect with the requested slice and applies the appropriate scale factors to the right tensor regions.

Test Plan:
Tested with DTensor configurations where quantized tensors are sliced across different dimensions. Verified that:
1. Dequantized tensor values are numerically correct
2. Block boundaries are properly calculated relative to full tensor shape
3. Scale factors are applied to correct tensor regions
4. Tensor shapes map is built efficiently using only metadata

Correctness validation using https://github.com/wwwjn/torchtitan/blob/dsv3-sd-test/tests/fsdp_dequantized_load.py
```
{
  "model.layers.0.mlp.gate_proj.weight": {
    "mse": 4.30626645453458e-11,
    "mae": 9.98388827611052e-07,
    "max_abs_diff": 0.0009703934192657471,
    "cosine_similarity": 1.010810375213623,
    "relative_error": 0.001330620958469808,
    "kl_divergence_1_to_2": "6.563401e-08",
    "kl_divergence_2_to_1": "-6.522914e-08",
    "js_divergence": 1.3711876079014476e-10,
    "shape": [
      18432,
      7168
    ],
    "t1_stats": {
      "min": -0.4453125,
      "max": 0.30859375,
      "mean": -1.2592146958922967e-05
    },
    "t2_stats": {
      "min": -0.44529813528060913,
      "max": 0.3085886240005493,
      "mean": -1.2624391274584923e-05
    }
  },
  "model.layers.0.mlp.up_proj.weight": {
    "mse": 2.5534721906361746e-11,
    "mae": 3.118609583907528e-06,
    "max_abs_diff": 0.00047551095485687256,
    "cosine_similarity": 1.038962483406067,
    "relative_error": 0.0013681650161743164,
    "kl_divergence_1_to_2": "-5.8253768e-08",
    "kl_divergence_2_to_1": "5.8747577e-08",
    "js_divergence": NaN,
    "shape": [
      18432,
      7168
    ],
    "t1_stats": {
      "min": -0.228515625,
      "max": 0.2333984375,
      "mean": 8.862222955485777e-08
    },
    "t2_stats": {
      "min": -0.2285017967224121,
      "max": 0.23338991403579712,
      "mean": 8.824501662729745e-08
    }
  },
  "model.layers.0.mlp.down_proj.weight": {
    "mse": 2.2803769289536646e-11,
    "mae": 2.8916260816913564e-06,
    "max_abs_diff": 0.0008973777294158936,
    "cosine_similarity": 1.0376262664794922,
    "relative_error": 0.001346255769021809,
    "kl_divergence_1_to_2": "1.2744896e-07",
    "kl_divergence_2_to_1": "-1.2736885e-07",
    "js_divergence": 5.992362162032805e-11,
    "shape": [
      7168,
      18432
    ],
    "t1_stats": {
      "min": -0.54296875,
      "max": 0.546875,
      "mean": -2.9487239316949854e-07
    },
    "t2_stats": {
      "min": -0.5429964661598206,
      "max": 0.5469087362289429,
      "mean": -2.9507478416235244e-07
    }
  }
}
```

https://www.internalfb.com/intern/testinfra/testrun/3940649985202645

Differential Revision: D82975005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163532
Approved by: https://github.com/wwwjn
2025-09-23 16:48:16 +00:00
..
_composable [FSDP2] idempotent reset_sharded_param: no-op if _local_tensor is already padded (#163130) 2025-09-18 09:20:37 +00:00
_pycute [CuTe] Change the logic of pycute manipulation ops like coalesce, complement from co-lex to lex (#162690) 2025-09-16 19:53:45 +00:00
_shard [BE][6/6] fix typos in test/ (test/distributed/) (#157640) 2025-07-11 14:09:37 +00:00
_tools [ROCm] Enabling several UTs (#161715) 2025-09-09 15:49:21 +00:00
algorithms [BE][PYFMT] migrate PYFMT for test/[a-h]*/ to ruff format (#144555) 2025-06-24 04:53:54 +00:00
bin
checkpoint [DCP] DTensor slice dequantization with proper block alignment (#163532) 2025-09-23 16:48:16 +00:00
elastic capturing exit codes after sigterm/sigkill from torch elastic. (#160908) 2025-09-17 17:41:35 +00:00
flight_recorder [fr] Fix one error in analysis script when subPG world size is smaller than global size (#156156) 2025-06-17 21:17:58 +00:00
fsdp Simplify BFLOAT16_AVAILABLE (#163445) 2025-09-22 07:31:46 +00:00
launcher Support XPU in --nproc-per-node option to torchrun (#159474) 2025-09-12 08:32:04 +00:00
nn/jit PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
optim [BE] fix remaining flake8 v7 warnings (#159044) 2025-07-25 02:56:34 +00:00
pipelining port distributed pipeline test files for Intel GPU (#159033) 2025-08-25 05:24:27 +00:00
rpc [BE][6/6] fix typos in test/ (test/distributed/) (#157640) 2025-07-11 14:09:37 +00:00
tensor Add basic tests for torch.distributed.tensor._utils.compute_global_tensor_info (#162968) 2025-09-23 14:56:32 +00:00
_test_template.py [C10D] Fix spelling of MultiProcContinuousTest (#160892) 2025-08-19 20:17:19 +00:00
argparse_util_test.py
test_backends.py API to retrieve default distributed backend from device (#140536) 2024-11-22 11:01:53 +00:00
test_c10d_common.py [Reland][2/N]Port several test files under test/distributed to Intel GPU (#159473) 2025-09-17 06:42:27 +00:00
test_c10d_functional_native.py [Reland][2/N]Port several test files under test/distributed to Intel GPU (#159473) 2025-09-17 06:42:27 +00:00
test_c10d_gloo.py [C10d][Gloo] Enable complex datatype support in ProcessGroupGloo (#156633) 2025-09-05 21:24:36 +00:00
test_c10d_logger.py add device generalisation support for distributed tests (#152471) 2025-06-20 07:35:42 +00:00
test_c10d_nccl.py Simplify BFLOAT16_AVAILABLE (#163445) 2025-09-22 07:31:46 +00:00
test_c10d_object_collectives.py Update test_c10d_object_collectives.py with DistributedTestBase class (#145056) 2025-02-13 03:57:59 +00:00
test_c10d_ops_nccl.py [ROCm] Enabling several UTs (#161715) 2025-09-09 15:49:21 +00:00
test_c10d_pypg.py [c10d] block_current_stream: correctness fixes (#158757) 2025-07-21 22:23:44 +00:00
test_c10d_spawn_gloo.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_spawn_nccl.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_spawn_ucc.py PEP585 update - test (#145176) 2025-01-22 04:48:28 +00:00
test_c10d_spawn.py Add __main__ guards to distributed tests (#154628) 2025-06-04 14:39:57 +00:00
test_c10d_ucc.py [BE][PYFMT] migrate PYFMT for test/[a-h]*/ to ruff format (#144555) 2025-06-24 04:53:54 +00:00
test_collective_utils.py [C10D] add _summarize_ranks util (#160284) 2025-08-28 00:17:53 +00:00
test_composability.py [C10D] Fix spelling of MultiProcContinuousTest (#160892) 2025-08-19 20:17:19 +00:00
test_compute_comm_reordering.py [ROCm] Enabling several UTs (#161715) 2025-09-09 15:49:21 +00:00
test_control_collectives.py [BE][PYFMT] migrate PYFMT for test/[a-h]*/ to ruff format (#144555) 2025-06-24 04:53:54 +00:00
test_cupy_as_tensor.py [ROCm] Enabling several UTs (#161715) 2025-09-09 15:49:21 +00:00
test_data_parallel.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_device_mesh.py [Reland][2/N]Port several test files under test/distributed to Intel GPU (#159473) 2025-09-17 06:42:27 +00:00
test_dist2.py [c10d] Fix setGroupName and setGroupDesc in group_split and merge_remote_group (#159429) 2025-07-30 19:55:55 +00:00
test_distributed_spawn.py Remove NO_MULTIPROCESSING_SPAWN checks (#146705) 2025-02-28 05:53:19 +00:00
test_dynamo_distributed.py [Reland][2/N]Port several test files under test/distributed to Intel GPU (#159473) 2025-09-17 06:42:27 +00:00
test_fake_pg.py Don't require FakeStore to be passed into fake backend (#162164) 2025-09-04 16:43:49 +00:00
test_functional_api.py Revert "Fix decorators skipping NCCL tests (#158846)" 2025-09-10 20:51:31 +00:00
test_inductor_collectives.py [Reland][2/N]Port several test files under test/distributed to Intel GPU (#159473) 2025-09-17 06:42:27 +00:00
test_launcher.py Fix unused Python variables in test/[a-d]* (#134665) 2024-12-13 22:13:12 +00:00
test_multi_threaded_pg.py add device generalization support for distributed tests (#156796) 2025-07-16 09:37:03 +00:00
test_nccl.py [C10D] Fix spelling of MultiProcContinuousTest (#160892) 2025-08-19 20:17:19 +00:00
test_nvshmem_triton.py Revert "[SymmMem] Promote @requires_nvshmem instead of enable_triton (#163423)" 2025-09-22 05:35:41 +00:00
test_nvshmem.py [SymmMem] Fix memory allocation hold-up (#162680) 2025-09-19 20:19:47 +00:00
test_p2p_ipc.py [ROCm] Enabling several UTs (#161715) 2025-09-09 15:49:21 +00:00
test_pg_wrapper.py [BE][6/6] fix typos in test/ (test/distributed/) (#157640) 2025-07-11 14:09:37 +00:00
test_run.py 154849 Add support to handle IGUSR1 and SIGUSR2 in multiprocessing (#160690) 2025-09-09 22:23:06 +00:00
test_serialization.py distributed/serialization: add experimental streaming torch.save/load methods (#146555) 2025-02-07 18:08:11 +00:00
test_store.py [Reland][2/N]Port several test files under test/distributed to Intel GPU (#159473) 2025-09-17 06:42:27 +00:00
test_symmetric_memory.py [ROCm][SymmMem] re-enable UTs (#162811) 2025-09-16 15:35:39 +00:00