Commit Graph

81 Commits

Author SHA1 Message Date
Yuanyuan Chen
99c8640b5d [1/N] Change C-style casts to static_cast or reinterpret_cast (#165750)
This series of changes try to cover C style casts into C++ alternatives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750
Approved by: https://github.com/Skylion007
2025-10-20 23:27:13 +00:00
PyTorch MergeBot
ab82456c16 Revert "[1/N] Change C-style casts to static_cast or reinterpret_cast (#165750)"
This reverts commit e1e8491b31.

Reverted https://github.com/pytorch/pytorch/pull/165750 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165750#issuecomment-3422413890))
2025-10-20 14:51:58 +00:00
Yuanyuan Chen
e1e8491b31 [1/N] Change C-style casts to static_cast or reinterpret_cast (#165750)
This series of changes try to cover C style casts into C++ alternatives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750
Approved by: https://github.com/Skylion007
2025-10-20 04:36:19 +00:00
Yuanyuan Chen
ecb53078fa Turn some const strings into constexpr in C++ code (#165203)
This PR turns more const strings into constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165203
Approved by: https://github.com/Skylion007
2025-10-13 20:25:20 +00:00
cyy
419a7e197d [6/N] Fix Wextra-semi warning (#139605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605
Approved by: https://github.com/ezyang
2024-11-04 13:43:16 +00:00
cyy
bbff667e32 [Distributed] [13/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136713)
Follows #136528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136713
Approved by: https://github.com/kwen2501
2024-09-27 10:11:53 +00:00
cyy
f048569c24 [Distributed] [11/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136439)
Follows #131671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136439
Approved by: https://github.com/kwen2501
2024-09-24 13:05:15 +00:00
FFFrog
8c4e1148b8 Refactoring byte_order (#135558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135558
Approved by: https://github.com/mikaylagawarecki
2024-09-11 21:06:43 +00:00
cyy
95dbbf713e [Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109)
Follows #125102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109
Approved by: https://github.com/ezyang
2024-07-16 04:23:42 +00:00
cyy
30875953a4 [1/N] Remove inclusion of c10/util/string_utils.h (#128300)
As a first step to remove it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128300
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-06-10 23:40:47 +00:00
cyy
ac603bc2f8 [Reland] Eliminate invocations of c10::stoi,c10::stod,c10::stoull,c10::stoll (#109566)
This is reland of #87603 with definitions of c10::stoXX kept for further investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109566
Approved by: https://github.com/huydhn
2023-09-19 07:15:25 +00:00
PyTorch MergeBot
4d44d8c00a Revert "Eliminate c10::stoi,c10::stod,c10::stoull,c10::stoll (#109179)"
This reverts commit 852f1b8417.

Reverted https://github.com/pytorch/pytorch/pull/109179 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this is breaking periodic buck build, so please fix the issue and reland the change https://github.com/pytorch/pytorch/actions/runs/6207458526/job/16852695272 ([comment](https://github.com/pytorch/pytorch/pull/109179#issuecomment-1724168571))
2023-09-18 18:41:12 +00:00
cyy
852f1b8417 Eliminate c10::stoi,c10::stod,c10::stoull,c10::stoll (#109179)
We can remove these functions in favor of std ones.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109179
Approved by: https://github.com/colesbury
2023-09-16 07:22:50 +00:00
Daniil Kutz
585ce32ca1 Heap buffer overflow in ditributed/rpc module (#105537)
Hi! we've been fuzzing PyTorch project with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).
We've found a couple heap-buffer-overflows in `distributed/rpc` module.

PyTorch version: 0f1621df1a

OS: Ubuntu 20.04

### How to reproduce

1.  Build docker from this [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch) and run the container.
2.  Then run `message_deserialize-afl++` fuzzing target on provided crash-inputs ([crash-056826339f6da8dbb97c944178e94494369a9e22.zip](https://github.com/pytorch/pytorch/files/12096151/crash-056826339f6da8dbb97c944178e94494369a9e22.zip), [crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip](https://github.com/pytorch/pytorch/files/12096160/crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip)):
```
unzip crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip
/message_deserialize-afl++ crash-4f85db9f19fe152c0018f6675c3b4c122227058f
```

### Heap buffer overflow in torch/csrc/jit/serialization/pickle.cpp:144

[crash-056826339f6da8dbb97c944178e94494369a9e22.zip](https://github.com/pytorch/pytorch/files/12096151/crash-056826339f6da8dbb97c944178e94494369a9e22.zip)

```asan
    "==7614==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60b001b58355 at pc 0x0000005d1147 bp 0x7fffffffa610 sp 0x7fffffff9de0",
    "READ of size 256 at 0x60b001b58355 thread T0",
    "    #0 0x5d1146 in __asan_memcpy /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:22:3",
    "    #1 0xd1cd19f in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))::$_3::operator()(char*, unsigned long) const /pytorch/torch/csrc/jit/serialization/pickle.cpp:144:9",
    "    #2 0xd1cd19f in unsigned long std::__invoke_impl<unsigned long, torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))::$_3&, char*, unsigned long>(std::__invoke_other, torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))::$_3&, char*&&, unsigned long&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14",
    "    #3 0xd27aa48 in std::function<unsigned long (char*, unsigned long)>::operator()(char*, unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:622:14",
    "    #4 0xd27a61c in torch::jit::Unpickler::readSlowWithBuffer(char*, unsigned long) /pytorch/torch/csrc/jit/serialization/unpickler.cpp:1047:23",
    "    #5 0xd2698b8 in unsigned char torch::jit::Unpickler::read<unsigned char>() /pytorch/torch/csrc/jit/serialization/unpickler.h:111:7",
    "    #6 0xd268816 in torch::jit::Unpickler::readOpCode() /pytorch/torch/csrc/jit/serialization/unpickler.h:130:38",
    "    #7 0xd268816 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:238:17",
    "    #8 0xd268522 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3",
    "    #9 0xd1c8502 in torch::jit::unpickle(std::function<unsigned long (char*, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:126:20",
    "    #10 0xd1c8dbd in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:136:10",
    "    #11 0xe56b16d in torch::distributed::rpc::readWrappedPayload(std::vector<char, std::allocator<char> >&, torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:515:18",
    "    #12 0xe3d8f29 in torch::distributed::autograd::RpcWithProfilingReq::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/autograd/rpc_messages/rpc_with_profiling_req.cpp:112:24",
    "    #13 0xe55f692 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:138:14",
    "    #14 0x6120a8 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27",
    "    #15 0x535de1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15",
    "    #16 0x51fcec in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6",
    "    #17 0x525a3b in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9",
    "    #18 0x54eff2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10",
    "    #19 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)",
    "    #20 0x51a60d in _start (/message_deserialize_fuzz+0x51a60d)",
    "",
    "0x60b001b58355 is located 0 bytes to the right of 101-byte region [0x60b001b582f0,0x60b001b58355)",
    "allocated by thread T0 here:",
    "    #0 0x60c7bd in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3",
    "    #1 0x62c7fd in std::_Vector_base<char, std::allocator<char> >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20",
    "    #2 0x62c7fd in void std::vector<char, std::allocator<char> >::_M_range_initialize<unsigned char const*>(unsigned char const*, unsigned char const*, std::forward_iterator_tag) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1582:14",
    "    #3 0x612913 in std::vector<char, std::allocator<char> >::vector<unsigned char const*, void>(unsigned char const*, unsigned char const*, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:657:4",
    "    #4 0x611c4a in LLVMFuzzerTestOneInput /message_deserialize.cc:181:21",
    "    #5 0x535de1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15",
    "    #6 0x51fcec in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6",
    "    #7 0x525a3b in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9",
    "    #8 0x54eff2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10",
    "    #9 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)",
    "",
    "SUMMARY: AddressSanitizer: heap-buffer-overflow /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:22:3 in __asan_memcpy",
    "Shadow bytes around the buggy address:",
    "  0x0c1680363010: 00 00 00 fa fa fa fa fa fa fa fa fa 00 00 00 00",
    "  0x0c1680363020: 00 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa",
    "  0x0c1680363030: fa fa 00 00 00 00 00 00 00 00 00 00 00 00 00 fa",
    "  0x0c1680363040: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00",
    "  0x0c1680363050: 00 00 00 00 00 fa fa fa fa fa fa fa fa fa 00 00",
    "=>0x0c1680363060: 00 00 00 00 00 00 00 00 00 00[05]fa fa fa fa fa",
    "  0x0c1680363070: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00",
    "  0x0c1680363080: 05 fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c1680363090: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c16803630a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c16803630b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "Shadow byte legend (one shadow byte represents 8 application bytes):",
    "  Addressable:           00",
    "  Partially addressable: 01 02 03 04 05 06 07",
    "  Heap left redzone:       fa",
    "  Freed heap region:       fd",
    "  Stack left redzone:      f1",
    "  Stack mid redzone:       f2",
    "  Stack right redzone:     f3",
    "  Stack after return:      f5",
    "  Stack use after scope:   f8",
    "  Global redzone:          f9",
    "  Global init order:       f6",
    "  Poisoned by user:        f7",
    "  Container overflow:      fc",
    "  Array cookie:            ac",
    "  Intra object redzone:    bb",
    "  ASan internal:           fe",
    "  Left alloca redzone:     ca",
    "  Right alloca redzone:    cb",
    "==7614==ABORTING"
```

### Heap-buffer-overflow in aten/src/ATen/core/ivalue.h:432

[crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip](https://github.com/pytorch/pytorch/files/11553011/crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip)

```asan
    "==60983==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6150001e4108 at pc 0x000000601877 bp 0x7fffffff9fd0 sp 0x7fffffff9fc8",
    "READ of size 4 at 0x6150001e4108 thread T0",
    "    #0 0x601876 in c10::IValue::isTensor() const /pytorch/aten/src/ATen/core/ivalue.h:432:27",
    "    #1 0x601876 in c10::IValue::destroy() /pytorch/aten/src/ATen/core/ivalue.h:1148:9",
    "    #2 0x699f72 in c10::IValue::~IValue() /pytorch/aten/src/ATen/core/ivalue.h:236:5",
    "    #3 0x699f72 in void std::_Destroy<c10::IValue>(c10::IValue*) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_construct.h:140:19",
    "    #4 0x699f72 in void std::_Destroy_aux<false>::__destroy<c10::IValue*>(c10::IValue*, c10::IValue*) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_construct.h:152:6",
    "    #5 0x699f72 in void std::_Destroy<c10::IValue*>(c10::IValue*, c10::IValue*) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_construct.h:184:7",
    "    #6 0x699f72 in void std::_Destroy<c10::IValue*, c10::IValue>(c10::IValue*, c10::IValue*, std::allocator<c10::IValue>&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/alloc_traits.h:738:7",
    "    #7 0x699f72 in std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_erase_at_end(c10::IValue*) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1796:6",
    "    #8 0x699e4a in std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_erase(__gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue, std::allocator<c10::IValue> > >, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue, std::allocator<c10::IValue> > >) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:191:4",
    "    #9 0xea5b11e in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:454:14",
    "    #10 0xea57d97 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251:27",
    "    #11 0xea579f1 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3",
    "    #12 0xe9a435e in torch::jit::unpickle(std::function<unsigned long (char*, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:126:20",
    "    #13 0xe9a471c in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:136:10",
    "    #14 0xfcd034b in torch::distributed::autograd::PropagateGradientsReq::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/autograd/rpc_messages/propagate_gradients_req.cpp:54:18",
    "    #15 0xfe720ff in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:132:14",
    "    #16 0x5c5c93 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27",
    "    #17 0x5c2bfd in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7",
    "    #18 0x5c2a08 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c",
    "    #19 0x5c25c8 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10",
    "    #20 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)",
    "    #21 0x50237d in _start (/message_deserialize_afl+0x50237d)",
    "",
    "0x6150001e4108 is located 8 bytes to the right of 512-byte region [0x6150001e3f00,0x6150001e4100)",
    "allocated by thread T0 here:",
    "    #0 0x5bfbfa in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3",
    "",
    "SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:432:27 in c10::IValue::isTensor() const",
    "Shadow bytes around the buggy address:",
    "  0x0c2a800347d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a800347e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00",
    "  0x0c2a800347f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00",
    "  0x0c2a80034800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00",
    "  0x0c2a80034810: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00",
    "=>0x0c2a80034820: fa[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a80034830: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a80034840: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a80034850: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a80034860: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a80034870: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "Shadow byte legend (one shadow byte represents 8 application bytes):",
    "  Addressable:           00",
    "  Partially addressable: 01 02 03 04 05 06 07",
    "  Heap left redzone:       fa",
    "  Freed heap region:       fd",
    "  Stack left redzone:      f1",
    "  Stack mid redzone:       f2",
    "  Stack right redzone:     f3",
    "  Stack after return:      f5",
    "  Stack use after scope:   f8",
    "  Global redzone:          f9",
    "  Global init order:       f6",
    "  Poisoned by user:        f7",
    "  Container overflow:      fc",
    "  Array cookie:            ac",
    "  Intra object redzone:    bb",
    "  ASan internal:           fe",
    "  Left alloca redzone:     ca",
    "  Right alloca redzone:    cb",
    "==60983==ABORTING"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105537
Approved by: https://github.com/albanD
2023-07-20 16:56:49 +00:00
Nikita Shulga
a229e78544 [BE] Enforce sign-compare (#96723)
Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds.
Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase.

The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars  here:
6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)

Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869
Do not try to fix sign compare violations in caffe2 codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723
Approved by: https://github.com/albanD
2023-03-15 06:04:20 +00:00
cyy
a405c6993f [submodule] update libfmt to tag 9.1.0 (#93219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93219
Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/albanD
2023-02-08 17:21:39 +00:00
Aaron Gokaslan
3916d7a575 Apply modernize-use-emplace to aten, c10, torch (#91077)
Apply clang-tidy check modernize-use-emplace. This is slightly more efficient by using an inplace constructor and is the recommended style in parts of the codebase covered by clang-tidy. This just manually applies the check to rest of the codebase. Pinging @ezyang as this is related to my other PRs he reviewed like #89000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91077
Approved by: https://github.com/ezyang
2022-12-19 07:49:56 +00:00
Kazuaki Ishizaki
e0c194f10b Fix typos in messages under torch (#88961)
This PR fixes typos of messages and parms in c++ source and head files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88961
Approved by: https://github.com/albanD
2022-11-14 19:06:41 +00:00
Michael Andreas Dagitses
f96d96a7fc turn on -Werror=type-limits in our Bazel CPU build
Summary:
We also fix any existing issues.

Test Plan: Built locally, rely on CI to confirm.

Reviewers: malfet

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79139

Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD
2022-06-10 10:04:08 +00:00
Scott Wolchok
82f7f8d471 [PyTorch] Adopt IValue::toTupleRef() where obvious (#65505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65505

Generated with

`fastmod -m 'toTuple\(\)(\s*)->' 'toTupleRef()${1}.'`

, followed by

`fastmod '(std::move\(.*)toTupleRef\(\).' '${1}toTuple()->'`

to unbreak 2 callsites.
ghstack-source-id: 142065835

Test Plan: CI

Reviewed By: gchanan

Differential Revision: D31131025

fbshipit-source-id: 54457ae5bbeb38db9c7f196d469b98521c3d3f34
2021-11-02 10:22:18 -07:00
Scott Wolchok
e88d1c4f10 [PyTorch] Add tuple inline storage (#64066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64066

I noticed a bunch of time being spent heap-allocating Tuples
in the unpickler. 1-, 2-, and 3-element Tuples are apparently common
enough that they get their own bytecode instructions, so I decided to
try also giving them their own representation. We store up to 3
IValues inline in `Tuple` rather than doing a second heap allocation
for a `std::vector<IValue>`.
ghstack-source-id: 140695395

Test Plan:
Added automated tests for TupleElements.

Pixel 3 before: https://www.internalfb.com/intern/aibench/details/761596366576284
Pixel 3 after: https://www.internalfb.com/intern/aibench/details/591414145082422
We went from 347 ms to 302 ms.

Reviewed By: dhruvbird

Differential Revision: D30592622

fbshipit-source-id: 93625c54c9dca5f765ef6d5c191944179cb281a8
2021-10-15 12:16:51 -07:00
Pavel Belevich
ee8a6c1d14 Replace std::unordered_map<c10::Device, c10::Device> with DeviceMap (#64393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64393

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D30708384

Pulled By: pbelevich

fbshipit-source-id: 1c565727e4f09cd9e560874dd90aa403470b4a97
2021-09-02 01:36:19 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Mike Guo
6ecc1a4c4f Make pytorch clang-tidy clean (#60649)
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.

I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop

# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
  -j \
  -s \
  -k \
  -v \
  --paths torch/csrc/ \
  -g"-torch/csrc/jit/passes/onnx/helper.cpp" \
  -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
  -g"-torch/csrc/jit/serialization/onnx.cpp" \
  -g"-torch/csrc/jit/serialization/export.cpp" \
  -g"-torch/csrc/jit/serialization/import.cpp" \
  -g"-torch/csrc/jit/serialization/import_legacy.cpp" \
  -g"-torch/csrc/onnx/init.cpp" \
  -g"-torch/csrc/cuda/nccl.*" \
  -g"-torch/csrc/cuda/python_nccl.cpp" \
  -g"-torch/csrc/autograd/FunctionsManual.cpp" \
  -g"-torch/csrc/generic/*.cpp" \
  -g"-torch/csrc/jit/codegen/cuda/runtime/*" \
  -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
  -g"-torch/csrc/deploy/interpreter/interpreter.h" \
  -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
  -g"-torch/csrc/deploy/interpreter/test_main.cpp"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649

Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.

Reviewed By: walterddr, janeyx99

Differential Revision: D29504258

Pulled By: 1ntEgr8

fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
2021-07-01 12:21:07 -07:00
Rohan Varma
d433a55c94 Replace throw std::runtime_error with torch_check in torch/csrc/distributed (#59683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59683

Replaces usages of throw std::runtime_error("foo") with the better
torch_check(false, "foo") which allows C++ stacktraces to show up when
TORCH_SHOW_CPP_STACKTRACES=1. This will hopefully provide much better debugging
information when debugging crashes/flaky tests.
ghstack-source-id: 131167210

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28981327

fbshipit-source-id: 677f569e28600263cab18759eb1b282e0391aa7b
2021-06-11 11:15:49 -07:00
Richard Barnes
3979cb0656 irange for size_t (#55320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27572577

fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03
2021-06-03 01:04:13 -07:00
Luca Wehrstedt
0422e67336 Use Devices instead of DeviceIndexes in TensorPipe agent (#57294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57294

With the advent of CPUs in the device maps, and to be more generic (e.g., to support AMD GPUs), and to avoid conversions when passing to Future and RRef and such, it's easier to use Devices instead of DeviceIndices. This started by just migrating the TensorPipe agent but the RPC layer is quite intertwined so I had to migrate a lot of stuff.
ghstack-source-id: 127916562

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28092733

fbshipit-source-id: 024dcb3648c5898ab13e770413c43958f04f1a8a
2021-05-01 16:12:55 -07:00
Luca Wehrstedt
311ad5e3af Merge CUDAFuture into ivalue::Future (#57052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57052

This PR caps a stack whose goal was to merge CUDAFuture into ivalue::Future. CUDAFuture used to be a subclass of ivalue::Future, which was already pretty good, but it meant that in several places we needed `#ifdef`s or registries in order to create the right type of class, which was annoying. We've made CUDAFuture device-agnostic, by using generic helpers, so that it doesn't depend on CUDA. Now all its code can be inserted into ivalue::Future.

This PR does this very naively, by copy-pasting CUDAFuture's code into the (previously empty) virtual methods of ivalue::Future. This helps ensure the correctness of this PR, as it's straightforward to see it behaves exactly like before. However we probably want to polish it a bit later to iron out so wrinkles.
ghstack-source-id: 127713138

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28036829

fbshipit-source-id: 3e5b16402f5dc245c1fcb9d7bf06db64dcb0d2a3
2021-04-29 09:31:52 -07:00
Nikita Shulga
eac02f85cf Fix more clang-tidy errors (#57235)
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235

Reviewed By: janeyx99

Differential Revision: D28084444

Pulled By: malfet

fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
2021-04-28 23:29:10 -07:00
Shen Li
1ee54cc7b4 Add devices argument to RRef constructor (#57085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085

PR #54932 fixed the CUDA RPC for RRef when RRef is created through
RPC. But besides that use case, RRef can also be created locally
by directly passing in a value, which would bypass the CUDA stream
synchronization in #54932.

This commit covers the above gap by adding a `devices` argument
to RRef constructor. The RRef will then use this argument to
choose between `CUDAFutre` and `ivalue::Future` to hold the value.
When `devices` is specified and non-empty, `CUDAFuture` will be
used, and the `devices` will be passed to that `CUDAFuture`.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D28050001

Pulled By: mrshenli

fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e
2021-04-28 19:11:10 -07:00
Mike Ruberry
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
Richard Barnes
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
Can Balioglu
2130f4ccc4 Use c10::ArrayRef instead of std::vector for the jit::unpickle's tensor_table. (#54428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54428

Using c10::ArrayRef as the parameter type makes the API more flexible and allows the caller to leverage small-buffer optimizations (e.g. c10::SmallVector, std::array) for performance critical cases.

Test Plan: No behavioral changes. Run the existing unit and integration tests.

Reviewed By: suo

Differential Revision: D27232222

fbshipit-source-id: 7b13bc6bd02257097ca119077028fbccc68cc925
2021-03-22 15:31:47 -07:00
generatedunixname89002005325676
f2b4b0e9eb [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D27184963

fbshipit-source-id: 65355a12697c8bd996b86947e3e0aeb0ee4eff3f
2021-03-19 05:16:43 -07:00
Ilia Cherniavskii
3b1e3103ca Remove usage of onEachDevice from legacy profiler (#54125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54125

Fixes https://github.com/pytorch/pytorch/issues/48987

Test Plan:
python setup.py clean
TORCH_CUDA_ARCH_LIST="6.0" USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake 2>&1 | tee ~/output.txt
python test/test_profiler.py -v

python setup.py clean
USE_CUDA=0 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install --cmake 2>&1 | tee ~/output.txt
python test/test_profiler.py -v

+ CI

Reviewed By: rohan-varma

Differential Revision: D27109481

Pulled By: ilia-cher

fbshipit-source-id: 3fba8bc55deafeed1ab4680b311e927f40eaf99c
2021-03-18 12:19:51 -07:00
Pritam Damania
40eea6d9d1 Support device map for distributed autograd while using TensorPipe. (#44859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44859

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: https://github.com/pytorch/pytorch/issues/44170
ghstack-source-id: 119950842

Test Plan:
1) waitforbuildbot
2) Unit test added.

Reviewed By: mrshenli

Differential Revision: D23751975

fbshipit-source-id: 2717d0ef5bde3db029a6172d98aad95734d52140
2021-01-27 13:01:44 -08:00
Luca Wehrstedt
186fe48d6e Format RPC files with clang-format (#50367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50367

This had already been done by mrshenli on Friday (#50236, D25847892 (f9f758e349)) but over the weekend Facebook's internal clang-format version got updated and this changed the format, hence we need to re-apply it. Note that this update also affected the JIT files, which are the other module enrolled in clang-format (see 8530c65e25, D25849205 (8530c65e25)).
ghstack-source-id: 119656866

Test Plan: Shouldn't include functional changes. In any case, there's CI.

Reviewed By: mrshenli

Differential Revision: D25867720

fbshipit-source-id: 3723abc6c35831d7a8ac31f74baf24c963c98b9d
2021-01-11 08:59:19 -08:00
Shen Li
008206decc Replace FutureMessage with ivalue::Future in RRefContext (#49960)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49960

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25730530

Pulled By: mrshenli

fbshipit-source-id: 5d54572c653592d79c40aed616266c87307a1ad8
2021-01-07 19:50:19 -08:00
Ilia Cherniavskii
f7a8bf2855 Use libkineto in profiler (#46470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470

Adding ability to use Kineto (CUPTI) to profile CUDA kernels

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
python test/test_profiler.py

python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                      sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                            aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                            aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                          aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                    aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                        cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                  cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                               aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                           aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                       cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                              aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
```

benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a

Reviewed By: Chillee

Differential Revision: D25142223

Pulled By: ilia-cher

fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80
2020-11-25 04:32:16 -08:00
Pritam Damania
781e0ed835 Support RRef.backward() for Owner RRefs. (#46641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46641

Second part of https://github.com/pytorch/pytorch/pull/46568, allows
RRef.backward() to work for owner RRefs.
ghstack-source-id: 115440252

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24441300

fbshipit-source-id: 64af28e6b6ae47ea27e611a148f217bc344a4c5b
2020-11-07 21:25:32 -08:00
Shen Li
96d48178c8 Make pipeWrite and pipeRead noexcept (#45783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783

After the previous device maps commits, `pipeWrite` might throw. In
this case, if we increment active calls before `pipeWrite` on the
caller, that active call won't be decremented properly when `pipeWrite`
throws. As a result, `shutdown` can silently timeout. I noticed this
as some tests take more than 60s to finish.

This commit extract the tensor device checking logic out of pipeWrite,
and make sure the error is thrown before the active call count is
incremented.

Differential Revision: D24094803

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: mrshenli

fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b
2020-10-08 18:53:51 -07:00
Ilia Cherniavskii
f5c95d5cf1 Source code level attribution in profiler (#43898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898

Adding with_source parameter to enable tracking source code
(filename and line) in profiler for eager, torchscript and autograd
modes

Test Plan:
python test/test_profiler.py
```
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Source Location
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  --------------------------------------------
ts_method_1                          10.43%           235.364us        36.46%           822.920us        822.920us        1                test/test_profiler.py(70): test_source
aten::add                            7.52%            169.833us        8.88%            200.439us        200.439us        1                test/test_profiler.py(69): test_source
aten::normal_                        6.26%            141.380us        6.26%            141.380us        141.380us        1                test/test_profiler.py(67): test_source
aten::add                            5.80%            130.830us        8.41%            189.800us        63.267us         3                test/test_profiler.py(72): test_source
aten::sum                            5.02%            113.340us        8.39%            189.475us        189.475us        1                test/test_profiler.py(64): ts_method_1
aten::add                            4.58%            103.346us        6.33%            142.847us        142.847us        1                test/test_profiler.py(62): ts_method_1
aten::mul                            4.05%            91.498us         9.62%            217.113us        217.113us        1                test/test_profiler.py(71): test_source
aten::add                            4.03%            90.880us         5.60%            126.405us        126.405us        1                test/test_profiler.py(58): ts_method_2
aten::empty                          3.49%            78.735us         3.49%            78.735us         19.684us         4                test/test_profiler.py(72): test_source
```

Reviewed By: ngimel

Differential Revision: D23432664

Pulled By: ilia-cher

fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615
2020-09-30 00:57:35 -07:00
Rohan Varma
27ab9bc0f9 [RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664

Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)

rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470

Test Plan:
```
rvarm1@devbig978:fbcode  (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```

Reviewed By: mrshenli

Differential Revision: D23638387

fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
2020-09-25 13:19:26 -07:00
Lucas Hosseini
af3fc9725d Extract rpc/tensorpipe_utils.{cpp,h} from rpc/utils.{cpp,h} (#44803)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44803

Test Plan: CI

Reviewed By: lw

Differential Revision: D23732022

fbshipit-source-id: 5b839c7997bbee162a14d03414ee32baabbc8ece
2020-09-18 13:51:43 -07:00
Shen Li
a9754fb860 Use TP Tensor.metadata to carry device info (#44396)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44396

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D23602576

Pulled By: mrshenli

fbshipit-source-id: c639789979b2b71fc165efbcf70f37b4c39469df
2020-09-11 08:33:22 -07:00
Shen Li
06aaf8c20d Add set_device_map to TensorPipeOptions to support GPU args (#42637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42637

This commit enables sending non-CPU tensors through RPC using
TensorPipe backend. Users can configure device mappings by calling
set_map_location on `TensorPipeRpcBackendOptions`. Internally,
the `init_rpc` API verifies the correctness of device mappings. It
will shutdown RPC if the check failed, or proceed and pass global
mappings to `TensorPipeAgent` if the check was successful. For serde,
we added a device indices field to TensorPipe read and write buffers,
which should be either empty (all tensors must be on CPU) or match
the tensors in order and number in the RPC message. This commit
does not yet avoid zero-copy, the tensor is always moved to CPU
on the sender and then moved to the specified device on the receiver.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23011572

Pulled By: mrshenli

fbshipit-source-id: 62b617eed91237d4e9926bc8551db78b822a1187
2020-08-14 18:46:55 -07:00
Nikita Shulga
a6b703cc89 Make torch_cpu compileable when USE_TENSORPIPE is not set. (#40846)
Summary:
Forward-declare `tensorpipe::Message` class in utils.h
Guard TensorPipe specific methods in utils.cpp with `#ifdef USE_TENSORPIPE`
Pass `USE_TENSORPIPE` as private flag to `torch_cpu` library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40846

Differential Revision: D22338864

Pulled By: malfet

fbshipit-source-id: 2ea2aea84527ae7480e353afb55951a068b3b980
2020-07-07 07:02:57 -07:00
Rohan Varma
14f7e95c1a Add prefix of remote events for RPC profiling (#40066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40066

Builds on top of the previous PR to ensure that all remotely profiled events are prefixed with the key for the RPC that generated them.

The key is generated by the result of `_build_rpc_profiling_key` in `rpc/internal.py` and prefixed onto the event name. In order to do this, we set the current-key when creating the RPC in Python, retrieve the currently-set key in C++ and save a GloballyUniqueId -> key mapping to an in-memory map. When we receive an RPC with profiling information, we expect to receive this ID back, and look up the corresponding profiling key in the map.

The key is then added to all the remote events.

Tested by adding tests to ensure the key is added to all the remote events. Also added a UT which tests in under the multi-threading scenario, to ensure that the mapping's correctness is maintained when several RPCs are in the process of being created at once.
ghstack-source-id: 106316106

Test Plan: Unit test

Differential Revision: D22040035

fbshipit-source-id: 9215feb06084b294edbfa6e03385e13c1d730c43
2020-06-22 11:01:07 -07:00
Rohan Varma
7e82382ad5 Allow profiler to be enabled remotely with RPC (#38748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38748

This diff contains the message scaffolding and profiler changes in order to be able to remotely run the profiler across different nodes and aggregate the results on a single node.

As discussed, we have implemented this by creating new message types, that similar to autograd messages, wrap the profiling information with the original message, and send this new message over the wire. On the receiving end, this wrapped message is detected, we fetch the original message from it, and process the original message with the profiler enabled. When sending a response with profiling information, we serialize the profiled `Events` and send them back over RPC. When such a message is received, the events profiled on the remote node are stored (added back to the local profiler).

Changes in this PR:
- New message types (run_with_profiling_req, run_with_profiling_resp) to send profiling info over the wire. Message parsing logic is added to handle these wrapped types.
- Handling of sending profiler data over the wire, in particular, the attributes of the `ProfilerConfig` and the serialized profiled `Event`s
- The logic for wrapping RPC messages is deduped with that in `rpc_with_autograd`, and the common payload wrapping/unwrapping logic is moved to helper functions in `rpc/utils.cpp`
- Changes in `autograd/utils.cpp` to detect if we have enabled the profiler and are sending an RPC, if so, uses the above new message types
- Changes in request_callback to parse and turn on the profiler in a thread-local fashion
- Serialization and deserialization of profiling `Events`, and support to add the remote events to the thread-local profiler
- Introduction of the concept of `node_id`, which as discussed with ilia-cher , will be used along with the `Event`s handle attribute to distinguish between events. When there are events from different nodes, this node information is rendered in the profile output (e.g. when printing tables), otherwise, it is not, since it is irrelevant.
- Some changes to profiler.cpp to add useful helper methods/guards
- toHere() is now profiled for RRefs
- Unittests
ghstack-source-id: 106134626

Test Plan: Added unittests, existing profiler unittests.

Differential Revision: D19510010

fbshipit-source-id: 044347af992f19a9e3b357c9567f6fc73e988157
2020-06-18 17:01:57 -07:00
Wanchao Liang
6c56671fd9 [jit] avoid pre-convert tensor to cpu in pickling (#38898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38898

Pickling will pickle the tensor meta info, and its up to the jit
exporter or other upstream who use the pickler to decide how to write
the actual tensor data.

This PR make we call getWritableTensorData in upper level so that rpc
and TensorPipe can leverge it with only pickling tensor meta data without
converting the tensor from GPU to CPU.

Test Plan: Imported from OSS

Differential Revision: D21879866

Pulled By: wanchaol

fbshipit-source-id: 75f7ff4073e4ad15b6588973dcbdc48f97a8329f
2020-06-07 21:28:33 -07:00