Re: Monitors crash largely due to the structure of pg-upmap-primary

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Mon, 24 Feb 2025 10:46:15 -0500

Decompile and edit the pool number in the crush map?

Mind you this is speculation, I have not tried this and I cannot warrant that it will not blow up.

> On Feb 24, 2025, at 10:31 AM, Michal Strnad <michal.strnad@xxxxxxxxx> wrote:
> 
> Hi.
> 
> We thought of that as well, but are we able to create a pool with a specific ID? We believe that the pool ID cannot be specified, since auto-incrementation is part of the protection to prevent recycling.
> 
> Thx
> Michal
> 
> 
> On 2/24/25 16:14, Anthony D'Atri wrote:
>> Re-create a pool 24 with the necessary number of PGs, remove the upmaps, then delete the temporary pool?
>> Mind you I would think that deleting a pool should itself remove upmaps for comprised PGs.
>>> On Feb 24, 2025, at 9:12 AM, Michal Strnad <michal.strnad@xxxxxxxxx> wrote:
>>> 
>>> Hi.
>>> 
>>> We ran into an issue with "pg-upmap-primary", which resulted in our monitors crashing massively (around 1,000 crashes per day). According to [1], it should be possible to remove these pg-upmap-primaries. Unfortunately, we are unable to do so because these PGs, and therefore the pool, no longer exist.
>>> 
>>> root@xxxxxxxxxx ~ # ceph osd dump | grep 'pg_upmap_primary' | grep 24.ffc
>>> pg_upmap_primary 24.ffc 232
>>> root@xxxxxxxxxx ~ # ceph osd rm-pg-upmap-primary 24.ffc
>>> Error ENOENT: pgid '24.ffc' does not exist
>>> root@xxxxxxxxxx ~ # ceph pg dump | grep "^24\."
>>> dumped all
>>> 
>>> Is there any way to remove these structures?
>>> 
>>> We also tried upgrading from the current version 18.2.1 to 18.2.4, but this led to a state on our three-node test cluster where one of the three monitors failed to start, along with a third of the OSDs, due to issues with the mentioned structure. Restarting the daemon didn’t help.
>>> 
>>> Does anyone have a solution or an idea? This is becoming quite a problem for us.
>>> 
>>> Below, I am attaching one of the many monitor crash logs.
>>> 
>>> Thank you very much for any advice!
>>> 
>>> Of course, we have also created a ticket in the tracker [https://tracker.ceph.com/issues/69760], where the same information I’m sending in this email is documented.
>>> 
>>> Michal
>>> 
>>> [1] https://tracker.ceph.com/issues/61948#note-32
>>> 
>>> {
>>> "assert_condition": "pg_upmap_primaries.empty()",
>>> "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/osd/OSDMap.cc",https://tracker.ceph.com/issues/69760
>>> "assert_func": "void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const",
>>> "assert_line": 3239,
>>> "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/osd/OSDMap.cc: In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t) const' thread 7f94216a3640 time 2025-02-02T19:16:03.629964+0100\n/builddir/build/BUILD/ceph-18.2.1/src/osd/OSDMap.cc: 3239: FAILED ceph_assert(pg_upmap_primaries.empty())\n",
>>> "assert_thread_name": "ms_dispatch",
>>> "backtrace": [
>>> "/lib64/libc.so.6(+0x54db0) [0x7f9429054db0]",
>>> "/lib64/libc.so.6(+0xa365c) [0x7f94290a365c]",
>>> "raise()",
>>> "abort()",
>>> "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f9429d630df]",
>>> "/usr/lib64/ceph/libceph-common.so.2(+0x163243) [0x7f9429d63243]",
>>> "/usr/lib64/ceph/libceph-common.so.2(+0x1a0f38) [0x7f9429da0f38]",
>>> "(OSDMonitor::reencode_full_map(ceph::buffer::v15_2_0::list&, unsigned long)+0xe2) [0x55ca54957e22]",
>>> "(OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v15_2_0::list&)+0x1de) [0x55ca549596ae]",
>>> "(OSDMonitor::build_latest_full(unsigned long)+0x2a3) [0x55ca549599a3]",
>>> "(OSDMonitor::check_osdmap_sub(Subscription*)+0xc8) [0x55ca5495be98]",
>>> "(Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0xf04) [0x55ca54834dd4]",
>>> "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x6a6) [0x55ca548359c6]",
>>> "(Monitor::_ms_dispatch(Message*)+0x779) [0x55ca54836d59]",
>>> "/usr/bin/ceph-mon(+0x2f3dfe) [0x55ca547f1dfe]",
>>> "(DispatchQueue::entry()+0x52a) [0x7f9429f5766a]",
>>> "/usr/lib64/ceph/libceph-common.so.2(+0x3e7321) [0x7f9429fe7321]",
>>> "/lib64/libc.so.6(+0xa1912) [0x7f94290a1912]",
>>> "/lib64/libc.so.6(+0x3f450) [0x7f942903f450]"
>>> ],
>>> "ceph_version": "18.2.1",
>>> "crash_id": "2025-02-02T18:16:03.632571Z_f5516ed0-6df5-4267-bada-71f5d8d764ba",
>>> "entity_name": "mon.mon001-clX",
>>> "os_id": "centos",
>>> "os_name": "CentOS Stream",
>>> "os_version": "9",
>>> "os_version_id": "9",
>>> "process_name": "ceph-mon",
>>> "stack_sig": "772ef523b041edc5147d1d9905926fb794d32b2635368a8199f6e2e4f2d688bf",
>>> "timestamp": "2025-02-02T18:16:03.632571Z",
>>> "utsname_hostname": "app001.clX",
>>> "utsname_machine": "x86_64",
>>> "utsname_release": "5.14.0-402.el9.x86_64",
>>> "utsname_sysname": "Linux",
>>> "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Dec 21 19:46:35 UTC 2023"
>>> }
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> -- 
> --
> Michal Strnad
> Storage specialist
> CESNET a.l.e.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx