Re: Reef: rgw daemon crashes

Eugen Block <eblock@xxxxxx> · Fri, 07 Feb 2025 21:30:01 +0000

To me it looks like a memory leak which wasn't present in 16.2.11 (the  
previous Ceph version on this cluster). The usage hasn't changed, so  
it must be Ceph. I've been watching podman stats for a bit, the rgw  
process uses more and more memory, until it caps at around 5 or 6 GB,  
then it respawns and runs for a couple of hours. But there's no OOM  
killer or anything, the host has plenty free RAM. This host does also  
run OSDs, but only for barely used HDD pools.

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

yesterday I upgraded a customer cluster to Reef (18.2.4). The  
upgrade went quite well, nothing happened for hours, until it did.  
One of the two RGW daemons has crashed twice in the last 12 hours.  
Here's one backtrace:

---snip---
Feb 06 23:19:58 storage09 conmon[2501983]: radosgw:  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.4/rpm/el9/BUILD/ceph-18.2.4/redhat-linux-build/boost/include/boost/context/posix/protected_fixedsize_stack.hpp:70: boost::context::stack_context boost::context::basic_protected_fixedsize_stack<traitsT>::allocate() [with traitsT = boost::context::stack_traits]: Assertion `0 == result'  
failed.
Feb 06 23:19:58 storage09 conmon[2501983]: *** Caught signal (Aborted) **
Feb 06 23:19:58 storage09 conmon[2501983]:  in thread 7f77a10b2640  
thread_name:radosgw
Feb 06 23:19:58 storage09 conmon[2501983]:  ceph version 18.2.4  
(e7ad5345525c7aa95470c26863873b581076945d) reef (stable)
Feb 06 23:19:58 storage09 conmon[2501983]:  1:  
/lib64/libc.so.6(+0x3e6f0) [0x7f78adeba6f0]
Feb 06 23:19:58 storage09 conmon[2501983]:  2:  
/lib64/libc.so.6(+0x8b94c) [0x7f78adf0794c]
Feb 06 23:19:58 storage09 conmon[2501983]:  3: raise()
Feb 06 23:19:58 storage09 conmon[2501983]:  4: abort()
Feb 06 23:19:58 storage09 conmon[2501983]:  5:  
/lib64/libc.so.6(+0x2871b) [0x7f78adea471b]
Feb 06 23:19:58 storage09 conmon[2501983]:  6:  
/lib64/libc.so.6(+0x37386) [0x7f78adeb3386]
Feb 06 23:19:58 storage09 conmon[2501983]:  7:  
/usr/bin/radosgw(+0x361cb2) [0x561ed9ef3cb2]
Feb 06 23:19:58 storage09 conmon[2501983]:  8:  
/usr/bin/radosgw(+0x361db8) [0x561ed9ef3db8]
Feb 06 23:19:58 storage09 conmon[2501983]:  9:  
/usr/bin/radosgw(+0x36e15e) [0x561ed9f0015e]
Feb 06 23:19:58 storage09 conmon[2501983]:  10:  
/usr/bin/radosgw(+0x357558) [0x561ed9ee9558]
Feb 06 23:19:58 storage09 conmon[2501983]:  11:  
/usr/bin/radosgw(+0x34546c) [0x561ed9ed746c]
Feb 06 23:19:58 storage09 conmon[2501983]:  12:  
/usr/bin/radosgw(+0x358f0a) [0x561ed9eeaf0a]
Feb 06 23:19:58 storage09 conmon[2501983]:  13:  
/usr/bin/radosgw(+0xb705de) [0x561eda7025de]
Feb 06 23:19:58 storage09 conmon[2501983]:  14:  
/usr/bin/radosgw(+0x3c6aed) [0x561ed9f58aed]
Feb 06 23:19:58 storage09 conmon[2501983]:  15:  
/lib64/libstdc++.so.6(+0xdbad4) [0x7f78ae258ad4]
Feb 06 23:19:58 storage09 conmon[2501983]:  16:  
/lib64/libc.so.6(+0x89c02) [0x7f78adf05c02]
Feb 06 23:19:58 storage09 conmon[2501983]:  17:  
/lib64/libc.so.6(+0x10ec40) [0x7f78adf8ac40]
---snip---

I didn't find anything helpful in the tracker, only this report on  
this list [0] without a response from a year ago. The other daemon  
on a different host seems to be stable for now. This is no  
multi-site deployment, just two RGWs for a single zone.
Any comments/pointers are appreciated! I can file a tracker issue if  
this is something new.

Thanks!
Eugen

[0] https://www.spinics.net/lists/ceph-users/msg80956.html

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx