Re: radosgw hang under pressure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Peter,

If you can reproduce and have debug symbols installed, I'd be interested to see the output of this tool:


https://github.com/markhpc/uwpmp/


It might need slightly different compile instructions if you have a newer version of go.  I can send you an executable offline if needed.  Since RGW potentially can have a fairly insane number of threads with the default settings, it will gather samples pretty slowly.  Just start out collecting something like 100 samples:


sudo ./unwindpmp -n 100 -p `pidof radosgw` > foo.txt


Hopefully that should help diagnose where all of the threads are spending time in the code.  uwpmp has a much faster libdw backend (-b libdw), but the callgraphs aren't always accurate so I would stick with the default unwind backend for now.


Mark


On 6/12/23 12:15, grin wrote:
Hello,

ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

There is a single (test) radosgw serving plenty of test traffic. When under heavy req/s ("heavy" in a low sense, about 1k rq/s) it pretty reliably hangs: low traffic threads seem to work (like handling occasional PUTs) but GETs are completely nonresponsive, all attention seems to be spent on futexes.

The effect is extremely similar to
https://ceph-users.ceph.narkive.com/I4uFVzH9/radosgw-civetweb-hangs-once-around-850-established-connections (subject: Radosgw (civetweb) hangs once around)
except this is quincy so it's beast instead of civetweb. The effect is the same as described there, except the cluster is way smaller (about 20-40 OSDs).

I observed that when I start radosgw -f with debug 20/20 it almost never hangs, so my guess is some ugly race condition. However I am a bit clueless how to actually debug it since debugging makes it go away. Debug 1 (default) with -d seems to hang after a while but it's not that simple to induce, I'm still testing under 4/4.

Also I do not see much to configure about beast.

As to answer the question in the original (2016) thread:
- Debian stable
- no visible limits issue
- no obvious memory leak observed
- no other visible resource shortage
- strace says everyone's waiting on futexes, about 600-800 threads, apart from the one serving occasional PUTs
- tcp port doesn't respond.

IRC didn't react. ;-)

Thanks,
Peter
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux