Re: radosgw pegging down 5 CPU cores when no data is being transferred

Paul Emmerich <paul.emmerich@xxxxxxxx> · Thu, 10 Oct 2019 15:25:24 +0200

I've also encountered this issue on a cluster yesterday; one CPU got
stuck in an infinite loop in get_obj_data::flush and it stopped
serving requests. I've updated the tracker issue accordingly.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Aug 21, 2019 at 3:55 PM Vladimir Brik
<vladimir.brik@xxxxxxxxxxxxxxxx> wrote:
>
> Hello
>
> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically,
> radosgw process on those machines starts consuming 100% of 5 CPU cores
> for days at a time, even though the machine is not being used for data
> transfers (nothing in radosgw logs, couple of KB/s of network).
>
> This situation can affect any number of our rados gateways, lasts from
> few hours to few days and stops if radosgw process is restarted or on
> its own.
>
> Does anybody have an idea what might be going on or how to debug it? I
> don't see anything obvious in the logs. Perf top is saying that CPU is
> consumed by radosgw shared object in symbol get_obj_data::flush, which,
> if I interpret things correctly, is called from a symbol with a long
> name that contains the substring "boost9intrusive9list_impl"
>
> This is our configuration:
> rgw_frontends = civetweb num_threads=5000 port=443s
> ssl_certificate=/etc/ceph/rgw.crt
> error_log_file=/var/log/ceph/civetweb.error.log
>
> (error log file doesn't exist)
>
>
> Thanks,
>
> Vlad
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com