I've also encountered this issue on a cluster yesterday; one CPU got stuck in an infinite loop in get_obj_data::flush and it stopped serving requests. I've updated the tracker issue accordingly. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Wed, Aug 21, 2019 at 3:55 PM Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx> wrote: > > Hello > > I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, > radosgw process on those machines starts consuming 100% of 5 CPU cores > for days at a time, even though the machine is not being used for data > transfers (nothing in radosgw logs, couple of KB/s of network). > > This situation can affect any number of our rados gateways, lasts from > few hours to few days and stops if radosgw process is restarted or on > its own. > > Does anybody have an idea what might be going on or how to debug it? I > don't see anything obvious in the logs. Perf top is saying that CPU is > consumed by radosgw shared object in symbol get_obj_data::flush, which, > if I interpret things correctly, is called from a symbol with a long > name that contains the substring "boost9intrusive9list_impl" > > This is our configuration: > rgw_frontends = civetweb num_threads=5000 port=443s > ssl_certificate=/etc/ceph/rgw.crt > error_log_file=/var/log/ceph/civetweb.error.log > > (error log file doesn't exist) > > > Thanks, > > Vlad > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com