I've also seen this problem on Nautilus with no obvious reason for the slowness once. In my case it was a rather old cluster that was upgraded all the way from firefly -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Tue, Feb 18, 2020 at 5:52 PM Wido den Hollander <wido@xxxxxxxx> wrote: > > > > On 8/27/19 11:49 PM, Bryan Stillwell wrote: > > We've run into a problem on our test cluster this afternoon which is running Nautilus (14.2.2). It seems that any time PGs move on the cluster (from marking an OSD down, setting the primary-affinity to 0, or by using the balancer), a large number of the OSDs in the cluster peg the CPU cores they're running on for a while which causes slow requests. From what I can tell it appears to be related to slow peering caused by osd_pg_create() taking a long time. > > > > This was seen on quite a few OSDs while waiting for peering to complete: > > > > # ceph daemon osd.3 ops > > { > > "ops": [ > > { > > "description": "osd_pg_create(e179061 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)", > > "initiated_at": "2019-08-27 14:34:46.556413", > > "age": 318.25234538000001, > > "duration": 318.25241895300002, > > "type_data": { > > "flag_point": "started", > > "events": [ > > { > > "time": "2019-08-27 14:34:46.556413", > > "event": "initiated" > > }, > > { > > "time": "2019-08-27 14:34:46.556413", > > "event": "header_read" > > }, > > { > > "time": "2019-08-27 14:34:46.556299", > > "event": "throttled" > > }, > > { > > "time": "2019-08-27 14:34:46.556456", > > "event": "all_read" > > }, > > { > > "time": "2019-08-27 14:35:12.456901", > > "event": "dispatched" > > }, > > { > > "time": "2019-08-27 14:35:12.456903", > > "event": "wait for new map" > > }, > > { > > "time": "2019-08-27 14:40:01.292346", > > "event": "started" > > } > > ] > > } > > }, > > ...snip... > > { > > "description": "osd_pg_create(e179066 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)", > > "initiated_at": "2019-08-27 14:35:09.908567", > > "age": 294.900191001, > > "duration": 294.90068416899999, > > "type_data": { > > "flag_point": "delayed", > > "events": [ > > { > > "time": "2019-08-27 14:35:09.908567", > > "event": "initiated" > > }, > > { > > "time": "2019-08-27 14:35:09.908567", > > "event": "header_read" > > }, > > { > > "time": "2019-08-27 14:35:09.908520", > > "event": "throttled" > > }, > > { > > "time": "2019-08-27 14:35:09.908617", > > "event": "all_read" > > }, > > { > > "time": "2019-08-27 14:35:12.456921", > > "event": "dispatched" > > }, > > { > > "time": "2019-08-27 14:35:12.456923", > > "event": "wait for new map" > > } > > ] > > } > > } > > ], > > "num_ops": 6 > > } > > > > > > That "wait for new map" message made us think something was getting hung up on the monitors, so we restarted them all without any luck. > > > > I'll keep investigating, but so far my google searches aren't pulling anything up so I wanted to see if anyone else is running into this? > > > > I've seen this twice now on a ~1400 OSD cluster running Nautilus. > > I created a bug report for this: https://tracker.ceph.com/issues/44184 > > Did you make any progress on this or run into it a second time? > > Wido > > > Thanks, > > Bryan > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx