On 8/27/19 11:49 PM, Bryan Stillwell wrote: > We've run into a problem on our test cluster this afternoon which is running Nautilus (14.2.2). It seems that any time PGs move on the cluster (from marking an OSD down, setting the primary-affinity to 0, or by using the balancer), a large number of the OSDs in the cluster peg the CPU cores they're running on for a while which causes slow requests. From what I can tell it appears to be related to slow peering caused by osd_pg_create() taking a long time. > > This was seen on quite a few OSDs while waiting for peering to complete: > > # ceph daemon osd.3 ops > { > "ops": [ > { > "description": "osd_pg_create(e179061 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)", > "initiated_at": "2019-08-27 14:34:46.556413", > "age": 318.25234538000001, > "duration": 318.25241895300002, > "type_data": { > "flag_point": "started", > "events": [ > { > "time": "2019-08-27 14:34:46.556413", > "event": "initiated" > }, > { > "time": "2019-08-27 14:34:46.556413", > "event": "header_read" > }, > { > "time": "2019-08-27 14:34:46.556299", > "event": "throttled" > }, > { > "time": "2019-08-27 14:34:46.556456", > "event": "all_read" > }, > { > "time": "2019-08-27 14:35:12.456901", > "event": "dispatched" > }, > { > "time": "2019-08-27 14:35:12.456903", > "event": "wait for new map" > }, > { > "time": "2019-08-27 14:40:01.292346", > "event": "started" > } > ] > } > }, > ...snip... > { > "description": "osd_pg_create(e179066 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)", > "initiated_at": "2019-08-27 14:35:09.908567", > "age": 294.900191001, > "duration": 294.90068416899999, > "type_data": { > "flag_point": "delayed", > "events": [ > { > "time": "2019-08-27 14:35:09.908567", > "event": "initiated" > }, > { > "time": "2019-08-27 14:35:09.908567", > "event": "header_read" > }, > { > "time": "2019-08-27 14:35:09.908520", > "event": "throttled" > }, > { > "time": "2019-08-27 14:35:09.908617", > "event": "all_read" > }, > { > "time": "2019-08-27 14:35:12.456921", > "event": "dispatched" > }, > { > "time": "2019-08-27 14:35:12.456923", > "event": "wait for new map" > } > ] > } > } > ], > "num_ops": 6 > } > > > That "wait for new map" message made us think something was getting hung up on the monitors, so we restarted them all without any luck. > > I'll keep investigating, but so far my google searches aren't pulling anything up so I wanted to see if anyone else is running into this? > I've seen this twice now on a ~1400 OSD cluster running Nautilus. I created a bug report for this: https://tracker.ceph.com/issues/44184 Did you make any progress on this or run into it a second time? Wido > Thanks, > Bryan > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx