Re: osd_pg_create causing slow requests in Nautilus

Paul Emmerich <paul.emmerich@xxxxxxxx> · Tue, 18 Feb 2020 18:54:36 +0100

I've also seen this problem on Nautilus with no obvious reason for the
slowness once.
In my case it was a rather old cluster that was upgraded all the way
from firefly

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Feb 18, 2020 at 5:52 PM Wido den Hollander <wido@xxxxxxxx> wrote:
>
>
>
> On 8/27/19 11:49 PM, Bryan Stillwell wrote:
> > We've run into a problem on our test cluster this afternoon which is running Nautilus (14.2.2).  It seems that any time PGs move on the cluster (from marking an OSD down, setting the primary-affinity to 0, or by using the balancer), a large number of the OSDs in the cluster peg the CPU cores they're running on for a while which causes slow requests.  From what I can tell it appears to be related to slow peering caused by osd_pg_create() taking a long time.
> >
> > This was seen on quite a few OSDs while waiting for peering to complete:
> >
> > # ceph daemon osd.3 ops
> > {
> >     "ops": [
> >         {
> >             "description": "osd_pg_create(e179061 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)",
> >             "initiated_at": "2019-08-27 14:34:46.556413",
> >             "age": 318.25234538000001,
> >             "duration": 318.25241895300002,
> >             "type_data": {
> >                 "flag_point": "started",
> >                 "events": [
> >                     {
> >                         "time": "2019-08-27 14:34:46.556413",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:34:46.556413",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:34:46.556299",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:34:46.556456",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:12.456901",
> >                         "event": "dispatched"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:12.456903",
> >                         "event": "wait for new map"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:40:01.292346",
> >                         "event": "started"
> >                     }
> >                 ]
> >             }
> >         },
> > ...snip...
> >         {
> >             "description": "osd_pg_create(e179066 287.7a:177739 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)",
> >             "initiated_at": "2019-08-27 14:35:09.908567",
> >             "age": 294.900191001,
> >             "duration": 294.90068416899999,
> >             "type_data": {
> >                 "flag_point": "delayed",
> >                 "events": [
> >                     {
> >                         "time": "2019-08-27 14:35:09.908567",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:09.908567",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:09.908520",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:09.908617",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:12.456921",
> >                         "event": "dispatched"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:12.456923",
> >                         "event": "wait for new map"
> >                     }
> >                 ]
> >             }
> >         }
> >     ],
> >     "num_ops": 6
> > }
> >
> >
> > That "wait for new map" message made us think something was getting hung up on the monitors, so we restarted them all without any luck.
> >
> > I'll keep investigating, but so far my google searches aren't pulling anything up so I wanted to see if anyone else is running into this?
> >
>
> I've seen this twice now on a ~1400 OSD cluster running Nautilus.
>
> I created a bug report for this: https://tracker.ceph.com/issues/44184
>
> Did you make any progress on this or run into it a second time?
>
> Wido
>
> > Thanks,
> > Bryan
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx