Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 1 Nov 2021 15:55:35 -0500

Hi Manuel,

I'm looking at the ticket for this issue (
https://tracker.ceph.com/issues/51463) and tried to reproduce.  This was
initially trivial to do with vstart (rados bench paused for many seconds
afters stopping an osd) but it turns out that was because the vstart
ceph.conf includes `osd_fast_shutdown = false`.  Once I enabled that again
(as it is by default on a normal cluster) I did not see any
noticeable interruption when an OSD was stopped.

Can you confirm what osd_fast_shutdown and osd_fast_shutdown_notify_mon are
set to on your cluster?

The intent is that when an OSD goes down, it will no longer accept
messenger connection attempts, and peer OSDs will inform the monitor with a
flag indicating the OSD is definitely dead (vs slow or unresponsive).  This
will allow the peering process to skip waiting for the read lease to time
out.  If you're seeing the laggy or 'waiting for readable' state, then that
isn't happening.. probably because the OSD shutdown isn't working as
originally intended.

If it's not one of those two options, make you can include a 'ceph config
dump' (or jsut a list of the changed values at least) so we can see what
else might be affecting OSD shutdown...

Thanks!
sage

On Thu, Oct 28, 2021 at 7:23 AM Manuel Lausch <manuel.lausch@xxxxxxxx>
wrote:

> Hello Istvan,
>
> the state "waiting for readable" seems to be related to this read_lease
> topic documented here:
> https://docs.ceph.com/en/latest/dev/osd_internals/stale_read/
>
> The only parameter to tune around this I know about, is the
> "osd_pool_default_read_lease_ratio" which is defaulted 0.8
>
> My clusters complains after 5 seconds for slow ops due to corresponding
> configuration. On starting, stopping or restarting OSDs I see slow
> ops up to about 15 seconds. In normal clusters this will not be shown
> as slow op because this is under the default 32 seconds limit.
>
> We plan to set this parameter to 0.15 to mitigate the issue before we
> will upgrade to ceph octopus and later. I also where happy if someone
> with deep knowledge could have a look to this issue.
>
>
> To your cluster. I don't see the reasen whats triggers your cluster to
> run in this stage. Did you stop/start some OSDs beforhand?
> You May tell us some more details.
>
>
> Manuel
>
>
>
>
> On Thu, 28 Oct 2021 11:19:15 +0000
> "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> wrote:
>
> > Hi,
> >
> > I found on the mail thread couple of emails that might be related to
> > this issue, but there it came after osd restart or recovery. My
> > cluster is on octopus 15.2.14 at the moment.
> >
> > When slow ops started to come I can see laggy pgs and slowly the 3
> > rados gateway starts to die behind the haproxy and when the slow ops
> > gone it will restart, but it causes user interruption.
> >
> > Digging deeper on the affected osd slow_ops historical log I can see
> > couple of slow ops, where 32 seconds spent on waiting for readable
> > event. Is there anything to do with this? To avoid osd restart I'm
> > running with osd_op_thread_suicide_timeout=2000
> > osd_op_thread_timeout=90 enabled on all osd at the moment.
> >
> > Might be I'm using too small amount of pg?
> > Data pool is on 4:2 ec host based and have 7 nodes, currently the pg
> > number is 128 based on the balancer suggestion.
> >
> > An example slow ops:
> >
> > {
> >             "description": "osd_op(client.141400841.0:290235021
> > 28.17s0
> >
> 28:e94c28ab:::9213182a-14ba-48ad-bde9-289a1c0c0de8.6034919.1_%2fWHITELABEL-1%2fPAGETPYE-5%2fDEVICE-4%2fLANGUAGE-38%2fSUBTYPE-0%2f148341:head
> > [create,setxattr user.rgw.idtag (56) in=14b,setxattr
> > user.rgw.tail_tag (56) in=17b,writefull 0~17706,setxattr
> > user.rgw.manifest (375) in=17b,setxattr user.rgw.acl (123)
> > in=12b,setxattr user.rgw.content_type (10) in=21b,setxattr
> > user.rgw.etag (32) in=13b,setxattr
> > user.rgw.x-amz-meta-storagetimestamp (40) in=36b,call
> > rgw.obj_store_pg_ver in=19b,setxattr user.rgw.source_zone (4) in=20b]
> > snapc 0=[] ondisk+write+known_if_redirected e37602)", "initiated_at":
> > "2021-10-28T16:52:44.652426+0700", "age": 3258.4773889160001,
> > "duration": 32.445993113, "type_data": { "flag_point": "started",
> > "client_info": { "client": "client.141400841", "client_addr":
> > "10.118.199.3:0/462844935", "tid": 290235021 }, "events":
> > [ { "event": "initiated", "time": "2021-10-28T16:52:44.652426+0700",
> > "duration": 0 },
> >                     {
> >                         "event": "throttled",
> >                         "time": "2021-10-28T16:52:44.652426+0700",
> >                         "duration": 6.1143999999999996e-05
> >                     },
> >                     {
> >                         "event": "header_read",
> >                         "time": "2021-10-28T16:52:44.652487+0700",
> >                         "duration": 2.2340000000000001e-06
> >                     },
> >                     {
> >                         "event": "all_read",
> >                         "time": "2021-10-28T16:52:44.652489+0700",
> >                         "duration": 1.1519999999999999e-06
> >                     },
> >                     {
> >                         "event": "dispatched",
> >                         "time": "2021-10-28T16:52:44.652490+0700",
> >                         "duration": 3.146e-06
> >                     },
> >                     {
> >                         "event": "queued_for_pg",
> >                         "time": "2021-10-28T16:52:44.652493+0700",
> >                         "duration": 4.0936000000000002e-05
> >                     },
> >                     {
> >                         "event": "reached_pg",
> >                         "time": "2021-10-28T16:52:44.652534+0700",
> >                         "duration": 1.9236e-05
> >                     },
> >                     {
> >                         "event": "waiting for readable",
> >                         "time": "2021-10-28T16:52:44.652554+0700",
> >                         "duration": 32.430086821000003
> >                     },
> >                     {
> >                         "event": "reached_pg",
> >                         "time": "2021-10-28T16:53:17.082640+0700",
> >                         "duration": 0.00010452500000000001
> >                     },
> >                     {
> >                         "event": "started",
> >                         "time": "2021-10-28T16:53:17.082745+0700",
> >                         "duration": 0.015673919000000001
> >                     },
> >                     {
> >                         "event": "done",
> >                         "time": "2021-10-28T16:53:17.098419+0700",
> >                         "duration": 32.445993113
> >                     }
> >                 ]
> >             }
> >         }
> >
> > Thanks in advance.
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx