Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 15 Nov 2021 17:32:24 -0600

Okay, I traced one slow op through the logs, and the problem was that the
PG was laggy.  That happened because of the osd.122 that you stopped, which
was marked down in the OSDMap but *not* dead.  It looks like that happened
because the OSD took the 'clean shutdown' path instead of the fast stop.

Have you tried enabling osd_fast_shutdown = true *after* you fixed the
require_osd_release to octopus?   It would have led to slow requests when
you tested before because the new dead_epochfied in the OSDMap that the
read leases rely on was not being encoded, making peering wait for the read
lease to time out even though the stopped osd really died.

I'm not entirely sure if this is the same cluster as the earlier one.. but
given the logs you sent, my suggestion is to enable osd_fast_shutdown and
try again.  If you still get slow requests, can you capture the logs again?

Thanks!
sage

On Fri, Nov 12, 2021 at 7:33 AM Manuel Lausch <manuel.lausch@xxxxxxxx>
wrote:

> Hi Sage,
>
> I uploaded a lot of debug logs from the OSDs and Mons:
> ceph-post-file: 4ebc2eeb-7bb1-48c4-bbfa-ed581faca74f
>
> At 13:24:25 I stopped OSD 122 and one Minute later I started it again.
> In both cases I got slow ops.
>
> Currently I running the upstream Version (without crude patches)
> ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific
> (stable)
>
> I hope you can work with it.
>
>
> here the current config
>
> # ceph config dump
> WHO     MASK  LEVEL     OPTION
> VALUE     RO
> global        advanced  osd_fast_shutdown
>  false
> global        advanced  osd_fast_shutdown_notify_mon
> false
> global        dev       osd_pool_default_read_lease_ratio
>  0.800000
> global        advanced  paxos_propose_interval
> 1.000000
>   mon         advanced  auth_allow_insecure_global_id_reclaim
>  true
>   mon         advanced  mon_warn_on_insecure_global_id_reclaim
> false
>   mon         advanced  mon_warn_on_insecure_global_id_reclaim_allowed
> false
>   mgr         advanced  mgr/balancer/active
>  true
>   mgr         advanced  mgr/balancer/mode
>  upmap
>   mgr         advanced  mgr/balancer/upmap_max_deviation                1
>
>   mgr         advanced  mgr/progress/enabled
> false     *
>   osd         dev       bluestore_fsck_quick_fix_on_mount
>   true
>
> # cat /etc/ceph/ceph.conf
> [global]
>     # The following parameters are defined in the service.properties like
> below
>     # ceph.conf.globa.osd_max_backfills: 1
>
>
>   bluefs bufferd io = true
>   bluestore fsck quick fix on mount = false
>   cluster network = 10.88.26.0/24
>   fsid = 72ccd9c4-5697-478c-99f6-b5966af278c6
>   max open files = 131072
>   mon host = 10.88.7.41 10.88.7.42 10.88.7.43
>   mon max pg per osd = 600
>   mon osd down out interval = 1800
>   mon osd down out subtree limit = host
>   mon osd initial require min compat client = luminous
>   mon osd min down reporters = 2
>   mon osd reporter subtree level = host
>   mon pg warn max object skew = 100
>   osd backfill scan max = 16
>   osd backfill scan min = 8
>   osd deep scrub stride = 1048576
>   osd disk threads = 1
>   osd heartbeat min size = 0
>   osd max backfills = 1
>   osd max scrubs = 1
>   osd op complaint time = 5
>   osd pool default flag hashpspool = true
>   osd pool default min size = 1
>   osd pool default size = 3
>   osd recovery max active = 1
>   osd recovery max single start = 1
>   osd recovery op priority = 3
>   osd recovery sleep hdd = 0.0
>   osd scrub auto repair = true
>   osd scrub begin hour = 5
>   osd scrub chunk max = 1
>   osd scrub chunk min = 1
>   osd scrub during recovery = true
>   osd scrub end hour = 23
>   osd scrub load threshold = 1
>   osd scrub priority = 1
>   osd scrub thread suicide timeout = 0
>   osd snap trim priority = 1
>   osd snap trim sleep = 1.0
>   public network = 10.88.7.0/24
>
> [mon]
>   mon allow pool delete = false
>   mon health preluminous compat warning = false
>   osd pool default flag hashpspool = true
>
>
>
>
> On Thu, 11 Nov 2021 09:16:20 -0600
> Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> > Hi Manuel,
> >
> > Before giving up and putting in an off switch, I'd like to understand
> > why it is taking as long as it is for the PGs to go active.
> >
> > Would you consider enabling debug_osd=10 and debug_ms=1 on your OSDs,
> > and debug_mon=10 + debug_ms=1 on the mons, and reproducing this
> > (without the patch applied this time of course!)?  The logging will
> > slow things down a bit but hopefully the behavior will be close
> > enough to what you see normally that we can tell what is going on
> > (and presumably picking out the pg that was most laggy will highlight
> > the source(s) of the delay).
> >
> > sage
> >
> > On Wed, Nov 10, 2021 at 4:41 AM Manuel Lausch <manuel.lausch@xxxxxxxx>
> > wrote:
> >
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx