Hi Sage, I uploaded a lot of debug logs from the OSDs and Mons: ceph-post-file: 4ebc2eeb-7bb1-48c4-bbfa-ed581faca74f At 13:24:25 I stopped OSD 122 and one Minute later I started it again. In both cases I got slow ops. Currently I running the upstream Version (without crude patches) ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) I hope you can work with it. here the current config # ceph config dump WHO MASK LEVEL OPTION VALUE RO global advanced osd_fast_shutdown false global advanced osd_fast_shutdown_notify_mon false global dev osd_pool_default_read_lease_ratio 0.800000 global advanced paxos_propose_interval 1.000000 mon advanced auth_allow_insecure_global_id_reclaim true mon advanced mon_warn_on_insecure_global_id_reclaim false mon advanced mon_warn_on_insecure_global_id_reclaim_allowed false mgr advanced mgr/balancer/active true mgr advanced mgr/balancer/mode upmap mgr advanced mgr/balancer/upmap_max_deviation 1 mgr advanced mgr/progress/enabled false * osd dev bluestore_fsck_quick_fix_on_mount true # cat /etc/ceph/ceph.conf [global] # The following parameters are defined in the service.properties like below # ceph.conf.globa.osd_max_backfills: 1 bluefs bufferd io = true bluestore fsck quick fix on mount = false cluster network = 10.88.26.0/24 fsid = 72ccd9c4-5697-478c-99f6-b5966af278c6 max open files = 131072 mon host = 10.88.7.41 10.88.7.42 10.88.7.43 mon max pg per osd = 600 mon osd down out interval = 1800 mon osd down out subtree limit = host mon osd initial require min compat client = luminous mon osd min down reporters = 2 mon osd reporter subtree level = host mon pg warn max object skew = 100 osd backfill scan max = 16 osd backfill scan min = 8 osd deep scrub stride = 1048576 osd disk threads = 1 osd heartbeat min size = 0 osd max backfills = 1 osd max scrubs = 1 osd op complaint time = 5 osd pool default flag hashpspool = true osd pool default min size = 1 osd pool default size = 3 osd recovery max active = 1 osd recovery max single start = 1 osd recovery op priority = 3 osd recovery sleep hdd = 0.0 osd scrub auto repair = true osd scrub begin hour = 5 osd scrub chunk max = 1 osd scrub chunk min = 1 osd scrub during recovery = true osd scrub end hour = 23 osd scrub load threshold = 1 osd scrub priority = 1 osd scrub thread suicide timeout = 0 osd snap trim priority = 1 osd snap trim sleep = 1.0 public network = 10.88.7.0/24 [mon] mon allow pool delete = false mon health preluminous compat warning = false osd pool default flag hashpspool = true On Thu, 11 Nov 2021 09:16:20 -0600 Sage Weil <sage@xxxxxxxxxxxx> wrote: > Hi Manuel, > > Before giving up and putting in an off switch, I'd like to understand > why it is taking as long as it is for the PGs to go active. > > Would you consider enabling debug_osd=10 and debug_ms=1 on your OSDs, > and debug_mon=10 + debug_ms=1 on the mons, and reproducing this > (without the patch applied this time of course!)? The logging will > slow things down a bit but hopefully the behavior will be close > enough to what you see normally that we can tell what is going on > (and presumably picking out the pg that was most laggy will highlight > the source(s) of the delay). > > sage > > On Wed, Nov 10, 2021 at 4:41 AM Manuel Lausch <manuel.lausch@xxxxxxxx> > wrote: > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx