Big OSD add, long backfill, degraded PGs, deep-scrub backlog, OSD restarts

Dave Hall <kdhall@xxxxxxxxxxxxxx> · Sat, 1 May 2021 16:51:11 -0400

Hello,

I recently added 2 OSD nodes to my Nautilus cluster, increasing the OSD
count from 32 to 48 - all 12TB HDDs with NVMe for db.

I generally keep an ssh session open where I can run 'watch cepf -s'.  My
observations are mostly based on what I saw from watching this.

Even with 10GB networking, rebalancing 529 pgs took 10 days, during which
there were always a few PGs undersized+degraded, frequent flashes of slow
ops, occasional OSD restarts, and the scrub and deep-scrub backlog steadily
increased.  When the backfills completed I had 24 missed deep-scrubs and 10
missed scrubs.

I suspect that this is because of some settings that I had fiddled with, so
this post may be an advertisement for what not to do to your cluster.
However, I'd like to know if my understanding is accurate.  I believe that
my settings resulted

In short, I think I had my config set up so there was contention due to too
many processes trying to do things to some OSDs all at once:

   - osd_scrub_during_recovery:  I think I had this set to true for the
   first 9 days, but set it to false when I started to realize that it might
   be causing contention
   - osd_max_scrubs:  I had this set high - global:30 osd:10.  At some
   earlier time when I had a scrub backlog I thought that these were counts
   for simultaneous scrubs across all OSDs rather than 'per OSD'
      - Now I see why the default is 1.
      - Assumption:  on an HDD multiple competing scrubs cause excessive
      seeking and thus compound impacts to scrub progress
   - osd_max_backfills:  I had bumped this up as well - global:30 osd:10,
   thinking it would speed up the rebalancing of my PGs onto my new OSDs.
      - Now, the same thinking as for osd_max_scrubs:  compounding
      contention, further compounded by the scrub acivity that should have been
      inhibited by osd_scrub_during_recovery:false.

I believe that all of this also resulted in my EC pgs (8 + 2) becoming
degraded.  My assumption here is that collisions between deep-scrubs and
backfills sometimes locked the backfill process out of a piece of an EC PG,
causing backfil to rebuild instead of copy.

The good news is that I haven't lost and data and, other than the scrub
backlog things seem to be working smoothly.  It seems like with 1 or 2
scrubs (deep or regular) running they are taking about 2 hours per scrub.
As the scrubs progress, more scrub deadlines are missed, so it's not a
steady march to zero.

Please feel free to comment.  I'd be glad to know if I'm on the right track
as we expect the cluster to double in size over the next 12 to 18 months.

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx