Re: Prioritize recovery over backfilling

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Thu, 21 Feb 2019 08:26:50 +0100

Hi Sage,
Would be nice to have this one backported to Luminous if easy. 

Cheers,
Frédéric.

Le 7 juin 2018 à 13:33, Sage Weil <sage@xxxxxxxxxxxx> a écrit :

On Wed, 6 Jun 2018, Caspar Smit wrote:
Hi all,

We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node
to it.

osd-max-backfills is at the default 1 so backfilling didn't go very fast
but that doesn't matter.

Once it started backfilling everything looked ok:

~300 pgs in backfill_wait
~10 pgs backfilling (~number of new osd's)

But i noticed the degraded objects increasing a lot. I presume a pg that is
in backfill_wait state doesn't accept any new writes anymore? Hence
increasing the degraded objects?

So far so good, but once a while i noticed a random OSD flapping (they come
back up automatically). This isn't because the disk is saturated but a
driver/controller/kernel incompatibility which 'hangs' the disk for a short
time (scsi abort_task error in syslog). Investigating further i noticed
this was already the case before the node expansion.

These OSD's flapping results in lots of pg states which are a bit worrying:

            109 active+remapped+backfill_wait
            80  active+undersized+degraded+remapped+backfill_wait
            51  active+recovery_wait+degraded+remapped
            41  active+recovery_wait+degraded
            27  active+recovery_wait+undersized+degraded+remapped
            14  active+undersized+remapped+backfill_wait
            4   active+undersized+degraded+remapped+backfilling

I think the recovery_wait is more important then the backfill_wait, so i
like to prioritize these because the recovery_wait was triggered by the
flapping OSD's

Just a note: this is fixed in mimic.  Previously, we would choose the 
highest-priority PG to start recovery on at the time, but once recovery 
had started, the appearance of a new PG with a higher priority (e.g., 
because it finished peering after the others) wouldn't preempt/cancel the 
other PG's recovery, so you would get behavior like the above.

Mimic implements that preemption, so you should not see behavior like 
this.  (If you do, then the function that assigns a priority score to a 
PG needs to be tweaked.)

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com