Re: Question about recovery priority

Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> · Fri, 23 Sep 2022 09:31:10 +0200

Hallo Josh thanks for your feedback!

On 9/22/22 14:44, Josh Baergen wrote:
Hi Fulvio,

https://docs.ceph.com/en/quincy/dev/osd_internals/backfill_reservation/
describes the prioritization and reservation mechanism used for
recovery and backfill. AIUI, unless a PG is below min_size, all
backfills for a given pool will be at the same priority.
force-recovery will modify the PG priority but doing so can have a
very delayed effect because a given backfill can be waiting behind a
bunch of other backfills that have acquired partial reservations,
which in turn are waiting behind other backfills that have partial
reservations, etc. etc. Once one is doing degraded backfill, they've
lost a lot of control over their system.

Yes I had found that page, which together with
https://docs.ceph.com/en/quincy/dev/osd_internals/recovery_reservation/
 explains the mechanism causing reservations to be waiting behind others...
However I am still on Nautilus and "sed -e 's/quincy/nautilus/' <URL>" 
leads to a much shorter and less detailed page, and I assumed Nautilus 
was far behind Quincy in managing this... in any case, I guess it's good 
to upgrade, and take advantage of software developments.

Rather than ripping out hosts like you did here, operators that want
to retain control will drain hosts without degradation.
https://github.com/digitalocean/pgremapper is one tool that can help
with this, though depending on the size of the system one can
sometimes simply downweight the host and then wait.

Thanks for "pgremapper", will give it a try once I have finished current 
data movement: will it still work after I upgrade to Pacific?

You are correct, it would be best to drain OSDs cleanly, and I see 
pgremapper has an option for this, great!
However, in my cluster (14 servers with ~20 disks each, ~3 PB raw space: 
cinder ~1PB, rgw~0.9PB) I see that draining (by reweighting to 0.) works 
nicely and predictably for replicated pools (1-2 days) but is terribly 
slow for my rgw 6+4 EC pool (>week): that's why I normally reweight up 
to some point and then rip 1 or 2 OSDs when I am fed up.
(By the way, the choice of 6+4 goes back to a few years, and was picked 
primarily as a compromise between space lost for redundancy and 
resilience to failures, when the cluster was much smaller: should make a 
few extensive tests and see whether it's worth to try different m+n.)

  Thanks again!

			Fulvio

Josh

On Thu, Sep 22, 2022 at 6:35 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote:

Hallo all,
       taking advantage of the redundancy of my EC pool, I destroyed a
couple of servers in order to reinstall them with a new operating system.
    I am on Nautilus (but will evolve soon to Pacific), and today I am
not in "emergency mode": this is just for my education.  :-)

"ceph pg dump" shows a couple pg's with 3 missing chunks, some other
with 2, several with 1 missing chunk: that's fine and expected.
Having looked at it for a while, I think I understand the recovery queue
is unique: there is no internal higher priority for 3-missing-chunks PGs
wrt 1-missing-chunk PGs, right?
I tried to issue "ceph pg force-recovery" on the few worst-degraded PGs
but, apparently, numbers of 3-missing 2-missing and 1-missing are going
down at the same relative speed.
     Is this expected? Can I do something to "guide" the process?

Thanks for your hints

                         Fulvio

--
Fulvio Galeazzi
GARR-CSD Department
skype: fgaleazzi70
tel.: +39-334-6533-250
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Fulvio Galeazzi
GARR-CSD Department
tel.: +39-334-6533-250
skype: fgaleazzi70
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx