Re: Prioritized pool recovery

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 6 May 2019 17:37:35 -0700

Hmm, I didn't know we had this functionality before. It looks to be
changing quite a lot at the moment, so be aware this will likely
require reconfiguring later.

On Sun, May 5, 2019 at 10:40 AM Kyle Brantley <kyle@xxxxxxxxxxxxxx> wrote:
>
> I've been running luminous / ceph-12.2.11-0.el7.x86_64 on CentOS 7 for about a month now, and had a few times when I've needed to recreate the OSDs on a server. (no I'm not planning on routinely doing this...)
>
> What I've noticed is that the recovery will generally stagger the recovery so that the pools on the cluster will finish around the same time (+/- a few hours). What I'm hoping to do is prioritize specific pools over others, so that ceph will recover all of pool 1 before it moves on to pool 2, for example.
>
> In the docs, recovery_{,op}_priority both have roughly the same description, which is "the priority set for recovery operations" as well as a valid range of 1-63, default 5. This doesn't tell me if a value of 1 is considered a higher priority than 63, and it doesn't tell me how it fits in line with other ceph operations.

I'm not seeing this in the luminous docs, are you sure? The source
code indicates in Luminous it's 0-254. (As I said, things have
changed, so in the current master build it seems to be -10 to 10 and
configured a bit differently.)

The 1-63 values generally apply to op priorities within the OSD, and
are used as part of a weighted priority queue when selecting the next
op to work on out of those available; you may have been looking at
osd_recovery_op_priority which is on that scale and should apply to
individual recovery messages/ops but will not work to schedule PGs
differently.

> Questions:
> 1) If I have pools 1-4, what would I set these values to in order to backfill pools 1, 2, 3, and then 4 in order?

So if I'm reading the code right, they just need to be different
weights, and the higher value will win when trying to get a
reservation if there's a queue of them. (However, it's possible that
lower-priority pools will send off requests first and get to do one or
two PGs first, then the higher-priority pool will get to do all its
work before that pool continues.)

> 2) Assuming this is possible, how do I ensure that backfill isn't prioritized over client I/O?

This is an ongoing issue but I don't think the pool prioritization
will change the existing mechanisms.

> 3) Is there a command that enumerates the weights of the current operations (so that I can observe what's going on)?

"ceph osd pool ls detail" will include them.

>
> For context, my pools are:
> 1) cephfs_metadata
> 2) vm (RBD pool, VM OS drives)
> 3) storage (RBD pool, VM data drives)
> 4) cephfs_data
>
> These are sorted by both size (smallest to largest) and criticality of recovery (most to least). If there's a critique of this setup / a better way of organizing this, suggestions are welcome.
>
> Thanks,
> --Kyle
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com