Re: RFC: (deep-)scrub manager module

Ronen Friedman <rfriedma@xxxxxxxxxx> · Tue, 21 Jun 2022 19:24:41 +0300

Hi All,

I'd like to add a few points to the discussion, regarding the current
implementation of scrub scheduling, what I started working on,
and what I just learned (from this thread and from a short brain storm with
Josh) I should add to my plans:
Sorry if it is hand-wavy a bit... I will try to follow with a more detailed
design doc later on:

*current implementation:*
* scheduling is indeed at the OSD level (and - if possible - I'd like to
keep it that way, to reduce system complexity);
* deep/shallow scrubs share the same scheduling queue (*** I plan to
rethink that);
* a PG is scheduled based on the following information:
  - the last d/s scrub stamp (which for 'must' scrubs is set to the
beginning of time);
  - a (dynamically calculated) target scrub scheduled time; basically -
"last + min-conf from pool";
  - a deadline (last + max from pool)
* PGs that are 'ripe' for scrubbing are sorted by their target times;
* the selected PG is given a chance to acquire ("reserve" in scrub code
parlance) its replicas' "scrub resources" (that 'max concurrent scrubs'
counter
  managed by the OSD); *failing* that (i.e. one of the replicas is too busy
to handle the additional scrub request), the
  PG is penalized, and will only be considered for scrubbing after all ripe
jobs are handled (or if "pardoned" after
  a few minutes).

The actual implementation has some additional details, mainly regarding
rescheduling. All in all, it
is a bit confusing and less than optimal.

*implementation in progress:*
* same basic design (OSD-level, penalizing PGs with busy replicas, keeping
the user-facing interface
  of 'scrub resources');
* each PG has the following associated scheduling data:
  - target time: last-scrub+min, as before. Note that here this is
calculated once, and only changes for:
   - scrub termination;
   - operator commands;
   - configuration changes
* a "not before" time point, which determines the "ripeness" of the PG to
be considered for scrubbing; it is
  not the primary sort key for the "ready to scrub" queue;
* the "not before" is set once the 'target time' changes, and can only be
pushed forward afterwards:
  - if a PG failed to start scrubbing;
  - ...
 each modification path of the "not before" takes the deadline into account;
* ripe PGs are sorted based on "target time", then "not before";

*Some ideas I was considering, and new ones following this thread / Josh:*
- separating shallow/deep scrubs. Maybe through a revised resource counting
system, that treats deep-scrubs
  as using more "resource units" than shallow scrubs; not sure what the
correct objectives are (appreciate comments).
- marking "urgent" scrubs in the "replica - I need your resources"
reservation requests. The idea is that if a replica
  is forced to refuse a reservation request for an urgent scrub, that
replica would not (for some predefined period) grant any non-urgent
  reservation requests, allowing the urgent scrub to have priority when
requested again (supposedly after running scrub
  sessions have completed).
- allowing some dynamic changes to the resource counters, e.g. to
compensate for blocked scrubs.

Please CC me on ideas/comments re scrub. I don't usually read every post in
the group.

Ronen

On Tue, Jun 21, 2022 at 11:00 AM Frank Schilder <frans@xxxxxx> wrote:

> Hi all,
>
> sounds like something like the following algorithm could be a long-term
> goal:
>
> - separate (deep-)scrub queues per pool (see below for possible scheduling
> conflicts and alternatives)
> - config settings for global/osd and mask class:dev_class (if present)
> apply; note that this should be sufficient, because no pool can contain
> OSDs from multiple device classes and performance across a single device
> class can be expected to be close to identical; if absolutely necessary,
> one could always create a device class per pool
>
> - all PGs of a pool get scheduled for (deep-)scrub every
> (DEEP_)SCRUB_INTERVAL (property of a pool) seconds (what would be
> reasonable defaults comparable to current behaviour? 1 month for scrub, 3
> months for deep scrub?)
> - the event of all PGs of a pool being scheduled is logged as "pool xyz
> (deep-)scrub start"
> - the event of a (deep-)scrub of the last PG of a pool being completed is
> logged as "pool xyz (deep-)scrub end", if (deep-)scrubbing does not end
> before the next schedule event, no such message will be present (for
> diagnosis), note that with "oldest scrub stamp first" we will still get
> everything scrubbed, just not in the expected time interval
>
> - the algorithm per queue is as close as possible to "oldest (deep-)scrub
> stamp first"; here I see a potential for scheduling conflicts if several
> pools with different (DEEP_)SCRUB_INTERVAL live on the same device class -
> which PG from different pools gets priority if they live on the same OSD;
> this problem could be so severe that (DEEP_)SCRUB_INTERVAL might need to be
> a property of a device class with (deep-)scrub queues per device-class
> instead of per pool
> - with such an algorithm one could probably implement many if not all
> config options that Anthony suggested
>
> Now, something like this might be a (too?) big project and possibly too
> detailed. A few first simple steps, as in
>
> > Good point. The pain point of not having the oldest scrubbed PG be top
> priority next scrub cycle
> > can be fixed at OSD level as well. For us it was easiest to fix it with
> a daemon. Not sure how easy
> > it is to fix this in the OSD scrub code. It might not hurt to have it
> fixed in two places though.
>
> might already be sufficient and more detailed control becomes less
> interesting. With high priority I would ask for using the oldest
> (deep-)scrub date stamp in the OSD code for (deep-)scrub-priority to
> increase predictability (remove random PG selection, if PGs have the same
> date stamp, use lowest ID as tie breaker). Then, the existing config
> options together with device-class masks allow already good control over
> how aggressive scrubbing is.
>
> By the way, Stefan, when you write "daemon", is this a daemon you
> implemented yourself on your own installation, or is it a daemon provided
> together with ceph?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Josh Salomon <jsalomon@xxxxxxxxxx>
> Sent: 20 June 2022 12:17:05
> To: Stefan Kooman
> Cc: Frank Schilder; Anthony D'Atri; Ceph Developers; ceph-users
> Subject: Re:  Re: RFC: (deep-)scrub manager module
>
> Replying on an older email in this thread.
>
> Yes, for sure this is  an interesting scrub strategy I hadn't thought
> about. With the reasoning behind it: make sure you scrub as many PGs as
> you can when allowed to, not wasting time (as long as its evenly
> distributed among OSDs). Do I get that right?
>
> Yes - this is what I meant
>
> As you mention priorities: should it have some sort of fairness
> algorithm that avoids situations where a pool might not be scrubbed at
> all because of the constraints imposed? I can imagine a heavily loaded
> cluster, with spinning disks, where a high priority pool might get
> scrubbed, but a lower priority pool might not. In that case you might
> want to scrub PGs from the lower priority pool every x period to avoid
> the pool never being scrubbed at all. This might make things (overly)
> complicated though.
>
> I think this was answered, but I used priority which may be misleading - I
> meant different frequencies, so one pool is scrubbed every week and another
> every month. Then when you select the PGs for scrubbing you use the date of
> the latest scrub, so this will increase the priority of the pools with a
> lower frequency and I believe will prevent starving in all cases. Still
> need to understand what happens when the scrub frequency is, for example, a
> week and the process (with all the constraints such as running only on
> limited times) takes more than a week. I still believe that if at any time
> we look just at the PGs which were not scrubbed for the longest time we
> will prevent starvation.
> If we allow only one scrub per OSD, the way I see the scheduling algorithm
> is something like:
> initial state, no scrub executing:
> OSD_group = all OSDS.
> selected = true
> While (selected):
>   selected = false
>   try to select the PG which is marked for scrubbing and was not scrubbed
> for the longest period and all PG_OSDS in OSD_group
>   if PG was selected
>     mark it for scrubbing
>     remove PG_OSDs from OSD_gropu
>     selected = true
>   end if
> end while
> wait until a scrub_completes event and restart the while loop.
>
> // scrub_complete event handling should add the PG_OSDS to OSD_group
>
> Regards,
>
> Josh
>
>
> On Mon, Jun 20, 2022 at 12:13 PM Stefan Kooman <stefan@xxxxxx<mailto:
> stefan@xxxxxx>> wrote:
> On 6/20/22 10:34, Frank Schilder wrote:
> > Hi all,
> >
> > I guess moving scrubbing to the MGRs is quite an exercise if its
> currently scheduled by OSDs. With the current functionality, simply having
> a manager module issue "ceph osd deep-scrub" commands will only ask the OSD
> to schedule a scrub, not execute it in a specific order at specific times.
>
> We currently have a daemon that does just that: ask the OSD for a
> (deep-)scrub. It has config options to tune the amount of concurrent
> (deep-)scrubs. It's not a hard limit, as it might take some time before
> a scrub is actually performed. So for short periods of time it might
> overshoot this target. But it issues (deep-)scrubs based on oldest
> timestamp, which is already a big win.
>
> >
> >  From my personal experience, this effort might be a bit too much and a
> few simple changes might be enough. I don't see any performance issues with
> executing scrub on our cluster and I wonder under what circumstances such
> occur.
>
> For one: When RocksDB performance is degraded (particularly PGs that
> hold a lot of OMAP). Then *scrubs*, can be even more of a problem than
> deep-scrubs. Something we learned about very recently.
>
> I would assume that scrub IO is part of the performance planning. My
> guess is, that pools where deep scrub impacts performance tend to have
> too few PGs (too many objects per PG). On small PGs, a deep scrub
> usually finishes quickly, even if it needs to be repeated due to client
> IO causing a pre-emption (osd_scrub_max_preemptions).
>
> > What I'm missing though is a bit more systematic approach to scrubbing
> in the same way that Anthony is suggesting. Right now it looks like that
> PGs are chosen randomly instead of "least recently scrubbed first". This is
> important, because it makes scrub completion unpredictable. Random number
> generators do not produce a perfect equi-distribution with the effect that
> some PGs will have to wait several almost complete cluster-scrub cycles
> before being deep-scrubbed eventually. This is probably a main reason why
> "not deep scrubbed in time" messages pop up. Its a few PGs that have to
> wait forever.
>
> That confirms our observation. This one is the easiest to solve, and
> highest on the list.
>
> > This unpredictability is particularly annoying if you need to wait for
> all PGs being deep-scrubbed when upgrading over several versions of ceph.
> It is basically not possible to say a cluster with N PGs and certain scrub
> settings will complete a full deep-scrub of all PGs after so many days.
>
> For such scenarios it would help a lot to have as many parallel scrubs,
> evenly distributed, as possible to speed up this process (like Josh
> suggested).
>
> >
> >  From my perspective, adding this level of predictability would be an
> interesting improvement and possibly simple to implement. Instead of doing
> (deep-)scrubs randomly, one could consider a more systematic strategy of
> always queueing the "least-recently deep-scrubbed (active)" PGs first. This
> should already allow to reduce scrub frequency, because a complete scrub
> does not imply that a majority of PGs is scrubbed several times before the
> last one is picked. Plus, a completion of a complete scrub of a healthy
> cluster is now something one can more or less compute as a number in days.
> One could just look at the top of the queue to query the oldest scrub date
> stamp to have a reliable estimate. Per-device type scrub frequency should
> then be sufficient to adapt to device performance or other parameters, like
> failure probability.
>
> Scrub frequency with respect to how likely a device might fail is
> interesting. Besides having the option to configure this per pool type
> (importance of data).
>
> >
> > Its basically the first two single-starred points of Anthony, where I
> would be OK if its still the OSDs doing this scheduling as long as the
> scrub time stamp is treated as a global priority, that is, every primary
> OSD holds a few of its least-recently scrubbed PGs in a/the global queue
> ordered by scrub time stamp and the top PGs are picked and only kept at the
> top of the queue if it violates the osd_max_scrubs setting. As an
> extension, one could introduce scrub queues per device class. Using config
> option masks (
> https://docs.ceph.com/en/quincy/rados/configuration/ceph-conf/#configuration-sections)
> it is already possible to configure scrub options by device class, which I
> personally consider sufficient.
>
> Good point. The pain point of not having the oldest scrubbed PG be top
> priority next scrub cycle can be fixed at OSD level as well. For us it
> was easiest to fix it with a daemon. Not sure how easy it is to fix this
> in the OSD scrub code. It might not hurt to have it fixed in two places
> though.
>
> Thanks for your input!
>
> Gr. Stefan
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx<mailto:dev@xxxxxxx>
> To unsubscribe send an email to dev-leave@xxxxxxx<mailto:dev-leave@xxxxxxx
> >
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx