Re: [ceph-users] Re: RFC: (deep-)scrub manager module

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



inside
Regards,

Josh


On Tue, Jun 21, 2022 at 7:25 PM Ronen Friedman <rfriedma@xxxxxxxxxx> wrote:
Hi All,

I'd like to add a few points to the discussion, regarding the current implementation of scrub scheduling, what I started working on,
and what I just learned (from this thread and from a short brain storm with Josh) I should add to my plans:
Sorry if it is hand-wavy a bit... I will try to follow with a more detailed design doc later on:

current implementation:
* scheduling is indeed at the OSD level (and - if possible - I'd like to keep it that way, to reduce system complexity);
* deep/shallow scrubs share the same scheduling queue (*** I plan to rethink that);
* a PG is scheduled based on the following information:
  - the last d/s scrub stamp (which for 'must' scrubs is set to the beginning of time);
  - a (dynamically calculated) target scrub scheduled time; basically - "last + min-conf from pool";
  - a deadline (last + max from pool)
* PGs that are 'ripe' for scrubbing are sorted by their target times;
* the selected PG is given a chance to acquire ("reserve" in scrub code parlance) its replicas' "scrub resources" (that 'max concurrent scrubs' counter
  managed by the OSD); failing that (i.e. one of the replicas is too busy to handle the additional scrub request), the
  PG is penalized, and will only be considered for scrubbing after all ripe jobs are handled (or if "pardoned" after
  a few minutes).

The actual implementation has some additional details, mainly regarding rescheduling. All in all, it
is a bit confusing and less than optimal.

implementation in progress:
* same basic design (OSD-level, penalizing PGs with busy replicas, keeping the user-facing interface
  of 'scrub resources');
* each PG has the following associated scheduling data:
  - target time: last-scrub+min, as before. Note that here this is calculated once, and only changes for:
   - scrub termination;
   - operator commands;
   - configuration changes
* a "not before" time point, which determines the "ripeness" of the PG to be considered for scrubbing; it is
  not the primary sort key for the "ready to scrub" queue;
* the "not before" is set once the 'target time' changes, and can only be pushed forward afterwards:
  - if a PG failed to start scrubbing;
  - ...
 each modification path of the "not before" takes the deadline into account;
* ripe PGs are sorted based on "target time", then "not before";

Some ideas I was considering, and new ones following this thread / Josh:
- separating shallow/deep scrubs. Maybe through a revised resource counting system, that treats deep-scrubs
  as using more "resource units" than shallow scrubs; not sure what the correct objectives are (appreciate comments).
- marking "urgent" scrubs in the "replica - I need your resources" reservation requests. The idea is that if a replica
  is forced to refuse a reservation request for an urgent scrub, that replica would not (for some predefined period) grant any non-urgent
  reservation requests, allowing the urgent scrub to have priority when requested again (supposedly after running scrub
  sessions have completed).
Very good point - I actually thought about it after our talk today 😀. Just one comment: I am not sure you need a predefined period - the event that releases this mode should be related to executing urgent scrubs (possibly have a list of all refused urgent scrubs and wait until the list is empty) - since urgent scrubs will not disappear when time passes.  Obviously there is a need to handle changes in PGs - but this should be manageable as part of the data movement process. 
- allowing some dynamic changes to the resource counters, e.g. to compensate for blocked scrubs.

Please CC me on ideas/comments re scrub. I don't usually read every post in the group.

Ronen


  



On Tue, Jun 21, 2022 at 11:00 AM Frank Schilder <frans@xxxxxx> wrote:
Hi all,

sounds like something like the following algorithm could be a long-term goal:

- separate (deep-)scrub queues per pool (see below for possible scheduling conflicts and alternatives)
- config settings for global/osd and mask class:dev_class (if present) apply; note that this should be sufficient, because no pool can contain OSDs from multiple device classes and performance across a single device class can be expected to be close to identical; if absolutely necessary, one could always create a device class per pool

- all PGs of a pool get scheduled for (deep-)scrub every (DEEP_)SCRUB_INTERVAL (property of a pool) seconds (what would be reasonable defaults comparable to current behaviour? 1 month for scrub, 3 months for deep scrub?)
- the event of all PGs of a pool being scheduled is logged as "pool xyz (deep-)scrub start"
- the event of a (deep-)scrub of the last PG of a pool being completed is logged as "pool xyz (deep-)scrub end", if (deep-)scrubbing does not end before the next schedule event, no such message will be present (for diagnosis), note that with "oldest scrub stamp first" we will still get everything scrubbed, just not in the expected time interval

- the algorithm per queue is as close as possible to "oldest (deep-)scrub stamp first"; here I see a potential for scheduling conflicts if several pools with different (DEEP_)SCRUB_INTERVAL live on the same device class - which PG from different pools gets priority if they live on the same OSD; this problem could be so severe that (DEEP_)SCRUB_INTERVAL might need to be a property of a device class with (deep-)scrub queues per device-class instead of per pool
- with such an algorithm one could probably implement many if not all config options that Anthony suggested

Now, something like this might be a (too?) big project and possibly too detailed. A few first simple steps, as in

> Good point. The pain point of not having the oldest scrubbed PG be top priority next scrub cycle
> can be fixed at OSD level as well. For us it was easiest to fix it with a daemon. Not sure how easy
> it is to fix this in the OSD scrub code. It might not hurt to have it fixed in two places though.

might already be sufficient and more detailed control becomes less interesting. With high priority I would ask for using the oldest (deep-)scrub date stamp in the OSD code for (deep-)scrub-priority to increase predictability (remove random PG selection, if PGs have the same date stamp, use lowest ID as tie breaker). Then, the existing config options together with device-class masks allow already good control over how aggressive scrubbing is.

By the way, Stefan, when you write "daemon", is this a daemon you implemented yourself on your own installation, or is it a daemon provided together with ceph?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Josh Salomon <jsalomon@xxxxxxxxxx>
Sent: 20 June 2022 12:17:05
To: Stefan Kooman
Cc: Frank Schilder; Anthony D'Atri; Ceph Developers; ceph-users
Subject: Re: [ceph-users] Re: RFC: (deep-)scrub manager module

Replying on an older email in this thread.

Yes, for sure this is  an interesting scrub strategy I hadn't thought
about. With the reasoning behind it: make sure you scrub as many PGs as
you can when allowed to, not wasting time (as long as its evenly
distributed among OSDs). Do I get that right?

Yes - this is what I meant

As you mention priorities: should it have some sort of fairness
algorithm that avoids situations where a pool might not be scrubbed at
all because of the constraints imposed? I can imagine a heavily loaded
cluster, with spinning disks, where a high priority pool might get
scrubbed, but a lower priority pool might not. In that case you might
want to scrub PGs from the lower priority pool every x period to avoid
the pool never being scrubbed at all. This might make things (overly)
complicated though.

I think this was answered, but I used priority which may be misleading - I meant different frequencies, so one pool is scrubbed every week and another every month. Then when you select the PGs for scrubbing you use the date of the latest scrub, so this will increase the priority of the pools with a lower frequency and I believe will prevent starving in all cases. Still need to understand what happens when the scrub frequency is, for example, a week and the process (with all the constraints such as running only on limited times) takes more than a week. I still believe that if at any time we look just at the PGs which were not scrubbed for the longest time we will prevent starvation.
If we allow only one scrub per OSD, the way I see the scheduling algorithm is something like:
initial state, no scrub executing:
OSD_group = all OSDS.
selected = true
While (selected):
  selected = false
  try to select the PG which is marked for scrubbing and was not scrubbed for the longest period and all PG_OSDS in OSD_group
  if PG was selected
    mark it for scrubbing
    remove PG_OSDs from OSD_gropu
    selected = true
  end if
end while
wait until a scrub_completes event and restart the while loop.

// scrub_complete event handling should add the PG_OSDS to OSD_group

Regards,

Josh


On Mon, Jun 20, 2022 at 12:13 PM Stefan Kooman <stefan@xxxxxx<mailto:stefan@xxxxxx>> wrote:
On 6/20/22 10:34, Frank Schilder wrote:
> Hi all,
>
> I guess moving scrubbing to the MGRs is quite an exercise if its currently scheduled by OSDs. With the current functionality, simply having a manager module issue "ceph osd deep-scrub" commands will only ask the OSD to schedule a scrub, not execute it in a specific order at specific times.

We currently have a daemon that does just that: ask the OSD for a
(deep-)scrub. It has config options to tune the amount of concurrent
(deep-)scrubs. It's not a hard limit, as it might take some time before
a scrub is actually performed. So for short periods of time it might
overshoot this target. But it issues (deep-)scrubs based on oldest
timestamp, which is already a big win.

>
>  From my personal experience, this effort might be a bit too much and a few simple changes might be enough. I don't see any performance issues with executing scrub on our cluster and I wonder under what circumstances such occur.

For one: When RocksDB performance is degraded (particularly PGs that
hold a lot of OMAP). Then *scrubs*, can be even more of a problem than
deep-scrubs. Something we learned about very recently.

I would assume that scrub IO is part of the performance planning. My
guess is, that pools where deep scrub impacts performance tend to have
too few PGs (too many objects per PG). On small PGs, a deep scrub
usually finishes quickly, even if it needs to be repeated due to client
IO causing a pre-emption (osd_scrub_max_preemptions).

> What I'm missing though is a bit more systematic approach to scrubbing in the same way that Anthony is suggesting. Right now it looks like that PGs are chosen randomly instead of "least recently scrubbed first". This is important, because it makes scrub completion unpredictable. Random number generators do not produce a perfect equi-distribution with the effect that some PGs will have to wait several almost complete cluster-scrub cycles before being deep-scrubbed eventually. This is probably a main reason why "not deep scrubbed in time" messages pop up. Its a few PGs that have to wait forever.

That confirms our observation. This one is the easiest to solve, and
highest on the list.

> This unpredictability is particularly annoying if you need to wait for all PGs being deep-scrubbed when upgrading over several versions of ceph. It is basically not possible to say a cluster with N PGs and certain scrub settings will complete a full deep-scrub of all PGs after so many days.

For such scenarios it would help a lot to have as many parallel scrubs,
evenly distributed, as possible to speed up this process (like Josh
suggested).

>
>  From my perspective, adding this level of predictability would be an interesting improvement and possibly simple to implement. Instead of doing (deep-)scrubs randomly, one could consider a more systematic strategy of always queueing the "least-recently deep-scrubbed (active)" PGs first. This should already allow to reduce scrub frequency, because a complete scrub does not imply that a majority of PGs is scrubbed several times before the last one is picked. Plus, a completion of a complete scrub of a healthy cluster is now something one can more or less compute as a number in days. One could just look at the top of the queue to query the oldest scrub date stamp to have a reliable estimate. Per-device type scrub frequency should then be sufficient to adapt to device performance or other parameters, like failure probability.

Scrub frequency with respect to how likely a device might fail is
interesting. Besides having the option to configure this per pool type
(importance of data).

>
> Its basically the first two single-starred points of Anthony, where I would be OK if its still the OSDs doing this scheduling as long as the scrub time stamp is treated as a global priority, that is, every primary OSD holds a few of its least-recently scrubbed PGs in a/the global queue ordered by scrub time stamp and the top PGs are picked and only kept at the top of the queue if it violates the osd_max_scrubs setting. As an extension, one could introduce scrub queues per device class. Using config option masks (https://docs.ceph.com/en/quincy/rados/configuration/ceph-conf/#configuration-sections) it is already possible to configure scrub options by device class, which I personally consider sufficient.

Good point. The pain point of not having the oldest scrubbed PG be top
priority next scrub cycle can be fixed at OSD level as well. For us it
was easiest to fix it with a daemon. Not sure how easy it is to fix this
in the OSD scrub code. It might not hurt to have it fixed in two places
though.

Thanks for your input!

Gr. Stefan
_______________________________________________
Dev mailing list -- dev@xxxxxxx<mailto:dev@xxxxxxx>
To unsubscribe send an email to dev-leave@xxxxxxx<mailto:dev-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux