Re: Backfill Performance for

Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Tue, 8 Aug 2023 11:39:00 -0600

Hi Jonathan,

> - All PGs seem to be backfilling at the same time which seems to be in
> violation of osd_max_backfills. I understand that there should be 6 readers
> and 6 writers at a time, but I'm seeing a given OSD participate in more
> than 6 PG backfills. Is an OSD only considered as backfilling if it is not
> present in both the UP and ACTING groups (e.g. it will have it's data
> altered)?

Say you have a PG that looks like this:
1.7ffe   active+remapped+backfill_wait    [983,1112,486]         983
[983,1423,1329]             983

If this is a replicated cluster, 983 (the primary OSD) will be the
data read source, and 1423/1329 will of course be targets. If this is
EC, then 1112 will be the read source for the 1423 backfill, and 486
will be the read source for the 1329 backfill. (Unless the PG is
degraded, in which case backfill reads may become normal PG reads.)

Backfill locks are taken on the primary OSD (983 in the example above)
and then all the backfill targets (1329, 1423). Locks are _not_ taken
on read sources for EC backfills, so it's possible to have any number
of backfills reading from a single OSD during EC backfill with no
direct control over this.

> - Some PGs are recovering at a much slower rate than others (some as little
> as kilobytes per second) despite the disks being all of a similar speed. Is
> there some way to dig into why that may be?

Where I would start with this is looking at whether the read sources
or write targets are overloaded at the disk level.

> - In general, the recovery is happening very slowly (between 1 and 5
> objects per second per PG). Is it possible the settings above are too
> aggressive and causing performance degradation due to disk thrashing?

Maybe - which settings are appropriate depend on your configuration
(replicated vs. EC); if you have a replicated pool, then those
settings are probably way too aggressive, and max backfills should be
reduced. If it's EC, the max backfills might be OK. In either case,
the sleep should be increased, but it's unlikely that the sleep
setting is affecting per-PG backfill speed that much (though it could
make it uneven).

> - Currently, all misplaced PGs are backfilling, if I were to change some of
> the settings above (specifically `osd_max_backfills`) would that
> essentially pause backfilling PGs or will those backfills have to end and
> then start over when it is done waiting?

It effectively pauses backfill.

> - Given that all PGs are backfilling simultaneously there is no way to
> prioritize one PG over another (we have some disks with very high usage
> that we're trying to reduce). Would reducing those max backfills allow for
> proper prioritization of PGs with force-backfill?

There's no great way to affect backfill prioritization. The backfill
lock acquisition I noted above is blocking without backoff, so
high-priority backfills could be waiting in line for a while until
they get a chance to run.

> - We have had some OSDs restart during the process and their misplaced
> object count is now zero but they are incrementing their recovering objects
> bytes. Is that expected and is there a way to estimate when that will
> complete?

Not sure - this gets messy.

FWIW, this situation is one of the reasons why we built
https://github.com/digitalocean/pgremapper (inspired by a procedure
and some tooling that CERN built for the same reason). You might be
interested in https://github.com/digitalocean/pgremapper#example---cancel-all-backfill-in-the-system-as-a-part-of-an-augment,
or using cancel-backfill plus an undo-upmaps loop.

Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx