Hey Josh, adding the dev list where you may get more input.
Generally I think your analysis is correct about the current behavior.
In particular if another copy of a shard is available, backfill or
recovery will read from just that copy, not the rest of the OSDs.
Otherwise, k shards must be read to reconstruct the data (for reed-
solomon family erasure codes).
IIRC it doesn't matter whether it's a data or parity shard, the
path is the same.
With respect to reservations, it seems like an oversight that
we don't reserve other shards for backfilling. We reserve all
shards for recovery [0].
On the other hand, overload from recovery is handled better in
pacific and beyond with mclock-based QoS, which provides much
more effective control of recovery traffic [1][2].
In prior versions, the osd_recovery_sleep option was the best
way to get more fine-grained control of recovery and backfill
traffic, but this was not dynamic at all. osd_max_backfills
allowed a maximum limit to parallelism. mclock supercedes these
both when it's enabled, since it can handle bursting and throttling
itself.
Josh
[0]
https://github.com/ceph/ceph/blob/v16.2.1/src/osd/PeeringState.cc#L5914-L5921
[1]
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#dmclock-qos
[2] https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/
On 4/19/21 12:24 PM, Josh Baergen wrote:
Hey all,
I wanted to confirm my understanding of some of the mechanics of
backfill in EC pools. I've yet to find a document that outlines this
in detail; if there is one, please send it my way. :) Some of what I
write below is likely in the "well, duh" category, but I tended
towards completeness.
First off, I understand that backfill reservations work the same way
between replicated pools and EC pools. A local reservation is taken on
the primary OSD, then a remote reservation on the backfill target(s),
before the backfill is allowed to begin. Until this point, the
backfill is in the backfill_wait state.
When the backfill begins, though, is when the differences begin. Let's
say we have an EC 3:2 PG that's backfilling from OSD 2 to OSD 5
(formatted here like pgs_brief):
1.1 active+remapped+backfilling [0,1,5,3,4] 0 [0,1,2,3,4] 0
The question in my mind was: Where is the data for this backfill
coming from? In replicated pools, all reads come from the primary.
However, in this case, the primary does not have the data in question;
the primary has to either read the EC chunk from OSD 2, or it has to
reconstruct it by reading from 3 of the OSDs in the acting set.
Based on observation, I _think_ this is what happens:
1. As long as the PG is not degraded, the backfill read is simply
forwarded by the primary to OSD 2.
2. Once the PG becomes degraded, the backfill read needs to use the
reconstructing path, and begins reading from 3 of the OSDs in the
acting set.
Questions:
1. Can anyone confirm or correct my description of how EC backfill
operates? In particular, in case 2 above, does it matter whether OSD 2
is the cause of degradation, for example? Does the read still get
forwarded to a single OSD when it's parity chunks that are being moved
via backfill?
2. I'm curious as to why a 3rd reservation, for the source OSD, wasn't
introduced as a part of EC in Ceph. We've occasionally seen an OSD
become overloaded because several backfills were reading from it
simultaneously, and there's no way to control this via the normal
osd_max_backfills mechanism. Is anyone aware of discussions to this
effect?
Thanks!
Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx