Hi Jonathan, > - All PGs seem to be backfilling at the same time which seems to be in > violation of osd_max_backfills. I understand that there should be 6 readers > and 6 writers at a time, but I'm seeing a given OSD participate in more > than 6 PG backfills. Is an OSD only considered as backfilling if it is not > present in both the UP and ACTING groups (e.g. it will have it's data > altered)? Say you have a PG that looks like this: 1.7ffe active+remapped+backfill_wait [983,1112,486] 983 [983,1423,1329] 983 If this is a replicated cluster, 983 (the primary OSD) will be the data read source, and 1423/1329 will of course be targets. If this is EC, then 1112 will be the read source for the 1423 backfill, and 486 will be the read source for the 1329 backfill. (Unless the PG is degraded, in which case backfill reads may become normal PG reads.) Backfill locks are taken on the primary OSD (983 in the example above) and then all the backfill targets (1329, 1423). Locks are _not_ taken on read sources for EC backfills, so it's possible to have any number of backfills reading from a single OSD during EC backfill with no direct control over this. > - Some PGs are recovering at a much slower rate than others (some as little > as kilobytes per second) despite the disks being all of a similar speed. Is > there some way to dig into why that may be? Where I would start with this is looking at whether the read sources or write targets are overloaded at the disk level. > - In general, the recovery is happening very slowly (between 1 and 5 > objects per second per PG). Is it possible the settings above are too > aggressive and causing performance degradation due to disk thrashing? Maybe - which settings are appropriate depend on your configuration (replicated vs. EC); if you have a replicated pool, then those settings are probably way too aggressive, and max backfills should be reduced. If it's EC, the max backfills might be OK. In either case, the sleep should be increased, but it's unlikely that the sleep setting is affecting per-PG backfill speed that much (though it could make it uneven). > - Currently, all misplaced PGs are backfilling, if I were to change some of > the settings above (specifically `osd_max_backfills`) would that > essentially pause backfilling PGs or will those backfills have to end and > then start over when it is done waiting? It effectively pauses backfill. > - Given that all PGs are backfilling simultaneously there is no way to > prioritize one PG over another (we have some disks with very high usage > that we're trying to reduce). Would reducing those max backfills allow for > proper prioritization of PGs with force-backfill? There's no great way to affect backfill prioritization. The backfill lock acquisition I noted above is blocking without backoff, so high-priority backfills could be waiting in line for a while until they get a chance to run. > - We have had some OSDs restart during the process and their misplaced > object count is now zero but they are incrementing their recovering objects > bytes. Is that expected and is there a way to estimate when that will > complete? Not sure - this gets messy. FWIW, this situation is one of the reasons why we built https://github.com/digitalocean/pgremapper (inspired by a procedure and some tooling that CERN built for the same reason). You might be interested in https://github.com/digitalocean/pgremapper#example---cancel-all-backfill-in-the-system-as-a-part-of-an-augment, or using cancel-backfill plus an undo-upmaps loop. Josh _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx