Re: Question about recovery vs backfill priorities

Wido den Hollander <wido@xxxxxxxx> · Fri, 2 Dec 2016 02:41:46 +0100 (CET)

> Op 1 december 2016 om 23:10 schreef Sage Weil <sage@xxxxxxxxxxxx>:
> 
> 
> Hi Bartek,
> 
> On Thu, 1 Dec 2016, Bartłomiej Święcki wrote:
> > We're currently being hit by an issue with cluster recovery. The cluster size
> > has been significantly extended (~50% new OSDs) and started recovery.
> > During recovery there was a HW failure and we ended up with some PGS in peered
> > state with size < min_size (inactive).
> > Those peered PGs are waiting for backfill but the cluster still prefers
> > recovery of recovery_wait PGs - in our case this could be even few hours
> > before all recovery is finished (we're speeding up recovery up to limits to
> > get the downtime as short as possible). Those peered PGs are blocked during
> > this time and the whole cluster just struggles to operate at a reasonable
> > level.
> > 
> > We're running hammer 0.94.6 there and from the code it looks like recovery
> > will always have higher priority (jewel seems similar).
> > Documentation only says that log-based recovery must finish before backfills.
> > Is this requirement needed for data consistency or something else?
> 
> For a given single PG, it will do it's own log recovery (to bring acting 
> OSDs fully up to date) before starting backfill, but between PGs there's 
> no dependency.
> 
> > Ideally we'd like it to be this order: undersized inactive (size < min_size)
> > recovery_wait => undersized inactive (size < min_size) wait_backfill =>
> > degraded recovery_wait => degraded wait_backfill => remapped wait_backfill.
> > Changing priority calculation doesn't seem to be that hard but would it end up
> > with inconsistent data?
> 
> We could definitely improve this, yeah.  The prioritization code is based 
> around PG::get_{recovery,backfill}_priority(), and the 
> OSD_{RECOVERY,BACKFILL}_* #defines in PG.h.  It currently assumes (log) 
> recovery is always higher priority than backfill, but as you say we can do 
> better.  As long as everything maps into a 0..255 priority value it should 
> be fine.
> 
> Getting the undersized inactive bumped to the top should be a simple tweak 
> (just force a top-value priority in that case)...

Would be very, very welcome. inactive PGs should always get priority imho.

I find it difficult to explain this to people why Ceph doesn't already do so. I've seen this happen multiple times.

Would this be just a fix in PG.cc in get_backfill_priority and get_recovery_priority? it seems so.

Wido

> 
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html