Re: Question about recovery vs backfill priorities

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 5 Dec 2016 15:18:54 +0000 (UTC)

On Mon, 5 Dec 2016, Bartłomiej Święcki wrote:
> Hi,
> 
> I made a quick draft of how new priority code could look like, please let me
> know if that's a good direction:
> 
> https://github.com/ceph/ceph/compare/master...ovh:wip-rework-recovery-priorities
> 
> Haven't tested it yet though so no PR yet, will do it today.

That looks reasonable to me!

> A side question: Is there any reason why pool_recovery_priority is not
> adjusting backfill priority?
> Maybe it would be beneficial to include it there too?

I'm guessing not... Sam?

sage

> 
> Regards,
> Bartek
> 
> 
> On 12/01/2016 11:10 PM, Sage Weil wrote:
> > Hi Bartek,
> > 
> > On Thu, 1 Dec 2016, Bartłomiej Święcki wrote:
> > > We're currently being hit by an issue with cluster recovery. The cluster
> > > size
> > > has been significantly extended (~50% new OSDs) and started recovery.
> > > During recovery there was a HW failure and we ended up with some PGS in
> > > peered
> > > state with size < min_size (inactive).
> > > Those peered PGs are waiting for backfill but the cluster still prefers
> > > recovery of recovery_wait PGs - in our case this could be even few hours
> > > before all recovery is finished (we're speeding up recovery up to limits
> > > to
> > > get the downtime as short as possible). Those peered PGs are blocked
> > > during
> > > this time and the whole cluster just struggles to operate at a reasonable
> > > level.
> > > 
> > > We're running hammer 0.94.6 there and from the code it looks like recovery
> > > will always have higher priority (jewel seems similar).
> > > Documentation only says that log-based recovery must finish before
> > > backfills.
> > > Is this requirement needed for data consistency or something else?
> > For a given single PG, it will do it's own log recovery (to bring acting
> > OSDs fully up to date) before starting backfill, but between PGs there's
> > no dependency.
> > 
> > > Ideally we'd like it to be this order: undersized inactive (size <
> > > min_size)
> > > recovery_wait => undersized inactive (size < min_size) wait_backfill =>
> > > degraded recovery_wait => degraded wait_backfill => remapped
> > > wait_backfill.
> > > Changing priority calculation doesn't seem to be that hard but would it
> > > end up
> > > with inconsistent data?
> > We could definitely improve this, yeah.  The prioritization code is based
> > around PG::get_{recovery,backfill}_priority(), and the
> > OSD_{RECOVERY,BACKFILL}_* #defines in PG.h.  It currently assumes (log)
> > recovery is always higher priority than backfill, but as you say we can do
> > better.  As long as everything maps into a 0..255 priority value it should
> > be fine.
> > 
> > Getting the undersized inactive bumped to the top should be a simple tweak
> > (just force a top-value priority in that case)...
> > 
> > sage
> 
>