Re: Question about recovery vs backfill priorities

Bartłomiej Święcki <bartlomiej.swiecki@xxxxxxxxxxxx> · Thu, 8 Dec 2016 12:49:06 +0100

Ok, so here we go: https://github.com/ceph/ceph/pull/12389

I included ceph_pool_recovery in backfill. I also noticed that priority 
could be negative
so I made sure this code will properly work in such cases.

I've been testing this locally and backfill for inactive pages recovery 
is much better now.

While testing I found another problem causing IO stalls, also confirmed 
on 10.2.3.
I made the cluster having some unfound objects during recovery by 
bringing all copies down,
after bringing one OSD up, some requests ended up in "waiting for 
missing objects". If I left
the cluster to fully recover, this eventually went away but the io was 
blocked even though
there was no inactive PG. When I manually restarted OSD having "waiting 
for missing objects"
on different test run, fio (the one I used to do IO load on the cluster) 
unblocked so I guess
those requests could have been unblocked earlier.

Is this something we could also improve?

Thanks,
Bartek

On 12/05/2016 04:18 PM, Sage Weil wrote:
On Mon, 5 Dec 2016, Bartłomiej Święcki wrote:
Hi,

I made a quick draft of how new priority code could look like, please let me
know if that's a good direction:

https://github.com/ceph/ceph/compare/master...ovh:wip-rework-recovery-priorities

Haven't tested it yet though so no PR yet, will do it today.
That looks reasonable to me!

A side question: Is there any reason why pool_recovery_priority is not
adjusting backfill priority?
Maybe it would be beneficial to include it there too?
I'm guessing not... Sam?

sage

Regards,
Bartek

On 12/01/2016 11:10 PM, Sage Weil wrote:
Hi Bartek,

On Thu, 1 Dec 2016, Bartłomiej Święcki wrote:
We're currently being hit by an issue with cluster recovery. The cluster
size
has been significantly extended (~50% new OSDs) and started recovery.
During recovery there was a HW failure and we ended up with some PGS in
peered
state with size < min_size (inactive).
Those peered PGs are waiting for backfill but the cluster still prefers
recovery of recovery_wait PGs - in our case this could be even few hours
before all recovery is finished (we're speeding up recovery up to limits
to
get the downtime as short as possible). Those peered PGs are blocked
during
this time and the whole cluster just struggles to operate at a reasonable
level.

We're running hammer 0.94.6 there and from the code it looks like recovery
will always have higher priority (jewel seems similar).
Documentation only says that log-based recovery must finish before
backfills.
Is this requirement needed for data consistency or something else?
For a given single PG, it will do it's own log recovery (to bring acting
OSDs fully up to date) before starting backfill, but between PGs there's
no dependency.

Ideally we'd like it to be this order: undersized inactive (size <
min_size)
recovery_wait => undersized inactive (size < min_size) wait_backfill =>
degraded recovery_wait => degraded wait_backfill => remapped
wait_backfill.
Changing priority calculation doesn't seem to be that hard but would it
end up
with inconsistent data?
We could definitely improve this, yeah.  The prioritization code is based
around PG::get_{recovery,backfill}_priority(), and the
OSD_{RECOVERY,BACKFILL}_* #defines in PG.h.  It currently assumes (log)
recovery is always higher priority than backfill, but as you say we can do
better.  As long as everything maps into a 0..255 priority value it should
be fine.

Getting the undersized inactive bumped to the top should be a simple tweak
(just force a top-value priority in that case)...

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html