Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Josh Durgin <jdurgin@xxxxxxxxxx> · Thu, 14 Sep 2017 23:58:29 -0700

On 09/14/2017 12:44 AM, Florian Haas wrote:
On Thu, Sep 14, 2017 at 2:47 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
On 09/13/2017 03:40 AM, Florian Haas wrote:

So we have a client that is talking to OSD 30. OSD 30 was never down;
OSD 17 was. OSD 30 is also the preferred primary for this PG (via
primary affinity). The OSD now says that

- it does itself have a copy of the object,
- so does OSD 94,
- but that the object is "also" missing on OSD 17.

So I'd like to ask firstly: what does "also" mean here?

Nothing, it's just included in all the log messages in the loop looking
at whether objects are missing.

OK, maybe the "also" can be removed to reduce potential confusion?

Sure

Secondly, if the local copy is current, and we have no fewer than
min_size objects, and recovery is meant to be a background operation,
then why is the recovery in the I/O path here? Specifically, why is
that the case on a write, where the object is being modified anyway,
and the modification then needs to be replicated out to OSDs 17 and
94?

Mainly because recovery pre-dated the concept of min_size. We realized
this was a problem during luminous development, but did not complete the
fix for it in time for luminous. Nice analysis of the issue though!

Well I wasn't quite done with the analysis yet, I just wanted to check
whether my initial interpretation was correct.

So, here's what this behavior causes, if I understand things correctly:

- We have a bunch of objects that need to be recovered onto the
just-returned OSD(s).
- Clients access some of these objects while they are pending recovery.
- When that happens, recovery of those objects gets reprioritized.
Simplistically speaking, they get to jump the queue.

Did I get that right?

Yes

If so, let's zoom out a bit now and look at RBD's most frequent use
case, virtualization. While the OSDs were down, the RADOS objects that
were created or modified would have come from whatever virtual
machines were running at that time. When the OSDs return, there's a
very good chance that those same VMs are still running. While they're
running, they of course continue to access the same RBDs, and are
quite likely to access the same *data* as before on those RBDs — data
that now needs to be recovered.

So that means that there is likely a solid majority of to-be-recovered
RADOS objects that needs to be moved to the front of the queue at some
point during the recovery. Which, in the extreme, renders the
prioritization useless: if I have, say, 1,000 objects that need to be
recovered but 998 have been moved to the "front" of the queue, the
queue is rather meaningless.

This is more of an issue with write-intensive RGW buckets, since the
bucket index object is a single bottleneck if it needs recovery, and
all further writes to a shard of a bucket index will be blocked on that
bucket index object.

Again, on the assumption that this correctly describes what Ceph
currently does, do you have suggestions for how to mitigate this? It
seems to me that the only actual remedy for this issue in
Jewel/Luminous would be to not access objects pending recovery, but as
just pointed out, that's a rather unrealistic goal.

In luminous you can force the osds to backfill (which does not block
I/O) instead of using log-based recovery. This requires scanning
the disk to see which objects are missing, instead of looking at the pg
log, so it will take longer to recover. This is feasible for all-SSD
setups, but with pure HDD it may be too much slower, depending on your
desire to trade-off durability for availability.

You can do this by setting:

osd pg log min entries = 1
osd pg log max entries = 2

I'm working on the fix (aka async recovery) for mimic. This won't be
backportable unfortunately.

OK — is there any more information on this that is available and
current? A quick search turned up a Trello card
(https://trello.com/c/jlJL5fPR/199-osd-async-recovery), a mailing list
post (https://www.spinics.net/lists/ceph-users/msg37127.html), a slide
deck (https://www.slideshare.net/jupiturliu/ceph-recovery-improvement-v02),
a stale PR (https://github.com/ceph/ceph/pull/11918), and an inactive
branch (https://github.com/jdurgin/ceph/commits/wip-async-recovery),
but I was hoping for something a little more detailed. Thanks in
advance for any additional insight you can share here!

There's a description of the idea here:

https://github.com/jdurgin/ceph/commit/15c4c7134d32f2619821f891ec8b8e598e786b92

Josh
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com