Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Florian Haas <florian@xxxxxxxxxxx> · Thu, 14 Sep 2017 09:44:27 +0200

On Thu, Sep 14, 2017 at 2:47 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> On 09/13/2017 03:40 AM, Florian Haas wrote:
>>
>> So we have a client that is talking to OSD 30. OSD 30 was never down;
>> OSD 17 was. OSD 30 is also the preferred primary for this PG (via
>> primary affinity). The OSD now says that
>>
>> - it does itself have a copy of the object,
>> - so does OSD 94,
>> - but that the object is "also" missing on OSD 17.
>>
>> So I'd like to ask firstly: what does "also" mean here?
>
>
> Nothing, it's just included in all the log messages in the loop looking
> at whether objects are missing.

OK, maybe the "also" can be removed to reduce potential confusion?

>> Secondly, if the local copy is current, and we have no fewer than
>> min_size objects, and recovery is meant to be a background operation,
>> then why is the recovery in the I/O path here? Specifically, why is
>> that the case on a write, where the object is being modified anyway,
>> and the modification then needs to be replicated out to OSDs 17 and
>> 94?
>
>
> Mainly because recovery pre-dated the concept of min_size. We realized
> this was a problem during luminous development, but did not complete the
> fix for it in time for luminous. Nice analysis of the issue though!

Well I wasn't quite done with the analysis yet, I just wanted to check
whether my initial interpretation was correct.

So, here's what this behavior causes, if I understand things correctly:

- We have a bunch of objects that need to be recovered onto the
just-returned OSD(s).
- Clients access some of these objects while they are pending recovery.
- When that happens, recovery of those objects gets reprioritized.
Simplistically speaking, they get to jump the queue.

Did I get that right?

If so, let's zoom out a bit now and look at RBD's most frequent use
case, virtualization. While the OSDs were down, the RADOS objects that
were created or modified would have come from whatever virtual
machines were running at that time. When the OSDs return, there's a
very good chance that those same VMs are still running. While they're
running, they of course continue to access the same RBDs, and are
quite likely to access the same *data* as before on those RBDs — data
that now needs to be recovered.

So that means that there is likely a solid majority of to-be-recovered
RADOS objects that needs to be moved to the front of the queue at some
point during the recovery. Which, in the extreme, renders the
prioritization useless: if I have, say, 1,000 objects that need to be
recovered but 998 have been moved to the "front" of the queue, the
queue is rather meaningless.

Again, on the assumption that this correctly describes what Ceph
currently does, do you have suggestions for how to mitigate this? It
seems to me that the only actual remedy for this issue in
Jewel/Luminous would be to not access objects pending recovery, but as
just pointed out, that's a rather unrealistic goal.

> I'm working on the fix (aka async recovery) for mimic. This won't be
> backportable unfortunately.

OK — is there any more information on this that is available and
current? A quick search turned up a Trello card
(https://trello.com/c/jlJL5fPR/199-osd-async-recovery), a mailing list
post (https://www.spinics.net/lists/ceph-users/msg37127.html), a slide
deck (https://www.slideshare.net/jupiturliu/ceph-recovery-improvement-v02),
a stale PR (https://github.com/ceph/ceph/pull/11918), and an inactive
branch (https://github.com/jdurgin/ceph/commits/wip-async-recovery),
but I was hoping for something a little more detailed. Thanks in
advance for any additional insight you can share here!

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com