Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Florian Haas <florian@xxxxxxxxxxx> · Fri, 15 Sep 2017 10:57:48 +0200

On Fri, Sep 15, 2017 at 8:58 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>> OK, maybe the "also" can be removed to reduce potential confusion?
>
>
> Sure

That'd be great. :)

>> - We have a bunch of objects that need to be recovered onto the
>> just-returned OSD(s).
>> - Clients access some of these objects while they are pending recovery.
>> - When that happens, recovery of those objects gets reprioritized.
>> Simplistically speaking, they get to jump the queue.
>>
>> Did I get that right?
>
>
> Yes
>
>> If so, let's zoom out a bit now and look at RBD's most frequent use
>> case, virtualization. While the OSDs were down, the RADOS objects that
>> were created or modified would have come from whatever virtual
>> machines were running at that time. When the OSDs return, there's a
>> very good chance that those same VMs are still running. While they're
>> running, they of course continue to access the same RBDs, and are
>> quite likely to access the same *data* as before on those RBDs — data
>> that now needs to be recovered.
>>
>> So that means that there is likely a solid majority of to-be-recovered
>> RADOS objects that needs to be moved to the front of the queue at some
>> point during the recovery. Which, in the extreme, renders the
>> prioritization useless: if I have, say, 1,000 objects that need to be
>> recovered but 998 have been moved to the "front" of the queue, the
>> queue is rather meaningless.
>
>
> This is more of an issue with write-intensive RGW buckets, since the
> bucket index object is a single bottleneck if it needs recovery, and
> all further writes to a shard of a bucket index will be blocked on that
> bucket index object.

Well, yes, the problem impact may be even worse on rgw, but you do
agree that the problem does exist for RBD too, correct? (The hard
evidence points to that.)

>> Again, on the assumption that this correctly describes what Ceph
>> currently does, do you have suggestions for how to mitigate this? It
>> seems to me that the only actual remedy for this issue in
>> Jewel/Luminous would be to not access objects pending recovery, but as
>> just pointed out, that's a rather unrealistic goal.
>
>
> In luminous you can force the osds to backfill (which does not block
> I/O) instead of using log-based recovery. This requires scanning
> the disk to see which objects are missing, instead of looking at the pg
> log, so it will take longer to recover. This is feasible for all-SSD
> setups, but with pure HDD it may be too much slower, depending on your
> desire to trade-off durability for availability.
>
> You can do this by setting:
>
> osd pg log min entries = 1
> osd pg log max entries = 2
>
>>> I'm working on the fix (aka async recovery) for mimic. This won't be
>>> backportable unfortunately.
>>
>>
>> OK — is there any more information on this that is available and
>> current? A quick search turned up a Trello card
>> (https://trello.com/c/jlJL5fPR/199-osd-async-recovery), a mailing list
>> post (https://www.spinics.net/lists/ceph-users/msg37127.html), a slide
>> deck
>> (https://www.slideshare.net/jupiturliu/ceph-recovery-improvement-v02),
>> a stale PR (https://github.com/ceph/ceph/pull/11918), and an inactive
>> branch (https://github.com/jdurgin/ceph/commits/wip-async-recovery),
>> but I was hoping for something a little more detailed. Thanks in
>> advance for any additional insight you can share here!
>
>
> There's a description of the idea here:
>
> https://github.com/jdurgin/ceph/commit/15c4c7134d32f2619821f891ec8b8e598e786b92

Thanks!. That raises another question:

"Until now, this recovery process was synchronous - it blocked writes
to an object until it was recovered."

So this affects just writes. Then I'm really not following the
reasoning behind the current behavior. Why would you want to wait for
the recovery of an object that you're about to clobber anyway? Naïvely
thinking an object like that would look like a candidate for
*eviction* from the recovery queue, not promotion to a higher
priority. Is this because the write could be a partial write, whereas
recovery would need to cover the full object?

This is all under the disclaimer that I have no detailed
knowledge of the internals so this is all handwaving, but would a more
logical sequence of events not look roughly like this:

1. Are all replicas of the object available? If so, goto 4.
2. Is the write a full object write? If so, goto 4.
3. Read the local copy of the object, splice in the partial write,
making it a full object write.
4. Evict the object from the recovery queue.
5. Replicate the write.

Forgive the silly use of goto; I'm wary of email clients mangling
indentation if I were to write this as a nested if block. :)

Again, thanks for the continued insight!

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com