Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Florian Haas <florian@xxxxxxxxxxx> · Sat, 16 Sep 2017 00:42:56 +0200

On Fri, Sep 15, 2017 at 10:37 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>> So this affects just writes. Then I'm really not following the
>> reasoning behind the current behavior. Why would you want to wait for
>> the recovery of an object that you're about to clobber anyway? Naïvely
>> thinking an object like that would look like a candidate for
>> *eviction* from the recovery queue, not promotion to a higher
>> priority. Is this because the write could be a partial write, whereas
>> recovery would need to cover the full object?
>
>
> Generally most writes are partial writes - for RBD that's almost always
> the case - often writes are 512b or 4kb. It's also true for e.g. RGW
> bucket index updates (adding an omap key/value pair).

Sure, makes sense.

>> This is all under the disclaimer that I have no detailed
>> knowledge of the internals so this is all handwaving, but would a more
>> logical sequence of events not look roughly like this:
>>
>> 1. Are all replicas of the object available? If so, goto 4.
>> 2. Is the write a full object write? If so, goto 4.
>> 3. Read the local copy of the object, splice in the partial write,
>> making it a full object write.
>> 4. Evict the object from the recovery queue.
>> 5. Replicate the write.
>>
>> Forgive the silly use of goto; I'm wary of email clients mangling
>> indentation if I were to write this as a nested if block. :)
>
>
> This might be a useful optimization in some cases, but it would be
> rather complex to add to the recovery code. It may be worth considering
> at some point - same with deletes or other cases where the previous data
> is not needed.

Uh, yeah, waiting for an object to recover just so you can then delete
it, and blocking the delete I/O in the process, does also seem rather
very strange.

I think we do agree that any instance of I/O being blocked upward of
30s in a VM is really really bad, but the way you describe it, I see
little chance for a Ceph-deploying cloud operator to ever make a
compelling case to their customers that such a thing is unlikely to
happen. And I'm not even sure if a knee-jerk reaction to buy faster
hardware would be a very prudent investment: it's basically all just a
factor of (a) how much I/O happens on a cluster during an outage, (b)
how many nodes/OSDs will be affected by that outage. Neither is very
predictable, and only (b) is something you have any influence over in
a cloud environment. Beyond a certain threshold of either (a) or (b),
the probability of *recovery* slowing a significant number of VMs to a
crawl approximates 1.

For an rgw bucket index pool, that's usually a sufficiently small
amount of data that allows you to sprinkle a few fast drives
throughout your cluster, create a ruleset with a separate root
(pre-Luminous) or making use of classes (Luminous and later), and then
assign that ruleset to the pool. But for RBD storage, that's usually
not an option — not at non-prohibitive cost, anyway.

Can you share your suggested workaround / mitigation strategy for
users that are currently being bitten by this behavior? If async
recovery lands in mimic with no chance of a backport, then it'll be a
while before LTS users get any benefit out of it.

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com