Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



(Sorry for top posting, this email client isn't great at editing)


The mitigation strategy I mentioned before of forcing backfill could be backported to jewel, but I don't think it's a very good option for RBD users without SSDs.


In luminous there is a command (something like 'ceph pg force-recovery') that you can use to prioritize recovery of particular PGs (and thus rbd images with some scripting). This would at least let you limit the scope of affected images. A couple folks from OVH added it for just this purpose.


Neither of these is an ideal workaround, but I haven't thought of a better one for existing versions.


Josh


Sent from Nine

From: Florian Haas <florian@xxxxxxxxxxx>
Sent: Sep 15, 2017 3:43 PM
To: Josh Durgin
Cc: ceph-users@xxxxxxxxxxxxxx; Christian Theune
Subject: Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

On Fri, Sep 15, 2017 at 10:37 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>> So this affects just writes. Then I'm really not following the
>> reasoning behind the current behavior. Why would you want to wait for
>> the recovery of an object that you're about to clobber anyway? Naïvely
>> thinking an object like that would look like a candidate for
>> *eviction* from the recovery queue, not promotion to a higher
>> priority. Is this because the write could be a partial write, whereas
>> recovery would need to cover the full object?
>
>
> Generally most writes are partial writes - for RBD that's almost always
> the case - often writes are 512b or 4kb. It's also true for e.g. RGW
> bucket index updates (adding an omap key/value pair).

Sure, makes sense.

>> This is all under the disclaimer that I have no detailed
>> knowledge of the internals so this is all handwaving, but would a more
>> logical sequence of events not look roughly like this:
>>
>> 1. Are all replicas of the object available? If so, goto 4.
>> 2. Is the write a full object write? If so, goto 4.
>> 3. Read the local copy of the object, splice in the partial write,
>> making it a full object write.
>> 4. Evict the object from the recovery queue.
>> 5. Replicate the write.
>>
>> Forgive the silly use of goto; I'm wary of email clients mangling
>> indentation if I were to write this as a nested if block. :)
>
>
> This might be a useful optimization in some cases, but it would be
> rather complex to add to the recovery code. It may be worth considering
> at some point - same with deletes or other cases where the previous data
> is not needed.

Uh, yeah, waiting for an object to recover just so you can then delete
it, and blocking the delete I/O in the process, does also seem rather
very strange.

I think we do agree that any instance of I/O being blocked upward of
30s in a VM is really really bad, but the way you describe it, I see
little chance for a Ceph-deploying cloud operator to ever make a
compelling case to their customers that such a thing is unlikely to
happen. And I'm not even sure if a knee-jerk reaction to buy faster
hardware would be a very prudent investment: it's basically all just a
factor of (a) how much I/O happens on a cluster during an outage, (b)
how many nodes/OSDs will be affected by that outage. Neither is very
predictable, and only (b) is something you have any influence over in
a cloud environment. Beyond a certain threshold of either (a) or (b),
the probability of *recovery* slowing a significant number of VMs to a
crawl approximates 1.

For an rgw bucket index pool, that's usually a sufficiently small
amount of data that allows you to sprinkle a few fast drives
throughout your cluster, create a ruleset with a separate root
(pre-Luminous) or making use of classes (Luminous and later), and then
assign that ruleset to the pool. But for RBD storage, that's usually
not an option — not at non-prohibitive cost, anyway.

Can you share your suggested workaround / mitigation strategy for
users that are currently being bitten by this behavior? If async
recovery lands in mimic with no chance of a backport, then it'll be a
while before LTS users get any benefit out of it.

Cheers,
Florian

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux