Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Josh Durgin <jdurgin@xxxxxxxxxx> · Fri, 15 Sep 2017 21:13:03 -0400 (EDT)

(Sorry for top posting, this email client isn't great at editing)

The mitigation strategy I mentioned before of forcing backfill could be backported to jewel, but I don't think it's a very good option for RBD users without SSDs.

In luminous there is a command (something like 'ceph pg force-recovery') that you can use to prioritize recovery of particular PGs (and thus rbd images with some scripting). This would at least let you limit the scope of affected images. A couple folks from OVH added it for just this purpose.

Neither of these is an ideal workaround, but I haven't thought of a better one for existing versions.

Josh

Sent from Nine
From: Florian Haas <florian@xxxxxxxxxxx>
Sent: Sep 15, 2017 3:43 PM
To: Josh Durgin
Cc: ceph-users@xxxxxxxxxxxxxx; Christian Theune
Subject: Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

On Fri, Sep 15, 2017 at 10:37 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:

>> So this affects just writes. Then I'm really not following the

>> reasoning behind the current behavior. Why would you want to wait for

>> the recovery of an object that you're about to clobber anyway? Naïvely

>> thinking an object like that would look like a candidate for

>> *eviction* from the recovery queue, not promotion to a higher

>> priority. Is this because the write could be a partial write, whereas

>> recovery would need to cover the full object?

>

>

> Generally most writes are partial writes - for RBD that's almost always

> the case - often writes are 512b or 4kb. It's also true for e.g. RGW

> bucket index updates (adding an omap key/value pair).

Sure, makes sense.

>> This is all under the disclaimer that I have no detailed

>> knowledge of the internals so this is all handwaving, but would a more

>> logical sequence of events not look roughly like this:

>>

>> 1. Are all replicas of the object available? If so, goto 4.

>> 2. Is the write a full object write? If so, goto 4.

>> 3. Read the local copy of the object, splice in the partial write,

>> making it a full object write.

>> 4. Evict the object from the recovery queue.

>> 5. Replicate the write.

>>

>> Forgive the silly use of goto; I'm wary of email clients mangling

>> indentation if I were to write this as a nested if block. :)

>

>

> This might be a useful optimization in some cases, but it would be

> rather complex to add to the recovery code. It may be worth considering

> at some point - same with deletes or other cases where the previous data

> is not needed.

Uh, yeah, waiting for an object to recover just so you can then delete

it, and blocking the delete I/O in the process, does also seem rather

very strange.

I think we do agree that any instance of I/O being blocked upward of

30s in a VM is really really bad, but the way you describe it, I see

little chance for a Ceph-deploying cloud operator to ever make a

compelling case to their customers that such a thing is unlikely to

happen. And I'm not even sure if a knee-jerk reaction to buy faster

hardware would be a very prudent investment: it's basically all just a

factor of (a) how much I/O happens on a cluster during an outage, (b)

how many nodes/OSDs will be affected by that outage. Neither is very

predictable, and only (b) is something you have any influence over in

a cloud environment. Beyond a certain threshold of either (a) or (b),

the probability of *recovery* slowing a significant number of VMs to a

crawl approximates 1.

For an rgw bucket index pool, that's usually a sufficiently small

amount of data that allows you to sprinkle a few fast drives

throughout your cluster, create a ruleset with a separate root

(pre-Luminous) or making use of classes (Luminous and later), and then

assign that ruleset to the pool. But for RBD storage, that's usually

not an option — not at non-prohibitive cost, anyway.

Can you share your suggested workaround / mitigation strategy for

users that are currently being bitten by this behavior? If async

recovery lands in mimic with no chance of a backport, then it'll be a

while before LTS users get any benefit out of it.

Cheers,

Florian

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Follow-Ups:

Re:  Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)
From: Christian Theune

Prev by Date:
Re:  Jewel -> Luminous upgrade, package install stopped all daemons

Next by Date:
Re:  Power outages!!! help!

Previous by thread:
Re:  Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Next by thread:
Re:  Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Index(es):

Date
Thread

[Index of Archives]

[Information on CEPH]

[Linux Filesystem Development]

[Ceph Development]

[Ceph Large]

[Ceph Dev]

[Linux USB Development]

[Video for Linux]

[Linux Audio Users]

[Yosemite News]

[Linux Kernel]

[Linux SCSI]

[xfs]