Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Josh Durgin <jdurgin@xxxxxxxxxx> · Fri, 15 Sep 2017 13:37:24 -0700

On 09/15/2017 01:57 AM, Florian Haas wrote:
On Fri, Sep 15, 2017 at 8:58 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
This is more of an issue with write-intensive RGW buckets, since the
bucket index object is a single bottleneck if it needs recovery, and
all further writes to a shard of a bucket index will be blocked on that
bucket index object.

Well, yes, the problem impact may be even worse on rgw, but you do
agree that the problem does exist for RBD too, correct? (The hard
evidence points to that.)

Yes, of course it still exists for RBD or other uses.

<snip>

There's a description of the idea here:

https://github.com/jdurgin/ceph/commit/15c4c7134d32f2619821f891ec8b8e598e786b92

Thanks!. That raises another question:

"Until now, this recovery process was synchronous - it blocked writes
to an object until it was recovered."

So this affects just writes. Then I'm really not following the
reasoning behind the current behavior. Why would you want to wait for
the recovery of an object that you're about to clobber anyway? Naïvely
thinking an object like that would look like a candidate for
*eviction* from the recovery queue, not promotion to a higher
priority. Is this because the write could be a partial write, whereas
recovery would need to cover the full object?

Generally most writes are partial writes - for RBD that's almost always
the case - often writes are 512b or 4kb. It's also true for e.g. RGW
bucket index updates (adding an omap key/value pair).

This is all under the disclaimer that I have no detailed
knowledge of the internals so this is all handwaving, but would a more
logical sequence of events not look roughly like this:

1. Are all replicas of the object available? If so, goto 4.
2. Is the write a full object write? If so, goto 4.
3. Read the local copy of the object, splice in the partial write,
making it a full object write.
4. Evict the object from the recovery queue.
5. Replicate the write.

Forgive the silly use of goto; I'm wary of email clients mangling
indentation if I were to write this as a nested if block. :)

This might be a useful optimization in some cases, but it would be
rather complex to add to the recovery code. It may be worth considering
at some point - same with deletes or other cases where the previous data
is not needed.

Josh
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com