Re: chaos monkeys

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/09/2012 12:16 PM, Gregory Farnum wrote:
<also moved to ceph-devel>
On Tue, Oct 9, 2012 at 9:59 AM, Sam Lang <sam.lang@xxxxxxxxxxx> wrote:
On 10/09/2012 11:46 AM, Gregory Farnum wrote:

On Tue, Oct 9, 2012 at 9:43 AM, Sam Lang <sam.lang@xxxxxxxxxxx> wrote:


Could we add some other chaos monkeys to the network/storage
infrastructure
besides ms_inject_socket_failures?  In particular, I would like to add
ms_inject_delay_msg and ms_inject_reorder_msgs?  I think those could
potentially help flush out some bugs (such as:

https://github.com/ceph/ceph/commit/fa66eaa162542ac01752ada91a46051dde060831).


You're going to have to explain these more — ordered delivery over a
connection is one of the guarantees that the messaging layer provides,
so that doesn't sound like a configurable we're going to add.


That's true, but there's no guarantee that the source will always send them
in the same order.  The bug I linked above is a good example, the mds was
sending out two messages, one the open session reply, and another the stale
session async message.  The bug is only expressed when the stale comes
before the open session, which is possible in some cases.  The stale
originates from a timer expiring, and the open session is sent after the
journal commit, so the timing (and ordering) of those two messages can vary
based on when the timer thread gets scheduled to execute, how long the
journal commit takes, etc.

Reordering messages at the destination would act to simulate all the
asynchronous paths like this that exist in our code.

The sending messenger also maintains ordering invariants. The endpoint
(the MDS) might not dispatch them in the same order all the time, but
that's at a different semantic layer and is not something we can
simulate inside the messenger — it requires semantic knowledge of
which messages are okay to reorder. If we just did random reordering
like you're suggesting, absolutely everything would break.

Putting a delay on the sender would avoid the reordering of messages that have semantic meaning but allow delay-caused reordering to occur for those that have no semantic dependency.

You're right that reordering at the receiver won't work, but it would be nice to have more concrete examples. The only example I can come up with is the unsafe/safe messages from mds to client. Even in that case it looks like we handle it by throwing away the unsafe message. What other examples exist? Caps issue/revoke?

-sam




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux