Re: Node down question

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 10 Nov 2014 14:41:55 -0800

On Mon, Nov 10, 2014 at 2:21 PM, Jason <jasons@xxxxxxxxxx> wrote:
> I have searched the list archives, and have seen a couple of references
> to this question, but no real solution, unfortunately...
>
> We are running multiple ceph clusters, pretty much as media appliances.
> As such, the number of nodes is variable, and all of the nodes are
> symmetric (i.e. same CPU power, memory, disk space).  As a result, we
> are running a monitor and OSD (connected to an SSD RAID) on each of the
> systems.  The number of nodes is typically small, on the order of five
> to a dozen.  As the node count gets higher, we are planning not to run
> monitors on all nodes.
>
> Our pools are typically set up with a replication size of 2 or 3, with a
> minsize of 1.
>
> The problem occurs when a single node goes down, such that its monitor
> and OSD stop at once.  For a client (especially a writer) on another
> node, there is a pretty consistent 20 second delay until further
> operations go through.  This is a delay that we cannot easily survive.
>
> If I first bring down the OSD, then wait a few seconds, and then bring
> down the monitor, the system behaves with only a few seconds of delay.
> However, we can't always guarantee the graceful shutdown (such as when a
> node is rebooted, loses network connectivity, or power is lost).
>
> Note that I get exactly the same behavior if I stop an OSD on one
> system, while stopping a monitor on another...
>
> Previous discussions similar to this have touched upon the "osd
> heartbeat grace" setting, which is conspiciously set to 20 seconds.  I
> have tried changing this, along with any other related settings, to no
> avail -- for whatever I do, the delay remains at 20 seconds.

It sounds like maybe you've also got clients which are taking a while
to decide that the monitor they were connected to is dead and pick
another one. You probably need to adjust the "mon client ping timeout"
and "mon client ping interval" down from their defaults of 30 and 10
(seconds).

(You'll also want to run your experiments while looking at ceph -w or
similar to see when the system is detecting the failure, how long it's
taking to compensate, and how long clients are taking after that to
complete any blocked writes.)
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com