On Mon, Aug 31, 2015 at 12:16 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: > On Mon, Aug 24, 2015 at 6:38 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >> On Mon, Aug 24, 2015 at 11:35 AM, Simon Hallam <sha@xxxxxxxxx> wrote: >>> Hi Greg, >>> >>> The MDS' detect that the other one went down and started the replay. >>> >>> I did some further testing with 20 client machines. Of the 20 client machines, 5 hung with the following error: >>> >>> [Aug24 10:53] ceph: mds0 caps stale >>> [Aug24 10:54] ceph: mds0 caps stale >>> [Aug24 10:58] ceph: mds0 hung >>> [Aug24 11:03] ceph: mds0 came back >>> [ +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN) >>> [ +0.000018] libceph: mon2 10.15.0.3:6789 session lost, hunting for new mon >>> [Aug24 11:04] ceph: mds0 reconnect start >>> [ +0.084938] libceph: mon2 10.15.0.3:6789 session established >>> [ +0.008475] ceph: mds0 reconnect denied >> >> Oh, this might be a kernel bug, failing to ask for mdsmap updates when >> the connection goes away. Zheng, does that sound familiar? >> -Greg >> > > I reproduced this locally (use SIGSTOP to stop the monitor) . I think > the root cause is that kernel client does not implement > CEPH_FEATURE_MSGR_KEEPALIVE2. So the kernel client couldn't reliably > detect the event that network cable got unplugged. It kept waiting for > new events from the disconnected connection. Yeah, the userspace client maintains an ongoing MDSMap subscription from the monitors in order to hear about this. It puts more load on the monitors but right now that's the solution we're going with: the monitor times out the MDS, publishes a series of new maps (pushed to the clients) in order to activate a standby, and the clients see that they need to connect to the new MDS instance. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com