> On Sep 1, 2015, at 16:13, Simon Hallam <sha@xxxxxxxxx> wrote: > > Hi Greg, Zheng, > > Is this fixed in a later version of the kernel client? Or would it be wise for us to start using the fuse client? > > Cheers, I just wrote a fix https://github.com/ceph/ceph-client/commit/33b68dde7f27927a7cb1a7691e3c5b6f847ffd14 <https://github.com/ceph/ceph-client/commit/33b68dde7f27927a7cb1a7691e3c5b6f847ffd14>. Yes, you should try ceps-fuse if this bug causes problems for you. Regards Yan, Zheng > > Simon > >> -----Original Message----- >> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx] >> Sent: 31 August 2015 13:02 >> To: Yan, Zheng >> Cc: Simon Hallam; Zheng Yan; ceph-users@xxxxxxxxxxxxxx >> Subject: Re: Testing CephFS >> >> On Mon, Aug 31, 2015 at 12:16 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >>> On Mon, Aug 24, 2015 at 6:38 PM, Gregory Farnum >> <gfarnum@xxxxxxxxxx> wrote: >>>> On Mon, Aug 24, 2015 at 11:35 AM, Simon Hallam <sha@xxxxxxxxx> >> wrote: >>>>> Hi Greg, >>>>> >>>>> The MDS' detect that the other one went down and started the replay. >>>>> >>>>> I did some further testing with 20 client machines. Of the 20 client >> machines, 5 hung with the following error: >>>>> >>>>> [Aug24 10:53] ceph: mds0 caps stale >>>>> [Aug24 10:54] ceph: mds0 caps stale >>>>> [Aug24 10:58] ceph: mds0 hung >>>>> [Aug24 11:03] ceph: mds0 came back >>>>> [ +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state >> OPEN) >>>>> [ +0.000018] libceph: mon2 10.15.0.3:6789 session lost, hunting for new >> mon >>>>> [Aug24 11:04] ceph: mds0 reconnect start >>>>> [ +0.084938] libceph: mon2 10.15.0.3:6789 session established >>>>> [ +0.008475] ceph: mds0 reconnect denied >>>> >>>> Oh, this might be a kernel bug, failing to ask for mdsmap updates when >>>> the connection goes away. Zheng, does that sound familiar? >>>> -Greg >>>> >>> >>> I reproduced this locally (use SIGSTOP to stop the monitor) . I think >>> the root cause is that kernel client does not implement >>> CEPH_FEATURE_MSGR_KEEPALIVE2. So the kernel client couldn't reliably >>> detect the event that network cable got unplugged. It kept waiting for >>> new events from the disconnected connection. >> >> Yeah, the userspace client maintains an ongoing MDSMap subscription >> from the monitors in order to hear about this. It puts more load on >> the monitors but right now that's the solution we're going with: the >> monitor times out the MDS, publishes a series of new maps (pushed to >> the clients) in order to activate a standby, and the clients see that >> they need to connect to the new MDS instance. >> -Greg > > > Please visit our new website at www.pml.ac.uk and follow us on Twitter @PlymouthMarine > > Winner of the Environment & Conservation category, the Charity Awards 2014. > > Plymouth Marine Laboratory (PML) is a company limited by guarantee registered in England & Wales, company number 4178503. Registered Charity No. 1091222. Registered Office: Prospect Place, The Hoe, Plymouth PL1 3DH, UK. > > This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. You are reminded that e-mail communications are not secure and may contain viruses; PML accepts no liability for any loss or damage which may be caused by viruses. > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com