Re: Testing CephFS

"Yan, Zheng" <ukernel@xxxxxxxxx> · Mon, 31 Aug 2015 19:16:51 +0800

On Mon, Aug 24, 2015 at 6:38 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Mon, Aug 24, 2015 at 11:35 AM, Simon  Hallam <sha@xxxxxxxxx> wrote:
>> Hi Greg,
>>
>> The MDS' detect that the other one went down and started the replay.
>>
>> I did some further testing with 20 client machines. Of the 20 client machines, 5 hung with the following error:
>>
>> [Aug24 10:53] ceph: mds0 caps stale
>> [Aug24 10:54] ceph: mds0 caps stale
>> [Aug24 10:58] ceph: mds0 hung
>> [Aug24 11:03] ceph: mds0 came back
>> [  +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN)
>> [  +0.000018] libceph: mon2 10.15.0.3:6789 session lost, hunting for new mon
>> [Aug24 11:04] ceph: mds0 reconnect start
>> [  +0.084938] libceph: mon2 10.15.0.3:6789 session established
>> [  +0.008475] ceph: mds0 reconnect denied
>
> Oh, this might be a kernel bug, failing to ask for mdsmap updates when
> the connection goes away. Zheng, does that sound familiar?
> -Greg
>

I reproduced this locally (use SIGSTOP to stop the monitor) . I think
the root cause is that kernel client does not implement
CEPH_FEATURE_MSGR_KEEPALIVE2. So the kernel client couldn't reliably
detect the event that network cable got unplugged. It kept waiting for
new events from the disconnected connection.

Regards
Yan, Zheng

>>
>> 10.15.0.3 was the active MDS at the time I unplugged the Ethernet cable.
>>
>>
>> This was the output of ceph -w as I ran the test (I've removed a lot of the pg remapping):
>>
>> 2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor election
>> 2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor election
>> 2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1
>> 2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,1 ceph1,ceph2
>> 2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
>> 2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256 active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s wr, 0 op/s
>> 2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up {0=ceph3=up:active}, 2 up:standby
>> 2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36 in
>> 2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up {0=ceph2=up:replay}, 1 up:standby
>> 2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:reconnect
>> 2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up {0=ceph2=up:reconnect}, 1 up:standby
>> 2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:rejoin
>> 2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active
>> 2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up {0=ceph2=up:active}, 1 up:standby
>> 2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active
>> 2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up {0=ceph2=up:active}, 1 up:standby
>> *<cable plugged back in>*
>> 2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565 boot
>> 2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25 in
>> 2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993 up:boot
>> 2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby
>> 2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor election
>> 2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor election
>> 2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1,2
>> 2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby
>> 2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507 (allowed interval 45)
>> 2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952 (allowed interval 45)
>> 2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302 (allowed interval 45)
>> 2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24152 10.10.10.67:0/3475305031 after 647.925323 (allowed interval 45)
>> 2015-08-24 11:13:51.037681 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24149 10.10.10.68:0/22416725 after 648.029988 (allowed interval 45)
>>
>> I did just notice that none of the times match up. So may try again once I fix ntp/chrony and see if that makes a difference.
>>
>> Cheers,
>>
>> Simon
>>
>>> -----Original Message-----
>>> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
>>> Sent: 21 August 2015 12:16
>>> To: Simon Hallam
>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>> Subject: Re:  Testing CephFS
>>>
>>> On Thu, Aug 20, 2015 at 11:07 AM, Simon  Hallam <sha@xxxxxxxxx> wrote:
>>> > Hey all,
>>> >
>>> >
>>> >
>>> > We are currently testing CephFS on a small (3 node) cluster.
>>> >
>>> >
>>> >
>>> > The setup is currently:
>>> >
>>> >
>>> >
>>> > Each server has 12 OSDs, 1 Monitor and 1 MDS running on it:
>>> >
>>> > The servers are running: 0.94.2-0.el7
>>> >
>>> > The clients are running: Ceph: 0.80.10-1.fc21, Kernel: 4.0.6-200.fc21.x86_64
>>> >
>>> >
>>> >
>>> > ceph -s
>>> >
>>> >     cluster 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd
>>> >
>>> >      health HEALTH_OK
>>> >
>>> >      monmap e1: 3 mons at
>>> > {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
>>> >
>>> >             election epoch 20, quorum 0,1,2 ceph1,ceph2,ceph3
>>> >
>>> >      mdsmap e12: 1/1/1 up {0=ceph3=up:active}, 2 up:standby
>>> >
>>> >      osdmap e389: 36 osds: 36 up, 36 in
>>> >
>>> >       pgmap v19370: 8256 pgs, 3 pools, 51217 MB data, 14035 objects
>>> >
>>> >             95526 MB used, 196 TB / 196 TB avail
>>> >
>>> >                 8256 active+clean
>>> >
>>> >
>>> >
>>> > Our Ceph.conf is relatively simple at the moment:
>>> >
>>> >
>>> >
>>> > cat /etc/ceph/ceph.conf
>>> >
>>> > [global]
>>> >
>>> > fsid = 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd
>>> >
>>> > mon_initial_members = ceph1, ceph2, ceph3
>>> >
>>> > mon_host = 10.15.0.1,10.15.0.2,10.15.0.3
>>> >
>>> > mon_pg_warn_max_per_osd = 1000
>>> >
>>> > auth_cluster_required = cephx
>>> >
>>> > auth_service_required = cephx
>>> >
>>> > auth_client_required = cephx
>>> >
>>> > filestore_xattr_use_omap = true
>>> >
>>> > osd_pool_default_size = 2
>>> >
>>> >
>>> >
>>> > When I pulled the plug on the master MDS last time (ceph1), it stopped all
>>> > IO until I plugged it back in. I was under the assumption that the MDS
>>> would
>>> > fail over the other 2 MDS’s and IO would continue?
>>> >
>>> >
>>> >
>>> > Is there something I need to do to allow the MDS’s to failover from each
>>> > other without too much interruption? Or is this because the clients ceph
>>> > version?
>>>
>>> That's quite strange. How long did you wait for it to fail over? Did
>>> the output of "ceph -s" (or "ceph -w", whichever) change during that
>>> time?
>>> By default the monitors should have detected the MDS was dead after 30
>>> seconds and put one of the other MDS nodes into replay and active.
>>>
>>> ...I wonder if this is because you lost a monitor at the same time as
>>> the MDS. What kind of logging do you have available from during your
>>> test?
>>> -Greg
>>>
>>> >
>>> >
>>> >
>>> > Cheers,
>>> >
>>> >
>>> >
>>> > Simon Hallam
>>> >
>>> > Linux Support & Development Officer
>>
>>
>> Please visit our new website at www.pml.ac.uk and follow us on Twitter  @PlymouthMarine
>>
>> Winner of the Environment & Conservation category, the Charity Awards 2014.
>>
>> Plymouth Marine Laboratory (PML) is a company limited by guarantee registered in England & Wales, company number 4178503. Registered Charity No. 1091222. Registered Office: Prospect Place, The Hoe, Plymouth  PL1 3DH, UK.
>>
>> This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. You are reminded that e-mail communications are not secure and may contain viruses; PML accepts no liability for any loss or damage which may be caused by viruses.
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com