On Fri, Sep 25, 2015 at 2:55 AM, Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> wrote: > Hi All... > > I have some questions about client session in CephFS. > > 1./ My setup: > > a. ceph 9.0.3 > b. 32 OSDs distributed in 4 servers (8 OSD per server). > c. 'osd pool default size = 3' and 'osd pool default min size = 2' > d. a single mds > e. dedicated pools for data and metadata > > 2./ I had 5 clients mounting Cephfs, but one of them was not ok. An 'ls -l > /cephfs was simply hanging' At this stage it would be good to do a "ceph daemon mds.<id> dump_ops_in_flight" on the MDS and a "ceph daemon <path to /var/run/client.<whatever>.asok mds_requests" on the client (assuming its fuse). This is likely to show you specifically what is stuck. > 3./ I thought I would be able to check the session using, the following > command, but nothing showed up. Is this supposed to happen? > > # cephfs-table-tool 0 show session > { > "0": { > "data": { > "Sessions": [] > }, > "result": 0 > } > } cephfs-table-tool is a disaster recovery tool for inspecting on-disk structures. It is not reporting on the state of a live system. You want "ceph daemon mds.<id> session ls". > 4./ I could only recover the problematic client by restarting the mds. The > mds log, generated by the restart, follows at the end. There is no relevant > logs before the restart. > - It seems the mds denies CLIENT5 attempt to reconnect, and thinks its > session is closed. > - Then the mds does some re-connection operations to the Ceph Cluster, > and once that is over, CLIENT5 is then allowed to reconnect. > > > 5./ My questions: > - Is there a simple command for me to check which sessions are active? > 'cephfs-table-tool 0 show session' does not seem to work As above. > - Is there a way for me to cross check which sessions belong to which > clients (IPs)? Yes, addresses are in the "session ls" output. > - Is there a way for me to understand / debug what is happening with a > particular session? In this case, I would be interested in understanding > which CLIENT5 was closed... The interesting bit will almost certainly be from before the restart, unfortunately. It sounds like everything was working properly after the restart. > - Is there a way to reestablish a session without having to restart the > mds? I guess you mean while keeping the client alive? Because in general if I was going to restart something it would be the client, not the MDS. If you think a particular client session has got stuck, you can also use "ceph daemon mds.<id> session evict" to kill it, whereupon if you are lucky the client will re-establish a session. Cheers, John > > TIA > Cheers > Goncalo > > 2015-09-25 11:09:47.700359 7f30608cf700 -1 mds.0.18 *** got signal > Terminated *** > 2015-09-25 11:09:47.734970 7f30608cf700 1 mds.0.18 suicide. wanted > down:dne, now up:active > 2015-09-25 11:09:48.958860 7f8eb9e7c800 0 ceph version 9.0.3 > (7295612d29f953f46e6e88812ef372b89a43b9da), process ceph-mds, pid 15265 > 2015-09-25 11:09:49.028184 7f8eb9e7c800 -1 mds.-1.0 log_to_monitors > {default=true} > 2015-09-25 11:09:50.164166 7f8eb53ac700 1 mds.-1.0 handle_mds_map standby > 2015-09-25 11:09:50.226510 7f8eb53ac700 1 mds.0.19 handle_mds_map i am now > mds.0.19 > 2015-09-25 11:09:50.226518 7f8eb53ac700 1 mds.0.19 handle_mds_map state > change up:standby --> up:replay > 2015-09-25 11:09:50.226527 7f8eb53ac700 1 mds.0.19 replay_start > 2015-09-25 11:09:50.226534 7f8eb53ac700 1 mds.0.19 recovery set is > 2015-09-25 11:09:50.246491 7f8eb21a7700 0 mds.0.cache creating system inode > with ino:100 > 2015-09-25 11:09:50.246704 7f8eb21a7700 0 mds.0.cache creating system inode > with ino:1 > 2015-09-25 11:09:50.577356 7f8eacb90700 1 mds.0.19 replay_done > 2015-09-25 11:09:50.577379 7f8eacb90700 1 mds.0.19 making mds journal > writeable > 2015-09-25 11:09:51.373407 7f8eb53ac700 1 mds.0.19 handle_mds_map i am now > mds.0.19 > 2015-09-25 11:09:51.373417 7f8eb53ac700 1 mds.0.19 handle_mds_map state > change up:replay --> up:reconnect > 2015-09-25 11:09:51.373426 7f8eb53ac700 1 mds.0.19 reconnect_start > 2015-09-25 11:09:51.373429 7f8eb53ac700 1 mds.0.19 reopen_log > 2015-09-25 11:09:51.373440 7f8eb53ac700 1 mds.0.server reconnect_clients -- > 4 sessions > 2015-09-25 11:09:51.373812 7f8eab583700 0 -- <MDS_IP>:6801/15265 >> > <CLIENT1_IP>:0/2270303053 pipe(0x4558000 sd=30 :6801 s=0 pgs=0 cs=0 l=0 > c=0x4309fa0).accept peer addr is really <CLIENT1_IP>:0/2270303053 (socket is > <CLIENT1_IP>:47157/0) > 2015-09-25 11:09:51.374121 7f8eab583700 1 mds.0.19 ms_verify_authorizer: > cannot decode auth caps bl of length 0 > 2015-09-25 11:09:51.374372 7f8eab482700 1 mds.0.19 ms_verify_authorizer: > cannot decode auth caps bl of length 0 > 2015-09-25 11:09:51.374756 7f8eb53ac700 0 log_channel(cluster) log [DBG] : > reconnect by client.34107 <CLIENT1_IP>:0/2270303053 after 0.001268 > 2015-09-25 11:09:51.376911 7f8eab17f700 1 mds.0.19 ms_verify_authorizer: > cannot decode auth caps bl of length 0 > 2015-09-25 11:09:51.378109 7f8eb53ac700 0 log_channel(cluster) log [DBG] : > reconnect by client.34185 <CLIENT2_IP>:0/5899 after 0.004635 > 2015-09-25 11:09:51.384038 7f8eb53ac700 0 log_channel(cluster) log [DBG] : > reconnect by client.34176 <CLIENT3_IP>:0/21438 after 0.010561 > 2015-09-25 11:09:51.404033 7f8eaaf7d700 1 mds.0.19 ms_verify_authorizer: > cannot decode auth caps bl of length 0 > 2015-09-25 11:09:51.414088 7f8eaae7c700 1 mds.0.19 ms_verify_authorizer: > cannot decode auth caps bl of length 0 > 2015-09-25 11:09:51.414864 7f8eb53ac700 0 log_channel(cluster) log [DBG] : > reconnect by client.34182 <CLIENT4_IP>:0/3941 after 0.041392 > 2015-09-25 11:09:51.414889 7f8eb53ac700 1 mds.0.19 reconnect_done > 2015-09-25 11:09:51.428669 7f8eb53ac700 1 mds.0.server session is closed, > ignoring reconnect, sending close > 2015-09-25 11:09:51.428689 7f8eb53ac700 0 log_channel(cluster) log [INF] : > denied reconnect attempt (mds is up:reconnect) from client.34104 > <CLIENT5_IP>:0/24520 (session is closed) > 2015-09-25 11:09:51.442231 7f8eaae7c700 0 -- <MDS_IP>:6801/15265 >> > <CLIENT5_IP>:0/24520 pipe(0x4565800 sd=34 :6801 s=2 pgs=2522 cs=1 l=0 > c=0x430a520).fault, server, going to standby > 2015-09-25 11:09:52.516919 7f8eb53ac700 1 mds.0.19 handle_mds_map i am now > mds.0.19 > 2015-09-25 11:09:52.516926 7f8eb53ac700 1 mds.0.19 handle_mds_map state > change up:reconnect --> up:rejoin > 2015-09-25 11:09:52.516930 7f8eb53ac700 1 mds.0.19 rejoin_start > 2015-09-25 11:09:52.517110 7f8eb53ac700 1 mds.0.19 rejoin_joint_start > 2015-09-25 11:09:52.517468 7f8eb53ac700 1 mds.0.19 rejoin_done > 2015-09-25 11:09:53.594284 7f8eb53ac700 1 mds.0.19 handle_mds_map i am now > mds.0.19 > 2015-09-25 11:09:53.594292 7f8eb53ac700 1 mds.0.19 handle_mds_map state > change up:rejoin --> up:active > 2015-09-25 11:09:53.594301 7f8eb53ac700 1 mds.0.19 recovery_done -- > successful recovery! > 2015-09-25 11:09:53.594536 7f8eb53ac700 1 mds.0.19 active_start > 2015-09-25 11:09:53.594752 7f8eb53ac700 1 mds.0.19 cluster recovered. > 2015-09-25 11:09:53.635288 7f8eaa775700 1 mds.0.19 ms_verify_authorizer: > cannot decode auth caps bl of length 0 > 2015-09-25 11:09:53.635329 7f8eaa775700 0 -- <MDS_IP>:6801/15265 >> > <CLIENT5_IP>:0/24520 pipe(0x45da000 sd=37 :6801 s=0 pgs=0 cs=0 l=0 > c=0x430a3c0).accept connect_seq 0 vs existing 1 state standby > 2015-09-25 11:09:53.635352 7f8eaa775700 0 -- <MDS_IP>:6801/15265 >> > <CLIENT5_IP>:0/24520 pipe(0x45da000 sd=37 :6801 s=0 pgs=0 cs=0 l=0 > c=0x430a3c0).accept peer reset, then tried to connect to us, replacing > 2015-09-25 11:09:53.675605 7f8eaa775700 1 mds.0.19 ms_verify_authorizer: > cannot decode auth caps bl of length 0 > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com