Re: CephFS: Question how to debug client sessions

John Spray <jspray@xxxxxxxxxx> · Fri, 25 Sep 2015 10:03:38 +0100

On Fri, Sep 25, 2015 at 2:55 AM, Goncalo Borges
<goncalo@xxxxxxxxxxxxxxxxxxx> wrote:
> Hi All...
>
> I have some questions about client session in CephFS.
>
> 1./ My setup:
>
> a. ceph 9.0.3
> b. 32 OSDs distributed in 4 servers (8 OSD per server).
> c. 'osd pool default size = 3' and 'osd pool default min size = 2'
> d. a single mds
> e. dedicated pools for data and metadata
>
> 2./ I had 5 clients mounting Cephfs, but one of them was not ok. An  'ls -l
> /cephfs was simply hanging'

At this stage it would be good to do a "ceph daemon mds.<id>
dump_ops_in_flight" on the MDS and a "ceph daemon <path to
/var/run/client.<whatever>.asok mds_requests" on the client (assuming
its fuse).  This is likely to show you specifically what is stuck.

> 3./ I thought I would be able to check the session using, the following
> command, but nothing showed up. Is this supposed to happen?
>
> # cephfs-table-tool 0 show session
> {
>     "0": {
>         "data": {
>             "Sessions": []
>         },
>         "result": 0
>     }
> }

cephfs-table-tool is a disaster recovery tool for inspecting on-disk
structures.  It is not reporting on the state of a live system.

You want "ceph daemon mds.<id> session ls".

> 4./ I could only recover the problematic client by restarting the mds. The
> mds log, generated by the restart, follows at the end. There is no relevant
> logs before the restart.
>     - It seems the mds denies CLIENT5 attempt to reconnect, and thinks its
> session is closed.
>     - Then the mds does some re-connection operations to the Ceph Cluster,
> and once that is over, CLIENT5 is then allowed to reconnect.
>
>
> 5./ My questions:
>     - Is there a simple command for me to check which sessions are active?
> 'cephfs-table-tool 0 show session' does not seem to work

As above.

>     - Is there a way for me to cross check which sessions belong to which
> clients (IPs)?

Yes, addresses are in the "session ls" output.

>     - Is there a way for me to understand / debug what is happening with a
> particular session? In this case, I would be interested in understanding
> which CLIENT5 was closed...

The interesting bit will almost certainly be from before the restart,
unfortunately.  It sounds like everything was working properly after
the restart.

>     - Is there a way to reestablish a session without having to restart the
> mds?

I guess you mean while keeping the client alive?  Because in general
if I was going to restart something it would be the client, not the
MDS.  If you think a particular client session has got stuck, you can
also use "ceph daemon mds.<id> session evict" to kill it, whereupon if
you are lucky the client will re-establish a session.

Cheers,
John

>
> TIA
> Cheers
> Goncalo
>
> 2015-09-25 11:09:47.700359 7f30608cf700 -1 mds.0.18 *** got signal
> Terminated ***
> 2015-09-25 11:09:47.734970 7f30608cf700  1 mds.0.18 suicide.  wanted
> down:dne, now up:active
> 2015-09-25 11:09:48.958860 7f8eb9e7c800  0 ceph version 9.0.3
> (7295612d29f953f46e6e88812ef372b89a43b9da), process ceph-mds, pid 15265
> 2015-09-25 11:09:49.028184 7f8eb9e7c800 -1 mds.-1.0 log_to_monitors
> {default=true}
> 2015-09-25 11:09:50.164166 7f8eb53ac700  1 mds.-1.0 handle_mds_map standby
> 2015-09-25 11:09:50.226510 7f8eb53ac700  1 mds.0.19 handle_mds_map i am now
> mds.0.19
> 2015-09-25 11:09:50.226518 7f8eb53ac700  1 mds.0.19 handle_mds_map state
> change up:standby --> up:replay
> 2015-09-25 11:09:50.226527 7f8eb53ac700  1 mds.0.19 replay_start
> 2015-09-25 11:09:50.226534 7f8eb53ac700  1 mds.0.19  recovery set is
> 2015-09-25 11:09:50.246491 7f8eb21a7700  0 mds.0.cache creating system inode
> with ino:100
> 2015-09-25 11:09:50.246704 7f8eb21a7700  0 mds.0.cache creating system inode
> with ino:1
> 2015-09-25 11:09:50.577356 7f8eacb90700  1 mds.0.19 replay_done
> 2015-09-25 11:09:50.577379 7f8eacb90700  1 mds.0.19 making mds journal
> writeable
> 2015-09-25 11:09:51.373407 7f8eb53ac700  1 mds.0.19 handle_mds_map i am now
> mds.0.19
> 2015-09-25 11:09:51.373417 7f8eb53ac700  1 mds.0.19 handle_mds_map state
> change up:replay --> up:reconnect
> 2015-09-25 11:09:51.373426 7f8eb53ac700  1 mds.0.19 reconnect_start
> 2015-09-25 11:09:51.373429 7f8eb53ac700  1 mds.0.19 reopen_log
> 2015-09-25 11:09:51.373440 7f8eb53ac700  1 mds.0.server reconnect_clients --
> 4 sessions
> 2015-09-25 11:09:51.373812 7f8eab583700  0 -- <MDS_IP>:6801/15265 >>
> <CLIENT1_IP>:0/2270303053 pipe(0x4558000 sd=30 :6801 s=0 pgs=0 cs=0 l=0
> c=0x4309fa0).accept peer addr is really <CLIENT1_IP>:0/2270303053 (socket is
> <CLIENT1_IP>:47157/0)
> 2015-09-25 11:09:51.374121 7f8eab583700  1 mds.0.19 ms_verify_authorizer:
> cannot decode auth caps bl of length 0
> 2015-09-25 11:09:51.374372 7f8eab482700  1 mds.0.19 ms_verify_authorizer:
> cannot decode auth caps bl of length 0
> 2015-09-25 11:09:51.374756 7f8eb53ac700  0 log_channel(cluster) log [DBG] :
> reconnect by client.34107 <CLIENT1_IP>:0/2270303053 after 0.001268
> 2015-09-25 11:09:51.376911 7f8eab17f700  1 mds.0.19 ms_verify_authorizer:
> cannot decode auth caps bl of length 0
> 2015-09-25 11:09:51.378109 7f8eb53ac700  0 log_channel(cluster) log [DBG] :
> reconnect by client.34185 <CLIENT2_IP>:0/5899 after 0.004635
> 2015-09-25 11:09:51.384038 7f8eb53ac700  0 log_channel(cluster) log [DBG] :
> reconnect by client.34176 <CLIENT3_IP>:0/21438 after 0.010561
> 2015-09-25 11:09:51.404033 7f8eaaf7d700  1 mds.0.19 ms_verify_authorizer:
> cannot decode auth caps bl of length 0
> 2015-09-25 11:09:51.414088 7f8eaae7c700  1 mds.0.19 ms_verify_authorizer:
> cannot decode auth caps bl of length 0
> 2015-09-25 11:09:51.414864 7f8eb53ac700  0 log_channel(cluster) log [DBG] :
> reconnect by client.34182 <CLIENT4_IP>:0/3941 after 0.041392
> 2015-09-25 11:09:51.414889 7f8eb53ac700  1 mds.0.19 reconnect_done
> 2015-09-25 11:09:51.428669 7f8eb53ac700  1 mds.0.server  session is closed,
> ignoring reconnect, sending close
> 2015-09-25 11:09:51.428689 7f8eb53ac700  0 log_channel(cluster) log [INF] :
> denied reconnect attempt (mds is up:reconnect) from client.34104
> <CLIENT5_IP>:0/24520 (session is closed)
> 2015-09-25 11:09:51.442231 7f8eaae7c700  0 -- <MDS_IP>:6801/15265 >>
> <CLIENT5_IP>:0/24520 pipe(0x4565800 sd=34 :6801 s=2 pgs=2522 cs=1 l=0
> c=0x430a520).fault, server, going to standby
> 2015-09-25 11:09:52.516919 7f8eb53ac700  1 mds.0.19 handle_mds_map i am now
> mds.0.19
> 2015-09-25 11:09:52.516926 7f8eb53ac700  1 mds.0.19 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2015-09-25 11:09:52.516930 7f8eb53ac700  1 mds.0.19 rejoin_start
> 2015-09-25 11:09:52.517110 7f8eb53ac700  1 mds.0.19 rejoin_joint_start
> 2015-09-25 11:09:52.517468 7f8eb53ac700  1 mds.0.19 rejoin_done
> 2015-09-25 11:09:53.594284 7f8eb53ac700  1 mds.0.19 handle_mds_map i am now
> mds.0.19
> 2015-09-25 11:09:53.594292 7f8eb53ac700  1 mds.0.19 handle_mds_map state
> change up:rejoin --> up:active
> 2015-09-25 11:09:53.594301 7f8eb53ac700  1 mds.0.19 recovery_done --
> successful recovery!
> 2015-09-25 11:09:53.594536 7f8eb53ac700  1 mds.0.19 active_start
> 2015-09-25 11:09:53.594752 7f8eb53ac700  1 mds.0.19 cluster recovered.
> 2015-09-25 11:09:53.635288 7f8eaa775700  1 mds.0.19 ms_verify_authorizer:
> cannot decode auth caps bl of length 0
> 2015-09-25 11:09:53.635329 7f8eaa775700  0 -- <MDS_IP>:6801/15265 >>
> <CLIENT5_IP>:0/24520 pipe(0x45da000 sd=37 :6801 s=0 pgs=0 cs=0 l=0
> c=0x430a3c0).accept connect_seq 0 vs existing 1 state standby
> 2015-09-25 11:09:53.635352 7f8eaa775700  0 -- <MDS_IP>:6801/15265 >>
> <CLIENT5_IP>:0/24520 pipe(0x45da000 sd=37 :6801 s=0 pgs=0 cs=0 l=0
> c=0x430a3c0).accept peer reset, then tried to connect to us, replacing
> 2015-09-25 11:09:53.675605 7f8eaa775700  1 mds.0.19 ms_verify_authorizer:
> cannot decode auth caps bl of length 0
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com