Hi All...
I have some questions about client session in CephFS.
1./ My setup:
a. ceph 9.0.3
b. 32 OSDs distributed in 4 servers (8 OSD per server).
c. 'osd pool default size = 3' and 'osd pool default min size = 2'
d. a single mds
e. dedicated pools for data and metadata
2./ I had 5 clients mounting Cephfs, but one of them was not ok. An
'ls -l /cephfs was simply hanging'
3./ I thought I would be able to check the session using, the
following command, but nothing showed up. Is this supposed to
happen?
# cephfs-table-tool 0 show session
{
"0": {
"data": {
"Sessions": []
},
"result": 0
}
}
4./ I could only recover the problematic client by restarting the
mds. The mds log, generated by the restart, follows at the end.
There is no relevant logs before the restart.
- It seems the mds denies CLIENT5 attempt to reconnect, and
thinks its session is closed.
- Then the mds does some re-connection operations to the Ceph
Cluster, and once that is over, CLIENT5 is then allowed to
reconnect.
5./ My questions:
- Is there a simple command for me to check which sessions are
active? 'cephfs-table-tool 0 show session' does not seem to work
- Is there a way for me to cross check which sessions belong to
which clients (IPs)?
- Is there a way for me to understand / debug what is happening
with a particular session? In this case, I would be interested in
understanding which CLIENT5 was closed...
- Is there a way to reestablish a session without having to
restart the mds?
TIA
Cheers
Goncalo
2015-09-25 11:09:47.700359 7f30608cf700 -1 mds.0.18 ***
got signal Terminated ***
2015-09-25 11:09:47.734970 7f30608cf700 1 mds.0.18 suicide.
wanted down:dne, now up:active
2015-09-25 11:09:48.958860 7f8eb9e7c800 0 ceph version 9.0.3
(7295612d29f953f46e6e88812ef372b89a43b9da), process ceph-mds, pid
15265
2015-09-25 11:09:49.028184 7f8eb9e7c800 -1 mds.-1.0
log_to_monitors {default=true}
2015-09-25 11:09:50.164166 7f8eb53ac700 1 mds.-1.0 handle_mds_map
standby
2015-09-25 11:09:50.226510 7f8eb53ac700 1 mds.0.19 handle_mds_map
i am now mds.0.19
2015-09-25 11:09:50.226518 7f8eb53ac700 1 mds.0.19 handle_mds_map
state change up:standby --> up:replay
2015-09-25 11:09:50.226527 7f8eb53ac700 1 mds.0.19 replay_start
2015-09-25 11:09:50.226534 7f8eb53ac700 1 mds.0.19 recovery set
is
2015-09-25 11:09:50.246491 7f8eb21a7700 0 mds.0.cache creating
system inode with ino:100
2015-09-25 11:09:50.246704 7f8eb21a7700 0 mds.0.cache creating
system inode with ino:1
2015-09-25 11:09:50.577356 7f8eacb90700 1 mds.0.19 replay_done
2015-09-25 11:09:50.577379 7f8eacb90700 1 mds.0.19 making mds
journal writeable
2015-09-25 11:09:51.373407 7f8eb53ac700 1 mds.0.19 handle_mds_map
i am now mds.0.19
2015-09-25 11:09:51.373417 7f8eb53ac700 1 mds.0.19 handle_mds_map
state change up:replay --> up:reconnect
2015-09-25 11:09:51.373426 7f8eb53ac700 1 mds.0.19
reconnect_start
2015-09-25 11:09:51.373429 7f8eb53ac700 1 mds.0.19 reopen_log
2015-09-25 11:09:51.373440 7f8eb53ac700 1 mds.0.server
reconnect_clients -- 4 sessions
2015-09-25 11:09:51.373812 7f8eab583700 0 --
<MDS_IP>:6801/15265 >> <CLIENT1_IP>:0/2270303053
pipe(0x4558000 sd=30 :6801 s=0 pgs=0 cs=0 l=0 c=0x4309fa0).accept
peer addr is really <CLIENT1_IP>:0/2270303053 (socket is
<CLIENT1_IP>:47157/0)
2015-09-25 11:09:51.374121 7f8eab583700 1 mds.0.19
ms_verify_authorizer: cannot decode auth caps bl of length 0
2015-09-25 11:09:51.374372 7f8eab482700 1 mds.0.19
ms_verify_authorizer: cannot decode auth caps bl of length 0
2015-09-25 11:09:51.374756 7f8eb53ac700 0 log_channel(cluster)
log [DBG] : reconnect by client.34107
<CLIENT1_IP>:0/2270303053 after 0.001268
2015-09-25 11:09:51.376911 7f8eab17f700 1 mds.0.19
ms_verify_authorizer: cannot decode auth caps bl of length 0
2015-09-25 11:09:51.378109 7f8eb53ac700 0 log_channel(cluster)
log [DBG] : reconnect by client.34185 <CLIENT2_IP>:0/5899
after 0.004635
2015-09-25 11:09:51.384038 7f8eb53ac700 0 log_channel(cluster)
log [DBG] : reconnect by client.34176 <CLIENT3_IP>:0/21438
after 0.010561
2015-09-25 11:09:51.404033 7f8eaaf7d700 1 mds.0.19
ms_verify_authorizer: cannot decode auth caps bl of length 0
2015-09-25 11:09:51.414088 7f8eaae7c700 1 mds.0.19
ms_verify_authorizer: cannot decode auth caps bl of length 0
2015-09-25 11:09:51.414864 7f8eb53ac700 0 log_channel(cluster)
log [DBG] : reconnect by client.34182 <CLIENT4_IP>:0/3941
after 0.041392
2015-09-25 11:09:51.414889 7f8eb53ac700 1 mds.0.19 reconnect_done
2015-09-25 11:09:51.428669 7f8eb53ac700 1 mds.0.server session
is closed, ignoring reconnect, sending close
2015-09-25 11:09:51.428689 7f8eb53ac700 0 log_channel(cluster)
log [INF] : denied reconnect attempt (mds is up:reconnect) from
client.34104 <CLIENT5_IP>:0/24520 (session is closed)
2015-09-25 11:09:51.442231 7f8eaae7c700 0 --
<MDS_IP>:6801/15265 >> <CLIENT5_IP>:0/24520
pipe(0x4565800 sd=34 :6801 s=2 pgs=2522 cs=1 l=0
c=0x430a520).fault, server, going to standby
2015-09-25 11:09:52.516919 7f8eb53ac700 1 mds.0.19 handle_mds_map
i am now mds.0.19
2015-09-25 11:09:52.516926 7f8eb53ac700 1 mds.0.19 handle_mds_map
state change up:reconnect --> up:rejoin
2015-09-25 11:09:52.516930 7f8eb53ac700 1 mds.0.19 rejoin_start
2015-09-25 11:09:52.517110 7f8eb53ac700 1 mds.0.19
rejoin_joint_start
2015-09-25 11:09:52.517468 7f8eb53ac700 1 mds.0.19 rejoin_done
2015-09-25 11:09:53.594284 7f8eb53ac700 1 mds.0.19 handle_mds_map
i am now mds.0.19
2015-09-25 11:09:53.594292 7f8eb53ac700 1 mds.0.19 handle_mds_map
state change up:rejoin --> up:active
2015-09-25 11:09:53.594301 7f8eb53ac700 1 mds.0.19 recovery_done
-- successful recovery!
2015-09-25 11:09:53.594536 7f8eb53ac700 1 mds.0.19 active_start
2015-09-25 11:09:53.594752 7f8eb53ac700 1 mds.0.19 cluster
recovered.
2015-09-25 11:09:53.635288 7f8eaa775700 1 mds.0.19
ms_verify_authorizer: cannot decode auth caps bl of length 0
2015-09-25 11:09:53.635329 7f8eaa775700 0 --
<MDS_IP>:6801/15265 >> <CLIENT5_IP>:0/24520
pipe(0x45da000 sd=37 :6801 s=0 pgs=0 cs=0 l=0
c=0x430a3c0).accept connect_seq 0 vs existing 1 state standby
2015-09-25 11:09:53.635352 7f8eaa775700 0 --
<MDS_IP>:6801/15265 >> <CLIENT5_IP>:0/24520
pipe(0x45da000 sd=37 :6801 s=0 pgs=0 cs=0 l=0
c=0x430a3c0).accept peer reset, then tried to connect to us,
replacing
2015-09-25 11:09:53.675605 7f8eaa775700 1 mds.0.19
ms_verify_authorizer: cannot decode auth caps bl of length 0
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com