cephfs kernel client hangs

Zhenshi Zhou <deaderzzs@xxxxxxxxx> · Tue, 7 Aug 2018 16:20:23 +0800

Hi,  I have a CEPH 12.2.5 cluster running on 4 CentOS 7.3 servers with kernel 4.17.0, Including 3 mons, 16 osds, 2 mds(1active+1backup). I have some cllients mounted cephfs in kernel mode. Client A is using kernel 4.4.145, and others are using kernel 4.12.8. All of them are using ceph client version 0.94.
  My mount command is something like 'mount -t ceph mon1:6789:/dir1 /mnt/dir1 -o name=user1,secretfile=user1.secret'. Client A's ceph user is different from other clients. While I copied files on Client A yesterday, it hung and cannot umount anymore. I then restarted mds serveice and finally found all other clients hung there. Here is the logs.
ceph.audit.log：    
2018-08-06 10:04:14.345909
7f8a9fa27700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1
included below; oldest blocked for > 32.978931 secs

2018-08-06 10:04:14.345936
7f8a9fa27700  0 log_channel(cluster) log [WRN] : slow request 32.978931
seconds old, received at 2018-08-06 10:03:41.366871:
client_request(client.214486:3553922 getattr pAsLsXsFs #0x10000259db6
2018-08-06 10:03:44.346116 caller_uid=0, caller_gid=99{}) currently failed to
rdlock, waiting

2018-08-06 10:04:44.346568 7f8a9fa27700 
0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest
blocked for > 62.979643 secs

2018-08-06 10:04:44.346593
7f8a9fa27700  0 log_channel(cluster) log [WRN] : slow request 62.979643
seconds old, received at 2018-08-06 10:03:41.366871:
client_request(client.214486:3553922 getattr pAsLsXsFs #0x10000259db6
2018-08-06 10:03:44.346116 caller_uid=0, caller_gid=99{}) currently failed to
rdlock, waiting

2018-08-06 10:04:44.347651
7f8a9fa27700  0 log_channel(cluster) log [WRN] : client.214486 isn't
responding to mclientcaps(revoke), ino 0x10000259db6 pending pFc issued pFcb,
sent 62.980452 seconds ago

…

2018-08-06 12:59:24.589157 mds.ceph-mon1
mds.0 10.211.121.61:6804/4260949067 257 : cluster [WRN] client.214486 isn't
responding to mclientcaps(revoke), ino 0x1000025a20d pending pFc issued pFcb,
sent 7683.252197 seconds ago

2018-08-06 13:00:00.000152 mon.ceph-mon1
mon.0 10.211.121.61:6789/0 8150 : cluster [WRN] overall HEALTH_WARN 1 clients
failing to respond to capability release; 1 MDSs report slow requests

2018-08-06 13:21:14.618192 mds.ceph-mon1
mds.0 10.211.121.61:6804/4260949067 258 : cluster [WRN] 12 slow requests, 1
included below; oldest blocked for > 11853.251231 secs

2018-08-06 13:21:14.618203 mds.ceph-mon1
mds.0 10.211.121.61:6804/4260949067 259 : cluster [WRN] slow request
7683.171918 seconds old, received at 2018-08-06 11:13:11.446184:
client_request(client.213537:308353 setfilelockrule 2, type 2, owner
12714292720879014315, pid 19091, start 0, length 0, wait 1 #0x100000648cc 2018-08-06
11:13:11.445425 caller_uid=48, caller_gid=48{}) currently acquired locks

2018-08-06 13:24:59.623303 mds.ceph-mon1
mds.0 10.211.121.61:6804/4260949067 260 : cluster [WRN] 12 slow requests, 1
included below; oldest blocked for > 12078.256355 secs

2018-08-06 13:24:59.623316 mds.ceph-mon1
mds.0 10.211.121.61:6804/4260949067 261 : cluster [WRN] slow request
7683.023058 seconds old, received at 2018-08-06 11:16:56.600168:
client_request(client.213537:308354 setfilelockrule 2, type 2, owner 12714292687012008619,
pid 19092, start 0, length 0, wait 1 #0x100000648cc 2018-08-06 11:16:56.599432
caller_uid=48, caller_gid=48{}) currently acquired locks
ceph-mds.log：    
2018-08-06 15:09:57.700198 7f8a9ea25700 -1
received  signal: Terminated from  PID: 1 task name:
/usr/lib/systemd/systemd --switched-root --system --deserialize 21  UID: 0
2018-08-06 15:09:57.700228 7f8a9ea25700 -1
mds.ceph-mon1 *** got signal Terminated ***
2018-08-06 15:09:57.700232
7f8a9ea25700  1 mds.ceph-mon1 suicide.  wanted state up:active
2018-08-06 15:09:57.704117
7f8a9ea25700  1 mds.0.52 shutdown: shutting down rank 0
2018-08-06 15:10:48.244347
7fa6dee9e1c0  0 set uid:gid to 167:167 (ceph:ceph)
2018-08-06 15:10:48.244368
7fa6dee9e1c0  0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
luminous (stable), process (unknown), pid 2683453
2018-08-06 15:10:48.246713
7fa6dee9e1c0  0 pidfile_write: ignore empty --pid-file
2018-08-06 15:10:52.753614
7fa6d7d62700  1 mds.ceph-mon1 handle_mds_map standby
ceph.log:
2018-08-06 15:09:57.792010 mon.ceph-mon1
mon.0 10.211.121.61:6789/0 8158 : cluster [WRN] Health check failed: 1
filesystem is degraded (FS_DEGRADED)
2018-08-06 15:09:57.792151 mon.ceph-mon1
mon.0 10.211.121.61:6789/0 8159 : cluster [INF] Health check cleared:
MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to respond to capability
release)
2018-08-06 15:09:57.792244 mon.ceph-mon1
mon.0 10.211.121.61:6789/0 8160 : cluster [INF] Health check cleared:
MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
2018-08-06 15:09:57.942937 mon.ceph-mon1
mon.0 10.211.121.61:6789/0 8163 : cluster [INF] Standby daemon mds.ceph-mds
assigned to filesystem cephfs as rank 0
2018-08-06 15:09:57.943174 mon.ceph-mon1
mon.0 10.211.121.61:6789/0 8164 : cluster [WRN] Health check failed:
insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)
2018-08-06 15:10:51.601347 mon.ceph-mon1
mon.0 10.211.121.61:6789/0 8184 : cluster [INF] daemon mds.ceph-mds is now
active in filesystem cephfs as rank 0
2018-08-06 15:10:52.563221 mon.ceph-mon1
mon.0 10.211.121.61:6789/0 8186 : cluster [INF] Health check cleared:
FS_DEGRADED (was: 1 filesystem is degraded)
2018-08-06 15:10:52.563320 mon.ceph-mon1
mon.0 10.211.121.61:6789/0 8187 : cluster [INF] Health check cleared:
MDS_INSUFFICIENT_STANDBY (was: insufficient standby MDS daemons available)
2018-08-06 15:10:52.563371 mon.ceph-mon1
mon.0 10.211.121.61:6789/0 8188 : cluster [INF] Cluster is now healthy
2018-08-06 15:10:49.574055 mds.ceph-mds
mds.0 10.211.132.103:6804/3490526127 10 : cluster [WRN] evicting unresponsive
client docker38 (213525), after waiting 45 seconds during MDS startup
2018-08-06 15:10:49.574168 mds.ceph-mds
mds.0 10.211.132.103:6804/3490526127 11 : cluster [WRN] evicting unresponsive
client docker74 (213534), after waiting 45 seconds during MDS startup

2018-08-06 15:10:49.574259 mds.ceph-mds
mds.0 10.211.132.103:6804/3490526127 12 : cluster [WRN] evicting unresponsive
client docker73 (213537), after waiting 45 seconds during MDS startup

  'client.214486:3553922' is client A, and docker38, 73, 74 are the other clients. All clients hung there while I cannot umount/ls/cd the mounted dir.  I think docker38, 73 and 74 are evicted because I restart MDS  without any barrier operations(see this). But how did the client A hung? Is there any way to deal with the hung mounted dir except rebooting the server?

Thanks
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com