cephfs kernel client hangs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
  I have a CEPH 12.2.5 cluster running on 4 CentOS 7.3 servers with kernel 4.17.0, Including 3 mons, 16 osds, 2 mds(1active+1backup). I have some cllients mounted cephfs in kernel mode. Client A is using kernel 4.4.145, and others are using kernel 4.12.8. All of them are using ceph client version 0.94.
  My mount command is something like 'mount -t ceph mon1:6789:/dir1 /mnt/dir1 -o name=user1,secretfile=user1.secret'. Client A's ceph user is different from other clients. While I copied files on Client A yesterday, it hung and cannot umount anymore. I then restarted mds serveice and finally found all other clients hung there. Here is the logs.
ceph.audit.log    

2018-08-06 10:04:14.345909 7f8a9fa27700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 32.978931 secs

2018-08-06 10:04:14.345936 7f8a9fa27700  0 log_channel(cluster) log [WRN] : slow request 32.978931 seconds old, received at 2018-08-06 10:03:41.366871: client_request(client.214486:3553922 getattr pAsLsXsFs #0x10000259db6 2018-08-06 10:03:44.346116 caller_uid=0, caller_gid=99{}) currently failed to rdlock, waiting

2018-08-06 10:04:44.346568 7f8a9fa27700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 62.979643 secs

2018-08-06 10:04:44.346593 7f8a9fa27700  0 log_channel(cluster) log [WRN] : slow request 62.979643 seconds old, received at 2018-08-06 10:03:41.366871: client_request(client.214486:3553922 getattr pAsLsXsFs #0x10000259db6 2018-08-06 10:03:44.346116 caller_uid=0, caller_gid=99{}) currently failed to rdlock, waiting

2018-08-06 10:04:44.347651 7f8a9fa27700  0 log_channel(cluster) log [WRN] : client.214486 isn't responding to mclientcaps(revoke), ino 0x10000259db6 pending pFc issued pFcb, sent 62.980452 seconds ago

2018-08-06 12:59:24.589157 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067 257 : cluster [WRN] client.214486 isn't responding to mclientcaps(revoke), ino 0x1000025a20d pending pFc issued pFcb, sent 7683.252197 seconds ago

2018-08-06 13:00:00.000152 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8150 : cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests

2018-08-06 13:21:14.618192 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067 258 : cluster [WRN] 12 slow requests, 1 included below; oldest blocked for > 11853.251231 secs

2018-08-06 13:21:14.618203 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067 259 : cluster [WRN] slow request 7683.171918 seconds old, received at 2018-08-06 11:13:11.446184: client_request(client.213537:308353 setfilelockrule 2, type 2, owner 12714292720879014315, pid 19091, start 0, length 0, wait 1 #0x100000648cc 2018-08-06 11:13:11.445425 caller_uid=48, caller_gid=48{}) currently acquired locks

2018-08-06 13:24:59.623303 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067 260 : cluster [WRN] 12 slow requests, 1 included below; oldest blocked for > 12078.256355 secs

2018-08-06 13:24:59.623316 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067 261 : cluster [WRN] slow request 7683.023058 seconds old, received at 2018-08-06 11:16:56.600168: client_request(client.213537:308354 setfilelockrule 2, type 2, owner 12714292687012008619, pid 19092, start 0, length 0, wait 1 #0x100000648cc 2018-08-06 11:16:56.599432 caller_uid=48, caller_gid=48{}) currently acquired locks

ceph-mds.log    

2018-08-06 15:09:57.700198 7f8a9ea25700 -1 received  signal: Terminated from  PID: 1 task name: /usr/lib/systemd/systemd --switched-root --system --deserialize 21  UID: 0

2018-08-06 15:09:57.700228 7f8a9ea25700 -1 mds.ceph-mon1 *** got signal Terminated ***

2018-08-06 15:09:57.700232 7f8a9ea25700  1 mds.ceph-mon1 suicide.  wanted state up:active

2018-08-06 15:09:57.704117 7f8a9ea25700  1 mds.0.52 shutdown: shutting down rank 0

2018-08-06 15:10:48.244347 7fa6dee9e1c0  0 set uid:gid to 167:167 (ceph:ceph)

2018-08-06 15:10:48.244368 7fa6dee9e1c0  0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 2683453

2018-08-06 15:10:48.246713 7fa6dee9e1c0  0 pidfile_write: ignore empty --pid-file

2018-08-06 15:10:52.753614 7fa6d7d62700  1 mds.ceph-mon1 handle_mds_map standby

ceph.log:

2018-08-06 15:09:57.792010 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8158 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)

2018-08-06 15:09:57.792151 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8159 : cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to respond to capability release)

2018-08-06 15:09:57.792244 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8160 : cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)

2018-08-06 15:09:57.942937 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8163 : cluster [INF] Standby daemon mds.ceph-mds assigned to filesystem cephfs as rank 0

2018-08-06 15:09:57.943174 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8164 : cluster [WRN] Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)

2018-08-06 15:10:51.601347 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8184 : cluster [INF] daemon mds.ceph-mds is now active in filesystem cephfs as rank 0

2018-08-06 15:10:52.563221 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8186 : cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded)

2018-08-06 15:10:52.563320 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8187 : cluster [INF] Health check cleared: MDS_INSUFFICIENT_STANDBY (was: insufficient standby MDS daemons available)

2018-08-06 15:10:52.563371 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8188 : cluster [INF] Cluster is now healthy

2018-08-06 15:10:49.574055 mds.ceph-mds mds.0 10.211.132.103:6804/3490526127 10 : cluster [WRN] evicting unresponsive client docker38 (213525), after waiting 45 seconds during MDS startup

2018-08-06 15:10:49.574168 mds.ceph-mds mds.0 10.211.132.103:6804/3490526127 11 : cluster [WRN] evicting unresponsive client docker74 (213534), after waiting 45 seconds during MDS startup

2018-08-06 15:10:49.574259 mds.ceph-mds mds.0 10.211.132.103:6804/3490526127 12 : cluster [WRN] evicting unresponsive client docker73 (213537), after waiting 45 seconds during MDS startup


  'client.214486:3553922' is client A, and docker38, 73, 74 are the other clients. All clients hung there while I cannot umount/ls/cd the mounted dir.  I think docker38, 73 and 74 are evicted because I restart MDS  without any barrier operations(see this). But how did the client A hung? Is there any way to deal with the hung mounted dir except rebooting the server?


Thanks

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux