Hi, On 12/05/17 17:58, Dan Jakubiec wrote: > Is this is configuration problem or a bug? we had massive problems with both kraken (feb-sept 2017) and luminous (12.2.0), seeing the same behaviour as you. ceph.conf was containing defaults only, except that we had to crank up mds_cache_size and mds_bal_fragment_size_max. using dirfrag and multi-mds did not change anything. even with luminous (12.2.0) basically a single rsync over a large directory tree could kill cephfs for all clients within seconds, where even a waiting period of >8 hours did not help. since the cluster was semi-productive, we coudn't take the downtime so we switched to unmounting all cephfs, flush journal, and re-mount it. interestingly with 12.2.1 on kernel 4.13 however, this doesn't occur anymore (the 'mds lagging behind' still happens, but recovers quickly within minutes, and the rsync doesn not need to be aborted). i'm not sure if 12.2.1 fixed it itself, or it was your config changes happening at the same time: mds_session_autoclose = 10 mds_reconnect_timeout = 10 mds_blacklist_interval = 10 mds_session_blacklist_on_timeout = false mds_session_blacklist_on_evict = false Regards, Daniel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com