Re: CephFS log jam prevention

Daniel Baumann <daniel.baumann@xxxxxx> · Tue, 5 Dec 2017 18:48:20 +0100

Hi,

On 12/05/17 17:58, Dan Jakubiec wrote:
> Is this is configuration problem or a bug?

we had massive problems with both kraken (feb-sept 2017) and luminous
(12.2.0), seeing the same behaviour as you. ceph.conf was containing
defaults only, except that we had to crank up mds_cache_size and
mds_bal_fragment_size_max.

using dirfrag and multi-mds did not change anything. even with luminous
(12.2.0) basically a single rsync over a large directory tree could kill
cephfs for all clients within seconds, where even a waiting period of >8
hours did not help.

since the cluster was semi-productive, we coudn't take the downtime so
we switched to unmounting all cephfs, flush journal, and re-mount it.

interestingly with 12.2.1 on kernel 4.13 however, this doesn't occur
anymore (the 'mds lagging behind' still happens, but recovers quickly
within minutes, and the rsync doesn not need to be aborted).

i'm not sure if 12.2.1 fixed it itself, or it was your config changes
happening at the same time:

mds_session_autoclose = 10
mds_reconnect_timeout = 10

mds_blacklist_interval = 10
mds_session_blacklist_on_timeout = false
mds_session_blacklist_on_evict = false

Regards,
Daniel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com