Re: [Ceph-users] Re: MDS failing under load with large cache sizes

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Tue, 6 Aug 2019 09:37:23 +0200

Thanks that helps. Looks like the problem is that the MDS is not
automatically trimming its cache fast enough. Please try bumping
mds_cache_trim_threshold:

bin/ceph config set mds mds_cache_trim_threshold 512K

That did help. Somewhat. I removed the aggressive recall settings I set 
before and only set this option instead. The cache size seems to be 
quite stable now, although still increasing in the long run (but at 
least not strictly monotonically).

However, now my client processes are basically in constant I/O wait 
state and the CephFS is slow for everybody. After I restarted the copy 
job, I got around 4k reqs/s and then it went down to 100 reqs/s with 
everybody waiting their turn. So yes, it does seem to help, but it 
increases latency by a magnitude.

As always, it would be great if these options were documented somewhere. 
Google has like five results, one of them being this thread. ;-)

Increase it further if it's not aggressive enough. Please let us know
if that helps.

It shouldn't be necessary to do this so I'll make a tracker ticket
once we confirm that's the issue.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com