Re: [Ceph-users] Re: MDS failing under load with large cache sizes

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Mon, 5 Aug 2019 09:21:35 +0200

Hi,

You can also try increasing the aggressiveness of the MDS recall but
I'm surprised it's still a problem with the settings I gave you:

ceph config set mds mds_recall_max_caps 15000
ceph config set mds mds_recall_max_decay_rate 0.75

I finally had the chance to try the more aggressive recall settings, but 
they did not change anything. As soon as the client starts copying files 
again, the numbers go up an I get a health message that the client is 
failing to respond to cache pressure.

After this week of idle time, the dns/inos numbers (what does dns stand 
for anyway?) settled at around 8000k. That's basically that "idle" 
number that it goes back to when the client stops copying files. Though, 
for some weird reason, this number gets (quite) a bit higher every time 
(last time it was around 960k). Of course, I wouldn't expect it to go 
back all the way to zero, because that would mean dropping the entire 
cache for no reason, but it's still quite high and the same after 
restarting the MDS and all clients, which doesn't make a lot of sense to 
me. After resuming the copy job, the number went up to 20M in just the 
time it takes to write this email. There must be a bug somewhere.

Can you share two captures of `ceph daemon mds.X perf dump` about 1
second apart.

I attached the requested perf dumps.

Thanks!

Attachment:
perf_dump_1.json

Description: application/json
Attachment:
perf_dump_2.json

Description: application/json
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com