Re: [Ceph-users] Re: MDS failing under load with large cache sizes

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Tue, 6 Aug 2019 16:57:14 +0200

4k req/s is too fast for a create workload on one MDS. That must
include other operations like getattr.

That is rsync going through millions of files checking which ones need 
updating. Right now there are not actually any create operations, since 
I restarted the copy job.

I wouldn't expect such extreme latency issues. Please share:

ceph config dump
ceph daemon mds.X cache status

Config dump: https://pastebin.com/1jTrjzA9

Cache status:

{
    "pool": {
        "items": 127688932,
        "bytes": 20401092561
    }
}

and the two perf dumps one second apart again please.
Perf dump 1: https://pastebin.com/US3y6JEJ
Perf dump 2: https://pastebin.com/Mm02puje

Also, you said you removed the aggressive recall changes. I assume you
didn't reset them to the defaults, right? Just the first suggested
change (10k/1.0)?

Either seems to work.

I added two more MDSs to split the workload and got a steady 150 reqs/s 
after that. Then I noticed that I still had a max segments settings from 
one of my earlier attempts at fixing the cache runaway issue and after 
removing that, I got 250-500 reqs/s, sometimes up to 1.5k (per MDS).

However, to generate the dumps for you, I changed my max_mds setting 
back to 1 and reqs/s went down to 80. After re-adding the two active 
MDSs again, I am back at higher numbers, although not quite as much as 
before. But I think to remember that it took several minutes if not more 
until all MDSs received approximately equal load the last time.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com