4k req/s is too fast for a create workload on one MDS. That must
include other operations like getattr.
That is rsync going through millions of files checking which ones need
updating. Right now there are not actually any create operations, since
I restarted the copy job.
I wouldn't expect such extreme latency issues. Please share:
ceph config dump
ceph daemon mds.X cache status
Config dump: https://pastebin.com/1jTrjzA9
Cache status:
{
"pool": {
"items": 127688932,
"bytes": 20401092561
}
}
and the two perf dumps one second apart again please.
Perf dump 1: https://pastebin.com/US3y6JEJ
Perf dump 2: https://pastebin.com/Mm02puje
Also, you said you removed the aggressive recall changes. I assume you
didn't reset them to the defaults, right? Just the first suggested
change (10k/1.0)?
Either seems to work.
I added two more MDSs to split the workload and got a steady 150 reqs/s
after that. Then I noticed that I still had a max segments settings from
one of my earlier attempts at fixing the cache runaway issue and after
removing that, I got 250-500 reqs/s, sometimes up to 1.5k (per MDS).
However, to generate the dumps for you, I changed my max_mds setting
back to 1 and reqs/s went down to 80. After re-adding the two active
MDSs again, I am back at higher numbers, although not quite as much as
before. But I think to remember that it took several minutes if not more
until all MDSs received approximately equal load the last time.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com