tuning ceph mds cache settings

Jonathan Woytek <woytek@xxxxxxxxxxx> · Wed, 9 Jan 2019 08:37:03 -0500

Hello ceph-users. I'm operating a moderately large ceph cluster with cephfs. We currently have 288 osd's, made up of all 10TB drives, and are getting ready to migrate another 432 drives into the cluster (I'm going to have more questions on that later). Our workload is highly distributed (containerized clients running across 32 hosts comprising in excess of 30k "clients"). We're operating with six active mds, each running with about 32GB of cache. Metadata is stored with replica 3, actual data stored with replica 2 (not sure if this matters for this discussion). 

Generally speaking, performance is pretty OK. We realize that we are making some compromises in terms of memory for the mds cache, and we are definitely making some compromises on available memory for osd's and are intentionally limiting them a bit while we wait on additional resources. 

While working on examining performance under load at scale, I see a marked performance improvement whenever I would restart certain mds daemons. I was able to duplicate the performance improvement by issuing a "daemon mds.blah cache drop". The performance bump lasts for quite a long time--far longer than it takes for the cache to "fill" according to the stats. 

I've run across a couple of settings, but can't find much documentation on them. I'm wondering if someone can help explain to me why I might see that bump, and what I might be able to tune to increase performance there. 

For reference, the majority of client accesses are into large directories of the form: 

/root/files/hash[0:2]/hash[0:4]/hash

I realize that this impacts locking and fragmentation. I am hoping someone can help to decypher some of the mds config options so that I can see where making some changes might help.

Additionally, I noted a few client options that raised some questions. First, "client use random mds": According to the mimic docs, this is "false" by default. If that is the case, how does a client choose an mds to communicate with? On top of that, does it stick with that mds forever? When I look at our mds daemons, they all list connections from each client. We're using fuse after having serious issues with the kernel driver on RHEL7 (mount would go stale for unknown reasons, and a full system reboot was required to clear the held capabilities from the mds cluster and recover the mount on the affected system). 

There are also "client caps release delay," which seems like it might help us if we were to decrease that number so a client wouldn't necessarily hold on to a directory for as long as it might by default. There are a few cache options, too, that I want to understand. 

I know this is long, but hopefully it makes sense and someone can give me a few pointers. If you need additional information to comment, please feel free to ask!

--
Jonathan Woytek
http://www.dryrose.com
KB3HOZ
PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com