Hi, We're looking at similar issues here and I was composing a mail just as you sent this. I'm just a user -- hopefully a dev will correct me where I'm wrong. 1. A CephFS cap is a way to delegate permission for a client to do IO with a file knowing that other clients are not also accessing that file. These caps need to be tracked so they can be later revoked for other clients to access files. (I didn't find a doc on CephFS caps, so this is a guess and probably wrong). 2. If you set debug_mds = 3 you can see memory usage and how many caps are delegated in total. Here's an example: mds.0.cache check_memory_usage total 7988108, rss 7018088, heap -457420, malloc -1747875 mmap 0, baseline -457420, buffers 0, max 1048576, 332739 / 332812 inodes have caps, 335839 caps, 1.0091 caps per inode It seems there is an int overflow for the heap and malloc measures on our server :( Anyway, once the MDS has delegated I think 90% of its max caps it will start asking clients to give some back. If those clients don't release caps, or don't release them fast enough you'll see... 3. "failing to respond to capability release" and "failing to respond to cache pressure" can be caused by two different things: an old client -- maybe 3.14 is too old like Wido said -- or a busy client. We have a trivial bash script that creates many small files in a loop. This client is grabbing new caps faster than it can release them. 3.b BTW, our old friend updatedb seems to trigger the same problem.. grabbing caps very quickly as it indexes CephFS. updatedb.conf is configured to PRUNEFS="... fuse ...", but CephFS has type fuse.ceph-fuse. We'll need to add "ceph" to that list too. 4. "mds cache size = 5000000" is going to use a lot of memory! We have an MDS with just 8GB of RAM and it goes OOM after delegating around 1 million caps. (this is with mds cache size = 100000, btw) 4.b. "mds cache size" is used for more than one purpose .. it sets the size of the MDS LRU _and_ it sets the maximum number of client caps. Those seem like two completely different things... why is it the same config option?!!! For me there are still a couple things missing related to CephFS caps and memory usage:: - a hard limit on the number of caps per client (to prevent a busy/broken client from DOS'ing the MDS) - an automatic way to forcably revoke caps from a misbehaving client, e.g. revoke and put a client into RO or even no-IO mode - AFAICT, "mds mem max" has been unused since before argonaut -- we should remove that completely since it is confusing (PR incoming...) - the MDS should eventually auto-tune the mds cache size to fit the amount of available memory. Best Regards, Dan On Fri, Jul 3, 2015 at 10:25 AM, Mathias Buresch <mathias.buresch@xxxxxxxxxxxx> wrote: > Hi there, > > maybe you could be so kind and help me with following issue: > > We running Ceph FS but there's repeatedly a problem with the MDS. > > Sometimes following error occurs: "mds0: Client 701782 failing to respond to > capability release" > Listing the session informations shows that the "num_caps" on that Client is > much more than on the other Clients. ( see also -> attachement ) > > The problem is that the load on one of the server is increasing to really > high value ( 80 to 100 ) independent of client which is complaining. > > I guess my problem is also that I dont really understand the meaning of > those "capabilties". > > Following facts (let me know if you need more): > > CEPH-FS-Client, MDS, MON, OSD all on same server > Kernel-Client (Kernel: 3.14.16-031416-generic) > MDS config > > only raised "mds cache size = 5000000" (because before there was error > "failing to respond to cache pressure") > > > Best regards > Mathias > > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com