Re: cephfs mds millions of caps

Webert de Souza Lima <webert.boss@xxxxxxxxx> · Fri, 15 Dec 2017 08:54:28 -0200

Hello, Mr. Yan

On Thu, Dec 14, 2017 at 11:36 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:

The client hold so many capabilities because kernel keeps lots of
inodes in its cache. Kernel does not trim inodes by itself if it has
no memory pressure. It seems you have set mds_cache_size config to a
large value. 

Yes, I have set mds_cache_size = 3000000
I usually set this value according to the number of ceph.dir.rentries in cephfs. Isn't that a good approach?

I have 2 directories in cephfs root, sum of ceph.dir.rentries is 4670933, for which I would set mds_cache_size to 5M (if I had enough RAM for that in the MDS server).

# getfattr -d -m ceph.dir.* index
# file: index
ceph.dir.entries="776"
ceph.dir.files="0"
ceph.dir.rbytes="52742318965"
ceph.dir.rctime="1513334528.09909569540"
ceph.dir.rentries="709233"
ceph.dir.rfiles="459512"
ceph.dir.rsubdirs="249721"
ceph.dir.subdirs="776"

# getfattr -d -m ceph.dir.* mail
# file: mail
ceph.dir.entries="786"
ceph.dir.files="1"
ceph.dir.rbytes="15000378101390"
ceph.dir.rctime="1513334524.0993982498"
ceph.dir.rentries="3961700"
ceph.dir.rfiles="3531068"
ceph.dir.rsubdirs="430632"
ceph.dir.subdirs="785"

mds cache size isn't large enough, so mds does not ask
the client to trim its inode cache neither. This can affect
performance. we should make mds recognize idle client and ask idle
client to trim its caps more aggressively

I think you mean that the mds cache IS large enough, right? So it doesn't bother the clients. 

This can affect performance. we should make mds recognize idle client and ask idle client to trim its caps more aggressively

One recurrent problem I have, which I guess is caused by a network issue (ceph cluster in vrack), is that my MDS servers start switching who is the active.
This happens after a lease_timeout occur in the mon, then I get "dne in the mds map" from the active MDS and it suicides.
Even though I use standby-replay, the standby takes from 15min up to 2 hours to take over as active. I see that it loads all inodes (by issuing "perf dump mds" on the mds daemon).

So, question is: if the number of caps is as low as it is supposed to be (around 300k) instead if 5M, would the MDS be active faster in such case of a failure?

Regards,
Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com