Re: cephfs mds millions of caps

"Yan, Zheng" <ukernel@xxxxxxxxx> · Fri, 15 Dec 2017 20:58:28 +0800

On Fri, Dec 15, 2017 at 8:46 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> On Fri, Dec 15, 2017 at 6:54 PM, Webert de Souza Lima
> <webert.boss@xxxxxxxxx> wrote:
>> Hello, Mr. Yan
>>
>> On Thu, Dec 14, 2017 at 11:36 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>>>
>>>
>>> The client hold so many capabilities because kernel keeps lots of
>>> inodes in its cache. Kernel does not trim inodes by itself if it has
>>> no memory pressure. It seems you have set mds_cache_size config to a
>>> large value.
>>
>>
>> Yes, I have set mds_cache_size = 3000000
>> I usually set this value according to the number of ceph.dir.rentries in
>> cephfs. Isn't that a good approach?
>>
>> I have 2 directories in cephfs root, sum of ceph.dir.rentries is 4670933,
>> for which I would set mds_cache_size to 5M (if I had enough RAM for that in
>> the MDS server).
>>
>> # getfattr -d -m ceph.dir.* index
>> # file: index
>> ceph.dir.entries="776"
>> ceph.dir.files="0"
>> ceph.dir.rbytes="52742318965"
>> ceph.dir.rctime="1513334528.09909569540"
>> ceph.dir.rentries="709233"
>> ceph.dir.rfiles="459512"
>> ceph.dir.rsubdirs="249721"
>> ceph.dir.subdirs="776"
>>
>>
>> # getfattr -d -m ceph.dir.* mail
>> # file: mail
>> ceph.dir.entries="786"
>> ceph.dir.files="1"
>> ceph.dir.rbytes="15000378101390"
>> ceph.dir.rctime="1513334524.0993982498"
>> ceph.dir.rentries="3961700"
>> ceph.dir.rfiles="3531068"
>> ceph.dir.rsubdirs="430632"
>> ceph.dir.subdirs="785"
>>
>>
>>> mds cache size isn't large enough, so mds does not ask
>>> the client to trim its inode cache neither. This can affect
>>> performance. we should make mds recognize idle client and ask idle
>>> client to trim its caps more aggressively
>>
>>
>> I think you mean that the mds cache IS large enough, right? So it doesn't
>> bother the clients.

yes, I mean the cache config is large enough.

>>
>>> This can affect performance. we should make mds recognize idle client and
>>> ask idle client to trim its caps more aggressively
>>
>>
>> One recurrent problem I have, which I guess is caused by a network issue
>> (ceph cluster in vrack), is that my MDS servers start switching who is the
>> active.
>> This happens after a lease_timeout occur in the mon, then I get "dne in the
>> mds map" from the active MDS and it suicides.
>> Even though I use standby-replay, the standby takes from 15min up to 2 hours
>> to take over as active. I see that it loads all inodes (by issuing "perf
>> dump mds" on the mds daemon).
>>
>> So, question is: if the number of caps is as low as it is supposed to be
>> (around 300k) instead if 5M, would the MDS be active faster in such case of
>> a failure?

300k are ready quite a lot. opening them requires long time. does you
mail server really open so many files?

>
> yes, mds recovery should be faster when client fewer caps. recent
> version kernel client and ceph-fuse should trim they cache
> aggressively when mds recovers.
>

I checked 4.4 kernel, it includes the code that trim cache when mds recovers.

> Regards
> Yan, Zheng
>
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> Belo Horizonte - Brasil
>> IRC NICK - WebertRLZ
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com