Hello Eugen,
thank you very much for the full explanation.
This fixed our cluster and I am sure this helps a lot of people around
the world since this is a problem occuring everywhere.
I think this should be added to the documentation:
https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall
or better:
https://docs.ceph.com/en/quincy/cephfs/health-messages/#mds-client-recall-mds-health-client-recall-many
Best wishes!
Malte
Am 09.08.22 um 16:34 schrieb Eugen Block:
> Hi,
>
>> did you have some success with modifying the mentioned values?
>
> yes, the SUSE team helped identifying the issue, I can share the
> explanation:
>
> ---snip---
> Every second (mds_cache_trim_interval config param) the mds is running
> "cache trim" procedure. One of the steps of this procedure is "recall
> client state". During this step it checks every client (session) if it
> needs to recall caps. There are several criteria for this:
>
> 1) the cache is full (exceeds mds_cache_memory_limit) and needs some
> inodes to be released;
> 2) the client exceeds mds_max_caps_per_client (1M by default);
> 3) the client is inactive.
>
> To determine a client (session) inactivity, the session's cache_liveness
> parameters is checked and compared with the value:
>
> (num_caps >> mds_session_cache_liveness_magnitude)
>
> where mds_session_cache_liveness_magnitude is a config param (10 by
> default).
> If cache_liveness is smaller than this calculated value the session is
> considered inactive and the mds sends "recall caps" request for all
> cached caps (actually the recall value is `num_caps -
> mds_min_caps_per_client(100)`).
>
> And if the client is not releasing the caps fast, the next second it
> repeats again, i.e. the mds will send "recall caps" with high value
> again and so on and the "total" counter of "recall caps" for the session
> will grow, eventually exceeding the mon warning limit.
> There is a throttling mechanism, controlled by
> mds_recall_max_decay_threshold parameter (126K by default), which should
> reduce the rate of "recall caps" counter grow but it looks like it is
> not enough for this case.
>
> From the collected sessions, I see that during that 30 minute period
> the total num_caps for that client decreased by about 3500.
> ...
> Here is an example. A client is having 20k caps cached. At some moment
> the server decides the client is inactive (because the session's
> cache_liveness value is low). It starts to ask the client to release
> caps down to mds_min_caps_per_client value (100 by default). For this
> every seconds it sends recall_caps asking to release `caps_num -
> mds_min_caps_per_client` caps (but not more than `mds_recall_max_caps`,
> which is 30k by default). A client is starting to release, but is
> releasing with a rate e.g. only 100 caps per second.
>
> So in the first second the mds sends recall_caps = 20k - 100
> the second second recall_caps = (20k - 100) - 100
> the third second recall_caps = (20k - 200) - 100
> and so on
>
> And every time it sends recall_caps it updates the session's recall_caps
> value, which is calculated how many recall_caps sent in the last
> minute. I.e. the counter is growing quickly, eventually exceeding
> mds_recall_warning_threshold, which is 128K by default, and ceph starts
> to report "failing to respond to cache pressure" warning in the status.
>
> Now, after we set mds_recall_max_caps to 3K, in this situation the mds
> server sends only 3K recall_caps per second, and the maximum value the
> session's recall_caps value may have (if the mds is sending 3K every
> second for at least one minute) is 60 * 3K = 180K. I.e. it is still
> possible to achieve mds_recall_warning_threshold but only if a client is
> not "responding" for long period, and as your experiments show it is not
> the case.
> ---snip---
>
> So what helped us here was to decrease mds_recall_max_caps in 1k steps,
> starting with 10000. This didn't reduce the warnings so I decreased it
> to 3000 and I haven't seen those warnings since then. Also I decreased
> the mds_cache_memory_limit again, it wasn't helping here.
>
> Regards,
> Eugen
>
>
> Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
>
>> Hello Eugen,
>>
>> did you have some success with modifying the mentioned values?
>>
>> Or some others from:
>>
>> https://docs.ceph.com/en/latest/cephfs/cache-configuration/
>>
>> Best,
>> Malte
>>
>> Am 15.06.22 um 14:12 schrieb Eugen Block:
>>> Hi *,
>>>
>>> I finally caught some debug logs during the cache pressure warnings.
>>> In the meantime I had doubled the mds_cache_memory_limit to 128 GB
>>> which decreased the number cache pressure messages significantly, but
>>> they still appear a few times per day.
>>>
>>> Turning on debug logs for a few seconds results in a 1 GB file, but I
>>> found this message:
>>>
>>> 2022-06-15 10:07:34.254 7fdbbd44a700 2 mds.beacon.stmailmds01b-8
>>> Session chead015:cephfs_client (2757628057) is not releasing caps
>>> fast enough. Recalled caps at 390118 > 262144
>>> (mds_recall_warning_threshold).
>>>
>>> So now I know which limit is reached here, the question is what to do
>>> about it? Should I increase the mds_recall_warning_threshold (default
>>> 256k) or should I maybe increase mds_recall_max_caps (currently at
>>> 60k, default is 50k)? Any other suggestions? I'd appreciate any
>>> comments.
>>>
>>> Thanks,
>>> Eugen
>>>
>>>
>>> Zitat von Eugen Block <eblock@xxxxxx>:
>>>
>>>> Hi,
>>>>
>>>> I'm currently debugging a reoccuring issue with multi-active MDS.
>>>> The cluster is still on Nautilus and can't be upgraded at this time.
>>>> There have been many discussions about "cache pressure" and I was
>>>> able to find the right settings a couple of times, but before I
>>>> change too much in this setup I'd like to ask for your opinion. I'll
>>>> add some information at the end.
>>>> So we have 16 active MDS daemons spread over 2 servers for one
>>>> cephfs (8 daemons per server) with mds_cache_memory_limit = 64GB,
>>>> the MDS servers are mostly idle except for some short peaks. Each of
>>>> the MDS daemons uses around 2 GB according to 'ceph daemon mds.<MDS>
>>>> cache status', so we're nowhere near the 64GB limit. There are
>>>> currently 25 servers that mount the cephs as clients.
>>>> Watching the ceph health I can see that the reported clients with
>>>> cache pressure change, so they are not actually stuck but just don't
>>>> respond as quickly as the MDS would like them to (I assume). For
>>>> some of the mentioned clients I see high values for
>>>> .recall_caps.value in the 'daemon session ls' output (at the bottom).
>>>>
>>>> The docs basically state this:
>>>>> When the MDS needs to shrink its cache (to stay within
>>>>> mds_cache_size), it sends messages to clients to shrink their
>>>>> caches too. The client is unresponsive to MDS requests to release
>>>>> cached inodes. Either the client is unresponsive or has a bug
>>>>
>>>> To me it doesn't seem like the MDS servers are near the cache size
>>>> limit, so it has to be the clients, right? In a different setup it
>>>> helped to decrease the client_oc_size from 200MB to 100MB, but then
>>>> there's also client_cache_size with 16K default. I'm not sure what
>>>> the best approach would be here. I'd appreciate any comments on how
>>>> to size the various cache/caps/threshold configurations.
>>>>
>>>> Thanks!
>>>> Eugen
>>>>
>>>>
>>>> ---snip---
>>>> # ceph daemon mds.<MDS> session ls
>>>>
>>>> "id": 2728101146,
>>>> "entity": {
>>>> "name": {
>>>> "type": "client",
>>>> "num": 2728101146
>>>> },
>>>> [...]
>>>> "nonce": 1105499797
>>>> }
>>>> },
>>>> "state": "open",
>>>> "num_leases": 0,
>>>> "num_caps": 16158,
>>>> "request_load_avg": 0,
>>>> "uptime": 1118066.210318422,
>>>> "requests_in_flight": 0,
>>>> "completed_requests": [],
>>>> "reconnecting": false,
>>>> "recall_caps": {
>>>> "value": 788916.8276369586,
>>>> "halflife": 60
>>>> },
>>>> "release_caps": {
>>>> "value": 8.814981576458962,
>>>> "halflife": 60
>>>> },
>>>> "recall_caps_throttle": {
>>>> "value": 27379.27162576508,
>>>> "halflife": 1.5
>>>> },
>>>> "recall_caps_throttle2o": {
>>>> "value": 5382.261925615086,
>>>> "halflife": 0.5
>>>> },
>>>> "session_cache_liveness": {
>>>> "value": 12.91841737465921,
>>>> "halflife": 300
>>>> },
>>>> "cap_acquisition": {
>>>> "value": 0,
>>>> "halflife": 10
>>>> },
>>>> [...]
>>>> "used_inos": [],
>>>> "client_metadata": {
>>>> "features": "0x0000000000003bff",
>>>> "entity_id": "cephfs_client",
>>>>
>>>>
>>>> # ceph fs status
>>>>
>>>> cephfs - 25 clients
>>>> ======
>>>> +------+--------+----------------+---------------+-------+-------+
>>>> | Rank | State | MDS | Activity | dns | inos |
>>>> +------+--------+----------------+---------------+-------+-------+
>>>> | 0 | active | stmailmds01d-3 | Reqs: 89 /s | 375k | 371k |
>>>> | 1 | active | stmailmds01d-4 | Reqs: 64 /s | 386k | 383k |
>>>> | 2 | active | stmailmds01a-3 | Reqs: 9 /s | 403k | 399k |
>>>> | 3 | active | stmailmds01a-8 | Reqs: 23 /s | 393k | 390k |
>>>> | 4 | active | stmailmds01a-2 | Reqs: 36 /s | 391k | 387k |
>>>> | 5 | active | stmailmds01a-4 | Reqs: 57 /s | 394k | 390k |
>>>> | 6 | active | stmailmds01a-6 | Reqs: 50 /s | 395k | 391k |
>>>> | 7 | active | stmailmds01d-5 | Reqs: 37 /s | 384k | 380k |
>>>> | 8 | active | stmailmds01a-5 | Reqs: 39 /s | 397k | 394k |
>>>> | 9 | active | stmailmds01a | Reqs: 23 /s | 400k | 396k |
>>>> | 10 | active | stmailmds01d-8 | Reqs: 74 /s | 402k | 399k |
>>>> | 11 | active | stmailmds01d-6 | Reqs: 37 /s | 399k | 395k |
>>>> | 12 | active | stmailmds01d | Reqs: 36 /s | 394k | 390k |
>>>> | 13 | active | stmailmds01d-7 | Reqs: 80 /s | 397k | 393k |
>>>> | 14 | active | stmailmds01d-2 | Reqs: 56 /s | 414k | 410k |
>>>> | 15 | active | stmailmds01a-7 | Reqs: 25 /s | 390k | 387k |
>>>> +------+--------+----------------+---------------+-------+-------+
>>>> +-----------------+----------+-------+-------+
>>>> | Pool | type | used | avail |
>>>> +-----------------+----------+-------+-------+
>>>> | cephfs_metadata | metadata | 25.4G | 16.1T |
>>>> | cephfs_data | data | 2078G | 16.1T |
>>>> +-----------------+----------+-------+-------+
>>>> +----------------+
>>>> | Standby MDS |
>>>> +----------------+
>>>> | stmailmds01b-5 |
>>>> | stmailmds01b-2 |
>>>> | stmailmds01b-3 |
>>>> | stmailmds01b |
>>>> | stmailmds01b-7 |
>>>> | stmailmds01b-8 |
>>>> | stmailmds01b-6 |
>>>> | stmailmds01b-4 |
>>>> +----------------+
>>>> MDS version: ceph version 14.2.22-404-gf74e15c2e55
>>>> (f74e15c2e552b3359f5a51482dfd8b049e262743) nautilus (stable)
>>>> ---snip---
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx