Re: Multi-active MDS cache pressure

Eugen Block <eblock@xxxxxx> · Wed, 10 Aug 2022 18:19:42 +0000

Hi,

This thread contains some really insightful information. Thanks Eugen for
sharing the explanation by the SUSE team. Definitely the doc can be updated
with this, it might help a lot of people indeed.
Can you help creating a tracker for this? I wish to add the info to doc and
push a PR for the same.

I agree, it's really valuable information. I'm quite busy this week  
but I'd be happy to create the tracker tomorrow or Friday.

Zitat von Dhairya Parmar <dparmar@xxxxxxxxxx>:

Hi there,

This thread contains some really insightful information. Thanks Eugen for
sharing the explanation by the SUSE team. Definitely the doc can be updated
with this, it might help a lot of people indeed.
Can you help creating a tracker for this? I wish to add the info to doc and
push a PR for the same.

On Wed, Aug 10, 2022 at 1:45 AM Malte Stroem <malte.stroem@xxxxxxxxx> wrote:

Hello Eugen,

thank you very much for the full explanation.

This fixed our cluster and I am sure this helps a lot of people around
the world since this is a problem occuring everywhere.

I think this should be added to the documentation:

https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall

or better:

https://docs.ceph.com/en/quincy/cephfs/health-messages/#mds-client-recall-mds-health-client-recall-many

Best wishes!
Malte

Am 09.08.22 um 16:34 schrieb Eugen Block:
> Hi,
>
>> did you have some success with modifying the mentioned values?
>
> yes, the SUSE team helped identifying the issue, I can share the
> explanation:
>
> ---snip---
> Every second (mds_cache_trim_interval config param) the mds is running
> "cache trim" procedure. One of the steps of this procedure is "recall
> client state". During this step it checks every client (session) if it
> needs to recall caps. There are several criteria for this:
>
> 1) the cache is full (exceeds mds_cache_memory_limit) and needs some
> inodes to be released;
> 2) the client exceeds mds_max_caps_per_client (1M by default);
> 3) the client is inactive.
>
> To determine a client (session) inactivity, the session's cache_liveness
> parameters is checked and compared with the value:
>
>    (num_caps >> mds_session_cache_liveness_magnitude)
>
> where mds_session_cache_liveness_magnitude is a config param (10 by
> default).
> If cache_liveness is smaller than this calculated value the session is
> considered inactive and the mds sends "recall caps" request for all
> cached caps (actually the recall value is `num_caps -
> mds_min_caps_per_client(100)`).
>
> And if the client is not releasing the caps fast, the next second it
> repeats again, i.e. the mds will send "recall caps" with high value
> again and so on and the "total" counter of "recall caps" for the session
> will grow, eventually exceeding the mon warning limit.
> There is a throttling mechanism, controlled by
> mds_recall_max_decay_threshold parameter (126K by default), which should
> reduce the rate of "recall caps" counter grow but it looks like it is
> not enough for this case.
>
>  From the collected sessions, I see that during that 30 minute period
> the total num_caps for that client decreased by about 3500.
> ...
> Here is an example. A client is having 20k caps cached. At some moment
> the server decides the client is inactive (because the session's
> cache_liveness value is low). It starts to ask the client to release
> caps down to  mds_min_caps_per_client value (100 by default). For this
> every seconds it sends recall_caps asking to release `caps_num -
> mds_min_caps_per_client` caps (but not more than `mds_recall_max_caps`,
> which is 30k by default). A client is starting to release, but is
> releasing with a rate e.g. only 100 caps per second.
>
> So in the first second the mds sends recall_caps = 20k - 100
> the second second recall_caps = (20k - 100) - 100
> the third second recall_caps = (20k - 200) - 100
> and so on
>
> And every time it sends recall_caps it updates the session's recall_caps
> value, which is calculated  how many recall_caps sent in the last
> minute. I.e. the counter is growing quickly, eventually exceeding
> mds_recall_warning_threshold, which is 128K by default, and ceph starts
> to report "failing to respond to cache pressure" warning in the status.
>
> Now, after we set mds_recall_max_caps to 3K, in this situation the mds
> server sends only 3K recall_caps per second, and the maximum value the
> session's recall_caps value may have (if the mds is sending 3K every
> second for at least one minute) is 60 * 3K = 180K. I.e. it is still
> possible to achieve mds_recall_warning_threshold but only if a client is
> not "responding" for long period, and as your experiments show it is not
> the case.
> ---snip---
>
> So what helped us here was to decrease mds_recall_max_caps in 1k steps,
> starting with 10000. This didn't reduce the warnings so I decreased it
> to 3000 and I haven't seen those warnings since then. Also I decreased
> the mds_cache_memory_limit again, it wasn't helping here.
>
> Regards,
> Eugen
>
>
> Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
>
>> Hello Eugen,
>>
>> did you have some success with modifying the mentioned values?
>>
>> Or some others from:
>>
>> https://docs.ceph.com/en/latest/cephfs/cache-configuration/
>>
>> Best,
>> Malte
>>
>> Am 15.06.22 um 14:12 schrieb Eugen Block:
>>> Hi *,
>>>
>>> I finally caught some debug logs during the cache pressure warnings.
>>> In the meantime I had doubled the mds_cache_memory_limit to 128 GB
>>> which decreased the number cache pressure messages significantly, but
>>> they still appear a few times per day.
>>>
>>> Turning on debug logs for a few seconds results in a 1 GB file, but I
>>> found this message:
>>>
>>> 2022-06-15 10:07:34.254 7fdbbd44a700  2 mds.beacon.stmailmds01b-8
>>> Session chead015:cephfs_client (2757628057) is not releasing caps
>>> fast enough. Recalled caps at 390118 > 262144
>>> (mds_recall_warning_threshold).
>>>
>>> So now I know which limit is reached here, the question is what to do
>>> about it? Should I increase the mds_recall_warning_threshold (default
>>> 256k) or should I maybe increase mds_recall_max_caps (currently at
>>> 60k, default is 50k)? Any other suggestions? I'd appreciate any
>>> comments.
>>>
>>> Thanks,
>>> Eugen
>>>
>>>
>>> Zitat von Eugen Block <eblock@xxxxxx>:
>>>
>>>> Hi,
>>>>
>>>> I'm currently debugging a reoccuring issue with multi-active MDS.
>>>> The cluster is still on Nautilus and can't be upgraded at this time.
>>>> There have been many discussions about "cache pressure" and I was
>>>> able to find the right settings a couple of times, but before I
>>>> change too much in this setup I'd like to ask for your opinion. I'll
>>>> add some information at the end.
>>>> So we have 16 active MDS daemons spread over 2 servers for one
>>>> cephfs (8 daemons per server) with mds_cache_memory_limit = 64GB,
>>>> the MDS servers are mostly idle except for some short peaks. Each of
>>>> the MDS daemons uses around 2 GB according to 'ceph daemon mds.<MDS>
>>>> cache status', so we're nowhere near the 64GB limit. There are
>>>> currently 25 servers that mount the cephs as clients.
>>>> Watching the ceph health I can see that the reported clients with
>>>> cache pressure change, so they are not actually stuck but just don't
>>>> respond as quickly as the MDS would like them to (I assume). For
>>>> some of the mentioned clients I see high values for
>>>> .recall_caps.value in the 'daemon session ls' output (at the bottom).
>>>>
>>>> The docs basically state this:
>>>>> When the MDS needs to shrink its cache (to stay within
>>>>> mds_cache_size), it sends messages to clients to shrink their
>>>>> caches too. The client is unresponsive to MDS requests to release
>>>>> cached inodes. Either the client is unresponsive or has a bug
>>>>
>>>> To me it doesn't seem like the MDS servers are near the cache size
>>>> limit, so it has to be the clients, right? In a different setup it
>>>> helped to decrease the client_oc_size from 200MB to 100MB, but then
>>>> there's also client_cache_size with 16K default. I'm not sure what
>>>> the best approach would be here. I'd appreciate any comments on how
>>>> to size the various cache/caps/threshold configurations.
>>>>
>>>> Thanks!
>>>> Eugen
>>>>
>>>>
>>>> ---snip---
>>>> # ceph daemon mds.<MDS> session ls
>>>>
>>>>     "id": 2728101146,
>>>>     "entity": {
>>>>       "name": {
>>>>         "type": "client",
>>>>         "num": 2728101146
>>>>       },
>>>> [...]
>>>>         "nonce": 1105499797
>>>>       }
>>>>     },
>>>>     "state": "open",
>>>>     "num_leases": 0,
>>>>     "num_caps": 16158,
>>>>     "request_load_avg": 0,
>>>>     "uptime": 1118066.210318422,
>>>>     "requests_in_flight": 0,
>>>>     "completed_requests": [],
>>>>     "reconnecting": false,
>>>>     "recall_caps": {
>>>>       "value": 788916.8276369586,
>>>>       "halflife": 60
>>>>     },
>>>>     "release_caps": {
>>>>       "value": 8.814981576458962,
>>>>       "halflife": 60
>>>>     },
>>>>     "recall_caps_throttle": {
>>>>       "value": 27379.27162576508,
>>>>       "halflife": 1.5
>>>>     },
>>>>     "recall_caps_throttle2o": {
>>>>       "value": 5382.261925615086,
>>>>       "halflife": 0.5
>>>>     },
>>>>     "session_cache_liveness": {
>>>>       "value": 12.91841737465921,
>>>>       "halflife": 300
>>>>     },
>>>>     "cap_acquisition": {
>>>>       "value": 0,
>>>>       "halflife": 10
>>>>     },
>>>> [...]
>>>>     "used_inos": [],
>>>>     "client_metadata": {
>>>>       "features": "0x0000000000003bff",
>>>>       "entity_id": "cephfs_client",
>>>>
>>>>
>>>> # ceph fs status
>>>>
>>>> cephfs - 25 clients
>>>> ======
>>>> +------+--------+----------------+---------------+-------+-------+
>>>> | Rank | State  |      MDS       |    Activity   |  dns  |  inos |
>>>> +------+--------+----------------+---------------+-------+-------+
>>>> |  0   | active | stmailmds01d-3 | Reqs:   89 /s |  375k |  371k |
>>>> |  1   | active | stmailmds01d-4 | Reqs:   64 /s |  386k |  383k |
>>>> |  2   | active | stmailmds01a-3 | Reqs:    9 /s |  403k |  399k |
>>>> |  3   | active | stmailmds01a-8 | Reqs:   23 /s |  393k |  390k |
>>>> |  4   | active | stmailmds01a-2 | Reqs:   36 /s |  391k |  387k |
>>>> |  5   | active | stmailmds01a-4 | Reqs:   57 /s |  394k |  390k |
>>>> |  6   | active | stmailmds01a-6 | Reqs:   50 /s |  395k |  391k |
>>>> |  7   | active | stmailmds01d-5 | Reqs:   37 /s |  384k |  380k |
>>>> |  8   | active | stmailmds01a-5 | Reqs:   39 /s |  397k |  394k |
>>>> |  9   | active |  stmailmds01a  | Reqs:   23 /s |  400k |  396k |
>>>> |  10  | active | stmailmds01d-8 | Reqs:   74 /s |  402k |  399k |
>>>> |  11  | active | stmailmds01d-6 | Reqs:   37 /s |  399k |  395k |
>>>> |  12  | active |  stmailmds01d  | Reqs:   36 /s |  394k |  390k |
>>>> |  13  | active | stmailmds01d-7 | Reqs:   80 /s |  397k |  393k |
>>>> |  14  | active | stmailmds01d-2 | Reqs:   56 /s |  414k |  410k |
>>>> |  15  | active | stmailmds01a-7 | Reqs:   25 /s |  390k |  387k |
>>>> +------+--------+----------------+---------------+-------+-------+
>>>> +-----------------+----------+-------+-------+
>>>> |       Pool      |   type   |  used | avail |
>>>> +-----------------+----------+-------+-------+
>>>> | cephfs_metadata | metadata | 25.4G | 16.1T |
>>>> |   cephfs_data   |   data   | 2078G | 16.1T |
>>>> +-----------------+----------+-------+-------+
>>>> +----------------+
>>>> |  Standby MDS   |
>>>> +----------------+
>>>> | stmailmds01b-5 |
>>>> | stmailmds01b-2 |
>>>> | stmailmds01b-3 |
>>>> |  stmailmds01b  |
>>>> | stmailmds01b-7 |
>>>> | stmailmds01b-8 |
>>>> | stmailmds01b-6 |
>>>> | stmailmds01b-4 |
>>>> +----------------+
>>>> MDS version: ceph version 14.2.22-404-gf74e15c2e55
>>>> (f74e15c2e552b3359f5a51482dfd8b049e262743) nautilus (stable)
>>>> ---snip---
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
*Dhairya Parmar*

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc. <https://www.redhat.com/>

dparmar@xxxxxxxxxx
<https://www.redhat.com/>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx