Re: Multi-active MDS cache pressure

Dhairya Parmar <dparmar@xxxxxxxxxx> · Wed, 10 Aug 2022 16:26:35 +0530

Hi there,

This thread contains some really insightful information. Thanks Eugen for
sharing the explanation by the SUSE team. Definitely the doc can be updated
with this, it might help a lot of people indeed.
Can you help creating a tracker for this? I wish to add the info to doc and
push a PR for the same.

On Wed, Aug 10, 2022 at 1:45 AM Malte Stroem <malte.stroem@xxxxxxxxx> wrote:

> Hello Eugen,
>
> thank you very much for the full explanation.
>
> This fixed our cluster and I am sure this helps a lot of people around
> the world since this is a problem occuring everywhere.
>
> I think this should be added to the documentation:
>
> https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall
>
> or better:
>
>
> https://docs.ceph.com/en/quincy/cephfs/health-messages/#mds-client-recall-mds-health-client-recall-many
>
> Best wishes!
> Malte
>
> Am 09.08.22 um 16:34 schrieb Eugen Block:
> > Hi,
> >
> >> did you have some success with modifying the mentioned values?
> >
> > yes, the SUSE team helped identifying the issue, I can share the
> > explanation:
> >
> > ---snip---
> > Every second (mds_cache_trim_interval config param) the mds is running
> > "cache trim" procedure. One of the steps of this procedure is "recall
> > client state". During this step it checks every client (session) if it
> > needs to recall caps. There are several criteria for this:
> >
> > 1) the cache is full (exceeds mds_cache_memory_limit) and needs some
> > inodes to be released;
> > 2) the client exceeds mds_max_caps_per_client (1M by default);
> > 3) the client is inactive.
> >
> > To determine a client (session) inactivity, the session's cache_liveness
> > parameters is checked and compared with the value:
> >
> >    (num_caps >> mds_session_cache_liveness_magnitude)
> >
> > where mds_session_cache_liveness_magnitude is a config param (10 by
> > default).
> > If cache_liveness is smaller than this calculated value the session is
> > considered inactive and the mds sends "recall caps" request for all
> > cached caps (actually the recall value is `num_caps -
> > mds_min_caps_per_client(100)`).
> >
> > And if the client is not releasing the caps fast, the next second it
> > repeats again, i.e. the mds will send "recall caps" with high value
> > again and so on and the "total" counter of "recall caps" for the session
> > will grow, eventually exceeding the mon warning limit.
> > There is a throttling mechanism, controlled by
> > mds_recall_max_decay_threshold parameter (126K by default), which should
> > reduce the rate of "recall caps" counter grow but it looks like it is
> > not enough for this case.
> >
> >  From the collected sessions, I see that during that 30 minute period
> > the total num_caps for that client decreased by about 3500.
> > ...
> > Here is an example. A client is having 20k caps cached. At some moment
> > the server decides the client is inactive (because the session's
> > cache_liveness value is low). It starts to ask the client to release
> > caps down to  mds_min_caps_per_client value (100 by default). For this
> > every seconds it sends recall_caps asking to release `caps_num -
> > mds_min_caps_per_client` caps (but not more than `mds_recall_max_caps`,
> > which is 30k by default). A client is starting to release, but is
> > releasing with a rate e.g. only 100 caps per second.
> >
> > So in the first second the mds sends recall_caps = 20k - 100
> > the second second recall_caps = (20k - 100) - 100
> > the third second recall_caps = (20k - 200) - 100
> > and so on
> >
> > And every time it sends recall_caps it updates the session's recall_caps
> > value, which is calculated  how many recall_caps sent in the last
> > minute. I.e. the counter is growing quickly, eventually exceeding
> > mds_recall_warning_threshold, which is 128K by default, and ceph starts
> > to report "failing to respond to cache pressure" warning in the status.
> >
> > Now, after we set mds_recall_max_caps to 3K, in this situation the mds
> > server sends only 3K recall_caps per second, and the maximum value the
> > session's recall_caps value may have (if the mds is sending 3K every
> > second for at least one minute) is 60 * 3K = 180K. I.e. it is still
> > possible to achieve mds_recall_warning_threshold but only if a client is
> > not "responding" for long period, and as your experiments show it is not
> > the case.
> > ---snip---
> >
> > So what helped us here was to decrease mds_recall_max_caps in 1k steps,
> > starting with 10000. This didn't reduce the warnings so I decreased it
> > to 3000 and I haven't seen those warnings since then. Also I decreased
> > the mds_cache_memory_limit again, it wasn't helping here.
> >
> > Regards,
> > Eugen
> >
> >
> > Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
> >
> >> Hello Eugen,
> >>
> >> did you have some success with modifying the mentioned values?
> >>
> >> Or some others from:
> >>
> >> https://docs.ceph.com/en/latest/cephfs/cache-configuration/
> >>
> >> Best,
> >> Malte
> >>
> >> Am 15.06.22 um 14:12 schrieb Eugen Block:
> >>> Hi *,
> >>>
> >>> I finally caught some debug logs during the cache pressure warnings.
> >>> In the meantime I had doubled the mds_cache_memory_limit to 128 GB
> >>> which decreased the number cache pressure messages significantly, but
> >>> they still appear a few times per day.
> >>>
> >>> Turning on debug logs for a few seconds results in a 1 GB file, but I
> >>> found this message:
> >>>
> >>> 2022-06-15 10:07:34.254 7fdbbd44a700  2 mds.beacon.stmailmds01b-8
> >>> Session chead015:cephfs_client (2757628057) is not releasing caps
> >>> fast enough. Recalled caps at 390118 > 262144
> >>> (mds_recall_warning_threshold).
> >>>
> >>> So now I know which limit is reached here, the question is what to do
> >>> about it? Should I increase the mds_recall_warning_threshold (default
> >>> 256k) or should I maybe increase mds_recall_max_caps (currently at
> >>> 60k, default is 50k)? Any other suggestions? I'd appreciate any
> >>> comments.
> >>>
> >>> Thanks,
> >>> Eugen
> >>>
> >>>
> >>> Zitat von Eugen Block <eblock@xxxxxx>:
> >>>
> >>>> Hi,
> >>>>
> >>>> I'm currently debugging a reoccuring issue with multi-active MDS.
> >>>> The cluster is still on Nautilus and can't be upgraded at this time.
> >>>> There have been many discussions about "cache pressure" and I was
> >>>> able to find the right settings a couple of times, but before I
> >>>> change too much in this setup I'd like to ask for your opinion. I'll
> >>>> add some information at the end.
> >>>> So we have 16 active MDS daemons spread over 2 servers for one
> >>>> cephfs (8 daemons per server) with mds_cache_memory_limit = 64GB,
> >>>> the MDS servers are mostly idle except for some short peaks. Each of
> >>>> the MDS daemons uses around 2 GB according to 'ceph daemon mds.<MDS>
> >>>> cache status', so we're nowhere near the 64GB limit. There are
> >>>> currently 25 servers that mount the cephs as clients.
> >>>> Watching the ceph health I can see that the reported clients with
> >>>> cache pressure change, so they are not actually stuck but just don't
> >>>> respond as quickly as the MDS would like them to (I assume). For
> >>>> some of the mentioned clients I see high values for
> >>>> .recall_caps.value in the 'daemon session ls' output (at the bottom).
> >>>>
> >>>> The docs basically state this:
> >>>>> When the MDS needs to shrink its cache (to stay within
> >>>>> mds_cache_size), it sends messages to clients to shrink their
> >>>>> caches too. The client is unresponsive to MDS requests to release
> >>>>> cached inodes. Either the client is unresponsive or has a bug
> >>>>
> >>>> To me it doesn't seem like the MDS servers are near the cache size
> >>>> limit, so it has to be the clients, right? In a different setup it
> >>>> helped to decrease the client_oc_size from 200MB to 100MB, but then
> >>>> there's also client_cache_size with 16K default. I'm not sure what
> >>>> the best approach would be here. I'd appreciate any comments on how
> >>>> to size the various cache/caps/threshold configurations.
> >>>>
> >>>> Thanks!
> >>>> Eugen
> >>>>
> >>>>
> >>>> ---snip---
> >>>> # ceph daemon mds.<MDS> session ls
> >>>>
> >>>>     "id": 2728101146,
> >>>>     "entity": {
> >>>>       "name": {
> >>>>         "type": "client",
> >>>>         "num": 2728101146
> >>>>       },
> >>>> [...]
> >>>>         "nonce": 1105499797
> >>>>       }
> >>>>     },
> >>>>     "state": "open",
> >>>>     "num_leases": 0,
> >>>>     "num_caps": 16158,
> >>>>     "request_load_avg": 0,
> >>>>     "uptime": 1118066.210318422,
> >>>>     "requests_in_flight": 0,
> >>>>     "completed_requests": [],
> >>>>     "reconnecting": false,
> >>>>     "recall_caps": {
> >>>>       "value": 788916.8276369586,
> >>>>       "halflife": 60
> >>>>     },
> >>>>     "release_caps": {
> >>>>       "value": 8.814981576458962,
> >>>>       "halflife": 60
> >>>>     },
> >>>>     "recall_caps_throttle": {
> >>>>       "value": 27379.27162576508,
> >>>>       "halflife": 1.5
> >>>>     },
> >>>>     "recall_caps_throttle2o": {
> >>>>       "value": 5382.261925615086,
> >>>>       "halflife": 0.5
> >>>>     },
> >>>>     "session_cache_liveness": {
> >>>>       "value": 12.91841737465921,
> >>>>       "halflife": 300
> >>>>     },
> >>>>     "cap_acquisition": {
> >>>>       "value": 0,
> >>>>       "halflife": 10
> >>>>     },
> >>>> [...]
> >>>>     "used_inos": [],
> >>>>     "client_metadata": {
> >>>>       "features": "0x0000000000003bff",
> >>>>       "entity_id": "cephfs_client",
> >>>>
> >>>>
> >>>> # ceph fs status
> >>>>
> >>>> cephfs - 25 clients
> >>>> ======
> >>>> +------+--------+----------------+---------------+-------+-------+
> >>>> | Rank | State  |      MDS       |    Activity   |  dns  |  inos |
> >>>> +------+--------+----------------+---------------+-------+-------+
> >>>> |  0   | active | stmailmds01d-3 | Reqs:   89 /s |  375k |  371k |
> >>>> |  1   | active | stmailmds01d-4 | Reqs:   64 /s |  386k |  383k |
> >>>> |  2   | active | stmailmds01a-3 | Reqs:    9 /s |  403k |  399k |
> >>>> |  3   | active | stmailmds01a-8 | Reqs:   23 /s |  393k |  390k |
> >>>> |  4   | active | stmailmds01a-2 | Reqs:   36 /s |  391k |  387k |
> >>>> |  5   | active | stmailmds01a-4 | Reqs:   57 /s |  394k |  390k |
> >>>> |  6   | active | stmailmds01a-6 | Reqs:   50 /s |  395k |  391k |
> >>>> |  7   | active | stmailmds01d-5 | Reqs:   37 /s |  384k |  380k |
> >>>> |  8   | active | stmailmds01a-5 | Reqs:   39 /s |  397k |  394k |
> >>>> |  9   | active |  stmailmds01a  | Reqs:   23 /s |  400k |  396k |
> >>>> |  10  | active | stmailmds01d-8 | Reqs:   74 /s |  402k |  399k |
> >>>> |  11  | active | stmailmds01d-6 | Reqs:   37 /s |  399k |  395k |
> >>>> |  12  | active |  stmailmds01d  | Reqs:   36 /s |  394k |  390k |
> >>>> |  13  | active | stmailmds01d-7 | Reqs:   80 /s |  397k |  393k |
> >>>> |  14  | active | stmailmds01d-2 | Reqs:   56 /s |  414k |  410k |
> >>>> |  15  | active | stmailmds01a-7 | Reqs:   25 /s |  390k |  387k |
> >>>> +------+--------+----------------+---------------+-------+-------+
> >>>> +-----------------+----------+-------+-------+
> >>>> |       Pool      |   type   |  used | avail |
> >>>> +-----------------+----------+-------+-------+
> >>>> | cephfs_metadata | metadata | 25.4G | 16.1T |
> >>>> |   cephfs_data   |   data   | 2078G | 16.1T |
> >>>> +-----------------+----------+-------+-------+
> >>>> +----------------+
> >>>> |  Standby MDS   |
> >>>> +----------------+
> >>>> | stmailmds01b-5 |
> >>>> | stmailmds01b-2 |
> >>>> | stmailmds01b-3 |
> >>>> |  stmailmds01b  |
> >>>> | stmailmds01b-7 |
> >>>> | stmailmds01b-8 |
> >>>> | stmailmds01b-6 |
> >>>> | stmailmds01b-4 |
> >>>> +----------------+
> >>>> MDS version: ceph version 14.2.22-404-gf74e15c2e55
> >>>> (f74e15c2e552b3359f5a51482dfd8b049e262743) nautilus (stable)
> >>>> ---snip---
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
*Dhairya Parmar*

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc. <https://www.redhat.com/>

dparmar@xxxxxxxxxx
<https://www.redhat.com/>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx