Re: MDS daemons stuck in resolve, please help

Frank Schilder <frans@xxxxxx> · Tue, 7 Sep 2021 12:34:55 +0000

Dear Dan,

thanks for the fast reply!

> ... when you set mds_recall_max_decay_rate there is a side effect that all session recall_caps_throttle's are re-initialized

OK, something like this could be a problem with the number of clients we have. I guess, next time I wait for a service window and try it out without load on it. Or upgrade first :)

> ... I've cc'd Patrick.

Thanks a lot! It would be really good if we could resolve the mystery of extra snapshots in pool con-fs2-data2.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
Sent: 07 September 2021 14:20:58
To: Frank Schilder; Patrick Donnelly
Cc: ceph-users
Subject: Re:  Re: MDS daemons stuck in resolve, please help

Hi,

On Tue, Sep 7, 2021 at 1:55 PM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dan,
>
> I think I need to be a bit more precise. When I do the following (mimic 13.2.10, latest):
>
> # ceph config dump | grep mds_recall_max_decay_rate
> # [no output]
> # ceph config get mds.0 mds_recall_max_decay_rate
> 2.500000
> # ceph config set mds mds_recall_max_decay_rate 2.5
> #
>
> the MDS cluster immediately becomes unresponsive. Worse yet, newly spawned MDS daemons also get stuck and are marked down after the beacon time-out. Clearly, having the *same* value either as default or explicitly present in the config data base leads to different behaviour. How is this possible unless its a bug or it leads to execution of different code paths? Expected behaviour clearly is: nothing happens. This is independent of current load.

The code is here:
https://github.com/ceph/ceph/blob/mimic/src/mds/SessionMap.cc#L1050
when you set mds_recall_max_decay_rate there is a side effect that all
session recall_caps_throttle's are re-initialized. I don't understand
why, but this clearly causes your MDS to stall. I assume this should
be a one-off -- once things recover and you have a good recall config
for your workloads, there shouldn't be any more stalling/side effects.

It is of course possible that mimic is missing other fixes, that we
have in nautilus++, that permit the config I discuss. I quickly
checked and 13.2.10 has the code to `recall caps incrementally`, so it
doesn't look obviously broken to me. But you might want to study those
differences to see what mimic lacks in terms of known recall
improvements.

> The ceph fs status is currently
>
> # ceph fs status
> con-fs2 - 1642 clients
> =======
> +------+--------+---------+---------------+-------+-------+
> | Rank | State  |   MDS   |    Activity   |  dns  |  inos |
> +------+--------+---------+---------------+-------+-------+
> |  0   | active | ceph-23 | Reqs:  434 /s | 2354k | 2266k |
> |  1   | active | ceph-12 | Reqs:    6 /s | 3036k | 2960k |
> |  2   | active | ceph-08 | Reqs:  513 /s | 1751k | 1613k |
> |  3   | active | ceph-15 | Reqs:  523 /s | 1460k | 1365k |
> +------+--------+---------+---------------+-------+-------+

OK those reqs look reasonable -- our MDSs have stable activity above
1000 or 2000 reqs/sec. Here's an example from our cluster right now:

+------+--------+----------------------+---------------+-------+-------+
| Rank | State  |         MDS          |    Activity   |  dns  |  inos |
+------+--------+----------------------+---------------+-------+-------+
|  0   | active | cephcpu21-46bb400fc8 | Reqs: 1550 /s | 44.5M | 18.5M |
|  1   | active | cephcpu21-0c370531cf | Reqs: 1509 /s | 14.2M | 14.1M |
|  2   | active | cephcpu21-4a93514bf3 | Reqs:  686 /s | 11.2M | 11.2M |
+------+--------+----------------------+---------------+-------+-------+

> +---------------------+----------+-------+-------+
> |         Pool        |   type   |  used | avail |
> +---------------------+----------+-------+-------+
> |    con-fs2-meta1    | metadata | 1372M | 1356G |
> |    con-fs2-meta2    |   data   |    0  | 1356G |
> |     con-fs2-data    |   data   | 1361T | 6035T |
> | con-fs2-data-ec-ssd |   data   |  239G | 4340G |
> |    con-fs2-data2    |   data   | 23.6T | 5487T |
> +---------------------+----------+-------+-------+
> +-------------+
> | Standby MDS |
> +-------------+
> |   ceph-16   |
> |   ceph-14   |
> |   ceph-13   |
> |   ceph-17   |
> |   ceph-10   |
> |   ceph-24   |
> |   ceph-09   |
> |   ceph-11   |
> +-------------+
> MDS version: ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)
>
> It seems like it improved a bit, but it is still way below the averages I saw before trying the cache trimming settings. I would usually have 2 MDSes with an average activity of 2000-4000 requests per second with peaks at 10K and higher (the highest I have seen was 18K) and 2 MDSes a bit less busy. All this with exactly the same IO pattern from clients, nothing changed on the client side during my attempts to set the cache trimming values.
>
> I wasn't implying a relation between snap trimming and caps recall. What I said is that after the change and roll-back of the cache trimming parameters, it looks like the snapshot trimming on (one of) the fs data pools seems to have stopped - i.e. something within ceph stopped working properly as a fall-out of the parameter changes and the cluster did not recover by itself yet.
>
> Snapshots themselves cause an extreme performance drop. There seems to be a bug in the kernel client that makes it spin over ceph_update_snap_trace and here over sort like crazy, here a perf record of the critical section:
>
> +   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] ret_from_fork_nospec_begin
> +   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] kthread
> +   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] worker_thread
> +   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] process_one_work
> +   99.31%     0.00%  kworker/0:2   [libceph]            [k] ceph_con_workfn
> +   99.30%     0.00%  kworker/0:2   [libceph]            [k] try_read
> +   99.27%     0.00%  kworker/0:2   [ceph]               [k] dispatch
> +   99.26%     0.00%  kworker/0:2   [ceph]               [k] handle_reply
> -   98.94%     0.06%  kworker/0:2   [ceph]               [k] ceph_update_snap_trace
>    - 98.88% ceph_update_snap_trace
>       - 90.03% rebuild_snap_realms
>          - 90.01% rebuild_snap_realms
>             - 89.54% build_snap_context
>                + 36.11% sort
>                  15.59% __x86_indirect_thunk_rax
>                  14.64% cmpu64_rev
>                  13.03% __x86_indirect_thunk_r13
>                  3.84% generic_swap
>                  0.64% ceph_create_snap_context
>         3.51% _raw_qspin_lock
>         2.47% __list_del_entry
>         1.36% ceph_queue_cap_snap
>         0.51% __ceph_caps_used
>
> I'm pretty sure its spinning over the exact same data over and over again because of the following observation. If I make a fresh mount, the client actually performs with high performance initially. It starts slowing down dramatically as the cache fills up. This has also been reported in other threads:
>
> https://tracker.ceph.com/issues/44100
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/ELWPK3QGARFTVAFULFOUDOTLUGIL4HLP/
>
> I cannot see how it is not a bug that operations with no cache are fast and with cache are slow. This issue is present in latest stable kernels, currently I test with 5.9.9-1.el7.elrepo.x86_64.
>
> This is why I am so concerned now that the cache trimming parameter change caused some internal degradation that, in turn, now leads to snapshots piling up and killing performance completely. It would be very helpful to know how ceph fs handles snapshots and how I can confirm that either everything functions as expected, or I have a problem. I'm afraid, having fs data pools with inconsistent snapshot counts points to a severe degradation.
>
> Maybe you could point one of the ceph fs devs to this problem?

Yeah I certainly can't add more to clarify what you asked above; I
simply don't know the snapshot code enough to speculate what might be
going wrong here. I've cc'd Patrick.

Cheers, dan

>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> Sent: 06 September 2021 11:33
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re:  Re: MDS daemons stuck in resolve, please help
>
> Hi Frank,
>
> That's unfortunate! Most of those options relax warnings and relax
> when a client is considered having too many caps.
> The option mds_recall_max_caps might be CPU intensive -- the MDS would
> be busy recalling caps if indeed you have clients which are hammering
> the MDSs with metadata workloads.
> What is your current `ceph fs status` output? If you have very active
> users, perhaps you can ask them to temporarily slow down and see the
> impact on your cluster?
>
> I'm not aware of any relation between caps recall and snap trimming.
> We don't use snapshots (until now some pacific tests) so I can't say
> if that is relevant to this issue.
>
> -- dan
>
>
>
>
> On Mon, Sep 6, 2021 at 11:18 AM Frank Schilder <frans@xxxxxx> wrote:
> >
> > Hi Dan,
> >
> > unfortunately, setting these parameters crashed the MDS cluster and we now have severe performance issues. Particularly bad is mds_recall_max_decay_rate. Even just setting it to the default value immediately makes all MDS daemons unresponsive and get failed by the MONs. I already set the mds beacon time-out to 10 minutes to avoid MDS daemons getting marked down too early when they need to trim a large (oversized) cache. The formerly active then failed daemons never recover, I have to restart them manually to get them back as stand-bys.
> >
> > We are running mimic-13.2.10. Does explicitly setting mds_recall_max_decay_rate enable a different code path in this version?
> >
> > I tried to fix the situation by removing all modified config pars (ceph config rm ...) again and doing a full restart of all daemons, first all stand-bys and then the active ones one by one. Unfortunately, this did not help. In addition, it looks like one of our fs data pools does not purge snapshots any more:
> >
> > pool 12 'con-fs2-meta1' no removed_snaps list shown
> > pool 13 'con-fs2-meta2' removed_snaps [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> > pool 14 'con-fs2-data' removed_snaps [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> > pool 17 'con-fs2-data-ec-ssd' removed_snaps [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> > pool 19 'con-fs2-data2' removed_snaps [2d6~1,2d8~1,2da~1,2dc~1,2de~1,2e0~1,2e2~1,2e4~1,2e6~1,2e8~1,2ea~18,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> >
> > con-fs2-meta2 is the primary data pool. It does not store actual file data, we have con-fs2-data2 set as data pool on the fs root. Its the new recommended 3-pool layout with the meta-data- and the primary data pool storing meta-data only.
> >
> > The MDS daemons report 12 snapshots and if I interpret the removed_snaps info correctly, the pools con-fs2-meta2, con-fs2-data and con-fs2-data-ec-ssd store 12 snapshots. However, pool con-fs2-data2 has at least 20. We use rolling snapshots and it looks like the snapshots are not purged any more since I tried setting the MDS trimming parameters. This, in turn, is potentially a reason for the performance degradation we experience at the moment.
> >
> > I would be most grateful if you could provide some pointers as to what to look for with regards of why snapshots don't disappear and/or what might have happened to our MDS daemons performance wise.
> >
> > Thanks and best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > [... truncated]
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx