Re: MDS daemons stuck in resolve, please help

Frank Schilder <frans@xxxxxx> · Tue, 7 Sep 2021 11:55:39 +0000

Hi Dan,

I think I need to be a bit more precise. When I do the following (mimic 13.2.10, latest):

# ceph config dump | grep mds_recall_max_decay_rate
# [no output]
# ceph config get mds.0 mds_recall_max_decay_rate
2.500000
# ceph config set mds mds_recall_max_decay_rate 2.5
# 

the MDS cluster immediately becomes unresponsive. Worse yet, newly spawned MDS daemons also get stuck and are marked down after the beacon time-out. Clearly, having the *same* value either as default or explicitly present in the config data base leads to different behaviour. How is this possible unless its a bug or it leads to execution of different code paths? Expected behaviour clearly is: nothing happens. This is independent of current load.

The ceph fs status is currently

# ceph fs status
con-fs2 - 1642 clients
=======
+------+--------+---------+---------------+-------+-------+
| Rank | State  |   MDS   |    Activity   |  dns  |  inos |
+------+--------+---------+---------------+-------+-------+
|  0   | active | ceph-23 | Reqs:  434 /s | 2354k | 2266k |
|  1   | active | ceph-12 | Reqs:    6 /s | 3036k | 2960k |
|  2   | active | ceph-08 | Reqs:  513 /s | 1751k | 1613k |
|  3   | active | ceph-15 | Reqs:  523 /s | 1460k | 1365k |
+------+--------+---------+---------------+-------+-------+
+---------------------+----------+-------+-------+
|         Pool        |   type   |  used | avail |
+---------------------+----------+-------+-------+
|    con-fs2-meta1    | metadata | 1372M | 1356G |
|    con-fs2-meta2    |   data   |    0  | 1356G |
|     con-fs2-data    |   data   | 1361T | 6035T |
| con-fs2-data-ec-ssd |   data   |  239G | 4340G |
|    con-fs2-data2    |   data   | 23.6T | 5487T |
+---------------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
|   ceph-16   |
|   ceph-14   |
|   ceph-13   |
|   ceph-17   |
|   ceph-10   |
|   ceph-24   |
|   ceph-09   |
|   ceph-11   |
+-------------+
MDS version: ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)

It seems like it improved a bit, but it is still way below the averages I saw before trying the cache trimming settings. I would usually have 2 MDSes with an average activity of 2000-4000 requests per second with peaks at 10K and higher (the highest I have seen was 18K) and 2 MDSes a bit less busy. All this with exactly the same IO pattern from clients, nothing changed on the client side during my attempts to set the cache trimming values.

I wasn't implying a relation between snap trimming and caps recall. What I said is that after the change and roll-back of the cache trimming parameters, it looks like the snapshot trimming on (one of) the fs data pools seems to have stopped - i.e. something within ceph stopped working properly as a fall-out of the parameter changes and the cluster did not recover by itself yet.

Snapshots themselves cause an extreme performance drop. There seems to be a bug in the kernel client that makes it spin over ceph_update_snap_trace and here over sort like crazy, here a perf record of the critical section:

+   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] ret_from_fork_nospec_begin
+   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] kthread
+   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] worker_thread
+   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] process_one_work
+   99.31%     0.00%  kworker/0:2   [libceph]            [k] ceph_con_workfn
+   99.30%     0.00%  kworker/0:2   [libceph]            [k] try_read
+   99.27%     0.00%  kworker/0:2   [ceph]               [k] dispatch
+   99.26%     0.00%  kworker/0:2   [ceph]               [k] handle_reply
-   98.94%     0.06%  kworker/0:2   [ceph]               [k] ceph_update_snap_trace
   - 98.88% ceph_update_snap_trace
      - 90.03% rebuild_snap_realms
         - 90.01% rebuild_snap_realms
            - 89.54% build_snap_context
               + 36.11% sort
                 15.59% __x86_indirect_thunk_rax
                 14.64% cmpu64_rev
                 13.03% __x86_indirect_thunk_r13
                 3.84% generic_swap
                 0.64% ceph_create_snap_context
        3.51% _raw_qspin_lock
        2.47% __list_del_entry
        1.36% ceph_queue_cap_snap
        0.51% __ceph_caps_used

I'm pretty sure its spinning over the exact same data over and over again because of the following observation. If I make a fresh mount, the client actually performs with high performance initially. It starts slowing down dramatically as the cache fills up. This has also been reported in other threads:

https://tracker.ceph.com/issues/44100
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/ELWPK3QGARFTVAFULFOUDOTLUGIL4HLP/

I cannot see how it is not a bug that operations with no cache are fast and with cache are slow. This issue is present in latest stable kernels, currently I test with 5.9.9-1.el7.elrepo.x86_64.

This is why I am so concerned now that the cache trimming parameter change caused some internal degradation that, in turn, now leads to snapshots piling up and killing performance completely. It would be very helpful to know how ceph fs handles snapshots and how I can confirm that either everything functions as expected, or I have a problem. I'm afraid, having fs data pools with inconsistent snapshot counts points to a severe degradation.

Maybe you could point one of the ceph fs devs to this problem?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
Sent: 06 September 2021 11:33
To: Frank Schilder
Cc: ceph-users
Subject: Re:  Re: MDS daemons stuck in resolve, please help

Hi Frank,

That's unfortunate! Most of those options relax warnings and relax
when a client is considered having too many caps.
The option mds_recall_max_caps might be CPU intensive -- the MDS would
be busy recalling caps if indeed you have clients which are hammering
the MDSs with metadata workloads.
What is your current `ceph fs status` output? If you have very active
users, perhaps you can ask them to temporarily slow down and see the
impact on your cluster?

I'm not aware of any relation between caps recall and snap trimming.
We don't use snapshots (until now some pacific tests) so I can't say
if that is relevant to this issue.

-- dan

On Mon, Sep 6, 2021 at 11:18 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dan,
>
> unfortunately, setting these parameters crashed the MDS cluster and we now have severe performance issues. Particularly bad is mds_recall_max_decay_rate. Even just setting it to the default value immediately makes all MDS daemons unresponsive and get failed by the MONs. I already set the mds beacon time-out to 10 minutes to avoid MDS daemons getting marked down too early when they need to trim a large (oversized) cache. The formerly active then failed daemons never recover, I have to restart them manually to get them back as stand-bys.
>
> We are running mimic-13.2.10. Does explicitly setting mds_recall_max_decay_rate enable a different code path in this version?
>
> I tried to fix the situation by removing all modified config pars (ceph config rm ...) again and doing a full restart of all daemons, first all stand-bys and then the active ones one by one. Unfortunately, this did not help. In addition, it looks like one of our fs data pools does not purge snapshots any more:
>
> pool 12 'con-fs2-meta1' no removed_snaps list shown
> pool 13 'con-fs2-meta2' removed_snaps [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> pool 14 'con-fs2-data' removed_snaps [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> pool 17 'con-fs2-data-ec-ssd' removed_snaps [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> pool 19 'con-fs2-data2' removed_snaps [2d6~1,2d8~1,2da~1,2dc~1,2de~1,2e0~1,2e2~1,2e4~1,2e6~1,2e8~1,2ea~18,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
>
> con-fs2-meta2 is the primary data pool. It does not store actual file data, we have con-fs2-data2 set as data pool on the fs root. Its the new recommended 3-pool layout with the meta-data- and the primary data pool storing meta-data only.
>
> The MDS daemons report 12 snapshots and if I interpret the removed_snaps info correctly, the pools con-fs2-meta2, con-fs2-data and con-fs2-data-ec-ssd store 12 snapshots. However, pool con-fs2-data2 has at least 20. We use rolling snapshots and it looks like the snapshots are not purged any more since I tried setting the MDS trimming parameters. This, in turn, is potentially a reason for the performance degradation we experience at the moment.
>
> I would be most grateful if you could provide some pointers as to what to look for with regards of why snapshots don't disappear and/or what might have happened to our MDS daemons performance wise.
>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> [... truncated]
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx