Re: MDS Upgrade from 17.2.5 to 17.2.6 not possible

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, May 17, 2023 at 9:26 PM Henning Achterrath <achhen@xxxxxxxxxxx>
wrote:

> Hi all,
>
> we did a major update from Pacific to Quincy (17.2.5) a month ago
> without any problems.
>
> Now we have tried a minor update from 17.2.5 to 17.2.6 (ceph orch
> upgrade). It stucks at mds upgrade phase. At this point the cluster
> tries to scale down mds (ceph fs set max_mds 1). We waited a few hours.
>
Just an FYI (if you use cephadm to carry out upgrades), having max_mds 1 can
be disastrous (especially for huge CephFS deployments) because cluster
cannot
quickly reduce active MDSs to 1 and a single active MDS cannot easily handle
the load of all clients. To overcome this, you can upgrade MDSs without
reducing max_mds, the fail_fs option can to be set to true prior to the
upgrade. There's a note in the beginning of the "STARTING THE UPGRADE"
section
that might be helpful to understand this better.

https://docs.ceph.com/en/latest/cephadm/upgrade/#starting-the-upgrade


> We are running two active mds with 1 standby. No subdir pinning
> configured. CephFS data pool: 575 TB
>
> While Upgrading, Rank 1 MDS remains in state stopping. During this state
> clients are not able to reconnect. So we paused this upgrade and set
> max_mds to 2 back again and fail rank 1. After that, standby becomes
> active.
> In the mds (rank 1 in stopping state) logs we can see: waiting for
> strays to migrate
>
> In our second try, we have evicted all clients first without success.
>
> We make daily snapshots on / and rotate them via snapshot scheduler
> after one week.
>
> Is there a way to get rid of stray entries without scale down mds or do
> we have to wait longer?
>
> We had about the same amount of strays before we did the major upgrade.
> So, it is a bit curious.
>
> Current output from ceph perf dump
>
> Rank0:
>
> "num_strays": 417304,
>          "num_strays_delayed": 3,
>          "num_strays_enqueuing": 0,
>          "strays_created": 567879,
>          "strays_enqueued": 561803,
>          "strays_reintegrated": 13751,
>          "strays_migrated": 4,
>
>
> Rank1:
>
> ceph daemon mds.fdi-cephfs.ceph-service-13.rwdkqs perf dump | grep stray
>
>          "num_strays": 172528,
>          "num_strays_delayed": 0,
>          "num_strays_enqueuing": 0,
>          "strays_created": 418365,
>          "strays_enqueued": 396142,
>          "strays_reintegrated": 67406,
>          "strays_migrated": 4,
>
>
>
> Any help would be appreciated.
>
>
> best regards
> Henning
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux