Re: ceph Pacific - MDS activity freezes when one the MDSs is restarted

Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx> · Thu, 25 May 2023 09:15:34 +0200

Hi Wes,

thanks for the heads-up.

Best,

Emmanuel

On Wed, May 24, 2023 at 5:47 PM Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx>
wrote:

> There was a memory issue with standby-replay that may have been resolved
> since and fix is in 16.2.10 (not sure), the suggestion at the time was to
> avoid standby-replay.
>
> Perhaps a dev can chime in on that status. Your MDSs look pretty inactive.
> I would consider scaling them down (potentially to single active if your
> workload allows).
>
> The MDS have an intricate update process when you use multiple active, make
> sure to read the docs on that if you arent using cephadm and want to
> attempt an upgrade.
>
> standby-replay can only take over for a single rank (tracks a single active
> MDS) where a standby can take over for any rank. more here:
> https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay
>
> Respectfully,
>
> *Wes Dillingham*
> wes@xxxxxxxxxxxxxxxxx
> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>
>
> On Wed, May 24, 2023 at 10:33 AM Eugen Block <eblock@xxxxxx> wrote:
>
> > Hi,
> >
> > using standby-replay daemons is something to test as it can have a
> > negative impact, it really depends on the actual workload. We stopped
> > using standby-replay in all clusters we (help) maintain, in one
> > specific case with many active MDSs and a high load the failover time
> > decreased and was "cleaner" for the client application.
> > Also, do you know why you use a multi-active MDS setup? Was that a
> > requirement for subtree pinning (otherwise multiple active daemons
> > would balance the hell out of each other) or maybe just an experiment?
> > Depending on the workload pinning might have been necessary, maybe you
> > would impact performance if you removed 3 MDS daemons? As an
> > alternative you can also deploy multiple MDS daemons per host
> > (count_per_host) which can utilize the server better, not sure which
> > Pacific version that is, I just tried successfully on 16.2.13. That
> > way you could still maintain the required number of MDS daemons (if
> > it's still 7 ) and also have enough standby daemons. But that of
> > course means in case one MDS host goes down all it's daemons will also
> > be unavailable. But we used this feature in an older version
> > (customized Nautilus) quite successfully in a customer cluster.
> > There are many things to consider here, just wanted to share a couple
> > of thoughts.
> >
> > Regards,
> > Eugen
> >
> > Zitat von Hector Martin <marcan@xxxxxxxxx>:
> >
> > > Hi,
> > >
> > > On 24/05/2023 22.02, Emmanuel Jaep wrote:
> > >> Hi Hector,
> > >>
> > >> thank you very much for the detailed explanation and link to the
> > >> documentation.
> > >>
> > >> Given our current situation (7 active MDSs and 1 standby MDS):
> > >> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> > >>  0    active  icadmin012  Reqs:   82 /s  2345k  2288k  97.2k   307k
> > >>  1    active  icadmin008  Reqs:  194 /s  3789k  3789k  17.1k   641k
> > >>  2    active  icadmin007  Reqs:   94 /s  5823k  5369k   150k   257k
> > >>  3    active  icadmin014  Reqs:  103 /s   813k   796k  47.4k   163k
> > >>  4    active  icadmin013  Reqs:   81 /s  3815k  3798k  12.9k   186k
> > >>  5    active  icadmin011  Reqs:   84 /s   493k   489k  9145    176k
> > >>  6    active  icadmin015  Reqs:  374 /s  1741k  1669k  28.1k   246k
> > >>       POOL         TYPE     USED  AVAIL
> > >> cephfs_metadata  metadata  8547G  25.2T
> > >>   cephfs_data      data     223T  25.2T
> > >> STANDBY MDS
> > >>  icadmin006
> > >>
> > >> I would probably be better off having:
> > >>
> > >>    1. having only 3 active MDSs (rank 0 to 2)
> > >>    2. configure 3 standby-replay to mirror the ranks 0 to 2
> > >>    3. have 2 'regular' standby MDSs
> > >>
> > >> Of course, this raises the question of storage and performance.
> > >>
> > >> Since I would be moving from 7 active MDSs to 3:
> > >>
> > >>    1. each new active MDS will have to store more than twice the data
> > >>    2. the load will be more than twice as high
> > >>
> > >> Am I correct?
> > >
> > > Yes, that is correct. The MDSes don't store data locally but do
> > > cache/maintain it in memory, so you will either have higher memory load
> > > for the same effective cache size, or a lower cache size for the same
> > > memory load.
> > >
> > > If you have 8 total MDSes, I'd go for 4+4. You don't need non-replay
> > > standbys if you have a standby replay for each active MDS. As far as I
> > > know, if you end up with an active and its standby both failing, some
> > > other standby-replay MDS will still be stolen to take care of that
> rank,
> > > so the cluster will eventually become healthy again after the replay
> > time.
> > >
> > > With 4 active MDSes down from the current 7, the load per MDS will be a
> > > bit less than double.
> > >
> > >>
> > >> Emmanuel
> > >>
> > >> On Wed, May 24, 2023 at 2:31 PM Hector Martin <marcan@xxxxxxxxx>
> wrote:
> > >>
> > >>> On 24/05/2023 21.15, Emmanuel Jaep wrote:
> > >>>> Hi,
> > >>>>
> > >>>> we are currently running a ceph fs cluster at the following version:
> > >>>> MDS version: ceph version 16.2.10
> > >>>> (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)
> > >>>>
> > >>>> The cluster is composed of 7 active MDSs and 1 standby MDS:
> > >>>> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> > >>>>  0    active  icadmin012  Reqs:   73 /s  1938k  1880k  85.3k  92.8k
> > >>>>  1    active  icadmin008  Reqs:  206 /s  2375k  2375k  7081    171k
> > >>>>  2    active  icadmin007  Reqs:   91 /s  5709k  5256k   149k   299k
> > >>>>  3    active  icadmin014  Reqs:   93 /s   679k   664k  40.1k   216k
> > >>>>  4    active  icadmin013  Reqs:   86 /s  3585k  3569k  12.7k   197k
> > >>>>  5    active  icadmin011  Reqs:   72 /s   225k   221k  8611    164k
> > >>>>  6    active  icadmin015  Reqs:   87 /s  1682k  1610k  27.9k   274k
> > >>>>       POOL         TYPE     USED  AVAIL
> > >>>> cephfs_metadata  metadata  8552G  22.3T
> > >>>>   cephfs_data      data     226T  22.3T
> > >>>> STANDBY MDS
> > >>>>  icadmin006
> > >>>>
> > >>>> When I restart one of the active MDSs, the standby MDS becomes
> active
> > and
> > >>>> its state becomes "replay". So far, so good!
> > >>>>
> > >>>> However, only one of the other "active" MDSs seems to remain active.
> > All
> > >>>> activities drop from the other ones:
> > >>>> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> > >>>>  0    active  icadmin012  Reqs:    0 /s  1938k  1881k  85.3k  9720
> > >>>>  1    active  icadmin008  Reqs:    0 /s  2375k  2375k  7080   2505
> > >>>>  2    active  icadmin007  Reqs:    2 /s  5709k  5256k   149k  26.5k
> > >>>>  3    active  icadmin014  Reqs:    0 /s   679k   664k  40.1k  3259
> > >>>>  4    replay  icadmin006                  801k   801k  1279      0
> > >>>>  5    active  icadmin011  Reqs:    0 /s   225k   221k  8611   9241
> > >>>>  6    active  icadmin015  Reqs:    0 /s  1682k  1610k  27.9k  34.8k
> > >>>>       POOL         TYPE     USED  AVAIL
> > >>>> cephfs_metadata  metadata  8539G  22.8T
> > >>>>   cephfs_data      data     225T  22.8T
> > >>>> STANDBY MDS
> > >>>>  icadmin013
> > >>>>
> > >>>> In effect, the cluster becomes almost unavailable until the newly
> > >>> promoted
> > >>>> MDS finishes rejoining the cluster.
> > >>>>
> > >>>> Obviously, this defeats the purpose of having 7MDSs.
> > >>>> Is this behavior?
> > >>>> If not, what configuration items should I check to go back to
> "normal"
> > >>>> operations?
> > >>>>
> > >>>
> > >>> Please ignore my previous email, I read too quickly. I see you do
> have
> > a
> > >>> standby. However, that does not allow fast failover with multiple
> > MDSes.
> > >>>
> > >>> For fast failover of any active MDS, you need one standby-replay
> daemon
> > >>> for *each* active MDS. Each standby-replay MDS follows one active
> MDS's
> > >>> rank only, you can't have one standby-replay daemon following all
> > ranks.
> > >>> What you have right now is probably a regular standby daemon, which
> can
> > >>> take over any failed MDS, but requires waiting for the replay time.
> > >>>
> > >>> See:
> > >>>
> > >>>
> >
> https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay
> > >>>
> > >>> My explanation for the zero ops from the previous email still holds:
> > >>> it's likely that most clients will hang if any MDS rank is
> > >>> down/unavailable.
> > >>>
> > >>> - Hector
> > >>> _______________________________________________
> > >>> ceph-users mailing list -- ceph-users@xxxxxxx
> > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >>>
> > >> _______________________________________________
> > >> ceph-users mailing list -- ceph-users@xxxxxxx
> > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >
> > > - Hector
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx