Re: ceph Pacific - MDS activity freezes when one the MDSs is restarted

Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx> · Wed, 24 May 2023 16:28:36 +0200

So I guess, I'll end up doing:
ceph fs set cephfs max_mds 4
ceph fs set cephfs allow_standby_replay true

On Wed, May 24, 2023 at 4:13 PM Hector Martin <marcan@xxxxxxxxx> wrote:

> Hi,
>
> On 24/05/2023 22.02, Emmanuel Jaep wrote:
> > Hi Hector,
> >
> > thank you very much for the detailed explanation and link to the
> > documentation.
> >
> > Given our current situation (7 active MDSs and 1 standby MDS):
> > RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> >  0    active  icadmin012  Reqs:   82 /s  2345k  2288k  97.2k   307k
> >  1    active  icadmin008  Reqs:  194 /s  3789k  3789k  17.1k   641k
> >  2    active  icadmin007  Reqs:   94 /s  5823k  5369k   150k   257k
> >  3    active  icadmin014  Reqs:  103 /s   813k   796k  47.4k   163k
> >  4    active  icadmin013  Reqs:   81 /s  3815k  3798k  12.9k   186k
> >  5    active  icadmin011  Reqs:   84 /s   493k   489k  9145    176k
> >  6    active  icadmin015  Reqs:  374 /s  1741k  1669k  28.1k   246k
> >       POOL         TYPE     USED  AVAIL
> > cephfs_metadata  metadata  8547G  25.2T
> >   cephfs_data      data     223T  25.2T
> > STANDBY MDS
> >  icadmin006
> >
> > I would probably be better off having:
> >
> >    1. having only 3 active MDSs (rank 0 to 2)
> >    2. configure 3 standby-replay to mirror the ranks 0 to 2
> >    3. have 2 'regular' standby MDSs
> >
> > Of course, this raises the question of storage and performance.
> >
> > Since I would be moving from 7 active MDSs to 3:
> >
> >    1. each new active MDS will have to store more than twice the data
> >    2. the load will be more than twice as high
> >
> > Am I correct?
>
> Yes, that is correct. The MDSes don't store data locally but do
> cache/maintain it in memory, so you will either have higher memory load
> for the same effective cache size, or a lower cache size for the same
> memory load.
>
> If you have 8 total MDSes, I'd go for 4+4. You don't need non-replay
> standbys if you have a standby replay for each active MDS. As far as I
> know, if you end up with an active and its standby both failing, some
> other standby-replay MDS will still be stolen to take care of that rank,
> so the cluster will eventually become healthy again after the replay time.
>
> With 4 active MDSes down from the current 7, the load per MDS will be a
> bit less than double.
>
> >
> > Emmanuel
> >
> > On Wed, May 24, 2023 at 2:31 PM Hector Martin <marcan@xxxxxxxxx> wrote:
> >
> >> On 24/05/2023 21.15, Emmanuel Jaep wrote:
> >>> Hi,
> >>>
> >>> we are currently running a ceph fs cluster at the following version:
> >>> MDS version: ceph version 16.2.10
> >>> (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)
> >>>
> >>> The cluster is composed of 7 active MDSs and 1 standby MDS:
> >>> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> >>>  0    active  icadmin012  Reqs:   73 /s  1938k  1880k  85.3k  92.8k
> >>>  1    active  icadmin008  Reqs:  206 /s  2375k  2375k  7081    171k
> >>>  2    active  icadmin007  Reqs:   91 /s  5709k  5256k   149k   299k
> >>>  3    active  icadmin014  Reqs:   93 /s   679k   664k  40.1k   216k
> >>>  4    active  icadmin013  Reqs:   86 /s  3585k  3569k  12.7k   197k
> >>>  5    active  icadmin011  Reqs:   72 /s   225k   221k  8611    164k
> >>>  6    active  icadmin015  Reqs:   87 /s  1682k  1610k  27.9k   274k
> >>>       POOL         TYPE     USED  AVAIL
> >>> cephfs_metadata  metadata  8552G  22.3T
> >>>   cephfs_data      data     226T  22.3T
> >>> STANDBY MDS
> >>>  icadmin006
> >>>
> >>> When I restart one of the active MDSs, the standby MDS becomes active
> and
> >>> its state becomes "replay". So far, so good!
> >>>
> >>> However, only one of the other "active" MDSs seems to remain active.
> All
> >>> activities drop from the other ones:
> >>> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> >>>  0    active  icadmin012  Reqs:    0 /s  1938k  1881k  85.3k  9720
> >>>  1    active  icadmin008  Reqs:    0 /s  2375k  2375k  7080   2505
> >>>  2    active  icadmin007  Reqs:    2 /s  5709k  5256k   149k  26.5k
> >>>  3    active  icadmin014  Reqs:    0 /s   679k   664k  40.1k  3259
> >>>  4    replay  icadmin006                  801k   801k  1279      0
> >>>  5    active  icadmin011  Reqs:    0 /s   225k   221k  8611   9241
> >>>  6    active  icadmin015  Reqs:    0 /s  1682k  1610k  27.9k  34.8k
> >>>       POOL         TYPE     USED  AVAIL
> >>> cephfs_metadata  metadata  8539G  22.8T
> >>>   cephfs_data      data     225T  22.8T
> >>> STANDBY MDS
> >>>  icadmin013
> >>>
> >>> In effect, the cluster becomes almost unavailable until the newly
> >> promoted
> >>> MDS finishes rejoining the cluster.
> >>>
> >>> Obviously, this defeats the purpose of having 7MDSs.
> >>> Is this behavior?
> >>> If not, what configuration items should I check to go back to "normal"
> >>> operations?
> >>>
> >>
> >> Please ignore my previous email, I read too quickly. I see you do have a
> >> standby. However, that does not allow fast failover with multiple MDSes.
> >>
> >> For fast failover of any active MDS, you need one standby-replay daemon
> >> for *each* active MDS. Each standby-replay MDS follows one active MDS's
> >> rank only, you can't have one standby-replay daemon following all ranks.
> >> What you have right now is probably a regular standby daemon, which can
> >> take over any failed MDS, but requires waiting for the replay time.
> >>
> >> See:
> >>
> >>
> https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay
> >>
> >> My explanation for the zero ops from the previous email still holds:
> >> it's likely that most clients will hang if any MDS rank is
> >> down/unavailable.
> >>
> >> - Hector
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> - Hector
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx