Re: ceph Pacific - MDS activity freezes when one the MDSs is restarted

Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx> · Wed, 24 May 2023 15:02:35 +0200

Hi Hector,

thank you very much for the detailed explanation and link to the
documentation.

Given our current situation (7 active MDSs and 1 standby MDS):
RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active  icadmin012  Reqs:   82 /s  2345k  2288k  97.2k   307k
 1    active  icadmin008  Reqs:  194 /s  3789k  3789k  17.1k   641k
 2    active  icadmin007  Reqs:   94 /s  5823k  5369k   150k   257k
 3    active  icadmin014  Reqs:  103 /s   813k   796k  47.4k   163k
 4    active  icadmin013  Reqs:   81 /s  3815k  3798k  12.9k   186k
 5    active  icadmin011  Reqs:   84 /s   493k   489k  9145    176k
 6    active  icadmin015  Reqs:  374 /s  1741k  1669k  28.1k   246k
      POOL         TYPE     USED  AVAIL
cephfs_metadata  metadata  8547G  25.2T
  cephfs_data      data     223T  25.2T
STANDBY MDS
 icadmin006

I would probably be better off having:

   1. having only 3 active MDSs (rank 0 to 2)
   2. configure 3 standby-replay to mirror the ranks 0 to 2
   3. have 2 'regular' standby MDSs

Of course, this raises the question of storage and performance.

Since I would be moving from 7 active MDSs to 3:

   1. each new active MDS will have to store more than twice the data
   2. the load will be more than twice as high

Am I correct?

Emmanuel

On Wed, May 24, 2023 at 2:31 PM Hector Martin <marcan@xxxxxxxxx> wrote:

> On 24/05/2023 21.15, Emmanuel Jaep wrote:
> > Hi,
> >
> > we are currently running a ceph fs cluster at the following version:
> > MDS version: ceph version 16.2.10
> > (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)
> >
> > The cluster is composed of 7 active MDSs and 1 standby MDS:
> > RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> >  0    active  icadmin012  Reqs:   73 /s  1938k  1880k  85.3k  92.8k
> >  1    active  icadmin008  Reqs:  206 /s  2375k  2375k  7081    171k
> >  2    active  icadmin007  Reqs:   91 /s  5709k  5256k   149k   299k
> >  3    active  icadmin014  Reqs:   93 /s   679k   664k  40.1k   216k
> >  4    active  icadmin013  Reqs:   86 /s  3585k  3569k  12.7k   197k
> >  5    active  icadmin011  Reqs:   72 /s   225k   221k  8611    164k
> >  6    active  icadmin015  Reqs:   87 /s  1682k  1610k  27.9k   274k
> >       POOL         TYPE     USED  AVAIL
> > cephfs_metadata  metadata  8552G  22.3T
> >   cephfs_data      data     226T  22.3T
> > STANDBY MDS
> >  icadmin006
> >
> > When I restart one of the active MDSs, the standby MDS becomes active and
> > its state becomes "replay". So far, so good!
> >
> > However, only one of the other "active" MDSs seems to remain active. All
> > activities drop from the other ones:
> > RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> >  0    active  icadmin012  Reqs:    0 /s  1938k  1881k  85.3k  9720
> >  1    active  icadmin008  Reqs:    0 /s  2375k  2375k  7080   2505
> >  2    active  icadmin007  Reqs:    2 /s  5709k  5256k   149k  26.5k
> >  3    active  icadmin014  Reqs:    0 /s   679k   664k  40.1k  3259
> >  4    replay  icadmin006                  801k   801k  1279      0
> >  5    active  icadmin011  Reqs:    0 /s   225k   221k  8611   9241
> >  6    active  icadmin015  Reqs:    0 /s  1682k  1610k  27.9k  34.8k
> >       POOL         TYPE     USED  AVAIL
> > cephfs_metadata  metadata  8539G  22.8T
> >   cephfs_data      data     225T  22.8T
> > STANDBY MDS
> >  icadmin013
> >
> > In effect, the cluster becomes almost unavailable until the newly
> promoted
> > MDS finishes rejoining the cluster.
> >
> > Obviously, this defeats the purpose of having 7MDSs.
> > Is this behavior?
> > If not, what configuration items should I check to go back to "normal"
> > operations?
> >
>
> Please ignore my previous email, I read too quickly. I see you do have a
> standby. However, that does not allow fast failover with multiple MDSes.
>
> For fast failover of any active MDS, you need one standby-replay daemon
> for *each* active MDS. Each standby-replay MDS follows one active MDS's
> rank only, you can't have one standby-replay daemon following all ranks.
> What you have right now is probably a regular standby daemon, which can
> take over any failed MDS, but requires waiting for the replay time.
>
> See:
>
> https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay
>
> My explanation for the zero ops from the previous email still holds:
> it's likely that most clients will hang if any MDS rank is
> down/unavailable.
>
> - Hector
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx