Re: ceph Pacific - MDS activity freezes when one the MDSs is restarted

Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx> · Fri, 9 Jun 2023 15:49:31 +0200

Hi Eugen,

thanks for the response! :-)
We have (kind of) solved the problem immediately at hand. The whole process
was stuck because the MDSes were actually getting 'killed'. In fact, the
amount of RAM we allocated to the MDSes was insufficient to accommodate the
logs' complete replay. Therefore, the MDSes were not able to replay the
logs. As the standby server was having the same issue (same config –
leading to the same behavior), the system was stuck in an endless loop.

This got resolved by allowing MDSes to use more RAM. Once the system was
back to a stable state, I manage to reduce the max number of segments
before trimming. Since the system is now stable, I'm not entirely convinced
I should "rock the boat" any further.

Thanks for your help,

Emmanuel

On Thu, Jun 8, 2023 at 12:04 PM Eugen Block <eblock@xxxxxx> wrote:

> Hi,
>
> sorry for not responing earlier.
>
> > Pardon my ignorance, I'm not quite sure I know what you mean by subtree
> > pinning. I quickly googled it and saw it was a new feature in Luminous.
> We
> > are running Pacific. I would assume this feature was not out yet.
>
> Luminous is older than Pacific, so the feature would be available for
> your cluster.
>
> > I can definitely try. However, I tried to lower the max number of mds.
> > Unfortunately, one of the MDSs seem to be stuck in "stopping" state for
> > more than 12 hours now.
>
> It sounds indeed like reducing the max_mds is causing issues, other
> users with a high client load reported similar issues during ceph
> upgrades where max_mds has to be reduced to 1 as well. Can you share
> more details about the MDS utilization (are those standalone servers
> or colocated services for example with OSDs?), how many cephfs clients
> (ceph fs status), what kind of workload do they produce, the general
> ceph load (ceph -s). Just to get a better impression of what's going
> on there. To check if and what pinning you use you could checkout the
> docs [1] and see if any (upper level) directory returns something with
> for the getfattr commands. Or maybe someone documented using setfattr
> for your cephfs, maybe in the command history?
>
> [1]
>
> https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies
>
> Zitat von Emmanuel Jaep <emmanuel.jaep@xxxxxxxxx>:
>
> > Hi Eugen,
> >
> > Also, do you know why you use a multi-active MDS setup?
> > To be completely candid, I don't really know why this choice was made. I
> > assume the goal was to provide fault-tolerance and load-balancing.
> >
> > Was that a requirement for subtree pinning (otherwise multiple active
> > daemons would balance the hell out of each other) or maybe just an
> > experiment?
> > Pardon my ignorance, I'm not quite sure I know what you mean by subtree
> > pinning. I quickly googled it and saw it was a new feature in Luminous.
> We
> > are running Pacific. I would assume this feature was not out yet.
> >
> > Depending on the workload pinning might have been necessary, maybe you
> > would impact performance if you removed 3 MDS daemons?
> > I can definitely try. However, I tried to lower the max number of mds.
> > Unfortunately, one of the MDSs seem to be stuck in "stopping" state for
> > more than 12 hours now.
> >
> > Best,
> >
> > Emmanuel
> >
> > On Wed, May 24, 2023 at 4:34 PM Eugen Block <eblock@xxxxxx> wrote:
> >
> >> Hi,
> >>
> >> using standby-replay daemons is something to test as it can have a
> >> negative impact, it really depends on the actual workload. We stopped
> >> using standby-replay in all clusters we (help) maintain, in one
> >> specific case with many active MDSs and a high load the failover time
> >> decreased and was "cleaner" for the client application.
> >> Also, do you know why you use a multi-active MDS setup? Was that a
> >> requirement for subtree pinning (otherwise multiple active daemons
> >> would balance the hell out of each other) or maybe just an experiment?
> >> Depending on the workload pinning might have been necessary, maybe you
> >> would impact performance if you removed 3 MDS daemons? As an
> >> alternative you can also deploy multiple MDS daemons per host
> >> (count_per_host) which can utilize the server better, not sure which
> >> Pacific version that is, I just tried successfully on 16.2.13. That
> >> way you could still maintain the required number of MDS daemons (if
> >> it's still 7 ) and also have enough standby daemons. But that of
> >> course means in case one MDS host goes down all it's daemons will also
> >> be unavailable. But we used this feature in an older version
> >> (customized Nautilus) quite successfully in a customer cluster.
> >> There are many things to consider here, just wanted to share a couple
> >> of thoughts.
> >>
> >> Regards,
> >> Eugen
> >>
> >> Zitat von Hector Martin <marcan@xxxxxxxxx>:
> >>
> >> > Hi,
> >> >
> >> > On 24/05/2023 22.02, Emmanuel Jaep wrote:
> >> >> Hi Hector,
> >> >>
> >> >> thank you very much for the detailed explanation and link to the
> >> >> documentation.
> >> >>
> >> >> Given our current situation (7 active MDSs and 1 standby MDS):
> >> >> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> >> >>  0    active  icadmin012  Reqs:   82 /s  2345k  2288k  97.2k   307k
> >> >>  1    active  icadmin008  Reqs:  194 /s  3789k  3789k  17.1k   641k
> >> >>  2    active  icadmin007  Reqs:   94 /s  5823k  5369k   150k   257k
> >> >>  3    active  icadmin014  Reqs:  103 /s   813k   796k  47.4k   163k
> >> >>  4    active  icadmin013  Reqs:   81 /s  3815k  3798k  12.9k   186k
> >> >>  5    active  icadmin011  Reqs:   84 /s   493k   489k  9145    176k
> >> >>  6    active  icadmin015  Reqs:  374 /s  1741k  1669k  28.1k   246k
> >> >>       POOL         TYPE     USED  AVAIL
> >> >> cephfs_metadata  metadata  8547G  25.2T
> >> >>   cephfs_data      data     223T  25.2T
> >> >> STANDBY MDS
> >> >>  icadmin006
> >> >>
> >> >> I would probably be better off having:
> >> >>
> >> >>    1. having only 3 active MDSs (rank 0 to 2)
> >> >>    2. configure 3 standby-replay to mirror the ranks 0 to 2
> >> >>    3. have 2 'regular' standby MDSs
> >> >>
> >> >> Of course, this raises the question of storage and performance.
> >> >>
> >> >> Since I would be moving from 7 active MDSs to 3:
> >> >>
> >> >>    1. each new active MDS will have to store more than twice the data
> >> >>    2. the load will be more than twice as high
> >> >>
> >> >> Am I correct?
> >> >
> >> > Yes, that is correct. The MDSes don't store data locally but do
> >> > cache/maintain it in memory, so you will either have higher memory
> load
> >> > for the same effective cache size, or a lower cache size for the same
> >> > memory load.
> >> >
> >> > If you have 8 total MDSes, I'd go for 4+4. You don't need non-replay
> >> > standbys if you have a standby replay for each active MDS. As far as I
> >> > know, if you end up with an active and its standby both failing, some
> >> > other standby-replay MDS will still be stolen to take care of that
> rank,
> >> > so the cluster will eventually become healthy again after the replay
> >> time.
> >> >
> >> > With 4 active MDSes down from the current 7, the load per MDS will be
> a
> >> > bit less than double.
> >> >
> >> >>
> >> >> Emmanuel
> >> >>
> >> >> On Wed, May 24, 2023 at 2:31 PM Hector Martin <marcan@xxxxxxxxx>
> wrote:
> >> >>
> >> >>> On 24/05/2023 21.15, Emmanuel Jaep wrote:
> >> >>>> Hi,
> >> >>>>
> >> >>>> we are currently running a ceph fs cluster at the following
> version:
> >> >>>> MDS version: ceph version 16.2.10
> >> >>>> (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)
> >> >>>>
> >> >>>> The cluster is composed of 7 active MDSs and 1 standby MDS:
> >> >>>> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> >> >>>>  0    active  icadmin012  Reqs:   73 /s  1938k  1880k  85.3k  92.8k
> >> >>>>  1    active  icadmin008  Reqs:  206 /s  2375k  2375k  7081    171k
> >> >>>>  2    active  icadmin007  Reqs:   91 /s  5709k  5256k   149k   299k
> >> >>>>  3    active  icadmin014  Reqs:   93 /s   679k   664k  40.1k   216k
> >> >>>>  4    active  icadmin013  Reqs:   86 /s  3585k  3569k  12.7k   197k
> >> >>>>  5    active  icadmin011  Reqs:   72 /s   225k   221k  8611    164k
> >> >>>>  6    active  icadmin015  Reqs:   87 /s  1682k  1610k  27.9k   274k
> >> >>>>       POOL         TYPE     USED  AVAIL
> >> >>>> cephfs_metadata  metadata  8552G  22.3T
> >> >>>>   cephfs_data      data     226T  22.3T
> >> >>>> STANDBY MDS
> >> >>>>  icadmin006
> >> >>>>
> >> >>>> When I restart one of the active MDSs, the standby MDS becomes
> active
> >> and
> >> >>>> its state becomes "replay". So far, so good!
> >> >>>>
> >> >>>> However, only one of the other "active" MDSs seems to remain
> active.
> >> All
> >> >>>> activities drop from the other ones:
> >> >>>> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
> >> >>>>  0    active  icadmin012  Reqs:    0 /s  1938k  1881k  85.3k  9720
> >> >>>>  1    active  icadmin008  Reqs:    0 /s  2375k  2375k  7080   2505
> >> >>>>  2    active  icadmin007  Reqs:    2 /s  5709k  5256k   149k  26.5k
> >> >>>>  3    active  icadmin014  Reqs:    0 /s   679k   664k  40.1k  3259
> >> >>>>  4    replay  icadmin006                  801k   801k  1279      0
> >> >>>>  5    active  icadmin011  Reqs:    0 /s   225k   221k  8611   9241
> >> >>>>  6    active  icadmin015  Reqs:    0 /s  1682k  1610k  27.9k  34.8k
> >> >>>>       POOL         TYPE     USED  AVAIL
> >> >>>> cephfs_metadata  metadata  8539G  22.8T
> >> >>>>   cephfs_data      data     225T  22.8T
> >> >>>> STANDBY MDS
> >> >>>>  icadmin013
> >> >>>>
> >> >>>> In effect, the cluster becomes almost unavailable until the newly
> >> >>> promoted
> >> >>>> MDS finishes rejoining the cluster.
> >> >>>>
> >> >>>> Obviously, this defeats the purpose of having 7MDSs.
> >> >>>> Is this behavior?
> >> >>>> If not, what configuration items should I check to go back to
> "normal"
> >> >>>> operations?
> >> >>>>
> >> >>>
> >> >>> Please ignore my previous email, I read too quickly. I see you do
> have
> >> a
> >> >>> standby. However, that does not allow fast failover with multiple
> >> MDSes.
> >> >>>
> >> >>> For fast failover of any active MDS, you need one standby-replay
> daemon
> >> >>> for *each* active MDS. Each standby-replay MDS follows one active
> MDS's
> >> >>> rank only, you can't have one standby-replay daemon following all
> >> ranks.
> >> >>> What you have right now is probably a regular standby daemon, which
> can
> >> >>> take over any failed MDS, but requires waiting for the replay time.
> >> >>>
> >> >>> See:
> >> >>>
> >> >>>
> >>
> https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay
> >> >>>
> >> >>> My explanation for the zero ops from the previous email still holds:
> >> >>> it's likely that most clients will hang if any MDS rank is
> >> >>> down/unavailable.
> >> >>>
> >> >>> - Hector
> >> >>> _______________________________________________
> >> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >>>
> >> >> _______________________________________________
> >> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >
> >> > - Hector
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx