There was a memory issue with standby-replay that may have been resolved since and fix is in 16.2.10 (not sure), the suggestion at the time was to avoid standby-replay. Perhaps a dev can chime in on that status. Your MDSs look pretty inactive. I would consider scaling them down (potentially to single active if your workload allows). The MDS have an intricate update process when you use multiple active, make sure to read the docs on that if you arent using cephadm and want to attempt an upgrade. standby-replay can only take over for a single rank (tracks a single active MDS) where a standby can take over for any rank. more here: https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay Respectfully, *Wes Dillingham* wes@xxxxxxxxxxxxxxxxx LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Wed, May 24, 2023 at 10:33 AM Eugen Block <eblock@xxxxxx> wrote: > Hi, > > using standby-replay daemons is something to test as it can have a > negative impact, it really depends on the actual workload. We stopped > using standby-replay in all clusters we (help) maintain, in one > specific case with many active MDSs and a high load the failover time > decreased and was "cleaner" for the client application. > Also, do you know why you use a multi-active MDS setup? Was that a > requirement for subtree pinning (otherwise multiple active daemons > would balance the hell out of each other) or maybe just an experiment? > Depending on the workload pinning might have been necessary, maybe you > would impact performance if you removed 3 MDS daemons? As an > alternative you can also deploy multiple MDS daemons per host > (count_per_host) which can utilize the server better, not sure which > Pacific version that is, I just tried successfully on 16.2.13. That > way you could still maintain the required number of MDS daemons (if > it's still 7 ) and also have enough standby daemons. But that of > course means in case one MDS host goes down all it's daemons will also > be unavailable. But we used this feature in an older version > (customized Nautilus) quite successfully in a customer cluster. > There are many things to consider here, just wanted to share a couple > of thoughts. > > Regards, > Eugen > > Zitat von Hector Martin <marcan@xxxxxxxxx>: > > > Hi, > > > > On 24/05/2023 22.02, Emmanuel Jaep wrote: > >> Hi Hector, > >> > >> thank you very much for the detailed explanation and link to the > >> documentation. > >> > >> Given our current situation (7 active MDSs and 1 standby MDS): > >> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > >> 0 active icadmin012 Reqs: 82 /s 2345k 2288k 97.2k 307k > >> 1 active icadmin008 Reqs: 194 /s 3789k 3789k 17.1k 641k > >> 2 active icadmin007 Reqs: 94 /s 5823k 5369k 150k 257k > >> 3 active icadmin014 Reqs: 103 /s 813k 796k 47.4k 163k > >> 4 active icadmin013 Reqs: 81 /s 3815k 3798k 12.9k 186k > >> 5 active icadmin011 Reqs: 84 /s 493k 489k 9145 176k > >> 6 active icadmin015 Reqs: 374 /s 1741k 1669k 28.1k 246k > >> POOL TYPE USED AVAIL > >> cephfs_metadata metadata 8547G 25.2T > >> cephfs_data data 223T 25.2T > >> STANDBY MDS > >> icadmin006 > >> > >> I would probably be better off having: > >> > >> 1. having only 3 active MDSs (rank 0 to 2) > >> 2. configure 3 standby-replay to mirror the ranks 0 to 2 > >> 3. have 2 'regular' standby MDSs > >> > >> Of course, this raises the question of storage and performance. > >> > >> Since I would be moving from 7 active MDSs to 3: > >> > >> 1. each new active MDS will have to store more than twice the data > >> 2. the load will be more than twice as high > >> > >> Am I correct? > > > > Yes, that is correct. The MDSes don't store data locally but do > > cache/maintain it in memory, so you will either have higher memory load > > for the same effective cache size, or a lower cache size for the same > > memory load. > > > > If you have 8 total MDSes, I'd go for 4+4. You don't need non-replay > > standbys if you have a standby replay for each active MDS. As far as I > > know, if you end up with an active and its standby both failing, some > > other standby-replay MDS will still be stolen to take care of that rank, > > so the cluster will eventually become healthy again after the replay > time. > > > > With 4 active MDSes down from the current 7, the load per MDS will be a > > bit less than double. > > > >> > >> Emmanuel > >> > >> On Wed, May 24, 2023 at 2:31 PM Hector Martin <marcan@xxxxxxxxx> wrote: > >> > >>> On 24/05/2023 21.15, Emmanuel Jaep wrote: > >>>> Hi, > >>>> > >>>> we are currently running a ceph fs cluster at the following version: > >>>> MDS version: ceph version 16.2.10 > >>>> (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) > >>>> > >>>> The cluster is composed of 7 active MDSs and 1 standby MDS: > >>>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > >>>> 0 active icadmin012 Reqs: 73 /s 1938k 1880k 85.3k 92.8k > >>>> 1 active icadmin008 Reqs: 206 /s 2375k 2375k 7081 171k > >>>> 2 active icadmin007 Reqs: 91 /s 5709k 5256k 149k 299k > >>>> 3 active icadmin014 Reqs: 93 /s 679k 664k 40.1k 216k > >>>> 4 active icadmin013 Reqs: 86 /s 3585k 3569k 12.7k 197k > >>>> 5 active icadmin011 Reqs: 72 /s 225k 221k 8611 164k > >>>> 6 active icadmin015 Reqs: 87 /s 1682k 1610k 27.9k 274k > >>>> POOL TYPE USED AVAIL > >>>> cephfs_metadata metadata 8552G 22.3T > >>>> cephfs_data data 226T 22.3T > >>>> STANDBY MDS > >>>> icadmin006 > >>>> > >>>> When I restart one of the active MDSs, the standby MDS becomes active > and > >>>> its state becomes "replay". So far, so good! > >>>> > >>>> However, only one of the other "active" MDSs seems to remain active. > All > >>>> activities drop from the other ones: > >>>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > >>>> 0 active icadmin012 Reqs: 0 /s 1938k 1881k 85.3k 9720 > >>>> 1 active icadmin008 Reqs: 0 /s 2375k 2375k 7080 2505 > >>>> 2 active icadmin007 Reqs: 2 /s 5709k 5256k 149k 26.5k > >>>> 3 active icadmin014 Reqs: 0 /s 679k 664k 40.1k 3259 > >>>> 4 replay icadmin006 801k 801k 1279 0 > >>>> 5 active icadmin011 Reqs: 0 /s 225k 221k 8611 9241 > >>>> 6 active icadmin015 Reqs: 0 /s 1682k 1610k 27.9k 34.8k > >>>> POOL TYPE USED AVAIL > >>>> cephfs_metadata metadata 8539G 22.8T > >>>> cephfs_data data 225T 22.8T > >>>> STANDBY MDS > >>>> icadmin013 > >>>> > >>>> In effect, the cluster becomes almost unavailable until the newly > >>> promoted > >>>> MDS finishes rejoining the cluster. > >>>> > >>>> Obviously, this defeats the purpose of having 7MDSs. > >>>> Is this behavior? > >>>> If not, what configuration items should I check to go back to "normal" > >>>> operations? > >>>> > >>> > >>> Please ignore my previous email, I read too quickly. I see you do have > a > >>> standby. However, that does not allow fast failover with multiple > MDSes. > >>> > >>> For fast failover of any active MDS, you need one standby-replay daemon > >>> for *each* active MDS. Each standby-replay MDS follows one active MDS's > >>> rank only, you can't have one standby-replay daemon following all > ranks. > >>> What you have right now is probably a regular standby daemon, which can > >>> take over any failed MDS, but requires waiting for the replay time. > >>> > >>> See: > >>> > >>> > https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay > >>> > >>> My explanation for the zero ops from the previous email still holds: > >>> it's likely that most clients will hang if any MDS rank is > >>> down/unavailable. > >>> > >>> - Hector > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > - Hector > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx