Hi Wes, thanks for the heads-up. Best, Emmanuel On Wed, May 24, 2023 at 5:47 PM Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> wrote: > There was a memory issue with standby-replay that may have been resolved > since and fix is in 16.2.10 (not sure), the suggestion at the time was to > avoid standby-replay. > > Perhaps a dev can chime in on that status. Your MDSs look pretty inactive. > I would consider scaling them down (potentially to single active if your > workload allows). > > The MDS have an intricate update process when you use multiple active, make > sure to read the docs on that if you arent using cephadm and want to > attempt an upgrade. > > standby-replay can only take over for a single rank (tracks a single active > MDS) where a standby can take over for any rank. more here: > https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay > > Respectfully, > > *Wes Dillingham* > wes@xxxxxxxxxxxxxxxxx > LinkedIn <http://www.linkedin.com/in/wesleydillingham> > > > On Wed, May 24, 2023 at 10:33 AM Eugen Block <eblock@xxxxxx> wrote: > > > Hi, > > > > using standby-replay daemons is something to test as it can have a > > negative impact, it really depends on the actual workload. We stopped > > using standby-replay in all clusters we (help) maintain, in one > > specific case with many active MDSs and a high load the failover time > > decreased and was "cleaner" for the client application. > > Also, do you know why you use a multi-active MDS setup? Was that a > > requirement for subtree pinning (otherwise multiple active daemons > > would balance the hell out of each other) or maybe just an experiment? > > Depending on the workload pinning might have been necessary, maybe you > > would impact performance if you removed 3 MDS daemons? As an > > alternative you can also deploy multiple MDS daemons per host > > (count_per_host) which can utilize the server better, not sure which > > Pacific version that is, I just tried successfully on 16.2.13. That > > way you could still maintain the required number of MDS daemons (if > > it's still 7 ) and also have enough standby daemons. But that of > > course means in case one MDS host goes down all it's daemons will also > > be unavailable. But we used this feature in an older version > > (customized Nautilus) quite successfully in a customer cluster. > > There are many things to consider here, just wanted to share a couple > > of thoughts. > > > > Regards, > > Eugen > > > > Zitat von Hector Martin <marcan@xxxxxxxxx>: > > > > > Hi, > > > > > > On 24/05/2023 22.02, Emmanuel Jaep wrote: > > >> Hi Hector, > > >> > > >> thank you very much for the detailed explanation and link to the > > >> documentation. > > >> > > >> Given our current situation (7 active MDSs and 1 standby MDS): > > >> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > > >> 0 active icadmin012 Reqs: 82 /s 2345k 2288k 97.2k 307k > > >> 1 active icadmin008 Reqs: 194 /s 3789k 3789k 17.1k 641k > > >> 2 active icadmin007 Reqs: 94 /s 5823k 5369k 150k 257k > > >> 3 active icadmin014 Reqs: 103 /s 813k 796k 47.4k 163k > > >> 4 active icadmin013 Reqs: 81 /s 3815k 3798k 12.9k 186k > > >> 5 active icadmin011 Reqs: 84 /s 493k 489k 9145 176k > > >> 6 active icadmin015 Reqs: 374 /s 1741k 1669k 28.1k 246k > > >> POOL TYPE USED AVAIL > > >> cephfs_metadata metadata 8547G 25.2T > > >> cephfs_data data 223T 25.2T > > >> STANDBY MDS > > >> icadmin006 > > >> > > >> I would probably be better off having: > > >> > > >> 1. having only 3 active MDSs (rank 0 to 2) > > >> 2. configure 3 standby-replay to mirror the ranks 0 to 2 > > >> 3. have 2 'regular' standby MDSs > > >> > > >> Of course, this raises the question of storage and performance. > > >> > > >> Since I would be moving from 7 active MDSs to 3: > > >> > > >> 1. each new active MDS will have to store more than twice the data > > >> 2. the load will be more than twice as high > > >> > > >> Am I correct? > > > > > > Yes, that is correct. The MDSes don't store data locally but do > > > cache/maintain it in memory, so you will either have higher memory load > > > for the same effective cache size, or a lower cache size for the same > > > memory load. > > > > > > If you have 8 total MDSes, I'd go for 4+4. You don't need non-replay > > > standbys if you have a standby replay for each active MDS. As far as I > > > know, if you end up with an active and its standby both failing, some > > > other standby-replay MDS will still be stolen to take care of that > rank, > > > so the cluster will eventually become healthy again after the replay > > time. > > > > > > With 4 active MDSes down from the current 7, the load per MDS will be a > > > bit less than double. > > > > > >> > > >> Emmanuel > > >> > > >> On Wed, May 24, 2023 at 2:31 PM Hector Martin <marcan@xxxxxxxxx> > wrote: > > >> > > >>> On 24/05/2023 21.15, Emmanuel Jaep wrote: > > >>>> Hi, > > >>>> > > >>>> we are currently running a ceph fs cluster at the following version: > > >>>> MDS version: ceph version 16.2.10 > > >>>> (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) > > >>>> > > >>>> The cluster is composed of 7 active MDSs and 1 standby MDS: > > >>>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > > >>>> 0 active icadmin012 Reqs: 73 /s 1938k 1880k 85.3k 92.8k > > >>>> 1 active icadmin008 Reqs: 206 /s 2375k 2375k 7081 171k > > >>>> 2 active icadmin007 Reqs: 91 /s 5709k 5256k 149k 299k > > >>>> 3 active icadmin014 Reqs: 93 /s 679k 664k 40.1k 216k > > >>>> 4 active icadmin013 Reqs: 86 /s 3585k 3569k 12.7k 197k > > >>>> 5 active icadmin011 Reqs: 72 /s 225k 221k 8611 164k > > >>>> 6 active icadmin015 Reqs: 87 /s 1682k 1610k 27.9k 274k > > >>>> POOL TYPE USED AVAIL > > >>>> cephfs_metadata metadata 8552G 22.3T > > >>>> cephfs_data data 226T 22.3T > > >>>> STANDBY MDS > > >>>> icadmin006 > > >>>> > > >>>> When I restart one of the active MDSs, the standby MDS becomes > active > > and > > >>>> its state becomes "replay". So far, so good! > > >>>> > > >>>> However, only one of the other "active" MDSs seems to remain active. > > All > > >>>> activities drop from the other ones: > > >>>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > > >>>> 0 active icadmin012 Reqs: 0 /s 1938k 1881k 85.3k 9720 > > >>>> 1 active icadmin008 Reqs: 0 /s 2375k 2375k 7080 2505 > > >>>> 2 active icadmin007 Reqs: 2 /s 5709k 5256k 149k 26.5k > > >>>> 3 active icadmin014 Reqs: 0 /s 679k 664k 40.1k 3259 > > >>>> 4 replay icadmin006 801k 801k 1279 0 > > >>>> 5 active icadmin011 Reqs: 0 /s 225k 221k 8611 9241 > > >>>> 6 active icadmin015 Reqs: 0 /s 1682k 1610k 27.9k 34.8k > > >>>> POOL TYPE USED AVAIL > > >>>> cephfs_metadata metadata 8539G 22.8T > > >>>> cephfs_data data 225T 22.8T > > >>>> STANDBY MDS > > >>>> icadmin013 > > >>>> > > >>>> In effect, the cluster becomes almost unavailable until the newly > > >>> promoted > > >>>> MDS finishes rejoining the cluster. > > >>>> > > >>>> Obviously, this defeats the purpose of having 7MDSs. > > >>>> Is this behavior? > > >>>> If not, what configuration items should I check to go back to > "normal" > > >>>> operations? > > >>>> > > >>> > > >>> Please ignore my previous email, I read too quickly. I see you do > have > > a > > >>> standby. However, that does not allow fast failover with multiple > > MDSes. > > >>> > > >>> For fast failover of any active MDS, you need one standby-replay > daemon > > >>> for *each* active MDS. Each standby-replay MDS follows one active > MDS's > > >>> rank only, you can't have one standby-replay daemon following all > > ranks. > > >>> What you have right now is probably a regular standby daemon, which > can > > >>> take over any failed MDS, but requires waiting for the replay time. > > >>> > > >>> See: > > >>> > > >>> > > > https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay > > >>> > > >>> My explanation for the zero ops from the previous email still holds: > > >>> it's likely that most clients will hang if any MDS rank is > > >>> down/unavailable. > > >>> > > >>> - Hector > > >>> _______________________________________________ > > >>> ceph-users mailing list -- ceph-users@xxxxxxx > > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > >>> > > >> _______________________________________________ > > >> ceph-users mailing list -- ceph-users@xxxxxxx > > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > - Hector > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx