So I guess, I'll end up doing: ceph fs set cephfs max_mds 4 ceph fs set cephfs allow_standby_replay true On Wed, May 24, 2023 at 4:13 PM Hector Martin <marcan@xxxxxxxxx> wrote: > Hi, > > On 24/05/2023 22.02, Emmanuel Jaep wrote: > > Hi Hector, > > > > thank you very much for the detailed explanation and link to the > > documentation. > > > > Given our current situation (7 active MDSs and 1 standby MDS): > > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > > 0 active icadmin012 Reqs: 82 /s 2345k 2288k 97.2k 307k > > 1 active icadmin008 Reqs: 194 /s 3789k 3789k 17.1k 641k > > 2 active icadmin007 Reqs: 94 /s 5823k 5369k 150k 257k > > 3 active icadmin014 Reqs: 103 /s 813k 796k 47.4k 163k > > 4 active icadmin013 Reqs: 81 /s 3815k 3798k 12.9k 186k > > 5 active icadmin011 Reqs: 84 /s 493k 489k 9145 176k > > 6 active icadmin015 Reqs: 374 /s 1741k 1669k 28.1k 246k > > POOL TYPE USED AVAIL > > cephfs_metadata metadata 8547G 25.2T > > cephfs_data data 223T 25.2T > > STANDBY MDS > > icadmin006 > > > > I would probably be better off having: > > > > 1. having only 3 active MDSs (rank 0 to 2) > > 2. configure 3 standby-replay to mirror the ranks 0 to 2 > > 3. have 2 'regular' standby MDSs > > > > Of course, this raises the question of storage and performance. > > > > Since I would be moving from 7 active MDSs to 3: > > > > 1. each new active MDS will have to store more than twice the data > > 2. the load will be more than twice as high > > > > Am I correct? > > Yes, that is correct. The MDSes don't store data locally but do > cache/maintain it in memory, so you will either have higher memory load > for the same effective cache size, or a lower cache size for the same > memory load. > > If you have 8 total MDSes, I'd go for 4+4. You don't need non-replay > standbys if you have a standby replay for each active MDS. As far as I > know, if you end up with an active and its standby both failing, some > other standby-replay MDS will still be stolen to take care of that rank, > so the cluster will eventually become healthy again after the replay time. > > With 4 active MDSes down from the current 7, the load per MDS will be a > bit less than double. > > > > > Emmanuel > > > > On Wed, May 24, 2023 at 2:31 PM Hector Martin <marcan@xxxxxxxxx> wrote: > > > >> On 24/05/2023 21.15, Emmanuel Jaep wrote: > >>> Hi, > >>> > >>> we are currently running a ceph fs cluster at the following version: > >>> MDS version: ceph version 16.2.10 > >>> (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) > >>> > >>> The cluster is composed of 7 active MDSs and 1 standby MDS: > >>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > >>> 0 active icadmin012 Reqs: 73 /s 1938k 1880k 85.3k 92.8k > >>> 1 active icadmin008 Reqs: 206 /s 2375k 2375k 7081 171k > >>> 2 active icadmin007 Reqs: 91 /s 5709k 5256k 149k 299k > >>> 3 active icadmin014 Reqs: 93 /s 679k 664k 40.1k 216k > >>> 4 active icadmin013 Reqs: 86 /s 3585k 3569k 12.7k 197k > >>> 5 active icadmin011 Reqs: 72 /s 225k 221k 8611 164k > >>> 6 active icadmin015 Reqs: 87 /s 1682k 1610k 27.9k 274k > >>> POOL TYPE USED AVAIL > >>> cephfs_metadata metadata 8552G 22.3T > >>> cephfs_data data 226T 22.3T > >>> STANDBY MDS > >>> icadmin006 > >>> > >>> When I restart one of the active MDSs, the standby MDS becomes active > and > >>> its state becomes "replay". So far, so good! > >>> > >>> However, only one of the other "active" MDSs seems to remain active. > All > >>> activities drop from the other ones: > >>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > >>> 0 active icadmin012 Reqs: 0 /s 1938k 1881k 85.3k 9720 > >>> 1 active icadmin008 Reqs: 0 /s 2375k 2375k 7080 2505 > >>> 2 active icadmin007 Reqs: 2 /s 5709k 5256k 149k 26.5k > >>> 3 active icadmin014 Reqs: 0 /s 679k 664k 40.1k 3259 > >>> 4 replay icadmin006 801k 801k 1279 0 > >>> 5 active icadmin011 Reqs: 0 /s 225k 221k 8611 9241 > >>> 6 active icadmin015 Reqs: 0 /s 1682k 1610k 27.9k 34.8k > >>> POOL TYPE USED AVAIL > >>> cephfs_metadata metadata 8539G 22.8T > >>> cephfs_data data 225T 22.8T > >>> STANDBY MDS > >>> icadmin013 > >>> > >>> In effect, the cluster becomes almost unavailable until the newly > >> promoted > >>> MDS finishes rejoining the cluster. > >>> > >>> Obviously, this defeats the purpose of having 7MDSs. > >>> Is this behavior? > >>> If not, what configuration items should I check to go back to "normal" > >>> operations? > >>> > >> > >> Please ignore my previous email, I read too quickly. I see you do have a > >> standby. However, that does not allow fast failover with multiple MDSes. > >> > >> For fast failover of any active MDS, you need one standby-replay daemon > >> for *each* active MDS. Each standby-replay MDS follows one active MDS's > >> rank only, you can't have one standby-replay daemon following all ranks. > >> What you have right now is probably a regular standby daemon, which can > >> take over any failed MDS, but requires waiting for the replay time. > >> > >> See: > >> > >> > https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay > >> > >> My explanation for the zero ops from the previous email still holds: > >> it's likely that most clients will hang if any MDS rank is > >> down/unavailable. > >> > >> - Hector > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > - Hector > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx