Re: ceph Pacific - MDS activity freezes when one the MDSs is restarted

Hector Martin <marcan@xxxxxxxxx> · Wed, 24 May 2023 23:11:51 +0900

Hi,

On 24/05/2023 22.02, Emmanuel Jaep wrote:
> Hi Hector,
> 
> thank you very much for the detailed explanation and link to the
> documentation.
> 
> Given our current situation (7 active MDSs and 1 standby MDS):
> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
>  0    active  icadmin012  Reqs:   82 /s  2345k  2288k  97.2k   307k
>  1    active  icadmin008  Reqs:  194 /s  3789k  3789k  17.1k   641k
>  2    active  icadmin007  Reqs:   94 /s  5823k  5369k   150k   257k
>  3    active  icadmin014  Reqs:  103 /s   813k   796k  47.4k   163k
>  4    active  icadmin013  Reqs:   81 /s  3815k  3798k  12.9k   186k
>  5    active  icadmin011  Reqs:   84 /s   493k   489k  9145    176k
>  6    active  icadmin015  Reqs:  374 /s  1741k  1669k  28.1k   246k
>       POOL         TYPE     USED  AVAIL
> cephfs_metadata  metadata  8547G  25.2T
>   cephfs_data      data     223T  25.2T
> STANDBY MDS
>  icadmin006
> 
> I would probably be better off having:
> 
>    1. having only 3 active MDSs (rank 0 to 2)
>    2. configure 3 standby-replay to mirror the ranks 0 to 2
>    3. have 2 'regular' standby MDSs
> 
> Of course, this raises the question of storage and performance.
> 
> Since I would be moving from 7 active MDSs to 3:
> 
>    1. each new active MDS will have to store more than twice the data
>    2. the load will be more than twice as high
> 
> Am I correct?

Yes, that is correct. The MDSes don't store data locally but do
cache/maintain it in memory, so you will either have higher memory load
for the same effective cache size, or a lower cache size for the same
memory load.

If you have 8 total MDSes, I'd go for 4+4. You don't need non-replay
standbys if you have a standby replay for each active MDS. As far as I
know, if you end up with an active and its standby both failing, some
other standby-replay MDS will still be stolen to take care of that rank,
so the cluster will eventually become healthy again after the replay time.

With 4 active MDSes down from the current 7, the load per MDS will be a
bit less than double.

> 
> Emmanuel
> 
> On Wed, May 24, 2023 at 2:31 PM Hector Martin <marcan@xxxxxxxxx> wrote:
> 
>> On 24/05/2023 21.15, Emmanuel Jaep wrote:
>>> Hi,
>>>
>>> we are currently running a ceph fs cluster at the following version:
>>> MDS version: ceph version 16.2.10
>>> (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)
>>>
>>> The cluster is composed of 7 active MDSs and 1 standby MDS:
>>> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
>>>  0    active  icadmin012  Reqs:   73 /s  1938k  1880k  85.3k  92.8k
>>>  1    active  icadmin008  Reqs:  206 /s  2375k  2375k  7081    171k
>>>  2    active  icadmin007  Reqs:   91 /s  5709k  5256k   149k   299k
>>>  3    active  icadmin014  Reqs:   93 /s   679k   664k  40.1k   216k
>>>  4    active  icadmin013  Reqs:   86 /s  3585k  3569k  12.7k   197k
>>>  5    active  icadmin011  Reqs:   72 /s   225k   221k  8611    164k
>>>  6    active  icadmin015  Reqs:   87 /s  1682k  1610k  27.9k   274k
>>>       POOL         TYPE     USED  AVAIL
>>> cephfs_metadata  metadata  8552G  22.3T
>>>   cephfs_data      data     226T  22.3T
>>> STANDBY MDS
>>>  icadmin006
>>>
>>> When I restart one of the active MDSs, the standby MDS becomes active and
>>> its state becomes "replay". So far, so good!
>>>
>>> However, only one of the other "active" MDSs seems to remain active. All
>>> activities drop from the other ones:
>>> RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS   CAPS
>>>  0    active  icadmin012  Reqs:    0 /s  1938k  1881k  85.3k  9720
>>>  1    active  icadmin008  Reqs:    0 /s  2375k  2375k  7080   2505
>>>  2    active  icadmin007  Reqs:    2 /s  5709k  5256k   149k  26.5k
>>>  3    active  icadmin014  Reqs:    0 /s   679k   664k  40.1k  3259
>>>  4    replay  icadmin006                  801k   801k  1279      0
>>>  5    active  icadmin011  Reqs:    0 /s   225k   221k  8611   9241
>>>  6    active  icadmin015  Reqs:    0 /s  1682k  1610k  27.9k  34.8k
>>>       POOL         TYPE     USED  AVAIL
>>> cephfs_metadata  metadata  8539G  22.8T
>>>   cephfs_data      data     225T  22.8T
>>> STANDBY MDS
>>>  icadmin013
>>>
>>> In effect, the cluster becomes almost unavailable until the newly
>> promoted
>>> MDS finishes rejoining the cluster.
>>>
>>> Obviously, this defeats the purpose of having 7MDSs.
>>> Is this behavior?
>>> If not, what configuration items should I check to go back to "normal"
>>> operations?
>>>
>>
>> Please ignore my previous email, I read too quickly. I see you do have a
>> standby. However, that does not allow fast failover with multiple MDSes.
>>
>> For fast failover of any active MDS, you need one standby-replay daemon
>> for *each* active MDS. Each standby-replay MDS follows one active MDS's
>> rank only, you can't have one standby-replay daemon following all ranks.
>> What you have right now is probably a regular standby daemon, which can
>> take over any failed MDS, but requires waiting for the replay time.
>>
>> See:
>>
>> https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay
>>
>> My explanation for the zero ops from the previous email still holds:
>> it's likely that most clients will hang if any MDS rank is
>> down/unavailable.
>>
>> - Hector
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx