Re: [CephFS] Completely exclude some MDS rank from directory processing

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 22 Nov 2024 13:55:11 +0100 (CET)

Well...only because I had this discussion in the back of my mind when I've watch the video yesterday. ;-)

Cheers,
Frédéric.

----- Le 22 Nov 24, à 8:59, Eugen Block eblock@xxxxxx a écrit :

> Then you were clearly paying more attention than me. ;-) We had some
> maintenance going on during that talk, so I couldn't really focus
> entirely on listening. But thanks for clarifying!
> 
> Zitat von Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>:
> 
>> Hi Eugen,
>>
>> During the talk you've mentioned, Dan said there's a hard coded
>> limit of 256 MDSs per cluster. So with one active and one
>> standby-ish MDSs per filesystem, that would be 128 filesystems at
>> max per cluster.
>> Mark said he got 120 but.. things start to get wacky by 80. :-)
>>
>> More fun to come, for sure.
>>
>> Cheers,
>> Frédéric.
>>
>> [1] https://youtu.be/qiCE1Ifws80?t=2602
>>
>> ----- Le 21 Nov 24, à 9:36, Eugen Block eblock@xxxxxx a écrit :
>>
>>> I'm not aware of any hard limit for the number of Filesystems, but
>>> that doesn't really mean very much. IIRC, last week during a Clyso
>>> talk at Eventbrite I heard someone say that they deployed around 200
>>> Filesystems or so, I don't remember if it was a production environment
>>> or just a lab environment. I assume that you would probably be limited
>>> by the number of OSDs/PGs rather than by the number of Filesystems,
>>> 200 Filesystems require at least 400 pools. But maybe someone else has
>>> more experience in scaling CephFS that way. What we did was to scale
>>> the number of active MDS daemons for one CephFS. I believe in the end
>>> the customer had 48 MDS daemons on three MDS servers, 16 of them were
>>> active with directory pinning, at that time they had 16 standby-replay
>>> and 16 standby daemons. But it turned out that standby-replay didn't
>>> help their use case, so we disabled standby-replay.
>>>
>>> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
>>> fs dump'?
>>>
>>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>>>
>>>>>
>>>>> Just for testing purposes, have you tried pinning rank 1 to some other
>>>>> directory? Does it still break the CephFS if you stop it?
>>>>
>>>>
>>>> Yes, nothing changed.
>>>>
>>>> It's no problem that FS hangs when one of the ranks goes down, we will have
>>>> standby-reply for all ranks. I don't like that rank which is not pinned to
>>>> some dir handled some io of this dir or from clients which work with this
>>>> dir.
>>>> I mean that I can't robustly and fully separate client IO by ranks.
>>>>
>>>> Would it be an option to rather use multiple Filesystems instead of
>>>>> multi-active for one CephFS?
>>>>
>>>>
>>>> Yes, it's an option. But it is much more complicated in our case. Btw, do
>>>> you know how many different FS can be created in one cluster? Maybe you
>>>> know some potential problems with 100-200 FSs in one cluster?
>>>>
>>>> ср, 20 нояб. 2024 г. в 17:50, Eugen Block <eblock@xxxxxx>:
>>>>
>>>>> Ah, I misunderstood, I thought you wanted an even distribution across
>>>>> both ranks.
>>>>> Just for testing purposes, have you tried pinning rank 1 to some other
>>>>> directory? Does it still break the CephFS if you stop it? I'm not sure
>>>>> if you can prevent rank 1 from participating, I haven't looked into
>>>>> all the configs in quite a while. Would it be an option to rather use
>>>>> multiple Filesystems instead of multi-active for one CephFS?
>>>>>
>>>>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>>>>>
>>>>> > No it's not a typo. It's misleading example)
>>>>> >
>>>>> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
>>>>> without
>>>>> > rank 1.
>>>>> > rank 1 is used for something when I work with this dirs.
>>>>> >
>>>>> > ceph 16.2.13, metadata balancer and policy based balancing not used.
>>>>> >
>>>>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block <eblock@xxxxxx>:
>>>>> >
>>>>> >> Hi,
>>>>> >>
>>>>> >> > After pinning:
>>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>>>>> >>
>>>>> >> is this a typo? If not, you did pin both directories to the same rank.
>>>>> >>
>>>>> >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>>>>> >>
>>>>> >> > Hi,
>>>>> >> >
>>>>> >> > I try to distribute all top level dirs in CephFS by different MDS
>>>>> ranks.
>>>>> >> > I have two active MDS with rank *0* and *1 *and I have 2 top
>>>>> dirs like
>>>>> >> > */dir1* and* /dir2*.
>>>>> >> >
>>>>> >> > After pinning:
>>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>>>>> >> >
>>>>> >> > I can see next INOS and DNS distribution:
>>>>> >> > RANK  STATE   MDS     ACTIVITY     DNS    INOS   DIRS   CAPS
>>>>> >> >  0    active   c   Reqs:    127 /s  12.6k  12.5k   333    505
>>>>> >> >  1    active   b   Reqs:    11 /s    21     24     19      1
>>>>> >> >
>>>>> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
>>>>> >> >
>>>>> >> > Events in journal of MDS with rank 1:
>>>>> >> > cephfs-journal-tool --rank=fs1:1 event get list
>>>>> >> >
>>>>> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
>>>>> (scatter_writebehind)
>>>>> >> >   A2037D53
>>>>> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
>>>>> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest
>>>>> accounted
>>>>> >> > scatter stat update)
>>>>> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
>>>>> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
>>>>> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
>>>>> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
>>>>> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
>>>>> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
>>>>> >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
>>>>> >> >   di1/A2037D53
>>>>> >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
>>>>> >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
>>>>> >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
>>>>> >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
>>>>> >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()
>>>>> >> >
>>>>> >> > But the main problem, when I stop MDS rank 1 (without any kind of
>>>>> >> standby)
>>>>> >> > - FS hangs for all actions.
>>>>> >> > Is this correct? Is it possible to completely exclude rank 1 from
>>>>> >> > processing dir1 and not stop io when rank 1 goes down?
>>>>> >> > _______________________________________________
>>>>> >> > ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> >>
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> >>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx