Re: [CephFS] Completely exclude some MDS rank from directory processing

Александр Руденко <a.rudikk@xxxxxxxxx> · Thu, 21 Nov 2024 16:53:12 +0300

Hi, Frank, thanks!

it might be that you are expecting too much from ceph. The design of the
> filesystem was not some grand plan with every detail worked out. It was
> more the classic evolutionary approach, something working was screwed on
> top of rados and things evolved from there on.

There was some hope that it's just a configuration problem in my
environment)

Specifically rank 0 is critical.

Yes, because we can't re-pin the root of FS to some other rank. It was
clear that rank 0 is critical. But unfortunately, as we can see all ranks
are critical for stable work in any directories.

чт, 21 нояб. 2024 г. в 14:46, Frank Schilder <frans@xxxxxx>:

> Hi Alexander,
>
> it might be that you are expecting too much from ceph. The design of the
> filesystem was not some grand plan with every detail worked out. It was
> more the classic evolutionary approach, something working was screwed on
> top of rados and things evolved from there on.
>
> It is possible that the code and pin-seperation is not as clean as one
> would imagine. Here is what I observe before and after pinning everything
> explicitly:
>
> - before pinning:
>   * high MDS load for no apparent reason - the balancer was just going in
> circles
>   * stopping an MDS would besically bring all IO down
>
> - after pinning:
>   * low MDS load, better user performance, much faster restarts
>   * stopping an MDS does not kill all IO immediately, some IO continues,
> however, eventually every client gets stuck
>
> There is apparently still communication between all ranks about all
> clients and it is a bit annoying that some of this communication is
> blocking. Not sure if it has to be blocking or if one could make it
> asynchronous requests to the down rank. My impression is that ceph
> internals are rather bad at making stuff asynchronous. So if something in
> the MDS cluster is not healthy sooner or later IO will stop waiting for
> some blocking request to the unhealthy MDS. There seems to be no such thing
> as IO on other healthy MDSes continues as usual.
>
> Specifically rank 0 is critical.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Eugen Block <eblock@xxxxxx>
> Sent: Thursday, November 21, 2024 9:36 AM
> To: Александр Руденко
> Cc: ceph-users@xxxxxxx
> Subject:  Re: [CephFS] Completely exclude some MDS rank from
> directory processing
>
> I'm not aware of any hard limit for the number of Filesystems, but
> that doesn't really mean very much. IIRC, last week during a Clyso
> talk at Eventbrite I heard someone say that they deployed around 200
> Filesystems or so, I don't remember if it was a production environment
> or just a lab environment. I assume that you would probably be limited
> by the number of OSDs/PGs rather than by the number of Filesystems,
> 200 Filesystems require at least 400 pools. But maybe someone else has
> more experience in scaling CephFS that way. What we did was to scale
> the number of active MDS daemons for one CephFS. I believe in the end
> the customer had 48 MDS daemons on three MDS servers, 16 of them were
> active with directory pinning, at that time they had 16 standby-replay
> and 16 standby daemons. But it turned out that standby-replay didn't
> help their use case, so we disabled standby-replay.
>
> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
> fs dump'?
>
> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>
> >>
> >> Just for testing purposes, have you tried pinning rank 1 to some other
> >> directory? Does it still break the CephFS if you stop it?
> >
> >
> > Yes, nothing changed.
> >
> > It's no problem that FS hangs when one of the ranks goes down, we will
> have
> > standby-reply for all ranks. I don't like that rank which is not pinned
> to
> > some dir handled some io of this dir or from clients which work with this
> > dir.
> > I mean that I can't robustly and fully separate client IO by ranks.
> >
> > Would it be an option to rather use multiple Filesystems instead of
> >> multi-active for one CephFS?
> >
> >
> > Yes, it's an option. But it is much more complicated in our case. Btw, do
> > you know how many different FS can be created in one cluster? Maybe you
> > know some potential problems with 100-200 FSs in one cluster?
> >
> > ср, 20 нояб. 2024 г. в 17:50, Eugen Block <eblock@xxxxxx>:
> >
> >> Ah, I misunderstood, I thought you wanted an even distribution across
> >> both ranks.
> >> Just for testing purposes, have you tried pinning rank 1 to some other
> >> directory? Does it still break the CephFS if you stop it? I'm not sure
> >> if you can prevent rank 1 from participating, I haven't looked into
> >> all the configs in quite a while. Would it be an option to rather use
> >> multiple Filesystems instead of multi-active for one CephFS?
> >>
> >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
> >>
> >> > No it's not a typo. It's misleading example)
> >> >
> >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
> >> without
> >> > rank 1.
> >> > rank 1 is used for something when I work with this dirs.
> >> >
> >> > ceph 16.2.13, metadata balancer and policy based balancing not used.
> >> >
> >> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block <eblock@xxxxxx>:
> >> >
> >> >> Hi,
> >> >>
> >> >> > After pinning:
> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
> >> >>
> >> >> is this a typo? If not, you did pin both directories to the same
> rank.
> >> >>
> >> >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> > I try to distribute all top level dirs in CephFS by different MDS
> >> ranks.
> >> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs
> like
> >> >> > */dir1* and* /dir2*.
> >> >> >
> >> >> > After pinning:
> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
> >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
> >> >> >
> >> >> > I can see next INOS and DNS distribution:
> >> >> > RANK  STATE   MDS     ACTIVITY     DNS    INOS   DIRS   CAPS
> >> >> >  0    active   c   Reqs:    127 /s  12.6k  12.5k   333    505
> >> >> >  1    active   b   Reqs:    11 /s    21     24     19      1
> >> >> >
> >> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
> >> >> >
> >> >> > Events in journal of MDS with rank 1:
> >> >> > cephfs-journal-tool --rank=fs1:1 event get list
> >> >> >
> >> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
> >> (scatter_writebehind)
> >> >> >   A2037D53
> >> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
> >> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest
> >> accounted
> >> >> > scatter stat update)
> >> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
> >> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
> >> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
> >> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
> >> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
> >> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
> >> >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
> >> >> >   di1/A2037D53
> >> >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
> >> >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
> >> >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
> >> >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
> >> >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()
> >> >> >
> >> >> > But the main problem, when I stop MDS rank 1 (without any kind of
> >> >> standby)
> >> >> > - FS hangs for all actions.
> >> >> > Is this correct? Is it possible to completely exclude rank 1 from
> >> >> > processing dir1 and not stop io when rank 1 goes down?
> >> >> > _______________________________________________
> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >>
> >>
> >>
> >>
> >>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx