Hi, Frank, thanks! it might be that you are expecting too much from ceph. The design of the > filesystem was not some grand plan with every detail worked out. It was > more the classic evolutionary approach, something working was screwed on > top of rados and things evolved from there on. There was some hope that it's just a configuration problem in my environment) Specifically rank 0 is critical. Yes, because we can't re-pin the root of FS to some other rank. It was clear that rank 0 is critical. But unfortunately, as we can see all ranks are critical for stable work in any directories. чт, 21 нояб. 2024 г. в 14:46, Frank Schilder <frans@xxxxxx>: > Hi Alexander, > > it might be that you are expecting too much from ceph. The design of the > filesystem was not some grand plan with every detail worked out. It was > more the classic evolutionary approach, something working was screwed on > top of rados and things evolved from there on. > > It is possible that the code and pin-seperation is not as clean as one > would imagine. Here is what I observe before and after pinning everything > explicitly: > > - before pinning: > * high MDS load for no apparent reason - the balancer was just going in > circles > * stopping an MDS would besically bring all IO down > > - after pinning: > * low MDS load, better user performance, much faster restarts > * stopping an MDS does not kill all IO immediately, some IO continues, > however, eventually every client gets stuck > > There is apparently still communication between all ranks about all > clients and it is a bit annoying that some of this communication is > blocking. Not sure if it has to be blocking or if one could make it > asynchronous requests to the down rank. My impression is that ceph > internals are rather bad at making stuff asynchronous. So if something in > the MDS cluster is not healthy sooner or later IO will stop waiting for > some blocking request to the unhealthy MDS. There seems to be no such thing > as IO on other healthy MDSes continues as usual. > > Specifically rank 0 is critical. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Eugen Block <eblock@xxxxxx> > Sent: Thursday, November 21, 2024 9:36 AM > To: Александр Руденко > Cc: ceph-users@xxxxxxx > Subject: Re: [CephFS] Completely exclude some MDS rank from > directory processing > > I'm not aware of any hard limit for the number of Filesystems, but > that doesn't really mean very much. IIRC, last week during a Clyso > talk at Eventbrite I heard someone say that they deployed around 200 > Filesystems or so, I don't remember if it was a production environment > or just a lab environment. I assume that you would probably be limited > by the number of OSDs/PGs rather than by the number of Filesystems, > 200 Filesystems require at least 400 pools. But maybe someone else has > more experience in scaling CephFS that way. What we did was to scale > the number of active MDS daemons for one CephFS. I believe in the end > the customer had 48 MDS daemons on three MDS servers, 16 of them were > active with directory pinning, at that time they had 16 standby-replay > and 16 standby daemons. But it turned out that standby-replay didn't > help their use case, so we disabled standby-replay. > > Can you show the entire 'ceph fs status' output? Any maybe also 'ceph > fs dump'? > > Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: > > >> > >> Just for testing purposes, have you tried pinning rank 1 to some other > >> directory? Does it still break the CephFS if you stop it? > > > > > > Yes, nothing changed. > > > > It's no problem that FS hangs when one of the ranks goes down, we will > have > > standby-reply for all ranks. I don't like that rank which is not pinned > to > > some dir handled some io of this dir or from clients which work with this > > dir. > > I mean that I can't robustly and fully separate client IO by ranks. > > > > Would it be an option to rather use multiple Filesystems instead of > >> multi-active for one CephFS? > > > > > > Yes, it's an option. But it is much more complicated in our case. Btw, do > > you know how many different FS can be created in one cluster? Maybe you > > know some potential problems with 100-200 FSs in one cluster? > > > > ср, 20 нояб. 2024 г. в 17:50, Eugen Block <eblock@xxxxxx>: > > > >> Ah, I misunderstood, I thought you wanted an even distribution across > >> both ranks. > >> Just for testing purposes, have you tried pinning rank 1 to some other > >> directory? Does it still break the CephFS if you stop it? I'm not sure > >> if you can prevent rank 1 from participating, I haven't looked into > >> all the configs in quite a while. Would it be an option to rather use > >> multiple Filesystems instead of multi-active for one CephFS? > >> > >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: > >> > >> > No it's not a typo. It's misleading example) > >> > > >> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work > >> without > >> > rank 1. > >> > rank 1 is used for something when I work with this dirs. > >> > > >> > ceph 16.2.13, metadata balancer and policy based balancing not used. > >> > > >> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block <eblock@xxxxxx>: > >> > > >> >> Hi, > >> >> > >> >> > After pinning: > >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 > >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 > >> >> > >> >> is this a typo? If not, you did pin both directories to the same > rank. > >> >> > >> >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: > >> >> > >> >> > Hi, > >> >> > > >> >> > I try to distribute all top level dirs in CephFS by different MDS > >> ranks. > >> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs > like > >> >> > */dir1* and* /dir2*. > >> >> > > >> >> > After pinning: > >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 > >> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 > >> >> > > >> >> > I can see next INOS and DNS distribution: > >> >> > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > >> >> > 0 active c Reqs: 127 /s 12.6k 12.5k 333 505 > >> >> > 1 active b Reqs: 11 /s 21 24 19 1 > >> >> > > >> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1. > >> >> > > >> >> > Events in journal of MDS with rank 1: > >> >> > cephfs-journal-tool --rank=fs1:1 event get list > >> >> > > >> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE: > >> (scatter_writebehind) > >> >> > A2037D53 > >> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION: () > >> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE: (lock inest > >> accounted > >> >> > scatter stat update) > >> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION: () > >> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION: () > >> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION: () > >> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION: () > >> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION: () > >> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION: () > >> >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT: () > >> >> > di1/A2037D53 > >> >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION: () > >> >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION: () > >> >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION: () > >> >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION: () > >> >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS: () > >> >> > > >> >> > But the main problem, when I stop MDS rank 1 (without any kind of > >> >> standby) > >> >> > - FS hangs for all actions. > >> >> > Is this correct? Is it possible to completely exclude rank 1 from > >> >> > processing dir1 and not stop io when rank 1 goes down? > >> >> > _______________________________________________ > >> >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >> > >> >> > >> >> _______________________________________________ > >> >> ceph-users mailing list -- ceph-users@xxxxxxx > >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> >> > >> > >> > >> > >> > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx