> > Hm, the same test worked for me with version 16.2.13... I mean, I only > do a few writes from a single client, so this may be an invalid test, > but I don't see any interruption. I tried many times and I'm sure that my test is correct. Yes, write can be active for some time after rank 1 went down, maybe tens seconds. And listing files (ls) can work some time for dirs which were listed before rank down, but only fews seconds. Before shutdown rank 1 I run write in this way: while true; do dd if=/dev/vda of=/cephfs-mount/dir1/`uuidgen` count=1 oflag=direct; sleep 0.000001; done Maybe it depends on the RPS... пт, 22 нояб. 2024 г. в 14:48, Eugen Block <eblock@xxxxxx>: > Hm, the same test worked for me with version 16.2.13... I mean, I only > do a few writes from a single client, so this may be an invalid test, > but I don't see any interruption. > > Zitat von Eugen Block <eblock@xxxxxx>: > > > I just tried to reproduce the behaviour but failed to do so. I have > > a Reef (18.2.2) cluster with multi-active MDS. Don't mind the > > hostnames, this cluster was deployed with Nautilus. > > > > # mounted the FS > > mount -t ceph nautilus:/ /mnt -o > > name=admin,secret=****,mds_namespace=secondfs > > > > # created and pinned directories > > nautilus:~ # mkdir /mnt/dir1 > > nautilus:~ # mkdir /mnt/dir2 > > > > nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir1 > > nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir2 > > > > I stopped all standby daemons while writing into /mnt/dir1, then I > > also stopped rank 1. But the writes were not interrupted (until I > > stopped them). You're on Pacific, I'll see if I can reproduce it > > there. > > > > Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: > > > >>> > >>> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph > >>> fs dump'? > >> > >> > >> Nothing special, just smoll test cluster. > >> fs1 - 10 clients > >> === > >> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > >> 0 active a Reqs: 0 /s 18.7k 18.4k 351 513 > >> 1 active b Reqs: 0 /s 21 24 16 1 > >> POOL TYPE USED AVAIL > >> fs1_meta metadata 116M 3184G > >> fs1_data data 23.8G 3184G > >> STANDBY MDS > >> c > >> > >> > >> fs dump > >> > >> e48 > >> enable_multiple, ever_enabled_multiple: 1,1 > >> default compat: compat={},rocompat={},incompat={1=base v0.20,2=client > >> writeable ranges,3=default file layouts on dirs,4=dir inode in separate > >> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no > >> anchor table,9=file layout v2,10=snaprealm v2} > >> legacy client fscid: 1 > >> > >> Filesystem 'fs1' (1) > >> fs_name fs1 > >> epoch 47 > >> flags 12 > >> created 2024-10-15T18:55:10.905035+0300 > >> modified 2024-11-21T10:55:12.688598+0300 > >> tableserver 0 > >> root 0 > >> session_timeout 60 > >> session_autoclose 300 > >> max_file_size 1099511627776 > >> required_client_features {} > >> last_failure 0 > >> last_failure_osd_epoch 943 > >> compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > >> ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds > >> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline > >> data,8=no anchor table,9=file layout v2,10=snaprealm v2} > >> max_mds 2 > >> in 0,1 > >> up {0=12200812,1=11974933} > >> failed > >> damaged > >> stopped > >> data_pools [7] > >> metadata_pool 6 > >> inline_data disabled > >> balancer > >> standby_count_wanted 1 > >> [mds.a{0:12200812} state up:active seq 13 addr [v2: > >> 10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987] compat > >> {c=[1],r=[1],i=[7ff]}] > >> [mds.b{1:11974933} state up:active seq 5 addr [v2: > >> 10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454] compat > >> {c=[1],r=[1],i=[7ff]}] > >> > >> > >> Standby daemons: > >> > >> [mds.c{-1:11704322} state up:standby seq 1 addr [v2: > >> 10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504] compat > >> {c=[1],r=[1],i=[7ff]}] > >> > >> чт, 21 нояб. 2024 г. в 11:36, Eugen Block <eblock@xxxxxx>: > >> > >>> I'm not aware of any hard limit for the number of Filesystems, but > >>> that doesn't really mean very much. IIRC, last week during a Clyso > >>> talk at Eventbrite I heard someone say that they deployed around 200 > >>> Filesystems or so, I don't remember if it was a production environment > >>> or just a lab environment. I assume that you would probably be limited > >>> by the number of OSDs/PGs rather than by the number of Filesystems, > >>> 200 Filesystems require at least 400 pools. But maybe someone else has > >>> more experience in scaling CephFS that way. What we did was to scale > >>> the number of active MDS daemons for one CephFS. I believe in the end > >>> the customer had 48 MDS daemons on three MDS servers, 16 of them were > >>> active with directory pinning, at that time they had 16 standby-replay > >>> and 16 standby daemons. But it turned out that standby-replay didn't > >>> help their use case, so we disabled standby-replay. > >>> > >>> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph > >>> fs dump'? > >>> > >>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: > >>> > >>>>> > >>>>> Just for testing purposes, have you tried pinning rank 1 to some > other > >>>>> directory? Does it still break the CephFS if you stop it? > >>>> > >>>> > >>>> Yes, nothing changed. > >>>> > >>>> It's no problem that FS hangs when one of the ranks goes down, we will > >>> have > >>>> standby-reply for all ranks. I don't like that rank which is not > pinned > >>> to > >>>> some dir handled some io of this dir or from clients which work with > this > >>>> dir. > >>>> I mean that I can't robustly and fully separate client IO by ranks. > >>>> > >>>> Would it be an option to rather use multiple Filesystems instead of > >>>>> multi-active for one CephFS? > >>>> > >>>> > >>>> Yes, it's an option. But it is much more complicated in our case. > Btw, do > >>>> you know how many different FS can be created in one cluster? Maybe > you > >>>> know some potential problems with 100-200 FSs in one cluster? > >>>> > >>>> ср, 20 нояб. 2024 г. в 17:50, Eugen Block <eblock@xxxxxx>: > >>>> > >>>>> Ah, I misunderstood, I thought you wanted an even distribution across > >>>>> both ranks. > >>>>> Just for testing purposes, have you tried pinning rank 1 to some > other > >>>>> directory? Does it still break the CephFS if you stop it? I'm not > sure > >>>>> if you can prevent rank 1 from participating, I haven't looked into > >>>>> all the configs in quite a while. Would it be an option to rather use > >>>>> multiple Filesystems instead of multi-active for one CephFS? > >>>>> > >>>>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: > >>>>> > >>>>> > No it's not a typo. It's misleading example) > >>>>> > > >>>>> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work > >>>>> without > >>>>> > rank 1. > >>>>> > rank 1 is used for something when I work with this dirs. > >>>>> > > >>>>> > ceph 16.2.13, metadata balancer and policy based balancing not > used. > >>>>> > > >>>>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block <eblock@xxxxxx>: > >>>>> > > >>>>> >> Hi, > >>>>> >> > >>>>> >> > After pinning: > >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 > >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 > >>>>> >> > >>>>> >> is this a typo? If not, you did pin both directories to the same > >>> rank. > >>>>> >> > >>>>> >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: > >>>>> >> > >>>>> >> > Hi, > >>>>> >> > > >>>>> >> > I try to distribute all top level dirs in CephFS by different > MDS > >>>>> ranks. > >>>>> >> > I have two active MDS with rank *0* and *1 *and I have 2 top > dirs > >>> like > >>>>> >> > */dir1* and* /dir2*. > >>>>> >> > > >>>>> >> > After pinning: > >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 > >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 > >>>>> >> > > >>>>> >> > I can see next INOS and DNS distribution: > >>>>> >> > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > >>>>> >> > 0 active c Reqs: 127 /s 12.6k 12.5k 333 505 > >>>>> >> > 1 active b Reqs: 11 /s 21 24 19 1 > >>>>> >> > > >>>>> >> > When I write to dir1 I can see a small amount on Reqs: in rank > 1. > >>>>> >> > > >>>>> >> > Events in journal of MDS with rank 1: > >>>>> >> > cephfs-journal-tool --rank=fs1:1 event get list > >>>>> >> > > >>>>> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE: > >>>>> (scatter_writebehind) > >>>>> >> > A2037D53 > >>>>> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION: () > >>>>> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE: (lock inest > >>>>> accounted > >>>>> >> > scatter stat update) > >>>>> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION: () > >>>>> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION: () > >>>>> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION: () > >>>>> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION: () > >>>>> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION: () > >>>>> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION: () > >>>>> >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT: () > >>>>> >> > di1/A2037D53 > >>>>> >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION: () > >>>>> >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION: () > >>>>> >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION: () > >>>>> >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION: () > >>>>> >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS: () > >>>>> >> > > >>>>> >> > But the main problem, when I stop MDS rank 1 (without any kind > of > >>>>> >> standby) > >>>>> >> > - FS hangs for all actions. > >>>>> >> > Is this correct? Is it possible to completely exclude rank 1 > from > >>>>> >> > processing dir1 and not stop io when rank 1 goes down? > >>>>> >> > _______________________________________________ > >>>>> >> > ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>> >> > >>>>> >> > >>>>> >> _______________________________________________ > >>>>> >> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>> >> > >>>>> > >>>>> > >>>>> > >>>>> > >>> > >>> > >>> > >>> > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx