And, Eugen, try to see ceph fs status during write. I can see next INOS, DNS and Reqs distribution: RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active c Reqs: 127 /s 12.6k 12.5k 333 505 1 active b Reqs: 11 /s 21 24 19 1 I mean that even if I pin all top dirs (of course without repinning on next levels) to rank 1 - I see some amount of Reqs on rank 1. вт, 26 нояб. 2024 г. в 12:01, Александр Руденко <a.rudikk@xxxxxxxxx>: > Hm, the same test worked for me with version 16.2.13... I mean, I only >> do a few writes from a single client, so this may be an invalid test, >> but I don't see any interruption. > > > I tried many times and I'm sure that my test is correct. > Yes, write can be active for some time after rank 1 went down, maybe tens > seconds. And listing files (ls) can work some time for dirs which were > listed before rank down, but only fews seconds. > > Before shutdown rank 1 I run write in this way: > > while true; do dd if=/dev/vda of=/cephfs-mount/dir1/`uuidgen` count=1 > oflag=direct; sleep 0.000001; done > > Maybe it depends on the RPS... > > пт, 22 нояб. 2024 г. в 14:48, Eugen Block <eblock@xxxxxx>: > >> Hm, the same test worked for me with version 16.2.13... I mean, I only >> do a few writes from a single client, so this may be an invalid test, >> but I don't see any interruption. >> >> Zitat von Eugen Block <eblock@xxxxxx>: >> >> > I just tried to reproduce the behaviour but failed to do so. I have >> > a Reef (18.2.2) cluster with multi-active MDS. Don't mind the >> > hostnames, this cluster was deployed with Nautilus. >> > >> > # mounted the FS >> > mount -t ceph nautilus:/ /mnt -o >> > name=admin,secret=****,mds_namespace=secondfs >> > >> > # created and pinned directories >> > nautilus:~ # mkdir /mnt/dir1 >> > nautilus:~ # mkdir /mnt/dir2 >> > >> > nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir1 >> > nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir2 >> > >> > I stopped all standby daemons while writing into /mnt/dir1, then I >> > also stopped rank 1. But the writes were not interrupted (until I >> > stopped them). You're on Pacific, I'll see if I can reproduce it >> > there. >> > >> > Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: >> > >> >>> >> >>> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph >> >>> fs dump'? >> >> >> >> >> >> Nothing special, just smoll test cluster. >> >> fs1 - 10 clients >> >> === >> >> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS >> >> 0 active a Reqs: 0 /s 18.7k 18.4k 351 513 >> >> 1 active b Reqs: 0 /s 21 24 16 1 >> >> POOL TYPE USED AVAIL >> >> fs1_meta metadata 116M 3184G >> >> fs1_data data 23.8G 3184G >> >> STANDBY MDS >> >> c >> >> >> >> >> >> fs dump >> >> >> >> e48 >> >> enable_multiple, ever_enabled_multiple: 1,1 >> >> default compat: compat={},rocompat={},incompat={1=base v0.20,2=client >> >> writeable ranges,3=default file layouts on dirs,4=dir inode in separate >> >> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no >> >> anchor table,9=file layout v2,10=snaprealm v2} >> >> legacy client fscid: 1 >> >> >> >> Filesystem 'fs1' (1) >> >> fs_name fs1 >> >> epoch 47 >> >> flags 12 >> >> created 2024-10-15T18:55:10.905035+0300 >> >> modified 2024-11-21T10:55:12.688598+0300 >> >> tableserver 0 >> >> root 0 >> >> session_timeout 60 >> >> session_autoclose 300 >> >> max_file_size 1099511627776 >> >> required_client_features {} >> >> last_failure 0 >> >> last_failure_osd_epoch 943 >> >> compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable >> >> ranges,3=default file layouts on dirs,4=dir inode in separate >> object,5=mds >> >> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline >> >> data,8=no anchor table,9=file layout v2,10=snaprealm v2} >> >> max_mds 2 >> >> in 0,1 >> >> up {0=12200812,1=11974933} >> >> failed >> >> damaged >> >> stopped >> >> data_pools [7] >> >> metadata_pool 6 >> >> inline_data disabled >> >> balancer >> >> standby_count_wanted 1 >> >> [mds.a{0:12200812} state up:active seq 13 addr [v2: >> >> 10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987] compat >> >> {c=[1],r=[1],i=[7ff]}] >> >> [mds.b{1:11974933} state up:active seq 5 addr [v2: >> >> 10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454] compat >> >> {c=[1],r=[1],i=[7ff]}] >> >> >> >> >> >> Standby daemons: >> >> >> >> [mds.c{-1:11704322} state up:standby seq 1 addr [v2: >> >> 10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504] compat >> >> {c=[1],r=[1],i=[7ff]}] >> >> >> >> чт, 21 нояб. 2024 г. в 11:36, Eugen Block <eblock@xxxxxx>: >> >> >> >>> I'm not aware of any hard limit for the number of Filesystems, but >> >>> that doesn't really mean very much. IIRC, last week during a Clyso >> >>> talk at Eventbrite I heard someone say that they deployed around 200 >> >>> Filesystems or so, I don't remember if it was a production environment >> >>> or just a lab environment. I assume that you would probably be limited >> >>> by the number of OSDs/PGs rather than by the number of Filesystems, >> >>> 200 Filesystems require at least 400 pools. But maybe someone else has >> >>> more experience in scaling CephFS that way. What we did was to scale >> >>> the number of active MDS daemons for one CephFS. I believe in the end >> >>> the customer had 48 MDS daemons on three MDS servers, 16 of them were >> >>> active with directory pinning, at that time they had 16 standby-replay >> >>> and 16 standby daemons. But it turned out that standby-replay didn't >> >>> help their use case, so we disabled standby-replay. >> >>> >> >>> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph >> >>> fs dump'? >> >>> >> >>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: >> >>> >> >>>>> >> >>>>> Just for testing purposes, have you tried pinning rank 1 to some >> other >> >>>>> directory? Does it still break the CephFS if you stop it? >> >>>> >> >>>> >> >>>> Yes, nothing changed. >> >>>> >> >>>> It's no problem that FS hangs when one of the ranks goes down, we >> will >> >>> have >> >>>> standby-reply for all ranks. I don't like that rank which is not >> pinned >> >>> to >> >>>> some dir handled some io of this dir or from clients which work with >> this >> >>>> dir. >> >>>> I mean that I can't robustly and fully separate client IO by ranks. >> >>>> >> >>>> Would it be an option to rather use multiple Filesystems instead of >> >>>>> multi-active for one CephFS? >> >>>> >> >>>> >> >>>> Yes, it's an option. But it is much more complicated in our case. >> Btw, do >> >>>> you know how many different FS can be created in one cluster? Maybe >> you >> >>>> know some potential problems with 100-200 FSs in one cluster? >> >>>> >> >>>> ср, 20 нояб. 2024 г. в 17:50, Eugen Block <eblock@xxxxxx>: >> >>>> >> >>>>> Ah, I misunderstood, I thought you wanted an even distribution >> across >> >>>>> both ranks. >> >>>>> Just for testing purposes, have you tried pinning rank 1 to some >> other >> >>>>> directory? Does it still break the CephFS if you stop it? I'm not >> sure >> >>>>> if you can prevent rank 1 from participating, I haven't looked into >> >>>>> all the configs in quite a while. Would it be an option to rather >> use >> >>>>> multiple Filesystems instead of multi-active for one CephFS? >> >>>>> >> >>>>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: >> >>>>> >> >>>>> > No it's not a typo. It's misleading example) >> >>>>> > >> >>>>> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't >> work >> >>>>> without >> >>>>> > rank 1. >> >>>>> > rank 1 is used for something when I work with this dirs. >> >>>>> > >> >>>>> > ceph 16.2.13, metadata balancer and policy based balancing not >> used. >> >>>>> > >> >>>>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block <eblock@xxxxxx>: >> >>>>> > >> >>>>> >> Hi, >> >>>>> >> >> >>>>> >> > After pinning: >> >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >> >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 >> >>>>> >> >> >>>>> >> is this a typo? If not, you did pin both directories to the same >> >>> rank. >> >>>>> >> >> >>>>> >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>: >> >>>>> >> >> >>>>> >> > Hi, >> >>>>> >> > >> >>>>> >> > I try to distribute all top level dirs in CephFS by different >> MDS >> >>>>> ranks. >> >>>>> >> > I have two active MDS with rank *0* and *1 *and I have 2 top >> dirs >> >>> like >> >>>>> >> > */dir1* and* /dir2*. >> >>>>> >> > >> >>>>> >> > After pinning: >> >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1 >> >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2 >> >>>>> >> > >> >>>>> >> > I can see next INOS and DNS distribution: >> >>>>> >> > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS >> >>>>> >> > 0 active c Reqs: 127 /s 12.6k 12.5k 333 505 >> >>>>> >> > 1 active b Reqs: 11 /s 21 24 19 1 >> >>>>> >> > >> >>>>> >> > When I write to dir1 I can see a small amount on Reqs: in rank >> 1. >> >>>>> >> > >> >>>>> >> > Events in journal of MDS with rank 1: >> >>>>> >> > cephfs-journal-tool --rank=fs1:1 event get list >> >>>>> >> > >> >>>>> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE: >> >>>>> (scatter_writebehind) >> >>>>> >> > A2037D53 >> >>>>> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION: () >> >>>>> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE: (lock inest >> >>>>> accounted >> >>>>> >> > scatter stat update) >> >>>>> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION: () >> >>>>> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION: () >> >>>>> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION: () >> >>>>> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION: () >> >>>>> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION: () >> >>>>> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION: () >> >>>>> >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT: () >> >>>>> >> > di1/A2037D53 >> >>>>> >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION: () >> >>>>> >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION: () >> >>>>> >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION: () >> >>>>> >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION: () >> >>>>> >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS: () >> >>>>> >> > >> >>>>> >> > But the main problem, when I stop MDS rank 1 (without any kind >> of >> >>>>> >> standby) >> >>>>> >> > - FS hangs for all actions. >> >>>>> >> > Is this correct? Is it possible to completely exclude rank 1 >> from >> >>>>> >> > processing dir1 and not stop io when rank 1 goes down? >> >>>>> >> > _______________________________________________ >> >>>>> >> > ceph-users mailing list -- ceph-users@xxxxxxx >> >>>>> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >>>>> >> >> >>>>> >> >> >>>>> >> _______________________________________________ >> >>>>> >> ceph-users mailing list -- ceph-users@xxxxxxx >> >>>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >>>>> >> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>> >> >>> >> >>> >> >>> >> >> >> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx