Re: [CephFS] Completely exclude some MDS rank from directory processing

Александр Руденко <a.rudikk@xxxxxxxxx> · Tue, 26 Nov 2024 12:08:02 +0300

And, Eugen, try to see ceph fs status during write.

I can see next INOS, DNS and Reqs distribution:
RANK  STATE   MDS     ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active   c   Reqs:    127 /s  12.6k  12.5k   333    505
 1    active   b   Reqs:    11 /s    21     24     19      1

I mean that even if I pin all top dirs (of course without repinning on next
levels) to rank 1 - I see some amount of Reqs on rank 1.

вт, 26 нояб. 2024 г. в 12:01, Александр Руденко <a.rudikk@xxxxxxxxx>:

> Hm, the same test worked for me with version 16.2.13... I mean, I only
>> do a few writes from a single client, so this may be an invalid test,
>> but I don't see any interruption.
>
>
> I tried many times and I'm sure that my test is correct.
> Yes, write can be active for some time after rank 1 went down, maybe tens
> seconds. And listing files (ls) can work some time for dirs which were
> listed before rank down, but only fews seconds.
>
> Before shutdown rank 1 I run write in this way:
>
> while true; do dd if=/dev/vda of=/cephfs-mount/dir1/`uuidgen` count=1
> oflag=direct; sleep 0.000001; done
>
> Maybe it depends on the RPS...
>
> пт, 22 нояб. 2024 г. в 14:48, Eugen Block <eblock@xxxxxx>:
>
>> Hm, the same test worked for me with version 16.2.13... I mean, I only
>> do a few writes from a single client, so this may be an invalid test,
>> but I don't see any interruption.
>>
>> Zitat von Eugen Block <eblock@xxxxxx>:
>>
>> > I just tried to reproduce the behaviour but failed to do so. I have
>> > a Reef (18.2.2) cluster with multi-active MDS. Don't mind the
>> > hostnames, this cluster was deployed with Nautilus.
>> >
>> > # mounted the FS
>> > mount -t ceph nautilus:/ /mnt -o
>> > name=admin,secret=****,mds_namespace=secondfs
>> >
>> > # created and pinned directories
>> > nautilus:~ # mkdir /mnt/dir1
>> > nautilus:~ # mkdir /mnt/dir2
>> >
>> > nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir1
>> > nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir2
>> >
>> > I stopped all standby daemons while writing into /mnt/dir1, then I
>> > also stopped rank 1. But the writes were not interrupted (until I
>> > stopped them). You're on Pacific, I'll see if I can reproduce it
>> > there.
>> >
>> > Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>> >
>> >>>
>> >>> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
>> >>> fs dump'?
>> >>
>> >>
>> >> Nothing special, just smoll test cluster.
>> >> fs1 - 10 clients
>> >> ===
>> >> RANK  STATE   MDS     ACTIVITY     DNS    INOS   DIRS   CAPS
>> >> 0    active   a   Reqs:    0 /s  18.7k  18.4k   351    513
>> >> 1    active   b   Reqs:    0 /s    21     24     16      1
>> >>  POOL      TYPE     USED  AVAIL
>> >> fs1_meta  metadata   116M  3184G
>> >> fs1_data    data    23.8G  3184G
>> >> STANDBY MDS
>> >>     c
>> >>
>> >>
>> >> fs dump
>> >>
>> >> e48
>> >> enable_multiple, ever_enabled_multiple: 1,1
>> >> default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
>> >> writeable ranges,3=default file layouts on dirs,4=dir inode in separate
>> >> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
>> >> anchor table,9=file layout v2,10=snaprealm v2}
>> >> legacy client fscid: 1
>> >>
>> >> Filesystem 'fs1' (1)
>> >> fs_name fs1
>> >> epoch 47
>> >> flags 12
>> >> created 2024-10-15T18:55:10.905035+0300
>> >> modified 2024-11-21T10:55:12.688598+0300
>> >> tableserver 0
>> >> root 0
>> >> session_timeout 60
>> >> session_autoclose 300
>> >> max_file_size 1099511627776
>> >> required_client_features {}
>> >> last_failure 0
>> >> last_failure_osd_epoch 943
>> >> compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> >> ranges,3=default file layouts on dirs,4=dir inode in separate
>> object,5=mds
>> >> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
>> >> data,8=no anchor table,9=file layout v2,10=snaprealm v2}
>> >> max_mds 2
>> >> in 0,1
>> >> up {0=12200812,1=11974933}
>> >> failed
>> >> damaged
>> >> stopped
>> >> data_pools [7]
>> >> metadata_pool 6
>> >> inline_data disabled
>> >> balancer
>> >> standby_count_wanted 1
>> >> [mds.a{0:12200812} state up:active seq 13 addr [v2:
>> >> 10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987] compat
>> >> {c=[1],r=[1],i=[7ff]}]
>> >> [mds.b{1:11974933} state up:active seq 5 addr [v2:
>> >> 10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454] compat
>> >> {c=[1],r=[1],i=[7ff]}]
>> >>
>> >>
>> >> Standby daemons:
>> >>
>> >> [mds.c{-1:11704322} state up:standby seq 1 addr [v2:
>> >> 10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504] compat
>> >> {c=[1],r=[1],i=[7ff]}]
>> >>
>> >> чт, 21 нояб. 2024 г. в 11:36, Eugen Block <eblock@xxxxxx>:
>> >>
>> >>> I'm not aware of any hard limit for the number of Filesystems, but
>> >>> that doesn't really mean very much. IIRC, last week during a Clyso
>> >>> talk at Eventbrite I heard someone say that they deployed around 200
>> >>> Filesystems or so, I don't remember if it was a production environment
>> >>> or just a lab environment. I assume that you would probably be limited
>> >>> by the number of OSDs/PGs rather than by the number of Filesystems,
>> >>> 200 Filesystems require at least 400 pools. But maybe someone else has
>> >>> more experience in scaling CephFS that way. What we did was to scale
>> >>> the number of active MDS daemons for one CephFS. I believe in the end
>> >>> the customer had 48 MDS daemons on three MDS servers, 16 of them were
>> >>> active with directory pinning, at that time they had 16 standby-replay
>> >>> and 16 standby daemons. But it turned out that standby-replay didn't
>> >>> help their use case, so we disabled standby-replay.
>> >>>
>> >>> Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
>> >>> fs dump'?
>> >>>
>> >>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>> >>>
>> >>>>>
>> >>>>> Just for testing purposes, have you tried pinning rank 1 to some
>> other
>> >>>>> directory? Does it still break the CephFS if you stop it?
>> >>>>
>> >>>>
>> >>>> Yes, nothing changed.
>> >>>>
>> >>>> It's no problem that FS hangs when one of the ranks goes down, we
>> will
>> >>> have
>> >>>> standby-reply for all ranks. I don't like that rank which is not
>> pinned
>> >>> to
>> >>>> some dir handled some io of this dir or from clients which work with
>> this
>> >>>> dir.
>> >>>> I mean that I can't robustly and fully separate client IO by ranks.
>> >>>>
>> >>>> Would it be an option to rather use multiple Filesystems instead of
>> >>>>> multi-active for one CephFS?
>> >>>>
>> >>>>
>> >>>> Yes, it's an option. But it is much more complicated in our case.
>> Btw, do
>> >>>> you know how many different FS can be created in one cluster? Maybe
>> you
>> >>>> know some potential problems with 100-200 FSs in one cluster?
>> >>>>
>> >>>> ср, 20 нояб. 2024 г. в 17:50, Eugen Block <eblock@xxxxxx>:
>> >>>>
>> >>>>> Ah, I misunderstood, I thought you wanted an even distribution
>> across
>> >>>>> both ranks.
>> >>>>> Just for testing purposes, have you tried pinning rank 1 to some
>> other
>> >>>>> directory? Does it still break the CephFS if you stop it? I'm not
>> sure
>> >>>>> if you can prevent rank 1 from participating, I haven't looked into
>> >>>>> all the configs in quite a while. Would it be an option to rather
>> use
>> >>>>> multiple Filesystems instead of multi-active for one CephFS?
>> >>>>>
>> >>>>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>> >>>>>
>> >>>>> > No it's not a typo. It's misleading example)
>> >>>>> >
>> >>>>> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't
>> work
>> >>>>> without
>> >>>>> > rank 1.
>> >>>>> > rank 1 is used for something when I work with this dirs.
>> >>>>> >
>> >>>>> > ceph 16.2.13, metadata balancer and policy based balancing not
>> used.
>> >>>>> >
>> >>>>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block <eblock@xxxxxx>:
>> >>>>> >
>> >>>>> >> Hi,
>> >>>>> >>
>> >>>>> >> > After pinning:
>> >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>> >>>>> >>
>> >>>>> >> is this a typo? If not, you did pin both directories to the same
>> >>> rank.
>> >>>>> >>
>> >>>>> >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>> >>>>> >>
>> >>>>> >> > Hi,
>> >>>>> >> >
>> >>>>> >> > I try to distribute all top level dirs in CephFS by different
>> MDS
>> >>>>> ranks.
>> >>>>> >> > I have two active MDS with rank *0* and *1 *and I have 2 top
>> dirs
>> >>> like
>> >>>>> >> > */dir1* and* /dir2*.
>> >>>>> >> >
>> >>>>> >> > After pinning:
>> >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>> >>>>> >> >
>> >>>>> >> > I can see next INOS and DNS distribution:
>> >>>>> >> > RANK  STATE   MDS     ACTIVITY     DNS    INOS   DIRS   CAPS
>> >>>>> >> >  0    active   c   Reqs:    127 /s  12.6k  12.5k   333    505
>> >>>>> >> >  1    active   b   Reqs:    11 /s    21     24     19      1
>> >>>>> >> >
>> >>>>> >> > When I write to dir1 I can see a small amount on Reqs: in rank
>> 1.
>> >>>>> >> >
>> >>>>> >> > Events in journal of MDS with rank 1:
>> >>>>> >> > cephfs-journal-tool --rank=fs1:1 event get list
>> >>>>> >> >
>> >>>>> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
>> >>>>> (scatter_writebehind)
>> >>>>> >> >   A2037D53
>> >>>>> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
>> >>>>> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest
>> >>>>> accounted
>> >>>>> >> > scatter stat update)
>> >>>>> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
>> >>>>> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
>> >>>>> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
>> >>>>> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
>> >>>>> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
>> >>>>> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
>> >>>>> >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
>> >>>>> >> >   di1/A2037D53
>> >>>>> >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
>> >>>>> >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
>> >>>>> >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
>> >>>>> >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
>> >>>>> >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()
>> >>>>> >> >
>> >>>>> >> > But the main problem, when I stop MDS rank 1 (without any kind
>> of
>> >>>>> >> standby)
>> >>>>> >> > - FS hangs for all actions.
>> >>>>> >> > Is this correct? Is it possible to completely exclude rank 1
>> from
>> >>>>> >> > processing dir1 and not stop io when rank 1 goes down?
>> >>>>> >> > _______________________________________________
>> >>>>> >> > ceph-users mailing list -- ceph-users@xxxxxxx
>> >>>>> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>>> >>
>> >>>>> >>
>> >>>>> >> _______________________________________________
>> >>>>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>>> >>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>
>> >>>
>> >>>
>> >>>
>>
>>
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx