Re: [CephFS] Completely exclude some MDS rank from directory processing

Eugen Block <eblock@xxxxxx> · Fri, 22 Nov 2024 11:37:29 +0000

I just tried to reproduce the behaviour but failed to do so. I have a  
Reef (18.2.2) cluster with multi-active MDS. Don't mind the hostnames,  
this cluster was deployed with Nautilus.

# mounted the FS
mount -t ceph nautilus:/ /mnt -o name=admin,secret=****,mds_namespace=secondfs

# created and pinned directories
nautilus:~ # mkdir /mnt/dir1
nautilus:~ # mkdir /mnt/dir2

nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir1
nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir2

I stopped all standby daemons while writing into /mnt/dir1, then I  
also stopped rank 1. But the writes were not interrupted (until I  
stopped them). You're on Pacific, I'll see if I can reproduce it there.

Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:

Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
fs dump'?

Nothing special, just smoll test cluster.
fs1 - 10 clients
===
RANK  STATE   MDS     ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active   a   Reqs:    0 /s  18.7k  18.4k   351    513
 1    active   b   Reqs:    0 /s    21     24     16      1
  POOL      TYPE     USED  AVAIL
fs1_meta  metadata   116M  3184G
fs1_data    data    23.8G  3184G
STANDBY MDS
     c

fs dump

e48
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
writeable ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'fs1' (1)
fs_name fs1
epoch 47
flags 12
created 2024-10-15T18:55:10.905035+0300
modified 2024-11-21T10:55:12.688598+0300
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 943
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 2
in 0,1
up {0=12200812,1=11974933}
failed
damaged
stopped
data_pools [7]
metadata_pool 6
inline_data disabled
balancer
standby_count_wanted 1
[mds.a{0:12200812} state up:active seq 13 addr [v2:
10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987] compat
{c=[1],r=[1],i=[7ff]}]
[mds.b{1:11974933} state up:active seq 5 addr [v2:
10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454] compat
{c=[1],r=[1],i=[7ff]}]

Standby daemons:

[mds.c{-1:11704322} state up:standby seq 1 addr [v2:
10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504] compat
{c=[1],r=[1],i=[7ff]}]

чт, 21 нояб. 2024 г. в 11:36, Eugen Block <eblock@xxxxxx>:

I'm not aware of any hard limit for the number of Filesystems, but
that doesn't really mean very much. IIRC, last week during a Clyso
talk at Eventbrite I heard someone say that they deployed around 200
Filesystems or so, I don't remember if it was a production environment
or just a lab environment. I assume that you would probably be limited
by the number of OSDs/PGs rather than by the number of Filesystems,
200 Filesystems require at least 400 pools. But maybe someone else has
more experience in scaling CephFS that way. What we did was to scale
the number of active MDS daemons for one CephFS. I believe in the end
the customer had 48 MDS daemons on three MDS servers, 16 of them were
active with directory pinning, at that time they had 16 standby-replay
and 16 standby daemons. But it turned out that standby-replay didn't
help their use case, so we disabled standby-replay.

Can you show the entire 'ceph fs status' output? Any maybe also 'ceph
fs dump'?

Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:

>>
>> Just for testing purposes, have you tried pinning rank 1 to some other
>> directory? Does it still break the CephFS if you stop it?
>
>
> Yes, nothing changed.
>
> It's no problem that FS hangs when one of the ranks goes down, we will
have
> standby-reply for all ranks. I don't like that rank which is not pinned
to
> some dir handled some io of this dir or from clients which work with this
> dir.
> I mean that I can't robustly and fully separate client IO by ranks.
>
> Would it be an option to rather use multiple Filesystems instead of
>> multi-active for one CephFS?
>
>
> Yes, it's an option. But it is much more complicated in our case. Btw, do
> you know how many different FS can be created in one cluster? Maybe you
> know some potential problems with 100-200 FSs in one cluster?
>
> ср, 20 нояб. 2024 г. в 17:50, Eugen Block <eblock@xxxxxx>:
>
>> Ah, I misunderstood, I thought you wanted an even distribution across
>> both ranks.
>> Just for testing purposes, have you tried pinning rank 1 to some other
>> directory? Does it still break the CephFS if you stop it? I'm not sure
>> if you can prevent rank 1 from participating, I haven't looked into
>> all the configs in quite a while. Would it be an option to rather use
>> multiple Filesystems instead of multi-active for one CephFS?
>>
>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>>
>> > No it's not a typo. It's misleading example)
>> >
>> > dir1 and dir2 are pinned to rank 0, but FS and dir1,dir2 can't work
>> without
>> > rank 1.
>> > rank 1 is used for something when I work with this dirs.
>> >
>> > ceph 16.2.13, metadata balancer and policy based balancing not used.
>> >
>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block <eblock@xxxxxx>:
>> >
>> >> Hi,
>> >>
>> >> > After pinning:
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>> >>
>> >> is this a typo? If not, you did pin both directories to the same
rank.
>> >>
>> >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>> >>
>> >> > Hi,
>> >> >
>> >> > I try to distribute all top level dirs in CephFS by different MDS
>> ranks.
>> >> > I have two active MDS with rank *0* and *1 *and I have 2 top dirs
like
>> >> > */dir1* and* /dir2*.
>> >> >
>> >> > After pinning:
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>> >> >
>> >> > I can see next INOS and DNS distribution:
>> >> > RANK  STATE   MDS     ACTIVITY     DNS    INOS   DIRS   CAPS
>> >> >  0    active   c   Reqs:    127 /s  12.6k  12.5k   333    505
>> >> >  1    active   b   Reqs:    11 /s    21     24     19      1
>> >> >
>> >> > When I write to dir1 I can see a small amount on Reqs: in rank 1.
>> >> >
>> >> > Events in journal of MDS with rank 1:
>> >> > cephfs-journal-tool --rank=fs1:1 event get list
>> >> >
>> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
>> (scatter_writebehind)
>> >> >   A2037D53
>> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
>> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:  (lock inest
>> accounted
>> >> > scatter stat update)
>> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
>> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
>> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
>> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
>> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
>> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
>> >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
>> >> >   di1/A2037D53
>> >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
>> >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
>> >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
>> >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
>> >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()
>> >> >
>> >> > But the main problem, when I stop MDS rank 1 (without any kind of
>> >> standby)
>> >> > - FS hangs for all actions.
>> >> > Is this correct? Is it possible to completely exclude rank 1 from
>> >> > processing dir1 and not stop io when rank 1 goes down?
>> >> > _______________________________________________
>> >> > ceph-users mailing list -- ceph-users@xxxxxxx
>> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>
>> >>
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>
>>
>>
>>
>>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx