And, Eugen, try to see ceph fs status during write.
I can see next INOS, DNS and Reqs distribution:
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active c Reqs: 127 /s 12.6k 12.5k 333 505
1 active b Reqs: 11 /s 21 24 19 1
I mean that even if I pin all top dirs (of course without repinning on
next levels) to rank 1 - I see some amount of Reqs on rank 1.
вт, 26 нояб. 2024 г. в 12:01, Александр Руденко <a.rudikk@xxxxxxxxx>:
Hm, the same test worked for me with version 16.2.13... I
mean, I only
do a few writes from a single client, so this may be an
invalid test,
but I don't see any interruption.
I tried many times and I'm sure that my test is correct.
Yes, write can be active for some time after rank 1 went down,
maybe tens seconds. And listing files (ls) can work some time for
dirs which were listed before rank down, but only fews seconds.
Before shutdown rank 1 I run write in this way:
while true; do dd if=/dev/vda of=/cephfs-mount/dir1/`uuidgen`
count=1 oflag=direct; sleep 0.000001; done
Maybe it depends on the RPS...
пт, 22 нояб. 2024 г. в 14:48, Eugen Block <eblock@xxxxxx>:
Hm, the same test worked for me with version 16.2.13... I
mean, I only
do a few writes from a single client, so this may be an
invalid test,
but I don't see any interruption.
Zitat von Eugen Block <eblock@xxxxxx>:
> I just tried to reproduce the behaviour but failed to do so.
I have
> a Reef (18.2.2) cluster with multi-active MDS. Don't mind the
> hostnames, this cluster was deployed with Nautilus.
>
> # mounted the FS
> mount -t ceph nautilus:/ /mnt -o
> name=admin,secret=****,mds_namespace=secondfs
>
> # created and pinned directories
> nautilus:~ # mkdir /mnt/dir1
> nautilus:~ # mkdir /mnt/dir2
>
> nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir1
> nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir2
>
> I stopped all standby daemons while writing into /mnt/dir1,
then I
> also stopped rank 1. But the writes were not interrupted
(until I
> stopped them). You're on Pacific, I'll see if I can
reproduce it
> there.
>
> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>
>>>
>>> Can you show the entire 'ceph fs status' output? Any maybe
also 'ceph
>>> fs dump'?
>>
>>
>> Nothing special, just smoll test cluster.
>> fs1 - 10 clients
>> ===
>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
>> 0 active a Reqs: 0 /s 18.7k 18.4k 351 513
>> 1 active b Reqs: 0 /s 21 24 16 1
>> POOL TYPE USED AVAIL
>> fs1_meta metadata 116M 3184G
>> fs1_data data 23.8G 3184G
>> STANDBY MDS
>> c
>>
>>
>> fs dump
>>
>> e48
>> enable_multiple, ever_enabled_multiple: 1,1
>> default compat: compat={},rocompat={},incompat={1=base
v0.20,2=client
>> writeable ranges,3=default file layouts on dirs,4=dir inode
in separate
>> object,5=mds uses versioned encoding,6=dirfrag is stored in
omap,8=no
>> anchor table,9=file layout v2,10=snaprealm v2}
>> legacy client fscid: 1
>>
>> Filesystem 'fs1' (1)
>> fs_name fs1
>> epoch 47
>> flags 12
>> created 2024-10-15T18:55:10.905035+0300
>> modified 2024-11-21T10:55:12.688598+0300
>> tableserver 0
>> root 0
>> session_timeout 60
>> session_autoclose 300
>> max_file_size 1099511627776
>> required_client_features {}
>> last_failure 0
>> last_failure_osd_epoch 943
>> compat compat={},rocompat={},incompat={1=base
v0.20,2=client writeable
>> ranges,3=default file layouts on dirs,4=dir inode in
separate object,5=mds
>> uses versioned encoding,6=dirfrag is stored in omap,7=mds
uses inline
>> data,8=no anchor table,9=file layout v2,10=snaprealm v2}
>> max_mds 2
>> in 0,1
>> up {0=12200812,1=11974933}
>> failed
>> damaged
>> stopped
>> data_pools [7]
>> metadata_pool 6
>> inline_data disabled
>> balancer
>> standby_count_wanted 1
>> [mds.a{0:12200812} state up:active seq 13 addr [v2:
>> 10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987
<http://10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987>]
compat
>> {c=[1],r=[1],i=[7ff]}]
>> [mds.b{1:11974933} state up:active seq 5 addr [v2:
>> 10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454
<http://10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454>]
compat
>> {c=[1],r=[1],i=[7ff]}]
>>
>>
>> Standby daemons:
>>
>> [mds.c{-1:11704322} state up:standby seq 1 addr [v2:
>> 10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504
<http://10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504>]
compat
>> {c=[1],r=[1],i=[7ff]}]
>>
>> чт, 21 нояб. 2024 г. в 11:36, Eugen Block <eblock@xxxxxx>:
>>
>>> I'm not aware of any hard limit for the number of
Filesystems, but
>>> that doesn't really mean very much. IIRC, last week during
a Clyso
>>> talk at Eventbrite I heard someone say that they deployed
around 200
>>> Filesystems or so, I don't remember if it was a production
environment
>>> or just a lab environment. I assume that you would
probably be limited
>>> by the number of OSDs/PGs rather than by the number of
Filesystems,
>>> 200 Filesystems require at least 400 pools. But maybe
someone else has
>>> more experience in scaling CephFS that way. What we did
was to scale
>>> the number of active MDS daemons for one CephFS. I believe
in the end
>>> the customer had 48 MDS daemons on three MDS servers, 16
of them were
>>> active with directory pinning, at that time they had 16
standby-replay
>>> and 16 standby daemons. But it turned out that
standby-replay didn't
>>> help their use case, so we disabled standby-replay.
>>>
>>> Can you show the entire 'ceph fs status' output? Any maybe
also 'ceph
>>> fs dump'?
>>>
>>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>>>
>>>>>
>>>>> Just for testing purposes, have you tried pinning rank 1
to some other
>>>>> directory? Does it still break the CephFS if you stop it?
>>>>
>>>>
>>>> Yes, nothing changed.
>>>>
>>>> It's no problem that FS hangs when one of the ranks goes
down, we will
>>> have
>>>> standby-reply for all ranks. I don't like that rank which
is not pinned
>>> to
>>>> some dir handled some io of this dir or from clients
which work with this
>>>> dir.
>>>> I mean that I can't robustly and fully separate client IO
by ranks.
>>>>
>>>> Would it be an option to rather use multiple Filesystems
instead of
>>>>> multi-active for one CephFS?
>>>>
>>>>
>>>> Yes, it's an option. But it is much more complicated in
our case. Btw, do
>>>> you know how many different FS can be created in one
cluster? Maybe you
>>>> know some potential problems with 100-200 FSs in one cluster?
>>>>
>>>> ср, 20 нояб. 2024 г. в 17:50, Eugen Block <eblock@xxxxxx>:
>>>>
>>>>> Ah, I misunderstood, I thought you wanted an even
distribution across
>>>>> both ranks.
>>>>> Just for testing purposes, have you tried pinning rank 1
to some other
>>>>> directory? Does it still break the CephFS if you stop
it? I'm not sure
>>>>> if you can prevent rank 1 from participating, I haven't
looked into
>>>>> all the configs in quite a while. Would it be an option
to rather use
>>>>> multiple Filesystems instead of multi-active for one CephFS?
>>>>>
>>>>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>>>>>
>>>>> > No it's not a typo. It's misleading example)
>>>>> >
>>>>> > dir1 and dir2 are pinned to rank 0, but FS and
dir1,dir2 can't work
>>>>> without
>>>>> > rank 1.
>>>>> > rank 1 is used for something when I work with this dirs.
>>>>> >
>>>>> > ceph 16.2.13, metadata balancer and policy based
balancing not used.
>>>>> >
>>>>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block <eblock@xxxxxx>:
>>>>> >
>>>>> >> Hi,
>>>>> >>
>>>>> >> > After pinning:
>>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>>>>> >>
>>>>> >> is this a typo? If not, you did pin both directories
to the same
>>> rank.
>>>>> >>
>>>>> >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
>>>>> >>
>>>>> >> > Hi,
>>>>> >> >
>>>>> >> > I try to distribute all top level dirs in CephFS by
different MDS
>>>>> ranks.
>>>>> >> > I have two active MDS with rank *0* and *1 *and I
have 2 top dirs
>>> like
>>>>> >> > */dir1* and* /dir2*.
>>>>> >> >
>>>>> >> > After pinning:
>>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
>>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
>>>>> >> >
>>>>> >> > I can see next INOS and DNS distribution:
>>>>> >> > RANK STATE MDS ACTIVITY DNS INOS DIRS
CAPS
>>>>> >> > 0 active c Reqs: 127 /s 12.6k 12.5k
333 505
>>>>> >> > 1 active b Reqs: 11 /s 21 24
19 1
>>>>> >> >
>>>>> >> > When I write to dir1 I can see a small amount on
Reqs: in rank 1.
>>>>> >> >
>>>>> >> > Events in journal of MDS with rank 1:
>>>>> >> > cephfs-journal-tool --rank=fs1:1 event get list
>>>>> >> >
>>>>> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
>>>>> (scatter_writebehind)
>>>>> >> > A2037D53
>>>>> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION: ()
>>>>> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE:
(lock inest
>>>>> accounted
>>>>> >> > scatter stat update)
>>>>> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION: ()
>>>>> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION: ()
>>>>> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION: ()
>>>>> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION: ()
>>>>> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION: ()
>>>>> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION: ()
>>>>> >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT: ()
>>>>> >> > di1/A2037D53
>>>>> >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION: ()
>>>>> >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION: ()
>>>>> >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION: ()
>>>>> >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION: ()
>>>>> >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS: ()
>>>>> >> >
>>>>> >> > But the main problem, when I stop MDS rank 1
(without any kind of
>>>>> >> standby)
>>>>> >> > - FS hangs for all actions.
>>>>> >> > Is this correct? Is it possible to completely
exclude rank 1 from
>>>>> >> > processing dir1 and not stop io when rank 1 goes down?
>>>>> >> > _______________________________________________
>>>>> >> > ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> >> > To unsubscribe send an email to
ceph-users-leave@xxxxxxx
>>>>> >>
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> >>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>>