Re: [CephFS] Completely exclude some MDS rank from directory processing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

> I mean that even if I pin all top dirs (of course without repinning on next levels) to rank 1 - I see some amount of Reqs on rank 1.

I assume you mean if you pin all top dirs to rank 0, you still see IO on rank 1? I still can't reproduce that, I waited for 15 minutes or so with rank 1 down, but I still could read/write to the rank 0 pinned dirs. And no IO visible on rank 1.

But what I don't fully understand yet is, I have a third directory which is unpinned:

ll /mnt/
insgesamt 0
drwxr-xr-x 2 root root 4023 29. Nov 09:41 dir1
drwxr-xr-x 2 root root    0 22. Nov 12:18 dir2
drwxr-xr-x 2 root root   11 29. Nov 09:34 dir3

dir1 and dir2 are pinned to rank 0, dir3 is unpinned:

getfattr -n ceph.dir.pin /mnt/dir3
# file: mnt/dir3
ceph.dir.pin="-1"

Shouldn't rank 0 take over dir3 as well since it's the only active rank left? I couldn't read/write into dir3 until I brought another mds daemon back up.

Am 26.11.24 um 10:08 schrieb Александр Руденко:
And, Eugen, try to see ceph fs status during write.

I can see next INOS, DNS and Reqs distribution:
RANK  STATE   MDS ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active   c   Reqs:    127 /s  12.6k  12.5k   333  505
 1    active   b   Reqs:    11 /s    21     24     19  1

I mean that even if I pin all top dirs (of course without repinning on next levels) to rank 1 - I see some amount of Reqs on rank 1.


вт, 26 нояб. 2024 г. в 12:01, Александр Руденко <a.rudikk@xxxxxxxxx>:

        Hm, the same test worked for me with version 16.2.13... I
        mean, I only
        do a few writes from a single client, so this may be an
        invalid test,
        but I don't see any interruption.


    I tried many times and I'm sure that my test is correct.
    Yes, write can be active for some time after rank 1 went down,
    maybe tens seconds. And listing files (ls) can work some time for
    dirs which were listed before rank down, but only fews seconds.

    Before shutdown rank 1 I run write in this way:

    while true; do dd if=/dev/vda of=/cephfs-mount/dir1/`uuidgen`
    count=1 oflag=direct; sleep 0.000001; done

    Maybe it depends on the RPS...

    пт, 22 нояб. 2024 г. в 14:48, Eugen Block <eblock@xxxxxx>:

        Hm, the same test worked for me with version 16.2.13... I
        mean, I only
        do a few writes from a single client, so this may be an
        invalid test,
        but I don't see any interruption.

        Zitat von Eugen Block <eblock@xxxxxx>:

        > I just tried to reproduce the behaviour but failed to do so.
        I have
        > a Reef (18.2.2) cluster with multi-active MDS. Don't mind the
        > hostnames, this cluster was deployed with Nautilus.
        >
        > # mounted the FS
        > mount -t ceph nautilus:/ /mnt -o
        > name=admin,secret=****,mds_namespace=secondfs
        >
        > # created and pinned directories
        > nautilus:~ # mkdir /mnt/dir1
        > nautilus:~ # mkdir /mnt/dir2
        >
        > nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir1
        > nautilus:~ # setfattr -n ceph.dir.pin -v 0 /mnt/dir2
        >
        > I stopped all standby daemons while writing into /mnt/dir1,
        then I
        > also stopped rank 1. But the writes were not interrupted
        (until I
        > stopped them). You're on Pacific, I'll see if I can
        reproduce it
        > there.
        >
        > Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
        >
        >>>
        >>> Can you show the entire 'ceph fs status' output? Any maybe
        also 'ceph
        >>> fs dump'?
        >>
        >>
        >> Nothing special, just smoll test cluster.
        >> fs1 - 10 clients
        >> ===
        >> RANK  STATE   MDS     ACTIVITY     DNS    INOS  DIRS   CAPS
        >> 0    active   a   Reqs:    0 /s  18.7k  18.4k  351    513
        >> 1    active   b   Reqs:    0 /s    21     24  16      1
        >>  POOL      TYPE     USED  AVAIL
        >> fs1_meta  metadata   116M  3184G
        >> fs1_data    data    23.8G  3184G
        >> STANDBY MDS
        >>     c
        >>
        >>
        >> fs dump
        >>
        >> e48
        >> enable_multiple, ever_enabled_multiple: 1,1
        >> default compat: compat={},rocompat={},incompat={1=base
        v0.20,2=client
        >> writeable ranges,3=default file layouts on dirs,4=dir inode
        in separate
        >> object,5=mds uses versioned encoding,6=dirfrag is stored in
        omap,8=no
        >> anchor table,9=file layout v2,10=snaprealm v2}
        >> legacy client fscid: 1
        >>
        >> Filesystem 'fs1' (1)
        >> fs_name fs1
        >> epoch 47
        >> flags 12
        >> created 2024-10-15T18:55:10.905035+0300
        >> modified 2024-11-21T10:55:12.688598+0300
        >> tableserver 0
        >> root 0
        >> session_timeout 60
        >> session_autoclose 300
        >> max_file_size 1099511627776
        >> required_client_features {}
        >> last_failure 0
        >> last_failure_osd_epoch 943
        >> compat compat={},rocompat={},incompat={1=base
        v0.20,2=client writeable
        >> ranges,3=default file layouts on dirs,4=dir inode in
        separate object,5=mds
        >> uses versioned encoding,6=dirfrag is stored in omap,7=mds
        uses inline
        >> data,8=no anchor table,9=file layout v2,10=snaprealm v2}
        >> max_mds 2
        >> in 0,1
        >> up {0=12200812,1=11974933}
        >> failed
        >> damaged
        >> stopped
        >> data_pools [7]
        >> metadata_pool 6
        >> inline_data disabled
        >> balancer
        >> standby_count_wanted 1
        >> [mds.a{0:12200812} state up:active seq 13 addr [v2:
        >> 10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987
        <http://10.7.1.115:6842/1955635987,v1:10.7.1.115:6843/1955635987>]
        compat
        >> {c=[1],r=[1],i=[7ff]}]
        >> [mds.b{1:11974933} state up:active seq 5 addr [v2:
        >> 10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454
        <http://10.7.1.116:6840/536741454,v1:10.7.1.116:6841/536741454>]
        compat
        >> {c=[1],r=[1],i=[7ff]}]
        >>
        >>
        >> Standby daemons:
        >>
        >> [mds.c{-1:11704322} state up:standby seq 1 addr [v2:
        >> 10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504
        <http://10.7.1.117:6848/84247504,v1:10.7.1.117:6849/84247504>]
        compat
        >> {c=[1],r=[1],i=[7ff]}]
        >>
        >> чт, 21 нояб. 2024 г. в 11:36, Eugen Block <eblock@xxxxxx>:
        >>
        >>> I'm not aware of any hard limit for the number of
        Filesystems, but
        >>> that doesn't really mean very much. IIRC, last week during
        a Clyso
        >>> talk at Eventbrite I heard someone say that they deployed
        around 200
        >>> Filesystems or so, I don't remember if it was a production
        environment
        >>> or just a lab environment. I assume that you would
        probably be limited
        >>> by the number of OSDs/PGs rather than by the number of
        Filesystems,
        >>> 200 Filesystems require at least 400 pools. But maybe
        someone else has
        >>> more experience in scaling CephFS that way. What we did
        was to scale
        >>> the number of active MDS daemons for one CephFS. I believe
        in the end
        >>> the customer had 48 MDS daemons on three MDS servers, 16
        of them were
        >>> active with directory pinning, at that time they had 16
        standby-replay
        >>> and 16 standby daemons. But it turned out that
        standby-replay didn't
        >>> help their use case, so we disabled standby-replay.
        >>>
        >>> Can you show the entire 'ceph fs status' output? Any maybe
        also 'ceph
        >>> fs dump'?
        >>>
        >>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
        >>>
        >>>>>
        >>>>> Just for testing purposes, have you tried pinning rank 1
        to some other
        >>>>> directory? Does it still break the CephFS if you stop it?
        >>>>
        >>>>
        >>>> Yes, nothing changed.
        >>>>
        >>>> It's no problem that FS hangs when one of the ranks goes
        down, we will
        >>> have
        >>>> standby-reply for all ranks. I don't like that rank which
        is not pinned
        >>> to
        >>>> some dir handled some io of this dir or from clients
        which work with this
        >>>> dir.
        >>>> I mean that I can't robustly and fully separate client IO
        by ranks.
        >>>>
        >>>> Would it be an option to rather use multiple Filesystems
        instead of
        >>>>> multi-active for one CephFS?
        >>>>
        >>>>
        >>>> Yes, it's an option. But it is much more complicated in
        our case. Btw, do
        >>>> you know how many different FS can be created in one
        cluster? Maybe you
        >>>> know some potential problems with 100-200 FSs in one cluster?
        >>>>
        >>>> ср, 20 нояб. 2024 г. в 17:50, Eugen Block <eblock@xxxxxx>:
        >>>>
        >>>>> Ah, I misunderstood, I thought you wanted an even
        distribution across
        >>>>> both ranks.
        >>>>> Just for testing purposes, have you tried pinning rank 1
        to some other
        >>>>> directory? Does it still break the CephFS if you stop
        it? I'm not sure
        >>>>> if you can prevent rank 1 from participating, I haven't
        looked into
        >>>>> all the configs in quite a while. Would it be an option
        to rather use
        >>>>> multiple Filesystems instead of multi-active for one CephFS?
        >>>>>
        >>>>> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
        >>>>>
        >>>>> > No it's not a typo. It's misleading example)
        >>>>> >
        >>>>> > dir1 and dir2 are pinned to rank 0, but FS and
        dir1,dir2 can't work
        >>>>> without
        >>>>> > rank 1.
        >>>>> > rank 1 is used for something when I work with this dirs.
        >>>>> >
        >>>>> > ceph 16.2.13, metadata balancer and policy based
        balancing not used.
        >>>>> >
        >>>>> > ср, 20 нояб. 2024 г. в 16:33, Eugen Block <eblock@xxxxxx>:
        >>>>> >
        >>>>> >> Hi,
        >>>>> >>
        >>>>> >> > After pinning:
        >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
        >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
        >>>>> >>
        >>>>> >> is this a typo? If not, you did pin both directories
        to the same
        >>> rank.
        >>>>> >>
        >>>>> >> Zitat von Александр Руденко <a.rudikk@xxxxxxxxx>:
        >>>>> >>
        >>>>> >> > Hi,
        >>>>> >> >
        >>>>> >> > I try to distribute all top level dirs in CephFS by
        different MDS
        >>>>> ranks.
        >>>>> >> > I have two active MDS with rank *0* and *1 *and I
        have 2 top dirs
        >>> like
        >>>>> >> > */dir1* and* /dir2*.
        >>>>> >> >
        >>>>> >> > After pinning:
        >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir1
        >>>>> >> > setfattr -n ceph.dir.pin -v 0 /fs-mountpoint/dir2
        >>>>> >> >
        >>>>> >> > I can see next INOS and DNS distribution:
        >>>>> >> > RANK  STATE   MDS  ACTIVITY     DNS    INOS   DIRS 
         CAPS
        >>>>> >> >  0    active   c  Reqs:    127 /s  12.6k  12.5k 
         333    505
        >>>>> >> >  1    active   b  Reqs:    11 /s    21     24   
         19      1
        >>>>> >> >
        >>>>> >> > When I write to dir1 I can see a small amount on
        Reqs: in rank 1.
        >>>>> >> >
        >>>>> >> > Events in journal of MDS with rank 1:
        >>>>> >> > cephfs-journal-tool --rank=fs1:1 event get list
        >>>>> >> >
        >>>>> >> > 2024-11-20T12:24:42.045056+0300 0xc5c1cb UPDATE:
        >>>>> (scatter_writebehind)
        >>>>> >> >   A2037D53
        >>>>> >> > 2024-11-20T12:24:46.935934+0300 0xc5c629 SESSION:  ()
        >>>>> >> > 2024-11-20T12:24:47.192012+0300 0xc5c7cd UPDATE: 
        (lock inest
        >>>>> accounted
        >>>>> >> > scatter stat update)
        >>>>> >> > 2024-11-20T12:24:47.904717+0300 0xc5ca0b SESSION:  ()
        >>>>> >> > 2024-11-20T12:26:46.912719+0300 0xc5ca98 SESSION:  ()
        >>>>> >> > 2024-11-20T12:26:47.910806+0300 0xc5cc3c SESSION:  ()
        >>>>> >> > 2024-11-20T12:27:35.746239+0300 0xc5ccc9 SESSION:  ()
        >>>>> >> > 2024-11-20T12:28:46.923812+0300 0xc5ce63 SESSION:  ()
        >>>>> >> > 2024-11-20T12:28:47.903066+0300 0xc5d007 SESSION:  ()
        >>>>> >> > 2024-11-20T12:29:08.063326+0300 0xc5d094 EXPORT:  ()
        >>>>> >> >   di1/A2037D53
        >>>>> >> > 2024-11-20T12:30:46.909621+0300 0xc5d96f SESSION:  ()
        >>>>> >> > 2024-11-20T12:30:47.908050+0300 0xc5db13 SESSION:  ()
        >>>>> >> > 2024-11-20T12:32:46.907649+0300 0xc5dba0 SESSION:  ()
        >>>>> >> > 2024-11-20T12:32:47.905962+0300 0xc5dd44 SESSION:  ()
        >>>>> >> > 2024-11-20T12:34:44.349348+0300 0xc5ddd1 SESSIONS:  ()
        >>>>> >> >
        >>>>> >> > But the main problem, when I stop MDS rank 1
        (without any kind of
        >>>>> >> standby)
        >>>>> >> > - FS hangs for all actions.
        >>>>> >> > Is this correct? Is it possible to completely
        exclude rank 1 from
        >>>>> >> > processing dir1 and not stop io when rank 1 goes down?
        >>>>> >> > _______________________________________________
        >>>>> >> > ceph-users mailing list -- ceph-users@xxxxxxx
        >>>>> >> > To unsubscribe send an email to
        ceph-users-leave@xxxxxxx
        >>>>> >>
        >>>>> >>
        >>>>> >> _______________________________________________
        >>>>> >> ceph-users mailing list -- ceph-users@xxxxxxx
        >>>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
        >>>>> >>
        >>>>>
        >>>>>
        >>>>>
        >>>>>
        >>>
        >>>
        >>>
        >>>



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux