Re: CephFS subvolumes not inheriting ephemeral distributed pin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Patrick,

Few other follow up questions.

Is directory fragmentation applicable only when multiple active MDS is
enabled for a Ceph FS?

Will directory fragmenation and distribution of fragments amongs active MDS
happen if we turn off balancer for a Ceph FS volume `ceph fs set midline-a
balance_automate false` ? In Squide, the CephFS automatic metadata load
(sometimes called “default”) balancer is now disabled by default. (
https://docs.ceph.com/en/latest/releases/squid/)

Is there a way for us to ensure that the directory tree of a Subvolume
(Kubernetes PV) is part of a same fragment and handled by a single MDS so
that a client operations are handled by one MDS?

What is the trigger to start fragmenting directories within a
Subvolumegroup?

With the `balance_automate` set to false and `ephemeral distributed pin`
enabled for a Subvolumegroup, can we expect (almost) equal distribution of
Subvolumes (Kubernetes PVs) amongst the active MDS daemons and stable
operation without hotspot migrations?


Regards,
Rajmohan R


On Wed, Nov 20, 2024 at 1:27 PM Patrick Donnelly <pdonnell@xxxxxxxxxx>
wrote:

> On Tue, Nov 19, 2024 at 9:20 PM Rajmohan Ramamoorthy
> <ram.rajmohanr@xxxxxxxxx> wrote:
> >
> > ```
> > Subvolumes do not "inherit" the distributed ephemeral pin. What you
> > should expect below is that the "csi" subvolumegroup will be
> > fragmented and distributed across the ranks. Consequently, the
> > subvolumes will also be distributed across ranks as part of the
> > subtrees rooted at each fragment of the "csi" subvolumegroup
> > (directory).
> > ```
> >
> > How is subvolumegroup fragmentation handled?
>
> Fragmentation is automatically applied (to a minimum level) when a
> directory is marked with the distributed ephemeral pin.
>
> > Are the subvolumes equally
> > distributed across all available active MDS?
>
> As the documentation says, it's a consistent hash of the fragments
> (which include the subvolumes which fall into those fragments) across
> ranks.
>
> > In the following scenario,
> > will 3 of the subvolumes be mapped to each of the MDS?
>
> You cannot say. It depends how the fragments are hashed.
>
> > Will setting the ephemeral distributed pin on Subvolumegroup ensure that
> > the subvolumes in it will be equally distributed across MDS ?
>
> Approximately.
>
> > We are looking at
> > ceph-csi use case for Kubernetes. PVs (subvolumes) are dynamically
> created
> > by Kubernetes.
>
> This is an ideal use-case for the distributed ephemeral pin.
>
> > # Ceph FS configuration
> >
> > ceph fs subvolumegroup create midline-a csi
> > ceph fs subvolumegroup pin midline-a csi distributed 1
> >
> > ceph fs subvolume create midline-a subvol1 csi
> > ceph fs subvolume create midline-a subvol2 csi
> > ceph fs subvolume create midline-a subvol3 csi
> > ceph fs subvolume create midline-a subvol4 csi
> > ceph fs subvolume create midline-a subvol5 csi
> > ceph fs subvolume create midline-a subvol6 csi
> >
> > # ceph fs ls
> > name: midline-a, metadata pool: fs-midline-metadata-a, data pools:
> [fs-midline-data-a ]
> >
> > # ceph fs subvolumegroup ls midline-a
> > [
> > {
> > "name": "csi"
> > }
> > ]
> >
> > # ceph fs subvolume ls midline-a csi
> > [
> > {
> > "name": "subvol4"
> > },
> > {
> > "name": "subvol2"
> > },
> > {
> > "name": "subvol3"
> > },
> > {
> > "name": "subvol5"
> > },
> > {
> > "name": "subvol6"
> > },
> > {
> > "name": "subvol1"
> > }
> > ]
> >
> > # ceph fs status
> > midline-a - 2 clients
> > =========
> > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
> > 0 active midline.server1.njyfcn Reqs: 0 /s 514 110 228 36
> > 1 active midline.server2.lpnjmx Reqs: 0 /s 47 22 17 6
> > POOL TYPE USED AVAIL
> > fs-midline-metadata-a metadata 25.4M 25.9T
> > fs-midline-data-a data 216k 25.9T
> > STANDBY MDS
> > midline.server3.wsbxsh
> > MDS version: ceph version 19.2.0
> (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
> >
> > Following are the subtrees output from the MDSs. The directory fragments
> does not
> > seem to be equally mapped to MDS.
> >
> > # ceph tell mds.midline.server1.njyfcn get subtrees | jq
> > [
> > {
> > "is_auth": true,
> > "auth_first": 0,
> > "auth_second": -2,
> > "export_pin": -1,
> > "distributed_ephemeral_pin": false,
> > "random_ephemeral_pin": false,
> > "export_pin_target": -1,
> > "dir": {
> > "path": "",
> > "dirfrag": "0x1",
> > "snapid_first": 2,
> > "projected_version": "1240",
> > "version": "1240",
> > "committing_version": "0",
> > "committed_version": "0",
> > "is_rep": false,
> > "dir_auth": "0",
> > "states": [
> > "auth",
> > "dirty",
> > "complete"
> > ],
> > "is_auth": true,
> > "auth_state": {
> > "replicas": {
> > "1": 1
> > }
> > },
> > "replica_state": {
> > "authority": [
> > 0,
> > -2
> > ],
> > "replica_nonce": 0
> > },
> > "auth_pins": 0,
> > "is_frozen": false,
> > "is_freezing": false,
> > "pins": {
> > "child": 1,
> > "subtree": 1,
> > "subtreetemp": 0,
> > "replicated": 1,
> > "dirty": 1,
> > "waiter": 0,
> > "authpin": 0
> > },
> > "nref": 4
> > }
> > },
> > {
> > "is_auth": true,
> > "auth_first": 0,
> > "auth_second": -2,
> > "export_pin": -1,
> > "distributed_ephemeral_pin": false,
> > "random_ephemeral_pin": false,
> > "export_pin_target": -1,
> > "dir": {
> > "path": "~mds0",
> > "dirfrag": "0x100",
> > "snapid_first": 2,
> > "projected_version": "1232",
> > "version": "1232",
> > "committing_version": "0",
> > "committed_version": "0",
> > "is_rep": false,
> > "dir_auth": "0",
> > "states": [
> > "auth",
> > "dirty",
> > "complete"
> > ],
> > "is_auth": true,
> > "auth_state": {
> > "replicas": {}
> > },
> > "replica_state": {
> > "authority": [
> > 0,
> > -2
> > ],
> > "replica_nonce": 0
> > },
> > "auth_pins": 0,
> > "is_frozen": false,
> > "is_freezing": false,
> > "pins": {
> > "child": 1,
> > "subtree": 1,
> > "subtreetemp": 0,
> > "dirty": 1,
> > "waiter": 0,
> > "authpin": 0
> > },
> > "nref": 3
> > }
> > },
> > {
> > "is_auth": false,
> > "auth_first": 1,
> > "auth_second": -2,
> > "export_pin": -1,
> > "distributed_ephemeral_pin": true,
> > "random_ephemeral_pin": false,
> > "export_pin_target": 1,
> > "dir": {
> > "path": "/volumes/csi",
> > "dirfrag": "0x100000006ae.11*",
>
> This fragment is pinned to rank 1 (export_pin_target).
>
> > "snapid_first": 2,
> > "projected_version": "50",
> > "version": "50",
> > "committing_version": "50",
> > "committed_version": "50",
> > "is_rep": false,
> > "dir_auth": "1",
> > "states": [],
> > "is_auth": false,
> > "auth_state": {
> > "replicas": {}
> > },
> > "replica_state": {
> > "authority": [
> > 1,
> > -2
> > ],
> > "replica_nonce": 1
> > },
> > "auth_pins": 0,
> > "is_frozen": false,
> > "is_freezing": false,
> > "pins": {
> > "ptrwaiter": 0,
> > "request": 0,
> > "child": 0,
> > "frozen": 0,
> > "subtree": 1,
> > "replicated": 0,
> > "dirty": 0,
> > "waiter": 0,
> > "authpin": 0,
> > "tempexporting": 0
> > },
> > "nref": 1
> > }
> > },
> > {
> > "is_auth": true,
> > "auth_first": 0,
> > "auth_second": -2,
> > "export_pin": -1,
> > "distributed_ephemeral_pin": true,
> > "random_ephemeral_pin": false,
> > "export_pin_target": 0,
> > "dir": {
> > "path": "/volumes/csi",
> > "dirfrag": "0x100000006ae.10*",
>
> This fragment is pinned to rank 0 (export_pin_target).
>
> > "snapid_first": 2,
> > "projected_version": "52",
> > "version": "52",
> > "committing_version": "50",
> > "committed_version": "50",
> > "is_rep": false,
> > "dir_auth": "0",
> > "states": [
> > "auth",
> > "dirty",
> > "complete"
> > ],
> > "is_auth": true,
> > "auth_state": {
> > "replicas": {}
> > },
> > "replica_state": {
> > "authority": [
> > 0,
> > -2
> > ],
> > "replica_nonce": 0
> > },
> > "auth_pins": 0,
> > "is_frozen": false,
> > "is_freezing": false,
> > "pins": {
> > "subtree": 1,
> > "dirty": 1,
> > "waiter": 0,
> > "authpin": 0
> > },
> > "nref": 2
> > }
> > },
> > {
> > "is_auth": true,
> > "auth_first": 0,
> > "auth_second": -2,
> > "export_pin": -1,
> > "distributed_ephemeral_pin": true,
> > "random_ephemeral_pin": false,
> > "export_pin_target": 0,
> > "dir": {
> > "path": "/volumes/csi",
> > "dirfrag": "0x100000006ae.01*",
>
> This fragment is pinned to rank 0 (export_pin_target).
>
> > "snapid_first": 2,
> > "projected_version": "136",
> > "version": "136",
> > "committing_version": "82",
> > "committed_version": "82",
> > "is_rep": false,
> > "dir_auth": "0",
> > "states": [
> > "auth",
> > "dirty",
> > "complete"
> > ],
> > "is_auth": true,
> > "auth_state": {
> > "replicas": {
> > "1": 1
> > }
> > },
> > "replica_state": {
> > "authority": [
> > 0,
> > -2
> > ],
> > "replica_nonce": 0
> > },
> > "auth_pins": 0,
> > "is_frozen": false,
> > "is_freezing": false,
> > "pins": {
> > "child": 1,
> > "frozen": 0,
> > "subtree": 1,
> > "replicated": 1,
> > "dirty": 1,
> > "authpin": 0
> > },
> > "nref": 4
> > }
> > }
> > ]
> >
> > # ceph tell mds.midline.server2.lpnjmx get subtrees | jq
> > [
> > {
> > "is_auth": true,
> > "auth_first": 1,
> > "auth_second": -2,
> > "export_pin": -1,
> > "distributed_ephemeral_pin": false,
> > "random_ephemeral_pin": false,
> > "export_pin_target": -1,
> > "dir": {
> > "path": "~mds1",
> > "dirfrag": "0x101",
> > "snapid_first": 2,
> > "projected_version": "332",
> > "version": "332",
> > "committing_version": "0",
> > "committed_version": "0",
> > "is_rep": false,
> > "dir_auth": "1",
> > "states": [
> > "auth",
> > "dirty",
> > "complete"
> > ],
> > "is_auth": true,
> > "auth_state": {
> > "replicas": {}
> > },
> > "replica_state": {
> > "authority": [
> > 1,
> > -2
> > ],
> > "replica_nonce": 0
> > },
> > "auth_pins": 0,
> > "is_frozen": false,
> > "is_freezing": false,
> > "pins": {
> > "child": 1,
> > "subtree": 1,
> > "subtreetemp": 0,
> > "dirty": 1,
> > "waiter": 0,
> > "authpin": 0
> > },
> > "nref": 3
> > }
> > },
> > {
> > "is_auth": true,
> > "auth_first": 1,
> > "auth_second": -2,
> > "export_pin": -1,
> > "distributed_ephemeral_pin": true,
> > "random_ephemeral_pin": false,
> > "export_pin_target": 1,
> > "dir": {
> > "path": "/volumes/csi",
> > "dirfrag": "0x100000006ae.11*",
> > "snapid_first": 2,
> > "projected_version": "66",
> > "version": "66",
> > "committing_version": "50",
> > "committed_version": "50",
> > "is_rep": false,
> > "dir_auth": "1",
> > "states": [
> > "auth",
> > "dirty",
> > "complete"
> > ],
> > "is_auth": true,
> > "auth_state": {
> > "replicas": {
> > "0": 1
> > }
> > },
> > "replica_state": {
> > "authority": [
> > 1,
> > -2
> > ],
> > "replica_nonce": 0
> > },
> > "auth_pins": 0,
> > "is_frozen": false,
> > "is_freezing": false,
> > "pins": {
> > "ptrwaiter": 0,
> > "child": 1,
> > "frozen": 0,
> > "subtree": 1,
> > "importing": 0,
> > "replicated": 1,
> > "dirty": 1,
> > "authpin": 0
> > },
> > "nref": 4
> > }
> > },
> > {
> > "is_auth": false,
> > "auth_first": 0,
> > "auth_second": -2,
> > "export_pin": -1,
> > "distributed_ephemeral_pin": false,
> > "random_ephemeral_pin": false,
> > "export_pin_target": -1,
> > "dir": {
> > "path": "",
> > "dirfrag": "0x1",
> > "snapid_first": 2,
> > "projected_version": "0",
> > "version": "1216",
> > "committing_version": "0",
> > "committed_version": "0",
> > "is_rep": false,
> > "dir_auth": "0",
> > "states": [],
> > "is_auth": false,
> > "auth_state": {
> > "replicas": {}
> > },
> > "replica_state": {
> > "authority": [
> > 0,
> > -2
> > ],
> > "replica_nonce": 1
> > },
> > "auth_pins": 0,
> > "is_frozen": false,
> > "is_freezing": false,
> > "pins": {
> > "child": 1,
> > "subtree": 1
> > },
> > "nref": 2
> > }
> > }
> > ]
>
> This all looks as expected. You can verify the location of the subvolumes
> by:
>
> ceph tell mds.X dump tree /volumes/csi/X 0
>
> and check auth_first. You should see that they are approximately
> uniformly distributed.
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Red Hat Partner Engineer
> IBM, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux