Hi Patrick, Few other follow up questions. Is directory fragmentation applicable only when multiple active MDS is enabled for a Ceph FS? Will directory fragmenation and distribution of fragments amongs active MDS happen if we turn off balancer for a Ceph FS volume `ceph fs set midline-a balance_automate false` ? In Squide, the CephFS automatic metadata load (sometimes called “default”) balancer is now disabled by default. ( https://docs.ceph.com/en/latest/releases/squid/) Is there a way for us to ensure that the directory tree of a Subvolume (Kubernetes PV) is part of a same fragment and handled by a single MDS so that a client operations are handled by one MDS? What is the trigger to start fragmenting directories within a Subvolumegroup? With the `balance_automate` set to false and `ephemeral distributed pin` enabled for a Subvolumegroup, can we expect (almost) equal distribution of Subvolumes (Kubernetes PVs) amongst the active MDS daemons and stable operation without hotspot migrations? Regards, Rajmohan R On Wed, Nov 20, 2024 at 1:27 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > On Tue, Nov 19, 2024 at 9:20 PM Rajmohan Ramamoorthy > <ram.rajmohanr@xxxxxxxxx> wrote: > > > > ``` > > Subvolumes do not "inherit" the distributed ephemeral pin. What you > > should expect below is that the "csi" subvolumegroup will be > > fragmented and distributed across the ranks. Consequently, the > > subvolumes will also be distributed across ranks as part of the > > subtrees rooted at each fragment of the "csi" subvolumegroup > > (directory). > > ``` > > > > How is subvolumegroup fragmentation handled? > > Fragmentation is automatically applied (to a minimum level) when a > directory is marked with the distributed ephemeral pin. > > > Are the subvolumes equally > > distributed across all available active MDS? > > As the documentation says, it's a consistent hash of the fragments > (which include the subvolumes which fall into those fragments) across > ranks. > > > In the following scenario, > > will 3 of the subvolumes be mapped to each of the MDS? > > You cannot say. It depends how the fragments are hashed. > > > Will setting the ephemeral distributed pin on Subvolumegroup ensure that > > the subvolumes in it will be equally distributed across MDS ? > > Approximately. > > > We are looking at > > ceph-csi use case for Kubernetes. PVs (subvolumes) are dynamically > created > > by Kubernetes. > > This is an ideal use-case for the distributed ephemeral pin. > > > # Ceph FS configuration > > > > ceph fs subvolumegroup create midline-a csi > > ceph fs subvolumegroup pin midline-a csi distributed 1 > > > > ceph fs subvolume create midline-a subvol1 csi > > ceph fs subvolume create midline-a subvol2 csi > > ceph fs subvolume create midline-a subvol3 csi > > ceph fs subvolume create midline-a subvol4 csi > > ceph fs subvolume create midline-a subvol5 csi > > ceph fs subvolume create midline-a subvol6 csi > > > > # ceph fs ls > > name: midline-a, metadata pool: fs-midline-metadata-a, data pools: > [fs-midline-data-a ] > > > > # ceph fs subvolumegroup ls midline-a > > [ > > { > > "name": "csi" > > } > > ] > > > > # ceph fs subvolume ls midline-a csi > > [ > > { > > "name": "subvol4" > > }, > > { > > "name": "subvol2" > > }, > > { > > "name": "subvol3" > > }, > > { > > "name": "subvol5" > > }, > > { > > "name": "subvol6" > > }, > > { > > "name": "subvol1" > > } > > ] > > > > # ceph fs status > > midline-a - 2 clients > > ========= > > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > > 0 active midline.server1.njyfcn Reqs: 0 /s 514 110 228 36 > > 1 active midline.server2.lpnjmx Reqs: 0 /s 47 22 17 6 > > POOL TYPE USED AVAIL > > fs-midline-metadata-a metadata 25.4M 25.9T > > fs-midline-data-a data 216k 25.9T > > STANDBY MDS > > midline.server3.wsbxsh > > MDS version: ceph version 19.2.0 > (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable) > > > > Following are the subtrees output from the MDSs. The directory fragments > does not > > seem to be equally mapped to MDS. > > > > # ceph tell mds.midline.server1.njyfcn get subtrees | jq > > [ > > { > > "is_auth": true, > > "auth_first": 0, > > "auth_second": -2, > > "export_pin": -1, > > "distributed_ephemeral_pin": false, > > "random_ephemeral_pin": false, > > "export_pin_target": -1, > > "dir": { > > "path": "", > > "dirfrag": "0x1", > > "snapid_first": 2, > > "projected_version": "1240", > > "version": "1240", > > "committing_version": "0", > > "committed_version": "0", > > "is_rep": false, > > "dir_auth": "0", > > "states": [ > > "auth", > > "dirty", > > "complete" > > ], > > "is_auth": true, > > "auth_state": { > > "replicas": { > > "1": 1 > > } > > }, > > "replica_state": { > > "authority": [ > > 0, > > -2 > > ], > > "replica_nonce": 0 > > }, > > "auth_pins": 0, > > "is_frozen": false, > > "is_freezing": false, > > "pins": { > > "child": 1, > > "subtree": 1, > > "subtreetemp": 0, > > "replicated": 1, > > "dirty": 1, > > "waiter": 0, > > "authpin": 0 > > }, > > "nref": 4 > > } > > }, > > { > > "is_auth": true, > > "auth_first": 0, > > "auth_second": -2, > > "export_pin": -1, > > "distributed_ephemeral_pin": false, > > "random_ephemeral_pin": false, > > "export_pin_target": -1, > > "dir": { > > "path": "~mds0", > > "dirfrag": "0x100", > > "snapid_first": 2, > > "projected_version": "1232", > > "version": "1232", > > "committing_version": "0", > > "committed_version": "0", > > "is_rep": false, > > "dir_auth": "0", > > "states": [ > > "auth", > > "dirty", > > "complete" > > ], > > "is_auth": true, > > "auth_state": { > > "replicas": {} > > }, > > "replica_state": { > > "authority": [ > > 0, > > -2 > > ], > > "replica_nonce": 0 > > }, > > "auth_pins": 0, > > "is_frozen": false, > > "is_freezing": false, > > "pins": { > > "child": 1, > > "subtree": 1, > > "subtreetemp": 0, > > "dirty": 1, > > "waiter": 0, > > "authpin": 0 > > }, > > "nref": 3 > > } > > }, > > { > > "is_auth": false, > > "auth_first": 1, > > "auth_second": -2, > > "export_pin": -1, > > "distributed_ephemeral_pin": true, > > "random_ephemeral_pin": false, > > "export_pin_target": 1, > > "dir": { > > "path": "/volumes/csi", > > "dirfrag": "0x100000006ae.11*", > > This fragment is pinned to rank 1 (export_pin_target). > > > "snapid_first": 2, > > "projected_version": "50", > > "version": "50", > > "committing_version": "50", > > "committed_version": "50", > > "is_rep": false, > > "dir_auth": "1", > > "states": [], > > "is_auth": false, > > "auth_state": { > > "replicas": {} > > }, > > "replica_state": { > > "authority": [ > > 1, > > -2 > > ], > > "replica_nonce": 1 > > }, > > "auth_pins": 0, > > "is_frozen": false, > > "is_freezing": false, > > "pins": { > > "ptrwaiter": 0, > > "request": 0, > > "child": 0, > > "frozen": 0, > > "subtree": 1, > > "replicated": 0, > > "dirty": 0, > > "waiter": 0, > > "authpin": 0, > > "tempexporting": 0 > > }, > > "nref": 1 > > } > > }, > > { > > "is_auth": true, > > "auth_first": 0, > > "auth_second": -2, > > "export_pin": -1, > > "distributed_ephemeral_pin": true, > > "random_ephemeral_pin": false, > > "export_pin_target": 0, > > "dir": { > > "path": "/volumes/csi", > > "dirfrag": "0x100000006ae.10*", > > This fragment is pinned to rank 0 (export_pin_target). > > > "snapid_first": 2, > > "projected_version": "52", > > "version": "52", > > "committing_version": "50", > > "committed_version": "50", > > "is_rep": false, > > "dir_auth": "0", > > "states": [ > > "auth", > > "dirty", > > "complete" > > ], > > "is_auth": true, > > "auth_state": { > > "replicas": {} > > }, > > "replica_state": { > > "authority": [ > > 0, > > -2 > > ], > > "replica_nonce": 0 > > }, > > "auth_pins": 0, > > "is_frozen": false, > > "is_freezing": false, > > "pins": { > > "subtree": 1, > > "dirty": 1, > > "waiter": 0, > > "authpin": 0 > > }, > > "nref": 2 > > } > > }, > > { > > "is_auth": true, > > "auth_first": 0, > > "auth_second": -2, > > "export_pin": -1, > > "distributed_ephemeral_pin": true, > > "random_ephemeral_pin": false, > > "export_pin_target": 0, > > "dir": { > > "path": "/volumes/csi", > > "dirfrag": "0x100000006ae.01*", > > This fragment is pinned to rank 0 (export_pin_target). > > > "snapid_first": 2, > > "projected_version": "136", > > "version": "136", > > "committing_version": "82", > > "committed_version": "82", > > "is_rep": false, > > "dir_auth": "0", > > "states": [ > > "auth", > > "dirty", > > "complete" > > ], > > "is_auth": true, > > "auth_state": { > > "replicas": { > > "1": 1 > > } > > }, > > "replica_state": { > > "authority": [ > > 0, > > -2 > > ], > > "replica_nonce": 0 > > }, > > "auth_pins": 0, > > "is_frozen": false, > > "is_freezing": false, > > "pins": { > > "child": 1, > > "frozen": 0, > > "subtree": 1, > > "replicated": 1, > > "dirty": 1, > > "authpin": 0 > > }, > > "nref": 4 > > } > > } > > ] > > > > # ceph tell mds.midline.server2.lpnjmx get subtrees | jq > > [ > > { > > "is_auth": true, > > "auth_first": 1, > > "auth_second": -2, > > "export_pin": -1, > > "distributed_ephemeral_pin": false, > > "random_ephemeral_pin": false, > > "export_pin_target": -1, > > "dir": { > > "path": "~mds1", > > "dirfrag": "0x101", > > "snapid_first": 2, > > "projected_version": "332", > > "version": "332", > > "committing_version": "0", > > "committed_version": "0", > > "is_rep": false, > > "dir_auth": "1", > > "states": [ > > "auth", > > "dirty", > > "complete" > > ], > > "is_auth": true, > > "auth_state": { > > "replicas": {} > > }, > > "replica_state": { > > "authority": [ > > 1, > > -2 > > ], > > "replica_nonce": 0 > > }, > > "auth_pins": 0, > > "is_frozen": false, > > "is_freezing": false, > > "pins": { > > "child": 1, > > "subtree": 1, > > "subtreetemp": 0, > > "dirty": 1, > > "waiter": 0, > > "authpin": 0 > > }, > > "nref": 3 > > } > > }, > > { > > "is_auth": true, > > "auth_first": 1, > > "auth_second": -2, > > "export_pin": -1, > > "distributed_ephemeral_pin": true, > > "random_ephemeral_pin": false, > > "export_pin_target": 1, > > "dir": { > > "path": "/volumes/csi", > > "dirfrag": "0x100000006ae.11*", > > "snapid_first": 2, > > "projected_version": "66", > > "version": "66", > > "committing_version": "50", > > "committed_version": "50", > > "is_rep": false, > > "dir_auth": "1", > > "states": [ > > "auth", > > "dirty", > > "complete" > > ], > > "is_auth": true, > > "auth_state": { > > "replicas": { > > "0": 1 > > } > > }, > > "replica_state": { > > "authority": [ > > 1, > > -2 > > ], > > "replica_nonce": 0 > > }, > > "auth_pins": 0, > > "is_frozen": false, > > "is_freezing": false, > > "pins": { > > "ptrwaiter": 0, > > "child": 1, > > "frozen": 0, > > "subtree": 1, > > "importing": 0, > > "replicated": 1, > > "dirty": 1, > > "authpin": 0 > > }, > > "nref": 4 > > } > > }, > > { > > "is_auth": false, > > "auth_first": 0, > > "auth_second": -2, > > "export_pin": -1, > > "distributed_ephemeral_pin": false, > > "random_ephemeral_pin": false, > > "export_pin_target": -1, > > "dir": { > > "path": "", > > "dirfrag": "0x1", > > "snapid_first": 2, > > "projected_version": "0", > > "version": "1216", > > "committing_version": "0", > > "committed_version": "0", > > "is_rep": false, > > "dir_auth": "0", > > "states": [], > > "is_auth": false, > > "auth_state": { > > "replicas": {} > > }, > > "replica_state": { > > "authority": [ > > 0, > > -2 > > ], > > "replica_nonce": 1 > > }, > > "auth_pins": 0, > > "is_frozen": false, > > "is_freezing": false, > > "pins": { > > "child": 1, > > "subtree": 1 > > }, > > "nref": 2 > > } > > } > > ] > > This all looks as expected. You can verify the location of the subvolumes > by: > > ceph tell mds.X dump tree /volumes/csi/X 0 > > and check auth_first. You should see that they are approximately > uniformly distributed. > > -- > Patrick Donnelly, Ph.D. > He / Him / His > Red Hat Partner Engineer > IBM, Inc. > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx