Re: CephFS subvolumes not inheriting ephemeral distributed pin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 19, 2024 at 9:20 PM Rajmohan Ramamoorthy
<ram.rajmohanr@xxxxxxxxx> wrote:
>
> ```
> Subvolumes do not "inherit" the distributed ephemeral pin. What you
> should expect below is that the "csi" subvolumegroup will be
> fragmented and distributed across the ranks. Consequently, the
> subvolumes will also be distributed across ranks as part of the
> subtrees rooted at each fragment of the "csi" subvolumegroup
> (directory).
> ```
>
> How is subvolumegroup fragmentation handled?

Fragmentation is automatically applied (to a minimum level) when a
directory is marked with the distributed ephemeral pin.

> Are the subvolumes equally
> distributed across all available active MDS?

As the documentation says, it's a consistent hash of the fragments
(which include the subvolumes which fall into those fragments) across
ranks.

> In the following scenario,
> will 3 of the subvolumes be mapped to each of the MDS?

You cannot say. It depends how the fragments are hashed.

> Will setting the ephemeral distributed pin on Subvolumegroup ensure that
> the subvolumes in it will be equally distributed across MDS ?

Approximately.

> We are looking at
> ceph-csi use case for Kubernetes. PVs (subvolumes) are dynamically created
> by Kubernetes.

This is an ideal use-case for the distributed ephemeral pin.

> # Ceph FS configuration
>
> ceph fs subvolumegroup create midline-a csi
> ceph fs subvolumegroup pin midline-a csi distributed 1
>
> ceph fs subvolume create midline-a subvol1 csi
> ceph fs subvolume create midline-a subvol2 csi
> ceph fs subvolume create midline-a subvol3 csi
> ceph fs subvolume create midline-a subvol4 csi
> ceph fs subvolume create midline-a subvol5 csi
> ceph fs subvolume create midline-a subvol6 csi
>
> # ceph fs ls
> name: midline-a, metadata pool: fs-midline-metadata-a, data pools: [fs-midline-data-a ]
>
> # ceph fs subvolumegroup ls midline-a
> [
> {
> "name": "csi"
> }
> ]
>
> # ceph fs subvolume ls midline-a csi
> [
> {
> "name": "subvol4"
> },
> {
> "name": "subvol2"
> },
> {
> "name": "subvol3"
> },
> {
> "name": "subvol5"
> },
> {
> "name": "subvol6"
> },
> {
> "name": "subvol1"
> }
> ]
>
> # ceph fs status
> midline-a - 2 clients
> =========
> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
> 0 active midline.server1.njyfcn Reqs: 0 /s 514 110 228 36
> 1 active midline.server2.lpnjmx Reqs: 0 /s 47 22 17 6
> POOL TYPE USED AVAIL
> fs-midline-metadata-a metadata 25.4M 25.9T
> fs-midline-data-a data 216k 25.9T
> STANDBY MDS
> midline.server3.wsbxsh
> MDS version: ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
>
> Following are the subtrees output from the MDSs. The directory fragments does not
> seem to be equally mapped to MDS.
>
> # ceph tell mds.midline.server1.njyfcn get subtrees | jq
> [
> {
> "is_auth": true,
> "auth_first": 0,
> "auth_second": -2,
> "export_pin": -1,
> "distributed_ephemeral_pin": false,
> "random_ephemeral_pin": false,
> "export_pin_target": -1,
> "dir": {
> "path": "",
> "dirfrag": "0x1",
> "snapid_first": 2,
> "projected_version": "1240",
> "version": "1240",
> "committing_version": "0",
> "committed_version": "0",
> "is_rep": false,
> "dir_auth": "0",
> "states": [
> "auth",
> "dirty",
> "complete"
> ],
> "is_auth": true,
> "auth_state": {
> "replicas": {
> "1": 1
> }
> },
> "replica_state": {
> "authority": [
> 0,
> -2
> ],
> "replica_nonce": 0
> },
> "auth_pins": 0,
> "is_frozen": false,
> "is_freezing": false,
> "pins": {
> "child": 1,
> "subtree": 1,
> "subtreetemp": 0,
> "replicated": 1,
> "dirty": 1,
> "waiter": 0,
> "authpin": 0
> },
> "nref": 4
> }
> },
> {
> "is_auth": true,
> "auth_first": 0,
> "auth_second": -2,
> "export_pin": -1,
> "distributed_ephemeral_pin": false,
> "random_ephemeral_pin": false,
> "export_pin_target": -1,
> "dir": {
> "path": "~mds0",
> "dirfrag": "0x100",
> "snapid_first": 2,
> "projected_version": "1232",
> "version": "1232",
> "committing_version": "0",
> "committed_version": "0",
> "is_rep": false,
> "dir_auth": "0",
> "states": [
> "auth",
> "dirty",
> "complete"
> ],
> "is_auth": true,
> "auth_state": {
> "replicas": {}
> },
> "replica_state": {
> "authority": [
> 0,
> -2
> ],
> "replica_nonce": 0
> },
> "auth_pins": 0,
> "is_frozen": false,
> "is_freezing": false,
> "pins": {
> "child": 1,
> "subtree": 1,
> "subtreetemp": 0,
> "dirty": 1,
> "waiter": 0,
> "authpin": 0
> },
> "nref": 3
> }
> },
> {
> "is_auth": false,
> "auth_first": 1,
> "auth_second": -2,
> "export_pin": -1,
> "distributed_ephemeral_pin": true,
> "random_ephemeral_pin": false,
> "export_pin_target": 1,
> "dir": {
> "path": "/volumes/csi",
> "dirfrag": "0x100000006ae.11*",

This fragment is pinned to rank 1 (export_pin_target).

> "snapid_first": 2,
> "projected_version": "50",
> "version": "50",
> "committing_version": "50",
> "committed_version": "50",
> "is_rep": false,
> "dir_auth": "1",
> "states": [],
> "is_auth": false,
> "auth_state": {
> "replicas": {}
> },
> "replica_state": {
> "authority": [
> 1,
> -2
> ],
> "replica_nonce": 1
> },
> "auth_pins": 0,
> "is_frozen": false,
> "is_freezing": false,
> "pins": {
> "ptrwaiter": 0,
> "request": 0,
> "child": 0,
> "frozen": 0,
> "subtree": 1,
> "replicated": 0,
> "dirty": 0,
> "waiter": 0,
> "authpin": 0,
> "tempexporting": 0
> },
> "nref": 1
> }
> },
> {
> "is_auth": true,
> "auth_first": 0,
> "auth_second": -2,
> "export_pin": -1,
> "distributed_ephemeral_pin": true,
> "random_ephemeral_pin": false,
> "export_pin_target": 0,
> "dir": {
> "path": "/volumes/csi",
> "dirfrag": "0x100000006ae.10*",

This fragment is pinned to rank 0 (export_pin_target).

> "snapid_first": 2,
> "projected_version": "52",
> "version": "52",
> "committing_version": "50",
> "committed_version": "50",
> "is_rep": false,
> "dir_auth": "0",
> "states": [
> "auth",
> "dirty",
> "complete"
> ],
> "is_auth": true,
> "auth_state": {
> "replicas": {}
> },
> "replica_state": {
> "authority": [
> 0,
> -2
> ],
> "replica_nonce": 0
> },
> "auth_pins": 0,
> "is_frozen": false,
> "is_freezing": false,
> "pins": {
> "subtree": 1,
> "dirty": 1,
> "waiter": 0,
> "authpin": 0
> },
> "nref": 2
> }
> },
> {
> "is_auth": true,
> "auth_first": 0,
> "auth_second": -2,
> "export_pin": -1,
> "distributed_ephemeral_pin": true,
> "random_ephemeral_pin": false,
> "export_pin_target": 0,
> "dir": {
> "path": "/volumes/csi",
> "dirfrag": "0x100000006ae.01*",

This fragment is pinned to rank 0 (export_pin_target).

> "snapid_first": 2,
> "projected_version": "136",
> "version": "136",
> "committing_version": "82",
> "committed_version": "82",
> "is_rep": false,
> "dir_auth": "0",
> "states": [
> "auth",
> "dirty",
> "complete"
> ],
> "is_auth": true,
> "auth_state": {
> "replicas": {
> "1": 1
> }
> },
> "replica_state": {
> "authority": [
> 0,
> -2
> ],
> "replica_nonce": 0
> },
> "auth_pins": 0,
> "is_frozen": false,
> "is_freezing": false,
> "pins": {
> "child": 1,
> "frozen": 0,
> "subtree": 1,
> "replicated": 1,
> "dirty": 1,
> "authpin": 0
> },
> "nref": 4
> }
> }
> ]
>
> # ceph tell mds.midline.server2.lpnjmx get subtrees | jq
> [
> {
> "is_auth": true,
> "auth_first": 1,
> "auth_second": -2,
> "export_pin": -1,
> "distributed_ephemeral_pin": false,
> "random_ephemeral_pin": false,
> "export_pin_target": -1,
> "dir": {
> "path": "~mds1",
> "dirfrag": "0x101",
> "snapid_first": 2,
> "projected_version": "332",
> "version": "332",
> "committing_version": "0",
> "committed_version": "0",
> "is_rep": false,
> "dir_auth": "1",
> "states": [
> "auth",
> "dirty",
> "complete"
> ],
> "is_auth": true,
> "auth_state": {
> "replicas": {}
> },
> "replica_state": {
> "authority": [
> 1,
> -2
> ],
> "replica_nonce": 0
> },
> "auth_pins": 0,
> "is_frozen": false,
> "is_freezing": false,
> "pins": {
> "child": 1,
> "subtree": 1,
> "subtreetemp": 0,
> "dirty": 1,
> "waiter": 0,
> "authpin": 0
> },
> "nref": 3
> }
> },
> {
> "is_auth": true,
> "auth_first": 1,
> "auth_second": -2,
> "export_pin": -1,
> "distributed_ephemeral_pin": true,
> "random_ephemeral_pin": false,
> "export_pin_target": 1,
> "dir": {
> "path": "/volumes/csi",
> "dirfrag": "0x100000006ae.11*",
> "snapid_first": 2,
> "projected_version": "66",
> "version": "66",
> "committing_version": "50",
> "committed_version": "50",
> "is_rep": false,
> "dir_auth": "1",
> "states": [
> "auth",
> "dirty",
> "complete"
> ],
> "is_auth": true,
> "auth_state": {
> "replicas": {
> "0": 1
> }
> },
> "replica_state": {
> "authority": [
> 1,
> -2
> ],
> "replica_nonce": 0
> },
> "auth_pins": 0,
> "is_frozen": false,
> "is_freezing": false,
> "pins": {
> "ptrwaiter": 0,
> "child": 1,
> "frozen": 0,
> "subtree": 1,
> "importing": 0,
> "replicated": 1,
> "dirty": 1,
> "authpin": 0
> },
> "nref": 4
> }
> },
> {
> "is_auth": false,
> "auth_first": 0,
> "auth_second": -2,
> "export_pin": -1,
> "distributed_ephemeral_pin": false,
> "random_ephemeral_pin": false,
> "export_pin_target": -1,
> "dir": {
> "path": "",
> "dirfrag": "0x1",
> "snapid_first": 2,
> "projected_version": "0",
> "version": "1216",
> "committing_version": "0",
> "committed_version": "0",
> "is_rep": false,
> "dir_auth": "0",
> "states": [],
> "is_auth": false,
> "auth_state": {
> "replicas": {}
> },
> "replica_state": {
> "authority": [
> 0,
> -2
> ],
> "replica_nonce": 1
> },
> "auth_pins": 0,
> "is_frozen": false,
> "is_freezing": false,
> "pins": {
> "child": 1,
> "subtree": 1
> },
> "nref": 2
> }
> }
> ]

This all looks as expected. You can verify the location of the subvolumes by:

ceph tell mds.X dump tree /volumes/csi/X 0

and check auth_first. You should see that they are approximately
uniformly distributed.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux