Ceph MDS randomly hangs when pg nums reduced

lokitingyi@xxxxxxxxx · Sat, 17 Feb 2024 07:40:23 -0000

Hi,

I have a CephFS cluster
```
> ceph -s

  cluster:
    id:     e78987f2-ef1c-11ed-897d-cf8c255417f0
    health: HEALTH_WARN
            85 pgs not deep-scrubbed in time
            85 pgs not scrubbed in time

  services:
    mon: 5 daemons, quorum datastone05,datastone06,datastone07,datastone10,datastone09 (age 2w)
    mgr: datastone05.iitngk(active, since 2w), standbys: datastone06.wjppdy
    mds: 2/2 daemons up, 1 hot standby
    osd: 22 osds: 22 up (since 3d), 22 in (since 4w); 8 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 115 pgs
    objects: 49.08M objects, 16 TiB
    usage:   35 TiB used, 2.0 PiB / 2.1 PiB avail
    pgs:     3807933/98160678 objects misplaced (3.879%)
             107 active+clean
             8   active+remapped+backfilling

  io:
    client:   224 MiB/s rd, 79 MiB/s wr, 844 op/s rd, 33 op/s wr
    recovery: 8.8 MiB/s, 24 objects/s
```

The pool and pg status

```
> ceph osd pool autoscale-status

POOL                SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
cephfs.myfs.meta  28802M                2.0         2119T  0.0000                                  4.0      16              on         False
cephfs.myfs.data  16743G                2.0         2119T  0.0154                                  1.0      32              on         False
rbd                  19                 2.0         2119T  0.0000                                  1.0      32              on         False
.mgr               3840k                2.0         2119T  0.0000                                  1.0       1              on         False
```

The pool detail

```
> ceph osd pool ls detail

pool 1 'cephfs.myfs.meta' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 3639 lfor 0/3639/3637 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 2 'cephfs.myfs.data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 66 pgp_num 58 pg_num_target 32 pgp_num_target 32 autoscale_mode on last_change 5670 lfor 0/5661/5659 flags hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
pool 3 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 486 lfor 0/486/478 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 '.mgr' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 39 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
```

When pg numbers reduce, the mds server would have a chance to hang.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx