MDS Performance and PG/PGP value

Yoann Moulin <yoann.moulin@xxxxxxx> · Thu, 6 Oct 2022 08:35:01 +0200

Hello

As previously describe here, we have a full-flash NVME ceph cluster (16.2.6) with currently only cephfs service configured.

The current setup is

54 nodes with 1 NVME each, 2 partitions for each NVME.
8 MDSs (7 actives, 1 sandby)
MDS cache memory limit to 128GB.
It's an hyperconverged K8S cluster, OSD are on K8s worker nodes, so I set "osd memory target" to 16GB.

for the last couple of weeks, we had major slow down with many MDS_SLOW_REQUEST and MDS_SLOW_METADATA_IO

[WRN] Health check update: 7 MDSs report slow requests (MDS_SLOW_REQUEST)
[WRN] Health check update: 6 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
[WRN] [WRN] MDS_SLOW_METADATA_IO: 6 MDSs report slow metadata IOs
[WRN]     mds.icadmin012(mds.4): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 164 secs
[WRN]     mds.icadmin014(mds.6): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 616 secs
[WRN]     mds.icadmin015(mds.5): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 145 secs
[WRN]     mds.icadmin011(mds.2): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 449 secs
[WRN]     mds.icadmin013(mds.1): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 650 secs
[WRN]     mds.icadmin008(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 583 secs

We noticed that cephfs_metadata pool had only 16 PG, we have set autoscale_mode to off and increase the number of PG to 256 and with this 
change, the number of SLOW message has decreased drastically.

$ ceph osd pool ls detail > pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on
     last_change 121056 lfor 0/16088/16086 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 autoscale_mode on 
     last_change 121056 lfor 0/0/213 flags hashpspool stripe_width 0 target_size_ratio 0.2 application cephfs
pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode off 
     last_change 139312 lfor 0/92367/138900 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 256 recovery_priority 5 
     application cephfs

$ ceph -s >   cluster:
    id:     cc402f2e-2444-473e-adab-fe7b38d08546
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum icadmin006,icadmin007,icadmin008 (age 7w)
    mgr: icadmin006(active, since 3M), standbys: icadmin007, icadmin008
    mds: 7/7 daemons up, 1 standby
    osd: 110 osds: 108 up (since 21h), 108 in (since 23h)

  data:
    volumes: 1/1 healthy
    pools:   3 pools, 4353 pgs
    objects: 331.16M objects, 81 TiB
    usage:   246 TiB used, 88 TiB / 334 TiB avail
    pgs:     4350 active+clean
             2    active+clean+scrubbing+deep
             1    active+clean+scrubbing

Is there any mechanism to increase the number of PG automatically in such a situation ? Or this is something to do manually ?

Is 256 good value in our case ? We have 80TB of data with more than 300M files.

Thank you for your help,

--
Yoann Moulin
EPFL IC-IT
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx