Hello
As previously describe here, we have a full-flash NVME ceph cluster (16.2.6) with currently only cephfs service configured.
The current setup is
54 nodes with 1 NVME each, 2 partitions for each NVME.
8 MDSs (7 actives, 1 sandby)
MDS cache memory limit to 128GB.
It's an hyperconverged K8S cluster, OSD are on K8s worker nodes, so I set "osd memory target" to 16GB.
for the last couple of weeks, we had major slow down with many MDS_SLOW_REQUEST and MDS_SLOW_METADATA_IO
[WRN] Health check update: 7 MDSs report slow requests (MDS_SLOW_REQUEST)
[WRN] Health check update: 6 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
[WRN] [WRN] MDS_SLOW_METADATA_IO: 6 MDSs report slow metadata IOs
[WRN] mds.icadmin012(mds.4): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 164 secs
[WRN] mds.icadmin014(mds.6): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 616 secs
[WRN] mds.icadmin015(mds.5): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 145 secs
[WRN] mds.icadmin011(mds.2): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 449 secs
[WRN] mds.icadmin013(mds.1): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 650 secs
[WRN] mds.icadmin008(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 583 secs
We noticed that cephfs_metadata pool had only 16 PG, we have set autoscale_mode to off and increase the number of PG to 256 and with this
change, the number of SLOW message has decreased drastically.
$ ceph osd pool ls detail > pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on
last_change 121056 lfor 0/16088/16086 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 autoscale_mode on
last_change 121056 lfor 0/0/213 flags hashpspool stripe_width 0 target_size_ratio 0.2 application cephfs
pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode off
last_change 139312 lfor 0/92367/138900 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 256 recovery_priority 5
application cephfs
$ ceph -s > cluster:
id: cc402f2e-2444-473e-adab-fe7b38d08546
health: HEALTH_OK
services:
mon: 3 daemons, quorum icadmin006,icadmin007,icadmin008 (age 7w)
mgr: icadmin006(active, since 3M), standbys: icadmin007, icadmin008
mds: 7/7 daemons up, 1 standby
osd: 110 osds: 108 up (since 21h), 108 in (since 23h)
data:
volumes: 1/1 healthy
pools: 3 pools, 4353 pgs
objects: 331.16M objects, 81 TiB
usage: 246 TiB used, 88 TiB / 334 TiB avail
pgs: 4350 active+clean
2 active+clean+scrubbing+deep
1 active+clean+scrubbing
Is there any mechanism to increase the number of PG automatically in such a situation ? Or this is something to do manually ?
Is 256 good value in our case ? We have 80TB of data with more than 300M files.
Thank you for your help,
--
Yoann Moulin
EPFL IC-IT
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx