if you have recovery io, then the system is not done recovering from the
failed disk or from some other failure, for example from the other OSDs
than flapped as a result of recovery load.
if so you may want to lower the recovery speed via
osd_max_backfills
osd_recovery_max_active
osd_recovery_sleep
Also if the above values are already low and you are using a pure HDD
disks, you may want consider in the longer term adding SSD db/wal
devices to them as pure HDD can get loaded during recovery.
/maged
On 09/07/2022 15:18, Michael Eichenberger wrote:
Hi all,
We currently run out of ideas what causes a 50% performance decrease
on disk io, after removing a OSD out of our cluster.
Performance measurement is regularly done by a canary virtual machine
which does an hourly disk IO measurement. Performance degradation is
reported by customers on other virtual machines also.
Symptoms:
- After taking a failed disk out of our ceph cluster ('ceph osd out X')
the canary VM measures a 50% performance degree.
- Finishing re-balancing did not have an impact on performance
- 'recovery io' as reported by ceph status is as high a usually
- 'client io' as reported by ceph status is significant lower than
usual, peaks are approximately factor 10 lower than 'recovery io'
which was not the case before.
Actions/Checks done all without impact on performance:
- Logs do not show any indication of failures or irregularities
(but we found a flapping OSD, which we also took out without
further performance impact !).
- No full or near full osd, PG's are balanced on OSDs.
- Network operates in usual manner (and did not change); no saturation
or high usage on links; bonds are ok; MTU settings checked and ok.
- Crush map does not show any unexpected entries.
- Boot of mons (one after another).
- Boot of storage nodes (one after another).
Cluster information:
- Version: ceph version 10.2.11 (Jewel)
- Operating System: CentOS 7
- 3 mons, 180 OSD on 8 storage nodes
- 382 TB used, 661 TB / 1043 TB avail
- OSD's on NVME, SSD and Disks, pools mapped
to either type (no mixed pools).
- All OSD's use filestore
Since we currently run out of ideas what could cause these performance
troubles, we appreciate any hint that increases the probability to
find a solution !
Thanks in advance.
With best regards, Michael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx