Re: 50% performance drop after disk failure

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sat, 9 Jul 2022 15:56:21 +0200

if you have recovery io, then the system is not done recovering from the 
failed disk or from some other failure, for example from the other OSDs 
than flapped as a result of recovery load.

if so you may want to lower the recovery speed via

osd_max_backfills
osd_recovery_max_active
osd_recovery_sleep

Also if the above values are already low and you are using a pure HDD 
disks, you may want consider in the longer term adding SSD db/wal 
devices to them as pure HDD can get loaded during recovery.

/maged

On 09/07/2022 15:18, Michael Eichenberger wrote:
Hi all,

We currently run out of ideas what causes a 50% performance decrease 
on disk io, after removing a OSD out of our cluster.

Performance measurement is regularly done by a canary virtual machine 
which does an hourly disk IO measurement. Performance degradation is 
reported by customers on other virtual machines also.

Symptoms:
- After taking a failed disk out of our ceph cluster ('ceph osd out X')
  the canary VM measures a 50% performance degree.
- Finishing re-balancing did not have an impact on performance
- 'recovery io' as reported by ceph status is as high a usually
- 'client io' as reported by ceph status is significant lower than
  usual, peaks are approximately factor 10 lower than 'recovery io'
  which was not the case before.

Actions/Checks done all without impact on performance:
- Logs do not show any indication of failures or irregularities
  (but we found a flapping OSD, which we also took out without
  further performance impact !).
- No full or near full osd, PG's are balanced on OSDs.
- Network operates in usual manner (and did not change); no saturation
  or high usage on links; bonds are ok; MTU settings checked and ok.
- Crush map does not show any unexpected entries.
- Boot of mons (one after another).
- Boot of storage nodes (one after another).

Cluster information:
- Version: ceph version 10.2.11 (Jewel)
- Operating System: CentOS 7
- 3 mons, 180 OSD on 8 storage nodes
- 382 TB used, 661 TB / 1043 TB avail
- OSD's on NVME, SSD and Disks, pools mapped
  to either type (no mixed pools).
- All OSD's use filestore

Since we currently run out of ideas what could cause these performance 
troubles, we appreciate any hint that increases the probability to 
find a solution !

Thanks in advance.

With best regards, Michael

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx