Re: 50% performance drop after disk failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



if you have recovery io, then the system is not done recovering from the failed disk or from some other failure, for example from the other OSDs than flapped as a result of recovery load.

if so you may want to lower the recovery speed via

osd_max_backfills
osd_recovery_max_active
osd_recovery_sleep

Also if the above values are already low and you are using a pure HDD disks, you may want consider in the longer term adding SSD db/wal devices to them as pure HDD can get loaded during recovery.

/maged

On 09/07/2022 15:18, Michael Eichenberger wrote:
Hi all,

We currently run out of ideas what causes a 50% performance decrease on disk io, after removing a OSD out of our cluster.

Performance measurement is regularly done by a canary virtual machine which does an hourly disk IO measurement. Performance degradation is reported by customers on other virtual machines also.

Symptoms:
- After taking a failed disk out of our ceph cluster ('ceph osd out X')
  the canary VM measures a 50% performance degree.
- Finishing re-balancing did not have an impact on performance
- 'recovery io' as reported by ceph status is as high a usually
- 'client io' as reported by ceph status is significant lower than
  usual, peaks are approximately factor 10 lower than 'recovery io'
  which was not the case before.

Actions/Checks done all without impact on performance:
- Logs do not show any indication of failures or irregularities
  (but we found a flapping OSD, which we also took out without
  further performance impact !).
- No full or near full osd, PG's are balanced on OSDs.
- Network operates in usual manner (and did not change); no saturation
  or high usage on links; bonds are ok; MTU settings checked and ok.
- Crush map does not show any unexpected entries.
- Boot of mons (one after another).
- Boot of storage nodes (one after another).

Cluster information:
- Version: ceph version 10.2.11 (Jewel)
- Operating System: CentOS 7
- 3 mons, 180 OSD on 8 storage nodes
- 382 TB used, 661 TB / 1043 TB avail
- OSD's on NVME, SSD and Disks, pools mapped
  to either type (no mixed pools).
- All OSD's use filestore

Since we currently run out of ideas what could cause these performance troubles, we appreciate any hint that increases the probability to find a solution !

Thanks in advance.

With best regards, Michael

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux