Re: 50% performance drop after disk failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Michael,
I have the following starting points:
Has the read and write latency also changed?
How does the utilisation of the hard disks look (iostat)?
Check the fragmentation and saturation of the file system.
What about the leveldb compaction?
Has the CPU load on the nodes changed?

Regards, Joachim 


Am 09.07.2022 15:18 schrieb Michael Eichenberger <michael.eichenberger@xxxxxxxxxxxxxxxxx>:

Hi all,

We currently run out of ideas what causes a 50% performance decrease on
disk io, after removing a OSD out of our cluster.

Performance measurement is regularly done by a canary virtual machine
which does an hourly disk IO measurement. Performance degradation is
reported by customers on other virtual machines also.

Symptoms:
- After taking a failed disk out of our ceph cluster ('ceph osd out X')
  the canary VM measures a 50% performance degree.
- Finishing re-balancing did not have an impact on performance
- 'recovery io' as reported by ceph status is as high a usually
- 'client io' as reported by ceph status is significant lower than
  usual, peaks are approximately factor 10 lower than 'recovery io'
  which was not the case before.

Actions/Checks done all without impact on performance:
- Logs do not show any indication of failures or irregularities
  (but we found a flapping OSD, which we also took out without
  further performance impact !).
- No full or near full osd, PG's are balanced on OSDs.
- Network operates in usual manner (and did not change); no saturation
  or high usage on links; bonds are ok; MTU settings checked and ok.
- Crush map does not show any unexpected entries.
- Boot of mons (one after another).
- Boot of storage nodes (one after another).

Cluster information:
- Version: ceph version 10.2.11 (Jewel)
- Operating System: CentOS 7
- 3 mons, 180 OSD on 8 storage nodes
- 382 TB used, 661 TB / 1043 TB avail
- OSD's on NVME, SSD and Disks, pools mapped
  to either type (no mixed pools).
- All OSD's use filestore

Since we currently run out of ideas what could cause these performance
troubles, we appreciate any hint that increases the probability to find
a solution !

Thanks in advance.

With best regards, Michael

--
stepping stone AG
Wasserwerkgasse 7
CH-3011 Bern

Telefon: +41 31 332 53 63
www.stepping-stone.ch
michael.eichenberger@xxxxxxxxxxxxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux