Hi all,
We currently run out of ideas what causes a 50% performance decrease on
disk io, after removing a OSD out of our cluster.
Performance measurement is regularly done by a canary virtual machine
which does an hourly disk IO measurement. Performance degradation is
reported by customers on other virtual machines also.
Symptoms:
- After taking a failed disk out of our ceph cluster ('ceph osd out X')
the canary VM measures a 50% performance degree.
- Finishing re-balancing did not have an impact on performance
- 'recovery io' as reported by ceph status is as high a usually
- 'client io' as reported by ceph status is significant lower than
usual, peaks are approximately factor 10 lower than 'recovery io'
which was not the case before.
Actions/Checks done all without impact on performance:
- Logs do not show any indication of failures or irregularities
(but we found a flapping OSD, which we also took out without
further performance impact !).
- No full or near full osd, PG's are balanced on OSDs.
- Network operates in usual manner (and did not change); no saturation
or high usage on links; bonds are ok; MTU settings checked and ok.
- Crush map does not show any unexpected entries.
- Boot of mons (one after another).
- Boot of storage nodes (one after another).
Cluster information:
- Version: ceph version 10.2.11 (Jewel)
- Operating System: CentOS 7
- 3 mons, 180 OSD on 8 storage nodes
- 382 TB used, 661 TB / 1043 TB avail
- OSD's on NVME, SSD and Disks, pools mapped
to either type (no mixed pools).
- All OSD's use filestore
Since we currently run out of ideas what could cause these performance
troubles, we appreciate any hint that increases the probability to find
a solution !
Thanks in advance.
With best regards, Michael
--
stepping stone AG
Wasserwerkgasse 7
CH-3011 Bern
Telefon: +41 31 332 53 63
www.stepping-stone.ch
michael.eichenberger@xxxxxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx