> during scrubbing, OSD latency spikes to 300-600 ms, I have seen Ceph clusters spike to several seconds per IO operation as they were designed for the same goals. > resulting in sluggish performance for all VMs. Additionally, > some OSDs fail during the scrubbing process. Most likely they time out because of IO congestion rather than failing. > In such instances, promptly halting the scrubbing resolves the > issue. > (6 SSD node + 6 HDD node) All nodes are connected through 10G > bonded link, i.e. 10Gx2=20GB for each node. 64 SSD 42 HDD 106 > one-ssd 256 active+clean one-hdd 512 active+clean > cloudstack.hdd 512 active+clean Your Ceph cluster has been optimized for high latency and IO congestion, goals that are suprisingly quite common, and is performing well given its design parameters (it is far from full, if it becomes fuller it will achieve its goals even better). https://www.sabi.co.uk/blog/15-one.html?150305#150305 "How many VMs per disk arm?" https://www.sabi.co.uk/blog/15-one.html?150329#150329 "CERN's old large disk discussion and IOPS-per-TB" _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx