Thank you all for your answers, this was really helpful! Stefan Priebe wrote: > yes we have the same issues and switched to seagate for those reasons. > you can fix at least a big part of it by disabling the write cache of > those drives - generally speaking it seems the toshiba firmware is > broken. > I was not able to find a newer one. Good to know that we're not alone :) I also looked for a newer firmware, to no avail. Igor Fedotov wrote: > Benoit, wondering what are the write cache settings in your case? > > And do you see any difference after disabling it if any? Write cache is enabled on all our OSDs (including the HGST drives that don't have a latency issue). To see if disabling write cache on the Toshiba drives would help, I turned it off on all 12 drives in one of our OSD nodes: ``` for disk in /dev/sd{a..l}; do hdparm -W0 $disk; done ``` and left it on in the remaining nodes. I used `rados bench write` to create some load on the cluster, and looked at ``` avg by (hostname) (ceph_osd_commit_latency_ms * on (ceph_daemon) group_left (hostname) ceph_osd_metadata) ``` in Prometheus. The hosts with write cache _enabled_ had a commit latency around 145ms, while the host with write cache _disabled_ had a commit latency around 25ms. So it definitely helps! Mark Nelson wrote: > This isn't the first time I've seen drive cache cause problematic > latency issues, and not always from the same manufacturer. > Unfortunately it seems like you really have to test the drives you > want to use before deploying them them to make sure you don't run into > issues. That's very true! Data sheets and even public benchmarks can be quite deceiving, and two hard drives that seem to have similar performance profiles can perform very differently within a Ceph cluster. Lesson learned. Cheers, -- Ben _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx