Yes and no... bluestore seems to not work really optimal. For example, it has no filestore-like journal waterlining and flushes the deferred write queue just every 32 writes (deferred_batch_ops). And when it does that it's basically waiting for the HDD to commit and slowing down all further writes. And even worse, I found it to be basically untunable because when I tried to increase that limit to 1024 - OSDs waited for 1024 writes to accumulate and then started to flush them in one batch which led to a HUGE write stall (tens of seconds). Commiting every 32 writes is probably good for the thing they gently call "tail latency" (sudden latency spikes!) But it has the downside of that the latency is just consistently high :-P (ok, consistently average). In my small cluster with HGST drives and Intel SSDs for WAL+DB I've found the single-thread write latency (fio -iodepth=1 -ioengine=rbd) to be similar to a cluster without SSDs at all, it gave me only ~40-60 iops. As I understand this is exactly because bluestore is flushing data each 32 writes and waiting for HDDs to commit all the time. One thing that helped me a lot was to disable the drives' volatile write cache (`hdparm -W 0 /dev/sdXX`). After doing that I have ~500-600 iops for the single-thread load! Which looks like it's finally committing data using the WAL correctly. My guess is that this is because HGST drives, in addition to a normal volatile write cache, have the thing called "Media Cache" which allows the HDD to acknowledge random writes by writing them to a temporary place on the platters without doing much seeks, and this thing gets enabled only when you disable the volatile cache. At the same time, deferred writes slightly help performance when you don't have SSD. But the difference we talking is like tens of iops (30 vs 40), so it's not noticeable in the SSD era :). What about the size of WAL/DB: 1) you don't need to put them on separate partitions, bluestore automatically allocates the available space 2) 8TB disks only take 16-17 GB for WAL+DB in my case. SSD partitions I have allocated for OSDs are just 20GB and it's also OK because bluestore can move parts of its DB to the main data device when it runs out of space on SSD partition. 14 февраля 2019 г. 6:40:35 GMT+03:00, John Petrini <jpetrini@xxxxxxxxxxxx> пишет:
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com