Re: Bluestore HDD Cluster Advice

vitalif@xxxxxxxxxx · Thu, 14 Feb 2019 14:55:18 +0300

Yes and no... bluestore seems to not work really optimal. For example, it has no filestore-like journal waterlining and flushes the deferred write queue just every 32 writes (deferred_batch_ops). And when it does that it's basically waiting for the HDD to commit and slowing down all further writes. And even worse, I found it to be basically untunable because when I tried to increase that limit to 1024 - OSDs waited for 1024 writes to accumulate and then started to flush them in one batch which led to a HUGE write stall (tens of seconds). Commiting every 32 writes is probably good for the thing they gently call "tail latency" (sudden latency spikes!) But it has the downside of that the latency is just consistently high :-P (ok, consistently average).
In my small cluster with HGST drives and Intel SSDs for WAL+DB I've found the single-thread write latency (fio -iodepth=1 -ioengine=rbd) to be similar to a cluster without SSDs at all, it gave me only ~40-60 iops. As I understand this is exactly because bluestore is flushing data each 32 writes and waiting for HDDs to commit all the time. One thing that helped me a lot was to disable the drives' volatile write cache (`hdparm -W 0 /dev/sdXX`). After doing that I have ~500-600 iops for the single-thread load! Which looks like it's finally committing data using the WAL correctly. My guess is that this is because HGST drives, in addition to a normal volatile write cache, have the thing called "Media Cache" which allows the HDD to acknowledge random writes by writing them to a temporary place on the platters without doing much seeks, and this thing gets enabled only when you disable the volatile cache.
At the same time, deferred writes slightly help performance when you don't have SSD. But the difference we talking is like tens of iops (30 vs 40), so it's not noticeable in the SSD era :).

So - in theory yes, deferred writes should be acknowledged by the WAL. In practice, bluestore is a big mess of threads, locks and extra writes, so this is not always so. In fact, I would recommend you trying bcache as an option, it may work better, although I've not tested it myself yet :-)
What about the size of WAL/DB:
1) you don't need to put them on separate partitions, bluestore automatically allocates the available space
2) 8TB disks only take 16-17 GB for WAL+DB in my case. SSD partitions I have allocated for OSDs are just 20GB and it's also OK because bluestore can move parts of its DB to the main data device when it runs out of space on SSD partition.
14 февраля 2019 г. 6:40:35 GMT+03:00, John Petrini <jpetrini@xxxxxxxxxxxx> пишет:

Okay that makes more sense, I didn't realize the WAL functioned in a similar manner to filestore journals (though now that I've had another read of Sage's blog post, New in Luminous: BlueStore, I notice he does cover this). Is this to say that writes are acknowledged as soon as they hit the WAL?

Also this raises another question regarding sizing. The Ceph documentation suggests allocating as much available space as possible to blocks.db but what about WAL? We'll have 120GB per OSD available on each SSD. Any suggestion on how we might divvy that between the WAL and DB?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com