Re: Bluestore HDD Cluster Advice

Nick Fisk <nick@xxxxxxxxxx> · Fri, 22 Feb 2019 11:31:18 -0000

>Yes and no... bluestore seems to not work really optimal. For example,
>it has no filestore-like journal waterlining and flushes the deferred
>write queue just every 32 writes (deferred_batch_ops). And when it does
>that it's basically waiting for the HDD to commit and slowing down all
>further writes. And even worse, I found it to be basically untunable
>because when I tried to increase that limit to 1024 - OSDs waited for
>1024 writes to accumulate and then started to flush them in one batch
>which led to a HUGE write stall (tens of seconds). Commiting every 32
>writes is probably good for the thing they gently call "tail latency"
>(sudden latency spikes!) But it has the downside of that the latency is
>just consistently high :-P (ok, consistently average). 

What IO size are you testing, Bluestore will only defer writes under 32kb is size by default. Unless you are writing sequentially,
only limited amount of buffering via SSD is going to help, you will eventually hit the limits of the disk. Could you share some more
details as I'm interested in in this topic as well.

>
>In my small cluster with HGST drives and Intel SSDs for WAL+DB I've
>found the single-thread write latency (fio -iodepth=1 -ioengine=rbd) to
>be similar to a cluster without SSDs at all, it gave me only ~40-60
>iops. As I understand this is exactly because bluestore is flushing data
>each 32 writes and waiting for HDDs to commit all the time. One thing
>that helped me a lot was to disable the drives' volatile write cache
>(`hdparm -W 0 /dev/sdXX`). After doing that I have ~500-600 iops for the
>single-thread load! Which looks like it's finally committing data using
>the WAL correctly. My guess is that this is because HGST drives, in
>addition to a normal volatile write cache, have the thing called "Media
>Cache" which allows the HDD to acknowledge random writes by writing them
>to a temporary place on the platters without doing much seeks, and this
>thing gets enabled only when you disable the volatile cache. 

Interesting, will have to investigate this further!!! I wish there were more details around this technology from HGST

>
>At the same time, deferred writes slightly help performance when you
>don't have SSD. But the difference we talking is like tens of iops (30
>vs 40), so it's not noticeable in the SSD era :).

What size IO's are these you are testing with? I see a difference going from around 50IOPs up to over a thousand for a single
threaded 4kb sequential test.

>
>So - in theory yes, deferred writes should be acknowledged by the WAL.
>In practice, bluestore is a big mess of threads, locks and extra writes,
>so this is not always so. In fact, I would recommend you trying bcache
>as an option, it may work better, although I've not tested it myself yet
>:-) 
>
>What about the size of WAL/DB: 
>
>1) you don't need to put them on separate partitions, bluestore
>automatically allocates the available space 
>
>2) 8TB disks only take 16-17 GB for WAL+DB in my case. SSD partitions I
>have allocated for OSDs are just 20GB and it's also OK because bluestore
>can move parts of its DB to the main data device when it runs out of
>space on SSD partition.

Careful here, Bluestore will only migrate the next level of its DB if it can fit the entire DB on the flash device. These cutoff's
are around 3GB,30GB,300GB by default, so anything in-between will not be used. In your example a 20GB flash partition will mean that
a large amount of RocksDB will end up on the spinning disk (slowusedBytes)

>
>14 февраля 2019 г. 6:40:35 GMT+03:00, John Petrini
><http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> пишет: 
>
>> Okay that makes more sense, I didn't realize the WAL functioned in a similar manner to filestore journals (though now that I've
had another read of Sage's blog post, New in Luminous: BlueStore, I notice he does >cover this). Is this to say that writes are
acknowledged as soon as they hit the WAL?
>> 
>> Also this raises another question regarding sizing. The Ceph documentation suggests allocating as much available space as
possible to blocks.db but what about WAL? We'll have 120GB per OSD available on each >SSD. Any suggestion on how we might divvy that
between the WAL and DB?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com