Re: optimize bluestore for random write i/o

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 6 Mar 2019 07:02:47 -0600

On 3/5/19 4:23 PM, Vitaliy Filippov wrote:
Testing -rw=write without -sync=1 or -fsync=1 (or -fsync=32 for batch 
IO, or just fio -ioengine=rbd from outside a VM) is rather pointless - 
you're benchmarking the RBD cache, not Ceph itself. RBD cache is 
coalescing your writes into big sequential writes. Of course bluestore 
is faster in this case - it has no double write for big writes.

I'll probably try to test these settings - I'm also interested in 
random write iops in an all-flash bluestore cluster :) but I don't 
think any rocksdb options will help. I found bluestore pretty 
untunable in terms of performance :)

For random writes, you often end up bottlenecked in the kv sync thread 
so long as you aren't generally CPU bound.  Anything you can do to 
reduce the work being done in the kv sync thread usually helps.  A big 
one is making sure you are hitting onodes in the bluestore cache rather 
than rocksdb cache or disk. IE having enough onode cache available for 
the dataset being benchmarked.

The best thing to do for me was to disable CPU powersaving (set 
governor to performance + cpupower idle-set -D 1). Your CPUs become 
frying pans but write IOPS, especially single-thread write IOPS which 
are the worst-case scenario AND at the same time the thing 
applications usually need increase 2-3 times. Test it with fio 
-ioengine=rbd -bs=4k -iodepth=1.

Yep, this is a big one.  I've asked for clarification from vendors if we 
can actually recommend doing this but haven't gotten a clear answer yet. :/

Another thing that I've done on my cluster was to set 
`bluestore_min_alloc_size_ssd` to 4096. The reason to do that is that 
it's 16kb by default which means all writes below 16kb use the same 
deferred write path as with HDDs. Deferred writes only increase WA 
factor for SSDs and lower the performance. You have to recreate OSDs 
after changing this variable - it's only applied at the time of OSD 
creation.

Decreasing the min_alloc size isn't always a win, but ican be in some 
cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we 
increased it to 16384 because at the time our metadata path was slow and 
increasing it resulted in a pretty significant performance win (along 
with increasing the WAL buffers in rocksdb to reduce write 
amplification).  Since then we've improved the metadata path to the 
point where at least on our test nodes performance is pretty close 
between with min_alloc size = 16k and min_alloc size = 4k the last time 
I looked.  It might be a good idea to drop it down to 4k now but I think 
we need to be careful because there are tradeoffs.

You can see some of the original work we did in 2016 looking at this on 
our performance test cluster here:

https://docs.google.com/spreadsheets/d/1YPiiDu0IxQdB4DcVVz8WON9CpWX9QOy5r-XmYJL0Sys/edit?usp=sharing

And follow-up work in 2017 here:

https://drive.google.com/file/d/0B2gTBZrkrnpZVXpzR2JNRmR0WFE/view?usp=sharing

It might be time to revisit again.

I'm also currently trying another performance fix, kind of... but it 
involves patching ceph's code, so I'll share it later if I succeed.

Would you consider sharing what your idea is?  There are absolutely 
areas where performance can be improved, but often times they involve 
tradeoffs in some respect.

Hello list,

while the performance of sequential writes 4k on bluestore is very high
and even higher than filestore i was wondering what i can do to optimize
random pattern as well.

While using:
fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4
--filename=/tmp/test --size=10G --runtime=60 --group_reporting
--name=test --direct=1

I get 36000 iop/s on bluestore while having 11500 on filestore.

Using randwrite gives me 17000 on filestore and only 9500 on bluestore.

This is on all flash / ssd running luminous 12.2.10.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com