Re: optimize bluestore for random write i/o

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3/5/19 4:23 PM, Vitaliy Filippov wrote:
Testing -rw=write without -sync=1 or -fsync=1 (or -fsync=32 for batch IO, or just fio -ioengine=rbd from outside a VM) is rather pointless - you're benchmarking the RBD cache, not Ceph itself. RBD cache is coalescing your writes into big sequential writes. Of course bluestore is faster in this case - it has no double write for big writes.

I'll probably try to test these settings - I'm also interested in random write iops in an all-flash bluestore cluster :) but I don't think any rocksdb options will help. I found bluestore pretty untunable in terms of performance :)


For random writes, you often end up bottlenecked in the kv sync thread so long as you aren't generally CPU bound.  Anything you can do to reduce the work being done in the kv sync thread usually helps.  A big one is making sure you are hitting onodes in the bluestore cache rather than rocksdb cache or disk. IE having enough onode cache available for the dataset being benchmarked.



The best thing to do for me was to disable CPU powersaving (set governor to performance + cpupower idle-set -D 1). Your CPUs become frying pans but write IOPS, especially single-thread write IOPS which are the worst-case scenario AND at the same time the thing applications usually need increase 2-3 times. Test it with fio -ioengine=rbd -bs=4k -iodepth=1.


Yep, this is a big one.  I've asked for clarification from vendors if we can actually recommend doing this but haven't gotten a clear answer yet. :/



Another thing that I've done on my cluster was to set `bluestore_min_alloc_size_ssd` to 4096. The reason to do that is that it's 16kb by default which means all writes below 16kb use the same deferred write path as with HDDs. Deferred writes only increase WA factor for SSDs and lower the performance. You have to recreate OSDs after changing this variable - it's only applied at the time of OSD creation.


Decreasing the min_alloc size isn't always a win, but ican be in some cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we increased it to 16384 because at the time our metadata path was slow and increasing it resulted in a pretty significant performance win (along with increasing the WAL buffers in rocksdb to reduce write amplification).  Since then we've improved the metadata path to the point where at least on our test nodes performance is pretty close between with min_alloc size = 16k and min_alloc size = 4k the last time I looked.  It might be a good idea to drop it down to 4k now but I think we need to be careful because there are tradeoffs.


You can see some of the original work we did in 2016 looking at this on our performance test cluster here:


https://docs.google.com/spreadsheets/d/1YPiiDu0IxQdB4DcVVz8WON9CpWX9QOy5r-XmYJL0Sys/edit?usp=sharing


And follow-up work in 2017 here:


https://drive.google.com/file/d/0B2gTBZrkrnpZVXpzR2JNRmR0WFE/view?usp=sharing


It might be time to revisit again.



I'm also currently trying another performance fix, kind of... but it involves patching ceph's code, so I'll share it later if I succeed.


Would you consider sharing what your idea is?  There are absolutely areas where performance can be improved, but often times they involve tradeoffs in some respect.



Hello list,

while the performance of sequential writes 4k on bluestore is very high
and even higher than filestore i was wondering what i can do to optimize
random pattern as well.

While using:
fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4
--filename=/tmp/test --size=10G --runtime=60 --group_reporting
--name=test --direct=1

I get 36000 iop/s on bluestore while having 11500 on filestore.

Using randwrite gives me 17000 on filestore and only 9500 on bluestore.

This is on all flash / ssd running luminous 12.2.10.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux