On 3/5/19 4:23 PM, Vitaliy Filippov wrote:
Testing -rw=write without -sync=1 or -fsync=1 (or -fsync=32 for batch
IO, or just fio -ioengine=rbd from outside a VM) is rather pointless -
you're benchmarking the RBD cache, not Ceph itself. RBD cache is
coalescing your writes into big sequential writes. Of course bluestore
is faster in this case - it has no double write for big writes.
I'll probably try to test these settings - I'm also interested in
random write iops in an all-flash bluestore cluster :) but I don't
think any rocksdb options will help. I found bluestore pretty
untunable in terms of performance :)
For random writes, you often end up bottlenecked in the kv sync thread
so long as you aren't generally CPU bound. Anything you can do to
reduce the work being done in the kv sync thread usually helps. A big
one is making sure you are hitting onodes in the bluestore cache rather
than rocksdb cache or disk. IE having enough onode cache available for
the dataset being benchmarked.
The best thing to do for me was to disable CPU powersaving (set
governor to performance + cpupower idle-set -D 1). Your CPUs become
frying pans but write IOPS, especially single-thread write IOPS which
are the worst-case scenario AND at the same time the thing
applications usually need increase 2-3 times. Test it with fio
-ioengine=rbd -bs=4k -iodepth=1.
Yep, this is a big one. I've asked for clarification from vendors if we
can actually recommend doing this but haven't gotten a clear answer yet. :/
Another thing that I've done on my cluster was to set
`bluestore_min_alloc_size_ssd` to 4096. The reason to do that is that
it's 16kb by default which means all writes below 16kb use the same
deferred write path as with HDDs. Deferred writes only increase WA
factor for SSDs and lower the performance. You have to recreate OSDs
after changing this variable - it's only applied at the time of OSD
creation.
Decreasing the min_alloc size isn't always a win, but ican be in some
cases. Originally bluestore_min_alloc_size_ssd was set to 4096 but we
increased it to 16384 because at the time our metadata path was slow and
increasing it resulted in a pretty significant performance win (along
with increasing the WAL buffers in rocksdb to reduce write
amplification). Since then we've improved the metadata path to the
point where at least on our test nodes performance is pretty close
between with min_alloc size = 16k and min_alloc size = 4k the last time
I looked. It might be a good idea to drop it down to 4k now but I think
we need to be careful because there are tradeoffs.
You can see some of the original work we did in 2016 looking at this on
our performance test cluster here:
https://docs.google.com/spreadsheets/d/1YPiiDu0IxQdB4DcVVz8WON9CpWX9QOy5r-XmYJL0Sys/edit?usp=sharing
And follow-up work in 2017 here:
https://drive.google.com/file/d/0B2gTBZrkrnpZVXpzR2JNRmR0WFE/view?usp=sharing
It might be time to revisit again.
I'm also currently trying another performance fix, kind of... but it
involves patching ceph's code, so I'll share it later if I succeed.
Would you consider sharing what your idea is? There are absolutely
areas where performance can be improved, but often times they involve
tradeoffs in some respect.
Hello list,
while the performance of sequential writes 4k on bluestore is very high
and even higher than filestore i was wondering what i can do to optimize
random pattern as well.
While using:
fio --rw=write --iodepth=32 --ioengine=libaio --bs=4k --numjobs=4
--filename=/tmp/test --size=10G --runtime=60 --group_reporting
--name=test --direct=1
I get 36000 iop/s on bluestore while having 11500 on filestore.
Using randwrite gives me 17000 on filestore and only 9500 on bluestore.
This is on all flash / ssd running luminous 12.2.10.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com