Re: Slow rbd reads (fast writes) with luminous + bluestore

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 28 Nov 2018 08:52:03 -0600

On 11/28/18 8:36 AM, Florian Haas wrote:
On 14/08/2018 15:57, Emmanuel Lacour wrote:
Le 13/08/2018 à 16:58, Jason Dillaman a écrit :
See [1] for ways to tweak the bluestore cache sizes. I believe that by
default, bluestore will not cache any data but instead will only
attempt to cache its key/value store and metadata.
I suppose too because default ratio is to cache as much as possible k/v
up to 512M and hdd cache is 1G by default.

I tried to increase hdd cache up to 4G and it seems to be used, 4 osd
processes uses 20GB now.

In general, however, I would think that attempting to have bluestore
cache data is just an attempt to optimize to the test instead of
actual workloads. Personally, I think it would be more worthwhile to
just run 'fio --ioengine=rbd' directly against a pre-initialized image
after you have dropped the cache on the OSD nodes.
So with bluestore, I assume that we need to think more of client page
cache (at least when using a VM)  when with old filestore both osd and
client cache where used.

For benchmark, I did real benchmark here for the expected app workload
of this new cluster and it's ok for us :)

Thanks for your help Jason.
Shifting over a discussion from IRC and taking the liberty to resurrect
an old thread, as I just ran into the same (?) issue. I see
*significantly* reduced performance on RBD reads, compared to writes
with the same parameters. "rbd bench --io-type read" gives me 8K IOPS
(with the default 4K I/O size), whereas "rbd bench --io-type write"
produces more than twice that.

I should probably add that while my end result of doing an "rbd bench
--io-type read" is about half of what I get from a write benchmark, the
intermediate ops/sec output fluctuates from > 30K IOPS (about twice the
write IOPS) to about 3K IOPS (about 1/6 of what I get for writes). So
really, my read IOPS are all over the map (and terrible on average),
whereas my write IOPS are not stellar, but consistent.

This is an all-bluestore cluster on spinning disks with Luminous, and
I've tried the following things:

- run rbd bench with --rbd_readahead_disable_after_bytes=0 and
--rbd_readahead_max_bytes=4194304 (per
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008271.html)

- configure OSDs with a larger bluestore_cache_size_hdd (4G; default is 1G)

- configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
than using 1%/99%/0% for metadata/KV data/objects, the OSDs use 1%/49%/50%

None of the above produced any tangible improvement. Benchmark results
are at http://paste.openstack.org/show/736314/ if anyone wants to take a
look.

I'd be curious to see if anyone has a suggestion on what else to try.
Thanks in advance!

Hi Florian,

By default bluestore will cache buffers on reads but not on writes 
(unless there are hints):

Option("bluestore_default_buffered_read", Option::TYPE_BOOL, 
Option::LEVEL_ADVANCED)
    .set_default(true)
    .set_flag(Option::FLAG_RUNTIME)
    .set_description("Cache read results by default (unless hinted 
NOCACHE or WONTNEED)"),

    Option("bluestore_default_buffered_write", Option::TYPE_BOOL, 
Option::LEVEL_ADVANCED)
    .set_default(false)
    .set_flag(Option::FLAG_RUNTIME)
    .set_description("Cache writes by default (unless hinted NOCACHE or 
WONTNEED)"),

This is one area where bluestore is a lot more confusing for users that 
filestore was.  There was a lot of concern about enabling buffer cache 
on writes by default because there's some associated overhead 
(potentially both during writes and in the mempool thread when trimming 
the cache).  It might be worth enabling bluestore_default_buffered_write 
and see if it helps reads.  You'll probably also want to pay attention 
to writes though.  I think we might want to consider enabling it by 
default but we should go through and do a lot of careful testing first. 
FWIW I did have it enabled when testing the new memory target code (and 
the not-yet-merged age-binned autotuning).  It was doing OK in my tests, 
but I didn't do an apples-to-apples comparison with it off.

Mark

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com