Re: Bluestore caching, flawed by design?

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 2 Apr 2018 10:48:03 -0500

On 04/01/2018 07:59 PM, Christian Balzer wrote:

Hello,

firstly, Jack pretty much correctly correlated my issues to Mark's points,
more below.

On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote:

On 03/29/2018 08:59 PM, Christian Balzer wrote:

Hello,

my crappy test cluster was rendered inoperational by an IP renumbering
that wasn't planned and forced on me during a DC move, so I decided to
start from scratch and explore the fascinating world of Luminous/bluestore
and all the assorted bugs. ^_-
(yes I could have recovered the cluster by setting up a local VLAN with
the old IPs, extract the monmap, etc, but I consider the need for a
running monitor a flaw, since all the relevant data was present in the
leveldb).

Anyways, while I've read about bluestore OSD cache in passing here, the
back of my brain was clearly still hoping that it would use pagecache/SLAB
like other filesystems.
Which after my first round of playing with things clearly isn't the case.

This strikes me as a design flaw and regression because:
Bluestore's cache is not broken by design.

During further tests I verified something that caught my attention out of
the corner of my when glancing at atop output of the OSDs during my fio
runs.

Consider this fio run, after having done the same with write to populate
the file and caches (1GB per OSD default on the test cluster, 20 OSDs
total on 5 nodes):
---
$ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randread --name=fiojob --blocksize=4M --iodepth=32
---

This is being run against a kernel mounted RBD image.
On the Luminous test cluster it will read the data from the disks,
completely ignoring the pagecache on the host (as expected and desired)
AND the bluestore cache.

On a Jewel based test cluster with filestore the reads will be served from
the pagecaches of the OSD nodes, not only massively improving speed but
more importantly spindle contention.

Filestore absolutely will be able to do better than bluestore in the 
case where a single OSD benefits by utilizing all of the memory in a 
node even at the expense of other OSDs.  One situation where this could 
be the case is RGW bucket indexes, but even there the better solution 
imho is to shard the buckets.  I'd argue though that you need to be 
careful about how you approach this.  Let's say you have a single node 
with multiple OSDs and one of those OSDs has a big set of temporarily 
hot read data.  If you let that OSD use up most of the memory  on the 
node to cache the data set, all of the other OSDs have to give up 
something:  Namely cached onodes.  That means that once your hot data is 
no longer hot, all of those other OSDs will need to perform future onode 
reads from disk.  Whether or not it's beneficial to cache the hot data 
set depends on how long it's going to stay hot and how likely those 
other OSDs are going to have a read/write operation at some point in the 
future.  I'd argue that if you assume a generally mixed workload that 
generally spans multiple OSDs, you are much better off ignoring the hot 
data and simply keeping the onodes cached.

I suspect that the more common case where bluestore looks bad is when 
someone is benchmarking reads on a single filestore OSD vs a single 
bluestore OSD and doesn't bother giving bluestore a large portion of the 
memory on the node.  Filestore can look faster than bluestore in that 
case, especially if the data set is relatively small and can fit 
entirely in memory.  In the case where you've configured bluestore to 
use most of your available memory, bluestore should be pretty close.  
For some configurations/workloads potentially faster.

My guess is that bluestore treats "direct" differently than the kernel
accessing a filestore based OSD and I'm not sure what the "correct"
behavior here is.
But somebody migrating to bluestore with such a use case and plenty of RAM
on their OSD nodes is likely to notice this and not going to be happy about
it.

Like I said earlier, it's all about trade-offs.  The pagecache gives you 
a lot of flexibility and on slower devices the price you pay isn't 
terribly high.  On faster devices it's a bigger issue.

I'm not totally convinced that some of the trade-offs we've made with
bluestore's cache implementation are optimal, but I think you should
consider cooling your rhetoric down.

1. Completely new users may think that bluestore defaults are fine and
waste all that RAM in their machines.
What does "wasting" RAM mean in the context of a node running ceph? Are
you upset that other applications can't come in and evict bluestore
onode, OMAP, or object data from cache?

What Jack pointed out, unless you go around and start tuning things,
all available free RAM won't be used for caching.

This raises another point, it being per process data and from skimming
over some bluestore threads here, if you go and raise the cache to use
most RAM during normal ops you're likely to be visited by the evil OOM
witch during heavy recovery OPS.

Whereas the good ole pagecache would just get evicted in that scenario.

2. Having a per OSD cache is inefficient compared to a common cache like
pagecache, since an OSD that is busier than others would benefit from a
shared cache more.
It's only "inefficient" if you assume that using the pagecache, and more
generally, kernel syscalls, is free.  Yes the pagecache is convenient
and yes it gives you a lot of flexibility, but you pay for that
flexibility if you are trying to do anything fast.

For instance, take the new KPTI patches in the kernel for meltdown. Look
at how badly it can hurt MyISAM database performance in MariaDB:

I, like many others here, have decided that all the Meltdown and Spectre
patches are a bit pointless on pure OSD nodes, because if somebody on the
node is running random code you're already in deep doodoo.

That being said, I will totally concur that syscalls aren't free.
However given the latencies induced by the rather long/complex code IOPS
have to transverse within Ceph, how much of a gain would you say
eliminating these particular calls did achieve?

https://mariadb.org/myisam-table-scan-performance-kpti/

MyISAM does not have a dedicated row cache and instead caches row data
in the page cache as you suggest Bluestore should do for it's data.
Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a
dedicated 128MB cache (less than 1%).  KPTI is a really good example of
how much this stuff can hurt you, but syscalls, context switches, and
page faults were already expensive even before meltdown.  Not to mention
that right now bluestore keeps onodes and buffers stored in it's cache
in an unencoded form.

That last bit is quite relevant of course.

Here's a couple of other articles worth looking at:

https://eng.uber.com/mysql-migration/
https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html

3. A uniform OSD cache size of course will be a nightmare when having
non-uniform HW, either with RAM or number of OSDs.
Non-Uniform hardware is a big reason that pinning dedicated memory to
specific cores/sockets is really nice vs relying on potentially remote
memory page cache reads.  A long time ago I was responsible for
validating the performance of CXFS on an SGI Altix UV distributed
shared-memory supercomputer.  As it turns out, we could achieve about
22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x
slower.  A big part of that turned out to be the kernel distributing
page cache across the Numalink5 interconnects to remote memory.  The
problem can potentially happen on any NUMA system to varying degrees.

I could regale you with even more ancient stories when I was working with
DEC VMSclusters. ^o^ But that's not here and now.

As for pinning, I'm doing this mostly on compute nodes, to basically keep
a good number of cores free on the NUMA node(s) that handle HW interrupts,
the kernel tends to do a decent enough job from there on.

And while you definitely have a point, modern designs like Epyc pretty much
negate NUMA issues.
Never mind that performance/latency conscious Ceph users already do stick
to single NUMA CPUs for SSD/NVMe servers if possible.

Personally I have two primary issues with bluestore's memory
configuration right now:

1) It's too complicated for users to figure out where to assign memory
and in what ratios.  I'm attempting to improve this by making
bluestore's cache autotuning so the user just gives it a number and
bluestore will try to work out where it should assign memory.

This would be very helpful (as in ratio of # of OSDs/total RAM).
Otherwise you wind up with non-uniformity issues again.

And especially _if_ it can also drop caches in low memory situations
voluntarily.

2) In the case where a subset of OSDs are really hot (maybe RGW bucket
accesses) you might want some OSDs to get more memory than others.  I
think we can tackle this better if we migrate to a one-osd-per-node
sharded architecture (likely based on seastar), though we'll still need
to be very aware of remote memory.  Given that this is fairly difficult
to do well, we're probably going to be better off just dedicating a
static pool to each shard initially.

I'm wondering if and how such a sharding can be realized while still
keeping the OSD (storage device really) the smallest failure domain and
not just the host.
Because I'm betting you that some people have specialty use cases
depending on that (not me for a change).

Christian

Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com