Re: Bluestore caching, flawed by design?

Mark Nelson <mnelson@xxxxxxxxxx> · Sat, 31 Mar 2018 08:24:45 -0500

On 03/29/2018 08:59 PM, Christian Balzer wrote:

Hello,

my crappy test cluster was rendered inoperational by an IP renumbering
that wasn't planned and forced on me during a DC move, so I decided to
start from scratch and explore the fascinating world of Luminous/bluestore
and all the assorted bugs. ^_-
(yes I could have recovered the cluster by setting up a local VLAN with
the old IPs, extract the monmap, etc, but I consider the need for a
running monitor a flaw, since all the relevant data was present in the
leveldb).

Anyways, while I've read about bluestore OSD cache in passing here, the
back of my brain was clearly still hoping that it would use pagecache/SLAB
like other filesystems.
Which after my first round of playing with things clearly isn't the case.

This strikes me as a design flaw and regression because:

Bluestore's cache is not broken by design.

I'm not totally convinced that some of the trade-offs we've made with 
bluestore's cache implementation are optimal, but I think you should 
consider cooling your rhetoric down.

1. Completely new users may think that bluestore defaults are fine and
waste all that RAM in their machines.

What does "wasting" RAM mean in the context of a node running ceph? Are 
you upset that other applications can't come in and evict bluestore 
onode, OMAP, or object data from cache?

2. Having a per OSD cache is inefficient compared to a common cache like
pagecache, since an OSD that is busier than others would benefit from a
shared cache more.

It's only "inefficient" if you assume that using the pagecache, and more 
generally, kernel syscalls, is free.  Yes the pagecache is convenient 
and yes it gives you a lot of flexibility, but you pay for that 
flexibility if you are trying to do anything fast.

For instance, take the new KPTI patches in the kernel for meltdown. Look 
at how badly it can hurt MyISAM database performance in MariaDB:

https://mariadb.org/myisam-table-scan-performance-kpti/

MyISAM does not have a dedicated row cache and instead caches row data 
in the page cache as you suggest Bluestore should do for it's data.  
Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a 
dedicated 128MB cache (less than 1%).  KPTI is a really good example of 
how much this stuff can hurt you, but syscalls, context switches, and 
page faults were already expensive even before meltdown.  Not to mention 
that right now bluestore keeps onodes and buffers stored in it's cache 
in an unencoded form.

Here's a couple of other articles worth looking at:

https://eng.uber.com/mysql-migration/
https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html

3. A uniform OSD cache size of course will be a nightmare when having
non-uniform HW, either with RAM or number of OSDs.

Non-Uniform hardware is a big reason that pinning dedicated memory to 
specific cores/sockets is really nice vs relying on potentially remote 
memory page cache reads.  A long time ago I was responsible for 
validating the performance of CXFS on an SGI Altix UV distributed 
shared-memory supercomputer.  As it turns out, we could achieve about 
22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x 
slower.  A big part of that turned out to be the kernel distributing 
page cache across the Numalink5 interconnects to remote memory.  The 
problem can potentially happen on any NUMA system to varying degrees.

Personally I have two primary issues with bluestore's memory 
configuration right now:

1) It's too complicated for users to figure out where to assign memory 
and in what ratios.  I'm attempting to improve this by making 
bluestore's cache autotuning so the user just gives it a number and 
bluestore will try to work out where it should assign memory.

2) In the case where a subset of OSDs are really hot (maybe RGW bucket 
accesses) you might want some OSDs to get more memory than others.  I 
think we can tackle this better if we migrate to a one-osd-per-node 
sharded architecture (likely based on seastar), though we'll still need 
to be very aware of remote memory.  Given that this is fairly difficult 
to do well, we're probably going to be better off just dedicating a 
static pool to each shard initially.

Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com