Re: Bluestore caching, flawed by design?

John Hearns <hearnsj@xxxxxxxxxxxxxx> · Mon, 2 Apr 2018 08:27:57 +0200

> A long time ago I was responsible for validating the performance of CXFS on an SGI Altix UV distributed shared-memory supercomputer.  As it turns out, we could achieve about 22GB/s writes with XFS (a huge >number at the time), but CXFS was 5-10x slower.  A big part of that turned out to be the kernel distributing page cache across the Numalink5 interconnects to remote memory. 
> The problem can potentially happen on any NUMA system to varying degrees.

That's very interesting. I used to manage Itanium Altixes and then an UV system. That work sounds very interesting.
I set up cpusets on the UV system, which had a big performance increase since user jobs had CPUs and memory close to each other.
I also had a boot cpuset on the first blade, which had the fibrechannel HBA, so I guess that had a similar effect in that the CXFS processes were local to the IO card.
UV was running SuSE - sorry.

On the subject of memory allocation, GPFS uses an amount of pagepool memory. The given advice always seems to be make this large.
There is one fixed pagepool on a server, even if it has multiple NSDs
How does this compare to CEPH memory allocation?

On 31 March 2018 at 15:24, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
On 03/29/2018 08:59 PM, Christian Balzer wrote:

Hello,

my crappy test cluster was rendered inoperational by an IP renumbering

that wasn't planned and forced on me during a DC move, so I decided to

start from scratch and explore the fascinating world of Luminous/bluestore

and all the assorted bugs. ^_-

(yes I could have recovered the cluster by setting up a local VLAN with

the old IPs, extract the monmap, etc, but I consider the need for a

running monitor a flaw, since all the relevant data was present in the

leveldb).

Anyways, while I've read about bluestore OSD cache in passing here, the

back of my brain was clearly still hoping that it would use pagecache/SLAB

like other filesystems.

Which after my first round of playing with things clearly isn't the case.

This strikes me as a design flaw and regression because:

Bluestore's cache is not broken by design.

I'm not totally convinced that some of the trade-offs we've made with bluestore's cache implementation are optimal, but I think you should consider cooling your rhetoric down.

1. Completely new users may think that bluestore defaults are fine and

waste all that RAM in their machines.

What does "wasting" RAM mean in the context of a node running ceph? Are you upset that other applications can't come in and evict bluestore onode, OMAP, or object data from cache?

2. Having a per OSD cache is inefficient compared to a common cache like

pagecache, since an OSD that is busier than others would benefit from a

shared cache more.

It's only "inefficient" if you assume that using the pagecache, and more generally, kernel syscalls, is free.  Yes the pagecache is convenient and yes it gives you a lot of flexibility, but you pay for that flexibility if you are trying to do anything fast.

For instance, take the new KPTI patches in the kernel for meltdown. Look at how badly it can hurt MyISAM database performance in MariaDB:

https://mariadb.org/myisam-table-scan-performance-kpti/

MyISAM does not have a dedicated row cache and instead caches row data in the page cache as you suggest Bluestore should do for it's data.  Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a dedicated 128MB cache (less than 1%).  KPTI is a really good example of how much this stuff can hurt you, but syscalls, context switches, and page faults were already expensive even before meltdown.  Not to mention that right now bluestore keeps onodes and buffers stored in it's cache in an unencoded form.

Here's a couple of other articles worth looking at:

https://eng.uber.com/mysql-migration/

https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/

http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html

3. A uniform OSD cache size of course will be a nightmare when having

non-uniform HW, either with RAM or number of OSDs.

Non-Uniform hardware is a big reason that pinning dedicated memory to specific cores/sockets is really nice vs relying on potentially remote memory page cache reads.  A long time ago I was responsible for validating the performance of CXFS on an SGI Altix UV distributed shared-memory supercomputer.  As it turns out, we could achieve about 22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x slower.  A big part of that turned out to be the kernel distributing page cache across the Numalink5 interconnects to remote memory.  The problem can potentially happen on any NUMA system to varying degrees.

Personally I have two primary issues with bluestore's memory configuration right now:

1) It's too complicated for users to figure out where to assign memory and in what ratios.  I'm attempting to improve this by making bluestore's cache autotuning so the user just gives it a number and bluestore will try to work out where it should assign memory.

2) In the case where a subset of OSDs are really hot (maybe RGW bucket accesses) you might want some OSDs to get more memory than others.  I think we can tackle this better if we migrate to a one-osd-per-node sharded architecture (likely based on seastar), though we'll still need to be very aware of remote memory.  Given that this is fairly difficult to do well, we're probably going to be better off just dedicating a static pool to each shard initially.

Mark

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com