Re: Bluestore caching, flawed by design?

John Hearns <hearnsj@xxxxxxxxxxxxxx> · Mon, 2 Apr 2018 08:33:35 +0200

Christian, you mention single socket systems for storage servers.
I often thought that the Xeon-D would be ideal as a building block for storage servers
https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html
Low power, and a complete System-On-Chip with 10gig Ethernet.

I haven't been following these processors lately. Is anyone building  CEPH clusters using them

On 2 April 2018 at 02:59, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

firstly, Jack pretty much correctly correlated my issues to Mark's points,

more below.

On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote:

> On 03/29/2018 08:59 PM, Christian Balzer wrote:

>

> > Hello,

> >

> > my crappy test cluster was rendered inoperational by an IP renumbering

> > that wasn't planned and forced on me during a DC move, so I decided to

> > start from scratch and explore the fascinating world of Luminous/bluestore

> > and all the assorted bugs. ^_-

> > (yes I could have recovered the cluster by setting up a local VLAN with

> > the old IPs, extract the monmap, etc, but I consider the need for a

> > running monitor a flaw, since all the relevant data was present in the

> > leveldb).

> >

> > Anyways, while I've read about bluestore OSD cache in passing here, the

> > back of my brain was clearly still hoping that it would use pagecache/SLAB

> > like other filesystems.

> > Which after my first round of playing with things clearly isn't the case.

> >

> > This strikes me as a design flaw and regression because:

>

> Bluestore's cache is not broken by design.

>

During further tests I verified something that caught my attention out of

the corner of my when glancing at atop output of the OSDs during my fio

runs.

Consider this fio run, after having done the same with write to populate

the file and caches (1GB per OSD default on the test cluster, 20 OSDs

total on 5 nodes):

---

$ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1

--rw=randread --name=fiojob --blocksize=4M --iodepth=32

---

This is being run against a kernel mounted RBD image.

On the Luminous test cluster it will read the data from the disks,

completely ignoring the pagecache on the host (as expected and desired)

AND the bluestore cache.

On a Jewel based test cluster with filestore the reads will be served from

the pagecaches of the OSD nodes, not only massively improving speed but

more importantly spindle contention.

My guess is that bluestore treats "direct" differently than the kernel

accessing a filestore based OSD and I'm not sure what the "correct"

behavior here is.

But somebody migrating to bluestore with such a use case and plenty of RAM

on their OSD nodes is likely to notice this and not going to be happy about

it.

> I'm not totally convinced that some of the trade-offs we've made with

> bluestore's cache implementation are optimal, but I think you should

> consider cooling your rhetoric down.

>

> > 1. Completely new users may think that bluestore defaults are fine and

> > waste all that RAM in their machines.

>

> What does "wasting" RAM mean in the context of a node running ceph? Are

> you upset that other applications can't come in and evict bluestore

> onode, OMAP, or object data from cache?

>

What Jack pointed out, unless you go around and start tuning things,

all available free RAM won't be used for caching.

This raises another point, it being per process data and from skimming

over some bluestore threads here, if you go and raise the cache to use

most RAM during normal ops you're likely to be visited by the evil OOM

witch during heavy recovery OPS.

Whereas the good ole pagecache would just get evicted in that scenario.

> > 2. Having a per OSD cache is inefficient compared to a common cache like

> > pagecache, since an OSD that is busier than others would benefit from a

> > shared cache more.

>

> It's only "inefficient" if you assume that using the pagecache, and more

> generally, kernel syscalls, is free.  Yes the pagecache is convenient

> and yes it gives you a lot of flexibility, but you pay for that

> flexibility if you are trying to do anything fast.

>

> For instance, take the new KPTI patches in the kernel for meltdown. Look

> at how badly it can hurt MyISAM database performance in MariaDB:

>

I, like many others here, have decided that all the Meltdown and Spectre

patches are a bit pointless on pure OSD nodes, because if somebody on the

node is running random code you're already in deep doodoo.

That being said, I will totally concur that syscalls aren't free.

However given the latencies induced by the rather long/complex code IOPS

have to transverse within Ceph, how much of a gain would you say

eliminating these particular calls did achieve?

> https://mariadb.org/myisam-table-scan-performance-kpti/

>

> MyISAM does not have a dedicated row cache and instead caches row data

> in the page cache as you suggest Bluestore should do for it's data. 

> Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a

> dedicated 128MB cache (less than 1%).  KPTI is a really good example of

> how much this stuff can hurt you, but syscalls, context switches, and

> page faults were already expensive even before meltdown.  Not to mention

> that right now bluestore keeps onodes and buffers stored in it's cache

> in an unencoded form.

>

That last bit is quite relevant of course.

> Here's a couple of other articles worth looking at:

>

> https://eng.uber.com/mysql-migration/

> https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/

> http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html

>

> > 3. A uniform OSD cache size of course will be a nightmare when having

> > non-uniform HW, either with RAM or number of OSDs.

>

> Non-Uniform hardware is a big reason that pinning dedicated memory to

> specific cores/sockets is really nice vs relying on potentially remote

> memory page cache reads.  A long time ago I was responsible for

> validating the performance of CXFS on an SGI Altix UV distributed

> shared-memory supercomputer.  As it turns out, we could achieve about

> 22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x

> slower.  A big part of that turned out to be the kernel distributing

> page cache across the Numalink5 interconnects to remote memory.  The

> problem can potentially happen on any NUMA system to varying degrees.

>

I could regale you with even more ancient stories when I was working with

DEC VMSclusters. ^o^ But that's not here and now.

As for pinning, I'm doing this mostly on compute nodes, to basically keep

a good number of cores free on the NUMA node(s) that handle HW interrupts,

the kernel tends to do a decent enough job from there on.

And while you definitely have a point, modern designs like Epyc pretty much

negate NUMA issues.

Never mind that performance/latency conscious Ceph users already do stick

to single NUMA CPUs for SSD/NVMe servers if possible.

> Personally I have two primary issues with bluestore's memory

> configuration right now:

>

> 1) It's too complicated for users to figure out where to assign memory

> and in what ratios.  I'm attempting to improve this by making

> bluestore's cache autotuning so the user just gives it a number and

> bluestore will try to work out where it should assign memory.

>

This would be very helpful (as in ratio of # of OSDs/total RAM).

Otherwise you wind up with non-uniformity issues again.

And especially _if_ it can also drop caches in low memory situations

voluntarily.

> 2) In the case where a subset of OSDs are really hot (maybe RGW bucket

> accesses) you might want some OSDs to get more memory than others.  I

> think we can tackle this better if we migrate to a one-osd-per-node

> sharded architecture (likely based on seastar), though we'll still need

> to be very aware of remote memory.  Given that this is fairly difficult

> to do well, we're probably going to be better off just dedicating a

> static pool to each shard initially.

>

I'm wondering if and how such a sharding can be realized while still

keeping the OSD (storage device really) the smallest failure domain and

not just the host.

Because I'm betting you that some people have specialty use cases

depending on that (not me for a change).

Christian

> Mark

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Rakuten Communications

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com