Re: Bluestore caching, flawed by design?

Christian Balzer <chibi@xxxxxxx> · Mon, 2 Apr 2018 16:16:51 +0900

Hello,

On Mon, 2 Apr 2018 08:33:35 +0200 John Hearns wrote:

> Christian, you mention single socket systems for storage servers.
> I often thought that the Xeon-D would be ideal as a building block for
> storage servers
> https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html
> Low power, and a complete System-On-Chip with 10gig Ethernet.
>
If you (re)search the ML archives you should be able find discussions
about this and I seem to remember them coming up as well.

If you're going to have a typical HDDs for storage and 1-2 SSDs for
journal/WAL/DB setup, they should do well enough. But in that scenario
you're likely not all that latency conscious about NUMA issues to begin
with, given that current CPU interlinks are quite decent.

They however do feel underpowered when mated with really fast (NVMe) or
more than 4 SSDs per node if you have a lot of small writes.

For example with a Jewel cluster and Intel DC S3610 SSDs this fio line:
---
$ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randwrite --name=fiojob --blocksize=4K --iodepth=32
---
with this CPU:
Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

will leave the SSDs only about 40% busy, but use about 2.5 cores (250% on
atop) per OSD process, leaving very little free CPU cycles to go around.

I'd look at something with these for high end single node systems or just
go Epyc and drown in PCIe lanes as well for a change:
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (12/6 cores)

Christian

> I haven't been following these processors lately. Is anyone building  CEPH
> clusters using them
> 
> On 2 April 2018 at 02:59, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > firstly, Jack pretty much correctly correlated my issues to Mark's points,
> > more below.
> >
> > On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote:
> >  
> > > On 03/29/2018 08:59 PM, Christian Balzer wrote:
> > >  
> > > > Hello,
> > > >
> > > > my crappy test cluster was rendered inoperational by an IP renumbering
> > > > that wasn't planned and forced on me during a DC move, so I decided to
> > > > start from scratch and explore the fascinating world of  
> > Luminous/bluestore  
> > > > and all the assorted bugs. ^_-
> > > > (yes I could have recovered the cluster by setting up a local VLAN with
> > > > the old IPs, extract the monmap, etc, but I consider the need for a
> > > > running monitor a flaw, since all the relevant data was present in the
> > > > leveldb).
> > > >
> > > > Anyways, while I've read about bluestore OSD cache in passing here, the
> > > > back of my brain was clearly still hoping that it would use  
> > pagecache/SLAB  
> > > > like other filesystems.
> > > > Which after my first round of playing with things clearly isn't the  
> > case.  
> > > >
> > > > This strikes me as a design flaw and regression because:  
> > >
> > > Bluestore's cache is not broken by design.
> > >  
> >
> > During further tests I verified something that caught my attention out of
> > the corner of my when glancing at atop output of the OSDs during my fio
> > runs.
> >
> > Consider this fio run, after having done the same with write to populate
> > the file and caches (1GB per OSD default on the test cluster, 20 OSDs
> > total on 5 nodes):
> > ---
> > $ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> > --rw=randread --name=fiojob --blocksize=4M --iodepth=32
> > ---
> >
> > This is being run against a kernel mounted RBD image.
> > On the Luminous test cluster it will read the data from the disks,
> > completely ignoring the pagecache on the host (as expected and desired)
> > AND the bluestore cache.
> >
> > On a Jewel based test cluster with filestore the reads will be served from
> > the pagecaches of the OSD nodes, not only massively improving speed but
> > more importantly spindle contention.
> >
> > My guess is that bluestore treats "direct" differently than the kernel
> > accessing a filestore based OSD and I'm not sure what the "correct"
> > behavior here is.
> > But somebody migrating to bluestore with such a use case and plenty of RAM
> > on their OSD nodes is likely to notice this and not going to be happy about
> > it.
> >
> >  
> > > I'm not totally convinced that some of the trade-offs we've made with
> > > bluestore's cache implementation are optimal, but I think you should
> > > consider cooling your rhetoric down.
> > >  
> > > > 1. Completely new users may think that bluestore defaults are fine and
> > > > waste all that RAM in their machines.  
> > >
> > > What does "wasting" RAM mean in the context of a node running ceph? Are
> > > you upset that other applications can't come in and evict bluestore
> > > onode, OMAP, or object data from cache?
> > >  
> > What Jack pointed out, unless you go around and start tuning things,
> > all available free RAM won't be used for caching.
> >
> > This raises another point, it being per process data and from skimming
> > over some bluestore threads here, if you go and raise the cache to use
> > most RAM during normal ops you're likely to be visited by the evil OOM
> > witch during heavy recovery OPS.
> >
> > Whereas the good ole pagecache would just get evicted in that scenario.
> >  
> > > > 2. Having a per OSD cache is inefficient compared to a common cache  
> > like  
> > > > pagecache, since an OSD that is busier than others would benefit from a
> > > > shared cache more.  
> > >
> > > It's only "inefficient" if you assume that using the pagecache, and more
> > > generally, kernel syscalls, is free.  Yes the pagecache is convenient
> > > and yes it gives you a lot of flexibility, but you pay for that
> > > flexibility if you are trying to do anything fast.
> > >
> > > For instance, take the new KPTI patches in the kernel for meltdown. Look
> > > at how badly it can hurt MyISAM database performance in MariaDB:
> > >  
> > I, like many others here, have decided that all the Meltdown and Spectre
> > patches are a bit pointless on pure OSD nodes, because if somebody on the
> > node is running random code you're already in deep doodoo.
> >
> > That being said, I will totally concur that syscalls aren't free.
> > However given the latencies induced by the rather long/complex code IOPS
> > have to transverse within Ceph, how much of a gain would you say
> > eliminating these particular calls did achieve?
> >  
> > > https://mariadb.org/myisam-table-scan-performance-kpti/
> > >
> > > MyISAM does not have a dedicated row cache and instead caches row data
> > > in the page cache as you suggest Bluestore should do for it's data.
> > > Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a
> > > dedicated 128MB cache (less than 1%).  KPTI is a really good example of
> > > how much this stuff can hurt you, but syscalls, context switches, and
> > > page faults were already expensive even before meltdown.  Not to mention
> > > that right now bluestore keeps onodes and buffers stored in it's cache
> > > in an unencoded form.
> > >  
> > That last bit is quite relevant of course.
> >  
> > > Here's a couple of other articles worth looking at:
> > >
> > > https://eng.uber.com/mysql-migration/
> > > https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
> > > http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-  
> > meltdown-performance.html  
> > >  
> > > > 3. A uniform OSD cache size of course will be a nightmare when having
> > > > non-uniform HW, either with RAM or number of OSDs.  
> > >
> > > Non-Uniform hardware is a big reason that pinning dedicated memory to
> > > specific cores/sockets is really nice vs relying on potentially remote
> > > memory page cache reads.  A long time ago I was responsible for
> > > validating the performance of CXFS on an SGI Altix UV distributed
> > > shared-memory supercomputer.  As it turns out, we could achieve about
> > > 22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x
> > > slower.  A big part of that turned out to be the kernel distributing
> > > page cache across the Numalink5 interconnects to remote memory.  The
> > > problem can potentially happen on any NUMA system to varying degrees.
> > >  
> > I could regale you with even more ancient stories when I was working with
> > DEC VMSclusters. ^o^ But that's not here and now.
> >
> > As for pinning, I'm doing this mostly on compute nodes, to basically keep
> > a good number of cores free on the NUMA node(s) that handle HW interrupts,
> > the kernel tends to do a decent enough job from there on.
> >
> > And while you definitely have a point, modern designs like Epyc pretty much
> > negate NUMA issues.
> > Never mind that performance/latency conscious Ceph users already do stick
> > to single NUMA CPUs for SSD/NVMe servers if possible.
> >  
> > > Personally I have two primary issues with bluestore's memory
> > > configuration right now:
> > >
> > > 1) It's too complicated for users to figure out where to assign memory
> > > and in what ratios.  I'm attempting to improve this by making
> > > bluestore's cache autotuning so the user just gives it a number and
> > > bluestore will try to work out where it should assign memory.
> > >  
> > This would be very helpful (as in ratio of # of OSDs/total RAM).
> > Otherwise you wind up with non-uniformity issues again.
> >
> > And especially _if_ it can also drop caches in low memory situations
> > voluntarily.
> >  
> > > 2) In the case where a subset of OSDs are really hot (maybe RGW bucket
> > > accesses) you might want some OSDs to get more memory than others.  I
> > > think we can tackle this better if we migrate to a one-osd-per-node
> > > sharded architecture (likely based on seastar), though we'll still need
> > > to be very aware of remote memory.  Given that this is fairly difficult
> > > to do well, we're probably going to be better off just dedicating a
> > > static pool to each shard initially.
> > >  
> > I'm wondering if and how such a sharding can be realized while still
> > keeping the OSD (storage device really) the smallest failure domain and
> > not just the host.
> > Because I'm betting you that some people have specialty use cases
> > depending on that (not me for a change).
> >
> > Christian
> >  
> > > Mark
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Rakuten Communications
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >  

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com