On 03/29/2018 08:59 PM, Christian Balzer wrote:
Hello,
my crappy test cluster was rendered inoperational by an IP renumbering
that wasn't planned and forced on me during a DC move, so I decided to
start from scratch and explore the fascinating world of Luminous/bluestore
and all the assorted bugs. ^_-
(yes I could have recovered the cluster by setting up a local VLAN with
the old IPs, extract the monmap, etc, but I consider the need for a
running monitor a flaw, since all the relevant data was present in the
leveldb).
Anyways, while I've read about bluestore OSD cache in passing here, the
back of my brain was clearly still hoping that it would use pagecache/SLAB
like other filesystems.
Which after my first round of playing with things clearly isn't the case.
This strikes me as a design flaw and regression because:
Bluestore's cache is not broken by design.
I'm not totally convinced that some of the trade-offs we've made with
bluestore's cache implementation are optimal, but I think you should
consider cooling your rhetoric down.
1. Completely new users may think that bluestore defaults are fine and
waste all that RAM in their machines.
What does "wasting" RAM mean in the context of a node running ceph? Are
you upset that other applications can't come in and evict bluestore
onode, OMAP, or object data from cache?
2. Having a per OSD cache is inefficient compared to a common cache like
pagecache, since an OSD that is busier than others would benefit from a
shared cache more.
It's only "inefficient" if you assume that using the pagecache, and more
generally, kernel syscalls, is free. Yes the pagecache is convenient
and yes it gives you a lot of flexibility, but you pay for that
flexibility if you are trying to do anything fast.
For instance, take the new KPTI patches in the kernel for meltdown. Look
at how badly it can hurt MyISAM database performance in MariaDB:
https://mariadb.org/myisam-table-scan-performance-kpti/
MyISAM does not have a dedicated row cache and instead caches row data
in the page cache as you suggest Bluestore should do for it's data.
Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a
dedicated 128MB cache (less than 1%). KPTI is a really good example of
how much this stuff can hurt you, but syscalls, context switches, and
page faults were already expensive even before meltdown. Not to mention
that right now bluestore keeps onodes and buffers stored in it's cache
in an unencoded form.
Here's a couple of other articles worth looking at:
https://eng.uber.com/mysql-migration/
https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html
3. A uniform OSD cache size of course will be a nightmare when having
non-uniform HW, either with RAM or number of OSDs.
Non-Uniform hardware is a big reason that pinning dedicated memory to
specific cores/sockets is really nice vs relying on potentially remote
memory page cache reads. A long time ago I was responsible for
validating the performance of CXFS on an SGI Altix UV distributed
shared-memory supercomputer. As it turns out, we could achieve about
22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x
slower. A big part of that turned out to be the kernel distributing
page cache across the Numalink5 interconnects to remote memory. The
problem can potentially happen on any NUMA system to varying degrees.
Personally I have two primary issues with bluestore's memory
configuration right now:
1) It's too complicated for users to figure out where to assign memory
and in what ratios. I'm attempting to improve this by making
bluestore's cache autotuning so the user just gives it a number and
bluestore will try to work out where it should assign memory.
2) In the case where a subset of OSDs are really hot (maybe RGW bucket
accesses) you might want some OSDs to get more memory than others. I
think we can tackle this better if we migrate to a one-osd-per-node
sharded architecture (likely based on seastar), though we'll still need
to be very aware of remote memory. Given that this is fairly difficult
to do well, we're probably going to be better off just dedicating a
static pool to each shard initially.
Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com