Re: Bluestore caching oddities, again

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 5 Aug 2019 06:46:12 -0500

On 8/4/19 7:36 PM, Christian Balzer wrote:
Hello,

On Sun, 4 Aug 2019 06:34:46 -0500 Mark Nelson wrote:

On 8/4/19 6:09 AM, Paul Emmerich wrote:

On Sun, Aug 4, 2019 at 3:47 AM Christian Balzer <chibi@xxxxxxx> wrote:

2. Bluestore caching still broken
When writing data with the fios below, it isn't cached on the OSDs.
Worse, existing cached data that gets overwritten is removed from the
cache, which while of course correct can't be free in terms of allocation
overhead.
Why not doing what any sensible person would expect from experience with
any other cache there is, cache writes in case the data gets read again
soon and in case of overwrites use existing allocations.
This is by design.
The BlueStore only populates its cache on reads, not on writes. The idea is
that a reasonable application does not read data it just wrote (and if it does
it's already cached at a higher layer like the page cache or a cache on the
hypervisor).

Note that this behavior can be change by setting
bluestore_default_buffered_write = true.

Thanks to Mark for his detailed reply.
Given the points I assume that with HDD backed (but SSD WAL/DB) OSDs it's
not actually a performance killer?

Not typically from the overhead perspective (ie CPU usage shouldn't be 
an issue unless you have a lot of HDDs and wimpy CPUs or possibly if you 
are also doing EC/compression/encryption with lots of small IO).  The 
next question though is if you are better off caching bluestore onodes 
vs rocksdb block cache vs object data.  When you have DB/WAL on the same 
device as bluestore block, you typically want to prioritize rocksdb 
indexes/filters, bluestore onode, rocksdb block cache, and bluestore 
data in that order (the ratios here though are very workload 
dependent).  If you have HDD + SSD DB/WAL, you probably still want to 
cache the indexes/filters with high priority (these are relatively small 
and will reduce read amplification in the DB significantly!).  Now 
caching bluestore onodes and rocksdb block cache may be less important 
since the SSDs may be able to handle the metadata reads fast enough to 
have little impact on the HDD side of things.  Not all SSDs are made 
equal and people often like to put multiple DB/WALs on a single SSD, so 
all of this can be pretty hardware dependent.  You'll also eat more CPU 
going this path due to encode/decode between bluestore and rocksdb and 
all of the work involved in finding the right key/value pair in rocksdb 
itself. So there are definitely going to be hardware-dependent 
trade-offs (ie even if it's faster on HDD/SSD setups to focus on 
bluestore buffer cache, you may eat more CPU per IO doing it).  Probably 
the take-away is that if you have really beefy CPUs and really buffer 
SSDs in a HDD+SSD setup, it may be worth trying a higher buffer cache 
ratio and see what happens.

Note that with the prioritycachemanager and osd memory autotuning, if 
you enable bluestore_default_buffered_write and neither the rocksdb 
block cache nor the bluestore onode cache need more memory, the rest 
automatically gets assigned to bluestore buffer cache for objects.

Mark

I'll test that of course but a gut feeling or ball park would be
appreciated by probably more people that me.

As Paul's argument, I'm not buying it because:
- It's a complete paradigm change when comparing it to filestore. Somebody
   migrating from FS to BS is likely to experience yet another performance
   decrease they didn't expect.
- Arguing for larger caches on the client only increases the cost of Ceph
   further. In that vein, BS currently can't utilize as much memory as FS
   did for caching in a save manner.
- Use cases like a DB with enough caching to deal with the normal working
   set but doing some hourly crunching on data that exceeds come to mind.
   One application here also processes written data once an hour, more than
   would fit in the VM pagecache, but currently comes from the FS pagecache.
- The overwrites of already cached date _clearly_ indicate a hotness and
   thus should be preferably cached. That bit in particular is upsetting,
   initial write caching or not.

Regards,

Christian
FWIW, there's also a CPU usage and lock contention penalty for default
buffered write when using extremely fast flash storage.  A lot of my
recent work on improving cache performance and intelligence in bluestore
is to reduce contention in the onode/buffer cache and also significantly
reduce the impact of default buffered write = true.  The
PriorityCacheManger was a big one to do a better job of autotuning.
Another big one that recently merged was refactoring bluestore's caches
to trim on write (better memory behavior, shorter more frequent trims,
trims distributed across threads) and not share a single lock between
the onode and buffer cache:

https://github.com/ceph/ceph/pull/28597

Ones still coming down the pipe are to avoid double caching onodes in
the bluestore onode cache and rocksdb block cache and age-binning the
LRU caches to better redistribute memory between caches based on
relative age.  This is the piece that hopefully would let you cache on
write while still having the priority of those cached writes quickly
fall off if they are never read back (the more cache you have, the more
effective this would be at keeping the onode/omap ratios relatively higher).

Mark

Paul
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com