Re: Bluestore caching oddities, again

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 8/4/19 7:36 PM, Christian Balzer wrote:
Hello,

On Sun, 4 Aug 2019 06:34:46 -0500 Mark Nelson wrote:

On 8/4/19 6:09 AM, Paul Emmerich wrote:

On Sun, Aug 4, 2019 at 3:47 AM Christian Balzer <chibi@xxxxxxx> wrote:
2. Bluestore caching still broken
When writing data with the fios below, it isn't cached on the OSDs.
Worse, existing cached data that gets overwritten is removed from the
cache, which while of course correct can't be free in terms of allocation
overhead.
Why not doing what any sensible person would expect from experience with
any other cache there is, cache writes in case the data gets read again
soon and in case of overwrites use existing allocations.
This is by design.
The BlueStore only populates its cache on reads, not on writes. The idea is
that a reasonable application does not read data it just wrote (and if it does
it's already cached at a higher layer like the page cache or a cache on the
hypervisor).

Note that this behavior can be change by setting
bluestore_default_buffered_write = true.

Thanks to Mark for his detailed reply.
Given the points I assume that with HDD backed (but SSD WAL/DB) OSDs it's
not actually a performance killer?


Not typically from the overhead perspective (ie CPU usage shouldn't be an issue unless you have a lot of HDDs and wimpy CPUs or possibly if you are also doing EC/compression/encryption with lots of small IO).  The next question though is if you are better off caching bluestore onodes vs rocksdb block cache vs object data.  When you have DB/WAL on the same device as bluestore block, you typically want to prioritize rocksdb indexes/filters, bluestore onode, rocksdb block cache, and bluestore data in that order (the ratios here though are very workload dependent).  If you have HDD + SSD DB/WAL, you probably still want to cache the indexes/filters with high priority (these are relatively small and will reduce read amplification in the DB significantly!).  Now caching bluestore onodes and rocksdb block cache may be less important since the SSDs may be able to handle the metadata reads fast enough to have little impact on the HDD side of things.  Not all SSDs are made equal and people often like to put multiple DB/WALs on a single SSD, so all of this can be pretty hardware dependent.  You'll also eat more CPU going this path due to encode/decode between bluestore and rocksdb and all of the work involved in finding the right key/value pair in rocksdb itself. So there are definitely going to be hardware-dependent trade-offs (ie even if it's faster on HDD/SSD setups to focus on bluestore buffer cache, you may eat more CPU per IO doing it).  Probably the take-away is that if you have really beefy CPUs and really buffer SSDs in a HDD+SSD setup, it may be worth trying a higher buffer cache ratio and see what happens.


Note that with the prioritycachemanager and osd memory autotuning, if you enable bluestore_default_buffered_write and neither the rocksdb block cache nor the bluestore onode cache need more memory, the rest automatically gets assigned to bluestore buffer cache for objects.


Mark

I'll test that of course but a gut feeling or ball park would be
appreciated by probably more people that me.

As Paul's argument, I'm not buying it because:
- It's a complete paradigm change when comparing it to filestore. Somebody
   migrating from FS to BS is likely to experience yet another performance
   decrease they didn't expect.
- Arguing for larger caches on the client only increases the cost of Ceph
   further. In that vein, BS currently can't utilize as much memory as FS
   did for caching in a save manner.
- Use cases like a DB with enough caching to deal with the normal working
   set but doing some hourly crunching on data that exceeds come to mind.
   One application here also processes written data once an hour, more than
   would fit in the VM pagecache, but currently comes from the FS pagecache.
- The overwrites of already cached date _clearly_ indicate a hotness and
   thus should be preferably cached. That bit in particular is upsetting,
   initial write caching or not.
Regards,

Christian
FWIW, there's also a CPU usage and lock contention penalty for default
buffered write when using extremely fast flash storage.  A lot of my
recent work on improving cache performance and intelligence in bluestore
is to reduce contention in the onode/buffer cache and also significantly
reduce the impact of default buffered write = true.  The
PriorityCacheManger was a big one to do a better job of autotuning.
Another big one that recently merged was refactoring bluestore's caches
to trim on write (better memory behavior, shorter more frequent trims,
trims distributed across threads) and not share a single lock between
the onode and buffer cache:


https://github.com/ceph/ceph/pull/28597


Ones still coming down the pipe are to avoid double caching onodes in
the bluestore onode cache and rocksdb block cache and age-binning the
LRU caches to better redistribute memory between caches based on
relative age.  This is the piece that hopefully would let you cache on
write while still having the priority of those cached writes quickly
fall off if they are never read back (the more cache you have, the more
effective this would be at keeping the onode/omap ratios relatively higher).


Mark




Paul
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux